275 14 6MB
English Pages 554 Year 2005
Language Acquisition, Change and Emergence: Essays in Evolutionary Linguistics Edited by
James W. Minett William S-Y. Wang
© 2005 by City University of Hong Kong All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, Internet or otherwise, without the prior written permission of the City University of Hong Kong Press
First published 2005 Printed in Hong Kong
ISBN: 962-937-111-1
Published by City University of Hong Kong Press Tat Chee Avenue, Kowloon, Hong Kong Website: www.cityu.edu.hk/upress E-mail: [email protected]
Table of Contents
Foreword William S-Y. Wang ........................................................................... ix
1.
Introduction James W. Minett ..............................................................................3
Part 1 — Language Emergence 2.
Speech and Language — A Human Trait Defined by Molecular Genetics King L. Chow...................................................................................21
Language as a unique human trait .................................................... 21 Further implication of genes in the acquisition of language ..................................................... 23 The “KE” family ..................................................................................... 24 Is there a hardwired structure for language? ................................. 26 Mapping of the language gene ........................................................... 28 The breakpoint in CS and BRD ......................................................... 29 Positional cloning of the FOXP2 ...................................................... 30 Does FOXP2 offer insight into human evolution? ...................... 32 What is the role of FOXP2? ............................................................... 35 Are there more language genes? ......................................................... 40
3.
Conceptual Complexity and the Brain: Understanding Language Origins P. Thomas Schoenemann...............................................................47
Introduction .............................................................................................. 48 How evolution works ............................................................................ 49 Evidence for continuity ......................................................................... 52 Syntax and grammar in natural languages ..................................... 61 iii
Grammar and syntax as emergent characteristics of semantic/conceptual complexity ................................................ 69 Conclusions............................................................................................... 83
4.
The Emergence of Grammar from Perspective Brian MacWhinney.......................................................................... 95
Empirical demonstrations .................................................................... 99 Depictive and enactive modes ........................................................... 100 Direct experience................................................................................... 103 Space and time ....................................................................................... 109 Plans .......................................................................................................... 115 Perspective and language acquisition ............................................. 132 Conclusion .............................................................................................. 139
5.
Polygenesis of Linguistic Strategies: A Scenario for the Emergence of Languages Christophe Coupé & Jean-Marie Hombert.................................... 153
The origin of languages....................................................................... 154 Mathematical models and computer simulations....................... 168 Discussion................................................................................................ 189 Conclusion .............................................................................................. 196
Part 2 — Language Acquisition 6.
Multiple-cue Integration in Language Acquisition: A Connectionist Model of Speech Segmentation and Rule-like Behavior Morten H. Christiansen, Christopher M. Conway & Suzanne Curtin ............................................................................. 205
Introduction ............................................................................................ 205 The segmentation problem ................................................................ 208 A computational model of multiple-cue integration in speech segmentation ............................................................... 212 iv
Simulation 1: A multiple-cue integration account of rule-like behavior..................................................................... 218 Simulation 2: The role of segmentation in rule-like behavior ..................................................................... 225 Experiment 1: Replicating the Marcus et al. (1999) results ........................ 229 Experiment 2: Segmentation and rule-like behavior ...................................... 232 General discussion ................................................................................ 233 Conclusion .............................................................................................. 240
7.
Unsupervised Lexical Learning as Inductive Inference via Compression Chunyu Kit ....................................................................................251
Introduction ............................................................................................ 252 Goodness measure ................................................................................ 257 Algorithm ................................................................................................ 259 Learning model ...................................................................................... 260 Testing data and learning results ..................................................... 262 Evaluation ............................................................................................... 272 Discussion ............................................................................................... 285 Conclusions ............................................................................................ 293
8.
The Origin of Linguistic Irregularity Charles D. Yang............................................................................297
The challenge of imperfection .......................................................... 299 The reality of phonological rules ..................................................... 299 The computation of rules ................................................................... 308 The evolution of rules ......................................................................... 318 Interface conditions and language evolution ............................... 323
v
Part 3 — Language Change 9.
The Language Organism: The Leiden Theory of Language Evolution George van Driem ........................................................................ 331
Language is an organism .................................................................... 332 A meme is a meaning, not a unit of imitation ............................. 332 Tertium datur ......................................................................................... 335 Syntax is a consequence of meaning ............................................... 337
10. Taxonomy, Typology and Historical Linguistics Merritt Ruhlen ............................................................................... 341
Introduction ............................................................................................ 341 Taxonomy and historical linguistics ............................................... 342 Pronouns .................................................................................................. 349 Lexical evidence..................................................................................... 358 Typology .................................................................................................. 361 Conclusion .............................................................................................. 348
11. Modeling Language Evolution Felipe Cucker, Steve Smale & Ding-Xuan Zhou ........................... 369
Introduction ............................................................................................ 369 First examples and the basic model ................................................ 370 The learning dynamics ........................................................................ 375 Language-systems, idiolects, and language drift ......................... 379 Fitness maximization ........................................................................... 384
Part 4 — Language and Complexity 12. Language and Complexity Murray Gell-Mann ......................................................................... 389
Complexity.............................................................................................. 389 The origins of human language ........................................................ 395 Word order – A linguistic arrow of time?..................................... 401 vi
13. Language Acquisition as a Complex Adaptive System John H. Holland.............................................................................411
Introduction ............................................................................................ 411 Models ...................................................................................................... 413 Complex adaptive systems ................................................................. 415 Properties of complex adaptive systems ........................................ 417 Adaptive agents ..................................................................................... 418 Building blocks ...................................................................................... 424 An agent-based model of language acquisition ........................... 429 Discussion ............................................................................................... 433
14. How Many Meanings Does a Word Have? Meaning Estimation in Chinese and English Chienjer Charles Lin & Kathleen Ahrens.......................................437
Introduction ............................................................................................ 438 Meanings in dictionaries, in language users, and in a semantic theory ............................................................ 440 Experiment 1: Comparing different meaning measurements in Chinese ........................................................................................ 446 Experiment 2: Comparing different meaning measurements in English ......................................................................................... 452 General discussion ................................................................................ 455
15. Typology and Complexity Randy LaPolla...............................................................................465
Complexity in what sort of system? ............................................... 468 Complexity in different subsets of human conventions ........... 424 Complex for whom? ............................................................................ 472 Background: Ostension and inference ........................................... 476 Is complexity necessary? ..................................................................... 477 We seem to be able to do well without some forms of complexity.......................................................... 481 vii
Complexity as a feature of categories, not language ................. 483 Complexity of language as a reflection of complexity of cognitive categories .......................................... 484 The development of language structure......................................... 486 How languages differ in terms of complexity.............................. 489 Conclusion .............................................................................................. 491
16. Creoles and Complexity Bernard Comrie ............................................................................ 495
Introduction ............................................................................................ 495 Creole origins and complexity .......................................................... 508 Conclusions............................................................................................. 524
Index.................................................................................................... 527 Contributors..................................................................................... 539
viii
Foreword William S-Y. Wang Chinese University of Hong Kong
Starting in May 2001, the Language Engineering Laboratory hosted a series of workshops in linguistics at the City University of Hong Kong. For convenience of reference, we called them ACE workshops. Acquisition and Change in the acronym are well recognized hallmarks of language. The ‘E’ in ACE could stand for either Evolution or Emergence, two concepts of fundamental importance in the way we think about linguistics, the latter word suggesting an emergentist perspective, as opposed to an innatist one. The fifteen essays contributed by the visitors to our workshops are organized into a single volume here, though they were presented at different occasions. James Minett has grouped them into four distinct parts, and begins the volume with a comprehensive introduction. While the authors include mainly linguists, several other disciplines are represented as well: anthropology, electronic engineering and computer science, genetics, physics, and psychology. Such an interdisciplinary perspective for our workshops makes good sense only within a broad evolutionary framework. The geneticist Dobzhansky once said that “nothing in biology makes sense except in the light of evolution.” Language is a product of biological and social forces acting individually as well as upon each other — a paradigm example of co-evolution. Only by sorting out the effects of these formative forces can we begin to illuminate the evolutionary trajectory which describes how language emerged in our species many millennia ago, and how it continues to change as it is acquired from generation to generation. These forces themselves must have changed their effects at major cultural transitions, e.g., the invention of writing, the advent of electronic communication, ix
and the expansion of global languages. We are only beginning to understand these effects. Change comes about primarily when the child constructs his own language from the bits and pieces in his linguistic environment, and also when the adult learns a foreign language and passes on his hybrid forms. In any case, a key feature of the evolutionary process is the tremendous amount of variability or heterogeneity at each and every linguistic level, for both the individual and for the speech community. Some general principles of self-organization must be constantly at work to constrain the variability, to limit the amount of ambiguity, and to balance the needs of the speaker and the hearer. Some of these principles may be associated with complex adaptive systems in general and applied to other cognitive processes as well; others may be specific to language. A central goal in the study of language and cognition is to investigate such principles in an explicit and systematic way. One approach that we find particularly promising is the building of models and simulation by computer. The Language Engineering Laboratory received a warm helping hand since its inception from Po Chung, who was then Vice President of Research at the City University of Hong Kong. Additional support came from the Research Grants Council of Hong Kong and Academia Sinica of Taiwan. Murray Gell-Mann, Thomas Lee, and Stephen Smale were kind enough to serve as my co-Principal Investigator on some of these grants. Encouragement and collaboration from colleagues were invaluable, particularly Ron Chen and C.C. Cheng. At a greater distance, we received a great deal of intellectual and moral support from Luca Cavalli-Sforza, John Holland, Mieko Ogura, Merritt Ruhlen, and Vince Sarich. Several workshops I co-organized with John at the Santa Fe Institute were particularly stimulating for our research. To all these friends and to all the contributors of this volume, our heart-felt thanks. In 2004, the Language Engineering Laboratory moved from the City University of Hong Kong a few miles north to the beautiful campus of the Chinese University of Hong Kong in Shatin, where we are affiliated with the DSP and Speech Technology Laboratory. x
Meanwhile, some of the topics we have been studying over these several years are finding their way into various journals. The references below give an idea of the broad span of our efforts. We have also moved our webpage to the Chinese University, at the address given below, where the interested reader can download PDF files. We plan to continue research on language within an evolutionary and interdisciplinary perspective, and hope to continue to receive the intellectual and moral support from our friends near and far who share our interest in the acquisition, change and evolution of language.
William S-Y. Wang February 2005
References Gong, T. and Wang, W. S-Y. (2005) Computational modeling on language emergence: A coevolution model of lexicon, syntax and social structure. Language and Linguistics 6.1, 1–41. Gong, T., Minett, J. W., Wang, W. S-Y., Ke, J-Y. and Holland, J. H. (2005) Coevolution of lexicon and syntax from a simulation perspective. Complexity. In press. Ke, J-Y., Minett, J. W., Au, C-P. and Wang, W. S-Y. (2002) Self-organization and selection in the emergence of vocabulary. Complexity 7.3, 41–54. Ke, J-Y., Ogura, M. and Wang, W. S-Y. (2003). Optimization models of sound systems using genetic algorithms. Computational Linguistics 29.1, 1–18. Peng, G. and Wang, W. S-Y. (2004) An innovative prosody modeling method for Chinese speech recognition. International Journal of Speech Technology 7, 129–140. Peng, G. (2005). Temporal and tonal aspects of Chinese syllables. Journal of Chinese Linguistics. In press. Wang, F. & Wang, W. S-Y. (2004) Basic words and language evolution. Language and linguistics 5.3, 643–662. Wang, W. S-Y. and Ke, J-Y. (2003). Language heterogeneity and self-organizing consciousness — commentary on Perruchet, P. and Vinter, A., The selforganizing consciousness. Behavioral and Brain Sciences 25.3, 358–359.
xi
Wang, W. S-Y., Ke, J-Y. and Minett, J. W. (2004) Computational studies of language evolution. In Chu-ren Huang and Winfried Lenders (Eds.) Computational Linguistics and Beyond (pp. 65–106). Academia Sinica: Institute of Linguistics. Wang, W. S-Y. and Gong, T. (2005) Categorization in artificial agents: guidance on empirical research? — Commentary on Steels, L. and Belpaeme, T., Coordinating perceptually grounded categories through language. Brain and Behavioral Sciences 28.4. Wang, W. S-Y. and Minett, J. W. (2005) The invasion of language: emergence, change, and death. Trends in Ecology and Evolution 20.5, 263–269. Wang, W. S-Y. and Minett, J. W. (2005) Vertical and horizontal transmission in language evolution. Transactions of the Philological Society 103.2, 121– 146. Whitehouse, P., Usher, T., Ruhlen, M. and Wang, W. S-Y. (2004) Kusunda: an Indo-Pacific language in Nepal. Proc. National Academy of Sciences (USA) 101, 5692–5695.
Webpage: http://dsp.ee.cuhk.edu.hk/lel/
xii
LANGUAGE ACQUISITION, CHANGE AND EMERGENCE
1 Introduction: Essays in Evolutionary Linguistics James W. Minett Chinese University of Hong Kong
The human language faculty has evolved at multiple levels: from changes in the cognitive processes by which language is acquired in the individual, to language change by diffusion of acquired linguistic features across populations of individuals, to the emergence of linguistic features over phylogenetic time scales. Evolution of language at each of these levels interacts with that at each other level. Furthermore, the language faculty has developed as a product of the complex interactions among the genetic codes that determine our physical and cognitive capabilities, and the environments — both physical and cultural — within which we live and interact. In order to better understand how human language has come to be the way it is, a holistic approach to studying language evolution is essential. The essays that follow provide a broad coverage of current efforts to better understand how language evolves. Part I concentrates on the phylogenetic emergence of language. Chow opens this part by reviewing the literature on the discovery and subsequent mapping of the FOXP2 gene, which has been implicated as having a role in the emergence of the human language faculty. The gene was discovered primarily as a result of studies of the so-called “KE” family, some of whose members have impaired language abilities. The affected individuals were found to have grammar-specific difficulties, particularly with suffixation. For
3
4
Language Acquisition, Change and Emergence
some, this has been considered evidence that FOXP2 provides a genetic basis for grammar. However, the affected individuals were also found to have other deficits, such as poor motor control of the musculature of the vocal tract and impaired speech production and perception, suggesting that the gene supports a broader inventory of functions than just grammar. Chow goes on to explain that the human FOXP2 gene has undergone strong positive selection since the human lineage split from that of the other primates. In that time, the human FOXP2 amino acid has undergone two mutations, becoming fixed sometime in the last 120,00 years, whereas the corresponding amino acids of other primates have undergone no change (except in the case of orangutans). What drove the evolution and fixing of this gene? Chow suggests that the driving force might have been language. The time depth of the fixing of FOXP2 coincides quite well with both the estimated time depth of between 154,000 and 160,000 years for the earliest known remains of anatomically modern humans, found at Herto, Ethiopia in 1997, and the appearance of novel human behaviors, such as art and long-distance trade, some 50,000 years ago. One should be careful, however, not to infer that language necessarily emerged anew at that time. As Chow notes, only a few cases of language impairment have been accounted for by mutation of FOXP2. The emergence of language has probably been influenced over an extended period of time by numerous genetic factors, FOXP2 simply being the first such gene identified. The search is on for further genes that impact upon language function and proficiency. The mapping of these genes, the estimation of the time depth of their fixing, and the identification of their interactions with other genes are essential next steps in the study of the genetic basis for language. Schoenemann discusses the origin of grammar in human language. He begins by explaining that evolution tends to favor small, incremental changes to pre-existing functions, rather than the creation of novel, domain-specific functions, which would most often lead to a decrease in fitness of an organism and so be selected
Introduction: Essays in Evolutionary Linguistics
against. He proposes that speech production and perception, semantics and syntax each evolved continuously in this way. Pointing also to the lack of clinical syndromes that affect only language, Schoenemann argues that grammar is not hard-wired into a language-specific cognitive module, but rather evolved as a result of incremental changes to pre-existing cognitive mechanisms that allowed for progressively complex conceptualization. He notes that the syntactic universals that exist in modern languages are just emergent properties of underlying semantics constraints, rather than encoded within a built-in Universal Grammar. For instance, the distinction of nouns from verbs is considered to follow directly from our general ability to conceptualize objects as distinct from actions. Schoenemann then relates the growth of cranial capacity — an indicator of brain size — in the hominid lineage to an increasing capability for complex conceptualization. He argues that the increasing capabilities of hominids in cognizing their environment led to the emergence of increasingly complex language. In other words, Schoenemann holds that semantics drove the emergence of syntax. Reporting his own previous work with computer simulations using populations of interacting artificial neural networks, Schoenemann has found that populations comprising larger networks evolve more complicated languages faster and with less error than those comprising smaller networks. This result demonstrates that an increase in the size of a neural net does lead to an increase in the ability to process information, and further supports his hypothesis that language emerged through a process of evolutionary continuity. Like Schoenemann, MacWhinney believes that language emerged as a result of a series of gradual evolutionary adaptations. However, he goes further to suggest that grammar arose to support switching between different perspectives. He reviews a broad range of the literature on perspective taking, in terms of both cognitive processes and grammar, to argue that perspective taking underlies language structure and much of higher-level cognition. After highlighting the frames of reference (egocentric, allocentric, geocentric and temporal) by which individuals cognize their
5
6
Language Acquisition, Change and Emergence
environments, MacWhinney explains how various linguistic constructions are influenced by the underlying perspective. For example, he presents a perspective shifting account for constraints on the co-reference of pronouns that complement and, in some cases, obviate the need for the more complex hypothesized innate principles of theories such as Government and Binding. He also offers an explanation for the relative ease with which restrictive relative clauses of different types are comprehended in terms of the number of perspective shifts that must be undertaken: for example, in English, which has SVO word order, subject-subject relative clauses, which require no shift in perspective, are generally easier to comprehend than subject-object relative clauses, which require two shifts in perspective. MacWhinney outlines a set of tasks that should be undertaken in future research to understand better the role of perspective as a force shaping cognition and language processing. If its role can be well established, this would be an example of a simple cognitive factor leading to complex emergent structure in the surface form of language. In the next chapter, Coupé and Hombert discuss arguments for and against the polygenesis of language and linguistic diversity. Rather than consider whether the human language faculty as a whole emerged at a single site or at multiple sites, they consider the relative likelihood of monogenetic and polygenetic emergence of distinct linguistic strategies akin to Hockett’s ‘design features’. They follow the approach adopted in a model by Freedman and Wang of using probabilistic arguments to investigate the likelihood of polygenetic emergence under various conditions, modeling a wide range of plausible demographic states of pre-historic hominid groups. However, they go beyond Freedman and Wang’s work by using simulation techniques to model the cultural transmission of linguistic strategies as a result of contact between groups. The authors consider the role of a number of parameters, including population density, the rate at which groups traverse a designated hunting zone, and the threshold distance that should separate two groups before they can come into contact with each
Introduction: Essays in Evolutionary Linguistics
other. They observe strong relationships between the time taken for a strategy to diffuse across a population and both the threshold distance for contact and population density, as intuition would lead us to expect. For plausible estimates of the population density of hominid groups and other relevant parameters, the authors infer that the probability of complete diffusion of a strategy across a population, and so of monogenesis, is typically significantly lower than the probability of polygenesis. However, they do not rule out the possibility that some key strategies emerged monogenetically which were then transmitted culturally from group to group across the entire hominid population. Part II treats language acquisition, with three papers describing approaches for modeling aspects of the acquisition process. In each essay, learning is seen to be achieved by distinguishing and encoding patterns of regularity. Christiansen, Conway and Curtin describe a connectionist model to suggest how children acquire language, in part, by integrating the information provided by multiple probabilistic cues present in caregiver speech. Their view is that not all linguistic structures can be acquired based on evidence from a single source; some structures require that evidence from multiple sources be integrated in order that acquisition take place. Even when certain cues have low validity, the use of multiple cues in combination can act to increase the robustness of the acquisition process. The authors focus on word segmentation, a necessary skill that a child must acquire in order to build up a compositional understanding of spoken utterances. They argue that word boundaries are discovered by the child from probabilistic sub-lexical cues of three types: phonology, lexical stress and utterance boundary information. They present simulation results, training simple recurrent networks to segment utterances from the CHILDES corpus of child-directed speech, to demonstrate how multiple-cue integration in a connectionist network can perform robust speech segmentation. The comparison of their simulation results with empirical evidence for the word segmentation performance of both infants and adults supports their claim that humans do indeed
7
8
Language Acquisition, Change and Emergence
follow a probabilistic learning mechanism to integrate cues from multiple sources to learn to segment words. Christiansen et al. go on to reiterate their main thesis that multiple-cue integration facilitates language acquisition in general, and briefly review evidence supporting the existence of salient probabilistic cues for the learning of word meaning, grammatical class and syntactic structure. Next, Kit examines the extent to which an unsupervised learning procedure can learn to perform word segmentation, adopting an alternative statistical learning procedure based on compression. This approach draws heavily upon concepts of complexity, a brief introduction to which is given by Gell-Mann in Part IV. In Kit’s model, regularities that are detected in the speech data allow the data to be compressed — the more the speech data is compressed, he argues, the closer this reflects the mechanism by which the speech data was produced. Kit therefore assumes that the language learner seeks the representation for the input speech data that compresses the data the most. The learner is assumed to have no prior knowledge of the morphotactic constraints of the target language. Invoking the principle of least cost, measuring cost in terms of the number of bits required to represent the compressed data, he shows that an unsupervised learning procedure can perform accurate word segmentation on a corpus of child-directed speech, derived from the CHILDES corpus. In doing so, Kit’s work suggests that prior knowledge encoded within a built-in language acquisition device is not required in order for a child to learn to segment words from fluent speech. Yang then looks at a different issue in the acquisition of language: the acquisition of irregular morphology. He derives a rule-based model to explain the widespread existence of irregularity in the lexicon, focusing on irregularity in verbs in the past tense of English: for example hold-held, which a child might say as hold-holded. He argues against Pinker’s Words and Rule (WR) model, in which the past tense of all regular verbs is derived by applying the single rule ‘add -d’, whereas the past tense of each irregular verb is stored as an associated stem-past pair. Instead, he
Introduction: Essays in Evolutionary Linguistics
favors a Rule over Words (RW) model, in which irregularities are encoded by over-riding default rules by more specific exception rules that operate on verbs classes, e.g. throw-threw, know-knew. In particular, Yang points out that if Pinker’s WR model were correct, irregular verbs heard more frequently than others would be remembered and used correctly more often — he illustrates by example that this is not the case. In the RW framework, although specific verbs belonging to a particular class might be rare, instances of verbs belonging to that class might be sufficiently frequent for the class to be learned — Yang calls this the Free-rider Effect. Irregular verbs are recovered by applying the most specific rule when multiple competing rules are present, a principle he refers to as the Elsewhere Condition. This framework resembles the default hierarchy of a classifier system, discussed by Holland in Part IV. Yang presents a detailed procedure for the induction of phonological rules. In his model, a word is represented as a sequence of phonemes that are themselves represented distinctly in terms of a set of articulatory features. The rules encode context-specific phonological changes that are applied to verb stems in order to generate the corresponding past tense forms. For example, a default rule might specify that the past tense is formed by adding ‘-d’ to the stem, while an exception rule might specify that verbs ending ‘-’ form the past tense by changing ‘’ to ‘æ’, such as sing-sang. Rules are induced as a result of observing recurrent patterns among the articulatory features of strings of phonemes heard during acquisition. Rule competition gives rise to the empirical phenomena of analogical leveling and extension. By applying the model over successive generations of language learners, Yang is able to show how evolution of the stem-past system is influenced by such a rule-based learning procedure, inferring that irregularity is all but inevitable. The three essays in Part III deal with language change. Van Driem begins by summarizing the Leiden theory of language evolution which holds language to be a symbiotic organism. In this account, language change is driven by the self-replicating components of language: memes. He goes against Dawkins’s
9
10
Language Acquisition, Change and Emergence
account of the meme as a unit of cultural transmission or imitation, which van Driem calls a mime, reserving the term meme for a unit of linguistic meaning, whether lexical or grammatical. He goes on to discuss the nature of linguistic meaning, explaining that meanings do not behave according to the principles of conventional logic but as non-constructible sets. In his view, meanings evolve as a result of the applications to which they are put, constrained by the underlying structure of the neural circuitry of the brain. Van Driem also briefly summarizes the Leiden view of the emergence of syntax from meaning by splitting of holistic utterances. His views on the emergence of language coincide, in part, with those of Schoenemann: particularly that the emergence of syntax was driven by semantics. Ruhlen discusses language change from the perspective of historical linguistics, his aim being to clarify the different goals and methodologies of three distinct disciplines — taxonomy, typology and historical linguistics — which he argues have come to be confused by linguists during the last century. He begins by distinguishing taxonomy from historical linguistics, specifically reconstruction. He defines linguistic taxonomy as the identification of the hierarchical structure of languages and their families. This, he writes, should precede historical linguistics, which comprises tasks such as reconstruction of the proto-language, the identification of sound correspondences, and the like. He stresses that reconstruction is the task of inferring the proto-language of a family that has already been identified by taxonomy. Ruhlen equates taxonomy with classification and, controversially, multilateral comparison. While he admits that the existence of language families such as Indo-European, Algonquian and Austronesian can be verified rigorously by establishing regular sound correspondences among sibling languages, he maintains that these families were first recognized by linguists who observed the grammatical and lexical morphemes that characterize each essentially by the method of multilateral comparison that both he and Greenberg have advocated. Ruhlen suggests that the recent failure to distinguish these disciplines has led to a stagnation in the discovery of new
Introduction: Essays in Evolutionary Linguistics
genetic relationships among languages and language families. Ruhlen then turns to the issue of using word lists to identify genetic relationship. He discusses in particular the pattern tVnV, having meanings such as ‘child’, ‘son’ or ‘daughter’, which he finds to be widespread among the Amerind languages of the Americas. Accepting that systematic sound correspondences for these putative cognates might be hard to come by, he maintains that the most parsimonious explanation for the pattern is genetic affiliation. Ruhlen distinguishes typological classification and genetic classification: the former, he comments, is based on “historicallyindependent structural traits” and should have no place in the identification of genetic relationships, while the latter is based on “historically-related genetic traits”. He also discusses in some detail the controversy regarding the use of 2nd and 3rd person pronouns alone to detect genetic relationship, focusing on the Eurasian M/T pattern and the Amerind N/M pattern. Cucker, Smale and Zhou close Part III with a presentation of a formal mathematical model for language evolution. For a population of speakers sharing a common language, the idiolect of each speaker is treated as a matrix whose elements indicate the probability of association between a particular meaning and a particular signal. They further define a communication matrix that models the influence of each speaker on the acquisition of language by other speakers. Speakers are assumed to interact iteratively by exchanging meaning-signal pairs with each other and updating their languages so as to improve the probability of successful communication. Their main result is to prove formally that a common language will emerge within a finite number of iterations with non-zero probability provided that speakers exchange sufficiently many meaning–signal pairs and that the communication matrix exhibits the mathematical property of weak irreducibility. They also prove that once a common language has emerged, a population will continue to maintain a common language — although language change might temporarily introduce a degree of heterogeneity among the idiolects of a population of agents, a globally
11
12
Language Acquisition, Change and Emergence
homogeneous language will eventually re-emerge. The authors also show how the model can be applied to the modeling of both language emergence and acquisition. Studying the behavior of this model as the number of exchanges between speaker and learner are reduced, so introducing a learning bottleneck, would be a useful next step that might allow new insights into language acquisition and change to be obtained. The final part of the book is concerned with the complexity of language. Gell-Mann opens this part with a discussion of the fundamental concepts of complexity, both in theory and as they pertain to language. He begins by summarizing several notions of complexity from a theoretical perspective. He then discusses arrows of time, or unidirectional processes, by which is meant the tendency of complex systems to evolve in certain directions or manners. In particular, Gell-Mann considers whether there are arrows of time in human language that might allow some of the features of a hypothesized universally shared ancestral language to be deduced. After reviewing the proposed language families of the world, including a discussion of the super-families that have been hypothesized by “lumpers” such as Greenberg and Ruhlen (see the chapter by Ruhlen in Part III for a discussion of evidence supporting the existence of the Amerind language family), Gell-Mann discusses the possibility that word order is just such a feature, believing the word order of early human language to be SOV, the most prevalent word order of extant languages. Gell-Mann then considers the relative complexity and simplicity of extant languages in terms of their typological features. For example, he questions whether there might be a general tendency towards reduced phonological complexity: reduction in the size of the phonological inventory, delaryngealization, and the loss of clicks, for example. On the other hand, he notes, processes such as palatalization give rise to increased complexity. With a number of unidirectional changes already known, for example the sound change /p/ to /f/ to /h/ to nothing, Gell-Mann suggests that there might be many more such unidirectional processes that are less obvious. Finding such arrows of time and determining the relative
Introduction: Essays in Evolutionary Linguistics
rates of simplification and complexification may tell us a great deal about how modern language first arose, whether by monogenesis or by polygenesis, and how language subsequently evolved. Computational models, such as that proposed by Coupé and Hombert in Part I, may help us to better assess how the numerous processes of simplification and complexification fit together into a coherent whole. Holland then describes an agent-based model for investigating the acquisition of language by individuals possessing domain-general cognitive mechanisms but no language-specific learning strategies. Holland makes use of the classifier system, a rule-based learning framework that he himself developed for investigating complex adaptive systems in general. After giving a conceptual introduction to complex adaptive systems, Holland describes a classifier system that he proposes be used to model processes of social interaction and language acquisition by individuals in a community. In the model, agents representing the language users move about a shared two-dimensional environment in which are distributed the resources, such as food and shelter, that enable the agents to “survive”. The agents interact with this environment and each other by transmitting, receiving and acting upon messages. The messages comprise the sensory information received from the environment, the internal “cognition” of the agent, and the commands output to effectors by which agents interact with the environment. The survival of an agent is determined by its ability to acquire resources through appropriate action sequences. The agents are initialized with very general, default rules that provide general functionality, such as movement and resting, but which sometimes lead to inappropriate action. When a default rule has a tendency not to fulfill an agent’s designated needs and goals, an exception to it may occasionally be triggered. The resulting pair of rules, in which the initial, default rules are sometimes over-ridden by novel, exception rules, gives rise to behavior that is more efficient than either rule alone. As more and more exception rules are generated, forming a default hierarchy, increasingly complex
13
14
Language Acquisition, Change and Emergence
behavior is encoded that allows the agent to behave efficiently over a wider range situations — Yang makes use of much the same concept in his modeling of the memorization of irregular verbs as exceptions to a default rule. In this framework, language is seen as an emergent behavior that results from the association of auditory input and output with fitness-enhancing behavior. As such, language is treated no differently from any other phenomenon — its emergence relies on efficient marshalling of domain-general capabilities. By adopting the classifier system as a framework for simulating language emergence and acquisition, we should be able to better understand how various cultural processes impact upon the robust acquisition of language, and the extent to which domain-general cognitive capabilities can generate artificial languages that incorporate some of the complexities of natural languages. The next chapter, by Lin and Ahrens, focuses on lexical ambiguity, one aspect of complexity in natural language. The authors seek a definition for word meaning that is psychologically sound. They contrast three methods for defining word meaning: meanings listed in dictionaries, meanings provided by human subjects, and meanings constrained by a linguistic theory. Dictionary meanings are often used because they are easy to obtain. However, differences in the meanings that are presented in different dictionaries makes it difficult for researchers to agree upon a particular set of meanings for a word. Furthermore, dictionary meanings do not keep close track of the productive language usage of current speakers, including obsolete or rarely used meanings and lacking novel meanings. An alternative approach is to use semantic intuition, by sampling the accessible polysemy — the number of different meanings that subjects are able to think of for a word — of many people. This method, however, can fail to reveal some meanings of a word, and obscures the criteria by which word meanings are distinguished. A third approach, advocated here by Lin and Ahrens, makes the criteria explicit by identifying and distinguishing senses of meanings from facets of meanings. As the authors write, “two meanings are distinct senses, when they involve
Introduction: Essays in Evolutionary Linguistics
different conceptual domains, and when they occur primarily in distinct linguistic contexts.” This definition of word sense can be applied to both dictionary meanings and to the semantic intuition of subjects. Conducting experiments for both Chinese and English, they find significant correlation between the numbers of meanings found by each method. Drawing upon their previous research into the ambiguity advantage, in which they found that words having many linguistics senses were recognized more quickly than words having fewer senses, they conclude that the sense-based definition of word meaning is psychologically sound. LaPolla discusses the complexification of linguistic systems, arguing that the complexity of a language must be considered in terms of its sub-systems, a view shared by Gell-Mann and Comrie. Complexification of one system may lead, through extension, to simplification of another system. As an example, he cites the Qiang language of Northern Sichuan Province, China, in which a conventionalized set of orientation marking prefixes related to geophysical environment have been extended metaphorically to mark perfectives and imperatives. The increased complexity in the system for orientation marking has brought about increased simplicity for the marking of perfectives and imperatives because no separate set of markers need be developed. One important issue that LaPolla stresses is “complex for whom?” For example, Chinese orthography may be read from left to right, from right to left, from top to bottom, and even from left to right and from right to left — in other words, in just about any direction. Although this causes no added complexity for the writer, the reader’s task is made more complicated because there is no standard direction of reading. On the other hand, with a standardized word order, the job of the writer is more complex, but the task of the reader is simplified by the constraint to the inferential process. While some linguists argue that “languages differ in terms of what you can say,” LaPolla prefers the position that “languages differ in terms of what you have to say.” For example, English
15
16
Language Acquisition, Change and Emergence
requires explicit mention of the subject of a sentence due to grammaticalization of a set of obligatory constraints on referent identification that have come to be associated with ‘subject’. Chinese, however, has not conventionalized these same constraints on referent identification, so the identification of the referent is not obligatory. Such conventions force particular interpretations of sentences, constraining what a language must say. Comrie closes the volume by considering whether creoles tend to be less complex than other languages, as is widely held to be true by linguists. He begins by discussing the difficulty that linguists have encountered in defining exactly what is a creole. Like pidgins, creoles arise as a result of contact between speakers having insufficient exposure to each others’ languages to acquire them perfectly. However, creoles are distinguished from pidgins, which are the first language of no one. Historically, for creoles that have formed due to contact between speakers of a European language and one or more non-European language, the European language has typically provided the majority of the lexicon. The creole grammar is typically very different from that of the lexifier language. However, the longer the period of contact between them, the greater the number of complexities of the lexifier language that tend to be found in the creole. Comrie agrees with LaPolla that it is not meaningful to consider the complexity of a language as a whole; rather, one should consider the complexity of a particular subsystem. Here, Comrie focuses on the complexity of the morphology of creoles. He distinguishes three types of morphological complexity. First, a language may exhibit agglutination, in which multiple affixes are attached to a single root, as in Turkish. Although there is evidence that such complexity poses no problem for first language learners, borrowing of this feature does not occur. Second, a language may exhibit fusional morphology, in which multiple semantic oppositions are fused into a single morpheme. This can be observed in Italian, for example, where no separate suffices to adjectives can be identified that encode number and gender. Such fusional morphology is very rare in creole languages. Third, a language may exhibit morphological irregularity.
Introduction: Essays in Evolutionary Linguistics
For example, German specifies several mechanisms for formation of the plural. Creole languages often lack such inflectional irregularity. Nevertheless, Comrie reminds us that creoles are “sufficiently complex to carry the full range of functions that are required of human language.” Comrie also reviews three particular accounts of the genesis of creole grammars: McWhorter’s view that creole grammars emerge as a result of universal principles, with little input from either the lexifier or substrate language; Bickerton’s Bioprogram Hypothesis, in which an innately specified, unmarked grammar tends to be acquired in the absence of consistent evidence for any particular marked target grammar; and Lefebvre’s analysis of Haitian Creole, which emphasizes the role of relexification. The essays included in this book represent a mosaic of current research into the evolution of language. The essays by Schoenemann and MacWhinney on language emergence, by Christiansen et al., Kit and Yang on language acquisition, and by Gell-Mann, Holland and Comrie on language complexity, for example, all present arguments for the shaping of language by relatively simple, cognitive processes, such as perspective taking, cue integration and the learning of default hierarchies. Some of these hypotheses make apparently contradictory claims about the underlying processes of language evolution. This is most evident in Part II, where compression-based processing, connectionism and rule-based processing are all invoked to explain features of language acquisition. But this diversity of approaches is a necessary step in expanding our understanding of the processes by which language evolves — without comparing the explanatory power of the various competing hypotheses, how is one to evaluate the extent to which they are each valid? As research into the structure of the brain advances, it is to be hoped that the biophysical correlates of the cognitive capabilities adduced in such hypotheses can be either verified or disproved. A further challenge for a holistic account of language evolution is that the various theories should meld into a consistent whole. Thus, the theories of the phylogenetic emergence of language should reflect established patterns of ontogenetic development (a case of ontogeny
17
18
Language Acquisition, Change and Emergence
recapitulating phylogeny) and historical change or else be explained in terms of advances in genetic, cognitive and social adaptations in the human lineage.
Part 1 Language Emergence
2 Speech and Language — A Human Trait Defined by Molecular Genetics King L. Chow Hong Kong University of Science and Technology
1. Language as a Unique Human Trait Do human beings have a monopoly on the capability of using language and developing speech? The sounds made by Kanzi, an adult pigmy chimpanzee, may have caught us by surprise and posted the latest challenge to the idea that animals do not have language. Through the study of hours of videotape showing the daily interactions and sounds he made at various occasions, researchers found that he is able to make consistent sounds with specific meaning attached. In addition, it was found that this pigmy chimp had the ability to comprehend spoken English with comparable lexical and grammatical details as well as a 2½-year-old human toddler. While the language capabilities of human toddlers continue to increase dramatically, Kanzi reached a capacity of 400–500 words, stopped learning new words, and stopped making progress in grammatical comprehension. The observations from various primate studies add to the growing body of evidence that language skills did not just show up suddenly in humans. Non-human primates do have an ability that 21
22
Language Acquisition, Change and Emergence
could be described as the use of primitive language. They indicate to us the abilities in language that early human beings once had. Language comprehension in the chimpanzee, however, stops short of advancement. The study of human language using animal models has therefore offered only limited usefulness. The observations of primates, however, have hinted at the notion that the use of language may be an inheritable trait established over time in the primate lineage with the complexity and sophistication of usage only recently evolved in the human branch. The origin of language and the process of language acquisition have been puzzles for decades. How can children rapidly learn the grammatical structure of their native language without overt instruction or without explicit awareness of the underlying rules? Chomsky noted that a simple associative mechanism that detected statistical regularity between word sequences would be unable to extract the recursive structure of language (Chomsky, 1988). He tossed out the idea that word meanings cannot be used to deduce grammatical rules, because we are able to distinguish grammatical from ungrammatical sequences, even when these are completely meaningless. Since non-human primates have difficulty acquiring grammatical language and there appears to be specialized regions in the brain in the left hemisphere controlling speech capability, he further argued that there exists an innate language faculty, implicitly implicating the unique anatomical organization of the human brain harnessed by the evolution of genetic components dictating speech function. While Chomsky’s arguments remain influential, they are not accepted as unchallenged doctrines. Psychologists have questioned whether young children really do know abstract grammatical rules. Can language learning be simply rote learning of associations between word strings and meanings with the grammatical regularities recognized only at a late stage in the learning process? The hardwiring of brain functions for speech may not be so different from that of reading, a capability established only in the very recent human history. Yet, it is difficult to imagine how these cognitive abilities could arise through natural selection so rapidly. These
Speech and Language – A Human Trait Defined by Molecular Genetics
abilities may actually have nothing to do with the innate pre-specification of brain areas for specific functions and may represent examples of the cognitive plasticity of acquired skills. As implicitly stated in all these arguments, such cognitive plasticity has to be aligned with genetic influences.
2. Further Implication of Genes in the Acquisition of Language Well aware of the interaction between genetic components and environmental influences on the exhibition of animal behavior, there is no better approach that can satisfy a geneticist than understanding the underlying mechanism of a biological function through identification of mutants and subsequent phenotypic characterization of them. For a human geneticist, both qualitative and quantitative documentation of symptoms and various physical diagnoses of a human trait, an ability, or a behavior would represent a comprehensive phenotype profiling process of paramount importance. Accurate diagnostic description of features can help to determine the correct association of a specific phenotype with the inherited genetic composition. In the human population, about 2–5% of children who are otherwise normal have significant difficulties in acquiring expressive or receptive language, despite adequate intelligence and opportunity (Bishop et al., 1995). Although these reported cases often fall into different classes of language disorders, strong evidence has been observed in studies on twins that there are strong genetic influences in developmental disorders of speech and language although environmental input cannot be totally excluded. There are cases where specific language impairment has been found to run in families. Nevertheless, most pedigree analyses did not implicate and were not consistent with a single defective gene, making the tracking of the genetic defect difficult, if not impossible. Interestingly, in these studies, there is strong evidence that the language impairment of the patients is influenced by the genetic inheritance as monitored
23
24
Language Acquisition, Change and Emergence
by non-word repetition tests, while there is no good correlation of genetic impact with the auditory processing ability (Bishop, 2002). The results imply that, at least, there is a clear genetic influence on the expressive subtypes of language impairment. In addition to the twin studies, there have been sporadic reports on language impairments in patients. These cases were largely caused by rare mutant alleles in the genes directly or indirectly responsible for the language function. Without a clear focus on a genomic region for analysis, the genetic lesions in these patients and the culprit locus of the defect could hardly be traced. In order to home in on one or a few genetic loci that can account for deficits in language ability, human geneticists are desperate to identify a good sample of patients for which data from individuals of multiple generations can be obtained.
3. The “KE” Family In 1990, Hurst et al. reported a rare case of a large extended family in London (known as the KE family) in which a speech and language disorder runs in the family. The affected members of this family do possess a form of language. Their principal defect seems to lie in a lack of fine control over the muscles of the throat and mouth needed for rapid speech. In tests, affected members of the family find answering questions in writing as difficult as having it done verbally, suggesting that the defective gene, if there is one, causes a conceptual problem as well as a muscular control problem. Based on the pedigree analysis, the trait is clearly inherited as an autosomal monogenetic dominant one with full penetrance (Figure 1). This KE family was referred for genetic counseling by the director of a school for children with speech and language problems, at which many of the KE family members had been pupils. The condition was characterized as developmental verbal dyspraxia (Hurst et al., 1990). Some reports have suggested that the core deficit of the disorder in this family is an inability to use grammatical suffixation rules,
Speech and Language – A Human Trait Defined by Molecular Genetics
Figure 1 Pedigree of the KE family. Affected individuals are indicated by filled symbols. Asterisks indicate those individuals who were unavailable for genetic analysis. Squares are males. Circles are females. A line through a symbol indicates that the person is deceased. (Adapted from Lai et al., 2001)
Figure 2 (a) Word and non-word repetition in the KE family. Bars indicate mean percent correct for the groups of affected (filled bar) and unaffected (open bar) family members. (b) Simultaneous and sequential orofacial movement. Bars indicate the mean percent correct for the group of affected family members (filled bar) and the normal control group (open bar). (Adapted from Vargha-Khadem et al., 1998)
25
26
Language Acquisition, Change and Emergence
such as those for tense, number, and gender, leading the genetic locus to be coined as the “grammar gene” (Gopnik 1990; Gopnik and Crago, 1991). Other analyses have shown that the phenotype is not as selective and is characterized by difficulties with many aspects of grammar and expressive language (Hurst et al., 1990; Vargha-Khadem et al., 1995, 1998) (Figure 2). On nearly every test used to assess an aspect of their speech and language function and orofacial praxis, the mean scores of the affected members fell significantly below those of the unaffected members. Furthermore, the phenotype involves grossly defective articulation, such that the speech of affected individuals is largely incomprehensible to the naïve listener (Hurst et al., 1990; Vargha-Khadem et al., 1995). Also, there is evidence of moderate non-verbal impairment in some affected family members. They have trouble identifying basic speech sounds, understanding sentences, making grammatical judgments, and with other language skills (Vargha-Khadem et al., 1995). The affected members have a broad phenotype with impaired generation of syntactical rules, striking articulatory impairment as well as cognitive impairment with more profound deficits in the verbal domain, i.e., the linguistic and orofacial praxic functions generally.
4. Is There a Hardwired Structure for Language? Developing a full understanding of the neural basis and associated cognitive/linguistic deficits in the KE family is clearly important, particularly with the hope of future gene-based diagnosis or therapy in mind. Brain-imaging studies with positron emission tomography (PET) of affected individuals from this family revealed functional and structural abnormalities in both cortical and subcortical motor related areas of the frontal lobe, particularly the basal ganglia (Vargha-Khadem et al., 1998). Quantitative analysis of magnetic resonance imaging (MRI) scans revealed structural abnormalities in several of these same areas. Mapping and comparison of the gray matter in affected and unaffected individuals showed several regions
Speech and Language – A Human Trait Defined by Molecular Genetics
where the affected group had either significantly more or significantly less gray matter than the unaffected group. Among the regions with more gray matter, the lentiform nucleus is a motor-related structure. Among the regions in which affected members had less gray matter were the cingulated cortex, Broca’s area and the caudate nucleus, all of which are motor-related structures implicated to be essential for speech capability. The caudate nucleus, particularly, was abnormally small bilaterally when compared with that of the unaffected group. These findings suggest that the central aspect of the disorder may be the disruption of selection and sequencing of fine orofacial movements, leading to the deficits in the development of language skills. More recently, using an automated technique of voxel-based morphometry (VBM) supplemented by targeted manual volumetry in affected and unaffected members of the KE family, and a group of age-matched controls, Watkins and colleagues demonstrated clear abnormalities in the affected family members that were not present in behaviorally normal members of the family (Watkins et al., 2002a). However, the difference was not a simple matter of reduced cortical volume, as some regions were larger than normal. While the caudate nucleus and inferior frontal gyrus were reduced in size bilaterally, the left frontal opercular region and the putamen had a greater volume of gray matter bilaterally. Recent data have shown that in some instances, such as in stuttering individuals, the increase in gray matter volume is correlated with the severity of the abnormality. Stutterers often have an increase in cortical volume in two main speech areas (Foundas et al., 2001). One possible explanation for a bigger cortex in developmental disorders is a lack of programmed cell death that occurs in a normal developing brain, a process thought to enhance cortex specialization and ensure appropriate cellular connections (Seldon, 1981). It remains possible that the increase of cortical gyral volume in certain brain regions suggests the absence of an important developmental process resulting in a lack of form and structure crucial for cortical function. Watkins et al. also highlight the importance of subcortical structures, particularly the caudate nucleus and putamen in language
27
28
Language Acquisition, Change and Emergence
development. Interestingly, the dorsal part of the caudate appeared to be involved in the language ability of the KE family and was significantly correlated in a complex pattern with the performance of KE family members in oral praxis and non-word repetition. The greater the reduction on the left, the poorer the performance on the test. Greater reduction on the right is correlated with better performance on a test of non-word repetition. Moreover, in a companion study, a widespread deficit in the affected members of the KE family in all aspects of speech and language, as well as in aspects of non-verbal intelligence, was observed (Watkins et al., 2002b). Longitudinal test scores available in a subset of the younger affected individuals also showed a progressive decline in IQ scores. These findings suggest that the speech disorder could have detrimental effects on various components of non-verbal intelligence, such as lexical development and articulation of common words.
5. Mapping of the Language Gene Based on the characterization of the phenotypes of individuals in the KE family and the assertion of a simple genetic basis, Fisher and colleagues initiated a genome-wide search for linkage of the genomic region with this speech disorder (Fisher et al., 1998). Using a fluorescent-based genotyping approach with microsatellite markers spaced evenly throughout the genome, a linkage was established for markers on the long arm of chromosome 7 in a well-characterized region, where the cystic fibrosis gene (CFTR) is also mapped. Co-segregation of the two closely linked markers, D7S486 and CFTR locus, gave maximal LOD scores of 6.2 and 5.5 with no recombination occurring. With the threshold of the declaring linkage set at 3, this high LOD score confidently defined the speech and language disorder locus on this long arm region. The locus then designated as SPCH1 was further mapped with available microsatellite markers from this region to a region of 5.6 centiMorgan distance on the 7q31 region flanked by two
Speech and Language – A Human Trait Defined by Molecular Genetics
polymorphic markers, D7S2459 and D7S643. With this region well covered by existing large genomic DNA fragments in YAC (Yeast artificial chromosome), BAC (Bacterial artificial chromosome) and PAC (P1 phage-derived artificial chromosome) clones, the availabilities of EST (Expressed Sequence Tags) and the region being actively sequenced at the time, potential genes associated with the language ability were sought after using various bioinformatic approaches. A transcript map of the critical interval defining all the potential transcription units (genes) on this genomic region spanning about 8 megabases (8 × 106 nucleotide base pairs) was also constructed. At the same time, this daunting task was made simple with the identification of two new patients, CS and BRD (Lai et al., 2000).
6. The Breakpoint in CS and BRD While the investigation of the KE family members has been central to the discovery of the innate aspects and molecular genetics of the language ability, the process was speeded up with the identification of two new unrelated patients. Lai et al. (2000) reported an independent case of a 5½-year-old boy (CS) with language impairments and verbal dyspraxia. This boy has a de novo balanced reciprocal translocation, t(5;7)(q22; q31) (chromosome 5 and chromosome 7 were reciprocally swapped with the junction at q22 position of chromosome 5 and q31 position of chromosome 7). The patient was referred to the genetics team with concerns for his delayed speech development and mild motor impairment. Subsequent assessment indicated the patient to be in the mildly delayed range. While the non-verbal skills were in the normal range, there was impairment in both understanding and expression of speech. He had oral dyspraxia. A second case was an 8-year-old boy (BRD) with a history of receptive and expressive language problems, accompanied by behavioral difficulties and low-range intellectual abilities, despite normal physical motor development. While an MRI scan of this
29
30
Language Acquisition, Change and Emergence
patient revealed a small dysembryoplastic neuroepithelial tumor in his right temporal lobe, cytogenetic analysis uncovered yet another de novo balanced reciprocal translocation, t(2;7)(p23; q31.3) (Warburton et al., 2000). The identification of these two patients was instrumental to the cloning of the language gene. The two balanced reciprocal translocations allow these two individuals to survive with no major detrimental developmental defects since all the chromosomal loci are retained, except the one residing at the junction of the chromosome translocation. Depending on the position of the breakpoint (the junction), a gene may have been disrupted but there is no reduced biological function. Therefore, the phenotypic diagnosis of these two patients to be associated with language deficits confirmed the mapping data from the KE family, i.e., the language gene, SPCH1, is on 7q31. In addition, the genomic DNA of these two patients would provide important reagents for the mapping of the precise breakpoint where the SPCH1 locus resides.
7. Positional Cloning of the FOXP2 With the confirmation of the genetic locus at around 7q31, active construction of the BAC/PAC-based sequence map spanning the SPCH1 interval of 7q31 proceeded and novel polymorphic markers in the 7q31 region were generated. With two-color FISH (fluorescent in situ hybridization) with the BAC clones to analyze the metaphase chromosome spread of cells from CS and BRD, their translocation breakpoints were mapped. A gene defined by an uncharacterized transcript, CAGHH44, encoding a polyglutamine stretch in the coding region, maps to the same break point region in CS. It encodes a protein specifically expressed in the brain, which makes it the likely candidate of the SPCH1 locus (Lai et al., 2000). When the full-length cDNA of this CAGHH44 was isolated, it was found to encode a 2.5 Kb transcript spanning 17 exons and containing a complete open reading frame of 2.1 kb. The encoded product is around 677 amino acids (Figure 3). The carboxy-terminal
Speech and Language – A Human Trait Defined by Molecular Genetics
Figure 3 The FOXP2 locus: (a) the exon organization and (b) alternative spliced transcripts. (Adapted from Lai et al., 2001)
portion of the predicted protein sequence contains a segment of 84 amino acids that shows high similarity to the characteristic DNA-binding domain of the forkhead/winged-helix (FOX) family of transcription factors. Based on the FOX protein classification, this protein has the best resemblance to the P class subgroup, and the gene was designated as FOXP2. The transcripts of FOXP2 were detected in several human and fetal tissues, with particularly strong expression in the brain. A similar study on the expression of the orthologs of this gene in mouse embryos revealed a specific expressing region in the central nervous system, including the
31
32
Language Acquisition, Change and Emergence
neopallial cortex and the developing cerebral hemisphere (Shu et al., 2001). FISH analysis showed that CS has a translocation breakpoint of chromosome 7 in a 200 bp region in the intron between exons 3b and 4. The disruption of FOXP2 is implicated in the etiology of the language disorder in this patient. Subsequently, a specific point mutation present only in the affected individuals of the KE family was defined. The mutation, which was absent in the unaffected individuals, resulted in an arginine to histidine substitution of the amino acid residue 533 at the forkhead DNA binding domain of FOXP2, possibly affecting the contact of this forkhead protein domain with the regulatory DNA sequences of the target genes (Lai et al., 2001). The dominant inheritance of the mutant allele in the KE family and the balanced reciprocal translocation in CS and BRD suggest that two functional copies of the FOXP2 gene are needed for acquisition of normal spoken language. The disruption of the FOXP2 locus in these cases clearly presents haplo-insufficiency of this gene in key stages of neural development affecting the establishment of the structures important for speech and language. Whether the brain circuitry is directly impaired because of the reduction of this gene function due to mutation or if it is disrupted indirectly because of the embryonic developmental defect remains to be established.
8. Does FOXP2 Offer Insight into Human Evolution? Study of the genomes of humans and chimpanzees often yields deep insights into the origins of various human traits. Application of the same principle to the study of language, one of the most distinctive human attributes, a critical step in human evolution and a prerequisite for the development of human culture, could inevitably invite scientific and social interest. The human ability to develop and articulate speech relies on capabilities, such as the fine control of the larynx and mouth, that are absent in chimpanzees or other great apes (Liebermann, 1984). The neural abnormality in the affected individuals with a defective FOXP2 gene and its neural-specific
Speech and Language – A Human Trait Defined by Molecular Genetics
expression profile apparently gave good evidence supporting such a notion. Interestingly, disregarding the region of the protein outside the forkhead domain, the mouse FOXP2 protein differs from the human protein at only three amino acid positions, putting this gene among the 5% of the most-conserved proteins in the mammalian genomes. It would be of interest to know how this protein is linked to the structural establishment in the brain and the evolution of language capability. By sequencing the complementary DNAs that encode the FOXP2 orthologous proteins in chimpanzee, gorilla, orangutan, rhesus monkey and mouse, and with the primary focus set on the forkhead domain where the amino acid change was observed in the affected individuals in the KE family, it was found that this gene does have intraspecies variation within the human population and differences among different vertebrate phyla (Enard et al., 2002; Zhang et al., 2002). Whereas the chimpanzee and mouse FOXP2 protein structures were essentially identical, the orangutan FOXP2 showed a minor change in the secondary structure with only one amino acid replaced. Thus, the FOXP2 gene has largely been unaltered during the evolution of mammals, but suddenly changed in humans after the hominid lineage split off from the chimpanzee line of descent (Figure 4). The two human-specific changes occurred at the amino acid residues at positions 303 and 325. The Thr to Asn change at residue 303 marks the unique forkhead domain of orthologs among all mammalian taxa and generated a slight alteration of the shape of the protein. The Ser residue at position 325 creates a potential target site for phosphorylation by protein kinase C together with a minor change in the predicted secondary structure. Indeed, a number of studies have indicated that phosphorylation of forkhead transcription factors can be an important mechanism mediating transcriptional regulation. This second change potentially gives the protein a new role in the signaling circuitry of human cells. While these postulates on the protein function are based on the structural prediction, the impact of these human-specific changes in the FOXP2 protein remains to be substantiated biochemically in developmental models. In addition,
33
34
Language Acquisition, Change and Emergence
despite a pattern of nucleotide polymorphism found among a number of human samples of different races, the human FOXP2 protein essentially displays no change except one case with two glutamine residues inserted into the polyglutamine stretch. Such insertion, however, would unlikely generate significant changes in the protein function as illustrated in a previous study of this domain (Enard et al., 2002; Zhang et al., 2002). Figure 4 Silent and replacement nucleotide substitutions mapped on a phylogeny of primates. Bars represent nucleotide changes, whereas gray bars indicate amino acid changes. (Adapted from Enard et al., 2002)
These comparative studies have shown that during the roughly 130 million years of separation between primates and rodents, a single amino acid change occurred in the FOXP2 protein. Two fixed amino acid changes occurred in the human lineage after it was separated from the chimpanzee 4.6–6.2 million years ago whereas no change occurred in the other primate lineages, suggesting a strong positive selection of amino acid changes in the human lineage. The results also imply that FOXP2 has been the target of selection during recent human evolution with the human version of the FOXP2 gene probably fixed less than 120,000 years ago (Enard et al., 2002). This date fits with the theory that accounts for the sudden appearance of novel behaviors in human ancestors 50,000 years ago, including art, ornamentation and long-distance trade.
Speech and Language – A Human Trait Defined by Molecular Genetics
Human remains from this period are physically indistinguishable from those of 100,000 years ago. Therefore, some genetically based cognitive change must have prompted the new behaviors. One with sufficient magnitude would then be the acquisition of language.
9. What is the Role of FOXP2? While the association of FOXP2 with language ability helps us locate the molecular control mechanism of this trait, the potential complexity has been underemphasized. In most recent reports, FOXP2 has often been cited as a transcription factor with a forkhead domain (Lai et al., 2001, Zhang et al., 2002). Transcription factors are protein molecules that function in the nucleus via binding with regulatory DNA sequences of their target genes. They facilitate, and sometimes restrict, the association of components in the general transcriptional machinery to activate efficient transcription of the target genes. Without activation of the target genes, the corresponding encoded protein products cannot be made and the biological activities will be silenced. It is easy to recognize that the FOXP2 protein is a regulatory molecule by itself and has no direct involvement in the hardwiring of the anatomical structure in the nervous system. Instead, it acts through the control of other genes that encode structural components of the neural anatomy, through proteins that facilitate cellular association when the anatomy is assembled, as well as through proteins that enhance cellular communication (Figure 5). Indirectly, transcription factors and the products of their target genes may act through a cascade of regulatory steps to modulate the anatomical architecture of the brain and the functional integration of neural tissues for the cognitive output — the ability to comprehend and to speak. On the other hand, this portrait of FOXP2 as a transcription factor is oversimplified. In most eukaryotes, including humans, transcription factors often function in a network with one transcription factor regulating dozens of others. Feedback loops that maintain homeostasis and positive reinforcement auto-regulation are
35
36
Language Acquisition, Change and Emergence
Figure 5 Complexity of FOXP2 as a regulator of language ability
FOXP2 gene AAAAAAA
FOXP2
F Synapse formation E
A Cell communication
Cell adhesion D Transcription Factor! B Cell shape
C Neurotransmitter
?
frequently found in different cellular regulatory pathways and are often identified in developmental processes. Since FOXP2 is a developmentally regulated gene (Shu et al., 2001), there is no reason not to believe that such a complex regulatory network is in place during the embryonic stage of a developing brain. Is FOXP2’s primary role perhaps to direct the regional neural establishment for the patterning of the brain structure essential for cognitive functions? How does FOXP2 interact with other genetic components to pattern our neural structures? When FOXP2 is acting as a single unit as a transcription factor, it remains easy for us to interpret its potential subcellular function. However, recent studies show that the scenario could be more complicated (Shu et al., 2001, Bruce and Margolis, 2002). There has been an under-emphasis of the other domains present in the FOXP2 encoded protein. FOXP2, together with a related molecule encoded by FOXP1, constitute a subfamily of the winged-helix/forkhead type molecules. They both contain a polyglutamine stretch near the
Speech and Language – A Human Trait Defined by Molecular Genetics
N-terminus, a putative zinc-finger motif implicated to have DNA binding capability, and a forkhead DNA binding domain (Figures 3 and 6). The molecular characterization of the FOXP2 aberration in the KE family showed that this mutant FOXP2 gene has a mis-sense mutation in the forkhead domain (exon 14) in all affected individuals of the KE family. The mutation suggests that the DNA binding domain is a key domain for the FOXP2 biological activity. The balanced reciprocal translocation t(5:7)(q22;q31.2) with a breakpoint between exons 3b and 4 in CS may result in the misregulation of FOXP2 and most likely lead to null activity of this gene because the tissue-specific and temporal-specific regulatory sequences have been uncoupled from the coding region. In both cases, the gene function dependent of the forkhead domain is obviously impaired. Additional evidence of product encoding with a hallmark polyglutamine stretch found in a number of gene products associated with psychiatric diseases shows that FOXP2 might have additional neuropsychiatric functions on top of its role in neural development. Indeed, FOXP2 carries one of the longest polyglutamine stretches in the human genome. Is this polyglutamine stretch involved in any subcellular function through interacting with a polyglutamine stretch interacting protein, such as HIP, rather than acting inside the nucleus (Wanker et al., 1997; Tanno et al., 1999)? Is this polyglutamine stretch linked to programmed cell death as the other polyglutamine-containing proteins are? Further investigation should be conducted particularly because reduced programmed cell death could be the cause of the altered neural anatomy in affected individuals (Hackem et al., 2000, Watkins et al., 2002a). Although molecular analysis detected little polymorphism of the polyglutamine stretch in the FOXP2 locus in individuals with progressive movement disorders, this protein domain may be an important structural motif for the normal activity of FOXP2 (Bruce and Margolis, 2002). For the same reason, the zinc finger domain, which interacts with DNA by both in vivo and in vitro assays, may have additional attributes in the activity of this protein in selecting appropriate DNA target sites. Are these domains crucial in the
37
38
Language Acquisition, Change and Emergence
control of FOXP2 activity in the cell? All of these possibilities remain to be tested experimentally. Figure 6 Structure of FOXP2. (A) Genomic structure of FOXP2 depicting novel exons in bold. (B) FOXP2 human cDNAs and ESTs. Dark gray marks the areas of the open reading frame; white indicates the untranslated region; and the dotted pattern marks the alternate donor/acceptor splice sites that extend the length of the given exons, while the poly-A tail of mRNA is indicated with the poly A stretch. (Adapted from Bruce and Margolis, 2002)
Speech and Language – A Human Trait Defined by Molecular Genetics
Figure 7 Schematic representation of FOXP2 and FOXP2-S proteins. The glutamine repeat region, zinc finger and forkhead domains are in gray with their specific amino acid numbered. Above the full length FOXP2 are the additional amino acids encoded when the three variably spliced exons with open reading frames are present. (Adapted from Bruce and Margolis, 2002)
To add another level of complexity, evidence was obtained for alternate splice variants and six previously undetected exons in various individuals (Bruce and Margolis, 2002) (Figure 6). This discovery of splice variants suggests that there are more than one unique protein products to be made (Figure 7). The control of splice variant selection would impede the efficiency of mRNA generation, the stability of the mRNA, its coding capacity, and at the end may result in a truncated version of the FOXP2 protein, FOXP2-S, the stability of which will obviously be modulated differently. It is apparent that our understanding of FOXP2 function is far from complete. The regulation of this gene and its function is certainly more complicated than that presented in the initial report with the single idea of this gene functioning as a transcription regulator. Indeed, based on the identification of these variants, the
39
40
Language Acquisition, Change and Emergence
FOXP2 locus spans at least 603 kb of genomic DNA, doubling the previously predicted size. It is one among the few largest genes in the human genome. The additional FOXP2 exons and splice variants underscore the variations of tissue-specific regulation of this gene. Through alternative splicing possibly in different tissues, the FOXP2 locus may encode multiple variant products with or without specific protein domains. Therefore, the dissection of the tissue-specific regulation of this gene and its translated products will be the key to understanding precisely its functional role in neural development as well as language acquisition.
10. Are There More Language Genes? With the success of identifying the FOXP2 gene as a key player in speech acquisition, can we argue that the arrival of language came with a single FOXP2 gene? The answer is a definite “NO”. While developmental disorders of speech and language are heritable, the ability of using language is probably influenced by several and possibly many different genetic factors. The unique three-generation KE family displaying a monogenic inheritance of speech deficits led us to the isolation of the first of such genes, FOXP2. This gene, however, is the cause of a rare language disorder and does not appear to be linked to the more common forms of language impairment. A significant number of individuals have difficulties with acquiring normal speech and language, despite adequate intelligence and environmental stimulation. These cases are not accounted for by mutations of the FOXP2 locus or any other specific gene (Meaburn et al., 2002; Fisher and DeFries, 2002; Newbury et al., 2002). Thus, additional genetic loci responsible for a majority of the language disorders remain to be identified. In addition, these genetic loci are likely acting in a complex combination. They may take the form of an oligogenic or multigenic inheritance (Figure 8). The possibility of having epistatic effects of genes, modifying interactions among genes, and reducing penetrance, phenocopy and heterogeneity will alter the wiring of the
Speech and Language – A Human Trait Defined by Molecular Genetics
Figure 8 Is language a monogenic or multigenic trait?
A
Single gene disorder FOXP2
Unaffected
Language Disorder
B
Disorder caused by Quantitative Trait Loci Unaffected
N is large
Language Disorder
C
Oligogene/Multigenic Disorder (Complex Trait Disorder) Unaffected
FOXP2 Modifier genes
Language Disorder
language center and thus speech capability. In the situation of multigenic input, environmental impacts often factor in and alter the presentation of the phenotypes. This complexity of genetic interactions results in a reduced power for traditional parametric linkage analysis, where specification of the correct genetic model is important. The scarcity of families with a large number of affected individuals also limits the broad application of the strategy focusing on large multi-generational pedigrees for identifying the responsible
41
42
Language Acquisition, Change and Emergence
locus, as demonstrated in the isolation of FOXP2. An alternative technology developed recently based on high-throughput genotyping techniques using data of microsatellite and single nucleotide polymorphism (SNP) markers has made it possible to analyze large numbers of sib-pairs using allele-sharing methodology. By monitoring the segregation and sharing of genomic markers between sib-pairs, coupled with a series of definitive psychometric tests to obtain reliable quantitative measures of language problems, genome-wide scans are possible for mapping of the quantitative trait loci. In fact, based on the application of these molecular tools, four chromosomal regions on chromosomes 2, 13, 16, and 19 have been implicated to harbor genes influencing language ability (Fisher et al., 2003). Positional cloning relying on a similar strategy exemplified by the cloning of FOXP2 can be employed to further pin-point the specific loci with precision. The molecular genetic approach will offer the potential for dissecting the specific functions of individual genes as well as the neurological pathways underlying speech disorders. With the combination of genetics, molecular genetics, neuroanatomical and cognitive analysis of patients with speech disorders, there is no doubt that further advances will be made in understanding the abnormalities in language ability. The same approach, in fact, has been extensively used in the study of other developmental functions, neural degenerative diseases and diseases caused by complex loci. The results generated in all the other studies have promised us that this unique trait of human beings — speech and language — will be tracked and dissected in detail one day. The result will offer us not only molecular understanding of this human trait, but also potential therapeutic actions to take with patients, and, more profoundly, insight into when and where we came from, which will allow us to define the uniqueness of human beings on the molecular level. On the other hand, such investigations are just at the beginning. There is still a long way ahead of us.
Speech and Language – A Human Trait Defined by Molecular Genetics
References Belton, E., Salmond, C.H., Watkins, K.E., Vargha-Khadem, F., and Gadian, D.G. (2003) Bilateral brain abnormalities associated with dominantly inherited verbal and orofacial dyspraxia. Human Brain Mapping 18, 194–200. Bishop, D.V.M., North, T, and Donlan, C. (1995) Genetic basis for specific language impairment: evidence from a twin study. Dev. Med. Child. Neurol. 37, 56–71. Bishop, D.V. (2002) The role of genes in the etiology of specific language impairment. J. Commun. Disord. 35, 311–328. Bruce, H.A. and Margolis, R.L. (2002) FOXP2: novel exons, splice variants, and CAR repeat length stability. Human Genet. 111, 136–144. Chomsky, N. (1988) Language and Problems of Knowledge: the Managua Lectures, MIT press. Enard, W., Przeworski, M., Fisher, S.E., Lai, C.S.L., Wiebe, V., Kitano, T., Monaco, A.P., and Paabo, S. (2002) Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418, 869–872. Fisher, S.E. and DeFries, J.C. (2002) Developmental dyslexia: genetic dissection of a complex cognitive trait. Nature Rev. Neuroscience 3, 767–780. Fisher, S.E., Vargha-Khadem, F., Watkins, K.E., Monaco, A.P., and Pembrey, M.E. (1998) Localization of a gene implicated in a severe speech and language disorder. Nature Genetics 18, 168–170. Fisher, S.E., Lai, C.S., and Monaco, A.P. (2003) Deciphering the genetic basis of speech and language disorder. Annu. Rev. Neurosci, 26, 57–80.. Foundas, A.L., Bollich, A.M., Corey, D.M. Hurley, M., and Heilman, K.M. (2001) Anomalous anatomy of speech-language areas in adults with persistent developmental stuttering. Neurology 57, 207–215. Gopnik, M. (1990) Feature-blind grammar and dysphasia. Nature 344, 715. Gopnik, M. and Crago, M.,B. (1991) Familial aggregation of a developmental language disorder. Cognition 39, 1–50. Hackam, A.S., Yassa, A.S., Singaraja, R., Metzler, M., Gutekunst, C.A., Gan, L., Warby, S., Wellington, C.L., Vaillancourt, J., Chen, N., Gervais, F.G., Raymond, L., Nicholson, D.W., and Hayden, M.R. (2000) Huntingtin interacting protein 1 induces apoptosis via a novel caspase-dependent death effector domain. J. Biol. Chem. 275, 41299–41308. Hurst, J.A., Baraitser, M., Auger, E., Graham, F., and Norrell, S. (1990) An extended family with a dominantly inherited speech disorder. Dev. Med. Child. Neurol. 32, 347–355.
43
44
Language Acquisition, Change and Emergence Lai, C.S., Fisher, S.E., Hurst, J.A., Levy, E.R., Hodgson, S., Fox, M., Jeremiah, S., Povey, S., Jamison, D.C., Green, E.D., Vargha-Khadem, F., and Monaco, A.P. (2000) The SPCH1 region on human 7q31: genomic characterization of the critical interval and localization of translocations associated with speech and language disorder. Am.. J. Hum. Genet. 67, 357–368. Lai, C.S.L., Fisher, S.E., Hurst, J.A., Vargha-Khadem, F., and Monaco, A.P. (2001) A forkhead-domain gene is mutated in a severe speech and language disorder. Nature 413,519–523. Liebermann, P. (1984) The biology and evolution of language. (Harvard Univ. Press, Cambridge, Massachusetts). Meaburn, E., Dale, P.S., Craig, I.W., and Plomin, R. (2002) Language-impaired children: no sign of the FOXP2 mutation. Neuroreport 13, 1075–1077. Newbury, D.F. and Monaco, A.P. (2002) Molecular genetics of speech and language disorders. Curr. Opinion in Pediatrics 14, 696–701. Seldon, H.L. (1981) Structure of human auditory cortex. I cytoarchitectonics and dendritic distributions. Brain Res. 229, 277–294. Shu, W., Yang, H. Zhang, L. LU, M.M., and Morrisey, E.E. (2001) Characterization of a new subfamily of winged-helix/forkhead (Fox) genes that are expressed in the lung and act as transcriptional repressors. J. Biol. Chem. 276, 27488–27497. Tanno, Y., Mori, T., Yokoya, S., Kanazawa, K., Honma, Y., Nikaido, T., Takeda, J., Tojo, M., Yamamoto, T., and Wanaka, A. (1999) Localization of huntingtin-interacting protein-2 (Hip-2) mRNA in the developing mouse brain. J. Chem. Neuroanat. 17, 99–107. Vargha-Khadem, F. Watkins, K.E., Price, C.J., Ashburner, J., Alcock, K.J., Connelly, A., Franckowiak, R.S.J., Friston, K.J., Pembrey, M.E., Mishkin, M., Gadian, D.G., and Passingham, R.E. (1998) Neural basis of an inherited speech and language disorder. Proc. Natl. Acad. Sci. USA 95, 12695–12700. Vargha-Khadem, F., Watkins, K.E., Alcock, K. Fletcher, P., and Passingham, R. (1995) Praxic and nonverbal cognitive deficits in a large family with a genetically transmitted speech and language disorder. Proc. Natl. Acad. Sci. USA 92, 930–933. Wanker, E.E., Rovira, C., Scherzinger, E., Hasenbank, R., Walter, S., Tait, D., Colicelli, J., and Lehrach, H. (1997) HIP-I: a huntingtin interacting protein isolated by the yeast two-hybrid system. Hum. Mol. Genet. 6, 487–495. Warburton, P., Baird, G., Chen, W., Morris, K. Jacobs, B.W., Hodgson, S., and Docherty, Z. (2000) Support for lmage of autism and specific language impairment to 7q3 from two chromosome rearrangement involving band 7q31. Am. J. Med.. Genet. 96, 228–234.
Speech and Language – A Human Trait Defined by Molecular Genetics Watkins, K.E., Vargha-Khadem, F., Ashburner, J., Passingham, R.E., Connelly, A., Friston, K.J., Frackowiak, R.S., Mishkin, M., and Gadian, D.G. (2002a) MRI analysis of an inherited speech and language disorder: structural brain abnormalities. Brain 123, 465–478. Watkins, K.E. Dronkers, N.F., and Vargha-Khadem, F. (2002b) Behavioral analysis of an inherited speech and language disorder: comparison with acquired aphasia. Brain 125, 452–264. Zhang, J., Webb, D.M., and Podlaha, O. (2002) Accelerated protein evolution and origin of human specific features: FOXP2 as an example. Genetics 162, 1825–1835.
45
3 Conceptual Complexity and the Brain: Understanding Language Origins P. Thomas Schoenemann University of Pennsylvania
Abstract The evolutionary process works by modifying pre-existing mechanisms, which makes continuity likely. A review of the evidence available to date suggests that there are many aspects of language that show evolutionary continuity, though the direct evidence for syntax and grammar is less clear. However, the universal features of grammar in modern human languages appear to be essentially descriptions of aspects of our basic conceptual universe. It is argued that the most parsimonious model of language evolution involves an increase in conceptual/semantic complexity, which in turn drove the acquisition of syntax and grammar. In this model, universal features of grammar are actually simply reflections of our internal conceptual universe, which are manifested culturally in a variety of ways that are consistent with our pre-linguistic cognitive abilities. This explains both why grammatical rules vary so much across languages, as well as the fact that the commonalities appear to be inherently semantic in nature. An understanding of the 47
48
Language Acquisition, Change and Emergence
way in which concepts are instantiated in the brain, combined with a comparative perspective on brain structure/function relationships, suggest a tight relationship between increasing brain size during hominid evolution and increasing conceptual complexity. A simulation using populations of interacting artificial neural-net agents illustrating this hypothesis is described. The association of brain size and conceptual complexity suggests that language has a deep ancestry.
1.
Introduction
Since language is one of the defining characteristics of the human condition, the riddle of its origin and evolution is one of the most intriguing and fundamental questions in all of evolutionary biology. As with all evolutionary reconstructions, we are limited in the data available on which to build our explanatory models. But the problem of unraveling language evolution is of course made even harder by the fact that speech acts themselves are inherently ephemeral, and the fossil and archaeological clues relevant to language are only tantalizingly equivocal (Wang, 1991b). Language behavior, in short, does not fossilize (Hauser et al., 2002). Thus, we are even further removed from the direct behavior of interest than for other important hominid adaptive behaviors such as bipedalism or the use of fire. It is exactly for this reason that a believable explanation will rely even more critically on a clear understanding of exactly how the evolutionary process works. Not all scenarios are equally likely from an evolutionary perspective. We must of course understand the complexity of natural language in humans, and place it within the proper comparative, cross-species context. But a believable characterization of natural language itself will — whether we like it or not — necessarily be constrained by what is evolutionarily likely. A model of language which is evolutionarily implausible is not just “. . . a problem for the biologist . . . ” (Chomsky, 1972: 70), but actually calls the model itself into question. A consideration of the problem in this light shows that the
Conceptual Complexity and the Brain
key to the puzzle is not the evolution of language-specific brain modules devoted solely for syntax. Instead, it is argued language evolved through the modification and elaboration of pre-existing cognitive mechanisms, with non-genetic cultural evolutionary processes playing a key role.
2.
How Evolution Works
The evolutionary process operating on biology creates and maintains complexity by capitalizing on random changes that are introduced into a population. While these changes can have large or small effects, the ones that happen to have small incremental effects also happen to be more likely to be retained. This is because the likelihood that a large mutation will have a positive effect on the fitness of an individual will decrease with the size of the change. Each of these intermediate incremental steps along an evolutionary pathway to some adaptation must be beneficial (ultimately with respect to reproduction). As Jacob (1977) notes, this means that “Evolution does not produce novelties from scratch. It works on what already exists, either transforming a system to give it new functions or combining several systems to produce a more elaborate one,” (p. 1164). This further implies that homologies will be the rule, rather than the exception (Schoenemann, 1999). That is, we should specifically be looking for them. It is also important to recognize that in an important sense, behavioral evolution drives biological evolution. Mayr (1978) points out that “there is little doubt that some of the most important events in the history of life, such as the conquest of land or of the air, were initiated by shifts in behavior.” (p. 55, quoted in Lieberman, 1984). It is true that the biology must already have been such that particular behavioral changes would be possible when the time came, but the appropriate behavioral flexibility necessarily existed prior to — and for reasons other than — the adaptive need of the organism to shift its behavior in any particular direction. The complete suite of biological changes that made terrestrial living
49
50
Language Acquisition, Change and Emergence
adaptive did not all occur prior to the emergence of the first land vertebrates. They occurred only as the organisms pushed the limits of their behavioral flexibility specifically in the direction of increasingly terrestrial living. In order to properly conceptualize the evolution of language, it is necessary to keep clearly in mind the two endpoints (Figure 1). In the beginning there existed a population of hominids lacking language, while at the end there exists a population with language. In order for this change to have occurred, it must necessarily have been true that there was some adaptive benefit of some kind to linguistic behavior (broadly defined). It does not matter for the present argument whether this benefit was related to communication or to some aspect of cognition or thinking, but some benefit must have accrued to individuals with better language abilities, or else we would not now be using language. Furthermore, this would have to have been true within each intermediate population, on average. Given this, it follows that if an individual within any one of these populations were able to use some pre-existing cognitive abilities to better accomplish some linguistically relevant processing, this individual would gain immediate advantages by doing so. Behavioral adaptations that require minimal genetic changes will be favored at each step. Given that this was always the case, the whole evolutionary process would necessarily have been biased towards incremental changes in pre-existing mechanisms, and decidedly not towards the evolution of completely new, language-specific cognitive modules. In general, the evolutionary process does not favor the evolution of domain-specific cognitive modules, particularly if any way can be found to accomplish the task by modifying pre-existing mechanisms. This is true in spite of the argument made by some evolutionary psychologists that domain-general mechanisms would necessarily be inferior to dedicated, domain-specific mechanisms, and hence will be inherently unlikely. The flaw in this argument is that it does not properly acknowledge the process of evolutionary change. Regardless of how much better a particular domain-specific mechanism might ultimately be if it could be perfectly engineered for
Conceptual Complexity and the Brain
its assigned task, the evolutionary process itself is inevitably biased towards modifying mechanisms that are (by definition) more domain general. As a corollary to this, it is clear that meaningfully important continuities with other species are to be expected (Schoenemann, 1999). In fact, continuities are so ubiquitous in biology that the burden of proof must lie with models that deny continuities out of hand. Specifically with respect to language, there is in fact a great deal of evidence for continuity in the mechanisms involved in sound production and perception, those underlying semantics, and possibly even for syntax.
Figure 1 The evolutionary transition to language
The transition to language involved a series of populations, starting from one that lacked language and ending at one that had acquired fully modern language. Each intermediate population would have been incrementally closer to the modern condition, on average, compared to the one before it. Any behavioral changes that could have accomplished these incremental steps with pre-existing cognitive abilities and anatomical features would necessarily have been favored, thereby biasing the evolution of language toward the modification of existing abilities, and away from the creation of wholly new structures and abilities.
51
52
Language Acquisition, Change and Emergence
3.
Evidence for Continuity
3.1 Continuity in sound production Continuity is evident with respect to sound production in features of the structure of the larynx, the use of particular features of the speech signal to convey meaning, and the musculature and neurological control used to create distinctive acoustic features. The larynx is responsible for producing the initial vibration that forms the foundation for speech. It turns out that our larynx is not fundamentally different from that of other mammals (Negus, 1949), and furthermore, animals which use their forelimbs for climbing generally have well developed larynges (Denes and Pinson, 1963). This is because the larynx functions not only to keep food from getting into the lungs, but also to seal air into the lungs under pressure, thereby strengthening the thorax considerably and allowing more effective use of the forelimbs. Humans are most closely related to the modern apes, with whom we share an upper body anatomy adapted to brachiation (a mode of locomotion characterized by swinging underneath tree branches with the forearms) which means that pre-linguistic hominids inherited a well developed larynx from their proto-ape ancestors. The vibration imparted by the larynx is then filtered through the supralaryngeal vocal tract, emphasizing some frequency bands (which are called ‘formants’) and deemphasizing others. The use of formants to convey information is not unique to human language, however. An excellent example occurs in the mating calls of bullfrogs from the species Rana catesbeiana (Capranica, 1965; Lieberman, 1984). These bullfrogs will join in a chorus with a synthesized version of their mating call only if it has concentrations of acoustic energy at either the first or second formants, or both (Capranica, 1965). It is also true that different animals, including humans, use similar sound characteristics to communicate the same kinds of underlying meanings. Other animals use pitch to indicate relative submissiveness (high frequency bias) vs. dominance/aggressiveness (low frequency bias), and this bias is also found cross-linguistically
Conceptual Complexity and the Brain
in humans (Kingston, 1991; Ohala, 1983). In order to produce the complex sound sequences of language, humans have evolved remarkable neural control over the muscles of the face, larynx, pharynx, tongue, mandible, diaphragm, and ribs. These muscles are innervated by motor portions of several cranial nerves: 1) the mandibular division of the trigeminal (Vth) which controls the muscles of mastication (i.e. the movement of the lower jaw), 2) the facial (VIIth) which controls the muscles of facial expression, 3) the glossopharyngeal (IXth) which controls the stylopharyngeus muscle (and may also innervate portions of the superior pharyngeal constrictor muscle), 4) the vagus (Xth) which controls the levator veli palatini, middle and inferior pharyngeal constrictors, salpingopharyngeus, and all the laryngeal muscles, and 5) the hypoglossal (XIIth) which controls the muscles of the tongue (Carpenter and Sutin, 1983). These motor fibers arise from various motor nuclei in the brainstem and constitute what may be considered the most basic level of speech control. The motor nuclei are in turn connected to various other neuroanatomical regions. The motor nuclei for muscles of the face, jaw, and tongue receive direct projections from the various motor regions of the cerebral cortex, as well as indirect connections (via the reticular formation and central gray regions of the brainstem) with the prefrontal cortex, cingulate cortex (considered part of the limbic system), and diencephalon (Deacon, 1989). The laryngeal musculature also appears to receive direct innervation from the motor cortex as well as indirect innervation (again, via the reticular formation and central gray regions of the brainstem) from the cingulate cortex and the diencephalon (Deacon, 1989; Jürgens and Zwirner, 2000). It is important to understand that this complexity is not unique to humans, however. The basic patterns of neural connections controlling the musculature involved in vocalization are the same in other primates (and mammals generally). The differences that have been documented so far occur only in the relative proportions and emphases of the different connections (Deacon, 1989). The basic rudiments of human neural connections are thought to be extremely old.
53
54
Language Acquisition, Change and Emergence
It is also known that the basic cortical connections relevant to language processing (as inferred from human clinical and electrical stimulation studies) match connections found in axonal tracer studies of monkey cortical connections (Deacon, 1988; Deacon, 1989; Galaburda and Pandya, 1982; Jürgens and Zwirner, 2000). For example, Broca’s and Wernicke’s areas (usually defined as including the posterior inferior frontal convexity and posterior third portion of the superior temporal gyrus, respectively), which were the first two areas found to be critical for language processing, are connected by a major tract known as the arcuate fasciculus. Since both these areas mediate different aspects of language, they must need to communicate in some way in order for language processing to proceed normally. An obvious place to look for human/non-human differences would be in the connection between these two areas. However, Deacon (1984) has shown that the Broca’s and Wernicke’s homologs in macaques share the same direct connections that are seen in humans. Deacon (1988) notes that “. . . all of the major pathways presumed to link language areas in humans are predicted by monkey tracer data,” (p. 368). It would appear that monkeys (which have been separate from the lineage leading to humans for ~25 million years, Sarich and Cronin, 1976) have the same basic set of neural connections even though they do not have similar behavioral abilities. What clearly has happened is a modification of existing architecture, not a major reorganization.
3.2 Continuity in perception While it is generally assumed that speech perception in humans has required some sort of neuroanatomical evolutionary change, there is nevertheless unmistakable evidence of continuity here as well. This can be seen not only in the perception of formants, but also in many of the sounds that characterize language. In order to differentiate vowels, it is necessary to be able to perceive rapid changes in formant frequencies. It turns out that the structures in the cochlea of the inner ear responsible for translating air pressure fluctuations
Conceptual Complexity and the Brain
(i.e., sound) into nerve impulses are almost ideally constructed to operate as a sound spectrogram analyzer (Denes and Pinson, 1963). Formants are thus exactly the kind of information that one would expect to be particularly salient. However, our auditory system did not appear in hominids for the express purpose of allowing the development of language. It is essentially the same as is found in all mammals, and thus likely dates back at least 200 million years (Lieberman, 1984). A number of studies of non-human animals provide behavioral evidence of the ability to extract the patterns of formant frequencies embedded in sound waves. For example, it has been shown that mynah birds “copy” human speech by mimicking the relative changes in formant frequencies (they produce two different tones at a time, one from each syrinx, Klatt and Stefanski, 1974; Lieberman, 1984). Obviously, if they can copy patterns of formant frequencies in some fashion, they must be able to perceive them. Fouts et al. (1976) have shown that common chimpanzees (Pan troglodytes) can understand spoken English. Savage-Rumbaugh et al. (1993) reports that the pygmy chimpanzee (Pan paniscus) Kanzi correctly identifies a large array of spoken English words (even in strict double-blind experiments), and is also able to do this with computer-synthesized versions of the words. Although Kanzi might simply be doing some gestalt pattern-matching, his ability to perform these kinds of tasks suggests that pygmy chimps can hear at least some of the same kinds of phonemic distinctions that humans use, and thus has the auditory apparatus to distinguish the essential components of the rapid formant transitions (and other key acoustic features of speech). There are suggestions that the human acoustic perceptual abilities are fine-tuned to the specific features of speech. For example, it appears that humans are better able to follow streams of phonemes than series of non-phonemic sounds, and phonemes can be decoded by listeners even though they vary tremendously in acoustic characteristics from speaker to speaker (particularly in the specific frequencies of the formants, Lieberman, 1984; 1988). However, given the abilities of language-trained chimps such as Kanzi, it is not clear whether the human abilities in this regard are
55
56
Language Acquisition, Change and Emergence
unique features specifically evolved in humans for language (again, Kanzi might simply be doing some gestalt pattern-matching), or simply extensions of abilities found in other animals. Another possible example of continuity involves ‘categorical perception’, which occurs when auditory discrimination is greater at some points along an acoustic continuum than at others. These areas of greater discrimination often occur at phonemic boundaries, thereby facilitating speech perception (Liberman et al., 1957). This has been suggested for a number of features, including voice-onset-time (Kuhl, 1986; Kuhl and Miller, 1975), differences in the second formant transition (Mattingly et al., 1971), and even differences in the third formant transition, which is the acoustic basis for the distinction between /ra/ and /la/ in English (Miyawaki et al., 1975). The categorical nature of the perception of phonemes is fundamentally different from the perception of other dimensions of auditory stimuli, such as basic duration, frequency, and intensity of tones, which have been shown to be perceived in an essentially continuous fashion (Divenyi and Sachs, 1978; Kuhl, 1986; Snowdon, 1990). Although categorical perception was initially thought to indicate that humans had evolved unique neurological adaptations for decoding the speech signal (Kuhl, 1986), experiments reported on a range of animals have suggested non-linear discrimination functions similar to humans for at least some phonemic contrasts (Kluender et al., 1987; Kuhl and Padden, 1982; Kuhl and Padden, 1983; Kuhl and Miller, 1975; Kuhl and Miller, 1978; Morse and Snowden, 1975). However, many of these studies are methodologically suspect, for example training animals only on the end points of the continuum before testing intermediates (e.g., Kuhl and Miller, 1975; Kuhl and Miller, 1978). What is needed is to show that discrimination is greater in some parts of the continuum of interest without unintentionally inducing the animal to respond in this way as an artifact of the training method. At least some studies appear to have done this. Kuhl and Padden (1982) trained 3 macaques to indicate when they heard a change in stimuli (i.e., /a/ vs. /i/, the same vowel differing in pitch contour rise, and later syllable pairs differing only in initial consonants such as /va/ vs.
Conceptual Complexity and the Brain
/sa/). The monkeys were then tested on how well they could detect pairs of computer-generated tokens along the /ba-pa/, /da-ta/, and /ga-ka/ continua, which all involve changes in voice-onset-time (VOT). Three pairs in each continua were tested, with each pair differing by exactly 20 ms in VOT. The pairs were equally spaced along the VOT continua, but only one pair straddled the human phonemic boundary. The monkeys were significantly more likely to indicate they heard a difference if the pairs straddled the human phonemic boundaries. Thus, at least some studies suggest a continuity with respect to categorical perception. Regardless of the status of these studies, it is important to point out that the general prediction of evolutionary continuity is particularly clear for speech perception. During the earliest stages of evolution of language, sounds would have been adopted both for their ability to be clearly distinguished by existing perceptual systems, as well as for ease in being produced by the existing vocal apparatus. Selection would have operated on both of these systems simultaneously (Kuhl, 1986), and changes may well have occurred in both over the evolution of language, but the system would necessarily have been biased towards those features that were already salient to an ape perceptual auditory system.
3.3 Continuity in semantics Complex organisms are able to make a larger number of distinctions in the varieties of perceptual information available to them than less complex organisms. These perceptual distinctions form the basis for conceptual categories. Bickerton (1990) argues that “The sea anemone . . . divides the world into ‘prey’ and ‘nonprey’, then divides the latter category into ‘potential predators’ and ‘others’, while the frog divides the world into ‘frogs’, ‘flying bugs’, ‘ponds’, and perhaps a few other categories like ‘large looming object (potential threat)’,” (p. 87). The kinds of categories that can be formed by complex organisms are not limited to specific sets of objects, like ‘flying bugs’ or ‘ponds’, of course. If they have multiple
57
58
Language Acquisition, Change and Emergence
senses interconnected to each other they can form more abstract categories such as ‘running’, ‘sleeping’, or ‘friendship’. It is true that the concepts recognized by one species may not be recognized by another. Dogs, for example, comprise one of many species that cannot differentiate as many colors as humans (Miller and Murphy, 1995; Neitz et al., 1989). Humans cannot hear the acoustic echoes that bats use to differentiate between an insect and a tree branch. Each species has evolved to pay attention to (i.e., form categories of) those parts of the environment that became most important for its own survival. Nevertheless, there is a substantial degree of overlap across species. Pigeons have been shown to have visual categories for such things as ‘people’, ‘trees’, ‘fish’, and even ‘Snoopy cartoons’ that are essentially the same as our own (Herrnstein, 1979). This clearly shows that, to a significant extent, human languages and cultures have made use of categories that are ‘real’ to a wide variety of animals. Furthermore, it is clear that other animals can use arbitrary symbols to communicate aspects of their conceptual worlds. A number of studies have demonstrated the ability of non-human species to use vocal calls to mark aspects of their internal motivation. For example, in several species more calls are given when a greater quantity or quality of food is found (Dittus, 1984; Hauser and Wrangham, 1987; Marler et al., 1986a; Marler et al., 1986b; Snowdon, 1990). These examples represent indexical (as opposed to truly symbolic) signs in Peirce’s semiotic framework (Agha, 1997), but they nevertheless indicate that an internal state can be marked with an external sign. These animals are not transmitting the emotion itself, they are transmitting a vocal sign of their emotional state. More impressively, Seyfarth et al. (1980) showed that vervet monkeys use three different alarm calls that are specific to three different types of predator: eagles, snakes, and leopards. The lack of transfer of habituation between calls for different predators (Cheney and Seyfarth, 1988) suggests that the signals really do carry semantic meaning. Subsequent work has shown that the vervet monkey case is not unique: several species of monkeys have been
Conceptual Complexity and the Brain
shown to use specific predator alarm calls in essentially the same manner (Zuberbuhler, 2000a; Zuberbuhler, 2000b; Zuberbuhler, 2001). A number of studies clearly show that chimpanzees not only have semantic concepts, but also that they can assign and use arbitrary symbols to communicate information about them. Gardner and Gardner (1984) showed in double-blind tests that chimpanzees could correctly name (using sign language) objects that the experimenter/observer themselves could not see (thereby ruling out some form of ‘Clever Hans’ subtle cuing). Premack and Premack (1972) demonstrated that chimpanzees (Pan troglodytes) could use arbitrary symbols to communicate information about the concepts they represented. Asked to provide the color and shape of apples, for example, the chimp Sarah correctly chose the symbols for “red” and “circle”, even though her icon for apple (which was used to ask her the questions) was a blue triangle. Subsequent work by Savage-Rumbaugh and colleagues (1986) showed that chimps could be trained to use arbitrary symbols to ask for specific items from an array, to ask for items which were out of sight, to respond to symbols requesting items from another room, and ultimately to request another chimp to get items for them. The fact that chimps have been trained to communicate in these ways is evidence that they are able to 1) form mental concepts, 2) assign arbitrary symbols to these concepts, and 3) communicate specific ideas concerning these concepts via purely symbolic means. Their abilities are not identical to those of humans, it is true, but the differences are ones of degree, not of kind. The gap between what they demonstrate when reared as human children vs. in the wild as chimpanzees is not good evidence for discontinuity, moreover. The studies of captive animals show what is cognitively possible for an ape, given a humanlike learning environment.
3.4 Continuity in syntax and grammar Of all aspects of language, syntax and grammar are the most difficult to demonstrate in non-human animals. Impressive abilities
59
60
Language Acquisition, Change and Emergence
have been shown for dolphins and sea lions (Schusterman and Gisinger, 1988), but these animals are quite distant from the human lineage, and therefore do not represent likely examples of evolutionary continuity. Zuberbuhler (2002) reports that Diana monkeys (Cercopithecus diana) behave differently to the alarm call of another primate, the Campbell’s monkey (Cercopithecus campbelli), if the call is first preceded by another kind of distinctive ‘boom’ call. This would appear to be a very primitive type of syntactic rule: one type of call appears to modify the meaning of another call. Premack and Premack (1972) showed that the chimp Sarah could mark argument relationships with an arbitrary device (in this case, serial order). While it is true that not all human languages require the use of serial order for this purpose, Sarah demonstrated that chimps have the cognitive structures that underlie the concept of “argument relationship” and furthermore, can use an arbitrary device to distinguish it. To argue that this is not evidence for continuity on the basis that human grammatical structures use many other devices in addition to serial order, is to misunderstand how evolution works. Perhaps the best evidence of continuity comes again from Kanzi, who has demonstrated in a number of tests that he can respond appropriately to verbal commands, even ones that he had never been exposed to before (e.g., “Pour the lemonade in the Coke.”). He responded correctly on 74% of 416 sentences in which the person giving the commands was not visible to Kanzi, and the person with him either covered their eyes (for the first 100 blind trials) or wore headphones playing loud music (for the remaining 316 blind trials) to ensure that they would not inadvertently cue him (Savage-Rumbaugh et al., 1993). At a minimum, Kanzi must have at least an incipient understanding that the relationship between sequences of symbols itself conveys meaning. These abilities are quite limited with respect to humans, although the seeds of possibility are quite clearly apparent. Given the degree of continuity in various other aspects of language, and specifically how existing structures have been modified for use in
Conceptual Complexity and the Brain
language, the null hypothesis should be that syntax and grammar can be explained in this way as well. To what extent is natural language grammar and syntax fundamentally different from other types of cognition? What exactly does natural language grammar look like and how should it be properly characterized?
4.
Syntax and Grammar in Natural Languages
4.1 Characterizing Universal Grammar Cross-linguistic studies of grammar and syntax make it evident that a tremendous amount of variation exists across languages (Croft, 2003). Furthermore, grammatical structures are known to change over relatively short periods of time (e.g., Ogura, 1993; Traugott, 1972). In addition, there are differing views on how to characterize grammar in the first place. Some linguists reject the view that formal mathematical structures are the appropriate model to describe grammar and heavily emphasize the semantic basis of language (Lakoff, 1987; Langacker, 1987; O’Grady, 1987). Some models of language origins do not see the question of grammar as central at all (e.g, Urban, 2002). Formal linguistic models have in fact not even been able to characterize English — one of the most intensively studied languages — in a completely satisfactory manner (Croft, 1991; Jackendoff, 1994; Lieberman, 2002). Furthermore, because of the large degree of variation across languages in specific grammatical structures, descriptions of the underlying “Universal Grammar” (UG) common to all languages are limited to very general descriptions of the phenomena at issue. I have previously described in detail the ways in which published descriptions of the features of UG are fundamentally semantic in nature (Schoenemann, 1999). For example, Table 1 lists the putative features of UG derived from Pinker and Bloom (1990) and Bickerton (1990). Whenever a particular feature is accomplished differently in various languages, the phrases “mechanisms exist”, “constructions exist”, or “lexical
61
62
Language Acquisition, Change and Emergence
Table 1 Putative features of Universal Grammar, according to Pinker and Bloom (1990) and Bickerton (1990).
A)
Hierarchical structure.
B)
Grammatical rules are dependent on this hierarchical structure (‘structure dependency’).
C)
Lexical categories (“noun,” “verb,” “adjective,” etc.) can be identified because of rules regarding their arrangement.
D)
Individual lexical items are abstract general categories, which are combined in various ways to refer to specific events, things, states, locations, etc..
E)
Rules specify how phrases should be combined, allowing the hearer to decode the underlying relationships between phrases, and hence the underlying meaning intended by the speaker.
F)
Mechanisms exist with allow the hearer to distinguish among various possible argument relationships between the constituent phrases of a sentence.
G)
Mechanisms exist to indicate temporal information.
H)
Verbs take either one, two or three arguments.
I)
Mechanisms exist to convey relations such as truth value, modality and illocutionary force (Steele, et al. 1981).
J)
Mechanisms exist to indicate the relationships between propositions in cases in which one proposition is an argument of another.
K)
Constructions exist to refer to a specific entity simply by specifying its role within a proposition.
L)
Lexical items exist (e.g., anaphoric items such as pronouns) which allow one to repeat a reference to something without having to repeat the entire noun phrase.
M) Mechanisms exist which license the omission of repeated phrases N)
Mechanisms exist which allow the fixing of a tightly constrained co-occurrence pattern between an empty element and a sentence-peripheral quantifier.
items exist” are used to indicate this. It is obvious from this list that highly specific rules and constructions are missing. Instead, general characterizations regarding the types of information that grammars universally code are included, rather than the specific rules that are
Conceptual Complexity and the Brain
used to code them. This is because the specific rules themselves are not universal. Furthermore, the types of information that grammars universally code can be seen as a reflection of our underlying conceptualization of the world. This raises the question, discussed below, of whether grammar is simply an epiphenomenon of semantics A few examples will illustrate this point (for a more detailed discussion, see Schoenemann, 1999). All natural language grammars are hierarchically structured (feature A). In a sentence like: “The poem made weak men blush and strong women cry”, we understand that men are weak and blushing, women are strong and crying, and that a poem had this effect on both, and not some other combination of the actions, actors and things mentioned in the sentence. The hierarchical structure of the sentence allows us to unravel these relationships. However, this is clearly a reflection of our underlying conceptual structure. We understand the world in this way. We organize social institutions hierarchically (often without clearly recognizing this as a choice, or even consciously planning them to look like their current state). Conceptual understanding of hierarchical structure is also something that evolved long before humans, and is not a result of language itself (cf., Bickerton, 1990). Primate social relationships are hierarchically structured in various ways (Cheney and Seyfarth, 1990; de Waal, 1989), for example. Furthermore, any complex structure built up from simpler beginnings is likely to be organized hierarchically (Sampson, 1978; Sampson, 1979; Sampson, 1980; Simon, 1962; Wang, 1984), regardless of how we conceptualize the world. In addition, key differences between sentences (which serve to indicate alterations in meaning) respect this hierarchical structure (this is often referred to as ‘structure dependency’; feature B). For example, the difference between the question “Is the boy who is angry here?” and the related statement “The boy who is angry is here” involves the location of only one of the two verbs in the sentences (plus differences in pitch contours when spoken). The question asks about whether a specific boy (i.e., the one who is angry) is here or not, it does not ask whether the boy is angry. We
63
64
Language Acquisition, Change and Emergence
know this because the phrase “. . . the boy who is angry” is left unchanged between the two sentences. Thus, exactly where a change or difference occurs in the structure of a sentence indicates what the difference in intended meaning is: phrase structures represent units of conceptual understanding. In the example above, “. . . the boy who is angry . . .” is a complete conceptual unit. Non-structuredependent grammars would break up these conceptual units, thereby requiring additional complexities to unravel the intended meanings. It does not matter that there are a few cases in which the semantic meaning of a particular unit may not be clear (as with “there” in “Is there any chocolate left?”). What matters is that in the vast majority of cases the structures in question are recognizable conceptual units. These cases form the basis for structure dependency rules, which then get applied even in cases where the meanings of particular units are unclear (for more detailed discussion, see Schoenemann, 1999). Another feature of all grammars is that they distinguish nouns from verbs (feature C). This clearly reflects the fact that we conceptualize two facets of (our) reality: objects (or ‘things’, defined as loosely as one likes) and actions (acts, occurrences, or modes of being of these objects). Whether or not it is possible to neatly characterize every noun or verb in this way is not critical. It is clearly the core of the distinction — we do in fact conceptualize reality in this way — and we should expect language to reflect this. In all languages, some mechanism exists for coding argument structure (e.g., who did what to whom; feature F). The mechanism varies, however, such that in some languages each noun is modified to indicate whether it is the direct object, indirect object, and so forth (case markings, as in Latin), while in other languages word order plays a more central role (as in English). Thus, what is universal is not a specific set of rules, but the concept of argument structure itself. Similarly, all languages have some mechanisms for coding temporal information (feature G). In some languages this is accomplished via verb inflection (as in English), while in others the verb does not change and instead temporal information is indicated through the use of separate words (as in Chinese dialects). Again,
Conceptual Complexity and the Brain
what is universal is not the specific rule structure, but simply that the concept of temporal information is coded in some way. A perusal of the other features in Table 1 indicate that they also are not specific rules, but rather acknowledgements that particular kinds of conceptual information are coded by all natural language grammars. This suggests that the innate structures of language are actually semantic and conceptual, rather than grammatical and syntactic. In other words, it appears there has been a conflation of (1) the fact that grammatical rules exist (though they vary in their specifics), with (2) the fact that some key conceptual features are cross-cultural. Since the specific grammatical structures encoding various universal conceptual frameworks vary from language to language, we must assume these variants are cultural in origin.1 The cultural evolution of grammar will necessarily involve the creation of structures that reflect underlying conceptual structure, and thus it is not necessary to propose a separate set of innate grammar-specific modules to guide this process. The fundamentally conceptual nature of the description of UG does not in and of itself prove that innate grammar-specific modules do not exist, of course, but it does suggest a more parsimonious proposition: the elaboration of semantic/conceptual complexity during human evolution drove the cultural evolution of grammar.
4.2 Evidence against the innateness of grammar One of the key arguments for the specific innateness of grammar has
1
One way around this conclusion would be to argue that a number of alternative grammatical structures are programmed genetically, but the specific features a child will learn are set by exposure to one or another grammatical structure (often referred to as “parameter setting”). I have previously explained why this idea is evolutionarily incoherent (Schoenemann 1999). Essentially, it requires multiple adaptations to the same problem, akin to birds evolving the possibility of growing completely different kinds of wings depending on the environment they find themselves in during development.
65
66
Language Acquisition, Change and Emergence
been that a child could not possibly learn the correct set of rules for their language on the basis of positive examples only (e.g., Bowerman, 1988; Komarova et al., 2001). Children are not consistently corrected for speech errors (Brown and Hanlon, 1970) and rarely pay attention even when corrected (McNeill, 1966). Because there are an infinite number of possible grammars consistent with any finite set of example sentences, it is logically impossible for a child to determine the actual grammar without some form of constraints on learning (Gold, 1967; Nowak et al., 2001). However, there are several problems with this argument. First, it isn’t at all clear that even most children converge on the same grammar. Instead, given that adult grammaticality judgments vary so much (Ross, 1979), it seems they simply converge on a set of grammars that are “good enough” for communication. Second, positive evidence can actually be used as a weak form of negative evidence (i.e., “if this form is correct, then another is unlikely to be correct, barring future positive evidence to the contrary”). Chomsky (1981) has pointed out that if children notice that “…certain structures or rules fail to be exemplified in relatively simple expressions, where they would be expected to be found, then a (possibly marked) option is selected excluding them in the grammar, so that a kind of ‘negative evidence’ can be available even without corrections, adverse reactions, etc.” (p. 9). Regier (1996) showed that this can be implemented for learning word meanings as well. Third, there is nothing in Gold’s (1967) thesis that requires that constraints on learning must be specifically grammatical, or even specifically linguistic. All that is required is that there be constraints of some kind. Thus, the question has really always been “what are the nature of the constraints?” and not “are there any constraints on language learning at all?” Since, as discussed above, descriptions of the major features of UG appear to be essentially descriptions of key parts of our semantic/conceptual worlds, we must seriously consider the possibility that the constraints on UG are actually semantic/conceptual, rather than grammar-specific. In this respect, it is of interest to note that the developmental emergence of grammar
Conceptual Complexity and the Brain
in children is apparently highly correlated with vocabulary size (Bates and Goodman, 1997). If vocabulary size can be seen as a proxy for conceptual complexity, then the connection between grammar and vocabulary development is consistent with the model suggesting that conceptual complexity drives grammatical complexity.
4.3 Lack of syndromes solely affecting syntax or grammar If grammar and syntax require innate structures that evolved solely for this purpose, it should be possible to find examples of clinical syndromes that only affect them. There are, however, apparently no such cases. It is true that Broca’s aphasics can have difficulties with certain kinds of grammatical structures (e.g., passive sentences, although the exact pattern of grammatical deficit appears to vary significantly across subjects) (Caramazza et al., 2001). However, the key question in the present context is whether Broca’s aphasia is associated with any non-linguistic deficits. This has, understandably, not been the focus of research attention in Broca’s aphasics. However, there is some research suggesting that Broca’s aphasics have difficulty with non-linguistic sequential learning (Christiansen et al., 2002). If confirmed, this would suggest that Broca’s aphasia is not specifically linguistic, though it obviously does affect language processing in important ways. An evolutionary perspective would predict that the neural substrate underlying Broca’s aphasia (as with any area relevant to language processing) would have been derived from pre-existing circuits that happened to process information in ways easy to modify for use in language. Our expectation should be that all ‘language’ areas will have important non-linguistic functions. Work on William’s syndrome (sometimes offered as evidence of the innateness of grammar, e.g., Pinker, 1994) has not produced a consistent picture of intact linguistic abilities combined with non-linguistic deficits. These individuals are highly retarded (IQ of
67
68
Language Acquisition, Change and Emergence
~50), but are remarkably verbal. However, Bates (1992) notes that they have other spared abilities as well (e.g., face recognition). Furthermore, they show impairments of lexico-semantics, morphological feature analysis, and at least some syntactic abilities (Karmiloff-Smith et al., 1998).
4.4 Language genes If grammar is only a reflection of conceptual structure, then genetic influences on grammar will either affect conceptualization more generally, or will affect mechanisms that are not specific solely to language. What little is concretely known at present regarding the genetics of language clearly supports this model, rather than the idea of language-specific genes. Recently a great deal of effort has been focused on a gene known as FOXP2 (the so-called “language gene”) (Enard et al., 2002). Individuals with a rare variant of this gene show various language deficits, including problems of grammar comprehension and production. Crucially, however, the pattern of deficits does not indicate specificity only to grammar: subjects with this variant show severe articulation difficulties, for example (Alcock et al., 2000; Watkins et al., 2002). Even more problematic, however is that the grammatical problems they exhibit are not even features of Universal Grammar. For example, they have difficulty with tense markers, and verb inflection generally, yet many languages completely lack verb inflection (e.g., all dialects of Chinese, Wang, 1991a). This indicates that FOXP2, while it clearly affects language, is not evidence for an innately-specified language-specific grammar module. Instead, it fits perfectly with a model in which different languages utilize different pre-existing cognitive components, many of which may well be genetically specified to varying degrees, but none of which evolved solely for language.
Conceptual Complexity and the Brain
5.
Grammar and Syntax as Emergent Characteristics of Semantic/Conceptual Complexity
5.1 Sociality and the human condition To place language, and grammar in particular, into proper context, it is critical that we recognize the fundamentally social nature of the human condition. There are many species which do not live in, or depend on, groups of conspecifics (e.g., bears, tigers). Many other species are found in groups, but are not particularly interactively social (e.g., schools of fish). They group together primarily to decrease their individual likelihood of being preyed upon. However, some species are much more obviously interactively social. Primates (with some exceptions) fall into this category, but humans are arguably the most interactively social of all. It is considered such a fundamental part of human nature that we actually identify, label and try to treat medically people who have problems with social interactions (e.g., autistics), and consider solitary confinement in prison to be one of the worst possible punishments. Much of what we do has been learned specifically through some form of imitation of others. Language is used in large measure, of course, to facilitate social interaction. It is also clear that large parts of language are also learned, rather than innately given. This is obviously true for the specific sound sequences that make up our lexicon, but is also true for all the grammatical peculiarities that are not part of UG.
5.2 Emergence of grammar and syntax If the universal features of grammar are really just reflections of our internal conceptualization of the world, while the specific rules and structures used by a given language are highly variable and not genetically coded themselves (instead borrowing and possibly elaborating on pre-existing cognitive abilities), and if at the same time the human condition is one of intense interactive sociality in
69
70
Language Acquisition, Change and Emergence
which learning through some form of imitation is ubiquitous, then an alternative model of the evolution of grammar and syntax may be considered. The specific rules and structures in a given language represent simply conventionalizations that allow languages to accurately communicate human semantics. Grammar and syntax would in this case be seen as behavioral adaptations that take advantage of different possible (i.e., pre-existing) cognitive abilities to accomplish the task of representing higher-order conceptual complexity. To take a concrete example, consider the use of serial order in some languages to mark argument relationships (Schoenemann and Wang, 1996). While the neural circuits involved in this process have not been identified in any detail, it is known that the prefrontal cortex is involved in marking temporal information generally. Subjects with prefrontal damage cannot plan and execute a complex set of motor movements, program a set of activities in correct temporal order, or remember the order of experiences (Fuster, 1985; Milner et al., 1985; Milner et al., 1991; Squire, 1987; Struss and Benson, 1986). This is true of non-human animals as well. A dissociation between item memory and order memory (with memory of sequential order localized in the frontal lobes) has been demonstrated in monkeys (Petrides, 1991; Squire, 1987) and even rats (Kesner and Holbrook, 1987; Kesner, 1990). The fact that the prefrontal cortex appears to be specifically involved in memory for serial order in species as far removed from humans as rats suggests that this specialization is very old (primate-rodent common ancestry dates to ~65 MYA, Sarich, 1985). Thus, in the earliest pre-linguistic hominids we can be confident that there were circuits already adapted specifically to processing serial information. An evolutionary perspective suggests that these circuits were capitalized upon by increasingly complicated language. The default hypothesis must be that these were adapted to use for language, not that wholly new circuits were created solely for language. Another profitable way to conceptualize this process is to think of language as a form of symbiont: language itself can be thought of as adapting to the human mind (Christiansen, 1994; Deacon, 1997).
Conceptual Complexity and the Brain
By definition, natural languages that are too difficult for humans to learn would never have come into existence. Since this must have been true of every single hominid population extending back to the very origins of language, we must assume that language has molded itself to the hominid mind as much as the hominid mind has molded itself to allow increasingly sophisticated language. Over time, there would have been an increase in the complexity of the kinds of things the earliest linguistically-inclined hominids would have wanted to try to communicate. This, in turn, would have led to an increase in the kinds of grammatical forms that became conventionalized (Savage-Rumbaugh and Rumbaugh, 1993). Note that even those who believe that UG is innate still must believe that some (not necessarily conscious) cultural elaboration occurred. The question is really over the extent to which this happened, not whether it happened at all.
5.3 Evolution of semantic/conceptual complexity If grammar is driven by semantic/conceptual complexity, what kind of evidence is there that semantic/conceptual complexity increased during human evolution? Ape language-learning studies suggest that vocabulary sizes of perhaps ~400 words are possible given a human-like developmental environment (Gardner and Gardner, 1989; Miles, 1990; Premack and Premack, 1972; Savage-Rumbaugh et al., 1993). This is at least two orders of magnitude smaller than that reported for the typical high-school senior (Miller and Gildea, 1991). It is not clear, however, how much of this difference is attributable to an underlying difference in semantic/conceptual complexity, rather than some other explanation (such as some inherent difficulty in connecting concepts with arbitrary signs). However, there is another ape/human difference which strongly points toward a difference in conceptual complexity: overall brain size. The human brain is 3–4 times the size of that found in apes, even though apes are similar in body size (gorillas even being quite a bit larger, Deacon, 1992; Falk, 1987; Holloway, 1995; Jerison, 1973). Jerison (1985) has argued that brain size is an index of the
71
72
Language Acquisition, Change and Emergence
degree of sophistication of an animal’s internal representation of the world: “Grades of encephalization presumably correspond to grades of complexity of information processing. These, in turn, correspond in some way to the complexity of the reality created by the brain, which may be another way to describe intelligence.” (p. 30). There are several reasons to suspect that this is in fact correct, and that the human difference indicates a major increase in conceptual complexity. First, concepts are instantiated in the brain as webs or networks of activation between primary sensory, secondary and association areas. This is directly evident from functional imaging studies of various kinds (see Pulvermuller, 2001), as well as behavioral work on correlations between word meanings (McRae et al., 1997). Functional imaging studies of people trying to imagine objects have shown that essentially the same patterns of activity are evident when people are imagining an image as when they are actually viewing it (Damasio et al., 1993; Kosslyn et al., 1993). These studies also demonstrate that information flows bi-directionally: activation of primary sensory areas is dependent not just on inputs from external censors, but can occur purely as a result of inputs internally from other areas of the brain. It is also clear that most of our subjectively experienced concepts are actually complex combinations of sensory information processed in various ways by the different cortical centers. This is clear for concepts like “ball,” or “fear,” which elicit an array of sensory information (e.g., “ball” has visual and sensorimotor components; “fear” has various visual, auditory, sensorimotor, and limbic components). But it is also true of apparently relatively simple concepts. It turns out that our experience of taste is actually the result of the interaction of olfactory (smell) and gustatory (taste) inputs. This can be easily demonstrated by eating a banana while alternately holding one’s nose closed and opening it. The banana “flavor” disappears when the nose is blocked, because it is actually largely olfactory. The “McGurk” effect (McGurk and MacDonald, 1976), in which the auditory perception of a phoneme can be altered if it is paired with a mismatched visual input, similarly indicates that
Conceptual Complexity and the Brain
conceptual awareness is the result of complex interactions between different inputs. This means there must be networks connecting differing regions as well as areas that mediate the integration of this information. To what extent is brain size relevant to conceptual complexity? It is certainly reasonable to suppose that larger brained species have more complicated networks of interconnection, thereby leading to greater potential conceptual complexity (Lieberman, 2002). However, strong support for this idea comes from a consideration of brain structure/function relationships across species. It is well known that behaviorally specialized animals have correlated increases in areas of the brain known to mediate those behaviors (e.g., Krubitzer, 1995). For example, over half of the cortex of the echolocating ghost bat (Macroderma gigas) is devoted to processing auditory information, and approximately two-thirds of the cortex of the playtypus (Ornithorhynchus anatinus) is devoted to processing electrosensory and mechanosensory information from its highly specialized bill (Krubitzer, 1995). The blind mole rat (Spalax ehrenbergi) spends essentially its entire life underground, placing somatosensory information at a premium, and making visual information useless. This species devotes a much larger portion of its cortex to somatosensory processing than rodents generally, while at the same time completely lacking a visual cortex (Mann et al., 1997). Racoons (Procyon lotor) display highly developed manual dexterity, and their somatosensory cortex is concomitantly relatively large for carnivores — to the extent that individual digits are represented on distinct cortical gyri (Krubitzer, 1995). Furthermore, it has even been shown that selective breeding for more whiskers in mice leads to an increase in the cortical representation of the somatosensory area corresponding to whiskers (Van der Loos et al., 1986). In fact, detailed mapping shows that each additional whisker is assigned its own additional cortical field (Van der Loos et al., 1986). These studies show that increased behavioral complexity is associated with increased neural resources. While we do not yet have detailed studies of the actual network complexity (i.e., circuit
73
74
Language Acquisition, Change and Emergence
diagrams) in these species, it is a reasonable assumption that the increase in neural resources is an index of an increase in fundamental network complexity. In addition to the general association between brain region size and behavioral specialization, it appears that larger brained animals have greater degrees of cortical specialization. There appear to be basic structural constraints that influence changes in neural connectivity in important ways, which in turn are likely to have fundamental effects on conceptual complexity. Larger brains have more neurons (Haug, 1987), but in order for these neurons to remain equally well connected with each other (in the sense of a signal having the same average number of synapses to traverse between any two neurons), the number of connections (axons) must increase much faster than the number of neurons (Ringo, 1991). Comparative data from Hofman (1985) on volumes of white and grey matter (which is composed primarily of neuron cell bodies and associated glial support cells) shows that white matter in fact increases faster than grey matter with increasing brain size. However, the increase is apparently not fast enough to maintain equal degrees of connectivity between neurons (Ringo, 1991). This means that as brain size increases, there is a concomitant increase in the separation between existing areas. There are a number of comparative studies that highlight this process. One particularly interesting example involves the separation of the motor and somatosensory areas of the cortex. In humans, as for primates generally, these two areas are separate. However, in opossum the areas appear to be completely overlapping. In rats, only the forelimbs have separate motor and somatosensory cortical representations; the hindlimbs match the opossum pattern (Ebbesson, 1984). Having separate motor and somatosensory areas presumably allows for a more complicated response to a given sensory input. Motor processing may proceed with greater independence of sensory input in primate brains compared to opossum brains. Another example of this pattern occurs in the connections between visual and motor cortex, in which smaller-brained animals
Conceptual Complexity and the Brain
such as rats and mice have direct projections from their primary visual cortex to their primary motor cortex, whereas the much larger-brained anthropoid primates lack such direct projections (Northcutt and Kaas, 1995). This allows for a greater sophistication of processing of visual information before a motor action is taken. Detailed comparative studies of cortical areas confirm that larger brained species have a greater number of distinct cortical areas than smaller brained species (Northcutt and Kaas, 1995). For example, rodents have only 5–8 visual areas (Northcutt and Kaas, 1995), whereas primates appear to have perhaps 20–30 (Felleman and Van Essen, 1991; Northcutt and Kaas, 1995). Thus, empirical evidence shows that increasing brain size leads to increasing numbers of cortical areas, and increasing independence of these cortical areas. Cortical specialization is critical to conceptual complexity because it increases the potential ability to differentiate complex sensory information into diverse constituent parts. It is easy to see how these parts would help to magnify subtle differences between different streams of sensory input, thereby clarifying what would otherwise be inchoate. Given the large, obvious evolutionary costs of increasing neural tissue (Hofman, 1983; Smith, 1990), and the consequent impossibility of explaining brain size increases short of some sort of adaptive benefit, the best explanation would appear to be increasing conceptual complexity. At a more general level of behavior and anatomy, it is known that brain size is associated with differences in social complexity (Dunbar, 1992; Dunbar, 1995; Sawaguchi, 1988; Sawaguchi, 1990; Sawaguchi and Kudo, 1990), degree of innovation, social learning, tool use (Reader and Laland, 2002), and even rates of apparent deceptive behavior (scaled by the number of studies done on different species, Byrne, 1993). All of this suggests that increasing brain size is an index of increasing cognitive (and therefore conceptual) complexity.
5.4 Evolutionary simulations Since evolutionary dynamics generally occur over a long time scale,
75
76
Language Acquisition, Change and Emergence
and have to be inferred from limited data, computer simulations of various kinds are useful tools for testing particular evolutionary models. The acquisition and emergence of vocabulary in populations of interacting agents has been studied in this way, leading to insights into possible evolutionary dynamics (Ke et al., 2002). A number of recent simulation studies have shown that the cultural emergence of grammar can be computationally modeled using agent-based simulations (e.g., Batali, 1998; Brighton and Kirby, 2001; Kirby, 2000). These models start with agents that do not have UG hardwired into them per se, but instead have simply general learning and pattern-matching algorithms. They are also required to communicate with one another, though they do not necessarily need to agree about which sets of symbols should be used for which concepts. While they clearly only model parts of the complete story, they nevertheless indicate that intuitions about what sorts of innate structures are necessary for UG-like features to emerge are, at the very least, suspect. One feature that has not been extensively studied in these agent-based studies is the relationship between brain size (general cognitive capacity) and conceptual complexity. To what extent is it possible to show that larger brain size has any relevance to increasing conceptual complexity, particularly in the context of a model of communication among agents? To address this question, Craig Martell (formerly of the Computer Science Department at the University of Pennsylvania, now at RAND) and I modified a simulation involving populations of interacting artificial neural nets. This particular simulation had been introduced originally by Batali (1998), and had been rewritten in the LISP programming language by Goroll (1999). The basic structure of the simulation is as follows. 30 virtual agents are constructed, each one composed of a single recurrent neural net (Elman, 1990). These nets each have 4 input nodes, 30 hidden nodes, and 10 output nodes. The recurrent feature of the nets means that the hidden nodes are ‘copied back’ each cycle and are used as additional inputs (such that there are actually 30 recurrent + 4 new inputs each cycle). Each agent ‘knows’ the same 100 meanings: all the possible combinations of 10 referents and 10
Conceptual Complexity and the Brain
predicates (e.g., ‘you angry’, ‘you excited’, ‘me angry’, etc.). ‘Words’ are produced by choosing a target meaning at random and finding the sequence of letters (introduced one at a time to the new input nodes) such that the output meaning vector of the ‘speaking’ agent most closely matches the target meaning. This input sequence is then given to a ‘hearer’ along with the original intended meaning.2 In each round, one agent is taught (i.e., its net is adjusted via standard backpropagation algorithms) to better understand 10 other agents’ words for each of their meanings (for details see Goroll, 1999). In Batali’s original simulation the population converged on a common code after about 15,000 rounds. Furthermore, the common code included strong suggestions of compositionality. For our simulations, two types of modifications where made. First, simulations were run using agents with either 100 meanings (as in the original simulation) or 200 meanings. Second, the sets of agents were given either 15, 30 or 60 hidden units, simulating simple changes in ‘brain’ size. The question we where interested in addressing was the extent to which the size of the artificial neural net affected the ability of the population to converge on a common code. Figure 2 shows the change for populations of agents with 100 meanings in average ‘speaker correctness’, which is the degree to which a speaker’s output (meaning) nodes match the intended (randomly chosen) meaning after the best input sequence is determined. The simulations using populations of agents with 15, 30, and 60 hidden nodes are plotted together. Figure 3 shows the same information, but this time with agents who have 200 possible meanings. It is clear from both figures that populations of nets with larger numbers of hidden units evolve more complicated languages faster and with less error than populations with smaller numbers of hidden units. This is, to some extent, not surprising. It is generally
2
This obviously involves something of a cheat, in that perfect knowledge is assumed between speaker and hearer. However, there must be some way in which intended meanings are indicated, e.g., from parent to child, among humans.
77
78
Language Acquisition, Change and Emergence
Figure 2 Rates of convergence to a common code for 100 meanings.
Change in ‘speaker correctness’ for three populations of agents, each knowing 100 possible meanings. Light grey: agents with 15 nodes (‘neurons’) in the hidden layer; dark grey: agents with 30 nodes; black: agents with 60 nodes. Figure 3 Rates of convergence to a common code for 200 meanings.
Change in ‘speaker correctness’ for three populations of agents, this time with each agent knowing 200 possible meanings.
Conceptual Complexity and the Brain
known that nets with larger hidden layers can learn more complicated types of associations that nets with smaller hidden layers. Nevertheless, this simulation is useful in showing that the principle applies equally to populations of interacting agents. To the extent that artificial neural nets model important aspects of real neural nets appropriately, this simulation also demonstrates at a basic level the idea that larger brains would be a likely concomitant of increasing semantic complexity. Future ideas to be pursued include adding a more sophisticated and realistic learning process (in which ‘hearers’ don’t have perfect knowledge of the intended meanings), increasing meanings in a more interesting and realistic way (e.g., in which different dimensions of meaning are added, rather than simply more of the same basic type), analyzing the effects of net size on compositionality, and devising a better learning process that doesn’t bias sequences toward the smallest possible length (which tends to work against full compositionality, particularly if the nets are large enough to simply memorize large numbers of individual sequences).
5.5 Implications for the evolution of language Whatever else increasing brain size led to in hominid evolution, it is difficult to escape the conclusion that conceptual complexity increased substantially during this time. Given the fundamentally socially interactive nature of humans, as well as the general association between degree of sociality and brain size, it is likewise difficult to believe that this increase in conceptual complexity could be unrelated to the evolution of language. The idea that the evolution of brain size and language are somehow related has a long history (e.g., Dunbar, 1996; Nadeau, 1991; Wang, 1991b; Washburn, 1960). Darwin himself (1882) argued for “. . . the relation between the continued use of language and the development of the brain . . .” (p. 87). This suggests that brain size itself may be an index of language evolution. If so, it would suggest that language has origins that are
79
80
Language Acquisition, Change and Emergence
substantially older than the appearance of anatomically modern Homo sapiens, which date to only ~100,000 years ago (Tattersall, 1998). Figure 4 shows the change in hominid cranial capacity (a good proxy for brain size) over the last four millions years for the majority of the fossil specimens that have been measured. It is evident that the major shift in brain size toward the modern condition began sometime between two and three million years ago. The earliest evidence for stone modification (presumably for tool use) is also found in this time range. Figure 4 Cranial capacity in fossil hominids over time.
Species names are in italics; other labels refer to individual specimens. Extant chimpanzees (Pan troglodytes) and humans (Homo sapiens) are included for comparison. Horizontal bars reflect uncertainties in dating. Vertical bars indicate ranges in cranial capacity for given taxonomic groupings (data points without vertical bars indicate single specimens). Data compiled by Falk (1987), with minor changes (Schoenemann, 1997)
There are a number of proponents of a relatively late date for the origin for language (at least for completely modern, fully
Conceptual Complexity and the Brain
syntactic language, e.g., Bickerton, 1995; Klein and Edgar, 2002; Tattersall, 1998). Their arguments rely heavily on a postulated tight connection between apparently sharp increases in the incidence of art and other forms of material culture in the archaeological record that are evident starting at around ~35,000 years ago. The implication that language must be reflected in material culture is, however, inherently problematic (Dibble, 1987). There is a tremendous range of variation in the complexity of material culture left behind by different modern human groups, yet all have fully modern language. Material culture obviously can change dramatically without requiring fundamental changes in language. It thus remains purely speculative to suggest that fully modern syntactic language explains the Middle to Upper Paleolithic transition. These models also, it is important to note, generally argue that some form of communicative behavior was likely evident prior to this point (Bickerton, 1995; Klein and Edgar, 2002). There are, in fact, suggestions that hint toward a much older date of origin. One line of argument derives from studies of the basicranium of various fossil hominid specimens (Laitman, 1983; Laitman, 1985; Lieberman, 1984), which, it is argued, allow estimates of the extent to which the larynx had lowered from the ape condition, thereby allowing for a significantly greater range of vowel sounds. Since it would presumably also have increased the likelihood of choking on food, it would not have happened without some adaptive benefit which is assumed to be language (but see Fitch and Reby, 2001). While it is true that this work suggests that Neanderthals (known from ~120,000 to ~35,000 years ago) did not have a fully lowered larynx, fossils even older than this appear to have a lower larynx than Neanderthal specimens. To the extent that this work can accurately estimate laryngeal position, it actually suggests that the lowering of the larynx had begun at least as far back as Homo erectus, about ~1.5 million years ago (Laitman, 1985). Other suggestive evidence for an early origin comes from studies of the endocasts of early hominids. At least one specimen, KNM-ER 1470 (Homo habilis), dating to ~1.8 million years ago, has been
81
82
Language Acquisition, Change and Emergence
claimed to have a sulcal pattern in the inferior frontal region (which includes the area where Broca’s area is located) that most closely matches modern humans rather than modern apes (Falk, 1983). This does not, of course, prove that Homo habilis had a Broca’s area, nor that it had language, however. Two other suggestions have been made in recent years regarding date of the origin of language. One involves the possibility that narrow vertebral canals in one specimen of Homo erectus (KNM-WT 15000) indicate a lack of sophisticated control over the muscles involved in controlling sub-laryngeal air pressures, and hence a lack of language in this species (Walker and Shipman, 1996). However, this has recently been questioned on the grounds that the specimen appears pathological (Latimer and Ohman, 2001; Meyer, 2003) and that intercostal muscles do not play a significant role in language production (Meyer, 2003). Another suggestion has been that the relatively large size of the hypoglossal canal (which carries the nerves to most of the muscles that control the tongue) in some later Homo erectus specimens may indicate the presence of language in these specimens (Kay et al., 1998). However, this has also been questioned on the grounds that there is no clear evidence that the size of the canal is an index of the degree of motor control of the tongue, as well as that the range of hypoglossal canal size in modern humans overlaps substantially with that of other non-human primates (DeGusta et al., 1999). Thus, while the evidence is clearly equivocal, there would appear to be reasonable evidence of a long history of language in hominid evolution. The evidence of the relationship between brain size and conceptual complexity suggests, at a minimum, that fundamental changes in human cognition critical to language evolution had begun prior to ~2 million years ago.
6.
Conclusions
Evolutionary change is biased towards modification of pre-existing mechanisms, and away from the construction of wholly new devices.
Conceptual Complexity and the Brain
This means that we should look for evidence of continuity, regardless of how unique a particular adaptation might appear. A review of the evidence available to date suggests there are in fact many areas of continuity. The aspect of language in which continuity is least evident is in syntax and grammar. However, since the universal features of grammar are essentially descriptions of aspects of our basic conceptual universe, while the specific rules vary from language to language, the most parsimonious model is that increasing conceptual/semantic complexity drove the acquisition of syntax and grammar (which would then be cultural manifestations of this pre-linguistic internal conceptual universe). An understanding of the way in which concepts are instantiated in the brain, combined with a comparative perspective on brain structure/function relationships, indicates that increasing brain size during hominid evolution reflects an increase in conceptual complexity. It is possible to simulate this process at a simple level with populations of interacting artificial neural-net agents. All of this suggests that the origins of language have a deep ancestry.
Acknowledgments I am indebted to Professor William S.-Y. Wang not only for making it possible for me to participate in the ACE workshops, but for profoundly shaping my views on language evolution through his many contributions to this area of inquiry. This paper has also benefited from discussions with the many participants of the workshops, particularly Professor Wang, Professor Thomas Lee, Professor Morten Christiansen, Craig Martell, Jinyun Ke, James Minett, Ching Pong Au, and Feng Wang. In addition, the ideas in this paper have been shaped over the years by many discussions with Professor Vincent Sarich, Dr. John Allen, Dr. Karen Schmidt, and Reina Wong. I am particularly indebted to Craig Martell, now at RAND, for his intellectual contribution and programming expertise on the agent-based simulations described in this paper. Lastly, I would like to thank the City University of Hong Kong for graciously hosting the ACE workshops and helping make the events so productive, and to the University of Pennsylvania for their sponsorship.
83
84
Language Acquisition, Change and Emergence
References Agha, Asif. 1997. ‘Concept’ and ‘communication’ in evolutionary terms. Semiotica, 116.189–215. Alcock, K. J., Passingham, R. E., Watkins, K. E. and Vargha-Khadem, F. 2000. Oral dyspraxia in inherited speech and language impairment and acquired dysphasia [Oct 15]. Brain and Language, 75.17–33. Batali, J. 1998. Computational Simulations of the Emergence of Grammar. Approaches to the Evolution of Language: Social and Cognitive Bases, ed. by J. R. Hurford, M. Studdert-Kennedy and C. Knight, 405–26. Cambridge: Cambridge University Press. Bates,
Elizabeth. 1992. Language Neurobiology, 2.180–85.
development.
Current
Opinion
in
Bates, Elizabeth and Goodman, Judith C. 1997. On the inseparability of grammar and the lexicon: Evidence from acquisition, aphasia, and real-time processing. Language & Cognitive Processes, 12.507–84. Bickerton, Derek. 1990. Language & Species. Chicago: University of Chicago Press. —. 1995. Language and Human Behavior. Seattle: University of Washington Press. Bowerman, Melissa. 1988. The ‘no negative evidence’ problem: How do children avoid constructing an overly general grammar? Explaining Language Universals, ed. by John A. Hawkins, 73–101. New York: Basil Blackwell Inc. Brighton, Henry and Kirby, Simon. 2001. The survival of the smallest: stability conditions for the cultural evolution of compositional language. Advances in artificial life: Proceedings of the 6th European Conference, ECAL 2001, 592–601. New York: Springer. Brown, R. and Hanlon, C. 1970. Derivational complexity and order of acquisition of syntax. Cognition and the Development of Language, ed. by J. R. Hayes, 11–53. New York: Wiley. Byrne, Richard. 1993. Do larger brains mean greater intelligence? Behavioral and Brain Sciences, 16.696–7. Capranica, R. R. 1965. The Evoked Vocal Response of the Bullfrog. Cambridge, Mass.: MIT Press. Caramazza, A., Capitani, E., Rey, A. and Berndt, R. S. 2001. Agrammatic Broca’s aphasia is not associated with a single pattern of comprehension performance [Feb]. Brain and Language, 76.158–84. Carpenter, Malcom B. and Sutin, Jerome. 1983. Human Neuroanatomy. Baltimore, Maryland: Williams & Wilkins.
Conceptual Complexity and the Brain Cheney, Dorothy L. and Seyfarth, Robert M. 1988. Assessment of meaning and the detection of unreliable signals by vervet monkeys. Animal Behavior, 36.477–86. Cheney, Dorothy L. and Seyfarth, Robert M. 1990. How Monkeys See the World. Chicago: University of Chicago Press. Chomsky, Noam. 1972. Language and Mind. New York: Harcourt Brace Jovanovich, Inc. —. 1981. Lectures on Government and Binding. Dordrecht: Foris Publications. Christiansen, Morten H. 1994. Infinite Languages, Finite Minds: Connectionism, Learning and Linguistic Structure, University of Edinburgh: Unpublished PhD dissertation. Christiansen, Morten H., Dale, Rick A., Ellefson, Michelle R. and Conway, Christopher M. 2002. The role of sequential learning in language evolution: Computational and experimental studies. Simulating the evolution of language, ed. by Angelo Cangelosi and Domenico Parisi, 165–87. New York: Springer-Verlag Publishing. Croft, William 1991. Syntactic categories and grammatical relations. Chicago: University of Chicago Press. Croft, William. 2003. Typology and universals. Cambridge: Cambridge University Press. Damasio, H., Grabowski, T. J., Damasio, A., Tranel, D., Boles-Ponto, L., Watkins, G. L. and Hichwa, R. D. 1993. Visual recall with eyes closed and covered activates early visual cortices. Society for Neuroscience Abstracts, 19.1603. Darwin, Charles. 1882. The Descent of Man and Selection in Relation to Sex, 2nd Edition. London: John Murray. de Waal, Frans. 1989. Peacemaking Among Primates. Cambridge: Harvard University Press. Deacon, Terrence W. 1984. Connections of the Inferior Periarcuate Area in the Brain of Macaca fascicularis: An Experimental and Comparative Investigation of Language Circuitry and its Evolution: Harvard University: Unpublished PhD dissertation. —. 1988. Human brain evolution: I. Evolution of language circuits. Intelligence and Evolutionary Biology, ed. by H. J. Jerison and I. Jerison, 363–82. Berlin: Springer-Verlag. —. 1989. The neural circuitry underlying primate calls and human language. Human Evolution, 4.367–401. —. 1992. Brain-language Coevolution. The Evolution of Human Languages, ed. by J. A. Hawkins and M. Gell-Mann, 49–83. Redwood City, CA: Addison-Wesley.
85
86
Language Acquisition, Change and Emergence Deacon, Terrence W. 1997. The symbolic species: the co-evolution of language and the brain. New York: W.W. Norton. DeGusta, D., Gilbert, W. H. and Turner, S. P. 1999. Hypoglossal canal size and hominid speech. Proceedings of the National Academy of Sciences USA., 96.1800–4. Denes, Peter B. and Pinson, Elliot N. 1963. The Speech Chain. Garden City, New York: Anchor Press/Doubleday. Dibble, Harold L. 1987. Middle Paleolithic symbolism: a review of current evidence and interpretations. Journal of Anthropological Archaeology, 263–96. New York. Dittus, W. P. J. 1984. Toque macaque food calls: semantic communication concerning food distribution in the environment. Animal Behavior, 32.470–77. Divenyi, P. D. and Sachs, R. M. 1978. Discrimination of time intervals bounded by tone bursts. Perceptual Psychophysics, 24.429–36. Dunbar, Robin I. M. 1992. Neocortex size as a constraint on group size in primates. Journal of Human Evolution, 20.469–93. —. 1995. Neocortex size and group size in primates: A test of the hypothesis. Journal of Human Evolution, 28.287–96. —. 1996. Grooming, Gossip and the Evolution of Language. London: Faber and Faber. Ebbesson, Sven O. E. 1984. Evolution and ontogeny of neural circuits. Behavioral and Brain Sciences, 7.321–66. Elman, Jeffrey L. 1990. Finding structure in time. Cognitive Science, 14.179–211. Enard, W., Przeworski, M., Fisher, S. E., Lai, C. S., Wiebe, V., Kitano, T., Monaco, A. P. and Paabo, S. 2002. Molecular evolution of FOXP2, a gene involved in speech and language [Aug 22]. Nature, 418.869–72. Falk, Dean. 1983. Cerebral cortices of East African early hominids. Science, 221.1072–74. —. 1987. Hominid Paleoneurology. Annual Review of Anthropology, 16.13–30. Felleman, D. J. and Van Essen, D. C. 1991. Distributed hierarchical processing in the primate cerebral cortex [Jan-Feb]. Cereb Cortex, 1.1–47. Fitch, W. T. and Reby, D. 2001. The descended larynx is not uniquely human [Aug 22]. Proceedings of the Royal Society of London. B, Biological Sciences, 268.1669–75. Fouts, R. S., Chown, B. and Goodin, L. 1976. Transfer of signed responses in American Sign Language from vocal English stimuli to physical object stimuli by a chimpanzee (Pan). Learning and Motivation, 7.458–75.
Conceptual Complexity and the Brain Fuster, J. M. 1985. The prefrontal cortex, mediator of cross-temporal contingencies. Human Neurobiology, 4.169–79. Galaburda, A. M. and Pandya, D. N. 1982. Role of architectonics and connections in the study of primate brain evolution. Primate Brain Evolution, ed. by E. Armstrong and D. Falk, 203–17: Plenum Press. Gardner, R. A. and Gardner, B. T. 1984. A vocabulary test for chimpanzee (Pan troglodytes). Journal of Comparative Psychology, 98.381–404. —. 1989. Early signs of language in cross-fostered chimpanzees. Human Evolution, 4.337–65. Gold, E. M. 1967. Language identification in the limit. Information and Control, 10.447–74. Goroll, Nils. 1999. (The Deep Blue) Nile: Neuronal Influences on Language Evolution, University of Edinburgh: Master’s thesis. Haug, H. 1987. Brain sizes, surfaces, and neuronal sizes of the cortex cerebri: A stereological investigation of man and his variability and a comparison with some mammals (primates, whales, marsupials, insectivores, and one elephant). American Journal of Anatomy, 180.126–42. Hauser, Marc D. and Wrangham, Richard W. 1987. Manipulation of food calls in captive chimpanzees. Folia Primatologia, 48.207–10. Hauser, Marc D., Chomsky, Noam and Fitch, W. Tecumseh. 2002. The faculty of language: what is it, who has it, and how did It evolve? Science, 298.1569–79. Herrnstein, Richard J. 1979. Acquisition, generalization, and discrimination reversal of a natural concept. Journal of Experimental Psychology, 5.116–29. Hofman, Michel A. 1983. Energy metabolism, brain size, and longevity in mammals. Quarterly Review of Biology, 58.495–512. —. 1985. Size and shape of the cerebral cortex in mammals: I. The cortical surface. Brain, Behavior and Evolution, 27.28–40. Holloway, Ralph L. 1995. Toward a synthetic theory of human brain evolution. Origins of the Human Brain, ed. by Jean-Pierre Changeux and Jean Chavaillon, 42–54. Oxford: Clarendon Press. Jackendoff, Ray. 1994. Patterns in the mind: language and human nature. New York: Basic Books. Jacob, François. 1977. Evolution and tinkering. Science, 196.1161–66. Jerison, H. J. 1973. Evolution of the Brain and Intelligence. New York: Academic Press. —. 1985. Animal intelligence as encephalization. Philosophical Transactions of the Royal Society of London, Series B, 308.21–35.
87
88
Language Acquisition, Change and Emergence Jürgens, Uwe and Zwirner, Petra. 2000. Individual hemispheric asymmetry in vocal fold control of the squirrel monkey. Behavioural Brain Research, 109.213–17. Karmiloff-Smith, A., Tyler, L. K., Voice, K., Sims, K., Udwin, O., Howlin, P. and Davies, M. 1998. Linguistic dissociations in Williams syndrome: evaluating receptive syntax in on-line and off-line tasks. Neuropsychologia, 36.343–51. Kay, Richard F., Cartmill, Matt and Balow, Michelle. 1998. The hypoglossal canal and the origin of human vocal behavior. Proceedings of the National Academy of Sciences USA, 95.5417–19. Ke, Jinyun, Minett, James, Au, Ching-Pong and Wang, William S.-Y. 2002. Self-organization and selection in the emergence of vocabulary. Complexity, 7.41–54. Kesner, R. P. and Holbrook, T. 1987. Dissociation of item and order spatial memory in rats following medial prefrontal cortex lesions. Neuropsychologia, 25.653–64. Kesner, Raymond P. 1990. Memory for frequency in rats: Role of the hippocampus and medial prefrontal cortex. Behavioral and Neural Biology, 53.402–10. Kingston, John. 1991. Five exaptations in speech: Reducing the arbitrariness of the constraints on language. Behavioral and Brain Sciences, 13.738–39. Kirby, Simon. 2000. Syntax without natural selection: How compositionality emerges from vocabulary in a population of learners. The Evolutionary Emergence of Language, ed. by Chris Knight, Michael Studdert-Kennedy and James R. Hurford, 303–23. Cambridge: Cambridge University Press. Klatt, D. H. and Stefanski, R. A. 1974. How does a mynah bird imitate human speech? Journal of the Acoustical Society of America, 55.822–32. Klein, Richard G. and Edgar, Blake. 2002. The Dawn of Human Culture. New York: John Wiley & Sons. Kluender, D. R., Diehl, R. L. and Killeen, P. R. 1987. Japanese quail can learn phonetic categories. Science, 237.1195–97. Komarova, N. L., Niyogi, P. and Nowak, M. A. 2001. The evolutionary dynamics of grammar acquisition [Mar 7]. J Theor Biol, 209.43–59. Kosslyn, S. M., Alpert, N. M., Thompson, W. L., Maljkovic, V., Weise, S. B., Chabris, C. F., Hamilton, S. E., Rauch, S. L. and Buonanno, F. S. 1993. Visual mental imagery activates topographically organized visual cortex: PET investigations. Journal of Cognitive Neuroscience, 5.263–87. Krubitzer, Leah. 1995. The organization of neocortex in mammals: are species differences really so different? Trends in Neurosciences, 18.408–17. Kuhl, Patricia K. and Padden, D. M. 1982. Enhanced discriminability at the
Conceptual Complexity and the Brain phonetic boundaries for the voicing feature in macaques [Dec]. Perceptual Psychophysics, 32.542–50. —. 1983. Enhanced discriminability at the phonetic boundaries for the place feature in macaques [Mar]. Journal of the Acoustical Society of America, 73.1003–10. Kuhl, Patricia K. 1986. Theoretical contributions of tests on animals to the special-mechanisms debate in speech. Experimental Biology, 45.233–65. Kuhl, Patricia K. and Miller, J. D. 1975. Speech perception by the chinchilla: Voiced-voiceless distinction in alveolar plosive consonants. Science, 190.69–72. —. 1978. Speech perception by the chinchilla: Identification functions for synthetic VOT stimuli. Journal of the Acoustical Society of America, 63.905–17. Laitman, Jeffrey T. 1983. The evolution of the hominid upper repiratory system and implications for the origins of speech. Glossogenetics: The Origin and Evolution of Language. Proceedings of the International Transdisciplinary Symposium on Glossogenetics, ed. by Eric de Grolier, 63–90. Paris: Harwood Academic Publishers. —. 1985. Evolution of the hominid upper respiratory tract: The fossil evidence. Hominid Evolution: Past, Present and Future, ed. by Phillip V. Tobias, Valerie Strong and Heather White, 281–86. New York: Alan R. Liss. Lakoff, George. 1987. Women, Fire, and Dangerous Things. Chicago: University of Chicago Press. Langacker, Ronald W. 1987. Foundations of Cognitive Grammar. Vol. 1. Stanford: Stanford University Press. Latimer, Bruce and Ohman, James C. 2001. Axial dysplasia in Homo erectus. Journal of Human Evolution, 40.A12. Liberman, Alvin M., Harris, Katherine Safford, Hoffman, Howard S. and Griffith, Belver C. 1957. The discrimination of speech sounds within and across phoneme boundaries. Journal of Experimental Psychology, 54.358–68. Lieberman, Philip. 1984. The Biology and Evolution of Language. Cambridge, Massachusetts: Harvard University Press. —. 1988. Language, intelligence, and rule-governed behavior. Intelligence and Evolutionary Biology, NATO ASI Series, Vol. G17, ed. by H. J. Jerison and I. Jerison, 143–56. Berlin: Springer-Verlag. —. 2002. On the nature and evolution of the neural bases of human language. Yearbook of Physical Anthropology, 45.36–62. Mann, M. D., Rehkamper, G., Reinke, H., Frahm, H. D., Necker, R. and Nevo,
89
90
Language Acquisition, Change and Emergence E. 1997. Size of somatosensory cortex and of somatosensory thalamic nuclei of the naturally blind mole rat, Spalax ehrenbergi. Journal fur Hirnforschung, 38.47–59. Marler, P., Dufty, A. and Pickert, R. 1986a. Vocal communication in the domestic chicken: I. Does a sender communicate information about the quality of a food referent to a receiver? Animal Behavior, 34.188–93. —. 1986b. Vocal communication in the domestic chicken: II. Is a sender sensitive to the presence and nature of a receiver? Animal Behavior, 34.194–98. Mattingly, I. G., Liberman, A. M., Syrdal, A. K. and Halwes, T. 1971. Discrimination in speech and non-speech modes. Cognitive Psychology, 2.131–57. Mayr, Ernst. 1978. Evolution. Sci. Am., 239.47–55. McGurk, H. and MacDonald, J. 1976. Hearing lips and seeing voices [Dec 23–30]. Nature, 264.746–8. McNeill, D. 1966. Developmental psycholinguistics. The Genesis of Language: A Psycholinguistic Approach, ed. by F. Smith and G. Miller, 15–84. Cambridge, Mass.: MIT Press. McRae, Ken, de Sa, Virginia R. and Seidenberg, Mark S. 1997. On the nature and scope of featural representations of word meaning. Journal of Experimental Psychology: General, 126.99–130. Meyer, Marc R. 2003. Vertebrae and language ability in early hominids. Paper presented at Annual Meeting of the PaleoAnthropology Society, Tempe, Arizona. Miles, H. Lyn White. 1990. The cognitive foundations for reference in a signing orangutan. “Language” and Intelligence in Monkeys and Apes: Comparative Developmental Perspectives, ed. by Sue Taylor Parker and Kathleen Rita Gibson, 511–39. Cambridge: Cambridge University Press. Miller, George A. and Gildea, Patricia M. 1991. How children learn words. The Emergence of Language: Development and Evolution, ed. by William S-Y. Wang, 150–58. New York: W. H. Freeman. Miller, P. E. and Murphy, C. J. 1995. Vision in dogs [Dec 15]. J Am Vet Med Assoc, 207.1623–34. Milner, B., Petrides, M. and Smith, M. L. 1985. Frontal lobes and the temporal organization of memory. Human Neurobiology, 4.137–42. Milner, Brenda, Corsi, Philip and Leonard, Gabriel. 1991. Frontal-lobe contribution to recency judgements. Neuropsychologia, 29.601–18. Miyawaki, K., Strange, W., Verbrugge, R., Liberman, A. M., Jenkins, J. J. and Fujimura, O. 1975. An effect of linguistic experience: The discrimination of /r/ and /l/ by native speakers of Japanese and English. Perceptual Psychophysics, 18.331–40.
Conceptual Complexity and the Brain Morse, P. A. and Snowden, C. T. 1975. An investigation of categorical speech discrimination by rhesus monkeys. Perceptual Psychophysics, 17.9–16. Nadeau, R. 1991. Minds, Machines and Human Consciousness. Chicago: Contemporary Books. Negus, V. E. 1949. The Comparative Anatomy and Physiology of the Larynx. New York: Hafner. Neitz, J., Geist, T. and Jacobs, G. H. 1989. Color vision in the dog [Aug]. Vis Neurosci, 3.119–25. Northcutt, R. G. and Kaas, J. H. 1995. The emergence and evolution of mammalian neocortex [Sep]. Trends Neurosci, 18.373–9. Nowak, M. A., Komarova, N. L. and Niyogi, P. 2001. Evolution of universal grammar [Jan 5]. Science, 291.114–8. O’Grady, William. 1987. Principles of Grammar & Learning. Chicago: University of Chicago Press. Ogura, Mieko. 1993. The development of periphrastic do in English: A case of lexical diffusion in syntax. Diachronica, 10.51–85. Ohala, J. J. 1983. Cross-language use of pitch: An ethological view. Phonetica, 40.1–18. Petrides, Michael. 1991. Functional specialization within the dorsolateral frontal cortex for serial order memory. Proceedings of the Royal Society of London. B, Biological Sciences, 246.299–306. Pinker, Steven. 1994. The Language Instinct: How the Mind Creates Language. New York: Harper Collins Publishers, Inc. Pinker, Steven and Bloom, Paul. 1990. Natural language and natural selection. Behavioral and Brain Sciences, 13.707–84. Premack, Ann James and Premack, David. 1972. Teaching language to an ape. Scientific American, 227.92–99. Pulvermuller, Friedemann. 2001. Brain reflections of words and their meaning [2001/12/1]. Trends in Cognitive Sciences, 5.517–24. Reader, S. M. and Laland, K. N. 2002. Social intelligence, innovation, and enhanced brain size in primates [Apr 2]. Proc Natl Acad Sci USA, 99.4436–41. Regier, Terry. 1996. The Human Semantic Potential: Spatial Language and Constrained Connectionism: Neural Network Modeling and Connectionism. Cambridge, Massachusetts: MIT Press. Ringo, James L. 1991. Neuronal interconnection as a function of brain size. Brain, Behavior and Evolution, 38.1–6. Ross, John Robert. 1979. Where’s English? Individual Differences in Language
91
92
Language Acquisition, Change and Emergence Ability and Language Behavior, ed. by Charles J. Fillmore, Daniel Kempler and William S.-Y. Wang, 127–63. New York: Academic Press. Sampson, Geoffrey. 1978. Linguistic universals as evidence for empiricism. Journal of Linguistics, 14.183–206. —. 1979. A non-nativist account of language universals. Linguisitics and Philosophy, 3.99–104. —. 1980. Making Sense. Oxford: Oxford University Press. Sarich, Vincent M. 1985. Rodent macromolecular systematics. Evolutionary Relationships among Rodents, ed. by W. Patrick Luckett and Jean-Louis Hartenberger, 423–52. New York: Plenum Press. Sarich, Vincent M. and Cronin, John E. 1976. Molecular systematics of the Primates. Molecular Anthropology, ed. by Morris Goodman and R. E. Tashian, 141–70. New York: Plenum Press. Savage-Rumbaugh, E. Sue 1986. Ape Language From Conditioned Response to Symbol. New York: Columbia University Press. Savage-Rumbaugh, E. Sue and Rumbaugh, Duane M. 1993. The emergence of language. Tools, Language and Cognition in Human Evolution, ed. by Kathleen R. Gibson and Tim Ingold, 86–108. Cambridge: Cambridge University Press. Savage-Rumbaugh, E. Sue, Murphy, Jeannine, Sevcik, Rose A., Brakke, Karen E., Williams, Shelly L. and Rumbaugh, Duane M. 1993. Language comprehension in ape and child. Monographs of the Society for Research in Child Development, 58.1–222. Sawaguchi, Toshiyuki. 1988. Correlations of cerebral indices for ‘extra’ cortical parts and ecological variables in primates. Brain, Behavior and Evolution, 32.129–40. —. 1990. Relative brain size, stratification, and social structure in Anthropoids. Primates, 31.257–72. Sawaguchi, Toshiyuki and Kudo, Hiroko. 1990. Neocortical development and social structure in Primates. Primates, 31.283–89. Schoenemann, P. Thomas. 1997. An MRI Study of the Relationship Between Human Neuroanatomy and Behavioral Ability, Anthropology, University of California, Berkeley: Unpublished PhD dissertation. —. 1999. Syntax as an emergent characteristic of the evolution of semantic complexity. Minds and Machines, 9.309–46. Schoenemann, P. Thomas and Wang, William S.-Y. 1996. Evolutionary principles and the emergence of syntax. Behavioral and Brain Sciences, 19.646–47. Schusterman, R. J. and Gisinger, R. 1988. Artificial Language comprehension in
Conceptual Complexity and the Brain dolphins and sea lions: The essential cognitive skills. Psychol. Rec., 38.311–48. Seyfarth, Robert M., Cheney, Dorothy L. and Marler, P. 1980. Vervet monkey alarm calls: Semantic communication in a free-ranging primate. Animal Behavior, 28.1070–94. Simon, H. 1962. The architecture of complexity. Proceedings of the American Philosophical Society, 106.467–82. Smith, B. Holly. 1990. The cost of a large brain. Behavioral and Brain Sciences, 13.365–66. Snowdon, Charles T. 1990. Language capacities of nonhuman animals. Yearbook of Physical Anthropology, 33.215–43. Squire, L. R. 1987. Memory and Brain. New York: Oxford University Press, Inc. Struss, D. T. and Benson, D. F. 1986. The Frontal Lobes. New York: Raven Press. Tattersall, Ian. 1998. Becoming Human: Evolution and Human Uniqueness. New York: Harcourt Brace. Traugott, Elizabeth Closs. 1972. A History of English Syntax. New York: Holt, Rinehart and Winston. Urban, Greg. 2002. Metasignaling and language origins [Mar]. American Anthropologist, 104.233–46. Van der Loos, H., Welker, E., Dorfl, J. and Rumo, G. 1986. Selective breeding for variations in patterns of mystacial vibrissae of mice. Bilaterally symmetrical strains derived from ICR stock [Mar-Apr]. J Hered, 77.66–82. Walker, A. and Shipman, P. 1996. The Wisdom of the Bones: in Search of Human Origins. New York: Knopf. Wang, William S.-Y. 1984. Organum ex machina? Behavioral and Brain Sciences, 7.210–11. —. 1991a. Explorations in Language. Taipei, Taiwan: Pyramid Press. —. 1991b. Explorations in language evolution. Explorations in Language, 105–31. Taipei, Taiwan: Pyramid Press. Washburn, Sherwood L. 1960. Tools and evolution. Scientific American, 203.63–75. Watkins, K. E., Dronkers, N. F. and Vargha-Khadem, F. 2002. Behavioural analysis of an inherited speech and language disorder: comparison with acquired aphasia [Mar]. Brain, 125.452–64. Zuberbuhler, K. 2000a. Referential labelling in Diana monkeys [May]. Animal Behaviour, 59.917–27. —. 2000b. Interspecies semantic communication in two forest primates [Apr
93
94
Language Acquisition, Change and Emergence 7]. Proceedings of the Royal Society of London. B, Biological Sciences, 267.713–18. —. 2001. Predator-specific alarm calls in Campbell’s monkeys, Cercopithecus campbelli [Oct]. Behavioral Ecology and Sociobiology, 50.414–22. —. 2002. A syntactic rule in forest monkey communication [Feb]. Animal Behaviour, 63.293–99.
4 The Emergence of Grammar from Perspective Brian MacWhinney Carnegie Mellon University
Successful communication rests not just on shared knowledge and reference (Clark and Marshall, 1981), but also on a process of mutual perspective taking. By giving clear cues to our listeners about which perspectives they should assume and how they should move from one perspective to the next, we maximize the extent to which they can share our perceptions and ideas. When language is rich in cues for perspective taking and perspective shifting, it awakens the imagination of the listener and leads to successful sharing of ideas, impressions, attitudes, and narratives. When the process of perspective sharing is disrupted by interruptions, monotony, excessive complexity, or lack of shared knowledge, communication can break down. Although we understand intuitively that perspective taking is central to communication, few psycholinguistic or cognitive models assign it more than a peripheral role. Linguistic theory typically views perspective as a secondary pragmatic filter (Kuno, 1986) that operates only after hard linguistic constraints have been fulfilled. This paper explores the hypothesis that, far from being peripheral or secondary, perspective taking is at the very core of language structure and higher-level cognition. This approach, which I will call the perspective hypothesis, makes the following basic claims: 95
96
Language Acquisition, Change and Emergence
1. Perspective taking operates online using images created in five systems: direct experience, space/time deixis, plans, social roles, and belief. In this paper, we will explore the first three of these systems, leaving a discussion of the last two to further work. 2. Language uses perspective taking to bind together these five imagery subsystems. 3. Grammar arose as a social convenience to support accurate tracking and switching of perspective. 4. Language comprehension and production use both depictive and enactive imagery. Depictive imagery relies on the ventral image processing system, whereas enactive imagery relies on the dorsal system for perception-action linkages. Perspective taking depends primarily on processing in the dorsal stream. 5. On the level of direct experience, perspective shifting depends on imagery grounded directly on body maps. 6. On the level of deixis in space and time, perspective shifting depends on the projection of the body image across egocentric, allocentric, and geocentric frames. 7. On the level of plans, perspective shifting in the transitivity system assigns roles to referents through the transitivity system. Premotor working memory areas and inferior frontal action planning areas provide the processing capacity to control perspective shifts in action chains. 8. By tracing perspective shifts in language, children are able to learn the cognitive pathways and mental models sanctioned by their culture. 9. The emergence of language as a species-specific human skill depends on a series of gradual evolutionary adaptations (MacWhinney, 2003) that supported perspective taking in the four subsystems, as well as additional adaptations for vocal control. The perspective hypothesis relies heavily on a series of recent advances in cognitive psychology, cognitive neuroscience, and
The Emergence of Grammar from Perspective
cognitive linguistics. In particular, it builds on the following theoretical positions and empirical advances: 1. As Miller and Johnson-Laird (1976) and Fauconnier (1994) have shown, language allows us to construct and describe mental models and mental spaces. 2. As Shank and Abelson (1977) and Rumelhart (1975) have shown, we use mental models to elaborate schemata, frames, and stories in which people have specified social roles. 3. As Zwaan and Radvansky (1998) and Glenberg (1997) have demonstrated, discourse comprehension produces an embodied situational model that instantiates mental spaces and social frames. 4. As Lakoff (1987) and Lakoff and Johnson (1980) have shown, language uses metaphor and extension to represent the body in the mind. In the terms of Feldman et al. (1996), we can say that language produces a cognitive simulation of reality. 5. As Barsalou (1999) and Langacker (1987) have demonstrated, cognition manipulates a system of perceptually grounded symbols. These symbols derive their expressive power from a retrievable (Ballard, Hayhoe, Pook, and Rao, 1997) mapping to direct experience. 6. As Harnad (1990) has argued, the grounding of cognition in a body can solve the symbol-grounding problem (Searle, 1980). 7. As Talmy (2000) has shown, clausal packaging, conflation, and structuring express the ways in which we map our human understanding of force and causation onto the physical and social world. 8. As Holloway (1995), Deacon (1997), Donald (1991), and Dunbar (2000) have suggested, language and cognition have co-evolved across the full six million years of human evolution. 9. As Damasio (1999), Donald (1998), Shallice and Burgess
97
98
Language Acquisition, Change and Emergence
(1996), MacNeilage (1998) and others have argued, the most recent evolutionary changes have allowed language to link all aspects of cognition through functional neural circuitry. 10. As demonstrated in neuroimaging work (Jeannerod, 1997; Kosslyn, Thompson, Kim, and Alpert, 1995; Osman, Albert, and Heit, 1999; Pulvermüller, 1999), the construction of mental images relies on the same neural pathways that produce direct action and perception. 11. As Vygotsky (1962) and Tomasello (1999) have noted, language facilitates socialization of the child in accord with culturally specific frameworks for cognition. Each of these positions is supported by a wide range of linguistic, psychological, and biological evidence. Together, these views have yielded a rich picture of the ways in which embodied perceptual symbol systems and situation models support language and cognition. By way of shorthand, I will refer to this emergent consensus as the theory of embodied cognition. Although the theory of embodied cognition has led to important advances in our understanding of the relation between language and cognition, it has not yet provided an account of real-time processes in language comprehension and production. Without such an account, it will be difficult to analyze the grammatical systems of human languages from the viewpoint of embodied cognition. In this chapter, I argue that bridging this gap requires us to extend current situational model theory to deal with the construct of perspective. The articulation of a theory of perspective is not a minor afterthought in the formulation of the theory of embodied cognition. It forces a fundamental rethinking of the dynamics of mental models, the nature of sentence processing, the functional grounding of grammatical structures, the shape of language acquisition, and the co-evolution of language and cognition. This rethinking is fundamental because perspective serves as a common thread that links together the four semi-modular cognitive systems governing direct experience, space-time deixis, plans, and social roles. Because perspective interacts with imagery on each of these four levels, it
The Emergence of Grammar from Perspective
provides a general rubric for knitting together all of cognition. By codifying ways of making these links for perspective taking, language provides us with smooth, controllable access to all of the objects of imagery and cognition. Because perspective operates at the level of the sentence and not the word, it has little impact on processing or development on the auditory, articulatory, and lexical levels. Perspective does not provide a new way of understanding lexical processing mechanisms such as spreading activation, inhibition, and interference. On the contrary, the operation of perspective is itself an outgrowth of basic learning and processing mechanisms such as induction, self-organization, imagery, and generalization. In this sense, it makes little sense to propose a theory of embodied cognition grounded on perspective as a replacement for standard cognitive psychology. Instead, perspective can be viewed as an elaboration of more well-understood, basic cognitive mechanisms.
1. Empirical Demonstrations There is now a voluminous experimental literature documenting the impact of embodied situation models on discourse processing. Glenberg (1997), Zwaan and Radvansky (1998), and Zwaan, Kaup, Stanfield, and Madden (in press) have reviewed this work in detail. The findings of this work are extremely consistent. When we listen to sentences, even in laboratory experimental contexts, we actively generate images of the situation models described by these sentences. One method for demonstrating this effect involves giving subjects probes that are either consistent with these mental models or not. When the probes are consistent, they respond quickly, when they are not, responses are slower. For example, when we think about “aiming a dart” we imagine pinching together our fingers (Klatzky et al., 1989). To take a more complex example (Bransford, Barclay, and Franks, 1972), when subjects read (1), as opposed to (2), they are likely to judge that they had heard sentence (3), although it was never presented.
99
100
Language Acquisition, Change and Emergence
1. Three turtles rested on a floating log, and a fish swam beneath them. 2. Three turtles rested beside a floating log, and a fish swam beneath it. 3. Three turtles rested on a floating log, and a fish swam beneath it. A second method involves giving subjects passages that produce coherent situation models such as (1) and ones that do not, such as (2). 1. While measuring the wall, Fred laid the sheet of wallpaper on the table. Then he put his mug of coffee on the wallpaper. 2. After measuring the wall, Fred pasted the wallpaper on the wall. Then he put his mug of coffee on the wallpaper. The prediction here is simply that (1) is easier to read and recall than (2). This method can also be used to show that situation models function online. For example Hess, Foss, and Carroll (1995) showed that reading of the final word in a sentence was faster if it matched up with the situation model generated by the sentence. Other methods involve showing that graphs that are consistent with situation models facilitate processing (Glenberg and Langston, 1992), checking for the updating of situation models using new information (Ehrlich and Johnson-Laird, 1982), and checking at various points for the availability of a protagonist (Carreiras, Carriedo, Alonso, and Fernández, 1997). This work provides clear experimental evidence that we do indeed construct situation models as we process discourse and that these constructions involve the assumption of perspectives in terms of direct experience, spatial position, temporal location, causal action, and social roles.
2. Depictive and Enactive Modes The perspective hypothesis holds that we can construct mental models in either depictive or enactive modes. When we construct images depictively, they appear as images on a visual screen and we
The Emergence of Grammar from Perspective
watch them as a spectator. Depiction relies primarily on processing in the ventral visual processing stream (Ungerleider and Haxby, 1994) that runs from the primary visual areas in the occipital lobe through the object recognition areas of the temporal lobe. Processing in the depictive mode allows only a minimal amount of perspective taking, perhaps just enough to focus attention on a figure over a ground, but not enough to become involved with the actions of that figure. When we construct mental images in the enactive mode, they involve us not just as spectators, but also as participants. Processing in the enactive mode involves perspective taking, since we can adopt the enactive viewpoint of specific objects or participants. Processing in this mode relies on the dorsal visual stream (Goodale, 1993) that runs through the parietal and eventually projects to supplementary eye field areas in the premotor cortex. This dorsal stream processes images and models in terms of links between perception and action. The ventral stream is older (Holloway, 1995) and processing in the depictive mode is relatively easier and more automatic. As Landau and Jackendoff (1993) have noted, image processing in the ventral stream provides more detail than spatial processing in the dorsal stream. Although processing in the dorsal stream is less precise and slower, it allows us to link perception to action in a way that will support perspective taking. As an example of how processing differs in these two modes, consider this sentence: “The skateboarder vaulted over the railing.” In the depictive modality, we see the skateboarder vaulting over the railing, as if he were a figure in a video. However, if we process this sentence enactively, we take the perspective of “the skateboarder” and imagine the process of crouching down onto the skateboard, snapping up the tail, and jumping into the air, as both rider and skateboard fly through the air over a railing and land together on the other side. Identifying with the skateboarder as the agent, we can evaluate the specific bodily actions involved in crouching, balancing, and jumping. Enactive processing allows us to construct a fuller and deeper (Craik and Lockhart, 1972) elaboration of the mental model.
101
102
Language Acquisition, Change and Emergence
Consider another example of the enactive-depictive contrast in the sentence, “the cat licked herself.” In the depictive mode, we see a movie of the cat raising her paw to her mouth and licking the fur with her tongue. In the enactive mode, we take the stance of the cat. We refer her paw to our hand and her tongue to our tongue. Most people would say that they are unlikely to employ the enactive mode in this case, as long as the sentence is presented by itself outside of context. However, if we embed the sentence in a larger discourse, we are more inclined to process enactively. Consider this passage: The cat spotted a mockingbird perched on the feeder. She crouched down low in the grass, inching closer and closer with all her muscles tensed. Just as she pounced, the bird escaped. Disappointed, she lept up to a garden chair, raised her paw to her tongue, and began licking it. Here, each clause links to the previous one through the perspective of the cat as the protagonist. As we chain these references together, we induce the listener to assume a single enactive perspective. The longer and more vivid our descriptions, the more they stimulate enactive processes in comprehension. Depictive and enactive processes run in parallel. As we go through the work of constructing a basic depictive mental model, we elaborate it enactively, as much as time and energy permit. Because our enactive interpretations may fail to reach completion, it is often the case that they are not fully available to our consciousness. When we look at sentence production, as opposed to sentence comprehension, the situation is different. In production, we often have direct access to memories that are encoded in the enactive mode. Unless we are describing events as we see them, we are usually retrieving events and referents from memory. According to the perspective-taking hypothesis, these events are most likely to be encoded enactively. Sometimes we have failed to construct stored mental models in a fully coherent enactive framework. For example, when telling a joke, we might forget the punch line. This type of
The Emergence of Grammar from Perspective
failure indicates that, even within our own embodied mental models, enactive processing can be incomplete. Perspective taking operates on five component subsystems. These subsystems process information in terms of (1) direct experience, (2) deictic spatio-temporal reference frames, (3) plans, (4) social roles, and (5) belief. Each of these subsystems can function rapidly and accurately without perspective taking. However, without perspective taking, their output is highly stimulus-bound (Hermer-Vazquez, Moffet, and Munkholm, 2001), depictive, limited, and modular. Our primate relatives display some basic abilities to perform perspective taking on each of these four levels (MacWhinney, 2003). Even without language, perspective taking partially liberates primate cognition from a complete dependence on stimulus input and permits the construction of fragmentary and limited mental models. However, with language, we can use perspective taking across these five systems to build up a single, unified embodied situation model.
3. Direct Experience Our basic mode of interaction with objects is through direct experience. Direct perception arises immediately as we interact with objects. We use vision, touch, smell, taste, kinesthesia, and proprioception to estimate the affordances (Gibson, 1977) that objects provide for action. As we use our arms, legs, and bodies to act upon objects, we derive direct feedback from these objects. This feedback loop between action and perception does not rely on symbols, perspective, or any other form of cognitive distancing. Instead, it is designed to give us immediate contact with the world in a way that leads to full embodiment and quick adaptive reactions. Because this system does not rely on memory, imagery, perspective, or other cognition systems (Gibson, 1977), it remains fully grounded on the direct relation between the organism and the environment. Consider the ways in which we perceive a banana. When we see a banana, we receive nothing more than an image of a yellow curved
103
104
Language Acquisition, Change and Emergence
object. However, as we interact directly with the banana, additional perceptions start to unfold. When we grab a banana, our hands experience the texture of the peel, the ridges along the peel, the smooth extensions between the ridges, and the rougher edges where the banana connects with other bananas into a bunch. When we hold or throw a banana, we appreciate its weight and balance. When we peel a banana, we encounter still further sensations involving the action of peeling, as well as the peel itself. With the peel removed, we can access new sensations from the meat of the banana. An overripe banana can assault us with its pungent smell. When we eat a banana, our whole body becomes involved in chewing, swallowing, and digestion. All of these direct interactions in vision, smell, taste, touch, skeletal postures, kinesthesia, proprioception, and locomotor feedback arise from a single object that we categorize as a “banana.” It is this rich and diverse set of sensations and motor plans that constitutes the fullest grounding for our understanding of the word “banana.” Of course, we know other things about bananas. We know that they are rich in potassium and Vitamin E, that they are grown in Central America by United Fruit cooperatives, and so on, but these are secondary, declarative facts (Paivio, 1971; Tabachneck-Schijf, Leonardo, and Simon, 1997) that rely on the primary notion of a banana that we derive from direct embodied perception.
3.1 Direct imagery The development of mental imagery produces cognitive ungrounding or distancing. When we imagine a banana, we call up images of the shape, taste, and feel of a banana even when it is not physically present. This imagery does not depend exclusively on language. We might be hungry and think of a banana as a possible food source, or we might detect a smell that would lead us to construct a visual image of a banana. Recent research in neurophysiology has shown that, when we imagine objects and actions in this way, we typically activate the same neuronal
The Emergence of Grammar from Perspective
pathways that are used for direct perception and direct action. For example, when we imagine performing bicep curls, there are discharges to the biceps (Jeannerod, 1997). When a trained marksman imagines shooting a gun, the discharges to the muscles mimic those found in real target practice. When we imagine eating, there is an increase in salivation. Neuroimaging studies by Parsons et al. (1995), Martin, Wiggs, Ungerleider, and Haxby (1996), and Cohen et al. (1996) have shown that, when subjects are asked to engage in mental imagery, they use modality-specific sensorimotor cortical systems. For example, in the study by Martin et al., the naming of tool words specifically activated the areas of the left premotor cortex that control hand movements. Experimental work has shown repeatedly that switching between these sensorimotor systems during imagery tasks exacts a clear cost in reaction time (Klatzky, Pellegrino, McCloskey, and Doherty, 1989; Solomon and Barsalou, 2001). Damasio (1999) has outlined the ways in which a distributed functional neural circuit involving mid-brain structures and the basal ganglia helps maintain the body image. Motor cortex (Kakei, Hoffman, and Strick, 1999) maintains as many as twelve separate maps of the human body. Additional body maps are located in the cerebellum (Middleton and Strick, 1998). Some of these maps can encode body orientation, head position, and the direction of eye movements. Others may be more linked to the dynamic actions discussed in the next section. The dynamic linkage between alternative body encodings on separate maps can be maintained by reverberation in functional circuits. Although imagery relies on the pathways used by direct perception and direct action (Decety and Grèzes, 1999; Decety et al., 1994), it differs from direct experience in four important ways: 1. Temporal lag. Using ERP and EMG measurements, Osman (1999) has shown that the image of a trigger release comes 100 msec later than the actual release. 2. Partial independence. Many patients with visual agnosia display good object recognition, but some damage to mental imagery. This shows that, although imagery
105
106
Language Acquisition, Change and Emergence
depends on the pathways used by direct perception, it also involves additional central resources that can be separately damaged. However, Behrmann, Moskovitch, and Winocur (1996) report findings from a patient who shows good imagery, but damaged object recognition. Similarly, Caplan and Waters (Caplan and Waters, 1995) have shown that patients with motor apraxia can use the phonological loop to remember word strings, even without being able to articulate normally. In order to explain patterns like these, we have to assume that imagery is partially independent of the final pathways for direct perception and action. Rather than being entirely linked to direct experience, imagery constructs internal fictive processes that are potentially separable from direct perception. In this sense, the homunculus that is used to produce imagery simulates reality, but it is not reality. 3. Decomposition. Imagery also differs from direct perception in the ways in which it can access the pieces of stored images. Barsalou (1999) has argued that our perception of an object such as an automobile allows us to enactively decompose the auto into its various pieces, such as the doors, the windows, the windshield, and the parts of the motor. The relation of each part to the others is traced through a top-down reenactment of direct experience, both perceptual and motoric. In this way, the full construct of an automobile relies on the vestiges of direct experience that are stored away as enactive images. 4. Generation. Imagery also differs from direct experience in the fact that the generation of images requires some form of active retrieval from memory. Studies using the verb generation task have pointed to an important role for frontal cortex in supporting strategic aspects of meaning access and generation (Petersen, Fox, Posner, Mintun, and Raichle, 1988; Posner, Petersen, Fox, and Raichle, 1988). In this task, subjects are shown pictures of objects and asked to think of actions they might
The Emergence of Grammar from Perspective
perform on these objects. In addition, lesion studies (Gainotti, Silveri, Daniele, and Giustolisi, 1995), PET studies (Posner et al., 1988), and fMRI analyses (Menard, Kosslyn, Thompson, Alpert, and Rauch, 1996) have shown that right frontal areas are involved in the generation or retrieval of action terms. Together, these studies point to an important role for frontal cortex in generating access cues for specific actions and the words that express those actions.
3.2 Partial ungrounding Imagery works together with memory, planning, dreaming, and projection to allow us to move away from direct experience. Together, these processes allow us to move beyond a direct linkage to object and actions and to imagine potential actions and their possible results. These processes lead to a partial ungrounding of cognition. However, the decomposable nature of perceptual symbol systems (Barsalou, 1999) allows us to recreate full grounding when needed for fuller comprehension. The fact that cognition can become partially ungrounded through imagery should not be construed as meaning that it is fully ungrounded (Burgess and Lund, 1997).
3.3 Direct experience and language Words provide convenient methods for mapping direct experiences onto linguistic form. In most cases, words afford little opportunity for perspective switching. The word “banana” packages together all our experiences with this object into a single unanalyzed whole. In some cases, however, our experiences are decomposed, even on the level of the word. For example, in Navajo, a chair is “bikáá’dah’asdáhí” or “on-it-one-sits.” To take a more familiar example, many languages refer to a corkscrew as a “cork puller.” In such examples, objects are being characterized in terms of our action
107
108
Language Acquisition, Change and Emergence
upon them. Miller and Johnson-Laird (1976) showed that definitions of nouns in terms of criterial attributes were often not as predictive as definitions in terms of imagined affordances. For example, they found that attempts to define a “table” in terms of the number or the placement of its legs or the shape of the top often failed to capture the possible variation in the shape of what counts as a table. It works better to define a table instead as an object that provides a space upon which we can place work. In this way, Miller and Johnson-Laird eventually came to the same conclusion that the Navajo reached when they called a table “bikáá’dání” or “at-it-one-works.” Languages can also capture aspects of direct experience through the projection of the body image. In English, we speak of the hands of a clock, the teeth of a zipper, and the foot of the mountain. In Apache, this penchant for body part metaphors carries over to describing the parts of an automobile. The tires are the feet of the car, the battery is its heart, and the headlights are its eyes. Such perspectival encodings combine with the direct experiences we discussed earlier in the case of “banana” to flesh out the meanings of words, even before they are placed into syntactic combination. The 18th century philosopher Giovanni Batista Vico understood this, when he noted that: In all languages the greater part of the expressions relating to inanimate things are formed by metaphor from the human body and its parts and from the human senses and passions.... for when man understands he extends his mind and takes in the things, but when he does not understand he makes the things out of himself and becomes them by transforming himself into them. (New Science, section 405) Plato attributes an even earlier statement of this type to the first philosopher, Protagoras, who declared, “Man is the measure of all things.” Adjectives encode images of direct perceptions for attributes such as weight, color, or smell. Verbs encode images of direct
The Emergence of Grammar from Perspective
action, often in relation to movements of the body. When we hear the word “walk,” we immediately activate the basic elements of the physical components of walking (Narayanan, 1997). These include alternating motions of the legs, counterbalanced swinging of the arms, pressures on the knees and other joints, and the sense of our weight coming down on the earth. Because we have good access to the components of motor plans, these images can be decomposed. However, in practice they often function as unanalyzed images of integrated plans. More generally, although individual words can construct meaning by reference to the body and direct experience, they do not in themselves allow for shifts in perspective. For that, we need to move to the higher levels of the phrase and the clause.
4. Space and time Perspective taking requires different sets of cognition mechanisms. For direct experience, perspective taking involves the projection of the body image onto the body and motions of other agents. For space, perspective taking involves the projection of a deictic center and map onto the position of another agent. Deictic centers can be constructed in three frameworks: egocentric, allocentric, and geocentric.
4.1 The egocentric frame Egocentric deixis directly encodes the perspective of the speaker. The spatial position of the speaker becomes the deictic center or “here.” Locations away from this deictic center are “there.” In face-to-face conversation, the deictic center can include both speaker and listener as a single deictic center. In this case, “here” can refer to the general position of the speaker and listener, and “there” can refer to a position away from the speaker and listener. Other terms that are grounded in the self’s position and perspective include “forward”, “backward”, “up”, “down”, “left”, and “right”.
109
110
Language Acquisition, Change and Emergence
To map our local environment around a deictic center, we create a series of deictic codes to mark the locations of objects with respect to previous body postures and eye fixations (Ballard et al., 1997). By accessing these stored deictic codes, we avoid the many computations that would be involved in having to worry repeatedly about the locations of all of the possible objects in the world around us. Ballard argues that the brain does this by establishing an internal deictic code for each object in working memory. These codes are stored with reference to our images of our body and eye positions and movements. The establishment of deictic codes depends on a set of mechanisms for the neuronal encoding of eye movements, body image, and body maps. In an early study on this topic, Bossom (1965) gave monkeys special eyeglasses that inverted the visual field. After moving about with these eyeglasses for some days, the monkeys became readapted to the upside down view these glasses provided. When Bossom then lesioned the monkeys at various cortical locations, he found that only lesions to the supplementary eye fields resulted in damage to the readapted visual field. This finding suggests that these frontal structures support the construction of a dynamic and adaptable visual field. Using single-cell recording techniques with macaque monkeys, Olson and Gettner (1995) located cells in the supplementary eye field of prefrontal cortex that respond not to positions in the actual visual field, but to positions on objects in visual memory. These results suggest that the prefrontal visual area works together with parietal areas to facilitate the processing of spatial representations. Connections between posterior and frontal areas (Goldman-Rakic, 1987) provide a method for the temporary storage of deictic codes in premotor working memory areas and the accessing of previous attentional foci. Permanent traces may be stored by offline hippocampal processing and cortical downloading (McClelland, McNaughton, and O’Reilly, 1995; Redish and Touretzky, 1997). The fact that primates have a short-term memory capacity equal to that of humans (Levine and Prueitt, 1989) suggests that the basic deictic memory system for egocentric perspective can
The Emergence of Grammar from Perspective
operate smoothly without additional reliance on verbal memory systems such as the phonological loop (Baddeley, 1990; Gathercole and Baddeley, 1993).
4.2 The allocentric frame The second spatial frame is the allocentric frame, sometimes called the object-centered or intrinsic frame. This frame is constructed by projecting the deictic center onto an external object. To do this, the speaker assumes the perspective of another object and then judges locations from the viewpoint of that object. The basic activity is still deictic, but it is extended through perspective taking. For example, “in front of the house” defines a position relative to a house. In order to determine exactly where the front of the house is located, we need to assume the perspective of the house. We can do this by placing ourselves into the front door of the house where we would face people coming to the front door to “interact” with the house. Once its facing is determined, the house functions like a secondary human perspective, and we can use spatial terms that are designed specifically to work with the allocentric frame such as “under”, “behind”, or “next to”. If we use these terms to locate positions with respect to our own bodies as in “behind me” or “next to me,” we are treating our bodies as the centers of an allocentric frame. In both egocentric and allocentric frames, positions are understood relative to a figural perspective that is oriented like the upright human body (Bryant, Tversky, and Franklin, 1992; H. H. Clark, 1973). Shifts in spatial perspective can lead to strange alternations of the perspectival field. For example, if we are lying down on our backs in a hospital bed, we might refer to the area beyond our feet as “in front of me,” even though the area beyond the feet is usually referred to as “under me.” To do this, we may even imagine raising our head a bit to correct the reference field, so that at least our head is still upright. We may also override the normal shape of the allocentric field by our own egocentric perspective. For example,
111
112
Language Acquisition, Change and Emergence
when having a party in the back yard of a house, we may refer to the area on the other side of the house as “in back of the house,” thereby overriding the usual reference to this area as “the front of the house.” In this case, we are maintaining our current egocentric position and perspective as basic and locating the external object within that egocentric perspective. Prepositions often reflect the perspectival nature of allocentric reference. For example, the preposition “in back of” is based on taking the point of view of an object and locating what would correspond to its back, if it were viewed as having a body. Body parts such as the face, the stomach, the buttocks, the feet, and the head all serve as the grounding for prepositions in many languages. In fact, Heine (1993) found in a survey of African languages that over three-quarters of the relational terms derive from body parts. Historically, parts are first projected to regions of inanimate objects, such as the “back” of a car. Next, they come to refer to regions in contact with these parts, such as “back of the car.” Finally, they come to refer to areas detached from the objects, as in “in back of the car.” The projection from the body can also support the development of words for “to” or “in front” based on the human eye, since the eye glances toward things. Even abstract case-marking systems can be shown to derive historically from simple deictic markers such as “to” and “from” (Anderson, 1971). The computation of allocentric reference required an evolutionary adaptation to basic primate spatial processing. We know that the parietal cortex in primates maintains separate maps for body-referenced and world-referenced positions (Snyder, Grieve, Brothcie, and Anderson, 1998). Body-referenced positions are adequate for egocentric spatial representations. However, world-referenced positions must be elaborated by perspective-taking to form allocentric representations. This could be achieved by linking frontal mechanisms to these parietal mechanisms. In addition, hippocampal mechanisms are used in spatial computations (McClelland et al., 1995). These mechanisms would not need to be modified, since they would simply store codes from a revised deictic center. It is possible that the expansion of parietal cortex and
The Emergence of Grammar from Perspective
processing in the dorsal stream that occurred about 4MYA (Holloway, 1995) could have provided our hominid ancestors with the ability to construct fully shiftable allocentric deictic centers.
4.3 The geocentric frame The third deictic reference system, the geocentric frame, enforces a perspective based on fixed external landmarks, such as the position of a mountain range, the sun, the North Star, the North Pole, or a river. These landmarks must dominate a large part of the relevant spatial world, since they are used as the basis for a full-blown Cartesian coordinate system. The Guugu Yimithirr language in northeast Queensland (Haviland, 1996) makes extensive use of this form of spatial reference. In Guugu Yimithirr, rather than asking someone to “move back from the table,” one might say, “move a bit to the mountain.” We can use this type of geocentric reference in English too when we locate objects in terms of compass points. However, our uncertainty about whether our listener shares our judgments about which way is “west” in a given microenvironment makes use of this system far less common. On the other hand, we often make use of Cartesian grids centered on specific local landmarks in English. For example, we can describe a position as being “fifty yards behind the school.” In this case, we are adopting an initial perspective that is determined either by our own location (e.g., facing the school) or by the allocentric perspective of the school for which the entry door is the front. If we are facing the school, these two reference frames pick out the same location. When we describe the position as being located “fifty yards toward the mountain from the school,” we are taking the perspective of the mountain, rather than that of the speaker or the school. We then construct a temporary Cartesian grid based on the mountain and perform allocentric projection to the school. Then we compute a distance of 50 yards from the school in the direction of the mountain. As we have already noted, language uses a variety of
113
114
Language Acquisition, Change and Emergence
closed-class forms to express basic spatial relations. In English, much of this work is done through prepositions, pronouns, and tense markers. In other languages, there may be a greater reliance on expressions of topological relations, contact, shape, and enclosure. However, all languages provide a rich set of expressions for egocentric and allocentric construction of space and time. These devices can be chained together in expressions, such as “in the pond under the log across the stream.” Processing of these chains of spatial expressions requires the same perspective shifting mechanisms needed to process plans, as we will see in the next section.
4.4 Temporal perspective In many ways, we conceive of time as analogous to space. Like space, time has an extent through which we track events and objects in terms of their relation to particular reference moments. Just as spatial objects have positions and extents, events have locations in time and durations. Time can also be organized egocentrically, allocentrically, or globally. When we use the egocentric frame, we relate events to our current speaking time (ST) (Vendler, 1957). Just as there is an ego-centered “here” in space, there is an ego-centered “now” in time. Just as we can project a deictic center onto another object spatially, we can also project a temporal center onto another time in the past or future. In this case, the central referent is not speaking time, but another reference time (RT). We can track the position of events in relation to either ST or RT or both using linguistic markings for tense. We can also encode various other properties of events such as completion, repetition, duration, and so on, using aspectual markers. When we come to depicting the duration of events, we can view them either as having an extent in a single dimension (“a long time”) or a relative size (“mucho tiempo”). Just as we tend to view events as occurring in front of us, rather than behind us, we also tend to view time as moving forwards from
The Emergence of Grammar from Perspective
past to future. As a result, it is easier to process sentences like (1) with an iconic temporal order than ones like (2) with a reversed order. However, sentences like (3) which require no foreshadowing of an upcoming event, are the most natural of all. 1. After we ate our dinner, we went to the movie. 2. Before we went to the movie, we ate our dinner. 3. We ate our dinner and then we went to the movie. Temporal reference in narrative assumes a strict iconic relation between the flow of the discourse and the flow of time. Processing of sequences that violate temporal iconicity by placing the consequent before the antecedent are relatively more difficult. However, in reality, it is difficult to describe events in a fully linear fashion and we need to mark flashbacks and other diversions through tense, aspect, and temporal adverbials.
5. Plans Primates can make sophisticated use of tools for single operations on objects, but they cannot form lengthier plans that combine these actions (Byrne, 1999). Donald (1998) has argued that the ability to formulate and execute plans was the centerpiece of the mimetic revolution that accompanied the geographical expansion of Homo erectus after 2MYA. During this period, brain mass doubled in allometric terms. Much of this expansion benefited prefrontal areas that support attention, memory, and plan organization. The expansion also benefited frontal areas that control vocal processes and temporal areas for auditory and lexical memory. Plans require not only perspective taking, but also perspective shifting. Shifts involve new combinations of actions and objects. For example, a plan for making an arrow will involve climbing a hill to find a suitable stone, returning to a work area, locating a chipping stone, chipping the point, plucking a branch from a tree, shaping the branch into a straight stick, slicing sinew, and tying the point. Although the self remains the protagonist throughout this plan,
115
116
Language Acquisition, Change and Emergence
there are continual shifts through direct experience, space, and causal action on objects. Representing perspective shifts requires a method for representing and accessing competing plans, resolving the competition, and developing optimal sequences of the components (Sacerdoti, 1977). It appears that the frontal lobes are uniquely adapted to construct plans in a manner that facilitates perspective tracking and switching. Dorsolateral prefrontal cortex plays a fundamental role in the storing of alternative representations in working memory (Barch et al., 1997; Braver et al., 1997; D. D. Cohen et al., 1997; Goldman-Rakic, 1987; Owen, Downes, Sahakian, Polkay, and Robbins, 1990). The ability to shift between perspectives requires a neural system for representing alternative perspectives, as well as a method for inhibiting one or more of the competing perspectives. For example, in the Stroop task (J. Cohen, Dunbar, and McClelland, 1990), the reader must inhibit the perception of the color of the word in order to quickly read the name of the word. In processing SO relative clauses, we need to move quickly from the viewpoint of the subject of the main clause to the viewpoint of the subject of the subordinate clause. In processing social relations, we need to quickly assess the viewpoints of other people, particularly as they conflict with our own views. Right frontal cortex supports memory for events that occur in discourse. The ability to store the traces of recent events in working memory is crucial for the construction of connected discourse. If we could not recall previously mentioned characters and actions, we would be unable to follow even the most basic descriptions and narratives. The complex interconnectivity between frontal, thalamic, and cingulate areas (Fuster, 1989; Kolb and Whishaw, 1995) suggests that the frontal system integrates a variety of mental facilities, all in the service of perspective taking and shifting. Mesulam (1990) asks, “Why does (prefrontal) area PG project to so many different patches of prefrontal cortex? Why are the various areas of prefrontal cortex interconnected in such intricate patterns?” The perspective hypothesis claims that frontal cortex is attempting to integrate perspective taking and shifting in verbally represented plans.
The Emergence of Grammar from Perspective
5.1 Dissecting events To formulate plans (Sacerdoti, 1977), we must have a way of representing the individual components of events. The ability to dissect events into their components involves a form of representation that is only incompletely attained in primates. Chimpanzees have no problems representing and naming individual objects or simple actions (Savage-Rumbaugh and Taglialatela, 2001). However, they are not able to combine these representations into fuller predicates (Terrace, Petitto, Sanders, and Bever, 1980). Some (Greenfield, 1991) have suggested that this failure arises from an inability to combine elements. However, others (Donald, 1998; Tomasello, 1999) see the deficit as involving an inability to segment events into their components. According to this view, events such as “shaking salt” are initially encoded as a single merged experience in which the shaking, the saltshaker, and the salt all form a single perceptual-action Gestalt. Similarly, in the act of cutting wood, there is no fundamental gap separating the wood, the axe, the chopping, the lifting, and the self as agent. In order to segment reality into separate events, language and cognition provide us with a system that orders nouns into role slots constellate around verbs. We use verbs to segment the flow of reality into bite-size actions and events. Then we flesh out the nature of the events by linking actors and objects to the verbs, as fillers of role slots. Item-based grammars (Hausser, 1999; Hudson, 1984; Kay and Fillmore, 1999; MacWhinney, 1988) derive syntactic structure from the ways in which individual words or groups of words combine with others. For example, the verb “fall” can combine with the perspective of “glass” to produce “the glass fell.” In this combination, we say that “fall” has an open slot or valency for the role of the perspective and that the nominal phrase “the glass” is able to fill that slot and thereby play the role of the perspective. In item-based grammars, this basic mechanism is used to produce the full range of human language. The specific phrasal structures of various languages emerge as a response to the process of combining words into appropriate role slots as we listen to sentences in real time (Hawkins, 1999).
117
118
Language Acquisition, Change and Emergence
5.2 Competing perspectives Much of the variation we find between languages involves alternative methods for marking perspective in syntactic combinations. These variations arise because of competition between alternative participants for case marking, agreement marking, and word order positioning (Bates and MacWhinney, 1989). When we describe an event such as “the farmer grew the corn,” there are two competing perspectives. The perspective of the corn is directly involved with the growing. If we wish to understand the changes that occur in the corn, we would have to assume this perspective. On the other hand, the perspective of the farmer is also relevant, since he cares for the corn in ways that make it grow. When dissecting events that have more than one participant, languages typically make a default commitment to one of these two perspectives. Accusative-nominative languages, like English, place focus on the actor by treating it as the default perspective for the clause. They then treat the activity of the patient as a secondary perspective contained within the scope of the larger perspective of the subject. In ergative-absolutive languages, like Basque or Djirbal, the default primary focus is typically on the participant undergoing the change, rather than on the participant causing the change. In the sentence “The farmer grew the corn,” the farmer is placed into the ergative case and the corn is in the absolutive case. The absolutive is also the case that is used for the word “corn” in the intransitive sentence “The corn grew.” This means that ergative languages place default focus on the patient, rather than the agent. They do this in order to focus not on the act of causation, but on the processes of change that occur in the patient. Variations in the marking of ergativity demonstrate three clear effects of perspective taking on event construction. These effects arise because we are more likely to assume the perspective of the agent when the action is immediately present and when the agent is closer to ego. 1. Tense. Gujarati (Delancey, 1981) uses ergative-absolutive marking in the perfective, but not in the imperfective
The Emergence of Grammar from Perspective
tense. Because of the ongoing nature of the imperfective (“was buying”), we tend to become involved in the action and therefore assume the perspective of the causor. Because of the completive nature of the perfective (“bought”), we are less involved in the action and more willing to treat the agent as the secondary or ergative perspective. 2. Person. In languages like Kham (Delancey, 1981), the choice of ergative or nominative marking can also depend on the person of the agent. When the actor is in third person, many languages use ergative marking. However, when the actor is in first or second person, these same languages often shift to using accusative marking. This split reflects the fact that we are more deeply involved with the first and second person perspectives, for which we can more directly infer causality. For third person actors, we are often on safer ground to defocus their causal activities and focus instead on the perspective of the patient. 3. Intentionality. Ergative marking can also be used to mark intentionality. Delancey (1981) describes this for the Caucasian language Batsbi, which uses ergative case for the subject of verb like “fall” when the falling is intentional and absolutive marking for the subject when the falling is unintentional. Sentences with the absolutive could be read like “falling happened to me.” Other factors that can lead to splits in ergative marking include inferential markers and certain discourse structures.
5.3 Constructions marking perspective The choice of absolutive or accusative marking is only one of many linguistic choices influenced by perspective. Other constructions (Kay and Fillmore, 1999), structures (Chomsky, 1981), or options (Halliday and Hasan, 1976) shaped by perspective include:
119
120
Language Acquisition, Change and Emergence
1. Passives. The choice of a passive over an active can be induced by the fact that the perspective is indefinite, unknown, or unidentifiable. Often we select the passive when we wish to avoid attributing responsibility (Seliger, 1989). 2. Double object. The decision to say either “I sent John the book” or “I sent the book to John” reflects the extent to which we wish to focus on the secondary perspective of the recipient (Zubin, 1979). 3. Inverse. Languages may require that nouns be placed in an order of relative animacy. When this happens, the verb can mark perspective inversion (Whistler, 1985). 4. Obviative. The marking of possession often involves expressions for equation, description, existence, and location (Heine, 1997). In the obviative, possession can be shifted to a non-perspectival owner, as in split ergative person marking. 5. Fictive agency. In “the library boasts three major collections,” we are taking the perspective of an inanimate object and treating it fictively (Talmy, 1988) as an agent. Other examples include “the path runs down to the river” and “the screws hold the legs onto the table.” 6. Conflation. We can join together “the car knocked the bicycle” and the “bicycle fell over” into the single clause “the car knocked over the bicycle.” When we do this, we subordinate the perspective of the bicycle in the second clause to the overall controlling perspective of the car. 7. Comparison. The directionality of comparison is governed by the projection of features from a perspective. Saying that “Bill speaks like my chimpanzee” is much different from saying that “My chimpanzee speaks like Bill.” 8. Complementation and control. The unmarked complement structure is one which maintains the perspective of the main clause, as in “Bill wanted to go” where “Bill” is the subject of the main verb “want” and
The Emergence of Grammar from Perspective
the complement verb “go.” In addition. The contrast between “the doll is easy to see” and “the doll is eager to see” reflects alternative perspectival configurations of the verbal adjectives “eager” and “easy.” 9. Relativization. In “the cat the dog chased hissed,” the “cat” shifts from the role of agent to the object and back again to agent. These shifts of perspective must be clearly marked in the grammar. 10. Binding. Pronouns mark both co-reference and perspective. When we say “He said Bill won” we know that “Bill” is not co-referent with “he.” These facts about the grammar are determined by the system for marking perspective shifts. This is only a partial list. Other syntactic processes influenced by perspective include adverbialization, phrasal attachment, dislocation, clefting, topicalization, possession, ellipsis, coordination, and reflexivization. In fact, it is difficult to find any syntactic process that is not at least partially impacted by perspective marking. Generalizing from this observation, we can say that syntax has two basic functions. The first function is attachment. The syntactic processor uses surface cues to link words together into a relatedness structure (Hausser, 1999; O’Grady, 2002) during online processing. The second function is perspective. The syntactic processor uses these same surface cues to assume a series of perspectives. For example, in the sentence, “The cat licked herself,” the processor links the subject and object into the slots required by the verb “lick.” At the same time the perspectival processor encourages us to assume the role of “the cat” to interpret this process enactively, much as we would interpret fictive agency as in “the screws hold the legs to the table.” In the following sections, we will examine these effects in detail for three selected areas: binding, relativization, and ambiguity marking. Given limitations in space, we will focus on these three areas because of their centrality in both linguistic and psycholinguistic work of the last 20 years. However, similar analyses apply to each of the syntactic domains we have mentioned.
121
122
Language Acquisition, Change and Emergence
5.4 Co-reference and c-command Perspective taking influences key aspects of the grammar of pronominal co-reference. These effects reflect a basic fact about language use, which is that starting points must be fully referential (MacWhinney, 1977). Gernsbacher (1990) has discussed this requirement in terms of the theory of “structure building.” The idea is that listeners attempt to build up a sentence’s interpretation incrementally. To do this, they need to have the starting point fully identified, since it is the basis for the rest of the interpretation. In dozens of psycholinguistic investigations, Gernsbacher has shown that the initial nominal phrase has the predicted “advantage of first mention.” This advantage makes the first noun more memorable and more accessible for further meaningful processing. When the first noun is low in referentiality (Ariel, 1990), the foundation is unclear and the process of comprehension through structure building is thwarted. If the starting point is a full nominal, referentiality is seldom at issue. However, if the starting point is a pronoun, then there must be a procedure for making it referential by finding an antecedent. One way of doing this is to link up the pronoun to an entity mentioned in the previous discourse. In a sequence like (1), it is easy to link up “he” with “John,” since John has already been established as an available discourse referent. However, in (2), the pronoun has no antecedent, and the sequence seems awkward and unlinked. (1) (2)
Johni was trying to list the Ten Commandments. Hei was unable to get past the first six. Only a few of the guests arrived on time. He says Bill came early.
The theory of perspective taking attributes these effects to the fact that starting points serve as the basis for the construction of an embodied situation model. The theory of Government and Binding (Chomsky, 1982; Grodzinsky and Reinhart, 1993; Reinhart, 1981) treats this phenomenon in terms of structural relations in a phrase-marker tree.
The Emergence of Grammar from Perspective
Principle C of the binding theory holds that a pronoun cannot c-command its referent. An element is said to c-command another element if it stands in a direct chain above it in a phrase tree. As a result, Principle C excludes a co-referential reading for (1), but not for (2). (1) (2)
Hei says Billi came early. Billi says hei came early.
In (1) the pronoun c-commands its referent because it stands in a direct chain of dominance above it in the tree. In (2) the pronoun is down below its referent in the tree and therefore does not c-command “Bill.” The perspective hypothesis attributes the unavailability of the co-referential reading of (1) to the fact that starting points must be referential. Without further cues, the processor cannot wait for a subsequent identifying co-referent and chooses instead to force co-reference with some entity from previous discourse. In (2), on the other hand, “Bill” is available as a referent and therefore “he” can co-refer to “Bill.” This effect is not a simple matter of linear order, since co-reference between a pronoun and a following noun is perfectly good when the pronoun is in an initial subordinate clause. Consider this contrast, where the asterisk on (3) indicates that “he” cannot be co-referential with “Lester.” (1) (2) (3) (4)
When hei drank the vodka, Lesteri started to feel dizzy. Lesteri started to feel dizzy, when hei drank the vodka. *Hei started to feel dizzy, when Lesteri drank the vodka. When Lesteri drank the vodka, hei started to feel dizzy.
Binding theory views the subordinate clause as generated within the VP where it is available for logical interpretation. In (1) and (2), “Lester” c-commands the pronoun and coreference is possible, even after the movement. In (3) and (4), it does not and coreference should be blocked. However, the acceptability of (4) is a problem for this version of binding theory. Reinhart (1983) explains the anomaly by arguing that coreference in (4) is supported by discourse constraints. In a sense, Reinhart’s discourse account is akin to the
123
124
Language Acquisition, Change and Emergence
discourse account developed within the perspective hypothesis. The perspective hypothesis attributes the acceptability of (1) to the presence of the subordinating conjunction “when” which gives the processor instructions that a subsequent NP can be used for co-reference to “he.” In (3), no such instructions are available. The referentiality requirement also applies in a somewhat weakened form to the direct and indirect objects of verbs. Van Hoek (1997) shows how availability for co-reference is determined by position in the argument chain (Givón, 1976). Although attention is first focused on the subject or trajector, it then moves secondarily to the object or other complements of the verb that are next in the “line of sight” (Langacker, 1995). This gradation of the perspectival effect as we move through the roles of subject, direct object, adjunct, and possessor is illustrated here: (1) (2) (3) (4) (5)
Hei often said that Billi was crazy. ? John often told himi that Billi was crazy. ? John often said to himi that Billi was crazy. John often said to hisi mother that Billi was crazy. The students who studied with himi enjoyed Johni.
By the time we reach elements that are no longer in the main clause, as in (5), co-reference back to the main clause is not blocked, since elements in a subordinate clause are not crucial perspectives for the structure building process. This gradient pattern of acceptability for increasingly peripheral clausal participants matches up with the view that the process of perspective taking during structure building requires core participants to be referential. Solan (1983) has shown that even 4-year-olds prefer sentences like (2) to (1). Principle C can account for some of these patterns. For example, the acceptability of (5) above is in conformity with the fact that there is no c-command relation between “him” and “John.” It is often true that both the binding theory and the perspective hypothesis provide good parallel accounts of particular anaphoric patterns. In this sense, formal theory and the perspective account complement each other. The perspective hypothesis also provides an account of the
The Emergence of Grammar from Perspective
acceptability of certain types of forward co-reference that are not explained by the binding theory. Consider this pair: (1) Shei had just come back from vacation, when Maryi saw the stack of unopened mail piled up at her front door. (2) *Shei came back from vacation, when Maryi saw the stack of unopened mail piled up at her front door. The presence of “had just” in (1) works to generate a sense of ongoing relevance that keeps the first clause in discourse focus long enough to permit co-reference between “she” and “Mary.” These sentences from Reinhart (1983) provide further examples of aspectual effects on perspective taking. (1) In Carter’si hometown, hei is still considered a genius. (2) ? In Carter’si hometown, hei is considered a genius. Although both of these sentences can be given co-referential readings, it is relatively easier to do so for (1), because of the presence of the “still” which keeps the co-referent active in memory. Preposed prepositional phrases have often presented problems for binding theory accounts (Kuno, 1986). Consider these examples: (1) (2) (3) (4)
Near Johni, hei keeps a laser printer. Near John’si computer desk, hei keeps a laser printer. *Hei keeps a laser printer near Johni. *Hei keeps a laser printer near John’si computer desk.
In (2) we have enough conceptual material in the prepositional phrase to enactively construct a temporary perspective for “John.” In (1) this is not true, and therefore “John” is not active enough to link to “he.” The binding theory attempts to explain patterns of this type by referring to the “unmoved” versions of the sentences in (3) and (4) above. Co-reference is clearly blocked in (3) and (4), despite the fact that it is possible in (2). This indicates that linear order is crucial for the establishment of perspective and that (2) does not derive either online or offline from (4). Two further examples from van Hoek (1997) illustrate a related point.
125
126
Language Acquisition, Change and Emergence
(1) In Tim’si play, hei offers Mary a mansion. (2) In Tim’si play, hei promised Mary a role. In (1), we take the role of an outside observer describing a creative act inside the frame of the play. In (2), on the other hand, we are less involved in Tim’s perspective. Here, the structural account would view the preposed phrase of (1) as an adverb and therefore not subject to blocked co-reference. Here, again, both the formal and functional accounts both work descriptively. However, the functional account is more useful in terms of explanation. Just as markers of ongoing relevance such as “had just” or “still” can increase the openness of a pronoun in a main clause to co-reference, so indefinite marking can decrease the openness of a noun in a preposed subordinate clause noun for co-reference, as indicated by the comparison of (1) with (2). (1) While Ruth argued with the mani, hei cooked dinner. (2) ? While Ruth argued with a mani, hei cooked dinner. (3) While Ruth was arguing with a mani, hei was cooking dinner. The addition of an aspectual marker of current relevance in (3) overcomes the effect of indefiniteness in (2), making “man” available as a co-referent for “he”. Gradient patterning of this type provides further evidence that pronominal co-reference is under the control of pragmatic factors (Kuno, 1986). In this case, the specific pragmatic factors involve interactions between definiteness and perspective. The more definite the referent, the easier it is to assume its perspective. Wh-words introduce a further uncertainty into the process of structure building. In strong crossover (Postal, 1971) sentences like (1), the initial wh-word “who” indicates the presence of information that needs to be identified. (1) (2) (3) (4)
*Whoi does hei like most? Whoi does hej like most? Whoi is hated by hisi brother most? Whoi thought that Mary loved himi?
The Emergence of Grammar from Perspective
(5) Whoi likes hisi mother most? (6) Whoi said Mary kissed himi? (7) Whoi likes himselfi most In (1) the listener has to set up “who” as an item that must be eventually bound to some argument slot. At the same time, the listener has to use “he” as the perspective for structure building. The wh-word is not a possible candidate for the binding of the crucial subject pronoun, so it must be bound to some other referent as in (2). However, when there is a pronoun that is not in the crucial subject role, co-reference between the wh-word and the pronoun is possible, as in (2) through (7). In these examples, the wh-word can co-refer to non-central components, such as objects and elements from embedded clauses. Only co-reference with subjects, as in (1), is blocked. This brief discussion has only sampled only a few of the way in which the perspective hypothesis can illuminate the grammar of co-reference. Other areas of the binding theory in which the perspective hypothesis provides direct accounts include the contrast between strong and weak crossover, the binding of reflexives (Kuno, 1986), and the assignment of quantifier scopes.
5.5 Clitic assimilation The English infinitive “to” typically assimilates with a preceding model verb to produce contractions such as “wanna” from “want to” in cases such as (1). However, this assimilation is blocked in some environments, such as (2), leaving us with (3) instead. (1) Why do you wanna go? (2) *Who do you wanna to go? (3) Who do you want to go? Chomsky (1981) and others have argued that the blocking of the assimilation in (3) is due to the presence of the trace of an empty category in the syntactic tree. However, there is reason to believe that the environment in which assimilation is favored is determined
127
128
Language Acquisition, Change and Emergence
not by syntactic forces, but by perspectival forces. In particular, we can contrast (1) and (2) below in which the infinitive does not cliticize with the verb with (3) where it does. In the case of (3), the subject has an immediate obligation to fulfill, whereas in (1) and (2), the fact that the subject receives the privilege of going is due presumably to the intercession of an outside party. Thus, the perspective continuation is less direct in (1) and (2), than it is in (3). (1) I get ta go. (Privilege) (2) I got ta go. (Privilege) (3) I gotta go. (Obligation) According to this account, cliticization occurs when a motivated subject engages in an action. When there is a shift to another actor, or a conflict of perspectives, as in “Who do you want ta go?”, cliticization is blocked.
5.6 Relativization Restrictive relative clauses can require us to compute multiple shifts of perspective. Consider these four types: SS: OS: OO: SO:
The dog that chased the cat kicked the horse. The dog chased the cat that kicked the horse. The dog chased the cat the horse kicked. The dog the cat chased kicked the horse.
0 switches 1- switch 1+ switch 2 switches
In the SS type, the perspective of the main clause is also the perspective of the relative clause. This means that there are no perspective switches in the SS relative type. In the OS type, perspective switches from the main clause subject (dog) to the relative clause subject (cat). However, this perspective shift is made less abrupt by the fact that “cat” is the object of the main clause and receives some secondary focus before the shift is made. In the OO type, perspective also switches once. However, in this case, it switches more abruptly to the subject of the relative clause. In the SO relative clause type, there is a double perspective shift.
The Emergence of Grammar from Perspective
Perspective begins with the main clause subject (dog). When the next noun (cat) is encountered, perspective shifts once. However, at the second verb (kicked) perspective has to shift back to the initial perspective (dog) to complete the construction of the interpretation. Sentences that have further embeddings have even more switches. For example, “the dog the cat the boy liked chased snarled” has four difficult perspective switches (dog -> cat -> boy -> cat -> dog). Sentences that have as much perspective shifting as this without additional lexical or pragmatic support are incomprehensible, at least at first hearing.1 The perspective account predicts this order of difficulty: SS > OO = OS > SO. Studies of the acquisition (MacWhinney, 1982) and adult processing (MacWhinney and Pléh, 1988) have provided support for these predictions. A reaction time study of Hungarian relative clause processing by MacWhinney and Pléh (1988) shows how perspective processing integrates topicalization and subjectivalization. In Hungarian, all six orders of subject, object, and verb are grammatical. In three of these orders (SOV, SVO, and VSO), the subject is the topic; in three other orders (OSV, OVS, and VOS), the object is the topic. When the main clause subject is the topic, the English pattern of difficulty appears (SS > OO = OS > SO). However, when the object is the topic, the order of difficulty is OO > OS = SO > SS. These sentences illustrate this contrast in Hungarian, using English words and with the relative clause in parentheses and NOM and ACC to mark the nominative subject and the accusative object:
1
The mere stacking of nouns is not enough to trigger perspective-shift overload. Consider the sentence, “My mother’s brother’s wife’s sister’s doctor’s friend had a heart attack.” Here, we do not really succeed in taking each perspective and switching to the next, but some form of minimalist comprehension is still possible. This is because we just allow ourselves to skip over each perspective and land on the last one mentioned. In the end, we just know that someone’s friend had a heart attack.
129
130
Language Acquisition, Change and Emergence
S(SV)OV:
The boy-NOM (he chased car-ACC) liked girl-ACC. “The boy who chased the car liked the girl.”
O(OV)SV: The boy-ACC (car-NOM chased him) girl-NOM liked. “The girl like the boy the car chased.” The S(SV)OV pattern is the easiest type for processing in the SOV word order. It follows the English pattern observed above. The O(OV)SV pattern is the easiest type to process in the OSV word order. Here the consistent maintenance of an object perspective through the shift from the main to the relative clause is easy, since the processor can then smoothly shift later to the overall sentence perspective. This contrast illustrates the fundamental difference in the way topic-centered languages manage the processing of perspective.
5.7 Ambiguity Syntactic ambiguities and garden paths typically arise from competition (MacDonald, Pearlmutter, and Seidenberg, 1994; MacWhinney, 1987) between alternative perspectives. With a preposed participial as in (1), we assume the default perspective of a speech act participant (“you” or “me”), although we can also entertain the perspective of the “relatives.” In (2), the preposed subordinate clause prepares us to quickly accept the perspective of the “visiting relatives.” However, even in this case, we can still shift, if we wish, to the perspective of a speech act participant. (1) Visiting relatives can be a nuisance. (2) If they arrive in the middle of a workday, visiting relatives can be a nuisance. (3) Brendan saw the Grand Canyon flying to New York. (4) Brendan saw the dogs running to the beach. (5) The women discussed the dogs on the beach. (6) The women discussed the dogs chasing the cats. In (3), the initial perspective resides with “Brendan” and the shift to the perspective of “Grand Canyon” is difficult because it is
The Emergence of Grammar from Perspective
inanimate and immobile. The shift to the perspective of “the dogs” is easier in (4), although again we can maintain the perspective of “Brendan” if we wish. In cases of prepositional phrase attachment competitions, such as (5), we can maintain the perspective of the starting point or shift to the direct object. If we identify with “the women,” then we have to use the beach as the location of their discussion. If we shift perspective to “the dogs” then we can imagine the women looking out their kitchen window and talking about the dogs as they run around on the beach. In (6), we have a harder time imagining that the women, instead of the dogs, are chasing the cats. As these examples illustrate, the starting point is always the default perspective. In transitive sentences, there is always some attentional shift to the object, but this shift can be amplified, if there are additional cues, as in (6). In some syntactic contexts in English, it is possible to shift perspective even more abruptly by treating the verb as intransitive and the following noun as a new subject. These examples illustrate this effect: (1) Although John frequently jogs, a mile is a long distance for him. (2) Although John frequently jogs a mile, the marathon is too much for him. (3) Although John frequently smokes, a mile is a long distance for him. Detailed self-paced reading and eye-movement studies of sentences like (1), with the comma removed, show that subjects often slow down just after reading “a mile.” This slow down has been taken as evidence for the garden-path theory of sentence processing (Mitchell, 1994). However, it can also be interpreted as indicating time spent in shifting to a new perspective when the cues preparing the processor for the shift are weak. Examples of this type show that perspective interpretation is an integral part of online, incremental sentence processing (Marslen-Wilson and Tyler, 1980). Perspectival ambiguities also arise from competitions between alternative interpretations of quantifier scopes. Consider these two examples:
131
132
Language Acquisition, Change and Emergence
(1) Someone loves everyone. (2) Everyone is loved by someone. If we take the perspective of “someone” in (1), we derive an interpretation in which it is true of some person that that person loves all other people. However, if we take the perspective of “everyone,” we derive an interpretation in which everyone is loved by at least one person. This second interpretation is much more likely in (2), because there “everyone” is the starting point. However, both interpretations are potentially available in both cases, because it is always possible to switch perspective away from the starting point to subsequent referents in a sentence, given additional processing time and resources.
6. Perspective and Language Acquisition The perspective hypothesis provides us with a new way of integrating old insights regarding processes in language acquisition. The view of language learning as an enactive process was articulated by Plato, Augustine, Vico, Dewey, Kant, Montesorri, Piaget, Vygotsky, and others.
6.1 Direct experience and word learning In the framework of the current account, we can say that language development begins with highly grounded mimetic symbols. Diary studies by Lewis (1936) and Halliday (1975) have shown how children express themselves through gestures, cries and other motions in the prelinguistic period. Others (Bates, 1976; Bower, 1974; Piaget, 1952) have noticed that the pointing gesture develops out of the attempt to grasp an object. Similarly, expressive sighs develops from the act of relaxing the muscles of the chest. Through symbolic distancing (Werner and Kaplan, 1963), fully grounded actions and perceptions slowing become ungrounded. When these early gestures and prosodies match up with established norms
The Emergence of Grammar from Perspective
sanctioned by the community, they are reinforced and retained. When they do not clearly match community norms, they are modified or dropped. In this way, even these highly grounded forms of reference become codified. The perspective hypothesis holds that the child must construct an enactive relation between meaning and sound. Saussurean doctrine holds that this relation is arbitrary (de Saussure, 1966). However, this arbitrariness may hold only in terms of the larger community. For the individual learner, learning is facilitated by the formation of covert, private links (Atkinson, 1975) between sound and meaning based on properties such as sound symbolism (Brown, Black, and Horowitz, 1955; Hinton, Nichols, and Ohala, 1994) or enactive matches (Meltzoff, 1988; Werner and Kaplan, 1963). When language fails to provide these matches, children simply construct their own. Once the links have generated strong reciprocal connections (Van Orden, Holden, Podgornik, and Aitchison, 1999) and once lexical access becomes automated (Keenan and MacWhinney, 1987), these ad hoc links can fade away. Sometimes, children will overtly display the internal enactive cues they have used to acquire new meanings and concepts. For example, Jon Fincham (personal communication) reports that, when his son Adam was just past 3, he had his first experience using scissors to cut down lines drawn on paper. Accompanying each full stroke of the scissors was a perfectly synchronized, corresponding mouth/jaw movement. When the scissors opened, so would his mouth. When the scissors closed, again so would his mouth. He would do this repeatedly with each cut of the scissors whenever he used them for several weeks. Recently, cognitive psychologists have explored the possibility that word meanings are acquired simply from the statistics of co-occurrence (Burgess and Lund, 1997; Landauer and Dumais, 1997). These models have been remarkably successful in capturing a variety of effects. The semantic vectors acquired in these models nicely mirror the distribution of patterns in human associative memory. As a result, it is plausible to imagine that these systems provide supplements to the basic processes of grounded learning.
133
134
Language Acquisition, Change and Emergence
They may also help the child in the solution of aspects of the bootstrapping problem for both word meaning (Li, Burgess, and Lund, 2001) and syntax (Gleitman, 1990). For adults, this type of learning can support development of the highly ungrounded use of language that predominates in areas such as academic or legal discourse. However, by itself, co-occurrence learning of this type cannot provide a satisfactory account of basic word meaning (Kaschak and Glenberg, 2000).
6.2 Marking spatial and temporal perspectives Children first learn to make spatial reference by developing a basic egocentric understanding of the positions of objects in space. Piaget (1952) has described this development in terms of the development of the object concept and procedures for dealing with invisible displacements. In learning to remember the positions of objects, the preverbal child relies on each of the three spatial reference systems. However, as Piaget has observed, the egocentric frame is primary. At the end of the second year, when the child comes to the task of learning language, the first locative terms are primarily egocentric and deictic. Allocentric terms such as “in” or “on” are initially processed in terms of affordances and topological relations, rather than through a complete shift of perspective to the distal object. Slowly, the use of the allocentric frame takes on an independent existence and children learn to shift reference between these frames. Geocentric reference is acquired much later (de Leon, 1994). Weist (1986) has shown how children begin temporal reference with a tenseless system in which events are simply stated. They then move on to a deictic system in which the event time is coded in reference to speaking time. Finally, they acquire the ability to code event time with respect to reference time in accord with allocentric reference. Work on the development of spatial perspective shifting has tended to focus on the comprehension of instructions and maps. The work has shown that the ability to shift perspectives emerges gradually during the school years (Hardwick, McIntyre, and Pick, 1976; Rieser, Garing, and Young, 1994).
The Emergence of Grammar from Perspective
6.3 Perspective and item-based patterns Children’s first sentences are produced through the use of what MacWhinney (1975; 1982) called item-based predicate patterns. These patterns are grounded on the individual syntax and conceptual structure of operators such as “my,” “give,” “more,” and “with.” Before age 3, there is little generalization over these argument structures and much of grammar is tightly grounded on the action schema underlying each of these predicates. Recent research (Goldberg, 1999; Lieven, Pine, and Baldwin, 1997; Tomasello, 1992) has emphasized the role of individual verbs in constructing syntax from the bottom up. In all of these accounts, the child begins with separated “verb islands” that are later linked together in larger construction types. Each verb encodes a slightly different pattern of muscle control, attentional movement, iteration, and goal direction; and each verb involves slightly different action perspectives. After age 3 (Tomasello, 2000), children begin to relate the various verb types into loosely coherent constructions. However, all aspects of this learning are still closely linked to the underlying physical realization of the verb. Researchers using the NTL framework (Bailey, Chang, Feldman, and Narayanan, 1998; Maia and Chang, 2001) have shown how one can construct detailed models of the components of verbs such as “walk.” “stumble,” “grab,” or “push.” An NTL model of “pushing” would refer in detail to the actions of the hands, back, and legs. If a rather small object were to be pushed across a short space, then only the hands would be involved, as when we push a salt shaker across the table. However, if we have to push a table against the wall, we will need to use our legs, our back and specific postures of our hands. Moreover, pushing is a process that has a beginning, duration, and possible end. All of these elements must be tightly specified in an embodied model of the verb. Slobin (1985) argues that children use causal roles to express a perspective that he calls the manipulative activity scene. In this scene, children distinguish the role of the initial perspective from the role of the object of the action. For each verb, the nature of these
135
136
Language Acquisition, Change and Emergence
actions and changes is different. Some involve movements; others involve experiences; still others involve various forms of causation. As a result, children work within each individual verb frame to distinguish the initial perspective or starting point (arg1 or the first argument) from the final object of the action (arg2 or the second argument). In verbs with a single argument, there is only one perspective. In verbs with three arguments, there can be an additional secondary perspective (arg3 or the third argument). The specific semantic value of these three roles (arg1, arg2, and arg3) must be characterized separately for each verb. The NTL framework shows how this characterization can be grounded on the specific action schemas associated with body movements and intentional shifts for each verb. Children learning a language with clear accusative marking such as Russian (Gvozdev, 1949) or Hungarian (MacWhinney, 1974) first learn to mark the accusative on verbs that have clear manipulative activities, such as “break” or “hit”. Similarly, children learning Kaluli (Schieffelin, 1985) first mark the ergative when it occurs with high transitivity verbs. Because verb frame generalization is limited before age 3, there is little overgeneralization of these early markings to intransitives or verbs with low transitivity. Early on, children’s perspectives on individual verbs can lie between those of accusative and ergative languages. For example, when a child says “picky up,” we may initially assume that this means, “You pick me up.” However, the actual early meaning is probably more focused on the child than on the agent who does the picking up. In this sense, it is more like “me experience picking up.” In addition to these basic frames for causal roles, children also rely on figure-ground relations to code predicates for possession, sources, positions, and goals. What is interesting about these prototypical frames is the extent to which each is organized from the perspective of the child as actor. The fundamental quality of the egocentric perspective has its impact not only on the learning of spatial relations, but also on the acquisition of causal action expressions.
The Emergence of Grammar from Perspective
6.4 Perspective and the development of binding There have also been many studies of children’s learning of anaphoric relations, particularly in the context of the binding theory (Chomsky, 1981). This research shows that children are sensitive early on to violations of Principle C, which block coreference in “He said Bill won.” This fact has been used to argue that Principle C is an innate component of Universal Grammar (UG). However, these facts can also be interpreted as evidence for the cognitive centrality of perspective taking. In the area of reflexives, the developmental results have been more problematic for proponents of the binding theory. For example, in a sentence such as (1), children tend to interpret “him” as co-referential with “horse,” as if it were (2). (1) (2)
The dog said that the horse hit him. The dog said that the horse hit himself.
Sentence (2) obeys Principle A of the binding theory that a reflexive pronoun must have a more prominent antecedent in its minimal domain. Children have no trouble learning this rule, since it involves a clear cue and a local syntactic structure. However, Principle B, which requires that a pronominal must not have a more prominent antecedent in its minimal domain, causes children more problems. To get around this empirical failure, theorists (Chien and Wexler, 1990; Grodzinsky and Reinhart, 1993) have introduced a partition in the binding theory between referring expressions that trigger binding and co-reference and non-referring expressions that only trigger co-reference. However, evidence in support of this two-process account is incomplete and inconsistent (O’Grady, 1997). The perspective hypothesis provides a rather more direct account of children’s processing of these sentences. According to this account, the child starts processing (1) from the perspective of “the dog.” Perspective then shifts to “the horse” and does not return to the overall subject in time to bind “him” to “the dog.” In order to master the perspective shifting required by Principle B, children must
137
138
Language Acquisition, Change and Emergence
improve their methods for holding two subjects in mind and switching quickly between them2. Children with Specific Language Impairment (SLI) have a particularly difficult time mastering this switching (Franks and Connell, 1996; van der Lely, 1994). This suggests that the syntactic impairment in at least some children with SLI may well emerge from a deeper impairment of core processes in perspective taking and switching. Processing of (2) is less problematic, because there is a clear local cue that forces coreference to the current perspective.
6.5 Perspective and coordination Perspective maintenance plays an important role in children’s imitations and productions of conjoined sentences (Ardery, 1979; Lust and Mervis, 1980; Slobin and Welsh, 1973). These studies have shown that young children find it easier to imitate a sentence like (1), as opposed to ones like (2). (1) (2)
Mary cooked the meal and ate the bread. Mary cooked and John ate the bread.
In (1) there is no perspective shift, since the perspective of Mary is maintained throughout. In (2), on the other hand, perspective shifts
2
The perspective account also helps us to understand some aspects of reflexives that are problematic for the c-command account. In the perspective account, co-reference is blocked in (b) because reflexives need to bind to referents that are already mentioned or for which there is a cue that promises that they will be mentioned. (a) The journey exposed Tom to himself far more than he had hoped. (b) *The journey exposed himself to Tom far more than he had hoped. (c) Bill told John that pictures of himself were on display in the Post Office. (d) Alfred thinks he is a great cook, and Felix does too. In (c) the ability of “himself” to refer to either “Bill” or “John” is a reflection of the fact that both are perspectival. A similar effect arises in the very different structure of (d) in which both Alfred and Felix are possible perspectives.
The Emergence of Grammar from Perspective
from Mary to John. Moreover, in order to find out what Mary is cooking, we have to maintain the perspective of both Mary and John until the end of the sentence.
7. Conclusion The perspective hypothesis offers a new way of understanding the linkage between language, society, and the brain. In this new formulation, communication is viewed as a social interaction that activates mental processes of perspective taking. Because perspective taking and shifting are fundamental to communication, language provides a wide array of grammatical devices for specifically marking perspective and perspective shift. The process of perspective shifting relies on at least four major neuronal systems that involve large areas of the cortex. Together, these systems allow us to store and produce images of previous direct experiences, spatial positions, plans, and social roles. Perspective allows us to thread together information from these four semimodular sources into a coherent integrated cognitive view. The perspective hypothesis generates a broad series of empirically testable claims about cognitive processing, language processing, language structure, and neuronal processing. However, the hypothesis must still be clarified in several ways: 1. The conditions governing the movement of attention during online processing need to be fully specified and simulated in the form of a processing model for a variety of languages, This work can build on cross-linguistic studies of sentence processing (MacWhinney and Bates, 1989) and the analyses of cognitive grammar (Langacker, 1987). 2. The management of perspective taking through grammatical devices needs to be specified for a wider variety of grammatical structures in a wider variety of languages.
139
140
Language Acquisition, Change and Emergence
3. The perspective hypothesis needs to be systematically applied to the sentence-processing literature to evaluate the extent to which it can provide alternative accounts to theories such as the garden-path model (Frazier, 1987) or capacity limitations (R. Lewis, 1998). 4. The implications of the hypothesis for sentence production need to be more fully specified. 5. The specific functional neural circuits that support perspective switching on the four proposed levels need to be more fully characterized and documented. 6. The development of ungrounded cognition through the growth of perspective, memory, and imagery needs to be documented in developmental terms. 7. We need more information about the emergence of perspective taking during language evolution. This is a lengthy agenda. However, if examination of these issues helps us to better understand language, cognition, and the brain, then exploration of the perspective hypothesis will have been worthwhile.
References Anderson, J. M. (1971). The grammar of case: Towards a localist theory. London: Cambridge University Press. Ardery, G. (1979). The development of coordinations in child language. Journal of Verbal Learning and Verbal Behavior, 18, 745–756. Ariel, M. (1990). Accessing noun phrase antecedents. London: Routledge. Aristotle. (1932). The Rhetoric. New York: Appleton-Century-Crofts, Inc. Atkinson, R. (1975). Mnemotechnics in second-language learning. American Psychologist, 30, 821–828. Austin, J. L. (1962). How to Do Things with Words. Cambridge, MA: Harvard University Press. Baddeley, A. D. (1990). Human memory: Theory and practice. Needham Heights, MA: Allyn and Bacon.
The Emergence of Grammar from Perspective Bailey, D., Chang, N., Feldman, J., and Narayanan, S. (1998). Extending embodied lexical development. Proceedings of the 20th Annual Meeting of the Cognitive Science Society, 64–69. Ballard, D. H., Hayhoe, M. M., Pook, P. K., and Rao, R. P. (1997). Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences, 20, 723–767. Barch, D. M., Braver, T. S., Nystrom, L. E., Forman, S. D., Noll, D. C., and Cohen, J. D. (1997). Dissociating working memory from task difficulty in human prefrontal cortex. Neuropsychologia, 35, 1373–1380. Barsalou, L. W. (1999). Perceptual symbol systems. Behavioral and Brain Sciences, 22, 577–660. Bartsch, K., and Wellman, H. (1995). Children talk about the mind. New York: Oxford University Press. Bates, E. (1976). Language and context: The acquisition of pragmatics. New York: Academic Press. Bates, E., and MacWhinney, B. (1989). Functionalism and the Competition Model. In B. MacWhinney and E. Bates (Eds.), The crosslinguistic study of sentence processing. New York: Cambridge University Press. Black, J., Turner, T., and Bower, G. (1979). Point of view in narrative comprehension, memory, and production. Journal of Verbal Learning and Verbal Behavior, 18, 187–198. Bossom, J. (1965). The effect of brain lesions on adaptation in monkeys. Psychonomic Science, 2, 45–46. Bower, T. G. R. (1974). Development In Infancy. San Francisco: Freeman. Bransford, J., Barclay, R., and Franks, J. (1972). Sentence memory: A constructive vs. interpretive approach. Cognitive Psychology, 3, 193–209. Braver, T. S., Cohen, J. D., Nystrom, L. E., Jonides, J., Smith, E. E., and Noll, D. C. (1997). A parametric study of prefrontal cortex involvement in human working memory. Neuroimage, 6, 49–62. Brown, R. N., Black, A. H., and Horowitz, A. E. (1955). Phonetic symbolism in four languages. Journal of Abnormal and Social Psychology, 50, 388–393. Bryant, D. J., Tversky, B., and Franklin, N. (1992). Internal and external spatial frameworks for representing described scenes. Journal of Memory and Language, 31, 74–98. Burgess, C., and Lund, K. (1997). Modelling parsing constraints with high-dimension context space. Language and Cognitive Processes, 12, 177–210.
141
142
Language Acquisition, Change and Emergence Byrne, M. (1999). Human cognitive evolution. In M. C. Corballis and S. E. G. Lea (Eds.), The descent of mind: Psychological perspectives on hominid evolution (pp. 71–87). Oxford: Oxford University Press. Caplan, D., and Waters, G. S. (1995). On the nature of the phonological output planning processes involved in verbal rehearsal: Evidence from aphasia. Brain and Language, 48, 191–220. Carreiras, M., Carriedo, N., Alonso, M. A., and Fernández, A. (1997). The role of verb tense and verb aspect in the foregrounding of information during reading. Memory and Cognition, 25, 438–446. Chien, Y., and Wexler, K. (1990). Children’s knowledge of locality conditions in binding as evidence for the modularity of syntax and pragmatics. Language Acquisition, 1, 225–295. Chomsky, N. (1981). Lectures on government and binding. Cinnaminson, NJ: Foris. —. (1982). Some concepts and consequences of the theory of government and binding. Cambridge, MA: MIT Press. Clark, H., and Marshall, C. (1981). Definite reference and mutual knowledge. In B. W. A. Joshi and I. Sag (Eds.), Elements of discourse understanding. Cambridge, MA: Cambridge University Press. Clark, H. H. (1973). Space, time, semantics, and the child. In T. E. Moore (Ed.), Cognitive development and language acquisition (pp. 28–63). New York: Academic Press. Cohen, D. D., Perlstein, W. M., Braver, T. S., Nystrom, L. E., Noll, D. C., Jonides, J., et al. (1997). Temporal dynamics of brain activation during a working memory task. Nature, 386, 604–608. Cohen, J., Dunbar, K., and McClelland, J. (1990). On the control of automatic processes: A parallel distributed processing account of the Stroop effect. Psychological Review, 97, 332–361. Cohen, M. S., Kosslyn, S. M., Breiter, H. C., DiGirolamo, G. J., Thompson, W. L., Anderson, A. K., et al. (1996). Changes in cortical activity during mental rotation. A mapping study using functional MRI. Brain, 119, 89–100. Craik, F. I. M., and Lockhart, R. S. (1972). Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior, 11, 671–684. Damasio, A. (1999). The feeling of what happens: Body and emotion in the making of consciousness. New York: Harcourt Brace. de Leon, L. (1994). Exploration in the acquisition of geocentric location by Tzotzil children. Linguistics, 32, 857–884. de Saussure, F. (1966). Course in general linguistics. New York: McGraw-Hill.
The Emergence of Grammar from Perspective Deacon, T. (1997). The symbolic species: The co-evolution of language and the brain. New York: Norton. Decety, J., and Grèzes, J. (1999). Neural mechanisms subserving the perception of human actions. Trends in Cognitive Sciences, 3, 172–241. Decety, J., Perani, D., Jeannerod, M., Bettinardi, V., Tadary, B., Woods, R., et al. (1994). Mapping motor representations with positron emission tomography. Nature, 371, 600–602. Delancey, S. (1981). An interpretation of split ergativity and related patterns. Language, 57, 626–658. Donald, M. (1991). Origins of the Modern Mind. Cambridge, MA: Harvard University Press. —. (1998). Mimesis and the Executive Suite: Missing links in language evolution. In J. R. Hurford, M. G. Studdert-Kennedy and C. Knight (Eds.), Approaches to the evolution of language. New York: Cambridge University Press. Dunbar, R. (2000). Causal reasoning, mental rehearsal, and the evolution of primate cognition. In C. Heyes and L. Huber (Eds.), The evolution of cognition. Cambridge, MA: MIT Press. Ehrlich, K., and Johnson-Laird, P. N. (1982). Spatial descriptions and referential continuity. Journal of Verbal Learning and Verbal Behavior, 21, 296–306. Fauconnier, G. (1994). Mental spaces: Aspects of meaning construction in natural language. Cambridge: Cambridge University Press. Fauconnier, G., and Turner, M. (1996). Blending as a central process of grammar. In A. Goldberg (Ed.), Conceptual structure, discourse, and language (pp. 113–130). Stanford, CA: CSLI. Feldman, J., Lakoff, G., Bailey, D., Narayanan, S., Regier, T., and Stolcke, A. (1996). Lo — The first five years of an automated language acquisition project. AI Review, 10, 103–129. Fourcin, A. J. (1975). Language development in the absence of expressive speech. In E. H. Lenneberg and E. Lenneberg (Eds.), Foundations of language development: A multidisciplinary approach (Vol. 2, pp. 263–268). New York: Academic Press. Franks, S. L., and Connell, P. J. (1996). Knowledge of binding in normal and SLI children. Journal of Child Language, 23, 431–464. Frazier, L. (1987). Sentence processing: A tutorial review. In M. Coltheart (Ed.), Attention and performance XII (pp. 601–681). London, UK: Lawrence Erlbaum Associates. Frith, C. D., and Frith, U. (1999). Interacting minds — a biological basis. Science, 286, 1692–1695.
143
144
Language Acquisition, Change and Emergence Fuster, J. M. (1989). The prefrontal cortex. New York: Raven Press. Gainotti, G., Silveri, M. C., Daniele, A., and Giustolisi, L. (1995). Neuroanatomical correlates of category-specific semantic disorders: A critical survey. Memory, 3, 247–264. Gathercole, V., and Baddeley, A. (1993). Working memory and language. Hillsdale, NJ: Lawrence Erlbaum Associates. Gernsbacher, M. A. (1990). Language comprehension as structure building. Hillsdale, NJ: Lawrence Erlbaum. Gibson, J. J. (1977). The theory of affordances. In R. E. Shaw and J. Bransford (Eds.), Perceiving, acting, and knowing: Toward an ecological psychology (pp. 67–82). Hillsdale, NJ: Lawrence Erlbaum. Givón, T. (1976). Topic, pronoun, and grammatical agreement. In C. Li (Ed.), Subject and topic (pp. 149–188). New York: Academic Press. Gleitman, L. (1990). The structural sources of verb meanings. Language Acquisition, 1, 3–55. Glenberg, A. (1997). What memory is for. Behavioral and Brain Sciences, 20, 1–55. Glenberg, A., and Langston, W. (1992). Comprehension of illustrated text: Pictures help to build mental models. Journal of Memory and Language, 31, 129–151. Goffman, E. (1955). On face-work: An analysis of ritual elements in social interaction. Psychiatry, 18, 213–231. Goldberg, A. E. (1999). The emergence of the semantics of argument structure constructions. In B. MacWhinney (Ed.), The emergence of language (pp. 197–213). Mahwah, NJ: Lawrence Erlbaum Associates. Goldman-Rakic, P. S. (1987). Circuitry of primate prefrontal cortex and regulation of behavior by representational memory. In V. B. Mountcastle, F. Plum and S. R. Geiger (Eds.), Handbook of Physiology, vol. 5 (pp. 373–417). Bethesda, MD: American Physiological Society. Goodale, M. A. (1993). Visual pathways supporting perception and action in the primate cerebral cortex. Current Opinion in Neurobiology, 3, 578–585. Goodall, J. (1979). Life and death at Gombe. National Geographic, 155, 592–620. Greenfield, P. (1991). Language, tools and brain: The ontogeny and phylogeny of herarchically organized sequential behavior. Behavioral and Brain Sciences, 14, 531–595. Grodzinsky, J., and Reinhart, T. (1993). The innateness of binding and coreference. Linguistic Inquiry, 24, 187–222. Gvozdev, A. N. (1949). Formirovaniye u rebenka grammaticheskogo stroya. Moscow: Akademija Pedagogika Nauk RSFSR.
The Emergence of Grammar from Perspective Halliday, M. (1975). Learning to mean: explorations in the development of language. London: Edward Arnold. Halliday, M., and Hasan, R. (1976). Cohesion in English. London: Longman. Hardwick, D., McIntyre, C., and Pick, H. (1976). The content and manipulation of cognitive maps in children and adults. Monographs of the Society for Research in Child Development, 41, Whole No. 3. Harnad, S. (1990). The symbol grounding problem. Physica D, 42, 335–346. Hausser, R. (1999). Foundations of computational linguistics: Man-machine communication in natural language. Berlin: Springer. Haviland, J. (1996). Projections, transpositions, and relativity. In J. Gumperz and S. Levinson (Eds.), Rethinking linguistics relativity (pp. 271–323). New York: Cambridge University Press. Hawkins, J. A. (1999). Processing complexity and filler-gap dependencies across grammars. Language, 75, 244–285. Heine, B. (1997). Cognitive foundations of grammar. New York: Oxford University Press. Heine, B., Güldemann, T., Kilian-Hatz, C., Lessau, D., Roberg, H., Schladt, M., et al. (1993). Conceptual shift: a lexicon of grammaticalization processes in African languages. Afrikanistische Arbeitspapier Köln, 34, 1–112. Hermer-Vazquez, L., Moffet, A., and Munkholm, P. (2001). Language, space, and the development of cognitive flexibility in humans: The case of two spatial memory tasks. Cognition, 79, 263–299. Hess, D. J., Foss, D. J., and Carroll, P. (1995). Effects of global and local context on lexical processing during language comprehension. Journal of Experimental Psychology: General, 124, 62–82. Hinton, L., Nichols, J., and Ohala, J. (Eds.). (1994). Sound symbolism. Cambridge: Cambridge University Press. Holloway, R. (1995). Toward a synthetic theory of human brain evolution. In J.-P. Changeux and J. Chavaillon (Eds.), Origins of the human brain (pp. 42–60). Oxford: Clarendon Press. Horowitz, L., and Prytulak, L. (1969). Redintegrative memory. Psychological Review, 76, 519–531. Hudson, R. (1984). Word grammar. Oxford: Blackwell. Jeannerod, M. (1997). The cognitive neuroscience of action. Cambridge, MA: Blackwell. Kakei, S., Hoffman, D. S., and Strick, P. L. (1999). Muscle and movement representations in the primary motor cortex. Science, 285, 2136–2139. Kaschak, M. P., and Glenberg, A. M. (2000). Constructing meaning: The role of affordances and grammatical constructions in sentence comprehension. Journal of Memory and Language, 43, 508–529.
145
146
Language Acquisition, Change and Emergence Kay, P., and Fillmore, C. J. (1999). Grammatical constructions and linguistic generalization: The “what’s X doing Y?” construction. Language, 75, 1–33. Keenan, J., and MacWhinney, B. (1987). Understanding the relation between comprehension and production. In H. W. Dechert and M. Raupach (Eds.), Psycholinguistic models of production. Norwood, N.J.: ABLEX. Klatzky, R. L., Pellegrino, J. W., McCloskey, B. P., and Doherty, S. (1989). Can you squeeze a tomato? The role of motor representations in semantic sensibility judgments. Journal of Memory and Language, 28, 56–77. Kolb,
B., and Whishaw, I. Q. (1995). Fundamentals of Neuropsychology. Fourth Edition. New York: W. H. Freeman.
Human
Kosslyn, S. M., Thompson, W. L., Kim, I. J., and Alpert, N. M. (1995). Topographical representations of mental images in primary visual cortex. Nature, 378, 496–498. Kuno, S. (1986). Functional syntax. Chicago: University of Chicago Press. Lakoff, G. (1987). Women, fire, and dangerous things. Chicago: Chicago University Press. Lakoff, G., and Johnson, M. (1980). Metaphors we live by. Chicago: Chicago University Press. Landau, B., and Jackendoff, R. (1993). "What" and "where" in spatial language and spatial cognition. Behavioral and Brain Sciences, 16, 217–265. Landauer, T., and Dumais, S. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240. Langacker, R. (1987). Foundations of cognitive grammar: Vol. 1. Stanford, CA: Stanford University Press. Langacker, R. (1995). Viewing in grammar and cognition. In P. W. Davis (Ed.), Alternative linguistics: Descriptive and theoretical models (pp. 153–212). Amsterdam: John Benjamins. Levine, D. S., and Prueitt, P. S. (1989). Modelling some effects of frontal lobe damage — Novelty and perseveration. Neural Networks, 2, 103–116. Lewis, M. M. (1936). Infant speech: A study of the beginnings of language. New York: Harcourt, Brace and Co. Lewis, R. (1998). Reanalysis and limited repair parsing: Leaping off the Garden Path. In J. D. Fodor and F. Ferreira (Eds.), Reanalysis in sentence processing. Boston: Kluwer. Li, P., Burgess, C., and Lund, K. (2001). The acquisition of word meaning through global lexical co-occurrences. Proceedings of the 23rd Annual Meeting of the Cognitive Science Society, 221–244.
The Emergence of Grammar from Perspective Lieven, E. V. M., Pine, J. M., and Baldwin, G. (1997). Positional learning and early grammatical development. Journal of Child Language, 24, 187–219. Luria, A. R. (1959). The directive function of speech in development and dissolution. Word, 15, 453–464. —. (1975). Basic problems of language in the light of psychology and neurolinguistics. In E. H. Lenneberg and E. Lenneberg (Eds.), Foundations of language development: A multidisciplinary approach (Vol. 2, pp. 49–73). New York: Academic Press. Lust, B., and Mervis, C. A. (1980). Development of coordination in the natural speech of young children. Journal of Child Language, 7, 279–304. MacDonald, M. C., Pearlmutter, N. J., and Seidenberg, M. S. (1994). Lexical nature of syntactic ambiguity resolution. Psychological Review, 101(4), 676–703. MacNeilage, P. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–546. MacWhinney, B. (1974). How Hungarian children learn to speak. University of California, Berkeley. —. (1975). Pragmatic patterns in child syntax. Stanford Papers And Reports on Child Language Development, 10, 153–165. —. (1977). Starting points. Language, 53, 152–168. —. (1982). Basic syntactic processes. In S. Kuczaj (Ed.), Language acquisition: Vol. 1. Syntax and semantics (pp. 73–136). Hillsdale, NJ: Lawrence Erlbaum. —. (1987). Toward a psycholinguistically plausible parser. In S. Thomason (Ed.), Proceedings of the Eastern States Conference on Linguistics. Columbus, Ohio: Ohio State University. —. (1988). Competition and teachability. In R. Schiefelbusch and M. Rice (Eds.), The teachability of language (pp. 63–104). New York: Cambridge University Press. —. (2003). The gradual evolution of language. In B. Malle and T. Givón (Eds.), The evolution of language. Philadelphia: Benjamins. MacWhinney, B., and Bates, E. (Eds.). (1989). The crosslinguistic study of sentence processing. New York: Cambridge University Press. MacWhinney, B., and Pléh, C. (1988). The processing of restrictive relative clauses in Hungarian. Cognition, 29, 95–141. Maia, T., and Chang. (2001). Grounded learning of grammatical constructions. 2001 AAAI Spring Symposium on learning grounded representations. Marslen-Wilson, W. D., and Tyler, L. K. T. (1980). The temporal structure of spoken language understanding. Cognition, 8, 1–71.
147
148
Language Acquisition, Change and Emergence Martin, A., Wiggs, C. L., Ungerleider, L. G., and Haxby, J. V. (1996). Neural correlates of category-specific knowledge. Nature, 379, 649–652. McClelland, J. L., McNaughton, B. L., and O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102, 419–457. McDonald, J. L., and MacWhinney, B. J. (1995). The time course of anaphor resolution: Effects of implicit verb causality and gender. Journal of Memory and Language, 34, 543–566. Meltzoff, A. N. (1988). Infant Imitation and Memory: Nine-Month-Olds in Immediate and Deffered Tests. Child Development, 59, 217–225. Menard, M. T., Kosslyn, S. M., Thompson, W. L., Alpert, N. M., and Rauch, S. L. (1996). Encoding words and pictures: A positron emission tomography study. Neuropsychologia, 34, 185–194. Mesulam, M.-M. (1990). Large-scale nuerocognitive networks and distributed processing for attention, language, and memory. Annals of Neurology, 28, 597–613. Middleton, F. A., and Strick, P. L. (1998). Cerebellar output: Motor and cognitive channels. Trends in Cognitive Sciences, 2, 348–354. Miller, G., and Johnson-Laird, P. (1976). Language and perception. Cambridge, MA: Harvard University Press. Mitchell, D. C. (1994). Sentence parsing. In M. Gernsbacher (Ed.), Handbook of psycholinguistics. San Diego, CA: Academic Press. Narayanan, S. (1997). Talking the talk is like walking the walk. Proceedings of the 19th Meeting of the Cognitive Science Society, 55–59. O’ Grady, W. (2002). An emergentist approach to syntax. —. (1997). Syntactic development. Chicago: Chicago University Press. Olson, C. R., and Gettner, S. N. (1995). Object-centered direction selectivity in the macaque supplementary eye field. Science, 269, 985–988. Osman, A., Albert, R., and Heit, M. (1999). Motor cortex activation during overt, inhibited, and imagined movement. Paper presented at the Psychonomics, Los Angeles. Owen, A. M., Downes, J. D., Sahakian, B. J., Polkay, C. E., and Robbins, T. W. (1990). Planning and spatial working memory following frontal lobe lesions in man. Neuropsychologia, 28, 1021–1034. Paivio, A. (1971). Imagery and verbal processes. New York: Rinehart and Winston. Parsons, L. M., Fox, P. T., Downs, J. H., Glass, T., Hirsch, T. B., Martin, C. C., et al. (1995). Use of implicit motor imagery for visual shape discrimination as revealed by PET. Nature, 375, 54–58.
The Emergence of Grammar from Perspective Passingham, R. (1993). The frontal lobes and voluntary action. Oxford: Oxford University Press. Petersen, S. E., Fox, P. T., Posner, M. I., Mintun, M., and Raichle, M. E. (1988). Positron emission tomographic studies of the cortical anatomy of single-word processing. Nature, 331, 585–589. Piaget, J. (1952). The origins of intelligence in children. New York: International Universities Press. Posner, M., Petersen, S., Fox, P., and Raichle, M. (1988). Localization of cognitive operations in the human brain. Science, 240, 1627–1631. Postal, P. (1971). Cross-over phenomena. New York: Holt, Rinehart, and Winston. Pulvermüller, F. (1999). Words in the brain’s language. Behavioral and Brain Sciences, 22, 253–336. Redish, D., and Touretzky, D. S. (1997). Cognitive maps beyond the hippocampus. Hippocampus, 7, 15–35. Reinhart, T. (1981). Definite NP anaphora and c-command domains. Linguistic Inquiry, 12, 605–635. —. (1983). Anaphora and semantic interpretation. Chicago: University of Chicago Press. Rieser, J. J., Garing, A. E., and Young, M. F. (1994). Imagery, action, and young children’s spatial orientation: It’s not being there that counts, it’s what one has in mind. Child Development, 65, 1262–1278. Rizzolatti, G., Fadiga, L., Gallese, V., and Fogassi, L. (1996). Premotor cortex and the recognition of motor actions. Cognitive Brain Research, 3, 131–141. Rumelhart, D. E. (1975). Notes on a schema for stories. In Bobrow and A. Collins (Eds.), Representation and understanding: Studies in cognitive science. New York: Academic Press. Sacerdoti, E. (1977). A structure for plans and behavior. New York: Elsevier Computer Science Library. Savage-Rumbaugh, E., and Taglialatela, J. (2001). Language, apes, and understanding speech. In T. Givón (Ed.), The evolution of language. Schank, R., and Abelson, R. (1977). Scripts, plans, goals, and understanding: An inquiry into human knowledge structures. Hillsdale, N. J.: Lawrence Erlbaum. Schieffelin, B. (1985). The acquisition of Kaluli. In D. Slobin (Ed.), The crosslinguistic study of language acquisition. Volume 1: The data. Hillsdale, NJ: Lawrence Erlbaum Associates. Searle, J. R. (1970). Speech acts: An essay in the philosophy of language. Cambridge: University Press.
149
150
Language Acquisition, Change and Emergence Searle, J. R. (1980). Minds, brains, and programs. Behavioral and Brain Sciences, 3, 417–424. Seliger, H. (1989). Semantic transfer constraints on the production of English passives by Hebrew-English bilinguals. In H. Dechert and M. Raupach (Eds.), Transfer in language production (pp. 21–34). Norwood: NJ: Ablex. Shallice, T., and Burgess, P. (1996). The domain of supervisory processes and temporal organization of behavior. Philosophical Transactions of the Royal Society of London B, 351, 1405–1412. Slobin, D. (1985). Crosslinguistic evidence for the language-making capacity. In D. Slobin (Ed.), The crosslinguistic study of language acquisition. Volume 2: Theoretical issues (pp. 1157–1256). Hillsdale, N. J.: Lawrence Erlbaum. Slobin, D. I., and Welsh, C. A. (1973). Elicited imitation as a research tool in developmental psycholinguistics. In C. A. Ferguson and D. I. Slobin (Eds.), Studies of child language development (pp. 485–497). New York: Holt, Rinehart and Winston. Smyth, R. (1995). Conceptual perspective-taking and children’s interpretation of pronouns in reported speech. Journal of Child Language, 22, 171–187. Snyder, L. H., Grieve, K. L., Brothcie, P., and Anderson, R. A. (1998). Separate body- and world-referenced representations of visual space in parietal cortex. Nature, 394, 887–891. Sokolov, A. (1972). Inner speech and thought. New York: Plenum Press. Solan, L. (1983). Pronominal reference: Child language and the theory of grammar. Boston: Reidel. Solomon, K., and Barsalou, L. W. (2001). Representing properties locally. Cognitive Psychology, 43, 129–169. Tabachneck-Schijf, H. J. M., Leonardo, A. M., and Simon, H. A. (1997). CaMeRa: A computational model of multiple representations. Cognitive Science, 21, 305–350. Talmy, L. (1988). Force dynamics in language and cognition. Cognitive Science, 12, 59–100. —. (2000). Toward a cognitive semantics. Vol. 1: The concept structuring system. Cambridge, MA: MIT Press. Terrace, H. S., Petitto, L. A., Sanders, R. J., and Bever, T. G. (1980). On the grammatical capacity of apes. In K. Nelson (Ed.), Children’s language: vol. 2. New York: Gardner. Teuber, H.-L. (1964). The riddle of frontal lobe function in man. In J. M. Warren and K. Akert (Eds.), The frontal granular cortex and behavior (pp. 410–477). New York: McGraw-Hill.
The Emergence of Grammar from Perspective Tomasello, M. (1992). First verbs: A case study of early grammatical development. Cambridge: Cambridge University Press. —. (1999). The cultural origins of human communication. New York: Cambridge University Press. —. (2000). Do young children have adult syntactic competence? Cognition, 74, 209–253. Tucker, D. (2001). Embodied meaning: An evolutionary-developmental analysis of adaptive semantics. In B. Malle and T. Givón (Eds.), The evolution of language. Philadelphia: Benjamins. Ungerleider, L. G., and Haxby, J. V. (1994). ‘What’ and ‘where’ in the human brain. Current Opinion in Neurobiology, 4, 157–165. van der Lely, H. (1994). Canonical linking rules: Forward vs. reverse linking in normally developing and Specifically Language Impaired children. Cognition, 51, 29–72. van Hoek, K. (1997). Anaphora and conceptual structure. Chicago: University of Chicago Press. Van Orden, G., Holden, J. G., Podgornik, M., and Aitchison, C. S. (1999). What swimming says about reading: Coordination, context, and homophone errors. Ecological Psychology, 11, 45–79. Vendler, Z. (1957). Verbs and times. Philosophical Review, 56, 143–160. Vygotsky, L. (1962). Thought and language. Cambridge: MIT Press. Weist, R. (1986). Tense and aspect. In P. Fletcher and M. Garman (Eds.), Language acquisition (2nd Ed.) (pp. 356–374). Cambridge: Cambridge University Press. Werner, H., and Kaplan, B. (1963). Symbol formation: An organismicdevelopmental approach to language and the expression of thought. New York: Wiley. Whistler, K. (1985). Focus, perspective, and inverse person marking in Nootkan. In J. Nichols and T. Woodbury (Eds.), Grammar inside and outside the clause (pp. 227–265). New York: Cambridge University Press. Zubin, D. A. (1979). Discourse function of morphology: The focus system in German. In T. Givón (Ed.), Syntax and semantics: Discourse and syntax (Vol. 12). New York: Academic Press. Zwaan, R. A. (1998). Situation models in language comprehension and memory. Psychological Bulletin, 123, 162–185. Zwaan, R. A., Kaup, B., Stanfield, R. A., and Madden, C. J. (in press). Language comprehension as guided experience. Behavioral and Brain Sciences.
151
5 Polygenesis of Linguistic Strategies: A Scenario for the Emergence of Languages Christophe Coupé and Jean-Marie Hombert CNRS & Université Lyon 2
Abstract On the one hand, numerous hypotheses have been put forward to account for the emergence of language during the last million years of human evolution. On the other hand, a large majority of linguists considers that nothing can be said about past languages before 8,000 or 10,000 years in the past, given our current knowledge on modern languages. A large gap obviously separates such approaches and conceptions, and has to be crossed to provide a better account of the development of our communicative system. To partially bridge the gap between the former domains, we aim at proposing a plausible scenario for the emergence of languages, with an emphasis on the development of linguistic diversity. The present study will address the question of the monogenesis or polygenesis of modern languages, which is often implicitly biased toward the first hypothesis. Probabilistic and computational models, as well as palaeo-demographic data and evolutionary considerations, will constitute the key points of our proposals.
153
154
Language Acquisition, Change and Emergence
1. The Origin of Languages 1.1
Language capacity and languages
Which definition for language? Even if “language” is often considered as the capacity that separates Man from other species, a precise definition remains controversial. Most of the supposed distinctive features that had for example been proposed by Hockett (1960) in the 1960s have been put into question during the last decades by studies on “talking apes”, like the bonobo named Kanzi (Savage Rumbaugh et al., 1998), or other animals. What, among other characteristics, contributes to the specificity of language lies in the profound unity of its nature in terms of cognitive or informational features, and at the same time the extreme diversity of its superficial forms, namely the languages. Explaining this discontinuity has been one of the major tasks of linguists, leading to the development of elaborate and highly detailed constructions like the generative grammars, and various fields such as typology, the study of language universals, etc. To what extent some features of language are genetically encoded remains at the heart of intense debates (Schoenemann, 1999; Enard et al., 2002; Lai et al., 2001). While the following study does not deal primarily with this controversy, we rely on a classical, yet sometimes implicit, distinction suggested by the following arguments: first, there is a faculty of language, language-specific or derived from more general cognitive abilities, which characterizes the human aptitude for its sophisticated communication. Second, there are instantiations of this faculty, which are the approximately 6,000 languages spoken today. As a consequence, we will tend to use the former expressions in italic rather than the term of “language” in isolation, and will focus on the similarities or divergences between the former notions, which are both related to the notion of linguistic diversity.
Polygenesis of Linguistic Strategies
Emergence of the faculty of language and of modern languages For the last twenty years, the question of the “origins of language” has been revitalized by the cooperative efforts of a number of disciplines. Partly in reaction to the proposal of a genetically-encoded Universal Grammar (Chomsky, 1975), their results and paradigms have provided new insights into this topic, among others: • Progress in the appreciation of our predecessors’ cognitive capacities and behaviors (archaeology or palaeoanthropology) • Discovery of substantial correlations between modern linguistic and genetic distributions (Cavalli-Sforza, 1994) • Discovery of plausible neural bases for behaviors that could be related to the faculty of language, for example mirror neurons (Rizzolatti et al., 1996; Rizzolati and Arbib, 1996), etc. On the basis of these new data, a number of theories have been put forward to explain when, why and how our communication system developed to reach its current state. Following the distinction made earlier, an increasing set of works is dealing on the one side with the origin of the faculty of language, while on the other side, another body of research is focusing on the origin of contemporary languages. Nevertheless, the temporal gap which separates these two fields is a large one, and the methods used to gather and analyze the data in each of them often have few in common. As Figure 1 summarizes, the emergence of a human-specific capacity of language presumably happened along our phylogenetic tree somewhere between some tens of thousands of years and a few million years ago, while the limit for the validity of reconstructions by historical linguists is most often assessed to be around 8,000 BP (Before Present). A consensus now seems to have gained ground which dates the origin of modern language between 50,000 and 100,000 years ago, in line with the modern behaviors of our species Homo sapiens. All these dates differ by several orders of magnitude.
155
156
Language Acquisition, Change and Emergence Figure 1 The time scale of language evolution
Emergence of language
While a number of linguists either try to rebuild the story of recent human languages or the first steps of the development of the faculty of language, e.g. Bickerton (1990)’s protolanguage, few of them have taken an interest in the origin and development of linguistic diversity per se. This approach differs from both the work on ancestors of modern linguistic families (e.g. Proto-Indo-European, Nostratic, Austric, etc.) and the study of the origin of the faculty of language. Its intrinsic difficulty lies obviously in the absence of clues from the past (even less than for the faculty of language), but comparisons with contemporary human societies, models or relationships between languages and other cultural features may provide a valuable help. Considering the evolution of linguistic diversity in itself is not only interesting because it represents a key event to understand the duality between the superficial diversity and the deep uniqueness of the human communication system. It is also useful because since its ties with various variables, for example the size of populations, or
Polygenesis of Linguistic Strategies
with cultural development, can shed light on the history of recent languages. By knowing if the social or demographic conditions of our predecessors were likely to correlate with a large or small number of languages, we may address the plausibility of scenarios put forward by some linguists about the number of languages that would have been spoken 10,000 or 50,000 years ago (Pagel, 2000). For the sake of simplicity, we will often confound linguistic diversity and diversity of languages, even if one should keep in mind that several concepts or problems differentiate these two notions, the most significant actually requesting a precise definition of what a language is (Nettle, 1999a: 63). The origins of linguistic diversity as a matter of function or product (why?) can be partially differentiated from the origins in space and time (where and when?). Since we will mostly consider the second notion in this article, we will just briefly summarize below our position regarding the first issue. Several theories explain the emergence of the faculty of language (the “why?” question) by social causes. Dunbar (1996) proposed for example that this faculty emerged to replace grooming in its function of preserving social coherency, because of the increase in social group size. This increase was suggested by the correlation between the size of the group and the volume of the neo-cortex in various monkey species (Dunbar, 1993). Another example is Dessalles’ political theory of language, which focuses on social aspects in agreement with Darwinian evolution: the development of the faculty of language would have enabled individuals to better express their qualities in order to form coalitions (Dessalles, 2001). Such coalitions are often observed in chimpanzees (De Waal, 1998). According to a different perspective, but still centered on the social context of linguistic usage, the data of sociolinguistics have largely underlined the weight of social and inter-individual interactions on the evolution of languages and dialects. The social game is in particular partly accountable for linguistic diversity, as highlighted by various examples like Labov’s pioneering study of Martha’s Vineyard (Labov, 1963). At the crossing of these various hypotheses, we assume that it is
157
158
Language Acquisition, Change and Emergence
reasonable to consider the origin of linguistic diversity and variability, and therefore of languages, in a concomitant way with the origin of the function of language. If the development of our communication system was since the beginning interlaced with the social life of the communities, and if this link has been preserved until today, then it seems likely that the social game on linguistic forms and the resulting diversity of languages were preserved throughout prehistory. This phenomenon was presumably modulated by a large number of parameters: size of the communities, expressiveness of early forms of languages, development of underlying cognitive capacities, etc. Moreover, the geographic distribution of populations presumably contributed to a very ancient diversification of the communication systems, in a similar way with the evolution of species. In this conceptual framework, the development of modern languages has to be integrated into the more general evolution of the human communication system, and only represents a “last step”. It therefore becomes interesting to wonder what defines the modernity of contemporary languages, in other words what differentiates them from more archaic languages; one may refer here for example to Bickerton’s notion of proto-language (Bickerton, 1990) or to Coupé and Hombert (2002)’s proposals regarding “language” in the context of the first sea-crossings to Australia.
Linguistic components The diversity of the world’s languages is naturally expressed by the differences between the structures and elements that compose these languages. Linguistic typology aims at classifying this variety of forms, which appear more or less frequently and in more or less independent ways; linguists for example often use the terms of (implicational) universals or tendencies. An obvious method to study the evolution of linguistic diversity is therefore to rely on a partial individuation of the linguistic forms, rather than studying languages as monolithic entities. Following this line, our work relies on the notion of linguistic item, as defined by Nettle (1999a: 5):
Polygenesis of Linguistic Strategies
A linguistic item is any piece of structure that can be independently learned and therefore transmitted from one speaker to another, or from one language to another. Words are the most obvious linguistic items, but sounds and phonological processes are items too, as are grammatical patterns and constructions. . . The distributions of different items in the world's languages need not be statistically independent, and indeed very often are not. It seems relevant for us to consider linguistic items as communicative tools. We will use the term linguistic strategies to reflect the fact that linguistic items before all address functional needs at a cognitive level: the typological elements represent different possible solutions or strategies to assemble and bridge external linguistic projections of mental representations: • word-order or case-markers to express the thematic relationships between the syntagms of the sentence; • phonemes to encode the acoustic forms of words and overcome the large variability of phonetic forms; • words as conventions about meanings, etc. While the term linguistic strategies has been defined and used, for example by Croft (1990: 27), in a typological context, the notion of strategy refers for us to the multiplicity of possible functions for the projection from the private cognitive level to the linguistic level, and the competition that may exist among them. In this line of thought, we may also insist on the fact that the emergence of the cognitive functions themselves has to be considered together with the corresponding linguistic strategies, and that the “grain” of these evolutions may not be as coarse as a single division between a syntactic and a non-syntactic state. Focusing on the precise linguistic correlates of phenomena like an increase in the size of the working memory (Baddeley, 1986), the emergence of a theory of mind or the development of inferential reasoning (Sperber, 1995) seems highly relevant.
159
160
Language Acquisition, Change and Emergence
From these few comments, it should be clear that we highly favor a progressive and segmented emergence of the function of language and languages, rather than a single step from a simpler stage (iconic, non-syntactic, etc.) to a fully modern one. As we have mentioned already, this article will mostly be concerned with the spatial and temporal characteristics of the various steps that have led to the current linguistic situation, and we will propose plausible hypotheses regarding these aspects in the final discussion.
1.2
Monogenesis or polygenesis of language
Monogenesis versus polygenesis of an innovation Two scenarios can be put forward to describe the appearance of any innovation in a population. The first one is called monogenesis and corresponds to a single emergence of the innovation, possibly followed by its spread in the population. Several independent appearances define the second possible scenario of emergence, namely polygenesis. In this case, the different emergences take place at several distinct sites, provided that innovations only appear once at a site. At least two major cultural innovations of our species seem to have appeared by polygenesis. First, agriculture is believed to have emerged independently in different places around 10,000 years ago, with archaeological proofs offering similar dates but separated by thousands of kilometers. Various regions, such as Mexico, NewGuinea, Europe, the Near-East and China display such evidence. More recently, the development of two or perhaps three different writing systems also points to a polygenesis: the Chinese ideographic system around 3,500 years ago, with inscriptions on bones or turtle shells, and the cuneiform system of the Mesopotamians around 5,000 years ago (Wang, 1973: 50–52) seem too distinct and far apart to have originated from a common origin1. Such questions of
1
The possible links between the Egyptian and Mesopotamian systems are hard to trace.
Polygenesis of Linguistic Strategies
course become more difficult to answer for more remote innovations, like for example the domestication of fire around 500,000 years ago. In the linguistic framework of our topic, it seems natural to wonder about a possible polygenesis of “language”, especially if one considers it as a cultural product like agriculture or writing. To be more precise, it is possible to focus on either the emergence of the function of language or the different linguistic components. While we will not further consider the former question and its archaeological bases, we will try to defend the latter approach as the most relevant to study the development of linguistic diversity and languages. To follow this guideline, we shall begin by introducing some general ideas about the origin of current languages.
The monogenesis of languages It is often more or less explicitly admitted that all modern languages originate from a single original language; this hypothesis is often described by the term of monogenesis of languages. This proposal interestingly gathers researchers who oppose themselves on the possibility to reconstruct languages spoken before 8,000 or 10,000 years ago. Some proponents of strong limitations of the methodology of historical linguistics do not reject the plausibility of a single ancestral language, but think that its content is beyond reach of our current investigations. This statement is much more a hypothesis than something strongly demonstrated and validated. To this extent, the following points should be recalled: First of all, the principle of reconstruction itself introduces an implicit bias toward a single ancestor for all contemporary languages. The forms reconstructed from the current states are, above all, “terms of abstract comparisons”, as translated from (Auroux, 2001), and therefore do not necessarily represent the linguistic reality of the past. It is for example rather difficult and unreasonable to conclude that Proto-Indo-European was the only language spoken in Europe or Western Asia around 6,000 years ago.
161
162
Language Acquisition, Change and Emergence
Such a position would rule out the common dialectal variety of languages, as well as the possible existence of now extinct languages. If the linguistic context suggested by the reconstructions is not the one which really took place some thousand years ago, it becomes dangerous to rely on a recurrent application of the process of linguistic reconstruction to conclude a decrease in the number of languages spoken by our ancestors as we go further back in the past. A smaller size of the meta-population in the past might have resulted in many fewer languages, but this argument is not sufficient to conclude the existence of a unique tree for all contemporary languages. A crucial event that plausibly separates the more recent Neolithic period from the Paleolithic situation is the demographic explosion which took place with the development of agriculture, after a slow initial growth at the end of the Paleolithic. The notion of punctuated equilibria, borrowed from the evolutionary theories in biology and introduced by Dixon (1997) in linguistics, can be applied to such transitions (ibid: 77). Such an approach points to the hidden ties between linguistic diversity and demographic contexts. Some researchers have begun to explore their relationship, either with computational models studying the role of population size (Nettle, 1999b), or considerations about the correlations between densities of speakers and linguistic diversity (Jacquesson, 2001). A second point is that assimilating a monogenesis of today’s languages and the existence of a unique ancestor can be viewed as incorrect. As Figure 2 depicts it, several distinct families could have indeed appeared independently and evolved before all languages but one disappeared. The remaining specimen would have given birth to all modern languages. This scenario is not implausible, since languages often disappear, for example in case of contacts with unbalanced social relationships between populations (Nettle and Romaine, 2000:147–9). Unbalances may have been even stronger in the past due to the small size of populations (Marsico et al., 2000). However, given the large areas which have been inhabited by humans for tens of thousands of years, it seems unlikely that the descendants of several “initial languages” could have all disappeared
Polygenesis of Linguistic Strategies
Figure 2 Polygenesis of languages and single origin of contemporary languages
but one. Large areas like Europe have seen the expansion of families like Indo-European, but such developments and replacements did not reach a world scale in recent times2. A third argument relies on the link between the origin of contemporary languages and the origin of our species between 100,000 and 200,000 years ago. Two main hypotheses are still discussed to account for the origin of modern Man. The first one, the Out of Africa hypothesis, postulates that the speciation event which led to our species took place in East Africa, and that our ancestors subsequently migrated out of this region and colonized the entire Earth (Lahr and Foley, 1994), replacing all the previous species that were living in Africa, Europe or Asia. Based on congruent archaeological and population genetic data (Stringer and Andrews, 1988; Cann et al., 1987), this scenario gathers the favors of most scholars in its opposition to the Multiregional Continuity hypothesis. This second proposal, based on Asian fossils and other
2
One may argue that if Homo sapiens replaced all other Homo species, a single language could as likely have replaced all other languages. However, where in the first case the replacement may be due to a physiological or cognitive advantage, linguists agree that no language is functionally better than others, which may be partially extrapolated back in the past for modern humans.
163
164
Language Acquisition, Change and Emergence
genetic studies, argues that modern humans evolved from pre-sapiens species locally, and that genetic fluxes were dense enough to preserve a single species despite the large geographical areas implied (Thorne and Wolpoff, 1992). In the framework of the Out of Africa theory, an emergence of modern languages, and some may even say of the true function of language, due to neuro-physiological changes as part of the speciation event, would have taken place in the small and geographically restricted population of “new-born” sapiens. A monogenesis would then have been more likely, followed by the spread of languages during the migrations leaving East Africa. On the other hand, the link between new cognitive abilities and new forms of language would rather lead to a polygenesis in the case of the Multiregional Continuity.
“Rethinking” the origins of modern languages The aim of our work is not to reject the hypothesis of the monogenesis of modern languages as a whole, but to point at the complexity of this question, and to show how its nuances may partially empty the current consensus from its substance. Our goal is to propose a general sketch of the emergence of linguistic diversity, which could in particular be applied to the development of modern languages. Our initial statement is that it seems often implicitly assumed that the putative ancestor of modern languages shares with them their degree of complexity and structures. In opposition to proponents of a catastrophic emergence of the function of language, many scholars do not deny the evolution of our system of communication (e.g. the notion of protolanguage), but do not apply this evolutionary way of thinking to the history of modern languages. This is of course partly due to the fact that people working on the origin of “language” often do not tackle the prehistory of modern languages to this end, and vice-versa. We aim at bridging this gap by considering that a large number of differences may have existed between modern languages and their first ancestors in modern Man. We will defend a scenario where part
Polygenesis of Linguistic Strategies
of the current linguistic diversity would have emerged during the several tens of thousands of years that have followed the emergence of our species. The second and third parts of this article will be devoted to this goal, introducing probabilistic and computational models and experiments abstracting the appearance of cultural innovations in group-fragmented populations, as well as considerations about palaeo-demographic and evolutionary data. But before beginning to describe our arguments, these introductory paragraphs shall be concluded with a brief discussion of some of the palaeo-demographic data on which we will rely hereafter.
1.3
Taking palaeodemography into account
Emergence in a population of individuals The appearance of any new cultural feature in the human species must be considered in a realistic framework, partly built on the specificities of the human population at the period involved. More precisely, if we want to talk about “sites of emergence”, as in the definition of polygenesis mentioned above, a clear definition of the meaning of these words in a demographic context is necessary. We shall therefore succinctly describe the data and theories about the societal structures during prehistory.
The structure of the ancestral human population Several sources of data point at various cues regarding the structure and the size of past populations. First of all, palaeo-anthropology and archaeology, by the study of characteristics of prehistoric living places such as surface, organization etc., lead us to conclude that the human population was composed of small groups of some tens of individuals, mostly between 20 and 50, during most parts of the Paleolithic (Hassan, 1981: 93–94). The number of 25 individuals is regularly quoted in various studies on prehistoric populations, and appears to be independent of time, density or type of environment (ibid: 53). Comparison to recent populations of hunter-gatherers inhabiting
165
166
Language Acquisition, Change and Emergence
various ecosystems, e.g. Eskimos with either a caribou and sea mammal hunting economy or a caribou hunting and fishing economy, also gives clues about densities during prehistory. Table 1 is reproduced from (Hassan, 1981: 198) and summarizes Birdsell ‘s proposals in 1972 for three successive periods of the Paleolithic. Table 1 Estimates of world prehistoric population, reproduced from (Hassan, 1981:198)
Period
Pop. Density (persons/km²)
Area occupied (1e6 km)
World pop. (1e6 persons)
Lower Palaeolithic
0.015
27.0
ca. 0.4
Middle Palaeolithic
0.032
38.3
ca. 1.0
Upper Palaeolithic
0.039
57.5
ca. 2.2
The total surface that was inhabited by the meta-population of humans can be estimated by the repartition of the prehistoric living places, as well as the carrying capacities of various environments: forests, cold or warm deserts, etc. (Bocquet-Appel and Demars, 2000). Australia or the Americas were, for example, only very lately colonized by our own species (and not by former Homo or pre-Homo species). Beyond these first data, population genetics studies propose some evaluations of the global population during the last million years. All the studies conclude a very small population of one or two million people at most, and the existence of a genetic “bottleneck” around 1,800,000 years ago with the speciation leading to Homo ergaster. However, their analysis of different genetic markers (mtDNA (Sherry et al., 1994), Alu insertions3 (Sherry et al., 1997), micro satellites (Zhivotovsky et al., 2000), etc.) feed the debate
3
Alu insertions are primate-specific genetic elements that mobilize via the process of retroposition. They are believed to be non-coding, and account for around 5% of the human genome by mass (which represents 500,000 Alu sequences) (Sherry et al., 1997).
Polygenesis of Linguistic Strategies
about a possible second bottleneck 100,000 years ago and later expansions, with the appearance of our species Homo sapiens (Hawks et al., 2000). These disagreements exemplify the genetic side of the controversy between the Out of Africa hypothesis and multiregional continuity. From these data, it appears clear that the study of the monogenesis or polygenesis of cultural innovations, whether linguistic or not, can be based on the human group as a relevant “base unit”, at least until the appearance of larger communities with the development of agriculture around 9,000 years ago.
Table 2 Areas of various lands or continents Area in km² Entire Earth
510,072,200
Emerged lands
148,939,800
Asia
44,547,800
Africa
30,043,900
Europe
10,404,000
Table 3 Density of human groups for different areas and population sizes Macro-population size
10,000 25,000 125,000 250,000 1,250,000 2,500,000 5,000,000
Surface (km²) / nb. of groups
400
1,000
5,000
10,000
50,000
100,000
200,000
1,000,000
4e-4
1e-3
5e-3
0.01
0.05
0.1
0.2
5,000,000
8e-5
2e-4
1e-3
2e-3
0.01
0.02
0.04
10,000,000
4e-5
1e-4
5e-4
1e-3
5e-3
0.01
0.02
25,000,000
1.6e-5
4e-5
2e-4
4e-4
2e-3
4e-3
8e-3
50,000,000
8e-6
2e-5
1e-4
2e-4
1e-3
2e-3
4e-3
100,000,000
4e-6
1e-5
5e-5
1e-4
5e-4
1e-3
2e-3
167
168
Language Acquisition, Change and Emergence
To get a better idea of the quantitative values that are involved in this framework, Tables 2 and 3 report the current sizes of land and the continents, and computations of the densities of human groups for different areas and sizes of the macro-population. In the next experiments and hypotheses, we will mostly investigate densities of population varying between 0.001 and 0.0001 human groups per km². They correspond to average values from 0.025 to 0.0025 individuals per km². While they will appear small compared to the values of Table 1 (especially the lower bound), we will explain later why we believe such values to be relevant in some situations.
2. Mathematical Models and Computer Simulations 2.1. A mathematical model to evaluate the probabilities of monogenesis or polygenesis of language Quantifying the probabilities of monogenesis or polygenesis Except in a few cases for which deciding between monogenesis or polygenesis of an innovation is possible without ambiguity (as we have seen for the development of agriculture or writing systems), it becomes harder to estimate the “mode” of emergence of an innovation when one goes further in the past and clues become rarer. The main difficulty, as we will see in the next paragraphs, lies in the possibility of an undetected diffusion of the innovation, which might lead to the wrong conclusion of a polygenesis. If concretely distinguishing monogenesis or polygenesis is a difficult task in some cases, it is however possible to compute the probabilities of these events. Indeed, if a mathematical model allows one to conclude that polygenesis of an innovation is much more likely to have taken place than monogenesis, it becomes relevant for the theories on this matter to consider the two possibilities, and not to reject the polygenetic hypothesis without a strong argument. This conception is even more crucial as definitive proofs are lacking when one is interested in the origins of the faculty of language or of languages.
Polygenesis of Linguistic Strategies
Description of Freedman and Wang’s model Freedman and Wang (1994) have been interested in the possibility of studying the probabilities of the two scenarios of monogenesis or polygenesis. To this end, they have proposed a purely mathematical model that we are now going to describe briefly. One of the targets of the paper was to reformulate correctly the “folk” intuition which assumes that if a rare event is already unlikely to happen once, it will be even less likely to occur twice. The model focuses on the link which unites the probability p of the emergence of language4 at one site and the probabilities of no emergence, monogenesis or polygenesis at n independent sites over a fixed period of time. The mathematical approach enabling the calculation of the probabilities depends on Poisson’s probability distribution, which characterizes the occurrences of rare events, that is those with a weak probability of occurrence. The values studied for the probability of emergence at one site p are chosen such that the expected number of sites at which language emerges is 1, 2, 3, etc. The probability of emergence is integrated over the entire time period, and the expected number of sites is therefore equal to p × n. It should be clear here that this expected number reflects a statistical approach, i.e. the mean number of sites at which language emerges if one considers a large number of episodes. By episode, we mean a concrete instantiation of the model described above, at the end of which one of the three possible scenarios has occurred: no emergence, monogenesis or polygenesis. The model does not predict the outcome of a single episode, but rather, of a large number of episodes, the percentages of them which end up as no emergence, monogenesis or polygenesis. The issue is then to investigate how the probabilities of the three scenarios evolve with the product p × n. To ease the understanding of the simulations we develop in the
4
The authors were interested in the emergence of language 1 or 2 million years ago. This rather relates to the notion of function of language, but the model may in fact be applied to any cultural innovation.
169
170
Language Acquisition, Change and Emergence
remaining parts of this chapter, we find it useful here to slightly modify Freedman and Wang’s model, in a way that does not change any result or interpretation: we modify the meaning of the probability p, which was integrated over a whole time period in the original model, to introduce the notion of time step. Instead of considering an indivisible time period, we consider a number T of time steps, and p the probability of emergence at each time step. This new approach can therefore be seen as the discrete counterpart of Freedman and Wang’s model, since the global time period they considered has been cut into discrete time units. During an episode, at each time step, a random test is performed against the probability p to check whether the innovation emerges or not, provided that it has not emerged before. Within this new framework, it turns out that the relevant parameter, i.e. the expected number of sites, can be replaced by the mean number of times the test against the probability p is positive in the n groups during the T time steps. This new parameter is equal to the product λ = p × n × T. One should notice that since only the value of the product λ is relevant, the time variable and the probability of emergence at one site are not independent; this relationship allows us to focus on the relevant values of p given T: if the values of p are chosen such that the product λ is very weak (well below 1.0), then the probabilities of monogenesis or polygenesis will be insignificant. For values of λ well above 1.0 (for example larger than 10.0), the probability of polygenesis will be close to 1.0, and the two other probabilities insignificant. In these two extreme situations, no qualitative transitions would be observed in the probabilities of the “modes” of emergence.
Results The graph of Figure 3 provides a good understanding of the situation. For various values of λ = p × n × T, the probabilities that zero, one (monogenesis) or several emergences (polygenesis) take place are displayed.
Polygenesis of Linguistic Strategies
Figure 3 Evolution of the probabilities of monogenesis and polygenesis of an innovation at several sites according to the probability of emergence at one site; adapted from Freedman and Wang (1996)
The evolution of the curves representing the probabilities of no emergence, monogenesis or polygenesis can be summarized in the following way: • The probability pn of no emergence decreases to 0 as λ increases according to the relation pn = exp(−λ) ; • The probability of monogenesis p m first increases for small values of the product, and then decreases to 0 (bell-shaped curve). p m = λ.exp(−λ) ; • The probability of polygenesis pp increases as the product increases, with the relation pp = 1 − (λ + 1).exp(−λ) . The last two behaviors lead to the existence of a threshold for the probability of emergence at one site (depending on the number of sites n and T), over which polygenesis becomes more likely than monogenesis. In other terms, large values of p, n or T increase the likelihood of a polygenesis of the innovation, and make it more likely than monogenesis over a given threshold.
171
172
Language Acquisition, Change and Emergence
As a consequence, it is necessary to reformulate more precisely the intuition according to which a weak probability of emergence at one site makes the emergence at several sites even less frequent, since no threshold is set for such statement and the intuition does not take into account the combinatory function of the probabilities at several sites. It should be pointed out here that it is nearly impossible to estimate the probability of emergence at one site, which widens the gap between the model and real situations. Nevertheless, the knowledge a posteriori of the situation may contribute toward a better estimation of this probability with the help of conditional probabilities: as the authors suggest it, for small values of p, the model does not account for the fact that the innovation has emerged. This is especially meaningful in the case of linguistic components or the faculty of language, because all human populations possess languages with some recurrent linguistic items, a situation to be contrasted with those of agriculture or writing. This may falsely lead to the rejection of monogenesis; however, it remains impossible to conclude polygenesis or monogenesis for a single episode (see definition above), since for a sample of size 1, an unlikely event like monogenesis cannot be ruled out. The fact that for small values of p the model does not account for the emergence of the innovation does not interfere with this issue, but is rather a limitation of the model itself. As a conclusion, polygenesis appears more likely than monogenesis for a large range of probabilities of emergence at one site. It also becomes even more likely as the number of sites increases for a fixed value of p. In this model, sites remain abstract entities and are, in particular, totally independent from each other. But, as we have seen, the reality of prehistory is partly represented by human groups moving in large geographical areas, and as a consequence able to enter into contact and transmit cultural innovations. The frequency of such contacts is hard to estimate intuitively. Freedman and Wang’s model therefore calls for further enhancements to take this aspect into account.
Polygenesis of Linguistic Strategies
2.2 Measuring the frequency of contacts Relevant parameters Estimating the frequency of contact between human groups knowing their density in a geographical area is a rather unintuitive task which asks for precise calculations. Were they meeting every month, every six months, every ten years or even less often? Of course, just the density of groups, represented hereafter by the variable de, is not enough, and other parameters have to be considered. We shall restrict our attention to three of them: • The threshold distance d between two groups for a contact to occur (when does one group detect another one?); • The “geometrical” features of the groups’ movements; • The speed of the groups v. The first parameter may be linked to the surface of the catchment territory of a group, which may be roughly equated to the region where a second group could be detected, not taking such clues as distant sounds or smoke into account. The size of such territories for hunting or gathering food is estimated to be around several hundred square kilometers, as indicated by ethnographic studies on various groups of hunter-gatherers (Biraben, 1997: 46–7). Regarding the geometrical “features” of the movements, it is reasonable to assume more or less directional migrations, from fully random “brownian-like” trajectories to extremely rectilinear motions, although the latter seem less likely. The speed of displacement may be approximated to several kilometers per year. Dates of archaeological sites related to the migration of the first farmers in Europe 10,000 years ago suggest a speed of 1 kilometer per year for the population wave (Cavalli-Sforza, 1994: 108–9). However, this net distance would only equal the average speed of human groups in the case of unidirectional movement in the same direction as the wave of migration (Hassan, 1981: 200–1). For more random movements, the average speed is necessarily higher. Moreover, tribes of hunter-gatherers were presumably moving faster than farmers cultivating the ground, given their reduced sedentarization.
173
174
Language Acquisition, Change and Emergence
First theoretical approach In the case of some specific movements, a useful analogy allows estimating the frequency of contact between groups. Although we will not enter here into the details of the mathematical method (Coupé, 2003), we may just summarize the approach by pointing at the similarities between the collisions of molecules in a perfect gas and the contacts between human groups: the diameter of the circular catchment territory of a human group can indeed be assimilated to the diameter of a molecule. We rely on this similarity to estimate the frequency of contacts in the case of pseudo-rectilinear movements (directional changes only take place after a collision of molecules in a perfect gas). The frequency of contact f for the molecules is approximated by the following formula:
f = de × d × v The numerical application with a density equal to 0.001 groups per km², a speed of displacement of 4 km/year and a radius of 10 km (d = 20 km) for the catchment territory leads to a frequency of around one contact every 12.5 years. The frequency of contact for human groups seems therefore at first extremely low. This may be moderated by differences in regional densities, but one decade seems a reasonable magnitude for the period between two contacts. To assert whether such values are reasonable, we will further ground our approach with computer simulations.
Simulations for rectilinear motions We programmed a simple computational multi-agent model to measure the frequency of contact experimentally. Agents representing human groups were allowed to move in a square bi-dimensional space (with pseudo-rebounds on the frontiers of this space and an initial random distribution) during a large period of time; the number of contacts between them was averaged over this period. The parameters varied as follows:
Polygenesis of Linguistic Strategies
• Number of agents n: 50, 100, 200 and 400. Additional values were also considered in specific cases (see Figure 5); • Density of groups de: 0.001, 0.0005 and 0.0001 groups per km². The space area was computed given de and n; • Speed v: 1, 5, 10, 15 or 20 km per year; • Threshold distance for contact d: 5, 10, 15 and 20 km. To estimate how close to the theoretical case the experimental results were, we computed the ratio of the experimental frequency of contact f over the theoretical frequency f th for various values of the parameters. Figures 4 displays curves for various values of the speed and size of the catchment territory, while Figures 5 and 6 allow estimating the impact of the density and number of groups. In addition to these graphs, various correlations were computed between series of values of f for different values of the parameters, in order to discover mathematical relationships. This approach was applied to all our simulation results and proved to be useful, as will appear later in this chapter.
Figure 4 Evolution of the frequency of contacts in function of the speed of the agents and the threshold distance for contact (threshold distance for contact in horizontal axis)
175
176
Language Acquisition, Change and Emergence Figure 5 Evolution of the frequency of contacts in function of the number of human groups Frequency of contact (density: 0,005 groups/km², threshold distance for contact: 15 km, speed: 10 km/year)
Figure 6 Evolution of the frequency of contacts in function of the density of human groups
Polygenesis of Linguistic Strategies
The following results can be drawn from the figures above, other graphs that were not included, and analyses of the series of results: • The density of groups d does not interact with the other parameters. Moreover, the curves displayed in Figure 6 show that the ratio of frequencies f / fth always seems to remain bounded in a narrow interval whatever the value of the density of groups. This means that the frequency of contact f is linearly related to the density of groups; • The number of groups n does not interact with the other parameters, and the curve in Figure 5 shows that the ratio of frequency seems to reach an asymptote for large values of n Since relevant values of n are much larger than 1000, f can be considered as independent from n in a first approximation; • The frequency of contact f can be described by the following formula expressing the independence or interactions between the various parameters: f = d × v × de × g(d, v). For the values of interest to us, g takes values between 0.5 and 3. It then appears that the former conclusions should be moderated, in that the experimental frequencies of contact f may be different from the theoretical frequency f th with ratios from 0.5 to 3. However, even with this multiplicative factor, the frequency of contact can be said to remain very small, especially compared to the life span of an individual during prehistory. Three reasons may be invoked to explain the discrepancies between the theoretical and experimental cases: • The theoretical formula approximates the reality by considering a molecule moving in a space containing static elements it can hit; considering a high velocity for these elements (according to a molecular speed distribution) leads to refining the formula by adding a √2 multiplicative factor in the right expression: f = √2 × de × d × v. This specific numerical factor is adapted to the speed distribution of molecules, but the general principle may
177
178
Language Acquisition, Change and Emergence
explain why the ratios of frequencies we observed were mostly greater than 1.0. • In the theoretical case, molecules move continuously, where in our case, the groups jump from one location to the next. For a small speed and a large size of the catchment territory, the area covered by a group after a given time span is close to the area covered in the case of a continuous motion, since the areas covered by the group at each time step largely overlap. However, for larger speed and smaller catchment territories, the motion cannot be assimilated to a continuous one, since the regions covered at each time step only slightly overlap or do not overlap at all. Such behavior modifies the frequency of contact, since on the one hand, a larger area is covered if the catchment territories overlap less, and on the other hand, close groups may miss each other in case of distant jumps from one area to another. • The theoretical case assumes an infinite space, whereas our agents were moving in a closed area. For agents close to the boundaries of this area, the frequency of contact is smaller than for agents in the center of the space. The larger the size of the area, the closer the experiments get to the theoretical case.
Non rectilinear motions For more random movements, the total area covered by a group during a period of time is intuitively smaller, since this group will more frequently revisit the same places, by often changing its direction. However, it remains unclear whether this affects the frequency of contact: groups may reduce their chance of meeting distant groups through more local movements, but this in turn
Polygenesis of Linguistic Strategies
increases their chance of meeting close groups again and again5. Once again, we relied on computer simulations to evaluate the situation. To model the notion of linearity of movement, we introduced an angle α corresponding to the maximum deviation allowed for the direction of a group at each time step; an angle equal to 0 corresponds to a linear motion, whereas an angle of 2π is equivalent to a Brownian motion: at each time step, agents choose a deviation between – π and π. In the former experiments with rectilinear motions, the angle was simply set to 0. We ran experiments for the same values of parameters as above, crossing them with the following values of α: 2π, π, π/3, π/6 and 0. As explained earlier, we computed correlations between the series of values obtained for these various angles, and compared the numerical values themselves. It appears that the frequency of contacts f obeys the behavior described for rectilinear motions whatever the value of α, and that this angle in fact plays no role in the frequency of contact. We interpret this result as a balance between the two phenomena that we introduced three paragraphs ago (distant versus local contacts). The frequency of contact can therefore still be computed according to the following formula: f = d × v × de × g(d, v).
Time to complete a diffusion The former results may falsely lead to the conclusion that the randomness of movements plays no role in the impact of contacts on the emergence of an innovation. However, it is not as much the frequency of contact as the speed of diffusion of this innovation which will be relevant in the coming paragraphs. We ran a last series of simulations to measure the time needed for a diffusion to reach all
5
For a Brownian motion in physics, the study of the intersections of two trajectories of particles (called Wiener’s sausages) is a hard and still unsolved problem.
179
180
Language Acquisition, Change and Emergence
the groups, once again according to different sets of values of the former parameters given above. For each set of parameters, the average time T for the complete diffusion of the innovation (to all groups, starting from a single group in possession of the innovation) was measured over 50 identical simulations. An additional parameter was added to the model: pt represented the probability that in case of contact, one agent in possession of the innovation transmits it to a second agent not in possession of it. It appears obviously that T is not only dependant on d, de and v, but also on pt , α and n. Indeed, the local aspect of the contacts in the case of more random movements decreases the speed of diffusion, and the more agents for a given density, the bigger the space to conquer for an innovation. The first phenomenon may be interpreted as a weak coefficient of diffusion in the modeling of epidemics (Murray, 1994: 651–5). According to the analyses of the numerical outputs of the simulations, and by comparing the values of the frequency of contact f and the time for complete diffusion T for identical sets of parameters, the following relationship links T to the other variables of the model: T = h1(f, n, p t )× exp(α.h2(f, p t ))
Figures 7 and 8 illustrate the progressive increase of T as α increases for two sets of values of f and pt . The different shapes of the exponential tendency curves and their coefficients clearly point to the second term of the product in the former expression of T. The functions h1 and h2 required detailed investigations before their analytical expressions could be found. First, the analysis of series of values of T for α=0 led to a further decomposition of the function h1 as follows: h1(f, n, pt ) =
−β η n δ .(f .pt ) β +1 (f .p t )
The coefficients η, β and δ were then estimated: η = 1.89, β = 0.17 and δ = 0.42.
Polygenesis of Linguistic Strategies Figure 7 Evolution of the time for complete diffusion in function of the directionality of the movements (first set of parameters) Time to complete diffusion, speed: 10 km per year, Threshold distance for contact: 10 km, density: 0.001 groups per km², pt = 0.7
Figure 8 Evolution of the time for complete diffusion in function of the directionality of the movements (second set of parameters) Time to complete diffusion, speed: 10 km per year, Threshold distance for contact: 5 km, density: 0.001 groups per km², pt = 0.1
181
182
Language Acquisition, Change and Emergence
Paying attention to the coefficients of the exponential tendency curves for various values of α further led to the following expression for h2(f, pt ) : h2(f, p t ) = ε.f .p t θ
The coefficients ε and θ were estimated as η, β and δ previously. The following values were found: ε = 1.21 and θ = 0.69. The expression found for T is rather complex, and may take other simpler forms. It may also be simplified for large values of n, which were however computationally too expensive to be investigated with the necessary high number of simulation runs. Table 4 provides various values of f and T for different values of d, v, de, n, p and α. Numbers in italic are extrapolations, while other values are experimental results. Table 4 Values of f and T for different sets of parameters of the model (extrapolated values in italic) v (km/year)
d (km)
de (group/km)
n
10
10
0.0001
258
0
0.7
0.022
810
10
10
0.0001
94
0
0.7
0.022
695
8.81
13.69
0.0034
189
5.40
0.107
0.86
179.88
6.94
10.25
0.001
113
1.57
0.047
0.16
1359.94
10
10
0.001
10000
1.5
0.1
0.24
1263
10
10
0.001
1000
1.5
0.1
0.24
756
10
10
0.001
10000
1.5
0.01
0.24
8953
10
10
0.001
1000
1.5
0.01
0.24
6329
5
10
0.0005
10000
3
0.01
0.065
31055
5
10
0.0005
1000
3
0.01
0.065
23524
α (rad)
p
f
T (years)
On the nature of contacts between human groups The opportunities for contact between human groups were presumably very slight. One question that may be raised is the possibility for these encounters to have been non-violent and sources
Polygenesis of Linguistic Strategies
of cultural exchanges or transfers of individuals. To partially answer this question, the exchanges of lithic material (Marwick, 2002) are a first clue of non-aggressive contacts, unless we always consider fights for the acquisition of these resources. As it appears in the analyses of databases of parietal art schemes in France and Spain during the Upper Paleolithic (50,000–10,000 BP) (Sauvet and Wlodarczyk, 1995), local stylistic heterogeneities in a globally homogeneous context for the graphic representations strongly suggest the existence of inter-groups relationships. These relationships may be at the origin of both the large-scale homogenization and the preservation of local diversities contributing to the social position of a group among others. Finally, exchanges of genes through the exchanges of individuals would have usefully preserved the diversity of the gene pool of a group. Such exchanges between groups, namely exogamy, are still common in hunter-gatherers like the Australian aboriginals. Welcoming a man or a woman speaking a different language could presumably play a significant role in the linguistic evolution of a group. Following the previous hypotheses raises the next question: could the contacts between groups have had an impact on the monogenesis or the polygenesis of innovations? This point is now going to be further investigated
2.3 Combining independent discovery and transmission by contact Intuitive statements If one follows the results of sections 2.1 and 2.2, a group may either discover an innovation by itself, or receive it from another group it meets. Intuitively, the resulting probabilities of monogenesis or polygenesis will be the result of the interaction between the diffusion and the emergence of the innovation among the groups: fast diffusion may prevent polygenesis from occurring, since all the groups will be contaminated by the innovation by diffusion before
183
184
Language Acquisition, Change and Emergence
having the possibility to discover it by themselves. Conversely, a slower diffusion will preserve the possibility of polygenesis and the probabilities of Freedman and Wang’s model will be more likely to be relevant for the real situation. The next simulations are aimed at checking whether this intuition is relevant or not.
Description of the new model To test the hypothesis of the last paragraph, both aspects of emergence and diffusion of an innovation among human groups were combined in a single multi-agent model. Contrary to the former models, it did not involve a spatial environment, and relied on a simple discrete probabilistic framework involving two parameters: • The parameter r expressing the probability for an agent to receive the innovation from groups already in possession of it; • The parameter pc , giving the probability for an agent to discover by itself the innovation at each time step. A total number of agents n and a time limit for the simulation Tmax were set for each run. The relevant value was as in Freedman
and Wang’s model the product λ = pc × n × Tmax . At each time step, two statistical tests were performed for each agent to decide whether it would discover the innovation by itself, or receive it from another group. In the first case, a random number rn1 between 0 and 1 was compared to pc , while in the second case, a second random number rn2 in the same interval was compared to the product r times the percentage p of agents already in possession of the innovation. An agent not in possession of the innovation was i) discovering the innovation by itself if rn1 < pc , or ii) receiving the innovation from other groups if rn2 < p.r. According to this second condition, no diffusion was taking place before a first agent had discovered the innovation. After this initial discovery, the diffusion was taking place according to a logistic growth, similar to what had been observed in the previous model of diffusion. This justified the
Polygenesis of Linguistic Strategies
use of a simpler and computationally cheaper non-spatial model. The relation between r and T was explored with additional experiments, and the following result was established: .
r=
2 ln (n) T
This relation found experimentally can in fact be derived from the equations describing logistic growth, provided that the number of human groups n is large enough.
Experiments and results To test our model, large sets of values for the two parameters r and λ were built, either by crossing random values for both of them or by fixing one parameter and choosing random values for the second. For each set of values of λ and r, 200 runs were performed to measure the probabilities of no emergence, monogenesis and polygenesis. A run was stopped i) as soon as two independent discoveries had taken place, ii) when a single discovery had diffused to the whole population or iii) if the time limit was reached without any discovery of the innovation. An initial series of runs was first computed to check the conformity of the computer model to the theoretical results established by Freedman and Wang in the case of no diffusion. A value of r equal to 0 reproduced the theoretical values, modulated by a small variability due to the limited number of experiments (an infinite number of simulations should reproduce the mathematical laws exactly). In order to evaluate the impact of diffusion on Freedman and Wang’s model, we investigated the evolution of the ratios of the experimental probabilities over the theoretical probabilities for no theo emergence, monogenesis and polygenesis, respectively rn = pexp , n / pn exp theo exp theo rm = p m / p m and rp = p p / p p . It appeared that the first ratio was always taking values very close to 1, which means that the probability of no emergence is left unaffected by diffusion. This is rather intuitive since diffusion can only occur when an emergence
185
186
Language Acquisition, Change and Emergence
has already taken place. In order to evaluate the evolution of the two other ratios, we relied on the following intuitions: i) large values of r should decrease the ratio rp until it eventually reaches 0: an extremely fast diffusion prevents polygenesis, ii) very weak values of r should give a ratio close to 1, since there is then virtually no diffusion, iii) the larger the probability pc and λ, the closer to 1 the ratio rp : groups discover the innovation by themselves very quickly, which prevents diffusion to occur, iv) before and after a phase of transition for the value of the ratio rp , variations in r or λ should have minimal impact. A possible relationship between them and rp could be: rp =
⎛1 ⎞ 1 − exp(−Cst.f (λ, r)) = tanh ⎜⎜ Cst.f (λ, r)⎟⎟⎟ ⎜⎝ 2 ⎠ 1 + exp(−Cst.f (λ, r))
These hypotheses led us to compute for each set of values the expression − ln(1 − rp /1 + rp ) , and compare it with the corresponding values of r and λ. We found a very strong positive correlation between the ratio λ / r and the former expression, and the following relation was established: rp = tanh(9.32e − 5.λ / r) theo Finally, the ratio rm = pexp could be derived simply from the m / pm knowledge of rn and rp .
Analysis From the previous computational study, we are able to propose a relation between the probabilities of monogenesis pm and polygenesis pp , the number of groups n, the expected number of groups in which the innovation appears when no diffusion occurs, λ, and the time T for the complete diffusion of the innovation in the population (this last parameter can itself be decomposed as seen previously):
Polygenesis of Linguistic Strategies rp = tanh(9.32e − 5.λ / r) ⎛ λ.T ⎞⎟ ⎟.(1 − exp(−λ) − λ.exp(−λ)) ⇒ pp = tanh ⎜⎜9.32e − 5. ⎜⎝ 2.ln(n)⎠⎟⎟ ⎛ λ.T ⎞⎟ ⎟.(1 − exp(−λ) − λ.exp(−λ)) ⇒ p m = 1 − exp(−λ) − tanh ⎜⎜9.32e − 5. ⎜⎝ 2.ln(n)⎠⎟⎟
As we have already made clear, it is often not possible to know the probability of emergence of an innovation at one site; in the former expressions, the parameter λ remains an unknown variable. However, we are able to compute the probabilities of monogenesis and polygenesis for various values of this parameter, n and T. The following array summarizes values of the ratio pp / p m for various values of these parameters. The values of T are identical to the values that were extrapolated in Table 4, and therefore may be related to specific sets of the basic parameters d, v, de, p and α. Tmax was chosen equal to 50,000 years. Table 5 Values of the ratio p p / p m for various values of λ, n and T T
n/λ
0.1
1
2
5
10
50
100
1263 years,
10
3.14e-5 0.0027
0.0089
0.032
0.068
0.45
1.29
756 years,
1
2.5e-5 0.0021
0.0071
0.025
0.054
0.33
0.89
8953 years,
10
2.2e-4
0.019
0.066
0.27
0.74
45.9
4300
6329 years,
1
2.1e-4
0.018
0.062
0.25
0.67
35.2
2555
31055 years,
10
7.7e-4
0.070
0.26
1.73
11
3.3e6
2.2e13
23524 years,
1
4.7e-4
0.041
0.15
0.75
2.85
6822
9.3e7
As we already mentioned it in section 1.3, we have investigated values of the density which appear to be small compared to estimations by palaeo-anthropologists. However, a significant
187
188
Language Acquisition, Change and Emergence
difference between our simple model of diffusion and the reality is that in the former, agents could move without constraints. In reality, natural barriers like warm or cold deserts, mountains, oceans, etc. may have significantly increased the time for diffusion to the whole population. If one admits that groups were living in restricted areas (along rivers or lakes or seas, near abundant resources, etc.), the transmission of innovations between these regions may have taken a long time6. We simulated this effect by considering smaller densities of groups. We then propose that the estimated densities for prehistory correspond to quite long periods for complete diffusion, which increases the likelihood of polygenesis. It appears finally that for values of the parameters that are congruent with our knowledge of prehistory, the probability of polygenesis may still be comparable with or higher than the probability of monogenesis for small values of λ (for example 5 expected sites of emergence with 10,000 human groups). If one deals with a large number of innovations which value of λ is large enough, it becomes statistically very likely that some of these innovations will emerge by polygenesis, and others by monogenesis. To rephrase this conclusion in our linguistic framework, if we assume a λ large enough for modern linguistic strategies, the model predicts that at least some of them appeared by polygenesis, possibly the majority. In the third and last part of this article, in order to further assess the results of our simulations, we will try to adopt a broader perspective concerning “abstract” models and their implicit statements, as we will try to place the polygenesis or monogenesis of linguistic innovations in a context centered on the cognitive capacities of our predecessors.
6
Extreme cases may be Australia, which required significant sea-crossings to be reached, or the Americas, which were only accessible through the northern path between Siberia and Alaska.
Polygenesis of Linguistic Strategies
3. Discussion 3.1. Cognitive potential and structural polygenesis An implicit assumption behind the former results The model we have described can be applied to any kind of cultural innovation. As we have mentioned in the first part of the article, we think that linguistic items or “linguistic strategies” are the best candidates when it comes to the evolution of languages and the development of linguistic diversity. We propose indeed to view them as cultural innovations, and to apply the results of our model to write the first draft of a scenario of the development of linguistic diversity, and of modern languages. However, doing this requires a better understanding of the probability of emergence at one site for linguistic components. A hidden assumption hides behind the former results of the model: the probability of emergence of the innovation has to remain constant during all the time period considered. Interestingly, analyzing this assumption leads to various comments that may enrich our point of view. A first aspect is to be related to the putative link we have already mentioned between the emergence of our species and the emergence of modern languages. Our entire former discussion becomes irrelevant if one cannot assume an approximately constant probability of emergence during the large period of time T considered (typically tens of thousands of years). This is exactly what happens if one assumes that the emergence of our species has to be followed immediately by the emergence of the innovations: all the strategies appear very shortly after the emergence of our species. If this emergence is very localized, as in the Out of Africa theory, then monogenesis is very likely. We will try to refute this hypothesis by introducing the notion of cognitive potential. Another objection that may be raised against a constant probability of emergence at one site is that interactions between linguistic items create context-dependant probabilities of emergence: a specific strategy will be more likely to appear in some contexts made by other pre-existing strategies.
189
190
Language Acquisition, Change and Emergence
The notion of cognitive potential Evolutionary biologists often quote the following saying: “The function makes the organ”. This means that an organ will not first appear randomly before being attributed a function, but will appear or change to satisfy a specific functional requirement. This assumption is of course related to the Darwinian law of natural selection. However, if a functional requirement may spawn an organ, this one may also be used for another function than the one it was first developed for. This phenomenon, which has received the name of exaptation, is another possibility to create a link between a physiological device and a given function. An example is the wing of the bat, which was formerly an upper leg like for other mammals, but gradually changed to fill another function. More and more biologists estimate that most of functions appear as exaptations, rather than with the primary appearance of an organ. This may especially be the case for the brain, where neural circuits and structures may have played different roles during evolution. One may refer here for example to MacNeilage’s Frame-Content theory (1998) centered on the evolution of the Broca’s area. We may slightly extend this last point of view by saying that the mechanism of exaptation could be translated to the domain of higher cognitive functions, and not only restricted to low-level neural networks. A function could then take advantage of already existing cognitive mechanisms to become active in the behavior of an individual. Various authors have already defended this position. Wang (1991) promoted, for example, the idea of a “mosaic” of cognitive functions which would have led to the emergence of language. Writing is a demonstrative example of a cognitive activity which is not the result of an evolutionary process, but relies on a collection of cognitive abilities and neural areas which primarily evolved for other functions. We propose to use the term “cognitive potential” to describe the fact that some cognitive functions could potentially emerge given an organic or cognitive background. At a given time, what would make them concretely exist or not is a matter of external conditions or events rather than internal requirements.
Polygenesis of Linguistic Strategies
Modern linguistic strategies may especially be considered as cognitive potentials, relying on various cognitive abilities as memory, integration of various spatial and temporal frames, attribution of thematic roles etc. A strong argument in favor of this analysis is that many linguistic strategies do not necessarily appear in all languages, but are readily learnt by young infants receiving them as part of their linguistic input. This notion of potentiality is highly significant for our discussion, since it implies that the emergence of modern linguistic strategies did not necessarily take place at the same time as the emergence of our species. This opens the door to a progressive emergence of the features of today’s languages, according to cultural innovations taking place in different human groups. It becomes then more plausible to assume a relatively constant probability Pc over a larger period of time, which in turn validates the results and hypotheses derived from our computational experiments. What “events” may trigger the emergence of a new linguistic strategy? This question remains hard to answer, because many factors have to be taken into account. On the one side, common linguistic phenomena, such as grammaticalization processes, may lead to new linguistic forms after a while: primary forms which were more likely to appear first gradually evolve into more complex states. Other elements can be related to cultural facts: behaviors requiring for example to share cognitive representations about spatially and temporally distant situations may have led to the emergence of new linguistic forms to express time and space, just as many new words have been created throughout history to name new concepts or tools. It is interesting to recall here the gap between the emergence of Homo sapiens and the emergence of behaviors like the first sea-crossings to Australia around 60,000 BP, the religious burial of the dead 7 or rock painting. If one accepts that several tens of
7
Archaeologists still debate on the first burials: whether Neanderthals or early H. sapiens were burying some of their dead remains very controversial. However, H. sapiens’ burials with offerings become obvious during the Upper Palaeolithic (50,000 – 10,000 BP) (Klein, 1999: 468–70, 550–3).
191
192
Language Acquisition, Change and Emergence
thousands of years separate the emergence of our species and the former behaviors, and that such behaviors require specific linguistic abilities, it is reasonable to assume that linguistic evolution accompanying these deep changes occurred a long time after the emergence of our species. It is especially relevant to notice here that these behaviours appeared when Homo sapiens had already colonized a significant part of Eurasia and Africa, which implies large areas and multiple natural barriers that prevent fast diffusions from occurring.
Structural genesis Linguistic strategies interact with each other. Some of them are exclusive or rarely appear together (Greenberg, 1978). One reason could be that once a cognitive demand is satisfied at a linguistic level, there is no need to have another strategy for the same purpose. Case-markers or word-order are such quite redundant strategies. Others may occur together with a high frequency in today’s languages. Once again, some explanations may be found in cognitive constraints or economy: two strategies may induce the same cognitive operations and hence save computational time for real speech processing. The X-bar rule of the Government and Binding Theory, which defines the position of all heads and specifiers, see (Black, 1999) for definition, may be explained by such computational savings or costs. Because of these interactions, all strategies are not as likely to occur in a context formed by already existing strategies. As a consequence, instead of considering steady probabilities of emergence for all strategies during the whole time period, probabilities at one site should in theory be recomputed at each new emergence to take the various interactions into account. Starting from one or several initial forms, families of languages follow their own pathway, and each new emergence modifies the possible directions of evolution the system may take. The term bifurcation, as introduced for example in linguistics by Ehala (1996: 2–3) to describe the evolution of linguistic systems, appropriately
Polygenesis of Linguistic Strategies
characterizes the fact that each new emergence, according to external contingent events, forces the system to choose a restricted path toward some of the initial potential configurations it could adopt. This is very similar to the evolution of species, where changes can only occur in the frame defined by the biological characteristics of the organism (Maturana and Mpodozis 2000). This specific scheme of interactions between the strategies blurs the situation by adding a lot of conditional relationships between the various paths that might be taken by a linguistic system from an initial state. However, one feels intuitively that such structural constraints do not contradict the main result of the model, which is that linguistic strategies are likely to emerge according to polygenesis at various locations. For some of them, the constraints will increase the probability of emergence at one site, while others will appear less likely because of pre-existing strategies. We propose to use the term structural genesis, to summarize the fact that linguistic innovations appear under structural constraints.
3.2 A scenario for the development of linguistic diversity and the emergence of modern languages Given all the previous proposals and hypotheses, we propose the following sketch for the development of languages: • Due partially to its social origin, linguistic diversity is as ancient as what we may call the human function of language, likely long before the emergence of Homo sapiens, as suggested by the behavioral achievements of pre-sapiens species. • Since we postulate that linguistic strategies represent devices to transfer data from a cognitive internal level to an external and shared one (or vice-versa) relying on general cognitive abilities, it is reasonable to assume that these strategies have become increasingly complex with the evolution of our cognitive capacities in the past, especially along the several speciation events which constitute the nodes of our phylogenetic tree.
193
194
Language Acquisition, Change and Emergence
• Once new cognitive capacities become available, the linguistic “appropriation” or usage of these capacities is not immediate. Cognitive potentials appear that may be instantiated only after a period of time. The evolution of the probability of emergence at one site and in a given linguistic state is determined by the nature of the new cognitive abilities, the benefit for communication of instantiating the strategy, the structural constraints that weigh on the emergence of the innovation and the cultural evolution and requirements at a more general level. • The former assumptions lead to the possibility of monogenesis or polygenesis of linguistic strategies. Until recently, around 10,000 years ago, the densities of population were very low, therefore leading to rather slow diffusions of linguistic innovations by contact between human groups. The progressively increasing number of sites and larger areas colonized seems to favor the polygenesis of various innovations, as was the case for both agriculture and writing. • According to Freedman and Wang’s arguments, polygenesis is not necessarily less likely than monogenesis on probabilistic grounds. This proposal has to be revised because of the possible diffusion of an innovation leading to the bypass of polygenesis. However, the results of our experiments and the large number of linguistic strategies show that the polygenesis of at least part of these strategies cannot be ruled out. In case of a large probability of emergence at one site, the model predicts that most of the strategies would have appeared by polygenesis (what the model cannot predict is exactly what the probability of emergence at one site is). If we now turn to the origin of contemporary languages, the former hypotheses can be reformulated in the following way (if one assumes the Out of Africa hypothesis): • Prior to the emergence of our own species, our
Polygenesis of Linguistic Strategies
predecessors already possessed a function of language. This function was taking diverse surface forms, according to their cognitive abilities (and physiological structures). Neither human language nor linguistic diversity appeared with our species. • With the emergence of our species, new cognitive potentials opened the way to new linguistic strategies. However, these strategies did not all appear right after the speciation event, but during a long time span of several tens of thousands of years. The various linguistic strategies took place according to external events, especially related to the more general cultural development of our ancestors. • Whether there were one or several ancestors to modern languages soon after the first Homo sapiens, these systems of communication were less complex than today’s languages. Assuming a single ancestor for all modern languages makes little sense if this ancestor only shared a few of the characteristics of contemporary languages, most likely the simplest ones, and if most of the modern linguistic features appeared later in various populations (partially constrained by the structural interactions between them, and therefore following different paths starting from the initial emergence). Considering the polygenesis of numerous linguistic strategies, or their monogenesis because of diffusion, leads to partially emptying this hypothesis of its substance and interest. Moreover, if the transition from pre-modern to modern humans took place slowly and in a large number of groups, various sources could have been at the origin of the linguistic families that later spread all over the world. • Trying to reconstruct a unique ancestor to today’s languages on the basis of comparisons of words or typological structures can only be validated if alternative scenarios built on the polygenesis of the considered features are demonstrated as significantly less likely than the monogenetic hypothesis.
195
196
Language Acquisition, Change and Emergence
4. Conclusion The question of the monogenesis or polygenesis of languages is a difficult one, because it involves numerous factors as diffuse as the palaeo-demographic conditions of our ancestors, the mechanisms of a speciation event, the relationships between general cognitive abilities and linguistic “tools,” etc. Most of our proposals are not firm demonstrations, but aim nevertheless to shed light on elements that are often forgotten in current debates. We have aimed to present this question in the more general framework of the evolution of linguistic diversity. By putting forward a scenario based on the possible polygenesis of linguistic strategies over long periods of time, we conclude that the fact of having a single ancestor to all modern languages makes little sense, if this ancestor was much simpler than today’s languages, and if many features evolved later independently in many human groups scattered all over Earth. Moreover, we strongly believe that language did not emerge with our species, and that many linguistic strategies, even already sophisticated, were used by our former ancestors. This position seems to run counter to the hypotheses of linguists such as Merritt Ruhlen (1994), who proposed words of the proto-sapiens, Homo sapiens’ “first” language, or Murray Gell-Mann, who defends the fact that there are various visible arrows of time in today’s languages, one of them being word-order (see contribution in this volume). Despite the fact that alternative models based on polygenesis have not been taken into account to contrast their proposals, their positions do not necessarily contradict our hypotheses, since some (central) features, such as word-order or core lexicon, might have already existed in the language of pre-modern men8. It is actually reasonable for us to assume that strategies such as word-order already existed long before the last 100,000 years: as experiments on monkeys have shown, sequential
8
We do not enter here into the debate whether word replacement totally erases the traces of such ancient lexicons or not (Ringe, 1992).
Polygenesis of Linguistic Strategies
ordering is an ancient ability (Terrace, 2000), and does not seem to require highly sophisticated cognitive capacities. This has to be compared with linguistic strategies requiring more complex internal representation, such as expressions of time, aspect, mode, causation (Shibatani and Pardeshi, 2001), etc. This view disagrees with some scholars’ proposal of a catastrophic transition from a proto-language to a fully syntactic one, and asks for the cognitive loads of various linguistic strategies. Contact between small groups of humans and possible linguistic innovations play a central role in our proposals. Such emphasis will appear close to the emphasis on contact in historical linguistics. These elements are often put forward to criticize the simplified, but useful, model of the Stammbaum. We therefore propose to project the controversies of the recent evolution of languages to their origins. As well as for recent contacts and consequent language changes, these considerations may be useful to refine our knowledge and reconsider well established theories of the prehistory of languages that may reveal themselves to be implausible in regard of such arguments.
Acknowledgments We thank James Minett for his precious help and useful comments to prepare this paper and improve its quality. This work was partially supported by grant #9040781 from the Research Grants Council of the Hong Kong SAR, China.
References Arbib, Michael A. and Rizzolatti, Giacomo. (1996) Neural expectations: a possible evolutionary path from manual skills to language. Communication and Cognition 29 (3/4), 393–4242. Auroux, Sylvain, and Mayet, Laurent. (2001) Entretien avec Sylvain Auroux, Le mystère des racines. Sciences et Avenir 125, 12–5. Baddeley, Alan D. (1986) Working memory. Oxford: Clarendon Press.
197
198
Language Acquisition, Change and Emergence Bickerton, Derek. (1990) Language and Species. Chicago: The University of Chicago Press. Biraben, Jean-Noël, Masset, Claude, and Thillaud, Pierre L. (1997) Le peuplement préhistorique de l’Europe. In Histoire des populations de l’Europe, tome 1 — Des origines aux prémices de la révolution démographique (pp. 39–92). Paris: Fayard. Black, Cheryl A. (1999) A step-by-step introduction to the Government and Binding theory of syntax. Summer Institute of Linguistics, http://www.sil.org/mexico/ling/E002-IntroGB.pdf. Bocquet-Appel, Jean-Pierre, and Demars, Pierre-Yves. (2000) Population kinetics in the upper Palaeolithic in Western Europe. Journal of Archaeological Science 27, 551–70. Cann, Rebecca L., Stoneking, Mark, and Wilson, Allan C. (1987) Mitochondrial DNA and Human evolution. Nature 325, 31–6. Cavalli-Sforza, Luigi Luca, Menozzi, Paolo, and Plazza, Alberto. (1994) The history and geography of human genes. Princeton: Princeton University Press. Chomsky, Noam. (1975) Reflections on language. New York: Pantheon Books. Coupé, Christophe. (2003) De l’origine du langage à l’origine des langues: Modélisations de l’émergence et de l’évolution des systèmes linguistiques. PhD dissertation in Cognitive Sciences, Univ. Lyon 2. Coupé, Christophe and Hombert, Jean-Marie. (2002) Language at 70,000 BP: Evidence from sea-crossings. In Proceedings of the Fourth International Conference on the Evolution of Language (p. 27). Harvard. Croft, William. (1990) Typology and Universals. Cambridge Textbooks in Linguistics. Cambridge: Cambridge University Press. De Waal, Franz. B. (1998) Chimpanzee Politics: Power and Sex among Apes. Baltimore: Johns Hopkins University Press. D’Errico, Francesco, Henshilwood, Christopher, and Nilssen, Peter. (2001) An engraved bone fragment from c. 70,000-year-old Middle Stone Age levels at Blombos Cave, South Africa: implications for the origin of symbolism and language. Antiquity 75 (288), 309–18. Dessalles, Jean-Louis. (2000) Aux origines du langage — Une histoire naturelle de la parole. Paris: Hermes Science Publications. Dixon, Robert M.W. (1997) The rise and fall of languages. Cambridge: Cambridge University Press. Dunbar, Robin I. M. (1993) Coevolution of neocortical size, group size and language in humans. Behavioral and Brain Sciences 16, 681–94. Dunbar, Robin I. M. (1996) Grooming, Gossip and the Evolution of Language. London: Farber and Farber.
Polygenesis of Linguistic Strategies Ehala, Martin. (1996) Self-organisation and language change. Diachronica 13 (1), 1–28. Enard, Wolfgang, Przeworski, Molly, Fisher, E. Simon, Lai, Cecilia S. L., Wiebe, Victor, Kitano, Takashi, Monaco, Tony and Pääbo, Svante. (2002) Molecular evolution of FOXP2, a gene involved in speech and language. Nature 418, 869. Freedman, David. A. and Wang, William S.-Y. (1996) Language polygenesis: A probabilistic model. Anthropological Sciences 104 (2), 131–8. Greenberg, Joseph H. (1978) Typology and cross-linguistic generalizations. In Joseph H. Greenberg (Ed.) Universals of human language. Method and Theory, volume 1 (pp. 33–61). Stanford: Stanford University Press. Harpending, Henry C., Batzer, Mark A., Gurven, Michael, Jorde, Lynn B., Rogers, Alan R., and Sherry, Stephen T. (1998) Genetic traces of ancient demography. Proceedings of the National Academy of Sciences of the United States of America 95, 1961–7. Harpending, Henry C., Sherry, Stephen T., Rogers, Alan R. and Stoneking, Mark. (1993) The genetic structure of ancient human populations. Current Anthropology 34 (4), 483–96. Hassan, Fekri A. (1981) Demographic archaeology. Studies in Archaeology. Academic Press. Hawks, John, Hunley, Keith, Lee, Sang-Hee, and Wolpoff, Milford. (2000) Population bottlenecks and Pleistocene human evolution. Molecular Biology and Evolution 17 (1), 2–22. Hockett, Charles F. (1960) The origin of speech. Scientific American 203, 88–96. Jacquesson, François. (2001) Pour une linguistique des quasi-déserts. In Anne-Marie Loffler-Laurian (Ed.) Etudes de linguistique générale et contrastive. Hommage à Jean Perrot (pp. 199–216). Paris: Centre de Recherche sur les Langues et les Sociétés. Klein, Richard. G. (1999) The human career, human biological and cultural origins. Chicago; London: The University of Chicago Press. Labov, William. (1963) The social motivation of a sound change. Word 19, 273–303. Lahr, Marta Mirazon and Foley, Robert. (1994) Multiple dispersals and modern human origins. Evolutionary Anthropology 3, 48–60. Lai, Cecilia S., Fisher, Simon E., Hurst, Jane A., Vargha-Khadem, Faraneh, and Monaco, Anthony P. (2001) A forkhead-domain gene is mutated in a severe speech and language disorder. Nature 413, 519–23. MacNeilage, Peter F. (1998) The frame/content theory of evolution of speech production. Behavioral and Brain Sciences 21 (4), 499–511.
199
200
Language Acquisition, Change and Emergence Marsico, Egidio. (1999) What can a database of proto-languages tell us about the last 10,000 years of sound changes? In Proceedings of the XIVth International Congress of Phonetic Sciences. San Francisco. Marsico, Egidio, Coupé, Christophe, and Pellegrino, François. (2000) Evaluating the influence of language contact on lexical changes. In Proceedings of the third Conference on the Evolution of Language (pp. 154–5). Paris: Ecole Normale Supérieure des Télécommunications. Marwick, Ben. (2002) Raw material transportation as an indicator of hominid symbolic linguistic capacity during the Pleistocene. In Proceedings of the Fourth International Conference on the Evolution of Language (p. 74). Harvard. Maturana, Humberto and Mpodozis, Jorge. (2000) The origin of species by means of natural drift. Revista Chilena de Historia Natural 73, 261–310. Murray, James D. (1994) Mathematical biology. Second, Corrected Edition. Berlin; Heidelberg; New York: Springer Verlag. Naccache, Albert F. H. (2002) Sociolinguistic approaches to the prehistory of the Mashriqian (Semitic) languages. In Proceedings of the Fourth International Conference on the Evolution of Language (p. 80). Harvard. Nettle, Daniel. (1999a) Linguistic Diversity. Oxford Linguistic. Oxford: Oxford University Press. —. (1999b) Using social impact theory to simulate language change. Lingua 108, 95–117. —. (1999c) Is the rate of linguistic change constant? Lingua 108, 119–136. Nettle, Daniel and Romaine, Suzanne. (2000) Vanishing voices. Oxford: Oxford University Press. Nichols, Johanna. (1992) Linguistic diversity in space and time. Chicago; London: University of Chicago Press. Pagel, Mark. (2000) The history, rate and pattern of world linguistic evolution. In Chris Knight, Michael Studdert-Kennedy, and Jim Hurford (Eds.) The evolutionary emergence of language (pp. 391–416). Cambridge: Cambridge University Press. Ringe, Donald. A. J. (1992) On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82 (1), 1–109. Rizzolatti, Giacomo, Fadiga, Luciano, Gallese, Vittorio, and Fogassi, Leonardo. (1996) Premotor cortex and the recognition of motor action. Cognitive Brain Research 3, 131–41. Ruhlen, Merritt. (1994) The origin of language. Tracing the evolution of the mother tongue. New York: John Wiley and Sons.
Polygenesis of Linguistic Strategies Sauvet, Georges and Wlodarczyk, André. (1995) Eléments d’une grammaire formelle de l’art pariétal paléolithique. L’Anthropologie 99 (2–3), 193–211. Savage-Rumbaugh, Sue, Shanker, Stuart G., and Taylor, Talbot J. (1998) Apes, language, and the human mind. Oxford: Oxford University Press. Schoenemann, P. Thomas. (1999) Syntax as an emergent characteristic of the evolution of semantic complexity. Minds and Machines 9, 309–46. Sherry, Stephen T., Rogers, Alan R., Harpending, Henry C., Soodyall, Himla, Jenkins, Trefor, and Stoneking, Mark. (1994) Mismatch distributions of mtDNA reveal recent human population expansions. Human Biology 66 (5), 761–75. Sherry, Stephen T., Harpending, Henry C., Batzer, Mark A., and Stoneking, Mark. (1997) Alu evolution in human populations: Using the coalescent to estimate effective population size. Genetics 147, 1977–80. Shibatani, Masayoshi and Pardeshi, Prashant. (2001) The causative continuum. The grammar of causation and interpersonal manipulation. In Masayoshi Shibatani (Ed.) Typological Studies in Language 48 (pp. 85–126). Amsterdam; Philadelphia: John Benjamins Publishing Company. Sperber, Dan. (1995) How do we communicate? In John Brockman and Katinka Matson (Eds.) How things are: A science toolkit for the mind (pp. 191–9). New York: Morrow. Stringer, Chris B. and Andrews, Peter. (1988) Genetic and fossil evidence for the origin of modern humans. Science 239, 1263–7. Terrace, Herbert S. (2000) Serial expertise and the evolution of language. In Proceedings of the Third International Conference on the Evolution of Language (pp. 154–155). Paris: Ecole Nationale Supérieure des Télécommunications, Paris. Thorne, Alan G. and Wolpoff, Milford H. (1992) The multiregional evolution of humans. Scientific American 266, 76–83. Wang, William S.-Y. (1973) Chinese language. Scientific American 228, 50–60. —. (1991) Explorations in language evolution. In William S.-Y. Wang (Ed.) Explorations in Language (pp. 105–130). Taiwan: Pyramid Press. Zhivotovsky, Lev A., Bennett, Lynda, Bowcock, Anne M., and Feldman, Marcus W. (2000) Human population expansion and microsatellite variation. Molecular Biology and Evolution 17 (5), 757–67. Zubrow, Ezra B. (1989) The Demographic Modelling of Neanderthal Extinction. In Paul Mellars and Chris Stringer (Eds.) The Human Revolution: behavioural and biological perspectives on the origins of modern humans (pp. 212–31). Princeton, New Jersey: Princeton University Press.
201
Part 2 Language Acquisition
6 Multiple-Cue Integration in Language Acquisition: A Connectionist Model of Speech Segmentation and Rule-like Behavior Morten H. Christiansen Cornell University
Christopher M. Conway Cornell University
Suzanne Curtin University of Pittsburgh
1.
Introduction
Considerable research in language acquisition has addressed the extent to which basic aspects of linguistic structure might be identified on the basis of probabilistic cues in caregiver speech to children. In this chapter, we examine systems that have the capacity to extract and store various statistical properties of language. In particular, groups of overlapping, partially predictive cues are increasingly attested to in research on language development (e.g., Morgan and Demuth, 1996). Such cues tend to be probabilistic and violable, rather than categorical or rule-governed. Importantly, these systems incorporate mechanisms for integrating different sources of information, including cues that may not be very informative when 205
206
Language Acquisition, Change and Emergence
considered in isolation. We explore the idea that conjunctions of these cues provide evidence about aspects of linguistic structure that is not available from any single source of information, and that this process of integration reduces the potential for making false generalizations. Thus, we argue that there are mechanisms for efficiently combining cues of even very low validity, that such combinations of cues are the source of evidence about aspects of linguistic structure that would be opaque to a system insensitive to such combinations, and that these mechanisms are used by children acquiring languages (for a similar view, see Bates and MacWhinney, 1987). These mechanisms also play a role in skilled language comprehension and are the focus of so-called constraint-based theories of sentence processing (Cottrell, 1989; MacDonald, Pearlmutter and Seidenberg, 1994; Trueswell and Tanenhaus, 1994) that emphasize the use of probabilistic sources of information in the service of computing linguistic representations. Since the learners of a language grow up to use it, investigating these mechanisms provides a link between language learning and language processing (Seidenberg, 1997). In the standard learnability approach, language acquisition is viewed in terms of the task of acquiring a grammar (e.g., Pinker, 1994; Gold, 1967). This type of learning mechanism presents classic learnability issues: there are aspects of language for which the input is thought to provide no evidence, and the evidence that does exist tends to be unreliable. Following Christiansen, Allen and Seidenberg (1998), we propose an alternative view in which language acquisition can be seen as involving several simultaneous tasks. The primary task — the language learner’s goal — is to comprehend the utterances to which she is exposed for the purpose of achieving specific outcomes. In the service of this goal the child attends to the linguistic input, picking up different kinds of information, subject to perceptual and attentional constraints. There is a growing body of evidence that as a result of attending to sequential stimuli, both adults and children incidentally encode statistically salient regularities of the signal (e.g., Cleeremans, 1993; Saffran, Aslin and Newport, 1996; Saffran, Newport and Aslin, 1996). The child’s
Multiple-Cue Integration in Language Acquisition
immediate task, then, is to update its representation of these statistical aspects of language. Our claim is that knowledge of other, more covert aspects of language is derived as a result of how these representations are combined through multiple cue integration. Linguistically relevant units (e.g., words, phrases, and clauses) emerge from statistical computations over the regularities induced via the immediate task. On this view, the acquisition of knowledge about linguistic structures that are not explicitly marked in the speech signal — on the basis of information that is — can be seen as a third derived task. We address these issues in the specific context of learning to identify individual words in speech. In the research reported below, the immediate task is to encode statistical regularities concerning phonology, lexical stress and utterance boundaries. The derived task is to integrate these regularities in order to identify the boundaries between words in speech. The remainder of this chapter presents our work on the modeling of early infant speech segmentation in connectionist networks trained to integrate multiple probabilistic cues. We first describe past work exploring the segmentation abilities of our model (Allen and Christiansen, 1996; Christiansen, 1998; Christiansen et al., 1998). Although we concentrate here on the relevance of combinatorial information to this specific aspect of acquisition, our view is that similar mechanisms are likely to be relevant to other aspects of acquisition and to skilled performance. Next, we present results from a new set of simulations1 that extends the coverage of the model to include recent controversial data on purported rule-learning by infants (Marcus, Vijayan, Rao and Vishton, 1999). New empirical predictions concerning the role of segmentation in rule-like behavior is derived from the model, and confirmed by artificial language learning experiments with adult participants. Finally, we discuss how multiple cue integration works and how this approach may be extended beyond speech segmentation.
1
Parts of the simulation results have previously been reported in conference proceedings: Christiansen, Conway and Curtin (2000).
207
208
Language Acquisition, Change and Emergence
2.
The Segmentation Problem
Before an infant can even start to learn how to comprehend a spoken utterance, the speech signal must first be segmented into words. Thus, one of the initial tasks that the child is confronted with when embarking on language acquisition involves breaking the continuous speech stream into individual words. Discovering word boundaries is a nontrivial problem as there are no acoustic correlates in fluent speech to the white spaces that separate words in written text. There are however a number of sub-lexical cues which could potentially be integrated in order to discover word boundaries. The segmentation problem therefore provides an appropriate domain for assessing our approach insofar as there are many cues to word boundaries, including prosodic and distributional information, none of which is sufficient for solving the task alone. Early models of spoken language processing assumed that word segmentation occurs as a byproduct of lexical identification (e.g., Cole and Jakimik, 1978; Marslen-Wilson and Welsh, 1978). More recent accounts hold that adults use segmentation procedures in addition to lexical knowledge (Cutler, 1996). These procedures are likely to differ across languages, and presumably include a variety of sublexical skills. For example, adults tend to make consistent judgements about possible legal sound combinations that could occur in their native language (Greenburg and Jenkins, 1964). This type of phonotactic knowledge may aid in adult segmentation procedures (Jusczyk, 1993). Additionally, evidence from perceptual studies suggests that adults know about and utilize language specific rhythmic segmentation procedures in processing utterances (Cutler, 1994). The assumption that children are not born with the knowledge sources that appear to subserve segmentation processes in adults seems reasonable since they have neither a lexicon nor knowledge of the phonological or rhythmic regularities underlying the words of the particular language being learned. Therefore, one important developmental question concerns how the child comes to achieve
Multiple-Cue Integration in Language Acquisition
steady-state adult behavior. Intuitively, one might posit that children begin to build their lexicon by hearing words in isolation. A single-word strategy whereby children adopted entire utterances as lexical candidates would appear to be viable very early in acquisition. In the Bernstein-Ratner (1987) and the Korman (1984) corpora, 22–30% of child-directed utterances are made up of single words. However, many words, such as determiners, will never occur in isolation. Moreover, this strategy is hopelessly underpowered in the face of the increasing size of utterances directed toward infants as they develop. Instead, the child must develop viable strategies that will allow her to detect utterance internal word boundaries regardless of whether or not the words appear in isolation. A more realistic suggestion is that a bottom-up process exploiting sub-lexical units allows the child to bootstrap the segmentation process. This bottom-up mechanism must be flexible enough to function despite cross-linguistic variation in the constellation of cues relevant for the word segmentation task. Strategies based on prosodic cues (including pauses, segmental lengthening, metrical patterns, and intonation contour) have been proposed as a way of detecting word boundaries (Cooper and Paccia-Cooper, 1980; Gleitman, Gleitman, Landau and Wanner, 1988). Other recent proposals have focused on the statistical properties of the target language that might be utilized in early segmentation. Considerable attention has been given to lexical stress and sequential phonological regularities — two cues also utilized in the Christiansen et al. (1998) segmentation model. In particular, Cutler and her colleagues (e.g., Cutler and Mehler, 1993) have emphasized the potential importance of rhythmic strategies to segmentation. They have suggested that skewed stress patterns (e.g., the majority of words in English have strong initial syllables) play a central role in allowing children to identify likely boundaries. Evidence from speech production and perception studies with preverbal infants supports the claim that infants are sensitive to rhythmic structure and its relationship to lexical segmentation by nine months (Jusczyk, Cutler and Redanz, 1993). A potentially relevant source of information for determining word boundaries is
209
210
Language Acquisition, Change and Emergence
the phonological regularities of the target language. A recent study by Jusczyk, Friederici and Svenkerud (1993) suggests that, between 6 and 9 months, infants develop knowledge of phonotactic regularities in their language. Furthermore, there is evidence that both children and adults are sensitive to and can utilize such information to segment the speech stream. Work by Saffran, Newport and Aslin (1996) shows that adults are able to use phonotactic sequencing to determine possible and impossible words in an artificial language after only 20 minutes of exposure. They suggest that learners may be computing the transitional probabilities between sounds in the input and using the strengths of these probabilities to hypothesize possible word boundaries. Further research provides evidence that infants as young as 8 months show the same type of sensitivity after only three minutes of exposure (Saffran, Aslin and Newport, 1996). Thus, children appear to have sensitivity to the statistical regularities of potentially informative sublexical properties of their languages such as stress and phonotactics, consistent with the hypothesis that these cues could play a role in bootstrapping segmentation. The issue of when infants are sensitive to particular cues and how strong a particular cue is to word boundaries has been addressed by Mattys, Jusczyk, Luce and Morgan (1999). They examined how infants would respond to conflicting information about word boundaries. Specifically, Mattys et al. (Experiment 4) found that when sequences which had good prosodic information but poor phonotactic cues where tested against sequences that had poor prosodic information but good phonotactic cues, the 9-month-old infants gave greater weight to the prosodic information. Nonetheless, the integration of these cues could potentially provide reliable segmentation information since phonotactic and prosodic information typically align with word boundaries thus strengthening the boundary information.
2.1
Segmenting using multiple cues
The input to the process of language acquisition comprises a complex combination of multiple sources of information. Clusters of
Multiple-Cue Integration in Language Acquisition
such information sources appear to inform the learning of various linguistic tasks (see contributions in Morgan and Demuth, 1996). Each individual source of information, or cue, is only partially reliable with respect to the particular task in question. In addition to previously mentioned cues — phonotactics and lexical stress — utterance boundary information has also been hypothesized to provide useful information for locating word boundaries (Aslin et al., 1996; Brent and Cartwright, 1996). These three sources of information provide the learner with cues to segmentation. As an example consider the two unsegmented utterances (represented in orthographic format): Therearenospacesbetweenwordsinfluentspeech# Yeteachchildseemstograspthebasicsquickly# There are sequential regularities found in the phonology (here represented as orthography) which can aid in determining where words may begin or end. The consonant cluster sp can be found both at word beginnings (spaces and speech) and at word endings (grasp). However, a language learner cannot rely solely on such information to detect possible word boundaries. This is evident when considering that the sp consonant cluster also can straddle a word boundary, as in cats pajamas, and occur word internally as in respect. Lexical stress is another useful cue to word boundaries. For example, in English most disyllabic words have a trochaic stress pattern with a strongly stressed syllable followed by a weakly stressed syllable. The two utterances above include four such words: spaces, fluent, basics, and quickly. Word boundaries can thus be postulated following a weak syllable. However, this source of information is only partially reliable as is illustrated by the iambic stress pattern found in the word between from the above example. The pauses at the end of utterances (indicated above by #) also provide useful information for the segmentation task. If children realize that sound sequences occurring at the end of an utterance always form the end of a word, then they can utilize information about utterance final phonological sequences to postulate word boundaries whenever these sequences occur inside an utterance.
211
212
Language Acquisition, Change and Emergence
Thus, knowledge of the rhyme eech# from the first example utterance can be used to postulate a word boundary after the similar sounding sequence each in the second utterance. As with phonological regularities and lexical stress, utterance boundary information cannot be used as the only source of information about word boundaries because some words, such as determiners, rarely, if ever, occur at the end of an utterance. This suggests that information extracted from clusters of cues may be used by the language learner to acquire the knowledge necessary to perform the task at hand.
3.
A Computational Model of Multiple-cue Integration in Speech Segmentation
Several computational models of word segmentation have been implemented to address the speech segmentation problem. However, these models tend to exploit solitary sources of information. For example, Cairns, Shillcock, Chater and Levy (1997) demonstrated that sequential phonotactic structure was a salient cue to word boundaries while Aslin, Woodward, LaMendola and Bever (1996) illustrated that a back-propagation model could identify word boundaries fairly accurately based on utterance final patterns. Perruchet and Vinter (1998) demonstrated that a memory-based model was able to segment small artificial languages, such as the one used in Saffran, Aslin and Newport (1996), given phonological input in syllabic format. More recently, Dominey and Ramus (2000) found that recurrent networks also show sensitivity to serial and temporal structure in similar miniature languages. On the other hand, Brent and Cartwright (1996) have shown that segmentation performance can be improved when a statistically-based algorithm is provided with phonotactic rules in addition to utterance boundary information. Along similar lines, Allen and Christiansen (1996) found that the integration of information about phonological sequences and the presence of utterance boundaries improved the segmentation of a small artificial language. Based on this work, we
Multiple-Cue Integration in Language Acquisition
suggest that the integration of multiple probabilistic cues may hold the key to solving the word segmentation problem, and discuss a computational model that implements this solution.
Figure 1 Illustration of the SRN used in Christiansen et al. (1998). Arrows with solid lines indicate trainable weights, whereas the arrow with the dashed line denotes the copy-back weights (which are always 1). UB refers to the unit coding for the presence of an utterance boundary. The presence of lexical stress is represented in terms of two units, S and P, coding for secondary and primary stress, respectively. (Adapted from Christiansen et al., 1998).
next segment Phonemes
UB
S
P
copy-back Hidden Units
Phonetic Features UB current segment
S
P
Context Units previous internal state
Christiansen et al. (1998) provided a comprehensive computational model of multiple cue integration in early infant speech segmentation. They employed a Simple Recurrent Network (SRN; Elman, 1990) as illustrated in Figure 1. This network is essentially a standard feed-forward network equipped with an extra layer of so-called context units. At a particular time step, t, an input pattern is propagated through the hidden unit layer to the output layer (solid arrows). At the next time step, t+1, the activation of the hidden unit layer at the previous time step, t, is copied back to the context layer (dashed arrow) and paired with the current input (solid arrow). This means that the current state of the hidden units
213
214
Language Acquisition, Change and Emergence
can influence the processing of subsequent inputs, providing a limited ability to deal with integrated sequences of input presented successively. The SRN model was trained on a single pass through a corpus consisting of 8181 utterances of child directed speech. These utterances were extracted from the Korman (1984) corpus (a part of the CHILDES database, MacWhinney, 1991) consisting of speech directed at pre-verbal infants aged 6–16 weeks. The training corpus consisted of 24,648 words distributed over 814 types and had an average utterance length of 3.0 words (see Christiansen et al. (1998) for further details). A separate corpus consisting of 927 utterances and with the same statistical properties as the training corpus was used for testing. Each word in the utterances was transformed from its orthographic format into a phonological form and lexical stress assigned using a dictionary compiled from the MRC Psycho-linguistic Database available from the Oxford Text Archive2. As input the network was provided with different combinations of three cues dependent on the training condition. The cues were (a) phonology represented in terms of 11 features on the input and 36 phonemes on the output 3 (b) utterance boundary information represented as an extra feature (UB) marking utterance endings, and (c) lexical stress coded over two units as either no stress, secondary or primary stress (see Figure 1). The network was trained on the immediate task of predicting the next phoneme in a sequence as well as the appropriate values for the utterance boundary and stress units. In learning to perform this task it was expected that the network would also learn to integrate the cues such that it could carry out the derived task of segmenting the input into words. With respect to the network, the logic behind the derived task is
2
3
Note that these phonological citation forms were unreduced (i.e., they do not include the reduced vowel schwa). The stress cue therefore provides additional information not available in the phonological input. Phonemes were used as output in order to facilitate subsequent analyses of how much knowledge of phonotactics the net had acquired.
Multiple-Cue Integration in Language Acquisition
that the end of an utterance is also the end of a word. If the network is able to integrate the provided cues in order to activate the boundary unit at the ends of words occurring at the end of an utterance, it should also be able to generalize this knowledge so as to activate the boundary unit at the ends of words which occur inside an utterance (Aslin et al., 1996). Figure 2 shows a snapshot of SRN segmentation performance on the first 37 phoneme tokens in the training corpus. Activation of the boundary unit at a particular position corresponds to the network’s hypothesis that a boundary follows this phoneme. Black bars indicate the activation at lexical boundaries, whereas the grey bars correspond to activation at word internal positions. Activations above the mean boundary unit activation for the corpus as a whole (horizontal line) are interpreted as the postulation of a word boundary. As can be seen from the figure, the SRN performed well on this part of the training set, correctly segmenting out all of the 12 words save one (/slipI/ = sleepy).
Figure 2 The activation of the boundary unit during the processing of the first 37 phoneme tokens in the Christiansen et al. (1998) training corpus. A gloss of the input utterances is found beneath the input phoneme tokens. (Adapted from Christiansen et al., 1998).
Boundary Unit Activation
0.7 Word Boundary Activation Word Internal Activation
0.6 0.5 0.4 0.3 0.2 0.1 0
e l @U h e l @U # @U d I @ # @U k V m 0 n # A j u e I s l i p I h e d (H)ello
hello
# Oh dear # Oh come on # Are you a sleepy
Phoneme Tokens
head?
215
216
Language Acquisition, Change and Emergence
In order to provide a more quantitative measure of performance, accuracy and completeness scores (Brent and Cartwright, 1996) were calculated for the separate test corpus consisting of utterances not seen during training: Accuracy =
Hits Hits + FalseAlarms
Completeness =
Hits Hits + Misses
Accuracy provides a measure of how many of the words that the network postulated were actual words, whereas completeness provides a measure of how many of the actual words that the net discovered. Consider the following hypothetical example:
#the#dog#s#chase#thec#at# where # corresponds to a predicted word boundary. Here the hypothetical learner correctly segmented out two words, the and chase, but also falsely segmented out dog, s, thec, and at, thus missing the words dogs, the, and cat. This results in an accuracy of 2 = 33.3% and a completeness of 2 + 3 = 40.0% . 2 2 + 4
Figure 3 Word accuracy (left) and completeness (right) scores for the net trained with three cues (phon-ub-stress — white bars) and the net trained with two cues (phon-ub — grey bars).
Multiple-Cue Integration in Language Acquisition
With these measures in hand, we compare the performance of nets trained using phonology and utterance boundary information — with or without the lexical stress cue — to illustrate the advantage of getting an extra cue. As illustrated by Figure 3, the phon-ub-stress network was significantly more accurate (42.71% vs. 38.67%: χ2 = 18.27, p < .001) and had a significantly higher completeness score (44.87% vs. 40.97%: χ2 = 11.51, p < .001) than the phon-ub network. These results thus demonstrate that having to integrate the additional stress cue with the phonology and utterance boundary cues during learning provides for better performance. To test the generalization abilities of the networks, segmentation performance was recorded on the task of correctly segmenting novel words. The three cue net was able to segment 23 of the 50 novel words, whereas the two cue network only was able to segment 11 novel words. Thus, the phon-ub-stress network achieved a word completeness of 46% which was significantly better (χ2 = 4.23, p < .05) than the 22% completeness obtained by the phon-ub net. These results therefore support the supposition that the integration of three cues promotes better generalization than the integration of two cues. Furthermore, the three cue net also developed a trochaic bias, and was nearly twice as good at segmenting out novel bisyllabic words with a trochaic stress pattern in comparison to novel words with an iambic stress pattern. Overall, the simulation results from Christiansen et al. (1998) show that the integration of probabilistic cues forces the networks to develop representations that allow them to perform quite reliably on the task of detecting word boundaries in the speech stream4. This result is encouraging given that the segmentation task shares many properties with other language acquisition problems which have been taken to require innate linguistic knowledge for their solution, and yet it seems clear that discovering the words of one’s native language must be an acquired skill. The simulations also
4
These results were replicated across different initial weight configurations and with different input/output representations.
217
218
Language Acquisition, Change and Emergence
demonstrated how a trochaic stress bias could emerge from the statistics in the input, without having anything like the “periodicity bias” of Cutler and Mehler (1993) built in. Below, we take our approach one step further demonstrating how our model can accommodate recent evidence regarding rule-like behavior in infancy.
4.
Simulation 1: A Multiple-cue Integration Account of Rule-like Behavior
The nature of the learning mechanisms that infants bring to the task of language acquisition is a major focus of research in cognitive science. With the rise of connectionism, much of the scientific debate surrounding this research has focused on whether rules are necessary to explain language acquisition. All parties in the debate acknowledge that statistical learning mechanisms form a necessary part of the language acquisition process (e.g., Christiansen and Curtin, 1999; Marcus et al., 1999; Pinker, 1991). However, there is much disagreement over whether a statistical learning mechanism is sufficient to account for complex rule-like behavior, or whether additional rule-learning mechanisms are needed. In the past this debate has primarily taken place within specific areas of language acquisition, such as inflectional morphology (e.g., Pinker, 1991; Plunkett and Marchman, 1993) and visual word recognition (e.g., Coltheart, Curtis, Atkins and Haller, 1993; Seidenberg and McClelland, 1989). More recently, Marcus et al. (1999) have presented results from experiments with 7-month-olds, apparently showing that the infants acquire abstract algebraic rules after two minutes of exposure to habituation stimuli. The algebraic rules are construed as representing an open-ended relationship between variables for which one can substitute arbitrary values, “such as ‘the first item X is the same as the third item Y,’ or more generally, that ‘item I is the same as item J’” (Marcus et al., 1999:79). Marcus et al.
Multiple-Cue Integration in Language Acquisition
further claim that a connectionist single-mechanism approach based on statistical learning is unable to fit their experimental data. In Simulation 1, we present a detailed connectionist model of these infant data, supporting a single-mechanism approach employing multiple-cue integration while undermining the dual-mechanism account. Marcus et al. (1999) used an artificial language learning paradigm to test their claim that the infant has two mechanisms for learning language. The subjects were seven-month old infants randomly placed in one of two experimental conditions. In the first two experiments, the conditions were ABA or ABB. Each word in the sentence frame ABA or ABB consisted of a consonant and vowel sequence (e.g., ‘li wi li’ or ‘li wi wi’). During a two-minute long familiarization phase the infants were exposed to three repetitions of each of 16 three-word sentences. The test phase in both experiments consisted of 12 sentences made up of words the infants had not previously been exposed to. The test items were broken into 2 groups for both experiments: consistent (items constructed with the same sentence frame as the familiarization phase) and inconsistent (constructed from the sentence frame the infants were not trained on) — see Table 1. In the second experiment the test items were altered in order to control for an overlap of phonetic features found in the first experiment. This was to prevent the infants from using this type of statistical information. The results of the first and second experiments showed that the infants preferred the inconsistent test items to the consistent ones. In the third experiment, which we focus on in this paper, the ABA grammar was replaced with an AAB grammar. The rationale was to ensure that infants could not distinguish between grammars based solely on reduplication information. Once again, the infants preferred the inconsistent items to the consistent items. The conclusion drawn by Marcus et al. (1999) was that a single mechanism that relied on only statistical information could not account for the results because none of the test items appeared in the habituation part of the experiment. Instead they suggested that a dual mechanism was needed, comprising a statistical learning
219
220
Language Acquisition, Change and Emergence
Table 1 The Habituation and Test Stimuli for the Two Conditions in Marcus et al. (1999). Test Stimuli AAB Condition
Habituation Stimuli
Consistent
Inconsistent
de de di, de de je, de de li, de de we
ba ba po
ba po po
ji ji di, ji ji je, ji ji li, ji ji we
ko ko ga
ko ga ga
de di di, de je je, de li li, de we we
ba po po
ba ba po
ji di di, ji je je, ji li li, ji we we
ko ga ga
ko ko ga
le le di, le le je, le le li, le le we wi wi di, wi wi je, wi wi li, wi wi we
ABB Condition
le di di, le je je, le li li, le we we wi di di, wi je je, wi li li, wi we we
component and an algebraic rule learning component. In addition, they claimed that a SRN would not be able to model their data because of the lack of phonological overlap between habituation and test items. Specifically, they state, Such networks can simulate knowledge of grammatical rules only by being trained on all items to which they apply; consequently, such mechanisms cannot account for how humans generalise rules to new items that do not overlap with the items that appeared in training (p.79). We demonstrate that SRNs can indeed fit the data from Marcus et al. Other researchers have constructed neural network models specifically to simulate the Marcus et al. results (Altmann and Dienes, 1999; Elman, 1999; Shastri and Chang, 1999; Shultz, 1999). In contrast, we do not build a new model to accommodate the results but take the existing SRN model of speech segmentation
Multiple-Cue Integration in Language Acquisition
presented above and show how this model — without additional modification — provides an explanation for the results. The Christiansen et al. (1998) model acquired distributional knowledge about sequences of phonemes, the associated stress patterns, and the occurrence of utterance boundaries. This knowledge allowed it to perform well on the task of segmenting the speech stream into words. We suggest that this knowledge can be put to use in secondary tasks not directly related to speech segmentation — including artificial tasks used in psychological experiments such as Marcus et al. (1999). This suggestion resonates with similar perspectives in the word recognition literature (Seidenberg, 1995) where knowledge acquired for the primary task of learning to read can be used to perform other secondary tasks such as lexical decision. Marcus et al. (1999) state that they conducted simulations in which SRNs were unable to fit the experimental data. As they do not provide any details of the simulations, we assume (based on other simulations reported by Marcus, 1998) that these focused on some kind of phonological output that the SRNs produced. Given our characterization of the experimental task as a secondary task, we do not think that the basis for the infants’ differentiation between consistent and inconsistent stimuli should be modeled using the phonological output of an SRN. Instead, we focus on the model’s ability to integrate the phonological input with utterance boundary information in order to segment out the individual words in the test items.
4.1
Method
Networks. Corresponding to the 16 infants in the Marcus et al. study, we used 16 networks similar to the SRN used in Christiansen et al. (1998) with the exception that the original phonetic feature geometry was replaced by a new representation using 18 features (see Appendix). Each of the 16 SRNs had a different set of initial weights, randomized within the interval [–0.25, 0.25]. The learning
221
222
Language Acquisition, Change and Emergence
rate was set to 0.1 and the momentum to 0.95. These training parameters were identical to those used in the original Christiansen et al. model. The networks were trained using the standard back-propagation learning algorithm (Rumelhart, Hinton and Williams, 1986) to predict the next constellation of cues given the current input segment. Materials. The materials from Experiment 3 in Marcus et al. (1999) were transformed into the phoneme representation used by Christiansen et al. (1998). Two habituation sets were created: one for AAB items and one for ABB items (see Table 1). The habituation sets used here, and in Marcus et al., consisted of three blocks of 16 sentences in random order, yielding a total of 48 sentences in each habituation condition. As in Marcus et al. there were four different test sentences: ‘ba ba po’, ‘ko ko ga’ (consistent with AAB); ‘ba po po’ and ‘ko ga ga’ (consistent with ABB). The test set consisted of three blocks of randomly ordered test sentences, totaling 12 test items. Both the habituation and test sentences were treated as a single utterance with no explicit word boundaries marked between the individual words. The end of each utterance was marked by activating the utterance boundary unit. All habituation and test items were assigned the same level of primary stress. Procedure. The networks were first trained on a single pass through the Korman (1984) corpus as in the original Christiansen et al. model. This corresponds to the fact that the 7-month-olds in the Marcus et al. study already have had a considerable exposure to language, and have begun to develop their speech segmentation abilities (Jusczyk, 1997, 1999). Next, the networks were habituated on a single pass through one of the habituation corpora — one phoneme at a time — with learning parameters identical to the ones used during the pre-training on the Korman corpus. The networks were then tested on the test set (with the weights “frozen”) and the activation of the utterance boundary unit was recorded for every phoneme input in the test set for the purpose of scoring the network performance on the derived task. The boundary unit activations across the seven input tokens for each item were separated into two groups according to whether they were recorded
Multiple-Cue Integration in Language Acquisition
for test sentences consistent or inconsistent with the habituation pattern. For the purpose of measuring word segmentation performance, the mean utterance boundary activation was calculated across all the habituation items for each network. Following Christiansen et al. (1998), a network was said to have postulated a word boundary whenever the boundary unit activation in a test sentence was above its habituation mean cut-off. The word segmentation performance for consistent and inconsistent sentences was then quantified in terms of accuracy and completeness scores (Brent and Cartwright, 1996; Christiansen et al., 1998).
4.2
Results
For each of the sixteen networks, accuracy and completeness scores were computed across all test items, and submitted to the same statistical analyses as used by Marcus et al. for their infant data. The accuracy scores were submitted to a repeated measures ANOVA with condition (AAB vs. ABB) as between network factor and test pattern (consistent vs. inconsistent) as within network factor. The left-hand side of Figure 4 shows the accuracy scores for the consistent and inconsistent items pooled across conditions. There was a main effect of test pattern (F(1,14) = 4.78, p < .05), indicating that the networks segmented significantly more actual words out from the inconsistent items (49.55%) compared to the consistent items (39.44%). Similarly to the infant data, neither the main effect of condition, nor the condition × test pattern interaction were significant (F's < 1). The completeness scores were submitted to a similar analysis, and the results are shown in the right-hand side of Figure 4. Again, there was a main effect of test pattern (F(1,14) = 5.76, p < .04), indicating that the networks were significantly better at segmenting out the words in the inconsistent items (35.76%) compared to the consistent items (28.82%). Neither the main effect of condition, nor the condition × test pattern interaction were significant (F's < 1). The higher accuracy and completeness scores
223
224
Language Acquisition, Change and Emergence
for the inconsistent items suggest that they would stand out more clearly in comparison with the consistent items, and thus explain why the infants looked longer towards the speaker playing the inconsistent items in the Marcus et al. study.
Figure 4 Word accuracy (left) and completeness (right) scores in Simulation 1 for the inconsistent (white bars) and the consistent test (grey bars)
Marcus et al. claim that a dual-mechanism system — involving a statistical learning mechanism and a rule-learning mechanism — is needed to account for the infant data. In contrast, Simulation 1 shows that a separate rule-learning component is not necessary to account for the data. This simulation shows how our SRN model of word segmentation can fit the data from Marcus et al. (1999) without invoking explicit rules. The pre-training allowed the SRNs to learn to integrate the regularities governing the phonological, lexical stress, and utterance boundary information in child-directed speech. We suggest that during the habituation phase, the networks then developed weak attractors specific to the habituation pattern and the phonology of the syllables used. These attractors will at the same time both attract a consistent item (because of pattern similarity) and repel it (because of phonological dissimilarity), causing interference with the derived task of word segmentation. The inconsistent items, on the other hand, will tend to be repelled by the habituation attractors and therefore do not suffer from the same
Multiple-Cue Integration in Language Acquisition
kind of interference, making them easier for the network to process. Multiple-cue integration learning enabled the SRN model to fit the infant data. Importantly, the model — as a statistical learning mechanism — can explain both the distinction between consistent and inconsistent items as well as the preference for the inconsistent items. Note that a rule-learning mechanism by itself only can explain how infants may distinguish between items, but not why they prefer inconsistent over consistent items. Extra machinery is needed in addition to the rule-learning mechanism to explain the preference for inconsistent items. Thus, the most parsimonious explanation is that only a statistical learning device is necessary to account for the infant data. The addition of a rule-learning device does not appear to be necessary.
5.
Simulation 2: The Role of Segmentation in Rule-like Behavior
Segmentation plays a crucial role in our multiple-cue integration model of the Marcus et al. data. In contrast, the previous accounts of the infants' rule-like behavior do not couch their explanation in terms of such basic components of speech processing. Nevertheless, the previous connectionist models implicitly rely on pre-segmented input to model the infant data. All the models use syllabic input representations, and require that the input be segmented into three-syllable sentences. Sentential segmentation is accomplished outside of the models by way of marking the beginnings and endings of sentences (Altmann and Dienes, 1999; cf. Dienes et al., 1999), by resetting the network before each sentence (Dominey and Ramus, 2000), by only doing error correction after every third syllable (Elman, 1999), or by only having three nodes to encode variable position (Shastri and Chang, 1999) or syllable input (Shultz, 1999). The importance of this pre-segmentation is highlighted if we make the pauses between words (250 ms) the same length as the pauses between sentences (1000 ms). Leaving sentential segmentation aside,
225
226
Language Acquisition, Change and Emergence
an increase in the time between syllables should have little effect on the performance of the models — except perhaps for the Dominey and Ramus model in which the increased time between syllables may result in an inability to distinguish between consistent and inconsistent items (Dominey, personal communication). However, having same-length gaps between words and sentences is likely to make sentential segmentation harder. If this affects rule-like behavior then it has to be explained outside the models by some kind of segmentation device. Similar considerations apply to learning mechanisms that acquire explicit symbolic rules. Marcus et al. (1999) characterized algebraic rules as representing an open-ended relationship between variables for which one can substitute arbitrary values. Their Experiment 3 was designed to demonstrate that rule-learning is independent of the physical realization of variables in terms of phonological features. The same rule, AAB, applies to — and can be learned from — ‘le le we’ and ‘ko ko ga’ (with ‘le’ and ‘ko’ filling the same A slots and ‘we’ and ‘ga’ the same B slot). As the abstract relationships that this rule represents only pertain to the value of the three variables, the amount of time between them should not affect the application of the rule. Thus, just as the physical realization of a variable does not matter for the learning or application of a rule, neither should the time between variables. The same rule AAB, applies to — and can be learned from — ‘le [250ms] le [250ms] we’ and ‘le [1000ms] le [1000ms] we’ (the ‘le’s should still fill the A slots and the ‘we’s the B slot despite the increased duration of time between the occurrence of these variables). Nevertheless, even though the rule should in principle apply, performance constraints arising outside the rule-learning component may prevent it from being retrieved (Marcus, personal communication). Thus, if rule-like behavior is affected by same-length gaps between words and sentences, then a separate segmentation component will be needed. We expect, however, that this pause manipulation can be accommodated by our multiple-cue integration mechanism model — without any need for pre-segmentation machinery. In the model, the preference for inconsistent items is explained in terms of differential
Multiple-Cue Integration in Language Acquisition
segmentation performance. Lengthening the pauses between words, as indicated above, would in effect solve the derived task for the model, and should result in a disappearance of the preference for inconsistent items. Thus, we predict that the model should show no difference between the segmentation performance on the consistent and inconsistent items when pauses between words have the same length as pauses between sentences. To test this prediction, we carried out a new set of simulations.
5.1
Method
Networks. Sixteen SRNs as in Simulation 1. Materials. Same materials as in Simulation 1 except that utterance boundaries were inserted between the words in the habituation and test sentences, simulating a lengthening of pauses between words (from 250 ms to 1000 ms) such that they have the same length as the pauses between utterances. Procedure. Same procedure as in Simulation 1.
5.2
Results
The completeness scores were submitted to the same analyses as in Simulation 2. As illustrated by Figure 5, the segmentation performance on the test items was improved considerably by the inclusion of utterance boundary-length pauses between words. As predicted, there was no difference between accuracy scores for consistent (74.43%; SE: 6.92) and inconsistent items (72.26%; SE: 7.86) (F(1,14) = .71). Neither was there a difference between the completeness scores for consistent (70.14%; SE: 7.622) and inconsistent items (70.49%; SE: 7.966) (F(1,14) = .02). As before there were no other effects or interactions (F’s < 1), save for an interaction between condition and test pattern for accuracy (F(1,14) =5.55, p < .04). This interaction was due to somewhat lower accuracy scores for the inconsistent condition in the AAB habituation pattern.
227
228
Language Acquisition, Change and Emergence Figure 5 Word accuracy (left) and completeness (right) scores in Simulation 2 for the inconsistent (white bars) and the consistent test items (grey bars).
Simulation 2 thus confirms the predicted effect of same-length pauses between words and sentences in the dual-task single-mechanism model. Without including an additional segmentation component, the previous connectionist models would suggest that the pause manipulation should not affect the rule-like behavior 5 . Similarly, learning mechanisms that acquire explicit symbolic rules would need to appeal to segmental performance constraints outside the rule component, in order to make the same predictions; otherwise, the pause manipulation would not be expected to affect rule-learning. To corroborate our model's predictions for the role of segmentation in rule-like behavior, we conducted an artificial language learning experiment using adult subjects.
5
Even though the Dominey and Ramus (2000) model is predicted to display similar behavior to our dual-task model (Dominey, personal communication), it is nevertheless still vulnerable to this problem because it requires pre-segmented input (i.e., resetting of internal states at the start of each sentence) to account for the original Marcus et al. (1999) results.
Multiple-Cue Integration in Language Acquisition
6.
Experiment 1: Replicating the Marcus et al. (1999) Results
Before investigating the role of segmentation in rule-like behavior, we need to first establish whether adults in fact exhibit the same pattern of behavior as the infants in the Marcus et al. study. The first experiment therefore seeks to replicate Experiment 3 from Marcus et al. using adult subjects.
6.1
Method
Participants. Sixteen undergraduate students were recruited from introductory Psychology classes at Southern Illinois University. The participants earned course credit for their participation. Materials. We used the original stimuli that Marcus et al. (1999) created for their Experiment 3. Each word in a sentence was separated by 250 ms. The 16 habituation sentences for each condition were created by Marcus et al. using the Bell Labs speech synthesizer. The original habituation stimuli were limited to two predetermined sentence orders. To avoid potential order effects, we used the SoundEdit 16 version 2 software for the Macintosh to isolate each sentence as a separate sound file. This allowed us to present the habituation sentences in a random order for each subject. The stimuli for the test phase consisted of four additional sentences that were either consistent or inconsistent with the training grammar. As mentioned earlier, these sentences contained no phonological overlap with the habituation sentences. Like the habituation stimuli, each word in a sentence was separated by a 250 ms interval. As before, we stored the test stimuli as separate SoundEdit 16 version 2 sound files to allow a random presentation order for each subject. Procedure. The participants were seated in front of a Macintosh G3 PowerPC equipped with a New Micros button box. Participants were randomly assigned to one of two conditions, AAB or ABB. The experiment was run using the PsyScope presentation software
229
230
Language Acquisition, Change and Emergence
(Cohen, MacWhinney, Flatt, and Provost, 1993) with all stimuli played over stereo loudspeakers at 75dB. The participants were instructed that they were taking part in a pattern recognition experiment. They were told that in the first part of the experiment their task was to listen carefully to sequences of sounds and that their knowledge of these sound sequences would be tested afterwards. Participants listened to three blocks of the 16 randomly presented habituation sentences corresponding either to the AAB or the ABB sentence frame. A 1000 ms interval separated each sentence as was the case in the Marcus et al. experiment. After habituation, the participants were instructed that they would be presented with new sound patterns that they had not previously heard. They were asked to judge whether a pattern was “similar” or “dissimilar” to what they had been exposed to in the training phase by pressing an appropriately marked button. The instructions emphasized that because the sounds were novel, they should not base their decision on the sounds themselves but instead on the patterns derived from the sounds. The participants listened to three blocks of the four randomly presented test sentences. After the presentation of each test sentence, the participants were prompted for their response. Participants were allowed to take as long as they needed to respond. Each test trial was separated by a 1000 ms interval.
6.2
Results
For the purpose of our analyses, the correct response for consistent items is “similar” while the correct response for inconsistent items is “dissimilar”. The mean overall score for correct classification of test items was 8.81 (SE: 0.63) out of a perfect score of 12. A single-sample t-test showed that this classification performance was significantly better than the chance level performance of 6 (t(15) = 4.44, p < .0005). The participants’ responses were then submitted to the same statistical analysis as the infant data in Marcus et al. (and Simulation 1 and 2 above). Figure 6 (left) shows the mean number
Multiple-Cue Integration in Language Acquisition
of consistent and inconsistent test items that were rated as dissimilar to the habituation items. As expected, there was a main effect of test pattern (F(1,14) = 18.98, p < .001), such that significantly more inconsistent items were judged as dissimilar (4.5; SE: 0.40) than consistent items (1.69; SE: 0.40). Neither the main effect of condition, nor the condition × test pattern interaction were significant (F’s < 1). Figure 6 The mean proportion of inconsistent (white bars) and consistent (grey bars) test items rated as dissimilar to the habituation pattern in Experiments 1 (left) and 2 (right).
Experiment 1 shows that adults perform similarly to the infants in Marcus et al.’s Experiment 3, thus demonstrating that it is possible to replicate their findings using adult participants instead of infants. This result is perhaps not surprising given that Saffran and colleagues were able to replicate statistical learning results obtained using adults participants (Saffran, Newport and Aslin, 1996) in experiments with 8-month-olds (Saffran, Aslin, et al., 1996). More generally, their results and ours suggest that despite small differences in the experimental methodologies used in infant and adult artificial language learning studies, both methodologies appear to tap into the same learning mechanisms. More generally, one would expect that the same learning mechanisms — statistical or rule-based — would be involved in both infancy and adulthood, and that similar results should be expected in both infant and adult studies with the kind of material used here.
231
232
Language Acquisition, Change and Emergence
7.
Experiment 2: Segmentation and Rule-like Behavior
Having replicated the Marcus et al. (Experiment 3) infant data with adult participants, we now turn our attention to the effect of same-length pauses between words and sentences on the learning of rule-like behavior.
7.1
Method
Participants. Sixteen additional undergraduate students were recruited from introductory Psychology classes at Southern Illinois University. The participants earned course credit for their participation. Materials. The training and test stimuli were the same as in Experiment 1 except that the 250 ms interval between words in a sentence was replaced by a 1000 ms interval using the SoundEdit 16 version 2 software. The 1000 ms interval between sentences remained the same as before. Procedure. The procedure and instructions were identical to those used for Experiment 1.
7.2
Results
The mean overall classification score was 5.75 (SE: 0.32) out of 12. This was not significantly different from a chance level performance of 6 (t < 1). The responses of the participants were submitted to the same further analysis as in Experiment 1. Figure 6 (right) shows the mean number of consistent and inconsistent items rated as dissimilar. As predicted by Simulation 3, there was no main effect of test pattern in this experiment (F(1,14) = .56), suggesting that the participants were unable to distinguish between consistent (2.75; SE: 0.17) and inconsistent (2.5; SE: 0.24) items. As in Experiment 1,
Multiple-Cue Integration in Language Acquisition
both the main effect of condition and the interaction between condition and test pattern interaction were not significant (F's = 0). These results show that preference for inconsistent items disappears when the pauses between words and sentences have the same length. This corroborates the prediction from the dual-task, single-mechanism model, underscoring the role of segmentation in rule-like behavior. Crucially, our approach to the Marcus et al. (1999) study as tapping into the derived task of word segmentation, allows the model to make the correct predictions without requiring additional machinery to perform sentential segmentation. The previous connectionist models, on the other hand, appear to require additional sentential segmentation components to account for the results from Experiment 2. This is also true for learning mechanisms that acquire explicit symbolic rules as suggested by Marcus et al. Without appealing to performance limitations arising from processing devices external to the rule-learning component, the lack of difference between consistent and inconsistent items in our artificial learning study cannot be explained. The combination of simulation and experimental results presented here suggest that the multiple-cue integration model provides a compelling account of rule-like behavior in infants and adults.
8.
General Discussion
In this chapter, we have suggested that the integration of multiple probabilistic cues may be one of the key elements involved in children’s acquisition of language. To support this suggestion, we have discussed the Christiansen et al. (1998) computational model of multiple cue integration in early infant speech segmentation. We have also shown through simulations and experiments that the model provides a single mechanism for learning the statistical structure of the speech input, while the representations acquired through multiple cue integration at the same time also allow the model to exhibit rule-like behavior, previously thought to be beyond
233
234
Language Acquisition, Change and Emergence
the scope of SRNs (cf. Marcus et al., 1999). Taken together, we find that the Christiansen et al. model in combination with the simulations and experiments reported here provide strong evidence in support of multiple cue integration in language acquisition. In the final part of this chapter, we discuss two outstanding issues with respect to multiple cue integration: how it works and how it can be extended beyond speech segmentation.
8.1
What makes multiple-cue integration work?
We have seen that integrating multiple probabilistic cues in a connectionist network results in more than a just a sum of unreliable parts. But what is it about multiple cue integration that facilitates learning? The answer appears to lie in the way in which multiple cue integration can help constrain the search through weight space for a suitable set of weights for a given task (Christiansen, 1998; Christiansen et al., 1998). We can conceptualize the effect that the cue integration process has on learning by considering the following illustration. In Figure 7, each ellipse designates for a particular cue the set of weight configurations that will enable a network to learn the function denoted by that cue. For example, the ellipse marked A designates the set of weight configurations that allow for the learning of the function A described by the A cue. With respect to the simulations reported above, A, B and C can be construed as the phonology, utterance boundary, and lexical stress cues, respectively. If a network using gradient descent learning (e.g., the back-propagation learning algorithm) was only required to learn the regularities underlying, say, the A cue, it could settle on any of the weight configurations in the A set. However, if the net was also required to learn the regularities underlying cue B, it would have to find a weight configuration which would accommodate the regularities of both cues. The net would therefore have to settle on a set of weights from the intersection between A and B in order to minimize its error. This constrains the overall set of weight configurations that the net has to choose between — unless the cues are entirely overlapping (in which case there would not be any
Multiple-Cue Integration in Language Acquisition
added benefit from learning this redundant cue) or are disjoint (in which case the net would not be able to find an appropriate weight configuration). If the net furthermore had to learn the regularities associated with the third cue C, the available set of weight configurations would be constrained even further. Figure 7 An abstract illustration of the reduction in weight configuration space that follows as a consequence of accommodating several partially overlapping cues within the same representational substrate. (Adapted from Christiansen et al., 1998).
Turning to the engineering literature on neural networks, it is possible to provide a mathematical basis for the advantages of multiple cue integration. Here multiple cue integration is known as “learning with hints”, where hints provide additional information that can constrain the learning process (e.g., Abu-Mostafa, 1990; Omlin and Giles, 1992; Suddarth and Holden, 1991). The type of hint most relevant to the current discussion is the so-called “catalyst hint”. This involves adding extra units to a network such that additional correlated functions can be encoded (in much the same way as the lexical stress units encode a function correlated with the information provided by the phonological input with respect to the
235
236
Language Acquisition, Change and Emergence
derived task of word segmentation). Thus, catalyst hints are introduced to reduce the overall weight configuration space that a network has to negotiate. This reduction is accomplished by forcing the network to acquire one or more additional related functions encoded over extra output units. These units are often ignored after they have served their purpose during training (hence the name “catalyst” hint). The learning process is facilitated by catalyst hints because fewer weight configurations can accommodate both the original target function as well as the additional catalyst function(s). As a consequence of reducing the weight space, hints have been shown to constrain the problem of finding a suitable set of weights, promoting faster learning and better generalization. Mathematical analyses in terms of the Vapnik-Chervonenkis (VC) dimension (Abu-Mostafa, 1993) and vector field analysis (Suddarth and Kergosien, 1991) have shown that learning with hints may reduce the number of hypotheses a learning system has to entertain. The VC dimension establishes an upper bound for the number of examples needed by a learning process that starts with a set of hypotheses about the task solution. A hint may lead to a reduction in the VC dimension by weeding out bad hypotheses and reduce the number of examples needed to learn the solution. Vector field analysis uses a measure of “functional” entropy to estimate the overall probability for correct rule extraction from a trained network. The introduction of a hint may reduce the functional entropy, improving the probability of rule extraction. The results from this approach demonstrate that hints may constrain the number of possible hypotheses to entertain, and thus lead to faster convergence. In sum, these mathematical analyses have revealed that the potential advantage of using multiple cue integration in neural network training is twofold: First, the integration of multiple cues may reduce learning time by reducing the number of steps necessary to find an appropriate implementation of the target function. Second, multiple cue integration may reduce the number of candidate functions for the target function being learned, thus potentially ensuring better generalization. As mentioned above, in
Multiple-Cue Integration in Language Acquisition
neural networks this amounts to reducing the number of possible weight configurations that the learning algorithm has to choose between.6 Thus, because the phonology, utterance boundary and lexical stress cues designate functions that correlate with respect to the derived task of word segmentation in our simulations, the reduction in weight space not only resulted in a better representational basis for solving this task, but also lead to better learning and generalization. However, the mathematical analyses provide no guarantee that multiple cue integration will necessarily improve performance. Nevertheless, this is unlikely to be a problem with respect to language acquisition because, as we shall see next, the input to children acquiring their first language is filled with cues that reflect important and informative aspects of linguistic structure.
8.2 Multiple cue integration beyond word segmentation Recent research in developmental psycholinguistics have shown that there is a variety of probabilistic cues available for language acquisition (for a review, see contributions in Morgan and Demuth, 1996). These cues range from cues relevant to speech segmentation (as discussed above) to the learning of word meanings and to the acquisition of syntactic structure. We briefly discuss the two latter types of cues here. Golinkoff, Hirsh-Pasek and Hollich (1999) studied word learning in children of 12, 19 and 24 months of age. They found that perceptual salience and social information in the form of eye gaze are important cues for learning the meaning of words. The study also provided some insights into the developmental dynamics of multiple-cue integration. In particular, individual cues are weighted differently at different stages in development, changing the
6 It should be noted that the results of the mathematical analyses apply independently of whether the extra catalyst units are discarded after training (as is typical in the engineering literature) or remain a part of the network as the simulations presented here.
237
238
Language Acquisition, Change and Emergence
dynamics of the multiple cue integration process across time. At 12 months, perceptual salience dominates — only names for interesting objects are learned — other cues need to correlate considerably for successful learning. Seven months later, eye gaze cues come into play, but the children have problems when eye gaze and perceptual salience conflict with each other (e.g., when the experimenter is naming and looking at a perceptually uninteresting object). Only at 24 months has the child’s lexical acquisition system developed sufficiently so that it can deal with conflicting cues. From the viewpoint of multiple cue integration, this study thus demonstrates how correlated cues are needed early in acquisition to build a basis for later performance based on individual cues. There are a variety of cues available for the acquisition of syntactic structure. Phonology not only provides information helpful for word segmentation, but also includes important probabilistic cues to the grammatical classes of words. Lexical stress, for example, can be used to distinguish between nouns and verbs. In a 3,000 word sample, Kelly and Bock (1988) found that 90% of the bisyllabic trochaic words were nouns whereas 85% of the bisyllabic iambic words were verbs (e.g., the homograph record has stress on the first syllable when used as a noun and stress on the second syllable when used as a verb). They furthermore demonstrated that people are sensitive to this cue. More recent evidence shows that people are faster and more accurate at classifying words as nouns or verbs if the words have the prototypical stress patterns for their grammatical class (Davis and Kelly, 1997). The number of syllables that a word contains also provides information about its grammatical class. Cassidy and Kelly (1991) showed that 3-year-olds are sensitive to the probabilistic cue that English nouns tend to have more syllables than verbs (e.g., gorp tended to be used as a verb, whereas gorpinlak tended to be used as noun). Other important cues to noun-hood and verb-hood in English include differences in word duration, consonant voicing, and vowel types — and many of these cues have also been found in other languages, such as Hebrew, German, French, and Russian (see Kelly, 1992, for a review).
Multiple-Cue Integration in Language Acquisition
Sentence prosody can also provide important probabilistic cues to the discovery of grammatical word class. Morgan, Shi and Allopenna (1996) demonstrated using a multivariate procedure that content and function words can be differentiated with 80% accuracy by integrating distributional, phonetic and acoustic cues. More recently, Shi, Werker and Morgan (1999) found that infants are sensitive to such cue differences. Sentence prosody also provides cues to the acquisition of syntactic structure. Fisher and Tokura (1994) used multivariate analyses to integrate information about pauses, segmental variation and pitch and obtained 88% correct identification of clause boundaries. Other studies have shown that infants are sensitive to such cues (see Jusczyk, 1997, for a review). Additional cues to syntactic structure can be derived through distributional analyses of word combinations in everyday language (e.g., Redington, Chater and Finch, 1998), and from semantics (e.g., Pinker, 1989). As should be clear from this short review, there are many types of probabilistic information readily available to the language learner. We suggest that integrating these different types of information similarly to how the segmentation model was able to integrate phonology, utterance boundary and lexical stress information is also likely to provide a solid basis for learning aspects of language beyond speech segmentation. Indeed, a recent set of simulations inspired by the one described here has demonstrated that the learning of syntactic structure by an SRN is facilitated when it is allowed to integrate phonological and prosodic information in addition to distributional information (Christiansen and Dale, 2001). Specifically, an analysis of network performance revealed that learning with multiple-cue integration resulted in faster, better, and more uniform learning. The SRNs were also able to distinguish between relevant cues and distracting cues, and performance did not differ from networks that received only reliable cues. Overall, these simulations offer additional support for the multiple-cue integration hypothesis in language acquisition. They demonstrate that learners can benefit from multiple cues, and are not distracted by irrelevant information.
239
240
Language Acquisition, Change and Emergence
9.
Conclusion
In this chapter, we have presented a number of simulation results that demonstrate how multiple cue integration in a connectionist network, such as the SRN, can provide a solid basis for solving the speech segmentation problem. We have also discussed how the process of integrating multiple cues may facilitate learning, and have reviewed evidence for the existence of a plethora of probabilistic cues for the learning of word meaning, grammatical class and syntactic structure. We conclude by drawing attention to the kind of learning mechanism needed for multiple cue integration. It seems clear that connectionist networks are well suited for accommodating multiple cue integration. First, our model of the integration of multiple cues in speech segmentation was implemented as an SRN. Second, and perhaps more importantly, the mathematical results regarding the advantages of multiple cue integration were couched in terms of neural networks (though they may also hold for certain other, non-connectionist statistical learning devices). Third, in the service of immediate tasks, such as encoding phonological information, connectionist networks can develop representations that can then form the basis for solving derived tasks, such as word segmentation. Symbolic, rule-based models, on the other hand, would appear to be ill equipped for accommodating the integration of multiple cues. First, the probabilistic nature of the various cues is not readily captured by rules. Second, the tendency for symbolic models to separate statistical and rule-based knowledge in dual-mechanism models is likely to hinder integration of information across the two types of knowledge. Third, the inherent modular nature of the symbolic approach to language acquisition further blocks the integration of multiple cues across different representational levels (e.g., preventing symbolic models from taking advantage of phonological cues to word class). Connectionism has shown itself to be a very fruitful — albeit controversial — paradigm for research on language (see, e.g., Christiansen and Chater, 2001b, for a review, or contributions in
Multiple-Cue Integration in Language Acquisition
Christiansen, Chater and Seidenberg, 1999; Christiansen and Chater, 2001a). Based on our work reported here, we further argue that connectionist networks may also hold the key to a better and more complete understanding of language acquisition because they allow for the integration of multiple probabilistic cues.
Author Note Morten H. Christiansen, Department of Psychology, Cornell University; Christopher M. Conway, Department of Psychology, Cornell University; Suzanne Curtin, Department of Linguistics, University of Pittsburgh. Correspondence concerning this article should be addressed to Morten H. Christiansen, Department of Psychology, Cornell University, Ithaca, NY 14853. Electronic mail may be sent via Internet to [email protected], [email protected] or to [email protected]. This work was partially supported by a Human Frontiers Science Program Grant (RGP0177/2001-B) awarded to MHC.
241
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
Q
Œ
Å
´
A
I
ç
U
ڦ
A
e
i
o
u
p
b
t
&
3
0
@
A
I
O
U
V
a
e
i
o
u
p
b
t
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
cons. son.
IPA
Symbol
0
1
1
1
1
0
0
0
0
0
1
0
0
0
1
0
0
labial
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
cor.
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
1
1
0
0
0
0
1
0
0
0
0
1
dorsal front
0
0
0
1
0
1
0
0
0
1
0
1
0
0
0
0
0
hi
0
0
0
0
0
0
0
1
1
0
1
0
1
0
1
1
1
low
0
0
0
0
1
0
1
0
0
1
1
1
0
1
0
1
0
mid
0
0
0
1
1
1
1
1
0
0
1
0
1
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
tense cont. nasal laminal strid. post. lateral voiced
Appendix The Phonemes from the MRC Psycholinguistics Database and Their Feature Representations
242 Language Acquisition, Change and Emergence
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
0
0
k
g
f
v
T
D
s
z
S
Z
h
m
n
N
l
r
w
j
k
g
f
v
T
D
s
z
S
Z
h
m
n
9
l
r
w
j
1
1
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
0
labial
0
0
1
1
0
1
0
0
1
1
1
1
1
1
0
0
0
0
1
cor.
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
dorsal front
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
hi
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
low
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
mid
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
0
1
0
1
0
1
0
1
0
1
0
1
tense cont. nasal laminal strid. post. lateral voiced
Note. Cons. = consonantal; son. = sonorant; cor. = coronal; cont. = continuant; strid. = strident; post. = posterior.
1
d
d
cons. son.
IPA
Symbol
Multiple-Cue Integration in Language Acquisition 243
244
Language Acquisition, Change and Emergence
References Abu-Mostafa, Y.S. (1990) Learning from hints in neural networks. Journal of Complexity, 6, 192–198. Abu-Mostafa, Y.S. (1993) Hints and the VC Dimension. Neural Computation, 5, 278–288. Allen, J. and Christiansen, M.H. (1996) Integrating multiple cues in word segmentation: A connectionist model using hints. In Proceedings of the Eighteenth Annual Cognitive Science Society Conference (pp. 370–375). Mahwah, NJ: Lawrence Erlbaum Associates. Altmann, G.T.M. and Dienes, Z. (1999) Rule learning by seven-month-old infants and neural networks. Science, 284, 875. Aslin, R.N., Woodward, J.Z., LaMendola, N.P. and. Bever, T.G (1996) Models of word segmentation in fluent maternal speech to infants. In J.L. Morgan and K. Demuth (Eds.), Signal to Syntax (pp. 117–134). Mahwah, NJ: Lawrence Erlbaum Associates. Bates, E. and MacWhinney, B. (1987) Competition, variation, and language learning. In B. MacWhinney (Ed.), Mechanisms of language acquisition (pp. 157–193). Hillsdale, NJ: Lawrence Erlbaum Associates. Bernstein-Ratner, N. (1987) The phonology of parent-child speech. In K. Nelson and A. van Kleeck (Eds.), Children's language (Vol. 6). Hillsdale, NJ: Lawrence Erlbaum Associates. Brent, M.R. (1999) An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34, 71–106. Brent, M.R. and Cartwright, T.A. (1996) Distributional regularity and phonotactic constraints are useful for segmentation. Cognition, 61, 93–125. Cairns, P., Shillcock, R.C., Chater, N. and Levy, J. (1997) Bootstrapping word boundaries: A bottom-up approach to speech segmentation. Cognitive Psychology, 33, 111–153. Carterette, E. and Jones, M. (1974) Informal speech: alphabetic and phonemic texts with statistical analyses and tables. Berkely, CA: University of California Press. Cassidy, K.W., and Kelly, M.H. (1991) Phonological information for grammatical category assignments. Journal of Memory and Language, 30, 348–369. Chater, N. and Conkey, P. (1992) Finding linguistic structure with recurrent neural networks. In Proceedings of the Fourteenth Annual Meeting of the Cognitive Science Society (pp. 402-407). Hillsdale, NJ: Lawrence Erlbaum Associates.
Multiple-Cue Integration in Language Acquisition Chomsky, N. (1986) Knowledge of Language, New York: Praeger. Christiansen, M.H. (1998) Improving learning and generalization in neural networks through the acquisition of multiple related functions. In J.A. Bullinaria, D.G. Glasspool and G. Houghton (Eds.), Proceedings of the Fourth Neural Computation and Psychology Workshop: Connectionist Representations (pp. 58–70). London: Springer-Verlag. Christiansen, M.H. and Allen, J. (1997) Coping with variation in speech segmentation. In A. Sorace, C. Heycock and R. Shillcock (Eds.), Proceedings of GALA 1997: Language Acquisition: Knowledge Representation and Processing (pp. 327–332). University of Edinburgh Press. Christiansen, M.H., Allen, J. and Seidenberg, M.S. (1998) Learning to segment speech using multiple cues: A connectionist model. Language and Cognitive Processes 13, 221–268. Christiansen, M.H. and Chater, N. (Eds.) psycholinguistics. Westport, CT: Ablex.
(2001a)
Connectionist
—. (2001b) Connectionist psycholinguistics: Capturing the empirical data. Trends in Cognitive Sciences 5, 82–88. Christiansen, M.H., Chater, N. and Seidenberg, M.S. (Eds.) (1999) Connectionist models of human language processing: Progress and prospects. Special issue of Cognitive Science 23 (4), 415–634. Christiansen, M.H., Conway, C.M. and Curtin, S. (2000) A connectionist single-mechanism account of rule-like behavior in infancy. Submitted for presentation at the 22nd Annual Conference of the Cognitive Science Society, Philadelphia, PA. Christiansen, M.H. and Curtin, S. (1999) The power of statistical learning: No need for algebraic rules. In The Proceedings of the 21st Annual Conference of the Cognitive Science Society (pp. 114–119). Mahwah, NJ: Lawrence Erlbaum Associates. Christiansen, M.H. and Dale, R.A.C. (2001) Integrating distributional, prosodic and phonological information in a connectionist model of language acquisition. In Proceedings of the 23rd Annual Conference of the Cognitive Science Society (pp. 220–225). Mahwah, NJ: Lawrence Erlbaum. Cleeremans, A. (1993) Mechanisms of implicit learning: Connectionist models of sequence processing. Cambridge, MA: MIT Press. Cole, R.A. and Jakimik, J. (1978) How words are heard. In G. Underwood (Ed.), Strategies of information processing (pp. 67–117). London: Academic Press. Coltheart, M., Curtis, B., Atkins, P. and Haller, M. (1993) Models of reading aloud: Dual-route and parallel-distributed-processing approaches. Psychological Review 100, 589–608.
245
246
Language Acquisition, Change and Emergence Cooper, W.E. and Paccia-Cooper, J.M. (1980) Syntax and speech. Cambridge, MA: Harvard University Press. Cottrell, G.W. (1989) A connectionist approach to word sense disambiguation. London: Pitman. Cutler, A. (1994) Segmentation problems, rhythmic solutions. Lingua 92, 81–104. —. (1996) Prosody and the word boundary problem. In J.L. Morgan and K. Demuth (Eds), From signal to syntax (pp. 87–99). Mahwah, NJ: Lawrence Erlbaum Associates. Cutler, A. and Mehler, J. (1993) The periodicity bias. Journal of Phonetics 21, 103–108. Davis, S.M., and Kelly, M.H. (1997) Knowledge of the English noun-verb stress difference by native and nonnative speakers. Journal of Memory and Language 36, 445–460. Demuth, K and Fee, E.J. (1995) Minimal words in early phonological development. Unpublished manuscript. Brown University and Dalhousie University. Dominey, P.F. and Ramus, F. (2000) Neural network processing of natural language: I. Sensitivity to serial, temporal and abstract structure of language in the infant. Language and Cognitive Processing 15, 87–127. Elman, J.L. (1990) Finding structure in time. Cognitive Science 14, 179–211. Elman, J. (1999) Generalization, rules, and neural networks: A simulation of Marcus et. al, (1999). Unpublished manuscript, University of California, San Diego. Fikkert, P. (1994) On the acquisition of prosodic structure. Holland Institute of Generative Linguistics. Fischer, C. and Tokura, H. (1996) Prosody in speech to infants: Direct and indirect acoustic cues to syntactic structure. In J.L. Morgan and K. Demuth (Eds), Signal to Syntax, (pp. 343–363). Mahwah, NJ: Lawrence Erlbaum Associates. Gleitman, L. R., Gleitman, H., Landau, B. and Wanner, E. (1988) Where learning begins: Initial representations for language learning. In F.J. Newmeyer (Ed.), Linguistics: The Cambridge Survey, Vol. 3 (pp. 150–193). Cambridge, U.K.: Cambridge University Press. Gold, E.M. (1969) Language identification in the limit. Information and Control 10, 447-474. Golinkoff, Hirsh-Pasek and Hollich (1999) In J.L. Morgan and K. Demuth (Eds.), Signal to Syntax (pp. 305–329). Mahwah, NJ: Lawrence Erlbaum Associates.
Multiple-Cue Integration in Language Acquisition Greenburg, J.H. and Jenkins, J.J. (1964) Studies in the psychological correlates of the sound system of American English. Word 20, 157–177. Hochberg, J.A. (1988) Learning Spanish stress. Language 64, 683–706. Jusczyk, P.W. (1993) From general to language-specific capacities: The WRAPSA model of how speech perception develops. Journal of Phonetics 21, 3–28. —. (1997) The discovery of spoken language. Cambridge, MA: MIT Press. Jusczyk, P.W., Cutler, A. and Redanz, N.J. (1993) Infants’ preference for the predominant stress patterns of English words. Child Development 64, 675–687. Jusczyk, P.W., Friederici, A. D. and Svenkerud, V. Y. (1993) Infants’ sensitivity to the sound patterns of native language words. Journal of Memory and Language 32, 402–420. Jusczyk, P.W. and Thompson, E. (1978) Perception of a phonetic contrast in multisyllabic utterances by two-month-old infants. Perception and Psychophysics 23, 105–109. Kelly, M.H. (1992) Using sound to solve syntactic problems: The role of phonology in grammatical category assignments. Psychological Review 99, 349–364. Kelly, M.H., and Bock, J.K. (1988) Stress in time. Journal of Experimental Psychology: Human Perception and Performance 14, 389–403. Korman, M. (1984) Adaptive aspects of maternal vocalizations in differing contexts at ten weeks. First Language 5, 44–45. MacDonald, M.C., Pearlmutter, N.J., and Seidenberg, M.S. (1994) The lexical nature of syntactic ambiguity resolution. Psychological Review 101, 676–703. MacWhinney, B. (1991) The CHILDES Project. Hillsdale, NJ: Lawrence Erlbaum Associates. Marcus, G.F., Vijayan, S., Rao, S.B. and Vishton, P.M. (1999) Rule learning in seven month-old infants. Science 283, 77–80. Marslen-Wilson, W. D. and Welsh, A. (1978) Processing interactions and lexical access during word recognition in continuous speech. Cognitive Psychology 10, 29–63. Mattys, S.L., Jusczyk, P.W., Luce, P.A. and Morgan, J.L. (1999) Phonotactic and prosodic effects on word segmentation in infants. Cognitive Psychology 38, 465–494. Morgan, J.L. and Demuth, K. (Eds) (1996) From Signal to Syntax. Mahwah, NJ: Lawrence Erlbaum Associates. Morgan. J.L. and Saffran, J.R. (1995) Emerging integration of sequential and
247
248
Language Acquisition, Change and Emergence suprasegmental information in preverbal speech segmentation. Child Development 66, 911–936. Morgan, J.L., Shi, R. and Allopenna, P. (1996) Perceptual bases of rudimentary grammatical categories: Toward a broader conceptualization of bootstrapping. In J.L Morgan and K. Demuth (Eds), From Signal to Syntax (pp. 263–281). Mahwah, NJ: Lawrence Erlbaum Associates. Nazzi, T., Bertoncini, J. and Mehler, J. (1998) Language discrimination by newborns: Towards an understanding of the role of rhythm. Journal of Experimental Psychology: Human Perception and Performance 24, 1–11. Omlin, C. and Giles, C. (1992) Training second-order recurrent neural networks using hints. In D. Sleeman and P. Edwards (Eds.), Proceedings of the Ninth International Conference on Machine Learning (pp. 363–368). San Mateo, CA: Morgan Kaufmann Publishers. Perruchet, P. and Vinter, A. (1998) PARSER: A model for word segmentation. Journal and Memory and Language 39, 246–263. Pinker, S. (1989) Learnability and cognition. Cambridge, MA: MIT Press. —. (1991) Rules of language. Science 253, 530–535. —. (1994) The language instinct: How the mind creates language. New York: William Morrow and Company. Plunkett, K. and Marchman, V. (1993) From rote learning to system building. Cognition 48, 21–69. Redington, M., Chater, N. and Finch, S. (1998) Distributional information: A powerful cue for acquiring syntactic categories. Cognitive Science 22, 425–469. Saffran, J.R, Aslin, R.N. and Newport, E.L. (1996) Statistical learning by 8-month-old infants. Science 274, 1926–1928. Saffran, J.R., Newport, E.L., Aslin, R.N. Tunick, R.A. and Barruego, S. (1997) Incidental language learning - listening (and learning) out of the corner of your ear. Psychological Science 8, 101–105. Seidenberg, M.S. (1995) Visual word recognition: An overview. In P.D. Eimas and J.L. Miller (Eds.), Speech, language, and communication. Handbook of perception and cognition (2nd ed.), Vol. 11. San Diego: Academic Press. Seidenberg, M.S. (1997) Language acquisition and use: Learning and applying probabilistic constraints. Science 275, 1599–1603. Seidenberg, M. S. and McClelland, J. L. (1989) A distributed, developmental model of word recognition and naming. Psychological Review 96, 523–568. Shastri, L. and Chang, S. (1999) A spatiotemporal connectionist model of
Multiple-Cue Integration in Language Acquisition algebraic rule-learning (TR-99-011). Berkeley, California: International Computer Science Institute. Shi, R., Werker, J.F. and Morgan, J.L. (1999) Newborn infants’ sensitivity to perceptual cues to lexical and grammatical words. Cognition 72, B11–B21. Shultz, T. (1999) Rule learning by habituation can be simulated by neural networks. In Proceedings of the 21st Annual Conference of the Cognitive Science Society (pp. 665-670). Mahwah, NJ: Lawrence Erlbaum Associates. Suddarth, S.C. and Holden, A.D.C. (1991) Symbolic-neural systems and the use of hints for developing complex systems. International Journal of Man-Machine Studies 35, 291–311. Suddarth, S.C. and Kergosien, Y.L. (1991) Rule-injection hints as a means of improving network performance and learning time. In L.B. Almeida and C.J. Wellekens (Eds.), Proceedings of the Networks/EURIP Workshop 1990 (Lecture Notes in Computer Science, Vol. 412, pp. 120–129). Berlin, Springer-Verlag. Trueswell, J.C. and Tanenhaus, M.K. (1994) Towards a lexicalist framework of constraint-based syntactic ambiguity resolution. In C. Clifton, L. Frazier and K. Rayner (Eds), Perspectives on sentence processing (pp. 155–179). Hillsdale, NJ: Lawrence Erlbaum Associates.
249
7 Unsupervised Lexical Learning as Inductive Inference via Compression Chunyu Kit City University of Hong Kong
Abstract This chapter presents a learning-via-compression approach to unsupervised acquisition of word forms with no a priori knowledge. Following the basic ideas in Solomonoff’s theory of inductive inference and Rissanen’s MDL framework, the learning is formulated as a process of inferring regularities, in the form of string patterns (i.e., words), from a given set of data. A segmentation algorithm is designed to segment each input utterance into a sequence of word candidates giving an optimal sum of description length gain (DLG). The learning model has a lexical refinement module to exploit this algorithm to derive finer-grained word candidates from word clumps recursively until no more compression effect is available. Experimental results on a child-directed speech corpus show that this approach reaches a state-of-art performance in terms of precision and recall of both words and word boundaries.
251
252
Language Acquisition, Change and Emergence
1.
Introduction
Studies on lexical learning are concerned, in general, with how a learner exploits its innate learning mechanisms, existing knowledge (if any) and other available facilities, like supervision of various kinds, to acquire more knowledge about words — the very basic units in a language to make up utterances for human communication. The most essential lexical knowledge is word forms, with which some syntactic and semantic properties (e.g., part of speech, meaning) are associated so as to make our utterances meaningful. In this sense, word forms are the carrier of meaning. Therefore, at the very beginning of lexical learning learners must have some means to infer word forms before they can associate any meanings with them. In this research we take a computational approach to study the lexical acquisition problem with a focus on unsupervised learning of word forms, involving no other lexical properties (such as syntactic and semantic). Unsupervised learning assumes no prior knowledge at the starting point of the learning and no supervision during the learning. Different approaches assume different initial abilities (innate learning mechanisms) on the part of the learner. A very interesting assumption for this study is that the learner has an initial mechanism that does little more than string counting. It does not know there are words to learn. And it does not “learn”, in a sense, but simply attempts to derive a least-cost representation for the input data in terms of the string counts, each of which gives an indication of how many bits can be saved via extracting a string as a lexical item. This minimal initial ability is interesting. It seems to follow Chomsky’s notions of the minimality of grammar for natural language and his arguments for the minimally necessary innate structure (and ability) for the language learning faculty. These observations are scattered in his linguistic theories from (Chomsky, 1957) to (Chomsky, 1995). Interestingly, if unsupervised learning from this initial point could succeed, it would be a piece of evidence against his assumption of a powerful universal grammar. Our
Unsupervised Lexical Learning as Inductive Inference via Compression
purpose here, however, is simply to demonstrate the power of our learning approach: a counting machine can learn words with a great extent of success. The minimality of a learner’s initial ability is important, because between two learning approaches achieving a similar learning performance, the one with less initial ability (and knowledge) indicates a more effective learning. The theoretical inspiration for this research comes from algorithmic information (or Kolmogorov complexity) theory (Solomonoff, 1964; Kolmogorov, 1965; Chaitin, 1996; Li and Vitányi, 1997), including (1) Solomonoff’s (1964) inductive inference theory — the first of the three origins of algorithmic information theory, (2) the Minimum Description Length (MDL) and Minimum Message Length (MML) principles by Rissanen (Rissanen, 1978, 1989; Rissanen and Ristad, 1994) and Wallace et al. (Wallace and Boulton, 1968; Wallace and Freeman, 1987), respectively,1 (3) Vitányi and Li’s formulation of the ideal MDL in terms of Kolmogorov complexity (Vitányi and Li, 1997, 2000) and, in particular, their idea (or “intuition”, in their own terms) on how to conduct inductive inference via compression of a given dataset by squeezing out the embedded regularities piece by piece (Li and Vitányi, 1997: 351). In a sense, the work reported here can be thought of as an attempt to formulate and implement this nontrivial “intuition”. An important theme in the study of learning is how to trace the underlying machinery that generates data, by detecting regularities in the data. Compression is considered to be an effective approach for an optimal approximation of the generally non-computable Kolmogorov complexity over a given dataset (Vitányi and Li, 1997, 2000). In the context of lexical learning, the dataset consists of
1 The subtle difference between MDL and MML is not critical to our research
here. A lengthy discussion on the difference of the two can be found in a special issue of The Computer Journal (Vol. 42, No. 4, 1999) on Kolmogorov complexity. More important to our research is the underlying philosophy they share. Thus, the abbreviation MDL is assumed to subsume both within this paper, for the sake of convenience.
253
254
Language Acquisition, Change and Emergence
utterances, each of which is a sequence of atomic symbols in the language in question, i.e., phonemes in sound, or letters in script. Ideally, to learn from a dataset is to retrieve an achievable minimal representation for the data, extracting all regularities from the data such that the final result cannot be further compressed. Since this minimal representation is not reachable in general, the best we can do is to compress the data as much as possible under some constraints, e.g., the ones imposed by the representation format allowable in the learning. Following Solomonoff’s insight into the duality of compression and regularities (Solomonoff, 1964), i.e., anything that can compress data is a piece of regularity and any regularity can (be used to) compress the data, we may state that a model that can compress the data to a greater extent is a better model in general, in the sense that it captures more regularities in the data and thus reaches closer to the true machinery that has generated the data. In this sense, unsupervised learning is a process of inductive inference to derive the optimal set of regularities from the data that can compress the data the most. Notice, however, that is a different issue whether the regularities so learnt are applied to carry out the compression. In this learning-via-compression approach to lexical learning, the compression is more a way of thinking about the computation involved in the learning process than a real procedure for compressing the input data. As in other studies more cognition-oriented than the above, we also adopt the assumption that the lexical learning follows the least-effort principle (Zipf, 1949) to learn from the natural language data generated by human language behaviors that are observed to be governed by the least-effort principle in language production. However, instead of interpreting the effort as the energy consumed by the learning process, we think of it as the cost in terms of the number of bits in the representation for the data. This point is critical, because from the point of view of a learning-viacompression approach, learning is merely a process to derive an economic representation for the data, and the regularities (e.g., string patterns) so obtained are not only the by-products of this
Unsupervised Lexical Learning as Inductive Inference via Compression
derivation but also the means towards the economic representation. Computational studies on lexical learning fall into different categories, e.g., connectionist (or neural network) approaches, genetic algorithms and probabilistic models. Many probabilistic models adopt, in one way or another, the basic idea of learning-via-compression and the MDL principle. Representative models include the word grammar by Olivier (1968) — a noticeable piece of early work on lexical learning, the distributional regularity (DR) model by Brent and Cartwright (1996), the concatenative model by de Marcken (1995, 1996), and Brent’s probabilisticallysound model and its implementation — the MBDP-1 system (Brent, 1999), among many others. Olivier’s work demonstrates the formulation of lexical learning as an optimization process in terms of an objective function with dynamic programming techniques. De Marcken’s model outputs impressive tree structures, most of which appear to be consistent with morphological structures. Among such structures, however, only about two out of twelve are real words, rather close to the recall of the random baseline used in Brent’s work (Brent and Cartwright, 1996; Brent, 1999). Brent’s MBDP-1 system gives a learning performance with balanced precision and recall, both above 70%, demonstrating the state of the art of computational studies specifically dedicated to lexical learning. Venkataraman’s recent work (Venkataraman, 2001) shows that such performance can be achieved by available n-gram language modelling, and the learning curves of the two models are also highly similar. The leaning model presented here achieves a slightly better average performance, and our approach appears significantly simpler than their models due to our straightforward formulation of the idea of learning via compression within the MDL framework. This paper reports on our recent work in unsupervised lexical learning via compression to derive a least-effort representation within the theoretical framework of inductive inference following the MDL principle. The purpose of the research on lexical learning is multi-fold. First, it aims to test the hypothesis that there is a mechanism underlying language acquisition that seeks for the least-effort representation for the input data. Secondly, it explores
255
256
Language Acquisition, Change and Emergence
machine intelligence with a focus on examining how much a computer can learn from natural language data given that it has only a minimum innate capacity. It can only differentiate between signals or, equivalently, characters in texts, together with other related capacities derived from this basic capacity, such as counting distinct signals and strings in a given corpus. It is highly significant to demonstrate that a counting machine can learn words from natural language data with little prior knowledge and supervision. Even more significantly, such success is not a result of the power of the learning mechanism involved. Rather, we show that linguistic regularities in real language data can be captured statistically and information-theoretically by a counting machine. Thirdly, we expect that the research on machine learning of natural language can shed light on the mechanism of human language acquisition, especially when a minimum innate capacity is assumed. If machine learning indicates that linguistic regularities embedded in language data play a critical role in facilitating language acquisition in addition to the innate mechanisms, that they may play a similar role to enable human infants to learn a language so easily. The paper is organized as follows. Section 2 defines a goodness measure for the compression effect of extracting a substring as a word candidate from a given set of data. Section 3 formulates a Viterbi algorithm for optimal segmentation for each input utterance in terms of the goodness measure. Section 4 presents a three-phase lexical learning model based on this algorithm. In Section 5 we present testing data for evaluation and corresponding learning results output from the learner, including word clumps as intermediate results. These show how the learner works towards acquiring finer-grained lexical items step by step. In Section 6, we define a number of measures, including word, word boundary and word type precision and recall. These provide a comprehensive evaluation for the learning performance. We report the evaluation results in terms of these measures. A number of interesting problems encountered in the learning are discussed in Section 7. Conclusions are presented in Section 8.
Unsupervised Lexical Learning as Inductive Inference via Compression
2.
Goodness Measure
When a learning problem is formulated as an optimization problem, an objective function is needed to guide the learning. In order to enable our unsupervised learning approach through compression to carry out the optimization, a goodness measure is required to evaluate the benefit from extracting each possible word candidate from the input corpus and putting it into the lexicon. Such a goodness measure, termed description length gain (DLG), was formulated in (Kit and Wilks, 1999; Kit, 2000) to compute the compression effect of this kind, as follows. Given a corpus X = x1x2 xn as a sequence of linguistic tokens (e.g., characters in our case), the DLG from extracting a subsequence xi xi +1 x j (also denoted as xi.. j ) ( i < j ) from X as a rule in the form r → x i.. j is defined as DLG ( xi .. j ∈ X ) = DL (X ) − DL (X[r → xi .. j ] ⊕ xi .. j )
(1)
where X[r → xi.. j ] represents the resultant corpus from the operation of replacing all instances of x i.. j with the new symbol r throughout X, and ⊕ denotes a string concatenation operation with a delimiter inserted in between its two operands. DL (⋅) is the empirical description length that can be estimated by the Shannon-Fano code or Huffman code, following classic information theory (Shannon, 1948; Cover and Thomas, 1991): DL (x ∈ X) = X Hˆ (X) = − X ∑ log 2 pˆ (x)
(2)
x∈V ( X )
= −
∑
x ∈ V (X)
c(x)log 2
c(x) X
where ⋅ denotes the length of a corpus, V (⋅) the vocabulary of a corpus, c (⋅) the frequency of a token in the corpus, and the relative frequency that is conventionally estimated as c (⋅) X . The average DLG of x i.. j , i.e., the compression effect of extracting each individual instance of x i.. j , is
257
258
Language Acquisition, Change and Emergence
DLGav ( xi .. j ∈ X ) =
DLG( xi .. j ∈ X )
(3)
c ( xi .. j )
A significant benefit of the above formulation is that we need not carry out a transformation to compute DL for the new corpus X ′ = X[r → xi .. j ] ⊕ xi .. j . Following (2), we can formulate the following calculation in (4), using the new count c′ ( x) in the new corpus X ′ . DL (X ′) = −
∑
x∈V ( X )∪{ r ,⊕ }
c′ ( x ) log 2
c′ (x) X′
(4)
where c′ (⋅) is a token count in X′. The focus of this computation is thus on how to derive c′ (⋅) in the new corpus X′ without deriving X′. Notice that in order to derive X′ we must carry out the costly and undesirable transformation by extraction, replacement and concatenation. Instead, we prefer to derive c′ (⋅) for any x directly from the known counts c ( x) and c ( xi .. j ) in the original corpus X. c ( xi .. j ) is the count of an n-gram of arbitrary length. The derivation is straightforward, as given below in (5), where c (⋅) and c (⋅ ∈ xi .. j ) denote an n-gram count in X and in x i.. j , respectively:
( )
c ′ ( x) =
⎧⎪ ⎪⎪c xi.. j ⎪⎪⎪ ⎪⎪c (⊕) + 1 ⎪⎨ ⎪⎪c ( x) ⎪⎪ ⎪⎪ ⎪⎪c ( x) − c xi.. j ⎪⎩
( ) c(x ∈ xi..j ) + c (x ∈ xi..j )
if x = r; if x = ⊕; if x ∉ xi.. j ;
(5)
if x ∈ xi.. j .
The first three cases are trivial. The last is the general case, in which the second item is the number of x’s reduced by the extraction of xi.. j and the third item is the number of x’s remaining in the only instance of x i.. j in the model part. Consequently, the length of the updated corpus after the transformation for extracting x i.. j is X ′ = X − c(xi .. j ) xi .. j + c(xi .. j ) + xi .. j + 1
(6)
The second item on the right-hand side is the length reduced by the extraction. The third and fourth items are, respectively, the number
Unsupervised Lexical Learning as Inductive Inference via Compression
of r’s and the length of a copy of the extracted pattern xi.. j that are put back into the corpus by the transformation, and 1 is the delimiter. Since all fragments of the input can be word candidates, the learner needs to examine all of them, i.e., all n-gram items, in the corpus. Because of the huge number of n-grams of arbitrary lengths in a large-scale corpus, we have developed the Virtual Corpus (VC) system (Kit and Wilks, 1998), based on the suffix array data structure (Manber and Myers, 1990), as a fairly efficient approach to handling them, including counting, storing and retrieval.
3.
Algorithm
With the aid of the DLG formulated above, the unsupervised lexical learning becomes an optimal segmentation problem of seeking the sequence of word candidates over an input utterance with the greatest sum of DLGs. Given an utterance U = t1t2 …tn as a string of some linguistic tokens (characters, phonemes or syllables) in a given corpus C, the optimal segmentation OS(U) over U such that the sum of DLGs over the word candidates is maximal can be formulated as follows.2 OS (U ) =
k
arg max
∑ DLG
s1 ⊕ s2 … ⊕ sk =U i =1
av
(si ) , for 0 < k < U .
(7)
The algorithm to implement this optimization is formulated by means of dynamic programming, following the basic idea of the Viterbi algorithm. It uses a list of intermediate variables BS[i ] (for i = 1, 2,… , n ) to store the best segmentation over t1t 2 … ti . A segmentation is a list (or chain) of adjacent segments (i.e., word candidates). The DLG over a list of segments, e.g., DLG(BS[ j ]) , is
2
We denote DLG (si ∈ C ) and DLGav (si ∈ C ) as DLG (si ) and DLGav (si ) , respectively, in discussion concerning only a default corpus C, for the sake of simplicity.
259
260
Language Acquisition, Change and Emergence
the sum of the DLGs of the individual segments in the list, as defined in (8): DLG(BS[ j ]) ≡
∑
DLGav (s)
(8)
s ∈ BS [ j ]
Figure 1 An illustration for the Viterbi algorithm
Following the illustration in Figure 1, the optimal segmentation (OS) algorithm is as straightforward as follows, with ∪ denoting the operation of joining two lists. 1. Starting from i = 1 , 2. Once BS[i − 1] is derived, let BS[i ] = BS[ j ] ∪ {[t j +1 … ti ]} for j = arg max DLGav (BS[k] ∪ {[t j +1 … ti ]})
(9)
k fmin (usually, 1) for the search at each step. This helps the algorithm avoid fruitless iterations on strings with a too-low frequency, in particular, the frequency 1. Notice that all strings with a count c = 1 have a negative DLG, and they are all long strings that can be (or have been) broken into shorter ones with a positive DLG.
4.
Learning Model
The learning model for unsupervised lexical learning exploiting the Viterbi segmentation consists of two phases, as depicted in Figure 2,
Unsupervised Lexical Learning as Inductive Inference via Compression
namely, optimal segmentation and lexical refinement. Each involves an application of the algorithm to infer lexical units at a different granularity. Another phase is word segmentation using the result of learning as a lexicon. This is post-learning application of the segmentation algorithm to identify individual words for the later process of language understanding. Figure 2 The model of lexical learning behind the lexical learning algorithms
1. Induction of lexical candidates by optimal segmentation on input utterances, 2. Lexical refinement by optimal segmentation on individual lexical candidates, and 3. Word segmentation by optimal segmentation using the lexicon acquired. 4. The optimal segmentation algorithm is the underlying mechanism supporting all these three phases with a lexicon of different granularity. The rationale behind this learning model is that it would be redundant to have distinct cognitive mechanisms for word discovery and word segmentation. Word segmentation is regarded as a special case of word discovery by determining word forms with a lexicon that is presumed to be adequate. If out-ofvocabulary words are encountered during word segmentation, the mechanism for word discovery will be invoked to infer unseen words. The learning performance of our learning model is evaluated based on the output from the third phase.
261
262
Language Acquisition, Change and Emergence
For purposes of exploring human lexical learning mechanisms by means of this computational approach, it is reasonable to assume that only those n-gram items containing at least one vowel can be word candidates. It is known that syllables are the basic units of speech representation and thus every word must contain at least one syllable. That is, every word contains at least one vowel (character). It is straightforward to implement this constraint within the optimal segmentation process: every chunk in the segmentation must contain at least one vowel character. Henceforth we denote the optimal segmentation with the vowel constraint as OS+V, and the optimal segmentation without such constraint as OS. In cases where we also use the apostrophe [’] as a vowel character, in addition to [aeiouy], we denote the algorithm as OS+V´ — a less constrained version of OS+V. This specification is to enable the learning algorithm to recognize bound morphemes like [-n’t] and [’ll] as individual lexical items in English. Without this specification, the OS+V algorithm would have no opportunity to show its ability to learn these morphemes as individual lexical items, because it would always have to attach them to lexical items with a vowel under the vowel constraint. We can also iterate these two algorithms, in the essence of the EM algorithm, to test their learning capacity and test how far they can go in lexical learning on their own. The implementation is very simple: repeat each of the algorithms iteratively and update the frequency of the n-grams according to the segmentation result in each iteration. It can be expected that the learning performance will improve iteration by iteration. When there is no more improvement, the process stops. We refer to these algorithms as OS+EM, OS+V+EM and OS+V´+EM. The experimental results of these algorithms will be presented in the next section.
5.
Testing Data and Learning Results
In this section we describe the input data to the lexical learner for testing and the output learning results. It is based on this output that the learner’s performance is evaluated. We will first give rationales
Unsupervised Lexical Learning as Inductive Inference via Compression
for selecting a text corpus of child-directed speech transcription as testing data, and then present the details of data preparation, with samples of the input, output, and intermediate learning results. The intermediate results also show how the learner works at each phase of learning.
5.1
Input data
It is understood that written text corpora like the Brown corpus and the PTB corpus are not appropriate for testing an unsupervised lexical learning approach aimed at exploring language-learning infants’ lexical learning mechanisms. Instead, we must use language data that the children actually receive in normal language-learning environments. The CHILDES database (MacWhinney and Snow, 1985; MacWhinney, 1991) is a collection of such data, contributed by a large number of scholars in the field of language acquisition. The Bernstein corpus (Bernstein-Ratner, 1987), a naturallyoccurring child-directed speech corpus from the CHILDES collection, is the most suitable dataset for the purpose of testing our lexical learner’s performance. Another reason for choosing the Bernstein corpus over a different corpus from CHILDES is that we intend to compare our work with the state-of-the-art approach that Brent has recently reported in (Brent, 1999), where the Bernstein corpus was used as testing data. The testing data will be illustrated below in detail with examples. It is a corpus of plain text transcribed from child-directed speech. However, this data is not yet the input for our lexical learner. We must first conduct necessary pre-processing to filter out non-speech content, including commentary, punctuation marks, 3
3
The only exception is the apostrophe [´]. It is not removed from the input corpus, because it is used to represent a reduced vowel in bound morphemes in English orthographic texts, e.g., [-n´t] and [-´re]. In order to enable the lexical learner with a vowel constraint to detect individual bound morphemes, we have to inform the learner that an apostrophe is an equivalent vowel character. See later sections for more discussion.
263
264
Language Acquisition, Change and Emergence
etc., and noise in the data. Next, we convert all capital letters into lowercase, except the initial letters in proper names, to prevent the learner from making an unnecessary distinction between a word and its capitalized version. We also add a special end-of-utterance symbol “#” to each utterance to tell the learner where an utterance ends. This symbol is not part of the data on which the learner will perform the learning. But it is necessary because we intend to feed the data to the learner utterance by utterance. After these steps, we have a text corpus consisting of a list of utterances, each of which is a sequence of characters ended by “#”. The output text corpus from the pre-processing contains spaces as word delimiters. But to test the learning ability of an unsupervised learning algorithm, we must make such word delimiters invisible to the learner. The simplest way to do this is to delete them from the input data, resulting in the input data to the learner. The first few lines of the input data are presented as follows: she’sreallyintobooksrightnow#
getit#
youwanttoseethebook#
getit#
ohlookthere’saboywithhishat#
getit#
andadoggie#
isthatforthedoggie#
ohyouwanttolookatthis#
canyoufeedittothedoggie#
w’lookatthis#
feedit#
haveadrink#
oh#putitin#
oknow#
ok#
ohwhat’sthis#
whatareyougonnado#
what’sthat#
i’llletherplaywiththisforawhile#
The spaces are artificial delimiters in written texts, in the sense that there is nothing in continuous speech corresponding to such spaces. It is appropriate to remove them from the input corpus for purposes of testing the ability to learn words form orthographic transcription of speech. A spaceless text such as the one given above is also known as an unsegmented text; its counterpart with spaces is accordingly called a segmented text. We will use an unsegmented corpus of child-directed speech to test our lexical learner’s performance. It is not easy for a human speaker to read off the
Unsupervised Lexical Learning as Inductive Inference via Compression
words in such an unsegmented text. Without a certain learning capacity, an unsupervised learner would not be able to infer words from a corpus of this type with an acceptable degree of success. An obvious question about the testing data is: why use orthographic text as input, rather than speech input? In order to answer this question, we must clarify several points. First, we do not need real speech input in the form of sound waves as input data for our study. We know that human infants’ categorical speech perception turns the speech signals they receive into a sequence of sounds, known as phones, for each utterance. Our research aims to explore the learning mechanisms (or strategies) that languagelearning infants exploit to map sound sequences to lexical items at the early stage of lexical learning, when they have little knowledge about words. Incorporating too many speech processing details would mask the purpose of our study. Furthermore, for technical reasons, all speech data suitable for computational studies of language learning are actually encoded in text format as some kind of transcription, e.g., phonetic transcript. So why not use phonetic transcripts as input? The only reason is their unavailability. Were they available, we would use them for this test with no hesitation. However, testing on orthographic transcripts of the same corpus is satisfactory, for the following reasons. First, our learning approach is aimed at exploring the general learning mechanisms that human infants use for learning words from language data, no matter what format the data is in and what distributional regularities that data may have. We are not interested in a learning mechanism that can deal with input data in one format but not others, or in one language only. We believe all human infants use similar learning mechanisms to deal with the lexical learning problem at the initial stage of language acquisition, regardless of the language involved. Second, our learning approach works through detecting regularities in the input data. Language data from different languages exhibit different regularities. No matter what regularities there are in the input, our learning approach must be able to capture them; otherwise, it is not a truly general approach. We know that
265
266
Language Acquisition, Change and Emergence
both the phonetic and orthographic transcripts are transcribed from the same speech data, and that most regularities in the original speech data are well maintained in both types of transcript, albeit in different forms. Furthermore, the orthographic transcript is more natural than an artificial phonetic transcription scheme, because it has evolved for years in practical use. Orthographic transcription carries its own inconsistency with the speech data, its own ambiguities and irregularities (e.g., \ǝ\ appears as -er, -or, -ur, -ir or in some other forms in orthographic texts). All these provide more challenges to, and therefore are a more creditable test for, our learning algorithms. When we perform pre-processing to filter out noise such as the non-speech content from the orthographic transcripts of the Bernstein corpus, we follow an essential principle in the data pre-processing, that is, we only do the minimally necessary adjustment. In addition to erasing punctuation marks and commentary notes (originally in square brackets) and converting capitalized words to lowercase, some other alterations are necessary, e.g., adjusting some special forms of words or abbreviated words (e.g., the abbreviated form of negation) back to standard orthographic forms, e.g., [i (the)m] → [i am], [wouldn (i)t] →[wouldn’t]. Notably, the speech-related marking symbols, mainly braces and colons inside words, e.g., as in [(a)n(d)], [beau:ti:ful], [fa:v(o)rite] and [tel:e:phone], are all filtered out. The pre-processing process carries out necessary adjustments for hundreds of problems like these ones in the original corpus. Nevertheless, many non-standard word forms remain in the data, e.g., [w´] (we), [y´] (you), [ya] (you), [whatcha] (what do you) and [whatchya] (what do you). This kind of inconsistency plays a critical role in testing the learner’s learning ability. We also allow one non-speech symbol, “+”, to remain in the data, as in [peek+a+boo] and [bye+bye], because it reflects the corpus constructors’ intention that these strings be considered as individual words or word-like compounds, rather than several words. Unknown words in the form [xxx], incomplete words such as [wh], onomatopoeia (e.g., [woof woof woof]) and interjections (e.g.,
Unsupervised Lexical Learning as Inductive Inference via Compression
[oh], [ooh] and [uh]) are also retained, in contrast to Brent’s testing
data. If all these irregularities were cleaned up, the lexical learner would have an artificially better performance. The entire testing corpus of child-directed speech extracted from the Bernstein corpus consists of 9702 utterances, 35K words and 143K characters. The average utterance length is 14.7 characters and 3.6 words; the average word length is 4.1 characters.
5.2
Learning result
Unsupervised lexical learning is aimed at acquiring lexical forms from speech data where word boundaries are not marked by any means. The expected outcome from the learning algorithms is a lexicon consisting of a list of individual word forms. This representation for both the intermediate and final results of the learning in our research is in fact a deterministic regular grammar. In this grammar, the right-hand sides of rules are lexical candidates that are represented plainly as strings of the atomic symbols in the input corpus. The left-hand sides, which merely function as indices for the lexical candidates, are skipped in the representation, because the positions of the candidates in the lexicon play the same role as the skipped indices. The choice of skipping the left-hand sides in the lexicon also reflects the principle of simplicity — the underlying philosophy of our learning approach. Recall that our learning approach seeks for a lexicon as the simplest representation for the input data. Here we will illustrate the representation formalism with fragments of the real output from our lexical learning experiments. The learning algorithms yielding such output are formulated in the next section. The output from the first step of the lexical learning — an optimal segmentation of input utterances into lexical candidates in terms of the DLG measure — consists of two parts. One is the results of the optimal segmentation on the input utterances. The other is the corresponding lexical candidates resulting from the segmentation, such as the ones in Table 1. Each candidate has its
267
268
Language Acquisition, Change and Emergence
Table 1 Sample lexical candidates output from the optimal segmentation, in ascending order of coverage (= count x length). Count 57 44 30 52 73 73 37 53 62 188 67 41 207 52 85 43 108 63 37 76 152 477 161 81 122 60 98 197 74 66 149 161 65 222 62 122 110 157 198 203 452 230 329 139 247
Length 6 8 12 7 5 5 10 7 6 2 6 10 2 8 5 10 4 7 12 6 3 1 3 6 4 9 6 3 8 9 4 4 10 3 11 6 7 5 4 4 2 4 4 10 10
Coverage 342 352 360 364 365 365 370 371 372 376 402 410 414 416 425 430 432 441 444 456 456 477 483 486 488 540 588 591 592 594 596 644 650 666 682 732 770 785 792 812 904 920 1316 1390 2470
DLG 12.6261 15.8121 33.3633 12.9680 12.5871 7.4265 25.6890 24.5830 14.0373 –3.4687 14.2962 23.8961 –1.8734 18.1429 8.4561 25.5103 6.5355 15.1237 30.8000 12.7278 –1.0154 –6.0098 –0.3685 9.8742 3.5881 23.9726 14.9844 2.2711 22.1664 24.5681 3.7548 6.1727 41.8614 2.3890 31.9754 11.9507 15.4102 7.2517 3.1169 6.4316 –2.7021 5.0676 5.3064 29.2903 29.1359
Candidate [p r e t t y ] [t h o s e a r e ] [c l o s e t h e d o o r ] [t h i s o n e ] [m o m m y ] [h e l l o ] [w h e r e ' s t h e ] [b y e + b y e ] [y o u c a n ] [i t ] [d o g g i e ] [w h a t i s t h a t ] [o k ] [w h a t i s i t ] [i t ’ s a ] [l o o k a t t h i s ] [b o o k ] [t h e r e ’ s ] [w h a t a r e t h o s e ] [w h a t ’ s ] [t h e ] [a ] [s e e ] [i s t h a t ] [t h i s ] [t h e d r a g o n ] [c a n y o u ] [y o u ] [a l l r i g h t ] [t h e d o g g i e ] [h e r e ] [o k a y ] [p e e k + a + b o o ] [a n d ] [t h a t ’ s r i g h t ] [t h a t ’ s ] [t h a t ’ s a ] [t h e r e ] [t h a t ] [l o o k ] [o h ] [w h a t ] [y e a h ] [w h a t ’ s t h i s ] [w h a t ’ s t h a t ]
Unsupervised Lexical Learning as Inductive Inference via Compression
count (or frequency), length, coverage and average DLG attached. Spacing between characters is for readability; and the symbol “+” comes from the originally corpus. Many infrequent candidates in this lexicon, not shown here, are non-words. In contrast, most frequent ones, as shown in Table 1, are real words or clumps of real words. A fragment of the optimal segmentation result is illustrated as follows. [she’s][really][int][o][book][sright][now] [youwantto][seethe][book] [ohlook][there’s][abo][ywith][his][hat] [and][a][doggie] [oh][youwantto][lookatthis] [w’][lookatthis] [havea][drink] [o][know] [oh][what’sthis] [what’sthat] ... [i’ll][le][ther][playwithth][isfor][a][whi][le]
However, a lexical refinement process is necessary in order to turn the word clumps into individual words. Table 2 gives a number of decompositions of such word-clump lexical candidates into words (and other shorter clumps) during the lexical refinement process. The right-most column is the DLG of each decomposition. Table 3 is the finer-grained lexicon that is output from the lexical refinement. The final results of the lexical learning are the output from the optimal segmentation of the input utterances using the refined lexicon (obtained by the first two steps of the learning, namely, the optimal segmentation and lexical refinement). The output consists of two parts: the final segmentation result, and its corresponding lexicon, similar to Table 3.
269
270
Language Acquisition, Change and Emergence Table 2 Sample decompositions of word clump lexical candidates into finer-grained lexical items during the lexical refinement process Round 1 (The first 12 in 503 decompositions): [what’sthat] [what’sthis] [that’sright] [allright] [canyou] [thedragon] [whatarethose] [there’s] [lookatthis] [whatisit] [whatisthat] [youcan]
→ → → → → → → → → → → →
[what’s][that] [what’s][this] [ t h a t ] [’ s ] [ r i g h t ] [all][right] [can][you] [the][dragon] [what][arethose] [ t h e r e ] [’ s ] [look][at][this] [what][isit] [what][isthat] [you][can]
15.8447 16.3159 8.8198 8.4251 3.3597 11.5103 14.4629 4.8175 5.2867 7.5195 14.9418 3.3597
Round 2 (The first 12 in 112 decompositions): [isthat] [that’s] [that’sa] [thedoggie] [where’s] [inthere] [thedog] [thisis] [youlike] [thebunny] [anotherone] [thebook]
→ → → → → → → → → → → →
[is][that] [that]['s] [that]['sa] [the][doggie] [where]['s] [in][there] [the][dog] [this][is] [you][like] [the][bunny] [a][nother][one] [the][book]
0.6786 3.3377 1.1489 15.4579 6.5925 5.4346 0.2781 1.6062 10.3563 11.6377 2.7744 8.0009
Round 3 (The first 10 in 35 decompositions): [what’s] → [ w h a t ] [’ s ] [youwant] → [you][want] [areyou] → [are][you] [doyou] → [do][you] [dowith] → [do][with] [didn’t] → [did][n’t] [goodbye] → [good][bye] [withthe] → [with][the] [cowjumpingoverthemoon] →[cow][jump][ing] [over][them][o][on] [knockedthem] [putit] [what’]
6.4455 8.6686 3.6702 1.4109 0.8707 0.6807 4.4128 5.6365
1.2150
→ → →
[knock][ed][them] [put][it] [ w h a t ] [’ ]
0.1631 0.4277 0.1275
→ →
[does][she] [an][you]
0.0765 0.2196
Round 4 (2 decompositions): [doesshe] [anyou]
Unsupervised Lexical Learning as Inductive Inference via Compression Table 3 Sample finer-grained lexical items output from the lexical refinement process, in ascending order of coverage (= count x length). Count 56 80 68 207 71 144 145 109 91 76 92 155 162 122 168 64 106 268 134 181 109 93 144 116 147 629 160 161 65 372 194 275 179 224 329 560 292 342 393 455 421 776 985 784 856
Length
Coverage
DLG
7 5 6 2 6 3 3 4 5 6 5 3 3 4 3 8 5 2 4 3 5 6 4 5 4 1 4 4 10 2 4 3 5 4 3 2 4 4 4 4 5 3 3 4 4
392 400 408 414 426 432 435 436 455 456 460 465 486 488 504 512 530 536 536 543 545 558 576 580 588 629 640 644 650 744 776 825 895 896 987 1120 1168 1368 1572 1820 2105 2328 2955 3136 3424
24.7372 8.9886 11.2601 –1.8734 10.1195 2.0056 0.4780 4.6727 8.5864 17.3798 9.5032 2.5998 –0.1738 5.4852 2.8772 20.7181 11.9946 –2.9071 4.7049 3.5252 8.1970 14.9853 6.0441 13.5829 7.6852 –5.5904 6.7353 6.1727 41.8614 –0.2737 7.1221 2.7419 10.0591 7.8500 0.7991 –2.3705 4.8850 5.3697 5.5280 7.8423 8.9229 1.5757 5.0111 7.1216 5.5552
Candidate [bye+bye] [right] [little] [ok] [nother] [him] [his] [he’s] [it’sa] [blocks] [don’t] [can] [one] [it’s] [ing] [thankyou] [daddy] [it] [with] [put] [hello] [doggie] [good] [mommy] [have] [a] [your] [okay] [peek+a+boo] [’ s ] [like] [and] [wanna] [book] [see] [oh] [here] [yeah] [this] [look] [there] [the] [you] [what] [that]
271
272
Language Acquisition, Change and Emergence
6.
Evaluation
In this section we report the evaluation of our lexical learning algorithms’ performance based on the experimental results on the Bernstein corpus. We first define several empirical measures for the evaluation, and then present the learning performance in terms of these measures.
6.1
Evaluation measures
We use the following empirical measures to evaluate a lexical learner’s performance on a given testing corpus: • • • •
Word precision and recall Word boundary precision and recall Correct character ratio Word type precision and recall
Word precision is defined as the proportion of correct words among the learned words, and word recall as the proportion of real words learned by the learner. Given an input corpus of N words, if the learner learns M words of which C words are correct, the word precision and recall are
C C and , respectively, computed in terms M N
of the numbers of word tokens. However, the above measures of word precision and recall do not reflect the credit in a learning output like [ithink][itwill] [comeout]. Here although none of the chunks in the segmentation is a real word, the learner nevertheless detects some regularity within the input data and correctly discovers many word boundaries. In order to remedy the inadequacy of the conventional measures of word precision and recall, we need word boundary precision and recall as evaluation measures. Word boundary precision is defined as the proportion of correct word boundaries among the word boundaries detected by the learner, and word boundary recall as the proportion of correct word
Unsupervised Lexical Learning as Inductive Inference via Compression
boundaries that the learner detects. Given an input corpus with N ′ word boundaries, if the learner detects M′ word boundaries of which C ′ boundaries are correct, the word boundary precision and recall are
C′ C′ and , respectively. M′ N′
Correct character ratio is defined as the proportion of characters in correctly learned words within the entire input corpus. Given an input corpus L characters long, of which L′ characters are in the correctly learned words, the correct character ratio of the learning is L′ . L
Word type precision and recall are precision and recall in terms of learned and standard word types in the learned and standard lexicons, respectively, rather than word tokens in the original corpus and segmentation resulted from learning. Word type precision is also called lexicon precision in Brent’s (1999) recent work. The above defined seven measures form a systematic evaluation for computational lexical learning.
6.1.1
Words versus morphemes
The above evaluation measures are meaningless if it is not clear what constitutes a word in a given language. Thus what words are is a critical issue in this evaluation. The basic rule we opt to follow is to recognize orthographic words in the input corpus in terms of conventional word delimiters such as spaces in English. That is, any string separated by spaces in the input corpus is recognized as a word. This rule is consistent with the principle of minimum change in the data pre-processing: we respect the original corpus as much as possible. Therefore, although we realize that there are many non-conventional word forms in the Bernstein corpus, e.g., [i gotchyou gotchya gotchya], [what d’ya want to do], [didja knock’em over] and [wa’dja like to call it], where several words are wrapped up into one “word”, we use them as “is”. Unfortunately, this simple rule fails to address another problem caused by abbreviated (i.e., phonetically reduced) words, e.g., [-’s]
273
274
Language Acquisition, Change and Emergence
as in [that’s], [what’s] and [there’s], [-’re] in [you’re] and [there’re], [-’ll] as in [i’ll] and [we’ll], and [-n’t] as in [can’t], [don’t], [doesn’t] and [isn’t]. How do we evaluate cases where the learner outputs segmentation results like [that][’s] and [does][n’t]? Should we count them as correct or as wrong? It is difficult to define the correctness clearly. If we only count real words in the evaluation, such lexical candidates should certainly be counted as wrong, because [’s] and [n’t] are not words. However, our task here is not to determine real words, but to evaluate the performance of an unsupervised lexical learning approach that simulates human lexical learning mechanisms. What do infant learners learn in lexical acquisition? Only real words? No, they also learn many morphemes as lexical items in addition to words. Morphemes are defined as the minimal meaningful units in a language. There are two types of morpheme: free and bound. Free morphemes are words, and most morphemes in a language like English are free morphemes. Bound morphemes cannot occur alone, e.g., [-ing] and [-ed]. Reduced word forms like [-’s] and [-n’t] are bound morphemes, because they lack a vowel to form independent syllables. It is understood that the apostrophe [’] in [-n’t] is used to represent a reduced vowel that is not qualified to make a syllable.4 The number of bound morphemes in a language like English is rather small, and they can only occur as part of a word, as in [don’t]. They are a closed class of lexical items in the lexicon of the English language. Our original evaluation problem of “what are words?’’ becomes a different problem: how do we give credit to the two types of morpheme that a learner acquires in the learning? If we only count correct words in the learning output, the result is that credit is given only to free morphemes and not to any bound morphemes. Such an evaluation turns out to be imperfect, unfair, and even flawed.
4
Thus, in order to enable the OS+V algorithm to learn bound morphemes such as [-n’t], we have to tell the learner that [’] is something equivalent to a vowel character.
Unsupervised Lexical Learning as Inductive Inference via Compression
However, it is controversial to credit bound morphemes in the same way as words, because bound morphemes are not words! Nor is it easy to determine a fair percentage of credit. Should a bound morpheme count as 50% or 90% of a word? All these unresolved problems indicate that it is inappropriate and unrealistic to use a clear-cut means of evaluation for unsupervised lexical learning that involves both words and bound morphemes as learning output. The only conceivable way out of this dilemma is to conduct separate evaluations of the learner’s performance on learning words and on learning all lexical items, the latter including both words and bound morphemes. To evaluate performance for learning words, we can simply count the chunks in the learning output that match the space-delimited words in the input corpus, and then generate evaluation results in terms of the above seven measures. Evaluation of learning of both words and bound morphemes is more complicated. We opt for a flexible approach. For example, the learner may recognize [that’s] as one word or [that][‘s]as two lexical items. All cases like these are considered as correct responses. This approach fails to address the contracted negative marker [-n’t], because sometimes the [-] component is not a well-formed word, e.g., [ca][n’t] and [wo][n’t]. Thus, we need to adjust it accordingly. It is considered correct if the learner recognizes either [Vn’t] as one word, or [V][n’t]as two lexical items, provided that the [V] component forms a well-formed word. That is, output chunks like [can’t] and [won’t] are all considered correct lexical items, but [ca][n’t] and [wo][n’t] are each recognized as one wrong and one correct item, rather than as two correct items. In contrast, [do][n’t] and [is][n’t] are each recognized as two correct items. English bound morphemes that we need to consider in the evaluation include [-’s], [-’d], [-’re], [-’ll], [-’ve] and [-n’t], all of which carry an apostrophe. We divide these bound morphemes into two groups. One group, denoted as G1, includes all these items but the last. Each morpheme in G1 only co-occurs with a noun. The last item [-n’t] forms another group, denoted as G2, which only co-occurs with a verb.
275
276
Language Acquisition, Change and Emergence
6.2
Learning performance
Table 4 and Table 5 present learning performance of the twelve unsupervised lexical learning algorithms, each with a slightly different combination of our OS, LR, WS algorithms and the EM algorithm. These tables include measures of word and word boundary precision and recall. Table 4 presents performance in learning real words, and Table 5 presents performance in learning words and bound morphemes. The performance of the OS and OS+V´ algorithms incorporated in the EM algorithm is presented in Figure 3. We can see that both programs converge very quickly towards the top of their performance. In general, both algorithms’ performance drops significantly in the second iteration. The exception is OS+EM’s word precision and word boundary precision, which continue to rise. All measures increase rapidly in the next five iterations. After that, the growth slows significantly in the next ten iterations, after which all measures reach their local maxima gradually. Table 4 Learning performance of the unsupervised lexical learning algorithms on words Word (Token) Learning algorithms
P (%)
R (%)
Word Boundary P (%)
R (%)
67.43
Corr. char. Ratio (%)
OS
32.43
31.03
70.48
33.14
OS+EM
58.66
56.06
83.11
79.42
54.71
OS+LR+WS
61.60
67.99
81.28
89.72
64.10
OS+(LR+WS)x2
58.98
68.49
78.99
91.72
63.17
OS+V
47.99
36.72
85.41
65.36
38.23
OS+V+EM
63.54
49.78
90.71
71.06
49.01
OS+V+LR+WS
74.99
70.92
90.29
85.39
70.04
OS+V+(LR+WS)x2
75.31
73.39
89.63
87.34
71.69
OS+V´
47.42
36.62
84.83
65.50
38.12
OS+V´+EM
63.42
51.12
90.72
73.13
50.51
OS+V´+LR+WS
70.17
70.13
87.58
87.53
68.12
OS+V´+(LR+WS)x2
69.26
72.22
86.06
89.75
68.80
Unsupervised Lexical Learning as Inductive Inference via Compression
Table 5 Learning performance of the unsupervised lexical learning algorithms on words and bound morphemes Word (Token) Learning algorithm
OS
OS+EM
OS+LR+WS
LR+WS)x2
OS+V´
OS+V´+EM
OS+V´+LR+WS
OS+V´+(LR+WS)x2
Word Boundary
Corr. char. ratio Counter (%) morphemes
P(%)
R(%)
P(%)
R(%)
33.86
32.17
71.19
67.65
34.33
+G1
34.04
32.32
71.28
67.68
34.35
+G1+G2
65.52
60.50
86.57
79.74
59.86
+G1
65.55
60.52
86.58
79.94
59.88
+G1+G2
69.89
73.77
85.42
90.17
71.33
+G1
70.36
74.09
85.66
90.19
71.75
+G1+G2
67.08
74.40
83.04
92.10
70.60
+G1
68.39
75.31
83.69
92.15
71.81
+G1+G2
49.33
37.81
85.78
65.76
39.28
+G1
49.55
37.95
85.89
65.78
39.41
+G1+G2
69.45
54.66
93.68
73.73
54.38
+G1
69.95
54.94
93.94
73.78
54.70
+G1+G2
79.42
75.87
92.20
88.08
75.42
+G1
79.92
76.17
92.45
88.11
75.82
+G1+G2
78.76
78.25
90.81
90.23
76.62
+G1
80.11
79.07
91.49 90.30
77.74
+G1+G2
Bound morphemes in group G1:
[-‘s], [-‘d], [-‘re], [-‘ve]
Bound morphemes in group G2:
[-n’t]
We have observed that the OS and OS+EM algorithms have a better balance between precision and recall than the OS+V´ and OS+V´+EM algorithms, as shown in the middle table in Table 6. The former two algorithms’ precision and recall for word and word boundary show a difference, defined as P − R , less than 3.69 percentage points and a divergence rate, defined as P − R min (P, R) , less than 4.65%; whereas the latter two algorithms have a difference of precision and recall in the range of 10 to 20 percentage points and a divergence rate in the range of 20% to 30%. But the EM algorithm seems not
277
278
Language Acquisition, Change and Emergence
to enlarge this difference and divergence rate: OS+EM has a balance of precision and recall as good as OS, and OS+V´+EM has a slightly better balance of precision and recall than OS+V´. We can also see the effect of the EM algorithm on improving the learning performance of the OS and OS+V´ algorithms. As shown in the middle table in Table 6, the EM algorithm increases the OS algorithm’s word precision and recall both by about 81% and its word boundary precision and recall both by about 18%. In contrast, Figure 3 Performance of the OS and OS+V’ algorithms in EM iterations
Unsupervised Lexical Learning as Inductive Inference via Compression
Table 6 The effectiveness of the EM algorithm and the vowel constraint Word Algorithm
Word Boundary
P (%)
R (%) |P – R| D (%)
P (%)
R (%) |P – R| D (%)
OS
32.43
31.03
1.40
4.51
70.48
67.43
3.05
OS+EM
58.66
56.06
2.60
4.64
83.11
79.42
3.69
4.65
OS+V´
47.42
36.62
10.80
29.49
84.83
65.50
19.33
29.51
OS+V´+EM
63.42
51.12
12.30
20.06
90.72
73.13
17.59
24.05
OS+EM Word
4.52
OS+V´+EM
Word boundary
Word
Word boundary
Algorithm
P (%)
R (%)
P (%)
R (%)
P (%)
R (%)
P (%)
R (%)
Beginning
32.43
31.03
70.48
67.43
47.42
36.62
84.83
65.50
End
58.66
56.06
83.11
79.42
63.42
51.12
90.72
73.13
Increment
26.23
25.03
12.63
11.99
16.00
14.50
5.89
7.63
Incr. Rate
80.88
80.66
17.76
17.78
33.74
39.60
6.94
11.65
Word (Token)
Word boundary
P (%)
R (%)
P (%)
R (%)
OS
32.43
31.03
70.48
67.43
33.14
OS+V´
47.42
36.62
84.83
65.50
38.12
Increment
14.99
5.59
14.35
-1.92
4.98
Incr. Rate (\%)
46.22
18.01
20.36
-2.86
15.03
OS
58.66
56.06
83.11
79.42
54.71
OS+V´
63.42
51.12
90.72
73.13
50.51
Increment
4.76
-4.94
7.61
-6.29
-4.20
Incr. Rate (\%)
8.11
-8.81
9.16
-7.92
-7.68
Learning algorithm
Corr. char. ratio (%)
OS +LR+WS
61.60
67.99
81.28
89.72
64.10
OS+V'+LR+WS
70.17
70.13
87.58
87.53
68.12
Increment
9.00
2.14
6.30
-2.19
4.02
Incr. Rate (\%)
14.61
3.15
7.75
-2.44
6.27
OS+(LR+WS)x2
58.98
68.49
78.99
91.72
63.17
OS+V´+(LR+WS)x2
69.26
72.22
86.06
89.75
68.80
Increment
10.28
3.73
7.07
-1.97
5.63
Incr. Rate (\%)
17.43
5.45
8.95
-2.15
8.91
279
280
Language Acquisition, Change and Emergence
the EM algorithm appears less effective, but still quite effective, on the OS+V´ algorithm. It enables the algorithm to increase word precision and recall by about 34% and 40%, respectively, and word boundary precision and recall by about 7% and 12%, respectively. This loss in effectiveness is probably due to the effect of the vowel constraint. The bottom table in Table 6 presents the effect of the vowel constraint on learning performance, with V´ (a looser constraint) as an example. It shows that, except for the case of the EM algorithm, vowel constraint consistently enhances learning performance by improving word precision and recall and word boundary precision, at the price of lowering word boundary recall slightly. While working with the EM algorithm, vowel constraint improves both word precision and word boundary precision by about 8% and 9%, respectively, at the price of lowering the corresponding recalls by about –9% and –8%, respectively, and also lowering the correct character ratio by 7.68%. These results indicate that vowel constraint is not a good co-operator with the EM algorithm. The EM algorithm can only reach a local minimum; and human learners do not learn in the same way by repeatedly iterating on the input data. We are in fact more interested in pursuing learning algorithms that simulate human lexical learning better than learning strategies and learning performance relying on the EM algorithm. We have implemented several such learning algorithms, including OS+LR+WS, OS+V+LR+WS and OS+V´+LR+WS. Their performance in learning words and bound morphemes is presented in Tables 4 and 5, respectively. A comparison of their performance and the corresponding EM algorithms’ performance is presented in Table 7 We see that the unsupervised lexical learning algorithms have a much better performance than the EM algorithm: word precision is better by 5–18%, word recall is better by 21–42%. Although word boundary precision is slightly lower, at most by 3.5%, word boundary recall is better by 13–20%, and the correct character ratio is better by 17–42%. The superiority of unsupervised lexical learning through optimal segmentation, lexical refinement and word segmentation is clearly overwhelming.
Unsupervised Lexical Learning as Inductive Inference via Compression Table 7 Comparison of learning performance on words and on words and bound morphemes: unsupervised lexical learning algorithms versus the EM algorithm
Learning algorithm
Word (Token)
Word boundary
P (%)
R (%)
P (%)
R (%)
Corr. char. ratio (%)
OS + EM
58.66
56.06
83.11
79.42
54.71
OS + LR + WS
61.60
67.99
81.28
89.72
64.10
Increment
2.94
11.93
–1.83
10.30
9.39
Incr. Rate (%)
5.01
21.28
–2.20
12.97
17.16
OS + V + EM
63.54
49.78
90.71
71.06
49.01
OS + V + LR + WS
74.99
70.92
90.29
85.39
70.04
Increment
11.54
21.14
–0.42
14.33
21.03
Incr. Rate (%)
18.02
42.47
–0.46
20.17
42.91
OS + V´+ WS
63.42
51.12
90.72
73.13
50.51
OS+V´+ LR + WS
70.17
70.13
87.58
87.53
68.12
Increment
6.75
19.01
–3.14
14.40
17.61
Incr. Rate (%)
10.64
37.19
–3.46
19.69
34.86
Word (Token)
Word Boundary
Learning algorithm
P (%)
R (%)
P (%)
R (%)
Corr. Char. Ratio (%)
OS + EM
65.55
60.52
86.58
79.94
OS + LR + WS
70.36
74.09
85.66
90.19
71.75
Increment
4.81
13.57
–0.92
10.25
11.87
Incr. Rate (%)
7.34
22.42
–1.06
12.82
19.82
OS + V´ + EM
69.95
54.94
93.94
73.78
54.70
OS + V´ + LR + WS
79.92
76.17
92.45
88.11
75.82
9.97
21.23
–1.49
14.33
21.12
14.25
38.64
–1.59
19.42
38.61
Increment Incr. Rate (%)
59.88
The effect of repeating the (LR+WS) part of the unsupervised lexical learning appears insignificant. It increases the recall slightly but sometimes decreases the precision on a similar scale. For example, in the experiment of learning words and bound morphemes, in comparison with the OS+V+LR+WS algorithm, the OS+V+(LR+WS)×2 algorithm increases word recall by 1.2 percentage points at the cost of lowering word precision by 2 points.
281
282
Language Acquisition, Change and Emergence
Table 8 Comparison of learning performance on words and on words and morphemes
Learning algorithms
OS+LR+WS
Lexical item Lexical type
Corr. char. ratio (%)
P (%)
R (%)
P (%)
R (%)
Word
61.60
67.99
81.28
89.72
64.10
Word + Morph.
70.36
74.09
85.66
90.19
71.75
8.36
6.10
4.38
0.47
7.65
14.22
8.97
5.34
0.52
11.93
Increment Incr. Rate (%)
OS+V'+LR+WS
Lexical boundary
D (%)
Word
70.17
70.13
87.58
87.53
68.12
Word + Morph.
79.92
76.17
92.45
88.11
75.82
9.75
6.04
4.87
0.58
7.70
13.89
8.61
5.56
0.66
11.30
Increment Incr. Rate (%)
Similarly, it increases word boundary recall by about 2 points at the cost of lowering word boundary precision by about 2 points. The OS+V´+(LR+WS)×2 algorithm demonstrates the best gain on the part of (LR+WS)×2 for learning words and morphemes: a gain of 2 percentage points in word recall with no loss in word precision, a gain of 2.2 points in word boundary recall at the cost of 1 point loss in word boundary precision, and a gain of about 2 points in correct character ratio. When bound morphemes are counted as correctly learned lexical items in the learning output, the learning performance of the OS+LR+WS and OS+V´+LR+WS algorithms goes up significantly. These two algorithms’ precision and recall on lexical items, precision and recall on lexical boundaries, and correct character rate are, respectively, about 14%, 9%, 5.5%, 0.5% and 11–12% higher than their counterparts on words, as shown in Table 8. Of all these increments, only the increment of word boundary recall is notably insignificant. In general, the learning performance of our algorithms compares favourably with the state-of-the-art performance of Brent’s MBDP-1 algorithm. The MBDP-1 algorithm is estimated to have an average
Unsupervised Lexical Learning as Inductive Inference via Compression
word precision of around 71% and word recall of around 72%. The upper table in Table 9 shows the comparison in terms of the difference in word precision and recall. Table 9 Comparison of our lexical learning algorithms’ performance with state-of-the-art performance Lexical item
Learning algorithm
Lexical type Word
OS+LR+WS
P (%)
67.99
–9.40 (–13.2%)
–3.01 (–4.2%)
Word + Morph. 70.36
74.09
–0.64 (–0.9%)
+2.09 (+2.9%)
70.92
+3.99 (+5.6%)
–1.08 (–1.5%)
–0.93 (–1.3%)
–0.97(–1.3%)
74.99
Word
70.17
70.13
Word + Morph. 79.92
76.17
+8.08 (+11.4%) +4.17% (+5.8%) F difference from MBDP-1
Lexical item
Learning algorithm
R (%)
61.60
OS+V+LR+WS Word OS+V´+LR+WS
Difference from MBDP-1
P (%) R (%)
Lexical type
P (%)
R (%)
F (%)
Word
61.60
67.99
60.86
–10.64 (–14.9%)
Word + Morph.
70.36
74.09
72.18
+0.68 (+0.1%)
OS+V+LR+WS Word
74.99
70.92
72.90
+1.40 (+2.0%)
Word
70.17
70.13
70.15
–1.35 (–1.9%)
Word + Morph.
79.92
76.17
78.00
+6.50 (+9.1%)
OS+LR+WS
OS+V´+LR+WS
R (%)
In order to clarify this comparison, we can use the F measure to combine the precision and recall for overall learning performance. The F measure is a variant of van Rijsbergen’s E measure introduced in (van Rijsbergen, 1979): F =1 − E . The F measure is defined as F=
1 1 1 α + (1 − α ) P R
(10)
In general, we consider precision and recall equally important, and consequently choose a value of α = 0.5 . Accordingly, we have F =
2PR P+R
(11)
283
284
Language Acquisition, Change and Emergence
Following this formula, the average overall learning performance of the MBDP-1 algorithm is estimated as F = 71.5%, as we can estimate. Our best performance is about 6.5 percentage points higher, i.e., 9%. The result of this comparison indicates that our best algorithm for word learning, namely the OS+V+LR+WS algorithm, has an overall performance of learning words that compares favourably with MBDP-1’s overall performance. If bound morphemes are considered as correctly learned lexical items, the OS+LR+WS algorithm has a learning performance as good as MBDP-1, whereas the OS+V´+LR+WS algorithm has a significantly better performance.5 If Brent had evaluated MBDP-1’s learning performance with the word boundary precision and recall and with the correct character ratio, it would have been possible to have a more thorough comparison between our learning algorithms and MBDP-1. With regard to the roughness in the above comparison, we cannot be certain that our learning approach really outperforms the MBDP-1 algorithm, because many factors in the learning algorithms and testing data preparation are different. For example, MBDP-1 is an incremental online learning algorithm, whereas ours are not; onomatopoeia and interjections are cleaned out of the testing data for MBDP-1 but retained in the testing data for our algorithms; MBDP-1 learns from phonetic transcripts, and our algorithms learn from orthographic transcripts. All these factors mean that the above comparison carries a certain degree of roughness. However, the comparison has clearly provided adequate evidence for the conclusion that our learning approach reaches the level of state-of-the-art of unsupervised lexical learning.
5
Notice that OS+V based algorithms, including OS+V+LR+WS, do not learn any bound morphemes like those listed in G1 and G2, which have an apostrophe for the reduced vowel.
Unsupervised Lexical Learning as Inductive Inference via Compression
7.
Discussion
Although our lexical learning approach shows outstanding performance in learning lexical items, including words and bound morphemes, there is still room for further improvement. In this section, we analyse a number of problems encountered by our unsupervised lexical learners in the experiments. We also discuss possible solutions.
7.1
Negative DLG segmentation
The first problem that still needs to be resolved is negative DLG segmentation: low frequency words in an utterance, which may or may not exist in the refined lexicon, can cause the segmentation of the entire utterance to have a negative DLG. Table 10 lists many examples of negative DLG segmentation output from the OS+ V´+LR+WS algorithm, with the frequency of the problem-causing words given at the right. Actually, it is not a problem that the learner is incapable of recognizing low frequency words. It is absolutely normal for all unsupervised lexical learning algorithms based on co-occurring statistics to be weak at learning most of these “bad” words. What really is a problem in our DLG-based learning algorithm is that when a “bad” word is lost in this way, the bad word appears to interfere with the recognition of other words. For example, in Table 10, [lets] in line 7 grabs seven words [ok] and [get down and get to these] into the same clump; [idea] in line 10 grabs six other words [oh i got i got an]. In almost all other utterances these six words are properly recognized by the unsupervised learner, as exemplified by the output from the OS+V´+LR+WS algorithm as given below, in comparison with all utterances involving [i got] and [got an] in the original input corpus in the right column.
285
286
Language Acquisition, Change and Emergence Table 10 Examples of segmentations with negative DLG DLG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
–35.1325 –32.6349 –14.0336 –29.7511 –35.1325 –22.5597 –34.0216 –35.1232 –27.7897 –35.1325 –5.3626 –29.5773 –22.5104 –22.7013 –26.2659 –15.7839 –0.7694 –24.4458 –33.1854 –26.0025 –12.6852 –35.1325 –18.1525 –25.2219 –22.7947 –22.0512 –28.0104 –35.1325 –35.1325 –29.7628 –13.5204 –21.7612 –31.6725 –16.0308 –7.4811 –20.2497 –19.2983
Examples of segmentation
Frequency
asmileithasasmile 2 1 i just realizedhowitworks that ’sawickedlaugh you monkey 1 1 one that willflapinthebreeze 3 iwanttoshowyousomethin’ does the chair partflattenitout 1 2 okletsgetdownandgettothese toys 6 ahaahaahaahawhatelsedoeshe say i don’t know whyshehatesitsomuch 2 1 ohigotigotanidea 1 you ’reintodestruction aren’t you Alice that ’swhywehadyourestrained 1 2 you can spankitifitbites you 3 and this boy ’s put tingonhisshirt could n’tpossiblybebecauseyourmother’shadaphoneattached 1, 2 1 yeah that ’s what the horsedoesyeahp wanna look at look at look this onehaspaper pages you might 2, 1 that sanowlohtheblocksfell over 3 3, 1 oh and the kidsaresayin’byeseethey’rewavin’ (noise) let’s tryawhthisisa snap you ’renotatallinterestedin this dragon ’ think 2 2, 2 igotchyougotchyagotchya 4 do you re member what wesaidthelast time mightbehavingtrouble onhisknees 1, 5 1 look sto me like you ’remakinghimdance 1 but he’s gonna triponhisshoelacestiesomebows like adolphinit’sawhale 1 6 hehassomuchhairyoucan’tseehisneck 4, 2 yastrapityagoanditsticks yeah Iwishitwereaburry 1 2 you know what it’s supposedtobe 2 it’s supposedtobea pieceof steak growlswhotookmy steak 1 1 maybe the drayon’dliketobe inthe high chair 1 ithink hemightbetoosm all for this high chair you know Michael’sgonnagoonvacation today 2, 1 it’s notakiteitthat’sthesteamcom ing outof the food 4, 2
Unsupervised Lexical Learning as Inductive Inference via Compression ohigotigetanidea
oh i got i got an idea
i got blocks athome oh i got them a hi got your nose i got ya finger wait now i got ta fix it i got it i got you i got your hand i got you i got you lemme open the door
i got blocks at home# oh i got them# ah i got your nose# i got ya finger# wait now i got ta fix it# i got it# i got you# i got your hand# i got you# i got you lemme open the door#
it’s got an a@lb@lc@l
it’s got an a@l b@l c@l #
Therefore, the key to solving this word-clumping problem is to find a principled way to protect the other words from being corrupted by erroneous recognition of a bad word. We say “principled way” because a cognitively sound strategy of word segmentation should be able to detach known words from unknown words. In our learning approach we assume the same DLG optimization based approach to word segmentation as to lexical learning. We need to further improve the DLG optimization strategy such that it can incorporate such a cognitively sound principle. An ideal strategy is one that can incorporate the advantages of a DLG-based approach but avoid its disadvantages in dealing with the negative DLG problem. It is quite possible that word segmentation by human subjects with an existing lexicon is an optimization process with a different goodness measure than the one for their lexical learning. A plausible strategy may be as simple as maximal match segmentation (MMS): scanning through the input, outputting the longest matched word and then moving on to the next word, and continuing to work this way through the entire input utterance. Thus it is worth exploring possibilities of incorporating the MMS strategy into the DLG optimization based segmentation for improvement. One possible approach to the incorporation would be to apply the MMS, whenever a negative DLG segmentation is encountered, in order to save as many good words as possible from
287
288
Language Acquisition, Change and Emergence
the clumping effect with “bad” word(s). According to our statistics, 5.76% of the learning output occurs in such word clumps. That means, if we had a strategy to perform word segmentation on this portion as well as on the rest, our learning algorithms would enhance their learning performance by 5.76% × (70% / (1–5.76%)) = 4.28%. This would be a very significant improvement. The problem of negative DLG segmentation deserves much research effort in our future work.
Table 11 Word type precision and recall of unsupervised lexical algorithms
Learning algorithm OS OS+LR+WS OS+(LR+WS)x2
7.2
Learned Correct word types word types 1,922 394
Word type Precision Recall (%) (%) 20.50 22.25
1,046
400
38.24
22.59
909
390
42.90
22.02
OS+V
3,036
601
19.64
33.94
OS+V+LR+WS
1,914
568
29.68
32.07
OS+V+(LR+WS)x2
1,792
557
31.08
31.43
OS+V'
2,916
604
20.71
34.11
OS+V'+LR+WS
1,582
565
35.71
31.90
OS+V'+(LR+WS)x2
1,492
558
37.42
31.49
Word type precision and recall
In contrast to the word token precision and recall, which are usually at the level of 70%, the word type precision and recall of our learning algorithms appear low, at the level of slightly higher than 30%, as listed in Table 11. The number of word types in the original input corpus is 1771. Word type precision and recall in unsupervised lexical learning are commonly at this level, or even lower. There have been few reports on them, except for the word type precision of the MBDP-1 algorithm reported in (Brent, 1999). MBDP-1’s word type precision
Unsupervised Lexical Learning as Inductive Inference via Compression
starts at about 36% and grows to 54% in its incremental learning on the Bernstein corpus. All other algorithms reported in (Brent, 1999) have a word type precision below 30% on average. From Table 11 we can see that the OS+V´ based algorithms have the best performance, and that when the LR+WS is applied a second time, precision increases significantly and the corresponding recall decreases slightly. Both our OS+V and OS+V´ based algorithms learn about 1/3 of the word types in the input corpus. How can such a low word type recall enable the word token precision and recall, in the manner we reported above, at the level of 70%? The answer is given in Figure 4: the top 500 word types, either in the frequency or coverage ranking, cover about 90% of the input corpus. Our learning algorithms, e.g., OS+V+LR+WS and OS+V´+LR+WS, learn about 550 words correctly, most of which are high frequency words. It is not surprising that these words cover more than 70% of the input corpus.
Figure 4 Word coverage rate versus frequency and coverage rank of word
289
290
Language Acquisition, Change and Emergence
Like other statistically based learning algorithms, our algorithms make fewer errors in learning high frequency words than in learning low frequency words. The word type precision and recall of our lexical learning versus word frequency rank are presented in the two figures in Figure 5. The upper figure is plotted in terms of the word frequency in the learning output and the lower in terms of the frequency in the input corpus. The diamonds plot the precision or recall at each frequency rank, and the solid lines plot the average precision or recall over word types up to a certain frequency rank. We can see from the upper figure that the word type precision over frequent words is very high: the average precision up to the first 100 ranks (out of 147) is above 80%, and the learned words are less than 100% correct only in 12 ranks out of the first half, roughly 75, of all ranks. We can also see from the lower figure that the word type recall over frequent words is also very high: the average recall up to the first 100 ranks (out of 147) is above 80%, and there are only 16 ranks in the first half, roughly 75, of all ranks in which the words are not 100% correctly learned. The low overall word type precision and recall is determined by the fact that word type number increases dramatically in the last 10 frequency ranks, where precision and recall both drop very rapidly. So, the focus of enhancing the word type precision and recall is on the enhancement of precision and recall of learning low frequency words.
7.3
Other problems
In addition to the negative DLG segmentation problem and the problem of low word type precision and recall, there are also other problems that hinder our learning algorithms from scoring any better. Some of these problems are inherent in the input data, e.g., the inconsistency and data noise in the transcripts. Others are related to our evaluation criteria, e.g., [-ing] and [-ed] are not counted as creditable morphemes in the learning output, because they are another type of morpheme categorically different from abbreviated forms of existing words.
Unsupervised Lexical Learning as Inductive Inference via Compression
Figure 5 Word type precision and recall in terms of frequency rank of word
291
292
Language Acquisition, Change and Emergence
Some problems are directly related to the behavior of the DLG-based leaner; e.g., in the Bernstein corpus, the word [balloon], with 46 occurrences (a very high frequency), is correctly recognized 42 times but erroneously divided into [ball][o][on] 3 times. A more frequent word, [another], with 86 occurrences, is always divided into [a][nother].6 Many frequent words or noun compounds, e.g., [instead] and [golfball], in some other child-directed corpora, are also given abnormal segmentations by the DLG-based learning algorithms. The word [instead], with 25 occurrences, is always segmented into [i][nstead], and the compound [golfball], with 19 occurrences, is divided into [golf][ball] 8 times and recognized as a single word 11 times. We still do not quite understand this unusual behavior in a DLG-based lexical learner, because according to the DLG optimization, the learner should choose [another] and [instead] instead of [a][nother] and [i][nstead], because [another] and [instead] have the same frequency as [nother] and [nstead], respectively, but each has a greater length and therefore a greater (positive) DLG. A lexical learner based on the DLG optimization should select these longer words instead of the shorter ones. Why does it learn these unexpected words? These are interesting problems that deserve further research. In the present study, we have a reasonable assumption, namely, the least effort principle, as the starting point for the unsupervised leaning. We have developed an elegant computational theory for the unsupervised lexical learning based on the MDL principle and, accordingly, formulated the DLG goodness measure for selecting word candidates. We also have implemented a number of sound learning programs based on DLG optimization that can learn most words correctly from the input corpus. However, a DLG lexical learner still has some unexpected behaviors beyond our understanding at the present time. We need to develop a thorough understanding of them in order to advance computational studies on
6
Native speakers actually say “that’s a whole nother thing”.
Unsupervised Lexical Learning as Inductive Inference via Compression
human cognitive mechanisms for lexical learning based on our current achievements with the DLG optimization approach.
8.
Conclusions
Unsupervised lexical learning is realized by an algorithm to achieve the DLG optimization over input utterances following the MDL principle. The representation formalism for the learning is trivially simple: each lexical item is represented as a string, with one parameter, namely, its frequency. Each lexical item’s DLG is calculated in terms of its frequency. The Viterbi algorithm is exploited to search for the segmentation of an utterance that gives the greatest sum of DLG over its segments. We have presented a novel approach to unsupervised lexical learning via compression, including its assumptions, underlying theories, a goodness measure for computing the compression effect of extracting word candidates, an optimal segmentation algorithm following this goodness measure to word candidates, and a learning model to apply this algorithm to derive finer-grained lexical items. Experiments on a large-scale corpus of child-directed speech show that its performance compares favourably to the state-of-the-art performance of unsupervised lexical learning. This performance indicates the validity and the effectiveness of the learning approach and the appropriateness of the implementation. The lexical learning process in our computational approach consists of three phases (or learning modules): DLG-based optimal segmentation of input utterances into lexical candidates, lexical refinement to divide the word-clump candidates into individual words, and word segmentation in terms of lexical items acquired in the previous two phases. This lexical learning model is consistent with human infants’ behaviors in lexical learning: they recognize many word clumps as individual lexical items and later divide them into individual words when they are exposed to more language evidence supporting the decomposability of the clumps (Peters, 1983). In our approach each of the three phases involves an
293
294
Language Acquisition, Change and Emergence
application of the same optimal segmentation algorithm for DLG optimization with a different set of word candidates, or a different word space. We have developed twelve unsupervised lexical learning algorithms, each with a different combination of learning modules, parameters and constraints, and also developed a comprehensive evaluation approach based on seven evaluation measures to systematically examine their learning performance on the orthographic texts of the Bernstein corpus of child-directed speech. The evaluation measures are word precision and recall, word boundary precision and recall, word type precision and recall, and correct character ratio. This is the most comprehensive evaluation approach ever applied in the field of computational lexical learning. The top performance of our DLG-based unsupervised learning of words and of words and bound morphemes is achieved by two typical unsupervised lexical learning algorithms involving the three phases, namely, the OS+V+LR+WS and OS+V´+LR+WS, respectively. The best performance in learning words is 75% precision, 71% recall and, accordingly, F = 73%, comparing favorably with the state-of-the-art performance achieved by Brent’s MBDP-1 algorithm on the same child-directed speech corpus. Our best performance in learning words and bound morphemes is 80% precision, 76% recall and F = 78% — this F score is 5 percentage points higher, an increment of 6.85%. In addition to the comprehensive evaluation described above, we have also analysed a number of problems encountered by the DLG-based learning approach, including negative DLG segmentation and low precision and recall of word type. This analysis points to directions for future work.
Acknowledgments The author wishes to thank Yorick Wilks for his enthusiastic support, advice and various kinds of help that have enabled this study, and thank Ming Li and Paul Vitányi for helpful discussions on Kolmogorov complexity, learning via compression, MDL and other theoretical issues related to this work. Sincere
Unsupervised Lexical Learning as Inductive Inference via Compression thanks also go to Randy LaPolla and Lisa Raphals for their helps, valuable comments and advice that have improved this paper significantly. The author is responsible for all remaining errors.
References Bernstein-Ratner, N. (1987) The phonology in parent child speech. In K. Nelson and A. van Kleeck (Eds.), Children’s Language, Vol. 6. Hillsdale, NJ: Erlbaum. Brent, M. R.. (1999) An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning 34, 71–106. Brent, M. R., and Cartwright, T. A. (1996) Distributional regularity and phonological constraints are useful for segmentation. Cognition 61, 93–125. Chaitin, G. (1966) On the length of programs for computing finite binary sequences. J. Assoc. Comput. Math. 13, 547–569. Chomsky, N. (1957) Syntactic structure. Hague: Mouton. —. (1995) The minimalist program. Cambridge, MA.: MIT Press. Cover, T. M., and Thomas, J. A. (1991) Elements of information theory. New York: John Wiley and Sons. de Marcken, C. (1995) The unsupervised acquisition of a lexicon from continuous speech. technical report, A.I. Memo No. 1558, AI Lab., MIT, Cambridge, Massachusetts. —. (1996) Unsupervised language acquisition. PhD thesis, MIT, Cambridge, Massachusetts. Kit, C. (2000) Unsupervised lexical learning as inductive inference. PhD thesis, University of Sheffield, UK. Kit, C., and Wilks, Y. (1998) The Virtual Corpus approach to deriving n-gram statistics from large scale corpora. In Proceedings of 1998 International Conference on Chinese Information Processing (pp. 223–229). —. (1999) Unsupervised learning of word boundary with description length gain. In CoNLL-99 (pp. 1–6). Kolmogorov, A. N. (1965) Three approaches for defining the concept of “information quantity”. Problem of Information Transmission 1, 4–7. Li, M., and Vitányi, P. M. B. (1997) Introduction to Kolmogorov complexity and its application, 2nd ed. New York: Springer-Verlag. MacWhinney, B. (1991) The CHILDES database, discovery systems. Dublin, OH.
295
296
Language Acquisition, Change and Emergence MacWhinney, B., and Snow, C. (1985) The child language data exchange system. Journal of Child Language 12, 171–296. Manber, U., and Myers, E. (1990) Suffix array: a new method for on-line string searches. In First ASM-SIAM Symposium on Discrete Algorithms (pp. 319–327). Providence: American Mathematical Society. Olivier, D. C. (1968) Stochastic grammars and language acquisition mechanisms. Cambridge, MA: Harvard University. Peters, A. (1983) The units of language acquisition. Cambridge: Cambridge University Press. Rissanen, J. (1978) Modelling by shortest data description. Automatica 14 , 465–471. —. (1989) Stochastic complexity in statistical inquiry. New Jersey: World Scientific. Rissanen, J., and Ristad, E. S. (1994) Language acquisition in the MDL framework. In E. Ristad (Ed.). Language Computations., Philadelphia, PA: American Mathematical Society. Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal 27, 379–423; 623–656. Solomonoff, R. J. (1964). A formal theory of inductive inference, part 1 and 2. Information Control 7, 1–22; 224–256. van Rijsbergen, C. J. (1979). Information Retrieval, 2nd ed. London: Butterworths. Venkataraman, A. (2001) A statistical model for word discovery in transcribed speech. Computational Linguistics 27 (3), 352–372. Vitányi, P. M. B., and Li, M. (1997) On prediction by data compression. In Proceedings of the 9th European Conference on Machine Learning, Lecture Notes in Artificial Intelligence, Vol. 1224 (pp. 14–30). Heidelberg: Springer-Verlag. —. (2000) Minimum description length induction, Bayesianism, and Kolmogorov complexity. IEEE Transactions on Information Theory 46 (2), 446–464. Wallace, C. S., and Boulton, D. M. (1968) An information measure for classification. The Computer Journal 11, 185–195. Wallace, C. S., and Freeman, P. R. (1987) Estimation and inference by compact coding. Journal of the Royal Statistical Society 49, 240–251 (discussion pages 251–265). Zipf, G. K. (1949) Human behavior and the principle of least effort. New York: Hafner.
8 The Origin of Linguistic Irregularity Charles D. Yang Yale University
1.
The Challenge of Imperfection
The ban on the discussion of language evolution by the Société de Linguistique de Paris in 1866 surely ranks among the most defied gag orders ever issued. While there has never been shortage of evolutionary speculations on the origin of language, recent years have seen an explosive growth of work, emerging from a high-profile biannual international conference, a monograph series at the Oxford University Press, and regular publications in leading journals such as Nature and Science. The renewed enthusiasm in the evolution of language was made possible by the advances in the study of language and related topics. Fifty years of modern linguistic research has revealed more about the nature of human language than the Société could have ever imagined. But it is important to keep in mind that deeper understanding of language only makes the game of language evolution harder to play. Everyone can tell a story about free will because we have no faintest idea how free will works, or what it is, for that matter. Games without rules are the easiest kind. Modern linguistics has given us a good idea of what the result of the evolution of language is; we must now reconstruct the process 297
298
Language Acquisition, Change and Emergence
that might have led to it. Just as no one will take seriously a theory of how wings evolved, if it is devoid of an explanation of why wings are the way they are but not of some other imaginable form or structure, no one will take seriously a theory of how language evolved if it is devoid of an explanation of why language is the way it is, but not some other logical, but empirically unattested, possibility. The challenge is the classic problem of form and function in evolution. It is easy to tell post-hoc stories how language is useful for this or that function, 1 it is far harder to understand why language, as we see it in the world around us, has this form but not that. The tension is particularly acute when the observed form appears to be functionally inferior to some conceivable alternative. One such feature is the phenomenon of irregularity in the lexicon. The English verb past tense is of course the best-known case. Of all English verbs, about 120 are irregular; the rest are regular, forming past tense by add -d, along the line of walk-walked. Another example: the noun plurals in German. German plurals fall into five classes: Kind-er (“children”), Wind-e (“winds”), Ochs-en (“oxen”), Daumen-ø (“thumbs”, with a null suffix), and finally, Auto-s (“cars”), a class which, despite having the fewest members of all, is nevertheless the default (Marcus et al., 1995). The pattern of irregularity also holds for languages without (obvious) overt morphology. Chinese, for example, has a classifier system with irregulars and a default: yi tou zhu (a pig), yi zhi yang} (a sheep), yi pi ma (a horse), each referring to a specific kind of objects/nouns, whereas ge is the default, which can be used with novel or nonsense nouns. Irregulars, by definition, are unpredictable, which means that special attention has been paid to them by language learners and users. Yet one can easily imagine a lexicon without irregulars, in which the morphophonological uses of all words are completely
1
And it is true, many features of language do seem to be superbly adaptive, e.g., the arbitrary association between sound and meaning, the recursive mechanism that forming words and sentences to express thought, etc.
The Origin of Linguistic Irregularity
predictable. For example, we may imagine a language in which nouns ending in a vowel add -t to form plurals, and those nouns ending in a consonant add -o. From every functional perspective, the imaginary system would be far easier to learn, produce, and process, and thus posing considerably less cost to the cognitive and perceptual systems. Moreover, no principle of language, as far as I know, prohibits a regular lexicon like this. But lexical irregularity shows up, in one form or another, in all languages that we know of. Now this is a non-trivial fact of language that a non-trivial theory of language evolution would like to explain. I would like to suggest that the presence of lexical irregularity results from the mechanisms of how words are learned. My argument — it’s a long one — takes the following form. First, based on the much-studied problem of English past tense, we will argue for a model of word learning that sharply differs from all previous approaches; specifically, Steven Pinker’s dual-route Words and Rule model (1999). Section 2 summarizes the developmental evidence for our alternative model, which consists of two components: (a) how phonological rules are learned, and (b) how such rules are used in word learning. Section 3 explains the algorithmic processes that underlie these two components, which, we shall argue, may involve domain-general abilities. The possibility that the learning mechanism used in word learning is not unique to language suggests that this mechanism might have been present prior to the emergence of language, and thus would have served as a constraint that shaped the properties of language and the outcome of language evolution. Finally, the model of word learning is extended to a model of sound change over time, with which we will show that irregularity in words is (almost) an inevitable outcome of how words are learned.
2.
The Reality of Phonological Rules
There is probably no problem in cognitive science that occupies more minds than the acquisition of English past tense; witness the 15 years of debate in Cognition, and the introduction intended for
299
300
Language Acquisition, Change and Emergence
the general public (Pinker, 1999). There are three key empirical findings. First, since Berko’s classic work (1958), we know that children, like adults, generally inflect novel verbs by adding the -d suffix. Second, Marcus et al. (1992) show that about 10% of English children’s irregular verb uses are overregularization errors, e.g., instead of hold-held, the child may say hold-holded. Finally, over-irregularization errors, such as bring-brang where the child misused an irregular pattern, are extremely rare; only 0.2% of all irregular past tense uses (Xu and Pinker, 1995).
2.1
Two models
According to the Words and Rule (WR) model (Pinker, 1999), the irregular verbs are memorized as associated stem-past pairs, following the connectionist literature (Rumelhart and McClelland, 1986), and the regulars are computed by the rule “add -d”, following the tradition of generative linguistics (Halle, 1962; Chomsky and Halle, 1968). This is illustrated in Figure 1. Since words don’t carry (ir)regularity tags, the WR model requires the learner to distinguish regulars and irregulars. In other words, the learner must learn that “add -d” is the regular rule, when he is exposed to a mixture of both regulars and irregulars. Here, the case of German plurals becomes relevant (Marcus et al., 1995); how does the learner conclude that the smallest class in fact is the default? We will return to this question in Section 3. The approach in the generative phonology is different. It asserts that the computation of all verbs, irregular or regular, is rule-based, and this is the approach we shall pursue. According to Lexical Phonology (Kiparsky, 1982), the rules for the irregulars are lexical/special; they are defined over particular words, and only these words. The default rule, in contrast, is general: it is not lexically restricted and can in principle apply to all words. The effect of irregularity is achieved by a principle of ordering, the Elsewhere Condition, which states that when multiple rules are applicable, the most specific will be used. Hence, an irregular lexical rule, which
The Origin of Linguistic Irregularity
Figure 1 Past tense according to Words and Rule. W0 … W5 denote irregular verbs, which are directly associated with their past tense, O0 … O5.
Figure 2 Past tense according to Rules over Words. The irregular rules take irregular verbs W0 … W5 as input, and produce output O1 … O5, the irregular past tense.
specifically refers to particular (irregular) words, will override the default; hence, we have hold-held rather than hold-holded. Because both irregular and regular words are computed by rules in this approach, let’s call it the Rule over Words (RW) model. It is illustrated in Figure 2.
301
302
Language Acquisition, Change and Emergence
For the purposes of this section, we will show that irregular verbs are indeed learned in classes, which are defined by rules that multiple verbs share, just as suggested in the traditional approach to phonology; we will turn to how such rules are learned in Section 3. The rules are, in fact, not so different from those stated in pedagogical grammars, along the lines of: (1) a. bring, buy, catch, seek, teach, think: add –t & Rhyme → /a/ b. cost, cut, hurt, let, put, quit, set, . . . : add -ø & no change c. blow, draw, grow, fly, know: add -ø & Rhyme → /u/ d. feed, shot, lose, leave, shoot, . . . : suffixation (-d, -t, and -ø) & Vowel Shortening e. . . . In what follows, we will present acquisition evidence in favor of the RW model over the WR model. For a detailed discussion of past tense learning and related issues, see Yang (2002a: Chapter 3). The acquisition data is taken from the longitudinal study of four American children in Marcus et al. (1992). The children and the percentage of their correct past tense use are: Adam (98.2% = 2446/2491), Eve (92.2% = 285/309), Sarah (96.5% = 1717/1780), and Abe (76% = 1786/2350).2 To evaluate the two models, and to quantify the linguistic data during past tense acquisition, we have obtained the frequencies of past tense irregulars from the adult sentences (transcribed in CHILDES) that these four children were exposed to.
2
Although Abe’s performance is considerably worse than the other children, it is not because a few high-frequency verbs were used badly. Rather, Abe’s past tense is problematic across the board; cf. Maratsos (2000).
The Origin of Linguistic Irregularity
2.2
The failure of frequency
In the WR model, the association between an irregular verb and its past tense is established by association. Thus, association follows a simple principle of memory (Pinker, 1995, 1999): the more you hear, the better you remember. Indeed, Marcus et al. (1992) found strong correlation between adult frequency and child overregularization errors –.33. However, having Bill Gates in a bar, though raises the average income of the patrons, does not make everybody rich. Similarly, collapsing all the irregular verbs together will very likely obscure important differences among these verbs. Figure 3 shows the frequency-performance correlation of the irregular verbs that appeared in the children’s production significantly often (> 25 times). Despite the fact that we are only looking at a subset of children’s irregular verbs, the frequency-overregularization correlation is still –0.32, comparable to that (–0.37) reported in Marcus et al. (1992) for all irregular verbs. We plot the (logarithm of) adult frequencies along the X-axis, and children’s performance along the Y-axis. If the WR model is correct, then we would at least expect a more or less monotonically upward-moving curve, following Pinker’s memory principle; that this is untrue you can see for yourself. Some of the verbs, such as bite-bit and shoot-shot, were heard 20–30 times less than get-got and put-put, yet they were used nearly perfectly, 89.2% and 93.8%. Some, such as threw and knew, are a few times more frequent that bit and shot, but were used far worse: 32.4% and 73.9%. Such examples are abundant, as Figure 3 makes clear. Let’s see how such frequency-performance disparity can be explained. In the RW model, learning an irregular verb consists of two parts: the learner will have to associate the verb with a specific lexical rule, which also has to be constructed as part of learning. The verb-rule association is established upon exposure to the particular verb in past tense; this can be quantified by its token frequency. The rule, however, can be established whenever any verb
303
304
Language Acquisition, Change and Emergence
that falls under it is encountered. In other words, by the virtue of being shared by multiple words, the learner’s experience with a rule is actually the sum of the token frequencies of all verbs that fall under it. This two-step process implies that the performance of a Figure 3. Frequency effects under the WR model
Figure 4. Frequency effects within irregular classes
The Origin of Linguistic Irregularity
particular irregular verb, W, will be correlated with the product of two frequencies: that of W, and that of the rule R of which W is a member, or fW ∑ i ∈ R fi . This leads to two quantitative predictions: (2) a. for verbs within the same class, token frequency will determine relative learning performance. b. for verbs with comparable frequencies, the size of their respective classes, i.e., the sum of token frequencies, determines relative learning performance. Figure 4 examines the prediction in (2a); we see that once verbs are examined in classes, frequency-performance correlation is perfect, with the exception of one verb (win-won) in one class. The reader is referred to Yang (2002a) for a detailed comparison of the individual verbs, including raw statistics.
2.3
The Free-rider effect
The WR model fares far worse once we compare verbs that belong to different classes; here the frequency-performance correlation completely breaks down. First, there are verbs for which adult frequencies are comparable (about 20 out of 110,000 sentences), but children’s performance differs significantly: hurt and cut were used 80% correctly, and draw, blow, grow, and fly were used only 35% correctly. Second, some low-frequency verbs are used much better than higher-frequency ones. Again, hurt and cut were used about 20 times by adults but have a correct usage rate of 80%; in contrast, knew and threw were used by adults 58 and 31 times respectively, but were used only 49% correctly. Finally, for Abe, whose past tense learning is quite delayed compared to Adam, Eve, and Sarah, some of the most frequent irregular verbs were used worse than some with very low frequencies. For example, Abe used hurt and cut correctly 66% of time: that’s better than go-went (64% correct), which was used by adults 557 times, and come-came, which was used by adults 262 times.
305
306
Language Acquisition, Change and Emergence
These frequency-performance, or input-output, disparities have a straightforward explanation in the RW model; see (2b) above. Verbs may have high performance if they belong to large classes; recall that exposure to every member of a class contributes to the learning the shared rule. Once the effect of rules is taken into account, we see that hurt and cut, while rare in the input, nevertheless belong to a very large class: the “no change” class, which include very frequent items such as let, set, put, etc., which tally up to over 3,000 occurrences in the input — that’s even more frequent than went and came, two of the most frequent irregular verbs, which nevertheless act alone. The “Rhyme→/u/” class, which contains the problematic verbs drew, blew, grew, flew, knew, and threw, only have 125 tokens altogether. Hence, lower or comparable frequency can be enhanced by class frequency — a Free-rider Effect — confirming the predictions of the RW model.3
2.4
General process in special words
Finally, there is a class of irregular verbs that was used very well by all children, almost irrespective of their frequencies. They are shown in Table 1. Table 1. Vowel shortening irregular verbs
Word
3
Input Frequency
Percent Correct
lose-lost
63
98%
leave-left
53
96%
say-said
544
99%
shoot-shot
14
94%
bite-bit
13
90%
Also worth noting is the verb caught. It was used by adults 36 times — compared to 58 times for knew and 31 times for blew — but used 96% correctly by children. The reason is that the Rhyme→/a/ class is also large: it includes thought, which alone appears 363 times in the input.
The Origin of Linguistic Irregularity
All these verbs form past tense by adding a suffix (-t, -d, or -ø), which is followed by the process of Vowel Shortening (Halle and Mohanan, 1985). Vowel Shortening under suffixation happens to be a very general process in the English language, and can be observed in following examples (Myers, 1987): (3) a. b. c. d. e.
[ay]-[I]: [i]-[ε]: [e]-[æ]: [o]-[a]: [u]-[^]:
divine-divinity deep-depth nation-national cone-conic deduce-deduction
It has been argued that Vowel Shortening falls out of the interaction between universal phonological principles and the way in which syllabification works in English (Myers, ibid; Halle, 1998). If this is the case, then the rule of Vowel Shortening is essentially given for free. Consequently, the task of learning Vowel Shortening verbs is greatly simplified, having been reduced to learning the particular suffixes. 4 Given the fact that cross-linguistically, acquisition of affixal morphology is near perfect (Phillips, 1995; Guasti, 2002), children’s impressive performance on vowel shortening verbs is expected. It is important to note that, in order to explain the acquisition data of the Vowel Shortening verbs, one must appeal to the overall sound patterns of the language. This is something that the WR model is in principle incapable of. Note that both RW and WR models make use of memorization to explain irregular acquisition. Irregulars are, by definition, unpredictable, and must be memorized, somehow, on an individual basis. The two models differ in how the verbs are memorized. The WR model memorizes the past tense of verbs directly, whereas the RW model memorizes word-rule associations, after which rules
4
It is certainly not the case that merely a dozen repetitions, as in the case of shoot-shot and bite-bit, suffice for near perfect learning; recall from earlier discussion that verbs in other classes are used far worse despite higher frequencies.
307
308
Language Acquisition, Change and Emergence
apply to stems to generate past tense forms. In both models, no memorization is used for regular verbs, and in both models, when a verb isn’t one of the irregulars, the default rule will pick it up. Much of the empirical work in the WR framework, ranging from language acquisition to online processing to cognitive neuroscience to aphasiology (see Pinker, 1999 for a review), focuses on the irregular vs. regular dichotomy, that is, the unpredictability of the irregulars, which calls for special memorization; on this score, both models fare equally well. Summarizing the acquisition findings, we see that irregular verbs are learned and organized in groups, by the means of irregular and regular rules. The word-rule association is statistical in nature, and this is an amendment to the classical conception of phonology, where rule application to words is categorical. Section 3 outlines a computational model that learns these rules and establishes word-rule associations.
3. 3.1
The Computation of Rules What makes a rule default
As noted earlier, in order for the WR model to work, the child must be able to identify, amongst the various sound change patterns in past tense, that “add -d” is special. Only after the -d rule is learned can the child sort verbs into the regular and irregular bins, and proceed to memorize the irregulars individually. Clearly, the default cannot be identified with the rule covering most verb tokens. Among verbs with highest frequencies in English, most are irregulars; indeed, irregular verbs make up 60% of the probability mass in English verbs (Grabowski and Mindt, 1995). It is then suggested (Pinker, 1999) that the default is one that covers most verb types; since there are only about 120 irregular verbs, this idea surely works for English.
The Origin of Linguistic Irregularity
But serious problems arise for German noun plural acquisition. Marcus et al. (1995) have established, using the Wug test and others on German speakers, that the “add –s” rule is the default. However, the -s class is the smallest among the five plural classes, four of which are irregular. Looking for the default as the rule with the dominant type frequency will get the German learner nowhere. In addition to the German challenge, Pinker’s view of default learning faces two other problems. First, it is unclear how much computational power will be needed for the child to keep track of frequencies of multiple classes to determine the statistically dominant default class. Second, and empirically, there are morphological systems with no defaults. For example, the genitives in Polish (Dabrowska, 2001) have three case markers, each restricted to a subset of nouns, and none is the default. That is, none of three markers passes the standard suits of benchmarks including the Wug test. This is an awkward problem for a learning model that looks for a default rule/class based on type frequency (or anything else, for that matter). So, a successful model for rule learning will have to meet the following conditions: (4) a.
It must handle the statistical minority of the default class in German plurals, and b. it must not require the presence of a default rule to work and yet it must be equipped to learn it if present in the learning data.
3.2
Rule learning by induction
In an important series of papers, Sussman and Yip (1996, 1997) provide, as far as I know, the only model that fits the bill. The following discussion draws from Molnar (2001), which is an extension of that work. In the S&Y model, the learner constructs phonological rules as mapping relations between input (e.g., stem) and output (e.g., past tense, plural) forms. Both input and output
309
310
Language Acquisition, Change and Emergence
are represented as a linear sequence of phonemes. Each phoneme is represented by a universal set of distinctive features (Jakobson, Fant, and Halle, 1951; Chomsky and Halle, 1968; Halle, 1983). For instance, /ae p l z/ (“apples”) is represented as follows:
Table 2 The feature representation of “apples” in the Sussman-Yip model. [æ]
[p]
[l]
[z]
syllabic
1
0
0
0
consonantal
0
1
1
1
sonorant
1
0
1
0
high
0
0
0
0
back
0
0
0
0
low
1
0
0
0
round
0
0
0
0
tense
0
1
0
1
anterior
0
1
1
1
coronal
0
0
1
1
voice
1
0
1
1
continuant
1
0
1
1
nasal
0
0
0
0
strident
0
0
0
1
“apples”
The learning algorithm is simple, and its intuition is based on how phonological rules are represented. Traditionally, rules take the following form (Halle, 1962; Chomsky and Halle, 1968): (5)
A → B / C____D
That is, a computational process (A changes to B) takes place in a specific context (between C and D). In the present model, the learner constructs phonological rules inductively: for words that undergo identical sound change, i.e., A→B, it tries to find what they have in
The Origin of Linguistic Irregularity
common, i.e., C_____D. An example in Figure 5, adapted from Molnar (2001), illustrates how the -d rule is learned.
Figure 5. The induction of the default -d rule
R#1 for: walk do: -d
R#2 for: talk do: -d
R#3 for: *alk do: -d
R#7 for: ralk do: -d
R#8 for: *k do: -d
R#10 for: kill do: -d
R#11 for: * do: -d
When learning starts, there are no words or rules. Suppose now the first word, walk-walked, comes in. The model identifies the phonological change from stem to past by comparing the representation of walk and that of walked, and establishes that the relevant change is “add -d”.5 Since this is the only piece of learning data available, the learner can draw no generalization but store it by route as a trivial rule: if walk then “add -d”. Suppose that the next word is talk-talked. The learner will follow same procedure to obtain: if talk then “add -d”. Now the learner is ready to make inductive generalizations. The two statements thus far constructed share the then clause — they undergo identical phonological change in past tense. The learner then tries to discover what they have in common in their respective
5
In the implementation, the “ed” suffix in “walked” is actually a /t/, and the whole word is represented as a feature matrix.
311
312
Language Acquisition, Change and Emergence
phonological descriptions; namely, what can be generalized from walk and talk. The induction algorithm works maximally conservatively; conservatively follows both logical (Berwick, 1985) and empirical (Clark, 1993) principles of acquisition. It compares two phonological representations, and for phonemes that have conflicting phonological features (+ and –), it places a* (“don’t care”) in the generalization. In our example, /walk/ and /talk/, which only differ in the initial phoneme, are generalized to /*alk/. The learner now considers all words that fit the description /*alk/, that is, all verbs that end in /alk/, to undergo the change “add -d” in past tense. As illustrated in Figure 5, as more words come in, the learner will continue to carry out the procedure described above. It is clear that, after a few regular verbs are presented, the condition for “add -d” will get very general: conflicting feature values, due to the diverse phonological shapes of regular verbs, will lead to more *’s in the generalization. Eventually, the learner will determine that anything can take -d in past tense, where “anything” is represented by *’s across the board in the if clause. And this is what phonologists call the default rule. The computer simulation actually returns three rules for regular verbs: (6) a.
Verbs that end in a voiced phoneme but not a d: [*.*.[+voice,+sonorant].d] [*.*.[+voice,-coronal].d] [*.*.[-low,-round,-tense,+continuant],d]
b. Verbs that end in an unvoiced phoneme but not a t: [*.*.[-voice,+strident].t] [*.*.[-voice,-coronal,-continuant].t]
c.
Verbs that end in (d, t): [*.(d,t).I.d]
They in fact match exactly the phonological rules that linguists would use to describe regular past tense.
The Origin of Linguistic Irregularity
Rule learning is very efficient in the S&Y model. Using the training data in MacWhinney and Leinbach (1993) and Ling and Marinov (1993), the S&Y model outperforms all previous implementations of past tense learning. When trained on regular verbs only, the model achieves 99.8% accuracy on prediction of regular past tense with only 30 examples. In contrast, the Ling-Marinov model and MacWhinney-Leinbach model only have a 90% prediction accuracy after 500 training examples, with the former learning faster than the latter. When trained and tested on both regular and irregular verbs, the S&Y model achieves 95% accuracy in prediction after merely 60 examples, the Ling-Marinov model, 76% with 500 examples, and the MacWhinney-Leinbach model, 57% with 500 examples. In addition, the Sussman-Yip model is able to learn verb past tense and noun pluralization at the same time, again, producing rules that are perfectly acceptable to phonologists. Irregular rules can be similarly learned; some of the results from computer simulation are given below, with the rules and irregular verbs that follow them. (7) a. b. c. d. e. f.
[*.*.i->ae.ng] rang, sang [*.*.E->a.t] forgot, got, shot [*.E.n.->t] bent, lent, meant [*.*.(r,l),*->u] blew, drew, grew, fly [*.*.*->a.t] bought, brought, caught, taught, thought [*.*.*->o.z] chose, froze, rose
The current implementation has a number of problems. First, it does not distinguish special and general rules, and this gives the impression that the irregular rules are also productive. For example, the statement in (7a) [*.*.i->ae.ng] for the sing-sang indicates that all verbs with an /ing/ ending will change /i/ to /ae/, which is obviously incorrect (“bring” and “wing” are counterexamples). The solution to this problem is beyond the scope of the present paper. It boils down to this: when is a rule lexically restricted to certain words that the child has to identify in the learning data, as opposed to becoming generally applicable (to novel items as well)?
313
314
Language Acquisition, Change and Emergence
As far as I know, the productivity problem of phonological rules has been addressed as a problem of learning. In other work (Yang, 2002b), I suggest that the learner strives to maintain a balance between the productivity of a rule and the number of exceptions it has to explicitly maintain. For example, if the rule [*.*.i->ae.ng] were productive, it must mark bring, wing, etc. as exceptions, because they do not follow under the pattern it describes. Under reasonable assumptions about how words are stored and accessed, it is possible to derive formal results on the mechanism the child learner may use to draw the special/general distinction. Roughly, if a rule has too many exceptions, the learner will regard it as lexically marked. Such, I suggest, is the fate of the irregular rules listed in (7). With an independent model of what makes a rule lexical, we can maintain the current model of rule learning. And interestingly, children do occasionally say “bring-brang” at a younger age, which is the only somewhat robust pattern of the very rare over-irregularizations (Xu and Pinker, 1995). This suggests that the [i->ae.ng] rule was productive early on in past tense acquisition. Presumably, if the only /ing/-ending words the child knows are sing, ring, and sting, and they both follow the [i->ae.ng] pattern, the child may well assume the rule to be productive for words with /ing/ endings. It is the accumulation of exceptions (bring, wing) to the rule that demotes it to special/lexical status. The other problem with the learning model is that it amounts to a two-level model of phonology, with direct mapping between the underlying form (the stem) and the surface form (the past tense). It has no notion of rule ordering. Hence, all the problems associated with two-level phonology, e.g., KIMMO (Koskenniemi, 1983), will be implicated here as well (Anderson, 1988). Moreover, the representation of words as a linear sequence of phonemes does not reflect the nonlinear representations adopted in modern phonological theories. And although the model can behaviorally replicate its effect, Vowel Shortening is not learned as a general and unified process in the language. Nevertheless, we believe that the basic algorithm of finding generalization through diversity of
The Origin of Linguistic Irregularity
phonological representation provides a basic framework suitable for rule learning and can be augmented with richer phonological principles and constraints.6 We can now address the two problematic lexical systems for the WR model, namely, Polish genitives and German plurals. First, the Polish problem is not a problem. The emergence of the default rule is facilitated by the learning data, not as a necessary part of the learning algorithm. If three rules provide a complete (and disjunctive) coverage of words, as in the case of Polish genitives, so be it; there is nothing in the learning model that forces the existence of a default. The German plural challenge is also straightforwardly resolved. The default, according to the rule learning model, is identified with the general rule that has *’s across the board, one which imposes no restrictions whatever on the word. Learning the default has nothing to do with statistics. In German, the default -s class consists of largely loan words, many from English, e.g., Auto-Autos, Radio-Radios, etc. Therefore, the nouns for the “add -s” class, being sufficiently diverse phonologically (like the regular nouns in English), will quickly lead the learner to recognize that the “add -s” rule has no phonological restrictions on the noun.7 The irregular nouns, in contrast, are associated with four other irregular rules, which are more restricted.8 The acquisition evidence reviewed in Section 2 strongly suggests that the mental lexicon is structured by morpho-phonological rules,
6 7 8
See Ristad (1994) for a theoretical formulation of learning ordered rules. Indeed, in the Y&S model, the -s rule can be learned after on average only 10 English noun plurals. However, they do seem to be productive, but only with respect to nouns with particular morpho-phonoligical properties (and are hence less general than the default -s rule). The claim that the -s rule is the default is correct, but slightly misleading. Such claims are established on the fact that when German speakers are presented with novel nouns, the -s rule is often used (Marcus et al., 1995). However, if the novel noun in fact obeys the general morpho-phonotactics of German, an irregular rule is used. See Pouplier and Yang (2003) for discussion.
315
316
Language Acquisition, Change and Emergence
much as suggested by classical generative phonology (Halle, 1962; Chomsky and Halle, 1968). The present section provides a companion model for learning rule-based phonology. It remains to be seen whether non-derivational approaches such as the Optimality Theory can provide an answer to the developmental and learnability problems posed by past tense, a most basic fragment of the English language.
3.3
Possible origins of rule learning
We now turn to the cognitive basis of the ability to learn and use rules in phonological learning, with the suggestion that this ability may be due to general mechanisms of learning that apply to other, non-linguistic, domains of knowledge. Consider the inductive learning algorithm that finds commonality among words through their phonological feature descriptions. There are strongly parallel findings between past tense acquisition and classifier acquisition in Chinese (Hu, 1993; Myers and Tsay, 2000) and Japanese (Yanamoto and Keil, 2000). These classifier systems 9 also have defaults and irregulars, and both Chinese and Japanese children overregularize the default. Given that
9
Eds.: Note that the term classifier system here refers to the noun classifiers used in such languages as Chinese. Classifiers are words that are used obligatorily to identify the category of a noun that is to be quantified; LaPolla briefly discusses classifiers with regard to the complexity of language in Part IV. Fortuitously, the term classifier system is also used to refer to the computationally complete, rule-based, message-parsing system described by Holland in Part IV. Yang’s Rule over Words framework and the classifier system have much in common: in particular, both are rule-based systems consisting of a set of rules that encode the default behavior of the agent, with exceptions to those rules acquired to improve the efficiency of the agent; furthermore, both Yang and Holland assume that specific, exceptional rules override general, default rules — a principle that Yang calls the Elsewhere Condition (see main text) — to generate a default hierarchy (see Holland, Part IV).
The Origin of Linguistic Irregularity
the use of classifiers is largely determined by semantics and conceptual categorization of nouns, we may interpret classifier acquisition as the same algorithm at work: conservative generalization for nouns sharing classifiers, and probabilistic association between nouns and classifiers. However, in classifier acquisition, the algorithm will have to operate on different (semantic) feature representations. More generally, if there is inductive learning at all in human perception and cognition, or in other species for that matter, it must be carried out conservatively, for nothing useful can be learned otherwise. Suppose one observes that both Bush Sr. and Jr. are ready to start a war against Iraq; the rational, and conservative, conclusion is that only families representing the oil industry are inclined to conjure up B-52s for profit, not all father-son pairs. Just learning rules is not enough; the learner also has to know how to use rules to organize words. Here the fundamental principle is the Elsewhere Condition, which asserts that more specific rules override the application of more general rules. Again I would like to suggest that the Elsewhere Condition has counterparts in other domains of human cognition. The most relevant example can be found in the Gricean conversational Maxim of Quantity, Be Informative. If I were to tell a fellow linguist about the ACE meeting, I would say “this is a conference about language acquisition, change, and evolution,” rather than “this is a conference about language.” When alternative ways of speaking are available, we pick the most specific, which is of the same character of the Elsewhere Condition.10 Specificity over generality may also be seen in non-linguistic tasks. According to the FIFA official rules (2002), if a player deliberately handles the ball, it is a freekick; but if one deliberately handles the ball in the box, it’s a spot kick. Interestingly, the rule for
10 Gregory Ward once lost a bet (to Larry Horn) when he tried to get college
students to call a square a rectangle; they wouldn’t, even in contrived situations.
317
318
Language Acquisition, Change and Emergence
freekicks does not have the clause “ . . . except in the penalty area”. It is unwritten, but tacitly assumed, that human referees will adhere to the specificity over generality principle; it follows that handballs in the area are awarded with penalty kicks. Or consider a poker hand: ♣K ♦K ♠K ♥Q ♦Q. It is a full house, although “two pairs” is also a possible, but not the most specific, description. Again, poker rules do not explicitly state that a hand with a pair and three of a kind cannot be two pairs — it is simply assumed that a more specific rule, if applicable, does apply. These examples open up the possibility that the Elsewhere Condition was a derivative of a more general cognitive ability. This ability would have evolved before language, assuming that language was the latest major event in the evolution of cognition. It was then co-opted for learning and organization of words with rules, after the emergence of the faculty of language. The combination of this domain-general principle and the domain-specific knowledge of language gave us words and rules, and how they interact in the lexicon. Of course, the argument might just go the other way; it is also possible that the Elsewhere Condition was phylogenetically linguistic, and its use in other cognitive domains was a by-product of a fundamentally linguistic principle. A possible way to tease these alternatives apart is to see whether there is evidence for the principle of specificity over generality in other (non-linguistic) species.
4. 4.1
The Evolution of Rules Word drifts
The preceding sections suggest that word learning has two components: (a) rule construction, and (b) word-rule association, which is governed by the Elsewhere Condition. Under this view, an irregular verb V may be simultaneously pulled by two rules: (a) the lexical rule R that it is associated to, and (b) a productive rule R′ that may apply to V because V falls under R’s phonological description yet doesn’t, following the Elsewhere Condition, because the presence of the more specific R. Figure 6 illustrates.
The Origin of Linguistic Irregularity
Figure 6 A word with competing rules.
R
V
R′
For example, V = “catch”, R = “-t suffixation & Rhyme → /u/”, and R′ = default. The V-R association is established by fiat, purely on the basis of repeated exposure to “caught”, whereas the V-R′ association is automatic given the unrestricted applicability of R′. Throughout the history of the English languages, many words have drifted from rule to rule. The so-called analogical leveling typically refers to an irregular verb becoming regular: for example, cleave/clove/cloven is now cleave/cleaved/cleaved, and for many speakers, strive/strove/striven has become strive/strived/strived; see Campbell (1998). Another kind of word drift, analogical extension, refers to a regular verb becoming irregular. For example, the past tense of wear was werede, which would have been regular if survived to modern English, but in fact it took on the bear-bore, swear-swore class. What would attract a word W from the rule it currently falls under, say A, to a different rule, B? 11 There are two logical possibilities. First, if A is general and productive, and if B is more specific match for W than A, then W-B association is assured by the Elsewhere Condition. That accounts for the pattern of analogical extension: for example, W = wear, A = default, and B = [er->or], as discussed above.12
11 The fact that W does drift from A to B clearly means that B is productive and
applies to novel items; here it means that B matches the phonological description of W. 12 Which means that when wear made the shift, [er->or] was a productive rule that would automatically convert /er/ to /or/; this is certified by consulting the OED.
319
320
Language Acquisition, Change and Emergence
Second, if A is special and hence unproductive, then the W-A association is established by fiat, on the basis of quantitative linguistic data during learning. If there is not “enough” evidence, W would escape the bounding of A and succumb to the attraction of B. This accounts for the pattern of analogical leveling: for example, W = strive, A = [i->o]}, B = default. The RW model provides a novel, and precise, quantification of how much evidence is “enough” to bind a word to a lexical rule. Recall that the success of learning an irregular verb W is positively correlated to the product of its token frequency, fW and the token frequency of its class R, that is, fw Σfi, where i and W belong to the same class. Hence, a verb can stay irregular due to either its own frequency or the high frequency of its class; the acquisition data analyzed in Section 2 provide strong evidence for this view, which can be called “Salvation by Volume”. It contradicts the claim of Bybee and Slobin (1983) and Pinker (1999) that irregularity is maintained by high frequency alone, which can be called “Salvation by Height”. While the frequency-based theory is largely correct for English past tense, it cannot be correct for German plurals. Recall that only about 8% of nouns in German are regular, which means the majority of nouns are irregular. According to the Salvation by Height theory, the majority of the irregular nouns, failing to match English irregular verbs in frequency, will have to drift to the default class; that is not what we see in German, however.
4.2
Evolutionary model of words and rules
We are now equipped to develop a model of word and rule change over time. The algorithm runs as follows: (8) a. Start with a random set of words, each with a “phonological” description (e.g., 100 bits of 0’s and 1’s). The words are grouped into arbitrary classes. Assume that the frequencies of the words follow a Zipfian distribution. b. For each generation
The Origin of Linguistic Irregularity
i. use the rule learning algorithm to derive a generalized rule based on the phonological description of words that belong to a same class. ii. for each word W and W ∈ R, with a probability 1 − exp [ −Vw τ ] , where Vw = f w i∈R fi , and τ is a
(
)
∑
decay constant • find a rule R′ which match the phonological description of W most closely without conflicting bits (Elsewhere Condition) • associate W with R′ c. Repeat (8b). An example will help the reader to understand how words and rules change. Suppose we have five words, and their phonological descriptions are W1 = [00001], W2 = [01101], W3 = [01001], W4 = [11110], W5 = [01100]. Their evolution over their generations may look like the following table: Suppose that initially, words 1, 2, and 5 are in the same class, and the learning algorithm collapses conflicting features to derive a Rule [0 * * 0 *]. Similarly, words 3 and 4 lead to another Rule [* 1 * * *]. In generation 2, suppose word 2 drifts to class 1. Now rules have to be relearned from scratch, and notice that the rules learned in generation 2 do not happen to differ from those in generation 1. However, generation 3 sees the drift of word 3 to class 1; consequently two brand new rules emerge and the old rules disappear from the lexicon. Figure 7 shows the result from a typical simulation with 500 words. Initially, they were randomly assigned into 50 classes. When the number of classes is stabilized, there are 6 classes. In Table 4, the simulation converged with a very large class, with completely general descriptions; this we can interpret as the default rule. We also have 5 small classes with specific phonological restrictions. This strongly resembles the organization of the English past tense system.
321
Language Acquisition, Change and Emergence Figure 7. A history of 500 words 50
45
40
35
Number of Classes
322
30
25
20
15
10
5 0
5
10
15
20
25
30
35
40
Iterations
Table 3. A history of 5 words Generation Class 1
Class 2
Change
1
(1, 2, 5) [0 * * 0 *]
(3, 4) [* 1 * * *]
2
(1, 5) [0 * * 0 *]
(2, 3, 4) [* 1 * * *]
W2 -> R2
3
(1, 3, 5) [0 * * * 0]
(2, 4) [* 1 1 * *]
W3 -> R1
Table 4. Six stable rules and their sizes Rule
Size
**************************************************
456
*******************1**0******************0********
12
****************************0*****************1***
9
****1******************************************1**
5
*******0******************************************
9
****************1**0******************************
9
The Origin of Linguistic Irregularity
It is interesting to note that the sharp reduction in the number of phonological classes in Figure 7 corresponds to the emergence of the default rule. Suddenly a completely general rule emerged, and it very quickly assimilated words from other classes. Varying certain conditions in the evolutionary model can lead to results similar to German plurals and Polish genitives. In both cases, we may start with several very large irregular classes, i.e., classes that have many members that share part of their phonological descriptions. These classes may remain stable (though it is unlikely, as we shall see Section 5). In the case of German, we simulate the effect of foreign imports by introducing a small class of random words that differ substantially from those already in the rule system. Because of the phonological diversity of words in this new class, a default rule can quickly emerge that covers only a small number of words. It is unable to attract words from the larger irregular classes, because the class size may prevent the drift of even low-frequency words — Salvation by Volume, as noted earlier. The evolutionary model is a logical extension of the independently motivated word learning model. It is thus of considerable relevance to the study of historical phonology. Our study shows that words can drift on an individual basis while leaving phonological rules intact. Also, for words that fall under a shared rule, some may drift away and some may not; see Table 3. This is consistent with the theory of lexical diffusion (Wang, 1969), and the reconciliation of lexical diffusion with lexical phonology (Kiparsky, 1986), but not consistent with the Neogrammarian regularity principle.
5.
Interface Conditions and Language Evolution
Finally, we can address the imperfection of lexical irregularity. Again, imagine a logically possible morphological system, where all nouns ending in a vowel add -t to form plurals, and all those ending in a consonant add -o; it is systematic, regular, and neat. But one of the trademarks of human language is its “leakiness”: in the lexicon
323
324
Language Acquisition, Change and Emergence
of the world’s languages irregularity and exceptions abound, as noted at the beginning of this paper. If the rule learning and evolution models provide an accurate description of reality, then we may have an explanation for the prevalence of irregularity. Suppose that a language did have a regular system just described, where two disjunctive rules give a complete and predictable coverage of the nouns. Suppose, as the result of language contact, a few foreign nouns entered into the native lexicon. This is a highly plausible assumption, given that language is used and transmitted by humans, and humans are mobile and social animals. Suppose that, for example, all the foreign nouns add -s in plural forms regardless of their phonological properties. As we saw in the simulation, a small number of diverse words suffice to yield a general default rule: add -s no matter what a (native or foreign) word looks likes. Now the two existing rules have a competitor, and native nouns may start drifting to the default. The -t and -o rules will get more and more specific, and smaller and smaller. Irregularity follows. As long as the learning data is diversified, through whatever means, there is a very good chance for a default to emerge, which subsequently may assimilate words from other rules and leave behind irregulars, a vestige of once regular rules. Hence, the “real” origin of linguistic irregularity may be historical and unpredictable:13 the model of learning and change presented here helps us pin down the predictable effects of such unpredictable causes. Our approach to language evolution can be seen as an execution of Chomsky’s Minimalist Program (1995); see Hauser et al. (2002) for elaboration in an evolutionary framework. In the Minimalist Program, the faculty of language is viewed as a cognitive module that interfaces with the rest of human cognition; informally, the
13 Innovation may also, unpredictably, introduce novel patterns into the
lexicon, which can then lead to defaults and irregulars.
The Origin of Linguistic Irregularity
“meaning” module and the “sound” module, the so-called interface conditions. We suggest that an additional interface condition lies in the ability to learn. Clearly, for a language to be usable, it must be learnable by children under normal conditions. Here, and at other places (for the learning of syntax, see Yang, 1998, 2002a), I have suggested that the ability to learn language may be due to a learning/growth mechanism in other cognitive and perceptual domains.14 It then renders plausible the hypothesis that the learning mechanisms were earlier evolutionary products. In order for language to be usable at all, it must satisfy these interface conditions. To make an analogy, imagine the mind/brain as the motherboard of a computer. Many parts are old, and shared with other species. Language, a recent arrival, would have to work with these old parts. Fortunately, these interface conditions are directly accessible for empirical study. They may be taken as the design specifications or restrictions on the brand new combinatorial linguistic system, from which one may infer some properties of language that might have inevitably followed. This presents a novel and possibly fruitful approach to the study of language evolution: by studying the present, we might learn something about the past. Doing so may help us to understand why language is exactly the way it is, rather than what it might have been.
Acknowledgments My thanks to Noam Chomsky, John Frampton, Sam Gutmann, Morris Halle, Julie Legate, and Morgan Sonderegger, Bill Wang for comments and discussion on this work. The audiences at ACE II (City University of Hong Kong), Johns Hopkins University, The Haskins Laboratory, University of Arizona, and Northwestern University have also been helpful.
14 This view of learning in no way marginalizes the innate and domain-specific
knowledge of language, the Universal Grammar; in the present context, it is the human phonological system with features and rules that the learning mechanism has to work with.
325
326
Language Acquisition, Change and Emergence
References Anderson, S. R. (1988). Morphology as a Parsing Problem. Linguistics 26: 521–544. Berko, J. (1958). The Child’s Learning of English Morphology. Word, 14:150–177. Berwick, R. (1985). The Acquisition of Syntactic Knowledge. Cambridge, MA: MIT Press. Bybee, J., & Slobin, D. (1983). Rules and Schemas in the Developmental and Use of the English Past Tense. Language, 58:265–289. Campbell, L. (1998). Historical Linguistics. Cambridge, MA: MIT Press. Chomsky, N. (1995). The Minimalist Program. Cambridge, MA: MIT Press. Chomsky, N, & Halle, M. (1968). The Sound Patterns of English. Cambridge, MA: MIT Press. Clark, E. (1993). The lexicon in Acquisition. Cambridge: Cambridge University Press. Dabrowska, E. (2001). Learning a Morphological System without a Default: the Polish Genitive. Journal of Child Language. 28: 545–574. Grabowski, E., & Mindt, D. (1995). A Corpus-based Learning List of Irregular Verbs in English. International Computer Archive of Modern and Medieval English Journal 19: 5–22. Guasti, M. T. (2002). Language Acquisition: The Growth of Grammar. Cambridge, MA: MIT Press. Halle, M. (1962). Phonology in Generative Grammar. Word 18: 54–72. —. (1998). The Stress of English Words 1968-1998. Linguistic Inquiry, 29:539–568. Halle, M., & Mohanan, K.-P. (1985). Segmental Phonology of Modern English. Linguistic Inquiry, 16:57–116. Hauser, M., Chomsky, N., & Fitch, T. (2002). The Faculty of Language: What is it, Who has it, and How did it Evolve. Science, 298: 1569–1579. Hu,
Q. (1993). The Acquisition of Chinese Classifers by Young Mandarin-speaking Children. Ph.D. Dissertation, Boston University.
Jakobson, R., Fant, G., & Halle, M. (1951). Preliminaries to Speech Analysis: The Distinctive Features and their Correlates. Cambridge, MA: MIT Press. Kiparsky, P. (1982). From Cyclic Phonology to Lexical Phonology. In van der Hulst, H., & Smith, N. (eds.) The Structure of Phonological Representations. I. 131–175.
The Origin of Linguistic Irregularity Kiparsky, P. (1988). Phonological Change. In Newmeyer, F. (ed.) The Cambridge Survey of Linguistics. I. Cambridge: Cambridge University Press. 363–415. Koskenniemi, K. (1983). Two-Level Morphology: A General Computational Model for Word-form Recognition and Production. Publication No. 11. University of Helsinki: Department of General Linguistics. Lightfoot, D. (1999). The Development of Language: Acquisition, Change, and Evolution. Oxford: Blackwell. Ling, C. & Marinov, M. (1993). Answering the Connectionist Challenge: a Symbolic Model of Learning the Past Tense of English Verbs. Cognition, 49:235–290. MacWhinney, B., & Leinbach, J. (1991). Implementation are not Conceptualizations: Revising the Verb Learning Model. Cognition, 29:121–157. Maratsos, M. (2000). More Overregularizations after All: New Data and Discussion on Marcus, Pinker, Ullman, Hollander, Rosen, & Xu. Journal of Child Language, 27:183–212. Marcus, G., Brinkmann, U., Clahsen, H., Wiese, R., & Pinker, S. (1995). German Inflection: the Exception that Proves the Rule. Cognitive Psychology, 29:189–256. Marcus, G., Pinker, S., Ullman, M., Hollander, M., Rosen, J., & Xu, F. (1992). Overregularization in Language Acquisition. Monographs of the Society for Research in Child Development, No. 57. Molnar, R. (2001). “Generalize and Sift” as a Model of Inflection Acquisition. Master’s thesis. Massachusetts Institute of Technology. Myers, S. (1987). Vowel Shortening in English. Natural Language and Linguistic Theory, 5:485–518. Myers, J., & Tsay, J. (2000). The Acquisition of the Default Classifier in Taiwanese. In Proceedings of the 7th International Symposium on Chinese Languages and Linguistics. Chia-Yi: National Chung Cheng University. 87–106. Ristad, E. (1994). Complexity of Morpheme Acquisition. In Ristad, E. (ed.) Language Computation. Philadelphia: American Mathematical Society. 185-198. Phillips, C. (1995). Syntax At Age 2: Cross-Linguistic Differences. In MIT Working Papers In Linguistics 26. Cambridge, MA: MITWPL, 325–382. Pinker, S. (1995). Why the Child Holded the Baby Rabbit: a Case Study in Language Acquisition. In L. Gleitman & M. Liberman (eds.) An Invitation to Cognitive Science: Language. Cambridge, MA: MIT Press, 107–133.
327
328
Language Acquisition, Change and Emergence Pinker, S. (1999). Words and Rules: the Ingredients of Language. New York, NY: Basic Books. Pouplier, M., & Yang, C. D. (2003). Regulars within Irregulars: The Finer Structure of German Plurals. Manuscript in progress, Yale University. Rumelhart, D., & Mcclelland, J. (1986). On Learning the Past Tenses of English Verbs: Implicit Rules or Parallel Distributed Processing? In J. McCelland, D. Rumelhart, & the PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure Of Cognition. Cambridge, MA: MIT Press, 216–271. Sussman, G., & Yip, K. (1996). A Computational Model for the Acquisition and Use of Phonological Knowledge. MIT Artificial Intelligence Laboratory, Memo 1575. —. (1997). Sparse Representations for Fast, One-Shot Learning. Paper presented at the National Conference on Artificial Intelligence. Orlando, Florida. Wang, W. S.-Y. (1969). Competing Changes as a Cause of Residue. Language, 45: 9–25. Xu, F., & Pinker, S. (1995). Weird Past Tense Forms. Journal of Child Language, 22:531–556. Yanamoto, K., & Keil, F. (2000). The Acquisition of Japanese Numeral Classifiers: Linkage between Grammatical Forms and Conceptual Categories. Journal of East Asian Linguistics, 9: 379–409. Yang, C. D. (1998). Toward a Variational Theory of Language Acquisition. Manuscript, Massachusetts Institute of Technology. —. (2002a). Knowledge and Learning in Natural Language. Oxford: Oxford University Press. —. (2002b). A Principle of Word Storage. Manuscript, Yale University.
Part 3 Language Change
9 The Language Organism: The Leiden Theory of Language Evolution* George van Driem Leiden University
1.
Language is an Organism
Language is a symbiotic organism. Language is neither an organ, nor is it an instinct. In the past two and a half million years, we have acquired a genetic predisposition to serve as the host for this symbiont. Like any true symbiont, language enhances our reproductive fitness. We cannot change the grammatical structure of language or fundamentally change its lexicon by an act of will, even though we might be able to coin a new word or aid and abet the popularity of a turn of phrase. Language changes, but not because we want it to. We are inoculated with our native language in our
*
The following is a synoptic statement on the Leiden theory of language evolution which I presented in a talk at the 2nd Workshop and Language Acquisition, Change and Emergence at the City University of Hong Kong on 24 November 2001 at the kind invitation of Bill Wang. The Leiden theory of language emergence is presented in greater detail in my handbook (van Driem, 2001b). 331
332
Language Acquisition, Change and Emergence
infancy. Like any other life form, language consists of a self-replicating core. The units of this self-replicating core are memes and their neural correlates. The Leiden theory of language evolution was developed in the early 1980s by Kortlandt (1985) and is further developed in my handbook of the greater Himalayan region (van Driem, 2001b). Meaning is the basis of language. The nature of meanings, understood in terms of the intuitionist set theory or constructivist mathematics developed by L.E.J. Brouwer, is a function of their neuroanatomy and their behavior as units in the Darwinian process of neuronal group selection. The Leiden conception of language evolution provides a linguistically informed definition of the meme (van Driem, 2000a, 2000b, 2001a). Previous characterizations of the meme by Dawkins (1976), Delius (1991) and Blackmore (1999) fall short of identifying the fecund high-fidelity replicators of extra-genetic evolution. The Leiden approach to linguistic forms as vehicles for the reproduction of meaningful elements in the hominid brain differs fundamentally from both the functionalist or European structuralist conception of language, whereby linguistic forms are seen as instruments used to convey meaningful elements, and the formalist or generative approach, whereby linguistic forms are treated as abstract structures which can be filled with meaningful elements. Naming and syntax can be shown to be two faces of the same phenomenon.
2.
A Meme is a Meaning, Not a Unit of Imitation
What precisely is a meme? The Oxford English Dictionary defines a meme as ‘an element of culture that may be considered to be passed on by non-genetic means, esp. imitation’. This is a British lexicographer’s recapitulation of Richard Dawkins’ original coinage: I think that a new kind of replicator has recently emerged on this very planet. It is staring us in the face. It is still in its infancy, still drifting about in its
The Language Organism: The Leiden Theory of Language Evolution
primaeval soup, but already it is achieving evolutionary change at a rate that leaves the old gene panting far behind. The new soup is the soup of human culture. We need a name for the new replicator, a noun that conveys the idea of a unit of cultural transmission, or a unit of imitation. (1976: 206). This Oxford definition of the meme is incomplete and linguistically uninformed. Charles Darwin came closer to the Leiden definition of the meme when he wrote that ‘the survival or preservation of certain favoured words in the struggle for existence is natural selection’ (1871, I: 60-61). By contrast, Susan Blackmore’s memetics is essentially a linguistically naïve view: Whether a particular sound is copied because it is easy to remember, easy to produce, conveys a pleasant emotion, or provides useful information, does not matter. . . . There is no such problem as the symbolic threshold with the memetic theory of language. The critical step was the beginning of imitation. . . . Once imitation evolved, something like two and a half to three million years ago, a second replicator, the meme, was born. A spoken grammatical language resulted from the success of copyable sounds. (1999: 103–104, 107) Language is more than just copyable sounds. A unit of imitation is a mime, and a mime does not meet the criteria of fecundity, high-fidelity replication and longevity required to qualify as a successful life-sustaining replicator. There is an essential difference between pre-linguistic mimes, such as the rice washing of Japanese macaques, and post-linguistic mimes, such as music, clothing fashions and dancing styles, which are able to evoke a myriad of associations in the realm of memes. However, the theme of Beethoven’s 9th symphony is a mime, not a meme. Language exists through meaning. The Leiden school defines memes as meanings in the linguistic sense. Grammatical memes, i.e.
333
334
Language Acquisition, Change and Emergence
the meanings of grammatical categories, are the systemic memes of any given language and are demonstrably language-specific. The meanings of words, morphemes and fixed idiomatic expressions are lexical memes. Some lexical memes are systemic and structural for a given language. Some are free-wheeling and parasitic. Some occupy an intermediate status. The idea that America is one nation under God, indivisible with liberty and justice for all, is not a meme. It is a syntactically articulate idea composed of a number of constituent lexical and grammatical memes, and this idea and its constituent parts are subject to Darwinian natural selection. Researchers in the field of Artificial Intelligence fail to address the problem of meaning when they resort to the propositional logic developed by the English mathematician George Boole. The adequacy of this approach is claimed as long as the variables are ‘grounded’. By grounding, logicians mean that there is some determinate way in which variables or symbols refer to their referents. Yet natural meaning does not obey the laws of Aristotelian logic or Boolean propositional calculus. A meaning thrives by virtue of its applications, which cannot be deduced from its implications. The implications of a meaning must be derived by its applicability, rather than the other way around. By consequence, a meaning has the properties of a non-constructible set in the mathematical sense. The behavior of the English meaning open is such that ‘The door is open’ can be said of a shut but unlocked door, in that the door is not locked. Likewise, of the same door it can be said that ‘The door is not open’, for it is shut. It is a cop-out to postulate polysemy to clarify such usages because the meaning of English open remains unchanged in either case. The same situation can be truthfully referred to by a linguistic meaning as well as by its contradiction. Yet there is no way of formalising a contradiction in traditional logic because of the principle of the excluded middle, i.e. tertium non datur. This principle, which dates back to Aristotle, renders classical logic a powerful tool and simultaneously makes classical logic a mode of thought which is at variance with the logic of natural language. The insight that meaning operates according to
The Language Organism: The Leiden Theory of Language Evolution
the mathematics of non-constructible sets was set forth by Frederik Kortlandt in 1985 in a seminal article entitled ‘On the parasitology of non-constructible sets’. The insight that human language operates independently of the principle of the excluded middle was appreciated by the Dutch mathematician L.E.J. Brouwer when he developed intuitionist set theory in the first quarter of the 20th century. Brouwer rejected the principle of the excluded middle for language and went as far as to warn mankind that linguistically-mediated ideas and language itself were inherently dangerous.
3.
Tertium Datur
The fact that meanings have the nature of non-constructible sets does not mean that meanings are fuzzy. Rather, meanings correspond to sets which are indeterminate in that there is no a priori way of saying whether a particular referent can or cannot be identified as a member of a set. If a homeless person in Amsterdam calls a cardboard box a house, that box becomes a referent of the word house by his or her very speech act. The first bear most children are likely to see today is a cuddly doll from a toy store and not a member of a species of the Ursidae family. Errett Bishop, chief proponent of the school of constructivist mathematics which grew out of intuitionist set theory, also rejected the principle of the excluded middle. He observed that ‘a choice function exists in constructivist mathematics because it is implied by the very meaning of existence’ (1967: 9). Even though Willard Quine adhered to the principle of the excluded middle throughout his life because of its utility as ‘a norm governing efficient logical regimentation’, he conceded that this Aristotelian tenet was ‘not a fact of life’, and was in fact ‘bizarre’ (1987: 57). Classical logical analysis requires the identifiability of distinguishable elements as belonging to the same set. In the case of an extensional definition, it presupposes a sufficient degree of
335
336
Language Acquisition, Change and Emergence
similarity between the indicated and the intended elements. In the case of an intentional definition, it presupposes the applicability of a criterion, which depends on the degree of similarity between the indicated property and the perceptible characteristics of the intended objects. The constructibility of a set is determined by the identifiability of its elements. Language does not generally satisfy this fundamental requirement of logic. Ever since Gottlob Frege, logicians have focussed on problems of truth in their attempt to understand meaning and language, but this approach has been inherently flawed from the very outset. Once Frege had defined a Gedanke as something which can be subject to logical tests of truth (1918: 64), he was inexorably led to disregard grammatical sentences in language which cannot be reinterpreted as logical prepositions and therefore embody no Gedanke (1923: 37). The inadequacy of classical logic for coming to terms with linguistic meaning underlies the failure of both the earlier and the later Wittgenstein to understand the workings of language. Instead, he remained perplexed by the nature of linguistic meaning throughout his life and saw the whole of philosophy as a battle against the bewitching of reason by language (1953: 47). The nature of meaning is a direct function of its neural microanatomy and the way neurons branch and establish their webs of circuitry in our brains. The parasitic nature of linguistically mediated meanings does not mean that there is no such thing as invariant meanings or Gesamtbedeutungen of individual lexical and grammatical categories within a given speech community. Invariant meanings are functionally equivalent within a speech community and can be empirically ascertained through Wierzbickian radical semantic analysis. Language began to live in our brains as an organismal memetic symbiont when these brains became host to the first replicating meaning. The difference between a meaning and a signal such as a mating call or the predator-specific alarm calls of vervet monkeys is that a meaning can be used for the sake of argument, has the properties of a non-constructible set and has a temporal dimension.
The Language Organism: The Leiden Theory of Language Evolution
4.
Syntax is a Consequence of Meaning
Syntax arose from meaning. Syntax did not arise from combining labels or names for things. Syntax arose when a signal was first split. Hugo Schuchardt had already argued that the first utterance arose from the splitting of a holistic primaeval utterance, not from the concatenation of grunts or names. He argued that the first word was abstracted from a primordial sentence and that the first sentences did not arise from the concatenation of words (1919a, 1919b). First-order predication arose automatically when the first signal was split. For example, the splitting of a signal for ‘The baby has fallen out of the tree’ yields the meanings ‘That which has fallen out of the tree is our baby’ and ‘What the baby has done is to fall out of the tree’. Mária Ujhelyi has considered long-call structures in apes in this regard. The ability to intentionally deceive is a capacity that we share with other apes and even with monkeys. In using an utterance for the sake of argument, the first wordsmith went beyond the capacity to deceive. He or she used an utterance in good faith, splitting a signal so that meanings arose, yielding a projection of reality with a temporal dimension. Since when has language resided in our brains? The idea that the Upper Palaeolithic Horizon is the terminus ante quem for the emergence of language dates back at least to the 1950s. The sudden emergence of art, ritual symbolism, glyphs, rock paintings and animal and venus figurines 60,000 to 40,000 years ago set the world ablaze with new colours and forms. The collective neurosis of ritual activity is an unambiguous manifestation of linguistically mediated thought. However, rudimentary stages of language existed much earlier. What the Upper Palaeolithic Horizon offers is the first clear evidence of the existence of God. God is the quintessential prototype of the non-constructible set because it can mean anything. This makes God the meme almighty. The British anthropologist Verrier Elwin quotes the Anglican bishop Charles Gore: I once had a talk with Bishop Gore and told him that I had doubts about, for example, the truth of the Bible,
337
338
Language Acquisition, Change and Emergence
the Virgin Birth and the Resurrection. “All this, my dear boy, is nothing. The real snag in the Christian, or any other religion, is the belief in God. If you can swallow God, you can swallow anything.” (1964: 99) The brain of our species has grown phenomenally as compared with that of gracile australopithecines or modern bonobos, even when we make allowances for our overall increase in body size. Initially the availability of a large brain provided the green pastures in which language could settle and flourish. Once meanings began to reproduce within the brain, hominid brain evolution came to be driven by language at least as radically as any symbiont determines the evolution of its host species. Language engendered a sheer tripling of brain volume from a mean brain size of 440 cc to 1400 cc in just two and a half million years. At the same time, the increasingly convoluted topography of our neocortex expanded the available surface area of the brain. The role of innate vs. learned behavior in the emergence of language is an artificial controversy when viewed in light of the relationship between a host and a memetic symbiont lodged in its bloated brain. In the past 2.5 million years, our species has evolved in such a way as to acquire the symbiont readily from earliest childhood. Our very perceptions and conceptualization of reality are shaped and moulded by the symbiont and the constellations of neuronal groups which language sustains and mediates.
References Bishop, Errett. (1967) Foundations of Constructive Analysis. New York: McGraw-Hill. Blackmore, Susan. (1999) The Meme Machine. Oxford: Oxford University Press. Darwin, Sir Charles Robert. (1871) The Descent of Man and Selection in Relation to Sex (2 vols.). London: John Murray.
The Language Organism: The Leiden Theory of Language Evolution Dawkins, Richard. (1976) The Selfish Gene. Oxford: Oxford University Press. Delius, Juan D. (1991) The nature of culture. In Marian Stamp Dawkins, Timothy R., Halliday and Richard Dawkins (Eds.) The Tinbergen Legacy (pp. 75–99). London: Chapman and Hall. van Driem, George. (2000a) De evolutie van taal: Beginselen van de biologische taalwetenschap. Public lecture delivered in Paradiso, Amsterdam, 16 April 2000. —. (2000b) The language organism: A symbiotic theory of language. Paper presented at the Belgian-Dutch Workshop on the Evolution of Language held at the Atomium in Brussels, 17 November 2000. —. (2001a) Taal en Taalwetenschap. Leiden: Research School of Asian, African and Amerindian Studies CNWS. —. (2001b) Languages of the Himalayas: An Ethnolinguistic Handbook of the Greater Himalayan Region with an Introduction to the Symbiotic Theory of Language (2 vols.). Leiden: Brill. —. (Forthcoming) The Language Organism. Elwin, Verrier. (1964) The Tribal World of Verrier Elwin. London: Oxford University Press. Frege, Gottlob. (1918) Der Gedanke, eine logische Untersuchung. In Arthur Hoffmann and Horst Engert (Eds.) Beiträge zur Philosophie des deutschen Idealismus (pp. 58–77), 1. Band, 1. Heft (1918–1919). Erfurt: Verlag der Keyser’schen Buchhandlung. —. (1919) Die Verneinung, eine logische Untersuchung. In Arthur Hoffmann and Horst Engert (Eds.) Beiträge zur Philosophie des deutschen Idealismus (pp. 143–157), 1. Band, 2. Heft (1918–1919). Erfurt: Verlag der Keyser’schen Buchhandlung. —. (1923) Logische Untersuchungen. Dritter Teil: Gedankengefüge. In Arthur Hoffmann (Ed.) Beiträge zur Philosophie des deutschen Idealismus (pp. 36–51), 3. Band, 1. Heft. Erfurt: Verlag Kurt Stenger. Kortlandt, Frederik Herman Henri. (1985) A parasitological view of non-constructible sets. In Pieper and Stickel (Eds.) Studia linguistica diachronica et synchronica: Werner Winter sexagenario anno MCMLXXXIII gratis animis ab eius collegis, amicis discipulisque oblata (pp. 477–483). Berlin: Mouton de Gruyter. Quine, Willard Van Orman. (1987) Quiddities: An Intermittently Philosophical Dictionary. Cambridge, Massachusetts: Harvard University Press. Schuchardt, Hugo. (1919a) Sprachursprung. I (vorgelegt am 17. Juli 1919). Sitzungsberichte der Preussischen Akademie der Wissenschaften XLII, 716–720.
339
340
Language Acquisition, Change and Emergence —. (1919b) Sprachursprung. II (vorgelegt am 30. Oktober 1919). Sitzungsberichte der Preussischen Akademie der Wissenschaften XLII, 863–869. Wittgenstein, Ludwig Josef Johann. (1953) [posthumous]. Philosophische Untersuchungen. London: Basil Blackwell.
10 Taxonomy, Typology and Historical Linguistics Merritt Ruhlen Stanford University
1. Introduction The past decade has witnessed a renewed interest in historical linguistics, as the various controversies surrounding Amerind, Nostratic, and even broader proposed taxa well attest. Yet this renewed interest seems to have revealed as much the current state of confusion within historical linguistics as the validity of any of the newly proposed families. I will argue here that the comparative method was misunderstood by historical linguists in the twentieth century, with the result that the discovery of new genetic relationships among languages effectively ground to a halt — with the significant exceptions of the work of Joseph Greenberg and the Nostraticists. What is equally distressing is that the borders between three distinct fields — taxonomy, typology, and historical linguistics — have become blurred. Each of these fields has its own goals and its own methodology, and they are not the same. This in no way implies that these fields are completely disconnected from one another. Certainly Greenberg’s enormous knowledge of diachronic typology informed his classification of Eurasiatic languages in many ways, most spectacularly in the explanation of the origin of the 341
342
Language Acquisition, Change and Emergence
Indo-European ablaut system and its historical connections with the vowel harmony systems of Uralic and Altaic (Greenberg, 2000). In the same way, his knowledge of historical linguistics is utilized from the very first steps in taxonomy (Greenberg, 1995), allowing him to recognize the most obvious etymologies and to weed out some spurious ones. What characterizes his work is that he used all three fields in an appropriate manner, and did not confuse the goals and results of these three different fields.
2. Taxonomy and Historical Linguistics The source of a great deal of the current controversy and confusion in historical linguistics resides in the fact that taxonomy1 failed to develop in the twentieth century beyond the obvious, and in some instances even regressed. One of the more egregious examples of regression is Johanna Nichols’ (1992: 4) rejection of the Altaic family: Three language families of central Eurasia . . . share striking similarities in morphosyntactic structure and pronominal roots: Turkic, Mongolian, and Tungusic. For a long time it was assumed that these three families were related as branches of a superstock called Altaic . . . . When the cognates proved not to be valid, Altaic was abandoned, and the received view now is that Turkic, Mongolian, and Tungusic are unrelated. It is unlikely, however, that this ‘received view’ would be accepted by Roy Andrew Miller (1971, 1991a,b,c), Sergei Starostin (1991), Anna Dybo, Oleg Mudrak (Dybo, Mudrak, and Starostin, 2003), or Greenberg (2000–02) and indeed the Altaic family is in no more doubt today than it was a century ago.
1
Also called classification, multilateral comparison, and mass comparison. All four terms are synonymous.
Taxonomy, Typology and Historical Linguistics
If one consults virtually any of the standard textbooks on historical linguistics the subject of taxonomy is not even mentioned and the precise means by which one discovers new language families is either not presented, or is presented in a completely fictitious manner in which reconstruction and sound correspondences are alleged to be the proof of language families. In fact, the fundamental error of the past century was that the comparative method came to mean, in linguistics, the reconstruction of a proto-language using regular sound correspondences. For example, in the index to Bynon (1977: 305) we find “comparative method, the see reconstruction, phonological.” In reality the comparative method in linguistics, biology, or any other field consists of essentially two stages. The first is taxonomy and the second is what is commonly called “historical linguistics,” as shown in Figure 1.
Figure 1. The comparative method THE COMPARATIVE METHOD
TAXONOMY = CLASSIFICATION
“HISTORICAL LINGUISTICS”
RECONSTRUCTION
SOUND CORRESPONDENCES
HOMELAND
…
These two stages are to a very great degree independent of one another and it is taxonomy, not historical linguistics, that defines language families at all levels. Taxonomy provides the wherewithal for the pursuits of most historical linguists: (1) the reconstruction of the proto-language, (2) the discovery of sound correspondences
343
344
Language Acquisition, Change and Emergence
among the constituent languages (or families), (3) the subgrouping of the family, (4) the location of the ancestral homeland, etc. In Figure 1 I have listed reconstruction and sound correspondences as independent parameters for the simple reason that reconstruction can be carried out in fields such as biology where sound correspondences, or anything analogous to them, are absent. It should also be noted that in biology reconstruction is in no way identified with the comparative method as it is in linguistics. In fact, in discussions of the comparative method in biology reconstruction is scarcely mentioned, and no biologist has ever demanded that Proto-Mammal be reconstructed, along with all of the intermediate stages leading from Proto-Mammal to all modern species of mammal, before he believes that mammals are a valid biological taxon. It has been alleged by a number of scholars that Greenberg has substituted for the comparative method an entirely different method for the investigation of linguistic prehistory. According to the traditional view “the comparative method does not apply at time depths much greater than about 8,000 years” (Nichols, 1992: 2). Greenberg’s methods supposedly begin at this cut-off point and produce families that are qualitatively different from those produced by the standard comparative method. Mark Durie and Malcolm Ross (1996: 5, 9) claim that “multilateral comparison is not a variant of the classical comparative method of historical linguistics. . . . Multilateral comparison . . . bears only the most superficial resemblance to the comparative method.” Goddard and Campbell (1994: 195) believe “the differences between Greenberg’s word-comparison approach and the standard historical-linguistic method are so vast that rational discussion between their respective proponents seems almost impossible.” Hans Hock and Brian Joseph (1996: 487) allege that “the American linguist Joseph Greenberg and some associates of his have claimed that long-distance relationships can be established more effectively — and more easily — by employing an approach totally different from the traditional methods. This is an approach of lexical ‘mass comparison’ or ‘multilateral comparison.’” And according to Nichols (1990: 477), “Greenberg (1987) makes clear that he believes such groupings [as
Taxonomy, Typology and Historical Linguistics
Altaic, Hokan, and Amerind] cannot be reached by the standard comparative method; a wholly different method, mass comparison, is required.” In reality, no such claim is made in Greenberg (1987), nor anywhere else in his writings, for the simple reason that in his view the comparative method applies in the same way at all levels of taxonomy, from the lowest to the highest. This confusion between the different stages of taxonomy and historical linguistics is apparent in the following quote from Bynon (1977: 271–72): “The use of basic vocabulary comparison not simply as a preliminary to reconstruction but as a substitute for it is more controversial . . . . It is clear that, as far as the historical linguist is concerned, [mass comparison] can in no way serve as a substitute for reconstruction.” But Greenberg never claimed that taxonomy is a substitute for reconstruction. Taxonomy and reconstruction are two separate and distinct enterprises. Taxonomy identifies families at all levels; reconstruction seeks to reconstruct the proto-language of a family that has already been identified by taxonomy. It is hard to see how one could even begin to reconstruct a proto-language of a family that hadn’t yet been discovered, much less that such a reconstruction would then somehow ‘prove’ the validity of that language family. A similar confusion of taxonomy with reconstruction is seen in Terrence Kaufman’s (1990: 23) claim that “a temporal ceiling of 7000 to 8000 years is inherent in the methods of comparative linguistic reconstruction. We can recover genetic relationships that are that old, but probably no earlier than that” (italics added). A third example of the confusion of reconstruction and classification appears in a recent textbook (Fox, 1995: 236): “One of the most controversial developments in the whole field of reconstruction in recent years has been the publication of Joseph Greenberg’s classification of the native languages of the Americas” (italics added). Many additional quotes could be adduced, but these three more than suffice to show that the distinct notions of genetic relationship and reconstruction have become almost synonymous in the minds of many linguists. In reality, genetic relationships are properties of classifications; they are not consequences of reconstruction.
345
346
Language Acquisition, Change and Emergence
Sometimes a distinct taxonomic stage is recognized as preceding the later stages that I have called historical linguistics. Yet even here the precise nature of taxonomy seems poorly understood. For example, Durie and Ross (1996: 6–7) have recently characterized the comparative method as consisting of seven stages, the first two of which correspond to what I have called taxonomy, and the last five to historical linguistics. The first two stages are: 1. Determine on the strength of diagnostic evidence that a set of languages are genetically related, that is, that they constitute a ‘family.’ 2. Collect putative cognate sets for the family (both morphological paradigms and lexical items). What is peculiar here is that the two initial steps are given in reverse order. One first uses some mysterious ‘diagnostic evidence’ to identify a language family and then one goes out and actually looks for evidence (grammatical and lexical) to support the validity of the family. But putative cognate sets are the diagnostic evidence for any family. It is the recognition of grammatical and lexical resemblances in both form and meaning that leads to the supposition that certain languages (or language families) are genetically related. The reason that such putative cognate sets are the basis for detecting linguistic relationships is so obvious that it is often seemingly overlooked. The basis of genetic classification — and hence linguistic relationships — is, quite simply, the arbitrary nature of the sound/meaning relationship in human language. Since any meaning can be represented by any sequence of sounds, there are hundreds, if not thousands, of possible phonetic representations for each meaning in each language. If, then, a certain set of languages has the same, or similar, phonetic representation for a word, one assumes the languages may well be related. If further consideration of this set of languages shows additional similar words, to the exclusion of other surrounding languages, the hypothesis of genetic relationship becomes virtually certain. Any language may accidentally resemble another language once, but if the resemblances continually appear in the same set of languages, and not elsewhere, they are hardly likely to be accidental.
Taxonomy, Typology and Historical Linguistics
In addition to chance, there are three other possible explanations for resemblant words: sound symbolism, borrowing, and common origin. Sound symbolic words are quite exceptional, precisely because they violate the arbitrary sound/meaning relationship, and are not used in classification, though they are quite correctly reconstructed for proto-languages since all languages do have sound symbolic words. Borrowings can usually be recognized by well-known linguistic techniques, such as reliance on basic vocabulary (pronouns, body parts) and outgroup comparison with other related languages. If the languages concerned are never known to have been in contact — or if the languages concerned cover entire continents — then borrowing is quite improbable. In recent years historical linguists have come to regard common origin, that is, an evolutionary explanation for linguistic similarities, as the explanation of last resort, when in fact it is, as Vincent Sarich (1994) pointed out, the default explanation. Instead of recognizing the simple basis of genetic classification, and thus linguistic relationships, twentieth-century historical linguists put forth increasingly rigorous demands, generally involving reconstruction with regular sound correspondences, before genetic relationships will be acknowledged, demands in fact so rigorous that they could never be satisfied. According to Calvert Watkins (1990: 292–95), a genetic linguistic relationship is first assumed, or hypothesized, by inspection or whatever. At that point must begin the careful and above all systematic comparison, which will lead, if the hypothesis or supposition of genetic relationship is correct, to the reconstruction of the linguistic history of the languages concerned, including the discovery of the attendant sound laws, which are a part of that history. . . . If I believe in Indo-European, Algonquian, or Austronesian, it is because scholars have done the necessary systematic explanation and produced the requisite historical results. If I do not believe in an Amerind, Eurasiatic, or Nostratic, it is because scholars have so far neither done the one nor produced the other. To spell it out: because
347
348
Language Acquisition, Change and Emergence
scholars have neither done the necessary systematic explanation, nor produced the requisite historical results. And there is no other way. But there is another way. It is the way that was used by the founders of comparative linguistics in the nineteenth century. And it is the same method advocated, and utilized, by Greenberg in all his works. This ‘other way’ is known as taxonomy, classification, mass comparison, or multilateral comparison. To demand, as does Watkins, that one must reconstruct the entire proto-language and then explain with regular sound correspondences exactly how every word in every descendant language evolved to its present form is clearly far more than was ever demanded of the very families that Watkins cites approvingly: Indo-European, Algonquian, and Austronesian. All three of these families were recognized early on simply by the specific, and distinct, grammatical and lexical morphemes that characterize each. Reconstruction, regular sound correspondences, and a complete explanation of all linguistic prehistory was never demanded and in fact such concepts as reconstruction and the regularity of sound change first appear only in the second half of the nineteenth century, long after these particular families were accepted by everyone. The recent claim by Nichols (1996: 46) that “the philologer of [Sir William] Jones’ time had been trained . . . in the principles of comparative method and reconstruction” is so wildly anachronistic as to defy explanation. The goal of an historical linguist should not be to demand an explanation for everything before he believes anything, as Watkins implies. Rather a scientist should attempt to explain non-random phenomena, for example, the prevalence of the N/M ‘I/thou’ pronominal pattern in the Americas, and a different pronominal pattern, M/T ‘I/thou’ in northern Eurasia. It is all well and good for Watkins to demur on the Amerind and Eurasiatic hypotheses, but if common origin is not responsible for these different pronominal patterns, what is? The requirement of reconstruction with regular sound correspondences appears to have been an innovation of twentieth-century historical linguists. It is not found, so far as I
Taxonomy, Typology and Historical Linguistics
know, in any of the works of nineteenth-century pioneers of comparative grammar such as Karl Brugmann or Berthold Delbrück. In the late nineteenth century Indo-European was being reconstructed; no scholar thought he was ‘proving’ Indo-European, much less discovering it. The family was accepted by everyone, and that was why they were trying to reconstruct it. What then did the pioneers of comparative Indo-European take as the basis of genetic relationship, if not reconstruction? The following quote from Delbrück is instructive: My starting point is that specific result of comparative linguistics that is not in doubt and cannot be in doubt. It was proved by Bopp and others that the so-called Indo-European languages were related. The proof was produced by juxtaposing words and forms of similar meaning. When one considers that in these languages the formation of the inflectional forms of the verb, noun, and pronoun agrees in essentials and likewise that an extraordinary number of inflected words agree in their lexical parts, the assumption of chance agreement must appear absurd (Delbrück, 1880: 121–2). Greenberg has been accused of having attempted to substitute taxonomy for reconstruction, but what really happened in the twentieth century was that historical linguists sought to substitute reconstruction for taxonomy, thus confusing the goals of historical linguistics with the requirements of genetic classification.
3. Pronouns It is not by accident that pronouns have figured as one of the major foci of taxonomic controversies in the twentieth century. As Dolgopolsky showed in 1964, the first- and second-person pronouns are the first and third most stable meanings in language (the numeral ‘two’ is second). The past decade witnessed an endless controversy over the alleged Amerind N/M pattern, and only slightly
349
350
Language Acquisition, Change and Emergence
less controversy over the Eurasiatic M/T pattern.2 There have been two camps. The first camp sees both patterns as survivals of two different languages, Proto-Amerind in the first case and Proto-Eurasiatic in the second. The second camp has two subgroups. The first subgroup claims that the alleged patterns are specious and that both patterns occur in both the Americas and the Old World; the second subgroup admits the reality of the two patterns, but attempts to give a non-evolutionary explanation in terms of sound symbolism, language universals, diffusion, etc. We should begin by noting that both the Eurasiatic and Amerind pronominal patterns were clearly recognized at the start of the twentieth century by, among others, Alfredo Trombetti (1905), who abundantly documented the Amerind pattern throughout the Americas in an appendix to his book and concluded: As can be seen, from the most northern regions of the Americas the pronouns NI ‘I’ and M ‘thou’ reach all the way to the southern tip of the New World, to Tierra del Fuego. Although this sketch is far from complete, due to the insufficient materials at our disposal, it is certainly sufficient to give an idea of the broad distribution of these most ancient and essential elements (p. 208). But Trombetti knew equally well that this American pattern was absent in northern Eurasia, where a totally different pattern, M/T, predominated, and he lamented in his book that “it is clear that in and of itself the comparison of Finno-Ugric me ‘I,’ te ‘you’ with Indo-European me- and te- is worth just as much as any comparison one might make between the corresponding pronominal forms in the Indo-European languages. The only difference is that the common origin of the Indo-European languages is accepted, while the
2
The Eurasiatic family consists of Indo-European, Uralic, Altaic, Korean-Japanese-Ainu, Gilyak, Chukchi-Kamchatkan, and Eskimo-Aleut; see Greenberg (2000–02).
Taxonomy, Typology and Historical Linguistics
connection between Indo-European and Finno-Ugric is denied” (p. 44). Antoine Meillet also was aware of the Eurasian M/T pattern, but he proposed a universal explanation rather than a genetic one: “It goes without saying that, in order to establish linguistic affinity, one must ignore everything that can be explained by general conditions common to all languages. For example, pronouns must be short, clearly made up of consonants that are easy to pronounce, and usually without consonant clusters. It is for this reason that pronouns are similar almost everywhere, without this fact implying a common origin” (Meillet, 1965: 89). Apparently unaware of Trombetti’s appendix on the Amerind pronominal pattern, Edward Sapir a decade later noted the presence of both first-person N and second-person M throughout the Americas and wrote, in a personal letter, “how in the Hell are you going to explain general American n- ‘I’ except genetically?” (quoted in Greenberg, 1987). Franz Boas was also aware of the widespread American pattern, but opposed the genetic explanation given by Trombetti and Sapir: “the frequent occurrence of similar sounds for expressing related ideas (like the personal pronouns) may be due to obscure psychological causes rather than to genetic relationship” (quoted in Haas, 1966). It would thus seem that at least the reality of the pattern — and its virtual restriction to the Americas — was beyond doubt by the beginning of the twentieth century. Such an assumption would, however, be incorrect, for during the final decade of the twentieth century there was a sharp debate, not just on the proper explanation of the Amerind pattern — genetic or non-genetic — but indeed on its very existence. According to Lyle Campbell (1994a: 47), “the n/m [‘I/you’] pattern is not nearly as common in the Americas as Greenberg claimed . . . [and] his supposed m/t [‘I/you’] pattern for his Eurasiatic languages is also found abundantly in the Americas (despite his and Ruhlen’s assertions to the contrary).” Campbell also claims that “several Amerind groups exhibit pronoun forms (m/t [‘I/you’]) that Greenberg attributes to Europe and Northern Asia” and “the n ‘first person’ / m ‘second person’ is by no means unique to, diagnostic of, or ubiquitous in American Indian languages” (Campbell,
351
352
Language Acquisition, Change and Emergence
1994b: 3, 9). Campbell’s denial of the reality of both the Eurasiatic and Amerind pronominal patterns is by no means idiosyncratic. Many other scholars have endorsed this view. According to Nichols (1992: 261), “the root consonantism of personal pronouns turns out to have symbolic properties comparable, in both their universality and their basic structural design, to those of “mama”–“papa” vocabulary. . . . Specifically, personal pronoun systems the world over are symbolically identified by a high frequency of nasals in their roots.” In what can only be considered a comical coda to the pronoun controversy the final debate of the twentieth century was between two scholars, Nichols and Campbell, who had previously been on the same side, vigorously opposing a genetic explanation for the Amerind pattern and, in fact, vigorously opposing Amerind as well. In 1996 Nichols published a paper (with David Peterson) recognizing that her previous universal explanations were incorrect and that scholars from Trombetti to Greenberg, who identified different pronominal patterns in different areas of the world, were correct: “the n:m paradigm . . . clearly . . . cannot be due to universals or random chance. . . . The n:m pronominal system is exceedingly rare outside of the Americas and very common in a geographically limited, though large, part of the Americas” (pp. 336–337). What is left unexplained by Nichols and Peterson is how the methodology of ‘population typology’ can arrive at precisely opposite conclusions on the basis of the same language sample. Again following Greenberg (1987: 54), they reject borrowing as a source of the pronominal similarities, pointing out that “pronouns are almost always inherited” (p. 337). While it is welcome that Nichols and Peterson (henceforth, N&P) have now independently confirmed Greenberg’s conclusion that the Amerind pattern is essentially an American phenomenon and cannot be explained by universals, chance, or borrowing, it is distressing that they falsely claim that theirs is the first proper demonstration. Criticizing both Campbell and me, they claim that “both sides cite only the evidence supporting their claims, and neither cites enough of that positive evidence to convince the reader
Taxonomy, Typology and Historical Linguistics
of the distribution of the n:m pronominal system in Amerind or elsewhere; neither side offers a proper survey than can capture evidence, both positive and negative, without bias so that the field can assess the distribution and status of this pronominal system” (p. 337). In fact, this was precisely what I had done in the two papers that N&P cite (Ruhlen, 1994b, 1995a). In the first of these articles I attempted to survey all the pronominal patterns that had been posited for all the world’s language families. For families such as Indo-European or Uralic this merely entailed listing the reconstructions for those families. For families that have not been reconstructed, such as the sub-Saharan families, I listed those pronouns that have been identified by specialists in these families, though without reconstructions. The evidence given was, therefore, neither “positive” nor “negative.” In the second article, using Greenberg’s 21 Amerindian notebooks, I surveyed the existence of both the N/M and M/T patterns in the Americas. Specifically, I looked for languages in the Americas that had both first-person singular N and second-person singular M; at the same time I also looked for languages that had both first-person singular M and second-person singular T. The results of this search are shown in Tables 1 and 2.3 In addition to the enormous breadth of the distribution of the Amerind pattern, there is a significant depth to the Amerind pattern as well. As can be seen, the Amerind pattern has been reconstructed for many Amerind subgroups, sometimes at great time depths (e.g., Proto-Algic, Proto-Hokan, Proto-Penutian, Proto-Uto-Aztecan, Proto-Tanoan, Proto-Quechuan, Proto-Chibchan, Proto-Aruak, Proto-Guahiban), while the Eurasiatic pattern has never been reconstructed by anyone for any American family, no matter how shallow, except for Proto-Eskimo-Aleut, the easternmost branch of
3
Since the original article was published I have developed an independent database containing reconstructions from all levels of Amerind (Ruhlen, 2002) and the results of this database have been incorporated in Table 1. Table 2 remains unchanged because no one, to my knowledge, has ever reconstructed M/T for any Amerind group.
353
354
Language Acquisition, Change and Emergence
Eurasiatic. These results speak for themselves. Moreover, were I to cite languages that have either first-person N or second-person M, we would find the number of languages cited in Table 1 would increase dramatically, while citing languages with either first-person M or second-person T would increase Table 2 only modestly since neither pronoun is common in the Americas. Although N&P now agree with Greenberg on the facts of the distribution, as well as on the illegitimacy of explaining these facts in terms of universals, chance, or borrowing, they do not agree with Greenberg that the explanation is genetic: “If Amerind were a genetic reality and the n:m paradigm a marker of it, then the marker should have a fairly even distribution over all of Amerind and should be found only there” (p. 367). Yet N&P concede that “the n:m paradigm is attested in all six branches of Greenberg’s Amerind” (368), as Table 1 attests. They object apparently that the paradigm is not preserved uniformly in lower-level subgroups, but there is really no such requirement in historical linguistics and there should be no such expectation. N&P offer two reasons that the n:m paradigm is not a genetic marker of Amerind. First, this paradigm in the Americas must be older than the temporal limits of the comparative method and therefore cannot be considered genetic evidence. Secondly, they point out two examples of the n:m paradigm in the Old World, the Vanimo4 language (spoken on the coast of northern New Guinea) and Mongolian, and on the basis of these two languages they project the n:m paradigm back to an even earlier historical connection in Asia — the “Pacific Rim distribution” — though they are unable to say whether this historical connection was due to common origin, borrowing, or something else.
4
Campbell (1997: 339) incorrectly identifies this language as Austronesian. It belongs rather to the Indo-Pacific family (Greenberg, 1971: 822).
Taxonomy, Typology and Historical Linguistics Table 1 Distribution of N ‘I’ – M ‘thou’ in the Americas.
ALMOSAN: Proto-Algic *-Vn/*-Vm, Kutenai -na:p-/-m; PENUTIAN: Proto-Penutian *n-/*m-, Tsimshian n-/m-, Chinook n-/m-, Proto-Coos-Takelma *n-/*ma, Takelma -n/ma, Proto-Plateau-Penutian *ni/*mis, Proto-Sahaptian *ʔi:n/*ʔi:m, Nez Perce ʔí·n/ʔí·m, Klamath ni-/mi-, Proto-California Penutian *ni/*mVn, Wintu ni-/mi-, Colouse nat/mit, Patwin na-/mi-, Proto-Yokuts *na-ʔ/*ma-ʔ, Proto-Maiduan *nik/*min, Maidu ni/mi, Nisenan ni/mi, Proto-Miwok-Costanoan *ka-na/*mi, Tunica -ni/ma, Huave -na-/me-; HOKAN: Proto-Hokan *n y a/*ma, Chimariko no-/mam-, Karok na/ɪ̄m, Arra-arra na/im, Pehtsik naah/eehm, Washo le (< *na)/mi, Esselen niš-/miš-, Proto-Yuman *ñ-/*m-, Yuma nnyep/mañ, Mohave inyeč/manč, Walapai ãn/ma, Havasupai inya/ma-a, Yavapai nya-a/ma-a, Diegueño ʔenyaa/maa, Chontal ni/mi, Coahuilteco na/mai, Comecrudo na/emnã, Karankawa n’/m, Cotoname na/men; CENTRAL AMERIND: Proto-Aztec-Tanoan *neʔ/*ʔeme, Proto-Tanoan *nõ/*ʔẽm, Kiowa nã/am, Jemez ne/ũmiš; Proto-Uto-Aztecan *n-/*m, Kawaiisu nɨʔɨ/ʔimi, Utah ne/yim, Opate ʔina-po/ʔeméʔe, Yaqui ʔinapo/ʔemeʔe, Tarahumara ni-hé/ʔyemi, Papago -ñ/-m, Hopi nuʔ/uma, Nahuatl no-/mo-, Pipil -neč-/-mit s; CHIBCHAN: Proto-Chibchan *na-sV/*mue-ya, Cogui nə́s/má, Proto-Aruak *na/*ma, Ica nən/ma, Chimila náari/ámma, Bribri ñõ/ma, Rama na/ma, Miskito yan (? < *ñan)/man, Ulua yan/man, Sumu yan/man, Guamaca nerra/ma, Lenca una/amna; ANDEAN: Proto-Quechuan *nuqa/*qam, Jaqaru na/huma, Aymara naya/huma, Mapudungu ta-ñi/ta-mi; EQUATORIAL: Proto-Guahiban xánɨ/xámɨ, Mocochi an-/ma, Pakaasnovos naʔ/wum, Achuar wina/amin, Jitnu kan/kam, Cuiba xan/xam, Guayabero xan/xam, Itene ana-/ma-; MACRO-PANOAN: Moseten ñu/mi, Nocten no-/em, Pacaguara no-/mi-, Chacobo no-/mina, Arazaire noena/mina; MACRO-GE: Delbergia nũ/ma, Tibagi in/ama, Catarina enha/ahama, Kaingang ʔiñ/ʔã.
Table 2 Distribution of M ‘I’ – T ‘thou’ in the Americas.
ESKIMO-ALEUT: Proto-Eskimo *-ma ‘I, my,’ Sirinik məŋa ‘I,’ elsewhere: uvanga ‘I,’ *-t (> -n in Western dialects) ‘thy,’ *-ti-k ‘your dual,’ *-ti-t ‘your plural’; AMERIND: Siouan: Mandan mi/da, Hidatsa ma/da; Paezan: Millcayac mioiñ/tœz; Ge: Coroado make/teke.
355
356
Language Acquisition, Change and Emergence
Let us dispense first with the Mongolian example. N&P consider Mongolian nam- ‘me,’ čam ‘thee’ to be an example of the “n-:Vm paradigm,” itself a variant of the n:m paradigm. The problem here is that the -m of ča-m is an accusative marker, not a second-person pronoun; the second-person marker is ča-, which derives from Classical Mongolian čima-. This stem contains či‘thou,’ itself deriving from Proto-Altaic *ti-, as comparison with the plural pronoun ta ‘you’ indicates (Illich-Svitych, 1976: 49). While most linguists would consider the comparison of an accusative ending with a second-person marker a rather serious error, apparently N&P do not, for in response to a similar criticism by Campbell (1997: 344) their reply was that “in sampling we do not purport to give accurate descriptions of the histories of language families; we want to make meaningful comparisons of the frequency of n:m systems in the Americas vs. elsewhere” (p. 609). Most linguists, however, will have a difficult time considering the comparison of an accusative marker with a pronoun a “meaningful comparison,” when the actual history of the form is well known. As for the sole New Guinea example, Campbell (1997: 346) is certainly correct that “chance congruence between parts of New Guinea and parts of America is a much more plausible account than that there is a mysterious historical connection between just these two regions that defies both time and space and is beyond standard notions of language change associated with the comparative method.” If one considers the historical implications of N&P’s Pacific Rim theory, one immediately encounters contradictions. For example, N&P claim that “the languages of eastern North America and especially those of eastern South America presumably descend from earlier colonizations” (p. 369), that is, earlier than the later migration that spread the N/M pattern throughout western North and South America. Accordingly, N&P consider the Algonquian family to be the result of an earlier migration because it lacks the crucial N/M pronoun system. But Algonquian’s closest kin — Wiyot and Yurok — are (or were) spoken on the California coast and Paul Proulx (1985) has reconstructed the N/M paradigm for Proto-Algic
Taxonomy, Typology and Historical Linguistics
(= Algonquian, Wiyot, and Yurok) and Algonquian itself has preserved first-person N. Since Proto-Algic possessed the N/M pattern, as Sapir (1913) thought, this simply means that second-person M was lost in Algonquian.5 But how can the loss of a single pronoun and a movement to the eastern seaboard overturn this close genetic relationship with Wiyot and Yurok and transform Algonquian into an earlier migration to the Americas? Of course as soon as one looks at other traits one finds abundant resemblances between Algonquian and all the other Amerind groups in western North America. Algonquian is just a normal Amerind group that happened to lose the second-person Amerind pronoun M. Campbell (1997) makes a number of astute observations and appropriate criticisms of N&P’s article, including pointing out that the alleged absence of the N/M pattern in eastern North and South America has been greatly exaggerated by N&P on the basis of a poorly chosen sample. But what really disturbs Campbell is that he realizes N&P have inadvertently walked into a trap of their own making. Having eliminated borrowing, accident, and universals as possible explanations N&P have left themselves (and Campbell) only one remaining possible explanation, which Campbell notes with alarm: “In denying borrowing of pronoun patterns, N&P in effect rule out ‘areal affinity’ [diffusion] and thus limit, perhaps unwittingly, the interpretation of their ‘single historical development’ to only one possible explanation: genetic relationship, inheritance from a common ancestor” (p. 341). And N&P’s vague description of this mysterious unknown historical process that led to the N/M distribution, “a single historical development of some sort . . . some kind of shared history” (p. 337, italics added), is not likely to satisfy anyone. Nor is their final response to Campbell on this very question: “Something happened. We do not and cannot know just what happened, but this does not preclude establishing when and where” (p. 613). What happened is the simple genetic
5
For the origin of the second-person pronoun that replaced M in Algonquian, see Greenberg (1987: 287).
357
358
Language Acquisition, Change and Emergence
explanation — a single population entered the Americas with the N/M pronoun pattern, spread rapidly throughout both North and South America around 11,000 years ago according to the archaeological record (Klein, 1999) and left in its wake traces of the original N/M pattern and numerous other diagnostic Amerind traits.
4. Lexical Evidence If the evidentiary value of pronouns has been swept under the rug in recent decades, the value of lexical evidence has come to be entirely discounted by many scholars, who often refer to it disparagingly as the ‘laundry list’ approach. According to Goddard and Campbell (1994: 195), “Greenberg’s classification [of American Indian languages] is a codification of his judgements of inspectional similarity and is thus, in principle, ahistorical.” Robert Rankin (1992: 330) proclaims that “the days are gone when ‘word-list linguistics’ could be profitably practiced by American linguists and anthropologists.” The reality is just the opposite. Scholars such as Rankin, Goddard, and Campbell have become so specialized in a narrow subfield, preoccupied with small obvious language families such as Algonquian, Siouan, or Mayan, that they are unaware what roots are widespread in the Americas, and which are not. In other words, they have not even reached the word-comparison stage, much less surpassed it. A single lexical element from the Amerind family illustrates this point. As shown in Ruhlen (1994c), there is a plethora of forms throughout North and South America with the shape tVnV and the meaning ‘child, son, daughter’ or the like. Furthermore, a careful analysis of hundreds of such forms suggests that the first vowel of this root was originally correlated with the gender of the child, with i indicating masculine gender, u feminine, and a indeterminate sex. Thus Proto-Amerind must have had a morphologically complex root *t’ina ‘son, brother,’ *t’una ‘daughter, sister,’ and *t’ana ‘child, sibling.’ No extant Amerind language preserves all three grades of
Taxonomy, Typology and Historical Linguistics
this root intact, but a number do preserve two (e.g., Tiquie ten ‘son,’ ton ‘daughter’), and even more preserve one. All three grades are, however, preserved elsewhere, for example in the Tucano numeral for ‘one’: nik-e ‘one (masc.),’ nik-o ‘one (fem.),’ nik-a ‘one (indet.)’ (Giacone, 1949). In this case all three grades are retained in what is the general Amerind word for ‘one’ (Ruhlen, 1995b). From the hundreds of examples given in Ruhlen (1994c), Table 3 gives one example of each grade of the root for each of the 13 Amerind subgroups. For Greenberg’s critics there in no historical connection between any of these forms. In addition to the multitude of forms where the vowel still indicates the appropriate gender, there are many forms that are clearly cognate with these forms, but in which the gender appears anomalous from the perspective of the Proto-Amerind pattern. There are numerous well-known typological developments that can lead to this situation. One is when the indeterminate gender becomes specialized as either masculine (e.g., Yuchi tane ‘brother’) or feminine (e.g., Proto-Siouan *i-thã-ki ‘man’s sister’). Another example is Proto-Algonquian *ne-tāna ‘my daughter,’ but already in 1923 Sapir had recognized that “Proto-Algonquian *-tan- must be presumed to have originally meant ‘child’ . . . and to have become specialized in its significance either to ‘son’ (Wiyot) or ‘daughter’ (Algonkin proper), while in Yurok its close relative -ta-ts [‘child’] preserved a more primary genetic significance” (Sapir, 1923: 41). Furthermore, this morphologically complex item occurs with a whole variety of Amerind suffixes and prefixes, as shown in Ruhlen (1994c). Clearly it is much more likely that all these t’ina/t’ana/t’una forms are related historically than that they are not. And that is the question that must be kept in mind. One should not demand that Proto-Amerind be reconstructed with regular sound correspondences; rather one should ask what is the most probable explanation for the presence of the Amerind pronominal pattern N/M in the Americas (and its absence elsewhere) and the presence of a morphologically-complex root *tVnV, exhibiting a distinctive gender ablaut system (not found elsewhere in the world), all in the
359
pnç-t’in “my elder brother” (Molala) tŏne- “daughter” (Cent. Sierra Miwok) t’inı˱-si “child, son, daugther” (Yana) a-t’on “younger sister” (Salinan) ɽdɑˮɑˮnó “brother” (Cuicatec) sin “brother” (Changuena) tzhœng “son” (Millcayac) den “brother” (Tehuelche) ten “son” (Tiquie) tin-gwa “son, boy” (Mocochi) dçnu “male child” (Yagua) u-tse-kwa “grandchild” (Tacana) àina “older brother” (Guato)
t’ána-t “grandchild” (Totanac)
t’an-pam “child” (Coahuilteco)
*tana “daughter, son” (Uto-Aztecan)
tuk-tan “child, boy” (Miskito)
dani- “mother’s sister” (Warrau)
tayna “first-born child” (Aymara)
tani-mai “younger sister” (Masaca)
taɽɎn “child” (Urubu-Kaapor)
tane “my son” (Pavishana)
tawin “granchild” (Lengua)
tog-tan “girl” (Tibagi)
Hokan
Central Amerind
Chibchan
Paezan
Andean
Macro-Tucanoan
Equatorial
Macro-Carib
Macro-Panoan
Macro-Ge
“younger sister” (Lenca)
a-ton-kä “younger sister” (Piokobyé)
-tóna “younger sister” (Tacana)
-tona “sister” (Nonuya)
a-tune-sas “girl” (Morotoko)
ton “daughter” (Tiquie)
thaun “sister” (Tehuelche)
t suh-ki “sister” (Cayapa)
tuntu-rusko
-t’ut’ina “older sister” (Taos)
-t’aona “sister” (Keres)
Penutian
“male boy” (Mohawk)
-ɽtsin
tane “brother” (Yuchi)
Keresiouan
tune “niece” (Coeur d’Alene)
t in “young man” (Yurok)
t’an’a “child” (Nootka)
*T’UNA “DAUGHTER, SISTER, GIRL”
s
*T’INA “SON, BROTHER, BOY”
Almosan
PROTO-AMERIND *T’ANA “CHILD, SIBLING”
Table 3. Proto-Amerind *T’INA/*T’ANA/*T’UNA “son/child/daughter”
360 Language Acquisition, Change and Emergence
Taxonomy, Typology and Historical Linguistics
same set of languages? Only common origin is a reasonable explanation for these and the many other grammatical and lexical items that characterize the Amerind family.
5. Typology If the confusion between taxonomy and historical linguistics was pervasive throughout the twentieth century, the fundamental distinction between typological classification and genetic classification — the former based on historically-independent structural traits, the latter on historically-related genetic traits (i.e., those involving both sound and meaning) — seemed securely established by Greenberg’s African classification in which, for the first time, typological traits were eliminated from consideration. Greenberg’s elimination of typological traits from genetic classification has been hailed by many as one of the primary achievements of this African classification (Dimmendaal, 1993: 801) and it is now universally accepted that Meinhof’s so-called “Hamitic” group, which was based in part on the presence or absence of gender in the various languages, is in no way a valid linguistic taxon. Recently Nichols has sought to resurrect the typological approach to historical linguistics in a book (1992) and series of articles (Nichols, 1990, 1995; Nichols and Peterson, 1996, 1998). According to Nichols (1992: 36), “one of the advantages claimed here for population typology is its ability to draw historical inferences from areal populations whose genetic classification is inadequate or incomplete from the perspective of standard historical work.” The results of Nichols’ enterprise, however, are no more valid than those of Meinhof. Space does not permit a complete examination of the numerous flaws in Nichols’ work (Greenberg, 1993; Ruhlen, 1994a). I would, however, like to point out a few of the fatal flaws in what Nichols calls “population typology” and to explain why it cannot possibly lead to any “historical inferences.”
361
362
Language Acquisition, Change and Emergence
The problem with the use of typological traits is three-fold: (1) Few traits are used; Meinhof used just one, Nichols uses ten. (2) Each trait has a small number of possible states, SVO yields six possible states, head/dependent marking just two. When we compare this method with normal historical linguistics it is evident why the typological approach is so weak. In classifying languages taxonomists use hundreds of words — not one or 10 — and each of these words has hundreds, if not thousands, of possible states (i.e. phonetic representations). Therefore the resolution of traditional taxonomy is vastly superior to the typological approach because it uses hundreds of different traits (words), each of which has hundreds of possible states (phonetic shapes). (3) A third advantage of genetic traits (words and affixes) is that they are independent of one another. The word for ‘hand’ does not influence what the word for ‘water’ might be. Typological traits, of course, are often highly correlated (Greenberg 1963), making them less valuable for taxonomy. Finally, it should be pointed out that typological traits do not cluster within historically valid taxa, while lexical and grammatical formatives do — from Romance to Nilo-Saharan, Amerind, or Eurasiatic — and it is precisely this clustering of (different) genetic traits that defines each group. As a rule of thumb evolutionary biologists would like to have at least as many traits (polymorphisms) as populations under consideration. Yet Nichols employs just ten traits for 174 languages. In biology it has long been well known that the use of a small number of traits for a large number of populations, such as Nichols has used, leads to unreliable and even absurd results. And indeed Nichols has discovered that within the Americas “eastern North America is peripheral and isolated, with more affinities to New Guinea than to the rest of the New World” (p. 224). This finding obviously does not jibe with N&P’s claim that it is western North America that is connected with New Guinea by the N/M pronoun pattern in Vanimo. In reality, of course, the Amerind populations of eastern and western North and South America are much more closely related, linguistically and biologically, to one another than either is to populations in New Guinea. Just as the use of a single
Taxonomy, Typology and Historical Linguistics
genetic trait, the N/M pronoun pattern, led to absurd results, so too does the use of 10 typological traits. Nichols also concludes (1992: 274f) that the ten typological traits she has surveyed provide evidence for the ‘Out of Africa’ hypothesis of human origins advocated by some paleontologists and geneticists, but there is in fact no such evidence in the book. The entire final chapter of the book, which is the only one in which human prehistory is mentioned, seems to have been grafted onto a book about typology, without any real connection to these typological data. Another alleged discovery of “population typology” is the existence of global clines for certain typological features. Thus Nichols claims that the inclusive/exclusive opposition constitutes a global cline. However, a cline is a reflection of an historical event. For example, there is a cline in certain gene frequencies, running from Turkey northwest through Europe, that has been associated with the spread of agriculture from Anatolia through Europe (Ammerman and Cavalli-Sforza, 1984). One example is the Rh polymorphism, where it appears that the Rh– gene was originally at very high levels in Europe6, and in fact even today the Basques have the highest incidence of this gene in the world. Ammerman and Cavalli-Sforza showed that the cline of the Rh– gene frequencies in Europe paralleled in remarkable fashion the wave of advance of the Middle Eastern agriculturalists as they spread through Europe, farmers who were presumably mostly, if not completely, Rh+. The cline for this gene (and for others) would reflect the progressive admixture of the genes as these two populations interbred. Returning to Nichols’ global cline for the inclusive/exclusive opposition, what historical process could possibly have been responsible for such a cline? Or, more to the point, how could historical events that are unrelated (i.e., the appearance of the
6
The Rh– gene is thought to have arisen in Europe through a mutation; it is rare elsewhere in the world. For further discussion see Cavalli-Sforza, Menozzi, and Piazza (1994: 300) and Cavalli-Sforza (1996: 169f).
363
364
Language Acquisition, Change and Emergence
inclusive/exclusive opposition in Australia, Oceania, Africa, and the Americas) constitute a global cline? For unrelated historical events to organize themselves into a global cline would seem to require either supernatural intervention or extremely good luck. In point of fact, however, the inclusive/exclusive opposition does not form a global cline. Rather, as Nichols herself admits, the inclusive/exclusive opposition is highest in Oceania, next highest in the Americas, and lowest in the Old World (Nichols, 1992: 206–7). This pattern thus in no way constitutes the east-to-west cline posited by Nichols. But this geographical inconvenience does not deter Nichols in her quest for global clines, for, according to Nichols, “water miles” are twice as long as “land miles” and therefore the Pacific Ocean is to be thought of as in the Atlantic Ocean and thus the pattern Oceania–Americas–Old World constitutes a global cline. And “water miles” are not Nichols’ only discovery. She also attributes the presence of noun class systems in Africa not to the historical process of diachronic typology outlined in Greenberg (1978), but rather to the fact that Africa is a “hotbed” for noun classes and therefore it is to be expected that African languages should have noun classes. For Nichols, an explanation is simply a neologism away.
6. Conclusion It is a sad commentary on the fate of historical linguistics in the twentieth century that families such as Eurasiatic and Amerind, whose outlines were already perceived by scholars such as Trombetti and Sapir at the beginning of the last century, are still considered controversial, at best, or completely spurious, at worst. Both families are in fact obvious if the proper basis of genetic classification — taxonomy — is understood and applied without preconception, rather than the caricature of the comparative method that became fashionable in the twentieth century. Equally deplorable was the retrogression in linguistic taxonomy during the twentieth century, where valid families that were well established and accepted
Taxonomy, Typology and Historical Linguistics
at the start of this century — Altaic, Na-Dene, Hokan, Penutian — came to be considered dubious by many traditional historical linguists. According to Hock and Joseph (1996: 498, 502), “Greenberg’s methodology of mass comparison must be considered of dubious reliability. . . . There is no credible alternative to the cumbersome and time-consuming traditional method of comparative linguistics.” Unfortunately this cumbersome and time-consuming method made no contribution to linguistic taxonomy in the twentieth century and it should be clear by now that it never will. I have tried in this article to clarify the methodological basis of genetic relationship, a methodology that was worked out in the nineteenth century by the pioneers of comparative linguistics and then largely forgotten or ignored during the twentieth century when the true principles of linguistic taxonomy disappeared from what became known as historical linguistics. The time is past for historical linguists to continue to pretend that language can only be used to study human prehistory at very shallow time depths and that Indo-European represents the temporal limits of the comparative method. There is no evidence that this is true and there is a great deal that it is not. The reconstruction of human prehistory using comparative linguistics is not just a responsibility of the linguistic community, it is also a debt that we owe to the ancillary human sciences.
References Ammerman, Albert J., and L. L. Cavalli-Sforza. 1984. The Neolithic transition and the genetics of populations in Europe. Princeton: Princeton University Press. Bynon, Theodora. 1977. University Press.
Historical
linguistics.
Cambridge: Cambridge
Campbell, Lyle. 1986. Comments. Current Anthropology 27.488. __. 1994a. Inside the American Indian language classification debate. Mother Tongue 23.41–55.
365
366
Language Acquisition, Change and Emergence __. 1994b. Putting pronouns in proper perspective in proposals of remote relationships among Native American languages. Survey of California and other Indian languages, Report 8, ed. by Margaret Langdon, 1–20. Berkeley: Survey of California and Other Indian Languages. __. 1997. Amerindian personal pronouns: A second opinion. Language 73.339–51. Cavalli-Sforza, L. Luca. 1996. Gènes, peuples et langues. Paris: Odile Jacob. Cavalli-Sforza, L. Luca, Paolo Menozzi, and Alberto Piazza. 1994. The history and geography of human genes. Princeton: Princeton University Press. Delbrück, Bertold. 1880. Einleitung in das Sprachstudium. Leipzig. Dimmendaal, Gerrit J. 1993. Review of On language: Selected writings of Joseph H. Greenberg. Language 69.796–807. Dolgopolsky, Aron B. 1964. Gipoteza drevnejshego rodstva jazykovyx semej severnoj evrazii s verojatnostnoj tochki zrenija. Voprosy jazykoznanija 2.53–63. Durie, Mark, and Malcolm Ross. 1996. Introduction. The comparative method reviewed: Regularity and irregularity in language change, ed. by Mark Durie and Malcolm Ross, 3–38. New York: Oxford University Press. Dybo, Anna, Oleg Mudrak, and Sergei A. Starostin. 2003. An Altaic etymological dictionary. Leiden: Brill. Fox, Anthony. 1995. Linguistic reconstruction: An introduction to theory and method. New York: Oxford University Press. Giacone, Antonio. 1949. Os Tucanos e outras tribus do Rio Uaupés afluente do Negro-Amazonas: notas etnográficas e folclóricas. São Paulo. Goddard, Ives, and Lyle Campbell. 1994. The history and classification of American Indian languages: What are the implications for the peopling of the Americas? Method and theory for investigating the peopling of the Americas, ed. by Robson Bonnichsen and D. Gentry Steele, 189–207. Corvallis, OR: Center for the Study of the First Americans. Greenberg, Joseph H. 1963. Some universals of grammar with particular reference to the order of meaningful elements. Universals of Language, ed. by Joseph H. Greenberg, 73–113. Cambridge: M.I.T Press. __. 1971. The Indo-Pacific Hypothesis. Current Trends in Linguistics, vol. 8, ed. by J. Donald Bowen, et al., 807–71. The Hague: Mouton. __. 1978. How does a language acquire gender markers? Universals of human language, Vol. 3, ed. by Joseph H. Greenberg, 47–82. Stanford: Stanford University Press. __. 1981. Amerindian Notebooks, 21 volumes, unpublished. __. 1987. Language in the Americas. Stanford: Stanford University Press.
Taxonomy, Typology and Historical Linguistics __. 1993. Review of Linguistic diversity in space and time, by Johanna Nichols. Current Anthropology 34.503–5. __. 1995. The concept of proof in genetic linguistics. Mother Tongue 1.207–16. __. 2000–02. Indo-European and its Stanford: Stanford University Press.
closest
relatives,
2
vols.
Haas, Mary. 1966. Wiyot-Yurok-Algonkian and problems of comparative Algonkian. International Journal of American Linguistics 32.101–7. Hock, Hans Henrich, and Brian D. Joseph. 1996. Language history, language change, and language relationship: An introduction to historical and comparative linguistics. Berlin: Mouton de Gruyter. Illich-Svitych, Vladislav M. 1971–84. Opyt sravnenija nostraticheskix jazykov, 3 vols. Moscow: Nauka. Kaufman, Terrence. 1990. Language history in South America: What we know and how to know more. Amazonian linguistics: Studies in lowland South American languages, ed. by Doris L. Payne, 13–73. Austin: University of Texas Press. Klein, Richard. 1999. The human career. Chicago: University of Chicago Press. Meillet, Antoine. 1965 [1914]. Le problème de la parenté des langues. Linguistique historique et linguistique générale, by Antoine Meillet, 76–101. Paris: Champion. Miller, Roy Andrew. 1971. Japanese and the other Altaic languages. Chicago: University of Chicago Press. __. 1991a. Genetic connections among Altaic languages. Sprung from Some Common Source, ed. by Sydney M. Lamb and E. Douglas Mitchell. Stanford: Stanford University Press, 293–327. __. 1991b. Anti-Altaicists contra Altaicists. Ural-Altaische Jahrbücher 63.5–62. __. 1991c. How many Verner’s Laws does an Altaicist need? Studies in the Historical Phonology of Asian Languages, ed. by William G. Boltz and Michael C. Shapiro, 176–204. Amsterdam: John Benjamins. Nichols, Johanna. 1990. Linguistic diversity and the first settlement of the New World. Language 66.475–521. __. 1992. Linguistic diversity in space and time. Chicago: University of Chicago Press. __. 1995. The spread of language around the Pacific rim. Evolutionary Anthropology 3.206–15. __. 1996. The comparative method as heuristic. The comparative method reviewed: Regularity and irregularity in language change, ed. by Mark Durie and Malcolm Ross, 39–71. New York: Oxford University Press.
367
368
Language Acquisition, Change and Emergence Nichols, Johanna, and David A. Peterson. 1996. The Amerind personal pronouns. Language 72.336–71. __. 1998. A reply to Campbell. Language 74.605–14. Proulx, Paul. 1985. Proto-Algic II: verbs. International Journal of American Linguistics 51.59–93. Rankin, Robert L. 1992. Review of Language in the Americas, by Joseph H. Greenberg. International Journal of American Linguistics 58.324–49. Ruhlen, Merritt. 1994a. Review of Linguistic diversity in space and time, by Johanna Nichols. Anthropos 89.640–41. __. 1994b. First- and second-person pronouns in the world’s languages. On the origin of languages: Studies in linguistic taxonomy, 252–60. Stanford: Stanford University Press. __. 1994c. Amerind T’ANA ‘child, sibling.’ On the origin of languages: Studies in linguistic taxonomy, 183–206. Stanford: Stanford University Press. __. 1995a. A note on Amerind pronouns. Mother Tongue 25.60–61. __. 1995b. Proto-Amerind numerals. Anthropological Science (Tokyo) 103.209–25. __. 2002. An index to Amerind reconstructions, unpublished. Sapir, Edward. 1913. Wiyot and Yurok, Algonkian languages of California. American Anthropologist 15: 617–46. __. 1923. The Algonkian affinity of Yurok and Wiyot kinship terms. Journal de la Société des Américanistes de Paris 15.36–74. Sarich, Vincent M. 1994. Occam’s razor and historical linguistics. In honor of William S-Y. Wang: Interdisciplinary studies on language and language change, ed. by Matthew Y. Chen and Ovid J. L. Tzeng, 409–30. Taipei: Pyramid Press. Starostin, Sergei A. 1991. Altajskaja problema i proisxozhdenie japonskogo jazyka. Moscow: Nauka. Trombetti, Alfredo. 1905. L’unità d’origine del linguaggio. Bologna: Luigi Beltrami. Watkins, Calvert. 1990. Etymologies, equations, and comparanda: Types and values, and criteria for judgement. Linguistic Change and Reconstruction Methodology, ed. by Philip Baldi, 289–303. Berlin: Mouton de Gruyter.
11 Modeling Language Evolution Felipe Cucker City University of Hong Kong
Steve Smale University of California at Berkeley and Toyota Technological Institute at Chicago
Ding-Xuan Zhou City University of Hong Kong
1. Introduction A purpose of this paper is to understand the evolution of the languages used by the agents of a population. We focus on language features which vary in a continuous manner. In our model a language is a function from a set X of meanings to a set Y of signals belonging to a prespecified class H in which a distance d is defined. In a linguistic population of k agents, a state consists of the set of k languages used by the individual agents. Such a state evolves with time as agents are exposed to meaning-signal pairs produced by other agents in the population and modify their own languages to improve communication with them. We may say that this agent learns from the present state of the population and that the iteration with time of these learning processes forms a 369
370
Language Acquisition, Change and Emergence
learning dynamics. This dynamics depends on the actual meaningsignal pairs to which individual agents are actually exposed. In our model we will eventually assume that these pairs are randomly drawn from X × Y according to a probability measure in this product space reflecting both the frequency with which different meanings occur in the linguistic setting and the current state. Different forms of noise will also be modeled by this measure. A key role in this dynamics is played by the strength with which each agent affects other agent’s language evolution through linguistic encounters. This set of mutual influences is modeled by a k × k matrix Γ whose entry γij, a nonnegative real number, measuring the impact of agent j in the development of the language of agent i. Thus, convergence to a common language is related to an irreducibility property of Γ, which we call weak irreducibility, and the speed of this convergence to a number associated to Γ. Weak irreducibility ensures the existence of sufficiently many linguistic connections (i.e., nonzero γij ’s) thus ruling out the possibility of partitioning the population into two disjoint subgroups which are isolated from one another. Let ΔH ⊂ H k be the diagonal of H k , i.e., ΔH = (f, f, . . . ,f) . This is the set of states in which agents of the population share a common language. Our main result can be roughly stated as follows (for a precise statement see Theorem 1 below). Main Result. Let P be a linguistic population whose matrix Γ is weakly irreducible. Let η > 0 and N(ΔH , η) be the η -neighborhood of ΔH . Assume that at each iteration the agents of P are exposed to m meaning-signal pairs for a sufficiently large m. Then, with high probability, the learning dynamics converges in a finite number t of steps to a state in N(ΔH , η) . The numbers m and t will depend on η and will become larger as η becomes smaller.
2. First Examples and the Basic Model In studying the way languages developed to become the shared communication system they are today, a reasonable simplified
Modeling Language Evolution
model starts with the representation of a language as a continuous function from a set X of meanings into a set Y of signals (or words). The choice of the spaces X and Y and the class of functions of f will depend on the particular language evolution we are attempting to model. But we will always take X to be a closed and bounded n l subset of IR and Y = IR for some n and l. Example 1 Consider the set of grey colors, i.e. different intensities of grey varying between white and black. We can model this set of meanings by taking X = [0, 1]. Here 0 corresponds to absolute white and 1 to absolute black. In normal speech, we associate absolute black with the word black and absolute white with the word white. Also, most of the grey tones are associated with the word grey. But some dark tones of grey are sometimes described as grey and sometimes as black, even by the same speaker (an ambiguity phenomenon). The frequency of, say, the former decreases with the darkness of the tone. A similar phenomenon happens with the light grey tones. 3 To model such a 3-word language we may take Y to IR be and associate “pure words” white = (1, 0, 0), grey = (0, 1, 0), and black = 3 (1, 0, 0). A possible language in this case is a function f : [0, 1]→ IR which maps [0, u1] to (1, 0, 0), [u2, u3] to (0,1,0), and [u4, 1] to (0, 0, 1), for some 0 < u1 < u2 < u3 < u4 < 1. Meanings in the intervals [u1, u2] and [u3, u4] are mapped to points of the form (λ, 1–λ, 0) and (0, λ, 1–λ) respectively for different values of λ ∈[0, 1]. For x ∈[u1, u2], one may interpret the value of λ in f(x)=(λ, 1–λ, 0) as the proportion of times the language uses white for x (the value 1–λ being the proportion of times it uses grey). Similarly for points in [u3, u4]. This interpretation models the common phenomenon of homophony in which different words may be used for a given meaning (see Ke et al., 2002 for more on this). The additional requirement that f is continuous, i.e., that λ varies continuously with the points in [u1, u2] and [u3, u4], is reasonable.
371
372
Language Acquisition, Change and Emergence
Example 2 An idealization (e.g. of Ke et al., 2002) in the study of language emergence assumes a situation in which members of a finite population eventually agree in associating the same utterances from a finite set of utterances to a finite set of meanings {x1,...,xr}. l To model this situation one may take X = x1,...,xr and Y = IR where l is a number of utterances and the ith utterance is ei = (0,...,0, 1,0,...,0), the 1 in the ith place. A convex combination of some vectors ei may be interpreted in this context, for instance, as some phonetic compromise between these utterances. Other interpretations are also possible (see Knight et al., 2002). Then, a language is a continuous map f : X→Y giving this association. We become more formal. Definition 1 A linguistic setting is a pair ((X, ρX ), Y ) where (1) X is a closed and bounded domain in IRn and ρX is a Borel probability measure on X. The pair (X, ρX ) is called the space of meanings. (2) Y = IRl for some l ≥ 1 is the space of signals. A language is a continuous function f : X→Y. The measure ρX will be interpreted as the relative frequency with which different meanings occur in the linguistic context at hand. The space of meanings will be thought to have an independent existence. Given two languages f1, f2 : X→Y, the distance between them is given by d(f , g) = (∫ & f1(x) − f2 (x) &Y2 dρX )1/ 2 . x
The distance d(f , g) will be interpreted in terms of communication ability between two agents using f and g, respectively, inversely. Thus large communication ability corresponds to small distance. So for an agent using language f, communication with an agent using language g is maximal at g = f or d(f , g) = 0. But communication depends also on the richness of languages f and g. A space of languages H is a convex set of functions from X to Y.
Modeling Language Evolution
By linguistic population we understand a finite set P = {1,...,k} of k agents together with a common space H of languages. To model how languages are adjusted by agents learning from each other we need to take into account how the different agents affect each other’s language evolution. To this end, we introduce, 2 for each pair (i, j) ∈ {1,..., k} , a real number γ ij measuring the number of linguistic encounters between agents i and j. By “encounters” we mean nonsymmetric, effective, encounters, so that γ ij measures the impact of agent j in the development of the language of agent i. This may be related to the frequency of physical encounters of these agents, to their social positions, ages, etc. The special case of the diagonal elements γ ij may be interpreted as an inertia which could be expected to be small in the case of linguistic immaturity and large for an agent with full language development. Example 3 Consider a population consisting of a mother M and a baby B. This is an instance of the problem of language acquisition. The assumption that the mother’s language is not affected by the baby’s is described by the equality γ MB = 0 which yields a matrix Γ with the form M B M 0⎞ ⎛ 1 B
⎜⎜ ⎜⎝1 − θ
⎟⎟
θ ⎠⎟
where θ > 0 is small. Example 4 Consider now a population consisting of two groups of agents, say the inhabitants of two islands, having no direct or indirect contact with each other. If I = {1,..., ni } and J = {ni + 1,..., ni + n j } denote the inhabitants these two islands, the matrix of linguistic encounters has the form ⎛Γ I ⎜⎜⎜ ⎜⎝ 0
0 ⎞⎟ ⎟ Γ J ⎠⎟⎟
where ΓI and Γ J are square matrices of dimension ni and nj respectively. Note that in this situation, we cannot expect all the agents will eventually speak the same language.
373
374
Language Acquisition, Change and Emergence
Remark 1
Throughout this paper we assume that
∑
k j =1
γij > 0 for
i =1,...,k. This excludes the possibility of a completely immature agent (one with inertia zero) which does not receive any impact from the rest of the agents. A state of a linguistic population is a k-tuple (f1,...,fk), where fi : X→Y is the language of the ith agent. The set H k of all k-tuples formed by languages from H will be called the state space. Recall that ΔH = {(f ,..., f ) ∈ H k } . Given a state (f1,...,fk), the linguistic context for agent i at that state can be succinctly expressed through the function
∑ γ f (x) F (x) = ∑ γ k
j =1 ij j k
i
j =1
ij
The numerator in the right-hand side is the weighted (according to Γ) mean of the different signals for x in the context at hand. The denominator is to uniformly scale the functions F1,...,Fk. Let λij :=
γ ij
∑
k
γ j =1 ij
,
i, j ∈ {1,..., k) .
We can rewrite the expression for Fi to obtain k
Fi = ∑ λij f j ,
i = 1,...,k.
(1)
j =1
Remark 2 The way Fi is defined makes use of both scalar multiplication and addition of languages. Considered on their own, these are only formal constructs; they do not have any linguistic interpretation. The way they are used to define Fi, however, has a particular structure. The function Fi is a convex combination of the languages f1,...,fk. Thus Fi can be seen as a creole of f1,...,fk, and for x∈X, Fi(x) can be interpreted via ambiguity or as some phonetic compromise between f1(x),...,fk(x) (cf. Example 2). Note also that Fi ∈ H due to the convexity of H . This is not necessarily true for arbitrary sums (or scalar multiplications) of elements in H . These sums are only guaranteed to be in ‹ρ2 (X)
Modeling Language Evolution
showing the contrast between the linguistic nature of H in our model and the purely formal one of ‹ρ2 (X) . The matrix Λ := (λij )ik, j =1 , in the sequel called the communication matrix, is an example of a k stochastic matrix since ∑ j =1 λij = 1 for i = 1,...,k. Definition 2 A k × k matrix Λ = (λij )ik, j =1 is said to be a stochastic matrix (or a Markov matrix) if λij ≥ 0 for all i, j and k
∑λ
ij
=1,
∀i = 1,..., k
j =1
A main result in the theory of stochastic matrices is the following. Proposition 1 (Perron-Frobenius) A stochastic matrix Λ has the eigenvalue 1 with the eigenvector (1,...,1). All its other eigenvalues are not more than 1 in modulus. In a stochastic matrix, the other eigenvalues with modulus 1 (if any) play an essential role in studying the limit of the powers Λt . The matrix is said to be weakly irreducible if 1 is a simple eigenvalue and all its other eigenvalues are less than 1 in modulus.
3. The Learning Dynamics Let z = {(x1 , y1),...,(xm , ym )} be a sample of m meaning-signal pairs. If an agent is exposed to this sample he may maximize communication with the population by replacing his language by m
fZ = arg min ∑ (f (xr ) − yr )2 . f ∈H
r =1
This replacement can be seen as produced by a learning algorithm (e.g. least squares, etc.) which attempts, from a finite amount of data (the m-tuple z), to approximate Fi . Let now
375
376
Language Acquisition, Change and Emergence
f = (f1 ,..., fk ) be a state. During a period of time (which we will take as our time unit) agents communicate to each other and this communication takes the form, for agent i, of a sample zi of meaning-signal pairs as above. After each agent has applied the learning algorithm, the state (f1,...,fk) is replaced by a new state (fz1 ,...,fzk ). Iteration of the above yields a dynamics with discrete time. This dynamics depends on the sequence specifying, for all t ∈ IN and 1 ≤ i ≤ k, the m-tuple of meaning-signal pairs to which agent i is exposed at time t. But in practice this sequence is not given a priori. Actually, the tuples at time t (being linguistic exchanges occurring at this time) will have to depend on the state at time t and therefore on the previously occurred tuples. In our model, we will actually assume that these m-tuples are random and this randomness will model that of the linguistic encounters, the frequency of the occurrence of different meanings, and even the possible existence of noise. Let f (t ) ∈ H k be the language state after t time units. In what follows we will assume the existence of a constant M ∈ IR and, for each i ∈ {1,..., k} and each t ∈ IN , a probability measure ρi(t ) on Z = X × Y satisfying:
(i) The marginal measure of ρi(t ) on X is ρX . (ii) The regression function of this measure (i.e., the function defined by x 6 ∫ yd ρi(t ) (y | x) , where ρi(t ) (y | x) is the Y
conditional (w.r.t. x) probability measure induced by ρi(t ) on Y) equals Fi(t ) = ∑ j =1 λij f j(t ) . k
(iii) For all f ∈ H , f (x) − y Y ≤ M almost everywhere. We can now extend the dynamics above to a stochastic dynamics which is defined similarly but where now the m-tuple z(it ) = (xir(t ) , yir(t ) )rm=1 is randomly drawn from ‹ρ2 (X) according to the measure ρi(t ) . Then, for i = 1,...,k, m
fi(t +1) = arg min ∑ (f (xir(t ) ) − yir(t ) )2 . f ∈H
r =1
Modeling Language Evolution
Before stating our main result we recall that, if X is a metric space and ε > 0, the covering number N (X, ε) is defined as the smallest A ∈ IN such that there exists A disks of radius ε covering X. Also, note that ‹ρ2 (X) induces a metric in H k by d(f , g) = & f − g & (‹
2 ρ
(X ))
k
k
∑& f
=(
r
− g r & (‹
2 ρ
(X ))
k
)1/ 2
r =1
From this metric the distance from a state f to the diagonal ΔH is defined d(f , ΔH ) = inf d(f , g) g ∈ΔH
Theorem 1
Let Λ be weakly irreducible with eigenvalues 1 = for any 1 > α* > maxi =2,...,k | αi | there exists a > 0 0 such that for each each 0 < δ < 1, ε > 0, t ∈ IN ,
α1 , α2 ,..., αk . Then,
constant CΛ, α * and m≥
(1- α )2 ε 2 288kM 2 * ) + log 1) log t + log k N ( H , 2 2 δ (1 − α*) ε 24kM
(
there holds d(f (t ) , ΔH ) ≤ CΛ, α (ε + d(f (0) , ΔH )α*t ) *
with confidence at least 1 − δ . We are next interested in study convergence properties of the stochastic dynamics for varying values of m. To clearly state these properties we will add a subscript [m] to the states so that f[(mt )] denotes the state after t steps when the number of examples at each step in the dynamics is m. Corollary 1 For any 1 > δ > 0 there exists an increasing function m(t) : IN → IN with limt →∞ m(t) = ∞ such that, with confidence 1 − δ , lim d ( f[(mt )(t )] , ΔH ) = 0 t →∞
377
378
Language Acquisition, Change and Emergence
By taking ε = d(f (0) , ΔH )α*t and δ = 1t in Theorem 1 we obtain that
{
}
(0) Prob d(f[(t) , ΔH )α*t ) ≤ m(t )] , ΔH ) ≥ CΛ ,α d ((f
*
1 t
This result takes a particularly sharp form when the class satisfies that, for some ξ > 0 , ln N (H , ε) = O (ε−ξ ) (a common feature in several of the classes H considered in learning theory (cf. Remark 3(i) below)).
Corollary 2 Assume that, for some ξ > 0, ln N (H , ε) = O (ε−ξ ) . Then for m = O (((1 − α )−2 α −2 t )(log t + (1 − α )−2 ξ α −2 tξ )) *
we have
{
*
*
*
}
(0) lim Prob d(f[(t) , ΔH )α*t ) = 0 m(t )] , ΔH ) ≥ 2CΛ ,α d ((f t→∞
*
The proof of Theorem 1 escapes the goals of this chapter. We only point here that it relies on tools from two different subjects: stochastic matrices (see, e.g. Seneta, 1973) and learning theory (Cucker and Smale, 2002; Haussler, 1992; Niyogi, 1998; Vapnik, 1998). Remark 3 (i) The choice of the hypothesis space in learning theory applications is an important issue. A common strategy consists of first choosing a linear space IH of continuous functions endowed with a norm || ||IH and then choosing a closed ball of radius (w.r.t. || ||IH ) R. The space H is this ball but endowed with the topology induced by the norm || ||∞ . For this strategy to work it is necessary that closed balls w.r.t. || ||H have a compact closure w.r.t. || ||∞ . Choices of H for which this happens are spaces of polynomials of bounded degree, say d, Sobolev spaces H s with s > n/2, and Reproducing Kernel Hilbert Spaces arising from a C ∞ Mercer kernel (for details of these spaces see Cucker
Modeling Language Evolution
and Smale, 2002). In the last two cases we have the following bound for the logarithm of covering numbers ⎛ RCs ⎞⎟ ln N (H , ε) ≤ ⎜⎜ ⎜⎝ ε ⎠⎟⎟
2n
n/s
+1 ,
and
⎛ RCh ⎞⎟ h ⎟ . ln N (H , ε) ≤ ⎜⎜ ⎜⎝ ε ⎠⎟⎟
where h is any number such that h > n, and CS and Ch are constants independent of ε and R. In the first case, we have ⎛ RCh ⎞⎟ ln N (H , ε) ≤ N ln ⎜⎜ , ⎜⎝ ε ⎠⎟⎟
where N is the dimension of the space of polynomials, i.e.,
(n +n d) .
(ii) Considering a hypothesis space places a framework, present at the origin of any linguistic process, which puts a boundary on the set of possible linguistic choices. In this sense, it plays a role akin to the universal grammar of Chomsky.
4. Language-Systems, Idiolects, and Language Drift The use of the word “language” in the expressions “the English language” and “John’s language” is not the same. Saussure used the words langue and parole to distinguish between them. In English, one may use the term language-system to denote the former.1 We have already used the term “language” to denote the latter and
1 What Saussure called a “langue” is any particular language that is the
common possession of all the members of a given language community (i.e. of all those who are acknowledged to speak the same language). We will introduce the term language-system in place of it. A language-system is a social phenomenon, or institution, which of itself is purely abstract, in that it has no physical existence, but which is actualized on particular occasions in the language-behavior of individual members of the language community. (Lyons, 1981:10).
379
380
Language Acquisition, Change and Emergence
made its definition a key piece in our model. We now give a definition for the Saussurean “langue.” Definition 3 We say that a language population P = {1,..., k} shares a language-system when d(f1 ,...fk ), ΔH ≤ τ0 . Here τ0 is a constant, which may depend on the modeled situation. Agents of a linguistic population sharing a language-system are said to speak an idiolect of it.2 The aim of this section is to apply Theorem 1 to model language emergence, language change and language learning. All through this section we assume that the space H satisfies that, for some CH , ξ > 0 , ln N (H , ε) ≤ CH ε−ξ .
4.1
Language emergence
Our goal is to show that, under certain conditions, with high probability, a language-system emerges in finite time. Before proceeding we note that when modeling the evolution of a linguistic context we may assume that the number of meaning-signal pairs to which an agent is exposed at each iteration is a certain fixed M. Note that this number is certainly bounded since an iteration corresponds to a finite amount of time. Thus, starting with an initial state f (0) , at each iteration the k agents are exposed to M examples. These examples are used by the agents to update their languages according to the learning dynamics thus obtaining a sequence of states {f (t ) }t ∈ IN
2 “In the last resort, we should have to admit that everyone has his own
individual dialect: that he has his own idiolect, as linguists put it. Every idiolect will differ from every other, certainly in vocabulary and pronunciation and perhaps also, to a smaller degree, in grammar.” (Lyons, 1981: 26–27).
Modeling Language Evolution
Theorem 2 Assume Λ is weakly irreducible. Then, starting from a state f (0) ∈ H k , the population P reaches a language-system in at most T iterations with probability at least 1 − δ , where T = T (Λ, f (0) ) =
ln(2CΛ, α d(f (0) , ΔH )) − ln τ0 *
| ln α* |
and −ξ
2
(0)
2
⎛ (1 − α )2 d(f (0) , ΔH )2 α 2T ⎞⎟ − M(1- α*) d (f ,2ΔH ) * * ⎟ 288kM δ ≤ CH Tk ⎜⎜⎜ ⎟ e ⎜⎝ 24kM ⎠⎟
α*2 T
.
Take m = M and ε = d(f (0) , ΔH )α*t in Theorem 1, bound d(f , ΔH ) by τ0 , and solve for t and then for δ.
Proof. (0)
Remark 4 If α* is small, T (Λ, f (0) ) will be small and convergence to a language-system will occur quickly with high probability. The smaller is α* , the faster is this convergence.
4.2
Language change
We will now be interested in the property that a population P will continue to share a language-system assuming it is already sharing one. Definition 4 Let the stability confidence δ* be the infimum of the δ ∈ IR such that, if d(f , ΔH ) ≤ τ0
then, when agents in P are provided with M meaning-signal pairs, with probability at least 1 − δ , d(f N , ΔH ) ≤ τ0
Here f N is the state resulting from applying one step of the learning dynamics to f.
381
382
Language Acquisition, Change and Emergence
Remark 5 The magnitude of δ* is a measure of the stability of the language-system sharing a population. Note that if H, M, and k are fixed then δ* depends only on Λ . If δ* is small then population P will likely continue to share a language-system. Yet, this language-system will change over time. We call this phenomenon language drift.3 Theorem 3 Assume Λ is weakly irreducible and α*CΛ,α* < 1 . Then −ξ
2 ⎛ ⎛ ⎞⎟ ⎞⎟⎟ ⎜⎜ ⎛⎜ ⎞⎟2 1 1 2 2⎜ ⎟ M(1-α*)2 τ02 d⎜⎜⎜ −α*⎟⎟⎟⎟ ⎜⎜ (1 − α*) τ0 ⎜⎜ − α* ⎟⎟ ⎟⎟⎟ ⎜ CΛ , α ⎟ ⎜ ⎝ ⎠ ⎜ ⎜⎜ * ⎟⎠ ⎟⎟ − ⎜⎝ CΛ,α* 288kM 2 δ* ≤ CH k ⎜⎜ ⎟⎟⎟ e ⎜⎜ ⎟⎟ 24kM ⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟ ⎜⎝ ⎠⎟
⎛ ⎞⎟ ⎜ 1 − α* ⎟⎟⎟ and t = 1 in Theorem 1. ⎜⎝ CΛ, α ⎠⎟⎟
Proof. Take ε = τ0 ⎜⎜⎜
*
Remark 6 (i) Again, if α* is small then δ* is small also and P continues to share a language system with high probability. (ii) In Theorem 3 we required α*CΛ,α < 1 . If this does not happen, *
we cannot take ⎛ ⎞⎟ ⎜ 1 − α* ⎟⎟⎟ > 0 . ε = τ0 ⎜⎜ ⎜⎜ CΛ, α ⎝ ⎠⎟⎟ *
3 The use of the word drift is taken from genetics where it has a meaning akin
to the one in our exposition. We note, however, that in linguistics the word is in general used differently. “The word ‘drift’ as used in evolutionary biology has a somewhat different, in fact almost opposite, meaning to the same word as used in physics or in linguistics . . . , but the same as in archeology . . . . In biology the term . . . expresses the changes due to random sampling processes in populations that are of finite size.” (Cavalli-Sforza and Feldman, 1981)
Modeling Language Evolution
This is already a sign of instability. In this case, however, one may consider
{
}
t* = min t ≥ 1 | α*tCΛ,α < 1 *
and the corresponding δ* . Clearly, the larger t* , the less stable is the population with respect to sharing a language-system.
4.3
Language acquisition
We close this section by deriving a result on language learning. An idealized situation of language learning is that of Example 3. The linguistic population is composed only of a mother and a baby, and the matrix Λ is given by ⎛ 1 0⎞⎟ ⎜⎜ ⎟ ⎜⎝1 − θ θ ⎠⎟⎟
where θ > 0 is small. Our results readily apply to this case. Proposition 2 The pair Mother-Baby in Example 3 reaches a language-system in at most T (θ) iterations with probability at least 1 − θ where T (θ) =
ln( 2(3 + 5)M) − ln τ0 2 + ln M − ln τ0 ≈ ln θ ln θ
and −ξ
⎛ (1 − θ)2 Mθ 2T (θ ) ⎞⎟ − M(1−θ ) θ 288 ⎟⎟ e δ ≤ 2CHT (θ)⎜⎜ ⎜⎝ 24 ⎠⎟
2 2T (θ )
Proof. Let fM and fB be the original languages of the mother and the baby respectively. Then fB (x) − fM (x) Y ≤ fB (x) − y Y + fM (x) − y Y ≤ 2M
383
384
Language Acquisition, Change and Emergence
almost everywhere. Therefore the distance from any of fM or fB to their mean f = (f M + fB ) / 2 is at most M and it follows that d((fM , fB ), ΔH ) ≤ 2M . In addition, it can be seen that CΛ,α* =
3+ 5 . 2
Now apply Theorem 2 with k = 2, α* = θ and d(f (0) , ΔH ) = 2M to obtain the result. Remark 7
Note that the numerator in the bound
2 + ln M − ln τ 0 ln θ
is common to all Mother-Baby pairs. The speed with which the baby learns the mother’s language (an accomplishment which we recognize when d((fB(t ) , fM(t ) ), ΔH ) ≤ τ0 depends then only on the denominator lnθ and it decreases with θ : the smaller is θ , the faster is the learning. Variations in θ (due to differences in the innate ability of the baby, frequency of linguistic encounters with the mother, etc.) explain the variations in children’s learning speed.
5. Fitness Maximization Given a state (f1 ,..., fk ) the language Fi = ∑ j≤k λij fi was introduced as a succinct expression for the linguistic context for agent i at that state. In this section we present another way to characterize Fi . Define the linguistic fitness of language f for agent i at the state (f1 ,..., fk ) by ⎛ k 2⎞ Φi (f ) = −∫ ⎜⎜⎜∑ γij f (x) − f j (x) ⎟⎟⎟d ρX (x) Y⎟ X⎜ ⎝ j=1 ⎠
Fitness can be thought to measure the ability of agent i to communicate with members of the population he encounters when he uses the language f (note that the ith term in the sum above reflects an inertia acting on agent i). This motivates the problem of, at a given state, finding the language f ∈ H that maximizes the
Modeling Language Evolution
linguistic fitness, that is, compute i = 1,..., k .
Fi* = arg max Φi (f ) , f ∈H
Proposition 3
For i = 1,..., k , if
∑
k j =1
γij > 0 then Fi* exists and is
unique on (X, ρX ) . In addition, k
Fi*(x) = Fi (x) = ∑ λij f j (x) . j =1
Remark 8 (i) Proposition 3 justifies considering Fi , not only as a succinct expression for the linguistic context for agent i at a given state but also as the language which maximizes a fitness function. The latter is akin to a “survival of the fittest” viewpoint. (ii) We say that the measure ρX is non-degenerate when, for all open subsets U ⊂ X , ρX (U ) > 0 . Non-degeneracy is a mild assumption; if ρX is degenerate one can replace (X, ρX ) by (X, ρX ) such that X ⊂ X , ρX is non-degenerate and ρX (X) = ρX (X) = 1 . We note now that if ρX is non-degenerate then, in Proposition 3, Fi is unique since it is continuous.
Acknowledgments This work has been substantially funded by a grant from the Research Grants Council of the Hong Kong SAR (project number CityU 1002/99P). Also, the second named author expresses his appreciation to City University of Hong Kong for its support.
385
386
Language Acquisition, Change and Emergence
References Cavalli-Sforza, L. L. and Feldman, M. W. (1981) Cultural Transmission and Evolution. Princeton University Press. Cucker, F. and Smale, S. (2002) On the mathematical foundations of learning. Bulletin Amer. Math. Soc., 39:1–49. Haussler, D. (1992) Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:78–150. Ke, J., Minett, J. W., Au, C. P., and Wang, W. S-Y. (2002) Self-organization and natural selection in the emergence of vocabulary. Complexity, 7.3:41–54. Knight, C., Studdert-Kennedy, M., and Hurford, J. R. (2000) The Evolutionary Emergence of Language: Social Function and the Origins of Linguistic Form. Cambridge University Press. Lyons, J. (1981) Language and Linguistics: An Introduction. Cambridge University Press. Niyogi, P. (1998) The Informational Complexity of Learning. Kluwer Academic Publishers. Seneta, E. (1973) Non-Negative Matrices. John Wiley & Sons. Vapnik, V. (1998) Statistical Learning Theory. John Wiley & Sons.
Part 4 Language and Complexity
12 Language and Complexity Murray Gell-Mann* Santa Fe Institute
1.
Complexity
When we talk about complexity, we take many different concepts in order to capture all the various notions of what is meant by complexity or its opposite simplicity. Some researchers, for example, study the mathematical theory of what is called computational complexity. This refers more or less to the number of steps or the length of time at the minimum that it takes some standard universal computer, U, to solve a certain problem. The question is as the size of the problem increases toward infinity, what happens to the amount of time necessary to solve it? Is this time polynomial in the size of the problem, exponential, or otherwise? However, if you ask a person in the street what is complex, this definition will not be what people mean usually. The kind of complexity that best captures what is meant in ordinary __________________________ * This chapter is based on a public lecture — “From the Simple to the Complex” — that professor Gell-Mann gave to the City University of Hong Kong on May 2, 2002. We are grateful to professor Gell-Mann for his further discussion of that lecture during the Third Workshop on Language Acquisition, Change and Emergence. 389
390
Language Acquisition, Change and Emergence
conversation, and in most scientific discourse as well, can be termed effective complexity. Effective complexity is often defined, rather crudely, as the length of a very concise description of the regularities of the entity in question, not the features treated as random or incidental, because complexity does not mean randomness. In order to satisfy this definition, we require a distinction between the regular and the random. This is usually context dependent, and sometimes even subjective. Consider music and static on the radio, an example of signal and noise of which we deal very often in science — the music is presumably the signal and the static is noise. The static would be treated as random or incidental, the music as regular. However, that is not true in all contexts. In the 1930s, Jansky and Bailey were looking for the origin of static. They found that a great deal of static came from a particular constellation in the sky. That type of static was regular in very important ways and founded a whole new science: radio astronomy. There is in principle a mathematical way of distinguishing the regular from the random in an absolute fashion, although for most problems it admits too few regularities as almost everything is treated as random. For most practical purposes, the distinction between the regular and the random depends on the existence of some kind of judge of what is and what is not important. The judge need not be human. Today, with the widespread use of computers, many problems can be expressed in terms of strings of 0s and 1s. Any entity can be represented by such a string, called a bit string. To do so, four components must first be specified: 1. The course-graining, i.e., the level of detail at which the entity is treated. 2. The language for describing the entity — obviously the length of the description depends somewhat on the language. 3. The knowledge and understanding of the world that is assumed. 4. A system of coding from the language to the bit string.
Language and Complexity
The need for the third component can be understood better by considering the example of an anthropologist who has lived in an Indian village in the Amazon for a couple of years, has learned the language, and now is joining the group to visit another village which has previously been uncontacted but which speaks the same language. The anthropologist would be able to communicate with these people, but if he were to attempt to explain to them, say, a tax-managed mutual fund, obviously their knowledge and understanding of the world would matter as to how long the explanation would have to be.
1.1
Effective complexity
In 1965, three different authors working independently defined the algorithmic information content (AIC) of a bit string or the entity described by the bit string as the length of the shortest program that would cause a given universal computer, U, to print out the bit string and then halt. This is a way of making formal the idea that AIC is the length of a very brief description of some entity. I would modify the definition of algorithmic information content by setting a maximum time, T, or a maximum number of steps for the universal computer U to have the program print out the result and halt. The effective complexity of an entity can then be stated as the algorithmic information content of the regularities of the entity, as opposed to the algorithmic information content of the features that are treated as random or incidental. This definition is made more rigorous by splitting the AIC of a bit string, K, into two terms: the AIC of the regularities, i.e. the effective complexity, Y, and the AIC of the random features, I: K = Y + I.
For example, consider a perfectly regular bit string, e.g., 111111111 . . . . The regularities of such a perfectly regular string have very little algorithmic information content — the string consists of just 1s — so its effective complexity is low. Now consider
391
Language Acquisition, Change and Emergence
a bit string that has no regularities, i.e., an incompressible string. The shortest description of an incompressible string is just to state the string itself. An incompressible string has the largest possible algorithmic information content for its length. However, the effective complexity is again very small because there are no regularities except the length. We conclude then that if we plot algorithmic information content between zero and the maximum for the length, with order at one end and complete disorder at the other, as shown in Figure 1, it is only in the middle that high effective complexity occurs. The effective complexity of both a perfectly orderly and a perfectly disorderly entity is. Figure 1 Effective Complexity as a function of Algorithmic Information Content.
Effective Complexity
392
AIC ORDER
DISORDER
Although the mathematical apparatus just sketched is actually not all that useful in many cases, it does help to strengthen the notion of effective complexity as the length of a very brief description of the regularities. It is helpful when a rigorous presentation of complexity is required (although the distinction between what is regular and what is random is generally context dependent). Recall that the definition of effective complexity given above
Language and Complexity
requires that there be a judge to distinguish the regular from the random. Social scientists have been doing this for a long time. In archaeology, for example, there is a rather clear imposition of a set of rules on what is important and what is unimportant. Items found during an archaeological dig may allow a social scientist to draw conclusions regarding the social structure of the society being investigated, i.e. the social complexity, which has to do with the number of different roles and professions within a society, the ranking of people, if there is a ranking, and so on. Clearly these are the regularities: when social scientists talk about social complexity they mean exactly the length of a very concise description of those things that are regularities to them. This turns out to agree very well with the definition of effective complexity given above.
1.2
Potential complexity
Another aspect of complexity that is very important is potential complexity. Consider the game Go. Go is extremely simple; you can explain its rules in a minute. Chess is also rather simple; maybe it takes ten minutes to explain the rules of chess. But both games are complex too because they can lead to situations that are very complex. The way one arrives at any situation is by a sequence of moves. These moves are parameters with probabilities, forming a chain of probabilities. There can be a great deal of complexity in terms of those moves: consider, for example, the many books that have been written on the beginning game, the middle game, and the end game of chess. This potential complexity comes from the additional frozen accidents, events that, by chance, give rise to new regularities that increase the effective complexity. When considering the complexity of Go — or for that matter the complexity of language — one should include the potential complexity. The same is true when you look at biological evolution. If you consider the great apes, you notice that four or five million years ago one of these great apes became more destructive and much more of a nuisance than the others — and that lead to us: Homo sapiens.
393
394
Language Acquisition, Change and Emergence
That particular great ape looked much like the others; but if you were to average over the probability chains in the future and consider the complexity of what results, you would find a much bigger complexity for that particular great ape because cultural complexity lay in the future, at least for certain paths. That cultural complexity is immense, much greater than for chimpanzees, which also have culture, but not so much. Potential complexity is an extremely important quantity which one has to distinguish from the complexity of the entity itself.
1.3
The arrow of time
The fundamental laws of physics which govern the behavior of the universe and everything in it seem to be simple. There are really just two laws of nature: One is the unified quantum theory of all the elementary particles and their interactions. The other is the initial condition of the universe, some thirteen billion years ago or so. Both of these are thought to be relatively simple, so complexity does not come from these basic laws. It may seem very unimportant for daily life that there was a particular initial condition for the universe thirteen billion years ago but that is absolutely wrong. The simple initial condition of the universe is responsible for the main arrow of time in the second law of thermodynamics which states that the average disorder in a closed system has the tendency to increase. You know this if you have children and you have both peanut butter and jelly in the kitchen. After a while, you will notice that there is jelly in the peanut butter jar and peanut butter in the jelly jar. As time goes on there will be more and more — if the children do not finish off these substances, eventually you would have equal amounts of peanut butter and jelly in both jars. The simple initial condition of the universe is what is ultimately responsible for this arrow of time, which is still pointing strongly forward nearly everywhere in the universe. I now turn to discuss the possibility that linguistic arrows of time might allow us to deduce something about the phylogenetic
Language and Complexity
emergence of language. I begin by discussing our current understanding of the origins of human language and the long-range relationships that have been identified among the world’s languages.
2.
The Origins of Human Language
Although there were doubtless earlier forms of human communication, perhaps somewhat more complex than those of the other great apes, at a certain point there came along fully modern language with all its apparatus of grammar, vocabulary and so forth, which characterize all known languages that are the first languages of somebody now or in the past. Suppose there was a single ancestral language spoken from which all present day languages are descended — can we recover properties of that language? Is the system of modern languages so young that there are still features that characterize language across the world which we can trace back? Or, on the contrary, is language so old that there is hardly any trace left of what it was like at the beginning? There are some of us who believe that human language is a recent phenomenon that dates back to the time when Homo sapiens sapiens, which had not been around for very long, underwent remarkable behavioral development in the Upper Paleolithic, developing painting skills, engraving, sculpture, dance, very much better made tools, and so on. Homo sapiens neanderthalensis apparently did not engage in all these activities. In a slightly earlier period, you find so-called Mousterian tools made both by Homo sapiens sapiens and by Homo sapiens neanderthalensis. But Homo sapiens sapiens became completely different in its culture. Around thirty thousand years ago, the Neanderthal people died out. It is possible that modern language goes back to the time of this cultural explosion, something like forty thousand years ago. Is that young enough that features of the ancestral language can be recovered?
395
396
Language Acquisition, Change and Emergence
In 1989, I co-chaired a meeting in Stanford with Jack Hawkins titled “The Arrow of Time and Founder Effects in Linguistics”. We were looking for arrows of time, or unidirectional processes as they are termed in linguistics, particularly those unidirectional processes that might reveal something about the initial situation, assuming there was a recoverable initial situation. Supposing there is a single ancestral language, you can ask whether it has left any traces or whether it has become mixed up in the intervening time to such a degree that little about it can be deduced. Of course, the hope of people who look for these correlations is that there is plenty left. The question then is where would you look for it? The usual place to look for traces of a single ancestral language is in very conservative lexical items, like words for “sun”, “moon”, “water”. Although rules of grammar and certain phonological features are generally supposed to be much less conservative than lexical items, as well as being more subject to contamination by horizontal transmission, it is possible nevertheless that some such features may in fact exhibit the same kind of conservatism and general preference for vertical over horizontal transmission that these conservative lexical items present. One possibility is word order, which I will discuss later in Section 3. Another is the apparent tendency for phonological complexity to be reduced, as I discuss in Section 4.
2.1
Long range classification
The late Joseph Greenberg, whose work was celebrated in Stanford in 2001, was one of the great pioneers of long range research. The aim of his research was to group together recognized language families, such as Indo-European, into much older units, called super-families. Greenberg discovered many super-families, showing crudely the existence of etymologies. He left to future workers the task of finding more rigorous evidence for these families, more rigorous etymologies and sound correspondences — all the apparatus that we have for the recognized families.
Language and Complexity
Most professional historical linguists, especially in the USA, refuse to take seriously the work of so-called “long rangers”, except in Africa where they lost the battle some years ago. They consider long range research unscientific — if the time depth is greater than six or seven thousand years, they say, it is hard to be as rigorous as they would like to be in reconstructing the sound system, the sound correspondences, and so on. This is obviously wrong. While they may be perfectly right in asking for greater rigor, rejecting the idea of long range relationships out of hand is wrong for the following reason. If it were correct that there is some upper bound to the time depth beyond which genetic relationship cannot be identified, then the evidence for recognized families such as Indo-European, Uralic and Sino-Tibetan would be marginal. But that is not true — the evidence for these families is overwhelming. There are dozens, perhaps hundreds, of etymologies that could be considered, so it is absurd to say that the evidence for the recognized families is marginal. Even conservative linguists agree that the evidence for the recognized language families is very strong. The idea that the evidence should suddenly drop to zero as you go to greater time depth could not possibly be true. What is true is that the research should be as scientific as possible, and some of the long rangers are trying to do that. With this aim in mind, we have initiated a project based at the Santa Fe Institute with a group of people who are trying to assemble a huge public database. The group is using reconstructed forms and sound systems for proto-languages wherever available to attempt to reconstruct forms and sounds systems for proto-languages for the super-families, and to see if there is a viable argument for proto-Sapiens, or as some people call it proto-world, the hypothetical ancestral language of all the languages that we know about. Figure 2 maps out the proposed language super-families of the world as viewed by Greenberg. The super-family Eurasiatic includes the language families Indo-European and Uralic, which includes Finnish, Estonian, Hungarian, Lappish, Mordvin, Komi, Mari, and also the Sumerian languages. Also included is Altaic, which includes
397
398
Language Acquisition, Change and Emergence
Turkic, Mongolic, Tungusic and so on, as well as Korean, Japanese and Ainu, although the inclusion of these three languages in Altaic is disputed by some linguists 1 . Eurasiatic also includes ChukchiKamchatkan and Eskimo-Aleut, spoken in part of Siberia, throughout the northern tier of North America and over into Figure 2 The language families and super-families proposed by Joseph Greenberg.
Khoisan
Dravidian
Austric
Niger-Kordofanian
Kartvelian
Indo-Pacific
Nilo-Saharan
Eurasiatic
Australian
Afro-Asiatic
Dene-Caucasian
Amerind
Greenland. The time depth of the ancestral language of this super-family might be as great as nine or ten thousand years. Eurasiatic may also include Kartvelian or Dravidian. Another proposed super-family is Dene-Caucasian, which includes Sino-Tibetan, of which Chinese is one set of languages, the
1
Eds.: Ruhlen comments on this and other disputes concerning linguistic classification and historical linguistics in Part III.
Language and Complexity
North Caucasian languages of the Northern Caucasus, including Chechen — Basque is very likely a member of this family — and Yeniseian, spoken in northern Siberia along the Yenisei river. Only one of the Yeniseian languages survives today but there are materials on several others that are now extinct. Then there are the Na-Dene languages of North America, including the Apache languages such as Navajo, Tlingit, Eyak, and so on. It looks very much as if this super-family is a valid taxon, which would go back something like ten thousand years. Afro-Asiatic includes Semitic, Ancient Egyptian, Berber and so on. Nilo-Saharan and Niger-Kordofanian are the two big families of black Africa. Then there is Khoisan, spoken mostly by the San, or bushman, and the Khoikhoi, also called the Hottentots. There are also two Khoisan languages in Tanzania. Then Austric, which includes languages of Southeast Asia and the Pacific islands. Greenberg includes all the American languages that are neither Na-Dene nor Eskimo-Aleut in a single, hypothetical super family, Amerind. Finally, there is Indo-Pacific, which includes languages of the interior of New Guinea, Tasmania and a number of other islands nearby including the Andaman Islands and Australia. These super-families, which cover the world, may in turn form bigger units, and maybe finally one single unit if you go back far enough to a single ancestral language. There is some evidence for world etymologies indicating descent from a single proto-Sapiens, but this evidence needs to be greatly improved with sound correspondences, skeptical treatment of the semantic and phonological shifts involved, and so on. This tracing back to an ultimate proto-language for all existing modern languages is coupled with the idea that the ancestral language, assuming it was there, was contemporary with the behavioral evolution at the Upper Paleolithic, forty thousand years ago. Although forty thousand years is quite a long time, it is much shorter than the millions of years that people used to talk about. How likely is it that evidence for a single ancestral language would remain after such a period of time? A very rough method that has sometimes been used to attempt language classification is known
399
400
Language Acquisition, Change and Emergence
as glottochronology. In glottochronology, you specify a list of basic meanings that are known to be conservative and identify the main words for these meanings in languages that are hypothesized to be related. You then look at the proportion of related words among the whole set of meanings shared by each pair of languages. In the original rule of thumb of glottochronology, a fourteen percent replacement of words in each language per thousand years was assumed. If that were true, however, very few ancestral words would remain after forty thousand years. So there would seem to be little possibility of finding evidence for a single proto-Sapiens. However, multilateral comparison of languages across various subgroups may make it easier to detect remnants of a proto-Sapiens. As they have come to learn more about ancient languages, reconstructed languages, and so on, linguists have progressed from using just basic vocabulary with preserved meanings into using roots. For example, in language families that are known very well, linguists are able to use knowledge of the sound system and the processes of producing new words to study roots and combinations of roots with various particles. The problem then becomes completely different quantitatively from looking just at the basic vocabulary. Working with only basic vocabulary, a hundred words or so, and insisting that words retain their meanings, will not get us back very far, even using multilateral comparison. To discover traces of human language that existed something like forty thousand years, I think there is no alternative but to work with roots. While some linguists may disagree with the search for large super-families that go back to great time depth, not all regard it as a heresy. Now I do not necessarily believe that these notions are correct. But what I think is a good idea though is to do much more research on these notions of large, deep super-families and possibly a single ancestral language, or maybe two, and to obtain funding for such an effort spread out over the world. The idea is to develop big databases on as many languages as possible, particularly on reconstructed proto-languages for existing, generally acknowledged language families and their subgroups, including also the work that has been done toward reconstructing the proto-languages of larger
Language and Complexity
units. Such research actually is going on. For example, Chris Ehret has attempted to apply the comparative method to a much deeper taxon than usual, Nilo-Saharan, a very deep family going back significantly more than ten thousand years. Sergei Starostin at the Santa Fe Institute is trying to do this with Eurasiatic and with Dene-Caucasian. He is also concerned that the accumulated database be public. There is this curious habit that some scholars have of preparing great works in linguistics and concealing them. We should try to develop a public database, with the notation as uniform as possible and with as much information as can be, so that people can try to carry out these tasks of reconstruction, check on these claimed super-families and check on the evidence that Ruhlen 2 , Bengston and others have presented toward such widespread etymologies that suggest monogenesis. Until these etymologies are demonstrated to be genuine, the search for a single ancestral proto-Sapiens will certainly continue to be controversial. But there is some evidence — why not pursue it and see if it works?
3.
Word Order — A Linguistic Arrow of Time?
In 2001, I presented some evidence for the idea that word order in sentences, also studied by Greenberg in his series of brilliant papers, might be sufficiently conservative that we can trace it largely through vertical transmission, and that the patterns of word order in the world’s languages may have a different model from what most people have assumed. Linguists generally use very conservative lexical items as a way of establishing so-called genetic relationships among languages, but I think this is only because they believe that
2
Eds.: See the chapter by Ruhlen in Part III for a discussion of his position on the patterns observed for first and second person singular pronouns both among Amerind languages and among Eurasiatic languages (Section 3), and the t’ina/t’ana/t’una pattern observed for kinship terms among Amerind languages (Section 4).
401
402
Language Acquisition, Change and Emergence
they are conservative. If any linguistic feature shows a strong correlation with vertical transmission in preference to horizontal transmission, then that can be used in an attempt to establish genetic relationships. Perhaps word order is sufficiently conservative to be used. In typology, one often looks first at the subject, verb and object order. The most common type is SOV as in, for example, Japanese. This is head-final, with the verb at the end. There is a strong correlation with having the genitive before the noun (G-N), postpositions and various other such properties which are also head-final. Jack Hawkins among others has shown how this type of correlation is very favorable for efficiency and communication. What seems to happen is that when there is a transition in one of these characters, it drags along the other characters to some extent so that after a long time a consistent type emerges. The opposite type, V1, has the verb at the beginning, which is very strongly correlated with the opposite states of these characters: N-G instead of G-N, prepositions (PR) instead of postpositions (PO), and so on. The adjective-noun order is less reliable. Intermediate to SOV and V1 is SVO, as in English. SVO is fairly well correlated with the same characters — N-G and PR — but the correlations are not so strong as for SOV and V1. In other words, consistently head-final and consistently head-initial languages are more favored. Most people have assumed that so many changes have occurred in word order since the emergence of language that there now exists roughly an equilibrium situation. I worked for a time on mathematical models of equilibrium for word order types, looking at the transition probability from one word order to another and balancing them out against the numbers of languages with different word orders. In that way, you would find out what proportion there should be of different word orders in different languages. But I have lost faith in that approach and have become much more interested in the idea that word order follows an arrow of time and that there are a few small changes that have taken place since the beginning of modern language. We are now proposing that in fact what we see is the relic of the original word order modified here
Language and Complexity
and there in the world by just a few changes, perhaps up to three or four. We are assuming a young system where we see the original type — the most common one, SOV — along with all the other head-final characteristics. Other word orders are the results of a few moves from that. These moves come from mutation, changes with certain probabilities of changing from one type to another, possibly away from head-final toward head-initial. The transmission is in large part vertical — this is not usually assumed — so we expect high correlation with language families and super-families. Although there certainly is some horizontal transmission, I believe this just to be a perturbation rather than dominant, as is usually assumed. Take the Dene-Caucasian super-family, presumably including Basque, the North Caucasian languages, including Burushaski, Na-Dene in the New World, Yeniseian, and Sino-Tibetan. Here we have a very strong predominance of what may be the original type, SOV, with postpositions and genitive preceding the noun. But of course this is the most common type, so we have to be careful. Basque is SOV, as are most of the North Caucasian languages. There are a few Northeast Caucasian languages that have SVO, which represents one step away from SOV. But it is interesting that they have PO and G-N, tell-tale signs that the word order derived from SOV by mutation or possibly borrowing. Yeniseian is SOV. Burushaski, ancient Hurrian and Urartian, which are thought by Starostin to be Caucasian languages, are also SOV. The Na-Dene languages are all SOV. Sino-Tibetan is mostly SOV. The exceptions are the Chinese and Karen languages, which have undergone significant contact with neighbors and have possibly borrowed some SVO. So the model works extremely well for Dene-Caucasian. This correlation cannot be attributed to areal effects because these languages are spread out throughout the world (although presumably their proto-languages were neighbors twelve thousand years ago). Eurasiatic, including Eskimo-Aleut, Gilyak and Altaic, is also very strongly SOV. The only exception in Altaic is Gagauz, a Turkic language spoken in Romania that has picked up SVO from Romanian. Uralic is mostly SOV — those Uralic languages that have
403
404
Language Acquisition, Change and Emergence
undergone extensive contact in Europe, such as Finnish, Estonian and Hungarian, have changed due to contact. Chukchi-Kamchatkan has some SVO, which could be due to either mutation or borrowing. The only exception is Indo-European. Indo-European, of course, is also deviant in other ways. For example, it is the only family in Eurasiatic that has grammatical gender. It also has many exceptions to SOV in various branches. In Celtic, for example, the word order has even changed as far as V1, with the verb in the beginning. But this is also consistent with the model, representing a significant amount of correlation.3 It is difficult to know otherwise why there should be such correlation with these proposed genetic super-families. Afro-Asiatic is extremely interesting. Christ Ehret has performed a cladistic analysis for Afro-Asiatic. He finds that Omotic diverged first and then Cushitic split from the rest. Chadic then separated off from the North Afro-Asiatic languages: Semitic, Egyptian and Berber. What we see is suggestive of the model: Chadic has moved one step to SVO. In the northern group we see that word order has changed all the way to V1. We see this also in some ancient Semitic languages — Akkadian is not completely V1, but ancient biblical Hebrew is. Ancient Egyptian has the verb at the beginning, as do the Berber languages. So it looks as if one step was taken by Chadic and then another step taken by North Afro-Asiatic languages. There has been a mutation, or more likely borrowing, back to SVO in modern Israeli Hebrew. For the so-called Niger-Kordofanian group, Greenberg grouped together Niger-Congo in the largely western section of black Africa and Kordofanian in the oases in the Sahara not occupied by Nilo-Saharan. Some linguists today, however, think that maybe one should classify Nilo-Saharan, Niger-Congo and Kordofanian as three subgroups of one super-family called Congo-Saharan. In either
3
Some include Kartvelian and Dravidian in Eurasiatic. This is a remnant of the Nostratic movement. It does not alter the picture here since both are SOV.
Language and Complexity
case, these languages are mostly SVO. Here there has been a change according to the model, with only a few languages retaining the old order, SOV: a few Kordofanian languages, a few Gur languages, a few Kwa languages and all the Mande languages. All the other languages have changed to SVO or, in the case of a couple of Ubangian languages, all the way to V1. Austric, proposed by Schmidt many years ago, combines Austro-Asiatic, Miao-yao, Daic, and Austronesian. Very little SOV remains. Many languages have SVO, so many so that you might imagine that the mutation took place at the very beginning of Austric. In the case of Austronesian, including Polynesian, some languages have changed all the way to V1. This is a case of major deviation, but it is quite systematic and correlates with the proposed super-family to a reasonable degree. It is sometimes very difficult to determine whether certain word order changes have taken place spontaneously or by contact. There are many examples of languages becoming SOV through heavy contact with other languages that are SOV. That has happened to Munda, for example, in India where there are many languages with SOV. Also, the Austronesian languages that are SOV all came into contact with Papuan languages. As another example, Celtic shifted from verb final directly to verb initial, apparently not via SVO. The question is whether that was a spontaneous change or, as has been argued by Orin Gentzler, was due to contact with speakers of Northern Afro-Asiatic. When the horizontal changes are fairly obvious, it is easy to eliminate them. We are then dealing with a system that is mainly mutation and vertical transmission. If the mutation is relatively rare, then we can actually work back to the original situation. While we have yet to test the significance of the correlation of word order with the proposed super-families, the evidence does appear to be strong.
3.1
Complexification and simplification
Let us now consider changes in the phonological complexity of language, that is, the number of different sounds that are
405
406
Language Acquisition, Change and Emergence
distinguished in various languages. Obviously, if you have only one or two vowel phonemes in a language then you must have many consonants. Otherwise, there are not enough sounds to make all the words. In Chinese, even with the tones, the phonological inventory is so small, especially in Mandarin, that you can barely find enough sounds to make all the words, resulting in a considerable amount of ambiguity. When you consider the way Japanese borrowed Chinese words, without the tones, and conflated many syllables into a lesser number, the ambiguity is tremendous — you find Japanese people drawing pictures in sand and so on while talking. There is a certain minimum sound treasury that is required in order to communicate, but the way this works out can be very different from one place to another. In the Caucasus, you find some languages having only one or two vowel phonemes but a huge number of consonants. In the Austronesian family, however, you find Hawaiian, which has very few consonants. Hawaiian has five vowels with two lengths, short and long, and the vowels can be put together to make diphthongs. The total sound resources in Hawaiian are therefore quite small compared to those in the Caucasus, so you have a loss of phonological complexity in the geographically isolated Hawaii. Is this a general phenomenon? It may be that when human beings populated a part of the world where there had not been human beings before, there was a tendency on the average for phonological complexity to be reduced. If true, this would be an example of an arrow of time. Merritt Ruhlen claims to have observed that in the Americas, as people populated first north, then central, and then south America with Amerind languages, the languages toward the end of this track tended to have a much simpler set of sounds than those at the beginning. We must examine that claim to see whether we can propose some set of general rules about phonological complexity. It is certainly true that certain difficult sounds seem to have been reduced in frequency across the world. For example, it looks, from diachronic studies, as if glottalized consonants have been dropped in many languages. One way to drop them is to split off the glottal
Language and Complexity
stop from the consonant by inserting a vowel in between, and there is some evidence that that has happened in a certain number of places. There of course are plenty of glottalized consonants left — for example, in the Americas, the Indians near Santa Fe retain glottalized consonants — but they seem to be less common than in the past. Laryngeals also seem to have been lost over the years in many places. Another possible situation concerns the clicks in the Khoisan languages of South Africa and Tanzania. The San have five clicks, which they combine with a number of consonants to make about forty-five different sounds. Neighboring Bantu languages like isiZulu and isiXhosa have borrowed three of the clicks but without the consonants — they have combined the clicks with one or two consonants of their own. Except for that slight borrowing, clicks do not exist elsewhere in the world. Does that mean that there is some fundamental division between Khoisan and the rest of the world’s languages? Perhaps clicks were originally part of the sound apparatus for human language that were preserved only in this one family and died out everywhere else. If Ruhlen’s claim about loss of phonetic complexity is correct, it would be interesting because then language would have started out being more phonologically complex. We know of some other features of phonetics that have increased in complexity. For example, there are frequent instances of complexity derived from palatalization. For example, we know that palatalization is a relatively recent development in Slavic and one which, as far as we can tell, was not triggered by contact with some other language having such rich palatalization. Another example is the development of nasalized vowels in many languages, including some Romance languages, which developed from sequences of vowels plus nasal, which led to the vowels being nasalized and the loss of the nasal. We know of a lot of unidirectional processes in sound change.4
4
Such sound changes may not be one hundred percent unidirectional, but are close thereto.
407
408
Language Acquisition, Change and Emergence
The sound change of word-initial /p/ to /f/ to /h/ to nothing, for example, occurs frequently. The loss of word-initial /p/ in Celtic presumably occurred in this way. We can see in fact some of the transitional stages, such as in the language of the Hercunian Forest in Europe where the original word initial /p/ became an /h/ that was eventually lost. However, there are some inscriptions in Spain where the Celtic /p/ is replaced by /h/ rather than by nothing. Very rarely do we observe the sound change nothing to /h/ to /f/ to /p/. We might expect then that /p/ would have disappeared over the last forty thousand years, and that there would no longer be any words beginning with /p/. Of course there are ways for restoring word-initial /p/. For example, in the P-Celtic languages, /kw/ changed to /p/, as also happened in Greek and in other languages. Another way that word-initial /p/ could be recovered is to have both initial /p/ and a subsequent vowel both go to zero, the next consonant, by chance, being a /p/. Although quite strikingly unidirectional, this process would not be evident without some deep study of the language involved. This may be true of a number of unidirectional processes which are also hidden in this way. An important question relates to phenomena which are difficult to develop but, once developed, are easily lost. One example where we may actually have some evidence, not from phonology but from morphology, is in languages that have a morphological dual: for example, English has singular / plural “cat” / “cats”. A number of languages have this special form, which is often obligatory for just two items. We know of many languages, attested in the history, that have lost the dual, for example many of the Indo-European, Uralic and Afro-Semitic languages. However, there are very few languages for which we can state explicitly from the historical record that it has recently acquired the dual. One possible explanation for this would be that it takes a long time to develop the dual, but that a language can lose the dual far more quickly. This means that if there are phenomena that it is harder to develop than it is to lose, then we will end up with skewing in the synchronic evidence or in the evidence going back only to a shallow time depth. It would be worth investigating the rates of complexification and simplification. The
Language and Complexity
ratio may be very different for different phenomena. For example, it might take significantly more time for pharyngeals to be developed than tones or nasalized vowels. It is difficult to determine the direction of the arrow of time. Of course, we cannot simply state that the complex invariably becomes simple, or the other way around (although maybe some of the early nineteenth century Indo-Europeanists believed you could). But with certain phenomena we probably can talk about an arrow — it is just that the arrow relates to the processes rather than to the relationships between states. So if we observe that a certain language is tonal while a related language is not tonal, we cannot identify which is the archaic state. But if, for example, one language has tones and no opposition between voiced and voiceless consonants while the other has no tones but does have an opposition between voiced and voiceless consonants, then the second language is probably more archaic in that respect than the first. Cantonese, for example, has two registers which come from the voiced and voiceless consonant opposition. There may be some general rules of loss of complexity in language, particularly phonological complexity. In any case, I think that the hypothesis that human language is a very young system for which we can still detect some of the facts of its original state is one that should inspire a great deal of interesting research. Although this hypothesis may turn out to be false, I think in many cases it will serve to instigate a lot of very important investigations and may well turn out to be true.
409
13 Language Acquisition as a Complex Adaptive System John H. Holland University of Michigan
1.
Introduction
My objective in this chapter is to describe an agent-based model that I believe is relevant to both language acquisition and language evolution. The model is an exploratory device designed for computer simulation, so it is more than descriptive, and it may be susceptible to mathematical analysis. If the model performs as expected, it will provide an existence proof that grammars can be acquired through ordinary cognitive mechanisms, despite the so-called “poverty of stimulus” barrier. In presenting this model, there are three concepts derived from the study of complex adaptive systems (cas) that have an important role: 1. The first concept concerns signaling between interacting agents. If we look at signaling in a biological cell, or in the immune system, we see that that the signals need not have any intrinsic meaning or semantics. Signaling of this kind has been studied for some time with the help of rule-based, message passing systems called classifier systems (cfs) 411
412
Language Acquisition, Change and Emergence
(Holland, 2000). A cfs is designed for use with a genetic algorithm (ga) so that it lends itself to evolutionary studies as well. 2. The second concept concerns default hierarchies (Holland et al., 1986). Rule-based systems can be organized into levels of generality, going from rules of great generality called defaults to highly specialized rules called exceptions. The basic idea is that, at any time, the system calls upon the most specialized rule relevant to its current situation. To say it another way, in the face of a dearth of information, the system’s fallback is a default rule. A default rule offers an advantage over random action, but it is often wrong. There is one special kind of default rule that we will encounter later, called a bridging rule. A bridging rule sends its message over an interval of time, keeping the system focused. For example, a bridging rule may keep emitting the message “I’m hungry,” “I’m hungry,” “I’m hungry,”. . . , staying active until the hunger ceases. 3. The third concept is that of a trigger. Triggers are a way of generating new rules in a cfs. Triggers are activated in specific circumstances like the circumstances leading to imprinting in birds. In the model discussed here triggers are carefully constrained to depend only on general cognitive mechanisms; they are not language specific. Triggers must not be invoked often, otherwise the system becomes cluttered with useless or redundant rules. In the model of language acquisition proposed here, the system learns initially by constructing a small set of default rules, then by adding exception rules as it accumulates experience. This is an efficient way to learn in complex situations: By definition a very general rule fits a lot of situations, so it is tested frequently, giving a good measure of its average usefulness. Exception rules use more information and are sampled less frequently, so they are added only as more experience accumulates. It is easy to show that the resulting structure is much more compact than a system built entirely at the most detailed exception level.
Language Acquisition as a Complex Adaptive System
In short, the proposed cfs model treats language acquisition as the progressive construction of a default hierarchy. A version of this model has been implemented in Mathematica to investigate the evolution of a signaling system in a biological cell (Holland, 2001) The model started with a single, simple default rule, in order to examine the evolution of complex signaling networks. The model was simple but it did demonstrate an increasingly complex, environment-relevant use of signals, without producing a clutter of useless signals. The object of this presentation is to provide a overview of the proposed model, referring to extant papers for details. The exposition begins with a description of the role of models (Section 2) followed by a description of complex adaptive systems (Section 3) and the properties that make them relevant to language acquisition (Section 4). This leads to a discussion of adaptive agents (Section 5) and building blocks for adaptive agents (Section 6). These tools and concepts can be used to build an agent-based model for studying language acquisition (Section 7). I will conclude with a discussion of what I think can be gained via this approach (Section 8).
2.
Models
Models have three broad, quite distinct purposes, often confused or misunderstood in practice. The most familiar models are data driven: They are meant to generate outputs that mimic, and predict, data collected through experiment and observation. A weather prediction model is a data-driven model. A second kind of model is an existence-proof model which has quite different objectives. An existence-proof model shows that something is possible in principle. One of the best-known existence-proof models is von Neumann’s demonstration in the late 1940s of a machine that can reproduce itself. Most philosophers and scientists in the 1930’s took self-reproduction as a defining characteristic of life, so the model became an important counterexample. The model was not intended to mimic living organisms; it was only meant to show that you could
413
414
Language Acquisition, Change and Emergence
design a self-reproducing machine. A third kind of model is an exploratory model. In this case, you simply put together a set of mechanisms to see what happens, much as one might experiment with a Lego kit. No attempt is made to imitate data or provide an existence proof; the objective is just to explore the possibilities inherent in some set of mechanisms. Models are central to planning and, as we’ll see, planning is central to maneuvering successfully in a cas. To appreciate the role of models in planning, consider the design of an new airplane. We are long past the day when the inventor tinkered a design, hoping that a test pilot would prove out the result. Now we model the design on a computer, and test it virtually before we ever commit the design to actual hardware. Because of increasing complexity, engineering of the twenty-first century more and more depends upon our ability to construct models that allow virtual testing. Note that a computer-based model, of the kind just described, and a theory have much the same purpose: they both tell us where to look. Consider Einstein’s theory of relativity at the time it was proposed in the early part of the twentieth century. It was a very odd theory at that time, predicting that light will bend when it passes close to a massive body, such as a star. How do you test that? Astronomers waited for an eclipse of the Sun so that they could observe a star along a line of sight close to the sun. As the Sun moved closer and closer to star’s position, it was observed that star was displaced slightly because of the bending light. This was the first real proof of the theory of relativity. Who would bother to look for the position of a star during an eclipse without the theory’s suggestion of displacement? In short, a theory makes predictions about the consequences of certain actions. If the predictions come true then we have a validation of the theory; if not, then we consider the reasons for the failure. Models and theories are alike in this respect. Moreover, the computer-based model is more rigorous than the usual mathematical theory. We never put down all the steps of a mathematical proof. Instead we adopt conventions; otherwise even the simplest mathematical proofs would go on for pages, as is seen
Language Acquisition as a Complex Adaptive System
in the classical treatise by Russell and Whitehead (1910) where every step is given. Conventions and shortcuts allow a proof with reasonable compactness. However, if we leave out even a single step in a computer program the result is garbage.
3.
Complex Adaptive Systems
Interacting agents defined along the lines presented in the introduction act as a complex adaptive system (cas). Such systems have unique, subtle properties. Because these properties are relevant to language acquisition and evolution, I will give a brief exposition of cas at this point. A complex adaptive system consists of a set of interacting individuals, called agents, which adapt or learn as they interact. A market provides a familiar example of a cas. In a market people (agents) buy and sell goods, signaling each other about prices. They adapt to changing market conditions by altering their techniques for buying and selling. We come at once to a major observation: A cas continually changes and evolves, instead of hovering near an equilibrium or fixed point. While economists often make the assumption that markets clear, arriving at a stable set of prices, in fact we know that this rarely happens. Attempts to describe markets in terms of equilibria tell us little about what happens in a real market. Another example points up a different, but equally important characteristic of cas. Consider the delivery of food to stores and markets in, say, Hong Kong. The inventory of food in a large city is typically a 3- or 4-week supply. Contrast this with the inventory of a major industry, such as the automotive industry in Detroit. Even under just-in-time inventory controls, implemented by a large planning group headed by a vice president for “scheduling”, the inventory is typically a 3-month supply. Where is the central planning commission that enables Hong Kong to do so much better in inventory control? There is none! Somehow control is distributed over many agents with no central organization for scheduling the
415
416
Language Acquisition, Change and Emergence
interactions. Such distributed control is typical of cas. As we will see, distributed control is suggestive for the language acquisition process. If you question an economist about distributed control, the first reply you will get is that the efficiency results from competition. If you ask exactly how competition achieves such excellent results, you are unlikely to get a precise answer. This lack of a precise answer points up a central mystery about cas: Time and again cas operate in modes that are highly efficient and hard to imitate with a rational design that uses a central executive. We know much less than we should about such distributed control. If we translate this discussion to language acquisition, this emergent efficiency suggests that more is going on than is suggested by the usual “poverty of stimulus” arguments. A wide range of systems, from ecosystems to immune systems, can be described as cas. Because of unusual properties, such as distributed control, many of our most powerful tools — regression analysis and least mean squares approximations, sampling and polling, basins of attraction for partial differential equations, fixed points, and the like — are inadequate for studying cas. Much of this difficulty can be traced to conditional IF/THEN interactions between cas agents: The behavior of the whole cas is more than the sum of the actions of its parts. Because our familiar tools depend upon additivity (linearity), they are of little avail in trying to develop an understanding of cas. Even when linear techniques are applicable, the results have little bearing on the aspects of cas behavior that are of most interest. Consider, for instance, questions about sustainability or loss of species in an ecosystem. Such questions concern the changing “transient” behavior of the system; they are not well-answered by the properties of optima, fixed points, equilibria, and the like. Indeed for most cas we do not even have a definition of an optimal configuration. What is the optimal organization for an ecosystem? Moreover, the changes in a cas caused by perturbations — say the introduction of some exotic species in an established ecosystem — are usually quite rapid. Though we usually think of evolution as
Language Acquisition as a Complex Adaptive System
a “slow” millennial process, we need only introduce some chemical, like DDT, or an antibiotic, to see evolutionary adaptation take place in a matter of years or, even, months. Our well-being as humans depends upon the rapidity with which our immune system adapts to newly encountered bacteria. So it is with most cas. We must understand the underlying mechanisms that yield these rapid “transients” if we are to have any chance of understanding, and controlling, cas. When we look at language acquisition in cas terms, it is the rapidity of these adaptations that leads us to take a closer look at the “poverty of stimulus” argument.
4.
Properties of Complex Adaptive Systems
What are some properties of cas that are different from properties of more familiar systems? First of all, there are many different ways — different niches — in which agents can exist in the world. My favorite example is the niche occupied by the bee orchid: The petals of the bee orchid fold so they look like a bee sitting in the center of the flower. The likeness is so good that male bees attempt to copulate with the flower, being covered with the flower’s pollen in the process. Obviously this is a very specialized niche, wherein the flower’s continued existence depends upon its ability to imitate the bee. Examples of this kind abound in ecosystems. As a further example, tropical rainforests grow on the poorest soil in the world — heavy rainfall leaches all the nutrients into the rivers — yet rainforests have an unusually rich diversity of biological forms. There are many ways to recycle the nutrients before they reach the forest floor and this results in a great variety of niches. As in the case of language, great diversity is employed to exploit possibilities offered by the environment. A second property is that innovation is a regular feature of cas. In an ecosystem, new species and new activities arise regularly over evolutionary time. In our current economy, inventions and new products are a central source of revenue — things as simple as new video games can create cash flows of tens of millions of dollars.
417
418
Language Acquisition, Change and Emergence
Compare this to the rate of change during the Middle Ages when one could wait a hundred years for an innovation; nowadays, six months seems like a long time. Again, regular innovation is a well-recognized feature of language. A third property is anticipation. The agents in a cas often anticipate future events. In an equity market, for instance, an agent will anticipate future activity. If that anticipation is that the market will go up, the agent will act accordingly and, even if the anticipation is not borne out, it still affects the market. We are most familiar with such anticipation in board games where, in a game like chess or Go, anticipation is often the difference between winning the game and loosing the game. But even a simple bacterium like E. Coli shows anticipation in swimming up a sugar gradient to reach a nutrient source. We will see that anticipations also have a central role in language acquisition. In a cas there is always a tradeoff between exploration and exploitation. To see this tradeoff in a particular context, consider a twenty-first century firm dealing with innovation. The firm can emphasize research, but it will soon face bankruptcy if it never builds a product. Or the firm can commit all of its assets to production, but then its products will soon become obsolete because of the high innovation rate, and bankruptcy will again follow. Between these two extremes, there is a middle ground combining exploration and exploitation. How do we find that middle ground when dealing with cas? Because we currently have only fragments of a cas theory, finding this middle ground is largely an art form, laced with the anticipation that comes from planning. That observation leads to the next question: How does one build plans when confronted with a cas?
5.
Adaptive Agents
The first step in searching for adaptive mechanisms relevant to language acquisition is to take a closer look at the notion of an adaptive agent. In language acquisition, as in all cas, these
Language Acquisition as a Complex Adaptive System
mechanisms are embodied in the adaptive agents that comprise the components of the system. In the present model, I will designate the strategy of an agent by a set of rules that changes as the agent adapts. The particular formulation I will use is a rule-based, message-passing system called a classifier system (cfs) (Holland, 2001). It should be emphasized that these models are exploratory, so the interest centers more on mechanisms than data. If the models are fully successful they will serve as existence proofs. In more detail, each agent consists of a set of IF/THEN rules that process messages: IF (a message of type m1 is present AND a message of type m2 is present) / THEN (produce a message m). When the rule’s IF conditions are satisfied it is said to be active. The classic notion of conditioning in psychology provides us with a simple example of an IF/THEN rule, a stimulus-response rule. Consider the following rule: IF there is a small flying object, a “fly”, in the left of the visual field THEN send a message that causes the head to turn 15º to the left. Notice that even this simple rule produces a rather sophisticated sequence of actions: If the “fly” is still to the left after the rule acts, then the rule will still be satisfied and it will cause a turn of an additional 15º. Thus, the rule will continue to act until the “fly” is centered in the visual field, at which point the condition of the rule is no longer satisfied. It is typical of cfs rule systems that simple rules can produce sophisticated actions, particularly when the rules interact via interior messages. Indeed, with a little care, classifier systems can be made computationally complete — any algorithm that can be written for a general-purpose computer can also be implemented by a cfs. Agents
419
420
Language Acquisition, Change and Emergence
in the language acquisition model employ both interior messages, for communication between rules, and exterior messages for communication to and from the environment. The only function of interior messages is to cause the activation of rules by satisfying their condition (IF) part; an interior message can be thought of as a bit string with no meaning beyond its ability to activate rules. Many rules can be active simultaneously, so the system may have many messages present at any given time. This “parallelism” allows breaking a situation into parts handled by combinations of rules, rather than requiring each distinct situation to be handled by one monolithic rule. Instead of using a single rule to describe what should be done when encountering “a red car by the side of the road with a flat tire”, the system combines rules for “cars”, “roadside”, “tires”, etc. to handle the situation. As with words in a language, relatively few rules can be combined to handle a broad range of situations. The individual rules serve as building blocks. Because the interior bit strings are uninterpreted there is no possibility of contradiction between different rules that are active simultaneously. This is a substantial advantage because checking rules for consistency is a difficult computational problem. It is a particular burden when rules are continually being added and deleted, as is the case when the agent is learning from experience. For a cfs, additional active rules simply mean more messages. We can think of all the messages as being collected on a kind of bulletin board called a message list. At any given time all rules check all messages to see if their IF parts are satisfied. Some messages cause actions in the agent’s environment by activating certain units called effectors (think of muscles). In the present model, the agent receives information about its environment via a single exterior message produced by a set of detectors. Each detector responds to some property of the environment: “Is there something moving out there?” Yes (1) or No (0). “Is there a small object in the center of the visual field?” Yes (1) or No (0). And so on. The result is a bit string describing the local environment. Unlike the interior messages, these exterior messages have an interpretation supplied by the detectors.
Language Acquisition as a Complex Adaptive System
For the study of language acquisition, it is important that the agents have motivations supplied by needs or goals. In the present model this is accomplished by providing each agent with a set of reservoirs, such as a “food” reservoir and a “shelter” reservoir, that the agent must keep near-full. In an environment involving scattered patches of food and shelter, the agent must execute sequences of rules that allow it to exploit these patches. Rules that belong to sequences providing input to the reservoirs will be called useful. It is clear that rule sequences that keep reservoirs near-full can become quite sophisticated. Indeed, one purpose of the model being proposed here is to show that language use can be an important addition to these useful rule sequences. Now we come to the first kind of adaptation that such a rule-based system can undergo. It involves a difficult problem of credit assignment: The cfs must use experience to decide which rules are useful to the system and which are not. In effect, the agent treats the rules as hypotheses to be confirmed or disconfirmed by experience. Rules are assigned strengths and a credit assignment algorithm modifies these strengths so that over time come they come to reflect usefulness. At any given time, satisfied rules make bids based on these strengths in a competition to become active. High bidders win this right, as in an auction. Though I will not go into the details here, the credit assignment algorithm used by a cfs is called the bucket brigade algorithm (Holland et al., 1986). As a simple example of the effects of credit assignment, consider the two rules: IF there is an object is to the left THEN turn the head to the left. IF there is an object is to the right THEN turn the head to the right. In most circumstances, the second rule is maladaptive and, under credit assignment, its strength would steadily decrease, while the second rule would become progressively stronger. A default hierarchy makes it particularly easy for a cfs to modify the competition between rules as it accumulates experience. Consider a cfs that has a very general rule of the form IF there is a moving object nearby THEN flee.
421
422
Language Acquisition, Change and Emergence
Such a rule, called a default rule, has a general utility, but it will often lead to inappropriate action. For example, if the cfs is modeling an ant, many moving objects are indeed threatening to the ant, but the ant would never approach another ant if the rule were always invoked. That would not bode well for the continued existence of ant colonies. To compensate for this shortcoming, the cfs has another rule, called an exception rule, of the form IF there is a moving object nearby that is small and emits a “friend” pheromone THEN approach. Notice that the exception rule uses more information than the default rule. Still both of these rules would be active simultaneously if there were a small, moving object nearby that emits a “friend” pheromone (for example, another ant). The conflict is resolved by having the rule that uses more information override the rule that uses less information. There is an interesting symbiosis that takes place between default rules and exception rules. Under credit assignment, a rule loses strength every time it is wrong. However, the exception rule prevents the default rule from becoming active in some situations in which it would be wrong. Thus the default rule makes fewer errors than it would if the exception rule were not present to override it. So the two rules give a better performance than either one alone. The default hierarchy can have many levels, with exceptions to the exceptions, and so on. A default hierarchy provides a natural way for an agent to add rules reflecting increasing experience. First the system learns very general rules — for example, rules that attend to few detectors — that, while often wrong, provide better-than-random responses to the environment. Because they are simple, such rules are easily discovered, and they are frequently tested so that their marginal utility is quickly confirmed. Exception rules are discovered and confirmed later because the particular situations that invoke them are less frequently encountered. However, once discovered, their symbiotic relation to the more general rules assures their incorporation. The situation is reminiscent of the acquisition of
Language Acquisition as a Complex Adaptive System
regular and irregular verb forms in language acquisition. There is a kind of rule, closely related to a default rule, that will play an important role in this model of language acquisition: It is called a bridge rule. Consider a rule that has the form IF the food reservoir is low THEN search for food This rule will stay active until the food reservoir is no longer low. Its message, then, serves to keep the system focused on a goal. Other rules having a condition satisfied by the bridge rule’s message will be activated whenever the reservoir is low. This provides the opportunity to combine and sequence rules to achieve a goal, rather than requiring one monolithic rule to carry out the whole mission. The somewhat fanciful counterpart of a bridge rule for a human would be rule of the form: IF “at the office” and “end of the workday” and “hungry” THEN send message “drive home”. 1. The “drive home” message could activate a rule of the form: IF “at the office” and “drive home” THEN “go to parking structure”. 2. There could be another rule of the form: IF “in the parking structure” and “drive home” THEN “start car”. 3. Further rules in the sequence would direct the car to the expressway, etc. The “drive home” message keeps the action focused toward the goal, while each rule sets up its successor by issuing the appropriate message. Note that many of these rules would be usable whenever the agent is hungry in different, but similar, situations. Also, a few rules like this provide the agent with considerable powers of anticipation, allowing the agent take actions predicated on more than the instantaneous environmental situation. Bridging rules enable the goal-oriented ordering we expect of a grammar.
423
424
Language Acquisition, Change and Emergence
As an aside: It is not difficult to design a “curiosity” reservoir that induces the agent to search its environment when it is not under immediate needs. This reservoir only affects the system when all the other reservoirs are near-full. It receives input whenever the system experiences an input that does not activate a high-strength rule, this being the operational definition of a new experience Once the agent determines that some rules are useless or, worse, harmful, it is natural to replace them with new rules that amount to new hypotheses. This process, called rule discovery in the cfs framework, is the model’s most powerful adaptive technique. Rule discovery cannot be random — that approach is demonstrably too inefficient to be useful. Somehow the formation of new rules must be biased by prior experience. The strength assigned to a rule can be treated as its fitness, reflecting its usefulness in the context of the other rules in the system. We can then treat the rules as a population of individuals undergoing Darwinian selection, using a genetic algorithm (Mitchell, 1996). Strong rules become “parents”, producing offspring rules that replace the weak rules in the system. By crossbreeding strong rules, much as a breeder crossbreeds plants or animals, we implement Darwin’s insight that natural selection does naturally what a breeder does consciously. We will examine this approach to rule formation more closely in the next section.
6.
Building Blocks
Usually a theory or a model is constructed in terms of parts or mechanisms. These building blocks, as I will call them, provide a principled way of repairing models that fail. We can ask which building blocks are responsible for error. In fact, we do this in other areas: Fourier analysis provides us with a set of building blocks, namely the frequencies, in control circuitry. We can then consider frequency response under different conditions, making corrections for inadequacies. The history of science, from classical Greece onward, centers on the discovery of relevant building blocks. Nowadays we think of
Language Acquisition as a Complex Adaptive System
quarks and gluons as the parts of nucleons, nucleons in turn combine to form the nucleus of an atoms, atoms in turn combine to form molecules, and so on. It’s interesting that the building blocks at one level, when put together in particular ways, become the building blocks one level up. It’s as if I have a Lego set and make structures that can serve as parts of a more complicated structure. The building blocks at one level constrain what I can do at the next level, but they do not determine what I can do at the next level. For that reason chemistry is a fully developed discipline, with its own laws and building blocks, even though the building blocks one level down, atoms, are well defined. To see building blocks from another point of view, consider one of the most important inventions of the twentieth century, the mobile source of power called the internal combustion engine. What are the building blocks of an internal combustion engine? They were all well-known before the twentieth century: gear wheels go back to classical times; Venturi’s perfume sprayer preceded the carburetor by a century and a half; sparking devices were invented by Volta, and so on. All the major parts of an internal combustion engine were well known long before we had the device itself. The invention consisted in putting these building blocks together in a new way. Building blocks offer us the combinatorics we expect of a grammar: Relatively few parts can be combined in an enormous number of ways. Consider the set of 7 children’s building blocks shown in Figure 1. Even if you look only at constructions using three building blocks, there are roughly 4,000 possibilities. If you consider constructions using 20 blocks the number of possibilities is truly astronomical. There are so many possibilities that the “castle” shown in Figure 1 could never be achieved by selecting and combining building blocks at random. Yet a child will quickly build a “castle” if requested. Two non-random elements are involved. First, note that small combinations of building blocks can serve as higher-level building blocks for larger structures. Witness the “towers” in Figure 1. Secondly, the child has a “vision” of the goal that allows planning. Both of these factors will enter into our later discussion of language acquisition.
425
426
Language Acquisition, Change and Emergence Figure 1
Building Blocks [Generators] 7 Building Blocks (multiple copies of each) …
…
×4
…
… ×2 ×2 ×2
… 1 of ~1,000,000,000,000
1 of ~4000
…
…
…
Now let us look at a different, more target-oriented set of building blocks. Figure 2 shows a set of building blocks for constructing faces. Here the face is divided into 10 distinct features: hairstyle, shape of forehead, shape of the eye, shape of the eye-brow, shape of the nose, and so on. There are ten alternatives for each feature. How many building blocks are there? And how many faces
Language Acquisition as a Complex Adaptive System
Figure 2
Building Blocks and Recombination Instance Position 1 2
1
2
3
4
5
…
… …
3 4 5
…
6
…
5
2
1
4
2
3
3
…
…
…
Using the number string representation for faces, new faces can be constructed by using crossover operator on pairs of faces.
can I make? There are only 10 × 10 = 100 building blocks, but I can build 1010 faces. If you can find the appropriate building blocks for the area you are trying to study, you can immediately construct a
427
428
Language Acquisition, Change and Emergence
wide range of plausible possibilities. Because we acquire the right set of building blocks for faces through experience, we can easily distinguish and remember many faces. Unless you are a zoo keeper, the same is not true for chimpanzee faces. Though their faces are every bit as variable as human faces, chimpanzees all look the same to most of us. Having an appropriate set of building blocks makes a great difference. I want to use these building blocks for human faces to make a point about Darwinian selection. Note first that any one of the 1010 possible faces can be described by a 10-digit decimal number where each digit gives the particular building block chosen for each feature (see Figure 2). Let us start with ten faces generated by randomly generating ten 10-digit numbers. Then let’s ask someone to rank those ten faces as to some subtle characteristic, say beauty or honesty. Though honesty is a subtle human characteristic, most people will take up the challenge, using their instincts to produce a ranking. From a Darwinian point of view, if selection is based on honesty, we can think of the face ranked highest as most fit, and the face ranked lowest as least fit. Let’s see what we can do about crossbreeding the faces in an attempt to produce faces that look more honest to the person who did the ranking. That is, let us do what breeders do when they try to produce better animals or plants by mating organisms with desirable characteristics. This procedure depends upon the exchange of segments between parent chromosomes, known in genetics as recombination or crossing over. In the present case we can think of each 10-digit number as a (short) chromosome, and crossing over causes an exchange of segments between the two strings, as shown at the bottom of Figure 2. Notice that the operation produces two new faces from the two “parent” faces, each new face having some of the characteristics from one face and some of the characteristics from the other. As an aside, though most biology texts emphasize mutation as the driving force of evolution, it is actually crossing over that produces most of the variation and innovation. Recombination takes place in every individual in every generation, while a mutation is something like
Language Acquisition as a Complex Adaptive System
10,000,000 times less frequent. Recombination is the reason that two children from the same parents can look so different. Now, using crossover on the more highly ranked “parent” faces, let us produce a new, second generation of ten faces. Repeat this ranking and recombination procedure three more times to produce a fifth generation of faces. It is almost certain that any face in this fifth generation will look more honest to the individual doing the ranking than any face in the first generation. This is a simple example of the way a genetic algorithm (Mitchell, 1996) finds combinations of building blocks that rank highly under a predetermined fitness criterion. There is a further advantage for determining a set of building blocks for a system of interest. In the present case, it would be quite difficult to write a paragraph defining the notion of an honest face, but by looking at the faces in the fifth generation we could see what building blocks they have in common. The subtlety of the concept arises because no single building block (say, large eyes) typifies the honest face; it will be certain combinations of building blocks. Interacting combinations of building blocks are typical of genetics and cas in general. Let that be a warning to you when you read that they have just found the gene for language or Alzheimer’s. It’s not that easy — genes interact in the chromosome to produce the effects we observer. We’ll see similar effects when we concern ourselves with the progressive development of a grammar during language acquisition.
7.
An Agent-based Model of Language Acquisition
As stated earlier, the objective of the model proposed here is to explore the possibility that general cognitive mechanisms alone, rather than language-specific mechanisms, might be sufficient for acquisition of a useful grammar. To provide a specific target, I will examine an agent-based, computer-executable model that uses a
429
430
Language Acquisition, Change and Emergence
classifier system as its underlying formalism. This model uses only general cognitive mechanisms (described below) applied to communication between naïve agents, called Learners, and agents that already have a capacity for language, called Teachers. The model, if successful, should allow a Learner to form a proto-grammar (also described below) in feasible time. The proto-grammar, in turn, should enable the Learner to make a combinatoric use of signals (vocabulary) to evoke “useful” actions from other agents. This would show that the “poverty of stimulus” argument does not supply a necessary constraint on language acquisition. Even if this strong result is not attained, the model should provide insight into the syntactic and semantic rules that are inferred or computed from input in the process of language acquisition The target agent-based model has the following scenario for defining useful actions: 1. Agents move about in a shared 2-dimensional environment which offers patches of resources that enable the agents to survive. The needs of an agent are specified by a set of reservoirs that it must replenish through resource acquisition. 2. The environment is designed to offer a wide range of options for attaining the resources through sequential action (as in a board game). 3. The survival of an agent is determined by its ability to acquire resources through appropriate action sequences (e.g. “locate food patch” ⇒ “approach food patch” ⇒ “consume food”). A useful action sequence is one that evokes goal directed, reservoir-filling actions from agents attending to the sequence. There is no explicit value assigned to signals or communications. Agents are assumed to have the following cognitive mechanisms: 1. Detectors for detecting local conditions in the environment, including the presence of other agents and signals emitted by other agents. Agents innately distinguish between
Language Acquisition as a Complex Adaptive System
objects and actions; at any given time an agent has (at most) a single action and/or object as salient (the center of its detecting apparatus). When a pair of agents in the same locale attends the same salient object/action, both agents are aware of this fact. 2. Effectors that allow movement and searching through the environment, resource acquisition when in a resource patch, and emission of signals. An agent has a fixed set of utterances (“vocabulary”) for inter-agent signaling (one utterance per time-step) akin to the warning cries of primates; this model does not study the progressive construction of words from elemental utterances (“phonemes”). 3. Rule-based processing of internal and external signals. 4. General-purpose learning algorithms for rating the usefulness of extant rules and discovery of new rules; the learning algorithms — credit assignment and rule discovery — are not language specific, being designed to increase the ability to acquire resources even in the absence of signaling. There are three specific rule discovery algorithms that work by imitation (akin to imprinting). With this background, I will now turn to the interaction of the two agents called Teacher and Learner; the discussion may be extended to multi-agent systems with ease. Let the agent Teacher — the agent competent in language — utter the word “cookie”. This utterance serves as part of an exterior message to Learner, formed via Learner’s detectors. We want to examine mechanisms that will enable Learner to build rules that make use of this utterance. There are three conditions that together indicate the appropriateness of a new rule: 1. A low reservoir: There must be an active bridging rule indicating a low reservoir. 2. A common salient object: For instance, Teacher may notice that both Learner and Teacher are observing the same object.
431
432
Language Acquisition, Change and Emergence
3. A relevant utterance: For instance, Teacher names the object. When these conditions obtain, a trigger causes the formation of new rule in Learner that enables Learner to repeat the utterance in the same context. It is a common observation that between a mother and child there is a steady vocal exchange that starts with babbling and becomes increasingly refined over time. Moreover, the mother regularly simplifies both the environment and her utterances to provide clear associations and to remove ambiguity. Large numbers of samples are produced and if the child errs the mother commonly repeats the utterance, sometimes many times. Though overt negative reinforcement is rare, the error-originated repetition amounts to the same thing. This brings up the key point for building an exploratory model: By using triggers like the one just described we can see how far the model will take us toward developing a language-like capacity. Model failures give us clues about where to look for more possibilities. This contrasts with even the most careful rhetorical discussions of language. If, in executing the model, you get so far and no farther, that provides useful information about the limitations of the mechanisms (for instance, triggers) you are using. This particular model will only be a success if it can produce a proto-grammar with just a few general-purpose triggers, say fewer than a dozen, designed to reflect general cognitive abilities. Then, instead of starting with an innate grammar corrected by tuning, the language acquisition model starts with this set of triggers and a blank slate. The triggers should generate an increasing default hierarchy that serves as a proto-grammar. It increases our hopes for success that we already have a proof-of-principle model of this process in the case of generalized signaling networks (Holland, 2001). That model generated the bridges and sequences of actions under the bridges, so in that context we know that rule generation of this kind is possible. It is interesting to compare the use of utterances intended for this model with the pre-linguistic use of signals by primates. Certain
Language Acquisition as a Complex Adaptive System
monkeys issue a “predator” cry whenever a predator is present, with variations in the cry according to the type of predator — leopard, eagle, or the like. Different cries cause different actions: The “leopard” cry causes the monkey to head to the canopy top, while the “eagle” cry causes the monkey to flee the canopy top. Interestingly, there is occasionally a smart monkey that learns to use this repertoire to “lie”: By uttering the “eagle” cry, in the absence of an eagle, this monkey clears the canopy, giving it uncontested access to the top of the canopy where the most desirable food is found. It is as if the monkey has a trigger that combines previous observations of the advantages of an empty canopy with the effect of the “eagle” cry. This trigger is a rather easy combination of strong motivation (a near-empty food reservoir), a salient observation (the desirability of an empty canopy top), and a relevant action (the “eagle” cry).
8.
Discussion
As indicated earlier, the target model will only be deemed successful if, through the action of a small number of triggers, it acquires a proto-grammar. This should come about through initial retention and generalization of a default utterance order (akin to Subject/Verb/Object), followed by progressive elaboration of a default hierarchy (Holland et al., 1986) through accrual and generalization of exceptions. The target model has some specific advantages in this respect: 1. An examination of the rules being formed provides a direct intuitive check on what is being internalized (learned), in contrast with neural network approaches. 2. All feedback loops (message exchanges) between agents are explicit. 3. At the times preceding need-satisfaction (increment to reservoirs), the active rules provide insights into the role of inter-agent signaling (if any) in increasing agent survivability.
433
434
Language Acquisition, Change and Emergence
4. The ontogeny of the proto-grammar can be directly observed by examining the succession of stored rules. A central concern in this study is the robustness of the mechanisms underlying these language acquisition processes which involve the interplay between social interaction, resource acquisition, and biological/cognitive machinery. Some motivating questions in this regard are: 1. What is the role of feedback and reinforcement in enabling a learner to develop the computational algorithms needed for robust acquisition of language? 2. What is the machinery for robust language acquisition by learners in situations of reduced input and socialization? 3. What are the mechanisms used by parents to increase children’s ability to communicate with a general population? 4. What are the factors in the evolution of language that increase the survivability of groups sharing a common language? In the target model, the central feature is the social role of language in confronting the problems presented by the environment in which the agents operate. The criterion for rule retention is implicitly defined by these requirements. Rule retention does not depend upon some a priori assignment of fitness to rules. In particular, in this model, the retention of a rule depends upon the role it plays in keeping the reservoirs filled. This is directly testable through control experiments. For instance, you could run the model with and without certain rules, examining the average reservoir levels in the two cases. Because a genetic algorithm can be applied to a classifier system, both acquisition and evolution can be investigated in this model. The model may actually be able to demonstrate a version of “ontogeny recapitulates phylogeny” in language acquisition and evolution. Though this idea was discredited in its original, naïve form, it is now being taken seriously again in genomics and proteomics. Micro-assay chips allow us to distinguish thousands of
Language Acquisition as a Complex Adaptive System
characteristics of active genes and proteins in different biological cells. Through cladistic analysis we can begin to reconstruct the phylogeny of various proteins and genes. It is often the case that phylogenetically-early proteins are also ontogenetically early. With luck, the models of the kind being proposed here will let us begin similar reconstructions for language.
References Holland, J. H. (2000) What is a learning classifier system? In P. L. Lanzi, W. Stolzmann, and S.W. Wilson (Eds.) Learning Classifier Systems (pp. 3–6). Springer: Berlin. —. (2001) Exploring the evolution of complexity in signaling networks. Complexity 7 (2), 34–45. Holland, J. H., Holyoak, K.J., Nisbett, R.E., and Thagard, P.R. (1986) Induction:processes of inference, learning, and discovery. MIT Press: Cambridge MA. Mitchell, M. (1996) An introduction to genetic algorithms. MIT Press: Cambridge MA. Russell, B. and Whitehead, A.N. (1910) Principia Mathematica. Cambridge University Press: Cambridge.
435
14 How Many Meanings Does a Word Have? Meaning Estimation in Chinese and English Chienjer Charles Lin University of Arizona
Kathleen Ahrens National Taiwan University
Abstract This chapter explores the psychological basis of lexical ambiguity. We compare three ways of meaning calculation, including meanings listed in dictionaries, meanings provided by human subjects, and meanings analyzed by a linguistic theory. Two experiments were conducted using both Chinese and English data. The results suggest that while the numbers of meanings obtained by different methods are significantly different from one another, they are also significantly correlated. Different ways of meaning calculation produce distinct numbers of meanings, though on a relative scale, words with more meanings tend to have greater numbers of meanings throughout. Dictionary meanings are distinguished from meanings obtained from subjects both in content and in number. 437
438
Language Acquisition, Change and Emergence
These results are then discussed with regard to their methodological implications for further research on psycho- semantics and semantic change.
1.
Introduction
In human language, words and meanings do not always form one-to-one correspondences. The majority of the human lexicon is, in fact, extensively associated with multiple meanings — what we refer to as lexical ambiguity.1 A word like board means both a flat piece of wood, and a group of people who manage something together. Homophones such as board and bored can be confusing when spoken in isolation. The multiple meanings associated with a word can be etymologically associated, but language users do not necessarily have such knowledge. Another mismatch between words and meanings is synonymy, where several words mean roughly the same thing. For example, like, favor, admire, enjoy, and love are synonyms meaning “having preference.”2 The associations between meanings and words are thus many-to-many in nature. A word has many meanings, and many words can mean the same thing. If we see words as boxes and meaning as the content, then it is easy to understand the relation between words and meaning in the evolution of language. On the one hand, we do not want too many boxes because they occupy a lot of space for storage. On the other hand, we do not want to put too many different things in one box because this would make it difficult to find an item. There is thus
1
Britton (1978) estimated 32% of the words in English texts to be ambiguous. Huang (1994) surveyed the first thousand pages of entries from Longman Dictionary of Contemporary English and found 39% of the entries polysemic; the average number of senses of these words is 3.02. Hue, Yen, Just, and Carpenter (1994) estimated 11.43% of the Chinese words in a dictionary to be ambiguous. 2 They are also called near synonyms, if we take the position that no two words can be taken to mean exactly the same thing.
How Many Meanings Does a Word Have?
this tension between using the same linguistic symbol for different meanings (for economy’s sake) and using distinct symbols for different concepts (for clarity’s sake). The cost of economy is confusion; the cost of clarity is excessive burden on memory and processing. In the history of a language, a word tends to develop meanings and undergo semantic developments, so that it is sufficiently utilized in the mental lexicon. Well-attested semantic developments include metaphorization and grammaticalization (e.g. Traugott and Dasher, 2001). However, not too many unrelated meanings are allowed to associate with one single word; otherwise, it would be difficult to communicate without having to repetitively request further clarification.3 This article is concerned with the methodological issues concerning the numbers of meanings associated with words. Psycholinguistic studies have been interested in how the semantic ambiguity of a word affects its processing both in isolation (e.g. Azuma and Van Orden, 1997; Rodd, Gaskell and Marslen-Wilson, 2002) and in sentences (e.g. Onifer and Swinney, 1981). A critical issue when doing such research is to determine the number of meanings a word has, and the kind of ambiguity this word demonstrates. As outlined above, a word’s number of meanings is usually confounded with several factors. A word can have many meanings that are closely related to one another. Another word can have a few distinct senses. Which word is more ambiguous? Lexical ambiguity is itself an ambiguous notion. In this paper, we compare three approaches that researchers have adopted to determine the numbers of word meanings — meanings listed in standard dictionaries, meanings produced by language users, and meanings processed by lexical semantic theory. Through the present
3
There are two possibilities for unrelated meanings to be associated with one lexical item. It could be accidental; two different words somehow got pronounced the same way and then further spelled the same way. It could also be a distant development of the core sense, the relation of which might be hard to establish out of that context especially when the successive intermediate stages are no longer available.
439
440
Language Acquisition, Change and Emergence
investigation, we wish to provide future studies of lexical semantics, psycholinguistics, and natural language processing with the nature and limitations of different ways of semantic representation, and the compatibility among them. Section 2 of this chapter introduces these different ways of meaning calculation. Section 3 presents an experiment comparing meaning metrics of Chinese nouns using these different methods. Section 4 presents a similar experiment on English words as a cross-linguistic confirmation. Section 5 discusses the implications of the results and concludes.
2.
Meanings in Dictionaries, in Language Users, and in a Semantic Theory
To a psycholinguist, the goal of a semantic representation is to reflect how the meanings of a word are represented in the human mind. One way of testing these representations is to examine how meanings are accessed in isolated words or in sentential contexts. To show the diverse ways of meaning calculation and their effects, we take as an example the paradigm of ambiguity advantage in isolated word recognition. Since the 1970s, there has been continuing interest in ambiguity effect and lexical access. The effect of ambiguity advantage has been reported, stating that words with greater numbers of meanings are recognized faster than words with few meanings. How these researchers determined a word’s number of meanings is the issue we will focus on here. Some researchers consulted dictionaries (Gernsbacher, 1984; Jastrzembski, 1981; Jastrzembski and Stanners, 1975; Rodd et al., 2002); some asked language users to decide whether a word is ambiguous or not (Borowsky and Masson, 1996; Hino and Lupker, 1996; Kellas, Ferraro and Simpson, 1988); some collected definitions from language users (Azuma and Van Orden, 1997; Millis and Button, 1989); some used linguistic definitions to determine the numbers of different meanings language users provided (Lin, 1999; Lin and Ahrens, 2000). These researchers
How Many Meanings Does a Word Have?
found conflicting results as to whether ambiguity advantage exists. A reasonable question to ask is whether the meanings from these different sources are compatible if we want to compare the results of different research. We will now consider dictionary meanings, meanings provided by language users (i.e. semantic intuition), and meanings determined by consulting a linguistic theory (i.e. linguistic senses) in turn.
2.1
Dictionary meanings
Psycholinguistic research in the 70s and 80s usually used dictionaries as the source of a word’s number of meanings (e.g. Gernsbacher, 1984; Jastrzembski, 1981; Jastrzembski and Stanners, 1975). These researchers checked their materials in published dictionaries for meaning enumeration. For example, Azuma and Van Orden (1997) matched the meanings they collected from language users to those listed in a dictionary; Rodd et al. (2002) consulted the on-line Wordsmyth English Dictionary-Thesaurus to decide a word’s ambiguity. Dictionary meanings are favored by researchers because they are standardized, comprehensive, and easy to obtain. The use of dictionary meanings in experiments, however, has its limitations. First of all, researchers consult different dictionaries, which inevitably have distinct editing styles and meaning presentations.4 To name a few editorial differences in different dictionaries, some dictionaries list distinct meanings as separate lexical entries; others tend to group them under one entry. Some dictionaries treat
4
For instance, Jastrzembski and Stanners (1975) used Random House Dictionary of the English Language, Unabridged (1967). Jastrzembski (1981) used Webster’s Third New International Dictionary, Unabridged (1976). Gernsbacher (1984) used Webster’s New World Dictionary, Unabridged (1976). Azuma and Van Orden (1997) consulted Webster’s New World Dictionary (1980). And Rodd et al. (2002) used Wordsmyth English Dictionary-Thesaurus (1998).
441
442
Language Acquisition, Change and Emergence
meaning extensions under one semantic entry; others put them in separate entries. The line to separate closely related meanings is hard to draw. Different dictionaries inevitably make different decisions concerning these finer semantic distinctions. Therefore, researchers referring to different dictionaries would come up with different numbers of meanings for the same sets of words. Second, dictionaries are designed for language users’ reference, thus representing the “standard” use of the lexicons. The definitions, hence, include archaic ones and lack the novel meanings that are emerging. Gernsbacher (1984) found in an informal survey that even well-educated subjects such as college professors could report only a small portion of the meanings actually listed in a dictionary.5 Our own survey of ten college students also showed that subjects frequently provided meanings that are quite different from dictionary definitions. Sometimes these meanings have such a high frequency that they should be considered well-established; however, they are not yet listed in the dictionary. For instance, the word xiaodi ‘little brother’ in Chinese has two meanings listed in Gwoyeuryhbaw Dictionary (國 語 日 報 辭 典 ) (1989): (a) the youngest brother, and (b) a modest term for oneself. A survey of 10 students provided four meanings: (a) a young boy — 9, (b) a waiter — 7, (c) a person at a lower rank — 4, and (d) a modest term for oneself — 2. (The numbers following each definition represent the numbers of subjects who provided such a meaning.) Meanings such as ‘a waiter’ and ‘a person at a lower rank’, though not found in a dictionary, were even more frequently provided by the subjects than the dictionary meaning ‘a modest term for oneself’. Novel senses or emerging slang uses of a word that are not included in standard dictionaries may nonetheless be very important in a native speaker’s knowledge of a word. There is an overt gap between dictionary meanings and the semantic knowledge subjects actually possess.
5
For the word gauge, which has 30 dictionary meanings, several college professors provided only 2 meanings in Gernsbacher’s (1984) survey.
How Many Meanings Does a Word Have?
2.2
Semantic intuition
The problems with dictionary meanings led researchers to turn to the production of meanings by language users. It is a reasonable move, since it is the semantic knowledge of language users that we are interested in. Millis and Button (1989) called this accessible polysemy — the number of different meanings that subjects are able to think of for a word. These meanings may be a subset of the meanings that a language user actually has, since there may be other meanings that are recognizable to a subject that he/she does not think of upon seeing the stimuli. However, the insufficiency can be compensated by collecting data from many people. We call these meanings collected from subjects the semantic intuition of language users. Millis and Button (1989) elicited three different kinds of accessible polysemy — first meanings, total numbers of meanings, and average numbers of meanings. First-meaning metric refers to the collection of the first meanings subjects think of for a word in a meaning generation task. This method, used by Rubenstein et al. (1970), Rubenstein et al. (1971), and Forster and Bednall (1976), has limitations. The first meanings subjects think of are the most dominant meanings. Since meanings other than the primary ones are overlooked, this method does not adequately represent the knowledge a subject has for a word. Millis and Button’s (1989) experiment also showed that lexical decision tasks using the first meanings did not produce the effect of ambiguity advantage. They are, therefore, not an appropriate choice to access people’s overall semantic knowledge. Both total meanings and the average numbers of meanings are collected by asking subjects to write down all the meanings they think of for a word without time limit. Then, researchers determine the numbers of different meanings. A total-meaning metric is the total numbers of different meanings subjects provide for a word; an average-meaning metric is the average numbers of different meanings each subject provides for each word. Total and average meaning metrics are two psychologically real estimations of people’s
443
444
Language Acquisition, Change and Emergence
accessible polysemy, since Millis and Button (1989) found ambiguity advantage using both metrics.6 In addition to the three metrics to estimate polysemy, some researchers simply asked their subjects to decide whether a word is ambiguous or not (Kellas et al., 1988; Borowsky and Masson, 1996; Hino and Lupker, 1996). Subjects are asked to circle whether a word has no meaning, one meaning, or more than one meaning. The limitations of this method are that (1) in making a decision among the three choices, subjects may not have thought over the stimuli sufficiently enough — at least not as sufficiently as when asked to provide meanings, (2) the criteria subjects use in making such a decision are unknown, and (3) words that have more than one meaning could vary greatly in the numbers of meanings; simply asking the subjects to make a multiple choice overlooks the differences among words with many meanings. In summary, we consider asking subjects to provide all the meanings they could think of for a word comes closest to their semantic knowledge of the words.
2.3
Linguistic senses
Calculation of a word’s number of meanings does not end at meaning generation by subjects. Among the meanings subjects generate, how do we decide which meanings are distinct, and which meanings are the same? Lin (1999) provides an alternative to deal with the delimitation problem. He stresses the importance of linguistic knowledge in delimiting word meanings, and argues for sense delimitation based on a lexical semantic theory proposed by Ahrens et al. (1998). This theory distinguishes two levels of meaning representation among Chinese nominals: senses and meaning facets. The properties of senses and meaning facets can be distinguished
6
Azuma and Van Orden (1997) and Lin (1999) both adopted the total-meaning metrics in their experiments.
How Many Meanings Does a Word Have?
based on (a) the conceptual domains involved, (b) the productivity and predictability of meaning relations, and (c) the linguistic context. The senses of a word have the following properties: (a) a sense is not an instance of metonymic or meronymic extension, but may be an instance of metaphorical extension; 7 (b) the extension links between two senses cannot be inherited by a class of nouns; (c) senses cannot appear in the same context (unless the complexity is triggered). A meaning facet, as “an extension from a particular sense” (ibid: 53), has the following properties: (a) they are instances of metonymic or meronymic extension; (b) nouns of the same semantic classes will have similar extension links to related meaning facets; (c) they can appear in the same context as other meaning facets. Therefore, two meanings are distinct senses, when they involve different conceptual domains, and when they occur primarily in distinct linguistic contexts. When the relation between two meanings is productively found among words of the same semantic class, these meanings are treated as meaning facets, which can be derived by inheritance rules. This makes the representation and processing of lexical semantics economical and efficient, since only the semantic information that cannot be derived by rules are listed as distinct
7
This theory captures the essential differences between metonymic and metaphorical extensions. Metonymic relations are within the meaning facet level because they are productive, predictable, and context-dependent. Metaphorical extensions are seen as relations among different word senses, because of the different conceptual domains involved. Meronymic and metonymic extensions are two main ways of deriving meaning facets. Meronymic extensions involve part/whole relations, by which part stands for whole or whole stands for part. Metonymic extensions include: (1) agentivization: from information media to information creator; (2) product instantiation: from institution to product; (3) grinding: from individual to mass; (4) portioning: from information media to information, from container to containee, from body part to function; (5) space mark-up: from landmark to space in vicinity, from structure to aperture, from institution to locus; (6) time mark-up: from event to temporal period, from object to process, from locus to duration (Ahrens et al., 1998: 57).
445
446
Language Acquisition, Change and Emergence
semantic entries (i.e. senses). To illustrate how this theory is put into practice, we take huoguo (火 鍋 ) as an example. Subjects provided two senses and two meaning facets under the first sense. The meanings of huoguo are represented in (1). (1) HUOGUO — Sense1: a pot cooking on the fire — Meaning facet1: physical object, hot pot, the container — Meaning facet2: the food contained in the hot pot — Sense2: a blocked shot, a term in basketball games The two senses of huoguo involve different conceptual domains — one in food, the other in sports. They cannot occur in the same linguistic context. However, the two meaning facets of the first sense are both in the food domain, and can co-occur in sentences like zuowan de huoguo hen bucuo ‘the hot pot last night was not bad’. In the following, we will use the definition of senses as the linguistic meanings
3.
Experiment 1: Comparing Different Meaning Measurements in Chinese
In this section, the numbers of meanings derived from different measures are tested statistically. The numbers of meanings of 200 disyllabic Chinese nouns listed in three dictionaries, the raw meanings subjects provided, and the linguistic senses processed by a linguistic theory are compared.
3.1
Experiment 1a: Dictionary numbers of meanings
Experiment 1a compares the meanings listed in three Chinese dictionaries, including Gwoyeuryhbaw Dictionary (GD) 國 語 日 報 辭
How Many Meanings Does a Word Have? 典 (1989), Revised Chinese Dictionary (RCD) 重 編 國 語 辭 典 (1997), and The Warmth Modern Chinese-English Dictionary (WCED) 旺 文 現 代 漢 英 辭 典 (1997). A Chinese-English dictionary is used to see whether the cross-linguistic way of defining meaning leads to similar or different results.
3.1.1
Materials and procedures
Two hundred disyllabic Chinese nouns were selected from The Most Frequent Nouns in Journal Chinese and Their Classification: Corpus-Based Research Series No. 4, published by the Chinese Knowledge Information Processing Group (CKIP, 1993). These nouns were selected with an eye to including 100 potentially ambiguous nouns and 100 potentially unambiguous nouns.8 These 200 words were checked for their number of meanings listed in the three dictionaries. The experimenter did not make any subjective judgments; the calculation of meanings was completely based on the numbers enumerated in the dictionaries.
3.1.2
Results
Not all the items were found in all dictionaries. Forty-nine of the items were missing in GD, 9 in RCD, and 23 in WCED. The whole list of the stimulus items and the number of meanings in the dictionaries are given in the first four columns of Appendix 1. Paired-samples t-tests show that the number of meanings listed in these three dictionaries are significantly different from one another at the level of 0.01 (GD-RCD: t(150) = –5.46, RCD-WCED: t(176) = 8.33, GD-WCED: t(141) = 3.68). Different dictionaries provide very different numbers of meanings for a list of words. Further
8
The inclusion of approximately half ambiguous and half unambiguous nouns was due to further use of these data in on-line lexical decision experiments. Those experiments were designed to examine effect of ambiguity advantage, semantic relatedness, and relative meaning frequency. For details, refer to Lin (1999) and Lin and Ahrens (2000).
447
448
Language Acquisition, Change and Emergence
investigation into the correlation among the dictionary meanings shows that these numbers of meanings listed in different dictionaries are significantly correlated (p < .01). The correlation is at least above 0.38. Table 1 gives the correlation matrix among the dictionary numbers of meanings. Table 1. Correlation matrix among the dictionary numbers of meanings
GD
RCD
GD
1.000
RCD
0.378*
1.000
WCED
0.568*
0.606*
WCED
1.000
* p < .01 Note: GD = Gwoyeuryhbaw Dictionary (1989) 國語日報辭典 ; RCD = Revised Chinese Dictionary (1997) 重編國語辭典 ; WCED = The Warmth Modern Chinese-English Dictionary (1997) 旺文現代漢英辭典
The results suggest that different dictionaries produce different numbers of meanings for the same words, even though relatively speaking, the words with more meanings in one dictionary also have more meanings in another.
3.2
Experiment 1b: Semantic intuition and linguistic senses
This experiment collected meanings from subjects. With the meanings subjects provided, we came up with three types of meanings — subjects’ raw numbers of meanings (i.e. semantic intuition), average numbers of linguistic senses, and total numbers of linguistic senses. Subjects’ raw numbers of meanings are the average numbers of meanings each subject provided for each word. These raw numbers of meanings represent language users’ intuition about what the meanings of a word are. As outlined in Section 1,
How Many Meanings Does a Word Have?
subjects’ average and total numbers of linguistic senses are the meanings generated by subjects and then analyzed by the definition of linguistic senses. The results will further be compared with the dictionary meanings in Experiment 1a.
3.2.1
Subjects
Two hundred undergraduates (126 females, 74 males) from National Chengchi University participated in the meaning generation task. All the subjects were native speakers of Mandarin who were exposed to both and only Mandarin and Taiwan Southern Min before the age of seven. All of the subjects rated their general proficiency of Mandarin above 5 in a 7-point scale.
3.2.2
Materials and procedures
The stimulus items are the same as the two hundred disyllabic Chinese nouns used in Experiment 1a. These items were randomly assigned to ten booklets. Each item list in each booklet was organized in two random orders. Subjects were randomly given a booklet containing a set of instructions, a list of 20 words, and answer sheets. They were asked to write down all the meanings they could think of for each word with no time limit. They were instructed to use the word in a sentence, and define the word as they had used it in the sentence. Subjects took approximately 30 minutes to complete the booklet. Twenty subjects provided meanings for each word. At the end of each booklet, a sheet required the subjects to review the meanings they provided for each word. This helped ensure that subjects had responded to each item, and offered them a second chance to think over the meanings they provided. The experimenter calculated the average numbers of meanings based on the numbers of meanings each subject wrote on the review sheet for each word, and from this derived subjects’ intuitive raw numbers of meanings. The numbers of linguistic senses required the decision of experimenters. Two experimenters independently decided the
449
450
Language Acquisition, Change and Emergence
numbers of different senses each subject provided for each word based on the definition in Ahrens et al. (1998). Then they together went through the items on which their analyses differed, and made a decision that both agreed upon. The average numbers of senses is calculated by averaging the numbers of distinct senses each subject provided for each item. The total numbers of senses are the numbers of distinct senses among all the meanings that all subjects generated for each word.
3.2.3 Results Twenty subjects provided meanings for each of the 200 words. To avoid idiosyncratic responses, only senses provided by more than 15% of the subjects were included. The data obtained are listed in Appendix 1. Subjects’ raw numbers of meanings are given in the fifth column; average numbers of senses, the sixth; total numbers of senses, the seventh column. The numbers of meanings using different measures are compared, including dictionary meanings, subjects’ raw numbers of meanings, subjects’ average and total numbers of linguistic senses. Paired-samples t-tests suggest that all but three pairs are significantly different from one another (p < .01). Table 2 gives the p values and degrees of freedom for each pair. The three pairs of meaning measures that do not differ from one another are: RCD and the total numbers of senses (p = .359), GD and subjects’ raw numbers of meanings (p = .164), and WCED and subjects’ average numbers of senses (p = .143). Most of the measures differ from one another, which demonstrates that different measures often brings forth considerably different results. Different as most of the measures are from one another, further examination of the correlations among them show that all the measures are significantly correlated (p < .01). Table 3 shows the correlations among these measures. Dictionaries vary in the enumeration of word meanings. The correlations between dictionary numbers of meanings and all other meaning measures vary from .738 to .461. The numbers of meanings in the RCD are relatively
How Many Meanings Does a Word Have?
Table 2 P-values, t-values, and degrees of freedom among different measures of number of meanings in Experiments 1a and 1b
GD 國語日報辭典
RCD
GD
RCD
WCED
國語日報 辭典
新編國語 辭典
旺文漢音 辭典
Ss’ raw meanings
Ss’ average senses
– t(150)=–5.46 **
–
t(141)=3.68 **
t(176)=8.33 **
–
Ss’ raw meanings
t(150)=1.40
t(190)=5.61 **
t(176)=3.77 **
–
Ss’ average senses
t(150)=3.32 *
t(190)=7.19 **
t(176)=1.47
t(199)=–10.90 **
–
Ss’ total senses
t(150)=–3.65 **
t(190)=0.92
t(176)=7.72 **
t(199)=–7.16 **
t(199)=–9.87 **
新編國語辭典
WCED 旺文漢音辭典
* p < .01
Ss’ total senses
–
** p < .001
Table 3 Correlations among the different measures of numbers of meanings in Experiments 1a and 1b
GD 國語日報辭典
RCD
GD
RCD
WCED
國語日報 辭典
新編國語 辭典
旺文漢音 辭典
Ss’ raw meanings
Ss’ average senses
1.000
0.738*
1.000
0.568*
0.606*
1.000
Ss’ raw meanings
0.645*
0.674*
0.514*
1.000
Ss’ average senses
0.634*
0.665*
0.471*
0.949*
1.000
Ss’ total senses
0.599*
0.616*
0.461*
0.793*
0.816*
新編國語辭典
WCED 旺文漢音辭典
* p < .01
Ss’ total senses
1.000
451
452
Language Acquisition, Change and Emergence
better correlated with subjects’ responses than WCED. RCD thus seems to be a dictionary that is closer to subjects’ semantic knowledge. The t-tests also showed no significant difference between RCD and the total numbers of linguistic senses (p = .359). In summary, we find that the different measurements of the numbers of meanings among Chinese nominals produce significantly different numbers of meanings. These numbers are, however, significantly correlated at a relative number-of-meaning scale. That is, words with many meanings have greater numbers of meanings in all measurements. In Section 3, we will examine if such patterns can be found when measuring numbers of meanings among English words.
4.
Experiment 2: Comparing Different Meaning Measurements in English
Are the results of Experiment 1 to be found in linguistic data other than those of Chinese? To investigate the different measures cross-linguistically, we conducted a similar experiment on English data, using definitions listed in two English dictionaries and the data collected by Azuma and Van Orden (1997).
4.1
Subjects, materials, and procedures
Azuma and Van Orden (1997) collected meanings from one hundred introductory psychology students at Arizona State University, who were all native speakers of English. Twenty participants provided meanings for each word. Sixty-nine English words were taken from the stimulus items used in the experiments of Azuma and Van Orden (1997). These words were checked for the numbers of lexical entries in the third edition of American Heritage Dictionary of the English Language (1992), and for the numbers of lexical entries and the numbers of semantic entries in Webster’s Third New International Dictionary of the English Language, Unabridged (1981). In addition to dictionary meanings, we
How Many Meanings Does a Word Have?
compared subjects’ raw, average, and total numbers of meanings.9 The average and total meanings were derived by matching subjects’ meanings with the definitions in Webster’s New World Dictionary (1980). Those not found in the dictionary were not included. The full list of English data and numbers of meanings is given in Appendix 2. These data represent the dictionary meanings, subjects’ raw intuition, and subjects’ meanings matched with dictionary meanings.
4.2
Results
All the measurements are significantly different from one another except the numbers of lexical entries listed in Webster’s Third New International Dictionary of the English Language and subjects’ total numbers of meanings (p < .001). This finding is similar to the results found for the Chinese data. Table 4 gives the matrix of paired-samples t-tests. Though different measures render rather different results, most of the numbers of meanings are correlated to one another. Table 5 shows that except meanings found in American Heritage Dictionary, all other meaning measurements are significantly correlated with one another. The numbers of semantic entries in the unabridged Webster’s Third New International Dictionary of the English Language showed highest correlations with the numbers of meanings provided by language users. They also correlated with the numbers of lexical entries found in the same dictionary, which suggest that when a word is associated with more lexical entries it is also more likely to be associated with more semantic entries. The numbers of meanings (raw, average, and total) provided by subjects are most highly correlated to one another.
9
Tamiko Azuma provided us with the raw, average, and total numbers of meanings of the English data, part of which were published in Azuma (1996) and Azuma and Van Orden (1997).
453
454
Language Acquisition, Change and Emergence Table 4 P-values, t-values, and degrees of freedom among different measures of number of meanings in Experiment 2 American Heritage American Heritage
Webster’s 1 Webster’s 2
Ss’ raw Ss’ average Ss’ total meanings meanings meanings
–
Webster’s 1
t(68)=–13.48 **
--
Webster’s 2
t(68)=–12.10 **
t(68)=–11.11 **
–
Ss’ raw meanings
t(68)=–4.64 **
t(68)=8.59 **
t(68)=11.89 **
–
Ss’ average meanings
t(68)=–4.48 **
t(68)=8.67 **
t(68)=–4.48 **
t(68)=4.54 **
–
Ss’ total meanings
t(68)=–10.69 **
t(68)=–1.73 p = .088
t(68)=11.23 **
t(68)=–10.19 **
t(68)=–10.27 **
–
** p < .001 Note
American Heritage = lexical entries in The American Heritage Dictionary of the English Language (1992) Webster’s 1 = lexical entries in Webster’s Third New International Dictionary of the English Language, Unabridged (1981) Webster’s 2 = semantic entries in Webster’s Third New International Dictionary of the English Language, Unabridged (1981)
Table 5 Correlations of different measures of the English data in Experiment 2 American Heritage
Webster’s 1 Webster’s 2
Ss’ raw Ss’ average Ss’ total meanings meanings meanings
American Heritage
1.000
Webster’s 1
0.719**
1.000
Webster’s 2
0.109
0.440**
1.000
Ss’ raw meanings
0.043
0.266*
0.461**
1.000
Ss’ average meanings
0.042
0.272*
0.459**
0.997**
1.000
Ss’ total meanings
0.090
0.301*
0.620**
0.621**
0.623**
* p < .05 ** p < .01
1.000
How Many Meanings Does a Word Have?
These results suggest that choosing different English dictionaries for reference may also lead to very different results. The unabridged version of Webster’s Dictionary showed more resemblance with the data provided by participants; however, a more recently-published and learner-oriented abridged dictionary like the American Heritage showed less comparable similarity with both the unabridged dictionary and meanings provided by subjects.
5.
General Discussion
In Experiment 1, we examined the relationship among dictionary meanings, linguistic senses, and participants’ semantic intuition for Chinese disyllabic words. Most of these different measures of word meanings led to significantly different numbers of meanings for the same set of words. Therefore, without considering the fundamental differences among these different methods, researchers who randomly selected a dictionary or any method to measure word meanings may obtain rather different results, which are not suitable for comparison. The significant correlations, however, indicate that even though these measures are very different from one another, they derive quite consistent results at a relative scale. Words with more meanings according to one way of calculation also have relatively more meanings in another. Namely, different measures of a word’s numbers of meanings reflect quite similar patterns, though qualitatively these meanings may be very different in content. The reason for such high correlations among these different methods is that the materials being delimited by these different methods are roughly the same across the board. An ambiguous word like mean, which has three meanings listed in American Heritage Dictionary, has nine meanings listed in Webster’s. A less ambiguous word like chest has only one meaning in American Heritage Dictionary and only two in Webster’s. Whatever the method is, a highly ambiguous word that has many meanings is listed with more meanings. A less ambiguous word likewise has fewer meanings listed in any dictionary. However, these different methods do produce different
455
456
Language Acquisition, Change and Emergence
numbers of meanings, depending on what is taken as a distinct meaning. Subjects’ meanings are most highly correlated to one another because they are from the same source — meanings provided by language users; dictionary meanings, however, differ from the senses entailed by current users for reasons that have been addressed in Section 2. The correlations between numbers of senses and subjects’ raw numbers of meanings (r = 0.949 and 0.793) suggest that language users are rather conscious of senses as a salient semantic level. It is noteworthy that the scales of meaning estimations are different using different measures. The numbers of dictionary meanings are generally higher than the average numbers of linguistic senses. That is, dictionaries cut meanings of the same lexical items into smaller pieces and include many that are not in a speaker’s active semantic consciousness. Figure 1 illustrates this point. Using different measures is like using different scales to cut pieces of paper of different lengths. The smaller the scale, the more pieces we get. Though the same paper might be cut into different numbers of pieces by different scales, the longer paper (words with more meanings) is generally cut into more pieces than the shorter one (words with fewer meanings). The English data show a similar trend. Certain dictionary meanings (such as those in American Heritage Dictionary) show less resemblance with the results of other measures, while meanings obtained from subjects show higher correlations to one another. Depending on lexical or semantic entries in the dictionaries also brings about very different results. The numbers of semantic entries are more correlated with subjects’ meanings than lexical entries. Overall, the results suggest that there is a distinction between dictionary meanings and the meanings obtained from subjects. They differ not only in the way of meaning calculation but also with regard to the content. Content-wise, meanings fluctuate in the history of a language. Some meanings in the dictionary are no longer in use, while language users are constantly developing novel uses of existing words. A closer look at the meanings listed in different dictionaries and those generated by language users gives us insight
How Many Meanings Does a Word Have?
Figure 1. Meaning delimitation using different methods.
on how similar and different they are. This also illustrates semantic changes in short-term language history. For example, the word danwei (單 位 ) has three meanings in both subjects’ total number of senses and in the dictionary RCD. The three senses given by the subjects are (1) the basic unit for calculation, (2) an official unit or department in an institution, and (3) a single seat. The dictionary listed the first two meanings; the third meaning was not ‘a single seat, but ‘the seats for monks’ — a meaning rarely used anymore. The word menkan (門 檻 ) also has three dictionary meanings — (1) a piece of wood or stone placed beneath a door, a doorsill, (2) a method or means of doing something, and (3) the capability of finding a method, among which only the first one was given by the subjects. Subjects provided yet a second sense not found in the dictionary — ‘the minimum, the lowest bounds permitted’, which is a metaphorical extension of the first sense. Examples like these illustrate the change of lexical meanings in progress. What used to be important meanings were included in dictionaries edited some time ago. These meanings may no longer be available to language users today. The current semantic knowledge of a word may differ both in content and in frequency from dictionary meanings. This is especially important to researchers interested in psychosemantic research. Meanings should be extracted from real language users if
457
458
Language Acquisition, Change and Emergence
our goal is to access the current semantic competence of the subjects. An additional supporting evidence for the use of linguistic senses comes from our experiments on ambiguity advantage (Lin, 1999; Lin and Ahrens, 2000). We ran lexical decision tasks (in which participants were instructed to decide if the stimuli they saw were words or non-words) using the same set of Chinese data in this research. In three experiments differing in the timing of stimuli presentation, we consistently found the effect of ambiguity advantage — words with many linguistic senses are recognized faster than words with only one sense. Factors such as sense frequency and sense relatedness were controlled for. This suggests that senses as defined by our linguistic theory are psychologically valid, since the time subjects take to access a word is sensitive to the number of senses that a word has. The results converge with our finding that subjects’ raw numbers of meanings and the linguistic senses are highly correlated.
Acknowledgments We are grateful for the research grant 9010001 from City University of Hong Kong to the first author during his visiting scholarship, and to the National Science Council of Taiwan for a research grant to the second author (NSC88-2411-H-002-051-M8). We also thank the audience at the First Workshop on Language Acquisition, Change, and Evolution in 2001 for valuable discussions. We are thankful to Tamiko Azuma for providing the English raw numbers of meanings, making the comparison between Chinese and English data possible. All remaining errors are our own.
References Ahrens, Kathleen, Chang, Lili, Chen, K. and Huang, Chu-Ren. (1998) Meaning representation and meaning instantiation for Chinese nominals. Computational Linguistics and Chinese Language Processing 3, 45–60. Azuma, T. (1996) Familiarity and relatedness of word meanings: Ratings for 110 homographs. Behavior Research Methods, Instruments and Computers 28, 109–24.
How Many Meanings Does a Word Have? Azuma, T. and Van Orden, G. C. (1997). Why safe is better than fast: The relatedness of a word’s meanings affects lexical decision times. Journal of Memory and Language 36, 484–504. Borowsky, R. and Masson, M. E. (1996) Semantic ambiguity effects in word identification. Journal of Experimental Psychology: Learning, Memory, and Cognition 22, 63–85. Britten, B. K. (1978) Lexical ambiguity of words used in English text. Behavior Research Methods and Instrumentation 10, 1–7. Chinese Knowledge Information Processing Group. (1993) The Most Frequent Nouns in Journal Chinese and Their Classification: Corpus-Based Research. Taipei: Institute of Information Science, Academia Sinica Forster, K. I. and Bednall, E. S. (1976) Terminating and exhaustive search in lexical access. Memory and Cognition 4, 53–61. Gernsbacher, M. A. (1984) Resolving 20 years of inconsistent interactions between lexical familiarity and orthography, concreteness and polysemy. Journal of Experimental Psychology: General 113, 256–81. Gove, Philip Babcock (Ed.) (1976) Webster’s Third New International Dictionary, Unabridged. Springfield, MA: Merriam-Webster Inc. Gove, Philip Babcock (Ed.) (1981) Webster’s Third New International Dictionary of the English Language, Unabridged. Springfield, MA: Merriam-Webster Inc. Gwoyeuryhbaw Group (Ed.) (1989) Gwoyeuryhbaw Dictionary. Taipei: Gwoyeuryhbaw Press. Hino, Y. and Lupker, S. J. (1996) Effects of polysemy in lexical decision and naming: An alternative to lexical access accounts. Journal of Experimental Psychology: Human Perception and Performance, 22, 1331–56. Huang, Shuanfan. (1994) Chinese as a metonymic language. In Matthew Chen and Ovid J. L. Tzeng (Eds.) In Honor of William S.-Y. Wang: Interdisciplinary Studies of Language and Language Change (pp. 223–52). Taipei: Pyramid Press. Hue, C-W., Yen, N-S., Just, M. A. and Carpenter, P. A. (1994) Studies of homographs in Chinese. In H. W. Chang, J. T. Huang, C-W. Hue and Ovid J. L. Tzeng (Eds.) Advances in the Study of Chinese Language Processing, Volume 1 (pp. 375–81). Taipei: Department of Psychology, National Taiwan University. Jastrzembski, J. E. (1981) Multiple meanings, number of related meanings, frequency of occurrence, and the lexicon. Cognitive Psychology 13, 278–305. Jastrzembski, J. E. and Stanners, R. F. (1975) Multiple word meanings and lexical search speed. Journal of Verbal Learning and Verbal Behavior 14, 534–537.
459
460
Language Acquisition, Change and Emergence Joordens, S. and Besner, D. (1994) When banking on money is not (yet) money in the bank: Explorations in connectionist modeling. Journal of Experimental Psychology: Learning, Memory, and Cognition 20, 1051–62. Kellas, G., Ferraro, F. R. and Simpson, G. B. (1988) Lexical ambiguity and the time-course of attentional allocation in word recognition. Journal of Experimental Psychology: Human Perception and Performance 14, 601–09. Li, Dian-Kui (Ed.) (1997) Revised Chinese Dictionary [Computerized CD-ROM Version]. Taipei: Ministry of Education, ROC. Lin, Chienjer Charles. (1999) Multiple Senses of Mandarin Chinese Nominals: Implications for Lexical Access. Graduate Program in Linguistics, National Chengchi University. Lin, Chienjer Charles and Ahrens, Kathleen. (2000) Calculating the number of senses: Implications for ambiguity advantage effect during lexical access. In Proceedings of the Seventh International Symposium of Chinese Languages and Linguistics (IsCLL-7) (pp. 141–155). National Chung-Cheng University, Jiayi. Millis, M. L. and Button, S. B. (1989) The effect of polysemy on lexical decision time: Now you see it, now you don’t. Memory and Cognition 17, 141–47. Onifer, W. and Swinney, David. (1981) Accessing lexical ambiguities during sentence comprehension: Effects of frequency-of-meaning and contextual bias. Memory and Cognition 9, 225–36. Parks, R., Ray, J. and Bland, S. (1998) Wordsmyth English DictionaryThesaurus. [ONLINE]. (available at http://www.wordsmyth.net/) Accessed February 1, 1999. University of Chicago Pickett, Joseph P. (Ed.) (1992) The American Heritage Dictionary of the English Language. Boston: Houghton Mifflin Company. Rodd, Jennifer, Gaskell, Gareth and Marslen-Wilson, William. (2002) Making sense of semantic ambiguity: Semantic competition in lexical access. Journal of Memory and Language 46, 245–66. Rubenstein, H., Garfield, L. and Millikan, J. A. (1970) Homographic entries in the internal lexicon. Journal of Verbal Learning and Verbal Behavior 9, 487–94. Rubenstein, H., Lewis, S. S. and Rubenstein, M. A. (1971) Homographic entries in the internal lexicon: Effects of systematicity and relative frequency of meanings. Journal of Verbal Learning and Verbal Behavior 10, 57–62. Shannon, Scott (Ed.) (1980) Webster’s New World Dictionary. New York: Simon & Schuster. Stein, J. (Ed.) (1967) Random House Dictionary of the English Language, Unabridged. New York: Random House.
How Many Meanings Does a Word Have? Traugott, Elizabeth Closs and Dasher, Richard B. (2001) Regularity in Semantic Change. Cambridge, UK: Cambridge University Press. Warmth Press Group (Ed.) (1997) The Warmth Modern Chinese-English Dictionary. Taipei: The Warmth Press.
Appendix 1 Numbers of meanings/senses of 200 Chinese nouns.
Word
Dictionary meanings G R W D C C D E D
Linguistic Senses Ss’ Aver- Total raw age Mns
Word
Dictionary meanings G R W D C C D E D
Linguistic senses Ss’ Aver- Total raw age Mns
經 典 jingdian
2
3
3
1.7
1.6
3
前 妻 qianqi
–
1
2
1.05
1.05
1
待 遇 daiyu
2
2
2
1.7
1.7
2
油 價 youjia
–
–
–
1.4
1.6
2
兩 極 liangji
2
3
2
2
1.85
3
車 速 chesu
–
1
1
1.05
1
1
中 央 zhongyang
2
3
2
2
1.95
3
品 質 pinzhi
1
1
2
1.3
1
1
傢 伙 jiahuo
2
3
–
2.15
1.95
2
螞 蟻 mayi
1
1
1
1.5
1.05
1
元 宵 yuanxiao
2
2
2
1.8
1.75
2
菜 餚 caiyao
–
1
1
1.05
1
1
人 馬 renma
–
3
1
1.35
1.25
2
言 論 yanlun
1
1
1
1.25
1
1
花 瓶 huaping
2
2
1
2
1.95
2
死 屍 sishi
1
1
1
1.25
1.2
1
後 台 houtai
3
2
2
1.85
1.85
2
書 本 shuben
1
1
1
1.05
1
1
回 音 huiyin
2
1
3
1.85
1.8
2
年 次 nianci
–
–
–
1.05
1
1
點 滴 diandi
1
3
2
2.05
1.9
2
眼 淚 yanlei
1
1
1
1.45
1.15
1
悲 劇 beijyu
2
2
1
1.9
1.7
2
慣 例 guanli
1
1
1
1.5
1
1
裂 痕 liehen
2
2
1
1.8
1.8
2
雜 糧 zaliang
1
1
1
1.05
1.05
1
角 度 jiaodu
2
2
2
2
1.8
2
工 資 gongzi
1
1
1
1.15
1.05
1
手 腕 shouwan
2
2
1
1.9
1.9
2
沙 灘 shatan
1
1
1
1.1
1.05
1
黃 金 huangjin
2
3
1
2.3
2.05
3
法 官 faguan
1
2
1
1.2
1.05
1
跳 板 tiaoban
2
2
2
1.95
1.9
3
居 所 jusuo
–
1
–
1
1
1
捷 徑 jiejing
2
2
1
1.9
1.85
2
遺 址 yizhi
–
1
1
1.05
1
1
籌 碼 chouma
2
2
1
1.75
1.7
2
清 晨 qingchen
1
1
1
1.15
1
1
牛 郎 niulang
3
4
–
2.4
2.45
4
勁 敵 jindi
1
1
–
1
1
1
杜 鵑 dujuan
2
2
2
2
2
3
睡 眠 shuimian
1
1
1
1.05
1
1
指 標 zhibiao
2
1
1
1.8
1.65
2
深 夜 shenye
–
1
–
1.05
1
1
傳 奇 chuanqi
2
4
2
1.75
1.6
2
國 王 guowang
3
3
1
1.2
1.15
1
鴨 蛋 yadan
2
2
1
2.2
2.05
2
墨 鏡 mojing
1
1
1
1.25
1.1
1
出 路 chulu
2
3
2
1.85
1.8
2
常 態 changtai
1
2
1
1.4
1
1
丈 夫 zhangfu
2
3
2
1.8
1.65
2
瓦 斯 wasi
1
3
1
1.95
1.75
2
公 安 gongan
1
3
1
1.65
1.65
2
定 存 dingcun
–
–
–
1.05
1.05
2
明 日 mingri
1
1
2
1.65
1.65
3
心 願 xinyuan
1
1
1
1.05
1
1
呼 聲 husheng
–
2
1
2.2
2.2
4
校 友 xiaoyou
–
1
1
1.2
1.15
2
461
462
Language Acquisition, Change and Emergence Dictionary meanings G R W D C C D E D
Ss’ Aver- Total raw age Mns
細 胞 xibao
1
1
1
1.55
1.4
2
銀 牌 yinpai
–
2
1
1.5
1.5
3
龍 頭 longtou
3
4
2
1.8
1.8
分 數 fenshu
2
3
2
1.95
模 型 moxing
1
1
2
1.4
偶 像 ouxiang
2
2
1
1.95
Word
Linguistic Senses
Dictionary meanings G R W D C C D E D
Ss’ Aver- Total raw age Mns
水 災 shuizai
1
1
1
1.2
1.2
畫 室 huashi
–
1
1
1.3
1.2
2
4
疑 點 yidian
–
1
1
1.3
1
1
1.85
2
美 女 meinyu
1
1
–
1.45
1.15
2
1.35
2
房 租 fangzu
1
1
–
1
1
1
1.6
2
肥 料 feiliao
1
1
1
1.2
1
1
Word
Linguistic senses
2
口 氣 kouqi
2
4
3
1.8
1.5
2
平 原 pingyuan
1
1
1
1.4
1.35
2
果 實 guoshi
2
2
2
1.95
1.85
2
財 富 caifu
1
1
1
1.35
1.35
2
斷 層 duanceng
–
1
1
2.05
2
3
體 能 tineng
1
1
–
1
1
1
泡 沫 paomo
1
1
1
1.9
1.55
2
鞭 炮 bianpao
1
1
2
1.05
1.05
1
份 量 fenliang
3
3
1
2.45
2.15
3
竹 林 zhulin
–
2
1
1.4
1.3
2
空 檔 kongdang
--
3
1
1.7
1.6
2
村 落 cunluo
1
1
1
1.05
1
1
小 弟 xiaodi
2
3
--
2.35
2.35
5
石 塊 shikuai
–
1
1
1.3
1.05
1
半 天 bantian
3
3
2
1.85
1.8
2
時 光 shiguang
1
1
2
1.3
1
1
臉 色 lianse
–
2
2
2.15
2.05
2
邦 交 bangjiao
1
1
1
1.1
1.1
1
長 短 changduan
3
4
3
1.95
1.9
3
節 慶 jieqing
–
–
–
1.4
1
1 2
爵 士 jueshi
1
1
2
2.1
1.95
2
魔 術 moshu
1
1
1
1.45
1.4
軍 機 junji
2
2
2
1.75
1.7
2
往 事 wangshi
1
1
1
1.05
1
1
便 衣 bianyi
2
2
2
1.8
1.45
2
信 譽 xinyu
–
2
1
1.15
1
1
黑 箱 heixiang
–
–
–
2.05
2.05
2
感 觸 ganchu
1
1
1
1.15
1.15
1
低 潮 dichao
–
2
1
1.95
1.9
3
雙 腳 shuangjiao
–
–
–
1.25
1.1
1
世 界 shijie
3
4
1
1.9
1.75
2
設 備 shebei
2
3
1
1.35
1.1
1
商 場 shangchang
2
2
1
1.8
1.7
2
歲 月 suiyue
1
1
1
1.3
1.05
1
儀 表 yibiao
3
3
1
1.75
1.7
2
君 主 jyunzhu
–
1
1
1.25
1.05
1
先 生 xiansheng
6
7
4
2.95
2.75
4
營 運 yingyun
1
1
–
1.35
1.25
2
公 公 gonggong
3
5
3
2.95
2.9
4
通 則 tongze
1
1
1
1.05
1.05
1
同 志 tongzhi
2
4
1
2.05
2.35
3
作 物 zuou
1
1
1
1.25
1.1
1
單 位 danwei
2
3
2
2
2
3
政 府 zhengfu
1
2
1
1.1
1
1
地 方 difang
5
5
4
2
1.75
3
其 他 qita
1
1
1
1.2
1
1
壓 力 yali
2
4
2
1.95
1.85
2
股 票 gupiao
1
1
1
1.2
1
1
家 教 jiajiao
3
2
1
1.85
1.7
3
外 號 waihau
1
1
1
1.1
1.05
1 1
少 爺 shaoye
2
2
2
2.2
1.65
3
請 帖 qingtie
1
1
1
1
1
意 思 yisi
5
5
5
2.5
2
5
用 途 yongtu
1
1
1
1.1
1
1
麻 雀 maque
2
2
1
1.65
1.35
2
茶 壺 chahu
–
1
1
1.5
1.3
1
把 柄 babing
2
5
1
1.7
1.65
2
病 菌 bingjyun
1
1
1
1.15
1.15
1
排 骨 paigu
2
2
1
2
2.7
2
青 銅 qingtong
2
2
1
1.35
1.05
1
學 院 xueyuan
–
3
1
1.25
1.25
3
證 件 zhengjian
–
1
1
1.05
1
1
搖 籃 yaolan
2
2
1
1.8
1.85
2
圈 套 qyuantao
1
1
1
1.55
1.35
2
旋 風 xuanfong
1
3
1
1.95
1.75
2
泳 裝 yongzhuang
–
1
–
1.1
1
1
How Many Meanings Does a Word Have?
Word
Dictionary meanings G R W D C C D E D
Linguistic Senses Ss’ Aver- Total raw age Mns
Word
Dictionary meanings G R W D C C D E D
Linguistic senses Ss’ Aver- Total raw age Mns
曲 線 qyuxian
2
2
1
1.95
1.9
3
牙 科 yake
–
1
1
1.6
1.35
會 計 kuaiji
2
2
2
1.65
1.35
2
專 題 zhuanti
–
1
1
1
1
1
東 西 dongxi
2
4
4
2.55
2.3
3
期 限 qixian
1
1
1
1.3
1.2
2
2
架 子 jiazi
2
6
4
2.1
2.1
3
粉 狀 fenzhuang
–
–
–
1.25
1.1
1
格 局 gejyu
1
2
1
1.45
1.45
3
聯 考 liankao
–
1
–
1.15
1
1
火 鍋 huoguo
3
3
1
1.65
1.5
2
步 驟 buzou
1
2
1
1.1
1.05
1
禮 拜 libai
2
3
4
2.2
2.1
2
樓 梯 louti
1
1
1
1.05
1.1
1
算 盤 suanpan
2
2
1
1.95
1.8
2
面 容 mianrong
–
1
1
1.6
1.55
3
精 神 jingshen
3
5
4
2
1.95
3
市 民 shimin
1
1
1
1.1
1
1
飯 碗 fanwan
2
2
2
2
1.95
2
泥 沼 nizhao
–
2
1
1.7
1.7
2
老 大 laoda
3
4
3
2.65
2.4
3
當 晚 dangwan
1
1
–
1.1
1
1
背 景 beijing
4
4
1
2.1
2
2
勤 務 qinwu
1
1
1
1.05
1
1
藍 圖 lantu
2
2
1
1.7
1.55
2
坦 克 tanke
–
1
1
1.5
1.05
1
調 幅 tiaofu
2
1
1
1.45
1.45
2
布 條 butiao
–
–
–
1.25
1.25
2
輪 廓 lunkuo
2
2
1
1.65
1.6
3
雨 傘 yusan
1
1
1
1.05
1
1
逃 兵 taobing
2
2
1
1.9
1.8
3
傷 勢 shangshi
–
1
1
1
1
1
門 檻 menkan
3
3
1
1.85
1.85
2
範 疇 fanchou
1
1
1
1.25
1
1 1
陰 影 yinying
2
2
2
2.1
1.95
2
弊 端 biduan
1
1
1
1.1
1.05
靈 魂 linghun
3
3
1
2.4
2.3
5
鋼 筋 gangjin
1
1
1
1.1
1.1
1
師 父 shifu
3
3
2
1.85
1.75
3
新 片 xinpian
–
–
–
1.8
1.65
3 1
下 文 xiawen
2
2
2
1.65
1.6
2
真 理 zhenli
1
1
1
1.3
1.1
味 道 weidao
1
4
1
2.25
1.85
2
服 裝 fuzhuang
–
1
1
1.45
1
1
磁 性 cixing
2
2
1
1.9
1.85
2
獵 槍 lieqiang
–
1
1
1
1
1
惡 夢 emong
–
1
1
1.75
1.75
2
鬧 鐘 naozhong
1
1
1
1.2
1.1
1
妹 妹 meimei
1
3
1
2.37
2.15
3
喉 嚨 houlong
1
1
1
1.2
1.05
1
綠燈 lyudeng
–
2
2
2.1
2.1
3
歸 宿 guisu
1
3
1
1.45
1.5
3
嘴巴 zuiba
1
2
2
1.95
1.45
2
菁 英 jingying
–
1
–
1.15
1.15
1
八 卦 bagua
1
1
1
2.2
2.15
4
黨 魁 dangkui
–
1
1
1.05
1.05
1
假 名 jiaming
–
5
2
1.15
1.15
2
告 示 gaoshi
–
2
1
1.55
1.35
2
漏 洞 loudong
2
2
2
1.9
1.75
2
貴 賓 guibin
2
2
1
1.65
1.5
2
江 湖 jianghu
5
3
3
1.85
1.85
3
程 式 chengshi
2
3
2
1.55
1.5
3
頻 率 pinlyu
1
2
1
2.05
2.05
4
班 級 banji
–
2
1
1.15
1.05
1
Note: GD = Gwoyeuryhbaw Dictionary (1989) 國 語 日 報 辭 典 RCD = Revised Chinese Dictionary (1997) 重 編 國 語 辭 典 WCED = The Warmth Modern Chinese-English Dictionary (1997) 旺文現代漢英辭典
463
464
Language Acquisition, Change and Emergence Appendix 2 English data of dictionary meanings, subjects’ raw numbers of meanings, subjects’ average numbers of meanings, and subjects’ total numbers of meanings (part of the data can be found in Azuma and Van Orden, 1997) Word
Dictionary
ball bark bill blank bomb bound calf card cast charm check chest chip clean club coat cover cross date draw drink dull dump dust faint fast file fine firm floor game hide land limit lock
AH
W1
W2
2 3 3 2 1 4 2 2 1 1 1 1 3 1 1 1 1 1 2 1 1 1 1 1 1 2 3 3 2 1 2 3 1 1 2
3 5 7 4 2 7 2 4 5 4 5 2 6 4 3 2 2 5 3 2 2 2 5 2 5 7 11 7 4 4 4 5 4 2 5
12 16 25 28 9 21 7 20 33 13 43 7 24 17 11 9 34 52 20 19 14 12 22 23 12 23 24 21 10 16 9 11 14 13 19
Subjects Average Total 2 2 4 2.25 2.25 4 2.5 2.5 6 2.05 2 6 2.35 2.35 4 2.65 2.65 9 2 2 2 2.5 2.4 3 2.95 2.95 8 2 2 4 3.45 3.45 9 2.05 2 2 2.8 2.8 6 2.15 2.1 7 2.9 2.9 6 2.05 2.05 4 2.55 2.45 10 2.55 2.55 6 2.95 2.85 4 2.8 2.8 10 1.75 1.7 3 2.3 2.3 4 2.6 2.6 6 1.7 1.7 3 1.85 1.85 4 1.8 1.75 3 2.9 2.9 8 3.3 3.3 9 2.05 2 4 2.25 2.25 6 1.95 1.95 4 1.75 1.75 2 2.45 2.45 4 1.55 1.5 3 2.3 2.3 6
Word
Raw
mean mine page park pitch plot pound rake rare rich ring round rule safe scale seal shape share sharp ship shop slip smoke soil sound spoke stick stock story strip tire trap trip watch
Dictionary AH
W1
W2
3 2 2 1 2 1 3 3 2 1 2 2 1 1 3 2 1 2 1 1 1 3 1 3 4 2 1 1 2 2 3 3 1 1
9 5 4 2 4 3 6 8 5 2 5 6 2 5 11 5 2 2 4 4 2 8 2 6 8 4 6 5 3 7 8 6 3 3
33 18 11 14 38 15 15 32 8 13 59 62 13 12 47 17 21 11 16 22 10 50 26 18 30 13 44 74 14 54 16 25 28 25
Subjects Average Total 2.25 2.25 3 2.25 2.25 4 2.35 2.35 3 1.8 1.8 2 2.1 2.1 7 2.2 2.2 6 2.55 2.55 4 1.5 1.5 2 2.1 2.05 3 2.3 2.3 4 2.95 2.95 9 2 2 13 2.05 2 3 2.1 2.1 3 2.9 2.9 9 2.45 2.45 4 2.3 2.2 3 2.25 2.25 6 2.55 2.55 7 2 2 2 2.2 2.2 3 2.85 2.85 8 2.2 2.2 8 1.85 1.85 2 2.05 2 8 1.95 1.95 2 2.7 2.7 8 3.1 3.1 7 2.2 2.2 3 2.75 2.7 7 1.9 1.85 3 2.15 2.05 4 2.6 2.6 4 2.15 2.1 4 Raw
Note: AH = lexical entries in The American Heritage Dictionary of the English Language (1992) W1 = lexical entries in Webster’s Third New International Dictionary of the English Language, Unabridged (1981) W2 = semantic entries in Webster’s Third New International Dictionary of the English Language, Unabridged (1981)
15 Typology and Complexity1 Randy J. LaPolla La Trobe University
1.
Complexity in What Sort of System?
For the Workshop I was asked to talk about complexity in language from a typological perspective. My way of approaching this topic was to ask myself some questions, and then see where the answers led. The first one was of course, “What sort of system are we looking at complexity in — what kind of system is language?” There are at least three different kinds of system that we can talk about, and each kind of system is related to a different kind of phenomena. The first kind are natural phenomena, like weather systems and living organisms. In these systems you often find evolution towards greater complexity — of course you can have simplification, but in general you have, at least in the history of evolution, like the evolution of man, greater and greater complexity. Phenomena of the second kind are the intentionally man-made phenomena, such as the internal combustion engine, and here development can go either way — you can have development toward more complex things like the piston engine itself (earlier
1 This paper is an edited transcript of the talk I gave at the Workshop. I
would like to thank James Minett for his excellent transcription of my talk. 465
466
Language Acquisition, Change and Emergence
Figure 1a School and bus stop separated by a field
Figure 1b Students begin crossing the field to get to the bus stop
Figure 1c The grass begins to wear away and a path emerges
Figure 1d The path is recognized and paved
types of engines were somewhat simpler) but then we also have simplifications, like the intentionally simplified rotary engine — one of the pluses of the rotary engine is that it has less parts, and is an overall simpler system. Phenomena of the third kind are man-made, but not created with the intention of creating the thing that is produced. Humans act according to goals but the goals in the case of phenomena of the third kind are not like those in the case of phenomena of the second kind, that is, to create that particular structure or that particular system. It is a more local and personal goal, and the combined activity of all the people attempting to achieve their goals creates that particular phenomenon, like an economy or a path in a field. Phenomena of the third kind are often called ‘invisible hand’
Typology and Complexity
phenomena, as it is as if an invisible hand creates the phenomenon. An example is the creation of a path through a field (cf. Mauthner, 1912; Keller, 1994). Let’s say we have two streets separated by a field; there’s a school on one street at one end of the field and a bus stop on the other street on the other side of the field (Figure 1a). When the kids come out of the school they want to go to the bus stop. Their goal is to get to the bus stop, so they try to pick an easy way to get there — they cross the field (Figure 1b). Maybe at first one or two of them cross the field, and some other students see them doing it, and see that the ones who go through the field get to the bus stop faster and easier by going that way through the field, and so they too start doing it; they copy the first students. Then more and more students cross the field in the same way. Over time, the students trying to get to the bus stop start to wear away the grass, so a very rough path develops (Figure 1c). It’s not that somebody said “Let’s form a path.” It’s just that a lot of people tried to find the most efficient way to get to the bus stop from the school, and they ended up walking the same way through the field, trampled the same grass, killed the grass, and created a path. Eventually people start using the path just because it is there, without thinking about whether it is the best way to go through the field. At some point, either out of simple conventionalization or because of some social factor (e.g. attitudes towards preserving the grass that is left), it may become recognized as the “unmarked” way to go through the field and crossing any other way would be considered “marked”. What happens in society often is that a development like this can be recognized and then made official — you pave the path (Figure 1d) — and then it becomes prescriptive. The path thus created is a phenomenon of the third kind. Language is also a phenomenon of the third kind. It is not a natural phenomenon, it does not follow the same kind of natural laws; it is based on humans trying to do something, but not trying to create language. Its development is a type of evolution, but it can go toward greater or lesser complexity. Just as with the path, there can also be intentional manipulation of language, such as when we write prescriptive grammars, or standardize languages. There can be
467
468
Language Acquisition, Change and Emergence
planned economies and planned languages, like when Malay pidgin was made into Bahasa Indonesia, the national language of Indonesia. In this case, they chose Malay Pidgin rather than Javanese to be the national language because Javanese is more complex than Malay Pidgin. Javanese has multiple levels of politeness registers — five levels of politeness — and this makes it difficult to learn and use, so they chose Malay Pidgin, as they wanted a language that would be easier for everybody to learn and use.
2.
Complexity in Different Subsets of Human Conventions
One of the things I want to talk about is complexity in different subsets of human conventions. Language is just one of many types of convention; it’s a tool that has developed, one of many tools that we have developed. Humans do things and, in the process of trying to do something, create systems and tools. One of the many types of tools that we have developed is the type of tools we use for eating. We can have a system of great complexity or a simple system in terms of the way we eat. Take for example the Western formal place setting presented in Figure 2a, which is from a web page2 that was set up to tell people how to set a formal place setting at home. In a formal banquet in a restaurant there might be even more forks, or more knives and spoons. Here we’ve got a salad fork, a dinner fork, a soup spoon, a tea spoon, different glasses for different kinds of wine, one glass for water, a serving plate, a bread plate, a soup bowl, a bread knife, another knife, and if steak was being served, a steak knife would also be added. This is a relatively complex system for eating. However, you can also have a relatively simple system for eating, as in Figure 2b, which is only a bowl and a pair of chopsticks. In fact in many of the places where I go to do fieldwork in rural
2
http://www.visatablelinen.com/formal.html, Milliken Table Lines & Table Cloths:Table Setting:Formal Dinner Place Settings.
Typology and Complexity
China you don’t even get the bowl, all you get is the chopsticks. In many places in the Philippines and Burma you just use your hands — that’s even simpler, but, of course, that’s not a developed tool. The minimal tool is the chopsticks.
Figure 2a. Western formal dinner place setting
A. Napkin
H. Salad Fork
B. Service plate
I. Dinner Fork
C. Soup bowl on a liner plate
J. Dessert Fork
D. Bread and Butter Plate with butter knife.
K. Knife
E. Water glass
L. Teaspoon
F. Wine glass
M. Soup Spoon
G. Wine glass
Figure 2b. Chinese informal place setting
469
470
Language Acquisition, Change and Emergence
So you can have complexity or the lack of it in different systems within your overall set of conventions. What happens in one system may influence what happens in other systems. For example, cutting up the food before it’s served, as the Chinese do, means it is not necessary to have a knife at the table. In a Western setting we have to have a steak knife when eating steak because the cook has not cut the steak up into bite-size pieces before serving it. In a Chinese setting, the cook has already cut the food up. So the conventions of cooking influence the conventions of eating. There are a lot of other types of conventions that influence each other. For example, the Jingpo people of Yunnan don’t fertilize their crops, and so they don’t save human manure like a lot of other peoples do to use as fertilizer for their crops. And since they don’t save human manure they don’t even build bathrooms, they just go to the woods. Because of this, they don’t have a native word for ‘bathroom’. Their conventions of agriculture influence their conventions of architecture, which in turn influence their conventions of language. There is influence in terms of complexity, as complexity in one system can mean simplification in another, for example complexity in the conventions of food preparation may result in simplicity in the tools that you need to eat with. Now let’s look at a linguistic example. The speakers of the Qiang language (Tibeto-Burman; northern Sichuan) conventionalized the set of orientation marking prefixes on the verb given in (1). (1) Qiang directional prefixes (ʁue ‘throw’)
təʁu
‘throw up (the mountain)’
zəʁu
‘throw towards the speaker’
ɦaʁu
‘throw down (the mountain)’
daʁu
‘throw away from the speaker’
səʁu
‘throw down-river’
əʁu
‘throw inside’
nəʁu
‘throw up-river’
haʁu
‘throw outside’
Typology and Complexity
These prefixes (the first syllable of the forms given) are a system for marking the direction or orientation of the action, such as ‘throw up the mountain’, ‘throw down the mountain’, ‘throw down river’, ‘throw up river’. This system has also been extended to marking perfectives, as in (2) and (3), and imperatives, as in (4). (2) the sə-tɕ-ȵike, ʁuatʂə χuəla-k 3sg DIR-eat-following bowl wash-go ‘S/he finished eating and went to wash the bowl.’ (3) nəs q@ ə-q@ lai the: stu@h@ yesterday 1sg DIR-go:1sg time 3sg food/rice ‘Yesterday when I entered the room, s/he was eating.’
tɕhə eat
(4) ə-z-n@! DIR-eat-IMP
‘Eat!’ In (2) and (3), the verb in the first clause has the direction prefix because the action was completed, while the verb in the second clause of each example does not have a prefix, as the action is not completed (and the direction of action is not important here). In (4) the directional prefix appears on the verb because it is an imperative clause (see LaPolla, 2003, for details). The point I’m making here is that even within language, once you have conventionalized a system, you can extend its use to marking some other functional domain. In Qiang a kind of marking which originally developed as a system of orientation or direction marking is now used for marking perfectives and imperatives. The complexity in this system now allows for simplicity in other types of marking — you don’t have to develop a separate set of perfective or imperative markers, you just use the same forms that already exist in the language for some other purpose. It can be said that having something in the language that could easily be metaphorically extended to another use encourages the development of the marking also, so it might not just be that it allows for the simplicity of the other system but that it actually encourages the development of that particular use, because you have something that could easily be extended that way.
471
472
Language Acquisition, Change and Emergence
3.
Complex for Whom?
An important question that came up when I was thinking about this topic was, “Complex for whom?” In China, cutting up the food into small pieces makes the job of the cook more complex; the cook has to worry about how he or she is going to cut the food. In fact, in Chinese cooking, one test of a cook is how he or she cuts; in Western cooking, I don’t think they worry so much about cutting, but in Chinese cooking it is very important how you cut things because you have to cut up all the food before you serve it. This makes the job of the cook in China much more complex but it makes the job of the diner much simpler — again, you have the complexity of the cooking job making the eating much easier. It is the same with language; a simple system of writing or language is less complicated for the writer or the speaker. For example, if you have a writing system that doesn’t have strong conventions about punctuation or a particular set word order, and has a set of other features that are relatively open to speaker or writer choice, this simplicity makes it easier for the writer, who doesn’t need to worry about having to follow some set of prescriptive forms, but it allows ambiguity, which makes it more complicated for the reader. Consider the following attested examples of Chinese writing. In Chinese, an author can chose various orders in which to write. In (5a) the writer wrote from left to right; in (5b) the author wrote from right to left. When you see these restaurant signs, as both are three characters long, and there is nothing in the writing system which tells you which way to read them, you have to use inference to figure out yourself which order is correct. So the job of the reader is more complicated because there is no standard direction of reading. It can be even more complicated, as in the case of (5c), which is a sign in Taipei, where you have to read from left and right at the same time, a short version of two two-character names (the three characters are, from left to right, “lao bao gong”, representing “lao bao” and “gong bao”, two types of medical plans in Taiwan) — to save space on the sign they just use three characters; instead of writing “lao bao” and “gung bao”, since one of the
Typology and Complexity
(5) a. Left to right:
→ 功德林
(restaurant sign in HK)
b. Right to left:
← 苑河金
(restaurant sign in HK)
→ ← 勞保公
(clinic sign in Taipei)
c. Right to left and Left to right:
音配像的全面勝利
f. Top to bottom / ?? Does it matter?
奮 ↑ 戰一百天奪取京劇 →
e. Top to bottom / Left to right
→ 中 ↓ 國現當代 文學探研
d. Top to bottom / Right to left:
上不大 車設小 入找同 錢贖價
(book cover)
(Guangming Daily 2002/4/21)
(sign in Hong Kong minibus)
characters is the same in both, they just have you read it from both sides in at the same time. In Chinese it is also possible to write from top to bottom vertically, as in (5d, e). When you write vertically, you can write either from right to left, as in (5d), the title of a book, or left to right, as in (5e), a headline from a mainland Chinese newspaper. There is nothing in the script and no hard and fast
473
474
Language Acquisition, Change and Emergence
conventions, except for the convention that when it is written vertically it should be top to bottom,3 that tell you which way you are going to have to read it; you have to figure that out by trying different possibilities and then deciding which makes more sense. The simplicity of the conventions related to word order makes it easier for the writer, because the writer doesn’t have to follow many strict conventions, as in English. However, it makes the job of the reader more complex because the reader has to use a much more complicated inferential process to figure out which way makes sense. The process is not simplified for the reader. And sometimes, of course, you get to a situation like in (5f), which is a sign in the mini-buses in Hong Kong, where one may not be sure which way to read it. For the first six years that I have lived in Hong Kong I have always read this top to bottom and right to left, but when I was preparing the talk for the workshop I began to think maybe it should be read top to bottom and left to right, because it makes sense either way. But it’s just a matter of which one you think makes more sense, because the three lines are three independent sentences. You notice of course that there is no real separation of anything within the clauses as well, so there is a lot of inference going on when you are reading this. On the other hand, if you have a standardized word order and punctuation, the job of the writer is more complex, because the writer has to worry about using the right word order and punctuation, but it simplifies the task for the reader because it’s constraining the reader’s inferential process. Now, there can be differences in terms of complexity between any two systems, and within a single system there are also different possibilities for complexity, so we might have a difference in complexity of the overall system, such as the difference in the
3 These patterns of writing go all the way back to the oldest form of Chinese
writing, oracle bone inscriptions, texts written on ox scapulas and turtle plastrons that had been burned and cracked in divination rituals, where the writing relating to a particular divination had to be near the relevant divination crack, and the direction of the cracks influenced the direction of the inscription (see Keightley, 1978, §2.9.4 for details).
Typology and Complexity
systems of eating Chinese and Western food, but even within a single system, like the system of English language use, the speaker has choices in terms of how complex to make an utterance. Consider the following example: (6) Q: A1: A2: A3: A4: A5: A6:
Do you want something to drink? (points to soup bowl) I have soup. No. I have soup. No, because I have soup. No, since I have soup, I don’t need anything to drink. No, I don’t want anything to drink. Since I have soup, I don’t need anything else to drink right now.
This was a conversation I had with my wife while eating dinner. I asked her Do you want something to drink? — her answer was to point to her soup bowl; that was her answer and I had to figure out what that meant. Simply pointing like that means I have to figure out what she is pointing at, and if I guess it is the bowl that she is pointing at, then I have to notice that the bowl is full, and then I have to notice what kind of thing is in the bowl, then I have to somehow think that’s relevant, and then guess how it is relevant, and then I have to figure out that if it’s a full bowl of soup (broth), then think back that I’m asking her if she wants something to drink, and since soup is a liquid, maybe what she’s thinking is that since she has a bowl full of liquid she doesn’t need anything else to drink. So with pointing as her answer I have to do all of this very complicated inference. But if she says I have soup, at least the first part of my inferential process is constrained — figuring out what she is pointing at and what’s in the bowl, that part is made simpler. If she says No, I have soup, then my inferential process is constrained even more; it is made even more simple by the fact that she has added the word no, but I still have to infer the relationship between the word no and the concept “I have soup”. She could also constrain that part by putting in the word because. She could say No, because I have soup, and then my inference of the relationship between no and I have soup would also be constrained. The answers
475
476
Language Acquisition, Change and Emergence
in (6A5) and (6A6) would also be possible, and again, the more complex the utterance that she uses, the more simple my inference in determining her communicative intention. It is like the example of writing systems given above: the more complex it is for one of the two communicators, the more simple it is for the other and vise versa.
4.
Background: Ostension and Inference
Now I have to back up a little bit and talk about what human communication is all about. Human communication isn’t about language, and language is not what is most important to human communication; as I mentioned, language is just a tool. What happens in communication is somebody does something, what we call an ostensive act, that gets the other person’s attention and the other person then, having seen the purposefully done act, assumes that the other person did that act for a reason, and then tries to figure out what that reason was; that’s communication. Language is not crucial to communication. We communicate all the time without language, just like my wife pointing to her soup bowl. Another example is from one morning shortly before the Workshop. I wanted to communicate something to my wife, but there was a guest sleeping in the room, so I couldn’t say anything. Therefore I just pointed upward with my index finger. What I was trying to communicate was that I was going to go up to the roof to do my exercises, and she understood that. So language is not absolutely necessary for communication, communication can happen whether you use language or not. The thing that language does in communication is constrain the addressee’s inferential process. The ostensive act, which may be linguistic or not, draws the other person’s attention and makes them think that the act is done purposefully and that they should apply some inferential process to figure out what the communicator’s intention was in doing this. As we assume that people are rational (that’s the basis of Grice’s (1975) Co-operative Principle), when they do an ostensive act we assume they must be doing it for a reason and we should figure out what
Typology and Complexity
that reason is. The way we figure it out is we create a context in which that ostensive act makes sense. Just like the example of pointing at the soup bowl, we have to figure out how pointing at the soup bowl could make sense in the context of expecting an answer to my question. I have to work through all the possible assumptions that I can put together and create a context of interpretation in which that particular ostensive act makes sense as an answer to my question. The thing that language can do is constrain the creation of this context of interpretation. In discussing the example of the soup bowl, I gave alternative responses with more complex forms, and showed how the more complex the linguistic form, the more constrained I would be in creating the context of interpretation and in figuring out what my wife’s communicative intention was, her intention to tell me that she didn’t want anything to drink. I want to point out something in my view of language that is different from a lot of other people’s view of language. In most work on language and communication, even in pragmatics, the form of the utterance is taken as given and it is assumed that the context is variable, and that we use the context to disambiguate the form. I see it the other way around. The way I see it, when we are in a communicative situation, we don’t have a lot of choice about the context, we are in that context. What we can choose is what particular ostensive act, what particular utterance, we are going to use in that context, so that’s the thing that is variable and that is the thing that’s constraining the creation of the context of interpretation. Language and the rules for its use in a particular society are a set of social conventions that have evolved in a particular way in that society in a response to the need to constrain the inferential process involved in communication in particular ways thought to be important in that society. Let me come back to this.
5.
Is Complexity Necessary?
Let me first ask, “Is complexity necessary?” In some cases, like what we saw in the soup bowl example, in talking with me, my wife didn’t need to be any more complex than pointing at the soup bowl,
477
478
Language Acquisition, Change and Emergence
I could figure the rest out. If she was in a restaurant and the waiter asked her, “Do you want something to drink?” I don’t think she could get away with just pointing at her soup bowl. So whether or not you need a certain level of complexity will depend on where you are, and on the complexity of other systems. We use forms to fit the context, and if we are in a particular context often, and use particular forms in particular ways to fit that context, they can become conventionalized. Like the Qiang directionals mentioned earlier. I don’t think it is a coincidence that the Qiangs live on the sides of steep mountains overlooking river valleys, so they always have to be going up and down, towards the river and away from the river. Those are important aspects of their environment, and this fact has led to forms for constraining the hearer’s interpretation in ways relevant to these aspects becoming conventionalized in their language. The nature of a society, such as the size and complexity of the speech community, can influence the patterns of the language spoken, and this will in turn influence the form that the language takes. There has been a lot of work on this. In particular, Trudgill (1996, 1997) pointed out that in a small community you are more likely to have more complex phonological systems, whereas in a widespread homogeneous community you are going to have simpler phonological systems. So there are all kinds of factors that can influence the level of complexity of a system. Now another thing about complexity, as we saw with the soup bowl example, is that more complex generally means more specific or more exacting. So if I want to have two pieces of bread instead of one, I can rip it into two with my hands — that’s the simplest way to deal with the problem — or I can use a tool. It’s more complex to use a tool, but if I use a tool I get a more exact cut. This is the same with language; the use of more explicit language constrains the hearer’s interpretive process much more, and so the hearer’s interpretation is more likely to be exactly the one intended by the speaker. For example, consider the two sentences in (6):
(6) (a) Peter’s not stupid.
(b) He can find his own way home
Typology and Complexity
(7) a. Peter’s not stupid; so he can find his own way home. b. Peter’s not stupid; after all, he can find his own way home. (from Wilson and Sperber, 1993:11) If one were to say “Peter’s not stupid. He can find his own way home,” without anything marking the logical relationship between the two sentences, it would be up to the hearer to figure out what the relationship is. There are two logical possibilities at least. It isn’t obligatory to make explicit what the relationship is. But you could make it explicit; you could say Peter’s not stupid so he can find his own way home, as in (7a), or Peter’s not stupid; after all, he can find his own way home, as in (7b). The relationship between the two clauses can be made explicit by the use of so or after all, and this is parallel to using a knife to cut bread; it makes the action more exacting, more fine in the case of cutting, and more explicit in the case of linguistic actions, and in doing that, by constraining, in the linguistic example, the inferential process, the speaker reduces the chances that the hearer will not be able to construct a context of interpretation in which the utterance makes sense. That is, it increases the likelihood that the hearer will correctly deduce the communicative intention of the speaker, just as you are more likely to get a nice neat cut of two even pieces of bread if you separate them with a knife rather than by hand. Now, why might a language develop an obligatorily explicit form? For a pattern of explicitness to be used often enough by enough people for it to become conventionalized, it must be constraining the interpretation of some salient category. That is, it has a cultural motivation. In some cases it isn’t easy to find the cultural assumptions that lead to the conventionalization of a certain form of explicitness, but sometimes it is. For example, when a speaker of Kalam (Pawley, 1993; Pawley and Lane, 1998), a language of Papua New Guinea, is reporting an event, he or she is expected to make reference to the whole sequence of situations and actions associated with the overall event, such as whether the actor was at the scene of the event or moved to the scene; what the actor
479
480
Language Acquisition, Change and Emergence
did; whether the actor then left the scene, and if so whether the actor took the affected object along or not; and what the final outcome of the event was — all of these are culturally required when you are describing some event. In English, you could just say The man fetched firewood, but in Kalam, you can’t just say ‘fetched firewood’; you have to say the whole series of events that happened in his going, his coming back, what happened in between, and so the narrative will be very complex, and this complexity is required by the culture. The interpretation of these aspects of the event are then generally more constrained in Kalam than in English. The narration of these sub-actions can take the form of many complex clauses, or, in the case of relatively commonly recurring multi-action events, can take the form of a conventionalized serial verb construction, as in (8) (from Pawley, 1993:95). In (9) is a conventional expression for ‘to massage’ in Kalam (Pawley, 1993:88). (8) b ak am mon p-wk man that go wood hit-bread ‘The man fetched some firewood.’
d ap ay-a-k get come put-3sg-PAST
(9) pk wyk d ap tan d ap yap gstrike rub hold come ascend hold come descend do ‘to massage’ It is because of the requirement on the explicitness of narration that the language has developed the sets of serial verb constructions that code frequently occurring sets of action sequences. That is, because certain actions often were narrated in the same way, and repeated over and over again, what formerly took the form of several clauses became simplified to a serial verb construction. Now, whether or not we can find a smoking gun — in this case there’s a very clear smoking gun, they have a societal expectation that a speaker should narrate all these sub-actions of an event, and we can use that to explain the development of the serial verb constructions — the fact that the pattern of explicitness is repeated often enough to become conventionalized means that it has to be
Typology and Complexity
culturally important. Some people argue that if you can’t find the motivation for some particular form, you can’t say it’s motivated. My point of view is that grammar, or any linguistic structure, develops out of patterns that have been repeatedly used over and over again so often that they became conventionalized, and the fact that they became conventionalized means that they had to have been repeated a lot, and the fact that they were repeated a lot means that they had to have been constraining some important aspect of the interpretation; a speaker is not going to repeat something often if it is not important to him or her to constrain the inference in that particular way.
6.
We Seem to be Able to Do Well Without Some Forms of Complexity
Getting back to this question of whether complexity is necessary, sometimes it seems we can do without it. For example, in Old English there was a very complex system of declension of nouns and adjectives, but we do quite well without it now. Old English inflected nouns and adjectives for four different cases in singular and plural, and an adjective had three different forms for the three different genders (actually six, as there were different forms depending on whether the noun took a demonstrative or not). In (10) are examples of the nouns stān ‘stone’ (masculine a-stem), giefu ‘gift’ (feminine ō-stem), and hunter ‘hunter’ (masculine consonant stem): (10) Singular
Plural
Nominative Genitive Dative Accusative
stān stān-es stān-e stān
gief-u gief-e gief-e gief-e
hunt-a hunt-an hunt-an hunt-an
Nominative Genitive Dative Accusative
stān-as stān-a stān-um stān-as
gief-a gief-a gief-um gief-a
hunt-an hunt-ena hunt-um hunt-an
481
482
Language Acquisition, Change and Emergence
In (11) is the declension of gōd ‘good’ when preceded by a demonstrative (gender is neutralized in the plural when the form is preceded by a demonstrative, but not when not preceded by a demonstrative): (11)
Masculine Feminine
Neuter
gōd-e gōd-an gōd-an gōd-e
Singular
Nominative Genitive Dative Accusative
gōd-a gōd-an gōd-an gōd-an
gōd-e gōd-an gōd-an gōd-an
Plural
Nominative Genitive Dative Accusative
gōd-an gōd-ena or agōd-ra gōd-um gōd-an
Modern forms: Singular (for all cases) Plural (for all cases)
stone gift hunter good stones gifts hunters good
Speakers of the system of Old English had to choose one of the forms from these paradigms every time they wanted to mention a stone, a gift, a hunter, or say something was good, and these paradigms are quite complicated, whereas in the modern system the paradigm is much simpler, just stone/stones, gift/gifts, hunter/hunters, and only one form for the adjective. We do okay with this simple system; we don’t need a great deal of complexity. A language doesn’t have to develop towards more complexity. In the case of English, it developed away from that particular type of complexity.
Typology and Complexity
7.
Complexity as a Feature of Categories, Not Language
One of the things that I want to mention, when talking about linguistic complexity, is that it is not that we want to talk about a language as a whole as being complex or not complex; we need to think in terms of sub-systems or categories of the language. For example, Chinese has a simpler system in terms of not having conventionalized tense marking, so a speaker doesn’t have to worry about tense when speaking, one can just say, for example, Wǒ qù xuéxiào ‘I go school’ and not say whether it was in the past, in the future or whatever, so in terms at least of the speaker it’s an easier job. But Chinese has developed a complex system of lexical categories coded in taxonomic compounds such as lóng-xiā ‘lobster’ (dragon-shrimp), jīng-yǔ ‘whale’ (whale-fish), and sōng-shù ‘pine’ (pine-tree), where the second syllable identifies the taxonomic class that the referent belongs to. It also has a complex system of classification of nouns using what we call noun classifiers, so you don’t just say ‘one book’, like in English, where you don’t have to worry about what class of object you’re talking about when you want to quantify an object. In Chinese you have to worry about what category you are talking about, and add the classifier for that category when you quantify that object. Compare, for example, English one book vs. Chinese yī běn shū (one classifier.for.book-like.objects book), English one table vs. Chinese yī zhang zhuōzi (one classifier.for.flat.rectangular.objects table). It’s more complex when you have to know what category each word is in in order to quantify it. The point is that Chinese has developed complex systems for constraining the interpretation of some functional domains, but not others, and so we can’t make blanket statements about languages, we need to look at each functional domain to see how the language deals with it. Different sub-systems of a language can also interact. To give one example, Proto-Arawak, an Amazonian language, had several locative cases but no marking of grammatical relations. Later,
483
484
Language Acquisition, Change and Emergence
mainly through contact with other, unrelated, languages in the same area, Tariana, an Arawak language, developed a complex system for marking grammatical relations by restructuring the locative cases (Aikhenvald, 2003). Tariana originally had a complex locative system and a simple, or no, system of grammatical relations, but then it restructured the locative cases into a complex system for marking grammatical relations and certain other features, and at the same time simplified the locative markings so that it now has only one very general locative case marker as opposed to having several before. Sometimes this can go back and forth — this is why we need to think about complexity in terms of the particular categories, not in terms of whole languages.
8.
Complexity of Language as a Reflection of Complexity of Cognitive Categories
The complexity of language is a reflection of the complexity of cognitive categories. The clearest example of course is phonemes; phonemes are categories. When we are babies, we can distinguish all kinds of sounds, but then later on we get into the habit of thinking that certain sounds go together in one category and other sounds get divided between two categories. For example, English speakers perceptually group together the voiced stop initials and voiceless unaspirated initials as one category, so they don’t hear the difference between [ba] and [pa]. Because of this, when a Chinese speaker says [peitɕiŋ] ‘Beijing’, with a voiceless unaspirated initial, an English speaker will hear it as if it is the same sound as the voiced initial [b], and will often pronounce the Chinese word as [beitɕiŋ], as they can’t hear the difference between the two sounds. Once you’ve made these categories, once you are habituated to these categories, the categories affect your perception. There’s a specialist in neuroscience at UCSD named Vilayanum Ramachandran. He summarizes his findings on perception by saying, “Perception is an opinion”, because when we hear, we don’t hear the different sounds, what we
Typology and Complexity
hear is filtered through the different categories in the mind. This is true of vision as well. The complexity of the language, whether a language separates certain sounds or not, is a reflection of the complexity of the categories in our minds. Shanghainese distinguishes voiced stops, voiceless unaspirated stops and voiceless aspirated stops, so for speakers of Shanghainese these are three different cognitive categories. So they have a more complex set of categories, at least in terms of stop consonants, than most English speakers, who have only two different categories for the three sounds. Another example is the difference between English and Mandarin Chinese speakers in terms of the conception of possession. In English there is no obligatory distinction between ownership and temporary physical possession; the verb have is used for both. But in Mandarin, these two categories are distinguished. For example, if I pick up this disk, this is my floppy disk, in English I can say This is my disk, and if my disk is in the hands of someone else, I can say to that person, You have my disk. In Mandarin you can’t do that; you can’t say the equivalent of ‘You have my disk’, you have to say something like ‘My disk is at your place’, with a locative expression rather than a possessive expression (this is not true of Cantonese, possibly due to English influence). The point is that Mandarin makes a distinction between ownership and temporary possession. I have found that after many years of speaking Mandarin, this way of thinking has affected my English, so in situations where someone had something of mine, I have found myself saying things like My disk is with you, rather than You have my disk. So my cognitive categories are being influenced by the language that I was speaking all the time, in this case a second language. But on the other hand, my English category distinctions (and lack of them) also affect my Mandarin. For example, I often don’t make a distinction between second person singular and second person plural, because I’m a native English speaker; we just have you for both singular and plural. I find myself, when speaking Mandarin, using just nǐ (2sg pronoun) when I should use nǐmen (2pl pronoun) for the plural; I just forget about the plural because I am so used to thinking with just one
485
486
Language Acquisition, Change and Emergence
category, not two categories. When we learn a language that doesn’t make the same distinctions that we are used to making, distinctions that reflect the distinctions made in our cognitive categories, we will try to fill in the perceived gaps. For example, in English we have obligatory tense marking, but Mandarin doesn’t have tense marking, and so a lot of English speakers, when they learn Mandarin, will look for something that seems like tense marking, they’ll find the perfective marker le and then use it any time that they feel would require a past tense marker in English. Or they will over-specify. For example, in English, if you want to say something like I’m going to go wash my hair, you have to include a possessive pronoun to specify whose hair is going to be washed. In Mandarin you don’t have to add a possessive pronoun; you just say Wǒ qù xǐ tóufã (lit.: I go wash hair), and in most contexts it’s assumed that you know whose hair you are going to wash; you don’t have to be specific about that. Native English speakers will often add the possessive pronoun to such a clause when speaking Chinese, though, as they feel the need to constrain the interpretation of whose hair is being washed because they are used to doing so when speaking English. On the other hand, a Chinese speaker living in America for thirty years will often still make mistakes in the use of he vs. she when speaking English — its just not a categorical difference that they have internalized, as their native language does not make that distinction.4
9.
The Development of Language Structure
Now back to the development of language structure. Grammar develops as the originally free collocations of lexical items used to
4 The third person pronoun in spoken Chinese does not inflect for animacy or
gender, but in the early 20th century many Chinese intellectuals learned English, French, or German, and came to feel the need to constrain, at least in writing, the interpretation of the referent of the third person pronoun, and so developed different ways of writing the third person pronoun in Chinese for male, female, inanimate, and godly referents.
Typology and Complexity
constrain the hearer's inference in a particular way become fixed in those particular structures. In communicating you want to constrain the hearer’s inferential process; in the beginning you can use any words to do that, any words are still better than no words. But then if you find that the particular pattern works, very often you repeat it again and again to constrain the hearer’s inference in that particular way, and then the pattern can become fixed. First it’s personal habit, and we are very much creatures of habit; all of our language use is really habit. And on a societal level, conventions are really just societal habits. For example, in Old English the word lic ‘like’ plus the instrumental suffix -e were used so often after an adjective to make explicit an adverbial relation to a verb that it became conventionalized and developed into the adverb-forming suffix -ly, as in quickly, used obligatorily in many contexts in English today (Lass, 1992). The frequent use of a demonstrative adjective to show that a referent was cognitively accessible conventionalized into definite marking in English (Pyles and Algeo, 1982). You can see this happening in Chinese; the demonstrative adjective in Chinese is being used so often as a way of showing indentifiability that some people are arguing that this is now becoming a definite marker, just like in English. Or in Chinese, you had a locative phrase that was used very often with an implication that the action was on-going, so you would say things like Tā zài nàr chī fàn (3sg LOC there eat rice) “He is eating there”. Eventually, you could drop the “there”, and just say Tā zài chī fàn (3sg PROG eat rice), as the locative verb zài was reanalyzed as a progressive marker (Chao, 1968:333). So what begins as a conversational implicature over time becomes conventionalized, and then becomes conventional implicature, and can then become further conventionalized until it becomes part of the grammar that forces a particular interpretation. Now what’s important is that grammatical structure that has become obligatory forces a particular interpretation. Some people say that languages differ in terms of what you can say, but another way to look at it is that languages differ in terms of what you have to say: English forces you to be much more explicit in certain contexts, for example, than Chinese, because English has grammaticalized a set of
487
488
Language Acquisition, Change and Emergence
obligatory constraints on referent identification we associate with “subject” and the use of the subject to mark particular speech act types. So we use the existence of subject in a clause and the position of subject in the clause to mark whether it is interrogative, imperative, or declarative. We have this as an obligatory part of every sentence, and because of that we then have to be explicit about who is the subject of the sentence. Chinese has not conventionalized these same constraints on referent identification (LaPolla, 1993), so you don’t have to be as explicit in terms of referent identification when you say something. Going back to the path through the field example, when you are going through the field you go a particular way because you find it expedient to go that way, but then other people start going that way and eventually the grass gets worn away to form a path, and the form of the path becomes fixed. At some point the path becomes recognized as the unmarked way to go through the field. This is true of other types of conventionalization as well. One method/tool/ system for achieving a particular purpose becomes the unmarked way to achieve that purpose, and other ways are seen as marked. In language there are several ways language structure can develop. You can develop either a particular word to constrain inference in a particular way, like the use of lic “like”, which developed into the adverb marking -ly, or it can be an extension of some pre-existing morphology for some new use, like the Qiang prefixes being extended to marking perfectives and also to imperatives. Or you can just have the fixing of structures, like in the English case where you have obligatory cross-clause co-reference in conjoined clauses. For example, Bernard Comrie once mentioned (1988:191) that if you have a sentence like The man dropped the melon and burst, [Audience laughs] — you laugh, because the interpretation of that pattern in English has to be that it is the man who burst, not the melon. The structure of this pattern in English has become fixed, to the point that you have this obligatory cross-clause co-reference; the subject of the second clause has to be the same as the subject of the first. The structure of The man dropped the melon and burst then forces a particular interpretation
Typology and Complexity
by disallowing certain assumptions about what is likely or possible to be added to the context of interpretation. It has become so conventionalized it forces the listener to interpret the sentence in a particular way, even if that particular interpretation does not make sense. A lot of languages don’t do that. Even languages as closely related as Italian don’t have such obligatory co-reference. Chinese also doesn’t force such co-reference. I have asked many Chinese people over the years to translate that sentence into Chinese and tell me who or what burst, and they say “Of course it’s the melon that burst; the man’s not going to burst.” But in English it has to be the man who burst because the grammar forces that particular interpretation.
10. How Languages Differ in Terms of Complexity So how do languages differ in terms of complexity? They can differ in terms of which functional domains they constrain the interpretation of. They can differ in terms of the extent to which they constrain it. And they can differ in terms of what mechanism they use to constrain it. So for example, in Chinese you can say the sentence in (12a), which is just “he/she go school”. You can leave it at that, you don’t have to add any tense marking, and you don’t have to specify if it is a man or a woman. In English you have to say “he went to school” or “she went to school”, or “he is going to school” or “she is going to school”, and so on, as in (12b–d); you have to be more specific — the grammar (the conventions of English usage) forces you to be more specific. English then differs from Chinese in that English obligatorily constrains the interpretation of the time of an action relative to the time of speaking (i.e. has obligatory tense marking, as well as obligatory gender and animacy marking for 3rd person pronouns). (12) a.
Tā
qù
3sg
go
xuéxiào. Chinese) school
b. She went to school. / He went to school.
489
490
Language Acquisition, Change and Emergence
c.
She is going to school. / He is going to school.
d. She goes to school. / He goes to school. Now while English obligatorily constrains the interpretation of past vs. present vs. future actions, it does not obligatorily mark a difference between recent past and distant past actions. How far in the past an action happened relative to the time of speaking is left up to inference; this aspect of the interpretation is not constrained. But in Rawang, a Tibeto-Burman language of northern Burma, you have four different past tenses, and it is obligatory to constrain the hearer’s interpretation of how far in the past the action was that you want to talk about. Compare the Rawang examples given in (13a–d) (from my own fieldwork). (13) a. àng
dī
á:m-í
3sg
go
DIR-Intrans.PAST
b. àng 3sg
dī go
dár- í
àng
dī go
ap-mí
3sg d. àng 3sg
dī go
yàng-í
c.
‘S/he left, went away (within the last 2 hours).’
‘S/he went (within today, but more than two TMhrs-Intrans.PAST hours ago).’
TMdys-Intrans.PAST
TMyrs-Intrans.PAST
‘S/he went (within the last year).’ ‘S/he went (some time a year or more ago).’
We can see then that English and Rawang both constrain the inference related to the interpretation of the time of the event relative to the time of speaking, unlike Chinese, but Rawang constrains it to a much greater degree. (Notice, as I mentioned earlier, that it is particular functional domains, and not languages that we should look at in terms of complexity. Here we see Rawang has more complexity in its tense system than English, but less complexity in its pronoun system, as it does not make the gender and animacy distinctions English does.)
Typology and Complexity
Now in terms of the type of marking you might have, we can go back to the example I mentioned earlier about washing one’s hair. I mentioned earlier that in Chinese when talking about washing hair, you don’t have to say whose hair you are washing. You can just say the sentence in (14a). In most situations you wash your own hair. If you are a professional hair washer, it might mean you are washing someone else’s hair but most of the time it would mean you are washing your own hair. (14) a. Tā
zài
xǐ
tóufa (Chinese)
3sg PROG wash hair. ‘S/he is washing (her/his) hair.’ (Lit.: ‘S/he is washing hair.)
b. He is washing his hair. c.
àng
nī
zv́l-shì-ē
(Rawang)
3sg hair wash-R/M-NPAST ‘S/he is washing her/his hair.’ In English, as in (14b), and Rawang, as in (14c), you have to be explicit, you have to say whose hair is being washed, but the way you are explicit differs between the two languages. The way you are explicit in English is to have a possessive adjective on the noun, as in his hair, whereas in Rawang you don’t put any marking on the noun itself, you put a reflexive/middle marker on the verb, which then marks the fact that the washer and the person whose hair is being washed are the same. So both Rawang and English are constraining the interpretation, unlike Chinese, but in this case they use very different types of morphology, in one language a pre-noun genitive modifier, and in the other a post-verbal reflexive suffix.
11. Conclusion To conclude, language is not an absolute necessity for communication, though without language the addressee’s inferential
491
492
Language Acquisition, Change and Emergence
task in creating the context of interpretation can be quite complex. Therefore communicators attempt to simplify the addressee’s task by constraining the addressee’s inferential process with a more explicit ostensive act which includes the use of linguistic forms, and when the particular pattern they use to do so is repeated often enough and by enough people it can become fixed as language structure. The consequence of this is that simplifying the addressee’s task complicates the communicator’s task, as the ostensive act produced by the communicator has to be more complex. As each society views the world differently, communicators in different societies will differ in terms of which particular functional domains they feel the need to constrain the interpretation of, to what degree they constrain the interpretation of a particular functional domain, and what mechanism they use to constrain the interpretation. These are the differences that lead to the differences in the degree of complexity of the sub-systems of different languages.
References Aikhenvald, Alexandra Y. (2003) Mechanisms of change in areal diffusion: New morphology and language contact. Journal of Linguistics 39 (1), 1–29. Chao, Yuen Ren. (1968) A grammar of spoken chinese. Berkeley; Los Angeles: University of California Press. Keightley, David N. (1978) Sources of Shang history: The oracle bone inscriptions of Bronze Age China. Berkeley; Los Angeles: University of California Press. Keller, Rudi. (1994) On language change: The invisible hand in language. Translated by Brigitte Nerlich. London: Routledge. LaPolla, Randy J. (1993) Arguments against 'subject' and 'direct object' as viable concepts in Chinese. Bulletin of the Institute of History and Philology 63 (4), 759–813. LaPolla, Randy J., with Huang, Chenglong. (2003). A grammar of Qiang. Berlin; New York: Mouton de Gruyter. Lass, Roger (Ed.) (1992) The Cambridge history of the English language, Vol. III (pp. 1476–1776). Cambridge: Cambridge University Press.
Typology and Complexity Pawley, Andrew. (1993) A language which defies description by ordinary means. In William A. Foley (Ed.) The role of theory in language description (pp. 87–130). Berlin; New York: Mouton de Gruyter. Pawley, Andrew, and Lane, Jonathan. (1998) From event sequence to grammar: Serial verb constructions in Kalam. In Anna Siewierska and Jae Jung Song (Eds.) Case, typology, and grammar (pp. 201–227). Amsterdam; Philadelphia: John Benjamins Publishing Company. Perkins, Revere D. (1980) The evolution of culture and grammar. PhD dissertation, State University of New York at Buffalo. Pyles, Thomas, and Algeo, John. (1982) The origins and development of the English language, 3rd ed. New York: Harcourt, Brace, Jovanovich. Trudgill, Peter. (1996) Dialect typology: isolation, social network and phonological structure. In Gregory R. Guy et al. (Eds.) Towards a social science of language, volume 1 (pp. 3–21). Amsterdam; Philadelphia: John Benjamins. —. (1997) Typology and sociolinguistics: linguistic structure, social structure and explanatory comparative dialectology. Folia Linguistica 23 (3–4), 349–360. Wilson, Deirdre, and Sperber, Dan. (1993) Linguistic form and relevance. Lingua 90, 1–25.
493
16 Creoles and Complexity Bernard Comrie Max Planck Institute for Evolutionary Anthropology, Leipzig, and University of California, Santa Barbara
1.
Introduction
My aim in this chapter is to examine the relationship between creole languages, hereafter simply creoles, and the notion of complexity. In particular, I want to examine an intuition held by many, though by no means all linguists who have concerned themselves with creoles, namely that creoles are, at least in certain respects, less complex than languages on average. In order to approach this problem, it is necessary first of all to address some definitional questions. I am aware that I am addressing an audience consisting partly of linguists and partly of non-linguists, the latter including readers with very different degrees of exposure to linguistics. In some places I will no doubt labor points that might seem obvious to linguists, and in others I will no doubt gloss over problems that strike me as relatively unimportant in the present context, but would certainly merit more extensive discussion in a different context. I can but crave indulgence, while expressing my readiness, where appropriate, to stand corrected or to be faced with clarification questions. 495
496
Language Acquisition, Change and Emergence
1.1
What are creoles?
First, what is a creole? While there have been times in the last few decades of research on creoles when linguists might have seemed at least near to a definitive characterization of the notion “creole”, the explosion of research on creoles in recent years has, if anything, made the question even more murky. One possible definition would be to say that creoles are those languages that have traditionally been called creoles, perhaps drawing a distinction between central cases, or canonical creoles, and more marginal cases whose status as creoles has been more contentious. This would clearly include the Caribbean creoles (such as Jamaican Creole, Guyanese Creole, Haitian Creole, and Sranan, the latter spoken in Surinam), the Indian Ocean creoles (such as Mauritian Creole, Seychelles Creole), and a number of South Pacific creoles (such as Hawaiian Creole English). Of course, such a definition provides no guarantee that the languages so included form a homogeneous group, other than in the sense that linguists have tended to group them together. A definition closer to a set of criteria for identifying a random language as being a creole or not would clearly be preferable, or even a set of criteria that arranged languages on a cline from most to least creole-like. For a general account of creole linguistics, Holm (1988–1989) may be recommended. The following list is my own attempt, in which it is important to note that some individual criteria are necessary, but far from sufficient, while others express tendencies rather than absolute boundaries between creoles and other languages; indeed, many linguists would question whether there is a clear dividing line between the two. A necessary condition for a language to be considered a creole is that it must be the native language of a community. All of the creoles listed above meet this criterion. The importance of this criterion is that it distinguishes creoles from another set of languages with which creoles are often grouped together, namely pidgins, which are auxiliary communication systems that are developed when speakers of different languages are brought together and which are different from any of the native
Creoles and Complexity
languages of the people involved. Another criterion, however, is shared by pidgins and creoles: Both arise as the result of contact between languages, more accurately between speakers of different languages who do not share knowledge of each other’s languages, and where a further crucial condition holds: Contact between the groups of speakers is not sufficiently intimate for them to acquire a full command of each other’s languages, so that a communication system distinct from any of the “input” languages arises. In the case of canonical creoles, which typically developed on plantations, we find limited contact between speakers of a European language and speakers of one or (usually) more non-European languages. The European language typically provides the bulk of the vocabulary, or lexicon of the creole, so that Jamaican Creole, Guyanese Creole, Sranan, and Hawaiian Creole English have a lexicon that is primarily English in origin, while Haitian Creole, Mauritian Creole, and Seychelles Creole have a lexicon that is primarily French in origin; it is partly on this basis that such languages are often regarded as varieties of the “lexifier” language, i.e. the language that provides most of the lexicon, especially in the case of creoles whose vocabulary is still largely recognizable as being of that origin — in the case of the creoles mentioned above this would include all except possibly Sranan. By contrast, the grammar of creoles is typically very different from that of the lexifier language, a topic to which I return below, since hypotheses about the origin of creole grammars not only form one of the main points of contention in creole studies but are also directly relevant to the issue of complexity in creoles. Many creoles are, and have for some time been, in contact with their lexifier language; indeed in some cases this has been the case since the development of the creole. In Jamaica, for instance, standard and other metropolitan varieties of English have always been present on the island, and have become more intrusive with the development of the mass media and education. Many creoles are thus in a sociolinguistic situation that has come to be called the “post-creole continuum”, with a range of socially conditioned linguistic varieties ranging from the most traditional creole (the basilect) through intermediate varieties (a reasonably stable variety
497
498
Language Acquisition, Change and Emergence
of which might be characterized as the mesolect) up to varieties close or identical to the metropolitan lexifier language (the acrolect). In some instances, contact between the creole and the lexifier language has been so intense and so continuous that even the most “basilectal” varieties seem close to what, in another creole situation, would be a mesolectal variety; this is the case, for instance, with Réunion Creole, spoken on the island of Réunion, a French overseas department in the Indian Ocean. In such a situation, it would not be surprising if more “complexities” of the lexifier language were to be found in the creole, and this is indeed the case, so that I will leave cases like Réunion Creole largely out of account in what follows. At the other extreme, one has creoles that, for historical reasons, have been cut off from their lexifier language for most of their history. This is the case with the Surinam creoles, which have English as their lexifier language from the brief period when Surinam was under English rule in the mid-seventeenth century; thereafter Surinam was under Dutch rule until independence in 1975, and Dutch continues as the official language. The most widely spoken creole of Surinam, Sranan, is not unexpectedly much less close to metropolitan English than, say, Jamaican Creole, though interestingly two other creoles of Surinam, Saramaccan and especially Ndjuka, are even more distinct, having developed in communities founded by escaped slaves with even less contact to European languages. So in looking at creoles, it is important to bear in mind the extent to which the situation as we observe it today may reflect continuing influence from the lexifier language, as opposed to the situation that would have obtained had there been little or no such contact since the genesis of the creole. As a brief digression, I would like to mention a criterion that was long thought to characterize creole languages but is now usually rejected as a necessary criterion and/or downplayed in its role. At one time it was thought that creoles necessarily arise from pidgins, with the usual development being as follows. Speakers of different languages being brought together without full access to each other’s languages develop a make-shift pidgin, which is not only no one’s native language but which is also subject to massive internal
Creoles and Complexity
variation, depending in part on the native language of individual speakers. For instance, word order might follow that of individual speakers’ native languages, with some speakers using Subject–Verb–Object, others Subject–Object–Verb, and yet others Verb–Subject–Object. Children growing up in the community where this “mess” is the normal means of communication would be forced to impose order on it, coming up with the degree of systematization and expressibility that is characteristic of any natively spoken language, and their resulting first language would be a creole. Some creoles may indeed have arisen in this way, and this is for instance one key aspect of Bickerton’s account of the development of Hawaii Creole English, a point to which I return below. But it is almost certainly not a necessary development, and it is now accepted that at least many creoles may have arisen without an intervening pidgin stage; such a more direct genesis of creoles is, for instance, part and parcel of Lefebvre’s relexification hypothesis, to which I also return below. (In addition, one should note that some pidgins become stabilized before acquiring native speakers. This is the case, for instance, with Tok Pisin in Papua New Guinea, which had stabilized in its grammatical structure as a second language of inter-ethnic communication well before it started being acquired as a first language by certain groups of children. Although natively spoken Tok Pisin is sometimes referred to as a creole, this is at least in a somewhat different sense from the use of the term with respect to canonical creole languages.) More generally, it should not be overlooked that many of the languages that we think of as creoles may have arisen in somewhat different social circumstances, and these differences may well account for some of the variation that we find among creoles and that has sometimes led different linguists to construct different origin hypotheses not only for the particular creole(s) on which they have specialized but also, by not necessarily justifiable extrapolation, for creoles in general. It should thus not be surprising if Réunion Creole, which has probably never been radically separated from metropolitan French in its history, is closer to its lexifier than are most creoles, and that this should lead some
499
500
Language Acquisition, Change and Emergence
linguists (most notably Robert Chaudenson) to view Réunion Creole as basically a variety of French; or if Hawaiian Creole English, whose development from Hawaiian Pidgin English is reasonably well documented, should lead other linguists (such as Derek Bickerton — see further below) to emphasize the importance of universal principles leading from a variable pidgin to a systematic creole; or if Haitian Creole arose in a community where there was a particularly strong presence of Fongbe and other closely related West African languages, thus leading some linguists (e.g. Claire Lefebvre, see below) to assign a major role to the influence of the non-lexifier, or substrate language. As has been emphasized by some creolists, such as Salikoko Mufwene and Pieter Muysken, the emergence of a creole can be a complex process both socially and linguistically in which competing strategies to achieve communication may be at play.
1.2
Morphological complexity
Crucial to our enterprise of asking whether creole languages are less complex than the average human language, and if so why, is a definition of the notion complexity with regard to human language. Unfortunately, a principled answer to the question how one measures the complexity of a random language is at best difficult and at worst insoluble. There is a truism in linguistics, which is often repeated by linguists and finds its way into introductory linguistics courses, that all languages are equally complex. In part this is a reaction, no doubt laudable in its intent, against a widespread misconception that languages of less technologically advanced communities can readily be qualified as less complex than those of more technologically advanced communities, and certainly we can point to complexities in languages spoken outside the technological super-club that would put most European languages to shame. For instance, the English speaker who has struggled with the two genders of French or the three of German will be faced with around a dozen genders when learning a Bantu language of Africa.
Creoles and Complexity
And if this same person has difficulties remembering how to make verbs agree with their subjects in Spanish or Greek, then many indigenous languages of northern Australia go way beyond this by requiring verbs to agree with both subjects and objects. But any linguist who maintains that all languages are equally complex would (or at least should) be embarrassed if asked actually to prove this claim, for the simple reason that we have no way of measuring the overall complexity of a language. One can, of course, maintain that a language that is simple in one area is likely to be complex in others, so that English has a relatively simple morphology but a relatively complex syntax, while Algonquian languages of North America have a relatively complex morphology but a relatively simple syntax. But how exactly does one weigh one up against the other? It is interesting that one of the most extensive recent attempts to measure linguistic complexity, Heggarty (2001), concludes that while we have a good chance of measuring relative complexity across languages in particular domains, there are no foreseeable prospects for measuring the relative complexity of one language against another globally. For the purposes of this paper, I have therefore chosen one particular area in which to examine complexity, namely morphology. (For some earlier moves of mine in this direction, see Comrie 1992.) The reason for this choice is far from random. First, morphology is one area where it is relatively straightforward (I emphasize “relatively”!) to measure complexity, so that, for instance, a language with only two cases in its noun declension, like English (nominative and genitive in -’s) is, other things being equal, simpler than one with around six (like Russian) or with eighteen or more (like Hungarian). Secondly, most creoles, including all canonical creoles, do have simple morphologies, indeed many come close to the canonical “isolating” type of language, i.e. one lacking all morphology altogether. To illustrate what is meant by an isolating language, it will be useful to compare plural formation in English and Vietnamese. In English, the plural of nouns is formed by means of a bound morpheme, -s in the case of regular nouns (the vast majority), so
501
502
Language Acquisition, Change and Emergence
that the plural of dog is dogs. In Vietnamese, by contrast, where plurality is marked this is not by means of morphology, but by means of a separate word, so that, for instance, the plural of tôi ‘I’ is chúng tôi ‘we’, literally ‘PLURAL I’. As an illustration with regard to creole languages, we may take an area that has been widely discussed in the literature on creoles, namely tense–mood–aspect formation. In European languages, tense–mood–aspect formation typically involves either just morphology or some combination of morphology and auxiliary verbs or particles. In English, for instance, the distinction between present and past tense involves morphology: I walk versus I walked. Progressive aspect requires the auxiliary verb be, as well as a special form (present participle) of the main verb, e.g. I walk versus I am walking. These can be combined to give I was walking, which is past progressive, incorporating the past tense of the auxiliary. In Jamaican Creole, by contrast, such semantic distinctions are indicated by means of invariable particles, such as en for past (or completive), a for progressive: mi waak ‘I walk’, mi en waak ‘I walked’, mi a waak ‘I am walking’, mi en a waak ‘I was walking’. Basilectal varieties of Jamaican Creole lack reflexes of English verb morphology completely. Two things should be clarified with respect to creoles and morphological complexity. First, while it is the case that most creoles, and certainly most canonical creoles, have very low levels of morphological complexity, it is not always the case that they lack bound morphology completely. For instance, Seychelles Creole makes a distinction, for a large number of verbs, between a consonant-final form and a vowel-final form, e.g. sât versus sât-e ‘sing’. The syntax–semantics of the distinction is complex, and not of direct relevance here, but clearly relates to inflectional morphology, i.e. different morphological forms of the same lexical item. Derivational morphology, i.e. forming different words from the same root, is even more widespread in some creoles — there is a list for Haitian Creole in Lefebvre (1998: 302–307), restricted to instances where the particular derived form is not found in metropolitan French, i.e. where one cannot simply assume that the whole word was taken from metropolitan French. Second, while a
Creoles and Complexity
low level of inflectional morphology may be viable as a necessary condition on creoles, it is certainly not a sufficient criterion, since many non-creole languages, such as Vietnamese, also lack or virtually lack inflectional morphology. In other words, while any language may accidentally happen to have little or no inflectional morphology, there seems to be something about creoles that more or less guarantees that they will have little or no morphology. While we have already seen what would characterize complete absence of morphological complexity, namely the complete absence of morphology, it is important to note that there are at least three directions in which a language might be said to have morphological complexity, and I want now to turn to these characterizations of morphological complexity. Although the three types are in principle readily distinguishable, and many languages are morphologically complex in only one direction, it is in principle possible for the different kinds of complexity to be combined in a language. First, a language might be morphologically complex in that it permits the accumulation of a large number of affixes on a single root, even though each of the affixes might be readily segmentable and might be invariable (subject to general phonological rules of the language in question) — I return to the last two points in the immediately following paragraphs. A paradigm example of such a language is Turkish, and the phenomenon may be illustrated by the word indirilemiyebilecekler, which means ‘it may be that they will not be able to be brought down’. Despite the undoubted complexity, the word is constructed on the basis of perfectly regular and transparent suffixation processes. The root is in ‘descend’ and, like each of the other intermediate stages, this could serve as a word in its own right, in this case as the singular imperative. The first suffix makes the verb causative, i.e. in-dir is ‘cause to descend, bring down’. The next suffix passivizes this, i.e. in-dir-il is ‘be caused to descend, be brought down’. Then comes the so-called impotential, expressing inability, so that in-dir-il-eme is ‘be unable to be brought down’. Adding the potential suffix -ebil to this, which requires two adjustments determined by rather general phonological rules, namely the insertion of y to break up the vowel hiatus and the
503
504
Language Acquisition, Change and Emergence
raising of the final vowel of -eme to give -emi before a y, we get to in-dir-il-emiy-ebil ‘be able to be unable to be brought down’. The suffix -ecek adds future tense, so that in-dir-il-emiy-ebil-ecek is ‘will be able to be unable to be brought down’. Finally, plural -ler specifies a third person plural subject, i.e. indirilemiyebilecekler ‘they will be able to be unable to be caused to descend’, or more idiomatically ‘it may be that they will not be able to be brought down’. Although the resultant word is long and complex in the sense that it contains a large number of bound morphemes, the formation is completely regular: One could in principle take any other Turkish verb and attach the same sequence of suffixes with the same meaning change, subject only to application of general phonological rules (such as vowel harmony, or the insertion of epenthetic y) and to the requirement that the result must, of course, make sense. Interestingly, this feature of Turkish morphology seems to give no particular problem to children acquiring Turkish as a first language. However, as noted by Johanson (2002: 84–85), even non-Turkic languages that have been in a situation of intense contact with a Turkic language and have undergone significant influence from that Turkic language never go to the extent of borrowing this extreme kind of “agglutination” (i.e. accumulating sequences of affixes). This does, incidentally, suggest an interesting facet of the general notion of complexity in linguistics. While one can define general formal measures of structural complexity, this is no necessary guarantee that phenomena measured as highly complex by such criteria will prove “difficult” in all practical situations. However complex Turkish agglutination may be, children do not find its acquisition difficult under conditions of first language acquisition. Second, the morphological system of a language may be complex in that it fuses the expression of a number of semantic oppositions together in a single morpheme. We can illustrate this by comparing the formation of feminine and plural forms of the adjective caro ‘dear’ in Spanish and Italian — see table 1; in the masculine singular, used as the citation form, the word happens to be the same in both languages. Both languages have a number
Creoles and Complexity
opposition in adjectives between singular and plural, and a gender opposition between masculine and feminine. In Spanish, the combined expression of number and gender is agglutinative, in that one can identify one suffix position (immediately after the stem) that marks gender, masculine -o versus feminine -a, and another, at the end of the word, that marks number (zero for singular versus -s for plural). In Italian, by contrast, there is no single suffix that we can identify as masculine or feminine, no singular suffix that we can identify as singular or plural, rather we must say that -o is masculine singular, -a feminine singular, -i masculine plural, -e feminine plural. In other words, the expression of gender and of number is fused into a single, formally unsegmentable morpheme. In some languages, the number of categories that can be fused into a single morpheme can be quite large. In Ancient Greek, for instance, the ending -ou in lú-ou ‘ransom!’ expresses middle voice, imperative mood, and second person singular subject. Fusional morphology is rare to non-existent in creole languages. Table 1 Gender and number forms of caro ‘dear’ in Spanish and Italian
masculine singular masculine plural feminine singular feminine plural
Spanish
Italian
car-o
car-o
car-o-s
car-i
car-a
car-a
car-a-s
car-e
Third and finally, morphological complexity may be because some morphological forms are irregular, i.e. cannot be predicted by means of a general rule. In a reasonably strictly agglutinating language like Turkish, the rule for plural formation is simple: Just add -ler (or its vowel harmony variant -lar). Even in Italian, with fusion, the rule is relatively straightforward: replace singular -o by plural -i, replace singular -a by plural -e. By contrast, in German there are several ways of forming the plural, with for the most part at best statistical probabilities of getting the right form without
505
506
Language Acquisition, Change and Emergence
knowing in advance: A word may take no suffix in the plural, with sub-divisions according to whether or not the stem-vowel is umlauted (e.g. change from o to ö); a word may take the suffix -e, again with a distinction between words taking umlaut and those not taking umlaut; a word may take the suffix -er (in which case one can predict, mercifully, that if the stem vowel is umlautable, then it will be umlauted); a word may take -en (mercifully, never with umlaut); or it may take -s (again, always without umlaut) — and this still omits a handful of more restricted possibilities, for instance found in loans from Latin and Greek. Creole languages typically lack such complexities in their inflectional morphology, although especially in a continuum with the lexifier language one may encounter such forms taken from the lexifier language, such as Jamaican wos ‘worse’ as the irregular comparative of bad ‘bad’. Before turning to the question of morphological complexity and creoles, it may be worth spending a small amount of time examining how linguists, or at least some linguists, believe morphological complexity arises in languages in general. At least in a large number of cases, it can be shown how bound morphology arises from the reduction of originally distinct words through part of the process of grammaticalization (see, for instance, Heine et al. 1991), whereby items pass from the lexicon into the grammar. A particularly transparent example is provided by comparison of the expression of the comitative relation (‘together with’) in different Balto-Finnic languages, since different languages in this northern European branch of the Uralic family are at different stages of grammaticalization, more specifically at different stages in the reanalysis of a postposition (a separate word) into a suffix. In Finnish, one of the most conservative languages here, the comitative relation is expressed by means of the postposition kanssa, which requires the genitive case of the preceding noun, as in sepä-n kanssa ‘with the blacksmith’, literally ‘blacksmith-GENITIVE with’. In Estonian, here one of the more innovative languages, this postposition has become a comitative case suffix, so that the Estonian equivalent is simply sepa-ga, literally ‘blacksmithCOMITATIVE’. By extrapolation, one can hypothesize that bound
Creoles and Complexity
morphology, which is known in some cases to arise from the grammaticalization of separate words, quite generally has this origin. Of course, in many instances we have no evidence in favor of (or, indeed, against) this hypothesis, but nonetheless the hypothesis provides a rationale for what is otherwise a puzzling phenomenon, namely the existence of bound morphology. If one accepts this hypothesis, then one thing that follows is that it takes time for bound morphology to arise: The “original” human language would presumably have lacked bound morphology. Once bound morphology developed, however, phonetic attrition could well have removed it again, so that most of the world’s languages are somewhere on some cycle of development and/or loss of bound morphology. I am thus not suggesting that Vietnamese, for instance, has never, throughout its history and that of its ancestors, had bound morphology; indeed, comparison with other languages of the Austro-Asiatic family would suggest that it has lost bound morphology. Complications such as fusion and irregularities can also often be shown to arise through the reanalysis of erstwhile regular agglutinative morphology, so that these may be phenomena that in origin postdate regular, agglutinative bound morphology. For instance, the distinction between singular -o and plural -i in Italian goes back to an earlier distinction between singular -o(-s) and plural -o-i, with the diphthong in the plural coalescing to give a single vowel as early as classical Latin, thus disrupting the original agglutinative picture. One way in which irregularities develop can be seen by returning to another comparison between the phonetically and morphologically in general more conservative Finnish and more innovative Estonian. In Finnish, the genitive of nouns is formed in general simply by suffixing the ending -n, so that satama ‘harbor’ has genitive satama-n, suitsu ‘smoke’ has genitive suitsu-n. Estonian has lost final vowels, and has also lost final n, but the order of the historical processes means that the loss of final vowels was no longer a productive process when the final n had been lost, so that vowels that become final because of loss of final n remain. The Estonian equivalents of the Finnish word forms just cited are sadam,
507
508
Language Acquisition, Change and Emergence
genitive sadama, suits, genitive suitsu. One result of this is that, in order to form the genitive in Estonian, there is no way of predicting which vowel will be used, and one simply has to learn that sadam belongs to the “a-declension’ and suits to the “u-declension”. For completeness, I should point out that there is a competing explanation for the mix of regularity and irregularity that one finds in the morphologies of most human languages, one that goes back at least to ideas of Otto Jespersen’s, but has been developed most explicitly in recent years by Alison Wray; see, for instance, Wray (1998). According to this approach, utterances in the earliest human language would have been holistic, somewhat like interjections, where ouch is not decomposable into meaningful elements in the way that I feel pain is — in a sense, this is the opposite of the line I have been following in this paper. Subsequently, similarities in form between such holistic utterances also sharing similarities in meaning would have led to abstraction of form–meaning similarities and the reinterpretation of holistic utterances into sequences of signs combining form with meaning, combinatorial both on formal and semantic levels. I have no qualms with accepting this as one kind of development that may have taken place in the history of human language, somewhat akin to folk etymology, although I remain skeptical of accepting it as the basic way in which articulated language arose, because of the sheer amount of coincidence that would be needed to get from a reasonably expressive holistic language to a reasonably articulated language.
2.
Creole Origins and Complexity
In this section, I want to examine three major current approaches to the origin of creoles with respect to their relevance for issues of complexity. A particular impetus for this investigation is the publication of McWhorter (2001), which confronted the linguistic world squarely with a hypothesis that the grammars of creole languages are the world’s most simple grammars; I turn to these
Creoles and Complexity
particular claims in subsection 2.1. Before doing this, it is perhaps worth clarifying one point. Until well into the second half of the twentieth century, creole languages were typically despised, being not only neglected by and large by linguists but also considered by the authorities in countries where they are spoken, often considered even by their native speakers, to be debased varieties of the respective lexifier languages. The work of the last half century has shown, however, that creole languages present many points of crucial interest to linguists, indeed to linguists of very different persuasions, ranging from the most formal grammarians to those who place most emphasis on social factors, not to forget typologically and functionally inclined grammarians along the way. The claim that creole grammars are the world’s simplest grammars has been interpreted by some as a return to those old ways of thinking, an interpretation perhaps not significantly countered by John McWhorter’s general reputation as one of the least “politically correct” commentators in the Afro-American community. McWhorter is well able to defend himself, but let me emphasize for my own purposes that the discussion of complexity in regard to language does not entail any disparagement of creole language. All languages used as first languages in speech communities have to be sufficiently complex to carry the full range of functions that are required of human language, including not only communicating complex ideas to others but also formulating complex ideas for oneself, or more generally coming to terms with the world. Creole languages are clearly no worse at doing this than any other language. Perhaps part of the problem is an identification of “more complex” with “better”, but a moment’s thought will show such an identification is far from necessary, indeed is in many instances clearly incorrect. In an ideal world, a language ought to be just as complex, and no more, as is necessary to carry out its communicative and cognitive functions. Of course, working out just what degree of complexity this entails is no easy matter, if indeed it is possible at all, but we can certainly point to features of non-creole languages that are not required by this specification, in particular
509
510
Language Acquisition, Change and Emergence
abundant arrays of morphological irregularities of the kind illustrated above for German plural formation. (Such “unnecessary” complexities may, however, play an important role in socialization. Children are clearly capable of acquiring such complexities in first language acquisition, so their successful acquisition is an important emblem of full socialization into the community. This factor may even be enhanced by the fact that it is notoriously difficult for second language learners successfully to master such complexities in their entirety. Such complexity may therefore serve a “shibboleth” function.)
2.1 Creole grammars as the world’s simplest grammars While the origin of lexical items in creole languages is usually rather straightforward, coming from the lexifier language, one of the main controversies in current creole studies concerns the origin of the grammar of creole languages. In principle, at least three sources should be considered: the lexifier language, the substrate language (or languages), and universal principles. Although deriving creole grammars from the lexifier language does have some supporters, most notably Robert Chaudenson, the grammars of most creoles — with the possible exception of some that have been in intense contract with their lexifier language and are arguably not, or no longer, typical creole languages — are radically different from those of any metropolitan variety of the lexifier language, so that most linguists working on creole languages have sought the genesis of at least most of creole grammar elsewhere. In subsection 2.3, I will examine one current account that would derive creole grammar from the substrate. The present subsection is the first of two examining claims that would derive creole grammars primarily from universal principles. McWhorter (2001) is the lead article in an issue of the journal Linguistic Typology that is otherwise devoted to responses, including quite critical responses, to the lead article. Although it does not dwell in detail on problems of creole genesis, it does argue that creole grammars are “young” grammars, which excludes the
Creoles and Complexity
possibility that they are taken, in whole or in large part, either from the lexifier or the substrate language(s), leaving as the only viable possibility that they arise primarily through universal principles, even if these universal principles are not explicitly set out. Somewhat more specifically, McWhorter accepts the account of the origin of creoles, or at least some creoles, via pidgins, with pidgins being minimal communication systems stripped of everything that is not essential to practical communicative concerns, used typically in a limited set of situations, and which moreover have not yet had a sufficiently long history to develop the “‘ornament’ that encrusts older languages” (McWhorter 2001: 125) — most of the canonical creoles arose in the seventeenth and eighteenth centuries, with some, such as Hawaiian Creole English, even later (early twentieth century). Morphology is one of the areas where McWhorter’s hypothesis makes particularly good predictions. Although, as we have seen above and will see again below, it is not quite true that creoles lack all morphology, they are, in comparison with world standards, very much at the low-morphology end of the spectrum, having typically little to no inflectional morphology and quite limited derivational morphology (“limited” in the sense that there are few derivational affixes, although the derivational affixes that are found in a creole may be very productive). All of this fits in well with the account of the origin of morphological complexity set out in subsection 1.2. Bound morphemes arise, initially as part of an agglutinating structure, on the basis of the grammaticalization of separate words. Subsequent phonetic processes and reinterpretations may lead to the rise of more complex relations between the form and meaning of morphological elements, such as fusion and irregularities, but this requires time. Such complexities are absent or virtually absent in creoles: McWhorter (2001: 141) notes that Saramaccan has only one morphological irregularity: The imperfective marker tá is realized as nan- before the verb gó ‘go’, as in mi tá wáka ‘I am walking’ but mi nangó ‘I am going’. Perhaps not surprisingly, much of McWhorter’s paper is concerned with morphology and issues closely intertwined with morphology.
511
512
Language Acquisition, Change and Emergence
When one turns to other areas, the picture is arguably less clear, as we will see in subsection 2.2 and especially in subsection 2.3. The brief discussion of example (1) in subsection 2.3 will point out a complex syntactic feature of Haitian Creole that seems to be taken directly from the main substrate language, Fongbe, as is also discussed by Lefebvre in her response to McWhorter’s position paper in Linguistic Typology 5.2/3. Subsection 2.2 will discuss a semantic distinction between realis and irrealis complementizers that is widespread in creole languages but absent, as a strictly grammatical distinction, in a number of other languages, including Mandarin Chinese. With regard to morphology, I agree with McWhorter that creole languages are demonstrably simpler than the world norm and that this is because they started out with little or no morphology and have not had time to develop any significant amount of morphology. But whether this carries across to other areas of the grammar is less clear to me, and subsection 2.3 will present an account of creole genesis that predicts morphological simplicity without necessarily predicting simplicity in other areas of the grammar.
2.2
The bioprogram hypothesis
Bickerton’s bioprogram hypothesis was for several years the dominant approach to the question of creole genesis, and even if more recent years have seen the upsurge of alternative accounts, including accounts that combine different individual approaches, the bioprogram hypothesis remains important not only for historical reasons, but also for the insights it gives into such issues as complexity and creole genesis, both to linguists who are favorable to the hypothesis and to those who reject it or at least are skeptical. In what follows I will outline the hypothesis — for a more detailed exposition by the author, reference may be made to Bickerton (1984); the relevant issue of Brain and Behavioral Sciences also includes reactions by others to Bickerton’s position paper — without going into too much detail as to whether particular claims are
Creoles and Complexity
correct or not. The interest is rather in what light this approach throws on questions of complexity in creoles. The bioprogram hypothesis relates specifically to the development of the grammar of creoles. It is rooted firmly in a generative approach to language acquisition, more specifically the Principles and Parameters version of generative grammar, which assumes that the child brings to the task of language acquisition certain quite specific “innate ideas” that guide the acquisition process. More specifically, the child comes equipped with certain principles which hold across all languages, so that the child knows that whatever language s/he is faced with, that language must adhere to these principles. Then there are a set of parameters on which languages can vary. For instance, we know that languages can vary in terms of whether they have basic head-final word order (with, for instance, the object before the verb, the possessor before the head noun, as in Japanese) or head-initial word order (with, for instance, the object after the verb, the possessor after the head noun, as in Arabic). Thus the headedness order parameter would have two values, head-final and head-initial. The child would bring to the task of language acquisition the knowledge that this is a relevant parameter and that this parameter has these two values, but data from the input language would be needed in order to determine which of the two values a particular language has. A child growing up in a Japanese-speaking community would have to set the parameter to head-final; a child growing up in an Arabic-speaking community would have to set the parameter to head-initial. The bioprogram hypothesis takes this one step further by saying that for a given parameter, one setting is unmarked, i.e. this is the setting the child will select as default in the absence of evidence to the contrary. Let us suppose — and I emphasize that this is an arbitrary choice for expository purposes — that head-final is unmarked. Then the child would enter the relevant part of language acquisition assuming that the language to be acquired is head-final. If the language is Japanese, no problem, and the parameter remains set to that value. But if the language is Arabic, then data will contradict the unmarked setting, which will therefore be reset to the
513
514
Language Acquisition, Change and Emergence
marked value, head-initial. One might ask what evidence there could be for the unmarked setting, given that the child will always end up with the setting that is specified by the community language as spoken by adults. One source of evidence might be mistakes made by children: Thus, if head-final is the unmarked setting, one might assume that children acquiring Japanese as a first language would never make mistakes in this respect, while those acquiring Arabic might go through a stage of believing that Arabic is head-final, until they encounter crucial data that forces them to switch parameter settings. Another piece of evidence would be creole grammars, though only under the assumption that creole grammars arise not from the lexifier language nor from the substrate language(s), but rather on the basis of the interaction of universal principles of first language acquisition with input that is so varied as to be effectively useless as a model for the child; the hypothesis of creoles developing from grammatically unstable pidgins would be the most obvious hypothesis to link in this way with the bioprogram hypothesis. The child is faced with input that is so variable that the child has no solid evidence on how to set parameters. As a result, parameters remain set to their default values. Bickerton’s evidence in favor of this approach is a number of similarities that are claimed to occur across all or most creoles — some exceptions would be expected, since language change since the genesis of a particular creole might have changed the setting of some parameters — and which can be attributed systematically to neither the lexifier language nor the substrate language(s). It should be emphasized that under the bioprogram hypothesis, there is no reason to suppose that the unmarked parameter settings should necessarily be less complex than the marked settings, except in a circular way to which I return presently. One of Bickerton’s examples of a putative creole universal is a distinction between realis and irrealis complementizers which can be illustrated by the contrast between Jamaican control im gaan go bied ‘he went and bathed’ (“etymologically” ‘him go.on go bathe’) and im gaan fi bied ‘he went to bathe’ (where fi is etymologically from English for). The use
Creoles and Complexity
of go in the first example entails that the person in question did bathe (realis), i.e. not only did he go with the intention of bathing, but he succeeded in carrying out this intention. The second only attributes the intention to the person in question, but leaves open whether or not this intention was realized (irrealis); thus, the second example, but not the first, would be compatible with a continuation bot im duon bied ‘but he didn’t bathe’ (“etymologically” ‘but him don’t bathe’). A distinction of this kind is certainly not a linguistic universal, so that, for instance, Mandarin Chinese would most naturally translate both constructions in the same way with a simple sequence of verbs. So if it is true that this distinction holds of all creoles, including creoles that can be shown to have completely different origins, and is found to be absent in some (and especially if in many, or most, or nearly all) other languages, then this would be strong evidence for the bioprogram hypothesis, with the more specific claim that expression of this particular semantic distinction is the unmarked parameter setting. More generally, creoles would be expected to have, or at least originally to have had, all unmarked parameter settings, while other languages, separated by millennia from their most remote ancestor, might happen to have the unmarked parameter setting, but might also, as the result of historical changes to parameter settings, have developed marked parameter settings which would then be acquired by future generations of children on the basis of explicit positive evidence for the marked parameter setting. Being required to indicate grammatically whether an intention was realized or whether this possibility is simply left open is arguably a complexity in the grammar of a language, since the same information could also be communicated more directly by lexical means, like the bot im duon bied that can be attached to the second Jamaican Creole example given above, and Bickerton makes clear that his hypothesis neither in principle nor in practice carries any assumption that the unmarked parameter setting will be less complex than the marked one. In a sense, the setting of parameters is a biological accident. This does, however, raise a problem that I have skirted so far, namely precisely what interpretation is to be
515
516
Language Acquisition, Change and Emergence
given to the notion complexity, in particular whether this can be taken as an abstractly identifiable notion irrespective of context or whether it is, rather, context bound, and thus perhaps assuming different values in different contexts. An example from outside language may help to clarify. Let us suppose that one has to descend a 45° incline. Which is more complex, to descend such an incline where steps have been dug into the incline or where the incline remains smooth? The answer may well depend on the identity of one. If a healthy human being has to walk down the incline then steps will almost certainly be the choice, indeed one of the first things that humans do to natural steep slopes to make the lives of their fellow humans easier is to dig steps into them. On the other hand, if the descent is to be made by a machine, then the steps are almost certainly going to be a hindrance: It is difficult and expensive, for instance, to design a robot that can handle steps, and much easier to fit it with wheels. Returning to our example of parameter setting under the bioprogram hypothesis, one might argue for that for a creature equipped with the unmarked parameter setting “distinguish realis and irrealis complementizers”, then the Jamaican Creole pattern will actually be less complex than the one found in Mandarin Chinese. Effectively, this means that one is left with a choice, under the bioprogram hypothesis, of assuming either that some unmarked parameter settings are more complex than their alternatives (based on a context-free assessment of complexity), or that the question is devoid of empirical interest, because the unmarked parameter setting — however counterintuitive it might seem to a disinterested observer — is necessarily less complex because it is what children default to. In practice, Bickerton opts for the former interpretation, in fact noting explicitly that some of his unmarked parameter settings, as found in creole grammar, are more complex than their alternatives. Although this approach could be extended to morphology, for instance by assuming that absence of morphology is an unmarked parameter setting and will therefore be adopted by the child in the absence of positive indication to the contrary, in fact Bickerton does not take any stand on this issue; certainly he does not include
Creoles and Complexity
absence of morphology under his list of unmarked parameter settings (which is not, however, claimed to be exhaustive), rather absence of morphology, like the prevalent Subject–Verb–Object order found in creoles, is left in abeyance, perhaps to be accounted for by general principles of communicative pressures that fall outside the specific claims of the bioprogram hypothesis. In recent years, the bioprogram hypothesis has come in for a fair amount of criticism, and it certainly no longer holds the near-dominant position in studies of creole genesis that it once did. Skepticism — and I have not tried to hide the fact that I share much of it — concerns both the empirical claims (e.g. whether particular grammatical properties really are as widespread in creoles as Bickerton claims) and the more general, Chomskyan innatist framework within which the bioprogram is embedded. But skepticism about the particular version of the bioprogram espoused by Bickerton should not be taken as a general rejection of all conceivable approaches that would include, inter alia, the application of universal principles (whether of first language acquisition or by second language learners) in the genesis of creoles. I hope, however, to have shown that the bioprogram hypothesis is an intellectually interesting claim that has repercussions, perhaps not quite as clear as one would have hoped, for the issue of complexity and creole genesis.
2.3
Relexification as a cognitive process
The approach to creole genesis developed in greatest detail by Lefebvre (1988) looks specifically at the development of Haitian Creole, although with claims that go well beyond the specific analysis of Haitian Creole. It will nonetheless be useful to outline the demographic circumstances under which Haitian Creole is believed to have developed. Haiti developed as a typical plantation settlement, with the lexifier language being clearly French, the language of the plantation owners and their colonial administration. The linguistic composition of the slave population that developed the creole is reasonably well understood, with a preponderance of
517
518
Language Acquisition, Change and Emergence
speakers of Fongbe, a language of the Kwa subgroup of the Niger-Congo family from West Africa; more specifically, Fongbe is spoken in what is now Benin (formerly Dahomey), and forms part of the large Gbe cluster of languages spoken in Ghana, Togo, and Benin, and whose other best-known member is Ewe. Thus, claims about substrate influence from African languages on Haitian Creole can be pinpointed quite accurately: We would be looking primarily for substrate influence from Fongbe. And a crucial point of Lefebvre’s analysis is that the influence of the Fongbe substrate is pervasive in Haitian Creole, in particular in its grammar (and also, as we will see in passing, in the structure of its lexicon, though not in the actual shape of lexical items). Lefebvre argues that Haitian Creole is essentially Fongbe relexified by means of the forms of French lexical items. More specifically, as speakers of Fongbe attempted to come to terms with the new language, French, to which they were only very incompletely exposed, they did so by taking the basic structure of Fongbe and replacing its lexical items by lexical items from French. The net result of this is a language, Haitian Creole, that has lexical items that are French in origin, at least as regards their form, though the semantic range often corresponds rather to that of Fongbe — thus Haitian Creole plim covers both ‘feather’ and ‘hair’, like Fongbe fún, but unlike French plume, which is just ‘feather’ (Lefebvre 1998: 71); but that has the grammatical structure of Fongbe. It should be emphasized that under Lefebvre’s treatment, the Fongbe-like properties of Haitian Creole can be quite specific, and thus not necessarily similar to patterns found in other West African languages, as in the structure of a noun phrase like (1), where the Haitian Creole structure almost exactly parallels that of the Fongbe (Lefebvre 1998: 78); I return below to the discrepancy at the ‘GENITIVE’ position. (1) krab
mwen Ø
sa
a
yo
àsɔÂn
nyDÂ
tbÁn
eÂlNÂ
ɔÂ
lÎH
crab
me
GENITIVE
DEMONSTRATIVE
DETERMINER
PLURAL
Haitian Creole Fongbe
‘these/those crabs of mine (in question/that we know of)’
Creoles and Complexity
It is important to note that the structure of (1) is very specific to Fongbe, in particular with the combination of demonstrative and determiner, a construction that is quite impossible in French (or English); a more literal translation of (1), disregarding the word order, into English would be *the my these crabs (French *les mes ces crabes), with the asterisks indicating that such structures are impossible in these languages. Under the relexification hypothesis, speakers of Fongbe would have replaced each of the lexical items in the Fongbe version by what they took to be the closest French lexical item, but retaining the grammatical structure of Fongbe. It will be noted that in (1) there is one exception to the relexification process as described so far, namely there is no equivalent in the Haitian Creole version to the Fongbe genitive marker. Why should this be so? Again, it is necessary to think closely about how relexification would have taken place “on the ground”, bearing in mind that speakers of Fongbe had only limited exposure to French. This exposure was sufficient for them to identify French major category, lexical morphemes, and where such morphemes were available they were used. Note, for instance, that the Fongbe plural marker was relexified using the same element as is used for the third person plural pronoun ‘they’, namely yo, which derives ultimately from the French stressed third person masculine plural pronoun eux (phonetically [1]). Where no French lexical item could be found corresponding to a Fongbe morpheme, the relexification process simply gives rise to a zero. Fongbe speakers could find no French lexical item sufficiently close to their genitive marker, so they relexified this morpheme with zero, as can be seen in the correspondence in (1). Those who know French might object that French does have a perfectly good correspondent to the Fongbe genitive marker, namely the preposition de ‘of’. However, this French preposition is not phonetically salient, and moreover has a number of other functions, for instance as a partitive article, so that French de l’eau can mean not only ‘of the water’ but also ‘some water’ (the French morphemes in this example are de ‘of’, l’ ‘the’ [form used before a vowel], eau ‘water’). In fact, it is generally true that the French preposition de and the French definite article are not
519
520
Language Acquisition, Change and Emergence
sufficiently salient to have been identified as separate morphemes by speakers of Fongbe; Haitian Creole lexical items sometimes include fossilized definite or partitive articles, which were thus apparently analyzed as an integral part of the lexical item, e.g. dlo ‘water’ (for French de l’eau, phonetically [dlo]), lari ‘street’ (for French la rue, phonetically [lary]), a point to which I return below. The discussion of the preceding paragraph now provides an alternative explanation why creoles typically lack bound morphology. Relexification requires identification of lexical items in the lexifier language that correspond to morphemes in the substrate language, and in situations of limited exposure, it is difficult to impossible to identify bound morphemes, in particular bound inflectional morphemes in the lexifier language. And indeed, Haitian Creole lacks all productive reflexes of French inflectional morphology. So creoles turn out, on this scenario, to have simple morphologies, but not because of any general trend towards simplification; indeed, by world standards the internal structure of a Haitian noun phrase like (1) is relatively complex. Derivational morphology may, however, be more salient, since Haitian Creole does have some productive derivational affixes that are of French origin, for instance the reversive verb prefix de- (French dé- ‘dis-, un-’, in the sense of undoing an action or the opposite of an action), e.g. pasyante ‘be patient’, depasyante ‘be impatient’. Although some of these derivational affixes might conceivably have arisen through later interaction between Haitian Creole and more metropolitan varieties of French, particular formations are found in Haitian Creole that do not occur in French and must therefore have been created productively within Haitian Creole. Thus, French has patienter ‘be patient’, but not *dépatienter ‘be impatient’; the closest French verb would be s’impatienter (French impatienter is a transitive verb meaning ‘make impatient’). It is also not excluded that in situations of creole genesis where there was somewhat more exposure to the lexifier language, some salient inflectional morphology might have been adopted. As noted above, Indian Ocean creoles with French as their lexifier, like Seychelles Creole, do have a minimal morphological opposition in some verbs of the type
Creoles and Complexity
sât – sâte ‘sing’, representing respectively French chante (inter alia, singular imperative) and chanter (infinitive) or chanté (past participle) or possibly some other inflectional form in -[e]. Moreover, there are attested cases, albeit rare, of creoles carrying over bound morphemes directly from the substrate language, e.g. Berbice Dutch Creole takes its perfective marker -tD and its imperfective marker -a(rD) directly from Izon (Ijo) (Kouwenberg 1994: 559, 666); it is no doubt significant that Izon (Ijo) seems to have been virtually the sole substrate language in the formation of Berbice Dutch Creole, thus allowing it have a more significant impact on the formation of the creole. Two final points are in order before leaving relexification. First, although the hypothesis predicts that the creole will follow the grammar of the substrate language down to quite fine details, such as the structure of the noun phrase in (1), it does not exclude the possibility that certain very salient grammatical properties of the lexifier language might find their way into the creole, and Lefebvre (1998: 38–40) indeed hypothesizes that basic word (lexical category) order properties will follow the lexifier language. Thus, Haitian French has the order with the interrogative adjective before the head noun in ki mounn ‘which person’ (“etymologically” qui monde) just as French does in quelle personne, which is the opposite of the Fongbe order mDÁ tDÂ, literally ‘person which’. A particularly striking example is provided by Berbice Dutch Creole, which has Subject–Verb–Object order, like Dutch (in simple clauses), and unlike what is known to have been the sole substrate language, Izon (Ijo), which somewhat unusually for a West African language, especially of the Nigerian coast, has Subject–Object–Verb word order. Second, it is important to distinguish the sense of the term relexification used here and an earlier proposal most closely associated with the name of Keith Whinnom, according to which the similarities among creoles are to be attributed to the hypothesis that they all or nearly all go back historically to a Portuguese-based creole (i.e. creole with Portuguese as its lexifier), in which words of Portuguese origin were subsequently replaced by lexical items taken
521
522
Language Acquisition, Change and Emergence
from other European languages (English, French, Dutch, Spanish — though it can often be difficult to tease apart Portuguese and Spanish as lexifier languages). The detailed comparison of grammatical features in Fongbe and Haitian Creole as illustrated in (1), especially if confirmed from studies of other creoles and different substrate languages, is an argument against Whinnom’s approach.
2.4
Reinterpretation of complexity in the lexifier language
In the last subsection of this section, I would like to discuss one specific way in which one sees morphological simplicity in creole languages, namely the way in which morphologically complex forms in the lexifier language are reanalyzed as morphologically simple forms in the creole. My first example is from Jamaican Creole. In the lexifier language, English, plurals are formed in a number of ways, by far the most frequent being the suffixing of -s, as in shoe, plural shoes. Jamaican has no reflex of this as a means of marking plurals. (Jamaican can form plurals of definite nouns by attaching the postposed element dem, as in di bwai dem ‘the boys’, which is “etymologically” ‘the boy them’ — compare the essentially identical strategy noted above for Haitian Creole; dem is also the third person plural pronoun, irrespective of syntactic function. When Jamaican Creole is written, the pluralizer dem is often written as a suffix, attached to the preceding noun by means of a hyphen, though the reason for this remains unclear to me. In any event, the pluralizer is a transparent grammaticalization of the plural pronoun.) However, basilectal Jamaican Creole does have reflexes of both shoe and shoes, namely shuu and shuuz. But in Jamaican Creole, these are distinct lexical items, with quite different meanings. The item shuuz denotes a human’s shoe, while the item shuu denotes a horseshoe; if one wants to refer to one human shoe, one says wan shuuz, and if one wants to refer to a defined group of horseshoes, one says di shuu dem, i.e. these nouns are, like all nouns in Jamaican Creole, number neutral. The reason for the differentiation is not hard to see:
Creoles and Complexity
Humans normally buy shoes in pairs, so those who created the creole would only have heard standard English plural shoes in this kind of situation. By contrast, it is normal to give a horse one new horseshoe if only one shoe is worn or has come loose, so English singular shoe would have been heard in this environment. Similar examples are found in other creoles. In Haitian Creole, for instance, the word for ‘water’ is dlo, cognate with standard French de l’eau, where the partitive article, here in the form de l’, means ‘some’, but is almost obligatory with a mass noun like eau if there is no other article. The pronunciations of the standard French and Haitian Creole versions are essentially identical, and it is probably only traditions of French spelling that might lead one to consider the standard French item to be two or three words, rather than one morphologically complex one. However, just as English bimorphemic shoes was taken over into Jamaican Creole as a single morpheme, so French dlo was analyzed as a single morpheme in Haitian Creole. Crucially, both shuuz and dlo are single morphemes within the systems of the respective creole languages, and there is no evidence that speakers of the creole ever analyzed them in any other way. But this does bear on the question of creole origins. It is evidence against the theory according to which creoles arose through simplification of the lexifier language by speakers of that language, the so-called foreigner talk theory of creole genesis. If this were the origin of creoles, then those who developed the creole would never have been exposed to morphologically complex forms like English shoes or French de l’eau. But from the fact that these forms surface in the creoles, we know that they must have been exposed to them. The exposure was not, however, sufficient for them to analyze them “correctly” from the viewpoint of the lexifier language. This suggests overall, then, that the populations that devised the creole were exposed to the lexifier language in a reasonably unmodified form, had sufficient exposure to extract lexical items (and probably at least a few salient grammatical features, such as basic word order), but not sufficient to be able to extract the morphological analyses appropriate in the lexifier language.
523
524
Language Acquisition, Change and Emergence
3.
Conclusions
Although, given my brief at the workshop in Hong Kong on which this volume is based, I have concentrated on creoles and complexity in language, there are other instances of languages created, as it were, anew that might be relevant to our present concerns and that should also be examined by those interested more generally in the development of complexity in natural language. One area that has gained much interest in recent years is sign languages, or more specifically the sign languages used by deaf communities. These are fully fledged languages used for the same range of purposes by individuals in the respective communities as spoken languages are used in theirs. In many cases, we know that deaf sign languages have very short histories, comparable to those of, for instance, plantation creole languages, or in some instances even shorter, as in the case of Nicaraguan Sign Language, which arose during the last decades of the twentieth century. One could also consider the achievements (and limits on these achievements) by adult second language learners who arrive as immigrants in a foreign country, have limited access to the language of that country, but are nonetheless forced to acquire some practical degree of communicative ability with speakers of that language. Among the best studied examples here is the limited acquisition of German by so-called Gastarbeiter ‘guest workers’ — the morphological complexity of standard German is certainly one of the things that is largely sacrificed in the resulting second-language variety, although some bound morphology is nonetheless typically present (Klein and Perdue 1997). Finally, one might consider first language acquisition, especially the extent to which children create structures that are not admissible in the target adult language to which they are exposed, though here care is needed: The abilities that children bring to the language learning task and their social environment are in some, perhaps crucial, respects different from those of the other language creators we have been considering. For instance, as noted above, children acquiring Turkish as a first language have little difficulty with the long strings of agglutinatively attached morphemes that
Creoles and Complexity
give so much trouble to second language learners and seem so recalcitrant to borrowing into other languages; indeed, children quite generally find first language acquisition easy! Moreover, children, unlike the inventors of deaf sign languages or of fossilized second language varieties, do have extended access to a target which they eventually, in fact quite soon, reach. Returning to the issue of creole genesis, I hope at least to have shown that creoles can play an important role in discussions of the development of complexity in natural language. The precise nature of this role and its precise importance for bigger issues, like the development of complexity in human language from its ultimate origin, depend at least in part on which particular theory of creole genesis one accepts, or on the proportions in which one combines these theories — always bearing in mind that the combination may be different for different creoles. McWhorter (2001) has the merit of opening up a serious discussion of complexity in relation to creoles, so that whatever the final outcome of this debate — and one must acknowledge that the responses published in the same issue of Linguistic Typology as McWhorter’s article tend to be more critical than supportive — we can expect to end up wiser with regard to the possible relevance of creole languages for our understanding of language complexity and more specifically for the origin and early development of complexity in human language. Although I have not tried to impose a particular conclusion on the reader, my own sense is that in at least some significant respects creole languages are on the whole less complex than the general run of human languages, that to a certain extent this is because creole languages have not had the time-depth to develop some of these complexities, but that the reason for the absence of these complexities in creole languages is a question that still merits further study.
525
526
Language Acquisition, Change and Emergence
References Bickerton, Derek. (1984) The language bioprogram hypothesis. Behavioral and Brain Sciences 7, 173–221. Comrie, Bernard. (1992) Before complexity. In John A. Hawkins and Murray Gell-Mann (Eds.), The evolution of human languages (pp. 193–211). Redwood City, CA: Addison-Wesley. Heggarty, Paul Andrew. (2001) Quantification and comparison in language structure — an exploration of new methodologies. PhD thesis, University of Cambridge. [Revised version entitled Measure language: putting numbers on language similarity – from first principles to new techniques. Oxford: Blackwell. Heine, Bernd, Ulrike Claudi, and Friederike Hünnemeyer. (1991) Grammaticalization: A conceptual framework. Chicago: University of Chicago Press. Holm, John. (1988–1989) Pidgins and creoles (2 volumes). Cambridge: Cambridge University Press. Johanson, Lars. (2002) Structural factors in Turkic language contacts. Richmond, Surrey: Curzon. Klein, Wolfgang, and Clive Perdue. (1997) The basic variety (or: couldn’t natural languages be much simpler?). Second Language Research 13, 301–347. Kouwenberg, Silvia. (1994) A grammar of Berbice Dutch Creole. Berlin: Mouton de Gruyter. Lefebvre, Claire. (1998) Creole genesis and the acquisition of grammar: The case of Haitian creole. Cambridge: Cambridge University Press. McWhorter, John. (2001) The world’s simplest grammars are creole grammars. Linguistic Typology 5, 125–166. Wray, Alison. 1998. Protolanguage as a holistic system for social interaction. Language and Communication 18, 47–67.
Index A
180, 189, 197, 207, 393, 481, 490, 499, 502 Australopithecines, 338 autism, 44, 69 axon, 74
accessible polysemy, 14, 443–444 accuracy, 216, 223–224, 227–228, 239, 313 adaptive agents, 413, 419 adaptive benefit, 50, 75, 81 adjective, 62, 402, 481–482, 487, 491, 504, 521 affixes, 16, 362, 503, 511, 520 Afro-Asiatic, 399, 404–405 agent-based simulation, 76, 83 agglutination, 16, 504 agglutinative morphology, 507 see also Turkish alarm calls, 58, 336 allocentric perspective, 113 Altaic, 342, 345, 350, 356, 365–367, 397, 403 ambiguity, 15, 121, 130, 168, 247, 249, 371, 374, 406, 432, 439– 441, 443–444, 447, 458, 472 ambiguity advantage, 15, 440, 443– 444, 447, 458 Amerind, 11–12, 341, 345, 347–354, 357–359, 361–362, 364, 368, 399, 401, 406 amino acid, 4, 30–31, 33–34, 39 analogical extension, 319 analogical leveling, 9, 319–320 ape, 52, 57, 59, 71, 81, 394 arcuate fasciculus, 54 argument structure, 64, 135 arrow of time, 394, 402, 406, 409 artificial neural network, 5 see also connectionist network; neural network aspect, 14, 26–27, 50, 83, 115, 172,
B back-propagation, 212, 222, 234 balanced reciprocal translocation, 29–30, 32, 37 basic word order, 523 basilect, 497 bathe, 515 behavioral, 29, 49, 51, 54–55, 70, 72–74, 193, 395, 399 behavioral adaptation, 50, 70 Bernstein corpus, 263, 266–267, 272–273, 289, 292, 294 see also CHILDES corpus bifurcation, 192 binding, 6, 31, 35, 37, 121–125, 127, 137 Bioprogram Hypothesis, 17, 512– 517 bipedalism, 48 blind mole rat (Spalax ehrenbergi), 73 body maps, 96, 105, 110 bonobo (Pan paniscus), 55, 154 Kanzi, 21, 55, 60, 154 see also chimpanzee bootstrapping problem, 134 borrowing, 16, 69, 347, 352, 354, 357, 403–404, 407, 504, 525 bound morphemes, 262–263, 274– 277, 280–282, 284–285, 294, 504, 511, 520 527
528
Language Acquisition, Change and Emergence bound morphology, 502, 506–507, 520, 524 brachiation, 52 brain, 5, 10, 17, 22, 26–27, 30–31, 33, 35–36, 47–49, 71–76, 79, 82–83, 105, 110, 115, 139–140, 190, 325, 332, 338, 512 arcuate fasciculus, 54 basal ganglia, 26, 105 brain mass, 115 brain size, 5, 48, 71, 73–76, 79, 82–83, 87, 338 brain stem, 53 brain volume, 338 Broca’s area, 27, 190 caudate nucleus, 27 cerebellum, 105 cerebral cortex, 53 cingulated cortex, 27, 53 dorsolateral prefrontal cortex, 116 frontal lobe (frontal cortex), 26, 70, 106–107, 116 gray matter (grey matter), 26–27, 74 inferior frontal (gyrus or cortex), 27, 54, 82, 96 left frontal opercular region, 27 left hemisphere, 22 lentiform nucleus, 27 neopallial cortex, 32 occipital lobe, 101 parietal, 101, 110, 112, 150, 183 prefrontal cortex, 53, 70, 110, 116, 144 premotor cortex, 101, 105, 149, 200 putamen, 27 regulatory pathways, 36 right frontal, 107, 116 superior temporal gyrus, 54 temporal lobe, 30, 101 Wernicke’s area, 54
working memory area, 96, 110 white matter, 74 bridging rule, 412, 423, 431 Broca’s aphasia, 67 Brownian motion, 179 bucket brigade algorithm, 421 building blocks, 413, 420, 424–429
C case marking, 64, 118, 159, 192 categorical perception, 56–57 categories, 57–58, 62, 255, 334, 336, 483–485, 505 c-command, 122–124, 138, 149 Celtic, 404–405, 408 child-directed speech, 7–8, 225, 251, 263–264, 267, 293–294 CHILDES corpus, 7–8, 214, 263 see also Bernstein corpus chimpanzee (Pan troglodytes), 21–22, 33–34, 55, 59, 80, 120, 428 Sarah, 59–60 chromosome 7, 28–29, 30, 32, 42, 428, 429 see also FOXP2 classification, 10, 31, 230, 232, 341– 342, 345, 347–348, 358, 361, 396, 398–399, 447, 483 see also multilateral comparison, taxonomy classifier systems, 316, 411–413, 419– 422, 424 classifiers, 316–317, 483 Clever Hans, 59 clicks, 12, 407 Clitic assimilation, 127 cochlea, 54 cognates, 11, 342 cognition, 5–6, 13, 50, 61, 82, 84, 95, 97–98, 103, 107, 109, 117, 197, 254, 299, 317–318, 324
Index cognitive modules, 50 domain-general, 13–14, 50–51, 299, 318 domain-specific, 4, 50, 318, 325 language-specific, 5, 13, 49–50, 68, 154, 334, 429 cognitive plasticity, 23 cognitive potential, 189–191, 194– 195 comitative, 506 communication system, 155–156, 158, 370, 496, 511 comparative method, 341, 343–344, 346, 348, 354, 356, 364–367, 401 complementizers, 512, 514, 516 completeness, 216–217, 223–224, 227–228, 508 complex adaptive systems (cas), 13, 411, 413–418, 429 complexification, 13, 15, 405, 408 complexity, 8, 12, 14–17, 22, 36, 39, 41, 47–49, 53, 65, 67, 69, 70– 71, 73–76, 79, 81–84, 95, 164, 201, 253, 294, 316, 387, 389– 394, 396, 405–407, 409, 414, 445, 465, 467–468, 470–472, 474, 477–478, 480–484, 489– 490, 492, 495, 497, 500–503, 505–506, 508–509, 511–512, 515, 517, 522, 524–525 writing system, 160, 168, 472, 476 Kolmogorov, 253, 294 network complexity, 73 potential complexity, 35, 393–394 social complexity, 75, 393 compression, 8, 17, 251, 253–257, 293–294 computational model, 150, 153, 162, 165, 212–213, 233, 308 computer simulation, 5, 76, 174, 179, 312–313, 411, 414
computer-based model, 414 connectionist network, 7, 207, 234, 240–241 see also artificial neural network; neural network constructions, 6, 61–62, 100, 119, 135, 144–147, 154, 159, 425, 515 context units, 213 see also recurrent neural network; simple recurrent network (SRN) conventionalization, 467, 479, 488 co-reference, 6, 121–127, 137–138, 488 cranial capacity, 5, 80 credit assignment, 421–422, 431 creoles, 16, 495–499, 501–502, 506, 508, 510–511, 513–515, 517, 520–521, 523–525 cross-clause co-reference, 488 crossover, 126–127, 428–429 see also mutation cue integration, 7–8, 17, 207, 213, 219, 225–226, 228, 233–237, 239–240 see also multiple cue integration cuing, 59 cultural emergence, 76 cultural evolution, 49, 65, 194 cultural innovation, 160, 165, 167, 169, 172, 189, 191 cultural transmission, 6, 10, 333 cystic fibrosis, 28
D data-driven model, 413 de l’eau, 519, 523 declension, 481–482, 501, 508 default hierarchy, 9, 13, 316, 413, 421– 422, 432–433
529
530
Language Acquisition, Change and Emergence default rule, 9, 13, 300, 308–309, 312, 315–316, 321, 323–324, 412–413, 422–423 deixis, 96, 98, 109 demonstrative adjective, 487 derivational morphology, 502, 511, 520 Description Length Gain (DLG), 251, 257, 259, 260, 267–269, 271, 285–287, 290, 292–294 dictionary meaning, 14, 437, 441– 443, 448–450, 452, 455–457, 461, 464 diffusion, 3, 7, 168, 179–181, 183– 186, 188, 194–195, 323, 350, 357, 492 direct experience, 96–98, 100, 103, 105–109, 116, 132, 139 directionals, 478 discourse processing, 99 dlo, 520, 523 domain-general, 51 domain-specific, 4, 50, 318, 325 dorsolateral prefrontal cortex, 116 Dravidian, 398, 404
E effective complexity, 390–393 egocentric perspective, 110–111, 136 elsewhere condition, 9, 300, 316– 319, 321 embodied cognition, 98–99 emergence, 3–6, 10, 14, 17, 19, 50, 66, 69, 76, 95–96, 153, 155, 157, 159–161, 164–165, 168– 172, 179, 183–185, 187–195, 200, 299, 315, 318, 323, 331, 337–338, 380, 386, 389, 395, 402, 500 see also self-organization, invisible hand
English, 6, 8, 15, 21, 55–56, 61, 64, 108, 113–114, 118, 127, 129, 130–131, 209, 211, 238, 262– 263, 273–275, 298–299, 307– 308, 315–316, 319–321, 332, 334, 379, 402, 408, 437–438, 440–441, 447–448, 452–456, 458–461, 463–464, 474–475, 480–492, 496–497, 499–501, 511, 514, 519, 522–523 Eskimo-Aleut, 350, 353, 398–399, 403 etymologies, 342, 396–397, 399, 401 Eurasiatic, 341, 347–348, 350–351, 353, 362, 364, 397–398, 401, 403–404 exaptation, 190 exception rules, 9, 13, 412, 422 existence-proof model, 413 explanatory model, 48 eye gaze, 237
F facet, 445, 446, 504 Fongbe, 500, 512, 518–519, 521–522 forkhead domain, 33, 35, 37, 39 FOXP2, 3–4, 30–40, 42, 68 see also grammar gene, language gene, SPCH1 frames, 5, 39, 96–97, 103, 111, 113, 134, 136, 191 free morphemes, 274 French, 238, 486, 497–500, 502, 517–523 frequency of meanings, 460 frozen accident, 393 functional magnetic resonance imaging (fMRI), 107 fusional morphology, 16, 505
Index
G
H
gender, 16, 26, 358–359, 361, 404, 482, 486, 489–490, 505 gene, 3–4, 23–24, 26, 28, 30–34, 36– 37, 39–40, 68, 86, 183, 206, 311, 333, 338, 363, 386, 429 gene pool, 183 generative grammar, 154, 513 genetic algorithm, 255, 412, 424, 429, 434–435 genetic classification, 346–347, 349, 361, 364 genetic defect, 23 genetic relationship, 11, 341, 345– 347, 349, 351, 357, 401 German, 17, 151, 238, 298, 300, 309, 315, 320, 323, 486, 500, 505, 510, 524 gestalt pattern-matching, 55–56 gesture, 132 glottalized consonants, 406 Gold’s Theorem, 66 Government and Binding Theory, 192 grammar, 3–5, 16–17, 26, 47, 59, 61, 63, 65–71, 76, 83, 95–96, 121–122, 127, 135, 139, 201, 206, 219, 229, 252, 255, 267, 349, 380, 395–396, 423, 425, 429, 432–434, 481, 486–487, 489, 492, 497, 506, 510, 512–513, 515–516, 518, 521 grammar gene, 26 see also FOXP2, language gene, SPCH1 grammatical rule, 22, 47, 62, 65, 220 grammatical suffixation, 24 grammaticalization, 16, 191, 439, 506, 511, 522
habituation, 220 hidden layer, 78–79 hierarchical structure, 10, 62–63 historical linguistics, 10, 161, 197, 341–346, 349, 354, 361–362, 364–365, 398 holistic, 3, 10, 17, 337, 508 holistic utterance, 10, 508 hominid, 5–7, 33, 48, 71, 79–83, 113, 200, 332, 338 Homo erectus, 81–82, 115 Homo ergaster, 166 Homo habilis, 81 Homo sapiens, 80, 155, 163, 167, 191, 193, 195–196, 393, 395 Homo sapiens neanderthalensis, 395 homophony, 371 hunter-gatherer, 165, 173, 183 hypoglossal canal, 82,
I image processing system, 96 dorsal (stream), 96, 101, 113 ventral stream, 96, 101 imagery, 96, 98–99, 103–107 depictive imagery, 96 enactive, 96, 100–103, 106, 121, 125, 132–133 imitation, 10, 69, 70, 332–333, 431 imperfective, 118, 511, 521 incompressible string, 392 incremental change, 4, 50 index, 71, 74–75, 79, 82, 343, 476 Indo-European, 10, 156, 161, 163, 342, 347–350, 353, 365, 396– 397, 404, 408–409 inductive inference, 251, 253–255 inferential process, 15, 474–477, 479, 487, 492 inflection, 64, 68
531
532
Language Acquisition, Change and Emergence inflectional morphology, 218, 502, 506, 511, 520 information theory, 253, 257 inheritance, 23, 32, 40, 357, 445 multigenic inheritance, 40 oligogenic, 40–41 innate language faculty, 22 innateness, 65, 67, 144 innovation, 75, 160, 168, 170–172, 179–180, 183–187, 189, 194, 348, 417–418, 428 intercostal muscles, 82 invisible hand, 466 see also emergence, self organization irregularity, 8–9, 16, 297–300, 320, 323–324, 508, 511 isolating language, 501 item-based grammar, 117
K KE family, 24–30, 32–33, 37, 40
L language acquisition, 7–8, 12–13, 17, 22, 40, 98, 132, 142–143, 147, 149, 150–151, 205–206, 208, 210, 217–218, 234, 237, 239– 241, 255, 263, 265, 308, 317, 373, 383, 411–413, 415–418, 420–421, 423, 425, 429–430, 432, 434, 504, 510, 513–514, 517, 524 language acquisition device (LAD), 8 language change, 3, 9, 10–11, 197, 331, 356, 380–381, 492, 514 language contact, 200, 324, 492 language disorder, 23–24, 28, 32, 40, 43–45 language drift, 382
language emergence, 12, 14, 17, 331, 372, 380 language evolution, 3, 9, 11, 17, 47– 48, 79, 82–83, 143, 156, 201, 297, 299, 324–325, 331–332, 370–371, 373, 411 language families, 10–12, 342–343, 346, 353, 356, 358, 396–398, 400, 403 Afro-Asiatic, 399, 404–405 Amerind, 11–12, 341, 345, 347– 354, 357–359, 361–362, 364, 399, 401, 406 Altaic, 342, 345, 356, 365, 398, 403 Austric, 156, 399, 405 Austronesian, 10, 347–348, 354, 405–406 Chukchi-Kamachatkan, 398, 404 Dene-Caucasian, 398, 401, 403 Dravidian, 398, 404 Eskimo-Aleut, 350, 353, 355, 398–399, 405 Eurasiatic, 341–342, 347–348, 350–356, 364, 397–398, 401, 403–404 Indo-European, 10, 156, 161, 163, 342, 347–350, 353, 365, 396–397, 404, 408–409 Indo-Pacific, 354, 399 Kartvelian, 398, 404 Khoisan, 399, 407 Niger-Kordofanian, 399, 404 Nilo-Saharan, 362, 399, 401, 404 Nostratic, 156, 341, 347, 404 Sino-Tibetan, 397, 398, 403 Uralic, 342, 350, 353, 397, 403, 408, 506 language gene, 30, 68 see also FOXP2, grammar gene, SPCH1
Index language impairment, 4, 23–24, 29, 40 language-specific grammar module, 68 see also language acquisition device larynx, 32, 52–53, 81 learning algorithm, 222, 234, 237, 261–262, 264, 266–267, 272, 276–277, 278–285, 288–290, 292, 294, 310, 315–316, 321, 375, 431 least effort principle, 254 Lefebvre’s relexification hypothesis, 499 Leiden theory of language evolution, 9, 331–332 lexical ambiguity, 14, 437–439 lexical decision tasks, 443, 458 lexical diffusion, 323 lexical item, 62, 252, 256, 262, 265, 270–271, 274–275, 282–285, 293, 346, 361, 396, 401, 439, 456, 486, 502, 510, 518–523 lexicon, 8, 16, 69, 196, 208, 257, 261, 267, 269, 273–274, 285, 287, 298, 315, 318, 321, 323– 324, 331, 438–439, 497, 506, 518 lexifier, 16–17, 497, 499, 506, 509– 511, 514, 517, 520–523 linguistic diversity, 6, 153–154, 156– 158, 161–162, 164–165, 189, 193, 195–196, 200 linguistic item, 158–159, 172, 189 linguistic senses, 441, 444, 446, 448– 449, 450, 452, 455–456, 458 linguistic strategy, 191 logic fuzzy, 335 propositional, 334 logistic growth, 184–185
long range classification, 396
M magnetic resonance imaging (MRI), 26, 39, 45, 142 mammal, 166, 344 marker, 122, 126, 275, 354, 356, 484, 486–487, 491, 511, 519, 521 mating calls, 52 maximal match segmentation (MMS), 287 meaning, 10–11, 14, 21, 52, 58, 60, 62–63, 77, 79, 106– 107, 109, 133–134, 136, 143, 145, 151, 165, 170, 237, 252, 298, 325, 332–347, 349, 358, 361, 369– 371, 375–376, 380–382, 411, 420, 437–438, 440–441, 443– 447, 449–450, 452–453, 455– 458, 504, 508, 511, 520 meaning extension, 442 meaning facets, 444–446 meaning metrics, 440, 443–444 meaning-signal pairs, 11, 369–370, 375–376, 380–381 memes, 9, 332–333 grammatical memes, 333 lexical memes, 62, 283, 334 meronymic extension, 445 mesolect, 498 metaphorical extension, 445, 457 microsatellite markers, 28 mimes, 333 mimetic symbols, 132 mimicking, 55 minimal genetic change, 50 minimalist program, 324 minimum description length (MDL), 251, 253, 255, 292–294 mirror neurons, 155
533
534
Language Acquisition, Change and Emergence monogenesis, 7, 13, 153, 160–162, 164, 167–172, 183, 185–189, 194–196, 401 mood, 502, 505 morphological dual, 408 see also plural formation, plurality morphology, 8, 16, 151, 218, 298, 307, 408, 488, 491–492, 501– 507, 511–512, 516, 520, 524 multigenic inheritance, 40 multilateral comparison, 342, 344, 348 see also classification, taxonomy mutant allele, 24, 32 mutation, 4, 32, 37, 49, 363, 403– 405, 428
449, 481, 483, 501, 507, 522
O ontogeny recapitulates phylogeny, 434 oral dyspraxia, 29 see also specific language impairment (SLI), verbal dyspraxia ostension, 476 Out of Africa hypothesis, 163, 167, 194 over-irregularization, 300, 314
P
N
palaeodemography, 165 palatalization, 12, 407
naming, 105, 117, 238, 332 nasalized vowels, 407, 409 native language, 22, 208, 217, 331, 345, 486, 496, 498 natural selection, 22, 190, 333–334, 386, 424 see also selection negative evidence, 66 Neolithic period, 162 see also Paleolithic period neural network, 190, 220, 235, 236, 240, 255, 433 see also artificial neural network, connectionist network neurons, 74, 336 n-grams, 259, 262 Nicaraguan Sign Language (NSL), 524 niches, 417 non-word repetition, 24–25, 28 nouns, 5, 64, 108, 117, 120, 129, 238, 298–299, 309, 315, 317, 320, 323–324, 440, 445–447,
Paleolithic period, 81, 162, 165–166, 183, 395, 399 see also Neolithic period parameter setting, 65, 514–516 pedigree analysis, 24 perfective marker, 486, 521 perspective, 5–6, 10, 12, 17, 48, 67, 70, 83, 95–96, 98–103, 107, 109, 111–140, 150–151, 157, 188, 299, 359, 361, 465 allocentric, 5, 96, 109, 111–114, 134 egocentric, 5, 96, 109–112, 114, 134, 136 geocentric, 5, 96, 109, 113, 134 temporal, 5, 114–115, 134 perspective shifting, 6, 95–96, 114– 115, 129, 134, 137, 139 perspective taking, 5, 17, 95–96, 99, 101, 103, 109, 111, 115–116, 118, 122, 124–125, 137–140 phenotype, 23, 26
Index phoneme, 9, 55–56, 72, 159, 214– 215, 221–222, 254, 259, 310, 312, 314, 406, 431, 484 phonetic attrition, 507 phonological inventory, 12, 406 phonological rules, 9, 299, 309, 310, 312, 314–315, 323, 503 phylogenetic tree, 155, 193 pidgins, 16, 496, 498, 511, 514 pigmy chimpanzee, 21 see also bonobo plans, 96, 98, 103–104, 109, 114– 117, 139, 149, 418, 472 plural formation, 501, 505, 510 see also morphological dual, plurality plurality, 502 see also morphological dual, plural formation polygenesis, 6–7, 13, 153, 160–161, 163–165, 167–172, 183, 185– 189, 193–196, 199 polyglutamine stretch, 30, 34, 36–37 population typology, 352, 361, 363 positive evidence, 66, 352, 515 positron emission tomography (PET), 26, 107 possessive pronoun, 486 precision, 42, 251, 255–256, 272– 273, 276–278, 280–284, 288– 291, 294 preposition, 112, 519 primates, 4, 21–22, 34, 53, 69, 74– 75, 82, 110, 112, 115, 117, 431–432 Principle A, 137 Principle B, 137 Principle C, 123–124, 137 principle of the excluded middle, 334–335 see also Tertium non datur principles and parameters, 513
probabilistic cues, 7–8, 205, 207, 213, 217, 233–234, 237–239, 240–241 Procyon lotor, 73 productivity problem, 314 programmed cell death, 27, 37 projection, 96, 107–109, 112–113, 120, 159, 337 pronouns, 6, 11, 62, 114, 121, 150, 347, 349–353, 358, 401, 489 proteins, 33, 35, 37, 39, 435 Proto-Indo-European, 156, 161 see also Indo-European proto-language, 10, 158, 197, 200, 343, 345, 347–348, 397, 399– 400, 403 proto-Sapiens, 397, 399–401
R random, 49, 77, 170, 173–174, 178, 180, 184–185, 222, 229, 255, 320, 323, 348, 352, 376, 382, 390–393, 412, 422, 424–425, 449, 496, 500–501 Rawang, 490–491 recall, 100, 116, 191, 251, 255–256, 272–273, 276–278, 280–284, 288–291, 294, 306–307, 320, 374, 377, 393 word boundary recall, 272, 280, 282 word recall, 272, 280–283 word type recall, 289–290 recombination, 28, 428–429 see also cross over recurrent neural network, 212 see also simple recurrent network (SRN) regular, 8, 10, 267, 272, 297–301, 299–301, 308, 312–313, 315, 319–320, 323–324, 343, 347–
535
536
Language Acquisition, Change and Emergence regular – cont. 348, 359, 390–393, 417, 423, 501, 503, 507 relativization, 121, 128 relexification, 17, 499, 517, 519, 520–521 replicators, 332 reservoirs, 421, 424, 430, 433–434 rule learning, 220, 309, 313–316, 321, 324 rule-like behavior, 207, 218, 225– 226, 228–229, 232–233
S salience, 237 second law of thermodynamics, 394 secondary perspective, 118, 120, 136 segment (segmentation), 7–8, 31, 117, 205, 207–213, 215, 217, 220–229, 232–233, 236–240, 251, 256, 259–262, 267–269, 272–274, 280, 285–287, 290, 293–294 selection, 4, 27, 34, 39, 57, 332, 338, 424, 428 see also natural selection semantic ambiguity, 439 semantic entries, 446, 452–454, 456, 464 semantic intuition, 14, 441, 443, 448, 455 sequential learning, 67 serial order, 60, 70 serial verb constructions, 480 shoe, 522 short-term memory, 110 shuz, 522 sign language, 59, 524 see also Nicaraguan Sign Language (NSL) simple recurrent network (SRN), 7,
213–215, 220–221, 224–225, 239–240 see also recurrent neural network single nucleotide polymorphism (SNP), 42 slang, 442 social relationship, 63, 162 social structure, 393 sound correspondences, 10–11, 343, 348, 396–397, 399 sound symbolism, 133, 347, 350 Spalax ehrenbergi, 73 SPCH1, 28, 30 see also FOXP2, grammar gene, language gene specialization, 27, 70, 74–75 speciation, 163–164, 166, 193, 195– 196 specific language impairment (SLI), 23, 43, 138, 143 see also oral dyspraxia, verbal dyspraxia speech disorder, 28, 42, 43 speech perception, 54, 56–57, 265 statistical learning, 8, 218–219, 220, 224–225, 231, 240 statistical regularity, 22 stimulus-response rule, 419 Stroop task, 116 stuttering, 27, 43 substrate, 17, 67, 235, 500, 510–512, 512, 514, 518, 520–522 substrate language, 17, 500, 510– 512, 514, 520–522 symbiosis, 9, 70, 331, 336, 338–339, 422 symbol (symbolic), 25, 58–59, 97–98, 107, 132, 200, 226, 228, 233, 240, 257, 264, 266, 269, 333, 347, 352, 439 synonymy, 438 syntax, 5, 10, 47, 49, 51, 59, 61, 67,
Index syntax – cont. 69, 70, 83, 121, 134–135, 142, 46–148, 151, 325, 332, 337, 501–502
T taxonomy, 10, 341–343, 345–346, 348–349, 361–362, 364 see also classification, multilateral comparison temporal order, 70, 115 tense, 8–9, 26, 68, 114–115, 118– 119, 142, 151, 298–299, 301– 303, 305, 307–314, 316, 319– 321, 483, 486, 489–490, 502, 504 tense marking, 483, 486, 489 Tertium datur, 335 Tertium non datur, 334-335 see also principle of the excluded middle theory of mind, 159 T’ina/T’ana/T’una, 359, 401 Tok Pisin, 499 tones, 55–56, 371, 406, 409 tool use, 75, 80 trait, 21–24, 35, 41–43, 362–363 transcription factor, 31, 33, 35–36 trigger, 105, 129, 137, 191, 412, 432–433 Turkish, 16, 503, 505, 524 typology, 10, 154, 158, 341, 361, 363–364, 402, 465, 510, 512, 525 typological classification, 11, 361
U universal features of grammar, 47, 69, 83 universal grammar (UG), 5, 61–62,
65–66, 68–69, 71, 76, 137, 155, 252, 325, 379 unsupervised learning, 8, 252, 254, 257, 264, 294 Upper Paleolithic Horizon, 337 see also Paleolithic period
V Vapnik-Chervonenkis (VC) dimension, 236 verb islands, 135 verbal dyspraxia, 24, 29 see also oral dyspraxia, specific language impairment (SLI) verbs, 5, 8–9, 14, 62–64, 108, 117, 124, 135–136, 151, 238, 298, 300–308, 312–313, 320, 501– 502, 515, 520 visual stream, 101 Viterbi algorithm (Viterbi segmentation), 256, 259–260, 293 vocalization, 53 voice-onset-time (VOT), 56–57 vowel shortening, 306–307 voxel-based morphometry (VBM), 27
W weak irreducibility, 11, 370 William’s syndrome, 67 word boundaries, 7, 208–209, 211– 212, 217, 222, 251, 267, 272 word meaning, 8, 14–15, 22, 66, 72, 133, 146, 238, 241, 439, 444, 450, 455, 458 word-order, 159, 192, 196, 401 basic word order, 523 standardized word order, 15, 474 SOV, 12, 129–130, 402–405 SVO, 6, 129, 362, 402–405 VSO, 129
537
538
Language Acquisition, Change and Emergence word strings, 22, 106 working memory, 96, 110, 116, 159, 197 Wug test, 309
Contributors Kathleen Ahrens Department of Foreign Languages & Literatures, National Taiwan University King L. Chow Department of Biology, Hong Kong University of Science and Technology Morten H. Christiansen Department of Psychology, Cornell University Bernard Comrie Department of Linguistics, Max Plank Institute for Evolutionary Anthropology, Leipzig; and Department of Linguistics, University of California, Santa Barbara Christopher M. Conway Department of Psychology, Cornell University Christophe Coupé Laboratoire Dynamique du Langage, CNRS and Université Lyon 2 Felipe Cucker Department of Mathematics, City University of Hong Kong Suzanne Curtin Departments of Linguistics and Psychology, University of Pittsburgh Murray Gell-Mann Santa Fe Institute John H. Holland Departments of Psychology and Electrical Engineering & Computer Science, University of Michigan; and Santa Fe Institute Jean-Marie Hombert Laboratoire Dynamique du Langage, CNRS and Université Lyon 2 Chunyu Kit Department of Chinese, Translation, & Linguistics, City University of Hong Kong Randy J. LaPolla Linguistics Program, La Trobe University
Chienjer Charles Lin Departments of Anthropology and Linguistics, University of Arizona Brian MacWhinney Department of Psychology, Carnegie Mellon University James W. Minett Department of Electronic Engineering, Chinese University of Hong Kong Merritt Ruhlen Department of Anthropological Sciences, Stanford University P. Thomas Schoenemann Department of Anthropology, University of Pennsylvania Steve Smale Department of Mathematics, University of California, Berkeley; and Toyota Technological Institute at Chicago George van Driem Department of Comparative Linguistics, Leiden University William S-Y. Wang Departments of Electronic Engineering and Linguistics & Modern Languages, Chinese University of Hong Kong Charles D. Yang Department of Linguistics, Yale University Ding-Xuan Zhou Department of Mathematics, City University of Hong Kong