The Use of Databases in Cross-Linguistic Studies 9783110198744, 9783110193084

This book promotes the development of linguistic databases by describing a number of successful database projects, focus

256 63 9MB

English Pages 409 [415] Year 2009

Table of contents :
Frontmatter
Contents
Introduction
Designing linguistic databases: A primer for linguists
A typological database of personal and demonstrative pronouns
Databases designed for investigating specific phenomena
How to integrate databases without starting a typology war: The Typological Database System
A contribution to ‘two-dimensional’ language description: the Typological Database of Intensifiers and Reflexives
StressTyp: A database for word accentual patterns in the world’s languages
The typological database of the World Atlas of Language Structures
Typology of reduplication: The Graz database
The Romani Morpho-Syntax (RMS) database
A database on personal pronouns in African languages
Backmatter

Recommend Papers

Crosslinguistic Studies on Noun Phrase Structure and Reference 900426082X, 9789004260825

Crosslinguistic Studies on Noun Phrase Structure and Reference contains 11 studies on the grammar of noun phrases. Part

150 95 3MB Read more

Crosslinguistic Influence and Crosslinguistic Interaction in Multilingual Language Learning 9781474235853, 9781474235884, 9781474235877

Which strategies do multilingual learners use when confronted with languages they don't yet know? Which factors are

164 27 5MB Read more

Design and Use of Relational Databases in Chemistry [1 ed.] 9781420064421, 1420064428

Relational databases and the relational model are far from new concepts, having first appeared 40 years ago. When we thi

324 80 1MB Read more

Crosslinguistic Influence in Second Language Acquisition 9781783094837

This collection provides an unprecedented insight into current approaches to the phenomenon of crosslinguistic influence

133 15 2MB Read more

A deep dive into NoSQL databases the use cases and applications [First edition] 9780128137871, 0128137878

A Deep Dive into NoSQL Databases: The Use Cases and Applications, Volume 109, the latest release in theAdvances in Compu

364 112 64MB Read more

Latin in Use: Amsterdam Studies in the Pragmatics of Latin 9050632971

Contributions by: A.M. Bolkestein, J.R. de Jong, C.H.M. Kroon, H. Pinkster, R. Risselada

103 5 11MB Read more

Emotions in Crosslinguistic Perspective [Reprint 2010 ed.] 9783110880168, 9783110170641

This volume aims to enrich the current interdisciplinary theoretical discussion of human emo-tions by presenting studies

173 95 113MB Read more

The Crosslinguistic Study of Language Acquisition, Volume 3 0805801057, 9780805801057

Extending the tradition of this series, which has become a standard reference work in language acquisition, this volume

189 27 45MB Read more

The Essential Criteria of Graph Databases 9780443141621

Although AI has incredible potential, it has three weak links: 1. Blackbox, lack of explainability2. Silos, slews of sil

105 74 40MB Read more

The Use of Textual Criticism for the Interpretation of Patristic Texts: Seventeen Case Studies 9780773430730, 0773430733

This book attempts to reunite what has been divided by illustrating the close relationship that should exist between tex

109 99 19MB Read more

The Use of Databases in Cross-Linguistic Studies
9783110198744, 9783110193084

Author / Uploaded
Martin Everaert (editor)
Simon Musgrave (editor)
Alexis Dimitriadis (editor)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

The Use of Databases in Cross-Linguistic Studies

≥

Empirical Approaches to Language Typology 41

Editors Georg Bossong Bernard Comrie Yaron Matras

Mouton de Gruyter Berlin · New York

The use of Databases in Cross-Linguistic Studies

edited by Martin Everaert Simon Musgrave Alexis Dimitriadis

Mouton de Gruyter Berlin · New York

Mouton de Gruyter (formerly Mouton, The Hague) is a Division of Walter de Gruyter GmbH & Co. KG, Berlin.

앝 Printed on acid-free paper which falls within the guidelines of the 앪 ANSI to ensure permanence and durability.

Library of Congress Cataloging-in-Publication Data The use of databases in cross-linguistic studies / edited by Martin Everaert, Simon Musgrave, Alexis Dimitriadis. p. cm. ⫺ (Empirical approaches to language typology ; 41) Includes bibliographical references and index. ISBN 978-3-11-019308-4 (hardcover : alk. paper) 1. Linguistics⫺ Databases. 2. Typology (Linguistics) I. Everaert, Martin. II. Musgrave, Simon. III. Dimitriadis, Alexis, 1963⫺ P128.D37U74 2009 4101.285574⫺dc22 2009004481

Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at http://dnb.d-nb.de.

ISBN 978-3-11-019308-4 ISSN 0933-761X © Copyright 2009 by Walter de Gruyter GmbH & Co. KG, D-10785 Berlin. All rights reserved, including those of translation into foreign languages. No part of this book may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopy, recording or any information storage and retrieval system, without permission in writing from the publisher. Printed in Germany.

Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Simon Musgrave, Alexis Dimitriadis and Martin Everaert Designing linguistic databases: A primer for linguists . . . . . . . . . . . . . . . . . . 13 Alexis Dimitriadis and Simon Musgrave A typological database of personal and demonstrative pronouns. . . . . . . . . 77 Heather Bliss and Elizabeth Ritter Databases designed for investigating specific phenomena . . . . . . . . . . . . . . 117 Dunstan Brown, Carole Tiberius, Marina Chumakina, Greville Corbett and Alexander Krasovitsky How to integrate databases without starting a typology war: The Typological Database System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Alexis Dimitriadis, Menzo Windhouwer, Adam Saulwick, Rob Goedemans and Tamás Bíró A contribution to ‘two-dimensional’ language description: The Typological Database of Intensifiers and Reflexives. . . . . . . . . . . . . . 209 Volker Gast StressTyp: A database for word accentual patterns in the world’s languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Rob Goedemans and Harry van der Hulst The typological database of the World Atlas of Language Structures. . . 283 Martin Haspelmath Typology of reduplication: The Graz database. . . . . . . . . . . . . . . . . . . . . . . . 301 Bernhard Hurch and Veronika Mattes The Romani Morpho-Syntax (RMS) database . . . . . . . . . . . . . . . . . . . . . . . . 329 Yaron Matras, Christopher White and Viktor Elšík A database on personal pronouns in African languages . . . . . . . . . . . . . . . 363 Guillaume Segerer Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391 Index of subjects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 Index of languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 Index of persons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

Introduction Simon Musgrave, Alexis Dimitriadis and Martin Everaert

1. Using databases in cross-linguistic research The core of modern (formal and functionalist/cognitive) linguistic theory is concerned with phenomena that we find in many, if not all languages of the world. Cross-linguistic comparison has firmly established that linguistic representations do not in fact vary randomly across languages, and over the last twenty years, theoretical linguistics, psycholinguistics and computational linguistics have become more and more sensitive to cross-linguistic variation. However, this cross-linguistic orientation is still disproportionately focused on languages that are well-known and well-described, but are unlikely to be fully representative of linguistic diversity as manifested in the world’s more than 6000 languages. As theoretical treatises incorporate more and stronger claims about the limits of variation, the need for the systematic evaluation of theories goes hand in hand with the need for more systematic language data. This holds true for research where the difference between language in its various modalities is relevant, but also for research requiring knowledge of the types of variation found among human languages. For functional and formal typologists alike, a systematic body of collected data on languages is essential in order to gain a proper understanding of what is truly universal in language and what is determined by specific cultural settings. And such data sets are increasingly needed in order to make it possible to systematically evaluate contrasting theoretical claims. Databases are an ideal tool for supporting these activities. Electronic databases have long been used as a tool in linguistic research. Nerbonne (1998) presented a collection of papers reporting work to that date, and also pointed to even earlier work in phonetics and psycholinguistics (Liberman 1997; McWhinney 1995). More recent workshops on the use of databases include those organised by the Institute for Research in Cognitive Science (University of Pennsylvania) in 2001 (IRCS 2001), by the Language Typology Resource Centre EU-project at CIL 17 (Prague) in 2003, and by E-MELD in 2004 (E-MELD 2004). At a more specialised level, we also note that a volume of the journal Sign Language & Linguistics was devoted in

2 Simon Musgrave, Alexis Dimitriadis and Martin Everaert part to the particular issues which arise in dealing with sign language data in databases (Bergman et al. 2001). Linguists’ interest in the use of databases is easily understood; as Nerbonne points out, the amount of linguistic data which exists is enormous and the use of computational tools can make handling these amounts of data significantly easier. A database is a general-purpose data-management tool, and can support a limitless variety of data-oriented enterprises; but while large language corpora, experimental apparatus, statistical analysis programs, and other similar applications can make use of databases, the focus of this book is what we might call the “pure” cross-linguistic research database: A body of collected linguistic research data with a user interface whose purpose is to present that data. Numerous typological databases have been developed by researchers in the field, often for personal or small-group use. Increasingly, these databases are being made available to the linguistic community over the Internet, providing the potential for enormous increases in the power of exploratory typological investigation. The fit between database technology and the interests and needs of crosslinguistic researchers is especially close. An electronic database is an obvious and appropriate tool: it can store the attribute values (= grammatical properties) of entities (= languages, constructions, etc.), and execute queries which recover information about the entities meeting a set of criteria. Once an adequate amount of data has been entered into the database for a workable sample of languages, the researcher can quickly and easily find out which languages in the sample have any combination of the described properties. A well-designed database can simplify the process of collecting and managing data, reduce the chance of errors, and greatly facilitate analysis and presentation of the collected data. Just as important, data in electronic form can be easily shared with others over the internet; all of the databases described in this book are now accessible over the internet. The above description is neutral with regard to the type of research being undertaken. Most, if not all, of the research described in this volume falls under the label of linguistic typology, but similar strategies and procedures have been employed with great success in other areas, such as comparative linguistics. Researchers in that area are often interested in what form a language uses to express a particular meaning, and this information can easily be coded as an attribute of the language, that is a wordlist can be seen as a simple database structure. An easily accessible example of a wordlist database is the Austronesian Basic Vocabulary Database (Greenhill, Gray and Blust 2003–2008, the website includes links to publications utilising the dataset), and more general information about techniques for utilising such

Introduction

3

data can be found in Kessler (2001) and in A.McMahon and R.McMahon (2005). In addition to the practical advantages which come from the use of electronic databases, we would suggest that there can also be conceptual advantages. As with any type of computational modelling in linguistics, designing a database imposes a certain level of rigour and explicitness on our conceptualisation of the domain. A database embodies a very specific abstract model of some portion of reality, and the process of designing that model will inevitably lead the researchers to the consideration of questions about their view of that reality which might otherwise have remained in the background. Answering those questions will be an important step in designing the database, and in many cases the process will continue in a cyclical fashion as the data which is entered into the database will provide new insights into the questions asked in the design stage and these insights may lead to further refinements in how the domain is modelled (and possibly concomitant changes to the database design – often a troublesome development!). Entering data into a database can be laborious, compared to informal records in a notebook or computer text document, but this is precisely because of the greater degree of precision that the structure of the database imposes. Of course a poorly designed database could ask the wrong questions, provide the wrong choice of answers, or fail to provide a place for storing important information; but a successful database will enhance, rather than impede, the quality and efficiency of data collection and utilization. In the next section of this introduction, we consider in detail how a linguistic domain can be conceptualised in different ways by different researchers by way of a comparison of two of the databases described in this volume. The section after that gives some thoughts on the (sometimes) vexed question of whether designing and populating a database should be considered a legitimate research activity, and whether it should be recognised as such by the institutions of academia. In the final section, we briefly introduce the chapters which make up the volume.

2. Databases as models of reality We suggested earlier that a database is an abstract model of some portion of the world, that portion that is of interest to the creators of the database. In the contribution of Dimitriadis and Musgrave, there is an introduction to one well-accepted way of constructing such an abstraction. There is, however, no procedure which, given sufficient information about the relevant

4 Simon Musgrave, Alexis Dimitriadis and Martin Everaert domain, will produce an optimal design modelling the domain. Rather, the theoretical commitments of the creators and the nature of the data which they take to be of interest will play a large part in influencing the design which is adopted. We will illustrate this point with a detailed examination of the way in which two different databases model a rather small domain of linguistic reality, pronoun systems. The two databases in question are Les marques personelles dans les langues africaines (MPLA for short; Segerer, this volume) and the Calgary Pronoun Database (CPD; Bliss and Ritter, this volume). Pronoun systems are, to quote Segerer, “in some respects ideal candidates for typological studies: they form closed sets with strong structural organization”. Nevertheless, there are interesting differences in the way in which the two databases model the domain, both in terms of the categories which are considered important, and also in terms of the possible values which can apply in the categories. The similarities and differences are summarised in Table 1. Table 1. A comparison of the encoding schemes of the Calgary Pronoun Database and the database Marques personelles dans les langues africaines. CPD

MPLA 1st -speaker inclusive – speaker + addressee 2nd – addressee 3rd – other 4th – another singular dual trial paucal plural general

Person

1 2 3

Number

s(ingular) p(lural) d(ual)

Gender

(open class)

Specification

animacy gender inclusivity/exclusivity definiteness logophoricity

Case

(open class)

Function

Tonic Subject Object Possessive Reflexive

Formality

(open class)

Person

Number

Introduction

5

In part, these differences are due to the different coverage which the two databases aimed for: the most complex number system of any of the languages which provide data for MPLA has singular, plural and dual numbers only, whereas CPD has data from languages with more complex systems. In other cases, however, the differences are clearly conceptual. For example, CPD treats the inclusive/exclusive distinction as an aspect of the category of person, but MPLA groups this distinction with several others in another category entirely. Segerer characterises this Specification category as “nonsyntactic information that is somehow related to the notions of ‘person’ or ‘number’”, and justifies the grouping with the claim that the various types of specification are mutually incompatible. This claim does not correspond to the design adopted by CPD, where it is possible for a form to have a value for inclusivity and for gender. Whether such a combination is actually attested in the CPD data is not relevant here; what is important is that the two designs are making different claims about the possible co-occurrence of feature values. Another difference in the underlying approaches is evident in the way that some categories are treated in the two databases. For the category of Specification and Function, MPLA is making a claim that it is possible to infer cross-linguistically valid values for the features involved. There will be tricky cases, some of which Segerer discusses in his chapter, but nevertheless the design of the database implies that these can still be represented by the possibilities which are allowed, without obscuring relevant distinctions. The design of CPD, on the other hand, realizes the view that some categories, in this case Gender, Case and Formality, can be identified as valid categories cross-linguistically, but that the possible values of these features are not comparable across languages. A difference of this type can also be seen at a higher level of abstraction in debates about whether it is useful for the community of linguists (or syntacticians, or typologists, etc.) to attempt to use a common scheme for encoding linguistic categories, or whether such attempts inevitably result in misrepresentation of data. Endeavours such as the General Ontology for Linguistic Description (GOLD; Farrar and Langendoen 2003) attempt to achieve improved mutual intelligibility within the linguistic community. An alternative view holds that the attempt may be counter-productive. The suggested standards can never be theoretically neutral, and their use will therefore at least obscure different theoretical approaches to linguistic data. A stronger version of this position would claim that important empirical distinctions in data are also at risk. An approach to these issues which at-

6 Simon Musgrave, Alexis Dimitriadis and Martin Everaert tempts to achieve interoperability without sacrificing empirical or theoretical detail is described in the chapter by Dimitriadis et al.

3. Databases as a research activity The material apparatus and the conventions which govern scholarly practice in publication and citation of printed material have evolved over centuries. Digital scholarship has only been with us for decades, and it is therefore not surprising that accepted standards are only beginning to emerge in that field. And beyond the questions about how to acknowledge electronic sources, there are questions about what sort of recognition, professional and institutional, should be accorded to the creators of electronic resources such as databases. Should such resources have the same status as more conventional forms of publication? Should they be regarded as “research output” at all?1 Given that we have invested our efforts in editing the current volume, the position of the editors is probably predictable, but it nevertheless seems useful to discuss these issues. A lively exchange on just this topic took place on the LINGTYP mailing list in April 2007 (the list traffic is archived at http://listserv.linguistlist.org/ archives/lingtyp.html). One contributor to this discussion did indeed express the view that producing a database is not research, and that research is restricted to the kind of activity that results in published material in a book or a journal. In response to this perhaps extreme position, Nigel Vincent stated succinctly a viewpoint which we would strongly endorse: “In preparing a good database, corpus, digitised archive or whatever, those responsible make innumerable judgments based on their analytical skill, knowledge, and experience” 2. Not only would we endorse this position, we would also suggest that the contents of this book directly support it. We have illustrated how the design of a linguistic database involves making theoretical presuppositions explicit, and how that process feeds a dialogue between theory and empirical reality during the data collection phase of a database project. It will be obvious to practising typologists that the process of finding the relevant data to be entered in a database and then interpreting it, taking into account both the theoretical commitments of the person who collected the 1

2

Of course such questions are being asked in many disciplines besides linguistics. See Coleman 2006 for a more general discussion. We are grateful to Nigel Vincent for permission to quote a section of a posting to the LINGTYP mailing list (22 April 2007).

Introduction

7

data and our own theoretical commitments, and assessing its reliability, indeed require analytical skill, knowledge and experience. It seems very clear to us that these activities are exactly those which lie at the core of what we understand as research. One of the projects described in this volume, the World Atlas of Language Structures (WALS for short; see Haspelmath et al. 2005, Haspelmath, this volume), has had a history which puts these questions in a very sharp focus. WALS appeared initially in book form with an accompanying CD, which included all the material published on paper as well as software which allowed the user to interrogate the underlying data directly and to ask questions which were not discussed in the published material. All of the material published in the 2005 book, and more, is now available online (www.wals.info). This interesting situation allows us to pose various questions. Is the online version of WALS a separate publication for which the editors and authors can claim credit separately? Perhaps the appropriate answer to this is that the situation is analogous to the publication of a second edition of an existing book (indeed the online version of WALS has been revised and improved in various ways), or perhaps a translation: the author claims credit to the extent that it demonstrates interest in their work, but not as a novel piece of research. But what would the situation be if WALS had been published online first, or if it had only been published online? The content would have been identical to that which was in fact issued as a book, so in one sense we would say that the amount of credit accruing to the creators should be identical. However, there is the complicating factor of institutional legitimisation – in the actual sequence of events, book publication under the imprimatur of a respected publisher played an important part in the recognition which came to the creators. Various forms of validation of online publication are beginning to emerge, but nothing currently exists which would give legitimacy in the way that the physical publication of WALS did. We do not mention this as an argument against online publication; we would prefer to see it as a stimulus towards the development of new protocols for scholarly activity in the digital era. We see a direct connection between the work described in this volume and work directed to the creation of those new protocols. We value work of the type described here as making a significant contribution to our research community, and we wish to encourage scholars to join in this type of work (that is one of the purposes of this book); in order to accomplish this, we need to work to ensure that suitable recognition of their work is given to the scholars who participate.

8 Simon Musgrave, Alexis Dimitriadis and Martin Everaert 4. The chapters The opening chapter in this collection, by Dimitriadis and Musgrave, aims to give the interested linguist some basic information about databases. Such a chapter cannot hope to be comprehensive, but the authors have tried to set out the basic principles and techniques of how a database should be designed, with particular reference to the problems which the linguist may encounter. The chapter also provides an overview of how databases work. This material may not be enough to guide the novice through the independent construction of a database, but it is our hope that it will at least be enough to enable that novice to discuss their needs in a useful fashion with those more expert in the computational implementation. We have already referred to the two chapters, by Bliss and Ritter and by Segerer, which describe databases with information about pronominal systems. The first of these has a wide coverage, while the second is specific to African languages. Aspects of these two databases have been discussed above, and they will not be introduced here in any more detail. Two chapters describe databases focusing on morphology. The first of these is the chapter by Brown, Tiberius, Chumakina, Corbett and Krasovitsky which discusses a group of databases constructed by a research group based at the University of Surrey. Individual databases cover such areas as syncretism, suppletion, agreement and historical change. The characteristic feature of the databases designed by this research group is that they favour a deep view rather than a wide view. These databases do not contain data from a large language sample, instead being designed to allow access to very detailed data from a small number of languages. These databases reflect what Corbett (2005) called the ‘canonical’ approach to typology. The results which are presented show that this methodology is one that can produce valuable results. The chapter by Dimitriadis, Windhouwer, Saulwick, Goedemans and Bíró discusses the problems encountered in designing and implementing a system that can provide simultaneous, integrated access to several diverse databases of typological information. The project brings up again the issues of partial and non-commensurable representations that have already been discussed in this introduction, since such an interface requires a way of dealing with the conceptual, structural and terminological differences between the component databases. The solution described in this chapter ac-

Introduction

9

knowledges that different theoretical frameworks are rarely equivalent or inter-convertible, and provides a thought-provoking glimpse of one direction in which linguistic typology, assisted by computers, may develop. The database described by Gast is concerned with the expression of particular meanings manifested as the word classes intensifiers and reflexives. This project therefore is conceptually distinct from most of the others described here. Rather than looking at a particular area or areas of grammar and assessing the properties of that part of the grammar for various languages, this project starts with a small group of meanings and asks what grammatical resources are deployed in different languages to express those meanings. The author uses this orientation as the starting point for a discussion of the theoretical context of creating a cross-linguistic database. The database itself has a narrow focus which allows for the presentation of detailed data, although the results which can be obtained are qualitative rather than quantitative. The chapter by Goedemans and van der Hulst is the only one in the volume which is concerned with phonological typology. StressTyp is a database of information about the stress systems of a large number of languages, and this chapter discusses a number of theoretical and representational challenges which are particular to the handling of this type of information. The authors give an illuminating presentation of successive implementations of their database, and conclude with a comparison of StressTyp with two other databases with similar (though more restricted) coverage. The chapter by Haspelmath is a description of the database architecture which underpins the World Atlas of Language Structures (Haspelmath et al. 2005). That important work has already become an indispensable tool for the typologist, and it is therefore very revealing to have this view “under the hood.” This chapter is also important in that it describes a database which contains information about a wide range of language phenomena. In this respect it contrasts with most of the other projects described in the volume, which typically have a narrower focus, either in terms of the phenomena of interest or in terms of the language sample used. The chapter by Hurch and Mattes introduces another database of morphology, this one dedicated to the phenomenon of reduplication. Although this database contains data from a larger sample of languages than the Surrey databases, it is nevertheless able to provide very detailed data about this

10 Simon Musgrave, Alexis Dimitriadis and Martin Everaert phenomenon, often seen as somewhat marginal from a Eurocentric perspective. This database includes a balanced sample of languages, which is actually a subset of the total number of languages covered in the database; this sample can be used to provide quantitative results. The chapter by Matras, White and Elšik describes an ambitious project which puts a particular meaning on the adjective “cross-linguistic” contained in the title of this volume. The database which they describe contains data from only one language, Romani, but that is a language which has considerable dialect diversity. Databases have become a valuable tool in dialect mapping based on phonetic variation (see for example Haimerl 1998), but the project described here aims to tackle the much more complex problem of tracking morphosyntactic variation across dialects. The project is also interesting for its use of a questionnaire as a data-gathering tool, and (in its latest incarnation) for the incorporation of audio examples into the web interface.

Appendix: URLs of the databases in the volume’s papers a. Bliss and Ritter: Pronoun Database Project http://136.159.142.10:591/ b. Brown et al.: The Surrey Morphology Group databases http://www.smg.surrey.ac.uk/ c. Dimitriadis et al.: The Typological Database System http://languagelink.let.uu.nl/tds/ d. Gast: Typological Database of Intensifiers and Reflexives http://noam2.anglistik.fu-berlin.de/~gast/tdir/ e. Goedemans and van der Hulst: StressTyp http://stresstyp.leidenuniv.nl http://languagelink.let.uu.nl/tds/ f. Haspelmath: The World Atlas of Language Structures Online http://www.wals.info/ g. Hurch and Mattes: Graz database on Reduplication http://reduplication.uni-graz.at/

Introduction

11

h. Matras, White and Elšik: Romani Morpho-Syntax database http://romani.humanities.manchester.ac.uk/ i. Segerer: Les marques personnelles dans les langues africaines http://sumale.vjf.cnrs.fr/pronoms/

References Bergman, Brita, Penny Boyes Braem, Thomas Hanke, and Elena Pizzuto (eds.) 2001 Sign Transcription and Database Storage of Sign Information. Special volume of Sign Language and Linguistics 4. Coleman, Ross 2006 Field, file, data, conference: towards new modes of scholarly publication. In Sustainable Data from Digital Fieldwork, Linda Barwick and Nick Thieberger (eds.). Sydney: Sydney University Press. [Available at http://hdl.handle.net/2123/1300/] Corbett, Greville 2005 The canonical approach in typology. In Linguistic Diversity and Language Theories (Studies in Language Companion Series 72), Zygmunt Frajzyngier, Adam Hodges and David S. Rood (eds.), 25–49. Amsterdam: Benjamins. E-MELD 2004 E-MELD Language Digitization Project Conference 2004, Workshop on Linguistic Databases and Best Practice, Wayne State University, Detroit, Michigan, July 15 –18, 2004. [Available at http://www.linguistlist.org/emeld/workshop/2004/proceedings.html] (accessed 2008-09-05). Farrar, Scott and D. Terence Langendoen 2003 Markup and the GOLD ontology. In Proceedings of EMELD-03. [Available at www.u.arizona.edu/~farrar/papers/FarLang03a.pdf] Greenhill, Simon, Robert Blust, and Russell Gray 2003–08 The Austronesian Basic Vocabulary Database. [Available at http://language.psy.auckland.ac.nz/austronesian/.] (Accessed 2008-09-12). Haimerl, Edgar 1998 A database application for the generation of phonetic atlas maps. In Linguistic Databases, John Nerbonne (ed.), 103 –116. Stanford CA: CSLI Publications. Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie (eds.) 2005 The World Atlas of Language Structures. Oxford: Oxford University Press.

12 Simon Musgrave, Alexis Dimitriadis and Martin Everaert IRCS 2001

IRCS Workshop on Linguistic Databases. 11–13 December 2001, University of Pennsylvania, Philadelphia, USA. [Available at http://www.ldc.upenn.edu/annotation/database/] (accessed 2008-09-05). Liberman, Mark 1997 Introduction to the Linguistic Data Consortium. [Available at http://www.ldc.upenn.edu/About/ldc_intro.shtml] (accessed 2008-09-03) MacWhinney, Brian 1995 The CHILDES Project: Tools for Analysing Talk. Hillsdale, NJ: Lawrence Erlbaum Associates. McMahon, April and Rob McMahon 2005 Language Classification by Numbers. Oxford: Oxford University Press. Nerbonne, John 1998 Introduction. In Linguistic Databases, John Nerbonne (ed.), 1–12. Stanford CA: CSLI Publications.

Designing linguistic databases: A primer for linguists Alexis Dimitriadis and Simon Musgrave

1. Introduction: What this is about It is a commonplace, by now, to refer to the recent explosive growth in the power and availability of computers as an information revolution. The most casual of computer users, linguists included, have at their fingertips an enormous amount of computing power. Tasks such as writing a document or playing a movie now seem self-explanatory, thanks to the integration of computers into mass culture and more specifically to two related processes: On the one hand, to the development of sophisticated software specialized for such tasks; and on the other, to the emergence, even among casual users, of a common-sense understanding of what these tasks are and how they may be approached. It is the combination of advanced tools and an intuitive understanding of what they do that allows us to experience such software as nearly self-explanatory. But the full potential of this new technology for linguistic research, or indeed for many other purposes, is still only beginning to be understood. Collecting and analyzing linguistic data is not like composing a text document – although many linguists, lacking a more appropriate paradigm, have no choice but to approach it as such. To fully realize the promise of linguistic databases, the subject of this book, it is necessary to understand the underlying concepts and principles; only then can the goals and problems associated with creating a linguistic database be properly identified, and appropriate solutions arrived at. There are, of course, countless books on databases, at all levels of sophistication. But while a desktop database application is a standard component of “office” software, the world of databases revolves around commercial concepts such as inventories of merchandise, personnel lists, financial transactions, etc. A linguist who undertakes to understand databases is presented with documentation, textbooks and examples from this world of commerce, with few hints as to how such examples relate to the domain of linguistics. The principles, to be sure, are equally applicable; but while a cookbook approach suffices for adapting a sample database to a related use

14 Alexis Dimitriadis and Simon Musgrave (for example, turning an example CD database into a personal book database customized for one’s own needs), a linguistic database needs to be created from scratch, and (therefore) requires a real understanding of the principles involved. This chapter is intended to help provide this conceptual understanding, and to make its acquisition easier by using examples from linguistics and focusing on topics that are relevant to typical linguistic databases.

1.1. What are databases, and why do we care? Technically, a database is defined as “any structured collection of data”. The emphasis here should be on structured: a stack of old-fashioned index cards, like the one in the example below, is a database:1 The cards are organized in a consistent way to indicate the language name, phoneme inventory, allomorphy rules, a code summarizing the size of the inventory, etc. In this day of digital computers, of course, the term database is generally reserved for digital databases; but while a digital database application has enormous advantages compared to old-fashioned pen and paper, it is still the structure of its contents, not the digital presentation, that is the essence of a database.

1

The card is part of Norval Smith’s Phoneme Inventory Database, which has been digitized and is available online as part of the Typological Database System at http://languagelink.let.uu.nl/tds/. (See Dimitriadis et al., this volume). Some of the information mentioned in the text is on the back of the card.

Designing linguistic databases: A primer for linguists

15

The stereotypical electronic database brings to mind tables of data, or a web form with fields for entering data or defining queries. But databases (the electronic kind) appear in many shapes and forms: Every bank withdrawal, library search, or airline ticket purchase makes use of a database. Less obviously, they are also behind every phone call, every train arrivals board, and every department store cash register. The ideal database is invisible:2 It is a way to manage the information involved in carrying out a certain task. A desktop computer’s calendar application, for example, stores appointments in a database; but the user sees a daily, weekly or monthly calendar, not a table with records, columns and keys. To return to linguistics, the Ethnologue website (www.ethnologue.com) provides a profile page for each of the world’s more than six thousand languages, and language profiles for each country in the world. The site does not obviously look like a database: There are no complicated search forms, and not a table to be seen anywhere. But each report page is composed on the fly, from information stored in a database. It is an interesting exercise (after reading this introduction) to reconstruct the database design that must be behind the Ethnologue directory. The database of the World Atlas of Language Structures (Haspelmath, this volume) presents its data in the form of maps of the world, with the data for each language appearing as color-coded pins at the nominal location of the language. 1.2. The “database model” Almost all computer applications engage in managing stored data of some sort. A text editor reads a document file and writes out an edited version, a computer game presents a series of encoded challenges and (if all goes well) records a high score for subsequent display, etc. Since this data is inevitably stored in computer files, the most straightforward approach to creating applications is to let them read and write data from files in some suitable, custom-designed or generic format. For example, a text editor can read a file containing a document in some recognized format, and later write the modified document in the same file (or a different one). But this approach, called the file-based model, has serious shortcomings for complex data-intensive applications, which need to process large amounts of information in much more demanding ways. The software system that keeps track of a bank’s money, for example, must be able to: 2

The oft-quoted maxim “good typography is invisible” is applicable to functional design in general, and to database design in particular.

16 Alexis Dimitriadis and Simon Musgrave – store and retrieve large amounts of data very quickly – carry out “concurrent” queries and transactions for multiple users at the same time – allow various sorts of operations on the data, keeping an “audit record” of why each one was carried out. (“Who withdrew 1000 euros from my account last week?”) – allow different applications to manipulate the same data: (E.g., the software embedded in automatic teller machines, the computer terminals of bank tellers, management software for producing overviews statistics, systems for carrying out transactions with other banks, etc.) – selectively grant access to different subsets of information to different users. (I should only be able to see the balance for my own account; clerks should not be able to move millions of euros at will). While the file-based model can be made to work, it leads to expensive duplication of programming effort, not to mention errors and incompatibilities at all levels. The many software applications involved need to understand the same file format, cooperate to deal with problems of simultaneous access to the same data, and somehow prevent unauthorized users from gaining access to the wrong data. The solution, called the database model, is to delegate the job of storing and managing data to a specialized entity called the database management system (DBMS). Instead of reading and writing from disk, applications request data from the DBMS or send data to it for storing. All the complicated issues of storing, searching and updating, and even the question of who should be granted access to the data, can be solved (and tested) just once at the DBMS level.

Windows database application

DBMS Web-based client

Database

The DBMS might function as a software library that is compiled into a larger program, or as a service that is contacted over an internet connection.

Designing linguistic databases: A primer for linguists

17

What is important is that it functions as the gatekeeper for the system’s data: External applications request data from the DBMS (by means of a suitable query explaining what data they want), or submit data to the DBMS for storage in the database. The DBMS must meet all the challenges discussed above: handling concurrent requests, high performance, access management, etc. But it is much easier to address these issues in one module, which is then relied upon for all data-related tasks by other software. Rather than create a database manager for the needs of each project or organization, today’s DBMSs are general-purpose engines that can handle any kind of data management task. They do this by supporting a very general model of data organization, which can then be customized according to each project’s needs. A bank or a linguistics research group, for example, can acquire such a database engine (perhaps Oracle or MySQL) and use it to create and run a database tailored to their particular needs.

1.3. Interacting with the database The typical DBMS is never seen by the users of the database; it does not have a graphical user interface (GUI for short), but merely serves or stores data in response to requests expressed in a suitable command language. (The most common command language is SQL, for Structured Query Language; we’ll come back to it later). Large commercial DBMSs such as Oracle and Microsoft’s SQL Server, and free analogues such as MySQL and PostgreSQL are in this category; they must be used in combination with a so-called client application, created in conjunction with the database, that provides the user interface. In the simplest cases, such an application is a so-called “thin client” that provides a view of the database quite close to the structure of the underlying tables. In other cases, the client application has significant functionality of its own, and should be viewed as the heart of the application; the database is merely used to store the persistent data that supports the application. On today’s computers, an interface application is almost certain to provide a graphical user interface, usually a network of forms with text fields for typing in data, and buttons or menus for various actions.

18 Alexis Dimitriadis and Simon Musgrave

Often, read-only views of the database do not look like structured forms but like reports, which present their contents in a more text-like format. (We already mentioned the Ethnologue website in this connection). The interface application might be a machine that sells train tickets, a cash register, a bar-code reader, or perhaps an eye-tracking device that collects data for a psycholinguistics experiment. The database model makes it possible to use any of these interfaces, or several of them at once, with the same collection of data. Note that the end-user’s view of the data can be very different from the tables and columns that the DBMS uses to store the data. Even when relying on forms like the above, the interface presents the data in a way that is useful and informative to the user: Data from several tables can be combined, and various fields can be selectively hidden when they are not relevant. The complete application, in short, can be a lot more than just a way of viewing and editing the tables of a database. In this guide, we will not discuss the many issues involved in the design of the user interface. The problems are no different from those that arise for any software development task, and the topic is much too large to be addressed here. We will limit ourselves to the question of designing the underlying database, and give but a bare outline of the relationship between the database and the interface application it supports.

Designing linguistic databases: A primer for linguists

19

The interface application can be a Windows application, installed on the user’s desktop computer. But an extremely popular alternative is the combination of a DBMS with a way of generating web pages on the fly, providing a web-accessible database. The diagram shows a typical arrangement.

PHP

web browser

mySQL database

web server

End-users employ a web browser to view pages provided by a web server somewhere. Behind the scenes, the web server communicates with a DBMS that manages the actual data; a computer program (written in PHP in this example) interprets user actions, requesting data as needed from the DBMS and formatting the results for web display. Web server and DBMS may be on the same server computer, or on different computers – it makes no difference, since the web server has access to the data only through the DBMS, and the user has access to the data only through the web server. Desktop database applications such as Microsoft Access and FileMaker Pro combine a DBMS with a graphical user interface in a single package.

forms interface

stored tables

Such applications are sometimes mistakenly referred to as “off the shelf” databases, in contrast to “custom” DBMSs of the type just discussed. In fact, a database must be defined in FileMaker or in Access just as it must be defined in MySQL; the difference is that the desktop databases come with a graphical interface for managing the creation of tables, and a specific environment (also with a GUI) for creating the user interface of the database (a process that can be very easy or even automatic, but also has its limitations),

20 Alexis Dimitriadis and Simon Musgrave whereas a pure DBMS does not provide integrated tools for the creation of its user interface.3

1.4. Types of databases General-purpose database management systems are based on some formal, general model for organizing data. By far the most common type of database in use today is the so-called relational database. All the well-known DBMSs are relational databases, including Oracle, MySQL, PostgreSQL, FileMaker Pro and Microsoft Access.4 While we will not discuss non-relational databases further in this chapter, it is worth mentioning some alternatives in order to better understand what a relational database can, and cannot, do. 1) The simplest type of data model is to have a single table, or “file”. Each row corresponds to some object (e.g., a language) being described, and each column represents a property (“attribute”), such as name, location, or Basic Word Order.5 2) A relational database consists of several tables (“relations”) of this sort, linked to each other in complex ways. 3) A hierarchical database is organized not as a table but as a tree structure, similar to folders and subfolders on a computer disk drive. Each unit “belongs” to some larger unit, and contains smaller units. Think of a book divided into chapters, then sections, then subsections etc. 4) In an object-oriented database, data are modeled as “objects” of various types. Objects share or inherit properties according to their type; e.g., a database about word classes could let objects of the type transitive verb inherit properties of the type verb. While useful for very complex applications, this model need not concern us here. 3

4

5

This does not mean that there is no GUI support for “pure” DBMSs. Database vendors or independent projects (in the case of open-source DBMSs) provide graphical tools for the management of each type of DBMS. For example, the application PhpMyAdmin allows web-based administration of MySQL databases. Note that such applications are intended for the set-up and administration of databases, not for interaction with the end-user. FileMaker Pro has some unusual features, which somewhat obscure the fact that it is undoubtedly a relational database. Tabular information might also be stored in a format that does not look like a table; e.g., as a series of name-value pairs. Data files for Shoebox, the linguistic data management application, are in this format.

Designing linguistic databases: A primer for linguists

21

The hierarchical model was among the very earliest database models; although it was largely supplanted by the relational model, it has become relevant again today because it corresponds to the natural structure of XML data, and is suitable for managing heterogeneous data. All of the linguistic databases presented in this volume are relational databases, with the exception of the Typological Database System (Dimitriadis et al., this volume) which uses a hierarchical model to unify a collection of independently developed linguistic databases. All of the component databases of the TDS are, in fact, relational databases.

2. Choosing a database platform Creating a linguistic database does not always come easy to linguists. For one thing, it involves technology and concepts that are not covered in the typical humanities curriculum. (Hopefully this chapter is already helping in this respect). For another, it involves making decisions and choices involving this technology, whose consequences are sometimes not felt (or understood) until much later. While this chapter is meant to be a conceptual introduction, not a technology guide, we will attempt in this section a very general discussion of some specific technical choices. The first decision to be made is: Do you need to build your own database? Numerous ready-to-use applications now exist that can support linguistic data collection, from the linguistics-specific (the best known probably being SIL’s Shoebox, and its successor the Linguist’s Toolbox) to the general (such as Microsoft’s Excel and other spreadsheet applications). Ferrara and Moran (2004) present an evaluation of several such tools. If an existing application meets your needs, it is not necessary to embark on designing and creating a new database from the ground up. Our discussion here cannot hope to provide definitive answers: The issues and trade-offs are complex, and depend on the nature of the specific task as well as the available resources (human, financial and technological), now and in the future. If you are pondering the creation of a new linguistic database and are unsure of the technical choices involved, it is probably best to solicit some expert advice from someone with a good understanding of the technology, being sure to discuss the specific requirements and resources available to your project. Having said that, most of the (custom) linguistic databases we know of can be classified in one of the following categories:

22 Alexis Dimitriadis and Simon Musgrave 1. The all-in-one desktop database, created with either Microsoft Access or FileMaker Pro.6 The simplest solution, it is most suitable for oneperson data collection projects. 2. The small, do-it-yourself web database. While considerably more complex than a desktop database, it is the best solution if multiple people must be able to enter data in parallel, or if there are plans to eventually make the database publicly available. 3. A sophisticated database for a large project with professional programming staff. We will have little to say about the third category. While there is no sharp line between a “small” database project and a large one, our goal here is to address the concerns of linguists with limited technical resources; the professionals do not really need our advice. 2.1. The all-in-one desktop database For a one-person research project without expert technical support, a desktop database application such as Microsoft Access or FileMaker Pro is often a very good solution. These applications store your entire database in a single disk file or folder,7 allowing it to be copied, backed up, and moved about like an ordinary document. This arrangement greatly simplifies the initial set-up of the system, since no server or network configuration is necessary. This can sometimes be very important in institutional settings, where IT policies might forbid the operation of independent database servers.

forms interface

6

7

stored tables

These two applications are the best-known in this category. The free office software suite Open Office provides an open source desktop database application that is (largely) compatible with Microsoft Access. Microsoft Access stores the entire database in a single disk file, while FileMaker stores each table as a separate disk file.

Designing linguistic databases: A primer for linguists

23

The user interface of the desktop database application allows users to define tables and relationships for the database, and to create forms for its user interface. Some understanding of database principles (such as this chapter provides) should go a long way towards helping a new user understand how these programs are meant to be used. Each product includes a scripting language that can be used to extend the functions of the automatically generated forms – but only with a degree of arcane knowledge. The advantages of using an all-in-one database can be summarized as follows: 1. A single product with a graphical user interface for both the database configuration and the user interface. 2. Automatic or interactive generation of the forms. 3. Everything fits in one file or folder, and can be backed up, sent by email, etc. 4. All that is needed is a desktop computer with the database application; software is easy to install or already present, and it is not necessary to set up a server. 5. Internet access is not required. The last point means that the all-in-one database can be used on a laptop computer without internet access – an important consideration for linguists considering data collection in the field.8 On the other hand, the approach also has certain disadvantages: 1. The desktop databases discussed are proprietary software.9 This limits the ease of distributing copies of the database, since the recipient must own the application software. It also means that the data is not highly portable – it is possible to export data from, e.g., Access to a non-proprietary file format, but doing so adds extra work. 2. The form creation facilities have their limits, which cannot be exceeded without a lot of programming knowledge. 3. It is not possible for multiple persons to enter data in parallel.

8

9

While a web database can also be installed locally on a laptop for non-internet use, the process is considerably less trivial than simply copying a disk file. The database included in Open Office, which can read Access databases, is open source software.

24 Alexis Dimitriadis and Simon Musgrave 4. It is now a common (and highly recommended) practice to make one’s data available to other researchers over the internet. An all-in-one database generally needs additional effort in order to be made available on the web. The third point is particularly important for collaborative projects: One of the goals of the database model is to support concurrency, i.e., simultaneous editing sessions by different users. But since this kind of database is stored in a disk file, it must be ensured that only one person at a time should edit it. In general, collaborative data entry with a desktop database raises issues similar to collaborative editing of a text document: Even if a common copy is kept in an accessible location (e.g., a network drive), only one person can modify it at a time. These problems are not insurmountable, and FileMaker and Access each have mechanisms for addressing them. Although this chapter is not intended to be a review of software applications, we will comment briefly on two of them: FileMaker has an easy mechanism for creating a web server interface to a database; once this is set up, multiple users can work on the single copy of the database by connecting to it with a web browser. The approach does require a workstation that can be configured as a database server, and of course all users must have an internet connection (or at least be on a common intranet). Also, the automatically generated web interface does not support all the user interface features of the full-fledged application.10 Microsoft Access, in turn, has a mechanism for “cloning” a database, so that modifications to the copies can later be merged together. But since there is no way to prevent two people from independently modifying the same data, the approach is not entirely safe unless project policies can guarantee that this will not happen. There are other mechanisms and add-on products that can make a desktop database accessible through a web server. To the extent that they work properly, they have the effect of converting a desktop database into a web database of the kind included in our second category, to which we now turn.

10

Bliss and Ritter (this volume) discuss their experiences with a FileMaker web database.

Designing linguistic databases: A primer for linguists

25

2.2. The small web database When multiple people must collaborate on data entry, the desktop database solution is inadequate. A way must be found for all users of the system to work with the same data store. As we have already seen, this means a DBMS that interacts with “client” application programs over the network. While the client programs could be stand-alone applications written in a variety of programming languages, a very popular solution is to set up the database on a web server, allowing ordinary web browsers to be used for display. We have already mentioned this arrangement, shown in the following diagram (repeated):

PHP

web browser

web server

mySQL database

The web server, in the middle, includes a set of programs that generate web pages on the fly, using data retrieved from the database. The scripting language PHP is one of the most common languages for writing such programs (scripts); it is specialized for use in web applications, providing extensive support for connecting to databases and communicating with webservers and browsers. In a web-accessible database, PHP scripts interpret and carry out user actions, including requests to view data, to log on or off the system (if authentication is required), and to insert or update data in the database. The necessary data is fetched from the database (which may or may not be on the same computer as the webserver) and formatted into html pages. The web server then sends the generated pages to the user’s browser for display. A very popular way to set up such a system is the so-called “LAMP stack”: Linux operating system, Apache webserver, MySQL database management system, and PHP. But many variations are possible: PostgreSQL can be used instead of MySQL; the entire Apache-MySQL-PHP combination can be installed on a computer running Windows or Mac OS X; a PHP module can be embedded in a Windows machine running Microsoft’s webserver, IIS; etc. One of our web databases consists of a PHP front end on a Linux machine, which talks to a Microsoft SQL Server DBMS running on a Windows server somewhere else. But the LAMP combination is a popular default

26 Alexis Dimitriadis and Simon Musgrave because it is reliable, and the software is free and open-source. The popularity of the technology has another important consequence for linguists of limited means: It is relatively easy to find programmers, including amateur programmers (e.g., students) who know how to make a web database with PHP. For the technically inclined, introductory how-to guides are also available online. The genius of this approach is that it relies on the user’s web browser to display the user interface of the database. A web browser, in addition to being already installed on every user’s computer, has the advantage of being an extremely sophisticated piece of software. For things that a web browser can do, it would be hard for a database project to create a standalone client application that does them equally well. But a web database has one important limitation: A web page cannot provide the fonts required to display it properly; these must be already resident on the user’s system. Linguistic databases often need to use phonetic symbols, or text from languages with less common writing systems, for which the required fonts are far from universally available. In such cases, users of a web database may need to manually download and install a font before they can use it properly. Fortunately, this is much simpler (and less problem-prone) than installing a full-fledged application. Some things are too complicated to do with HTML and a web browser. For example, we may want to display audio or video in a format that the browsers do not support, to draw syntax trees, to accurately measure reaction times, or to manipulate maps interactively. Modern browsers support a number of ways to extend their basic functionality; notably, they can execute javascript programs embedded in a webpage. For more demanding uses, a stand-alone client application is sometimes the only option. In short, the web-based database has the following important advantages: 1. Fully supports collaboration. 2. Allows open access to the database over the web (if, and when, desired). 3. Can be built entirely with free software, and generally relies on open standards rather than proprietary protocols or file formats. 4. The program that generates the user interface can be arbitrarily complex. The capabilities of browsers can be further extended with javascript and other browser technologies, if desired. Disadvantages:

Designing linguistic databases: A primer for linguists

27

1. Compared to the all-in-one desktop database, the main disadvantage of the web database is that creating one requires knowledge of several different technical domains: facility with setting up software and servers, PHP programming, HTML and CSS (for designing the pages to be generated) and SQL (for communication between PHP and database). Fortunately, the popularity of the LAMP suite means that it is relatively easy to find skilled help. For the technically inclined, there are many free primers and reference manuals. 2. A second problem is that a server computer is needed. For linguists that only have access to their desktop workstation, this can mean negotiations with their IT department and /or the cost of buying a server. At some institutions, IT policies prohibit the operation of an independent server. It should be added that knowing how to build a database is only the beginning; to actually do so, a successful design must be devised and carried out. This is as true for a desktop database as for a web database; but because of the greater complexity of web databases, the time investment required and the consequences of errors are proportionately larger.

2.3. Some recommendations So how should you go about creating a database? We can only offer general suggestions here, and even these are limited to the kinds of situations we have experience with. But it should be clear that designing a database is a complex undertaking. If possible, get help from someone experienced with databases. If you do get help, be sure to be actively involved in all stages. Get as good an understanding as possible of the relevant issues (reading the rest of this volume should help), and meet at least weekly to discuss the design. Don’t assume that your programmer understands how you, as a linguist, view the data you want to collect; they don’t. Only through sustained discussion can there be a convergence of visions. Some more general suggestions: 1. Plan ahead: design your database carefully before you start using it in earnest. 2. Plan for change: As you collect data, your understanding of the phenomenon and the best way to study it will evolve.

28 Alexis Dimitriadis and Simon Musgrave 3. Keep it simple. (But make sure it meets your foreseeable needs). 4. Document your database: explain it in writing, to yourself and others. 3. The relational database model In a relational database, data is formally represented as instances of one or more relations (the mathematical basis of this design was first set out in Codd 1970). Concretely, a relation is a table with named columns:11 Language Name

ISO code

English Italian Swahili Halh Mongolian Dyirbal Mongol

eng ita swh khk dbl mgt

Speakers 309,000,000 61,500,000 772,000 2,337,000 40 336

Area

SourceID

Europe Europe Africa Asia Australia Papua New Guinea

Eth15 Eth15 Ashton47 Eth15 Dixon83 SIL 2003

Each row of the table is a record, corresponding to some object being described; in this case, to a language. Each column is an attribute, representing a property. It can be seen that the cells of the table contain the data; each cell gives the value of an attribute, for the object corresponding to that row. Each value represents one unit of information about the object being described by the record in question. It can be seen from the above example that rows and columns are not interchangeable: Each column is given a name (and meaning) by the database designer, while rows are added as the database is used. A table could have columns but contain no data (hence no rows); but there could never be a table with rows but no columns. As a database grows, it can come to contain thousands or even millions of records in some tables; there is in principle no limit. Most DBMSs, on the other hand, have comparatively low limits on the number of attributes that can be declared. Microsoft Access allows a maximum of 255 columns for each table. There are quite a few alternative terms for these database fundamentals: A record (table row) is formally known as a tuple.12 A table (relation) is 11

Data and citations in these examples are sometimes made up, and should not be assumed to be either correct or representative of what the cited sources actually write.

Designing linguistic databases: A primer for linguists

29

also known as a file. We will avoid this term since it invites confusion with files on a computer disk; for example, Microsoft Access stores all tables of a database in a single disk file, while some DBMSs (such as MySQL) use several disk files for information belonging to one table. Attributes are sometimes called fields (because they correspond to input fields in the user interface), or just properties. 3.1. Keys and foreign keys The notion of key is central to relational databases. A key for a table is a set of attributes (but usually just one attribute) that will always uniquely identify a record in that table. In the above example, the language name, ISO code, and number of speakers are all unique; but the number of speakers, and even the name, are not guaranteed to be unique (for instance, all extinct languages have the same number of speakers). Only the ISO code is, by design, guaranteed to be unique. The keys of a table are sometimes also called candidate keys; a table can have several. Our real-world knowledge allowed us to identify a key in this table; since a DBMS cannot be expected to guess which sets of attributes can serve as keys, there is a way to declare them. The primary key is an attribute, or set of attributes, that the DBMS will use to uniquely identify records. The DBMS will enforce this uniqueness, refusing to create two records with the same key value. By convention, the primary key is indicated by underlying: Language Name English

ISO code

Speakers

Area

SourceID

eng

309,000,000

Europe

Eth15

(etc.)

A key that consists of more than one attribute is called a composite key (as opposed to a simple key). While a database table can have several different candidate keys, it will only have one primary key (which might, of course, be composite).

12

Abstractly, a table represents a relation, defined mathematically as a collection of “tuples” (triples, quadruples, etc.) of values for some attributes. For example, the triple (Dyirbal, dbl, 40) represents the name, ISO code and recorded population of a language.

30 Alexis Dimitriadis and Simon Musgrave Keys are involved in expressing relationships between tables. (Note that a relationship should not be confused with a relation, which is just a table). A relationship in a database expresses a real-world relationship between the objects described. For example, our table of languages contains the attribute SourceID, which indicates the bibliographic source of the information. Our example database also contains another table, whose records are not languages but bibliographic sources (books, articles, and other publications). A record in the Language Details table13 is now related to a record in the Bibliographic Source table, by a relationship we might describe as “contains information from.” To encode the relationship in the database, we store with each record in the Language table a value (in the attribute SourceID) that matches the primary key of a record in the Bibliographic Source table; in the example below, this is the value “Eth15”. We say that the attribute SourceID is a foreign key. More generally, a foreign key is an attribute (or set of attributes, if it’s a composite key) within one table, that matches the primary key of some (other) table. A foreign key expresses a relationship between the two tables. The DBMS can ensure that every foreign key really matches the key of a record in the related table, e.g., by refusing to store non-matching values. In the following, we use a star after the attribute name to indicate that it is a foreign key. (This is not standard notation). Language Details Language Name English

ISO code

Speakers

Area

SourceID*

eng

309,000,000

Europe

Eth15

(etc.)

Bibliographic Source ID

Title

Ashton47 Swahili grammar (including intonation) Eth15

Ethnologue: Languages of the world, 15th Ed.

Author

Year

Publisher

Ashton, E.O.

1947

Longmans

Gordon, Raymond G., Jr. (ed.)

2005 SIL International

…

(etc.)

13

Now that we have more than one table, it is useful to identify them by name.

Designing linguistic databases: A primer for linguists

31

3.2. Retrieving data with SQL The names of several DBMSs have already been mentioned, and the reader may have noticed that more than one of these includes the string SQL in its name. This is indicative of the importance of SQL in the field of databases. As already mentioned, SQL stands for Structured Query Language, a language used interactively and by programs to query and modify data and to manage databases. With the partial exception of FileMaker, all commonlyused relational DBMSs support the use of SQL, and we recommend that anyone involved in a database project acquire familiarity with the basics of the language.14 Although SQL can be used for a variety of database operations, such as inserting data, deleting data and creating new tables, its most visible function (as the name suggests) is for expressing queries: specifying data to be retrieved from an existing table or tables. Queries minimally consist of two parts: A specification of the field or fields to be retrieved and a specification of the table in which those fields will be found. These two aspects are represented in SQL by a SELECT command containing a FROM clause. For example, the following SQL statement will retrieve all the language names recorded in the Language Details table shown above: 15 SELECT “Language Name” FROM “Language Details”; More than one field can be specified in the SELECT clause. Multiple fields are separated by commas, as in the following example which retrieves language names and speaker populations: SELECT “Language Name”, Speakers FROM “Language Details”;

14

15

SQL is recognized as a standard by ISO, the International Standards Organization; however, actual database implementations always have limitations, extensions or other differences from the standard, which are significant enough that complex SQL scripts are typically not portable from one DBMS to another. Nevertheless, the basics of manipulating tables are nearly the same, and an understanding of the workings of SQL allows one to work with any SQL implementation. The quotation marks (“) around table and field names are necessary for names that contain a space; otherwise, they can be omitted. Note that some DBMSs use the (non-standard) syntax `name` or [name] instead. A semicolon (;) signals the end of an SQL command, which can continue over several lines of text.

32 Alexis Dimitriadis and Simon Musgrave Typically, when we construct a query, we are interested only in records which match a certain criterion. Such restrictions are expressed in SQL using a WHERE clause. WHERE clauses usually express a condition on some field of the table being queried; this need not be one of the fields from which data will be retrieved. The first of the following examples will retrieve the names of the languages in our table for which the number of speakers is more than 500,000, the second would retrieve both language names and speaker populations given the same condition: SELECT “Language Name” FROM “Language Details” WHERE Speakers > 500000; SELECT “Language Name”, Speakers FROM “Language Details” WHERE Speakers > 500000; The above queries do not specify the order in which the matched values should be presented. In this case they will probably be presented in the order in which they are stored in the table, but the rules of SQL do not guarantee this. If we wish to display the results of our query in a particular order, e.g., alphabetically by language or ordered by size of speaker population, this can be accomplished by using an ORDER BY clause. Such clauses allow us to specify which field (or fields) will be used to sort the data, and whether the order should be ascending or descending. The following example will return a list sorted in ascending order of speaker population – ascending order is the default and need not be specified explicitly: SELECT “Language Name”, Speakers FROM “Language Details” WHERE Speakers > 500000 ORDER BY Speakers; All the examples given so far have retrieved data from a single table. One of the important features of a relational database is that queries can be carried out which access more than one table; in SQL this is accomplished using a JOIN clause. In its simplest form, such a clause specifies a combination of two tables from which data will be retrieved and a join condition based on a field which the two tables have in common, typically the foreign key field in one table and its source in another table. The following example will retrieve the titles and authors of all sources used for the languages of Europe, along with the names of these languages. Because two tables might

Designing linguistic databases: A primer for linguists

33

use the same name for some fields (e.g., “ID”), an extended syntax can be used to specify field names: TableName.FieldName.16 SELECT “Language Details”.“Language Name”, Title, Author FROM “Bibliographic Source” JOIN “Language Details” ON “Bibliographic Source”.ID = “Language Details”.SourceID WHERE “Language Details”.Area = ‘Europe’; Conceptually, a JOIN operation creates a new, transient table, created by combining the rows of its constituent tables according to the join condition (the ON clause). The main SELECT clause then operates on this resulting table.17 The JOIN operation creates one row for each combination of records (rows) that satisfy the join condition; this means that any table row that matches multiple rows of the other table will be used multiple times. In our example, a single bibliographic source can be used for several languages, and the joined table will include rows like these: From Language Details table

From Bibliographic Sources table

Language Name

ISO … SourceID ID code

Title

English

eng

Eth15

Eth15

Ethnologue: Languages Gordon, Rayof the world, 15 th Ed. mond G., Jr. (ed.)

Italian

ita

Eth15

Eth15

Ethnologue: Languages Gordon, Rayof the world, 15 th Ed. mond G., Jr. (ed.)

Swahili

swh

Ashton47 Ashton47 Swahili grammar (including intonation)

Halh khk Mongolian

16 17

Eth15

Eth15

Author

…

Ashton, E.O.

Ethnologue: Languages Gordon, Rayof the world, 15 th Ed. mond G., Jr. (ed.)

Prefixing the table name is optional if the field name appears in only one table. Note that this is a conceptual description; such tables exist only in the sense of representing the stages of the query. The DBMS can usually figure out which rows and columns of the join table are needed for the final result, and will construct those alone without generating the entire intermediate table. In any event, such tables are not permanently stored in the database.

34 Alexis Dimitriadis and Simon Musgrave The Source record with ID Eth15 has been paired with each language that gives Eth15 as its SourceID. From this temporary table, the SELECT query will retrieve only those rows that match the WHERE clause (in our example, those with Area = ‘Europe’); finally, the result of the SELECT query is a new transient table that contains only the requested columns of these rows: Language Name

Title

Author

English

Ethnologue: Languages of the world, Fifteenth Edition

Gordon, Raymond G., Jr. (ed.)

Italian

Ethnologue: Languages of the world, Fifteenth Edition

Gordon, Raymond G., Jr. (ed.)

(etc.)

This sort of table manipulation, explicit or implicit, is the essence of the “relational algebra” underlying the operation of relational databases: Operations on tables return other tables, which can themselves be operated upon. Multiple JOIN and SELECT statements can be combined in a single query in various ways, allowing extremely powerful manipulations of the data. Data retrieval with SELECT is only one of the many functions of SQL; almost any database operation can be controlled with SQL commands. There are commands to create and delete databases, control access rights, and to add, delete or modify stored data. As already mentioned, we recommend that anyone involved in planning or implementing a database should acquire familiarity with the basics of SQL. This knowledge is very useful when considering the possible design of a database: One should be able to see what sort of queries would be of interest to the potential users of the database, and to express these queries in SQL using the names of tables and fields as specified in the design. If this proves difficult to do, then possibly the design needs rethinking.

4.

How data is stored in databases

4.1. Data types DBMSs allow us to specify what type of data we wish to store in any particular column of a table. While all data is in the end stored in some sort of binary format, the data type of a field determines how it will be treated by the DBMS. There is a temptation for beginning database creators to ignore

Designing linguistic databases: A primer for linguists

35

this factor, and to define all columns as fields holding text data. However, there are several reasons why this is not good practice, and we will discuss some of these as we introduce the most important data types. The most important data types are text, numbers, and Boolean (or “logical”) fields; all databases have additional data types and, partly for technical reasons, numerous subtypes of the core data types. In particular, each DBMS will offer several types of integer fields, differing in their size and (therefore) the range of numbers they can store, and several types of text fields with options for controlling the maximum amount of text that can be stored in them. The specific types and subtypes offered depend on the DBMS, as do the names by which they are known. Conceptually the simplest type of data is Boolean data, that is, a field which can only have the values True or False. Boolean data is efficient to store and use, since a Boolean field nominally needs only one bit of memory and can therefore be stored extremely compactly. This type of data is frequently used in typological databases, where many fields may contain information about whether a given language has a certain property or not. Numeric fields are for storing numbers; linguistic databases generally make little use of numbers (except for automatically-generated IDs or as indices into a list of possible values), but they are of great importance for applications in business and the quantitative sciences. DBMSs provide a variety of numeric types, with different storage requirements. Typical examples are small integers (x < 256), which require one byte of memory,18 large integers which use up to eight bytes, and one or more sizes of “floating point” fields, which store non-integer (“real”) numbers.19 For a database that will include numeric data, there are immediate and obvious disadvantages to storing numbers in a text field: Such fields are not considered numeric by the database, but are treated as simple sequences of characters (which happen to contain digits). If sorted, they will be alphabetized like names, with 1, 10 and 1,000,000 appearing before 2 or 20. They cannot be added, multiplied, or compared for magnitude with other num18

19

The basic unit of storage for information in a digital computer is one binary state, known as a bit. These are then organized into groups of eight, known as bytes, which have 28 or 256 possible values. Floating point numbers are represented as an exponent plus a fixed number of “significant digits,” and can store extremely large or extremely small numbers, with some loss of precision. E.g., the number 602,214,141,070,409,084,099,072 can be represented as 6.0221414 x 1023 (In a database, exponent and significant digits are in binary, not decimal form).

36 Alexis Dimitriadis and Simon Musgrave bers; for example, the following query (given as an example in the preceding section) is only meaningful if the field Speakers has a numeric data type: SELECT “Language Name” FROM “Language Details” WHERE Speakers > 500000; The most complex core type, and the one that requires the most storage capacity, is text data. One strategy which DBMSs use to contain the demands on memory of fields containing text data is to allow the designer to specify the number of characters allowed in a given field. Databases historically provided fixed-length text fields, which always reserve space for a fixed number of characters whether it is needed or not. Today’s DBMSs also provide variable-length fields, which only take up the amount of space actually used for each string (plus some overhead for encoding the length of the field). The database designer can specify a maximum length, up to a limit imposed by the design of the DBMS; the most efficient text type (known as Varchar in MySQL, and as Text in Access) is limited to 255 characters. Every DBMS also provides a data type for long strings; this might allow up to 65,535 characters. Short and long text types may come in fixed-length and variable-length variants.20 Because variable-length text fields only use as much storage space as they need, there is no need to severely restrict the maximum size of strings in the design of the database; we recommend making all text fields long enough to contain any foreseeable data. The short-text type should be preferred to long text, for strings that are not expected to exceed 255 characters. (You should consult your database manual to check that the data type you are using is indeed variable-length). The full inventory of database data types is quite a bit more complex than we have sketched here. Most DBMSs also have special numeric types for dates, times, and currency amounts. Access has a special text data type for hyperlinks. Specifying a field to contain data of one of these types has consequences for integrity at data entry (see below), and also allows a range of specialized operations to be performed on the data. For example, if a field is specified to contain only dates, then the exact format to be used can also be specified and the user interface will provide the user with a mask, a template which will only accept data of the required format. Many 20

Long strings are known as Memo in Access, and as Text in MySQL. (Therefore, “Text” means very different types in these two DBMSs). Access does not have fixed-length strings.

Designing linguistic databases: A primer for linguists

37

DBMSs also allow the database designer to declare the character encoding (see next section) used in a text field, or in an entire table or database. This can affect data validation, sorting alphabetically, and searching. Finally, many DBMSs provide a special data type for arbitrary large blocks of binary data, which the database will store without knowing anything about their internal format. Another data type that is of particular interest to linguists is the enumerated type, which restricts a field to taking values from a list specified by the database designer. For example, the property Basic Word Order can be modeled as an enumerated attribute that takes values from the set (SVO, SOV, VSO, VOS, OSV, OVS, Free). This is an extremely useful construct, and one that is very frequently applicable in linguistics since a lot of properties take values from a fixed set of options. Relying on an enumerated type restricts the data which can be entered in a field, and therefore lessens the risk of input errors and inconsistencies (for example, if one types “S-V-O” instead of “SVO”, which is recognizable to a human but will be treated as a distinct value by the DBMS). While some databases define an enumerated data type, it is also easy to model in the user interface, or by using foreign keys. We will see how in section 7. We should not end our discussion of data types without mentioning that DBMSs typically allow a field to be null, i.e., to have no value at all. Nulls require care by the database designer, since their behavior in queries and arithmetic operations can be surprising. A text field whose value is the empty string is different from a null text field, and a null numeric field is different from any numeric value (including zero). We have seen, then, some of the properties of the various data types and the benefits of utilizing them properly. One other reason for adopting the discipline of data typing is that it can minimize problems in the processing of the data at the user interface. When routines are defined in a scripting or programming language, variables will be used to hold data for processing. Such variables must be typed in many scripting languages, and it is always good programming practice to do so even if it is not required. If the types assigned to the data fields in the tables which supply data for processing do not correspond to the types assigned to variables, runtime errors can occur. Tracking down and correcting such errors is very time-consuming, but they can be avoided by the correct use of data typing. Finally, the benefit of data typing can also be viewed in a rather more conceptual sense. A theme which runs through this chapter is that being able to think precisely about the nature of one’s data is an essential skill in working with databases. Data typing can be viewed as one aspect of that skill: If we cannot be precise about the

38 Alexis Dimitriadis and Simon Musgrave type of data we intend to store in a particular field of our database, then we are not thinking about the problem with sufficient precision. 4.2. Character encodings Linguists often require access to characters beyond those normally available on a standard Roman alphabet keyboard. The characters of the International Phonetic Alphabet (IPA) are needed for phonetic transcription, and different character sets are needed for the standard written representation of many languages. Many linguists will have had the experience of carefully ensuring that the necessary characters are included in a document only to see them vanish completely when the file is opened on a different computer. Such problems arise from the way in which character encodings are internally represented by computers. Each character is represented as a number, but the mapping between characters and numbers is arbitrary and may vary from one operating system to another, and even from one font to another. The basic characters of a Roman keyboard do have standardized codes, provided by the American Standard Code for Information Interchange (ASCII). Later systems based on ASCII, especially ISO 8859, provide standard mappings for other writing systems. ISO 8859 includes Cyrillic, Arabic, Greek, Hebrew and Thai characters, as well as various extensions of the core Latin character set. The IPA characters are not included. However, all such schemes faced one basic obstacle, which is that they were designed to use one byte of data for each character. This limits the number of characters which can be encoded to a maximum of 256, the number of distinct values of an eight-bit binary number. Accordingly, ISO 8859 provides a family of separate encodings for the different alphabets it supports. In each of those, the first 128 values (expressible as a seven-bit number) are identical with the ASCII encoding, and contain the standard English letters (capital and lower case), punctuation and numbers, and a number of whitespace characters and control codes; the remaining 128 values, the so-called “high page” of the character table, are different for each encoding defined by ISO 8859. For example, ISO-8859-1 (also known as Latin-1) provides accented characters and other symbols needed for Western European languages; ISO-8859-5 covers Cyrillic, ISO-8859-7 Modern Greek, etc. The problem with using a family of eight-bit encodings is that it is difficult to mix text from different alphabets. Since each encoding includes ASCII as a subset, a file that uses the ISO-8859-7 encoding, for example, can contain a mixture of Greek and English text. But there is no simple way

Designing linguistic databases: A primer for linguists

39

to mix Cyrillic and Greek, or Cyrillic and French: ISO 8859 does not provide a way to indicate which encoding is being used, or to switch between encodings. To manage multi-lingual text, a database or other application must provide its own way of keeping track of the character encoding used.21 HTML pages and Microsoft Word documents each have their own way of accomplishing this; but desktop database applications are not designed to allow fine control over text encoding – database fields are generally meant to contain “flat” text, without invisible embedded codes, and the encoding can usually be set as a database-wide option, if at all. For a cross-linguistic database, this is a severely restrictive state of affairs. Every font uses some encoding scheme to map character codes into character shapes (“glyphs”). Because ISO 8859 does not include an encoding standard for IPA characters, IPA fonts used various arbitrary mappings. In effect, each IPA font defined its own encoding, an alternative to the standard encodings of ISO 8859.22 All this made a change of fonts a potentially disastrous affair, since a string of Russian, French or IPA text could suddenly turn into nonsense when displayed with an incompatibly encoded font. Such limitations, as well as the proliferation of other incompatible encodings, led to the formation of a working group in the mid 1980s whose aim was to establish a comprehensive and universally recognized scheme for the digital encoding of character data. The outcome of the work of that group and its successor, the Unicode Consortium, is Unicode. Unicode is a single, universal character encoding scheme, originally based on two-byte codes (giving it 65,536 potential “codepoints”) and later generalized to abstract codepoints that are independent of the number of bytes used to represent them. Unicode 5.1 covers over 100,000 symbols, and each is assigned its own codepoint. Unicode makes a principled distinction between character symbols and glyphs, the visual shapes used to represent them. For example, the character A (“capital A”) can be represented with any of the glyphs A, A, A, A, A, etc. A single glyph may also correspond to a combination of several characters: For example, many fonts include a glyph for the typographical “ligature” [fi], which represents [f] and [i] together. 21

22

Another standard, ISO 2022, includes a means of switching between encodings; this still requires the application to keep track of the “current” encoding at any point in the text. This system is not very widely supported, and is increasingly being supplanted by Unicode (see below). SIL eventually adopted a consistent mapping for all their IPA fonts, which has also been used by some independent IPA fonts.

40 Alexis Dimitriadis and Simon Musgrave Unlike most earlier encoding schemes, Unicode in principle treats any element which can form part of a character as a separate symbol (“character”). For example, Unicode treats the letter [e] as one symbol and the acute accent as another one; each is assigned to a different codepoint. The character [é] is therefore made up of two symbols in Unicode.23 This approach has admirable conceptual clarity, but it does also pose problems for rendering complex characters which are made up of several symbols. Fonts provide glyphs for common combinations of letters and accents (such as [é]), which often look better than the direct combination of the separate glyphs; some applications and display systems can automatically substitute such combined glyphs for the two-symbol combination, but others cannot. The 100,000 plus symbols of Unicode 5.1 include coverage of all major scripts of the world: European alphabets, Middle Eastern scripts including Arabic and Hebrew, Indian scripts including Devanagari, Bengali, Tamil and Telugu, Thai, Chinese, Japanese and Korean characters, a number of historically important scripts such as Runic, Ogham and Gothic, and the symbols of the IPA. Unicode is therefore a development of great significance for linguists, providing a standardized scheme for encoding characters from most of the writing systems used by the world’s languages. In terms of the portability of data (Bird and Simons 2001), it is highly desirable that linguists should use Unicode for all of their work with language data.24 There are however some practical problems which arise in using Unicode. Firstly, there is a crucial difference between encoding and rendering. As we have said, Unicode provides an encoding for a huge range of characters, and that encoding is stable across hardware and software platforms. However, in order to make practical use of this capability, the computers on which files are opened must be equipped with fonts that include glyphs for the characters which have been encoded. Although all major manufacturers support Unicode in principle, availability of wide-coverage fonts is proving to be slow in coming. Currently, for Windows machines, Arial Unicode MS is included as part of Microsoft Office (and thus is not installed on all computers by default). It has coverage of 50,377 codepoints which equates to 23

24

In fact, in order to maintain heritage encodings from ISO 8859, Unicode also includes [é] as a single character. Unicode defines “normalization” tables that decompose such characters into the corresponding sequence of simple characters. There are also tables for normalization in the opposite direction (wherever such combined symbols exist). For those seeking more information about Unicode, Gillam 2003 provides a detailed but accessible introduction.

Designing linguistic databases: A primer for linguists

41

38,917 characters, including IPA and most major scripts. Rendering is in general very accurate, but there is a known bug which affects the rendering of double-width diacritic characters: these consistently appear one character to the left of their true position. Such problems will eventually disappear as new software versions are released, but in the meantime they cause considerable inconvenience. There are other Unicode fonts which are valuable for linguistic work, of which we will mention only a couple. Lucida Sans Unicode has well-designed IPA symbols, but limited coverage of non-Latin writing systems. The Doulos SIL font family, distributed freely by the Summer Institute of Linguistics (http://scripts.sil.org/TypeDesignResources), is designed specifically to render IPA characters. It covers all IPA symbols and diacritics, which are rendered very accurately, but otherwise has narrow coverage. Numerous other Unicode fonts are also available from SIL, and more are under development. The Gentium font, in particular, provides wide coverage but is still under development (it has no bold typeface yet, for example). A valuable resource for all issues related to Unicode fonts is the webpage authored by Alan Woods (http://www.alanwood.net/unicode/), and some specialist advice on Unicode and IPA is provided by John Wells (http:// www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm). The SIL pages also provide a wealth of information. The second issue which arises in using multi-lingual text, in Unicode or in any other encoding, is that of data entry. Standard software packages such as Microsoft Office provide only limited facilities for inputting data in non-Roman characters. The Insert-Symbol command in Word allows such characters to be accessed, but it is designed for inserting single characters and is far from adequate for working with sizeable bodies of text in nonRoman writing systems. There are two basic approaches to this problem, although in practice the two sometimes overlap. Firstly, it is possible to redefine the mapping between the keyboard and the characters it inserts; this is most useful when a fixed set of characters will be used extensively (e.g., for IPA input), and the user can memorize the new layout or at least learn to find needed symbols quickly. For Windows computers, the best known keyboard-remapping utility is Keyman (http://www.tavultesoft.com/keyman/). This is a powerful tool but, at least in previous versions, it had some drawbacks.25 Keyman is a system25

We do not currently use this program and therefore we cannot comment on the performance of later versions. We also note that the tool was previously distributed freely, but is now distributed commercially.

42 Alexis Dimitriadis and Simon Musgrave level tool: changing to a different keyboard mapping should alter the behaviour of the keyboard for all software. But because of the way keyboard input is handled in Windows, this is not actually possible, and the versions with which we have had experience have not in fact operated consistently in this fashion. In particular, the keyboard remapping did not operate at all when using the Microsoft Access DBMS (for further details, see Musgrave 2002). In previous versions, it was also quite difficult to create new keyboard mappings; one was reliant on mappings which had been created by other users and made available online. The current version seems to address this problem. The solution which we prefer, therefore, is to use a specialized tool to edit Unicode text and then to import the prepared text into the database with which one is working. This approach, in our experience, allows much greater control over the process of producing accurate Unicode text. For projects that do not involve intensive use of a single keyboard layout, it is sometimes most convenient to use an on-screen keyboard application. The user enters the desired text by “pressing” buttons displayed on the screen, and then pastes it to the database or another destination. Because it is easy to switch between sets of symbols and the symbol inserted by each key is always visible on the screen, this is a convenient method for moderate or light use. The TDS IPA Console (http://languagelink.let.uu.nl/tds/ipa/) is a java application specialized for IPA text entry; the on-screen keys are arranged in the shape of the familiar IPA symbol charts, and the application can be easily customized with additional symbols not included in the standard layout. There are also webpages that provide on-screen keyboards with similar functionality, but these cannot be customized. Two other tools which are valuable for this purpose, again for Windows computers only, are Sharmahd Unipad and ELAN. Unipad is a Unicode text editor. A basic version is available for free download, but registration and payment is necessary for full editing capability. The program offers two modes for entering data: one can either select characters from a tabular representation of the Unicode character set, laid out in planes, or one can use a keyboard redefinition. The editor comes with a large number of keyboards (approximately 50) pre-defined, and several others are available from the Unipad website. But it is also possible to create custom keyboard layouts very easily; all that is involved is dragging the desired characters from the Unicode planes and dropping them onto the desired positions on a keyboard layout on the screen. When selected as the active keyboard, a mapping affects the behavior of the keyboard, but only in the Unipad editor. The keyboard can also be viewed on the screen, and characters can be entered by

Designing linguistic databases: A primer for linguists

43

clicking on them with the mouse. The great advantage of this tool is that it is so easy to create special keyboards. If one is entering phonemic transcriptions of data from one language, access is needed to some of the IPA symbol set, but typically only a subset of those characters are used. Rather than having to negotiate a keyboard mapping which provides access to all the IPA characters, it is possible to make a keyboard with only those that are needed, with a significant gain in efficiency. ELAN is a tool for creating time-aligned annotations linked to media files, and is increasingly being used by linguists for transcribing primary data. We use it as an example of an indirect scenario for data entry: In many cases, the data entered into the database does not come directly from off-line sources, but from another electronic resource such as typed field notes, or (as in this case) a transcribed corpus. ELAN was designed and implemented by the technical team of the Max Planck Institute for Psycholinguistics, Nijmegen, and is distributed freely (http://www.lat-mpi.eu/tools/ elan/). ELAN files are a specialized type of structured text file, saved in Unicode format; various aids to entering non-Roman characters are included in the interface. The user can access various keyboard remappings directly from the interface, with seven languages supported and some in multiple mappings. Two methods for entering IPA characters are also available, one using the SAMPA computer-readable phonetic alphabet (http://www.phon.ucl.ac.uk/ home/sampa/), and the other using the Roman Typographic Root method. This requires the user to type the Roman character closest to the desired symbol and then select the symbol from a list which appears. The capabilities of ELAN for entering non-Roman characters are less sophisticated than those of Unipad, but for linguists who already use it for transcription, it has the advantage of allowing data entry into the database to be incorporated in their existing workflow. 4.3. Multimedia in databases Recent technical advances have made feasible the presentation of linguistic data as audio or video material, or graphics. This advance opens up significant new possibilities for the discipline, and it is desirable to include such material in linguistic databases. While it is possible to store such content directly in a database, the relatively large size of audio and video recordings (compared to the rest of the data) often makes it convenient to store such materials separately, and access them by storing links to them in the database.

44 Alexis Dimitriadis and Simon Musgrave Desktop DBMS packages now typically allow the content of a field to be a Uniform Resource Locator (URL) which points to a multimedia resource. The interface must include the necessary functionality to access the resources and present them to the user as required. A web-based database can easily handle multimedia, since web browsers already support multimedia content (sometimes via helper applications): Again, fields in the database can be used to store URLs, and the user interface scripts are programmed to present the relevant resources as hypertext links in the generated web pages. A simpler application is the dynamic generation of maps or other graphics (e.g., statistical graphs or charts) on the basis of data in the database. On a web database, this can either be done “server-side” (a graphics manipulation plug-in on the webserver creates an image on the fly, which is sent to the user’s browser for display) or “client-side” (a fixed image, such as a map of the world, is sent to the user’s browser along with a script that controls the placement of additional information on it). A common use of the technique are the familiar typological maps, which display the distribution of one or more linguistic features as color-coded dots at the canonical geographical location of each language.26 5.

Designing a database

5.1. Preliminaries The relational database model is a simple, but very powerful basis for organizing data in a database: A relational database is a collection of tables, related (linked) to each other by means of foreign keys. This so-called relational structure of tables and relationships determines the essential nature of a database: The user interface can present data in a very different way (or in many different ways, as we have seen), and there are additional features at the database level, such as indexes and access control directives; but these 26

Currently, the state of the art for such maps is to embed in the database interface a service provided by Google Maps: The Google service creates and manages a map of the world (or the desired part of it), which appears as part of the database interface. Behind the scenes, the database supplies geographical locations and annotations which are placed on the map by the user’s browser. A skilled programmer can set up the complete system in a few days. The websites of the World Atlas of Language Structures (Haspelmath, this volume) and the Typological Database System (Dimitriadis et al., this volume) utilize this method.

Designing linguistic databases: A primer for linguists

45

only build on the core properties of the database, which is determined by its relational structure. So what is the right relational structure for a database? This depends on the situation, of course: on the nature of the data, and on its intended uses. Naturally, a database must have a way to store all information it is meant to contain; and this information must be organized in a way that allows it to be entered, viewed and modified in a straightforward way. There is a voluminous literature on the topic of how to design a database, with countless competing methodologies, procedures, and rules of thumb. While the database specialists disagree among themselves on the details, the underlying principle is simple: The structure of the database should reflect the logical organization of the data. In particular, it is not the end-application or the user interface that determines how the database should be organized. If a database has a relational structure that suits the data, it will support any desired use of it (possibly after technical elaborations such as indexes, etc.). Conversely, incorrect design can make it impossible to put the database to new uses, or even to store all collected data. As a simple example, suppose that we are collecting information on reflexives, and design a database that allows exactly one reflexive per language. This design will require awkward conventions and work-arounds as soon as we encounter a language with two reflexives we need to describe. An appropriate database design will have the following properties: 1. It will represent the required data and appropriate relationships between the units of data. 2. It will provide a model of the data that supports the operations that will need to be performed: entering, searching, and modifying data in the database. 3. It should be the simplest way to accomplish these tasks with acceptable performance. The last consideration has mostly escaped our attention until now, and to some extent goes against the first two: A database is not a theoretical exercise but a practical tool; it should not be more complex than it needs to be – but it should be complex enough. For example, a person’s name may consist of one or more last names, given names, connectives such as von, and perhaps other modifiers or honorific titles (Jr., Professor). How this should

46 Alexis Dimitriadis and Simon Musgrave be represented in a database depends on its purposes; a telephone directory or employee database should probably represent each of these parts separately, while a table of contributors to a linguistic database can often get away with fields for just a “first” and “last” name (loosely construed), or even a single field Name. For linguistic applications, the question is complicated by the theoretical perspective and goals of the database. In general, a database designer will model properties of the subject matter of the database much more precisely than other, peripheral properties. For example, while it is common to classify languages as having trochaic or iambic feet, this classification does not exhaust the range of variation in foot construction. The Stress Typology database (Goedemans and Van der Hulst, this volume), whose subject is the mechanisms of footing and stress assignment, provides a much more fine-grained set of classificatory parameters, from which the types iambic and trochaic can be computed as summary properties. A database, then, is a model of the real world – specifically, of the part of the world we are interested in. But a model is not the same as the real thing: the data in a database, and hence its structure as well, represent an idealization of the world that is sufficiently detailed for our purposes but sufficiently abstract to be put into practice. Finding the right balance is something that comes with experience, and is arguably a bit of an art.

5.2. Normalization To better understand the process of relational database design, we first consider two concrete properties that a correctly designed database must have: 1. Each cell in a database table should contain only one value (i.e., one piece of information). 2. Each piece of information should be entered in the database only once. The process of discovering and correcting problems of this sort (and many similar ones) is known as normalization. A database that meets the first of the above criteria is said to be in first normal form. (The second of our criteria is addressed by several different normalization rules). Consider again our table of language details: Suppose that we want to add information about the basic word order for each language, and that our definition of “basic word order” allows a language to have more than one (or that we decide to record multiple categorizations in case of doubt); sup-

Designing linguistic databases: A primer for linguists

47

pose, moreover, that we decide to treat German as having both SOV and SVO basic word order. Language Details Language Name

ISO code

…

Basic Word Order

English

eng

SVO

German

deu

SOV, SVO

(etc.)

The above table violates our first rule (is not in first normal form), since both word order values for German have been entered in a single cell. This practice creates all sorts of technical difficulties, because database searching and sorting operations are designed to treat each table cell as a complete value. This might seem harmless in this toy example, but the problems rapidly multiply as things get more complicated: If we add a field for the bibliographic source of the word order information, it may be necessary to add two sources in the cell for German, and to somehow indicate which source is responsible for the which word order classification. We might just give these details in plain text in the bibliographic source field; or we might decide to use special notation, perhaps a backslash, between multiple values, and to adopt the convention that the first reference corresponds to the first word order value, and so on. But these conventions are opaque to the DBMS (i.e., unknown and unusable); and as they multiply, they add complexity to the system which quickly grows beyond what can be reasonably managed. The solution is to structure the database in a way that allows multiple values for the word order attribute to be recorded for a single language. We replace our previous table, which provided one row for each language, with a new table Word Order, in which each row represents a basic word order of one language; German is then allowed to occupy two rows, solving our problem. Suppose, however, that our new table looks as follows: Word Order Language Name

ISO code

Area

Basic Word Order

Reference*

English

eng

Europe

SVO

Smith75

German

deu

Europe

SOV

Jones84

German

deu

Europe

SVO

Jones84

(etc.)

48 Alexis Dimitriadis and Simon Musgrave This table violates the second of our conditions: It contains three fields, Language name, ISO code and Area, that must be the same for all records that concern a German word order. In other words, all copies of these values give the same information, which should appear only once. (On the other hand, the two values of Jones84 on the Reference column do not represent the same piece of information: the SOV and SVO values might have come from different publications, so the fact that they are the same could not have been predicted). The solution, here, is to have two tables: One for languages, containing information such as name, population etc. (as before), and a second one containing only information related to the basic word order. The two tables are linked by means of a foreign key, Language ID, that must match the primary key (ISO code) of the Language Details table.27 Language Details Language Name

ISO code

Area

Reference*

English

eng

Europe

Smith75

German

deu

Europe

Eth15

(etc.) Word Order Language ID*

Basic Word Order

Reference*

eng

SVO

Smith75

deu

SOV

Jones84

deu

SVO

Jones84

(etc.)

These kinds of concerns, then, are among the issues we must address when designing a database. In the following sections we consider the process of database design, beginning with the general principles before we turn to the specifics.

27

We have not discussed the choice of primary keys for the two tables; we return to this type of example in a later section.

Designing linguistic databases: A primer for linguists

49

5.3. The database design process Building a database is a complex undertaking. We present here a commonly used, top-down process following Connolly et al. (1999): First, we collect sufficient information to gain a proper understanding of the problem; we then construct the database in several stages, beginning with a high-level model of the data domain. 1. Work out your needs, in writing. Examine real data and/or any paper forms or printed materials used for the same purpose in the past. 2. Carry out the high-level conceptual design (abstract design) of the database. This means thinking about how our data is organized, without worrying about the specifics of tables, keys and attributes. 3. In the logical design stage, we define tables and relationships that reflect the conceptual design. 4. The physical design stage involves the actual programming of the database, taking into account the features and limitations of our specific DBMS and user interface environment (with which we build the database client). 5. When the design steps have been completed, build the actual database. 6. Try it out. 7. Clarify your needs, and revise steps 1– 6 as necessary. Databases are by tradition complex entities that support very large enterprises. Under such conditions, the cost of change increases enormously once a system has gone into use: Imagine needing to take thousands of automatic teller machines off line in order to revise the database that keeps track of their transactions! While linguistic databases are much simpler, it is still essential that a database should be planned and built as carefully as possible before it goes into use. Testing will almost always reveal problems that can be easily corrected – as long as they are noticed at an early stage. It should also be noted that database-creation applications and tools are not built with the assumption that the design of the database might change at any time. If you add a new attribute to a table in Microsoft Access, for example, it will not automatically appear in existing forms for that table; the forms will have to be re-generated and customized from scratch, or the new field will have to be manually added to existing forms (a more complicated process). The importance of planning ahead presents linguists with a chicken-andegg problem: A research survey database is intended to collect data about a phenomenon that is not yet completely understood, so it is almost certain

50 Alexis Dimitriadis and Simon Musgrave that an ideal model for it cannot be constructed until after data collection has been completed. We cannot address this difficult question here, beyond recommending the obvious: Do as much advance preparation as possible, and plan for change. For example, if the values for a field come from a fixed list (menu) of possible values, make sure that the list can be easily extended during the course of the project – and that someone actively involved in the project knows how to do it.

5.4. Preparation Many databases are not designed by the people who are going to use them. It is important to collect the information necessary for the database designer to make the right decisions: 1. What kind of data will be stored in the database? Collect some real data, in electronic or in paper form, of the kind that will be entered into the database. If a substantial body of relevant data exists already, it could reveal patterns that will not be apparent from just one or two sample records (or, worse, from fake “data” made up by the designer). If the data is to be collected by questionnaire, prepare one or two completed questionnaires (again, with real data), before the database design is even started. The questionnaire itself is also a big source of information; but be sure to distinguish clearly between what will be written in the questionnaire and what will be entered into the database. For example, some exploratory questions might be intended to guide the consultant, and do not need to appear in the database; while some properties are not provided by the consultant, but will be determined by a database analyst on the basis of the completed questionnaire. 2. What will be done with the data? Think about specific scenarios you want to be able to carry out, including data entry as well as searching and browsing: Will people want to see a list of all the phonemes in a language? Should they be able to search by geographic region? Etc. Think about the research questions you hope to answer with the help of the database, and the kind of information you will need in order to do so. It is best to involve several people in discussing the use of the database. 3. What kinds of users will be using the database? Typically, there will be privileged users who can enter and edit data in the database, and external users who can only search or display the data. For larger projects, there

Designing linguistic databases: A primer for linguists

51

may be external consultants who should be given editing access to a single language only, etc. For a web database, it is common to initially restrict all access to authenticated users only, and to open up read access for external users after the data has been double-checked, and perhaps utilized for a particular research paper or dissertation connected to the project. 4. What will happen to the database after the conclusion of the project period? The data may be archived or made public on a separate web database; such future considerations can be taken into account during the database design, for example by providing suitable export functions and ensuring that appropriate standards are followed.

6. Conceptual design In the conceptual database design stage, we determine the organization of our data at an abstract level. This is necessary, since the particular properties of relational databases are too low-level, in ways that we will presently see. The core task in conceptual design is to identify the entities, i.e., idealized objects, that our database is about; and the relationships between them. More correctly, we must identify the entity types present in our database. An entity doesn’t need to be a physical object. Rather, it is a notion or concept that we, in our project, consider to have “an independent existence”, and can ask questions about. In other words, an entity is something that we collect data about. For a cross-linguistic survey database, we can start by identifying languages as entities that we will document: Our database might have information about German, Yoruba, Japanese, etc. We thus have our first entity type, Language. (Note again that our model must involve types, such as Language, rather than single objects such as German). In our example we provided directory information about languages, such as name, ISO code, population, etc. These are attributes of the entity Language. (Entity attributes, as we will see shortly, are similar to the attributes of database tables but are more general). We also named a source for the information on each language. A book, or other bibliographic source, can also be said to have “independent existence”: We can identify many attributes of a book, such as its title, author etc., independently of what we used the book for. We thus have a second entity type, Bibliographic Source. We also have a relationship between the two entity types: A bibliographic source can be Cited as the source of the demographic data on some language.

52 Alexis Dimitriadis and Simon Musgrave The entity-relationship model we have constructed can be displayed in a so-called Entity-Relationship diagram (E-R diagram). Each entity type is shown as a rectangle, with attached circles representing its attributes. The relationship is represented by a diamond, connected with lines to the two entities. The “crow’s foot” on the language side of the relationship indicates that multiple languages could cite the same source.

Some heuristics for conceptual design Entities are things (or notions) that our database is about. If we record information about something, that “something” is a candidate for being modeled as an entity. To determine how to divide information into entities, and how to allocate attributes to them, consider: 1. What information goes together? 2. What information is dependent on other information? 3. What should be updated or deleted together? For example, the vernacular text, gloss and translation of an example sentence clearly go together. It would make no sense to delete one of these while leaving the rest in the database (unless, of course, we are simply deleting an erroneous value until it can be entered correctly).

6.1. Relationships The E-R diagram shown above also indicates the cardinality of the relationship between the two entities, which indicates the number of separate relationships that a single entity can participate in. A relationship is one-to-one if only one of each entity can be related. For example, the relationship Is Capital Of always relates a city to a single country (or to none), and vice

Designing linguistic databases: A primer for linguists

53

versa. A relationship is one-to-many if one entity type can be related to several entities of the other type, but not vice versa. Example: A relationship Location between cities and countries, since each city is in only one country (if we ignore the possibility of divided cities, or treat such cases as two cities), but a country could be the location of many cities. Finally, a relationship is many-to-many if both entity types can be related to multiple entities of the other type. As an example, consider the relationship Flies To, between airlines and cities. For a linguistic example, consider a relationship Occurs In, between morphemic gloss labels and sentences. A particular gloss label can occur in many sentences, and each sentence can contain many labels in its gloss. The cardinality of a relationship determines (along with other factors) how it will be represented in the relational structure of our database, which we derive in the logical design stage. A relationship can also be classified depending on whether it is obligatory for the participating entities. If every entity of some type must participate in a relationship, the relationship is total for that entity type; otherwise, the relationship is partial for that entity type. For example, every country has a capital, so Is Capital Of is a total relationship for countries; but some cities are not the capital of any country, hence Is Capital Of is a partial relationship for cities. Consider also the relationship Describes, between languages and reference grammars: every grammar describes some language, but there are languages for which no grammar exists; therefore the relationship is total for grammars, but partial for languages. 6.2. Entity attributes and relationship attributes Entity attributes are a generalization of the attributes of tables. Like them, they express particular properties of the entity being described. Once we have identified the entities in our database, we solidify the design by identifying the attributes that we are interested in recording. An attribute of an entity should depend only on the entity it will describe; for example, our language directory contains an entity type Language with information about languages, and an entity type Source with publication information (author, title, publisher etc.) about books and articles from which the information was drawn. In general, we also want to record the page number where a piece of information appears. Note that this is not a property (attribute) of the book itself; after all, we might have drawn many different pieces of information from different pages of a single book. Neither is the page number an attribute of the language. Rather, the page num-

54 Alexis Dimitriadis and Simon Musgrave ber is an attribute of the relationship Cited, which relates a language to a bibliographic source. In an entity-relationship model, relationships as well as entities can have attributes. In defining an attribute, we should have some idea of the possible values it may take; these are the attribute domain, and might be a list of fixed strings, a number or monetary value (rare for linguistic applications), or perhaps free text. An attribute can also be classified as simple or complex, which refers to whether the attribute consists of multiple distinct parts. For example, an entity Person might have a complex attribute Name that we further subdivide into First Name, Surname, etc. An entity attribute is also allowed to be multi-valued; in our language directory example, Alternate Name is a multi-valued attribute. Similarly, Person might have an attribute Phone Number which we allow to take multiple values.28 An attribute that can only hold one value (the usual case) is said to be single-valued. Exercise: Consider the entity Book (or Bibliographic Source). What kind of attribute is the attribute Author? We have just suggested that Author is an attribute of the entity Book. But isn’t Author a separate entity, a special case of Person? The answer depends on our intended use for the database. Entities are things that we collect information about. Unless we intend to record biographical information about book authors, there is no reason to elevate them to the status of separate entities. An author is simply an attribute of Book, as far as we are concerned. On the other hand, a native speaker consultant for some language is generally of more interest: We may want to record their particular dialect, language biography, name and contact details, etc. Therefore we might decide to treat Consultant as an entity type in its own right.

6.3. A common pattern Let us consider a linguistic example in more detail: A cross-linguistic database of reflexives includes data about reflexives in various languages; the entities in this database include Language and Reflexive; a language may have several distinct reflexives that need to be described, but a reflexive can only be considered as belonging to one language at a time (numerous 28

Recall that table attributes can only hold one value for each record. This is why we say that entity attributes are a generalization (abstraction) of table attributes. In our discussion of logical database design, we show how complex and multivalued attributes can be encoded in a relational database.

Designing linguistic databases: A primer for linguists

55

European languages have a reflexive se, for example; but the se of each language has its own distinct properties, hence we must speak of separate, homophonous reflexives that are described independently of each other). Exercise: what is the cardinality of the relationship between languages and reflexives? And is it partial or total for each entity involved? 29 Our database of reflexives will also include sentences exemplifying the use of reflexives; let us consider the status of these sentences in the database design. We can begin by thinking of them as “examples,” which suggests that each sentence we record is obligatorily an example of some reflexive in our database. This is a common way of organizing a linguistic database. But we should point out that “example” is really a relationship between a sentence and a reflexive; hence we can consider Sentence to be our third entity type (or Phrase, or Text) if our examples are not exactly one sentence in length), and we let Is Example Of be a relationship between sentences and reflexives.30 This point of view allows us to use the same sentence as an example of several things (of two reflexives, perhaps, or of several properties of a particular reflexive); this perspective also allows us to record sentences in our database that do not contain any reflexive (maybe we include the sentences for contrast, or we will identify a reflexive construction later). In other words, we have decided to treat the relationship between sentences and reflexives as many-to-many (since we can also give multiple examples of the same reflexive, of course), and as partial for both entities. We arrive at the following design: Language

Construction (Reflexive)

Sentence 29

30

Answer: Every reflexive belongs to some language, hence the relationship is total for reflexives. However, we can conceive of a language with no reflexives (at least hypothetically), so the relationship is partial for languages. The relationship is many-to-one, since a language can have many reflexives, but a reflexive is by definition a property of a single language. More precisely, we can define a family of relationships of the type Is example of property X, where X is some property of reflexives that we are collecting information about (e.g., a morphological characteristic, a locality property, etc.)

56 Alexis Dimitriadis and Simon Musgrave This is an extremely common state of affairs for linguistic survey databases: The identifiable entities include Language, some form or Construction type that a language may have multiple varieties of, and Sentences that are used to exemplify each construction or some of its properties. To these three common entities we might add Source (for bibliographic references) and perhaps Person, for analysts, informants or other human contributors to the database (we might also treat each class of contributor as a different entity type). If a sentence can only exist as an example of a reflexive, our database might allow users to browse sentences only by first selecting a reflexive, and then viewing the sentences that are linked to it. But if the relationship between sentences and reflexives is partial (for sentences), the user interface must provide a way to view or search for sentences which does rely on a reflexive that the sentence is an example of. Exercise: Is the relationship between sentences and languages total or partial for each entity? 7. Logical design: From entities to tables The conceptual design of our database, represented as an Entity-Relationship model, indicates the important characteristics of our database at an abstract level, allowing us to focus on the foundational issues of database structure without getting caught up in the technical limitations of relational database tables. But once the conceptual design has been completed, it must be used as the basis for a concrete relational design involving tables, keys and relationships. This is the function of logical design, the second stage in the database design process. We can think of logical design as the process of converting an E-R model into a relational structure. In its simplest form, the conversion is as follows: 1. Each entity type normally becomes a table; 2. Simple entity attributes become table attributes; 3. Simple one-to-many relationships between entities become table relationships represented by a foreign key. More complex elements of the E-R model must be handled in other ways, which often involve the introduction of additional tables in the database; these additional tables, therefore, do not correspond to entities in the E-R model.

Designing linguistic databases: A primer for linguists

57

Let us consider the familiar example of the language directory; we have modeled it according to the following E-R diagram, which assumes that all information about a language will come from a single source. In other words, the relationship is one-to-many, with a given source (e.g. the Ethnologue directory) potentially providing information about several languages.

The logical design for this model begins with two tables, one for Language Details and one for Bibliographic Source. Each entity attribute is represented simply as a table attribute. We then choose suitable primary keys: For the Language table, we chose the ISO code which is guaranteed to be unique for each language (assume that we do not plan to identify any languages not recognized by SIL, the ISO code authority). For the Source table, we define as the primary key an arbitrary key chosen by the analyst entering the data, and expected to represent an author-year identifier in the customary style. However, the exact form of the arbitrary key is not important; we could have used an automatically assigned numeric value, which would require fewer decisions by the analyst but would have no mnemonic value. Once we have the two tables, we want to indicate that each record in the Language table cites as its source a single record in the Source table. We do this by including a foreign key, SourceID, as an attribute of the Language table. This allows each language to be linked to just one source (since we are only allowed to enter one value for the SourceID attribute), with no restrictions on how many languages can be linked to a single source. In other words, this foreign key implements a many-to-one relationship between languages and sources: Many languages, one source.31 All told, we have transformed the E-R diagram above into the following (already familiar) tables: 31

If we had instead added a foreign key to the Bibliographic Source table, the relationship would allow many sources to be linked to a single language – not what we want, in this case.

58 Alexis Dimitriadis and Simon Musgrave Language Details Language Name

ISO code

Speakers

Area

SourceID*

English

eng

309,000,000

Europe

Eth15

(etc.) Bibliographic Source ID

Title

Ashton47 Swahili grammar (including intonation) Eth15

Author

Year

Publisher

Ashton, E.O.

1947

Longmans

2005 Ethnologue: Languages Gordon, Rayof the world, 15 th Ed. mond G., Jr. (ed.)

…

SIL International

(etc.)

In this simple example, each entity corresponds to a table and each relationship corresponds to a foreign key relationship; there is no real difference between the entity-relationship model and the relational schema. But now let us consider a database on some phenomenon, e.g., a database of focus constructions, which includes example sentences. As we have seen, such a database will have the entities Language, Focus construction, and Sentence. In the logical design, we model each of them as a separate database table. But a single sentence might include instances of two or more focus constructions; i.e., the relationship is many-to-many. A foreign key in the Sentence table can only identify one construction, and hence is insufficient. To encode a many-to-many relationship, we create a new table whose primary key is a combination of the primary keys of the two tables, Focus Construction and Sentence. (This is a complex primary key, consisting of at two attributes). Each record in this new table indicates a relationship between one construction and one sentence. Because the primary key (which must be unique for each table) is the combination of both keys, each sentence ID and each construction ID can be used multiple times.

Designing linguistic databases: A primer for linguists

59

Focus Construction Name

Language ID*

wa

jpn

…

(etc.) Sentence ID

Language ID*

Original Text

Gloss

Translation

101

jpn

hono wa, …

book Top …

As for the book, …

…

(etc.) Focus-sentence (relationship) Focus ID*

Sentence ID*

wa

101

wa

113

(etc.)

Multi-valued attributes also require an additional table, which does not correspond to an entity in the entity-relationship model. Consider the language attribute Alternate Name, a multi-valued attribute of the Language Details table (since a language can have multiple alternate names). Each alternate name is a simple string, not an entity; but since we cannot store multiple values in a single attribute of the Language Details table, we create an additional table, Alternate Name, whose purpose is to store multiple names for a language. Each record in this table includes the language ID of the language that the record is about, declared as a foreign key that must match the primary key of the Language Details table, ISO code; because the language ID is not the primary key of the Alternate Name table, it is possible to have multiple records for the same language ID. Language Details Language Name

ISO code

Speakers

Area

SourceID*

Swahili

swh

772,000

Africa

Ashton47

(etc.)

60 Alexis Dimitriadis and Simon Musgrave Alternate Name ID

Language ID*

Name

101

swh

Kiswahili

102

swh

Kisuaheli

103

swh

Arab-Swahili

(etc.)

For illustration, the above table uses an arbitrary numeric primary key, ID; this allows us to enter any values we wish – even to enter the same alternate name multiple times for the same language. We could have instead declared the two attributes Language ID and Name as a complex primary key, which would make it impossible to accidentally enter the same alternate name twice for the same language: 32 Alternate Name Language ID*

Name

swh

Kiswahili

swh

Kisuaheli

swh

Arab-Swahili

(etc.)

Note that (with either version) the table Language Details makes no mention of the Alternate Name property; the relationship is encoded in one direction only, by declaring the field Language ID as a foreign key that matches the primary key of the Language Details table. A complex attribute (e.g., Name) does not need to be modeled in an additional table. We can simply model each component of the complex attribute as a separate table attribute (e.g., First Name, Surname, etc.). We do need a separate table in the case of a relationship with attributes, even if the relationship is not many-to-many. Recall our example of a relationship with attributes, the Citation relationship in our Language Details example. We have seen that the page number on which a particular language property is discussed is not a property of the bibliographic source, 32

It would still be possible to enter the same alternate name for different languages, of course.

Designing linguistic databases: A primer for linguists

61

but of the relationship between a language and a source. Therefore, we can create a table for the relationship; its primary key is a combination of the primary keys for Language and Source, and it contains the additional attribute Page Numbers where the appropriate page range appears. Language Details Language Name

ISO code

Speakers

Area

English

eng

309,000,000

Europe

(etc.) Bibliographic Source ID

Title

Author

Year

Publisher

Ashton47

Swahili grammar (including intonation)

Ashton, E.O.

1947 Longmans

Eth15

Ethnologue: Languages Gordon, Raymond of the world, 15 th Ed. G., Jr. (ed.)

…

2005 SIL International

(etc.) Citation (relationship) Language ID*

SourceID*

Pages

eng

Eth15

45

deu

Eth15

42

swh

Ashton47

1–21

(etc.)

7.1. Enumerated attribute values There is another important use for tables that do not correspond to an entity: To hold lists of possible values for attributes whose value must be taken from a fixed set of choices (enumerated attributes). For example, a crosslinguistic database might assign each language to a geographic macro-area from the following pre-determined set of options:

62 Alexis Dimitriadis and Simon Musgrave Africa Eurasia SE Asia and Oceania Australia and New Guinea North America South America Using an ordinary text field to hold these values would be inefficient and error prone – the analyst would have to type the above values over and over, and nothing would prevent them from entering minor variations (“S. America”) or even areas that are not on the above list (e.g., “Central America”). For this reason, it is much preferable for the user interface to show the above areas as a fixed set of choices from which the analyst must choose, in the form of a drop-down menu, “radio buttons”, or perhaps in some other way. Any system used to create the user interface will provide ways of creating menus, etc., that display a list of possible values. But where do these values come from? The set of possible values is part of the structure of the database, since it represents the domain of the attribute in question. The best approach is to build such information into the database itself, as follows: We create a table whose sole purpose is to represent an attribute domain, e.g., “macro-areas”. This table could have a numeric primary key, or the names of the macro-areas could themselves be the primary key: Linguistic Macro-area ID

Area

1

Africa

2

Eurasia

(etc).

To use this, we declare the Area attribute of our Language Details table to be a foreign key, pointing to this table. The DBMS can ensure that any foreign key value we enter matches a record in the Linguistic Macro-area table; the result is that our database is guaranteed to store only legal values for this attribute. Language Name

ISO code

Speakers

Area*

SourceID*

English

eng

309,000,000

2

Eth15

(etc.)

Designing linguistic databases: A primer for linguists

63

It is important to realize that the table Linguistic Macro-area is part of the design of the database: The six records it contains are defined before data about any language has been entered into the database, and it will never grow beyond these six rows as we enter data for different languages – unless we decide to amend our list of macro-areas. In other words, this table does not contain data about any language or other entity; it is an auxiliary device representing an attribute domain. This arrangement is not the simplest one imaginable: Any environment for building our user interface could allow us to store the list of possible values in an internal list, without involving a database table. We do not recommend this approach, for two important reasons: First, it separates the attribute domain information from the database definition, and embeds it in the interface application. If we later need to use the database independently of this interface application, the domain information will be lost. For example, we might wish to dump our desktop database and create a web interface for it, or to create an improved interface application to replace our first effort. In many cases, the database will contain only a small numeric value (between 1 and 6, in our example), which the interface application must interpret and display as the name of a macro-area. In the above example, we might end up with just numeric codes for the area, with no indication of which part of the world corresponds to area code 2 – this information can only be retrieved by examining the original user interface, if that is still possible. The second reason to avoid this approach is that it makes it harder to extend or revise the enumerated values during the course of the project. While we can reasonably hope to arrive at a list of macro-areas that will not need to be changed during the lifetime of our project, the same is not the case with classifications about which we do not have a clear picture in advance. For example, the Sentence entity in our database of reflexives might have an attribute Clause Embedding Type, for the type of complementation involved in multi-clause sentences. Whatever classification of embedding types we might initially arrive at, it is quite likely that we will wish to amend or extend it during the course of the project, in the face of unexpected constructions from new languages. Such changes should be undertaken with care, of course, to avoid invalidating data already in the database; but they are very often necessary in research-oriented databases. Hiding this kind of information in the user interface makes it harder to amend; the change typically requires a programmer, and since each interface environment is different, it can be difficult to locate someone with the necessary skills. It is easy, on the other hand, to arrange for a way to modify

64 Alexis Dimitriadis and Simon Musgrave the contents of a database table (e.g., by providing a suitable form); even if no such provision was made in advance, database tables are well-understood and easier to modify than an unfamiliar collection of scripts. The techniques we have discussed can be combined as needed: For example, we might want an enumerated attribute that takes multiple values: E.g., in our reflexives database we wish to indicate the alternative meanings that a reflexive construction might have: In addition to being reflexive, for example, the Italian reflexive se could have reciprocal, middle, impersonal, or certain other meanings. The possible meanings we identify are recorded on a list, as indicated in this section; but since we need to allow more than one alternative meaning to be selected, reflexives and meanings are in a many-to-many relationship. Accordingly, an intermediate table is used to relate reflexives and possible meanings, as described in the previous section.

7.2. Managing enumerated domains Enumerated attributes are very common in linguistic applications; by creating a table to store the possible values of each domain, linguistic databases tend to become littered with dozens of small tables which can become difficult to manage – especially if we must make provisions for a form that will allow editing of the domain values, a means of activating this form, etc. While the approach we have described so far is quite commonplace, we will briefly present a refinement that can greatly simplify the management of large numbers of enumerated domains: Instead of creating a separate table for each domain, create a single table which will hold all enumerated values, listing the domain to which each value belongs alongside its ID: Enumerated Value Definition Domain

Value ID

Value Label

MacroArea

1

Africa

MacroArea

2

Eurasia

MacroArea

3

SE Asia and Oceania

Embedding

1

Tensed Complement

Embedding

2

Infinitival complement

Embedding

3

Paratactic construction

(etc)

(etc).

Designing linguistic databases: A primer for linguists

65

The user interface, then, will provide as possible values of the Area attribute not the entire contents of some table, but only the values of the above table that have the Domain “MacroArea”. The above table has a complex primary key consisting of the attributes Domain and Value ID (as indicated by underlining). This ensures that we will not accidentally define the same value ID twice for the same domain. Other arrangements are also possible. In addition to reducing the number of distinct tables, this approach has the considerable advantage that a single form, or group of forms, can be used to manage all the enumerated values – even values for domain types that have not been created yet. In addition, it is easy to add a column to this table for the documentation of each value; this makes it possible to document the meaning of each possible value in such a way that it is part of the database. (How this information is presented to the users, of course, is up to the designers of the interface).

8. Physical database design and beyond Logical database design, when properly carried out, converts a high-level model of the information domain into a system of tables, keys and relationships that can be directly expressed in the “relational algebra” that underlies relational database management systems. The next step is to create an actual database that realizes this design (which is in any event preliminary, since it is certain to be modified down the road). This is not a trivial step; while all relational DBMSs are based on the same abstract mathematical model (the aforementioned relational algebra), they differ in a multitude of details that must be taken into account at this stage: The designer must choose from between several “table engines” provided by the DBMS for the actual storage of data, each of which with different features and performance benefits; each database attribute must be given an appropriate data type (and maximum size, as appropriate), from the many varieties of numeric, text and binary data types provided by the DBMS. (See section 4.1). Databases may also have limitations such as being unable to use a certain type as a primary key, or extra features not supported by most other databases. From our perspective, issues of character encoding are particularly important; for a cross-linguistic database created today, use of Unicode is an absolute requirement if the data will include non-Latin text. There is no excuse for using a different character encoding method for a cross-linguistic database today.

66 Alexis Dimitriadis and Simon Musgrave The database designer should also consider indexes at this stage. An index allows the DBMS to find records with a certain value without examining all records in a table, and can provide a noticeable speed-up for large tables and complex queries; many DBMSs will automatically index certain kinds of attributes (e.g., primary keys), but additional indexes are sometimes useful. Indexes do slightly slow down the performance of the DBMS during data entry, since they need to be updated whenever indexed data is inserted or modified. However, this time cost is minuscule; if your database does not run a busy airport, you will not notice any delay due to updating indexes. More generally, this is the stage where performance issues should be considered. We have emphasized that the design process should focus on correct design, which is the most important factor in obtaining good performance. If this rule is observed, the comparatively small size of linguistic databases means that there is usually no need for further attention to performance: With a properly structured database, correct queries and a few indexes at the right places, few linguists are likely to encounter performance problems in the speed of database queries. If a problem does arise, it is almost certainly due to a flaw in the design of the tables or queries. Narrow attempts at “optimizing,” such as storing numeric codes instead of text values in the hope that this will reduce processing time, are likely to produce negligible runtime improvements; but the added complexity they introduce will require extra time to code and debug, which will easily exceed the expected time savings from speeding up the database. In practice, the major restriction on the results which can be obtained from a linguistic database is the existence of errors in the data which is entered. Input errors are inevitable, but some of the techniques which we have discussed previously, such as appropriate data typing and the use of enumerated values, can and should be used to minimize the impact of this problem. Based on all these considerations, the database developer will build the tables and relationships that constitute the database; depending on the DBMS and supporting applications used to manage it, the database might be defined with a series of SQL commands, or interactively through a graphical management interface provided by the database application.33 An actual, working database is only one half of the data management system. The other half is the interface application, or applications, that will 33

For the open-source stand-alone DBMSs MySQL and PostgreSQL, there are independently developed applications that provide a graphical, web-based management interface.

Designing linguistic databases: A primer for linguists

67

be used to get data into and out of the database. The interface application should provide forms, reports and/or other facilities for carrying out all the actions that the project needs: Entering data of various kinds, and searching for, displaying and summarizing information. We will not attempt even a cursory overview of the issues involved in designing the interface to a database, because these depend, much more than the relational design of the database, on the particular needs and procedures of the project; each project will have different needs and will therefore require different solutions. The view of the data presented by the user interface can be quite different from that developed during database design, because the two are based on different considerations: While the relational design of the database should be based on the internal logic of the data, the design of the user interface should be based on what we want to do. In particular, the user interface will often group in one form information from different tables, or will present only parts of a table in a form, as appropriate for the use at hand. Our database of reflexives might show, next to each sentence, the name and language family of the language that the sentence belongs to; this information is drawn from the Language table, by means of a suitable query or view. For a questionnaire-based survey, it could be useful to have an application that presents each question of the questionnaire, in the correct order. Whether such features are worth the effort, or whether they are even a good idea, really depends on the specific needs of a project, including the degree to which the questions and answers can be expected to change in the course of data collection. At a minimum, the user interface should display the data in a way that allows it to be easily interpreted, and should support the required operations (data entry, modifications, searches). Implementation features, such as numeric IDs or tables implementing many-to-many relationships, should be hidden from the user. Beyond this, the issues are no different from those that arise for any software development task; software development is a large and complex topic, and one that we have neither the resources nor the qualifications to offer concrete instruction on. In closing, therefore, we limit ourselves to the following general recommendations: 1. 2. 3. 4.

Get skilled help if possible; but stay actively involved if you do. Consult several future users. Test, and test again. Anticipate the need for more revisions down the line.

68 Alexis Dimitriadis and Simon Musgrave 9. Some problems, solutions, and limitations While relational databases provide a powerful framework for data management, they are not equally well suited to all uses. In this section we touch on some potentially tricky issues, and perceived or real limitations of relational databases. Naturally the discussion in no way exhausts the possible sources of trouble.

9.1. Sequences and multiple values A common source of difficulty involves multiple values. It is not uncommon, in linguistic description, to assign multiple values to a single property. For example, a certain construction may have more than one informationstructure function, a language may be known by more than one name, a verb may have more than one meaning. Databases, as we have seen, should assign only one value to each attribute of a record; if we wish to allow more values, it is necessary either to create duplicate attributes (Function2, Function3, etc.) or to use an additional linked table for the multi-valued attribute. Both methods introduce complexity to the database that seems out of proportion with the modest goal of occasionally recording multiple values for an attribute.34 Suppose that in our database we normally record one basic word order for each language, but a few tricky cases require more than one order to be entered. In this case it might be enough to add an extra field Secondary Word Order, or some such. But note that such a field is not interchangeable with the first word order field; any queries must explicitly take into account the fact that there are two (or more) places to store a word order. Moreover, if there are any attributes that represent properties specific to each word order, they will need to be duplicated as well. In more complex cases, it is not usually a good idea to create duplicate attributes. A variation of this problem arises when it is necessary to store sequences of atomic elements, especially if the maximum number of elements cannot be specified. This is the case if our database is to be used for the analysis of words (as sequences of morphemes), or of sentences (as sequences of words, or of words and morphemes). 34

For this reason, perhaps, Filemaker departs from the relational model and can handle multiple attribute values without requiring an additional table to be created.

Designing linguistic databases: A primer for linguists

69

In designing a table to store the morphological structure of words, we might wish to store the entire decomposition of each word in one record; to do so we’d need to place each morpheme in a separate column, i.e., we’d need duplicate attributes Morpheme1, Morpheme2, Morpheme3, etc. Because columns need to be pre-defined when the database is designed, it is necessary to assign an arbitrary upper limit to the number of morphemes which can make up a word – but what do we do if a word with a greater number of constituent morphemes is found? ID

Form

37

wa-li-on-an-a

Language* Morph1 Morph2 Morph3 Gloss1 Gloss2 … swh

wa

li

on

c2

Pst

(etc).

In this example, this design is in any event inadvisable because each morpheme will need to be described by additional attributes, each of which also needs to be duplicated for the maximum possible number of morphemes; maintaining duplicates of so many attributes will quickly make it impossible to manage the database, or to write useful queries. The solution is provided by the design principles we have discussed: Morphemes should be treated as a separate entity type, in a many-to-one relationship with words; therefore there should be a separate table, in which each morpheme occupies a separate row. This eliminates the duplication of attributes and simultaneously removes the limit on the number of morphemes, albeit at the cost of considerable added complexity for the database programmer: The table of morphemes must include information about sequential position (rank), the morpheme itself and a foreign key pointing to the word being analyzed – all this in addition to the information we want to record about each morpheme. A single word is now spread over several rows of two tables (Word and Morpheme), which must be linked to other tables via foreign keys as necessary. The user interface must present this information in the familiar form of a string of morphemes, and convert in the opposite direction on data input.35 35

Note that this is a simplified example; in a real database of this sort, each morpheme might point to an entry in a dictionary of morphemes, so that the gloss and properties of the morpheme -an ‘Recip’, for example, need only be stored in one place. The words themselves would probably be linked to a text or sentence, and not directly to the language.

70 Alexis Dimitriadis and Simon Musgrave Word ID

Form

Language*

37

wa-li-on-an-a

swh

…

(etc). Morpheme Word ID*

Rank

Form

Gloss

36

1

wa

c2

36

2

toto

child

37

1

wa

c2

37

2

li

Pst

37

3

on

see

37

4

an

Recip

37

5

a

FV

(etc).

This design is not as simple as one might have hoped for, but it is necessary if we wish to manage detailed information about each morpheme in our text. If we only need to know where the morphological boundaries are, it may be simpler to store each complete word as a single string, and let the user interface manage and display it appropriately. For a concrete example, we turn to the equivalent task involving sentences. Consider a multilingual database that focuses on a certain phenomenon and includes glossed example sentences. The following structure is often sufficient for the sentences:36 Sentence ID

Original

Morphemic Tier

Gloss

Translation

307

Watoto walionana

Wa-toto wa-li-on-ana

c2-child The children c2-Pst-see-Recip-FV saw each other

(etc.)

36

We suppress other properties of the sentence that we may be interested in.

…

Designing linguistic databases: A primer for linguists

71

Instead of treating every morpheme as a separate value, we have allocated one value (attribute) for each tier of the sentence. (Note that each table cell contains one text string: they are shown on two lines for typographical reasons only). The database interface can parse each tier and display them in a word-aligned manner in the customary way; gloss labels used in the gloss tier can even be extracted and looked up in a table of glosses, for validation or to display their definition. If all this is handled by the user interface application, there is no need to burden the database schema by modeling each morpheme as a distinct value. In the web interface, the underlined gloss labels are hyperlinks that bring up a definition from a table of glosses: Swahili (swh), strategy: -ana

ID 544

(ok) wa-toto wa-li-sem-an-a c2-child c2-Pst-speak-Recip-FV ‘The children spoke to each other’

The downside of this approach is that we lose some of the power of relational databases. For example, we must use a full-text search to find sentences in which some morpheme occurs; while text searches are more than fast enough for a linguistic database on today’s computers, they are slower than relying on an index, and might lead to slower performance for very large or complex applications. More importantly, it is no longer possible to store detailed information for each morpheme in a word or sentence. For such uses, the more complex design presented above is necessary. Note also that we have reduced the complexity of the database structure by shifting some responsibilities to the user interface. Our interface application must ensure, for example, that the sentence tiers contain equal numbers of words and morphemes. And if a new user interface is needed at some point (perhaps to convert a desktop database into a web database, or to replace an obsolete web interface), this functionality will have to be implemented again. The right design choice depends on the nature of our interest in the data: If our database is intended to support the detailed analysis of primary data, it is probably a good idea to fully represent the morphemic decomposition in the relational design. If our database is targeted to some morphosyntactic phenomenon (e.g., reflexives) and the sentences serve as supporting material, the simpler design just sketched may well be sufficient.

72 Alexis Dimitriadis and Simon Musgrave 9.2. Boolean attributes in theory and practice Boolean (true/false) attributes are frequently used in typological databases, where many fields may contain information about whether a given language has a certain property or not. It is also not uncommon in linguistic theory to model a multi-valued property as a combination of binary features; for example, Chomsky and Halle (1968) treat the three-way vowel height opposition low, mid, high in terms of two features (since two binary features can occur in four different combinations, one of the four must excluded from occurring); Goedemans and Van der Hulst (this volume) classify primary accent placement patterns in terms of seven binary parameters; etc. These analyses were motivated by good linguistic reasons (theoretical and/or empirical), and linguists who use them in their work may well decide to design their database along the same lines. Care should be taken to document the behavior and meaning of the novel parameters (especially for a publicly accessible database), since their meaning will not be obvious to a viewer unfamiliar with the framework in question. Often the data can be presented in terms of both the formal features and the traditional classification (which the database, or the interface application, should be able to compute from the formal features). While decomposing a linguistic property into features can be desirable for linguistic reasons, the technique should never be used purely as an attempt to “optimize” the performance of the database: A property that can take a restricted number of values should be modeled as an enumerated attribute, using the methods we have already discussed. Working with a single multi-valued attribute is simpler, more convenient and more intuitive; and it is in fact faster than a query that must address several artificial Boolean features in order to reconstruct a multi-valued attribute value.37 A related technique is to encode a multi-valued property as a collection of binary “characters,” Boolean variables that each determine whether the property has a particular value; e.g., the property Basic Word Order could be represented as the Boolean variables “BWO = SVO”, “BWO = SOV”, and so on for all seven possible word orders (we include the word order “Free” among the possibilities). In this case the database, or the analyst, needs to ensure that only one of the seven can take the value True for any language (unless, of course, the project defines “basic word order” in such a way that 37

See Dunn et al. (2003) for an example of the decomposition of typological features as binary characters, and Wichmann and Kamholz, (in press) for additional comments on the dangers of decomposing multi-valued properties.

Designing linguistic databases: A primer for linguists

73

a language could have more than one). While characters are useful for certain types of statistical analysis, we recommend always storing such properties in non-decomposed form, as an enumerated attribute; if characters are needed, it is straightforward to generate them dynamically with an SQL query.38 The general principle we have advocated applies here as well: The database should be designed so as to model the organization of the data (to the best of our understanding of it). While it may be justifiable to simplify the design in respects that are not important to our needs, modifications that make the database more complex than required by its conceptual model are never a good idea.

9.3. Design changes The most important characteristic of a database is that it is intended to manage a structured collection of data. While we have seen that a database consisting of tables and relations can support an interface that is not at all table-like, it is still true that a database must be carefully designed and structured before it is put into use. This presents problems for the creators of research databases, who might expect to work with the data for some time in order to develop an understanding of how it should be analyzed and structured. In such cases, it is important not to build restrictions into the software, and especially into the database model, that may conflict with the eventual analysis that will need to be expressed. (This is another reason that unnecessary “optimizations” should be avoided). Imagine a cross-linguistic database that includes phonotactic information; the database might have been built on the assumption that it is enough to characterize each language by its maximal syllable template (e.g., CVN), which can be stored as a simple attribute in the Language table. But suppose that the researchers, during the course of their work, find it necessary to record separate data about 38

If the basic word order is stored in a field called Order, we can generate characters named charSVO, charSOV, etc., as follows (consult an SQL manual for documentation on the CASE statement). SELECT *, CASE Order WHEN ‘SVO’ THEN 1 ELSE 0 END charSVO, CASE Order WHEN ‘SOV’ THEN 1 ELSE 0 END charSOV, ... FROM “Language Details”;

74 Alexis Dimitriadis and Simon Musgrave each phonotactic pattern in a language (e.g., to study the properties of CV and CVN separately). As we have just seen, this requires syllable templates to become a separate entity that is in a many-to-one relationship with languages. This will involve a fundamental change in the design of the database. A simpler problem can arise if a database includes a hard-coded list of enumerated values; e.g., verb argument types such as direct object, indirect object, prepositional phrase, etc. In the course of data collection, the users of this database might encounter languages that require an argument type that is not on this list; if the programmer who created the database is gone and has not arranged for an easy way to extend the list, considerable inconvenience will result. (The design sketched in section 7.2 is an effective way to avoid this problem).

9.4. Hierarchical structure An important limitation of relational databases is that, by the nature of the “relational algebra” they are based on, they are not very good at managing recursive hierarchical structure. Parsed syntactic trees, in particular, are not a good fit for the kinds of queries a relational database can support. In such cases the parsed tree might be stored in a database, perhaps along with indexing information, but many kinds of searches would have to rely on functionality outside the database proper. Other strategies are needed if representing hierarchical information is important on a project (see Van Halteren 1997: 42–46 for discussion). Similarly, database queries are not the most convenient tool for large text corpora (parsed or unparsed). In such cases, especially for small projects, it may be most straightforward not to rely on a DBMS at all. 10. Conclusion This very brief introduction could not hope to provide a complete guide to designing or building databases. We have focused on conceptual rather than practical issues, especially those which in our opinion are easily overlooked during the hands-on process of learning one’s way around a database application. Our goal was not so much to allow the reader to be self-sufficient as to make it possible to think about the process of database design at the right level of abstraction. We hope that this will be of help to readers who can

Designing linguistic databases: A primer for linguists

75

already create simple databases (e.g., with Microsoft Access or MySQL and PHP) but lacked the conceptual perspective developed here, as well as to those who need to discuss the design of their database with a programmer.

References Bird, Steven and Gary Simons 2003 Seven dimensions of portability for language documentation and description. Language 79: 557–582. Chomsky, Noam and Morris Halle 1968 The Sound Pattern of English. New York: Harper & Row. Connolly, Thomas, Carolyn Begg and Anne Strachan 1999 Database systems: A practical approach to design, implementation, and management. 2nd edition. Boston: Addison-Wesley. Codd, Edgar F. 1970 A relational model of data for large shared data banks. Communications of the ACM 13: 377–387. Dunn, Michael, Angela Terrill, Ger Reesink, Robert A. Foley, and Stephen Levinson 2005 Structural Phylogenetics and the Reconstruction of Ancient Language History. Science 309 (5743): 2072–2075. Ferrara, Marisa and Steven Moran 2004 Review of DBMS for Linguistic Purposes. Proceedings of E-MELD 2004. Available at http://www.linguistlist.org/emeld/workshop/2004/ proceedings.html. Gillam, Richard 2003 Unicode Demystified. Boston: Addison-Wesley. Musgrave, Simon 2002 Taking Unicode to the (Microsoft) Office. Glot International 9/10: 316 –319. Van Halteren, Hans 1997 Excursions into Syntactic Databases. Amsterdam: Rodopi. Wichmann, Søren and David Kamholz to appear A stability metric for typological features. Sprachtypologie und Universalienforschung 61(3): 251–262.

A typological database of personal and demonstrative pronouns Heather Bliss and Elizabeth Ritter

1. Introduction What is the inventory of morphosyntactic features available to the pronominal systems of the world’s languages? This is the question that we sought to answer in creating our typological database. Our database houses general language information, personal and demonstrative pronoun paradigms, and detailed summaries of the pronoun systems of 109 typologically diverse languages. The data is organized in such a way that it is highly searchable, and the database layout is accessible and user-friendly. As such, the pronoun database facilitates typological research as well as case studies of pronominal systems and their morphological composition. Our database was created with limited financial resources, which restricted our access to technical expertise and resources. In particular, we were unable to hire technology consultants to construct or maintain our database. Instead, our project team was composed largely of linguistics undergraduate students, who had had little or no experience or training in database construction. Due to both budgetary constraints and our limited technical abilities, we chose to use a software application with a built-in user interface, FileMaker Pro, for our database, and this choice resulted in less flexibility than may have been ideal for our purposes. Again as a result of our limited technical capabilities, our web-hosted version of the database was auto-generated using a FileMaker Pro plug-in, and as such, was not as user-friendly as we would have hoped. Furthermore, due to changes in technology, the web version of the database is no longer current, and is no longer available online. In spite of these challenges, however, we succeeded in creating a typological database which enabled us to answer our original research questions. In this paper, we outline the main features of our database, with a view to providing a candid and thorough account of the strengths and weaknesses of our project. This paper will proceed as follows: In §2, we discuss our research questions in detail, and explain why we considered a database to be the most effective way to address these questions. §3 details a number of

78 Heather Bliss and Elizabeth Ritter issues related to the design of our database, ranging from the software considerations and overall relational structure to the organization of data fields and their presentation to database users. In §4, we turn from database design to database content, and here we look at our language sample, and some typological findings that our database gave rise to. §5 outlines the challenges we faced in building this database, and the effects that these challenges had on the final product. In §6, we look beyond the database itself to other benefits of our project. §7 is the conclusion.

2. The research questions At the outset of our project, our research question was admittedly vague. We wanted to investigate the range of cross-linguistic variability in the morphosyntactic features that characterize personal pronoun paradigms. Personal pronouns are closed-class lexical items that are characterized by person distinctions (e.g. 1st, 2nd, 3rd). Our research question was focused on the types of person distinctions, as well as other distinctions, that can be made in personal pronoun systems of the world’s languages. Greenberg’s (1963) claim about the universality of person and number features provided a starting point: (1)

Greenberg’s (1963) Universal #42 All languages have pronominal categories involving at least three persons and two numbers.

Greenberg’s Universal #42 speaks to the minimal requirements of a pronominal system. We wondered – what are the limits in person and number systems? If three persons and two numbers is the minimum, what is the maximum? Furthermore, how do person and number features interact? What kinds of gaps and syncretisms are common, rare and impossible? We predicted the range of variability to be relatively limited in person and number systems. However, person and number are not the only features available to pronoun systems. Again, we turn to Greenberg: (2)

Greenberg’s (1963) Universal #43 If a language has gender categories in the noun, it has gender categories in the pronoun.

Greenberg's Universal #43 identifies gender as a third pronominal feature, and suggests that, unlike person and number, gender features are optional

A typological database of personal and demonstrative pronouns 79

for pronouns. The question that this raises is what other morphosyntactic features may characterize the pronominal systems of the world’s languages? And again, how do these different features interact in a language to generate unique pronominal forms? With a view to collecting data for future projects, we also gathered information on demonstrative pronouns. For purposes of this project, Our criteria for labelling a set of forms as demonstrative pronouns were that either the reference grammar presented them as such, or the pronouns in question exhibit proximity distinctions (e.g. proximal, distal, remote), and lack distinctions of person. This data would permit an exploration of the morphosyntactic relationship between 3rd person pronouns and demonstrative pronouns, which, in some languages, are closely related. In fact, the two classes of pronouns are not always clearly distinguished in the sources we consulted. Although other classes of pronouns (logophoric, reflexive, etc.) are also predicted to interact with personal pronouns in interesting ways, we limited our study to personal and demonstrative pronouns only. These two classes of pronouns appear in almost all languages, unlike some other classes, such as logophors, which are more restricted. Furthermore, budgetary and time constraints restricted the scope of our project. Although we have not developed in detail the part of our study that looks at the relationship between personal and demonstrative pronouns, some preliminary findings are summarized in §4.3.2. With these goals in mind, we began to compile a typological survey of pronouns, and their morphosyntactic features. Because we wanted to explore the range of morphological variability in pronoun systems, our data set included a large number of pronouns from a large number of languages. Not only were we interested in the pronominal forms themselves, but we were particularly interested in the features represented in these forms. Thus, the amount of data to be recorded for each pronoun was extensive, and for this reason, a database was the logical choice for the organization of this data. Not only was a database an effective way to answer our original research questions, but it also facilitated the preservation of this research material for use in future research projects.

3. Database design In planning the database, a range of design issues had to be addressed, including the choice of software application, the overall structure of the data, the types of fields to be used for storing the data, as well as the user inter-

80 Heather Bliss and Elizabeth Ritter faces for accessing the data. This section goes through the details of each of these issues.

3.1. Database platform The database was created using a software application called FileMaker Pro (version 5.5). A major advantage of using this type of software is that it does not require that database administrators know programming languages or code in order to effectively structure both the data and the user interface. With respect to the organization of data, fields in FileMaker Pro are relatively easy for the administrator to modify by means of option menus that are built right into the software. With respect to the structure of the user interface, once a database is constructed in FileMaker Pro, it is straightforward for users to input data with little room for error. Because of the accessibility of FileMaker Pro, linguistics students with little or no technical training could easily acquire the computer skills to work on the database. Issues related to student researchers will be further addressed in §6. This is not to say, however, that FileMaker Pro is not without its problems. Many of the challenges we have faced in building this database are directly related to our choice of software application. FileMaker Pro, because of its user-friendly, code-free platform, is somewhat restrictive in its application. Once our data was compiled, we found it quite difficult to set up layouts that allowed for pronoun paradigms to be displayed correctly. We also found that the web-hosted version of our FileMaker Pro database was significantly more limited than the desktop version, and due to changes in technology, is no longer available online today. Issues related to the challenges of working with FileMaker Pro will be further addressed in §5. 3.2. Relational structure The pronoun database is a relational database with a one-to-many organizational structure. Each language in the database is associated with a unique identifier number and occupies one record in a parent file. Each record in the parent file is related to many records in one of two daughter files. The records in each daughter file are the pronouns themselves, with personal pronouns being stored in one file, and demonstrative pronouns in the other. As with the languages, each personal and demonstrative pronoun is associ-

A typological database of personal and demonstrative pronouns 81

ated with a unique identifier number and occupies one record in a daughter file. The relational structure of the database is depicted in (3) below: (3)

Relational database structure Language 1:N

Personal Pronoun

1:N

Demonstrative Pronoun

In (3), a single parent record is related to many daughter records in two separate files. This relational structure reflects the nature of our data, in which a single language has many personal and demonstrative pronouns. This basic relational structure forms the foundation for the pronoun database, and determines the organization of the data fields and the user interfaces in the database. 3.3.

Fields in the database

The parent file and two daughter files each have their own set of fields in which data on individual records (languages, personal pronouns and demonstrative pronouns) is stored. 3.3.1. Parent fields The parent file contains text fields for general information about the language, calculation fields for summary information about the pronoun features, and comment fields for other kinds of relevant information. Each text fields stores information about a language’s alternate names, genetic affiliation, and countries where it is spoken. Data entered in these fields is standardized according to the 13th edition of the Ethnologue (www.ethnologue.com). A sample of the data in the text fields is shown in (4) for the language Balochi:

82 Heather Bliss and Elizabeth Ritter (4)

General language information stored in text fields in the parent file Language summary for Balochi, Western

The calculation fields in the parent files compute a summary of the range of inflectional contrasts expressed in the personal and demonstrative pronouns for each language. The values in these fields are calculated for both personal and demonstrative pronouns together. As such, contrasts that were found in either the personal pronouns, or the demonstrative pronouns, or both, are included in the summary fields.1 In retrospect, this decision was not ideal, as the calculation fields provided a global summary for both paradigms that often did not accurately reflect the range of inflectional contrasts found within a single paradigm. A sample of the data in these fields is shown for the personal pronouns in the language Campa (Arawakan, Peru) below:

1

Some inflectional categories are only found in either personal or demonstrative pronoun paradigms. Specifically, person and formality are found only in the personal pronouns, and proximity is found only in the demonstrative pronouns. Number, gender, and case, on the other hand, are found in both, and the calculations for these fields are cumulative, referencing both personal and demonstrative pronoun records.

A typological database of personal and demonstrative pronouns 83

(5)

Calculation fields in the parent file

Also in the parent file are comment fields that accommodate large amounts of prose specifying information on various aspects of the data for each language. These fields are used for information thought be important to the language profile, but for which we didn’t have a dedicated text or calculation field. Comments may include details on the transcription of pronominal forms, descriptions of typologically interesting characteristics of the pronoun paradigms, as well as keys to interpreting data in the pronoun files. The organization of prose within the comments is standard across language records, with respect to both the ordering of information, as well as the use of headings. This organization allows database users to quickly access the information stored in the comments fields for the different languages. We have found that the comments on typologically interesting or unusual characteristics are a particularly useful source of information for our research team. 3.3.2. Daughter fields The fields in a given daughter file list the form of a single pronoun and its morphosyntactic feature specification. The form of the pronoun is stored in two fields, a text field and a container field with a jpeg image. The morphosyntactic features of the pronoun are listed in a second set of fields, one for each category (i.e. person, number, gender, case, and formality). These two types of data will be discussed in turn below.

84 Heather Bliss and Elizabeth Ritter 3.3.2.1. Phonological forms Although our primary interest is in morphosyntactic features, it was important to include the actual pronominal forms in the database for a number of reasons. First, we wanted to make it possible for anyone to verify the analysis in the database, or to develop alternate analyses of the pronoun paradigms. In assigning morphosyntactic feature values to the pronouns, we were analysing the data in a certain way, and we wanted to be certain that researchers using our database also had access to the raw data. Additionally, there are a number of interesting theoretical issues related to the pronominal forms themselves, including the question of whether the different features have distinct morpho-phonological content, and whether forms are suppletive or affixal. We included pronominal forms to facilitate research in these areas. We were faced with a number of challenges in including actual pronominal forms in our database. One challenge was that of cross-linguistic uniformity versus individual language representation. Our data was collected from a wide variety of sources (largely reference grammars), and these sources use different orthographic conventions to represent the different languages’ pronoun forms. Some grammars employ the standard orthography of the language being described, other grammars employ an IPA (or near-IPA) transcription, and still other (mostly older) grammars employ other phonetic-based alphabets. In order to facilitate cross-linguistic comparison, we opted to present the pronouns in a standardized orthography, namely IPA, whenever possible.2 However, in the interest of representing forms as they are generally recognized in more commonly studied languages, we also included the standard orthographic forms for those languages that employ Roman-based orthographies. To avoid confusion, these languages each occupy two records in the database – one with IPA transcription, and one with orthographic forms. As a reviewer points out, it would have been simpler to have a single record with an extra field for orthographic forms. For more detail on the challenges we faced in including phonological forms in the database, see §5.2.1.

2

Our IPA transcription was based on the phonetic and phonological descriptions given in the grammar sources. Importantly, wherever our transcription diverged from that of the grammar, this was documented in the comments fields for that language.

A typological database of personal and demonstrative pronouns 85

3.3.2.2. Open vs. closed class features In addition to the text and graphical fields for the pronominal forms, a number of different fields in the daughter files house data on the pronominal features. Each feature category (i.e. person, number, gender, case and formality) occupies a different field in the database.3 These fields are filled in with a feature value (e.g. 1 st person, plural number, or feminine gender) for each record, i.e. for each pronoun. For example, the English pronoun he would be coded as follows: (6)

Form he

Person 3rd

Number singular

Gender masculine

Case nominative

The data in (6) would occupy a single record in the daughter file, and each of the columns in (6) represents a single field, and its value for that particular record. Not surprisingly, we found that person and number behave quite differently cross-linguistically from gender, case, and formality. In particular, person and number features are essentially universal and closed class features: They occur in every language, exhibiting very limited variability. Gender, case, and formality, on the other hand, occur in a subset of languages and constitute open class features. In other words, they exhibit far greater variability in distribution and feature values. In §4, the distinction between open and closed class pronominal features is discussed in greater detail; here we focus on how this distinction influenced the design of our database. In the database, the fields for closed and open class features have a fundamentally different design. Closed class features (i.e. person and number) are pre-specified for all languages with a set number of values that can be selected for each pronoun record. For a database user entering data, this set of values is presented in the form of a drop-down menu. Open class features, on the other hand, do not have pre-specified values, but instead, the set of values for a particular feature is specified on a language by language basis. A database user entering data first lists the feature values for a particular language in the parent record, and then is presented with only these values (in a drop-down menu format) when entering forms in the daughter 3

For purposes of succinctness, only personal pronoun features are discussed in this paper. For information on demonstrative pronoun features, see Bliss and Ritter (2001).

86 Heather Bliss and Elizabeth Ritter records. These different field structures in the database reflect the basic distinction between closed and open class features cross-linguistically. For details on how these different field structures were implemented using FileMaker Pro (version 5.5), see Bliss and Ritter (2001). 3.4.

User interfaces

In the previous section, reference was made to how database users enter data into the database. In this section, we discuss design issues related to the presentation of data, and the user interfaces for the database. The completed database consisted of two versions: a desktop version and a web-based version. Although both contained the same data, they differed in terms of the user interfaces, with the desktop version having much greater functionality than the web-based version. In what follows, an overview is given of the two different versions of the database. 3.4.1. The desktop version In the desktop version of the database, users navigate through the data via a number of different layouts that each present different types of data. The first layout that is presented to users upon opening the database is a list of languages. A snapshot of this layout as it appears in the desktop database is given in (7) below:

A typological database of personal and demonstrative pronouns 87

(7)

List of languages layout

All of the languages in the database are listed in this layout, and users can scroll through to view language names and genetic affiliations for the languages. The buttons on the right-hand side of this layout can be selected to view more information for each of the languages. By selecting the topmost button, the Language Summary button, users are taken to a layout that includes the text fields shown in (4), the calculation fields in (5), as well as comment fields. In short, this layout displays all of the parent fields in the database, one record at a time. The next two buttons take users to layouts that display the personal and demonstrative pronoun paradigms. Samples of these layouts are given in (8) and (9):

88 Heather Bliss and Elizabeth Ritter (8)

Personal pronoun paradigm layout

(9)

Demonstrative pronoun paradigm layout

In (8), the personal pronouns in Chinook (Penutian, United States) are shown as they appear in the database. Following typical conventions, the personal pronouns are displayed in a paradigm table in which each row displays all of the pronouns sharing the same person, case, gender, and formality features. Forms in a single row differ only in terms of their number features, with singular, dual, plural (and in some languages, trial or paucal) forms being displayed in different columns in the same row. The demonstrative pronoun layout shown in (9) differs from the personal pronoun layout

A typological database of personal and demonstrative pronouns 89

in (8), in that each row in this layout represents a single demonstrative pronoun. We discuss the paradigm layout in (8) in greater detail in §5.2.2 below. The final layout that is accessible from the List of Languages layout is one which compares 3rd person pronouns with demonstrative pronouns. A sample of this layout is shown for the language Paipai (Hokan, Mexico) in (10) below. On the left-hand side, 3rd person pronouns are listed with all of their marked features, and on the right-hand side, demonstrative pronouns are listed. The rightmost column in the demonstrative pronoun table displays a field entitled “Match?” This field looks for demonstrative pronouns that are identical in phonological form to 3rd person pronouns. The relationship between demonstrative and 3rd person pronouns is discussed in greater detail in §4.3.2. (10) 3rd person/demonstrative pronouns comparison layout

3.4.2. The web-hosted version The pronoun database was available online during the years 2000 through 2005, with somewhat restricted functionality. Although the database is no longer hosted on the Internet, the process of developing an online version influenced many of our design choices, and affected the overall appearance and functionality of the database. Web-hosting itself was a relatively straightforward process. FileMaker Pro (version 5.5) came equipped with a web hosting plug-in component, which, when activated, allowed Internet users to access a server that hosted

90 Heather Bliss and Elizabeth Ritter our database online. Users accessing the database online were presented with a homepage that gave an introduction to the project, had links to research papers associated with the database, and most importantly, had links to the database itself. There were, in fact, two different online versions of the pronoun database. Due to limitations in the FileMaker Pro web hosting plug-in (see §5.2.3 for details), the database appeared quite differently in Internet Explorer than in other browsers widely in use at the time. These two different online versions are described in turn below. When viewed in Internet Explorer, the database appeared almost exactly as it does in the desktop version. Upon entering this version of the database, users saw the List of Languages layout, as in the desktop version. The main difference in this version, however, was that only one record was shown at a time, and users had to use navigation buttons on the right hand side of the screen to view other records. A snapshot of this layout is shown in (11). (11) List of languages layout as viewed in Internet Explorer

To see a true “list” of languages, users were able to view the records in a table, again by way of navigation buttons on the right hand side of the screen. The table view is shown in (12) below.

A typological database of personal and demonstrative pronouns 91

(12) Table view of list of languages layout as viewed in Internet Explorer

From these two layouts, users were able to access all of the other layouts described in §3.4.1 for the desktop version. When viewed in browsers other than Internet Explorer (such as the then popular, but now obsolete, Netscape Navigator), the database appeared quite differently. A limitation of the web plug-in in FileMaker Pro (version 5.5) was that it had reduced functionality with most browsers other than Internet Explorer. In particular, the buttons that allowed users to access different layouts were not operative, and as such, different layouts could not be viewed. The navigation buttons that were auto-generated in FileMaker Pro allowed users to access three layouts: a table view of the data, a form view of the data, and a search layout. We configured this version of the database so that in these three layouts, all of the data could be accessed. The table view was identical to that in the Internet Explorer version, shown in (12) above. When users opened this version of the database, they were taken to the table view, and from there, they could see the language names, and their ge-

92 Heather Bliss and Elizabeth Ritter netic and areal affiliations for all languages. All of the language names were hyperlinked to the form view for that language, and in the form view, users could view the personal and demonstrative pronoun paradigms, as well as all of the data that was stored in the Language Summary layout (basic language information, summaries of morphosyntactic features, and comments). Although there were subtle differences in their appearance, the search layouts in both the desktop and online versions of the database had the same fields and functionality. The search layout is discussed in detail in the following section. 3.4.3. Searching the pronoun database An important component of our database is its searchability. Our search function is localized to a search layout that can be accessed from the List of Languages layout in either the desktop or web version. In both versions of the database, the search layout has a similar appearance and identical functionality. The desktop version of the search layout is given in (13) (see next page). As seen in (13), our search layout is not open-ended, but rather highly specific, and the main purpose of this design is to restrict searches to the available options. The search layout displays a number of fields for which researchers can select values from a series of drop-down menus. For example, researchers cannot type in any language name in the ‘Language Name’ field, but instead must select a language name from the list of languages in the database. Searches can be performed for any field or combination of fields in the database. Researchers can search general language information, such as language name, genetic affiliation and/or countries where the language is spoken, as well as pronoun-specific properties, such the presence or absence of certain features or feature categories in the paradigms. For example, researchers can search for all languages with formality, or all languages with formality and dual number, or all Indo-European languages with formality. The search layout, although not open-ended, is flexible, allowing researchers to search for a wide range of cross-linguistic properties.

A typological database of personal and demonstrative pronouns 93

(13) Search layout in the desktop database

3.5. Summary In this section, we have given an overview of the various user interfaces for our database. The database was designed to be available in both a desktop and an online version, and both versions contain various layouts of the data, including a search layout. These interfaces are of crucial importance to the database, as they are the lens through which the data is viewed.

94 Heather Bliss and Elizabeth Ritter 4.

The completed product: a database of pronouns

In this section, we turn from the database structure to the data itself, and discuss the types of information in the database, as well as some of our findings. We begin with a brief discussion of the language sample in §4.1, based on the general information gathered about each language. In §4.2, we present an overview of the features that characterize the personal pronoun paradigms in the languages of the database, based on the summaries of inflectional contrasts calculated for each language. The comments on noteworthy properties document the fact that many of the pronominal systems included in this database are typologically interesting for a variety of reasons. In §4.3, we demonstrate the usefulness of this aspect of the database with a discussion of two kinds of rare systems.

4.1. Language sample Our language sample was determined to a large extent by the reference grammars that were available to us at the time of data collection. As a result, our sample is not genetically or areally balanced. Nevertheless, we chose a sampling method that enabled us to include as many genetically and areally diverse languages as possible, and as a result, our sample of 109 languages represents 52 different languages families (including four isolates), from five continents and 80 different countries. For a list of all of the languages in the database, along with their genetic and areal affiliations, see Appendix A.

4.2. Overview of pronominal features: typological tendencies Personal pronouns in the languages of our database are specified for person, number, gender, case and/or formality. By and large, our database findings conform to well-known linguistic universals on morphosyntactic features, such as those in Greenberg (1963), Blake (1994), Corbett (1991, 2000) and Siewierska (2004). As mentioned in §3.3.2.2, we found that person and number had extremely limited variability, with only a small closed set of universal feature values. Gender, case, and formality, on the other hand, had a wider range of variability, with a non-universal open set of feature values. Each of these features will be discussed in turn.

A typological database of personal and demonstrative pronouns 95

4.2.1. Person PERSON refers to pronominal features that are based on speech-act role. Most of the languages in our database make distinctions for 1st (speaker), 2nd (addressee), and 3rd (other) persons in the pronoun paradigm. However, this is not the only person system found in our database. Here we describe the two types of exceptions we found. First, although as stated in Greenberg (1963), all languages have pronominal categories for 1st, 2nd, and 3rd persons, not all languages have separate 3rd person personal pronouns. Instead, a small number of our database languages employ demonstrative pronouns for 3rd person. Examples include Basque (Basque, Spain) and Mongolian (Altaic, Mongolia). The existence of such languages provides support for the view that only 1st and 2nd person pronouns are in fact specified for this feature (cf. Benveniste 1971). Second, there are languages that make additional person distinctions. A significant subset of languages, such as Cubeo (Tucanoan, Columbia) or Telugu (Dravidian, India) have inclusive pronouns, whose referents includes the speaker and one or more addressees.4 A small handful of languages, including Yup’ik (Eskimo-Aleut, United States), have distinct 4th person pronouns. Like 3rd person, 4th person refers to an entity that is not a participant in the speech act, and in fact 4th person is often treated as subclass of 3rd person (e.g Reed et al. 1977; Jacobson 1984). Functionally, the use of 4th person makes it possible to distinguish among individuals that are not participants in the speech act. For example, in Yup’ik, 4th person is used to refer back to an individual already mentioned in the discourse, while 3rd person introduces a new individual. (See Jacobson 1995 for detailed discussion of conditions on use of 4th person in Central Yup’ik.), In summary, the set of person features in our database, and crosslinguistically, is restricted, and consists of the following:5

4

5

Inclusive pronouns are generally analyzed as a type of 1st person, cf. Zwicky (1977), and McGinnis (2005) for discussion. On this view they contrast with 1st person exclusive pronouns, which refer to the speaker, or speaker others, but crucially do not refer to the addressee. However, since an inclusive pronoun refers to both speaker and addressee it could alternatively be analyzed as a type of 2nd person, cf. Déchaine (1999), Harley and Ritter (2002). In order to avoid engaging in this debate, we list such pronouns as inclusive. For a breakdown of the type of person systems in our database, see Appendix B.

96 Heather Bliss and Elizabeth Ritter (11) Universal Person Distinctions – – – – –

1st inclusive 2nd 3rd 4th

(speaker) (speaker and addressee) (addressee) (other) (another)

4.2.2. Number Like person, number features are restricted cross-linguistically to a small set of values: (12) Universal Number Distinctions – – – – – –

singular dual trial paucal plural general

(1) (2) (3) (a few) (many) (any)

Not only is the number of distinctions restricted, but the combinatorial options for these distinctions are also highly restricted cross-linguistically.6 The overwhelming majority of languages in our database make only two number distinctions in their pronoun paradigms: singular and plural. For those languages that make more than two distinctions, they all conform to Greenberg’s Universal #34: (13) Greenberg’s (1963) Universal #34 No language has a trial number unless it has a dual number. No language has a dual unless it has a plural. Furthermore, we found that no language distinguished both trial and paucal number, and this is consistent with Corbett’s (2000) claim about the mutual exclusivity of these two distinctions. Note, however, that our observations about trial and paucal number are based on only three languages in our 6

See Appendix C for a summary of the number systems represented in our database.

A typological database of personal and demonstrative pronouns 97

sample that have these distinctions (Tok Pisin (English-based Creole, Papua, New Guinea), which has trial number, and Fijian (Austronesian, Fiji) and Yimas (Sepik-Ramu, Papua, New Guinea), which have a paucal). Finally, two languages in our sample, namely Kiowa (Kiowa-Tanoan, United States) and Campa (Arawakan, Peru) do not make any number distinctions in their pronominal paradigms. Following Corbett (2000), we refer to the absence of number distinctions as GENERAL NUMBER.

4.2.3. Gender Unlike person and number, gender features constitute an open class, with the type and number of gender distinctions varying widely from language to language. In the database, any open-ended distinctions based on fixed properties of the referent are classified as gender distinctions. Typical gender features (i.e. masculine, feminine, etc.) are not the only features that refer to fixed referential properties cross-linguistically. A small number of languages in our database make distinctions of animacy, in addition to, or instead of gender.7 Animacy-based gender classes in the languages of our database include animate, inanimate, human, supernatural, neuter, and vegetable. It is an open question whether animacy and gender can be collapsed into a single category, or are in fact morphosyntactically distinct types of features (cf. Bliss 2005; Hanson 2003; Harley and Ritter 2002).

4.2.4. Case For purposes of our database, CASE is defined as a pronominal distinction expressing grammatical and other dependency relations within the clause. Like gender, case is an open class feature, and the set of case distinctions varies widely from language to language. However, we found that the case features in our database can be divided into three categories. Grammatical cases, such as nominative, ergative, or genitive, are found in all languages with case distinctions. Locative cases, such as ablative, elative, or inessive, are cross-linguistically less common than grammatical cases. Finally, a third category for all remaining cases, including benefactive, instrumental, or vocative, is much less common that the other types of cases found in the 7

For a breakdown of the types of gender systems in our database, see Appendix D. See §3.3.2.2 for discussion of coding open versus closed class features.

98 Heather Bliss and Elizabeth Ritter database. These findings are consistent with the case hierarchy posited by Blake (1994), in which grammatical cases outrank local cases, which in turn outrank other types of cases.8

4.2.5. Formality At the time when we first developed our database, there was little typological work on the morphosyntactic feature formality,9 and much of our understanding about how formality systems operate was based on formality systems found in Indo-European languages, in which there is a binary formality distinction in the 2nd person singular only. However, we found that the range of formality systems is much broader and more complex than we had originally assumed.10 Specifically, we found that some languages also make distinctions of formality in the 1st and/or 3rd persons, and that no language in our sample makes formality distinctions in the inclusive. In addition, in two of the languages in the database (Mixteco-Chalcatongo and Zapoteco (both Oto-Manguean, Mexico)), formality intersects with gender in the 3rd person. Finally, some languages make more than two distinctions of formality in their pronoun systems. These additional distinctions typically encode age, gender, occupation, and other socially constructed properties. Two notable examples of languages in our database with multiple formality distinctions in all persons are Thai (Daic, Thailand) and Vietnamese (Austro-Asiatic, Viet Nam). Pronominal systems in these two languages are described as formality-driven systems, and for this reason Campbell (1969) suggests that these languages have “noun substitutes” rather than personal pronouns. The representation of formality in our database proved to be a rather formidable challenge, notably because formality is a relative notion, and descriptive characterizations (used frequently in reference grammars) such as formal, polite, or casual, for example, do not capture this fact, or the similarities between different languages that have relative formality distinctions. As a solution to this problem, we adopted a convention of quantifying formality distinctions, with formal pronouns being coded with a positive value, informal pronouns being coded with a negative value, and neutral or 8 9

10

See Appendix E for a summary of the case systems found in our database. Notable exceptions include Brown and Levinson (1978, 1987) and Levinson (1983). For a summary of the types of formality systems found in our database, see Appendix F.

A typological database of personal and demonstrative pronouns 99

unmarked pronouns coded as zero. Higher integers represented higher (or lower) degrees of formality, depending on the polarity of the integer. This allowed us to systematically represent formality in the pronoun paradigms. We also noted the characterization given in the reference grammar of each level of formality.11 This practice allowed us to compare formality systems cross-linguistically, while recognizing the fact that formality is a relative notion, not based on a fixed property of the referent. For example, the table in (14) shows the formality systems of four different languages. Each has a formal pronoun (Pos1) which contrasts with a neutral form (0) and at least one other degree of (in)formality. Though the characterization of the formal pronoun differs from language to language, the position it occupies in the system is essentially the same. (14) Variation in formality systems Nama

Vietnamese

Telugu

Fijian

Neg1

familiar

abrupt

informal

—

0

(basic, neutral, or unmarked)

Pos1

polite

superior

formal

avoided relation

respectful

very polite

special respect

Pos2

For further details on how formality is encoded in the database, see Bliss and Ritter (2001). 4.3. Some typologically interesting cases In this section, we turn from the typical to the exceptional, as we consider two 12 examples of rare pronoun systems which not only merit further indepth investigation, but also raise interesting typological questions.

11

12

The descriptive characterization for each level of formality for a given language was listed in the comments field (under the heading ‘Formality’) of the language file. For reasons of succinctness, we limit our discussion of typologically interesting cases to only two. For further examples, see Ritter and Bliss (2005).

100 Heather Bliss and Elizabeth Ritter 4.3.1. Inclusive “singular” INCLUSIVE refers to a subtype of 1st person whose referent necessarily includes both the speaker and the addressee. In our database, approximately one-third of the languages (39 languages, or 35%) have inclusive pronouns. Inclusive, by its very nature, refers to a group of at least two persons: the speaker and the addressee. Consequently, inclusive pronouns most often have non-singular number, and in fact, in 29 of the 39 languages with inclusives, only non-singular inclusives exist. However, in our database there are a small number of languages with inclusive “singular” pronouns. Examples of these languages include Kalihna (Carib, Guyana), Ngandi (Australian, Australia), and Yaouré (Niger-Congo, Côte d’Ivoire). In these languages, inclusive “singular” pronouns refer to the speaker and exactly one addressee. Inclusive plural forms take the same plural morphology as other pronouns in the paradigm, and refer to the speaker plus two or more addressees. One language in our database, Dakota (Siouan, Canada), has only an inclusive singular form, and no inclusive plural. Riggs (1973) refers to this inclusive as a dual form, but as the language has no other dual pronouns, we elected to analyze it as an inclusive singular. The existence of inclusive singular pronouns in languages like Kalihna or Dakota raises a number of interesting typological questions regarding the relationship between person and number in inclusives and the morphological realization of clusivity cross-linguistically. 4.3.2. Demonstratives used as 3rd person A second typologically marked pronoun system is one in which there are no distinct 3rd person personal pronouns. These types of systems use demonstrative pronouns for this purpose. In our database, there are seven 13 languages of this type, including Basque (Basque, Spain), Kiowa (KiowaTanoan, United States), Telugu (Dravidian, India), and Yimas (Sepik-Ramu, Papua New Guinea). The existence of these pronoun systems raises interesting typological questions regarding the distinction between 3rd person personal pronouns and demonstrative pronouns. In most languages (where these two types of 13

This number does not include Djingili (Australian, Australia). In Djingili demonstratives are used for nominative and ergative cases only as there exists a dedicated 3rd person objective pronoun.

A typological database of personal and demonstrative pronouns 101

pronouns are distinct), demonstrative pronouns are coded for additional features that do not play a role in personal pronoun systems, such as proximity to the speech act or other deictic properties. In languages like Basque and Kiowa, for example, these types of distinctions are extended to the 3rd person personal pronouns. In his grammar of Yimas, Foley (1991: 114) explains that “[t]he choice between proximal, distal or remote to refer to a 3rd person most often reflects the “centrality of the participant in the discourse, or the speaker’s empathy towards it”. Foley goes on to explain that in Yimas, proximal forms are used for 3rd person if the referent is immediately involved in the speech act, or if the speaker has good feelings toward the referent. Remote forms are employed if the referent is of secondary interest or if the speaker has negative feelings towards the referent. It is an open question whether this characterization of Yimas can be extended to other languages that use demonstratives as 3rd person pronouns. Interestingly, the description of proximity distinctions in the 3rd person in Yimas is quite similar to descriptions of obviation in languages like Kutenai (isolate, Canada), Navaho (Na-Dene, United States), and Ojibwa (Algic, Canada and United States). The relationship between distinctions of proximity and obviation would seem to merit investigation in future research. 4.4. Summary In this section, we have looked at the range of data that is housed in our database. Our sample of 109 genetically and areally diverse languages manifests a number of typological tendencies, and at the same time includes many typologically unusual phenomena. Both the regularities and the exceptions are important aspects of the investigation into pronominal systems and their morphosyntactic features. This discussion highlighted the fact that this research tool not only enabled us to answer our original question, viz. what are the morphosyntactic features found in the personal pronominal systems of the world’s languages, but also provides the empirical foundation for a range of other research projects, including both broad typological studies and narrow case studies of particular languages. 5. Challenges In this section, we address two types of challenges we faced: limitations in skills and resources, and limitations in software capabilities. These will each be discussed in turn below.

102 Heather Bliss and Elizabeth Ritter 5.1. Limited technical skills and resources As mentioned above, our database project operated under a somewhat restrictive budget, and as a result, we were unable to hire information technology consultants to design, develop or maintain our database. Our project team consisted of a linguistics professor and linguistics undergraduate and graduate students, and none of our team members had any significant training or expertise in database construction. Naturally, this lack of technical expertise posed a very real challenge to us. After consulting with knowledgeable linguists and a computer specialist, we made the decision to use a database management system equipped with a form-based user interface, rather than design a user interface uniquely tailored to our project. This decision was entirely influenced by our collective inexperience in database design, and posed significant limitations on our database capabilities (as will be discussed in the following section). By and large, the pronoun database is the result of a lengthy process of trial and error, in which our project team worked to find innovative solutions for the technological hurdles we faced. The end result is an effective tool which not only enabled us to reach our immediate research goals, but also has the capacity to contribute to a broad range of future research projects. 5.2. Software limitations Our database was created using a software application called FileMaker Pro (version 5.5). As mentioned, our decision to use this application instead of constructing a unique user interface entailed that the pronoun database was less flexible than was required by the data. As with any software program there is always the risk that it will be redesigned or taken off the market, which would render our database obsolete. In this section, we detail three problems we encountered as a result of working in FileMaker Pro, as well as an overview of the solutions we created for these problems. 5.2.1. Pronominal forms An important feature of our database is that it includes the actual phonological forms for each pronoun, and not just the morphosyntactic features affiliated with them. Including the pronominal forms was not as simple a task as we originally thought, and this was largely due to the accessibility

A typological database of personal and demonstrative pronouns 103

of fonts at the time when the database was created, and their compatibility with FileMaker Pro software.14 As mentioned, only Roman-based orthographies were included in the database because the fonts required to represent the characters in other orthographies (such as Cyrillic or those employed by many Asian languages) were unavailable in the version of FileMaker Pro that we were using. In addition to these orthographical forms, IPA forms were included for all pronouns in all languages in the database. The web-hosting component of FileMaker Pro only supported two fonts: Times New Roman and Arial, and as a result, many of the IPA characters 15 could not be properly displayed on the web version of our database. Our solution to this problem was to create graphical versions of each of the pronominal forms in the database, and to display only the graphical forms in the online database. Adapting a technique outlined in Baron and Peck (2001), we copied and pasted the text forms into a paint program called Canvas, v. 2.1, and then pasted them back into a container field in FileMaker Pro. In the desktop version of the database, these graphics are stored as PICT files, the default graphic format for Canvas 2.1. In the web version, these PICT files are converted to temporary jpeg images when the graphic data is requested by the web browser (Schwartz 2000: 520). This solution was successful in that online users were able to view all of the pronominal forms correctly. However, due to the large number of pronoun forms in the database (approximately 6000 total), the text-to-graphic conversion was a tedious task for our project team to undertake, and more importantly, it significantly increased the size of our database, which in turn increased the upload time of the online version.

5.2.2. Paradigms In §3.4.1, we showed the paradigm layouts for the personal and demonstrative pronouns. These two layouts differed considerably, with the personal pronoun paradigm layout appearing in a more traditional paradigm format. In this layout, each row in the paradigm displays all of the pronouns sharing the same person, case, gender, and formality features. Forms in a single 14

15

Presumably, continued development of Unicode standards would alleviate this type of problem today. We chose to use the font SIL Doulos IPA for the IPA-transcribed pronominal forms. This font is not available in the web component of FileMaker Pro.

104 Heather Bliss and Elizabeth Ritter row differ only in terms of their number features, with singular, dual, plural (and in some languages, trial or paucal) forms being displayed in different columns in the same row. Displaying the personal pronouns in a paradigm format was a considerable challenge (and for this reason, a paradigm layout has yet to be implemented for the demonstrative pronouns). Each pronoun occupies a single record in the database, and it is not possible in FileMaker to create a layout in which multiple records are displayed in a single row. Our solution to this problem is complex and the precise details are beyond the scope of this paper. In short, configuring the paradigm layout involved re-coding the data to have multiple forms occupy a single record in a distinct daughter field used exclusively for paradigm displays. Although this solution has increased the size of our database considerably, it allows us to display and view the pronominal paradigms in a way that is easily analyzed by researchers working on pronouns and morphosyntactic features. 5.2.3. Web-Hosting As mentioned in §3.4.2, our database was available online for a number of years. The database was made accessible online via a FileMaker Pro web hosting plug-in component, which could be activated in order to allow internet users to access our database. Given our limited technical skill set, this was a relatively straightforward way to host our database online, as it did not require the use of HTML, Java or any other programming codes to create a web-based version of the database. However, because FileMaker Pro’s web hosting component came equipped with a built-in user interface, the way in which the database was displayed online was also pre-configured, and could not be manipulated. Furthermore, the web hosting plug-in encountered inexplicable problems that affected how well the online database operated. Each of these problems will be discussed in turn. In §3.4.2, we explained that the database appeared quite differently depending on whether it was viewed with Internet Explorer or with other web browsers, such as the now obsolete Netscape Navigator. The Internet Explorer version was quite similar in appearance to the desktop version, and permitted users to access each of the layouts that are available in the desktop version. The other online version, on the other hand, was highly restricted in its functionality. Only two layouts were available, and as such, the data was not organized in a particularly user-friendly fashion in that version. Although the data in the Internet Explorer version was much more accessible than in the other online version, the Explorer version was not without

A typological database of personal and demonstrative pronouns 105

its problems. It was mentioned in §3.4.2 that there were navigation buttons at the right hand side of the screen for viewing the database in different ways (i.e., table view versus form view, etc.). These navigation buttons were auto-generated in FileMaker Pro, and they appeared in addition to the navigation buttons we created for accessing different layouts. Having two series of navigation buttons was not only redundant but confusing. With respect to the problems we experienced with the FileMaker Pro web hosting plug-in, it is as yet unclear to us whether the software is in some way defective, or whether the size or complexity of our database somehow exceeded the software’s capabilities. Either way, hosting the database online using this plug-in led to some significant problems. For instance, the database took an exceptionally long time to upload, and to switch between layouts and records. More importantly, the database crashed often, and needed to be re-booted quite regularly. These problems appear to be rooted in our choice of software application, and eventually led to the decision to disable the online version. 5.3. Summary In this section, we have looked at the major challenges we encountered in creating the pronoun database. Our project team was not composed of people with training and expertise in computerized database design and maintenance, and as a result, we elected to use FileMaker Pro for the implementation of our database. This choice inevitably led to a less flexible user interface, but with some innovation, we found ways to work around the limitations of the FileMaker Pro program to meet the needs of our database. In the next section, we look at the flip-side of these issues, as we discuss the research and training benefits to the members of our project team arising from their work on the pronoun database. 6.

Benefits

In this section, we address the benefits of this database both for purposes of research and training. These will each be discussed in turn below. 6.1. Research opportunities As noted above, this database was constructed for purposes of a specific research project, whose goal was to develop a geometry of morphosyntactic

106 Heather Bliss and Elizabeth Ritter features modeled on the phonological feature geometries of Clements (1985) and Sagey (1986). We soon realized that our model needed to be restricted to closed class categories, i.e. person and number features. The most complete version of our proposal is published in Harley and Ritter (2002). More recent projects have explored the role of animacy in gender and person features, resulting in revisions to our original proposal (Hanson 2003; Ritter 2005). However, the database houses a wealth of accessible and organized information, and it was originally our intention to continue to develop the database for use in future projects. In particular we wanted to provide the empirical foundation for two further typological projects: (a) investigation into the properties of open class features, i.e. gender, case and formality, and (b) development of a model of demonstratives and the morphosyntactic relationship between personal and demonstrative pronouns. Although our current research interests have shifted, and consequently we have stopped development of the database, we have preserved the information gathered in two formats for future use: a soft copy of the database itself, stored on a desktop computer and on CD, and a paper copy of all pronoun paradigms and language summary information.16 The database fields that describe noteworthy properties were included with a view to identifying properties of specific languages that merit further investigation, such as those listed in §4.3. Investigation into the morphosyntactic properties of such languages enables us to better understand the contribution of language specific constraints to the pronominal system, and to identify the limits of the universally determined options. For example, a closer look at the grammar of Thai (Daic, Thailand) and or Vietnamese (Austro-Asiatic, Viet Nam) would enable us to establish whether the forms listed in our database are indeed personal pronouns, or whether they are in fact nouns, which would in turn raise such fundamental questions as what is a personal pronoun and do all languages have personal pronouns? The database contains information on pronominal features that have not received much attention in the literature, such as relative features – formality and proximity. Questions such as how to represent relative feature systems, and the relationship between these and better studied categories of features are as yet unanswered, but the database provides a point of departure for 16

At the time of revising this chapter, FileMaker is still commercially available, and it should be possible to upgrade the database to the current version of the software application (version 9). However, due to resource limitations, we have no immediate plans to do so.

A typological database of personal and demonstrative pronouns 107

research. For example, in French formal 2nd person singular vous is homophonous with the 2nd person plural, while in German the formal 2nd person singular Sie is identical to the 3rd person plural. Functionally, these are augmentation or distancing strategies. Are such strategies limited to languages with only two degrees of formality, or to languages with formality in 2nd person only? Finally, the decision to include the pronouns themselves in the database means that it could also be used for phonological and morpho-phonological research. We included descriptive notes on morphological properties and on the phonological inventory of each language, which might well be used for a variety of projects, outside our own interests and expertise. 6.2. Training for student researchers As detailed in §5, our project team was composed largely of undergraduate linguistics students, who had little or no previous training in database design, construction, or maintenance. Working on the pronoun database gave the students members of our project team the opportunity to acquire these skills, as well as to develop other critical research skills. Four of the five undergraduate students decided to pursue graduate training in linguistics, and their positive experience on the project no doubt contributed to that decision, as evident from the fact that they contributed to the substantive research, using this tool, and that their own analyses were informed by our model (Bliss 2005; Bliss and Jesney 2004; Haberl and Hanson 2000; Hanson 2000; Hanson 2003; Hanson, Harley and Ritter 2000; Mills 2000). This experience not only contributed to their subsequent intellectual formation, but also enabled a number of them to gain employment in the private sector as computational linguists. 6.3. Summary In short, this database proved to be an effective research tool for our original research project on the representation of pronominal features, but this has by no means exhausted its usefulness. Rather this database was conceived of as a tool for long term use, by a variety of researchers. Unfortunately, its applicability to other projects will be limited by external considerations of accessibility and compatibility.

108 Heather Bliss and Elizabeth Ritter 7. Conclusion Our experience demonstrates the feasibility and effectiveness of computerized databases for linguistic research, even in the face of significant limitations in expertise, technical and financial resources. We could not have completed our original research project without the database, and the experience gained has proven important for subsequent projects. In retrospect, the greatest limitation resulted from our choice of software application. Given the success of our efforts, we now believe that we perhaps underestimated our own abilities, and that our team could have created a customized database which would have better served our needs both in the immediate and long-term. Appendix A: Languages in the database Language

Language Family

Places Spoken17

Aceh

Austronesian

Indonesia; Sumatra

Acholi Ainu

Nilo-Saharan Language Isolate

Sudan; Uganda Japan; Russia (Asia)

Albanian Arabic, Gulf Spoken

Indo-European Afro-Asiatic

Arapesh, Bumbita

Torricelli

Albania Iran; Kuwait; Oman; Saudi Arabia; United Arab Emirates Papua, New Guinea

Awtuw Balochi, Western

Sepik-Ramu Indo-European

Papua, New Guinea Afghanistan; Iran; Pakistan

Bandjalang Basque

Australia Basque

Australia Spain

Berik Brahui

Trans-New Guinea Dravidian

Indonesia, Irian Jaya Afghanistan; Pakistan

Cahuilla Uto-Aztecan Campa, Perené Ashéninca Arawakan

United States Peru

Catalan-Valencian-Balear

France; Spain

17

Indo-European

Only those countries for which the speaker population of a language is over 5 % of the total speaker population for that language, or for which over 75 % of the people of that country speak the language are included here (Ethnologue, 13th Edition).

A typological database of personal and demonstrative pronouns 109 Language

Language Family

Places Spoken

Chechen

North Caucasian

Russia (Europe)

Chinese, Mandarin

Sino-Tibetan

China

Chinook

Penutian

United States

Comanche

Uto-Aztecan

United States

Cubeo

Tucanoan

Columbia

Daga

Trans-New Guinea Papua, New Guinea

Dakota

Siouan

Canada; United States

Dieri

Australian

Australia

Djingili

Australian

Australia

Dong

Daic

China

Dutch

Indo-European

Belgium; Netherlands

English

Indo-European

Canada, UK, USA …

Fijian

Austronesian

Fiji

Finnish

Uralic

Finland; Sweden

French

Indo-European

Canada, France …

Georgian

South Caucasian

Georgia

German

Indo-European

Austria; Germany; Liechtenstein; Luxembourg; United States

Gilyak

Language Isolate

Russia (Asia)

Godié

Niger-Congo

Côte d’Ivoire

Greek

Indo-European

Cyprus; Greece

Haitian Creole French

French-based Creole

Dominican Republic; Haiti; Puerto Rico

Halkomelem

Salishan

Canada

Hausa

Afro-Asiatic

Niger; Nigeria

Hebrew

Afro-Asiatic

Israel

Hmong Njua

Hmong Mien

China; Laos; United States

Ho

Austro-Asiatic

India

Iraqw

Afro-Asiatic

Tanzania

Japanese

Japanese

Japan

Kabardian

North Caucasian

Russia (Europe); Turkey

Kaingáng

Macro-Ge

Brazil

Kalihna

Carib

French Guiana; Guyana; Surinam; Venezuela

110 Heather Bliss and Elizabeth Ritter Language

Language Family

Places Spoken

Kannada

Dravidian

India

Kiowa

Kiowa-Tanoan

United States

Koasati

Muskogean

United States

Kongo, San Salvador Kutenai

Niger-Congo Language Isolate

Angola; Zaire Canada; United States

Kwakuitl

Wakashan

Canada; United States

Ladahki

Sino-Tibetan

India

Latvian

Indo-European

Latvia

Lillooet

Salishan

Canada

Lithuanian

Indo-European

Lithuania; United States

Lugbara

Nilo-Saharan

Uganda; Zaire

Luiseño Makian, West

Uto-Aztecan West Papuan

United States Indonesia (Maluku)

Marghi, Central

Afro-Asiatic

Nigeria

Marshallese

Austronesian

Marshall Islands, Nauru

Maxakalí

Macro-Ge

Brazil

Mískito

Misumalpan

Honduras; Nicaragua

Miwok, Central Sierra

Penutian

United States

Mixteco-Chalcatongo

Oto-Manguean

Mexico

Mohawk Mongolian, Halh

Iroquoian Altaic

Canada; United States Mongolia

Nahuatl, Classical

Uto-Aztecan

Mexico

Nama

Khoisan

Navaho

Na-Dene

Botswana; Namibia; South Africa United States

Ngandi

Australian

Australia

Ojibwa, Western

Algic

Canada; United States

Paipai

Hokan

Mexico

Pakaásnovos

Chapacura-Wanham

Brazil

Palauan

Austronesian

Guam; Palau Islands

Pidgin, Nigerian

English-based Creole

Nigeria

Polish Pomo, Eastern

Indo-European Hokan

Poland; United States United States

Potawatomi

Algic

Canada; United States

A typological database of personal and demonstrative pronouns 111 Language

Language Family

Places Spoken

Quecha, Huanuco Huallaga

Quechuan

Peru

Rikbaktsa

Macro-Ge

Brazil

Romanian

Indo-European

Moldova; Romania

Salish, Southern Puget Sound Salishan

United States

Sirionó

Tupi

Bolivia

Somali

Afro-Asiatic

Ethiopia; Somalia

Sotho, Southern

Niger-Congo

Lesotho; South Africa

Spanish

Indo-European

Mexico; Spain; United States …

Swahili

Niger-Congo

Tanzania

Swedish

Indo-European

Sweden; United States

Tamazight, Central Atlas

Afro-Asiatic

Algeria; France; Morocco

Tauya

Trans-New Guinea

Papua, New Guinea

Telugu

Dravidian

India

Thai

Daic

Thailand

Tok Pisin

English-based Creole

Papua, New Guinea

Tonkawa

Coahuiltecan

United States

Tunica

Gulf

United States

Turkish

Altaic

Turkey

Tzutujil

Mayan

Guatemala

Vietnamese

Austro-Asiatic

Viet Nam

Wappo

Yuki

United States

Welsh

Indo-European

Wales

Wichita

Caddoan

United States

Wolaytta

Afro-Asiatic

Ethiopia

Xokleng

Macro-Ge

Brazil

Yaouré

Niger-Congo

Côte d’Ivoire

Yimas

Sepik-Ramu

Papua, New Guinea

Yupik, Central

Eskimo-Aleut

United States

Zapoteco, Yatzachi

Oto-Manguean

Mexico; United States

Zuni

Language Isolate

United States

112 Heather Bliss and Elizabeth Ritter Appendix B: Person systems in the database 18 System

# of lgs

% of lgs

66

60%

4

4%

37

34%

1/incl/2(/3)

1

1%

Mongolian

1/2/3/4

1

1%

Yupik

1/incl/2/3/4

0

0%

--

1/2/3 1/2(/3) 1/incl/2/3

Examples Kabardian; Swahili Basque; Kannada Cubeo; Telugu

Appendix C: Number systems in the database 19 System

# of lgs

% of lgs

Examples

sg/pl

86

79%

Ojibwa; Turkish

sg/dl/pl

Lithuanian; Zuni

18

16%

sg/dl/pa/pl

2

2%

Fijian; Yimas

sg/dl/tl/pl

1

1%

Tok Pisin

general

2

2%

Campa; Kiowa

Appendix D: Gender systems in the database System

# of lgs

% of lgs

Examples

none

59

54%

Lugbara; Georgian

gender-based

French; Hebrew

37

34%

animacy-based

9

8%

Ainu; Godié

gender and animacy based

4

4%

Mixteco; Ngandi

18

19

The symbol (3) indicates that there are no dedicated 3rd person pronouns. Rather, demonstratives are used for this purpose. The following abbreviations are used in Appendix C: sg = singular; dl = dual; tl = trial; pa = paucal; pl = plural.

A typological database of personal and demonstrative pronouns 113

Appendix E: Case systems in the database System

# of lgs

% of lgs

Examples

none

46

42%

Daga; Quechua

grammatical cases only

40

37%

Hausa; Palauan

grammatical and local cases only

6

6%

Dieri; Turkish

grammatical, local and other cases

17

15%

Berik; Chechen

Appendix F: Formality systems in the database i. Formality systems by degrees of formality System

# of lgs

% of lgs

Examples

none

74

68%

Acholi; Gulf Arabic

2 degrees of formality

25

23%

Ho; Navaho

>2 degrees of formality

10

9%

Mixteco; Thai

ii. Formality systems by person System

# of lgs

% of lgs

74

68%

Acholi; Gulf Arabic

26

24%

Welsh; Gilyak

2

2%

Ladahki; Wolaytta

1 , 2 , 3 persons

5

4%

Makian; Vietnamese

3rd person only

2

2%

Kalihna; Zapoteco

none nd

2 person only nd

rd

2 , 3 persons only st

nd

rd

Examples

References Baron, Cynthia L. and Daniel Peck 2001 Visual QuickPro Guide: FileMaker Pro Advanced for Windows and Macintosh. Berkeley: Peachpit Press. Benveniste, Émile 1971 The nature of pronouns. In Problems in General Linguistics, 217– 222. Coral Gables, FL: University of Miami Press.

114 Heather Bliss and Elizabeth Ritter Blake, Barry J. 1994 Case. Cambridge: Cambridge University Press. Bliss, Heather 2005 Formalizing point-of-view: The role of sentience in Blackfoot’s direct/ inverse system. MA thesis, University of Calgary. Bliss, Heather and Karen Jesney 2005 Resolving hierarchy conflict: Local obviation in Blackfoot. Calgary Papers in Linguistics 26: 92–116. Bliss, Heather and Elizabeth Ritter 2001 Developing a database of personal and demonstrative pronoun paradigms: Conceptual and technical challenges. In Proceedings of the IRCS Workshop on Linguistic Database, Steven Bird, Peter Buneman, and Mark Liberman (eds.), 38–47. Philadelphia, PA: Institute for Research in Cognitive Science. Brown, Penelope and Stephen Levinson 1978 Universals in Language Usage: Politeness Phenomena. In Questions and Politeness: Strategies in Social Interaction, Ester Goody (ed), 56 –311. Cambridge: Cambridge University Press. 1987 Politeness: Some Universals in Language Usage. Studies in Interactional Sociolinguistics 4. Cambridge: Cambridge University Press. Campbell, Russell N. 1969 Noun Substitutes in Modern Thai: A Study in Pronominality. Paris: Mouton. Clements, George N. 1985 The geometry of phonological features. Phonology Yearbook 2: 225– 252. Corbett, Greville G. 1991 Gender. Cambridge: Cambridge University Press. 2000 Number. Cambridge: Cambridge University Press. Déchaine, Rose-Marie 1999 What Algonquian morphology is really like: Hockett revisited. In Papers from the Workshop on Structure and Constituency in Native American Languages, Leora Barel, Rose-Marie Déchaine, and Charlotte Reinholtz (eds.), 25 –72. (MIT Occasional Papers in Linguistics 17.) Cambridge, MA: MIT Press. Foley, William A. 1991 The Yimas Language of New Guinea. Stanford, California: Stanford University Press. Greenberg, Joseph H. 1963 Some universals of grammar with particular reference to the order of meaningful elements. In Universals of Language, Joseph H. Greenberg (ed.), 73 –113. Cambridge, MA: MIT Press.

A typological database of personal and demonstrative pronouns 115 Haberl, Heather and Rebecca Hanson 2000 Gender specification in inclusive pronouns: An analysis of Nama. Paper presented at: The Truths About Pronouns and How True They Are Workshop. University of Konstanz, December 15–16, 2000. Hanson, Rebecca 2000 Pronoun acquisition and the morphological feature geometry. Calgary Working Papers in Linguistics 22: 1–14. 2003 Why can’t we all just agree? Animacy and the person case constraint. MA thesis, University of Calgary. Hanson, Rebecca, Heidi Harley, and Elizabeth Ritter 2000 Underspecification and universal defaults for person and number features. In Proceedings of the 2000 Meeting of the Canadian Linguistics Association, J. T. Jensen and G. van Herk (eds), 111–122. Ottawa, ON: Cahiers Linguistiques d’Ottawa. Harley, Heidi and Elizabeth Ritter 2002 Person and number in pronouns: A feature-geometric analysis. Language 78 (3): 482–526. Jacobson, Steven A. 1984 Yup’ik Eskimo Dictionary. Fairbanks: Alaska Native Language Center and Program, University of Alaska. 1995 A Practical Grammar of the Central Alaskan Yup’ik Eskimo Language. Fairbanks: Alaska Native Language Center and Program, University of Alaska. Levinson, Stephen 1983 Pragmatics. Cambridge: Cambridge University Press. McGinnis, Martha 2005 On markedness asymmetries in person and number. Language 81(3): 499 –718. Mills, Timothy Ian 2000 Morley Stoney pronouns: A feature geometry. Calgary Working Papers in Linguistics 22: 15–28. Reed, Irene, Osahito Miyako, Steven Jacobson, Paschal Afcan, and Michael Krauss 1977 Yup’ik Eskimo Grammar. Fairbanks: Alaska Native Language Center and Yup’ik Language Workshop. Riggs, Stephen R. 1973 Dakota Grammar, Texts, and Ethnography. Minneapolis: Ross and Haines. [Original edition, Washington: Government Printing Office, 1893.] Ritter, Elizabeth 2005 Ambiguity, generality and the formalization of person and number features. Paper presented at the Workshop on Feature geometries: From phonology to syntax. Université de Paris 8, December 3, 2005.

116 Heather Bliss and Elizabeth Ritter Ritter, Elizabeth and Heather Bliss 2005 A typology of personal pronouns. Paper presented at Atelier, typologie et universaux du langage. Paris, December 5, 2005. Sagey, Elizabeth 1986 The representation of features and relations in non-linear phonology. Ph.D. dissertation, Massachusetts Institute of Technology. Siewierska, Anna 2004 Person. Cambridge: Cambridge University Press. Schwartz, Steven A. 2000 FileMaker Pro 5 Bible. Foster City, CA: IDG Books Worldwide. Zwicky, Arnold 1977 Hierarchies of person. Chicago Linguistics Society 13: 714 –733.

Databases designed for investigating specific phenomena Dunstan Brown, Carole Tiberius, Marina Chumakina, Greville Corbett and Alexander Krasovitsky

1. Introduction1 In this chapter we consider databases which have been constructed to investigate particular linguistic phenomena. Data entered into a database with little thought or attention to its categorisation are at best useless, and in the worst case harmful if used to make spurious generalizations. So there is a requirement that we are explicit about our analyses and the phenomena under investigation, and that serious thought is given to the structure of the database. If we are dealing with just one language, in the first instance we are only required to make sure that there is consistency across the language, although it is good to adopt standards which eventually allow for comparison across languages (see, for example, Bird and Simons 2003). Alternatively, a database takes a particular linguistic phenomenon and investigates it cross-linguistically. This requires an explicit definition of the phenomenon to ensure consistency across the languages in the database and to allow others to determine whether something is an instance of it or not. There is the danger that things can be made to look the same when they are not. In all instances, the key is for the user to have access to the data (or, more accurately, a representation of the data) which are claimed to be an instance of the phenomenon. In §2 we discuss the issues which arise when developing databases for morphological phenomena. We illustrate this by examining inflectional syncretism in §2.1, and then proceed to the extreme phenomenon of suppletion in §2.2. In §3 we consider briefly an ongoing project using a database to 1

An earlier version of §2.1 appeared as Brown (2001). The description of the suppletion database in §2.2 is based on a section from Hippisley, Chumakina, Corbett and Brown (2004). The introductory paragraph to §4 and §4.1 are based on Tiberius, Brown, and Corbett (2002).

118 Brown, Tiberius, Chumakina, Corbett and Krasovitsky investigate historical change in one language, Russian, over a short time period, in this case 200 years. In §4 we discuss a database created for the cross-linguistic investigation of agreement, an important phenomenon for syntacticians. We then show how this database has sufficient detail for it to be used to give us a general typological sketch of Russian in terms of how it fits within the the canonical typology of agreement developed by Corbett (2006). 2.

Morphological Phenomena

Typological databases tend to register broad generalizations about languages (e.g. word order), and morphology is a particularly challenging area which may not appear to be readily accessible to database oriented study, because of its apparent language specificity. 2.1. Syncretism The Surrey Morphology Group created a database for inflectional syncretism with two aims in mind. We wished to create a resource which would be useful for our detailed investigation of inflectional syncretism (see Baerman, Brown and Corbett 2005). In addition to this, we intended to make this resource available to the wider community. In order to do both of these things, we needed to be explicit about our use of the term ‘syncretism’ and also, because it is not really possible for data to be entered into a database in an entirely ‘theory-neutral’ way, to provide a potential user of the database with the chance to check how we had come to our analyses. As a result, the theory which underlies the database may be more inclusive than the one which a potential user may prefer. Three issues are particularly important: the definition of the phenomenon; the choice of language sample; the structure of the database. We deal with these in the following sections.

2.1.1. Terminology A typical definition of inflectional syncretism is given here: The relation between words which have different morphosyntactic features but are identical in form. (Matthews 1997: 367)

Databases designed for investigating specific phenomena

119

As an example, if we consider the Classical Armenian u-stem noun žam ‘time’ in Table 1, we see that the morphosyntactic feature combinations locative singular, dative singular, genitive singular and instrumental singular in Table 1 correspond to the single inflected form žamow. Table 1. An instance of syncretism in Classical Armenian u-stem ‘time’ sg nom sg acc

žam

sg loc sg dat sg gen sg instr

žamow

sg abl

žamē

Imagine that a potential user of a database on inflectional syncretism had queried the database and asked to be shown all instances where there was identity between dative singular, locative singular and genitive singular. Among other things, the query then gives the syncretism from Table 1. A user would need to know on what basis it was decided that there is syncretism of these forms in Classical Armenian. Some uses of the term ‘syncretism’ treat it in a very broad way, such that it is not established on a language internal basis, but relies on global knowledge of the inventory of features. So dative, locative and instrumental would be instances of a syncretism, because the language in question always fails to have them. There are a number of problems with this: i) we would need to know the inventory of features in advance; ii) determining which functions the totally absent function is syncretic with usually involves the assumption of prototypical semantics for related functions. Here, however, the definition of the phenomenon may already involve a particular assumption about why it comes about. In this case, a definition of the phenomenon in question is ultimately based on the view that its causes are underlyingly semantic (see §2.1.3). But within the theoretical space of possibilities we might consider a number of motivations for syncretism: semantic, syntactic, morphological, phonological. A database should not be biased toward any of these possible interpretations, and should ideally allow a researcher to discover possible motivations. Construction of a database on syncretism where the phenome-

120 Brown, Tiberius, Chumakina, Corbett and Krasovitsky non was not defined in a language-internal way would, for the reasons given, obscure or hide morphological and phonological causes, for example. Accordingly, we needed to make the definition more explicit so that syncretism is an instance of a single inflected form corresponding to more than one morphosyntactic description, where feature values in the morphosyntactic description are established on a distributional basis within the language. Comrie (1991: 46) makes a distinction between distributional and formal case, for instance. For each nominal in the language, establish the distinct forms that this nominal can show [i.e. its array of formal cases]. Now compare the distributions of all nominals. If some distribution is of a distinct form for all nominals, then this is a [distributional] case. If the distribution (a) of some form of some nominal is a proper sub-set of the distribution (a+b) of some form of any nominal, then the distribution or subdistribution defined by a and b are distinct [distributional] cases for all nominals. If the distribution (c+d) of some form of some nominal mutually and nonexhaustively overlaps the distribution (d+e) of some form of any other nominal, then each of c, d, and e is a distinct distributional case for all nominals. (Comrie 1991: 46)

Hence, for a particular feature value, in this instance a value for case, to be involved in syncretism and therefore be entered into the database, the feature-value must formally be distinguished somewhere within the language in question. On the basis of this we can define syncretism informally, but relatively explicitly, as follows: Syncretism A single inflected word form corresponds to more than one morphosyntactic description, where the feature values in the descriptions are established distributional values for the language in question. So locative and instrumental are treated as valid cases for Classical Armenian, because they have distinct forms for some nouns (Table 2).

Databases designed for investigating specific phenomena

121

Table 2. Classical Armenian ‘book’ o-stem ‘book’ sg nom sg acc

gir

sg loc

giri

sg dat sg gen

giroy

sg instr

girov

sg abl

giroy

Equally, in the demonstrative and pronominal system there are distinct genitive singular forms. So the use of a formal and distributional definition yields clear instances of the phenomenon on a language-internal basis. We have illustrated here with examples of case (in the singular), but under the definition of syncretism given, we can include any morphosyntactic features.2 Within the Surrey Syncretisms Database we defined ten possible feature sets, together with provision for two additional sets, when required. The sets are Number, Case, Gender, Definiteness, Person, Tense, Mood, Voice, Aspect, Negation. As we could not be sure of a complete a priori list of the possible feature values for these sets, we designed the database in such a way that the researcher could enter new values when appropriate.

2.1.2. Coverage Given the definition of syncretism in §2.1.1, populating the database involved examination of the paradigms of related sets of lexical items and comparing them with the paradigms of other lexical items of the same or related word classes. This obviously requires grammatical sources which are quite detailed in their description of the language’s morphology. Typically, studies of syncretism have been biased toward Indo-European, despite the fact that it is found elsewhere. The Surrey Morphology Group has carried out larger scale surveys of Person and Case syncretism as part of the World Atlas of Language Struc2

See, for example, the complex person syncretisms in Dalabon (Evans, Brown and Corbett 2001).

122 Brown, Tiberius, Chumakina, Corbett and Krasovitsky tures project (Baerman and Brown 2005a–b).3 However, trying to do a detailed analysis of every type of syncretism within each language requires high quality data which will most probably not be available for a larger sample, and so the sample for such an undertaking is necessarily much smaller. Table 3. Genetic Affiliations in the Database Family Indo-European Afroasiatic Niger-Kordofanian Nilo-Saharan Tibeto-Burman Trans-New Guinea Altaic Austronesian Carib Chibchan Chukotko-Kamchatkan Dravidian Eskimo-Aleut Isolate Kartvelian Kiowa-Tanoan Nakh-Daghestanian Non-Pama-Nyungan Pama-Nyungan Sepik-Ramu Tacanan Uralic TOTAL

Number of Languages 4 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 30

For our purposes data collection was guided by the desideratum of instantiating as much as possible of the logical feature space. First, this meant that we included only languages with some inflectional morphology: we did not 3

Matthew Baerman has created a database specifically for Person syncretism with a coverage of 110 languages. (http://www.smg.surrey.ac.uk/personsyncretism/ index.aspx)

Databases designed for investigating specific phenomena

123

add in isolating languages, to which the question does not even apply. Second, within the families meeting the inflectional criterion, we aimed for an areal and genetic spread. Third, languages found to have patterns of syncretism not otherwise represented in the database were favoured over other candidates. The last two criteria were sometimes in conflict; in particular, the Indo-European languages are somewhat overrepresented (four in all), as they display a particular typological richness of patterns. The user can rebalance the sample if desired by reducing the number of Indo-European languages. However, our aim was to explore the theoretical space, and to make the relevant data widely accessible. Data was obtained from published grammars of sufficient detail and through personal communication with experts. These different constraints meant that we limited ourselves to 30 languages. In Table 3 we have shown how many languages there are from each genetic group. Upon completion of the project the Surrey Syncretisms Database contained 1256 records on languages, syncretisms and their domains (counted from the LanguageSyncretismDomain table, see §2.1.4). That is on average 42 records per language, although some languages have much more information associated with them than others. In sum, rather than concentrate on a broad sample with little information for each language, we created a smaller scale detailed database to examine the logical space of syncretism using high quality grammars. 2.1.3. How theoretical issues and the database structure relate From a theoretical viewpoint there are at least four ways in which inflectional syncretism can be interpreted. The first is to claim that it is accidental homophony and should be interpreted as a mere coincidence of form (i.e. the forms just happen to be the same, but are not identical). The second is to treat the syncretism as an instance of one morphosyntactic combination borrowing its form from another (typically called ‘referrals’). The third way is to claim that particular morphosyntactic combinations share a common form to which they are both indexed (i.e. one does not borrow from the other). The fourth way involves some reinterpretation of morphosyntactic features: typically, it is assumed that, where two or more functions share the same form, there is some degree of underspecification involved. The assumption is that morphosyntactic features are themselves defeasible. In its extreme form the first approach basically says that inflectional syncretism is a non-existent phenomenon. Of course, there are variants on

124 Brown, Tiberius, Chumakina, Corbett and Krasovitsky the first approach which treat certain apparent identities as accidental, and others as not. The problem with the first interpretation of syncretism is that it is often very difficult to determine what is accidental and what is not. For instance, in Russian adjectives the (masculine and neuter) instrumental singular and dative plural always have the same form, a fact which would suggest that it is hardly accidental, and yet this may appear counterintuitive to proponents of particular theories. Because it is a matter of debate what is accidental and what is not, it is important that every instance of syncretism is entered in the database. It should be the decision of the database user whether a syncretism is accidental or not. In order to allow for this, however, we must build in a means for the user to check the data. As we see in §2.1.4 this is achieved by the use of two hyperlink fields in our database (one in the table of languages and one in the table which combines languages and syncretisms). These fields point to profile documents, written by the database creators. Irrespective of its theoretical merits, the fourth approach, using underspecification, is unsatisfactory for the construction of a database, as it involves a specific interpretation of the data. In addition to this, there are two problems. We might start to construct a database and find over time that the feature ‘geometry’ we have assumed does not function for each new language we come across. Whether it would or not is an empirical matter, of course, but it cannot be guaranteed from the outset. Another potential problem is that it will leave out information that a potential user would find useful. In other words, the user would have to reconstruct the analysis in order to determine which features are syncretic. It has been argued elsewhere that some of the assumptions associated with the relationship between syncretism and underspecification approaches do not necessarily apply over a broad sample of languages (see Baerman, Brown and Corbett 2005). Irrespective of the truth of this claim, use of an underspecification based approach to create a database of syncretism omits information that will be useful for a potential user. Having excluded on practical grounds the first and fourth theoretical approaches as a basis for the structure of our database, we now turn to the referrals and indexing approaches. These approaches are better, because they need not force a particular interpretation of the data. The difference between the third and fourth approach is more one of implementation than theory. Consider the Russian4 paradigm in Table 4. 4

Russian is not one of the languages in the Surrey Syncretisms Database. We have tried to minimize the number of Indo-European languages.

Databases designed for investigating specific phenomena

125

Table 4. Russian syncretisms (in transcription)

Sg Gen Sg Dat Sg Loc

zakon ‘law’

komnata ‘room’

kost′ ‘bone’

okno ‘window’

zakona zakonu zakone

komnati komnate komnate

kost′i kost′i kost′i

okna oknu okne

One way of accounting for the syncretisms in Table 4 is to make appeal to rules of referral. These are rules which specify that one morphological form will be realized identically to another; the term is due to Zwicky (1985: 372). Rules of referral may be seen as comparable to Perlmutter and Orešnik’s (1973) ‘prediction rules’. Stump (1993, 2001) argues explicitly that underspecification can account for certain syncretisms, whereas referrals are the best way of dealing with others. Under the referrals approach we would say that the form of the dative singular is based on the locative singular (because the locative singular form ending in -e can be found on its own in the zakon and okno declensions). If we stated that there is a referral of dative singular and locative singular in Russian, this abstracts away from the particular forms, as in the kost′ type the actual form involved differs from that in the komnata type.5 So for Russian we have a syncretism stated as one binary pair of combinations (namely dative singular and locative singular). For the kost′ type we also need to say that the genitive singular is syncretic with both the dative and locative singular. Of course, by stating that it is syncretic with either one should involve the implication that it is syncretic with the other. However, for practical reasons this information may not be apparent to the user of a database. As we shall see in our discussion in §2.1.4 we introduce a particular distinction between ‘dependent’ and ‘independent’ syncretisms to get round this issue. Treating forms as indexed to particular morphosyntactic combinations is another way that could be used to represent the syncretisms in Table 4. 5

This kind of relationship is systematic. There are different ways of representing it. In the lexical knowledge representation language DATR, for example, which is default inheritance based, it is represented as ‘global inheritance’, because we need to know one form of a specific lexical item in order to determine another of its forms, while at the same time expressing that this relationship holds generally for a class of items. (See Evans and Gazdar 1996 for more on DATR.) In default unification approaches this type of relationship is represented using reentrancies which may be indefeasible or default (for example, Lascarides and Copestake 1999).

126 Brown, Tiberius, Chumakina, Corbett and Krasovitsky Practically there is a difference from the referrals approach. Here we would say that there are two types of syncretism in Table 4: one (komnata type) involves the set of feature combinations SgDat, SgLoc; the other involves the set of feature combinations SgDat, SgLoc, SgGen. This approach is feasible, although implementing it would probably require some estimate on the upper limit of the number of combinations in a set. The practical issue for constructing a database using this approach is the use of values which enable us to search and find instances where SgDat and SgGen are syncretic, for example. Under the referrals approach we can do this straightforwardly, because we are always dealing with instances of binary pairs of feature combinations. The indexing approach can sometimes lose sight of what is shared. For example, the syncretism of SgDat and SgLoc is shared by both the komnata and kost′ types, but they are treated as separate sets – one subsuming the other – under the indexing approach. It is also desirable to constrain the data entry such that we can use such constraints as primary keys to say that once we have stated that a language has a particular syncretism (irrespective of the form) we do not repeat this. It is for these reasons that we adopted the ‘referrals’ approach to the design of the Surrey Syncretisms Database. Aronoff (1994: 83) criticizes the use of rules of referral – in certain analyses – precisely because of their directionality. While there is evidence for directional syncretisms (see Baerman, Brown and Corbett 2005), the Surrey Syncretisms database did not encode information about directionality, as this is a matter of interpretation. Instead, the creation of a binary pair of combinations in the database only involved a commitment to the fact that the same form is used by the two feature combinations in question.6 Using an approach based on binary pairs of combinations still allows us to search for hierarchical ordering in the data. In the work of Hjelmslev (1943/1961), Carstairs[-McCarthy] (1984, 1987, 1992), Wunderlich and Fabri (1995), and Noyer (1997: xx–xxi, 45), for example, particular categories (feature sets) are ordered in relation to each other to capture dependencies, such as that between gender and number, for instance. Gender distinctions, for example, are lost in the presence of a particular number, typically plural (Greenberg 1963). Treating syncretisms as binary pairs of feature combinations allows us to look at the context in which they occur. Taking up the example of case syncretism within a particular number, we can easily con6

As two feature combinations are involved we have to use an arbitrary convention to decide which one is on the left-hand side of the binary pair. But this has no theoretical import.

Databases designed for investigating specific phenomena

127

struct queries in which we ask for instances where the number values are the same, but the case values differ. Hence, the adopted approach still allows for this interpretation, and it is possible to search for data which may indicate such hierarchical orderings.

2.1.4. The database The Surrey Morphology Group Syncretisms Database contains 18 tables. Of these, 10 tables contain the values for reasonably well-defined feature sets. Hence, there is a table Number which contains the values for number the researcher finds as he enters data from a chosen language. The 10 tables are: Number, Case, Gender, Definiteness, Person, Tense, Mood, Voice, Aspect, Negation. In addition to the 10 tables there are two others, Spare1 and Spare2 for additional morphosyntactic feature values which cannot be placed in any of the sets given in the 10 tables. These 12 tables are to be seen on the right side of Figure 1. LanguageTable Languages

LanguageSyncretismDomain Table Languages, Syncretisms, Domains

Syncretism Table Syncretisms (Pairs of Combinations)

12 Tables Features

WordTypes Table Word classes

Domain Table Domains (word classes, semantics)

Figure 1. Structure of the Surrey Syncretisms Database

Combination Table Combinations (two relationships with Syncretisms Table)

128 Brown, Tiberius, Chumakina, Corbett and Krasovitsky A feature value from one of the twelve tables can be combined with other feature values from the same twelve to form a morphosyntactic combination. The Combination table is ringed in Figure 1.7 The relationship between any given feature table (e.g. Number) and the Combination table is one-tomany. The Combination table assigns a unique arbitrary index to the combination set, but it is the actual set of combinations which is the primary key for the Combination table. The table of syncretisms is in the middle of the top set of tables. The Combination table has two relationships with the Syncretism table. Both of these are one-to-many. A syncretism is treated as a binary pair of morphosyntactic combinations. The Syncretism table contains the indices for each morphosyntactic combination. These indices are automatically compared during data entry to check that they are not identical (as a form which shares identical morphosyntactic features is not a syncretism) and to determine which combination is put in the first field, and which in the second. This is done by placing the higher number in the second field. The index number for the combination is determined purely by the particular point in time at which the researcher entered the combination. If one wishes to compare all instances of, say, pairs of number values involved in syncretisms, irrespective of the other features they occur with, this may mean that queries have to reorder the values alphanumerically. Consider the following pairs of combinations taken from the database: (1)

Combination 1 pl-masc

Combination 2 sg-fem

(2)

Combination 1 sg-masc

Combination 2 pl-fem

In both these examples the number values involved are sg and pl. However, in (1) pl is the Combination 1 value, and in (2) sg is the Combination 1 value. Hence, queries which may be interested in pairs of single categories such as Number also involve sorting the values alphanumerically. This has proved to be straightforward. If we had decided to adopt an approach in which forms were associated with an n-tuple of combinations, this ordering 7

Figure 1 shows two tables ringed. In fact, these are just the table of combinations repeated. This is because the Combination table has two relationships with the Syncretism table: a combination occurs in both the left-hand side field and righthand side fields of a syncretism.

Databases designed for investigating specific phenomena

129

issue would be more complex, of course. Other fields in the LanguageSyncretismDomain table allow us to associate a binary pair with other binary pairs, if they are part of a greater syncretism. We do this by defining a syncretism as ‘independent’ if it may occur on its own (for instance, the SgDat-SgLoc syncretism in Russian is ‘independent’) and ‘dependent’ if the binary pair only occurs in the presence of another syncretism (e.g. the SgGen-SgLoc syncretism only occurs in the presence of the SgDat-SgLoc syncretism). A ‘realization’ field then allows us to find examples of syncretisms based on the same form. Toward the bottom left of Figure 1 is a table which contains fields of word classes (nouns, adjectives, verbs etc.). This table can construct sets of word classes to be used in the Domain table. The primary key for this table is the field which contains the value which represents a set of word classes. A set consists of one or more word classes.8 This then enables us to see which word classes typically group together in syncretism domains. The Word Class table has a relationship with the Domain table, which combines three fields: word class, syntax, semantics. Hence, the relationship between the Word Class table and the Domain table is one-to-many, as a single set of word classes could occur under different syntactic or semantic restrictions.9 The table in the top left contains information about languages. There are three fields: Language, Family, Report. Introduction of the Family field enables us to make restrictions on queries where all examples may be from the same language family. Also, given the non-redundant design of the database, we can freely change a family affiliation, if required on the basis of new information, by just changing one field of one record in one table. The Report field is a hyperlink field which contains a detailed report on the language written by the researcher. To the right of the Language table in Figure 1 is the LanguageSyncretismDomain table (LSD table), which is the heart of the database. It combines the information about languages with that about syncretisms and domains. The relationship between the Language table and the LSD table is one-tomany. The relationship between the Syncretism table and the LSD table is 8

9

The set of word classes is labelled WordType. Individual word classes are labelled WordType1 and so on. We have found that there are very few instances of this. A potential example is the word class (set) adjective, which may behave differently in predicative or attributive syntactic function. But empirically it appears that examples like this are rare. Hence, use of the database has shown that a theoretical possibility is not that common.

130 Brown, Tiberius, Chumakina, Corbett and Krasovitsky one-to-many, and the relationship between the Domain table and the LSD table is also one-to-many. The LSD table also contains a hyperlink field which links to a document containing illustrative paradigms for each syncretism in the database. This use of hyperlinks to illustrative paradigms allows for quality control by the user, who can see how the analysis, encoded by the choice of morphosyntactic combinations, is arrived at. 2.1.5. Summary for syncretism In §2.1 we gave an explicit definition of syncretism and showed how important this is in order to guarantee the grounds of comparison. We then discussed theoretical issues and showed how particular theories are better suited for database design for practical reasons. Finally, in §2.1.3 we showed the design of our database and how the structure reflected, to a certain extent, one particular theoretical approach, at the same time allowing a user access to all the information they require to check analyses.

2.2.

Suppletion

Even more so than syncretism, suppletion is an extreme morphological phenomenon where different inflectional forms of the same lexical item are not related phonologically (for example, in English go and went). Mel′čuk defines it in the following way: For the signs X and Y to be suppletive their semantic correlation should be maximally regular, while their formal correlation is maximally irregular. (Mel′čuk 1994: 358)

By definition it is particularly challenging to create a robust model to deal with a phenomenon which is part of the extreme end of the spectrum of irregularity. However, it is possible to use the fact that the semantic and functional side of the phenomenon are regular. 2.2.1. The structure of the database In order to support consistency in data entry the Surrey Suppletion Database was structured as a relational database, just as the Surrey Syncretisms Database. By treating the data in terms of a number of tables with relation-

Databases designed for investigating specific phenomena

131

ships between them, we were able to place constraints on the information entered. In Figure 2 we see the underlying structure of the database. Feature Tables Tables of Features. Language Information about languages.

LexemeStem Associating lexemes and stems.

StemCombination Associating stems with specific morphosyntactic combinations.

Combination morphosyntactic combinations.

Figure 2. Structure of the Surrey Suppletion database

To the right of Figure 2 we see ten tables of features (Number, Case, etc.) plus an additional ‘spare’ table. These eleven tables are encircled in the figure. The values from these tables are combined in a morphosyntactic combination, as represented by the Combination table. As a value can occur in more than one morphosyntactic combination, the relationship between feature values and morphosyntactic combinations is one-to-many. We can use the morphosyntactic combinations to define the morphosyntax associated with particular stems. In the StemCombination table, stems are associated with the morphosyntactic content of which they are the expression in form. It is worth noting the consequences for database design of the contrast between language specific idiosyncrasy on the form side (of which suppletion is an excellent example) and the greater cross-linguistic applicability of morphosyntactic values on the function side. Because stems are language-

132 Brown, Tiberius, Chumakina, Corbett and Krasovitsky specific items and morphosyntactic combinations are not, the same morphosyntactic combination (from the Combination table) can occur with many different stems. So the relationship between the Combination table and the StemCombination table is one-to-many. (Morphosyntactic combinations are listed only once, but potentially each combination can occur many times with different stems.) The LexemeStem table provides triples of information: a lexeme name (lexeme being an abstraction over a whole paradigm); a stem name; a description of a stem. The relationship between the LexemeStem table and the StemCombination table is one-to-many, as a language specific stem is constrained to be described once, but this stem could in principle occur in more than one morphosyntactic combination (in the StemCombination table). The beauty of having a separate LexemeStem table and StemCombination table is that we can then describe stems both in terms of the morphosyntax and also in terms of their arbitrary morphological function (in the field ‘stem name’ in the LexemeStem table). This allows us to analyze a stems morphosyntactic and morphomic (i.e. purely morphological) properties, which will be language specific. Information about lexemes and stems is therefore represented through the use of the two linked tables discussed, the LexemeStem table and the StemCombination table. The latter associates stems with morphosyntax, and the LexemeStem table associates the stems with lexemes. These tables are linked using the stem field. The LanguageLexemeSuppletion table – which is represented to the right of the Language table in Figure 2 – brings together the information about suppletion and languages and introduces further fields, such as semantic categories (the lexical semantics of the items involved), the syntactic categories (word classes), whether the suppletion alternates, and if the lexeme has additional suppletion. There is also a hyperlink to example paradigms. The language table provides information about the languages in the table, together with an individual report. 2.3. Morphological phenomena: summary In §2.1 we considered three important issues for the design of a database dedicated to the morphological phenomenon of syncretism. In §2.2 we considered an even more extreme phenomenon, suppletion. When designing a database for morphological phenomena it is very important to take into consideration how it is to be searched. For instance, we have seen in our discussion of suppletion the structures available for encoding language-specific

Databases designed for investigating specific phenomena

133

information about stems, but searches based on language-specific data are, of course, not going to be successful at yielding many results. This is why much thought must be given to how the morphosyntactic information is structured in the database so that it can be used effectively to find the specific examples. 3.

Historical phenomena

In this section we consider a different challenge for databases geared toward specific phenomena. Here what is under investigation is the set of changes which have occurred in the use of the morphology in one language, namely Russian, over a 200 year period. We describe ongoing work and the decisions which have been taken so far.

3.1. Short-term morphosyntactic change The task of the database is challenging: to give a detailed account of morphosyntactic shifts in Russian over the 19th and 20th centuries. It focuses on unstable areas in Russian morphosyntax that allow multiple alternative morphological values for one syntactic position. The following phenomena are under investigation: the genitive of negation (genitive vs. accusative objects with negated transitive verbs), predicative nouns (nominative vs. instrumental), predicative adjectives (long vs. short vs. instrumental forms), the choice of adjectives when occurring in a numeral phrase with 2–4 (nominative/accusative vs. genitive), predicates in quantified expressions (singular vs. plural), predicates with conjoined noun phrases (singular vs. plural). To capture the magnitude of factors that determine morphosyntactic variation and diachronic change, the database presents statistical analyses of the competition of functionally identical grammatical forms within equal time periods (20 year time slots). These are analyzed with respect to a number of morphological, syntactic, semantic and stylistic conditions. Statistics are derived from a corpus of fiction and non-fiction texts written between 1801 and 2000 (ca. 30 million tokens, compiled by Adrian Barentsen, University of Amsterdam). Extracted data are combined with secondary data from previous works and processed according to database requirements, so that both types of data provide full coverage of the ten 20 year time periods. As with the Surrey Syncretisms and Surrey Suppletion databases, the Short-Term Change database is a relational database, created using Micro-

134 Brown, Tiberius, Chumakina, Corbett and Krasovitsky soft Access. It consists of six modules, in accordance with the phenomena under investigation; each of the modules consists of a set of linked tables. The modules have a similar information structure and hierarchy of attributes, which have three levels: 1) chronological (relates a phenomenon under investigation to a given period of time); 2) conditional (which provides factorial analysis and breaks down data with respect to factors that drive variation); and 3) statistical (which presents underlying numbers and percentages for each of the competing forms within a number of representative samples under each of the relevant conditions). Therefore the database allows various types of data manipulation, including factorial analysis of morphosyntactic variation per period per phenomenon, period-by-period tracing of diachronic change, and concurrent (vertical) analysis of different active processes in Russian morphosyntax. To give an example, the variation of nominative/instrumental case marking on predicate nouns has been reflected in the database with respect to structural properties (type of governing verb), lexical semantics (e.g. animacy hierarchy, lexical class of the predicate noun) and sentence semantics (e.g. temporal or modal specification of the predicate). The list of relevant conditions (entries) was compiled on the basis of previous studies and original corpus-based research. The database allows examination of the significance of each of the conditions within different time periods on the basis of statistics. For instance, in this particular case statistics clearly indicate a drastic change in late 20th century Russian: semantically driven distribution of predicate cases, being characteristic of Russian until the middle of the 20th century, has been gradually replaced by a syntactically-determined model based on the instrumental in the second half of the century. As entries associated with similar factors are unified across the modules, the database allows us to trace the rise or decline of the influence of certain language categories on morphosyntactic variation and change, for example the impact of animacy on case marking of predicate nouns or objects of negated transitive verbs, on predicate agreement with conjoined subjects, and so on.

4. The Agreement Database The Surrey Database of Agreement is a typological database which includes detailed information on a syntactic phenomenon in a small, carefully chosen set of languages. Fifteen languages, taken from different families so as to maximize diversity, were investigated. These are Basque, Chichewa,

Databases designed for investigating specific phenomena

135

Georgian, Hungarian, Kayardild, Mayali, Ojibwa, Palauan, Qafar, Russian, Tamil, Tsakhur, Turkana, Yimas, and Yup’ik.10 For each of these languages, data was gathered according to a consistent format and entered into a relational database for searching and further reference. An inclusive approach was adopted, based on the notion of canonical agreement (Corbett 2003, 2006). Thus all phenomena normally treated under agreement are included together with some which would count as instances of agreement for only some linguists. In §4.1 we describe the data format followed by a discussion of the structure of the database in §4.2. We conclude with an analysis of how the data in the agreement database can be used to determine the canonicity of Russian agreement §4.3.

4.1. Data format In the Surrey Database of Agreement, we used the following framework to describe agreement. We call the element which determines the agreement the controller. The element whose form is determined by agreement is the target (e.g. the verb). The syntactic environment in which agreement occurs is the domain of agreement (e.g. subject-predicate). And when we indicate in what respect there is agreement, we are referring to agreement categories (e.g. number, person). Finally, there may be conditions on agreement (e.g. definiteness). Our framework of terms is illustrated in Figure 3 (see Corbett 2006: 5), using number (singular) to illustrate, but this framework generalizes to all agreement features, of course:

10

This was therefore a different enterprise as compared with the extensive database compiled by Anna Siewierska. By the time of the writing of Siewierska (1999), her database included 272 languages, which is an effective size for checking cross-linguistic claims. Naturally, the information on individual languages is less detailed than in the database that we constructed.

136 Brown, Tiberius, Chumakina, Corbett and Krasovitsky domain controller

target

the system

works condition

feature: number value: singular Figure 3. Framework of terms

Each of those areas was further investigated in our database. Thus for each language we defined its agreement domains with the respective controllers, targets, categories and their values, and conditions if present. For example, for Chichewa the following agreement domains are found in the database: Agreement Domain Subject-Predicate Antecedent-Anaphor Head-Modifier Possessor-Possessed

Controller-Target Pairs 12 6 6 1

The Subject-Predicate agreement domain in Chichewa occurs with twelve different controller-target pairs in the database. We have used the term ‘frequent’ to describe this situation in previous work (Tiberius, Brown and Corbett 2002), as it has the highest count in the database. However, this term is not being used to refer to a corpus-based count. (Although we might expect subject-predicate to be the most frequent in a corpus, of course.) For each of the domains the database contains a hyperlink to a file with example sentences illustrating the particular kind of agreement. For instance, the following Chichewa sentence illustrates subject-predicate agreement of the noun phrase subject with the finite verb in number: 11 (3)

11

tsamba li-ku-bvunda G3SG.leaf G3SG-PRES-rot ‘The leaf is rotting’

(Corbett and Mtenje 1987)

G3 stands for gender 3. The traditional Bantu concord subclasses are organised into 10 genders in the database.

Databases designed for investigating specific phenomena

137

For each language in the database, there is also a prose report written by the researcher who established and entered the data for that language, giving sources. Data for the different languages is obtained from published grammars of sufficient detail and through personal communication with experts. The reports are intended to allow the user to see which analytical decisions were made and to treat the data accordingly. Since considerable disagreement exists in the literature, it is important that the user can see how choices were made. This is the same approach as we saw for the Surrey Syncretisms database in §2.1.4 and the Surrey Suppletion Database in §2.2.1.

4.2. Structure of the database As with the other databases, the original version of the Agreement database was designed and implemented in Microsoft Access. That database contained 9 tables: Language, LanguageDomainID, ControllerCategory, TargetCategory, Construction, Controller, Category, and Target. The relationships between these tables are illustrated in Figure 4.

Figure 4. Structure of the Surrey database of Agreement

The tables towards the right of this figure contain the basic elements that we distinguish to define agreement, i.e. possible controllers, possible targets, possible domains, and possible categories. Both controllers and targets can

138 Brown, Tiberius, Chumakina, Corbett and Krasovitsky exhibit agreement categories. It is possible that for the same construction the value of the category on the controller and the value of the category on the target are not necessarily the same. For example, in British English examples, such as The committee have decided where a singular controller (the committee) has plural agreement. Information about the combination of controllers with their agreement categories and targets with their agreement categories is contained in the tables ControllerCategory and TargetCategory. The next table to the left is the Domain table. It defines unique combinations of agreement constructions with controller (category) and target (category) pairs. Each of these agreement domains is assigned a unique arbitrary index. The Domain table is linked to the LanguageDomain table, which forms the heart of the database. This table combines information about languages with that about agreement domains. The relationship between the Domain and LanguageDomain table is one-to-many as a particular agreement domain can occur in more than one language. The LanguageDomain table is also linked to the Language table by a one-to-many relationship as a particular language can have several agreement domains. The Language table contains information about the languages in the database. It defines the language family to which a language belongs and there is a hyperlink to the language report.

4.3. Using the database for a case study of agreement in Russian In this section we show how the Agreement database can be used to determine to what extent agreement in Russian is canonical, following the set of criteria defined by Corbett (2006: 8–27). Corbett defines three general principles that characterize canonical agreement. These are: Principle I: Canonical agreement is redundant rather than informative. Principle II: Canonical agreement is syntactically simple. Principle III: The closer the expression of agreement is to canonical (i.e. affixal) inflectional morphology, the more canonical it is as agreement. These principles are operationalized by a set of nineteen criteria relating to the five components of our account of agreement, namely, controllers, targets, domains, features and conditions. The first principle relates specifically to the amount of information found on the targets of agreement when compared with the controllers of agree-

Databases designed for investigating specific phenomena

139

ment (Corbett 2006: 10–12). The second principle is reflected in criteria related to all aspects of agreement (controllers, targets, domains, features, and conditions), whereas the third principle applies uniquely to targets. The principles and criteria are determined a priori (see Corbett 2007: 8f., for discussion). As an instance of the type of logic involved, let us consider Principle I above. Imagine two contrasting situations, both involving a set of inflectional markers on a verb. In one the inflectional markers always provide new information, information that cannot be extracted from elsewhere in the clause. We might argue that this information must be determined by non-syntactic considerations (probably semantics), and there would then be no need to have recourse to a rule of agreement. In the second situation, the inflectional marker is always redundant: the information is available elsewhere in the clause. In this circumstance, we would have a strong argument in favour of a rule of agreement, to account for the systematic covariance between the verb form and the source of the information elsewhere in the clause. Since it is the latter circumstance which requires recourse to agreement, this is the more canonical situation for agreement. It does not follow, of course, that agreement is always redundant. The point is simply that the canonical, best, indisputable instances of agreement show redundancy. It is also important to bear in mind that the canonical need not be highly frequent. In fact, it may be very infrequent among languages. (That is, it may be impossible to find a language which exemplifies the canonical ideal in every respect.) We now use evidence from the database to determine the canonicity of the agreement system of Russian. It is important to note that the percentage figures given are calculated from the database. In most cases they result from a comparison of the different types of controllers, targets, domains, and features recorded in the database. The promise of this approach is that it allows for a way of addressing the problems of holistic statements about languages by allowing for quantification of constructions within the language in terms of counts of items from the database. These counts can be based on particular tables, such as the Controller table, or queries resulting from joining tables. The Surrey Database of Agreement is based on high quality grammars and consultation with experts on the languages in question, and the counts therefore come near to the best current understanding of the possible elements involved in agreement for those languages. The approach we present here is not based on corpus counts. However, there is an analogy to be drawn between this undertaking, where the possible types of a particular element, such as controllers, are counted, and enterprises which are more familiar to linguists, such

140 Brown, Tiberius, Chumakina, Corbett and Krasovitsky as counting the number of nouns in a lexicon belonging to a particular inflection class, or realizing a particular case in a certain way. Naturally, as with the analogous situation with inflectional classes, we can also envisage corpus-based extensions of this method whereby the count of possible types of controller in the database, for example, is compared with the count of the actual instances of these types within a corpus. While the percentage figures we give can only be an approximation, the method presented here is as valid as the analogous more familiar ones discussed. In our case study below, we will contrast principles I and III and we will see that Russian obeys principle III more than I. 4.3.1. Canonical agreement in Russian Principle I: Canonical agreement is redundant rather than informative As we saw earlier, principle I relates mainly to controllers and features found on the targets of agreement. We determine its canonicity on the basis of seven criteria, i.e. C-1, C-2, C-10, C-17, C-18, C-19 (as set out in Corbett 2006: 8–27).12 For illustration purposes, we will only discuss the calculations of the canonical rating for C-1 and C-17 in detail. A summary of the results for all the criteria is given in Table 6. C-1: controller present > controller absent (canonical rating: 82 %) This criterion states that agreement constructions where the controller is present are more canonical than those where it is absent. In the Surrey Database of Agreement, the following controllers are distinguished for Russian: Noun, Noun Phrase, Personal Pronoun, Comitative Noun Phrase, Comitative Personal Pronoun, Conjoined Noun Phrases, Conjoined Personal Pronouns, Quantified Noun Phrase, Defective Controller (infinitive phrase/that clause), No Possible Controller and No Overt Controller. The controller types Comitative Personal Pronoun and Conjoined Personal Pronouns have been introduced to capture instances such as my s toboj ‘you and I (lit. we with you)’ and my i ty ‘you and I (lit. we and you)’. Note however, that conjoined personal pronouns are extremely rare in Russian. Only a few instances were found in the Uppsala corpus (Lönngren 1993; Maier 1994). The Russian controllers given above are typically present. For example:

12

In the criteria, ‘>’ indicates ‘more canonical than’.

Databases designed for investigating specific phenomena

(4)

141

ty čitaeš´ you.SG.NOM read.2SG ‘you are reading’

So Russian appears to be canonical according to this criterion, but it does also allow for subjectless sentences. Consider, for example, weather predicates where no overt controller is possible. (5)

sveta-l-o dawn-PAST-NEUT.SG ‘Day was dawning.’

Other examples of subjectless sentences can be found in sentences expressing a causal relationship. Compare the following pair: (6)

ego zasypa-l-o sneg-om he.SG.ACC cover-PAST-NEUT.SG snow-MASC.SG.INSTR

(7)

ego zasypa-l sneg he.SG.ACC cover-PAST-MASC.SG snow-MASC.SG.NOM (Wierzbicka 1988: 224)

In (6) there is no overt controller of agreement on the verb. In (7), by contrast, the noun ‘snow’ controls agreement on the verb. Both sentences express the same relationship, but there is a difference in emphasis according to Wierzbicka. In example (6), where what is the subject in (7) is instead expressed by an instrumental, the apparent causer of the event (the snow) is seen by the speaker as an instrument of another unspecifiable power. Subjectless sentences can also be used in indirect speech when the embedded subject is the same as the secondary speaker. For example: (8)

ona skaza-l-a, čto razdenetsja she.SG.NOM say-PAST-FEM.SG that undress.PF.FUT.3.SG.REFL sama self.FEM.SG ‘She said that (she) would undress by herself.’ (Timberlake 1993: 871)

This use of zero anaphora in the embedded clause is expected and the use of an overt pronoun as subject of the verb ‘undress’ might lead to the interpretation that there is non-identity of speaker and subject of the reported act.

142 Brown, Tiberius, Chumakina, Corbett and Krasovitsky Zero anaphora can also occur in connected speech, as long as the referent is uniquely identifiable. (9)

Ol´ga Ivanovna ne ljubi-l-a dumat´ o Ol’ga.SG.NOM Ivanovna.SG.NOM not love-PAST-FEM.SG to.think about neprijatnom i počti nikogda ne duma-l-a. unpleasant.NEUT.SG.LOC and almost never not think-PAST-FEM.SG Izbega-l-a razgovor-ov o bolezn-jax, … avoid-PAST-FEM.SG conversation-PL.GEN about illness-PL.LOC ‘Olga Ivanovna did not like to think about anything unpleasant and almost never thought. (She) avoided conversation about illness, ....’ (Timberlake 1993: 871f.)

Let us now illustrate how we calculated the canonical rating of 82% for this criterion. This figure is obtained by dividing the types of controllers into two categories, controller present and controller absent, as is illustrated in Table 5. Table 5. Controller types for Russian in the database Controller Present

Controller Absent

Comitative NP Conjoined NPs Noun NP Personal Pronoun Conjoined Personal Pronouns Comitative Personal Pronoun Quantified Noun Phrase Defective Controller

No Possible Controller No Overt Controller

Of the 11 types of controllers that are distinguished for Russian, there are two where controllers are absent: No Possible Controller and No Overt Controller. This is 2/11, which is 18 % non-canonical, i.e. 82 % canonical.13

13

The next logical step would be to consider a weighting of the criteria. However, as the calculations of the individual criteria are approximations, we feel that this step should not be taken yet.

Databases designed for investigating specific phenomena

C-17:

143

feature is lexical > non-lexical (canonical rating: 89 %)

Agreement in lexical features is considered more canonical than agreement in non-lexical features. By a lexical feature we mean a feature for which the target cannot be marked independently. The following feature-value pairs are distinguished for Russian: person (1,2,3), number (singular, plural), gender (masculine, feminine, neuter), animacy subgender (animate, inanimate), and case (nominative, genitive, dative, accusative, locative, instrumental). Of those, gender, number, person, and animacy are lexical features in Russian. Animacy forms a subgender. Agreeing modifiers differ for animate and inanimate nouns when in the accusative case. For animates, the accusative form is syncretic with the genitive form (e.g. čeloveka), for inanimates, the accusative form is syncretic with the nominative form (e.g. stol). (10) Ja vižu ètot I.SG.NOM see.1.SG.PRES this.MASC.SG.ACC.INAN stol table.INAN.MASC.SG.ACC ‘I see this table.’ (11) Ja vižu èt-ogo I.SG.NOM see.1.SG.PRES this-MASC.SG.ACC.ANIM čelovek-a man.ANIM.MASC-SG.ACC ‘I see this man.’ Traditional accounts of Russian also include agreement in case. However, case is not a feature of the noun; it is imposed on the noun phrase by government by some other syntactic element, and is therefore less canonical than agreement in gender, number, person, and animacy. To calculate the canonical rating for this criterion, we consider the domain/(target)category combinations in the database. There are 18 unique domain/(target)category combinations for Russian. Of these, two involve case, which is non-lexical. This is 11% (2/18). Therefore we estimate that the canonical rating for C-17 is 89 %. Conclusion for Principle I: Table 6 summarizes the canonical ratings for the seven criteria that are relevant for principle I. Averaging the percentages for these seven criteria we obtain a canonical rating of 86 %.

144 Brown, Tiberius, Chumakina, Corbett and Krasovitsky Table 6. Canonical ratings for criteria defining principle I Criterion

Canonical rating

C-1 C-2

controller present > controller absent

82 %

controller has overt expression of agreement features > controller has covert expression of agreement features

64 %

C-10 C-16

doubling > independent

100 %

domain is one of set > single domain

100 %

C-17 C-18

feature is lexical > non-lexical

89 %

features have matching values > non-matching values

79 %

C-19

no choice of feature value > choice of value

89 %

Principle III: The closer the expression of agreement is to canonical (i.e. affixal) inflectional morphology, the more canonical it is as agreement This principle is a generalization of criteria numbers 5, 6, 7, 8, and 9 (as set out in Corbett 2006: 8–27) which all relate to targets. We will discuss criteria 5, 7, and 8 in detail. In the database, the following targets are distinguished for Russian: Adjective, Demonstrative Determiner, Finite Verb, Numeral_1, Numeral_2, Numeral_3/4, Personal Pronoun, Possessive Pronoun, Relative Pronoun, Predicate Adjective LF (Long Form), Predicate Adjective SF (Short Form), and Special Pronominal Adjectives. The target type Special Pronominal Adjectives has been included to deal with instances of agreement with sam ‘oneself, alone’ and odin ‘alone’. We will see that targets form the area where Russian appears to be highly canonical. Bound marking is considered to be more canonical than free marking and this is further revised by the following cline of canonicity: C-5’: inflectional marking (affix) > clitic > free word (canonical rating: 100 %) There appear to be no free words in Russian which have purely an agreement function. If one considers the criteria of Lehmann (1982: 234) the personal pronoun still has a definiteness or referential function and so is not ‘pure’ agreement. In terms of its position as a Slavonic language, Russian does not perform that well in the clitic stakes. The shorter forms of the accusative, genitive and dative which had an enclitic function (Kiparsky 1967: 133f.) have all

Databases designed for investigating specific phenomena

145

but disappeared from the contemporary language. Kiparsky (1967: 133) cites Gadolina (1963: 94) with regard to the colloquial short form te, saying this is an innovation. (12) ja te dam po baške I.SG.NOM you.SG.DAT will.give.1.SG along head.FEM.SG.DAT ‘I’ll knock your block off.’ (Kiparsky 1967: 133) This would, of course, be an instance of a non-canonical target, for an agreement domain which Russian would otherwise not have. Forms such as te do not occur in doubling constructions and can be ruled out for the purposes of criterion C-5’. Russian is therefore highly canonical in terms of the degree of morphological bonding in its agreement, relying on affixal marking. C-7: regular > suppletive (canonical rating: 99 %) Although Russian has some very interesting examples of suppletion, it does not have any examples of suppletive agreement targets. Or rather, Russian does not have agreement targets which are suppletive in terms of agreement features. Suppletion behaves as one would expect in Russian, occurring in instances of inherent inflection (see Booij 1996 for a definition of this term). Hence the verb idti ‘to go’ has a suppletive past tense šel / šla / šlo / šli and the noun čelovek ‘person’ has a suppletive plural form ljudi. But, as noted, these items are either not agreement targets (the noun) or the feature value in question is not an agreement feature for Russian (tense on the verb). Criterion C-7, however, can be interpreted as a cline of regularity in which the regular and suppletive points are just ends of the scale. In this case there are examples of less regular marking of agreement. We can use the criteria for irregularity in Russian established in Corbett et al (2001) to do this. For instance, irregularity in the stem ranks above irregularity in inflection. A number of verbs have consonantal alternations in the stem, but these are typically associated with the present tense as a whole, again a non-agreement feature. In the second conjugation, however, we find that the stem of the first person singular in the present may differ in form from that of the rest of the present tense. Perhaps it is significant that where Russian begins, albeit weakly, to approach non-canonicity with regard to criterion C-7, we find this in the most clearcut pronominal area, namely first person singular.

146 Brown, Tiberius, Chumakina, Corbett and Krasovitsky Table 7. Second conjugation verb ‘to see’ Features

Videt’ ‘to see’

Gloss

1.SG.PRES 2.SG.PRES 3.SG.PRES 1.PL.PRES 2.PL.PRES 3.PL.PRES

viž-u vid-iš vid-it vid-im vid-ite vid-jat

‘I see’ ‘you (sg) see’ ‘he/she/it sees’ ‘we see’ ‘you (pl) see’ ‘they see’

In addition to this, there may also be prosodic irregularities, which, if one considers agreement features, again tend to pick out the first person singular. Finally, in the remnants of the athematic verbs in Russian we find a split between singular and plural, where the older athematic forms remain in the singular, but the verb has adopted the regular second conjugation endings in the plural.14 If one considers possessive adjectives of the rare otcov ‘father’s’ type, then it could be argued that these exhibit a kind of irregularity in that they mix noun-like inflection with typical adjectival inflection. Specifically, they use the noun forms in the masculine and neuter genitive and dative singular. Hence, this is a marginal example of gender involvement in irregularity of marking. But this is already a long way from suppletion. With regard to criterion C-7 Russian is basically canonical, with a number of marginal examples when one sees the criterion in terms of a cline of possibilities. C-8: alliterative > opaque (canonical rating: ± 50 %)15 This is where Russian tends to be less than canonical. We have examples such as that in (13).

14

15

We leave aside the question of the choice of first conjugation ending for the third person plural of the perfective verb ‘to give’. In order to calculate this figure we took the 10 target types in Russian where we find phonological identity and an approximate proportion of the typical paradigm which could involve alliteration. We then averaged the proportion to obtain the figure.

Databases designed for investigating specific phenomena

147

(13) Èt-a ženščin-a skaza-l-a. this-FEM.SG.NOM woman.FEM-SG.NOM said-PAST-FEM.SG ‘This woman said.’ Hence, we have phonological identity of the markers in this example. The complication here is that the marker on the noun, by virtue of its position within the paradigm, marks case. This phonological identity, accompanied by modification of morphosyntactic content as we change word class, is not found throughout, however: (14) Èt-a doč′ skaza-l-a. this-FEM.SG.NOM daughter.FEM-SG.NOM said-PAST-FEM.SG ‘This daughter said.’ Examples of full alliteration are rare in the oblique cases. For example, in the genitive singular there is no alliteration between the adjective and head noun, as shown by (15). (15) molod-ogo mal′čik-a young-MASC.SG.GEN boy.MASC-SG.GEN ‘… of the young boy…’ Elsewhere there may be partial, but not full, alliteration. For example, the oblique cases in the plural have partial alliteration between the noun and adjective. In (16) the initial vowel of the inflectional affix contrasts between the noun and the agreeing adjective. (16) molod-ymi mal′čik-ami young-PL.INSTR boy.MASC-PL.INSTR ‘… with/by the young boys …’ The lack of total alliteration for agreement with nouns in oblique cases, together with examples in the core cases, suggests that Russian should be treated as somewhere between the two ends of this particular scale. Accordingly we treat Russian as approximately 50 % canonical with regard to criterion 8. Conclusion for principle III: Table 8 summarizes the canonical ratings for the target criteria that are relevant for principle III.

148 Brown, Tiberius, Chumakina, Corbett and Krasovitsky Table 8. Canonical ratings for criteria defining principle III Criterion

C-5’ C-6 C-7 C-8 C-9

Canonical Rating

inflectional marking (affix) > clitic > free word

100 %

obligatory > optional

100 %

regular > suppletive

99 %

alliterative > opaque

50 %

productive > sporadic

99 %

The target criteria discussed show that Russian obeys principle III in that its expression of agreement is affixal. Averaging the percentages of the 5 criteria results in an approximate canonical rating of 90 %.

4.3.2. Summary If there is enough detail in the database it is possible to use it to give an overall characterization of the extent to which a language can be shown to have a particular phenomenon. In this section we considered the canonicity of agreement in Russian by comparing its controller types, target types, domain types, features and conditions against the relevant criteria defined by Corbett (2006). On the basis of our analysis we conclude that of the general principles of canonical agreement given in Corbett (2006), Russian obeys principle III more than principle I. So far, we have only applied this novel method to Russian, but it would certainly be possible to use it to characterize other languages. One possible caveat is that the data available for the majority of languages is not as great as for a Russian. However, the beauty of this approach is that it also suggests what to look for and can lead to further refinement, if required.

5. Conclusion When databases are created to study particular phenomena it is important that the researchers define those phenomena explicitly. This will facilitate the task of determining the database structure, and it will also help guarantee that any cross-linguistic investigation involves comparison of similar things. We saw how important this was in our discussion of the Surrey Syncretisms Database. Also, explicitness and consistency should not rule out inclusion

Databases designed for investigating specific phenomena

149

of data which others would consider an instance of the phenomenon in question. The canonical approach to typology (for example, Corbett 2006) provides a way round this, as we can determine what are more or less canonical instances of a phenomenon. We also showed how the Surrey Database of Agreement could be used to create an approximate measure of how close a language comes to being canonical. Our original motivation for creating the databases discussed here was to have detailed information on a small sample of languages for the purpose of our research on syncretism, suppletion and agreement. In the case of the short term morphosyntactic change project the aim was developing a detailed look at historical change in one language. These databases could therefore be seen as just a research tool toward that end. However, the design and development of a database involves much more than that. There is a level of theorising required. Furthermore, the creation of such databases allow us to derive typological sketches of a language, as illustrated for the Russian system of agreement in §4.3, which could otherwise not be easily created or quantified. And as the databases have links to the original examples, it is possible to determine how the analyses were arrived at.

Acknowledgements The research reported here has been supported in part by the UK’s Economic and Social Research Council (grants R000237939, R000238228 and RES-062-23-0696) and the Arts and Humanities Research Council (grants RG/AN4375/APN18306 and B/RG/AN4375/APN10619). We have benefited from on-going discussions within the Surrey Morphology Group, in particular with Matthew Baerman. We would also like to thank David Bowers and Roger Gentry for help with the design and implementation of the Surrey Syncretisms Database and Surrey Database of Agreement. Harley Quilliam is to be thanked for his work on making the Surrey databases available over the web. We presented our work on using the Surrey Database of Agreement to calibrate Russian in terms of canonical agreement to the Workshop on Agreement held at the LAGB’s autumn meeting in 2002 in Manchester and would like to thank participants there for comments. We thank Martin Everaert and Simon Musgrave, as well as the anonymous reviewers for this volume, for further helpful comments. Links to the different databases discussed in this chapter can be found at www.smg.surrey.ac.uk.

150 Brown, Tiberius, Chumakina, Corbett and Krasovitsky References Aronoff, Mark 1994 Morphology by Itself: Stems and Inflectional Classes. Cambridge, MA: MIT Press. Baerman, Matthew and Dunstan Brown 2005a Case syncretism. In World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 118 –121. Oxford: Oxford University Press. 2005b Verbal person/number syncretism. In World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 122–125. Oxford: Oxford University Press. Baerman, Matthew, Dunstan Brown, and Greville Corbett 2005 The Syntax-Morphology Interface: A Study of Syncretism. Cambridge: Cambridge University Press. Bird, Stephen and Gary Simons 2003 Seven dimensions of portability for language documentation and description. Language 79: 557–582. Booij, Geert 1996 Inherent versus contextual inflection and the split morphology hypothesis. In Yearbook of Morphology 1995, Geert Booij and Jaap van Marle (eds.), 1–16. Dordrecht: Kluwer. Brown, Dunstan 2001 Constructing a typological database for inflectional morphology: the SMG database for syncretism. In Proceedings of the IRCS / National Science Foundation Workshop on Linguistic Databases. Institute for Research in Cognitive Science, University of Pennsylvania, December 11–13, 2001, Stephen Bird, Peter Buneman, and Mark Liberman (eds.), 56–64. [Available at .] Carstairs, Andrew 1984 Outlines of a constraint on syncretism. Folia Linguistica 18: 73–85. 1987 Allomorphy in Inflexion. London: Croom Helm. Carstairs-McCarthy, Andrew 1992 Current Morphology. London: Routledge. Comrie, Bernard 1991 Form and function in identifying cases. In Paradigms. The Economy of Inflection, Frans Plank (ed.), 41–55. Berlin /New York: Mouton de Gruyter. Corbett, Greville G. 2003 Agreement: The range of the phenomenon and the principles of the Surrey Database of Agreement. In Agreement: A Typological Perspective (special number of Transactions of the Philological Society

Databases designed for investigating specific phenomena

151

101, no. 2), Dunstan Brown, Greville G. Corbett and Carole Tiberius (eds.), 155–202. 2006 Agreement. Cambridge: Cambridge University Press. 2007 Canonical typology, suppletion and possible words. Language 83: 8– 42 Corbett, Greville G., Andrew Hippisley, Dunstan Brown, and Paul Marriott 2001 Frequency, regularity and the paradigm: a perspective from Russian on a complex relation. In Frequency and the Emergence of Linguistic Structure, Joan Bybee and Paul Hopper (eds.), 201–226. Amsterdam: John Benjamins. Corbett, Greville G. and Alfred D. Mtenje 1987 Gender agreement in Chicheŵa. Studies in African Linguistics 18 (1): 1–38. Evans, Nicholas, Dunstan Brown, and Greville Corbett 2001 Dalabon pronominal prefixes and the typology of syncretism: a Network Morphology analysis. In Yearbook of Morphology 2000, Jaap van Marle and Geert Booij (eds.), 187–231. Dordrecht: Kluwer. Evans, Roger and Gerald Gazdar 1996 DATR: A language for lexical knowledge representation. Computational Linguistics 22: 167–216. Greenberg, Joseph H. 1963 Some universals of grammar with particular reference to the order of meaningful elements. In: Joseph H. Greenberg (ed.), Universals of Language, 73–113. Cambridge, MA: MIT Press. [Paperback edition published 1966.] Gadolina, Margarita Anatol’jevna 1963 Istorija form ličnyx i vozvratnogo mestoimenij v slavjanskix jazykax [History of the forms of the personal and reflexive pronouns in the Slavonic languages]. Moscow: AN SSSR, Institut Slavjanovedenija. Hippisley, Andrew, Marina Chumakina, Greville G. Corbett, and Dunstan Brown 2004 Suppletion: frequency, categories and distribution of stems. Studies in Language 28: 387– 418. Hjelmslev, Louis 1943 Omkring sprogteoriens grundlæggelse [Toward the foundations of a theory of language]. Copenhagen: Ejnar Munksgaard. 1961 Prolegomena to a Theory of Language. Madison, WI: University of Wisconsin Press. [Translation of Hjelmslev 1943.] Kiparsky, Valentin 1967 Russische historische Grammatik. Band II: Die Entwicklung des Formensystems. Heidelberg: Carl Winter. Lascarides, Alex and Ann Copestake 1999 Default representation in constraint-based frameworks. Computational Linguistics 25: 55–105.

152 Brown, Tiberius, Chumakina, Corbett and Krasovitsky Lehmann, Christian 1982 Universal and typological aspects of agreement. In Apprehension: Das sprachliche Erfassen von Gegenständen, II: Die Techniken und ihr Zusammenhang in Einzelsprachen, Hansjakob Seiler and Franz Josef Stachowiak (eds.), 201–267. Tübingen: Narr. Lönngren, Lennart 1993 Častotnyj slovar´ sovremennogo russkogo jazyka [Frequency dictionary of the contemporary Russian language]. (Acta Universitatis Upsaliensis, Studia Slavica Upsaliensis 32.) University of Uppsala: Uppsala. Maier, Ingrid 1994 Review of Lennart Lönngren (ed.) ‘Častotnyj slovar´ sovremennogo russkogo jazyka’. Rusistika segodnja 1: 130 –136. Matthews, Peter H. 1997 The Concise Oxford Dictionary of Linguistics. Oxford: Oxford University Press. Mel’čuk, Igor 1994 Suppletion: Toward a logical analysis of the concept. Studies in Language 18: 339–410. Noyer, Rolf 1997 Features, Positions and Affixes in Autonomous Morphological Structure. New York: Garland. Perlmutter, David M., and Orešnik, Janez 1973 Language-particular rules and explanation in syntax. In A Festschrift for Morris Halle, Stephen R. Anderson and Paul Kiparsky (eds.), 419–59. New York: Holt, Rinehart and Winston. Siewierska, Anna 1999 From anaphoric pronoun to grammatical agreement marker: Why objects don’t make it. In Folia Linguistica 33 (2), Greville G. Corbett (ed.), Special Issue on Agreement: 225–251. Stump, Gregory T. 1993 On Rules of referral. Language 69: 449–79. 2001 Inflectional Morphology: A Theory of Paradigm Structure. Cambridge: Cambridge University Press. Tiberius, Carole, Dunstan Brown, and Greville Corbett 2002 A typological database of agreement. In Proceedings of LREC2002, The Third International Conference on Language Resources and Evaluation, Vol. VI, Manuel González Rodríguez and Carmen Paz Suarez Araujo (eds.), 1843–1846. Las Palmas, Spain. Timberlake, Alan 1993 Russian. In The Slavonic Languages, Bernard Comrie and Greville G. Corbett (eds.), 827–886. London /New York: Routledge.

Databases designed for investigating specific phenomena

153

Wierzbicka, Anna 1988 The Semantics of Grammar. (Studies in Language Companion Series 18.) Amsterdam /Philadelphia: John Benjamins. Wunderlich, Dieter and Ray Fabri 1995 Minimalist morphology: An approach to inflection. Zeitschrift für Sprachwissenschaft 14: 236–94. Zwicky, Arnold 1985 How to describe inflection. In Proceedings of the Eleventh Annual Meeting of the Berkeley Linguistics Society, Mary Niepokuj, Mary VanClay, Vassiliki Nikiforidou, and Deborah Feder (eds.), 372–386. Berkeley, CA: Berkeley Linguistics Society.

How to integrate databases without starting a typology war: The Typological Database System Alexis Dimitriadis, Menzo Windhouwer, Adam Saulwick, Rob Goedemans and Tamás Bíró

1. Introduction The Typological Database System (henceforth TDS)1 is a web-based service that provides integrated access to a collection of independently created typological databases. Thus it is not an original data collection, but an interface to the data contained in its component databases. The TDS web server can be accessed at the URL http://languagelink.let.uu.nl/tds/. The main challenges in developing the TDS were not, as one might perhaps imagine, due to the technical problem of combining data residing in different software platforms; in fact this only poses minor obstacles. The real difficulties arise from (a) the very large total number of descriptive parameters included in the aggregated databases, and (b) the differences in structure, terminology, and theoretical assumptions among component databases. The typical typological database contains a very large number of data fields 1

The TDS Project is being carried out by a research group of the Netherlands Graduate School of Linguistics (LOT), with members from the University of Amsterdam, Leiden University, Radboud University Nijmegen, and Utrecht University: Tamás Bíró (linguistic design and database integration), Alexis Dimitriadis (project manager), Rob Goedemans (database integration and phonology domain expert), Ruth Lind (intern), Adam Saulwick (ontology developer, typologist and database integration), Eugenie Stapert (student assistant), Franca Wesseling (student assistant), Menzo Windhouwer (software system designer and developer). The TDS Project gratefully acknowledges the financial support of the Netherlands Organization for Scientific Research (NWO). Presentations of various aspects of the TDS have been given at the following conferences: E-MELD 2005 (Dimitriadis et al. 2005), DISWeb 2005 (Saulwick et al. 2005), DGfS 2006 (Saulwick et al. 2006), and elsewhere. We thank the audiences at these conferences for useful feedback and discussions which have contributed to improving the content and presentation of this chapter. We remain responsible for any deficiencies.

156 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró (parameters), often several hundred, about a large number of languages (again in the hundreds). The component databases were created independently of each other, and reflect a focus on diverse aspects of languages, research questions, and theoretical backgrounds. The TDS Project’s approach to data management is central to addressing both challenges. The diversity and sheer quantity of the data make it impossible to deal with differences through some sort of “consensus” representation; indeed, for reasons that will be detailed below, we consider such a goal to be not only unattainable but flawed, with the potential to distort the real content of the data. Instead, the TDS aims to represent the diverse theoretical perspectives faithfully. Purely notational differences are resolved (e.g., if different databases use the values ‘yes’ and ‘+’ to mean the same thing, these can be converted to the single value ‘yes’); but for the rest, the TDS focuses on presenting the contents of the component databases as accurately as practicable. This means that in many cases, divergent classifications or analyses will be presented side by side. While the total number of database parameters is large, the TDS is not a large-scale data integration project. Approximately a dozen databases are integrated in the initial phase of the Project, and the eventual size of the archive will be in the dozens rather than hundreds or thousands of databases. Hence it was possible to focus on integrating the semantics and encoding of a particular (and progressively extended) set of databases. This is a labour-intensive process, but is justified by the richness of information represented by the data contained in each component database. Such data have been collected one language and one parameter at a time, often following laborious study of the relevant sources. The complex decision-making process involved in creating each component database is reflected in its presentation through the TDS. In order to be included in the system, the data in each component database are restructured and recoded to the extent that this is possible without loss of information, and organized so as to produce an integrated whole with consistent structuring conventions (as far as this is possible given the diversity of the data). This permits the aggregated data to be effectively navigated. The user interface, data structuring and integration process are supported by an ontology of linguistic concepts developed by the Project for this purpose. The TDS interface allows users to search for fields of interest, which are then used for querying the data. Searching is thus a two-step process: In the “pre-query” step, the user discovers fields relevant to topics of interest, by using one of several search and browsing options in the TDS interface. Se-

How to integrate databases without starting a typology war

157

lected fields are accumulated in a sort of shopping basket. In the second step, the collected fields are used to construct and execute a query. In the next two sections, we introduce the goals of the system and the challenges it confronted. Section 4 presents an overview of the TDS. Sections 5, 6 and 7 describe the system in more detail, focusing first on the knowledge architecture (section 5) and then on the software implementation and user interface. Sections 8 and 9 focus on the practicalities of database integration, as carried out by TDS members. Finally, we close with some discussion and conclusions.

2. Goals of the system The ubiquity of the internet, which allows data to be shared across large distances efficiently and practically for free, has gradually brought about a development of great importance to typologists: While collections of typological data were once considered a personal tool for the exclusive use of the researchers who compile them, researchers are increasingly coming to view them as resources that can, and therefore should, benefit the academic community. A growing number of typological databases, many originally created for personal or small-group use, are now being made publicly available. But the growth in the number of such databases, welcome as it is, comes with a cost. As more and more typological databases become accessible to users other than their creators, colleagues, and others already familiar with them, the task of managing and using the information becomes more difficult. We can identify the following kinds of problems facing a linguist who looks for typological information on the web: 1. Resource discovery This is simply the step of finding a data source with information on some topic of interest. 2. Correct and effective use As already mentioned, databases use varying terminology, notation, organization of the data, and search commands. Even if these are documented in detail, they can be quite difficult for a new user to assimilate and employ properly; and proper documentation is not always available. 3. Efficiency of resource utilization As the amount of online information grows, the time and effort involved in searching databases one by one and collating the results becomes an obstacle to their efficient utilization.

158 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró The problem of resource discovery is being addressed by language archives, which collect a large amount of linguistic resources in one place (hopefully well-known or easy to find), and by numerous initiatives that are developing improved methods for resource description and discovery. Generally these include enriched standards for resource description, which can be utilized by existing or new tools.2 The TDS does not directly address this problem, except by providing a (small) number of databases collected in one place. The TDS directly addresses the second and third tasks. The goals of the system are (a) to provide an interface that will help users find relevant data, and (b) to enable users to interpret the data they are presented with. The TDS interface allows speedy combined searches over the data in its collection of typological databases, from a single user interface and using (as much as possible) consistent terminology and encoding conventions. The results are fully aggregated across databases, and can be displayed in a number of formats, including (so far) two export formats, as an XML (Extensible Markup Language) document or as a single table in comma-separated-values (CSV) format. Interpretation of the data is aided by presenting documentation for each database field, supported by documentation on any linked linguistic concepts in the TDS global ontology. Since data in the TDS is often separated from its original context, in all cases the interface must present the provenance of the data along with its database-specific description; this allows users to properly evaluate the information retrieved. 3. The problem In creating a single data resource from the collection of typological databases, the TDS Project needed to address the various kinds of differences among them. These can be of several kinds:

2

Metadata-oriented initiatives include the Dublin Core Metadata Initiative (DCMI, at http://dublincore.org/), the Open Archives Initiative (OAI, at http://www. openarchives.org/), and the linguistics-specific initiatives of the Open Language Archives Community (OLAC, at http://www.language-archives.org/) and the International Standards in Language Engineering (ISLE, at http://www.ilc.cnr.it/ EAGLES/isle/). A different approach to resource discovery, based not on metadata but on sophisticated pattern matching, is followed by the Online Database of Interlinear Text (ODIN, at http://www.csufresno.edu/odin/; Lewis 2006).

How to integrate databases without starting a typology war

159

1. Different types of content So-called “analytical” typological databases consist of logical variables describing each language as a whole; for example, “language X has/does not have subject-verb agreement.” Other databases contain example sentences with detailed annotations (“sentence databases”), or a combination of both types of information. The Project attempts to integrate different types of content so that, for example, a single query can search both examples and logical variables for relevant information.3 2. Different theoretical commitments There exists, of course, no universally accepted and descriptively exhaustive linguistic theory. Thus, the information in each database reflects the analytical and theoretical commitments of its creators. For example, the notions Subject and Object are used in several different ways by linguists, and there are alternative ways of expressing structural relationships, e.g., the S/A/P/R categorization as recently applied in Haspelmath (2005).4 Conversions between theories, it turns out, cannot be automated with reliability; and database creators do not want their theoretical commitments to be misrepresented. Hence the TDS places a high priority on preserving and presenting to the user the framework of database-specific assumptions needed to properly interpret the data extracted from a component database. Such information will allow knowledgeable users to recognize both the descriptive content and the theoretical commitments of a statement, regardless of whether it matches their own theoretical orientation. It can be beneficial for users to view information even if it is not expressed in terms of their own theoretical framework. For example, information about properties of “subjects” can be useful even to those who do not believe that this is a typologically sound notion. 3. Constructed for different purposes The focus and detail of coverage of individual databases vary depending on the creators’ own research interests, even where there are no theoretical disparities (or no significant ones). Such variability can lead to significant differences in the structure, content, and degree of detail in conceptually similar data.

3

4

The system includes both kinds of data, but such cross-type searches are only partially supported at time of writing. This example is discussed in more detail in the following section.

160 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró 4. Different notational conventions In many cases the different databases use equivalent, or near-equivalent, ways of describing data. An obvious example is the use of different glossing labels for broadly accepted linguistic categories, such as “p” or “pl” for Plural. It is generally easy to reconcile purely notational differences, but the databases can also differ in the details of how such concepts are defined and applied. It is thus necessary to distinguish notational variation from theoretically important differences, and to resolve differences with respect to the former but not the latter. 5. Different design choices There are a myriad of ways to organize a body of information into a database, and the component databases therefore differ markedly in their structure. As this source of variation is compensated for, it becomes easier to address the more troublesome types already discussed. Consistent design choices also make it easier for the end user to gain an understanding of the data, compared to dealing with multiple conventions that must each be learned separately. 6. Different software The TDS component databases were developed with, or for, a variety of database management systems (DBMSs) that currently include Microsoft SQL Server, MySQL, Microsoft Access, Excel, 4th Dimension, and custom-made database software. Their origins in different software environments (operating systems, fonts, storage formats, etc.) introduce additional complications. The TDS uses a plug-in architecture to import data from this plethora of formats; fortunately, it has been possible to find interface modules or exchange formats for each database included. Occasional relatively small problems (such as font-encoding glitches) can require manual intervention in the form of ad hoc fine-tuning scripts, but these are quickly resolved and only need to be addressed once per database. In practice, dealing with this kind of variation does not pose great difficulty. These sources of variation must be dealt with in different ways. The TDS approach distinguishes between variation in structure or encoding, which is judged to be a design choice of no inherent linguistic significance, and variation in the choice of linguistic terms and (especially) categories and distinctions. Broadly, we can speak of differences in encoding and differences in meaning (semantics). While the metadata initiatives mentioned earlier might one day lead to more uniformity in structure and encoding

How to integrate databases without starting a typology war

161

among databases, they will have no effect on the divergence of theoretical viewpoints and research traditions that constitutes the most intractable source of heterogeneity. These diverse viewpoints are not only dearly held by their practitioners: They are the subject matter and outcome of linguistic analysis, and cannot (indeed, should not) be replaced by any uniform, agreed-upon framework. During the early stages of the Project, consultation with the community of database creators and prospective users established that preservation of the specific claims made by the creators of the component databases is of the utmost importance. Full harmonization of the collected data into a common form would lead to unacceptable distortion in the accuracy or precision of the data. Conversely, a multiplicity of alternative models and classification schemes for data originating in different databases is acceptable, as long as the model applicable in each case is made explicit. Accordingly, the TDS approach is as follows: encoding differences are compensated for wherever possible, by transforming the source data to adhere to, or at least be relatable to, a uniform design (“object model”). Semantic divergences are maintained, and are made explicit by suitable documentation and careful construction of relationships between various levels of metadata. We should point out here that the Project takes a neutral standpoint towards the data in the component databases; that is, we do not consider it the job of the Project to check and correct the data in the component databases, or to make value judgments on the analyses they express. The developers of the component databases have devoted much time and effort to collecting information in their databases; in database terms, the component databases represent high-value information reservoirs, created through considerable human effort and utilizing extensive domain expertise. In short, each component database represents an extremely valuable resource. The Project must ensure that the contents of the component databases are accurately imported and presented in the TDS, but it is not responsible for the accuracy of those contents themselves. To do otherwise would be impossible, since the TDS could not assume responsibility for the accuracy of the data without undertaking to resolve the empirical and theoretical differences between its component databases. As long as the provenance of information provided by the TDS is explicit, end users will be able to assess its suitability and reliability for themselves.

162 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró 3.1. The notion Subject: A mini case study To better illustrate the integration process across databases, we will consider a simple case involving a single descriptive parameter. Consider a query such as “which languages have subject-verb agreement.” Two of the component databases contain information on “subjects.” One database, the Typological Database Nijmegen (henceforth TDN),5 contains a single boolean variable answering exactly this question; another database, the Person Agreement Database (henceforth PAD), includes a block of variables giving more information about subject-verb agreement, for languages in which this exists. A complicating factor is that the notion subject is not a primitive in the PAD. Instead, the PAD relies on a common alternative, the four-way classification of grammatical functions as sole argument of an intransitive verb (S), and agent-like (A), patient-like (P), and recipient-like (R) arguments of a transitive verb.6 How can this classification be used to get information on subject-verb agreement? We must define a query in terms of the available categories. The common notion subject is itself interpreted in different ways within the linguistic community, but in principle it would be expected to be co-extensive with the union of the categories S and A.7 The most useful strategy, 5 6

7

See section 4 for descriptions of the component databases. The introduction of a category S distinct from A and P is found in Dixon (1972) and Comrie (1978). A recent adaptation of the system can be found in Haspelmath (2005). For discussion of the problems associated with subject as a crosslinguistic category, see Comrie (1989: ch. 5). There is considerable disagreement among linguists as to what constitutes a subject, and little of it can be resolved by adopting the S/A/P/R system. For example, consider an experiencer predicate like Spanish me gustas tu ‘I like you’, which assigns dative to the experiencer and nominative to the stimulus. Does this predicate have a “dative subject,” as suggested by the thematic relationships and the word order, or should the title “subject” be applied to the nominative-marked stimulus, which also controls “subject agreement” on the verb? Should the answer be the same in all languages with a similar construction? The issue depends on one’s underlying principles, definitions, and the desired abstractness of analysis, and is amplified when one considers languages with fewer morphosyntactic clues, or with less well-understood grammars. Adoption of the S/A/P/R system allows ergative languages to be coherently discussed but does not resolve such questions, since one must still determine what is or is not “agent-like,” etc. Fortunately, these issues tend to have little impact on the determination of analytical properties of a language, such as pro-drop or basic word order; these

How to integrate databases without starting a typology war

163

then, is to search for data on S and A controllers. However, there are enough differences in principle and in practice between theories (and between the practice of individual researchers), that an implicit mapping of S and A values to the value Subject could lead to erroneous information being reported (for example, if an A by someone’s definition is not a Subject by someone else’s). Therefore this decision is best left to the user. Let us now consider the reverse situation: A linguist who subscribes to the four-way classification of grammatical roles wants to query for languages which have agreement with A (the agent-like argument). Since one of the component databases does provide this information, the TDS should make it available. But the information in TDN cannot answer this question directly, since TDN does not distinguish between transitive and intransitive subjects. Does that mean that the information in TDN is of no interest in this case? Only the user can answer this; information about the category Subject might be useful, although inexact, or it might be irrelevant to the user’s needs. The TDS interface respects these considerations. A user who searches for the term “subject” during the pre-query stage is presented with a list of database fields that includes the relevant fields from both databases (this happens even though the documentation directly associated with the PAD fields does not use the word “subject”). The user can assess the relevance of each field for his or her purposes. Presented with the available options, a user interested only in agreement with A might decide to rely exclusively on PAD or to additionally look up information on subject agreement in TDN, later refining it by consulting other information on individual languages. There is another reason to be conservative in transforming data from one descriptive framework into another: the creators of PAD have made a deliberate choice not to rely on the category Subject, which they consider problematic and inadequate; they might not be keen to endorse a statement to the effect that “according to the PAD, language X has subject-verb agreement,” since the PAD does not directly make any statements about the category Subject. In reporting the contents of its component databases, the TDS must be careful to do so accurately.8

8

generally depend on the typical or prevailing configurations, for which there is less disagreement on identifying the “subject” (or A). An accurate report is not necessarily verbatim, however; the issue here is the use of disputed terms or categories. It is sometimes necessary to develop descriptions where none exist, or to expand, in consultation with database developers, on descriptions in the original documentation so that they are interpretable when

164 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró Note finally that these issues were illustrated by reference to a very simple example involving a single descriptive parameter. The problem becomes even less tractable when numerous data fields are used to describe aspects of a phenomenon, or when more than one linguistic concept is involved. For example, the Typological Database Amsterdam (henceforth TDA) database encodes subject-predicate order in terms of the following parameters: 1. Basic word order of the clause 2. Whether the order is fixed or variable 3. Whether non-canonical order is morphologically marked The basic word order is not expressed in the traditional terms of subject, verb and object position (e.g., SVO or SOV), but in terms of the position of the predicate: Predicate-initial, -medial or -final, with the option of having multiple orders in a language (cf. Hengeveld et al. 2004). Because multiple orders are allowed, the approach contrasts with a system that identifies one basic word order, and perhaps identifies a secondary order on the basis of particular criteria. Note that the question of one versus many basic orders is independent of, and more complicated to resolve than, the choice between the traditional “constituent-based” and predicate-based order. Neither system can be converted to the other without loss of information (or the manual addition of information not already in the database).

4. The Typological Database System The TDS Project began with the goal of unifying a number of typological databases, originally all by Dutch linguists.9 The Project’s early attention to terminological and conceptual differences has evolved into the present focus on developing a software system whose architecture is fundamentally based on the principles of data integration described here. Monachesi et al. (2002) present an early vision of the system.

9

removed from their original context. We take care that the resulting statements do not violate the theoretical commitments of the database they describe. The early participants in the Project included Paola Monachesi, Anne-Marie Mineur and Manuela Pinto, as well as several of the participants named in footnote 1. The developers of the initial group of component databases played an important role in defining the goals of the system, and identifying problems and areas of concern.

How to integrate databases without starting a typology war

165

It has always been the intention to include a moderate number of databases, whose integration into the system would require specialist expertise. At present the following databases are searchable through the TDS interface, and a few others are being prepared for integration. 1. The Anaphora Typology database project (Utrecht University) is surveying the binding properties of reflexives and reciprocals, referred to collectively as “local coreference strategies.” This database is focused on reflexives, although some information is also provided on pronouns and reciprocals. It contains glossed example sentences (grammatical or ungrammatical) in a variety of syntactic configurations. A limited number of languages are examined in detail. Only a few languages and properties are currently included in the TDS. Developers: Alexis Dimitriadis, Martin Everaert, Eric Reuland, Tanya Reinhart; Utrecht institute of Linguistics OTS. 2. The Person Agreement Database (PAD) contains analytical (languagelevel) data for over 400 languages, on person agreement and some related areas such as word order. The information is coded in terms of over 250 variables. For a subset of these, citations to relevant pages in reference grammars are given. Developers: Anna Siewierska, Lancaster University; Dik Bakker, University of Amsterdam. 3. Smith’s Phoneme INventories (SPIN) is a collection of phoneme inventories and lexical tones in 110 languages based on published works. Developer: Norval Smith, University of Amsterdam. Digitized by the TDS Project. 4. The Stress Typology Database (StressTyp) contains information on the metrical systems (stress systems) of 510 languages, based on grammars and theoretical works. Notions covered include rule-based stress, lexical stress, extrametricality, foot types, etc. Developers: Rob Goedemans, Leiden University; Harry van der Hulst, University of Connecticut. See Goedemans and Van der Hulst (this volume). 5. The Syllable Typology Database (SylTyp) contains information on syllable structures. Restrictions and rules concerning possible syllabic structures are provided, as well as information pertaining to the content of these structures. Developers: Harry van der Hulst, University of Connecticut; Rob Goedemans, Leiden University. 6. The Typological Database Amsterdam (TDA) focuses on basic word order and constituent order systems. Information classifying the parts-of-

166 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró speech system of the included languages is also provided. Developer: Kees Hengeveld, University of Amsterdam. 7. The Typological Database Nijmegen (TDN) contains analytical (language-level) information on a variety of topics, including: basic word order, intransitive predication, case marking, temporal sequencing, relative clause information, comparatives, possessive constructions, verbal morphology, tense/aspect, noun phrase coordination, manner adverb encoding, verbal derivation. The number of languages varies from topic to topic, with a minimum of 140 for all topics, and a maximum of 410 for some. Developer: Leon Stassen, Radboud University Nijmegen. 8. The Typological Database of Intensifiers and Reflexives (TDIR) provides information on intensifiers and reflexives as well as on some related domains of grammar such as the middle voice and scalar focus particles. Its focus is on accurate language description and documentation (“grammar fragments”). The data have been obtained from both native speaker consultation (primary data) and literature on the relevant topics and languages (secondary data). For secondary data, sources and references are generally indicated. The database contains information on more than a hundred languages, including approximately 600 examples. The sample is not balanced genetically or areally. Developers: Volker Gast, Daniel Hole, Ekkehard König, Peter Siemund, Stephen Töpper; Free University of Berlin. See Gast (this volume). 9. The UCLA Phonological Segment Inventory Database (UPSID) is a collection of phoneme inventories for 451 languages. Features such as manner, place, length, phonation type and secondary articulation are included. Developer: Ian Maddieson, UCLA. 10. The Graz Database on Reduplication provides data on reduplication in the world’s languages. The collected examples are described phonologically, morphologically and semantically, together with information on productivity and diachrony. Developer: Bernhard Hurch, University of Graz, Austria. See Hurch and Mattes (this volume). 11. The World Color Survey elucidates the relationship between color categories and basic color terms. The summary tables included in the TDS have originally appeared in the World Atlas of Language Structures. Developers: Paul Kay, Luisa Maffi; University of California at Berkeley. In addition, the TDS includes a number of supporting data resources that are not themselves considered linguistic databases. These include the table

How to integrate databases without starting a typology war

167

of languages and three-letter codes defined by ISO standard 639-3,10 a table of genetic affiliations for each language as assigned by the Ethnologue directory (used by permission), and two collections of geographic locations (GIS coordinates) for several hundred languages: One originally compiled by Matthew Dryer and updated for use in the World Atlas of Language Structures (see Haspelmath, this volume), and another, limited to African languages, compiled by Guillaume Segerer. Both are used by the kind permission of their creators. The Universal Phoneme Position Chart (described in Section 8.2) is based on data contained in the UPSID database, but involves significant additional processing by the TDS; it can be seen as an additional information source. Together, the component databases of the TDS contribute more than twelve hundred database fields (attributes), with varying amounts of information on almost one thousand languages.

4.1. System Architecture

Figure 1. Overview of the TDS architecture. 10

The ISO 639-3 codes are the successor to the “SIL codes” used in the past by Ethnologue (http://www.ethnologue.org/). SIL is the registration authority for the new ISO codes, and provides an access point at http://www.sil.org/iso639-3/.

168 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró Figure 1 gives a global overview of the TDS architecture. Data are imported from the component databases (at the bottom of the diagram), and the user interacts with the system through its web interface (at the top). The core of the system is the metadata, which contain all TDS knowledge about the component databases and the linguistic domain. The knowledge base consists of various specifications which are linked to each other: a set of database-specific ontologies, one global linguistic ontology and (currently) two topic taxonomies. Maintenance of the knowledge base is supported by the TDS Workbench (right) and other custom-built or general-purpose tools, including an ontology editor. The functions of the TDS are divided into two core processes: Loading and integration of the data from the component databases, and querying the data in response to user requests. The first core process, integration of the data from the component databases, is implemented as a pre-processing step that is executed whenever the contents of the TDS change. For example, if a new database is added, or an updated version of an existing component database is received from its developers, or changes are made to existing metadata. Data integration involves the following steps: importing, transforming, merging and (possibly) enriching.11 The process is completely controlled by database-specific specifications written in the Data Transformation Language (DTL),12 a declarative description language developed by the TDS Project for this purpose (see section 5.2). The DTL specifications tell the data integration engine about the implementation details of each component database, e.g., which DBMS is used, where the database is located, and which encoding is used. Using this information, the engine selects the proper plug-in to load the data from the component database. Once data are loaded, they undergo a transformation process. The transformation rules are also part of the DTL specification and define the unification of different notations and structures, to the extent that this is possible without compromising the semantics of the component database. Special care is taken in the harmonization of key values, e.g., language and phoneme codes. This is necessary in order to recognize when different databases describe the same object (language, phoneme, etc.) Once the keys are properly harmonized, the next step is to merge all 11

12

Although the data integration process is described in this section as sequential, implementation-wise the steps are interleaved, e.g., local database enrichments happen together with the data transformation. Important TDS terms and acronyms are collected in a glossary at the end of this paper.

How to integrate databases without starting a typology war

169

the data about one object, e.g., a specific language, from the various component databases. Next, cross-database data enrichment can take place. 13 The end result of interpreting the specifications in the DTL scripts is a unified data collection, the TDS data. Management of the DTL scripts and other TDS metadata is facilitated by the TDS Workbench, which provides various checks on the consistency of the metadata network; for example, it detects invalid links between the various specifications, and invalid links and structures within the global ontology. The second core process of the TDS is support for end-user queries over this data collection. The query process is steered by a web interface, which helps the user build and execute queries in two stages. At the first stage, the pre-query, the user interacts mainly with the metadata. By navigating through the network of metadata elements and/or doing full-text searches on their content, the user can identify fields of interest and collect them into the query basket. In this pre-query step, the system conducts “smart searches” that exploit the sense relationships encoded in the metadata (especially the ontology and DTL schema). Search terms, for example, use the vocabulary of a user’s theoretical framework. By locating these terms in the semantic network, the system can suggest fields to the user which are linked to semantically close terms, even if they are drawn from an alternative theoretical framework. For example, the search term “predicate medial” partially matches the ontology Concept14 PredicateMedialWordOrder, which is part of the Hengeveld et al. (2004) classification of constituent orders (see section 3.1). The ontology provides links to the related Concepts OVS and SVO, even though these categories only occur in the alternative, three-part classification of constituent order systems. The TDS can now also suggest fields related to the concepts OVS and SVO, although they will have a lower ranking in the search results than fields directly related to the Concept PredicateMedialWordOrder. Once the user is satisfied with the collection of fields, the query proper is defined by specifying selection and projection criteria, i.e., which records to retrieve (by matching the selection criteria) and which fields of these re13

14

At time of writing, the TDS supports local database enrichment (new computed fields for a database, with possible reference to global TDS data), but no crossdatabase enrichment (i.e., enrichment that utilizes merged data from multiple component databases). We use the term Concept, with a capital C, to refer to an entry in the global linguistic ontology, representing a linguistic concept. (See section 5.)

170 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró cords to display (“project”). The query can now be executed by the system, and the retrieved subset of the TDS data is presented to the user. Once more the metadata assist the user in interpreting the results. The user can then initiate a fresh query, or modify the current query and resubmit it.15 5. Knowledge architecture A key feature of the TDS is its focus on safeguarding the semantic integrity of the integrated data. The TDS Project considers it crucial that the knowledge residing in component databases should be faithfully preserved during the integration process. To support diversity in the theoretical perspectives of component databases, the TDS utilizes a “hybrid,” or two-level, model of the semantic domain (Stuckenschmidt and Van Harmelen 2005). In the hybrid TDS model, there are two main levels where knowledge is stored: the global ontology of broad linguistic concepts (henceforth TDS-GO) and the local ontologies of the individual component databases, encapsulated in the DTL schema. The global level provides ontology entries, or Concepts, which describe general unifying linguistic concepts. The local level includes pointers to these general Concepts, and contains database-specific definitions. The global ontology thus introduces common ground, which enables the possibility of significant cross-database queries, whereas the local ontology is the store of the database-specific knowledge. This model, depicted in Figure 2, forms the core of the TDS knowledge integration. Each component database is associated with its own local ontology. The local ontology is specified using the DTL, and defines local Notions, or entries in the local ontology, by reference to the tables and attributes in the schema of the corresponding component database (typically a relational database). In general, each database attribute (field) is imported and expressed as a Notion; but the DTL also supports powerful means of restructuring, combining, or even splitting up attributes when necessary, and hence the mapping of attributes to Notions is not always one to one. In the DTL specification, a Notion is described through metadata in the form of short labels, more detailed descriptions, and links to the global ontology. Topic taxonomies provide domain-specific hierarchies as quick entry points into the global linguistic ontology. The system supports multiple alter15

The query interface is described in more detail in section 7.

How to integrate databases without starting a typology war

171

native taxonomies, so that a linguistic domain can have its own, dedicated search template.16

Figure 2. TDS knowledge architecture with knowledge representation roles.

5.1. Database schemata and metadata The eleven databases currently comprising the TDS exemplify a range of approaches to data modeling. Several of them have properly normalized, or even over-normalized database schemas, while others are completely unnormalized: 17 e.g., they consist of a single large table or store multiple pieces of information in one attribute value. Some databases are sparsely filled with data, and could be considered semi-structured, while others are highly structured and densely filled. Implementation details of the DBMS used for the component database (either currently or in the past), or the needs of the user interface, are also sometimes apparent in the data. For example, proprietary font encodings are used for IPA characters, and the data 16

17

Alternative taxonomies are not extensively used at this time. Only two taxonomies, a general linguistics-oriented one and a taxonomy exposing the organization of the TDS global ontology, are included in the interface. A third, taxonomylike view (“view by datatype”) exposes the native hierarchical organization of the TDS data. A database is said to be “normalized” if its table structure meets certain technical criteria (see Date 2004).

172 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró structure and values of one database transparently reflect its original status as a dataset for the statistical package SPSS. The DTL transformation rules can address such quirks. To make the imported data useful, however, the intended semantics associated with the encoding needs to be made explicit. As mentioned, most of these databases were created for personal or smallgroup use and so very limited interpretive metadata is generally available. Even when this exists, it is usually not precise enough for the requirements of the TDS, where a data field may be presented out of its original context of forms or related fields. In the metadata development process, the interpretive/analytical intentions of the database creator may be discerned by examining the original user interfaces (where these exist), but typically, repeated and sometimes extensive interaction with the database developer(s) is required in order to accurately represent their intention in the local ontology. These analyses form the core of the knowledge component and represent the semantic characterization of the component databases, and by extension the linguistic knowledge in the TDS. Once a component database has been adequately described in the DTL, the specification is used to drive the importation of data from the component database into the aggregated TDS data. This is a fully automated process that can be repeated whenever the contents of a component database change. Manual intervention is only required if new fields are added to the database (which will need to be described in the DTL schema), or if changes are made to the naming or definition of database fields or potential values. Simple addition of more data by the creators of the database, or correction of errors (or other modifications) in existing data, do not require attention by the TDS developers if the database design remains the same.

5.2. Local database ontologies and the Data Transformation Language The local ontologies should capture theory-specific knowledge. As its name reveals, the Data Transformation Language (DTL) started out as a declarative specification of transformation rules to overcome notational and structural differences. But since the description of the resulting data schema is part of this specification, the language was extended with constructions that enrich the schema with metadata. Notions are the basic building blocks of a DTL specification. They can express a database-specific theoretical construct, an individual field from a database (or a constructed field), or a group of other Notions. The following kinds of metadata can be attached to each Notion:

How to integrate databases without starting a typology war

173

1. a short label, preferably about five words long 2. a short description 3. one or more links to Concepts in the global ontology; these links can be labeled with a type, to indicate different kinds of relationships 4. one or more links to other DTL Notions 5. a semantic data type18 Notions can be nested and thus form a hierarchy, or rather several hierarchies with distinct roots. Each hierarchy is subdivided into smaller semantically coherent sub-hierarchies. Such a sub-hierarchy forms a local semantic context; Notions should always be shown within their context, or at least the context should be available to the user to allow the proper interpretation (and disambiguation) of displayed data from other Notions in locally nested hierarchies, potentially with the same name. For example, it is clear that the Notion name in the context of the Notion author represents an author’s name (rather than a language name). A powerful aspect of this is that it allows a descendant Notion to inherit some of the sense properties, e.g., the links to Concepts, from its parent and ancestor Notions. This, in turn, prevents local overspecification. 1. TOP NOTION tdn:locationalPredicates 2. LABEL "Locational predicates" 3. DESCRIPTION "Information concerning locational predicates, including form 4. of, and conditions on, construction, and form of the negation." 5. LINK TO CONCEPT locationalPredicate 6. GROUPS { 7. NOTION tdn:ZeroEncoding 8. LABEL "Locational predicate is zero" 9. LINK TO CONCEPT conditionsOnEncoding 10. GROUPS { 11. NOTION tdn:v168_Zero_plus_locative_prepositional_phrase 12. LABEL "Locational predicate is zero + locative prepositional phrase" 13. DESCRIPTION "The locational predicate is expressed without the use 14. of an overt verb, but has a locative prepositional phrase." 15. VALUE IS FIELD v168 16. GROUPS WHEN "yes" { 17. NOTION tdn:v169_Zero_for_present_only 18. VALUE IS FIELD v169; 19. NOTION tdn:v170_Zero_in_positive_sentences_only 20. VALUE IS FIELD v170; 21. } 22. } 23. }

Example 1. An excerpt of the DTL Notion hierarchy for the TDN database. 18

The DTL supports two basic types: enumerations and free text. Subtypes of these types, associated with particular semantics, can be declared and attached to Notions. Their main use is to influence the rendering of Notion values in the user interface. For example, the TDS specification declares a special data type for phonemes, which allows them to be rendered in their proper place in the Universal Phoneme Positioning Chart (UPPC).

174 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró Example 1 shows an excerpt from a DTL sub-hierarchy for Locational Predicates. The keywords TOP NOTION indicate the start of a sense context. The Notion tdn:locationalPredicates has a label, a description and a link to one Concept in the global ontology, locationalPredicate. The nested Notion tdn:ZeroEncoding also contains a link to one Concept, conditionsOnEncoding, but will also inherit the link to the Concept locationalPredicate from its parent. The same example also shows the link from the DTL specification to fields in the database schema: the Notion tdn:v168_Zero_ plus_ locative_ prepositional_ phrase takes its value from the TDN database field labeled ‘v168’. Although each database contains specific, and perhaps unique, Notions, some parts of its DTL hierarchy will overlap with the DTL Notions of other databases. To express such relationships, the DTL supports scopes. Each database has its own scope, where it creates its own Notions. Example 1 shows a part of the tdn scope, the scope of the Typological Database Nijmegen. Like Notions, scopes form a hierarchy and inherit Notions declared in parent scopes. While the leaves in this hierarchy are formed by database scopes, the root is formed by a “warehouse”19 scope, the tds scope. Common Notions and structures such as tds:Language and tds:languageIdentification are declared in this scope, as shown in Example 2. 1. WAREHOUSE tds { 2. DECLARE NAME "Typological Database System"; 3. … 4. DECLARE ROOT NOTION tds:Language 5. DESCRIPTION "All linguistic and non-linguistic information about a 6. particular language." 7. GROUPS { 8. TOP NOTION tds:LangIdentification 9. LABEL "Language identification" 10. DESCRIPTION "Information concerning the identification and identity 11. of a language: Name, georgaphical area where it is spoken, 12. etc. Properties that are not part of the synchronous 13. description of its system." 14. LINK TO CONCEPT languageInformation 15. GROUPS { 16. NOTION tds:Name 17. LABEL "Language name" 18. LINK TO CONCEPT language 19. TYPE IS TEXT; 20. … } 21. … } 22. … }

Example 2. An excerpt from the DTL Notion declarations showing the warehouse scope tds. 19

The term warehouse refers to data warehousing, a computer science discipline which focuses on the management of integrated views of multiple (heterogeneous) databases.

How to integrate databases without starting a typology war

175

Between the warehouse scope and the database scopes there can be domainspecific scopes, which declare domain-specific Notions. For example, there is a scope for the linguistic domain on phonetics. Together, the upper scopes define a skeleton hierarchy, or hierarchies, of Notions. The lowest database scopes then localize these declared Notions by associating them with database fields and database-specific transformation rules, and further extend the structure with database-specific Notions. Example 3 shows the reuse of Notion structures from the tds warehouse scope in the stressTyp database scope, and shows how database-specific Notions are embedded in this structure. The local, database-specific ontologies are thus aligned in their structure, which allows the TDS to integrate the data loaded from the component databases while maintaining (large) data islands of theory-specific information. 1. DATABASE stressTyp { 2. DECLARE NAME "Stress Typology Database"; 3. … 4. ROOT NOTION tds:Language 5. KEY IS LOOKUP(MAP code(FIELD Eth15,FIELD Dialect-name,FIELD Name)) 6. GROUPS { 7. TOP NOTION tds:LangIdentification 8. GROUPS { 9. NOTION tds:Name 10. VALUE IS FIELD Name; 11. …} 12. TOP NOTION tds:Phenomena 13. GROUPS { 14. TOP NOTION tds:Phonology 15. GROUPS { 16. NOTION tds:MetricalPhonology 17. GROUPS { 18. TOP NOTION stressTyp:generalStressAssignment 19. LABEL "General stress assignment properties and parameters" 20. DESCRIPTION "A collection of general parameters and 21. properties describing stress placement 22. patterns." 23. LINK TO CONCEPT stressPlacementProperty 24. GROUPS { 25. TOP NOTION tds:stressRulesDescription 26. VALUE IS FIELD Quotation; 27. … } 28. … } 29. … } 30. … } 31. … } 32. … }

Example 3. DTL Notion hierarchy with Notion localization.

The relational database model, which underlies the design of most of the component databases, uses a separate database table for each entity type, or type of object being described (e.g., language, book, sentence). The TDS equivalent is a so-called Root Notion; each Root Notion represents the root of a separate hierarchy of Notions, and includes all properties of the object being described. There is a very limited number of such hierarchies. At pre-

176 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró sent, the TDS contains five: for languages, potential phonemes, example sentences, glosses and bibliographical references. Instantiations of these Root Notions (for example, a particular language and a particular book) can be related to each other through foreign key relationships, and can be in a many-to-many or one-to-many relation. The relational database model does not support hierarchical grouping of the database fields (attributes) in a database. While a table’s attributes may be conceptually organized into groups and subgroups with related content, this is not expressed in the design of the database.20 But the data model of the TDS is hierarchically organized, thereby allowing groupings of data fields to be directly expressed as hierarchies of Notions. This model also supports multiple values for a data field without needing a separate table/ hierarchy, as would be necessary in a relational database. This has made it possible to replace various component database tables and special data fields, whose only purpose was to accommodate multiple values, with a single TDS Notion that can simply be instantiated multiple times for a single record (e.g., multiple constituent orders for a single language). 5.3. The global linguistic ontology As stated in section 2, the primary task of the global ontology of linguistic concepts (TDS-GO for short) is to facilitate integration of diverse typological databases containing information on (the analysis of) linguistic facts. To do this, the TDS-GO specifies the domain vocabulary by defining Concepts with descriptions of various linguistic terms and concepts (including information on the logical structure of complex concepts). These descriptions are intended to be descriptive and explanatory of linguistic concepts, but neither exhaustive nor theory-neutral. They are designed to define the Concepts which express and unify local ontology Notions. The TDS-GO is not meant to include a comprehensive compendium of linguistic concepts; We adopt the bottom-up principle of ontology development (see section 5.3.1), and the TDS-GO only includes information that is relevant, directly or indirectly, to the component databases comprising

20

Some component databases achieve a limited grouping effect by creating a separate table for each group of related attributes; this is unnecessary from the relational perspective.

How to integrate databases without starting a typology war

177

the TDS.21 Some parts of the TDS-GO contain highly detailed and hierarchically deep Concepts, representing a range of linguistic phenomena in phonology, morphology, syntax, semantics and pragmatics, from a synchronic or diachronic perspective, with particular focus on the following areas: grammatical agreement, parts of speech, word and constituent order, stress placement, predication, phonemic and phonological properties, semantic categories, relational categories (such as case, grammatical alignment, valencies, event types and modification), speech styles, syntactic categories, paradigmatic and systemic groupings, as well as geographic location and genealogical classification. Three such examples of TDS-GO class hierarchies (where “→” represents the is-a relation) are: 1. Linguistic property → phonetic or phonological property → syllable structure property → onset feature → obligatory onset 2. Linguistic property → phonetic or phonological property → suprasegmental property → stress placement → main stress placement → variable stress placement systems → non-lexical stress placement → edge placement → right word edge stress placement → antepenultimate if heavy, else penultimate if heavy, else antepenultimate. 3. Linguistic property → linguistic functions property → marker function → agreement marker function → agreement marker for core arguments → subject agreement marker. The global ontology organizes the relevant linguistic Concepts into a coherent formal network of classes and relations. Where information in more than one database relates to the same topic, it is the task of the ontology to establish a valid conceptual structure that will enable the integration of diverse conceptualizations. Where multiple senses occur, the TDS-GO also needs to maintain and present coherently the sense differences of component databases. Thus the TDS-GO takes a neutral standpoint towards the analyses

21

The TDS-GO is not aligned with GOLD, the General Ontology for Linguistic Description (www.linguistics-ontology.org). GOLD is targeted to concepts related to morphosyntactic annotation, and therefore does not cover many of the topics needed for the TDS global ontology. In addition, GOLD was still evolving at the time the TDS ontology was developed; consequently, the TDS-GO was developed independently of GOLD, rather than as an extension of it. We plan to align the TDS-GO with GOLD to the extent that this is feasible, by defining correspondences and removing any gratuitous incompatibilities.

178 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró represented by the component databases of the TDS. That is, we do not consider it the job of the TDS Project to choose between alternative analyses, but rather to provide access to information through cross-database querying. Therefore the goal of the TDS-GO is not to provide one system of linguistic concepts that is adopted as canonical, but rather to express the systems of all the perspectives represented in the TDS. We therefore describe the TDS-GO as an inclusive ontology of linguistic concepts. The TDS-GO is built using one of the industry-standard languages, Web Ontology Language (OWL). The choice of OWL was motivated by (among other things) the requirement for extensibility, ease of integration with other components of our XML-based system, web-based user interface querying and the availability of development tools. 5.3.1. Conceptual principles underlying ontology development The TDS-GO is based on a set of underlying principles which govern the process of establishing Concepts and their relationships. As discussed in Saulwick et al. (2005), our methodology follows current recommendations for ontology building (Gruber 1993; Gómez-Pérez et al. 2004), namely: clarity, coherence, extendibility [sic], minimal encoding bias, minimal ontological commitment, representation of disjoint and exhaustive knowledge, minimization of syntactic difference in encoding and standardization in naming conventions. Important features of ontology-driven integration are the use of shared vocabulary in a coherent and consistent manner (Gruber and Olsen 1994) and where possible the standardization of naming conventions. In the following paragraphs we will discuss the conceptual principles guiding TDS ontology development. A bottom-up approach A fundamental design principle of the TDS-GO concerns the basis for the postulation and establishment of Concepts (i.e., classes, properties or individuals). It is a design and methodological principle of the TDS that ontological Concepts are only established on the basis of information existing in component databases, thereby constraining the global ontology to a range of relevant areas. This is motivated by the desire to ground the ontology in empirical data-based theory, and thus it acts as a limiting device on otherwise unconstrained ontology growth. However, a Concept may be established for which there is no database mapping, if it is syntagmatically or

How to integrate databases without starting a typology war

179

paradigmatically relevant. For instance, at one point the TDS-GO Concept Transitive Object (the second argument of a transitive verb) was not linked to any database fields, but it was included in the ontology alongside the Concepts Intransitive Argument (the sole argument of an intransitive verb) and Monotransitive Argument (the first argument of a transitive verb), which did relate to data in component databases. In other words, a Concept may be established if it fills a paradigmatic or syntagmatic gap in the network thematic domain. Later, this Concept was linked to new database fields; in this way, its prior addition to the ontology as part of a paradigm facilitated the integration and linking of new data in a globally consistent way. Prototypes As is well known from prototype theory (Rosch and Lloyd 1978; Taylor 1989; Varela et al. 1991), an entity included in a category (also a class of entities subsumed by a superordinate category) may have more or fewer of the features/attributes associated with that category, depending on whether it represents a more or less prototypical exponent of the category. The TDSGO adopts a prototype approach to the classification of linguistic categories. For instance, the class Free Pronoun subsumes the classes Cardinal Pronoun, Demonstrative Pronoun, Emphatic Pronoun, Personal Pronoun, Possessive Pronoun, Reflexive Pronoun, and Weak Form Of Person Marker. Subsumption represents the standardly used “is-a (kind of)” relation, where the subordinate entities represent specializations of the category. It is clear that the entities Cardinal, Demonstrative, Emphatic, Personal and Possessive Pronouns are each a special type of the superordinate class free pronoun. That is, each of these classes has at least one additional feature that is the basis for its specialization. We could label each of these features, respectively, as +value cardinal, +value demonstrative, +value emphatic and so on. In terms of classification, one could argue that the class Weak Form Of Person Marker is an invalid specialization of the class Free Pronoun because it is not necessarily free or unbound. Its exponents may be free, cliticized or bound depending on the language. Thus in the strictest sense the class Weak Form Of Person Marker is not a specialization of the Free Pronoun. However, adopting a prototype analysis allows for a subordinate class (in this case Weak Form Of Person Marker) to have features in apparent conflict with the superordinate class if certain core features of the specialized class are consonant with the superordinate category. In this case we could describe some of these as: “deictic marker referencing person

180 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró referents.”22 By permitting the kind of prototype classification presented here, a richer and thus more fine-grained network of associations between categories is provided. This results in the possibility of more extensive cross-data mappings and thus facilitates more effective resource discovery. Theory-neutral perspective Each of the component databases reflects the theoretical stance of its creator in a myriad of choices, both in the way linguistic phenomena are conceptualized and in the terminology used to describe them. When diverse databases provide information about the same topic or use the same term, there is the potential for mismatch. As already mentioned, the TDS-GO is an inclusive ontology of linguistic concepts: it provides a common vocabulary that serves as a non-prescriptive basis for the integration of database-specific categories. The TDS-GO is by design maximally compatible with different conceptualizations of linguistic phenomena. It includes crucial concepts but attempts to refrain from incorporating details peculiar to a particular theoretical orientation. This does not mean that the ontology itself consists of Concepts that are “a-theoretic.” Indeed we hold that such a pursuit is unattainable for the simple reason that all terms bear the hallmarks of their particular theoretical orientation. Rather, the global ontology is inclusive in the sense that it can accommodate the variety and richness of individual theoretical orientations with all their idiosyncrasies. Where appropriate, variant and potentially conflicting orientations or conceptualizations are included in the global ontology and are unified under broader categories. The decision whether to include a concept in the global ontology is dependent on how widely accepted a linguistic category is, within or across (conflicting) linguistic theories. A linguistic category that is not included in the global ontology is treated as a concept in a local ontology: it is represented as a DTL Notion (Saulwick et al. 2005). In this way, the ontology strives to achieve a variation on the principle of Gruber’s (1993) minimal ontological commitment, namely minimal orientation commitment. This is the inclusion of diverse theoretical orientations, without ascribing to any one a favoured status. An example of “ontological unification” (Saulwick et al. 2005) is the case of the variant Concepts Basic Word Order and Predicate-Based Word Order. In the TDS-GO these are unified under the supercategory Core Con22

The degree to which a weak form is able to encode referential specificity is not at issue here; see Siewierska (2004: 9, 124 ff.).

How to integrate databases without starting a typology war

181

stituent Word Order. We call this semantic unification; not an ironing out or watering down of theoretical orientation, but the establishment of an inclusive superconcept for the purposes of information integration. A query over any one of these Concepts allows the end-user access to the others. In adherence to the principle of clarity (Gómez-Pérez et al. 2004), the TDS will ensure that the intention behind each database contributor’s use of terminology is faithfully represented in the local ontologies of the DTL. 5.3.2.

Linguistic Concepts

The TDS-GO models a variety of linguistic objects, relationships and other linguistics-related ideas. Ontology Concepts are labeled and described with a short explanation, and possibly references to a bibliography. The TDS-GO thus also serves as a guide to interpreting the terminology of component databases. In this section we only give a high-level overview of the ontology design. The reader is referred to Saulwick et al. (2005) and Dimitriadis et al. (2005) for more details on the structure and implementation of the TDSGO. 5.3.2.1. Types of linguistic Concepts We distinguish between the following major types of linguistic Concepts: 1. Linguistic objects These can be thought of as existing in themselves. They include Concepts such as Sentence, Morpheme and Phonological Segment, as well as classes representing Language and various groups of languages. 2. Linguistic properties These are (linguistically salient) properties predicated of a linguistic object. For example, Basic Word Order is a property of Languages, while Referential is a property of certain words or syntactic constituents. In this terminology, properties do not relate one linguistic object to another but can be thought of as one-place predicates. They are generally associated with a set of possible values; for some the values are ‘True’/’False’ or ‘Present’/’Absent’, while for others it may be one of several possibilities with linguistic meaning such as a paradigm, as with the property Case which can have the values Accusative/Ergative/Dative, etc.

182 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró 3. Linguistic relations These model a phenomenon involving two or more linguistic objects or properties. For example, following Corbett (1998: 191), Agreement is modelled as a relationship involving a controller (‘the element which determines agreement’), a target (‘[t]he element whose form is determined by agreement’), a domain (‘[t]he syntactic environment in which agreement occurs’), and agreement features (‘in what respect there is agreement’). The participants in a relation play distinguished roles, whose names may be particular to each relation: for Agreement, the roles are controller, target, etc. Complex phenomena that are not explicitly relational are also treated in terms of roles: for example, Stress Assignment can be described as involving a Method (algorithm) that makes reference to types of feet, edge-sensitivity, extrametrical material, etc.

5.3.2.2. Relationships between linguistic Concepts Entities in the ontology are organized according to the following major relationship types: 1. Subsumption Some linguistic Concepts are specializations of others. For instance, ‘grammatical case’ is subsumed by the more general Concept ‘case’. 2. Loose synonymy This designates variant linguistic terminology used to refer to the same phenomenon. When two phrases denote the same conceptualization of a phenomenon, it is useful to link them in order to provide a means of searching using different vocabulary than that used for naming the ontology Concepts. (Loose) synonymy between two phrases is currently implemented as an annotation on the class (essentially, a data property that gets a value for the entire class). The Concept Agreement Marker is for example annotated with the alias Person Inflection. 3. Related phenomena This identifies variant linguistic terminology used to refer to similar or related phenomena. For instance, the Concept Basic Word Order has this relationship to Predicate-Based Word Order. Although the two phrases denote somewhat different conceptualizations of phenomena, it is useful to link them in order to provide a means of unified searching across both component databases in which the terms occur. As neither of the current

How to integrate databases without starting a typology war

183

standard annotations owl:sameAs and owl:equivalentClass (Bechhofer et al. 2004) captures our required semantic correspondence, we use our own annotation, tds:equatesWith, to equate two related Concepts. 4. Meronymy This stipulates part–whole relations.23 Some linguistic concepts are modelled in a strict hierarchical structure. For example, mora > syllable > foot > … form a meronymic hierarchy. Note that their relationship to each other cannot be expressed through subsumption; a syllable, for example, is not a kind of foot, but part of one. Certain linguistic hierarchies are organized so that units of one type are a direct part of the next higher unit, e.g., in the “prosodic hierarchy” (Nespor and Vogel 1986), which is a hierarchy of utterance constituents from a prosodic perspective in which “[e]very prosodic category in the hierarchy has as its head an element of the next-lower level category” (Kager 1999: 146). The directpart-of relation is a specialization of the general part–whole relation. 24 We encode part–whole relationships via a meronymic predicate, isDirectPartOf (and its transitive closure, isPartOf). This relation is asserted between pairs of classes, and serves to organize them into meronymic hierarchies. 5. Determination We use this appellation when a linguistic property is defined in terms of one or more other linguistic properties. For example, if a heavy syllable is defined as a syllable with a long vowel or a coda, then both of these are determinants of the property Syllable Weight (even though the presence of only one is enough to make a syllable heavy).25 The determinant 23

24

25

A variety of meronymic relationships may be required. For instance, Story (1993) based on Landis et al. (1987); Winston et al. (1987) and Chaffin et al. (1988) lists seven types of meronymic relations: component– object, member – collection, portion–mass, stuff–object, phase–activity, place–area and feature–event. Our implementation follows the current W3C recommendation, which calls for expressing meronymic relationships in terms of a direct part relation when “what is needed is not a list of all parts but rather a list of the next level breakdown of parts, the ‘direct parts’ of a given entity” (Rector and Welty 2005). Determination only holds when the definition of a concept involves aspects of another. It should not be confused with empirically based implicational relationships. In the latter case we have two concepts which are independently defined, and the implicational relationship is an empirical (contingent) fact rather than part of their meaning.

184 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró relation is a logical (not linguistic) relationship between linguistic properties or relations. 6. Form-function relationship Here “function” is used in a specific sense. It refers to the linguistic function served by some linguistic entity. This relationship associates entities of the type linguistic object with linguistic properties expressing their possible linguistic function. For example, the Concept Agreement Marker is a possible function of Affix. A form-function relationship is a (type of) linguistic relation, and it is implemented accordingly (i.e., as an OWL Class with roles expressed as OWL Properties).

5.4. Topic taxonomies The global ontology as described has a formal structure, which is needed to enable “smart search” facilities. However, this does not make it a natural entry point to the data collection for end-users. The placement of Concepts in the global ontology, while formally correct, will not always be intuitive. To minimize possible confusion, the global ontology is currently largely invisible in the TDS user interface.26 Its primary function is in the back end of the system, where it facilitates the coherent unification and integration of diverse linguistic concepts. The process of local and global ontology creation results in fixed Notion and Concept hierarchies, each of which reflects one of many possible organizations of the (meta)data. By introducing topic taxonomies we allow multiple organizations of metadata for different types of uses. A topic taxonomy is a relatively small list of domain-specific terms organized into a loose hierarchy. They are designed to capture the perspective of different sub-domains of the linguistic space. The data of component databases is not directly associated to topics in the taxonomies, for the reason that as the collection of component databases grows over time, it would be impractical to maintain links to multiple topic taxonomies. Instead, the global ontology serves as a common frame of reference at the nexus between metadata and topic taxonomy. Topics in the taxonomies and Notions in the component databases are both linked to Concepts in the global ontology, allowing taxonomy top26

Concept definitions are shown to the end user when they (partially) match a full text search, or in relationship to a Notion, allowing the user to fine-tune the search or disambiguate the meaning of the Notion.

How to integrate databases without starting a typology war

185

ics to be indirectly related to database fields. (The indirectness of the link is not apparent to the users, who only see a list of fields and grouping Notions associated with each taxonomy topic.) The default taxonomy provides a complete overview of topics currently covered by the component databases. Its initial structure was based on the table of contents of Thomas Edward Payne’s book Describing Morphosyntax: a guide for field linguists, but has been extended with more topics to cover the entire domain of topics in the TDS. In this taxonomy the topic “Complement clauses,” a daughter of the topic “Clause combinations, coordination,” is linked to the Concept “Syntactic complement” in the global ontology. This Concept has direct relationships with nine Notions in the DTL specifications, e.g., the Notions tdn:Form_of_the_complements and tdn:ComplementOfCopula. When we take the semantic context of these Notions into account, the topic “Complement clauses” is directly or indirectly related to 63 Notions. The Notions that are directly related will be ranked highest when presented to the user. No domain-specific taxonomies have been created to date, but it is envisaged that some may be created in the future, e.g., one limited to the domain of phonology.

6. TDS implementation The TDS architecture shown in Figure 1 is close to a semantic web application, although it is an application-specific web.27 This made it possible for some core technologies used in the metadata network to follow W3C recommendations or working drafts. Topic taxonomies make use of the SKOS vocabulary, and the global ontology is defined in OWL. A benefit of using these standards is that externally developed tools are available to aid in managing the corresponding components, e.g., the TDS-GO is edited with the Protégé ontology editor with a plug-in for OWL. The DTL was developed in-house, as no suitable specification language or toolkit was found. Early versions of the TDS used XSLT as the transformation language, with the result that creating specifications required extensive knowledge of the low level implementation details of the TDS. The 27

The TDS web interface is of course part of the World Wide Web, and some descriptions of Notions, Concepts and topics refer to web pages on other sites. However, the semantic knowledge base isn’t open and thus is not part of the semantic web as envisioned by W3C.

186 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró DTL was designed with the goal of allowing the knowledge engineer, a linguist, to create such specifications. In addition, it supports the addition of enriched metadata about the database semantics. The DTL is a declarative domain-specific language for transformation rules, which can be annotated with metadata that express the semantics of the resulting data (schema). The DTL Engine, the interpreter that carries out the database integration according to the DTL specifications, is implemented in Java. One of the first products of the TDS implementation phase was a tool to import data from various database formats, called TDS Localize. This tool is implemented in C, and uses a plug-in architecture to allow the dynamic addition of loaders for new DBMSs. The current set of plug-ins provides access to databases using the following interfaces: ODBC (e.g., MySQL, Microsoft SQL server), ODBTP (e.g., Microsoft Access) and CSV exports (e.g., SPSS, Mircrosoft Excel). The DTL engine relies on this tool to import most of the databases. By now some C plug-ins (e.g., for CSV and XML files) have been replaced by Java equivalents, which are loaded directly by the DTL engine; but TDS Localize is still needed because C libraries are still the only way to access some major DBMSs (e.g., Microsoft Access is only accessible via ODBTP).28 The result of the import, transform, merge and enrich process, as implemented in the DTL engine and specified by a set of database specific DTL specifications, is a large XML document containing the instantiated, and interlinked, Notion hierarchies. XML is well suited for this kind of semistructured data (semi-structured because the resulting data structures may overlap only partially, allowing very sparsely filled data structures). The use of XML also enables us to draw on the wealth of technology standards and tools which have been produced since its birth out of SGML in 1998, e.g., XSLT and XQuery. The TDS web interface uses two additional modules. The front end is powered by Backbase (http://www.backbase.com/), a Rich Internet Application (RIA) library, which uses JavaScript to extend the browser with a set of sophisticated GUI widgets allowing the creation of web applications which resemble desktop applications in functionality. The server backend uses the 1060 NetKernel application server (http://www.1060.org/), which incorporates a very rich set of tools to handle XML. The most prominent tool is Saxon (http://www.saxonica.com), the leading open source XSLT 28

A complete switch to Java would allow the complete TDS framework to be deployed on any platform that runs Java. TDS Localize is currently the only tool that binds the TDS to the UNIX family of operating systems.

How to integrate databases without starting a typology war

187

and XQuery engine.29 This engine is used to execute the queries and to style the query interface and results.

7. The user interface The TDS webserver must provide the functionality of a specialized interactive application while running on an ordinary web browser. This is accomplished with the help of Backbase, a sophisticated library of JavaScript routines that extend the web browser’s built-in capabilities. JavaScript is executed by the user’s web browser, which must therefore be of sufficiently recent vintage, and must have JavaScript enabled. The Backbase library provides user-interface enhancements such as pop-up messages, “tabbed” sub-pages, and a rich system of windows and menus, all managed within the browser window. The library makes it possible for many interactive operations to be performed locally on the user’s computer, avoiding a timeconsuming request to the TDS webserver at each step.30 29

30

Saxon is not the ideal tool for the purposes of the TDS. At the start of the TDS implementation phase, Saxon was the only reasonably up-to-date XQuery processor. However, Saxon is document-oriented, which means that in principle it has to reread the source XML document, the TDS data, for each query. The 1060 NetKernel helps by caching the parsed XML document, but the resource consumption is still quite high as the document has to be fully loaded into memory (which takes on average 5 times the size of the document on disc!) Since XQuery gained W3C recommendation status, more and more standardscompliant XML database engines are becoming available, and could replace Saxon. An XML database engine would, among other optimizations, take care of loading only XML document “hot spots” into memory; hence it is expected that resource consumption will be more modest, and response times will drop to a fraction of current levels. At the time of writing, however, the TDS implementation still uses Saxon as the XQuery engine, and has not been ported to an XML database engine. The extra functionality comes at some cost. “Rich internet application” libraries are still relatively unstable technology, due especially to browser inter-compatibility limitations, and place heavy processing demands on the user’s browser. The TDS server is only compatible with relatively recent, full-featured versions of Internet Explorer and Firefox; the TDS interface takes 10 –20 seconds to start up, or even more on slow computers, because of the complexity of the JavaScript library; and the TDS interface is poorly integrated with the browser’s Back button (it is therefore recommended that its use should be avoided).

188 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró The opening page of the TDS server provides the starting points for using the system, as well as links to the expected information and support pages. This includes documentation about the Project, a tutorial walkthrough that introduces new users to the basics of using the system, descriptions of the databases included, and technical features and limitations of the TDS server. As mentioned in section 4.1, querying the TDS is a two-step process: The pre-query involves selecting database fields of interest and adding them to the query basket. The user then opens the query basket and specifies query options and search terms for the actual query. The query can then be submitted to the server, and the results are returned and displayed. The goal of the pre-query stage is to find database fields whose contents are relevant to the user’s goals. The TDS provides ways to search or browse through the descriptions of fields. In the Search tab, the user types search terms in a form and is shown a list of database fields with matching metadata. Suppose, for example, that a user wants to look up whether Swahili is listed as a null-subject language. They can look for relevant database fields by typing “null subject” in the search field of the Search tab, which will reveal (alongside a large number of other partial matches) a field named pro-drop. The documentation displayed with this field indicates that it comes from the TDN database, and that it is indeed relevant to the task at hand. It should be noted that the search terms are not only matched against the text of the field names and descriptions, but also against the possible values of each field and the global ontology Concepts to which the field and values are linked. (Matching Concepts are also shown in the list of results.) In this way the global ontology of linguistic Concepts, and the links between them, serve as a system of structured keywords that guide searches. For example, the database field pro-drop is linked to the Concept NullSubject. Searching for “null subject” will match the field pro-drop via this indirect link, even though the term does not appear in the field’s description. Once a field of interest is found, a user may add it to the “query basket,” a sort of shopping cart for database fields. As an alternative to typing in search terms, a user can browse the hierarchy of topics and data fields in the TDS (i.e., the taxonomy topics and local ontology Notions). The TDS interface provides several alternative hierarchies for navigation. The “view by datatype” tab matches the hierarchy of entities and thematic groupings in the TDS data. At the top of the hierarchy are the objects described: a language treated as a whole, a unit of text (usually a glossed sentence with additional annotations), or an external entity

How to integrate databases without starting a typology war

189

that exists independently of any particular language, such as bibliographic sources or the universal phoneme inventory.31 Most component databases of the TDS consist of “analytical” parameters, which describe a language as a whole. Hence, most information is found under the heading Language. This contains language identification information (name, ISO code, location and genetic affiliation), and a large group of properties labeled Linguistic Phenomena, which is where information about linguistic systems and properties is concentrated. Users of the TDS will most often search under this node. The second view through which the user can browse for data fields is organized by topic. Here the linguistic topics covered by the component databases of the TDS have been organized into a shallow hierarchy, or taxonomy. It is the “table of contents” of the TDS, much as the table of contents of a linguistics textbook provides an overview of the topics it addresses. 32 Because the description of a database field often involves reference to several linguistic Concepts, a database field may appear under several topics in the hierarchy. The TDS currently provides two different topic taxonomies; it is planned that additional taxonomies may be added, customized to particular kinds of uses or subfields of linguistics (e.g., for phonology-related topics). Once the user has added some fields to the query basket, they can proceed to the second stage of the search procedure, the formulation of the query itself. The query basket is viewed in a separate window, which lists each collected field or group along with its description and controls for defining the search query. To this end, the user must decide on (i) what selection criteria to specify (e.g., “the field pro-drop must have the value True”), and (ii) which fields to display in the results.33 If one wants to know the names and language families of languages that have pro-drop, they would specify “pro-drop = True” as the selection criterion and include Language name and Genetic affiliation for the fields to display. 31

32

33

In a relational database, each of these would correspond to a separate table. The TDS equivalent is a so-called Root Notion. The topic hierarchy is in fact modelled on the table of contents of Payne (1997), with the necessary adaptations. It is possible to specify selection criteria for a field but suppress its display; for example, if one only selects languages that have pro-drop, there is no need to see the value “pro-drop = True” for each language shown. If multiple values can match the selection criteria (e.g., “Basic Word Order = SVO or VSO”), it is of course necessary to display the relevant field if its value is of interest.

190 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró Only fields in the query basket can be used for this process. If the user wants additional fields, they must resume the pre-query process and add the required fields to the basket. The method of specifying selection criteria depends on the type of the data field. For fields with “enumerated” values, which come from a fixed set of choices, the query window displays a control that allows for the selection of one or more desired values. For text fields, users can enter text to search for, with a choice between exact and partial (substring) match. For example, one can search for all phonetic segments whose description includes the words unrounded vowel. Text fields, and searches, may use any Unicode character.34 Once the query has been defined in this fashion, it can be submitted to the system. The result appears in a new window. (Multiple query views can be open simultaneously, and each is automatically assigned a number. They remain open when the query is submitted and can be resumed, modified and resubmitted.35) The results are displayed in table format by default, but the TDS supports several alternatives. At the top of the results window is a link (labeled result settings) that allows switching between display modes. The “report” format displays each field and value on a separate row, and is useful when very many data fields have been selected for display, or with long text fields. The “summary” style presents counts for the different values (e.g., number of languages in the TDS listed as having each word order), and constitutes the only kind of data statistics currently provided by the TDS. There are also two export formats, XML and comma-separated values (CSV); these 34

35

A small icon provides a shortcut to the TDS IPA Console, an external application that can be used to paste special characters (especially IPA characters) into form fields. The TDS IPA Console is a Java application developed as a stand-alone tool by the TDS Project. It was developed to simplify the entry of IPA characters in phonological query fields, but it can be used with any other program that can import (paste) text from the system clipboard: text editors, spreadsheets, web browsers, etc. The Console comes with a large number of predefined buttons for entering IPA characters, arranged on several tabs in the familiar IPA table layout; users can define additional keys bound to any Unicode character (which can be selected from a list or specified by its “Unicode number”). At the time of writing, support for modifying a query window is limited. Selection and projection criteria can be changed, but new fields cannot be added to an open query view window. Instead, new fields must be added to the query basket and a new query window must then be opened.

How to integrate databases without starting a typology war

191

are useful for exporting the results to another database application or spreadsheet. The TDS system, as can be seen from the presentation of its features, is oriented towards discovering and viewing data in the component databases. Because of the mixed nature of its content, the set of languages found in the TDS do not constitute an areally or genetically balanced sample. While balanced sub-samples can be defined among the great number of languages present in the aggregated TDS data, such collections will not have values for all possible typological parameters (database fields); thus, a balanced sample can only be defined on a per-query basis, or for one component database at a time. This feature is not currently supported by the TDS; because of the additional uncertainties involved in aggregating diverse databases, we advise against attributing statistical significance to the results retrieved from the system.

8. Working with metadata The TDS framework for data integration allows a lot of flexibility in carrying out the integration process. Just how this should be done depends on the particular database being integrated, on the general principles that the TDS team has adopted, and not infrequently on the subtleties of relevant linguistic theory. For each database included in the TDS system, a DTL schema is written that defines the transformation of the source data into the unified data schema of the TDS. After a careful examination of the original database, and usually following some discussion with the database creators, members of the TDS team create an initial schema file that imports the data from the database into “field Notions,” i.e., nodes in the DTL hierarchy representing a database field, essentially with no changes. (As already described, the DTL schema includes a reference to a plug-in that establishes access to the database.) The TDS knowledge engineer then embarks on reorganizing the data fields of the component database so that they match the hierarchy and encoding conventions of the TDS system, and on entering documentation for all Notions in the resulting schema. The simplest kind of alteration (conceptually at least) is the recoding of values for common types. For example, Boolean (true/false) fields are represented with the standard values “True” and “False” in the TDS. If a field uses the values “+” and “–”, “yes” and “no”, or “1” and “0”, these will be mapped to the TDS standard.

192 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró Another kind of recoding involves various kinds of cleaning up the values found in a database. Component databases often use text fields for parameters that logically allow only a restricted set of values (true/false parameters, word orders such as “SVO”, “SOV”, etc.). This opens the door to inadvertent misspellings, typos or other irregularities in the content of such fields, which must be resolved if they are to be correctly matched to query selection conditions. DTL specifications often include ad hoc rules to address such problems. Sometimes the database creators append comments to the logical value of the field, most frequently question marks or other annotations indicating uncertainty about the correctness of the value. Such annotations constitute important information for the end user; they are separated from the value itself and recorded as an “uncertainty marker,” which is automatically displayed whenever the associated value is displayed in the results of some query. (Cf. section 9 for further discussion of uncertainty annotations.) Other fields are combined or split up to yield more consistently organized Notions. A database will sometimes use a different data field for each possible value of a linguistic property; e.g., one field for the basic word order SVO, another field for SOV, etc. Such decomposition into “characters” has its uses, but it does not meet the design guidelines of the TDS; logical fields are reconstituted, so to speak, by mapping groups of such database fields into a single Notion. In some cases, a new Notion is computed in more elaborate ways from one or more database fields. The field Notion “pro-drop” (whether a language allows main clauses lacking an overt subject) is computed from its logical complement, a field in the TDN database that records whether a language always requires overt subjects in main clauses. A more complicated example, involving the computation of the property Trochaic language from a combination of other properties, is described in the next section. A major part of harmonization is the assignment of Notions to the proper place in the TDS hierarchy. Database fields whose semantics are sufficiently similar to those of other databases (e.g., language name) are mapped to Notions with global scope. Such Notions are typically already defined elsewhere, in the master DTL specification; the schema for the component database instantiates them with data from the database. (If a Notion belongs in the global scope but is not already defined, it can be added to the master DTL schema at this point.) Most Notions, however, are defined in a way that is specific to their database of origin and will receive local scope. They are grouped and organized hierarchically to represent the conceptual relations between the various fields in the database. Hierarchies

How to integrate databases without starting a typology war

193

of local Notions are embedded in the shared hierarchy; for example, the master DTL file defines the Notion tds:Phenomena (containing all linguistic phenomena) and a daughter called tds:BasicWordOrder, representing all information on word order from all databases. Databases containing word order information will define Notions with local scope within this subhierarchy, as appropriate. If some Notions defined with local scope are later found to be shared by several databases, they can be transferred to global scope so that they can be shared as needed. In addition to situating and instantiating field Notions properly, the TDS knowledge specialists must look after their descriptive metadata. These consist of descriptive documentation entered directly into the DTL and associated with Notions at any grouping level, and of links to Concepts in the global ontology. Appropriate links facilitate discovery of these Notions through the search interface, and bridge the gap between database-specific (local) semantics and global semantics. Finally, an important part of the data import process is the harmonization of key values. Almost all typological data describe individual languages; merging data from different databases relies on knowing which language is being described. The TDS accomplishes this by basing the primary key for each language on its ISO code (SIL code). Ideally a component database will provide ISO codes for every included language; in practice there are plenty of omissions, differences between versions of the Ethnologue, and other complications that are resolved by the knowledge engineers on a case by case basis. Similar problems arose in the harmonization of phoneme inventories; this case is described below (section 8.2). First, we turn to a simple example of constructing a TDS Notion that does not correspond to a single database field. 8.1. Finding “trochaic languages” The StressTyp component database contains information on the foot type that is used in the derivation of stress systems in various places. Feet are used to derive the locations of primary and secondary stress (the separation being motivated by arguments not discussed here, but see Goedemans and Van der Hulst, this volume). As is well known, feet come in two flavours, iambic (right headed) and trochaic (left headed). In the view of StressTyp, a language is said to be a trochaic language if any of the following is true:

194 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró 1. the default foot used to place stress in the primary stress domain (in every case in quantity-insensitive languages, and in case both syllables are light in the quantity-sensitive languages) is trochaic. 2. The foot used to derive the location of rhythm beats (secondary stress) is trochaic. 3. Both iambic and trochaic feet are used in the derivation of rhythm. StressTyp does not directly record whether a language is trochaic, but the property is easily calculated. To find a list of languages with these parameters from within the StressTyp database, we need to execute the SQL query shown in Example 4: 1. SELECT Name 2. FROM Stress 3. WHERE Rhythm_Type="trochaic" 4. OR Rhythm_Type="both" 5. OR Stress_l_l="trochaic";

Example 4. Trochaic feet SQL query

For the TDS, it was decided that this property should be directly searchable in the system. Accordingly, a local Notion was defined, and its value is computed from the relevant StressTyp fields as appropriate (Example 5). The result is an ordinary field Notion that can be used in queries like any other field. A query for trochaic languages returns a list of 160 matching languages. 1. NOTION stressTyp:trochaicFeet 2. LABEL "Trochaic language" 3. DESCRIPTION "Trochaic feet are used in the analysis of the str ess pattern 4. observed in this language" 5. LINK TO CONCEPT trochaicProperty 6. IS ( 7. FIELD Rhythm_Type="both" 8. OR FIELD Rhythm_Type="trochaic" 9. OR FIELD Stress_l_l="trochaic" 10. );

Example 5. Trochaic feet DTL Notion

8.2. Harmonizing phoneme inventories The core source of information on phoneme inventories in the TDS is Ian Maddieson’s UCLA Phonological Segment Inventory Database, or UPSID (Maddieson 1984; Ladefoged and Maddieson 1996). This database contains a set of 920 segments, from which individual languages select a subset to

How to integrate databases without starting a typology war

195

form their specific phoneme inventory. The segments in the original UPSID database are represented in a highly idiosyncratic, SAMPA-like encoding. This means that the identity of a certain segment is not transparent to users unless they are thoroughly versed in the UPSID encoding system. The TDS Project has adopted the international standard IPA representations for phoneme representation. This was not possible when UPSID was originally created, but is relatively simple nowadays, since there are Unicode fonts in common use that contain the (majority of) necessary IPA characters. Therefore, the first stage of the harmonization of UPSID was to transliterate the 920 SAMPA-encoded segments into IPA. This achieved the dual purpose of upgrading to the more user-friendly and standard IPA, and enabling integration of UPSID with other databases containing information about phonemes. Thus a searchable list was created, encoded in Unicode IPA, together with a phonetic description of their articulatory and/or acoustic properties. This list is a reasonably comprehensive (but certainly incomplete) collection of the possible phonemes in human language, and might be of some interest in itself. But the main motivation in creating it was to support IPA notation in querying and displaying the phoneme inventories of individual languages, or groups of languages. Representing the results of such queries on screen in a user-friendly fashion also proved to be a challenge. A long list of consonants and vowels is much less informative than a neat table, comparable to those found in grammars and IPA charts, where consonants and vowels are ordered according to their place and manner of articulation. This information is available in UPSID, along with all other segmental features. In order to be able to represent phoneme inventories on screen, we created the Universal Phoneme Position Chart (UPPC). For consonants, the UPPC is in essence a single table of 30 columns and 171 rows, in which all segments are represented according to their manner and place attributes. The treatment of vowels is similar. The large number of rows results from the extensive number of secondary articulations that are potentially distinctive in individual languages. An excerpt from the upperleft corner of the UPPC is presented in Table 1. With the aid of the UPPC, TDS query results involving phonemic inventory data can be presented on screen in table format. To generate an inventory of a single language, empty columns and rows are automatically removed, resulting in a table of manageable proportions. As an extra feature, we have added the possibility to render tables from queries over multiple languages. Such queries result in aggregated “phoneme inventories” presenting the phonemes of all languages in the query. This is supplemented

196 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró by numbers indicating their frequency, with coloured backgrounds to reveal “hot spots” of common or very common phonemes. Table 1. The initial five rows and six columns of the Universal Phoneme Position Chart

Plosive Plosive, aspirated Plosive, preaspirated Plosive, palatalized Plosive, labialized

Bilabial

Labiodental

Dental

voiceless voiced

voiceless voiced

voiceless voiced

p pʰ ʰp pʲ pʷ

b

bʲ bʷ

b̪

t̪ tʰ̪

d̪

tʲ̪

d̪ʲ

8.3. Expanding the coverage Starting with UPSID’s extensive segment pool, we hoped to be able to incorporate other databases containing information on segment inventories relatively easily. The IPA representations, which are created from templates with fixed orders for multiple co-articulation modifiers to the central symbol, act as the database keys to which we can link phonemes from other databases. Our expectation was that UPPC would require little supplementation, beyond the 920 values from UPSID, to support other databases with phonemic information. But our experience with the incorporation of the SPIN database proved otherwise. SPIN contains phoneme inventories for 110 languages. It is unique among the TDS databases in that it did not enter the Project as an electronic database; the data was available to us only on handwritten index cards, and was digitized by TDS participants.36 Unicode IPA characters were used throughout, to ensure compatibility with the UPPC and the rest of the TDS. Once the data was imported, we matched the complete set of phonemes from SPIN to the 920 potential phonemes in the UPPC, expecting to have to iron out only a moderate amount of human errors. Unfortunately, the “ironing” required proved to be quite substantial. There were 418 discrepancies between SPIN and the UPPC, attributable to six types of mismatches: 36

The data were entered in Excel worksheets using the TDS IPA Console, and imported into the TDS using a custom conversion script.

How to integrate databases without starting a typology war

197

1. Homographic errors: incorrect symbols A unique phoneme should be represented by a unique Unicode character, but in practice this is not always the case. Especially when differences between characters are not clearly visible in certain fonts, errors are easily made when entering data. When using a font in which /g / looks exactly like /ɡ/, for instance, the former may be selected to represent the voiced velar plosive although the latter, a character with a different Unicode number, is the official IPA symbol. Errors of this type are easily corrected through visual inspection of the mismatches. 2. Variant grapheme mismatches In some cases two official notations for the same phoneme exist, as with the case of velarization, which can be represented by either d̴ or dˠ for the voiced alveolar plosive. If two databases happen to choose different options, we get a mismatch that is not apparent but will lead to incorrect behaviour in queries, since the same phoneme is intended. We resolve this type of mismatch by permitting either grapheme as a valid representation of the phoneme and introducing a relationship which equates the two possible representations in the data representation language. Note that when more than one linguist fills a database, one may even get database-internal inconsistencies of this type, in which identical co-articulations for two different phonemes are represented differently. 3. Notational mismatches Notation differences in older sources, or sources with different notational conventions. This type is similar to the previous one. Labialized segments, for instance, are normally represented with superscript / w/. In some sources one may find a convention in which labialization is represented by a normal /w/. We can only correct this by looking at the source from which the inventory was taken to determine exactly what articulation was originally intended. If that information is available, we can represent the phoneme using the current IPA notation. Sometimes, however, an uncommon phoneme is intentionally represented using nonstandard orthography; care must be taken here not to equate such phonemes to a common phoneme that is already present in the UPPC. 4. Grapheme order mismatches Multiple co-articulations may be represented in different order. A labialized, aspirated /k/ can be represented as /kwh/ or /khw/. There are conventions, however, for the concatenation of diacritics, and we adhere to these, e.g., using /kwh/ and revising alternate representations.

198 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró 5. Diacritic order mismatches Similar, thought not visibly so, are those phonemes written with multiple diacritics whose sequence is unordered. One diacritic may be placed above the segment, for instance, and one below it. These may be entered in either order. The result will be the same graphically, but the sequence of Unicode characters is different. This problem is easily solved by normalizing the Unicode representations,37 or by visually examining the entries and manually imposing a fixed ordering scheme. 6. Input errors Simple input errors are easily introduced. For instance, an extra diacritic may have unintentionally been typed after some segment. This may be poorly visible or invisible on the screen, but will lead to a different Unicode representation than the one intended. Moreover, diacritics may have been chosen that are simply incorrect, or errors may have been introduced when working from not very legible (especially handwritten) originals. These errors are usually easy to fix after reference to the original database. After we corrected the mismatches of these types, we were left with 190 segments that are new and quite a few that needed checking in the original grammars. For a few languages, we consulted the original reference grammar and established that an unmatched segment should in fact be replaced by one already in the UPPC. The number of remaining new segments seems rather high to us, but since the purpose of the TDS is to integrate existing databases rather than to create new resources, we have not put further effort into resolving the differences between the datasets. We simply expand the pool of possible segments with those phonemes from SPIN that were not eliminated during the procedure described above.

37

The Unicode standard defines several normalized forms, which include a canonical order for diacritics. Unicode-enabled software libraries provide functions that normalize Unicode strings.

How to integrate databases without starting a typology war

199

9. Guidelines for typological databases During the process of importing the eleven databases currently in the TDS, several recurring problems were recognized in the metadata and data cleaning process. This section contains some guidelines for developers of typological databases which would improve their reusability, both as stand-alone resources and through incorporation into the TDS or a similar system: 1. Good database design Choosing the appropriate relational design for a database (that is, the structure of tables and relationships that make up the database) can make a big difference to its usability, as well as to the ease of reusing its contents in the TDS or any other system. A properly designed database is easy to enter data into and extract information from; it is also easier to modify as one’s research design evolves. We recommend making the effort to choose a suitable design, by finding and consulting knowledgeable colleagues if necessary, at the initial stages of constructing a new database.38 Starting with the right design has such immediate and noticeable benefits for the database creators that its usefulness for eventual data integration is almost incidental. 2. Documenting assumptions and procedures Typological databases have a long lifespan. Most of the databases currently incorporated in the TDS have existed for over a decade, even decades. During such long periods of time it is easy for the assumptions and procedures employed when defining fields and adding languages to the database to be lost. Sometimes this may lead to inconsistencies. By documenting the database it becomes possible to refresh the project’s collective memory, to keep data clean and consistent, and also to provide rich metadata if the database is eventually made public or is reused, for example by incorporating into the TDS. Some database systems provide 38

The simplest typological databases consist of a single large table, with one record (row) per language and one column for each property being described. If a phenomenon could be described multiple times per language (e.g., information about each focus marker or syllable structure that occurs), these ought to occupy a separate table in a many-to-one relationship to the language table. Also examples, bibliographic citations, etc., should generally be in separate tables. The needs of each database differ, so the right design depends on the information being collected and the needs of the project. For an introduction to database design for linguistics, see Dimitriadis and Musgrave (this volume).

200 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró a way to store documentation of the database’s fields, but such support is typically very limited. It is often more convenient to maintain documentation in a separate document, or in a special table within the database (special effort must be made to keep these up to date, however). 3. Citations to the sources of information Some typological databases, especially older ones, do not include bibliographic or other source information. Such information is essential for error-checking or further research into the information given in the database; even in cases where there is only one well-known reference grammar for a given language, in which case one might consider a citation of it to be redundant, this fact is only known to people who are already involved in study of this language. At a minimum, a typological database should list the source, or sources, of information for each language. Such information only needs to be entered once per source, and can be re-used for future research involving the language. Sources should be given in a separate table, allowing the same source to be easily referenced for multiple languages, or other records, where appropriate (and perhaps to be copied to a new database). Ideally, each group of information in the database would include a separate citation, consisting of a source plus the relevant page numbers; but this may be impractical if it will result in a very large number of citation fields. Simply listing sources for each language in the database is an extremely useful and workable compromise. 4. Key values Record keys identify each record to the database. We recommend that you look for standards to take your key values from. Even if your database internally uses numeric keys or other ad hoc values, providing standard identifiers makes it easy to detect inadvertent duplications and is essential for the reusability (i.e., continued life) of the data. In the case of languages, which are always the core entity in a typological database, the standard to use is the most current ISO 639-3 specification (http://www.sil.org/iso639-3/), which contains unique language codes for living and extinct languages. The ISO 639-3 codes are the successor to the “SIL codes” used in the past by the Ethnologue language directory. It can be difficult to fill in this information after the fact; often, the ISO code corresponding to a language cannot be determined without referring to the original reference grammar. The TDS is replete with data about languages whose identity (i.e., ISO code) could not be determined with certainty by the TDS analysts. ISO codes should always be determined (and recorded) when data entry for a language commences. The intro-

How to integrate databases without starting a typology war

201

ductory section of most reference grammars provides sufficient information to identify the language in the Ethnologue. What if the language variety being described is a dialect that differs in significant respects from the standard, or canonical variety, of the language listed in the ISO 639-3 directory? In this case the ISO code for the language should be provided, but the record must be distinguished from one that would describe the standard variety. This must include the assignment of a special key for the dialect; again, we recommend adhering to a systematic means of identification if possible. The Ethnologue contains additional information on language families and dialects which can be used to standardize keys. Only fall back on an internal key when a standard one cannot be found, since lack of a database-independent key will greatly complicate the integration of data for that specific language with data from other databases.39 Non-standardized keys should be clearly distinguished from ISO codes or other standard keys (e.g., by beginning with a prefix such as dialect-). 5. Comments Many databases contain a field with arbitrary notes or comments for any aspect of the record. Such general-purpose fields are difficult for the TDS to handle, since data is extracted and presented one field at a time. It is far better to have separate comment fields for each value or group of values that requires them (this is also more effective for the original database). At the least, a comment should name the data fields that it is relevant to. Another common practice is to use “a comments field” for any information that the database creators decided to collect after the database was already designed, or that they cannot decide how to encode. Such practices are sometimes unavoidable, especially if information must be collected before a decision on its encoding can be made; but they should be replaced as soon as possible by dedicated fields. A comments field is only useable by humans inspecting the record, and cannot be properly utilized in database queries.

39

If the ISO codes are used as the primary key for a record, internal codes are necessary to avoid null keys; but if the ISO codes are given in a non-key field, only valid ISO codes should be provided; a three-letter code that is not part of some standard is worse than useless, since it might be misinterpreted during a future data integration.

202 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró 6. NULL values NULL values are controversial in the database community because their semantics are not well-defined (cf. Date 2004: ch. 19). What does a NULL mean? Is the field irrelevant in this record? Should the NULL be interpreted as some default value for the field? Has a value simply not been entered yet? Or did the analyst try to determine the correct value but could not, because the source grammar does not cover the subject, or perhaps because complexities in the language under study make an answer impossible? We recommend that you choose specific values for each of these circumstances (as applicable), make them as explicit as possible, and explain their meaning in more detail in the database documentation. For example, NULL should mean not yet looked at, irrelevant can mean that the question does not apply, another value can mean “looked at but could not find the proper value.” This will facilitate further data entry in the database, and will allow the TDS to provide this information to the end user. 7. Uncertainty Values cannot always be assigned with complete certainty. Uncertain values are better than nothing, but it is good practice to make the uncertainty explicit. A common stratagem is to append a question mark to the value, or to put it between brackets. Sometimes comments are also added to the value. But this is a poor strategy from the perspective of database design, since the variant codes look like completely different values to the DBMS (e.g., “X” and “X?”), and since it requires that the values be declared as text fields (rather than boolean or other enumerated types). A better strategy would be to have a separate field encoding the degree of certainty, but this might be impractical if it would need to be replicated for very many data fields. For the purposes of importing a database into the TDS, a simple and consistent means of marking uncertainty works best (e.g., by appending a question mark). The TDS can split such constructions into the value proper and the uncertainty value. Elaborate embedded notation for values is difficult and error-prone for database contributors to enter, and for the TDS to parse. In any event, it is important that any notational conventions be clearly documented.

How to integrate databases without starting a typology war

203

10. Conclusions The TDS approach to data integration is based on the principle that semantic integration of complex typological data from different sources is both impossible and undesirable. Impossible because the different typological data collections differ in subtle and complex ways; and undesirable because many such differences are inextricably linked to particular theoretical conceptualizations, which are the goal and end-result of typological research itself. Accordingly, the TDS data management schema is designed to accommodate, side by side, alternative ways of organizing and describing a semantic domain. This approach does not preclude full semantic unification where it is possible (for example, where the differences are primarily notational), but it allows data integration to be carried out without it. Instead of focusing on unification, primacy is given to preserving the explicit and implicit knowledge in the component databases, notably through the addition of detailed interpretive metadata in consultation with the creators of the component databases. The databases currently included in the TDS represent a wealth of painstakingly collected typological information. The TDS allows for the first time their diverse contents to be examined side by side, and to be interpreted with the aid of descriptive documentation solicited and organized by the Project members. Its export-oriented output formats allow data to be transferred to external applications. Internally, the “hybrid” approach to knowledge representation provides a global ontology of unifying linguistic Concepts (which still accommodate multiple theoretical perspectives), plus multiple local ontologies relating database fields to global Concepts, and to each other. Since the contents of each database are only related to the global ontology, this structure ensures that the Project can grow in scale without management becoming impractical (as it would if each new database needed to be related to all of the existing ones), and that sufficient breadth and depth of coverage is available to support querying. The TDS does not assume responsibility for the correctness of the data included in its component databases, but for reporting it faithfully. The degree of confidence that users will place in the system will depend on the ability of the TDS to accurately reflect the analyses in the component databases. The provenance of all information is made visible at every stage of user interaction (searching, query construction and results display), allowing users to assess for themselves the reliability of the information and the theoretical perspective and assumptions it rests on. The result, we hope, is a

204 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró system suitable for a research community increasingly interested in tools that will support empirically grounded linguistic typology, whose findings can be verified and reproduced.

Appendix: TDS glossary Concept DTL

When capitalized: An entry in the TDS Global Ontology. Data Transformation Language. A domain-specific language created by the TDS Project. Provides a high-level declarative notation for transformation rules and metadata annotations that express the semantics of the resulting data. ISO code A unique three-letter code identifying a language according to the ISO 639-3 standard. A successor to the three-letter “SIL codes” used in the Ethnologue directory. Metadata Information about the format or meaning of a data field or collection of fields, intended to aid people or computers in its interpretation. Notion When capitalized: A node in the integrated data hierarchy of the TDS, representing a database field, a localized linguistic term or value, or a group of Notions. Ontology A formal representation of a set of concepts within a domain, and of the relationships between those concepts. OWL Web Ontology Language. A knowledge-representation language for ontologies. Root Notion The root (top-level node) of a tree in the TDS data hierarchy. TDS-GO The TDS Global Ontology of linguistic concepts. Top Notion A Notion that forms the root (top-level node) of a coherent semantic context. UPPC The Universal Phoneme Position Chart, a table of possible phonological segments organized for viewer-friendly presentation. References Bechhofer, Sean, Frank van Harmelen, Jim Hendler, Ian Horrocks, Deborah L. McGuinness, Peter F. Patel-Schneider and Lynn Andrea Stein 2004 OWL Web Ontology Language Reference. Vol. 2005. World Wide Web Consortium. [http://www.w3.org/TR/owl-ref/]

How to integrate databases without starting a typology war

205

Chaffin, Roger, Douglas J. Herrmann and Morton Winston 1988 An empirical taxonomy of part-whole relations: Effects of part-whole type on relation identification. Language and Cognitive Processes 3 (1): 17– 48. Comrie, Bernhard 1978 Ergativity. In Syntactic Typology: Studies in the phenomenology of language, edited by Winfred P. Lehmann. Austin: University of Texas Press. 1989 Language Universals and Linguistic Typology. 2nd ed. Oxford: Blackwell. Corbett, Greville G. 1998 Morphology and Agreement. In The Handbook of Morphology, Andrew Spencer and Arnold M. Zwicky (eds.), 191–205. Oxford, UK / Malden, MA: Blackwell. Date, Chris J. 2004 An Introduction to Database Systems. 8th ed. Addison-Wesley. Dimitriadis, Alexis, Adam Saulwick and Menzo Windhouwer 2005 Semantic relations in ontology mediated linguistic data integration. In E-MELD Workshop on Morphosyntactic Annotation and Terminology: Linguistic Ontologies and Data Categories for Linguistic Resources (E-MELD 2005), Cambridge, MA. Dixon, Robert M. W. 1972 The Dyirbal language of North Queensland. Cambridge Studies in Linguistics 9. London: Cambridge University Press. Gómez-Pérez, Asunción, Mariano Fernández-López and Oscar Corcho 2004 Ontological Engineering. Advanced Information and Knowledge Processing. London: Springer. Gruber, Thomas R. 1993 Toward principles for the design of ontologies used for knowledge sharing. In International Workshop on Formal Ontology in Conceptual Analysis and Knowledge Representation, N. Guarino and R. Poli (eds.). Deventer: Kluwer Academic Publishers. Gruber, Thomas R. and Gregory R. Olsen 1994 An ontology for Engineering Mathematics. Paper presented at the 4th International Conference on Principles of Knowledge Representation and Reasoning, Bonn, Germany. Haspelmath, Martin 2005 Argument marking in ditransitive alignment types. Linguistic Discovery 3 (1): 1–21. Hengeveld, Kees, Jan Rijkhoff and Anna Siewierska 2004 Parts-of-speech systems and word order. Journal of Linguistics 40 (3): 527–570.

206 A. Dimitriadis, M. Windhouwer, A. Saulwick, R. Goedemans and T. Bíró Kager, René 1999 Optimality theory. Cambridge textbooks in linguistics. Cambridge / New York: Cambridge University Press. Ladefoged, Peter and Ian Maddieson 1996 The Sounds of the World’s Languages. Oxford, UK / Cambridge, MA: Blackwell. Landis, T. Y., Douglas J. Herrmann and Roger Chaffin 1987 Development differences in the comprehension of semantic relations. Zeitschrift für Psychologie 195 (2): 129 –139. Lewis, William D. 2006 ODIN: A Model for Adapting and Enriching Legacy Infrastructure. In Second IEEE International Conference on e-Science and Grid Computing (e-Science’06). Maddieson, Ian 1984 Patterns of sounds. Cambridge Studies in Speech Science and Communication. Cambridge: Cambridge University Press. Monachesi, Paola, Alexis Dimitriadis, Rob Goedemans, Anne-Marie Mineur and Manuela Pinto 2002 A Unified System for Accessing Typological Databases. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 3), Las Palmas, Canary Islands, Spain. Paris: ELRA. Nespor, Marina and Irene Vogel 1986 Prosodic Phonology. Studies in Generative Grammar. Dordrecht: Foris. Payne, Thomas Edward 1997 Describing morphosyntax: a guide for field linguists. Cambridge: Cambridge University Press. Rector, Alan and Chris Welty (eds.) 2005 Simple part-whole relations in OWL Ontologies. W3C Editor’s Draft Vol. 2005. World Wide Web Consortium. [http://www.w3.org/2001/ sw/BestPractices/OEP/SimplePartWhole/] Rosch, Eleanor and Barbara B. Lloyd 1978 Cognition and categorization. Hillsdale, NJ: Lawrence Erlbaum. Saulwick, Adam, Rob Goedemans, Alexis Dimitriadis and Menzo Windhouwer 2006 Architecture and procedures for the integration of linguistic databases in the TDS. In 28. DGfS, AG 6 – Language Archives: Standards, Creation and Access. Saulwick, Adam, Menzo Windhouwer, Alexis Dimitriadis and Rob Goedemans 2005 Distributed tasking in ontology mediated integration of typological databases for linguistic research. In International Workshop on Data Integration and the Semantic Web (DISWeb’05), J. Castro and E. Teniente (eds.), Vol. 13. Proceedings of the CAiSE’05 Workshops. Porto: Springer.

How to integrate databases without starting a typology war

207

Siewierska, Anna 2004 Person. Cambridge textbooks in linguistics. New York: Cambridge University Press. Storey, Veda C. 1993 Understanding Semantic Relationships. VLDB Journal 2: 455– 488. Stuckenschmidt, Heiner and Frank van Harmelen 2005 Information Sharing on the Semantic Web. Advanced Information and Knowledge Processing. Berlin: Springer. Taylor, John R. 1989 Linguistic categorization: prototypes in linguistic theory. Oxford: Clarendon Press. Varela, Francisco J., Evan Thompson and Eleanor Rosch 1991 The embodied mind : cognitive science and human experience. Cambridge, MA: MIT Press. Winston, Morton E., Roger Chaffin and Douglas J. Herrmann 1987 A taxonomy of part-whole relations. Cognitive Science 11(4): 417– 444.

A contribution to ‘two-dimensional’ language description: the Typological Database of Intensifiers and Reflexives Volker Gast

1. Introduction The relationship between language description and linguistic typology is often regarded as an asymmetrical one: even though such criticism is hardly ever articulated explicitly – perhaps because many language specialists are also typologists – there is a tendency to regard typology as being ‘parasitic’ on descriptive linguistics: typologists are seen as drawing their material from grammars, extracting bits and pieces from a quarry as it were, without seeing either the whole picture or the intricate details. To an extent, this type of criticism may, in some cases, even be justified, although the ‘quarry metaphor’ is of course inappropriate, since no material is actually removed from the grammars. Still, there is often a subliminal feeling of ‘misuse’ of the data, a favourite topic for the discussion period at typological conferences when specialists point out inaccuracies in examples of ‘their’ language, or call the appropriateness of a given term or categorization into question. However, there can be no denying that descriptive linguistics can also profit from typology. A ‘symbiotic coexistence’ of language description and typology has been particularly fruitful in what we may broadly call the ‘Australian school of language description and typology’. Evans and Dench (2006: 2) also include “formal linguistics” in this relationship of reciprocal benefit, stating that “there is a triadic and mutually complementary relationship between descriptive linguistics (of which the writing of grammars is but one part), linguistic typology, and formal linguistics.” This type of symbiosis can be illustrated on the example of the work done by another Australian linguist. Dixon’s (1994) typological survey of ergativity – obviously inspired by his descriptive work on Australian languages – provides a useful frame of reference for descriptive linguists who encounter some type of ergativity phenomenon in ‘their’ language. This is not to say, of course, that language descriptions should be confined to corroborating the generali-

210 Volker Gast zations made by Dixon or other typologists. On the contrary, typological generalizations may help descriptive linguists to identify remarkable or non-canonical patterns in their data, for instance if the relevant languages do not ‘comply with’ the generalizations formulated in typological work. In many cases, such deviance from widely attested patterns may also be an incentive to have a second look at the data, and to consider an analysis in a new light. Radical structuralists (or semantic relativists) may object at this point that the use of any kind of a priori framework will only obscure the view on the real ‘genius’ of a language; but this type of particularism is not only unjustified in many cases – in general, there does seem to be a good deal of inter-lingual commensurability – it is also unwise from a strategic point of view. As Evans and Dench (2006: 5) put it, “[d]escribing each language entirely on its own terms is a noble and galvanizing task, but unless grammarians orient their findings to what typologists know about the world’s other languages, their grammars can all too easily become obscure, crabbed and solipsistic […] or at best half-veiled in the idiosyncrasies of specific areal or language-family-specific traditions”. We should also not forget that typologists constitute a major target group for grammars published in book format, at least as far as ‘exotic’ languages are concerned. In this article, I would like to show that the ‘symbiosis’ of language description and linguistic typology can also be extended in another direction: Cross-linguistic research can itself be carried out in a descriptive and perhaps even documentary spirit when it uses electronic devices such as relational databases. This will be illustrated on the example of a typological database which contains information on a linguistic domain at the interface of lexicon and grammar (intensifiers and reflexives). The paper starts in Section 2 with pointing out a basic distinction in grammaticography, namely the one between semasiological and onomasiological approaches to language description. The idea of a ‘bidirectional’ descriptive approach (form-to-function and function-to-form) is extended to a ‘two-dimensional’ one in Section 3 (where ‘two-dimensional’ refers to the orthogonal relationship between languages and grammatical domains). After providing a typological overview of intensifiers and reflexives in Section 4, the Typological Database of Intensifiers and Reflexives is described in Section 5. Section 6 contains some concluding remarks.

A contribution to ‘two-dimensional’ language description

211

2. Bidirectionality in language description: from form to function and vice versa As has often been pointed out in literature on grammaticography (e.g. Lehmann 2004; Lehmann and Maslova 2004; Mosel 2002, 2006) there are two ways of organizing a grammatical description: (i) from form to function, and (ii) from function to form. The first approach is often called semasiological, indicating the ‘target of description’ (the σηµα, i.e. the ‘sign, signal, feature’/ signifié), and the second, accordingly, onomasiological (from the σηµα to the όνοµα, i.e. the ‘name’ or signifiant). Alternative dichotomies aimed at capturing this distinction are ‘analytic’ vs. ‘synthetic’, ‘decoding’ vs. ‘encoding’ and simply ‘form-based’ vs. ‘meaning-based’ (cf. Evans and Dench 2006; Mosel 2006). Lehmann (2004) illustrates the two methods on the example of the English preposition with, and the way its meanings can be encoded using other expressive devices in English. We may either carry out a semasiological analysis of with, for instance, by assigning four different meanings to it: (i) ‘instrument’, (ii) ‘comitative’, (iii) ‘reciprocity’ and (iv) ‘material’ (cf. Table 1). Or else, we may consider a given conceptual relation – say, ‘instrument – and determine the formal means associated with that domain in English. This will allow us to identify a set of ‘strategies’ such as those on the right hand side of Table 1. As Table 1 illustrates, there is usually a oneto-many relationship both from form to function and vice versa. Therefore, the two methods of language description may lead to rather different results when they are applied to the same language. Table 1. Onomasiology and semasiology: the instrumental preposition with (cf. Lehmann 2004) conceptual relations

mapping

structural devices

reciprocal (X and Y reciprocate)

Y uses X to Pred

instrument (Y uses X)

Y with X

comitative (X accompanies Y)

Y Pred using X

material (X is material of Y)

Y Pred by X

onomasiology

→

← semasiology

212 Volker Gast Reference grammars are usually organized in a semasiological way, sometimes complemented by a few onomasiological sections (e.g. ‘possession’). Attempts have also been made to organize entire grammars in an onomasiological way (cf. Mosel 2006 for some examples), but mainstream grammaticography is still heavily biased towards a semasiological approach. This may be due to the fact that semasiological descriptions are more useful for decoding language, while onomasiological descriptions are geared towards encoding purposes (this is also reflected in the alternative terms ‘analytic’ [for ‘semasiological’] and ‘synthetic’ [for ‘onomasiological’]; cf. Mosel 2002, 2006). Since decoding is usually the first step in academic language learning, it is only natural that semasiological descriptions should be preferred by most linguists. Moreover, a grammarian aiming to provide an onomasiological description is faced with the problem of which semantic or ontological categories to choose as a point of departure. While the inventory of formal markers constituting the grammatical system of a language is finite, the set of conceptual domains encodable in natural language is theoretically infinite, so the choice of any given conceptual domain as a describendum is arbitrary. Semasiological and onomasiological approaches can also be fruitfully combined by ‘flipping back and forth’ (or ‘oscillating’) between the two methods (cf. Hole 2000). Though often implicit, such a ‘bidirectional’ method is common practice in linguistic typology. For instance, typologists may start off with a given meaning – say, ‘possession’ – and set out to determine the ways this meaning is expressed in the languages of the world. In doing so, they will come up with a list of formal types of encoding (e.g. BE-possession [with an existential predicate], HAVE-possession [with a transitive predicate], etc.). In a second, semasiological step, they can now determine the range of meanings associated with each ‘strategy’. As a result of such an ‘oscillation’ between onomasiology and semasiology, one may determine systematic patterns of polysemy, first in individual languages and then cross-linguistically. This approach also underlies the so-called ‘semantic map’ model, where similarity of meaning or function is represented by proximity in a two-dimensional space (cf. Haspelmath 1997 [on indefinite pronouns], van der Auwera and Plungian 1998 [on modality]; see also van der Auwera and Gast forthcoming). Figure 1 provides the semantic map for indefinite pronouns devised by Haspelmath (1997) on the basis of a sample of 40 languages.

A contribution to ‘two-dimensional’ language description

(1) specific known

(2) specific unknown

(3) irrealis non-specific

213

(7) direct negation

(4) question

(6) indirect negation

(5) conditional

(8) comparative (9) free choice

Figure 1. Haspelmath’s (1997: 64) semantic map of indefinite pronouns

Semantic maps such as the one given in Figure 1 allow for a simple visualization of the form-function (semasiological) and function-form (onomasiological) mapping for any formal marker associated with at least one of the functions represented on the map. For instance, the semantics of the several indefinite pronouns of Latin can be described by indicating the space that they cover on the ‘semantic map’. This is shown in Figure 2. LATIN

specific known dam

specific irrealis unknown non-specific

question

indirect negation

conditional

comparative

ali-

direct negation n-vis/-libet free-choice

-quam Figure 2. The range of functions associated with indefinite pronouns of Latin (Haspelmath 1997: 69)

3. From bidirectional to two-dimensional language description The fact that most grammars are organized in a semasiological way leads to a certain incommensurability in cross-linguistic comparison. The reason for this is obvious: as pointed out above, the first step in a cross-linguistic study is usually an onomasiological one. This means that typologists have to

214 Volker Gast search in grammars for the specific conceptual domain they are interested in. While this is not necessarily an obstacle to typological work, problems emerge when the relevant function is not mentioned at all in the grammar. This is often the case when languages lack expressions specialized to the semantic domain in question. If, for instance, a language does not have a grammatical category of ‘mood’, a grammar of that language will in all likelihood not contain a section on mood (as do most grammars of English). This is not to say that semantic mood distinctions such as the one between ‘realis’ and ‘irrealis’ cannot be expressed at all in such a language – the expression of such meanings may simply be distributed over several ‘loci of encoding’ (say, a combination of syntactic and lexical means), or it may be expressed in a ‘parasitic’ way (e.g. when tense is used to encode mood distinctions, as in English). In such cases, the function-to-form approach may be more useful for typologists, even though it does of course not guarantee that a grammar will contain the information that is of interest. In order to facilitate cross-linguistic comparison, North Holland started the publication of the Lingua Descriptive Studies series in the late 1970s (LDS, cf. Matras et al. this volume), later taken over by Croom Helm and then continued in a modified form as the Routledge Descriptive Grammars series. The grammars of that series are all based on the same questionnaire, which was designed by B. Comrie and N. Smith (cf. Comrie and Smith 1977). These grammars are thus geared towards comparative purposes. For instance, the questionnaire contains a considerable number of questions relating to reflexivity and binding. If one wants to compare the binding properties of a language A with those of another language B, one simply has to look up the descriptions given in the relevant sections. Let us refer to this approach as ‘two-dimensional’. It can be represented as a table in which each row corresponds to a ‘grammatical topic’ and each column to a language. This is illustrated in Table 2.

A contribution to ‘two-dimensional’ language description

215

Table 2. ‘Two-dimensional’ language description: the principle of the Lingua Descriptive Studies series Language 1

Language 2

…

Language n

Section 1 Section 1.1 Section 1.1.2 … Section 2 Section 2.1 Section 2.1.2 …

Quite obviously, grammars based on a general template such as Comrie’s and Smith’s questionnaire will always be both redundant and non-exhaustive: first, not all questions will be relevant to all languages, i.e. there will be empty cells in a grid of the type illustrated in Table 2; and second, the information retrieved from a questionnaire cannot even come close to a full coverage of the elements and rules constituting the grammatical system of a language. It is certainly for these reasons, among others, that the LDS project was seriously criticized before long. The functional yield of the questionnaire-based descriptions was simply too low, i.e. the grammars published in that series provide relatively little information on a relatively large number of pages, in comparison to traditional (basically semasiological) grammars. Even though the LDS project may be regarded as a failure by some typologists, the idea of compiling a set of cross-linguistically comparable ‘parallel grammars’ is certainly still appealing. One may wonder if that enterprise would have taken a different course, had it been based on a different (perhaps more comprehensive) questionnaire, and had it made use of electronic devices such as databases allowing for quick and easy crossreferencing across grammatical domains. This is not the place to argue for a revival of the LDS project. However, I would like to show that a similar, less ambitious, undertaking may be worthwhile pursuing, namely the compilation of what we may call ‘domain-specific cross-linguistic databases’. Roughly speaking, the idea of a ‘domain-specific cross-linguistic database’ can be described as follows: such a database provides (parallel) information on one grammatical domain for a large number of languages. In a way, the mapping from languages to grammatical domains is thus reversed,

216 Volker Gast in relation to a traditional reference grammar. While a reference grammar provides information on many grammatical domains for a single language, a ‘domain-specific cross-linguistic database’ contains information on a single grammatical domain for a large number of languages. In other words, it corresponds to one row of Table 2 on p. 215. Obviously, the restriction to one specific domain of grammar allows for a more comprehensive, and also more focused, description. In what follows, a database will be presented which has been set up in this spirit, namely the Typological Database of Intensifiers and Reflexives.

4. A typology of intensifiers and reflexives The Typological Database of Intensifiers and Reflexives grew out of a typological project on intensifiers and reflexives directed by E. König and carried out at the Free University of Berlin between 1996 and 2002.1 The database was first set up by P. Siemund (in a not openly accessible format) for project-internal purposes. When the project had come to an end, three other project members decided that the data collected during the project work should be made available to the public, thus contributing to the development of an openly accessible pool of typological resources on the internet. The database was considerably revised and extended, and was transferred into an internet resource by V. Gast, D. Hole and S. Töpper (using a MySQL database system which is accessed by PHP-pages; cf. also Section 5). The first version (version 1.0) of this database was published in 2003. More recently, the database has been subject to further modifications, providing additional search options and a new user interface (version 2.0, published in April 2007).2 It should be noted that a database such as the one described in this paper is only feasible as a by-product of a typological investigation into the domain at issue. It is only in this way that interesting questions and parameters of variation can be identified, and that general patterns and limits of variation can be distinguished from language particular idiosyncrasies (which 1

2

The project was funded by the German Science Foundation (‘Typological investigations on emphatic reflexives, reflexives, and focus particles’, Ko 497/5-1/4). Additional funding for the database was provided by the ‘Kommission für Forschung der Freien Universität Berlin’. The financial support from all sources is gratefully acknowledged. URL: http://www.tdir.org/

A contribution to ‘two-dimensional’ language description

217

should not make it into the general structure of a database; such details can be mentioned in all-purpose fields containing comments of various types). Moreover, a balance needs to be found between a striving for explicitness, on the one hand, and the necessity to restrict one’s attention to certain phenomena, on the other. Before providing a description of the database itself in Section 5, we will therefore briefly introduce the typology of intensifiers underlying the database.

4.1. A basic typology of intensifiers The term ‘intensifier’ is used here for expressions such as Latin ipse, Russian sam and English self-forms when they are used in an adjunct function (the president himself). A comprehensive typology of intensifiers has been proposed by König and Gast (2006). I will here only summarize the corner posts of that typology. Intensifiers typically form a constituent with an NP (or DP). We call such intensifiers ‘adnominal’. Relevant examples are given in (1) and (2) from Abkhaz and Albanian, respectively: (1)

(2)

Abkhaz [NP [NP à-jγab] l-xatà] ART-girl POSS.3SG-INT ‘the girl herself’ Albanian [NP ajo vetё] mё tha she INT to.me said ‘She herself said it to me.’

Hewitt (1989: 58)

Buchholz and Fiedler (1987: 283)

In many languages, intensifiers may also occur at a distance from their associated NPs. Such ‘head-distant’ intensifiers as illustrated in (3) and (4) usually have a slightly different meaning from the relevant ‘head-adjacent’ ones illustrated in (1) and (2) above. Often, they can be paraphrased with adverbials such as ‘without help’, ‘alone’ or ‘by oneself’. (3)

Albanian ai e pa vetё he it saw INT ‘He saw it himself.’

Buchholz and Fiedler (1987: 283)

218 Volker Gast (4)

Bengali nɔyon-Ø chobi-ţa nije-Ø nije-Ø ek̃ eche Nayan-NOM picture-ACC INT-NOM INT-NOM has.drawn ‘Nayan has drawn the picture by himself.’ Sengupta (2000: 284)

In European languages, two major types of interpretation can be distinguished for ‘head-distant’ intensifiers: first, they often signal that an action has been carried out ‘without an external cause(r)’; we call these readings ‘exclusive’ (cf. the examples in (3) and (4)). Second, they sometimes have a function similar to that of additive focus particles such as also and too. This use type – which we call ‘inclusive’ – is illustrated in (5) and (6). (7) illustrates the ‘exclusive’ use of Russian sam for comparison. (5)

English Don’t tell me what it’s like to have children, I have children myself.

(6)

Russian sam ty krysa INT you rat ‘You’re a rat yourself.’ (context: ‘Don’t forget the rat poison!’) Katerina Lvovna in ‘Lady Macbeth of Mtsensk’ (opera by D. Shostakovich)

(7)

Ol’ga učit svoix detej sama Olga teaches her children INT.SG.FEM ‘Olga teaches her children herself.’

Olga Pavlovskaya (p.c.)

‘Inclusive’ uses of intensifiers as illustrated in (5) and (6) are very rare in the world’s languages and represent a particularity of European languages. In Gast (2006), I have argued that the additive implicature associated with the relevant expressions is not primarily a lexical property of the intensifiers themselves, but follows from the interaction of a specific syntactic configuration (intensifier is outside T[ense]P[hrase]) with certain conditions on information structure (TP is deaccented/given). Given their rarity in languages outside of Europe, inclusive intensifiers have not played a major role in our typological investigation. A third major type of intensifier behaves like the ‘adnominal’ one insofar as it forms a constituent with an NP, but it occupies a different structural position, typically that of a possessor or specifier. For instance, the Georgian intensifier tav-/tviton may either adjoin to the whole NP (projecting another

A contribution to ‘two-dimensional’ language description

219

NP, cf. (8)), or take up the specifier position of an NP (cf. (9)). In the latter case, it is basically equivalent to the English adjective own. In fact, it has been argued by König and Gast (2006), among others, that English own is a specialized ‘attributive intensifier’, i.e. it has basically the same semantic properties as intensifying self-forms and differs from the latter only in terms of its syntactic position and function. Accordingly, the sheriff’s own horse is interpreted as ‘the horse of the sheriff himself’. This analysis is supported by the fact that many languages (such as Georgian) use the same expression for adnominal and attributive intensification: (8)

Georgian me beč’edi miveci [NP t(v)iton [NP dedopals]] I.ERG ring.NOM gave.it.to.her INT queen.DAT ‘I gave the ring to the queen herself.’ Hewitt (1995: 85)

(9)

xelmc’ipem samives cxenebis [NP tav-tav-is-i emperor.ERG all.three.DAT horse.PL.GEN INT-INT-GEN-AGR ĵog-i] misca herd-NOM gave.it.to.them Hewitt (1995: 564) ‘The emperor gave each his own herd of horses to all three.’

The ‘basic typology’ of intensifiers outlined above is summarized in Figure 3: intensifiers

qp

head-adjacent

ei

adnominal

attributive

head-distant

ei

inclusive

exclusive

Figure 3. Basic typology of intensifiers

4.2. Parameters of variation Intensifiers may differ in terms of their formal and distributional properties along a variety of dimensions. A comprehensive overview of the patterns and limits of variation found in this domain has been provided by König and Gast (2006). In the following, only those parameters will be mentioned that are represented in the Typological Database of Intensifiers and Reflex-

220 Volker Gast ives: (i) selectional restrictions imposed on intensifiers, (ii) their lexical sources, and (iii) the question of whether intensifiers and reflexives are formally identical or not.

4.2.1. Selectional restrictions As has been shown by König and Gast (2006), selectional restrictions associated with intensifiers can be accounted for on the basis of the well-known animacy hierarchy, both within and across languages. A relatively simple version of the animacy hierarchy is given in (10): (10) pronouns

> lexical NPs/ > lexical NPs/ concrete referents abstract referents SAP > non-SAP humans > animate > inanimate

Table 3 (from König and Gast 2006) illustrates that languages manifest different cut-off points on the animacy hierarchy with regard to the distribution of their intensifiers. We will only consider two examples from a language which uses different intensifiers for animate as opposed to inanimate referents, namely Japanese. Japanese uses jishin in the first case and jitai in the second: (11) Japanese (animate referents) Taro-jishin kyouju-wo sonkeishiteiru Taro-INT professor-ACC honour ‘Taro himself will honour the professor.’

Akio Ogawa (p.c.)

(12) kekkonshiki-jitai-wa bujini shuuryoshita Wedding-INT-TOP without.problems happen ‘The wedding itself went off without a hitch.’ Akio Ogawa (p.c.)

221

A contribution to ‘two-dimensional’ language description Table 3. Intensifiers and animacy restrictions (König and Gast 2006: 244) PRONOUNS

1/2

3

COMMON NOUN

human

animate

inanimate concrete

Basque Malagasy T. Nahuatl

-eu-

abstract

bera-tena

mihitsy

sie PRO

Japanese German Spanish

— jishin

jitai selbst

—

mism-

4.2.2. Historical sources The selectional restrictions of a given intensifier are often associated with its historical origin (i.e. intensifiers derived from specific sources are associated with specific selectional restrictions). Many intensifiers derive from body parts (e.g. ‘head’) or the notion ‘body’ itself. Such intensifiers are typically (though not in all cases) associated with a distributional restriction to human or animate referents: (13) ‘body’: Korean casin, Malagasy-tena, Maricopa maatm, Yoruba fúnra-, etc. ‘head’: Abkhaz-xatà, Amharic ras-, Podoko ba mudarə-, Soninke yinmé, etc. A second – relatively frequent – diachronic source of intensifiers is the numeral ‘one’ or an element meaning ‘alone’ (some languages use the same expression for both of these meanings): (14) ‘alone’: Indonesian sendiri, Tzotzil -tuk, Yiddish aleyn, etc. ‘one’: (Tetelcingo) Nahuatl sie, Lingala mɔ́ kɔ́, etc. A number of languages have intensifiers based on an element indicating ‘precision of reference’, e.g. ‘precisely (NP)’ or ‘(the) very (N)’. Such intensifiers typically do not exhibit selectional restrictions:

222 Volker Gast (15) Mixtec máá- ‘exact(ly)’, Breton end-eeun ‘exactly, precisely’, Zapotec lagahk < PRO + gahk ‘exactly, precisely’, etc. Of course, there are innumerable other types of historical sources giving rise to the development of intensifiers, but most of them seem to be found only in single or very few languages (e.g. Hebrew atsmo < ‘bone’, Hungarian maga < ‘seed’). In many cases, no etymology can be determined with absolute certainty, even in well documented languages such as Indo-European ones. For instance, the intensifiers of most Romance languages are based on (the Latin intensifier) ipse, with different types of additions or truncations. Given that ipse can itself be regarded as incorporating an older element with an intensifying meaning (it is generally analyzed as a combination of the pronominal/demonstrative stem is with an intensifying suffix -pse), we cannot trace the historical origin of It. stesso, Sp. mismo and Fr. même back to any element which does not itself have an intensifying meaning. Similarly, the historical origin of intensifiers in Germanic languages is unclear, though there are a number of plausible hypotheses (cf. König and Gast 2006: 266 – 267 for a survey). 4.2.3. Formal identity vs. differentiation of intensifiers and reflexives In approx. half of the world’s languages, intensifiers are formally indistinguishable from reflexive anaphors. This correspondence is illustrated for English in (16), for Lezgian in (17) and for Malagasy in (18): (16) a. The president himself opened the meeting. b. John looked at himself. (17) Lezgian a. či q’iliw Lenin wič ata-nwa! we.GEN to Lenin INT.ABS come-PFCT ‘Lenin himself has come to us!’ Haspelmath (1993: 186) b. Ali-di-zi wič güzgü.d-a akwa-zwa Ali-OBL-DAT REFL.ABS mirror-INESS see-IMPF ‘Ali sees himself in the mirror.’ Haspelmath (1993: 185)

A contribution to ‘two-dimensional’ language description

(18) Malagasy a. tonga izy tena-ny arrived he INT-POSS.3SG ‘He himself arrived.’ b. mahita-tena i Koto sees-REFL NA Koto ‘Koto can see himself.’

223

Randriamasimanana (1986: 232)

Zribi-Hertz and Rajaonarisoa (1999: 21)

Whether or not intensifiers and reflexives are formally identical has a number of repercussions on the distribution of the relevant items. For instance, languages like English (where the two types of expressions are indistinguishable) often (though not necessarily) avoid the co-occurrence of an intensifier with a reflexive pronoun – sometimes for synchronic reasons (horror aequi), sometimes for diachronic ones. Given that English self-forms function as both intensifiers and reflexive pronouns, corresponding to, say, both Germ. selbst and sich, we would expect them to co-occur in contexts such as (20) (the counterpart of the German example (19)), but such combinations are impossible in English (cf. also Gast and Siemund 2006 on the co-occurrence of intensifiers and reflexives): (19)

[Hans wasn’t injured by anyone else; …] er hat sich selbst verletzt. he has REFL INT injured

(20) *He injured himselfREFL himselfINT. The only way of conveying the distinction between a ‘bare’ reflexive and the combination of a reflexive and an intensifier is to use specific stress patterns (an English sentence corresponding to (19) would have stress on -self rather than injured). Other languages in which intensifiers and reflexives are indistinguishable do allow the ‘double use’ of such elements. It is possible, for instance, in Kashmiri (panun paan) and Tsakhur (wuž-e: wuž): (21) Kashmiri Koorev sajoov panun paan. girls.ERG decorated INT REFL ‘The girls decorated themselves.’

Wali et al. (2000: 474)

224 Volker Gast (22) Tsakhur Rasul-e: wuž-e: wuž getu. Rasul-ERG INT-ERG REFL.NOM beat ‘Rasul beat himself.’

Lyutikova (2000: 229)

Moreover, in English intensifiers cannot be freely combined with object pronouns. This clearly has diachronic reasons (cf. Gast 2006: ch. 8). A sentence such as (23) is therefore generally deviant, and an ‘avoidance strategy’ as in (24) is the most natural way of conveying the meaning intended in (23): (23) ??I saw him himself. (24) I saw the man himself.

5.

The make-up of the database

Having provided an overview of the type of expression documented in the database, and the type of variation found in the relevant domain, we will now turn to a description of the database itself, and the way information can be retrieved from it. The data structure will be outlined in Section 5.1 and the three search options available on the current user interface will be described in Sections 5.2–5.4. There are three types of queries: (i) searches for languages (Section 5.2), (ii) searches for examples of a specific type (Section 5.3), and (iii) searches for intensifiers of a specific type (Section 5.4).

5.1. The data structure TDIR is a relational database with a very simple structure. It is built around two types of entities, i.e. Languages and Examples (which are capitalized in order to distinguish them from ‘real’ languages and example sentences). Technically, Languages are sets of attribute-value pairs which in addition to information concerning the language itself (language name, ethnologue code, area/family/country) specify the features to be retrieved by users, e.g. the primary adnominal intensifier of the language in question as well as its lexical source, exclusive intensifiers, relevant reflexive markers, etc. While most of the attributes require natural language markers as a value, there are also three fields providing space for more heterogeneous information: one for paradigms, one for ‘matters of representation’ and one for general com-

A contribution to ‘two-dimensional’ language description

225

ments. Given that each piece of information is regarded as an attribute of exactly one Language, all the information can be stored in a single table. This (very simple) design, which is a residue of the time when the database served project-internal purposes, implies obvious limitations. In particular, given that the database is organized around Languages – rather than, say, Strategies or Markers – each attribute can be specified only once per Language. For instance, there is only one column for ‘historical source’ in the Language table, which relates to the ‘primary adnominal intensifier’. Detailed information on the ‘minor strategies’ of a given language is thus not provided for (though there is a single field for ‘other intensifiers’ in each data set). This feature is a clear disadvantage of the database that should be avoided in comparable projects in the future. The second major table, linked to the Language table via the ethnologue code, contains examples. In addition to the ‘linguistic tiers’ (original, glosses, translation, source) there is a field for the type of expression exemplified (adnominal intensifier, exclusive intensifier, inclusive intensifier, reflexive, scalar focus particle), which allows users to retrieve examples of a specific type by specifying a value in one of the search forms (cf. Section 5.3). Moreover, the Examples table contains an all-purpose ‘comments’-field and an identification code pointing to a third table with references and sources. The fourth table of the database, finally, also linked to the Examples table, contains a list of glosses with short explanations. 5.2. Search for languages For each language documented in the database, a ‘grammar fragment’ can be accessed by selecting the language from a dropdown field via the ‘Search for intensifiers and examples’ link on the main page of the interface. A grammar fragment is a short survey of the most important information: the relevant elements are listed and their distributional and morphological properties are outlined. Let us consider an example. The grammar fragment of the Mayan language Tzotzil as rendered in Figure 4 provides an overview of the items used in that domain at the top of the page:

226 Volker Gast

Figure 4. Grammar fragment of Tzotzil

The first row indicates the ‘primary adnominal intensifier’ of Tzotzil (-tuk). It also provides an indication of its morphological properties (-tuk takes a possessive prefix) and of the way the intensifier combines with its head NP (-tuk precedes either a lexical NP or a pronoun). Moreover, sortal restrictions applying to the primary intensifier -tuk are indicated (it only combines with animate NPs), and information on its lexical source is provided (< ‘alone’). Given that many languages have more than one intensifier, there is also a field for ‘other intensifiers’. In the case of Tzotzil, the element mismo (borrowed from Spanish) is also occasionally used as an alternative to POSS-tuk. The next line indicates the exclusive intensifier of Tzotzil, which is formally identical to the adnominal one, differing from it only in terms of its position. This piece of information is not given in the table at the top of the page, but is also retrievable from the grammar fragment, as it is mentioned in the ‘comments’ field further down (cf. below). Moreover, the ‘grammar fragment’ indicates the form of the primary reflexive marker (POSS-ba) as well as of any secondary reflexives (no such element could be identified in the case of Tzotzil), of the attributive intensifier (POSS-tuk in a different construction) and of the most commonly used scalar additive focus particles (i.e. elements meaning ‘even’). Scalar additive focus particles have been included because in some languages, intensifiers are also used in that function, e.g. in French (même) and German (selbst). Each of the elements for which examples are available is linked directly to glossed examples. The format of examples will be described in Section 5.3.

A contribution to ‘two-dimensional’ language description

227

The table at the top of each grammar fragment (cf. Figure 4) only provides a rough overview of the elements in question, as well as links to relevant examples. More information is given further down on the page. First, there is information about the language itself, in particular its genetic affiliation and geographical position. Second, there is a field for inflectional paradigms. In the grammar fragment of Tzotzil, this field indicates the paradigms for both (the intensifier) POSS-tuk and (the reflexive) POSS-ba (cf. Figure 5). Moreover, there is a ‘comments’ field, which in the case of Tzotzil specifies the structural position of the exclusive intensifier (preverbal). For languages using any specific non-standard characters or a standardized orthography, there is also a field for ‘matters of representation’ (no relevant information is given in the case of Tzotzil).

Figure 5. Additional information on Tzotzil

228 Volker Gast 5.3. Search for examples of a specific type As mentioned above, the grammar fragment provides direct links to (glossed) examples, which can be accessed by simply clicking on the relevant markers. Another way of finding examples is via the ‘Search for examples of a specific type’-option. The following search parameters can be specified: (i) type of element exemplified (adnominal intensifier, exclusive intensifier, reflexive, etc.); (ii) language; (iii) search string in original; (iv) search string in gloss; (v) search string in translation and (vi) search string in source(s). While the type of example and the language can be chosen from a dropdown field, the other parameters are typed into a text field. The search parameters are linked by a Boolean AND-operator. Given that the default value is in all cases ‘unrestricted’, leaving all parameters unspecified delivers a list of all examples contained in the database (on April 30, 2007, the database contained 689 examples). All parameters can be freely combined. The output given by the ‘Search for examples of a specific type’ is a list of examples meeting the conditions specified in the search form. The examples are presented in a simple ‘original-gloss-translationsource’-format. A search for an exclusive intensifiers in Arabic, for instance, delivers the output shown in Figure 6.

A contribution to ‘two-dimensional’ language description

229

Figure 4. Search for examples of exclusive intensifiers in Arabic

As can be seen from Figure 6, there are also links to (a) the glosses used in the examples, and (b) the sources from which the examples have been taken (at the top of the page), even though a complete indication of the source is also given in the example itself. The Arabic examples given in Figure 6 could also have been found via the language search (i.e., in a grammar fragment). However, the example search option offers a number of additional features and may even be helpful for users who are not primarily interested in intensifiers. For instance, one may search for specific grammatical categories in the gloss tier (past tense, indicative mood, etc.), or even for references to a specific book or author.

230 Volker Gast 5.4. Search for intensifiers of a specific type Finally, one can also search for intensifiers with specific properties. The search parameters are: (i) sortal restrictions, (ii) lexical source, (iii) area and (iv) family. The search form is shown in Figure 7:

Figure 7. Search form for ‘Search for intensifiers of a specific type’

A search for intensifiers with a selectional restriction to animate referents in Mesoamerica (as illustrated in Figure 7) delivers Mixtec máá-, Tzotzil -tuk, Totonac ma:niʔ and Zapotec lagahk. This information is, again, provided in the form of a grammar fragment, though only the first part of the fragment is displayed (i.e. the table with the most basic information and links to examples). Another one of these grammar fragments is shown in Figure 8 (the one of Totonac).

A contribution to ‘two-dimensional’ language description

231

Figure 8. Grammar fragment of Totonac, accessed via the ‘Search for intensifiers of a specific type’ option.

6. Summary The Typological Database of Intensifiers and Reflexives provides basic information on intensifiers in a sample of more than 100 languages. Given the format of the database, the information is of course limited, but the availability of glossed examples and references to pertinent literature provide the basis for further investigations. In other words, the database is intended as a first point of reference for researchers interested in the encoding of intensifiers and reflexives. It has been mentioned that the Typological Database of Intensifiers and Reflexives is basically a by-product of a typological research project, though some extra funding was also provided specifically for the database (cf. Note 1). I have aimed to show that it has a basically descriptive purpose, i.e. it is a reference work rather than an analytic research tool. Given that no balanced samples have been defined, it is not fit for use as the only input to any statistical procedures. Moreover, it is likely that the Typological Database of Intensifiers and Reflexives still exhibits a number of shortcomings and, perhaps, also singular inaccuracies in the data. Restrictions of time and money have prevented us from developing a more sophisticated and more comprehensive tool. Still, we hope to have made a contribution to the

232 Volker Gast development of an openly accessible typological data pool, perhaps instigating the development of more (and better) typological databases than ours.

Acknowledgements This paper was written during a visit at the University of Melbourne (School of Languages and Linguistics) in April 2007. I am greatly indebted to Ekkehard König and the Alexander-von-Humboldt Foundation for financial support, and to Nick Evans and other members of the School of Languages and Linguistics for their hospitality and input. Moreover, I would like to thank Martin Haspelmath, Daniel Hole and two anonymous reviewers for helpful comments and criticism. Any remaining inaccuracies are my own. References Auwera, Johan van der, and Volker Gast forthc. Prototypes and categories. In The Oxford Handbook of Linguistic Typology, Jae Jung Song (ed.), Oxford: Oxford University Press. Auwera, Johan van der, and Vladimir A. Plungian 1998 Modality’s semantic map. Linguistic Typology 2: 79–124. Buchholz, Oda and Wilfried Fiedler 1987 Albanische Grammatik. Leipzig: VEB Enzyklopädie. Comrie, Bernard and Norval Smith 1977 Lingua descriptive series: Questionnaire. Lingua 42 (1): 1–72. Dixon, Robert Malcolm Ward 1994 Ergativity. Cambridge: Cambridge University Press. Evans, Nicholas and Alan Dench 2006 Introduction: Catching language. In Catching Language: The Art and Craft of Grammar Writing, Nicholas Evans, Alan Dench, and Felix K. Ameka (eds.), 1–39. Berlin / New York: Mouton de Gruyter. Gast, Volker 2006 The Grammar of Identity: Intensifiers and Reflexives in Germanic Languages. London: Routledge. Haspelmath, Martin 1993 A Grammar of Lezgian. Berlin /New York: Mouton de Gruyter. 1997 Indefinite Pronouns. Oxford: Oxford University Press. Hewitt, B. George 1995 Georgian – A Structural Reference Grammar. Amsterdam: John Benjamins.

A contribution to ‘two-dimensional’ language description

233

Hewitt, B. George and Zaira K. Khiba 1989 Abkhaz. London: Routledge. Hole, Daniel 2000 Heuristics and typology. Sprachtypologie und Universalienforschung 53 (1): 13 –20. König, Ekkehard and Volker Gast 2006 Focused expressions of identity: A typology of intensifiers. Linguistic Typology 10 (2): 223–276. Lehmann, Christian 2004 Funktionale Grammatikographie. In Dimensionen und Kontinua: Beiträge zu Hansjakob Seilers Universalienforschung, Waldfried Premper (ed.), 147–165. Bochum: Brockmeyer. Lehmann, Christian and Elena Maslova 2004 Grammaticography. In Morphologie: Ein internationales Handbuch zur Flexion und Wortbildung, Vol. 2, Geert E. Booij, Christian Lehmann, Joachim Mugdan and Stavros Skopeteas (eds.), 1857–1882. Berlin /New York: Mouton de Gruyter. Lyutikova, Ekaterina A. 2000 Reflexives and emphasis in Tsaxur (Nakh-Dagestanian). In Reflexives: Forms and Functions, Zygmunt Frajzyngier and Traci S. Curl (eds.), 227–255. Amsterdam: John Benjamins. Mosel, Ulrike 2002 Analytic and synthetic language description. In Linguistik jenseits des Strukturalismus: Akten des II. Ost-West-Kolloquiums, Berlin 1998, Kennosuke Ezawa, Wilfried Kürschner, Karl H. Rensch, and Manfred Ringmacher (eds.), 199–208. Tübingen: Narr. 2006 Grammaticography – the art and draft of writing grammars. In Catching Language: The Art and Craft of Grammar Writing, Nicholas Evans, Alan Dench, and Felix K. Ameka (eds.), 41– 68. Berlin /New York: Mouton de Gruyter. Randriamasimanana, Charles 1986 The Causatives of Malagasy. Honolulu: The University of Hawaii. Sengupta, Gautam 2000 Lexical anaphors and pronouns in Bangla. In Lexical Anaphors and Pronouns in Selected South Asian Languages, Barbara C. Lust, Kashi Wali, James W. Gair, and Karumuri Venkata Subbarao (eds.), 277– 332. Berlin /New York: Mouton de Gruyter. Wali, Kashi, Omkar N. Koul, Peter Edwin Hook and Ashok K. Koul 2000 Lexical anaphors and pronouns in Kashmiri. In Lexical Anaphors and Pronouns in Selected South Asian Languages, Barbara C. Lust, Kashi Wali, James W. Gair, and Karumuri Venkata Subbarao (eds.), 471– 512. Berlin /New York: Mouton de Gruyter.

234 Volker Gast Zribi-Hertz, Anne and Fara Rajaonarisoa 1999 Reflexive marking as noun incorporation: Evidence from French and Malagasy. Evidence from French and Malagasy. Ms., 39 pp. Paris.

StressTyp: A database for word accentual patterns in the world’s languages Rob Goedemans and Harry van der Hulst

Introduction It is possible, although inadvisable to discuss the structure of a linguistic database without saying a few things about the nature and linguistic analyses of the data that the database aims to store and query. In cognitive science terms, this would be like jumping ahead to the implementational level, without taking note of the computational and algorithmic levels (however, one delimits these levels in detail). Here, we take the computational level to involve specifying the nature of the data, and the algorithmic level to refer to the way in which linguists have generally captured regularities in the data. Even though the goal of the present volume is to focus on database structure and use (the implementational level), we supply a discussion of the nature and linguistic analyses of stress in section 1, hoping not to make it a barrier to the discussion of the database. In 1991, we started working on a database for word stress systems, and it is hard to believe that we have been working on this project, off and on for 15 years now. We called the database StressTyp, but if we had had the perspective we have now we would probably have called it AccentTyp because stress is just one manifestation of the broader phenomenon of accent (see section 1). Since the database is mostly designed to store information about the location of the accent that lies behind stress, it would appear that the focus of StressTyp is, in fact, on accent location. Over the years, StressTyp has developed into a full-fledged typological database that currently contains information on the word accentual systems of 510 languages. In this chapter we will describe the theoretical underpinnings of StressTyp (section 1), the history and current status of StressTyp (section 2.1), the goals of StressTyp (section 2.2), the limitations (section 2.3), dissemination (section 2.4), future developments (section 2.5), the architecture of StressTyp (section 3) and what one can do with StressTyp (section 4). The prefinal section (section 5) is devoted a comparison of StressTyp with other data collections or databases that store information on word accentual systems. We also provide an appendix with StressTyp fields and codes that

236 Rob Goedemans and Harry van der Hulst may be useful for reference while reading this chapter (other reference material, like a list of languages contained in StressTyp is available on the web at http://stresstyp.leidenuniv.nl). Section 6 concludes this chapter.

1. The nature and linguistic analyses of our data A considerable number of languages (including English) display a phenomenon known as word stress. Word stress is one manifestation of a general characteristic of human languages, which is that linguistic expressions appear to have a ‘prominence structure’. Linguistic prominence can be studied with reference to various domains such as ‘words’, ‘phrases’ or complete ‘sentences’ and even though there is no undisputed clear-cut definition of these domains, it does seem clear that the prominence patterns are neither universal nor randomly diverse. In other words, even though there are differences among the languages of the world, there are, at the same time, recurrent patterns. These patterns are typically (but apparently not always) grounded in general principles of rhythm, according to which ‘beats’ are spaced apart by a recurrent small number of non-beats, as well as being grammaticalized and lexicalized, by being put to use, among others, as markers of morphological and syntactic structure, in particular, but not exclusively, structural edges. Most prominence patterns, then, have at least two general characteristics. Firstly, there tends to be the above-mentioned regular alternation of beats and non-beats; this is their rhythmical aspect. Secondly, as markers of domains, there must be one unit in these domains that stands out over all others. This unit marks the domain in opposition to other, adjacent domains, and ideally one of its edges (either the left edge or the right edge) by being either at the edge or close to it. Domain and edge marking as well as rhythmical alternation, are two entirely different sides of the prominence pattern, the first elevating one unit within the domain to a unique status, the other causing a regular strong – weak alternation among the units within the domain. Rhythm and domain marking are independent, and we will see that neither is crucially dependent on the other. However, rhythm and domain/edge marking, when co-occurring, are interrelated in that it so happens that the unique unit, which we will call the head of the domain, at least typically, is a rhythmically strong unit. Classical metrical theory (Liberman & Prince 1977) was founded on making this connection by seeing rhythm as necessarily feeding head-location. A further unification of rhythm and domain

StressTyp: A database for word accentual patterns in the world’s languages

237

marking was then achieved by construing rhythm itself as the marking of domains called feet (prototypically consisting of two syllables or morae), so that the strong beat of the foot effectively became the head of the foot, while the head of the word would be the head of one of the feet within the word. Thus, head-marking became the unifying device for constructing (or analyzing) linguistic prominence patterns. As such, metrical theory embodied a theory of phonological dependency which, in fact, had already independently been established in Dependency Phonology (Anderson and Jones 1974, 1977; Anderson and Ewen 1987). In the linguistic literature on word stress, a distinction is commonly made between primary stress and non-primary stress. Primary stress corresponds to head marking at the word level, whereas non-primary stresses refer to the rhythmic alternation. The latter is sometimes further divided into secondary, tertiary etc. stress and, at this point in our story, it is not clear how such finer distinctions can result from ‘pure’ rhythm, which has so far been depicted as a ‘flat’ regular alternation of ‘strong’ and ‘weak’. In addition to the term ‘stress’, we also find the term ‘accent’. Why this difference? When we say that one unit (let us say, a syllable) within the word stands out as the head, we say nothing about the manner in which it stands out. Using the pre-theoretical term ‘prominence’ also implies little, if anything in this respect. It turns out that heads can be manifested in a variety of ways. Hyman (1977) made a distinction between stress-accent languages and pitch-accent languages. The generalizing notion for him was accent, which we will take to be an alternative for the term head in the formal characterization of prominence patterns. (We also speak of heads of syntactic phrases or morphological complex words, where the term accent is not used. The notion head, in our view, is relevant in all components of the grammar, which sometimes goes unnoticed precisely because people use different terms for it.) In Hyman’s view accents do not have an inherent manifestation. In a pitch-accent language, the accent is cued by a pitch property (an elevated pitch or a pitch rise, typically). In a stress-accent language, the manifestation is ‘stress’ which he took to be the kind of properties that are typically associated with ‘stress’ in languages such as English (extra duration, extra loudness, hyper-articulation etc.). However, there is no reason (and Hyman would, we are sure, agree) to limit the manifestation possibilities of accent to these two cases. Accent could be manifested by duration alone (a duration-accent language), or by full vowel quality (stressless vowels being reduced), etc. In addition, the head may distinguish itself from non-heads by a greater array of phonotactic possibilities, or by being the locus of tonal distinctions, or by being the anchor point for intonational

238 Rob Goedemans and Harry van der Hulst tones.1 It would now appear that the study of ‘prominence’ (with a Eurocentric focus on stress-accent languages) must merely be seen as one way of getting to the deeper notion of accent (i.e. headedness). In analyzing accents, the two dimensions of relevance are the specification of the domain of the accent and the determination of the location of the accent within the domain. Subsequent, or parallel to that inquiry we can ask how the accent is manifested within the domain, i.e. what cues are available to determining the location of the accent such that the accent can function as the marker of the domain (edges). Manifestations can broadly be grouped into phonetic cues (pitch, duration, loudness, fortition, hyperarticulation, and/or their reverse in unaccented syllables), phonological cues (phonotactic complexity, including both segmental and tonal distributional patterns), or any role in other kinds of regularities, be they phonological, morphological or pertaining to intonation. A comprehensive typology of accent manifestation remains to be developed, but given the broad area of cues and functions it is likely that many more languages may have word accent than just those in which accent is manifested as ‘pitch’ or ‘stress’. As a working hypothesis, we might assume that all languages have accent. Pulgram (1970) has argued that marking of the word domain is not perhaps a universal fact, referring among other to the notorious case of French in which he argued that only a phrase final accent can be observed in the form of cues such as extra loudness and being an anchor for intonational tones. However, if we acknowledge a broader array of accentual manifestations such claims might well turn out to be misdirected. An argument for the universality of word accent could follow from recognizing word headedness not only as serving the parsing of linguistic expressions into domains, but also as serving the mental storage of words. If words are not stored as linear strings of syllables, but rather as hierarchically structured objects (with perhaps no linear order as such), it might be that the nature of this hierarchy already involves the notion of head. If this is so, the ultimate motivation of heads would not lie in parsing, but in necessary properties of mental representations. This being the case, we could still argue that the edge-biased location of word accent is grounded in their 1

The literature on intonational units refers to tones or tone combinations that anchor to word accents as pitch accents. This notion of pitch accent is different from the notion of pitch accent as one type of word accentual system, but the two are clearly related. In both cases pitch units are linked to heads of domain. Intonational pitch accents are tone units that link to phrasal heads (which, lower down, are also word heads) while word level pitch accents link to word heads.

StressTyp: A database for word accentual patterns in the world’s languages

239

role in being parsing cues. Thus, we are making a distinction between motivating the very existence of word head and motivating the location of the head. The fact that heads exist in other grammatical (and most likely linguistic external) domains where there is no linkage to parsing cues suggests a deeper motivation of heads. Since heads of phrases are not necessarily peripheral to their domain, edge-bias may not be intrinsic to the notion head, but follow, in the area of phonology, from their role in determining parsing cues. We have revealed thus far that the location of accent within the ‘word’ domain may be dependent on rhythmic alternation (specifically seeking out strong syllables) and the domain itself (specifically seeking out its edges); cf. (1a). The preference for rhythmically strong syllables can be seen as a specific instance of the tendency for head location to be parasitic on a differentiation between syllables that is independently present. When a syllable is rhythmically strong it stands out (along with the other strong syllables) by virtue of the externally imposed rhythm. However, syllables are also differentiated from each other in terms of their internal properties. Thus a syllable containing a long vowel, or high tone or a fully articulated, nonreduced vowel may stand out in comparison to syllables that lack such properties. Syllable-intrinsic ‘weight’ may thus determine location of word accent (1b), but it can also determine the distribution of rhythmically strong syllables (‘foot accents’) which in turn determine the location of word accents (1c). If neither rhythm nor syllable weight feeds the location of accent, only domain edges provide a guide to its location (1d): 2 (1)

a. Rhythm-based system: Rhythm ⇒ Edge ⇒ Word Accent b. Weight-based system: Weight ⇒ Edge ⇒ Word Accent c. Weight and rhythm-based system: Weight ⇒ Rhythm ⇒ Edge ⇒Word Accent d. Minimal system: Edge ⇒ Word Accent

We are assuming, then, that the notion of word accent is independent, in principle, from both weight and rhythm, its location being primarily motivated by being a parsing cue (although its existence may lie in the nature of mental representations) for which reason its location is always dependent 2

The logical possibility Rhythm ⇒ Weight Edge ⇒Word Accent exists, but in this case weight is a phonetic exponent of rhythm, for example when rhythmically strong vowels have a longer duration than weak vowels.

240 Rob Goedemans and Harry van der Hulst on edges. However, weight and/or rhythm may play a role in determining the location of word accent. Note that systems of type b and d show that rhythm is not a crucial aspect of all accentual systems, while domain marking (headedness) is. The external and internal differentiation between syllables can exist independently from head location such that head location can be parasitic on these properties. However, as we showed earlier, when speaking about the manifestation of accent, such properties can also be the result of accent location, a reversal of dependency, so to speak (see footnote 2). Focusing on systems in which rhythm seems relevant to the location of word accent, Metrical Theory made the entirely reasonable move to design layered algorithms in which, for type (1a) languages, firstly words are ‘parsed’ into left- or right-headed feet (a choice that was thought to differentiate languages) after which a second rule picks out the head of the rightmost or leftmost foot to be the head of the word; see (2a). (This second step was initially formalized as a tree-building procedure, constructing a left-branching or right-branching tree taking the feet as terminal elements. This idea was later abandoned by a device that simply elects a peripheral foot (head) as the head of the string. Type (1b) languages would not have rhythmic feet. Heavy syllables would stand out and the rightmost or leftmost of these syllables would be promoted to primary word accent; see (2b) Initially, heavy syllables were thought to be heads of ‘unbounded’ feet (unbounded meaning the foot domain is often larger than the prototypical two syllables, maximally comprising the whole word), an idea that some gave up and others maintained. In type (1c) languages, the parsing into foot domains was made dependent on a procedure that would designate heavy syllables as necessary foot heads. The location of these heavy syllable heads would then take priority over the default (left- or right-oriented) procedure of locating the heads of feet by imposing a constraint that bars heavy syllables from a weak position in the foot (hence they will always be heads); see (2c). Finally, type (1d) languages in which accent could be located purely with reference to an edge were in practice also seen as having a foot layer (see 2d-ii). The fact that no rhythm could be ‘perceived’ would be consistent with the idea that heads of feet can, but need not have an audible phonetic cue: (2) a. Rhythm-based system: Rhythm ⇒ Edge ⇒ Word Accent (i)

* * * * * ((σσ)(σσ)(σσ)(σσ))

StressTyp: A database for word accentual patterns in the world’s languages

– –

Feet can be left or right-headed, and the primary accent can be left or right-oriented.) Primary accent can even be located on the third syllable from the edge if a peripheral is marked as extrametrical:3 (ii)

–

–

241

* * * * * ((σσ)(σσ)(σσ)(σσ) )

Extrametricality creates ambiguity in that, for example, a system with penultimate stress can be derived with left-headed feet as in (i) or right-headed feet plus extrametricality. To differentiate between the various non-primary accents, additional structure would have to be postulated, as was the case in the original versions of Metrical Phonology.

b. Weight-based system: Weight ⇒ Edge ⇒ Word Accent * * * (σσσσ σσσ) –

(a bold sigma indicates a heavy syllable)

Systems of this sort need a default rule for words that lack heavy syllables. This default appears to be independent in its edge orientation from the rule that promotes a peripheral heavy syllable. For example, a system that promotes the rightmost heavy syllable can have the initial or final syllable as its default location.

c. Weight and rhythm-based system: Weight ⇒ Rhythm ⇒ Edge ⇒ Word Accent * * * * * ((σσ)σ(σ )(σσ)σ(σ)) –

The same remarks as for (2a) apply here. Note that heavy syllables must be heads of feet, a factor that disturbs the rhythm which may trigger de-accenting rules to avoid accent clashes.

d. Minimal system: Edge ⇒ Word Accent (i)

3

* (σσσσσσ)

The device of extrametricality stipulates that a peripheral syllable can be ignored.

242 Rob Goedemans and Harry van der Hulst – –

Even penultimate accent can be derived if it is allowed to make a peripheral syllable ‘extrametrical’) As mentioned, this kind of system can be derived via an inaudible foot layer: (ii)

* * * * * ((σσ)(σσ)(σσ)(σσ))

Variants and notational issues aside (see van der Hulst 1999), all these systems have a predictable location of the word accent. There are, as it turns out, also languages in which the accent location must be lexically specified. In practice, these systems can all be analyzed as involving ‘diacritic weight’: we simply mark the syllable that unpredictably has primary accent as heavy and then apply the above weight-sensitive algorithms (cf. 2 b,c) to derive different types of lexically marked systems. Systems of type (1b, 2b) have been called unbounded because the location of primary accent can be anywhere in the word. This is in sharp contrast with all other systems, called bounded, in which the primary accent is (a) strictly peripheral (final, initial), (b) near-peripheral (post-initial, prefinal) or (c) ‘third-in’ (critically due to extrametricality). The metrical approach is committed to the idea that primary accent is always dependent on prior foot assignment. (Liberman and Prince 1977; Vergnaud and Halle 1978; Idsardi 1992; Halle and Idsardi 1994; Kager 1993; for surveys see van der Hulst 1999, 2000a, 2000b, 2002, 2006). Given that one needs to account for rhythmic structure, the location of primary accent almost comes for free. Van der Hulst (1984) first noted that the location of primary accent (putting unbounded systems aside) does not always follow from the principles that determine the rhythmic structure of the word. (see also van der Hulst 1990, 1992, 2002, 2006, to appear; van der Hulst and Kooij 1994; van der Hulst and Lahiri 1988; for similar views see Harms 1981; Roca 1986; Hurch 1995; McGarrity 2003). The location of primary accent and the distribution of rhythmic beats can display subtle differences. For example, one can be weight-sensitive while the other is not. Also, while primary accent can be lexically determined, rhythmic beats never are. These and other reasons suggested that even though primary accent has rhythmlike distributions, it would seem that the rhythmic grounding of primary accent has been grammaticalized by becoming a separate algorithm that thus can be divorced from the rhythmic principles that continue to account for the overall rhythmic patterns of words. Being grammaticalized, the algorithm can become sensitive to purely lexical factors such as diacritic weight,

StressTyp: A database for word accentual patterns in the world’s languages

243

dependence on word class and stratal layers in the lexicon, which suggests that the algorithm for primary word accent is a lexical procedure. Rhythmic structure on the other hand has all the properties of post-lexical or rather implementational processes. In the emerging view the dependency between rhythm and primary accent is reversed in comparison to metrical theory. Rhythmic structure is now dependent on the prior location of primary accent in the sense that this primary accent location must be properly integrated into the rhythmic structure which must be built around it. Why would primary accent location and not rhythm tend to grammaticalize and lexicalize? We suspect that this may be related to the abovementioned idea that mental representations of words need a headed structure of some sort. The idea of separating primary and non-primary accent structure leads to the following approach. Some variant of the part of metrical theory that distributes rhythmic beats can be maintained as a procedure that applies to actual utterances. For primary accent location, van der Hulst (1996, 1999, to appear) suggests the following approach. For all systems in which accent is bound (a, c and d in 1 and 2) we say that a bisyllabic domain is selected at the left or right edge of the word. However, to accommodate unbounded systems (b in 1 and 2), we also allow the option that the domain for accentuation is the whole word. With respect to both options, extrametricality can apply. If the system is weight-insensitive (cf. a in 1 and 2), this produces only one case, (3a). We only need to say whether the left or right edge is selected for primary accent. If the system is weight-sensitive, the heavy syllables (phonologically or diacritically) stand out. If the domain contains only one heavy syllable, this will be the only syllable that is available for primary accent selection. If there is more than one heavy syllable, we need to say which one wins. If there is none, we need a default rule (cf. Prince 1983). These procedures apply in (3b) where the domain is the whole word and in (3c) where the domain is a two-syllable window. Finally, we can select the whole word and not have weight as a relevant factor (see 3d): (3)

a. Rhythm-based system: Rhythm ⇒ Edge ⇒ Word Accent * (σσσ(σσ)) b. Weight-based system: Weight ⇒ Edge ⇒ Word Accent (i)

* (σσσσσσσ)

244 Rob Goedemans and Harry van der Hulst (ii)

* * * (σσσσ σσσ)

(iii)

* (σσσσσσσ)

c. Weight and rhythm-based system: Weight ⇒ Rhythm ⇒ Edge ⇒ Word Accent (i)

* * (σσσ(σ σ))

(ii)

(iii)

* ** (σσσ(σσ ))

(iv)

* * (σσσ(σσ)) * (σσσ(σσ))

– This is a system in which primary accent lies on the final syllable if it is heavy, otherwise the pre-final syllable is accented. – The separate edge orientation for the heavy – heavy and light – light case predicts four types of bounded weight-sensitive systems which are attested, both on the right side and the left side of the word. d. Minimal system: Edge ⇒ Word Accent (i)

* (σσσσσσ)

(ii)

* (σσσσ(σσ))

We need to add extrametricality to the mix to derive third-in systems, as well as all bounded weight-sensitive systems in which a peripheral heavy syllable is ignored (such as Classical Latin). This creates several instances of structural ambiguity that, apparently, cannot be avoided. In particular, systems of type a and d can be difficult to differentiate. For example, a weight-insensitive penultimate system can be of type a (locating the head on the left in a right-edge bisyllabic window) or of type d (locating the domain on the right in a unbounded window subject to extrametricality). For pen-peripheral systems (second syllable accent, or penultimate accent), this ambiguity is caused by having the option of extrametricality. However, in peripheral systems (initial or final accent), the ambiguity exists regardless (cf. 3d-i and 3d-ii). Often, the exceptional locations of accents will reveal the nature of the system. Turkish, which has regular final accent, allows exceptional locations of accents far inside the word. This system is thus unbounded. Polish, on the other hand, having regular penultimate accent,

StressTyp: A database for word accentual patterns in the world’s languages

245

only allows exceptions on the final or antepenultimate syllable. This system is therefore bounded. In this approach, primary accent location can be analyzed in terms of seven parameters, two of which (4a-i and 4b-i) are dependent on the setting of another parameter: (4)

a. Domain size: bounded/unbounded (i) Edge of bounded domain: left/right b. Extrametricality: yes/no (i) Edge of extrametricality: left/right c. Project weight: yes/no4 d. If two (or more) heavies: leftmost/rightmost e. If no heavies: leftmost/rightmost

One type of system remains to be accounted for. This is a system (termed a ‘count system’) in which the location of primary accent is apparently necessarily dependent on the prior exhaustive rhythmification of the entire word. Consider the following primary accent rule: (5)

a. In a word with an even number of syllables, primary accent is prefinal b. In a word with an odd number of syllables, primary accent is preprefinal

It would seem that we have to establish a left-to-right, left-headed rhythmic pattern and then select the rightmost beat as the primary accent, (6a). At first sight, it might, however, also be possible to assume a right-edge bounded domain and extrametricality and the projection of ‘rhythmic’ weight, (6b) (6)

4

a. (i)

* * * * * ((σσ)(σσ)(σσ)(σσ))

(ii)

* * * * * ((σσ)(σσ)(σσ)(σσ)σ)

b. (i)

* * * * (σσσσσ(σσ))

(ii)

* * * * (σσσσσσ(σσ))

Note that the weight parameter could be replaced by two constraints that can enter into a dependency relation (‘ranking’): if weight is on: weight > rhythm; if weight if off: rhythm > weight. This is possible, as any parameter can be replaced by two constraints. It is not clear that anything is gained by this alternative.

246 Rob Goedemans and Harry van der Hulst The oddity of the alternative in (6b) (in which we have suppressed the foot boundaries for clarity) is that rhythm, which we so far attributed to the utterance level, must feed the lexical procedure for primary accent assignment. What is the solution to this paradox? One solution is to assume that the primary accent algorithm can be post-lexical as well as lexical. Being postlexical, the procedure can be sensitive to rhythmic weight, or not. If it is, we get the ‘count system’ type. Allowing the primary accent procedure to be post-lexical creates additional ambiguity, however. It essentially allows a standard metrical (‘rhythm first’) treatment of all systems (not just count systems) in which primary accent can harmlessly be said to fully depend on independently needed principles for rhythmic structure. This would leave the cases in which primary accent cannot be derived from rhythmic feet as oddities that require some special procedure, a route taken in Hayes (1995). It could be that this is just how matters are. Some systems simply would be open to both a fully post-lexical analysis (in which case we expect no lexical exceptions at all) and a lexical analysis (in which lexical exceptions are possible). In an attempt to reduce ambiguity, van der Hulst (1997) explores the position which holds that primary accent location can only be lexical (which makes sense if it is a lexical requirement for reasons of necessarily needing a head in the mental representation of words). To maintain this position he claims that all count systems are systems in which the primary accent lacks any overt manifestation and/or is in diachronic transition. An additional speculation was that polysynthetic languages, which rank high among the count systems, simply lack a lexical notion of word altogether. It was furthermore suggested that the apparent rhythm-based primary accent was an enhancing effect that results from phrasal accentuation and/or intonation. These ideas call for considerable empirical confirmation and must remain, at this point, theoretically-driven speculations. A different approach is to forget about a lexical – postlexical divide and derive the different types of systems in terms of different dependency relations between the constraints that govern rhythm and primary accent: (7)

Rhythm ⇒ Primary accent Primary accent ⇒ Rhythm

This is the approach taken in Optimality Theory (Prince and Smolensky 1993) where ‘dependency’ is called ‘ranking’. This approach, however, fails to explain why rhythm is never lexically determined, while primary accent is, either mostly, or in the form of exceptions (that are almost always

StressTyp: A database for word accentual patterns in the world’s languages

247

present). We must leave the definite solution for count systems for future research. This concludes our theoretical preamble concerning the nature of word accentual systems and available theories in this area. In the development of our database on accentual systems, we were inevitably inspired by the theoretical considerations presented in this section, which are based on considerable study of both accentual systems and available theories. However, we did make an effort to design the record structure in as theoretically neutral a manner as possible. The resulting record structure is described in section 2.4. Before we look at that structure, however, let us first present the development of StressTyp from its inception to the present day in which it has become the leading typological database on stress systems. 2.

StressTyp – an overview

2.1. History and current status of StressTyp Work on StressTyp was initiated by van der Hulst in 1991 as a pilot project of EUROTYP (1990 –1994), a project on the typology of European languages, financed by the European Science Foundation (ESF). EUROTYP consisted of 9 Theme Groups, each studying an aspect of European languages from a comparative and typological point of view. The topic of Theme Group 9 (coordinated by van der Hulst) was Word Prosodic Systems.5 In the course of the EUROTYP project the question regarding storing language data (both original and from written sources) received special attention and in 1991 it was decided to start two pilot projects, one of which was StressTyp. The idea was to develop an intelligent filing system for data (i.e. rules, generalizations, patterns) on word prosodic systems. The structure of the records was developed by Harry van der Hulst (then at HIL, Leiden), in collaboration with Aditi Lahiri (then at the Max Planck Institute, Nijmegen). Some relevant equipment was made available by a grant from the EUROTYP project and further support of the Faculty of Arts of Leiden University. Kees van der Veer (Max Planck Institute, Nijmegen) implemented the record structure in 4th Dimension for MacIntosh. Since then, Rob Goedemans has controlled all aspects of the implementation side of the database. 5

The results of this EUROTYP project have been published in van der Hulst (1999) (ed.).

248 Rob Goedemans and Harry van der Hulst The first data for StressTyp were extracted from typological studies, or theoretical works that refer to a lot of languages, such as Hyman (1977), Greenberg and Kashube (1976), Hayes (1980 /95), Lockwood (1983), Halle and Vergnaud (1987) and so on. Additional data came from the Masters theses of Aglaia Cornelisse (Australian languages) and Bernadette Hendriks (Papuan languages), both supervised by van der Hulst. These data were first combined in so-called Data Entry Sheets (basically a paper-and-pencil version of the record structure) and subsequently Rob Goedemans and Ellis Visch transferred the data into the 4th Dimension database. In this process they checked the information for consistency and correctness by going back to the original sources, and often also to additional theoretical or descriptive studies. At the end of this phase, StressTyp contained 154 languages. After the Eurotyp phase, work on StressTyp was continued by Ellis Visch, Ruben van de Vijver and Rob Goedemans. Other people who have contributed their time in this early phase were Simone Langeweg, Bernadette Hendriks and Paulus-Jan Kieviet.6 The combined efforts of these people resulted in more complete coverage of the accentual systems of the individual languages, thoroughly checked records, and the addition of accentual information for 116 new languages, bringing the total to 270. From 1997–2001, StressTyp was included in the Prosody of Indonesian Languages (PIL) project coordinated by Vincent van Heuven (Leiden University), during which time the database implementation was improved and the number of languages went up from 270 to 510. Goedemans checked the content of the old records for errors, checked the primary sources of languages for which the entry was only based on remarks in secondary sources, added examples where these were missing, and updated the language names and affiliations according to the SIL Ethnologue 13th edition standard (Grimes 1996). At this point, only a handful of records for languages in StressTyp are based on secondary sources only. During the PIL project we were approached by the editors of the World Atlas of Language Structures (WALS), a cooperative effort of the Max Planck institute for Evolutionary Anthropology in Leipzig and linguists with typological databases from all over the world. Specifically, we were invited to produce a number of maps that would show the distribution of 6

Some of these people worked on StressTyp in the context of other projects that were funded by the Netherlands Foundation of Scientific Research, NWO, the Holland Institute for Generative Linguistics (HIL), the Department of General Linguistics of Leiden University and the Department of General Linguistics of the Free University of Amsterdam.

StressTyp: A database for word accentual patterns in the world’s languages

249

various kinds or aspects of word accentual systems. We produced four such maps (see Goedemans and van der Hulst 2005a–d). StressTyp has benefited greatly from the cooperation with the WALS editors. The WALS project compiled a representative list of the world’s languages and we ensured that all languages in this list were present in our database. To this end we used the list of descriptive sources that the WALS editors provided and added a significant number of the 240 languages with which StressTyp grew during the PIL phase. Finally, StressTyp was expanded with 2 fields for geographical location and a procedure was developed to draw distribution maps of StressTyp data with the help of the mapping programme AGIS. StressTyp is now also included in the Typological Database System (TDS), a joint venture of the Universities of Amsterdam, Leiden, Nijmegen, and Utrecht, which aims at development of a common query interface for several typological databases. A prototype of the system is up and running (http://languagelink.let.uu.nl/tds see Dimitriadis et al., this volume).7 In the first phase of the TDS project, Rob Goedemans ported StressTyp to an MSAccess implementation that follows the original database design, so that we can now serve a much bigger user group. To facilitate a smooth integration in the TDS, examples in IPA were converted to Unicode and the Ethnologue codes were updated to the 15th Edition (Gordon 2005). In section 2.4. we will report additional information on past and current activities involving StressTyp. First we will say a few words about the goals and limitations of our database project. 2.2. Goals One of the main goals of StressTyp is to offer a quick entry to the primary and secondary literature on stress systems of the languages of the world. By primary literature we mean grammars and articles that provide first-hand descriptions of language data, including examples, generalizations and the like. By secondary sources we refer to theoretical works on stress which themselves draw on such primary sources. Critically, by using the word ‘primary sources’, we do not imply that the data stored in StressTyp are collected first hand from, or, checked with native speakers by us. Nothing in the idea behind StressTyp would preclude collecting and storing first hand data, but we have simply not had the means to do this. 7

The TDS also contains SyllTyp, another database designed by Harry van der Hulst and Rob Goedemans.

250 Rob Goedemans and Harry van der Hulst There was no intent to include only a representative sample of the languages of the world (but see below). We recorded information for whatever language we could find accentual information for. We have included all languages for which clear statements on accent location were present in the sources. The record structure allows for much more (see section 2.4), but additional information was only added if it was readily available in the source, in the hope that the record could be made more complete later on (as was often the case, although all records are still incomplete). As a matter of course, one of the goals of StressTyp is typological in nature. A sufficiently rich database allows for quantitative research and checking of implicational relationships. We can use StressTyp to expose common and uncommon traits of stress patterns, to check the validity of certain claims made in theoretical works and to discover new dependencies between various stress (and perhaps even other) parameters.

2.3. Limitations The data that StressTyp contains are as trustworthy as the information we found in the sources. If that information is wrong, StressTyp has copied that wrong information. (Of course, whenever we had any reason to believe that the information was wrong we did not copy it.) We have tried to trace the information back to the original descriptive source wherever this was possible. Every record, of course, specifies the sources on which we have based the coding. Specifying values in database fields necessitates interpreting sometimes very limited information. Although we do not wish to criticise the hard and important work that has been done to obtain first-hand descriptions of languages, and without which an enterprise like StressTyp would not be possible, we do note that the data and generalizations that are provided are often insufficiently precise to conclusively determine the exact nature of the accentual system. This is not surprising given the amazing variety of accentual systems that often differ in very subtle details pertaining to factors that determine syllable weight, rhythm, word length and so on (even ignoring the role of morphology, where this appears to be relevant). This means that the information in StressTyp is very often rather incomplete. The information stored for each language ranges from very elementary statements (like “initial stress”, all further fields unspecified) to fairly detailed specifications for a number of fields. The record allows for information on syllable struc-

StressTyp: A database for word accentual patterns in the world’s languages

251

ture and morphological structure as well. The former is often present in an elementary form, the latter is mostly absent. Misinterpretations on our part are also possible. The coding system requires interpretation of the sources. In addition, our records cannot always be faithful to any particular source. Where we have consulted more than one source for one language an attempt has been made to reconcile the sources. In doing so we may have come up with a coding that does not correspond to an actually existing dialect or language variety. Another factor that may have attributed to inconsistencies is that various people have been involved in the coding. Despite its limitations, our own experience is that StressTyp can be helpful in developing and testing hypotheses by offering data and properties of different languages in an identical format (see section 3). Besides, collecting information on as many languages as possible is simply the only manner to proceed if one wishes to develop general theories. In the ‘old days’, every student of stress would keep records like this in note books, or on filing cards. Clearly, with the availability of computers, those efforts are more likely to result in digital storage. We emphasize that StressTyp cannot be held responsible for providing incorrect, or incomplete information. We always encourage those who use StressTyp in publications to not only acknowledge the use of our database, but also to check crucial information in the original descriptive sources or with native speakers. We welcome all corrections and additions both regarding specific languages and the overall organization. 2.4. Dissemination To promote the use of StressTyp we have published a collection of articles in 1996, of which some are based on StressTyp information, while others describe the database structure and ways to go to from descriptions in sources to the StressTyp coding. In addition, this volume describes some direct numerical results and examples of queries.8 A second volume based on StressTyp data is underway. In that book we present part of the data in several geographically oriented appendices, while chapters written by experts on those respective areas comment on generalizations and patterning, provide new insights into the phenomenon of accent, and supplement the 8

Goedemans, van der Hulst and Visch (1996a).

252 Rob Goedemans and Harry van der Hulst StressTyp data with additional languages.9 Also in 1996, in a short article in Glot International we offer basic information about StressTyp.10 In addition, we have promoted StressTyp on the web. On the StressTyp website we describe in which ways others can use the database, either directly on the web, or by obtaining a copy of the application. Several versions of the database are available: – Legacy versions: a full 4th Dimension version which allows you to use all the standard facilities of the 4th Dimension database package. For PC and MacIntosh. Also available for users who do not own 4th Dimension, but with reduced functionality. – Access version: All the original data, and a new user interface. Examples in Unicode. – Online version (at http://stresstyp.leidenuniv.nl/). – TDS version (at http://languagelink.let.uu.nl/tds/, incorporated in a larger system). The legacy versions are no longer updated and we thus strongly advise users to obtain the free Access version. The core fields from StressTyp are represented in the on-line version. One can use this version for relatively simple queries. However, for more advanced work it will be necessary to obtain the stand-alone (Access) application. The same core fields are represented in the TDS. You can query StressTyp fields in the TDS in combination with fields from other databases. Also, the TDS system will guide users who are not too familiar with accentual phonology with ample explanatory notes, links to related fields and the like. By making the database available to other researchers in the ways described above we hope to benefit from their knowledge (or personal databases in whatever form) and cooperation in adding more languages to the system, and improving the quality of information presently contained in StressTyp.

9 10

Goedemans, van der Hulst and van Zanten (to appear). Goedemans, van der Hulst and Visch (1996b).

StressTyp: A database for word accentual patterns in the world’s languages

253

2.5. Future developments StressTyp started with very little funding, and has, since its inception, piggybacked on other projects or on people’s ‘free’ time. At the moment, the two authors of this chapter ensure the continuation of the StressTyp project. We continue to plan extending the content of the database by systematically trying to add information on language families or linguistic areas that are now underrepresented. To do this systematically and rigorously we need a grant that is exclusively dedicated to the development of StressTyp. Efforts to acquire such funding are currently under way. Long ago, we also planned to extend the scope of the database by means of a questionnaire. This questionnaire was designed and is intended to be filled in by linguists who are familiar with a particular language. If appropriate means come our way, we will develop this questionnaire further and start distributing it. In addition, we would try to systematically solicit (references to) books or articles that contain useful information on stress systems, especially of languages that are not yet contained in the database and thus build an archive of primary and secondary sources (preferably in machine-readable form). Our most recent attempt (mentioned above) has been to invite a number of phonologists to write a survey of stress systems in various parts of the world to be published in Goedemans, van der Hulst and van Zanten (to appear). These surveys will be used to add new information to StressTyp. It is inevitable that others have developed databases on word accentual systems (perhaps with less detailed record structures) or that information on stress systems is part of less specialized databases. Such databases could be ‘old fashioned’ paper-and-pencil collections or actual digital collections. We would like to be made aware of such systems and, more importantly, of the availability of the information contained in them. We are interested in collaborating with others if such systems are still accessible. In section 5.3 we discuss two other databases that were constructed while StressTyp was already in existence. Such duplication is unfortunate and we plan to establish collaborations with these projects if they are still active. It is possible to be more ambitious. Originally we aimed at embedding StressTyp in a network of related databases that would provide information on various aspects of stress research, such as an annotated bibliography of stress (StressBib, currently in progress), a terminological database (StressTer, currently in progress), addresses of linguists who do research on stress (StressRes) and so on. The global indicator for this imaginary network was StressEx (Stress Expert System). Meanwhile we have also developed a data-

254 Rob Goedemans and Harry van der Hulst base for phonotactic information (SylTyp), so a more general ‘umbrella’ would be Word Prosody Database. This could include a database initiative started by Larry Hyman (XTone: http://xtone.linguistics.berkeley.edu/) which collects information on word tonal patterns. Ultimately, we could then establish a database for Word Phonology if the work by Ian Maddieson on segmental inventories would be included.11 Finally, one could imagine combining all these pattern-oriented databases (of which, we are sure, there are more around) with process-oriented databases, i.e. databases that collect information on phonological processes, such as NasDat (http://acvu.nl/staf/ wlm.wetzels/pwp_en.htm), ATR/Vowel Harmony (an old, now dormant project of Maarten Mous, Norval Smith and Harry van der Hulst) and others. All this work is, of course, strongly reminiscent of the considerable efforts that were developed by Joseph Greenberg and his collaborators in the universals project which, for the most part, led to pencil-and-paper databases, or to digital collections that are now perhaps no longer accessible or useable. (If more general initiatives on a word phonology database could be developed, it would be advisable to try and incorporate this earlier work. Time permitting, the authors of this chapter will make an effort to develop such an initiative.)

3. The record structure It is well known that the designer of a database faces a paradoxical problem (which we call ‘The Database Paradox’). To make the perfect database one needs to know exactly what the extent, structure and nature of the data is, what one is going to be interested in and how the data can be best represented to serve that goal. However, there is a need for a database because all these things are not known in detail. The design of StressTyp, as described in this section, is therefore nothing more than an attempt to capture all the properties of stress (or rather: word accentual) systems in a set of parameters that we now consider to be complete, but which is “open” in the sense that future discoveries may force us to change parameters or add new ones. So far, extraordinary systems that we have encoded have occasionally forced us to expand the set of possible values for some of the fields, but to date we have never encountered a language whose stress system necessitated expansion of the set of fields. This is largely due to the separation of the encodings for primary and secondary accent, which allows us to encode 11

His work, as reported in WALS, in fact also includes syllable structure.

StressTyp: A database for word accentual patterns in the world’s languages

255

even the most unyielding of the exotic stress patterns. We must emphasize that, although the parameters used in this database embody a particular view on word accentual structure (as developed in van der Hulst 1984, 1996, to appear), they are presented as purely ‘descriptive’. It is entirely possible to interpret the information without making a commitment to any theory that assumes separation of information on primary accent and rhythmical structure (as the one that is described in section 1). It is unavoidable that the translation of properties of certain systems (as described in the sources) into the format of the database is not always entirely straightforward, because these systems have special or sometimes even conflicting features. Most typically, as we stated in section 2.3., we find that descriptions in primary sources are insufficiently explicit, while examples given leave open different interpretations. As a result, there are often different possible ways of storing properties of systems. We have consistently stored similar ambiguous systems in a similar fashion. When more detailed information becomes available for such languages, reinterpretation may be necessary, as has proven to be the case for some languages. As far as the “we cannot know what one is going to be interested in”-part of the paradox is concerned, we encoded every aspect of stress systems that we could think of separately and sometimes redundantly. The result is that single parameters of stress systems can be queried straightforwardly, but that combinations of parameters, which typically arise when one wants to find certain prototypical stress systems, sometimes result in complicated multiple queries. In our experience, we have always been able to query StressTyp on every aspect of stress that held our interest. Sometimes it was cumbersome, but never impossible. With the incorporation of StressTyp in the TDS, the complicated queries for prototypical systems have been incorporated as part of the knowledge-base. Many sets that needed multiple queries in StressTyp can be generated with a few clicks when one uses the TDS. Future wishes can, of course, not be foreseen, but we are confident that StressTyp can accommodate them as is, or with only a few minor modifications and/or additions (apart from expansion of StressTyp coverage to other fields, of course). We will now present the record structure of StressTyp (of which the database structure is a straightforward ‘flat’ table with one record per language). Below, the fields are presented with their respective definitions. Where necessary, a more detailed explanation is given. The presentation exactly follows the order of the fields in the user interface of the StressTyp application. Here we present only the core fields of StressTyp, since only these are relevant for the discussion that follows. The rest of the fields can be found in Appendix B (without description). A much more detailed explanation of

256 Rob Goedemans and Harry van der Hulst all the fields can be found in the manual (http://stresstyp.leidenuniv.nl) and Goedemans, van der Hulst and Visch (1996c) Language: The name of the language or dialect in the naming-conventions of SIL Ethnologue 15th ed. If the language is referred to by other names in the literature, these are added to prevent double records in the database. Dialect of: The name of the mother language of which the language specified in the record field language is a dialect. Genetic affiliation: The family tree, root first (as in SIL Ethnologue). Region: All geographical areas in which the language, or dialect of the language in question, is spoken. Latitude and Longitude: Exact geographical location of the (centre of) the area in which the language is spoken. Stress Type: Indicates the main stress type of the language by means of a code, identifying the position(s) of main stress. It can either be a simple abbreviation from a list of items, or a combination of abbreviations and one (or more) connective(s). The most common items include I(nitial), S(econd), T(hird), A(ntepenultimate), P(enultimate) and U(ltimate) for stress systems that place primary accent on a fixed syllable in every word of the language. Combinations may take the following shape: U/P for a language that places stress on the final (ultimate) syllable if it is heavy, else on the penult; I;S for a language that generally has fixed initial stress but has a significant group of exceptions that all have stress on the second syllable; F/L for a unbounded stress language that places stress on the first heavy syllable in the word, and on the last syllable if there are no heavies. More codes and examples can be found in appendix A and a more detailed explanation is given in the manual (http://stresstyp.leidenuniv.nl). Quote: Description of the stress pattern in words, usually taken from the primary source. Descriptive source: The primary source (in most cases) in which the description of the stress pattern was found. Theoretical Source: Reference to authors that have analyzed the pattern in a certain theoretical framework. Examples: Illustrate all aspects of the stress pattern (if available). Stress Domain: Indicates whether stress is assigned at the Left or Right edge of the word in bounded systems, or whether it can be assigned anywhere in the word in Unbounded systems. Extrametricality: Specifies whether any phonological unit is ignored in the selection of the domain at the Left or Right edge of the word, or whether Extrametricality does not play a role.

StressTyp: A database for word accentual patterns in the world’s languages

257

Extrametrical Unit: Specifies what unit is ignored, if any. Possible values: consonant, vowel, mora, syllable, heavy syllable and foot. Weight: Does syllable weight play a role in the assignment of primary accent, Yes or No? Stress if Both Heavy: If weight plays a role, heavy syllables (prototypically those with long vowels or codas) in the domain attract stress. If the bisyllabic domain contains only one heavy syllable, it is clear where the stress must go. If the domain contains two heavy syllables we need to specify what happens in this case. This field does exactly that. The options are, naturally, Right or Left. Stress if Both Light: In languages that use weight this field describes what happens when both syllables in the domain are light. In languages that do not use weight this field describes what happens in all cases. The options are Trochaic (i.e. left headed) and Iambic (i.e. right headed). (see section 4 for motivation of the choice for trochaic and iambic instead of left and right). Stress Repair: Yes/No field that indicates whether there is a shift outside the two-syllable stress window, if both syllables inside the window are light. Degenerate feet: Incomplete feet (monosyllabic in languages that do not use weight, monomoraic in languages that do) can (Yes) or cannot (No) be used in the analysis. If they can be used, single syllables that are left over when the rest of the word is parsed into binary feet (bisyllabic or bimoraic) do get a secondary stress by virtue of this degenerate foot that may be used to parse the syllable. In languages that do not use such feet, these syllables remain unparsed, hence stressless. Subminimal words: Words that are smaller than a foot exist (Yes) or are prohibited (No) in the language. In quantity-insensitive (QI) languages subminimal words are all monosyllabic words. In quantity-sensitive (QS) languages these are only monosyllabic words than consist of a light syllable. Rhythm: The language employs a pattern of secondary stresses, Yes or No. Starting edge: Specifies at which end of the word rhythmic patterning starts. Left, Right, Edge-in (i.e. patterning starts at both edges) or Centrifugal (i.e. Rhythm echoes away from the primary stress that is assigned somewhere in the middle of the word. Extrametricality, Extrametrical Unit and Weight: see above. Type: Specifies whether Trochaic, Iambic or both types of feet are used in the rhythmic pattern. Repair: Yes/No field that indicates whether the rhythmic surface patterning may deviate from that specified by the fields. Iterativity: Field that specifies whether secondary stress is assigned only once, at the starting edge (No), or as many times as possible given the num-

258 Rob Goedemans and Harry van der Hulst ber of syllables in the word that can be parsed into rhythmic feet (Yes). Rhythm ternary: Specifies whether ternary (trisyllabic) feet with Trochaic or Iambic heads are used in the analysis of secondary stress. Default setting is No ternary rhythm. Template: The full set of possible syllable types in CV notation. Obligatory Onsets: All syllables in the language must have onsets (Boolean). Branching Onsets: Onsets with more than one segment are allowed (Boolean). Long Vowels: Long vowels occur in the language (Boolean). Closed Syllables: Closed syllables occur in the language (Boolean). Geminates: The language uses geminates (Boolean). Heavy for stress: Specifies exactly which syllables count as heavy in the assignment of primary stress. Heavy for rhythm: Specifies exactly which syllables count as heavy in the assignment of secondary stress. Repair: In full text what happens and when, in languages for which either one of the repair fields above has the value Yes. Remarks: Any remaining remarks about the stress pattern or its encoding.

4. What can one do with StressTyp? The goal of the coding system has been to make it possible to search through the database for the occurrence of quite specific properties. With the search facilities of Access or the web interface, StressTyp can be instrumental in testing and developing hypotheses (given that the limatitions of StressTyp have been taken into account; cf. section 2.3). Goedemans and van der Hulst (2005 a–d), van Zanten and Goedemans (2007) and Goedemans (to appear) all use StressTyp to present data on stress patterns in various ways: primary data on the occurrence of the most common types of stress systems, exponents of syllable weight, secondary stress types in the languages of the world, the geographical distribution of languages that have certain specified characteristics. Moreover, these articles, especially Goedemans (to appear), contain examples of phonological claims that can be put to the test, using StressTyp to quantify exactly for what percentage of the world’s languages supposedly universal claims hold true. We will not repeat those exercises here, but rather present four new examples in the same fashion as the ones presented in the aforementioned ar-

StressTyp: A database for word accentual patterns in the world’s languages

259

ticles.12 The most direct way in which we can present data from StressTyp is in simple graphs that show how many languages in the sample of 510 exhibit certain stress patterns. In the past, we have generally had a rather global outlook when we presented such data. An article in which one presents complete overviews of the types of stress patterns that occur in the world’s languages is necessarily coarse, simply because there are too many possibilities. The sheer number of possible types has prevented us from revealing some of the finer grained distinctions in earlier publications. A subset of languages that has suffered from this contains the so-called unbounded languages. As explained above, unbounded languages are those in which primary accent can, in principle, be assigned anywhere in the word, no matter how long it is. Languages which are traditionally called unbounded are always quantity-sensitive (see section 1 for discussion of applying this notion to quantity-insensitive languages in so-called minimal systems, 3d), and the decision where to place main stress in the word is usually based on syllable weight (and in some cases diacritic weight or prominence due to pitch). This type comes in four basic flavours, depending on which of the heavy syllables in the word receives primary accent and what happens when there are no heavy syllables: (8)

F/F stress the first heavy syllable, and in case there are none, stress the first syllable F/L stress the first heavy syllable, and in case there are none, stress the last syllable L/F stress the last heavy syllable, and in case there are none, stress the first syllable L/L stress the last heavy syllable, and in case there are none, stress the last syllable

To this we add languages in which stress is lexically marked, can occur anywhere in the word, but does not adhere to one of the four rules above (usually that means there is only one syllable in the word that is lexically marked). The category, containing languages like Kewa, is labelled ‘Lex’. Languages like Russian, in which the stress rule is also sensitive to lexical marking, but which display patterns like ‘stress the first lexically marked 12

Readers who are interested in more examples, quantifying basic patterns and claims are referred to Goedemans and van der Hulst (2005a–d), van Zanten and Goedemans (to appear) and Goedemans (to appear).

260 Rob Goedemans and Harry van der Hulst syllable or else the first’ are incorporated in the appropriate category (lexical marking being, in our view, just another instance of weight). A final category of unbounded languages groups unique, fairly exotic, stress patterns, usually variations on one of the four types listed above under the label ‘Irr’ (for Irregular).13 In previous publications we have lumped the unbounded systems together in one group to be able to show all possible quantity-sensitive stress systems in one simple graph. We will now present the subsets in detail. When we consult StressTyp to get the actual numbers we arrive at the following result.

Figure 1. Number of languages for each of the six categories of unbounded systems.

Not counting the ‘Lex’ systems, for which it is anyone’s guess what their preferred location for stress is, we observe that the general preference for bounded stress systems, namely to use left headed constituents (see Goedemans, to appear), is also reflected within the unbounded category. 27 of the 41 unbounded languages in the groups F/F, F/L, L/F and L/L place stress on the leftmost heavy syllable, and 29 of the same 41 languages place stress on the first syllable when no heavy syllable is present in the word. A second usage of StressTyp is to look directly at the values of the parameters we have incorporated and draw a map showing the way these values are spread among the languages around the globe. Let us continue on the note set in above in our second exercise. We will look at bounded systems 13

In earlier publications we have commented on the fact that we also consider the count systems to be of the unbounded variety. We leave them aside here, and present only those systems that are uncontroversially unbounded, whatever theory one adheres to.

StressTyp: A database for word accentual patterns in the world’s languages

261

and determine the default value for stress placement, i.e. the value that indicates the edge choice when the domain does not contain a heavy syllable, or when the language is not quantity-sensitive to begin with. In other words, we will determine at what edge of the domain primary accent ends up when no weight is in play and see if any interesting geographical clustering appears. As was noted above, the flavours we have are trochaic (head on the left) and iambic (head on the right). We can query StressTyp for systems that are iambic or trochaic in this respect and feed the results, together with the geographical coordinates of the languages in the result set, to a mapping program. When we do that we arrive at the following map.

Figure 2. Geographical spread of languages that show Trochaic (black dots) or Iambic (white squares) primary accent patterning in the default case.

We have produced maps like the one in Figure 2 before (WALS maps 14– 17, see Goedemans and van der Hulst 2005a–d) and on one of these (map 17) we presented the languages in StressTyp that show trochaic or iambic patterning for secondary accent. With respect to that map we noted the following tendencies: (9)

(i)

Iambic rhythm occurs mostly in North and South America.

(ii) South America and Australia always seem to have clear rhythmic patterns.

262 Rob Goedemans and Harry van der Hulst (iii) Africa, on the other hand, shows little evidence for rhythmic patterns. The map presented here shows that with respect to default edges for primary accent: (10) (i)

Left is the default edge for most languages, on a par with the preference for trochaic rhythm that most languages show. (ii) In Australian, Austronesian, Indian and European languages the trochaic option is chosen in an overwhelming majority of languages.

When we compare the observations, we note that: (11) (i)

For lack of African languages in StressTyp that use rhythm, no preference for either iambic or trochaic feet could be detected on the rhythm map. The current map clearly shows that African languages, in keeping with the general trend, have a clear preference for the trochaic default for main stress. (ii) Where iambic rhythm was largely confined to the Americas, the usage of the right edge of the domain as the default location for primary accent is more widespread.

We take the languages’ preference for trochaic or iambic rhythm, as presented on the WALS map, to reflect default preference for either left or right headed constituents, an assumption we think is hard to contest. In StressTyp, we have made clear the affinity between rhythmic feet and the default location of primary stress through the values for the Stress if Both Light parameter, which are not Left and Right as one might expect but rather Trochaic and Iambic (see section 3). In most other theories the two parameters are even inseparable. In this light we can state some extended generalizations based on a combination of the data on these two maps. (12) (i) Left-headed constituents are the ones most commonly used in word accentual systems (on average, about 81% of the languages use them either as rhythmic feet or as the default for primary accent). (ii) Australian, Austronesian, Indian, African and European languages clearly prefer left-headed constituents. (iii) North-American languages show no preference. In South-America the split is roughly between Andean, left-headed, non-Andean (perhaps Amazonian), right-headed preference.

StressTyp: A database for word accentual patterns in the world’s languages

263

A question that comes to mind immediately when we consider these maps is whether there are any mismatches. Are there languages that use trochaic feet in the assignment of secondary accent, and yet assign primary accent with an iamb in the default case? Do languages exist that use iambic feet for rhythm but which use a trochee to place default primary accent on the lefthand side in the domain? If these mismatches exist, then what percentage of the total sample do they constitute? An example of the third type of query that we can execute in StressTyp yields the answer. As in most databases, we can use combined queries, designed to list for which records (languages) in the database combinations of two or more parameter values hold true, and thus, to discover whether dependencies between these parameters exist. In our case, the dependency is quite clear. Prototypical trochaic languages use trochaic feet for rhythm and assign primary accent at the left [within the accentual domain, not necessarily the left side of the word], whereas iambic languages do the opposite. This fact is exploited in standard metrical theory, which regards the foot on which main stress falls as a rhythmic foot, either on the left or right side of the word, instead of constructing a separate domain for primary stress with its own rules. In StressTyp parameters, the dependency is found by comparing the fields Stress if Both Light and Rhythm Type. Standard metrical theory predicts that these two fields should always have the same value, either trochaic or iambic. What we are looking for now is whether languages exist that defy this common pattern. The table below shows the results. Table 1. Default location of primary accent, broken down by rhythmic foot type. Rhythm Type

Stress if Both Light

Trochaic

Iambic

Trochaic

133

2

Iambic

9

15

With respect to the total number of languages for which we have clear information on both the foot type for rhythm and the default edge for primary accent, the number of mismatches is not impressive (only 7%), but cannot be dismissed as insignificant either. As noted in section 1, such mismatches are problematic in standard metrical theory (and its offshoots). In a theory that separates the treatment of primary and secondary stress, these are logical possibilities. As such the mismatches provide support for the validity of the separation between primary and secondary accent assignment. One

264 Rob Goedemans and Harry van der Hulst might object that a theory that separates primary and secondary accent predicts a random correlation between the values for both fields queried here and, strictly speaking, that is true. We believe that there are two reasons for why most primary accent patterns mirror rhythm (or vice versa). Firstly, it is reasonable to assume that primary accent locations are grounded in rhythmic patterns historically; thus they would, in principle, start out mirroring rhythm. Secondly, the correlation will remain stable, even after the two aspects of word accentuation patterns have been separated into a lexical (primary accent) and a post-lexical part (rhythm) because we expect rhythmic patterns to be constructed avoiding clashes with the primary accent; thus we expect that, for example, a penultimate primary accent will cause a trochaic rhythm to the extent that rhythmic patterns tend to ‘echo’ away from the primary accent site. In this sense, rhythm will tend to mirror primary accent. As a final exercise, let us try a cross database query. For that, we need to turn to the TDS, which allows us to define integrated queries using fields from multiple databases (a second example of how to use the TDS for such a query is given in Dimitriadis et al., this volume). Let us stay in the realm of heads and edges and compare the Stress if Both Light parameter to a field from another database that has nothing to do with stress, but which does relate to headedness. The Typological Database Nijmegen (http://www. hum.uva.nl/tds) contains fields in which information on basic word order is stored. It has been claimed that headedness in syntax may be correlated with headedness in phonology. If we query the TDS for the full matrix of possible combinations of default location for primary stress and word order, we obtain the following result. Table 2. Default location of primary accent, broken down by basic word order types Stress if Both Light Word order

Trochaic

Iambic

SVO

19

8

OVS VSO

0 11

0 2

VOS SOV

0 24

0 5

Concentrating on the order of Object and Verb we conflate the values of VSO and SVO (ignoring the empty OVS and VOS categories). Unfortu-

StressTyp: A database for word accentual patterns in the world’s languages

265

nately, there is no clear correlation to be discovered, though the percentage of trochaic languages that have a verb-final predicate is higher than that same percentage for the iambic languages. We hasten to add, however, that we must not draw iron-clad conclusions from such a small sample. We merely present this little exercise as an example of the things one can do with cross database queries. This concludes our tour of the typological possibilities that StressTyp offers. Other types of queries will be possible, but these will most likely be similar to one of the four types presented above. One such type involves cross-database queries of multiple databases that all store data on stress. These queries are potentially very interesting, since they hold the key to confirmation of conclusions drawn from StressTyp data through independent sources, but also to expansion of the sample on which such typological conclusions can be drawn. Therefore we devote the next section to a brief comparison of three databases on stress.

5.

Comparison to other databases

Although (as far as we know) StressTyp was the first initiative to systematically store information on stress patterns electronically, and is at present by far the most complete one, both in coverage of aspects of the phenomenon and in number of languages, it is not the only one. There are two other computerized databases on stress that we are aware of. We discuss them below.

5.1. Bailey’s Stress System Database The first stress database to appear beside StressTyp was developed by Todd Mark Bailey, who derived his database from the data he collected for his dissertation: ‘Nonmetrical constraints on stress’ (Bailey 1995). Information about his “Stress System Database”, can be found at http://www.cf.ac.uk/ psych/subsites/ssdb/. It is interesting to note that Bailey’s theoretical perspective on stress systems incorporates the idea of viewing primary and secondary accent as separate phenomena, deserving different theoretical mechanisms. In particular, he shares with van der Hulst (1984 and later publications) the idea that primary accent assignment is not based on rhythm, with the possible exception of count systems. In his database, each record consists of five fields:

266 Rob Goedemans and Harry van der Hulst – Long Word SPC (Syllable Priority Code)

A code that captures the basic pattern of the language – Short Word SPC

A code that captures the pattern of words that are shorter than the maximal size of the ‘stress window’ – Language

One or more language names – References

The source(s) of the code – Comments

Miscellaneous comments about the nature of syllable weight, or the foot type of ‘bottom-up stress’ systems (when rhythm crucially feeds primary accent; cf. section 1), lexical exceptions, or the presence of special features such as ternary rhythm. Bailey’s coding system is based on the idea that in locating primary stress certain syllables get priority over others. For example, in Hopi primary stress falls on the first syllable if heavy, otherwise on the second syllable. Bailey encodes this as follows: (13) Hopi: 12/2 Here there are two priority codes separated by ‘/’. The first code specifies the relative priority among heavy syllables saying that syllable 1 (the first syllable from the left; cf. below) takes priority over syllable 2, i.e. a heavy second syllable is stressed only if the first is not heavy. The second term specifies the case in which neither the first nor the second syllable is heavy. In dealing with a system in which the same calculation occurs on the right side of the word (stress the last if heavy, otherwise the penult), Bailey uses the same coding, differentiating the two by adding ‘R’ (right side of the word) or ‘L’ (left side): (14) Hopi: Hawaiian:

Bailey Code

StressTyp Code

12/2L 12/2R

I/S U/ P

(Read: “1 if heavy else 2 if heavy else 2”, StressTyp codes given for reference)

StressTyp: A database for word accentual patterns in the world’s languages

267

We wonder why Bailey did not use ‘1/2’ instead (“1 if heavy else 2”). This would be almost equivalent to the StressTyp code which builds the L versus R option into the syllable count: 1/2L = I/ S. If stress is weight-insensitive the codes are: (15) Latvian: 1L Cambodian: 1R

ST: I ST: U

Stress systems with multi valued (non-binary) weight distinctions challenge any coding system. In Pirahã, for example, there are five weight levels. The heaviest syllable in a right-edge three-syllable window receives primary stress. In case of a tie, the rightmost heaviest syllable wins. If there are no heavies, the rightmost (final) syllable is stressed. For cases like Pirahã Bailey’s coding assumes the following form: (16) Pirahã: 123/123/123/123 /1R ST: Pirahã A-U/U (Read: 1, else 2 else 3 in the heaviest weight class / 1, else 2 else 3 in the next heaviest weight class, 1, else 2 else 3 in the next heaviest weight class / 1, else 2 else 3 in the next heaviest weight class / 1 if no heavies. StressTyp code read: Antepenult if heavier than the two syllables in the window, else Ultimate if it is equal in weight to the penult, else select the heaviest of the two syllables in the domain, Ultimate if all syllables are light.) Bailey’s code seems a bit redundant, since the principle is the same whatever the weight class is. However, if the code is meant to be an algorithm we would need Bailey’s level of explicitness. StressTyp does not aim at this level of explicitness in the ‘stress type’ field because a fully explicit coding is provided in another set of fields. For unbounded systems, Bailey uses the numbers 1 through 4 to refer to the edge at which the accent is assigned and 9 through 6 for the syllables on the other side of the word, irrespective of the number of syllables: (17) σσ σσσσσσσσ σσ 9 8 7 6 ...... 4 3 2 1 Thus he assumes (for practical purposes, not as a theoretical claim) that even in unbounded systems, primary accent is assigned within a four syllable window at one of the edges.

268 Rob Goedemans and Harry van der Hulst (18) First/First system: 12..89 /1L First/Last system: 12..89 /9L

ST: F/F ST: F/L

(read: 1 if heavy, else 2 if heavy …else penult if heavy, else final if heavy / else first) (read: 1 if heavy, else 2 if heavy …else penult if heavy, else final if heavy / else last) The stress system of Hindi poses a challenge: according to Bailey, primary stress is on the rightmost superheavy (excluding the final syllable, except when this is the only superheavy in the word), else on the rightmost heavy (excluding the final syllable), else on the initial syllable: (19) Hindi: 23..891/23..899/9R

ST: Hindi U%A

(Read: penult, else antepenult, …else third, else second, else last in heaviest weight class / penult, else antepenult, …else third, else second in next heaviest weight class / else first) Here we witness a difference in interpretation of the sources or use of different sources. Whereas Bailey’s characterization of the system is that it is unbounded, StressTyp analyzes the Hindi system as a bounded system. This is not the place to resolve the true nature of the system, which, as is widely admitted (cf. Hayes 1995), is both complex and possibly dialectally heterogeneous. What we learn here is that a comparison of different databases reveals areas that call for a closer look. Finally, we discuss Wargamay as an example of a count system in which the first syllable is stressed in words with an even number of syllables, and the second in words with an odd number of syllables. Bailey assumes that the word is parsed in trochaic feet from right to left and that the final foot head receives primary accent (single leftover syllables are not parsed). The priority code must then take rhythmic strength to be a type of weight. ‘@s’ means ‘weight is secondary stress’ (this approach is rather similar to the one we promoted in Goedemans, van der Hulst and Visch 1996). (20) Wargamay: 12@sL ST: F(CNT) (Read: “the first if a foot head, the second if a foot head”) In conclusion, we see here how two approaches that differ primarily in the depth of coding. Whereas Bailey tries to capture the nature of stress systems

StressTyp: A database for word accentual patterns in the world’s languages

269

in a single code, StressTyp, in addition to using a comparable code, provides a more detailed and fractured analysis of the system. In addition, StressTyp offers much greater overall detail in many other areas. This is not a criticism of Bailey’s work which, as we assume, served his goals quite well. StressTyp is more ambitious in that it tries to be a tool for all researchers working on stress phenomena. It thus must be richer, more redundant and more explicit.

5.2. Gordon’s database of QI stress systems The second database that we discuss here was created by Matthew Gordon in the context of an inquiry into weight-insensitive systems (Gordon 2002) Hence, his database specifically only contains information about systems of this type. His codes are: (21) Initial (+ antepenult, + penult) Peninitial Antepenult (+ initial) Penult (+ initial) Final (+ initial) The information between parentheses refers to the location of secondary stress, which is encoded if mentioned in his sources. Presumably, any combination of a primary accent and secondary accent code is, in principle, possible. In addition, he uses the following code for systems that have no primary accent and only rhythm. (22) (e,l) even syllables from left to right (first beat is second syllable = S) (e,r) even syllables from right to left (first beat is penultimate syllable = P) (o,l) odd syllables from left to right (first beat is initial syllable = I) (o,r) odd syllables from right to left (first beat is ultimate syllable = U) In parentheses we added the location of the first beat that is assigned, and ‘translated’ this into the primary accent position to facilitate comparison to the coding in StressTyp. Gordon’s database was constructed for an even more specific purpose than Bailey’s database. It is only to be expected, then, that his dataset and his coding system is much narrower than that of StressTyp, and Bailey’s

270 Rob Goedemans and Harry van der Hulst database. Again, this is not a criticism, it merely is a consequence of the different goals for which databases are built.

5.3. Comparison In this section, we will make some explicit comparisons between the three databases. Let us first present a few numbers, determining the overlap between the three databases. Investigation on the basis of automated name comparison with reference to SIL code and sources in cases of doubt allowed us to create a relational combined database in which language names together with elementary stress type encoding for the three databases are stored. Languages that occur in more than one database are linked.14 Simple queries now tell us that 160 of the 197 languages in the Stress System Database (SSD) also occur in StressTyp, while the SSD contains 37 languages that are not present in StressTyp. Gordon’s database contains 273 languages. The overlap with StressTyp is 123 languages, so that there are no less than 150 languages that do not occur in StressTyp.15 The overlap between Gordon and Bailey is 62 languages, while the overlap between these two and StressTyp is 51 languages. All of these are weightinsensitive (because no other type is represented in Gordon’s database). We now look at the ways in which these 51 languages are encoded in the three databases. In the Gordon column we added the equivalent StressTyp code for the ‘(x,y)’ codes:

14

15

We thank Menzo Windhouwer who wrote the scripts that generated our comparison lists. We have to be careful not to count languages twice in these comparisons. In four cases for instance (Sierra Miwok, Turkish, Mam, Dakota) a single SSD record corresponds to two StressTyp records (if two varieties [e.g. dialects] of a language appear in StressTyp and we cannot decide to which of these the SSD record relates). In this case a comparison list will show 164 rows, but in these rows the SSD languages above appear twice. When we compare StressTyp and SSD these should, of course, be counted only once. In the comparison of StressTyp to Gordon’s database a similar reduction has been applied with respect to the records for Tsaxur, Aramaic, Hebrew and Basque (the latter even corresponds to nine StressTyp records).

StressTyp: A database for word accentual patterns in the world’s languages

271

TOTAL CODE INFO

StressTyp_Name Armenian Cavineña

StressTyp

Gordon

L/F (L but one)

U

Bailey Bailey long word short word 1R

P

(e,r) P

2R

A (NMS)

A

3R (3+)

P/A

P

2R

Apoze

A

A

3R (3+)

Baadi; Bardi; Badimaya

I

(o,l) I

1L

Banggarla; Parnkalla

A

A

3R (3+)

Cahuilla, (Desert and Mountain Dialects)

I

I

1L

Czech

I

(o,l) I

1L

Dakota; Sioux

S

S

2L

Dehu; Lifu

I

(o,l) I

1L

Dieri; Diyari

I

(o,l) I

1L

Djingili; Tjingili

P

(e,r) P

2R

Emae; Mae

A

A

3R (3+)

French

U/P

U

1R

French

U/P

U

1R

Garawa

I

I

1L

Georgian

Cayubaba; Cayuvava Chamorro

A;I (NMS)

A

1L

Mansi; Vogul

I

(o,l) I

1L

Paiute, Southern

S

P

2L

U (NMS)

U

1R

U/P

U

12 /21/1R

Hungarian

I

(o,l) I

1L

Icelandic

I

(o,l) I

1L

Karelian

I

(o,l) I

1L

F (CNT)

(e,r) F

12@s L (3+)

I

I

1L

Tübatulabal Hebrew, Tiberian

Mullukmulluk; MalakMalak Kuku-Yalanji Latvian Lezgi; Lezgian; Kiurintsy Liv; Livonian

I

I

1L

I /I (IRR)

S

1R

I

(o,l) I

1L

1L (2–) 1L (2–) 1L (2–)

1L (2–)

1L (3–)

272 Rob Goedemans and Harry van der Hulst TOTAL CODE INFO

StressTyp_Name

Bailey Bailey long word short word

StressTyp

Gordon

Macedonian

A

A

3R (3+)

1L (2–)

Mapuche; Araucanian; Aucan

S

(e,l) S

2L (3+)

1L (2)

Maranunggu

I

(o,l) I

1L

U/P

U

12/2R

P

(e,r) P

2R

P;I

(o,l) I

1L

Ono

I

(o,l) I

1L

Pintupi-Luritja

I

(o,l) I

1L

Piro; Yine

P

P

2R

Pitta pitta; Bidhbidha

I

(o,l) I

1L

Polish

P

P

2R

Ruija

I

(o,l) I

1L

Selepet

I

(o,l) I

1L

Sorbian

I

I

1L

Swahili

P

P

2R

Tajik

U

U

1R

Uzbek, Northern

U

U

1R

Meso Grande Diegueño (*) Nengone Ngalkbun; Dalabon; Boun

Vod; Votic

I

(o,l) I

1L

Warao; Guarao

P

(e,r) P

2R

Weri; Were

U

(o,r) U

1R

Wongkumara; Wankumara

I

(o,l) l

1L

We observe that the codes presented here match to a high degree, which means that all three databases either made the same kinds of mistakes or, a more likely scenario, that there are very few, if any, mistakes in this sample. Alternatively we can look at the relative number of QI systems per type in the three databases and see whether there are significant differences in the percentages.

StressTyp: A database for word accentual patterns in the world’s languages Bailey

Weight-Insensitive Systems I S T A P U

40 3 0 8 16 25

(= 43,5 %) (= 3 %)

Total

92 (= 47 % of total 197)

(= 9 %) (= 17,5 %) (= 27 %)

Gordon 103 15 0 8 77 69 272

(= 38 %) (= 6 %) (= 3 %) (= 28 %) (= 25 %)

273

StressTyp 92 16 1 12 110 50

(= 33 %) (= 5,5 %) (= 0,5 %) (= 4 %) (= 39 %) (= 18 %)

281 (= 55 % of total 510)

The global patterns are comparable. However, a few striking differences need an explanation. Since the sample sizes in Gordon’s database and StressTyp are almost identical and quite sizeable, let us concentrate on these two first. Sample size should help here to reveal real tendencies. The one major difference between Gordon’s database and StressTyp is the size of the P category. In StressTyp it is much too large, a feature we have commented on before. It more than likely is due to the fact that during the Prosody of Indonesian Languages project, we have added many Austronesian languages. Since these have predominantly quantity insensitive stress systems with primary stress on the penultimate syllable (see van Zanten, Stoel and Remijssen, to appear), this category is overrepresented in StressTyp. Should we reduce the number of Austronesian languages in the sample, we are sure that the overall percentages will eventually quite closely resemble those we found for the Gordon database, since the percentage for P will go down and that of the other categories will go up. The smaller database in this overview is Bailey’s, and it is therefore more susceptible to influence of imbalances in the sample (like the one we noted above for StressTyp). With respect to the other two, I, A and U are overrepresented, and P is underrepresented. Careful analysis of the Bailey sample may reveal what causes this. If we assume that this database is a ‘little off’ because of the relatively low sample size, and that the other two reflect more accurately what is going on in the languages of the world, we may, in any case, conclude that languages prefer initial stress, prefinal stress being a good second, with final stress not far behind. Antepenultimate stress and stress on the second syllable are relatively uncommon (only 1 out of 20 languages for both categories), while stress on the third syllable is virtually non existent.

274 Rob Goedemans and Harry van der Hulst Finally, we can compare the numbers of weight-sensitive systems in the SSD and StressTyp: Weight-Sensitive Systems

Bailey

StressTyp

I or S I, S or T – S or T P or U U or P – P or A (or pre-A) Unbounded

14 (= 14 %) 0 22 (= 21%) 23 (= 22 %) 44 (= 43 %)

37 (= 20 %) 2 (= 1%) 65 (= 35 %) 27 (= 15 %) 54 (= 29 %)

Total

103 (= 52 % of total)

185 (= 36 % of total)

Striking differences here are the relatively high number of unbounded languages in the SSD, and the fact that in the SSD there are as many systems that have stress on one of the last three syllables as systems that have stress on one of the final two syllables only. In the latter case, StressTyp languages clearly prefer stress to occur on one of the final two syllables. The cause of these differences eludes us, but we suggest that it may again be due to imbalances in the samples. We have no way of telling which column of percentages more closely reflects the objective truth, but we tend to place more trust in the one with the larger sample size. In conclusion we note that this comparison supports the StressTyp data. We have seen that the codes for QI languages closely match the codes for the same languages in two other databases. We have also seen that one of these two other databases contains an almost equally large sample of QI languages and that the percentages of QI languages in each of the possible categories in this database are similar to those we find in StressTyp, especially if we reduce the number of Austronesian languages in StressTyp (since these are overrepresented). Finally, we note that it seems imperative for quantitative research on stress systems to work with rather large sample sizes. Without further research, we cannot be sure which of the three databases we compared here comes closest to accurately describing the tendencies in the languages of the world, but we do think it is a tell-tale sign that the smaller one of the three seems to be the odd man out in the large comparison of QI systems in all databases, and shows some unexpected patterns in the comparison of the QS systems. We suggest that the StressTyp sample has enough critical mass to do quantitative research, but that it could benefit greatly from an increase to about 1500 records, if the additional languages are carefully selected to make the whole sample genetically and areally more balanced.

StressTyp: A database for word accentual patterns in the world’s languages

275

6. Concluding remarks In this chapter we have provided a detailed description of StressTyp, a database for word accentual systems in the languages of the world, discussing both the history, current state and intended future developments. We have indicated the record structure of the database and shown how the stored information can be used for queries of various kinds. It is our intention to continue the development of Stresstyp both regarding its structure and content and we welcome any kind of comment based on reading this chapter or using the database. Finally, let us repeat that we also welcome any kind of collaboration either in the area of word accentual systems or, more broadly in other areas of word phonology toward establishing larger and more ambitious projects.

Appendix: Additional StressTyp fields and Codes for the Type field A. Fields not mentioned in section 3. Exceptional Patterns Exceptional Patterns Examples Source Unaccented Words Category Monosyllables Only Y/ N Prefixes Stress Neutral Y/ N Stress Sensitive Y/ N Stressed Inherently Y/ N Cyclic Effects Y/ N Comments Compounds Category 1: sw/ws Category 2: sw/ws Category 3: sw/ws Category 4: sw/ws

Suffixes Stress Neutral Y/ N Stress Sensitive Y/ N Stressed Inherently Y/ N Cyclic Effects Y/ N Pre-stress Y/ N

276 Rob Goedemans and Harry van der Hulst Clitics Comments Examples Phonetic Realization Lexical Pitch Y/ N Tone Classes Y/ N Processes Processes Examples B. StressTyp codes Fixed Stress Patterns I S T A P U

Primary stress always occurs on the initial syllable. Primary stress always occurs on the second syllable. Primary stress always occurs on the third syllable. Primary stress always occurs on the antepenultimate syllable. Primary stress always occurs on the penultimate syllable. Primary stress always occurs on the final syllable.

Variable stress patterns I/I Place stress on the initial syllable if it is heavy (even if the second syllable is also heavy), otherwise place stress on the second syllable if it is heavy, if neither first nor second syllables are heavy, then place stress on the first syllable. I/S Place stress on the initial syllable if it is heavy (even if the second syllable is also heavy), otherwise place stress on the second syllable if it is heavy, if neither first nor second syllables are heavy, then place stress on the second syllable. S/I Place stress on the second syllable if it is heavy (even if the first syllable is also heavy), otherwise place stress on the first syllable if it is heavy, if neither first nor second syllables are heavy, then place stress on the first syllable. S/T Place stress on the second syllable if it is heavy (even if the third syllable is also heavy), otherwise place stress on the third syllable if it is heavy, if neither second nor third syllables are heavy, then place stress on the third syllable.

StressTyp: A database for word accentual patterns in the world’s languages

277

U/U Place stress on the ultimate syllable if heavy (even if the penultimate syllable is also heavy), otherwise place stress on the penultimate syllable if it is heavy, if neither are heavy, place stress on the ultimate syllable. U/P Place stress on the ultimate syllable if heavy (even if the penultimate syllable is also heavy), otherwise place stress on the penultimate syllable if it is heavy, if neither are heavy, place stress on the penultimate syllable. P/U Place stress on the penultimate syllable if heavy (even if the ultimate syllable is also heavy), otherwise place stress on the ultimate syllable if it is heavy, if neither are heavy, place stress on the ultimate syllable. P/P Place stress on the penultimate syllable if heavy (even if the ultimate syllable is also heavy), otherwise place stress on the ultimate syllable if it is heavy, if neither are heavy, place stress on the penultimate syllable. Or: Place stress on the penultimate syllable if heavy (even if the antepenultimate syllable is also heavy), otherwise place stress on the antepenultimate syllable if it is heavy, if neither are heavy, place stress on the penultimate syllable. The code for this type is also P/P with the note that EM=right. P/A Place stress on the penultimate syllable if heavy (even if the antepenultimate syllable is also heavy), otherwise place stress on the antepenultimate syllable if it is heavy, if neither are heavy, place stress on the antepenultimate syllable. A/A Place stress on the antepenultimate syllable if heavy (even if the penultimate syllable is also heavy), otherwise place stress on the penultimate syllable if it is heavy, if neither are heavy, place stress on the antepenultimate syllable. F/F Place stress on the first heavy syllable in the word. If there is no heavy syllable present, place stress on the first syllable. F/L Place stress on the first heavy syllable in the word. If there is no heavy syllable present, place stress on the last syllable. L/F Place stress on the last heavy syllable in the word. If there is no heavy syllable present, place stress on the first syllable. L/L Place stress on the last heavy syllable in the word. If there is no heavy syllable present, place stress on the last syllable. Other codes and connectives Lex The locations of either main or secondary stresses are specified in the lexicon for the majority of the words in the language. This means that stress

278 Rob Goedemans and Harry van der Hulst can be phonemic, because two non-monosyllabic words that are identical in segmental make up may differ in stress location and meaning. NMS Stands for No Main Stress. All stresses are reported to be equally prominent. L(CNT) This is a so-called “count system”. Primary stress is assigned to the head of the last foot in the word. Stress is assigned from left-to-right. This leads to different stress locations for words with an odd and an even number of syllables. F(CNT) This is a so-called “count system”. Primary stress is assigned to the head of the first foot in the word. Stress is assigned from right-to-left. This leads to different stress locations for words with an odd and an even number of syllables, usually Initial stress in the even case and Second stress in the odd case. IRR is used to indicate that stress varies unpredictably within the domain. Pitch and Tone are added between parentheses to indicate interaction between pitch or tone assignment and metrical structure. ; This connective indicates that there is some degree of variation between two (or more) patterns for main stress. The dominant pattern comes before the semicolon. - This connective indicates that “superheavy” syllables are involved in the computation of stress. If such a syllable occurs in the position indicated before the hyphen, it bears stress. Otherwise a standard rule (placed after the hyphen) comes into operation. % This connective indicates a stress shift outside the bounded stress domain under special circumstances. Stress shifts to the location after the % sign under these circumstances, and stays in the bounded domain otherwise.

StressTyp: A database for word accentual patterns in the world’s languages

279

References Anderson, John M. and Charles Jones 1974 Three theses concerning phonological representations. Journal of Linguistics 10: 1–26. 1977 Phonological Structure and the History of English. Amsterdam: North-Holland. Anderson, John M. and Colin J. Ewen 1987 Principles of Dependency Phonology. Cambridge: Cambridge University Press. Bailey, Todd Mark 1995 Nonmetrical constraints on stress. Ph.D. diss., University of Minnesota. Garde, Paul 1968 L’Accent. Paris: Presses universitaires de France. Goedemans, Rob to appear Stress Typology. In Stress Patterns of the World: The Data, R. Goedemans, H. van der Hulst and E. A. van Zanten (eds.). Berlin /New York: Mouton de Gruyter. Goedemans, Rob and Harry van der Hulst 2005a Fixed stress locations. In The World Atlas of Linguistic Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 62–65. Oxford: Oxford University Press. 2005b Weight-sensitive stress. In The World Atlas of Linguistic Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 66–69. Oxford: Oxford University Press. 2005c Weight factors in weight-sensitive stress systems. In The World Atlas of Linguistic Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 70 –73. Oxford: Oxford University Press. 2005d Rhythm types. In The World Atlas of Linguistic Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds), 74 –77. Oxford: Oxford University Press. Goedemans, Rob, Harry van der Hulst, and Ellis Visch 1996a The organization of StressTyp. In Stress Patterns of the World, Rob Goedemans, Harry van der Hulst, and Ellis Visch (eds.), 27– 68. (Holland Institute of Generative Linguistics Publications 2.) The Hague: Holland Academic Graphics. 1996b StressTyp Manual. Leiden: Holland Institute of Generative Linguistics. 1996c StressTyp: A database for prosodic systems in the world’s languages. Glot International 2 (1/2): 21–23.

280 Rob Goedemans and Harry van der Hulst Goedemans, Rob, Harry van der Hulst, and Ellen van Zanten to appear Stress Patterns of the World: The Data. John Benjamins: Amsterdam. Gordon, Matthew 2002 A factorial typology of quantity insensitive stress. Natural Language and Linguistic Theory 20: 491–552. Greenberg, Joseph H. and Dorothy Kashube 1976 Word prosodic systems: A preliminary report. Working Papers on Language Universals 20: 1–18. Halle, Morris and William J. Idsardi 1994 General properties of stress and metrical structure. In A Handbook of Phonological Theory, John A. Goldsmith (ed.), 403 – 443. Oxford: Basil Blackwell. Halle, Morris and Jean-Roger Vergnaud 1987 An Essay on Stress. Cambridge, MA: MIT Press. Harms, Robert T. 1981 A Backwards Metrical Approach to Cairo Arabic Stress. Linguistic Analysis 7: 429–451. Hayes, Bruce 1980 A Metrical Theory of Stress. Ph.D. diss., Massachusetts Institute of Technology. [Distributed in 1981 by the Indiana University Linguistics Club, Bloomington, Indiana.] 1995 A Metrical Theory of Stress: Principles and Case Studies. Chicago, Illinois: University of Chicago Press. Hulst, Harry van der 1984 Syllable Structure and Stress in Dutch. Dordrecht: Foris Publications. 1996 Separating primary accent and secondary accent. In Stress Patterns of the World, Rob Goedemans, Harry van der Hulst, and Ellis Visch (eds.), 1–26. (Holland Institute of Generative Linguistics Publications 2.) The Hague: Holland Academic Graphics. 1990 The book of stress. Unpublished manuscript, Department of General Linguistics, Leiden University. 1992 The independence of main stress and rhythm. Paper presented at the Krems Phonology Workshop. 1997 Primary accent is non-metrical. Italian Journal of Linguistics/Rivista di Linguistica 9 (1): 99 –127. 1999 Word accent. In Word Prosodic Systems in the Languages of Europe, H. van der Hulst (ed.), 3–116. Berlin /New York: Mouton de Gruyter. 2000a Issues in foot typology. In Issues in Phonological Structure, Mike Davenport and Stephen J. Hannahs (eds.), 95–127. Amsterdam: John Benjamins. [Also appeared in Toronto Working Papers in Linguistics 16 (2007): 77–102.] 2000b Metrical phonology. In The First Glot International State-of-theArticle Book: The Latest in Linguistics, Lisa Chen and Rint Sybesma

StressTyp: A database for word accentual patterns in the world’s languages

281

(eds.), 307–326. (Studies in Generative Grammar 48.) Berlin /New York: Mouton de Gruyter. [Originally published in Glot International (1995) 1(1): 3–6.] 2002 Stress and accent. In Encyclopedia of Cognitive Science, Vol. 4, Lynn Nadel (ed.), 246 –254. London: Nature Publishing Group. 2006 Word stress. In The Encyclopedia of Language and Linguistics. 2nd Edition, Vol. 13, Keith Brown (ed.), 655–665. Oxford: Elsevier. to appear Brackets and grid marks or theories of primary accent and rhythm. In Representations and Architecture in Phonological Theory, Charles Cairns and Eric Raimy (eds.), Cambridge, MA: MIT Press. Hulst, Harry van der (ed.) 1999 Word Prosodic Systems in the Languages of Europe. Berlin /New York: Mouton de Gruyter. Hulst, Harry van der and Jan Kooij 1994 Two modes of stress assignment. In Phonologica 1992, Wolfgang Dressler and John Rennison (eds.), 107–114. Torino: Rosenberg and Sellier. Hulst, Harry van der and Aditi Lahiri 1988 On foot typology. NELS 18: 286 –209. Hurch, Bernhard 1995 Accentuations. In Natural Phonology: The state of the Art on Natural Phonology, Bernhard Hurch and Richard A. Rhodes (eds.), 73–96. Berlin /New York: Mouton de Gruyter. Hyman, Larry M. 1977 On the nature of linguistic stress. In Studies in Stress and Accent, Larry M. Hyman (ed.), 37–82. (Southern California Occasional Papers in Linguistics 4.) Los Angeles: Department of Linguistics, University of Southern California. Idsardi, William J. 1992 The computation of prosody. Ph.D. diss., Massachusetts Institute of Technology. Kager, René 1993 Alternatives to the iambic-trochaic law. Natural Language and Linguistic Theory 11: 381– 432. Liberman, Mark and Alan Prince 1977 On stress and linguistic rhythm. Linguistic inquiry 8: 249–336. Lockwood, David G. 1983 Parameters for a typology of stress. In The Ninth LACUS Forum 1982, John Morreall (ed.), 231–241. Columbia, South Carolina: Hornbeam Press. McGarrity, Laura W. 2003 Constraints on patterns of primary and secondary stress. Ph.D. diss., Department of Linguistics, Indiana University.

282 Rob Goedemans and Harry van der Hulst Prince, Alan 1983 Relating to the grid. Linguistic Inquiry 14: 19–100. Prince, Alan and Paul Smolensky 1993 Optimality Theory: Constraint Interaction in Generative Grammar. (Technical Report #2 of the Rutgers University Center for Cognitive Science and Computer Science Department, University of Colorado at Boulder.) Piscataway, New Jersey: Rutgers University. Pulgram, Ernst 1970 Syllable, word, nexus, cursus. The Hague: Mouton. Roca, Iggy 1986 Secondary stress and metrical rhythm. Phonology Yearbook 3: 330– 341. Vergnaud, Jean-Roger and Morris Halle 1978 Metrical structures in phonology. Unpublished Ms., Massachusetts Institute of Technology. Zanten, Ellen van and Rob Goedemans 2007 A functional typology of Austronesian and Papuan stress systems. In Prosody in Indonesian Languages. LOT: Occasional Series 9, Vincent van Heuven and Ellen van Zanten (eds.), 63– 88, Utrecht: Igitur, Utrecht Publishing and Archiving Services.

The typological database of the World Atlas of Language Structures Martin Haspelmath

1. Introduction The World Atlas of Language Structures (often abbreviated as WALS) is primarily a book with 142 world maps showing the global distribution of structural features of language. It was put together by Martin Haspelmath, Matthew S. Dryer, David Gil and Bernard Comrie at the Max Planck Institute for Evolutionary Anthropology between 1999 and 2004, and published by Oxford University Press in July 2005 (Haspelmath et al. 2005). Over forty authors contributed to it, each structural feature (and thus each map) being the responsibility of a single author or team of authors. A sample map is shown in Figure 1.

Figure 1. The WALS map “Passive Constructions”, by Anna Siewierska (2005)

On the maps, each language is shown by a dot (most often a circle), and different colours stand for different structural types (or feature values). Thus, in the map in Figure 1 the white dots are languages lacking a passive constructions, and the red dots are languages with a passive construction.

284 Martin Haspelmath The World Atlas of Language Structures thus resembles a traditional dialect atlas, but the coding points are not places that the authors actually visited. Instead, they stand for languages on which the authors obtained information through published descriptions (reference grammars, dictionaries, scholarly articles, but occasionally also personal communications from experts and/or speakers of the language). Only at most 10 % of the world’s languages can be said to have been described reasonably well, so the maps only show about 400 languages on average (out of the 6000 –7000 languages that were still spoken at the end of the 20th century). The features have at least two different values, and at most nine, because more than nine different colours (or colour-shape combinations) are difficult to distinguish visually on a map. The editors’ task thus consisted in assembling a database from the authors that primarily consisted of one two-column table for each feature, giving pairs of language names and feature values. In addition we asked for bibliographical references and page numbers. A very partial sample table (showing just five languages) is shown in Table 1. Table 1. A very partial table exemplifying the data for Figure 1 language name

feature value author-year (1:present, 2: absent)

pages

Apurinã Arapesh Evenki Koasati Tunica

1:present 2:absent 1:present 1:present 2:absent

522 14 217–225 138 5–21

Facundes 2000 Conrad and Wogiga 1991 Nedjalkov 1997 Kimball 1991 Swanton 1921

In addition, the authors were asked to provide a 2000-word text giving a description of the feature and providing full definitions of the various values. These texts are printed in the atlas on the two pages preceding the two map pages. Since linguists may want to use the data underlying the World Atlas of Language Structures in a variety of ways, an electronic version of the atlas was published together with the book on a CD-ROM: The Interactive Reference Tool, programmed by Hans-Jörg Bibiko at the Max Planck Institute for Evolutionary Anthropology. This programme allows users to view the maps of the printed atlas and to display the data in a variety of ways, to conduct automatic searches, to export data and maps, and to create compound features based on the standard 141 features of the printed version.

The typological database of the World Atlas of Language Structures

285

Thus, the WALS data can be seen as a single complex database, consisting of 141 databases on structural features that are linked by a common metadata scheme (data on languages, references, and so on). 2. Research questions There were two main research questions that the editors and the authors wanted to address: (i) What correlations exist between structural features in different areas of grammar? For example, is it true that languages with little verb inflection tend not to make a past-nonpast distinction (not even a noninflectional one)? Is it true that languages with large vowel inventories tend to have small consonant inventories, and vice versa? (ii) What geographical patterns are exhibited by the structural features? For example, is it true that tone distinctions are found primarily in subSaharan Africa and Southeast Asia? (The geographical perspective on the distribution of structural features is generally called areal typology.) That interesting correlations between different structural features can be found has been well-known since Greenberg (1963), and since the 1980s the search for correlations has also been prominent in generative syntax. Since the 1970s a substantial amount of systematic large-scale cross-linguistic research (i.e. research involving 50 or more languages from all areas of the world) has been carried out, and it was obviously desirable to put the typological data together in such a way that potential correlations can be tested easily and automatically. Thus, the editors approached linguists who they know had gathered data from a large number of languages and asked them to contribute one or several chapters to WALS. We also enlisted several doctoral students who were in the process of gathering data for their dissertations, and a number of typologists only started gathering data on a largescale basis when they heard about the project in 1999 and 2000. (We did not try to incorporate any of the early published work from the 1970s and 1980s whose authors were no longer actively involved in typological research.) WALS includes data from three projects that are described in more detail in this book: The StressTyp database (Rob Goedmans and Harry van der Hulst), the Surrey Morphology Group’s syncretism database (Matthew Baerman and Dunstan Brown), and the database on intensifiers and reflexives at the Freie Universität Berlin (Ekkehard König, Volker Gast and associates).

286 Martin Haspelmath That structural features tend to cluster geographically has also been known for quite a while. Sprachbund phenomena have been discussed for a number of areas in various parts of the world, and already Jakobson ([1931] 1971) ventured the hypothesis of a Eurasian Sprachbund based on a few phonological features. But the issue of large areal patterns became prominent in language typology only with the publication of Dryer (1989) and Nichols (1992). Especially the latter included detailed discussions of areal patterns, but contained virtually no maps. Areal typology within Europe received a boost from van der Auwera (1998) and related work in the EUROTYP project, which showed that even outside the well-known Balkan area, many geographical patterns can be found (see also Haspelmath 2001). So at the end of the 1990s the time seemed ripe for a larger enterprise that put much of the available (and also a lot of new) cross-linguistic data on maps on a global scale. Thus, WALS tried to address two goals simultaneously that are not logically linked to each other. The search for correlations can proceed without any geographical information, and the search for areal patterns need not be concerned with correlations. However, combining the two goals had a number of clear benefits: First, the stated goal of publishing an atlas helped motivate those contributors that were primarily interested in correlations, because their contribution was published in the form of a “chapter” consisting of two text pages (written by them) and two map pages (for which they provided the underlying data). Four pages is not much on a CV, but the atlas chapters are conventional publications that can be cited easily, and the authors can get credit for their work in this way. If we had just published an electronic database without a book, the authors would probably have been much more reluctant to share their data. (And if we had proposed publishing the data as printed tables in a book, we would not have found a publisher, or the book would not have been read.) Second, if we had just focused on the areal patterns without taking correlations into account, we might have limited ourselves to a printed atlas. But the goal of finding correlations forced us to include a way of allowing the user to search for correlations in a number of ways. As a result even those chapters that were primarily included for their geographical-pattern interest can now be used for finding correlations. Third, at least since Dryer (1989) it has been widely known that finding valid correlations presupposes some awareness of geographical patterns, just as it presupposes awareness of genealogical patterns. For example, if we limit ourselves to the languages of Africa and Europe, it seems as if the

The typological database of the World Atlas of Language Structures

287

presence of a tone distinction precludes the presence of a rounding distinction in front vowels (i.e. i vs. ü, e vs. ö). African languages tend to show tone, European languages front vowel rounding. But the world-wide picture is rather different: Tone contrasts are found especially in Africa and Southeast Asia, while front rounded vowels are found especially in northern Eurasia (cf. Figure 2 below). In China, there are not fewer languages with front rounded vowels than in Europe, and all of them have a tone distinction.

Figure 2. The WALS map “Front Rounded Vowels”, by Ian Maddieson (2005a)

Thus, the correlation goal and the areality goal fit together very well, and in actual fact nowadays most comparative linguists have both research questions in mind when they study a particular phenomenon in a large number of languages.

3. Database design The WALS database consists of three main tables: The Data table, the Features table, the Languages table. (In addition, there are other metadata tables such as the references table that I will not talk about here.) As our primary goal was to give a representative picture of the world’s linguistic diversity, we just asked for minimal information on each language-feature pair: a feature value for the dot colour/shape on the map, and references including page numbers (cf. Table 1 above). In addition, we allowed the authors to provide an example (since this is very time-consuming, only a fairly small

288 Martin Haspelmath number of features include examples). We also allowed up to five references, so we had to include five author-year and page number fields in the data table. Thus, the Data table ended up being more complicated than the simple Table 1 above. A list of the fields is given in Table 2, with three sample records. In this table, the Language Number field and the Feature Number field are necessary to relate the Data table to the Languages and Features tables, respectively. The data are from Siewierska (2005), Gil (2005), and Corbett (2005). Table 2. The Data table of the WALS database: Three sample records Language Number: Feature Number:

1233 (=Apurinã) 107 (=Passive)

645 (=Nauruan) 55 (=Classifiers)

Value: Example: Author-year 1: Page Numbers 1: Author-year 2: Page Numbers 2: Author-year 3: Page Numbers 3: …

1 (=present) – Facundes 2000 522 – – – –

3 (=obligatory) – Kayser 1993 68–76 Lynch 1998 120 204 –213 – –

2011 (=Lak) 30 (=Number of Genders) 4 (=four) – Corbett 1991 24 –26 Xajdakov 1980 Murkelinskij 1967 166 –167

The Features table primarily contains the feature name, the feature number, and the names of the feature values:

The typological database of the World Atlas of Language Structures

289

Table 3. The Features table of the WALS database: Three sample records (Corbett 2005; Dryer 2005; Haspelmath 2005) Feature Number: 30

33

105

Feature Name:

Number of Genders

Coding of Nominal Ditransitive Constructions: Plurality The Verb ‘Give’

Value Name 1:

None

Plural prefix

Indirect-object construction

Value Name 2:

Two

Plural suffix

Double-object construction

Value Name 3:

Three

Plural stem change Secondary-object construction

Value Name 4:

Four

Plural tone

Value Name 5:

Five or more Plural reduplication –

Value Name 6:

–

Mixed plural

–

Value Name 7:

–

Plural word

–

Value Name 8:

–

Plural clitic

–

Value Name 9:

–

No plural

–

Mixed

The actual database is a little more complex than shown in Table 3: For all feature values, we actually have long and short value names. The long names appear in the text, and the short names appear on the map legend and in the electronic version. Moreover, we added information about dot colours/ shapes in nine additional Value Colour fields. The Languages table crucially includes the language number, the language name, the location, and the WALS code (i.e. the three-letter abbreviation that is shown on each dot on the printed maps). Identifying the languages that the authors provided information on proved to be a very timeconsuming task, since language names are often not sufficient to identify a language. Only after WALS was completed was an ISO standard for unique identification of languages established (ISO 639-3), and the discipline of linguistics is still far from accepting such a standard. It will take a while before linguistics publications include a unique language identifier as a matter of course. Thus, the WALS authors were required only to give a language name and their sources, and where there was a problem (e.g. in identifying the right dialect of Quechua or Berber), the editors consulted the sources in order to establish the identity of the language. Where alternative names exist in the literature, the editors tried to choose the name form that is currently the most common among linguists and is the most acceptable to the speakers.

290 Martin Haspelmath In addition, in order to facilitate the identification of the languages, the editors added Ethnologue 14 and Ethnologue 15 codes (the latter being largely identical with ISO 639-3), the names from Ethnologue, and the names chosen in the two other major published language catalogues, Ruhlen (1987) and Moseley & Asher (eds.) (1994). Moreover, for each language its family and genus was determined, and for some larger families several subfamilies are distinguished (e.g. Chadic within Afro-Asiatic, or Munda within Austro-Asiatic). Finally, for each language we have information on the country (or countries) where it is (primarily) spoken. Thus, the Languages table contains the 13 fields shown and exemplified in Table 4. Table 4. The Languages table of the WALS database: Three sample records Language Number:

645

1038

2205

Language Name:

Nauruan

Nuuchahnulth

Dâw

Location:

166°55E 0°30S

126°40W 49°40N

67°05W 0°15S

WALS Code:

nau

nuu

daw

Ethnologue 14 code:

NRU

NOO

KWA

Ethnologue 15 code:

nau

noo

kwa

Ethnologue name:

Nauruan

Nootka

Kamã

Ruhlen name:

Nauruan

Nootka

–

Asher & Mosely:

Nauruan

Nootka

Kamán

Family:

Austronesian

Wakashan

Vaupés-Japurá

Subfamily:

Eastern MalayoPolynesian

–

–

Genus:

Oceanic

Southern Wakashan

Vaupés-Japurá

Country:

Nauru

Canada

Brazil

Of course, it would have been desirable to include other sociolinguistic information of various sorts, such as the number of speakers, the use of the language in the media and in schools, the amount of bilingualism, and so on. And given that the database is part of an “atlas”, it is natural to ask for a more precise indication of the place(s) where the language is spoken, e.g. in the form of polygons (rather than just a single dot at the centre of the area where the language is spoken). Unfortunately, information of this sort is available only for a small percentage of the languages, so we did not try to include it. Moreover, on the atlas maps we did not want to privilege languages with a large number of speakers, or languages that are spoken over

The typological database of the World Atlas of Language Structures

291

a wide area, so even if polygon information had been available, we would not have chosen it as the primary means of presenting the data on maps.

4. Implementation The database described in the preceding section was initially implemented in FileMaker Pro at the editorial working level, and since the use of database software was (at least until recently) not universal even among typologists, often text files had to be converted into the right database format. The database was published on a CD-ROM accompanying the printed atlas together with a programme (the Interactive Reference Tool) that allows users to display the data in a variety of ways, to conduct automatic searches, to export data and maps, and to create compound features based on the standard 141 features of the printed version. Users of the electronic database can customize the map in various ways: show major cities and country names, remove country boundaries and rivers, and replace the light green/light blue base map by a topographic map showing altitude levels. The language dots can be shown in five different sizes, and the language name can be shown either as three-letter WALS code inside the symbol, or in full to the right of the symbol. The colors and shapes of the symbols can be changed. When the mouse pointer moves over the dot, the full name is shown, and when clicking on a dot, a window with further information on the language opens (including the data source). Users can also zoom in on areas with high dot density, closely enough to see all dots separately, and drag on a map to see adjacent areas. Maps can be exported and printed, and various user-defined selections can be saved for future use. The Interactive Reference Tool allows users to manipulate the standard features in two ways: values can be removed (if they are not of interest in a certain context), and several values can be merged into a single value. For instance, the five values of chapter 1 on consonant inventories (small, moderately small, average, moderately large, large) can be reduced to three (below average, average, above average) with just two mouse clicks. Users can search for language names, genus names, family names, country names, and even for text in bibliographic entries. It is possible, for instance, to find and display all languages beginning with X, all languages belonging to the Austronesian family, all languages spoken in Colombia, or all languages described by Jeffrey Heath. On the maps that only show languages (without giving information about the features), different dot colours may stand for different families or different genera.

292 Martin Haspelmath The Interactive Reference Tool contains no particular reseources for assessing areal patterns beyond its map-making capability. Whether an apparent geographical clustering of a particular type is significant or just looks significant will have to be decided in different ways. (So far the methodology of assessing areality is a very underdeveloped area in comparative linguistiuics.) (See Comrie 2007 for more on WALS as a research resource for areal typology.) However, the Tool was designed to help the comparative linguist find correlations between different features, genealogical information, and geographical information. The most frequent question of theoretical linguists is perhaps whether two features correlate. To test this, users can create their own compound features. For example, they may want to know whether the existence of tone in a language is correlated with the type of syllable structure. Both features have three values (tone: none, simple, complex; syllable structure: simple, moderately complex, complex), so by combining them, one gets nine possible values, as shown in Table 5. Table 5. Result of combining two three-valued WALS features Combined value No tones AND Simple syllable structure

languages 28

No tones AND Moderately complex syllable structure

135

No tones AND Complex syllable structure

112

Simple tone system AND Simple syllable structure

21

Simple tone system AND Moderately complex syllable structure

75

Simple tone system AND Complex syllable structure

23

Complex tone system AND Simple syllable structure

11

Complex tone system AND Moderately complex syllable structure

58

Complex tone system AND Complex syllable structure

8

The programme automatically creates a compound feature with these nine values, shows the number of languages in each value, suggests a symbol for each value, and displays a map of the compound feature. More complex ways of creating compound features are also possible and are described in detail in the Manual of the programme.

The typological database of the World Atlas of Language Structures

293

5. Limitations The World Atlas of Language Structures was conceived of as a five-year project and primarily as an atlas, and as a result there are a number of limitations of the database that from some perspectives leave certain things to be desired. Especially if one tries to approach the WALS database from a quantitative/statistical point of view, one quickly realizes that WALS is not perfect. Since the feature values are the core piece of information provided by WALS, the entire database can be seen as primarily consisting of a 141-by2560 matrix (141 features, 2560 languages) with 360,960 cells that can be filled with an integer between 1 and 9 (the feature value). However, on average each map shows only around 400 languages, so that there are only about 58,000 cells filled with data points – about 84% of the cells are empty. Thus, although WALS makes a huge amount of information readily available for the first time and is therefore widely regarded as a milestone in the history of comparative linguistics, from a statistician’s point of view a problem is “the large amount of missing data” (Cysouw et al. 2007). This limitation can be overcome only by gathering further data, the most expensive part of the entire enterprise. For most of the gaps, original fieldwork with the speakers of the languages would be required. The alternative option of limiting the admitted languages strictly to those that have a coding for all the features, or of limiting the admitted features to those that have a coding for all the languages, would have entailed discarding a large amount of information that might be invaluable from other perspectives, so it was not seriously considered. A limitation that was imposed by the goal of making an atlas is the restriction of the feature values to nine. While the most salient structural parameters of grammatical structure do not normally have more than a handful of values (e.g. with two values: configurational vs. nonconfigurational; with three values: head-final vs. head-initial vs. free order; with four values: head-marking vs. dependent-marking vs. double marking vs. zero marking), it is easy to define a parameter in such a way that there are more than nine values. For example, chapter 33 on the coding of nominal plurality (Dryer 2005) distinguishes plural prefixes from plural suffixes (cf. Table 3 above), but it has “plural clitic” and “plural word” as unitary types without differentiating between proclitics and enclitics, or between preposed and postposed plural words. Another striking case concerns features where elements are counted, as in chapter 30 (on the number of genders, Corbett 2005; cf. Table 3 above). The fifth value “five or more” lumps together a potentially

294 Martin Haspelmath large number of different situations, and these could have been distinguished. What we could have done (and what a future more sophisticated project of this sort will probably do) in such situations is to include two levels of feature values: On the one hand, a level of feature-value detail where a large number of distinct types are distinguished in the database to capture a maximum of information. Since this information is difficult to display on a map (at least if the map is meant to be interpreted directly by human observers), similar types could then be lumped together exclusively for the purposes of map representation. In WALS, the decision was taken to maintain strict identity between the database and the maps to simplify the procedure, but this is of course not necessary. A consequence of the upper limit on values is that many maps work with a relatively uninformative value “other” or “mixed”. For example, in map 105 on ditransitive constructions (Haspelmath 2005; cf. Table 3 above) the “mixed” type can be a mixture of type 1 and 2, of type 2 and 3, of type 1 and 3, or of all three types. These different mixtures would not have been easy to show transparently on the map, so it was decided to lump them together in a single mixed type. Such “mixed” or “other” values are problematic for statistical analyses, especially similarity analyses, because from the fact that two languages are coded as “other” one cannot conclude that they are more similar to each other than to any of the other types. So again a more sophisticated future project would distinguish the various mixed types at the database level, and would lump them together at the level of map representation. Another decision that was driven by the goal of making maps was to display several unrelated features on a single map, and to treat them as a single feature in the database. For example, in ch. 9 (“Presence of uncommon consonants”, Maddieson 2005b), seven values are displayed on the map: Table 6. The seven values distinguished on map 9 (Maddieson 2005b) value 1. 2. 3. 4. 5. 6. 7.

None of the four uncommon consonants Presence of clicks Presence of labial-velars Presence of pharyngeals Presence of “th” sounds Presence of clicks, pharyngeals and “th” sounds Presence of pharyngeals and “th” sounds

number of languages 448 9 45 21 40 1 2

The typological database of the World Atlas of Language Structures

295

But this seven-valued feature is of course just a conflation of four binary features, concerning the presence or absence of clicks, labial-verlars, pharyngeals and “th” sounds. These are all fairly rare sounds, so maps showing their distribution would be relatively uninteresting, because almost all of the languages would have the value “absence”. Even with four rare consonants, 79 % of the languages have the value “None” and appear as white dots on map 9. A future version of the database should distinguish the underlying single features from the composite features that human users like to see on maps. This will lead to a proliferation of features, because a large number of the WALS features can be decomposed further. Consider map 33 on the “Coding of nominal plurality” (Dryer 2005), whose WALS values are shown as ean example in Table 3 above. The nine values are repeated in Table 7. Table 7. The nine values distinguished on map 33 (Dryer 2005) value 1. 2. 3. 4. 5. 6. 7. 8. 9.

Plural prefix Plural suffix Plural stem change Plural tone Plural reduplication Mixed plural Plural word Plural clitic No plural

number of languages 118 495 5 2 8 34 150 59 86

These nine values can be recoded in the following way as primitive features: a. b. c. d. e.

presence/absence of nominal plurality (1–8 /9) fusion of plural coding: word / clitic/affix /stem change (7/8 /1–2 /3– 4) position of plural element: preceding /following (1/2) type of stem change: segmental/tonal (3/4) source of plural element: reduplicated /specified (5 /1–2, 7–8)

Such features are less interesting for the human interpreter and data provider, but from a database point of view they are more straightforward than WALS’s composite features. Another limitation that is highlighted by Cysouw et al. (2005) is the fact that the concepts used in the feature and value descriptions are not standardized. Terms like “case” or “clitic” may not have exactly the same meaning

296 Martin Haspelmath across different chapters. In other words, the WALS chapters are not based on a standard ontology, but are in conformity with the normal practice of linguistics: Technical concepts have to be defined anew by each author, because there is not enough common ground among linguists even for fairly basic concepts. This is a limitation that will be much harder to overcome than the others mentioned in this section (cf. Haspelmath 2009+ for general discussion of the comparative concepts used by typologists).

6. Prospects Clearly, the next steps to be taken by the editors and other interested comparative linguists must aim to expand the WALS database and to overcome some of the limitations mentioned in the last section. Of course, this will be a long-term process, but now that WALS exists many of the challenges that are ahead of us have come into sharper focus. A relatively straightforward improvement in access to the database is its free availability on the web. In fact, it had been suggested from the beginning by some of the contributors that a WALS on the web was sufficient, and that a book was not eeded. However, since the discipline does not yet have a standard way of giving credit to scholars who supply their data (plus data description) to a larger electronic database, publishing a book was probably the right decision in 1999. It is clear, though, that it is an important task for linguists (and scientists more generally) to develop conventions for giving recognition to scholars who contribute their datasets to a publicly accessible database. In April 2008, the database of the World Atlas of Language Structures was made available online (http://wals.info). All the maps and texts, and many of the search functions of the Interactive Reference Tool, are now freely accessible to all internet users. The material was published under a Creative Commons licence, in agreement with the book publisher, Oxford University Press. The online version is conceived of as a separate publication (a kind of second edition), with a separate publication date and place of publication.1 Yearly updates are envisaged, which would count as subsequent editions of the same publication. At the beginning, these updates will mostly consist in corrected errors, but in the future it should be possible for an author to add more languages to an already published feature,1 to differ1

WALS Online users are instructed to cite WALS Online chapters in the following form:

The typological database of the World Atlas of Language Structures

297

entiate the feature values further, and it will probably even be possible to submit completely new features. There is also a blog function that allows users to comment on individual chapters, and at some point in the future it should become possible to make comments on particular language-feature pairs (e.g. “I disagree that Basque has value 1 in feature 37, for the following reasons…”). Eventually, databases in comparative linguistics should become fully interactive, and the first steps in this direction have been taken.

References Auwera, Johan van der 1998 Conclusion. In Adverbial Constructions in the Languages of Europe, Johan van der Auwera and Donall E. Baoill (eds.), 813–836. Berlin / New York: Mouton de Gruyter. Comrie, Bernard 2007 Areal typology and the World Atlas of Language Structures. In Studies in Greek linguistics: Proceedings of the Annual Meeting of the Department of Linguistics, May 6–7, 2006, 23–40. Thessaloniki: Aristotle University of Thessaloniki. Corbett, Greville G. 2005 Number of genders. In The World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 126 –129. Oxford: Oxford University Press. Cysouw, Michael, Mihai Albu, and Andreas Dress 2007 Analyzing feature consistency using dissimilarity matrices. Sprachtypologie & Universalienforschung 61 (3): 263–279. Cysouw, Michael, Jeff Good, Mihai Albu, and Hans-Jöerg Bibiko 2005 Can GOLD “cope” with WALS? Retrofitting an ontology onto the World Atlas of Language Structures. In Proceedings of the 2005 EMELD workshop, http://emeld.org/workshop/2005/proceeding.html/. Dryer, Matthew 1989 Large linguistic areas and language sampling. Studies in Language 13: 257–292.

Cysouw, Michael. 2008. Inclusive/ Exclusive Distinction in Independent Pronouns. In: Haspelmath, Martin & Dryer, Matthew S. & Gil, David & Comrie, Bernard (eds.) The World Atlas of Language Structures Online. Munich: Max Planck Digital Library, chapter 39. Available online at http://wals.info/feature/39. Accessed on

298 Martin Haspelmath 2005

Gil, Davids 2005

Coding of nominal plurality. In The World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 138 –141. Oxford: Oxford University Press.

Numeral classifiers. In The World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 226 –229. Oxford: Oxford University Press. Greenberg, Joseph H. 1963 Some universals of grammar with particular reference to the order of meaningful elements. In Universals of Grammar, Joseph H. Greenberg (ed.), 73 –113. Cambridge, MA: MIT Press. Haspelmath, Martin 2001 The European linguistic area: Standard Average European. In Language Typology and Language Universals, Martin Haspelmath, Ekkehard König, Wulf Oesterreicher, and Wolfgang Raible (eds.), 1492– 1510. (Handbücher zur Sprach- und Kommunikationswissenschaft 20.) Berlin /New York: Mouton de Gruyter. 2005 Ditransitive constructions: The verb ‘give’. In The World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 426–429. Oxford: Oxford University Press. 2009+ Comparative concepts and descriptive categories in cross-linguistic studies. Ms., MPI for Evolutionary Anthropology, Leipzig. Haspelmath, Martin, Matthew Dryer, David Gil, and Bernard Comrie (eds.) 2005 The World Atlas of Language Structures. Oxford: Oxford University Press. Jakobson, Roman 1971 Reprint. Über die phonologischen Sprachbünde. Roman Jakobson: Selected Writings, vol. 1, 137–143. The Hague: Mouton. Original edition, Travaux du Cercle Linguistique de Prague 4: 234–240. (Réunion phonologique internationale tenue a Prague, 18-21/XII, 1930.) 1931. Maddieson, Ian 2005a Front rounded vowels. In The World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 50 –53. Oxford: Oxford University Press. 2005b Presence of uncommon consonants. In The World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 82– 85. Oxford: Oxford University Press. Moseley, Christopher and Ron E. Asher (eds.) 1994 Atlas of the World’s Languages. London: Routledge. Nichols, Johanna 1992 Linguistic Diversity in Space and Time. Chicago: The University of Chicago Press.

The typological database of the World Atlas of Language Structures

299

Ruhlen, Merritt 1987 A Guide to the World’s Languages 1: Classification. Stanford: Stanford University Press. Siewierska, Anna 2005 Passive constructions. In The World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 434 – 437. Oxford: Oxford University Press.

Typology of reduplication: The Graz database Bernhard Hurch and Veronika Mattes

1. Preamble 1 What makes reduplication take a specific status in linguistics is its unusual position within a theory of the linguistic sign. No doubt, it falls within the scope of the Peirceian concept, but the object is not a single physically definable lexical or grammatical item; rather it is a process, or more generally, a grammatical procedure itself. As reduplication proves, nothing prevents a procedure from being interpretable as the material representation of the sign, although we are used to associate a specific material grammatical form, e.g., English -s or Basque -k or Italian -i, with a specific grammatical function, namely plurality in this case. The essence of reduplication thus is, that we cannot isolate a specific, segmentally defined (morpho-)phonological form to be associated with a specific function; the reduplicant is a (morpho-) prosodic unit the exact shape of which depends on the phonological form of the item which undergoes the grammatical (morphological) change.2 Whatever is an object of apperception, can function as a sign. This status, very simple on the one hand, namely in the systematics of construction, but very complex on the other hand, namely in its relation between sound and form, makes reduplication a challenging topic in linguistics and especially in grammatical theory. We do not want to postulate here that the lack of a one-to-one-correspondence between a specific form and its grammatical function really is detrimental to the cognitive processing of reduplication. There are too many languages in the world which make use of it and too many different types of reduplication, even several types co-present in one language, to maintain such a claim. We just have to constitute our analysis from the point that within a linguistic (cognitive) system the sign relation based on a coherently used rule-like procedure is as good as the sign relation based on an affix. 1 2

URL: http://reduplication.uni-graz.at – Email: [email protected] In contrast to some recent generative theories we do not assume that the object of apperception is a prosodic unit itself, which is filled by a subsequent rule, but the rule itself, in the same way as the English plural is not just a (V)C skeleton, later filled by . We consider the process itself of being the sign.

302 Bernhard Hurch and Veronika Mattes This is obviously true, both taking the distribution of reduplication in the languages of the world (Rubino 2005a and b) and its frequency in languages with reduplicative systems competing with other morphological procedures, but also taking its specific role in diachronic changes. In most parts of the world as well as in most language families we find a fair number of reduplicative languages and many of them do in fact make grammatically regular use of more than one reduplicative procedure. As far as its distribution across the languages of the world is concerned, we do not find a specific concentration as to geographical regions or other linguistically defined areas. Reduplication is rare only from a narrowly European point of view: the languages of Europe do show reduplication only in a very restricted way. If it is present at all, it is limited to expressive and onomatopoetic word formations, and thus has a strong tendency to lexicalization.3 It has been evident for a long time that linguistic theory has a heavy European bias – notwithstanding the universal orientation of the field. Thus, also the study of reduplication has had difficulties for quite some time to transcend the examples from Latin perfect or Classical Greek aorist formation. The situation is dramatically different in other parts of the world. There are languages and whole language families that have more than one or even several productive types of reduplication, coherently denoting distinct grammatical and/or lexical operations. These types differ formally following the structure of the reduplicant, i.e., possible variants are full (complete) vs. partial reduplication, where the first denotes the repetition of the complete word form of the simplex, the second only a part of it. Moreover, in partial reduplication forms may vary with respect to the (morphoprosodic) shape of the reduplicant, e.g., full syllable, CV-, CV: -, etc. Interactions with other phonological, morphological or lexical strategies like fixed segmentism may complicate the patterns. Reduplication differs from both additive and modificatory morphological procedures in important aspects, but also comes close to them in others. So even if there might be several overlapping features, or if one could assume a continuum between affixation, modification and reduplication, it is probably best to establish reduplication as a proper morphological operation in the strict sense and not to group it with other processes in order to avoid vagueness of description and of theoretical evaluation. For different reasons, both descriptive and theoretical, reduplication has become a major issue in the linguistic discussion of the past three decades: 3

Older stages of Indo-European are left aside for the moment.

Typology of reduplication: The Graz database

303

in the re-discovery of morphology as a proper field of grammar within generative theory; in the rise of non-linear models in phonology; in the study of the interaction of phonology (and later of prosody) and morphology in the formulation of prosodic constraints on morphological form within a research area termed prosodic morphology; but also in non-formal theories ranging from the vast area of cognitive linguistics to typological research. Still within but beginning to transcend the classical SPE-framework, Wilbur (1973) mentions one important formal aspect of reduplication which also corresponds to the above mentioned sign-status of the operation: The rule of reduplication very much resembles a phonological operation insofar as the reduplicant must image the base, in the best case by copying (part of) it segmentally and prosodically. This preferred identity, which only two decades later will be cast in the form of a constraint in optimality theory, is misleading insofar as it allows a statement of the phenomenological part of the rule in purely phonological terms, although its character, its conditions and its function are morphological in nature. Wilbur, and this was not selfevident in the beginning of the 70’s, does argue in favor of the genuinely morphological status and most theories by now do accept this view. This specific character is phonological merely in appearance, and thus reduplication itself is also not situated at an interface between phonology and morphology, but it is a morphological operation proper. Situating reduplication at the interface between these components is a result of the already mentioned descriptive peculiarity, but it does not reflect its intrinsic position e.g. in rule derivations or in diachrony. For non-linear models, first in phonological and later also in morphological theories, the treatment of reduplication presented a major issue. This is the case as the description of reduplication is itself an operation which prototypically transcends both segmental representation on the (morpho-)phonemic side and affixation on the side of morphology and thus becomes a touchstone for the validity of descriptive tools and of their empirical power within a given theory. In other words, reduplication very much figured as an argument for the necessity of non-linear approaches, and in the same spirit it was part of the first results and also frequently illustrated the functioning of such theories. This development has contributed very much to an intensification of the study of reduplication in many different languages through the past two decades, and linguistic theory has profited from it considerably. Sometimes theoretically induced descriptive devices played a more than prominent role, and the empirical basis was neglected more than was admissible. As sometimes happens, theoretical developments are faster than their empirical consolidation. Repetition in data, and even the danger of distortion of examples was

304 Bernhard Hurch and Veronika Mattes an unpleasant consequence. Doubtless, this was one of the original motivations for the creation of the Graz database on reduplication. Reduplication has one more specific property in its relation between sound structure and meaning: with higher than chance frequency the procedure of reduplication is used to express quantitative or qualitative augmentation, in a very general sense. It has frequently been held in the literature that with reduplication we find the closest example to what is called iconic modeling in linguistics, i.e., a situation in which the linguistic form directly maps its meaning, namely an increase of form directly shapes an increase of semantic content. There are some striking counterexamples to this postulated iconicity, but the database, among other things, is also oriented towards clarifying questions of such form – meaning relations. Linguistic typology as a discipline contributing to a non-formal understanding of invariance in grammatical structures of the languages of the world has also had important impulses in the past two decades. One foundation for typological studies is the empirical basis on which statements or predictions are made. Within what figures in the Humboldtian idea of an encyclopedia of grammatical categories, reduplication is a serious candidate for study. The concept of typology4 our project pursues is one that deals with the study of hierarchical and implicational principles of invariance in the structure of grammar. What a database may contribute to the pursuit of this aim is on the one hand to provide a solid empirical basis for study, and on the other to offer the most unrestrained possibilities for inquiries on the topic. Therefore the database needs to reflect in its structure not only the most fundamental issues, but virtually all aspects which at some point of description and discussion of reduplication are mentioned in the literature. The solidity of the database and its typological functionality must be expressed both by an in-depth analysis of each single language following a thorough and coherent plan of description and analysis, and by a typologically valid selection of a sample of languages, which allows further-reaching generalizations. In addition, we have designed the structuring of the data4

We do not have space here for clarifying what typology really is about. It just needs to be underscored that we have to deal with a wide array of ratings of phenomena, of approaches, of statements on the field of typology, which maybe would one day need a more detailed discussion. Differences depend on theories of grammar, especially formal vs. functional theories, but they also concern the tasks of typology and, therefore, the role of typology in linguistics. This ranges from purely classificatory statements to the study of invariance of principles of grammatical structures.

Typology of reduplication: The Graz database

305

base, its use and possible queries in a way to make it as compatible as possible with other typologically oriented databases. Thus we intend the Graz Database of Reduplication (GDR) to provide substantive information on a wide range of questions independent of the theoretical background in which they are formulated. The website, moreover, contains an extensive bibliography on the topic, as well as some reprint publications of classics in the field, and it serves as a place for onlinepublications on reduplication and related topics. Free access will be granted to the scientific community in general.

2. Question of definition We are well aware of the fact that the definition of reduplication proper has been a matter of debate at least since August Friedrich Pott’s seminal study on reduplication, published as Doppelung (Reduplikation, Gemination) als eines der wichtigsten Bildungsmittel der Sprache beleuchtet aus Sprachen aller Welttheile in 1862 and that a thorough definition is needed of what is understood by the term reduplication, in order to accomplish the above mentioned targets and objectives. It is particularly important at this point to make a clear distinction between reduplication on the one hand, and a series of phenomena, similar to reduplication in one or more formal respects, but still different in essence, on the other hand. Such distinctions have already been drawn by Pott (1862). His study provides a very systematic discussion of the parallels and the overlappings between reduplication and other phenomena, but he also sharply sketches the differences between reduplication and gemination in (morpho-)phonology on the one hand, and between reduplication and repetitions in syntax on the other hand. Pott was probably one of the Indo-Europeanists most open to general linguistics in the 19th century and, although his modeling strongly relies on the language family he is most aware of, he does have a short fourth chapter with reduplication data from other parts of the world. It is by no means an exaggeration to consider Pott’s volume to be the first empirically based typological study on reduplication. Thoroughness in definition inevitably results in a rather tight and narrow definition of reduplication. For all programming of the database and for the necessary requirement of comparability of the classification of the data, such a procedure was essential. As a working hypothesis we are using the following definition:

306 Bernhard Hurch and Veronika Mattes A reduplicative construction is a set of at least two linguistic forms F and F’ in a paradigmatic, i.e. non-suppletive morphological relation in which F’ contains a segment or a sequence of segments which is derived from a non-recursive repetition of (a part of) F. Reduplication exists, if a specific grammar makes systematic use of reduplicative constructions. Reduplication must be situated among the types of purely morphological operations, in addition, we also want to establish reduplication as a genuinely morphological process, independent of affixation and other additive procedures. The above definition may be modified at a later stage of our research, depending on empirical results, on theoretical approaches, on specific research designs, or simply on predilections, but we emphasize that, in order to eliminate ambiguity, an operable choice had to be made and we are convinced that the definition above does reflect purely linguistic reasoning. Other repetitive phenomena, which we exclude from our definition, could be related to reduplication and can in fact be related historically. Such evidence may be a shortcoming of a database: categorizations necessarily are discrete choices while only the quantitative view of the phenomenon will hopefully allow us to redraw a line of diachrony and will allow us to make statements on (possible) developments.5 The emergence of reduplication may be thought of only as non-gradual, as abrupt and discontinuous, but the further development frequently is gradual and continuous. This is perhaps the greatest drawback of typological databases in general: they cannot give an account of the flexible and overlapping categories, which a natural language usually has. The definition as above needs some clarifications northwards and southwards, to use this Kiparskyan metaphor, as it excludes repetitive constructions, which do not fit the wording. The database does not include purely syntactically induced repetitions, although they sometimes are considered to stand at the origin of morphologically reduplicative patterns (cf. Gil 2005). On the other side there are some rare cases in which the only function of reduplication which can be made out is prosodic: in those cases the resulting reduplicated form in one way or another fits better the prosodic requirements of a given language (Hurch 2002).6 5

6

The Second Graz Conference on Reduplication in 2007 was dedicated to the topic “Diachrony and productivity”. A selection of papers will be edited by the authors in a monothematic issue of the journal Morphology. This origin of reduplication does clearly also show up in some stages of language acquisition.

Typology of reduplication: The Graz database

307

3. The Database 7 For the user, the database is accessible via two different modes: a rather simple tree structure, which displays “browseable” information on the reduplication types of the single languages; and a rather sophisticated query builder, which allows a free selection both of all available variables to be displayed, and of the criteria along which the search of the database is performed. Most important for the described purposes, within the free query builder, it is possible to restrict the search on a prespecified sample of 100 typologically balanced languages 8 out of all languages that the database contains. The selection of the overall number of languages is biased, due to the presence of interesting reduplication types in languages outside the prespecified list, due to the specific research interests of the project members, or due to the availability of data.9 These two modes provide the essential portals necessary for exploiting all possible information contained in the database. As we consider the formal characteristics and the functional properties of reduplication as equally important for linguistic research, the database is organized in a way that allows consultation from all different grammatical aspects. The database is example-based: each record represents one type. ‘Type’ is defined in terms of its specific form - function relation, so, for example, CV-plural and CV-diminutive are two distinct types. If possible, each type is illustrated with ‘further examples’ (which do not get individual records and specifications). Each entry is analyzed in phonological and morphological detail and is linked with information about semantics and – where available – about productivity and diachrony. The sources of the data are listed separately. The information contained in the database is based on reference grammars, specific studies on reduplication, and questionnaires supplied by experts on single languages or dialects. 7

8

9

The architecture of the database was developed in collaboration with and programmed by Olga Konovalova and has been modified and maintained by Angela Fessl; project members in addition have been Ingeborg Fink, Motomi Kajitani, Thomas Schwaiger and Ursula Stangel. We decided to follow the 100 language sample recommendation of Haspelmath, Dryer, Gil and Comrie’s WALS (2005), in order to enable comparisons and correlations with other quantitative typological studies which are geared to this list. Quantitative searches, such as how many languages have infixal reduplication, or the percentage of languages which have certain reduplication types, or which completely lack reduplication are performed only on the balanced sample.

308 Bernhard Hurch and Veronika Mattes For every language in the database we give a short overview of the reduplication system and other data that might be relevant for a better understanding of the reduplication examples. This information on the language can be searched easily and exhaustively via the tree structure. This search can start from a language, a language family, or an area respectively, and allows a selection of examples with respect to forms, functions, patterns, semantics, and word-classes. In a next step the basic information on the selected examples is provided, adapted to the formal or functional focus of the search.

Figure 1. Start page of the tree structure

Typology of reduplication: The Graz database

Figure 2. Chosen language: Papiamentu

Figure 3. Selected examples: plural reduplication (in Papiamentu)

309

310 Bernhard Hurch and Veronika Mattes The individual search via the dynamic query builder, instead, can be organized from any user-defined starting point. The following screen shots show an example of searching for information on the productivity of initial reduplication for plural. In the first step, the user selects all fields (information) which s/he wants to have displayed in the result, i.e. all information which s/he is interested in:

Figure 4. Start page of the query builder with an individual selection of fields (to be displayed in the result)

Typology of reduplication: The Graz database

311

In the second step, the conditions for the selection are chosen and combined:

Figure 5. Second selection page of the query builder with the individual selection of two conditions (initial reduplication for plural marking)

312 Bernhard Hurch and Veronika Mattes The result is presented as follows:

Figure 6. One of the result pages (displaying information on the productivity of initial reduplication for plural formation)

In the following section all information fields and searching parameters that appear in the database and thus can be searched and/or combined by the user will be described. We will first clarify the basic terminological notions. A second step will describe the structure of the general and typological information for the single languages. The third step lists and explains the search parameters employed for the single examples.

Typology of reduplication: The Graz database

313

1. ‘Reduplication’ is used to refer to the morphological operation (see our working definition above, section 2.). ‘Simplex form’ is the unreduplicated basis for reduplication (e.g. taki → ta~taki). The ‘reduplicated word form’ is the result of reduplication (e.g. ta~taki). With ‘base’ we refer to the portion of the simplex form, which is copied (e.g. taki), while this copied (and eventually changed) material in the reduplicated word form is referred to as ‘reduplicant’ (e.g. ta~taki). 2. The Languages10 are listed together with the SIL Code11 and alternative language names, in order to avoid ambiguity. They are assigned to their genetic affiliation in the field family/group as well as to the geographic area 12 where the language is spoken. Some general data are listed together with the language, i.e. important general phenomena of the language which are not directly related to reduplication, but which might interact with it or which might be important to understand the status of reduplication in this language. A specific field provides basic typological information for every language. There is an overview provided on the reduplication types by means of forms and functions of reduplication.13 The field relationship form – function indicates if the language has only one reduplicative form, which corresponds to one and only one function (one form – one function), if one form can have different functions (one form – various functions), if one function has different forms (various forms – one function) or if different forms correspond to different functions (various forms – various functions). Worded summarizing remarks on the reduplication system are available in the field reduplication system. Comments on diachrony and comments on productivity summarize information on these topics as far as it is available. The occurrence of repetitive constructions other than reduplication, and recursive operations in addition to reduplication is indicated, as well 10

11 12 13

Terms in bold letters refer to the labels of the lookup-fields in the database. The use of italics indicates the possible values. Ethnologue, 15th edition. According to the use in the WALS-project, see Haspelmath et al. (2005). A language can have for example two functional types of reduplication, e.g. for plural and for perfective aspect, but both have various forms. Another language can have two formal types, e.g. full and CV-, one of them, or both, have various functions. In the former case it is sensible to assume two functional types, whereas we find two formal types in the latter. The relevant field at this point only gives a rough overview.

314 Bernhard Hurch and Veronika Mattes as information on stylistic limitations on the use of reduplication in the field stylistic reduplication. The parameters described as follows can be searched and combined individually due to the dynamic architecture of the database. For each reduplication type at least one representative example is cited (i.e. the reduplicated word form with its corresponding simplex form, the translation and glossing). Other examples illustrating exactly the same type of reduplication are listed in the field further examples. Its reduplication pattern can be full (e.g. Papiamentu ketu~ketu ‘very quiet’, Kouwenberg 2003: 162), partial (e.g. Kwaza dury~ry ‘is rolling’, van der Voort 2003: 75), full with affixation (e.g. Tagalog bahay~bahay-an ‘doll house’, Schachter and Otanes 1972: 100), partial with affixation (e.g. Bikol du~duwa-he ‘exactly two’, Mattes 2007: 105), or an echo-word-formation (e.g. Bikol harap~hasap ‘rough’, Mattes 2007: 56). The example is classified as one of the reduplication types which can be distinguished in the language (see above). As far as possible we provide sentences or phrases (including glosses and translation), in order to demonstrate how the reduplication type is used. Under the heading “reduplication description” each reduplicant is described in detail as to the position, i.e. the position with regard to the stem (initial, final, internal; it can be undefined in the case of full reduplication) and in terms of direction, i.e. the direction in which the reduplicant is located with respect to the base (left, right, undefined). It is also described with respect to adjacency (the reduplicant is adjacent, if attached directly to the base, whereas in other cases it may be separated by the stem/by a part of the stem/by other material, e.g. Woleaian liugiuw ‘expect it’ → liugiu~liug ‘expect, INTR’, Sohn 1975: 130). Reduplication is furthermore classified for contiguity (as contiguous if a contiguous succession of segments from the base is matched directly with the reduplicant).14 The number of iterations indicates how often the base is copied. In most cases, the value for this is once, i.e. the same material appears twice in the reduplicated word form (duplication). Only some languages allow or, in some cases require a triplication (e.g. Mokilese roar ‘shudder’ → roar~roar ‘be shuddering’ → roar~roar~roar ‘continue to shudder’, Harrison 1973: 426). Whether there exist languages with quadruplication or even more copies in 14

There are two cases which do not fit in this classification: If a number of noncontiguous segments from the base is selected for the formation of the reduplicant or if the reduplicant is interrupted by additional material, e.g. by an infix as in Bikol bakal ‘buy’ → b-in-a~bakal ‘is buying-UG’, Mattes (2007: 94).

Typology of reduplication: The Graz database

315

regular reduplication is still unclear. At this point we do not have such an example. The precise form of the reduplicant itself (e.g. full, CV-, -Vr-) is specified as well as the formalization of the reduplicated word form (e.g. C1V1 ~C1 V1C2V2). Reduplication can be exact or non-exact, i.e. it is nonexact, if prespecified segments or features in addition to the copied material are included in the reduplicant (e.g. Bikol huru~harong ‘small house’, Mattes 2007: 57) or if certain segments or features of the base are not included in the reduplicant (Bikol nag-ta~trabaho ‘is working’, Mattes (2007: 81)). Further important information is on affixation which might trigger reduplication, or which might obligatorily accompany reduplication, and on fixed segmentism; i.e. prespecified segments in the reduplicant, which do not occur in the base. The field (supra-)segmental changes indicates if prosodic or segmental changes occur in the reduplicant with respect to the base, as for example the lengthening of the reduplicated vowel.15 (Morpho-) syntactic behavior refers to the type of construction in which the reduplicated word can occur, must occur, or cannot occur at all. Reduplication on reduplication comments on the situation where two subsequent reduplicative formations would be expected (e.g. diminution plus plural formation in some Nahua dialects, cf. Peralta Ramirez (1991)). Often, the reduplicated word form has a specific stress pattern, or the reduplicant triggers a regular stress shift. The fields under the heading “reduplicating string description” contain information on the (non)-identity of base and reduplicant regarding the relevant unit or domain: The type of base indicates a segmental, prosodic, or morphological categorization of the base. The type of reduplicant indicates a segmental, prosodic or morphological categorization of the reduplicated unit. In both cases, these units are described more precisely as for example CVC-, syllable, foot, stem, or affix in specification of base/reduplicant. In most cases the types of base and reduplicant will be identical, but in several others they are not and these are especially interesting. For example in various languages a base syllable with a complex syllable structure is reduced to the core syllable in the reduplicant, e.g. in Bikol nag-ta~trabaho ‘is working’ (see above). The fields of “morphological description” categorize the reduplicated word form and its corresponding simplex form with respect to the word class and/or the lexical category to which they are associated. Furthermore they are specified for all categories which are relevant for the reduplication 15

General phonological changes that occur independently of the reduplication but are the result of regular phonological processes are not considered.

316 Bernhard Hurch and Veronika Mattes type of the example: for number, verb number reference (if a verb is marked for number, it will be mentioned whether the number refers to the subject, the object or the action itself); for tense, person, gender, case, voice, transitivity/valency, mood, aspect/aktionsart, and gradation (e.g. comparative, superlative, equational). We try to classify the process type of reduplication as inflectional or derivational. We are aware that this distinction is sometimes rather arbitrary, but still we have included this information as we think it might prove interesting at some point in research. In order to avoid more serious problems, we use intermediate steps like clear inflection and derivation, border case inflection and derivation, and undecided. The function of reduplication is one of the most basic and important search parameters of the database. It aims at giving a short description of the grammatical purposes a reduplicative type is used for. For reasons of better cross-linguistic comparability we try to reduce the functions to very few categories, as word class derivation (e.g. Fiji rere ‘fear’ → re~rere ‘fearful’, Schütz 1985), diminution (e.g. Afrikaans vat ‘touch’ → vat~vat ‘touch tentatively’, Botha 1984: 10), pluralization (e.g. Ilokano kailián ‘townmate’ → ka~kailián ‘townmates’, Rubino 2005a: 12), and intensification (e.g. Bikol mahal ‘expensive’ → mahal~mahal ‘very expensive’, Mattes 2007: 57). Finally, the field order of morphological operations gives information about the order in which the word formation process(es) must occur if additional morphological processes take place, e.g. an infixation before reduplication or vice versa. Reduplication Semantics gives the basic meaning expressed by the reduplication type in case of productive reduplication (diminuity, intensity, plurality, etc.). In case of lexical reduplication16, the semantic field is indicated (e.g. plant name, animal name, movement, body part, color). The field relation simplex form – reduplicated word form refers to the degree of semantic deviation from one another, i.e. narrow (e.g. Bikol mahal ‘expensive’ → mahal~mahal ‘very expensive’, see above), loose (e.g. Kwaza ka’ra ‘dry’ → kara~’ra ‘meagre’, van der Voort 2003: 79), or arbitrary (in case of fully lexicalized reduplication). Kind of change mentions elaborations, restrictions or determinations of semantic features of the simplex by reduplication. 16

The term lexical reduplication refers to lexemes with reduplicative structure which do not (in a synchronically transparent way) correspond to a simplex form, e.g. Arabic thatha ‘young horse’, or dardar ‘elm tree’ (Procházka 1995: 58). Such lexical reduplications are cross-linguistically very widespread and seem to be semantically regular to a great extent.

Typology of reduplication: The Graz database

317

In most cases it is rather difficult to differentiate between the morphological and semantic description, as the semantics of the category can be understood to be a somewhat more elaborated specification of the morphological categorization itself. It would be very interesting to have a complete picture of productivity of reduplication types in all languages. But unfortunately there is very little information. So, type of productivity will be filled in only if we have reliable information on the degree of productivity. Evaluation refers to the grammatical status of the reduplication type, i.e. whether it is morphological, lexical, or extragrammatical (e.g. onomatopoetics). Alternative procedures mentions operations that can be used (optionally or conditionally) instead of reduplication to express the same grammatical function (e.g. the intensifying suffix -on in Bikol instead of full reduplication, cf. Mattes 2007: 30). If we know about the integration of loan words, i.e. if loan words undergo the reduplication type like native words or if they are treated differently, we provide this information. The field notification/orthography is filled in only if reduplication is specifically marked in orthography, e.g. in Afrikaans reduplication can be distinguished from repetition by a hyphen. In Indonesian spelling, reduplication can be noted by an empty space, by a hyphen, or by a superscript (, , or ‘children’, cf. Gil 2005: 51). Domain/blocking lists categories of words, which can undergo / enhance this type of reduplication (domain) or morphological/phonological environments and/or conditions, which block the application of the reduplication type (blocking). Information about the historical development of reduplication types is also very fragmentary. But, nevertheless, it would be very important to test some hypotheses on the origin and the development of reduplication types, patterns and systems and its diachronic changes (cf. Niepokuj 1997, Bybee et al. 1994, and our own critical approach (Hurch and Mattes 2005). Therefore, if possible, we will provide information on the diachrony, namely both the origin and the formal deconstruction (diachronic reduction or decline, e.g., from CV to V, or CVC to CV); the recession or the expansion, i.e. the reduction or enlargement of domains of application of the reduplication type in diachronic view. Finally, each example representing one reduplication type of the language is linked with its reference. Additional literature can be found in the extensive bibliography on the topic, which is not part of the database proper but separately accessible on the website.

318 Bernhard Hurch and Veronika Mattes 3. The software which serves as the basis for the technical implementation of the reduplication database is an Apache2 Server and a MySQL 4.0.24 Debian database. For presenting the data in the web in an appropriate way, HTML, JavaScript, and PHP are used. HTML, DHTML and JavaScript are responsible for the design and the layout of the data, whereas the PHP, embedded in the HTML, provides the access to the MySQL database. The data itself is stored in the MySQL database and consists of several different tables representing the fields described above and so-called lookup tables, which show the connections and relations between them.17 This kind of structuring enables the sophisticated flexibility of the reduplication database especially that implemented in the query-builder. 4. Outlook The scope of the database, which is still under construction, is obviously to contribute to a better understanding of the phenomenon reduplication, its functions, meanings, formal properties, and its distribution. It aims at offering a tool to test hypotheses derived from generalizations in linguistic research both within the reduplication discussion and from outside. Very much of the general literature has been dedicated to typological aspects and proposes implicational statements about for example the presence of full reduplication in systems with partial reduplication, or the presence of augmentative reduplication in languages with diminutive reduplication. The theoretically oriented literature on reduplication specifically focuses on the functioning of formal tools and descriptive devices. A consequence of this circumstance is that this type of literature, which has on the one hand contributed a lot to the understanding of the phenomenon in the past decades, has, on the other hand, only rarely given an overview of different types of reduplication within one single language. The database is structured in such a way as to allow qualitative and quantitative search options in the dynamic query builder, but it also allows a language specific in-depth data search via the search tree. The Graz Database on Reduplication is designed so as to allow future participation in a network of typologically oriented databases in order to improve the possibilities of searching for external generalizations. It is desirable to be able to evaluate whether properties such as morphological invariance, the presence or absence of reduplication, or of specific reduplica17

The complete list can be found in the Appendix below.

Typology of reduplication: The Graz database

319

tion types with specific functions, etc. do show a systematic or statistically significant relation to the presence or absence of other structural types, of other categories etc., and whether such connections can also be established across the different components of grammar. Finally the database at the time of the end of the project will be the most comprehensive data collection on reduplication and on the relevant literature.

Appendix 18 The most important table is the language table. When inserting a new language into the database, this data must be inserted first. The table consists of the following columns, which contain the general information about the new language: idlang

unique key of the language

lang

name of the language

alternative names

other names of the language

langfamily_group

genetic classification

area

geographic area

general_data

general data on the language

typological_inform

general typological information

denominredtypes

generalization of the formal and/or functional properties of the reduplication system

redtype_function

relationship between the forms and the functions of reduplication

specific_red

summarizing remarks on the reduplication system

comdiachrony

summarizing remarks on the diachrony of the reduplication system

comproductivity

summarizing remarks on the productivity of the reduplication system

repetitive_oper

repetitive (syntactic) operations in the language

18

Many thanks to Angela Fessl for the compilation of the table.

320 Bernhard Hurch and Veronika Mattes recursive_oper

recursive (morphological) operations in the language

stylistic_oper

stylistic limitations or variations in the reduplication system

comments

any additional information

show_lang

switch indicating if a language is online or offline

The second important table is called example. It contains all tupels of examples of all languages. Each example consists of the following columns: idexample

unique key

reduplword

word form which results from reduplication

redupl_word_translation

translation of the reduplicated word form into English (or other language(s) according to its source)

simplex_form

word form which serves as the basis of reduplication

simplex_form_translation

translation of the simplex form into English (or other language(s) according to its source)

redpattern

pattern of reduplication

redtype

grammatical purpose of reduplication

gloss_red_word

morpheme-by-morpheme correspondence of the reduplicated word form

gloss_simplex_form

morpheme-by-morpheme correspondence of the simplex form

comments

any additional information

informant

bibliographic references of example and/or linguist and/or native speakers who provided the example

compiler

person who compiled the data

The lookup table lookup_lang_example is the connection between the language table and the example table. In consists only of three columns: id

unique key

idexample

unique key of the example

idlang

unique key of the language

Typology of reduplication: The Graz database

321

The following tables give more detailed and additional information about the examples in the example table and are combined with the examples and their corresponding lookup tables. With these tables the query builder and the tree of the database are built up. The content of these tables represent the possibilities and restrictions of the search in both representations. This is especially comprehensible when opening the query builder. The first page shows all possible fields stored as columns in the corresponding tables. The condition page shows the available values stored in the database with which the SQL-statement will then be automatically created. On the third page all examples are displayed which fulfill the conditions of the second page, but only the values of the fields selected on the first page are presented. The bibliography consists of the following columns, where the names are self-explanatory: idbibl, type, author1_lastname, author1_firstname, author2_lastname, author2_firstname, author3_lastname, author3_firstname, other_authors; title, editors, edition, booktitle, series, year, journal, volume, number, pages, relevantpages, howpublished, publisher, institution, address, conferencename, conferencelocation, keywords, abstract, link. The diachronic_description table columns are: iddiachrony

unique key

origin

original or previous form and/or function of reduplication

deconstruction

description of formal reduction or declination of a reduplicated word form

recession

description of reduction of domains (recession) and/or enlargement of domains (expansion) to which reduplication can be applied

development

diachronic development of reduplication

comments

any additional information

322 Bernhard Hurch and Veronika Mattes The morpholreduplicant table covers the following tuples: idmorphol

unique key

wordclass

word class of the reduplicated word form

subspecifications_wordclass

subcategorization of the word class of the reduplicated word form

lexcategory

lexical category of the reduplicated word form if it cannot be described in terms of word classes

number

number expressed by the reduplicated word form

verb_number_reference

reference of verb number of the reduplicated word form

tense

tense expressed by the reduplicated word form

person

person expressed by the reduplicated word form

gender

gender expressed by the reduplicated word form

wordcase

case expressed by the reduplicated word form

voice

voice expressed by the reduplicated word form

transitivity_valency

transitivity and/or valence expressed by the reduplicated word form

mood

mood expressed by the reduplicated word form

aspect_aktionart

aspect and/or aktionsart expressed by the reduplicated word form

gradation

degree of property or quality of an entity expressed by the reduplicated word form

processtype

categorization of reduplication in terms of morphological operation

process_clarification

grammatical purpose of reduplication

order_morphol

order of reduplication and other word formation processes

comments

any additional information

The morpholsimplexform has the following columns: idmorphol

unique key

wordclass

word class of the simplex form

subspecifications_wordclass

subcategorization of the word class of the simplex form

Typology of reduplication: The Graz database

323

lexcategory

lexical category of a simplex form if it cannot be described in terms of word classes

number

number expressed by the simplex form

verb_number_reference

reference of verb number of the simplex form

tense

tense expressed by the simplex form

person

person expressed by the simplex form

gender

gender expressed by the simplex form

wordcase

case expressed by the simplex form

voice

voice expressed by the simplex form

transitivity_valency

transitivity and/or valence expressed by the simplex form

mood

mood expressed by the simplex form

aspect_aktionart

aspect and/or aktionsart expressed by the simplex form

gradation

degree of property or quality of an entity expressed by the simplex form

comments

any additional information

The table productivity holds the following fields: idproduct

unique key

productivity_type

type of productivity of reduplication

evaluation

grammatical status of reduplication

alternprocedures

procedures which can fulfill the same function as reduplication

loanwords

application of reduplication to loan words

notation

special notification for reduplication

blockingred

word category which can undergo reduplication (domain) and phonological and/or morphological conditions which block the application of reduplication (blocking)

comments

any additional information

324 Bernhard Hurch and Veronika Mattes The reduplication_description is made up of the following fields: idredupl

unique key

position

position of the reduplicant with respect to its base

direction

direction in which the reduplicant is copied with respect to its base

adjacency

relationship between base and reduplicant with respect to their adjacency

contiguity

relationship between base segments and reduplicant segments with respect to their contiguity

number_of_iterations

number of times a base appears

form_of_redupl

form of the reduplicant

formula

formal description of the reduplicated word form

exact_vs_nonexact

formal relationship between the simplex form and the reduplicated word form

affixation

affixation which triggers or obligatorily occurs with reduplication

fixed_segmentism

prespecified segments in the reduplicated word form

suprasegmental

any (supra-)segmental changes which occur

syntactic_behaviour

(morpho-) syntactic constructions in which the reduplicated word form can/must/cannot occur

red_on_red

information whether the reduplicated word form can be the input for other reduplicative operations

stresspattern

specific stress pattern of the reduplicated word form

comments

any additional information

The reduplstring table consists of the following columns: idstring

unique key

stringtype_base

segmental, prosodic or morphological categorization of the base

stringtype_redupl

segmental, prosodic or morphological categorization of the reduplicant

specification_base

detailed description of the base

specification_redupl

detailed description of the reduplicant

comments

any additional information

Typology of reduplication: The Graz database

325

The last table is the semantics table: idsemantics

unique key

reduplsemantics

basic meaning expressed by reduplication (in cases of morphological reduplication)

subsemantics

semantic field of reduplication (in cases of lexical reduplication)

categorsimplex

semantic classification of the simplex form

simplex_redupl

relationship between the simplex form and the reduplicated word form with respect to the degree of semantic deviation

changekind

any changes (strengthening or weakening) of inherent features of the simplex form by means of reduplication

prototypicality

relationship between the simplex form and the reduplicated word form with respect to their prototypicality

scope

relationship between the simplex form and the reduplicated word form with respect to their semantic scope

referentiality

(not used)

comments

any additional information

All lookup tables like lookup_example_bibl, lookup_example_redupl, lookup_example_diachrony, lookup_example_product, lookup_example_semantics and lookup_string_example represent the connection between the example table and a second table represented by the second part of the table name. Also both the tables lookup_reduplicant_morphology, lookup_simplexform_morphology are the link between the example table and the appropriate morphology table. All these tables consist of only three columns: id

unique key for the current lookup table

id1

unique key for the first table

id2

unique key for the second table

326 Bernhard Hurch and Veronika Mattes References Botha, Rudolf P. 1984 A Galilean analysis of Afrikaans reduplication. Stellenbosch Papers in Linguistics 13: 1–193. Bybee, Joan, Revere D. Perkins, and William Pagliuca 1994 The Evolution of Grammar. Tense, Aspect and Modality in the Languages of the World. Chicago / London: University of Chicago Press. Gil, David 2005 From repetition to reduplication. In: Bernhard Hurch (ed.), Studies on Reduplication, 31–64. Berlin /New York: Mouton de Gruyter. Harrison, Sheldon P. 1973 Reduplication in Micronesian Languages. University of Hawaii Working Papers in Linguistics 5: 57–92. Haspelmath, Martin, Matthew Dryer, David Gil, and Bernard Comrie (eds.) 2005 The World Atlas of Language Structures. Oxford: Oxford University Press. Hurch, Bernhard 2002 Die sogenannte expletive Reduplikation im Tarahumara. Rehabilitation eines verpönten Begriffs: Euphonie. In Sprache als Form. Festschrift für Utz Maas zum 60. Geburtstag, Michael Bommes, Christina Noack, and Doris Tophinke (eds.), 51–62. Wiesbaden: Westdeutscher Verlag. Hurch, Bernhard and Veronika Mattes 2007 The Graz Database on Reduplication. In La Réduplication, Alexis Michaud and Aliyah Morgenstern (eds.), 191–202. (Faits de Langues 29.) Paris: Ophrys. Kouwenberg, Silvia (ed.) 2003 Twice as meaningful. Reduplication in Pidgins, Creoles and other contact languages. (Westminster Creolistics Series 8.) London: Battlebridge. Mattes, Veronika 2007 Types of Reduplication. A case study of Bikol. Ph.D. diss., University of Graz. Niepokuj, Mary 1997 The Development of Verbal Reduplication in Indo-European. (Journal of Indoeuropean Studies Monograph 24.) Washington: Institute for the Study of Man. Peralta Ramirez, Valentín 1991 La reduplicación en el náhuatl de Tezcoco y sus funciones sociales. Amerindia 16: 63 –77.

Typology of reduplication: The Graz database

327

Pott, August 1862 Doppelung (Reduplikation, Gemination) als eines der wichtigsten Bildungsmittel der Sprache, beleuchtet aus Sprachen aller Welttheile. Lemgo /Detmold: Meyer. Procházka, Stephan 1995 Semantische Funktionen der reduplizierten Wurzeln im Arabischen. Archiv Orientálni 63: 39 –70. Rubino, Carl R. 2005a Reduplication. In The World Atlas of Language Structures, Martin Haspelmath, Matthew Dryer, David Gil, and Bernard Comrie (eds.), 114 –117. Oxford: Oxford University Press. 2005b Reduplication: Form, function and distribution. In Studies on Reduplication, Bernhard Hurch (ed.), 11–29. Berlin /New York: Mouton de Gruyter. Schachter, Paul and Fe T. Otanes 1972 Tagalog Reference Grammar. Berkeley/ Los Angeles/ London: University of California Press. Schütz, Alfred 1985 The Fijian language. Honolulu: University of Hawaii Press. Sohn, Ho-min 1975 Wolaeian Reference Grammar. Honolulu: University of Hawaii Press. Voort, Hein van der 2003 Reduplication of Person Markers in Kwaza. Acta Linguistica Hafniensia 35: 65–94. Wilbur, Ronnie 1973 The phonology of reduplication. Bloomington: Indiana University Linguistics Club.

The Romani Morpho-Syntax (RMS) database Yaron Matras, Christopher White and Viktor Elšík

1. Background and aims Despite having now become the largest minority language in the European Union – with upwards of 3.5 million speakers dispersed mainly in central and southeastern Europe – Romani is still considered one of the continent’s lesser-known languages. Yet interest in the language is prompted by its very special position in a number of areas: its history – Romani is the only modern Indo-Aryan language that has been spoken exclusively in Europe since the middle ages; its geography – Romani is exceptional in not covering a coherent territory, but rather being dispersed in ‘diaspora’ communities, often characterised by repeated migrations; its structural-typological characteristics – Romani dialects have absorbed structural influences from a variety of different languages, and in the absence of a unifying standard, have developed in diverse directions; and its socio-political status – with growing European integration, efforts are underway to take into consideration the special needs of the Romani people at various levels, and this includes expanding the usage domains of the Romani language. In all these areas, a comparative approach to the diverse dialects of Romani is essential: In the absence of written documentation on earlier stages of the language, reconstruction relies on a comparative study of the dialects. The comparative sample of Romani dialects provides an opportunity to observe regularities of structural change, including contact-induced change (see Elšík and Matras 2006). Applied questions of language codification, standardisation, and the mutual comprehensibility of Romani dialects are also best addressed by comparing lexical and grammatical structures. These considerations were behind the creation of a central corpus of Romani dialects that would facilitate structural comparison among them. Work on the RMS (Romani Morpho-Syntax) database began in 1998, with the intention of creating an electronic resource that would store both linguistic data, and ‘metadata’ in the form of answers to analytical questions, and so would allow queries on entire sets of data. Organised in a format resembling a grammatical description, and aiming to cover all aspects of structural variation among the dialects, RMS is quite possibly the only ex-

330 Yaron Matras, Christopher White and Viktor Elšík isting comprehensive comparative grammar in electronic form. It is also one of the larger projects of its kind. Its development has been supported by grants from the Arts and Humanities Research Board (AHRB), the Economic and Social Research Council (ESRC), and the Open Society Institute (OSI), with a total accumulated budget of around £ 565,000 (€ 840,000). In various phases, the project has so far employed three co-workers – a Research Associate, a Programmer, and an Archive Manager – on a full-time basis, around a dozen part-time research and technical assistants, and around 50 part-time fieldwork assistants working in altogether 20 different countries. The project’s data archive now contains some 300 original recordings, as well as data extracted from numerous published sources (grammatical descriptions and texts). An earlier form of the database has been accessible online to a small circle of researchers specialising in Romani via a special server since 2001. It has served as a data basis for several monograph-length comparative investigations of Romani, including Matras (2002), Boretzky and Igla (2004), and Elšík and Matras (2006), and is currently providing a data management frame for several ongoing PhD dissertations in Romani linguistics, at several different institutions. At the time of writing, the database is undergoing a technical transformation to a new application with a web interface, which will gradually become publicly accessible via the project’s website: http://romani.humanities.manchester.ac.uk/. In the present contribution, we outline the aims, scope and content structure of the database, data collection strategies, the different phases in the technical development of the resource, the query structure, and future prospects. Other brief descriptions of RMS can be found in some of our earlier work – Matras (2004a: 281–285) and Elšík and Matras (2006: 55–64) – as well as on the project website.

2. The linguistic investigation of Romani Proto-Romani – the term given to development phases of the language in its pre-European period – appears to have originated in the Central areas of India, during the early transition period from Old to Middle Indo-Aryan (300 BC-500 AD). As pointed out already by Turner (1926), Romani shares ancient innovations from this period with other Central languages of India, such as Hindi/Urdu and Punjabi, whereas developments from a later date – the transition period to Early New Indo-Aryan (ca. 500 AD-800 AD) – are shared with the languages of the Northwest, such as Kashmiri (see Matras 2002, ch. 3). These include on the one hand archaisms, which were retained

The Romani Morpho-Syntax (RMS) database

331

in the Northwest, but not in the Central languages (such as the presence of certain consonant clusters, e.g. tr- in trin ‘three’); general innovations that encompassed the entire Indo-Aryan speaking region (such as the reduction of nominal case and inflected past tense of the verb); as well as innovations that are limited to the Northwest (such as the development of a new person concord system in the past tense). This evidence points to an early migration history within India, even before the language left the subcontinent. Later phases in Proto-Romani are characterised by unique innovations, while in some domains the language maintains Middle Indo-Aryan archaisms: e.g. the persistence of a consonantal present-tense conjugation and consonantal forms of nominal case-endings. Already Pott (1844–1845) drew attention to the layers of Iranian, Armenian and Greek loanwords, which characterise later phases of Proto-Romani (outside of India) and which arguably constitute evidence of prolonged contacts with the respective western Asian populations. The immense lexical and grammatical impact of medieval Greek, first highlighted by Miklosich (1872–1880), is now accepted as the beginning of a new stage in the language – called Early Romani – which was characterised by the structural-typological Europeanisation, or specifically Balkanisation (Matras 1994) of Romani. Early Romani is regarded as the precursor of the modern dialects of Romani, which emerged gradually following the dispersion of Romani-speaking populations across Europe in the period paralleling the decline of the Byzantine Empire, from around 1350 onwards. The earliest written attestations of Romani from around 1542 (Britain), 1570–1597 (Germany and France), and 1668 (Thrace), and numerous sources from the early 1700s, already represent the kind of dialectal variation found in Romani today, while the geographical distribution patterns of structural variants seems to point largely to developments in situ, rather than ‘genetic’ inheritance (although this point remains controversial in Romani linguistics). From this one might conclude that the bulk of developments separating the dialects occurred during the period of settlement (which followed the period of migration), in the 16 th century (see Matras 2005). Miklosich’s dialectological work on Romani divided the dialects based on a similar assumption, according to the peoples amongst whom the Roma had settled. This tradition was broken by Gilliat-Smith (1915), who described the geographical overlap of distinct dialect groups in northern Bulgaria, highlighting the need to take successive migrations and continuous networking among historically related groups into account. During most of the 20th century, classification work in Romani dialectology relied on loose

332 Yaron Matras, Christopher White and Viktor Elšík impressions of structural similarities, recognising geographically proximate groups on the one hand, as well as isolated, out-migrant offshoots of those groups on the other. Recently, with the availability of a larger dataset and some intense advances in the geographical plotting of linguistic features, a debate between two interpretations has occupied the centre stage in Romani linguistics: the first attributes regional differences to the diffusion of innovations in geographical space following settlement (Matras 2002, 2005), the second attributes them to older – so-called ‘genetic’ – differences that existed prior to settlement, and that were brought to their current locations by groups or tribes speaking distinct dialects (Bakker 1999, Boretzky 1999a and 1999b, Boretzky and Igla 2004). In evaluating the evidence, the role of identifying archaisms vs. innovations is of course crucial. In the absence of historical documentation, the procedure for establishing the position of a feature relies largely on a comparative interpretation of the datasets. Other issues arising in recent years from the analysis of Romani include the potential for typological drift and change in a language that is strictly oral and enjoys little institutional support and so no regulation either; and the extent and quality of contact influence in a language whose adult speakers are all bilingual, which is a marginalised and often oppressed language, limited to basilectal functions. Its dialects being in contact, under comparable socioliguistic conditions, with dozens of languages as far apart as Basque, Welsh, Finnish, Croatian, Hungarian, and Turkish, Romani provides an excellent sample with which to study the lexical and structural effects of language contact. 3. The RMS agenda and implementation strategy Aiming to provide a tool to facilitate research into such areas, the RMS database was created with the following domains of analysis in mind: 1) Historical, aiming to compare dialect specific innovations, and so to cover a dimension that is specific to Romani, focusing on developments of form to form, and from form to function; 2) Typological, aiming to examine the structural representation of functions across a sample of dialects, thus covering relations between function and form, and among clusters of functions; 3) Contact-theoretical, aiming to examine contact influences, and so, for this purpose, tagging structures by etymology and etymological layers (representing ‘depth’ of borrowing);

The Romani Morpho-Syntax (RMS) database

333

4) Dialectological, aiming to examine the link between innovations and their geographical distribution in what is considered to be a non-territorial (insular) language, thereby critically addressing the notion of a ‘genetic’ classification of dialects. The initial phase of the database construction aimed therefore at covering in maximum depth questions of variation among the dialects that could inform the above domains of investigation. This involved, in the first phase, in-depth comparative research into the dialects, drawing on all available published descriptions (using, in practice, around 40 monograph-length and some 20–30 article length publications on individual dialects). For each grammatical domain, lists of variants were plotted, giving a general inventory of possible forms. These would constitute the backbone of the form slots, eliciting the formal representation of morphemes. The tool used in this initial phase was FileMakerPro, a user-friendly database application; this tool was subsequently abandoned, however. We return to the technical side of the database construction below. The compilation of variants from the literature led at the same time to a comparative analysis, and a historical analysis, of the emergence of certain categories, leading in turn to the plotting of form-to-form fields – those representing the shape, for a particular dialect, of an inherited form – as well as the form-to-function fields – those representing the dialect-specific function of an inherited form. Thus, a slot was devoted to the hypothesised Early Romani indefinite form *khajek, asking a) whether it is continued in the dialect (i.e. presence of the form), b) its shape in the dialect, e.g. kajek оr possibly kek (i.e. form to form), and c) its function in the dialect, e.g. general determiner ‘some-’ or ‘no-’, or person indefinite ‘somebody’ or ‘nobody’ (i.e. form to function). Consider in more detail an example of the form-to-function perspective. Romani dialects inherit two forms of the present stem: A short form, in which the final morpheme indicates person concord (1SG -av etc.), and a long form, where the suffix -a attaches to the person concord morpheme (1SG -av-a etc.). It appears that the long form served as a present-future in Early Romani, while the short form was the subjunctive. The dialects continue both forms, but alter their functions, often in connection with the introduction of an analytical future category. Figure 1 shows the distribution in some dialects. Noteworthy is the geographical distribution of the developments: In the Balkans (Sepečides, Rumelian Romani, Kosovo Bugurdži, Florina Arli), the long forms are confined to the present indicative, and the future is ex-

334 Yaron Matras, Christopher White and Viktor Elšík pressed by a future particle (followed by the subjunctive). In central Europe (Lovari, Rumungro, and Roman), the short forms take over also a present indicative meaning, while the long form specialises for future. Serbian Kalderaš shows contamination of the central European pattern with the Balkan pattern. The original state of affairs is preserved in the western, German-French and Scandinavian dialects. Elsewhere, combinations are found: an ongoing shift in the expression of the present indicative from long to short forms, combined with a loss of the future meaning of the long forms only through the introduction of an analytic future in Russian Romani. DIALECT

SHORT FORM

LONG FORM

FUTURE PARTICLE

Sinti, Manuš

subjunctive

present-future

–

Finnish R

subjunctive

present-future

–

Latvian R

present-subjunctive

present-future

–

Welsh R

present-subjunctive

present-future

–

Rumungro

present-subjunctive

future

–

Roman

present-subjunctive

future

–

Lovari

present-subjunctive

future

–

Serbian Kalderaš

present-subjunctive

future

ka

Sepečides

subjunctive

present

ka

Rumelian R

subjunctive

present

ka(m)

Kosovo Bugurdži

subjunctive

present

ka(m)

Florina Arli

subjunctive

present

ka

Russian R

present-subjunctive

present

l-

Figure 1. Inherited present-stem forms and their TAM function in some dialects

The database organisation in the original FileMakerPro format captures such data by allocating fields in a layout devoted to ‘Verb inflection’ to ‘Tense and mode marking’, asking for the function of each of the anticipated Present-tense forms (short form, long form), and continuing to elicit the strategies used to mark the Future tense. Each field carries a value list, comprising all variants that have been collected during the pilot study, and so all anticipated variants. The list is open, and new forms can be added to it, if encountered in the data. Thus a query can select any of the attested forms

The Romani Morpho-Syntax (RMS) database

335

and search for particular data, or else simply look up the data that has been entered into the relevant field for a respective dialect record or set of records (see Figure 2). The organisation of questions and content of the data fields displayed in Figure 2 are typical of the form-to-function approach. The second approach is the function-to-form procedure. Here, state-ofthe-art typological descriptions and questionnaires (e.g. those emerging from the EUROTYP project, and other recent typological investigations) were taken into account in order to plot representation grids for the respective functions. One example is the continuum of semantic integration of complement clauses (cf. Matras 2004a). This is captured, following typological work on complementation such as that by Wierzbicka (1988), Givón (1990), Frajzyngier (1991), Frajzyngier and Jasperson (1991), and Dixon (1995), by a range of main clause predicates representing tighter and less tight event integration (such as can, want, begin, try, fear etc.), as well as the contrast between modality (can, begin, etc.) and epistemic complementation (see, know, hear etc.), and between identical subject and differentsubject constructions (so-called manipulative predicates such as demand, ask, etc.).

Figure 2. Database excerpt ‘Tense marking’ in FileMakerPro 6 format

336 Yaron Matras, Christopher White and Viktor Elšík For each predicate, three value lists appear. The first contains a statement about the presence or absence of a complementiser conjoining the two clauses. The value options are ‘none’, or a choice of a complementiser type. This latter value is a Romani-specific form. Modal complements tend to take a non-factual complementiser of the type TE (realised in the individual dialects as te, tə or ti). Epistemic complements tend to take a complementiser of the type KAJ (usually realised as kaj), though this latter is often substituted by a borrowed particle. The next field identifies the origin of the complementiser, the value options being ‘non-applicable’ (in case a complementiser is absent), ‘inherited’, or a choice between several layers of borrowing: those from an Old contact language (no longer spoken in the community), a Recent contact language (still spoken by the older generation), or a Current contact language (spoken regularly by all members of the community).1 The final field characterises the inflection of the complement verb. The value options are ‘finite’ and ‘non-finite’. Clause combining in Romani is overwhelmingly finite. However, in modal complements with identical subject constructions (‘infinitive clauses’), some (mainly central European) dialects tend to generalise one of the person-inflected forms, thereby abandoning subject agreement, and introducing instead a kind of ‘infinitive’, based historically on one of the finite forms. The final field is a data field, into which an example is inserted. Figures 3–4 show an example of entries for the Yerli dialect as spoken in Velingrad, Bulgaria (acquired for the database through direct elicitation). With the modal verb want, the complementiser is tə, historically *te, and so TE is the type selected from the value list. The etymology field indicates that it is inherited (and so part of the pre-European component). The complement verb is finite, showing person agreement with the subject of the matrix clause, and the absence of the present/future suffix -a marks it out for the subjunctive: dža-v ‘go-1SG’; cf. the matrix verb mang-av-a ‘want-1SG-PRES’. For the verb see we find a different state of affairs. The complementiser či is borrowed, and so the concrete form is entered. The etymology field indicates a borrowing from the current contact language, which for this dialect is Bulgarian. The question of the finiteness of the 1

The distinction, introduced in Matras (1998), is intended to capture the layered history of contact influences, which is relevant both to Romani communities with a history of migrations, and to those whose external prestige language changed as a result of historical circumstances (e.g. the shift from Ottoman Turkish to Bulgarian and Greek in the Balkans, or from Hungarian to German in some territories of the former Austro-Hungarian Empire).

The Romani Morpho-Syntax (RMS) database

337

verb is redundant in epistemic constructions, where no Romani dialect uses non-finite forms, and therefore it does not appear in the entry. To summarise, then, the initial database sketch consisted of an outline of likely variation and inventories of possible variants in the shape of forms, the semantic functions and the distribution of inherited forms, and the structural representation of semantic functions, including both the composition and etymology of the participating forms. These are displayed through two types of fields: those presenting actual linguistic data for exemplification, and those presenting questions about the data (e.g. “is a definite article retained?”, “what is the function of short forms of the present tense?”). The primary purpose of the database is to allow the user to query the data by looking up the contents of any individual field or combination of fields, for any dialect or combination of dialects.

Figure 3. Extract in FileMakerPro 6 format on ‘Complementation’ /‘Modality’

338 Yaron Matras, Christopher White and Viktor Elšík

Figure 4. Extract in FileMakerPro 6 format on ‘Complementation’ /‘Epistemic’

From the user’s viewpoint, RMS is structured in the form of a standard grammatical description, with distinct chapters devoted to functional domains of structure (see Figure 5). Each record in the database represents what is referred to as a Sample, which is equivalent to a unique source on the language. The initial batch of sources that were taken into account when first plotting the database fields were published descriptions of Romani dialects. The ‘source’ in these cases is the author, drawing on a corpus of material from a particular community, which quite often contains data elicited from a variety of speakers (though the type of grammatical sketch that is based exclusively or almost exclusively on one speaker is not rare in Romani dialectology). In the second phase, a questionnaire was constructed, covering all main areas of variation, and most data now contained in the database and RMS archive are the product of systematic fieldwork carried out throughout eastern, central, and southeastern Europe. Here, a Sample corresponds to a speaker as a source of data. Several speakers (as well as, where relevant, printed sources) may be grouped together to represent one Dialect. The degree of uniformity among unique samples representing one single Dialect thus becomes in itself subject to investigation, and indeed part of the future agenda and prospect of further development of the database tools (see below). The questionnaire was composed through a careful consideration of all data fields, and inspired by the need to elicit data to be able to fill them. It is thus tailored to the database structure, which itself is the product of a prolonged investigation into variation and structural composition in Romani.

The Romani Morpho-Syntax (RMS) database

339

Like the database, the questionnaire addresses form-to-form, form-to-function, and function-to-form questions. All issues are built into either a set of some 850 short sentences, which constitute the bulk of the questionnaire, or a wordlist or a list of verbs to be inflected. The elicitation technique exploits the fact that all Roma are bilingual, and uses the majority language to elicit translations from the speakers of words, verb conjugations, and phrases. For this purpose, the questionnaire in its first version from 2001 has so far been translated into some 14 different languages. Much of the fieldwork has been carried out by graduate students specialising in Romani linguistics, and for this purpose networking workshops were set up, bringing together students from different institutions and different countries to discuss fieldwork methodology, transcription conventions, and so on. Additional fieldwork assistants were recruited among students of Romani background, who were equally invited to participate in training and instruction workshops. General profile of the source

Interrogatives and Relatives

Modals

Noun inflection

Indefinites

Noun derivation

Article inflection

Adjective inflection

Lexicon

Adjective derivation

Lexicophonetic features

Adjectives

Phonology

Numerals

Verb inflection

Embeddings and relative clauses

Personal/reflexive pronouns

Verb derivation

Adverbial clauses

Verb adaptation

Word order

Demonstratives

Copula inflection

Utterance modifiers

Prepositions Case Representation Local relations Temporal relations Complementation

Figure 5. Chapters in the RMS database

The advantages of the questionnaire are obvious: It allows systematic coverage of structures in a way that cannot otherwise be guaranteed, and it makes data available for direct comparison between the dialects. For this, fieldworkers follow a uniform procedure. All interviews with speakers – of average duration of some four hours – are recorded, and the recordings digitised and archived, normally both as complete files, as well as cut into individual phrases. The informant’s answers are transcribed (using Unicode fonts) onto a spreadsheet. Each phrase in the original questionnaire is tagged for the grammatical categories that it is intended to elicit. The tags range from exemplification of individual phonemes or particular inflection endings in

340 Yaron Matras, Christopher White and Viktor Elšík words, through to word classes and entire semantic constructions such as types of clause combinations or case relations. Naturally, not each and very sentence is translated by the speakers as intended, and so there is an error margin in the actual ability of each recorded questionnaire to retrieve the intended constructions through the pre-built tags. But since each semantic function, construction and structure appear in several different positions through the questionnaire, retrieval is generally guaranteed, even if not via each and every intended phrase. The tags are designed to answer the analytical questions that are dealt with in the database, and so they match chapters, sections, and indeed individual cells in the RMS database. The spreadsheets can thus feed directly into the database: In the earlier working phase, the link between spreadsheet data and the database in FileMakerPro 6 was based on manual retrieval of the data by sorting the spreadsheet rows according to the tags, and entering the relevant data into the database fields. During the recent development stage, a new database has been created, allowing the transcriptions to be fed directly into the database, which then retrieves them automatically as exemplification for individual data fields. Each word and phrase are also linked to the digital sound files, thus making all raw data – in transliteration and in original sound – directly accessible to the user. Figures 6 and onwards show extracts from the recent development of a Sample Database, modelling some of the functions of the new, upgraded RMS. This Sample Database has been freely accessible via the project website since January 2006. Note that users may choose one of several functions: Phrase search, Wordlist search, Verb inflection search, and Grammatical category search. The user can select a dialect (representing a particular Source Sample, i.e. the transcription and recording of an interview with an individual speaker). It is also possible to select languages other than English as input languages for phrases or pre-defined word searches. The Phrase search function retrieves any corresponding string within the transcription, thus covering words, affixes, and short phrases. Figure 6 illustrates the query input for the word ‘boy’ in English, in the Š utka Arli Romani dialect of Macedonia, while Figure 7 shows the query output.

The Romani Morpho-Syntax (RMS) database

Figure 6. Web-based Sample Database query function ‘Phrase search’  http://romani.humanities.manchester.ac.uk

Figure 7. Output menu of query function ‘Phrase search’

341

342 Yaron Matras, Christopher White and Viktor Elšík Note that in the output, the corresponding audio file for each transcribed word or phrase can be heard as well, by clicking on the audio icon. The wordlist query operates on the basis of a direct retrieval of a word or paradigm. A list of some 250 everyday words is included in the questionnaire for the sake of lexical and phonological comparison among the dialects. A list of a chosen set of over 50 verbs with complete conjugations documents all relevant verb inflection classes in the language. Figure 8 shows the query output for the selection of the verb ‘arrive’: the user can view the complete present-tense and past-tense conjugations, those namely, that cannot be easily predicted since they involve inflection and not an analytical marker. Verb inflections are thus typical ‘form-to-form’ queries.

Figure 8. Query ‘verb inflection’ output menu

The Romani Morpho-Syntax (RMS) database

343

On the other hand, ‘function-to-form’ queries exploit the tagging system that is applied to phrases. The user is able to open a window and within it select a particular tag, representing a grammatical-semantic or category function (Figure 9). In the output, all phrases are shown which contain the relevant tag (Figure 10).

Figure 9. Web-based Sample Database query function ‘grammatical category search’

344 Yaron Matras, Christopher White and Viktor Elšík

Figure 10. Output ‘grammatical category search’

The queries illustrated above (Figure 6–10) constitute an innovation compared to the functions covered by the older database sketch in FileMakerPro 6. From the technical side, they represent, in fact, an entirely new database, a custom-designed application with a web-interface, which replaces the old sketch in FileMakerPro 6 (see below). While the Sample Database – the first part of the new RMS to be available online – is still limited to direct linkage to phrases via the tagging structure, the new RMS ‘proper’ combines the strength of the analytical RMS database with the functionality of the new application, in that it integrates complete datasets (supplied to it through spreadsheets of transcriptions and corresponding sets of sound files) into the tables that hold data and metadata on grammatical structure. This new application is currently, at the time of writing, under development, and is planned to be freely accessible via the project’s website from 2008. Figure 11 illustrates the presentation of a typical function-to-form section, here the table of indefinite pronouns. Note that by clicking on the re-

The Romani Morpho-Syntax (RMS) database

345

spective field within the table, the user is able to retrieve relevant phrases via the tagging system from the questionnaires, in both transcription and sound. The same procedure is followed in order to input data into the tables, for each record. The database is thus enriched by an interactive dimension which allows, for each and every item of data, sentential exemplification, in transcription and sound – something that is impossible to deliver in a conventional written grammatical description.

Figure 11. Web-based Database layout ‘Indefinite forms’, with exemplification for Person-Negative form (nikon ‘nobody’)

4. Management and organisation From the above it has become clear that RMS is not just a database, but also a strategy for data collection, processing, and evaluation: It inspires, and is dependent on, a certain method of data collection and archiving, and in its design it subscribes to certain notions prevalent in functionally-oriented typology in respect of categorisation and structural representation of semantic functions, and to certain assumptions about the diachronic development of Romani. Following the database outline it is possible to compose basic grammatical descriptions of the language that are informed by both the functional-typological and the particular diachronic assumptions about

346 Yaron Matras, Christopher White and Viktor Elšík Romani (see e.g. Matras 2004b; Tenser 2005; Chileva 2005; Chashchikhina 2006). More than just a tool to store data, RMS is thus an integrated approach to language documentation and evaluation. Despite its anchoring in certain assumptions about language function and the development of Romani, however, it leaves ample scope for analysts to retrieve data and evaluate them in entirely different directions. It is thus an open resource, one that is theoretically informed but not theoretically prejudiced. Although RMS was constructed as a tool specifically for the investigation of Romani, the procedure behind the management of the resource and the project that supports it is, in principle, applicable to other languages as well. As such, RMS may be regarded as a model for comprehensive documentation especially of lesser-known languages. In this section we review some of the general aspects of the project. 1 Research into dialectal variation 2 Postulation of historical developments/ background analysis 3 Drafting of form, form-to-form, and form-to-function data presentation layout 4 Integration of typological description grids, drafting of function-to-form layouts 5 Creation of a questionnaire to elicit all aspects of structural variation 6 Tagging of questionnaire data with reference to database fields 7 Fieldwork using the questionnaire: training of fieldworkers, audio recording and transcription of interviews, archiving 8 Database upgrade: Custom-made application with web interface, re-design of data tables, integration of questionnaire data with transcription and sound 9 Release of Sample Database and gradual upgrade (adding of samples) 10 Release of complete database model 11 Development of elaborate query structure 12 Replication of model for other languages

Figure 12. Summary of RMS implementation and management strategies, by stage

The steps outlined in Figure 12 represent successive (though sometimes also parallel and intertwined) stages in the project’s development. As discussed in the previous section, the preliminary sketch for RMS consisted of an inventory, by grammatical category, of variants, derived from existing literature. To this, information deemed essential for a language description was added – inspired and informed by typological works. Note that most grammatical descriptions of Romani had not, by that stage, been typologically

The Romani Morpho-Syntax (RMS) database

347

oriented, and few contained any information at all about syntactic typology. Comments and data on syntax in the relevant literature were quite often limited to loose exemplification, rather than systematic remarks. Thus, complementation might be illustrated with one or two examples, but those would not enable to retrieve many insights into the continuum of modality vs. epistemic complementation, for instance. Based on the survey of morphological variants, combined with a typological survey, a preliminary design was produced, outlining the information that was of interest. The approach at this stage was not a technical one, and was completely uninformed by any technical approach to database design. Rather, it was based on a purely linguistic appreciation of relations between values as representing linguistic functions and paradigm values, with no distinction between primary data, derived data, and meta-data commenting. The availability of FileMakerPro as an application that allowed amateur plotting and easy retrieval of data, created a temptation to focus the project’s resources on recruiting linguistic, rather than technical skills. In effect, the resulting file in FileMakerPro was nothing but a single, huge table, with over 5000 columns representing content-defined data fields, interacting with a mere 100 or so rows representing individual dialect records. In hindsight, a more informed approach would have quite possibly enabled a quicker production of a proper application with a relational structure, able to store complex data and allowing the necessary flexibility in designing a query structure. An impeding factor, however, is the structure of the grant scheme and the need to complete phases within a relatively short funding period. Prior to the successive development of questionnaires and the procedures to tag phrases, the full requirements and opportunities of the database would not have been envisaged; and these in turn could only emerge once a database sketch was in place, storing a preliminary set of data and allowing crossdialectal comparison. Previous fieldwork on Romani, much like fieldwork on other languages, or on cross-linguistic samples for typological purposes, relied on just a limited set of questions aiming to elicit a modest set of variables. The inconveniences of a comprehensive questionnaire aiming to document a variety of structures to allow a complete descriptive sketch of a dialect are obvious: the time constraints limits access to informants, the amount of material takes time to process, check, archive, and evaluate. The RMS questionnaire is one of few enterprises known to us that aimed at a comprehensive description of dialectal varieties of a language. The work of administering the questionnaire, archiving and processing the data was only possible through the creation of an entire network (in the case of Romani, an international

348 Yaron Matras, Christopher White and Viktor Elšík network was required), within which several dozen individuals carried out a series of specialised tasks, from interviewing in particular languages, to transcribing particular dialects, to archiving the material (checking transcriptions, digitising and editing sound files) and inputting the data into the actual database. The network allowed a kind of production-line management of some of the tasks. Thus interviewers are able to pass on their recordings to an archive manager, who delegates various tasks to transcribers, sound technicians, and later to analysts for input; not only are these different individuals, but quite often they work at different institutions and reside in different countries. The need to upgrade the database to a custom-made application arose when it became clear that the available dataset could only support a very limited query structure, which could not be integrated with other applications or extended to cover new functions. The major problem encountered in this phase was the need to re-define categorisations and create relations among data sets in different tables. This lengthy process, still ongoing at the time of writing, involves a productive re-assessment of the possible relations among instances of data which are not self-evident from the purely linguistic-paradigmatic perspective. The very first significant upgrade was the creation of a database that could accommodate the actual transcriptions and sounds, with their tags, thus allowing the direct query structure described above. The fact that relations between phrases, tags, and sound samples had already been set in advance allowed a rather quick design and early sharing of the so-called Sample Database with a wide audience on the web. Following from this is the gradual convesion of the original sketch into a relational database (see below), the import of data already stored in the FileMaker format, and the development of query structures. A middleterm aim is then to view RMS as a model for potential applications documenting other languages, and to pilot its adaptation to another group of closely related languages. 5. The database structure: technical aspects The original RMS database, built using FileMaker version 5, cannot be considered a relational database. FileMaker version 5 encourages the development of single table databases; in a sense, a simple spreadsheet with each row being a ‘record’ and each column holding a certain piece of information for each record.

The Romani Morpho-Syntax (RMS) database Dialect Name

Dialect group

Origin

Location

Etc…

---

---

---

---

---

---

---

---

---

---

---

---

---

---

---

349

Figure 13. FileMaker database structure

This design resulted in the original RMS database containing in excess of 5,000 columns, each holding a discrete piece of information about the dialect. However, while this method is capable of holding any arbitrary data, it is not capable of holding any information about the data. In essence the data, in itself, is meaningless. Any meaning is only derivable from outside the system; meanings imposed upon it by the user. In fact, the data does have inherent meaning, it is simply that the FileMaker structure cannot represent this. For example the FileMaker database has several columns containing data concerning ‘Layer 2 nominal inflection’ markers. The data held in the column is the actual marker for the specific dialect, but which inflection it is (for which case/number combination) is not identifiable from the data. For example, the FileMaker database has two columns for ‘Ablative’ nominal inflection markers; one for ‘Plural’ and one for ‘Singular’. There is nothing within the data that indicates that either of these columns has anything to do with ‘Ablative’ or to do with ‘Singular’ or ‘Plural’ (or in fact that it has anything to do with nominal inflection). This can only be discerned by the user reading the arbitrary label given as the column name. Further, there is nothing inherent within the FileMaker database that indicates that the ‘Ablative singular’ marker and the ‘Ablative plural’ marker have anything in common (ie, that they are both ‘Ablative’ markers). The new database development faces the challenge to correct this. A relational database attempts to represent data by its relationships to other data, thus building a network of ‘links’ between subject domain concepts, having the data as a quantification of these relationships for a specific ‘record’. In fact the relational model does away with the ‘traditional’ concept of ‘Records’ (a single, linear, collection of data referring to a single subject concept) and, instead, breaks the subject of the database into its component concepts. Each concept, or possible variation of that concept, is represented

350 Yaron Matras, Christopher White and Viktor Elšík by a set of quantifying data. Meaning is derived from the relationships between concepts that are implicit within the data sets. Returning to the previous example; within the FileMaker RMS we have a structure that says… – There is a ‘Ablative Singular layer 2 nominal inflection marker’ – There is a ‘Ablative Plural layer 2 nominal inflection marker’ – Etc…etc. .

In the relational model we say… – – – – – – – –

There are ‘Samples’ 2 There are ‘Nominal Inflection Markers’ There are ‘Grammatical Cases’ There are ‘Grammatical Numbers’ ‘Samples’ have ‘Nominal Inflection Markers’ ‘Nominal Inflection markers’ have a ‘Grammatical Case’ ‘Nominal Inflection markers’ have a ‘Grammatical Number’ ‘Nominal Inflection markers’ have a shape

‘Samples’, ‘Nominal Inflection Marker’, ‘Grammatical Case’ and ‘Grammatical Number’ are component concepts that allow us to build a representation of the subject domain, in this case dialects of Romani. Each of these concepts could have a number of attributes that allow the quantification of each instance of the concept. For example ‘Ablative’ is an instance of the concept ‘Grammatical Case’, or put another way one possible instance of ‘Grammatical Case’ has a ‘name’ of ‘Ablative’. In the same way, the concept of ‘Grammatical Number’ has two instances, one with the ‘name’ of ‘Singular’ the other with the ‘name’ of ‘Plural’. So, conceptually, as a ‘Car’ has an ‘Engine’ and a ‘Gear Box’, a ‘Nominal Inflection’ has a ‘Grammati2

In earlier phases, the unit explored was considered a ‘Dialect’ of the language. In later discussions, especially in connection with the technical compilation of the data, it was decided that there was no obvious procedure through which to distinguish ‘dialects’ from individual ‘samples’, each of which represents a speaker. Several speakers may be grouped together on the basis of their origin, or residence in, the same location, or on the basis of (any sets of) similarities among them. The entity ‘dialect’ is thus a secondary classification of samples. It was therefore decided that the database should operate on the basis of assigning data to individual ‘samples’, each representing a speaker (or a published source, in the case of secondary source compilation).

The Romani Morpho-Syntax (RMS) database

351

cal Case’ and a ‘Grammatical Number’. It is this shift in thinking about the subject domain that is the greatest challenge to the development of the new RMS database. The implications of this shift create a significant difficulty in the re-use of the FileMaker RMS data which has been chosen based on the assumptions or interests of the researchers. The new database system must hold not only the discrete elements of data that the research project is concerned about, but also the metadata that gives meaning to those elements. In essence, the new system is not just a data store, somewhere to hold the results of analysis, but it is a model of the real world. It is often tempting to think of a relational database in terms of ‘records’ and ‘fields’. For many data sets this can be a useful conceptualisation of the data, however, in designing a database it lends itself to over simplification along with the combining of concepts and if care is not taken can result in a lack of adequate normalisation. It also implies formal structure and order where there is none. For example, in the RMS database a ‘Record’ could be considered to comprise all the information held about a dialect. However, this data will span many component concepts and multiple instances of a concept may relate to the same dialect. If one considers these concepts as represented as tables and the attributes as columns within these tables, then the ‘Record’ for a dialect will comprise many rows from many tables (and quite possibly multiple rows from the same table), thus the connotations of the concept ‘Record’ looses validity. In the strictest sense a relational database is constructed of ‘Relations’, ‘Tuples’ and ‘Attributes’; where a ‘Tuple’ is a collection of ‘Attributes’ and a ‘Relation’ is a set of ‘Tuples’ that all have the same ‘Attributes’, no structure nor order is implied. To ease the interaction with the dataset most relational database client interfaces utilise the more familiar representation of the data as ‘Tables’ with named ‘Columns’, each ‘Tuple’ being presented as a row in the table. These concepts are derived from SQL (Structured Query Language), almost exclusively used as the interface to relational database management systems. One of the more significant changes to the structure of the information for the new RMS database is the way in which a Dialect is defined. Within the new system each interview with an individual constitutes a ‘Sample’ of the dialect. It is from this sample that data is extracted and entered into the database system. In this way each Dialect can have more than one Sample and thus, more than one set of data defining it. This allows for the interesting possibility of analysing the differences within Dialects as well as between Dialects, measuring similarities and differences between Samples, and ultimately being able to represent the transition between dialects in a more realistic, gradual morphing rather than a set of discrete boundaries.

352 Yaron Matras, Christopher White and Viktor Elšík In order to achieve a platform independence for the new database system it was decided to use web based technologies for the user interface. The user is able to access the data from any computer platform that has a standards compliant web browser and, obviously, is connected to the internet. With such a detailed system, with complex functional requirements this presents a further challenge to the development process. This challenge has been met by the implementation of a multi-tiered MVC (model-view-control) design for the application. Each tier is functionally independent of each other with intertier communication achieved through a specified and consistent interface. Tier 1 is the User Interface. This is the actual web page that the user sees and interacts with. This is built with standard web technologies; HTML for layout definition and Javascript to control the functionality of the individual layouts. Tier 1 follows a loose Model-View-Control architecture. ‘Model’ being the data that is presented on the layout. ‘View’ being the HTML code. ‘Control’ being the Javascript code that manipulates the ‘Model’, presenting the data on the layout, registers and responds to user activity and communicates with Tier 2. Tier 2 is the application code that runs on the server. This code performs the tasks requested of it by Tier 1 (eg. acquire data, update data, user login etc…). Tier 2 exposes its functionality to Tier 1 though a simple API (Application Programming Interface). This allows the code running within each user’s browser to perform the tasks requested of it by the user. Tier 2 follows a standard Model-View-Control architecture. ‘Model’ being a set of object classes that represent the key concepts within the database model (each class roughly equating to a ‘Table’ within the database) and give functionality to the instances of those concepts. The object classes that make up the ‘Model’ act as a wrapper around the underlying database which comprises Tier 3. ‘View’ being the application code that build the visual layouts that are presented within the user’s browser. ‘Control’ being the application code that performs the ‘business’ logic; manipulates the ‘Model’ in order to select, insert or modify data, decides which ‘View’ to trigger the building of and performs other tasks such as authentication and authorisation. This separation of Model, View and Control allows for easy modification of the data structures, the visual appearance or the underlying business logic having any effect on the others. Tier 3 is the actual database itself. This is a full Relational Database Management System (RDBMS). This tier holds and manages all the data and the structures that define the data. Tier 2 has access to the data, and the functionality of the RDBMS through the SQL (= Structured Query Language) based interface that the RDBMS exposes.

The Romani Morpho-Syntax (RMS) database

353

Figure 14. The three-tier structure of the relational RMS database

For data to be presented on the user’s screen, the user’s browser first sends a request across the internet to the server asking for the specific layout. The server receives this request, processes it, building and returning the layout to the browser. The layout comes in two parts: the formatting code that defines how the page looks on the screen and some control code that defines the functionality that the layout has. At this point the layout has no ‘data’ on it. Once the layout has loaded, it contacts the server, requesting all the data to be displayed. The server gathers the data via a request to the data-

354 Yaron Matras, Christopher White and Viktor Elšík base and returns it to the layout, which then places that data into the relevant locations on the screen. In this way the web page is only refreshed when the layout needs to change, thus improving efficiency and speed of the application as well as giving the user a more ‘local’ feel to the application. The layouts are designed to only request data from the server when they need it. This is demonstrable with the exemplification of data on the layout. With a click on a piece of data a window appears showing example sentences that demonstrate the use of the specific data. These examples are ‘loaded’ from the server at the point at which the user clicks on the data. In a similar way the data entry layouts request from the server a list of suggested values to be presented as a ‘value list’ to the user when they try to enter a piece of data for a dialect. This ‘value list’ is generated on the server from a unique list of all the values that are entered for that ‘data point’ for any sample/dialect within the database. This makes the value lists dynamic and always up to date (as new data is entered into the database these values will appear in their respective value lists). The new RMS database application has, conceptually, three levels of data. There is the ‘Sample phrases’ derived from the transcriptions of the interviews, the sound files generated from the recordings of the interviews and the ‘dialect definition data’ which is extracted from the transcribed sentences (synonymous with the FileMaker RMS data). All this data needs to be ‘linked’ together within the system.

Figure 15. Overview structure of the ‘Sample phrases’ data

The Romani Morpho-Syntax (RMS) database

355

The Sound files are stored on the server file system and are referenced within the system by the Dialect/Sample code and the phrase’s reference number, which together comprise the sound file’s filename. This allows sound files and Sample Phrases to be combined on the layouts. The ‘linking’ of the Sample Phrases and the Dialect definition data is a little tricky. Each piece of data refers to some phenomenon identified within the sample’s recorded phrases. These phrases are used as the source for data entry as well as exemplification of data once entered. Thus the system must identify which phrases are likely to present each required phenomenon. To achieve this the system holds a set of ‘Tags’, each referring to a particular phenomenon. Each phrase is ‘Tagged’ to identify the likelihood of it containing examples of each of the relevant phenomena. Given a specific phenomenon (tag) the system can then present all of the phrases, or all of any one sample’s phrases, that are likely to present examples of that phenomenon. However, the system also needs to know which phenomenon the user is looking for. This can be achieved in two ways; by the user selecting the phenomenon from a list (i.e. selecting the ‘tag’) or by identifying the phenomenon via the data point that the user is looking at. The first option is seen within the ‘Sample Database’ system that is currently available from the project website and constitutes phase 1 of the development project. Phase 2 development, found within the RMS database application, presents the second principle. The database is built up of tables designed to hold information that represents the subject domain’s component concepts, or ‘Entities’ as they are more commonly referred to in database design paradigms. Any one specific phenomenon is thus represented as a row within the relevant table (or in other terms, as an instance of an ‘Entity’). These rows, however, hold more information than just the visual representation of the phenomenon. For example, an instance of the ‘Layer 2 nominal inflection marker’ entity is represented by a specific string of characters, its form. However there is more information needed in order to get meaning from this representation; the ‘Sample’ it is from, the ‘Grammatical Case’ that the marker represents, the ‘Grammatical Number’ that it represents. So the row within the table must hold information on ‘Sample’, ‘Grammatical Case’ and ‘Grammatical Number’ and the form (or shape) of the inflection marker. This ‘extra’ information of Case and Number could be considered metadata (it describes the data, giving meaning to it, rather than being the data itself). It is also known data, there is a small, fixed list of alternatives that can be used, and it is this information that defines what the user is looking for in the data itself.

356 Yaron Matras, Christopher White and Viktor Elšík For example, a user may look for an ‘Ablative’, ‘Plural’ marker. Using all combinations of this known metadata ‘proto entities’ can be constructed. In this case there would a ‘Layer 2 nominal inflection marker’ proto entity for each possible combination of case and number and stored in a table in the database that mirrors the table that holds the actual data. This table has two differences to the ‘data table’: There is no reference to ‘Sample’ nor any form data (these would be meaningless in the context of the ‘proto entity’). There is a reference to the ‘Tag’ used to indicate the specific phenomenon within the sample phrases. In this way, given specific instance data of a ‘Layer 2 nominal inflection marker’, or given that the user is requested to enter the form of a specific ‘Layer 2 nominal inflection marker’, the system will know both Case and Number. This metadata is then used to lookup the Tag using the table that holds the ‘proto entities’ and can then present all the sample’s phrases that are ‘linked’ to the returned ‘Tags’. Querying the new RMS database system will be handled in three different ways within the application. The most basic way of querying will be similar to the way queries can be made on the current RMS database system. The user, presented with an empty set of layouts can enter certain criteria into the individual ‘cells’ and then request the application find all the samples that match those entered criteria. This interface will look, fundamentally, like the normal data input interface. Once the matching samples have been returned the interface will allow the user to view any layout for each of these samples. This is a basic ‘filter’ query. The user, given the correct permissions, will be able to download the results in various formats. In addition to the simple ‘filter’ query, the user will be able to select to have certain values plotted on a map of Europe, giving a graphical representation of the dispersal of the particular phenomena. The intention is for this to be ultimately flexible. A more sophisticated mechanism for querying the data, based on the abilities of the SQL database that underpins the new RMS development, will also be implemented. This will involve a complex interface for building unique queries that can analyse the data held within the RMS database rather than just present data per sample. To accompany this interface will be several ‘standard’ queries that can be used to compare samples with each other within or between dialects. What has been described so far is a rather simplistic overview of the main technical features of the database application under development, but it does allude to a very complex system with an intricate lattice of data with differing requirements. When one considers the development of such a system, there is a great need to consider many other implications that may not be immediately obvious.

The Romani Morpho-Syntax (RMS) database

357

The platform on which the application is developed is rather critical. It is all too often that such projects can take what is considered an ‘easy option’. However, easy options are rarely the best option in hindsight. There are issues concerning proprietary lock-in that are often not considered. The original RMS database suffered from this, being built upon FileMaker. FileMaker is a proprietary database system that requires licensing of the product for both client and server use. In this way, anyone who wanted access to the data would need to purchase a license. Although licenses are often perpetual in nature, they only relate to the version of the software that the license was purchased for. As time progresses the software vendor releases new versions and stops support for older versions. Often, for many reasons, new versions are not compatible with the older version. This is a situation that the FileMaker RMS database finds itself in. There are also vendor lock-in risks with bespoke developments, should they be developed on top of proprietary development environments. Like many establishments the development support team within our institution utilises Microsoft.NET development tools and servers for developing applications. Again this leads to vendor lock-in as the development tools and the servers that serve the application must be licensed, and the developments made using those tools can only be ‘run’ on the same vendors server software thus tying the application to that particular vendor, in this case Microsoft. This project aimed to eradicate this issue, giving freedom to application and removing all licensing costs, by using only open source or free and standards based software. Due to constraints laid down by the institution the platform technologies that were used for the development were PHP for the server side programming, MySQL for the database engine and standards based web technologies for the client side development. These are not the best solutions for this application and require a degree of extra development work to overcome their shortcomings. PHP has limited UNICODE support so care is needed when working with text within the application so not to mangle any UNICODE characters or produce specious results. PHP does provide a set of multi-byte functions that duplicate most of the core string functions and provides enough functional coverage for general multi-byte string manipulation. However, there are no specific multi-byte alternatives for non-string functions so care must be taken to ensure that multi-byte characters will not adversely affect the result of such function. For example, array sorting functions are not multibyte safe resulting in the sort order not being semantically correct when the array contains multi-byte characters. The array would be sorted by byte value rather than alphanumeric order, resulting in single byte characters

358 Yaron Matras, Christopher White and Viktor Elšík being first (and in alphanumeric order) followed by all two-byte characters, then three-byte characters and then four-byte characters. MySQL does now support the majority of the SQL standards, however there are certain features omitted, or have limitations (such as requiring ‘root’ privileges to implement), which can hinder advanced developments. The main concern for the RMS development is UNICODE compatibility. As much of the data is character based the logical choice is to use Regular Expressions for pattern matching in queries. However, MySQL’s Regular Expression engine is not multi-byte safe. In most cases this has little impact as both Regular Expression and String To Match will be both UNICODE and any multi-byte characters would be considered as multiple characters in both and thus, relatively speaking, the pattern is maintained. However, this does cause an issue when using multi-byte characters within a Regular Expression character class, for example; [āēīōū]. In theory, this character class should match any one of the 5 long vowel characters presented. However, since each of these characters is multi-byte (in fact 2 bytes long each), the Regular Expression engine in MySQL seems to interpret this character class as containing 10 single byte characters and will try and match to any one of those 10 ‘characters’. Consequently, in MySQL the Regular Expression /d[āēī]d/ will not match the strings ‘dād’, ‘dēd’ or ‘dīd’ as it would be expected to. MySQL interprets the regular expression as trying to match a string that has 3 characters; the letter ‘d’ followed by any one of the 6 single byte ‘characters’ in the character class followed by a ‘d’ character. The string ‘dād’ is interpreted by MySQL as a 4 character string; ‘d’ followed by two single byte characters followed by another ‘d’. Since the string has two characters between the ‘d’s and the Regular Expression requires only one MySQL’s Regular Expression engine will not register a match. The necessary work around is to use grouping and alternation, so [āēī] becomes (ā|ē|ī) and MySQL, instead of trying to match with any one of the 6 single byte ‘characters’, is now trying to match with any one of the 3 ‘character pairs’. There is also a considerable data storage requirement for this application. With each sample that is entered into the system there are potentially a few hundred megabytes of data to be stored. The bulk of this is the approximately 800 or so sound files per sample ranging between 4 and 200 kilobytes each. When one considers that the current FileMaker RMS consists of in excess of 100 dialects and the project is continuing its data collection, the application can easily require many gigabytes of storage space. This also needs to be backed up in case of system failure, and adequate backup facilities therefore need to be considered.

The Romani Morpho-Syntax (RMS) database

359

6. Conclusion: New prospects in descriptive linguistics New technologies are by definition revolutionary: They allow us to do things that we were unable to do before, in relation to transfer and processing of information, but also in re-evaluating the meaning of information. RMS has had an institutional impact on Romani linguistics, by creating an international collaboration network needed to collect and process data on Romani varieties. It has also already inspired new analyses of Romani, and beyond – using the Romani sample as a basis for theoretical discussion (see Elšík and Matras 2006 on ‘Markedness’). One indisputable accomplishment of RMS is its function as a resource of raw, yet catalogued data. Supplying the armchair user with both transcriptions and sound of hundreds of phrases from dozens of speakers, it brings fieldwork to the home. Moreover, it enables the user to control and verify every instance of analytical judgement and assessment taken by the input team, by retrieving the original exemplification. This sets a new standard in descriptive work in linguistics, which, once noted, is likely to prove difficult for linguists to fall behind. The availability of data in this way on the web also engages wider audiences of users, increasing the relevance of descriptive linguistics. The planned query structure involving dynamic generation of maps from within the database might be regarded as a new step in the understanding of dialectology and dialectological surveys, one which de-constructs, to a certain extent, the notion of dialect boundaries and ‘genetic’ groupings, and allows the user instead to consider a plethora of classification options with minimal effort. A key function here is carried by the planned query structure to measure distance among samples and sets of samples, as described above. In the above we have not elaborated on our treatment of speaker metadata. However, a second phase of data collection began in January 2006, using a supplementary questionnaire on details of personal biography as well as community customs. So far, data were gathered in this way, along with data from the primary questionnaire, in communities in Ukraine, Moldova, Serbia, Croatia, Montenegro, Greece, Romania, Hungary, Italy and Poland. One of the tasks on the project’s future agenda is to design opportunities to link grammatical data with ethnographic data and with biographical data of speakers, to explore the extent of variation and the existence of boundaries within communities as well as among them. The latter functions are of great potential importance to educational and language policy in Romani. In the absence of either a standard language, or a central government with responsibility to safeguard language and promote

360 Yaron Matras, Christopher White and Viktor Elšík language teaching throughout the Romani-speaking community, it is vital to gain a more thorough understanding about the prospects of cross-dialect communication and mutual intelligibility of the dialects, as well as to develop tools that would facilitate the transfer of text materials from one dialect to another. Combining RMS – the inventory of grammatical variants – with a lexical resource such as Romlex 3 – a lexical database of Romani dialects – could allow for the development of such a tool, which in turn would facilitate the pooling and sharing of linguistic resources for teaching and other purposes.

References Bakker, Peter 1999 The Northern branch of Romani: mixed and non-mixed varieties. In Die Sprache der Roma: Perspektiven der Romani-Forschung in Österreich im interdisziplinären und internationalen Kontext, Dieter W. Halwachs and Florian Menz (eds.), 172–209. Klagenfurt: Drava. Boretzky, Norbert 1999a Die Gliederung der Zentralen Dialekte und die Beziehungen zwischen Südlichen Zentralen Dialekten (Romungro) und Südbalkanischen Romani-Dialekten. In Die Sprache der Roma: Perspektiven der RomaniForschung in Österreich im interdisziplinären und internationalen Kontext, Dieter W. Halwachs and Florian Menz (eds.), 210 –276. Klagenfurt: Drava. 1999b Die Verwandtschaftsbeziehungen zwischen den Südbalkanischen Romani-Dialekten: Mit einem Kartenanhang. Frankfurt am Main: Peter Lang. Boretzky, Norbert and Birgit Igla 2004 Dialektatlas des Romani. Wiesbaden: Harrassowitz. Chashchikhina, Olga 2006 A grammatical sketch of Ukrainian (Servi) Romani. MA diss., University of Manchester. Chileva, Veliyana 2005 The morphosyntax of Velingrad Yerli Romani. MA diss., University of Manchester. Dixon, Robert M. W. 1995 Complement clauses and complementation strategies. In Grammar and Meaning: Essays in Honour of Sir John Lyons, Frank R. Palmer (ed.), 175–220. Cambridge: Cambridge University Press. 3

http://romani.uni-graz.at/romlex/

The Romani Morpho-Syntax (RMS) database

361

Elšík, Viktor and Yaron Matras 2006 Markedness and Language Change: The Romani Sample. Berlin/New York: Mouton de Gruyter. Frajzyngier, Zygmunt 1991 The de dicto domain in language. In Approaches to Grammaticalisation, Vol. 1, Elizabeth Closs Traugott and Bernd Heine (eds.), 219–251. Amsterdam: John Benjamins. Frajzyngier, Zygmunt, and Robert Jasperson 1991 That-clauses and other complements. Lingua 83: 133–153. Gilliat-Smith, Bernard J. 1915 A report on the Gypsy tribes of North East Bulgaria. Journal of the Gypsy Lore Society, new series, 9: 1–54, 65–109. Givón, Talmy 1990 Syntax: A Functional-Typological Introduction. Vol. 2. Amsterdam: John Benjamins. Matras, Yaron 1994 Untersuchungen zu Grammatik und Diskurs des Romanes: Dialekt der Kelderaša /Lovara. Wiesbaden: Harrassowitz. 1998 Utterance modifiers and universals of grammatical borrowing. Linguistics 36 (2): 281–331. 2002 Romani: A linguistic introduction. Cambridge: Cambridge University Press. 2004a Typology, dialectology and the structure of complementation in Romani. In Dialectology meets typology, Bernd Kortmann (ed.), 227– 304. Berlin /New York: Mouton de Gruyter. 2004b Romacilikanes: The Romani dialect of Parakalamos. Romani Studies 14 (1): 59 –109. 2005 The classification of Romani dialects: A geographic-historical perspective. In General and Applied Romani Linguistics, Dieter W. Halwachs, Barbara Schrammel and Gerd Ambrosch (eds.), 7–26. Munich: Lincom Europa. Miklosich, Franz 1872–80 Über die Mundarten und Wanderungen der Zigeuner Europas X–XII. Vienna: Karl Gerold’s Sohn. Pott, August 1844–45 Die Zigeuner in Europa und Asien: Ethnographisch-linguistische Untersuchung vornehmlich ihrer Herkunft und Sprache. Halle: Heynemann. Tenser, Anton 2005 Lithuanian Romani. Munich: Lincom Europa.

362 Yaron Matras, Christopher White and Viktor Elšík Turner, Ralph L. 1926 The position of Romani in Indo-Aryan. Journal of the Gypsy Lore Society, third series, 5: 145 –189. Wierzbicka, Anna 1988 The Semantics of Grammar. (Studies in Language Companion Series 18.) Amsterdam /Philadelphia: John Benjamins.

A database on personal pronouns in African languages Guillaume Segerer

1. Introduction In 2001, the University of Bayreuth (Germany) and the LLACAN (Langage, Langues et Cultures d’Afrique Noire, CNRS-INALCO – Paris, France) began a joint research program devoted to the study of pronominal sytems in African languages. This program was intended to last three years and ended in 2004 with the publication of a book (Ibrizsimow and Segerer 2004) summarizing our work. The database presented here originated as a part of this project but is now autonomous. Pronominal systems1 are in some respects ideal candidates for typological studies: they form closed sets with strong structural organization, their presence is apparently universal, and they show a broad variety of internal organizations, that is, nearly all conceivable possibilities probably exist in one or more language(s). Pronominal systems have also been widely used to establish genetic relationships. These factors make it very desirable to have access to a large amount of data. These data exist but are scattered among publications, many of which are not easily available to researchers. Moreover, the data are presented in so many different ways that comparing data from even closely related languages can be problematic. Consequently, the aims of the database are: – To assemble all the available data on the systems of personal pronouns in African languages (thus far, it contains information from more than 500 languages). – To present these systems in a unified manner, so that different systems can be easily compared with one other. There are currently three different layouts, respectively displaying the whole system, the different subsystems, and an inventory of the forms in the system.

1

The expression pronominal systems is used here as shorthand for “personal pronominal systems”.

364 Guillaume Segerer – To allow for multiple kinds of searches and to display the results on maps. – To be freely available on the web. The URL is http://sumale.vjf.cnrs.fr/ pronoms/2. Let us first present the internal structure of the database, then give an overview of the website.

2.

Structure of the database

2.1. General description The database is made up of three tables: one for the languages, one for the references, and one for the forms. The internal structure of each of these tables and the links between them are shown in figure 1 and discussed below. FORMS

REFERENCES LANGUAGES

id name SILname code phylum group subgroup comment geo date of update

id lang_id source number of forms status comment date of update

id person number specification reference_id form simpl. form formula utf8 syllable tone function comment date of update

Figure 1. General structure of the database

2

The name of the website is: Les marques personnelles dans les langues africaines. It is written in French, but it should be easy to use for anybody with the most basic knowledge of the language.

A database on personal pronouns in African languages 365

2.1.1. The language table This table is intended for use with all the existing LLACAN databases. Ultimately, it should include all the languages of Africa. As of spring 2007, it contained 2297 records. The only comparable list of African languages in electronic form is held by the SIL. The SIL list is included in ours, but we give more attention to dialectal variation, especially in the regions for which we have specialists, i.e., central Africa, northern Nigeria, and the Atlantic coast from Senegal to Guinea. Technically, the SIL list was used to build our own. One of the conditions for using the SIL list3 was that the information contained in it should not be altered in any way. That is why the language table has two fields (SILname and code) that provide the exact Ethnologue references for every language. On the database website, the information about a language always includes a link to the Ethnologue page for that language. However, some differences between the SIL list and ours remain. For example, we provide an additional field for the name of the language. This is not (as in the SIL database) for alternate names, but rather for the name under which the language is best known. It is often the case that the SIL name is never used in the literature. For example, the language of the Bijagós Islands in Guinea Bissau is called BIDYOGO by the SIL, but the only available sources for this language call it Bijogo (Segerer 2002) or Bijago (Wilson 2000–2001; Arthur 1975; Sapir 1971). Being the author of the only published grammar of the language, I feel free to use my version of the language name. There are plenty of similar cases where the data used in the database come directly from field work by known scholars. This table also provides information about genetic affiliation: I have decided to use the general classification shown on the Ethnologue website, but for compactness I have kept only three levels of classification, namely phylum, group, and subgroup. Whereas the phylum and group affiliations, even when controversial, are easy to establish, the subgroup value has to be chosen in a sometimes arbitrary manner. The decision here is mine, and reflects what I regard as most useful. It is not intended to be definitive, nor does it wish to replace the Ethnologue classification. Rather, it is intended only to help find a language in a list of more than 500. In order to display the distribution of a given feature (or a given form) on a map, some information about language location was needed. This information was not available in computerized form, so it had to be created. Given that: i) all language maps are inaccurate, and ii) we are dealing with a con3

See www.ethnologue.com.

366 Guillaume Segerer tinent, not a country, I chose the “dot-strategy” that is, a geographical representation whereby every language is a dot on the map. The exact location of the dots was obtained from various sources, with abundant use of the language maps provided by Ethnologue. In many cases, however, I relied on other sources, mostly maps provided in the language descriptions consulted for the personal pronouns. For such languages as creoles or certain widely used trade languages, the dot was located in the largest city where the language is the main one. Some of the decisions were necessarily arbitrary, as for example the choice of Cairo as the location for Old Egyptian. Only three kinds of languages were not given a location: proto-languages, some extinct languages, and a few languages merely mentioned in the Ethnologue list for which we could find no information at all. Language sampling As the ultimate, though admittedly utopian, goal of this endeavour is to collect data from all the languages of Africa, no particular care has been taken about the representativity of the language sample. For the time being (summer 2008), the distribution of the languages documented in the database is as follows: Afro-Asiatic: 137 languages, including Chadic Cushitic Omotic

103 17 11

Semitic Berber Egyptian

4 1 1

Niger-Congo: 347 languages, including Benue-Congo Gur Ubangi Atlantic Kwa Mande

98 85 57 46 27 15

Adamawa Kordofanian Dogon Kru Ijoid

8 5 2 2 1

Nilo-Saharan: 37 languages, including Central Sudanic 30 Saharan 2 Fur 1 Komuz 1

Maban Songhai Western Sudanic

1 1 1

A database on personal pronouns in African languages 367

Khoi-San languages, creoles, mixed languages and unclassified languages (i.e. languages not affiliated to a phylum by the Ethnologue) are currently not represented in the database. 2.1.2.

The reference table

The linkage between the table of forms and the language table cannot be straightforward. There are sometimes different sources for a single language, showing different data on personal pronouns. If a form in the table of forms were linked directly to the language table via a language id, the sources could not be differentiated. That is why a reference table has been provided. Initially, this table contained no proper bibliographical references, but gave the minimal information (author, year) allowing to retrieve the references from a list of sources in a separate page. Recently, these references were linked to the newly released WebBall4. 33 out of the 349 sources couldn’t be linked to an existing bibliographical reference. These include personal fieldnotes, fieldnotes given to me by colleagues or unreferenced manuscripts. 2.1.3.

The table of forms

This is where the actual pronominal forms appear. As can be seen from figure 1 (p. 364), this table contains several fields. All the information about each form is shown here, and is supposed to allow all the forms in a language to be presented as a system. “Technical” fields (id, reference_id, simplified form, formula, utf8, date of update) aside, each field provides a different sort of information. The aim here was to classify each pronominal form by its features. The first division made was between “person” and “number”. 2.1.3.1. The Person and Number fields The “person” field contains a number from 1 to 3. This simple fact raises a question: is the 3rd person really a person? In a classic paper, E. Benveniste 4

This Web Bibliography of African Languages and Linguistics (http://sumale.vjf. cnrs.fr/Biblio/), authored by myself from a source file by J. F. Maho, has been put online in the spring of 2008.

368 Guillaume Segerer (1966) called it the non-personne, and on these grounds, G. Miehe, in her overview of pronominal systems in the Gur group of Niger-Congo (Miehe 2004), only gives forms for the 1st and 2nd persons. In the database, I decided to include non-deictic (or referential, or delocutive) forms as 3rd persons. However, for those languages with noun classes, where there are several 3rd person markers, I give only the forms corresponding to the so-called “human classes” (which are in fact present in all the reported class languages). The inclusion of all the forms would lead to unreadable tables, and would hinder comparison. That said, I realize that a database devoted to noun class systems would be of great interest and am now working on a project of this kind. The “number” field can have one of three values: “s” for ‘singular’, “p” for ‘plural’ and “d” for ‘dual’. As far as I know, there is no other number type in the languages of Africa (no ‘paucal’ or ‘trial’ as in some Austronesian languages). However, there is one frequent feature of African languages that might have fit in this field (or alternatively in the “person” field): the inclusive/exclusive contrast. The reason why it is not in either of these fields is that it is always associated with 1st person plural. One never finds a ‘2nd person inclusive’, as this is meaningless. The “person” and “number” fields are designed so that any value of the former can combine with any value of the latter (even though some combinations are very rare, e.g., 3rd person dual – but see Lega-Beya). Moreover, the different values within each field are mutually exclusive. If one form happens to stand for, say, both 2nd and 3rd person plural (as for example Wolof seen ‘possessive 2p and 3p’), then there must be two distinct records. The same requirement (of mutual exclusion) holds for the “specification” field, which we will now examine. 2.1.3.2. The Specification field This field is for non-syntactic information that is somehow related to the notions of “person” or “number” (syntactic features are put in the “function” field, see below). These specifications include: animacy, gender, inclusivity/ exclusivity, and definiteness. In addition, this field contains a feature which could be regarded as intermediate between referential and syntactic domains: logophoricity5. All these features are put in the same field because 5

A logophoric pronoun is a specific form basically used to refer to the author of reported speech, as in hex says that hex will come as opposed to hex says that hey will come.

A database on personal pronouns in African languages 369

they are mutually exclusive, despite belonging to a single conceptual domain. It is a fact that none of the African languages examined thus far has a form with any two such features, say, ‘feminine’ and ‘exclusive’. Even logophoric marking seems to be incompatible with gender marking, inclusivity, or animacy. There can, of course, be different logophoric forms for different functions such as subject, object, or possessive. But in the case of a form with the meanings ‘logophoric’ and ‘object’, for example, the two meanings coalesce (the form is both logophoric and object), unlike a form with the meanings ‘subject’ and ‘object’, which means that the same form is used in different syntactic environments (either subject or object). That is why the “specification” field takes only one feature, while the “function” field can take as many as five. 2.1.3.3. The Form field The transcription of forms respects, as far as possible, the one used by the original author. This can of course lead to different transcriptions of identical phonemes, such as, for example, the palatal nasal: ɲ in most cases, but often ny in Chadic languages. These variant spellings are allowed indifferently in the forme (“form”) field. In some cases, an author uses different transcriptions in different publications. Let me illustrate this with a detailed example: E. E. Kari, in his grammar of Degema (Kari 2004) and in an earlier article (Kari 2003), uses a dot below vowels to mark their [-ATR] counterparts (“non-expanded vowels” in the author’s words) and below b and d for their implosive counterparts. Furthermore, he sets up a spelling rule stating that “only the first vowel of a simple word that contains non-expanded vowels is dotted”. But in two other articles (Kari 2002a/b), he prefers the ordinary transcription, using ə, ɛ, ɪ, ɔ, ʊ, and ɓ, ɗ respectively.6 In addition, the palatal glide is written as y in 2003 and 2004, but as j in 2002. As for tone marking, the rules are not entirely clear, but there are also differences between sources. Thus, the 3rd person singular object form is spelled yi in (2003: 280), ọyí in (2004: 258) and ɔj ɪ ́ in (2002b: 184). 6

The vowels presented here are phonologically of two kinds: ə is the dotted a and belongs to the “expanded” vowels as do e, i, o, u, while the others, together with a, form the set of “non-expanded” vowels. So dotting has a serious drawback: the dotted a is grouped with undotted letters and the simple a is grouped with dotted letters.

370 Guillaume Segerer In this kind of situation, which is actually quite frequent, my choice of transcription is not wholly arbitrary. Though efforts are made to respect the author’s habits, a simple letter is preferred to a letter combined with a diacritic (here ɔ is preferred to ọ, for example, particularly insofar as it is used by the author himself). Furthermore, there is a field called forme simplifiée (“simplified form”) where the following transformations take place: 1. Tones, tildes and other diacritics are removed. 2. Every “unusual” (i.e. non-ASCII) letter is replaced by the (phonetically and typographically) closest capital letter, following strict rules. For example, ɓ is replaced by B, š, ʃ and sh are replaced by S, etc. A complete table of replacements is not necessary here, as the original form can always be viewed on the web pages. The simplified form is very useful for sorting and searching. First, the user does not have to type a complicated sequence of keys. Second, a search on a particular form will pick up all the forms that look more or less the same, regardless of tone differences or transcription traditions7. Clitics and affixes are not distinguished, for the necessary information is generally unavailable. However, a dash indicates whether a form occurs before or after the head (i.e., verb for a subject or an object, noun for a possessive). Thus, -X for an object form means that X is suffixed to – or attached to the right edge of – a verb form. It must be emphasized that, except for rare cases, no additional morphological and/or syntactic information is provided. In particular, the order of affixes cannot be deduced from the forms alone.

2.1.3.4. The Tone and Syllable fields There is a “tone” field, using five symbols : B (Bas): low; H (Haut):high; M (Moyen): middle, d (descendant): falling; m (montant): rising. This set of symbol is obviously not enough for all African languages, some of which can display up to five distinctive tone levels, not to mention contour tones. But this lack of accuracy is not really a problem, as will be seen below. This field is only displayed on the “inventory” page. It can be used for sorting, as can any other column in this display. It is nevertheless not considered essential to the database, for various reasons, including these two:

7

See “The search tools”, p. 380.

A database on personal pronouns in African languages 371

– Accuracy: In many sources, tone is either left unmarked or is marked only on some forms or is not consistent throughout the data. – Variability: In many languages, tone is used to mark important grammatical contrasts (e.g., aspect) but the relevant detailed information is not always provided. When it is, other problems can arise, such as a multiplicity of similar forms in a single cell of a table, which would greatly complicate the overall picture of the system. The attempt has been made, however, to give as much detail as possible, sometimes by use of the “comment” field. Here is a “simple” example of the potential complexity of tone data: Gwari, a Nupoid language of Nigeria, as described by Hyman and Magaji (1970), has three “basic” tones, namely low, middle, and high. In addition, there are three other tone units: lower-mid, rising, and falling. These extra units are not phonological; they originate from sequences of basic tones. Thus, lower-mid “is in most cases a mid tone that has been slightly lowered as the result of a preceding low tone” (Hyman and Magaji 1970: 16). The personal pronouns generally bear this lower-mid tone, as an effect of a historical low-tone initial vowel, still present on the disjunctive forms. So the simple forms of personal pronouns in Gwari are underlyingly mid but surface as lower-mid. At the segmental level, there are only two sets of personal pronouns in Gwari: the simple forms, which are used as subject, object, and possessive, and the extended forms (with the initial low-tone vowel), which fall into the T category (see 1.1.3.5 below). The simple forms can bear different tones, according to their function: “Subject pronouns bear high tone in the present perfect (…) and yesterday and beyond yesterday past tenses. Lower-mid subjects are found in the present and future tenses. (…) While possessive pronouns are always lowermid, direct object pronouns are found to be mid after high tone verbs (…) and high after lá and kú in the completed tenses (…). Indirect objects are also high tone after high tone verbs”.

Any form can thus be lower-mid, mid, or high, depending on its function. Giving an exact account of this variability in the database would mean having three different sets of forms, one for each tone, and providing each form with a comment explaining the conditions of its appearance. As to the “syllable” field, its content is calculated from the “simplified form” field. For the time being, it has been put to little use and is included “just in case”. Its accuracy will be limited, as information about the exact

372 Guillaume Segerer status of certain consonant sequences or long vowels is not always readily available.

2.1.3.5. The Function field A pronoun may be assigned different grammatical roles. In language descriptions, the same role may be given different names, often resulting in unnecessary confusion. To give just one example, “independent”, “free”, “tonic”, “strong”, and “emphatic” are names given to a set of personal pronouns that can be used in the same syntactic environments as nouns (thus fully meriting the label “pronoun”). In the database, the choice has been made to try to unify the grammatical roles, and my experience has shown that five functions are enough to account for the vast majority of the distinctions found in African languages. These functions are the following (database labels are in French; English meanings/alternate labels 8 are in italics): T: S: O: P: R:

Tonique: tonic, independent, emphatic, free, strong… Sujet: subject, verbal, verbal concord, subjective… Objet: object, direct object, indirect object, oblique, accusative, dative…9 Possessif: possessive, attributive, dependent, genitive… Refléchi: reflexive

The above associations may look a bit simplistic, but for the vast majority of the languages thus far examined, they cause no problem. There are of course some languages for which they may be unsatisfactory, namely those few for which a category of case has been described. Kambaata, for example (Y. Treis, pers. com.), has no less than eight cases, of which four do not readily fall into one of the above categories, viz., ablative, locative, vocative, and instrumental / comitative / perlative (the others being nominative, accusative, genitive, and dative). Furthermore, the cases which at first glance are likely candidates for one of the existing categories (e.g., accusative = Object or genitive = Possessive) are better labelled Tonic. Not only are they de8

9

I only list here some of the labels found in grammars written in English. Frenchspeaking scholars are no less creative in giving names to grammatical roles. Case languages are not common in Africa. The labels “accusative” and “dative” may be found in older works under the influence of traditional, european, latinbased grammar. Furthermore, in some of the (rare) case languages, as in Bedja, there is no contrast between accusative and dative (Vanhove pers. com.).

A database on personal pronouns in African languages 373

scribed as “independent” by the author, but there are two additional sets, one of object suffixes (on verbs) and another of possessive suffixes (on nouns). Fortunately, there is room for comments on every form, so that all the “independent” forms can be put in the T category, with the case information in the “comment” field. On the web pages, these comments appear as footnotes. In the end, 60 different Kambataa forms fall into the T category, while 12 go into the O and P categories. This may be oversimplification, but it is the price to pay for having a single kind of table for all languages. By the way, in the actual manipulation of the database, it turns out that the usefulness of this all-purpose table is limited. For my personal purposes at least, the “functions” page and the “inventory” page are much more helpful.

2.1.4. Problems in assigning properties to forms: an example When a single form has several functions, the rule is to list these functions together in the function field. This case is very frequent; not every language distinguishes four different functions. But what to do when one form represents more than one person? The Tomo-Kan variety of Dogon has a set of reflexive pronouns and a set of logophorics (Léger 1971). But these two sets contrast only singular and plural: the form la is the logophoric subject pronoun for the three persons of the singular, while the form le is the logophoric subject pronoun for the three persons of the plural. Ha and he are the corresponding forms for the category of reflexive. In this case, the alternative is: – Either la (for example) is assigned the features [1s], [2s], [3s], and there are three records with the same form. The language then has three forms for the three singular persons, but these forms happen to be the same. – Or la is attributed only to one person, say [3s], which is by far the most common situation for logophoric pronouns, and a note in the comment field says that this form represents all three singular persons. In this case, there is only one record, and the language is considered to have one form, but this form is presented as a 3rd person sg., which is inaccurate. The former solution was adopted. The tables may be a bit denser, but the reader clearly sees that the true contrast for Tomo-Kan logophoric and reflexive pronouns is between singular and plural.

374 Guillaume Segerer 2.1.5. Technical data: PHP, MySql The data was collected and prepared using MS Excel® for typing and UltraEdit® as an intermediate text editor. Some operations were performed with CC10, a program that does consistent changes in a textfile. The use of these three different softwares allowed a lot of time-savings. The typing is easier in MS Excel® but CC is needed to calculate tone and syllable patterns, simplified forms and UTF-8 forms. Ultra-Edit® is used for its extended searchand-replace abilities but is not strictly necessary. The database itself was implemented using the PHP/ MySql combination. The only problems encountered were related to my poor initial knowledge of the language (I actually learned PHP while implementing the site) and are of minor interest here. An earlier version of the website made use of a special font for writing forms, but Unicode encoding is now preferred. Arial Unicode MS®, a nearly complete unicode font, was provided with the MS Windows® 2000 operating system. This font still gives best results. It is not provided with more recent versions of MS Windows®, but there is a unicode font developed by the SIL, freely available on their website. There is an appropriate link on the left menu of the database site. The choice of PHP and MySql was simple, as this combination presents all the features I judged necessary: easy to learn, designed for web publishing, powerful enough to process large amounts of data.

3.

The website

The database website is freely available to all, the only limitation being the display of the unicode font, which I have tested only on MS Windows® 2000 and XP. Two main operations can be performed: the display of a particular pronominal system and the search for particular forms and/or functions. A third feature, the comparison tool, is proposed as an experimental tool. The left panel of the welcome screen contains four main sections, each with three links, plus two bottom links leading to the references page and a unicode font download page, respectively. The title of each section is in itself a link leading to the same page as the first item of the corresponding menu. The first section, entitled accueil (“home”), is an exception. Its title link leads to the default page, but the three items of the menu provide dif10

Consistent Changes: www.sil.org/computing/catalog/show_software.asp?id=4.

A database on personal pronouns in African languages 375

ferent kinds of information: a very quick tutorial (mode d’emploi), a short piece of information about the content of the database (information), and a list of abbreviations.

3.1.1. The languages menu This menu (langues) offers two main ways of finding a language: either on a genetic basis, or merely alphabetically. The classification used is based on the one presented on the Ethnologue website and has been discussed in section 2.1.1 above. Clicking on the name of the language group in the genetic list will show an internal language list together with the references used. A click on a language name leads to a page with basic information on the language and three buttons giving access to the various displays of the pronominal system (cf. 3.2 below). 3.1.2. The search menu This menu (recherche) leads to three different tools for searching for either forms or languages with particular features. These tools are presented in section 3.3 below. 3.1.3. The comparison menu This menu (comparaison) offers three ways of making special comparisonoriented kinds of searches. It is presented in section 3.4 below.

3.1.4. The bottom links Finally, the bottom of the left menu consists of two links: the first one (références) leads to a list of the sources consulted for building the database (see sections 2.1.2 above and 3.6 below); the second one allows the user to download a unicode font from the SIL website.

376 Guillaume Segerer 3.2.

The display tools

Once a language is selected using the “languages” menu presented above, a short description is displayed, which includes : - the basic affiliation of the language (see section 2.1.1 above); - the complete reference of the source(s) from where the data was taken; - a link to the Ethnologue information on the language, via its 3-letter code. Below are three buttons, allowing to display the pronominal system of the language in three possible ways: 1. The whole system presented in a single table 2. “Function” tables showing the various paradigms (i.e. subsystems) 3. An inventory of the existing forms 3.2.1. The general table of the system All the forms of the system are shown in this table. An attempt has been made to allow direct comparison between any pair of systems. Consequently, the table is maximally complex: the rows list all possible combinations of person, number, and specification (see above, p. 368), 27 in all; the columns represent the five possible functions, namely Subject, Object, Possessive, Tonic, and Reflexive. An example of this is presented in Table 1 below, taken from my own data on Bijogo (Segerer 2002), a language of the Atlantic branch of the Niger-Congo phylum. As can be seen from this example, a form may be followed by a number referring to a footnote when more detailed information is needed.

3.2.2. The function tables The above system may be shown in a more compact display without all the irrelevant rows and columns. This feature is accessed by the fonction button, which brings up a series of very compact separate tables. Each table presents the paradigm, or sub-system, for one function. Table 2 below shows the same system as Table 1 above. As the Possessive and Reflexive columns are empty in Table 1, there are no separate tables for Possessives and Reflexives. This kind of display allows the user to have a very quick overview of the internal structure of a system. There are no comments or footnotes, and the “specifications” (as defined above) are mentioned in brackets.

A database on personal pronouns in African languages 377 Table 1. A complete pronominal system: Bijogo (Segerer 2002) Pers

Sujet

Objet

Possessif

Tonique

1s

ɲ-

na-

amɔ

ɛɲɔ

2s

m-

am-

Réfléchi

2s masculin 2s féminin 3s

1

mɔ- / ɔg

wa-

we / nɛ-

t-

antV-

n-

annV-

o-

6

5

2

3

ɔgan / ɔnna

nɛ

3s masculin 3s féminin 3s animés 3s inanimés 3s indéfini 3s logophorique

4

nɛ-

1duel 2duel 3duel 1p

atɛ

1p inclusif 1p exclusif 2p

anɛ

2p masculin 2p féminin 3p

ya-

7

10

ma- / yag

5

8

9

yagan /yaana

3p masculin 3p féminin 3p animés 3p inanimés 3p indéfini 3p logophorique

ba-

Commentaires : 1: classe 1 (humain singulier) 2: Démonstratif éloigné classe 1 3: Démonstratif proche classe 1 4: extérieur à la forme verbale 5: extérieur à la forme verbale (pronom de classe)

6: 7: 8: 9: 10:

seulement dans les formes relatives du verbe classe 2 (humains pluriel) Démonstratif éloigné classe 2 Démonstratif proche classe 2 seulement dans les formes relatives du verbe (île de Canhabaque)

378 Guillaume Segerer While the general table showed that, for example, in Bijogo the 1st person plural is characterised by -t-, the separate tables reveal that the singular Tonic forms for the 1st and 2nd persons all have a VCV pattern, while the singular and the plural are characterised by a final -ɔ and a final -ɛ respectively. This kind of display is very useful for picking up an analogical change or highlighting a syncretism. For the time being, this is the simplest way to compare the internal structure of different subsets of a whole system. Further developments of the website include a specific tool for comparing two systems, or two subsets from different languages. Table 2. The Function tables Sujet 1 2 3

singulier

pluriel

ɲmowa- (L)

tnyaba- (L) Objet

1 2 3

singulier

pluriel

naammɔwe (L) nɛ- (L) Tg

antVannVmayag Tonique

1 2 3

singulier

pluriel

ɛɲɔ amɔ ɔgan ɔnna

atɛ anɛ yagan yanna

3.2.3. The inventory table The last type of display, illustrated above in Table 3, shows the forms in a list, sorted by the “simplified form” (this concept has been discussed above, p. 369). A click on a column header modifies the sort order accordingly.

A database on personal pronouns in African languages 379

The possibility of changing the sort order provides different perspectives on the same inventory, which can lead to different perceptions of the way the system is internally structured. But there is another invaluable feature offered with this display: a click on a simplified form leads to the search page, which we will now examine in detail. Table 3. The Inventory table Forme simplifiée

Forme complète

tons

syll

pers

mod.

am amO anE annV antV atE ba EnyO m ma mO n na nE nE ny o Og Ogan Onna t wa we ya yag yagan yanna

amamɔ anɛ annVantVatɛ baɛɲɔ mmamɔnnanɛnɛɲoɔg ɔgan ɔnna twawe yayag yagan yanna

– – – – – – – – – – – – – – – – – – – – – – – – – – –

VCVCV VCV VCCVVCCVVCV CVVCV CCVCVCCVCVCVCVVC VCVC VCCV CCVCV CVCVC CVCVC CVCCV

2s 2s 2p 2p 1p 1p 3pL 1s 2s 3p 3s 2p 1s 3s 3sL 1s 3s 3s 3s 3s 1p 3sL 3sL 3p 3p 3p 3p

O T T O O T S T S O O S O R OR S S O T T S S O S O T T

380 Guillaume Segerer 3.3.

The search tools

There are three different tools for searching the database: 3.3.1. Searching by form The simplest search is for a single simplified form (cf. p. 369), regardless of its meaning and function. Recall that a simplified form is a version of the form using only latin script with no diacritics, special characters being rendered with capital letters.11 As pointed out by a reviewer, this can be viewed as a “low-tech implementation” of a kind of “fuzzy searching”, but when it was first conceived, this offered many advantages: encodingindependent, easy to implement, fast, easy to use (no complicated typing). But it is true that there are some negative counterparts: a search on “e” will retrieve the various kinds of “e” while a search on “E” will find all the look-alikes, but not the “e-s”, so that it is not possible to find forms containing either e or E. To address this, the default search is case insensitive. A checkbox has been added to make it case sensitive if need be. Still, it is not possible to search on ʊ alone for example, as U will also retrieve ɯ, ɷ and U. As a result, this search provides a table which looks very much like the inventory table in the previous paragraph. The differences are the following: – the “simplified form” column is now absent, since the simplified form is the same for all the forms returned by the search. – there are two additional columns, namely Language and Group. The names in the language column may be clicked on to return to the corresponding language page. This table throws up all the records in the database which share a given simplified form, whether this form was clicked on from the inventory table or typed in from the search page. A typed form may contain an asterisk (‘*’) as a wildcard standing for any character or sequence of characters. The result page gives the total number of records retrieved, as well as the total number of languages in which the form is to be found. For example, a search for “b*” retrieves a total of 543 forms in 150 different languages. Three additional criteria may be used to refine the search:

11

As an example, U stands for the following symbols: ɯ, ɷ, ʊ and U with or without diacritics.

A database on personal pronouns in African languages 381

– A “person” can be chosen from “first”, “second”, or “third”. For example, a search for “b*” limited to the 2nd person now retrieves 125 forms in 72 different languages. – “Number” can be selected, the choice being between “singular”, “dual”, and “plural”. For example, if the previous search is further refined selecting “plural”, there are now 76 forms left in 47 languages. – Finally, a major phylum can be selected. Here the choice is limited to “Nilo-Saharan”, “Afro-Asiatic”, and “Niger-Congo”, as no data from any Khoi-San language has yet been entered into the database. Adding the “Nilo-Saharan” choice to our previous search criteria retrieves only 4 forms in 2 languages. After the list of forms, on the same page, is a map showing the distribution of the forms. As stated above, languages are represented by dots as the best compromise between easy visibility and geographical accuracy. The dots are colored according to the phylum to which the language belongs. More details on maps are given below (§ 3.5, p. 386). 3.3.2. Searching by function This link leads to a page with a list of all the different functions (see above, 2.1.3.5) used in the “function” field of the table of forms in the database. For each function, the total number of corresponding forms is given. A click on the function name gives the complete list of the forms (warning: for the “subject” function, there are no less than 4730 forms) in a table similar to the one obtained with the “search by form” tool, but this time with a column for the simplified forms, these being in turn clickable to perform a new “search by form”. Below the forms, a map shows the distribution of the relevant languages. Here, for example, is the map showing the distribution of logophoricity in the languages of the sample, where one can see that this phenomenon appears to be restricted to what is traditionally called the “fragmentation belt”, that is, the area of maximal linguistic differentiation. 3.3.3. Typological search From this page it is possible to perform a search based on typological criteria, namely the presence of certain types of binary contrasts, or the presence of a certain particular value. The types of binary contrast that can be searched for are the following:

382 Guillaume Segerer – inclusiveness (inclusive vs. exclusive 1st person plural) – animacy (animate vs. inanimate) – gender (masculine vs. feminine) In addition, a search may be performed on three unitary values: – indefinite – logophoric – dual This kind of query returns either a list of languages or a list of groups of languages where the desired attributes are to be found. Two different “depths” of genetic groupings may be selected, namely “group” and “subgroup” (see section 2.1.1 above). The selected linguistic entity (language, group, or subgroup) may then be clicked on to display either the corresponding language page or a list of the languages of the corresponding group.

Logophorique

Figure 2. Distribution of logophorics

A database on personal pronouns in African languages 383

All the possible combinations of features may of course be tried. For instance, a search on the attribute logophoric alone returns 45 languages, and a search on inclusiveness returns 131 languages 12. But a mixed search on both logophoric and inclusiveness returns only 20 languages. Hence, the proportion of logophoric languages which also have the inc / exc feature (20/45 = 44%) is greater than the proportion of inc / exc languages which also have the logophoric feature (20 /131 = 15%). As the 45 “logophoric” languages represent less than 9% of the sample, these figures can be regarded as significant. Given that both features have to do with the marking of an aspect of the actual speech situation, it might be worth noting that a language is more likely to have the inc/exc contrast if it is logophoric than the contrary. The six searchable attributes may seem too few, but it turns out that no language (in the database sample) is returned when all six are selected, nor is any language returned when any five of them are selected. Furthermore, every combination of two criteria gives some results, which means that taken two by two, these criteria are never incompatible with each other. Another kind of typological information may be found with the second search option on this page. Here one can select a language cluster from a pop-up menu where all the phyla and groups are listed. The result is a table showing a “yes/no” value for each of the six attributes listed above for every language of the selected cluster. In addition, the total number of “persons” is given for every language. For example, a language with no logophoric, no inclusive/exclusive contrast, no gender contrast, no animacy contrast, and no indefinite pronoun has 6 persons: 1st, 2nd and 3rd for singular and plural. At the end of the table, the total figures for each attribute are given, allowing a very broad evaluation of the typological features of the language cluster. Let us illustrate this with the Chadic family of the Afroasiatic phylum: in the database, 103 languages of this family are documented (as of spring 2007). The number of distinctive “persons” in these languages vary from 6 (Zaar languages, Boto, Guruntum…) to 12 (Ngizim). The frequencies of the six typological features are as follows: – inclusive/exclusive contrast: 46 – gender contrast: 58 – animacy contrast: 0 12

– logophoricity 1 – indefinite 3 – dual 13 languages

The number of languages returned is actually the number of systems found in the database. For some languages, several sources have been used which may show some discrepancies. These languages may thus appear more than once in the database. The case is not uncommon.

384 Guillaume Segerer While the massive presence of gender is not surprising, due to the Afroasiatic origin of the Chadic family, the inclusive/exclusive contrast, which concerns nearly half the languages, is more unexpected here, and might be regarded as an influence from the neighbouring Niger-Congo and NiloSaharan languages where this feature is much more widespread. A quick search with the “search by function” tool shows that the only other Afroasiatic languages with the inc/exc contrast are to be found in the Omotic branch among the southernmost languages of the phylum, that is, very near to Niger-Congo and Nilo-Saharan languages.

3.4. The comparison tools These tools let the user perform multi-criteria comparative searches and display the results on a map, where the languages containing the two types of elements searched for appear as orange and blue dots respectively. Three kinds of searches are available: i) A search for two different forms sharing the same “meaning” (here “meaning” must be understood as a combination of some or all of the features “person”, “number”, “specification”, and “function” as defined above). The “meaning” may be left blank (by selecting tous – ‘all’ – in the respective columns), so that only the distribution of the forms is shown on the map. Thus, one may look at the respective distribution of, say, mi and ni as elements of pronominal systems. The result is shown in Figure 3. ii) A search for two different meanings rendered by the same form. Figure 4 below illustrates this: while 30% of the languages in the database show at least one form beginning with t, this map shows that the distribution of these forms according to their meanings is not fortuitous. A ‘feminine’ meaning appears consistently in languages from the Afroasiatic phylum while ‘1st person plural’ is almost restricted to NigerCongo languages. Of course, a language may have both features, i.e. a feminine form and a 1st plural beginning with t. This can be checked with the “simple search” tool, and it turns out that the Chadic language Dangaleat, for instance, has the forms te and tè for 1st person plural inclusive (as indirect object) as well as a form tí as a 3rd person singular feminine (in the subject and object functions).

A database on personal pronouns in African languages 385

iii) Finally, a comparative search may be performed on two different forms with two different meanings. As has been pointed out, it is always possible to leave the “meaning” fields blank. In this case, all the elements sharing a particular form will be looked for. The interest of such a search is that the overall distribution of a given form may be compared with the distribution of the same form with a particular meaning. The mention of the number of relevant lagnuages on the map is also useful. Thus, a comparative search for forms with the shape “m*” (i.e. beginning with m) on the one hand, and for the same forms with the ‘1st person singular’ meaning on the other hand reveals that out of 411 languages showing at least one form in m-, 290 (i.e. 70%) have a m- form in the 1st person singular.

Figure 3. Distribution of mi and ni as elements of pronominal systems

386 Guillaume Segerer

Figure 4. distribution of the meanings “1pl” and “feminine” for t*

3.5. The design of maps The “search by form” and “comparison” tools are associated with a map to show the location of the languages where the forms are found. This map is dynamically generated using the SVG (Scalable Vector Graphics) language, which provides the easiest way to connect a drawing tool and a database. Unfortunately, this tool, initially created by Adobe®, will no longer be maintained after 2008. Furthermore, the new versions of the two most popular browsers, Microsoft Internet Explorer® 7.0 and Mozilla Firefox® 3.0 process SVG files differently: while the former needs a plug-in to display SVG files, the latter comes with a native renderer. It is therefore difficult, if not impossible, to generate SVG files to be viewed on both browsers. When the “African pronouns” website was first developed, all the maps were con-

A database on personal pronouns in African languages 387

ceived using Microsoft IE 6, which worked perfectly. Today, though a few modifications have been made, there may still be some difficulties with some versions of IE 7. As has already been pointed out, languages are represented by dots whose color either stands for the phylum they belong to or distinguishes two different forms in “comparative searches”. In the latter case, the dots are of two different shapes, i.e. circles and squares. Moreover, they are partly transparent, so that whenever two dots are very close to each other it is still possible to see that there are two of them. In addition, with the “comparative search” tools, single languages often display both of the forms searched for, in which case the dot will take a color intermediate between orange and blue, that is, a shade of brown which is sufficiently distinct from both orange and blue. In addition, placing the pointer over any dot causes the language name to be displayed in the upper right part of the map so that, when clicked on, it leads to the corresponding language page. Those languages that are documented in the database but are not relevant for the search illustrated by the map are represented with small transparent dots. 3.6. Sources In building up this database, several hundred sources have been consulted. These are listed on the “references” page. It must be emphasized again that the way the systems are presented on the database website may be quite different from the way they are presented in the corresponding source. To give only one example, the pronominal system of Ekpeye, an Igboid language of the Benue-Congo branch of Niger-Congo, is presented by David J. Clark (1972) as a four-term system comprised of speaker, speaker’s group, hearer, and referent with nothing like a number category. In addition, the terms -nI and -BE may be used to encode what is usually rendered as we (inclusive), you (pl), and they. The author is nevertheless explicit: The category of number is relevant nowhere else in Ekpeye grammar, and its introduction here would complicate the description to no useful purpose (Clark 1972: 98). It is not in my intention to pass judgment on every source I have consulted, but for the purpose of inserting the Ekpeye system into the database, I had to consider the forms with -nI and -BE as true plurals; however, a brief comment (including the above quotation) is included on the Ekpeye language page. Usually, the transposition of a pronominal system into a database-compliant shape was quite simple. In doing this for hundreds of languages, how-

388 Guillaume Segerer ever, I noticed that only a very small proportion of linguistic descriptions provide a clear and systematic presentation of the pronominal system. In general, the various subsystems are presented in whichever of the major parts of the grammar they fit best. Thus, subject pronouns are often given in the chapters devoted to verbs, and possessive pronouns are discussed in the “noun phrase” sections. My own description of Bijogo (Segerer 2002), completed roughly at the time the database was at its beginning, shows no inclusive table of person marking in the language, although such a table is given for noun class markers, for example. This is to say that even if the strong internal structure of pronominal systems has long been recognized, few scholars feel the need to present and discuss this structure in and for itself.

4. Conclusion This database on person marking in the languages of Africa is an ambitious undertaking which is still far from completion. In fact, the number of languages represented being slightly over 500, there must be at least around 1700 left out! Moreover, the database suffers from inherent imperfections: pronominal systems are far more complicated than anything even the tidiest of tables can ever show. Various kinds of display have been designed to capture diverse aspects of the systems, but many other kinds of information cannot be taken into account, among which are the whole domains of syntax and pragmatics. Some examples of this simplification have been given above, cf. the tonal complexity in Gwari. It would be easy for any author of one of the descriptions used here to deride the inaccuracy of the data as they appear on the web pages. Consequently, I would like to stress again that the database was created as a means of facilitating comparative and typological studies. As far as I know, only one article (Pozdniakov and Segerer 2004) has thus far exploited this tool, but it is my hope that others will follow, and this is one of the reasons for making it freely available on the internet.

References Arthur, Isa 1975 The Bijago language (Orango dialect). Unpublished Ms., Paris Benveniste, Emile 1966 Problèmes de linguistique générale, Vol. 1. Paris: Gallimard.

A database on personal pronouns in African languages 389 Clark, David J. 1972 A four-term person system and its ramifications. Studies in African Linguistics 3 (1): 97–106. Hyman, Larry M. and Daniel J. Magaji 1970 Essentials of Gwari Grammar. Ibadan: University of Ibadan. Ibriszimow, Dymitr and Guillaume Segerer (eds.) 2004 Systèmes de marques personnelles en Afrique. (Afrique et Langage, 8.) Paris /Louvain: Peeters. Kari, Ethelbert E. 2002a Distinguishing between clitics and affixes in Degema, Nigeria. African Study Monographs 23 (3): 91–115. 2002b Distinguishing between clitics and words in Degema, Nigeria. African Study Monographs 23 (4): 177–192. 2003 Serial verb constructions in Degema, Nigeria. African Study Monographs 24 (4): 271–289. 2004 Degema Reference Grammar. Cologne: Rüdiger Köppe. Léger, Jean 1971 Grammaire dogon: Tomo-kan. Miehe, Gudrun 2004 Les pronoms personnels dans les langues Gur. In Systèmes de marques personnelles en Afrique, Dymitr Ibriszimow and Guillaume Segerer (eds.), 97–128. (Afrique et Langage 8.) Paris / Louvain: Peeters. Pozdniakov, Konstantin and Guillaume Segerer 2004 Reconstruction des pronoms atlantiques et typologie des systèmes pronominaux. In Systèmes de marques personnelles en Afrique, Dymitr Ibriszimow and Guillaume Segerer (eds.), 151–162. (Afrique et Langage 8.) Paris / Louvain: Peeters. Sapir, J. David 1971 West Atlantic: an inventory of the languages, their noun class systems and consonant alternation. In Current Trends in Linguistics 7: Linguistics in sub-Saharan Africa, Thomas A. Sebeok (ed.), 45–112. The Hague /Paris: Mouton. Segerer, Guillaume 2002–07 Les marques personnelles dans les langues africaines. Access at: http://sumale.vjf.cnrs.fr/pronoms/ [online database]. 2002 La langue bijogo de Bubaque (Guinée Bissau). (Afrique et Langage 3.) Paris /Louvain: Peeters. Wilson, William A.A. 2000–01 Vowel harmony in Bijagó. Journal of West African languages 28 (1): 19 –32.

Contributors

Tamás Bíró Faculty of Arts – Humanities Computing University of Groningen Oude Kijk in ’t Jatstraat 26 9712 EK Groningen The Netherlands

Greville Corbett Surrey Morphology Group Faculty of Arts and Human Sciences [J1] University of Surrey Guildford, Surrey, GU2 7XH United Kingdom

[email protected]

[email protected]

Heather Bliss Department of Linguistics

Alexis Dimitriadis Utrecht Institute of Lingustics OTS

University of British Columbia Totem Field Studios 2613 West Mall Vancouver, BC Canada V6T1Z4

Utrecht University Janskerkhof 13 3512 BL Utrecht The Netherlands [email protected]

[email protected]

Dunstan Brown Surrey Morphology Group Faculty of Arts and Human Sciences [J1] University of Surrey Guildford, Surrey, GU2 7XH United Kingdom [email protected]

Marina Chumakina Surrey Morphology Group Faculty of Arts and Human Sciences [J1]

Viktor Elšík Institute of Linguistics and Finno-Ugric Studies Charles University Prague Nám. J. Palacha 2 110 00 Praha 1 Czech Republic viktor_ [email protected]

Martin Everaert Utrecht Institute of Lingustics OTS

University of Surrey Guildford, Surrey, GU2 7XH United Kingdom

Utrecht University Janskerkhof 13 3512 BL Utrecht The Netherlands

[email protected]

[email protected]

392 Contributors Volker Gast English Department Freie Universität Berlin Habelschwerdter Allee 45 14195 Berlin Germany

Alexander Krasovitsky Surrey Morphology Group Faculty of Arts and Human Sciences [J1] University of Surrey Guildford, Surrey, GU2 7XH United Kingdom

[email protected]

[email protected]

Rob Goedemans LUCL – Phonetics Lab

Yaron Matras School of Languages, Linguistics and Cultures

Universiteit Leiden Cleveringaplaats 1 PO box 9515 2300 RA Leiden The Netherlands [email protected]

The University of Manchester Samuel-Alexander-Building NG-11 Oxford Road Manchester M13 9PL United Kingdom [email protected]

Martin Haspelmath Max-Planck-Institut für evolutionäre Anthropologie Deutscher Platz 6 04103 Leipzig Germany [email protected]

Veronika Mattes Institute of Linguistics University of Graz Merangasse 70 8010 Graz Austria [email protected]

Harry van der Hulst Department of Linguistics University of Connecticut 337 Mansfield Road Storrs, CT 06269-1145 USA

Simon Musgrave Linguistics Program Monash University Victoria 3800 Australia

[email protected]

[email protected]

Bernhard Hurch Institute of Linguistics

Elizabeth Ritter Department of Linguistics

University of Graz Merangasse 70 8010 Graz Austria

University of Calgary 2500 University Drive NW Calgary, Alberta Canada T2N 1N4

[email protected]

[email protected]

Contributors Adam Saulwick Human Interaction Capabilities, Command, Control, Communications and Intelligence Division Defence Science and Technology Organisation 205 Labs PO Box 1500, Edinburgh SA, 5111 Australia

393

Christopher White eLearning Applications Team Application Support and Development The University of Manchester B38 Sackville Building Sackville Street Manchester M60 1QD United Kingdom [email protected]

[email protected] Guillaume Segerer Laboratoire Langage, Langues et Cultures d’Afrique Noire CNRS 7 rue Guy Môquet 94800 Villejuif France [email protected] Carole Tiberius Computational Linguistics Institute for Dutch Lexicology (INL) Matthias de Vrieshof 3 2311 BZ Leiden The Netherlands [email protected]

Menzo Windhouwer Theoretische Taalwetenschap (UvA) Spuistraat 210 (Bungehuis), Room 306 1012 VT Amsterdam The Netherlands [email protected]

Index of subjects

ablative (see case, ablative) academic recognition. 6–7 accent (word accent), 40, 72, 235, 237– 246, 250–251, 254–269 pitch ~, 238 Access (Microsoft application), 19–20, 22–24, 28–29, 36, 42, 49, 75, 134, 137, 160, 186, 249, 252, 258 adjective, 10, 124, 129, 133, 144, 146, 147, 219, 339 affiliation genetic ~, 81, 87, 92, 122, 167, 189, 227, 256, 313, 365 affix, 144, 147–148, 184, 295, 301, 315, 340, 370 Agent, 162–163 agreement, 8, 118, 134–149, 162–165, 177, 182, 184, 296, 336 ~ categories, 135, 138 canonical ~, 135, 138, 140, 148–149 controller, 135–144, 148, 163, 182 subject–verb ~, 159, 162–163 anaphora, 165 zero ~, 141–142 animacy, 4, 97, 106, 112, 134, 143, 220– 221, 368–369, 382–383 ~ hierarchy, 134, 220 architecture, 9, 157, 160, 164, 167–171, 185–186, 235, 307, 314, 352 Model–View–Controller, 352 system ~, 167 archive, 6, 156, 158, 253, 330, 338, 347– 348 areal, 92, 94, 123, 210, 285–286, 292 argument, 7, 74, 139, 162–163, 177, 179, 193, 238, 303 attribute, 2, 20, 28–30, 37, 47, 49, 51–73, 134, 167, 170–171, 176, 179, 195, 224– 225, 332, 350 –351, 382–383

attribute-value pair, 2, 61, 68, 72, 171, 224 basic word order (see word order) binary pair, 125–126, 128–129 Boolean, 35, 72, 162, 191, 202, 228, 258 Canvas, 103 case, 4–5, 82–83, 85, 88, 94, 97–98, 103, 106, 113, 119–121, 125–127, 131, 134, 140, 143, 147, 162, 166, 177, 181–182, 295, 316, 322, 331, 339–340, 349–350, 355–356, 372–373 ~ hierarchy, 98 ablative ~, 97, 349, 350, 356, 372 dative ~, 119, 124–125, 143–144, 146, 162, 181, 372 distributional ~, 120 formal ~, 120 citation, 6, 28, 60–61, 165, 199–200 clause, 3–34, 63, 97, 139–141, 164, 166, 185, 192, 335–336, 339–340 clitic, 144, 148, 276, 289, 293, 295, 370 comma-separated values [CSV], 158, 186, 190 conditional, 32–33, 134, 213, 321 conjugation, 145–146, 331, 339, 342 controller, 135–144, 148, 163, 182 corpus, 6, 43, 133, 136, 139–140, 329, 338 creole language, 109–110, 366–367 CSV (see comma-separated values) data ~ modeling, 171 ~ structure, 171, 186, 224, 352 ~ type, 34 –37, 65–66, 173 ~ warehousing, 174–175 Data Transformation Language, 168–175, 180 –181, 185–186, 191–194, 204

Index of subjects 395 database desktop ~, 13, 19, 22–27, 39, 63, 71, 86, 93 ~ design (see design, database) ~ documentation, 13, 65, 73, 157–158, 161, 163, 166, 188, 191, 193, 200, 202–203, 329, 332, 346 ~ field, 39, 106, 158, 163, 167, 172, 174–176, 179, 185, 188–193, 203, 204, 250, 338, 340, 346 ~ paradox, 254 relational ~, 20–21, 28–29, 32, 34, 44, 46, 51, 54, 56, 65, 68, 71, 74, 80 – 81, 130, 133, 135, 170, 175–176, 189, 210, 224, 348, 349, 351, 352 ~ schema, 71, 171, 174 ~ scope, 174, 175 cross-linguistic ~, 9, 39, 54, 61, 65, 73 domain–specific ~, 215–216 web ~, 22–27, 44, 51, 71 Database Management System [DBMS], 16 –20, 25, 29 –37, 42, 44, 47, 49, 62, 65–66, 74, 168, 171, 202 decoding (grammaticography), 211–212 definiteness, 4, 121, 127, 135, 144, 368 derivation, 166, 193–194, 303, 316, 339 design ~ principle, 69, 178 conceptual ~, 49, 51–52, 56 database ~, 3, 15, 28, 36–37, 45–56, 65 – 67, 74, 78 –79, 102, 105, 107, 130 –131, 172, 199, 202, 249, 287, 347, 355 logical ~, 49, 53, 56–58 diachronic change, 8, 118, 133–134, 149, 302, 317 dialectal variation, 331, 346, 365 dialectology, 331, 338, 359 directionality, 126 documentation language ~, 346 domain expert, 155, 161 dynamic query builder, 310, 318

edge, 177, 263–245, 256–257, 261–264, 267, 370 encoding (grammaticography), 4–5, 36, 38–41, 65, 132, 211–212, 214, 231, 254, 258, 270, 374 enrichment, 168–169 enumerated value, 63–66, 74 error, 2, 16, 27, 37, 62, 66, 80, 102, 172, 196–198, 248, 296, 340 input ~, 37, 66, 198 Ethnologue, 15, 18, 30, 33–34, 57–58, 61, 81, 108, 167, 193, 200, 201, 204, 224– 225, 248–249, 256, 290, 313, 365–367, 375–376 Ethnologue code (see ISO language code) Eurotyp, 247–248, 286, 335 experiencer, 162 Extensible Markup Language [XML], 21, 158, 186–187, 190 extrametricality, 165, 182, 241–245, 256– 257 factorial analysis, 134 feature, 5, 8, 20, 24, 32, 44, 49, 65, 67, 72, 77–107, 118–128, 131, 135, 138–148, 166, 170, 177–179, 182, 188, 191, 195, 211, 224–225, 229, 255, 266, 273, 283–297, 302, 315–316, 325, 332, 339, 356, 358, 365–369, 373–376, 379, 383–384 ~ geometry, 106 ~ space, 122 feet, 46, 52, 165, 182–183, 193–194, 237, 239–242, 246, 257–258, 262–263, 266, 268, 278, 315 iambic ~, 46, 193–194, 257–258, 261– 265 trochaic ~, 46, 192–194, 257–258, 261– 265, 268 fieldwork, 293, 330, 338–339, 346–347, 359

396 Index of subjects FileMaker Pro, 19–20, 22, 24, 31, 68, 77, 80, 86, 89–91, 102–106, 291, 333–340, 344, 347–351, 354, 357–358 font, 26, 38–41, 103, 160, 171, 195, 197, 339, 374–375 form-based (grammaticography), 102, 211 form-function relationship, 184 gemination, 305 gender, 4–5, 78, 82–83, 85, 88, 94, 97– 98, 103, 106, 112, 121, 126–127, 136, 143, 146, 316, 322, 368–369, 382–384 genetic, 81, 87, 91–92, 94, 122–123, 167, 189, 227, 256, 313, 319, 331–333, 359, 363, 365, 375, 382 grammar fragment, 166, 225–231 grammaticography (analytic), 210–212, 231, 334 harmonize, 161, 168, 192–195 head, 145, 147, 183, 221, 226, 236 –241, 244, 246, 258, 261, 264, 268, 278, 370 head-marking, 237, 293 hierarchy, 20, 21, 74, 126–127, 134, 170– 177, 183–184, 188–193, 204, 220, 238, 304 animacy ~, 134, 220 case ~, 98 notion ~, 173, 175, 186 historical change (see diachronic change) homophony, 123 HTML, 6, 25–27, 39, 104, 318, 352 iconicity, 304 idiosyncrasy, 131 inclusive/exclusive distinction, 4–5, 95– 96, 98, 100, 118, 135, 178, 180 –181, 218–219, 225, 238, 296, 368, 369, 382– 384, 387–388 inflection, 140, 145–146, 182, 285, 316, 334, 336, 339–340, 342, 349–350, 355– 356

integration, 13, 155 –170, 176 –181, 184, 186, 191, 195, 199, 201, 203, 249, 317, 329, 335, 346 semantic ~, 203, 335 intensifier, 9, 10, 166, 209–210, 216–231, 285 interface user ~, 2, 8, 17–20, 23–24, 26, 29, 36– 37, 43–45, 49, 56, 62– 63, 65–67, 69– 73, 77, 80 – 81, 86, 93, 102, 104 –105, 155–158, 160, 163, 165, 170–173, 178, 184, 186–188, 193, 210, 216, 224–225, 249, 252, 255, 303, 351–352, 356 web ~, 10, 24, 63, 71, 168–169, 185 – 186, 258, 330, 346 International Phonetic Alphabet [IPA], 38–43, 84, 103, 171, 190, 195–197, 249 IPA keyboard, 42, 190, 196 intransitive, 162–163, 166, 179 inventory phonological ~, 13–14, 36, 77, 107, 119, 166, 194 –197, 212, 254, 285, 291, 333, 337, 346, 360, 363, 370, 373, 376, 378–380 IPA (see International Phonetic Alphabet) irregularity, 130, 145–146, 192 ISO language code, 167, 193, 200, 204, 224–225, 249, 270, 313 Javascript, 26, 352 key, 15, 29–30, 42, 49, 56–62, 65, 69, 83, 117, 168, 170, 175, 190, 193, 196, 200– 201, 265, 319–325, 352, 359, 370 foreign ~, 29–30, 32, 37, 44, 48, 56, 57– 62, 69, 176 primary ~, 29, 30, 48, 57– 62, 65 – 66, 126, 128–129, 193, 201 knowledge engineer, 186, 191, 193 language ~ contact, 332 ~ documentation, 346

Index of subjects 397 ~ family, 67, 108–110, 129, 138, 189, 201, 253, 302, 305, 308 ~ policy, 359 ~ universals, 1, 39, 78, 85, 94, 96, 167, 173, 189, 195–196, 204, 236, 238, 254, 258, 291, 302, 363 layout, 41, 42, 77, 80, 86–93, 103–105, 190, 318, 334, 345–346, 352–356, 363 lexeme, 132, 316 linguistic ~ object, 181–182, 184 ~ property, 72, 177, 183, 192 ~ relation, 182 ~ terms, 160, 176, 182, 204 ~ typology (see typology) logical design (see design, logical) logophor, 79, 368–369, 373, 382–383 masculine, 85, 97, 124, 143, 146, 382 meronymy, 183 metadata, 158, 160–161, 168–172, 184– 186, 188, 191, 193, 199, 203–285, 287, 329, 344, 347, 351, 355–356, 359 mixed language, 367 Model-View-Controller architecture, 352 mood, 121, 127, 214, 229, 316, 322 morphology, 8–10, 55, 69–70, 77, 79, 100, 107, 117–122, 125, 127, 130, 132–133, 138, 144–145, 149, 164, 166, 177, 225– 226, 236–238, 250–251, 285, 301–303, 306–307, 313, 315–319, 322–325, 347, 370 morphosyntax, 10, 71, 77–79, 83–84, 92, 94, 98, 101–106, 118 –134, 147, 149, 162, 177, 185 MySQL, 17, 19 –20, 25, 29, 36, 66, 75, 160, 186, 216, 318, 357–358 negation, 121, 127, 133, 173, 213 neuter, 97, 124, 143, 146 non-canonical, 142, 145, 164, 210 normalization, 40, 46, 198 Notion, 52, 159, 162, 165, 170–176, 180, 184–188, 191–194, 204

Root ~, 174–176, 189, 204 Top ~, 173–175, 204 obviation, 101 one-to-many relationship (see relationship, one-to-many) onomasiology, 210–213 ontology, 5, 155–156, 158, 168–185, 188, 193, 203–204, 296 orthography, 84, 197, 227, 317 overt, 140–142, 144, 173, 192, 246 OWL (see Web Ontology Language) paradigm, 13, 77–84, 87–89, 92, 94–100, 103–104, 106, 121, 124, 130, 132, 146– 147, 179, 181, 224, 227, 342, 347, 355, 376 pattern-matching, 158, 358 person, 4–6, 23, 24, 45, 54, 56, 78–83, 85, 88–89, 94–101, 103, 106–107, 112– 113, 121–122, 127, 135, 143, 145–146, 162, 165, 179, 182, 316, 320, 322, 331, 333, 336, 367–369, 373, 376, 378, 381– 385, 388 phoneme, 14, 50, 165–168, 173, 176, 189, 193–198, 204, 339, 369 ~ inventory, 14, 165, 166, 189, 193– 196 phonology, 9, 84, 89, 102, 106 –107, 119, 120, 130, 146–147, 155, 166, 175, 177, 181, 185, 190, 194, 204, 237–239, 241, 243, 252, 254, 256, 258, 264, 275, 286, 301–303, 305, 307, 315, 317, 323, 339, 342, 369, 371 PHP, 19, 25–27, 75, 318, 357, 374 phylum, 365, 367, 376, 381, 383–384, 387 pitch accent, 238 platform, 21, 40, 80, 155, 186, 352, 357 plug-in, 44, 77, 89–91, 104–105, 160, 168, 185–186, 191, 386 plural, 4, 5, 85, 88, 96, 100, 104, 107, 112, 124, 126, 133, 138, 143, 145–147, 160, 289, 293, 295, 301, 309–313, 315,

398 Index of subjects 349–350, 356, 368, 373, 378, 381– 384, 387 predicate, 133–134, 141, 144, 162, 164, 169, 181, 183, 212, 265, 335–336 locational ~, 173–174 predication, 166, 177 predicative noun, 133 pre-query (see query, pre- ~) primary key (see key, primary) pronoun, 4, 10, 77–107, 112, 140 –142, 144, 165, 179, 212–213, 220–224, 226, 296, 339, 344, 363, 366 –368, 371– 373, 383, 386, 388 prosodic unit, 301 prosody, 248, 254, 273, 303 prototype, 179, 180, 249 proximity, 79, 82, 101, 106, 212 quantification, 98, 133, 139–140, 142, 149, 258, 349–350 quantity-insensitive stress, 273 quantity-sensitive stress, 260 query (see also search), 2, 15–17, 31–37, 47, 50, 56, 66–69, 71–74, 91–93, 119, 126–129, 133, 139, 156–159, 162–163, 169–171, 181, 184, 187–197, 201, 203, 214, 216, 224–225, 228–231, 235, 249, 251–252, 255, 261, 263–265, 270, 275, 284–286, 291, 296, 305–312, 316, 318, 321, 329–330, 334–337, 340–348, 351– 352, 356–359, 364, 370, 374–375, 387 ~ basket, 169, 188 –190 pre- ~, 156, 163, 169, 188, 190 reciprocal, 64, 209, 211 recoding, 156, 295 redundancy, 105, 138–140, 200, 215, 267, 269, 337 reduplication, 9 –10, 166, 289, 295, 301– 325 reflexive, 4, 9–10, 45, 54–56, 63–64, 67, 71, 79, 165 –166, 179, 209 –210, 216, 219–220, 222–228, 231, 285, 339, 372–373, 376

region, 50, 256, 302, 331, 365 relation, 20, 28, 30, 73, 97, 99, 118, 126, 176–177, 179, 182–184, 192, 211, 216, 245–246, 301, 304, 306 –307, 316, 318–319, 332, 339 –340, 347–348, 351, 359 relational structure, 44 – 45, 53, 56, 78, 80 – 81, 347 relationship, 18, 23, 30, 44 – 45, 49, 51– 61, 64 – 67, 69, 74, 79, 89, 100 –101, 106, 124–125, 128–132, 137–138, 141, 159, 161–162, 166, 169, 173–174, 176, 178, 181–185, 197, 199, 204, 209–211, 250, 313, 319, 324–325, 349–350, 363 one-to-many ~, 53, 56, 57, 80, 128– 132, 138, 176, 211 resource discovery, 157–158, 180 resource utilization, 157 rhythm, 194, 236 – 246, 250, 255, 257 – 258, 261–269 role grammatical ~, 163, 372 sample, 2, 8 –10, 13, 50, 78, 81– 82, 87, 89, 94, 97–98, 101, 118, 122–124, 134, 149, 166, 191, 212, 231, 250, 259, 263, 265, 272–274, 283–284, 288–290, 304, 307, 329, 332, 338, 340 –351, 354 –359, 366, 381, 383 sampling language ~, 366 schema, 58, 71, 169 –174, 186, 191–192, 203 search (see query) searchability, 77, 92, 165, 194 –195, 383 segment, 166, 181, 190, 194–198, 204, 258, 306, 314 –315, 324 selectional restriction, 220 –221, 230 semantic ~ unification, 125, 168, 180–181, 184, 203, 236 ~ integrity, 170 ~ map, 212–213 ~ network, 169

Index of subjects 399 ~ relativism, 210 ~ web, 185 semantics, 119, 129, 132, 134, 139, 156, 160, 168, 172–173, 177, 186, 192–193, 202, 204, 213, 307–308, 316–317, 325 semasiology, 210–215 Shoebox, 20, 21 sign relation, 301 SIL, 21, 28, 30, 39, 41, 57, 58, 61, 103, 167, 193, 200, 204, 248, 256, 270, 313, 365, 374–375 singular, 4, 5, 85, 88, 96, 98, 100, 104, 107, 112, 119, 121, 124, 125, 133, 135, 138, 143, 145–147, 231, 349–350, 368, 369, 373, 378, 381–385 SQL (see Structured Query Language) statistics, 2, 16, 44, 73, 133, 134, 172, 190, 191, 231, 293, 294 stem, 131–133, 145, 222, 289, 295, 314, 315, 333 stress, 9, 46, 165, 175, 177, 182, 193–194, 223, 235 –238, 241, 247–270, 273– 278, 315, 324, 388 Strucured Query Language [SQL], 17, 25, 27, 31–32, 34, 66, 73, 160, 186, 194, 351–352, 356, 358 subject, 4, 13, 46, 134, 136, 141, 159, 161– 164, 177, 188, 192, 202, 216, 244, 316, 335–336, 338, 349–351, 355, 369–373, 376, 381, 384, 388 dative ~, 162 intransitive ~, 62, 108–110, 126, 155, 159, 162–163, 216, 256 –257, 266– 267, 269, 271, 273–274, 276, 283, 296, 370, 372, 379 subsumption, 179, 182–183 suffix, 222, 275, 289, 293, 295, 317, 333, 336, 373 suppletion, 8, 117, 130–133, 137, 145–146, 149 syllable, 73–74, 165, 177, 183, 194, 199, 237–245, 250, 254–261, 266–278, 292, 302, 315, 370–371, 374

syncretism, 8, 78, 117–133, 137, 148–149, 285, 378 synonymy, 182 syntax, 26, 31, 33, 74, 119, 129, 132–135, 143, 165, 177–178, 181–182, 185, 214, 218–219, 236–237, 264, 285, 305, 315, 319, 324, 347, 368–370, 372, 388 target, 135–140, 143–148, 182, 210–211, 305 taxonomy, 171, 184–185, 188–189 tense, 121, 127, 145, 166, 214, 229, 316, 322, 331, 334 –335, 337, 371 theoretical commitment, 4, 6–7, 159, 163 tone, 165, 238, 239, 276, 278, 285, 287, 289, 292, 295, 369–371, 374 transcription, 38, 43, 83–84, 125, 339 – 340, 344–346, 348, 354, 359, 369–370 transitive verb, 20, 133 –134, 162, 179 tree structure, 20, 307–308 Typological Database System [TDS], 10, 14, 21, 42, 44, 155–204, 249, 252, 255, 264 typology, 1, 2, 8–9, 46, 149, 155, 165, 175, 204, 209–210, 212, 216–217, 219, 238, 247, 285–286, 292, 301, 304, 345, 347 canonical ~, 118 uncertainty, 192, 202 unclassified languages, 367 underspecification, 123–125 Unicode, 39, 40 – 43, 65, 103, 190, 195 – 198, 249, 252, 339, 357–358, 374–375 Universal Phoneme Positioning Chart [UPPC], 173, 195–198, 204 user-defined, 291, 310 value attribute- ~ pair, 224 NULL ~, 202 variation dialectal ~, 331, 346, 365 verb, 133–134, 141, 144, 162, 164, 169, 181, 183, 212, 265, 335–336

400 Index of subjects voice, 121, 127, 166, 316, 322 vowel, 72, 147, 183, 190, 195, 237, 239, 254, 257–258, 285, 287, 315, 358, 369, 371–372 warehouse (see data, warehousing) Web Ontology Language [OWL], 178, 183 –185, 204 word class, 9, 20, 121, 129, 132, 147, 243, 315–316, 322, 340 word order, 46–48, 68, 72, 118, 162, 164– 165, 180, 190, 192–193, 264, 339

basic ~, 20, 37, 46–48, 68, 72–73, 162, 164–166, 180–182, 189, 192, 264 predicate–based ~, 180, 182 XML (see Extensible Markup Language) XSLT, 185–186 zero anaphora (see anaphora, zero)

Index of languages This index lists all languages and language groupings (families) mentioned in the text, without attempting to indicate any relationships between the groups named.

Abkhaz, 217 Acehnese (Aceh), 108 Acholi, 108, 113 Adamawa (family), 366 Afrikaans, 316, 317 Afro-Asiatic (family), 108–110, 290, 366, 381 Ainu, 108, 112 Albanian, 108, 217 Algic (family), 101, 110 Altaic (family), 95, 109–110, 122 Amharic, 221 Apoze, 271 Apurinã, 288 Arabic, 38, 40, 108, 228–229, 316 Egyptian ~, 366 Gulf ~, 108, 113 Aramaic, 270 Arapesh, 108, 284 Arawakan (family), 82, 97, 108 Armenian, 271, 331 Classical ~, 119–121 Atlantic (family), 365–366, 376 Aucan (Mapuche, Araucanian), 271 Australian (family), 100, 108, 110, 209, 248, 262 Austro-Asiatic (family), 98, 106, 109– 110, 290 Austronesian (family), 2, 97, 108–110, 122, 262, 273–274, 290–291, 368 Awtuw, 108 Balear, 108 Balochi (Baluchi, Baluci), 81–82, 108 Bandjalang, 108 Banggarla (Parnkalla), 271

Bardi (Baadi, Badimaya), 271 Basque, 95, 100–101, 108, 112, 134, 221, 270, 297, 301, 332 Bedja, 372 Bengali, 40, 218 Benue-Congo (family), 366, 387 Berber, 289, 366 Berik, 108, 113 Bijogo (Bidyogo, Bijago), 365, 376–378, 388 Bikol, 314–317 Boto, 383 Brahui, 108 Breton, 222 Caddoan (family), 110 Cahuilla, 108, 271 Cambodian, 267 Campa, 82, 97, 108, 112 Carib (family), 100, 109, 122 Catalan, 108 Cavineña, 271 Cayubaba (Cayuvava), 271 Chadic, 290, 366, 369, 383–384 Chamorro, 271 Chapacura-Wanham (family), 110 Chechen, 108, 113 Chicheŵa, 134, 136 Chinese, Mandarin, 40, 108 Chinook, 88, 108 Coahuiltecan (family), 110 Comanche, 108 Croatian, 332 Cubeo, 95, 108, 112 Cushitic (family), 366 Czech, 271

402 Index of languages Daga, 108, 113 Daic (family), 98, 106, 108, 110 Dakota (Sioux), 100, 108, 270–271 Dalabon (Ngalkbun, Boun), 271 Dâw, 290 Degema, 369 Dehu (Lifu), 271 Dieri (Diyari), 108, 113, 271 Djingili (Jingulu, Tjingili), 100, 108, 271 Dogon (family), 366, 373 Dong, 108 Dravidian (family), 95, 100, 108–110, 122 Dutch, 108, 164 Ekpeye, 387 Emae (Mae), 271 English, 28–30, 33–34, 38, 47–48, 58, 61– 62, 85, 108, 130, 138, 211, 214, 217– 219, 222–224, 236 –237, 301, 320, 340, 372 English-based Creole, 110 Eskimo-Aleut (family), 95, 110, 122 Evenki, 284 Fiji (Fijian), 97, 99, 108, 112, 316 Finnish, 109, 332, 334 Finnish Romani (Romani dialect), 334 Florina Arli (Romani dialect), 333–334 French, 39, 107, 109, 112, 222, 226, 238, 271, 364, 372 French-based Creole (family), 109 Fur (family), 366 Garawa, 271 Georgian, 109, 112, 135, 218–219, 271 German, 47–48, 51, 107, 109, 216, 221, 223, 226, 336 Gilyak, 109, 113 Godié, 109, 112 Greek, 38–39, 109, 302, 331, 336 Guarao (Warao), 271 Gwari, 371, 388 Gur (family), 366, 368

Guruntum, 383 Haitian Creole French, 109 Halkomelem, 109 Hausa, 109, 113 Hawaiian, 266 Hebrew, 38, 40, 109, 112, 222, 270–271 Hindi, 268, 330 Hmong Mien (family), 109 Hmong Njua, 109 Ho, 109, 113 Hokan (family), 89, 110 Hopi, 266 Hungarian, 135, 222, 271, 332, 336 Icelandic, 271 Igboid (family), 387 Ijoid (family), 366 Indo-Aryan Early New ~, 330 Indo-Aryan (family), 329–330, 331 Indo-European (family), 92, 98, 108– 110, 121–124, 222, 302 Indo-Iranian (family), 108 Indonesian, 221, 248, 273, 317 Iraqw, 109 Iroquoian (family), 109 Italian, 28, 33–34, 64, 222, 301 Japanese, 40, 51, 109, 220–221 Kabardian, 109, 112 Kaingáng, 109 Kalihna, 100, 109, 113 Kambaata, 372 Kannada, 109, 112 Karelian, 271 Kashmiri, 223, 330 Kayardild, 135 Khoisan (family), 110 Khoi-San (family), 367, 381 Kiowa, 97, 100 –101, 109, 112 Kiowa-Tanoan (family), 97, 100, 109, 122

Index of languages 403 Koasati, 109, 284 Komuz (family), 366 Kongo, San Salvador, 109 Kordofanian (family), 366 Korean, 40, 221 Kosovo Bugurdži (Romani dialect), 333–334 Kru (family), 366 Kuku-Yalanji, 271 Kutenai, 101, 109 Kwa (family), 366 Kwakuitl, 109 Kwaza, 314, 316 Ladahki, 109, 113 Lak, 288 Language Isolate, 108–110 Latin, 38, 213, 217, 222, 244, 302 Latvian, 109, 267, 271, 334 Latvian Romani (Romani dialect), 334 Lega-Beya, 368 Lezgian (Lezgi, Kiurintsy), 222, 271 Lillooet, 109 Lingala, 221 Lithuanian, 109, 112 Liv (Livonian), 271 Lovari (Romani dialect), 334 Lugbara, 109, 112 Luiseño, 109 Maban (family), 366 Macedonian, 271 Macro-Ge (family), 109–110 Makian, West, 109 Malagasy, 221–223 Mam, 270 Mande (family), 366 Mansi (Vogul), 271 Manuš (Romani dialect), 334 Maranunggu, 271 Marghi, Central, 109 Maricopa, 221 Marshallese, 109

Maxakalí, 109 Mayali, 135 Mayan (family), 110, 225 Meso Grande Diegueño, 271 Mískito, 109 Misumalpan (family), 109 Miwok, Central Sierra, 109 Mixtec, 222, 230 Mixteco-Chalcatongo, 98, 109 Mohawk, 109 Mokilese, 314 Mongolian, 28, 33, 95, 109, 112 Mullukmulluk (MalakMalak), 271 Munda, 290 Muskogean (family), 109 Na-Dene (family), 101, 110 Nahuatl (Teletcingo Nahuatl), 109, 221 Nahuatl, Classical, 109 Nama, 99, 110 Nauruan, 288, 290 Navaho, 101, 110, 113 Nengone, 271 Ngandi, 100, 110, 112 Ngizim, 383 Niger-Congo (family), 100, 109–110, 366, 368, 376, 381, 384, 387 Nilo-Saharan (family), 108–109, 122, 366, 381, 384 North Caucasian (family), 108 –109 Nupoid (family), 371 Nuuchahnulth, 290 Ojibwa, 101, 110, 112, 135 Ojibwa, Western, 110 Omotic (family), 366, 384 Ono, 271 Oto-Manguean (family), 98, 109–110 Paipai, 89, 110 Paiute, Southern, 271 Pakaásnovos, 110 Palauan, 110, 113, 135

404 Index of languages Papiamentu, 309, 314 Penutian (family), 88, 108–109 Pidgin, Nigerian, 110 Pintupi-Luritja, 271 Pirahã, 267 Piro (Yine), 271 Pitta pitta (Bidhbidha), 271 Podoko, 221 Polish, 110, 244, 271 Pomo, Eastern, 110 Potawatomi, 110 Proto-Romani, 330 –331 Punjabi, 330 Qafar, 135 Quecha, Huanuco Huallaga, 110 Quechua, 113, 289 Quechuan (family), 110 Rikbaktsa, 110 Roman (Romani dialect), 38, 43, 103, 334 Romance (family), 222 Romani, 10–11, 329–360 Early ~, 331, 333 Proto- ~, 330–331 Finnish Romani, 334 Florina Arli, 333–334 Kosovo Bugurdži, 333–334 Latvian Romani, 334 Lovari, 334 Manuš, 334 Roman, 38, 43, 103, 334 Rumelian Romani, 333–334 Rumungro, 334 Russian Romani, 334 Sepečides, 333–334 Serbian Kalderaš, 334 Sinti, 334 Šutka Arli, 340 Welsh Romani, 334 Yerli, 336 Romanian, 110 Ruija, 271

Rumungro (Romani dialect), 334 Russian, 39, 118, 124–125, 129, 133–135, 138–149, 217–218, 259, 334 Saharan (family), 366 Salish, Southern Puget Sound, 110 Salishan (family), 109, 110 Selepet, 271 Semitic (family), 366 Sepečides (Romani dialect), 333–334 Sepik-Ramu (family), 97, 100, 108, 110, 122 Serbian Kalderaš (Romani dialect), 334 Sierra Miwok, 270 Sino-Tibetan (family), 108–109 Sinti (Romani dialect), 334 Siouan (family), 100, 108 Sirionó, 110 Somali, 110 Songhai (family), 366 Soninke, 221 Sorbian, 271 Sotho, Southern, 110 South Caucasian (family), 109 Spanish, 110, 162, 221–222, 226 Sudanic Central ~ (family), 366 Western ~ (family), 366 Šutka Arli (Romani dialect), 340 Swahili, 28, 30, 33, 58–59, 61, 71, 110, 112, 188, 271 Swedish, 110 Tagalog, 314 Tajik, 271 Tamazight, Central Atlas, 110 Tamil, 40, 135 Tauya, 110 Telugu, 40, 95, 99–100, 110, 112 Thai, 38, 40, 98, 106, 110, 113 Tok Pisin, 97, 110, 112 Tonkawa, 110 Torricelli (family), 108

Index of languages 405 Totonac, 230, 231 Trans-New Guinea (family), 108, 110, 122 Tsakhur (Tsaxur), 135, 223–224, 270 Tübatulabal, 271 Tucanoan (family), 95, 108 Tunica, 110, 284 Tupi (family), 110 Turkana, 135 Turkish, 110, 112–113, 244, 270, 332, 336 Tzotzil, 221, 225–227, 230 Tzutujil, 110 Ubangi (family), 366 Uralic (family), 109, 122 Urdu, 330 Uto-Aztecan (family), 108–109 Uzbek, Northern, 271 Valencian, 108 Vietnamese, 98–99, 106, 110, 113 Vod (Votic), 271 Wakashan (family), 109, 290 Wappo, 110 Wargamay, 268 Welsh, 110, 113, 332, 334

Weri (Were), 271 West Papuan (family), 109 Wichita, 110 Wolaytta, 110, 113 Woleaian, 314 Wolof, 368 Wongkumara (Wankumara), 271 Xokleng, 110 Yaouré, 100, 110 Yerli (Romani dialect), 336 Yiddish, 221 Yimas, 97, 100, 101, 110, 112, 135 Yoruba, 51, 221 Yuki (family), 110 Yup’ik, 110, 112 Central ~, 95, 110 Zaar (family), 383 Zapotec (Zapoteco), 98, 110, 113, 222, 230 Zapoteco Yatzachi, 110 Zuni, 110, 112

Index of persons

Afcan, Paschal, 95 Albu, Mihai, 293, 295 Anderson, John M., 237 Aronoff, Mark, 126 Arthur, Isa, 365 Asher, Ron E., 290 Auwera, Johan van der, 212, 286 Baerman, Matthew, 118, 122, 124, 126, 149, 285 Bailey, Todd Mark, 265–271, 273–274 Bakker, Peter, 165, 332 Baron, Cynthia L., 103 Bechhofer, Sean, 183 Begg, Carolyn, 49 Benveniste, Émile, 95, 367 Bergman, Brita,, 2 Bibiko, Hans-Jörg, 284 Bird, Steven, 40, 117 Blake, Barry J., 94, 98 Bliss, Heather, 4, 8, 10, 24, 77, 85–86, 97, 99, 107 Blust, Robert, 2 Booij, Geert, 145 Boretzky, Norbert, 330, 332 Botha, Rudolf P., 316 Bowers, David, 149 Braem, Penny Boyes, 2 Brown, Dunstan, 117–118, 122, 124, 126, 129, 285 Brown, Penelope, 98 Buchholz, Oda, 217 Bybee, Joan, 317 Campbell, Russell N., 98 Carstairs-McCarthy, Andrew, 126 Chaffin, Roger, 183 Chashchikhina, Olga, 346

Chileva, Veliyana, 346 Chomsky, Noam, 72 Chumakina, Marina, 8, 117 Clark, David J., 387 Clements, George N., 106 Codd, Edgar F., 28 Coleman, Ross, 6 Comrie, Bernard, 120, 162, 214–215, 283, 292, 296, 307 Connolly, Thomas, 49 Copestake, Ann, 125 Corbett, Greville G., 8, 94, 96 – 97, 117– 118, 121, 124, 126, 135–136, 138–140, 144–145, 148–149, 182, 288–289, 293 Corcho, Oscar, 178, 181 Cornelisse, Aglaia, 248 Cysouw, Michael, 293, 295–296 Date, Chris J., 171, 202 Déchaine, Rose-Marie, 95 Dench, Alan, 209, 210, 211 Dimitriadis, Alexis, 1, 3, 6, 8, 10, 13–14, 21, 44, 155, 165, 181, 199, 249, 264 Dixon, Robert M. W., 162, 209–210, 335 Dress, Andreas, 293 Dryer, Matthew S., 167, 283, 286, 289, 293, 295–296, 307 Dunn, Michael, 72 Elšík, Viktor, 329 –330, 359 Evans, Nicholas, 121, 209 –211, 232 Evans, Roger, 125 Ewen, Colin J., 237 Fabri, Ray, 126 Facundes, Sidney da Silva, 284, 288 Farrar, Scott, 5 Fernández-López, Mariano, 178, 181

Index of persons 407 Ferrara, Marisa, 21 Fiedler, Wilfried, 217 Foley, Robert A., 101 Foley, William A., 101 Frajzyngier, Zygmunt, 335 Gadolina, Margarita Anatol’jevna, 145 Gast, Volker, 9 –10, 166, 209, 212, 216 – 224, 285 Gazdar, Gerald, 125 Gentry, Roger, 149 Gil, David, 283, 288, 296, 306 –307, 317 Gillam, Richard, 40 Gilliat-Smith, Bernard J., 331 Givón, Talmy, 335 Goedemans, Rob, 8–10, 46, 72, 155, 165, 193, 235, 247–249, 251–253, 256, 258, 260–261, 268 Gómez-Pérez, Asunción, 178, 181 Good, Jeff, 199 Gordon, Matthew, 30, 33–34, 58, 61, 249, 269–271, 273 Gray, Russell, 2 Greenberg, Joseph H., 78, 94–96, 126, 248, 254, 285 Greenhill, Simon, 2 Gruber, Thomas R., 178, 180 Haberl, Heather, 107 Haimerl, Edgar, 10 Halle, Morris, 72, 242, 248 Halteren, Hans van, 74 Hanke, Thomas, 2 Hanson, Rebecca, 97, 106 –107 Harley, Heidi, 95, 97, 106 –107, 149 Harmelen, Frank van, 170 Harms, Robert T., 242 Harrison, Sheldon P., 314 Haspelmath, Martin, 7, 9–10, 15, 44, 159, 162, 167, 212–213, 222, 232, 283, 286, 289, 294, 296, 307, 313 Hayes, Bruce, 246, 248, 268 Heath, Jeffrey, 291

Hendler, Jim, 183 Hendriks, Bernadette, 248 Hengeveld, Kees, 164, 166, 169 Herrmann, Douglas J., 207–209 Heuven, Vincent van, 248 Hewitt, B. George, 217, 219 Hippisley, Andrew, 117 Hjelmslev, Louis, 126 Hole, Daniel, 166, 212, 216, 232 Hook, Peter Edwin, 223 Horrocks, Ian, 183 Hulst, Harry van der, 9–10, 46, 72, 165, 193, 235, 242–243, 246–249, 251– 256, 258, 261, 265, 268, 285 Hurch, Bernhard, 9 –10, 166, 242, 301, 306, 317 Hyman, Larry M., 237, 248, 254, 371 Ibriszimow, Dymitr, 389 Idsardi, William J., 242 Igla, Birgit, 330, 332 Jacobson, Steven A., 95 Jakobson, Roman, 286 Jasperson, Robert, 335 Jesney, Karen, 107 Jones, Charles, 237 Kager, René, 183, 242 Kamholz, David, 72 Kari, Ethelbert E., 369 Kashube, Dorothy, 248 Kayser, M. S. C., 288 Khiba, Zaira K., 217 Kieviet, Paulus-Jan, 248 Kiparsky, Valentin, 144–145 König, Ekkehard, 166, 216–217, 219–222, 232, 285 Kooij, Jan, 242 Koul, Ashok K., 233 Koul, Omkar N., 233 Kouwenberg, Silvia, 314 Krauss, Michael, 95

408 Index of persons Ladefoged, Peter, 194 Lahiri, Aditi, 242, 247 Landis, T. Y., 183 Langendoen, D. Terence, 5 Langeweg, Simone, 248 Lascarides, Alex, 125 Léger, Jean, 373 Lehmann, Christian, 144, 211 Levinson, Stephen, 98 Lewis, William D., 158 Liberman, Mark, 1, 236, 242 Lloyd, Barbara B., 179 Lockwood, David G., 248 Lönngren, Lennart, 140 Lynch, John, 288 Lyutikova, Ekaterina A., 224 MacWhinney, Brian, 12 Maddieson, Ian, 166, 194, 254, 287, 294 Magaji, Daniel J., 371 Maier, Ingrid, 140 Marriott, Paul, 145 Maslova, Elena, 211 Matras, Yaron, 10, 11, 214, 329, 330–332, 335–336, 346, 359 Mattes, Veronika, 9 –10, 166, 301, 314 – 317 Matthews, Peter H., 118 McGarrity, Laura W., 242 McGinnis, Martha, 95 McGuinness, Deborah L., 183 McMahon, April, 3 McMahon, Rob, 3 Miehe, Gudrun, 368 Miklosich, Franz, 331 Mills, Timothy Ian, 107 Mineur, Anne-Marie, 164 Miyako, Osahito, 95 Monachesi, Paola, 164 Moran, Steven, 21 Mosel, Ulrike, 211–212 Moseley, Christopher, 290 Murkelinskij, G. B., 288

Musgrave, Simon, 1, 3, 8, 13, 42, 149, 199 Nerbonne, John, 1, 2 Nespor, Marina, 183 Nichols, Johanna, 286 Niepokuj, Mary, 317 Noyer, Rolf, 126 Olsen, Gregory R., 178 Orešnik, Janez, 125 Otanes, Fe T., 314 Pagliuca, William, 317 Patel-Schneider, Peter F., 183 Payne, Thomas Edward, 185, 189 Peck, Daniel, 103 Peralta Ramirez, Valentín, 315 Perkins, Revere D., 317 Perlmutter, David M., 125 Pinto, Manuela, 164 Pizzuto, Elena, 2 Plungian, Vladimir A., 212 Pott, August, 305, 331 Pozdniakov, Konstantin, 388 Prince, Alan, 236, 242–243, 246 Procházka, Stephan, 316 Pulgram, Ernst, 238 Quilliam, Harley, 149 Rajaonarisoa, Fara, 223 Randriamasimanana, Charles, 223 Rector, Alan, 183 Reed, Irene, 95 Reesink, Ger, 72 Riggs, Stephen R., 100 Rijkhoff, Jan, 164, 169 Ritter, Elizabeth, 4, 8, 10, 24, 77, 85–86, 95, 97, 99, 106 –107 Roca, Iggy, 242 Rosch, Eleanor, 179 Rubino, Carl R., 302, 316 Ruhlen, Merritt, 290

Index of persons 409 Sagey, Elizabeth, 106 Sapir, J. David, 365 Saulwick, Adam, 8, 155, 178, 180 –181 Schachter, Paul, 314 Schütz, Alfred, 316 Schwartz, Steven A., 103 Segerer, Guillaume, 4–5, 8, 11, 167, 363, 365, 376 –377, 388 Sengupta, Gautam, 218 Siemund, Peter, 166, 216, 223 Siewierska, Anna, 94, 135, 164–165, 169, 180, 283, 288 Simons, Gary, 40, 117 Smith, Norval, 14, 165, 214–215, 254 Smolensky, Paul, 246 Sohn, Ho-min, 314 Stein, Lynn Andrea, 183 Storey, Veda C., 201 Strachan, Anne, 49 Stuckenschmidt, Heiner, 170 Stump, Gregory T., 125

Timberlake, Alan, 141–142 Turner, Ralph L., 330

Taylor, John R., 179 Tenser, Anton, 346 Terrill, Angela, 72 Thompson, Evan, 179 Tiberius, Carole, 8, 117, 136

Xajdakov, Said M., 288

Varela, Francisco J., 179 Veer, Kees van der, 247 Vergnaud, Jean-Roger, 242, 248 Vijver, Ruben van de, 248 Vincent, Nigel, 6, 248 Visch, Ellis, 248, 251–252, 256, 268 Vogel, Irene, 183 Voort, Hein van der, 314, 316 Wali, Kashi, 223 Welty, Chris, 183 Wichmann, Søren, 72 Wierzbicka, Anna, 141, 335 Wilbur, Ronnie, 303 Wilson, William A. A., 365 Windhouwer, Menzo, 8, 155, 270 Winston, Morton E., 183 Wunderlich, Dieter, 126

Zanten, Ellen van, 252–253, 258, 273 Zribi-Hertz, Anne, 223 Zwicky, Arnold, 95, 125