Challenging the Myth of Monolingual Corpora [1 ed.] 9789004276697, 9789004276680

Challenging the Myth of Monolingual Corpora brings new insights into the monolingual ideal that has permeated most branc

175 79 1MB

English Pages 248 Year 2017

Recommend Papers

Challenging the Monolingual Mindset 9781783092529

This book challenges the monolingual mindset by highlighting how language-related issues surround us in many different w

144 70 3MB Read more

Challenging the Myth of Gender Equality in Sweden 9781447325994

Sweden is often considered one of the most gender-equal countries in the world and held up as a model to follow, but the

144 44 3MB Read more

Racial Profiling in Canada: Challenging the Myth of ‘a Few Bad Apples’ 9781442678972

Informed by a wealth of research and theoretical approaches from a wide range of disciplines, Racial Profiling in Canada

124 62 13MB Read more

Southern Craft Food Diversity: Challenging the Myth of a US Food Revival 9781529211436

Driven by consumers’ desire for slow and local food, craft breweries, traditional butchers, cheese makers and bakeries h

127 70 20MB Read more

The Myth of the Soul

196 59 1MB Read more

The Myth of Multitasking

530 68 3MB Read more

The Myth of Paganism

510 44 2MB Read more

The Myth of the Magus

672 63 51MB Read more

On Under-reported Monolingual Child Phonology 9781788928953

The first edited volume on monolingual child phonology This book compiles original studies investigating crosslinguist

129 93 10MB Read more

Challenging the Boundaries of Slavery 9780674030251

Challenging the boundaries of slavery ultimately brought on the Civil War and the unexpected, immediate emancipation of

145 20 625KB Read more

Challenging the Myth of Monolingual Corpora [1 ed.]
9789004276697, 9789004276680

Author / Uploaded
Arja Nurmi
Tanja Rütten
Päivi Pahta

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Challenging the Myth of Monolingual Corpora

Language and Computers Studies in Digital Linguistics

Edited by Christian Mair (University of Freiburg, Germany) Charles Meyer (University of Massachusetts at Boston) Editorial Board Mark Davies (Brigham Young University) Anke Lüdeling (Humboldt University) Anthony McEnery (Lancaster University) Lauren Squires (Ohio State University)

Volume 80

The titles published in this series are listed at brill.com/lc

Challenging the Myth of Monolingual Corpora Edited by

Arja Nurmi Tanja Rütten Päivi Pahta

LEIDEN | BOSTON

The Library of Congress Cataloging-in-Publication Data is available online at http://catalog.loc.gov LC record available at http://lccn.loc.gov/2017027646

Typeface for the Latin, Greek, and Cyrillic scripts: “Brill”. See and download: brill.com/brill-typeface. issn 0921-5034 isbn 978-90-04-27668-0 (hardback) isbn 978-90-04-27669-7 (e-book) Copyright 2017 by Koninklijke Brill NV, Leiden, The Netherlands. Koninklijke Brill NV incorporates the imprints Brill, Brill Hes & De Graaf, Brill Nijhoff, Brill Rodopi and Hotei Publishing. All rights reserved. No part of this publication may be reproduced, translated, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission from the publisher. Authorization to photocopy items for internal or personal use is granted by Koninklijke Brill NV provided that the appropriate fees are paid directly to The Copyright Clearance Center, 222 Rosewood Drive, Suite 910, Danvers, MA 01923, USA. Fees are subject to change. This book is printed on acid-free paper and produced in a sustainable manner.

Contents Preface vii List of Illustrations viii 1 How Many Languages are there in a Monolingual Corpus? 1 Arja Nurmi and Tanja Rütten 2 Indian English or Indian Englishes? Accounting for Speakers’ Multilingual Repertoires in Corpora of Postcolonial Englishes 16 Claudia Lange 3 Mono- and Multilingualism in a Specialized Corpus of New Zealand Stories 39 Alexander Onysko and Marta Degani 4 What Happens to Ongoing Change in Multilingual Settings? A Corpus Compiler’s Perspective on New Data and New Research Prospects 58 Mikko Laitinen 5 Multilingual Speakers, Multilingual Texts: Multilingual Practices in Learner Corpora 80 Marcus Callies and Leonie Wiemeyer 6 Multilingualism in English as a Lingua Franca: Flagging as an Indicator of Perceived Acceptability and Intelligibility 95 Niina Hynninen, Kaisa S. Pietikäinen and Svetlana Vetchinnikova 7 English Commonplace Books as Multilingual Receiver Corpora 127 Thomas Kohnen 8 Multilingual Practices in the Corpus of English Religious Prose: Annotation and Access 153 Tanja Rütten 9 Semi-automatic Discovery of Multilingual Elements in English Historical Corpora: Methods and Challenges 172 Jukka Tyrkkö, Arja Nurmi and Jukka Tuominen

vi

contents

10

‘Multilinguality’ in Learner Corpora: The Case of the MILE 200 Rolf Kreyer

11

Multilingualism and Quotations from a Corpus-Linguistic Perspective: A Case Study of Samuel Taylor Coleridge’s Biographia Literaria 220 Mark Kaunisto

Preface This volume came about as a continuation of the Are there monolingual cor pora? (Computer says no) workshop at the 36th ICAME conference held at the University of Trier in May, 2015, organised by Arja Nurmi and Päivi Pahta. Tanja Rütten was one of the presenters at the workshop and in this volume we bring together our joined interests in multilingualism and the issue of corpus design and annotation. During the workshop it became evident that there was need for further discussion on multilingual practices in supposedly monolingual corpora. Many of the papers within these covers are versions of presentations given at the workshop, but several were solicited from authors we knew to be working in fields of English corpus linguistics where the question of multilingualism is (also) present. The volume as a whole benefitted immensely also from the contributions and discussions of presenters who, for one reason or another, could not publish their presentation here. One of the starting points for the workshop and this volume was the Multi lingual Practices in the History of Written English project, funded by the Academy of Finland 2012–2016 (project number 258434). The project was led by Päivi Pahta, and Arja Nurmi was a senior scholar involved in the project. Among the authors, also Jukka Tyrkkö and Jukka Tuominen were associated with the project. We would like to thank the Academy of Finland for funding the project. We also thank the degree programme for English language, literature and translation studies at the Faculty of Communication Sciences at the University of Tampere for financing the hiring of Ms Veera Saarimäki, MA, as our editorial assistant in 2016–2017. Without her painstaking eye for detail, the volume would not be what it is today. Finally, we are grateful to Brill and the editors of the Language and Computers, Studies in Digital Linguistics series, Professors Christian Mair and Charles Meyer, for publishing the volume. The Editors Tampere and Cologne April 2017

List of Illustrations Figures 1.1 2.1 4.1 4.2 4.3 4.4 4.5 9.1 9.2 10.1 10.2 10.3 10.4 10.5

Multilingual and multivoiced practices 4 South Asian language families 25 Covering the informational-interactional continuum of genres 65 Ratios of core per emergent modals in the six corpora 69 Patterns of situational variation in written and spoken ELF corpora 70 Order of frequency of core modals in BrE/AmE and in ELF 72 Order of frequency of emergent modals in BrE/AmE and in ELF 74 Flowchart of the Multilingualiser tagging process 182 Dictionary cross-checking 183 A facts sheet on Hong Kong 201 Materials and task description, student’s original text and text without expressions that can be found in the materials 208 The lexical frequency profile aggregated over all 16 texts before the deletion of common lexis and after (left); change of proportion after deletion (right) 212 The change rate for individual frequency bands after deletion of common lexis 213 Original collocations (in boxes) and collocations from materials and task descriptions (underlined) in two sample texts 215

Tables 2.1 2.2 2.3 2.4 4.1 4.2

Overview of ICE-corpora and their sampling of speakers’ metadata 23 Typological distribution of the 122 scheduled and non-scheduled languages of India in the 2001 census 26 Representation of language families in ICE-India 33 Words marked as ‘indigenous’ in ICE-India and the OED edition in which they were first mentioned 34 Key characteristics in compilation 63 Frequencies and ratios of modals in two native varieties and in non-native use 67

List Of Illustrations

4.3 8.1 8.2 10.1 10.2 10.3 10.4 10.5 11.1 11.2 11.3 11.4

ix

The most frequent modal types in written and spoken non-native corpora 72 The genre network structure of COERP 155 Coding schema for multilingual elements in COERP 165 Paraphrases in three grades of intermediate learners of English 203 The estimated composition of the MILE 205 The composition of the sample for the present study 206 Proportion of content words from materials in exam texts 207 Number of collocations based on materials and task descriptions per words for each text (left) and per 1000 words (right) 214 The numbers of words in languages other than English in Biographia Literaria 228 The ten most quoted authors (in English) in Biographia Literaria 230 The numbers of words in passages translated from foreign languages in Biographia Literaria 231 The frequencies of most commonly occurring personal and possessive pronouns in the quotations and Coleridge’s own text in Biographia Literaria 234

CHAPTER 1

How Many Languages are there in a Monolingual Corpus? Arja Nurmi and Tanja Rütten 1

Introduction

The monolingual corpus as a monolithic, single-language database, representative of the language of likewise monolingual speakers or writers, is a tacit and probably only half-conscious, but convenient, invention by the corpus linguist. This is in line with the common societal assumptions of western societies about “one nation, one language” that rose in France during the revolution, dispersed over the nineteenth century in Europe and have dominated European thinking ever since. In linguistics this has inevitably resulted in an emphasis on the analysis of single languages, largely in isolation of each other. The notable exception from early on is research on language contact, examining the impact of one independent language system on the lexico-grammatical structure of another. However, not a single of the world’s just over two hundred countries is monolingual (Deumert 2011, 262), and depending on our definition of bi- or multilingualism, it could be argued that the vast majority of the global population is in fact multilingual (see e.g. Edwards 2006, 7 or Li Wei 2007, 3–11). If we zoom in on Europe alone, a recent survey on Europeans and their languages carried out by the European Commission indicates that 54% of the population of EU member states meet the criterion for functional multilingualism, i.e. they are able to hold a conversation also in a language other than their mother tongue. To take an example from another corner of the world, the Australian census of 2006 lists 388 languages spoken in the homes of 16.8% of the population (Deumert 2011, 273). Surely, linguistic realities like these must have an impact on the authentic language data that corpus compilers store into their corpora. The question, then, arises: Is multilingualism reflected in our corpora? If it is, how? And how do we as corpus linguists deal with it? The question of how we define multilingualism is also relevant here (for the history of the concept, see e.g. Li Wei 2007). In this volume, multilingualism is seen, not as the traditional ideal of a balanced bilingual with a command of two languages that he or she has grown up with, but rather in terms of the speakers’ linguistic resources and repertoires that originate in multiple

© koninklijke brill nv, leiden, ��7 | doi ��.��63/9789004276697_002

2

Nurmi and Rütten

languages, and their ability to apply those resources in their speech or writing. We see the potential for multilingualism both in individuals and in societies. Even if we do not necessarily agree with the position of Edwards (2006), who argues that modern speakers of English who are familiar with such individual foreign-language words such as tovarich or expressions such as Guten Tag are multilingual individuals, it is obvious that the definition of multilingualism should be inclusive of a range of abilities. Perhaps the most inclusive definition is given by Blommaert (2010, 102), according to whom [m]ultilingualism … should not be seen as a collection of ‘languages’ that the speaker controls, but rather as a complex of specific semiotic resources, some of which belong to a conventionally defined ‘language’, while others belong to another ‘language’. The resources are concrete accents, language varieties, registers, genres, modalities such as writing—ways of using language in particular communicative settings and spheres of life, including the ideas people have about such ways of using, their language ideologies. Even if we adopted a somewhat more restrictive outlook, remaining in the sphere of different conventionally defined ‘languages’, we can safely say that monolingualism as a quality of either individuals or societies has always been a minority phenomenon. People throughout history have gained command of more than one language through education, professional contacts, personal interests, or migration—simply by virtue of living in a multilingual society and having to find ways to communicate with speakers of other languages. Even a very basic command of a language allows a speaker or writer to incorporate elements of it into their communication, i.e. to make use of their multilingual resources. By way of experiment, if we turn our attention to a standard corpus of English, such as the British National Corpus, we can easily find many instances of multilingual practices that fit in an even stricter definition of multilingualism than that given by Edwards. The following examples were retrieved using random French, German and Latin phrases, and represent both informative (1, 3) and imaginative writing (2). Some searches reveal lengthier passages in another language, like example (1), which implies a considerable conversational fluency in the use of multilingual resources. Some hits occur in contexts that seem to prompt the use of the relevant language in the communicative situation, including reported conversations with speakers of other languages, as in example (2); in such contexts it is common to find several successive expressions in the same language. Again, some degree of competence in more than

How Many Languages are There in a Monolingual Corpus ?

3

one language can be assumed. Interestingly, the search also reveals instances like example (3) where multilingualism reflected in the text does not rest on the speaker’s comprehension of multiple languages, which is a common criterion, used, for example, by Edwards (2006). (1) After a while he returned, came over to me and, though I half expected a smack, said, ‘Maintenant, il y a un nouvel relation entre nous. Maintenant nous serrons camarades.’ We’d done it—(BNC: FS0 1727) (2) The fräulein smiled and said, ‘Auf wiedersehen.’ Karelius alone used the old Austrian farewell: ‘Ich küsse die Hand.’ (BNC: B20 1488) (3) What puzzles him, and us, is United’s newly disencrusted coat of arms and its motto ‘ex nihilo, nihil fit.’ I haven’t the faintest what it means (BNC: K4T 9034) As is apparent from examples (1)–(3), multilingual practices can also be seen as multivoiced practices, where quotations can represent the voice of someone other than the author (1, 3). Such quotations can also perform many of the same functions regardless of the language used, so that many English elements bear a resemblance to the French, German and Latin passages illustrated. Such quotative practices range from literary discussions and academic discourse conventions to language learning environments, where linguistic items from textbooks and teaching material are adopted and adapted to the linguistic repertoire of the learner. In both cases, speakers and writers make use of linguistic material that, in one sense, can be described as ‘other than their own’ and so produce a multivoiced text. While these multivoiced practices are not always multilingual (just as multilingual practices are not necessarily multivoiced), they bear a great deal of resemblance to multilingual practices, identified both in spoken language code-switching and written language data, and discussing them in this context will provide new insight into both phenomena. Figure 1.1 illustrates the relationship of the concepts of multilingual and multivoiced practices as we conceive them in this volume. The combination of elements from more than one language, or voice, to a single communicative episode—whether a conversation or a text—thus appears much more common than is generally assumed, and may even be the rule rather than the exception. This point is supported in virtually all contributions collected in the volume at hand, from a historical as well as a present-day and cross-cultural perspective. It is also supported by e.g. Mair (2011), discussing the frequent use of Jamaican Creole in the spoken language of even

4

Nurmi and Rütten

Multilingual

Multivoiced

Figure 1.1 Multilingual and multivoiced practices.

educated Jamaican speakers in the ICE-Jamaica corpus. Mair further makes the point that in corpus-based studies of World Englishes multilingual contexts have been long ignored, and advocates for a more systematic study of multi lingualism, both in interactive computer-mediated contexts and in spoken urban surroundings (2009, 436). On the other hand, recent research on some corpora compiled for analysing the history of English shows that multilingual practices are found in written texts from all historical periods (see e.g. Pahta and Nurmi 2011; see also Pahta et al. in press). So it is time that we addressed the question of, firstly, just how many languages are there in what we often assume are monolingual corpora of, say, English, and secondly, how can we compile corpora that better represent actual language use in contexts where standard English is just one of the varieties and languages in use? This volume, then, brings together papers that investigate the presence of multilingual practices in supposedly monolingual corpora. The corpora discussed represent a broad range of Englishes and include present-day synchronic varieties of English as well as historical and diachronic perspectives. Contributions address the corpus compilers’ views as well as the annotators’ and users’ perspectives. Viewpoints range from explicitly multilingual practices that are consciously taken into consideration in the compilation and annotation process to implicitly multi-voiced perspectives, where philological insight is used in unearthing multilingual practices in what superficially looks like a monolingual English corpus. In the next section, we will briefly look at the sociological and language ideological underpinnings of the supposition of monolingualism in corpora (globalisation, superdiversity etc.). Section three presents the guiding questions for

How Many Languages are There in a Monolingual Corpus ?

5

the volume and briefly reviews how individual contributions have answered them. Assessments range from the perspectives of research on multilingualism in the traditional sense of the concept to more innovative approaches, where the notion of multilingualism is extended to voices other than the author’s and is thus halfway independent of the actual language that is used by the producer of the speech event. Section four rounds up this introduction by discussing ways to find, distinguish and describe non-English elements in ‘monolingual corpora’. 2

Monolingualism—Fact or Fiction?

As mentioned above, monolinguals are a minority among the global population. Our focus in this volume is on English, hence we discuss the topic from that perspective, but many of the trends identified in English-speaking countries can also be seen elsewhere. In many countries different languages live side by side, are used in different registers and on different occasions. So in Tanzania, for example, speakers may have one native language they speak at home, while they are educated in Kiswahili, which is one of the lingua francas used also for e.g. business encounters. English plays a role in higher education and administration, and any number of other local languages may form a part of an individual speaker’s linguistic repertoire (Melchers and Shaw 2011, 136). In terms of English world-wide, Meshtrie (2006, 482) goes as far as to claim that in these contexts monolingualism is “the marked case”, while in the current globalising (or globalised) society, the “ideal speaker” encounters the need to draw on their linguistic resources in order to interact with people from all kinds of different backgrounds, whether in terms of solidarity or adversity, meeting as equals or negotiating power hierarchies. The “polyphony of codes/languages” can be seen as the native language of people in the context of New Englishes, but, in our view, more and more as the native language of people all over the world; the growing body of research on urban multilingualism and superdiversity provides ample evidence for this trend (see e.g. Blommaert and Rampton 2011, Creese and Blackledge forthcoming, Meyerhoff and Stanford 2015). In addition to spoken interaction, multilingual practices are frequently in evidence in computer-mediated communication. It seems that there are still many hindrances for writing in non-prestige varieties, such as Jamaican Creole, in traditional media, unless it is for the purposes of folklore or quoting individual speakers. This has changed in e.g. diasporic online forums, where speakers make use of multiple languages and varieties to construct their meanings. Mair and Pfänder (2013, 541) note that multilingual practices in their data are

6

Nurmi and Rütten

not a reflection of poor linguistic skills, but on the contrary they “are almost exclusively found with forum users who have full command of the normative varieties of the locally dominant languages and who thus use multilingual writing as an additional resource”. Is there any such thing as a monolingual speaker of English? If we consider the speakers of English in the world and their linguistic resources, it is evident that the only potentially monolingual group are the speakers of what Kachru (1985) calls “Inner Circle” Englishes: both the “Outer Circle”, i.e. countries where English is spoken as a second language used in e.g. administrative and educational contexts, and the “Expanding Circle”, i.e. the rest of the world where English is taught as a foreign language, are by definition contexts where speakers of English are largely multilingual. How monolingual then are the speakers of English in the Inner Circle? Considering the situation of English-speaking countries, there are obviously autochthonous linguistic minorities in each and every one of them. (For details, see e.g. Melchers and Shaw 2011.) In the UK we find speakers of Welsh, Gaelic and Irish, in Ireland Irish is the national language beside English, in Canada apart from English and French there are speakers of First Nations and Inuit languages, and in the USA there are still many Native American and Alaska native languages. Similarly in Australia, there are speakers of Aboriginal languages and in New Zealand speakers of Māori. Many of these languages are endangered to varying degrees, although there are efforts to preserve them. In addition to the indigenous languages, there are many immigrant languages in each country, the smorgasbord of languages present in any community depending on the circumstances of migration. Immigrant languages may well have a long history as well, considering e.g. the centuries of Spanish spoken in California. The communities of immigrant language speakers may be vitalised by new waves of migration, keeping the linguistic minorities from being assimilated. On the other hand, even long-standing linguistic minorities may well preserve some elements of their heritage language, even if they do not speak the language fluently any more. The numbers of European heritage-language speakers, especially Italian, German, Hungarian and French show a downward trend in US census data, but there are still approximately a million people resident in the United States who say they speak German at home (Ryan 2013). During the history of English, the waves of migrants, particularly Vikings and Normans, were slowly assimilated to the English-speaking population, but not without leaving their trace in the shape of English. If we take one of the Inner Circle countries as an example, we can examine this situation in all its complexity. In the Irish census of 2011, 41.9% of respondents answered ‘yes’ to the question whether they could speak Irish (Central

How Many Languages are There in a Monolingual Corpus ?

7

Statistics Office 2012). Given that all children learn both Irish and English at school, it could be argued that for a less strict interpretation of multilingualism, most people who have received their schooling in Ireland are multilinguals. In addition to the two national languages, schools also provide foreign language teaching in French, German, Spanish and Italian, which is in accordance with the EU language policy of everyone mastering two other languages in addition to their mother tongue (COM 2008). The 2011 census included for the first time also questions on other languages spoken at home, and 11% of residents reported they spoke a language other than English or Irish at home, the most common languages being Polish, French and Lithuanian. Of those speaking a foreign language at home, 6% answered they were not able to speak English at all. Given all this data, it could be argued that the vast majority of Irish residents are multilingual to some extent. As can be seen from the above example, not only do multilingual individuals gain their linguistic repertoires in a variety of ways but they also belong to a variety of different linguistic communities. In Ireland, for example, there are speakers of Irish living in the Gaeltacht area, where they encounter other native speakers of Irish and carry out many tasks related to their daily lives in Irish. At the same time, English is a part of their lives, as it is the overwhelmingly strong language of many areas of life. On the other hand, people who learn a foreign language at school (whether English in the Expanding Circle countries today or French or Latin in eighteenth-century England) are typically members of a far more loosely knit network of speakers. Multilingual resources can be used for identity-work, marking membership in a linguistic group, as the Latino population in the United States does when they mix English and Spanish resources in their speech but also increasingly writing. Another type of identity-work is found in the Latin phrases found in the writings of well-educated people throughout the history of English. There the writers can indicate their own membership in the community of educated people but they can also build bridges towards their readership, marking them as members of the same educated elite. The less educated would have had fewer linguistic resources in the range of multilingualism, but even they had access to e.g. Latin as the language of religion, engaging in both multilingual and multivoiced practices when referring to the teachings of the church. 3

Challenging the Myth of Monolingual Corpora

In corpus linguistics, increasing the size, but not necessarily the quality of the database has been one of the major goals for a long time. Ever bigger databases,

8

Nurmi and Rütten

resulting in automatic, web-crawling ‘corpora’ (e.g. in the case of GloWbE) seemed to be on the top of corpus linguists’ wish lists, and for good reasons. At the same time, it should be noted that the “small and tidy” and “big and messy” approaches of corpus compilation and annotation both have their merits (see e.g. Mair 2006 for a discussion of this). While it is true that corpus enhancement along the lines of automatic tagging and parsing has always been a major branch of corpus linguistic activity, too, the question of how to deal with nonEnglish elements in English language corpora has seen considerably less scholarly activity. Size does matter, for an assessment of multilingual practices as well as for nearly everything else, but in order to identify multilingual practices in the first place, improved annotation is essential, too. And in order to improve annotation schemata, a sound idea of what constitutes a multilingual element is, of course, a necessary prerequisite. When discussing the annotation of multilingual elements, the question of language boundaries comes up. At times, language users clearly flag their other language elements and their switches from one into another (Poplack 1987). In speech this can take place for example through repetition or metalinguistic commentary, but also pauses, hesitation and the mention of the language switched into. In writing, similar tendencies can be seen, and in English historical writings, for example, flagging can take the form of explicit labelling (that is in Latin), or in the case of foreign-language elements the reader might not be able to understand easily, the introduction of intratextual translation or support in English, often highlighted through either verbal (or, i.e., that is to say) or visual cues (parentheses, italics, underlining) (Nurmi and Skaffari 2016). Elements accompanied with flagging elements like these are easily recognised as evidence of multilingual practices. Once they are identified in the text, they are also relatively straight-forward to annotate. There are, however, also times when speakers and writers deal with their linguistic output in a way that has been described as translanguaging (see e.g. Otheguy, García and Reid 2015). On these occasions, writers do not pay attention to the boundaries between languages, but rather treat all their linguistic resources as one pool of features to draw from in order to communicate their meaning. These instances may also be occupying the grey area between borrowing and multilingual practices, as they may fluidly use both domesticated and original spelling, for example. In present-day spoken Finnish the English adverbial about (in the sense ‘approximately’) is frequently used. When it is written, the written form can follow standard English spelling (6), but can also reflect the domesticated spoken form (e.g. öbaut or abaut in 4 and 5), even in quality newspapers such as the Helsingin Sanomat.

How Many Languages are There in a Monolingual Corpus ?

9

(4) “Viime vuoden kesäkuusta tämän vuoden kesäkuuhun työllisten määrä on kasvanut 33 000:lla. Jos pystyttäisiin pitämään tällainen trendi vuoteen 2019 asti, oltaisiin 72 prosentin työllisyysasteessa, öbaut”, Sipilä sanoo. (Helsingin Sanomat 12 August, 2016) ‘“From June last year to June this year the number of the employed has risen by 33,000. If we could maintain a trend like this until 2019, we would be at an employment rate of about 72%”, says [Prime Minister] Sipilä.’ (5) Asun tossa abaut sadan metrin päässä Evästiellä. (Helsingin Sanomat 4 November, 1999) ‘I live there about a hundred meters away, in Evästie.’ (6) Se oli about vartti kun äijiltä lähti lapasesta. (@JethroRostedt on Twitter 4 March, 2015) ‘It was about a quarter of an hour before the guys lost it.’ Considering that all spelling and pronunciation variants from Standard English to variously domesticated Finnish perform the same function in the texts and maintain the English meaning, trying to pigeon-hole these expressions into separate categories of code-switching/code-mixing and borrowing would be not only futile but counterproductive in terms of speakers’ linguistic production. This also presents a dilemma for corpus coding. How to deal with such hybrid elements in-between languages? This is an issue that is particularly of interest for corpora of more informal language, whether spoken or written, but since these elements tend to find their way even to the quality newspapers, initially through interviews and columns, trying to decide on a particular moment as a cut-off point is difficult without a good understanding of the current status of any individual linguistic element. With these issues in mind, the contributions in this volume address the following questions: 1. From a corpus compiler’s view: What to do with multilingual texts and elements, when compiling a monolingual corpus? What are the criteria for inclusion and exclusion in sampling? How does representativeness play into these choices? 2. From a corpus annotator’s view: How to annotate foreign-language passages in a corpus? Should they be given a text-level coding, and if so, how detailed? In case of linguistic annotation, how should foreign-language elements be dealt with?

10

Nurmi and Rütten

3. From a corpus user’s view: How can we study multilingual practices in monolingual corpora? How do we approach a corpus, if the foreign-language elements have not been annotated? How do we deal with questions of representativeness, if the corpus compilers have not in any way indicated their choices with regard to multilingual elements? What kinds of results on multilingual practices can be gained when studying multilingual practices in supposedly monolingual corpora? For obvious reasons, these three views are often intertwined. For example, the question of how we can study multilingual practices in a (seemingly monolingual) corpus depends, of course, on the amount of annotation with which the respective corpus is equipped. In a similar way, the question how detailed an annotation schema should be depends, amongst other things, on the multilingual practices of the population from which this sample stems. Consequently, all contributions in this volume consider most, if not all, of the above questions, but place emphasis on different aspects. Research perspectives range from Postcolonial and World Englishes over a range of nonnative and learner Englishes to historical stages of the language. The corpora described in the individual contributions discuss explicitly multilingual practices in the traditional sense of the concept as well as more opaque multi-lingual and multi-voiced discourse practices. Of the papers that discuss explicit multilingual practices in seemingly monolingual corpora, the opening paper of this volume by Lange reviews how multilingual practices are documented in the various postcolonial components of the International Corpus of English (ICE). In particular, Lange evaluates ICE-India from both a corpus user’s and a corpus compiler’s perspective, and discusses building a more balanced corpus of Indian English with view of the multiple native languages influencing the Englishes spoken on the subcontinent. In a similarly explicit multilingual context, Onysko and Degani discuss the selection of texts and informants for a corpus of mono- and bilingual native speakers of New Zealand English, with the concomitant problems of coding both background information and text level variation. They also place emphasis on the question how cultural meaning can be explored by corpus-linguistic means, provided the respective annotation schema systematically accounts for the diversity of multilingual elements in the corpus. Besides these obvious multilingual contexts provided by postcolonial varieties of English, the myth of monolingual practices also extends to corpora

How Many Languages are There in a Monolingual Corpus ?

11

compiled to study non-native and learner Englishes, and English as a lingua franca. These lines are pursued in the three subsequent contributions. First, Laitinen brings to table a discussion of annotating the multilingual elements in advanced non-native corpora of English, when the languages used range from majority languages to traditional minority languages and immigrant languages. An explicit learner perspective is pursued in the contribution by Callies and Wiemeyer, who introduce the Corpus of Academic Learner English (CALE). Callies and Wiemeyer discuss various approaches to annotating multilingualism and transfer in learner corpora and describe developing an annotation practice for multilingual elements. Their contribution is complemented by Kreyer’s paper, towards the end of the volume, who discusses multivoiced practices in learner Englishes, which turn out to be much more implicit than the phenomena introduced in Callies and Wiemeyer. Hynninen, Pietikäinen and Vetchinnikova approach English as both a spoken and written lingua franca in academic and private contexts (ELFA and WrELFA corpora of academic spoken and written ELF). Their focus is on a discussion of the appearance and functions of multilingual practices in English as a Lingua Franca. In all three cases, multilingual practices occur quite explicitly in the data but are dealt with in various ways in both the compilation process and in the way in which the data were approached to conduct research. From a diachronic perspective, explicit multilingual practices are discussed in the contributions by Kohnen, Rütten, and Tyrkkö, Nurmi and Tuominen. Kohnen presents ideas for building a corpus of commonplace books— strikingly similar to Laitinen’s present-day corpora of non-native Englishes in their presentation of often complete texts in one language in a multilingual compilation or environment. From a research-oriented perspective, Kohnen also explores basic questions of language choice in the genre of commonplace books. Rütten introduces the annotation schema developed for the Corpus of English Religious Prose against the background of the long-standing history of multilingual practices in the religious domain. In addition, she describes multivoiced practices in the domain, which may or may not be multilingual, and illustrates how these can be dealt with in the corpus architecture and basic annotation. By contrast, Tyrkkö et al. take a turn on (semi-)automated processes of identifying multilingual elements in an unannotated corpus. In addition to describing software designed to reliably identify, annotate and analyse foreign language elements in a historical English corpus, the Corpus of Late Modern English 3.0 (CLMET 3), Tyrkkö et al. emphasise that multilingual practices

12

Nurmi and Rütten

cannot be reduced to binary distinctions, e.g. foreign/English, native/non-native English, as is often conveniently done. Instead, they show how textual and cultural context feed into an assessment of multilingual practices. Against the background of these explicit multilingual practices in synchronic and diachronic corpus linguistics, Kreyer and Kaunisto introduce more opaque, multivoiced practices. These appear much more implicitly in corpora, but are strikingly similar to multilingual practices (see also section 2). Both Kreyer and Kaunisto, and also Rütten in her discussion of the “invisible hand”, offer different approaches to multivoiced texts, discussing intertextual elements that represent another speaker’s or writer’s voice in a text, whether multi- or monolingual. Of these papers, Kreyer seemingly takes the notion of multilingualism in corpora to its very limits. Turning to learner corpora, Kreyer discovers the extent to which learner texts are mere copies of source material in the Marburg Corpus of Intermediate Learner English (MILE). In fact, being multivoiced in this sense, such learner productions resemble multilingual practices to a considerable extent. Consequently, Kreyer discusses the types of mark-up needed to detect such multivoiced practices and provides an illustrative analysis of intermediate learner English in MILE. Kaunisto takes a corpus user’s perspective and conducts a philological study of Samuel Taylor Coleridge’s Biographia Literaria, which is one of the files contained in the Corpus of Late Modern English (CLMET 3), but lacks any form of multilingual annotation. He shows how severe the influence of multivoiced interference can be even on high frequency items such as personal pronouns. All contributions agree that various languages, in varying proportions, appear alongside with English in the “English” corpora which are investigated in this volume. Depending on their respective research paradigms, contributors offer various courses of action for this situation. This highlights the fact that we may be well advised to rethink our understanding of corpora as monolingual language data repositories. Also, we need to address the question how to find and interpret non-English elements. 4

Tracing Multilingual Practices in Supposedly Monolingual Corpora

How does one find, distinguish and describe foreign language elements in both, corpora that do and corpora that do not flag non-English elements as such? In theory, there are two general routes one may wish to take here: automatic and manual identification. In the real world, the task is usually a combination of both.

How Many Languages are There in a Monolingual Corpus ?

13

In the present volume, Tyrkkö et al. present a semi-automatic approach, introducing software that identifies non-English elements with considerable precision. Rütten presents a corpus design that integrates multilingual, and to a lesser extent also multivoiced, practices into the architecture of the corpus from the start. At the other extreme, the contributions by Kaunisto and Kohnen proceed from purely philological points of departure, identifying multilingual elements with the help of scholarly editions and informed philological knowledge about context (text production, text reception, circulation etc.). While both approaches will successfully identify non-English elements, only the latter is able to spot multivoiced elements. The identification of multivoiced elements is something that might be of interest in corpus research, and could be at least partly automated in the future, since familiar quotations could be identified using electronic text repositories, and other flags for multivoiced elements could be identified (at least the use of quotation marks and quotative phrases like he/she says and according to). However, this is a vital challenge in research on multilingual practices, as is pointed out in several contributions. Hynninen et al. show that even though corpus compilers may flag a linguistic structure as non-English, this need not necessarily be the case for the speakers in the actual speech events. Hynninen et al. look at how code-switches are flagged in discourse and they see a noteworthy discrepancy between explicitly flagged code-switches by the speakers and annotation schemata by compilers that only distinguish English from foreign elements. While the foreign-tag marks non-English elements, these tags may say very little as to how code-switches were perceived by the actual speakers. This, of course, has implications for the assessment of the level of competence of non-native English speakers and brings in another facet of multilingualism that may need attendance in the annotation schema. Along the same lines, Kreyer contrasts materials and task descriptions from the English language learning environments with students’ textual productions. His findings indicate that even advanced learners show one third of their collocations as originating from the materials/task descriptions. Again, this not only has implications for the assessment of language competence and idiomaticity, but points to yet another issue to be taken into consideration in annotating supposedly monolingual material. Far from being able to resolve these matters within the two covers of this book, we hope that bringing these issues into focus will help to rethink the widely accepted notion of ‘the monolingual corpus’ and to be able to better fine-tune into text samples, knowing that much can be expected that is not the voice, or language, of the author.

14

Nurmi and Rütten

References Blommaert, Jan. 2010. The Sociolinguistics of Globalization. Cambridge: Cambridge University Press. Blommaert, Jan and Ben Rampton. 2011. “Language and Superdiversity.” Diversities 13/2: 1–22. Central Statistics Office. 2012. This is Ireland. Highlights from Census 2011, Part 1. Dublin: Stationary Office. COM. 2008. “Multilingualism: An Asset for Europe and a Shared Commitment.” Communication from the Commission to the European Parliament, the Council, the European Economic and Social Committee and the Committee of the Regions. Brussels: Commission of the European Communities. Creese, Angela and Adrian Blackledge, eds. forthcoming. The Routledge Handbook of Language and Superdiversity. London and New York: Routledge. Deumert, Ana. 2011. “Multilingualism.” In The Cambridge Handbook of Sociolinguistics, edited by Rajend Mesthrie, 261–282. Cambridge: Cambridge University Press. Edwards, John. 2006. “Foundations of Bilingualism. In The Handbook of Bilingualism, edited by Teij K. Bhatia and William C. Ritchie, 7–31. Oxford: Blackwell. European Commission. 2012. Europeans and their Languages. Special Eurobarometer 386. http://ec.europa.eu/public_opinion/archives/ebs/ebs_386_en.pdf. Kachru, Braj B. 1985. “Standards, Codification, and Sociolinguistic Realism: The English Language in the Outer Circle.” In English in the World: Teaching and Learning the Language and the Literature, edited by Randolph Quirk and H. G. Widdowson, 11–30. Cambridge: Cambridge University Press. Li Wei. 2007. “Dimensions of Bilingualism.” In The Bilingualism Reader, edited by Li Wei, 3–22. 2nd edition. London: Routledge. Mair, Christian. 2006. “Tracking Ongoing Grammatical Change and Recent Diversification in Present-day Standard English: The Complementary Role of Small and Large Corpora.” In The Changing Face of Corpus Linguistics, edited by Antoinette Renouf and Andrew Kehoe, 355–376. Amsterdam: Rodopi. Mair, Christian. 2009. “Corpus Linguistics Meets Sociolinguistics: Studying Educated Spoken Usage in Jamaica on the Basis of the International Corpus of English (ICE).” In World Englishes: Problems, Properties, Prospects, edited by Lucia Siebers and Thomas Hoffmann, 39–60. Amsterdam: Benjamins. Mair, Christian. 2011. “Corpora and the New Englishes: Using the ‘Corpus of CyberJamaican’ (CCJ) to Explore Research Perspectives for the Future.” In A Taste for Corpora: In honour of Sylviane Granger, edited by Fanny Meunier, Sylvie De Cock, Gaëtanelle Gilquin and Magalie Paquot, 209–236. Amsterdam: Benjamins. Mair, Christian and Stefan Pfänder. 2013. “Vernacular and Multilingual Writing in Mediated Spaces: Web Forums for Post-colonial Communities of Practice.” In Space

How Many Languages are There in a Monolingual Corpus ?

15

in Language and Linguistics: Geographical, Interactional, and Cognitive Perspectives, edited by Peter Auer, Martin Hilpert, Anja Stukenbrock and Benedikt Szmrecsanyi, 529–556. Berlin and New York: de Gruyter. Melchers, Gunnel and Philip Shaw. 2011. World Englishes. 2nd edition. London: Hodder Education. Meshtrie, Rajend. 2006. “Society and Language: Overview.” In Encyclopedia of Language and Linguistics, Vol. 11, edited by Keith Brown, 472–484. Amsterdam: Elsevier. Meyerhoff, Miriam and James N. Stanford. 2015. “ ‘Tings Change, All Tings Change’: The Changing Face of Sociolinguistics with a Global Perspective.” In Globalising Sociolinguistics: Challenging and Expanding Theory, edited by Dick Smakman and Patrick Heinrich, 1–15. London and New York: Routledge. Nurmi, Arja and Janne Skaffari. 2016. “Whiche is in Englisshe tong—Managing Latin in English.” Paper presented at the International Conference on English Historical Linguistics (ICEHL-19), Essen, August 22–26. Otheguy, Ricardo, Ofelia García and Wallis Reid. 2015. “Clarifying Translanguaging and Deconstructing Named Languages: A Perspective from Linguistics.” Applied Linguistics Review 6/3: 281–307. Pahta, Päivi and Arja Nurmi. 2011. “Multilingual Discourse in the Domain of Religion in Medieval and Early Modern England: A Corpus Approach to Research on Historical Code-switching.” In Code-switching in Early English, edited by Herbert Schendl and Laura Wright, 219–251. Berlin: Mouton de Gruyter. Pahta, Päivi, Janne Skaffari and Laura Wright, eds. in press. Multilingual Practices in Language History: English and Beyond. Berlin and New York: de Gruyter. Poplack, Shana. 1987. “Contrasting Patterns of Code-switching in Two Communities.” In Aspects of Multilingualism, edited by Erling Wande, Jan Anward, Bengt Nordberg, Lars Steensland and Mats Thelander, 51–77. Uppsala: Borgströms. Ryan, Camille. 2013. Language Use in the United States: 2011. American Community Survey Reports. Washington, DC: U.S. Department of Commerce.

CHAPTER 2

Indian English or Indian Englishes? Accounting for Speakers’ Multilingual Repertoires in Corpora of Postcolonial Englishes Claudia Lange 1

Introduction

When the late Sidney Greenbaum (1988, 315) published his “proposal for an international computerized corpus of English”, he set off a flurry of activity whose momentum has continued unabated well into the twenty-first century. The original proposal called for an extension of the scope for computerized comparative studies in three ways: (1) to sample standard varieties from other countries where English is the first language, for example Canada and Australia; (2) to sample national varieties from countries where English is an official additional language, for example India and Nigeria; and (3) to include spoken and manuscript English as well as printed English. Greenbaum 1988, 315

Thus began the success story of the ICE-project, which has provided the international scholarly community with an invaluable tool for investigating variation and change across varieties of English and which has placed many Postcolonial Englishes (PCEs) on the research agenda for the first time. In a later progress report, Greenbaum (1990, 81) elaborated on the first two points listed in his original proposal, noting that “[t]he change in expression from standard varieties to national varieties is not an attempt at elegant variation”. He, unlike Quirk (1990), envisaged a continuum between standard (i.e. native) varieties and national varieties in countries where English is an L2 (i.e. non-native).1 What we find juxtaposed here are standard and standardizing varieties, with Indian English (IndE) as a prime candidate for the latter:

1 Cf. Lange (2012, 24–32) for a discussion of the debate between Quirk and Kachru on the topic.

© koninklijke brill nv, leiden, ��7 | doi ��.��63/9789004276697_003

Indian English or Indian Englishes ?

17

[…] it is from India that we have the clearest evidence of the internal status of the English spoken by indigenous educated people […] Among the countries where English is not a native language, India comes closest to a situation in which a new distinctive standard language will emerge. Greenbaum 1990, 81

Greenbaum thus explicitly acknowledges that L2 Englishes can develop an endonormative standard, a position which contrasts sharply with Quirk’s insistence at the time that only native varieties can have—and set—standards. In retrospect, this particular aspect of the ICE-project rationale as articulated by Greenbaum has received overwhelming support, both theoretically by Edgar Schneider’s Dynamic Model of the evolution of Postcolonial Englishes (Schneider 2007), and empirically by the development of varieties such as Singapore English, which is generally considered a fully matured variety in its own right (cf. Schröter 2012, 563). In 1990, Greenbaum clearly did not expect to encounter much difference between L1 and L2 countries with respect to sampling and corpus compilation: The standard language, as elsewhere, would tend to be non-regional and represent the consensus of educated speakers. It has been argued that English in India constitutes a continuum of competence in language, but of course a similar continuum of competence is observable in nativespeaker countries. Greenbaum 1990, 82

This expectation changed slightly once the first actual corpus projects took off and the first reports from corpus compilers collecting data in multilingual speech communities came in (in Greenbaum 1996). After all, PCEs are only one ingredient of the linguistic repertoires in those highly multilingual speech communities where they are typically spoken, which creates a host of practical as well as theoretical challenges for the creation of a (supposedly) monolingual corpus of English, especially when informal spoken registers are concerned. Multilingual speakers may simply not choose English as the language of informal conversation or resort to stilted exchanges when asked to “speak English” for the sake of the corpus compilers. Another important methodological point concerns the sampling and documentation of a speech community’s linguistic repertoire in which English is embedded: in how far does the choice of speakers contributing to the corpus reflect the typological diversity in the area under discussion? This question is highly salient from two different perspectives, as will be demonstrated with respect to ICE-India and Indian English: first of all,

18

Lange

Indian speakers of English themselves typically do not believe that there is an entity which deserves the label ‘Indian English’ rather than, say ‘Marathi English’ or ‘Tamil English’, thus highlighting the potential substrate influence on Indian English(es). Secondly, in order to be able to tackle the question of substrate influence, we have to find a way of representing the enormous linguistic diversity that characterises India. This paper, then, will review in how far these questions have been addressed in the compilation of ICE-India and will sketch how they might be taken into account in compiling a corpus of spoken Indian English in the twenty-first century. To do so, I will first give an overview of how individual ICE-teams have risen to the challenge(s) of multilingualism, with special emphasis on the Indian context. I will then continue to focus on multilingual India and IndE and outline the repercussions that the Indian communicative space has had for contact linguistics, specifically the framework established by Thomason (2001). Finally, I will outline how a new corpus of Indian English might look like, one that captures English as an integral ingredient of speakers’ multilingual repertoires. 2

ICE and Postcolonial Englishes

The volume Comparing English Worldwide: The International Corpus of English (Greenbaum 1996) delineated the scope of the ICE-project, providing both an overview of the general aims and principles as well as the first ‘field reports’, so to say. Of the four ICE-projects dedicated to multilingual communities that were represented in the book, two have been finalised by the original teams, namely ICE-East Africa (Schmied 1996) and ICE-Hong Kong (Bolt and Bolton 1996). The other two projects, covering Nigeria and Fiji, have moved on to the ‘ICE-Age 2’: ICE-Nigeria was taken up by a new team in 2007 and released in 2014, while ICE-Fiji is still in the making (cf. ICAME Journal 34), again by a different team. Some common concerns emerged in the creation of the ‘ICE-Age 1’corpora dealing with English as a ‘national’ language in Greenbaum’s sense, that is, with English as a former colonial language with quite different degrees of entrenchment in the individual countries. The challenges to be faced when compiling a corpus of English in a multilingual setting are best documented for ICE-East Africa; unfortunately, little is known about the compilation of e.g. ICE-Singapore or ICE-Philippines—neither the corpus manuals nor other scholarly articles provide further information. I will focus on three aspects: the

Indian English or Indian Englishes ?

19

choice of speakers/contributors, the position of English in multilingual communicative spaces, and the representation of linguistic diversity. 2.1 Choice of Speakers As already mentioned above, the ICE-project was to sample standard(izing) varieties of English, where ‘standard English’ was to emerge from the usage of speakers/contributors: The authors and speakers of the texts are aged 18 or above, were educated through the medium of English, and were either born in the country in whose corpus they are included, or moved there at an early age and received their education through the medium of English in the country concerned (http://ice-corpora.net/ice/design.htm). Such a criterion, while general and innocent enough for countries such as the UK, raises a host of problems for postcolonial countries. Janet Holmes’ (1996, 164) question “Who counts as a New Zealander?”, or more specifically “At what point does an immigrant become a New Zealander?” highlighted the fact that many postcolonial societies especially in the Asia-Pacific region are characterized by relatively recent large-scale immigration, or rather population movement in general. Contributors to the ICAME Journal special issue on “ICE-Age 2” note that educated speakers from e.g. Sri Lanka or Fiji typically receive their tertiary education abroad, spending long stretches of time outside their home countries and being exposed to other international varieties of English. Such speakers were excluded from ICE-NZ, a decision that turned out to be impractical for e.g. ICE-Fiji: The restriction should rather be that at least not the entire higher education should have been acquired abroad, and that the authors should not have gained several degrees while staying overseas. We also allow for authors spending some time of the year abroad and writing from abroad, as long as they did not leave the country in the formative years before the age of 18. This more relaxed perspective is closer to the reality of the educated speaker of acrolectal Fiji English. Biewer, Hundt and Zipp 2010, 7

Another point concerns the requirement of English-medium education: given this criterion, the current Prime Minister of India, Narendra Modi, would be excluded since he attended a rural Gujarati-medium school. English-medium

20

Lange

education in India effectively acts as a caste marker, a given for the affluent upper middle class but largely out of reach for the majority of the population, especially in rural areas. Modi’s home state Gujarat was in fact found to be the one where pupils in rural areas stand the least chance of acquiring English: Graddol (2010, 117) quotes from the ASER report from 2010 (Annual Status of Education Report), which found that less than 25% of rural Gujarati children in class 8 are able to read and understand simple sentences in English (as opposed to e.g. over 75% in Kerala and the Northeastern states). Overall, access to education in India has been steadily improved, but English-medium education must have been even more exclusive at the time when data collection for ICE-India began. Therefore, “[t]he category of ‘conversations’ are drawn largely from the trained ELT teachers, though they have not been educated in the medium of English at all levels” (Shastri 2002, 2), a decision which acknowledges the range and depth of English in India at the time (cf. Lange 2007). Recent phenomena such as transnationalism and superdiversity (Vertovec 2007) are likely to complicate the question ‘who counts as a speaker of X’ even further: multilingualism and multilingual speakers are moving from the periphery to the centre, entering and transforming the Western—largely monolingual—mainstream while frequently upholding their ties to South Asia. Indian English has become a global language, spoken in the diaspora by NRIs (non-resident Indians) from quite different backgrounds: unskilled labourers working on construction sites in Dubai, highly qualified students pursuing a postdoc in the US, or third-generation British Asians in Greater London. Sharma’s (2011) research on the latter group has shown how skilfully young British Asians exploit their linguistic repertoire (comprising British Standard English, London English and Indian English features) in social interaction, negotiating their distinct multicultural identities along the way. The volume edited by Hundt and Sharma (2014) pays tribute to the global spread of Indian English, detailing how and where IndE entered another communicative space as a minority language and documenting some of the linguistic consequences of these contact scenarios. These emergent diasporic dialects of Indian English are quite likely to have an impact on IndE spoken in India. However, more fieldwork would be needed to capture the transnational ties of IndE diasporic communities as a prerequisite for an analysis in terms of social networks. 2.2 The Scope of English in Multilingual Societies The discussion above referred to multilingualism on an individual level by focussing on speakers’ competence in English. Speakers’ exposure to and familiarity with English is also a function of societal multilingualism:

Indian English or Indian Englishes ?

21

researchers need to be aware of the parameters which structure the communicative space of a given speech community. Which roles are assigned to individual languages, which domains are generally reserved for a specific language? Schmied (1996) gives a vivid description of the linguistic division of labour that typically occurs in multilingual speech communities: In Tanzania, for instance, it would sound very strange if grandparents were addressed in English (even if they understood the language, which is unlikely). As the transmission of cultural values, a major function of grandparent-grandchildren interaction, is firmly linked to first languages, English is highly inappropriate in such contexts. The vast majority of the direct conversations in ICE-GB would simply not be conducted in English: all the family conversations (e.g. S1A-007) and mealtime conversations (e.g. S1A-056) would be too exceptional to be included in an African or Asian corpus of English. In most ESL cultures the use of English would be considered rude in such contexts, as the older members of the family might be excluded because of their lack of language skills. Schmied 1996, 186–87

English in postcolonial societies typically correlates with the formal end of the communicative spectrum, sometimes even with the H(igh)-language in a diglossic situation:2 the vernacular languages (L(ow)-language(s)) have less prestige, are acquired informally at home and are used for informal and mostly spoken interaction, while the H-language is acquired formally in educational settings, has more prestige and is typically the language used for official/administrative purposes. Collecting written texts in English for an ICE-project thus becomes much less of a demanding task than recording a natural conversation in English, as the quote above illustrates. Whether English is used in informal interaction will also depend on societal multilingualism. For example, Bolt and Bolton (1996, 200) noted that “in this overwhelmingly Cantonese-speaking city” of Hong Kong, speakers would only engage in an English conversation if nonCantonese speakers were present. In multilingual India the situation would be quite different: the second official language and neutral link language English would be the natural choice as soon as speakers of more than one indigenous language come together.

2 ‘Diglossia’ as originally conceptualized by Ferguson (1959) referred to the functional separation of two varieties of the same language (e.g. Classical Arabic vs. present-day national varieties of Arabic). Since then, the concept has been extended to multilingual contexts.

22

Lange

2.3 Representing Linguistic Diversity The discussion of how to capture a representative sample of the Englishspeaking population has largely been dominated by the social question: habitual speakers of English in most postcolonial societies typically cluster in the higher social classes. Less attention has been paid to sampling the typological diversity of speakers’ other languages. ICE-East Africa was the first of the first generation PCE corpora to provide metadata which include information about speakers’ mother tongues. Table 2.1 provides an overview of the range of speaker-related information that is distributed with the respective corpora. The ICE-Age 2 corpora all offer metadata, sometimes including the label ‘ethnicity’ rather than ‘mother tongue’. The patchy sociolinguistic information about speakers is unfortunate, as it precludes a variationist perspective on PCEs. It also prevents us from probing deeper into the motivation for specific innovations in PCEs. Typical lines of reasoning when confronted with divergence in form, function or frequency of a specific form make recourse to the following: (a) Historical retention: a PCE feature which diverges from current usage in the historical input variety (mostly BrE) may reflect an older stage of the language that has since been lost. Mesthrie (2006) has further drawn attention to the fact that the ‘historical input variety’ was much more heterogeneous than tacitly assumed: many of the sailors, settlers, traders, missionaries, soldiers and teachers who interacted with the indigenous population in the colonies spoke an array of nonstandard and/or regional dialects (Mesthrie 2006, 278–86). This point clearly plays no role as far as current corpus compilation is concerned.3 (b) Universals of second language acquisition: the late Braj Kachru was instrumental in establishing PCEs as varieties in their own right rather than learner Englishes riddled with ‘deviations’. Still, PCEs in many contexts were and are “taught (as L2), rather than ‘caught’ (as L1)” (Mesthrie 2010, 596), which brings universal mechanisms of second language acquisition (SLA) into the range of explanatory parameters. Processes such as simplification/regularization or redundant/explicit marking have been identified to play a role both in SLA and in PCE innovations; lists of relevant processes have been drawn up by e.g. Williams (1987) or 3 A diachronic corpus project interested in reconstructing variation in the historical input to IndE will find a host of unpublished private papers, letters and other documents related to members of the East India Company in the British Library (http://www.bl.uk/catalogues/ indiaofficeselect/welcome.asp).

23

Indian English or Indian Englishes ? Table 2.1

Overview of ICE-corpora and their sampling of speakers’ metadata

ICECorpus

Data Corpus Speaker Speaker Speaker data: Speaker data: Speaker collection release data: age data: gender education languages data: other

ICE-GB

1990s

1998

yes

yes

ICE-India 1992–1996 2002

yes

yes

ICE-HK ICE-SIN ICE-EA

1990s 2006 1990s 2002 1991–1996 1999

no no yes

no no yes

ICE-PHIL 1991–2002 2004

yes

yes

ICE-NIG

yes

yes

yes yes (undergrad.; BA/LL.B./BS; Master’s/ MA/MS; PhD/MD/DD) no no

yes

yes

yes

from 2007 2014

ICE-CAN 1990s

2009

yes (secondary, university) yes (secondary; graduate; MA/MPhil; PhD) no no partially

n.a.

occupation: yes

yes

occupation: yes

no no partially

no no no

yes

occupation: yes

occupation: partially (if known) ethnicity: yes occupation: yes ethnicity: yes

more recently Schneider (2012). The recent emphasis on “bridging the paradigm gap” (Mukherjee and Hundt 2011, already proposed in Sridhar and Sridhar 1986) is bound to yield important insights for the study of PCEs. (c) Language contact: Substrate influence or language contact as motivating factor always looms large, but is highly difficult to substantiate once we move beyond the realm of loanwords. Thomason (2010) provides a

24

Lange

deceptively straightforward plan of action to identify contact-induced language change: The first requisite is to consider the proposed receiving language (let’s call it B) as a whole, not a single piece at a time […] Second, identify a source language (call it A). […] Third, find some shared features in A and B. […] Fourth, prove that the features are old in A—that is, prove that the features are not innovations in A. And fifth, prove that the features are innovations in B, that is, that they did not exist in B before B came into close contact with A. Thomason 2010, 34

Contact-induced language change in PCEs might not only be difficult to prove, it has even been ruled out entirely as the major motivating factor: since there are a number of recurring innovations in PCEs across the Anglophone world (e.g. article omission, mass nouns as count nouns), a contact explanation appears highly unlikely in the face of many hundred typologically very distinct background languages. However, I have argued elsewhere (Lange 2012, 237–39) that at least in India, Thomason’s second requirement for proving language contact, namely pinning down a specific source language, may be conceptualized differently. I will elaborate on this point in the next chapter; suffice it to say in this context that metadata should always make reference to speakers’ full linguistic repertoire. 3

Multilingual India

The Indian subcontinent is home to a large number of languages from typologically quite different language families. Figure 2.1 displays the geographical range of the language families represented across South Asia as a whole; Table 2.2 presents the most recent available Indian census data for the individual language families and their number of speakers.4 The combined information from map and table points to an inverse relationship between number of speakers and number of languages: while the Indo-Aryan language family is the one with the highest number of speakers, 4 The Munda languages displayed in the map belong to the Austro-Asiatic language family and are not treated separately in the Census data. Burushaski is a language isolate. The figure of 0.02% for speakers of English may come as a surprise, but is due to the fact that the census tables on language are concerned with speakers’ mother tongues, not their other languages.

25

Indian English or Indian Englishes ? tajikistan

uzbekistan turkmenistan

C

H

I

N

A

N

AFGHANISTAN

T I S P A K

A

N

E

P

A

L

bhutan bangladesh MYANMAR

I N D I A

A r a b i a n

S e a

400 200

600 km

Andaman Sea

0

200

SRI LANKA e

S

0

ve di ca Lac

Burushaski Dravidian Indo-Aryan Iranian Munda Other Austro-Asiatic Tibeto-Burman

Bay of Bengal

a

INDIAN OCEAN

400 mi

Figure 2.1 South Asian language families. BASED ON UN CARTOGRAPHIC SECTION MAP OF SOUTHASIA FROM 2011.

Austro-Asiatic and Tibeto-Burmese comprise the largest number of individual languages, albeit with the lowest percentage of speakers. The number of languages in these families would be even higher were it not for the Census policy

26

Lange

Table 2.2 Typological distribution of the 122 scheduled and non-scheduled languages of India in the 2001 census (http://www.censusindia.gov.in/Census_Data_2001/Census_Data_ Online/ Language/statement9.htm) Language families

1. Indo-European (a) Indo-Aryan (b) Iranian (c) Germanic [i.e. English] 2. Dravidian 3. Austro-Asiatic 4. Tibeto-Burmese 5. Semito-Hamitic [i.e. Arabic] Total

Number of Languages

Persons who returned the languages as their mother tongue

Percentage to total population

21 2 1 17 14 66 1 122

790,627,060 22,774 226,449 214,172,874 11,442,029 10,305,026 51,728 1,026,847,940

76.87 0.00 0.02 20.82 1.11 1.00 0.01 99.83

of ignoring languages with less than 10,000 speakers and of subsuming ‘mother tongues’ under an abstract concept of ‘language’. What is not immediately apparent is that even though the Indo-Aryan (IA) languages are dominant, not even the most widely spoken IA language, Hindi, is spoken by an absolute majority of the population. Further, no Indian federal state is monolingual: the percentages of minority language speakers in Indian states who do not claim the official state language as their mother tongue ranges from 4.01% in Kerala via 18.36% in Delhi to as much as 86.07% in Nagaland (Mallikarjun 2004). From the perspective of the average European monolingual person, the sheer diversity of the Indian linguistic landscape appears staggering. However, the national ethos of ‘Unity in Diversity’ is also reflected in the degree of convergence and mutual exchange between languages. The contact situation between speakers of Dravidian and Indo-Aryan languages has persisted for millennia and serves as a prime example for a Sprachbund (cf. Emeneau 1956; Masica 1976).5 India further represents a puzzling case for researchers who are accustomed to Western notions of linguistic standardization: the impetus for standardizing a vernacular European language was and is invariably correlated 5 Current research increasingly focuses on ‘microlinguistic areas’, cf. the special issue of the 2015 Journal of South Asian Languages and Linguistics 2/1.

Indian English or Indian Englishes ?

27

with literacy; standard languages are by definition written languages (cf. Haugen 1966). In India, however, a vast body of knowledge was passed on orally from generation to generation in a highly codified form, namely in Sanskrit. While the earliest texts of the Rigveda date back to around 1500 BC, Sanskrit was first committed to writing around 150 AD (cf. Masica 1993, 50–55). Sanskrit grammars, again composed and transmitted orally, did not acknowledge typological diversity and contact, or historical development. It was left to the European colonizers to spell out the historical trajectories of Sanskrit and successive Indo-Aryan languages, following the foundational moment of historical-comparative linguistics, i.e. Sir William Jones’ famous pronouncement in 1786 on the similarities between Sanskrit and European languages. In 1816, Francis Whyte Ellis delivered the “Dravidian Proof” (Trautmann 2006), arguing for the first time that Indo-Aryan and Dravidian were separate language families, and that their similarities were due to longstanding contact. Contact between English and Indian languages goes back to the seventeenth century when the first traders of the British East India Company (EIC) arrived, but became more intense from the middle of the nineteenth century onwards, when more and more educational institutions including universities were established by the British. The ‘Great Indian Education Debate’ (cf. Zastoupil and Moir 1999) at the time centered around the medium of instruction: Orientalists favoured the study and teaching of Indian languages, Anglicists were in favour of English for imparting Western education. Thomas Babington Macaulay as the most prominent Anglicist delivered his famous and muchquoted “Minute on Education” in 1835,6 paving the way for English as the language which granted access to Western education and thus also to upward social mobility within the colonial apparatus. In present-day India, there are still occasional reflexes of a colonial cringe with respect to English, but the importance of the language for the Indian communicative space remains unchallenged. The Indian constitution names Hindi as the first and English as the second official language of the Union, English is the official language in some of the highly multilingual northeastern states, and it is the national link language which bridges the gap between the northern Hindi belt and the southern Dravidian area. To repeat: the figure of 0.02% speakers of English in Table 2.2 above is misleading when it comes to an estimate of the number of proficient speakers of English in India, since it only represents those speakers who report English as their mother tongue, not as a second language. The 2001 census also documented the extent of multilingualism and found that 10.4% or around 126 million people reported to speak 6 Cf. Zastoupil and Moir (1999, 161–73) for the full text of the Minute.

28

Lange

English as a second or third language (Graddol 2010, 66). David Crystal (2008) raises the figures even higher; he reports on his impromptu survey among Indian colleagues: Although answers varied greatly, depending on the levels of English assumed, most people thought that around a third of the population were these days capable of carrying on a domestic conversation in English. […] Given that India’s population is now well over a billion, this meant a total of around 350 million people—more than the combined Englishspeaking populations of the leading first-language countries. Crystal 2008, 5

On the one hand, then, India with its extensive linguistic diversity and its grassroots multilingualism should be a paradise for contact linguists. This diversity, on the other hand, poses significant challenges for theories of language contact. The special Indian case is both explicitly and implicitly mentioned in the model originally put forward by Thomason and Kaufman (1988) and later endorsed by Thomason in numerous publications (e.g. Thomason 2001; 2003; 2010). Thomason’s model gives pride of place to social factors: it is the intensity coupled with the duration of contact which propels contact-induced language change, eventually even overriding typological constraints on borrowing. Two main types of contact scenarios are recognized in her model, depending on the presence versus the absence of full, or at least extensive, fluency in the recipient language. That is, the crucial factor is whether the people who introduce the interference features speak the language into which the features are introduced—or, in other words, whether imperfect learning plays a role in the interference process. Thomason 2003, 691

The scenario where imperfect language learning plays no role gives rise to borrowing, as spelled out in Thomason’s well-known Borrowing Scale (e.g. Thomason 2001, 70–71). Borrowing thus occurs when balanced bilinguals introduce elements from a surrounding language into their own (and vice versa). The Borrowing Scale predicts a hierarchy of borrowable features: the first items to be borrowed will be non-basic lexical items. Greater intensity of contact might lead to the borrowing of structural features, to the extent that “anything goes”, given the appropriate social conditions. By contrast, if the contact scenario is characterized by imperfect language learning by a group of speakers, these speakers will incorporate features from their original language

Indian English or Indian Englishes ?

29

into the newly acquired target language (TL). Thomason labels this process shift-induced interference and predicts that these incorporated features will be primarily phonological and syntactic rather than lexical (cf. Thomason 2003, 691). Prominent examples for varieties of English which were shaped by shiftinduced interference are Irish English and South African Indian English (cf. Mesthrie 1992). Even though English and Irish Gaelic were in contact in Ireland for centuries, it was only in the nineteenth century that the Gaelic-speaking population shifted to English, incorporating syntactic features of Gaelic into the emerging Irish variety of English. Irish English is thus truly a language shift variety in the sense that it involved the large-scale loss of the speech community’s original language. However, Thomason points out that imperfect language learning does not necessarily trigger language shift, just as language contact is only a necessary, but never a sufficient condition for language change. In her earlier monograph, Thomason (2001) explicitly referred to India in elaborating on the notion of ‘shift-induced interference’: It is important to keep in mind that imperfect learning in this context does not mean inability to learn, or even lack of sufficient access to the TL to permit full learning: learners must surely decide sometimes, consciously or unconsciously, to use features that are not used by native speakers of the TL. Another point that must be made emphatically is that this type of interference can occur without language shift. In India, for instance, there is a variety of English known as ‘Indian English’ that has numerous interference features of this type from indigenous languages of India; Indian English is spoken by many educated Indians who speak other languages natively, so although it is a variety that is characteristic of one country, it is not, strictly speaking, a variety formed under shift conditions. Thomason 2001, 74

IndE and PCEs in general thus exemplify ‘shift-induced interference without language shift’, so to say: the effect of imperfect learning can be traced in speakers’ L2, even in the absence of language shift. The phonology of Indian English provides a case in point, both with respect to prosody and the realization of individual phonemes: “the rules of accentuation of IndE are closer to those of Indian languages than to those of RP” (Gargesh 2004, 1000), and the “dental fricatives /θ/ and /ð/ are non-existent in IndE” (Gargesh 2004, 998), generally realized as aspirated dental plosives since the “dental sound is present in Indian languages and therefore it is easier in terms of articulation for speakers to replace the fricative” (Sailaja 2009, 21).

30

Lange

Thomason suggests to continue using the term for want of an alternative, urging readers to keep her caveats in mind and to gracefully overlook the term’s “literal inaccuracy” (Thomason 2003, 693). Her reminder that ‘imperfect learning’ should not be taken to indicate failure in mastering the target language is also highly relevant for postcolonial countries. ‘Interference features’ may become sociolinguistic indicators for an emerging national variety, being appropriated as markers of identity vis-à-vis both neighbouring varieties and the global mainstream. Yet another of Thomason’s caveats bears directly on the Indian scenario. Her model abstracts away from multilingual contact situations, but in convergence areas such as India, “more than two languages may be involved, with varying mixes of borrowing and shift-induced interference going on at more or less the same time” (Thomason 2003, 693). To trace the transfer of a specific feature from a clearly demarcated source language to a clearly demarcated target language appears to be next to impossible in multilingual India; any account for a specific feature in terms of contact would be rendered pointless, or at least highly speculative. However, the very fact that the Indian communicative space has been characterized by convergence for millennia entails that some linguistic features have spread beyond their source languages. It can then be left to scholars of (Proto-) Dravidian and (Proto-) Indo-Aryan to reconstruct the precise origin of a specific feature before it became integrated into other Indian languages; the linguist out to prove that language contact triggered a specific innovation has to show that the feature in question occurs in languages across India. This approach would work well if the goal is to account for innovations in IndE as a national variety in the ICE-sense, i.e. a variety which is “non-regional and represent[s] the consensus of educated speakers”, to come back to Greenbaum’s definition as quoted above. An IndE contact feature that can only be traced back to either Dravidian or Indo-Aryan languages would stand less of a chance to eventually become part of an emerging IndE standard. Still, a new corpus of IndE should strive to include speakers whose other languages encompass the whole range of linguistic diversity to be found in India, as I will argue in the next section. 4

Indian English in the Twenty-First Century

The start of the Indian ICE-project in the early 1990s coincided with the introduction of momentous and far-reaching changes to Indian society as a whole. From 1991 onwards, the then Indian government embarked upon a course of liberalisation and deregulation of the economy, dismantling the “permits,

Indian English or Indian Englishes ?

31

licences and subsidy raj” (Bose and Jalal 2004, 189) and exposing “the centralized monolith” (Bose and Jalal 2004, 190) to the impact of globalisation. The winners of this economic development are the growing numbers of the Indian middle class. A report on The Great Indian Middle Class published by the National Council of Applied Economic Research (NCAER) in 2004 set the tone for its introductory chapter on the “Middle Class Rising” as follows: Currently estimated at just a little over 57 million people, the Indian Middle Class has grown almost two-and-a half times since 1995–96 when it was around 25 million […] With the fastest growth in income levels between 1995–96 and 2001–02 taking place in urban areas, 64 per cent of the Indian Middle Class is to be found in urban areas, up from under 58 in 1995–96. Though the largest concentration of the middle class is to be found in northern and western India […], the fastest growth has taken place in southern India. By 2005–06, the middle class is projected to cross 92 million, and with growth expected to accelerate, by 2009–10, the middle class is likely to be around 153 million […]. NCAER 2004, 1

Middle class Indians born in the 1990s have an altogether different exposure to English than their parent generation, both quantitatively and qualitatively: if they have not attended an English-medium school themselves, they are highly likely to contribute to the growing demand for private education and to send their children to a private English-medium institution rather than a government school. “English-knowing bilingualism”, the language policy adopted for Singapore, effectively turns into an “English-knowing multilingualism” for a considerable portion of the younger urban population across India. This accelerating spread of English is not limited to India: Times have changed, and the status and spread of English in Asia have changed substantially: ‘English is exploding in Asia’ […]. No doubt this is the world region where the number of speakers of English is increasing most rapidly, and dynamic developments are more pronounced than anywhere else on the globe. Schneider 2014, 249

Younger Indian speakers who grew up with IndE in their linguistic repertoire are thus prime candidates to illustrate Thomason’s notion of ‘shift-induced interference without language shift’ in their everyday multilingual communicative behavior. These speakers represent the first generation of Indians to

32

Lange

appropriate English as one of their mother tongues; their usage is thus also likely to advance an emerging standard Indian English. 4.1 India’s ICE-Age 2? I would like to propose a new corpus of spoken IndE following the principles of the overall ICE-project, but specifically geared towards capturing IndE as part of a multilingual communicative space. Such a corpus would not be representative of the whole range of IndE across the country, but highly acrolectal and also quite elitist in its focus on the young urban middle class. One reason for limiting the choice of contributors in this way was already mentioned: if there will be a standard Indian English, it will emerge from their usage as the “consensus of educated speakers”, to reiterate Greenbaum once more. The second reason is derived from a contact-linguistic perspective: what happens to a PCE in terms of nativization when it becomes an L1 among other languages? How do speakers manage their multilingual repertoires? Mesthrie posed the following questions with respect to his research on South African Indian English as a language-shift variety: Is it the case that language shift throws up more variation than does balanced bilingualism? […] Is it the case that adults involved in the early stages of language shift are the ones who are responsible for the greatest number of innovations, and that children involved in the late stages of language shift (and/or the first post-shift generation) are the ones who act as selectors and stabilizers from this pool of variants? We must leave this topic for future research. Mesthrie 2006, 276–77

The Indian case does not necessarily involve language shift, as stated before, but balanced bilingualism may also involve extensive borrowing (cf. Thomason’s borrowing scale (2001, 70–71)). The theoretical emphasis on documenting IndE as a contact language would have further repercussions for the choice of speakers: the corpus should include more speakers of the northeastern Indian languages belonging to the Austro-Asiatic and Tibeto-Burman families. Table 2.3 compares the genetic affiliation of corpus speakers’ mother tongues with the overall affiliation to language families according to the 2001 census (cf. also Table 2.2 above). Speakers of Dravidian and Indo-Aryan languages were almost equally represented and speakers of Tibeto-Burmese even overrepresented, but the higher percentage obscures the fact that the latter were only represented by 5 speakers (of Naga, Manipuri and Angami) overall,

33

Indian English or Indian Englishes ? Table 2.3 Representation of language families in ICE-India (Lange 2012, 83) Genetic affiliation of mother tongues

In the corpus (%)

All India (%)

Indo-European Dravidian Tibeto-Burman Austro-Asiatic

48.98 48.54 2.07 0.41

76.89 20.83 1.00 1.11

and the former by one speaker of Khasi (Lange 2012, 82). A higher percentage of such speakers would act as a counter-balance to Sprachbund effects in contact-induced innovations. The Northeast is neither culturally nor linguistically a part of the longstanding Indian convergence area; that is, an IndE innovation that is found with speakers of Indo-Aryan and Dravidian but not others is then (a) most likely contact-induced, and (b) most likely derived from the feature pool of the Indian Sprachbund. The kind as well as the range of speakers envisaged here as contributors to a new corpus are not as difficult to enlist as one might imagine. First of all, linguists and corpus compilers visiting India are moving within precisely that context that fosters and displays young urban multilingualism, namely the universities. India’s central universities such as e.g. the English and Foreign Languages University (EFL-U) in Hyderabad or the highly reputed IITs (Indian Institutes of Technology) across the country draw their students from all over India. Each university campus thus becomes a hotspot of linguistic diversity, and since the overwhelming majority of students lives in hostel accommodation on campus, the campuses also encourage the development of dense social networks. It would thus suffice to gather data on some selected campuses, rather than undertaking extensive travel all over India. Speakers should not be discouraged from code-switching, but, as said before, extensive code-switching is unlikely if speakers come together who do not have an Indian language in common. In annotating the transcribed material, more care should be taken to mark indigenous forms in the data. Nelson (1996, 38) explains how borrowings should be annotated in ICE: expressions “which have become naturalized over time, and are now considered part of the English lexicon” are not to be marked as . The label (indigenous) occurs exclusively in multilingual contexts:

34

Lange

In some countries, such as India and Cameroon, English is used as a second official language, and may coexist with several local ones. In these countries, words from local languages are marked as (indigenous) rather than , though they will be marked as foreign words in every other ICE corpus in which they appear. If words from more than one indigenous language appear in a corpus, the specific language from which they come can be incorporated into the markup symbol, e.g. . Nelson 1996, 38

In practice, the distinction between ‘naturalized’, ‘indigenous’ or ‘foreign’ might be quite straightforward in some cases, but a matter of debate (even within the targeted speech community) in many others. A clue for considering a form ‘naturalized’ is surely its occurrence in the OED; Table 2.4 lists some common Table 2.4 Words marked as ‘indigenous’ in ICE-India and the OED edition in which they were first mentioned Lexical item

Meaning

First OED publication

saree lakh crore Rupee paneer tandoori chapatti ghee samosa ahimsa Shri Raga yaar haan nahi accha na theek hai

women’s garment numeral 100,000 numeral 10 million (100 lakhs) Indian currency curd, soft cheese food prepared in a tandoor (clay oven) (OED: ‘chupatti’) Indian bread clarified butter triangular pastry doctrine of non-violence title of respect mode in Indian classical music colloq., friend, mate (also a discourse marker) yes no yes, okay (discourse marker) invariant tag (discourse marker) yes, okay (discourse marker)

1982 1901 1893 1910 (updated 2011) new entry 2005 1986 1972 1899 1982 1972, updated 2012 1986 1982, updated 2008 new entry 2015 – – – – –

Indian English or Indian Englishes ?

35

expressions that are marked as (enous) in ICE-India, together with the OED edition in which they were first listed. The table reveals that many expressions have been ‘naturalized’ quite early on and thus would not have required the -marking. Others are a recent addition to the OED and testify to the growing impact of Indian English on the English language in general. The last items in the table which all remain absent from the OED are very common discourse markers in everyday interaction. These are all derived from Hindi, but not restricted to speakers of Hindi as a mother tongue, and should thus be marked to acknowledge speakers’ hybrid repertoires in a lively multilingual setting. It is now more than 25 years since data collection for ICE-India began—a new corpus of IndE would also introduce a diachronic dimension to the study of this particular variety paralleling the distance between BROWN and LOB from the sixties and their nineties successors FROWN and FLOB. Specific IndE forms may have stabilised as features of speakers’ L1 IndE, thus settling the vexing question of ‘error’ vs. ‘innovation’ once and for all. An ICE-Age 2-corpus of Indian English would give us the rare opportunity to study an emerging standard variety in the making, a variety firmly embedded in a multilingual communicative space. References Biewer, Carolin, Marianne Hundt and Lena Zipp. 2010. “ ‘How’ a Fiji Corpus? Challenges in the Compilation of an ESL ICE Component.” ICAME Journal 34: 5–23. Bolt, Philip and Kingsley Bolton. 1996. “The International Corpus of English in Hong Kong.” In Comparing English Worldwide: The International Corpus of English, edited by Sidney Greenbaum, 197–214. Oxford: Clarendon Press. Bose, Sugata and Ayesha Jalal. 2004. Modern South Asia: History, Culture, Political Economy. 2nd edition. London: Routledge. BROWN = A Standard Corpus of Present-Day Edited American English, for use with Digital Computers (Brown). 1964, 1971, 1979, compiled by W. N. Francis and H. Kučera. Brown University. Providence, Rhode Island. Crystal, David. 2008. “Two Thousand Million? Update on the Statistics of English.” English Today 24/1: 3–6. Emeneau, Murray B. 1956. “India as a Linguistic Area.” Language 32: 3–16. Ferguson, Charles. 1959. “Diglossia.” Word 15: 325–40. FLOB = The Freiburg-LOB Corpus of British English (F-LOB), compiled by Christian Mair et al., Albert-Ludwigs-Universität Freiburg.

36

Lange

FROWN = The Freiburg-Brown Corpus of American English (Frown), compiled by Christian Mair et al., Albert-Ludwigs-Universität Freiburg. Gargesh, Ravinder. 2004. “Indian English: Phonology.” In A Handbook of Varieties of English. Volume 1: Phonology, edited by Edgar W. Schneider, Kate Burridge, Rajend Mesthrie and Clive Upton, 992–1002. Berlin and New York: Mouton de Gruyter. Graddol, David. 2010. English Next India: The Future of English in India. The British Council. http://englishagenda.britishcouncil.org/sites/default/files/attachments/ books-english-next.pdf. Greenbaum, Sidney. 1988. “A Proposal for an International Computerized Corpus of English.” World Englishes 7: 315–15. Greenbaum, Sidney. 1990. “Standard English and the International Corpus of English.” World Englishes 9: 79–83. Greenbaum, Sidney, ed. 1996. Comparing English Worldwide: The International Corpus of English. Oxford: Clarendon Press. Haugen, Einar. 1966. “Dialect, Language, Nation.” American Anthropologist 68/4: 922–35. Holmes, Janet. 1996. “The New Zealand Spoken Component of ICE: Some Methodological Challenges.” In Comparing English Worldwide: The International Corpus of English, edited by Sidney Greenbaum, 163–81. Oxford: Clarendon Press. Hundt, Marianne and Devyani Sharma, eds. 2014. English in the Indian Diaspora. Amsterdam: John Benjamins. ICAME Journal 34. 2010. Special issue ICE-Age 2: ICE Corpora of New Englishes in the Making. ICE = International Corpus of English. 1988–. Initiated by Sidney Greenbaum. Journal of South Asian Languages and Linguistics 2/1. 2015. Special issue Micro-linguistic Areas in South Asia. Lange, Claudia. 2007. “Let’s Face the Music: The Multilingual Challenge.” In Annual Review of South Asian Languages and Linguistics, edited by Rajendra Singh, 87–95. Berlin and New York: Mouton de Gruyter. Lange, Claudia. 2012. The Syntax of Spoken Indian English. Amsterdam: John Benjamins. LOB = The Lancaster-Oslo/Bergen Corpus of British English, For Use With Digital Computers. 1970–1978, compiled by Geoffrey Leech, Lancaster University, Stig Johansson, University of Oslo (project leaders), and Knut Hofland, University of Bergen (head of computing). Mallikarjun, B. 2004. “Indian Multilingualism, Language Policy and the Digital Divide.” Language in India 4. http://www.languageinindia.com/april2004/kathmandupaper1 .html. Masica, Colin P. 1976. Defining a Linguistic Area: South Asia. Chicago: University of Chicago Press.

Indian English or Indian Englishes ?

37

Masica, Colin P. 1993. The Indo-Aryan Languages. Cambridge: Cambridge University Press. Mesthrie, Rajend. 1992. English in Language Shift: The History, Structure and Socio linguistics of South African Indian English. Cambridge: Cambridge University Press. Mesthrie, Rajend. 2006. “Contact Linguistics and World Englishes.” In The Handbook of World Englishes, edited by Braj B. Kachru, Yamuna Kachru and Cecil L. Nelson, 273–88. Oxford: Blackwell. Mesthrie, Rajend. 2010. “New Englishes and the Native Speaker Debate.” Language Sciences 32: 594–601. Special Issue: The Native Speaker and the Mother Tongue, edited by Umberto Ansaldo. Mukherjee, Joybrato and Marianne Hundt, eds. 2011. Exploring Second-Language Varieties of English and Learner Englishes: Bridging a Paradigm Gap. Amsterdam: John Benjamins. NCAER = National Council of Applied Economic Research. 2004. The Great Indian Middle Class. Results from the NCAER Market Information Survey of Households. Nelson, Gerald. 1996. “Markup Systems.” In Comparing English Worldwide: The International Corpus of English, edited by Sidney Greenbaum, 36–53. Oxford: Clarendon Press. OED = The Oxford English Dictionary online. www.oed.com. Quirk, Randolph. 1990. “Language Varieties and Standard Language.” English Today 21: 3–10. Sailaja, Pingali. 2009. Indian English. Edinburgh: Edinburgh University Press. Schmied, Josef. 1996. “Second-language Corpora.” In Comparing English Worldwide: The International Corpus of English, edited by Sidney Greenbaum, 182–96. Oxford: Clarendon Press. Schneider, Edgar W. 2007. Postcolonial English. Varieties around the World. Cambridge: Cambridge University Press. Schneider Edgar W. 2012. “Exploring the Interface between World Englishes and Second Language Acquisition—and Implications for English as a Lingua Franca.” Journal of English as a Lingua Franca 1: 57–91. Schneider, Edgar W. 2014. “Asian Englishes—into the Future: A Bird’s Eye View.” Asian Englishes 16: 249–56. Schröter, Verena. 2012. “Colloquial Singaporean English.” In The Mouton World Atlas of Variation in English, edited by Bernd Kortmann and Kerstin Lunkenheimer, 562–72. Berlin: Mouton de Gruyter. Sharma, Devyani. 2011. “Style Repertoire and Social Change in British Asian English.” Journal of Sociolinguistics 15: 464–92. Shastri, S. V. 2002. Overview of the Indian Component of the International Corpus of English (ICE-India). Distributed with the ICE-India corpus.

38

Lange

Sridhar, Kamal K. and S. N. Sridhar. 1986. “Bridging the Paradigm Gap: Second Language Acquisition Research and Indigenized Varieties of English.” World Englishes 5: 3–14. Thomason, Sarah. 2001. Language Contact: An Introduction. Washington: Georgetown University Press. Thomason, Sarah. 2003. “Contact as a Source of Language Change.” In The Handbook of Historical Linguistics, edited by Brian D. Joseph, 687–712. Oxford: Blackwell. Thomason, Sarah. 2010. “Contact Explanations in Linguistics.” In The Handbook of Language Contact, edited by Raymond Hickey, 29–47. Oxford: Wiley-Blackwell. Thomason, Sarah G. and Terrence Kaufman. 1988. Language Contact, Creolization, and Genetic Linguistics. Berkeley: University of California Press. Trautmann, Thomas R. 2006. Languages and Nations. The Dravidian Proof in Colonial Madras. Berkeley: University of California Press. Vertovec, Steven. 2007. “Super-diversity and its Implications.” Ethnic and Racial Studies 30: 1024–54. Williams, Jessica. 1987. “Non-native Varieties of English: A Special Case of Language Acquisition.” English World-Wide 8: 161–99. Zastoupil, Lynn and Martin Moir, eds. 1999. The Great Indian Education Debate. Documents Relating to the Orientalist-Anglicist Controversy, 1781–1843. Richmond, Surrey: Curzon.

CHAPTER 3

Mono- and Multilingualism in a Specialized Corpus of New Zealand Stories Alexander Onysko and Marta Degani 1

Introduction

If we consider research on world Englishes, it is evident that corpora have become a major tool for investigating English varieties. This is due to developments in corpus linguistics and an increasing turn towards empirical, usagebased analyses of Englishes. Clearly, corpora can help with gaining insight into variety specific usage patterns, and compilers try to push the limits inherent to every data collection. The major struggle in corpus compilation has been the aim of increasing the size of data while maintaining control over its representativeness. So far, however, the study of world Englishes has not paid much attention to another trend in corpus linguistics—that is to creating and utilizing small, specialized corpora which allow for in-depth analyses of particular research questions. The project we describe in this paper takes a step in this direction by outlining the ongoing compilation of a corpus that taps into the language use of ethnically and linguistically diverse New Zealanders when speaking in their varieties of New Zealand English. At the core of the project is a story-telling task that was carried out with a number of participants of Māori and Pākehā (i.e. non-Māori, particularly New Zealand European) ethnicities. While the main language of the task is English, we are interested in finding out whether differences among the participants emerge and whether such differences can be related to the speakers’ linguistic repertoires ranging from mostly monolingual to bi/multilingual1 skills. Our hypothesis is that experience in using New Zealand English and another language, in particular te reo Māori, the indigenous language of New Zealand, would inspire linguistic features that can enrich the description of internal variation in New Zealand English (see Lange, this volume, for a similar observation in the context of Indian Englishes). Besides various forms of language contact that can be triggered by multilingual competence, the New Zealand 1 In line with a consistent body of research in the area of bi- and multilingualism, we use the term multilingualism as inclusive of bilingualism (cf., e.g., Cenoz 2013).

© koninklijke brill nv, leiden, ��7 | doi ��.��63/9789004276697_004

40

Onysko and Degani

context is also striking for its cultural constellation. On the one hand, after about 150 years of colonial oppression, the Polynesian Māori people, who first inhabited the ‘land of the long white cloud’—Aotearoa, have experienced a revival since the 1980s, regaining to some extent their rights and their language while maintaining and reviving their cultural practices. On the other hand, the socio-historical development of New Zealand since the signing of the Treaty of Waitangi in 1840 and the arrival of large numbers of British settlers soon after (cf. King 2003) has firmly established an Anglo-European culture in the country that has developed mostly in sync with ‘the Western world’. In this situation, many of the Māori people who care for their language and culture have become not only bilingual but also bicultural New Zealanders, and their use of the mainstream language of English can be a token for their linguistic and cultural repertoires. In the context of this situation, our specific story-telling task is but a small attempt to render available a targeted collection of spoken language that can give rise to different linguistic analyses and, ideally, to some insights into the nexus of language and culture. To do so, we are in the process of turning the collected spoken data into a small, specialized corpus that gathers the different stories prompted by a set of stimuli and recorded with a selected body of participants grouped according to the factors of monolingualism vs. multilingualism and their ethnic-cultural identification. In line with previous research (cf. Aston 1997; Flowerdew 2004), the projected size of the New Zealand Stories Corpus (NZSC) will be in the upper range of small corpora reaching around 250,000 words of transcribed speech. Its specialized nature emerging from the specific conversational situation of a prompted story-telling task, which leads to a co-constructed narration, or small story (cf. Georgakopoulou 2007), characterizes the corpus as a specialized collection of spoken language in terms of topics and genre. At the same time, background information on the speakers accompanies the story data, which facilitates qualitative analyses. As pointed out by Vaughan and Clancy (2013), the major benefit of small corpora is that their data is enriched by contextual information, allowing for linguistic investigations that rely on a contextual embedding of language use. It is in this spirit that we have carried out our data collection and are currently building the corpus. Another aspect that makes the NZSC a ‘specialized’ corpus is its concern with multilingual and monolingual speakers of New Zealand English for the purpose of comparing these groups of speakers. As many of the contributions collected in this volume show, awareness of multilingualism in corpus compilation, including monolingual corpora, has not been regular practice so far. In addition, the recognition of multilingual elements is closely connected to

Mono- and Multilingualism in a Specialized Corpus

41

transcription and mark-up conventions. In our paper, we would thus like to provide insight into the specifics of the NZSC, focussing on the methodology of data collection and corpus compilation as well as on the presence of multilingual elements in the speech data. Some examples of multilingual units will be analyzed with a view to the type of linguistic analyses that can be applied to the data, and some possibilities for future research will be outlined at the end. Since the NZSC is intended to make a contribution to existing corpus resources in the field of World Englishes, we would first like to take a look at existing corpora in this area and briefly discuss how they deal with instances of multilingualism, if at all. 2

Multilingualism in Major Corpora of World Englishes

From its beginnings in the 1960s, the field of corpus linguistics has seen many efforts dedicated to compiling corpora of the English language, often with the intent of studying grammatical phenomena of English. The BROWN corpus of written American English (1964) and its counterpart, the Lancaster-Oslo/ Bergen Corpus (LOB) for British English compiled in the 1970s, were followed on by projects that increased the size of corpora while maintaining a focus on the major varieties of British and American English. The British National Corpus (BNC), the Bank of English and the Corpus of Contemporary American English (COCA) are nowadays still leading corpora in the field. Most recently, the use of the World Wide Web has immensely increased the size of automatically generated corpora such as the Wikipedia Corpus comprising 1.9 billion words, the NOW corpus, which currently holds about 2.8 billion words and grows by 20 million words a day, and the English TenTen Corpus (Sketchengine), which boasted about 19 billion words in 2013 (Sketchengine 2016). Apart from that, a range of smaller corpora provides collections of English historical texts such as the Helsinki Corpus of English Texts and ARCHER (A Representative Corpus of Historical English Registers). Looking beyond corpora that target British and American English, some coverage can be found most notably for Australia with the Australian National Corpus and New Zealand with the Wellington Spoken Corpus (WSC) and the Wellington Written Corpus (WWC). Researchers interested in investigating other world Englishes have currently two main resources at their disposal. First of all, the International Corpus of English (ICE) provides a balanced set of 1 million word corpora and currently comprises 26 teams working on different national varieties, 14 of which have already been made available at the time of writing (ICE website, 09/2016). While the ICE corpora have become the staple diet of corpus-based research

42

Onysko and Degani

on varieties of English, the recently issued GloWbE (Global Web-based English) corpus offers another opportunity to delve into the world of Englishes. As described in Davies and Fuchs (2015a), the GloWbE corpus consists of 1.9 billion words that were harvested from the World Wide Web, covering 20 Englishspeaking countries. Google® searches were used for identifying country specific websites, mostly by their domain labels. In terms of text types, the compilers aimed for an approximate relation of 60% of blog data and 40% of a mix of other writings. This way, they hope to mirror the relation of 60% of spoken vs. 40 % of written data in the ICE corpora (Davies and Fuchs 2015a, 3–4). The sheer size of GloWbE provides a lot of potential for corpus-based research on world Englishes as it facilitates the emergence of more varietyspecific usage patterns. Furthermore, it allows for the comparison of findings among 20 different varieties. However, as discussed in a series of responses to Davies and Fuchs’ (2015a) keynote article on GloWbE, the level of ‘noise’ remains particularly high in the corpus, which means that the corpus needs to be used with care and results need to be checked manually, which calls for its deliberate use. Nelson (2015, 39) mentions that automatically recognizing where a particular website is geographically located can lead to erroneous results, and, what is more, it cannot be taken for granted that an author of a web text is actually a user of the national variety indicated by a particular countryspecific web domain. In addition, Davies and Fuchs (2015b, 46–47) concede in their response to the critical comments that a certain amount of duplicated websites have probably remained in the corpus even if measures have been applied to reduce that problem such as filtering out texts which exhibit matching long n-grams. Clearly, these issues draw attention to the fact that researchers working with GloWbE need to be willing to retrace the original contexts to manually weed out the findings. Apart from that, Mair (2015, 30) adds that no provisions have been made for the recognition of multilingual elements in the automated tagging of the data in GloWbE. Thus, the potential of highlighting the multilingual realities that underlie many varieties of English remain unexplored in the corpus. This criticism can be extended to the ICE corpora in which other language material is sometimes treated as extra-corpus material and may be eliminated from the main texts (Mair 2015, 30). However, when looking at the conventions for annotation applied in the transcriptions of the ICE corpora, they actually include the recognition and marking of the categories of “foreign language”, “other language”, and “indigenous language” (Nelson 2002a; 2002b; see also Lange, this volume). This shows that language contact features such as borrowings and codeswitches are to some extent accessible for searches in the corpora.

Mono- and Multilingualism in a Specialized Corpus

43

English learner corpus research, which is related to the field of World Englishes from the perspective of language contact (cf. Onysko 2016b) would potentially also benefit from the annotation of contact or transfer features since the language backgrounds of learners can shine through in their use of English depending on their proficiency and other factors. So far, however, it appears that learner corpora have mostly not been annotated for language contact features and multilingual elements (for a discussion cf. Callies and Wiemeyer, this volume). The use of English as an international language has been captured to some extent in VOICE (Vienna-Oxford International Corpus of English). Even though VOICE consists of conversations held by speakers of English as a learner language or second language, the corpus has not been conceived of as a learner corpus. Instead, the aim of VOICE is to provide a database of English used as a vehicle language or lingua franca in contexts of international communication. Most notably for the concerns of the current volume, utterances in languages other than English are consistently annotated, alluding to the multilingual potential contained in the corpus (VOICE 2007). Finally, and most relevant for the New Zealand Stories Corpus, the Wellington Spoken Corpus (WSC) of one million words that comprises 75% of informal dialogue (Holmes, Vine and Johnson 1998, 7) has made provisions for the coding of non-English speech. In general, there is only little use of other languages that are not endemic to New Zealand in the corpus. These are marked with the tag “foreign” and the name of the language. Most of the multilingual elements in the WSC are related to the indigenous Māori language, and these are variably coded in the corpus as “ … ” (Holmes, Vine and Johnson 1998, 39). Differently from our small corpus of New Zealand stories, however, not all instances of Māori elements are marked in the WSC. For example, Māori proper names including tribal names are not annotated as Māori. Indigenous names for flora and fauna are also generally not marked even if existing English alternatives are added in glosses. Only if Māori names of flora and fauna are not widely used among the non-Māori speakers of New Zealand English are these terms annotated as Māori speech in the WSC. Furthermore, longer stretches of uninterrupted Māori discourse are omitted from the transcripts and a summary of their contents is given instead (Holmes, Vine and Johnson 1998, 39). This shows that while some of the multilingual, particularly Māori, elements are acknowledged in the WSC, the main focus of the corpus is on the use of English in New Zealand, and the interactional multilingualism that occurred during data collection is backgrounded to some extent. Moreover, the decision of what to code as Māori and what not seems

44

Onysko and Degani

to be influenced by a distinction between accepted Māori borrowings in New Zealand English (not necessarily coded with proper names) and instances of codeswitching (coded besides proper names). Since the New Zealand Stories Corpus explicitly targets potential differences in the use of English between multilingual and monolingual New Zealanders, our annotation strives for being fully inclusive of all language elements and terms of Māori origin used by the participants. Further mention of our annotation conventions for multilingual elements will be part of the next section on corpus design. By way of examples, section 4 will show that an inclusive approach to coding all instances of Māori elements in the mainly English stories is important as the context specific use of proper names and other lexical items from Māori can build up a connected imagery that can reverberate a speaker’s identity and render rich cultural content in the English narration. 3

Corpus Design

The compilation of the New Zealand Stories Corpus (NZSC) is an essential part of a larger project which aims at investigating bilingualism in Māori and English and its implications for the varieties of English in New Zealand. The project was initiated during a one-year research stay at the School of Māori and Pacific Development (SMPD) at the University of Waikato in New Zealand. This undertaking would not have been possible without the support of many scholars at the SMPD and the linguistics programme at the University of Waikato and beyond.2 The corpus will comprise 142 narrations that were elicited through a storytelling task. In line with a long tradition of story-telling and photo elicitation in the social sciences (cf., e.g., Harper 2002; Harrison 2002; Ketelle 2010; Robinson 2002), a set of three photographs was used as a starting point for the narrations. Since the pictures had to be evocative, three typical, scenic New Zealand landscapes not portraying any cultural artefacts and people were chosen as the ideal shots. The selected images represent a) a lake surrounded by green vegetation and with a jetty protruding into the water, b) a beach scenery with the inlet of a brook flowing into the sea, and c) a gravel road leading towards 2 We would particularly like to thank (in alphabetical order): Julie Barbour, Donna Campbell, Hineitimoana Greensill, Ray Harlow, Jeanette King, Daryl Macdonald, Margaret Maclagan, James McLellan, Sophie Nock, Haupai Puke, Raukura Roa, Tom Roa, Keely Smith, Linda Smith, and all the participants who volunteered to take part in the project.

Mono- and Multilingualism in a Specialized Corpus

45

a mountain range. The same visual stimulus (the three photographs) and the same verbal prompt (“please select a picture and take it as a starting point to tell a story”) were used with each of the participants in the study. No time constraints were imposed for the task so that people could feel at ease and freely associate the selected image to their thoughts in order to recount a story. The notion of story that was adopted for the study is flexible in that it comprises anecdotes, memories and recollections of personal experiences. In other words, the investigation considers a story as a space for individual verbal expression (here, no consideration is given to either specific gestures accompanying the oral narration or personal drawings by the participants as aids to their verbal expression). This notion of story does not strictly coincide with a monologue in the Labovian sense. According to Labov’s model (1972), a story is a monologue of personal and past experiences that is usually structured in six different parts (abstract, orientation, complicating action, resolution, evaluation and coda) and is the exclusive product of its teller. The role of any other person sharing the experience or story-telling is reduced to the one of a listener. In our study, the vast majority of the collected stories combine monologues with interactive sequences in a way that is more in line with research in modern narratology and its emphasis on interaction as a fundamental ingredient for the construction of an ‘authentic’ narrative space in the experimental setting (Baynham 2011; De Fina and Perrino 2011; Koven 2011; Schiffrin 2009). In particular, the study adopts the idea of “small story” (Georgakopoulou 2006; 2007; Bamberg and Georgakopoulou 2008), which describes a story as emerging in interaction and characterized by frequent turns among the participants who contribute to its online co-construction. All of our stories start with a monologue, generally lasting a few minutes, followed by the active participation of the facilitators. This means that after the participant tells the first part of the story, the ‘interviewers’ also engage in the construction of the narrative by telling some of their own related experiences and asking a few questions. In this way, most of the stories continued for an average of about fifteen to twenty minutes. Each story was audio recorded using a professional, non-intrusive technical device: a pocketsize digital recorder (Voice Tracer®), which was placed on a table close to some finger food and drinks provided for the participants. Digital files were recorded in high quality MP3 format. The task was performed for all people in the same setting, a meeting room at the University of Waikato, and data were collected over a period of 8 months (from January to August 2011). At the end of each oral session, participants were thanked and provided with information about the major aims of the research.

46

Onysko and Degani

All the participants were students regularly enrolled at the University of Waikato when carrying out the task. While this type of selection allowed for the creation of a fairly homogenous pool of speakers, it needs to be pointed out that the data cannot be taken as representative of New Zealand society at large. Participants were recruited locally during visits of the principal researchers to classes in different schools and departments in the humanities and social sciences. In order to limit possible biases related to the nature of the task, it was decided to include only students of languages, linguistics and communication. During the visits to classes, which were aimed at raising interest in the research project without revealing details of the study, a preliminary questionnaire was distributed among students that asked for their ethnic identity, personal information and knowledge of languages. More precisely, the potential participants had to declare their ethnicities, age, country of birth, years of residence in New Zealand, subjects of study, and knowledge of languages other than English. At the end of the questionnaire, students had the option to indicate whether they wished to participate in the study. In case of a positive reply, people were contacted for arranging individual meetings to carry out the task. The kind of information asked on the questionnaire was crucial for arranging participants in comparable groups according to the main variables of ethnicity and knowledge of either one or more than one language. We aimed to recruit about 30 participants each for four target groups: a) ethnically Māori bilingual speakers of English and Māori; b) ethnically Māori monolingual speakers of English; c) ethnically Pākehā (New Zealanders of European descent) monolingual speakers of English; d) ethnically non-Māori, mostly Pākehā, New Zealanders with knowledge of another language (usually at an advanced learner level in a language other than Māori). The number of people who took part in the study is 142, but not all of them fit into one of the four target groups. In order to account for the degree of bi/multilingualism of the participants, different criteria were followed, which were inspired by previous research on bilingual speakers of Māori (see Harlow et al. 2009). First of all, students were asked to provide a self-rating concerning their knowledge of all other languages than English on a scale from 1 to 5 points: 1 (very good), 2 (good), 3 (fairly good), 4 (basic), 5 (a few words and expressions). Particularly for the Māori language, this was accompanied by additional questions concerning: age of acquisition, context of acquisition, frequency of use, situations of use, and exposure (e.g. cultural activities and media). Answers to each of these factors were given points that were weighted according to the following hierarchy: self-rating > age and environment of acquisition > frequency of use > contexts of use > exposure.

Mono- and Multilingualism in a Specialized Corpus

47

The answers to the questionnaire showed that a division into monolingual vs. multilingual participants was not possible as a simple, categorical contrast. Instead, all of our participants indicated to know at least a few words and expressions in other languages. This gave rise to a continuum of states from being virtually monolingual to highly multilingual. Within that continuum, huge differences existed between the participants, and we aimed to mirror that in our categorization of monolingual and multilingual groups by creating a sufficient numerical distance between them. In detail, monolingual speakers scored from 0–33 points while participants were considered as multilingual if they reached 55 and more points according to the weighted hierarchy of points (see Onysko and Degani 2014, 187–90 for more details). Speakers that received between 33 and 55 points were considered as weak bilinguals. In this way, the bilingualism index allowed the distinction of our participant body into the following parts: Māori multilinguals, Māori weak multilinguals, Māori monolinguals (in English), Pākeha multilinguals, and Pākeha monolinguals (in English). Some research on these participant groups has already been carried out involving a compound meaning interpretation task that was administered after the story-telling (see, e.g., Onysko and Degani 2014; Onysko 2016a). For the building of the NZSC on the other hand, it is important to include the sociolinguistic information of each of the speakers so that it will be possible to run internal comparisons with the data and to facilitate qualitative investigations. As with any corpus, another crucial aspect for the building of the NZSC is data transcription, which can be steered by the aims and interests of the corpus builders viz. researchers. The transcription of our stories has required us to make a few methodological choices that are particularly relevant for the marking of multilingual elements. Most of these choices concern the annotation of Māori terms. First of all, it was decided to mark all instances of Māori lexical units. To do so, the symbols  …  confine Māori proper nouns and  …  is used for all other Māori lexical material.3 Example (1) shows how the annotation is implemented in the transcription. (1) … I’m part of a Waikato [M] group, and so ahm although I’m from Hastings na Ngati Pourou, Ngati Kahungunu [M], I’m not from Waikato [M], ahm if you know of turangawaewae [M] in Ngaruawahia [M] , ahm Mahinarangi [M] and Turongo [M] , oh they are the name of the wharenui [M] , well 3 Elements of other languages are annotated as  … .

48

Onysko and Degani

Mahi- Mahinarangi [M] is actually my tipuna [M] from Kahungunu [M] and she married into the Waikato [M] and so that’s my link [Met] and that’s that’s why I feel content and and like I don’t feel out of place with this group … (P 25)

In contrast to the WSC, the fact that a Māori word is considered as integrated in the vocabulary of NZE did not prevent its annotation. This was done in order to guarantee full visibility to multilingualism in the corpus and to allow for investigations into language contact phenomena such as lexical borrowing and codeswitching. In the NZSC, every occurrence of Māori language use is marked, irrespective of whether it concerns a single word or longer syntactic structures. Considering the fact that many Māori words are part of NZE, it is also important to pay attention to how speakers actually pronounce them. Thus, when an instance of Māori language occurs, an annotation is provided that specifies whether its pronunciation is Māori or not. The symbol [M] is used after a Māori term to indicate a Māori type of pronunciation, while [E] signals an English (Pākehā) pronunciation. In relation to suprasegmental features, the occurrence of high rising intonation is coded with the symbol . This is inspired by findings in previous research that has identified high rising terminals as a characteristic intonation pattern of speakers of Māori English (see, e.g., Allan 1990; Britain 1992; Szakay 2008). The annotation also accounts for potential instances of transfer from Māori into English, marked by [T?]. Another aspect that is given recognition in the transcriptions is the use of figurative language, especially metaphors. The occurrence of a metaphor is marked with [Met]. Beyond this and in line with common practice in the field (cf. mark-up and transcription guidelines in VOICE), conventions have also been adopted to mark a range of other general features such as emphasis, pauses, overlaps and laughter, among others. 4

Multilingual Elements in the Corpus

An interesting aspect that researchers can investigate in the NZSC is the occurrence of multilingual elements. While the story-telling task was performed in English, stories also contain lexical and syntactic units from languages other than English. This is mostly due to the fact that one major group of participants is made up of people who are bilingual speakers of English and Māori. In addition to Māori, a few examples from other Polynesian languages (e.g. Samoan) are also present although this remains a more marginal phenomenon.

Mono- and Multilingualism in a Specialized Corpus

49

The discussion of multilingualism in the corpus, therefore, focuses on the presence of Māori elements in what on the surface appear as English stories. A first analysis of the data reveals that the Māori language shows up in the corpus to different degrees. The most obvious instances are place names. Many stories, especially the ones told by Māori-English bilinguals, are rich in Māori toponymy. There are Māori place names denoting lakes (e.g. Rotoiti, Rotorua, Okareka), mountains (e.g. Maungatautari, Ruapehu, Tongariro), beaches (e.g. Turihaua, Waihi, Ohope), islands (e.g. Mokoia, Whakaari, Motu Taiko), national parks (e.g. Turangi), hills (e.g. Takaka) as well as towns, villages (e.g. Maketu, Tuaranga, Kawhia, Kaikoura, Whakatane) and regions (Waikato, Taranaki). This finding is related to the nature of the task, which prompted associations to certain typical landscapes portrayed on the stimulus pictures. Thus, unsurprisingly, many speakers started telling their stories by providing the name of a location as the general setting for the story to come. In addition, a large number of place names in New Zealand have a Māori origin. While the presence of Māori place names was expected, it is also important to emphasize that, in stories told by Māori-English bilinguals, Māori place names do not only occur at the beginning of the narration to indicate the geographic setting of the story. Instead, place names tend to recur throughout the narrations of Māori-English bilinguals, and they often become a vehicle for expressing one’s own tribal identity and sense of belonging to a particular area. In a similar fashion, proper names denoting people or events in the stories of Māori-English bilinguals can also convey cultural meaning. Thus, stories contain clear references to: a) the names of Māori tribal groups with whom speakers identify (e.g. Te Arawa, Tainui, Kai Tahu), b) the names of Māori ancestors that define genealogical lines of descendance (e.g. Mahuripounamu, Mahinarangi), c) the names of Māori people of great historical influence (e.g. the Māori king Tawhiao), and d) the names of protagonists of Māori legends that show the connection of speakers to the land they talk about (e.g. Tutanekai, Hinemoa). Among the proper names referring to cultural events and institutions, two relevant examples are Matatini (‘a competition involving different forms of Māori song and dance’) and Kingitanga (‘the Māori king movement’). In addition to Māori proper names of places, people, things and events, other multilingual elements that frequently occur in the stories of MāoriEnglish bilinguals are Māori common nouns referring to fauna (e.g. kereru ‘New Zealand pigeon’, kina ‘sea urchin’, paua ‘abalone’), objects/artefacts (e.g. tiki ‘carved figure, a neck ornament’, kete ‘basket’, mere ‘short flat weapon/ club made of [green]stone’), education/learning (e.g. wānanga ‘seminar, tribal knowledge, tertiary institution’, kura kaupapa ‘Māori immersion school’, kōhanga reo ‘Māori immersion nursery school’) and other important cultural

50

Onysko and Degani

concepts (e.g. wairua ‘spirit, soul’, marae ‘traditional meeting ground and surrounding buildings’, tangihanga ‘Māori funeral ceremony’). Māori-English bilinguals also very often refer to the Māori language as reo (literally ‘language’) or te reo (‘the language’). These instances of multilingual language usage pose methodological questions concerning their classification as either loans or codeswitches. A number of Māori terms have been adopted as lexical borrowings in New Zealand English (cf. Bauer 1980, 1994; Deverson 1984, 1991; Geering 1993; Macalister 2005). These borrowings have entered New Zealand English in different historical periods starting from the early phases of colonization (Macalister 2006), and they have come to serve different linguistic, social and political needs (cf., e.g., Gordon and Deverson 1998; Degani 2010; Onysko and Calude 2013). The first authoritative recognition of Māori words as an integral part of the New Zealand English lexicon came with their integration in the Dictionary of New Zealand English (Orsman 1997). Macalister’s Dictionary of Maori Words in New Zealand English (2005) also provides useful insights into the range and number of Māori words that can be considered as belonging to the vocabulary of this English variety. In light of these observations, all of our participants, who identify as New Zealanders, can be expected to use a few Māori words in their spoken interactions in English. While first data analyses appear to confirm this general expectation, the multilingual competence of our Māori-English bilingual participants affects their linguistic behaviour in a more significant way. First findings indicate that Māori-English bilingualism results in a higher incidence and a broad range of Māori words in the oral narratives, many of which can be considered as not being part of general New Zealand English (cf. Macalister 2008). Such instances of Māori vocabulary can be interpreted as examples of codeswitching. This idea is also supported by the fact that our bilinguals sometimes felt a need to provide an English translation equivalent (see Hynninen et al., this volume) or an English explanation for the Māori expressions they used:4 (2) I did ahm manu korero, which is a Māori speech competition, when I was at high school. (P 20) In (2) the speaker is accommodating to her interlocutors who are not expected to know her usage of Māori. In general, a noticeable feature in the stories of 4 To facilitate the reading of the cited examples from the stories, markups have been removed. Speaker identification codes are given in brackets below the examples.

Mono- and Multilingualism in a Specialized Corpus

51

bilinguals is their displaying a good range of Māori lexical content, as exemplified in the passage reported in (3). (3) We’ve got like a caravan, like a little area out at lake Rotoiti, and, ahm my dad bought a caravan next to the marae there, ahm, it’s called Ruato but the mar- the wharenui is called Ngā Pūmanawa e waru t- so the eight beating hearts of Te Arawa, the the iwi—and he’s got like a section just next door to that marae and he’s got we’ve got like a caravan there and things and we go there for New Year’s or, yeah many New Year’s and then we take the kids out to, ‘cause all all my cousins and things would come along … (P 3) In this passage there are Māori names referring to places, things and people like Rotoiti, Ruato, Ngā Pūmanawa and Te Arawa. In addition, the excerpt also contains words referring to cultural concepts like marae, wharenui and iwi, and it shows an example of a suppressed codeswitch into Māori, e waru te (‘the eight’), which is then continued in English. Other examples of codeswitching in the data involve quoting Māori phrases or stretches of Māori discourse, as in (4). (4) I asked what we gonna be covering and she goes ‘Kia ora, kei te pehea koe’? (P 19) As indicated by the examples above, the New Zealand Stories Corpus appears to be promising for conducting research on language contact phenomena, mostly involving the relation between English and Māori. In addition, the data allows for an exploration of cultural content that is mediated via the Māori language in the English narrations. Example (5) briefly illustrates how the nexus of language and culture can be investigated in the corpus. (5) it’s probably my main marae there and that’s where all the tangihanga, oh you know like when we go back home if it’s immediate family, so I’ve lost quite a few people—you know immediate, first cousins and aunties and uncles and stuff, quite a few and so that’s our main our main marae that we go to, and yeah, that really just brings it all back and, ahm, reminds me of tikanga. basically it just, yeah, reminds me of being being on the marae, ah the protocols that happen on the marae, ahm, right from the sound of the karanga of the kuia, ahm, from when, ah, she is

52

Onysko and Degani

welcoming a group onto the marae, ahm, all those protocols, yeah, so it starts from there. (P 25)

In this story, the speaker talks a lot about the place where she comes from, alluding to her tribal identity and her affiliations to different tribal groups. The Māori term that recurs more frequently in her narration and in the representative excerpt reported above is that of marae. As explained by anthropologists, for the Māori people the marae represents “the centre of Maori identity”, a “cultural institution” (Mead 2003, ch.6) and is “a symbol of tribal identity and solidarity” (Barlow 1991, 73). It is a place, both physical and spiritual, where the Māori can connect to their land, their ancestors as well as to their rituals, customs and cultural values. Different types of ceremonies and cultural activities can take place on the marae. Here, the speaker specifically refers to her participation in Māori funerals (tangihanga), which can last a few days and are celebrated in accordance to norms of culturally appropriate behaviour (tikanga). In the passage, she recounts the initial phases of the funeral when an elderly lady (kuia) starts singing a special call (karanga), which welcomes people to join the ceremony. From a larger perspective, research concerning the connections between language selection, cognition and the transmission of culturally specific content can be fruitfully pursued further with the corpus. For instance, one study that is based on selected data from the NZSC shows how the Māori key cultural concept marae carries specific conceptualizations (Degani 2017). 5

Conclusion

At present, the NZSC outlined in this paper is still under construction. Only a portion of the stories told by Māori-English bilinguals has been transcribed and annotated so far. Once the complete data will be available in electronic form, thorough investigations can be carried out on the potential diversity of language use among the different participant groups of monolingual and multilingual New Zealanders. Apart from our primary interest in language contact phenomena, conceptual metaphors and the representation of cultural conceptualizations, other research questions can also be explored. For example, the pragmatic features of High Rising Terminals (HRT) and general extenders could add to the description of New Zealand English and illuminate potential further differentiations due to the mono- and multilingual repertoires

Mono- and Multilingualism in a Specialized Corpus

53

of its speakers. As discussed in the paper, the NZSC will be an example of a small, specialized corpus, and conclusions drawn from the data will need to be interpreted carefully in terms of the socio-demographic backgrounds of the participants and the particular task situation. On the other hand, the specific task and methodology that underlie the NZSC will make the corpus a focused collection of language data, which facilitates in-depth, qualitative analyses. In view of studying world Englishes, the corpus is intended to provide a small step towards targeted corpus-based research of Englishes in their multilingual contexts. References Allan, Scott. 1990. “The Rise of New Zealand Intonation.” In New Zealand Ways of Speaking English, edited by Allan Bell and Janet Holmes, 115–28. Wellington: Victoria University Press. ARCHER = A Representative Corpus of Historical English Registers. 1990–1993/2002/2007/ 2010/2013. Originally compiled under the supervision of Douglas Biber and Edward Finegan at Northern Arizona University and University of Southern California; modified and expanded by subsequent members of a consortium of universities. Aston, Guy. 1997. “Large and Small Corpora in Language Learning.” In PALC97: Practical Applications in Language Corpora, edited by Barbara Lewandowska-Tomaszczyk and Patrick J. Melia, 51–62. Łodz: Łodz University Press. AusNC = Australian National Corpus. https://ausnc.org.au/. Bamberg, Michael and Alexandra Georgakopoulou. 2008. “Small Stories as a New Perspective in Narrative and Identity Analysis.” Text & Talk 28/3: 377–96. Bank of English Corpus. Compiled under the leadership of John Sinclair. http://www .collins.co.uk/page/The+Collins+Corpus. Barlow, Cleve. 1991. Tikanga Whakaaro: Key Concepts in Māori Culture. Auckland: Oxford University Press. Bauer, Laurie. 1980. “Something Old, Something New, Something Borrowed: An Essay on Loanwords.” In Views of English 2: Victoria University Essays for English Teachers and Students, edited by David Norton and Roger Robinson, 19–27. Wellington: Victoria University of Wellington. Bauer, Laurie. 1994. “English in New Zealand.” In The Cambridge History of the English Language, Vol. V, English in Britain and Overseas: Origins and Development, edited by Robert Burchfield, 382–429. Cambridge: Cambridge University Press. Baynham, Mike. 2011. “Stance, Positioning, and Alignment in Narratives of Professional Experience.” Language in Society 40: 63–74.

54

Onysko and Degani

BNC = British National Corpus. Compiled by the BNC Consortium. http://www.natcorp .ox.ac.uk/. Britain, David. 1992. “Linguistic Change in Intonation: The Use of High Rising Terminals in New Zealand English.” Language Variation and Change 4/1: 77–104. BROWN = A Standard Corpus of Present-Day Edited American English, for use with Digital Computers 1964, 1971, 1979. Compiled by W. N. Francis and H. Kučera. Brown University. Providence, Rhode Island. Cenoz, Jasone. 2013. “Defining Multilingualism.” Annual Review of Applied Linguistics 33: 3–18. COCA = Corpus of Contemporary American English. Compiled by Mark Davies (Brigham Young University). http://corpus.byu.edu/coca/. Davies, Mark and Robert Fuchs. 2015a. “Expanding Horizons in the Study of World Englishes with the 1.9 Billion Word Global Web-based English Corpus (GloWbE).” English World-Wide 36/1: 1–28. Davies, Mark and Robert Fuchs. 2015b. “A Reply.” English World-Wide 36/1: 45–47. De Fina, Anna and Sabina Perrino. 2011. “Introduction: Interviews vs. ‘Natural’ Contexts: A False Dilemma.” Language in Society 40: 1–11. Degani, Marta. 2010. “The Pakeha Myth of One New Zealand/Aotearoa. An Exploration in the Use of Maori Loanwords in New Zealand English.” In From International to Local English—and Back Again, edited by Roberta Facchinetti, David Crystal and Barbara Seidlhofer, 165–96. Frankfurt am Main: Peter Lang. Degani, Marta. 2017. “Cultural Conceptualizations in Stories of Māori-English Bilin guals: The Cultural Schema of marae.” In Advances in Cultural Linguistics, edited by Farzad Sharifian, 661–82. Singapore: Springer. Deverson, Tony. 1984. “ ‘Home Loans’: Maori Input into Current New Zealand English.” English in New Zealand 33: 4–10. Deverson, Tony. 1991. “New Zealand English Lexis: The Maori Dimension.” English Today 26: 18–25. enTenTen = The English TenTen Web Corpus. Compiled by Sketchengine. https://www .sketchengine.co.uk/ententen-corpus/. Flowerdew, Lynne. 2004. “The Argument for Using English Specialised Corpora to Understand Academic and Professional Settings.” In Discourse in the Professions: Perspectives from Corpus Linguistics, edited by Ulla Connor and Thomas Upton, 11–33. Amsterdam: John Benjamins. Geering, Elaine. 1993. “The Use of Maori in the Late Nineteenth-century Auckland Press.” In Of Pavlova, Poetry and Paradigms: Essays in Honour of Harry Orsman, edited by Laurie Bauer and Christine Franzen, 250–60. Wellington: Victoria University Press. Georgakopoulou, Alexandra. 2006. “The Other Side of the Story: Towards a Narrative Analysis of Narratives-in-interaction.” Discourse Studies 8/2: 235–57.

Mono- and Multilingualism in a Specialized Corpus

55

Georgakopoulou, Alexandra. 2007. Small Stories, Interaction, and Identities. Amsterdam, Philadelphia: John Benjamins. GloWbE = Corpus of Global Web-based English. Compiled by Mark Davies (Brigham Young University). http://corpus.byu.edu/glowbe/. Gordon, Elizabeth and Tony Deverson. 1998. New Zealand English and English in New Zealand. Auckland: New House Publishers. Harlow, Ray, Peter Keegan, Jeanette King, Margaret Maclagan and Catherine Watson. 2009. “The Changing Sound of the Māori Language.” In Variation in Indigenous Minority Languages, edited by James N. Stanford and Dennis R. Preston, 129–52. Amsterdam, Philadelphia: John Benjamins. Harper, Douglas. 2002. “Talking about Pictures: A Case for Photo Elicitation.” Visual Studies 17/1: 13–26. Harrison, Barbara. 2002. “Photographic Visions and Narrative Inquiry.” Narrative Inquiry 12/1: 87–111. The Helsinki Corpus of English Texts (1991). Department of Modern Languages, University of Helsinki. Compiled by Matti Rissanen (Project leader), Merja Kytö (Project secretary); Leena Kahlas-Tarkka, Matti Kilpiö (Old English); Saara Nevanlinna, Irma Taavitsainen (Middle English); Terttu Nevalainen, Helena Raumolin-Brunberg (Early Modern English). Holmes, Janet, Bernadette Vine and Gary Johnson. 1998. Guide to the Wellington Corpus of Spoken New Zealand English. Wellington: University of Wellington. ICE = International Corpus of English. 1988–. Initiated by Sidney Greenbaum. Ketelle, Diane. 2010. “The Ground They Walk on: Photography and Narrative Inquiry.” The Qualitative Report 15/3: 547–68. King, Michael. 2003. The Penguin History of New Zealand. Rosedale, NZ: Penguin Books. Koven, Michelle. 2011. “Comparing Stories Told in Sociolinguistic Interviews and Spontaneous Conversation.” Language in Society 40: 75–89. Labov, William. 1972. Language in the Inner City. Philadelphia, PA: University of Pennsylvania Press. LOB = The Lancaster-Oslo/Bergen Corpus of British English, For Use With Digital Computers. 1970–1978. Compiled by Geoffrey Leech, Lancaster University, Stig Johansson, University of Oslo (project leaders), and Knut Hofland, University of Bergen (head of computing). Macalister, John. 2005. A Dictionary of Maori Words in New Zealand English. Oxford and New York: Oxford University Press. Macalister, John. 2006. “The Maori Presence in the New Zealand English Lexicon, 1850– 2000: Evidence from a Corpus-based Study.” English World-Wide 27: 1–24. Macalister, John. 2008. “Tracking Changes in Familiarity with Borrowings from te reo Māori.” Te Reo 51: 75–97.

56

Onysko and Degani

Mair, Christian. 2015. “Responses to Davies and Fuchs.” English World-Wide 36/1: 29–33. Mead, Hirini Moko. 2003. Tikanga Māori: Living by Māori Values. Wellington: Huia Publishers. Nelson, Gerald. 2002a. “ICE Markup Manual for Written Texts.” Available at http:// ice-corpora.net/ice/manuals.htm (accessed 30 September 2016). Nelson, Gerald. 2002b. “ICE Markup Manual for Spoken Texts.” Available at http:// ice-corpora.net/ice/manuals.htm (accessed 30 September 2016). Nelson, Gerald. 2015. “Responses to Davies and Fuchs.” English World-Wide 36/1: 38–40. NOW = News on the Web Corpus. Compiled by Mark Davies (Brigham Young University). http://corpus.byu.edu/now/. Onysko, Alexander. 2016a. “Enhanced Creativity in Bilinguals? Evidence from Meaning Interpretations of Novel Compounds.” International Journal of Bilingualism 20/3: 315–34. Onysko, Alexander. 2016b. “Modeling World Englishes from the Perspective of Language Contact.” World Englishes 35/2: 196–220. Onysko, Alexander and Andreea S. Calude. 2013. “Comparing the Usage of Māori Loans in Spoken and Written New Zealand English: A Case Study of Maori, Pakeha, and Kiwi.” In New Perspectives on Lexical Borrowing: Onomasiological, Methodological, and Phraseological Innovations, edited by Eline Zenner and Gitte Kristiansen, 143– 70. Berlin and New York: Mouton de Gruyter. Onysko, Alexander and Marta Degani. 2014. “Listening to a voice canoe: Differences in Meaning Association between Māori Bilingual and Pākehā Monolingual Speakers.” In He Hiringa, He Pūmanawa—Studies on the Māori Language: In Honour of Ray Harlow, edited by Alexander Onysko, Marta Degani and Jeanette King, 179–210. Wellington: Huia Publishers. Orsman, Harry. 1997. The Dictionary of New Zealand English. Auckland: Oxford University Press. Robinson, Dave. 2002. “Using Photographs to Elicit Narrative Accounts.” In Narrative, Memory and Life Transitions, edited by Kates Milnes, Brian Roberts and Christine Horrocks, 179–87. Huddersfield: University of Huddersfield Press. Schiffrin, Deborah. 2009. “Crossing Boundaries: The Nexus of Time, Space, Person, and Place in Narrative.” Language in Society 38: 421–45. Szakay, Anita. 2008. Ethnic Dialect Identification in New Zealand: The Role of Prosodic Cues. Saarbrücken: VDM Verlag Dr. Müller. Vaughan, Elaine and Brian Clancy. 2013. “Small Corpora and Pragmatics.” In Yearbook of Corpus Linguistics and Pragmatics 2013: New Domains and Methodologies, edited by Jesús Romero-Trillo, 53–73. Dordrecht: Springer. VOICE (2007), VOICE Transcription Conventions [2.1]. https://www.univie.ac.at/voice/ documents/VOICE_mark-up_conventions_v2-1.pdf (accessed 4 September 2016).

Mono- and Multilingualism in a Specialized Corpus

57

VOICE = Vienna-Oxford International Corpus of English (v.2.0, 2013). Compiled by Barbara Seidlhofer et al. (University of Vienna). http://www.univie.ac.at/voice/ page/index.php. Wikipedia Corpus. Compiled by Mark Davies (Brigham Young University). http:// corpus.byu.edu/wiki/. WSC = Wellington Corpus of Spoken New Zealand English. Compiled by Janet Holmes, Bernadette Vine and Gary Johnson. http://www.victoria.ac.nz/lals/resources/ corpora-default#wsc. WWC = Wellington Corpus of Written New Zealand English. Compiled by Laurie Bauer et al. http://www.victoria.ac.nz/lals/resources/corpora-default#wwc.

CHAPTER 4

What Happens to Ongoing Change in Multilingual Settings? A Corpus Compiler’s Perspective on New Data and New Research Prospects Mikko Laitinen 1

Introduction

This article investigates ongoing changes in core and emergent modal auxiliaries. Their recent history is well documented in the dominant English varieties, viz. British English (BrE) and American English (AmE) (Leech 2013). My focus is, however, on exploring how these changes are adopted in multilingual ELF (English as a Lingua Franca) settings, and this perspective is novel in various ways. With regard to ongoing change in general, previous approaches have primarily investigated the main standard varieties (Leech et al. 2009). Alternatively, studies of outer circle Englishes have emphasized other sociocultural elements more than the competencies of multilingual speakers (cf. the focus on colonial ties in Collins 2015). In addition, English in the expanding circle is typically only approached synchronically, without considering the historical embedding of the grammatical structures under investigation. Only a few studies so far have looked into ELF from a historical linguistic perspective, as an evolutionary product in the long diachronic chain of Englishes (Laitinen and Levin 2016; Laitinen 2016). As will be shown below, this state of affairs is brought about by the limited coverage of genres in the expanding circle English corpora. It means that despite the fact that a substantial bulk of English today consists of its use as a lingua franca resource by speakers/ writers in multilingual settings, little is known of what happens to variability in such environments. My approach extends multilingualism to include not only multilingual elements in corpora but also multilingual competencies and identities, and it calls for more work on developing new sources of corpus evidence of ELF. The new corpus resources could be used to challenge the dominating monolingual ideal, in which non-native speakers/writers are seen to be norm dependent, relying on native norms and conventions. Considerable attempts have already taken place to diversify this ideal, and they have led to debates on bridging the paradigm gap between second and learner language use (e.g.

© koninklijke brill nv, leiden, ��7 | doi ��.��63/9789004276697_005

What Happens to Ongoing Change in Multilingual Settings ?

59

Mukherjee and Hundt 2011; Gilquin 2015; Edwards and Laporte 2015) on the one hand, and to charting ELF uses on the other (Jenkins, Cogo and Dewey 2011; Seidlhofer 2011; Mauranen 2012). The ELF paradigm in particular has shifted the focus from deficiencies and deviations to investigating English use as part of multilingual repertoires. Much of this research has investigated spoken interaction, but one strand has looked into how synchronic variability in native English is reflected in both spoken and written ELF (Mauranen, Carey and Ranta 2015). The research here incorporates a broader perspective to ELF, investigating how diachronic processes of change shape it and how multilingual speakers/ writers adapt to ongoing change. It draws results from synchronic ELF corpora. However, similarly to Collins’ (2009) and Leech’s (2013) studies of inner and outer circle Englishes, it will be assumed that the patterns of synchronic variability in the channels (spoken–written) and genres can be used as a surrogate for time, which enable drawing conclusions on how diachronic variability is adapted in ELF. Section 2 discusses the broader relevance and the theoretical background. Section 3 introduces corpus compilation process of two new multi-genre ELF corpora. Section 4 presents a case study of core and emergent modals in ELF, comparing the results with those reported in the dominant English varieties. The results tackle the question of what happens to ongoing changes in multilingual ELF settings, and they show that ELF speakers/writers accentuate ongoing change. 2

ELF and Language Change

The article does not ascertain the variety status of ELF. Instead, it observes gradual developments and frequency drifts, which have been shown to undergo substantial changes in highly standardized main varieties (Leech et al. 2009, 268–70). According to the principles of quantitative corpus linguistics operating on frequency variables (Grieve 2015), there is no reason to assume that such drifts would not take place in other forms of English, such as ELF. Frequency data from ELF have broader relevance in English linguistics. They not only contribute to the debate of the role of non-native Englishes but also add to knowledge of variability, contact and change. New types of data should provide insights on (a) unification, diversification and (dialect) levelling processes, which have been established in colonial settings (Hickey 2004), and (b) on whether ongoing change is accelerated or slowed down by bi-/multilingual speakers. The questions are intertwined. On the one hand, the traditional

60

Laitinen

assumption of norm-dependency stems from the research in (post)-colonial contexts. The spread of English is typically characterized by extraterritorial conservativism, in which contact means adult language learning that leads to simplification. Recently, Hundt (2009) has suggested a more complex typology of the outcomes in colonial settings, involving not only a true colonial lag, but also various scenarios, such as extraterritorial innovation, or truly divergent patterns. On the other hand, evidence from the earlier stages of the expansion of English suggests that the role of non-native speakers/writers is far from straightforward. As an illustration, studies on post-Norman conquest England point to acceleration of change as result of multilingual influence. Blake (1992, 10) suggests that “the attempt by French people to speak English and at a later stage bilingualism would inevitably promote changes” (cf. Cheshire et al. 2011 on developments in present-day urban multiethnolects). A real-time study, making use of multi-genre corpora, should shed light on the diffusion of change in multilingual settings. In its simplest form, this could be tested by investigating frequency variables, such as modal auxiliaries, and if ELF speakers/writers consistently select the outgoing recessive forms, it will indicate slowing down change, and vice versa. Various forces could play a role, the first of which is adult language acquisition, mentioned above. As opposed to this, an approach drawing from social network theory (Granovetter 1973; Milroy and Milroy 1985) predicts that weak and often insignificant interpersonal ties, which are common at times of mobility, promote diffusion of innovations. It is assumed that individuals who adopt English as an additional resource alongside their L1s and engage in ELF communication have more mobility and subsequently also looser ties on average than the rest of the population and therefore act as agents of change (Laitinen et al., in press). Thus, influence by multilingual speakers could accelerate diffusion of change (cf. the Civil War effect in a primarily monolingual setting in Raumolin-Brunberg 1998, 367–68). That is, ELF speakers/writers with their multilingual identities could be leaders of linguistic change in all contexts. 3

Methodology: A Corpus Compiler’s Perspective

This section overviews some of the existing non-native English corpora vis-àvis the methods used here. After establishing the need for new sources of evidence, it details the compilation process of two new multi-genre ELF corpora. It focuses on the sampling frame but also illustrates how the compilers deal with multilingual elements.

What Happens to Ongoing Change in Multilingual Settings ?

61

3.1 Observing Ongoing Change in Multi-genre Corpora The case study below tests the usability of the methods from short-term diachronic corpus linguistics for the study of ELF. As discussed in Leech et al. (2009, 24–31), this approach in the Standard English setting makes use of corpora with equidistant observation points and cover a comparable set of genres. The evidence has sometimes been drawn from the various Brown corpora that consist of 15 text types divided into four broad genres, i.e. academic, news, non-fiction and fiction. This division was not originally aimed for diachronic investigations but has been made use of successfully in studying recent changes in the inner core varieties. Some studies have made use of the various International Corpus of English (ICE) siblings (Collins 2009), and increasingly also the Corpus of Historical American English (COHA) and other mega-corpora (Leech 2013). As is well known, all these are multi-genre sources, which enable taking into account the discourse situation in real-time studies. As for the non-native use, the role of genre has played a less central role. For instance, the written learner corpus (ICLE) consists of one genre of argumentative and literature essays by advanced learners of English (Granger 2008). Since it is designed for interlanguage research, this one-genre approach has turned out to be sufficient (Callies 2015) but has obvious limitations in other approaches. The two spoken ELF corpora, i.e. VOICE (Vienna-Oxford International Corpus of English) and ELFA (English as a Lingua Franca in Academic Settings), contain material only from formal spoken interactions in institutional and professional contexts. VOICE for instance contains nonscripted face-to-face interactions in which English is used a “as a common means of communication among speakers from different first-language backgrounds” (Seidlhofer 2011, 23). The interactions represent a range of speech events, such as interviews, press conferences, service encounters and workshop discussions (Seidlhofer 2011, 23–24; for ELFA, see Mauranen 2012, 73–74). The only written ELF corpus (WrELFA by Anna Mauranen) consists of three text types in the academic genre, and offers a point of comparison for the spoken academic ELF data, but is not ideal for real-time comparisons that require access to a set of genres. Suffice it to say that since the existing non-native corpora have by and large been designed for synchronic studies, new multi-genre resources are needed. New corpora with a wide range of textual coverage on the written–spoken continuum should facilitate real-time studies of how grammatical structures undergoing changes are adapted. In general, evidence of textual variability enables determining direction of change in diachronic linguistics (Rissanen 2008).

62

Laitinen

Ensuring empirical validity requires new multi-genre corpora, and these new corpus sources should ideally complement the existing sources of ELF evidence. In addition, they should include a broad range of written genres other than the informatively-oriented academic texts in WrELFA. There is no need to reinvent the wheel in corpus design since the current native- and secondlanguage corpora offer a model that could be modified for non-native contexts. However, some modifications are needed because the arrival of electronicallymediated communication has led to the emergence of genres which were not even though of in the 1960s–80s when many current corpora were designed. Written ELF today appears not only in print but also in various online genres, such as personal and professional blogs, and micro-blog (Twitter) messages. One alternative would have been to repeat the ICE design, as has been done by Edwards (2014) in the Dutch context.1 A benefit would have been to ensure comparability with the evidence drawn from the other ICE corpora, but a more important criterion was to take into account the diversity of non-native English uses. The corpus model presented below starts with the assumption that the new multi-genre corpora are based on a bottom-up style corpus design that sets out by looking at what types of genre exist rather than repeating a sampling frame originally intended for post-colonial settings. 3.2 Towards New Multi-genre ELF Corpora To increase empirical validity, the author and his colleagues are currently compiling two multi-genre corpora of written non-native English texts (Laitinen and Levin 2016; Laitinen 2016). We focus on two Nordic countries, Sweden and Finland, where the role of English has undergone considerable changes in recent decades. Most importantly, the changes have been extensively documented in previous research (Taavitsainen and Pahta 2003; 2008 and Leppänen et al. 2011; Bolton and Meierkord 2013). The two countries are not undergoing language shift, but the sociolinguistic situation could be characterized as urban multilingualism, in which English is used as an additional resource alongside the main languages primarily, but not exclusively, by younger generations who live in urban areas and work in white- and pink-collar professions. Even though we focus on a restricted geographic setting, the corpus work builds on the idea of replicability. The collection criteria should be repeatable elsewhere, and the collection parameters, textual division, and informant selection are suitable for a range of countries in which the role of English is undergoing change. One long-term objective is that the corpora could be 1 The author collaborates with Edwards with the aim of testing what kinds of differences and similarities emerge in the quantitative and qualitative findings from two corpus design models.

What Happens to Ongoing Change in Multilingual Settings ?

63

repeated in 10–15-year intervals, thus giving additional diachronic depth to the study of non-native uses of English. Predicting and guessing what genres will live on is impossible, but if the corpus design is systematic, balanced, and representative at this stage, it increases the likelihood of replicability in the future. Our working titles for the two corpora are SWE-CE, the Corpus of English texts in Sweden, and FIN-CE, the Corpus of English texts in Finland. They are systematically-collected and large enough sources of baseline data to fulfill the requirement for empirical validity. Table 4.1 illustrates the key characteristics. We cover written texts, and the objective is that these corpora together with the already-existing spoken ELF corpora cover the continuum spoken and written situations. The collection targets English in use as a resource alongside writers’ L1s. It is conditioned by individual needs to communicate globally and locally using English. The great majority of the texts are such that English is not the target of learning, but a resource. The only exception is fiction, which we are currently (autumn 2016) collecting in collaboration with teachers organizing creative writing courses in a few universities in the two countries. The rationale of including this genre is that fan fiction is an important arena of non-native writing (Leppänen 2012), but unfortunately such texts do not fulfil our need to identify the authors, and therefore we collect material from educational settings. The corpora will be small and tidy (Hundt and Leech 2012), and the target is c. 1.5 million words per corpus. The materials, except the fiction texts, have to be available or accessible primarily through open sources. During the compilation, all the informants are identified, and the compilers have knowledge of the extent to which the materials have been subjected to normative language checking by professional editors/translators and native speakers. Preference is Table 4.1

Key characteristics in compilation

Characteristic

Description

Channel Variety Access and availability Identifiability Comparability

Primarily written texts English as an additional resource Texts primarily from publicly-available sources Single authored texts and authors identifiable Enables comparisons with the existing corpora used in studying the recent history of English Synchronic with most data from the 2010s onwards Multi-genre design that facilitates diatopic approaches

Time frame Genre composition

64

Laitinen

given to texts that are not edited, but it is assumed that the more informationally oriented a text is, the more likely it is to have undergone some degree of language checking and collaborative effort. Published materials edited by native speakers are excluded. The corpora take into account the diversity of genres, and our definition makes use of Biber’s (1988, 104–8) multidimensional analysis of textual variation, and we use dimension 1 as the basis for placing various texts to the matrix. This dimension, i.e. information density and exact content vs. interactional and generalized content, is used as a heuristic tool and is not yet empirically validated. Covering a wide array of genres enables real-time studies and makes it possible to compare the quantitative findings with the existing native English corpora from which much of the evidence of the recent history of English is drawn. Figure 4.1 sets the sampling frame side-by-side with the spoken VOICE to illustrate the genre coverage. The academic component consists of masterslevel theses written in disciplines other than language studies, and the target is c. 20% of the total number of tokens. The news component consists of news texts (reportage, editorials and reviews) from on-line news sources written in English by Finnish and Swedish journalists. We have confirmed with the publishers that the texts have not undergone normative intervention by native speakers, and we have spot-checked that the news stories are not international wire news by news agencies. The target size for the press component is c. 15% of the total. The blog components cover personal and professional blogs in which the writers blog about, for example, fashion, cooking and contemporary politics. The future objective is to divide the component so that searching for professional (c. 15%) and personal blogs (c. 15%) can be distinguished. The fiction (c. 15% target) and the Tweet (c. 15%) components currently only contain material from individuals in Sweden. 3.3 Handling Multilingual Elements in the Corpora The texts are written primarily in English, but they may contain occasional switches to other languages, as seen in (1) and (2). (1) Only a few formalities remain until the foundation “Bergmangårdarna på Fårö” (The Ingmar Bergman Estate on Fårø) can use the buildings (SWE-CE, news, 2010) (2) På Svenska I’m not entirely certain how I will shape my future in the digital world. This blog in English is one of the alternatives (SWE-CE, new media, 2007)

What Happens to Ongoing Change in Multilingual Settings ?

65

Spoken texts (e.g. voice)

Educational domain

Professional domain

Micro blogs (tweets)

Interactional focus

Leisure domain

Fiction

elf

Personal blogs Professional blogs Press Academic prose Figure 4.1 Covering the informational-interactional continuum of genres.

In such cases, the text level coding indicates foreign elements, På Svenska (i.e. ‘in Swedish’). In the case of blog texts, the great majority of the authors produce primarily monolingual texts. Occasionally, the blog writers have bilingual entries, meaning for instance that an entry is written both in Swedish and then repeated in English, as is the case in (3). In such cases, only the English version is recorded. (3) Åh, vilken mysig morgon. Ligger fortfarande i pyjamas och dricker kaffe direkt från nattduksbordet. Fönstret står öppet, vilket gör det ännu varmare och skönare under det fluffiga täcket. Nu börjar dock min mage kurrar efter en stor frukost, dessutom måste jag börja packa och bocka av ytterligare några grejer på att-göra-listan innan jag sätter mig på bussen hemåt till Blekinge. / Oh, what a cozy morning. I’m wearing pajamas and drinking coffee in bed. The window is open, which makes it even warmer and more comfortable in my fluffy blanket. But now my stomach growls after a big breakfast, and I also have to start packing before it’s time for me to take the bus back home to my dear family again. (SWE-CE, new media, 2014)

Informational focus

Written texts: swe/fin-ce

66

Laitinen

The micro-blog (Tweet) component which has been tested is extremely heterogeneous; the main language of the Tweets is English, and we have discarded the occasional messages in the national/domestic languages. The illustration in (4) shows a sample as it appears in the source. (4) I’m half Finnish so please tweet a little slower. Another languange lesson: “För att maximera din gastronomiska upplevelse” means “it tastes good”. Time for a family dinner so I’ll talk to you in a while. Watch this vine if you get borde while waiting. Another language lesson! Familjeläktare is Swedish for Batman. (SWE-CE, tweets, 2015) The case study makes use of 1,023,196 words of spoken English in the VOICE corpus and a pilot version of 1,052,386 words drawn from both SWE/FIN-CE. The results are divided into spoken and written modes, with the written divided into three genres: (a) 456,910 running words of academic prose; (b) 165,608 rw of news texts, and (c) 429,868 rw of professional and personal blog texts by native Finnish and Swedish speakers. 4

English as Part of Multilingual Repertoire: New Research Prospects

This study zooms in on what types of diversification take place in today’s multilingual environments. It tests what happens to ongoing change in multilingual settings, thus incorporating a broad perspective to multilingualism that takes into account multilingual competences and identities. It looks into broad quantitative patterns in core and emergent modals which have shown to be undergoing substantial recent changes (Leech 2003; Leech et al. 2009; Collins 2009; Leech 2013). They are related to the decline of core modal auxiliaries followed with a bare infinitive. According to Leech (2013, 100), the decline is more pronounced in AmE which is “roughly one generation in advance of the British decline”. His data show a consistent decline in every single modal whereas in BrE two items, can and could, increase slightly in frequency (Leech 2013, 100). In addition, emergent modal elements, viz. grammaticalized modal idioms, which are semantically related to core modals, have increased but not as fast as the decrease in the core modals. Leech (2013, 107) suggests a “modality deficit” in present-day English. These trends are also supported by longer diachrony and the results from COHA (Leech 2013), and the evidence from spoken language suggests that these developments are changes from below as

What Happens to Ongoing Change in Multilingual Settings ?

67

both the decline and the increase are steeper in spoken than in written texts (also Bowie, Wallis and Aarts 2013). Section 4.1 focuses on broad patters and genre differences in written and spoken ELF, and 4.2 on the frequency-based order of individual modal types in the data. The results, normalized per 1 million words, are set side-by-side with Leech et al.’s (2009, 283, 286) findings from standard edited written English in the FLOB (BrE) and Frown (AmE) corpora and with Leech’s (2013, 112) results from the spoken demographic subcorpus of the British National Corpus (BNC) and the Longman Corpus of Spoken American English (LCSAE). The statistics have been calculated using R. 4.1 Frequencies of Core and Emergent Modals Table 4.2 shows the normalized frequencies and the ratios of core and emergent modals, setting ELF side-by-side with BrE and AmE. In ELF, the core modal frequency is 30,247, and that of the emergent modals is 9,880, which results in the ratio (core/emergent) of 3.1. The ratios in the two native varieties, drawn from Leech (2013), are 2.5 in AmE and 3.6 in BrE. A contingency table test shows that the differences between non-native use and both of the native varieties are statistically highly significant (p