A Corpus-Driven Approach to Language Contact: Endangered Languages in a Comparative Perspective 9781614516576, 9781614517610

This book proposes a corpus-driven approach to language contact based on the study of endangered languages. Drawing on v

275 103 9MB

English Pages 260 Year 2016

Table of contents :
Preface
Table of contents
List of figures
List of tables
Abbreviations
1 Introduction
1.1 Contact linguistics
1.2 Language contact in endangered languages
1.3 Corpus-driven analysis of language contact
1.4 Overview of this book
2 Data collection and annotation
2.1 Data collection
2.2 The sample
2.3 Transcription and annotation
2.4 Corpus size
2.5 Corpus accessibility
3 Overall composition of a multilingual corpus
3.1 Background
3.1.1 Corpora with 0?5% contact words
3.1.1.1 The Ixcatec-Spanish corpora
3.1.1.2 The Balkan Slavic-Greek corpora
3.1.1.3 The Colloquial Upper Sorbian- and the Burgenland Croatian-German corpora
3.1.2 Corpora with 20?35% contact words
3.1.2.1 The Thrace Romani-Turkish-Greek and the Finnish Romani-Finnish corpora
3.1.2.2 The Molise Slavic-Italian corpora
3.2 Discussion
4 Borrowing or codeswitching?
4.1 Background
4.2 Degree of composition and flagging
4.2.1 The Balkan Slavic Nashta-Greek corpus
4.2.2 The Ixcatec-Spanish corpus
4.2.3 The Thrace Romani-Turkish-Greek corpus
4.2.4 The Finnish Romani-Finnish corpus
4.3 Word classes
4.3.1 The Ixcatec-Spanish corpus
4.3.2 The Balkan Slavic Nashta-Greek corpus
4.3.3 The Romani corpora
4.4 Lexical semantic fields
4.4.1 The Ixcatec-Spanish corpora
4.4.2 The Balkan Slavic-Greek corpora
4.4.3 The Thrace Romani-Turkish-Greek corpus
4.5 Regularity
4.5.1 The Ixcatec-Spanish corpora
4.5.2 The Balkan Slavic-Greek corpus
4.5.3 The Thrace Romani-Turkish-Greek corpus
4.5.4 The Finnish Romani-Finnish corpus
4.6 Discussion
5 Integration strategies
5.1 Background
5.2 Phonetics and phonology
5.3 Noun integration
5.3.1 The Ixcatec-Spanish corpus
5.3.2 The Romani-Turkish-Greek corpus
5.4 Verb integration
5.4.1 Light verb strategy
5.4.2 Indirect insertion
5.4.3 Paradigm transfer
5.5 Discussion
6 Inter-speaker variation
6.1 Background
6.2 Inter-speaker variation for contact words
6.2.1 The Ixcatec-Spanish corpus
6.2.2 The Balkan Slavic Nashta-Greek corpus
6.2.3 The Thrace Romani-Turkish-Greek corpus
6.2.4 The Finnish Romani-Finnish corpus
6.3 Inter-speaker variation for borrowing and codeswitching
6.3.1 The Slavic corpora
6.3.2 The Finnish Romani-Finnish corpus
6.4 Inter-speaker variation for borrowed nouns and verbs
6.4.1 The Ixcatec-Spanish corpus
6.4.2 The Slavic corpora
6.4.3 The Romani corpora
6.5 Discussion
7 Pattern replication
7.1 Background
7.2 The Balkan Slavic Nashta-Greek corpus
7.2.1 TMA markers
7.2.2 Phonetics
7.2.3 Articles
7.3 The Ixcatec-Spanish corpus
7.3.1 Articles
7.3.2 Clause-linking
7.3.3 Frames of reference
7.3.4 Word order in verbal clauses
7.4 The Thrace Romani-Turkish-Greek corpus
7.4.1 Prosody in wh- and polar questions
7.4.2 Articles
7.4.3 Verb morphology
7.4.4 Word order in noun phrases
7.5 Discussion
8 Information structure
8.1 Background
8.2 The Ixcatec-Spanish corpus
8.2.1 Prosody
8.2.2 Word order
8.2.3 Morphology
8.3 The Thrace Romani-Turkish-Greek corpus
8.3.1 Prosody
8.3.2 Word order
8.3.3 Morphology
8.4 Discussion
9 Contact settings
9.1 Background
9.2 The Balkan Slavic-Greek communities
9.2.1 Hrisa
9.2.2 Liti
9.3 The Ixcatec-Spanish community
9.4 The Thrace Romani-Turkish-Greek community
9.5 Discussion
9.5.1 An active bilingual community
9.5.2 Prescriptive attitudes and institutional support
9.5.3 Past contact settings
10 Concluding remarks
10.1 A scale of language mixing
10.2 Extra layers for a refined scale of language mixing
10.3 Types of contact phenomena and types of social settings
10.4 For a corpus-driven approach to language contact
References
Index of authors
Index of subjects and languages

Recommend Papers

A Corpus-Driven Approach to Language Contact: Endangered Languages in a Comparative Perspective 9781614516576, 9781614517610

This book proposes a corpus-driven approach to language contact based on the study of endangered languages. Drawing on v

164 63 9MB Read more

Language Change and Language Structure: Older Germanic Languages in a Comparative Perspective 9783110886573, 9783110135381

179 24 53MB Read more

Heritage Languages : A Language Contact Approach [1 ed.] 9789027261762, 9789027204714

Heritage languages, such as the Turkish varieties spoken in Berlin or the Spanish used in Los Angeles, are non-dominant

125 27 1MB Read more

A Historical Approach to Casuistry: Norms and Exceptions in a Comparative Perspective 9781350006751, 9781350006782, 9781350006775

Casuistry, the practice of resolving moral problems by applying a logical framework, has had a much larger historical pr

187 113 7MB Read more

A Historical Approach to Casuistry: Norms and Exceptions in a Comparative Perspective

Casuistry, the practice of resolving moral problems by applying a logical framework, has had a much larger historical pr

448 30 9MB Read more

Language Contact : A Multidimensional Perspective [1 ed.] 9781443867429, 9781443844017

Since the inception of modern contact linguistics through the works of Weinreich (1953) and Haugen (1953), numerous inve

136 47 2MB Read more

OKAY across Languages: Toward a comparative approach to its use in talk-in-interaction 2020048726, 2020048727, 9789027208156, 9789027260284

278 0 128MB Read more

Language Contact in the Arctic: Northern Pidgins and Contact Languages 9783110813302, 9783110143355

179 52 11MB Read more

Typological and social constraints on language contact. Amerindian languages in contact with Spanish [1-2] 9789078328629

Summary The present study deals with linguistic borrowing in Latin America from the perspective of typology and sociolin

219 133 4MB Read more

Number in the World's Languages: A Comparative Handbook 9783110622713, 9783110560695

The strong development in research on grammatical number in recent years has created a need for a unified perspective. T

168 64 22MB Read more

A Corpus-Driven Approach to Language Contact: Endangered Languages in a Comparative Perspective
9781614516576, 9781614517610

Author / Uploaded
Evangelia Adamou

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Evangelia Adamou A Corpus-Driven Approach to Language Contact

Language Contact and Bilingualism

Editor Yaron Matras

Volume 12

Evangelia Adamou

A Corpus-Driven Approach to Language Contact Endangered Languages in a Comparative Perspective

ISBN 978-1-61451-761-0 e-ISBN (PDF) 978-1-61451-657-6 e-ISBN (EPUB) 978-1-5015-0065-7 ISSN 2190-698X Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliograﬁe; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. 6 2016 Walter de Gruyter Inc., Boston/Berlin Cover image: Anette Linnea Rasmus/Fotolia Typesetting: RoyalStandard, Hong Kong Printing and binding: CPI books GmbH, Leck ♾ Printed on acid-free paper Printed in Germany www.degruyter.com

Preface The present book oﬀers an analysis of ﬁrst-hand data from three unrelated languages: Balkan Slavic, Thrace Romani, and Ixcatec. At the basis of this work lies the collaboration with all of the speakers of these languages who generously shared their stories and accepted that their conversations be recorded. Many thanks to the Ixcatec speakers Cipriano Ramirez Guzmán, Ruﬁna Robles, Juliana Salazar Bautista, and Pedro Salazar Gutierrez; and to Santa María Ixcatlán’s General Assembly for granting us permission to work on Ixcatec. Many thanks to all the Romani and Slavic speakers from Greece who have chosen to remain anonymous due to the complexity of the political context in the country and a special thank-you to Sabiha Suleiman and to Chrysoula Adamou for their precious assistance during ﬁeldwork in Drosero and Liti respectively. The collection and analysis of these data were carried out as part of my research activities at the French National Centre for Scientiﬁc Research (CNRS). My research has also beneﬁtted greatly from several externally-funded research programmes. The Balkan Slavic corpus of Nashta was created within a FrenchGerman research programme which I jointly led with Walter Breu, Electronic database of endangered Slavic varieties in non-Slavic speaking European countries (2010‒2012), with funding from the French National Research Agency and the Deutsche Forschungsgemeinschaft (ANR-09-FASHS -025 and DFG BR 1228-4-1). Research on Thrace Romani received support from the programme Towards a multi-level, typological and computer-assisted analysis of contact-induced language change (2010‒2014), funded by the French National Research Agency (ANR-09-JCJC -0121-01, P.I. Isabelle Léglise). Research on Ixcatec was conducted within the Ixcatec documentation programme (2010‒2013), funded by the Endangered Languages Documentation Programmes of the Hans Rausing Foundation (MDP 0214, P.I. Denis Costaouec). The analysis of the data was then continued as part of the programme Designing spoken corpora for cross–linguistic research (2013–2016), funded by the French National Research Agency (ANR-12-BSH 2-0011, P.I. Amina Mettouchi). I also wish to acknowledge support from the programme Investments for the Future funded by the French National Research Agency (ANR-10-LABX-0083). Some of the studies reported on here have been conducted in collaboration with other scholars and have resulted in joint conference papers and publications which are cited throughout the book. Speciﬁcally, the comparison of the Balkan Slavic data with other Slavic minority languages was possible through collaboration with Walter Breu, Georges Drettas, and Lenka Scholze. The comparison of the Thrace Romani data with the Finnish Romani data is part of collaborative research with Kimmo Granqvist. Research on Romani phonetics and prosody is part of a collaboration with Amalia Arvaniti. Also, part of the

vi

Preface

Ixcatec data which was taken into consideration in this book was kindly shared with me by Denis Costaouec. Over the years, I beneﬁted from discussions with Walter Breu, Claudine Chamoreau, Denis Costaouec, Zygmunt Frajzyngier, Victor Friedman, Kimmo Granqvist, Isabelle Léglise, Yaron Matras, Felicity Meakins, Amina Mettouchi, Bettina Migge, Carol Myers-Scotton, Eva Schultze-Berndt, Stavros Skopeteas, Lameen Souag, and Anton Tenser. Special thanks are due to Maïa Ponsonnet, Claudia Wegener, and Stergios Chatzikyriakidis for their careful reading of some of the chapters of this book and to Margaret Dunham for editing my English. For technical support I would like to thank Mourad Aouini, Christian Chanard, Séverine Guillaume, and Pascal Vaillant. For the statistical analyses thanks are due to Rachel Chen and François Sermier and for the maps to Jérôme Picard. Many thanks to Elif Diviçioglu for advice on the analysis of the Turkish data, to Olivier Le Guen for insights on the analysis of the Ixcatec gesture and frames of reference, to Martine Toda and Yordanka Kozareva for their assistance with the Balkan Slavic corpus, to Claire Wolfarth for help with the Ixcatec corpus, and to Frida Cruz for assisting me with the non-verbal tasks in Ixcatlán. I am particularly grateful to the editor of this series, Yaron Matras, for his precious advice and encouragement during the editing process. Last, I wish to dedicate this book to my daughter Niki, who not only accompanied me during my ﬁeldwork trips over the last ten years, but also actively participated in community life and assisted me with my research when possible. The reasons why she does not wish to study linguistics are of course not related in any way to these trips!

Table of contents Preface v List of ﬁgures List of tables Abbreviations

xi xvii xix

1 1.1 1.2 1.3 1.4

1 Introduction 1 Contact linguistics Language contact in endangered languages Corpus-driven analysis of language contact 6 Overview of this book

2 2.1 2.2 2.3 2.4 2.5

Data collection and annotation 9 Data collection 11 The sample Transcription and annotation 14 Corpus size 16 Corpus accessibility

3 5

9

12

21 3 Overall composition of a multilingual corpus 21 3.1 Background 22 3.1.1 Corpora with 0‒5% contact words 23 3.1.1.1 The Ixcatec-Spanish corpora 26 3.1.1.2 The Balkan Slavic-Greek corpora 3.1.1.3 The Colloquial Upper Sorbian- and the Burgenland Croatian-German 29 corpora 31 3.1.2 Corpora with 20‒35% contact words 3.1.2.1 The Thrace Romani-Turkish-Greek and the Finnish Romani-Finnish 32 corpora 36 3.1.2.2 The Molise Slavic-Italian corpora 38 3.2 Discussion 4 4.1 4.2 4.2.1 4.2.2 4.2.3 4.2.4

41 Borrowing or codeswitching? 41 Background 43 Degree of composition and ﬂagging 44 The Balkan Slavic Nashta-Greek corpus 46 The Ixcatec-Spanish corpus 47 The Thrace Romani-Turkish-Greek corpus 49 The Finnish Romani-Finnish corpus

viii 4.3 4.3.1 4.3.2 4.3.3 4.4 4.4.1 4.4.2 4.4.3 4.5 4.5.1 4.5.2 4.5.3 4.5.4 4.6

Table of contents

Word classes 50 52 The Ixcatec-Spanish corpus 53 The Balkan Slavic Nashta-Greek corpus 53 The Romani corpora 56 Lexical semantic ﬁelds 57 The Ixcatec-Spanish corpora 60 The Balkan Slavic-Greek corpora 62 The Thrace Romani-Turkish-Greek corpus 66 Regularity 68 The Ixcatec-Spanish corpora 68 The Balkan Slavic-Greek corpus 71 The Thrace Romani-Turkish-Greek corpus 77 The Finnish Romani-Finnish corpus 78 Discussion

5 5.1 5.2 5.3 5.3.1 5.3.2 5.4 5.4.1 5.4.2 5.4.3 5.5

81 Integration strategies 81 Background 81 Phonetics and phonology 84 Noun integration 84 The Ixcatec-Spanish corpus The Romani-Turkish-Greek corpus 89 Verb integration 90 Light verb strategy 90 Indirect insertion 91 Paradigm transfer 96 Discussion

6 6.1 6.2 6.2.1 6.2.2 6.2.3 6.2.4 6.3 6.3.1 6.3.2 6.4 6.4.1 6.4.2

97 Inter-speaker variation 97 Background 97 Inter-speaker variation for contact words 97 The Ixcatec-Spanish corpus 98 The Balkan Slavic Nashta-Greek corpus 99 The Thrace Romani-Turkish-Greek corpus 103 The Finnish Romani-Finnish corpus Inter-speaker variation for borrowing and codeswitching 103 The Slavic corpora 113 The Finnish Romani-Finnish corpus Inter-speaker variation for borrowed nouns and verbs 115 The Ixcatec-Spanish corpus 118 The Slavic corpora

85

103

115

Table of contents

6.4.3 6.5

The Romani corpora 130 Discussion

128

7 7.1 7.2 7.2.1 7.2.2 7.2.3 7.3 7.3.1 7.3.2 7.3.3 7.3.4 7.4 7.4.1 7.4.2 7.4.3 7.4.4 7.5

132 Pattern replication 132 Background 134 The Balkan Slavic Nashta-Greek corpus 135 TMA markers 138 Phonetics 141 Articles 142 The Ixcatec-Spanish corpus 142 Articles 144 Clause-linking 145 Frames of reference 151 Word order in verbal clauses 153 The Thrace Romani-Turkish-Greek corpus 153 Prosody in wh- and polar questions 157 Articles 159 Verb morphology 161 Word order in noun phrases 161 Discussion

8 8.1 8.2 8.2.1 8.2.2 8.2.3 8.3 8.3.1 8.3.2 8.3.3 8.4

163 Information structure 163 Background 164 The Ixcatec-Spanish corpus 164 Prosody 167 Word order 170 Morphology The Thrace Romani-Turkish-Greek corpus 172 Prosody 175 Word order 179 Morphology 181 Discussion

9 9.1 9.2 9.2.1 9.2.2 9.3 9.4 9.5

184 Contact settings 184 Background 184 The Balkan Slavic-Greek communities 188 Hrisa 191 Liti 197 The Ixcatec-Spanish community The Thrace Romani-Turkish-Greek community 206 Discussion

172

200

ix

x 9.5.1 9.5.2 9.5.3 10 10.1 10.2 10.3 10.4

Table of contents

An active bilingual community 207 Prescriptive attitudes and institutional support 208 Past contact settings

208

211 Concluding remarks 212 A scale of language mixing 215 Extra layers for a reﬁned scale of language mixing Types of contact phenomena and types of social settings 222 For a corpus-driven approach to language contact

223 References 236 Index of authors Index of subjects and languages

239

219

List of ﬁgures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15

Figure 16 Figure 17 Figure 18 Figure 19 Figure 20

Figure 21

Screen capture of a search on a ﬁle produced with Jaxe for Thrace 13 Romani 14 Screen capture of an Elan-CorpA ﬁle for Ixcatec 18 The Slavic minority languages of the EuroSlav corpora Screen capture of a ﬁle produced with ITE for Balkan Slavic 19 Nashta 24 Interactions in the two Ixcatec-Spanish corpora Two Ixcatec-Spanish corpora (8,807 words in total): Distribution of 25 word-tokens with respect to language 27 Interactions in the Balkan Slavic Nashta corpus Two Balkan Slavic-Greek corpora (9,235 words in total): Distribution 28 of word-tokens with respect to language Interactions in the Colloquial Upper Sorbian-German 29 corpus 30 Interactions in the Burgenland Croatian corpus Two Slavic-German corpora (8,012 words in total): Distribution of 31 word-tokens with respect to language 33 Interactions in the Thrace Romani-Turkish-Greek corpus The Thrace Romani-Turkish-Greek corpus (5,816 words in total): 34 Distribution of word-tokens per language 35 Interactions in the Finnish Romani-Finnish corpus The Finnish Romani-Finnish corpus (13,031 words in total): Distribution of word-tokens per language (adapted from Adamou and 36 Granqvist 2014) 37 Interactions in the Molise Slavic-Italian corpus The Molise Slavic-Italian corpus (17,279 words in total): Distribution 38 of word-tokens per language Distribution of word-tokens with respect to language for seven 39 corpora The Thrace Romani-Turkish-Greek corpus: Length of Turkish and 47 Greek word-tokens The Finnish Romani-Finnish corpus: Distribution of word-tokens per language in the Finnish-dominant and Romani-dominant clauses 49 (adapted from Adamou and Granqvist 2014) The Finnish Romani-Finnish corpus: Length of Finnish word-tokens in Romani-dominant clauses and of Romani tokens in Finnishdominant clauses (adapted from Adamou and Granqvist 50 2014)

xii Figure 22 Figure 23 Figure 24 Figure 25

Figure 26 Figure 27 Figure 28 Figure 29 Figure 30 Figure 31 Figure 32 Figure 33

Figure 34 Figure 35 Figure 36 Figure 37

Figure 38

Figure 39

List of ﬁgures

The Ixcatec-Spanish contemporary corpus: Distribution of nouns 53 and verbs per language The Balkan Slavic Nashta-Greek corpus: Distribution of nouns per 54 language Thrace Romani-Turkish-Greek corpus: Distribution of tokens per 55 language and word class Finnish Romani-Finnish corpus: Distribution of tokens per language and word class (adapted from Adamou and Granqvist 55 2014) Token frequency and diﬀusion of borrowed nouns across speakers 67 for the Thrace Romani, Ixcatec, and Balkan Slavic corpus The Ixcatec corpora: Frequency of Spanish nouns occurring more 69 than twice or in both corpora The Ixcatec corpus: Frequency of Spanish verbs in the corpus of the 70 1950s The Thrace Romani corpus: Frequency of Turkish nouns 72 (part 1) The Thrace Romani corpus: Frequency of Turkish nouns 73 (part 2) 75 The Thrace Romani corpus: Frequency of Turkish verbs The Thrace Romani-Turkish-Greek corpus: origin of words preceding 94 Turkish verbs (adapted from Adamou and Granqvist 2014) The Finnish Romani-Finnish corpus: origin of words preceding and following Finnish verbs (adapted from Adamou and Granqvist 95 2014) The Ixcatec-Spanish contemporary corpus: interspeaker variation of 98 current-contact language word-tokens The Balkan Slavic Nashta-Greek corpus: Inter-speaker variation of 99 current-contact language word-tokens 100 Map of the area of Thrace, Greece The Thrace Romani-Turkish-Greek corpus: Distribution of word-tokens per language for 10 speakers, with respect to 100 location The Thrace Romani-Turkish-Greek corpus: Distribution of word-tokens per language for 10 speakers, with respect to 101 language shift The Thrace Romani-Turkish-Greek corpus: Distribution of word-tokens per language for 10 speakers, with respect to 102 age

List of ﬁgures

Figure 40

Figure 41

Figure 42 Figure 43 Figure 44 Figure 45 Figure 46 Figure 47 Figure 48 Figure 49 Figure 50 Figure 51 Figure 52

Figure 53

Figure 54 Figure 55

Figure 56

xiii

The Thrace Romani-Turkish-Greek corpus: Distribution of word-tokens per language for 10 speakers, with respect to family 102 and peers The Finnish Romani-Finnish corpus: Distribution of language mixing among speakers (adapted from Adamou and Granqvist 103 2014) The Slavic corpora: Rates of borrowings with respect to language 104 (Adamou et al. 2015) The Slavic corpora: Rates of borrowings with respect to location 105 (Adamou et al. in press 2016) The Balkan Slavic-Greek corpus: Inter-speaker variation for 106 borrowings The Burgenland Croatian-German corpus: Inter-speaker variation 107 for borrowings The Colloquial Upper Sorbian-German corpus: Inter-speaker 108 variation for borrowings The Molise Slavic-Italian corpus: Inter-speaker variation for 109 borrowings The Slavic corpora: Conditional Inference Recursive Partitioning 110 Tree for borrowings The Slavic corpora: Rates of borrowings with respect to language 111 and sex The Slavic corpora: Rates of borrowings with respect to language 112 and age group The Slavic corpora: Correlation between borrowing and 113 codeswitching The Finnish Romani-Finnish corpus: Distribution of Finnish borrowings among speakers (based on unpublished data from Adamou 114 and Granqvist 2014) The Finnish Romani-Finnish corpus: Distribution of Finnish codeswitching insertions among speakers (based on unpublished data 114 from Adamou and Granqvist 2014) The Ixcatec-Spanish contemporary corpus: Distribution of borrowed 115 nouns per language for four speakers The Ixcatec-Spanish corpora: Distribution of borrowed nouns for the contemporary Ixcatec-Spanish corpus and the Ixcatec-Spanish 116 corpus of the 1950s The Ixcatec-Spanish corpus of the 1950s: Distribution of nouns with 117 respect to language

xiv Figure 57 Figure 58 Figure 59 Figure 60 Figure 61 Figure 62 Figure 63 Figure 64 Figure 65 Figure 66 Figure 67

Figure 68

Figure 69

Figure 70

Figure 71

Figure 72 Figure 73 Figure 74

List of ﬁgures

The Ixcatec-Spanish corpus of the 1950s: Distribution of verbs with 117 respect to language The Slavic corpora: Rates of borrowed nouns with respect to 118 language (Adamou et al. 2015) The Slavic corpora: Rates of borrowed nouns with respect to 119 location The Balkan Slavic-Greek corpus: Inter-speaker variation for 120 borrowed nouns The Colloquial Upper Sorbian-German corpus: Inter-speaker 121 variation for borrowed nouns The Burgenland Croatian-German corpus: Inter-speaker variation 122 for borrowed nouns The Molise Slavic-Italian corpus: Inter-speaker variation for 123 borrowed nouns The Slavic corpora: Conditional Inference Recursive Partitioning 124 Tree for borrowed nouns The Slavic corpora: Rates of borrowed nouns with respect to 125 language and sex The Slavic corpora: Rates of borrowed nouns with respect to 126 language and age group The Thrace Romani-Turkish-Greek corpus: Distribution of nouns per language in Thrace Romani for four speakers (adapted from 128 Adamou and Granqvist 2014) The Thrace Romani-Turkish-Greek corpus: Distribution of verbs per language in Thrace Romani for four speakers (adapted from 129 Adamou and Granqvist 2014) The Finnish Romani-Finnish corpus: Distribution of nouns per language for three speakers (adapted from Adamou and Granqvist 130 2014) The Finnish Romani-Finnish corpus: Distribution of verbs per language for three speakers (adapted Adamou and Granqvist 130 2014) Mean F1 and F2 for the Balkan Slavic Nashta stressed (in grey) and unstressed (in black) vowels based on 455 tokens produced in the 139 spontaneous speech of a female speaker Variability of F1 and F2 values of stressed vowels in Nashta 140 produced in the spontaneous speech of a female speaker Variability of F1 and F2 values of unstressed vowels in Nashta 140 produced in the spontaneous speech of a female speaker 148 Co-speech gesture in Ixcatec for ‘left’

List of ﬁgures

Figure 75 Figure 76 Figure 77 Figure 78 Figure 79 Figure 80 Figure 81 Figure 82 Figure 83 Figure 84 Figure 85 Figure 86 Figure 87 Figure 88 Figure 89 Figure 90 Figure 91 Figure 92 Figure 93 Figure 94 Figure 95 Figure 96

xv

Co-speech gesture in Ixcatec for the left-right axis 199 149 Pitch track of a wh-question in Standard Modern Greek (Tsiplakou 154 et al. 2011) 155 Pitch track of a wh-question in Thrace Romani Pitch track of the polar question ‘Do you need help?’ in Standard 156 Modern Greek (Tsiplakou et al. 2011) 156 Pitch track of a polar question in Thrace Romani Pitch track of a Greek polar question enunciated by a Thrace 157 Romani speaker 165 Ixcatec: Pitch track of the word ‘three’ in isolation Ixcatec: Pitch track of the word ‘three’ under corrective 166 focus Ixcatec: Pitch track of the word ‘three’ combined with the focus 171 particle Thrace Romani: Pitch track of a spontaneous example with focus on 173 naj ‘is not’ (adapted from Arvaniti and Adamou 2011: 243) Pitch track of the word erzaˈnava ‘pharmacy’ in Thrace 174 Romani Pitch track of the word erˈzanava ‘pharmacy’ under focus with 175 stress-shift in Thrace Romani Focus in Thrace Romani: SV order and prosodic marking (adapted 177 from Arvaniti and Adamou 2011: 244) Focus in Thrace Romani: PV order and prosodic marking (adapted 178 from Arvaniti and Adamou 2011: 243) Pitch track illustrating the Turkish focus-sensitive particle da and 180 the Turkish numeral classiﬁer tane in Thrace Romani Hrisa, Greece: Language domains in 1976 (based on Drettas 189 1981) Hrisa, Greece: Family of a Slavic speaking household (adapted from 190 Drettas 1981: 153) Hrisa, Greece: Families of two Slavic speaking households (adapted 191 from Drettas 1981: 153) Map of the area of Thessaloniki indicating the village of Liti (Ajvati), 192 1903 A female Nashta speaker’s social network during the late Ottoman 194 period and early times of the Greek state A female Greek-Nashta speaker’s social network in the second half 195 of the twentieth century Balkan Slavic Nashta: Language domains in the end of the Ottoman 196 era

xvi Figure 97 Figure 98 Figure 99

List of ﬁgures

Map of Santa María Ixcatlán, Mexico 198 A Romani speaker’s interactions in various language 203 domains 204 A Romani female speaker’s social network

List of tables Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12 Table 13 Table 14 Table 15 Table 16 Table 17 Table 18 Table 19 Table 20 Table 21 Table 22 Table 23

The Ixcatec-Spanish corpora: Lexical semantic ﬁelds of Spanish 57 nouns and borrowed score in WOLD The Balkan Slavic Nashta corpus: Lexical semantic ﬁelds of Greek 61 nouns and borrowed score in WOLD The Thrace Romani corpus: Lexical semantic ﬁelds of Turkish nouns 63 and borrowed score in WOLD The Thrace Romani corpus: Lexical semantic ﬁelds of Turkish nouns 65 and borrowed score in WOLD 78 Finnish and Romani conjunctions (Granqvist 2000) 86 Case assignment of Greek words in Romani 88 Gender assignment in Thrace Romani Thrace Romani: distribution of NPs with a non-borrowed 88 noun Thrace Romani: distribution of NPs with a Romani 88 determiner 89 Romani and Greek articles Nashta. Perfective and imperfective past of loan verbs from Greek 91 in second plural form 93 Romani and Turkish verbs in Thrace Romani The Thrace Romani corpus: Frequency of Turkish and Romani verbs 93 with respect to person The Thrace Romani corpus: Frequency of Turkish TMA markers with 94 Turkish verbs The Thrace Romani corpus: Distribution of word-tokens per speaker 99 and language 104 The sample of the EuroSlav corpus 127 The Slavic data Verb morphology in Balkan Slavic Nashta and Greek (Adamou 135 2012a: 155) Distribution of completive clauses with la in Ixcatec (Adamou and 145 Costaouec 2013) 147 The relational terms in Ixcatec 161 Word order in Romani, Turkish, and Greek noun phrases Absolute number of occurrences of S with respect to V (realized 168 within the same prosodic unit) Absolute number of occurrences of P with respect to V (realized 169 within the same prosodic unit)

xviii Table 24 Table 25 Table 26 Table 27 Table 28 Table 29

List of tables

Focus marking strategies between Thrace Romani and the contact 182 languages, Turkish and Greek Focus marking strategies between Ixcatec and Mexican 182 Spanish Type of bilingual speech with respect to the overall rates of tokens 214 from the languages in contact Type of bilingual speech for corpora with 20‒35% contact 218 words Type of bilingual speech for corpora with 0‒5% contact 219 words Extralinguistic factors with respect to rates of borrowings in the 220 corpora

Abbreviations Abbreviations follow the updated version of the Leipzig glossing rules elaborated within the French National Agency for Research (ANR) programme Designing Spoken Corpora for cross-linguistic research in collaboration with the Max Planck Institute at Leipzig.1 The abbreviations found in the examples of other authors were kept unchanged. ABL ACC ADJ ANT ANTIP AOR APPL ART AUX CAUS CLS CLF / CL CO COMP DAT DEF DEM DIR DIST EVD EXCL EXCM EXS F FOC FUT GEN IMP IMPF

Ablative case Accusative case Adjective Anterior, past before past Antipassive voice Aorist Applicative Article Auxiliary Causative voice Class Classiﬁer Cross-reference morpheme Complementizer Dative case Deﬁnite Demonstrative Directional Distal Evidential Exclusive Exclamation Existential Feminine gender Focus Future Genitive Imperative Imperfect

1 http://cortypo.huma-num.fr/resources.html

xx

INCL INDF INS INTJ IPFV IPRF IRR ITER LOC M MID N NEG NOM NP NUM OBL PART PFV PL POSS PRET PRF PROG PRS PST PTCP PTL Q REFL REL SG SUB TMA V

Abbreviations

Inclusive Indeﬁnite Instrumental Interjection Imperfective Imperfect Irrealis Iterative Locative case Masculine gender Middle Neuter gender Negative Nominative Proper Noun (or Noun phrase in the text) Numeral Oblique case Partitive case Perfective Plural Possessive Preterite Perfect Progressive Present Past Participle Particle General tag for interrogative marker Reﬂexive Relative Singular Subordinator Tense, mood, aspect Verb

Chapter 1

Introduction 1.1 Contact linguistics Beginning in the second half of the twentieth century, language contact phenomena have increasingly attracted the attention of linguists. Enormous progress has been made in our understanding of the ways in which speakers may use different languages, either by reserving them for speciﬁc settings or interlocutors, or by alternating between them in one setting and with one interlocutor. Research in contact linguistics has also shed light on the eﬀects that the combination of two or more languages may have on the linguistic structures of each language. Under some circumstances, language contact may completely modify the original typology of the contact languages or even lead to the genesis of a new, mixed language. In the tradition of Weinreich (1953) and Haugen (1953), the multilingual speaker is regarded as the locus of language contact. In this perspective, scholars suggested several types of relations between language contact outcomes and the type of society or social activities they are produced in. For example, an attempt to correlate the intensity of contact and the language contact outcomes led to the proposal of a borrowing scale ranging from “casual contact” and lexical borrowing, to “very strong cultural pressure” and heavy structural borrowing (Thomason and Kaufman 1988: 74‒75). For the authors: It is the sociolinguistic history of the speakers [. . .] that is the primary determinant of the linguistic outcome of language contact. Purely linguistic considerations are relevant but strictly secondary overall (Thomason and Kaufman 1988: 35).

Similarly, Loveday (1996), Aikhenvald and Dixon (2007), and Trudgill (2008), take into consideration the external factors which facilitate the diﬀusion of linguistic features in various language contact settings. Recently, Stell and Yakpo (2015) suggest that a systematic look for language-external regularities is the best way to proceed in the study of language contact and more speciﬁcally codeswitching. While not denying the importance of external factors, several authors have also stressed the central role of cognitive parameters with respect to the results of language contact. In this perspective, Myers-Scotton (1993a) has argued that the similarities in contact phenomena are due to the similar underlying cognitive processes which operate in the heads of the multilingual speakers. According to

2

Introduction

Myers-Scotton (1993a), speakers collaborate to form a matrix language system in performance, which, in turn, determines the nature of the contact phenomena. Matras (2009) also re-centres language contact in speech production, both in language processing and communication goals. In this approach, the multilingual speaker is the locus of repertoires: entire repertoires or speciﬁc parts of them are associated with particular social activities and are regulated by the prescriptive attitudes of the speech community. According to the author, two main mechanisms are at play in language contact. The ﬁrst is linked to the borrowability of several items which are more adapted to a given social activity, or occur in intense communicative negotiation and thus block the speaker’s repertoire selection mechanism, i.e., discourse operators. The second mechanism concerns linguistic structures which show a systematic tendency towards convergence. For Matras (2007) sociolinguistic factors are relevant, yet, secondary; they do not trigger but rather license the speakers to “dismantle the mental demarcation boundaries that separate their individual languages” (Matras 2007: 68). In other words, both the loyalty to the repertoires and the desire to exploit the wealth of the repertoires are determined by the reduction of the hurdles in the way of the most eﬃcient communication (Matras 2009: 4‒5). More speciﬁcally, the sources of borrowing are to be found in “the need to reduce the cognitive load” (Matras 2007: 67; following Matras 1998). In this perspective, all languages (or repertoires) are activated for a multilingual speaker, as evidenced by the numerous speech-errors encountered in the repertoire selection even among the most skilled multilingual speakers and despite them being contrary to their local communicative goals. Myers-Scotton and Matras stress the relevance of linguistic and psycholinguistic factors based on natural, ecologically-valid data. In contrast, psycholinguists attempt to demonstrate the nature of cognitive control in bilingual speech through experimental data (see Ingvalson, Ettlinger, and Wong 2014 for a review of the recent literature on this topic). Studies in contact linguistics have also focused on typological factors which may shape the language contact outcomes. Poplack and D. Sankoﬀ (1988) suggest that languages with similar typological features will more likely produce alternational codeswitching, while typologically distinct languages will produce insertional codeswitching. Muysken (2000) formulates the hypothesis that if the matrix language is agglutinating, then codeswitching will be of the insertional type; if both languages are ﬂectional, then codeswitching will be alternational. Also, Field (2002) expresses the idea that an isolating language will not borrow the morphology of a fusional or agglutinative language, that an agglutinative

Language contact in endangered languages

3

language will only borrow agglutinative morphology, and that a fusional language will borrow both agglutinative and isolating morphology. It is also widely accepted that structural, formal, and functional similarities may play a facilitating role for some language contact phenomena to occur. For example, Weinreich (1953) notes that for a transfer of inﬂectional morphemes to take place, a systemic equivalence between the two languages is needed. Similarly, Winford (2003) underlines that typological distance of languages in contact determines the linguistic results to a great degree. Last, there is increasing awareness of the usefulness of the study of language contact for linguistic theory. This approach to language contact is considered to be “top-down” as it strives to predict which language elements will be most susceptible to contact-induced change (Benmamoun, Montrul, and Polinsky 2013). Sorace (2011) for example develops the Interface Hypothesis, suggesting that language contact phenomena are more likely to occur for linguistic structures made up of more than one component. This book takes the stance that all of the above mentioned approaches oﬀer invaluable insights for understanding language contact. Such advances have informed the “bottom-up” and, more speciﬁcally, “corpus-driven” approach taken in this book by looking at both the linguistic and the social factors at play. Also, although the main focus is on spontaneous data, experimental data produced in controlled environments are regarded as complementary as they enable in-depth analysis of linguistic phenomena which may occur only rarely in spontaneous discourse.

1.2 Language contact in endangered languages Language contact may occur for individual multilingual speakers who are members of heterogeneous networks, as is the case of immigrants or expatriates. This is a relatively well-described language contact situation, often related to studies of language acquisition and language attrition, i.e., language loss at the level of the individual; see among others Montrul (2008). Language contact may also occur for individual multilingual speakers who are members of heterogeneous, complex networks, composed of speakers who to some extent share similar multilingual proﬁles. This is the case of immigrants who are part of a larger immigrant community; see for example the case of Spanish-speaking communities in the United States of America (Poplack 1980; Silva-Corvalán 1986) and of Turkish-speaking communities in the Netherlands (Backus 1992). This is also the case for members of communities with long-term bilingualism; see for example Treﬀers-Daller (1994) for French-Dutch bilinguals

4

Introduction

in Belgium and Poplack and Dion (2012) for English-French bilinguals in Canada. The focus in this book is on a diﬀerent type of bilingualism or multilingualism, which takes place in largely – although not entirely – homogenous linguistic networks. More speciﬁcally, this book deals with a number of bilingual or multilingual settings involving an oral tradition language that is no longer transmitted to younger generations because of a preference for a dominant everyday language which is also an oﬃcial language, with literary written traditions, schooling, and active media. This setting is often encountered in modern states where most speakers of a minority language have everyday contact with a majority language. Such settings are characterized by deep imbalance between minority languages and the language toward which the shift is taking place. The dominant languages are used in most communication settings, including in the public sphere and power-related domains. The minority languages in contrast, because of their use in only a limited number of interactions, tend to be more strongly aﬀected by language contact and range among the most endangered languages. Hitherto, minority languages mostly came into contact with other languages in rural areas, but this is changing with increasing urbanization.1 Indeed, relatively large communities of speakers of minority languages are frequently to be found in urban settings where they may form solid networks. This is the case for example for many Romani dialects in Europe or in the Americas. Moreover, some native Mexican languages now have communities of speakers in the major cities of the United States of America. The goal of this book is to promote the study of endangered languages in the ﬁeld of contact linguistics. Recent estimations of language diversity indicate that 300 out of 6,000 languages are spoken by just half the world’s population. It is also estimated that in the next hundred years half of the world’s languages will disappear.2 In other words, within the next hundred years the people who had traditionally spoken these languages will have all shifted to another language. Endangered languages are thus particularly relevant to the study of language contact not only because they are so frequently encountered, but also because they oﬀer types of language contact which have been little investigated. Languages can be “endangered” in varying degrees and several measures of language endangerment have been proposed in the literature. The classiﬁcation adopted in this book follows the Krauss (2006) endangerment scale. A language 1 See http://www.un.org/en/development/desa/population/publications/urbanization/urban-rural. shtml 2 See http://www.unesco.org/new/en/culture/themes/endangered-languages/

Corpus-driven analysis of language contact

5

ranked A+ has more than a million speakers or is an oﬃcial state language. A language is A, “stable”, when it is spoken by children and adults. It is A– when spoken in some localities including by children, also termed “unstable”. A language is ranked B in the endangerment scale when spoken only by parental generation and up, and C when spoken by grandparental generation and up. A language is “critically endangered”, or D, when it is spoken by great-grand parental generation or by a very small number of people. Last, a language is “extinct” (E category) when there are no known speakers. The languages discussed in this book belong to the endangerment categories A– through D.

1.3 Corpus-driven analysis of language contact As discussed in the previous sections, language contact has been addressed through a variety of perspectives. One can distinguish two major trends: frameworks which focus on the study of language contact in synchrony, and frameworks which approach language contact from a historical perspective. The study of language contact from a synchronic viewpoint aims at the examination of the constraints and social signiﬁcance of bilingual speech. Such studies often rely on the analysis of rich spontaneous oral corpora, traditionally based on interviews and in-group conversations or elicitation of semi-spontaneous speech through visual stimuli (Poplack 1980; Myers-Scotton 1993b; Silva-Corvalán 1994; Gardner-Chloros 2009). Over the past several years, data collection increasingly relies on crowdsourcing through a variety of smartphone applications and informal productions such as text messages (SMS) and chat conversations are more and more frequently taken into consideration. We note, however, that most researchers exploiting such multilingual corpora have worked on widely spoken languages, such as English, French, Spanish, Arabic, and Chinese, or on national and oﬃcial languages such as Swahili, Dutch, Turkish, Greek, or Guaraní. In contrast, lesser-known languages have more frequently been examined from a historical perspective based on descriptive studies. Indeed, for linguists working on lesser-known and endangered languages, the production of a grammatical description and a dictionary is a high priority. A collection of texts is generally associated with this work, sometimes combined with access to the audio or video recordings. However, these texts are typically analysed in a corpus-illustrated approach with very little quantitative analysis. Some typological generalisations for language contact have been drawn from collating information in language descriptions (Thomason and Kaufman 1988; Thomason 2001; Winford 2003; Heine and Kuteva 2005; Wohlgemuth 2009),

6

Introduction

or using speciﬁcally-designed questionnaires (Matras and Sakel 2007; Matras 2009; Haspelmath and Tadmor 2009a; Haspelmath and Tadmor 2009b; for Pidgin and Creole languages Michaelis et al. 2013). But rarely have there been quantitative, corpus-based studies of under-described languages which explore language contact in synchrony. In an attempt to bridge this gap, this book proposes an integrated method that could serve for the quantitative exploitation of free-speech multilingual corpora from lesser-known languages. It is inspired by the corpus-anchored tradition of language contact studies, but has been adapted to the speciﬁcities of endangered languages. The advantages of this approach are twofold. First, the study of lesser-described languages contributes to our understanding of language contact by broadening the sample and providing greater diversity in the study of contact settings. Second, the analysis of spontaneous, oral corpora establishes a solid empirical basis for the study of language contact phenomena from a cross-linguistic perspective.

1.4 Overview of this book This book oﬀers an analysis of multilingual corpora from lesser-known and endangered languages. Given that most corpus studies are based on majority languages, with well-established written traditions, a corpus-based analysis of oral tradition languages is a very exciting undertaking which must nevertheless be grounded in slightly adjusted theoretical and methodological frameworks. Chapter 2 thus discusses the challenges related to data collection and annotation for endangered languages. Chapter 3 proposes an analysis based on the overall rates of other-language word-tokens in spontaneous speech. This analysis shows that some languages have very low proportions of words from the current-contact language, while others have signiﬁcantly higher proportions of contact words. In Chapter 4, I discuss the borrowing/codeswitching distinction, a question that has been addressed among others by Haugen (1950), Myers-Scotton (1993), Poplack (2004), Matras (2009), and Poplack and Dion (2012). Several criteria are applied to the study of the contact words found in the various corpora: degree of composition, word classes, lexical semantic ﬁelds, and regularity of an item. The degree of composition makes it possible to establish a distinction between corpora containing a majority of (lengthy) alternational codeswitching or singleword insertions. “Flagging” of contact material, through hesitations or other meta-linguistic commentary, may also characterize codeswitching insertions as

Overview of this book

7

opposed to borrowings. The corpora are also examined from the viewpoint of word classes (for a discussion on the comparability of word classes across the languages see Evans 2000; Haspelmath 2012). This allows for a correlation between the overall rates of current-contact language tokens, analysed in Chapter 3, and the variety of word classes to be found in a given corpus. A classiﬁcation of the contact words with respect to lexical semantic ﬁelds follows the database Loanwords in the word’s languages (WOLD) (Haspelmath and Tadmor 2009b). Comparison of contact words from the free-speech corpora under study with the age and borrowed scores in the WOLD questionnaires illustrates the compatibility of the two methods. Finally, the criterion of regularity of contact words is applied, conﬁrming the limits of this method for lexical items but revealing its interest for function words and discourse particles. Chapter 5 examines the borrowing integration strategies used in each language community at the level of phonetics and phonology, syntax, and morphology for nouns and verbs. The study shows clearly that bilingual speakers follow the patterns of integration or non-integration established in their language communities. Chapter 6 is dedicated to the study of inter-speaker variation based on the variationist framework (Labov 1971; D. Sankoﬀ 1988; Poplack 1993). This approach allows us to check for individual patterns and relate them to social constraints, such as age, sex/gender, and location. Chapter 7 examines pattern replication, namely language contact eﬀects at the level of structures and functions. Several analyses are presented for prosody and phonetics, word order, articles, tense, mood, and aspect markers, clause linking, and frames of reference. Comparison of the extent of pattern replication with the contact-word rates established in Chapter 3 shows that there is no direct correlation between the two; a bilingual corpus may show very few contact words but signiﬁcant pattern replication. Chapter 8 is dedicated to information structure, a domain which combines the study of prosody, syntax, and morphology. Information structure and the expression of focus in particular, are closely related to strong communicative tension and are expected to show signiﬁcant contact eﬀects (Matras 2009). Despite being of interest, the domain remains under-explored in lesser-known languages, due to the complexity of the parameters and methodological diﬃculties. Several preliminary studies are presented for two of the languages examined in this book, Ixcatec and Thrace Romani. Chapter 9 looks at relations between the language contact results analysed in the preceding chapters and the types of social networks that prevail in the language communities under study (Milroy and Margrain 1980; Milroy 2002).

8

Introduction

The diﬀerences and similarities found in the overall proportions of contact words extracted from the corpora of this study appear to be linked to the duration and the contact type of the bilingual communities prior to the shift. Lastly, Chapter 10 discusses a scale of bilingual speech corpora which can serve as a basis for cross-linguistic comparisons and summarizes the ﬁndings of the quantitative approach with respect to the literature on language contact.

Chapter 2

Data collection and annotation 2.1 Data collection Descriptive linguists have traditionally relied on the study of phenomena encountered in spontaneous texts in what could be called “corpus-illustrated” research. In the tradition of comparative linguistics, another popular method employed by linguists working on lesser-known languages is the use of typological and lexical questionnaires, inspired by the “Swadesh list”. More recently, this material is completed by elicitation with various types of visual stimuli and experimental tasks. Under the impulse of so-called “documentary linguistics” (Himmelmann 1998), descriptive linguists are relying more and more on extended audio and video recordings for their studies. Indeed, in the past decades, several large foundations have made ﬁnancing available for the production of language grammars and sound and video archives, e.g., the Endangered Languages Documentation Programme (funded by Arcadia and managed by the School of Oriental and African Languages, London, U.K.), the Volkswagen Foundation with the programme DoBeS (Dokumentation Bedrohter Sprachen), the Documenting Endangered Languages programme at the National Science Foundation (USA), and the Endangered Languages Foundation (USA). As a result, in the years to come, the quantitative exploitation of several-hour long corpora of endangered languages is likely to become more frequent. Although the quantitative analysis of these texts is not yet widespread, there is a clear tendency in this direction (see among others Haig et al. 2011; Seifart et al. 2012). However, the variationist approach is poorly represented in the ﬁeld of endangered languages; see for some exceptions the studies in Stanford and Preston (2009) and the project The Wellsprings of Linguistics Diversity, funded by the Australian Research Council (2014‒2019, P.I. Nicholas Evans). The most adequate methodology for building multilingual corpora involving endangered languages seems to be the collection of spontaneous, oral data. The most common obstacle to documenting spontaneous bilingual speech involving an endangered language, however, is the desire of the last speakers to oﬀer the purest version of their language, consciously excluding fragments of the contact language. It is thus important to try to create the conditions in which the participants may feel comfortable enough to express themselves in what is, or at least used to be, their ordinary way of speaking. For example, during the language

10

Data collection and annotation

documentation of Ixcatec, a critically endangered language of Mexico, once the goal of the sessions was well understood, i.e., simply that they use Ixcatec, the speakers became much more comfortable in the presence of the researchers and video camera ﬁlming them. One can hardly say, however, that these interactions were entirely spontaneous: were it not for the language documentation programme, everyone would have spoken Spanish. Another diﬃculty in building oral corpora arises from the fact that in spontaneous conversations, not all participants are equally active, depending on the topic of the discussion or on the speaker’s personality, gender, social class, education, and social networks. For example, for the Ixcatec language documentation programme, several meetings were organised between a ﬂuent female speaker and another female speaker whose language competence in Ixcatec could not be established prior to the meeting. During the sessions, the ﬂuent female speaker produced mostly monologues, punctuated by acquiescence from the other speaker. It is unclear whether the absence of more active participation on behalf of the second female speaker was a sign of poor language competence or was due to politeness norms. This means that in order to obtain data for some speakers who are less talkative in two-participant or multiparty interactions, it may also be useful to collect monologues and narratives in an interviewlike setting. In theory, linguists involved in language documentation target the collection of a variety of registers, including informal conversations, traditional tales, recollection of traditions, and life-stories. In practice, however, ﬁeld linguists working on an endangered language need to adapt to the speakers’ wishes, habits or possibilities of text production. During the language documentation programme of Ixcatec, for example, it appeared that the types of collected texts depended on the speakers’ gender. On the one hand, the male Ixcatec consultants produced conversations on a topic they chose in advance, generally relating to the history of the community. Conversations were “staged”, following a distribution of roles between one speaker asking questions and one speaker responding. Turn-taking in the dyadic male-to-male conversations was slow (average 1,000 ms), with few overlaps and back-channels. On the other hand, the female Ixcatec consultants oﬀered a vivid conversation on a variety of topics, mostly related to the current life of the village. The two ladies, who were good friends, were particularly happy to meet since their houses are located in opposite parts of the village and their health problems and household duties do not allow them to meet as frequently as they would like. The resulting conversations were characterized by frequent overlaps, back-channelling, and quick turntaking (average 220 ms). The combination of the two speech-styles provided a

The sample

11

good sample of how conversations may have been conducted in Ixcatec femaleto-female and male-to-male interactions. To conclude, the experience of working with a variety of language communities shows that there is no recipe for collecting spontaneous bilingual data. The best way forward is to adapt to the cultural habits and the personalities of the consultants by trying to maximize the spontaneity of production through long-term presence in the community or through collection of the data by a trained community member.

2.2 The sample The ﬁrst task, when working on an endangered language, is to identify all skilled speakers and obtain permission to work on the language. In practice this is more diﬃcult than it may appear since social relations are typically governed by gender, age, and social-class or other hierarchical antagonisms. While it is ideal to work with both male and female speakers – as numerous studies have revealed the existence of linguistic variation depending on sex and gender – we note the absence of women’s speech in several linguistic studies (see among others Innes 2006). Indeed, dialectological studies were conducted with predominantly elder male speakers, known as NORMs “non-mobile old rural males” (Chambers and Trudgill 1998: 29). The researcher’s gender may also inﬂuence work with language consultants as co-ed sessions are not always accepted by local communities: in some cultures, male researchers may encounter diﬃculties in working with women, whereas female researchers may have better access to them and vice-versa. Although clear-cut gender divisions were not relevant for the study of Ixcatec, it appeared that prior to the ELDP language documentation programme, the attempts to document the language involved exclusively the male speakers. As a female researcher, I grasped the opportunity of working intensively with two ﬂuent female Ixcatec speakers who not only produced a diﬀerent speech-style than men, but also made use of some interesting linguistic features, such as the evidential marker and the focus particle which were not documented in the men’s productions. Age-related tensions are also particularly frequent in settings of language shift. Younger speakers, who are often L2 speakers of the endangered language, may consider themselves or may be considered by the other members of the community as less proﬁcient. This was the case for the study of Ixcatec, where a 40-year-old female speaker was reluctant to participate in the language documentation programme. Nevertheless, the productions of semi-speakers are interesting with regard to language contact. Moreover, semi-speakers may contribute

12

Data collection and annotation

to other tasks such as translation, transcription, and data collection. For example, the study of Balkan Slavic Nashta, beneﬁtted from the collaboration of a 55-year-old Greek-speaking woman who had grown up in the village and had excellent understanding of Nashta, but has never had the opportunity to speak it. Consideration of the socio-demographic characteristics of speakers is also crucial to understanding the type of language contact. While social classes as deﬁned in Western societies do not exist as such in all communities throughout the world, anthropological studies revealed the existence of other types of social hierarchies which may also be relevant for the study of language shift and language contact in general. Presence in the speech-community for relatively long periods of several months and repetitive visits allow the researcher to slowly grasp the speakers’ environment and social life. To conclude, while aiming at a balanced sample in the study of endangered languages, researchers can only obtain the data that it is possible to obtain in a given community at a given time. This book shows that it is possible to carry out useful and productive quantitative analyses even when language samples are small or imbalanced.

2.3 Transcription and annotation A variety of tools were used to transcribe and annotate the corpora presented in this book. For the Slavic data we used the authoring tool ITE (Interlinear Text Editor).1 ITE was used for transcribing, tokenizing, glossing, and translating the texts. Codeswitching and borrowings, whether from the current-contact language or a past-contact language, were tagged directly in the xml ﬁle. The ITE ﬁles were synchronized with SoundIndex, a tool for time-aligning xml-formatted text annotation.2 Searches were conducted in a speciﬁcally developed interface. Similarly, the Thrace Romani corpus was ﬁrst transcribed and annotated with ITE. The xml ﬁles were then transformed for further annotation using Jaxe,3 a tool which, in the meantime, had been speciﬁcally developed for the 1 ITE was developed by Michel Jacobson at the CNRS (French National Centre for Scientiﬁc Research) laboratory, LACITO (Oral Tradition Languages and Civilizations). 2 SoundIndex was developed by Michel Jacobson at the CNRS (French National Centre for Scientiﬁc Research) laboratory, LACITO (Oral Tradition Languages and Civilizations). 3 Jaxe was developed within the project Towards a multi-level, typological and computerassisted analysis of contact-induced language change, with funding from the French Research Agency (ANR) (P.I. Isabelle Léglise).

Transcription and annotation

13

Figure 1: Screen capture of a search on a ﬁle produced with Jaxe for Thrace Romani

study of multilingual corpora with a compatible concordance tool.4 The advantage of Jaxe is its association to a database which enables annotation of various sociolinguistic parameters. Figure 1 shows the ﬁles produced by Jaxe and the concordance tool with a screen capture from the Thrace Romani corpus. The Ixcatec corpus was ﬁrst transcribed and annotated using ITE and Toolbox. The resulting ﬁles were then processed in the ELAN format,5 and more

4 Developed with funding from the programme Empirical Foundations of Linguistics of the French Research Agency. 5 ELAN was created by the Max Planck Institute for Psycholinguistics at Nijmegen, Netherlands, see Sloetjes and Wittenburg 2008. ELAN is freely downloadable at https://tla.mpi.nl/ tools/tla-tools/elan/

14

Data collection and annotation

Figure 2: Screen capture of an Elan-CorpA ﬁle for Ixcatec

speciﬁcally in the ELAN-CorpA version.6 The ELAN-CorpA version oﬀers the possibility to work with the sound and audio ﬁle while proceeding at a morpheme break thanks to the “Interlinearize” function. For the annotation of the Ixcatec corpus I used six tiers, which depend on the reference tier (rf ).7 In this format, tx is the tier for the broad phonetic transcription. The tier mot (word transcription), is an intermediate tier which is automatically tokenized into morphemes. Morphemes are glossed in the ge tier. The ft tier is used for free translation. Last, the rx tier is used for word classes (nouns, verbs, conjunctions, particles, etc.), borrowings and codeswitching, and other relevant information. Searches can then be conducted with ELAN’s concordance tool. Figure 2 illustrates the ELANCorpA format with a screen capture from Ixcatec.

2.4 Corpus size One of the main problems for a corpus-based study of lesser-known and endangered languages is the small size of the corpora. Corpus-based studies draw their strength from the analysis of sizeable corpora and relatively large numbers of speakers. For example, Poplack (1993) relies on a 3.5 million-word corpus for 6 ELAN-CorpA was developed within the research programme A spoken Corpus for Afro-Asiatic languages, funded by the French Research Agency (ANR) (P.I. Amina Mettouchi). The resulting ELAN-CorpA version is freely downloadable at http://corpafroas.huma-num.fr/outils.html 7 Also see the ELAN-CorpA Manual http://corpafroas.tge-adonis.fr/ﬁchiers/manual.pdf

Corpus size

15

the study of language contact among French-English bilinguals of Canada. For the study of Puerto Rican codeswitching, 66 hours of recordings were analysed, with a minimum of 2 hours of speech per speaker (Poplack 1980). Myers-Scotton (1993a) worked with 20 hours of speech for each of her corpora from more than 100 speakers. Similarly, Travis and Torres Cacoullos (2013) built a 30-hour-long corpus from some 40 bilingual New Mexico Spanish-English speakers with approximately 200,000 words. One may draw a rapid conclusion that a small corpus is not appropriate for a quantitative study, but several precious studies were conducted with small corpora. For example, Deuchar (2006) based the study of Welsh-English bilinguals on a ﬁve-hour corpus of which only forty minutes were quantitatively analysed. Van Hout and Muysken (1994) for Bolivian Quechua-Spanish, Grenoble (2009) for Evenki-Russian bilinguals, and Meyerhoﬀ (2014) for Nkep-Bislama bilinguals, produced valuable studies despite the limited size of the corpora, constituted of approximately 6,000 words. Building a free-speech corpus, and even more so a conversational corpus, is a very time-consuming task. According to the calculations by Travis and Torres Cacoullos (2013), approximately 50 hours are needed for the transcription of one hour of speech for conversations involving two participants, and even more for sessions with more than two participants. Notice that these researchers are discussing corpora based on an orthographic transcription of two global communication languages, Spanish and English. Also, these transcriptions are conducted in active collaboration with community members who provide the ﬁrst draft. However, even when the annotation of language corpora is trusted to trained assistants, the task still requires several years as Poplack (1993) observes for the Canadian corpus involving two widely spoken languages. The diﬃculties reported by colleagues working on major-communication languages appear to be insurmountable for the endangered languages explored in this book. Indeed, for lesser-known languages it is not always possible to collaborate with language assistants. Moreover, even in the case where language assistants are involved in the project, the size of the corpora is by no means comparable to the millions of word-tokens obtained for wider-spoken languages. Little-known languages generally require full annotation of the words for glosses in order for the data to be preserved for the future generations. This speciﬁcity of lesser-known languages often implies a double annotation of a corpus, adding glosses to the already diﬃcult task of transcribing the recordings. To overcome this hurdle, automatic alignment of corpora from lesser-studied languages is currently being elaborated in several projects in collaboration with engineers (see for an example Michaud et al. 2012). Such projects may in the future provide the researchers with the tools to signiﬁcantly augment their corpora.

16

Data collection and annotation

At present, the best way to increase the size of a corpus is to make some radical choices as far as the extent of the annotation is concerned. Bilingual spoken corpora with ﬁne-grained annotations, such as treebanks, sometimes may only go up to 20,000 words even for the most widely-spoken languages like Chinese and English (Wang and Liu 2013). Gomez Rendon (2008), in a corpus-based study of language contact in three Amerindian languages, relied on three corpora of approximately 100,000 words each, which is a great accomplishment in the ﬁeld. To obtain this size, beside obvious hard work and exploitation of corpora shared by fellow linguists, the author limited the annotation to some levels conforming to the goals of his study, i.e., he annotated the word classes for the borrowed parts of the corpus but not for the entire corpus. In all cases, with more or less extensive corpora, corpus-based studies are necessarily limited. As observed for the large corpora, the occurrence of the targeted language contact phenomena can be scarce. Poplack notes that in the French-English corpus, “code-switches occur anywhere from not at all to 132 times in an interview, loanwords represent between 0.1% and 2.5% of the total lexicon employed by an individual, unambiguous cases of convergence are exceedingly rare, etc.” (Poplack 1993: 261). This observation, instead of being discouraging for corpus-based research on smaller corpora can be understood as an encouragement: it is not at all certain that bigger collections of data would modify the rates observed in the small corpora, although they would necessarily increase the absolute numbers of occurrences. To conclude, a quantitative look at spontaneous oral corpora can be useful for the study of language contact phenomena even for relatively small corpora such as the corpora produced for the documentation of endangered languages. Indeed, several researchers are in favour of broadening the ﬁeld of corpus linguistics (see among others Newman, Baayen, and Rice 2011; Gries and Berez in press) and statisticians are collaborating with linguists in order to develop new statistical models adapted to the speciﬁcities of small and heterogeneous linguistic corpora (among others see Nock et al. 2009; Tagliamonte and Baayen 2012; Gries 2015).

2.5 Corpus accessibility An important issue for the development of corpus-based research is data accessibility. In recent years, several language databases have been created, such as the Archive of the Indigenous Languages of Latin America (AILLA) at the University of Texas in Austin, the Endangered Languages ARchive (ELAR) at the School of Oriental and African Studies (SOAS) in London, the Language Archive at the Max Planck Institute for Psycholinguistics in Nijmegen, the CNRS Pangloss

Corpus accessibility

17

Collection in Paris, and the Paciﬁc and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) through collaboration of three institutions, the Australian National University in Canberra, the University of Melbourne, and the University of Sydney. In order for the materials to be made part of a heritage archive, a vital step is the collection of the metadata associated with archived documents, such as sociolinguistic information on the speakers, the researchers, the content of the document, and the date and place of recording. Metadata may follow the Dublin Core speciﬁcations8 as deﬁned by the Open Language Archives Community (OLAC) for all types of linguistic documents. However, archives such as DoBes and ELAR (SOAS) use fuller standards, better adapted to the needs of linguists, such as the ISLE Metadata Initiative (IMDI) developed by the Max Plank Institute. The program Arbil, which uses the IMDI standards, makes it possible to generate the metadata for the various types of ﬁles produced by ﬁeld linguists and the associated work ﬂow. Another popular format, initiated by CLARIN, is the Component MetaData Infrastructure (CMDI) which allows for ﬁne-grained and researcher-speciﬁc metadata.9 Most archives propose diﬀerent types of access to the data. For example, ELAR proposes four access groups: Ordinary users, Researchers, Community members, and Subscribers. Indeed, questions related to the legal authorizations and the protection of the speakers’ identity are complex and a growing body of literature on the topic is available (see Grenoble and Furbee 2010). Obtaining the necessary authorizations depends on the types of communities. In some cases research permits may be required by the local authorities, in others permission from the local populations, or permission at the level of individual speakers. For example, within the Ixcatec documentation programme, the Ixcatec speakers agreed individually on participating in the study by signing formal authorizations after each working session. But more signiﬁcantly, the Ixcatec documentation research programme was approved by the community’s general assembly. These structures may not be available in other settings and an individual agreement may be the best solution, as in the case of the corpora collected in Greece. Since language is considered a cultural heritage, it is the researcher’s duty to make the work available to the language communities. But ensuring access to raw or annotated data for the scientiﬁc community is also a way to improve and augment corpus-based studies. We cannot predict the form that linguistic research will take in the years to come, but open access of oral corpora from lesser-known languages will most likely allow for new, collaborative research 8 http://dublincore.org 9 http://www.clarin.eu/content/component-metadata

18

Data collection and annotation

to be done. Also, depositing language corpora in databases ensures their permanent preservation and accessibility, as well as the guarantee that the corpora will be updated as needed following the progress of digital technology. In this perspective, most corpora cited in this book are available online. The entirety of the Slavic corpora is available at the Pangloss Collection in open access as the result of the programme Electronic database of endangered Slavic varieties in non-Slavic speaking European countries (EuroSlav 2010);10 see Map in Figure 3. Pangloss collection is a language archive developed at the research centre Oral Tradition Languages and Civilizations (LACITO) of the French National

Figure 3: The Slavic minority languages of the EuroSlav corpora 10 ANR-09-FASHS -025 and DFG BR 1228-4-1

Corpus accessibility

19

Centre for Scientiﬁc Research (CNRS).11 The Collection is an Open Archive containing recordings, text annotations, and metadata. It currently contains 1,400 recordings in more than 70 languages, including 400 transcribed and annotated documents (Michailovski et al. 2014). The data are stored on servers of HumaNum,12 a French structure which works under the supervision of the French National Higher Education Computing Centre (CINES). The audio recordings of the Pangloss Collection may be listened to in their entirety or sentence by sentence. The researcher may set parameters so that the transcription appears in parallel with the sound recordings, either sentence by sentence, or with the transliteration and the morphosyntactic glosses; see Figure 4.

Figure 4: Screen capture of a ﬁle produced with ITE for Balkan Slavic Nashta 11 http://lacito.vjf.cnrs.fr/pangloss 12 Humanités Numériques (HUMA-NUM ; formerly ADONIS ), CINES and CC-IN2P3 http://www. huma-num.fr

20

Data collection and annotation

Pangloss collection also hosts the annotated tales from the Thrace RomaniTurkish-Greek corpus (Adamou 2008). The spontaneous Romani conversations are not available online due to the private content of the recordings. The annotated corpora are nevertheless deposited at a server with restricted access,13 as part of the French National Research Agency project Towards a multi-level, typological and computer-assisted analysis of contact-induced language change. They are referred to as “unpublished data” as only excerpts of the data will become available in the years to come. Finally, 50 hours of Ixcatec recordings are available online at the Endangered Languages Archive (ELAR).14 Access is restricted to the Ixcatec language community and to the scientiﬁc community. At present, seven hours of conversations are fully annotated in ELAN format. The annotated ﬁles (including transcriptions in IPA, glosses and two more tiers indicating the origin of the tokens and the word classes) are synchronized with the video ﬁles. Thirty minutes of the associated annotated ELAN ﬁles are deposited at a server with restricted access as part of the French National Research Agency programme Designing Spoken Corpora for Cross-linguistic Research.15 They will become publically available in 2016. Another thirty minutes of ELAN ﬁles annotated for information structure is also currently being prepared with funding from the French National Research Agency programme Investments for the future.

13 http://clapoty.vjf.cnrs.fr/contacts/index.php 14 See http://www.elar-archive.org/index.php 15 See http://cortypo.huma-num.fr/project.html

Chapter 3

Overall composition of a multilingual corpus 3.1 Background Once a corpus has been collected and annotated for the study of language contact, as discussed in Chapter 2, a number of operations are open to the researchers. The possibilities depend on the corpus speciﬁcities and the researcher’s interests, but a very ﬁrst step may be to conduct a word-count to determine the composition of the corpus with regard to the languages in contact. The word-count depending on the origin of the language is not a new method. For example, the so-called “criterion of frequency” was suggested by Myers-Scotton (1993a: 68) within the Matrix Language Framework in order to identify the Matrix Language (ML) of a bilingual corpus. Nevertheless, in subsequent work the frequency criterion was rejected in favour of the more solid System Morpheme Principle (Myers-Scotton 2002: 61). This choice was due to the diﬃculties that researchers encountered in practice and counter-examples from languages in which a numerically-dominant language was not the Matrix Language. Although a quantitative approach based on a word-count alone is not suﬃcient, it will be shown that when word-count is combined with other parameters it becomes a useful tool which may allow for cross-linguistic comparability on a solid, empirical basis. Indeed, the word-count method helps us identify the “numerically-dominant language” in a bilingual corpus. For example, if a corpus contains 96% word-tokens from language A and 4% word-tokens from language B, it is straightforward that language A is the numerically-dominant language. If another corpus contains 80% word-tokens from language A and 20% from language B, this corpus can be understood as a relatively more mixed corpus. Such an overview of a multilingual corpus allows us to have an idea of the type of language mixing produced in a given bilingual or multilingual community. As illustrated in this chapter, not all bilingual communities use the same quantity of contact words in in-group conversations and this is already an interesting ﬁnding. In practice, in order to determine the rates of words depending on their origin, one has to count all the words which have previously been tagged as “words of the current-contact language(s)”, and calculate what percentage of the total number of words in the corpus they represent. Researchers can also tag the words from past-contact languages or from languages with no direct contact. If some words are shared between two or more languages, then it is

22

Overall composition of a multilingual corpus

possible to tag them separately as “multiple” or choose the language of origin based on historical and linguistic evidence. A search can then be conducted either with an ELAN concordance tool, or with searches in Toolbox, Word, Excel or Notepad++, depending on the format in which the corpus was transcribed. A major diﬃculty that researchers have to deal with in order to attain crosslinguistic comparability with the word-count method stems from the typological diversity of the languages that need to be compared. As mentioned by Muysken (2000: 66), the morpheme-frequency criterion should be taken into consideration in combination with the typology of the languages in contact. Indeed, languages may be of the “isolating” type, showing a one-to-one relation between morpheme and word with one semantic unit per morpheme; they may be of the “agglutinating” type, showing a many-to-one relation between morpheme and word with one semantic unit per morpheme; or, they may be of the “fusional” type, showing a many-to-one relation between morpheme and word with several semantic units per morpheme. The problems one faces when comparing the number of word-tokens for languages belonging to diﬀerent morphological types are twofold. First, the diﬀerence in the overall size of the corpus depends on the typological proﬁle of the language under study, with isolating languages showing many more tokens than agglutinative and fusional languages. For example, a Chinese corpus would provide us with more word-tokens than a Turkish corpus. Second, and more importantly, a problem arises for the evaluation of the rates of word-tokens when the languages in contact belong to diﬀerent morphological types, for example if one language is fusional and the other analytical. In a well-segmented corpus, the distinction between an analytical and an agglutinative language can be overcome by counting the numbers of morphemes in the glosses (ge tier in ELAN-CorpA) for each language rather than the words (mot tier in ELAN-CorpA). Indeed, counting the morphemes is crucial to describe mixed languages (see Meakins 2013), or languages where borrowing of morphology occurs with little lexical borrowing, such as Resigaro (Seifart 2012). One way to address these diﬃculties can be by combining the numerical criterion with a more ﬁne-grained analysis, taking into account several other parameters, such as word classes, length of switches, and inter-speaker variation as shown in the following chapters. It is the combination of all these parameters that can provide a powerful framework for contact linguistics. 3.1.1 Corpora with 0‒5% contact words The word-count method shows that some corpora have a numerically-dominant language A and contain very few word-tokens from the current-contact language B, ranging from 0% to 5%.

Background

23

3.1.1.1 The Ixcatec-Spanish corpora Ixcatec, or ʃ hwanì ‘our language’, is an Otomanguean language of the Popolocan branch, spoken in the village of Santa María Ixcatlán, in the State of Oaxaca, Mexico. Ixcatec is a critically endangered language, rated D in Kraussʼs scale for language endangerment assessing (Krauss 2006). It is nowadays spoken by less than ten speakers, most of them in their 80s, of whom only four are ﬂuent in Ixcatec. Fifty hours of conversations and narratives were recorded within the Ixcatec documentation programme and can be accessed online.1 The video recordings were made in order to document both speech and gesture following the recommendations of the multimodal documentation of languages (Seyfeddinipur 2012). Of these, approximately seven hours (35,000 words) have been analysed by Evangelia Adamou and Denis Costaouec. At the moment, however, the analysis of the contemporaneous corpus for language contact relies on a total of 7,207 word-tokens, available in ELAN format. The Ixcatec-Spanish contemporaneous corpus is built with data from the four last ﬂuent speakers, two male and two female, all living in the village of Santa María Ixcatlán. All Ixcatec speakers were in their early 80s at the moment of the recordings and all of them had little formal education with just a few years spent in primary school. Figure 5 shows the interactions between the speakers for the recordings and gender separation: the female speakers only talked to each other as did the male speakers, in the presence of the researchers who are not ﬂuent speakers of Ixcatec and who are community outsiders. In one case the male speakers were recorded in the presence of community members who nevertheless were not speakers of Ixcatec. The contemporary corpus can be compared with the Ixcatec texts from the 1950s, published by Fernández de Miranda (1961). The Ixcatec-Spanish corpus of the 1950s is constituted of only 1,600 words, from a single, male speaker, Doroteo Jiménez. The corpus of the 1950s contains several Ixcatec texts translated from Spanish by Doroteo Jiménez and recorded in Mexico City by Fernández de Miranda. The corpus was glossed by Denis Costaouec in collaboration with the two male Ixcatec speakers and was electronically annotated in Toolbox as part of the Ixcatec documentation programme. The example in (1) illustrates the contemporary Ixcatec-Spanish corpus, with an excerpt from a discussion between the two female speakers. It can be seen that Spanish words are limited to the name of the social programme Oportunidades meaning ‘opportunities’ and referring to State social programmes. 1 See http://elar.soas.ac.uk/deposit/0193

24

Overall composition of a multilingual corpus

Figure 5: Interactions in the two Ixcatec-Spanish corpora (circles for female speakers, squares for male speakers, numbers for word-tokens in the interactions)

Contemporary Ixcatec-Spanish corpus < Ixcatec (in plain), Spanish (in bold) (1)

tʃika like

Ɂinári 1PL . EXCL

ndíʃera because

Ɂísána better

β-eɁe-ma

oportunidá

IPFV-give-3PL

NP

‘Like us, it’s better, because they give us the welfare programme (Oportunidades).’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php/. Tones are transcribed as follows: high is transcribed as ˊ on the vowel, low as ˋ on the vowel, and mid is not noted but applies to all vowels which are not high or low.)

Background

25

The example in (2) from the corpus of the 1950s shows a plethora of Spanish words, both verbs and nouns. Ixcatec-Spanish corpus of the 1950s < Ixcatec (in plain), Spanish (in bold) (2)

tsí-tse EVD -do

Ɂordená order

la

ts-átsu-kú

REL

EVD -be-ANT

sóldádú soldier kárgú charge

tse do

ɸusilá shoot

ɾéj king

‘He ordered the soldiers under his command to shoot the king.’ (Fernández de Miranda 1961: 194, line 71, my glosses, my translation from Spanish, tones are transcribed as follows: high is transcribed as ˊ on the vowel, low as ˋ on the vowel, and mid is not noted but applies to all vowels which are not high or low.) One would expect that the last speakers of Ixcatec who produced the contemporary corpus would be using more Spanish in their speech since Spanish is their everyday language and Ixcatec is no longer spoken in everyday life. Moreover, the Ixcatec speakers are discussing a variety of topics, including everyday life events, and one would expect Spanish words from the dominant pragmatic setting to be mobilized. But the word-count reveals a diﬀerent picture with few contact words in the contemporary corpus and signiﬁcantly more contact words in the corpus of the 1950s. Figure 6 shows the results of the word-count for the contemporary Ixcatec-Spanish corpus with 4% Spanish tokens. The quantitative

Figure 6: Two Ixcatec-Spanish corpora (8,807 words in total): Distribution of word-tokens with respect to language

26

Overall composition of a multilingual corpus

analysis of the corpus of the 1950s shows that Spanish words represent 10% of the total; see Figure 6. The unexpected diﬀerences in the rates of contact words in the two corpora call for more ﬁne-grained analyses, which are illustrated in the following chapters. A discussion with respect to the extralinguistic factors is provided in Chapter 9.

3.1.1.2 The Balkan Slavic-Greek corpora Another corpus with 0‒5% contact words comes from the Balkan Slavic variety called Nashta, literally ‘our (language)’. The annotated corpus, of 5,301 wordtokens, was recorded in the small town of Liti, which is located 10 km from the city of Thessaloniki in Greece, in the years 2000 among the last ﬂuent speakers, aged from 75 to 85 years (Adamou 2013a). In the scale of endangerment, Nashta is critically endangered (D) (Krauss 2006), although the Balkan Slavic varieties spoken by the Christian populations in other areas of Greece can be qualiﬁed as severely endangered (C) (Krauss 2006). The Nashta Slavic variety is part of the group of Balkan Slavic languages represented in the area by two oﬃcial languages, Literary Bulgarian and Literary Macedonian. Unlike these languages, however, the speakers of the Nashta variety have been in contact with Greek at least for the past century during which the shift to Greek took place. The Balkan Slavic Nashta-Greek corpus is based on the speech of three ﬂuent speakers, two female and one male, in their 70s at the moment of the recordings, and one younger female semi-speaker. The recordings took place at home, bringing together acquaintances who were asked to speak Nashta although Greek would have been the language used in these circumstances. Interviews with the researcher completed the corpus. Figure 7 shows the interactions that constitute the Balkan Slavic Nashta corpus which was analysed for this study. In (3), an excerpt from the Balkan Slavic Nashta corpus is provided. This example raises some interesting methodological questions. Whereas the verb ciniˈsa ‘get going’ is easily identiﬁed as a Greek-origin verb, the word for ‘shepherd’, tʃuˈban, illustrates the diﬃculties for tagging a word which is shared by various languages. Indeed, the word tʃuˈban ‘shepherd’ is used by Turkish monolinguals, and a similar word, tsoˈmbanos ‘shepherd’, is used by Greek monolinguals, in variation with the word vosˈkos ‘shepherd’. Moreover, when taking into account the etymology of the word, it appears that the Turkish word for ‘shepherd’ is a borrowing to Persian. Should the annotator consider the word tʃuˈban as a “Persian-origin” word, a “Turkish-origin” word or as a “current-contact” language word from Greek? To decide, one has to mobilize

Background

27

Figure 7: Interactions in the Balkan Slavic Nashta corpus (circles for female speakers, squares for male speakers, numbers for word-tokens in the interactions)

both historical and linguistic parameters. First, the fact that the last speakers of Nashta do not speak Persian or Turkish indicates that the word ‘shepherd’ is not a recent addition to their Nashta vocabulary. Knowing that Ottoman Turkish was greatly inﬂuenced by Persian and that Ottoman Turkish was a major contact language in the Balkans during Ottoman times one may trace the word tʃuˈban and decide that it was most likely introduced in a past-contact setting through Turkish. This analysis is backed up by the similarity of the Nashta and the Turkish form and the fact that the same word is used in other Balkan Slavic languages with no contemporary contact with Greek. However, although tagging the word as “Turkish” is historically accurate, it masks the information that a similar word is also available in the current-contact language, Greek. An alternative and more accurate option with respect to the study of language contact is to tag the word tʃuˈban as “multiple” for being shared between two of Nashta’s contact languages, the current-contact language, Greek, and the past-contact language, Turkish. Although the word-count of “current-contact” tokens will

28

Overall composition of a multilingual corpus

not include the words tagged as “multiple”, such detailed tagging allows for a ﬁne-grained analysis of the corpus. Balkan Slavic Nashta corpus < Slavic (in plain), Greek (in bold), Multiple (underscored) (3)

tʃuˈban-at shepherd-ART. SG . M i and

ˈpasʲ-ʃe graze- IPRF.3SG

ciniˈsa get_going.AOR . LVM .3SG

ˈvoftsʲ-e-te sheep-ART. PL

iˈdno one.SG . N

diˈʎeko far

den day.SG . N

‘A shepherd was grazing his sheep far away, and one day, he got going. . .’ (Adamou 2013a. Excerpt from Le berger et son ombre, sentences 2 and 3. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) It is possible to compare the Balkan Slavic Nashta corpus (Adamou 2013a) to the corpus of another Balkan Slavic variety of Northern Greece recorded in 1976 in the village of Hrisa (Drettas 2013). The annotated corpus is constituted of a total of 3,302 tokens from a 40-year-old and a 60-year-old female speaker. The corpus was recorded by Georges Drettas at the speakers’ homes in presence of family members. Based on our historical and linguistic knowledge of the area, as discussed in Chapter 9, we expect the Balkan Slavic corpus of the 1970s to show less Greek words than the corpus of the 2000s since contact with Greek was intensiﬁed during the twentieth century. In accordance to this hypothesis, the analysis of the Hrisa Balkan Slavic corpus shows practically no Greek tokens; see Figure 8. The analysis of the Nashta corpus shows 4% Greek tokens, when restricted to

Figure 8: Two Balkan Slavic-Greek corpora (9,235 words in total): Distribution of word-tokens with respect to language

Background

29

the speech of the ﬂuent speakers. This percentage gets higher, up to 5%, when the data from both the ﬂuent speakers and a semi-speaker are analysed. 3.1.1.3 The Colloquial Upper Sorbian- and the Burgenland Croatian-German corpora Two other comparable Slavic corpora, with less than 5% words from the currentcontact language, come from the Colloquial Upper Sorbian- and the Burgenland Croatian-speaking communities. Colloquial Upper Sorbian is a West-Slavic language spoken in Germany. Colloquial Upper Sorbian is the everyday language of the Catholic Sorbian population living in Upper Lusatia and it may be considered as an unstable endangered language, A– in Krauss’s scale (Krauss 2006). A cross-regional Upper Sorbian standard is used in schools, church, and media with relatively little impact on Colloquial Upper Sorbian. The corpus is constituted of a total of 4,348 tokens, from eight diﬀerent speakers, three female and ﬁve male, aged from 26 to 83 (Scholze and Breu 2013); see Figure 9.

Figure 9: Interactions in the Colloquial Upper Sorbian-German corpus (circles for female and squares for male speakers, speakers’ initials, numbers for word-tokens in the interactions)

30

Overall composition of a multilingual corpus

Example (4) illustrates the use of a German discourse particle alzo ‘well’. Colloquial Upper Sorbian corpus < Slavic (in plain), German (in bold) (4)

alzɔ well

pɔ at

nas we.GEN

da-w-ɛ give-IPFV- PRS .3SG

wɛ in

swʊ͡əjb-ɛ family-LOC . SG . F

jɛn ART. INDF. NOM . SG . M

nawɔʃk custom.NOM . SG . M

‘So, in our family, there is a custom.’ Scholze and Breu 2013. Excerpt from Pâques chez les Sorabes, sentence 1. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) Burgenland Croatian (Central South Slavic) is spoken in Eastern Austria since the ﬁfteenth-sixteenth centuries. A literary variety has been elaborated since the sixteenth century and the standard language is mainly restricted to school, church, and media. Burgenland Croatian can be rated as A-, unstable, with children still speaking the language in some localities (Krauss 2006). The corpus has a total of 3,664 tokens from ten diﬀerent speakers, six female and four male, aged from 11 to 82 (Scholze, Breu, and Utschitel 2013); see Figure 10.

Figure 10: Interactions in the Burgenland Croatian corpus (circles for female and squares for male speakers, speakers’ initials, numbers for word-tokens in the interactions)

Background

31

Figure 11: Two Slavic-German corpora (8,012 words in total): Distribution of word-tokens with respect to language

An example in (5) illustrates the use of a German verb in the speech of a Burgenland Croatian speaker. Burgenland Croatian corpus < Slavic (in plain), German (in bold) ͡ (5) tako ˈpraːv-o ko=se ur ˈpietak so.MID

custom-NOM . SG . N

ˈnu͡otʲ-iː night-LOC . SG . F

PTL- REFL

mblaˈdiːn-a youth-NOM . SG . F

already

Friday.ACC . SG . M

ˈstref-i meet.PFV- PRS .3SG

‘That’s the custom, so on Friday, at night the young people already meet.’ (Scholze, Breu, and Utschitel 2013. Excerpt from Le marriage en Burgenland, sentence 2. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) The analysis of the Sorbian corpus shows roughly 4% tokens from German, as illustrated in Figure 11. Similarly, the Burgenland Croatian-German corpus shows 5% German word-tokens; see Figure 11. 3.1.2 Corpora with 20‒35% contact words A second type of bilingual corpora can be posited for the cases where 20‒35% word-tokens from the current-contact language are used in average.

32

Overall composition of a multilingual corpus

3.1.2.1 The Thrace Romani-Turkish-Greek and the Finnish Romani-Finnish corpora A ﬁrst example comes from the trilingual Romani community currently living in Greek Thrace and using Romani (Indo-Aryan), Turkish (Altaic), and Greek (Greek). Thrace Romani can be qualiﬁed as unstable, eroded in Kraussʼs terminology, rated A– (Krauss 2006). This means, that the language is spoken in some localities by children, but a process of shift at the level of parental generation is taking place in other localities or in some families within one locality. The Thrace Romani-Turkish-Greek corpus includes data from story-telling, interviews with the researcher and in-group conversations between 21 Roma speakers in the presence of the researcher. In this corpus, speakers discussed personal topics such as work, health, and family. The discussions were animated and the speakers paid little attention to the recorder. For two of the three sessions of storytelling, speakers performed in front of an audience of children and adults and thus had to draw the attention by a lively narration. The data were collected during four ﬁeldwork visits carried out between 2007 and 2010 by the author of this book. The recordings took place in the house, yard or workplace of one of the participants. Alongside the main group of speakers, several friends and family members regularly stopped by; see Figure 12. For the analysis of the corpus, I counted the distribution of words depending on the origin of the language, namely the words that are native or borrowed from past-contact languages which were tagged “Romani”, and the words that come from the current-contact languages, namely Turkish and Modern Greek. I also tagged as “multiple” the words which may be found in the two currentcontact languages. An excerpt from the Thrace Romani-Turkish-Greek corpus is given in (6) illustrating the use of a verb from Turkish and of a conjunction from Greek. Thrace Romani corpus < Romani (in plain), Turkish (in bold), Greek (underscored) (6)

latʃo good

afu since

te

dʒavtar go.1SG . DIR

COMP

gadal this_way

dyʃym-ijor-sənəs think-PROG -2PL

mange 1SG . DAT

‘Fine, since this is what you think, I’ll leave.’ (Adamou 2008. Excerpt from The coward and the giants, sentence 95. Accessed online at http://lacito.vjf.cnrs.fr/pangloss)

Background

33

Figure 12: Interactions in the Thrace Romani-Turkish-Greek corpus (circles for female participants and squares for male participants, R for the researcher, S for the speakers)

The Thrace Romani-Turkish-Greek corpus, containing 5,816 word-tokens, shows respectively 15% Turkish word-tokens and 4% Greek word-tokens, and 1% words tagged “multiple”; see Figure 13. Interestingly, the rate of the Greek tokens in the Thrace Romani-TurkishGreek corpus is similar to the rate observed in the corpora with 0‒5% contact words discussed in the previous section. These rates may be contrasted to the much higher rate of Turkish tokens (15%). Also, although language endangerment is similar to that of Colloquial Upper Sorbian, we observe that the results

34

Overall composition of a multilingual corpus

Figure 13: The Thrace Romani-Turkish-Greek corpus (5,816 words in total): Distribution of wordtokens per language

in terms of language mixing are radically diﬀerent. An explanation will be looked at in terms of sociolinguistic parameters in Chapter 9. The Finnish Romani-Finnish corpus, illustrates another corpus with 20‒35% contact words. Finnish Romani, belongs to the C category of the language endangerment scale (Krauss 2006), being severely endangered with just grandparental generation having ﬂuency (Adamou and Granqvist 2014). The shift is taking place toward Finnish, and younger Romani speakers may speak a mixed variety of Romani as a second language, acquired during their socialization process within the Romani community of Finland as adolescents (Vuorela and Borin 1998). The Finnish corpus, of 13,019 word-tokens, is based on the speech of three female speakers in their 60s and 70s, recorded in the 1990s, at a moment when only Roma over about 65 years of age were able to communicate ﬂuently in Romani (Adamou and Granqvist 2014). The three elder speakers were ﬂuent in Finnish Romani although Finnish was the dominant language in their everyday life. The recordings took place in a semi-formal setting with interviews between one interviewer and one speaker. The interviewers, two Roma women in their 30s, were proﬁcient speakers of Finnish Romani and were involved in the preservation and revitalization actions concerning Finnish Romani. Figure 14 graphs the interactions in the Finnish Romani-Finnish corpus (Adamou and Granqvist 2014). The Finnish Romani corpus was transcribed in Word format by Hellevi Hedman-Valentin, a Finnish Roma. The transcribed corpus was later tagged in

Background

35

Figure 14: Interactions in the Finnish Romani-Finnish corpus (circles indicate female participants, numbers for word-tokens in the interactions)

Excel by Kimmo Granqvist with respect to the matrix language, following MyersScotton (1993a). An example from the Finnish Romani-Finnish corpus is provided in (7) showing a variety of words from Finnish. Finnish Romani corpus < Romani (in plain), Finnish (in bold) (7)

ja and

doːri there

maŋ-jom prayed.1SG liːjas took.3SG

rukoil-i-n prayed-1SG deːvel-es god-OBL . SG

kokonaːn entirely

deːvel-es-ta god-OBL . SG -ABL ta and

maːn me

deːvel god oma-ksi own-TRN

l-iːjas took.3SG ja and

maːn me

deːvel god

oma-ks own-TRN

täytt-i ﬁlled.3SG

‘And there I prayed to God. I prayed to God and God took me to his own. God took me wholly unto himself and ﬁlled me.’ (Adamou and Granqvist 2014)

36

Overall composition of a multilingual corpus

Figure 15: The Finnish Romani-Finnish corpus (13,031 words in total): Distribution of wordtokens per language (adapted from Adamou and Granqvist 2014)

The analysis of the Finnish Romani-Finnish corpus shows 65% Romani words, and 35% Finnish words; see Figure 15. These rates contrast with the Ixcatec and Nashta corpora which showed very low proportions of contact language tokens despite of all three corpora being produced by the last speakers of endangered languages. Therefore, it seems that the sole fact of language endangerment fails to oﬀer a satisfactory explanation for the type of language mixing.

3.1.2.2 The Molise Slavic-Italian corpora Another similar corpus in terms of the rates of word-tokens from the two languages in contact comes from Molise Slavic (South Slavic, Štokavian-Ikavian), an endangered language of Italy (Breu 2011). Molise Slavic has been spoken in Italy for 500 years, following a migration from Dalmatia (Breu 2011). It has primarily been under the inﬂuence of the Molisian dialect of Italian and several regional varieties of Italian, then of Standard Italian since the middle of the nineteenth century. Nowadays, Molise Slavic is no longer transmitted to children with few exceptions. Although bilingual classes in Molise Slavic are available, their impact is very limited (Adamou et al. in press 2016). In Krauss’s scale (2006) the varieties fall under the category B, since they are spoken by parental generation in most localities with less than a thousand speakers in total (Adamou et al. in press 2016).

Background

37

Figure 16: Interactions in the Molise Slavic-Italian corpus (circles for female and squares for male speakers, speakers’ initials, numbers for word-tokens in the interactions)

The Molise Slavic-Italian corpus was collected and annotated in the years 2000 and consists of 17,279 words (Breu 2013). Thirteen speakers were interviewed in one-to-one conversations with the researcher or occasionally a community member; see graph in Figure 16. Six male and seven female speakers participated in this study, ages ranging from 30 to 80, living in the three localities where Molise Slavic is traditionally spoken, at the Province Campobasso in the Region of Molise in Italy, namely Acquaviva, San Felice, and Montemitro. An example from the Molise Slavic corpora is shown in (8) illustrating the use of a noun and an adversative marker from Italian, as well as a complementizer from Italian attached to the Slavic ˈaje ‘because’.

38

Overall composition of a multilingual corpus

Figure 17: The Molise Slavic-Italian corpus (17,279 words in total): Distribution of word-tokens per language

Molise Slavic corpus < Slavic (in plain), Italian (in bold) (8)

ˈaje-ka because-COMP ma but

káːkḁ how

ˈbi-x-u be-IPRF-3PL

ˈbi-x-u be-IPRF-3PL

fanˈdaːzm-a ghost-NOM . PL . M

fanˈdaːzm-a ghost-NOM . PL . M

je=ˈrek-l-a be.PRS .3SG =say.PFV- PTCP- SG . F AUX . PRF (1) PRF (1)

mo my.NOM . SG . F

ˈmat mother.NOM . SG . F

‘Because there were ghosts! What? There were ghosts? said my mother.’ (Breu 2013. Acquaviva corpus. Excerpt from L’âne et les fantômes, sentences 12 and 13. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) The analysis of the Molise Slavic-Italian corpus shows an average of 22% tokens from Italian; see Figure 17.

3.2 Discussion An overall quantitative corpus analysis based on the word-count enables us to distinguish two categories of bilingual and trilingual corpora, illustrated in Figure 18.

Discussion

39

Figure 18: Distribution of word-tokens with respect to language for seven corpora

Some bilingual corpora are grouped together for showing less than 5% current-contact-language word-tokens despite the fact that the speakers are balanced bilinguals. A ﬁner-grained distinction can be made between the corpora which show close to zero tokens from the current-contact language and those which show close to 5% tokens. Speakers of these bilingual communities are producing a type of bilingual speech that can be thought of as targeting monolingual speech. This type of speech was illustrated with data from four corpora, the Ixcatec-Spanish corpus, the Balkan Slavic-Greek corpus, the Colloquial Upper Sorbian-German, and the Burgenland Croatian-German corpus. Another kind of language mixing can be found in corpora showing more than 20% and less than 35% current-contact-language tokens. This category was illustrated by the trilingual Romani-Turkish-Greek corpus, the Finnish RomaniFinnish corpus, and the Molise Slavic-Italian corpus. The diﬀerences and the similarities in the overall rates of contact tokens in the various corpora allow for several preliminary observations. First, all these bilingual or trilingual corpora, with more or less contact words from the current-contact languages, are in accordance with the Asymmetry Principle which predicts that “bilingual speech is characterized by asymmetry in terms of the participation of the languages concerned” (Myers-Scotton 2002: 9).

40

Overall composition of a multilingual corpus

Second, research on codeswitching has shown that the higher the speaker’s competence, the more frequent and complex the language mixing will be (TreﬀersDaller 1994; Poplack 2004). In the case of the seven corpora under study, the rate of contact words does not reﬂect the speakers’ competence in the contact language, but should probably be seen as an indicator of the community’s type of bilingual speech prior to the shifting process (see Chapter 9). Indeed, the last ﬂuent speakers who produce high rates of contact words, belong to communities that had intensive contact with the contact language, while the last ﬂuent speakers who produce low rates of contact words belong to communities that had only occasional contact with the contact language. It could be argued that the building of the corpora for severely endangered languages such as Balkan Slavic, spoken in Greece, and Ixcatec, spoken in Mexico, is based on un-natural situations since the consultants have been asked to speak a language they would not have used otherwise. In this context, one could argue that the speakers will naturally avoid language mixing with the everyday languages. Nevertheless, data presented from two other severely endangered languages, Finnish Romani and Molise Slavic, oﬀer counter-evidence to this analysis in that the corpora show a great number of contact words or codeswitching. Instead of discarding altogether this sort of corpora for not being “natural enough”, and since there is no other way of working on these languages which are no longer spoken, we should ask ourselves what these low rates of tokens from the current-contact languages mean for the study of language contact. As discussed in Chapter 9, the type of contact, language attitudes, and the status of the two languages in contact also play a crucial role in the resulting language mixing.

Chapter 4

Borrowing or codeswitching? 4.1 Background One of the main empirical problems with which language-contact specialists are confronted, is the distinction between codeswitching insertions and borrowings (see among others Haugen 1950; Poplack and D. Sankoﬀ 1984; Poplack, D. Sankoﬀ, and Miller 1988). For most scholars, this distinction needs to be made even though, in practice, we lack solid deﬁnitional criteria. Codeswitching may be broadly deﬁned as the alternation of languages within a conversation (Matras 2009: 101). This alternation is generally understood as meaningful at the discourse level: In CS (code-switching), the contrast between one code and the other (for instance, one language and another) is meaningful, and can be interpreted by participants, as indexing (contextualizing) either some aspects of the situation (discourse-related switching), or some feature of the code-switching speaker (participant-related switching) (Auer 1998: 2).

But, alternating between one language and another may also be the unmarked way of speaking in a given community: “especially when switching is at length or frequent, neither language is the unmarked choice” (Myers-Scotton and Jake 2014 following Myers-Scotton 1993b). Auer (1998) distinguishes between codeswitching and this type of frequent, unmarked mixing, dubbed “language mixing”. Alternation between two or more languages may occur at several levels: when the alternation occurs at the boundary of a clause, codeswitching is known as “alternational”, covering what is sometimes also called “extrasentential” and “intersentential” codeswitching. Codeswitching can also be “insertional” (Muysken 2000), a type known in a number of studies as “intrasentential” (Myers-Scotton 1993a). Myers-Scotton (1993a) deﬁnes intrasentential codeswitching as containing at least one word from the Embedded Language (EL), in other terms the non-dominant language of the clause, and any number of words from the Matrix Language (ML), or the dominant language. A very similar phenomenon to insertional codeswitching is that of borrowing, broadly deﬁned as the transfer of sound and form-meaning units (Heine and Kuteva 2005). The diachronic aspect is generally a key feature for the deﬁnition of borrowing, summarized by Haspelmath as follows: [Borrowing] refers to a completed language change, a diachronic process that once started as an individual innovation but has been propagated throughout the speech community (Haspelmath 2009: 38).

42

Borrowing or codeswitching?

Historical and comparative linguistic methods can help us identify borrowings which were introduced in past contact settings. However, when these words are shared with the current-contact language, it becomes diﬃcult to decide on whether these words should be treated as borrowings or as codeswitches. Matras (2009: 111) suggests that there is a borrowing‒codeswitching continuum rather than a clear-cut distinction. A variety of criteria would locate an item on this continuum, i.e., degree of speaker bilingualism (monolingual vs. bilingual), degree of item composition (utterance vs. single lexeme) and of functionality (stylistic vs. default use), unique character of the referent (lexical vs. para-lexical), operationality (core vocabulary vs. grammatical operations), regularity of the process (single vs. regular occurrence), and structural integration (non-integrated vs. integrated). While the continuum approach is closer to the complex empirical evidence, other scholars attempt to isolate the criteria that work best and oﬀer a more clear-cut deﬁnition of borrowings and codeswitches. For example, Poplack and Dion (2012) suggest that all single-word tokens of a current-contact language should be treated as borrowings. This analysis is based on a rich bilingual corpus from Canada, in which single words always get structurally integrated independent of their frequency and regularity. In contrast, the authors observe that multi-word tokens are never structurally integrated. Poplack and Dion’s deﬁnition of borrowing based on the degree of composition runs counter Myers-Scotton’s approach (1993a), which considers single words as codeswitching insertions, with the potential of becoming borrowings as they get conventionalized. Against this widespread understanding of borrowing as originating in codeswitching, Haspelmath observes that there is little evidence for the claim that bilinguals are needed in order for contact words to become borrowings and concludes that “lexical borrowing is not in any way dependent on code-switching” (Haspelmath 2009: 42). In the sections that follow I will apply some of the criteria suggested in the literature on the borrowing/codeswitching distinction. The most solid criterion, namely the use of a word by monolingual speakers, does not apply to the corpora I am concerned with in which the languages under study are no longer spoken by monolingual speakers. In this case, one must tackle other criteria, such as the degree of composition (Matras 2009) and the meta-linguistic comments that accompany contact words (Poplack 2004), examined in 4.2, the operationality (Matras 2009) by referring to word classes in 4.3., lexical semantic ﬁelds in 4.4., and the item’s regularity (Matras 2009) in 4.5. Also, Chapter 5 is dedicated to the study of integration strategies.

Degree of composition and ﬂagging

43

4.2 Degree of composition and ﬂagging In order to distinguish between codeswitch insertions and borrowings, one of the diagnostic features that may be examined is the degree of composition (Matras 2009: 111). Longer insertions are more likely codeswitches as opposed to short insertions, which are good candidates for the borrowing category. However, research on codeswitching has clearly showed that one-word tokens from the contact language are the most commonly found in bilingual corpora (see Backus 1992; Treﬀers-Daller 1994). In consequence, this criterion appears to be problematic. In practice, to calculate the degree of composition, one needs to look at the length of the insertions from language A in a numerically dominant clause from language B. One can thus qualify the main type of contact-language insertions in a corpus, i.e., short insertions, for one-word, two-word or three-word insertions, the latter known as “chunks” (Backus 1999) or “internal EL islands” (MyersScotton 2002: 149); and longer insertions, above four words. For linguists willing to analyse corpora with respect to the codeswitching types “alternational” and “insertional”, the study of the degree of composition should be combined with the study of the type of morphemes and their syntactic relation to the clause. On the one hand, “insertional” codeswitching generally applies to single words provided they are syntactically dependant on the clause they are inserted in, the so-called “nested insertions A B A” (Muysken 2000: 230). On the other hand, “alternational” codeswitching covers both utterancelong, syntactically-independent insertions and insertions that occur at the periphery of a clause (Muysken 2000). This terminology, however, is problematic as studies on borrowability have shown that discourse markers, typically occurring at the periphery of a clause, are often borrowed (Matras 2009). Therefore using the term “alternational” codeswitching for both long utterances and short insertions occurring at the periphery of the clause, such as discourse markers, appears highly confusing. For that reason, the choice is made in this book to treat separately the degree of composition and the word class of the short insertions. Another criterion that has been discussed in the literature on codeswitching is the so-called ﬂagging of contact language words. Bilingual speakers may ﬂag a contact word: by pauses, hesitation phenomena, repetition, metalinguistic commentary and other means of drawing attention to the switch, with the result of interrupting the smooth production of the sentence at the switch point (Poplack and Sankoﬀ 1988: 1176).

44

Borrowing or codeswitching?

Switch ﬂagging can be very common in some bilingual communities and not in others. Poplack notes for example that ﬂagging of codeswitching material is common in the Ottawa corpus of French-English (Poplack 1985) and in the Finnish-English corpus (Poplack, Wheeler, and Westwood 1987), whereas codeswitching is mainly smooth in the Puerto Rican corpus. Indeed, several studies stress the ﬂuidity with which switches are carried out and experimental studies point out the short response times in language switching (Myers-Scotton and Jake 2013b: 2). Manfredi, Simeone-Senelle, and Tosco (in press), based on the analysis of a corpus of 10 Afro-Asiatic languages1 show that intersentential switches are part of distinct prosodic units. Nevertheless, as the authors admit when discussing an example of Arabic insertion into a Beja narrative, there is no reason to believe that the switch is treated diﬀerently than a monolingual clause would have been treated. The analysis of the Afro-Asiatic corpora also shows that intrasentential switches are not prosodically distinct from monolingual items: Actually, intrasentential CSW [codeswitching] can also occur within monolingual Intonation Units. This is typically the case of embedded discourse markers that occur as prosodically independent words. (Manfredi, Simeone-Senelle, and Tosco in press).

For single words, the authors conclude that no prosodic boundary is found, although a correlation with an emphatic pitch raise may occur, provided no other word is focused within the same clause. In what follows, I examine some of the corpora under study to illustrate the two types of language mixing which were established with the word-count method, presented in Chapter 3, namely the corpora with less than 5% contact words and the corpora with 20‒35% contact words.

4.2.1 The Balkan Slavic Nashta-Greek corpus The analysis of the Balkan Slavic Nashta corpus based on the word-count showed that very few words are shared with the current-contact language, Greek. The analysis of the degree of composition shows that Greek tokens can be part of both short insertions and long stretches. A closer look at the longer switches, however, reveals that they are directed towards the researcher. These lengthy Greek insertions are often “ﬂagged”, in that they are preceded by metalinguistic commentary. Below is an example of this type of longer switches, with the Greek segments in angle brackets: 1 Available online at http://corpafroas.huma-num.fr/Archives

Degree of composition and ﬂagging

45

(9) Balkan Slavic Nashta corpus < Slavic (in plain), Greek (in bold), codeswitching to Greek (in angle brackets < >) Female ﬂuent speaker: aˈla uˈno saˈxat ta a ˈtʃiniʃe ˈona a ˈtʃeʃʃe ‘But at that moment, she was carding the wool .’ Female semi-speaker: ‘’ Female ﬂuent speaker: ‘’ Researcher (in Greek): ne kala ‘Yes, that’s OK.’ (Adamou 2013a, Le petit Dimitro, sentence 11. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) In some cases Balkan Slavic Nashta speakers may repeat a token which has initially been enunciated in either of the two languages. In the ﬁrst example, the occurrence in Greek indicates that the word in Slavic had been partly forgotten but that its use is reactivated; see (10a). In the second example, the Greek insertion is the translation of the immediately preceding Slavic sentence. The use of Greek in this case could be due to the fact that it is a quotation; see (10b). Balkan Slavic Nashta corpus < Slavic (in plain), codeswitching to Greek (in angle brackets < >) (10)

a.

ˈleʃtʲæ ‘, lentils.’ (Adamou 2013a, Le petit Dimitro, sentence 6. Accessed online at http://lacito.vjf.cnrs.fr/pangloss)

b.

ʃo ˈtolko zaboˈva [. . .] ‘What took you so long? ’ (Adamou 2013a. Le couple qui se disputait, sentence 22. Accessed online at http://lacito.vjf.cnrs.fr/pangloss)

Other than these Greek codeswitching insertions, most words which are shared with Greek monolinguals and prevail in in-group conversations are short, single words or two-word chunks. In contrast, the younger semi-speaker is producing several codeswitches which are ﬂagged; see an example in (11).

46

Borrowing or codeswitching?

Balkan Slavic Nashta corpus < Slavic (in plain), codeswitching to Greek (in angle brackets < >) (11)

da ti da ne sa bos ‘For you to so you won’t be scared.’ (Adamou 2013a. La ﬁlle et le fantôme, sentence 4. Accessed online at http://lacito.vjf.cnrs.fr/pangloss)

An analysis of the pauses shows that 32 out of 71 Greek words are preceded by two second-long pauses but this is also the case for the Turkish origin words (N = 20 out of 47) even though Turkish is no longer a contact language. It is thus not possible to consider the pauses preceding the Greek words as indicators of ﬂagging. 4.2.2 The Ixcatec-Spanish corpus The word-count in Chapter 3 showed that the Ixcatec-Spanish corpus contains few words from the current-contact language, Spanish. The analysis of the degree of composition shows that the last Ixcatec speakers produce mainly short, single-word insertions. Two-word regular expressions from Spanish are also common, better understood as “tags”, in that they retain their syntactic independence, e.g., ni modo ‘no way’, shown in (12), aj karamba ‘oh wow!’, ave maria literally ‘Ave Maria’ but best translated in English as ‘oh my God!’. The Ixcatec speakers produce practically no long switches to Spanish during the recording sessions. Contemporary Ixcatec corpus < Ixcatec (in plain), codeswitching to Spanish (in angle brackets < >) (12)

ʔméé same no way ‘Right, !’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php)

In the Ixcatec-Spanish corpus, pauses, hesitations, repetitions and metalinguistic comments are not exclusively related to the introduction of contact language material. For example, the Ixcatec expression ndrí ʔmike ‘what is it called?’ is frequently used as a ﬁller, shown in (13), but its use is not restricted to codeswitches as it precedes 26 words in total of which only 5 are Spanish.

Degree of composition and ﬂagging

47

Contemporary Ixcatec corpus (13)

ʔmée méendi kú-kaβa-ri hãã³ same like_this PFV-receive-HON yes ‘You received, yes, what’s the name. . .’

la COMP

ndrí what

ʔmi-ke call-ITER

(Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php/) A careful look at the pauses in the Ixcatec-Spanish corpus also shows that 64% Spanish nouns (N = 89 out of 140) follow a two second-long pause but that 56% non-borrowed nouns (N = 245 out of 453) also follow a two second-long pause. 4.2.3 The Thrace Romani-Turkish-Greek corpus As discussed in Chapter 3, the Thrace Romani-Turkish-Greek corpus shows high rates of words shared with Turkish, and to a smaller extent with Greek. The analysis of the degree of composition shows that 74% of the Turkish words are single-word-tokens as well as 88% of the Greek words; see Figure 19. In the Thrace Romani-Turkish-Greek corpus, alternational switching to Turkish mainly occurs in interactions with in-group members who have shifted to Turkish. For example, during a conversation that is taking place in the local TurkishRomani variety, the child calls for the mother in Turkish and the mother responds in Turkish to the child since Romani is no longer transmitted to him,

Figure 19: The Thrace Romani-Turkish-Greek corpus: Length of Turkish and Greek word-tokens

48

Borrowing or codeswitching?

see (14a) and (14b). The use of Turkish here is clearly participant-related as it is adapted to the interlocutor’s competence and stems from the family’s preference for interrupting the transmission of Romani to the children. After a brief interaction in Turkish, the conversation resumes in the Thrace Romani variety by one of the participants who addresses the mother of the child, see (14c) (note, however, the use of the Turkish verb with the Turkish TMA markers). The mother replies by using two Turkish verbs, with Turkish verb morphology. If the overall community pattern of mixing with Turkish is not taken into consideration, the mother’s clause can be analysed as a Turkish clause. But the corpus analysis, as illustrated by the preceding clause made with Romani material and one Turkish verb, indicates that the mother’s reply in (14d) is just part of an unmarked way to speak Romani characterized by the use of Turkish words from speciﬁc word classes (see section 4.3.3) following speciﬁc integration patterns (see Chapter 5). Thrace Romani corpus < Romani (in plain), Turkish (in bold) (14)

a.

Child: ane

‘Mom!’

b.

Mother: tʃok gyzel ‘Very nice!’

c.

Female speaker: tʃe but ʃukar INTJ very nice

jazijor write.PROG .3SG

‘Hey, he writes nicely.’ d.

Mother: okuijor read.PROG .3SG

jazijor write.PROG .3SG

‘He reads, he writes. . .’ (Adamou, unpublished corpus) Participant-related codeswitching to Greek occurs with in-group members who have shifted to Greek or in presence of outsiders, such as the researcher. In example (15) it can be seen that the Romani clause is repeated in Greek, followed by another short Greek switch insertion. Thrace Romani corpus < Romani (in plain), codeswitching to Greek (in angle brackets < >) (15)

ame tʃore sam ‘We are poor. . tʃe Well, ’ (Adamou, unpublished corpus)

Degree of composition and ﬂagging

49

4.2.4 The Finnish Romani-Finnish corpus Another corpus with high rates of contact words is the Finnish Romani-Finnish corpus. In the Finnish Romani-Finnish corpus, two types of clauses can be distinguished (Granqvist 2000). Clauses in which Romani is the numericallydominant language, which accounts for up to 80% of the clauses, and clauses in which Finnish is the numerically dominant language, representing 20% of the clauses (Adamou and Granqvist 2014); see Figure 20. The use of a high number of Finnish-dominant clauses, which are not related to the interlocutor’s competence, diﬀers from the mixing pattern of the Thrace Romani-Turkish-Greek corpus, which is participant-related. A closer look at the clauses with Finnish Romani as the dominant language shows a majority of Finnish single-word insertions (75%) followed by 16% twoword tokens and 6% three-word tokens; see Figure 21. This is also the case in clauses with Finnish as the dominant language with 65% single-word insertions; see Figure 21. To summarize, Finnish Romani speakers show two types of behaviour: alternation between Finnish Romani-dominant and Finnish-dominant clauses, and short insertions from Finnish. The latter need to be further analysed in order to decide whether they are more borrowing- or codeswitching-like.

Figure 20: The Finnish Romani-Finnish corpus: Distribution of word-tokens per language in the Finnish-dominant and Romani-dominant clauses (adapted from Adamou and Granqvist 2014)

50

Borrowing or codeswitching?

Figure 21: The Finnish Romani-Finnish corpus: Length of Finnish word-tokens in Romanidominant clauses and of Romani tokens in Finnish-dominant clauses (adapted from Adamou and Granqvist 2014)

4.3 Word classes Several scholars have observed that some types of morphemes or word classes are more frequently encountered in codeswitching in the speech of bilinguals or as established borrowings as a result of past-contact settings. An attempt to account for the diﬀerence between morpheme types in the speech of bilinguals comes from the 4-M model elaborated in Myers-Scotton (2002) and subsequent publications, i.e., Myers-Scotton and Jake (2009). The 4-M model distinguishes two types of morphemes: the conceptually-activated morphemes, or “early system morphemes”, and the structurally-assigned morphemes, or “late system morphemes”. The “early system morphemes” convey semantic and pragmatic information. They are salient in the mental lexicon along with their content morpheme heads. Examples of early system morphemes include determiners and derivational prepositions and particles in phrasal verbs, certain aﬃxes, i.e., derivational and plural markers in noun phrases and some tense and aspect markers in verbal clauses, as well as subordinating and coordinating conjunctions. The “late system morphemes” are not salient in production until the level of the “Formulator”. The “Formulator” is the production mechanism that puts together the larger constituents that indicate the structure of the clause. There

Word classes

51

are two types of late system morphemes: “bridges” and “outsiders”. “Bridges” are the elements that join together two NPs and complementizers that join together two clauses. “Outsiders” include the agreement markers and several case markers. According to the 4-M model, late system morphemes are rarely borrowed because they carry little content and are not accessed in production until the level of the Formulator. In a typological and diachronic perspective, various borrowability hierarchies were elaborated in order to account for the frequency with which some word classes or elements within a word class would be borrowed. Borrowability is more speciﬁcally deﬁned as follows: Borrowability is taken to mean the likelihood of a structural category to be aﬀected by contact-induced change [. . .] (whether matter- or pattern-replication). (Matras 2007: 31).

According to Matras: The degree of borrowing is related to the intensity of exposure to the contact language. The outcome of language contact is a product of structural similarities and diﬀerences (congruence) among the languages concerned. Borrowability is a product of inherent semantic-pragmatic or structural properties of the aﬀected categories. (Matras 2007: 34).

The most recent borrowability hierarchies were elaborated through two samples, created within two international projects hosted by the University of Manchester: a large Romani sample (Elšík and Matras 2006) – oﬀering the possibility of observing contact between various languages of the same group with a variety of languages, mainly Indo-European – and a diversiﬁed cross-linguistic sample of 27 languages (Matras and Sakel 2007b). Both studies show that nouns are the items most frequently borrowed, followed by verbs, discourse markers, adjectives, interjections, adverbs, various particles, numerals, pronouns, derivational aﬃxes, and inﬂectional aﬃxes. The items in this hierarchy are highly heterogeneous from a structural point of view: one ﬁnds semantic, syntactic, morphological, and phonological features. This grouping is coherent with the general framework, which favours semantic-pragmatic criteria over all others, and relates them to mental processing. These studies clearly show that derivational aﬃxes and inﬂectional aﬃxes are low in the hierarchy. Indeed, most studies in contact linguistics agree that bound morphemes are less prone to borrowing than free morphemes (see Weinreich 1953; Moravcsik 1978; Wilkins 1996). Likewise, for Field (2002) function words are more easily borrowed than both agglutinative aﬃxes and fusional aﬃxes. It is also widely admitted that derivational morphology, i.e., bound

52

Borrowing or codeswitching?

morphemes that carry lexical information or change the lexeme’s class, is more likely to be borrowed than inﬂectional morphology, i.e., bound morphemes with grammatical information (see Moravcsik 1978; Thomason and Kaufman 1988; Matras 2007). To account for morphological borrowing, Seifart (2012: 475) proposes the Principle of Morphosyntactic Subsystem Integrity according to which borrowed morphemes are morphosyntactically interrelated. Muysken (2012: 483) suggests that borrowing of suﬃxes relies on the interplay of four principles: a. Optimize L2 principles b. Optimize universal combinatory principles c. Optimize L1 principles d. Optimize L1-L2 correspondences In order to examine the types of morphemes that come from the contact language in the corpora under study, the analysis is based on word classes. This term applies to major classes such as nouns and verbs, and minor classes such as adverbs, adjectives, particles, pronouns, determiners, and conjunctions. Although word classes are language speciﬁc categories they allow for some cross-linguistic comparability (see Evans 2000; Haspelmath 2012). If nouns, verbs, adjectives and adverbs are relatively straightforward, particles, pronouns, determiners and conjunctions are loose categories which include a variety of elements. Of course, typological diﬀerences have to be taken into consideration in the interpretation of the results: for example under “determiners” some languages may show articles, deﬁnite and indeﬁnite, while others code deﬁniteness with bare nouns. Such typological diﬀerences necessarily create diﬀerences in the proportions of the category of determiners that come from the contact language and should be discussed in detail.

4.3.1 The Ixcatec-Spanish corpus As it was shown in Chapter 3, the contemporary Ixcatec-Spanish corpus has an overall rate of word-tokens of less than 5%. The analysis of the types of morphemes reveals a very restricted number of word classes from Spanish, namely 23% nouns and a few verbs; see Figure 22. Even if individual tokens from other word classes are occasionally used, their presence is not in any way signiﬁcant with respect to the overall number of tokens within their word class: we noted the use of only few temporal adverbs (e.g., ante ‘before’) and locative nouns (e.g., ladu ‘side’), and only one function word, the comitative kon ‘with’, which is in variable use with its more frequent Ixcatec counterpart ku (N = 50). It can also be noted that Ixcatec speakers use no answer particles from Spanish, but use instead the native hãã³ ‘yes’ (N = 31) and ʔíana ‘no’ (N = 53).

Word classes

53

Figure 22: The Ixcatec-Spanish contemporary corpus: Distribution of nouns and verbs per language2

4.3.2 The Balkan Slavic Nashta-Greek corpus The Balkan Slavic Nashta corpus, another corpus with 0‒5% contact words, also shows few word classes, namely nouns and a small number of verbs. Tokens from other word classes may occur but in a peripheral way. For example, the conjunction i ‘or’ of Greek origin occurs only once, and the adversative ala ‘but’ occurs 20 times. The analysis also brings out a rate of 33% nouns from Greek; see Figure 23. This ﬁnding helps shed light on the speakers’ perception of their language as a “mixed language” since content words are the best perceived.

4.3.3 The Romani corpora Contrary to the corpora with 0‒5% contact words and limited word classes, the corpora with 20‒35% contact words include numerous tokens from several word classes. The analysis of the word classes found in the Thrace Romani-Turkish-Greek corpus is a good example to illustrate the corpora with a total 20‒35% contact words. In the Thrace Romani corpus we count 27% Turkish nouns, 11% Modern Greek nouns, and 2% tokens which could come from either contact language. 2 This analysis does not include the various predicates, i.e. existential/attributive, locative, and possessive.

54

Borrowing or codeswitching?

Figure 23: The Balkan Slavic Nashta-Greek corpus: Distribution of nouns per language

This means that the corpus shows 40% current-contact language nouns while native nouns and nouns of past-contact languages make up 60% of all the nouns in the corpus. The Thrace Romani corpus also shows a relatively high number of verb tokens from the current-contact languages, namely 12% Turkish verbs and only 2% Greek verbs. Adverbs are frequently Turkish (29%) or Greek (6%). We also note 19% Turkish and 10% Greek adjectives. Last, the Thrace Romani-Turkish-Greek corpus shows a mixture of conjunctions with 67% native Romani conjunctions, 21% Turkish, and 12% Greek. The word class which is practically not aﬀected by contact in the Thrace Romani corpus is that of pronouns; i.e., 99% of the pronouns in the corpus are “Romani”. Typologically, the fact that Romani, Turkish, and Greek, are pro-drop languages may have an eﬀect on this result. In summary, what characterizes a Thrace Romani utterance is the use of Turkish nouns, verbs, adjectives, and conjunctions and the lack of some Turkish word classes such as pronouns and determiners; see Figure 24. The analysis of the Finnish Romani-Finnish corpus shows that for Finnish Romani-dominant clauses, Finnish is the original language for 72% of conjunctions, 27% of particles, 22% of adverbs, and 13% of nouns. As opposed to the Thrace Romani-Turkish-Greek corpus, which shows practically no Turkish pronouns, 5% of pronouns in Finnish Romani clauses stem from Finnish (i.e., person, interrogative, demonstrative, reﬂexive, possessive, and indeﬁnite pronouns). Moreover, considering all verb tokens, 10% are Finnish, similar to the rate of Turkish verbs in the Thrace Romani-Turkish-Greek corpus. See Figure 25 for the distribution of the word classes in the Finnish Romani-Finnish corpus.

Word classes

55

Figure 24: Thrace Romani-Turkish-Greek corpus: Distribution of tokens per language and word class

Figure 25: Finnish Romani-Finnish corpus: Distribution of tokens per language and word class (adapted from Adamou and Granqvist 2014)

56

Borrowing or codeswitching?

4.4 Lexical semantic ﬁelds As illustrated in 4.3, all the corpora under study, independent of the rates of contact material, show high rates of contact nouns inserted in the speech of another numerically-dominant language. The ease with which nouns from a current-contact language are produced is a well-known phenomenon established through several studies on naturally occurring codeswitching (Poplack 1980; Backus 1992; Myers-Scotton 1993a; Treﬀers-Daller 1994). Studies on borrowing have also shown that nouns are among the most highly borrowable word classes (Matras 2009). Several explanations have been put forward to explain this phenomenon. Matras (2009) isolates various criteria, such as the “utilitarian” motivation, the need to reduce the processing load, and the speaker’s weak or strong control. It is noted that the “proximity”, “familiarity”, and “frequency” principles discourage lexical borrowing. Van Hout and Muysken (1994) also suggest that frequency in the recipient language may operate as an inhibiting factor for borrowing. The resulting idea is that some nouns are more likely to be borrowed either because they are more adapted to a particular social activity, or because they occur in intense communicative negotiation and thus block the speaker’s repertoire selection mechanism (Matras 2009). Also, as Myers-Scotton and Jake (2013b) observe, nouns can be used without much cognitive cost since the selection needs only to be done at the semantic-pragmatic level. The idea of gap ﬁlling in the process of language shift has also been very inﬂuential in order to account for the facility with which nouns are integrated into bilingual speech (Grosjean 1982; Myers-Scotton 1993a). Borrowing is of course not the only option for the speakers of a language upon the arrival of a new concept or object. Processes of lexical creation are available in all languages and can be put into action. Interestingly though, borrowing is a very popular strategy. Brown (1999) compares how 77 objects and concepts which Native Americans were exposed to after the arrival of the Europeans were expressed in 292 Native American languages. For example, for the word ‘coﬀee’, in Brown’s sample, 81% of the languages borrowed the word whereas the remaining 19% created native words through compounds, derivation, or by using simple words whose meaning was extended or shifted. In order to study the lexical semantic ﬁelds which are most aﬀected by language contact, I compare the results of the free-speech corpora to the results of the typological project Loanwords in the word’s languages (WOLD) (Haspelmath and Tadmor 2009b). WOLD is based on a questionnaire of 1,000‒2,000 entries

Lexical semantic ﬁelds

57

conducted in 41 languages by various language specialists.3 The results led to the establishment of a typology of languages based on their borrowing rate: languages with a borrowing rate of over 50% were identiﬁed as “very high borrowers”, followed by “high borrowers” for a rate of 25‒50%, then “average borrowers” for 10‒25%, and “low borrowers” for a rate of less than 10% (Tadmor 2009: 57). In this chapter, the list of the contact-nouns in three of the corpora under study is structured with respect to the WOLD categories. A comparison with the borrowed score and the age score in WOLD is established. Note that the borrowed score in WOLD ranges from 0 to 0.98, but that the lowest age score is 0.67. One must note, however, that contrary to the analysis in the present book which is restricted to words from current-contact languages, WOLD considers loanwords both from current- and past-contact languages.

4.4.1 The Ixcatec-Spanish corpora As Table 1 shows, in the Ixcatec corpus, Spanish nouns belong to a variety of semantic ﬁelds, i.e., time relations, food and drink, the religious domain, social and political relations such as health institutions or other administration, location, modern world culture, basic actions and technology, and clothing. 32 Table 1: The Ixcatec-Spanish corpora: Lexical semantic ﬁelds of Spanish nouns and borrowed score in WOLD Semantic ﬁeld

Word

Time relations

úrá ‘hour’ sádu ‘Saturday’ semána ‘week’ miércole ‘Wednesday’ lúne ‘Monday’ dumíngu ‘Sunday’ féchá ‘date’ moméntú ‘moment’ enérú ‘January’ febrérú ‘February’ mayú ‘May’ agóstó ‘August’ otúbré ‘October’ disyémbré ‘December’

3 WOLD is available online at http://wold.clld.org/

Borrowed Age score score in WOLD in WOLD 0.76 0.66 0.63 0.54 0.54 0.49

0.81 0.83 0.79 0.82 0.81 0.83

58

Borrowing or codeswitching?

Table 1: (Continued)

Semantic ﬁeld

Word

Food and drink

kaɸé ‘coﬀee’ súká ‘sugar’ arós ‘rice’ trígu ‘wheat’ mantéká ‘fat, butter’ kanéla ‘cinnamon’ kumínu ‘cumin’ klávu ‘clavo’ hóli ‘garlic sauce’ áhu ‘garlic’ piña ‘pineapple’ mantsána ‘apple’ rekávdu ‘broth’ móle ‘mole sauce’ kumída ‘food’ alkol ‘alcohol’ tekíla ‘tequila’ refréscu ‘soft-drink’ meskál ‘mescal’ borráchú ‘drunkard’

Sense and perception

savór ‘the taste’ gústú ‘the taste’

Modern world culture

káru ‘car’ pelíkula ‘ﬁlm’ bisikléta ‘bicycle’ mikróɸono ‘microphone’ mitʃúdu ‘mop’ ehersísio ‘exercice’ túβo ‘tube’ kartuá ‘cardboard’ fabór ‘favour’

Borrowed Age score score in WOLD in WOLD 0.86 0.79 0.62 0.52 0.49

0.77 0.82 0.82 0.86 0.83

0.79 0.74 0.62

0.73 0.71 0.74

Location, buildings

oditorio ‘auditorium’ sítiu ‘site’

Social and political relations

rréy ‘king’ 0.53 kwa-enferméra ‘nurse’ 0.45 oportunidá ‘opportunity, social aid’ distrítú ‘district’ estádú ‘state’ empleádú ‘employee’ rrewñón ‘meeting’ awtoridá ‘authority’ palásyú ‘palace’

0.81 0.78

Lexical semantic ﬁelds

59

Table 1: (Continued)

Semantic ﬁeld

Word

Borrowed Age score score in WOLD in WOLD

trónó ‘throne’ podér ‘power’ (kwa-)doktóra ‘female doctor’ estampádú ‘stamped’ mósó ‘day-worker’ ɸaéna ‘task’ Kinship

família ‘family’ byúdú ‘widower’ subrínu ‘nephew’ komádre, kumáre, kumaréɲa padrástro, padrino ‘godfather’

0.42 0.21 0.19

0.84 0.83 0.84

Animals

u-búrrú ‘donkey’

0.63

0.83

Warfare

soldádú ‘soldier’ enemígú ‘enemy’ gwardyénté ‘guard’ koronél ‘coronel’ jenerál ‘army general’ rrebolusyonáryó ‘revolutionary’ komisyón ‘committee’

0.58 0.45 0.43

0.79 0.83 0.76

Emotions

karíñú ‘tenderness’ marabíyá ‘marvel’

Agriculture and vegetation

kosétʃa ‘harvest’ kriójá ‘Creole’ tsju-rrósá ‘rose’ salvariál ‘salvarial’ mulínu ‘mill’ estáblú ‘barn’ palénke ‘palenque’

0.27

0.80

Possession

tjénda ‘(the) shop’ sentáβu ‘cent’ makílá ‘weight measure’ pésú ‘weight’ gástú ‘expense’

0.70

0.80

0.42 0.36

0.81 0.85

Basic actions and technology morál ‘basket’ trabáju ‘(the) work’ toné ‘barrel’ asyéntú ‘seat’ kárgú ‘charge’ taréya ‘task’ lata ‘tin’

60

Borrowing or codeswitching?

Table 1: (Continued)

Semantic ﬁeld

Word

Borrowed Age score score in WOLD in WOLD

Emotions and values

Ɂanímà ‘soul’

0.40

0.89

Religion and beliefs

krúsi ‘cross’ grásia ‘graces’ jesu-krístú ‘Jesus Christ’ mísèè ‘mass’ rrusáryo ‘rosary’ panteon ‘cemetery’ rrelíkyá ‘relics’

0.50

0.78

Speech and language

nómbré ‘name’ kwéntú ‘tale’

0.09

0.88

Cognition

nesesidá ‘necessity’

Quantity

duséná ‘dozen’

Spatial relations

lugá ‘place’ ládu ‘side’

0.25

0.85

Clothing

sumbréru ‘hat’ panítu ‘scarf’

0.49

0.80

Miscellaneous

uñón ‘union’ kornetéró ‘bugler’ komformidá ‘conformity’ rretrátú ‘portrait’ káydá ‘fall’ sotána ‘cassock’ hakontesimyéntó ‘event’ fwérsá ‘strength’ bárá ‘stick’ tifu ‘typhus’

Spanish nouns appear at the WOLD database and, as Table 1 shows, Ixcatec speakers use Spanish nouns which have an average borrowing score (mean score 0.51) and a relatively high age score (mean score 0.81). These results indicate that the Spanish nouns in Ixcatec are more likely borrowings than codeswitching insertions.

4.4.2 The Balkan Slavic-Greek corpora The Balkan Slavic Nashta corpus shows a variety of nouns in the lexical domains which are known to be aﬀected by contact; see Table 2. However, only 18 meanings

Lexical semantic ﬁelds

61

Table 2: The Balkan Slavic Nashta corpus: Lexical semantic ﬁelds of Greek nouns and borrowed score in WOLD Borrowed score in WOLD

Age score in WOLD

pçato ‘plate’ tindʒir ‘pot’ skurdar ‘garlic mixture’ liparide ‘kind of ﬁsh’ kulur ‘round loaf’ laŋɟide ‘fritters’

0.54 0.33

0.76 0.84

Animals

kotopulo ‘chicken’ kukuʎ ‘cocoon’

0.37

0.89

Physical world

mura ‘mulberry tree’ kalamɲe ‘rose branches’

Basic actions and technology

putir ‘glass’ kanistro, paner ‘basket’

0.62 0.42

0.80 0.81

House

lamba ‘lamp’ kandil ‘candle’ porta ‘door’ skala ‘ladder’ kalmotie ‘rose tree mat’ disko ‘tray’

0.65 0.63 0.23 0.23

0.78 0.83 0.87 0.81

Time

april ‘April’ avɣusto ‘August’

Clothing

fustan ‘woman’s dress’ mandil ‘headband’ peto ‘lapel’ sakaki ‘jacket’ kordjele ‘ribbons’

0.47 0.37

0.82 0.78

Religion and beliefs

prika ‘trousseau’ stefano ‘wedding crown’

Cognition

sxoʎo ‘school’ daskal ‘teacher’ matima ‘class’

0.69 0.51

0.80 0.80

Social and political relations

kratos ‘state’ ðimoprasia ‘bidding’ sklovac ‘little slave’

Law

ðikastire ‘court’

0.55

0.78

Possession

foros ‘tax’ ispraktor ‘collector’

0.59

0.81

Semantic ﬁelds

Word

Food and drink

62

Borrowing or codeswitching?

Table 2: (Continued) Borrowed score in WOLD

Age score in WOLD

aftocinta ‘cars’ tinikie ‘tin jerry cans’ fortoti ‘shipper’ raceta ‘racket’

0.79 0.69

0.73 0.79

Quantity

çilde ‘a thousand’

0.58

0.83

Location, buildings

cendra ‘clubs’ 0.57

0.80

Semantic ﬁelds

Word

Modern world culture

Spatial relations

bala ‘ball’

Kinship

androjino ‘couple’

Body

ijia ‘health’

Sense and perception

kroto ‘crack’

are also found in the WOLD database. The comparison with the WOLD scores shows that borrowed scores are average (mean score 0.51), while age scores are relatively high (mean score 0.81) indicating that these words are more likely borrowings than codeswitching insertions.

4.4.3 The Thrace Romani-Turkish-Greek corpus The Thrace Romani corpus includes Turkish nouns from a variety of lexical semantic ﬁelds; see Table 3. We note that Turkish nouns are often dialectal and of Arabic and Persian origin, as is the case for a great proportion of the Turkish vocabulary in general. 42 meanings can be found in WOLD but this number would have been higher had we considered the borrowings from past-contact languages. Interestingly, comparison of the Thrace Romani list of nouns with the WOLD scores reveals that several Turkish nouns used by Romani speakers have relatively low borrowed scores (mean score 0.40), but relatively high age scores (mean score 0.82). The age score thus indicates that these words are more likely borrowings rather than codeswitching insertions, a result in accordance with our knowledge of the sociolinguistic setting.

Lexical semantic ﬁelds

63

Table 3: The Thrace Romani corpus: Lexical semantic ﬁelds of Turkish nouns and borrowed score in WOLD Borrowed score in WOLD

Age score in WOLD

asker ‘soldier’ kələtʃi ‘sword’ dyʃmaja ‘enemies’ marebava ‘war’ bairako ‘ﬂag’

0.58 0.45 0.45 0.34

0.79 0.83 0.83 0.79

Animals

majmuna ‘monkey’ sivrisinek ‘mosquito’ tilkia ‘fox’ kirpia ‘porcupine’

0.33 0.22 0.17

0.79 0.82 0.84

Physical world

dunjava, dunja ‘world’ kavako ‘tree’ balkano ‘mountain’

0.40 0.25 0.17

0.84 0.79 0.89

House

badʒava ‘chimney’ jastəkora ‘pillows’ kapia ‘door’ ev ‘house’ kofa ‘bucket’

0.34 0.26 0.23 0.11

0.85 0.83 0.87 0.87

Time

saat ‘hour’ zaman, zamano, vakut ‘time’ gyn ‘day’ sene ‘years’ saba ‘morning’ akʃam ‘evening’ kəʃi ‘winter’

0.76 0.54 0.19 0.19 0.14 0.13 0.09

0.81 0.83 0.90 0.89 0.81 0.83 0.83

Clothing

menia ‘scarfs’

0.45

0.79

Emotions and values

jurek ‘heart’

0.40

0.89

Speech

kosuzi ‘word’ masale, meselava ‘tale’

0.25

0.84

Social and political relations

patiʃaj ‘king’ dyvel ‘country’ mileti ‘people’ malava ‘neighbour’

0.53 0.48 0.38

0.81 0.83 0.82

Possession

dukjano ‘shop’ pare ‘money’

0.70 0.54

0.80 0.87

Modern world culture

makine ‘machine’ tomaﬁli, araba ‘car’ gazeta ‘newspaper’

0.89 0.79 0.68

0.78 0.73 0.77

Semantic ﬁeld

Word

Food and drink

xurbuze ‘watermelons’

Warfare and hunting

64

Borrowing or codeswitching?

Table 3: (Continued) Borrowed score in WOLD

Age score in WOLD

apo ‘pill’ tiara ‘airplane’ astenava ‘hospital’ erzanava ‘pharmacy’ doktor ‘doctor’

0.61 0.59 0.56

0.77 0.74 0.75

Body

iljatsa ‘medication’ nefesi ‘breath’

0.42

0.84

Kinship

insan ‘man, person’ kari kotsa ‘couple’

0.32

0.93

Basic actions and technology

bardako ‘glass’ tokmako ‘hammer’ sepetora ‘baskets’ dəmirdʒio ‘blacksmith’

0.62 0.45 0.42 0.29

0.80 0.77 0.81 0.82

Location, buildings

saraj ‘palace’ gjumiriko ‘customs house’

Agriculture and vegetation

tarlava ‘ﬁeld’

0.19

0.84

Miscellaneous

kəsgənəko ‘anger’ kokjie ‘perfume’ aberi ‘news’ giyndeliko ‘daily wage’ jabandʒio ‘stranger’ tʃiʃma ‘fountain’ sənər ‘nerve’ devi ‘giant’ ʃei ‘thing’ tʃaləʃma ‘work’ (nominalization)

Semantic ﬁeld

Word

Table 4 illustrates the Greek nouns in the Thrace Romani corpus. Notice that only 8 meanings are found in the WOLD database. The borrowed score of these words is low (mean score 0.43) and the mean age score is relatively high (0.79).

Lexical semantic ﬁelds

65

Table 4: The Thrace Romani corpus: Lexical semantic ﬁelds of Turkish nouns and borrowed score in WOLD Semantic ﬁeld

Word

Borrowed score in WOLD

Age score in WOLD

Speech and language

korna ‘horn’

0.46

0.79

Clothing and grooming

mendili ‘handkerchief’ pantuloni ‘trousers’

0.48 0.56

0.80 0.81

House

avlia ‘yard’

0.23

0.80

Agriculture and vegetation

paγuri ‘gourd’

0.26

0.80

Warfare and hunting

tokso ‘arrow’

0.19

0.82

Modern world

tilefono ‘telephone’

0.73

0.71

Time

kero ‘time’

0.54

0.83

Miscellaneous

psixoloγos ‘psychologist’ eksetasis ‘exams’ ikonomies ‘savings’ ipalilos ‘employee’ plindirio ‘washing machine’ moromandila ‘swipes’ istoria ‘story’ balkoni ‘balcony’ palamakja ‘clapping’ anxos ‘stress’ kalamaki ‘straw’ xamos ‘disaster’ taftota ‘identity card’ leski ‘club’ mastika ‘chewing gum’ karnavalja ‘carnival’ kafeteria ‘coﬀee shop’ kursakja ‘carting’ neolea ‘youth’ tono ‘accent’ pazari ‘bazar, market’ δisekatomiria ‘millions’ kendro ‘center’ jimnasio ‘middle-school’ likio ‘high-school’ sinajermo ‘alarm’

The fact that few of the Greek words in the Thrace Romani corpus are also found in WOLD indicates that they are more likely codeswitching insertions rather than borrowings. Indeed, several Greek words are related to Greek institutional realities and therefore of the cultural-related rather than the core vocabulary; see for example in (16), the nouns psixoloγos ‘psychologist’ and eksetasis ‘exams’.

66

Borrowing or codeswitching?

Thrace Romani corpus < Romani (in plain), Turkish (in bold), Greek (underscored) (16)

psixoloγos psychologist

psixoloγos psychologist

jazmijor write.NEG . PROG .3SG kantʃik nothing

in NEG

vo he

mono just

[va] yes

ute nor

tuke you.DAT

del give.PRS .3SG

beʃel sit.PRS .3SG eksetasis exams

apora pills tut you.ACC

konuʃur speak.PRS .3SG

kerel make.PRS .3SG

tuke you.INS

tut you.ACC

‘A psychologist, a psychologist, doesn’t prescribe pills to you. He gives you nothing. He just sits, talks to you [yes]. He doesn’t make (clinical) exams either.’ (Adamou 2010: 153) This analysis is in keeping with the sociolinguistic analysis which also indicates that Thrace Romani speakers have only been bilingual in Greek for the past three to four generations.

4.5 Regularity The criterion of regularity has been considered a promising tool for distinguishing between borrowing and codeswitching. In theory, the more regular a contact word is, the more likely to be considered a borrowing. In order to determine whether single words from the contact language were more or less established in the bilingual community, Poplack, Sankoﬀ, and Miller (1988) suggested a quantitative analysis based on regularity. This analysis gave rise to four types of so-called “borrowings”: 1. “Nonce borrowings” for words used only once. 2. “Idiosyncratic borrowings” for words used more than once but by a single speaker. 3. “Recurrent borrowings” for words used more than 10 times. 4. “Widespread borrowings” for words used by more than 10 speakers. Nevertheless, confronted with more empirical data, the regularity criterion does not meet expectations. In more recent publications, as in Poplack and Dion (2012), the relevance of regularity is questioned, as it fails to distinguish between borrowings and nonce borrowings (or codeswitches).

Regularity

67

Similarly, among a variety of criteria, Myers-Scotton (1993a) suggested measuring regularity through the “three-occurrence rule”. According to this rule, if a contact word occurs three times in a given corpus, generally thought of as a 10,000-word corpus, this word can be considered as regular. The arbitrariness of the principle which Myers-Scotton recognized since the start – why would three occurrences be considered as regular enough? – as well as the diﬃculty scholars had in applying it, led her to abandon the regularity factor altogether in subsequent publications. The problem of the regularity criterion is obvious when we are dealing with lexical items, and not with grammatical morphemes or word classes. Indeed, lexical items are related to the type of text and the topic of conversation and therefore they are likely to show little recurrent uses. Our study on regularity, based on smaller corpora, conﬁrms the above mentioned diﬃculties; see Figure 26. In the Balkan Slavic corpora, 66% Greek nouns and verbs (N = 47 out of 71) appear just once while the rest occurs less than 10 times. In the Ixcatec corpora, 36% Spanish nouns and verbs are single occurrences. We also note that 27% of the words are used in both Ixcatec corpora, the contemporary corpus and the corpus of the 1950s. Finally, in the Thrace Romani-Turkish-Greek corpus, Turkish words occur just once in 42% of the cases. Also, 75% of the borrowed nouns are used by a single speaker.

Figure 26: Token frequency and diﬀusion of borrowed nouns across speakers for the Thrace Romani, Ixcatec, and Balkan Slavic corpus

An alternative criterion to regularity is the “listedness” criterion which aims at qualifying a contact word with respect to “the degree to which a particular element or structure is part of a memorized list” (Muysken 2000: 71). In a study of Welsh-English bilingualism, Deuchar (2006) applies this criterion by checking the Welsh dictionary in order to see whether the English words which occur in

68

Borrowing or codeswitching?

the free-speech corpus are listed there. This method is clearly not suited in the case of under-described or endangered languages which generally do not have any dictionaries. The possibility of controlling acceptance of a word with community members has also proven to be inapplicable, since the results depend on mostly metalinguistic considerations rather than on real usage. In the following sections, I present a more detailed analysis of the corpora with respect to regularity. The Ixcatec-Spanish corpora are presented in 4.5.1., the Balkan Slavic Nashta-Greek corpus in 4.5.2., the Thrace Romani-TurkishGreek corpus in 4.5.3, and an overview of the Finnish Romani-Finnish corpus in 4.5.4.

4.5.1 The Ixcatec-Spanish corpora The analysis of the two Ixcatec corpora, the contemporary corpus and the corpus of the 1950s, shows that few words from Spanish occur in both corpora, i.e., ‘godfather’, ‘week’, ‘hour’, ‘shop’, ‘place’, ‘task’, and ‘hat’; see Figure 27. Verbs from Spanish are mainly found in the corpus of the 1950s, in which just a few verbs occur more than once, i.e., ‘provide’, ‘solve’, ‘save’, ‘shoot’, ‘get’, ‘occupy’, ‘maintain’, and ‘order’; see Figure 28.

4.5.2 The Balkan Slavic-Greek corpus Regularity appears to be a more appropriate tool for testing speciﬁc types of lexical words, such as answer particles. For example, in the Balkan Slavic Nashta corpus only the Greek answer particle ne ‘yes’ is found; see example (17). The Slavic form da ‘yes’ is recognized as a Slavic answer particle but is not considered a local, Nashta word, and is never used. The Greek answer particle can therefore qualify as a borrowing since there is no other form used by the last speakers. Another positive answer particle, [χ] ‘alright’, most likely from the Persian-origin kho, is considered to be the native answer particle.4 In practice, however, it never occurs in the spontaneous corpus.

4 The answer particle is also emblematic among the Muslim Slavic speakers of Greece, known as Pomaks, who are trilingual in Pomak (Slavic), Turkish, and Greek. Personal research shows that Pomak speakers, in the villages where Pomak is still transmitted, reject the Slavic answer particle da and promote the Persian-origin kho as a native Pomak form, realized [huo].

Regularity

69

Figure 27: The Ixcatec corpora: Frequency of Spanish nouns occurring more than twice or in both corpora

70

Borrowing or codeswitching?

Figure 28: The Ixcatec corpus: Frequency of Spanish verbs in the corpus of the 1950s

Regularity

71

Balkan Slavic Nashta corpus < Slavic (in plain), Greek (bold) (17)

ˈmajka mother

me ACC .1SG

ˈtʃini-ʃe do-IPRF.3SG

kuˈkuʎe cocoons

ne yes

‘My mother used to prepare cocoons, yes.’ (Adamou 2013. Excerpt from La soie, sentence 3. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) The regularity criterion can also successfully apply to discourse particles which are more recurrent in the speech. The analysis of regularity may then reveal some interesting tendencies. For example, as suggested by Matras (1998, 2009), the adversative marker is among the most contact-sensitive elements crosslinguistically. Looking at the corpus of Balkan Slavic Nashta, the Greek adversative ala ‘but’ is the only adversative marker found with the exception of one occurrence of the more literary Greek contrastive marker omos. The criterion of regularity thus indicates that ala is a Greek borrowing in the Balkan Slavic Nashta corpus. Balkan Slavic Nashta corpus < Slavic (in plain), Greek (bold) (18)

aˈla but

uˈno 3SG . N

ˈzemaʃe take.IPRF.3SG

sa REFL

ˈnemaʃe not_have.IPRF.3SG

u at

ˈniva-ta ﬁeld-ART. SG . F

skuˈrdar skurdar.SG

‘But that one didn’t have any at the ﬁeld, he would bring the garlic mixture, skurdar.’ (Adamou 2013a. Excerpt from La moisson, sentence 20. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) Interestingly, the adversative ama ‘but’ from Turkish is totally absent from the Balkan Slavic Nashta corpus. In contrast, the Turkish adversative ama is dominant in the Balkan Slavic corpus of the 1970s recorded in Hrisa (N = 7). Knowing that the Turkish adversative ama was widespread in most Balkan Slavic varieties, as also attested in the Hrisa corpus, it is safe to say that the Turkish-origin ama ‘but’ was replaced in Nashta by the Greek ala quite rapidly, possibly during the twentieth century. 4.5.3 The Thrace Romani-Turkish-Greek corpus In the Thrace Romani corpus some nouns were used both in Turkish and Romani, e.g., gyn ‘day’ from Turkish, occurs twice in the corpus for two speakers while

72

Borrowing or codeswitching?

Romani native give/gie ‘day’ shows 7 occurrences for four speakers; see Figure 30. Other words occur just in one language, see Figures 29 and 30, e.g., kapia ‘door’ from Turkish (N = 3), kavako ‘tree’ from Turkish (N = 6), whereas the Romanian origin kopatʃi ‘tree’ is not attested in the corpus but is known by the speakers.

Figure 29: The Thrace Romani corpus: Frequency of Turkish nouns (part 1)

Regularity

73

Figure 30: The Thrace Romani corpus: Frequency of Turkish nouns (part 2)

Also, in Thrace Romani, the word pare ‘money’ from Turkish para ‘money’ occurs in a regular way (N = 6 for four diﬀerent speakers, shown in Figure 30) and the native love ‘money’ is not attested in the corpus. The word pare ‘money’, from Turkish, is a well-known borrowing from the Ottoman era and can be

74

Borrowing or codeswitching?

found in all the Balkan languages in variation with native words. This is also the case for Modern Greek where the Turkish word is integrated with a plural suﬃx, as paraðes, and co-exists with the words lefta and xrimata, all meaning ‘money’. A diﬀerence though between the Turkish word ‘money’ pare in Thrace Romani, and paraðes in Modern Greek, is that in Modern Greek it is used by speakers who do not speak Turkish, even though they may have some metalinguistic knowledge of the word’s origin. In the Thrace Romani community, speakers are ﬂuent in Turkish and similarly may, or may not, be conscious of the word’s Turkish origin. Looking at the regularity in the Thrace Romani-Turkish-Greek corpus, at the Balkan languages in general, and at the other varieties of Romani with speakers who do not speak Turkish, the conclusion to be drawn is that pare ‘money’ is a borrowing in Thrace Romani rather than a switch. Thrace Romani speakers also used 76 diﬀerent Turkish verbs in their speech. 48 of them were used only once, and only 9 were used more than 4 times; see Figure 31. The Romani equivalent of several Turkish verbs has been recorded in the corpus. Namely, the most frequent Turkish verb ol- ‘become’ (N = 12) is used in variation with a native Romani form (N = 12); see in (19) an example from the same speaker during the narration of a tale: (19)

Thrace Romani corpus < Romani (in plain), Turkish (in bold) oldu sap ‘He became a snake.’ (Sentences 5, 6, 27, 38) kerdindol sap ‘He became a snake.’ (Sentences 13, 17, 24) (Adamou 2008. Excerpt from The man-snake. Accessed online at http://lacito.vjf.cnrs.fr/pangloss)

Other Turkish verbs, which occur between four and eight times in the corpus, also have a Romani equivalent which is used as or more frequently: this is the case for the Turkish verbs ‘wait’ (Romani N = 1), ‘understand’ (Romani N = 3), ‘work’ (Romani N = 12), ‘drink’ (Romani N = 14), and ‘put’ (Romani N = 18). In these cases it is diﬃcult to evaluate the verb as either a codeswitching or a borrowing based on the criterion of regularity and the existence of a native equivalent. For other frequent Turkish verbs, such as for the second and third more frequent verb, respectively ‘marry’ and ‘write’, there is no Romani equivalent in the corpus. These Turkish verbs could therefore be considered as borrowings. This could also be the case for a number of Turkish verbs which occur from one to four times and have no Romani equivalent in the corpus, such as ‘read’, ‘ﬁnish’, ‘be tired’, ‘think’, ‘return’, ‘explode’, ‘rest’, ‘like’, etc.

Regularity

Figure 31: The Thrace Romani corpus: Frequency of Turkish verbs

75

76

Borrowing or codeswitching?

Among the following Turkish verbs with one occurrence we observe an equally frequent or a more frequent Romani equivalent: ‘boil’ (Romani N = 1), ‘enter’ (Romani N = 1), ‘roast’ (Romani N = 2), ‘call’ (Romani N = 2), ‘get on’ (Romani N = 2), ‘tell’ (Romani N = 3), ‘send’ (Romani N = 8), ‘know’ (Romani N = 13), ‘leave’ (Romani N = 23), ‘get’ (Romani N = 20), ‘throw’ (Romani N = 26), ‘come’ (Romani N = 53), ‘do’ (Romani N = 60), and ‘go’ (Romani N = 87). This means that we can consider these Turkish verbs as codeswitch insertions. See the examples in (20) for the verb ‘to talk’, used 7 times in Turkish and once in Romani. Thrace Romani corpus < Romani (in plain), Turkish (in bold), Multiple (underscored) (20)

a.

dʒan-es know-2SG

kasa who.INS

konuʃ-ijor-sun talk-PROG -2SG

akana now

‘Do you know who you’re talking to now?’ (Adamou and Granqvist 2014) b.

amen 1PL .OBL hajde INTJ

muruʃ male

tʃavo boy

naj be.NEG .3SG

konuʃ talk.IMP.2SG

‘We don’t have a boy (in the family). . .come on, talk!’ (Adamou and Granqvist 2014) c.

airsǝs useless

ep all_the_time

me 1SG . NOM

ka FUT

orbisarav talk.1SG

‘Useless (girl)! Am I to do the talking all the time?’ (Adamou and Granqvist 2014) To conclude, 19 Turkish verbs have an equally frequent or more frequent Romani equivalent in the corpus, and could therefore qualify as codeswitching insertions, while 57 verbs have been encountered only with their Turkish form in a manner that would allow us to consider them as borrowings. However, as I discuss in Chapter 5 on verb integration, all these verbs, independent of their regularity and the existence of a Romani equivalent, are not integrated into the Romani morphology, a feature which is typical of codeswitching insertions but incompatible with most deﬁnitions of borrowing.

Regularity

77

Interestingly, the same variability between the Turkish adversative ama and the Greek ala is found in the Thrace Romani-Turkish-Greek corpus, i.e., the Turkish adversative ama occurs 32 times, the Greek ala only twice, and another form which could be Greek or Turkish, ma, also occurs twice. The example of the Balkan Slavic corpus shows that if the appropriate conditions are met, such as decreasing contact with Turkish and increasing contact with Greek, ala could become dominant in Thrace Romani and either replace the Turkish ama or co-exist with speciﬁc pragmatic and sociolinguistic readings. Indeed, in Literary Bulgarian, the standardization process which started in the nineteenth century introduced a Slavic adversative marker no. Rather than the marker no replacing the Turkish adversative ama, the two are now pragmatically contrasted: no is the neutral, unmarked adversative, whereas ama is pragmatically marked (Fielder 2015). There is also variation in coordinating conjunctions in the Thrace RomaniTurkish-Greek corpus: the Turkish conjunction da ‘and’ occurs 10 times, exempliﬁed in (21), while the Romani ta for the same function is extremely rare. A tendency to replace the Romani ta can therefore be documented through the analysis of the corpus but there is no absolute replacement. Thrace Romani corpus < Romani (in plain), Turkish (in bold), Multiple (underscored) (21)

lav take.1SG telefono telephone

lake her da and

me 1SG . NOM phenav say.1SG

demek say

telefono telephone

lake her

‘Suppose I call her on the phone and tell her. . .’ (Adamou, unpublished corpus) 4.5.4 The Finnish Romani-Finnish corpus In the Finnish Romani-Finnish corpus, the Finnish coordinator ja ‘and’ is used in 63% of the cases and the inherited Romani ta ‘and’ in 39.96% of the cases; see Table 5 (Granqvist 2000). The temporal and adversative markers are also in variation with the use of the native forms, i.e., 11% for the Romani temporal marker and 13% for the adversative. In contrast, the complementizer and the conditional are Finnish in practically 100% of the cases.

78

Borrowing or codeswitching?

Table 5: Finnish and Romani conjunctions (Granqvist 2000) Finnish

Romani

Conj.

Freq.

%

Conj.

Freq.

%

Total

Meaning

ja että jos kun mutta Total

232 78 21 137 73 468

63.04 97.50 100.00 88.96 86.90 66.20

ta at om ka bi

136 2 0 17 11 166

39.96 2.50 0.00 11.04 13.10 23.48

368 80 21 154 84 707

‘and’ ‘that’ ‘if’ ‘when’ ‘but’

Variability also exists in Finnish Romani for negation. The monolingual Romani types naa, naa na(a), and naa nas ‘no’ cover 77.5% of all the negations. The mixed types ei naa and ei nas ‘no’ amount to 21.6 %. Bare Finnish ei is extremely rare (0.9 %) (Granqvist 2000).

4.6 Discussion In Chapter 4, I examined some of the factors that could allow us to distinguish between borrowings and switching insertions in a bilingual corpus. The analysis of the degree of composition of the current-contact language words shows that there is little diﬀerence between the two types of corpora established in Chapter 3, i.e., those with 0‒5% or 20‒35% contact words. Most of the corpora under study show mainly one- or two-word insertions in conversations with in-group community members bringing us back to the diﬃculty in characterizing a contact word as a borrowing or a single-word switch. Alternational codeswitching is triggered by an outsider’s presence or by speakers who have shifted to another language. Attempting to apply the notion of “ﬂagging” to contact material is unfortunately of very limited use for identifying codeswitching in the corpora under study. Flags such as pauses, hesitations, and other metalinguistic commentary frequently occur in the corpora under study independent of the origin of the word. Indeed, for any language, but probably even more so for an endangered language, speakers may hesitate before enunciating some words, looking for the best way to organise their speech and express an idea with clarity. Thus, although the study of ﬂagging is an interesting criterion to better describe the codeswitching patterns, it is a much more limited criterion for distinguishing neatly between borrowing and codeswitching.

Discussion

79

In this chapter, it was shown that for some multilingual corpora it is possible to correlate the variety of word classes showing high rates of contact word-tokens with the overall number of current-contact language tokens. For example, the corpora with more than 20% tokens from the current-contact language are expected to show a greater variety of word classes. This is the case for Thrace Romani and Finnish Romani which show word-tokens from almost all the word classes from Turkish and Finnish respectively. In contrast, the corpora with less than 5% words from the current-contact language, when viewed from the perspective of the rates of word classes, show only content words such as nouns and eventually verbs too. This is the case of Ixcatec, in contact with Spanish, and Balkan Slavic Nashta, in contact with Greek. This ﬁnding is in accordance with all the studies on language contact (Myers-Scotton 1993a; Matras 2007). This does not mean, however, that individual words belonging to other word classes do not occur in the corpora with 0‒5% contact words. It only means that their presence is not visible enough when compared to the entire word class. This statement is of course highly dependent on the typology of the languages in contact as one language may have obligatory free pronouns while another may be a pro-drop language. These typological diﬀerences must be taken into consideration in the interpretation of such results. The present study also illustrated how cross-linguistic comparability for the study of lexical semantic ﬁelds may be obtained through the study of natural speech. Comparison with the WOLD age score shows that the Ixcatec and the Balkan Slavic corpora have relatively high scores for nouns from their currentcontact languages, similar to Thrace Romani. This result is intriguing for Thrace Romani, which shows high overall rates of contact words from Turkish, but strengthens the analysis of Thrace Romani as an “unevenly-mixed language” developed in the nineteenth century as argued in Adamou and Granqvist (2014). Our study partly conﬁrms the relevance of Myers-Scotton’s (1993a) suggestion that core vocabulary should count as “borrowing” as opposed to peripheral, cultural vocabulary which is better classiﬁed as “codeswitching”. Indeed, for some recent contact settings only culturally-speciﬁc content words were used. This is the case for Modern Greek words in Thrace Romani, nicely contrasting with Turkish words which are generally part of core vocabulary and have a high age score in WOLD. The Ixcatec corpus, however, shows a great amount of cultural vocabulary from Spanish which appears to have been introduced in the past and would hardly qualify as “codeswitching”. Last, this study shows that the regularity criterion can prove to be useful for function words and discourse particles but less so for lexical items with a

80

Borrowing or codeswitching?

unique referent. Regular use of speciﬁc lexical items depends on the topic of conversation, a characteristic of either bilingual or monolingual speech. A possible way of controlling for regularity is through the use of elicitation tasks which aim at producing semi-spontaneous speech, as in Nagy (2011). Another solution to the shortcomings of regularity for lexical items may be the use of probabilistic statistical methods not per token but per word class, as shown in Chapter 6.

Chapter 5

Integration strategies 5.1 Background Probably the most decisive criterion used by linguists in order to establish whether a word is a borrowing or a switch is the degree of integration: Despite etymological identity with the donor language, established loanwords assume the morphological, syntactic, and often, phonological, identity of the recipient language (Poplack 2001: 2063).

As clear-cut as this criterion appears to be, however, the empirical evidence shows that its application is far more complex. One diﬃculty comes from the fact that integration processes take place at various levels, i.e., phonological, morphological, and syntactic. If the item is integrated at all three levels, by means of phonology, morphology, and syntax, then it can easily be considered a borrowing. Nevertheless, in practice, all levels do not show the same degree of integration and some hierarchy needs to be established. For example, Poplack considers an item as a switch insertion if it is integrated only at the level of phonology or only at the level of syntax, provided it is not simultaneously integrated at the two other levels (Poplack 1980: 584). Another problem stems from variability in the degree of integration for one and the same speaker or for various speakers in a given bilingual community. Based on elicited data, Poplack and Sankoﬀ (1984: 129) suggest that the more frequent a contact word is, the more it becomes stabilized in its degree of integration, at least at the level of phonology and morphology. Nevertheless, studies based on spontaneous data show that morphological and syntactic integration is not gradual but abrupt (Poplack 2004; Poplack and Dion 2013). In the sections that follow, I illustrate the advantages and disadvantages of applying the criterion of integration in the study of contact language insertions with examples from some of the corpora under study. In section 5.2. I examine phonological and phonetic integration, in 5.3. noun integration, and in 5.4. verb integration.

5.2 Phonetics and phonology It is well known that fully bilingual speakers are capable of producing a variety of contact language forms, ranging from “non-integrated” to “integrated” in

82

Integration strategies

terms of phonology. Poplack and Dion (2013) stress that bilinguals will generally have an “accent” from their ﬁrst language when using words from the contact language and suggest not considering phonological integration a signiﬁcant cue. Moreover, Matras (2009: 229) notes that when borrowed words have “foreign” phonemes or ones that do not show the same distribution as in the recipient language, this can destabilise the entire phonological system and trigger “random variation” and “uncertainty” among the speakers. Below is an example from the Balkan Slavic Nashta-Greek corpus. The Greek insertion in (22a) is followed by its equivalent with native material. In (22b), the ﬁrst occurrence of the cluster ‘Saint-Athanasius’ shows partly phonological adaptation with [ɟ], immediately followed in the repetition by the Greek [j], whereas the Greek [θ] is kept in both cases as opposed to the integrated form seen in (22a), as [t]. In (22a), the Greek insertion of the cluster ‘Saint-Athanasius’s day’ is not integrated into Slavic as it keeps the Greek genitive case, and in (22b) it is in the Greek nominative case. Word order is identical in the two languages, ADJ-N, so syntax is not relevant here. Balkan Slavic Nashta corpus < Slavic (in plain), Greek (in bold) (22)

a.

aˈʝiu saint.GEN

aθanaˈsiu

[. . .]

NP. GEN

sﬁˈtij saint

taˈnas NP

‘Saint Athanasius (day) [. . .] Saint Athanasius.’ b.

ˈaɟos saint.NOM

aθaˈnasʲios NP. NOM

[. . .]

ˈajos saint. NOM

aθaˈnasʲios NP. NOM

‘Saint Athanasius’s (day) [. . .] Saint Athanasius.’ (Adamou 2013. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) Partial phonological integration is also found in the Thrace Romani-TurkishGreek corpus. At times Romani speakers adapt the Turkish and Greek words to the Romani phonological system and at other times they do not. For example, the Turkish vowels, [y], [ɯ] and [œ] which do not belong to the native Romani inventory are used with Turkish words, i.e., [saˈɾɯ] ‘yellow’ (from Turkish sarı). But, other phonemes, such as the Turkish /h/, are not used in Thrace Romani, i.e., Turkish mahal ‘neighbourhood’ /maaˈla/; hep ‘always’/ep/, etc. (Adamou and Arvaniti 2014: 228). The absence of phonological integration of these words may indicate that they are borrowings from past-contact settings, an analysis that is coherent with the results from the study of lexical semantic ﬁelds presented in Chapter 4.

Phonetics and phonology

83

Complex phonological phenomena, such as vowel harmony, are not adopted by Thrace Romani speakers, i.e., the Turkish word yıldız ‘star’ is pronounced [jɯɫˈdɯzi] ‘star.F ’ with the Romani feminine suﬃx –i, without Turkish vowel harmony (Adamou and Arvaniti 2014: 228). This does not mean, however, that vowel harmony is a feature that cannot be adopted by Romani in other contact settings. For example, in the Finnish Romani-Finnish corpus, Romani speakers respect Finnish vowel harmony not just for Finnish words but have also integrated the phonological feature in inherited parts of the vocabulary under Finnish inﬂuence (Granqvist 2000). Words of Greek origin are used by Thrace Romani speakers with some of the Greek consonants, such as [ɣ], [θ], [ð]: e.g., [dziˈɣaɾe] ‘cigarettes’ (from Greek tsiɣara), [pinaˈciða] ‘sign’ (Adamou and Arvaniti 2014: 228). For other consonants, such as the voiceless velar fricative [x] the closest Romani equivalent [χ] may be used, i.e., Greek [aˈɾaxni] ‘spider’ becomes [aˈɾaχni], but the consonant /k/ is more frequently preferred, i.e., [maˈstixa] ‘chewing gum’ is pronounced [maˈstika], [ˈlastixo] ‘hose’ [laˈstika] (Adamou and Arvaniti 2014: 228). It is thus diﬃcult to draw any conclusion about the status of these words based on phonological integration. Similarly, in Molise Slavic, the possibility of ﬂuctuating between a phonologically integrated and a phonologically non-integrated form can be seen for one and the same speaker. For example, for ‘Little Red Riding Hood’, the integrated form [kapuˈtʃeto ˈros] (in sentences 1 and 41) coexists with the nonintegrated [kapputˈtʃetto ˈrosso] with Italian gemination and morphology (in sentence 3) http://lacito.vjf.cnrs.fr/pangloss/ Molise Slavic, Acquaviva, Text Petit chaperon rouge). To conclude, the study of phonological integration provides evidence for the existence of a continuum of integration of contact words. Phonological integration is therefore a criterion which does not allow for a clear-cut distinction between borrowings and codeswitching insertions unless one considers that codeswitchings are characterized by instability in their phonological integration and that borrowings are characterized by a more stabilized form. In practice, this criterion is confronted with the variability that also exists among monolinguals for borrowings: for instance, in Modern Greek, the word for pyjamas is variably realized as [piˈdzama], [piˈzama], [mbiˈdzama], [mbiˈdʒama], reﬂecting a more general trend towards variation in Greek voiced stops and uncertainty as to the phonological status of [dz]. The phonological integration criterion would have been misleading since the word for pyjamas is a long-term borrowing, used by monolinguals and having no synonym.

84

Integration strategies

5.3 Noun integration If a noun from a contact language A is not integrated into the morphology and syntax of language B it is most likely a switch insertion. Morphological integration into language B, which could indicate a borrowing, can be more or less advanced: integration may be complete, partial (for part of the paradigm), or mixed (resulting from the combination of the morphology of language A and B) (Gardani 2008, in press). Moreover, literature on language contact has already drawn attention to diﬀerences in the integration of various types of morphemes. For example plural morphemes from contact languages are more easily transferred than case morphemes (Matras 2007: 43). Indeed, Gardani (in press) observes that plurals on noun phrases (NP) have a higher borrowing rating, arguing that plural marking in NPs is closer to derivation than to contextual inﬂection. In this section, I will examine what possibilities are illustrated by the corpora under study as far as noun integration is concerned.

5.3.1 The Ixcatec-Spanish corpus Ixcatec nouns have very little morphology, namely possessive suﬃxes, demonstratives and a relatively loose system of classiﬁers, di²- ‘man’, kwa²- ‘woman’, ʔu²- ‘animal’ (Costaouec and Swanton, in press 2015). The Ixcatec classiﬁers are generally, but not always, suﬃxed to the Spanish nouns when relevant as shown in (23), although variation can be observed, e.g., kwá-doktora and doktora ‘female doctor’. Contemporary Ixcatec corpus < Ixcatec (in plain), Spanish (in bold) (23)

sá

kwá-enɸerméra

kú-tʃe-kú-nà

DEF

CLF-nurse

PFV-say-ANT-1SG

la COMP

nda how

ʃtá ugly

sí EXS

‘I told the nurse how ugly it is!’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php) Moreover, in the Ixcatec-Spanish corpus, Spanish nouns always receive the Ixcatec noun determiners, namely the article and the numerals.

Noun integration

85

5.3.2 The Romani-Turkish-Greek corpus As far as number is concerned, in the Thrace Romani-Turkish-Greek corpus, borrowed masculine nouns generally use the plural from a past-contact language, the Romanian -uri. Such is the case for the Turkish borrowings ap-ora ‘pills’, dev-ora ‘giants’, eteklik-ora ‘long skirts’, etc. This phenomenon is found in many Romani dialects and is not speciﬁc to Thrace Romani. The nouns bearing foreign morphology (often of Greek origin) are called “xenoclitic” and are distinguished from the “oikoclitic” nouns taking native morphology (Elšík and Matras 2006: 72). Oikoclitic and xenoclitic nouns show some similarities, such as the oblique plural -en, but many diﬀerences due to Greek inﬂuence, e.g., plural nominatives in -i. In varieties with current contact with Greek, a newer layer of Greek loan morphology is added, e.g., sepeči ‘basket weaver’, has the Greek plural suﬃx in sepečides (Matras 2002: 204). Unlike plural noun markers, case morphemes are known not to be inserted into the morphosyntactic frame of another language (Myers-Scotton 2002). This is true in Thrace Romani, where Turkish nouns generally bear Indic case marking, see (24). Thrace Romani corpus < Romani (in plain), Turkish (in bold) (24)

o the

gadžo non_Gypsy

pakav believe-1SG

kaj that

tumaﬁl-eske car-DAT ni NEG

pare money

del give.PRS .3SG

‘The non-Gypsy, I believe that he doesn’t give the money for the car.’ (Adamou, unpublished corpus) A diﬀerence between a clause in Thrace Romani with Turkish nouns and an alternational switch in Turkish may therefore be established depending on the morphological integration of the Turkish nouns. This is more clearly illustrated below in an excerpt from the speech of a Romani female speaker who addresses her Romani friend in Turkish. The choice of Turkish can be due to the fact that both speakers live in a neighbourhood where they have adopted Turkish in their everyday life. The female speaker immediately repeats the question in Romani when addressing a young Romani girl who lives in the neighbourhood where Romani is still (at least partly) transmitted, see (25).

86

Integration strategies

Thrace Romani corpus < Turkish (in angle brackets < >), Romani (in plain), Multiple (underscored) (25)

this

NEG . Q

tʃei daughter

mar INTJ

naj be.NEG

san be.2SG

tʃe INTJ

‘[To her friend:] Hey, ? [To the girl:] Hey, aren’t you Yzgjan’s daughter?’ (Adamou, unpublished corpus) Note that, in the ﬁrst sentence in (25), all the elements come from Turkish: the genitive, the possessive, the negative interrogative marker, and the demonstrative. The interjection is not Turkish but is a common interjection for Balkan languages in general, from Greek mori. Typically, a Thrace Romani sentence would not have included some of these elements, such as the demonstrative or the negative question particle, and Turkish nouns would have been integrated into the Romani morphology. This can be seen in the following sentence where the genitive for the proper noun has the Indic form and where the Romani noun ‘girl’ is used together with the Romani deﬁnite article. Moreover, in the second sentence, the negative particle, the verb, and the interjection are all Romani. Morphological integration is thus relevant for the distinction between a Turkish borrowing and alternational codeswitching. In contrast, Greek nouns in Thrace Romani either retain Greek case marking or not. In Table 6, it is shown how the singular nominative case marking of Greek –(i)s is not kept in Romani, realized as -i. But, observe how Greek plural accusative marking is kept in sinodus ‘companions’, respecting the case that the construction requires in Greek and in Romani, but using the Greek form –us. Despite morphological non-integration, sinodus takes the Romani phonology, with the Greek [ð] realized as [d]. It appears that the study of morphological integration of Greek words in Thrace Romani oﬀers another argument for considering these words as codeswitching insertions rather than borrowings. Table 6: Case assignment of Greek words in Romani Case NOM

ACC

Romani

Greek

DEF NP

o lefteri phenel says

DEF NP. NOM

si man sinodus is 1SG . ACC companion.ACC . PL

ixa sinoðus had companion.ACC . PL

‘Lefteris says.’

o lefteris lei says

‘I was accompanied.’

Noun integration

87

Interestingly, in the Finnish Romani-Finnish corpus, Finnish nouns are always inserted into Romani dominant speech with the Finnish case marking, as shown in (26) for the partitive case (Adamou and Granqvist 2014). Finnish Romani corpus < Romani (in plain), Finnish (in bold) (26)

line got.3PL

deevelesko God.GEN

armoa mercy.PART

‘They received God’s mercy.’ (Adamou and Granqvist 2014) The Finnish Romani data thus oﬀer an interesting counter-example to the literature on borrowing and codeswitching. Indeed, Finnish case is restricted to Finnish nouns and has not replaced Romani case which is still in use with Romani nouns. This type of insertion of morphologically non-integrated Finnish nouns is facilitated by the convergence that largely aﬀected Finnish Romani, replicating Finnish word order and case (Adamou and Granqvist 2014). However, one should keep in mind that quite frequently Finnish nouns are used in the nominative singular, where no case or number marking would be required in Finnish. Let us now observe what happens with gender assignment in the Thrace Romani-Turkish-Greek corpus. Romani has gender inﬂection for masculine and feminine, Turkish does not, and Greek has masculine, feminine, and neuter. In the corpus, Greek words in the neuter take Romani gender based on their ending. Feminine Greek nouns ending in –i are integrated into Thrace Romani with an additional –a; Greek or Turkish nouns with –a in the ﬁnal syllable, independent of the gender in the contact language, are assigned feminine gender, with suppression of the ﬁnal consonant if necessary. Greek nouns with neuter morphology ending in –o are used in the masculine. Finally, Greek neuter nouns in –os are integrated as masculine. See Table 7. In bilingual speech, noun phrases from the contact language are known to generally receive the determiners of the dominant language of the clause but counter examples have been reported for Moroccan Arabic-French bilinguals (Naït M’Barek and Sankoﬀ 1988). Subsequent work has shown that this is not the case in all contact settings between French and Arabic, e.g., for French noun insertions in Lebanese Arabic (Poplack 2004). In the Thrace Romani-Turkish-Greek corpus, Romani nouns receive a Romani determiner in 96% of the cases and a Turkish quantiﬁer, generally a numeral, in

88

Integration strategies

Table 7: Gender assignment in Thrace Romani Noun ending -a

Romani

Contact language

Meaning

i pulma bus(F)

< Greek to pulman DEF. N bus(N)

‘The (intercity) bus.’

DEF. F

bari maala big.F neighbourhood(F)

< Turkish mahala

‘Big neighbourhood.’

koja mastika chewing-gum(F)

< Greek mastixa chewing-gum(F)

‘The chewing-gum.’

beibilinora dippers-PL (M)

< Greek beibilino dippers(N)

‘Dippers.’

o aŋxos anxiety(M)

< Greek to aŋxos DEF. N anxiety(N)

‘The anxiety.’

i avlia yard(F)

< Greek i avli DEF. F yard(F)

‘The yard.’

DEM . F

-o

-os

DEF. M

-i

DEF. F

Table 8: Thrace Romani: distribution of NPs with a non-borrowed noun

N{rmn}

DET{rmn}

QUANT{tur}

96% 265

4% 10

Table 9: Thrace Romani: distribution of NPs with a Romani determiner

DET{rmn}

N{rmn}

N{tur}

N{ell}

75% 265

21% 76

4% 15

4% of the cases; see Table 8. Romani determiners are used with 75% Romani nouns, 21% Turkish nouns and only 4% Greek nouns; see Table 9. In this respect, the Thrace Romani corpus conﬁrms the general cross-linguistic tendencies. As far as deﬁniteness is concerned, Thrace Romani shares the masculine and feminine singular forms of the deﬁnite article with Greek, as can be seen

Verb integration

89

Table 10: Romani and Greek articles Romani

Greek

NOM

OBL

NOM

GEN

ACC

e

o i to

tu tis tu

to(n) ti(n) to

i

ton

ta

ton

tus tis ta

SG

M F N

o i

PL

Unspeciﬁed M F N

e

in Table 10. Although Romani articles emerged due to contact with Greek during the Byzantine period, they were most likely grammaticalized through the Romani demonstratives (Matras 2002: 280). Deﬁnite articles are nowadays found in several Romani varieties independent of present-day contact to Greek. Thus, despite the striking similarity between Romani and Greek articles, they were not counted in the corpus analysis as Greek. One should further note that Turkish has no articles, but this does not appear to have had any eﬀect on Thrace Romani articles. Contact-induced loss of the articles, however, has taken place in other contact settings involving Romani, namely in the North-eastern Romani dialects. In the Thrace Romani corpus, Greek nouns can be inserted with the Greek or the Romani deﬁnite article, as shown in example (27), exhibiting the same variability in the morphological and phonological integration that has been mentioned in the previous sections. Thrace Romani corpus < Romani (in plain), Turkish (in bold), Greek (underscored), Multiple (in italics) (27)

me 1SG . NOM katar from

mangav want.1SG

e DEF. PL

arne eggs

to DEF. SG . N

o DEF. SG

numero number numero number

sekis eight

ja or

dokus nine

‘I want the number of eggs, number eight or nine.’ (Adamou, unpublished corpus)

5.4 Verb integration Verbs from a contact language A are integrated into language B following a number of strategies, which can be presented as follows (Wichmann and Wohlgemuth 2008):

90 a. b. c.

d.

Integration strategies

“light verb strategy” for cases where a verb (usually ‘to do’) is required to accommodate the loan verb; “indirect insertion” for cases where an aﬃx is used to accommodate the loan verb; “direct insertion” for “a process whereby the loan verb is plugged directly into the grammar of the target language with no morphological or syntactic accommodation” (Wichmann and Wohlgemuth 2008: 99); “paradigm transfer” for cases where the loan verb is accompanied by the verb morphology and its meanings.

These types of verb integration are illustrated in the following sections with the data from the corpora under study.

5.4.1 Light verb strategy Spanish verbs in the Ixcatec-Spanish corpus are integrated with the light Ixcatec verbs tse ‘do’ or tsu ‘want’, which receive the person and TMA markers. The light verbs are followed by the inﬁnitive of the Spanish verb without its ﬁnal consonant, e.g., tse pregúnta ‘ask’, tse Ɂadbertí ‘warn’, tse salbá ‘save’, and tse koresponde ‘love in return’ shown in (28). Phonological adaptation of the Spanish verbs to the Ixcatec tones occurs in that the stressed syllable receives the high tone and the unstressed syllables the mid tone. Ixcatec corpus of the 1950s < Ixcatec (in plain), Spanish (in bold) (28)

tsee-mi do-ANTIP

koɾespondé love_in_return

ʃkãhũ tomorrow

thĩhngì past

‘Therefore our son will love (us) in return in the future.’ (Fernández de Miranda 1961: T7-78, my glosses, my translation from Spanish. Tones are transcribed as follows: high is transcribed as ˊ on the vowel, low as ˋ on the vowel, and mid is not noted but applies to all vowels which are not high or low.) 5.4.2 Indirect insertion Wichmann and Wohlgemuth (2008: 105) illustrate the indirect insertion of loan verbs with examples from Nordic languages which use speciﬁc aﬃxes for Latin and Old French loan verbs. Similarly, in Yaqui (Uto-Aztecan), a Nahuatl verb class marker is used to accommodate Spanish verbs.

Verb integration

91

Indirect insertion of loan verbs through so-called “loan verb markers” is also widespread in Romani (Bakker 1997b; Matras 2002; Elšík and Matras 2006: 324– 333). The use of loan verb markers in Romani originates from inﬂectional or derivational aﬃxes that do not keep their grammatical value: i.e., the forms derived from the Greek aorist is-/-as-/-os-, as well as those derived from the Greek present tense -iz-/-az-/-oz- and -in-/-an-/-on-, which are particularly common in the Vlax Romani branch. These loan verb adaptation markers were often borrowed from Greek in Early Romani in order to accommodate the Greek loan verbs (for a discussion see Matras 2002: 128‒134). Indirect insertion also occurs in the Balkan Slavic languages, as noted for Pomak in Adamou (2012a). Similarly, when Greek verbs are used in Balkan Slavic Nashta they consistently take the Greek aorist marker -s-. The -s- is maintained in the imperfective although in Greek it would no longer have an -s- but a -z(see Table 11). It appears that the Greek aorist marker -s- does not keep the aoristic value, and thus functions as a loan verb marker. Table 11: Nashta. Perfective and imperfective past of loan verbs from Greek in second plural form 2PL past perfective

2PL past imperfective

miriˈsaxme and miriˈsnaxme isixaˈsaxme areˈsaxme idopiˈsaxme kiniˈsaxme simorfoˈsaxme

miˈrisaxme isiˈxasaxme aˈresaxme idoˈpisaxme kiˈnisaxme simoˈrfosaxme

‘to smell’ ‘to calm’ ‘to like’ ‘to inform’ ‘to move’ ‘to make obey’

Loan verb markers may either be used only with loan verbs from a single contact language (Wohlgemuth 2009: 98), or can be used with any loan verb, in a process of “borrowing of accommodation patterns” (Wohlgemuth 2009: 224). This is the case for the -s- Greek loan verb marker which, in most Balkan languages, serves to accommodate not only Greek loan verbs but also Turkish verbs (Adamou 2012a). Similarly, the Greek loan verb markers remained productive in many Romani dialects even when Romani speakers had lost their active knowledge of Greek.

5.4.3 Paradigm transfer The use of tense, mood, and aspect (TMA) markers together with the loan verb is referred to as “paradigm transfer” in Bakker (1997) or “parallel system borrowing” in Kossmann (2010). In a cross-linguistic perspective Wohlgemuth (2009)

92

Integration strategies

notes that this strategy represents only 1 % of his sample and is localized in the Eastern Mediterranean area, namely in Romani and Kormakiti (an Arabic language heavily inﬂuenced by Greek). Romani is indeed one of a handful of languages that use a contact-language verb together with the TMA and person markers of the contact language. This type of non-integration of the verbs from contact language A in the morphology of language B has been documented for North Russian Romani (Rusakov 2001), Finnish Romani (Granqvist 2000), Crimean Romani and Lithuanian Romani (Elšík and Matras 2006: 135), and several Vlax and Balkan Romani dialects (Adamou 2010; Friedman 2013). In Thrace Romani, Turkish verbs consistently take Turkish verb morphology, i.e., various TMA markers, such as preterite, progressive, future, and optative; as well as person markers, negation, causative, and the reﬂexive and passive verb forms. We note that in Thrace Romani the Turkish evidential morpheme was reanalysed as an adverb with the Greek meaning of reporting the truth value of a statement (Adamou 2012a); see section 7.4. for more details. The examples in (29) oﬀer an illustration of Turkish verbs with Turkish morphology in Thrace Romani. The list of the Turkish verbs and their frequency in the corpus is presented in Chapter 4. Thrace Romani corpus < Romani (in plain), Turkish (in bold) (29)

a.

me 1SG . NOM

evlen-me-dim marry-NEG - PRET.1SG

‘Me, I did not get married!’ (Adamou, unpublished corpus) b.

kaj how.Q

jaz-dər-adʒ-an write-CAUS - FUT-2SG

‘How will you have it written?’ (Adamou, unpublished corpus) c.

e DEF. PL

patišaja kings

ep all_the_time

emred-ijo-lar give_orders-PROG -3PL

‘The kings, they are giving orders all the time.’ (Adamou 2008. Excerpt from The Louse and the Rom, sentence 3. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) Table 12 shows the Romani and the Turkish conjugation for present and progressive. It can be seen that the Turkish verbs in Romani follow the Turkish conjugation with minimal adaptation in terms of phonology.

Verb integration

93

Table 12: Romani and Turkish verbs in Thrace Romani

1SG 2SG 3SG 1PL 2PL 3PL

Romani verb

Turkish verb

Turkish verb in Romani

orbisar-av orbisar-es orbisar-el orbisar-as orbisar-en orbisar-en

konuşuyor-um konuşuyor-sun konuşuyor konuşuyor-uz konuşuyor-sunuz konuşuyor-lar

konuʃijor-um konuʃijor-sun konuʃijor konuʃijor-us konuʃijor-sunus konuʃijor-lar

Elšík and Matras (2006: 134–36) mention for several Romani dialects the existence of split inﬂections depending on the person. In these cases, third person markers are more likely to be taken from the contact language, followed by second person and then ﬁrst person markers. In Thrace Romani, however, no split is observed in the inﬂection of Turkish verbs. Table 13 presents the frequency of Turkish and Romani verbs in Thrace Romani with respect to the person in order to examine a possible preference for some persons. It can be seen that third person is the most frequent person for both Romani and Turkish verbs followed by ﬁrst person in both languages. Second person ranks third for both languages, followed by third plural. Second and ﬁrst person plural forms are the rarest for Turkish verbs as they are for Romani verbs. To conclude, the frequency of person for the Turkish verbs follows the frequency for person for the Romani verbs. Table 13: The Thrace Romani corpus: Frequency of Turkish and Romani verbs with respect to person Person

Turkish verbs (tokens)

Romani verbs (tokens)

3SG 1SG 2SG 3PL 2PL 1PL

78 34 19 15 8 1

601 203 160 126 38 29

Table 14 shows the frequency of the various TMA markers used in the Thrace Romani corpus with a Turkish verb. It appears that Turkish verbs are more frequently used in the progressive, followed by the preterite, the negation, and the future. The relatively low use of the optative marker is probably due to the fact that the Turkish optative aﬃx -(y)A-, as in uzanayim ‘to lie down’, is in variation

94

Integration strategies

with the Romani complementizer te, e.g., with the present tense in te bekler ‘to wait’; te konušur ‘to talk’; with the progressive in te japištijorlar ‘to stick’. Table 14: The Thrace Romani corpus: Frequency of Turkish TMA markers with Turkish verbs TMA markers

Tokens

Progressive Preterite Negation Future Causative Optative Present

51 37 17 15 13 12 1

Last, it is important to examine the origin of the elements surrounding the Turkish verbs. If Turkish items follow and precede the Turkish verbs, we could conclude that the use of the Turkish verbs is triggered by the use of other Turkish elements. The analysis of the corpus shows clearly that Turkish verbs in Thrace Romani are not triggered by an immediately preceding or following element from Turkish (Adamou and Granqvist 2014). The results indicate that a majority of Romani words precede the Turkish verbs. Moreover, a clear majority of Romani words follows the Turkish verbs; see Figure 32. Similar to Thrace Romani, in the Finnish Romani-Finnish corpus, Finnish verbs are integrated with Finnish verb morphology, including person, mood

Figure 32: The Thrace Romani-Turkish-Greek corpus: origin of words preceding Turkish verbs (adapted from Adamou and Granqvist 2014)

Verb integration

95

and tense marking. Finnish negation is also used as well as participles and inﬁnitives, e.g., syntynyt ‘born’, puhumassa ‘speaking’. In contrast, Finnish verbs with Romani morphology are very rare in the corpus (N = 14 out of 225) (see Adamou and Granqvist 2014). They mainly occur for the verb tykätä ‘to like’, adapted in Romani as tykkuv-, and more rarely the verb kantaa ‘to carry’, shown in (30). Finnish Romani corpus < Finnish (in bold) (30)

me 1SG

kant-otommas carry-PST.1SG

paani water.NOM

‘I carried water.’ (Adamou and Granqvist 2014) The Finnish imperatives kato ‘look!’ and kuule ‘hear!’, are frequent in the Finnish Romani corpus and are used as tags in both Finnish Romani-dominant clauses and Finnish-dominant clauses. The Finnish auxiliary olla ‘to be’ may be used to form the past tense in Romani (Adamou and Granqvist 2014). In the Finnish Romani clauses, Finnish verbs are preceded by a Romani word in 78% of the cases, showing that the Finnish verbs are not triggered by the use of a Finnish word. Finnish verbs are also frequently followed by a Romani word in 65% of the cases (Adamou and Granqvist 2014). See Figure 33.

Figure 33: The Finnish Romani-Finnish corpus: origin of words preceding and following Finnish verbs (adapted from Adamou and Granqvist 2014)

96

Integration strategies

5.5 Discussion The degree of integration is an important criterion in order to evaluate whether a contact word is more of the “borrowing” or of the “codeswitching” type. As several authors have already noted, the analysis of our corpora conﬁrms that phonological integration of contact words is often variable for a single speaker. It also shows that, inversely, morphological integration of NPs and verbs from a contact language follows a more consistent pattern within each community. Wichmann and Wohlgemuth (2008: 12) propose a correlation between loan verb integration strategies and the “degrees to which speakers of the target language are exposed to the source language(s)”. This is only partially conﬁrmed by the case studies examined in this book, if one compares verb integration strategies with the overall ratio of contact-language words, using the latter as “predictors” of contact intensity. For example, Ixcatec, with low overall numbers of contact-language word-tokens, uses the “light verb” strategy to accommodate Spanish verbs. But, Balkan Slavic Nashta, with equally low numbers of contactlanguage tokens, favours the indirect insertion strategy for Greek verbs. Moreover, Thrace Romani and Finnish Romani, with high overall numbers of contactlanguage tokens, employ a “paradigm transfer” for Turkish and Finnish verbs respectively. However, Molise Slavic, with high rates of contact-language words, prefers “direct insertion” for the accommodation of Italian verbs. To conclude, the data show no correlation between loan verb strategies and rates of contact words. The fact that, in Romani-dominant speech, Turkish and Finnish verbs are not integrated, informs ongoing discussion on the cognitive preferences of bilinguals. According to Myers-Scotton and Jake (2014: 4), it is less costly for the bilingual speaker to integrate the verb into the morphology of language B, since it only requires for the speaker to control the semantic-pragmatic features but not to check the congruence between the two languages. This generalization is made for the so-called “classic codeswitching” and not for the “composite codeswitching”, a type of bilingual speech which according to Myers-Scotton is characterized by strong convergence between the two languages in contact. In Thrace Romani, however, we observe consistent use of contact-language verbs with the complex verb morphology of the contact language, realized smoothly and in rapid speech, although there is no convergence between the languages in contact as discussed in detail in Chapter 7. In order to understand the cognitive processes of these Romani bilinguals, psycholinguistic experiments seem necessary.

Chapter 6

Inter-speaker variation 6.1 Background The rate of contact words in a bilingual corpus is an average taking into account all-speaker word-tokens; see Chapter 3. This means that variation across individuals could be very high. This chapter presents an analysis of the corpora under study by looking at “inter-speaker variation”. It is based on the variationist framework, which combines methods from linguistics, sociology, and statistics (Labov 1971, 1984; D. Sankoﬀ 1982, 1988; G. Sankoﬀ 1974; G. Sankoﬀ and Labov 1985; Tagliamonte and Baayen 2012). The use of the variationist analysis allows determining the factors, both linguistic and extralinguistic, that may govern the use of a word from the languages in contact. It is thus possible to check for individual patterns and relate them to social factors, such as age, sex, location, professional occupation, and education. Moreover, by comparing the rates of the total number of contact words for all the speakers with individual productions, one obtains an idea of the pattern of language mixing that prevails in a given bilingual community. Unfortunately, individual variation analysis is not always possible: some samples are based on higher numbers of speakers, others on higher numbers of texts for a single speaker, some samples are more sociolinguistically homogeneous than others, etc. This chapter is structured as follows: section 6.2. presents inter-speaker variation with regard to all contact words. Section 6.3. examines the diﬀerences among individual speakers for borrowings and for codeswitching. Section 6.4. deals with inter-speaker variation with respect to content words, such as nouns and verbs. Last, a summary and discussion are presented in section 6.5.

6.2 Inter-speaker variation for contact words 6.2.1 The Ixcatec-Spanish corpus The Ixcatec-Spanish corpus is based on the speech of the four most ﬂuent speakers of Ixcatec, two male and two female. Unlike the majority of the Ixcatecs born in the ﬁrst part of the twentieth century, the last Ixcatec speakers learned the language from having attended school less than their peers had, and from

98

Inter-speaker variation

having spent time with their monolingual Ixcatec grandparents. Three of the Ixcatec speakers grew up bilingual. According to them, they learned Ixcatec out of personal interest for the language, by paying close attention to the Ixcatec interactions around them. Only one of the speakers was raised in a monolingual Ixcatec environment and recalls going to school at age six with no knowledge of Spanish. It was through the socialisation process at school that she acquired Spanish, which she now speaks ﬂuently. As can be seen in Figure 34, the analysis of the Ixcatec-Spanish corpus shows little inter-speaker variation with respect to the use of Spanish wordtokens. No diﬀerence is found in the rates of Spanish words between the speaker who was brought up monolingual, RRM, and the other bilingual speakers.

Figure 34: The Ixcatec-Spanish contemporary corpus: inter-speaker variation of current-contact language word-tokens

6.2.2 The Balkan Slavic Nashta-Greek corpus The analysis of the Balkan Slavic Nashta corpus reveals little inter-speaker variation among the three ﬂuent speakers; see Figure 35. If the semi-speaker’s production is integrated to this account, then the ﬂuent speakers show a standard deviation of –1 to –2 points with respect to the overall average of 5% contact words. Age is the main factor for the variation observed between the last ﬂuent speakers (in their 80s at the moment of the recordings) and the semi-speaker (speaker XF, 56 years old at the moment of the recording), who uses 16% Greek tokens. As noted in Chapter 4, the majority of the Greek words in the speech of the semi-speaker are lengthy, ﬂagged codeswitching insertions, unlike the speech of the elders, characterized by single contact-word insertion.

Inter-speaker variation for contact words

99

Figure 35: The Balkan Slavic Nashta-Greek corpus: Inter-speaker variation of current-contact language word-tokens

6.2.3 The Thrace Romani-Turkish-Greek corpus Table 15 shows the distribution of the words in the Thrace Romani corpus with respect to the contact languages, Turkish and Greek. The results are then discussed in relation to four extralinguistic factors: location, age, language shift, and grouping according to family and peers (Adamou 2015). Table 15: The Thrace Romani corpus: Distribution of word-tokens per speaker and language Sp.

Age

Sex

Location

Romani words

Turkish words

Greek words

Total words

9 8 10 4 1 2 3 6 5 7

28 34 34 50 37 34 34 23 26 12

M F M F F F F F F F

Komotini Komotini Komotini Drosero Drosero Kirnos Kirnos Drosero Drosero Drosero

1014 (83%) 887 (79%) 856 (84%) 519 (84%) 153 (64%) 144 (68%) 141 (70%) 170 (94%) 93 (71%) 55 (90%)

202 (16%) 197 (18%) 133 (13%) 76 (12%) 69 (29%) 29 (14%) 18 (9%) 8 (5%) 35 (27%) 5 (8%)

10 (1%) 36 (3%) 25 (3%) 23 (4%) 18 (7%) 39 (18%) 42 (21%) 2 (1%) 3 (2%) 1 (2%)

1226 (100%) 1120 (100%) 1014 (100%) 618 (100%) 221 (100%) 212 (100%) 201 (100%) 180 (100%) 131 (100%) 61 (100%)

The Thrace Romani corpus contains data from ten speakers living in three diﬀerent locations. Three speakers live in the suburbs of the city of Komotini; see Map in Figure 36. Five speakers are settled in Drosero, in the suburbs of the city of Xanthi. These two groups have everyday contact and intermarry. In addition, there are two other speakers who grew up in Drosero but currently live in Kirnos, in the suburbs of Xanthi. Unlike Drosero, which is constituted of a majority of Romani trilingual speakers, Kirnos is composed of a majority of L1 or L2 Turkish speakers.

100

Inter-speaker variation

Figure 36: Map of the area of Thrace, Greece

Figure 37 presents the ten trilingual Romani speakers following the distribution of contact words. One sees that the Romani speakers from Komotini and the 50-year-old speaker from Drosero show a similar pattern of language mixing. One further notes that the speakers living in Drosero are less grouped together. In contrast, the two speakers who are settled in the Turkish-speaking community of Kirnos show a very similar pattern of language mixing.

Figure 37: The Thrace Romani-Turkish-Greek corpus: Distribution of word-tokens per language for 10 speakers, with respect to location (in light grey speakers from Kirnos, in dark grey speakers from Komotini, in black speakers from Drosero; circles indicate female participants, diamonds male participants)

Inter-speaker variation for contact words

101

Figure 38 presents the same Romani speakers but compares the rates of contact words with the degree of language shift. It can be seen that the two speakers who live in the Turkish-speaking community of Kirnos again have similar patterns of language mixing.

Figure 38: The Thrace Romani-Turkish-Greek corpus: Distribution of word-tokens per language for 10 speakers, with respect to language shift (in light grey speakers who have shifted to Turkish in their everyday life, in black speakers who use Romani in their everyday life; circles indicate female participants, diamonds male participants)

Figure 39 divides the Romani speakers into age groups. One sees that the youngest speakers produce very few Turkish words. However, the sample and the productions in this corpus are too small as yet to conﬁrm a general tendency to abandon the Romani-Turkish mixing. Last, Figure 40 categorizes speakers depending on their families and peers. It can be seen that speakers who belong to the same families or who are amongst peers use similar patterns for mixing the three languages: Romani, Turkish, and Greek. One observes that the three female friends, all in their 30s, are similar in their distribution of contact words (in black in the ﬁgure). This similarity is also due to their interactions with the other participants: the three women were conversing with a male Romani speaker in his 30s who had shifted to Greek and exchanged the most with the researcher. One further observes that the speakers living in Komotini (in dark grey in the ﬁgure), who are also members of the same family, show similar language usage. Their language mixing is similar to that of the elder speaker in Drosero. Lastly, one sees that the two youngest girls use a similar pattern, but one which diﬀers from that used by the other family members (in light grey in the ﬁgure).

102

Inter-speaker variation

Figure 39: The Thrace Romani-Turkish-Greek corpus: Distribution of word-tokens per language for 10 speakers, with respect to age (in black are speakers under 29, in dark grey, speakers ages 30-49, in light grey, the 50-year-old speaker; circles indicate female participants, diamonds male participants)

Figure 40: The Thrace Romani-Turkish-Greek corpus: Distribution of word-tokens per language for 10 speaker, with respect to family and peers (speakers belonging to a single family or group of peers are indicated with the same colour: black, light grey, and dark grey; circles indicate female participants, diamonds male participants)

Inter-speaker variation for borrowing and codeswitching

103

6.2.4 The Finnish Romani-Finnish corpus The Finnish Romani-Finnish corpus was gathered from three female speakers, of similar age and background. An analysis of the Finnish Romani corpus reveals some variability in the use of Finnish Romani and Finnish insertions. As shown in Figure 41, speaker 1 uses 45% Finnish words, speaker 2 uses 24% Finnish words, and speaker 3 uses 35% Finnish words (Adamou and Granqvist 2014).

Figure 41: The Finnish Romani-Finnish corpus: Distribution of language mixing among speakers (adapted from Adamou and Granqvist 2014)

6.3 Inter-speaker variation for borrowing and codeswitching In section 6.2. I examined inter-speaker variation with respect to the rates of contact-language words, both borrowings and codeswitching insertions included. I now turn to analyse inter-speaker variation with respect to borrowings and codeswitching. 6.3.1 The Slavic corpora Following the criteria discussed in Chapter 5 with respect to the borrowing/ codeswitching distinction, the phonologically and morphologically non-integrated or multi-word insertions from the current-contact languages in the Slavic corpora of the EuroSlav programme were tagged as codeswitching insertions; the integrated, single-word or two-word insertions were tagged as borrowings (Adamou et al. 2015). We then counted all the words shared with the languages in contact and calculated the percentage they represent in the total number of words in the corpus. Table 16 breaks down the composition of the sample for the four Slavic corpora: Molise Slavic, Balkan Slavic, Burgenland Croatian, and Colloquial Upper Sorbian.

104

Inter-speaker variation

Table 16: The sample of the EuroSlav corpus Molise Slavic

Balkan Slavic

Burgenland Croatian

Colloquial Upper Sorbian

Total

Male Female

6 7

1 5

4 6

5 3

16 21

Young ≤ 49 Middle 50–70 Old ≥ 70

2 4 7

1 2 3

2 3 5

6 1 1

11 10 16

Figure 42 summarizes the rates of borrowings for the Slavic languages under study presented separately in Chapter 3. It appears that the Molise Slavic-Italian corpus, with 22.6% Italian borrowings, has more borrowings than the three other corpora, i.e., Balkan Slavic, Colloquial Upper Sorbian, and Burgenland Croatian, which have less than 5% borrowings from their respective currentcontact languages.

Figure 42: The Slavic corpora: Rates of borrowings with respect to language (Adamou et al. 2015)

Inter-speaker variation for borrowing and codeswitching

105

Figure 43 presents the rates of borrowings for the Slavic corpora with respect to location, i.e., thirteen locations in total. It can be seen that Molise Slavic speakers, independent of their location (Acquaviva, San Felice, Montemitro), use similar proportions of borrowings, and this despite diﬀerences in the degree of language endangerment across the three localities. Similarities between the various localities are also found for the Burgenland Croatian and the Colloquial Upper Sorbian corpora. However, the analysis reveals that location does make a diﬀerence in the Balkan Slavic corpus, as the speakers of Liti use more Greek borrowings than the speakers of Hrisa. It is important to remember that the two sub-corpora of Balkan Slavic were recorded at diﬀerent moments: the Hrisa corpus was recorded in the 1970s and the Liti corpus in the early 2000s. The observed diﬀerence in the rates of Greek borrowings for the two locations may therefore also be due to changes over time.

Figure 43: The Slavic corpora: Rates of borrowings with respect to location (Adamou et al. in press 2016)

The graphs in the following ﬁgures allow a closer look at the rates of borrowings by comparison between individual uses within each language community. Speakers are grouped by language and their initials are presented in alphabetical order. Note that the graphs present the results with a loop eﬀect depending on the community’s overall rates.

106

Inter-speaker variation

Figure 44 conﬁrms that, in terms of rates, Balkan Slavic speakers from Hrisa, AN and MK, use fewer Greek borrowings than speakers from Liti do. It also shows that the semi-speaker from Liti, XF, uses very few borrowings. However, as we saw in the previous chapters, this speaker uses a high number of codeswitching insertions which results in overall high rates of contact-language words diﬀerentiating her from the elder, ﬂuent speakers.

Figure 44: The Balkan Slavic-Greek corpus: Inter-speaker variation for borrowings

The data from Burgenland Croatian, in Figure 45, show that most speakers are grouped in the same dispersion zone with less than 5% borrowings. This ﬁgure also allows identifying an “outlier”, speaker HH, who uses 12% of borrowings.

Inter-speaker variation for borrowing and codeswitching

107

Figure 45: The Burgenland Croatian-German corpus: Inter-speaker variation for borrowings

Visual inspection of inter-speaker variation for Colloquial Upper Sorbian, graphed in Figure 46, shows some variability within the range of 0‒6% borrowings. The analysis of the Molise Slavic corpus, shown in Figure 47, reveals little inter-speaker variation. Indeed, most speakers produce 20‒24% Italian borrowings and there are only three outliers (EM, MR, and NG).

108

Inter-speaker variation

Figure 46: The Colloquial Upper Sorbian-German corpus: Inter-speaker variation for borrowings

Due to the complexity of inter-speaker variation, a statistical analysis was conducted in collaboration with Rachel Shen with the Conditional Inference Recursive Partitioning Tree. For this analysis, the rate of “borrowings” was the dependent variable; “individual speaker”, “age”, “sex”, “language” (four languages), “location” (13 locations), “recording session”, and “text type” (elicited

Inter-speaker variation for borrowing and codeswitching

109

Figure 47: The Molise Slavic-Italian corpus: Inter-speaker variation for borrowings

or spontaneous) were the random factors. The results, in Figure 48, show that the Molise Slavic language is signiﬁcantly diﬀerent from the other three languages in terms of the rates of borrowings. An analysis with Data Exploratory Using Random Forests (Breiman 2001) following Tagliamonte and Baayen (2012) conﬁrms these results and the relevance of the predictor “language” (Adamou

110

Inter-speaker variation

Figure 48: The Slavic corpora: Conditional Inference Recursive Partitioning Tree for borrowings

et al. in press 2016). This ﬁnding indicates that the speakers of the Slavic minority languages under study use similar rates of “borrowings” from their currentcontact languages, conforming to patterns of borrowing which prevail in their bilingual community. Several statistical models tested the relation of the rates of “borrowings” with three factors, “language”, “age”, and “sex”. The models show that “language” increases the model’s prediction (χ²(3) = 31.1, p < .001) (Adamou et al. in press 2016).

Inter-speaker variation for borrowing and codeswitching

111

Figure 49 allows the comparison of the rates of borrowings for each language with respect to sex. It can be seen that male and female speakers within each language community have similar rates of borrowings. Figure 50 also shows that no age diﬀerences can be found within each language community for the rates of borrowings.

Figure 49: The Slavic corpora: Rates of borrowings with respect to language and sex

112

Inter-speaker variation

Figure 50: The Slavic corpora: Rates of borrowings with respect to language and age group

Lastly, “codeswitching” was tested with respect to the rates of borrowings. An analysis of the correlation between the amount of codeswitching insertions and borrowings using Spearman’s test shows that there is no correlation between the two factors (ρ = 0.2); see Figure 51.

Inter-speaker variation for borrowing and codeswitching

113

Figure 51: The Slavic corpora: Correlation between borrowing and codeswitching

6.3.2 The Finnish Romani-Finnish corpus The study of inter-speaker variation with respect to borrowings and codeswitching sheds light on the diﬀerences in the overall ratios of contact words discussed in section 6.2.4. for the Finnish Romani speakers. Figure 52 shows that the three Finnish Romani speakers produce similar rates of Finnish borrowings in the Romani-dominant parts of their speech, ranging from 12% to 22%. Speaker 1 and speaker 2 show similar rates of Finnish borrowings, respectively 22% (767) and 21% (854). In contrast, speaker 3 shows lower rates of Finnish borrowings, namely 12% (313).

114

Inter-speaker variation

Figure 52: The Finnish Romani-Finnish corpus: Distribution of Finnish borrowings among speakers (based on unpublished data from Adamou and Granqvist 2014)

The analysis of inter-speaker variation with respect to codeswitching reveals more variability; see Figure 53. It can be seen that speaker 1 and speaker 3 codeswitch more than speaker 2.

Figure 53: The Finnish Romani-Finnish corpus: Distribution of Finnish codeswitching insertions among speakers (based on unpublished data from Adamou and Granqvist 2014)

When we compare the results from the study of the rates of Finnish borrowings with the rates of Finnish codeswitching, we can conclude that speaker 1 is using both Finnish borrowings and codeswitches to Finnish. Speaker 2 practically never codeswitches to Finnish but uses relatively high rates of Finnish borrowings. Speaker 3 codeswitches to Finnish but uses relatively fewer

Inter-speaker variation for borrowed nouns and verbs

115

borrowings than the two other speakers. These diﬀerences are not due to extralinguistic factors such as “sex”, “age” or “education” as the Finnish Romani data come from three female speakers, of similar age and background.

6.4 Inter-speaker variation for borrowed nouns and verbs Another interesting question as far as borrowing is concerned has to do with the way an individual speaker would vary in the use of borrowed content words. A study of inter-speaker variation in relation to borrowed nouns and verbs from the current-contact language is presented in this section.

6.4.1 The Ixcatec-Spanish corpus The analysis of inter-speaker variation with respect to the rates of borrowed nouns for the four Ixcatec speakers, shows that three of the speakers produce similar proportions of Spanish nouns (approx. 20%), but speaker CRG uses the lowest percentage of Spanish nouns (12%); see Figure 54.

Figure 54: The Ixcatec-Spanish contemporary corpus: Distribution of borrowed nouns per language for four speakers

Figure 55 compares the rates of borrowed nouns in the contemporary Ixcatec corpus with the Ixcatec texts of the 1950s (Fernández de Miranda 1960). It can be seen that the Ixcatec corpus of the 1950s shows more than 33% Spanish nouns, or 10% more borrowed nouns than those encountered in the contemporary Ixcatec corpus.

116

Inter-speaker variation

Figure 55: The Ixcatec-Spanish corpora: Distribution of borrowed nouns for the contemporary Ixcatec-Spanish corpus and the Ixcatec-Spanish corpus of the 1950s

This result is intriguing. It is not possible to understand the higher rates of borrowed nouns in the corpus of the 1950s in terms of language competence. The speaker of the 1950s, Doroteo Jiménez, was not only a ﬂuent speaker, but he also has had the chance of speaking and listening to Ixcatec much more than the speakers recorded in the beginning of the twenty-ﬁrst century. It is also not possible to explain this diﬀerence through the topics discussed in the two corpora. The texts of the 1950s are mainly traditional stories whereas the contemporary texts of the women are everyday conversations about presentlife events. We would therefore expect that traditional stories would include fewer nouns from Spanish and that modern-life conversations would include more Spanish nouns, but the opposite occurs. In order to examine whether the rates of Spanish nouns may be related to the topic of the Ixcatec texts of the 1950s, a separate analysis was carried out for each text. The analysis of the individual texts of the 1950s shows great variability as far as the rates of Spanish nouns is concerned, ranging from 17% to 56%; see Figure 56. The text with most Spanish nouns, Rey y cura ‘The king and the priest’, is a translation of a Spanish tale in Ixcatec and as such it might require the use of Spanish nouns which are related to the Spanish pragmatic context. Two other texts, the Señor de las tres caídas and Cuarto Viernes, discuss political-military and religious events, thus once again events that are pragmatically-related to the dominant, Spanish language.

Inter-speaker variation for borrowed nouns and verbs

117

Figure 56: The Ixcatec-Spanish corpus of the 1950s: Distribution of nouns with respect to language

Similar variation is found in the rates of Spanish verbs used by Doroteo Jiménez, and ranging from 0% to 23%; see Figure 57. The texts with the highest rates of Spanish verbs are also the ones with the highest rates of Spanish nouns. We can therefore consider that the rates of Spanish content words in the texts of the 1950s are text-dependent.

Figure 57: The Ixcatec-Spanish corpus of the 1950s: Distribution of verbs with respect to language

118

Inter-speaker variation

6.4.2 The Slavic corpora The study of inter-speaker variation in relation to borrowed nouns from the current-contact languages is conducted for the Slavic data from the EuroSlav corpus (Adamou et al. 2015). First, Figure 58 graphs the rates of borrowed nouns in all the Slavic corpora. Balkan Slavic speakers produced 10% borrowed nouns; Colloquial Upper Sorbian speakers produced 13%, Burgenland Croatian speakers 22.4%, and Molise Slavic speakers 46.1% borrowed nouns.

Figure 58: The Slavic corpora: Rates of borrowed nouns with respect to language (Adamou et al. 2015)

An analysis of the Slavic dataset by “location” is presented in Figure 59. The analysis shows the use of similar rates of Italian nouns in all the localities of Molise Slavic (Acquaviva, San Felice, and Montemitro). It also shows that Balkan

Inter-speaker variation for borrowed nouns and verbs

119

Slavic speakers from Hrisa used practically no Greek nouns, unlike speakers of Liti who used relatively high rates of Greek nouns.

Figure 59: The Slavic corpora: Rates of borrowed nouns with respect to location

Visual inspection of the graphs showing variation for borrowed nouns at theõlevel of individual speakers indicates that the Hrisa speakers (AN and MK) õand the semi-speaker of Liti (XF) produce no Greek nouns, while the three speakers from Liti (AF, VF, and XF) produce roughly 30% Greek nouns; see Figure 60.

120

Inter-speaker variation

Figure 60: The Balkan Slavic-Greek corpus: Inter-speaker variation for borrowed nouns

Inter-speaker variation for borrowed nouns in the Colloquial Upper Sorbian corpus reveals that most speakers produce less than 10% borrowed nouns and speakers BS, SS, and US produce between 20‒30% borrowed nouns; see Figure 61.

Inter-speaker variation for borrowed nouns and verbs

121

Figure 61: The Colloquial Upper Sorbian-German corpus: Inter-speaker variation for borrowed nouns

The analysis of inter-speaker variation in the Burgenland Croatian corpus shows greater variability for borrowed nouns thus calling for further investigation; see Figure 62.

122

Inter-speaker variation

Figure 62: The Burgenland Croatian-German corpus: Inter-speaker variation for borrowed nouns

Finally, the analysis of inter-speaker variation in the Molise Slavic corpus shows that, for most speakers, Italian nouns make up between 40% and 50% of all the nouns; see Figure 63.

Inter-speaker variation for borrowed nouns and verbs

123

Figure 63: The Molise Slavic-Italian corpus: Inter-speaker variation for borrowed nouns

Figure 64 shows the results of the Conditional Inference Recursive Partitioning Tree Model for “borrowed nouns”. The rate of “borrowed nouns” was analysed as the dependent variable, and the random predictors were “language” (four languages), “location” (13 locations), “individual speakers”, “sex”, “age”, “recording session”, and “text type” (elicited or spontaneous). It can be seen that the Molise Slavic language is signiﬁcantly diﬀerent from the other three Slavic languages with respect to the amount of borrowed nouns. Similar to borrowings, the rate of borrowed nouns is speciﬁc to each language and does not depend on the text type, sex or age of the speaker. This result is conﬁrmed by an analysis with

124

Inter-speaker variation

Random Forests (Breiman 2001) showing that “language” is the best predictor (Adamou et al. 2015).

Figure 64: The Slavic corpora: Conditional Inference Recursive Partitioning Tree for borrowed nouns

Figure 65 shows the rates of borrowed nouns with respect to language and sex. It can be seen that male speakers of Colloquial Upper Sorbian use slightly more borrowed nouns than female speakers, a result also statistically conﬁrmed by general linear mixed models (Adamou et al. in press 2016). For Balkan Slavic, the male speaker produces more borrowed nouns than the female speakers, but this result may be due to the fact that in the corpus of the 1970s, which showed practically no Greek borrowings, there were no male speakers. Also, the female semi-speaker from Liti produced few borrowings and relied mostly on codeswitching.

Inter-speaker variation for borrowed nouns and verbs

125

Figure 65: The Slavic corpora: Rates of borrowed nouns with respect to language and sex

Figure 66 shows the rates of borrowed nouns with respect to language and age group even though statistical analyses with general linear mixed eﬀects models does not reveal any signiﬁcance (Adamou et al. in press 2016). For Colloquial Upper Sorbian, young speakers produce less borrowed nouns than speakers of the middle age group. Age group is not an important factor for Molise Slavic. For the Balkan Slavic data, elder speakers use more borrowed nouns but, as previously noted, most elder speakers were recorded in the years 2000s and the younger speakers were recorded in the 1970s when the language

126

Inter-speaker variation

was more vital. For Burgenland Croatian, it can be seen that elder speakers use less borrowed nouns than middle age group speakers.

Figure 66: The Slavic corpora: Rates of borrowed nouns with respect to language and age group

Table 17 presents the Slavic dataset for each speaker, including the speaker’s age, sex, language, and location. It also includes the overall number of words, of nouns, and the raw number of borrowings and borrowed nouns per speaker.

127

Inter-speaker variation for borrowed nouns and verbs

Table 17: The Slavic data

Speaker

Age

Sex

Language

Location

Words total

AG LB MR NB GP LP NG NS RG AM EM LM SG JM AS JD KS MSZI MSC MP MSW HH ID US RE BS MS SS JE FH ML AF XM VF XF AN MK

30 30 70 70 60 70 50 60 40 70 70 80 70 62 47 11 79 73 63 73 70 52 82 61 37 30 26 34 26 36 83 80 80 79 56 40 60

M M F M F M M F M F F F F M F M F M F F F M F M F M M M M F F F M F F F F

MS MS MS MS MS MS MS MS MS MS MS MS MS BC BC BC BC BC BC BC BC BC BC CUS CUS CUS CUS CUS CUS CUS CUS BS BS BS BS BS BS

Montemitro Montemitro Montemitro Montemitro Acquaviva Acquaviva Acquaviva Acquaviva Acquaviva San Felice San Felice San Felice San Felice Nikit Oslip Oslip Oslip Oslip Oslip Trausdorf Trausdorf Wulk Wulk Crostwitz Radibor Rosenthal Rosenthal Rosenthal Rosenthal Rosenthal Zerna Liti Liti Liti Liti Hrisa Hrisa

210 1132 2511 1002 610 1661 1571 1479 1441 1643 106 802 2767 704 335 280 259 639 371 211 247 170 448 906 233 468 733 670 421 382 535 427 3347 1162 357 2080 1222

Borrowings (tokens)

Nouns total

Borrowed nouns (tokens)

50 282 433 224 122 394 474 325 305 363 18 160 644 43 12 20 11 8 10 7 3 20 6 57 6 28 35 25 3 10 22 33 131 48 116 22 8

30 183 378 157 89 265 337 125 231 257 28 114 433 148 37 74 48 77 72 35 34 31 65 116 44 32 99 124 67 89 51 88 866 308 NA NA NA

24 79 185 84 37 137 147 45 87 120 11 54 184 35 11 15 8 1 8 5 3 18 1 34 3 9 7 25 3 5 5 24 57 31 0 0 0

128

Inter-speaker variation

6.4.3 The Romani corpora The Romani corpora from Thrace and Finland show high rates of nouns and verbs from their respective contact languages. More importantly, the Turkish verbs and the Finnish verbs and nouns, cannot be classiﬁed as either “borrowings” or “codeswitching” due to the lack of morphological integration into the recipient language and to the fact that they are generally surrounded by Romani elements. In order to understand this phenomenon more clearly, I examined the use of current-contact language verbs and nouns with respect to inter-speaker variation. In the Thrace Romani-Turkish-Greek corpus, I examined four speakers who show 79‒84% Romani tokens in their speech and who are the best represented in the sample as far as the total number of words is concerned. As illustrated in Figure 67, the female 50-year-old speaker uses roughly 11% Turkish nouns, whereas the three speakers who are in their 30s use 28‒41% Turkish nouns. The chi-squared test showed that the choice of the language for the nouns is extremely signiﬁcant ( p 8.4E-10).

Figure 67: The Thrace Romani-Turkish-Greek corpus: Distribution of nouns per language in Thrace Romani for four speakers (adapted from Adamou and Granqvist 2014)

By contrast, the distribution of Romani, Turkish, and Greek verbs is similar for all four speakers: 81‒89% Romani verbs, 6‒11% Turkish verbs, and only 1‒ 6% Greek verbs; see Figure 68. The chi-squared test showed that the choice of the language for the verbs is very signiﬁcant ( p 5.7E-3).

Inter-speaker variation for borrowed nouns and verbs

129

Figure 68: The Thrace Romani-Turkish-Greek corpus: Distribution of verbs per language in Thrace Romani for four speakers (adapted from Adamou and Granqvist 2014)

These preliminary results indicate that the use of Turkish nouns in Thrace Romani is an ongoing process, with the 50-year-old speaker using the most Romani nouns, as opposed to the three speakers in their 30s who show roughly 50% Romani nouns. Therefore, it seems that there is an increase in the number of Turkish nouns through the generations. The analysis of the verbs shows that the rates of Turkish verbs is similar across the speakers, independent of their age, indicating some stability. This result could be interpreted either as a sign that Turkish verbs are behaving like borrowings, or that their use is an “unmarked” choice in Thrace Romani. The analysis of the Finnish Romani-Finnish corpus reveals some variability in the production of nouns and verbs by the three speakers; see Figure 69 and 70. Speaker 1 uses 21% Finnish nouns and 14% Finnish verbs. Speaker 2 uses 6% Finnish nouns and 4% Finnish verbs. And speaker 3, uses 11% Finnish nouns and 11% Finnish verbs. The chi-squared test showed that signiﬁcance was extremely high for the choice of language for nouns ( p 4.2E-11) and for the choice of language for verbs (p 1.6E-15). When comparing the two Romani corpora, it appears that Finnish Romani speakers vary more in the use of Finnish nouns, whereas Thrace Romani speakers show an age-related distribution of nouns from the current-contact languages. The same variability is found for verbs in the Finnish Romani-Finnish corpus, whereas the Thrace Romani speakers show a similar distribution for Turkish and Romani verbs independent of age.

130

Inter-speaker variation

Figure 69: The Finnish Romani-Finnish corpus: Distribution of nouns per language for three speakers (adapted from Adamou and Granqvist 2014)

Figure 70: The Finnish Romani-Finnish corpus: Distribution of verbs per language for three speakers (adapted Adamou and Granqvist 2014)

6.5 Discussion The analysis of inter-speaker variation for the corpora under study shows moderate inter-speaker variation within a given language community. Indeed, a statistical analysis for the Slavic corpora shows that “language” is the best predictor for the rates of borrowings and that Molise Slavic speakers from Italy produce signiﬁcantly more borrowings than the speakers from all the other communities independent of other factors such as individual speaker, sex, age, location, recording session, and type of text. This indicates that small, tightlyknit communities, such as the ones examined here, each have their own way

Discussion

131

drawing material from their contact languages, be it through borrowing or codeswitching. These results contrast with studies on codeswitching which report signiﬁcant inter-speaker variation and high dependence on the topic of conversation (Gardner-Chloros 2009). The analysis of inter-speaker variation also highlights “outliers”, i.e., speakers with a diﬀerent language usage than the dominant community pattern. For instance, in a corpus where all the ﬂuent speakers produce 4% currentcontact language tokens, a speaker who produces 16% current-contact language words as is the case in the Balkan Slavic Nashta-Greek corpus, can be understood as a speaker who attempts to produce the community’s minimally-mixed speech but is not successful in doing so. In contrast, for the Thrace RomaniTurkish-Greek corpus, younger Romani speakers also seem to be producing a minimally-mixed speech, with rates under 6% although the other speakers produce up to 15% words from the current-contact languages. Usage among younger speakers could thus indicate a change in the community’s speech patterns, evolving from the “mixed-speech” type to the “monolingual-speech” type. This could be the result of pressure from the dominant, monolingual ideologies promoted, among others, at school. The study of the relation between the rates of borrowings and codeswitching shows that for the Slavic corpora there is no correlation. Similarly, this correlation is not obvious for the Finnish Romani speakers, who may codeswitch a lot but use few borrowings and vice-versa. The study of inter-speaker variation in the Slavic corpora for borrowed nouns conﬁrms that individual speakers follow patterns which prevail in their bilingual community. It is also noteworthy that the rates of borrowed nouns may be particularly high for some languages even though the overall numbers of contact words, all word classes combined, are very low. For example, the Ixcatec speaker of the 1950s uses more contact nouns than a Finnish Romani or an elder Thrace Romani speaker even though the Ixcatec corpus is among those corpora with low rates of contact words overall (fewer than 5%), whereas the Romani corpora are among those corpora with high rates of contact words overall (20‒35%) . Last, the analysis of borrowed nouns in the corpora under study only partly conﬁrms the results of dictionary-based studies, which indicate lexicon loss in critically endangered languages (Dorian 1989; Sasse 1992). Rather, the analysis of free-speech shows that the lexicon may be well preserved even among the last speakers of a language. This ﬁnding is similar to the results reported from a study on lexicon preservation in Faetar, spoken in Italy and Canada. The study, which elicited semi-spontaneous speech with a sample of 80 speakers, showed lexical stability despite widespread assumptions within the language community about lexicon loss (Nagy 2011).

Chapter 7

Pattern replication 7.1 Background Harder to identify than borrowing or codeswitching, “linguistic convergence” or “replication” is a fascinating aspect of language contact outcomes. In replication, either grammatical or lexical, forms are not borrowed but yet the functions and structures of two languages in contact coincide (Heine and Kuteva 2005). For Heine and Kuteva (2005), grammatical replication takes place through various processes, such as contact-induced grammaticalization and restructuring, the latter through loss and rearrangement. The term “replication” is also used to refer to “a linguistic structure, of any kind, in a new, extended set of contexts, understood to be negotiated in a diﬀerent language” (Matras 2009: 146). More speciﬁcally, the term “matter replication” applies to the replication of word-forms, i.e., of the morphological and phonological material, and the term “pattern replication” to: the patterns of distribution, of grammatical and semantic meaning, and of formal-syntactic arrangement at various levels (discourse, clause, phrase or word) that are modelled on an external source (Matras and Sakel 2007b: 829‒830).

Next to the grammaticalization process discussed by Heine and Kuteva (2005), Matras (2009) draws attention to the creative, abrupt processes of pattern replication induced by pivot-matching. In a pattern replication process, if two languages in contact both have a unit for the same function x, then function z of the unit which exists only in the “model language” (ML), may be replicated in the “replica language” (RL). The shared function x is what Matras and Sakel (2007b) call the “pivot” feature (also in Matras 2009: 240‒243). Structural, formal, and functional similarities may play a facilitating role for an existing category to be developed in a language contact situation. For the study of pattern replication involving speciﬁc morphemes, preliminary searches in a corpus can be relatively easy. However, since one must take patterns into account, more precise searches must be carried out, often requiring additional coding. Of course, examining all possible pattern replication phenomena is an extremely time-consuming task, in competition with the ambition of working on a larger corpus, and researchers generally have to make a choice from among the wealth of possible research questions. In some areas moreover, such as the study of prosody, research is as yet in the early stages. Indeed, the study of

Background

133

prosody based on spontaneous speech is complex and often also requires speciﬁc elicited materials. Additional work is also necessary to show that a given phenomenon is contact-induced as this entails either examining speaker production in all of the contact languages, at the least doubling the amount of work to be done; or by looking at historical data, which are notoriously defective for oral-tradition languages; or by comparing the data with the most closely-related languages. Even then, the pattern replication hypothesis may remain ungrounded. Moreover, the study of pattern replication in bilingual settings may be related to a more complex type of linguistic convergence which has been produced during century-long contact between several languages. Trubetzkoy (1928) was the ﬁrst to express the idea that a number of languages, while not closely related, may be viewed as a group, called Sprachbund, union linguistique, jazykovoj sojuz, for sharing some common linguistic features due to contact. Terms such as “convergence area” (Weinreich 1958) and “linguistic area” are generally used as equivalents despite the fact that they express slightly diﬀerent theoretic approaches (Campbell 2006; Heine and Kuteva 2006). Joseph gives the following deﬁnition of Sprachbund: A Sprachbund can be deﬁned as any group of languages that due to intense and sustained bilingual contact share linguistic features, largely structural in nature but possibly lexical as well, that are not a result of shared inheritance from a common ancestor nor a matter of independent innovation in each of the languages involved. (Joseph 2010: 620)

However, several criticisms related to the relevance of “linguistic areas” have been raised: Linguistic areas are therefore not real-life entities. Rather, they are constructions by linguists, who choose to grant their attention to situations in which, as a result of sociohistorical coincidences, a series of conditions are met, and to label this kind of situation in a particular way. (Matras 2009: 274)

Despite these questions, the topic remains very popular in contact linguistics. Indeed, linguistic or convergence areas result from a number of individual replication patterns which have taken place in a systematic manner within a given contact area. This discussion is relevant to the study of pattern replication for both the corpora from the Balkans and Mexico analysed in this book. The earliest works on the similarities between Balkan languages appeared in the late-nineteenth and early-twentieth centuries (see Miklosich 1861; Sandfeld 1930 [1926]). In the modern literature, the existence of a Balkan Sprachbund applies to South Slavic

134

Pattern replication

(Macedonian, Bulgarian, Balkan Slavic, and some dialects of Serbian) as well as to Balkan Romance, Albanian and to some extent Greek, Balkan Turkish, Romani, and Judezmo (Sobolev 2004). Pattern replication in Balkan Slavic and in Thrace Romani should therefore be examined in the light of the Balkan Sprachbund. Ixcatec belongs to the Popolocan branch of the Otomanguean stock which includes three more languages, namely Chocho or Chocholtec, Popoloc, and Mazatec. Ixcatec also partakes at the Mesoamerican linguistic area, involving Otomanguean, Mayan, and Mixe-Zoquean languages on the basis of phonological, morphological, syntactic, and semantic calques (Campbell, Kaufman, and SmithStark 1986). In this context, possible pattern replication from Spanish is discussed in relation to general Mesoamerican typological features. In the sections that follow it is shown how a spontaneous corpus may be examined to investigate pattern replication for articles, verb morphology and TMA markers, word order, and clause linking. Phonology as well as prosodic and phonetic replication, involving both matter and pattern replication (Matras 2009: 221), are also examined in this chapter. Section 7.2. discusses several pattern replication phenomena in Balkan Slavic Nashta. Section 7.3. is dedicated to the Ixcatec corpus and section 7.4. to the Thrace Romani corpus. Finally, section 7.5. discusses pattern replication based on the various corpora presented in this book.

7.2 The Balkan Slavic Nashta-Greek corpus Balkan Slavic is characterized by certain features that distinguish it from all other Slavic languages, such as the grammaticalization of deﬁnite articles and the loss of case marking. These features do not apply to all Balkan Slavic varieties, e.g., several Rhodope and Macedonian varieties have retained various case markers. Deﬁnite articles have arisen in several Balkan Slavic varieties, ranging from three deﬁnite articles determined by deixis, as is the case in some Rhodopean and Macedonian varieties, as opposed to most other Balkan Slavic varieties which have only one article. Balkan Slavic verb systems are distributed in a quite complex manner. A general distinction can be made between the Western zone which has developed a grammaticalized evidential, and the Eastern zone, where use of the evidential remains determined by pragmatics. The Western zone has also developed a fully grammaticalized ‘have’-perfect, contrary to the Eastern varieties. These features are said to result from complex contact processes between several languages of the Balkans. In the following sections I discuss some of these features for the Balkan Slavic Nashta data.

The Balkan Slavic Nashta-Greek corpus

135

7.2.1 TMA markers As it was shown in Chapter 3, the Balkan Slavic corpora contain very few contact words from Greek. However, the study of the Nashta verbal system indicates signiﬁcant convergence toward the Greek verbal system in such a way that the two systems have become almost identical for the last Nashta speakers (Adamou 2006). This development is partly due to centuries of contact within the Balkan Sprachbund, and partly to the modern language-contact setting characterized by a generalized, rapid shift to Greek. Table 18: Verb morphology in Balkan Slavic Nashta and Greek (Adamou 2012a: 155)

volitive optative exhortative imperfective ‘have’ perfect

Nashta

Greek

ki da neka -uva- or stress imam Vinv(-no)

θa na as stem morphology exo Vinv.

The rise of a future based on the volitive ‘want’ is among the most well-known Balkan features: a ‘want’-future is found in Greek, Tosk Albanian, Rumanian, Macedonian, Bulgarian, Serbian and Croatian, and Romani (Joseph 1992: 154). In all these languages the rise of the ‘want’-future followed a similar grammaticalization path (Assenova 2002): modal verb (inﬂected for person and number) < auxiliary (free word order) < clitic (ﬁxed word order, phonologically reduced form) < modal future (between the fourteenth-seventeenth centuries). More recently, convergence of the Nashta verbal system with Greek took place for the potential mood. As noted in Adamou (2006: 59), the last ﬂuent Nashta speakers use a very uncommon construction for a Slavic language, namely [ke + aorist], which is clearly a case of pattern replication of the Greek equivalent [θa + aorist]. This parallel is shown in (31) with elicited examples. Balkan Slavic Nashta (31)

a.

ˈpetro-to

ki

NP- DEF. N

FUT

kuˈpi buy.AOR .3SG

ˈkola car

‘Petros must have bought a car.’ (Adamou 2012a: 156)

136

Pattern replication

Greek b.

o

ˈpetros

θa

DEF. NOM . M

NP. NOM

FUT

aˈγorase buy.AOR .3SG

aftoˈkinito car.ACC

‘Petros must have bought a car.’ (Adamou 2012a: 156) The inﬂuence of Greek on the potential mood in Nashta is better understood when compared with the other closely-related Slavic languages. In a ﬁrst stage, the Balkan conditional with [ke + imperfect] replaced the traditional Slavic conditional [bi ʻbeʼ + V-l] in Bulgarian and Macedonian. However, Literary Macedonian uses the particle bi for the potential mood due to a more recent contact-induced change that took place with Serbo-Croatian during the twentieth century (Hacking 1998: 115). Another general Balkan convergence phenomenon is the use of an optative particle following the loss of the inﬁnitive (Joseph 1983). The Greek construction [na + ﬁnite verb] is also observed in Nashta [da + ﬁnite verb], but the feature is clearly due to the Balkan Sprachbund than to recent contact with Greek. The grammaticalization of a ‘have’-perfect in the Balkan Sprachbund is a more controversial convergence feature. Several scholars (Gołąb 1984; Lindstedt 2000; Tomić 2004), consider the ‘have’-perfect as a Balkan feature. Havranek (1936) and Vasilev (1968) argue more speciﬁcally for the development of the Balkan Slavic ‘have’-perfect under Romance inﬂuence. Nashta – unlike Bulgarian but like Literary Macedonian – has developed a fully grammaticalized ‘have’perfect. The Nashta ‘have’-perfect is characterized by an invariable verbal form based on the neuter past participle and ending in –no/–to. It is used with intransitive verbs such as ‘to die’, e.g., jima umrjiano ʻhe has diedʼ. Example (32) shows the use of the ‘have’-past perfect with a transitive verb. Balkan Slavic Nashta corpus (32)

aˈla but sa REFL

uˈno DEM . DIST. SG . N

ˈdʲæte child

ˈimaʃe have(AUX )-IPRF.3SG

ˈzeto take.PTCP. SG . N

‘But that child had taken it.’ (Adamou 2013. Excerpt from Pear Story (past), sentence 14. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) Modern Greek uses a ‘have’-perfect based on a former inﬁnitive, [‘have’ + nonﬁnite verb], e.g., ˈexo ˈγrapsi ‘I have written’. Moreover, in some Greek varieties

The Balkan Slavic Nashta-Greek corpus

137

another ‘have’-perfect was grammaticalized as early as the thirteenth century, [‘have’ + verbal adjective], e.g., ˈexo γraˈmeno ‘I have written’ (Moser 1988). Despite the shared ‘have’ auxiliary perfect in Nashta and in Greek the determining factor for evaluating the role of Greek inﬂuence on Nashta depends on the chronology of its grammaticalization in the latter (see Adamou 2012b). According to Koneski (1965) the grammaticalization of a ‘have’-perfect in the closely related Macedonian varieties dates back to the eighteenth century. If Koneski’s proposal is correct, Greek could not be the source of the ‘have’-perfect in Nashta as bilingualism with Greek was neither intensive nor extensive before the late-nineteenth, early-twentieth centuries (prior to this date, speakers of Nashta had probably contact mainly with Church Greek; see Chapter 9 for more details). Romance inﬂuence, through Aromanian, cannot be considered a plausible source either as Aromanian presence in the area of Liti has not been documented for several centuries. However, Koneski’s date can be questioned as the eighteenth-century Konikovo Gospel manuscript (Lindstedt et al. 2008) shows no instances of a ‘have’-perfect. If the grammaticalization is to be dated to the nineteenth century, then Greek could be a convincing candidate. What is certain is the fact that the grammaticalization of a ‘have’-perfect with the past passive participle in Nashta had an impressive impact on the verbal system as a whole. Namely, it probably led to the loss of the former Slavic ‘be’-perfect as well as to the even more remarkable loss of all the -l verb forms based on the active participle. Traces of the old perfect form with the ‘be’ auxiliary and l-verbs are only found in a folk song that the last ﬂuent speakers of Nashta still recall, shown in (33): Balkan Slavic Nashta (33)

vlase Vlachs

me 1SG . ACC

sa REFL

doʃ-l-e come-EVD - PL

male INTJ

‘Vlachs have come (‘be’ AUX . PRF +V-l), oh my!’ (Adamou 2012a: 157) Greek was probably a catalyst for the loss of the -l verbal forms, which is a unique feature within the Slavic branch. The gradual loss of the -l forms has also been reported for other Balkan Slavic varieties in contact with Greek, i.e., in Sohos by Vaillant and Mazon (1938), and in Kastoria by Friedman (1977) and Topolinjska (1995). The loss of the -l verbal forms is most probably a side-eﬀect of the grammaticalization of the ‘have’-perfect under the inﬂuence of Greek. Indeed, in most Macedonian varieties which have no contact with Greek, the

138

Pattern replication

rise of the ‘have’-perfect led to the use of the -l forms to express evidentiality (Friedman 1988). The evidential uses of the -l forms are lacking in Nashta, probably because the Greek verb system does not have a grammaticalized evidential. Under Greek inﬂuence, Nashta consistently uses the aorist or the narrative present for tales. Compare the Nashta version in (34a) and the Greek version in (34b) of a tale narrated by the same speaker: (34)

Balkan Slavic Nashta a.

i and

noʃ knife

ˈzexa take.AOR .3PL

i and

iˈdin one

barˈdak jug

ˈvoda water

‘And a knife they took, and a jug of water.’ (Adamou 2006: 86) Modern Greek b.

ˈpiran take.AOR .3PL

ˈena one.ACC

ˈpiran take.AOR .3PL

ki and

kuˈva bucket.ACC

ˈena one.ACC

neˈro water.ACC

maˈxeri knife.ACC

‘They took a bucket of water, they also took a knife.’ (Adamou 2012: 158) Interestingly, although the Nashta corpus has few contact words (less than 5%), it shows an impressive convergence with Greek at the level of the verbal system. This convergence results both from century-long processes within the Balkan Sprachbund and very quick changes within the shifting process to Greek.

7.2.2 Phonetics Phonology varies in the Balkan Slavic varieties, e.g., some varieties are characterized by vowel reduction while others by lengthening of stressed syllables. In terms of dialectology, the Western and the Eastern dialects of the Bulgarian‒ Macedonian continuum are separated by the so-called jatova granica, the jat isogloss based on the development of the old ě as either e in the Western zone and an alternation of e and ja depending on the context in the Eastern zone. The Balkan Slavic Nashta corpus of connected, spontaneous speech was analysed with digital speech analysis software. Semi-automatic alignment was

The Balkan Slavic Nashta-Greek corpus

139

run with EasyAlign by phonetician Martine Toda.1 It was then possible to run a semi-automatic script in Praat with the “log ﬁle 4” and “triangle vocalique” scripts.2 Figure 71 graphs the mean F1 and F2 values for the Nashta vowels based on 455 tokens produced by a female speaker. These preliminary ﬁndings show that in the Balkan Slavic variety of Nashta the mid vowels /e/ and /o/ are not raised to [i] and [u] respectively as is the case in the contact-varieties of Northern Greek. It also shows that unstressed /a/ and /e/ are realized in a similar area.

Figure 71: Mean F1 and F2 for the Balkan Slavic Nashta stressed (in grey) and unstressed (in black) vowels based on 455 tokens produced in the spontaneous speech of a female speaker

Figure 72 shows the dispersion of the stressed vowels in Nashta, with signiﬁcant overlap for /a/ and [æ], and for /e/ and /ə/. Figure 73 shows the overlap between unstressed /e/ and /a/. It also shows that there is practically no overlap between /o/ and /u/ as well as between /i/ and /e/. Further research should be conducted in this perspective but this preliminary analysis illustrates the possible exploitations of spontaneous corpora for the study of phonetics.

1 EasyAlign is developed by Jean-Philippe Goldman, University of Geneva. 2 Developed by Cédric Gendrot, see http://gendrot.ilpga.fr/scripts.htm

140

Pattern replication

Figure 72: Variability of F1 and F2 values of stressed vowels in Nashta produced in the spontaneous speech of a female speaker

Figure 73: Variability of F1 and F2 values of unstressed vowels in Nashta produced in the spontaneous speech of a female speaker

The Balkan Slavic Nashta-Greek corpus

141

7.2.3 Articles Only a few Slavic languages have grammaticalized deﬁnite articles: this is the case for South Slavic, while their grammaticalization in North Russian is under discussion (see Breu 1994; Kasatkina 2008). The grammaticalization of postposed articles in South Slavic results from both internal and contact-induced factors, as it coincides with a similar development in the Romance languages of the area (Romanian, Aromanian, and Meglenoromanian) and Albanian (Assenova 2002). Deﬁnite articles in South Slavic were grammaticalized from demonstratives, a cross-linguistically common development (Diessel 1999; Lyons 1999). Written sources show that the postposed demonstratives attested in Old Church Slavonic documents (between the ninth and eleventh centuries) were grammaticalized into clitic demonstratives and then into clitic articles (Mladenova 2007). Some South Slavic languages, such as Literary Bulgarian, have a single deﬁnite article. Others, such as Literary Macedonian and Pomak (Adamou 2011), have three deﬁnite articles. Nashta uses a single deﬁnite article, based on the -t form. The analysis shows that 83% of all the nouns in the corpus are determined by a deﬁnite article, a high percentage indicating that deﬁnite articles are grammaticalized in Nashta. Although it cannot be argued that the grammaticalization of the Nashta deﬁnite article is induced from contact with Greek in the contemporary setting, Greek inﬂuence is evident in the use of the Nashta deﬁnite articles with proper nouns, impossible in most Balkan Slavic languages; e.g., to dimitro-to ‘DEM NP-ART ’ (note the double determination of the proper noun by a demonstrative and an article; double determination is also possible in other Balkan Slavic languages and in Greek). The grammaticalization of an indeﬁnite article in Macedonian and Bulgarian is controversial. In Nashta, the analysis of the corpus shows that 15% of all the nouns are determined by the numeral iˈdin ‘one’, inﬂecting for gender and number. Only a small proportion of bare plural or singular nouns with an indeﬁnite value is encountered in the corpus (2%). Qualitative analysis of the examples indicates the use of ‘one’ either as a numeral or with referential nouns, e.g., iˈdin aˈndroʝino ‘a couple’, iˈdin ˈvujko ‘an uncle’, but not with non-referential and generic nouns. Thus the available corpus gives no indication of a fully grammaticalized indeﬁnite article in Nashta.

142

Pattern replication

7.3 The Ixcatec-Spanish corpus 7.3.1 Articles Among the issues that can be examined through the qualitative and quantitative analysis of a spontaneous corpus, is the path toward the grammaticalization of a deﬁnite and indeﬁnite article. Ixcatec is a Popolocan language and the grammaticalization of articles in the other Popolocan languages is among the topics that are not treated in detail in the literature. Preliminary analysis of half-an-hour of the contemporary Ixcatec corpus (2,366 words) shows that the numeral hŋgu ‘one’ has a relatively low frequency; i.e., N = 19 with a noun, or 5% of the total occurrences of nouns. The corpus of the 1950s is diﬀerent from the contemporary corpus on this point. The analysis shows that 22% Ixcatec nouns are determined by the numeral hŋgu ‘one’ (N = 87). A qualitative analysis of the corpus shows that hŋgu is used as a numeral and that it can also introduce a discourse-new participant either in a presentative construction, or with a speciﬁc referent independent of whether it will become the topic of discussion; see (35). Ixcatec (35) ʃuwo-kú come-ANT

hŋgu

one

kwaʃúŋgu married_woman

‘A young woman came.’ (Adamou, unpublished corpus, Pear Stories) Ixcatec hŋgu is not used with non-speciﬁc referents and it has restrictions in interaction with the existential or locative predicates even in contexts in which the referent is given and its use should be expected, as shown in (36a) and (36b). Ixcatec (36) a.

sí-kú EXS -ANT

tʃitsé party

‘There has been a party.’ (Adamou, unpublished corpus) b.

haʔi since

ʰŋgu one

ndusà month

kíi-kú LOC -ANT

tʃitsé party

‘One month ago there was a party.’ (Adamou, unpublished corpus)

The Ixcatec-Spanish corpus

143

The existence of a grammaticalized deﬁnite article in Ixcatec is also of interest. Sá has been described as an optional demonstrative by Fernandez de Miranda (1961: 93) but it could probably be described as a deﬁnite article in the path of grammaticalization. A qualitative analysis of the Ixcatec data shows that the Ixcatec deﬁnite article is used as a deﬁnite anaphoric, with a referent which was previously mentioned or a referent which is pragmatically identiﬁable, as shown in (37). Ixcatec [Following: ‘They are dancing!’] (37)

a EXCM

ka=sá=mi-tʃʔa all-DEF- CLS -woman

‘Oh, all the women!’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php) The Ixcatec deﬁnite article is not used with a referent with generic meaning when it is new in discourse, as in (38a), but it can be used with a generic referent when it is given in discourse, as shown in (38b). Ixcatec [+new], [+generic] (38)

a.

ndra for_this

tse do

ʔu-ndjaʰɲù CLF-turkey

‘The turkeys do like this.’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php) [+given], [+generic] b.

ndra for_this ndéde what

sá=ʔu-ndʒaʰɲù DEF- CLF-turkey

ʔísá more

ʃeʔe funny

tú=hu PROG . PL =be

‘That’s why the turkeys are funnier.’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php)

144

Pattern replication

The quantitative analysis shows that sá determines a noun in roughly 20% of cases (N = 92 out of 446) in the contemporary Ixcatec corpus and in 10% of cases (N = 41) in the corpus of the 1950s. The low rate of nouns determined by sá conﬁrms the results of the qualitative analysis indicating that Ixcatec sá is not a fully grammaticalized deﬁnite article. Interestingly, despite the low degree of grammaticalization of the Ixcatec deﬁnite article, it is regularly used with human proper nouns; such uses are also common among the monolingual Spanish-speaking Ixcatecs. To conclude, it appears that the grammaticalization of a deﬁnite and an indeﬁnite article in Ixcactec is in process but it is not possible to evaluate the extent of the Spanish inﬂuence.

7.3.2 Clause-linking Ixcatec has a complementizer la similar to Spanish que (for Spanish see Demonte and Fernández-Soriano 2009). Like Spanish que, the Ixcatec complementizer introduces relative clauses (RC), see (39a), completive clauses (CoCl), see (39b), and co-occurs with an interrogative pronoun, see (39c). But unlike Spanish, Ixcatec la also introduces adverbial clauses, see (39d). Similar complementizers can be found in the most closely-related languages of the Popolocan branch. Ixcatec (39) a.

ruéða tyre

[la REL

kí-βika PROG .3SG - seaze

sá-mindawa]RC ] DEF- man

‘A tyre that the man is holding.’ (Adamou and Costaouec 2013: 193, translated from Spanish) b.

sá=kwá-enɸerméra DEF- CLF-nurse [la COMP

nda how

ʃtá ugly

kú-tʃe-kú-nà PFV-do-ANT-1SG

sí]CoCl:O EXS

‘I told the nurse, how ugly it is!’ (Adamou and Costaouec 2013: 193, translated from Spanish) c.

ʃ hũ¹ nice

kwa-tʃu PFV-say

ndi how

la COMP

tj ʔwí clean

‘How nice he says how they cleaned!’ (Adamou and Costaouec 2013: 203, translated from Spanish)

145

The Ixcatec-Spanish corpus

d.

kwa-tu-βihi-ʔana-na

la

PFV-3PL -arrive-NEG - FOC

SUB

mã¹hũ¹ sweep

‘Won’t they come to sweep?’ (Adamou and Costaouec 2013: 193, translated from Spanish) Inﬂuence from Spanish can be argued to have taken place in terms of frequency for speciﬁc constructions. For example, in the contemporary Ixcatec corpus the majority of the completives have a matrix adverb (54%), e.g., mééndi ‘this way’, he ‘now’, but this type of completive is practically absent from the Ixcatec texts of the 1950s (see Table 19). On the one hand, these constructions are characteristic of unplanned speech which characterizes the contemporary corpus. On the other hand, the use of completive clauses with a matrix adverb could have increased under inﬂuence from the Spanish constructions. Table 19: Distribution of completive clauses with la in Ixcatec (Adamou and Costaouec 2013)

Contemporary corpus Corpus of the 1950s

Matrix verb

Adverb

Interrogative pronoun

Total

28% (20) 63% (20)

54% (35) 6% (2)

18% (16) 31% (10)

100% (71) 100% (32)

7.3.3 Frames of reference A “frame of reference” relies on the use of coordinate systems for construing spatial relations (Levinson 2003: 24‒61): the speaker locates a referent A, dubbed “ﬁgure”, with respect to a referent B, dubbed “ground” (Levinson 2003: 41). In several Mesoamerican languages, terms for cardinal points are commonly used as spatial identiﬁers in both large-scale and small-scale descriptions (Brown and Levinson 1993; Bohnemeyer et al. 2011). This frame of reference is known as “geocentric” (or “absolute” frame of reference in Levinson 2003: 53, 66). In an “egocentric” frame of reference the “ground” is the observer’s viewpoint (in Levinson 2003: 53 “relative” frame of reference). A third frame of reference is the “intrinsic” frame of reference, i.e., locations are represented in relation to a referent’s intrinsic properties (front, back, sides). Mesoamerican languages are reported to use egocentric systems the least but that their use increases with the use of Spanish (Bohnemeyer et al. 2011). It is thus interesting to examine the Ixcatec data in this perspective. The study of both spontaneous and semi-spontaneous data shows that Ixcatec has an intrinsic frame of reference in terms of linguistic expressions. The front-back axis is rendered in Ixcatec through the word ndatʃuè ‘behind’,

146

Pattern replication

see (40), for referents situated behind a “ground”, and rendered in Spanish with the word espalda ‘back’, i.e., x tiene a su espalda y ‘x has y at its back’. A referent can also be in front of a “ground”, with respect to its door, nduha ‘door’, see (41). If there is no door, Ixcatecs use the Spanish word frente ‘front’, see (42). Ixcatec (40)

aj INTJ

Ɂu-jè CLF-snake

ki=ndatʃuè-na PROG .3SG =behind-FOC

‘Oh, the snake is behind (him).’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php) (41)

nduh-é door-POSS .3SG

mulínu mill

ku and

nduh-é door-POSS .3SG

sí-kú

líi

kíi

hŋgu

PRED. EXIST-ANT

LOC . PROX

PRED. LOC

one

tʃí hŋgu other_one ndatsĩ outside

‘In front of the mill, and in front of the other one, it’s here, there is a patio.’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php) (42)

frénte front

káŋtʃa court

kíi PRED. LOC

núŋgu church

‘In front of the basketball court is the church.’ (Adamou, unpublished corpus. Recordings available at http://www.elar-archive.org/index.php) A geocentric frame of reference is also used in Ixcatec through the words k ʔája ‘up’ and híŋgi ‘down’. The terms ‘up’ and ‘down’, however, are not signalling that a referent is higher than the “ground” in the village’s topography. Rather, an entity is ‘up’ when it is situated northern than another entity. This distinction, ‘up’/‘down’, also applies to the three main streets of the village: the street which is at the northernmost part of the village is called kʔája tʃahu ‘upper street’, the southernmost street is called híŋgi tʃahu ‘lower street’, and the street in between is kusiine tʃahu ‘middle street’, from the term used to refer to the middle ﬁnger. These are also the colloquial names used in Spanish among the Spanish monolinguals of the village, i.e., calle de arriba ‘upper street’, calle de abajo ‘lower street’, and calle de medio ‘middle street’. See Table 20 for a summary of the Ixcatec relational terms.

The Ixcatec-Spanish corpus

147

Table 20: The relational terms in Ixcatec Axis

Ixcatec

Front/back

ndatʃuè ‘behind’ < tʃuè ‘his back’ nduha ‘door’ frente ‘back’ ladu ‘side’ teŋgi ‘follow’

Left/right

kútʃ ʔé ‘left (hand)’ ndúá ‘right (hand)’ ladu ‘side’ teŋgi ‘follow’

Up (northerly)/ down (southerly)

k ʔája ‘up’ híŋgi ‘down’

Contact with Spanish has inﬂuenced the development of spatial organisation for the left/right axis which is not speciﬁed in the Ixcatec language. Due to the lack of evidence in the spontaneous corpus, I conducted a task to explore the localization of several entities at the level of the community. The four Ixcatec speakers were asked to describe the locations for nine buildings, answering the question “Where is X with respect to Z?”. The sessions were ﬁlmed in order to allow for the study of both the language and the gestures following Le Guen (2011). In the example (43), where the speaker describes the location of her house with respect to the presidencia ‘municipality building’, the left/right axis is rendered by default with the Spanish borrowing ladu ‘side’. Ixcatec (43)

ládu side

tíndáhɲa municipality

‘Next to the municipality (lit. at the municipality’s side).’ When asked to give more details, the speaker ﬁrst shifted to Spanish looking for the word izquierda ‘left’, see (44). Then she calqued in Ixcatec the Spanish term izquierda ‘left’, by using the term kútʃ ʔé which generally applies to the ‘left hand’. At the same time she gestured with the left hand at the left periphery of the gestural space (McNeill 1992), as can be seen in Figure 74b.

148

Pattern replication

a. ‘Where there is the municipality. . .’

b. ‘on the left, there is my comadre’s house.’

Figure 74: Co-speech gesture in Ixcatec for ‘left’

Ixcatec < Ixcatec (in plain), codeswitching to Spanish (in angle brackets < >) (44)

ndira kíi tíndáhɲa la where PRED. LOC municipality LOC . DIST ‘Where is the municipality (Fig. 74a)’,

‘’ PRED. POSS

kútʃ ʔé left

‘On the left

(Fig. 74b),’

ja

la

kíi

ndi-é LOC . DIST PRED. LOC house- POSS .3SG ‘there is my comadre’s house.’

kumaré-ɲána comadre- POSS .1SG

When asked to repeat the description, the speaker gestured ﬁrst at the left extreme periphery to locate the municipality (see Figure 75a), then to the centre to refer to the street (see Figure 75b), and last to the extreme upper right periphery to locate the house (see Figure 75c); the example in Ixcatec is given in (45).

The Ixcatec-Spanish corpus

Ixcatec (45) líi

kíi

líi

kíi

LOC . PROX

PRED. LOC

149

tíndáhɲa LOC . PROX PRED. LOC municipality ‘Here is the municipality, (Fig. 75a)’ tʃahu street ‘here is the street, (Fig. 75b)’ na líi kíi kumaré-ɲána so LOC . PROX PRED. LOC comadre- POSS .1SG ‘so here is my comadre (house) (Fig. 75c).’

The comparison of the gestures in the two descriptions of the house, shown in Figure 74b and in Figure 75c, shows that the localization of house is the exact

a. ‘Here is the municipality. . .’

b. ‘here is the street. . .’

c. ‘here is my comadre (house).’ Figure 75: Co-speech gesture in Ixcatec for the left-right axis

150

Pattern replication

opposite. In the ﬁrst description, shown in Figure 74b, the Spanish word ‘left’ triggers the egocentric frame of reference and the house is located at the left of the municipality with respect to an observer facing the building. In the second description, shown in Figure 75c, the speaker adopts a diﬀerent strategy, where the “ground” is again the municipality but the house is located through gesture alone at the right extreme periphery of the gestural space, or northerly, partly coinciding with the house’s location at the north-west of the municipality (note that the session took place away from these two locations and that during this task the speaker faces west). In order to test cognitive preference in the spatial domain among the Ixcatec speakers, a nonverbal, memory task was conducted. The task was inspired by the Max Planck Institute’s task “animals in a row” (Brown and Levinson 1993; Levinson 2003) but was adapted for the Ixcatecs to allow for a clear distinction between the three frames of reference, the intrinsic, the geocentric, and the egocentric. The task was executed exclusively outdoors to avoid the co-found factor indoors-outdoors. The participants consisted of the four Ixcatec-Spanish bilinguals who participated in the language documentation programme, and a semi-speaker of Ixcatec. Three items – a soap bar, a matchbox, and a candle – were ﬁrst placed in a row on top of the seat of a chair. The back of the chair faced north, towards the main mountains of the village, and so did the participants. The participants were asked to memorize the placement of the objects and to reposition them. They were then asked to place the objects on the chair as a test-run. The participants were then rotated 180 degrees with respect to the ﬁrst setting. This time the back of the chair faced west, whereas the participants faced south. In this position, they were asked to put the objects down as they remembered them from the previous setting. This procedure was repeated twice, with a random change in the order of the objects. For the second round, the objects were placed on the ground instead of the chair. Participants were ﬁrst asked to place the objects on the ground facing north, towards the main mountains. They were then rotated 180 degrees with respect to the ﬁrst setting, thus facing south. This procedure was repeated three times, with a random change in the order of the objects. Interestingly, two of the bilingual participants explicitly stated that two arrangements were possible, and then produced both an egocentric and a geocentric arrangement for the ﬁrst trial. When asked to choose the one that seemed “best”, they opted for the egocentric frame, which was then kept consistent in all the answers. The analysis of the results shows that in the task with the chair, all three frames of reference were used; 25% of the responses were geocentric, 25% ego-

The Ixcatec-Spanish corpus

151

centric, and 50% intrinsic. For task 2, where the objects were placed on the ground, the egocentric strategy was dominant, with 65% of the responses, while 35% of the responses were geocentric. To summarize the results, the study of nonverbal small-scale arrangements and of co-speech gesture involving larger-scale descriptions shows that the last Ixcatec-Spanish bilinguals rely on all three frames of reference, i.e., egocentric, intrinsic, and geocentric. More speciﬁcally, the study of gesture indicates that the egocentric frame of reference is triggered by codeswitching to Spanish. Also, the fact that the two Ixcatec speakers considered the egocentric frame as the most adequate response in the nonverbal task might equally be due to Spanish inﬂuence. 7.3.4 Word order in verbal clauses Veerman-Leichsenring (2001: 311) argued for a change of the Ixcatec word order in transitive clauses, from VSO to SVO, through contact with Spanish, a nonrigid V-medial language with an SVO unmarked order. From a genetic perspective, however, SVO word order is not a surprising feature for an Otomanguean language (Campbell, Kaufman, and Smith-Stark 1986: 547) and is the unmarked order of Nahuatl (Uto-Aztecan), an important language for the area. In this section, I will discuss the possible word-order change in the transitive Ixcatec clauses and the extent to which this change may be due to contact with Spanish. The study of word order in Ixcatec conﬁrms Veerman-Leichsenring’s observation about a possible change in the word order of the transitives from a previous verb initial order. Although this change cannot be dated, it seems that it is not a recent development introduced by the last speakers of Ixcatec since it is already observed in the texts collected by de Miranda in the 1950s. Within the 1,600-word corpus provided by Fernández de Miranda (1961), the agent very rarely follows the verb in clauses with overt arguments; see the example (46) for such a use: Ixcatec (46)

hngu one níka just

sumbréru hat ʃkã¹ twenty

bena-ʃi-mi buy-APPL-ANTIP

hngu one

Ɂina rich

sentábó cent

‘The rich buy one hat for hardly twenty-one cents.’ (Fernández de Miranda 1961: 184, my glosses, my translation from Spanish. Tones are transcribed as follows: high is transcribed as ˊ on the vowel and with ¹ on the nasalized vowel, low as ˋ on the vowel, and mid is not noted but applies to all vowels which are not high or low.)

152

Pattern replication

Moreover, Ixcatec shows features typical of verb initial languages (Greenberg 1963; Dryer 2011a, 2011b, 2011c, 2011d). It has relative clauses that follow the nouns (NRel) (Adamou and Costaouec 2013), a feature that correlates with VO order in the following way: “if a language is VO, then it is usually NRel” (Dryer 2011c). Ixcatec has prepositions, a feature that correlates with VO order as a bidirectional implicational hierarchy: “if a language is OV then it is usually postpositional and if a language is postpositional then it is usually OV” (Dryer 2011b). Ixcatec has an N-ADJ word order, a feature which is not considered signiﬁcant despite raw numbers across the languages showing its preponderance in VO languages (Dryer 2011d). Finally, Ixcatec has a VS unmarked order (also see Chapter 8), another feature that indicates a change in the word order of the transitive clauses. Indeed, typological studies show that the position of subjects tends to be the same in intransitive and in transitive clauses (Dryer 2011a), and that VS languages largely correspond to VSO, VOS, and OVS orders (Dryer 2011a). Comparison with the two most closely related languages of the western Popolocan branch, Popoloc and Chocho, is also consistent with the change in word order observed in Ixcatec transitive clauses. Indeed, Popoloc and Chocho (Veerman-Leichsenring 2001) have an unmarked VSO order. A cross-reference morpheme follows the verb when the agent is fronted. Compare the unmarked VSO order in (47a) with the marked SVO order in (47b) for Metzontla Popoloc: Metzontla Popoloc (47)

a.

če-ʔè=ni give-3>3=INCL

thà-xuáná CL -Juana

nìù tortilla

‘Mrs. Juana gives us tortillas.’ (Veerman-Leichsenring, 2006: 94) b.

thà-xuáná

nà

CL -Juana

FOC

če-ʔè=ni=thà give-3>3=INCL =CO

nìù tortilla

‘It is Mrs. Juana who gives us tortillas.’ (Veerman-Leichsenring, 2006: 94) (Tones are transcribed as follows: high is transcribed as ˊ on the vowel, low as ˋ on the vowel, and mid is not noted but applies to all vowels which are not high or low.) Although it seems that a change in the word order of transitives took place in Ixcatec, it is not clear whether this change is contact-induced as VeermanLeichsenring suggests. Indeed, the change in the word order of the transitives in Ixcatec could have had language-internal functional motivations, since an

The Thrace Romani-Turkish-Greek corpus

153

SVO order allows an unambiguous marking of grammatical relations. This analysis has also been discussed for the change that took place in most IndoEuropean languages from an OV to a VO order. Similarly, in the Mesoamerican area, a change in word order is reported for Yucatec Maya (Mayan). Skopeteas and Verhoeven (2009) convincingly argue that the high frequency of an SVO order from an unmarked verb-initial order in Yucatec Maya could be motivated by language-internal factors: adjacent syntactic units of the same category that have to be interpreted as functionally distinct (i.e., they bear diﬀerent thematic roles), but are not marked for their function (i.e., do not bear case marking) are diﬃcult to parse (Skopeteas and Verhoeven 2009: 255).

To conclude, Ixcatec being a language with formerly adjacent agent and object arguments and with no case marking, may have shifted to an SVO order allowing for the agent to be clearly distinct from the object by assuming a nonadjacent position. The typological congruence of the new SVO order with the Spanish unmarked SVO order may of course have functioned as a catalyst for this change.

7.4 The Thrace Romani-Turkish-Greek corpus As shown in Chapter 3, Thrace Romani speakers produce high rates of words from Turkish in in-group conversations. More importantly, as discussed in Chapter 5, Thrace Romani speakers use a typologically rare non-integration strategy for Turkish verbs by inserting them into Romani-dominant speech with the Turkish morphology. Myers-Scotton (2002: 269) considers that such atypical contact phenomena may become possible once convergence between the languages under contact has occurred and refers to such speech as “composite codeswitching”. According to Myers-Scotton’s proposal, the use of non-integrated Turkish verbs in Thrace Romani would have been possible after important convergence with Turkish. In the following sections I examine the degree to which convergence has taken place between Thrace Romani and Turkish at the level of prosody, articles, verb morphology, and word order in noun phrases.

7.4.1 Prosody in wh- and polar questions Recent work on prosody indicates that there may be several convergent features in the Balkan languages, namely at the level of questions. Romani prosody,

154

Pattern replication

however, is a little-studied ﬁeld with respect to contact and the results presented here are preliminary. In Standard Modern Greek, polar and wh-questions use characteristic melodies which may be considered as typologically rare. For example, in wh-questions, the wh-word is placed in initial position and carries the only stress accent in the sentence (Arvaniti and Ladd 2009). Following the accent’s peak, pitch falls and either remains low or shows a small rise at the end of the clause, as shown in Figure 76. In Turkish, wh-words also bear a focus accent which is realized in situ (Ladd 2008).

Figure 76: Pitch track of a wh-question in Standard Modern Greek (Tsiplakou et al. 2011)

An intonation pattern similar to that of Greek is also found in Thrace Romani wh-questions (Arvaniti and Adamou 2011). Observe Figure 77 where the accent falls on the wh-word so ‘what’ in clause-initial position, followed by a fall and small rise in the end of the clause. In Standard Modern Greek, focus in polar questions is realized as low pitch on the verb when focus is broad or on some other constituent when focus is narrow (Arvaniti and Baltazani 2005). Pitch remains low until the end of the question, where one sees a rise-fall, the peak of which co-occurs with the last stressed syllable (Arvaniti and Baltazani 2005); see Figure 78. In contrast, Turkish indicates focus by question particles and uses prosody to a lesser extent in polar questions (Ladd 2008).

The Thrace Romani-Turkish-Greek corpus

155

Figure 77: Pitch track of a wh-question in Thrace Romani

Figure 79 illustrates a polar question in Thrace Romani with a L*+H pitch accent and a boundary tone H-L%. Compare the similar pattern, both in prosodic and syntactic term, between the Romani polar question in Figure 79 and the Greek polar question expressed in spontaneous speech by a Romani speaker in Figure 80. Also notice that the boundary tone in Greek enunciated by a Romani speaker, shown in Figure 80, is realized lower than in the example from Standard Modern Greek shown in Figure 78 (Tsiplakou et al. 2011). Indeed, the intonation in the Greek polar question produced by a Romani speaker, shown in Figure 80, would be infelicitous in Greek because of the early focus on the verb (Tsiplakou et al. 2011). Comparison between Thrace Romani, Greek, and Turkish shows that Thrace Romani is closer to Greek in terms of the syntactic and prosodic realization of wh- and polar questions.

156

Pattern replication

Figure 78: Pitch track of the polar question ‘Do you need help?’ in Standard Modern Greek (Tsiplakou et al. 2011)

Figure 79: Pitch track of a polar question in Thrace Romani

The Thrace Romani-Turkish-Greek corpus

157

Figure 80: Pitch track of a Greek polar question enunciated by a Thrace Romani speaker

7.4.2 Articles Romani has a deﬁnite article which was grammaticalized during the Byzantine era under the inﬂuence of Greek (Matras 2002). Contact with languages without a deﬁnite article led to the loss of the Romani article, as in Finnish Romani, in contact with Finnish, and in Northern Romani dialects in contact with various Slavic languages. Moreover, Romani varieties spoken in the Balkans appear to have grammaticalized the numeral jek ‘one’ as an indeﬁnite article under the inﬂuence of the various contact languages. Thrace Romani is today in contact with Turkish, which does not have a formal article system. Turkish relies on word order and case marking as well as on the use of the determiner bir ‘one’ to express deﬁniteness and indeﬁniteness. The second contact language of Thrace Romani, Greek, has a fully-grammaticalized

158

Pattern replication

indeﬁnite article and a frequently used deﬁnite article, which is mandatory with proper nouns. Note that bare nouns in Greek are indeﬁnite. A quantitative analysis of the Thrace Romani corpus shows that 18% (N = 152) of the total number of nouns is determined by a deﬁnite article, including with proper nouns. Only 5% (N = 37) of the nouns is determined by the numeral jek ‘one’. Despite the low frequency of jek ‘one’, a qualitative analysis of the corpus indicates uses with referential nouns, see (48a), and non-referential nouns, see (48b). No uses of the indeﬁnite article with generic nouns have been noted in the corpus. Thrace Romani corpus (48)

a.

ek one

xoraxni Turkish_woman

sas was.3SG

kxamni pregnant

‘A Turkish woman was pregnant.’ (Adamou 2008. Excerpt from The man-snake, sentence 1. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) b.

mangav want.1SG

phenel say.3SG

ek one

romni woman

‘He says, I want a woman.’ (Adamou 2008. Excerpt from The man-snake, sentence 10. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) Corpus analysis also shows 39% bare nouns (N = 340 out of 865), of which 20% are plural (N = 73). The relatively high rates of singular bare nouns despite the existence of a grammaticalized deﬁnite article calls for further investigation. Turkish inﬂuence does not appear at the level of case marking or word order, but a parallel with Greek may be drawn. Indeed, it appears that, apart from the plural bare nouns, which are also indeﬁnite in Greek, singular bare nouns are found in constructions which would also be grammatical in informal Greek; e.g., ‘go N’, ‘play N’, ‘get N’, and ‘do/make N’ such as in kerel marebava ‘make war’. Singular bare nouns are also found in attributive predications, either with a copula ‘be/become N’ or without a copula. Proper nouns are not determined when they are used in the vocative. Mass nouns, such as paj ‘water’ and lon ‘salt’, are not determined by an article. Finally, the high rate of bare nouns in Thrace Romani is partly due to the fact that the deﬁnite article has fused with several prepositions, e.g., andi jak ‘in the ﬁre’, a phenomenon that is also found in Greek, e.g. sti fotia ‘in the ﬁre’. To conclude on articles, Thrace Romani shows more similarities with Greek than with Turkish. Unlike the Northern Romani dialects, which lost the deﬁnite

The Thrace Romani-Turkish-Greek corpus

159

article due to contact with languages having no articles, Thrace Romani has not converged on this point with its main contact language, Turkish.

7.4.3 Verb morphology As discussed in Chapter 5, Thrace Romani systematically integrates Turkish verbs with their Turkish TMA and person markers. In the ﬁeld of contact linguistics, it has been argued that typological similarities and diﬀerences shape the contact phenomena (Weinreich 1953; Matras 2007). In this section, we explore the similarities and the diﬀerences between the Turkish and Romani verbal systems which may have inﬂuenced the transfer of Turkish verbs without any morphological integration into Romani. Typologically, both Romani and Turkish have synthetic verb morphology: Romani, an Indo-Aryan language, shows mainly fusional and sometimes agglutinative patterns, whereas Turkish, an Altaic language, is a prototypical agglutinative language. In both languages most verb morphemes follow the verbal stem and agreement markers tend to come last. When two languages in contact share similar morphemes, convergence is likely to occur. For example, the Romani and Turkish causative markers show some striking similarities: the inherited Romani causative is based on the transitivizing aﬃx, -ar-, combining with the preterite stem frequently ending in -d (Matras 2002: 121) while the Turkish causative is -dır-/-tır-. Interestingly, it has been noted that the Romani causative is more productive when the contact language also has a causative and more speciﬁcally that Turkish played the role of a catalyst for the productivity of the Romani causative when the two languages were in contact (Matras 2002: 120). In the Thrace Romani corpus, the Romani causative morpheme occurs 14 times with Romani verbs and the Turkish causative -dır- 13 times with Turkish verbs. See an example of the Romani and Turkish causatives in (49): Thrace Romani corpus < Romani (in plain), Turkish (in bold) (49)

barad-ar-dom grow-CAUS - PRET.1SG evlen-tər-dəm marry-CAUS - PRET.1SG

len 3PL . ACC len 3PL . ACC

‘I raised them, I married them.’ (Adamou, unpublished corpus)

me 1SG . NOM me 1SG . NOM

160

Pattern replication

I now turn to examine what happens when the languages in contact are diﬀerent. An interesting case is that of Turkish evidentiality and its impact on Thrace Romani. On the one hand, in Turkish, the verbs in past tense obligatorily indicate whether the event was experienced directly (-di) or indirectly (-miş); e.g., Ahmet gel-miş ‘Ahmet came/must have come.’ (Aksu-Koç and Slobin 1986: 159). The morpheme -miş has the basic meanings of inference and hearsay and is used in everyday speech as well as in myths, folktales, and jokes. On the other hand, in most Romani varieties evidentiality is not grammaticalized. In the Thrace Romani corpus, the Turkish aﬃx, -mış, is used as a free morpheme, muʃ, and occurs only three times. See the examples (50) and (51) for the use of muʃ with Romani verbs. Thrace Romani corpus < Romani (in plain), Turkish (in bold), Greek (underscored) (50)

phendas said.3SG oti that

muʃ allegedly

e DEF. OBL

i

fatma

DEF. F

NP

xurd-es child-ACC

voj 3SG . F

lja took.3SG

sas with

pe-sa REFL - INS

‘Fatma said, allegedly, that she took the child with her.’ (Adamou, unpublished corpus) Thrace Romani < Romani (in plain), Turkish (in bold) (51)

me 1SG .OBL

dea mother.ACC

ka will

marel beat.3SG

muʃ allegedly

‘He will, supposedly, beat my mother.’ (Adamou, unpublished corpus) Unlike Turkish -mış, which is used for inference and hearsay, Romani speakers use muʃ to report on the truth of the statement. The meaning of Romani muʃ is similar to the meaning of the Greek expressions taha ʻit seemsʼ, lei ʻsaysʼ, and demek, from Turkish ʻsayʼ, used in the sense of ʻallegedly’. Indeed, demek is also used in Thrace Romani with the same meaning (three times in the corpus), as illustrated in (52). Thrace Romani corpus < Greek (in bold) < Turkish (52)

phendas said.3SG

demek allegedly

i

fatma

DEF. F

NP

‘Fatma said, allegedly . . .’ (Adamou, unpublished corpus)

Discussion

161

To conclude, the exact inﬂuence of Turkish on the Romani verbal system is diﬃcult to determine, while it appears that the second contact language, Greek, plays an important role in pattern replication.

7.4.4 Word order in noun phrases Romani word order in noun phrases is similar to that of its contact languages, Turkish and Greek; see Table 21. A search in the corpus shows that the most frequent word order for adjectives and nouns is ADJ-N (N = 38), e.g., ek bari jak ‘one big ﬁre’, ek baro balo ‘one big pig’, i bari avlia ‘on big courtyard’, ek bari kofa ‘one big bucket’. 4 occurrences of an N-ADJ order are registered, an order which seems to be “marked”, e.g., o nak baro, e danda bare ‘the nose (is) big, the teeth (are) big’. This order is possible in Greek, namely with speciﬁc intonation as well as in polydeﬁnite constructions, but not in Turkish. Table 21: Word order in Romani, Turkish, and Greek noun phrases

ART DEM NUM ADJ

Romani

Turkish

Greek

ART-N DEM-N NUM-N ADJ-N (N-ADJ)

– DEM-N NUM-N ADJ-N

ART-N DEM-N NUM-N ADJ-N (N-ADJ)

We note that constructions of the Greek type POSS POSS N which are described in the literature as contact-induced (Matras 2002) are scarce in the Thrace Romani corpus; e.g., mirne me vasta ‘lit. my hands of mine’, mirni mi dej ‘lit. my mother of mine’. Likewise, pronominal object doubling, which is an areal feature and is reported for Balkan and Vlax Romani, is practically absent from the Thrace Romani corpus. Finally, the obligatoriness of resumptive pronouns for objects in relative clauses with kaj does not show in the corpus.

7.5 Discussion In Chapter 7, it was shown that languages with 0‒5% contact words may have extensive pattern replication from their contact language. This is the case in Balkan Slavic Nashta, which converged with Greek through a combination of century-long processes in the Balkan Sprachbund and extremely rapid changes due to the shift to Greek in the twentieth century.

162

Pattern replication

More impressively, the corpus of Colloquial Upper Sorbian, which was not examined in detail in this chapter, contains very few contact words and yet shows extensive convergence with German. For example, in Colloquial Upper Sorbian articles have been grammaticalized under the inﬂuence of German and deﬁnite and indeﬁnite articles have replicated the German article functions; personal pronouns became obligatory under German inﬂuence; the use of the dual case has been reduced; passive constructions were grammaticalized with the auxiliary hodwać under the inﬂuence of German werden ‘become’; expletive to replicates German es ‘it’; and Colloquial Upper Sorbian developed phonological vowel length (see Breu 2004, 2008; Scholze 2008). Finally, the inﬂuence of Spanish on Ixcatec could be limited to the role of catalyst for several language-internal changes, i.e., word order of the transitive clauses, grammaticalization of the deﬁnite and indeﬁnite articles, and uses of the complementizer. The inﬂuence of Spanish is, however, more clear at the level of frames of reference, i.e., introduction of a left-right axis, progression of the egocentric frame of reference in cognitive representations of small-scale arrangements. In Thrace Romani, Turkish inﬂuence at the level of pattern replication is not apparent, despite the high number of contact words (20‒35%). For example, no convergence with Turkish is observed for the deﬁnite articles and the intonation patterns in wh- and polar questions which seem closer to the Greek patterns than the Turkish ones. The strong inﬂuence of Greek on Thrace Romani is evidenced by the complex process concerning the Turkish evidential marker, i.e., borrowing of the Turkish form and replication of the Greek function. Contrary to Thrace Romani, however, other corpora with high rates of contact words also show extensive replication. For example, Finnish Romani has entirely converged with Finnish, at the level of case, loss of the deﬁnite article, word order, and prosody (Granqvist 2000, 2003). Similarly, Molise Slavic, has converged with Italian for the future, irrealis, resultative, verbal aspect, expletive constructions, passive constructions, gender, and has grammaticalized an indeﬁnite article (Breu 2005, 2008). In conclusion, this chapter shows that the rates of contact words in a bilingual or multilingual corpus are not directly related to the extent of pattern replication. Languages with few contact words may show extensive pattern replication while languages with many contact words may show minimal pattern replication. Thus, although lexical borrowing is said to precede pattern replication in the various borrowing scales, it appears that the extent of lexical borrowing and pattern replication evolve independently.

Chapter 8

Information structure 8.1 Background Information structure is a particularly good candidate for contact-induced change. The vulnerability of the devices related to information structure is expressed by the Interface Hypothesis (Sorace 2011) and is demonstrated in a number of studies on bilingualism. In this approach, information structure devices are expected to be highly aﬀected by language contact as they are located at the syntax-pragmatics interface. More generally, as Myers-Scotton (1993a: 236‒237) observes, codeswitching is part of the means to highlight new information and is therefore closely related to information structure. Information structure is expressed by prosodic, syntactic, and morphological devices, all of which are known to be prone to contact. For example, the borrowability of prosody is high as expressed in the following hierarchy: prosody > stress > vowel length > vowel quality > semi-vowels and liquids > complex consonants > other consonants (Matras 2009: 232)

Word order is also a well-known candidate for pattern replication: One way of replicating a word order arrangement found in another language is by narrowing down the range of discourse options available by choosing among the use patterns that are available in the replica language the one that most readily corresponds to the one in the model language and making it the regular one – using it more frequently and in a wider range of contexts. (Heine 2008: 38)

Finally, morphological devices such as focus particles and focus-sensitive particles are frequently borrowed. Matras explains their high likelihood of being borrowed as follows: As the speaker moves to prompt the hearer into activating presupposed knowledge, and further as the speaker puts forward propositions that challenge presupposed knowledge, resistance on the part of the hearer may be anticipated. It is in such instances that the speaker is susceptible to malfunctions of the selection and inhibition mechanism. Fusion of repertoire components with respect to the relevant grammatical operations is a way to pre-empt such malfunctions. (Matras 2009: 197)

Despite its interest for linguistics, the study of information structure in lesserknown languages faces a number of methodological obstacles. The study of

164

Information structure

information structure is traditionally based on researcher intuition. It may be complemented by the analysis of rich, spontaneous corpora, and experiments that are generally conducted in a laboratory. These methodologies are more diﬃcult to apply for the study of lesser-known and endangered languages. For example, linguists generally do not have any native speaker’s intuitions when working on an endangered language. Moreover, the use of experimental techniques in the ﬁeld is complex and requires some adjustment to the speakers’ cultural and educational background (for a number of tasks which can be used in the ﬁeld see Skopeteas et al. 2006). Another option is to rely on natural, unscripted data. However, in free-speech, the contexts and speakers’ intentions are sometimes diﬃcult to analyse and explicit question-answer pairs are rare (Schultze-Berndt and Simard 2012). Also, the study of the prosodic marking of information structure based on spontaneous data is a very complex task as the context and the various phonetic parameters cannot be controlled. In the following sections, I present some preliminary results from ongoing collaborative research on information structure in Ixcatec and in Thrace Romani. These studies rely on the analysis of free-speech corpora and are complemented by experimental tasks. The classiﬁcation of constituents with respect to their information status is based on the distinction between “new”, “given”, and “focused” constituents (see among others Chafe 1976; Büring 2009; Katz and Selkirk 2011). The term “discourse-new” is used to refer to a constituent which has no antecedent in the preceding discourse or has not been mentioned in the preceding twenty clauses, an arbitrary number based on the referential distance measures proposed in Givón (1983). The term “discourse-given” is used for any item that has been mentioned in the preceding discourse, explicitly or through a semantically-related expression. Last, “focus” is used for constituents introducing alternatives into the discourse (Rooth 1992). The data discussed in the following sections illustrate two possible ways in which language contact aﬀects focus marking: through weakening of existing strategies for Ixcatec, in contact with Spanish, and through addition of new means for Thrace Romani, in contact with Turkish and Greek.

8.2 The Ixcatec-Spanish corpus 8.2.1 Prosody Ixcatec is a tone language with three lexically contrastive tones: high (H), mid (M), and low (L). The role of prosody in the expression of focus in tone languages is a ﬁeld requiring more study but a growing number of analyses show

The Ixcatec-Spanish corpus

165

that prosody is also relevant for the expression of focus in tone languages. The role of prosody in the expression of focus was established for major-communication languages such as Chinese (Xu 1999; Chen and Gussenhoven 2008), but its role has not yet been examined in the tone languages of Mesoamerica. The study of the prosodic correlates of information structure in Ixcatec was a great challenge. First, semi-spontaneous data were elicited with tasks from the Questionnaire on information structure (Skopeteas et al. 2006). The analysis of these data indicated the use of lengthening and pitch expansion in focus conditions. Figure 81 shows the realization of the word nĩ¹hẽ ʽthreeʼ in isolation. Then, in Figure 82, the numeral is shown under corrective focus with signiﬁcant pitch expansion, up to 415 Hz, and lengthening of the ﬁrst syllable.

Figure 81: Ixcatec: Pitch track of the word ‘three’ in isolation

166

Information structure

To conﬁrm the observations of the semi-spontaneous data, an experimental study for the realization of focus in Ixcatec was conducted in collaboration with phonetician Matthew Gordon. The study conﬁrms that pitch expansion is a correlate of contrastive focus, whereas duration and intensity are robust correlates of corrective focus (Adamou, Gordon, and Gries submitted).

Figure 82: Ixcatec: Pitch track of the word ‘three’ under corrective focus

Due to the absence of historical data, it is diﬃcult to determine the role played by language contact in the prosodic marking of focus in Ixcatec. On the one hand, prosodic marking of focus may have been present in Ixcatec independent of contact with Spanish. For example, duration is a correlate of focus in a tone language such as Standard Chinese where it combines with pitch expansion and compression of the post-focal elements (Chen and Gussenhoven 2008). On the other hand, in Mexican Spanish, duration is the most robust correlate of focus, together with a L+H* accent, earlier peak on the stressed syllable, and higher

The Ixcatec-Spanish corpus

167

intensity (Kim and Avelino 2003; de la Mota, Butragueño, and Prieto 2010). As discussed in section 8.2.3., the typological characteristics of Ixcatec indicate that the systematic use of prosodic devices to express focus has probably gained weight under the inﬂuence of Mexican Spanish.

8.2.2 Word order Ixcatec noun phrases have a rigid word order, NUM N ADJ. This is conﬁrmed by the analysis of the spontaneous and semi-spontaneous corpora and the data obtained through the “Animal Game” task (Skopeteas et al. 2006). See an example in (53). Ixcatec (53)

nĩ¹hẽ three

Ɂu-jahà CLF-eagle

[juwà]F green

‘Three GREEN eagles.’ (QUIS task) In contrast, the role of word order in verbal clauses is more complex as it is also related to the coding of grammatical relations. As discussed in Chapter 7, the change in the word order of the transitive clauses in Ixcatec has resulted in the coding of agents in the preverbal position. We also note that Ixcatec does not code grammatical relations by case or adposition. Moreover, the unique argument of intransitive verbs (S) and the agent-like argument of transitive verbs (A) are indexed on the verb through the same suﬃxes,1 whereas the patient-like (P) and the recipient-like (R) arguments are not indexed (terminology follows Malchukov, Haspelmath, and Comrie 2010). Data from experimental tasks and semi-spontaneous speech show that new S arguments in Ixcatec follow the verb, as shown in (54a). Given S arguments precede the verb, as shown in (54b). (54)

Ixcatec

VSNEW a.

ʃuwo-kú come-ANT

hŋgu

one

kw a woman

ʻA woman came.ʼ (Pear stories) 1 This basic marking varies depending on the verbs as some verbs, e.g. ‘hurt’, may receive either the A-S suﬃxes or the possessive suﬃxes with a meaning ‘be sick’.

168

Information structure

SGIVENV b.

sá=kwa DEF-woman

kí=tsu PROG .3SG -want

kwiká-kwa pull-CO.3SG . F

ʻThe woman is trying to pull a wooden chair.ʼ

jaʃilà wood_chair (QUIS task)

Observe that when the S argument appears pre-verbally, as in (54b), it triggers the use of a cross-reference morpheme. The use of cross-reference morphemes allows us to consider that the VS word-order is the “canonical”, “unmarked” order, and that the preverbal position for the S arguments is the “marked” one. Cross-reference morphemes, however, only occur with nouns that take the classiﬁers di- ‘man’, k wa- ‘woman’, ʔu- ‘animal’, with the third singular pronouns suwáda and suwákwa, as well as with some nouns like ‘mother’, etc. Also notice that the new constituent in (54a) is determined by the numeral ‘one’, which is in the process of grammaticalization as an indeﬁnite article, while the given constituent in (54b) is determined by the deﬁnite article (see Chapter 7). For the study of word order, the spontaneous corpus was ﬁrst segmented in prosodic units in ELAN, and the core arguments were tagged for semantic role (S, A, P, R), information status, and animacy. The quantitative analysis of a total of 648 verbs, shows that, for the intransitive clauses, new, given and focused S arguments appear both postverbally and preverbally; see Table 22. Table 22: Absolute number of occurrences of S with respect to V (realized within the same prosodic unit) New

Given

Focus

Total

5 0 0 5

15 3 1 19

3 0 0 3

23 3 1 27

6 3 5 14

18 4 9 31

1 0 2 3

25 7 16 48

VS Human Animate Inanimate Total SV Human Animate Inanimate Total

Corpus analysis shows few dislocated arguments, i.e., realized in a distinct prosodic unit: S arguments were left-dislocated in only four cases and rightdislocated in three cases. Right-dislocated arguments generally follow a pause

The Ixcatec-Spanish corpus

169

and show F0 reset, while left-dislocated arguments show lengthening of the ﬁnal syllable and are followed by a pause and an F0 reset (Adamou, Gordon, and Gries, submitted). The experimental data show that A-like arguments always precede the verb, whether new or given, as illustrated in (55a) and (55b). (55)

Ixcatec

ANEWVPGIVEN a.

sá=kwa-ʔĩ¹ DEF- CLF-little

kí=Ɂuteká-kwa PROG .3SG -push-CO. 3SG . F

‘The girl is pushing the boy.’

sá=li-ʔĩ¹ DEF- CLS -little

(QUIS task)

AGIVENVPNEW b.

sá=kwa DEF- woman

kí=Ɂuteká-kwa PROG .3SG -push-CO. 3SG . F

sá=mi-ndawa DEF- CLS -male

‘The woman is pushing the man.’ (QUIS task) The analysis of the spontaneous data shows that overt expression of A-arguments through nominals is very rare. The analysis also shows that new and given P-like arguments appear both postverbally and preverbally and that they are focused in situ; see Table 23. Last, A-like arguments were left dislocated in just four cases and P-like arguments were right dislocated in 18 cases and left-dislocated in just one case. Table 23: Absolute number of occurrences of P with respect to V (realized within the same prosodic unit) New

Given

Focus

Total

3 0 22 25

1 2 17 20

0 0 0 0

4 2 39 45

0 0 3 3

0 1 8 9

0 0 2 2

0 1 13 14

VP Human Animate Inanimate Total PV Human Animate Inanimate Total

170

Information structure

8.2.3 Morphology In the contemporary Ixcatec-Spanish corpus, a focus particle -na is optionally used. The Ixcatec particle -na is clearly an inherited focus-marking strategy as it is encountered in other Popolocan languages such as Metzontla Popoloc (Veerman-Leichsenring 2006: 94). Also, in the Ixcatec texts of the 1950s, the particle occurs 31 times, for contrastive focus, as in (56), and for what I tentatively call “contrastive topics” (CT) following Büring (in press), illustrated in (57): Ixcatec (56)

tila until

tyhĩ day

[nĩ¹hẽ]F -ná three-FOC

‘In THREE days. . .’ (Fernández de Miranda 1961: 183, my glosses my translation from Spanish) Ixcatec [Preceding discourse: ‘The owner of the house is happy. He brings together all of his family. Then, he has had prepared a goat (for the barbecue) so as to eat with those who helped at the house construction.’] Answer to an implicit question: ‘What do the family members do?’ (57)

a.

[mi-tʃɁa]C T-ná CLS -woman-FOC

[Ɂú]F mill

niɲu tortillas

[. . .]

‘The WOMEN mill tortillas [that they oﬀer to those who have worked].’ b.

sála DEM

hngu one

[ndawa]C T-ná man-FOC kala or

júhu two

[batu-beɁe-ʃi]F 3PL (IPFV )-give(IPFV )-APPL Ɂɲù rope

[. . .]

‘The MEN are contributing with one or two ropes [for the owner of the house when they have no money. Those who have money oﬀer one or two pesos].’ (Fernández de Miranda 1961: 181, 182 sentences 25, 26, my glosses, my translation from Spanish. Tones are transcribed as follows: high is transcribed as ˊ on the vowel and with ¹ on the nasalized vowel, low as ˋ on the vowel, and mid is not noted but applies to all vowels which are not high or low.)

The Ixcatec-Spanish corpus

171

Figure 83 illustrates the focus marker –na, suﬃxed to the focused word nĩ¹hẽ ‘three’. When the prosodic realisation of the word ‘three’ is compared to the pragmatically neutral realisation shown in Figure 81, it appears that lengthening and pitch expansion of the focused word combine with the focus particle. An experimental study on focus expression in Ixcatec conﬁrms that the Ixcatec focus particle currently combines with prosodic marking and that its presence enhances the prosodic marking of the focused element (Adamou, Gordon, and Gries, submitted).

Figure 83: Ixcatec: Pitch track of the word ‘three’ combined with the focus particle

In a cross-linguistic perspective, the combination of prosody and focus particles appears to be rare (Büring 2009; Féry 2013). Such combination may therefore have arisen in Ixcatec through contact between the two typologically-distinct languages: Ixcatec, mainly relying on a specialised focus marker, and Spanish, mainly relying on prosody for focus (when there is no syntactic movement involved).

172

Information structure

A possible path for the combination of prosodic marking with the focus marker in Ixcatec may go through the combination of Spanish lexicon with the Ixcatec focus marker. Although we lack evidence for this claim, we observe that in one of the examples from the contemporary corpus, the speaker combines the Spanish word sekundarja ‘middle-school’ with the Ixcatec focus marker. This example comes as a corrective reply to the researcher’s question about whether the speaker teaches classes of Ixcatec at the kindergarten. The Spanish word is enunciated with lengthening of the accented syllable, following the Mexican Spanish way of marking focus through duration, but the Ixcatec focus marker is maintained. It is therefore possible that such combination started with the use of Spanish words and then became generalized to the rest of the Ixcatec lexicon as shown in 8.2.1.

8.3 The Thrace Romani-Turkish-Greek corpus 8.3.1 Prosody For the study of prosodic marking of focus, the spontaneous Romani data from Thrace were analysed in Praat in collaboration with phonetician Amalia Arvaniti following the principles of the autosegmental-metrical framework of intonational phonology (see Arvaniti and Adamou 2011). In this framework, a distinction is made between stress and intonation, with stress represented in metrical structure and intonation by means of a series of H (high) and L (low) tones. Thrace Romani has ﬁnal stress in the native parts of its vocabulary (Adamou and Arvaniti 2014). The analysis of the spontaneous corpus shows the use of a L+H* accent for focus and deaccenting of the rest of the utterance, as shown in Figure 84 (Arvaniti and Adamou 2011: 243). The L+H* accent is distinct from lexical stress for showing an earlier peak and consistent rise. Interestingly, both contact languages, Turkish (Özge and Boszahin 2010) and Greek (Arvaniti and Baltazani 2005), are also using a L+H* for contrastive focus. Although this type of focus marking is common cross-linguistically, we note that it does not occur in other Romani varieties which were in long-term contact with languages that employ diﬀerent focus strategies, such as Finnish Romani in contact with Finnish (Granqvist 2003). The analysis of the free-speech data also shows that speakers of Thrace Romani use a stress-shift to an earlier syllable from the one canonically stressed (Adamou and Arvaniti 2014). For example, in the clause in (58), the adjective /saˈno/ ‘thin’ and the noun /bal/ ‘hair’ are phrased separately and the canonical ﬁnal stress is observed for the word /saˈno/ ‘thin’. In the following clause, however, the adjective and the noun are phrased together and the stress of the word ‘thin’, /saˈno/, shifts to the ﬁrst syllable [ˈsano] (Adamou and Arvaniti 2014: 229).

The Thrace Romani-Turkish-Greek corpus

Figure 84: Thrace Romani: Pitch track of a spontaneous example with focus on naj ‘is not’ (adapted from Arvaniti and Adamou 2011: 243)

Thrace Romani (58)

ikaˈlda took.3SG

kaˈtar from

an in

jek one

saˈno thin

bal hair

‘She took out from there one light hair.’ (Adamou, 2008. The Man-Snake, sentence 2. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) ni NEG

pukaˈvel reveal.3SG

laˈke 3SG . DAT

kaj that

si be.3SG

ˈsano thin

bal hair

‘She doesn’t say that it is a light hair.’ (Adamou, 2008. The Man-Snake, sentence 3. Accessed online at http://lacito.vjf.cnrs.fr/pangloss)

173

174

Information structure

Similar stress-shift is encountered with words under focus (Arvaniti and Adamou 2011). Compare the realization of the Turkish word [erzaˈnava] ‘pharmacy’ in Figure 85, and its realization with stress-shift as [erˈzanava] in Figure 86. The stress-shift in Thrace Romani is reminiscent of the stress-shift in Turkish which is used for thematic contrast (Özge and Boszahin 2010: 141).

Figure 85: Pitch track of the word erzaˈnava ‘pharmacy’ in Thrace Romani

The Thrace Romani-Turkish-Greek corpus

175

Figure 86: Pitch track of the word erˈzanava ‘pharmacy’ under focus with stress-shift in Thrace Romani

8.3.2 Word order In Thrace Romani, the unique argument of intransitive verbs (S) and the agentlike argument of transitive verbs (A) are indexed on the verb through suﬃxes. Arguments are coded through case, i.e., nominative, genitive, accusative, dative, instrumental, and adpositions. Romani is a pro-drop language and core arguments are not necessarily expressed overtly. Similar to other Romani varieties (Matras 1995, 2002), Thrace Romani intransitive clauses with a discourse-new participant show a verb initial order, as can be seen in (59).

176

Information structure

Thrace Romani < Romani (in plain), Turkish (in bold) VS NE W (59)

kida-pes gather.3SG - REFL

bytyn all

o ART. DEF. NOM

gavutno village_people

‘All the village people gather.’ (Adamou 2008, The louse and the Rom, Sentence 18. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) When the participant of an intransitive clause is discourse-given, null argument structures are frequent. When there is an overt nominal, however, it generally precedes the verb, as shown in (60). Thrace Romani < Romani (in plain), Turkish (in bold) SG I V E N V (60)

kava this

sevindi be_happy.PST.3SG

‘This one was happy.’ (Adamou 2008, The louse and the Rom, Sentence 45. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) The SV order may also be used for constituents under contrastive focus and for contrastive topics. Figure 87 illustrates the most common way for marking contrastive focus in Thrace Romani by combining prosody, through a L+H* accent and post-focal deaccenting, with syntactic marking (Arvaniti and Adamou 2011). In mono-transitive clauses, A-like arguments are frequently not overtly expressed. New or given P arguments, human or inanimate, nouns or pronouns, canonically follow the verb, as shown in (61a) for the noun dʒuv ‘louse’, which has been previously introduced, and in (61b), for a discourse-new argument. (61)

Thrace Romani < Romani (in plain), Greek (in bold) VPG I V E N a. lel kaja dʒuv take.3SG this louse ‘He takes this louse.’ (Adamou 2008, The louse and the Rom, sentence 7. Accessed online at http://lacito.vjf.cnrs.fr/pangloss) VPNE W b. lav take.1SG

e ART. DEF. OBL

lastika hose

‘I take the hose.’ (Adamou, unpublished corpus)

The Thrace Romani-Turkish-Greek corpus

177

Figure 87: Focus in Thrace Romani: SV order and prosodic marking (adapted from Arvaniti and Adamou 2011: 244)

PV order is reserved for topicalization and focus. Figure 88 shows an example of focus for apora ‘pills’, with PV order and a L+H* accent followed by deaccenting of the verb.

178

Information structure

Figure 88: Focus in Thrace Romani: PV order and prosodic marking (adapted from Arvaniti and Adamou 2011: 243)

The unmarked word order in the ditransitive clauses is VRT, where R is generally a pronoun, as shown in (62a). A focused T argument may be preverbal as shown in (62b). (62)

Thrace Romani < Romani (in plain), Turkish (in bold), Greek (in italics), Multiple (underscored) VRG I V E N TNE W a.

del give.3SG

leske 3SG . DAT

pare money

‘He pays him.’ (lit. he gives him money) (Adamou 2008, The Louse and the Rom)

The Thrace Romani-Turkish-Greek corpus

179

TF OC VRG I V E N b.

em FOC

birindʒi ﬁrst_class

moromandila swipes

daas give.IMPF.1SG

tuke 2SG . DAT

‘I was giving you ﬁrst-class swipes!’ (Adamou, unpublished corpus) The analysis of the Thrace Romani data clearly show that inﬂuence from Turkish on word order is not encountered since Turkish has an unmarked V-ﬁnal order (SOV) and focus is obtained in situ preverbally (Göksel and Özsoy 2003) or in the immediately preverbal position (Erguvanli 1984). In contrast, Thrace Romani word order is closer to Greek, an SVO language which uses an SV order for narrow focus and can focus an object either in situ or preverbally (Georgakopoulos and Skopeteas 2010).

8.3.3 Morphology Thrace Romani has borrowed from Turkish the focus particle da (Göksel and Özsoy 2003). In Turkish, the enclitic particle dA functions as an additive particle ‘also’ with focused host constituents, which have to be obligatorily in the preverbal position, and unfocused host constituents, which may be in sentence initial or postverbal position. The Turkish particle dA may have scope over the focused constituent or, as shown in (63b), over the entire clause: (63)

Turkish a.

I’m going out.

b.

TEYZE-M-E aunt-1SG . POSS - DAT

de dA

uğra-yacağ-ɩm visit-FUT-1SG

‘I will also visit my AUNT.’ (Göksel and Özsoy 2003: 1163) In the Thrace Romani corpus, the Turkish focus-sensitive particle da ‘also’ (N = 58) is used without the Turkish syntactic constraints but following the general preferences in Romani word order, i.e., focus in the preverbal position is preferred for subjects and focus in situ, postverbally, is preferred for objects. Similarly to Turkish, scope may be over the host constituent, as shown in Figure 89, or over the whole clause.

180

Information structure

Figure 89: Pitch track illustrating the Turkish focus-sensitive particle da and the Turkish numeral classiﬁer tane in Thrace Romani

Interestingly, the Turkish particle da is also found in the corpus of Thrace Romani from the seventeenth century, collected and transcribed by Evliya Çelebi in the city of Komotini (Gümülcine in Turkish). Although the contemporary Thrace Romani varieties under study are not the direct descendants of the seventeenth-century variety, the fact that da is documented in both varieties conﬁrms the high borrowability of the Turkish particle. As shown in (64), da precedes the ﬁrst singular pronoun in the Romani text, where it probably functions as a coordinator, but it follows the pronoun in the Turkish text. (64)

Romani text: da’maytah paya’ puwyah da me[j] te phe[n]ja and I your-OBL sister-ACC

bu[l]je ass-LOC

Discussion

181

Turkish text: sikeyim ben de senüñ qız qarındaşıñı ‘. . . and let me fuck your sister.’ (Friedman and Dankoﬀ 1991: 163) In the Thrace Romani corpus, the Turkish (< Persian) numeral classiﬁer, tane, is used similarly to Turkish, for the contrastive focus of a numeral (Schroeder 1999). In the spontaneous corpus, tane combines with 15% of all the numerals. But, although in Turkish tane is rarely used with numerals above 20, this restriction does not apply to Romani, as can be seen in Figure 89 where it is used with the numeral ‘40’. Lastly, Thrace Romani speakers use the additive expression em x em y ‘and x . . . and y’ in which a given and a new argument are contrasted. This expression is found in both contact languages, Turkish and Greek, as well as in several other Balkan languages. In Thrace Romani, em may also occur as a focus-sensitive particle with only one part of the construction, as in the example (62b) repeated in (65). Thrace Romani corpus < Romani (in plain), Turkish (in bold), Greek (in italics), Multiple (underscored) (65)

em FOC

birindʒi ﬁrst_class

moromandila swipes

daas give.IMPF.1SG

tuke 2SG . DAT

‘I was giving you ﬁrst-class swipes!’ (Adamou, unpublished corpus)

8.4 Discussion To conclude, the study of information structure shows that Thrace Romani, in contact with Turkish and Greek, proceeds by adding on focus strategies from Turkish, i.e., stress-shift, focus-sensitive particle da, numeral classiﬁer tane, and additive particle em (. . .em); see Table 24. In terms of prosody, both Turkish and Greek use a similar L+H* accent than the one found in Thrace Romani, but in terms of word order, Thrace Romani does not appear to be inﬂuenced by Turkish and is more similar to Greek.

182

Information structure

Table 24: Focus marking strategies between Thrace Romani and the contact languages, Turkish and Greek Focus marking

Romani

Turkish

Greek

Accent type

L+H*

L+H*

L+H*

Stress shift

yes

yes

no

Word order

SV or AV VP or PV

preverbal

SV VP or PV

Focus-sensitive particle

=da ‘also’ and focus

=da ‘also’ and focus

(ke ‘and’)

Numeral classiﬁer

tane ‘piece’

tane ‘piece’

no

Additive

em. . .em ‘and. . .and’

hem. . .hem ‘and. . .and’

em. . .em ‘and. . .and’

In contrast, Ixcatec, in contact with Mexican Spanish, decreases the use of typologically non-congruent strategies, i.e., specialised focus particle -na, while increasing the use of the typologically-congruent devices, i.e., prosodic marking through duration and intensity; see Table 25. Table 25: Focus marking strategies between Ixcatec and Mexican Spanish Focus marking

Ixcatec

Mexican Spanish

Prosody

duration, pitch expansion, intensity

duration, L+H*, intensity

Focus particle

yes

no

Word order

SV or VS VP or PV AV

VS PV

As other case studies also seem to indicate (e.g., Bullock 2009; Aikhenvald 2010; Meakins 2011; van Rijswijk and Muntendam 2012), information structure may be a domain which favours contact-induced adjustments rather than replacement of the existing marking means. Bullock (2009) on the French of Frenchville spoken in the United States of America refers to addition of means, i.e., pitch accent and tonal contours, rather than replacement. This contact-induced process results from language shift following century-long bilingualism. In Tariana (Arawak), Aikhenvald (2010) describes a decrease in the use of the subject focus marker, while word order patterns shared with one of the contact-languages, Tucano, are increasing. Tariana speakers have been shifting to Tucano since the 1920s and are also under strong inﬂuence from Portuguese.

Discussion

183

Based on these studies, I tentatively suggest a correlation between the type of contact and the eﬀects at the level of information structure. On the one hand, addition may occur in high-contact settings, such as between Romani, Turkish, and Greek, or between French in contact with American English in Frenchville. On the other hand, changes in frequency may result from shift-processes, illustrated by the Ixcatec study, but also by the data from Tariana in contact with Tucano.

Chapter 9

Contact settings 9.1 Background Much of the literature on language contact has focused on the relations between language-contact outcomes and the type and intensity of language contact at the level of the society, e.g., Wichmann and Wohlgemuth (2008); Wohlgemuth (2009); Tadmor (2009). These studies build on the proposal of a “borrowing scale” in Thomason and Kaufman (1988: 74‒75) that postulates the existence of a hierarchy, ranging from “casual contact” and lexical borrowing, to “very strong cultural pressure” and heavy structural borrowing. The types of contact in this scale, however, are deﬁned in such general terms making it diﬃcult for a researcher to situate a speciﬁc contact setting with respect to this gradient. How do we distinguish a setting with “casual contact” from a setting with “slightly more intense contact” and “more intense contact”, or a setting with “strong cultural pressure” and “very strong cultural pressure”? Recently, Trudgill (2008) has suggested some more detailed social and sociolinguistic parameters, such as community size (large vs. small), social network structures (tight vs. loose), and types of contact (low vs. high, long-term contact with child bilingualism vs. short-term contact with adult bilingualism). To assess contact intensity, this chapter oﬀers a detailed description of the social setting in which language contact has taken place in three of the communities under study: the Balkan Slavic, the Ixcatec, and the Thrace Romani communities. This implies taking a close look at the social, ethnographic, cultural, and economic environment in which the languages are spoken, including a wide range of factors such as population size and changes in living conditions. Following studies that take social networks to be a crucial component of language phenomena (Milroy and Margrain 1980; Milroy 2002), this chapter also presents the types of networks that can be found in the various communities.

9.2 The Balkan Slavic-Greek communities Data analysis in Chapter 3 and Chapter 7, revealed low rates of borrowings from Greek in the Balkan Slavic corpora combined with extensive pattern replication. In order to understand these results, a brief account of the history and the speciﬁc relation of the Slavs to the Greek state is needed (for a detailed account of the history of the Balkans see Mazower 2001).

The Balkan Slavic-Greek communities

185

The Balkans are notorious for a long multilingual tradition during the Byzantine and Ottoman Empires, which drew the attention of scholars to the similarities among several of the languages spoken in the area. Since then, the modern Balkan states have developed national education systems based on a monolingual, standardised model. In these settings, many traditional varieties have disappeared or are currently disappearing in favour of the standardised languages, whether these belong to the same family or not. In the early-nineteenth century, Greece was the ﬁrst kingdom in the Balkans to gain its independence from Ottoman rule, opening the road to signiﬁcant remodelling of the political map of the area. Recognized as independent by the United Kingdom in 1823, then by the Ottoman Empire, the newly-founded Kingdom of Greece was barely half the size of Greece as we know it today. Greece’s armed struggle for independence was supported by Christian Orthodox populations that had been settled in the area for centuries and who shared a sense of common belonging. Alongside speakers of Greek, the most signiﬁcant population in the Attica and the Peloponnese were the Arvanites, an Albanian-speaking population. Soon after the creation of the Kingdom of Greece, Thessaly came to be integrated in 1881. With these new integrations in the Kingdom of Greece came the people known as the Aromanians, speaking a Romance language. Greece had been an independent state for a century when the political development of the Ottoman lands of the geographical area of Macedonia came under focus. At the end of the nineteenth century, the numerous Slavic-speaking communities of Ottoman Macedonia had the choice of supporting two independence movements: the Bulgarian nationalist movement, drawing unity from a shared Slavic language, and the Greek nationalist movement, drawing unity from loyalty to the Greek Church Patriarchate of Constantinople. Nevertheless, as Mazower notes, “a sense that speaking Bulgarian implied belonging to a Bulgarian nation was slow to emerge” (Mazower 2001: 99). A third option eventually developed, namely the Macedonian independence movement (Internal Macedonian Revolutionary organisation, IMRO, founded in 1897), building upon geographical belonging while maintaining a strong link to the Bulgarian movement. When Greece entered the military-political arena concerning Ottoman Macedonia in the beginning of the twentieth century, parts of the population seemed ready to embrace the Greek nationalist movement, extenuated as they were by the price to pay, in ﬁnancial and human forces, to the Bulgarian national movement, and confused by its internal divisions. Indeed, at this period, letters from the Greek Council in Salonica reveal the reluctance of the local Slavic population to support the Bulgarian armed struggle:

186

Contact settings

The Central Committee of Soﬁa has scolded the Bulgarian citizens of the Salonica sandzak for their indiﬀerence to the revolutionary movement. (My translation from Greek. Unpublished document, Salonica, June 20, 1905. General Council of Greece in Salonica to the Greek minister of Foreign aﬀairs AAK-B 1905; 391, microﬁlm 1251-53.)

Education was one of the most signiﬁcant tools used to ensure the support of the Slavic populations of Ottoman Macedonia: Bulgaria-funded schools competed against Greek schools, funded by the Greek state and the Greek Patriarchate, and grants for higher education in Soﬁa or in Athens guaranteed the local Slavic elites’ support.1 Still, one has to keep in mind that education at the time had much less to do with peasants or even the wealthier cattle-owners, as it did with merchants. National movements in the Balkans were mostly led by those wealthy, Western-inspired elites who also had the political power to steer and guarantee the population’s support. In the late-nineteenth century religious aﬃliation became another battleﬁeld in Ottoman Macedonia opposing the Greek Patriarchate and the Bulgarian Exarchate church which had become independent (1870‒1872). Years after the recognition of the autonomous Bulgarian Church, the Constantinople Patriarchate was still organising protests in Macedonia, as can be seen in the diplomatic documents of the time: The people have no great sympathy with the action of the Patriarch as they understand that his eﬀorts are in fact directed against the establishment of the Bulgarian National Church in Macedonia, which they generally accept as a ‘fait accompli’. (Unpublished document, Salonica, November 3, 1890. B.D. F 195-1692 No 86).

During the same period, the people of Macedonia were also facing severe security problems within the Ottoman Empire and were therefore open to political change. Both oral tradition accounts from the village of Liti and written documents show the incapacity of the Ottomans to provide security for their subjects. Alfred Biliotis, the General Council of England in Salonica reports on this issue: Gheghs – oppression of the native population of Macedonia Depuis l’enlèvement de Chevalier en 1899 on a stationné des soldats. . . Mais dans les autres localités non occupées par eux, il y a des gendarmes albanais et ghèghes, environ 130, qui oppriment et dépouillent les habitants sans les protéger eﬃcacement contre les brigands.

1 According to the documents of the General Council of Greece in Salonica, 1893 AAK-A, there was funding for a kindergarten in Liti (1891–1892). In the city of Salonica, a Bulgarian school had been in function since 1868 (AEK 8528).

The Balkan Slavic-Greek communities

187

[Since the kidnapping of Chevalier in 1899, soldiers have been stationed. . . But, in the localities which are not occupied by them, there are Albanian and Gheg police oﬃcers, approximately 130, who oppress and steal from the local peoples without protecting them eﬃciently against the thieves.] (My translation. Unpublished document, 1901, 1902 B.D F 195-2133, 336-339.)

In 1913, Ottoman Macedonia was ﬁnally divided in three: the Southern part joining the Greek state, the two other parts divided between Bulgaria and Serbia. The Slavic-speaking populations became integrated into a dominant Greek national project. Contrary to the ﬁrst wave of Greek national independence, in the early-nineteenth century, concurrent national movements based on Slavic languages not only existed but had also successfully provided the basis for building neighbouring national states. Following the division of Ottoman Macedonia, entire villages and families were split along the lines of each national project supported. Those who actively backed the Bulgarian national project often moved – or were forced to move – to Bulgaria since no pro-Slavic movement was tolerated in the recently-integrated Greek territories. In this sensitive context, the shift to Greek was far more urgent for the Orthodox Greek Slavs than it was for the Arvanites and the Aromanians in the nineteenth century. This language shift concerned 180,000 “Slavic” speakers and 20,000 “Bulgarian” speakers according to the 1928 census. For the Greek state, it was crucial that the people abandon the local Slavic varieties since competing identities in the Balkans were precisely built upon Slavic languages. The shift to Greek became the goal of State policies, with measures ranging from development of Greek-speaking kindergartens to prohibiting public or private use of the Slavic vernaculars (1920‒1930). In the villages, the local pro-Greek elites actively promoted the shift to Greek and, given that all pro-Slavic elites had moved to the neighbouring countries, no pro-Slavic political movement as such came to balance out these policies. Followed a period during which people were more and more concerned with worker’s rights; for example there were major strikes in Salonica in 1936, violently repressed by national police forces. The dictatorship of General Metaxas (1936‒1941), which was notoriously harsh for Greek citizens of Slavic descent, was also harsh for workers in general. After World War II, the creation of nation-states was no longer at issue, and the divisions in the Balkans had taken on a strong political twist. The populations were then divided between left-wing pro-communists and right-wing procapitalists. In the neighbouring countries, Communists were in power, forming the Republics of Bulgaria and Yugoslavia. In Greece, on the contrary, the Civil War ended with the right-wing in power (see among others Baerentzen et al.

188

Contact settings

1987). Though the Greek Communists were not exclusively of Slavic descent, many Greeks of Slavic descent were sympathizers of the Communist party. During the Civil War, inhabitants of entire Slavic-speaking villages which were located close to the border with Yugoslavia were advised to move temporarily so as to escape bombing of the area but as it turns out they were never able to return to Greece (Monova 2002). The end of World War II was followed by considerable waves of migration to the Greek cities (see Mazower 2000), breaking century-long village traditions, among which language transmission. Indeed, although much literature is available on the role that state policies and schools played in the massive shift to Greek, it is important to stress that the loss of the vernacular languages largely coincided with urbanisation. Indeed, despite number of anti-Slavic state policies during the ﬁrst part of the twentieth century, the radical shift to Greek came after the end of World War II and the Civil War that immediately followed. Although the oﬃcial census of 1951 mentions 41,017 Slavic speakers, there is strong evidence showing that there were native Slavic speakers born up until the 1940s – and in some areas until the early seventies – indicating that Slavic was transmitted to children by mothers and more decisively by paternal grandmothers living in the same household.

9.2.1 Hrisa The Balkan Slavic corpus of the 1970s was recorded at the village of Hrisa, in Northern Greece. In 1976, when the recordings took place, Hrisa had 537 inhabitants, of which 373 were ndopçi “locals”, i.e., of Slavic descent, and 164 were Greek-speaking refugees who arrived at the time of the Exchanges between Greece and Turkey in 1923 (Drettas 1981: 146). The description of language usage provided by Drettas shows that despite the oﬃcial language policy against Slavic that was promoted by the dictatorship (1967‒1974), the Balkan Slavic variety was used in public life. Figure 90, based on Drettas (1981: 148‒150), shows that Balkan Slavic was used in Church, alongside with Church Greek, in the village’s traditional coﬀee-shop, alongside with Greek, as well as in the traditional market of the nearby city. Administrative matters at the time were conducted in Katharevousa, a Greek literary form which is very distant from Modern Greek varieties. Katharevousa, however, was mastered by just a few individuals who assisted the community in their administrative tasks. Interestingly, according to Drettas, in in-group conversations the Slavic variety was the dominant language despite the oﬃcial anti-Slavic policy.

The Balkan Slavic-Greek communities

189

Figure 90: Hrisa, Greece: Language domains in 1976 (based on Drettas 1981)

To illustrate the transmission of the Balkan Slavic variety at Hrisa, Drettas (1981) maps out three households. In Figures 91 and 92, circles indicate female, and triangles male members of a family. The ﬁgures also indicate the knowledge of Greek and preference for the use of Balkan Slavic for storytelling, singing, and tradition-related activities. It can be seen that the 75- and 76-year-old females are

190

Contact settings

monolingual in Slavic. Women ages 35‒60 are dominant Slavic speakers with a strong accent in Greek. A male, 60-year-old Slavic speaker is noted with only a slight accent in Greek. Speakers below 30 appear to be having a good knowledge of Greek, with minimal Slavic accent. It can also be seen that language transmission is ensured by the grandparents’ generation and that children often demonstrate a preference for the Balkan Slavic variety for tradition-related topics.

Figure 91: Hrisa, Greece: Family of a Slavic speaking household (adapted from Drettas 1981: 153)

The Balkan Slavic-Greek communities

191

Figure 92: Hrisa, Greece: Families of two Slavic speaking households (adapted from Drettas 1981: 153)

9.2.2 Liti Liti, where the Balkan Slavic Nashta variety has been recorded, was known as Aivati or Aivatovo during the Ottoman period. Liti is located 10 km from the city of Thessaloniki, which according to sources from the nineteenth century required a three-hour trip; see Map in Figure 93. The Tabula Imperii Romani K 3 (1976: 78) attests to the existence of a ‘demos of Liti’ in 117 BC, whose inhabitants appear to be of Thracian origin, partly from the Edones. The presence of Slavic agricultural populations in Liti is noted in a seventh-century Byzantine source, ‘The miracles of Saint Demetrius’. The excerpt below describes the rush of the people of Salonica to the countryside following the end of one of the city’s sieges:

192

Contact settings

On put alors voir nos concitoyens, [semblables par l’eﬀet de la famine à] des morts et des fugitifs, se rendre avec femmes et enfants aux habitations [sklavènes] des environs de Lite et autres lieux voisins, en ramener du blé et des légumes secs. . . [We could then see our fellow citizens, [who because of the famine looked] dead and like fugitives, with their women and children, going to the [Slavic] inhabitants around Lite and other neighbouring localities, to get wheat and fruits. . .] (Lemerle 1979: 207. My translation from French).

There is nothing to indicate that Liti’s Slavic population has left the village since, although new populations have probably been added during the centuries that followed.

Figure 93: Map of the area of Thessaloniki indicating the village of Liti (Ajvati), 19032

In the beginning of the twentieth century, the Hilmi Pasha census (1904) reports 1,395 “Greeks” in the village, a term which can be understood as “Christian Orthodox” in the Ottoman context. Indeed, as Orthodox Christians, the people 2 Published by Artaria and Co. Landkartenhandlung. Reproduced with the kind authorisation of the Hellenic Literary and Historical Archive (code HLHA Archive K No 06-04. E 1777).

The Balkan Slavic-Greek communities

193

of Liti were part of the Greek millet, ‘a community based on religious belief’, during the ﬁve centuries of the Ottoman era (Mazower 2001: 64). Brankof reports in 1,904 the existence of one Greek high-school and college with 3 teachers and 207 students, evidence of the inﬂuence of Greece at the time. Together with part of the geographical region of Macedonia, Liti was integrated in the Greek state in 1912‒1913. The integration to the Greek state, which drew on its Christian identity and Greek language, triggered the shift of the Balkan Slavic speakers to Greek which is now complete. The language shift also coincided with signiﬁcant socio-economic changes in the community’s structure (for a detailed account of this change in a neighbouring Slavic community see Karakasidou 1997). From a predominantly peasant society, the people of Liti slowly came to access white-collar professions. From a society that had had limited access to formal education, access to schooling in Greek became the norm. Language shift from Nashta to Greek started with educated male elected representatives. This decision had an impact on relations to at least three degrees, as laid out by the “Three Degrees of Inﬂuence Rule” (Christakis and Fowler 2010 and references therein). For instance, according to this rule, the language of the exchanges initiated by this speaker had an impact on family members (one degree), family members’ friends (two degrees), and the families of the family members’ friends (three degrees). This means that in time, highlyconnected individuals, such as male merchants from rural communities like Liti, may inﬂuence an entire community through centrality in the network, i.e., having many ties to members of a network who also have many ties. They can thus rapidly propagate language shift by a sort of “contagion”. In this language-shift chain, individuals who are peripheral in the network are the least likely to participate in the shift: such may be elder female speakers, individuals living at the outskirts of the village, individuals with small families and few friends, etc. In what follows, I compare the social networks of two women who lived in Liti reconstructed based on oral traditions from a family of merchants and small land and cattle owners: one of the women was born at the end of the nineteenth century, when Liti was part of the Ottoman Empire; the other was born at the beginning of the twentieth century, when Liti was part of the Greek state. In general, note that women were the least exposed to formal education and were the least likely to wander beyond their village for employment reasons. Figure 94 shows at a small scale the social network of the woman born in the late-nineteenth century. The ﬁgure shows that all of her social contact was with her family and other villagers and the language used was Nashta, with the exception of limited contact with Church Greek.

194

Contact settings

Figure 94: A female Nashta speaker’s social network during the late Ottoman period and early times of the Greek state

During the ﬁrst half of the twentieth century, the former monolingual Balkan Slavic community of Liti went through a brief stage of bilingualism with Greek to then become today’s predominantly Greek monolingual community. The last ﬂuent speakers who were recorded for our study were part of the unique intermediate bilingual generation. Figure 95 reconstructs the network of one of these bilingual female speakers at the second half of the twentieth century. Comparison between Figure 94 and Figure 95 shows clearly how the networks were modiﬁed during the twentieth century. The change took place mostly within the structure of the surrounding society: increasing intermarriages with non-Slavic-speaking spouses which in turn increased daily contact with monolingual Greek speakers in families and neighbourhoods (for the neighbouring village of Assiros, Karakasidou 1997 cites in average 30% of marriages with outsiders for inhabitants born during 1915‒1940). Even though education was mostly limited to primary school, it allowed speakers born in the early-twentieth

The Balkan Slavic-Greek communities

195

Figure 95: A female Greek-Nashta speaker’s social network in the second half of the twentieth century (co-workers in rectangles, family in circles, close friends in triangles)

century to master the Greek language. Radio and television came to permeate everyday life in the second half of the twentieth century, and contact with the nearby city of Thessaloniki became more frequent due to the development of public transport and roads.

196

Contact settings

Figure 96: Balkan Slavic Nashta: Language domains in the end of the Ottoman era

Figure 96 shows the domains of language use for a Balkan Slavic Nashta speaker during Ottoman times. It can be seen that adult female speakers had little use of Greek, or Turkish, for having little or no contact with religious authorities, school teachers, administration, and trade. During the twentieth century, following integration of the area to Greece, the domains which were dedicated to the use of Balkan Slavic were entirely replaced by the use of Greek.

The Ixcatec-Spanish community

197

At the same time, it is important to stress that women increasingly acquired access to the language domains previously dominated by Greek and Turkish. As a result of this process, Nashta gave way to Greek. The last speakers of Balkan Slavic Nashta who were recorded for the present study use the language very rarely, their everyday language being Greek. Despite the widespread loss of Nashta, women have maintained some expressions in the aﬀective domain: endearments, insults, and irony are often expressed in Nashta, even among the last hearers. Speakers born between 1940 and 1950 are weak semi-speakers, having a lot of diﬃculty in using Nashta but being able to understand it. The generations born after 1950 are either the last hearers or have had no contact with the language at all. Due to lack of oﬃcial state policy with regard to the existence of Slavicspeaking communities in Greece, the complete absence of the topic in the school curriculum, and the highly sensitive political issue that the topic represents even at present, the last speakers are having diﬃculties in explaining the presence of the ancestral language in their village and families. A wide-spread scenario which attempts to account for the use of the language is related to the existence of traditional hostels in the village which required the knowledge of several “foreign” languages. Indeed, the last speakers consider Nashta to be a “bastard language”, made up of words from Bulgarian, Greek, and Turkish.

9.3 The Ixcatec-Spanish community Ixcatec is today one of the most endangered languages of Mexico, spoken by less than ten speakers. This situation is the result of a century-long shift to Spanish, beginning with Spanish colonization in 1519. As early as 1535, when the newly acquired territories became part of New Spain, Spanish was established as the language of relations with the new administration. Alongside Spanish, Nahuatl was successful as a language of communication with outsiders and replaced some of the native languages in parts of Mexico. Nevertheless, the most decisive moment in the shift to Spanish took place during the Porﬁriato (1886‒1911) and during the post-revolutionary period, when educational policies promoting Spanish monolingualism prevailed. Today, Spanish is the only oﬃcial language in Mexico and is the language of 98.5% of the country’s population. The 365 indigenous languages of Mexico have various degrees of vitality but most speakers are bilingual with Spanish. Amerindian monolinguals are those who received little formal schooling, mainly elders and women. Bilingual education aiming at language maintenance has been promoted in the past years but in practice it has actually served to facilitate the acquisition of Spanish.

198

Contact settings

The use of Ixcatec is restricted to the municipality of Santa María Ixcatlán in the State of Oaxaca; see Map in Figure 97. There is little information on the village of Santa María Ixcatlán in remote historical periods and archaeological data is missing (see Hironymous 2007: 33‒57). There seems to be very little evidence of pre-Columbian occupation in the immediate vicinity of the village. The territory stretching between Santa María Ixcatlán and Santa María Tecomacava does show however many remnants of pre-Columbian occupation. A town called La Muralla, away from the current location of the village, seems to have been an important pre-Hispanic centre, perhaps the ancient capital of the Señorío of Ixcatlán.

Figure 97: Map of Santa María Ixcatlán, Mexico

The Descripción de Ixcatlán by Velásquez de Lara, dating back to 1579 and part of the Relaciones geográﬁcas series requested by the Crown of Spain are a rich source of information on the pre-colonial period. A large part of the Mixteca, including the city of Coixtlahuaca, came under Aztec domination from the middle of the ﬁfteenth century. It is assumed that Ixcatlán belonged to the province of Coixtlahuaca, thus paying tribute to the Aztec empire. While Ixcatlán has long been an independent Señorío, it is not known precisely if it had already lost that status before the Aztec Emperor Moctezuma II (reign 1502‒1520) came to dominate the region. The representatives of the communities of Coixtlahuaca surrendered to Cortez and his army in 1520. As early as 1522, Ixcatlán was given

The Ixcatec-Spanish community

199

in encomienda as a reward to Rodrigo de Segura and Garcia Velez, two European soldiers from Cortez’s campaigns (Hironymous 2007: 58). Today the village of Ixcatlán has some 400 inhabitants but at the time of the Spaniards’ arrival in 1522 it was an important centre within the Mixteca zone with an estimated population of 10,000 (Hironymous 2007: 8). The region’s wealth was probably due to trade, given that the territory is strategically located between the mountainous area of the Mixteca Alta and the fertile valley of the Cuicatlán Cañada, and located along the very active trade route linking Mexico City to the rich territories of Chiapas and present-day Guatemala. A sharp drop in numbers following the Spanish conquest was caused by forced labour in the mines and new diseases; in 1579, there was an estimated population of 1,200 Ixcatecs (Hironymous 2007: 8). At the same time, the ﬂourishing economy of Ixcatlán narrowed to the sole exploitation of the scarce agricultural and forest land resources and resulted in a constantly declining population. The municipality of Santa María Ixcatlán spreads over a territory of approximately 200 km in the Mixteca Alta. The village is located at an altitude of 2,400 m above sea level, between several mountains. The houses are spread out on a grid that consists of three main roads and smaller intersecting roads. The centre of the village is organised around the church, the market, and the town hall. The village is bordered on the western side by a river. The mountains that surround the village are highly regarded by the local population since they provide important natural resources, such as palm, honey, wood, and a habitat for cattle. Today the inhabitants of Santa María Ixcatlán practice small-scale farming and cattle breading. Women pursue the pre-Colombian tradition of palm weaving. Ixcatec trade is limited to proximity exchanges: sale or exchange of palm, palm hats, and wood, as well as purchase or exchange of fruits and vegetables and manufactured goods. There are notable diﬀerences between the wealthier, trade-related families and families who strictly rely on resources from smallscale farming. Unsurprisingly, the last Ixcatec speakers come from the poorest families, brought up in households located at the periphery of the village. Santa María Ixcatlán is governed by the indigenous customary law and legal practices, the so-called Usos y costumbres ‘traditions and customs’, formalized by an amendment to the Constitution of the State of Oaxaca in 1990 and its Code of election procedures (1993). This implies that the City Council (ayuntamiento) is not elected by popular vote on political party lists, but is made up of members appointed by the community, namely by elders and the asamblea ‘assembly’. The village is organised along social, political, and religious lines commonly encountered in the indigenous communities of Mesoamerica, known as comunalidad (Maldonado 2002) and referring to a system of governance

200

Contact settings

based on an indigenous form of collectivism. Such social organisation has probably been carried forward for centuries at the least, in the form still seen today. In Santa María Ixcatlán this organisation is characterized by collective ownership of land; decision-making in the village’s asamblea ‘assembly’; community work in the form of tequios; obligatory civil work in the form of cargos; and active participation in the organisation of a number of civil and religious events. Women generally take charge of civil work involving school, health, and care for the elderly but they are not given any central role in high-ranked governance structures. For example, the asamblea ‘assembly’ is made up only of male heads of families and the presidencia ‘municipality’ is always ensured by a man. Despite Santa María Ixcatlán’s central position in trade before the Spanish conquest, in the centuries that followed and until today, contact with outsiders only happens on occasion. Santa María Ixcatlán is a well-known religious centre attracting pilgrims from the neighbouring villages for the most important religious holidays. At such times the residents of Santa María Ixcatlán trade with the pilgrims. Commercial transactions also take place with itinerant merchants who sometimes visit the village as well as during visits to the nearby cities. The village’s inhabitants only have regular, everyday contacts with outside members of the community such as teachers, doctors, and priests, and passive contact with Spanish through television. Contacts with the authorities are more frequent for men who serve in the City Council, a charge which today entails occasional visits to the city of Oaxaca. To conclude, for centuries contact intensity with Spanish can be said to have been low, as it did not happen on a daily basis and only concerned a small proportion of community members. Despite low contact intensity, a rapid process of shift to Spanish took place during the twentieth century and is nowadays completed. Such rapid shift did not take place in all the indigenous communities of Mexico although they had similar pressure from the national education system and socioeconomic parameters. But, in small communities like Santa María Ixcatlán, with strong social cohesion, once the language shift process has been launched it may have an eﬀect within two to three generations.

9.4 The Thrace Romani-Turkish-Greek community In Chapter 3, the Romani corpora from Greek Thrace and Finland were shown to contain high rates of contact words from their respective contact languages, Turkish and Finnish. Also, the insertion of morphologically-intact elements from the contact languages which characterizes these corpora, as presented in

The Thrace Romani-Turkish-Greek community

201

Chapter 5, is rare in codeswitching and incompatible with most deﬁnitions of borrowing (Myers-Scotton 2002; Poplack and Dion 2012). Nonetheless, as noted in Adamou and Granqvist (2014), it is a process typical of so-called “mixed languages”: Mixed languages are the result of the fusion of two identiﬁable source languages, normally in situations of community bilingualism (Meakins 2013: 159).

In past decades tremendous progress has been made on the comprehension of mixed languages: some mixed languages draw their grammar from one language and lexicon from another, like Ma’á and Angloromani (Bakker and Mous 1994; Matras 2010), and are known as “Grammar-Lexicon” or “G-L mixed languages” (Meakins 2013); other mixed languages show compartmentalisation of elements of the two languages, known as “Verb-Noun” or “V-N mixed languages” (Meakins 2013), like Michif (Bakker 1997a), Light Warlpiri (O’Shannessy 2013), and Gurindji Kriol (Meakins 2012). Most of the above-mentioned languages have not been analysed in quantitative terms in order to establish the contribution of each language in the “mixture”. However, a quantitative analysis of Gurindji Kriol, a V-N language, shows that a single dominant language cannot be identiﬁed (McConvell and Meakins 2005). Unlike Gurindji Kriol, the Romani datasets under study have a clearly dominant Romani component and resemble codeswitching in the early formation stages of mixed languages (O’Shannessy 2012, 2013; McConvell and Meakins 2005; Meakins 2012). In an earlier work (Adamou 2010), Thrace Romani was classiﬁed as an example of a “fused lect”, deﬁned as a form of “stabilized codeswitching”: While LM (Language Mixing) by deﬁnition allows variation (languages may be juxtaposed, but they need not be), the use of one “language” or the other for certain constituents is obligatory in FLs [fused lects]; it is part of their grammar, and speakers have no choice. (Auer 1998: 15).

What all studies on mixed languages point to, is the massive use of general language-contact strategies in a creative way that only seems to arise in speciﬁc sociolinguistic settings. A correlation between mixed languages and sociohistorical criteria has been suggested by Bakker (1994: 24) distinguishing, on the one hand, languages created in mixed households, like Michif, and on the other hand, languages created by former nomadic groups in a process of language shift, like Angloromani. Thomason (1995) also relates socio-historical characteristics of speech communities with linguistic processes of language mixing, with languages created in existent groups based on grammar replacement

202

Contact settings

(the G-L mixed languages), and languages created in newly-formed groups, showing compartmentalisation (the V-N mixed languages). Romani mixing, as encountered in Greek Thrace, also developed in similar settings as the ones described in the literature on mixed languages, i.e., in highly bilingual communities, under conﬂicting processes of language shift and language maintenance (see Matras 2009; O’Shannessy 2012; Meakins 2013). Thrace is a region with a century-long presence of several multilingual communities coexisting with monolingual Greek speaking communities. This is the case of the Turkish-Greek speaking community (approx. 55,000 speakers), the Pomak-Turkish-Greek speaking community (approx. 36,000), and the small Armenian-Greek (and partly Turkish) speaking community (numbers from several sources cited in Kostopoulos 2009: 290–291). Even though Roma from the Muslim communities of Greek Thrace have mixed origins, most speakers use a Romani Vlax dialect which is geographically centred in today’s Romania (Matras 2005). They are among the Romani groups in the Balkans who term themselves xoraxane roma ‘Muslim, Turkish Rom’, as opposed to the dasikane roma which is the name for the ‘Christian Rom, Greek Rom’ in the area. Muslim Roma have Turkish ﬁrst and last names, and women usually wear sarouels typical of the Muslim population in the area and reminiscent of the Ottoman-era clothing. Adults and children are typically trilingual in Romani, Turkish, and Greek with diﬀering degrees of competence in the three languages. Turkish and Greek are the languages of trade and other professional activities, but they are also used at home and in the community alongside with Romani. Most of the elder Roma have received no formal education but the younger generations receive primary-school education in Greek and Turkish: Turkish being the language of education for the local Muslim population under the Lausanne Treaty July (1923), alongside Greek, which is the state language. Figure 98 illustrates the domains of interaction of a female Romani speaker of Greek Thrace. The interaction languages are marked on the lines, illustrating the complex trilingual network involving Romani, Greek, and Turkish. Despite a lack of oﬃcial data one may say that the Roma of the communities under study generally work in trade; they also frequently work as seasonal agricultural labourers and occasionally as cleaning staﬀ for domestic or city services. They form part of a tightly-knit community which has frequent contact with outsiders. The network of a female Romani speaker, aged 34, and living in the city

The Thrace Romani-Turkish-Greek community

203

Figure 98: A Romani speaker’s interactions in various language domains

of Komotini, is illustrated in the diagram in Figure 99. As shown, the Romanispeaking community has high transitivity. Also, all members of the Romanispeaking network have everyday interactions for work purposes with individuals having diﬀerent interaction languages. Even though these contacts are casual and ephemeral, they aﬀect the Romani-speaking network as a whole.

204

Contact settings

Figure 99: A Romani female speaker’s social network (co-workers in rectangles, family in circles, close friends in triangles)

Alongside the fact that this is a tightly-knit community with frequent contact with speakers of Turkish, language shift is among the factors that should be examined in order to better understand how the Romani language mixing came about. Myers-Scotton (1998) suggests that a Matrix Language Turnover is responsible for the mixing of Thrace Romani with Turkish: That is, the explanation would be that Romani speakers were in the process of shifting to Turkish, but that this was a Matrix Language Turnover that was arrested. For sociopsychological reasons, the shift stopped. (Myers-Scotton 2013: 40).

The arrested Matrix Language Turnover analysis suggested by Myers-Scotton is plausible for the Turkish speaking communities of Greek Thrace, but probably not due to the contemporary shift process. I suggest that Thrace Romani-

The Thrace Romani-Turkish-Greek community

205

Turkish language mixing started during the Ottoman Empire and was transmitted as such to the younger generations. This hypothesis is supported by the fact that Turkish verbs with Turkish morphology are still used in other Romani communities even when contact with Turkish is lost for more than three generations, as in the neighbourhood of Ajia Varvara in Athens (Igla 1996). The contemporary use of Turkish verbs with Turkish morphology in these communities indicates that the language-mixing process is not due to the contemporary contact setting. In order to account for the contemporary Romani-Turkish mixing I propose the following scenario: Vlax Romani speakers who are currently settled in Greek Thrace were most likely itinerant craftsmen, at least in the late Ottoman times (Adamou 2010; Adamou and Granqvist 2014). Turkish was the major trade language at the time, and children, women, and men were all involved in trade. Such generalized bilingualism could have given rise to intensive codeswitching within the community and eventually a shift to Turkish could have started. The shift to Turkish could then have been halted with the change in the sociolinguistic setting related to the end of the Ottoman Empire when, in 1923, part of Thrace was integrated into the Greek state. The change in the political boundaries that resulted from the formation of the Modern Greek state had certainly an impact on the Ottoman Roma’s mobility who became semi-sedentary and adjusted their working activities to the new borders. In the new setting, Modern Greek was added to their linguistic abilities while Turkish became a minority language although it remained the trade language in Thrace. At present, even though Turkish is the dominant language in Thrace Romani communities, one observes that Greek is becoming an increasingly inﬂuential contact language. The following excerpt illustrates the positive attitudes toward Greek within the Thrace Romani community. The conversation involves a 30year-old man who is frequently using Greek in in-group conversations. When asked to explain the reasons of this usage, he replies as follows: (66)

Thrace Romani < Romani (in plain), codeswitching to Greek (in angle brackets < >), pauses / Male speaker:

ame sam roma / ‘We are Roma. ’

Female speaker 1:

soske orbisares dasikane ‘Why do you speak Greek?’

Male speaker:

e dasikane ‘Greek? ’

206

Contact settings

Female speaker 1:

‘’

Female speaker 2:

‘’ [laughs]

Female speaker 1:

‘’

Male speaker:

/ aj naʃlom ‘ OK, I’m gone.’

Female speaker 1:

naʃlan / ‘Go. ’

9.5 Discussion The corpus analysis in the previous chapters has allowed us to classify the contact outcomes in a detailed manner, with respect to borrowing, codeswitching, and pattern replication phenomena. Based on all the above-mentioned studies that relate the contact phenomena with the type of contact, the prediction is that corpora with few borrowings and limited pattern replication are more probably produced in settings with little contact, and corpora with extensive borrowing and pattern replication are produced in settings with extensive, everyday language contact. First, in all the cases that were examined, the high rate of contact words in the corpora does not reﬂect the speakers’ skill in the contact language, e.g., Greek, Spanish, or Turkish. On the contrary, the last ﬂuent speakers of an endangered language are often more ﬂuent in the language towards which they have shifted in their everyday life than in their ancestral language, e.g., Balkan Slavic, Ixcatec, or Romani. Also, as shown in this chapter, all the communities under study form in-group social networks with high transitivity, which means that most of a person’s contacts have a contact with one another. This is often the case for minority-language speaking communities and more true of rural communities showing such “dense” and “multiplex” networks (Milroy and Margrain 1980). In the following sections I examine several non-linguistic factors that may explain the diﬀerences in the datasets with respect to the more or less extensive use of contact words and pattern replication: the vitality of a bilingual community, prescriptive attitudes, institutional support for a standard variety, and the patterns inherited from past contact settings. These factors are explored for

Discussion

207

the Balkan Slavic, Ixcatec, and Thrace Romani communities presented in detail in Chapter 9, as well as for the Molise Slavic, Finnish Romani, Burgenland Croatian, and Colloquial Upper Sorbian communities which have been brieﬂy touched upon in this book.

9.5.1. An active bilingual community The word-count method applied in Chapter 3 revealed similar rates of contact words in the critically-endangered Balkan Slavic Nashta and Ixcatec corpora. In both cases we know that this result is not due to the linguistic competence of the speakers in the current-contact language since both Nashta and Ixcatec speakers are using Greek and Spanish respectively in their everyday life. However, it could be said that the low rates of contact words in the Ixcatec and Balkan Slavic corpora stems from the fact that they were both recorded with speakers who do not use their languages in their everyday life and therefore have no chances of mixing the two languages. As shown in Chapter 3, Molise Slavic speakers produced high rates of borrowings from Italian. Molise Slavic is nowadays spoken by approximately 1,000 inhabitants with very limited child bilingualism (Breu 2011). It is spoken in three contiguous villages, with diﬀering degrees of language use: Acquaviva Collecroce, which is the traditional centre of the area; Montemitro, which is a smaller and more conservative community; and San Felice del Molise, which only has a few speakers of Molise Slavic. Interestingly, despite the diﬀerences in everyday use of Molise Slavic over the three communities, the word-count shows that speakers from all three villages use similar proportions of Italian borrowings (see Chapter 3). This ﬁnding suggests that the rate of borrowings is not a direct result of the vitality of the bilingual community at the moment of the recordings. The vitality of the bilingual community also fails to account for the small proportion of borrowings in some active bilingual communities, such as the Colloquial Upper Sorbian- and Burgenland Croatian-speaking communities, which are in daily contact with German. Indeed, Colloquial Upper Sorbian, which is used in everyday life by speakers of all ages, shows very few borrowings from German although the inﬂuence of German is signiﬁcant through pattern replication. To conclude, the degree of vitality of a bilingual community appears to be an insuﬃcient explanatory parameter of the rates of contact words encountered in a bilingual corpus.

208

Contact settings

9.5.2. Prescriptive attitudes and institutional support Another way to explain the low rates of contact words in the bilingual corpora of endangered languages would be to claim that the last ﬂuent speakers of an endangered language are consciously avoiding the use of contact words in order to provide a purer form of their ancestral language. For example, the Ixcatec speakers were recruited for the Ixcatec language documentation programme and it sounds reasonable to expect from them to use a minimum of Spanish words. However, comparison with corpora from other endangered languages, such as Finnish Romani, shows that avoidance of contact words or codeswitching is not necessarily achieved or maybe not even aimed at. Strong prescriptive attitudes in a given community, linked to the presence of an inﬂuential standard language, may inﬂuence the outcomes of language contact. Indeed, in order to understand the low rates of German words in Colloquial Upper Sorbian despite the vitality of the bilingual community, one has to take into consideration the great inﬂuence of the Standard Upper Sorbian, which is used in formal settings, i.e., school, church, and media (Scholze 2008). The Burgenland Croatian-German corpus is also interesting in that is shows low proportions of borrowings while being recorded in an active bilingual community. But, unlike Standard Upper Sorbian, the Burgenland Croatian Standard is considered to have a limited impact on the community even though it is used at school, church, and in the media (Szucsich 2000). In contrast, Romani communities living in Greek Thrace and in Finland have little or no pressure from a standard Romani variety, despite the fact that Finland oﬀers institutional support for the Finnish Romani language. The lack of a standard Romani variety could therefore be related to the use of atypical strategies of language mixing that characterize the Romani datasets examined in this book. Similarly, the Molise Slavic corpora can be understood as the result of minimal normative pressure from the recently developed standard variety (Adamou et al. 2015).

9.5.3. Past contact settings The vitality of a bilingual community in the past and past prescriptive attitudes are probably the best indicators of the language contact outcomes observed in the endangered languages that we examined in this book. Indeed, the two Romani communities living in Finland and in Greece appear to be transmitting a mixed language developed in the late nineteenth century. At that time, the communities were itinerant traders and most likely valued their

Discussion

209

bilingual identity as they do nowadays (Adamou and Granqvist 2014). Thus, the language-mixing patterns set in the past help explain the similarities in the contact outcomes in two communities with diﬀerent degrees of language endangerment: Finnish Romani, which is severely endangered (C), and Thrace Romani, which is an unstable language (A-) still transmitted to children in some families and is still active in everyday life. Despite these diﬀerences, the language mixing patterns are similar, notwithstanding the fact that Finnish Romani speakers codeswitch more than Thrace Romani speakers do (Adamou and Granqvist 2014). Molise Slavic speakers also had extensive contact with Italian speakers with whom they worked and intermarried. Moreover, there were no prescriptive attitudes or literary tradition to inﬂuence the Molise Slavic communities prior to the twentieth century, when the introduction of a Slavic standard variety and institutional support for bilingual education certainly altered the status balance (Breu p.c.). The low rate of Greek words in Balkan Slavic Nashta also seems to reﬂect the borrowing pattern that was established when the language was still actively spoken, as the 1970s corpus indicates (Adamou et al. in press 2016). The small amount of Greek words in the Balkan Slavic corpus of the 1970s indicates that contact with Greek in the Ottoman times was likely to have been restricted to speciﬁc geographic areas and people related to the Greek Orthodox Church and perhaps trade. Past-contact patterns may of course be modiﬁed through a change in the general sociolinguistic conditions. For example, the ﬂuent speakers in the Nashta corpus and two of the speakers in the Hrisa corpus are part of the same generation, i.e., born between 1916 and 1936. It is possible that if the speakers who were recorded in the 1970s were recorded nowadays, they would show an increase in the number of Greek tokens in their speech similar to that observed in the Nashta corpus. It is equally possible that if the speakers who were recorded in the ﬁrst decade of the twenty-ﬁrst century were recorded in the 1970s, they would have used fewer words from Greek, much like the Hrisa speakers. This analysis is supported by the sociolinguistic context, knowing that Northern Greece became part of the Greek state in 1912‒1913 and that the increase in contact with native Greek speakers, through inter-marriages in the community or everyday contact facilitated by the modernization of transport, as well as exposure to Greek radio and TV, became more and more important through the years (Adamou et al. in press 2016). The minimal presence of German borrowings and codeswitching in Colloquial Upper Sorbian is probably the result of a tradition led by prescriptive Sorbian intellectuals which started in the sixteenth century (Faska 1998) and

210

Contact settings

which is still powerful today (Adamou et al. 2015). German inﬂuence is however to be seen in the extensive pattern replication that characterizes Colloquial Upper Sorbian. Finally, for Burgenland Croatian, it must be stressed that Burgenland Croatian has a rich literary tradition, dating back to the sixteenth century, even though the Standard Burgenland Croatian language has little inﬂuence in the contemporary setting. The inﬂuence of this century-long literary tradition could probably account for the low rates of words from German in the contemporary corpus (Adamou et al. 2015).

Chapter 10

Concluding remarks In this book I have presented some of the explorations of language contact phenomena made possible by the quantitative analysis of spoken corpora in endangered languages. It is uncontroversial that lesser-known languages bring new evidence to light, thereby challenging current assumptions, e.g., the Romani mixing of contact verbs with no integration in the native morphology (Adamou and Granqvist 2014), mixed languages created from codeswitching (Meakins 2013), or morphological borrowing with little lexical borrowing (Seifart 2012). This book argues that the evidence from these languages becomes more solid when it is based on quantitative data as is the case for the above-mentioned studies. For quantitative data to be collected from speakers of endangered languages, spontaneous, unscripted speech seems to be the optimal method. Combination with experimental methods, provided they are culturally adapted, is not in any way to be excluded as it enables the researchers to explore phenomena which may be under-represented in the spontaneous speech. In sum, what does the quantitative analysis of spontaneous, spoken corpora bring to the study of language contact? 1) First, it serves to identify the numerically-dominant language of a bilingual or multilingual corpus. It thus contributes to the debate of whether in bilingual speech a “dominant”, “base”, or “matrix” language can be identiﬁed or not (see Myers-Scotton 1993a; Gardner-Chloros 2009). 2) The word-count method of a bilingual or multilingual corpus allows for cross-linguistic comparison based on language usage. Similar to WOLD – which is however based on word-lists – the word-count method reveals “types” of languages with respect to the rates of borrowings. As discussed in Chapter 3, the word-count reveals two main types of bilingual communities: those with an almost full separation of the languages in contact, targeting a monolingual type of speech (less than 5% contact words), and those which tend to combine the languages in contact in an intensive and systematic manner (20‒35% contact words). 3) The quantitative analysis of free-speech corpora makes it possible to assess preference with respect to codeswitching or borrowing in a bilingual corpus through systematic study of a variety of criteria such as degree of composition, degree of integration, regularity, and frequency of the contact words. This reveals some “fuzziness” (Gardner-Chloros 2009: 167), but also the existence of consistent patterns of language mixing in the bilingual communities.

212

Concluding remarks

4) The quantitative method enables the comparison of the productions of several individual speakers, revealing patterns of borrowing or codeswitching at the level of the community at the same time as helping identify the outliers. Both social factors, such as age, sex, or location, and linguistic constraints, such as priming, genre, or construction-related variation, may explain these tendencies (Travis and Torres Cacoullos, in press). 5) The quantitative analysis of free-speech corpora also serves to calculate the proportion of contact words with respect to word classes, conﬁrming the privileged access to language contact for borrowed nouns. 6) The quantitative analysis equally allows for a number of studies at the level of pattern replication, such as for word order, prosodic and phonetic phenomena, articles, complementizers, and other frequently used morphemes. Given the limited size of the datasets, linguistic phenomena that occur more rarely in spontaneous speech require additional tasks. 7) Finally, the corpus-driven approach makes it possible to relate contact phenomena to non-linguistic factors in a precise and quantiﬁable manner. This reveals the importance not only of present-day contact conditions but also the continuation of patterns established in past contact settings. In the following sections, I summarize some of the ﬁndings of the corpora analysed in this book and discuss them in relation to other available corpora.

10.1 A scale of language mixing By comparing rates of contact words in the multilingual corpora under study, I tentatively propose a scale of language “mixing”, a term which is used here in a broad sense to refer to the combination of elements drawn from two or more languages which are actively spoken in a given community as observed in the spontaneous speech of individual speakers. This scale distinguishes on the one hand between bilingual corpora with less than 5% contact words from the current-contact language(s), and on the other hand corpora with more than 20% but less than 35% contact words. Of the former type I examined corpora from Ixcatec in contact with Spanish, Balkan Slavic in contact with Greek, Burgenland Croatian in contact with German, and Colloquial Upper Sorbian also in contact with German. Of the latter type are the corpora from Thrace Romani in contact with Turkish and Greek, Finnish Romani in contact with Finnish, and Molise Slavic in contact with Italian. This scale of language mixing can serve as a basis for cross-linguistic comparison and is meant to evolve as new corpora are added to the discussion. Researchers interested in language contact could easily classify a bilingual corpus

A scale of language mixing

213

based on the rates of contact words and discuss both the corpus speciﬁcities and the relevance of the scale of language mixing. To illustrate this possibility I will now discuss brieﬂy several Afro-Asiatic corpora which are fully annotated for borrowing and codeswitching and which are available online.1 A word-count search for the Afro-Asiatic corpora conﬁrms the relevance of the type of corpora with 0‒5% contact words. This type includes the following three corpora from minority languages in contact with majority languages. The Beja corpus, from an Afro-Asiatic language of Sudan in contact with Sudanese spoken Arabic, shows roughly 26 borrowings and 8 codeswitching insertions out of a total of 5,890 word-tokens, representing 0% of the total words (Vanhove 2012). The Gawwada corpus, a language of Ethiopia represented by 2,394 words in total shows only 12 codeswitching insertions from Amharic, or 0% contact words (Tosco 2012). The Zaar corpus of Nigeria, a language spoken at home and in the market, shows relatively more borrowings and codeswitching insertions: from 10,631 words in total, 292 borrowings from English and Hausa were annotated, or 3%, and 292 codeswitching insertions from English and Hausa, also 3% of the total (Caron 2012b). Corpora from oﬃcial languages (A+) in contact with other major communication languages may also be compared to minority language corpora in order to test the validity of the two types of language mixing discussed in this book. Afro-Asiatic corpora from majority languages in contact with other widely spoken languages are represented by Hausa, Hebrew, and spoken Moroccan Arabic. The Hausa corpus, from Nigeria, with 11,981 words in total shows only 78 codeswitching insertions from Arabic and English, or 1% of the total (Caron 2012a). The Hebrew corpus, containing 7,537 words also contains very few borrowings, 20 in total or 0% (Yatziv 2012). In contrast, from a total of 12,430 words, the spoken Moroccan Arabic corpus shows 104 borrowings and, more importantly, 924 codeswitching insertions from French and English, i.e., 7% contact words in total (Barontini 2012; Caubet 2012; Vicente 2012). The Moroccan Arabic corpus is thus diﬀerent from the two types of corpora discussed in this book in two ways: in terms of the rate, which lies in between the two main types on the scale used here, and in that the Moroccan Arabic corpus is mainly constituted of codeswitching insertions. It is interesting to note that despite the relatively high rates of contact elements, the proportions do not reach those found in the corpora with 20‒35% contact words. 1 See http://corpafroas.huma-num.fr/Archives/ListeFichiersELAN.php

214

Concluding remarks

Other examples of bilingual corpora come from three languages in contact with Spanish in Latin America (Gomez Rendon 2008). One of these languages is not endangered, namely Guaraní of Paraguay (A+), while the two others may be considered unstable, Quichua from Ecuador (A–), and Otomí from Mexico (A–). The three corpora show an overall rate of 14‒19% Spanish tokens, that the author classiﬁes as “borrowings”, to which should be added the codeswitching insertions which are treated separately (Gomez Rendon 2008). Table 26 locates various corpora on a scale of language mixing, showing that there is a continuum in the number of contact words across the languages of the world. Table 26: Type of bilingual speech with respect to the overall rates of tokens from the languages in contact contact words 2,000 restricted loose long-term

Burgenland Croatian

low C, D – +/– – –