Language, Corpora and Cognition 9783631663363, 9783653056488, 9783631707098, 9783631707104

222 107 4MB

English Pages [298] Year 2017

Table of contents :
Cover
Table of Contents
Introduction (Piotr Pęzik / Jacek Waliński)
Gradience in cognitive scanning: participle modifiers in Polish and English (Barbara Lewandowska-Tomaszczyk)
Experimental applications of dependency-based phraseology extraction (Piotr Pęzik)
Computational distributional semantics and free associations: a comparison of two word-similarity models in a study of synonyms and lexical variants (Marcin Tatjewski / Mirosław Bańko / Adrianna Kucińska / Joanna Rączaszek-Leonardi)
Grammars or corpora? Who should we trust?
Empirical analysis of morphological doubletism in Croatian (Dario Lečić)
Figurative dimensions of health: a corpus-illustrated study (Adamina Korwin-Szymanowska / Jacek Tadeusz Waliński)
“Justice with an attitude?” – towards a
corpus-based description of evaluative phraseology in judicial discourse (Stanisław Goźdź-Roszkowski)
Using time to express remoteness in space:
A corpus-based study of distance representations for motion medium in the National Corpus of Polish (Jacek Tadeusz Waliński)
Avenues for Research on Informal Spoken Czech Based on Available Corpora (Petra Klimešová / Zuzana Komrsková / Marie Kopřivová / David Lukeš)
Introducing a corpus of non-native Czech with automatic annotation (Alexandr Rosen)
Corpus-based Analysis of Czech Units Expressing Mental States and Their Polish Equivalents Identification of Meaning and Establishing Polish Equivalents Referring to Different Theories (Elżbieta Kaczmarska)
Problem solving in English and Polish:
A cognitive corpus-based study of selected metaphorical conceptualizations (Marcin Trojszczak)
Corpus Linguistics for Critical Discourse Analysis. What can we do better? (Victoria Kamasa)
Towards quantitative and qualitative characterisation of various types of dialogue: interviews vs. Panel Discussions (Dorota Pierścińska)
Standardisation in safety data sheets?
A corpus-assisted study into the problems
of translating safety documents (Aleksandra Beata Makowska)
Lexical bundles in English medical texts (Monika Betyna)

Recommend Papers

Collocations, Corpora and Language Learning 9781108992602, 9781108994798

This Element provides a systematic overview and synthesis of corpus-based research into collocations focusing on the lea

164 7 2MB Read more

Bilinguals: Cognition, Education and Language Processing : Cognition, Education and Language Processing [1 ed.] 9781617615672, 9781607417101

A bilingual person, in a broad definition, is one who can communicate in more than one language, be it actively (through

150 27 3MB Read more

Language Corpora Annotation and Processing [1 ed.] 9789811629600

This book addresses the research, analysis, and description of the methods and processes that are used in the annotation

301 45 7MB Read more

New Trends in Corpora and Language Learning 9781474211925, 9781441159960

This book provides an up-to-date snapshot of recent research and developments in the use of corpora for language learnin

163 3 4MB Read more

Language Corpora Annotation and Processing [1 ed.] 9789811629594, 9789811629600

This book addresses the research, analysis, and description of the methods and processes that are used in the annotation

547 64 10MB Read more

Corpora and Language Learners (Studies in Corpus Linguistics) 9027222886, 9789027222886

507 96 2MB Read more

Learner Corpora and Language Teaching [1 ed.] 9789027262820, 9789027202369

While native corpora and corpus linguistic tools and methods have been used and applied for quite some time in the devel

150 9 13MB Read more

Critical Discourse Analysis and Language Cognition 9781474471411

An interdisciplinary study of issues of language manipulation, this book explores the interpretation stage of critical d

120 94 27MB Read more

Cognition and Language Growth 9783110871678, 9783110130584

157 74 11MB Read more

Language, Expressivity and Cognition 9781350332867, 9781350332898, 9781350332874

Providing an up-to-date, multi-perspective and cross-linguistic account of the centrality of emotion in communication, t

179 9 10MB Read more

Language, Corpora and Cognition
9783631663363, 9783653056488, 9783631707098, 9783631707104

Author / Uploaded
Piotr Pęzik
Jacek Tadeusz Waliński

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

ŁÓDŹ

STUDIES IN LANGUAGE Edited by Barbara Lewandowska-Tomaszczyk and Łukasz Bogucki

51 Piotr Pęzik / Jacek Tadeusz Waliński (eds.)

Language, Corpora and Cognition

This book focuses on matching theoretical predictions about language and cognition against empirical language data. The contributions use corpus linguistics methodology for their analysis. The contributors evaluate a variety of themes from combining syntax, semantics, discourse, terminology, to cognitive linguistics with the techniques and quantitative methods related to linguistic data processing.

Piotr Pęzik is an Assistant Professor of Linguistics at the Institute of English Studies, University of Lodz. He has authored publications in the areas of corpus and computational linguistics, information extraction and information retrieval. His areas of interest also include corpus-based phraseology and the wider perspective of prefabrication and compositionality in language. Jacek Tadeusz Waliński is an Assistant Professor of Linguistics at the Institute of English Studies, University of Lodz. His research concentrates on interactions between language and cognition in the mental processing of the sociocultural reality. His research interests also include applications of cognitive corpus linguistics for foreign language teaching and translation.

Language, Corpora and Cognition

ŁÓDŹ

Piotr Pęzik / Jacek Tadeusz Waliński (eds.)

STUDIES IN LANGUAGE Edited by Barbara Lewandowska-Tomaszczyk and Łukasz Bogucki

Editorial Board Piotr Cap (University of Łódź, Poland) Jorge Díaz-Cintas (University College, London, England) Katarzyna Dziubalska-Kołaczyk (Adam Mickiewicz University, Poznań, Poland) Wolfgang Lörscher (Universität Leipzig, Germany) Anthony McEnery (Lancaster University, England) John Newman (University of Alberta, Canada) Hans Sauer (Ludwig-Maximilians-Universität München, Germany) Piotr Stalmaszczyk (University of Łódź, Poland) Elżbieta Tabakowska (Jagiellonian University, Kraków, Poland) Marcel Thelen (Zuyd University of Applied Sciences, Maastricht, The Netherlands) Gideon Toury † (Tel Aviv University, Israel)

Language, Corpora and Cognition

Vol. 51

Piotr Pęzik / Jacek Tadeusz Waliński (eds.)

Language, Corpora and Cognition

Bibliographic Information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the internet at http://dnb.d-nb.de.

This publication was financially supported by the Institute of English Studies of the University of Łódź.

ISSN 1437-5281 ISBN 978-3-631-66336-3 (Print) E-ISBN 978-3-653-05648-8 (E-PDF) E-ISBN 978-3-631-70709-8 (EPUB) E-ISBN 978-3-631-70710-4 (MOBI) DOI 10.3726/b10717 © Peter Lang GmbH Internationaler Verlag der Wissenschaften Frankfurt am Main 2017 All rights reserved. Peter Lang Edition is an Imprint of Peter Lang GmbH. Peter Lang – Frankfurt am Main · Bern · Bruxelles · New York · Oxford · Warszawa · Wien All parts of this publication are protected by copyright. Any utilisation outside the strict limits of the copyright law, without the permission of the publisher, is forbidden and liable to prosecution. This applies in particular to reproductions, translations, microfilming, and storage and processing in electronic retrieval systems. This publication has been peer reviewed. www.peterlang.com

Table of Contents Piotr Pęzik and Jacek Waliński Introduction...................................................................................................................7 Barbara Lewandowska-Tomaszczyk Gradience in cognitive scanning: participle modifiers in Polish and English......11 Piotr Pęzik Experimental applications of dependency-based phraseology extraction..........29 Marcin Tatjewski, Mirosław Bańko, Adrianna Kucińska and Joanna Rączaszek-Leonardi Computational distributional semantics and free associations: a comparison of two word-similarity models in a study of synonyms and lexical variants......57 Dario Lečić Grammars or corpora? Who should we trust? Empirical analysis of morphological doubletism in Croatian...............................73 Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński Figurative dimensions of health:a corpus-illustrated study...................................87 Stanisław Goźdź-Roszkowski “Justice with an attitude?” – towards a corpus-based description of evaluative phraseology in judicial discourse............................... 107 Jacek Tadeusz Waliński Using time to express remoteness in space: A corpus-based study of distance representations for motion medium in the National Corpus of Polish....................................................................................... 129 Petra Klimešová, Zuzana Komrsková, Marie Kopřivová and David Lukeš Avenues for Research on Informal Spoken Czech Based on Available Corpora.................................................................................... 145 Alexandr Rosen Introducing a corpus of non-native Czech with automatic annotation............ 163

6

Table of Contents

Elżbieta Kaczmarska Corpus-based Analysis of Czech Units Expressing Mental States and Their Polish Equivalents Identification of Meaning and Establishing Polish Equivalents Referring to Different Theories............................................. 181 Marcin Trojszczak Problem solving in English and Polish: A cognitive corpus-based study of selected metaphorical conceptualizations.............................................. 201 Victoria Kamasa Corpus Linguistics for Critical Discourse Analysis. What can we do better?........................................................................... 221 Dorota Pierścińska Towards quantitative and qualitative characterisation of various types of dialogue: interviews vs. Panel Discussions............................................. 241 Aleksandra Beata Makowska Standardisation in safety data sheets? A corpus-assisted study into the problems of translating safety documents.............................................. 265 Monika Betyna Lexical bundles in English medical texts............................................................... 287

Piotr Pęzik and Jacek Waliński

Introduction The idea of this volume was conceived during the 9th international conference on Practical Applications of Language Corpora (PALC 2014), held at the University of Łódź, Łódź, Poland, 20–22 November 2014, which was accompanied by a special workshop devoted to Uncovering Time in Language. This is where most of papers originate from, although some contributions in the volume, most notably three final papers from young researchers, were invited at a later stage. At present, linguists support different assumptions about the faculty of language, and they adopt a broad spectrum of methodologies for research on linguistic phenomena. This volume presents a selection of studies that reflect a growing interest in matching theoretical predictions about language and cognition against empirical language data with the use of corpus linguistics methodology. Throughout the editing process we have incurred a number of debts that we wish to acknowledge. First of all we appreciate the invaluable assistance from Barbara Lewandowska-Tomaszczyk, the founding mother of PALC conferences and, more recently, the principal investigator of the project on Perception of Time as a Linguistic Category, which provided financial support for the publication of this book. Hereby we acknowledge that this publication was carried out and supported with the Polish National Science Centre (Narodowe Centrum Nauki) project Perception of Time as a Linguistic Category, No. 2011/01/M/HS2/03042. We gratefully acknowledge the assistance of Anna Bączkowska from the Kazimierz Wielki University in Bydgoszcz, who reviewed the papers submitted for publication. Her critical insights and helpful comments greatly improved the value of the volume as a whole. Our sincere thanks also go to all individual contributing authors, who responded with utmost professionalism to all requests that were made of them, and patiently waited for the publishing process to finalize. The present volume contains 15 papers. In the opening chapter, Barbara Lewandowska-Tomaszczyk seeks an explanation for the regular patterns of pre- and post-modifying participial constructions in Polish and English. The differences observed in the reference corpora are linked to the aspectual systems of the two languages, and more specifically, to the partially gradient nature of the cognitive scanning processes in Polish, which is not attested in the English data. Piotr Pęzik proposes a dependency-based method of extracting phraseological units from corpus search results. The approach makes use of automatic dependency

8

Piotr Pęzik and Jacek Waliński

annotations. Its potential novelty lies in the improved ability to identify “collocational catenae”, which are recurrent dependency subtrees composed of more than two lexical items which seem to have acquired some phraseological status. The results of automatic phraseology extraction are claimed to have theoretical implications, as they bring insights into the interplay of compositionality and prefabrication in language production and reception. In the third chapter of the volume, Marcin Tatjewski, Mirosław Bańko, Adriana Kucińska and Joanna Rączaszek Leonardi combine two different methodologies of measuring word similarity. A corpus-based, distributional model is compared with the results of a survey in which informants provide free associations of selected words. The correlation observed between the results obtained with these two methods gives some weight to the assumption that corpus and experimental data can be exploited in tandem to explore the cognitive aspects of language use. Dario Lečić uses a similar combination of corpus- and survey-based data to explore the phenomenon of morphological doubletism in Croatian. The specific question addressed in his chapter is whether corpora provide a more accurate description of native speakers’ morphological preferences than formal grammars of Croatian. He finds that although the relative proportions in the frequencies of the doubled forms are generally reflected in the survey, phonologically similar forms are much less consistent with their corpus distributions. Taking into account data found in the British National Corpus and the Corpus of Contemporary American English, Adamina Korwin-Szymanowska and Jacek Waliński discuss a conceptual mapping of health as the general condition of human functioning onto two basic dimensions of embodied experience involving up– down and strong–weak scales. From this perspective the concept of health forms an array of primary conceptual metaphors arising from the cognitive embodiment. Stanisław Goźdź-Roszkowski’s paper investigates the role of selected grammatical patterns in expressions of evaluative meanings found in a corpus of US Supreme Court opinions. Using the concept of Local Grammar and evidence from this corpus, the author identifies and systematizes the conventional ways in which the judges signal their attitudes and how they relate them to those of relevant interactants in their written opinions. The patterns selected for the analysis turn out to serve their prototypical evaluative functions. In the subsequent chapter, Jacek Waliński demonstrates a proportion between spatial and temporal expressions of distance for the semantic attribute of motion medium based on objectively verifiable frequencies of language patterns found in the National Corpus of Polish. His research shows that in this particular semantic context Polish speakers tend to express distance both in spatial and temporal terms, with spatial representations being used more frequently, but not by a large margin.

Introduction

9

Petra Klimešová, Zuzana Komrsková, Marie Kopřivová and David Lukeš describe a number of distinctive features of casual spoken discourse which set it apart from written and formal spoken language. Corpora of informal conversational discourse are presented as an invaluable source of data without which it would be impossible to systematically study discursive, phonetic and grammatical phenomena in spoken language. In the next chapter in this volume, Alexandr Rosen introduces a corpus of essays composed by non-native Czech authors. The specific challenge addressed in this paper concerns the automatic annotation of non-native language data. The author shows that part-of-speech taggers and lemmatizers originally developed for native Czech can be adopted to generate useful linguistic annotations of the corpus of non-native writing. Elżbieta Kaczmarska’s paper focuses on a selection of polysemous mental state verbs in Czech. Using a parallel corpus, she extracts clusters of their Polish equivalents and investigates the extent to which they can be predicted by different linguistic frameworks. Marcin Trojszczak presents selected aspects of metaphorical conceptualization of problem solving that are shared between English and Polish on the basis of linguistic data from the British National Corpus and the National Corpus of Polish. He approaches metaphorical conceptualizations of problem solving from the perspective of cognitive corpus-based linguistics. Victoria Kamasa critically reviews a number of papers whose authors use Corpus Linguistics techniques in Critical Discourse Analysis and identifies some the most vulnerable points of the current research practice. She also suggests improvements for corpus-supported discourse analysis which can help researchers avoid many methodological pitfalls. The last three papers in the volume are corpus-based studies contributed by young researchers working on their doctoral dissertations. Dorota Pierścińska discusses quantitative parameters of dialogue in interviews and panel discussions and relates these two genres to the qualitative features in order to develop and propose their general characterisation. Aleksandra Makowska looks at a corpus of industrial safety sheets and points out possible improvements in their translation and terminological standardisation. Monika Betyna investigates the functions of lexical bundles in a topic-oriented corpus of medical texts.

Barbara Lewandowska-Tomaszczyk

State University of Social Sciences in Konin

Gradience in cognitive scanning: participle modifiers in Polish and English Abstract: The main issue to be discussed in the present paper is gradience in the cognitive scanning processes of events described in terms of de-verbalized categories in Polish and English, used in sentences as weakly-grounded or non-grounded forms with decreasing assertive force. In such constructions the event represented by the modifier is either more fully attributivized and used pre-nominally or left in its more verbal shape, frequently following the relevant nominal. The profiles involve a number of relevant parameters and the position of the modifier with respect to the modified noun and the presence of the complementary sentence elements are also demonstrated to be relevant indicators of the particular construal parameters. The language data are drawn from the British National Corpus and the National Corpus of Polish. Distinct language profiles in this respect are demonstrated to pertain first of all to different aspectual systems in the two languages and embrace in particular pre-/ post-modification behaviour of the relevant constructions, which are argued to be rooted in the partial gradience nature of the scanning processes in Polish, as compared to English. Keywords: aspect, attribute, British National Corpus, construal, (co-)temporality, distribution, event, frequency, gerund, gradience, modification, National Corpus of Polish, participle, scanning, sentence position, tense, verb

Background assumptions1 The basic assumption of Cognitive Linguistics is that the subject of the analysis is a relationship between objects and events not as they are in the real world but rather the conceptualization of the relationships between objects, events and their participants. Furthermore, language specific syntactic structures function as vehicles which convey the speaker’s conceptualization of an outside scene or event. The main focus of the present paper is the phenomenon of graded cognitive scanning in the conceptualization of events described in terms of de-verbalized categories in Polish and English, used in sentences as weakly-grounded or nongrounded forms with decreasing assertive force. Their distinct language profiles in 1 Research carried out within COST Action TD0904 TIMELY, supported by National Science Centre (NCN) grant No 2011/01/M/HS2/03042, Perception of Time as a Linguistic Category.

12

Barbara Lewandowska-Tomaszczyk

this respect are demonstrated to pertain first of all to the distinct aspectual systems in the two languages and embrace in particular pre-/post-modification behaviour of the relevant constructions, which are argued to be rooted in the gradience of the scanning processes in Polish, as compared to English. In such constructions an event represented by the modifier is either more fully attributivized and used pre-nominally or left in its more verbal shape, frequently following the relevant nominal. The profiles involve a number of parameters, and the position of the modifier with respect to the modified noun and the presence of the complementary sentence elements are demonstrated to be relevant indicators of the particular construal parameters. The concept of cross-language equivalence adopted here is discussed in terms of an overall framework embracing an Event structure as well as reference grammatical categories such as primarily Tense and Aspect interpreted in the form of basic image-schemata and their extensions as well as the construal relations distinct in Polish and English. Time, underlying Tense and Aspect on the other hand, in its ‘real’ sense, as argued in Lewandowska-Tomaszczyk (2016: xix) is probably a subjective experience type, as called by Grady (1997), possibly modalitybased, more individual than space, and is considered to be, as shown in Edelman et al.(2012), Evans (2013) and other researchers, phenomenologically real, although its conceptualizations may differ across cultures and individuals.

1. Materials The language materials used in the study are drawn from the British National Corpus and the National Corpus of Polish and both quantitative data (frequencies) for particular forms and their distribution as well as their qualitative properties, referring to language-specific rules of event type and event phase expression, are considered.

2. Cognitive Linguistic concepts relevant to the analysis 2.1 Construal A typical event scene is built around participants and relations holding among them, all being situated in a spatial and temporal framework. Construal signifies the way a scene and events are portrayed and structured (Langacker 1987, 1991). Langacker demonstrates the dynamicity of construal types (b-e) on the basis of schema variants represented in Fig. (1). The sentences in (1.) below exemplify these constructions with the basic force-dynamic schema represented in (a):

Gradience in cognitive scanning

13

Figure 1. Construal types (after Lagacker 1987, 1991).

Each of the sentences in (1) presents a different foregrounding/background structure with consecutive participants made salient and represented by a distinct syntactic structure of (1). (1)

(i) Participants, Relations & Force-dynamic Schema: Mark (Agent) exerts force > (with) van (Instrument)> (onto) toy (Patient), (ii) Mark crashed the toy with his van. (iii) The van crashed the toy. (iv) The toy easily crashed. (v) Mark crashed the toy (under his van).

2.2 Grounding The concept of grounding, essential in the discussion of prenominal participial modifiers, the focus of the present study, addresses the linking of semantic content to contextual factors that constitute the subjective ground (or situation of speech). Relevant to such constructions, as proposed by Langacker, is establishing deixis and reference to person, time (temporality) and aspects. Of certain exploratory value are similarities and contrasts between clausal (verbal), nominal and adjectival (attributive) grounding. It is typically assumed in cognitive analyses that gerunds, participles and deverbal forms lose or weaken their grounding properties and simultaneously lose their assertive force.

14

Barbara Lewandowska-Tomaszczyk

2.3 Profiling Langacker (1981) proposes that the field of human vision is restricted. People are able to subtend only a limited portion of the surrounding, so the speaker’s attention is focused on a particular region only. This region, in focus of the viewer/speaker’s attention, is its profile and the larger part of the surrounding is linguistically expressed against the profile’s base. It can then be proposed that in the case of a deverbal – participial – modifier such as e.g., a broken arm, it is the main action and the change of state (breaking) that form the profile base, with the final state (outcome) of the action, broken, as its profile.

2.4 Bounding Equally essential for the discussion in the present paper is the concept of bounding, i.e., the “existence of a limit (internal to the scope of predication) to the set of interconnected entities that constitute a region, or the set of component states that constitute a process” (Langacker 1987: 545). Mass and count nouns, imperfective and perfective verbs are largely based on differences in bounding construals with mass Nouns and imperfective Verbs forming unbounded regions and Concrete N and perfective B constituting a bounded region.

2.5 Bounding and scanning Langacker follows traditional grammar and divides English verbs, which typically designate processes, into two aspectual classes (2000: 223), i.a., perfective (e.g. learn, eat), which is internally bounded within the immediate temporal scope, i.e. it has an onset and offset, and imperfective (e.g., the stative verbs such as know), which do not express bounding. Langacker further postulates that relational units such as adjectives, adverbs, prepositions, infinitives, and participles designate atemporal relation, which comprise a holistic viewing of a scene (summary scanning), in contradistinction to processes, which are temporal and involve a sequential portrayal of the scene, linked with sequential scanning. Adjectives profile then, in Langacker’s model, an atemporal relation between a thing and the property, while participles retain the sequential scanning, characteristic of an original verb from which they derive, but are typically profiled in terms of atemporal, summary scanning. The present papers seeks an explanation for the use of those participial constructions in Polish and English, which are present in the adjectival positions (prenominally) but retain both their original verbal character and temporal, sequential scanning and analyses those participle modifier uses which involve the postnominal position.

Gradience in cognitive scanning

15

The account proposed is related to the claim that instead of maintaining a rigid distinction between sequential and summary scanning, scanning is considered rather a matter of gradience in a cross-linguistic perspective, and some language systems (like Polish) enable prenominal adjectival participles to retain their verbal (sequential) character to a higher degree, by making their verbal, sequential properties more salient than in the case of similar processes in English.

3. Participles contrasted There are basically two types of participial modifiers in English: present or active (-ing) participles and past (-ed) participle (including so-called passive participle, i.e., the participle of transitive verbs). Although some of the modifiers can function both attributively as prenominal and postnominal modifiers, and predicatively, while some others cannot be used in this way, the attributive uses of participial adjectives in English outnumber predicative uses for all types of participle (cf. Biber et al. 1999: 530). However, it is argued in the present analysis that different positions of participial modifiers express semantic differences, particularly with respect to the properties of (co-)temporality and the aspectual character of the events inherent in the participles. Although, as presented in Lewandowska-Tomaszczyk (2008), the two languages present some similarities there, Polish permits some complex participial premodification in the attributive use, which cannot occur in English and, moreover, the prenominal participial modifiers in Polish retain their sequential character to a higher degree than similar forms in English. A similar situation takes place in English with regard to the borderline cases of participial class membership used as premodifiers. As exemplified by Biber et al. (1999: 66–67) there arises an ambiguity between permanent (non-co-temporal) and occasional (co-temporal) uses in adjectival as opposed to nominal – gerundive instances of similar forms as in e.g., the travelling public, understood either as a more frequent non-co-temporal ‘public which travels’ or co-temporal ‘public which is travelling’.

3.1 Polish 3.1.1 Present participle The Present participial forms in the Polish examples below are used mostly postnominally and in such structures they refer to the activities co-temporal with the main clause event.

16

Barbara Lewandowska-Tomaszczyk

(2) Samotna staruszka nie nadążająca za pielgrzymką, samotny mężczyna palący papierosa albo ten drugi samotnie pijący piwo; lit. ‘an old, lonely woman, not keeping up with the whole group of pilgrims […] a lonely man smoking a cigarette and the other one drinking beer alone’ The preferred word order pattern follows the sequence Noun (Agent), Participle, Instrument, Location, often preceded by another modifier: (3) Wolne dzieci bawiące się zerwanymi łańcuchami lit. ‘Free children, playing with broken chains’ (4) Roześmiane dzieci bawiące się na podwórku ‘Laughing children playing in the courtyard’. Prenominal Present Participial modifiers – co-temporal with the Main Event – are either premodified themselves (5, 6), or embedded in a verbal phrase (7): (5) Nie było sensu, by dobrze czytające dzieci od nowa poznawały litery; lit. ‘well reading children’ (6) Patrzyła na bawiące się dzieci ‘she was looking at the playing children’ (7) Fotografując bawiące się dzieci z góry lit.’ taking photographs of the playing children – from above; In the case of some Verbs such as ważyć ‘weigh’, the position of the participial modifier signals a semantic difference between Transitive (ważyć coś ‘to weigh something’, example (8) below) and Middle Verb readings (e.g., ważyć 10 kg ‘weigh 10 kgs’) as in examples (9, 10, 11): (8) na płótnie “Ważąca złoto” ‘on the canvas “[The woman] weighing gold”’ The distinction observed in (9–11) below points also to the (in)definiteness of the modified noun. The noun with the premodification (10) refers to a (more) definite noun than that in (9), which involves new information. This aspect is further clarified in example (11), in which the participial clause following the noun kobieta ‘woman’, introduces a definitional criterion on the preceding noun. (9) Ważąca pięćdziesiąt kilogramów kobieta wydaje na świat dziecko; lit. ‘Weighing 50 kg woman gives birth to a baby’ (10) Niewiasta, ważąca około 200 kg. ‘a female, weighing about 200 kg’ (11) W Cuernee kobieta ważąca ponad 100 kg nie ma prawa jeździć konno w podkoszulku; lit. ‘In Cuernee a woman weighing over 100 kg is not supposed to ride a horse in a T-shirt’

Gradience in cognitive scanning

17

The sentence in (12) below though does not seem to constitute a counterexample. The prenominal adjectival participial clause uses rather the information from the DELIVERY/BIRTH FRAME activated in the preceding context: (12) narodziny dziecka ludzkiego są – z powodu stosunku wielkości noworodka do rozmiarów matki – wyjątkowo w świecie małp bolesne i niebezpieczne. Ważąca pięćdziesiąt kilogramów kobieta wydaje na świat dziecko o wadze zwykle około trzech kilogramów ‘the birth of a human infant is – because of the proportion between the size of the newborn and the mother – particularly painful and dangerous in the world of apes. (lit.) The weighing fifty kilograms woman delivers a baby of about three kilograms of weight’. The semantic-syntactic interpretation agrees with the general principle of NounAdjective word order in Polish, in which prenominal Adjectival modifiers introduce a descriptive interpretation as in e.g. brunatny niedźwiedź ‘brown bear’, while the postnominal modifier introduces rather a generic (genus) interpretation niedźwiedź brunatny lit. ‘bear brown’ (Fisiak et al. 1978). The prenominal participle too is related to occasional/co-temporal uses, while the postnominal one – to steady and generic ones. On the whole, both in English and Polish the prenominal – stative (steady/generic) use is conjectured to be related to a summary scanning profile of the action, while the postnominal – occasional one – to a sequential scanning process. The sequential scanning profiles the action in (13, extended in 15), while the summary scanning serves mainly the, understood broadly, (co-)referential function in the utterance (14): (13) osoba cierpiąca ‘a person suffering’ (14) tancerka cierpiąca na Alzeihmera ‘a dancer suffering from Alzhemer’s’ versus (15) Odczucie porzucenia odżywa boleśnie w dorosłym życiu, a cierpiąca osoba nie zdaje sobie zwykle sprawy, że jego źródła tkwią w przeszłości. ”The sense of rejection is acutely felt in the adult life, and the suffering person does not realize that its sources are anchored in the past.’ The broad co-referential function of the prenominal modifier is clearly visible in (16), in which the form is preceded by the demonstrative pronoun tą ‘this’ and refers to the contextual water/boat-related framing of the event: (16) Wiatr był bardzo mocny, a my mieliśmy podniesione oba żagle, na motyla, jeden po jednej, a drugi po drugiej stronie łódki, i szybowaliśmy prawie nad tą gotującą się wodą, zupełnie sami, bo wszyscy uciekli przed burza, a ona

18

Barbara Lewandowska-Tomaszczyk

w końcu przeszła bokiem. ‘The wind was very strong, and we had both our sails up, for a butterfly, one at one side and the other at the other and were almost flying over this boiling water, all alone, as all the others ran away from the storm, but eventually it passed by.’ A complex (Location) prenominal modification, engaging a participle is acceptable in Polish, but not in English: (17) Usłyszałem jak na małym ogniu bulgocze gotująca się w dużym garnku zupa na obiad; lit. ‘I heard as on small fire bubbles boiling itself in a big pot soup for dinner’ The Present Participle of the Transitive Verb gotować (coś) ‘boil/cook (something)’ can be used in both pre- and post-nominal position according to the (non-)definiteness principle discussed above: (18i) babcia gotująca ‘grandma cooking’ versus (18ii) gotująca babcia ‘cooking grandma’ The word-order rule here opposes that used in English in the case of equivalent structures. While the former corresponds to the genus-identifying equivalent (grandma who cooks regularly) or to the contrastive use (e.g., na tym zdjeciu babcia gotująca obiad, a na tamtym śpiąca w hamaku ‘in this picture gradma cooking dinner and in the other – sleeping in a hammock’), the latter indicates the immediate (temporary/transient) activity-identifying order (grandma cooking at the moment). In English the prenominal modification is connected with a fixed, steady (adjectival) property, while post-nominalization typically indicates a temporary condition. The shift of the modifier position in the case of regular transitive verbs in which the NP following the verb is a direct object is more constrained and significantly less frequent. To refer back to the verb of weighing, in (19), ‘to weigh’ (ważyć) is used in the function of Participle of the Transitive Verb, while in (20), the participle ważąca ‘weighing’, is metaphorically used (in the sense of ‘significant’), as a post-nominal modifier to the noun kwota ‘sum (of money)’: (19) Waga ważąca dobro i zło ‘The scales (weight) weighing the good and the evil’ (20) Kwota ważąca dla budżetu państwa ‘the sum (of money) significant (lit. weighing) for the state budget’

Gradience in cognitive scanning

19

3.1.2 Past Participle The aspectual differences in the Polish Verb system significantly influence the actual use of the past participial forms. The Polish prefixal verbs are typically combined with the perfective meanings of the verb. The perfective – in both languages – is a function that requires, as proposed by Moens and Steedman (2005: 99) “its input category to be a culmination […and] its result is the corresponding consequent state.” The prefixed verbs enable the acceptable derivation of the preand post-nominal participial modifying forms as in: (21) wiszą dwaj zastrzeleni mężczyźni lit. ‘hang two shot dead men’ (22) mężczyźni zdeformowani rzez morza i cierpienie ‘men deformed by the sea and suffering’ Here too, similarly to the Present Participle modification, the prenominal modifiers are of a descriptive character (21), while the post-modification in (22) is of the (genus) identifying nature. The so-called Imperfective (or unmarked) Aspect on the other hand tends to be used interchangeably with the Perfective sense in casual speech as in Kto pisał ten wiersz? lit. ‘who was writing (imperf.) this poem?’ in the sense of Who wrote this poem? For example, out of 240,192,461 units in the National Corpus of Polish 112 instances of the imperfective Past Participial form of ‘boil/cook’ gotowana identified, the vast majority (90 instances), are the perfective senses such as gotowana woda/śniadanie/szynka ‘cooked, boiled water/breakfast/ham’. A similar usage of imperfective participles is noted in the Passive Voice, although here some part of the uses refers to the durative sense of the Imperfective Participle, which is possible in English mostly by means of the syntactically composed Perfect Progressive Aspect: (23) Jak czytelnik zauważył, książka ta pisana była do tej pory Lit. ‘As the reader (may have) noted, this book has been being written till now’. A fairly frequent use of such forms in Polish is grammatically polysemous and can be considered ambiguous between the durative aspect mentioned above and the aspect expressing a frequentative (iterative) sense, or else, as observed in such cases as (24) and (25), in the perfective sense: (24) Cała literatura o niewolnictwie jest albo fałszywa, albo od rzeczy. Pisana albo przez misjonarzy, albo przez wyzwoleńców, albo w najlepszym wypadku przez niewolników. ‘All books/The entire literature concerning slavery is either false or irrelevant. Written [imperfective or (perfective) frequentative] either by missionaries, or by freedmen, or in the best case, by slaves’

20

Barbara Lewandowska-Tomaszczyk

(25) Książka pisana w pierwszej osobie ‘A book written in the first person’. Due to the ambiguity potential of the past participial forms, the processes of mental scanning are not homogenous in such cases. Another example below (26) with the past participial form of the ambiguous (either imperfective or (perfective) iterative) form zamykany/a ‘closed’ (in the finite form in this example) presents a high frequency of occurrence in the corpus materials: (26) Promenada jest zamykana na noc ‘The promenade is (being) closed for the night’ Clearly stative, adjectival senses of past participles can be used either pre- or post-nominally: (27) przenośna, zamykana popielniczka/(popielniczka zamykana) w formie kubka ‘a portable, (lit. (being) closed) closing/closable) ashtray in the shape of a mug’ An ambiguity between a durative and frequentative sense is connected with the (co)referentiality of (one or many) modified nouns: (28) bielizna noszona od wielu sezonów ‘(the same item or many items like it) (the) underwear having been/being worn for many seasons’ (29) to taka sama woda, jak ta noszona przez ogrodnika w dużej konwi ‘it’s the same water as that carried by the gardener in the large can’ The frequentative nosić ‘carry’ has a corresponding unbounded – durative (imperfective) form nieść, in (30) indicating a single, unbounded act, pre-modifying the noun: (30) wyłania się niesiona przez tłum taksówka lit. ‘emerges carried-by-the-crowd taxi’ The post-modifying phrase in (31) identifies a similar (sequentially scanned) event (being carried), and there is a clear distinction with the prenominal modifier (soaked), though mainly in the area of boundedness, i.e., the perfectivity status of the prenominal modification. (31) Nasiąknięta piłka niesiona wiatrem często zaskakująco zmieniała swój lot. ‘A damp/soaked ball (being) carried by the wind, changed its flight surprisingly often’

Gradience in cognitive scanning

21

3.2 English 3.2.1 Present Participle The Present Participle is related to progressive aspect in English. As proposed by Moens and Steedman (2005: 98) “progressive auxiliaries are functions that require their input to denote a process. Their result is a type of state that we shall call a progressive state, which describes the process as ongoing at the reference time”. The modifying Present Participle can occupy either a prenominal or postnominal position in English. In the prenominal position its form resembles a gerund formation as in a reading lamp, however the form reading in this case is not in the direct object relationship with the noun lamp. Instead it functions as a property linked to the noun, potentially in a number of ways. In this case, lamp is used as an auxiliary instrument in the activity expressed in (the gerund form) reading. Another gerundial example such as gold digging area is also clearly distinct both syntactically and conceptually from the participial (postnominal) attribute, for instance look at the girl crying, also opposed to, causally linked, crying shame i.e., ‘shame that makes you cry/causes crying’ or writing career ‘career in writing’. The forms referred to as a prenominal present participle form instead are such as e.g., a living organism, in which the noun (organism in this case) is modified by an attributive participle. A participle modifier search of the BNC generates a set of data, which can be systematized into two major categories in terms of their distributional/semantic properties. The first one (I) represented by (32) demonstrated an almost fully adjectival (state) profile, less temporal than the one (II) in example (33), which portrays a process co-temporal and co-extensive with the main clause grammatical taxa, i.e., expressions of functions such as tense, aspect and mode in this case. (I) (32) It has only been burning coal in it! (II) (33) And there’s that bit of wood burning between his boots! Category (I) manifests the semantics of steady attributes used as labelling descriptors and frequent lexicalized forms such as weighing or wrapping machines, running horse, racing car or disclosing tablets. Category (II), postnominal participial modification, typically demonstrates occasional properties, attributes, and in the case of participial modifiers, transient activities.

22

Barbara Lewandowska-Tomaszczyk

(34) No Metro trains running [i.e., not running at the moment, possibly for some time] And yet, there are corpus uses which do not conform to the principle of postmodification in the case of transient actions/activities. One of them involves cases of process-salient events such as in (35) to halt a closing door (36) a noisily closing door In both (35) and (36) the process (of closing in this case) is the most salient, focalized part in the described event, with either its duration or manner emphasized. On the other hand, in the case of (37) hanging cloth the unbounded (progressive), impersonal character of the present verb semantics shifts the verbal attribute into state-like characteristics, of the verbal periphery sense. A somewhat different syntactic principle is manifested in the case of (38) below, in which the noun is modified by a phrase designating an inherent property of the modified noun and not one expressing a transient process: (38) bombs weighing up to a tonne The principle operating here refers to the absence of left-branched modification in the English modifier system in the case of complex modifiers, i.e., those exceeding one unit (e.g., ?weighing up to a tonne bombs), present in Polish (ważące do jednej tony bomby).

3.2.2 Past Participle The range of modifying past participle forms in English can be exemplified by the following set of postmodifying participles below (39–42). Typically of the perfective meanings, the past participle modifiers denote a culmination of events, although the sequential scanning of the processes is evident, particularly in the case of activity verbs with semantically telic and perfective senses (issue, mention). The verb eat can be considered polysemous between a progressive and iterative sense and the scanning is either unbounded and more sequence-like or else cumulative and repetitive. The event expressed in read is ambiguous, more frequently progressive, unbounded though:

Gradience in cognitive scanning

23

(39) A book issued is not necessarily a book read (40) A reliable device of the type mentioned in the subsection (41) the eggs and bacon eaten with bread and butter Pre-nominal past participle uses do typically involve a summary cognitive scanning, although the processual verbal (i.e., sequential) character of the modifier is observed in possible uses with durative adverbials such as e.g.: (42) Caller Display puts control of the telephone back into customers’ hands, restoring the balance of power between the caller and the [frequently] called person Post-nominal past participle uses involve a sequential cognitive scanning, used in process-linked or typicality expressing (iterative) modifiers as in: eaten [usually] with bread and butter. There are instances of the modifiers which are used interchangeably as in (43) which will best produce the results desired (44) the knowledge desired is through these relations (45) by asking that the desired goal is The most frequent nominals used with prenominal desired are desired results and desired time. One of the characteristic classes of pre-nominal past participle modifiers in English is the category of negative pre-nominal modification as in: (46) (47) (48) (49) (50)

piles of unread scripts a practically unheard of thing he played his unheard tune unexpected guest unkept appointments

The account of such cases involves an interpretation of negation as an instance of a perfectivising phenomenon. Examples in (46–50) are fairly clear in this respect – an unheard of thing has just been heard (by the speaker), an unexpected guest has just arrived, an unheard tune has just been played, while unread scripts have probably been read by the speaker. Such a context makes the negative participles, but not their positive (unbounded) counterparts (played, read, heard) – cumulative, completed, bounded. And it is for the bounded elements in English (i.e., change-ofstate verbs in particular), that the prenominal past participle modifying position is reserved as in the case of a broken arm or a written but not a ?read document. A similar perfectivising role is performed by English complex pre-participial modifiers such as in:

24 (51)

Barbara Lewandowska-Tomaszczyk

(i) a widely read book (ii) widely eaten muesli/etc., (iii) one of the best kept secrets in jazz (iv) a depressed and unkept appearance (v) two very cautiously written paragraphs (vi) written language/warning/statement/test

Complex participial modifiers, which would otherwise not be used in the prenominal position, can be turned into an almost steady, state-like properties in written language in particular, by employing punctuation marks, particularly a dash: (52) creative graphically marked many-times-read book All the examples discussed above are instances of participle modifiers of transitive verbs. Intransitive verbs on the other hand are not semantically predestined to modify the head noun in either language. Those infrequent cases in English such as a fallen tree, those departed guests and in Polish e.g., opadłe ‘fallen’(most frequently) liście ‘leaves’, owoce ‘fruit’ or, more occasionally, ręce ‘arms’ (87 instances for 240, 192, 461 units in total) tend to function rather as fixed expressions with weak productive power. And yet, it does not mean that no trace of sequential scanning is retained in such cases. It is the whole process of, say, falling that is in fact conceptualized, for both fallen and opadłe, although in terms of a final (completed, cumulating, i.e., perfective) act of the whole event, which possesses the property of the highest salience.

4. Conclusions The asymmetry between the events represented in the basic part of the sentence built around the main verb on the one hand and a participial construction on the other, which changes its character from a verbal to successively more and more adjectival (in some cases via nominalization: gerund) can also be traced in the use of participle modification in Polish and English. Their distinct language profiles in this respect are demonstrated to pertain first of all to the distinct aspectual systems in the two languages and embrace in particular pre-/post-modification behaviour of the relevant constructions, which are argued to be rooted in the partial gradience nature of the scanning processes in Polish, as compared to English. In the case of both Polish and English present and past participle modifiers, the cognitive scanning processes possess a gradient character. The most saliently

Gradience in cognitive scanning

25

marked sequential scanning is present in the case of postnominal present participle modifiers, with a decreasing force of cognitive sequentiality marking in the cases of prenominal present participial uses and with a lowered scanning salience degree in the case of past participle modifiers in the prenominal position. In Polish a particularly salient type involves partly grounded past participial, progressive types, which makes these processes distinct from the English ones, in which such grammatical forms are not present. The polysemy of such forms between an unbounded, progressive meaning and a frequentative (iterative) sense, contributes further to the complexity of these phenomena both on the syntactic as well as on the cognitive semantic planes2. The analysis presented in this study confirms the basic cognitive linguistic assumptions verbalised by Langacker that a (progressive) event (like entering) is scanned sequentially. In such a situation the conceptualizer (i.e. the language user) views the different facets of the complex scene successively (as in a motion picture). Summary scanning obtains when the different facets of the complex scene are made available as a single Gestalt (as in a picture). However, it needs to be emphatically stressed that the degree of sequentiality can vary in one language and cross-linguistically, which is evident in a number of usage-based parameters and their interrelationships. In other words, even in Gestalt pictures, successive steps can be mentally accessed. This picture then is closer to what Verkuyl (1995) proposes in terms of aspectuality shifts as e.g., by shifting the meanings from progressive, iterative to perfective. This is also close to van Wright’s (1964) account of the dynamicity of event structure in which e.g., the Path-structure of John walked home can be represented by a function from natural numbers into a domain of locations, in particular a domain of spatio-temporal locations. The successive locations then can be activated even when the progressive marker is not used, although it is evident that the modifying participles are certainly less grounded and possess less assertive force than full verbs. They are used in de-sentensized constructions whose profiles are reduced (see Cristfato 2003; Lewandowska-Tomaszczyk 2008). The analysis also has consequences for the interpretation of the notion of event (Lewandowska-Tomaszczyk 2011). Making reference to Vendler’s (1957) classification of events at this point, we propose that an action event is a prototypical type of event with its bounded, fully grounded, sequential scanning and a clear 2 Discussing temporal ontology and temporal reference, Moens and Steedman (2005:97) point to the effect on meaning ”of the combination of the progressive with an expression denoting an atomic punctual event as in Sandra was hiccupping occurs in two stages: first the point proposition is coerced into a process of iteration of that point. Only then can this process be defined as ongoing, and hence as a progressive state.”

26

Barbara Lewandowska-Tomaszczyk

force-dynamic action chain. Processes are a less prototypical type – they have no inherent bounding, they are sequential with the whole path profiled. Achievements profile the final phase of action events, which are sequential themselves, although with the final phase most saliently profiled, while states profile the final phase of those action events which achieved the status of a steady property themselves. They are weakly grounded, most fully bounded, close to summary scanning, in which the state is profiled. And yet, even states can activate, more strongly in some languages than others, elements of change, experienced as time in most of the contemporary approaches to time and temporality, so they manifest this temporal – not fully stative, sequential – property in their conceptualizations.

References Biber, D., Johansson, S., Leech, G., Conrad, S. & E. Finegan. 1999. Longman Grammar of Spoken and Written English. Harlow: Longman/Pearson Education Ltd. Cristofaro, S. 2003. Subordination. Oxford: Oxford University Press. Edelman, Sh., Fekete, T., & N. Zach. 2012. Being in Time. Amsterdam: John Benjamins. doi: 10.1075/aicr.88. Grady, J. E. 1997. Foundations of Meaning: Primary Metaphors and Primary Scenes. Unpublished doctoral thesis, Linguistics dept. UC Berkeley. Evans, V. 2013. Language and Time: A Cognitive Linguistics Perspective. Cambridge: Cambridge University Press. doi: 10.1017/CBO9781107340626. Fisiak, J., Lipińska-Grzegorek, M. & T. Zabrocki. 1978. An Introductory Polish-English Contrastive Grammar. Warszawa: PWN. Moens, M. & M. Steedman. 2005. “Temporal Ontology and Temporal Reference”. In I. Mani, J. Pustejowsky and R. Gaizauskas (eds.), The Language of Time: A Reader, 93–114. Oxford Linguistics. Oxford: Oxford University Press. Langacker, R. W. 1987, 1991. Foundations of Cognitive Grammar vols. 1 and 2, Stanford, Calif.: Stanford University Press. Lewandowska-Tomaszczyk, B. (ed.). 2008. Asymmetric Events (Converging Evidence in Language and Communication Research). Amsterdam: John Benjamins. doi: 10.1075/celcr.11. Lewandowska-Tomaszczyk, B. 2008. “Asymmetries in Polish and English Participial Modification”. In Lewandowska-Tomaszczyk, B. (ed.), Asymmetric Events, 261–281. Amsterdam: Benjamins. Lewandowska-Tomaszczyk, B. 2011. Events as they are. In P. Stalmaszczyk (ed.), Turning Points in the Philosophy of Language and Linguistics, 35–63. Frankfurt a. Main: Peter Lang.

Gradience in cognitive scanning

27

Lewandowska-Tomaszczyk, B. (ed.). 2016. Conceptualizations of Time. Amsterdam: Benjamins. Verkuyl, H. J. 1993. A Theory of Aspectuality. The Interaction between Temporal and Atemporal Structure. Cambridge: CUP. von Wright, Georg H. 1963. Norm and Action: A Logical Inquiry, London: Routledge & Kegan Paul.

Corpora British National Corpus (BNC). National Corpus of Polish (NKJP).

Piotr Pęzik

University of Łódź

Experimental applications of dependencybased phraseology extraction Abstract: The central role of phraseology in language reception and production has been steadily appreciated in recent years. It is now more commonly accepted that, in order to achieve native like fluency and clarity of expression, we heavily re-use conventionalized, multiword units of meaning in addition to compositionally generating new phrases, clauses and sentences. In order to systematic investigate linguistic prefabrication, we need new tools and methodologies of detecting phraseology from large reference corpora. This paper presents a dependency-based method of extracting phraseology and discusses its theoretical implications and practical applications in foreign language pedagogy. Keywords: Phraseology extraction, dependency linguistics, collocations, phraseodidactics

1. Introduction A popular form of the Principle of Compositionality states that language production is essentially the process of combining atomic words into syntactically valid phrases and sentences (see Werning, Hinzen and Machery 2012). According to this view, sentence meanings are fully determined by the meanings of their constituents and the rules of their arrangement.1 Language is also often described not only as compositional, but also as infinitely productive. This property of language, which is more formally known as the Infinitude Claim (Pullum and Scholz 2010), predicts that the number of lexically (and syntactically) distinct sentences is not limited, at least theoretically, even if there are cognitive limits on how much syntactic complexity language speakers can handle (Baggio, van Lambalgen and Hagoort 2012). The recognition of the unbounded novelty and compositionality of language is further reflected in the claim that syntactic sentences have “zero probability” of occurrence.2 This could be paraphrased to read that one should not expect identical 1 Only this popular formulation of the Principle of Compositionality is considered here and contrasted with the Idiom Principle. For a comprehensive overview of the current state of research on compositionality, see (Werning, Hinzen & Machery, 2012). 2 “The vastness of the set of sentences from which normal discourse draws will yield precisely the same conclusions; the probability of ‘normal sentences’ will not be significantly different from zero”. (Chomsky, 1978)

30

Piotr Pęzik

sentences to reappear with any significant frequency in non-controlled language registers. Some of the most influential linguistic theories of the recent decades place great emphasis on compositionality, infinitude and the limitless creativity of language. On the other hand, there are disciplines and traditions of linguistics which give serious consideration to the role of prefabrication. For many phraseologists and corpus linguists, for instance, the reuse of ready-made language structures is a central mechanism of language production and reception. This alternatie perspective on how often language is reproduced may start from the recognition of the most obvious exceptions to the claim of zero-probability of syntactic sentences. We regularly use thousands of fully reproduced clausal and sentential idioms (e.g. Don’t judge the book by its cover), even more structurally complex quotes and citations (e. g. Never in the field of human conflict was so much owed by so many to so few), proverbs, conversational formulas and discourse markers (Are you with me, What’s up? etc.) and many other types prefabricated sentential formulas, cf. (Pawley and Syder 1983), which often have highly stereotyped meanings and functions (Gläser 1998). Nevertheless, fully reproduced sentences are a small minority of the set of all sentences found in written discourse. In fact, reproduction of non-trivial sentences in written discourse may easily be regarded as a case of plagiarism. The theoretically infinite novelty of language at the level of syntactic sentences and structurally larger units of language is therefore intuitively accepted. What is much less certain is that the general uniqueness of syntactic sentences should lead to the conclusion that prefabrication and “rote recall is a factor of minute importance in ordinary use of language” (Chomsky 1964). It seems that running parallel to the Principle of Compositionality is the Idiom Principle (Sinclair 1991), which emphasizes the ubiquity of prefabrication, choice restrictions and non-compositionality in naturally-occurring language. Corpus-based studies have revealed high levels of phraseological reproduction in naturally occurring discourse thus confirming the relevance of this principle to achieving native-like selection and fluency (Pawley and Syder 1983). It is now increasingly acknowledged that prefabrication and recall play a crucial role in language production and reception and that as language speakers, we seem to “do at least as much remembering as (we do) putting together’’ (Bolinger 1979: 97). Apart from its significant theoretical implications, this observation should be considered in different ideas of applied linguistics. With respect to foreign language acquisition, Cowie, Mackin, McCaig (1993, X) further note that “the accurate and appropriate use of English expressions which are in the broadest sense idiomatic is one distinguishing mark of a native command of that language and a reliable measure of the proficiency of foreign learners”.

Experimental applications of dependency-based phraseology extraction

31

What is the incidence of subsentential prefabrication in language? There are at least tens of thousands of word combinations which have been deemed sufficiently idiomatic to be included in specialized phraseological dictionaries (Cowie and Mackin 1975). The number of restricted collocations3 alone, which contain synsemantic constituents (Hausmann 2004; see below), has been vaguely speculated to be an order of magnitude larger that the number of dictionary-recorded single word token lexical units (Mel’čuk 2001). If this is true, then the total number of different phraseological units (henceforth referred to as PU’s), especially if we include the vast and largely indeterminate set of “open collocations”, may well be in the millions of distinct lexical types. This realization should make all empiricallyminded linguists wonder how this huge body of PU’s with widely differing syntactic forms (from seemingly anomalous and petrified to fully regular combinations), semantic configurations (from fully restricted idioms to open and semantically transparent collocations) and distributional properties (from extremely rare to common) can be consistently discovered, explored categorized and researched. The issue of detecting phraseological units is just as complex and elusive as the field of phraseology itself. In the present paper I attempt to make a modest contribution to this topic. This comes in the form of a method of detecting and exploring phraseological profiles in search results obtained from reference corpora and more generally in any samples of naturally-occurring text. The method can be described as “dependency-based phraseology extraction” and it takes its name from the assumption that a vast proportion of phraseological units, not only idioms or collocations, but also some of the less “canonical” PU’s such as recurrent n-grams or so-called “lexical bundles” (Biber and Barbieri 2007), conversational formulae and multiword discourse devices, seem to be realized as subtrees of sentence dependency trees, even if they are not complete or connected phrase structure constituents. Such dependency subtrees have also been referred to as “chains” (O’Grady 1998) or “catenae” (Osborne, Putnam and Groß 2012). Starting with these general assumptions, in this paper I describe a method of exploring corpus search results which relies on extracting and aggregating a subset of recurrent (collocational) catenae. I also briefly present examples of using this method to analyse search results obtained from a number of large reference corpora of English, including the COCA, COHA, GloWbE4, BNC (BNC 2001),

3 Definitions of restricted collocations, open collocations, pure and figurative idioms as I use them in this paper can be found, for example, in (Cowie & Mackin 1975). 4 For more information see: http://corpus.byu.edu/coca/, http://corpus.byu.edu/coha/, http://corpus.byu.edu/glowbe/, respectively.

32

Piotr Pęzik

UkWaC (Ferraresi, Zanchetta, Baroni and Bernardini 2008) and Monco.5 Next, I discuss the possibility of using dependency-based phraseology extraction to build Automatic Combinatorial Dictionaries from reference corpora. Finally, I demonstrate some tentative applications of such resources in the area of phraseodidactics and phraseostylistics.

2. Phraseological units as catenae Phraseology has been described as a field “bedeviled by the proliferation of terms and by conflicting uses of the same term” (Cowie 1998). Regrettably, in this section of my paper I have to add to this confusion. The general assumption I make for the purposes of phraseology extraction in this study is that phraseological units are “collocational catenae”. The catena is a syntactic term introduced by Osborne et al., (2012), who argues that it should replace the phrase structure constituent as the basic unit of syntactic analysis. For those familiar with the basics of dependency syntax, the catena should be a rather obvious notion. Figure 1 shows a possible dependency representation of the sentence “Don’t judge the book by its cover.” The structure in the figure is in fact a graph, or to be more precise, a tree (i.e. an acyclic, connected graph). Any subtree of this sentence dependency tree (and possibly the sentence itself) such as judge + book + the, judge + book + by is a catena.6 This means that single words, which are not considered in this study, are also catenae. The concept of the catena has been used in phraseological studies. O’Grady (1998) predicts that the obligatory lexical components of all (English) idioms form (uninterrupted) dependency chains, which are also subtrees of the sentence dependency tree, i.e. catenae. Due to the variability of many seemingly fixed idiomatic structures, this so-called Continuity Constraint is difficult to defend. For example, the sentential idiom “the devil is in the details” may have a number of syntactic realizations violating the continuity of its “obligatory” components (cf. Simov and Osenova, 2015). I provide more examples of exceptions to the Continuity Restraint below.

5 For more information see: monitorcorpus.com. 6 In the remaining sections of this chapter I use a textual dependency notation inspired by one of the output formats of the Stanford Dependency Parser, whereby each dependency is placed in parenteses and prefixed by its type, e. g.: [dobj(judge, book), prep(book, by), det(book, a), pobj(by, cover), det(cover, the)].

Experimental applications of dependency-based phraseology extraction

33

Figure 1. A dependency structure tree for different variants of the sentential idiom “Don’t judge a book by its cover.” found in reference corpora of English.7

What we can claim more confidently is that the majority of idioms and other semantic and functional types of phraseological units are found in lexically stereotyped catenae (connected dependency subtrees), even when they do not belong to any single phrasal constituent. O’Grady (1998) gives examples of idioms spanning separate constituents such as “goose + be + cooked”, which nevertheless form simple dependency chains. Idiomatic expressions also tend to have core elements which are valid catenae even though they are not syntactically “complete” phrasal constituents, e.g. have a|no heart for, have feelings for. Many such sequences are at least recurrent and conventionalized, if not clearly idiomatic. This means that we have some reason to believe that they are stored in memory as prefabricated units, ready to be recalled and re-used in their typical contexts of occurrence. Their internal syntactic structure can usually be represented as a dependency subtree. They may also have open or partly restricted arguments which are part of their “external valency” (cf. Burger 2003). I will refer to such recurrent catenae as “idiomatic” or “collocational catenae”.

7 I used the yEd editor to create the graph visualizations in this paper. See http://www. yworks.com.

34

Piotr Pęzik

Table 1. Examples of instances of phraseological units which form connected dependency graphs (catenae). Pure and figurative idioms

Dependency structure

So here he is blowing the gaff on the Czech connection. [UKWaC, casi.org.uk]

[dobj(blowing, gaff), det(gaff, the), prep(gaff, on)]

He is considering killing two birds with one stone by looking at the Redhat RHEL 64 bit platform. [UKWaC, dice.inf.ed.ac.uk]

[dobj(killing, birds), num(birds, two), num(stone, one), pobj(with, stone), prep(killing, with)]

Don’t judge a book by its cover.

See Figure 1

Restricted and open lexical collocations At Mushara, an artesian well provides the best water. [COCA, Smithsonian]

amod(well, artesian)

But police argue, no one has the right to break the law. [COCA, CBS]

[dobj(break, law), det(law, the)]

Phrasal and prepositional verbs Zachary takes after his father. [COCA, We Mulvaneys]

prep(takes, after)

Grammatical collocations good at

prep(good, at)

look into

prep(look, into)

Collocational chains Lutoslawski was the first postwar Polish [dobj(gain, recognition), amod(recognition, composer to gain international recognition. international)] [COCA, Christian Science Monitor (CSM)] Lexical bundles, discourse markers I’m not sure if it applies to me either. [COCA, CSM]

[nsubj(sure, I), cop(sure,’m), neg(sure, not), advcl(sure, if)]

I mean the thing is – I mean Cal is right. [COCA, Fox Saturday]

[nsubj(is, thing), det(thing, det)]

Open-ended idiomatic expressions This is the poor man’s route to human space [amod(man, poor), poss(man, ‘s)] flight,” says Garvin. [COCA, Science News] The Italians had a saying that ‘the conjugal bed is the poor man’s opera’. [BNC, A4M] Not everyone has a natural knack for choosing colours and patterns. [BNC, AM5]

[dobj(knack, have), det(knack, a), prep(knack, for)] OR [pobj(with, knack), det(knack, a), prep(knack, for)]

Experimental applications of dependency-based phraseology extraction

35

It needs to be re-emphasized that the only assumption I make for the purpose of phraseology extraction in this paper is that, syntactically, the great preponderance of phraseological units are found in highly stereotyped, connected dependency structures. By “highly stereotyped dependency structures” I mean subtrees of sentence dependency trees whose lexical nodes are either obligatory or lexically and/ or grammatically restricted (see the following example). As already explained, I do not claim that phraseological units have some absolutely “obligatory” elements which always form catenae or that idioms are “stored” as catenae. The latter of these two claims is made in passing by Osborne et al. (2012). On closer inspection, the claim that idioms are composed of sets of lexically invariable elements is generally problematic. There are idiomatic expressions, such as to walk a fine line between X and Y, whose most lexical constituents can be “replaced” with a “variant”, for example: tread a thin path amid X and Y. Many idiomatic expressions are lexico-grammatical patterns. The different variants of the set of expression which follow the general pattern not to have the slightest/ foggiest/remotest idea/notion seem to share only semantic and syntactic properties, such as the requirement that 1) the nominal object should be close synonym of idea, 2) the adjectival modifier should be used in the superlative form and 3) that the entire phrase should be negated (Tucker 1996). To further illustrate such lexico-grammatical variability of idiomatic expressions, let us consider the expression everything but the kitchen sink (when used to mean almost everything), which with little controversy could be classified as a highly prefabricated idiom. As shown below, however, several variants of this expression are found in the UKWaC corpus. Some of the uses of this phrase follow its dictionary definitions. Others are word puns and humorous deviations. Its most variable part is only partly open as it is usually filled by words of different grammatical classes functioning as logical quantifiers (but, including, and or plus). Nevertheless, strictly speaking, there is no single, obligatory lexical item in this position: • We can provide all aspects of event management including set and stage design, theming, sponsorship negotiations (…); literally everything and the kitchen sink. [busygirl.co.uk] • At Argos you can buy just about everything, including the kitchen sink! [a2zcheats.co.uk] • ‘We expect them to throw everything but the kitchen sink at us,’ said McGeechan ahead of the second Test. [observer.guardian.co.uk] • Everything plus the kitchen sink. [eternit.co.uk] • Whilst there is quite a lot of outside deck space – a far amount is given over to freight on outbound sailings as she carries everything including the proverbial kitchen sink!

36

Piotr Pęzik

Also, many idiomatic expressions have syntactic variants which are non-isomorphic dependency graph (see the discussion of high hopes below), which means that they are not “stored” as any single catena. Rather, if they can be regarded as different instantiations of the same linguistic entity at all, they seem to have a number of prototypical dependency structures which are used more frequently than most of their variants. Alternatively, they can be treated as independent idiomatic expressions, which only diachronically stem from the same root. The following sections of this paper attempt to show how the assumption that phraseological units are by and large realized as catenae (even if they may not be “stored” as such) can be used as the basis of ad hoc phraseology extraction. In essence, the hypothesis presented here is that by detecting lexically recurrent dependency subtrees in samples of corpus data, we can discover and categorize a wide variety of phraseological units of the types illustrated above.

3. An experimental use case In order to illustrate the method proposed in this paper, let us focus on corpus concordance aggregation as a specific use case of phraseology extraction techniques. Keywords-in-context concordances probably remain the most widely applied technique of analyzing corpus data. One of the advantages of using concordances is that, while revealing quantitative evidence of language use (in the form of collections of occurrences of words, phrases or grammatical structures), they offer some direct qualitative insights into the linguistic phenomenon in question. Concordances can also instantly reveal phraseological patterns in which a particular word or phrase occurs in a given corpus. Having said that, there are some cognitive limits on the amounts of corpus data which linguists can effectively peruse. When thousands of concordances are returned by a corpus query, as is often the case with modern-day reference corpora, aggregation and summarization techniques are needed to deal with the overwhelming abundance of data. One commonly applied method of identifying combinatorial patterns in corpus search results is to sort concordances by either the matching spans or their immediate contexts. It is also possible to perform more sophisticated query-based extraction of binary collocations (see Pęzik 2012a, 2012b) to identify the most recurrent collocates of the terms appearing in the original concordance query. To be fair, the idea of applying phraseology extraction methods to summarize concordances is by no means new. In fact, in their seminal paper on collocation extraction. Church and Hanks (1990: 23) suggest that the point-wise Mutual Information measure could be used “to speed up the labor-intensive task of categorizing the concordance lines”. In this section, I propose to extract and aggregate recurrent

Experimental applications of dependency-based phraseology extraction

37

catenae found in concordances fetched by a corpus query as an experimental application of dependency-based phraseology extraction. I will argue that by filtering and sorting recurrent catenae by their frequencies we discover a variety of phraseological patterns, which could otherwise be extremely time-consuming to identify. To illustrate the need for corpus data exploration and aggregation, let us assume that a user runs the following query against a large corpus of English: .8 Table 2 shows the number of sentences returned from six different reference corpora. Table 2. Numbers of concordances for the noun test in selected corpora of English. Corpus BNC COCA UKWaC GloWbE Monco COHA

Matched sentences 14 688 71 647 330 813 252 369 41 378 26 547

The sheer numbers of the concordances obtained for these queries are overwhelming. It is rather clear that identifying the recurrent word combinations formed by the noun test in these contexts may require some text mining techniques. Simply sorting these occurrences by the immediate contexts of the query word may reveal some of the positional collocates of test. Extracting recurrent n-grams from such lists of concordances or from any other collection of sentences for that matter could shed more light on the phraseological connections of test, but it should be remembered that n-grams (defined as sequences of n adjacent words) have no explicit syntactic status. They are a purely textual concept, which may work quite well for recurrent, linear word combination, and especially for fixed word order discourse formulas. As an approximation of formulaicity and lexico-grammatical patterning in language, they also have very practical applications in different areas of natural language processing, such as automatic speech recognition. They are not always sufficient, however, in revealing syntactically related collocates which may occur in varying order and which may be separated by several words. For example the simple collocation pass a test illustrated in the two concordances below cannot be captured by n-gram aggregation techniques because of the words separating the two core items of this combination: 8 This query simply matches all inflected forms the word test.

38

Piotr Pęzik

a) A book entitled ‘How to Pass the Police Initial Recruitment Test’ is available under our lending scheme. [UKWaC , bedfordshire.police.uk] b) Once this test has been passed, a work permit can be issued for a maximum of five years with no minimum period (…) [UKWaC, theprworks.co.uk] On the other hand, in the dependency representation of these sentences, the two lexical nodes are directly linked by single-edge paths (direct object and nominal subject), which makes such patterns relatively easy to define and recognize. As it happens, verb + direct object combinations are one of the most obvious types of syntactic patterns in which verbal collocations occur. Aggregating lexical realizations of syntactic patterns is sometimes described as a relational approach to collocational extraction (Evert 2005), and it has been implemented in corpus search engines. For example, the so-called word sketches can be regarded as tabular listings of binary syntactic relations (Kilgarriff and Rychlý 2010). In contrast to simple binary collocation extraction, the approach presented in this paper makes it possible to aggregate lexical realizations of dependency subtrees which may be larger than combinations of two words. Seretan (2011) discusses some previously used methods of syntax-based extraction of “complex collocations”, which she defines as recurrent combinations of more than two words. She also distinguishes between intuition-based and data-driven induction of syntactic patterns used to identify complex collocations. The method described in the present paper can be used to both, intuitively predefine, or empirically derive syntactic patterns by aggregating dependency trees.

4. An implementation Figure 2 shows a screenshot of SlopeQ Desktop, an application which can be used as a simple concordancer for one of the six reference corpora mentioned above9. On submitting a SlopeQ syntax query (Pęzik, 2015a) the user obtains a keywords-in-context concordance from the back-end server. Clicking on the “Chains” button in the top-right corner of the results screen brings up a dialog with several parameters which can be used to find and aggregate recurrent catenae in the current list of concordance results.

9 For more information, see: pezik.pl → SlopeQ Desktop.

Experimental applications of dependency-based phraseology extraction

39

Figure 2. An implementation of query-based concordance aggregation in SlopeQ Desktop.

Concordances are sent to a backend extraction module, where they are part-ofspeech tagged with the Apache OpenNLP PoS tagger,10 dependency parsed by Maltparser (Nivre, Hall and Nilsson 2006) and transformed into graph database objects which can be traversed and filtered to extract relevant subtree patterns. Since finding all subtrees of large sets of sentence trees found in a set of concordances can be computationally expensive, it is obligatory to specify at least one lexical node which the resulting catenae should contain. In fact, it is possible to specify more than one lexical nodes with their part of speech tags in this text field. The morphological tags may be defined with regular expressions. In the example above, only catenae which contain the lemma test annotated as a noun will be considered in the results. Additional morphological tags of unspecified lemmas can be defined as a separate option. For example, in order to limit the set of catenae to potential phraseological units which contain a verb node (such as verbal idioms and noun + verb collocations), one can enter the string V.* in the second option field available. It is also possible to require that the extracted catenae should contain at least one of a set of specified dependencies. This option is left blank in the example above, but it could, for instance, be set to amod, if we were only interested in recurrent combinations of the noun test with verbs and an 10 For more information, see: https://opennlp.apache.org/.

40

Piotr Pęzik

additional adjectival modifier. I will illustrate the use of this option when discussing the extraction of collocational chains below. Finally, it is possible to specify the minimum frequency threshold and the length of the catenae to be extracted. Table 2 shows 16 recurrent catenae which occurred in a sample of 1 000 BNC concordances extracted for the word test with the catena extraction filters discussed above. It has to be clarified that SlopeQ Desktop currently only extracts non-branching subtrees containing at least two word nodes, i.e. paths rather than any subtrees of the sentence dependency tree. (This may change in the near future with some planned optimizations of the subtree generation algorithm). Despite this limitation and the relatively small size of the concordance sample, we see a variety of recurrent combinations retrieved for this query in Table 3. Some of them seem to be spurious compositions, e.g. cop(test, be). However, the majority of these catenae have a clear phraseological status as light-verb restricted collocations (have a test (1), take a test (8)), partly restricted and open lexical collocations (pass a test (10), stand the test (15), fail + test (13), test + show (9)). Part of the novelty of the proposed approach, which sets it apart from many binary collocation extraction methods lies in its ability to identify longer idiomatic sequences such as put sth to the test (4), a self-contained, fixed combination of four words. Table 3. Collocational catenae extracted from concordances of the noun test. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Catena dobj(have, test) cop(test, be) pobj(to, test), prep(put, to) det(test, the), pobj(to, test), prep(put, to) nsubj(be, test) det(test, the), nsubj(be, test) det(test, a), dobj(have, test) dobj(take, test) nsubj(show, test) dobj(pass, test) det(test, the), dobj(stand, test) det(test, the), dobj(take, test) dobj(fail, test) dobj(play, test) dobj(stand, test) pobj(in, test), prep(play, in)

Type dobj cop pobj_prep det_pobj_prep nsubj det_nsubj det_dobj dobj nsubj dobj det_dobj det_dobj dobj dobj dobj pobj_prep

Length 2 2 3 4 2 3 3 2 2 2 3 3 2 2 2 3

Count 13 12 12 11 9 8 7 7 7 6 5 5 5 5 5 5

Parents 1 0 1 0 1 0 0 1 0 0 0 0 0 0 1 0

Experimental applications of dependency-based phraseology extraction

41

Including catenae of different lengths leads to the problem of structural subsumption, whereby longer chains are usually, but not necessarily always subsumed by shorter ones. To indicate this type of relation between the extracted items, I include the number of longer, recurrent catenae which subsume the current item in the Parents column. For example, the three word catena put + to + test has one parent in the analyzed set of concordances (put to the test). A detailed list of parents can be accessed by clicking on a particular item. The precision of dependency-based phraseology extraction may be increased by adding restrictions and filters on the type of chains to be aggregated. Table 3 lists a number of collocational chains which were extracted from BNC concordances obtained for the word recognition. A collocational chain is sometimes defined as a combination of two restricted collocations which share the same collocator (the synsemantic element of a collocation), whereas a collocational cluster is a combination of two collocations sharing the same base (the autosemantic element of a collocation) (Hausmann 1997, 2004; Heid and Gouws 2006; Svensén 2009). In the present paper I use this term to refer to combinations of collocations directly linked in their dependency representation. Such combinations do not necessarily contain semantically restricted elements, since they may consist of open collocations with no “synsemantic” constituents. For the purposes of the present paper we may assume that all collocational chains are collocational catenae containing at least three word nodes. Non-recurrent collocational chains (spurious combinations of collocations which are clearly not “institutionalized” or prefabricated) cannot be considered to be collocational catenae. One of the interesting aspects of collocational chains is that, although they can be one-off combinations of otherwise established collocations, they are often recurrent as wholes, even when they are composed of largely open collocations. Collocational chains are some of the less studied types of phraseological processes. This is partly because their status as partly prefabricated and reused rather than recomposed units of meaning is not obvious. Due to their low frequency, most collocational chains only can only be discovered in large corpora of several hundred million words. I will argue that catena-based phraseology extraction facilitates the discovery of collocational chains. For example, let us consider the collocational chains found in the set of more than 5 000 BNC concordances obtained for the word recognition. When retrieving these items, I searched for catenae which occur in the initial set of concordances at least 3 times, contain the noun recognition, an adjective and a direct object dependency. I also limited the results to catenae containing three or more word nodes.

42

Piotr Pęzik

Table 4. Recurrent collocational chains extracted from ca. 5 000 BNC concordances of the noun recognition. # 1 2 3 4 5 6 7 8 9 10

Catena amod(recognition, international), dobj(gain, recognition) amod(recognition, formal), dobj(give, recognition) amod(recognition, full), dobj(gain, recognition) amod(recognition, international), dobj(achieve, recognition) amod(recognition, international), dobj(receive, recognition) amod(recognition, official), dobj(give, recognition) amod(recognition, full), dobj(give, recognition) amod(recognition, great), dobj(give, recognition) amod(recognition, statutory), dobj(give, recognition) amod(recognition, diplomatic), dobj(extend, recognition)

Type amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj

Count 10 8 6 5 5 5 4 4 4 3

This result set is more focused than the previous one. We see more than 10 recurrent collocational chains, in which the noun recognition is simultaneously the governor of the adjectival modifier dependency and the direct object (i.e. a dependent) of some verb. Most of them are also found in the UKWaC and COCA corpora. A closer look at the concordances of these chains, which can be accessed by clicking any particular item in the list of results, shows that they may play a role in achieving native-like selection and fluency (Pawley and Syder 1983). The use of recurrent chains such as achieve|gain international recognition, give statutory recognition or extend diplomatic recognition is characteristic of fluent, idiomatic English, even though many such combinations are hardly ever attested in dictionaries due to their superficial compositionality.

5. Dependency queries vs. positional queries One could argue that similar lists of recurrent catenae could be extracted by aggregating the results of positional phrase queries. By “positional phrase queries” I mean queries which require at least two word terms co-occurring within a predefined context. Some of these terms can be lexically underspecified part-of-speech tags. For example, the following SlopeQ query: (gain** advantage**)~10 can be formulated to find concordances containing the words gain and advantage and occurring within a distance of up to ten words of each other. If we apply the same type of catena filters similar to the ones used in the example discussed above, we obtain a list of collocational chains featuring adjectival and verbal collocations similar to the one shown in Table 5.

Experimental applications of dependency-based phraseology extraction

43

Table 5. Collocational chains of the noun advantage. # 1 2 3 4 5 6 7 8 9 10

Catena Amod(advantage, competitive), dobj(gain, advantage) amod(advantage, political), dobj(gain, advantage) amod(advantage, unfair), dobj(gain, advantage) amod(advantage, significant), dobj(gain, advantage) amod(advantage, tactical), dobj(gain, advantage) amod(advantage, economic), dobj(gain, advantage) amod(advantage, strategic), dobj(gain, advantage) amod(advantage, temporary), dobj(gain, advantage) amod(advantage, decisive), dobj(gain, advantage) amod(advantage, long-term), dobj(gain, advantage)

Type amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj amod_dobj

Count 33 14 14 9 8 7 7 7 3 3

Although in principle, it is possible to aggregate concordances on matching query terms using only positional word co-occurrence criteria, the advantage of using dependency annotations for this purpose is that they make it possible to explicitly define the types of syntactic links between the individual words. It is possible to specify a positional query such as (gain** advantage**)~10 to get a set of contexts that would largely overlap with the results shown in Table 4. However, the precision11 of such a result set may be noticeably lower than that of filtering queries using explicit dependency notation. The results of would contain a number of underspecified adjectives which do not form collocational chains with the direct object dependency gain + advantage, especially if we allow the query terms to co-occur in various word order permutations. Some of such false positives of a positional query which is intended to retrieve instances of a collocational chain are illustrated in the following examples from the COCA: a) advantage of being able to gain b) advantage he might gain over other c) deliberate attempt to gain an advantage d) advantages gained by changing to hot e) advantages such as gaining f) advantage, as well as the large gain

11 Precision is usually defined as the number of true positives divided by the sum of false positives and true positives in the result set.

44

Piotr Pęzik

All of these contexts contain the words gain, advantage and an underspecified adjective, even though none of them contains a collocational chain due to the lack of a direct syntactic link between these items. Apart from the problem of false positives, the boundaries of concordance matches may be difficult to mark when lexically underspecified terms are defined in a positional query. The three matches below contain valid instances of the collocational chain in question, but their left boundary is defined by adjectives preceding the actual collocational chain combination: a) racial profiling gains unwarranted advantage b) temp firm believing he could gain a competitive advantage c) vile nativist passions and gain cheap advantage Given that automatic dependency annotations contain errors, aggregating the results of positional queries could be expected to yield higher recall12 rates as they may match syntactic realizations of potential collocational chains which are difficult to predict. For example, the words high and hope(s) may form a restricted collocation: a) Alan Shearer wore the number ten shirt with high hopes of scoring the goals in the first international since the retirement of Gary Lineker. [BNC, CH3] b) For a while hopes were high, hotel and other construction soared, and expectations rose for tourism and other opportunities. [COCA, Christian Science Monitor] c) The Uist men, however, led by a few strokes, and hopes of winning ran high amongst them when Colla MacLeod. [UkWAC, gdl.cdlr.strath.ac.uk] Whenever such syntactic variability of a phraseological unit is suspected, it is recommended to leave the dependencies between the query words unspecified. An example result set obtained for a filtering query which only requires that the matching catenae contain combinations of the words high and hope is shown in Table 6. Instances of this collocation extracted from a sample of 1223 COCA concordances include catenae which consist of an adjectival modifier and nominal subject dependencies similar to the ones illustrated above. There are also longer recurrent chains containing the word hope as a verbal and prepositional objects (have high hopes and with high hopes).

12 Recall is defined as the number of the true positives divided by the sum of false negatives and true positives in the result set.

Experimental applications of dependency-based phraseology extraction

45

Table 6. Different syntactic configurations of the collocation high + hopes. 1

amod(hope, high)

amod

2

amod(hope, high), dobj(have, hope) amod_dobj

231

3

nsubj(high, hope)

72

4

amod(hope, high), pobj(with, hope) amod_pobj

nsubj

486

62

Such unsupervised exploration of concordance results may occasionally reveal more general lexico-grammatical patterns. For example, many idiomatic expressions which we might think to normally require the lexical verb have with a dependent direct object (e.g. have green fingers/ have a green thumb, have a knack for + pobj, have an eye for + pobj) are frequently realized as non-finite prepositional clauses introduced by with, e.g.: a) Surowiecki is a vivid writer with a knack for culling entertaining examples. [COCA, Christian Science Monitor] b) Sisal Rope should be the choice for anyone with a green thumb. [COCA, USA Today] c) To anyone with an eye for buildings, a town in France does not look the same as a town in Germany. [COCA, Geographical Review] This again shows the point already discussed above that many phraseological units may have multiple recurrent syntactic variants. At the same time, their seemingly open positions tend to be partly restricted as they only allow words of certain semantic or syntactic classes. In the example above, the preposition with denotes possession, which makes it semantically related to the lexical verb have. What is important in the context of syntax-based phraseology extraction is that a large number of such patterns can be revealed by aggregating recurrent dependency structures on their lexical arguments. The problem of relating different variants of idiomatic expressions to one another, which may be important from a lexicographic point of view, is not considered in this paper.

6. Automatic Combinatorial Dictionaries So far, I have focused on ad hoc concordance analysis as a possible use case of dependency-based phraseology extraction. The same methodology of identifying collocational catenae can be used on a much larger scale to generate pre-computed Automatic Combinatorial Dictionaries (ACD’s) for selected dependency patterns. ACD’s can be defined as databases of recurrent word

46

Piotr Pęzik

combinations (Pęzik 2012b, 2013). Their function is similar to that of collocation dictionaries. They serve primarily as “production dictionaries” (Svensén 2009) which list recurrent collocations whose meaning is largely compositional, but which are nevertheless difficult to predict as native-like, especially by non-native writers and speakers of a given language or translators, who are also potentially affected by negative phraseological transfer, even when translating into their first language (Pęzik 2011). The macrostructure of ACD’s is usually derived from a list of part-of-speech annotated lemmas. Each PoS-disambiguated lemma becomes a headword in the dictionary, which means that the word break, for example, has two separate entries, one as a noun and one as a verb. Next, for each type of lemma a set of grammatical patterns is defined to extract recurrent combinations from a reference corpus. In dependency-based phraseology extraction, these rules can be specified simply as chains of dependencies in which the seed lemma for which the entry is being generated is found. Table 7 shows such a simple set of rules used to extract collocations and structurally larger types of PU’s from the dependency contexts of the noun fraction found in the COCA. Table 7. Types of noun-based dependency chains considered in ACD extraction. Type

Distinct catenae

Example

prep_pobj

190

fraction + of + second

dobj

98

represent + fraction

dobj_prep

68

represent + fraction + of

amod

64

small + fraction

nsubj

37

fraction + small

pobj

25

for + fraction

prep

18

fraction + of

dobj_amod

16

represent + small + fraction

dobj_amod_prep

15

represent + small + fraction + of

adv_amod

5

very + small + fraction

dobj_prt

2

make + up + fraction

In this set of eleven types of catenae, combinations of nouns with their prepositional objects turn out to be the most productive. This can be predicted from the general meaning of fraction as a proportion of some larger entity. For every extracted catena in every entry I compute a number of distributional properties

Experimental applications of dependency-based phraseology extraction

47

in addition to simple counts of occurrences. These properties can be placed in four general categories: raw and normalized frequencies, dispersion, strength of association and independence. Some of these statistics (see Table 8) are widely used in phraseology extraction and corpus studies. For example, the catena fraction of (a) second was detected 183 times, in 175 texts and 105 different COCA sources. The association between the word nodes found in this combination is relatively strong. A number of statistical collocation extraction measures can used to express the strength of the weakest link found in each chain . Table 8. Some distributional properties of two collocational catenae containing the noun fraction. fraction of

fraction of a second

Frequency

2956

183

Texts

2621

175

Sources

461

105

Strength

9

9

Containing recurrent chains

270

0

Contained chains

0

1

Measuring the structural independence of catenae, however, may be less obvious. As mentioned above, although PU’s may have their core parts, they also tend to have some external valency, that is to say, they tend to take syntactic arguments which are lexically predictable. In many cases, shorter word combinations are frequently subsumed by longer structures, although they may also function independently. The full complexity of this phenomenon for just one ACD entry is illustrated in Figure 3, which shows a subsumption graph for 271 catenae containing the word fraction and occurring at least 5 times in the BNC. The directed edges in this graph represent the containment relationship, which can be recursive. The disconnected vertices in the top-right corner of the graph are never contained in the sample of concordances used to build this graph.

48

Piotr Pęzik

Figure 3. A subsumption graph for a subset of recurrent catenae containing the noun fraction.

Pęzik (2015b) suggests a measure of n-gram independence which makes it possible to rank multiword units by their tendency to occur outside of longer recurrent units. In the experimental ACD described in this section, I use a simpler indication of catena independence. As shown in Table 8, the combination fraction of is contained by 270 other catenae which occur at least 3 times in the corpus and which are labeled as “parents” here. The catena fraction of a second is never contained by longer recurrent catenae. The dependency-based ACD extracted for all nouns, verbs and adjectives which occur at least 100 times in the COCA contains 36 420 entries and 3 720 343 catenae which in turn occur at least 3 times in the corpus. Such a dictionary can be used not only to discover and validate specific phraseological units, but also

Experimental applications of dependency-based phraseology extraction

49

to show the approximate levels of incidence of phraseological prefabrication in language. As predicted at the beginning of this paper, ACD’s confirm that the Principle of Compositionality seems to be complemented by the Idiom Principle. Out of the billions of possible combinatorial choices native speakers of English seem to consistently select millions lexically recurrent catenae, many of which seem to be remembered – if not holistically then at least associatively. The role of memory and prefabrication in achieving native like selection and native-like delivery (Oppenheim 2000) is therefore fundamental and corpus-derived ACD’s are useful in revealing and exploring prefabrication patterns.

7. Applications in data-driven language learning and teaching In the last sections of this paper I briefly introduce the functionality of a prototype application called Phrime13, which has been designed as a phraseodidactic and phraseostylistic platform for foreign language learning and teaching. Phrime is essentially a web-based application which makes used of dependency-based phraseology extraction to support the exploration, learning and teaching of English phraseology. The basic problem it addresses has already been identified above: even when phraseological units such as open collocations are syntactically regular and semantically transparent, they are difficult to reliably predict by foreign speakers of a given language. Restricted collocations and idioms are additionally problematic from the perspective of foreign language reception. There are currently four main modules of Phrime called Explore, Read, Write and Learn. The first of these modules is a search and exploration interface for the underlying ACD. Users can enter keyword queries which are matched against the entry headwords and specific catenae stored in the dictionary. An entry page lists the set of catenae found for a seed word. Each catena is presented on a separate page with example concordances, a dependency structure diagram and statistics of usage in the reference corpus. Additionally, a summary of external and internal valency of all recurrent catenae is provided in the form of bar plots showing their constituents and “parent” catenae. Users can edit and bookmark such chains. Concordances available for individual catenae can also be selected and used to semiautomatically create data-driven vocabulary exercises. Although this functionality can be used independently, it becomes particularly useful in the language teaching module of Phrime, which makes it possible to create text-oriented phraseodidactic 13 Phrime (from phraseological priming) is currently being developed in a research project at Transition Technologies S. A. For current versions of Phrime, see: http://phrime. tt.com.pl.

50

Piotr Pęzik

courses for learners of English. Users who take on the role of teachers select a set of texts and process them in the phraseology detection module. Catenae found in a given text are confirmed as phraseological units and the teacher can then select their concordances from the reference corpus to build an exercise in which some of the constituents of phraseological units are gapped. For example, in the following paragraph from the Travel section of the New York Times website, the catena recognized by Phrime as recurrent in the reference corpus are marked in bold. Still, to get a duty-free bargain on anything other than liquor and cigarettes, you have to do some research before you travel. An online sale or in-store promotion can make an item cheaper at home than in a duty-free shop. And if you’re buying makeup, you might find that the discount is not better than the free samples or gift set offered at your hometown department store. If you’re in the market for a Louis Vuitton bag or Chanel perfume, know what it costs at home and on the web so you’ll recognize a deal if and when you see it during your vacation”. (Stephanie Rosenbloom, New York Times)14

As shown in Figure 4, these recognized chains are presented in a table below the annotated text. The teacher can click on a chain, go to its ACD page and view a list of relevant concordances in the corpus. Figure 4. A sample of collocational catenae recognized in a paragraph of English text.

For example, the following concordances can be found for some of the catenae matched above: 14 See http://goo.gl/rMh99s, accessed on October 10th 2015.

Experimental applications of dependency-based phraseology extraction

51

• In other words, Constantine recognized a good deal when he saw it and therefore called the council to ensure male power (…). [COCA, The Da Vinci deception] • I mean, the Palestinians have simply refused to recognize the good deal that has been put in front of them. [COCA, CNN_Politics] • And, most of all, we need to create a strong market for multicultural literature by buying the books that do get published. [COCA, Teacher Librarian] • “The market for composting systems appears good right now”, he adds. [COCA, BioCycle] The first two concordances illustrate the use of the semi-restricted verb + directobject lexical collocation to recognize a deal. The other two examples illustrate the grammatical collocation market for. Both of these collocations are realized as simple catenae: dobj(deal, recognize) and prep(market, for). These combinations are partly restricted and they may be difficult to compositionally “predict” by foreign learners of English. Consequently, it may make sense to include such items in follow-up vocabulary exercises to help learners memorize such phraseological items by trying to predict their use in authentic contexts of use other than the original text in which they were first recognized by Phrime. Sets of concordances selected by the teacher are automatically converted into a gap filling exercise in which one or more constituents of a catena are blanked out. Such exercises can be assigned to a particular text or printed for offline use. It is also possible to create Phrime “challenges”, where concordances of different catenae are used to illustrate the use of a single lexical item found in all such chains. Table 9 shows such an exercise generated from a set of collocational catenae containing the target word hope. Table 9. A vocabulary exercise utilizing the effect of phraseological priming. He has high ______ for his five grandsons, who occasionally come to watch the work.

New York Times

She is the person he pins all his _____ and dreams on.

USA Today

Tuesday ‘s election results dashed his party’s _____ of returning to the majority and his own hopes of ever becoming speaker.

Associated Press

Bush held out ____ for the Fallujah talks, saying the United States was “open to suggestions” on reducing the violence.

Associated Press

Expect the worst, _____ for the best.

CBS 48Hours

The U.S. State Department, still reduced to observer status, seemed to have abandoned any _____ of Senate ratification until 1997.

Environment

52

Piotr Pęzik

These concordances serve as “primes” which trigger the recognition of hope as their target. It is usually sufficient to select concordances for just three partly restricted catenae to obtain such a unique “collocational fingerprint” of the target word. This effect can be described as phraseological priming or “triangulation”. In the extreme cases of so-called cranberry collocations (Moon, 1998) such as kith and kin or to and fro a single bounded lexeme is enough to prime the target. Such cases are comparatively rare, however. Phraseological priming exercises can be generated by any Phrime users and shared on social media platforms, thus creating a possible “gamification” effect, whereby users learn phraseological units by solving word puzzles.

8. Summary and future work In this paper, I presented a method of dependency-based phraseology extraction. By identifying and aggregating recurrent subtrees or “catenae” of sentence dependency trees, we can extract textually recurrent components of collocations, collocational chains, idioms and other types of PU’s and rank them by distributional criteria. The first implementation of the method discussed in this paper is used to support the analysis of large sets of concordances extracted from reference corpora. Next, I showed how the same procedure can be used to generate reference ACD’s and how such resources can be used to reveal and explore the immense richness of phraseological prefabrication. Finally, I briefly illustrated the applications of dependency-based phraseology extraction in the area of phraseodidactics and phraseostylistics. Future versions of the method proposed in this paper should include a greater variety of data-driven dependency subtrees, as the experiment described in this paper was limited to non-branching paths. There is also room for improvement in the area of normalizing and grouping syntactically different but semantically similar PU’s.

References Baggio, G., van Lambalgen, M. & P. Hagoort. 2012. The processing consequences of compositionality. In M. Werning, W. Hinzen & E. Machery (eds.), The Oxford handbook of compositionality, 657–674. Oxford: Oxford University Press. Biber, D. & F. Barbieri. 2007. Lexical bundles in university spoken and written registers. English for Specific Purposes, 26(3), 263–286. BNC, C. 2001. The British National Corpus, version 2 (BNC World). Distributed by Oxford University Computing Services.

Experimental applications of dependency-based phraseology extraction

53

Bolinger, D. 1979. Meaning and memory. In G. Haydu (ed.), Experience forms: their cultural and individual place and function, 95–111. Berlin: De Gruyter Mouton. Burger, H. 2003. Phraseologie: eine Einführung am Beispiel des Deutschen. Berlin: E. Schmidt. Chomsky, N. 1964. Current issues in linguistic theory. Berlin: Walter de Gruyter. Chomsky, N. 1978. Topics in the theory of generative grammar. Berlin: Walter de Gruyter. Church, K. W. & P. Hanks. 1990. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, 16(1), 22–29. Cowie, A. P. 1998. Phraseology : theory, analysis, and applications. Oxford: Oxford University Press. Cowie, A. P. & R. Mackin (eds.). (1975). Oxford dictionary of current idiomatic English. Oxford: Oxford University Press. Cowie, A. P., R. Mackin, and I. R McCaig. 1993. Oxford Dictionary of English Idioms. Oxford; New York: Oxford University Press. Evert, S. 2005. The statistics of word cooccurrences. Word Pairs and Collocations. (PhD Thesis). Institut Fur Maschinelle Sprachverarbeitung, Stuttgart. Ferraresi, A., Zanchetta, E., Baroni, M. & S. Bernardini. 2008. Introducing and evaluating ukwac, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop (WAC-4) Can we beat Google, 47–54. Gläser, R. 1998. The Stylistic Potential of Phraseological Units in the Light of Genre Analysis. In A. P. Cowie (ed.), Phraseology : theory, analysis, and applications, 124–143. Oxford: Oxford University Press. Kilgarriff, A. & P. Rychlý. 2010. Semi-Automatic Dictionary Drafting. In G.-M. De Schryver (ed.), A way with words : recent advances in lexical theory and analysis: a festschrift for Patrick Hanks, 299–312. Kampala: Menha Publishers. Mel’čuk, I. 2001. Collocations and lexical functions. In A. P. Cowie (ed.), Phraseology: Theory, Analysis, and Applications, 23–54. Oxford: Oxford University Press. Moon, R. 1998. Fixed expressions and idioms in English: A corpus-based approach. Oxford: Clarendon Press. Nivre, J., Hall, J. & Nilsson, J. 2006. Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of LREC, Vol. 6, 2216–2219. O’Grady, W. 1998. The syntax of idioms. Natural Language & Linguistic Theory, 16(2), 279–312. Oppenheim, N. 2000. The importance of recurrent sequences for nonnative speaker fluency and cognition. In H. Riggenbach (ed.), Perspectives on Fluency, 220–240. Ann Arbor: University of Michigan Press.

54

Piotr Pęzik

Osborne, T., Putnam, M. & T. Groß. 2012. Catenae: Introducing a Novel Unit of Syntactic Analysis. Syntax, 15(4), 354–396. Pawley, A. & F. H. Syder. 1983. Two puzzles for linguistic theory: Nativelike selection and nativelike competence. In J. C. Richards and R.W. Schmidt (eds.), Language and Communication, 191–227. London: Longman. Pęzik, P. 2011. Providing corpus feedback for translators with the PELCRA Search Engine for NKJP. In S. Gozdz-Roszkowski (ed.), Explorations across languages and corpora : PALC 2009, 135–144. Frankfurt am Main: Peter Lang. Pęzik, P. 2012a. Wyszukiwarka PELCRA dla danych NKJP. In A. Przepiórkowski, M. Bańko, R. Górski & B. Lewandowska-Tomaszczyk (eds.), Narodowy Korpus Języka Polskiego, 253–279. Warszawa: Wydawnictwo Naukowe PWN. Retrieved from http://nkjp.pl/settings/papers/NKJP_ksiazka.pdf. Pęzik, P. 2012b. Graph-based visualisation of collocational profiles. Conference paper presented at the Europhras 2012 Conference, Maribor. Retrieved from http://www.europhrasmaribor.si/eng/. Pęzik, P. 2013. Paradygmat dystrybucyjny w badaniach frazeologicznych. Po wtarzalność, reprodukcja i idiomatyzacja. In P. Stalmaszczyk (ed.), Metodologie językoznawstwa. Ewolucja języka, ewolucja teorii językoznawczych, 141–160. Łódź: Wydawnictwo Uniwersytetu Łódzkiego. Pęzik, P. 2015a. Spokes – a search and exploration service for conversational corpus data. In Selected Papers from CLARIN 2014, 99–109. Linköping University Electronic Press, Linköpings Universitet. Retrieved from http://www.ep.liu.se/ ecp_article/index.en.aspx?issue=116;article=009. Pęzik, P. 2015b. Using n-gram independence to identify discourse-functional lexical units in spoken learner corpus data. International Journal of Learner Corpus Research, 1(2), 242–255. doi:doi.org/10.1075/ijlcr.1.2.03pez. Pullum, G. K. & B. C. Scholz. 2010. Recursion and the infinitude claim. Recursion in Human Language, (104), 113–138. Seretan, V. 2011. Syntax-based Collocation Extraction. Berlin: Springer. Simov, K. & P. Osenova. 2015. Catena Operations for Unified Dependency Analysis. In Proceedings of the Third International Conference on Dependency Linguistics, 320–329. Sinclair, J. 1991. Corpus, concordance, collocation. Oxford: Oxford University Press. Svensén, B. 2009. A handbook of lexicography : the theory and practice of dictionarymaking. New York: Cambridge University Press.

Experimental applications of dependency-based phraseology extraction

55

Tucker, Gordon D. 1996. “So Grammarians Haven’t the Faintest Idea: Reconciling Lexis-Oriented and Grammar-Oriented Approaches to Language.” In R. Hasan, C. Cloran, and D. Butt (eds.), Functional Descriptions: Theory in Practice, 145–78. Amsterdam Studies in the Theory and History of Linguistic Science Ser. 4, Current Issues in Linguistic Theory 121. Amsterdam: Benjamins. Werning, M., Hinzen, W. & E. Machery. (eds.). 2012. The Oxford handbook of compositionality. Oxford: Oxford University Press.

Marcin Tatjewski

Polish Academy of Sciences

Mirosław Bańko, Adrianna Kucińska and Joanna Rączaszek-Leonardi University of Warsaw

Computational distributional semantics and free associations: a comparison of two wordsimilarity models in a study of synonyms and lexical variants Abstract: This paper compares two methods for qualitatively evaluating the semantic proximity between pairs of synonyms or lexical variants: one method involving statistical distributional representations, the other free associations elicited from informants. The former is a relatively new method based on measuring word distances in multidimensional semantic spaces created from large language corpora. The latter method, long well known in psychology, is a kind of survey in which subjects give free associations to given words and these words’ proximity is then computed on the basis of the associations provided. Although these methods are quite different in many respects, the present paper, based on lexical material from Polish, finds their results to be well correlated. This opens the question of in what circumstances semantic spaces may offer a less costly alternative to free association studies. Keywords: semantic spaces, text corpora, correlations

1. Introduction Measuring the semantic proximity between words has long been a subject of interest for both psychologists and linguists. Classical methods aimed to elicit data from subjects in surveys or experiments, e.g. by using Osgood’s semantic differential or free association studies (Osgood, Suci & Tannenbaum 1957; Snider & Osgood 1969). Newer techniques are based on semantic spaces constructed from large language corpora, the most known examples being Latent Semantic Analysis, HAL, and COALS (Landauer, et al. 1998; Lund & Burgess 1996, Rohde, et al. 2006). The present paper focuses on the semantic proximity of words in specific pairs. In these pairs, one word is a lexical loan of a clearly foreign origin, the other being its native synonym, or one is an unassimilated (viz. less assimilated) spelling

58

M. Tatjewski, M. Bańko, A. Kucińska, J. Rączaszek-Leonardi

variant of a loan, the other being a better assimilated one. The focus is on such pairs, because the present study originated from a research project concerned with the adaptation, perception and reception of verbal loans (hence APPROVAL, see http://www.approval.uw.edu.pl).1 One of the hypotheses adopted in the APPROVAL project is that no two words, not even spelling variants, are totally equivalent functionally: if the designative meanings of any two words happen to coincide, the words differ in other respects, e.g. they have different connotations, different stylistic distribution, or different frequency. This hypothesis is by no means a revolutionary one. Philosophers, linguists, literary historians and lexicographers alike have long been claiming that absolute synonymy is not-existent and similar observations have occasionally been made even about spelling variants of words (Bańko & Svobodová 2014). However, it is not easy to find systematic, qualitative and quantitative studies intended to verify these claims. Distinctive synonym dictionaries (e.g. Nagórko et al. 2004) provide abundant material, but lack the necessary rigor to serve as a proof. Another assumption made in the APPROVAL project is that differences in meaning (broadly understood) may result from differences in form: the unfamiliar or less familiar form of borrowed words leads them to be perceived differently from their native or better assimilated equivalents. During the project, this assumption has been repeatedly confirmed in a series of corpus-based word pair analyses, as well as in questionnaire studies. Details can be found on the project website, so let us restrict ourselves here to just one example. The words helikopter and śmigłowiec both mean ‘helicopter’ in Polish, but the latter one is preferred when talking about small, light and fast machines, probably due to the fact that the name śmigłowiec is related to the words śmigło ‘propeller’, śmigły ‘swift’ and śmigać ‘move quickly, zip (around)’. This observation was first made in corpus analysis and reported in Bańko and Hebal-Jezierska (2014), then confirmed by a survey in a group of 180 students (p < 0.002), see Bańko, et al. (2015). The meaning of a word is a complex structure and its complexity encourages qualitative approaches. For instance, in the APPROVAL project 100 word pairs (sometimes triples, quadruples, etc.) were analyzed in detail with the aim of identifying meaning distinctions between their members. The analyses took the form of case studies, each of them performed according to the same pattern, in which some description components were predefined in order to make the

1 The project is supported by the Polish National Science Centre, registration number DEC-2011/03/B/HS2/02279.

Computational distributional semantics and free associations

59

studies comparable. In addition, the analyses were made on parallel word pairs of two languages, Polish and Czech, to reduce the danger of overgeneralizations inherent in studies made on one language only. Without diminishing the role of qualitative approaches, it should be noted that it is sometimes more useful or even necessary to evaluate the semantic proximity of words by means of quantitative indices. This not only brings more information about the subject of study but also cross-validates the methods. In this paper, two proximity measures are compared: one based on free associations produced by people in response to certain words, the other one on a computational analysis of a multidimensional representation of the very same words, extracted from a corpus. Overall, 35 Polish word pairs were analyzed using both methods, 28 of them consisting of a borrowed and a corresponding native word and 7 consisting of an unassimilated and an assimilated variant of the same word. The list of the 35 word pairs can be found in Figure 2 and again in Table 2. Free association data were collected using association questionnaires. Data were coded for item analysis: for each word, a list of 66 associations was compiled and frequency distributions were calculated. Next, an index of semantic proximity was computed for each word pair as the proportion of the associations shared by both members of this pair to the overall number of associations given for them (Rączaszek-Leonardi et al. 2014). To derive multidimensional representations of the same words, a matrix-based semantic model of the Polish language was created on the National Corpus of Polish (NKJP), using the method of Correlated Occurrence Analogue to Lexical Semantics (COALS) implemented in S-Space library (Jurgens & Stevens 2010, Rohde, et al. 2006). COALS is one of the most effective methods of distributional semantics, belonging to the same family of tools as Latent Semantic Analysis or Hyperspace Analogue to Language. The input corpus to COALS was a balanced sub-corpus of NKJP, comprising approx. 300 mil. segments (or 250 mil. words), which is big enough for distributional semantics requirements. Word relationships in the resulting computational matrix-based semantic model were explored using cosine similarity of word-vectors. Two kinds of metrics were analyzed for the investigated synonym and variant pairs: similarity scores for each pair and sets of nearest semantic neighbors for each word. In the rest of this paper, the two methods used – one based on free associations, the other on semantic space – will be described in more detail and then related to each other by means of a correlation analysis.

60

M. Tatjewski, M. Bańko, A. Kucińska, J. Rączaszek-Leonardi

2. Distributional semantics 2.1 Related work Distributional semantics is a broad research area initiated in the 1960s by linguists who discovered that the similarity in meaning between words can be derived from their occurrence in similar contexts (Turney & Pantel 2010). The computational branch of this domain, often called semantic spaces, spans several methods for multidimensional modelling of language semantics. The first computational approaches to distributional semantics were designed in the late 1990s, pioneered by Hyperspace Analogue to Language – HAL (Lund & Burgess 1996). The HAL method laid the foundation and is now a simple reflection of what is the base layer of every other modern and more advanced semantic space model. Therefore, a good understanding of HAL makes all the other methods more meaningful. Building a HAL model consists of the following steps: 1. Selecting and preprocessing a large-scale text corpus on which the model will be computed. 2. Creating a matrix M with dimensionality N x N, where N is the total number of words in the corpus. Therefore both rows and columns are indexed by words from the text corpus. 3. For each cell M[i][j], calculating the number of co-occurrences of words i and j in the corpus. Co-occurrences are usually counted in small neighborhoods of, for instance, 10 consecutive words. From such a model it is possible to extract vectors to use as semantic representations of words. Using these vectors one can, for example, quantify the meaning similarity of two words by calculating a similarity score (e.g. the Euclidean distance) of their respective word-vectors. It is vital to understand that a comparison of two word-vectors from such a model does not tell us how frequently these two words co-occur, but rather describes the level of similarity between the contexts in which these words occur independently. Semantic space models have already been applied to address a number of problems (Conley & Burgess 2000; Maas, et al. 2011; Tatjewski & Jaworski 2013). The next important computational method of distributional semantics is Latent Semantic Analysis – LSA (Landauer, et al. 1998), which, contrary to HAL’s word-word matrix, proposed a word-document model. Additionally, LSA introduced into semantic spaces the concept of dimensionality reduction performed with the use of singular value decomposition. Another important model is Correlated Occurrence Analogue to Lexical Semantics – COALS (Rohde, et al.

Computational distributional semantics and free associations

61

2006). The designers of COALS combined dimensionality reduction with several mathematical transformations to construct one of the most effective word-word semantic spaces. Moreover, while in the primary research on the HAL model the Euclidean distance was used as a word-vector similarity score, later models used measures more suitable for this task, e.g. cosine similarity and Pearson’s correlation. Evaluation of semantic spaces was often performed on synonymy exercises from the TOEFL test. While LSA achieved 64% accuracy on this test, COALS managed to significantly outperform LSA and other methods, achieving up to 88% accuracy (Landauer et al. 1998; Jurgens & Stevens 2010; Rohde et al. 2006). Some approaches have been made to apply semantic spaces to the Polish language (Kruszyński & Rączaszek-Leonardi 2006), the most significant being the work of Piasecki, Szpakowicz, and Broda (2009) on developing semantic-space tools, which they utilized for the development of the Polish WordNet. They used lemmatization, disambiguation and, more importantly, they introduced the concept of morphosyntactic constraints to filter out only relevant word co-occurrences. Despite all this previous work, however, there was still no available semantic space useful for the task necessary in the APPROVAL project. This was because: 1. None of the previous Polish semantic spaces was built on a balanced text corpus. 2. Semantic spaces constructed for the development of the Polish WordNet were built separately for particular parts of speech, while in the APPROVAL project, semantic proximity measures need to be calculated in a single space. 3. None of the published semantic space resources granted full access to specific word-vectors, which was necessary to compare the meaning of any given words.

2.2 NKJP semantic space generation The first important question to consider when building a semantic space is what corpus to use for computation, which is strictly connected to the research questions posed. For our purposes (comparing loan and native forms), we needed corpus of Polish as representative as possible, reflecting its use in a variety of social contexts by a variety of users. We decided to choose the National Corpus of Polish (NKJP) as this is one of the biggest corpora for the Polish language, containing 1.8 bn segments (Przepiórkowski et al. 2012). More specifically, we decided to use the NKJP balanced version (300 m segments), which is the only balanced corpus for the Polish language big enough to satisfy the requirements of semantic spaces. The second fundamental question in the process is which semantic space method to apply. This is also related to the availability of method implementations.

62

M. Tatjewski, M. Bańko, A. Kucińska, J. Rączaszek-Leonardi

A good example of a robust software resource for computational distributional semantics is the S-Space Package, developed by Jurgens and Stevens (2010), which contains implementations of most of the popular methods, including HAL, LSA, COALS and others. Another available package is SuperMatrix – a program for building semantic spaces based specifically on the Polish language. It was created by the team responsible for developing the Polish WordNet (Broda & Piasecki 2013). After considering the performance, level of accessibility and the range of available additional tools, we decided to use the COALS implementation from the S-Space Package. The work was divided into several steps: 1. Pre-processing of the NKJP to obtain word lemmas in a format suitable for the S-Space Package. a) Extracting disambiguated words from xml files containing full tagging. b) Removing all characters different than standard letters for the Polish language. c) Joining the results into a single file (around 1.6 GB in size), where each line stands for a single document from the NKJP. 2. Initial tests of the S-Space Package to check whether the tool is capable of producing a semantic space over a corpus of such size and what CPU and memory resources are necessary for this operation. 3. Defining the size of a desired semantic space in terms of the number of word vectors and number of dimensions. a) Deciding how many words we want to contain in the space by calculating how many of them occur in the corpus more than 150–200 times (an often used occurrence threshold in semantic spaces). In order to include most of the APPROVAL project’s focus words, but not compromise the quality, we decided to include the 42 200 most frequently occurring words in the created semantic space. b) Choosing a target number of dimensions of the semantic space after dimensionality reduction. Reduction with use of SVD is performed in order to maximize the semantic accuracy of the space. 4. Computing the final space using the defined parameters. It is important for any person willing to perform such tasks to know that they are highly resource-consuming. The balanced version of the NKJP consists of around 600 GB of xml files. In order to pre-process that volume of data in a reasonable time it was necessary to parallelize the pre-processing work over 40 powerful CPUs. Moreover, for computing a semantic space of such size, the S-Space Package requires around 30 GB of memory in order to finish with success.

Computational distributional semantics and free associations

63

3. Free associations For the free association analysis, 28 synonym pairs and 14 variant pairs were chosen, most of them previously compiled for the purpose of the linguistic analysis in the APPROVAL project. The criteria for selection, which ensured that the words would be suitable stimuli in a psychological study, were as follows: 1) both elements of the pairs were one-word expressions, 2) none of them was an obsolete or very rare word in Polish, 3) none of them was polysemous in an obvious way. The set of 42 pairs was divided in two equal subsets (with 14 synonym and 7 variant pairs in each), roughly balanced for word classes. This was done to ensure the feasibility of the task for participants (providing associations for 42 words would be too taxing). Next, each of the subsets was further subdivided in such a way as to contain only one member of each pair and care was taken to balance the resulting subsets in terms of loan/native words and unassimilated/assimilated variants. Thus 4 lists of words were created, each 21 words long, each containing 14 members of synonym pairs and 7 members of variant pairs, balanced for loan/ native and unassimilated/assimilated distinctions. The lists served as the basis for association questionnaires. These were filled in by 88 participants, most of them university students of various departments who volunteered to participate in the study. Each of the four questionnaires was filled in by 22 participants. Their task was to produce three associations for every word in the questionnaire, which yielded 66 associations for each word. Data were coded for item analysis, i.e. for each word, a list of associations was compiled and frequency distributions were calculated. Unreadable associations, spaces left blank by participants and repetitions (identical associations given by the same subject) were ignored. Lastly, the data were aggregated into clusters. Each such cluster included derivatives of the same parent word (i.e. words having a common stem), word forms differing due to inflection and/or phrases including any of the above-mentioned words or word forms, whether idiomatic or not. The details of the study are given in Rączaszek-Leonardi, et al. (2014).

4. Results 4.1 Semantic space results The output of the semantic space generation using the COALS method is a matrix containing a row-vector for each word included in the corpus. The columns of this matrix are meaningless, which is an effect of SVD dimensionality reduction. We were mainly interested in using this resulting data to calculate similarity scores between particular words in the NKJP corpus. For this task, we could choose from

64

M. Tatjewski, M. Bańko, A. Kucińska, J. Rączaszek-Leonardi

among many metrics designed to compare vectors. One of the metrics especially useful for comparing semantic vectors is cosine similarity shown in Figure 1. Figure 1. Cosine similarity score calculated on vectors A and B. n

cos( A, B )

A⋅ B � A� � B�

∑A ⋅B i

i =1

n

i

n

∑ (A ) ⋅ ∑ (B ) i =1

i

2

i =1

i

2

According to its design, the COALS method returns only positive values. Therefore, the cosine similarity calculated over word-vectors obtained with COALS gives results in the range [0:1] with 1 indicating perfect semantic similarity and 0 indicating perfect dissimilarity. A few of the focus words of the APPROVAL project had to be excluded from the analysis because of their low frequency in the corpus. For all the other words and word pairs, two types of information were obtained: 1. Similarity scores for each synonym pair and each variant pair (Figure 2). 2. Nearest semantic neighbors for each word, where the neighborhood was defined by similarity scores (Table 1). Figure 2. Cosine similarity scores calculated in the semantic space.

Computational distributional semantics and free associations

65

Table 1 shows how the analysis of semantic neighborhoods for synonym and variant pairs can help draw conclusions about meaning differences between the members of these pairs. Table 1 shows the nearest neighbors of cyklista and kolarz – two Polish words meaning ‘cyclist’. These nearest neighbors indicate that cyklista is used in the general context of transportation and moving around, while kolarz is a more specific term related to professional sports. This result is coherent with the linguistic analysis of the cyklista – kolarz pair performed in the APPROVAL project. Table 1. Nearest neighbors for synonyms cyklista and kolarz. cyklista Neighbor

Similarity score motocyklista 0.767 rowerzysta 0.760 pieszy 0.715 rajdowiec 0.705 rowerowy 0.704 kolarz 0.688 jednoślad 0.683 rower 0.668 zmotoryzowany 0.666 rajd 0.661

kolarz Neighbor biegacz kolarski maratończyk kajakarz peleton lekkoatleta pływak wyścig zawodnik rajdowiec

Similarity score 0.859 0.783 0.775 0.766 0.762 0.761 0.758 0.755 0.744 0.736

It is important to note the high impact of dimensionality variations on the results. When the SVD reduction is not performed, the COALS space has full dimensionality. In this case the nearest neighborhoods include not so much words of similar meaning to the word investigated but rather words which are likely to co-occur with it. On the other hand, when the SVD reduction is performed one can see a strong emergence of concepts: as the dimensions are reduced, words merely cooccurring become farther apart in the semantic space while words sharing similar concepts come closer as neighbors. Even more interesting observations come from comparing spaces reduced to different numbers of dimensions. When the number of dimensions is small, concepts compress, hence related concepts start to coalesce. When the number of dimensions is large, on the other hand, one can observe more distinct concepts in the data. These outcomes are clearly in line with the conclusions that were first stated by the creators of LSA (Landauer et al. 1998).

66

M. Tatjewski, M. Bańko, A. Kucińska, J. Rączaszek-Leonardi

4.2 Free associations results For each pair, associations shared between the members of the pair and associations idiosyncratic to one member of the pair were listed separately. By definition, the former have a frequency of at least 2 for both members of the pair, while the latter have a frequency of at least 2 for at least one member of the pair. Associations of lower frequencies were not included, with the exception of mutual associations (occurring between the pair members themselves), which were included irrespective of their frequency. The manner of presentation is explained in Figure 3. Figure 3. Presentation of free associations results.

Lastly, an index of semantic proximity was computed, as the proportion of the associations shared by the members of each pair (increased by the number of mutual associations between the pair members) to the overall number of associations obtained for this pair. To illustrate how the index was computed, let us calculate it for the pair egzystować – istnieć ‘exist’, see Figure 4.

Computational distributional semantics and free associations

67

Figure 4. Computing semantic proximity on the basis of free associations: an example. egzystować

istnieć

być filozofia żyć (życie) trwać (trwanie, przetrwanie) wegetować (wegetacja)

8 1 1 3 3 12 5 4

istnieć

egzystować człowiek funkcjonować być (byt) filozofia żyć (życie)

4 2 2 15 2 9 0 0

The first association common to both members of this pair is być (byt), its frequency being 3 for egzystować and 15 for istnieć. The lower of these two figures is the number shared by both members of the pair, and so we take 6 (i.e. 3 multiplied by 2) as the pair’s number of shared associations with być (byt). This is the first element of the pair’s total sum of shared associations, in the numerator of the fraction to be computed. The second element is 4 (the lower of the two frequencies given for filozofia, multiplied by 2), the third and the last element is 18 (the lower of the frequencies given for żyć (życie), again multiplied by 2). The number of shared associations is then increased by the number of mutual associations, i.e. 8 occurrences of istnieć for egzystować and 4 occurrences of egzystować for istnieć. This is because the semantic similarity of any two words manifests not only in the associations they share with other words, but also in the mutual associations they have. In fact, it is the mutual associations that indicate their direct similarity, while the associations shared with other words indicate their similarity only in an indirect way, according to the common rule that if two objects are similar to a third one, they are very likely to be similar too. The total number of associations shared by the members of the egzystować – istnieć pair, increased by the number of mutual associations between this pair’s members, is thus 6 + 4 + 18 + 8 + 4 = 40. The overall number of associations (shared and not shared) obtained for this pair is 71, which is taken as the denominator of the fraction. Dividing the first number by the second, we arrive at the semantic proximity measure between the elements of the egzystować – istnieć pair: 40/71, which is approximately 0.56. A list of semantic proximity measures for all 42 word pairs used in the free association study is available in Rączaszek-Leonardi, et al. (2014). Among them are 35 indices obtained for word pairs analyzed in the NKJP semantic space. They are quoted in Table 2 in the descending order of similarity scores.

68

M. Tatjewski, M. Bańko, A. Kucińska, J. Rączaszek-Leonardi

Table 2. Similarity scores based on free associations. email – mejl camping – kemping triumf – tryumf menstruacja – miesiączka jeans – dżins eksplozja – wybuch symptom – objaw dealer – diler komfort – wygoda helikopter – śmigłowiec korumpować – przekupywać cyklista – kolarz autonomiczny – niezależny puls – tętno egzystować – istnieć absurdalny – niedorzeczny football – futbol manager – menedżer

0.740 0.667 0.667 0.656 0.652 0.649 0.642 0.614 0.598 0.592 0.589 0.583 0.571 0.566 0.563 0.559 0.557 0.551

infekcja – zakażenie komponent – składnik stymulować – pobudzać doktor – lekarz dewastować – niszczyć konsultant – doradca toksyczny – trujący partycypować – uczestniczyć stagnacja – zastój ekspert – znawca fenomenalny – nadzwyczajny agresor – napastnik unikalny – wyjątkowy kuriozalny – osobliwy kompleksowy – całościowy strofa – zwrotka fan – miłośnik

0.549 0.494 0.471 0.470 0.449 0.444 0.421 0.377 0.376 0.368 0.361 0.282 0.253 0.239 0.236 0.202 0.048

4.4 Comparison of semantic space and free associations The results obtained with semantic space analysis and free associations analysis can be compared in a natural way, using the similarity measures which we defined separately for both models. For the set of word pairs on which we focused in the APPROVAL project, we analyze the relationship between the similarity scores from the two sources. Figure 5 shows the plot of these values against each other for each word pair. The Spearman correlation calculated on these sets yields the result of 0.52. The Spearman independence test returns a p-value of 0.0018, thus strongly rejecting the hypothesis that the scores from the two measures might be independent. However, it must be noted that in the final correlation and independence results for the comparison, two word pairs were excluded as outliers: fan – miłośnik and e-mail – mejl. This decision was based on their high studentized residuals and the following experimental reasons: 1. the fan – miłośnik pair received no common associations from the respondents, which was not aligned with the design of our two metrics; 2. the e-mail – mejl pair could not be comparably analyzed in both sources as the semantic space excluded all words containing a hyphen (-), thus, in the semantic space, there is a similarity score for email – mejl, which effectively is a different word pair.

Computational distributional semantics and free associations

69

As can be seen, the two semantic models provide us with highly correlated scores, despite being based on two very different information sources, i.e. the computational analysis of a text corpus and the free associations elicited from human informants. This result can be interpreted as a mutual validation of the two models. The fact that they both agree in the final scores suggests that they both return measures of word similarity which are related to the objective semantics of these words in the language. Figure 5. Similarity scores obtained from semantic space and free associations plotted against each other in order to show correlation.

70

M. Tatjewski, M. Bańko, A. Kucińska, J. Rączaszek-Leonardi

4.5 Published resources As we expect that our semantic space may prove useful for many other research purposes, we have published it, along with an interface for browsing it, on our project’s webpage: http://portal.uw.edu.pl/en_GB/web/approval/przestrzenie_wyniki1 The final published version of the semantic space contains the 42 200 most common words from the National Corpus of Polish and its dimensionality was reduced to 500. The target dimensionality was chosen based on the advices from COALS inventors (Rohde et al. 2006). The space with full dimensionality is not as effective as the reduced one. Moreover, browsing the semantic space before dimensionality reduction is very difficult in technical terms because of the large space it requires and computational cost of nearest neighbors queries.

5. Discussion Our work on word meaning similarity measures was stimulated by a need to compare the meaning of words in the APPROVAL project and to test the hypothesis assumed therein. We decided to use complementary methods so that any one of them could help to verify the results of the others. The methods applied were: corpus-based case studies performed by linguists, free association studies carried out with human informants, and computational semantic-space analyses. In addition, the case studies were performed independently on the material of two languages, Polish and Czech, to reduce the danger of overgeneralizations as compared to research based on one language only. This present study was focused on two of our three methods: free associations and semantic space analyses. It turned out that the word proximity measures they produce were correlated on a statistically significant level. Bearing in mind that free-association studies are costly and time consuming, as they demand human-performed data coding work, the prospect of achieving equivalent results by queries in a semantic space is attractive and worth consideration. However, more work has to be done to assess which of the two methods is better under what circumstances.

References APPROVAL: Adaptation, perception and reception of verbal loans in Polish and Czech – linguistic, psychological and historic-cultural factors. Available at: http://www.approval.uw.edu.pl/en_GB. Bańko, M. & M. Hebal-Jezierska. 2014. “What can lexicography gain from studies of loanword perception and adaptation?” In A. Abel, Ch. Vettori & N. Ralli

Computational distributional semantics and free associations

71

(eds.), Proceedings of the XVI EURALEX International Congress: The User in Focus, 15–19 July 2014, Bolzano/Bozen, 981–991. Bolzano/Bozen: EURAC research. Bańko, M. & D. Svobodová. 2014. “The role of the form–meaning relationship in the process of loanword adaptation.” Polonica 7, 5–19. Bańko, M., Rączaszek-Leonardi, J. & A. Kucińska. 2015. “Preference for one member of a pair of synonyms or lexical variants as an indicator of their different semantic potential (the case of words differing on the foreign – native scale).” Retrieved from: https://portal.uw.edu.pl/en_GB/web/approval/ percepcja_wyniki (downloadable as Report 3 and Appendix). Broda, B. & M. Piasecki. 2013. “Parallel, massive processing in SuperMatrix: a general tool for distributional semantic analysis of corpora.” International Journal of Data Mining, Modelling and Management 5, 1–19. Conley P. & C. Burgess. 2000. “A computational approach to modeling population differences.” Behavior Research Methods, Instruments & Computers 32(2), 274–279. Jurgens, D. & K. Stevens. 2010. “The S-Space Package: an open source package for word space models.” Proceedings of the ACL 2010 System Demonstrations, ACLDemos ’10, Stroudsburg, PA, USA, 2010, 30–35. Association for Computational Linguistics. Kruszyński B. & J. Rączaszek-Leonardi. 2006. “Między strukturalistyczną a psychologiczną reprezentacją znaczenia: wielowymiarowa przestrzeń semantyczna (HAL).” In P. Stalmaszczyk (ed.), Metodologie językoznawstwa. Podstawy teoretyczne, 282–296. Łódź: Wydawnictwo Uniwersytetu Łódzkiego. Landauer, T. K., P., Foltz, W. & D. Laham. 1998. “An introduction to latent semantic analysis.” Discourse processes 25, 259–284. Lund, K. & C. Burgess. 1996. “Producing high-dimensional semantic spaces from lexical co-occurrence.” Behavior Research Methods, Instrumentation, and Computers 28, 203–208. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y. & C. Potts. 2011. “Learning word vectors for sentiment analysis.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies 1, 142–150. Nagórko, A., Łaziński, M. & H. Burchardt. 2004. Dystynktywny słownik synonimów. Kraków: Universitas. Osgood, C. E., Suci, G. & P. Tannenbaum. 1957. The measurement of meaning. Urbana, IL: University of Illinois Press. Piasecki, M., Broda, B. & S. Szpakowicz. 2009. A wordnet from the ground up. Wrocław: Oficyna Wydawnicza Politechniki Wrocławskiej.

72

M. Tatjewski, M. Bańko, A. Kucińska, J. Rączaszek-Leonardi

Przepiórkowski, A., Bańko, M., Górski, R. L. & B. Lewandowska-Tomaszczyk. (eds.) 2012. Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo Naukowe PWN. Rączaszek-Leonardi, J., Bańko, M., & A. Kucińska. 2014. “A comparative study of free associations to loanword/native synonyms and variant pairs in Polish.” Retrieved from: https://portal.uw.edu.pl/en_GB/web/approval/percepcja_wyniki (downloadable as Report 1 and appendices). Rohde, D. L. T., Gonnerman, L. M. & D. C. Plaut. 2006. “An improved model of semantic similarity based on lexical co-occurrence.” Communications of the ACM 8, 627–633. Snider, J. G. & C. E. Osgood. 1969. Semantic Differential Technique: a Sourcebook. Chicago: Aldine. Tatjewski M. & P. Jaworski. 2013. “Application of semantic spaces to sentiment analysis for words. Unpublished paper presented at CyberEmotions 2013 conference.” Retrieved from: http://science24.com/events/7302/boa/boa.pdf and https://www.youtube.com/watch?v=cv1ICNAhnuw. Turney, P. D. & P. Pantel. 2010. “From frequency to meaning: Vector space models of semantics.” Journal of artificial intelligence research 37: 141–188.

Corpora NKJP. National Corpus of Polish. Available at: http://nkjp.pl.

Dario Lečić

University of Sheffield

Grammars or corpora? Who should we trust? Empirical analysis of morphological doubletism in Croatian Abstract: Morphological doubletism refers to a phenomenon when two (or more) variants inhabit a single cell of an inflectional paradigm (e.g. forms dove and dived can both occur in the cell for the past tense of the English verb to dive). In this paper we analyse this phenomenon in Croatian, a South Slavonic language, by juxtaposing several sources of linguistic data. First we look at how they are treated in grammar books of Croatian. This is followed by a corpus analysis of some examples of doublets in present-day language. More often than not, one can find opposing information in these two sources. Finally, we present the results of a native speakers’ intuitions study. The results show that speakers are highly sensitive to the relative proportions of the two forms as found in the corpus; however, in some instances these frequency effects are cancelled out by analogical effects – phonologically similar forms are given similar ratings even when their corpus distributions are completely inverse. Keywords: Morphology, doubletism, corpus, acceptability, analogy

1. Introduction Morphological doubletism1 refers to a phenomenon when two (or more) morphological variants inhabit a single cell in an inflectional paradigm. For instance, in English the plural of nouns of Latin and Greek origin (e.g. formula, cactus) can either be the same as in the original language (formulae, cacti) or it can follow English grammatical patterns (formulas, cactuses). There are also occasional verbs that can be weak and strong at the same time (dive > I dove and I dived). Whereas examples such as above are quite sporadic in western European languages (see Fehringer 2004 for German and Dutch, Thornton 2011 for Italian etc.), in Slavonic languages they appear to a greater extent and encompass whole families of words, due to rich morphological systems of Slavonic languages. For

1 There is a variety of terms found to describe this situation in language – Marković (2012: 49) calls it morphological synonymy (since we are dealing with two morphemes with an identical grammatical meaning), Bermel and Knittl (2012) use the term competing forms, Thornton (2011) calls it overabundance.

74

Dario Lečić

instance, in Czech, the majority of masculine nouns can have two forms in the locative singular (hrad ‘castle’ > hrad-u/hrad-ě), whereas masculine inanimates will, on top of that, also have two forms in the genitive singular (sýr ‘cheese’ > sýr-a/sýr-u). Some animate nouns will have two forms in the nominative plural (akademik ‘academic’ > akademici/akademikové). Doublets also appear in genitive plural of feminine and neuter nouns (karta ‘card’ > karet/kart), dative and locative plural of feminines (mast ‘ointment’ > D mastem/mastím, L mastech/mastích), etc. Croatian, a South Slavonic language, exhibits this phenomenon to an even larger scale. There are two types of doubletism that appear in Croatian: doubletism of endings and doubletism of stems. Some of the situations where doublet endings appear are: • instrumental singular of masculine nouns that end in: -ar2 (gospodar ‘master’ > gospodar-om/gospodar-em), a palatal (princ ‘prince’ > princ-om/princ-em), and -io (radio > radi-om/radij-em); • genitive plural of feminine nouns that end in a consonant cluster: crkva ‘church’ > crkv-ā/crkv-ī/crkāv-ā, bitka ‘battle’ >bitk-ā/bitk-ī/bitāk-ā and some masculine nouns (prst ‘finger’ > prst-ā/prst-ī/prst-ijū, zub ‘tooth’ > zub-ā/zub-ī); • instrumental singular of feminine nouns that end in a consonant: riječ ‘word’ > riječ-i/riječ-ju, mladost ‘youth’ >mladost-i/mladošć-u; • comparatives and superlatives of some mono- and di-syllabic adjectives: čist ‘clean’ > čist-iji/čišć-i, blizak ‘close’ > bliž-i/blisk-iji; • gender doublets: večer ‘evening’ (feminine/neuter), finale ‘finals’ (masculine/ neuter); moj kolega ‘my colleague’ > plural moji kolege (masc.)/moje kolege (fem.); • derivational doublets: -iv/-ljiv: (promjenj-iv/promjen-ljiv ‘changeable’); -ni/ski (imenič-ni/ imenič-ki ‘nominal’), -ica/-ka (tinejdžerica/tinejdžerka ‘a female teenager’), etc. Doublet stems appear in the following cases: • short (no stem extension) and long (extended stem) plurals of mono- and disyllabic masculine nouns: znak ‘sign’ > znac-i/znak-ov-i, prsten ‘ring’ > prsten-i/ prsten-ov-i; also, in the long plural the stem can be extended with -ov or -ev: put ‘path’ > put-ov-i/put-ev-i (alongside put-i), pojas ‘belt’ > pojas-ov-i/pojas-ev-i (alongside pojas-i);

2 These nouns also have doublet endings in the vocative singular (gospodar-u/gospodar-e) and possesive adjective (gospodar-ov/gospodar-ev).

Grammars or corpora? Who should we trust?

75

• dative and locative singular of feminine nouns ending in -ka, -ga and -ha (non-sibilarised/sibilarised stem): točka ‘dot’ > točk-i/točc-i, Aljaska > Aljask-i/ Aljasc-i, Požega ‘town name’ > Požeg-i/Požez-i; • short and long forms of possessive pronouns: moj ‘mine’ > Gen. mog/mojeg, Dat. mom/mojem; njen/njezin ‘hers’; • verbal doublets: crpiti/crpsti ‘to pump’, obavještavati/obavješćivati ‘to inform’, izaći/izići ‘to go out’, donesti ‘to bring’ > donesoh, donesavši and donijeti > donijeh, donijevši; podići ‘to lift’ > podigao and podignuti > podignuo); • lengthening/shortening of ě (yat) and ‘covered r’ (/r/ followed by ě): nasljeđe/ naslijeđe ‘heritage’, ljesovi/lijesovi ‘coffins’, pripovijedaka/pripovjedaka ‘short story’ (genitive plural); sprečavati/sprječavati ‘to prevent’, strelica/strjelica ‘arrow’. It is interesting to see how this phenomenon is dealt with in reference works of Croatian as well as in modern linguistic theory. In a Croatian language adviser, the reader can find the following statement: Where there are variants (doublets), they necessarily need to be put in complementary relation because the standard language cannot bear more than one signifier for the same signified, several synonymous and equivalent linguistic units (…) The standard language needs variants in order to fulfil the needs of all its functional styles, but it does need all variants in equal measure. Orthographic and grammatical variants jeopardise it, whereas lexical and stylistic variants are desirable (HJS 1999: 48, italics mine).3

Further down in the text we can find a similar claim: “Doublets, inconsistencies and contradictions destabilise the norm; in fact, they call it into question” (HJS 1999: 63). Babić (1990: 41) claims that doublets in language are “an unnecessary burden, deadweight” and that they cause “conflicts and disruption between individuals and between groups.” Zoričić (1998: 47) is of a slightly more moderate view: “we should consider doublets as freedom of choice. However, matters of free choice in language cause unusual behaviour almost by default. Many refuse to give up on their well-established habits, do not accept alternative forms even when they know their way is not in line with present-day norm. On the other hand, they ask for greater precision on behalf of norm-setters as they do not want things to be done both this way and that way.” This selection of quotes shows that most linguists (Croatian ones at least) consider the presence of doublets as a problem in language. But is that truly the case? Linguistics around the world did not pay much attention to this phenomenon for several reasons: one of them is the fact that the majority of research focused on English, and we have seen that in English it is in fact a scarce phenomenon. The 3 All the translations of Croatian references are the author’s own renditions.

76

Dario Lečić

other reason is that this phenomenon did not go well with the dominant beliefs of generative grammar. Clark (1987) defined The Principle of Contrast by which it is necessary for two forms to differ in meaning.4 In other words, this principle would disallow the existence of absolute synonyms in a language. It was also believed that the existence of variant forms disturbs the principle of language economy. But as Kapović (2011: 29) explains, the principle of economy is very ambiguous in itself – is it more economical for a language to have no declensions (as English does) or is it more economical to have them? However, in recent decades linguistic science has progressed and started regarding this “problem” from another, cognitive-linguistic, perspective. Jonke (1964: 198) established that “introducing doublets, even triplets, signifies a step forward both in theory and in practice. It enables for the lexical treasure of our language to be exploited in full (…) It gives every writer freedom to choose, to decide on the option which suits his text better or has a greater effect without breaking the norm to an irreconcilable extent. It frees the writer of all constraints.” Even though it is clear from the text that Jonke is referring to lexical doublets, we believe that this claim can easily be applied to doublets on any level of language, including morphological. Ellis (1999: 470) also thinks that language users, “as human beings, value variety for its own sake; they instinctively use language forms as objects which can be experimented and played with.”

2. Research goals and methods We hope for this research to shed some new light on the question whether speakers consider this state of affairs problematic or not. For this purpose we use three different types of data and compare them to each other: rules formulated by grammarians, distributions found in a corpus and data obtained from native speakers.

2.1 Grammar books The most time-consuming part of the research has been to establish the final number and list of lexemes that appear with doublet forms in any of the abovementioned situations. Most reference works of Croatian do provide the user with such lists; however, more often than not the lists differ from one manual to another. For instance, if a speaker wishes to find out what the Instrumental singular of masculine nouns and personal names that end in -io (such as radio, Mario etc.) is, they would find the following solutions: the grammars by Težak-Babić (2004) 4 Similar to this is Aronoff ’s Synonymy Blocking Principle.

Grammars or corpora? Who should we trust?

77

and Raguž (1997) sanction5 only the forms radijem, Marijem, whereas the grammar by Barić et al. (2005) and a monolingual dictionary by Anić (2003) prescribe only radiom, Mariom. The most recent orthographical manual of Croatian (Jozić 2013) gives both variants equal status. Anić’s dictionary will sanction only ljenji as the comparative of lijen ‘lazy’, whereas Težak-Babić (2004) will sanction only ljeniji. Moreover, one can even find situations within the same work where two variants are sanctioned for a certain lexeme, but on the very next page one of the forms is labelled as ungrammatical. Moreover, sometimes it is not sufficient merely to determine whether a form appears in one reference work or not, but the user also needs to look at the ordering of the variants since these works often indicate greater or lesser grammaticality of variants by their mutual ordering. We find differences here as well. The most complicated situation appears in the genitive plural of feminine nouns, where there are three possible variants: -ī ending, -ā ending and -ā ending with the reinsertion of the fleeting a. Raguž (1997: 45) recommends forms with the fleeting a as the best since they are the most differentiating ones (the other two endings also appear in other cases), followed by those with -ā and finally those with -ī. However, grammar by Babić et al. (VHG 2007: 394) will say that the forms with the fleeting a are being “evidently pushed out” of the language. In his empirical research, Težak (1980: 15) determined that the -ī ending is “uncontrollably suppressing the other two options.” HJS (1999) gives an advantage to -ā, except in cases where fleeting a + -ā is “more common.” Furthermore, some authors try to find any kind of distinguishing criterion – semantic, syntactic, stylistic etc. – because they are still of the belief that proper doublets are an unwanted feature of language. For instance, Težak-Babić (2004) and HJS (1999) will say that the instrumental singular of the word put ‘path’ is putom when it is preceded by a preposition and putem when it is not. Other manuals (VHG 2007, Raguž 1997) will say that putom should be used in its primary meaning ‘road, journey’, and putem in a metaphorical meaning ‘means, way of doing’. However, even the authors of HJS (1999: 88) will admit that in the case of instrumental singular of feminine nouns “attempts to restrict the usage of -i only when following a preposition or adjective and -ju in all other cases do not bear fruit.” One of the reasons why such discrepancies arise is that reference works of Croatian are not based on real-life data (or if they are, they only rely on what is

5 The verb sanction has two semantically opposite meanings in English: 1) to authorise, approve, 2) to impose a sanction on, penalise. In this work we are using it in the former meaning.

78

Dario Lečić

termed ‘the classical works of Croatian literature’) but rather on the authority of the grammarians themselves. In addition to that, as we have seen in the above example, the rules and principles are formulated in an awkward manner, using ambiguous constructions such as “usually”, “many words”, “in some cases” etc., which raise more questions than they answer.

2.2 Corpus data However, in order to verify that the rules in Croatian grammars are in fact arbitrarily determined and out-of-date with actual usage, one needs to check the usage. Hence we come to the second source of information about a language – a corpus. In recent years corpora have become an indispensable tool of linguistic research. There are three large online corpora of Croatian in existence at the moment: (1) Croatian National Corpus (Hrvatski nacionalni korpus, HNK), the latest version (v3.0) consisting of 220 million tokens; (2) Croatian Language Repository (Hrvatska jezična riznica, HJR), a 100-million-token corpus, and (3) hrWaC, the Croatian version of the Web as Corpus project, which contains 1.4 billion tokens. Each of three corpora have their virtues as well as pitfalls (in terms of balance, representativity and annotation). Different corpora were used at different stages of the research, depending on the availability of each. We believe that Croatian is in an urgent need of a grammar based on everyday usage, using either data from a corpus or any other relevant source.6 For instance, a learner of Croatian would not be able to find in any of the Croatian grammars how to decline names such as Matea, but they would often find how to decline the noun divolijeska ‘wild hazeltree’, which does not appear a single time in any of the three corpora. In any research on variation we can have two possible starting points (Ellis 1999: 1) we can either assume that variation is systematic until it can be shown to be non-systematic or 2) we can assume it is free until it is shown to be systematic. Whereas the majority of grammarians, as we have seen above, take (1) as a given fact, we adopt (2) as the main guideline of our research. A preliminary corpus analysis has shown that oscillations from the grammar are reflected in everyday usage. In some cases recommendations from reference manuals simply do not match the actual state of affairs in the language. For instance, the word šutovi is unanimously sanctioned as the normatively acceptable plural form of the word šut ‘shot’, whereas šutevi is the non-sanctioned form. 6 Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. 1999. The Longman Grammar of Spoken and Written English. Harlow, England: Pearson Education is a classic example of one such grammar.

Grammars or corpora? Who should we trust?

79

However, in hrWaC the non-sanctioned form appears almost 10 times more (2512) than the sanctioned form (252). Even though the majority of rules of grammar are based on some kind of phonological principle (final phoneme, pitch accent etc.), we also found a considerable number of phonological families whose members show quite opposite distributions of the two variants. In table 1 we give a couple of examples of the instrumental singular of masculine nouns ending in -ej. All the frequencies come from HJR. Table 1. instrumental singular of masculine nouns ending in -ej found in HJR. muzejom muzejem Sergejom Sergejem trofejom trofejem esejom esejem sprejom sprejem volejom volejem

168 42 31 24 45 15 10 16 30 66 22 83

Even phonologically minimal pairs can have completely opposite preferences. For instance, the minimal pair of adjectives tijesan ‘narrow’ and bijesan ‘rabid’ have the following distributions of the two comparative/superlative forms: the short form (naj)tješnji appears 1073 times in hrWaC and its long counterpart (naj)tjesniji 219; on the other hand (naj)bješnji appears 45 times and (naj)bjesniji 214.

2.3 Native speakers’ data We have seen that reference works of Croatian are not helpful much when it comes to morphological doublets. We remind the reader of Zoričić’s (1998: 47) opinion that speakers “do not want things to be done both this way and that way.” So where does this leave the native speaker? How will they use these lexemes? Will they in fact consider them problematic? In what kind of a relationship will certain morphological variants be in their mental grammars? Will they both have equal status or will one of them have a more favourable status? What will that status depend on? The third part of the research consists of several questionnaire studies conducted among native speakers of Croatian. The questionnaire material was selected on the

80

Dario Lečić

basis of corpus frequencies. Usage-based theories of language argue that language users are extremely sensitive to the language they are exposed to (so-called input). Language units that we encounter more frequently will become more entrenched in our mental grammars, which will in turn be reflected in their more frequent usage in later situations. On the example of doublets that would mean the following: if we come across one member of the doublet pair in 90% of the cases and the other one in 10%, we will use those two variant in roughly the same proportions. This is the hypothesis we are testing by means of questionnaires. The absolute frequencies of the two variants from the corpus were transformed into relative proportions. Based on these proportions, we divide the lexical material into several proportional bands. In the most recent study we had 7 bands, which were labelled in the following way: • Band 1 (called absolute domination): Variant 1 appears in the corpus in 100% of the cases; • Band 2 (called extreme domination): Variant 1 appears in 90–99% of the cases; Variant 2 appears in 1–10%; • Band 3 (called weak domination): Variant 1 55–89% : Variant 2 11–45%; • Band 4 (called equiprobability): both variants appear with similar frequencies (45–55%); • Band 5 (called weak subordination): Variant 1 11–45% : Variant 2 55–89%, • Band 6 (called extreme subordination): Variant 1 1–10% : Variant 2 90–99%; • Band 7 (called absolute subordination): Variant 1 does not appear in the corpus; Variant 2 appears 100% of the time. We have conducted two such studies in the course of two years using four different instances of doubletism (instrumental singular of masculine nouns ending in -ar and in a palatal sound, masculine plurals, and adjective comparison). The number of bands differed in each individual study, primarily due to lack of data. We noticed that the situation of equiprobability is extremely rare in the language and in some cases it was not possible to have it as a separate band. Also, in cases of instrumental singular of -ar nouns we were unable to find a single lexeme where the -em variant would be more frequent than -om, so we could not include bands 5–7. The number of respondents in the first study (for a detailed analysis see Lečić 2015) was 105, in the second one 207. We asked the respondents to rate the acceptability of each form of a pair on a 7-point Likert scale, where 1 is ‘absolutely unacceptable’ and 7 is ‘absolutely acceptable’. Both studies produced similar results: the variant that appears more frequently in the corpus is also more acceptable to speakers, whereas the acceptability of the other variant will depend on its position on the dominance scale – Variant 2 from

Grammars or corpora? Who should we trust?

81

the weak domination band will be more acceptable than Variant 2 from the strong domination band. Statistical analyses (ANOVAs) have shown that these differences are not a result of chance, but are rather statistically significant. Another significant difference that arose from the analysis was between forms that do not appear in the corpus and forms that appear with minimum frequency. For instance, the mean acceptability of the form kuharem ‘chef ’ (f = 0) was 2.28, whereas the acceptability of vladarem ‘ruler’ (f = 4) was 4.66. Bybee believes that even minimum frequency can have a significant influence on the perception of an item. “There is no way for frequency to matter unless even the first occurrence of an item is noted in memory. Otherwise, how would frequency accumulate?” (Bybee 2010: 18). These results are in line with results of previous studies in the same framework. For instance, Arppe and Järvikkivi (2007) demonstrated that what is highly frequent in a corpus will also be highly acceptable, but that the vice versa (highly acceptable > highly frequent) is not necessarily true. In addition, forms which are unacceptable for speakers will at the same time be low-frequent, but the opposite situation (infrequent forms > extremely unacceptable) is also not necessarily true. Similar arguments have been put forward by linguists who use informationtheoretic approaches (see Milin et al. 2009). More frequent forms carry less information whereas low frequent forms have greater information burden, which makes them harder to process, which in turn explains their lower acceptability. However, a slight deviation from the general tendency occurs in the inside bands (3, 4 and 5). In some cases we noticed that the same variants of phonologically similar lexemes have similar ratings, even though their relative proportions differ. Table 2 is a copy of Table 1, with the addition of mean acceptability values for each variant (N = 207). It is important to note here the rule for this case as defined by Croatian grammars. Nouns that end in palatal take -em, whereas all other nouns take -om. However, in cases when the final vowel of the stem is /e/, dissimilation should take place, which means that the ending should in fact change to -om. For words in table 2, forms with -em are therefore of questionable status and we label them with a question mark.

82

Dario Lečić

Table 2.

muzejom ?muzejem Sergejom ?Sergejem trofejom ?trofejem esejom ?esejem sprejom ?sprejem volejom ?volejem

Frequency (HJR) 168 42 31 24 45 15 10 16 30 66 22 83

Mean rating 5.31 4.11 3.83 6.16 4.68 5.08 4.21 5.36 4.38 5.23 4.43 5.28

In the table we have three lexemes that appear more frequently with the -em variant and three that prefer -om. However, in 5 of the 6 cases native speakers still prefer the -em variant. The only instance where -om gets a better rating is muzejom where its dominance is quite strong (80%). This kind of a result makes us think that there is a stronger principle that speakers follow rather than a grammar rule. That principle could be formulated something like: when in doubt, choose -em consistently (at least for this group of nouns). And the doubt arises either when the proportions of the two forms are fairly equal or when the frequencies of both are so small that neither of them managed to become entrenched into the speaker’s mental grammar. We find a similar distribution in the minimal pair tijesan – bijesan, which was also mentioned above. Table 3.

najtjesniji najtješnji najbjesniji najbješnji

Frequency (HJR) 40 299 34 26

Mean rating 5.72 3.66 6.27 3.10

This leads us to believe that speakers’ grammars are organised in a much simpler way than formal grammars. Rather than resorting to complicated sub-rules and exceptions as formal grammars do, speakers are to a large extent guided by principles of

Grammars or corpora? Who should we trust?

83

analogy. So maybe the reason why reference works of Croatian are unable to define a consistent rule for a certain category is that there are no such rules.

3. Conclusions When it comes to the present-day situation with doublets in Croatian, corpus research has shown that examples such as the ones above are truly in free variation and in some cases it is impossible to determine which is the dominant and which is the recessive form. Ambiguous and unclear wording in Croatian grammars definitely does not help the situation to get clarified and stabilised. When it comes to the status of these forms in the speakers’ mental grammars, preliminary research leads us to the following conclusions: (1) As Joan Bybee has emphasized on several occasions, “quantitative distributions matter and are a part of grammar” (Bybee 2010: 122). Relative proportions have proved to be a significant predictor variable of acceptability of certain forms. Speakers are aware of the input they receive and they are especially adept at keeping track of the distribution of doublet forms. This distribution, in turn, affects their perception of those forms and their subsequent usage. The situation that the form that appears more frequently of the two is also more acceptable is therefore completely expected and logical. (2) However, in the situation of weak domination of one form (such as in Tables 2 and 3) or when there is not enough information about their distribution, the most important thing for the speakers is actually to be as consistent as possible, that is to use the same ending with forms that share certain similarities (phonological, semantic etc.) in their mental grammars. (3) Speakers’ grammars are much simpler than traditional grammars (they ignore some rules that appear in traditional grammars, such as the aforementioned dissimilation) and are mostly based on analogies, which are supplemented by quantitative distributions (expressed here as relative proportions). Several exemplar-based models have been brought forward in recent years, such as Analogical Modeling (Skousen et al. 2002), Tilburg Memory-Based Learner (Daelemans & van den Bosch 2005), etc. We believe these models paint the most accurate picture of what takes place in language processing and it is these models that will be the subject of future research using the same material. Another goal of future research will be to determine what frequency is needed in order to determine the true relationship of the two variants (whether it is measured in dozens, hundreds or thousands).

84

Dario Lečić

References Arppe, A. & J. Järvikivi. 2007. “Every method counts – Combining corpus-based and experimental evidence in the study of synonymy.” Corpus Linguistics and Linguistic Theory, 3(2), 131–160. Babić, S. 1990. Hrvatska jezikoslovna čitanka. Zagreb: Globus. Bermel, N. & L. Knittl. 2012. “Corpus frequency and acceptability judgements: A study of morphosyntactic variants in Czech.” Corpus Linguistics and Linguistic Theory 8(2), 241–275. Bybee, J. 2010. Language, Usage and Cognition. Cambridge: Cambridge University Press. Clark, E. V. 1987. “The Principle of Contrast.” In B. MacWhineey (ed.), Mechanisms of Language Acquisition, 1–33. Hillsdale, NJ: Lawrence Erlbaum. Daelemans, W. & A. van den Bosch. 2005. Memory-based language processing. Cambridge: Cambridge University Press. Ellis, R. 1999. “Item versus System Learning: Explaining Free Variation.” Applied Linguistics 20(4), 460–480. Fehringer, C. 2004. “How Stable are Morphological Doublets? A Case Study of /ə/ ~ ø variants in Dutch and German.” Journal of Germanic Linguistics 16(4), 285–329. Jonke, L. 1964. Književni jezik u teoriji i praksi. Zagreb: Znanje. Kapović, M. 2011. Čiji je jezik. Zagreb: Algoritam. Lečić, D. 2015. “Inflectional doublets in Croatian: The case of the instrumental singular.” Russian Linguistics 39(3), 375–393. Marković, I. 2012. Uvod u jezičnu morfologiju. Zagreb: Disput. Milin, P., Kuperman, V., Kostić, A. & R. H.Baayen. 2009. “Paradigms bit by bit: an information theoretic approach to the processing of paradigmatic structure in inflection and derivation.” In J. P. Blevins and J. Blevins (eds.), Analogy in grammar: Form and acquisition, 214–252. Oxford: Oxford University Press. Skousen, R., Lonsdale, D. & D. B. Parkinson. 2002. Analogical Modeling: An Exemplar-Based Approach to Language. Amsterdam: John Benjamins. Težak, S. 1980. “Genitiv množine imenica e vrste s višesuglasničkim osnovnim završetkom.” Jezik 28(1), 1–15. Thornton, A. M. 2011. “Overabundance (multiple forms realizing the same cell): a non-canonical phenomenon in Italian verb morphology.” In M. Goldbach et al. (eds.), Morphological Autonomy: Perspectives from Romance Inflectional Morphology, 358–381. Oxford: Oxford University Press. Zoričić, I. 1998. Hrvatski u praksi. Pula: Zavičajna naklada Žakan Juri.

Grammars or corpora? Who should we trust?

85

Grammar books of Croatian Anić, V. 2003. Veliki rječnik hrvatskog jezika. Zagreb: Novi Liber. [VHG] Babić, S., Brozović, D. Škarić, I. & S. Težak. 2007. Velika hrvatska gramatika, knjiga prva: Glasovi i oblici hrvatskoga književnoga jezika. Zagreb: Nakladni zavod Globus. Barić, E., Lončarić, M., Malić, D., Pavešić, S., Peti, M., Zečević, V. & M. Znika. 2005. Hrvatska gramatika. Zagreb: Školska knjiga. [HJS] Institut za hrvatski jezik i jezikoslovlje. 1999. Hrvatski jezični savjetnik. Zagreb: Pergamena/Školske novine. Jozić, Ž. (ed.) 2013. Hrvatski pravopis. Zagreb: Institut za hrvatski jezik i jezikoslovlje. Raguž, D. 1997. Praktična hrvatska gramatika. Zagreb: Medicinska naklada. Težak S. and S. Babić. 2004. Gramatika hrvatskoga jezika: Priručnik za osnovno jezično obrazovanje. Zagreb: Školska knjiga.

Corpora (HNK) Hrvatski nacionalni korpus, http://hnk.ffzg.hr. (HJR) Hrvatska jezična riznica, http://riznica.ihjj.hr. (hrWaC), Web as Corpus (Croatian), http://nlp.ffzg.hr/resources/corpora/hrwac/.

Adamina Korwin-Szymanowska

Institute of Assisted Human Development and Education, Academy of Special Education, Warsaw

Jacek Tadeusz Waliński

Institute of English Studies, University of Lodz

Figurative dimensions of health: a corpus-illustrated study Abstract: The meaning of health has been debated among Western philosophers and medicine practitioners for over two millennia. To date, more than 300 various definitions of health have been proposed in medicine, psychology, pedagogy, sociology, and other disciplines preoccupied with the human condition, yet an unequivocal definition of health remains as elusive as it is fundamental to our existence. This study approaches the concept of health as an array of primary conceptual metaphors that arise from the cognitive embodiment. Taking into account data found in the British National Corpus and the Corpus of Contemporary American English, this paper discusses the conceptual mapping of health as a general condition of human functioning onto two basic dimensions of embodied experience, which include up–down and strong–weak scales. From this perspective, health as the metaphorical concept forms gradable antonymy, where contrasting properties between health and disease are represented in terms of a scale running between two poles. Within this gradable antonymy health can be graded against different norms, which means that there is no absolute criterion by which one can tell what it means to be healthy and there may be a partial overlap between different scales. Keywords: health, conceptual metaphor, metonymy, objectification, gradable antonymy, cognitive corpus-based linguistics

1. Obscurity of the meaning of health There is an ongoing discussion on what health means, which takes place across all of science, including medicine, psychology, sociology, pedagogy, philosophy, as well as other disciplines preoccupied with the state of human condition (Loudon 1997). Mateusiak, Gwozdecka-Wolniaszek, and Januszek (2011) emphasize that the meaning of health pertains to four basic dimensions of human existence: physical, psychological, social, and spiritual, which makes it a multidimensional concept. Moreover, the meaning of health alternates as it is adjusted to the human life cycle, which involves coping with different developmental tasks. Additionally,

88

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

the concept of health is subject to multiple criteria. Among them, (a) a medical criterion is understood as a lack of deviations from certain norms, which is assessed through medical examination; (b) vitality refers to the strength and energy that an individual has; (c) functional capability can be identified as the potential to achieve different aims; (d) a sense of balance includes capabilities of dealing with adversity; and (e) a sense of well-being depends, at least to large extent, on an overall lack of malady. Furthermore, the concept of health is relative because it encompasses inalienable categories, such as age, sex, level of education, as well as norms and values held by an individual, which can be viewed from historical, cultural, social, and economic perspectives. Traditional approaches to the meaning of health relied on a distinction between health and disease as marking two opposite poles of one dimension (see KorwinSzymanowska 2015 for a review). For instance, a biomedical model of health is based on the assumption that any disease has a specific etiological background caused by particular factors leading to changes in the structure and functioning of the human body. Body functions, which are based on biochemical processes of cells and organs, can be measured precisely with biological analyses and other ways of assessment, which produces a set of normalized medical standards of health. From this perspective, health can be defined as a lack of biological dysfunctions. However, since this model focuses predominantly on the level of cellular processes, it leaves out psychological and social aspects, which are also key facets of one’s health (Allen 1998; Brannon & Feist 2010). More recent attempts at a coherent delineation of health take into account additional, subjective factors, such as emotions, convictions, and self-awareness, which are separate from the objective biomedical health indicators. Among most prominent definitions of health that follow this track is that proposed in 1948 by the World Health Organization. It views health as “a state of complete physical, mental and social well-being and not merely the absence of disease or infirmity” (WHO 1948: 100). It derives from Sigerist’s (1941) thesis that health involves an internal balance between an individual’s body and mind, as well as her interactions with the physical and social environment. From this perspective, health can be viewed as the process of having control over one’s mental and physical capabilities, as well as having the ability to adapt to changes happening in the surrounding environment. Although the definition proposed by WHO is still considered as a breakthrough in health studies, some researchers consider it as utopian and too generic for practical applications. They point out that the state of perfect balance between various health factors is hardly attainable for an individual. More importantly, however, the

Figurative dimensions of health

89

WHO definition shifted the focus of attention from biomedical factors towards more holistic models of health, which are open to psychological, social, and other interpretations (see Brannon & Feist 2010; Woynarowska 2007: 18–19 for discussions). For instance, a new paradigm of health understanding proposed by Engel (1977) under the label biopsychosocial model of health assumes an ongoing interaction between the mind and the body, as well as the environmental and social influences, which relates to numerous health aspects. As summarized by Suls and Rothman (2004): This perspective holds to the idea that biological, psychological, and social processes are integrally and interactively involved in physical health and illness. The initially provocative premise that people’s psychological experiences and social behaviors are reciprocally related to biological processes has fueled dramatic advances in health psychology over the past 25 years … As a guiding framework, the biopsychosocial model has proven remarkably successful as it has enabled health psychologists to be at the forefront of efforts to forge a multilevel, multisystem approach to human functioning (Suls & Rothman 2004: 119).

While the above-mentioned biomedical model of health basically treats the human being as a machine that can be repaired when broken down by a disease, the biopsychosocial model sees the individual as a unity, emphasizing the fact that apart from illness there are other factors relevant to health, such as motivation to recover, understanding of the disease, social support, etc. (Allen 1998; Brannon & Feist 2010; Korwin-Szymanowska 2015). There are also other models of health that have gained certain popularity. For instance, ecological models (see Sallis, Owen, & Fisher 2008 for a review) situate health and health behavior in the context of ecosystem. The ecological framework includes the health triad, which sees health as a dynamic equilibrium between the host, the environment, and the agent. Upsetting the balance within this triad usually upsets the health of an individual in one way or another. This particular model was proposed as a solution for dealing with the interconnectedness of many global problems and complexities related to managing and caring for our natural environment. Woynarowska (2007: 18) has identified more than 300 different variants of health understanding, which lay emphasis on specific aspects within different frameworks. This diversity of perspectives results in ambiguity of the term health, which recently tends to be substituted by notions such as well-being, quality of life, or happiness (see Heszen & Sęk 2007; Kahneman, Diener & Schwarz 2003). An overall conclusion emerging from the multitude of health definitions is that, although its meaning is intuitively graspable to all of us, a comprehensive description of health escapes mono-disciplinary attempts. This study approaches the

90

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

concept of health from still a different angle. Following the observation that our conceptions of health tend to be discussed figuratively in terms of an up–down scale (Lakoff & Johnson 1980), this paper examines the concept of health as an array of conceptual metaphors originating from certain basic dimensions of embodied physical experience.

2. Health as a metaphorical concept In their seminal work Metaphors We Live By, Lakoff and Johnson (1980) put forward the hypothesis that metaphor is essentially a matter of thought, i.e. an important part of our conceptual structure is metaphorical in nature. Using numerous examples form disparate domains of human activity, they assume that conceptual metaphors shape our understanding and have a potential to determine how we reason about abstract concepts. Many aspects of our experience cannot be clearly delineated in terms of the naturally emergent dimensions of our experience. This is typically the case for human emotions, abstract concepts, mental activity, time, work, human institutions, social practices, etc. … Though most of these can be experienced directly, none of them can be fully comprehended on their own terms. Instead, we must understand them in terms of other entities and experiences, typically other kinds of entities and experiences. (Lakoff & Johnson 1980: 177).

Although the theory has undergone various adjustments and updates since its original conception (e.g. Lakoff & Johnson 1999; Lakoff 2008; see Ruiz de Mendoza Ibánez & Pérez Hernández 2011 for a concise up-to-date review), the conceptual metaphor can be defined in a nutshell as a conceptual mapping, i.e. a set of correspondences between two conceptual domains, in which a previously stored conceptual representation of one cognitive model is used to provide a structured understanding of another. The source domain is less abstract, i.e. more accessible to perception, than the target domain. Only a part of the source domain is mapped onto the target, and only a part of the target domain is involved in the mapping because one concept cannot be the same as another. A good example of concept that is experienced directly, yet remains as elusive as it is fundamental to our existence is health. As discussed in the previous section, although health constitutes an inseparable part of our daily functioning, it evades a single clear-cut definition due to its numerous dynamic multi-dimensional biopsycho-social aspects. Unfortunately, linguistic studies on figurative conceptions of health as such have been relatively few and far between. While discussing orientational metaphors Lakoff and Johnson (1980: 15) point out that common examples for this category include health is up and sickness is down metaphors,

Figurative dimensions of health

91

e.g. “She’s in top shape” vs . “He’s sinking fast”. Lakoff and Johnson (1980) add that “good is up gives an up orientation to general well-being, and this orientation is coherent with special cases like happy is up, health is up, alive is up …” (Lakoff & Johnson 1980: 15). This indicates that metaphorical conceptualizations of health are grounded in embodied experience (Gibbs, Lima, & Francozo 2004; see Gibbs 2005; Semin & Smith 2008 for edited collections of studies, cf. Mahon 2015 for a recent discussion). Grady (1997, 1999) introduces a distinction between complex and primary metaphors. Essentially, primary metaphors are simple patterns that map basic perceptual concepts onto equally fundamental but not directly perceptual ones. They arise directly from basic recurring units of human experience. Grady assumes that source concepts for primary metaphors are typically based on various basic force-dynamic concepts (cf. Talmy 1988), such as up, down, forward, backward, bright, dark, etc. The corresponding target domains in primary metaphors include fundamental building blocks of mental experience that escape direct perception, such as happy, sad, difficult, success, and health. In contrast, complex metaphors do not arise directly from experiential correlations but are made up of primary metaphors. For example, various complex metaphors referring to the concept of journey, such as love / a business / a task is a journey can be accounted for in terms of the primary metaphor purposes are destinations, which explains why the sentence “We are going nowhere” can be used just as well for discussing a marital crisis, an unsuccessful business venture, or frustration with the task at hand. Lakoff and Johnson (1980: 139–141) observe that health functions as the source domain in complex metaphors. For instance, love is health metaphor views the relationship in terms of a patient, e.g. “It’s a healthy relationship”, “Their relationship is reviving”. As summarized by Kövecses (2010: 19), “Both the general properties of health and illness and particular illnesses frequently constitute metaphorical source domains”, which includes such examples as: state is health and economy is health (see Boers 1999; Urbonaite & Šeškauskiene 2007). However, it must be emphasized that complex metaphors in which health acts as the source domain are beyond the scope of this study. In this study of domains used for mapping the meaning of health onto other concepts we take into account the recent theory of objectification proposed by Szwedek (2007, 2011, 2014), who views conceptualization of abstract concepts in terms of concrete entities as objectification. Szwedek (2007, 2011) argues that for a metaphorical structure to exist, it is necessary to objectify the concept by assigning it some physical status. Since all other domains depend on the physical object that

92

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

is accessible to our senses, the object schema acts as the ultimate source domain, which provides grounds for metaphorical conceptions of health. Taking the assumption that health is a primary metaphorical concept that arises directly from embodied human experience, this paper discusses fundamental scales used for its conceptual mapping from the perspective of data found in linguistic corpora.

3. Methodology This study approaches the question of figurative dimensions of health from the perspective of cognitive corpus-based approach to language study, which brings together the descriptive framework of cognitive linguistics (Croft & Cruse 2004; Janda 2015) with the methodological workbench of corpus linguistics (McEnery & Hardie 2012). Essentially, cognitive corpus-based linguistics relies on the explanatory frameworks of cognitive linguistics, but approaches them in such a way that their relevance to a given linguistic phenomenon can be empirically validated in large corpora (see Lewandowska-Tomaszczyk & Dziwirek 2009 for an edited collection of studies). More specifically, this study employs a corpus-illustrated approach, i.e. one in which claims about language structure are illustrated with examples taken from corpora (Tummers, Heylen, & Geeraerts 2005). Grounding research on the conceptual metaphor in empirical corpus data has been advocated by Deignan (1999, 2005, 2008), who points out that “a computerised corpus can enable the researcher to detect patterns of usage more quickly than either the use of intuition or the analysis of individual texts, as words or expressions are automatically retrieved from the corpus and sorted” (Deignan: 178). She adds that grounding research on the conceptual metaphor in the corpus data “can reveal many linguistic details that could be passed over in the examination of single texts, and might not be observed at all when data are elicited rather than gathered from language in use” (Deignan: 293). Since one of the most significant objections against the conceptual metaphor research has been overreliance on decontextualized examples, the application of corpus data for this purpose makes observations more inter-subjective and allows one to accept results with a greater confidence (Fabiszak & Konat: 2013). This research employs two reference corpora for English. One is the British National Corpus (henceforth BNC), which is a 100 million word collection of samples of written and spoken contemporary British English from a wide range of texts, not limited to any particular subject field, genre, or register (see www.natcorp. ox.ac.uk for more information). The other is the Corpus of Contemporary American English (henceforth COCA), which contains more than 450 million words balanced between spoken, fiction, academic journals, popular magazines, and

Figurative dimensions of health

93

newspapers from 1990–2008 (see corpus.byu.edu/coca/ for more information). Both these resources are publicly available standard reference corpora (McEnery & Wilson 2001: 32), which have been used by researchers in a variety of contexts, including research on conceptual metaphors (e.g. Fabiszak & Kaszubski 2006). The corpora employed for this research were queried with SlopeQ Desktop (ver. 01.05), which is a part-of-speech-sensitive concordancer with support for lemmatization and proximity queries (see Pęzik 2015 for more information). Due to the part-of-speech annotation of the corpora (see Garside, Leech, and McEnery 1997), it enables, for instance, searching specifically for all adjectives used to qualify the noun health with the query “ health” or searching specifically for all nouns qualified by the adjective healthy with the query “healthy ”. From an array of different strategies which can be used for extracting linguistic expressions that reflect conceptual metaphorical mappings from corpora (see Stefanowitsch 2006 for a review), this study employs searching for sentences containing lexical items from both the source domain and the target domain. The examination was implemented by looking for expressions in which nouns used to refer to health as the conceptual source domain, either precede or follow lexemes used to refer to up/down and strong/week as metaphorical target domains investigated in this study. Specifically, lexical items used to refer to “health” were limited to: health, condition, state, form, and shape. The “up” state was specified using the lexemes: up, upward, peak, top; while the “down” state using: down, downward, bottom, downhill. The “strong” state was specified using the lexemes: strong, powerful, robust, stout, sturdy; whereas the “weak” state using: weak, feeble, enfeebled, frail, fragile. Moreover, expressions referring to transitions in the state of health were found by looking for the noun health followed by verbs marking a positive change: improve, a negative change: deteriorate, or stabilization: stabilise (stabilize). Negative changes expressed in terms of downward movement were found by looking the noun health followed by verbs: decline, drop, plunge, and tumble. Different factors that affect health were found by looking for the noun health preceded by verbs used to express either positive or negative influence: help or damage, respectively. Finally, examples for phrasal verbs expressing the negative change of health as down-transition were found by looking for verbs: strike, come, go, followed by the particle down, and then lexemes related to sickness: disease, symptoms, illness, sickness, cold, and flu. Finally, metonymical references to health and disease in terms of strength and weakness were taken into consideration (see Section 4.2). Expression referring to transitions in the state of health as changes in one’s strength were found by

94

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

looking for the noun strength preceded by verbs expressing either restoration of the previous state: restore, recover, recoup, regain, or a loss of strength: lose, drain, sap, rob. Examples for a change in the health state expressed in terms of getting weaker or stronger were found by looking for combination of verbs get and feel with comparative adjectives stronger and weaker. Searching was implemented using proximity queries (Bernard & Griffin 2009). They allow for searching with a slop value, which specifies how far apart lexical items included in a query can be from one another to be still returned as a result to the query. The slop can be used in combination with the binary (yes/no) preserve order option, which indicates whether the original order of query terms should be retained in results. In this study, proximity queries were implemented by adjusting the value of slop between 1 and 5, depending on the number of examples retrieved from the corpus. The preserve order option was set to either to “yes” or “no”, depending on the particular query. Setting this option to “no” allows, in some cases, for finding both examples of health changing its state, e.g. “Baxter’s health greatly improved”, and factors affecting its state, e.g. “The right lighting can … improve your health”. All queries used to obtain the examples are listed in Appendix, which provides for immediate replicability of the study. Since this paper does not aspire to make any quantitative claims about health metaphors based on their frequency in the corpora (cf. Fabiszak, 2008), this methodology seems to be reasonably adequate for the purpose of the study.

4. Figurative dimensions of health A discussion on figurative dimensions of health must start from the observation that the word “health” is a polysemous lexical item, which embraces a cluster of meanings that are closely connected to one another (Lewandowska-Tomaszczyk 2007). In one sense, health functions in language as the semantically charged concept that signifies the positive condition of one’s being healthy, as opposed to the negative condition of one’s being ill. This meaning is defined in OED (2009) as “Soundness of body; that condition in which its functions are duly and efficiently discharged”. As illustrated in Figure 1, this sense refers to that part on the scale of biopsychosocial functioning of an organism which marks the positive condition, which fits into the model of health and illness promoted by Antonovsky (1987). Figure 1. Health as the positive condition of biopsychosocial functioning.

Figurative dimensions of health

95

Examples found in the corpus data indicate that health as the expressly positive condition is something we often wish for ourselves and others, e.g. “We all love, hate, cry, fear, bleed, die, and wish for health and happiness” (BNC), “I sincerely wish you health and happiness” (COCA). Moreover, it tends to be associated with happiness: the phrase health and happiness occurs 14 times in the BNC and 75 times in the COCA. Moreover, in this positive sense it functions as the root for the adjective healthy, whose meaning refers to “possessing or enjoying good health; hale or sound (in body), so as to be able to discharge all functions efficiently” (OED, 2009). This adjective is used in various contexts with reference to individuals and the community, e.g. healthy adults, healthy baby, healthy children, healthy population, healthy society; body parts, e.g. healthy cells, healthy hair, healthy heart, healthy skin, healthy teeth and gums; and all sorts of living organisms, e.g. healthy animals, healthy plants, healthy crops, healthy trees. It also functions in another closely connected sense, which means “conducive to or promoting health” (OED, 2009), in a wide variety of literal expressions such as healthy air, healthy breakfast, healthy diet, healthy environment, healthy lifestyle, and metaphorically to denote “sound condition” (OED, 2009), e.g. healthy country, healthy democracy, healthy economy, healthy industry, healthy market, etc. However, the word “health” has also another sense, which as noted in the OED (2009) evolved from the above prototypical sense by extension. It can be defined as “The general condition of the body with respect to the efficient or inefficient discharge of functions” OED (2009). As illustrated in Figure 2, this other sense refers to the whole dimension of biopsychosocial functioning of an organism. Figure 2. Health as the whole dimension of biopsychosocial functioning.

Since health in this sense is semantically uncharged, it includes both positive and negative parts of the scale. Accordingly, it can be qualified by both positive, e.g. good health, blooming health, excellent health, perfect health, and negative modifiers, e.g. bad health, failing health, ill health, poor health, and weak health, which occur multiple times in both corpora analyzed in this study. Moreover, health in this sense tends to be discussed as a state/process that undergoes either positive or negative transitions (cf. Pustejovsky 1991). It can deteriorate, as in “His health has deteriorated substantially in prison” (COCA), improve, as in “After a week in the country, Baxter’s health greatly improved” (BNC), or stabilize as in “The pope’s health has stabilized” (COCA). Additionally, it can be influenced by a multitude of different factors, which affect health either positively, e.g. “The right lighting

96

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

can boost your mood and improve your health” (COCA), “Sporadic exercise can help general health” (BNC), or negatively, e.g. “Smoking can seriously damage your health” (BNC), “Overwork as an undergraduate damaged his health” (COCA).

4.1 up–down scale Since health as the general dimension of biopsychosocial functioning has a tendency to fluctuate and can be affected by countless factors, its exact condition is difficult to specify in absolute terms. Probably for that reason, as observed by Lakoff and Johnson (1980), it tends to mapped conceptually onto an up–down scale. While the up part is associated with the positive condition of the human health, i.e. health in the above-discussed prototypical meaning, the down part of the scale is associated with the negative condition, i.e. disease. This metaphorical mapping can be illustrated with a variety of examples from corpora probed in this study. For instance, an alternating positive and negative condition of one’s health can be described in terms of up and down states, e.g. “John’s health had been up and down for years”, “Mrs. Fassbinder’s health has been up and down over the past five years, but she has not had to stay overnight in the hospital” (COCA). Since the positive condition of one’s health is associated with the top part of the scale, somebody who is very healthy is said to be at the peak of health, e.g. “Only forty-seven, Drake had appeared to be at the peak of health”, “[Caine] jumped down from the fence, moving like a man at the peak of health”, or in the peak of health, e.g. “Meijer tries to stay in the peak of health by eating right and exercising”; “[Her doctor] pronounced her in the peak of health and quite ready to go home” (COCA). Similarly, efforts taken to maintain a positive condition of an organism can be expressed in terms of keeping in peak health, e.g. “The aim is simple: … to keep your hair in peak health”, “weekly maintenance … should keep your … livestock in steady conditions and peak of health” (BNC). A person whose health is in a very positive condition can be described as being in top health, e.g. “He missed the game against Detroit and has rarely been in top health” (COCA), top form, e.g. “She was still taking medication after her breast cancer operation but she seemed to be on top form” (BNC), or top shape, e.g. “Burke was certain his heart had always worked just fine, always been in top shape” (COCA). In contrast, the negative condition of one’s health can be described in terms of being located down the scale, as in “My health is down since the damn tractor laid on my chest” (COCA). A gradual deterioration of one’s health can be expressed in terms of health going downhill, e.g. “[She] says her health started to go downhill soon after she received mercury fillings in her teeth”, “Soon after giving birth to Brody and Conner, Rachel’s health had begun to go downhill” (COCA). When

Figurative dimensions of health

97

one’s health deteriorates it declines, e.g. “Shortly after that, Messick suffered a stroke and her health declined”, “In recent years, Sinatra’s health declined and he rarely was seen in public” (COCA), takes a plunge, e.g. “When we arrived with a U-Haul, my grandmother’s health took a plunge downward”, “[He] has recently gone public with the news that his mental health took its own plunge years ago” (COCA), or tumbles, e.g. “The immune system is like a set of scales that sometimes tips sharply enough to send a person’s health tumbling” (COCA). When one gets ill, she/he is struck down with a disease, e.g. “Have you been struck down with flu this year?” (BNC), comes down with a disease, e.g. “I came down with a blood and kidney disease that was diagnosed first as possible leukemia” (COCA), or goes down with a disease, e.g. “He went down with influenza symptoms on Tuesday evening” (BNC).

4.2 strong–weak scale What additionally emerges from the corpus data is that health as the general dimension of one’s biopsychosocial functioning can be mapped conceptually onto other scales originating from the embodied experience. Another scale used for the conceptual mapping of general condition of health is a strong–weak scale. In this source domain, figurative expressions of health appear to employ a metonymic (Kövecses & Radden 1998; Panther & Thornburg 2007; see also Bierwiaczonek 2013 for a recent account) mapping of strength and weakness as standing for the positive or negative condition of health, respectively. This fits into the category of metonymic schemas of causation, which are based on a cause-and-effect type of relationship (Kövecses & Radden, 1998, p. 56). In this case, an effect for cause metonymic relationship is created, in which strength is mapped metonymically onto the positive part of the scale as the effect brought about by good health. On the other hand, weakness as the effect typically caused by disease is mapped metonymically onto the negative condition of health. In the realm of health conceptualizations, the conceptual relationship between elements in this metonymical frame can be specified more precisely as physiological effect for biopsychosocial state. What can be observed in this context is that the concept of health undergoes objectification (Szwedek 2007, 2011). Since health and disease are certain states and all states are objects,1 health is objectified as an object. Figurative expressions of health condition along the strong–weak scale map the dynamic multidimensional concept, which includes a variety of bio-psycho-social aspects, onto a 1 Szwedek (2011, 2014) proposes the following line of reasoning: if states are conceptualized as containers (Lakoff & Johnson 1980: 30) and containers are objects, then states are objects. See also states are objects in Kövecses 2000: 93–97.

98

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

single dimension of embodied experience. More specifically, the objectification of health–disease as strength–weakness results in the creation of conceptualizations that “map values from the source domain onto states in the target domain” (Szwedek 2014: 371). Due to that, the condition of health can be reduced to a certain state within the physical realm. Specific metaphors that may be proposed in this context include health is a strong human (body) and sickness is a weak human (body). Within the scope of these metaphors, health is given a coherent structure and inherits properties of the prototypical object,2 which can be specified in basic physical terms (see Szwedek 2014 for a broader discussion on the relationships between domains in metaphorization). The metaphorical mapping of health–disease onto strength–weakness can be illustrated with a variety of examples taken from the corpora probed in this study. For instance, a positive condition of health can be described as robust (good) health, e.g. “[She] insisted that her mother, a woman in robust health, was gravely ill”, “Their return to robust good health was swift and uninterrupted” (COCA), or sturdy health, e.g. “If you want to live a long, long time in sturdy health…” (BNC). By analogy, an ailing person does not enjoy robust health, e.g. “Mozart, never in robust health, died Dec. 5 after a streptococcal infection” (COCA). Treatment can be expressed as restoring strength, e.g. “[They had] drawn the angry scarlet blush from his wounds, and restored his strength” (COCA). Recovery from illness can be put across in terms of regaining one’s strength, e.g. “[MacDougal has been] trying to regain his strength after a bout with a stomach virus”, “[The] pope did suffer ‘a bout of difficult health’ a few years ago, then regained his strength”, or recovering one’s strength, e.g. “As soon as she recovered her strength, she returned to her hours in the library” (COCA). Likewise, the process of recovery can be described as getting stronger, e.g. “Now on antiviral medication, he’s getting stronger”, “I had been quite ill and had that pleasant feeling of getting stronger each day”, or feeling stronger, e.g. “Tina had just battled back from two weeks of terrible sickness, and was feeling stronger again” (COCA). On the other hand, a deterioration of one’s heath condition can be expressed in terms of losing strength, e.g. “She was 81; she knew she was losing her strength; and she felt she might not recover it” (COCA). Disease is associated with something that drains strength, e.g. “The diarrhea had utterly drained her strength”; saps strength, e.g. “frequent operations and hospitalizations sapped what was left of his strength”; or robs strength, e.g. “a paralysis

2 In his theory of objectification, Szwedek (2007, 2011, 2014) follows Kotarbiński’s philosophical doctrine of reism, which assumes that “Persons ought to be regarded as objects, i.e. sentient objects” (Kotarbiński 1990: 4).

Figurative dimensions of health

99

that robbed his strength”, “[He] puts up a gallant fight against the brain tumor that daily robs him of his strength” (COCA). At the opposite pole of the scale, the negative condition of one’s health is associated with weakness. Accordingly, sickness can be expressed in terms of weak state, e.g. “He’s in a weak state of health”, or weak condition, e.g. “Her general condition was so weak that he had arranged for the doctors to take special care of her”. Generally poor health can be described as weak health, e.g. “She was a premature baby, and suffered weak health during early childhood” (BNC), “He privately he made fun of his weak health” (COCA); frail health, e.g. “They ignored the Michael Jackson’s frail health or pushed him too hard”, “Pope John Paul II may have to cut back on his travels due to his frail health” (COCA); or fragile health, e.g. “[The King] had been in fragile health since he was hospitalized in the United States four years ago for lung problems” (COCA). A deterioration of one’s health condition can be expressed in terms of getting weaker, e.g. “Gee, I’m getting weaker, I’m deteriorating”, “I kept getting sicker and sicker and weaker and weaker” (COCA). When one’s health falters below certain level, it can be described as too weak, e.g. “He was already too weak to undergo a liver transplant”, “She was too weak with AIDS to leave her home” (COCA). Similar adjectives used in this context include feeble, e.g. “My son was so feeble that I thought I would lose him at birth” (BNC), “He felt now like a man who, long enfeebled, is finally cured of a serious illness” (COCA); and frail, e.g. “Moving a frail parent with chronic disease into your home is certainly a gesture of love”, “When you and your parent visit a frail friend or relative in the hospital…”, “She was beginning to feel a little ‘frail’ and was admitted to hospital” (COCA).

5. Conclusions From the cognitive corpus-based linguistic perspective, the conceptualization of health as the dimension of human functioning appears to hinge on conceptual mappings derived from basic aspects of embodied experience. The spatialization of health in health is up and sickness is down metaphors may be attributed to the physical basis: while healthy condition is associated with the upright posture, illness typically forces us to lie down physically (Lakoff & Johnson 1980: 15). Similarly, the conceptual mapping of health as strength and sickness as weakness can be motivated by embodiment: while the physical strength of the human body is associated with good health, weakness is among common symptoms of illness. The use of these particular scales suggests that people’s ordinary conceptualizations of health and disease are deeply grounded in the cognitive embodiment, and indicates that people have a strong tendency to rely on embodied

100

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

experiences to understand the nature of health and illness (see Gibbs & Franks 2002; Semino et al. 2015 for studies conducted in the specific context of cancer). Although the up/down and strong/weak dimensions appear to be prevailing conceptual domains in health conceptualizations, other domains can also be used for this purpose, albeit perhaps not as consistently, e.g. a bright/dark scale, as in “You look wonderful, said John, glowing with health” vis-à-vis “That’s better than fading away in a hospital bed ” (BNC) (see Kövecses 2010: 18–23 for a review of common source domains). Figurative expressions of health as the dimension along the up–down and strong–weak scales indicate that health and disease form gradable antonymy, i.e. a pair of words with opposite meanings, where contrasting properties between the two meanings lie on a continuous spectrum running between two poles (Cruse & Togia 1995; Lewandowska-Tomaszczyk 2010, see Jones, Murphy, Paradis, & Willners 2012 for a recent comprehensive discussion on antonyms in English), rather than the +/– axiological property of a word (Krzeszowski 1997). As noted by Croft and Cruse (2004: 169), the principal image-schema for antonymy of this kind is scale, which construes a property in terms of more and less. Since within the gradable antonymy health can be graded against different norms, there is no absolute single criterion by which one can tell what it means to be healthy. Moreover, there may be a partial overlap between different scales used for the conceptual mapping of health. This, coupled with an extensive set of varying states between health and disease, contributes to the vagueness and imprecision (Tuggy 1993; Solt 2015) of figurative health expressions. Finally, there are other aspect of the meaning of health that have been left out of the scope of this paper, such as ontological domains used for the objectification of health in the prototypical sense of the positive condition of human functioning, as well as inherent force-dynamics in health conceptualizations. They undoubtedly require further analysis, which opens several paths for further cognitive corpusbased linguistic studies on health as a figurative concept.

References Allen, F. (1998). Health Psychology: Theory and Practice. St. Leonards: Allen & Unwin. Antonovsky, A. (1987). Unraveling the mystery of health: How people manage stress and stay healthy. San Francisco, CA: Jossey-Bass. Bernard, E., & J. Griffin. (2009). Understanding Lucene’s query syntax. In Hibernate Search in Action, 202–214. Greenwich, CT: Manning.

Figurative dimensions of health

101

Bierwiaczonek, B. (2013). Metonymy in Language, Thought and Brain. Sheffield: Equinox. Boers, F. (1999). When a bodily source domain becomes prominent: the joy of counting metaphors in the socio-economic domain. In R. W. Gibbs & G. Steen (eds.), Metaphor in Cognitive Linguistics, 47–56. Amsterdam: John Benjamins. Brannon, L., & J. Feist. (2010). Health Psychology: An Introduction to Behavior and Health. Belmont, CA: Wadsworth. Croft, W., & D. A. Cruse. (2004). Cognitive Linguistics. Cambridge: Cambridge University Press. Cruse, D. A., & P. Togia. (1995). Towards a cognitive model of antonymy. Lexicology, 1, 113–141. Deignan, A. (1999). Corpus-based research into metaphor. In L. Cameron & G. Low (eds.), Researching and Applying Metaphor, 177–199. Cambridge: Cambridge University Press. Deignan, A. (2005). Metaphor and Corpus Linguistics. Amsterdam: John Benjamins. Deignan, A. (2008). Corpus Linguistics and Metaphor. In R. W. Gibbs (ed.), The Cambridge Handbook of Metaphor and Thought, 280–294. Cambridge: Cambridge University Press. Engel, G. L. (1977). The Need for a New Medical Model: A Challenge for Biomedicine. Science, 196, 129–136. Fabiszak, M. (2008). Corpus frequency as a guide to metaphor labelling. In Z. Wąsik & T. Komendziński (eds.), Metaphor and Cognition, 149–162. Frankfurt am Main: Peter Lang. Fabiszak, M., & P. Kaszubski. (2006). Studying metaphor with the BNC. Poznań Studies in Contemporary Linguistics, 41, 111–129. Fabiszak, M., & B. Konat. (2013). Zastosowanie korpusów językowych w językoznawstwie kognitywnym [Application of language corpora in corpus linguistics]. In P. Stalmaszczyk (ed.), Metodologie językoznawstwa. Ewolucja języka. Ewolucja teorii językoznawczych, 131–142. Łódź: Wydawnictwo Uniwersytetu Łódzkiego. Garside, R., Leech, G. N., & T. McEnery. (1997). Corpus Annotation: Linguistic Information from Computer Text Corpora. London: Longman. Gibbs, R. W. (2005). Embodiment and Cognitive Science. Cambridge: Cambridge University Press. Gibbs, R. W., & H. Franks. (2002). Embodied Metaphor In Women’s Narratives About Their Experiences With Cancer. Health Communication, 14(2), 139–165. doi:10.1207/S15327027HC1402_1.

102

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

Gibbs, R. W., Lima, P. L. C., & E. Francozo. (2004). Metaphor is grounded in embodied experience. Journal of Pragmatics, 36(7), 1189–1210. doi:10.1016/ j.pragma.2003.10.009. Grady, J. E. (1997). Foundations of Meaning: Primary Metaphors and Primary Scenes (Ph.D. Dissertation). University of California, Berkeley. Grady, J. E. (1999). A typology of motivation for conceptual metaphor correlation vs. resemblance. In R. W. Gibbs & G. Steen (eds.), Metaphor in Cognitive Linguistics, 79–100. Amsterdam: John Benjamins. Heszen, I, & H. Sęk, H. (2007). Psychologia zdrowia [Health Psychology]. Warszawa: Wydawnictwo Naukowe PWN. Janda, L. A. (2015). Cognitive Linguistics in the Year 2015. Cognitive Semantics, 1(1), 131–154. doi:10.1163/23526416-00101005. Jones, S., Murphy, M. L., Paradis, C., & C. Willners. (2012). Antonyms in English: Construals, Constructions and Canonicity. Cambridge: Cambridge University Press. Kahneman, D., Diener, E., & N. Schwarz. (eds.) (2003). Well-being: the foundations of hedonic psychology. New York: Russell Sage Foundation. Korwin-Szymanowska, A. (2015). Poziom świadomości zdrowotnej młodzieży akademickiej uczelni warszawskich [The level of health awareness among students in Warsaw universities]. (Ph.D. Dissertation). Academy of Special Education, Warsaw. Kotarbiński, T. (1990). Philosophical Self-Portrait. In J. Woleński (ed.), Kotarbiński: Logic, Semantics and Ontology, 1–6. Dordrecht: Springer Netherlands. Kövecses, Z. (2000). Metaphor and Emotion: Language, Culture, and Body in Human Feeling. Cambridge: Cambridge University Press. Kövecses, Z. (2010). Metaphor: A Practical Introduction, 2nd Ed. New York: Oxford University Press. Kövecses, Z., & G. Radden. (1998). Metonymy: Developing a cognitive linguistic view. Cognitive Linguistics, 9(1), 37–78. doi:10.1515/cogl.1998.9.1.37. Krzeszowski, T. P. (1997). Angels and Devils in Hell: Elements of Axiology in Semantics. Warszawa: Energeia. Lakoff, G. (2008). The Neural Theory of Metaphor. In R. W. Gibbs (ed.), The Cambridge Handbook of Metaphor and Thought, 17–38. Cambridge: Cambridge University Press. Lakoff, G., & M. Johnson (1980). Metaphors We Live By. Chicago: University of Chicago Press. Lakoff, G., & M. Johnson (1999). Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western Thought. Chicago: University of Chicago Press.

Figurative dimensions of health

103

Lewandowska-Tomaszczyk, B. (2007). Polysemy, Prototypes, and Radial Categories. In D. Geeraerts & H. Cuyckens (eds.), The Oxford Handbook of Cognitive Linguistics, 139–169. Oxford: Oxford University Press. Lewandowska-Tomaszczyk, B. (2010). Meaning. In B. Lewandowska-Tomaszczyk (ed.), New Ways to Language, 105–132. Łódź: Wydawnictwo Uniwersytetu Łódzkiego. Lewandowska-Tomaszczyk, B., & K. Dziwirek (eds.). (2009). Studies in Cognitive Corpus Linguistics. Frankfurt am Main: Peter Lang. Loudon, I. (ed.) (1997). Western Medicine. An Illustrated History. Oxford: Oxford University Press. Mahon, B. Z. (2015). What is embodied about cognition? Language, Cognition and Neuroscience, 30(4), 420–429. doi:10.1080/23273798.2014.987791. Mateusiak, J., Gwozdecka-Wolniaszek, E., & M. Januszek. (2011). Kręte ścieżki pomiaru zdrowia – prace nad konstrukcją kwestionariusza do oceny zdrowia [Winding roads of health measurement – working out a health evaluation survey]. In M. Górnik-Durose & J. Mateusiak (eds.), Psychologia zdrowia: Konteksty i pogranicza, 125–147. Katowice: Wydawnictwo Uniwersytetu Śląskiego. McEnery, T., & A. Hardie (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press. McEnery, T., & A. Wilson (2001). Corpus Linguistics: An Introduction, 2nd Ed. Edinburgh: Edinburgh University Press. Panther, K.-U., & L. L. Thornburg (2007). Metonymy. In D. Geeraerts & H. Cuyckens (eds.), The Oxford Handbook of Cognitive Linguistics, 236–263. Oxford: Oxford University Press. Pęzik, P. (2015) Spokes – a Search and Exploration Service for Conversational Corpus Data. In Selected Papers from the CLARIN 2014 Conference, October 24–25, 2014, Soesterberg, 99–109. Linköping: Linköpings University Electronic Press. Retrieved from: http://www.ep.liu.se/ecp_article/index.en.aspx?issue=116; article=009. Pustejovsky, J. (1991). The syntax of event structure. Cognition, 41(1–3), 47–81. doi:10.1016/0010-0277(91)90032-Y. Ruiz de Mendoza Ibáñez, F. J., & L. Pérez Hernández (2011). The Contemporary Theory of Metaphor: Myths, Developments and Challenges. Metaphor and Symbol, 26(3), 161–185. doi:10.1080/10926488.2011.583189. Sallis, J. F., Owen, N., & E. B. Fisher (2008). Ecological models of health behavior. In K. Glanz, B. K. Rimer, & K. Viswanath (eds.), Health Behavior and Health Education: Theory, Research, and Practice, 4th Ed. San Francisco, CA: Jossey-Bass. Semin, G. R., & E. R. Smith (eds.), (2008). Embodied Grounding: Social, Cognitive, Affective, and Neuroscientific Approaches. Cambridge: Cambridge University Press.

104

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

Semino, E., Demjen, Z., Demmen, J., Koller, V., Payne, S., Hardie, A., & P. Rayson. (2015). The online use of Violence and Journey metaphors by patients with cancer, as compared with health professionals: a mixed methods study. BMJ Supportive & Palliative Care, published online ahead of print. doi:10.1136/ bmjspcare-2014-000785. Sigerist, H. E. (1941). Medicine and Human Welfare. Michigan: Yale University Press. Solt, S. (2015). Vagueness and Imprecision: Empirical Foundations. Annual Review of Linguistics, 1(1), 107–127. doi:10.1146/annurev-linguist-030514-125150. Stefanowitsch, A. (2006). Corpus-based approaches to metaphor and metonymy. In A. Stefanowitsch & S. T. Gries (eds.), Corpus-Based Approaches to Metaphor and Metonymy, 1–16. Berlin: Mouton de Gruyter. Suls, J., & Rothman, A. (2004). Evolution of the Biopsychosocial Model: Prospects and Challenges for Health Psychology. Health Psychology, 23(2), 119–125. doi:10.1037/0278-6133.23.2.119. Szwedek, A. (2007). An alternative theory of metaphorization. In M. Fabiszak (ed.), Language and Meaning: Cognitive and Functional Perspectives, 313–327. Frankfurt am Main: Peter Lang. Szwedek, A. (2011). The ultimate source domain. Review of Cognitive Linguistics, 9(2), 341–366. doi:10.1075/rcl.9.2.01szw. Szwedek, A. (2014). The nature of domains and the relationships between them in metaphorization. Review of Cognitive Linguistics, 12(2), 342–374. doi:10.1075/ rcl.12.2.04szw. Talmy, L. (1988). Force Dynamics in Language and Cognition. Cognitive Science, 12(1), 49–100. doi:10.1207/s15516709cog1201_2. Tuggy, D. (1993). Ambiguity, polysemy, and vagueness. Cognitive Linguistics, 4(3), 273–290. doi:10.1515/cogl.1993.4.3.273. Tummers, J., Heylen, K., & D. Geeraerts. (2005). Usage-based approaches in Cognitive Linguistics: A technical state of the art. Corpus Linguistics and Linguistic Theory, 1(2), 225–261. doi:10.1515/cllt.2005.1.2.225. Urbonaite, J. & I. Šeškauskiene (2007) HEALTH Metaphor in Political and Economic Discourse: a Cross-Linguistic Analysis. Studies About Languages, 11, 68–73. WHO. (1948). Preamble to the Constitution of the World Health Organization as adopted by the International Health Conference. New York: World Health Organization. Retrieved on December 10, 2014 from http://www.who.int/about/ definition/en/print.html. Woynarowska, B. (2007). Edukacja zdrowotna [Health Education]. Warszawa: Wydawnictwo Naukowe PWN.

Figurative dimensions of health

105

Corpora and resources BNC. (2001). The British National Corpus (2001). [World Edition] Available from Oxford University Computing Services at: www.natcorp.ox.ac.uk. COCA. (2012). The Corpus of Contemporary American English. Available from Brigham Young University at: byu.edu/coca/. OED. (2009). Oxford English Dictionary, 2nd Edition, Version 4.0. [CD-ROM]. Oxford: Oxford University Press. SlopeQ. (2015). A part-of-speech-sensitive concordancer with support for lemmatization and proximity queries, ver. 01.05. [Developed by Piotr Pęzik]. Łódź: University of Łódź.

Appendix The corpora used for this study were searched with the SlopeQ concordancer, which offers support for the part-of-speech tagging, lemmatization, and proximity queries (see Pęzik, 2015 for more information). Lemmatization allows for queries incorporating all English inflectional forms with the use of double asterisk (**) as a wildcard. For example, the query “go**” substitutes for “go, goes, went, gone, going”. The value of slop and preserve order option used for proximity queries is indicated below. The pipe symbol ( | ) indicates logical OR, which enables executing multiple queries in a single line. All queries were implemented for the full contents of the corpus. The following queries were implemented in this study: [ANY ADJECTIVE] HEALTH: health HEALTHY [ANY NOUN]: healthy HEALTH AS UP/DOWN: health|condition|state|form|shape up|upward|peak| top|bottom|down|downward|downhill [Slop=5, Preserve order=No] HEALTH AS STRONG/WEAK: strong|powerful|robust|stout|sturdy|frail|weak| feeble|enfeebled|frail|fragile health|condition|state|form|shape [Slop=5, Preserve order=No] HEALTH AND HAPPINESS: health and happiness HEALTH TRANSITIONS: health improve**|deteriorate**|stabilize**|stabilise** [Slop=3, Preserve order=No] FACTORS AFFECTING HEALTH: help**|damage** health [Slop=3, Preserve order=Yes] HEALTH AS NEGATIVE CHANGE DOWNWARD: health decline**|drop**| plunge**|tumble** [Slop=5, Preserve order=Yes]

106

Adamina Korwin-Szymanowska and Jacek Tadeusz Waliński

VERBS EXPRESSING DOWN-TRANSITION OF HEALTH: strike**|come** |go** down disease|symptoms|illness|sickness|cold|flu [Slop=5, Preserve order=No] HEALTH AS RESTORATION/LOSS OF STRENGTH: restore**|regain**|recou p**|recover**|lose**|drain**|sap**|rob** strength [Slop=5, Preserve order=Yes] CHANGE OF HEALTH AS GETTING WEEKER/STRONGER: get**|feel** stronger|weaker [Slop=1, Preserve order=Yes]

Stanisław Goźdź-Roszkowski University of Łódź

“Justice with an attitude?” – towards a corpus-based description of evaluative phraseology in judicial discourse Abstract: This study discusses the role of grammar patterns (v-link + ADJ + that, v-link + ADJ + to-infinitive and N that) to express evaluative meanings in a corpus of US Supreme Court opinions. It is argued that insights from corpus linguistics and the concept of Local Grammar can be accommodated to identify the systematic ways in which judges signal their attitudes and assess the arguments of other legal interactants in their written opinions. The article outlines various ways in which the concept of evaluation is understood and it discusses its role in legal argumentation. Using a substantial corpus of over 1.3 million words of legal opinions, it is shown that judges tend to rely on both overt and covert linguistic clues to evaluate arguments put forward in court. The first two patterns the v-link + ADJ + that pattern and the v-link + ADJ + to-inf turn out to be a good diagnostic to identify evaluative adjectives. Apart from locating instances of evaluative acts, they help to indicate the positioning of interactants in the evaluation process. The nouns identified in the third pattern (N+that) tend to be associated with mostly negative evaluation of propositions in legal argumentation. Keywords: Evaluation, phraseology, judicial discourse, grammar patterns, local grammar “Anyway, that’s my view and it happens to be correct”. (Antonin Scalia) “The Court’s errors on both points spring forth from the same diseased root: an exalted conception of the role of this institution in America” (United States vs. Windsor) “In light of the fundamental nature of the substantive rights embodied in the right to marry— and their central importance to an individual’s opportunity to live a happy, meaningful, and satisfying life as a full member of society … ” (United States vs. Windsor)

1. Introduction As the excerpts from the epigraph for this chapter show, contrary to popular belief, judicial discourse is not devoid of emotive, attitudinal expressions. The first quote, attributed to Antonin Scalia1, one of the judges of the United States Supreme Court, reflects the fact that although judges are expected to draft linear

1 This quote can be found in Murphy 2008, Scalia: A Court of One.

108

Stanisław Goźdź-Roszkowski

lines where the formulation of the decision merely reflects the application of the relevant legal norms to the facts of the case, the articulation of the judges’ argumentation presupposes a certain degree of subjectivity. Judges are no more considered as mere bouche de la loi (literally mouth of law), as simple translators into practice of the legal norms (cf. Garavelli, 2010: 97). Their ‘presence’ in the texts they produce is becoming more and more evident. As will be shown below, US Supreme Court opinions provide a unique opportunity for judges to express their personal agreement or disagreement with the court’s ruling by means of writing dissenting or concurring parts of the opinion. The other two quotes, which come from the Supreme Court recent landmark decision related to the rights of same-sex couples (United States vs. Windsor), illustrate how judges in the written opinions may express their personal views by means of value-laden word choice (e.g. errors, diseased root, exalted, happy, meaningful). In fact, the title of this chapter alludes to the popular American television programme Judge Judy in which the main protagonist, a small-claims judge expresses her feelings and attitudes very strongly. While a certain amount of overt expression of attitudinal meaning may be acceptable in lower court oral proceedings, its use in written legal opinions handed down by the highest court of the land may still be surprising, especially for the general public. Judicial opinions tend to be perceived as ‘disinterested’ genre, i.e. one where focus is on impartiality, absence of emotive language,2 communicating only the facts and propositional information. The way judges report the reasons behind their decisions should be apparently influenced by two major considerations: power and neutrality (Solan 1993: 3). While the former seems indisputable in view of the paramount importance of the Supreme Court, its central position in the American judicial system, and the law-making role of judicial precedents in the common law system, the latter remains an open question and it comes under scrutiny in this paper. In addition, the use of language in legal opinions can be considered as institutional discourse because judges draft their opinions within the institutional framework of the court (the US Supreme Court) and the way they mark stance is expected to be constrained in a number of ways. The results of the analysis presented in this chapter focus on the use of evaluative language, which is usually investigated as stance (e.g. Conrad & Biber 2000; Biber 2006) evaluation (Hunston & Thompson 2000; Hunston 2011) and

2 Lack of emotion is usually listed as one of characteristic features of specialized discourse (cf. Gotti, 2003).

Towards a corpus-based description of evaluative phraseology

109

appraisal (Martin & White 2005).3 At the most fundamental level, evaluation can be understood as a behavioral phenomenon which can manifest itself in signaling that something is good or bad, desirable or undesirable, likely o unlikely to happen (cf. e.g. Hunston 2004; Partington et al. 2013). Evaluative language is accordingly often marked by negative or positive polarity. Its actual verbal realizations can be extremely complex and often highly elusive since evaluative meanings can be expressed overtly as in the examples above, or they can be communicated implicitly by relying on shared values and knowledge. Thompson and Hunston (2000: 14) argue that positive or negative evaluation can be construed, for example, in terms of goal achievement. Things contributing to the realization of goals are perceived as positive, while those that thwart the achievement of the objectives are evaluated as negative. For the purpose of this study I adopt the term ‘evaluation’ to refer to the expressions of speaker or writer’s emotional attitude; a concept related to Martin’s appraisal or Biber’s attitudinal stance (cf. Thompson & Hunston 2000: 5). I further follow Hunston (1994: 210) in conceptualizing evaluation as expressed through language “which indexes the act of evaluation or the act of stance-taking. It expresses an attitude towards a person, situation or other entity and is both subjective and located within a societal value system”. This approach combines various types of evaluative meanings including those associated with the concept of modality (Halliday 1994) inasmuch as ‘‘the speaker associates with the thesis an indication of its status and validity in his own judgment’’ (Halliday 1970: 335). However, this perception of evaluation does not normally include meanings realizing “the speaker’s judgment of the probabilities, or the obligations, involved in what he is saying” (Halliday 1985/1994: 75)].4 These types of evaluative meaning are not treated here as they are deemed to go beyond the scope of the present analysis. The significance of evaluation for legal discourse and, especially, judicial discourse can hardly be overestimated. Indicating an attitude towards a legal entity, process or interactant is inherent in the acts of persuasion and argumentation, both being an integral part of judicial discourse. A substantial part of judicial opinions involves expressing agreement or disagreement with decisions given by lower courts, 3 Other related and sometimes overlapping concepts include metadiscourse (e.g. Hyland and Tse 2004), modality (e.g. Palmer 1987), sentiment (e.g. Tabouada and Grieve 2004), evaluative, attitudinal or affective language (e.g. Ochs 1989), or evidentiality (e.g. Chafe and Nichols 1986). 4 This restriction on a range of evaluative meanings is important in view of the extensive research on deontic and epistemic modality in legal discourse (See Cheng & Cheng 2014 for an up-to-date overview).

110

Stanisław Goźdź-Roszkowski

opinions expressed by counsel representing the parties, as well as the opinions arrived at by fellow judges from the same bench. Evaluation is the engine of persuasion (Partington et al. 2013: 46) and judges have to persuade that their grounds are right or that the arguments adduced by the defendants or their counsel are wrong. When judges express their opinion, they also reflect their value systems and the ideologies existing in their community and in the legal system at large. Finegan (2010: 65) argues that analyzing the affective, attitudinal use of judicial language is “crucially important in the training of attorneys in the United States” because opinions written by appellate court judges are the principal focus of attention in law school classrooms in the United States (Mertz 2007). While judicial argumentation has been often investigated from legal logic, legal theoretical and philosophical perspectives (e.g. MacCormick 1992; MathieuIzorche 2001) , it has received relatively scant attention from linguisticallyoriented research (see, however Mazzi 2007, 2010). There seem to be even fewer forensic studies adopting a corpus perspective to address explicitly the concept of evaluative meanings in judicial discourse. Some recent corpus-based studies which focus on evaluative meanings in judicial or courtroom discourse include Heffer (2007), Mazzi (2008), Mazzi (2010), Finegan (2010), Szczyrbak (2014), Goźdź-Roszkowski & Pontrandolfo (2014). Szczyrbak examines stance-taking strategies in a corpus of US Supreme Court opinions. The analysis is informed by du Bois’s (2007) interactional concept of stance and the two related notions of epistemicity and evidentiality. Both Mazzi (2008) and Finegan (2010) examine the use of adverbials of stance in judicial discourse. The former study focuses on selected (8) stance adverbs (e.g. apparently, clearly, etc.) analyzed in a corpus of 98 equity judgments of the Chancery Division of the High Court of Justice of England and Wales. In the latter, Finegan (2010) examines judicial attitude by focusing on adverbial expressions of attitudinal stance and emphasis. Heffer (2007) draws upon the systematic-functional lexical-semantic appraisal framework of judgment (Martin & White 2005) to examine the linguistic construal of evaluating witnesses and defendants by trial lawyers and judges. In doing so, Heffer investigates a large corpus of official court transcripts. Mazzi (2010: 374) views evaluation as a deep structure and a prominent aspect of the way in which judges construct their argumentative positions. In his corpus-based study, Mazzi investigates evaluative lexis in the judicial discourse of US Supreme Court written opinions. By focusing on the single discourse element of ‘this/these/that/those + the labelling noun’, he provides some corpus evidence to demonstrate that abstract nouns such as, for example, attitude, difficulty, process, reason, etc. have both encapsulating and evaluative function when found in this pattern in the

Towards a corpus-based description of evaluative phraseology

111

judicial opinions. Goźdź-Roszkowski and Pontrandolfo (2014) is the only study that begins to explore evaluation from a cross-language perspective by providing a detailed analysis of how the phrase the fact that is used to evaluate status in American and Italian judgments. What all these above-mentioned studies share is the corpus-based approach with the overarching goal to identify and examine language resources deployed to express a range of evaluative meanings in (English) judicial discourse. They differ in terms of scope, i.e. the types of evaluative meanings (affective, epistemic) the range of linguistic resources under investigation (e.g. a specific grammatical category of stance adverbs) and the extent to which subjective and interactive aspects of evaluative acts are taken into account (Szczyrbak 2014). Since evaluative meanings can be communicated explicitly or implicitly, some studies focus on overt markers of evaluation, e.g. stance adverbs, and demonstrate how judges deploy this linguistic resource to mark their personal stance in no uncertain terms. In this approach, evaluation is essentially conceptualized as a set of preselected words and phrases which are known to indicate evaluative meanings. This is a ‘pure’ corpus-based approach and researchers focus on recurrent linguistic items, such as words, phrases or grammatical categories, to ascertain the extent to which these apply in a given language variety (a genre, register) under investigation.5 This corpus linguistics approach to evaluation enables one to attempt the task of quantifying different types of evaluative meanings in a given dataset (Hunston 2011). Although not a full-scale and exhaustive research study, Finegan (2010) compares the rounded frequencies of a range of stance adverbs in his corpus of US Supreme Court opinions (COSCO) and in some selected general language corpora. The results show that certain adverbs of stance and emphasis such as, for example, correctly, importantly or simply are statistically used much more frequently in written legal opinions than in general language to indicate attitudinal stance and emphasis (Finegan 2010: 74). These findings provide some grounds to hypothesize that judicial opinions might be marked linguistically for the relatively frequent occurrence of stance adverbs. The present study shares the corpus perspective outlined above and it aims to contribute to the growing body of corpus-based research on evaluation in judicial 5 Classic studies of this type tend to rely on the concept of stance defined as ‘a cover term for the expression of personal feelings and assessments (e.g. Conrad and Biber 2000: 57). It should be noted that some scholars distinguish between ‘evaluation’ (the ascription of a value to an entity, whether inside or outside the text) and the interactive ‘stance’ (indications in the text that a human being, the writer, is communicating with another human being, the reader) (cf. Hunston 2011: 51).

112

Stanisław Goźdź-Roszkowski

discourse by demonstrating how phraseology can be used both to identify more systematically overtly evaluative language and to uncover less obvious acts of evaluation. I will invoke the concept of the local grammar of evaluation to argue, after Hunston (2011), that patterning can be indeed used as a diagnostic to identify words in judicial discourse which share evaluative meanings. In doing so, I attempt to show that the local grammar approach can also be applied to the more restricted, domain-specific genre of judicial opinions. Apart from identifying evaluative adjectives, this approach also enables one to examine the interactive nature of evaluative acts by indicating a range of entities, i.e. legal interactants (e.g. lower courts, legal counsel, prosecution) involved in the act of evaluation. This part of the analysis will be illustrated by discussing two patterns: v-link + ADJ + that pattern and v-link + ADJ + to-infinitive pattern. Finally, I intend to take the analysis one step further and propose that the local grammar of evaluation include yet another, nominal, pattern of ‘N that’, where the noun is followed by an appositive that-clause. I argue that this pattern is highly relevant to the description of evaluative meaning in judicial discourse for two reasons. First, it is used by judges to interpret the epistemic status of propositions contained in legal argumentation. Second, this pattern is also used to express the judges’ attitudinal stance.

2. Corpus linguistics, phraseology and evaluation One limitation of a purely corpus-based approach is that there may be other, not previously identified language resources employed to express evaluative meanings, and especially, there may be evaluative meanings “not readily available to nakedeye perusal (Partington 2013: 11). Indeed, evaluation often works in subtle ways when expressed through less obviously evaluative language choices.6 A major claim made recently in corpus linguistics approaches to evaluation is that this area of language use can be effectively examined by searching and identifying pattern, i.e. “consistency in the mapping of meaning on to form, and consistency in the items that co-occur with a node from the phraseological perspective” (Hunston 2011: 167). In this approach, phraseology is understood not only as sequences of words that contribute to the creation of an evaluative act but, first and foremost, as “consistency in how particular kinds of textual item are evaluated within a specialized corpus” (Hunston 2011: 167). At this point, another methodologically 6 The possible absence of overt linguistic clues marking evaluative meanings raises the fundamental question of sheer feasibility of using corpus linguistics methodology to analyze this area of language use. This problem is one of the central themes discussed in Hunston (2011).

Towards a corpus-based description of evaluative phraseology

113

important concept, namely, ‘local grammar of evaluation’ should be brought up as it is closely corresponds to the idea of patterning in discourse and it is crucial to the analysis presented in this study. It is briefly described below.

2.1 Local Grammar of Evaluation The link between specific language patterns and evaluation was established in Hunston & Sinclair (2000), where the authors draw upon the concept of “local grammar” (Gross 1993) and “sublanguage” (Harris 1991) to find a viable analytical framework for studying evaluation. Rather than describing a language as a whole, corpus grammarians, more specifically parsers, developed the idea that particular areas of language can be examined separately as they seem to show patterning of their own. A local grammar of evaluative meaning depends on the concept of pattern grammar (Hunston & Francis 1999) and in particular on the idea that words which regularly occur with similar co-texts share this meaning. For example, Hunston and Sinclair (2000) proposes that the pattern It+ link verb + adjective group + clause is a reliable way to identify evaluative adjectives since virtually all adjectives found in this pattern share an evaluative meaning (e.g. it was wonderful talking to you the other day, it seemed important to trust her judgment, etc.). Other patterns comprising the proposed local grammar of evaluation include, for example, there + link verb + something, link verb + adjective group + to-infinitive clause or link verb + adjective group + that clause. The focus is clearly on adjectives as the primary ‘carrier’ of evaluation and the authors claim this approach is effective, even if not 100 per cent reliable (Hunston & Sinclair 2000: 91). The local grammar approach is ‘local’ in the sense of isolating specific, welldefined linguistic resources tried and tested for their capability to act as signals of evaluative meanings. Hunston & Sinclair (2000) use the well-known Bank of English to test the reliability of their approach. This chapter shares the phraseological corpus perspective outlined above and it aims to contribute to the growing body of corpus-based research on evaluation in judicial discourse by demonstrating how phraseology can be used both to identify more systematically overtly evaluative language and to uncover less obvious acts of evaluation. In doing so, I use the concept of the local grammar of evaluation to argue, after Hunston (2011), that patterning can be indeed used as a useful ‘diagnostic’ to identify words in judicial discourse which share evaluative meanings. Thus, I show that the local grammar approach can also be applied to the more restricted, domain-specific genre of judicial opinions. Apart from identifying evaluative adjectives, this approach also enables one to track the interactive nature of evaluative acts by indicating a range of entities,

114

Stanisław Goźdź-Roszkowski

i.e. legal interactants (e.g. lower courts, legal counsel, prosecution) which are the targets of evaluation. This part of the analysis will be illustrated by discussing two patterns: v-link + ADJ + that pattern and v-link + ADJ + to-infinitive pattern. Moreover, I intend to take the analysis one step further and propose that the local grammar of evaluation include yet another, this time nominal, pattern – e.g. ‘N that’ – where the noun is followed by an appositive that-clause. I argue that this pattern is highly relevant to the description of evaluative meaning in judicial discourse for two reasons. First, it is used by judges to interpret the epistemic status of propositions contained in legal argumentation. Second, this pattern is also used to express attitudinal stance.

3. Materials and Methods 3.1 The Corpus of US Supreme Court Opinions The data in this study were selected from a collection of opinions given by the Supreme Court of the United States of America. Since the main purpose of this study is to explore dominant evaluative patterns employed in judicial discourse, the judgments were randomly selected from the period of 1999–2012, irrespective of a particular legal domain (civil law, criminal law, etc.). The texts of the opinions were accessed via FindLaw.com, a well-known legal information web portal providing free access to cases heard by the US Supreme Court. In all, the corpus consists of 123 different opinions totaling 1,333,329 running words. It should be noted that the term opinion can be used in two senses. More generally, opinion may refer to the official decision of a court of justice and it is thus interchangeable with the term judgment, which is defined as “the official and authentic decision of a court of justice upon the respective rights and claims of the parties to an action or suit therein litigated and submitted to its determination. The final decision of the court resolving the dispute and determining the rights and obligations of the parties” (Black 1990). The term opinion is also used to denote the reason which the court gives for its decision. It then refers to the statement made by a judge or court of the decision reached in connection with a case heard before them. Such a statement explains the law as applied to the case and provides the reason on the basis of which the judgment is made. In this study, the term opinion is used primarily in the first sense as a convenient label for the entire genre. Opinions are characterized by a fixed generic structuring. Opinions delivered by the Supreme Court of the United States generally have the following structure (cf. Brostoff & Sinsheimer 2003):

Towards a corpus-based description of evaluative phraseology

115

Headnote – this section includes the names of the parties, identification of parties (petitioner, respondent, an identification of the court in which the recorded case was heard, and the date of the opinion. Procedural History – This is a brief description of how the lower-instance courts have dealt with the case. This section usually includes the basis for review. Holding- Invariably signalled by the use of the word held, this section provides the decision (ruling) reached by the Supreme Court in a particular case ended with a disposition of the case (e.g. affirmed, vacated and remanded, etc). Opinion – This part of a judgment includes the names of the judges who heard the case and it specifies what type of opinion a given judge expressed. This is the most interesting and relevant part from the perspective of evaluative language because it provides opportunity for judges to take a stance towards a particular legal point and the arguments of other legal interactants, including their fellow judges. For example, in a 6:3 decision (there are nine justices of the US Supreme Court), there will be a plurality opinion, which occurs when the final outcome is agreed by majority but for differing reasons, then two judges could write one concurring opinion, three judges could write another concurring opinion, one judge could write his or her own opinion and three judges could dissent. Concurring opinions are those which agree with the majority decision for different reasons, while dissenting opinions are given by judges who do not agree with the majority. The following example illustrates the variety of standpoints taken by different judges and how the different opinions can be traced back to their authors: Kennedy, J., delivered the opinion of the Court, in which Rehnquist, C. J., and Stevens, O’Connor, and Breyer, JJ., joined, and in which Scalia and Thomas, JJ., joined as to Parts I, II, III, and IV. Stevens, J., filed a concurring opinion. Thomas, J., filed an opinion concurring in part and dissenting in part, in which Scalia, J., joined. Ginsburg, J., filed a dissenting opinion, in which Souter, J., joined.

Judgments thus represent a promising area for the study of evaluation, with the opinion part being a primary ‘site’ where evaluative language is likely to be identified.

3.2 Methods This study employed corpus linguistic methods to guide the analysis and interpretation of findings. The corpus was first queried for the three patterns (v-link + ADJ + that pattern, v-link + ADJ + to-infinitive and ‘N that’ pattern) in order to retrieve all adjectives occurring in these co-texts. In the case of the ‘N that’ pattern, only nouns occurring at least 5 times per million words were considered. Then, the lists of adjectives and the nouns were scrutinized with reference to the concept of evaluation and its types (Hunston 2011). This part of the analysis involved

116

Stanisław Goźdź-Roszkowski

concordancing each word and examining manually the surrounding co-texts. The analysis was carried out using the linguistic software package WordSmith Tools (version 5.0).

4. Results and Discussion 4.1 v-link + ADJ + that clause and v-link + ADJ + to-inf clause patterns The first two patterns consists of the adjectives followed by either a that-clause (where the word that is optional) or a to-infinitive clause.7 Both turn out to be quite productive in terms of a range of adjectives identified and corresponding evaluative meanings. Table 1 lists 15 most frequent adjectives retrieved for each pattern which occur at least 10 times per one million words in the corpus. Fortytwo different adjectives were identified in that pattern. only ones. Table 1. Most frequent adjectives identified in the two patterns arranged in the descending order of frequency. v-link + ADJ + that clause clear (60), true (49), correct (43), aware (33), undisputed (29), possible (28), unlikely (25), likely (18), certain (16), evident (15), apparent (14), conceivable (12), confident (11), implausible (10), noteworthy (10) v-link + ADJ + to-inf likely (70), necessary (52), difficult (45), sufficient (43), unable (41), insufficient (29), hard (27), correct (23), unlikely (18), willing (16), important (16), appropriate (16), wrong (14), reluctant (13), impossible (10)

The adjectives found in this pattern are used to express different evaluative meanings. Some of the most frequent ones are epistemic adjectives indicating the writer’s degree of certainty, likelihood or awareness of something (e.g. aware, clear, confident, correct, possible, (un)likely, certain). Even though there are no typically emotive or attitudinal adjectives such as afraid, amazed, happy, etc., a number of adjectives are used to express evaluation. These include appropriate, difficult, implausible, important, noteworthy, wrong, etc. These adjectives can also be grouped according to whether they are predominantly positive or negative in their evaluative orientation. In fact, virtually all forms of appraisal can be viewed 7 See also Goźdź-Roszkowski & Pontrandolfo 2014 for a more detailed analysis of these patterns carried out from a comparative (English – Italian) perspective.

Towards a corpus-based description of evaluative phraseology

117

in terms of the fundamental good-bad evaluation (cf. Partington et al. 2013: 45). This observation refers not only to attitudinal linguistic items such as correct and wrong, difficult, implausible or noteworthy but also to adjectives signaling modality (e.g. likely, unlikely, possible, certain) and indicating the assessment of how likely or true an event is, the degree of truth or certainty which can be attributed to a given proposition or judgment on how necessary an action is. By way of illustration, I now turn to consider extended examples of two adjectives employed to express overtly positive evaluation of propositions contained in legal argumentation provided by different actors (or ‘interactants’) participating in the legal process: correct (Table 2) and wrong (Table 3). Table 2. The adjective correct parsed in accordance with the v-link + ADJ + that pattern. Entities evaluated

Hinge

Evaluative category Proposition evaluated

Noun group

Link verb Adjectival group

that clause

the Illinois Supreme is Court

correct

that General Order 92–4 is not a sufficient limitation on police discretion.

The Court

is

correct

that many mental diseases are difficult to define and the subject of great debate

Respondent

is

correct

that some crimes must be charged with greater specificity than an indictment parroting a federal criminal statute’s language

The majority

is

correct

that rigid adherence to such an approach could conceivably produce absurd results

The sentences provided in Table 1 are parsed using slightly modified terminology proposed in Hunston and Sinclair (2000), which is largely self-explanatory. The designation “entities evaluated” is proposed here to refer to a range of different interactants appearing in the judicial appellate process: the Illinois Supreme Court, as an example of a lower court (with regard to the federal US Supreme Court), the Court and the Majority denoting the majority (en bloc) opinion of the US Supreme Court, the Respondent and Petitioner as the two major parties to a legal dispute. Syntactically, ‘entities evaluated’ represent noun groups occupying the subject position in a sentence. The act of evaluation is thus conducted in a fairly straightforward and unequivocal manner in the sense that all the entities are

118

Stanisław Goźdź-Roszkowski

named explicitly as the target of evaluation. The examples in Table 2 were sampled from the opinion part of the judgments representing majority, dissenting or concurring opinions. The judges, as evaluators, support the arguments put forward by the counsel representing the litigants, the lower courts or their fellow judges. It should be noted that apart from evaluation expressed by means of the ‘adjectival group’ represented in this example by the word correct, evaluation is expressed overtly in the evaluated propositions, mainly by means of adjectives (sufficient, difficult, rigid or absurd) and the verb parroting. One sentence may include both positive and negative evaluation and a larger context is often needed to determine the overall positive or negative polarity. For example, in the sentence The Court is correct that that many mental diseases are difficult to define and the subject of great debate, the evaluating judge in his dissenting opinion welcomes the fact that the majority opinion recognizes what is perceived by the evaluator as a problem. These are in fact my instances of a phenomenon referred to as ‘embedded evaluation’, i.e. “items of intrinsically favorable evaluation which are found embedded in expressions of overall unfavorability” (Partington et al. 2013: 61–62). An inherently positive word or phrase (such as correct) may be embedded in a larger unit (e.g. at a sentence or paragraph level) which may be imbued with the opposite polarity: [1] Though Clark is correct that applying the moral incapacity test (telling right from wrong) does not necessarily require evaluation of a defendant’s cognitive capacity to appreciate the nature and quality of the acts charged against him, his argument fails to recognize that cognitive incapacity is itself enough to demonstrate moral incapacity, so that evidence bearing on whether the defendant knew the nature and quality of his actions is both relevant and admissible. Example [1] shows how the positive evaluation of Clark, the Petitioner (Clark is correct) found in a subordinate concessive clause, is embedded in the negative evaluation phrased in the main clause: his argument fails to recognize. This example signals the complexity of expressing evaluation in judicial discourse and the need for studying longer contexts beyond concordance lines. Table 3 illustrates negative evaluation expressed by means of the pattern v-link followed by adjective group and the to-infinitive clause and directed at a range of interactants appearing in the judicial settings. As can be seen, the pattern v-link + ADJ + to-inf clause pattern can be realized as two variants. The first one, shown in the upper part of Table 3, identifies explicitly a legal interactant as a subject of the main clause and the target of evaluation. The noun group slot can be filled in by a range of entities including a lower court (the Sixth Circuit), a majority opinion of the US SC (referred to as the Court), a dissenting judge or judges, and

Towards a corpus-based description of evaluative phraseology

119

a respondent. The other variant shown in the lower part of Table , consists of an introductory or anticipatory it, followed by a link verb and a to-infinitive clause. Table 3. The adjective wrong parsed in the v-link + ADJ + to-inf clause pattern. Entity evaluated

Hinge

Evaluative category Proposition evaluated

Noun group

Link verb

Adjectival group

to-infinitive clause

The Sixth Circuit The majority The Court The dissent

was is is is

wrong wrong wrong wrong

to hold that to say that to suggest to read

It link verb

adjective group Thing evaluated

It is

quite wrong

to invite state court judges to discount…

It is

wrong

to assume that his petition by itself failed to alert the Oregon Supreme Court to the federal nature of

Example [2] provided below comes from the holding of the Bradshaw, Warden vs. Stumpf judgment in which the Supreme Court unanimously rejected the decision by the United States Court of Appeals for the Sixth Circuit to reverse an earlier judgment given by a District Court. [2] The Sixth Circuit was also wrong to hold that prosecutorial inconsistencies between the Stumpf and Wesley cases required voiding Stumpf ‘s guilty plea. In this case, there is a negative evaluation of the lower court’s, i.e. the Sixth Court’s judicial action which correlates with the Supreme Court overall negative decision encapsulated in its disposition: “reversed in part, vacated in part and remanded”. Negative evaluation can also be averred by judges sitting on the same bench. Importantly, a Supreme Court judge can write a dissenting opinion in which he or she is joined by other judges. Predictably, such opinion is replete with negative evaluation of argumentation put forward by the majority of judges. A good illustration is provided in [3]: [3] The majority is also wrong to say that this Court has “narrowed” Parden in its “subsequent opinion[s],” ante, at 12, at least in any way relevant to today’s decision. In Example [3] the dissenting judge Justice Breyer (Expense Board et al. on writ of certiorari to the United States Court of Appeals for the Third Circuit) finds a perceived flaw in the argumentation expounded in the majority opinion. Although a dissenting opinion has no bearing on the disposition, it provides an extremely useful insight into legal reasoning. Examples [2] and [3] represent an explicit

120

Stanisław Goźdź-Roszkowski

averral of evaluation by means of the variant noun group + v-link + adjective + to-inf clause. In contrast, the it v-link + to-inf clause variant is used as more covert form of expressing evaluation. Example [4], provided below, is a case in point. [4] It is quite wrong to invite state court judges to discount the importance of such guidance on the ground that it may not have been strictly necessary as an explanation of the Court’s specific holding in the case. In the above excerpt [4] from Thomas L. Carey, Warden, Petitioner V.Mathew Musladin, Justice Stevens concurs in the judgment albeit with serious reservations. While it is easy to identify the judge as the evaluator, the same does not apply to the entity evaluated. The preceding context suggests that the evaluation is directed at the Ninth Circuit by pointing out the consequences of its decision and the impact this decision may have had on state court judges. At the same time, the evaluation also implicates the Supreme Court by referring to its past holdings. Both correct and wrong belong to the category of evaluative lexis whose evaluative weight is intrinsic and evaluation is their major function. The evaluative meaning of such items is very easily identified and the two patterns described above turn out to be quite productive in their retrieval. The next section discusses evaluative lexis whose evaluative function becomes apparent only by studying its interaction with other items of a particular evaluative orientation.

4.2 Uncovering the less obvious meanings: N that’ pattern Hunston (1989) argues that one of the major functions of evaluation is to identify the epistemic object being evaluated, i.e. its status. Status is then defined as “the averred degree and type of alignment between a text or proposition and the world” (Hunston 2011: 92). There are a number of ways in which status can be indicated. When a proposition appears in a that-clause, the status may be signaled by the verb, noun or adjective controlling the that-clause e.g. it is possible that, Smith argues that, the assumption that… etc. This section focuses on nouns (e.g. argument, conclusion, discovery, hypothesis, idea, theory, suggestion) to demonstrate that interpreting epistemic status is a key way in which subjectivity is expressed in judicial discourse and that a number of these nouns co-occur with co-texts which are marked by a particular (esp. negative) evaluative orientation. It is now widely acknowledged (e.g. Halliday & Matthiessen 2004: 637) that nouns followed by appositive that-clause indicate the epistemic status of the proposition expressed in the that-clause and that projected that-clauses of this kind are important to disciplinary epistemology.

Towards a corpus-based description of evaluative phraseology

121

There are 18 nouns in the corpus which appear with the minimum rounded frequency of 5 times per million words in at least 5 different judgments to guard against individual idiosyncratic styles. These include:8 fact (316), argument (171), conclusion (168), view (93), proposition (73), contention (54), assumption (53), suggestion (52), possibility (49), assertion (46), belief (35), notion (32), presumption (30), theory (19), impression (18), allegation (16), observation (13) interpretation (5). When viewed as single word forms, these nouns do not display any disciplinary specificity. For example, in his study of academic registers, Biber (2006: 112) identifies nouns controlling that-clauses which “label the status of the information presented in the that-clause”. Some of them, such as argument, assumption, idea, notion, possibility are also found in the list above. However, if examined in wider co-texts, some of the nouns show consistent co-occurrence patterns of negative evaluation. There is space here only to give a few examples9. For example, in Example [5] the proposition expressed as ‘state courts must apply the restrictive Salerno test’ is evaluated as an assumption, that is something which is not readily verifiable. [5] Justice Scalia’s assumption that state courts must apply the restrictive Salerno test is incorrect as a matter of law. The proposition attributed to the Supreme Court judge, Antonin Scalia is then negatively evaluated though a value-laden word choice – incorrect. On closer examination, the phrase the assumption that has been found to co-occur with negative evaluation in 65% of all the instances when it is found in the corpus. The negative evaluation is mainly expressed by means of the adjectives: erroneous, incorrect, mistaken, questionable, unfounded, apparent, naive etc. The propositions labeled as ‘assumptions’ are often construed as the basis for other ideas used in legal argumentation. This use is reflected in the co-occurrence of assumption that with the prepositions on and under. The phrase on the/an assumption that appears 17 times (32% of all the instances when the word assumption is used in this pattern) and five times with the preposition under. Still, many of these instances co-occur with negative 8 In the case of some of the nouns, it is necessary to bear in mind that there may be a distinction between technical and non-technical senses of a particular word form. For example, the word presumption is used in general language (LGP) to refer to a belief that something is true because it seems reasonable or likely but in law it denotes the belief that something is true because no one has proved that it is not. A typical example would be a presumption of innocence. 9 See also Goźdź-Roszkowski & Pontrandolfo which documents more detailed and comparative findings of these nouns in the N+that pattern.

122

Stanisław Goźdź-Roszkowski

evaluation: e.g. on the erroneous/mistaken/rejected assumption that. There are also other uses of the assumption that, for example to express agreement, associated with neutral or positive evaluative orientation such as, for example, I have no difficulty in endorsing the assumption that, research in psychology supports the assumption that, etc. but the expression of negative evaluation remains dominant. In a similar vein, the phrase the notion that tends to co-occur with negative evaluation (in 60% of all the instances). Unlike in the case of the assumption that, the negative evaluation is expressed by means of various parts of speech. These range from adverbs: (e.g. flatly, firmly), verbs (e.g. reject, discount, undermine, support, through adjectives (e.g. dubious, straightforward, commonsense, to nouns or noun phrases (e.g. mere speculation). There are also some longer phrases containing the evaluative component: [the notion that …] has no support in either reason or precedent, error seems likely in [the notion that]. Excerpt [6] taken from a majority opinion is a case in point: [6] We have firmly rejected the notion that an official action is protected by qualified immunity unless the vey action in question has previously been held unlawful. Indeed, the notion that is often found in majority opinions to express disagreement: we discounted the notion that […], we rejected the notion that […], etc. Note the co-occurring personal pronoun we to signal that the opinion is expressed on behalf of the Court. When used in clause-initial subject position, negative polarity is expressed explicitly by means of predicative adjectives as in Examples [7] and [8]. [7] The notion that the application of a ‘coercion’ principle would lead to a more consistent jurisprudence is dubious, [8] The notion that media corporations have constitutional entitlement accelerated judicial review of the denial of zoning variances is absurd. Other nouns in this pattern with a predominantly negative evaluative function include suggestion, used 85% of the time in negative contexts, and argument (75%). In the case of the argument that, there is a strong co-occurrence with the verb reject (24%) the noun court (18%) and the personal pronoun we. This use reflects the polyphony of judicial voices expressing different stances in majority, dissenting and concurring opinions, as illustrated by Examples [9], [10] and [11]. [9] The argument that virtual child pornography whets pedophiles’appetites and encourages them to engage in illegal conduct is unavailing because (…). [10] The Court also rejected the argument that it failed to consider the significance of advances in computer technology (…). [11] We rejected the petitioner’s argument that […]

Towards a corpus-based description of evaluative phraseology

123

The status noun argument seems to play a pivotal role in judicial discourse, since it is used by the Supreme Court judges to refer to the reasons that lead them to reach a particular decision. The occurrences reveal that, in the context of the opinion drafting, argument can be referred to two possible ‘interlocutors’: the colleagues sitting on the same bench with whom the judge writing the opinion disagrees (as in Example 9) or the arguments, adduced by the lower-court judges and which the judge evaluates to reach the decision of allowing or dismissing the appeal (shown in Example 10). Example [11] marks both the argumentative stance adopted in the plurality opinion, as well as the identity of the ‘interlocutor’, i.e. the petitioner. The nouns discussed so far turn out to acquire a predominantly bad evaluative function but there are also some which do not appear to have any strong inherent evaluative leaning. Instead, they may, in different contexts, be observed to express favourable or unfavourable evaluations. The example of the view that seems to belong to this category. The phrase the view that tends to be used to provide support for a particular proposition found in judicial opinions. Both Examples [12] and [13] illustrate what could be referred to as neutral polarity, or rather that ‘grey area’ between positive and negative polarity and the difficulty one may have when faced with the task of distinguishing between them. Example [12] conveys the neutral sense of supporting a particular viewpoint and it comes from a footnote to the plurality opinion in the Pat Osborn, Petitioner v. Bary Haley et al. case in which Justice Breyer’s chose to be “concurring in part and dissenting in part”. It is cited in support of the argumentation put forward by the Court and it thus leans more towards the positive end of the positive-negative cline: [12] Justice Breyer takes the view that the Attorney General may issue a Westfall Act certification if he contests the plaintiff ’s account of the episode-in-suit. In contrast, [13] is an excerpt from a dissenting opinion written by Justice Scalia in Beneficial National Bank et al., Petitions v. Marie Anderson et al. The use of ‘takes the view that’ could be interpreted as neutral if we confine our analysis only to the first sentence. However, the second sentence reveals an unequivocally negative evaluation of the view concluded by I respectfully dissent: [13] Today’s opinion takes the view that because §30 of the National Bank Act, 12 U. S. C. §§85, 86, provides the exclusive cause of action for claims of usury against a national bank, all such claims--even if explicitly pleaded under state law–are to be construed as “aris[ing] under” federal law for purposes of our jurisdictional statutes. Ante, at 9. This view finds scant support in our precedents and no support whatever in the National Bank Act or any other Act of Congress. I respectfully dissent.

124

Stanisław Goźdź-Roszkowski

Final observation is that the view that is usually found in individual (dissenting or concurring opinions) and this phrase can be found with some personal marking of stance as illustrated in Example [14]: [14] I find much to commend the view that the Establishment Clause […)]. Other examples include: I adhere to my view that, I write separately to state my view that. There is a rather obvious co-occurrence between the first person pronoun I and the view that used to mark judicial stance even more emphatically.

5. Conclusion This study adopted a corpus phraseological approach to explore patterns of expressing evaluative meanings which had not been previously studied in judicial discourse. These findings enrich our understanding of the nature and range of strategies employed by judges to mark their stance in the written opinions. It appears that judges tend to rely on both overt and covert linguistic clues to evaluate arguments put forward in court. First, this paper corroborates the applicability of the local grammar approach to the study of evaluation in the highly specialised genre of the United States Supreme Court opinions. The first two patterns the v-link + ADJ + that pattern and the v-link + ADJ + to-inf turn out to be a good tool to identify evaluative adjectives. Apart from locating instances of evaluative acts, they help to indicate two sides of the evaluation process: evaluators, i.e. legal interactants expressing evaluation and interactants evaluated. Then, it is proposed that the pattern consisting of nouns followed by appositive that-clause should also be included in the local resources specific to evaluation. The examination of nouns found in this pattern shows that they may fulfil two functions. First, they are used to mark epistemic stance of propositions, which in itself is an act of evaluation. As shown in section 4.2 above, Scalia’s statement is marked as assumption which suggests it is not based on solid objective evidence. As a result, the epistemic value of this utterance is somehow reduced. Second, evaluation can be also found at the level of co-occurring phraseologies. It has been shown that the selection of these nouns correlates strongly with negative evaluation, which could suggest that the function of the N+that pattern extends beyond marking epistemic stance. The analysis of the three patterns provides evidence pointing towards an interplay between different evaluative voices and a varying degree of explicitness present in the expression of evaluation. Evaluative lexis identified in the patterns can be differently categorized depending on their evaluative potential. Evaluative weight is inherent in certain adjectives (e.g. correct, wrong, implausible, etc.) while some nouns can have a predominantly

Towards a corpus-based description of evaluative phraseology

125

negative evaluation, not necessarily obvious without investigating their co-texts. The examination of negative phraseologies co-occurring with the nouns corroborates Hunston’s (2011) claim that evaluation is to a great extent contextual and cumulative, i.e. evaluative meaning is spread across phraseologies rather than attached to individual words. For example, the phrase the notion that does not collocate with a single negatively-charged item (e.g. reject) but with a range of items belonging to different parts of speech, none of which is very frequent. It is only the cumulative frequencies of all the items that determine the negative polarity of the notion that. Corpus study and its techniques can be thus an invaluable tool in bringing such ‘non-obvious meaning’ to the light of day.

References Biber, D. 2006. University Language. A corpus-based study of spoken and written language. Amsterdam: John Benjamins. Biber, D., Johansson, S., Leech, G., Conrad, S. & E. Finegan. 1999. The Longman Grammar of Spoken and Written English. London: Longman. Black, H. L. & H. C. Black 1990. Black’s Law Dictionary with Pronunciations. New York: West Publishing Company. Brostoff, T. K. & A. Sinsheimer. 2003. Legal English. An Introduction to the Legal Language and Culture of the United States. New York: Oceana Publications. Chafe, W. & J. Nichols (eds.) 1986. Evidentiality: The Linguistic Encoding of Epistemology. Norwood, NJ: Ablex. Cheng, W. & Le Cheng. 2014. “Epistemic modality in court judgments: A corpusdriven comparison of civil cases in Hong Kong and Scotland.” English for Specific Purposes, 33, 15–26. Conrad, S. & D. Biber. 2000. “Adverbial marking of stance in speech and writing.” In G. Thompson and S. Hunston (eds.), Evaluation in Text. Authorial Stance and the Construction of Discourse, 56–73. Oxford: Oxford University Press. Du Bois, J. W. 2007. “The stance triangle.” In R. Englrebretson (ed.), Stancetaking in discourse: subjectivity, evaluation, interaction, 139–182. Amsterdam: John Benjamins. Finegan, E. 2010. “Corpus linguistics approaches to ‘legal language’: adverbial expression of attitude and emphasis in Supreme Court opinions.” In M. Coulthard & A. Johnson (eds.), The Routledge Handbook of Forensic Linguistics, 65–77. London: Routledge. Garavelli, M. 2010. “I giudici e il linguaggio.” In J. Visconti (ed.), Lingua e Diritto. Livelli di analisi. Milano: LED. Gotti, M. 2003. Specialized Discourse. Bern: Peter Lang.

126

Stanisław Goźdź-Roszkowski

Goźdź-Roszkowski, S. & G. Pontrandolfo. 2014. “Facing the Facts: Evaluative Patterns in English and Italian Judicial Language.” In V. Bhatia et al. (eds.), Language and Law in Professional Discourse. Issues and Perspectives, 10–28. Cambridge Scholars Publishing. Goźdź-Roszkowski, S. & G. Pontrandolfo. 2014. “Exploring the Local Grammar of Evaluation: the Case of Adjectival Patterns in American and Italian Judicial Discourse.” Research in Language, 12 (1), 71–92. Goźdź-Roszkowski, S. & G. Pontrandolfo. 2013. “Evaluative patterns in judicial discourse: a corpus-based phraseological perspective on American and Italian criminal judgments.” International Journal of Law, Language and Discourse, 13(2), 9–69. Halliday, M. A. K. 1985/1994. An introduction to functional grammar, 2nd Ed. London: Edward Arnold. Halliday, M. A. K. 1970. “Functional diversity in language as seen from a consideration of modality and mood in English.” Foundations of Language, 6, 322–361. Halliday, M.A.K. & C. Matthiessen. 2004. An Introduction to Functional Grammar, 3rd Ed. London: Arnold. Heffer, C. 2007. “Judgement in Court: Evaluating participants in courtroom discourse.” In K. Kredens and S. Goźdź-Roszkowski (eds.), Language and the Law: International Outlooks, 145–179. Frankfurt am Mein: Peter Lang. Hunston, S. 2011. Corpus Approaches to Evaluation. Phraseology and Evaluative Language. New York: Routledge. Hunston, S. 2004. “Counting the uncountable. Problems of identifying evaluation in a text and in a corpus.” In A. Partington, J. Morley and L. Haarman (eds.), Corpora and Discourse, 157–188. Bern: Peter Lang. Hunston, S. 1994. “Evaluation and organisation in academic discourse.” In M. Coulthard (ed.), Advances in Written Text Analysis, 191–218. London: Routledge. Hunston, S. & G. Thompson. (eds.) 2000. Evaluation in Text. Authorial Stance and the Construction of Discourse. Oxford: Oxford University Press. Hyland, K. & P. Tse. 2004. “Metadiscourse in academic writing: A reappraisal.” Applied Linguistics, 25, 156–176. Martin, J. R. & P. White, P. 2005. The Language of Evaluation: Appraisal in English. London: Palgrave. Mazzi, D. 2007. “The Construction of Argumentation in Judicial Texts: Combining a Genre and Corpus Perspective.” Argumentation, 1, 21–38.

Towards a corpus-based description of evaluative phraseology

127

Mazzi, D. 2008. “I first have to decide whether there were any notes in the first place. I consider that there probably were: adverbials of stance in equity judges’ argumentation.” Textus, 21, 505–522. Mazzi, D. 2010. “This Argument Fails for Two Reasons… A Linguistic Analysis of Judicial Evaluation Strategies in US Supreme Court Judgments.” International Journal for the Semiotics of Law, 23(4), 373–385. MacCormick, N. & A. Aarnio. (eds.) 1992. Legal reasoning. Aldershot: Dartmouth. Mathieu-Izorche, M.-L. 2001. Le raisonnement juridique. Initiation a’ la logique et a’l’argumentation. Paris: Presses Universitaires de France. Mertz, E. 2007. The Language of Law School: Learning to Think Like a Lawyer. New York: Oxford University Press. Murphy, B. A. 2014. Scalia. A Court of One. New York: Simon and Schuster. Ochs, E. (ed.) 1989. “The Pragmatic of Affect.” Text, 9 (Special Issue). Palmer, F. 1987. The English Verb, 2nd Ed. London: Longman. Partington, A., Duguid, A. & C. Taylor. 2013. Patterns and Meanings in Discourse. Theory and practice in corpus-assisted discourse studies (CADS). Amsterdam: John Benjamins. Solan, L. M. 1993. The Language of Judges. Chicago: University of Chicago Press. Szczyrbak, M. 2014. “Stancetaking strategies in judicial discourse: evidence from US Supreme Court opinions,” Studia Linguistica Universitatis Iagellonicae Cracoviensis, 131, 1–30. Tabouada, M. & J. Grieve. 2004. “Analyzing appraisal automatically.” In Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text, 158–161. Stanford.

Jacek Tadeusz Waliński University of Łódź

Using time to express remoteness in space: A corpus-based study of distance representations for motion medium in the National Corpus of Polish Abstract: Entanglement of space and time in the human mind is among the most intensely pursued topics in contemporary cognitive science. A linguistic area that seems to be particularly suited to researching this problem is the domain of motion events because in this context expressions of distance can take both spatial and temporal representations. This study demonstrates a proportion between spatial and temporal expressions of distance for the semantic attribute of motion medium based on objectively verifiable frequencies of language patterns found in the National Corpus of Polish. Data obtained in this research show that in this semantic context Polish speakers tend to render distance both in spatial and temporal terms, with spatial representations being used more frequently, but not by a large margin. The results indicate that in the framework of motion events space and time act as complementary to each other, which suggests that they are correlated metonymically, rather than being asymmetrically dependent. Keywords: Space, time, spacetime, motion, events, medium, Ground, cognition, corpus linguistics

1. Introduction Subjective distance in space, i.e. what people know or believe about a distance, depends on a number of factors, including the complexity of environmental features, the physical effort that needs to be expended, and the time required to reach a destination, especially in situations of restricted access to other kinds of information (Montello 2009; Tversky 2011). Observations of travel time as a popular metric of spatial distance, especially in the context of urban environments, have been made for years in studies on geographical cognition (MacEachren 1980; Montello 1997). Yet, according to widespread linguistic intuitions, it seems that the relationship between time and space in linguistic expressions of distance is dictated by the asymmetry of space and time reflected in conceptual metaphors (Lakoff & Johnson 1980, 1999). As put deftly by Casasanto, Fotakopoulou & Boroditsky, (2010):

130

Jacek Tadeusz Waliński

In English, it is nearly impossible to talk about domains like time without using words that can also express spatial ideas: Vacations can be long or short, meetings can be moved forward or pushed back, deadlines can lie ahead of us or behind us. Yet it is far less common to use temporal words to talk about space (Lakoff & Johnson 1999). Although we could say that we live “a few minutes from the station,” we could just as easily express this spatial idea in spatial words, saying “a few blocks from the station.” (Casasanto et al. 2010: 388).

However, in linguistics some aspects of language are generally perceived, while others have to be computed to be evaluated properly. Intuitive judgments often lead to heuristics and biases (Kahneman 2011; A. Tversky & Kahneman 1974). As noted by the father of modern corpus linguistics, the late John Sinclair (1991: 4), “human intuition about language is highly specific, and not at all a good guide to what actually happens when the same people actually use the language”. Since access to statistics on the frequency of language patterns is unavailable through linguistic intuition, providing verifiable data on the frequency of language patterns in corpora is a key asset that the corpus methodology brings to linguistic studies. In recent years, a number of scholars have voiced a need for cognitive linguistics to put a stronger emphasis on the application of empirical data derived form corpora (e.g. Geeraerts 2006; Grondelaers, Geeraerts & Speelman 2007; Heylen, Tummers & Geeraerts 2008). Since a commitment to the usage-based model of language is among the central theses of cognitive linguistics (Evans 2012; Janda 2015), making use of linguistic performance samples goes naturally with cognitive linguistic research. On these grounds, this paper examines a proportion between spatial and temporal representations of distance in the semantic context of motion events (Talmy 2000a, 2000b) from the perspective of data found in the National Corpus of Polish (henceforth, NCP). More specifically, this study is restricted to examining linguistic expressions of distance for the semantic attribute of motion medium, i.e. the environment in which motion occurs. It complements studies conducted earlier for the semantic attributes of instrument and manner (Waliński 2014a, 2014b), and parallels a study conducted for English (Waliński, in press) with the British National Corpus.

2. Conceptions of space–time relations Entanglement of space and time in cognition is among the most intensely pursued problems in contemporary cognitive science (Núñez & Cooperrider 2013). The mutual relationship between mental representations of space and time can be viewed at least in four different ways. One originates from the empiricist philosophy, which assumes that the nature of all knowledge is affected by sensory experience (Markie 2015). Since space and time serve as our two basic locational

Using time to express remoteness in space

131

frameworks by means of which we situate objects and events, the perception of space is inextricably connected with the perception of time. As stated by Locke (1689/1995: 140), “expansion and duration do mutually embrace and comprehend each other; every part of space being in every part of duration, and every part of duration in every part of expansion”. From this perspective, it is difficult to think about either without thinking about the other, which makes these two domains symmetrically dependent on each other. Engberg-Pedersen (1999) assumes that space and time are so strongly interwoven in cognition, that they should not be analyzed as two separate domains. She argues that although it is possible to distinguish between conceptualizations of space and time at some cognitive levels, the distinction between space and time should be attributed to the difference between static objects and dynamic events, rather than space and time as such. However, an alternative proposal posits that that space and time are asymmetrically dependent. This view stems from an assumption that while the domain of space appears to be directly accessible through the senses, the domain of time escapes sensory perception. As put by Lakoff (1993: 218), “… we have detectors for motion and detectors for objects/locations. We do not have detectors for time”. Consequently, it is plausible to presume (e.g. Clark 1973; Lakoff & Johnson 1980, 1999) that space is the concrete domain that provides us with means of structuring time. As an outcome, time is processed indirectly and structured metaphorically in terms of space. As a third possibility, the cognition of space and time can be considered independent of each other, despite being much alike due to a similarity between these two domains. For example, Jackendoff (2002; see also Jackendoff & Aaron 1991) suggests that, although our conceptions of space and time may be thematically parallel, which is reflected in spatial metaphors used for expressing temporal concepts, the presumed primacy of space is illusory. Jackendoff points out that it is epistemologically equally plausible to assume that space and time are essentially unrelated domains organized by a common set of parameters that are simply more transparent in the spatial than in the temporal verbalization. From this perspective, it is possible that metaphors referring to space and time arise out of the structural similarity (Murphy 1996) of pre-existing conceptual structures between space and time. Although spatial metaphors have become conventional ways of talking about time, they are actually unrevealing about the space–time relations (see also Pinker 2007, Ch. 5). Moreover, space and time can be viewed from the perspective of the unitary framework of spacetime, which was geometrically modeled by Minkowski (1908/1964) with reference to Einstein’s (1905/1952a) Special Theory of Relativity.

132

Jacek Tadeusz Waliński

Subsequently announced Einstein’s (1916/1952b) General Theory of Relativity assumes that we function in a four-dimensional universe determined by threedimensional space combined integrally with the fourth dimension of time (Hawking 1988; see also DiSalle 2006, Ch. 4). The theory forces us to accept that time is not completely independent of space, but is combined with it to form an entity called spacetime. However, although the concept of spacetime has been considered in some linguistic studies (e.g. Bączkowska 2011; Jaszczolt 2009), it normally escapes human intuition. As emphasized by Hawking (1988: 10), most people, including scientists, still use Newton’s (1687/1995) model to think and talk about time and space in everyday situations. He adds that although it is sometimes helpful to think of the four spatial-temporal coordinates of an event in terms of space-time pictured mentally as a four-dimensional space, imagining a fourdimensional space is in fact impossible. Langacker (2012: 200–203) emphasizes that the assumption that space and time form a four-dimensional representational space in conception of objects and events is a foregone conclusion. He adds that despite certain parallelisms suggesting that space and time are comparable, there exist important asymmetries indicating that time is not just another space-like cognitive dimension. For instance, although from the perspective of Einsteinian physics it would be as accurate to assume that motion through space occurs in time as that motion through time occurs in space, in everyday language we are inclined to say that a falling apple gets “closer and closer” rather than “later and later” to the ground. Relations between space and time have been discussed from various perspectives in abundant literature (e.g. DiSalle 2006; Evans 2013; Le Poidevin 2003; Moore 2014; Smart 1964; Tenbrink 2007 and references therein), yet the nature of this relationship has not been established precisely.

3. (Dis)similarities between space and time in the human mind Comparisons between psychological space and time are difficult to conduct because sensory modalities involved in the perception of space have more clearly defined aspects than those involved in the perception of time (Grondin 2010). After almost 130 years of research, psychology has yet to distinguish a definitive sensory system responsible for perception and processing of time (Matthews & Meck 2014). Neuroscience has not found the neural basis for the processing of temporal intervals and the experience of duration, either (Wittmann 2013). Historically, the perception of space was both intuitively and in empiricist philosophy associated with the visual modality. However, systematic studies in blind and sighted individuals have provided ample evidence that visual experience is not

Using time to express remoteness in space

133

a necessary feature in the mental development of spatial representations (Millar 2008). It appears that spatial knowledge depends on a cognitive structure that organizes information obtained through all modalities, but itself is not dependent on any particular modality (Spence & Driver 2004). What makes investigations of the relationship between space and time in cognition even more difficult to conduct is that they are attributed different dimensionalities. Time is generally regarded as a linear vector extending ahead into the future and back into the past. On the other hand, space is discussed in terms of one-dimensional distances, two-dimensional planes, and three-dimensional spaces. Another basic difference between space and time is that the dimension in which time extends, or “flows” as we tend to say, is not reversible, which has been termed by Galton (2011) temporal transience: what occurs in time, occurs only once at that very moment, with no possibility of return (see Bergson 1922/2002 for a discussion on the evanescent nature of time). On the other hand, there are certain similarities between space and time in cognition. Classic studies in psychophysics (Stevens 1986) have demonstrated that people associate lines of different lengths with tones of different durations, and vice versa. Both adults and children recognize them as meaningful representations and provide consistent and systematic responses to them in psychophysical tasks based on spontaneous alignment of representations of temporal duration with representations of spatial length (e.g. Casasanto et al. 2010; Srinivasan & Carey 2010). Moreover, a link between spatial and temporal dimensions of psychological distance has been observed in their relation to the level of mental construal. An extensive series of studies on construal of distance (see Liberman & Trope 2014 for a review) found that events located further away in space and time are more likely to be represented in terms of abstract and general features at a higher level of mental construal. According to Construal-Level Theory of Psychological Distance, temporal and spatial distances are associated and are inferred from one another, which makes them act in the human mind in a complementary and compensatory way.

4. Representations of motion-framed distance for medium in the NCP This study investigates what is called here motion-framed distance (cf. motionframed location in Tutton 2012), which refers to a distance that separates one point from another in space in the semantic context of motion events. Talmy (2000b, Part 1) characterizes a basic Motion event as a situation consisting of four internal

134

Jacek Tadeusz Waliński

core components: (1) the presence or absence of motion (Motion); (2) the moving entity (Figure); (3) the object with respect to which the Figure moves (Ground); (4) the course followed by the Figure with respect to the Ground (Path). In the context of this study it is worth pointing out that the component of Motion refers to “the presence per se of motion or locatedness in the event” (Talmy 2000b: 25), despite the fact the in the latter the Figure does not change its position with respect to the Ground. Moreover, Talmy distinguishes an associated co-event, which refers to (5) the manner in which the motion takes place (Manner); and (6) the cause of its occurrence (Cause). Levinson (2003: 96) notes that the description of motion involves an additional set of parameters that denote not only change of location, but also manner, instrument, medium of motion, as well as other attributes. As already mentioned, this study is restricted to examining linguistic representations of distance for the semantic attribute of motion medium (cf. Ground in Talmy 2000b). However, it must be emphasized that the semantic attributes of motion are not easily disentangled. For instance, the expression by sea not only encodes the medium through which a traversal takes place, but additionally implies a certain manner of travelling, typically sailing, which in turn involves a range of instruments used for that purpose, e.g. a ship, boat, etc. (see Goddard and Wierzbicka 2009 for a study demonstrating how the semantics of physical activity verbs in English, Polish, and Japanese ties the kind of instruments used in the action with the manner in which the instrument is used). Therefore, at least for certain instances of distance expressions, it is impossible to make an absolute distinction between medium, manner, and instrument since they form a kind of semantic cline. In order to verify empirically how frequently motion-framed distance representations marked semantically for the medium of motion are expressed in Polish with spatial vis-à-vis temporal terms, this study employs Narodowy Korpus Języka Polskiego (the National Corpus of Polish, henceforth NCP). It is a 240 million word collection of samples taken from both spoken and written contemporary Polish roughly mirroring the British National Corpus in its structure (see www.nkjp.pl for more information). The NCP has the important advantage of being a publicly available standard reference corpus, which enables other researchers to attest or expand the present research. The corpus was examined using queries based on regular expression syntax (see Waliński 2015 for a full listing accompanied by corresponding concordances retrieved from the NCP), which provides for immediate replicability of the study with nothing more than a web browser. The examination was implemented by looking for frequencies of spatial and temporal adverbials that express absolute distance, i.e. one denoted using spatial or temporal units. Although the use of adverbials represents a fundamental way

Using time to express remoteness in space

135

of expressing remoteness in space, it is far from being exhaustive of the entirety of ways used for representing spatial distance in language (see Carlson 2010 for an overview). However, the aim of this paper is not to examine the full array of linguistic means available for denoting remoteness in space, but to observe a general proportion between spatial and temporal representations of the motionframed distance for the semantic attribute of medium in Polish.

4.1 Language patterns Medium of motion has been discussed in literature under a variety of different labels. Langacker (2008) subsumes it under the umbrella term landmark. Talmy (2000a, 2000b) sees it as Ground that acts as a spatial reference point for the motion/location of the Figure. Talmy (2000b) considers conflation of Motion + Ground in verbs roots as a minor pattern in linguistic representation of motion events, and notes that in English this semantic attribute is predominantly expressed with prepositional phrases. In Polish, it is typically expressed with prepositional phrases involving accusative nominal forms, e.g. “przez morze” [EN: over the sea_ACC.], “przez las” [EN: through the forest_ACC.] or locative nominal forms “po drodze” [EN: by road_LOC]. Alternatively, a similar sense of space traversed, although less bounded perhaps, can be conveyed by bare instrumental nominal forms, e.g. “morzem”, “lasem”, “drogą” [EN: sea_INS., forest_INS., road_INS.], etc. The following schematic patterns were used to search for these options in the NCP: SPATIAL or TEMPORAL UNIT + PREPOSITION + MEDIUM OF MOTION [acc or loc]; [SLOP FACTOR=1, PRESERVE ORDER=YES] SPATIAL or TEMPORAL UNIT + MEDIUM OF MOTION [ins]; [SLOP FACTOR=0, PRESERVE ORDER=YES] The selection of motion media examined in this research involves ten environments typically involved in journeying: ląd, woda, powietrze, śnieg, bezdroże, góra, las, morze, droga, kolej [EN: land, water, air, snow, field, mountain, forest, sea, road, railway], also in plural forms where applicable, which parallels items analyzed previously for English (Waliński, in press). The selection of prepositions was limited to po and przez, whose meaning may be approximated (LewandowskaTomaszczyk 2012) by a range of English prepositions such as: across, along, by, over, through. This choice of nominal forms for landmarks and accompanying prepositions is obviously far from being exhaustive of Polish expressions of distance for the medium of motion, but appears to be reasonably adequate for the purpose of this study.

136

Jacek Tadeusz Waliński

Since this study aims specifically to identify representations of the motion-framed distance denoted in terms of temporal duration, e.g. “Szedł ponad cztery godziny po sniegu” [EN: He walked over four hours through/over the snow] vis-à-vis ones denoted in terms of spatial expansion, e.g. “Przebyli przeszło tysiąc kilometrów morzem” [EN: They travelled over a thousand kilometers by sea], a unit of space/ time measurement was incorporated in the patterns. Units of time measurement selected for analysis involve those that are typically used to express duration, i.e. minutes, hours, and days [PL: minuta, godzina, dzień]. Units of space measurement selected for comparison include meter, kilometer and mile [PL: metr, kilometr, mila]. Although Polish speakers do not normally express spatial extents with imperial units, the [nautical] mile (PL: mila [morska]) is used in the context of sea travels. Because lexemes in the above schematic patterns do not always follow directly one after another in linguistic performance, searching was implemented with proximity queries (Bernard & Griffin 2009). Essentially, proximity queries afford us to take account of additional modifiers between query terms. They allow for specifying a slop factor, which determines how far apart lexical items included in a query can be from one another to be still returned as a result to the query. In this study, corpus queries were implemented with the slop value of 1, which reveals more specific environments, such as “boczne, kamieniste, żużlowe, oblodzone, puste, ruchliwe drogi” [EN: back, stony, cinder, icy, empty, busy roads], etc. For the nominal instrumental pattern the slop value was set to 0 to avoid an excess of coincidental hits. Additionally, the binary (yes/no) preserve order option was set to “yes” to indicate that the original order of query terms should be retained in results (see Waliński 2015 for a full listing of all queries used in this study).

5. Summary of results Because the use of proximity queries increases the recall of results at the expense of their precision (see Pęzik 2011), the resulting set of concordances retrieved from the NCP had to be carefully reviewed to eliminate examples sharing the defined sequence/proximity of lexical items by coincidence.1 Out of 239 concordance lines retrieved from the BNC, 128 were recognized as valid representations of the motion-framed distance in spatial or temporal terms, e.g. “Do kościoła jest pięć kilometrów przez las” [EN: The church is five kilometers through the forest] or “Asuan dzielą zaledwie trzy godziny jazdy koleją od prastarego Luksoru” [EN:

1 For instance, “po drodze” is ambiguous between the sense discussed here and another, related meaning close to “on one’s way” or a more idiomatic (metaphoric) “by the way”.

Using time to express remoteness in space

137

Asuan is separated from ancient Luxor only three hours by railway], etc. The results found for the selected language patterns are presented in Table 1. Table 1. Expressions of the motion-framed distance in spatial and temporal terms found in the NCP for the semantic attribute of motion medium. Medium of motion ląd (land)

Denoted in spatial terms

Denoted in temporal terms

3

0

woda (water)

3

1

powietrze (air)

12

14

śnieg (snow)

7

15

bezdroże (field)

3

1

góra (mountain)

6

0

las (forest)

24

14

morze (sea)

1

2

droga (road)

11

2

kolej (railway)

1

8

Total Proportion

71

57

55%

45%

Although the number of examples retrieved from the NCP for the semantic attribute of medium is not very extensive, it can serve as an indicator of the relationship between space and time in motion-framed distance expressions. Table 1 shows that for the selected landmarks the number of spatial expressions 71 (55%) exceeds the number of temporal expressions 57 (45%). Still, the proportion between spatial and temporal expressions of the motion-framed distance for this semantic attribute appears to be balanced rather than totally dissimilar. It is noteworthy that a very similar ratio (56% vs. 44%) has been observed in a parallel study conducted earlier for English (Waliński, in press) with the British National Corpus, which indicates that the overall result does not arise haphazardly. Although it is impossible to discuss language in terms of absolute numbers and ratios, the proportion of spatial vs. temporal expressions of motion-framed distance found in the NCP for the semantic attribute of motion medium confirms observations made earlier (Waliński 2014a, 2014b) that denoting spatial extents in temporal terms is a common way of expressing spatial distance in the semantic context of motion events. The results indicate that for the medium of motion

138

Jacek Tadeusz Waliński

Polish speakers tend to express the motion-framed distance both in spatial and temporal terms, with spatial representations being used more frequently, but not by a large margin. Since this tendency has been found to occur cross-linguistically, it appears to be modulated by the presence of the semantic element of motion, rather than by linguistic patterns alone.

6. Conclusions It would be unwise to draw hard and fast conclusions about the relationship between space and time in the human mind from the results of research so restricted in its scope. However, it is plausible to presume that the obtained data are indicators of certain properties of the entanglement between space and time in cognition. Since the linguistic representation of space is largely relativistic and approximate, rather than Euclidean and quantitative (Talmy 2000a, Part 1), it comes naturally to language users to express spatial distance in temporal terms. This way of expressing distance is highly versatile. Because clocks and watches are much more widespread than instruments of distance measurement, it is a relatively straightforward way of expressing a distance that is unknown precisely in metric terms. Moreover, it allows for expressing a distance from the speaker’s subjective point of view as a particularly short/long way to a destination, e.g. “Wieki miną nim dotrzemy stąd do Warszawy!” [EN: It will take ages to get to Warsaw!]. For very remote or hard to access places, we specify the distance to Mars in months of space traveling or the distance to Mt. Everest peak in days of climbing without even noticing the shift from spatial to temporal domain of representation. Denoting spatial distance in terms of travel time is particularly convenient in urban environments, where reaching various destinations depends not as much on the spatial separation as on the traffic intensity at different times of the day. More specifically, the results demonstrate that in the semantic context of motion events space and time often act in a complementary fashion, rather than being universally asymmetrically dependent. This complementarity can be observed more directly in certain instances of language use found through a concordance analysis, e.g. “Odległość od Warszawy – 18 kilometrów, 28 minut koleją” [EN: The distance from Warsaw – 18 kilometers, 28 minutes by railway] or “Odwiedziłam Kadyny – kilometr od Zalewu Wiślanego, pół godziny drogi od Elbląga” [EN: I visited Kadyny – located one kilometer away from the Vistula Lagoon and half an hour away from Elbląg]. Such examples demonstrate that spatial and temporal representations can act on an equal footing in expressions of remoteness in space, which indicates that in the context of motion events space and time are closely tied and neither can be regarded as the metaphorical extension of the other.

Using time to express remoteness in space

139

Moreover, B. Tversky (2011) emphasizes that knowledge of space on the horizontal plane is derived from motion in time. Since each and every motion occurs in space and takes time, space and time are interchangeable and intertwined in numerous senses in spatial cognition. Kövecses (2005: 53; see also Lakoff & Johnson 1999: 152) notes that in English one can say, “I slept for fifty miles while she drove” (Distance For Time-Duration) and “San Francisco is half an hour from Berkeley” (Time-Duration For Distance). He argues that in such expressions time and motion act as correlated domains joined in a single literal conceptual frame of Time-Motion schema, within which elements can stand for each other in the form of metonymies. Engberg-Pedersen (1999) points out that we can use names of places, which are primarily spatial words, to denote punctual moments in time in terms of spatial locations, e.g. “I haven’t had a drink since London”. In the light of this study, it is plausible to propose that in the semantic context of motion events both time and space can be viewed as elements of a unified conceptual frame of Space-Time-Motion, which dictates the relationship between space and time in a complementary fashion. Within this schema any two elements can stand metonymically for the third one: time elapsed in motion can be used to express spatial distance; space traversed in motion can be used to identify dura tion, which is commonly used for telling the time by the Sun’s position in the sky; a punctual moment in time can be used to specify a location passed while traveling; and a specific location passed during traveling can be used to refer to a specific moment in time. This cognitive complementarity of space and time in motion representations is likely to be related to the unity of time, space, and motion pointed out by Aristotle (350BC/1995), or perhaps even to the spatial–temporal relativity assumed by Einstein’s (1916/1952a) General Theory of Relativity.

References Aristotle. 350BC/1995. “Physics” (written c. 350BC). In J. Barnes (ed.), The complete works of Aristotle (Vol. 1). Princeton, NJ: Princeton University Press. Bączkowska, A. 2011. Space, Time & Language: A Cognitive Analysis of English Prepositions. Bydgoszcz: Wydawnictwo Uniwersytetu Kazimierza Wielkiego. Bergson, H. 1922/2002. “Concerning the Nature of Time.” In K. Ansell-Pearson & J. Mullarkey (eds.) Henri Bergson: Key Writings., 205–222. New York: Continuum. Bernard, E. & J. Griffin, J. 2009. “Understanding Lucene’s query syntax.” In Hibernate Search in Action, 202–214. Greenwich, CT: Manning. Carlson, L. A. 2010. “Encoding Space in Spatial Language.” In K. S. Mix, L. B. Smith & M. Gasser (eds.), The Spatial Foundations of Language and Cognition, 157–183. Oxford: Oxford University Press.

140

Jacek Tadeusz Waliński

Casasanto, D., Fotakopoulou, O. & L. Boroditsky. 2010. “Space and Time in the Child’s Mind: Evidence for a Cross-Dimensional Asymmetry.” Cognitive Science, 34(3), 387–405. doi:10.1111/j.1551-6709.2010.01094.x. Clark, H. H. 1973. “Space, time, semantics, and the child.” In T. E. Moore (ed.), Cognitive Development and the Acquisition of Language, 27–63. New York: Academic Press. DiSalle, R. 2006. Understanding Space-time: The Philosophical Development of Physics from Newton to Einstein. Cambridge: Cambridge University Press. Einstein, A. 1905/1952a. “On the Electrodynamics of Moving Bodies.” [First published in 1905 as Zur Elektrodynamik bewegter Körper in Annalen der Physik, 322, 891–921]. In W. Perrett & G. B. Jeffery (Trans.), The Principle of Relativity: A Collection of Original Papers on the Special and General Theory of Relativity, 35–65. New York: Dover. Einstein, A. 1916/1952b. “The Foundation of the General Theory of Relativity.” [First published in 1916 as Die Grundlage der allgemeinen Relativitätstheorie in Annalen der Physik, 354, 769–822]. In W. Perrett & G. B. Jeffery (Trans.), The Principle of Relativity: A Collection of Original Papers on the Special and General Theory of Relativity, 109–164. New York: Dover. Engberg-Pedersen, E. 1999. “Space and Time.” In J. Allwood & P. Gärdenfors (eds.), Cognitive Semantics: Meaning and Cognition, 131–152. Amsterdam: John Benjamins. Evans, V. 2012. “Cognitive linguistics.” Wiley Interdisciplinary Reviews: Cognitive Science, 3(2), 129–141. doi:10.1002/wcs.1163. Evans, V. 2013. Language and Time: A Cognitive Linguistics Approach. Cambridge: Cambridge University Press. Galton, A. 2011. “Time flies but space does not: Limits to the spatialisation of time.” Journal of Pragmatics, 43(3), 695–703. doi:10.1016/j.pragma.2010.07.002 Geeraerts, D. 2006. “Methodology in Cognitive Linguistics.” In G. Kristiansen, M. Achard, R. Dirven & F. J. Ruiz de Mendoza Ibáñez (eds.), Cognitive Linguistics: Current Applications and Future Perspectives, 21–49. Berlin: Mouton de Gruyter. Goddard, C. & A. Wierzbicka. 2009. “Contrastive semantics of physical activity verbs: ‘Cutting’ and ‘chopping’ in English, Polish, and Japanese.” Language Sciences, 31(1), 60–96. doi:10.1016/j.langsci.2007.10.002. Grondelaers, S., Geeraerts, D. & D. Speelman. 2007. “A case for a Cognitive corpus linguistics.” In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson & M. J. Spivey (eds.), Methods in Cognitive Linguistics, 149–169. Amsterdam: John Benjamins. Grondin, S. 2010. “Timing and time perception: A review of recent behavioral and neuroscience findings and theoretical directions.” Attention, Perception & Psychophysics, 72(3), 561–582. doi:10.3758/APP.72.3.561.

Using time to express remoteness in space

141

Hawking, S. 1988. A Brief History of Time: From the Big Bang to Black Holes. New York: Bantam Books. Heylen, K., Tummers, J. & D. Geeraerts. 2008. “Methodological issues in corpusbased Cognitive Linguistics.” In G. Kristiansen & R. Dirven (eds.), Cognitive Sociolinguistics Language Variation, Cultural Models, Social Systems, 91–128. Berlin: Mouton de Gruyter. Jackendoff, R. 2002. Foundations of Language: Brain, Meaning, Grammar, Evolution. Oxford: Oxford University Press. Jackendoff, R. & D. Aaron. 1991. “Review of ‘More Than Cool Reason: A Field Guide to Poetic Metaphor’.” Language, 67(2), 320–338. doi:10.2307/415109. Janda, L. A. 2015. “Cognitive Linguistics in the Year 2015.” Cognitive Semantics, 1(1), 131–154. doi:10.1163/23526416-00101005. Jaszczolt, K. M. 2009. Representing Time: An Essay on Temporality as Modality. Oxford: Oxford University Press. Kahneman, D. 2011. Thinking, Fast and Slow. New York: Farrar, Straus and Giroux. Kövecses, Z. 2005. Metaphor in Culture: Universality and Variation. Cambridge: Cambridge University Press. Lakoff, G. 1993. “The contemporary theory of metaphor.” In A. Ortony (ed.), Metaphor and Thought, 2nd Ed., 202–251. Cambridge: Cambridge University Press. Lakoff, G. & M. Johnson. 1980. Metaphors We Live By. Chicago: University of Chicago Press. Lakoff, G. & M. Johnson. 1999. Philosophy in the Flesh: The Embodied Mind and Its Challenge to Western Thought. Chicago: University of Chicago Press. Langacker, R. W. 2008. Cognitive Grammar A Basic Introduction. Oxford: Oxford University Press. Langacker, R. W. 2012. “Linguistic manifestations of the space-time (dis)analogy.” In L. Filipović & K. M. Jaszczolt (eds.), Space and Time in Languages and Cultures: Language, culture, and cognition, 191–215. Amsterdam: John Benjamins. Le Poidevin, R. 2003. Travels in Four Dimensions: The Enigmas of Space and Time. Oxford: Oxford University Press. Levinson, S. C. 2003. Space in Language and Cognition: Explorations in Cognitive Diversity. Cambridge: Cambridge University Press. Lewandowska-Tomaszczyk, B. 2012. “Approximative Spaces and the Tolerance Threshold in Communication.” International Journal of Cognitive Linguistics, 2(2), 165–183. Liberman, N. & Y. Trope. 2014. “Traversing psychological distance.” Trends in Cognitive Sciences, 18(7), 364–369. doi:10.1016/j.tics.2014.03.001.

142

Jacek Tadeusz Waliński

Locke, J. 1689/1995. An Essay Concerning Human Understanding. [First published in 1689]. Amherst, NY: Prometheus Books. MacEachren, A. M. 1980. “Travel Time as the Basis of Cognitive Distance.” The Professional Geographer, 32(1), 30–36. doi:10.1111/j.0033-0124.1980.00030.x. Markie, P. 2015. “Rationalism vs. Empiricism.” In E. N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. Stanford, CA: Stanford University. Retrieved from http://plato.stanford.edu/archives/sum2015/entries/rationalism-empiricism/. Matthews, W. J. & W. H. Meck. 2014. “Time perception: the bad news and the good.” Wiley Interdisciplinary Reviews: Cognitive Science, 5(4), 429–446. doi:10.1002/ wcs.1298. Millar, S. 2008. Space and Sense. Hove: Psychology Press. Minkowski, H. 1908/1964. “Space and Time.” [A translation of an address delivered at the 80th Assembly of German Natural Scientists and Physicians, at Cologne, September 21, 1908]. In J. J. C. Smart (ed.), Problems of Space and Time, 297–312. New York: Macmillan. Montello, D. R. 1997. “The perception and cognition of environmental distance: Direct sources of information.” In S. C. Hirtle & A. U. Frank (eds.), Spatial Information Theory A Theoretical Basis for GIS: International Conference COSIT’97 Proceedings, 297–311. Berlin: Springer. Montello, D. R. 2009. “A Conceptual Model of the Cognitive Processing of Environmental Distance Information.” In K. S. Hornsby, C. Claramunt, M. Denis & G. Ligozat (eds.), Spatial Information Theory: International Conference COSIT 2009 Proceedings, 1–17. Berlin: Springer. Moore, K. E. 2014. The Spatial Language of Time: Metaphor, Metonymy, and Frames of Reference. Amsterdam: John Benjamins. Murphy, G. L. 1996. “On metaphoric representation.” Cognition, 60(2), 173–204. doi:10.1016/0010-0277(96)00711-1. Newton, I. 1687/1995. The Principia: Mathematical Principles of Natural Philosophy. [First published in 1687 as Philosophiae Naturalis Principia Mathematica]. (A. Motte, Trans.). Amherst, N.Y: Prometheus Books. Núñez, R. E. & K. Cooperrider. 2013. “The tangle of space and time in human cognition.” Trends in Cognitive Sciences, 17(5), 220–229. doi:10.1016/ j.tics.2013.03.008. Pęzik, P. 2011. “Providing corpus feedback for translators with the PELCRA search engine for NKJP.” In S. Goźdź-Roszkowski (ed.), Explorations Across Languages and Corpora: PALC 2009, 135–144. Frankfurt am Main: Peter Lang.

Using time to express remoteness in space

143

Pinker, S. 2007. The Stuff of Thought: Language as a Window into Human Nature. New York: Viking. Sinclair, J. 1991. Corpus, Concordance, Collocation. Oxford: Oxford University Press. Smart, J. J. C. (ed.). 1964. Problems of Space and Time. New York: Macmillan. Spence, C. & J. Driver. (eds.). 2004. Crossmodal Space and Crossmodal Attention. Oxford: Oxford University Press. Srinivasan, M. & S. Carey. 2010. “The long and the short of it: On the nature and origin of functional overlap between representations of space and time.” Cognition, 116(2), 217–241. doi:10.1016/j.cognition.2010.05.005. Stevens, S. S. 1986. Psychophysics: Introduction to Its Perceptual, Neural, and Social Prospects [First published in 1975]. New Brunswick, NJ: Transaction Books. Talmy, L. 2000a. Toward a Cognitive Semantics, Vol. I: Concept Structuring Systems. Cambridge, MA: MIT Press. Talmy, L. 2000b. Toward a Cognitive Semantics, Vol. II: Typology and Process in Concept Structuring. Cambridge, MA: MIT Press. Tenbrink, T. 2007. Space, Time, and the Use of Language: An Investigation of Relationships. Berlin: Mouton de Gruyter. Tutton, M. 2012. “Granularity, space, and motion-framed location.” In M. Vulchanova & E. van der Zee (eds.), Motion Encoding in Language and Space, 149–165. Oxford: Oxford University Press. Tversky, A. & D. Kahneman. 1974. “Judgment under Uncertainty: Heuristics and Biases.” Science, 185(4157), 1124–1131. doi:10.1126/science.185.4157.1124. Tversky, B. 2011. “Visualizing Thought.” Topics in Cognitive Science, 3(3), 499–535. doi:10.1111/j.1756-8765.2010.01113.x. Waliński, J. T. 2014a. Complementarity of Space and Time in Distance Representations: A Cognitive Corpus-based Study, 2nd Ed. Łódź: Łódź University Press. Waliński, J. T. 2014b. “Complementarity of Space and Time in Motion-Framed Distance.” In B. Lewandowska-Tomaszczyk & K. Kosecki (eds.), Time and Temporality in Language and Human Experience, 85–101. Frankfurt am Main: Peter Lang.

144

Jacek Tadeusz Waliński

Waliński, J. T. 2015. Spatial and temporal representations of medium-mediated expressions of motion-framed distance in the National Corpus of Polish (COST Action TD0904 TIMELY Research Report No. 01/2015). Lodz: University of Lodz. Retrieved from http://anglistyka.uni.lodz.pl/userfiles/timely/NCP-MediumMediatedDistance.pdf. Waliński, J. T. (in press). “Space and time in medium-mediated expressions of distance.” In Within and beyond the Lexicogrammar Continuum. Amsterdam: John Benjamins. Wittmann, M. 2013. “The inner sense of time: how the brain creates a representation of duration.” Nature Reviews Neuroscience, 14(3), 217–223. doi:10.1038/ nrn3452.

Petra Klimešová, Zuzana Komrsková, Marie Kopřivová and David Lukeš Charles University

Avenues for Research on Informal Spoken Czech Based on Available Corpora Abstract: This paper aims to probe into several phenomena typical of spontaneous spoken Czech in informal conversation. It focuses on lexical fillers, formally reduced pronunciation variants and [v]-prothesis (a former regional feature turned stylistic marker in many respects). These features set informal spoken Czech apart from written language on the one hand, and from formal spoken language, where more attention is given to the form of the message, on the other. In a formal context, all three of these items would be considered as stigmatizing by peers. The material used to demonstrate these features will consist of publicly accessible corpus data on informal communication in Czech spanning the years 1988–2011, and particularly the ORAL2013 corpus, which provides the most up-to-date material of this sort from all over the territory of the Czech Republic. Keywords: Spoken Czech, informal communication, corpora

1. Introduction The following paper aims to explore data from publicly accessible corpora of spoken Czech (see Section 2 for a list) which provide access to transcripts of spontaneous dialogs in informal situations. As such, they represent what Čermák (2009) has termed “prototypically spoken texts”. Spontaneous spoken language is the means of communication typically used in the family circle, among friends and close people in general, cases in which the speaker is minimally self-conscious about the formal attributes of her speech. In general, it can be distinguished both from written language as well as from spoken language as used in formal situations, which poses greater demands on speakers in terms of the form and precision of their utterances. Concerning the written vs. spoken dimension: spoken language is specific with respect to both production and reception/perception. Unlike written language, it does not permit the speaker to go back and correct herself without a trace; it is anchored in time and often allows little prior preparation (i.e. is created on the fly). Its perception and interpretation is therefore much more constrained by the linear succession of individual elements in utterances (see Auer 2009 for a discussion on the cognitive constraints imposed by this unidirectional temporality of spoken language). From a linguistic point of view, depending on the given language, marked

146

P. Klimešová, Z. Komrsková, M. Kopřivová, D. Lukeš

morphological, syntactical and lexical differences between the two modes can exist (see e.g. Miller & Weinert 1998: 22–23 or Čmejrková & Hoffmannová 2011: 35). As for the second dimension, the difference consists mainly in that in informal speech, the speaker exerts less conscious control over her utterances and employs certain variants of key linguistic variables which she generally tries to avoid in formal, controlled communication. In our study, we will focus on three examples typical of spontaneous informal spoken Czech: lexical fillers, formally reduced pronunciation variants of frequently occurring words (as exemplified on {because} and {man, dude}), and [v]-prothesis.1 Our primary goal is to demonstrate that these phenomena are commonplace in spoken language and depend on different factors: lexical fillers correlate with unpreparedness (they indicate an effort to extend one’s speaking turn beyond the minimal scope necessary to convey its actual content), reduced pronunciation is connected with the token frequency of words and [v]-prothesis used to be a regional feature but has in many ways become a stylistic marker. Secondly, these short probes should serve to show how available corpora of spoken Czech can be fruitfully used for research. In particular, the aim is to give a taste of the character of the material itself and the speaker- and situation-related metadata that accompanies it, in order to hint at the broad range of research topics that it can be used to address. The aforementioned phenomena represent typical features of spoken language and span several levels of linguistic analysis: semantics and lexicology, phonetics, and stylistics. While [v]-prothesis is straightforwardly analyzed as a sociolinguistic variable, there is much to be gained by looking at both lexical fillers and formal reductions from a diachronic perspective, as results of a general process which might be called conversationalization.

1.1 Conversationalization Conversationalization is a concept formed by analogy to the well-established notion of grammaticalization. Grammaticalization is the process by which a free combination of linguistic elements becomes entrenched in grammar: typically, a new construction emerges to disambiguate a relevant semantic aspect which has no support from the grammatical system of the language; in due time, if proven useful, the construction is then generalized and becomes the compulsory way of expressing said semantic aspect. Meillet (1912) has described it in a somewhat more restrictive yet 1 This text will adhere to the convention that target Czech words will always be set and where appropriate, followed by one or several approximate English equivalents {in braces} and/or a phonetic transcription [in square brackets].

147

Avenues for Research on Informal Spoken Czech

probably clearer way as “the shift of an independent word to the status of a grammatical element” (quoted in McMahon 1994: 160). In the process, the word thus becomes an obligatory element. This shift is often accompanied by semantic bleaching (loss of independent propositional meaning) and phonetic erosion. By conversationalization, we understand a similar recruitment (or abduction) of a language element (word, phrase) for the purposes not of grammar, but conversation management in free-form interactions. Such an element becomes conspicuously frequent (not obligatory), which again goes hand in hand with phonetic erosion (formal attrition; see Zipf 1929), and also sheds its propositional meaning. It serves instead to organize discourse and/or structure the interaction on the fly (turn-taking, confirming status of shared context). Last but not least, it also serves to flag the conversation as informal and relaxed by showing that repetition and redundancy are allowed.

2. Source Corpora A valuable source of data for studying the aforementioned phenomena can be found in corpora of informal spoken Czech. The recordings and transcripts in this type of corpus must fulfill the following criteria: informal setting, dialogical nature, non-public and unofficial communication situation, and no prior preparation/unscriptedness. For this study, we selected the corpora as listed in Table 1.2 Individual utterances in all of these corpora are annotated for the following sociolinguistic factors describing the speakers: sex; age group (under vs. over 35 years old, though more precise age information is recoverable as well); and highest education achieved (non-tertiary vs. tertiary). Table 1. Available corpora of informal spoken Czech used in this study. corpus

size

time span of regional recordings coverage

total hours of audio

tokens

positions

PMK

674,992

819,267

1988–1996

Prague

about 90

BMK

500,460

596,009

1994–1999

Brno

about 80

ORAL2006 1,000,798 1,312,282 2002–2006

Bohemia

111

ORAL2008 1,000,097 1,349,536 2002–2007

Bohemia

115

ORAL2013 2,785,189 3,285,508 2008–2011

Bohemia, Moravia, Silesia

292

2 All accessible via http://www.korpus.cz.

148

P. Klimešová, Z. Komrsková, M. Kopřivová, D. Lukeš

The methodology of data collection for all of these corpora involved having an associate who records the interaction on site (i.e. participants were not taken out of their familiar environment). Ideally, this person should be the only one aware beforehand of the conversation being recorded; in practice, this is not always possible. The distinction between the associate and regular participants can be recovered from the corpus metadata, but is generally not considered as a factor in subsequent analyses, because the targeted communication situations are informal enough (and the sessions long enough, generally at least 10 minutes) in order to be conducive to unwarped linguistic behavior even on the part of the participant(s) who are aware of the recording device. The PMK and BMK corpora are special in that they were designed to partly reflect more formal spontaneous spoken language, elicited via a formalized Q&A session with the participant. These data were excluded from our analyses (see Section 4.2).

3. Lexical Fillers In the literature, lexical fillers appear under the collective heading of “dysfluencies” which includes all phenomena disrupting the flow an utterance. “Native speakers […] use a variety of fillers to fill their hesitation pauses, such as the lengthening of sounds, quasi-lexical fillers (uh, uhm), lexical fillers (well, you know etc.), and repetitions” (Rieger 2001: 81). Lexical fillers also appear in the literature under the term “discourse markers” (e.g. Schiffrin 1987). According to Georgakopoulou and Goutsos (2004), discourse markers “are typically found in utterance initial position or, on the whole, in transitional locations (beginning and end of units); they do not always have a clear propositional (semantic) meaning or their propositional meaning is superseded by their discoursal functions” (Georgakopoulou & Goutsos 2004: 99). However, as fillers progressively shed their propositional meaning, they are increasingly prone to be found at any position within an utterance, which means they are (relatively) independent from a syntactic point of view. Fillers are often used to gain or play for time in the process of speech production, when the speaker is at a loss for how to continue or formulate her utterance, or cannot recall the appropriate words. They are then frequently combined into longer chains. At the same time, fillers are an important device that speakers use to manage turn-taking (blocking listeners from interrupting the speaker with a new speaking turn).

Avenues for Research on Informal Spoken Czech

149

3.1 {simply, just} and {man, (you) idiot}3 For our survey of lexical fillers, we picked the following expressions: (originally an adverb meaning {simply, just}) and (originally the vocative case of {ox} but currently predominantly used as a highly informal or even insulting form of address – {man, dude, (you) idiot}). These words clearly differ by their respective origins, but by way of their being frequently employed in spoken language, they lose their lexical meaning and drift away from the characteristics which had been imparted on them at earlier stages of the development of Czech, and which still persevere to a large extent in written language. Regarding , opinions vary concerning its function and part-of-speech classification – e.g. the standard dictionary Slovník spisovné češtiny pro školu a veřejnost (Institute of the Czech Language 1994) considers it to be only an adverb (in this sense, the closest English equivalent would be {unadornedly}), while according to Čermák et al. (2007) it is only a particle (roughly corresponding to {simply, just}, but more syntactically independent). In our study, we expect the loss of the original adverbial meaning and consider as either a particle or a filler word (comparable by its ubiquitousness in the idiolect of some speakers to English {like}). The difference between these two interpretations is very slight, it depends on the context of the utterance and other factors. As criteria for distinguishing among them, we selected: • presence of a pragmatic and contact-establishing function (e.g. engaging the addressee, stressing the content of an utterance, reinforcing meaning, emphasis, summarization of previous information), often indicated by intonation → particle interpretation • frequent repetition of the word in one utterance, co-occurrence with other fillers, shift of grammatical categories before or after the target word, gratuitous insertion into a noun or verb phrase → lexical filler interpretation As to , the history of the word is a typical example of the evolution of a lexical item through conversationalization and semantic bleaching. Its original use was as a term of abuse but currently, it is more frequently employed as an interjection (Čermák 2001: 43; Čermák et al. 2007) or particle. As such, it is an expression of relief, disgust or surprise and can be found at the start of an utterance or in isolation, often preceded by {you.SG}, as in . Going one step further on the conversationalization axis, the word can also act as a lexical 3 The English equivalents are borrowed from the Frequency Dictionary of Czech Core Vocabulary (Čermák & Křen 2011).

150

P. Klimešová, Z. Komrsková, M. Kopřivová, D. Lukeš

filler; in these cases, a “bare” is usually involved (no preceding ) and it is often repeated or interspersed multiple times. The full pronunciation is [vole], but the word frequently undergoes formal reduction (elision, centralization of the first vowel): [voe], [vəe]. We can thus see a conversationalization continuum ranging from the original full semantics of a word (adverb in the case of , noun in the case of ) – its cognitive/basic/propositional/referential meaning – through particle uses exhibiting a distinct pragmatic function (discourse marker, interpersonal stance marker), to a lexical filler usage, which is mostly syntactically independent and serves the more primitive functions of turn-managing and establishing or maintaining contact.

3.2 Methodology Our goal was to establish a reliable estimate of the frequencies/proportions of particle vs. lexical filler uses of and in the ORAL2013 corpus, with respect to the sex and age group of speakers. As our criteria for distinguishing between the two uses were too semantically involved to allow automatic extraction from the corpus, we annotated a series of random samples from the concordances of and by hand, computed the target proportions in each sample and estimated 95% confidence intervals for the true proportions in the entire corpus with a two-tailed t-distribution.4 The corpus being representative and balanced (Válková, Waclawičová & Křen 2012), we assume that proportions discovered in it will reflect real-life usage in a meaningful way.

3.3 Results 3.3.1 {simply, just} The expression is the 30th most frequent word in the ORAL2013 corpus (in terms of token, not type frequencies, as this corpus is not lemmatized). It totals 17,640 instances, which corresponds to a normalized (relative) frequency of 5369.03 instances per million (i.p.m.).5 9611 instances were uttered by men and the remaining 8029 by women.

4 Had the concordances for both expressions been shorter, we would have annotated them by hand in their entirety, obtaining the true corpus proportions directly, without the need for taking random samples and estimating confidence intervals. 5 The expression becomes the 25th most frequent lemma in the whole ORAL series corpora. It totals 28,781 instances, which corresponds to a normalized frequency of 4839.32 i.p.m.

Avenues for Research on Informal Spoken Czech

151

Figure 1 shows the proportion of realizations of employed as a lexical filler vs. as a particle by men and women, as estimated for the corpus population based on samples (see Sec. 3.2). In 60% of the cases for men and 53% for women, occurs as a particle, i.e. it retains its pragmatic and/or contactestablishing function. Figure 1. Proportions of instances of the expression prostě (simply, just) as a filler vs. non-filler with respect to different sociological variables (left); corresponding absolute counts with 95% confidence intervals (right).

Several examples: ; . The particle summarizes the content of the preceding context, explicitly tagging the take-away message (“to put it shortly/simply”). The pause – signaled in transcription by a full stop mark – seems the be a typical collocation candidate. On the other hand, with both sexes, the word occurs more frequently as a filler with speakers from the younger generation, though the exact ratios to the older generation differ. This indicates that the younger generation is more inclined towards the redundant, conversationalized use of the word. Again, some examples: (a disjointed sequence of words with little actual propositional content); ( co-occurring with other filler words); (redundant repetition of the word ); (repeated parasitic insertion of the word between closely syntactically related clause elements).

152

P. Klimešová, Z. Komrsková, M. Kopřivová, D. Lukeš

3.3.2 {man, (you) idiot} In the ORAL2013 corpus, the expression vole appears at the 110th place of the token frequency list (3552 instances in total, corresponding to a normalized frequency of 1081.11 i.p.m.).6 242 realizations are by women, and the remaining overwhelming majority of 3310 cases comes from male speakers. This disproportion of the occurrences between the two sexes (see also Čermák 2001) may have been influenced by the original semantics of the word as a term of abuse. Another plausible interpretation would be that the word is predominantly used in exclusively male social situations, but this hypothesis is hard to explore given the way that our present corpus data are annotated. For women, in 72% of the cases, the pragmatic (contact-establishing and/or emotional) function of the expression vole is preserved, much as with the word ; it does thus not become a mere filler. Looking at the distribution of as a filler based on sex and age in Figure 2, we see that younger women tend to use as a filler more frequently than older women, but it is still a minority of instances. For men, on the other hand, the filler function is clearly dominant (only 27% of the instances are not fillers). Figure 2. Proportions of instances of the expression vole (man, (you) idiot) as a filler vs. non-filler with respect to different sociological variables (left); corresponding absolute counts with 95% confidence intervals (right; no CIs were computed for women, because there were so few examples that all were annotated).

6 The expression becomes the 104thh most frequent token in the whole ORAL series corpora. It totals 5,579 instances, which corresponds to a normalized frequency of 938.07 i.p.m. This vocative case of the lemma accounts for 97.7% of its occurrences.

Avenues for Research on Informal Spoken Czech

153

It would be worthwhile to ascertain whether the use of the term is also influenced by who is talking to whom (e.g. by whether the dialog is between speakers of the same sex and generation, or speakers who differ with respect to these characteristics), or perhaps by the speaker’s attained level of education. All of the possibilities for future research mentioned above are predicated on the assumption that the word , despite its original meaning having been considerably semantically bleached, is still considered as a stylistically marked means of expression.

4. Formal Reductions Informal conversation usually involves a lot of reduction of forms, i.e. phonetic erosion (see Mitterer 2008). The most typical of these elisions are captured in the transcripts of the ORAL series corpora as different token variants. The lexemes thus treated are frequently occurring words, at least disyllabic. Some of them even have a shortened lexicalized variant which can be found in works of fiction, employed in parts of the text emulating spoken language: for instance, {man} [͡tʃeː͡tʃe] (in the vocative case used for addressing people) instead of the full careful pronunciation [͡tʃlovje͡tʃe]. The words {because}, {man, dude} and {man, (you) idiot} were selected for detailed analysis. They can all be found in written language as well, and in spoken language, all of them have several variants, the formally reduced ones being neither overly regionally specific nor affected by additional processes, either phonological or morphological.

4.1 {because} In both spoken and written language, the expression {because} acts as a causal connective.7 In written language, it is the 84th most frequent word – see (Čermák & Křen 2004), according to whom 44% of the instances come from works of fiction; in the synchronic written corpus of Czech SYN2010, it has a normalized frequency of 825 i.p.m. The written corpus SYN2010 features only the full variant (100,352 instances). In spoken language, it is even more frequent: 2953 i.p.m. based on the ORAL2013 corpus (42nd place, with all variants pooled).

7 Originally a multiword conjunction, the component words coalesced in writing over time (see Bauer 1960: 289–291). Even in printed texts from as recently as the 19th century, the writing was not yet stabilized, as evidenced by the variation in the DIAKORP corpus of historical Czech texts.

154

P. Klimešová, Z. Komrsková, M. Kopřivová, D. Lukeš

This conjunction occupies the 7th place in spoken corpus among conjunctions in terms of frequency and is the first trisyllabic conjunction. By contrast, the spoken corpus ORAL2013 yields 10 variants (9709 instances); Figure 3 represents their counts and how they relate to each other. In informal conversation, reduced variants dominate. For a summary of speaker social group preferences, see Table 2. Figure 3. Proportions of different phonetic realizations of the lexeme {because} in the ORAL2013 corpus.

Table 2. Three most frequent variants of lexemes {because} and {man} in ORAL2013, and the social groups for which they are typical. A form was considered typical for a group if its relative frequency (i.p.m.) for that group was at least 15% higher than for any other group. variant

counts

sex

age group

education

protože

3974

women

over 35 y.o.

NA

prtože

2910

women

over 35 y.o.

NA

prže

1580

NA

under 35 y.o. tertiary

člověk

852

NA

under 35 y.o. tertiary

čověk

483

NA

under 35 y.o. tertiary

čovek

84

men

NA

non-tertiary

4.2 {man, dude} The second lexeme selected for analysis is {man}, which in spoken interactions occurs primarily in one of the two following functions: either as a generic animate subject or object, i.e.  an indeterminate pronoun, or as the fossilized vocative case form {man, dude}, whose role is predominantly pragmatic; in some cases, one may even start considering the option of interpreting it as a lexical filler.

Avenues for Research on Informal Spoken Czech

155

The distribution of the transcription variants of the lexeme in the ORAL2013 corpus is shown in Figure 4. The individual variants represent the whole paradigm. It is clear from the chart that the full pronunciation and the various reduced variants combined are comparable in terms of frequency, and that as in Figure 3, the full realization is most frequent when variants are considered separately. Sociolinguistic usage tendencies are summarized in Table 2. Figure 4. Proportions of different phonetic realizations of the lexeme {man} in the ORAL2013 corpus.

The expression {man, (you) idiot}, analyzed in Sec. 3.3.2, has a similar pragmatic function, also conveyed by the vocative case; it is captured in transcription in reduced forms as well ( [voe], [ve]), which are one of the side effects of conversationalization mentioned in Sec. 1.1. As Table 3 shows, unlike , is used increasingly frequently in informal conversations, its vulgar connotations dwindling (see also Čechová, Krčmová & Minářová 2008: 66; another possible explanation could invoke subtle changes in the methodology of data collection, i.e. in the recording conditions or guidelines, or in the sociolinguistic distribution of speakers). The rows marked PMK N and BMK N indicate subcorpora obtained by selecting strictly non-formal recordings from the PMK and BMK corpora. The use of formally reduced variants should be considered as one of the criteria signaling the process of semantic bleaching, which leads to the expression becoming a mere lexical filler. Table 3. Relative frequencies (i.p.m.) of {man} and vole {man, (you) idiot} in various corpora of spoken Czech, ordered from top to bottom by recency (see Tab. 1 for details). For definition of the PMK N and BMK N subcorpora, see text. corpus PMK N BMK N ORAL2006 ORAL2008 ORAL2013

člověče – i.p.m. 190 149 126 118 174

vole – i.p.m. 25 396 521 1000 1081

156

P. Klimešová, Z. Komrsková, M. Kopřivová, D. Lukeš

5. [v]-prothesis [v]-prothesis affects certain words beginning in [o] and is originally a dialect feature of the west of the Czech lands (Short 1993: 529). However, since the 19th century  National Revival, when variants without prothetic [v] were selected for the newly coined language standard and thus gained prestige, it has become associated with popular or even vulgar speech, especially for speakers of eastern dialects who do not use it themselves. It is therefore currently also a stylistic marker signaling a lower register, and potentially stigmatizing, depending on the context in which it is utilized. The data examined here consist of a sociolinguistically annotated (see Sec. 2) concordance of words beginning in either [vo-] or [o-] from the ORAL2013 corpus (41,183 and 36,820 instances in total, respectively). For accuracy, note that not all Czech words starting in [o-] are equally likely to undergo [v]-prothesis (Sgall 2011: 211). To complicate matters further, words exist that begin in [vo-] where the [v] did not arise by prothesis ( {wasp} < Proto Indo-European *wobʰseh₂); worse, some of them have unrelated [v]-less counterparts representing different lexemes ( {axis}). Last but not least, [v]-prothesis can also occur on a prefixed root ( {don’t argue}). Overall, this means that our rudimentary corpus-mining method for this exploratory survey has both a lower precision (some known spurious items are included) and recall (some known legitimate items are excluded) than ideal, but we assume that these adverse effects will balance out over the sociolinguistic categories of speakers we examine. It would be nevertheless interesting in follow-up work to devise a more stringent extraction method and see whether results are affected in any meaningful way. As shown in Figure 5, despite our coarse data collection strategy, the expected gradual decrease of [v]-prothesis incidence from west to east is well documented on our corpus material. A χ² test for independence reveals this correlation as very significant (χ² = 22,463.63; df = 8; p < 0.001) and as expected based on a visual inspection of the chart, its effect is quite strong (Cramér’s V = 0.54). According to Balhar et al. (2005: 371), which is based on dialectological surveys from the 1960s to the 1980s, the [v]-prothesis isogloss runs from north to south through Central Moravia, i.e. the region which in Figure 5 exhibits clear signs of being a transition area between the two tendencies, showing that in this regard, current speech patterns still correspond to earlier models. At the same time, the distribution in western regions contradicts other researchers’ opinions that [v]-prothesis might be receding in its native west (Hoffmannová 2011).

Avenues for Research on Informal Spoken Czech

157

Figure 5. Proportions of regional usage of [v]-prothesis. Regions arranged roughly from west to east, starting from left.

Interestingly, the remaining readily available sociological factors (sex, age group, education level) play next to no role in the distribution of [v]-prothesis when compared to the effect of region. True, according to a χ² test, their influence is significant (p < 0.001), but given that large sample sizes are biased towards significance, this is not surprising. More importantly, their effect size is negligible (correlation coefficient ɸ = 0.033, 0.053 and 0.082 respectively), which indicates that there is little actual difference in usage among speakers when grouped based on these criteria. A recent investigation of [v]-prothesis is due to Chromý (2015), who offers a detailed, in-depth discussion of the topic, focusing on various intralinguistic and sociolinguistic factors influencing the occurrence of the phenomenon in the speech of Prague-born speakers between 20 and 30 years of age. Instead of relying on available corpus data, Chromý chose to collect his own, which has its advantages, but also some drawbacks. On the one hand, the data were annotated with [v]-prothesis as the target research task in mind, which means the coding is likely to be more reliable in this regard than in the case of a general transcript without a particular focus. Additionally, the group of participants selected is quite homogeneous by design and all are recorded under the same conditions, which establishes good grounds for valid generalizations based on the results. On the other hand, Chromý (2015) is perhaps somewhat overzealous in his generalizations: to claim as he does by the wording of his title that his conclusions hold for the “Prague vernacular” as a whole is somewhat misleading, since all of his subjects come from a relatively narrow age range, and what is more, they are all

158

P. Klimešová, Z. Komrsková, M. Kopřivová, D. Lukeš

university students or graduates (24). In other words, the conclusions that can be drawn from this sample should be limited to the fairly narrow sociological group it represents. Another concern is that overt instead of covert recording may warp linguistic behavior, even though Chromý’s argument that targeting friends of his recording associate as subjects led to a successful elicitation of casual speech is basically sound (25). In spite of this, Chromý’s (2015) study is very rigorous, especially from the point of view of the statistical apparatus employed. It would therefore be a valuable contribution to replicate his procedure using more extensive and more sociologically varied spoken corpus data, both confronting and hopefully extending the results he obtained. A super-corpus encompassing all ORAL series corpora plus some additional previously unreleased data (over 5M tokens in total) is currently in preparation, and it will include some form of lemmatization (Lukeš et al. 2015), which should allow for more exact extraction of instances of [v]-prothesis by confrontation of word forms against ([v]-less) lemmas; this seems like a promising material basis for an attempt at said replication. In particular, the availability of data from a wider age range would allow to reexamine Chromý’s tentative conclusion that [v]-prothesis in Prague is on the decline, based on a comparison with an earlier study by Jančák (1974). He rightly and candidly relativizes this claim by pointing to the various inconsistencies between the two data sets (Jančák recorded teenagers, in settings which may have been somewhat more informal (Chromý 2015: 37). By contrast, the data in the future ORAL corpus, which cover a wide age span, would allow to consistently investigate this potential generational shift by leveraging the apparent time construct (Bailey et al. 1991), which states that the comparison of linguistic variation across generations in a synchronic snapshot of speech can be used as a proxy for assessing the language evolution that took place in the time span that separates them.8

6. Conclusion The examples of lexical fillers analyzed in Section 3 can serve as a good illustration of how important it is to investigate spoken language and try to reassess prior findings (e.g. on part-of-speech classification) based primarily on more narrowly 8 Oddly enough, Chromý (2015) also invokes the apparent time construct to justify his comparison with Jančák’s data (36–37), even though there is no need for it in his case, because his two data sets are actually truly diachronic. The apparent-time hypothesis is only useful insofar as it allows to make up for unavailable historical data by presenting the speech of the older generation in a synchronic snapshot as a proxy for them.

Avenues for Research on Informal Spoken Czech

159

defined written language. At the same time, our survey indicates a need for more detailed situational metadata in corpora, to verify e.g. the hypothesis that is characteristic of male-to-male communication. With respect to Section  4, it is unfortunate that with the exception of the words used for this exploratory survey, formally reduced variants are mostly not annotated in any special way in the available corpora, which means that it is impossible to search for them and quantify their occurrence straightforwardly. This is regrettable, because formal attrition itself can be an indicator of semantic bleaching in progress, which can ultimately lead to a word being used as lexical filler, as e.g. with . This shortcoming should be partly relieved by the upcoming ORTOFON corpus which is currently being built at the Institute of the Czech National Corpus, and which will feature a separate tier containing a complete phonetic transcription . As for Section 5, as already mentioned, it would be worthwhile in future work to refine the process of acquisition of data on [v]-prothesis. Another promising avenue is to explore additional intricacies in the relationships between individual sociological factors influencing the variable’s distribution, uncovering potential interactions, e.g. via logistic regression. The methodology employed by Chromý (2015) is a great inspiration in this regard.

Acknowledgements This research was made possible by the Programme for the Development of Fields of Study at Charles University, No. P11 Czech national corpus, sub-programme Czech national corpus.

References Auer, P. 2009. “On-Line Syntax: Thoughts on the Temporality of Spoken Language.” Language Sciences 31, 1–13. Bailey, G., Wikle, T., Tillery, J. & L. Sand. 1991. “The Apparent Time Construct.” Language Variation and Change 3(3), 241–64. doi:10.1017/S0954394500000569. Bauer, J. 1960. Vývoj Českého Souvětí. Nakladatelství Československé Akademie Věd. Čechová, M., Krčmová, M. & E. Minářová. 2008. Současná stylistika. Praha: Nakladatelství Lidové noviny. Retrieved from: http://www.databazeknih.cz/knihy/ soucasna-stylistika-124986. Čermák, F. 2001. “Já Vůl, Ty Vole/ty Jsi Vůl, to Je Vůl….” Čeština Doma a ve Světě 9, 42–44.

160

P. Klimešová, Z. Komrsková, M. Kopřivová, D. Lukeš

Čermák, F. 2009. “Spoken Corpora Design. Their Constitutive Parameters.” International Journal of Corpus Linguistics 14(1), 113–23. Čermák, F. & M. Křen. (eds.) 2004. Frekvenční Slovník Češtiny [Frequency Dictionary of Czech]. Praha: Nakladatelství Lidové noviny. Čermák, F. & M. Křen. (eds.) 2011. Frequency Dictionary of Czech Core Vocabulary for Learners. London: Routledge. Čermák, František et al. (eds.) 2007. Frekvenční Slovník Mluvené Češtiny [Frequency Dictionary of Spoken Czech]. Praha: Karolinum. Chromý, J. 2015. “Vliv jazykových faktorů na užívání protetického v- v pražské mluvě.” Slovo a slovesnost 76, 21–38. Čmejrková, S. & J. Hoffmannová. 2011. Mluvená Čeština: Hledání Funkčního Rozpětí. Praha: Academia. Georgakopoulou, A. & D. Goutsos. 2004. Discourse Analysis: An Introduction. Edinburgh: Edinburgh University Press. Hoffmannová, J. 2011. Mluvená Čeština v Zrcadle ,,psané Konverzace“ Na Chatu [Spoken Czech in the Light of ‘Written Conversation’ in Web Chats].” In S. Čmejrková & J. Hoffmannová (eds.), Mluvená Čeština: Hledání Funkčního Rozpětí [Spoken Czech: In Search of the Functions It Spans], 393–408. Praha: Academia. Institute of the Czech Language (ed.) 1994. Slovník Spisovné Češtiny pro Školu a Veřejnost: S Dodatkem Ministerstva Školství, Mládeže a Tělovýchovy České Republiky. 2nd Revised and Updated Ed. Praha: Academia. Jančák, P. 1974. “Frekvence Hlavních Hláskoslovných Znaků v Mluvě Pražské Mládeže.” Naše Řeč 57, 191–200. Kučera, K. & M. Stluka. 2011. DIAKORP: Diachronní Korpus, Verze 5 Z 21. 2. 2011. Praha: Ústav Českého Národního Korpusu FF UK. Retrieved from: http://www. korpus.cz. Lukeš, D, Klimešová, P., Komrsková, Z. & M. Kopřivová. 2015. “Experimental Tagging of the ORAL Series Corpora: Insights on Using a Stochastic Tagger.” In Proceedings of Text, Speech, Dialogue 2015. McMahon, A. M. S. 1994. Understanding Language Change. Cambridge: Cambridge University Press. Meillet, A. 1912.“L’évolution Des Formes Grammaticales.” Rivista Di Scienza 12 (26). Miller, J. & R. Weinert. 1998. Spontaneous Spoken Language: Syntax and Discourse. Oxford: Clarendon Press. Mitterer, H. 2008. “How Are Words Reduced in Spontaneous Speech?” In Proceedings of the ISCA Tutorial and Research Workshop on Experimental Linguistics, 165–168. Athens: University of Athens.

Avenues for Research on Informal Spoken Czech

161

Rieger, C. L. 2001. “Idiosyncratic Fillers in the Speech of Bilinguals.” In Proceedings of DiSS’01, 81–84. Schiffrin, D. 1987. Discourse Markers. Cambridge: Cambridge University Press. Sgall, P. 2011. Jazyk, Mluvení, Psaní [Language, Speech, Writing]. Praha: Karolinum. Short, D. 1993. “Czech.” In B. Comrie and G. G. Corbett (eds.), The Slavonic Languages, 455–533. London: Routledge. Válková, L, Waclawičová, M. & M. Křen. 2012. “Balanced Data Repository of Spontaneous Spoken Czech.” In Proceedings of LREC 2012, 3345–3349. Zipf, G. K. 1929. Relative Frequency as a Determinant of Phonetic Change. Harvard Studies in Classical Philology 40, 1–95. doi:10.2307/310585.

Corpora Benešová, L., Křen, M. & M. Waclawičová. 2013. ORAL2013: Reprezentativní Korpus Neformální Mluvené Češtiny [ORAL2013: A Representative Corpus of Informal Spoken Czech]. Praha: Ústav Českého národního korpusu FF UK, Praha. Available at: http://www.korpus.cz. Čermák, F., Adamovičová, A. & J. Pešička. 2001. PMK (Pražský Mluvený Korpus): Přepisy Nahrávek Pražské Mluvy Z 90. Let 20. Století. Praha: Ústav Českého národního korpusu FF UK. Available at: http://www.korpus.cz. Hladká, Z. 2002. BMK (Brněnský Mluvený Korpus): Přepisy Nahrávek Brněnské Mluvy Z 90. Let 20. Století. Praha: Ústav Českého národního korpusu FF UK. Available at: http://www.korpus.cz. Kopřivová, M & M. Waclawičová. 2006. ORAL2006: Korpus Neformální Mluvené Češtiny. Praha: Ústav Českého národního korpusu FF UK. Available at: http:// www.korpus.cz. Křen, M., Bartoň, T., Cvrček, V., Hnátková, M., Jelínek, T., Kocek, J., Novotná, R. et al. 2010. SYN2010: Žánrově Vyvážený Korpus Psané Češtiny. Praha: Ústav Českého národního korpusu FF UK, Praha. Available at: http://www.korpus.cz. Waclawičová, M, Kopřivová, M., Křen, M. & L. Válková. 2008. ORAL2008: Sociolingvisticky Vyvážený Korpus Neformální Mluvené Češtiny. Praha: Ústav Českého národního korpusu FF UK. Available at: http://www.korpus.cz.

Alexandr Rosen

Charles University in Prague

Introducing a corpus of non-native Czech with automatic annotation1 Abstract: Learner corpus can be annotated with linguistic categories, target hypotheses and error labels. We show that useful results can be achieved even for non-native Czech by applying methods and tools developed for standard language. The corpus includes more than 8.6 thousands short essays, nearly one million words. First, the texts are processed by a tagger and lemmatizer. Then, a stochastic spelling and grammar checker is used to propose correct forms for non-words and some incorrect ‘real words’. The precision of this step is above 80%. The corrected texts are tagged again. Original and corrected forms are compared and error labels, based on criteria applicable in a formally specifiable way, are assigned. The metadata include, i.a., the author’s sex, age, first language, CEFR level of proficiency in Czech, and the task’s time limit and topic. The corpus is available on-line via a search interface or for download. Keywords: Learner corpus, Czech, non-native language, error annotation, grammar checker

1. Introduction Texts in a learner corpus can be annotated in two independent ways: (i) by standard linguistic categories: morphosyntactic tags, base forms, syntactic structure and functions, and (ii) by error annotation: corrected word forms (target hypotheses), and categories specifying the nature of errors. Reasonably reliable methodologies and tools are available for linguistic annotation (i) of many languages, as long as the text is produced by native speakers. The situation is different for non-standard language of non-native learners and for error annotation (ii), where manual annotation is quite common. However, with the growing volumes of learner corpora, the need for methods and tools simplifying such tasks is increasing. In Section 2 we provide a glimpse of the landscape of learner language annotation. Then, after an overview of existing learner corpora of Czech, including a corpus of texts written by non-native learners of Czech in Section 3, we show that 1 The paper reports on the CzeSL-SGT corpus, which was built from texts collected within the ESF project CZ.1.07/2.2.00/07.0119 and the Charles University PRVOUK P10 funding program. Work on the corpus itself was supported by the Ministry of Education of the Czech Republic as a part of the Czech National Corpus project LM2011023.

164

Alexandr Rosen

useful results can be achieved by applying tools developed for standard language. The core part of this contribution (Section 4) is concerned with the CzeSL-SGT corpus (Czech as a Second Language with Spelling, Grammar and Tags), which includes transcripts of essays hand-written by (mostly young) non-native learners of Czech in 2009–2013. The corpus includes about 8.6 thousand texts by nearly two thousand native speakers of 54 languages; altogether about 1 million words (for details about the corpus content see Table 4). Most texts are equipped with metadata (Section 4.1). Word forms are tagged by word class, morphological categories and base forms (lemmas). Forms detected as incorrect (including some real-word errors) are corrected by a stochastic spelling and grammar checker and the resulting texts are tagged again. Original and corrected forms are compared and error labels are assigned, based on criteria applicable in a formally specifiable way. All the annotation is assigned automatically (see Section 4.2). The corpus is available either for on-line searching using the search interface of the Czech National Corpus (http://korpus.cz), or for download from the LINDAT data repository (http://www.lindat.cz), see Section 4.3. Automatic corrections have been evaluated using an existing manually annotated subset of the corpus (the manual annotation includes a target hypothesis about the form, see Section 5). Finally in Section 6 we discuss some challenging aspects of the corpus and its annotation a show perspectives for its development and use.

2. Automating the annotation of learner texts Some annotation tools designed for native language, such as taggers, lemmatizers, parsers, spelling and grammar checkers can be applied to the original text including learner language, or to their corrected version, even though the success rate depends on how much the texts deviate from the standard language. On the other hand, the task of categorizing errors is usually a manual exercise and its methodology is far from established. Some error taxonomies have a more prominent position (Dagneaux et al. 2008; Granger et al. 2002), but there are quite a few other annotation schemes used in practice (for an overview see Štindlová 2013: 71f). Moreover, such taxonomies often assume a target hypothesis and even if motivated mainly by formal, grammar-based criteria, they are designed for a human annotator. For example, the error taxonomy in Rosen et al. (2014) includes errors of two types: (i) non-words, i.e. word forms which are incorrect with respect to literary Czech in any context, and (ii) real-word errors, i.e. word forms identifiable as incorrect only in a specific context. Both types are subdivided into more detailed categories and

Introducing a corpus of non-native Czech with automatic annotation

165

subcategories; see Table 1 and Table 2.2 Only the categories printed in boldface can be detected automatically, but the system still assumes that a target hypothesis or a more general category (complex verb) is established manually. Table 1. A taxonomy of non-words. incorrect form foreign word, coinage

word boundary

inflection stem other coined Czech word foreign word inflected foreign word split prefix, joined preposition wrongly split/joined compound other

Table 2. A taxonomy of real-word errors. agreement government pronominal reference reflexive form

complex verb: analytical modal verb copula

negation redundant word missing word word order

lexis, idiom misused grammar category incurred error word salad

Despite the daunting complexity of assigning such categories by an automatic tool, automatic annotation of learner texts is still a realistic task. In addition to the option of using standard methods and tools designed for native language, applications developed specifically to process learner language are now also available, including intelligent tutoring systems (e.g. Dickinson & Herring 2008; Levy et al. 2014), automated scoring in language testing (http://www.ets.org; Shermis & Burstein 2013), annotation of learner texts: both linguistic annotation (Dickinson & Ragheb 2009; Nagata et al. 2011; Krivanek & Meurers 2014) and error annotation (Leacock et al. 2010, 2014; Díaz-Negrillo et al. 2013; Gamon et al. 2013a,b; Ng et al. 2013, 2014; Junczys-Dowmunt & Grundkiewicz 2014). However, the efforts listed above are focused on English and error annotation is limited to error correction.3 2 An additional error category concerns inappropriate register and style. In comparison to other taxonomies the sample taxonomy may seem rather coarse-grained. However, it does not need to specify details about the individual forms because it assumes morphosyntactic annotation of the text. 3 Errors in the training data for the 2014 CONLL Shared Task were classified into 28 types (Ng et al. 2014), but the task was to correct the text, not to assign error labels.

166

Alexandr Rosen

3. Learner Corpora of Czech Czech is one of the three languages of Merlin, a learner corpus of Czech, German, and Italian (http://www.merlin-platform.eu; Boyd et al. 2014). The main goal of the 2012–2014 project was to build a platform matching the standard proficiency levels of CEFR (Common European Framework of Reference) with language phenomena specific to the level. The corpus includes texts consisting of about 80 thousand word tokens at CEFR levels A1–C1. It is tagged, parsed, on-line searchable and includes rich metadata. AKCES, the Acquisition Corpora of Czech (http://akces.ff.cuni.cz; Šebesta 2012), is an umbrella project aimed at building written and spoken language resources about the acquisition Czech by both non-native and native learners. The project also maps the Roma ethnolect of Czech (Eckert 2015). Table 3 shows the currently available AKCES corpora. Table 3. Available AKCES corpora. Searchable47

Written Native

Spoken

Non-native

Written Roma

Spoken

Downloadable48 # tokens Note school essays, age SKRIPT 2012 AKCES 1 0.7M 11–19 transcripts, class SCHOLA 2010 AKCES 2 1.0M interactions, age 6–19 essays, also Roma ethnolect, age 9–76, CzeSL-plain AKCES 3 2.3M also non-native bachelor theses CzeSL-SGT AKCES 5 1.1M automatic annotation subset of CzeSL-plain, CzeSL-MAN49 0.3M manual annotation subset of CzeSL-plain AKCES 4 0.3M (rom) audio and transcripts, ROMi 1.0 1.5M various environments, age 12–28

In the following, we focus on the non-native texts of AKCES, i.e., on its part called CzeSL (Czech as a Second Language).4 CzeSL is a collection of transcribed essays, 4 For historical and technical reasons, the CzeSL-plain and CzeSL-man corpora also include the Roma ethnolect, and CzeSL-plain an additional part consisting of Bachelor theses authored by non-native students.

Introducing a corpus of non-native Czech with automatic annotation

167

hand-written by students of Czech at various occasions as a part of the learning process. For a basic overview of the scope of CzeSL see Table 4. Most texts are equipped with metadata about the author and the task.5 The first languages (L1) of the learners are varied – most of them belong to the Slavic group (65%, mainly Russian, Ukrainian and Polish), followed by non-Indo-European languages (20%, mainly Vietnamese, Chinese and Arabic). Other Indo-European languages (German, English, French) constitute about 10%. The distribution of texts according to CEFR and the L1 groups is shown in Table 5. Table 4. The CzeSL corpus – sizes and proportions. Number of texts

8.6K

Number of sentences

111K

Number of words

958K

Number of tokens

1,148K

Number of authors Number of native languages Proficiency levels Age of the authors

1,965 54 A1–C2 9–76

Share of women/men (in the number of words)

5/3 KW

Number of words per text

100–200

5 Full metadata are currently available only in the CzeSL-SGT corpus. See http://utkl.ff.cuni.cz/~rosen/public/sgt_counts_by_meta_en.html for the complete statistics.

168

Alexandr Rosen

Table 5. The CzeSL corpus – number of texts by language groups and CEFR levels. Slavic Indo-European Non-Indo-European Unknown Total A1

1783

199

622

5

2609

A1+

283

21

11

0

315

A2

1348

269

480

1

2098

A2+

403

54

113

0

570

B1

929

195

357

0

1481

B2

523

115

107

0

745

C1

82

17

24

0

123

C2

0

1

0

0

1

291

27

33

324

675

5642

898

1747

330

8617

Unknown Total

The texts are anonymized by replacing personal names with appropriate forms of Adam and Eva. Names of smaller places (streets, villages, small towns) and other potentially sensitive data are replaced by QQQ. Unreadable characters or words are transcribed as XXX. (For more details about the CzeSL corpus see http:// utkl.ff.cuni.cz/learncorp/; Štindlová et al. 2013; Rosen et al. 2014; Meurers 2015).

4. An automatically annotated learner corpus – CzeSL-SGT The CzeSL-SGT corpus (Czech as a Second Language with Spelling, Grammar and Tags) is coextensive with the strictly non-native part of CzeSL. Texts from the “foreign” part of CzeSL-plain (ciz), collected in 2009–2011, are extended by texts collected in 2013. The transcription markup, encoding some properties of the original manuscripts and preserved e.g. in CzeSL-MAN, is discarded. Instead, the final edits of the author are respected.

4.1 Metadata Most texts are equipped with metadata about the author and the text, available in Czech and English. The Czech National Corpus site offers the Czech version, while the LINDAT data repository offers the entire corpus using their English version. There are 15 items about the author, such as sex, age, L1, CEFR level of proficiency in Czech, duration and method of study, length of stay in the Czech Republic or knowledge of Czech among family member. Additional 15 items concern the task and the text, such as date, time limit, word count, topic, genre, dictionary/textbook

Introducing a corpus of non-native Czech with automatic annotation

169

allowed or whether it is a part of an exam.6 Most authors (79%) have written more than one text. Some or even all items may be missing for some texts: identification of the author is present in 96.7% texts, the first language in 96.3% texts.

4.2 Annotation If a word form in the original input text is recognized by a standard morphological analyzer (Hajič 2004), it is tagged by word class, morphological categories and base forms (lemmas). We use Morče, a standard Czech tagger (Votrubec 2005, 2006), trained on native language (the Prague Dependency Treebank, see Hajič 1998). Its success rate varies by text and deteriorates with the amount of deviations from standard Czech (its reported results on native text are 95–96%). For native texts in the Czech National Corpus, the tagger is combined with a rule-based module (Petkevič 2006), but experiments have shown that for non-native texts the rules, assuming correct grammatical structures, increase the error rate. In parallel to the tagging task, the input text is corrected by Korektor, a spelling and grammar checker, combining rule-based morphology with stochastic language and error models (Richter 2010; Richter et al. 2012). For annotating the current version of CzeSL-SGT, the language model was trained on a corpus of native texts collected from the web and the error model on a small custom-built corpus.7 The tool corrects not only unrecognized word forms (non-words) but also some forms which are incorrect within a given context (real-word errors). From the resulting n-best ranked suggestions with a correction type (spelling or grammar) only the first option is used. However, the present implementation of Korektor cannot insert or delete word boundaries (split or join word forms), which is one of the more frequent error types in learner texts. The corrected text is tagged and lemmatized again. Original and corrected forms are compared and error labels, based on applicable formal criteria, are assigned (Jelínek et al. 2012). In the resulting annotation each token is labelled by the following attributes:

6 For a more technical description of the corpus see http://utkl.ff.cuni.cz/~rosen/public/ 2014-czesl-sgt-en.pdf For a list of all attributes and values in Czech and English see http://utkl.ff.cuni.cz/ ~rosen/public/meta_attr_vals.html. The numbers of documents, listed according to specific attribute values, are given here: http://utkl.ff.cuni.cz/~rosen/public/sgt_ counts_by_meta_en.html. 7 Ramasamy et al. (2015) report better results with language models trained on the SYN2005 corpus.

170

Alexandr Rosen

• word – original word form • lemma – lemma of word; same as word if the form is not recognized • tag – morphological tag of word; if the form is not recognized: X@------------• word1 – corrected form; same as word if determined as correct • lemma1 – lemma of word1 • tag1 – morphological tag of word1 • gs – information on whether the error was determined as a spelling (S) or grammar (G) error; word is mostly recognized for grammar errors • err – error type, determined by comparing word and word1 http://utkl.ff.cuni.cz/~rosen/public/SeznamAutoChybR0R1_en.html Example (1) shows 4 spelling errors in a single sentence. Incorrect forms are in boldface, the second line is the sentence as corrected by Korektor. All the illformed words are non-words. Example (1) Tén pes míluje svécho kamarada – člověka. Ten pes miluje svého kamaráda – člověka. that dog loves refl.poss friend man ‘That dog loves his friend – the man.’ Table 6 shows the attribute values for the annotated sentence in the corpus. The three columns headed by the attributes word, lemma and tag concern the original, uncorrected text. An incorrect form is labelled by the morphological analyser bundled with the tagger as unknown (X@), while its lemma is identical to word.8 The next triple word1, lemma1 and tag1 shows its automatically corrected version. Korektor specifies the incorrect forms in the gs column as spelling errors (S). The analyser and Korektor do not always agree about a specific form as a non-word. A more sophisticated word form recognition is currently available in the analyser, so it is safer to trust the tagger’s verdict.

8 Irrelevant suffixes of the positional tags are omitted for space reasons. For a description of the tagset see http://ufal.mff.cuni.cz/pdt/Morphology_and_Tagging/Doc/hmptagqr. html.

Introducing a corpus of non-native Czech with automatic annotation

171

Table 6. Annotation of a sample sentence (1) including spelling errors. word

lemma

tag

word1

lemma1

tag1

gs err

Tén

Tén

X@

Ten

ten

PDYS1

S

Quant1

pes

pes

NNMS1 pes

pes

NNMS1

míluje

míluje

X@

miluje

milovat

VB-S—3P

S

Quant1

svécho

svécho

X@

svého

svůj

P8MS4

S

Voiced

S

Quant0

kamarada kamarada X@

kamaráda kamarád NNMS4

‑

-

Z:

‑

člověka

člověk

.

.

-

Z:

NNMS2 člověka

člověk

NNMS4

Z:

.

Z:

.

Example (2) includes a non-word nejakij and a real-word error postele. According to the rules used for the manual annotation of CzeSL-MAN, nejakij should be corrected to nějaký, a form of Literary Czech, rather than to nějakej, a Colloquial Czech form preferred by Korektor in Table 7 due to the smaller edit distance between the non-word and the corrected form. The form postele could be correct in a different context, but it is incorrect within the adverbial of location, where the form posteli in the local case is required. Example (2) Nejakij muž spí

v postele.

*some man sleeps in bedgen.sg/nom.pl/acc.pl/voc.pl Nějaký muž spí some

v posteli.

man sleeps in bedloc.sg

‘Some guy is sleeping in the bed.’ In the corpus, nějakej (correction of the non-word nejakij) is correctly tagged as a colloquial form (“6” at position 7 of tag1, see Table 7). For the real-word error postele the tagger chooses the implausible directional interpretation of the adverbial, where postele is accusative plural (“P4” at positions 4 and 5 of tag) and the preposition v takes an accusative complement (“4” at position 5): ‘Some guy is sleeping into the beds.’ The corrected form and the preposition are tagged correctly (singular, local case “S6”). Korektor specifies the error as grammatical “G”, while the error label assigner merely says that there is an error in a single character “SingCh”.

172

Alexandr Rosen

Table 7. Annotation of a sample sentence (2) including a real-word error. word

lemma

tag

word1

lemma1

tag1

gs

err

Nejakij

Nejakij

X@

Nějakej

nějaký

PZYS1–6

S

Caron0

muž

muž

NNMS1

muž

muž

NNMS1

spí

spát

VB-S—3P

spí

spát

VB-S—3P

v

v

RR--4

v

v

RR–6

postele

postel

NNFP4

posteli

postel

NNFS6

G

SingCh

.

.

Z:

.

.

Z:

Table 8 shows how many spelling and grammar errors are corrected in the corpus (depending on the G or S value of gs, assigned by Korektor) and how many word forms are (un)recognized (depending on tag assigned by the tagger). The number of grammar (real-word) errors is relatively high (17.8% of the total number of corrected errors, even if only grammar errors in forms recognized by the tagger are counted). However, the success rate of correcting grammar errors is lower than for spelling errors. Table 8. Spelling and grammar errors corrected by Korektor; word forms (un) recognized by the tagger. Error type Spelling errors Grammar errors

Frequency

% of total tokens

% of corrected forms

118,488

10.33%

77.97%

33,474

2.92%

22.03%

151,962

13.24%

100.00%

Spelling errors in unrecognized forms

94,878

8.37%

62.44%

Grammar errors in recognized forms

27,055

2.36%

17.80%

104,523

9.11%

Errors total (grammar and spelling)

Unrecognized forms total

Table 9 shows a sample of 50 error labels used in the corpus. The labels are assigned by rules comparing the original and the corrected string. Some of them have a strong linguistic basis, other labels are more formal or used as wastebasket categories.

Introducing a corpus of non-native Czech with automatic annotation

173

Table 9. Selected formal errors in Czesl-SGT. Error type Cap0 Cap1 Voiced0 Voiced1 VoicedFin0 VoicedFin1 Voiced Palat0 Je0 Je1 Mne0 Mne1 ProtJ0 ProtJ1 ProtV1 EpentE0 EpentE1

Error description capitalization: incorrect lower case capitalization: incorrect upper case voicing assimilation: incorrect voiced voicing assimilation: incorrect voiceless word-final voicing: incorrect voiceless word-final voicing: incorrect voiced voicing: other errors missing palatalization (k,g,h,ch) je/e: incorrect e je/e: incorrect je me/mne: incorrect m me/mne: incorrect mne, mne, mne protethic j: missing j protethic j: extra j protethic v: extra v e epenthesis: missing e e epenthesis: extra e

Example evropě → Evropě; štědrý → Štědrý Staré → staré; Rodině → rodině stratíme → ztratíme; nabítku → nabídku zbalit → sbalit; nigdo → nikdo kdyš → když; vztach → vztah přez → pres; pag → pak protoše → protože; hodili → chodili amerike → Americe; matke → matce ubjehlo → uběhlo; Nejvjetší → Největší vjeděl → věděl; vjeci → věci zapoměla → zapomněla mněla → měla; rozumněli → rozuměli sem → jsem; menoval → jmenoval jse → se; jmé → mé vosm → osm; vopravdu → opravdu domček → domeček rozeběhl → rozběhl; účety → účty

Table 10 lists the top 12 most frequent error labels in the corpus. Note that errors in diacritics are by far the most common. The notorious spelling problem of Czech native speakers – the uncertainty about the use of i and y – ranks much lower. Table 10. The 12 most frequent error types detected in CzeSL-SGT. Error type

Error description error in diacritics: missing Quant0 vowel accent SingCh a single wrong character error in diacritics: extra vowel Quant1 accent Caron0 error in diacritics: missing caron capitalization: incorrect upper Cap1 case RedunChar other single extra character Caron1 error in diacritics: extra caron Unspec Y0 Y1 MissChar Voiced

Example Freq % vzpominám → vzpomínám; doufam → 67181 41.61 doufám otevřila → otevřela; vezmíme → vezmeme; 25451 15.76 ktérá → která; hledát → hledat

17710 10.97

vecí → věcí; sobe → sobě

13893 8.61

Staré → staré; Rodině → rodině

11847 7.34

opratrně → opatrně; zrdcátko → zrcátko břečel → brečel; bratřem → bratrem provudkyně → průvodkyně; krerénu → error in the middle of the word kterému i instead of correct y pražskích → pražských; vipije → vypije y instead of correct i hlavným → hlavním; líbyl → íbil zaímavou → zajímavou; bohaství → missing character bohatství voicing: other errors pěžky → pěšky; hodili → chodili

3157 1.96 2661 1.65 2504 1.55 2384 1.48 2179 1.35 1805 1.12 1783 1.10

174

Alexandr Rosen

As Table 11 shows, broader error categories are represented in CzeSL-SGT in proportions similar to those in hand-annotated CzeSL-MAN. This is a comforting result – there is no evaluation of the error labels assignment at the moment. The differences in some categories (omission) may also be due to the heterogeneity of texts in CzeSL-MAN, namely to the high share of Roma ethnolect texts. Table 11. Percentages of error types detected automatically in CzeSL-SGT and manually in Czesl-MAN. General error type CzeSL-SGT CzeSL-MAN Insertion 3.76 3.52 Omission 1.39 9.20 Substitution 31.30 37.67 Transposition 0.16 0.19 Missing diacritic 50.19 40.40 Addition of diacritic 12.69 8.60 Wrong diacritic 0.51 0.43

In addition to the attributes listed above, the search interface of the Czech National Corpus offers ‘dynamic’ attributes, derived from some positions of tag and tag1. They can be used in queries to specify values of morphological categories without regular expressions, to stipulate identity of these values in two or more forms to require grammatical concord or to compare values of a category for word and word1. These attributes are available for the following categories of the original and the corrected form: • k, k1 – word class (position 1 of the tag) • s, s1 – detailed word class (position 2 of the tag) • g, g1 – gender (position 3 of the tag) • n, n1 – number (position 4 of the tag) • c, c1 – case (position 5 of the tag) • p, p1 – person (position 8 of the tag)

4.3 Using the corpus The corpus can be searched from the unified search interface of the Czech National Corpus (https://kontext.korpus.cz). CzeSL-SGT is one of “Synchronic written corpora”, in the category “specialized”. With the “Query Type” set to “Basic” and no other specifications, a string entered in the “Query” field returns sentences where the form or lemma occurs in the original, uncorrected text. For more advanced queries, including references to tags, lemmas, error types, corrected forms and

Introducing a corpus of non-native Czech with automatic annotation

175

metalanguage attributes, the “Query Type” should be set to “CQL” and/or the settings in “Specify query according to the meta-information” modified.9 In addition to query types available in other types of corpora, dynamic attributes support some other interesting options. A CQL query in Example (3) returns nouns, adjectives and pronouns recognized as such in the original, detected as grammatically incorrect, preserving the word class in the corrected form but having a different case. Example (3) 1:[k=”[NAP]” & gs=”G”] & 1.k=1.k1 & 1.c!=1.c1 The corpus is also available for download from the LINDAT data repository (http://hdl.handle.net/11234/1–162. The corpus is currently in release 2. Some bugs present in the original release have been fixed and the whole corpus is now a single XML document with each text as a “div” element. See Figure 1 for an extract from a sample text with the annotation, including metadata in the header.10 Figure 1. A sample annotated text in the XML format.

9 For general help on using CQL see http://www.sketchengine.co.uk/documentation/ wiki/SkE/CorpusQuerying. 10 The metadata attributes about the text are prefixed by “t_”, while those about the student by “s_”. In the annotation of “word” elements, insignificant tag suffixes are not shown for space reasons.

176

Alexandr Rosen

5. Evaluating the automatic annotation The error annotation can be evaluated using the CzeSL-MAN, the existing manually annotated subset of the corpus – the manual annotation includes one or two target hypothesis about an incorrect form and one or more error labels. So far, only the proposed corrected forms were evaluated. In Rosen et al. (2014) we report on results that Korektor achieved in an experiment based on a pilot corpus consisting of 67 CzeSL-MAN texts (9.4K tokens), including 786 unrecognized tokens, where two annotators agreed on the same corrected form. The language and the error models, trained on native texts, were the same as those used for annotating the present version of CzeSL-SGT. The comparison of Korektor’s output with either of the two annotation levels of CzeSL-MAN is not quite fair: only non-words are corrected at level 1, while level 2 includes errors in syntax, word order and style, mostly well beyond the current reach of Korektor. Still, for level 1 precision was 74% and recall 71%. For level 2, the precision dropped to 60% and recall to 45%. These results were considered sufficiently high to justify the use of Korektor in the annotation of CzeSL-SGT. Ramasamy et al. (2015) experiment with different setups of language and error models. The best results were comparable or better – see Table 12 (“Pilot corpus” for the previous results,“CzeSL-MAN” for the new results).11 They were achieved by using models trained on native texts for the entire CzeSL-MAN test set. The authors report in detail on an easier task of error detection: in a sample of 3K most frequent tokens identified by an annotator as incorrect, more than 89% non-words (form errors) were detected. On the other hand, the result for real-word (grammar) errors was only 15.5%. Interestingly, for the combination of the two error types (as in *zajímavy → zajímavý → zajímavé ‘interesting’), the best detection result was also 89%.

11 Comparison of the two experiments should be taken with a grain of salt due to different methodology.

Introducing a corpus of non-native Czech with automatic annotation

177

Table 12. Evaluation of the automatic error correction. Pilot corpus

CzeSL-MAN

Level 1 Level 2 Level 1 Level 2 Precision

74%

60%

73%

78%

Recall

71%

45%

80%

62%

6. Discussion and perspectives A reliable correction tool is the key to a successful automatic error annotation. There are at least two obvious paths to a more successful result: (i) better training data especially for the error model, which should consist only of Czech texts produced by foreigners and thus be more in line with the content of CzeSL-SGT, and (ii) extending the tool to handle errors spanning word boundaries, including splitting/joining and word order errors. |Other options include parameterizing the model according to a specific type of learner Czech (by the first language or proficiency level), or experimenting with the tool design, perhaps in combination with a machine translation approach. The absence of automatic methods and tools targeting non-native language is not caused only by the computational complexity of the task and the absence of data resources, e.g. for machine learning applications. There is a more fundamental issue of largely missing concepts and schemes to describe non-standard linguistic phenomena. As a separate research track, we develop categories for annotating non-standard word forms, which can replace tagging schemes used for standard language. We are aware that some aspects of manual annotation of non-standard language cannot be substituted by an algorithm or even by a stochastic model. However, the fact that CzeSL-SGT is one of the most popular downloads from the LINDAT/ CLARIN repository, together with a growing list of references to CzeSL-MAN or CzeSL-SGT (Aharodnik et al. 2013; Hudousková 2013, 2014; Štindlová 2015; Meurers 2015) may suggest that (semi-)automatic annotation is a useful help.

References Aharodnik, K., Chang, M., Feldman, A. & J. Hana. 2013.“Automatic identification of learners’ language background based on their writing in Czech.” In Proceedings of the 6th International Joint Conference on Natural Language Processing – IJNCLP 2013, 1428–1436. Nagoya. Retrieved from: https://msuweb.montclair.edu/~feldmana/publications/I13-1200. pdf.

178

Alexandr Rosen

Boyd, A. et al. 2014. “The MERLIN Corpus: learner language and the CEFR.” In N. Calzolari et al. (eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik: European Language Resources Association (ELRA). Retrieved from: http://www.lrec-conf.org/proceedings/lrec2014/pdf/606_Paper.pdf. Dagneaux, E. et al. 2008. The Louvain Error Tagging Manual, Version 1.3. Louvainla-Neuve: Centre for English Corpus Linguistics, Université Catholique de Louvain. Díaz-Negrillo, A., Ballier, N. & P. Thompson. (eds.) 2013. Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: John Benjamins. Dickinson, M. & J. Herring. 2008. “Developing online ICALL exercises for Russian.” The 3rd Workshop on Innovative Use of NLP for Building Educational Applications – ACL08-NLP-Education, 1–9. Columbus. Retrieved from: http:// cl.indiana.edu/ md7/papers/dickinson-herring08.html. Dickinson, M. & M. Ragheb. 2009. “Dependency annotation for learner corpora.” In Proceedings of the Eighth Workshop on Treebanks and Linguistic Theories TLT8. Retrieved from: http://cl.indiana.edu/ md7/papers/dickinson-ragheb09. pdf. Eckert, E. 2015. Romani in the Czech Sociolinguistic Space. Anglo-American University Prague. Retrieved from: http://www.aauni.edu/wp-content/up loads/2015/04/Eckert-finallc.pdf. Granger, S. et al. 2002. Error Tagging Manual for L2 French. Louvain-la-Neuve: Université catholique de Louvain, Centre for English Corpus Linguistics. Hajič, J. 1998. “The Prague Dependency Treebank.” In E. Hajičová, (ed.), Issues of Valency and Meaning – Studies in Honour of Jarmila Panevová, 106–132. Praha: Karolinum, Charles University Press. Hajič, J. 2004. Disambiguation of Rich Inflection: Computational Morphology of Czech. Praha: Karolinum, Charles University Press. Hudousková, A. 2013. “The corpus CzeSL in the service of teaching Czech for foreigners – errors in the use of the pronoun který.” In K. Gajdošová & A. Žáková (eds.), Proceedings of the Seventh International Conference Slovko 2013. Lüdenscheid: RAM-Verlag. Hudousková, A. 2014. “Jmenné koncovky v češtině pro cizince – distribuce, frekvence a fonetika. První sonda.” In V. Petkevič, A. Adamovičová & V. Cvrček (eds.), Radost z jazyků. Sborník k 75. narozeninám prof. Františka Čermáka, 215–230. Praha: Nakladatelství Lidové noviny. Jelínek, T., Štindlová, B., Rosen, A. & J. Hana. 2012. “Combining manual and automatic annotation of a learner Corpus.” In P. Sojka, A. Horák, I. Kopeček & K. Pala (eds.), Text, Speech and Dialogue – Proceedings of the 15th International

Introducing a corpus of non-native Czech with automatic annotation

179

Conference TSD 2012, 127–134. Springer. Retrieved from: http://utkl.ff.cuni.cz/ rosen/public/2012-czesl-tsd_prefinal.pdf. Junczys-Dowmunt, M. & R. Grundkiewicz. 2014. “The AMU System in the CoNLL-2014 Shared Task: grammatical error correction by data-intensive and feature-rich statistical machine translation.” In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, 25–33. Baltimore, Maryland: Association for Computational Linguistics. Retrieved from: http://acl2014.org/acl2014/W14-17/pdf/W14-1703.pdf. Krivanek, J. & D. Meurers. 2014. “Comparing rule-based and data-driven dependency parsing of learner language.” In E. H. Kim Gerdes & L. Wanner (eds.), Dependency Theory. Amsterdam: IOS Press. Levy, M., Blin, F., Siskin, C. B. & O. Takeuchi. (eds.). 2014. WorldCALL – International Perspectives on Computer-Assisted Language Learning. London: Routledge. Meurers, D. 2015. “Learner corpora and natural language processing.” In S. Granger, G. Gilquin & F. Meunier (eds.), The Cambridge Handbook of Learner Corpus Research. Cambridge: Cambridge University Press. Retrieved from: http://purl. org/dm/papers/meurers-15.html. Nagata, R., Whittaker, E. & V. Sheinman. 2011. “Creating a manually error-tagged and shallow-parsed learner corpus.” In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies – Volume 1, 1210–1219. Portland, Oregon: Association for Computational Linguistics. Retrieved from: http://dl.acm.org/citation.cfm?id=2002472.2002625. Ng, H. T. et al. 2014. “The CoNLL-2014 Shared Task on Grammatical Error Correction.” In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, 1–14. Baltimore, Maryland: Association for Computational Linguistics. Retrieved from: http://www.aclweb.org/ anthology/W/W14/W14-1701. Ng, H. T., Wu, S. M., Wu, Y., Hadiwinoto, C. & J. Tetreault. 2013. “The CoNLL-2013 Shared Task on Grammatical Error Correction.” In Proceedings of the Seventeenth Conference on Computational Natural Language Learning: Shared Task, 1–12. Sofia, Bulgaria: Association for Computational Linguistics. Retrieved from: http://www.aclweb.org/anthology/W13-3601. Petkevič, V. 2006. “Reliable morphological disambiguation of Czech: rule-based approach is necessary.” In M. Šimková, (ed.), Insight into the Slovak and Czech Corpus Linguistics, 26–44. Bratislava: Veda. Ramasamy, L., Rosen, A. & P. Straňák. 2015. “Improvements to Korektor: A case study with native and non-native Czech.” ITAT (Information technologies – Applications and Theory).

180

Alexandr Rosen

Richter, M. 2010. An Advanced Spell Checker of Czech. Faculty of Mathematics and Physics, Charles University, Prague. Retrieved from: https://redmine.ms.mff.cuni.cz/attachments/2/richter-diplomathesis.pdf. Richter, M., Straňák, P. & A. Rosen. 2012. “Korektor – a system for contextual spell-checking and diacritics completion.” In Proceedings of COLING 2012, 1019–1028. Mumbai, India: The COLING 2012 Organizing Committee. Retrieved from: http://www.aclweb.org/anthology/C12-2099. Rosen, A., Hana, J., Štindlová, B. & A. Feldman. 2014. “Evaluating and automating the annotation of a learner corpus.” Language Resources and Evaluation 48(1 Special Issue: Resources for language learning), 65–92. doi:http://dx.doi. org/10.1007/s10579-013-9226-3. Šebesta, K. 2012. “Learner corpora and the Czech language.” In I. Semrádová (ed.), Intercultural Inspirations for Language Education. Spaces for understanding,, 74–89. Univerzita Hradec Králové. Shermis, M. D. & J. Burstein. (eds.) 2013. Handbook of Automated Essay Evaluation – Current Applications and New Directions. London: Routledge. Štindlová, B. 2013. Žákovský korpus češtiny a evaluace jeho chybové anotace. Praha: Univerzita Karlova v Praze, Filozofická fakulta. Štindlová, B. 2015. “K parcelaci gramatiky češtiny pro nerodilé mluvčí.” In M. Švrčinová & Z. Vlasáková (eds.), Gramatika ve výuce a testování cizích jazyků (včetně češtiny pro cizince, 198–209. Praha: Ústav jazykové a odborné přípravy UK. Štindlová, B., Škodová, S., Hana, J. & A. Rosen. 2013. “A learner corpus of Czech: current state and future directions.” In S. Granger, G. Gilquin & F. Meunier (eds.), Twenty Years of Learner Corpus Research: Looking back, Moving ahead. Louvainla-Neuve: Presses Universitaires de Louvain. Retrieved from: http://utkl.ff.cuni. cz/ rosen/public/LCR2011_proceedings_Stindlova-et-al_prefinal.pdf. Votrubec, J. 2006. “Morphological tagging based on averaged perceptron.” In WDS’06 Proceedings of Contributed Papers, 191–195.

Elżbieta Kaczmarska

University of Warsaw Institute of Western and Southern Slavic Studies

Corpus-based Analysis of Czech Units Expressing Mental States and Their Polish Equivalents Identification of Meaning and Establishing Polish Equivalents Referring to Different Theories Abstract: The analysis is focused on Czech polysemous verbs expressing mental states. We test if different linguistic theories can predict the closest equivalents of these verbs in Polish. The analysis proper is preceded by automatic extraction of pairs of equivalents from the parallel corpus InterCorp. These pairs constitute a kind of bilingual dictionary. The research includes automatic excerption of chosen verbs (with aligned segments). The segments are analysed manually. We check how the key verb was translated and what kind of collocations and arguments it has. Subsequently we apply methods specific for each linguistic theory. The analysis is complemented by collocation profiles, obtained by Word Sketches that proved to be an essential tool. The analysis shows the most effective theories but also shortcomings of the proposed algorithm to spur its optimization and development. Keywords: Equivalent; psych verbs; parallel corpora, Czech, Polish

1. Introduction Czech and Polish come from the same West Slavic language group, a part of the Slavic language family. They both share a common ancestry and their vocabularies show many similarities. Many words sound nearly the same and native speakers of both languages are able to understand them.1 However, there are some surprising lexical contrasts between Czech and Polish, causing difficulties in language contacts. A case in point are verbs expressing mental states2 and nouns denoting 1 We also face the prevalent phenomenon of false friends: sukienka (pl) ‘dress’ – sukýnka (cs) ‘short skirt’. 2 The class of mental verbs (also known as psych verbs) includes verbs of perception, cognition and emotion (Pustejovsky 1993). In this paper, psych verbs are only an example

182

Elżbieta Kaczmarska

emotions and feelings, e.g. mít rád (to like, to love), být líto (to regret, to be sorry), toužit (to miss, to want, to desire). In some cases it is impossible to reproduce a text (or a phrase) in the target language; a native speaker of Polish is not able to encode the meaning of such words and confronts the problem of the impossibility to express the same content in their own language which does not provide the same concept (Kaczmarska & Rosen 2014a). Instead, one can indicate a cluster of equivalents with a slight change of meaning (Lewandowska-Tomaszczyk 1984, 2013), but none of them will cover exactly the same semantic field. Verbs expressing mental states are, in this sense, particularly problematic because of their ambiguity and subjective character. This is why parts of their meaning are lost in the act of translation and one-to-one equivalent pairs are difficult to find. It must be admitted that some of these verbs are not considered polysemous by Czechs, while they seem to have several meanings for a speaker of Polish.3 Regarding verbs expressing mental states, one also encounters the problem of misunderstanding. In some cases it is not possible to state what a given unit means in terms of the target language. The problems are not solved by traditional dictionaries, which provide only a limited number of equivalents (in most cases with no examples of usage). The best-known Czech-Polish dictionary (Siatkowski & Basaj 2002) presents a list of equivalents of the analyzed units, e.g.: • mít rád – lubić (to like), kochać (to love) • být líto – być żal (to be sorry for), być przykro (to feel sorry), szkoda (to sorry) • toužit – tęsknić (to miss), pragnąć (to desire), marzyć (to dream). Without additional clues, it is not possible to translate these into Polish properly. In some cases, even the context is not conclusive; e.g. the avowal mám Tě rád can be translated as lubię cię (I like you) or kocham cię (I love you). For a Polish speaker the unit mít rád has at least two quite different meanings: kochać (to love) and lubić (to like).

of a group of verbs causing particular problems in the process of translation. The paper will neither analyse nor describe the nature of the psych verbs themselves. They are a relatively well-studied phenomenon in linguistics [cf., e.g., the works of Adriana Belletti & Luigi Rizzi, Agnieszka Będkowska-Kopczyk, Stefan Engelberg Martina Hřebíčková, Barbara Lewandowska-Tomaszczyk (also with Paul A. Wilson), Karel Pala & Zdena Šachová, Bożena Rozwadowska, Alexandros Tantos, Irena Vaňková, Anna Wierzbicka]. 3 In more distant languages, this is a common situation. Languages pattern the meaning in their own way.

Corpus-based Analysis of Czech Units Expressing Mental States

183

2. Goals The objective of the study is to find a suitable equivalent for a given unit (psych verbs) and to present an attempt at applying methods of different linguistic approaches to build an algorithm for the selection of equivalents. Our research is based on texts from the parallel corpus InterCorp (Čermák & Rosen 2012; Kaczmarska & Rosen 2014b). The experimental algorithm will involve several steps corresponding to different grammatical theories. Depending on a given verb, we can find the best equivalent at each stage.

3. Algorithm assisting in the selection of equivalents Determining Polish equivalents for Czech verbs is a part of a larger project; within its frameworks we elaborate an algorithm assisting in the selection of equivalents of verbs, exploiting data from the parallel corpus InterCorp (Kaczmarska 2015b). By applying methods based on various linguistic theories (see Figure 1) we test if the meaning of a given verb depends on its syntactic characteristic (Levin 1993) and then we establish the closest equivalent based on the syntactic patterns (referring also to the syntactic structure of the potential equivalent). Figure 1. Steps of the algorithm for identifying equivalents.

184

Elżbieta Kaczmarska

As shown, the algorithm consists of several steps. The verb, subjected to such a test, does not have to go through all the stages; the optimal equivalent can be found at each step of the analysis.

3.1 Step one – automatic extraction of pairs of equivalents In the pilot study (Kaczmarska & Rosen 2013) we extracted pairs of equivalents from the parallel corpus InterCorp, obtaining a bilingual glossary (Och & Ney 2003; Skoumalová 2008; Jirásek 2011). At this stage we compiled clusters of Polish equivalents for each individual Czech verb. The Table 1 below presents a cluster of the most frequent equivalents for the study case – toužit. Table 1. Polish equivalents of toužit. Equivalent of toužit pragnąć 304 chcieć 107 tęsknić 82 marzyć 70 pożądać 40 ochota 24 zapragnąć 9 pragnienie 8 tęsknota 8 zależeć 8 spragniony 7 życzyć 6 Overall

673

Pilot studies have shown that for many verbs suitable equivalents can be found already at this stage; interestingly, this generally concerns verbs expressing negative feelings (Kaczmarska 2014a, 2016; see also Section 4.1).

3.2 Step two – valence analysis The aim of the analysis at this stage is to answer whether valence requirements4 can help to identify Polish equivalents of the verbs. We assumed that in cases concerning some meanings the equivalent could be established on the basis of 4 In this paper, the valence is understood as a linguistic phenomenon referring to the number of arguments controlled by a verbal predicate. We define syntactic and semantic features of analysed verbs at later stages of our research (Dębski 1982; Daneš & Hlavsa 1987; Rytel 1989; Greń & Rytel-Kuc 1991; Čermáková 2009; Urbańczyk-Adach 2001).

Corpus-based Analysis of Czech Units Expressing Mental States

185

the convergence of valence requirements (Levin 1993). For this purpose, we conducted a manual analysis of the selected verb (with aligned segments) from InterCorp (Kaczmarska & Rosen 2013). We established (in each segment) how the verb was translated, how many arguments the given verb binds, what kinds of arguments they are (whether it is, e.g. a noun – if so, what kind of entity it denotes: real, abstract, human being, etc.) and how they are bound (by morphological case, preposition, infinitive, relative clause). We also checked which arguments are bound to the Polish equivalents and conducted syntactic and semantic analyses of the arguments bound by the Czech verb and by its equivalents. We expected that in most cases the results of the syntactic and semantic analyses should allow for establishing the semantically and syntactically closest Polish equivalents of the Czech verbal unit. The verb toužit binds arguments using structures such as: • • • • •

toužit po Oabstr (abstract object) toužit po Ohum (human object) toužit po / do OR (real object) toužit + inf toužit + S (aby… / po tom, aby…)5

Thanks to the manual analysis of the aligned segments we could see how the verb is translated in each particular group. The study revealed that valency can influence the choice of equivalent in Polish. In the case of the analyzed verb, we concluded that the most appropriate equivalent can be identified only for toužit bound with an infinitive. toužit + inf → pragnąć6 + inf

5 toužit + sentence (to…). 6 The verb chcieć (to want) is classified as a synonym of pragnąć (to desire). The difference between them lies in the intensity of the feeling.

186

Elżbieta Kaczmarska

Table 2. Polish equivalents of toužit + infinitive. Equivalent of toužit + infinitive pragnąć inf 44 chcieć inf 20 marzyć o Oabstr 4 pragnąć Oabstr 3 być pragnieniem inf 1 chętnie + S 1 mieć marzenie inf 1 mieć ochotę inf 1 pragnąć + S 1 tęsknić za (+ S) 1 zachciewać się Oabstr 1 Other 2 Overall 80

In the remaining groups the results were not decisive.7 All the units which did not find the proper equivalent at step one would be automatically moved to the next stage of the algorithm. 7 Complete figures were published in Kaczmarska and Rosen (2013). Translating structures with Oabstr was particularly problematic and pointed to the need for a deeper analysis of the objects. As a test, we examined two abstract objects – “big love” (velká láska / wielka miłość) and “exotic journey” (exotická cesta / egzotyczna podróż). We discovered that both are easily combined with the Czech analyzed verb: toužit po velké lásce / exotické cestě. We also tried to combine them with three most frequent Polish equivalents: marzyć o wielkiej miłości / egzotycznej podróży to dream of big love / exotic journey

tęsknić.za wielką miłością to miss big love /

/ egzotyczną podróżą (???) exotic journey

pragnąć to desire

/ egzotycznej podróży (?) exotic journey

It is possible to combine the objects with the verb marzyć (to dream), but it is not correct to use them with the verb tęsknić (to miss). It would be acceptable only if the “big love” represented a person. Also, the verb pragnąć (to desire) hardly allows combination with the object of “exotic journey”. The ambiguities of the results from the first step made us pursue a more detailed analysis at the next stage. The test was

big

wielkiej miłości love /

Corpus-based Analysis of Czech Units Expressing Mental States

187

3.3 Step three – Case Grammar At this stage we identify cases – the roles played by elements bound to the verb (Fillmore 1968; Halliday 1985; Korytkowska 1984, 1992, 1993; Kaczmarska 2001). However, this step was ineffective for this particular group of verbs. The analyzed units (expressing different emotions and feelings) are uniform in terms of collocability. They combine with certain arguments; we identify Experiencer and a kind of Source (or Stimulus), but on this basis we are not able to distinguish the meaning. However, this step will not be removed from the final version of the algorithm, which can be used to study other groups of verbs, where the semantic roles of the arguments bound to the equivalent and the original verbs may be significant for the differentiation of the meaning. In the case of other verbs we can identify roles such as: Agent, Beneficiary, Location, Time, Instrument, Substance, and Object (itself), and the guideline for choosing the equivalent may be its collocability with the argument and its specific role.

3.4 Step four – Pattern Grammar “If a word has several senses, and is used in several patterns, each pattern will occur more frequently with one of the senses than the others, such that the patterning of an individual example will indicate the most likely sense of the word in that example” (Hunston & Francis 2000: 20). The verbs we analyze are mostly polysemous. In tracking their patterns, we hope to be able to link the concrete meaning with a pattern type (understood as a repeatable combination of words). “A pattern can be identified if a combination of words occurs relatively frequently, if it is dependent on a particular word choice, and if there is a clear meaning associated with it” (Hunston & Francis 2000: 37). We established whether there was indeed such repeatability in the corpus occurrences (Ebeling & Ebeling 2013). The manual analysis based on InterCorp indicated, i.e., two patterns of the Czech unit být líto (to be sorry, to regret) associated with two meanings. If the unit být líto is combined with two nominal phrases (Dative and Genitive), it corresponds to the Polish equivalent żal (to be sorry for, to regret). If combined only with the Dative nominal phrase, and possibly with the element to, it corresponds to the Polish equivalent (być) przykro (to be sorry).

used also for other research (Kaczmarska 2014a, 2014b; Kaczmarska Rosen, Hana & Hladká 2015).

188

Elżbieta Kaczmarska

Table 3. Patterns of být líto (żal, być przykro). żal Jak mi ho bylo líto! Jakże mi go było żal! Je mi ho samozřejmě líto. Jest mi go oczywiście żal… Přišlo mi jí prostě líto. Po prostu zrobiło mi się jej żal. být líto + NPDAT + NPGEN = żal

(być) przykro Pak mi je líto. Wobec tego, przykro mi! Potom nám to bylo oběma líto. Potem nam obu było przykro. …nabídne mi sisinku a já si vezmu, protože by mu bylo líto, kdybych si nevzala… …zaprasza mnie na cuksa i ja biorę, bo byłoby mu przykro, gdybym nie wzięła… být líto + NPDAT + to / Ø = (być) przykro

In this case, the manual analysis allows us to establish the proper equivalent. For verbs represented by a large number of occurrences, the manual analysis will not be helpful. For these units we will be able to use Word Sketches mentioned in Step six.

3.5 Step five – Cognitive Grammar 3.5.1 Dictionaries and corpora At this stage, we try to encode the meaning of a word in terms of conceptualization (Langacker 1987, 1991, 2008; Geeraerts 2010). We analyze the unit mít rád.8 In a reputable dictionary of Czech (Havránek 1989), mít rád is defined as pociťovat k někomu náklonnost, lásku, milovat, mít v oblibě (to feel affection for someone, love, to love, to like). According to these definitions, the Czech-Polish dictionary (Siatkowski & Basaj 2002) gives the following Polish equivalents: kochać, lubić, przepadać (to love, to like, to be found). These Polish verbs, supposedly equivalents of the analyzed Czech unit, refer to completely different feelings (emotions). For a Polish speaker, a combination of meanings “to love” and “to like” within a single expression is a strange and unfamiliar concept. In the parallel corpus InterCorp we can find more equivalents of the unit mít rád (lubić, kochać, podobać się, uwielbiać, polubić, pokochać, w naszym guście), however, they all belong to two distinct semantic fields denoting “love” and “liking”.9 The strangeness of the concept makes both its understanding and translating into Polish very difficult, 8 For mít rád (not fully translatable into Polish) see also Kaczmarska and Rosen (2014a). 9 In the parallel corpus InterCorp we found 2799 occurrences of the unit mít rád. 66% of the occurrences were translated into Polish as a unit referring to „liking“ (lubić) and 18% – „love“ (kochać). The remaining 16% were translated with other units.

Corpus-based Analysis of Czech Units Expressing Mental States

189

and – as mentioned in the Introduction – the problem sometimes cannot be solved even in a wider context, e.g.: (cs) Mám tě strašně rád, řekl. (Kundera-Valcik_na_rozl) (pl) Strasznie cię kocham – rzekł. (Kundera-Valcik_na_rozl) (cs) Kdybys mě měla ráda, nemohla by ses opičit s tím pitomým jménem. (GrusaDotaznik) (pl) Gdybyś mnie naprawdę lubiła, nie wygłupiała byś się z tym kretyńskim imieniem. (Grusa-Dotaznik) (cs) Máš-li mne jen trošku rád, shoď mne z třetího patra, dej mně tu poslední outěchu. (Hasek-OsudyDobrehoVvSV) (pl) Jeśli masz dla mnie choć troszkę przyjaźni, zrzuć mnie z trzeciego piętra, udziel mi tej ostatniej pociechy. (Hasek-OsudyDobrehoVvSV) In these contexts we could use all equivalents offered by the Czech-Polish dictionary. Furthermore, there is a verb milovat in Czech, which is translated into Polish as kochać (to love).10 However, there is a certain complication regarding the translation of the analyzed unit mít rád. Native speakers of Polish tend to reflect the pattern from their own language in a foreign language. In Polish, “lubić” i “kochać” signify completely different feelings. The difference lies mainly in the quality, not the intensity of the feeling. In this particular situation, a native speaker of Polish can automatically attribute the meaning “kochać” (to love)11 to the unit milovat and the meaning “lubić” (to like)12 to the unit mít rád. In Polish there is no verb expressing feelings on the borderline of love and liking, because such a concept does not exist in this language. Consequently, the Polish language cannot offer an appropriate equivalent. While confronting Czech and Polish, we can experience, in this case, a misunderstanding.

10 In InterCorp we found also 586 occurences of the unit milovat: 84% translated with the unit referring to “to love” (kochać) and 4% – “to like” (lubić). The remaining 12% were translated with other units. 11 An analysis based on InterCorp shows that Polish verb kochać is translated into Czech predominantly as milovat (over 72% occurrences out of 3497); 20% occurrences were translated as a unit reffering to “liking” (mít rád) and the remaining 8% were translated with other unit. 12 In the parallel corpus InterCorp we also found 3063 occurrences of the verb lubić: 95% of the occurrences were translated into Czech as a unit referring to mít rád (“liking”) and 3% – “love”.

190

Elżbieta Kaczmarska

With such a large amount of data from InterCorp, we can make an attempt at building a network of meanings (Kaczmarska 2015a).13 Figure 2. The network of meanings of the Czech units mít rád and milovat and of the Polish units kochać and lubić.

3.5.2 A survey and the web According to the current results of the analysis we could conclude that there are more reasons for translating the Czech unit into Polish as lubić than as kochać. Since the unit is frequent and still very problematic, we conducted a survey among Czech 13 The network is based on definitions from monolingual dictionaries available online: http://goo.gl/fY3Kt4; http://goo.gl/oYtv4D; http://sjp.pwn.pl/szukaj/lubi%C4%87.html; http://sjp.pwn.pl/szukaj/kocha%C4%87.html. The network reflects only the way of understanding and translation of the Czech units into Polish (this is why there are only arrows pointing in one direction). We took into consideration only four described units; we realize, however, that a comprehensive map of meanings should also include other potential equivalents, such as uwielbiać (to adore). We treat the two units kochać and lubić as representatives for the groups of equivalents that could have the feature ‘love’ or ‘liking’. The equivalence of meaning was based on manual analysis of occurrences (Czech originals translated into Polish), a total of 3385 examples in Czech and as much in Polish.

Corpus-based Analysis of Czech Units Expressing Mental States

191

native speakers. The survey took place in Liberec (Czech Republic), in December 2013 (30 respondents, 19 – 24 year old). The aim of the survey was to discover the meaning of mít rád on the basis of the opposition with milovat; we also asked the direct question if there are any differences between the two verbs. Furthermore, the respondents were requested to write what objects may be combined with the two verbs. 100% of the respondents identified semantic differences between the two verbs; milovat is something more than mít rád. However, they had a problem with the diversification of objects bound by the units; they assigned the same objects to both units. Such results could not be used in our analysis directly. We could state that the choice of an equivalent depends on a wider context. However, on the web we can find a large number of opinions concerning this problem. The Czechs discuss what the unit mít rád means as a declaration.14 Consequently, we cannot say that the problem of translating mít rád can always be solved by a wider context. If the concept does not exist in a target language, reproducing it in this language may be connected with the loss of a part of its meaning (Kaczmarska 2014b; Kaczmarska & Rosen 2014a). We intentionally presented only vestigial elements of the Cognitive Grammar methods. The theory assumes that the analyses will be conducted manually, and for this reason it is very hard to adopt in our algorithm. In the final version of the algorithm it will be moved to the last step and only applied in exceptional cases.15

3.6 Step six – Word Sketches and other tools of the future As mentioned above (Step four and Step five), the manual analysis will not be efficient in the case of verbs represented by a large number of occurrences (i.a. toužit). For these units, we could use Word Sketches (Kilgarriff & Tugwell 2002; Kilgarriff, et al. 2014).16 The collocates of an analyzed unit are grouped according to the grammatical relations in which they occur. Word Sketches seem to be 14 There is no doubt as to the meaning milovat in the same position. (http://diskuse.doktorka.cz/mit-rad-zamilovat-se-milovat/, http://www.poradte.cz/spo lecnost/21684-milovat-nebo-mit-rad.html, http://janajerabkova.blog.idnes.cz/c/194377/ Milovat-nebo-mit-rad.html). 15 In problematic cases, we can also refer to the explications and natural semantic metalanguage (Wierzbicka 1980, 2001) or construct an intensity scale of properties expressed by a given verb (Mikołajczuk 1997, 1999; Bratman 1987). 16 “A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour.” Word Sketches are available online at: http://www. sketchengine.co.uk/documentation/wiki/Website/Features#Wordsketches.

192

Elżbieta Kaczmarska

a universal tool for analyzing collocations and word combinations in terms of pattern grammar and valency. We made an attempt to apply the tool for further analysis of the Czech verb toužit. The research was based on the Czech-Polish part of InterCorp. At this point, we confront yet another problem. The Czech examples are excerpted from the vast Czech National Corpus and the Polish examples from InterCorp. The data are incomparable in terms of size. Unfortunately, at present, using Word Sketches for the Czech language is impossible in InterCorp. It is also impossible to use the National Corpus of Polish (Przepiórkowski, et al. 2012): its size is comparable with the size of the Czech corpus, but the Polish corpus does not offer the Word Sketches as a tool. The analysis must be conducted on the basis of one corpus and homonymous texts. The appropriate tool, which is in preparation, will cooperate with both the Czech and the Polish part of InterCorp. Word Sketches is a promising tool for our study. We expect to be able to analyze not only objects expressed by nominal phrases, but also adverbs combined with key verbs.

4. A case study – závidět ‘to envy’ and žárlit ‘to be jealous’ 4.1 Automatic extraction of pairs of equivalents We generated the Czech-Polish dictionary thanks to the tool Treq (available at the Czech National Corpus website).17 Table 4. The most common Polish equivalents of the verb závidět based on Treq.18 Polish equivalents of závidět zazdrościć zazdrość pozazdrościć zazdrosny zawiść darzyć straszliwie współzawodniczyć zwyknąć

188 26 16 7 4 1 1 1 1

17 Treq (available online at: http://treq.korpus.cz) generates lists of the most often equivalents of selected words. However, one should realize that the product is not an ideal dictionary, including only proper equivalents. Treq uses a fully automatic method and among proposed equivalents we also occasionally find accidental words and even punctuation marks. In previous research, we generated the dictionary ourselves (Kaczmarska & Rosen 2013). 18 Treq excerpted occurrences from all Czech texts and their equivalent segments.

Corpus-based Analysis of Czech Units Expressing Mental States

193

Some of the equivalents from the table do not correspond to the meaning of závidět: They could appear by coincidence as a result of an error of the alignment (zwyknąć, straszliwie) or they are a part of a synonymous phrase (darzyć – darzyć uczuciem / zazdrością / uczuciem zazdrości). Among the suggested equivalents there is also a verb współzawodniczyć that can be interpreted as a distant synonym of the unit závidět. The most frequent equivalent is the verb zazdrościć (and its derivative pozazdrościć). The other suggestions are represented by very few occurrences. Also the noun zazdrość is a part of a phrase czuć zazdrość that means zazdrościć. The results are conclusive and we are able to establish the Polish equivalent at the level of our analysis. We can also check the results with an excerpt from the Czech-Polish part of the parallel corpus InterCorp19. There are only 50 occurrences of the verb (from originally Czech texts translated into Polish). Most of them are translated as zazdrościć, which confirms the assumption.20 Table 5. Polish equivalents of Czech verb závidět based on the Czech-Polish part of the parallel corpus InterCorp. Polish equivalents of závidět 50 zazdrościć pozazdrościć być zazdrosny other

45 3 1 1

We also generated this type of dictionary for the verb žárlit. Table 6. The most common equivalents of the verb žárlit based on Treq. Equivalents of žárlit zazdrosny zazdrość zazdrościć być osiłek darzyć owszem rywalka zawiść

141 25 14 2 1 1 1 1 1

19 This time we excerpted occurrences only from originally Czech texts and their translations into Polish. 20 The only occurrence with the equivalent zazdrosny presents an example with an ellipsis: (cz) […] spokojen, že mu nemá co závidět… (pl) […] zadowolony, że nie potrzebuje być zazdrosny… [Paral-VeletrhSplnenych].

194

Elżbieta Kaczmarska

As in the case of the previous verb, there are also some wrongly aligned equivalents (być, osiłek, darzyć, owszem). The words rywalka and zawiść can be treated as elements of the structures synonymous to the concept of ‘jealousy’ (zazdrość). The most frequent equivalent is the word zazdrosny – a component of the phrase być zazdrosny (to be jealous). We also checked equivalents of the verb žárlit in the Czech-Polish part of InterCorp. Table 7. Polish equivalents of Czech verb žárlit based on the Czech-Polish part of the parallel corpus InterCorp. Polish equivalents žárlit

59

być zazdrosny zazdrościć zazdrość zawiść error

46 7 4 1 1

About 78% of occurrences include the equivalent być zazdrosny21 and we can consider it as the proper equivalent. The verbs závidět and žárlit were also subjected to another investigation (Kaczmarska 2016). A thorough analysis (both syntactic and semantic) was conducted and it confirmed that their equivalents can be found at the first step of the presented algorithm. The analysis allowed us to build a network of meanings for the analysed units:

21 In the aligned segments, there is also the verb zazdrościć as an equivalent. The occurrences do not include, however, the typical object of the jealousy (expressed by the nominal phrase in Genitive), but a kind of the reason of being jealous expressed by a sentential phrase: (cz) Povídám, jako vždycky, von na mě žárlí, že jsem mladší než von. (pl) Powiadam, jak zawsze, on mi zazdrości, że jestem młodszy niż on [Hrabal-Prilis_ hl_samot] The verb zazdrościć (as equivalent of žárlit) appears also in constructions with ellipsis: (cz) Právě proto, že už nechce žárlit, bere vážně a bez podezření jeho tvrzení! (pl) Właśnie dlatego, że nie chce już zazdrościć, przyjmuje jego słowa poważnie i bez podejrzeń! [Kundera-Valcik_na_rozl].

Corpus-based Analysis of Czech Units Expressing Mental States

195

Figure 3. The network of meanings of the Czech units závidět and žárlit and of the Polish units zazdrościć and być zazdrosnym.22

As mentioned in paragraph 3.1 (above), the meaning of Czech verbs expressing negative emotions and feelings (e.g. trápit, mrzet) are easier to recognize than the verbs expressing positive emotions and feelings (Kaczmarska 2014a). As a consequence, the process of translating the verbs themselves into Polish is simpler (Kaczmarska 2016). The answer as to why it happens, would require a thorough semantic analysis of two (numerous) groups of verbs – expressing both positive and negative emotions and feelings. However, conducting such an analysis in the context of this study is not possible.

5. Conclusions and perspectives Using corpora in the process of establishing equivalents seems to be obvious and necessary. Especially a parallel corpus makes it possible to define clusters of equivalents, which are essential and fundamental for any further steps. Although a 22 The network is based on definitions from monolingual dictionaries available online: http://goo.gl/KauBD5 ; http://goo.gl/nYxZ8E (for the verbs závidět and žárlit) http://sjp.pwn. pl/sjp/zazdrosc;2544740.html http://sjp.pwn.pl/sjp/zazdro%C5%9Bci%C4%87;2544739 (for units zazdrościć and być zazdrosnym).

196

Elżbieta Kaczmarska

parallel corpus can assist in the development of comparative analyzes, the research is often confronted with difficulties due to incompatible tools.23 As we found Word Sketches promising for our research, we prepared the tool for the Polish part of InterCorp, but it is not available for external users of InterCorp. Word Sketches for the Czech part of InterCorp is in preparation. We hope that Word Sketches applied to both the Czech and the Polish parts of InterCorp will be the turning point for building our algorithm. At the later stage, we obtained rather disappointing results based on Case Grammar. The method using Case Grammar will be tested on a larger number of verbs to see if it deserves to be developed further or discarded. We also realize that the project needs a deeper cognitive analysis of the most difficult units.24 However, the cognitive method is very difficult to implement into the algorithm and must be elaborated manually. The analysis also clearly showed the problem of “nonexistence” of a concept in the other language. Translation of such words always leads to an arbitrary decision by the translator. We hope that our algorithm will be able to cooperate with machine translation tools.25 This is why, in addition to a manual analysis of the valency requirements, we also conduct experimental trials of stochastic modeling of the choice of an equivalent on the basis of the context. We use two methods: the first – based on the context of a few lexemes before and after the keyword, and the second – on the basis of lexemes dependent directly on the keyword. The methods and results of the research are presented in a separate paper (Kaczmarska, Rosen, Hana & Hladká 2015).

23 We face similar problems while working with monolingual corpora. Word Sketches are available for SYN (Czech National Corpus). For the Polish language, a comparable corpus is NKJP (National Corpus of Polish), but we cannot use Word Sketches for NKJP. Furthermore, the Czech and Polish corpora have different statistical functions what make corpus-based comparative analyses even more complicated. 24 In problematical cases we can refer to explications and natural semantic metalanguage (Wierzbicka 1980, 2001) or construct a scale of the intensity of a feature expressed by a given verb (Mikołajczuk 1997, 1999; Bratman 1987). 25 Work on algorithms improving machine translation and differentiating the meanings of ambiguous units (e.g. WSD – Word Sense Disambiguation) are already carried out for a long time and known also for parallel corpora. They are mostly based on data obtained from very large corpora using various mathematical methods (mainly statistical), cf. e.g. Liang Tian et al. 2014; Młodzki et al. 2012; Liang Tian et al. 2010; Han et al. 2013; Kędzia et al. 2014. Developed algorithms also use various linguistic approaches; more on this subject – Han et al. 2013.

Corpus-based Analysis of Czech Units Expressing Mental States

197

References Bratman, M. E. 1987. Intentions, Plans, and Practical Reason. Massachusetts: Harvard University Press. Čermák, F. & A. Rosen. 2012. The case of InterCorp, a multilingual parallel corpus. International Journal of Corpus Linguistics 13(3), 411–427. Čermáková, A. 2009. Valence českých substantive. Praha: Nakladatelství Lidové noviny. Daneš, F. & Z. Hlavsa. 1987. Větné vzorce v češtině. Praha: Academia. Dębski, A. 1982. “Semantyczna walencja czasownika w aspekcie konfrontatywnym.” Biuletyn Polskiego Towarzystwa Językoznawczego 39, 79–90. Ebeling, J. & S. O. Ebeling. 2013. Patterns in contrast. Amsterdam: John Benjamins. Fillmore, C. J. 1968. „The Case for Case.“ In E. Bach & R. T. Harms (eds.), Universals in Linguistic Theory, 1–88. New York: Holt, Rinehart, and Winston. Geeraerts, D. 2010. Theories of Lexical Semantics. Oxford: Oxford University Press. Greń, Z. & D. Rytel-Kuc. 1991. “Wykorzystanie przekładów literackich w pracy nad dwujęzycznym słownikiem walencyjnym.” In H. Běličová, G. Nieszczimienko & Z. Rudnik-Karwatowa (eds.), Problemy teoretyczno-metodologiczne badań konfrontatywnych języków słowiańskich, 69–78. Warszawa: Instytut Słowianoznawstwa Polskiej Akademii Nauk. Halliday, M. A. K. 1985. An Introduction to Functional Grammar. London: Arnold. Han, A.L-F., Lu, Y., Wong, D.F., Chao, L.S., He, L. & J. Xing. 2013. “Quality Estimation for Machine Translation Using the Joint Method of Evaluation Criteria and Statistical Modeling.” In Proceedings of the Eighth Workshop on Statistical Machine Translation, 365–372. Association for Computational Linguistics. Havránek, B. (ed.) 1989. Slovník spisovného jazyka českého. Praha: Academia. Hunston, S. & G. Francis. 2000. Pattern Grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins. Jirásek, K. 2011. Využití paralelního korpusu InterCorp k získávání ekvivalentů pro chorvatsko-český slovník. In F. Čermák (ed.), Korpusová lingvistika Praha 2011: 1 – InterCorp, 45–55. Praha: Nakladatelství Lidové noviny. Kaczmarska, E. 2001. „Badanie struktury walencyjnej czeskich i polskich predykatów posiadających pozycję Experiencera.“ Studia z Filologii Polskiej i Słowiańskiej 37, 177–187. Kaczmarska, E. 2002. “Nominalizacje odczasownikowe w języku polskim i czeskim (wybrane problemy).” Studia z Filologii Polskiej i Słowiańskiej 38, 87–99. Kaczmarska, E. 2010. “Analiza zdolności konotacyjnych polskich i czeskich predykatów odnoszących się do strachu, złości i wstydu.” In J. Goszczyńska &

198

Elżbieta Kaczmarska

Z. Greń (eds.), Res slavisticae, 135–153. Warszawa: Wydział Polonistyki Uniwersytetu Warszawskiego. Kaczmarska, E. 2012. „Czeski czasownik „zdát se“ w przekładzie na język polski (na podstawie badań z wykorzystaniem czesko-polskiego korpusu równoległego InterCorp).“ Studia z Filologii Polskiej i Słowiańskiej 47, 247–261. Kaczmarska E. 2014a. “Czeskie czasowniki oznaczające stany psychiczne – sposoby ustalania polskich ekwiwalentów na podstawie korpusu równoległego InterCorp.” In A. Stolarczyk-Gębiak & M. Woźnicka (eds.), Zbliżenia. Językoznawstwo – Literaturoznawstwo – Translatologia, 45–55. Konin: Państwowa Wyższa Szkoła Zawodowa w Koninie. Kaczmarska, E. 2014b. “Czy na pewno się (nie)rozumiemy? O problemach, uproszczeniach i stratach w przekładzie (na podstawie czesko-polskiej części korpusu równoległego InterCorp).” In M. Benešová, R. Rusin Dybalska & L. Zakopalová (eds.), Proměny polonistiky. Tradice a výzvy polonistických studií, 192–199. Praha: KAROLINUM. Kaczmarska, E. 2015a. “Mít rád czy milovat? O czeskiej miłości po polsku.” In M. Falkowska & K. Waszakowa (eds.), Pojęcia zapisane w języku, 139–156. Warszawa: Wydział Polonistyki Uniwersytetu Warszawskiego. Kaczmarska, E. 2015b. “W poszukiwaniu znaczenia czasowników wyrażających stany psychiczne. Analiza czeskich czasowników i ich polskich ekwiwalentów – próba implementacji wybranych teorii lingwistycznych (walencja, gramatyka przypadków głębokich, Pattern Grammar, lingwistyka kognitywna).” Prace Filologiczne 67, 131–150. Kaczmarska, E. 2016. „O dwóch czeskich jednostkach wyrażających negatywne stany emocjonalne i ich polskich ekwiwalentach. Analiza na materiale z korpusu paralelnego InterCorp“. In E. Gruszczyńska & A. Leńko-Szymańska (eds.), Polskojęzyczne korpusy równoległe. Polish-language Parallel Corpora, 227–248. Warszawa: Instytut Lingwistyki Stosowanej Uniwersytetu Warszawskiego. Kaczmarska, E. & A. Rosen. 2013. „Między znaczeniem leksykalnym a walencją – próba opracowania metody ekstrakcji ekwiwalentów na podstawie korpusu równoległego.” Studia z Filologii Polskiej i Słowiańskiej 48, 103–121. Kaczmarska, E. & A. Rosen. 2014a. „Czego nie można wyrazić w języku polskim, czyli o leksykalnych w nim brakach.“ Polonica 34, 53–66. Kaczmarska, E. & A. Rosen. 2014b. „Praktyczny przewodnik po korpusie równoległym InterCorp.“ In M. Hebal-Jezierska (ed.), Praktyczny przewodnik po korpusach języków słowiańskich, 207–231. Warszawa: Wydział Polonistyki Uniwersytetu Warszawskiego. Kaczmarska, E., Rosen, A., Hana, J. & B. Hladká. 2015. “Syntactico-semantic analysis of arguments as a method for establishing equivalents of Czech and Polish verbs expressing mental states.” Prace Filologiczne 67, 151–174.

Corpus-based Analysis of Czech Units Expressing Mental States

199

Kędzia, P., Piasecki, M., Kocoń, J. & A. Indyka-Piasecka. 2014. “Distributionally Extended Network-Based Word Sense Disambiguation in Semantic Clustering of Polish Texts.” In IERI Procedia (International Conference on Future Information Engineering) vol. 10, 38–44. DOI: 10.1016/j.jeri.2014.09.073. Kilgarriff, A. & D. Tugwell. 2002. „Sketching words.“ In M-H. Corréard (ed.), Lexicography and Natural Language Processing: A Festschrift in Honour of B. T. S. Atkins, 125–137. EURALEX. Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P. & V. Suchomel. 2014. „The Sketch Engine: ten years on.“ Lexicography: Journal of ASIALEX 1(1), 7–36. Korytkowska, M. 1984. „Kategoria przypadka semantycznego (na materiale języka polskiego, bułgarskiego i serbsko-chorwackiego).“ In V. Koseska-Toszewa & M. Korytkowska (eds.), Studia konfrontatywne polsko-południowosłowiańskie, 11–38. Wrocław: Zakład Narodowy im. Ossolińskich. Korytkowska, M. 1992. Typy pozycji predykatowo–argumentowych. Gramatyka konfrontatywna bułgarsko-polska. Warszawa: Slawistyczny Ośrodek Wydawniczy. Korytkowska, M. 1993. „O konfrontatywnym opisie predykatorów bułgarskich i polskich (na przykładzie jednostek otwierających miejsce dla argumentu o wartości Experiencer).“ In V. Koseska_Toszewa & M. Korytkowska (eds.), Studia gramatyczne bułgarsko-polskie 5–6, 121–150. Langacker, R. 1987. Foundations of Cognitive Grammar, Vol. 1, Theoretical Prerequisites. Stanford: Stanford University Press. Langacker, R. 1991. Foundations of Cognitive Grammar, Vol. 2, Descriptive Application. Stanford: Stanford University Press. Langacker, R. 2008. Cognitive Grammar: A Basic Introduction. New York: Oxford University Press. Levin, B. 1993. English Verb Classes and Alternations: A Preliminary Investigation. Chicago: University of Chicago Press. Lewandowska-Tomaszczyk, B. (ed.) 2005. Podstawy językoznawstwa korpusowego. Łódź: Wydawnictwo Uniwersytetu Łódzkiego. Lewandowska-Tomaszczyk, B. 1984. Conceptual Analysis, Linguistic Meaning, and Verbal Interaction. Łódź: Wydawnictwo Uniwersytetu Łódzkiego. Lewandowska-Tomaszczyk, B. 2013. „Komunikacja i konstruowanie znaczeń w przekładzie.“ Paper presented at the conference Zbliżenia: językoznawstwo – translatoryka – literaturoznawstwo, Konin, Poland, November 13–14, 2013. Mikołajczuk, A. 1997. „Pole semantyczne ‘gniewu’ w polszczyźnie (Analiza leksemów: gniew, oburzenie, złość, irytacja).“ In R. Grzegorczykowa & Z. Zaron (eds.), Semantyczna struktura słownictwa i wypowiedzi, 149–171. Warszawa: Wydawnictwa Uniwersytetu Warszawskiego.

200

Elżbieta Kaczmarska

Mikołajczuk, A. 1999. Gniew we współczesnym języku polskim. Analiza semantyczna. Warszawa: Wydawnictwo Energeia. Młodzki, R., Kopeć, M. & A. Przepiórkowski. 2012. “Word Sense Disambiguation in the National Corpus Of Polish.” Prace Filologiczne LXIII, 155–166. Och, F.J. & H. Ney. 2003. “A Systematic Comparison of Various Statistical Alignment Models.” Computational Linguistics 29(1), 19–51. Przepiórkowski, A., Bańko, M., Górski, R. & B. Lewandowska-Tomaszczyk. (eds.) 2012. Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo PWN. Pustejovsky, J. 1993. Semantics and the Lexicon. Berlin: Springer. Rosen, A. & M. Vavřín. 2014. Korpus InterCorp – čeština, verze 7 z 19. 12. 2014. Retrieved from: http://www.korpus.cz. Rytel, D. 1989. “Wybrane problemy opisu walencyjnego języka.” Studia z Filologii Polskiej i Słowiańskiej 26, 237–247. Rytel-Kuc, D. (ed.) 1991. Walencja czasownika a problemy leksykografii dwujęzycznej. Wrocław: Zakład Narodowy im. Ossolińskich. Siatkowski, J. & M. Basaj. 2002. Słownik czesko-polski. Warszawa: Wiedza Powszechna. Skoumalová, H. 2008. “Extracting dictionaries from parallel corpora.” In Proceedings of The Third Baltic Conference on Human Language Technologies, 297–301. Kaunas: Vytautas Magnus University. Tian, L., Wong, D. F. & S. Chao. 2010. “An Improvement of Translation Quality with Adding Key-Words in Parallel Corpus.” Machine Learning and Cybernetics 3, 1273–1278. DOI:10.1109/ICMLC.2010.5580888. Tian, L., Wong, D.F., Chao, S. & F. Oliveira. 2014. “A Relationship: Word Alignment, Phrase Table, and Translation Quality.” The Scientific World Journal. Hindawi Publishing Corporation. DOI:10.1155/2014/438106. Urbańczyk-Adach, N. 2011. Wariantywność walencji czeskiego czasownika. Warszawa: Slawistyczny Ośrodek Wydawniczy. Wierzbicka, A. 1971. Kocha – lubi – szanuje. Medytacje semantyczne. Warszawa: Wiedza Powszechna.

Corpora Czech National Corpus – InterCorp. Institute of the Czech National Corpus. Available online at http://www.korpus.cz. National Corpus of Polish. Available online at: http://nkjp.pl.

Marcin Trojszczak University of Lodz

Problem solving in English and Polish: A cognitive corpus-based study of selected metaphorical conceptualizations Abstract: The study aims to present selected aspects of metaphorical conceptualization of problem solving that are shared between English and Polish on the basis of linguistic data from the British National Corpus and the National Corpus of Polish. The study approaches metaphorical conceptualizations of problem solving from the perspective of cognitive corpus-based linguistics. It combines the theoretical framework of Conceptual Metaphor Theory with the methodological workbench of corpus linguistics. The paper focuses on five selected ways of conceptualizing problem solving that were found in the analyzed corpora: (1) problem solving is breaking down a physical object into smaller parts; (2) problem solving is moving over/around a physical obstacle; (3) problem solving is removing a physical obstacle; (4) problem solving is leveling out a physical surface; (5) problem solving is fighting an enemy. In line with Szwedek’s theory of objectification it is proposed that all identified conceptual metaphors are ultimately motivated by the notion of a physical object. It is also plausible to claim that they are specific realizations of two over-arching conceptual metaphors, namely, abstract objects are physical objects and mental activity is a physical activity. Besides discussing the underlying conceptual motivations for such expressions in English and Polish, the study indicates that metaphorical similarities of this kind can potentially occur crosslinguistically, which opens new paths for future research. Keywords: Conceptual metaphor theory, objectification, mind, problem solving, cognitive linguistics, corpora

1. Introduction Metaphor has been studied from several theoretical perspectives including philosophy, linguistics, psychology, and more recently, neuroscience (see Gibbs 2008; Ortony 1993 for edited collections of studies). In linguistics, metaphors have been approached from cognitive (e.g. Lakoff & Johnson 1980; Kövesces 2010), psycholinguistic (e.g. McGlone 1996; Glucksberg 2003), neurolinguistic (e.g. Forgács, Lukács & Pléh 2014; Lakoff 2014) and discourse (e.g. Cameron 2003; Zinken, Hellsten & Nerlich 2008) perspectives. The following study investigates selected parallelisms in metaphorical conceptualizations of problem solving in English and

202

Marcin Trojszczak

Polish from the perspective of corpus-based cognitive semantics (Stefanowitsch 2006; Heylen, Tummers & Geeraerts 2008; Fabiszak & Konat 2013). It combines the theoretical framework of the Conceptual Metaphor Theory and the methodological workbench of corpus linguistics (Deignan 2005, 2008; see also Glynn & Fischer 2010; Glynn & Robinson 2014; Waliński 2013). From a psychological perspective, problem solving is an essential and pervasive process “that structures everyday life in meaningful ways” (Green & Gilhooly 2005, p. 347). The fact that problem solving is one of the central cognitive processes means that it warrants attention from both cognitive psychologists and cognitive linguists (Palmer 2003). The present study is a practical implementation of this idea. It goes along the lines of recent Dancygier and Sweetser’s postulates (2014: 182), which posit that cross-linguistic studies of figurative language focusing on sources of metaphorical conceptualizations are in high demand. Moreover, the study goes along the lines of the main tenets of Conceptual Metaphor Theory, which assumes that numerous linguistic representations are motivated by universally shared embodied categories and conceptual metaphors (Kövecses 2005; 2015; Kardela 2006; Johansson Falck & Gibbs 2012). Comparative cognitive corpus studies (Lewandowska-Tomaszczyk 2012) focusing on linguistic representations of problem solving can provide insights into the ways in which this cognitive process is metaphorically conceptualized in different languages. This paper concentrates on the data from English and Polish, which share common Indo-European ancestry reflected in certain semantic features. However, despite their shared origins, both languages are members of different subgroups within Indo-European language family: English is a Germanic language and Polish is a Slavic language (Clackson 2007). This results in mutual unintelligibility and numerous phonological, syntactic, and semantic discrepancies (Harbert 2006; Sussex & Cubberley 2006). The rationale behind this study is investigating potential similarities in the way the speakers of English and Polish metaphorically conceptualize problem solving.

2. Problem solving Problem solving as a cognitive process has been studied predominantly in cognitive psychology. The researchers examining this area try to find out what kinds of processes take place in the human mind when a person is solving various types of problems (Davidson & Sternberg 2003; Nęcka, Orzechowski & Szymura 2008; Bassok & Novick 2012; Mayer 2013). In order to understand what problem solving is one needs to start from the definition of a problem. Though in practice we encounter various sorts of problems that challenge us in different ways (Green &

Problem solving in English and Polish

203

Gilhooly 2005), in cognitive psychology a problem is technically defined as “a difference between actual state of things and intended or imposed aim (desired state) that cannot be removed in a routine way” (Nęcka, Orzechowski & Szymura 2008: 484). A problem involves three aspects: a start state, a goal state, and a set of operators, i.e. cognitive actions that are applied in order to move closer to a desired goal state (Green & Gilhooly 2005). For cognitive psychologists, problems are the types of relations between situations and people. Thus, problem solving can be defined as “an activity directed towards reducing discrepancy between actual and desired state that is based on the application of planned sequence of cognitive actions” (Nęcka, Orzechowski & Szymura 2008: 484). Problem solving involves taking various steps, the so-called cognitive actions. The whole process is described as problem solving cycle and includes seven steps (Sternberg 2009: 430–434). The first step of the cycle is problem identification. Here, the problem-solver simply recognizes the existence of a problematic situation. In the next step, he or she defines it in order to come up with a solution, which is called problem definition and representation. Next, the problem-solver embarks on strategy formulation. Here, possible solutions are enumerated, which can involve both analysis and synthesis. The fourth step involves organizing information. In this part, one needs to collect and integrate all the information necessary to deal with the problem. Problem solving also includes resource allocation. It is a vital part of the cycle because resources such as time, space, money, knowledge, or energy set limits on the capacity to solve a given problem. The sixth element of the cycle is monitoring. This is the process of supervising and checking up one’s progress on the path towards achieving the desired state. The last part of the cycle is evaluation, which can be defined as assessing the solution. Problems can be categorized in numerous ways (Robertson 2001; Sternberg 2009; Bassok & Novick 2012; Mayer 2013). Basic classifications reviewed by Nęcka, Orzechowski and Szymura (2008: 487–491) involve division along the lines of their complexity. If a problem has one solution, it is described as a convergent one. Problems with more than one solution are called divergent problems. Unlike simple problems, complex ones require formulating complicated mental models and representations. In turn, well-defined problems have clear paths to solutions. On the other hand, ill-defined problems are characterized by no clear solving paths. Solving problems in real life involves the following features: intransparency, polytely (many goals), complexity of the situation, connectivity of variables, dynamic developments, and time-delayed effects (Funke 1991).

204

Marcin Trojszczak

Problem solving involves numerous mental processes. The factors that positively affect this activity include, among others, positive transfer, incubation, and insight (Sternberg 2009). Positive transfer is a process of using the knowledge that was deployed in solving one problem in order to solve another one with a similar deep structure. This transfer is very often based on analogy (Nęcka, Orzechowski & Szymura 2008: 533). Incubation is a way of avoiding negative transfer. It is an act of putting the problem in question aside, taking a pause from a given stage in problem solving (Nęcka, Orzechowski & Szymura 2008: 535). Insight is defined as a sudden change in the perception of the problem that leads to understanding its nature (Sternberg 2009: 444).

3. Methodological framework This study focuses on selected metaphorical conceptualizations of problem solving in linguistic expressions in English and Polish. The theoretical framework employed in this research is Conceptual Metaphor Theory (Lakoff & Johnson 1980, 1999; Lakoff 1993; Kövecses 2010). Throughout the years, the theory has undergone several changes, developments, and expansions (see Grady 2007; Kövecses 2010; Ruiz de Mendoza Ibánez & Pérez Hernández 2011; Steen 2011 2013; Dancygier & Sweetser 2014 for reviews). Its main tenet is the assumption that the human conceptual system is populated by special conceptual structures called conceptual metaphors. The Conceptual Metaphor Theory puts emphasis on the experiential motivation of numerous metaphorical patterns that emerge from conceptual domains closely related in everyday, embodied experiences. From this perspective, metaphor is not merely a linguistic figure of speech, but rather a conceptual structure, which to a large extent organizes our ways of thinking in various domains. It can be essentially defined as “understanding one conceptual domain in terms of another conceptual domain” (Kövecses 2010: 4). Accordingly, conceptual metaphor is a type of unconscious projection, or conceptual mapping, between source (more concrete) and target (more abstract) domains. This projection is both systematic and asymmetrically directional (Grady 2007; Kövecses 2010). It means that the mapping between the two conceptual domains involves not only the objects and general characteristics of the domain but also domain-specific relations, events, and scenarios. It also means that the inferences about objects, relations, and events are mapped from source to target domain, and not vice-versa. In this study, linguistic expressions related to problem solving are approached from the perspective of cognitive corpus-based semantics (Deignan 2005; Stefanowitsch 2006; Heylen, Tummers & Geeraerts 2008; Deignan & Cameron 2013; Fabiszak & Konat 2013). It combines the theoretical perspective of Conceptual

Problem solving in English and Polish

205

Metaphor Theory and the methodological workbench of corpus linguistics defined as “the study of language data on a large scale that involves computer-aided analysis of extensive collections of spoken and written texts” (McEnery & Hardie 2012: i). The emphasis put on empirical data brings a number of advantages to the linguistic inquiry, including objectivity and verifiability of results, access to statistics on the frequency of language patterns, as well as the access to samples of language usage reflecting linguistic behavior of a given population. The combination of these two perspectives, which is crucial for the present study investigating respectively Polish and English speakers, results in an approach that enables us to test empirically intuitive explanatory notions of the Conceptual Metaphor Theory. Relying on empirical linguistic data found in two referential corpora for Polish and English enables us to fend off methodological objections that are commonly made against the Conceptual Metaphor Theory (e.g. Sandra & Rice 1995; Sandra 1998; Deignan 2005; McGlone 2007; Sinha 2007; Strugielska 2014). One of the most important objections against the Conceptual Metaphor Theory is the researchers’ overreliance on decontextualized examples. As summarized by Semino, Heywood, and Short (2004: 1273) “in practice, (…) most claims about the existence of particular conceptual metaphors from Reddy (1979) and Lakoff and Johnson (1980) onwards have been based on lists of decontextualized sentences, all supposedly realizing the same underlying mapping in the minds of the speakers of a language”. The following study resolves this problem by making use of the data from linguistic corpora. This particular methodological perspective, known in the literature as, a corpus-illustrated approach (Tummers, Heylen & Geeraerts 2005) allows for the reconciliation of introspectively made intuitions (hypotheses) and empirical data (the samples of Polish and English) (Geeraerts 2010). Although, such an approach does not resolve the problem of annotating empirical language data (Gries & Divjak 2010), which is ultimately dependent on the choices made by the researcher, it does allow for the study of usage as exemplified in natural language samples included in corpora (Croft 2000; Tummers, Heylen & Geeraerts 2005; Bybee 2010). On these grounds, the present cognitive corpus-illustrated study aims to demonstrate, at least to some degree, what common underlying metaphorical expressions are employed by speakers of English and Polish when they describe the activity of problem solving.

4. Analysis The following study is based on an analysis of linguistic expressions comprising the lexemes problem (in English) and problem (in Polish) with accompanying verbs. The analysis of such collocations enables us to reconstruct the elements

206

Marcin Trojszczak

of the metaphorical model of problem solving in English and Polish. The data analyzed in this study come from two referential corpora: The British National Corpus (henceforth, BNC) and The National Corpus of Polish (henceforth, NCP). Both conform to the principle of standard reference (McEnery & Hardie 2012) by being widely available to all researchers and finite (no more texts being added after the final compilation). The British National Corpus is a 100 million collection of both spoken and written British English. The corpus includes a wide range of sources including specialist periodicals, popular fiction, unpublished informal communication, academic books, regional and national publications, as well as other texts. The texts included in the corpus are not limited to any particular register, genre, or subject field (Aston & Burnard 1998; see: www.natcorp.ox.ac.uk for more information). The National Corpus of Polish includes a 240 million word collection of both spoken and written Polish. The written part of the corpus comprises daily newspapers, Internet texts of many types, periodicals, journals, as well as classic literature. The structure of sources used in the NCP loosely mirrors that of the BNC (Przepiórkowski, Bańko, Górski & Lewandowska-Tomaszczyk 2012; see: www.nkjp.pl for more information). The data collection was carried out using the HASK browser developed by Piotr Pęzik (2013, 2014) at the University of Łódź,1 which is an advanced online tool for analyzing and visualizing collocational data. The browser gives access to an interactive database that covers most common collocations generated from the BNC and the NCP. It enables us to find collocations (e.g. verb-noun, adjective-noun) of numerous words by typing them in. The browser not only shows the relevant data but also allows for sorting the output according to statistical significance, downloading the data in the form of an Excel spreadsheet, and displaying KWIC concordances, i.e. the fragments of the original texts in which a given collocation has been found with the left and right context. The research part of this study is based on the data provided by the HASK browser. The data collection procedure for both languages includes: (1) typing in the lexemes problem (for English) and problem (for Polish); (2) analyzing their collocations with verbs by looking at the original texts’ fragments provided in order to separate metaphorical and literal collocations – during this part the general guidelines of Metaphor Identification Procedure (Pragglejaz 2007) are used; (3) compiling a tentative list of collocations that includes metaphorical linguistic expressions referring to the activity of problem solving.

1 See: http://pelcra.pl/hask_pl/Home, http://pelcra.pl/hask_en/Home for more information.

Problem solving in English and Polish

207

The analysis is based on three sources of information about meanings: (1) dictionary meanings of the analyzed verbs from the Oxford English Dictionary (henceforth, OED) and the PWN Dictionary of Polish (henceforth, PWN); (2) their etymology;2 (3) other collocates of the analyzed verbs (extracted from the BNC and the NCP by means of the HASK browser). The above-described data collection procedure is the point of departure for qualitative interpretation of the culled data in the light of the basic tenets of Conceptual Metaphor Theory (Kövecses 2010; Dancygier & Sweetser 2014). The objective is to classify metaphorical linguistic expressions (selected collocations) in both languages and present potentially shared conceptual metaphors they exemplify.3 Furthermore, it is important to note that analysis focuses exclusively on the examples (collocations) referring to problem solving. It does not cover the expressions that refer to other types of actions connected with problems (e.g. somebody raises a problem, problem surfaces etc.). In other words, it focuses on expressions that refer exclusively to the activity of problem solving.

5. Findings The above-mentioned procedure applied to the BNC and the NCP enables us to distinguish five conceptual metaphors of problem solving4 that are present both in English and Polish corpora. These shared conceptualizations are specific instantiations of two over-arching conceptual metaphors abstract objects are physical objects and mental activity is a physical activity (see Jäkel 1995, 2003; Lakoff & Johnson 1999; Szwedek 2011).

2 For an example of study applying etymological data in cognitive semantics see Sweetser 1990. 3 In the following analysis both synchronic (dictionary meanings) and diachronic (etymology) data are used (see Trim 2011; Diaz-Vera 2015 for discussions about diachronic dimension of metaphor study). They are used in order to arrive at the most complete image of underlying conceptual mappings. Assessing which of them are present in the minds of contemporary speakers and which are nontransparent for them is a matter of experimental research and therefore expands beyond the scope of this paper. 4 This classification does not exhaust all ways of metaphorically conceptualizing problem solving which could be found in the BNC and the NCP. The presentation of a complete metaphorical image of problem solving based on these corpora is beyond the scope of this article.

208

Marcin Trojszczak

5.1 Problem Solving is Breaking Down a Physical Object into Smaller Parts The corpus data indicate that both English and Polish speakers conceptualize problem solving in terms of breaking down a physical object into smaller parts. This general expression is used with reference to various physical activities. They are similar in the sense that they all refer to reducing something into more manageable, less complex elements. Breaking down a problem is exemplified in English by such expressions as to crack a problem, to solve a problem, and to resolve a problem. For instance, in the sentence “The council has just about cracked the problem of multiple signs” the problem is conceptualized as an object with a hard outer surface (e.g. a nut). Here, the activity of problem solving is conceptualized as hitting it in order to break its surface. The question of the conceptualization of problem solving in such sentences as “You do not, of course, solve problems by merely throwing money at them” and “Sometimes even little children can help resolve problems when they are explained openly and simply” is more complex as dictionary and etymological data point out in different specific directions. Thus, on the one hand, problem solving can be viewed as being conceptualized in terms of dissolving a substance in a liquid. On the other hand, it can be understood as unfastening or loosening something.5 Similarly, in Polish, breaking down of a physical object is conceptualized by means of verbs referring to the acts of untying, cutting, and breaking (see Majid, A. & Bowerman, E. 2007 for a review of the semantics of cutting and breaking). In the expression rozwiązać problem [EN lit.: to untie a problem] the act of problem solving is understood in terms of untying a knot or any other tangled physical object. This is exemplified in the sentence “Niczego się nie bał i wyobrażał sobie, że każdy problem można rozwiązać pięściami” [EN: He wasn’t afraid of anything and thought, that every problem could be solved (lit. untied) with fists]. In turn, the expressions rozstrzygnąć problem [EN: to cut out a problem] and przeciąć problem [EN: to cut a problem through/in two] show that problem solving is perceived in terms of dividing something with a sharp tool, e.g. a knife or scissors. This is exemplified, for instance, in the sentence “Prezydent Jackson próbował rozstrzygnąć ten problem, ale bezskutecznie” [EN: President Jackson was trying to solve (lit. to cut out) this problem, but unsuccessfully]. In turn, the expression rozgryźć problem [EN: to gnaw at a problem] refers to the activity of eating. For instance, in the sentence “Projektanci na razie głowią się jak rozgryźć niektóre problemy” [EN: 5 In Late Middle English (in the sense ‘loosen, dissolve, untie’) from Latin solvere ‘loosen, unfasten’. (OED).

Problem solving in English and Polish

209

For now, designers are thinking how to crack (lit. gnaw) some problems] breaking down of a problem is likened to gnawing at a piece of food in order to break it into parts.6 Moreover, problem solving can be conceptualized directly in terms of breaking a physical object, i.e. rozbijać problem na mniejsze elementy [EN: to break a problem into smaller parts].

5.2 Problem Solving is Removing a Physical Obstacle In the conceptual metaphor problem solving is removing a physical obstacle, problem is conceptualized as an obstacle standing in the way, which must be removed from its present location to a different place. In English, the problemphysical object can be removed from one place and put somewhere else (to remove a problem). In this expression problem solving is perceived as discarding something unwanted. The same action is exemplified by the expression to clear a problem. Here, the verb to clear refers to the action of getting rid of something (OED). In the expression, to eradicate a problem, the problem is metaphorically construed as a plant (preferably a weed) that needs to be removed by the roots from the ground.7 It is exemplified, for instance, by the sentence “And the immediate effect of the mixed ability grouping was to eradicate behavioural problems of that kind almost entirely”. In Polish, this conceptual metaphor is present in such expressions as pozbyć się problemu [EN: to get rid of a problem], and usunąć problem [EN: to remove a problem] (e.g. “Ten problem usunięto i teraz podajemy krystalicznie czystą wodę” [EN: This problem was removed and now we deliver a crystal clear water]). In both expressions the problem is conceived of as an unwanted object (an obstacle) that needs to be physically removed in order to be resolved.

5.3 Problem Solving is Moving Over/Around a Physical Obstacle Problem solving can be also conceptualized in terms of moving in a certain manner over/around an obstacle encountered on the way. In expressions that exemplify the conceptual metaphor problem solving is moving over/around 6 In this context it is worth pointing out that the Polish idiomatic expression trudny orzech do zgryzienia [EN lit.: a hard nut to crack] is used as a synonym of a term problem. It is plausible to claim that this expression is another linguistic instantiation of the underlying conceptual metaphor problem solving is breaking down a physical object into smaller parts. 7 In Late Middle English (in the sense ‘pull up by the roots’) from Latin eradicat- ‘torn up by the roots’, from the verb eradicare, from e- (variant of ex-) ‘out’ + radix, radic- ‘root’ (OED).

210

Marcin Trojszczak

a physical obstacle movement takes place in certain environment. The physical elements of this environment are metaphorically represented. They serve as a means of conceptualizing the problem solving activity. English expressions such as to overcome a problem, to surmount a problem,8 and to transcend a problem9 share the same underlying image in which the problem is represented as an elevated ground of some sort. They are present in such sentences as “To surmount this problem, the elephants started tearing down trees” or “I overcame the problem by making the shelves in the cupboard easily removable”. The problem is conceptualized as a hill, a mountain or other element of the environment that is situated higher. In turn, the activity of solving a problem is represented as climbing or going over this problematic physical obstacle. In Polish, problem solving is conceptualized as both moving over and around a physical obstacle. The first action is exemplified in the expression przeskoczyć problem [EN: to jump over a problem]. For instance, in the sentence “Czasami mając nawet pieniądze nie jest człowiek w stanie przeskoczyć niektórych problemów” [EN: Sometimes even when you have money, you can’t surmount (lit. jump over) certain problems], the activity of solving a problem is conceptualized as jumping over some object that blocks our normal movement. In the expression obejść problem [EN: to go around a problem], e.g. “Problem ten należy obejść w jeden z poniższych sposobów” [EN: One should solve (lit. go around) this problem in one of the following ways], problem solving is conceptualized in terms of bypassing a physical obstacle. The same underlying image motivates the expression ominąć problem [EN: to bypass a problem]. This shows that one can solve a problem not only by going over it but also by simply circumventing it. In sum, in the conceptual metaphor problem solving is moving over/around a physical object the problem-solver is conceptualized as traveler who encounters a physical obstacle (e.g. a hill or a mountain) that metaphorically represents a problem. Solving is conceptualized in terms of bypassing or going over this obstacle.

5.4 Problem Solving is Leveling Out a Physical Surface The conceptual metaphor problem solving is leveling out a physical surface is present in both analyzed languages. In English, it is exemplified by the expression to iron out a problem. For instance, in the sentence “There are still problems to iron 8 In Late Middle English (also in the sense ‘surpass, be superior to’) from Old French surmonter (OED). 9 In Middle English from Old French transcendre or Latin transcendere, from trans‘across’ + scandere ‘climb’. (OED)

Problem solving in English and Polish

211

out, and as we saw in the passage by Ross, there are still some inconsistencies” problem solving is conceptualized as the activity of ironing. The problem are creases and folds on the piece of cloth that must be pressed in order to be removed. Leveling out of the surface of the problem is the final result of ironing it out. In Polish, this conceptual metaphor can be found, for instance in the sentence “Uczniowe pytali jak przyszły wójt zamierza zniwelować problem walających się na terenie gminy śmieci” [EN: Students were asking how the prospective mayor was going to solve (lit. to smooth/to level) the problem of waste in the community]. The verb zniwelować [EN: to level] in its concrete meaning (PWN) refers to the action of aligning the land surface. This means that in the expression zniwelować problem [EN: to smooth/level a problem] the activity of problem solving is metaphorically understood as aligning a piece of uneven land. In the conceptual metaphor problem solving is leveling out a physical surface the problem is conceptualized in terms of a rough physical surface. Both in English and Polish, leveling it out is conceptualized in two different ways: as ironing or as smoothing it out. Although two given conceptualizations profile the situation somewhat differently, they share the same underlying structure in which the problem is a rough surface and solving it is perceived as making it level.

5.5 Problem solving is fighting an enemy The conceptual metaphor problem solving is fighting an enemy is based on the mapping in which problem solving is conceptualized in terms of fighting an enemy. The exact nature of the enemy is unknown but one can surmise that the prototypical enemy is another human being. In English, this conceptualization is exemplified by the expression to beat a problem. The expression emphasizes the violent nature of problem solving as fighting. The problem is the enemy that needs to be defeated. The same physical activity motivates other English expressions, e.g. to combat a problem, to hit a problem, to fight a problem, as well as to cope with a problem.10 They can be found in such sentences as “No one has seriously challenged the view that attempted suicide should be regarded as an inappropriate way of coping with problems” and “In order to combat this problem a number of other indexing methods have been developed”. In Polish, one can find similar metaphorical images. They are present in such sentences as “Mam nadzieję, że Parlament Europejski będzie podsuwał nam takie rozwiązania, które pomogą naszemu krajowi pokonać problemy gospodarcze” [EN: 10 In Middle English (in the sense ‘meet in battle, come to blows’) from Old French coper, colper, from cop, colp ‘a blow’, from Greek kolaphos ‘blow with the fist’. (OED)

212

Marcin Trojszczak

I hope that European Parliament will give us such solutions which will help our country to cope with (lit. to defeat) economical problems] or “Dalsze problemy może powstrzymać jedynie siedmiodniowy tydzień pracy” [EN: Further problems could be solved (lit. withheld) only by a seven-day work week]. Such expressions as pokonać problem [EN: to defeat a problem], przezwyciężyć problem [EN: to overpower a problem], likwidować problem [EN: to liquidate a problem], eliminować problem [EN: to eliminate a problem], zwalczać problem [EN: to fight a problem] all highlight the struggle between the problem-solver and his/her task. This physical struggle is also visible in the following expressions: załatwić problem11 [EN: to kill a problem], rozprawić się z problemem [EN: to blast a problem], uporać się z problemem [EN: to struggle successfully with a problem], and powstrzymać problem [EN: to hold back/withhold a problem]. In all of these expressions, the successful act of problem solving is equated to defeating, overpowering or eliminating the problem. In other words, the problem that is solved is seen in terms of a defeated enemy.12

6. Conclusions and further research The objective of this study was to present conceptualizations of problem solving in English and Polish.13 The analysis based on the data obtained from referential corpora for these languages (the BNC and the NCP) suggests that the speakers of English and Polish share common metaphorical conceptualizations of this mental activity. The hypothesis proposed in this study is that all conceptual metaphors of problem solving are ultimately motivated by the notions of a physical object and a physical activity that involves this physical object-problem. The first claim provides support for Szwedek’s theory of objectification. According to Szwedek (2011, 2014), abstract phenomena are ultimately conceptualized in 11 Załatwić is a polysemous term. In colloquial Polish it denotes killing (PWN). 12 This conceptual metaphor is the example of anthropomorphism (personification) which is a common tendency to ascribe human characteristics, intentions, and emotions to non-human objects (Epley, Waytz & Cacioppo 2007). The fact that problem solving is conceptualized in terms of fighting shows that this predilection can be viewed as one of the psychological motivations for the existence of this metaphor. 13 The study does not make any claims about the frequency of analyzed expressions (and therefore their entrenchment in speakers’ minds or the language systems as such), but a cursory look at descriptive statistics provided by the HASK browser shows that some expressions, e.g. to transcend a problem (3 appearances in the BNC), ominąć problem (13 appearances in the NCP) are extremely rare and some are significantly more frequent, e.g. to resolve a problem (more than 250 appearances in the BNC), rozwiązać problem (more than 4500 appearances in the NCP).

Problem solving in English and Polish

213

terms of physical objects. This conception is an original solution to two fundamental problems in Conceptual Metaphor Theory, namely, how to draw a line between abstract and concrete domains, and what is the criterion that enables us to distinguish between abstract and concrete entities. As for the first question, Szwedek (2011: 350) claims that the “ultimate experiential basis is our experience of physical objects, the only entities directly accessible to our senses”. As for the second, he (2011: 360) proposes “the experience of density (physicality) through touch to be the only, simple and clear criterion of distinction between material and phenomenological worlds”. For Szwedek (2011, 2014), the object schema is subject to no further metaphorization and serves as the basis for other types of metaphorical conceptualizations. The basis of objectification, as exemplified in such expressions as to give thought, or a heavy thought, is identifying, conceptualizing, and verbalizing abstract entities in terms of “the only world that had been known to our ancestors, the world of physical objects” (Szwedek 2011: 344). This process leads to the creation of new concepts, e.g. thought that inherits properties of prototypical physical objects. They are based on the category object and their subcategories which include moving object, animate object, human being, and, culturally supernatural being (Szwedek 2014). The objectification is seen as one of the most important processes in the phylogenetic development of abstract reasoning in humans (Szwedek 2011: 361). The fact that physical objects are ultimate source domains is evident not only through the analysis of metaphorical expressions, including the conceptualization of thought as well as problem solving (as in this study), but also due to the supporting arguments from neuroscience, psychology, the Great Chain model, and Kotarbiński’s philosophy of reism (Szwedek 2011). The second claim is in line with Jäkel’s (1995, 2003) metaphorical model of mental activity, which assumes that all sorts of mental activity (thinking, understanding, problem solving, memory processes, etc.) are conceptualized in term of physical activity, in particular physical manipulation. Combining Szwedek’s theory with Jäkel’s insights enables us to comprehensively account for shared metaphorical expressions found in English and Polish corpora. We suggest that identified conceptual metaphors of problem solving are ultimately motivated by two overarching conceptual metaphors, namely, abstract objects are physical objects and mental activity is a physical activity. The examples analyzed in this study confirm the fundamental assumptions of both theories. Moreover, this cross-linguistic study confirms Dancygier and Sweetser’s (2014) claim that “there are deep commonalities in human perception and cognition which are reflected in language and in figurative models – and [that] there are

214

Marcin Trojszczak

deep and fascinating cultural differences” (2014: 181). The study shows that within the context of underlying conceptual similarities for the five identified conceptual metaphors the speakers of both languages still have a room for choosing different source domains, e.g. ironing versus smoothing out, dissolving versus biting, etc. (see Kövecses 2005 for a discussion on cultural and experiential dimensions in conceptual metaphors). Besides demonstrating metaphorical similarities between English and Polish, the study indicates that they can be comprehensively accounted for in terms of Szwedek’s theory of objectification and the insights from Jäkel’s model of mental activity.14 By demonstrating the existence of corresponding conceptualizations between English and Polish the present study opens a set of new areas of investigation concerning conceptualizations of problem solving in other languages and sets the agenda for studying parallel conceptual metaphors across other languages (Trojszczak, in preparation).

References Aston, G. & L. Burnard. 1998. The BNC Handbook. Exploring the British National Corpus with SARA. Edinburgh: Edinburgh University Press. Bassock, M. & L. R. Novick, 2012. “Problem solving.” In K. J. Holyoak & R. G. Morrison (eds.) The Oxford Handbook of Thinking and Reasoning, 413–432. Oxford: Oxford University Press. Bergen, B. 2012. Louder Than Words. The New Science of How The Mind Makes Meaning. New York: Basic Book. Bybee, J. 2010. Language, cognition, and usage. Cambridge: Cambridge University Press. Cameron, L. 2003. Metaphor in Educational Discourse. London: Continuum. Clackson, J. 2007. Indo-European Linguistics. Cambridge: Cambridge University Press.

14 It is important to note that this study is agnostic on whether distinguished conceptual metaphors are psychologically real (see Sandra 1998; Steen 1999, Gibbs 2007). It appears that this question cannot be resolved by means of linguistic analysis. A possible way of verifying empirically whether these conceptual metaphors are present in speakers minds’ is applying simulational semantics (Bergen 2012; Kemmerer 2015) methodology. Thus, the list of distinguished conceptual metaphors is to be seen as a point of departure for further empirical research, which will inevitably require working out the means of operationalizing the present findings (Trojszczak, in preparation). This is in keeping with Gibbs’s (2007: 3) claim that “intuitions of cognitive linguists […] serve as the source of experimental hypotheses on the workings of the cognitive unconscious”.

Problem solving in English and Polish

215

Croft, W. 2000. Explaining language change. An evolutionary approach. London: Longman. Dancygier, B. & E. Sweetser. 2014. Figurative Language. Cambridge: Cambridge University Press. Davidson, J. E. & R. J. Sternberg. (eds.). 2003. The Psychology of Problem Solving. Oxford: Oxford University Press. Deignan, A. 2005. Metaphor and corpus linguistics. Amsterdam: John Benjamins. Deignan, A. 2008. “Corpus Linguistics and Metaphor.” In R. W. Gibbs (ed.) The Cambridge Handbook of Metaphor and Thought, 280–294. Cambridge: Cambridge University Press. Deignan, A. & L. Cameron. 2013. “A re-examination of understanding is seeing.” Journal of Cognitive Semiotics, 5 (1–2), 220–243. Epley, N., Waytz, A. & J. T. Cacioppo. 2007. “On Seeing Human: A Three-Factor Theory of Anthropomorphism.” Psychological Review, 4 (114), 864–886. Diaz-Vera, J. (ed.). 2015. Metaphor and Metonymy Across Time and Cultures. Berlin: De Gruyter Mouton. Fabiszak, M. & B. Konat. 2013.“Zastosowanie korpusów językowych w językoznawstwie kognitywnym.” In P. Stalmaszczyk (ed.) Metodologie językoznawstwa: Ewolucja języka, Ewolucja teorii językoznawczych, 131–142. Lodz: Lodz University Press. Funke, J. 1991. “Solving complex problems: Exploration and control of a complex system.” In R. J. Sternberg & P. A. Frensch (eds.) Complex problem solving. Principles and mechanisms. Hillsdale: Lawrence Erlbaum Associates. Geeraerts, D. 2010. “The doctor and the semantician.” In D. Glynn & K. Fischer (eds.) Quantitative Methods in Cognitive Semantics. Corpus-Driven Approaches, 63–78. Berlin: De Gruyter Mouton. Gibbs, R. W. 2007. “Why cognitive linguists should care more about empirical methods.” In M. Gonzalez-Marquez, I. Mittelberg, S. Coulson & M. J. Spivey (eds.) Methods in Cognitive Linguistics, 2–18. Amsterdam: John Benjamins. Gibbs, R. W. (ed.). 2008. The Cambridge Handbook of Metaphor and Thought. Cambridge: Cambridge University Press. Glucksberg, S. 2003. “The psycholinguistics of metaphor.” Trends in Cognitive Science, 7, 92– 96. Glynn, D. & K. Fischer. (eds.). 2010. Quantitative Methods in Cognitive Semantics. Corpus-Driven Approaches. Berlin: De Gruyter Mouton. Glynn, D. & J. A. Robinson. (eds.). 2014. Corpus Methods for Semantics. Quantitative studies in polysemy and synonymy. Amsterdam: John Benjamins. Grady, J. E. 2007. “Metaphor.” In D. Geeraerts & H. Cuyckens (eds.) The Oxford Handbook of Cognitive Linguistics, 188–214. Oxford: Oxford University Press.

216

Marcin Trojszczak

Green, A. J. K. & K. Gilhooly. 2005. “Problem solving.” In N. Braisby & A. Gellatly (eds.) Cognitive Psychology, 347–381. Oxford: Oxford University Press. Gries, S. T. & D. Divjak. 2010. “Quantitative approaches in usage-based Cognitive Semantics: Myths, erroneous assumptions, and a proposal.” In D. Glynn & K. Fischer (eds.) Quantitative Methods in Cognitive Semantics. Corpus-Driven Approaches, 333–353. Berlin: De Gruyter Mouton. Harbert, W. 2006. The Germanic Languages. Cambridge: Cambridge University Press. Heylen, K., Tummers, J. & D. Geeraerts. 2008. “Methodological issues in corpusbased Cognitive Linguistics.” In G. Kristiansen & R. Dirven (eds.) Cognitive Sociolinguistics: Language Variation, Cultural Models, Social Systems, 91–128. Berlin: Mouton de Gruyter. Jäkel, O. 1995. “The metaphorical conception of mind: ‘Mental activity is manipulation’.” In J. R. Taylor & R. E. MacLaury (eds.) Language and the Cognitive Construal of the World, 197–229. Berlin: Mouton de Gruyter. Jäkel, O. 2003. Metafory w abstrakcyjnych domenach dyskursu. Kraków: Universitas. Johansson Falck, M. & R.W. Gibbs. 2012. « Embodied motivations for metaphorical meanings.” Cognitive Linguistics, 23 (2), 251–272. Kardela, H. 2006. “Koncepcja umysłu ucieleśnionego w kognitywizmie.” In W. Dziarnowska & A. Klawiter (eds.) Mózg i jego umysły, 217–243. Poznań: Wydawnictwo Zysk i S-ka. Kemmerer, D. 2015. Cognitive Neuroscience of Language. New York: Psychology Press. Kövecses, Z. 2005. Metaphor in Culture. Universality and Variation. Cambridge: Cambridge University Press. Kövecses, Z. 2010. Metaphor. A Practical Introduction (2nd Edition). New York: Oxford University Press. Kövecses, Z. 2015. Where Metaphors Come From. Reconsidering Context in Metaphors. New York: Oxford University Press. Lakoff, G. & M. Johnson. 1980/2003. Metaphors We Live By, 2nd Ed. Chicago: University of Chicago Press. Lakoff, G. & M. Johnson. 1999. Philosophy in the Flesh. The Embodied Mind and Its Challenge to Western Thought. New York: Basic Books. Lakoff, G. 1993. “The Contemporary Theory of Metaphor.” In A. Ortony (ed.) Metaphor and Thought (2nd edition), 202–251. Cambridge: Cambridge University Press. Lakoff, G. 2014. “Mapping the brain’s metaphor circuitry: metaphorical thought in every day reason.” Frontiers in Human Neuroscience, 8, 958.

Problem solving in English and Polish

217

Lewandowska-Tomaszczyk, B. 2012. “Cognitive corpus studies: A new qualitative and quantitative agenda for contrasting languages.” MFU Connexion, 1 (1), 15–34. Majid, A. & M. Bowerman. (eds.). 2007. “Cutting and breaking events: A crosslinguistic perspective [Special Issue].” Cognitive Linguistics, 18 (2). Mayer, R. 2013. “Problem solving.” In D. Reisberg (ed.) The Oxford Handbook of Cognitive Psychology. Oxford: Oxford University Press. McEnery, T. & A. Hardie. 2012. Corpus Linguistics. Method, Theory and Practice. Cambridge: Cambridge University Press. McGlone, M. S. 1996. “Conceptual metaphors and figurative language interpretation: Food for thought?” Journal of Memory and Language, 35, 544–565. McGlone, M. S. 2007. “What is the explanatory value of a conceptual metaphor?” Language & Communication, 27 (2), 109–126. Nęcka, E., Orzechowski, J. & B. Szymura. 2008. Psychologia poznawcza. Warszawa: PWN. Ortony, A. (ed.). 1993. Metaphor and Thought, 2nd Ed. Cambridge: Cambridge University Press. Palmer, G. B. 2003. “Introduction.” Cognitive Linguistics, 14 (2/3), 97–108. Pęzik, P. 2013. “Paradygmat Dystrybucyjny w Badaniach Frazeologicznych. Powtarzalność, Reprodukcja i Idiomatyzacja.” In P. Stalmaszczyk, (ed.) Metodologie Językoznawstwa. Ewolucja Języka, Ewolucja Teorii Językoznawczych, 143–160. Lodz: Lodz University Press. Pęzik, P. 2014. “Graph-Based Analysis of Collocational Profiles.” In V. Jesenšek & P. Grzybek (eds.) Phraseologie Im Wörterbuch Und Korpus [Phraseology in Dictionaries and Corpora], 227–243. Maribor/Praha: Filozofska Fakulteta. Pragglejaz Group 2007. “MIP: A Method For Identifying Metaphorically Used Words in Discourse.” Metaphor and Symbol, 22 (1), 1–39. Przepiórkowski, A., Bańko, M., Górski, R. L. & B. Lewandowska-Tomaszczyk. (eds.). 2012. Narodowy Korpus Języka Polskiego. Warszawa: PWN. Robertson, S. I. 2001. Problem solving. Hove: The Psychology Press East Sussex. Ruiz de Mendoza Ibánez, F. J. & L. Pérez Hernández. 2011. « The Contemporary Theory of Metaphor: Myths, Developments and Challenges.” Metaphor and Symbol, 26 (3), 161–185. Sandra, D. & S. Rice. 1995. “Network analyses of prepositional meaning: Mirroring whose mind—the linguist’s or the language user’s?” Cognitive Linguistics, 6 (1), 89–130. Sandra, D. 1998. “What linguists can and can’t tell you about the human mind: A reply to Croft.” Cognitive Linguistics, 9 (4), 361–78.

218

Marcin Trojszczak

Semino, E., Heywood, J. & M. Short. 2004. “Methodological problems in the analysis of metaphors in a corpus of conversations about cancer.” Journal of Pragmatics, 36, 1271–1294. Sinha, C. 2007. “Cognitive linguistics, psychology and cognitive science.” In D. Geeraerts & H. Cuyckens (eds.) The Oxford Handbook of Cognitive Linguistics, 1666–1294. Oxford: Oxford University Press. Steen, G. J. 1999. “From linguistic to conceptual metaphor in five steps.” In R.W. Gibbs & G. J. Steen (eds.) Metaphor in Cognitive Linguistics, 57–77. Amsterdam: John Benjamins. Steen, G. J. 2011. “Metaphor in language and thought. How do we map the field?” In M. Brdar, S. T. Gries & M. Žic Fuchs (eds.) Cognitive Linguistics. Convergence and Expansion, 67–86. Amsterdam: John Benjamins. Steen, G. J. 2013. “The contemporary theory of metaphor – now new and improved!” In F. Gonzálvez-García, M.S. Peña Cervel & L. Pérez Hernández (eds.) Metonymy revisited beyond the Contemporary Theory of Metaphor: Recent developments and applications, 27–65. Amsterdam: John Benjamins. Stefanowitsch, A. 2006. “Corpus-based approaches to metaphor and metonymy.” In A. Stefanowitsch & S. T. Gries (eds.). Corpus-based approaches to metaphor and metonymy, 1–16. Berlin: Mouton de Gruyter. Sternberg, R. J. 2009. Cognitive Psychology. Belmont: Wadsworth Cengage Learning. Strugielska, A. 2014. “The place of metaphor in devolved cognitive linguistics.” Metaphorik.de, 25, Retrieved from http://www.metaphorik.de/de/journal/25/ metaphorikde-252014.html. Sussex, R. & P. Cubberley. 2006. The Slavic Languages. Cambridge: Cambridge University Press. Sweetser, E. 1990. From etymology to pragmatics. Metaphorical and cultural aspects of semantic structure. Cambridge: Cambridge University Press. Szwedek, A. 2011. “The ultimate source domain.” Review of Cognitive Linguistics, 9 (2), 341–366. Szwedek, A. 2014. “The nature of domains and the relationships between them in metaphorization.” Review of Cognitive Linguistics, 12 (2), 342–374. Trim, R. 2011. Metaphor and the historical evolution of conceptual mapping. Basingstoke: Palgrave Macmillan. Trojszczak, M. (in preparation). Ways of solving a problem – a cognitive crosslinguistic study. Lodz. Tummers, J., Heylen, K. & D. Geeraerts. 2005. “Usage-based approaches in Cognitive Linguistics: A technical state of the art.” Corpus Linguistics and Linguistic Theory, 1 (2), 225–261.

Problem solving in English and Polish

219

Waliński, J. T. 2013. Complementarity of Space and Time in Distance Representations. A Corpus-based Study. Lodz: Lodz University Press. Zinken, J., Hellsten, I. & B. Nerlich. 2008. “Discourse metaphors.” In R. Dirven, R. Frank, T. Ziemke & J. Zlatev (eds.), Body, language, and mind. Vol. 2, Sociocultural situatedness., 363–385. Berlin: Mouton de Gruyter.

Victoria Kamasa

University of Poznań

Corpus Linguistics for Critical Discourse Analysis. What can we do better? Abstract: In this study we present a critical review of usage of corpus linguistics techniques (CL) in Critical Discourse Analysis. Our review is based on over 30 papers in which authors declared to use some form of CL for some form of Critical Discourse Analysis (CDA/CL). We analyze the methods used by paper’s authors as well as the results provided by those methods in order to propose some points for improvement such as extended usage of statistics or better corpus design in order to avoid limitation of the results. We also point to some inconsequence which may take place during the research process concerning both sticking to rules declared by the author and paying attention to numbers such as word frequencies. Another problem we discuss can be called “mind-reading problem”: while the results concern the proprieties of text, the conclusions regard cognitive states of the users. Finally, we discuss the so-called cherry-picking problem and the role of researcher’s intuition and show some stages of CDA/CL research in which the intuition continues to play crucial role. For every of these points we present some examples form research practice and demonstrate its possible influence on the results and conclusions. Keywords: Critical discourse analysis, corpus linguistics, corpus supported discourse analysis, corpus design, corpus methods

1. Introduction Critical Discourse Analysis (henceforth CDA) has lately celebrated its 20th birthday with a seminar in Amsterdam. During these 20 years both theoretical reflection and empirical studies have flourished within the field. Different approaches, such as the Discourse Historical Approach (Reisigl & Wodak 2001) or the sociocognitive approach (Dijk 2008), have been proposed, adopted for a vast range of topics and overviewed in publications such as Wodak and Meyer (2009) or Duszak and Fairclough (2008). But CDA has not only been developed and practiced. It has also been criticized, most remarkably by Widdowson (1995, 1998) and Breeze (2011) who offer an extensive overview of controversies surrounding CDA. Some critical remarks in the context of educational research have also been summarized by Rogers et al. (2005), while more particular comments are scattered through various publications (O’Halloran 2009; Orpin 2005; Prentice 2010; Stubbs 1997). This critique contributed to the application of some techniques of corpus linguistics (henceforth CL) in CDA . First attempts of employing this approach

222

Victoria Kamasa

came in the late 1990’s (e.g. Beaugrande 2008; Flowerdew 1997; Hardt-Mautner 1995; Krishnamurthy 1996) and were followed by the seminal work of Baker (2006), in which he describes basic corpus techniques useful for discourse analysis and illustrates them with examples of his own research. The discussion on the benefits of corpus techniques for CDA continues in a wide array of research papers using these techniques (e.g. Baker et al. 2008; Degano 2007; Subtirelu 2013). The idea of corpus assisted CDA led to the proliferation of studies in which corpus techniques were used to reveal and describe discourses of interest for critical analysts. The popularity of this approach is visible both in the variety of techniques used and the diversity of subjects analyzed. The techniques range from straightforward analyses of concordances (Albakry 2004) to quite sophisticated research on lexical bundles (Herbel-Eisenmann & Wagner 2010) or automatically tagged semantic categories (Prentice 2010). The vast range of subjects includes, among others, national identity issues in Ireland (Prentice 2010), Malaysia (Don et al. 2010) or Quebec (Freake et al. 2010), different social problems such as sexism (Yasin et al. 2012) or eating-disorders (Lukac 2011), and social construction of businesswomen (Koller 2004) or economic crisis (Lischinsky 2011). Also, the sheer number of publications using corpus techniques demonstrates the growing interest in the field. According to the Scopus database, the total number of publications combining CDA and CL in 1990s was 3, in 2000s it amounted to 29, and since 2010 it has already reached 47.1 This tendency is also visible in the leading conferences in the field: the 2014 Critical Approaches to Discourse Analysis Among Disciplines (CADAAD) conference featured talks by almost 40 authors who used some form of corpus analysis in their studies. Such interest gave rise to some critical considerations concerning the usage of corpus techniques in CDA. For instance, Mautner (2009b) suggests that the contextual nature of CDA and the lack of context in corpus research may produce some tensions. She also points to semiotic impoverishment and oversensitivity to frequency as possible drawbacks of the CL approach. Gabrielatos (2009) and Gabrielatos and Duguid (2014), on the other hand, offer a more detailed discussion concerning the techniques chosen (such as keywords), statistical measures used to perform them and the way these measures are interpreted in empirical studies. Nevertheless, until now the steadily growing body of corpus-supported CDA studies has not been critically reviewed in order to identify the most vulnerable points of the research practice and suggest some improvements. We will attempt to fill this gap in the present paper. As a way of organizing our reflections, we will use the criticism voiced about CDA and the ways the corpus 1 Presented numbers result from search term: “KEY (‘critical discourse analysis’ ) AND TITLE-ABS-KEY (corpus OR corpora OR ‘corpus linguistics’)” (Scopus 2014).

Corpus Linguistics for Critical Discourse Analysis

223

approach has been expected to address it. We will therefore strive to answer the question whether the way corpus techniques are used in research practice fulfills the hopes vested in them and, if not, what are the major problems we should address. The starting point is the critical one; therefore, we will concentrate on such research practices that from our point of view need to be improved in order to take full advantage of the CL approach. We will focus on weaknesses and practices which may raise doubts, withholding from presenting all the excellent practices both in the field and in the studies we base our overview on. We will focus our considerations on the body of research papers published between 2002 and 2013 in which the authors both declare a commitment to some form of CDA and use at least one of corpus techniques (such as the analysis of concordances, collocations, keywords or lexical bundles). The presented overview is not claimed to be exhaustive, but is hoped to provide some useful insights into how the use of corpus techniques in CDA might be improved. The following sections will briefly present the main points of criticism of CDA and the ways the corpus approach was expected to address them. Then we will demonstrate some examples from research practice in which, despite using corpus techniques, the problems pointed out by CDA critics remained unsolved. Finally, we will suggest some good practice that should help to avoid such problems and consider issues that seem to remain unresolved even with the use of advanced CL techniques.

2. CL as answer for criticism of CDA In its 20 years CDA has been discussed from different perspectives and criticized for various shortcomings. The key points of critique relate to the analyzed material (decontextualization of analyzed texts and fragmentary character of the analyses), scholarly discipline (bias and cherry-picking and pivotal role of the researcher’s intuition) and lack of a coherent theory of audience response. Introduction of CL techniques into CDA was believed to address all these problems.

2.1 Decontextualisation of analyzed texts One of CDA’s weaknesses CL should help to overcome is the decontextualization of the analyzed texts. This lack of context relates to production, consumption, distribution, or reproduction of texts (Breeze 2011; Rogers et al. 2005), but can also be seen more broadly as ignoring the interactional frame (Rogers et al. 2005). While the CL approach alone does not allow for more context embedded analyses, working with large volumes of data (Mautner 2009a) might balance out the omission of context in the analyses. Consequently, researchers using the CL approach

224

Victoria Kamasa

still miss the context of a particular text, but they see such a large amount of texts that they encounter in the studied phenomenon as multiple instances. Therefore, it might suffice to obtain reliable results, even though no context is taken into account. In this case, CL, rather than offering a simple answer to the decontextualization problem, offers the researchers a possibility to ignore the context without detriment to the validity of the results.

2.2 Fragmentary character of analyses Another point of criticism directed at CDA to which CL seems to offer answers is the fragmentary character of the analyses, observed both on the levels of sampling and analytic procedures. With respect to the former, the critics question the representativeness of the studied texts (Stubbs 1997) and point to their exemplificatory character (Fowler 1996). With respect to the latter, doubts arise concerning the legitimacy of conclusions about ideology based on focusing on particular lexical items or certain grammatical features (Breeze 2011). CL presents an obvious solution for both aspects of this problem: the representativeness of the data is achieved through sophisticated sampling procedures and, above all, through large sample size (Degano 2007). Moreover, CL also enables a more exhaustive analysis by “highlighting lexical and grammatical regularities” (Lischinsky 2011: 155) and a comprehensive, rather than selective, description of syntactic and semantic properties of lexical items (Hardt-Mautner 1995). Therefore, the problem of the fragmentary character of the analysis can be solved by the use of corpora and corpus tools, both on the level of sampling and analytic procedures.

2.3 Bias CL is also believed to reduce the bias often indicated as another weakness of CDA. This bias has been associated with the political commitment of CDA that in some cases leads researchers to act on personal grounds, rather than a scholarly principle (Breeze 2011). The critics of CDA also pointed out that some analyses are conducted in such a way as to confirm the researcher’s preconceptions (Widdowson 1995) and that “political and social ideologies are read into the data” (Rogers et al. 2005: 372). As mentioned above, CL offers the researchers a heuristic tool (HardtMautner 1995) and a more focused approach to the texts (Lischinsky 2011), both being helpful in reducing bias. Moreover, Subtirelu (2013: 43) mentions that using CL techniques in CDA provides “results that are not idiosyncratic to a specific analyst”, while Baker (2011: 24) sees the bias-reducing potential of CL in the fact that it forces the researcher to “account for any larger-scale or salient patterns”, i.e. also those that do not conform to her or his political or social ideologies. Hence,

Corpus Linguistics for Critical Discourse Analysis

225

it is believed that focus on frequency and regularities provided by CL helps (at least partially) to overcome the problem of bias.

2.4 Cherry-picking It is also hoped that using the CL approach will address the cherry-picking problem voiced, for example, by Breeze (2011), Verschueren (2001), or Mautner (2009b). This problem is described as a tendency to choose such sets of texts or sets of examples that fit either the interpretative framework (Verschueren 2001) or the presumptions of the researcher (Rogers et al. 2005), which may skew the results and diminish their social credibility. As an answer to this, CL offers a clear and precise criterion for choosing what should be analyzed, namely frequency. As Degano (2007: 363) puts it: “the quantitative approach forces [sic] to a closer observation of data, with a view to the frequency with which a certain characteristic occurs, so that uses which can be identified as recurring are considered as more relevant than isolated examples”. As a result, such approach helps to shift the researcher’s attention from what seems to be interesting and might confirm (unconscious) presumptions to what is demonstrably frequent, regular and forms some kind of a pattern.

2.5 Pivotal role of researcher’s intuition Additionally, the CL approach may be able to deal with the pivotal role of the researcher’s intuition noted by Breeze (2011), Widdowson (1998), or Prentice (2010). On a general level, such a role of the researcher demonstrates itself in the crucial significance of her or his interpretive and explanatory skills for the obtained results (Breeze 2011) and the subjectivity of the reading which leads to many possible interpretations of the same text (Breeze 2011). On a more detailed level, this intuition also plays a central role in all analyses that involve some kind of categorization, as they are based on a subjective application of a coding system (Prentice 2010). Here again, the focus on frequency and patterns characteristic for CL can be helpful in dealing with this issue.“The concordancer produces ‘results’ in its own right” (HardtMautner 1995: 24), so there is no need for the researcher to use her or his intuition to decide which lexical items or patterns of co-occurrence should be analyzed. The involvement of the intuition in the analysis is also lowered by the heuristic function of the corpus software, which draws the researcher’s attention to phenomena that should be investigated more closely (Hardt-Mautner 1995). Finally, some level of intersubjectivity, being a consequence of reducing the intuition’s role in the studies, has also been confirmed experimentally (Marchi & Taylor 2009).

226

Victoria Kamasa

2.6 Lack of coherent theory of audience effects and audience response The problem of the lack of a coherent theory of audience effects and audience response can also be partially tackled by using the CL approach. As Rogers et al. (2005: 386) put it, CDA authors “failed to represent the relationship between the grammatical resources and the social practices”, whereas Breeze (2011) points to the great complexity of reader-text relations (involving readers’ exposure to many different discourses and subtlety of ideological meanings’ transfer) and advises caution in drawing conclusions about thought from language. Once again, focusing on high frequency items and patterns in texts contributes to the solution of this problem, as it might be suspected that persistent recurrence is related to cognitive visibility (Baker 2011; Gabrielatos & Baker 2008). As demonstrated above, CL presents itself as a remedy for numerous issues voiced by the critics of CDA. For some problems the solutions offered by CL are obvious and quite comprehensive, as in the case of the fragmentary character of the analyses or cherry-picking. For others, it is rather indirect and more of a hint or direction to be followed and developed than a clear answer, as in the case of the decontextualization of analyzed texts or the lack of audience effects theory. The problems and their respective solutions are summarized in the Figure 1. Figure 1. CDA critique and CL answers. Decontextualisation of analyzed texts

Large material as balance for the lack of context

Fragmentary character of analyses

Representativeness (large corpora) Exhaustive analysis (software)

Cherry -picking

Choice of analysied items on the basis of statistics

Pivotal role to the researcher’s intuition

Statistical significance instead of intuition

Bias

Lack of a coherent theory of audience effects and audience response

Intersubjective methods Focus on patterns and regularities

Frequency as main factor effecting the audience

Corpus Linguistics for Critical Discourse Analysis

227

3. CL answers and research practice With such an optimistic view of the possibilities opened for CDA by the use of CL, the question arises if the research practice lives up to these expectations. There is a number of problems which recur in numerous studies that might weaken the belief in CL as a solution for most issues that CDA is facing.

3.1 Decontextualisation of analyzed texts 3.1.1 We do not pay enough attention to the numbers As shown above, the potential of CL for solving the decontextualization problem lies in balancing the lack of context with the great amount of data. In research practice, however, we sometimes tend to focus on items whose frequency does not justify that interest and therefore we limit that potential. This might be the case for Lukac’s (2011) study on the pro-eating disorder discourse. She analyses the corpus of 19 blogs of pro-ana community members consisting of 222 464 tokens. Although she starts with most frequent expressions used for eating disorders (with frequencies up to 422), at some point she bases one of her conclusions on the co-occurrences of words which appear in the corpus as seldom as 25 times: “Goal (…) premodifies the noun weight in 25 occurrences. Goal weight is one of the terms specific for the pro-eating disorder vocabulary” (Lukac 2011: 204). Mulderrig (2011) presents an even looser approach to numbers. In her analysis of “change in the discursive construction of social identity in UK education policy” (Mulderrig 2011: 563) based on 17 policy consultation documents, she hardly gives any numbers (including the size of corpus), justifying her conclusions with descriptions such as “more frequently” or “comparatively infrequently”. These examples show that using CL techniques does not always lead to focusing on what is frequent and pervasive. Such research practice may lead to two dilemmas. Firstly, one can question the validity of the “specific for the pro-eating disorder vocabulary” label for a term occurring only 25 times in a corpus of more than two hundred thousand words. Moreover, limiting the comparisons between items to vague description rather than quoting specific numbers supplies only weak support for the conclusions based on such comparisons. But besides these small (in comparison to the results of the described studies as a whole) doubts, an important question can be posed: how frequent must an item be in order to state that the many contexts in which we see it balance the decontextualization of the analyzed texts it occurs in. It seems to be quite legitimate to claim that an analysis of 16 000 occurrences of the word refugees (Gabrielatos & Baker 2008) allows to see it in so many different contexts

228

Victoria Kamasa

that detailed discussion of production or consumption of texts it appears in can be neglected. But the same logic probably does not apply to 25 or even 10 instances of a word in a corpus. It is doubtful whether any particular number of occurrences or definite frequency per thousand words can be defined. Nevertheless, if we wish for CL to solve the decontextualization problem, we need to pay closer attention to the numbers we base our conclusions on.

3.1.2 We do not use statistical possibilities fully The insufficient attention paid to the numbers we obtain from our corpora is also visible in the underuse of statistical measures in some of our research. This underuse has two dimensions: the first involves basing our interpretations on a difference in numbers for which the statistical significance was not checked, while the second concerns the replacement of statistical measures with raw frequencies. The first dimension may be noticed in Marling’s (2010) study of feminism’s representation in Estonian print media. Describing the changes of this representation in time, Marling points to the alterations in the number of articles containing the word feminism in different thematic sections of the analyzed newspaper. These alterations (specifically the increase of articles concerning domestic news) are interpreted as a signal of the domestication of feminism in Estonia. However, the statistical significance of the difference is not examined2 and the raw difference (2,5 per year as opposed to 3 per year) seems to be rather small. As for the second dimension – replacing statistical measures with raw frequencies – it may be the case for Edwards (2012), who determines the node words in British National Party discourse by identifying “words which stood out for their clear quantitative variation between texts” (Edwards 2012: 247). Searching for keywords instead of concentrating on raw frequencies would probably be more informative, keywords being a statistically based corpus technique for comparing two sets of texts in search for words with relatively high frequencies (Scott 2013). Such approach would grant the certainty that the differences in frequencies are significant enough to be taken into account. Also, the study by Freake et al. (2010) on the construction of nationhood and belonging in Quebec would presumably benefit from using statistical measures for collocates as a replacement for raw frequencies of co-occurrence. Such measures ensure to some extent that the cooccurrences are both frequent and unique enough to call them collocates and base interpretations on them. 2 Chi-squared test for the numbers provided in the cited paper shows no statistical significance (chi=0,26, p=0,39).

Corpus Linguistics for Critical Discourse Analysis

229

Basing conclusions on raw frequencies can be misleading and, no less importantly, it neglects the potential of corpus supported discourse analysis to deal with the problem of decontextualization through balancing the lack of context with the high number of examples taken into careful consideration.

3.2 Fragmentary character of analyses 3.2.1 Our corpus design is not always clear Large size of corpora and sophisticated sampling methods used in corpus-supported CDA contribute substantially to solving the problem of the fragmentary character of analyses. In order to ensure this contribution, every step of corpus design must be carefully considered and fully justified. Nevertheless, not all corpora we base our analyses on seem to comply with these requirements: some claims regarding the representatives of the analyzed material, relevance of the chosen texts to the subject in question or coverage of the studied topics can be challenged. The problem with representativeness is exemplified by Almeida’s (2011) research on U.S. newspapers’ construction of the Israeli-Palestinian conflict. Her corpus contains 3 randomly chosen newspaper stories per week for 4 months (April through July) in a span of 6 years (2002–2006). As a result, she grounds her findings in a corpus of 250 press articles which she claims are representative of media coverage. Nevertheless, it is rather doubtful if the structure of the population (all press articles concerning the Israeli-Palestinian conflict) and the sample are similar. Such similarity is one of the most commonly described requirements for representativeness of a sample (e.g. Babbie 2013), therefore the lack of compliance with this requirement may suggest that the analysis remains fragmentary rather than exhaustive. The next problem – relevance of material – may be noticed in Subtirelu’s (2013) research concerning the language ideologies and nationalisms present in the U.S. Congress. Unlike Almeida, he studies the whole population rather than a sample. He decides to work with “transcriptions of legislative events related to S203”3 (Subtirelu 2013: 44). Such an approach should ensure the exhaustive character of the analysis, especially as considerable effort has been taken to encompass all relevant documents in the first outline of the corpus. However, then the author decides to manually categorize the documents according to the role of multilingual voting materials and to further analyze only those sections of the documents which he believes to be relevant to the subject. Such an approach is justified by

3 A portion of the Voting Rights Act mandating multilingual ballots for language minorities.

230

Victoria Kamasa

the need to “ensure that keywords identified in the later analysis pertained only to S203” (Subtirelu 2013: 44). However, as the specific criteria used to distinguish the relevant sections from those considered irrelevant are not listed, the criticism related to the fragmentary character of the analysis may still hold.

3.3 Bias 3.3.1 We do not follow our own rules Bias in the studies should be minimized by the use of intersubjectively replicable methods supplied by CL. Nevertheless, in some cases we tend to overstep the methods we described. This can both provide a more detailed and sophisticated view of the discussed problems and reintroduce bias into our studies. This double nature of not exactly following one’s own methods is visible, for example, in Weninger’s (2010) study of representations of social actors at the lexico-grammatical level in the US urban redevelopment discourse. At some point, she complements her planned study of colligations with “a closer look at the lexis” (Weninger 2010: 604), which, on the one hand, supports her claims and provides a more comprehensive picture of the discussed problem. On the other hand, paying closer attention to facts which confirm a point of view may be seen as biased. In a similar manner, Albakry (2004) bases his conclusions about text attitudinal or ideological positions expressed and negotiated in US and Canadian reports investigating the Kandahar ‘friendly fire’ on the comparative analysis of the “use of salient terms of representation, anonymity, agency terms, passivization, and use of modals and expression of stance” (Albakry 2004: 165). But then he strengthens these conclusions with detailed analysis of selected lexical items such as friendly or fratricide. And again, such practice has an ambiguous character: it offers broader and deeper perspective on the examined data, but at the same time it is selective and thus may confirm the researcher’s preconceptions.

3.3.2 We do not pay enough attention to the numbers Another research practice which reintroduces bias into corpus-supported CDA is the above-described lack of sufficient attention paid to word frequencies. Focusing on items that are thought-provoking or supportive of our conclusions regardless of their frequencies may undermine the bias-reducing power of corpus techniques. Resigning from considering high frequency of an item as the only factor determining its inclusion in the analysis most likely raises the influence of the researcher’s convictions on the presented results. Baker (2012) offers an exhaustive discussion of the problems related to the numbers generated by corpus research and bias that is brought into research by

Corpus Linguistics for Critical Discourse Analysis

231

interpreting them. He suggest that every interpretation of frequencies derived from a corpus carries some level of bias which cannot be omitted due to the critical commitment of CDA. Nevertheless, not incorporating frequency into analysis seems to strongly increase the level of bias and reduce the potential of corpus techniques to ensure the credibility of CDA through “analytical tools and methods that are rigorous and grounded in scientific principles such as representativeness, falsification, data-driven approaches, using statistical approaches to test hypotheses and a desire to provide a full picture of representation” (Baker 2012: 255).

3.4 Cherry-picking 3.4.1 We choose examples Research practice shows that focus on frequency only partially solves the cherrypicking problem described in the literature. Although we use big corpora and statistical methods to generate lists of words or phrases we wish to analyze, we sometimes still tend to narrow our further analyses only to selected items. For example, Bachmann (2011) concentrates on keyword analysis in order to reconstruct discursive constructions of same-sex relationships in the debates of the British Parliament concerning the Civil Partnership Bill. He devotes considerable effort to generate a keyword list which will be both manageable for study and subject accurate (using different reference corpora and different significance levels). But, after finally acquiring such lists, he uses less than a half of the keywords for further analysis without taking into consideration either frequency or keyness of these words. Therefore, he selected some items from those with marked frequency in the studied texts and bases his conclusions on an in-depth analysis of these items. Similarly, Freake et al. (2010), describing the construction of nationhood and belonging in Quebec, include only some of the collocations they got from their corpus in their interpretations. Likewise, Salama (2011), in his attempt to characterize discourses about Wahhabi-Saudi Islam post-9/11, discusses in detail 33 concordances out of 435 occurrences of the word in question in the analyzed texts without providing a description of the way those concordances were chosen. As the above mentioned studies show, researchers who use corpus techniques in their analysis sometimes chose the items on which they base their conclusions. The problem might be moved to another level, as in the case of keywords and collocates: the choice concerns words (or combination of words) which are statistically more frequent than randomly or intuitively chosen words. But the problem of (unconsciously) picking examples that confirm the researcher’s presumptions

232

Victoria Kamasa

remains at least partially unsolved, even if the picking pool is justified better than in the case of traditional qualitative studies.

3.5 Pivotal role of researcher’s intuition 3.5.1 We categorize on the basis of intuition Even though corpus software draws researchers’ attention to phenomena that call for a closer investigation (Hardt-Mautner 1995) and, therefore, should lower the role of intuition in the analysis, many examples from research practice show that this role continues to be prominent. This is especially visible in studies that involve some form of categorization of the analyzed items. The first form of such categorization is related to the determination of the semantic prosody of a given word which is based on categorizing every collocate of this word as positive, negative or neutral. These judgments about the evaluative load of the words seem purely intuitive as no procedure leading to the presented result is described. They are part of many studies such as Mautner’s (2007) reconstruction of the ageing discourses or the author’s research on in vitro fertilization discourses (Kamasa 2012). A similar role of the researcher’s intuition may be observed in studies including the analysis of semantic preference, only the subject of assessment shifts from emotional load to semantic field a word belongs to (eg. (Lischinsky 2011; Salama 2011; Weninger 2010). In both cases, the researcher’s intuition is to some extent supported by corpus software as it provides frequency-based lists of items which ought to be judged on their emotional load or semantic field. Nevertheless, the judgment itself depends fully on the skills of the researcher. Also other types of intuition-based categorization are present in the research practice. Chen (2012) considers the changes in the usage of different evaluation types4 in the articles from China Daily due to the political developments in China and provides information about statistically significant diachronic differences in the occurrences of these evaluation types. The statistics are nevertheless performed on intuitively chosen and categorized examples. Similarly, Edwards (2012) uses corpus tools to generate a list of all uses of the word our in his corpus. However, then he divides these instances manually and without any specific criteria into three types in order to base some of his conclusions on the differences in the number of occurrences of particular types through time. As a result of such practices, the role of the researcher’s intuition remains pivotal to the results.

4

The evaluation types are based on the work of Labov (1972).

Corpus Linguistics for Critical Discourse Analysis

233

3.6 Lack of a coherent theory of audience effects and audience response 3.6.1 We try to read in minds The caution in drawing conclusions about thought from language (as advised by Breeze 2011) is not always fully exercised in some of our analyses. Even though corpus tools allow us to determine the frequencies of words or their co-occurrences and therefore to include these frequencies as the main factor influencing the audience’s response, we sometimes tend to neglect them and build conclusions about audience’s reception leaving them aside. This might be the case O’Halloran’s (2009) investigation of the pre-UE-expansion discourses about immigrants in the UK: she investigates a 26-thousand-word corpus of newspaper articles but her suggestion that “regular readers have been positioned into making negative contrast with information on Eastern European immigration” (O’Halloran 2009: 38) is only based on 36 examples of specific usage of the word “but”5. The role of frequency (as a main factor justifying the conclusions about audience response) for such an interpretation could be questioned and thereby the potential of corpus techniques to address the lack a coherent theory of audience effects is partially limited. A comparable problem concerns inferences about the intentions of the text’s author. On the basis of the texts’ analysis we form claims about the state of mind or will of the sender without grounding them in psycholinguistic knowledge about relations between communicative intention and the form of the utterance. For example, Forchtner and Kolvraa (2012) in their paper about representations of Europe’s past, present and future in speeches given by major political figures sometimes comment on the mental states of these political figures rather than on the texts delivered by them as in “Prodi aims to6 unify the continent”, “Merkel (2007), for example, apparently sees no contradiction between Europe’s dark history and an emerging ambition to make a distinctly European contribution to the running of the world beyond Europe” or “Prodi can again avoid differentiating between perpetrators and victims” (Forchtner & Kolvraa 2012: 388–393). Therefore, the question arises if for such interpretation the claim that CL offers a more databased approach to discourse analysis can be fully sustained. Even though CL helps to address almost every point mentioned by CDA’s critics, some shortcomings in our research practice might reintroduce all these

5 6

Specifically, she refers to Wodak’s (1999) strategies realized via the usage of word “but”. Emphasizes from the author of the paper.

234

Victoria Kamasa

problems back into our studies. Figure 2 illustrates the relation between the main points of the criticism of CDA, solutions offered by CL and problems emerging from our research practice. Figure 2. Critique of CDA, CL answers and problems from research practice.

4. Conclusions The foregoing overview of research practice in corpus supported CDA revealed numerous obstacles which may prevent us from exploiting the full potential of CL techniques. This is especially concerning in the light of hopes vested in CL as an answer to the criticism of CDA and may raise the question whether various aspects of the research practice need to be improved or if these hopes were simply too high considering the present state of affairs. For the above listed issues, both seem to be valid to some extent.

Corpus Linguistics for Critical Discourse Analysis

235

A few of these problems can probably be tackled by paying more attention to the research design and analysis itself. If we increase the number of statistical techniques used in our studies and base every comparison only on differences that are statistically significant, we will contribute to addressing the decontextualization problem. Avoiding any intuitive and not intersubjective steps in our corpus design will prevent our studies from being fragmentary rather than comprehensive. Through usage of frequency and statistical scores as the only criterion for the selection of items we base our conclusions on, we will lessen the bias in our studies and avoid cherry picking. Finally, we can mitigate the lack of a coherent audience response theory by limiting conclusions about mental states of text’s recipients by thorough inspection of frequency and restraining ourselves from conclusions about mental states of texts’ authors. Nevertheless, answers to other problems still remain far from clear and call for a more through consideration. With bias being one of the main points of the criticism of CDA on the one hand, and the need for exhaustive and in-depth analyses on the other, we probably ought to consider to what (if to any) extent is it justified to supplement our analyses and conclusions with observations that do not result from the application of our originally planned methods. If we wish corpus methods to solve the problem of the pivotal role of the researcher’s intuition, we still need to address the question of categorization. How (if not on the basis of intuition) should we decide whether an item is positively or negatively loaded? What (if not intuition) is the best basis for deciding to which semantic field a word belongs? Some answers for that problem may be found in the field of computational linguistics such as sentiment analysis (Thelwall et al. 2010) or automated semantic tagging (Prentice 2010). However, the suitability of these solutions for an essentially interpretative perspective such as CDA could be challenged. Self-criticism is one of the important features of CDA advised for example by Fairclough (2001). We hope therefore that the presented critical overview, although not extensive and biased by the author’s personal views, will draw our attention to some crucial points in the design and analysis of corpus data for CDA purposes and inspire further debates. We also hope that these remarks will be in some way useful not only for CDA practitioners but also for all who wish to reconstruct some form of social meaning on the basis of text corpora.

236

Victoria Kamasa

References Albakry, M. 2004. U.S. “Friendly Fire” Bombing of Canadian Troops: Analysis of the Investigative Reports. Critical Inquiry in Language Studies 1(3), 163–178. Almeida, E. P. 2011. “Palestinian and Israeli Voices in Five Years of U.S. Newspaper Discourse.” International Journal of Communication 5, 1586–1605. Babbie, E. R. 2013. The practice of social research. Belmont, CA: Wadsworth Cengage Learning. Bachmann, I. 2011. “Civil partnership – ‘gay marriage in all but name:’ a corpusdriven analysis of discourses of same-sex relationships in the UK Parliament.” Corpora 6(1), 77–105. Baker, P. 2006. Using corpora in discourse analysis. London: Continuum. Baker, P. 2011. Social involvement in corpus studies: Interview with Paul Baker. In V. S. Viana, G. Zyngier & G. Barnbrook (eds.), Perspectives on corpus linguistics, 17–28. Amsterdam: John Benjamins. Baker, P. 2012. “Acceptable bias? Using corpus linguistics methods with critical discourse analysis.” Critical Discourse Studies 9(3), 247–256. Baker, P., Gabrielatos, C., Khosravinik, M., Krzyzanowski, M., McEnery, T. & R. Wodak. 2008. “A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press.” Discourse & Society 19(3), 273–306. Beaugrande, R. de. 2008. Krytyczna analiza dyskursu a znaczenie „demokracji “ w wielkim korpusie. In A. Duszak & N. Fairclough (eds.), Krytyczna analiza dyskursu: Interdyscyplinarne podejście do komunikacji społecznej, 103–119). Kraków: TAiWPN Universitas. Breeze, R. 2011. Critical Discourse Analysis and Its Critics. Pragmatics 21(4), 493–525. Chen, L. 2012. Reporting news in China: Evaluation as an indicator of change in the China Daily. China Information 26(3), 303–329. Degano, C. 2007. Dissociation and Presupposition in Discourse: A Corpus Study. Argumentation 21(4), 361–378. Dijk, T. A. van. 2008. Discourse and context: A socio-cognitive approach. Cambridge: Cambridge University Press. Don, Z. M., Knowles, G. & Choong, K. F. 2010. Nationhood and Malaysian identity: a corpus-based approach. Text & Talk – An Interdisciplinary Journal of Language, Discourse & Communication Studies 30(3), 267–287. Duszak, A. & Fairclough N. (eds.) 2008. Krytyczna analiza dyskursu: Interdyscyplinarne podejście do komunikacji społecznej. Kraków: TAiWPN Universitas.

Corpus Linguistics for Critical Discourse Analysis

237

Edwards, G. O. 2012. “A comparative discourse analysis of the construction of ‘ingroups’ in the 2005 and 2010 manifestos of the British National Party.” Discourse & Society 23(3), 245–258. Flowerdew, J. 1997. “The Discourse of Colonial Withdrawal: A Case Study in the Creation of Mythic Discourse.” Discourse & Society 8(4), 453–477. Forchtner, B. & C. Kolvraa. 2012. “Narrating a ‘new Europe’: From ‘bitter past’ to self-righteousness?” Discourse & Society 23(4), 377–400. Fowler, R. 1996. “On critical linguistics.” In C. R. Caldas-Coulthard & M. Coulthard (eds.), Texts and practices: Readings in critical discourse analysis, 3–14. London: Routledge. Freake, R., Gentil, G. & J. Sheyholislami. 2010. “A bilingual corpus-assisted discourse study of the construction of nationhood and belonging in Quebec.” Discourse & Society 22(1), 21–47. Gabrielatos, C. 2009. Corpus-based methodology and critical discourse studies. Context, content, computation. (Siena English Language and Linguistics Seminars (SELLS).) Siena. Gabrielatos, C. & P. Baker. 2008. “Fleeing, Sneaking, Flooding: A Corpus Analysis of Discursive Constructions of Refugees and Asylum Seekers in the UK Press, 1996–2005.” Journal of English Linguistics 36(1), 5–38. Gabrielatos, C. & A. Duguid. 2014. Corpus Linguistics and CDA. A critical look at synergy. CDA20+ Symposium. Amsterdam. Hardt-Mautner, G. 1995. “Only Connect.” Critical Discourse Analysis and Corpus Linguistics. Herbel-Eisenmann, B. & D. Wagner. 2010. “Appraising lexical bundles in mathematics classroom discourse: obligation and choice.” Educational Studies in Mathematics 75(1), 43–63. Kamasa, V. 2012. “Naming ‘In Vitro Fertilization:’ Critical Discourse Analysis of the Polish Catholic Church’s Official Documents.” Procedia – Social and Behavioral Sciences 95, 154–159. Koller, V. 2004. “Businesswomen and war metaphors: ‘Possessive, jealous and pugnacious’?” Journal of Sociolinguistics 8(1), 3–22. Krishnamurthy, R. 1996. “Ethnic, Racial and Tribal: The Language of Racism?” In C. R. Caldas-Coulthard & M. Coulthard (eds.), Texts and practices: Readings in critical discourse analysis, 129–149. London: Routledge. Lischinsky, A. 2011. “In times of crisis: a corpus approach to the construction of the global financial crisis in annual reports.” Critical Discourse Studies 8(3), 153–168.

238

Victoria Kamasa

Lukac, M. 2011. “Down to the bone: A corpus-based critical discourse analysis of pro-eating disorder blogs.” Jezikoslovlje 12(2), 187–209. Marchi, A. & C. Taylor. 2009. “If on a Winter’s Night Two Researchers… A Challenge to Assumptions of Soundness of Interpretation.” Critical Approaches to Discourse Analysis across Disciplines 3(1), 1–20. Marling, R. 2010. “The Intimidating Other: Feminist Critical Discourse Analysis of the Representation of Feminism in Estonian Print Media.” NORA – Nordic Journal of Feminist and Gender Research 18(1), 7–19. Mautner, G. 2007. “Mining large corpora for social information: The case of elderly.” Language in Society 36(1). Mautner, G. 2009a. “Checks and balances: how corpus linguistics can contribute to CDA.” In R. Wodak & M. Meyer (eds.), Methods of critical discourse analysis, 122–144. London: SAGE. Mautner, G. 2009b. “Corpora and Critical Discourse Analysis.” In P. Baker (ed.), Contemporary corpus linguistics, 32–46. London: Continuum. Mulderrig, J. 2011. “Manufacturing Consent: A corpus-based critical discourse analysis of New Labour’s educational governance.” Educational Philosophy and Theory 43(6), 562–578. O’Halloran, K. 2009. “Inferencing and cultural reproduction: a corpus-based critical discourse analysis.” Text & Talk – An Interdisciplinary Journal of Language, Discourse Communication Studies 29(1), 21–51. Orpin, D. 2005. “Corpus Linguistics and Critical Discourse Analysis: Examining the ideology of sleaze.” International Journal of Corpus Linguistics 10(1), 37–61. Prentice, S. 2010. “Using automated semantic tagging in Critical Discourse Analysis: A case study on Scottish independence from a Scottish nationalist perspective.” Discourse & Society 21(4), 405–437. Reisigl, M. & R. Wodak. 2001. Discourse and discrimination: Rhetorics of racism and antisemitism. London: Routledge. Rogers, R., Malancharuvil-Berkes, E., Mosley, M., Hui, D. & G. O. Joseph. 2005. “Critical Discourse Analysis in Education: A Review of the Literature.” Review of Educational Research 75(3), 365–416. Salama, A. H. Y. 2011. “Ideological collocation and the recontexualization of Wahhabi-Saudi Islam post-9/11: A synergy of corpus linguistics and critical discourse analysis.” Discourse & Society 22(3), 315–342. Scott, M. 2013. WordSmith Tools Help. Accessed September 7, 2013. Stubbs, M. 1997. “Whorf ’s Children: Critical comments on Critical Discourse Analysis (CDA).” In A. Ryan & A. Wray (eds.), Evolving models of language: Papers from the Annual meeting of the British association for applied linguistic

Corpus Linguistics for Critical Discourse Analysis

239

held at the University of Wales, Swansea, September 1996, 110–116. Clevedon: British Association for Applied Linguistic. Subtirelu, N. C. 2013. “‘English… it’s part of our blood’: Ideologies of language and nation in United States Congressional discourse.” Journal of Sociolinguistics 17(1), 37–65. Thelwall, M., Buckley, K., Paltoglou, G. & D. Cai. 2010. “Sentiment strength detection in short informal text.” Journal of the American Society for Information Science and Technology 61(12), 2544–2558. Verschueren, J. 2001. “Predicaments of Criticism.” Critique of Anthropology 21(1), 59–81. Weninger, C. 2010. “The lexico-grammar of partnerships: corpus patterns of facilitated agency.” Text & Talk – An Interdisciplinary Journal of Language, Discourse & Communication Studies 30(5), 591–613. Widdowson, H. G. 1995. “Discourse analysis: a critical view.” Language and Literature 4(3), 157–172. Widdowson, H. G. 1998. “The Theory and Practice of Critical Discourse Analysis.” Applied Linguistics 19(1), 136–151. Wodak, R. & M. Meyer. (eds.) 2009. Methods of critical discourse analysis. London: SAGE. Yasin, M. S. M., Hamid, B. A., Keong, Y. C., Othman, Z. & A. Jaludin. 2012. “Linguistic Sexism In Qatari Primary Mathematics Textbooks.” GEMA Online™ Journal of Language Studies 12(1), 53–68.

Dorota Pierścińska University of Łódź

Towards quantitative and qualitative characterisation of various types of dialogue: interviews vs. Panel Discussions Abstract: Both interviews and panel discussions share one common characteristics – they feature discourse participants in a meaningful conversation whose topic and speakers are selected in advance. However, the pattern of exchanges is different for the two genres: one to one vs. one to many. The main aim of this study is to find out quantitative parameters of dialogue in the two genres, and to relate them to the qualitative features in order to develop and propose a general characterisation of the two genres. To this end, patterns of occurrence of the grammatical categories of frequent lexemes, keywords, and 4-word clusters are investigated. Two sample corpora have been compiled and explored: one consisting of interviews, dated 2009–2014, of about 12 000 words; the other including panel discussions, dated 2012, of about the same size. On the basis of quantitative analyses, it is plausible to argue that interviews are closer to real-life spoken discourse, i.e. more verbal and emotive. For instance, they involve a greater variety of lexis, higher number of verb forms, and numerous interjections/expletives and fillers. Additionally, the data indicate that panel discussions rely to a greater extent on formulaic language (4-word clusters) and description (adjectives, adverbs). Considering the above, a hypothesis can be proposed that panel discussions are more concerned with the organisation of the discourse itself. Moreover, they are much more focused on passing information, explanation, and description. Keywords: Dialogue, corpora, interviews, panel discussions, frequency, key words, clusters

1. Introduction The object of this study is dialogue, more specifically its two different types: interviews and panel discussions. Dialogue is considered to be a reciprocal type of discourse where the potential for interaction is fully developed, i.e. participants both adjust to and influence the development of the communicative act. Both interviews and panel discussions perform similar functions, mostly informative, sometimes persuasive, and their aims are alike, i.e. to present information about the people or events, disclose the speakers views and attitudes while allowing for their reaction to the questions and the interlocutors’ ideas. This paper constitutes a report of the results of the study based on attested,

242

Dorota Pierścińska

naturally occurring spoken data. The investigation is corpus-driven with only one assumption, namely that different genres display different linguistic characteristics, at least to some extent. Therefore, the concept of genre is crucial for my investigation. As there are multiple definitions of genre, I have decided to rely on the most general view presented by Giltrow and Stein (2009: 1): “At the very least, genre works as a common intuitive concept – a sense that features of language aggregate in recognizable patterns, and that these aggregations indicate something important in the uses of language in context”. And it is that aggregation of salient linguistic features and recurring patterns which is the object of this investigation. Nowadays, many linguists argue for the significance of formulaicity, the importance of prefabricated elements and patterns in the way we construct utterances and use the language. As early as in the 1980s Bakhtin (1986: 87) argued: “When we select words in the process of constructing an utterance, we by no means always take them from the system of language in their neutral, dictionary form. We usually take them from other utterances, and mainly from utterances that are kindred to ours in genre, that is, in theme, composition, or style”, and “If speech genres did not exist and we had not mastered them, if we had to originate them during the speech process and construct each utterance at will for the first time, speech communication would be almost impossible” (1986: 79, quoted in Tannen, 1997: 139). The importance of genre analysis can be best summarized by citing Stubbs (1996: 8): “The ability to identify and compare different genres contributes to our ability to understand them.”

2. Goals This study presents systematic and critical interpretation of two types of dialogue which constitute two separate genres of spoken interaction. Thus, genre and cross-genre analysis is conducted in order to find out quantitative and qualitative parameters characteristic of dialogic interaction in the two genres. Lexical, lexicogrammatical, and grammatical patterns are investigated since “it is important to show what can be learned by studying patterns of language across texts and corpora” (Stubbs 1996). To this end, the pattern of occurrence of the grammatical categories of frequent words, key words, and 4-word lexical bundles is investigated. The issue of grammatical and lexico-grammatical structures as well as lexis peculiar to interviews and panel discussions is addressed. The focus is on linguistic or discursive features that might help to uncover both distinctness and common features of both genres.

Quantitative and qualitative characterisation of various types of dialogue

243

In this study, which employs a bottom up data driven methodology, I concentrate on salient lexis and grammatical patterns to demonstrate how the meaning emerges in a given genre. Likewise, I attempt to uncover what factors might motivate the linguistic choices made by the speakers. Next, I attempt to relate all the lexical, grammatical, and structural elements to the genre they represent and show the way they constitute and characterize each genre. Then, textual and discursive peculiarities are addressed with the overall aim of specifying what underlies the perception of interviews and panel discussions as distinct genres.

3. Methodology The methodology employed in this study involves two stages: first, the quantitative analysis is carried out with the application of WordSmith Tools 5.0. Then the interpretation of the data is presented. Sorting out of the data is an important task as both frequency lists and keyword lists can be a point of departure for initial assumptions and further studies. The process involves not only automated counts, but also manual sorting out of the words into grammatical categories. The classification and counting of frequent words and keywords with regard to grammatical category they represent presents valuable data since grammar is thought to be meaningful, at least in cognitive terms (Langacker 2000). Subsequently, the collation and juxtaposition of the frequent words and keywords with regard to grammatical categories uncovers their pattern of occurrence and makes it possible to describe their role and significance both in separate dialogues and across the two types. This, in turn, leads to a contrastive quantitative analysis of specific lexical and grammatical items across the genres. Dispersion plots are also used to inquire whether those items are spread evenly across the whole corpus, or only characteristic of particular speakers. What follows next is computing of 4-gramms with key words and the most frequent words, which leads to their grammatical and functional categorization based on typology proposed by Biber (2006), which was modified to suit conversation analyses. Extraction of frequent clusters is crucial for uncovering language structures peculiar to texts. Frequent clusters’ typology proposed by Biber (2006: 136, 137), inter alia, their structural division is one of the parameters in the examination as well as other classifications, e.g. into personal and impersonal clusters. The inquiry provides insight into the grammatical classes of lexis, characteristic structures, and grammatical patterns, which allows to draw conclusions about the form and the meaning present in both genres.

244

Dorota Pierścińska

4. Corpora used in the study Two sample corpora have been compiled for this investigation. They are of similar length – about 12 000 running words each, and represent two genres: interviews and panel discussions. The first sample corpus contains 5 interviews of about equal length collected from various websites, (2009–2014), concerning professional and private lives of well-known people. The second one is a sample of panel discussions of about the same size. The panel discussions sample comprises fragments of live debates examining moral issues behind the news in 2012. Four fragments on various topics (politics, economy, health, and sport) have been selected from the corpus of texts containing BBC Radio 4 programme “The Moral Maze”. The panel consists of journalists, writers, publishers, politicians, and specialists in the areas discussed. Apart from the moderator, 8 to 9 speakers participate in each debate. The reference corpus is the downloadable part of COCA (1.7 million words), however for this analysis only the spoken component part was used, which comprises 382 102 tokens of which 376 552 were used for the word list. The texts mostly consist of the transcriptions of radio and TV programmes, interviews and discussions, dated 1990–2012. For some particular data regarding the occurrence and usage of specific lexical items and structures the British National Corpus was employed.

5. Analysis Both interviews and panel discussions share one common characteristics – they feature discourse participants in a meaningful conversation whose topic and speakers are selected in advance. However, the pattern of interaction is different due to the distinct number of speakers: one to one vs. one to many. While in an interview only one person is supposed to ask questions, in the discussion the participants often interact with one another, not just answer the moderator’s questions. Furthermore, the assumption can be made that the level of attention to certain issues is different irrespective of the topic of the conversation. While interviews tend to focus more on personal lives, career and work, panel discussions concentrate more on social and political issues overall.

5.1 Basic statistics of the two samples Although the size of the interviews’ and panel discussions’ samples is nearly the same, slightly over 12 thousand words, with respectively 12, 060 and 12, 039 tokens used for the word lists, two disparities can be observed. One concerns the type

Quantitative and qualitative characterisation of various types of dialogue

245

to token ratio, and the other the length of sentences. For the interviews there are 2,193 types (distinct words) with the type to token ratio (TTR)18.18. For the panel discussions there are only 1,987 types. Consequently, the TTR is much lower: 16.50. Moreover, the standardized type to token ratio is considerably higher for the interviews (37.60) when compared to the panel discussions (36.64). Taking this into account, one can argue that the language of the interviews is characterized by the greater variety of vocabulary. Furthermore, the results regarding the mean length of sentences in words are rather surprising as the speakers in the interviews seem to construct longer sentences than in the discussions. The mean length of the sentences is respectively 18.76 and 17.84.

5.2 Frequent words and keywords This part of the analysis focuses on salient words with special attention given to parts of speech of key value for spoken genres, such as personal pronouns, verbs, exclamations and fillers, etc. The issue regarding grammatical word class distribution and variation across genres has been widely discussed and described by Biber et al. (1999), Leech et al. (2001), and Rayson et al. (2002). Frequent words, i.e. the words that comprise over 0.5% of the text, are sorted out into main grammatical categories. Consequently, the content words and other characteristic words are displayed in tables together with the data concerning their frequency. Likewise, the key words of the two genres are sorted out according to the grammatical categories they represent, and other categories, e.g. negative key words, and the results for the two sample corpora are compared and contrasted. Then, there is an attempt to compare the results yielded by both the word-lists and the keyword lists.

246

Dorota Pierścińska

5.2.1 Frequent words Table 1. Frequent words comprising at least 0.5% of the text.1

Personal pronouns

Total Verbs Verb be

Total (without S) Syntactic negation Total Other frequent words Other function words

Interviews word/number of words I+ 3.71% IT+ 1.74% YOU+ 1.43% WE+ 0.77% 7.65% LIKE+(adverb, preposition) 0.58% DO+ 0.51% WASS+ (also genitive form) IS+ BEM+ N’T+

1.36% 1.32% 0.79% 0.55% 0.52% 3.22% 0.64%

MY+ JUST+ 19

0.64% 0.97% 0.54% 25.88%

Panel discussions word/number of words YOU+ 2.08% I+ 2.07% IT+ 1.64% WE+ 0.98% THEY+ 0.74% 7.51% HAVE+ 0.85% THINK+ 0.77% DO+ 0.73% IS+ 2.21% ARE1.09% S+ (also genitive form) 1.08% BE0.84% WAS0.55% 4.69% NOT+ 0.78% N’T+ 0.70% 1.48% THERE+ 1.03% PEOPLE+ 0.81% 12 22.49%

The pluses and minuses in the table are assigned according to the list published by Leech, Rayson, and Wilson (2001), which gives frequencies of word forms in the spoken part of the British National Corpus, comparing them with the written part. The lists are available online at the companion website for the book http://ucrel.lancs.ac.uk/bncfreq/ Plus (+) signifies higher frequency in speech, minus (-) signifies higher frequency in writing. The results indicate that the interviews and panel discussions have an equal number of the most frequent lexical and function words that are predominantly used in written texts. Moreover,

1

The corpora are not annotated for part of speech, which results in the lack of total value for some of the categories. The words that belong to more than one grammatical category are marked with italics.

Quantitative and qualitative characterisation of various types of dialogue

247

the percentage of their occurrence is similar, about 1.90%. The results do not cover the last group, i.e. other function words, which contains mainly articles, prepositions, and conjunctions more characteristic of written than of spoken genres (Appendix 3 and 4). The total percentage of occurrence of personal pronouns is nearly identical for the two genres, however, there is a considerable difference regarding the pattern of their occurrence. In contrast to the interviews, the panel discussions feature personal pronoun they, and show the lower presence of I, which is replaced by higher frequency of other pronouns, especially you. The members of panels address each other using you, or by first name. It is a frequent way of referring to the other person’s views and opinions (you think). In addition, the high frequency of occurrence of you can be attributed to the grammatical versatility of this pronoun – it can be both a subject and object pronoun. As far as verbs are concerned, apart from do and have, which mostly function as auxiliary verbs, the two genres differ in the use of main verbs: lexeme think is characteristic of the panel discussions, while the word like (also an adverb or preposition) is peculiar to the interviews. Other differences regarding the appearance of frequent words in the two genres include: the presence of short form of am exclusively in the interviews, and the second person of the verb be exclusively in the panel discussions; greater percentage of all the forms of the verb be in the panel discussions; greater presence of written forms of this verb when compared to the interviews; lower presence of syntactic negation in the interviews; and a greater variety of function words accompanied by their higher occurrence in the interviews. This, in turn, may indicate a greater variety of diverse structures in the interviews, which will be investigated by means of 4-word clusters. The other frequent words, as shown in Table 1, instantiate a significant difference in the topic of the two genres. The interviews are mostly about what belongs to the speakers, e.g. my life, my dad, my teacher, my novel, etc while the panel discussions seem to be concerned with other people.

5.2.2 Key words The table shows word classes and other characteristic features of the key words for the two genres. Since the corpora are not annotated for part of speech, the keyness value of most categories cannot be reliably calculated and the total cannot be worked out. Whenever the keyness value is given and the word belongs to more than one grammatical category, it is marked with italics. For this reason there are no clear-cut categories apart from personal pronouns, the verb be, and

248

Dorota Pierścińska

negative key words as far as keyness value is concerned. Looking at Table 2, one can easily notice the predominance of common nouns in the panel discussions. The full keyword lists are given in the appendices – Appendix 1 and 2. We can safely assume that the nominalization process, which is more typical of written genres, e.g. academic writing (Biber et al. (1999), Leech et al. (2001), and Rayson et al. (2002), takes place in the panel discussions. Also, the genre seems to be more descriptive – we can observe higher occurrence of adjectives and adverbs. Although, there seems to be only moderate difference in the number of these categories in the two genres, one has to remember that common nouns very often act as modifiers and this seems to be the case in the panel discussions. By contrast, interviews are characterised by the greater number of verb forms and the high keyness value of I. The personal pronoun I is a keyword in the interviews – as a subject pronoun it marks the actor/doer, and the actions performed occur both in the past and in the present, e.g. reading, thinking, and writing or rap, work. A characteristic feature of the interviews is the presence of verbs and phrases of desire and volition, in both past and present forms, such as would like and wanted and the construction expressing future intention, namely gonna. This contraction of going to signals the use of informal language. In the interviews, unlike in the panel discussions, the salient forms of be correspond to the key personal pronoun. In the panel discussions the form of be with the highest keyness value is is, while the third person personal pronoun he has a very salient negative keyness value. One can presume that he has been replaced by the pronoun there (frequent word), or most probably by common nouns, which the panel discussions abound in. In this genre, there is a significant presence of verbs of verbal interaction such as say, and saying, therefore we can assume that the speakers refer to one another’s or their own utterances a lot. This implicates the participants’ need for explanation, elaboration, rephrasing and developing arguments. The modal verb can, and its negative form cannot are salient in the panel discussions, thus the notions they express, e.g. possibility, ability, permission, or lack of them are important for this genre. The study reveals that interjections and fillers are of key importance for both genres (Table 2). Interjections, expletives, swear words, and fillers, which are predominantly present in the interviews, are characteristic of natural speech production. The first group (shit, fucking, indeed) indicates emotional involvement. In fact, they are commonly used to express emotions. The second group (er, erm, actually) gives the speaker the time for thinking and/or rephrasing. Expletives, and swearwords are considered to be less polite interjections. According to the Collins English Dictionary, the expletive shit is a slang word, and

Quantitative and qualitative characterisation of various types of dialogue

249

the swearword fucking is a taboo slang word. It is believed that interjections are not a part of speech because they do not serve any grammatical function (the Collins English Dictionary). However, they are crucial for spoken discourse as they are, apart from intonation, the most immediate way of venting emotions in a dialogue. Weigand (2004: 99) claims that “[interjections] are not only natural signs resulting from an overflow of feeling but linguistic signs which are culturally specific and associated with linguistic conventions with regard to prosody, grammar, meaning and use”. Therefore, they can characterise the speaker in that they are not only language specific but also indicative of social status, education, background, etc. Both interjections and fillers, are much more peculiar to the interviews, and some of the interjections that appear in the interviews can be classified as swearwords. Swearwords can signal the deviance from the norm or be a sign of social and cultural identity (Collins English Dictionary; Weigand 2004). In the case of an interview a swearword such as fucking can both carry the social meaning, as well as the feeling, or sense of spoken informality and intimacy between the participants. Logical connectors are very telling in that they show the predominant syntax of the two discourses. The connectors such as and, or then allow for clause chains, and compound sentences formed by coordination while which and that make it possible to create complex sentences formed by subordination. Givón (2009: 12) argues that complex clauses “are much less frequent in natural discourse and more difficult to process”. Which and that can also function as pronouns and adjectives, and then apart from being a conjunctive adverb can also exist as an ordinary adverb. On the basis of the value of negative keywords, bearing in mind that the reference corpus comprises only spoken language, it becomes obvious that the panel discussions are a less typical conversational genre than the interviews because of the higher negative keyness value of some of their lexis. This becomes even more significant when one is aware of the fact that the words from specific areas of language tend to associate with other words from the same or akin genre, register, or domain.

250

Dorota Pierścińska

Table 2. Key words. Interviews Instance(s)/numbers

Panel discussions Keyness Instance(s)/numbers Keyness

Personal pronouns

I

Proper names

16

4

Common nouns

26

41

Nouns total

42

45

Verbs

WRITING READING D (had, would, d’Ivoire) WANTED GONNA RAP THINKING LIKE WORK

64.64 42.33 38.55 36.78 34.83 32.77 32.81 31.87 25.24

VIEW THINK CHEAT CANNOT SAYING SAY CAN

40.94 38.11 29.51 27.88 26.57 24.69 24.03

Verb be

WAS M

51.89 32.81

IS ARE BE

79.61 50.45 24.74

Verb be total

194.08

HE

84.70

-118.82

154.80

Adjectives/adverbs

13

Exclamations/ expletives/interjections/ discourse markers

SHIT FUCKING

207.87 125.39

INDEED

60.17

Fillers/particles

ER ERM WELL

88.46 41.79 -25.55

ACTUALLY

46.41

Logical connectors

AND OR THEN

36.58 30.61 26.57

WHICH THAT INDEED

35.53 46.96 60.17

Negative KW

HERE WELL

-27.23 -25.55

OFF MR HE

-25.05 -77.86 -118.82

Negative KW total Other KW

21

-52.88 LIKE A

29.24 24.74

-221.73 THAT WHICH OF

47.96 35.53 24.81

Quantitative and qualitative characterisation of various types of dialogue

251

5.3 Dispersion plots The WordSmith Tools 5.0 function called plot calculates in numbers the dispersion value of a given word, or words, while it also shows a graphic representation, which makes it possible to see whether the words are evenly distributed throughout the corpus. The higher the dispersion value the more evenly the word is distributed. The highest possible dispersion value is 1. Having investigated the dispersion of the keywords in the interviews one can notice that words such as salient conjunctions, personal pronouns, auxiliary verbs, possessive adjectives and articles are evenly distributed unlike slang words, swearwords, exclamations, expletives and even fillers, which are characteristic of particular speakers. The only content words that are evenly distributed throughout the whole corpus of the interviews are like, kind and work. A similar tendency regarding even distribution of grammar words can be noticed in the panel discussions with the keyword that having the highest dispersion value (0.948). However, in contrast to the interviews, interjections, and fillers are dispersed evenly throughout the whole corpus of the panel discussions. There are only two of them, indeed and actually, and they are widely used by all the speakers. Lexical words that occur in every text that constitutes the panel discussions corpus are different from those in the interviews, they are: moral, people, witness, view, think, certainly, saying, point, very, terms, be, say, can. There are definitely more of them in comparison to the interviews, and one can distinguish at least three semantic fields: one to do with people, the other with expressing opinions, and the third one concerning verbal communication. As there are significantly more content and grammar words in the panel discussions that are evenly spread over the whole corpus, the conclusion can be drawn that the discourse of the panel discussions is more homogenous with regard to the theme and structure then that of the interviews.

5.4 4-word lexical bundles Please note that I use the terms n-gramm, cluster and lexical bundle interchangeably to refer to frequently occurring sequences of words. In this study they have been computed with all the key words and the frequent words comprising at least 0.5% of the text. All the clusters have been sorted out according to the frequency of occurrence first, then classified into grammatical and functional categories. While generating the clusters one can decide on the number of words that they consist of and the minimum frequency of occurrence of the cluster in the text. Since the samples under investigation are relatively small, both parameters were set on 4.

252

Dorota Pierścińska

Table 3. Clusters with key words and frequent words comprising at least 0.5% of the text. N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Interviews ART OF THE KNOCK THE ART OF THE I M LIKE I OF SHUT EVERYTHING ELSE A BIT OF A READING AND DEEP THINKING DEEP READING AND DEEP AND DEEP THINKING WHAT I M GOING TO REALLY DO THIS SHIT M NOT GOING TO I DO N’T HAVE DO N’T HAVE THAT BUT I DO N’T BE IN THE STREETS I COULD N’T GET I JUST STARTED DOING I COULD REALLY DO COULD REALLY DO THIS WHEN I SAW MY MY LAP WHEN I LAP WHEN I SAW THE MIDDLE OF THE SOMETIMES WE LL BE IN THE MIDDLE OF OF THE ART OF TO BEEF THAT SHIT BEEF THAT SHIT UP THAT SLAPS IN THE THAT SHIT UP THAT WANT TO HEAR THAT TO HEAR THAT SHIT UP THAT S NOT HEAR THAT SHIT THAT THAT SHIT THAT SLAPS SHIT THAT SLAPS IN SHIT UP THAT S IT IT S NOT IN REALLY IN A IN IN REALLY IN KNOW IN IN IN IN IN IN REALLY CLUB SOMETIMES WE LL FOR THE MOST PART THE STUDIO AND THEY THE STREETS AND I SO I JUST STARTED

Frq 15 13 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

N 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44

Panel discussions VERY MUCH INDEED OUR SPIRIT OF THE GAME OF THE SPIRIT OF THE SPIRIT OF THE THE PROBLEM IS THAT IN TERMS OF THE I DO N’T THINK HAVE TO GO TO THANK YOU VERY MUCH YOU VERY MUCH INDEED OUR NEXT WITNESS IS OF THE BRITISH STATE SO ON AND SO ON AND SO FORTH I M NOT SAYING INDEED OUR NEXT WITNESS MUCH INDEED OUR NEXT THE POINT ABOUT SPORT WHAT WE RE DOING OUR FIRST WITNESS IS IS THAT THERE IS FOUL OF THE SPIRIT BY THE STANDARDS OF THE ACTIONS OF THE THEY CAN GET TO IN THE CONTEXT OF ON THE LINE FROM CLOSEST THEY CAN GET IS THAT THERE IS TO GO TO THE I M TRYING TO A CASE FOR DAMAGES AND SO ON AND YOU VE JUST SAID VERY MUCH INDEED OUR M NOT SAYING THAT IN THE PAST I A GHASTLY ABERRATION OR UNHAPPY BECAUSE THEY HAVE THEY RE UNHAPPY BECAUSE RE UNHAPPY BECAUSE THEY WAS A LABEL THAT WAS A GHASTLY ABERRATION A LABEL THAT WAS

Frq 7 7 7 6 6 6 6 6 5 5 5 5 5 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4

Quantitative and qualitative characterisation of various types of dialogue

253

There are 47 different 4-word lexical bundles in the interviews’ sample, and 44 in the sample of panel discussions, as shown in Table 3. However, in the interviews, 3 clusters refer to the name of a book (The Art of the Knock), which is frequently mentioned by the speakers. For this reason, they cannot be qualified as speech clusters. Thus, one is left with equal number of speech clusters for each sample, i.e. 44. Although both genres include the same number of different types of spoken clusters, the use of those clusters varies in each genre. While looking at the frequency one can notice that the frequency of occurrence of the clusters in the text is much higher for the discussions than the interviews. In the interviews all of the clusters but one occur only 4 times, and the higher frequency is 5; while in the discussions one finds 3 clusters that occur 7 times, 5 clusters that occur 6 times, and 7 clusters with the frequency of 5. Thus, the frequency of occurrence of the most frequent speech clusters in the panel discussions is significantly higher than that in the interviews. Because of that the density of the clusters in the discussions is much higher. This means that the language of the panel discussions is based on repetition to a greater extent than that of the interviews. On the other hand, the clusters in the interviews display some other characteristic trait which is absent in the panel discussions. Among the interviews’ clusters there are four frequent lexical bundles indicating hesitation by means of repetition: IN REALLY IN A; IN IN REALLY IN; KNOW IN IN IN; IN IN IN REALLY. This kind of repetition does not serve cohesive purposes, but is characteristic of natural speech production.

5.4.1 Personal clusters Among 44 speech clusters there are 16 personal lexical bundles in the interviews (about 36%), and only 13 personal clusters in the panel discussions (about 29,5%). The percentages seem not to differ much, however, in the interviews 75% of the personal clusters comprise I while in the panel discussions only 31% of the personal clusters are I clusters. The majority of the personal n-gramms in the panel discussions are they clusters (about 38%). Table 4. Frequent lexical bundles: personal clusters. N 1 2 3 4

Interviews I M LIKE I I M GOING TO I DO N’T HAVE BUT I DO N’T

Freq. 5 4 4 4

Panel discussions I DO N’T THINK THANK YOU VERY MUCH YOU VERY MUCH INDEED I M NOT SAYING

Freq. 6 5 5 5

254 N 5 6 7 8 9 10 11 12 13 14 15 16

Dorota Pierścińska Interviews Freq. I COULD N’T GET 4 I JUST STARTED DOING 4 I COULD REALLY DO 4 WHEN I SAW MY 4 MY LAP WHEN I 4 LAP WHEN I SAW 4 SOMETIMES WE LL BE 4 IT IT S NOT 4 CLUB SOMETIMES WE LL 4 THE STUDIO AND THEY 4 THE STREETS AND I 4 SO I JUST STARTED 4

Panel discussions Freq. WHAT WE RE DOING 4 CLOSEST THEY CAN GET 4 I M TRYING TO 4 YOU VE JUST SAID 4 IN THE PAST I 4 UNHAPPY BECAUSE THEY HAVE 4 THEY RE UNHAPPY BECAUSE 4 RE UNHAPPY BECAUSE THEY 4 THEY CAN GET TO 4

5.4.2 The/of clusters Table 5. THE and/or OF clusters. N 1 2 3 4 5 6 7 8 9 10 11

Interviews Freq. ART OF THE KNOCK + 15 THE ART OF THE + 13 OF THE ART OF 4 OF SHUT EVERYTHING ELSE 4 A BIT OF A + 4 THE MIDDLE OF THE + 4 IN THE MIDDLE OF + 4 THE STUDIO AND THEY 4 THE STREETS AND I 4

Panel discussions Freq. SPIRIT OF THE GAME + 7 OF THE SPIRIT OF + 7 THE SPIRIT OF THE + 6 THE PROBLEM IS THAT 6 IN TERMS OF THE + 6 OF THE BRITISH STATE 5 THE POINT ABOUT SPORT 4 FOUL OF THE SPIRIT + 4 BY THE STANDARDS OF + 4 THE ACTIONS OF THE + 4 IN THE CONTEXT OF + 4

*Clusters 1, 2, and 3 constitute the title of a book The Art of the Knock therefore are not qualified as speech clusters.

On one hand, clusters comprising the and/or of are characteristic of description, narration, and argumentative texts. Hyland (2008), who investigated 4-word lexical bundles in academic writing, discovered that “the noun phrase with ofphrase fragment is the most common structure overall, comprising about a quarter of all forms [4-word lexical bundles] in the corpus” (Hyland 2008: 10). On the other hand, such constructions are infrequent in spontaneous spoken

Quantitative and qualitative characterisation of various types of dialogue

255

language. In this respect the interviews, unlike the panel discussions, are very close to pure dialogic form. The sample yielded only three typical clusters of this kind (a bit of a, the middle of, in the middle of) while in the panel discussions 11 the/of clusters have been distinguished. Another thing worth noticing is that the number of this kind of clusters, i.e. the noun phrase with of-phrase fragment (marked with + in my table), is inversely related to the number of frequent function words in the wordlist (other than auxiliary verbs and negation). The function words that are considered here are those that comprise over 0.5% of the text (see Appendices 3 and 4). Although most function words are peculiar to written language, their variety, especially that of prepositions and conjunctions contributes to the greater variety of structures, and this way to a lower amount of repeated expressions and structures. Therefore, greater variety of those function words, which is the case for the interviews, contributes to the language being less formulaic.

5.4.3 Fixed/semi-fixed phrases Table 6. The clusters that are fixed/semi-fixed phrases. N 1 2 3 4 5

Interviews N Panel discussions A BIT OF A 1 THE SPIRIT OF THE BUT I DO N’T 2 THE PROBLEM IS THAT THE MIDDLE OF THE 3 IN TERMS OF THE IN THE MIDDLE OF 4 I DO N’T THINK FOR THE MOST PART 5 THANK YOU VERY MUCH 6 THE POINT ABOUT SPORT 7 BY THE STANDARDS OF 8 THE ACTIONS OF THE 9 IN THE CONTEXT OF 10 A CASE FOR DAMAGES

The quantity of fixed and semi-fixed expressions in the two genres (Table 6) corroborates to my earlier claim about the greater formulaicity in the panel discussions. Interviews comprise only 5 frequent lexical bundles that can be classified as fixed or semi-fixed phrases while panel discussions include twice as many. At least two of the expressions that come from the panel discussions corpus are classified by Aijmer (1996), as ‘local discourse markers’, those are: the problem is that, and the point about sth (is). The data in Table 6 indicate that the discourse of the panel discussions relies twice as much on formulaicity then that of the interviews, at least with regard to what we consciously perceive of and regard as formulaic expressions. All of the expressions can be found in the BNC, however, some of

256

Dorota Pierścińska

the above given semi-fixed phrases can be completed with other words, e.g. for the most (serious cases), a case for (paying) or the point about (them). In some cases the phrases instantiate a predominant collocation; in others the grammatical pattern, i.e. the structure itself, seems to be the most important. To illustrate the former: there are1049 cases of for the most (…) in the BNC, out of which 719 phrases constitute the expression for the most part. The latter can be supported by considering some other grammatical patterns for the structures, such as a case for V-ing or a case for a/the plus singular noun, or a case for followed by an adjective, numeral, or some other determiner which may precede a noun or V-ing form. 274 instances of this structure can be found in the BNC. Likewise, the point about (118 cases in the BNC) can be followed by a pronoun, a noun preceded by an adjective or determiner, or a V-ing form. Apart from the fact that a particular lexical item can be changed, semi-fixed phrases work the same way for the user and the hearer in the way that they know what can follow, or be expected next as the grammatical choice is limited.

5.4.4 Grammatical patterns of 4-word clusters As shown in Table 7, most of the clusters of the two genres can be put into 10 structural categories plus an unclassified group. The categories with the greatest disparities are marked with italics. About the same number of various types of structures were distinguished for each genre, however, there are some quantitative differences as well as peculiarities regarding the occurrence of particular grammatical patterns in one but not the other genre. The predominant structure for both genres is the one based on the verb phrase (fragment). While in the panel discussions the number of occurrence of such a construction is only slightly higher than other structures, such as pronoun plus verb phrase (fragment) or constructs based on adjective (phrase), in the interviews it is prevailing, occurring nearly twice as many times as the second in the list i.e. pronoun plus verb phrase (fragment). The greatest differences can be observed as far as noun phrase clusters and logical connector plus noun /pronoun phrase are concerned. The former occurs exclusively in the panel discussions while the latter is found solely in the interviews, and the number of their occurrence is adversely proportional in the two genres. A similar tendency appears regarding prepositional phrase plus of and noun/noun phrase and noun phrase (fragment)/personal pronoun. What is worth noticing is that in the interviews there exist no structures purely based on the noun phrase.

Quantitative and qualitative characterisation of various types of dialogue

257

Table 7. Grammatical patterns of 4-word clusters. Structure I. Verb phrase (fragment) II. Pronoun + verb phrase (fragment) III. Noun phrase + of IV. Prepositional phrase + of V. Prepositional phrase (+) VI. Noun/noun phrase + and + noun phrase (fragment)/personal pronoun VII. Logical connector + noun /pronoun phrase VIII. Noun phrase (+) IX. Adjective (phrase) (+) X. Clause XI. Unclassified

Interviews 11 6 2 1 4 3 4 3 1 11

Panel discussions 7 6 3 4 3 4 5 1 11

6. Findings In his two-dimensional plot of four genres Biber (1988: 18) places the conversation and the panel discussions on the same side of an axis that marks high presence of pronouns (apart from third person pronouns) and contractions, and at the same time on the opposite sides of another axis, thus indicating that the panel discussions are characterized by many nominalizations and passives in contrast to the conversation. The results of my investigation only partly corroborate to Biber’s claim. High presence of personal pronouns and contracted syntactic negation (N’T) among the frequent words of both corpora certainly does. Yet the words in the keyword list only partly support Biber’s results with I as a keyword for the interviews only, no positive keywords among the personal pronouns in the panel discussions, he as a negative key word for the panel discussions, and keyword contractions appearing only in the interviews (D, M). As far as nominalizations are concerned my findings fully support those presented by Biber, however, his multi-dimensional analysis ties the two factors together, i.e. nominalizations and passives. In this study, neither the list of the frequent words nor key words indicate the salient presence of past participles, that are necessary for passive constructions. The findings regarding evenly dispersed frequent words and key words are significant as they reveal not only the leading theme, which is, e.g. in the panel discussions moral issues and people in general, but also the structure of the discourse itself. For instance, they indicate that in the panel discussions the participants express their opinions (think, view, point), and refer both to their interlocutors words and their own a lot (saying, say, point). What is more, the particulars of the text are revealed in

258

Dorota Pierścińska

that the speakers take into account facts (witness, terms) as well as draw conclusions and express their own point of view (think, view, certainly, can). With regard to the interviews, the subject matter that is present in the whole sample corpus is work. Even if the use of idiosyncratic and emotional language suggests a very personal attitude and the possibility of a disclosure of some personal or intimate pieces of information, there cannot be found any specific lexical items that would indicate a common topic across the whole corpus in that domain. The results also indicate the most typical words for the dialogue of the interviews, which are like and kind. The former one at least partly owes its popularity to its versatility as a different part of speech, the latter constitutes a frequent spoken phrase kind of. According to Leech, Rayson, & Wilson (2001), like as a verb, adverb and preposition is characteristic of spoken language. As far as kind of phrase is concerned, it is predominantly used in the spoken genres, namely lectures, broadcast discussions and interviews, the predominant medium is speech and the domains, which are also spoken, include education and leisure according to the BNC profile. Biber finds hedges co-occurring with interactive features (e.g. first and second person pronouns and questions) and with other features marking reduced or generalized lexical content (e.g. general emphatics, pronoun it, contractions) (Biber 1988: 240). In my investigation the hedge kind of co-occurs to a certain extent with some of the features mentioned. However, in the scale of the whole corpus it does not affect the lexical content in a way that it lowers the number of various lexical items in the corpus, on the contrary, there is more variety of lexical items in the interviews than in the panel discussions where hedges are not salient. Another factor seems to have more influence on the variety of lexis, i.e. formulaicity – the more formulaic expressions the less variety of lexis. It has been demonstrated that formulaic language is a prominent feature of the panel discussions’ discourse. The notions of conscious choice, decision making, and future plans are of key importance for the speakers in the interviews. Additionally, the interviews are more versatile as far as time is concerned than the panel discussions. In the former speakers move freely from the present into the past (key verbs such as d, wanted, and was) and look into the future (gonna) while the latter seems to reside in the present. Undoubtedly one can imagine the discussion of the panel about the past or the future of something, then one of those tenses would be predominant, or even two, however, I suggest that the kind of switch or free transfer between all the tenses is not viable in the panel discussions as a genre. Besides, the conclusion can be drawn about the interviews being always personal and more concerned with emotions, while the panel discussions tend to be either impersonal, or based on opposition of I vs. you/they, and more informative. In the

Quantitative and qualitative characterisation of various types of dialogue

259

interviews emotions are expressed outwardly, thus pointing to the social status, education, or social affiliation of the speakers, while in the panel discussions they are expressed in the socially accepted way. The results demonstrate that interviews and panel discussions are unlike in that the former are more verbal and spontaneous, while the latter are more grammaticalized, structured and well-organized. The claim can be supported by the occurrence of informal language, less complex syntax and coordinated clauses in the interviews; more complex syntax, subordinated clauses, twice the number of the clusters that constitute fixed or semi-fixed phrases, and the greater use of nominalization in the discussions. The claim about the panel discussions having its discourse well-structured at the local level can be supported by the presence of local discourse markers such as actually, indeed, the problem is that, and the point about sth (is). Discourse markers are a strategic device for coping with real-time limits typical of spoken interaction. They often index the speaker’s close attention towards his own words, or the interlocutor’s utterance. The grammatical structures that are peculiar to the interviews are verb phrases or their fragments, while panel discussions show even distribution of all grammatical constructions. This in turn indicates the more spontaneous and conversational character of the former genre, and the more balanced constitution of the latter. Considering the above, the assertion is made that panel discussions are more concerned with the organisation of the discourse itself, they are much more focused on passing information and the explanation of some phenomena than the interviews. On the other hand, the language of the interviews is more idiosyncratic and varied. Moreover, the interviews are to do with the doers themselves and the activities they undertake, which is not the case for the panel discussions unless those activities are concerned with verbal communication, e.g. say, saying. Additionally, the data indicates panel discussion’s greater reliance on formulaic language, and description (adjectives, adverbs), thus indicating the speaker’s greater distance both to the topic and the participants. Next, the panel discussions discourse is characterized by more complex and developed politeness strategies displayed through the use of a wide range of linguistic features, such as salient polite discourse markers (actually, indeed), and frequent phrases, e.g. thank you very much (indeed). The genre also abounds in discourse markers, and expressions such as in my/your/public view, the problem is that, in terms of sth, I don’t think, the point about sth, by the standards of, in the context of, which I would classify as discourse markers of argumentative strategies. In this genre, the speakers who often appear in public and are obliged to talk under the time constraint have developed a set of formulaic expressions to reduce both the production and processing effort, especially regarding the

260

Dorota Pierścińska

presentation of arguments and counterarguments. All of those features contribute to our perception of the discourse being not only well-organized, and smooth, but also well-considered and thought over.

7. Conclusions The discourse of panel discussions is about expressing opinions, speculating and reflecting on the issues, arguing and developing arguments, hypothesizing and theorizing while the discourse of interviews is to do with giving factual information, both about the present and the past, from the point of view of the speaker only, as well as expressing emotions, hopes and plans for the future. Interviews as a genre are more concerned with the individual, the agent, I as a centre, while the panel discussions promote the group as the theme, and pro-social linguistic behaviour. The linguistic data point out to interviews displaying characteristics of real-life natural conversation, while at the same time indicating the proximity of panel discussions and informative genres. The linguistic choices and omissions (what the speakers do not choose in the given genre) seem to suggest two things. The first one is that the speakers are sensitive to their community linguistic practices, what is more, they approve and apply the rules and the form of language characteristic of it. The second one is that the genre convention may allow for or restrict the expression of identity. The interview as a genre promotes both, an expression of community linguistic practices and individuality through the use of idiosyncratic forms. The panel discussions seem to restrict the speakers. This may be due to the fact that in the here-and-now context the speakers become part of a new community, however small it might be, and their need to become understood and accepted imposes attitudes and values shared by the new group (the panel). These factors naturally affect linguistic choices and language behaviour, e.g. avoiding unaccepted vocabulary choices, using mitigating words, trying to build more complex structures. This is not to say that the speakers in the panel discussions change their views, values, or affiliation. The panel discussions is a genre where they can express controversial and personal as well as their community’s views. This is to say that they change their linguistic strategies for the purpose of being heard, recognized, and acknowledged as those who present valid ideas. This appears to be the reason why the feelings as well as social background, or affiliation of the speakers is not apparent in the panel discussions as it is in the interviews. Therefore, the two discourses are unlike each other – the discourse of interviews being heterogeneous, and diversified, while that of panel discussions appears to be showing some consistency and uniformity, thus can be classified as homogenous.

Quantitative and qualitative characterisation of various types of dialogue

261

References Adolphs, S. 2002. Genre and Spoken Discourse: Probabilities and Predictions. Nottingham Linguistic Circular 17. Retrieved from http://www.nottingham. ac.uk/~aezweb/nlc/adolphs.pdf. Aijmer, K. 1996. Conversational Routines in English: Convention and Creativity. New York: Longman. Archer, D. 2009. What’s in a Word-list? : Investigating Word Frequency and Keyword Extraction. Farnham: Ashgate. Archer, D., Culpeper, J., & P. Rayson. 2009. Love – ‘a familiar or a devil?’ An Exploration of Key Domains in Shakespeare’s Comedies and Tragedies. In D. Archer (ed.), What’s in a Word-list? 137–157. Farnham: Ashgate. Betten, A., & M. Dannerer. (eds.). 2005. Dialogue Analysis IX: Dialogue in Literature and the Media. Selected Papers from the 9th IADA conference Salzburg 2003. Part 2: Media. Niemeyer. Biber, D., Johansson, S., Leech, G., Conrad, S., & E. Finegan. 1999. Longman Grammar of Spoken and Written English. England: Pearson Education Limited. Biber, D. 2006. University Language. A Corpus-Based Study of Spoken and Written Registers. Philadelphia, PA: John Benjamins. Facchinetti, R., & F. Palmer. (eds.). 2004. English Modality in Perspective. Genre Analysis and Contrastive Studies. In T. Kohnen & J. Mukharjee (eds.), English corpus linguistics, Vol. 1. Frankfurt am Main: Peter Lang. Flowerdew, J. & R. W. Forest. 2009. Schematic Structure and Lexico-Grammatical Realization in Corpus-based Genre Analysis: The Case of Research in the PhD Literature Review. In M. Charles, S. Hunston & D. Pecorari (eds.), Academic Writing. At the Interface of Corpus and Discourse, 15–36. London: Continuum. Giltrow, J., & D. Stein. 2009. Genres in the Internet: Innovation, evolution, and genre theory. In J. Giltrow & D. Stein (eds.). Genres in the Internet : Issues in the theory of genre, 1–26. Amsterdam: John Benjamins. Giltrow, J., & Stein, D (Eds.). 2009. Genres in the Internet : Issues in the theory of genre. Amsterdam: John Benjamins. Givón, T. (2009). Genesis of Syntactic Complexity : Diachrony, Ontogeny, NeuroCognition, Evolution. Amsterdam: John Benjamins. González, M. 2004. Pragmatic Markers in Oral Narrative. The case of English and Catalan. Amsterdam: John Benjamins. Hoey, M. P. 1991. Patters of Lexis in Text. Oxford: Oxford University Press. Hyland, K. 2008. As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes 27, 4–21. doi: 10.1016/j.esp.2007.06.001.

262

Dorota Pierścińska

Hyland, K. 2015. Genre, discipline and identity. Journal of English for Academic Purposes 19, 32–43. doi: 10.1016/j.jeap.2015.02.005. Kretzschmar, W. A. Jr. 2009. The Linguistics of Speech. Cambridge: Cambridge University Press. Langacker, R. W. 2000. Why a mind is necessary. Conceptualization, grammar and linguistic semantics. In L. Albertazzi, (ed.), Meaning and Cognition. A multidisciplinary approach, 25–38. Philadelphia, PA, USA: John Benjamins Publishing Company. Leech, G. N., Rayson, P., & A. Wilson. 2001. Word frequencies in written and spoken English: based on the British National Corpus. London: Longman. Rayson, R., Wilson, A. & G. Leech. 2002 Grammatical word class variation within the British National Corpus Sampler. In P. Peters, P. Collins & A. Smith (eds.), New Frontiers of Corpus Research: Papers from the Twenty First International Conference on English Language Research on Computerized Corpora, Sydney 2000. 29 –306. Amsterdam: Rodopi. Scott, M. & C. Tribble. 2006. Textual Patterns. Key Words and Corpus Analysis in Language Education. Amsterdam: John Benjamins. Sinclair, J. 2004. Trust the Text. London: Routledge. Stubbs, M. 1996. Text and Corpus Analysis: Computer-assisted Studies of Language and Culture. Cambridge, MA: Blackwell Publishers. Tagg, C. 2012. The Discourse of Text Messaging : Analysis of SMS Communication. London: Continuum. Tannen, D. 1997. Involvment as Dialogue. In M. S. Macovski (ed.), Dialogue And Critical Discourse: Language, Culture, Critical Theory, 137–157. New York: Oxford University Press. Weigand, E. 2004. Emotions. The simple and the complex. In E. Weigand (ed.), Emotion in Dialogic Interaction. Advances in the Complex, 3–21. Amsterdam: John Benjamins. Womack, P. 2011. Dialogue. Abingdon: Routledge.

Reference materials BYU-BNC: British National Corpus http://corpus.byu.edu/bnc/ COCA (Corpus of Contemporary American English) sample retrieved from http://corpus.byu.edu/full-text/samples.asp Collins English Dictionary, 3rd edition. 1994. Glasgow: HarperCollins Publishers. Frequency lists retrieved from http://ucrel.lancs.ac.uk/bncfreq/

Quantitative and qualitative characterisation of various types of dialogue

263

Interviews retrieved from: Author Spotlight, Kyle Minor, A Conversation with Philip Graham, June 4th, 2014 http://htmlgiant.com/author-spotlight/a-conversation-with-philipgraham/#more-119755 Film, Taylor Kitsch By Teddy Wayne, Published 01/28/14 http://www.interview magazine.com/film/taylor-kitsch#_ Interview: Jay Worthy on Lndn Drgs, Compton, and doing things differently, posted by Pigeons on June 4, 2014 in interviews http://pigeonsandplanes. com/2014/06/jay-worthy-lndn-drgs-interview/ Rushes Sequences – Nicholas Carr interview – USA, Dan Biddle, Thursday, 26 November 2009 http://www.bbc.co.uk/blogs/digitalrevolution/2009/11/rushes-sequences-nicholas-carr.shtml Steve Coogan Talks About Going on the Road with Dame Judi Dench in “Philomena” November 22, 2013, Paula Schwartz http://themovieblog.com/2013/steve-coogantalks-about-going-on-the-road-with-dame-judi-dench-in-philomena/ The companion website for Leech, G. N., Rayson, P., & Wilson, A. 2001. Word frequencies in written and spoken English: based on the British National Corpus. London: Longman. Retrieved from http://ucrel.lancs.ac.uk/bncfreq moralmaze2012_corpus – the corpus comprising the transcript of BBC Radio 4 programmes The Moral Maze made available by ARG-tech Centre for Argument Technology. Retrieved from http://arg.dundee.ac.uk/

Aleksandra Beata Makowska University of Łódź

Standardisation in safety data sheets? A corpus-assisted study into the problems of translating safety documents Abstract: In industry, production is organised into standardised processes to ensure the satisfactory level of the output. Producers and their subsidiaries are legally obliged to provide material safety data sheets. These documents need to meet the requirements of the Classification, Labelling and Packaging (CLP) regulation in EU, the Globally Harmonized System of Classification and Labelling of Chemicals (GHS) on the global scale and the Registration, Evaluation and Authorisation of Chemicals (REACH). Moreover, translations of safety documents need to meet the standards concerning translation and terminology. However, the preliminary scrutiny of safety documents reveals that the performance does not reflect the ideal. For the purpose of this study, the author gathered and analysed a parallel corpus of 93 material safety data sheets for the respective products of 720,000 words, created over the years 2005–2014. The materials for the analysis are available online in Polish, English and German. The comparative analysis may serve as a base that provides solutions to reach a high level of standardisation in the process of translation of safety data sheets and to produce a set of guidelines for the successful translation that would serve their purpose and be useful for target readers. Keywords: Specialised translation, standardisation, harmonisation, safety data sheet

1. Introduction Business expansion on the global scale facilitates the blurring of boundaries: business people conform to standards in the way business is done, with fewer and fewer cultural differences. Global legal regulations unify legislations across countries. Bodies, like United Nations, introduce legislations that should be implemented in countries around the world. Apart from standards for business transactions, there are standardised political and transportation procedures: the documents required in international business and transport are harmonised, if not standardised and the methods tend to be unified.

266

Aleksandra Beata Makowska

2. Data and Methodology For the purpose of the study, the author gathered a parallel corpus of 93 matching material safety data sheets of 720,000 words in Polish, English and German, created over the years 2005–2014 to conduct a comparative study of the three language versions. The documents are available online, and were gathered from the producers’ websites. The purpose of the study is to determine the potential areas of the most frequent problems and to provide possible solutions to identified and analysed problems as well as produce a set of guidelines for the translation of such texts. The author conducted a comparative analysis of terminology, linguistic features and the generic structure of the texts to eliminate problems in the translation process of safety data sheets and to reach a high level of standardisation and produce useful guidelines for translators of such texts. However, since data in each set of documents differ and data are presented verbally, via graphical representations and lexicographical symbols, corpus analysis of parallel texts was not possible. Therefore, the scrutiny was conducted via a comparative analysis of parallel texts in three languages: Polish, English and German.

3. Safety Data Sheets Safety data sheets (SDS) are documents which provide guidelines on handling, storing and using chemicals in industry. They were introduced by the GHS system1 implemented in 2003 in 60 countries. The global legislation regulates the safe use of chemical substances worldwide. On the European level, there are two regulations concerning the issue of chemical substances, i.e. REACH2 and CLP3. According to the legislations, producers are obliged to equip every chemical they

1 ῾GHS is a system of classifying chemical substances on the basis of the hazards they pose and provides methods of communicating hazards by product labels and safety documents (safety data sheets), rules for safe handling, storing and using chemicals. Retrieved from: http://www.unece.org/trans/danger/publi/ghs/ghs_welcome_e.html, 18.06.2015. 2 ῾REACH is a regulation implemented in 2007 in the European Union to protect human health and the environment from dangers posed by chemical substances. Retrieved from: http://ec.europa.eu/enterprise/sectors/chemicals/reach/index_en.htm, 18.06.2015. 3 ῾CLP is a regulation introduced in 2009 to the European legislation that regulates the classification, labelling and packaging of chemical substances as a part of the GHS (standard symbols representing hazards on labels and safety data sheets).

Standardisation in safety data sheets?

267

produce and introduce to the internal market with a safety data sheet and a label on the packaging. In order to control the problem of chemical substances and issue of documents for them, two controlling bodies were established: ECHA4 in the EU in 2007 and Bureau for Chemical Substances5 in Poland in 2011. In Poland, there are also two other controlling authorities: National Labour Inspectorate6 (Państwowa Inspekcja Pracy) and Sanitary-Epidemiological Station7 (SANEPID) that inspect production companies and ensure that documentation required by law is available to workers and that each chemical is labelled accordingly. Moreover, SDSs serve as a source of information about the emergency procedures and instructions concerning handling, storing, using a substance. The documents are written for people who deal with chemicals, i.e. workers on the production line that use the product, warehouse people who store the chemicals and rescue services that rescue people in case of an accident. According to the GHS regulations, every safety data sheet contains 16 section which are standardised and listed in the Annex 4 of the GHS in order to create one harmonised system of safety documents on the global scale. The SDS is organised into a list or a table to allow rescue services to identify potential hazards quickly and apply proper treatment in case of a situation when human health or the environment are at stake, even if the SDS is written in a foreign language.

4 5 6 7

Retrieved from: http://ec.europa.eu/enterprise/sectors/chemicals/classification/in dex_en.htm, 18.06.2015. ECHA “helps companies to comply with the legislation, advances the safe use of chemicals, provides information on chemicals and addresses chemicals of concern”. Retrieved from: http://echa.europa.eu/about-us, 18.06.2015. The Bureau for Chemical Substances is a governmental institution formed in 2011 under the supervision of the Minister of Health. Its main responsibilities are chemicals and the compliance with the EU legislation. Retrieved from: https://www.chemikalia.gov.pl/general_information.php, 18.06.2015. National Labour Inspectorate is a national controlling body responsible for executing the compliance of employers with the labour law through inspections in the companies. Retrieved from: http://www.pip.gov.pl/en/about-us/18279,our-statutory-responsibilities.html, 18.06.2015. SANEPID is a controlling authority whose main responsibility is protection of human health from harmful factors. Retrieved from: http://www.gis.gov.pl/?go=news, 18.06.2015.

268

Aleksandra Beata Makowska

4. Standardisation Standardisation allows for economical operation in the industry, according to the principles of lean management8 (Womack & Jones 1996: 17). According to the EU specialists, standards are created to be “used as a means of demonstrating compliance with regulation” (Hatto 2013: 8). The term can thus be interpreted in the SDSs as unified processes that prevent from failure (proper conduct with the chemical according to the instructions), establishing a “well-defined practice” (following the instructions) (Hatto 2013: 8). However, harmonisation is used in situations where standards cannot be applied, i.e. in cases where some differences are acceptable or the elimination of differences is not possible. It is defined as “prevention or elimination of differences in the technical content of standards having the same scope (…). Harmonization looks at differences between process standard, and sets bounds to the degree of their variation” (Richen & Steinhorst 2005: 1). Following Richen & Steinhorst’s definition (2005), harmonisation allows for some discrepancies, whereas standardisation is the creation of uniformity and sameness. Thud, safety data sheets strive toward standardisation, but can reach only the level of harmonisation.

5. Manifestations of standardisation in SDSs Safety data sheets are subject to standardisation on different levels. Safety documents are created as a result of the regulations on the global, European and national level, i.e. GHS implemented by the United Nations, CLP introduced in 2009 in the EU for the packaging and labelling of chemicals used in the industry and REACH regulation in the UE that came into force in 2007. As a result of the legislation, controlling authorities were established that ensure the standardisation of the documents during the preparation stage: European Chemicals Agency (ECHA) in EU, the Bureau for Chemical Substances in Poland, National Labour Inspectorate (Państwowa Inspekcja Pracy) and Sanitary-Epidemiological Station (SANEPID) in Poland. The SDS is a compilation of different standards in the document itself. The standards pertain to using, shipping, handling and storing and disposing of the chemical. Moreover, they can be represented by lexicographical symbols or pictograms and are strictly organised in structure, which can be treated as a SaintDizier’s restatement (2012: 4). 8 Lean management is a management technique in which companies produce more by using fewer resources and reducing muda (waste) (Womack & Jones 1996: 17).

Standardisation in safety data sheets?

269

The first group of standards concerns transportation and include the UN Number (1), ADR (1), AND, IMDG9 (2), IATA10 and TA-Luft (3). However, there is no agreement regarding the translation of transportation standards. Some standards, like ADR (1), are transferred into the target language, whereas the others, e.g. IMDG (2), IATA (2) and TA-Luft (3), remain in the source language (either English or German). (1) In English: 3 (F1) Flammable liquids In Polish: 9 (M6) Różne niebezpieczne substancje i przedmioty ‘(M6) Miscellaneous dangerous substances and articles’ (2) ENVIRONMENTALLY HAZARDOUS SUBSTANCE, LIQUID, N.O.S. (reaction product: bisphenol-A-(epichlorhydrin) epoxy resin (number average molecular weight = 700)) (3) in English: TA-Luft Klasse 5.2.5 ‘TA-air class 5.2.5’ Pictograms (4) are characteristic for safety data sheets and product labels. They are unified in all languages and were designed to represent the hazards graphically and verbally11 since they communicate hazards associated with a chemical in an efficient way (Willey 2012: 862).

9 ῾International Maritime Code for Dangerous Goods’ ‘primary focus is to foster the safe handling of hazardous materials, as well as to offer the expertise of the ocean carriers in forging regulatory development’. Retrieved from: http://www.ivodga.com/, 17.06.2015. 10 DGR serves as a worldwide reference for transporting hazardous goods by air. Retrieved from: http://www.iata.org/whatwedo/cargo/dgr/Pages/index.aspx, 17.06.2015. 11 http://www.unece.org/trans/danger/publi/ghs/pictograms.html, 17.06.2015.

270

Aleksandra Beata Makowska

(4) Figure 1. Examples of GHS pictograms: explosive and toxic.

GHS also produced codes (5a and 5b) with hazard and precautionary statements12 which consist of a combination of letters and numbers, provided with levels of their harmful effects. The codes and levels given for ingredients of the preparation remain in English (5a) but the instructions for the preparation are translated into the target language (5b). (5a) Acute Tox. 4, H302; Acute Tox. 4, H312; Acute Tox. 4, H332; Skin Irrit. 2, H315; Eye Irrit. 2, H319 in all languages (5b) in Polish: H302 Działa szkodliwie po połknięciu. ‘H302 acts harmfully after swallowing’ in English: H302 Harmful if swallowed. in German: H302 Gesundheitsschädlich bei Verschlucken. ‘H302 Harmful.to.health by swallowing’ Chemicals are also subject to standardisation since there are a few nomenclatures concerning the names of chemical substances, more popular or more professional

12 The full list of hazard and precautionary statements. Available at: http://www.unece. org/fileadmin/DAM/trans/danger/publi/ghs/ghs_rev02/English/07e_annex3.pdf.

Standardisation in safety data sheets?

271

(Newmark 1988: 155). CAS13 or IUPAC14 standards are a sequence of a numerical symbol for each chemical given together with the name of the substance which allow for the quick identification of the chemical substance by rescue services in case of an accident. The names may differ across languages, like the German name Butylglykol ‘butyl.glycol’, but the CAS number is the international standard (6). (6) CAS: 111–76-2 2-butoksyetanol ‘2-butoxyethanol’ CAS: 111–76-2 2-butoxyethanol CAS: 111–76-2 Butylglykol ‘butyl.glycol’

in Polish in English in German

EWC15 (European Waste Catalogue) in which each waste has an individual code consisting of 6 figures followed by a description (7) provides instructions for the disposing of the chemical substance. The example (7) shows the differences in the translation. In the Polish SDS, the list of codes is the most detailed, whereas in English and German, the authors provide only one code for the substance. (7) in Polish: 08 Odpady z produkcji, przygotowania, obrotu i stosowania powłok ochronnych (farb, lakierów, emalii ceramicznych), kitu, klejów, szczeliw i farb drukarskich ῾08 wastes from the manufacture, formulation, supply and use of coatings (paints, varnishes and vitreous enamels), adhesives, putties and printing inks’ 08 01 Odpady z produkcji, przygotowania, obrotu i stosowania oraz usuwania farb i lakierów ῾08 01 wastes from the manufacture, formulation, supply, use and removal of paint and varnish’ 08 01 20 Zawiesiny wodne farb lub lakierów inne niż wymienione w 08 01 19 13 Chemical Abstracts Service, is the world’s authority for chemical information whose objective is to find, collect and organise all information concerning chemical substances. Retrieved from: https://www.cas.org/, 17.06.2015. 14 International Union of Pure and Applied Chemistry deals with chemicals and promotes chemical sciences on the global scale. Retrieved from: http://www.iupac.org/, 17.06.2015. 15 http://www.eauc.org.uk/page.php?subsite=waste&page=the_european_waste_catalo gue_ewc, 17.06.2015.

272

Aleksandra Beata Makowska

‘aqueous suspensions of paints or varnishes other than those mentioned in 08 01 19’ 08 01 20 aqueous suspensions containing paint or varnish other than those mentioned in 08 01 19 In German: 08 01 20 wässrige Suspensionen, die Farben oder Lacke enthalten, mit Ausnahme derjenigen, die unter 08 01 19 fallen ‘08 01 20 aqueous suspensions that contain paints or varnishes with the exception of those which fall in 08 01 19’ S-phrases and R-phrases16 are the standardised instructions given in SDSs which consist of numbers followed by the given instructions (8). The R-phrase and R-number indicate the risks people are exposed to by the chemical substance, whereas S-phrase and S-number are connected with safe handling of the chemicals. The worker will look at the verbal instruction and the specialist will glance at the symbol. (8 ) R36/38 Działa drażniąco na oczy i skórę. ‘R36/38 Acts irritatingly to eyes and skin’ R36/38 Irritating to eyes and skin. R36/38 Reizt die Augen und die Haut. ‘R36/38 Irritates the eyes and the skin

in Polish in English in German

The above mentioned standards illustrate that the generic structure of the document and the language are strictly regulated. The purpose for applying so many standards is to limit the time needed to retrieve necessary information from the SDS. For this reason, safety data sheets can be treated as codified texts whose translation should also be standardised (Źrałka 2007: 77).

6. Standardisation of translation Translation is also subject to standardisation. In order to improve specialised communication, translators should rely on standardised terms. However, there is a constant proliferation of new concepts which need to be named. Moreover, parallel research, polysemy and term migration among fields may cause problems for translators of specialised texts (Dury 2005). For this reason, terminologists 16 http://www.msds-europe.com/id-485-r_s_phrases.html, 17.06.2015.

Standardisation in safety data sheets?

273

ought to standardise terms, i.e. “to discuss and to agree upon the general adoption of what is best among alternative possibilities and arrangements” in order to reach a high level of clear and efficient communication in technical domains (Wüster 1955: 1). As a result, there emerged standards concerning terminology (a few examples): ISO TC 37, ISO 12616, EN-15038:2006 as well as organisations, like the European Association for Terminology (EAFT). SDSs are based on many standards, concerning the generic structure of the document, the instructions and chemicals. As a result the preparation of the document and the translation process, also standardised, shines as a straightforward task. However, the analysis reveals that the performance is far from being standardised, harmonised, or at times even acceptable.

7. Challenges for translators Translation of the SDS is a challenge to translators for a number of factors because the document is strictly organised and its language is highly regulated. There are two main categories of challenges: linguistic challenges due to the transfer from the SL/SC into TL/TC and legal requirements imposed by the GHS.

7.1 Linguistic challenges in SDSs Linguistic challenges can be divided into lexical and syntactic challenges. As for the first group, an SDS contains intertwining legal, technical, medical and chemical terms. This language is characteristic for professionals dealing with safety in industry who form a discourse community since they are a group of people with common interests and shared terminology (Swales 1987: 2).

7.1.1 Lexical challenges in the SDS The difficulty in translating SDSs lies in their strictly organised generic structure because the document contains a highly regulated language and gives thorough instructions. Moreover, there is a mixture of jargons and various standards for which the translator needs to adopt a suitable strategy. Among lexical challenges, one can find terminology, instructions, standards expressed verbally, via lexicographical symbols and graphical representations, i.e. GHS hazard and precautionary codes and statements, R-phrases and S-phrases and acronyms representing different standards (for handling, storing, using and disposing of the chemical).

274

Aleksandra Beata Makowska

A. Terminology Terminology is one of the key factors that differentiates specialised texts. According to Wüsterian studies (1955), ideally there should be one-to-one terminological correspondence between translated languages. However, there are several nomenclatures in each language and more popular or professional names for chemical substances (Newmark 1988: 155) (9). (9) in Polish: CAS: 25068-38-6 produkt reakcji bisfenolu A z epichlorohydryną, ciężar cząsteczkowy > 700 ‘reaction product of bisphenol-A with epichlorhydrin, molecular weight >700’ in English: CAS: 25068–38-6 reaction product: bisphenol-A-(epichlorhydrin) epoxy resin (number average molecular weight = 700) in German: CAS: 25068–38-6 Bisphenol-A-Epichlorhydrin-Harze MG < 700 ‘ bisphenol-A-epichlorhydrin-resin, molecular weight 700) ‘SUBSTANCE ENVIRONMENTALLY HAZARDOUS, LIQUID, N.O.S. (reaction product: bisphenol-A-epichlorhydrin, molecular weight > 700)’ IMDG ENVIRONMENTALLY HAZARDOUS SUBSTANCE, LIQUID, N.O.S. (reaction product: bisphenol-A-(epichlorhydrin) epoxy resin (number average molecular weight = 700)), MARINE POLLUTANT IATA ENVIRONMENTALLY HAZARDOUS SUBSTANCE, LIQUID, N.O.S. (reaction product:bisphenol-A-(epichlorhydrin) epoxy resin (number average molecular weight = 700)) In English: TA-Luft

TA-Luft Klasse 5.2.5/I TA-Luft Class 5.2.5/I

Schwangerschaft Gruppe ‘pregnancy group’

C

276

Aleksandra Beata Makowska MAK 8-Stunden-Mittelwert mg/m³ Zink und seine anorganischen Verbindungen ‘MAK 8-hours-value mg/ m³’ (alveolengängige Fraktion); 0,1 mg/m³; gemessen als alveolengängige Fraktion (vgl. Abschn. Vd) S. 191) ‘Zinc and its inorganic compounds (alveolar fractions); 0,1 mg/m³; measured as alveolar fractions (compare section misc.)p.191)’

In German: ADR 3082 UMWELTGEFÄHRDENDER STOFF, FLÜSSIG, N.A.G. (Bisphenol-A-Epichlorhydrin-Harze MG < 700) ‚ENVIRONMENTALLY HAZARDOUS SUBSTANCE, LIQUID, N.O.S. (bisphenol-A-epichlorhydrin resin molecular weight < 700)‘ IMDG ENVIRONMENTALLY HAZARDOUS SUBSTANCE, LIQUID, N.O.S. (reaction product: bisphenol-A-(epichlorhydrin) epoxy resin (number average molecular weight = 700)), MARINE POLLUTANT IATA ENVIRONMENTALLY HAZARDOUS SUBSTANCE, LIQUID, N.O.S. (reaction product: bisphenol-A-(epichlorhydrin) epoxy resin (number average molecular weight = 700)) The example (10) shows that only the ADR transportation standard is translated to the TL. The remaining ones, i.e. IMDG, IATA and TA-Luft remain in their SL. Even though the language of international transport is English, TA-Luft is in German. Omitting the remaining standards affects the relevance of the TT: the reader will have to use more effort to make out the meaning of the instruction, than in the ST. Moreover, such a strategy leads to lower functionality of the specialised text.

7.1.2 Syntactic and conceptual challenges Syntactic challenges are the second group of difficulties whose origin lies in different language systems. English and German are North-Germanic languages, whereas Polish is a one of the Slavic languages. The documents are created to provide clear and precise instructions how to deal with the chemical substance. The safety regulation language should be straightforward. However, Saint-Dizier (2012) finds that the language of instructions is unnecessarily complex, which makes the texts less functional. Źrałka (2007) claims that in codified texts it is uncommon to find grammatical forms that express instructions, but nominal phrases are quite popular. Saint-Dizier’s (2012) study contradicts this statement, and in safety regulation language ‘it is quite frequent to observe in an instruction several negations, pronouns, complex cross-references and embedded conditions (2012: 391).

Standardisation in safety data sheets?

277

A. Instructions SDSs consist mainly of instructions and the imperative forms are frequent in these documents. In Polish and German, the imperative forms can be expressed by the use of the infinitive and they are polite and familiar forms of the non-personal address and the formal register. In English, the imperative is direct, either in formal and informal styles. (11) in Polish: in English: in German:

P102 Chronić przed dziećmi. ‘Protect-INF from children.’ P102 Keep out of reach of children. P102 Darf nicht in die Hände von Kindern gelangen. ‘P102 must not get in the hands of children.’

In example (11), the instruction in Polish is expressed by the infinitive. In the English language, there is an imperative, whereas in German, one can find the modal verb dürfen /to be authorised to/ followed by the infinitive which acts as a formal instruction.

B. Phrase building/compounding The three analysed languages differ in terms of phrase building and compounding. In English it is acceptable to put one noun in front of the other to be used as an adjective. In German, there are long compound nouns that embrace all elements (Donaldson 2007: 47). In Polish, the system of inflections allows building long detailed and descriptive phrases. (12) in Polish: ośrodek zatruć ‘centre of poison’ in English: poison centre in German: Giftinformationszentrum ‘poison.information.centre’ The example (12) presents how the three languages differ in terms of compounding. In Polish, there are two nouns, of which the second is in Genitive. In English, there are two nouns, the first of which functions as an adjective. In the German version there is one compound noun, the most descriptive because it contains the noun information. However, in the Polish version, the phrase ośrodek zatruć should be replaced by the name of the institution dealing with health problems to make the message more precise for target readers.

278

Aleksandra Beata Makowska

C. Conceptual challenges The three languages belong to two language groups: Polish is a West Slavic language, whereas English and German are North-Germanic languages. As a result, there are differences in the conceptualisation of reality (Langacker 1986: 1–2). Moreover, instructions in each language will differ slightly due to the process of re-conceptualisation of the message during the translation process (13) (Lewandowska-Tomaszczyk 2010: 108). (13) in Polish: R20/21/22 Działa szkodliwie przez drogi oddechowe, w kontakcie ze skórą i po połknięciu. ‘Acts harmfully through air passages, in contact with skin and after swallowing.’ in English: R20/21/22 Harmful by inhalation, in contact with skin and if swallowed. in German: R20/21/22 Gesundheitsschädlich beim Einatmen, Verschlucken und Berührung mit der Haut. ‘Harmful.to.health by inhalation, swallowing and touching with skin’ Even though the three versions of the R-phrase talk about harmful influence of the chemical on people, in each language the message is slightly different (13). In the Polish language, the chemical is harmful during the passing through the air passages in the human body, whereas in English is communicates that the inhalation of the substance has a negative effect. In German, the messages are more precise because the compound adjective gesundheitsschädlich means ‘harmful to health’ and Berührung is not only the contact with the substance, but more precisely ‘touching’.

7.2 Legal requirements Legal requirements are the other group of requirements that SDSs need to meet. The documents need to be written in a formal register in the impersonal mood. The documents must be organised into a list or a table and contain 16 sections. In Annex 4 of the GHS, there are also requirements concerning the language of the SDS which should be simple, clear, and precise, avoiding jargons, acronyms

Standardisation in safety data sheets?

279

and abbreviations (United Nations, Globally Harmonized System of Classification and Labelling of Chemicals, 2013: 410)’. Therefore, the time needed to read the document and extract vital information will be limited. This requirement goes in alliance with the translation standards that impose the obligation on translators to produce fully functional target texts. The GHS aims at standardising safety data sheets to achieve a high degree of safety while dealing with chemicals on the professional level. For this reason, every stage of approaching the chemical substance is unified, which is reflected by the use of various standards, which makes the document more and more formal. The standards are expressed via different means, which is a challenge for translators. The problem is which translation strategy to use in the translation process. The examples illustrate that there is no agreement as for the translation of certain standards. Moreover, linguistic constraints, i.e. discrepancies between language systems and different conceptualisations make the task even more difficult.

8. Results of the study Research was conducted on a corpus of 93 matching safety data sheets in Polish, English and German to spot the areas of problems and to determine whether the documents comply with GHS/REACH/CLP regulations as well as to find areas of the potential problems in the translation process. Table 1 shows the results of the comparative analysis of the documents. Table 1. Results of the comparative analysis of safety data sheets. No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11 12. 13. 14.

Category Result Terminological problems (problems with equivalence) 28 Relevance problems 74 Equivalence problems (general language) 28 Translation errors 37 Different version number and issue date 93 Not compensated information 75 Not corresponding numerical data 36 Different layout 18 Compensation of information 46 R-/S-phrases 93 GHS hazard and precautionary codes and statements 61 EWC number 93 CAS number 75 Pictograms 93

280

Aleksandra Beata Makowska

9. Interpretation of the results The results of the study show that SDSs comply with the GHS/REACH/CLP regulations in terms of pictograms, S-/R-phrases, EWC number and transportation standards (where applicable). All sets of the documents contain pictograms S-/Rphrases and EWC numbers. GHS hazard and precautionary codes and statements appear in 61 cases. Not all documents include special transportation instructions because some chemical substances do not require special transportation conditions and transportation standards can be omitted. Problems stem from the different issue dates and version numbers. In all 93, cases there are no corresponding issue dates and version numbers. This is the main reason for problems in the harmonisation of the documents. Neither ECHA nor the Bureau for Chemical Substances have managed to impose on producers to provide one version in all languages, containing all relevant information, available to their users on a global scale. In 36 documents, numerical data is not corresponding within the set. Some information is given in only one language version, not compensated in others. Either in Polish or English, there appear instructions or data that are not found in the original, German version. On the other hand, there is information only in German, not found in other language versions. It is the result of the previous problem: not corresponding issue numbers and dates. The bigger the time span between the issue dates, the bigger the discrepancies. Safety data sheets also contain different measurements which stem from different national regulations concerning the TLVs. Moreover, some discrepancies concern numerical data or the ingredient list. There are full lists of ingredients or only hazardous ingredients because in the past the producers were obliged to provide only the list of hazardous ingredients with the percentage concentration (confidentiality policy). Problems also occur in the translation process. First of all, there are translation errors, as the translators do not understand the text fully. In 28 documents, there were translation problems with terminology and equivalence problems concerned with the general language. As for the general language, the instructions are awkwardly worded. As a result, the communicated message changes its meaning. Some information is compensated or moved to a different section, mainly, the TLVs and R-/S-phrases, which extends the time needed to read the document. However, 74 documents contain non-compensated information.

Standardisation in safety data sheets?

281

10. Sources of problems The results of the study point out several areas of problems which can be identified. The major source of problems lies in different versions of the same safety data sheet in different languages available at the same time. The analysis reveals that there are no corresponding version numbers and issue dates on the documents. The regulations for safety documents and labelling are constantly updated. As a result, some documents are older and contain obsolete data since they complied to older regulations. Regulations valid for different countries differ and different information appears on the documents. It also pertains to numerical data, since TLVs are calculated individually for each country. Some substances are considered hazardous in some countries, but in others are not. For this reason, there will never be the full standardisation. For financial reasons, subsidiaries order translations on the basis of the price, not quality. Therefore, documents are often translated carelessly, by inexperienced translators. Moreover, the commissioners ordering translation are not qualified enough to check whether the translation product contains relevant information and the language is clear enough. On the other hand, controlling bodies – SanitaryEpidemiological Station or National Labour Inspectorate – ensure whether the SDS is available to workers and whether it contains the 16 sections. Even though there are special bodies in Poland and EU, the preparation of documents lacks in control and supervision. The only goal is to have the documents written, translated, registered and sent to clients. Since the regulations change so rapidly, it is difficult for companies to keep pace with them. Even though the documents which were subject to the analysis come from big international corporations, the documentation contains inconsistencies. The preparation of documents also fails to be coordinated on the EU level, which is the result of the previous problem. Practice shows that different versions of SDSs for the same product have been available at the same time. Producers try to update them to comply with existing regulations, but their subsidiaries fail to translate them and publish on time. Neither ECHA nor the Bureau for Chemical Substances verify the issue number of documents on the European level and SANEPID or National Labour Inspectorate have different competencies.

11. Solutions to problems A high level of standardisation of safety data sheets can be reached if the identified problems are solved. Differences in regulations valid for different countries should be minimised to harmonise legal systems.

282

Aleksandra Beata Makowska

Issuing the same version of the same safety data sheet in different languages at the same time will increase the level of harmonisation. The institutions should cooperate more closely in terms of registration and verification and harmonisation of the issue of documents on the European and global scale. Moreover, there should be stricter controls over the issue and translation of SDSs by ECHA in the EU and the Bureau for Chemical Substances in Poland at the stage of registration and verification of documents, which would simplify the process of standardisation. Moreover, SDSs should be translated by experienced translators, acquainted with the specificity of the texts. The documents should be written in simple and clear language, according to the GHS guidelines. Information needs to be provided in an unambiguous way in order to limit time needed to extract indispensable data. In order to acquaint the translators with the problem, the task of translating SDSs could be introduced as early as at the translation training course to highlight the complexity of these texts. Some problems that stem from differences among different language systems create differences in the way the message is expressed in each language. Different conceptualisations prevent from full standardisation in safety data sheets. The translator can make the message as close in meaning as possible, but the sameness of meaning in verbal messages will never be possible.

12. Conclusion Translation of safety data sheets a remains a challenging task for translators. The documents need to meet various requirements and are limited by constraints. SDSs are functional texts which need to contain relevant information, written in accordance to the current regulations in the proper form and format. The data need to be up-to-date, necessary for the user and should be provided in a digestible form since safety documents serve a specific purpose: to warn the user against potential dangers connected with the chemical. In SDSs, there is a contradiction regarding the language. On the one hand, the GHS imposes the obligation to make the language clear and precise so that the users can extract information quickly. As the target group includes the workforce on the production line and warehouses, the language in documents should be adjusted to their level of education. Furthermore, SDSs are also aimed at rescue services who act in emergency situations under time pressure. As a result, they do not have much time to deliberate upon the author’s intentions. On the other hand, the documents are filled with terminology from different domains, i.e. chemistry, physics, medicine, law, etc. and it is difficult to find a compromise.

Standardisation in safety data sheets?

283

Finally, there is a difference between standardisation and harmonisation in terms of accepting variations. In the preparation and translation of SDSs, one needs to minimise existing variations to make the documents standardised. Currently, the documents can only be harmonised, which is depicted in the name of global legal system – GHS.

References Baker, M. (ed.) 1998. Routledge Encyclopedia of Translation Studies. London: Routledge. Byrne, J. 2006. Technical Translation Usability Strategies for Translating Technical Documentation. Dordrecht: Springer. Cabre, T. & J. C.Sager. (ed.). 1998. Terminology. Theory, Methods and Applications. Amsterdam/Philadelphia: John Benjamins Publishing Company. Donaldson, B. 2007. German. An Essential Grammar. New York: Routlegde. Dury, P. 2005.Terminology and Specialised Translation: the Relevance of the Diachronic Approach. In LSP & Professional Communication, Vol. 5 (1), 31–41. Retrieved February, 20, 2014 from: http://ej.lib.cbs.dk/index.php/LSP/article/ view/2042. Dziubalska-Kołaczyk K. & B. Walczak. 2011. “Polish”. In Delcourt, C. & P.van Sterkenburg, (eds.). 2011. The languages of the 27. Bruxelles: Fondation universitaire de Belgique. 817–840. Eckman, C. E. 2003. “Streamlining translation – ISO 12616: 2002, Translationoriented Terminography.” In: ISO BULLETIN, 26–27. Hatto, P. 2013. Standards and Standardisation. A practical guide for researchers. Luxembourg: Publications Office of the European Union. Langacker, R. 1986. “An Introduction to Cognitive Grammar.” In Cognitive Science. Vol. 1. l–40. Thelen, M. 2010. „Translation studies: Terminology in theory and practice. In: Lewandowska-Tomaszczyk, B. & M. Thelen. (eds.). 2010. Meaning in Translation. Frankfurt am Mein: Peter Lang. Newmark, P. 1988. A Textbook of Translation. Hempstead: Prentice Hall. Richen, A. & A. Steinhorst. 2005. “Standardization or Harmonization? You need Both.” In BPTrends, 1–5. Retrieved from: www.bptrends.com, 14.11.2014. Saint-Dizier, P. 2012. “Facets of a Discourse Analysis of Safety Requirements.” In: In: Bouma, G., Ittoo, A., Métais, E., Wortmann, H. (eds.). 2012. Natural Language Processing and Information Systems. Vol. 7337, 391–396.

284

Aleksandra Beata Makowska

Swales, J. 1987. “Approaching the Concept of Discourse Community.” Paper presented at the Annual Meeting of the Conference on College Composition and Communication, Atlanta, GA, March. 19–21. Temmerman, R. & K. Kerremans. 2003. “Termontography: Ontology Building and the Sociocognitive Approach to Terminology Building.” In Proceedings of CIL17, Prague: Matfyzpress. 1–10. Vinay, J. P. & J. Darbelnet. 1995. Comparative Stylistics of French and English: a Methodology for Translation, translated by Sager, J. C. & Hamel, M. J. Amsterdam/Philadelphia: John Benjamins. Wilson D. & D. Sperber. 2012. Meaning and Relevance. Cambridge: CUP. Womack, J. P. & D. T. Jones. 1996. Lean Thinking. Banish Waste and Create Wealth in Your Corporation. Warsaw: Prod.Press.com. Źrałka, E. 2007. “Teaching specialised translation through official documents.” In The Journal of Specialised Translation. Issue 7. 74–91.

Appendix 1 List of Internet websites used as a reference REGULATION (EC) No 1272/2008 of the European Parliament and of the Council of 16 December 2008. In: Official Journal of the European Union 353/1 of 31 December 2008. (2008, December, 31). Retrieved February, 22, 2014 from: http://eur-lex.europa.eu/LexUriServ/Lex UriServ.do?uri=OJ:L:2008:353:0001:1355:en:PDF REGULATION (EC) No 1907/2006 of the European Parliament and of the Council of 18 December 2006. In: Official Journal of the European Union L 396 of 30 December 2006. (2007, May, 29). Retrieved February, 22, 2014 from: http://eur-lex.europa.eu/LexUriServ/ LexUriServ.do?uri=OJ:L:2007:136:0003:0280:en:PDF European Commission on chemicals. (n.d.). Retrieved November 15, 2013 from: http://ec.europa.eu/enterprise/sectors/ chemicals/documents/classification/#h2-1 Globally Harmonized System of Classification and Labelling of Chemicals (GHS). (n.d.). Retrieved June, 18, 2015 from: http://www.unece.org/trans/danger/publi/ghs/ ghs_welcome_e.html European Commission on REACH. (n.d.). Retrieved June, 18, 2015 from: http://ec.europa.eu/enterprise/sectors/chemicals/ reach/index_en.htm

Standardisation in safety data sheets?

285

National Labour Inspectorate. (n.d.). Retrieved June, 18, 2015 from: http://www.pip.gov.pl/en/about-us/18279,ourstatutory-responsibilities.html Annex 3 of the GHS: Codification of Hazard Statements. (2013). Retrieved June, 18, 2015 from: http://www.unece.org/fileadmin/DAM/trans/ danger/publi/ghs/ghs_rev02/English/07e_annex3.pdf, Annex 4 of the GHS: Guidance on the Preparation of Safety Data Sheets (SDS). 2013. Retrieved June, 18, 2015 from: http://www.unece.org/fileadmin/DAM/trans/ danger/publi/ghs/ghs_rev01/English/08e_annex4.pdf The European Waste Catalogue (EWC). (2015). Retrieved June, 17, 2015 from: http://www.eauc.org.uk/page.php?subsite= waste&page=the_european_waste_catalogue_ewc COMMISSION DECISION of 3 May 2000 replacing Decision 94/3/EC establishing a list of wastes pursuant to Article 1(a) of Council Directive 75/442/EEC on waste and Council Decision 94/904/EC establishing a list of hazardous waste pursuant to Article 1(4) of Council Directive 91/689/EEC on hazardous waste. (January, 1, 2002). Retrieved June, 17, 2015 from: http://eur-lex.europa.eu/LexUriServ/LexUriServ. do?uri=CONSLEG:2000D0532:20020101:EN:PDF CLP Labelling. (n.d.). Retrieved November, 15, 2013 from: http://clp.gov.pl/oznakowanie Sanitary-Epidemiological Station in Gdańsk. (n.d.). Retrieved November, 15, 2013 from: http://www.wsse.gda.pl/index.php?id=443 The International Vessel Operators Dangerous Goods Association, Inc.(n.d.). Retrieved June, 17, 2015 from: http://www.ivodga.com/ The International Air Transport Association (IATA). (n.d.). Retrieved June, 17, 2015 from: http://www.iata.org/Pages/default.aspx CAS Registry Number. (n.d.). Retrieved June, 17, 2015 from: https://www.cas.org/ The International Union of Pure and Applied Chemistry (n.d.). Retrieved June, 17,2015 from: http://www.iupac.org/ The list of R-phrases and S-phrases. (n.d.). Retrieved June, 17, 2015 from: http://www.msds-europe.com/id-485-r_s_ phrases.html The European Chemicals Agency (ECHA). (n.d.). Retrieved June, 18, 2015 from: http://echa.europa.eu/about-us

286

Aleksandra Beata Makowska

The Bureau for Chemical Substances. (n.d.). Retrieved June, 18, 2015 from: https://www.chemikalia.gov.pl/general_infor mation.php National Labour Inspectorate. (nd.). Retrieved June, 18, 2015 from: http://www.pip.gov.pl/en European quality standard EN-15038:2006. (n.d.) Retrieved June, 20, 2015 from http://qualitystandard.bs.en-15038.com/ GHS pictograms. (n.d.). Retrieved June, 17, 2015 from: http://www.unece.org/trans/danger/publi/ghs/ pictograms.html

Monika Betyna

Kazimierz Wielki University

Lexical bundles in English medical texts Abstract: An increasing attention is paid to the corpus analysis of specialist registers of language use, particularly texts representing medical fields. Additionally, it is worth pointing out the scarcity of corpus linguistic studies of lexical bundles (Biber et al. 1999) in texts that were initially written in Polish. This descriptive and explanatory study, with its register perspective (Biber & Conrad 2009), has been designed to be the starting point for a more extensive, corpus focused, description of what most common lexical bundles are used for in medical texts concerning Hyperbaric Oxygen Therapy (HBOT). The research material includes one hundred medical articles, all written originally in English, and retrieved from the Internet. This material was compiled into a purpose-designed corpus of circa 200,000 words. It is essentially based on the methodology proposed by Biber, Conrad and Cortes (2003, 2004), Biber (2006), and Goźdź-Roszkowski (2011), which enables us to conduct an analysis of discourse functions and the use of lexical bundles. The study, based on the results presented below, appears to reveal some links between the frequent occurrence of lexical bundles, and situational and functional characteristics of the text variety. Keywords: Corpus linguistics, phraseology, register analysis, corpus-driven approach, lexical bundles, Hyperbaric Oxygen Therapy (HBOT)

1. Introduction A rapid growth of interest in corpus linguistics has been observed in recent years. Newer and more efficient computational tools, as well as more sophisticated research procedures, have been developed and implemented in order to improve analytical devices. Thanks to them, it has become less difficult for specialists to recognize and study phenomena commonly present in language use. These phenomena include commonly used multi-word units or strings of word forms which are often found in a number of specialist texts (including medical text type, as it is the case with this study). That is why the corpus linguistics approach proves to be particularly attractive for analyses of lexis and phraseology of specialist texts. They are based on limited stocks of prefabricated chunks, linguistic patterns or formulas. It is, therefore, described by Sinclair (1991) as the phenomenon of the Idiom Principle, which concerns the standard specialist texts with numerous ready-made phrases with just one form and meaning. Similarly, Wray (2002) uses the term ‘formulaicity’ in reference with the frequent use of formulaic sequences. Formulas are characterized as

288

Monika Betyna

sequences or as “various types of wordstrings” which appear to be “prefabricated: that is, stored and retrieved whole from memory at the time of use, rather than being subject to generation or analysis by the language grammar” (Wray 2002: 9). Moreover, Wray (2002: 75) claims that the formulas used in a text depend on the type of the situational context and the purpose for which they are used. As a result, language users use certain formulas with higher frequency than others, depending on social situations. As a matter of fact, formulaic sequences are very frequent in discourse and occur in “so many forms that it is presently difficult to develop a comprehensive definition of the phenomenon” (Schmitt & Carter 2004: 1−2). Phraseology belongs to a wide range of research models that can be potentially used to investigate formulaicity in a given language. It covers an extensive area of word combinations, which tend to be fairly fixed, also known as proverbs, as well as phrases and fixed expressions, but also includes formulaic sentences and whole texts (Burger 2007). This study acknowledges that recurrent and statistically important multi-word patterns of specific words are of great interest to corpus linguists focusing on phraseology (Moon 2007: 1046). The idea of lexical bundles (LBs) that was initially proposed by Biber et al. (1999), has proved especially convenient in phraseological research on fixed expressions. We can describe LBs as sequences of two or more words which can occur frequently in a natural discourse and usually constitute some lexical blocks, which are then used in different situational and communicational contexts, for example I don‘t think, as a result, the nature of the, as well as etc. (Biber et al. 1999: 990−991). Typically, LBs are neither idiomatic expressions nor perceptually salient phrases. In fact, LB’s meaning tends to be self-evident from the respective words found within it (Biber 2006: 134). According to Biber (2006: 174), “the functions and meanings expressed by these lexical bundles differ dramatically across registers and academic disciplines, depending on the typical purposes of each”. Stubbs and Barth (2003: 81) claim that recurrent LBs, which are described as ‘chains’, are “not necessarily linguistic or psycholinguistic units”, which means that some LBs are not complete syntactic units. Nevertheless, LBs can contain some syntactic units or can incorporate a complete syntactic unit, however, some of them are not even pre-constructed. Additionally, Kopaczyk (2012: 5) emphasises that “the lexical bundles approach is not limited to exploration of phrasal constituents”. Conversely, the sequences of words which are extracted from corpora and are uninterrupted are often smaller or larger than a phrase. According to Goźdź-Roszkowski (2011: 44), the traditional phraseological research has been enlarged by the lexical bundles methodology in the way that is founded on the study of the frequencies of fixed sequences, yet it disregards their form, i.e. their grammatical structure.

Lexical bundles in English medical texts

289

The matter of LB was discussed by many researchers. The issue of lexical bundles in the medical field was discussed by Grabowski (2013), who analysed lexical bundles found in patient information leaflets (PILs), more specifically in pharmaceutical leaflets. Among other researchers who deal with lexical bundles in medical articles are Sadat Jalali and RaoufMoini (2015). Their study investigated the use of lexical bundles in medical research articles starting with four and up to eight words long. The results showed that most of the lexical bundles were resultative signals which refer to text-oriented bundles, which can be defined as a way of organising the text and its meaning in the form of a message or an argument. Those bundles are used by the authors of the articles to signal their opinion on the findings. Thanks to these markers the readers have an idea what conclusions the author has reached even before finishing reading the article.

2. Hyperbaric Oxygen Therapy – a brief characteristic In this paper, the emphasis is put on the description of the most frequent lexical bundles in medical articles concerning HBOT (Hyperbaric Oxygen Therapy), written originally in English as there are no significant works about this topic in Polish. According to the online U.S. National Library of Medicine, Hyperbaric Oxygen Therapy (HBOT) is a supply of the increased amount of oxygen into tissues and organs with decreased perfusion. Hyperbaric oxygen restricts necrosis area and the surface of damage. It stimulates cell proliferation, reduces inflammation and activates neoangiogenesis. During the HBOT, in a 12-person chamber, patients breath pure oxygen under the pressure 1,5–3,0 ATA. Treatment lasts 90 minutes. Patients undergo approximately thirty treatments. Pure oxygen is administrated through airtight facial mask or hoods with rubber collars. In accordance with the highest safety standards it allows one to keep the oxygen concentration in ambient chamber air under safety level – 23%. HBOT is characterized by high efficacy in non-healing chronic wounds treatment, which significantly reduces the treatment duration and often helps to avoid limb amputation. The most common side effect is ear barotrauma. Complications related to the oxygen toxicity, as well as changes in the ocular, are mostly mild and transient. In the case of total elimination of undesirable effects qualification for hyperbaric oxygen treatment is required. HBOT is available at different Hyperbaric Oxygen Therapy and Wound Treatment Centres in Poland. HBOTs have a specific institutional addressor (medical institutions) and a singular addressee (a patient/consumer of a medicine or medicinal product, pharmacist, nurse, general practitioner, etc.). However, the medical text variety also involves intermediate users such as regulatory authorities. Although articles

290

Monika Betyna

devoted to HBOT constitute one of the most commonly used specialist text type used by doctors, pharmacists, nurses, and patients in Poland, one may note a scarcity of studies devoted to description of the use and functions of the most frequent phraseologies in this pharmaceutical text variety. In fact, there have been a lot of corpus linguistic studies exploring linguistic variation of HBOT originally written in English and other languages (e.g. Turkish), conducted by Ghanizadeh (2010), Al-Waili, Butler, Beale, Abdullah, Hamilton, and Lee (2005), and Rossignol (2008) among others. This observation provides the motivation to undertake a preliminary corpus-driven study. More specifically, the aim of this paper is an identification of the most frequent LBs found in a sample of medical articles about HBOT. In order to explain how to achieve this goal, the research material and methodology are described in detail in the following section.

3. Research material and methodology This study employs a perspective register (Biber& Conrad 2009: 51−81). For this reason, the research aim is to establish the most common lexical bundles used in medical articles about HBOT, within the limits of the compiled corpus.The research material embraces a corpus of one hundred online articles about HBOT (full-texts). Overall, the size of the corpus is 196,757 word tokens (12,754 types). The corpus was compiled on the basis of research articles concerning three aspects of medicine all of which are connected with HBOT. The first one is Hyperbaric Oxygen Therapy in general; the second one concerns treating autism with HBOT, and the third one concerns plasmaJet treatment in wound healing. Thirty articles were extracted from the Online Journal of Wound Care and the rest were taken from various Internet sources. The length of the articles ranges from 2 up to 90 pages and the authors originate from all over the world.It is important to remember that the study employs a corpus-driven approach as defined by Tognini-Bonelli who states: In a corpus-driven approach the commitment of the linguist is to the integrity of the data as a whole, and descriptions aim to be comprehensive with respect to corpus evidence. (Tognini-Bonelli 2001: 84). This type of research can be defined as a ‘bottom-up’, explanatory study on the use of assorted kinds of both adjoined and separated repetitive combinations of words. These words are usually incomplete grammatical units with compositional meaning, which have been extracted with the use of specialized computer software, in an either automatic or semi-automatic manner. The empirical part focuses on lexical bundles and n-grams in HBOT articles. N-grams may be divided into those phrases which include semantically free

Lexical bundles in English medical texts

291

lexical bundles and those which are semantically dependent. The stages of the study are explained below. The first step involves the use of Wordsmith Tools 6.0 (Scott 2015) software as well as AntConc software version 3.4.4. for Windows in order to generate lexical bundles commonly found in HBOT articles. Secondly, a frequency list is presented. The list was created for preliminary research based on English medical articles concerning Hyperbaric Oxygen Therapy (HBOT), including function words and content words.Swan (2005) defines ‘discourse marker’ as “a word or expression which shows the connection between what is being said and the wider context”. As argued by Swan, a discourse marker is something that connects clauses in a sentence and also helps to establish the order in which the described events happened or indicates the speakers’ attitude. Examples of discourse markers in English include: on the other hand; frankly; as a matter of fact. Other most frequently used words appear to be the keywords of the discussed discipline, HBOT. For instance, the word ulcer appears at a top position on the frequency list presented below. This can be explained by the fact that ulcer appears to be one of the most common disorders treated by HBOT. Table 1. Top of the frequency list for English medical articles concerning Hyperbaric Oxygen Therapy (HBOT) generated withthe AntConc 3.4.4 Program (for Windows). RANK 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

FREQUENCY 11784 10411 8886 7564 6049 4273 3706 3032 2652 2440

WORD the of and well then therapy ulcers patients medicine diabetic

It is important to establish some criteria according to which an appropriate sample of lexical bundles can be found. Such criteria and their parameters are presented by Biber et al. (1999), Biber, Conrad and Cortes (2003, 2004), and Biber (2006) and are as follows: (1) establishing the length of an LBs (to identify the range of the analysis, in this preliminary research it ranges from 2 to 6 n-grams); (2) a frequency cut-off point (that helps to reduce the number of the analysed LBs), which is a normalized frequency of occurrence of an LB per 1 million words; (3) a number of texts in which an LB has to appear (to remove idiosyncratic LBs from the analyses). The number of the analysed texts is 100, but those articles

292

Monika Betyna

are divided more specifically into groups of texts concentrating the treatment of autism through hyperbaric oxygen therapy, by using plasmaJet, and those which describe Hyperbaric Oxygen Therapy in general. Regarding the length of the analysed lexical bundles, this analysis is based on 2-word and up to 6-word lexical bundles. This particular length (up to 6-grams) was chosen because the articles concerning HBOT contain a large amount of formulaic language of varying length and due to that there are numerous strings of words to be found in such texts. In respect to the frequency cut-off point criterion, the research focuses on very common lexical bundles and that is why the limits are set at 250 occurrences per 1 million words for 5-word LBs and 150 occurrences per million words for 6-word LBs. The situation is different for 2-word up to 4-word LBs, as there are about 500 occurrences of those n-grams per 1 million words. Table 1 shows the frequency list with the most common lexical bundles selected from the compiled corpus about HBOT. Analysing the concordances of these keywords reveals the most frequent occurrences of Lexical Bundles and phraseological units, as demonstrated in Table 2. Table 2. A sample of the frequency list of the most frequent Lexical Bundles and phraseological units in preliminary research of English medical articles concerning Hyperbaric Oxygen Therapy (HBOT) generated withtheAntConc 3.4.4 Program (for Windows). PHRASEOLOGICAL UNITS Oxygen therapy

RANK FREQUENCY PERCENT 5 1173 18%

Hyperbaric oxygen therapy

6

1149

17%

Diabetic foot ulcer

7

1138

16%

Patients with

11

595

9%

The undersea hyperbaric medical society

13

586

8%

Hyperbaric medicine

15

572

7,8%

Of hyperbaric oxygen therapy

16

476

7.5%

Therapy for patients with heart disabilities 17

467

7,2%

Effects of hyperbaric oxygen therapy

18

464

7%

Diabetic foot

20

446

6%

Diving and hyperbaric medicine

25

424

5%

A fresh solution and allowed to

36

374

4%

The study shows it may

45

327

3%

Wound care hyperbaric

52

301

2%

Oxygen therapy for acute coronary syndrome

54

290

1%

Lexical bundles in English medical texts

293

As shown in the materials, the frequency of lexical bundles compiled with the keyword therapy is 70155 occurrences and with the word ulcer it is 18274 occurrences, which in the case of 100 articles means high frequency. Additionally, apart from function words and discourse markers, other common items such as noun phrases, verb phrases, adjective phrases and prepositional phrases are phraseological units. Such units are often used by experts in the field of HBOT. Most of the collocations represent nominal phrases, where keyword is narrowed by another noun, adjective, verb or preposition. Sometimes such statements are called multiword units or multiword items as it was presented in the works of Moon (2007) or Nattinger and DeCarico (1992). All the LBs are connected with the topic of the articles, which means that they are not typical LBs since they constitute a semantic unit. In this case, LBs are rather clusters of random elements, which are often repeated with each other, and therefore they are treated as a unit (neither semantic nor syntactic and they are not dependent on each other). Apart from the LBs (presented on the second, third, tenth position etc. in Table 4 which are often repeated and cannot be divided, there are also phraseological units, which constitute semantic units and colligations (similar to collocations but focused more on the grammatical aspect). In English, noun phrases are characterised by the highest repeatability, as indicated in Diagram 1 below, which is based on the data from the corpus. Diagram 1. A division of the most frequent phraseological units in the frequency list from English medical articles concerning Hyperbaric Oxygen Therapy (HBOT).

The frequent use of noun phrases can be explained by the fact that conventionalisation is quite often achieved through nominalisation, both in specialised and in more general registers. In other words, the most frequent concepts are described with noun phrases, which become stronger as they are used more and more often. When we look at the given corpus, 40% of the phraseological units prove to be noun phrases. Those connections are probably dependent on the language usus. Nevertheless, it is very difficult to decide whether this deduction is objective as some of the

294

Monika Betyna

phrases might just be used spontaneously. Tables and the diagram generated with the collocations show that the number of different combinations of adjective phrases or prepositional phrases is much lower compared to noun phrases. This case was also raised by Sadat Jalali and RaoufMoini (2015), who achieved similar results. Those findings may lead to the conclusion that using lexical bundles in specific texts indicates a great knowledge of the field. What is more, because of the professional and very sophisticated vocabulary, the text may be difficult for a layperson. The authors of specialist texts rightly assume that most readers of the articles will have similar level of expertise and comparable knowledge of such bundles.The conducted research allows one to notice that the occurrence of lexical bundles in specialist medical texts concerning HBOT is very high.

4. Concluding remarks This explanatory and preliminary study combines particular theoretical notions that can be found in the area of register analysis and phraseology with certain features of the corpus linguistic methodology, specifically the corpus-driven approach. With the aid of corpus linguistic phraseology, one can determine and examine new ways of identifying multi-words units. Likewise, one can discover new facts about the use, purpose and allocation of repetitive co-occurring arrangements of words, with a special focus on their distribution and regularity and not on the rare and irregular uses. Despite the fact that researches now have access to an increasing number of English text-based corpora (e.g. the British National Corpus, the Corpus of Contemporary American English), as well as numerous instruments and materials that have evolved closely with the corpora (see Przepiórkowski et al. 2012 for a review), corpus linguistic research of phraseology in native English medical texts is still very limited. Based on the present research it is easy to see that medical specialist texts contain a lot of lexical bundles, as well as phraseological units, such as noun phrases, verb phrases, adjective phrases and prepositional phrases. There are also numerous function words and discourse markers. Usage of such vocabulary indicates two things. First of all, the authors of the articles use them because they are experts and have the necessary knowledge to use them correctly. Second of all, it shows that these texts are written with other experts in mind. The reason for the situation is that while such specialist vocabulary may make the text more difficult for a lay person it will actually be more accessible for experts who are used to such vocabulary and can communicate more effectively using professional terms rather than their simplified equivalents.

Lexical bundles in English medical texts

295

References Annelie, A. & E. Britt. 2012.“Recurrent word combinations in academic writing by native and non-native speakers of English: A lexical LBs approach.”English for Specific Purposes, 31, 81–92. Al-Waili, NS., Butler, G. J., Beale, J., Abdullah, M. S., Hamilton, R. W., Lee, B. Y. 2005, “Hyperbaric oxygen in the treatment of patients with cerebral stroke, brain trauma, and neurologic disease.”AdvTherap, 22(6): 659–78. Available at: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380430/ Retrieved on 13 December 2015. Biber, D. 2006. University Language.A Corpus-Based Study of Spoken and Written Registers.Amsterdam: John Benjamins. Biber, D. 2009.“A corpus-driven approach to formulaic language in English: multiword patterns in speech and writing.” International Journal of Corpus Linguistics, 14, 275–311. Biber, D., &S. Conrad.2009.Register, Genre and Style. Cambridge: Cambridge University Press. Biber, D., Conrad S. &V. Cortes.2003.“Lexical LBs in speech and writing: An initial taxonomy.”In A.Wilson, P.Rayson, & T. McEnery (eds.) In Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech, 71–92. Frankfurt am Main: Peter Lang. Biber, D., Conrad, S., & V. Cortes. 2004. “If you look at…”: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25, 371–405. Biber, D., Johansson S., Leech G., Conrad S.& E.Finegan. 1999.The Longman Grammar of Spoken and Written English. London: Longman. Burger, H. 2007.Phraseologie: eininternationales Handbuchzeitgenössischer Forschung, Vol. 2. Berlin: Walter de Gruyter. Goźdź-Roszkowski, S. 2011.Patterns of Linguistic Variation in American Legal English.A Corpus-Based Study.Frankfurt: Peter Lang. Ghanizadeh, A. 2012.“Hyperbaric oxygen therapy for treatment of children with autism: a systematic review of randomized trials.” Medical Gas Res, 11, 2:13. Available at: http://www.ncbi.nlm.nih.gov/pubmed/22577817 Retrieved on 11 December 2015. Grabowski, Ł. 2013.“Register Variation Across English Pharmaceutical Texts: A Corpus-driven Study of Keywords, Lexical Bundles and Phrase Frames in Patient Information Leaflets and Summaries of Product Characteristics.”Procedia – Social and Behavioral Sciences, 95, 391–401. doi:10.1016/j.sbspro.2013.10.661 Kopaczyk, J. 2012.“Long lexical bundles and standardisation in historical legal te xts.”StudiaAnglicaPosnaniensia, 47(2−3), 3–25.

296

Monika Betyna

Moon, R. 2007.Corpus linguistic aspects of phraseology. In Burger (ed.), “Phraseology: An international handbook of contemporary research.” 28(2),1045–1059. Berlin: de Gruyter. Nattinger, J. & J. DeCarrico. 1992.Lexical Phrases and Language Teaching.Oxford: Oxford University Press. Przepiórkowski, A., Bańko, M., Górski, R., & B. Lewandowska-Tomaszczyk. (eds.) 2012.Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo PWN. Rossignol, D. A. 2008.“The use of hyperbaric oxygen therapy in autism.”In: J. H.Zhang (ed.) Hyperbaric oxygen for neurological disorders, 209–258. Flagstaff: Best Publishing Company. Sadat, J., & M. Raouf.2014.“Structure of Lexical Bundles in Introduction Section of Medical Research Articles.”Procedia – Social and Behavioral Sciences 98, 719–726. Schmitt, N., &R. Carter.2004.“Formulaic sequences in action: An introduction.” In N. Schmitt (ed.) Formulaic Sequences: Acquisition, Processing and Use, 1–22. Amsterdam: John Benjamins. Scott, M. 2015.“WordSmith Tools 6.0. Liverpool: Lexical Analysis Software.” InJ. Sinclair (ed.) Corpus, Concordance, Collocation. Oxford: OxfordUniversity Press. Sinclair, J. 1991.Corpus, Concordance, Collocation.Oxford: Oxford University Press. Stubbs, M., &I. Barth.2003.“Using recurrent phrases as text-type discriminators: a quantitative method and some findings.”Functions of Language, 10, 65–108. Tognini-Bonelli, E. 2001.Corpus Linguistics at Work.Amsterdam/Philadelphia. Swan, M. 2005.Practical English Usage.Oxford: Oxford University Press. Wray, A. 2002.Formulaic Language and the Lexicon.Cambridge: CambridgeUniversity Press.