270 84 14MB
English Pages 280 [281] Year 2023
Using Corpora in Discourse Analysis
i
BLOOMSBURY DISCOURSE
Series Editor: Ken Hyland The Bloomsbury Discourse Series aims to capture the fast-developing interest in discourse to provide students, new and experienced teachers and researchers in applied linguistics, ELT and English language with an essential bookshelf. Each book deals with a core topic in discourse studies to give an in-depth, structured and readable introduction to an aspect of the way language is used in real life. Titles published in the series: The Discourse of Customer Service Tweets, Ursula Lutzky Discourse Analysis, Brian Paltridge Discourse and Identity on Facebook, Mariza Georgalou Spoken Discourse, Rodney Jones The Discourse of Online Consumer Reviews, Camilla Vásquez Sports Discourse, Tony Schirato Corporate Discourse, Ruth Breeze Discourse of Twitter and Social Media, Michele Zappavigna Discourse Studies Reader, Ken Hyland
ii
Using Corpora in Discourse Analysis SECOND EDITION Paul Baker
iii
BLOOMSBURY ACADEMIC Bloomsbury Publishing Plc 50 Bedford Square, London, WC1B 3DP, UK 1385 Broadway, New York, NY 10018, USA 29 Earlsfort Terrace, Dublin 2, Ireland BLOOMSBURY, BLOOMSBURY ACADEMIC and the Diana logo are trademarks of Bloomsbury Publishing Plc First published in Great Britain 2006 This edition published 2023 Copyright © Paul Baker, 2023 Paul Baker has asserted his right under the Copyright, Designs and Patents Act, 1988, to be identified as Author of this work. For legal purposes the Acknowledgements on p. xi constitute an extension of this copyright page. Cover illustration courtesy of Martin O’Neill All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. Bloomsbury Publishing Plc does not have any control over, or responsibility for, any third-party websites referred to or in this book. All internet addresses given in this book were correct at the time of going to press. The author and publisher regret any inconvenience caused if addresses have changed or sites have ceased to exist, but can accept no responsibility for any such changes. A catalogue record for this book is available from the British Library. A catalog record for this book is available from the Library of Congress. ISBN: HB: PB: ePDF: eBook:
978-1-3500-8374-5 978-1-3500-8375-2 978-1-3500-8377-6 978-1-3500-8376-9
Series: Bloomsbury Discourse Typeset by RefineCatch Limited, Bungay, Suffolk To find out more about our authors and books visit www.bloomsbury.com and sign up for our newsletters.
iv
Contents
List of Figures vi List of Tables vii Preface ix Acknowledgements xi
1
Introduction
2
The First Stages
29
3
Corpus Building
55
1
4 Frequency, Dispersion and Distribution 5
Concordances
6 Collocates 7
Keyness
81
107
135
161
8 Going Beyond the Basics 9 Conclusion
193
221
Notes 235 Glossary 237 References 241 Index 257
v
Figures
3.1 3.2 3.3 4.1 4.2 4.3 5.1 5.2 6.1 6.2 6.3 6.4 6.5 7.1 8.1 8.2 8.3
vi
Header information obtained from Nexis Short sample header from a written text Grammatically tagged sentence from the British National Corpus Wordlist output of WordSmith Wordlist of three word clusters Concordance plot of work2live in the holiday corpus Screenshot of (refugee/refugees) in the BNC via CQPweb Sorted concordance Simple collocational network of America Collocational network focussing on America and allies Collocational network focussing on America and war Comparisons of representations of America Collocational network from tabloid news Keywords when the sub-corpora are compared against the same reference corpus Excel spreadsheet indicating Coefficient of Variance over four time periods Frequencies of selected adjectives in gay personal adverts over time Screenshot of Trends analysis in Sketch Engine
70 73 75 87 92 96 111 121 147 148 150 156 160 178 198 199 201
Tables
2.1 3.1 4.1 4.2 .
4.3 4.4 4.5 4.6 4.7
4.8
4.9 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6.1 6.2 6.3 6.4
Features of popular corpus analysis tools Popular online corpus building resources Holiday leaflets Percentage frequencies of the ten most frequently occurring words in the holiday corpus and their equivalencies in the BNC The most frequent ten lexical words in the holiday corpus The most frequent lexical lemmas in the holiday corpus The most frequent lexical verbs in the holiday corpus The most frequent informal terms in the holiday corpus Combined frequencies per million words of loads, mates, cool, massive, fab, info and tons in the BNC for age, sex and social class Combined frequencies per million words of loads, mates, cool, massive, fab, info and tons in the BNC, cross-tabulated for sex, age and social class Clusters relating to articles about Romanians Refugees as victims Refugees as recipients of help The refugee situation Too many refugees Illegal refugees refugee from Genuine refugees Refugees as destructive or manipulative Positive representations of refugees Concordance of America collocating with bleed or racism Top ten logDice collocates of America using #LancsBox and Sketch Engine Categorisation of top fifty collocates of America Sample concordance lines showing America collocating with her, its and allies
49 63 85 89 90 91 94 98
99
101 106 115 116 117 118 119 122 123 125 125 140 142 145 148
vii
viii
List of Tables
6.5 6.6 6.7 6.8 6.9 6.10 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 7.14 7.15 7.16 7.17 7.18 8.1
Sample concordance lines showing America collocating with war and against Word Sketch of America Sample concordance lines showing representations of America as warlike Sample concordance lines showing representations of America as manipulative or criminal Sample concordance lines showing representations of America as hated and under attack Sample concordance lines showing representations of America as weak and failing The ten most frequent words in the fox hunting corpus The ten most frequent lexical words in the fox hunting corpus The ten most frequent lexical words used by opposing groups in the fox hunting debate Keywords when p < 0.0001 Concordance of make with criminal Common clusters containing dogs Concordance of the use of dogs Concordance of illiberal (pro-hunt debate) Sample concordance of fellow citizens, Britain and people (pro-hunt) Concordance of activities (pro-hunt) Concordance of barbaric (anti-hunt) Concordance (sample) of cruelty (anti-hunt) Concordance (sample) of cruelty (pro-hunt) Concordance of cruelty associated with (anti-hunt) Concordance of there is cruelty in (pro-hunt) Concordance (sample) of words tagged as S1.2.5+ ‘Toughness; strong/weak’ (anti-hunt) Concordance (sample) of words tagged as S1.2.6+ ‘Sensible’ and S1.2.6− ‘Foolish’ (pro-hunt) Concordance (sample) of words tagged as G2.2+ ‘Ethical’ (pro-hunt) Metaphors tagged as A1.1.1 in Laura Kuenssberg’s tweets
151 152 154 154 155 156 163 163 164 166 171 171 173 174 175 175 176 178 179 181 182 185 185 187 217
Preface
Using Corpora in Discourse Analysis was first published in 2006. I wrote it because I was excited about the potentialities of using corpus linguistics methods to answer questions about discourse and I wanted to share this new form of analysis with others. There were a small number of people working in the area at the time but nobody had written a book which showcased the range of techniques of analysis, pairing them with different corpora. I tried to make the book accessible so that it could be understood by people who were not familiar with either corpus linguistics or discourse analysis, as well as people who did not have English as a first language. Seventeen years, and many conference presentations, workshops, journal articles and books later, the ideas, arguments and techniques in the book still apply, and the field is now much larger than it was, with a concomitant expansion of software and corpora. I have applied the techniques in the book to a number of projects, both large-scale and small, including representations of Islam, masculinity, trans people and obesity in the press, propaganda strategies of people advocating violent jihad, the ways that patients use language to give feedback on health services and analysis of a forum devoted to anxiety. I have worked with charities and government organisations, helping them to make sense of the vast amounts of language data that they have amassed. These projects have given me the opportunity to think about the best ways of conveying results to different audiences, a point which I address in the first and final chapters of the book. Hopefully others can benefit from various hard-won lessons I have learnt. I have also continued to reflect on and develop new methods of analysis and refine existing ones. Although I did not realise it at the time of writing, Using Corpora in Discourse Analysis was a blueprint for my own future career trajectory as well as inspiring others. The intention of this second edition is to provide an updated version which maintains the spirit and aims of the original but goes beyond it in a number of ways, taking into account the developments in the field in the last fifteen years. One way that this has been done is to showcase a wider range of software tools. The first edition only made use of WordSmith, whereas the ix
x
Preface
second edition also includes AntConc, GraphColl, CQPweb and Sketch Engine, focussing on aspects of each tool that are unique. I have also added a new chapter which discusses steps in carrying out a research project, and two other chapters have been rewritten. The chapter on collocates is based on a different dataset – the violent jihad texts mentioned earlier – as well as having sections on collocational networks and the Word Sketch feature of Sketch Engine, while the final analysis chapter in the book now covers a wider range of applications, considering different types of corpora and more challenging forms of analysis. The other chapters have also been updated, and I have included questions for students and recommendations for further reading at the end of each chapter, along with a glossary. One aspect of using the book with students in my own workshops is that the corpus tools have continued to uncover new findings within the corpora, and a few of these have been incorporated into those chapters. I hope that the book continues to inspire. Paul Baker, Lancaster.
Acknowledgements
Thanks to Tony McEnery, Mike Scott, Laurence Anthony, Paul Rayson, Hans Martin Lehmann, Andrew Hardie, Elena Semino, Veronika Koller, Ruth Wodak, Susan Hunston, Ken Hyland, Jennifer Lovel, Laura Gallon and Morwenna Scott.
xi
xii
1 Introduction This book is about a set of techniques that can be used to analyse language for a particular purpose. More explicitly, it is about using corpora (large bodies of naturally occurring language data stored on computers) and corpus processes (computational procedures which manipulate this data in various ways) in order to uncover linguistic patterns which can enable us to make sense of the ways that language is used in the construction of discourses (or ways of constructing reality). It therefore involves the pairing of two areas related to linguistics (corpus linguistics and discourse analysis) which have not always had a great deal to do with each other for reasons I will try to explain later in this chapter. This book is mainly written for ‘linguists who use corpora’ (Partington 2003: 257), rather than explicitly for corpus linguists, although hopefully corpus linguists may find something of use in it too. This chapter serves as an overview for the rest of the book. A problem with writing a book that involves bridge-building between two different disciplines is in the assumptions that have to be made regarding a fairly disparate target audience. Some people may know a lot about discourse analysis but not a great deal about corpus linguistics. For others the opposite may be the case. For others still, both areas might be equally opaque. I will try to cover as much ground as possible and hope that readers bear with me or can skim through the parts that they are already familiar with. I will begin by giving a quick description of corpus linguistics, followed by one of discourse.
Corpus linguistics Corpus linguistics is ‘the study of language based on examples of real life language use’ (McEnery and Wilson 1996: 1). Before the availability of 1
2
Using Corpora in Discourse Analysis
computers, the Latin term corpus (meaning body) could refer to a body of language but by the 1980s the term corpus linguistics was being used to refer to the study of such bodies. From today’s perspective, where adjacent academic fields increasingly compete for attention, as a ‘brand name’, the term corpus linguistics is unfortunately opaque. Once the term has been fully explained and newcomers discover that it actually involves the use of computer software to analyse texts via statistical procedures, it is important to also explain that the software is an aid to analysis, it does not constitute the analysis itself. As Biber (1998: 4) points out, corpus-based research actually depends on both quantitative and qualitative techniques: ‘Association patterns represent quantitative relations, measuring the extent to which features and variants are associated with contextual factors. However functional (qualitative) interpretation is also an essential step in any corpus-based analysis.’ In addition, researchers have to make numerous decisions about the computational aspects of analysis relating to the corpus, software, techniques and settings they adopt. Corpora are generally large (some contain billions of words) representative samples of a particular type of naturally occurring language, so they can therefore be used as a standard reference with which claims about language can be measured. The fact that they are encoded electronically means that complex calculations can be carried out on large amounts of text, revealing linguistic patterns and frequency information that would otherwise take days or months to uncover by hand, and may run counter to intuition. Electronic corpora are often annotated with additional linguistic information, the most common being part of speech information (for example, whether a word is a noun or a verb), which allows large-scale grammatical analyses to be carried out. Other types of information can be encoded within corpora – for example, in spoken corpora (containing transcripts of dialogue) attributes such as sex, age, socio-economic group and region can be encoded for each participant. This would allow comparisons to be made about different types of speakers. For example, Rayson et al. (1997) have shown that speakers from economically advantaged groups use adverbs like actually and really more than those from less advantaged groups. On the other hand, people from less advantaged groups are more likely to use words like say, said and saying, numbers and taboo words. Corpus-based or equivalent methods have been used from as early as the nineteenth century. The diary studies of infant language acquisition (Taine 1877, Preyer 1889), or Käding’s (1897) frequency distribution of sequences
Introduction
of letters in an eleven-million word corpus of German focussed on collections of large, naturally occurring language use (in the absence of computers, the data was painstakingly analysed by hand). However, up until the 1970s, only a small number of studies utilised corpus-based approaches. Quirk’s (1960) survey of English usage began in 1961, as did Francis and Kucera’s work on the Brown corpus of American English. It was not until the advent of widely available personal computers in the 1980s that corpus linguistics as a methodology became popular. Johansson (1991) shows that the number of such studies doubled for every five-year period between 1976 and 1991 and since the 1990s the field has continued to grow, supporting numerous journals, book series and conferences. Corpus linguistics has since been employed in a number of areas of linguistic enquiry, including dictionary creation (Clear et al. 1996, Hanks 2012), as an aid to interpretation of literary texts (Louw 1997, Mahlberg 2013), forensic linguistics (Wools and Coulthard 1998, Wright 2017), language description (Sinclair 1999, Leech et al. 2009), language variation studies (Biber 1988, Reppen et al. 2002) and language teaching (O’Keefe et al. 2007, Friginal 2018). The aim of this book, however, is to investigate how corpus linguistics can enable the analysis of discourses. With that said, the term discourse has numerous interpretations, so the following section explains what I mean when I use it.
Discourse The term discourse is problematic, as it is used in social and linguistic research in a number of inter-related yet different ways. In traditional linguistics it is defined as either ‘language above the sentence or above the clause’ (Stubbs 1983: 1), or ‘language in use’ (Brown and Yule 1983). We can talk about the discourse structure of particular texts. For example, a recipe will usually begin with the name of the meal to be prepared, then give a list of ingredients, then describe the means of preparation. There may be variants to this, but on the whole we are usually able to recognise the discourse structure of a text like a recipe fairly easily. We would expect certain lexical items or grammatical structures to appear at particular places (for example, numbers and measurements would appear near the beginning of the text, in the list of ingredients, ‘4 15ml spoons of olive oil’, whereas imperative sentences would appear in the latter half, ‘Slice each potato lengthwise’). A related use of discourse which refers to text structure is discourse organiser or
3
4
Using Corpora in Discourse Analysis
discourse marker. These are words and phrases that help us to signal the structure of our speech or writing. They can indicate a host of features including topic change, reformulation, evaluation, agreement, causality, coordination and interpersonal relationships. For example, the word now might be used at the start of a sentence or utterance as a discourse marker to indicate that the writer or speaker wants to change topic. The term discourse is also sometimes applied to different types of language use or topics, for example we can talk about political discourse (Chilton 2004), colonial discourse (Williams and Chrisman 1993), media discourse (Fairclough 1995) and environmental discourse (Hajer 1997). Relatedly, a number of researchers have used corpora to examine discourse styles of people who are learners of English. Ringbom (1998) found a high frequency of lexis that had a high level of generality (words like people and things) in a corpus of writing produced by learners of English when compared to a similar corpus of native speakers. So this is a conceptualisation of discourse which is linked to genre, style of text type. In this book we will be examining different discourses: tourist discourse in Chapter 4, violent jihadist discourse in Chapter 6 and political discourse in Chapter 7. However, discourse can also be defined as ‘practices which systematically form of the objects of which they speak’ (Foucault 1972: 49) and it is this meaning of discourse which I intend to focus on in this book (although in practice it is difficult to consider this meaning without taking into account the other meanings as well). In order to expand upon Foucault’s definition, discourse is a ‘system of statements which constructs an object’ (Parker 1992: 5) or ‘language-inaction’ (Blommaert 2005: 2). It is further categorised by Burr (1995: 48) as ‘a set of meanings, metaphors, representations, images, stories, statements and so on that in some way together produce a particular version of events . . . Surrounding any one object, event, person etc., there may be a variety of different discourses, each with a different story to tell about the world, a different way of representing it to the world.’ Because of Foucault’s notion of practices, discourse therefore becomes a countable noun: discourses (Cameron 2001: 15). So around any given object or concept there are likely to be multiple ways of constructing it, reflecting the fact that humans are diverse; we tend to perceive aspects of the world in different ways, depending on a range of factors. In addition, discourses allow for people to be internally inconsistent; they help to explain why people contradict themselves, change position or appear to have ambiguous or conflicting views on the same subject (Potter and Wetherell 1987). We can view cases like this in terms of people holding
Introduction
competing discourses. Therefore, discourses are not valid descriptions of people’s ‘beliefs’ or ‘opinions’ and they cannot be taken as representing an inner, essential aspect of identity such as personality or attitude. Instead they are connected to practices and structures that are lived out in society from day to day. Discourses can therefore be difficult to pin down or describe – they are constantly changing, interacting with each other, breaking off and merging. As Sunderland (2004) points out, there is no ‘dictionary of discourses’. In addition, any act of naming or defining a discourse is going to be an interpretative one. Where I see a discourse, you may see a different discourse, or no discourse. It is difficult, if not impossible, to fully step outside discourse. Therefore our labelling of something as a discourse is going to be based upon the discourses that we already (often unconsciously) live with. As Foucault (1972: 146) notes, ‘it is not possible for us to describe our own archive, since it is from within these rules that we speak’. To give a couple of examples, Holloway’s (1981, 1984) work on heterosexual relations produced what Sunderland (2004: 58) refers to as a ‘male sexual drive’ discourse, one which constructs male sexuality as a biological drive – men are seen as having a basic need for sex which they cannot ignore and must be satisfied. Such a discourse could be used in law courts to ensure that men who are convicted of rape receive light sentences. Similarly, Sunderland (2004: 55) identifies a discourse of compulsory heterosexuality, based on Rich’s (1980) critical essay ‘Compulsory Heterosexuality and Lesbian Existence’. This discourse would involve practices which involve overlooking the existence of gay and lesbian people by assuming that everyone is heterosexual. Traces of this discourse could be found in a wide range of language contexts – for example, relatives asking a teenage boy if he has found a girlfriend yet; in adverts for perfume or lingerie, where it is typically a man who is shown buying gifts for his female partner; or in medical, scientific or advisory texts (which may focus on male–female penetrative (missionary position) intercourse as the only (or preferred) way of conceiving a child or achieving orgasm). Discourses of compulsory heterosexuality could also be shown by the absence of explicit references to heterosexuality in speech and writing, effectively normalising or unproblematising the concept. For example, we would expect the terms man, gay man and heterosexual man to occur in general language usage in the order of frequency that I have just listed them in. Man is generally taken to mean heterosexual man, which is why the latter term would appear so rarely. Gay man – being the marked, exceptional case would therefore appear more frequently than heterosexual man, but not as often as man.1
5
6
Using Corpora in Discourse Analysis
Due to the array of meanings of discourse, defining Discourse Analysis (DA) can also be tricky. A range of approaches are in existence although what they tend to have in common is that they view language as social interaction and they analyse linguistic features in real-life instances of texts. A related form of DA called Critical Discourse Analysis (CDA, later sometimes referred to as CDS – Critical Discourse Studies) developed out of an approach called Critical Linguistics (Fowler et al. 1979) and aimed to identify how power imbalances and inequalities are realised through social and linguistic practices (see Chapter 4 for more detail). With CDS we sometimes begin our research with a hunch that a particular type of language use is unfairly disempowering. The analysis is intended to set out how this is done, in an objective way, although this could still leave CDS practioners open to the charge of incorporating their political biases into their research, having reached a conclusion before they began their analysis. An important way that discourses are constructed, maintained, circulated and challenged is through language. Language (both as an abstract system: phonetics, grammar, lexicon, etc. and as a context-based system of communication) is not the same as discourse, but we can carry out analyses of language in texts in order to uncover traces of discourses. So bearing this linguistic dimension of discourse analysis in mind, to what extent have corpora been utilised in studies that have tried to uncover discourses in language?
The emergence of CADS The first studies which used corpora to carry out discourse analysis took place in the early 1990s. Caldas-Coulthard (1993, 1995) carried out a study of gender representation on a corpus of news as narrative discourse. The study focussed on comparing how often males and females were mentioned, how often they were quoted or referred to in specific ways, e.g. as belonging to institutions or in terms of their relationships to others. Then, in 1995, Hardt-Mautner published a technical paper where she identified the possibility of a fruitful relationship between corpus linguistics and Critical Discourse Analysis. She rightly concludes (1995a: 25) that ‘we are now in a position to argue that the analyst’s intuition, though still an important tool, at last has a powerful and versatile ally on its side’. In 1996, a corpus analysis tool called WordSmith, created by Mike Scott, helped to inspire further discourse-oriented work, which continued through
Introduction
the late 1990s. Examples include Krishnamurthy (1996) who looked at identity words like tribal, ethnic and racial in corpora; Morrison and Love (1996) who examined letters to magazines; and Flowerdew’s (1997) analysis of speeches by Chris Patten, the last governor of Hong Kong. Some of these studies, despite using corpus techniques, employed quite small amounts of data. For example, Stubbs (1996) published an analysis of the ways that gender is constructed within two of Robert Baden-Powell’s speeches to boys and girls, highlighting the fact that ideological issues can be present even around a fairly innocuous word like happy. He showed that Baden-Powell (the founder of the Boy Scouts Association) instructed girls to make other people happy whereas boys were simply instructed to live happy lives. Similarly, Shalom (1997) analysed a small corpus of men’s and women’s personal adverts, finding, among other things, that gay men were more likely to emphasise physical attributes and masculinity compared to other types of advertisers. In the 1990s and 2000s, there were several individuals and groups who were carrying out this type of work, not always fully aware of what others were doing. A key location was Birmingham University where scholars were influenced by pioneering work carried out by Michael Stubbs. Another was at Lancaster University, where Geoffrey Leech had overseen the creation of the British National Corpus, enabling work on social variation such as McEnery et al.’s (2000) research on swearing and demographic categories. I took up a lectureship at Lancaster in 2002 and had become interested in applying corpus techniques to carry out critical discourse analysis, which had been popularised by colleagues at Lancaster like Norman Fairclough, Ruth Wodak and Paul Chilton. Meanwhile, at Victoria University of Wellington in New Zealand, Janet Holmes was using corpora to consider uses of sexist and non-sexist language (Holmes 2001, Sigley and Holmes 2002) while at the University of Bologna, a group led by Alan Partington took an approach called CADS (Corpus Assisted Discourse Studies). Unlike, perhaps, the other approaches outlined above, CADS aimed to de-emphasise the more critical aspects of analysis (Partington et al. 2013: 339), taking a more exploratory approach to allow researchers to study topics where they may not have pre-existing hypotheses or strong beliefs about author bias or the existence of power inequalities. An example of CADS research I have been involved in is Baker et al. (2019) which examined a corpus of patient feedback to the National Health Service in the UK. It was not the intention of this research to identify power inequalities or abuses but instead to obtain a sense of the ways that patients praised or criticised aspects of their
7
8
Using Corpora in Discourse Analysis
treatment, and how they used language to legitimate their feedback. However, a related piece of research was carried out by Evans (2020) who analysed how members of staff at the NHS responded to this patient feedback. This analysis was more critical in that it identified ways that some staff members undermined certain types of feedback by using strategies like sarcasm or mock politeness. Presenting at conferences and workshops in the 2000s, I was often struck by the differing reactions to the approach, noting interest, disinterest and hostility from individual audience members. There were perhaps good reasons for this range of opinions. A new field (actually the combination of two fields) can take time to develop as it has to compete with existing approaches. A new methodology (or justification for using a research method) requires quite a bit of explanatory time which can detract from the actual analysis and discussion of results or implications. Additionally, this was a developing field with little understanding of best practice, so sometimes an analysis might be under-developed and thus unconvincing. Criticisms I heard were (paraphrased) ‘words are beautiful, like flowers, you shouldn’t put them in a computer program’,‘it doesn’t go far enough, it’s just descriptive’, ‘yes, but what does it mean?’ and ‘people shouldn’t download a lot of data from a culture they know hardly anything about’. These criticisms point to the importance of interpreting and explaining results within context (e.g. qualitative examination of texts in the corpus and consideration of the social contexts that resulted in the creation of the texts originally), rather than thinking that results constitute simply tables of frequencies. While I find corpus-based discourse analysis to be a worthwhile technique, I do not wish to be thoughtlessly evangelical about it. All methods of research have associated problems which need to be addressed and are also limited in terms of what they can and cannot achieve. One criticism of corpus-based approaches is that they are too broad – they do not facilitate close readings of texts. However, this is akin to complaining that a telescope only lets us look at faraway phenomena, rather than allowing us to look at things closeup, like a microscope (Partington 1998: 144). Kenny (2001) argues that in fact the corpus-based approach is more like a kaleidoscope, allowing us to see textual patterns come into focus and recede again as others take their place. Acknowledging what a corpus-based approach can do and what it cannot do is necessary, but should not mean that we discard the methodology – we should just be more clear about when it is appropriate to use it, when we should combine it with other approaches or when it is simply not going to be useful.
Introduction
Other researchers have problematised corpus-centred approaches as constituting linguistics applied rather than applied linguistics (e.g. Widdowson 2000). Widdowson (2000: 7) claims that corpus linguistics only offers ‘a partial account of real language’ because it does not address the lack of correspondence between corpus findings and native speaker intuitions. Widdowson also questions the validity of analysts’ interpretations of corpus data and raises questions about the methodological processes that they choose to use, suggesting that the ones which computers find easier to carry out will be chosen in preference to more complex forms of analysis. Additionally, Borsley and Ingham (2002) criticise corpus-based approaches because it is difficult to make conclusions about language if an example does not appear in a corpus. They also argue that language is endowed with meaning by native speakers and therefore cannot be derived from a corpus (see Stubbs (2001a, 2002) for rejoinders to these articles). A related criticism is by Baldry (2000: 36) who argues that corpus linguistics treats language as a self-contained object, ‘abstracting text from its context’. And Cameron (1998), in an article about dictionary creation using corpus-based methodologies, warns that corpus linguists have had a tendency to over-rely on newspapers and synchronic data, at the expense of charting the historical origins surrounding words and their changing meanings and usages over time. Such criticisms are worth bearing in mind, although they should not prevent researchers from using corpora; rather, they should encourage corpus-based work which takes into account such problems, perhaps supplementing their approach with other methodologies. For example, there is no reason why corpus-based research on lexical items should not use diachronic corpora in order to track changes in word meaning and usage over time and several large-scale corpus building projects have been carried out with the aim of creating historic corpora from different time periods.2 Corpus linguistics also tends to be conceptualised (particularly by noncorpus researchers) as a quantitative method of analysis: something which is therefore at odds with the direction that social inquiry has taken since the 1980s. Before the 1980s, corpus linguistics had struggled to make an impact upon linguistic research because computers were not sufficiently powerful enough or widely available to put the theoretical principles into practice. Ironically, by the time that computers had become widely available to scholars, there had already occurred a shift in the social sciences in the accepted ways that knowledge was produced via research methodologies. Structuralist approaches that emphasized measurement and quantification
9
10
Using Corpora in Discourse Analysis
were replaced by post-structuralist or post-modern approaches, which tended to be more qualitative (Burr 1995). As Denzin (1988: 432) writes: Gone are words like theory, hypothesis, concept, indicator, coding scheme, sampling, validity, and reliability. In their place comes a new language: readerly texts, modes of discourse, cultural poetics, deconstruction, interpretation, domination, feminism, genre, grammatology, hermeneutics, inscription, master narrative, narrative structures, otherness, postmodernism, redemptive ethnography, semiotics, subversion, textuality, tropes.
While Denzin (1988: 432) optimistically suggested that now researchers had a choice, I would agree with Swann, in her assessment of language and gender research, who notes that, ‘On the whole . . . there does seem to have been a shift towards more localised studies’ and ‘far less reliance is placed on quantifiable and/or general patterns’ (Swann 2002: 59). So corpus linguistics largely became viable as a methodology at a point where this epistemological shift had already occurred, and its grounding in quantification has not made it attractive to social scientists. In the 1990s McEnery and Wilson (1996: 98) and Biber et al. (1998) both noted that the amount of corpus-based research in discourse analysis has been relatively small. Post-structuralists have developed close formulations between language, ideology and hegemony, using writers like Gramsci (1985) and Bakhtin (1984) as a springboard. And the move towards deconstructionism in the social sciences over the past twenty years or so has tended towards research into language and identities that could be particularly associated with people who are viewed as holding (or sympathetic towards) problematic, contested or powerless identities (for example, immigrants, women, people who are neuro-diverse, LGBTQ+, from non-white ethnic groups, religious minorities or who are living with a range of physical or mental health conditions). Such people are likely to be more aware of the oppression of such groups and therefore hold with forms of analysis that are associated with questioning the status quo – e.g. queer theory, feminist linguistics and critical discourse analysis – rather than reiterating and reinforcing a list of ways in which people speak, think or behave differently from each other. Burr (1995: 162) refers to this as action research, forms of research which have change and intervention rather than the discovery of ‘facts’ as their explicit aim. Corpus research then, with its initial emphasis on comparing differences through counting, and creating rather than deconstructing categories, could therefore be viewed as somewhat retrograde and incompatible with post-structuralist thinking. Indeed, one area that corpus linguistics has excelled in has been in
Introduction
generating descriptive grammars of languages (e.g. Biber et al. 1999) based on naturally occurring language use, but focussing on language as an abstract system. Another reason why language and identity researchers have shied away from corpora is due to practical, rather than ideological considerations. Researchers have argued that discourse analysis is very labour intensive (e.g. Gill 1993: 91) and therefore ‘discourse analysis, as with many other varieties of qualitative research is usually more difficult than positivist number crunching’ (Parker and Burman 1993: 156). However, I would argue that a corpus linguistics approach can be perceived as equally time consuming. Large numbers of texts must first be collected, while their analysis often requires learning how to use computer programs to manipulate data. Statistical tests may be carried out in order to determine whether or not a finding is significant, necessitating the requisite mathematical know-how. Gaining access to corpora is not always easy – and large corpus building projects can be very time consuming and expensive, sometimes requiring the acquisition of research grants in order to be carried out successfully. No wonder then, that it is often simply less effort to collect a smaller sample of data which can be transcribed and analysed by hand, without the need to use computers or mathematical formulae. Despite these issues, in the last two decades, a field has emerged. In 2005 I published Public Discourses of Gay Men after becoming interested in how gay men had been represented in various texts such as newspaper articles, scripted comedy and political debates. Techniques like keywords and collocation were new to me and there was not a great deal of work to refer to in terms of how I could use them to examine discourses, so I tried to develop and apply techniques as I went along. As I worked on the book I realised that many of the methods I carried out could have a wider application so I took up the opportunity to write a more general book, based on the same kinds of analysis, but using a wider variety of topics and corpora. That book was the first edition of the one you are reading, Baker (2006). After it was published, I worked on a project to examine representations of refugees in the press. There were two teams on the project. The corpus linguists consisted of myself, Tony McEnery and Costas Gabrielatos while the critical discourse analysts were Ruth Wodak, Michal Krzyzanowski and Majid Khosravinik. We published a paper in 2008 (Baker et al.), called ‘A useful methodological synergy?’, where we presented a model which involved moving between corpus and CDA approaches, using each stage to form hypotheses. We referred to this as corpus-based critical discourse analysis.
11
12
Using Corpora in Discourse Analysis
However, the term Corpus Assisted Discourse Studies (CADS), which I mentioned earlier, has perhaps become more widely adopted. In 2010, Alan Partington organised a special issue in the journal Corpora based on CADS research which examined change over time, and with Alison Duguid and Charlotte Taylor he published a book on CADS in 2013. Partington also organised CADS conferences that were held in Camerino (2002), Bologna (2012) and Siena (2016). In 2018 I hosted the conference at Lancaster University. This was followed by an online conference organised by Charlotte Taylor at Sussex University in 2020 and a return to Italy for the conference in 2022. A journal (Journal of Corpora and Discourse Studies) commenced in 2018. In addition, collections of studies taking a (largely) CADS perspective were published in 2015 (Baker and McEnery), 2018 (Taylor and Marchi) and 2020 (Friginal and Hardy) – the latter work being edited by two American authors at universities in the state of Georgia. The research in the CADS journal and the more recent conferences has tended to be a mixture of those which take a critical, social-impact based perspective and those which function more as objective explorations of discourse. At the same time that CADS has become an accepted method of study, it has found itself in an increasingly crowded field, sitting alongside other approaches which have less obscure-sounding labels such as topic modelling (Blei and Lafferty 2007), opinion mining (Pang and Lee 2008), sentiment analysis (Tsolomon et al. 2012) and culturomics (Michel et al. 2010). Some of these kinds of approaches sometimes use computer software to assign tags or codes to words or phrases while others use machine learning techniques to identify structures or relationships between and within texts. They differ from CADS in that the analysis tends to be taken from a computational rather than a linguistic or discourse-centred perspective. The software does more of the analytical work and there is less focus on human researchers reading texts and decoding their context, as well as considering the wider social context that texts are created in. It is ironic in that I have sometimes applied the same kinds of criticisms to them that qualitative researchers have previously applied to CADS (e.g. the analysis is mostly descriptive, producing results that are obvious or even inaccurate due to mis-tagging and there is not enough attempt to consider context or consequences). These approaches certainly have worth, although they tend to be more useful for answering different sorts of questions compared to the ones that CADS researchers focus on. As I stated at the beginning of this section, criticisms of a corpus-based approach are useful in that they make us aware of limitations or potential
Introduction
pitfalls to be avoided. However, having come this far, it seems fair to consider an alternative perspective – what can be gained from using corpora to analyse discourse?
Advantages of the CADS approach Reducing researcher bias While older, empirical views of research were concerned with the reduction of researcher bias in favour of empiricism and objectivity, newer, more postmodern forms of research have argued that the unbiased researcher is in itself a ‘discourse of science through which a particular version . . . of human life is constructed’ (Burr 1995: 160). Burr argues that objectivity is impossible as we all encounter the world from some perspective (the ‘objective’ stance is still a stance). Instead, researchers need to acknowledge their own involvement in their research and reflect on the role it plays in the results that are produced. Not all discourse analysts are inclined to take this view of objectivity. Blommaert (2005: 31–2) points out that: ‘The predominance of biased interpretation begs questions about representativeness, selectivity, partiality, prejudice, and voice (can analysts speak for the average consumer of texts?).’ It is difficult, if not impossible, to be truly objective, and acknowledging our own positions and biases should be a prerequisite for carrying out and reporting research. However, this perspective assumes a high degree of researcher self-awareness and agency. The term critical realism (Bhaskar 1989) is useful, in that it outlines an approach to social research which accepts that we perceive the world from a particular viewpoint, but the world acts back on us to constrain the ways that we can perceive it. So we need to be aware that our research is constructed, but we should not deconstruct it out of existence. Also, we may be biased on a subconscious level which can be difficult to acknowledge. At other times, we may not want to acknowledge our position for various reasons (concerns, for example, that our findings may be played down because they were published by someone who holds a particular identity, or we may desire to protect or conceal some aspect of our own identity such as sexuality, gender or ethnicity for other reasons). A lot of academic discourse is written in an impersonal, formal style, so introducing some sort of personal statement may still seem jarring, particularly in some disciplines.
13
14
Using Corpora in Discourse Analysis
And ultimately, even if we convincingly reflect on our personal circumstances and their relationship to our research, we may still end up being biased in ways which have nothing to do with who we are but are more concerned with the ways that human beings process information. A famous study by psychologists Kahneman and Tversky (1973) showed that people (105 out of 152 to be exact) tend to think that in a typical sample of text in the English language the number of words that begin with the letter k is likely to be greater than the number of words that have k as the third letter. In reality, there are about twice as many words that have k as their third letter than there are words that begin with k. Yet we tend to over-estimate the cases of the first letter because we can recall such words more easily. We also tend to succumb to other cognitive biases. Mynat et al. (1977) showed that in a variety of settings, decision makers tend to notice more, assign more weight to, or actively seek out evidence which confirmed their claims, while they tended to ignore evidence which might discount their claims (confirmation bias). Related to this is the hostile media effect (Vallone et al. 1985) which shows that ideological partisans tend to consistently view media coverage as being biased against their particular side of the issue (a phenomenon that perhaps we should attend to when carrying out action research). People also tend to focus more on information that they encounter at the beginning of an activity (the primacy effect). The presence of such cognitive biases can be particularly problematic when carrying out discourse analysis. For example, we may choose to focus on a newspaper article which ‘confirms’ our suspicions but ignore other articles which present a different perspective. There is nothing essentially ‘wrong’ about that, but it may mean that we need to be careful in terms of any generalisations we make beyond the article itself. Additionally, we may only focus on aspects of a text which support our initial hypotheses, while disregarding those which present a more complex or contradictory picture. By using a corpus, we at least are able to place a number of restrictions on our cognitive biases. It becomes less easy to be selective about a single newspaper article when we are looking at hundreds or thousands of articles – hopefully, overall patterns and trends should show through. Of course, we cannot remove bias completely. Corpus researchers can theoretically be just as selective as anyone in choosing which aspects of their research to report or bury. And their interpretations of the data they find can also reveal bias. With corpus analysis, there are usually a lot of results, and sometimes, because of limitations placed on researchers (such as deadlines or word-length restrictions of journal articles), selectivity comes into play.
Introduction
Although the aim of CADS was to avoid carrying out politically motivated analysis (Partington et al. 2013: 339), the wider adoption of the term CADS has resulted in some of the research under this label coming from a more critical perspective than others. I see the decision to take a critical approach or not as linear and shifting rather than an either/or decision, sometimes being contingent on the type of data being analysed as well as the goals of the researcher. But at least with a corpus, we are starting (hopefully) from a position whereby the data itself has not been selected in order to confirm existing conscious (or subconscious) biases. One tendency that I have found with corpus analysis is that there are usually exceptions to any rule or pattern. It is important to report these exceptions alongside the overall patterns or trends, but not to over-report them either.
The incremental effect of discourse As well as helping to restrict bias, corpus linguistics is a useful way to approach discourse analysis because of the incremental effect of discourse. One of the most important ways that discourses are circulated and strengthened in society is via language use, and the task of discourse analysts is to uncover how language is employed, often in quite subtle ways, to reveal underlying discourses. By becoming more aware of how language is drawn on to construct discourses or various ways of looking at the world, we should be more resistant to attempts by writers of texts to manipulate us by suggesting to us what is ‘common-sense’ or ‘accepted wisdom’. As Fairclough (1989: 54) observes: The hidden power of media discourse and the capacity of . . . power-holders to exercise this power depend on systematic tendencies in news reporting and other media activities. A single text on its own is quite insignificant: the effects of media power are cumulative, working through the repetition of particular ways of handling causality and agency, particular ways of positioning the reader, and so forth.
Journalists are able to influence their readers by producing their own discourses or helping to reshape existing ones. Such discourses are often shaped by citing the opinions of those in powerful and privileged positions. Becker (1972: xx) calls this the ‘hierarchy of credibility’ whereby powerful people will come to have their opinions accepted because they are understood to have access to more accurate information on particular topics than
15
16
Using Corpora in Discourse Analysis
everyone else. Hall et al. (1978: 58) say: ‘The result of this structured preference given in the media to the opinions of the powerful is that these “spokesmen” become what we call the primary definers of topics.’ So a single word, phrase or grammatical construction on its own may suggest the existence of a discourse. But other than relying on our intuition (and existing biases), it can sometimes be difficult to tell whether such a discourse is typical or not, particularly as we live in ‘a society saturated with literacy’ (Blommaert 2005: 108). By collecting numerous supporting examples of a discourse construction, we can start to see a cumulative effect. In terms of how this relates to language, Hoey (2005) refers to the concept of lexical priming in the following way: ‘Every word is primed for use in discourse as a result of the cumulative effects of an individual’s encounters with the word.’ As Stubbs (2001b: 215) concludes: ‘Repeated patterns show that evaluative meanings are not merely personal and idiosyncratic, but widely shared in a discourse community. A word, phrase or construction may trigger a cultural stereotype.’ Additionally, Blommaert (2005: 99) notes that a lot of human communication is not a matter of choice but is instead constrained by normativities which are determined by patterns of inequality. And this is where corpora are useful. An association between two words, occurring repetitively in naturally occurring language, is much better evidence for an underlying hegemonic discourse which is made explicit through the word pairing than a single case. For example, consider the sentence taken from the British magazine Outdoor Action: ‘Diana, herself a keen sailor despite being confined to a wheelchair for the last 45 years, hopes the boat will encourage more disabled people onto the water.’ We may argue here that although the general thrust of this sentence represents disabled people in a positive way, there are a couple of aspects of language use here which raise questions. These being the use of the phrase confined to a wheelchair, and the way that the co-ordinator despite prompts the reader to infer that disabled people are not normally expected to be keen sailors. There are certainly traces of different types of discourses within this sentence, but are they typical or unusual? Which discourse, if any, represents the more hegemonic variety? Consulting a large corpus of general British English (the British National Corpus 1994), we find that the words confined and wheelchair have fairly strong patterns of co-occurrence with each other. The phrase confined to a wheelchair occurs forty-five times in the corpus, although the more neutral term wheelchair user(s) occurs thirty-seven times. However, wheelchair bound occurs nine times. We also find quite a few cases of wheelchair
Introduction
appearing in connection with co-ordinators like although and despite (e.g. despite being restricted to a wheelchair he retains his cheerfulness; despite confinement to a wheelchair, Rex Cunningham had evidently prospered; although confined to a wheelchair for most of her life, Violet was active in church life and helped out with a local Brownie pack). While this is not an overwhelmingly frequent pattern, there are enough cases to suggest that one discourse of wheelchair users constructs them as being deficient in a range of ways, and it is therefore of note when they manage to be cheerful, prosperous or active in church life! The original sentence about Diana the keen sailor certainly is not an isolated case but conforms to an existing set of expectations about people in wheelchairs. Thus, every time we read or hear a phrase like wheelchair bound or despite being in a wheelchair, our perceptions of wheelchair users are influenced in a certain way. At some stage, we may even reproduce such discourses ourselves, thereby contributing to the incremental effect without realising it.
Resistant and changing discourses As well as being able to establish that repeated patterns of language use demonstrate evidence of particular hegemonic discourses or majority ‘common-sense’ ways of viewing the world, corpus data can also reveal the opposite – the presence of counter-examples which are much less likely to be uncovered via smaller-scale studies. And if a resistant discourse is found when looking at a single text, then we may mistake it for a hegemonic discourse. Discourses around a topic are not static. They continually shift position – a fact that can often be demonstrated via analysis of language change. There is little agreement among linguists about whether language reflects thought or shapes thought or whether the relationship constitutes an unending and unbroken cycle of influence. Whatever the direction of influence, charting changes in language is a useful way of showing how discourse positions in society are also in flux. What was a hegemonic discourse ten years ago may be viewed as a resistant or unacceptable discourse today. At the most basic level, this can be shown by looking at changing frequencies of word use in a diachronic (or historical) corpus, or by comparing more than one corpus containing texts from different time periods. For example, if we compare two equal sized corpora of British English3 containing written texts from the early 1960s and the early 2020s we see that in the 2020s corpus there are various types of words which occur
17
18
Using Corpora in Discourse Analysis
much more frequently than when they appeared in the earlier corpus, for example lexis which reflect the rise of social justice discourse: inequalities, racism, identity, vulnerable, diversity; and lexis which reflect ‘green’ discourse: climate, environmental, global, environment. In addition, we find that certain terms have become less frequent: girl and titles like Mr and Mrs were more popular in 1960s British English than they were in the 2020s, suggesting that sexist discourses or formal ways of addressing people have become less common. However, we could also compare the actual contexts that words are used in over different time periods as it may be the case that a word is no more or less frequent than it used to be, but its meanings have changed over time. For example, in the early 1960s corpus the word blind almost always appears in a literal sense, referring to people or animals who cannot see. The term blind is not significantly more frequent in the 2020s corpus, although in about half its occurrences we now find it being used in a range of more metaphorical (and negative) ways: turning a blind eye, blind to the levels of risk, blind optimism. We could say that blind has expanded semantically, to refer to cases where someone is ignorant, thoughtless or lacks the ability to think ahead. As Hunston (1999) argues, this non-literal meaning of blind could constitute a discourse prosody which influences attitudes to literal blindness (although it could also be argued that the separate meanings exist independently of each other). What the corpus data has shown, however, is that the negative metaphorical meaning of blind appears to have increased in written British English over time – it is not a conceptualisation which has always been as popular.4
Triangulation As described earlier in this chapter, the shift to post-structuralist methods of thought and research has served to de-emphasise the focus on more quantitative, empirical methods. However, another aspect of poststructuralism may actually warrant the inclusion of corpus-informed research. One of the main arguments of social constructionism is to question and ‘deconstruct’ binary arguments that have served the basis of Western thinking for thousands of years, such as ‘nature or nurture’ (Derrida 1978, 1981). Such oppositions are typical of ideologies in that they create an inherent need to judge one side of the dichotomy as primary and the other as secondary, rather than thinking that neither can exist without the other.
Introduction
Instead, Derrida recommends that we reject the logic of either/or of binary oppositions, in favour of a logic of both/and. The same could be said for the split between quantitative/qualitative or structuralism/post-structuralism. Indeed, post-structuralism favours a more eclectic approach to research, whereby different methodologies can be combined together, acting as reinforcers of each other. It is not the case that corpus linguists should view corpora as the only possible source of data; ‘Gone is the concept of the corpus as the sole explicandum of language use. Present instead is the concept of a balanced corpus being used to aid the investigation of a language’ (McEnery and Wilson 1996: 169). Tognini-Bonelli (2001) makes a distinction between corpus-based and corpus-driven investigations. The former uses a corpus as a source of examples, to check researcher intuition or to examine the frequency and/or plausibility of the language contained within a smaller data set. A corpusdriven analysis proceeds in a more inductive or naïve way – specific hypotheses are not produced in advance, the corpus is the data and techniques are used to derive (sometimes unforeseen) patterns which show regularities (and exceptions) in language. As with the distinction between critical and non-critical research, I view corpus-based and corpus-driven as existing at theoretical poles on a linear scale, with most research falling somewhere in the middle. As McNeil (1990: 22) points out, triangulation (a term coined by Newby 1977: 123) or using multiple methods of analysis (or forms of data) is now accepted by ‘most researchers’. Layder (1993: 128) argues that there are several advantages of triangulation: it facilitates validity checks of hypotheses, it anchors findings in more robust interpretations and explanations, and it allows researchers to respond flexibly to unforeseen problems and aspects of their research. Even when discourse analysts do not want to have to go to the trouble of building a corpus from scratch, they could still gainfully use corpora as a reference, to back up or expand on their findings derived from smaller-scale analyses of single texts. For example, Sunderland (2004: 37–8) looked at a newspaper article which publicised a ‘fairytale’ venue for marriage ceremonies. She argued that the article focussed on the bride as the bearer of the (stereotypically) male gaze (due to phrases such as ‘its flying staircase down which the bride can make a breathtaking entrance’). An analysis of the words which bride tends to collocate (co-occur) with most often in a large corpus of naturally occurring language revealed terms to do with appearance like blushing, dress, wore, beautiful and looked. On the other hand, bridegroom and groom tended to collocate with mainly functional words (pronouns,
19
20
Using Corpora in Discourse Analysis
conjunctions, prepositions, etc.), suggesting that the constructions of brides in the article were ‘loaded’ in a way which did not apply to bridegrooms. So while the main focus of Sunderland’s analysis was a single news article, a general corpus proved to be useful in confirming suspicions that what she was seeing was, in fact, a hegemonic discourse. In such cases it only takes a couple of minutes to consult a reference corpus, showing such a corpusbased method to be an extremely productive means of triangulation. With Jesse Egbert I have published two edited collections of papers based around triangulation and corpus linguistics. The first (Baker and Egbert 2016) considered a combination of different corpus approaches as constituting a form of triangulation. We gave the same corpus (a set of internet forum posts from a ‘question and answer’ website) and research questions to ten sets of researchers and asked them to use corpus methods to analyse the data in their own way. We then carried out a meta-analysis of the different reports, to identify the extent to which the multiple perspectives provided convergent, divergent or complementary findings. We found a picture that was mostly complementary with each researcher tending to make unique discoveries that others did not find, with a few areas of shared focus and very few findings that contradicted one another. Triangulating different corpus linguistics methods therefore offers the chance for a wider set of findings, and across the chapters in this book I offer different approaches that can be used separately or in combination as researchers see fit. The second edited collection (Egbert and Baker 2019) took a different approach to triangulation in that it involved nine unrelated pieces of research which each combined a piece of corpus analysis with a different linguistic discipline such as psycholinguistics or historical linguistics. I contributed one of these chapters, combining a CADS-style analysis of a corpus of newspaper articles about obesity (using collocations) with a qualitative analysis of sets of samples of the articles (see also Baker and Levon 2015 for a similar kind of study). I found that the different approaches were slightly better at addressing certain types of research questions – the collocational approach was good at spotting representations of obese people but was not as good at identifying the reasons that journalists gave for causes of obesity. I concluded that an approach which combined corpus methods with close readings was optimally productive although also potentially more time consuming. At the least then, I would advise that CADS researchers spend time engaging with some of the texts in their corpus in a more traditional way. Partington (2010: 88) talks of integrating discourse studies with techniques and tools from corpus linguistics, a point noted by Mautner
Introduction
(2019: 8) who considers how the two strands can be sensibly combined. She advises that there should be ‘constant oscillation between quantitative and qualitative viewpoints, moving back and forth between computer-based discovery procedures and traditional, human hermeneutics’ as opposed to ‘two sets of apparently unrelated results are simply placed side by side, with links between them asserted rather than demonstrated’.
Some concerns While in the last section I have hoped to show how corpus linguistics can act as a useful method (or supplementary method) of carrying out discourse analysis, there are still a few concerns which are necessary to discuss before moving on. First, a corpus analysis can produce a lot of results – long tables of words and accompanying concordance tables. One of the criticisms of the approach is that the focus on frequency can simply confirm what people already knew, although this can also be the result of another form of cognitive bias called the hindsight bias. My research on representation of Islam in the British press, for example, uncovered a general picture of negative media bias, although that meant people could respond with ‘so what? I knew that already’. Partington (2017) has talked about the importance of examining nonobvious meaning within CADS – an analysis should try to incorporate what is surprising, in other words. I have tended to find that each CADS-based research project I have been involved in has tended to produce a mixture of obvious and non-obvious findings. It is therefore the task of the researcher to tease these apart and give enough focus to the surprises, without compromising the overall picture. Even with the obvious findings, it can be useful to provide statistical detail – we may expect that Muslims are described more often as extremists than moderates in the press, but to what extent does this happen exactly, which newspapers do it most and how has this changed over time? Even those who claim ‘I knew it already’ are unlikely to be able to produce all of the right answers to these questions. Additionally, in terms of social impact, statistics which reveal the exact extent of an ‘obvious’ finding are much more convincing than making a general claim. Second, corpus data is usually only language data (written or transcribed spoken), and discourses are not confined to verbal communication. By holding a door open for a woman, a man could be said to be performing a communicative act which could be discursively interpreted in numerous
21
22
Using Corpora in Discourse Analysis
ways – a discourse of ‘the gallant man’, of ‘male power imposing itself on women’ or a non-gendered discourse of ‘general politeness in society’ for example. In a similar way, discourses can be embedded within images – for example, in Chapter 4 I examine a corpus of tourist brochures. Although the written text reveals a great deal about discourses of tourism within, at the same time this should be viewed as working in relationship to the visual images, which give a very clear idea about the sorts of people who the brochures are aimed at and the sorts of activities they would be expected to engage in and enjoy while on their holiday. Caldas-Coulthard and van Leeuwen (2002) investigate the relationship between the visual representations of children’s toys (in terms of design, colour and movement) such as The Rock and Barbie and texts written about them, suggesting that in many cases discourses can be produced via interaction between verbal and visual texts. The fact that discourses are communicated through means other than words indicates that a corpus-based study is likely to be restricted – any discourses that are uncovered in a corpus are likely to be limited to the verbal domain. Some work has been carried out on creating and encoding corpora that contain visuals (e.g. Hollink et al. 2016), although at the moment there does not appear to be a standardised way of encoding images in corpora. In Chapter 8 I outline some work on analysing visuals in CADS research and add some thoughts on its potentialities. In addition to that, issues surrounding the social conditions of production and interpretation of texts are important in helping the researcher understand discourses surrounding them (Fairclough 1989: 25). Questions involving production such as who authored a text, under what circumstances, for what motives and for whom, in addition to questions surrounding the interpretation of a text: who bought, read, accessed, used the text, what were their responses, etc. cannot be simply answered by traditional corpus-based techniques, and therefore require knowledge and analysis of how a text exists within the context of society. One problem with a corpus is that it contains decontextualised examples of language. We may not know the ideologies of the text producers in a corpus. In a sense, this can be a methodological advantage, as Hunston (2002: 123) explains, ‘the researcher is encouraged to spell out the steps that lie between what is observed and the interpretation placed on those observations’. So we need to bear in mind that because corpus data does not interpret itself, it is up to the researcher to make sense of the patterns of language which are found within a corpus, postulating reasons for their existence or
Introduction
looking for further evidence to support hypotheses. Our findings are interpretations, which is why we can only talk about reducing bias, not removing it completely. A potential problem with researcher interpretation is that it is open to contestation. Researchers may choose to interpret a corpus-based analysis of language in different ways, depending on their own positions. For example, returning to a study previously mentioned, Rayson et al. (1997) found that people from socially disadvantaged groups tend to use more non-standard language (ain’t, yeah) and taboo terms (fucking, bloody) than those from more advantaged groups. While the results themselves are not open to negotiation, the reasons behind them are, and we could form numerous hypotheses depending on our own biases and identities, e.g. poor standards of education or upbringing (lack of knowledge), little exposure to contexts where formal language is required or used (no need to use ‘correct’ language), rougher life circumstances (language reflecting real life), the terms helping to show identity and group membership (communities of practice), etc. Such hypotheses would require further (and different) forms of research in order to be explored in more detail. This suggests that corpus analysis shares much in common with forms of analysis thought to be qualitative, although at least with corpus analysis the researcher has to provide explanations for results and language patterns that have been discovered in a relatively objective manner. Also, a corpus-based analysis will naturally tend to place focus on patterns, with frequency playing no small part in what is reported and what is not. However, frequent patterns of language do not always necessarily imply underlying hegemonic discourses. Or rather, the ‘power’ of individual texts or speakers in a corpus may not be evenly distributed. A corpus which contains a single (unrepresentative) speech by the leader of a country or religious group, newspaper editor or CEO may have a greater influence than hundreds of similar texts which were produced by ‘ordinary people’. Similarly, we should not assume that every text in a corpus will originally have had the same size and type of audience. General corpora are often composed of data from numerous sources (newspaper, novels, letters, etc.), and it is likely to have been the case that public forms of media would have reached more people (and therefore possibly had a greater role to play in forming and furthering discourses) than transcripts of private conversations. We may be able to annotate texts in a corpus to take into account aspects of production and reception, such as author occupation/status or estimated readership, but this will not always be possible. In addition, frequent patterns of language (even when used by powerful text producers) do not always imply mainstream ways of thinking.
23
24
Using Corpora in Discourse Analysis
Sometimes what is not said or written is more important than what is there, revealing assumptions or understandings that are shared by society. For example, in university prospectus discourse we would expect to find a term like mature student occurring more often than a term like young student. However, we should not assume that there are more mature students than young students, the term student invisibly carries connotations of youth within it that do not need to be expanded upon, hence there is little need for a marked opposite equivalent of mature student (immature student?). Similarly, a hegemonic discourse can be at its most powerful when it does not even have to be invoked, because it is taken for granted. A sign of true power is in not having to refer to a state of affairs explicitly, because everybody is aware of it and the vast majority do not question it. Prior awareness or intuition about what is possible in language should help to alert us to such absences, and often comparisons with a larger normative corpus will reveal what they are. We also need to be aware that people tend to process information rather differently to computers. Therefore, a computer-based analysis will uncover hidden patterns of language. Our theory of language and discourse states that such patterns of language are made all the more powerful because we are not aware of them; therefore we are unconsciously influenced. However, it can be difficult to verify the unconscious. For example, in Chapter 5 I show how refugees are characterised as out-of-control water, with phrases like flood of refugees, overflowing camps, refugees streaming home, etc. being used to describe them. I (and other researchers) have interpreted this water metaphor as being somewhat negative and dehumanising. However, would we all interpret flood of refugees in the same way? Hoey (2005: 14) points out that we all possess personal corpora with their own lexical primings which are ‘by definition irretrievable, unstudiable and unique’. If we were concerned about the ways that refugees are represented, then we may have already consciously noticed and remarked on this water-metaphor pattern. But what if English was not our first language? Would we be less or more likely to critically notice the metaphor? And if we were someone who did not approve of refugees, we may even interpret the word flood as being too ‘soft’ preferring a less subtle negative description. Also, did the person who wrote flood of refugees actually intend this term to be understood in a negative sense, or were they simply unthinkingly repeating what has now become a ‘naturalised’ (El Refaie 2002: 366) way of writing about refugees (as Baker and McEnery 2005 point out, even texts produced by the Office of the United Nations High Commissioner for Refugees, a body aimed at helping refugees, contain
Introduction
phrases using the water metaphor). As Partington (2003: 6) argues, ‘authors themselves are seldom fully aware of the meanings their texts convey’. Perhaps conscious intention is more crucial to the formation of discourses and reliance on subconscious repetition and acceptance is required for their maintenance (see also Hoey 2005: 178–88 for further discussion). And words do not have static meanings, they change over time. They also have different meanings and triggers for different people. Corpus analysis needs to take into account the fact that word meanings change and that they can have different connotations for different people. Indeed, one fruitful area for CADS is diachronic research which aims to track the ways that language and discourse changes over time. Therefore, a corpus-based analysis of discourse is only one possible analysis out of many, and open to contestation. It is an analysis which focusses on norms and frequent patterns within language. However, there can be analyses that go against the norms of corpus data and, in particular, research which emphasises the interpretative repertoires (Gilbert and Mulkay 1984) that people hold in relationship to their language use can be useful at teasing out the complex associations they hold in connection to individual words and phrases. Corpus linguistics does not provide a single way of analysing data either. As the following chapters in this book show, there are numerous ways of making sense of linguistic patterns: collocations, keywords, frequency lists, clusters, etc. And within each of these corpus-based techniques the user needs to set boundaries. For example, at what point do we decide that a word in a corpus occurs enough times for it to be ‘significant’ and worth investigating? Or if we want to look for co-occurrences of sets of words (e.g. how often do flood and refugees occur near each other), how far apart are we going to allow these words to be? Do we discount cases where the words appear six words apart? Or four words? Unfortunately, there are no simple answers to questions like this, and instead the results themselves (or external criteria such as word count restrictions on the length of journal articles) can dictate the cut-off points. For example, we may decide to only investigate the ten most frequently occurring lexical words in a given corpus in relation to how discourses are formed. However, while these words tell us something about the genre of the corpus, they may be less revealing of discourses. So we could expand our cut-off point to investigate the top twenty words. This is more helpful, but then we find that we have too much to say, or we are repeating ourselves by making the same argument, so we make a compromise, only discussing words which illustrate different points.
25
26
Using Corpora in Discourse Analysis
Again, these concerns should not preclude using corpus data to analyse discourse. But they may mean that other forms of analysis should be used in conjunction with corpus data, or that the researcher needs to take care when forming explanations about their results.
Structure of the book This book has two main goals: to introduce researchers to the different sorts of analytical techniques that can be used with corpus-based discourse analysis, and to show how they can be put into practice on different types of data. I have usually found that people understand things better when they are given real life examples, rather than discussing ideas at an abstract level, so I have included a range of different case studies in the following chapters in the book. Chapter 2 is a wholly new chapter which covers a range of issues to be considered before we begin a CADS-based research project. It addresses areas such as choosing a topic or corpus analysis tool, reflecting on analyst position and forming researching questions. Chapter 3 looks at issues to do with data collection and corpus building, in order to address questions such as how large a corpus should be and the best ways to collect and annotate data. Chapter 4 uses a small corpus of holiday leaflets produced by the tour operator Club 18–30 in order to examine how some of the more basic corpus-based procedures can be carried out on data and their relevance to discourse analysis. It includes looking at how frequency lists can be used in order to provide researchers with a focus for their analyses and how measures such as the type/token ratio help to give an account of the complexity of a text. It also shows how the creation of concordance plots of lexical items can reveal something about the nature of discourses over the course of a particular text. Chapter 5 investigates the construction of discourses of refugees in the British National Corpus, a large reference corpus of general British English, and is concerned with methods of presenting and interpreting concordance data. It covers different ways of sorting and examining concordances as well as introducing the concepts of semantic preference and discourse prosodies. Chapter 6 is also a completely new chapter, using a corpus of propaganda materials in order to examine how collocates can reveal representations and strategies which are aimed at persuading readers to carry out violence. This chapter explores various ways of calculating collocation and the pros and
Introduction
cons associated with each. It shows how collocational networks can reveal strong associations between central concepts in a text. Chapter 7 examines different discourse positions within a series of debates on fox hunting which took place in the British House of Commons. In order to achieve this, we look at the concept of keywords, lexical items which occur statistically more frequently in one text or set of texts when compared with another (often a larger ‘benchmark’ corpus). However, this chapter expands the notion of keywords to consider key phrases (e.g. multiword units) and key semantic or grammatical categories – which necessitates prior annotation of a text or corpus. Chapter 8 updates an older chapter from the first edition which focussed on using a corpus approach to examine features beyond the lexical level, by looking at patterns of nominalization, attribution, modality and metaphor. This chapter considers more advanced techniques that can be used in CADS research, including triangulation, comparisons of multiple corpora, working with different languages and analysis of speech, visuals and social media texts. Finally, Chapter 9 concludes the book and re-addresses some of the concerns that have been first raised in this chapter. Before moving on to look at the different techniques that can be used in order to carry out corpus-based discourse analysis, we first need an idea and a corpus. Chapters 2 and 3 therefore explore issues connected to developing a research project, and building a corpus, respectively.
Further reading Baker, P. (2012), ‘Acceptable bias?: Using corpus linguistics methods with critical discourse analysis’, Critical Discourse Studies 9(3): 247–56. This paper considers the extent to which a corpus approach reduces analyst bias. Hyland, K. and Paltridge, B. (eds) (2013), The Bloomsbury Companion to Discourse Analysis, London: Bloomsbury. This edited collection covers a range of approaches to discourse analysis including CDA, multi-modal discourse analysis, narrative analysis and ethnography. McEnery, T., Xiao, R. and Tono, Y. (2006), Corpus-based Language Studies: An Advanced Resource Book, London: Routledge. This book functions as a primer in corpus linguistics, introducing key concepts and studies. Partington, A., Duguid, A. and Taylor, C. (2013), Patterns and Meanings in Discourse: Theory and Practice in Corpus-Assisted Discourse Studies (CADS), Amsterdam: John Benjamins. A detailed series of case studies which outline the CADS approach.
27
28
Using Corpora in Discourse Analysis
Questions for students 1 The term corpus linguistics can be quite obscure to people who are not familiar with it. Could you come up with another term which is more self-explanatory? 2 Similarly, identify a few of the different meanings associated with the term discourse and think of alternative (and clearer) terms for them. 3 Think about the idea of the incremental effect of discourse and consider five ways that people could encounter the same discourse over the course of a day. 4 Try to identify a few discourses that are so mainstream and accepted that they are rarely encountered in language use.
2 The First Stages Introduction An important stage in a piece of CADS research involves the groundwork, the process of designing the research that occurs before any data has been collected. It is not necessary to make fixed decisions at this point as doing so can hinder the project if things do not go according to plan. But it is worth spending some time thinking about the steps outlined in this chapter as they will help to give the project a shape and hopefully save time in the long run. Some of the sections, such as choosing a topic or carrying out background reading, are likely to occur at an early stage in the project. Others, such as engaging with others, will occur at other points, even towards the end. However, it is useful to bear them in mind – to have a sense, even if it is vague, of where the project might be headed. Some of the advice in this chapter can be generalisable to other forms of research in the social sciences but I have tried to highlight cases that are particularly pertinent to practioners of CADS research and the kinds of issues they are likely to face.
Choosing a topic Within CADS, most topics involve analysis of corpora composed of real-life data and are socially oriented in some way. One popular area of CADS research involves studies of representation – often of a particular social group or identity trait. Such a study might involve examining representation within the group by its own members or it could look at how non-members represent the group or its representation more generally in the media. Often CADS research can be motivated by a sense that problematic representations 29
30
Using Corpora in Discourse Analysis
exist although this does not have to always be the case. Sometimes it is already known that a group are going to be negatively represented as others have pointed this out and with certain types of corpora it is likely that whatever group we look at, we will find negative representations – particularly if we are examining corpora containing news media which often reports on ‘bad news’. It is still worth interrogating these texts though, as CADS approaches can show the extent of negative or positive representation as well as indicating the range of ways that such representations are indexed through language, some of which may be subtle. Another popular choice is to consider groups who are disadvantaged in some way – often because aspects of their identity make it harder for them to challenge negative representations. Such groups may constitute a minority within a population, they may be subject to discriminating laws or they may not be able to mount challenges due to issues around economic inequality, health, stigma or age. Such research can often have a lot in common with critical discourse studies in that it considers abuses of power and critically asks the question ‘who benefits’ from the current situation. This is a view of power which separates social groups into those who are powerful and those who are not. For example, Brindle (2018) examined language use taken from a white supremacist web forum, focussing on posters’ representations of gay people. The concept of intersectionality (Crenshaw 1989) is relevant, as we can acknowledge that people belong to multiple social groups and that these intersections of identities can compound further inequalities, resulting in a complex hierarchy of power. For example, Candelas de la Ossa (2019) examined texts to support women survivors of domestic abuse finding that they construct multiple marginalised survivors as distal by using thirdperson pronouns to represent them as exceptional. A related view of power, taken from a post-structuralist perspective, is that it is more like a web than a ladder, and that even relatively powerless people can experience ways of being powerful (and vice versa), albeit in fleeting or limited contexts (see Baxter 2003). Such a perspective might warrant research on less explored misuses of power – such as sexual harassment, objectification or domestic violence that is carried out against men. CADS research does not have to focus on representations of social groups but can instead involve a wider consideration of particular text types, perhaps those that are produced by a certain group, for example, by looking at an internet forum for people who have a particular health condition. Hunt and Brookes (2020) examined how members of a community of people with
The First Stages
eating disorders used language to express aspects of their condition. While analysis here may partly focus on self-representations, it might also consider how members of the group characterise the health condition, their own relationship to it and how they represent relevant social actors such as health practioners or family members. Other CADS research might consider how language is used in texts to represent aspects of a socially relevant issue or concept, such as the environment (e.g. Bevitori and Johnson 2017), religion (Dayrell et al. 2020), populism (Roch 2020) or immigration (Salahshour 2016). A project which is focussed on inequality, power abuses or manipulation of large numbers of people might warrant collecting texts that are created by particularly powerful social actors from the domains of business, government, law and religion. Discourses in such texts are also likely to be filtered into the popular consciousness via the media, which is one reason why analysis of news has been such a popular area in CADS. For example, Lukin’s (2019) analysis of the words war and violence in news texts and reference corpora found how war was justified by being invested with a moral dimension, so it was seen as purposeful and reasoned. Similarly, Taylor (2010) has examined use of the word science while Marchi (2010) looked at a set of words relating to the concept of morality. Such research does not have to begin with a hypothesis or aim to identify a social problem but may instead be more prospective. Many kinds of texts involve language which is persuasive in some way, and another aspect of CADS can relate to how authors try to put their perspectives across in a convincing way. These can include commercial texts such as business reports, adverts or infomercials, review websites, political texts such as party broadcasts or manifestos, parliamentary debates or speeches, health-related information and travel brochures – all of which are ripe for a CADS approach. There are a number of motivations for choosing a particular topic. Some people are lucky enough to have the freedom to choose a topic that they find personally interesting – always an advantage as it means you are more likely to stick with the research and do it justice. Having a topic given to you (e.g. by a supervisor, external partner or employer) might initially not be as exciting but you can take advantage of the fact that you are not bringing too many of your own biases to the table, and often, as you engage with a topic in detail, personal enthusiasm for it will grow. A topic ought to not just be of interest to the researcher though – it should ideally be relevant to others, both academics and non-academics who are affected to some extent by the
31
32
Using Corpora in Discourse Analysis
topic. The term timely is often used in relation to social research, indicating that a topic is currently viewed as important and there are groups of people in existence who will be able to use your research in some way. A balance needs to be considered between tackling a trending topic that lots of people are talking about and something that is going to stand the test of time. In the early 2010s I received numerous PhD proposals from students who wanted to analyse news coverage of the Arab Spring. Then in the late 2010s, I received another set of proposals from students who wanted to study Donald Trump’s tweets. If a topic is suddenly very newsworthy, consider whether it will be as newsworthy in five years and whether you will be the only person who decides to carry out research on it. It might be worth choosing a subject that attracts less attention but is still socially relevant and not going to be replaced with another ‘hot topic’ in a couple of years’ time. A less obvious option could be to consider a topic which is of historical interest such as Gupta (2017), who examined how campaigners for women’s suffrage were represented in The Times newspaper during the period 1908 to 1914, or McEnery and Baker’s analyses of ‘the poor’ (2017) and prostitution (2018) in seventeenthcentury England. Such research can be useful in helping us to understand how contemporary understandings of such concepts have changed (or not). Similarly, in Baker et al. (2013) which examined representations of Muslims in the British press between 1998 and 2009, we included a chapter which looked at a range of texts that had been published between 1425 and 1979 as a way of tracing which kinds of representations had died out or emerged more recently. You might also consider the type of data that your topic will require you to collect. One of the most commonly created types of CADS corpora consists of news articles. Large amounts of this kind of text can be collected reasonably quickly from online news databases and it can be argued that such articles play an important role in influencing public perceptions about a range of topics. Methodologically, many of the issues surrounding this text type have already been considered – which is perhaps helpful if you are a beginning researcher and could use a model to follow. However, less wellstudied types of data are likely to bring with them different kinds of methodological issues which can enhance the originality of your research. As an example, one of my PhD students (Evans 2020) examined a corpus consisting of responses from medical staff to patients who had left feedback online. An unexpected discovery from this research was that a significant amount of the corpus consisted of ‘stock responses’, where staff had used the same message to reply to multiple patients (sometimes inappropriately).
The First Stages
The amount of so much repeated text in the corpus had consequences for analysis of frequency lists and resulted in the student having to develop a number of new approaches, considering questions such as: In what contexts do staff use stock responses vs uniquely crafted ones and how does language use differ between the two? Even if the topic you choose is a popular one, an original angle can be argued for it if the data you collect is of a type that others do not readily have access to. News articles and tweets are reasonably easy for most researchers to get hold of but if you can build a corpus containing data that others have found difficult to access then this immediately gives you an advantage. For example, in Chapter 6 I work with a corpus of texts created by violent jihadists that were given to me as a result of collaboration carried out with people working in anti-terrorism units. Of course, feasibility is also an important consideration – if the texts are very difficult to get hold of, would result in insurmountable ethical problems or would take too much time, money and effort to convert to electronic format, then an idea might need to be adapted or shelved. Corpus building is covered in the following chapter and at this point we will move on to consider background reading.
Background reading One question I am often asked is the extent to which it is necessary to carry out background reading around a topic before engaging in CADS research. A perceived strength of CADS is that human biases are limited by using our corpus with specialist software. In reality, different kinds of human bias are still going to be at play through much of the analysis. As mentioned in the previous chapter, corpus analyses tend to produce a lot of results so researchers usually need to be selective, finding a way to tell a story about the data which is grounded in rigorous analysis but is not overwhelming, repetitive, boring or tells us nothing that we did not already know. This is a delicate balancing act, and one concern could be that if we engage in advance reading, we could be introducing new biases into the project, meaning that we will be on the lookout for what others have already found. So an argument could be made for the ‘naïve’ researcher approach, where ignorance about a topic is potentially beneficial. However, all things considered, I think there are more gains to be had by carrying out advance reading while simultaneously reflecting on how this
33
34
Using Corpora in Discourse Analysis
might impact on the way that you subsequently conduct the research. Conduct the reading critically, try to find multiple perspectives, note which perspectives you find most convincing and ask yourself why this is the case as well as paying attention to those which feel outside your comfort zone. As you carry out the analysis, try to have awareness of the ways that your reading may impact on the paths you follow or the information you assume to be true or common-sensical. Make a particular effort to deviate from those paths – to explore avenues where nobody else has gone down. Or if you do follow certain paths, keep an open mind and look for counter examples rather than those that confirm what someone else has already said. I think this will lead to a more sophisticated form of analysis as opposed to one where you know little from the outset or where you simply allow the reading to tell you what to look for. In any case, I think it can be difficult for us to truly approach most topics from a completely naïve perspective. And one exercise which might be worth carrying out at the outset (before you engage with background reading), would be to make some notes about what you think you already know about the topic, both in terms of facts, your attitudes or feelings towards the topic and your personal experience (no matter how trivial or vague) of it. Such a document can be interesting to return to at different points as the project progresses. So getting a sense of related work that others have done on a topic, both using CADS and other methods, is a good idea. It will help to avoid duplication of effort as well as allowing you to identify gaps – which are often helpfully outlined in the literature review and conclusion sections of journal articles or books (look out for phrases like ‘areas for future research’). Conference websites, proceedings or books of abstracts can help to identify research that is not yet published but perhaps likely to appear in a couple of years’ time. Background reading should also be carried out on approaches within CADS, enabling you to gain an impression of the range of popular and new techniques and methodological approaches. However, it is also a good idea to conduct background reading on studies that have used a similar type of corpus to yours, even if the topic is quite different. So if you intend to build a corpus of film scripts to consider how male and female characters are represented differently, you might want to read around other corpus studies that have used film scripts, and extend the remit to also take into account similar types of data such as television scripts, plays or even computer games which can also involve scripts.
The First Stages
Additionally, carry out reading around the topic itself. CADS is highly interdisciplinary and it is possible that whatever subject you are interested in, it will have been studied from non-corpus perspectives, perhaps involving a qualitative analysis of a small number of texts in detail, a content analysis which will involve identification of themes and might present them quantitatively, or studies which involve interviews or surveys. It is important to take into account these previous findings, inasmuch as they reveal what is already known so you are not re-inventing the wheel, just with more data. A corpus study which confirms what others have already found is of value in terms of indicating that findings are perhaps more generalisable than had otherwise been demonstrated, but ideally such a study should also be telling the reader something new, something that other approaches had missed. However, we can only know what was missed if we have an awareness of what is already there. While the bulk of the reading can be carried out at the start of the project, bear in mind that there are also likely to be periods when it is useful to go back to the literature – for example, to research a particular aspect of the analysis that has come to light. Additionally, for projects that last for more than a year or so, an additional sweep of the literature towards the end is useful in terms of identifying very recently published work that is worth taking into account.
Reflecting on the analyst position Reflexivity should be a central part of CADS research, at every stage. Since the emergence of post-structuralist approaches to research, analysts have been encouraged to reflect on aspects of their own identity, position or relationship to what or who is being researched, asking how they may have impacted on the research process and also how the research has impacted on them. Analysts coming from within corpus linguistics, computational or more quantitative fields may not have as much experience of this kind of reflexivity, and indeed it might be thought that the corpus linguistics approach does away with the need to do so because the data are all collected with due consideration given to sampling and representativeness and then computer software offsets the analyst’s own biases. This is not exactly the case however, and a CADS-style analysis still involves subjective decision making and analyst interpretation. In addition, the availability of large amounts of electronically available data from a wide
35
36
Using Corpora in Discourse Analysis
range of sources around the world can mean that analysts are predisposed to being disengaged with the contents of their corpus in a way that is perhaps not helpful. There is a dilemma here though – someone who researches a topic which they already know a great deal about could be accused of not having enough distance from it to carry out a properly subjective analysis. They may already know what questions to ask and even have a good idea what the answers will be before they start, so they could end up producing a kind of polemic, legitimated with lots of tables of numbers. There is value in trying to look for the data that do not provide the answers to the questions that you hope to find. This will enable you to produce a more nuanced analysis, which is actually more likely to be convincing. It is also worth considering who is warranted to carry out research on a topic. I remember talking to a researcher at a conference several years ago. I had recently published some corpus research looking at representations of Muslims, she was looking at a similar area. I told her of my concern that as someone who was not a Muslim, my findings would be seen as coming from an outsider perspective and not taken seriously. However, she had a similar concern – that her status as a Muslim would mean she would be seen as too biased and not objective enough. I wondered who would be qualified to carry out such research in that case and realised that there will always be someone who can question your validity to engage with a particular topic. There is value in having researchers from different backgrounds or who have different perspectives to working on similar topics or corpora, as a form of triangulation, and this could occur either singly or in terms of working as part of a team where people from different identities are represented. It is worth reflecting on your own position as a researcher in relation to the topic you are studying as well as considering your own research skills and expertise. Within academic research we are often encouraged to background our identities. We write in passive sentences to elide ourselves (‘A study was carried out’), use nominalisations (‘The decision to . . .’) which hide human agency, and are told to avoid personal pronouns so that our writing sounds suitably objective, neutral and scientific. Corpus linguistics discourse research sits somewhere between scientific, social and post-structuralist approaches and those in the field may adopt different writing styles to one another depending on their earlier influences. Some researchers may feel that their research risks becoming overly confessional or self-indulgent if they write about themselves, and there is a risk that unhelpful generalisations about identity groups can be made, e.g. ‘As a woman . . .’ In addition, an
The First Stages
academic style which constantly uses the pronoun I can be distracting. With that said, there are places where opportunities for self-reflection present themselves (particularly in the introduction and conclusion sections) and these may be places where writers might want to explain their motivation for choosing a certain topic, the extent to which their identities gave them an insider perspective, helping them to make sense of certain findings or an outsider perspective, allowing objective distance. They may also wish to discuss how they addressed potential pitfalls in terms of their relationship to their topic and how their own views changed as a result of engaging with their research. A pattern I have noted on several occasions is how students have initially set out to be completely objective with a piece of research but as a result of engaging with their corpus they found this stance difficult to maintain. This is not really a problem, but it is good for readers to get a sense of the researcher’s changing perspective if there is space to do so.
Working as part of a team Carrying out a CADS type of study can be initially off-putting as it requires a range of different kinds of expertise. Researchers need to have knowledge relating to statistics (so they can choose which kinds of statistical tests to use and how to interpret the output from software) and text processing (how to convert texts to appropriate format for use with corpus software). A debate currently exists within the wider field of corpus linguistics about the extent to which corpus analysts need skills in computer programming languages. On the one hand is the view that the off-the-shelf tools provide a limited set of techniques and statistical tests which may not be adequate for the research you want to carry out. Being able to program in a language like Python or using a powerful software environment for statistics like R will enable researchers to carry out more sophisticated and original analyses. On the other hand, it could be argued that such desirata would exclude large numbers of people who find programming to be difficult or it may result in poorly conceived or even inaccurate pieces of research. In addition, the ideal CADS researcher needs knowledge about different pieces of corpus analysis software and methods, knowledge about how to carry out linguistic analysis, knowledge about different frameworks that have been used to carry out discourse analysis (including critical discourse analysis) and the ability to frame their research within various relevant social contexts. They also need to be skilled at writing (both for an academic
37
38
Using Corpora in Discourse Analysis
audience and for a popular audience), presenting their research at conferences and in the media, being able to publicise their work on social media and able to engage with stakeholders and users. For example, a study which looks at journalistic practices around a particular group who have a health condition might try to reach out to groups which represent people with that health condition as well as trying to persuade journalists to reconsider any problematic ways that they write about such people. So the ideal CADS researcher is a coding wizard, a maths nerd, a linguistic expert, a rigorous analyst, a careful reader, a skilled writer, a confident self-publicist, an engaging speaker and a social butterfly. A single person is unlikely to hold an abundance of all of these qualities, so realistically we need to understand our strengths and weaknesses and resolve to either work on those areas where we might be lacking experience or motivation or consider bringing other people onboard. Working as part of a team can be useful in terms of allowing researchers to combine their skills without anybody feeling over-stretched. Aphorisms like ‘many hands make light work’ and ‘two heads are better than one’ come to mind. However, we might also think of sayings like ‘too many captains and not enough crew’ or ‘a camel is a horse designed by committee’, so it is important to pick your team carefully. Teams tend to work as well as their weakest member and sometimes the person who is most enthusiastic at the start may be the person who contributes the least. A good team will not duplicate too many skill-sets, rather, every member ideally should be able to do something well that the others cannot. Some people thrive when working in teams while others prefer to work alone. Again, being able to appraise your own working style will be useful in helping you decide whether being part of a team will be of personal benefit. A team does not have to involve many people – I have tended to find it most productive to work with one other person who has a similar personality and approach to work but can complement my abilities. It can be useful to trialrun a team with a smaller piece of research first, so that potential issues are made apparent before people have committed to something that will be difficult to back out of.
Forming research questions It is not essential to begin a piece of CADS research with a set of specific research questions. The prospective nature of CADS allows us to start with
The First Stages
a vague question along the lines of ‘What is distinctive in this corpus?’ As the analysis progresses, the answer to this question is likely to result in the emergence of additional research questions, making the project more focussed. However, background reading and personal knowledge of the topic under investigation might bring certain questions to mind from the outset and this approach is perfectly fine as long as we do not set out with the aim of trying to prove a point or to only concentrate on finding what we are looking for. As many socially-oriented pieces of CADS research might be motivated by issues of power or inequality it is likely that problematic examples of language use will be found, especially in a large corpus. However, such cases may only represent a small part of a larger picture so it is important to get them into perspective by considering what else is present (or not there). Research questions can change or be refined as part of the analytical process then. In terms of wording the question, having one which can be answered by a simple yes or no is generally less preferable than a question which allows for more discussion. ‘Is newspaper x biased against group y?’ comes across as requiring an over-simplified response. Instead, a question like ‘To what extent, if any, is newspaper x biased against group y and in what ways?’ would enable the researcher to engage with the question in more detail. A question involving comparison can often be helpful, enabling an analyst to focus their research. One way to discover the extent of a phenomenon in a corpus is to consider how often it occurs in relation to something else. This can be done within a corpus, particularly if we have two or more subjects or topics that we wish to focus on. A typical example could be to compare how related types of identities (men vs women, Muslims vs Christians) are represented both separately and in relation to one another in the same corpus or in equivalent corpora which each contain mention of one of the groups or topics under examination. Additionally, if we consider our corpus to be the target of our analysis (the target corpus), we might want to examine a second corpus which could act as a benchmark or reference (the reference corpus) for what is typical or expected. There is no single standard reference corpus so we would need to establish the ways in which the one we chose or created actually does act as a reference. For example, if we are examining a corpus of newspaper articles about a topic such as mental health we may want to compare it against a general reference corpus containing a wide range of text types, using the keywords technique (see Chapter 7). A comparison between the two would
39
40
Using Corpora in Discourse Analysis
identify different kinds of language use in the news corpus – language around mental health but also language pertaining to journalism. But if we compared our news corpus against a different reference corpus, containing just newspaper articles on a wide range of topics, a keywords comparison is less likely to yield aspects of language relating to journalism as this kind of language occurs in both, so the frequencies will cancel one another out. Each reference corpus will provide a different kind of contrast with our mental health corpus. Our research questions thus need to take into account what our comparisons are likely to produce. Another form of comparative research question could involve subdividing the target corpus in various ways, for example by time period, text type or author. This can be worth trying, particularly if the analyst has a hunch that comparing the subdivisions will be worthwhile or if researcher familiarity with the data lends itself to a split. A real-world event such as the passage of a law, the election of a new leader or a large-scale disaster might suggest a good point to split a corpus in two based on time period. A corpus of news articles might be split based on writing style (popular vs quality or tabloid vs broadsheet), political perspective (left vs right) or ownership (government vs individuals). Another way of thinking about identities in CADS is not to consider how two or more identity groups are represented but to compare how those identity groups actually use language, for example by dividing a corpus into male and female speech. In Chapter 7 I make a division based on whether politicians argued for or against the introduction of a law within a series of debates. Sometimes, splitting a corpus into sections can result in some of the sections being too small for meaningful results to be obtained, or it can mean that there are too many sections for the analysis to be done in any sort of depth. Comparing subsections may not always reveal anything interesting to report so different criteria might need to be employed to facilitate a comparison, or it may simply be the case that the corpus should not be divided at all. Comparative research questions can tend to stress differences between two or more corpora or sections of corpora, e.g. ‘How does corpus A differ from corpus B?’ Difference is understandably an interesting aspect of analysis, and particularly as CADS makes use of frequency information one way of deciding what to focus on would be to prioritise reporting results based on the largest or most significant differences. Corpus techniques like keywords are based around the concept of difference so the software we use to carry out CADS is likely to put us in a ‘difference mindset’. However, the
The First Stages
amount of difference between two corpora might be relatively small and if we devote too much attention to it we risk misleading the reader or overlooking other aspects of the overall picture. Similarity might be more difficult for analysts to identify or explain but it is worth considering the extent to which it contributes to the analysis, so it is worth flagging up both difference and similarity in comparative research questions (see Taylor 2018). Likewise, it is easier to focus on what is present in a corpus, as opposed to what is absent, particularly things that could or should have been there. One way of identifying absence is via comparison with other corpora, reference or otherwise (see Duguid and Partington 2018). Within Critical Discourse Studies there exists a wealth of tools for carrying out qualitative linguistic analyses on individual texts. Sometimes this can involve lists of features which are known to position the author and reader in particular ways, to obscure or highlight certain pieces of information, to emphasise or to cast doubt. As such, it can be appropriate to form research questions which are based around the use of particular linguistic features such as grammatical agency, metaphor, euphemism, personalisation, modality, quotation, rewording, use of questions, interrogatives or nominalisation. The identification of such features can help us answer not only what is present in the data but how a particular stance or discourse is achieved or legitimated through language. So while we might have one type of research question like ‘How is group x represented in corpus y?’ a related question would be ‘How is language used to make the representations of group x appear reasonable, normal or right?’ The prospective, corpus-driven methods for identifying frequent or salient patterns in a corpus have the potential to allow different linguistic features to emerge as the analysis progresses. For example, a keywords analysis might reveal that the target corpus contains a higher than expected use of first- and second-person pronouns, which would indicate that authorial style and relationship building between author and reader is likely to be one avenue of investigation worth following. However, a keyword analysis might not be as useful in helping us to identify more complex linguistic features and in such cases it is worth noting additional research questions which specifically carry out enquiries of a corpus based on the possible relevance of such features. The questions considered so far might be employed in order to address a wider question like ‘What are the discourses on topic x in corpus y?’ As discourses are largely realised through communication, and appear via linguistic traces in texts, the focus on language is required in order to identify
41
42
Using Corpora in Discourse Analysis
the presence of discourses. Techniques from CADS should also be able to make claims about the frequencies of different kinds of discourses in a corpus, although it is difficult to make exact claims as it is unlikely that every instance of a discourse being realised through language can be identified through corpus means. Instead, the techniques outlined in the following chapters should help to reveal the most salient and frequent ones, as well as providing the opportunity to uncover some of the rarer discourses.
Stages of analysis There is no single way of carrying out a CADS analysis although some steps are likely to come earlier than others, some may only occur once and some may be repeated at different points as part of a process of analytical refinement. Baker et al. (2008) offered a cyclical model of analysis which involved moving back and forth between corpus-based and more traditional qualitative techniques of analysis. The stages I set out below are based upon my experiences with working on subsequent CADS-based projects – so they perhaps reflect more accurately how I usually carry out this kind of research, although bear in mind that this model is not set in stone and other researchers are likely to incorporate different stages or carry them out in different orders, producing equal valid and interesting analyses.
Stage 1. Groundwork What are my initial research questions? What should go in my corpus? Choosing a topic, noting what you know about it and carrying out background reading are good places to start. After that the process of setting initial research questions and building a corpus are likely to come. I have put these two aspects as co-existing because they are likely to impact on one another. We can only set research questions if we know what is in our corpus but setting research questions can also help us to decide what to put in our corpus. So we are likely to cycle back and forth between these two stages as we identify what we would like to look at, whether the data exists and can be converted into a corpus format and the feasibility of dividing the corpus into different sections. There are numerous stages to building a corpus and as that forms the basis of the following chapter I will not say too much about them here.
The First Stages
Stage 2. Description What aspects of language are frequent or distinctive in the corpus? The next stage involves choosing a corpus tool or set of tools (discussed in the following section) and conducting some preliminary analyses. These may be based on corpus-driven methods like keywords or frequency lists or they could involve more specific, targeted searches based on features we suspect are going to be interesting or sets of words that are central to the analysis. For example, in our analysis of representations of Muslims and Islam in the news (Baker et al. 2013) we did not need to carry out any sort of frequency-based techniques to know that the words Muslim and Islam were going to play a central part of our early analysis. However, an analysis of collocates of these words proved to be useful in helping us to decide what to focus on in more detail at later stages of the analysis. So earlier stages of analysis would involve both identifying frequent or salient aspects in the corpus as a whole and focussing on what is known to be relevant.
Stage 3. Comparison How is language use different or similar in different parts of the corpus? A third (optional) stage could be to carry out comparisons (and look for similarities) by splitting the corpus into sections as described earlier. For our Islam study we examined differences between newspapers and change over time (the corpus consisted of articles published between 1998 and 2009), considering both how our set of focus words (like Muslim and Islam) were used in different parts of the corpus but also querying what else made the parts of corpus distinctive from one another. During these exploratory stages of analysis there is likely to be a process of refinement, as unexpected aspects of language present themselves or unexplained differences emerge. There are three types of emerging findings that can help to guide your focus: what is surprising, what is interesting and what is problematic. Obviously, defining what constitutes these three areas is subjective (and also likely to involve a degree of overlap). Talking about your early results with others might prove useful in terms of helping you to decide what to focus on. For the Islam project we identified a number of aspects of representation that we felt were worth looking at closely. These included the ways that Muslims were presented as a collective identity and how this collective was indicated as being distinct from other social groups, the ways that different types of Muslims were represented according to the extent or strength of their belief
43
44
Using Corpora in Discourse Analysis
and how these were evaluated, and the ways that male and female Muslims were specifically represented. Finally, we considered a particular type of negative representation (Muslims as receiving government benefits) which although not hugely frequent, was salient and associated with some of the most strongly negative aspects of language in the corpus. As a result of looking at these four areas (collectivisation, strength of belief, gender and benefits) we returned to some of our earlier forms of analysis which had identified potentially interesting distinctions between sections of the corpus, change over time and differences between newspapers. Combining different aspects of analysis allowed us to identify more detailed patterns – for example, helping us to show how representations of Muslims receiving government benefits had initially been the province of rightleaning tabloid newspapers but over the time period we examined, these stories gradually started to also appear in right-leaning broadsheet newspapers, albeit to a lesser extent.
Stage 4. Interpretation What is the function of the language patterns identified? Having identified different patterns in the corpus, a later stage of analysis can involve attempting to interpret what has been found. This stage normally involves careful concordance analyses of selected words, phrases or linguistic patterns (see Chapter 5) and may require some sort of functional classification scheme to be devised by the analyst in order to identify typical (and rare) uses of language. For example, with the Islam project, we had identified strong right-hand collocates of Muslim like community, world and country. These collocates appeared to be used as ways of collectivising Muslims, but to examine how they were used functionally we had to carry out concordance analyses of them. So a random sample of 100 concordance lines of Muslim community identified 78 as referring to British Muslims, 17 referring to local communities in a particular town in the UK while 5 referred to a global Muslim community. We also found two clear discourse prosodies surrounding the term: one which constructed the Muslim community as having a propensity to take offence (signified by terms like antagonise, offensive, upset, uproar, resentment and anger), the other which represented the Muslim community as separate from the rest of Britain (involving terms like non-assimilation, driving a wedge, too little understanding and simmering conflict).
The First Stages
Stage 5. Explanation Why do these patterns exist in the corpus? This stage can sometimes involve putting the corpus tools aside for a time and turning to other sources. For example, in our study on the representation of Islam, we found an interesting pattern regarding the use of the word Moslem which was chiefly used by just two newspapers in the corpus and peaked in the year 2001. One of these newspapers had abandoned its use in 2003, the other stopped using it in 2004. Our corpus analysis did not explain why this had happened, nor did it directly explain why some articles used Moslem instead of Muslim. It was only by carrying out additional research were we able to find out that in 2002 the Muslim Council of Britain had requested that these two newspapers stop using the Moslem spelling as its pronunciation is similar to the Arabic word for ‘oppressor’. In addition, we tried to explain other findings by consulting a range of sources: readership demographics for different newspapers, consideration of complaints about articles about Islam made to the UK’s press regulatory system, government policy around Islam, attitude polls, crime statistics and findings from political inquiries into integration of different communities in the UK. At this stage of the analysis it might be worth identifying whether the findings can be explained by an over-arching theoretical framework or critique, such as hegemonic masculinity, Marxist theory, critical race theory or gender performativity. I would advise that such theories ought to be applied after the corpus analysis rather than attempting to shoehorn the analysis into an existing theory. A strong commitment to proving a particular theory before engaging with analysis might result in some aspects of the data being overlooked or misinterpreted. Leech’s principle of total accountability (1992: 112), or the idea that all data gathered must be accounted for, is important to bear in mind. Otherwise, our analysis can look as if we have cherry-picked examples to prove the points we wanted to make. If only some of the data is congruent with a particular theory, we should make this clear, providing the appropriate frequency information and caveats about generalisability.
Stage 6. Reformulation What additional questions can we ask of the corpus? A further stage of a CADS analysis can involve forming and investigating hypotheses within the corpus itself, using non-corpus-based means. For
45
46
Using Corpora in Discourse Analysis
example, with our Islam study we identified that some of the most negative representations in our news corpus were found in letters to newspapers from readers. This led us to hypothesise that such representations did not come out of thin air but that readers were inspired to produce them as a result of reading earlier editions of the newspaper which printed similar but less strongly worded versions of the representation. In order to investigate whether this was the case we examined a series of letters printed in one newspaper which complained about Muslims wanting to ban piggybanks. This led us to investigate that newspaper’s output for the week preceding these letters, resulting in the discovery of a similar story about piggybanks. However, this story indicated that High Street banks, not Muslims, wanted to ban piggybanks and even quoted a Muslim member of Parliament who said that Muslims would not be offended by them. It was interesting then, that the subsequent letters misinterpreted the story, representing Muslims as taking offence.
Stage 7. Critique Is there a problem and how can it be resolved? The above stages of analysis can occur in a linear fashion but they can also run alongside each other or be cyclical, with analysts moving back and forth between different stages, from wide to narrow, from prospective to targeted, from the corpus itself to consideration of wider context, or from the whole corpus to smaller segments within it. However, there is a further, optional stage, which depends on the extent to which we wish to critically engage with our research. This stage goes beyond a traditional CADS analysis to explicitly consider who benefits from the uses of language that have been explored in the corpus, whether or not this is fair or problematic and if not, what could be done, if anything, to rectify matters. For our research on Islam in the press, we identified numerous problematic uses of language, for example a very high proportion of stories involving Muslims and conflict, repeated representations of Muslims as different from and hostile to non-Muslims, as holding extreme beliefs and unfairly receiving preferential treatment. Rather than advising that specific words or phrases should be banned, we instead argued that particular discourses or representations should be avoided, such as the dehumanising representation of Muslim women who wear veils as bats, Daleks or zombies. Subsequently, we shared our findings with a non-governmental organisation, made presentations at a number of cross-parliamentary events, guided the
The First Stages
organisation on how to conduct their own analyses of media texts, and spoke at mosques where we advised audience members on how to identify and challenge instances of negative bias in the press. A critical analysis may not just focus on problematic cases but also note uses of language that are empowering, promote equality and inclusivity or are likely to result in preservation or growth as opposed to damage and misery. Some analysts may feel that this last stage is not needed and that it is more appropriate to let the findings speak for themselves so that readers can be left to draw their own conclusions. Other analysts may want to present their own personal evaluation while reflecting on their response to the findings and discuss how that may have impacted the research process. I believe that both approaches are valid and nobody should feel obliged to state their position or try to impose their own sense of what is right or wrong if they feel that this is simply not appropriate. I pick up this thread again a little later in the chapter when I discuss engaging with others.
Choosing a corpus analysis tool When I wrote the first edition of this book, a handful of different corpus analysis tools existed, most of which were centred around concordancing. Among these, I carried out most of my analysis with Mike Scott’s WordSmith, one of the earliest corpus analysis tools and the one which incorporated the notion of keywords, a technique which has proven to be popular among CADS research. At the time of writing, Scott has released version 8.0 of WordSmith while alongside it there exist numerous other corpus analysis tools available, each which have their own functionalities while over-lapping to various degrees. Software development continues apace so new tools or updated versions of existing ones are likely to have already appeared by the time this book has been published. Therefore, readers are advised to carry out their own searches to identify what else has been made available in recent years, taking into consideration price, platform, features, output, search options and relative ease of use. A good place to start is https:// corpus-analysis.com/ which currently lists over 250 corpus tools, of which the majority are free to use. A single user licence for WordSmith 8.0 is currently £50 (UK) although versions 4.0 and 5.0 can be downloaded for free. A similar tool, AntConc, created by Laurence Anthony, is also free and offers a more streamlined interface although has a slightly more limited set of functions compared to
47
48
Using Corpora in Discourse Analysis
WordSmith. A third piece of software, #LancsBox, created by Vaclav Brezina, is also free and provides tools for visualising corpora and linguistic patterns, such as collocational networks (discussed in Chapter 6). All three tools allow users to upload their own corpora, although their speed may depend on your computer’s processing capacity and the size of the corpora you are working with. Different tools can offer a range of ways of carrying out techniques like concordances, collocations and keywords, sometimes offering distinct tests (e.g. the log-ratio test or the BIC test) and even those which offer the same tests can have slightly different ways of counting words or performing the calculation so the same test carried out on the same corpus but using different tools can sometimes produce inconsistent results. WordSmith’s default settings do not distinguish numbers as words, instead classifying every number with the code #. AntConc does not count the enclitic marker ’ as part of a word so will split a word like he’s into two ‘words’ he and s. It is usually possible to go into the settings of the tool and change these if you would like words to be counted differently from the default (this can be particularly useful when working with data from Twitter as AntConc uses @ and # as wildcards so they will not be counted as appearing in hashtags or Twitter handles). What is most important though, is to be clear about which tool (and which version of it) you used in your own research and whether you altered the default settings. There is no single best tool (or best technique or statistical test for that matter) and it is worth experimenting with different approaches to obtain an impression of the kinds of results they produce. For researchers who are new to corpus linguistics, I usually recommend trying a simple, free tool like AntConc first although it is good to have a breadth of knowledge of the affordances of a range of tools. Your preference for using a PC or Mac is likely to have some impact on which tool you use as not all tools are designed to work with both types (I have tended to find that sometimes Mac users struggle a little more to get some pieces of software to work effectively). As well as tools which have to be downloaded as executable programs onto your computer, there are another set which require no downloads but are available via website interfaces online. Such sites include CQPweb, Sketch Engine, Wmatrix and English-Corpora.org. These sites normally require users to register for an account using their email address, sometimes requiring a fee to be paid. They often come with a set of bespoke (reference) corpora that can be much larger than those that can be handled by the downloadable tools. For example, the English-Corpora.org website contains
The First Stages
iWeb, a fourteen billion word corpus of webpages, while Sketch Engine has the English Web 2020 (enTenTen20) Corpus containing thirty-eight billion words. Some of the online interfaces allow users to upload their own corpora, and for analysts who are working with millions of words of data, this is likely to be an attractive option. Some tools carry out automatic part-of-speech annotation of corpora (assigning codes to words like singular common noun or superlative adjective), including those uploaded by the user. Sketch Engine, Wmatrix and #LancsBox all currently do this, while CQPweb and English-Corpora. org also work with annotated versions of corpora. Wmatrix also offers semantic tagging, assigning codes based on semantic categories (e.g. red will be tagged with the code O.4.3 – Colour and colour patterns). The tagsets used with each tool are likely to be slightly different, so comparisons between systems may also produce inconsistent results. Table 2.1 indicates some of the features of the tools mentioned in this section. To demonstrate a range of the current tools in use, I use WordSmith in Chapter 4, CQPweb in Chapter 5, Sketch Engine and #LancsBox in Chapter 6, AntConc and Wmatrix in Chapter 7 and WordSmith and AntConc in Chapter 8. Table 2.1 Features of popular corpus analysis tools Price
Tagged Online or Special features versions standalone of corpora No
Standalone Text editing utilities
AntConc
£50 (versions 4.0 and 5.0 are free) Free
No
#LancsBox
Free
Yes
CQPweb Wmatrix
Free Free
Yes Yes
Sketch Engine
Free for 30 days Yes trial, free to some EU institutions, 99.98 € per year + VAT for academics Free Yes
Standalone Concordance sorts by frequency Standalone Collocational networks Online Lockwords Online Semantic annotation Online Word Sketches, Trends Analysis
WordSmith
EnglishCorpora.org
Online
Searches via pronunciation
49
50
Using Corpora in Discourse Analysis
Engaging with others It is useful to consider engagement as consisting of two related areas: dissemination and impact. The former relates to getting people to read or hear about the research while the latter involves the capacity for the research to result in an actual change. Dissemination is generally easier to carry out and is likely to act as a preliminary stage to impact. While engagement often occurs at the end of a project, it is worth considering how it can be implemented at other stages, starting with the research design. Thinking about the kinds of people who could benefit from the research and reaching out to them can be useful in terms of obtaining different perspectives, new sources of corpus data or helping to refine research questions. Most large-scale projects make use of a steering group, usually consisting of a mixture of academics and non-academics, and while this is not usually necessary for student dissertations and theses, it can still be useful to consider the benefit of contacting charities and non-governmental organisations or other relevant institutions such as schools and medical practices. If none are forthcoming at the start, it is worth keeping an eye open for new contacts as the project progresses. Dissemination, carried out early enough, might result in engagement opportunities from interested parties who may be able to contribute new questions or insights to the remainder of the research project. As noted earlier in this chapter, an optional stage of the analysis can sometimes involve identifying problematic or disempowering aspects of language use as well as noting cases of good practice or even making recommendations aimed at improving language use. I think analysts would need to take care not to be naïvely idealistic here. A document of guidelines to journalists or policy makers is unlikely to result in any sort of immediate uptake or change, even if the researcher already has well-established connections or networks. Social change is often slow and incremental, and powerful social actors tend to make changes in the face of a concerted and overwhelming response to actions that are widely seen as no longer acceptable. Sharing findings on social media is one way that dissemination can be carried out, and it is worth considering how non-academics can be engaged, for example via social media, blog posts, podcasts, media articles or interviews, exhibitions, events or workshops, short films, animations or comic strips. The challenge here is in translating what can at times be a rather dry procedural analysis involving citing statistics or technical terminology into a series of digestible points that can be connected together to form a
The First Stages
compelling narrative. Academics are trained to write in a style which unfortunately does not make for light reading or include those who have not been trained in the right jargon (see Billig 2013 for a critique of the writing style of social scientists). It is ironic then, that all our hard work might only reach a relatively small number of other academics who are perhaps likely to already agree with us. Unlearning how to write as an academic is one of the more challenging aspects of CADS research and is a skill that should not be under-estimated. In writing reports for non-academics, I usually aim to have a one-page executive summary at the start, and I provide small boxes throughout the report which present the most important findings in digestible bites. I try to avoid technical language and acronyms as much as possible, as well as leaving out most of the methodological detail (which can be implemented as footnotes if needed). Visuals like bar or pie charts can also be helpful, although it is important not to try to tell a multi-layered or overly complex story with them. For a non-academic audience a good visual ought to reveal one or two points which should only take a few seconds to work out. One area that I would not compromise on, however, is in the use of eye-catching titles which over-state or over-simplify a claim or use other techniques associated with ‘clickbait’. So I would avoid statements like ‘British journalists hate Muslims’ or ‘You’ll never believe what these newspapers said about Muslims’ which are misleading or employ an inappropriate tone. Working with a non-academic partner can also bring challenges. Your research questions may not be the ones that the partner wants to ask, and generally I have found that partners are more interested in the content of a corpus as opposed to how language is used to achieve something (e.g. ‘What are the attitudes?’ as opposed to ‘How do authors manipulate people or legitimate their stance?’). Some non-partner organisations can have complicated hierarchical structures and may require numerous people to read a report and sign it off before it can be made public. If the partner is providing the corpus data, this may hold up opportunities for academic publishing, particularly if there is anything in the report which has the potential to be controversial. At other times, people in partner organisations may be working to very tight deadlines or have unrealistic expectations about what is possible from a corpus analysis. They may also not be aware of the numerous duties of the average academic. And there can sometimes be a high turnover within other organisations so the person who initially made contact with you may leave after a few months to take on another role, meaning you will need to re-establish the relationship with the organisation
51
52
Using Corpora in Discourse Analysis
through a new person. None of these issues are insurmountable, but in terms of expectation management, they are worth bearing in mind. A project’s life can sometimes be extended as a result of engagement with others. One consequence of this can involve work that aims to replicate or follow up the study using new data. For our study on the representation of Islam in the press, a few years later we collected a second, equivalent corpus, consisting of newspaper articles from the period 2010–2014. The analysis of this second data set was comparative, based upon investigating aspects of the first corpus to identify patterns of change or stability. The motivation for updating the study was the result of interaction with a media monitoring and advocacy group called MEND (Muslim Engagement and Development). MEND set us additional questions, based on us examining terms we had not spent a lot of time on, such as Islamophobia. While the end of a research project may be marked by the publication of a journal article or book, or the completion of a thesis, in fact many of my own pieces of CADS research have continued as a result of groups finding out about the work, wanting to know more or asking to be shown how to carry out similar research themselves.
Conclusion The topics in this chapter were inspired by my own experiences with corpusassisted discourse analysis over the last couple of decades, along with insights obtained from supervising and examining research students and discussions that I have had while teaching at the annual Corpus Linguistics Summer School which takes place at Lancaster University. The issues brought up in this chapter are likely to crop up at various times over the course of a research project so I do not advocate that they are only considered during the first few weeks and that this is where firm solutions are identified and then stubbornly adhered to. There should always be room to make adjustments, to identify new questions or directions or to abandon some aspect of a project if it is not working out. For those who are new to the field, it is a good idea to allow yourself more time than you think you need, and not to become discouraged if progress appears to be slow at first. Each new project requires time for researchers to familiarise themselves with the corpus, to identify an engaging and answerable set of research questions and to figure out how to answer them. There is often a learning curve, whereby the first part of the analysis can take a long time to get right, but once it is done, everything else
The First Stages
seems to fall into place. With that said, I have often found that I make incremental improvements on my method as I continue to work with a corpus, to the extent that when I am finished, I often need to go back to the start and redo the earliest work. When a research project is written up it normally has a tidy appearance where decisions and actions appear to naturally follow one another and all of the false starts and dead ends are absent. In other words, the experience of carrying out research can be much messier (and more frustrating) than the stories that we tell one another about it. Do not be disheartened then, if your experience of a piece of corpus research does not resemble the neater narratives you have read about. There is usually a much larger amount of analysis that takes place, compared to what actually appears in a published account of a piece of CADS research, and there can be a resulting tension between wanting to show transparency and needing to provide an accessible and engaging report. For example, a word list containing the top 500 words in a corpus is unlikely to be helpful to readers – as a rule of thumb, if a table is going to be larger than a page, it should be reduced or put in an appendix as a last resort. A PhD thesis, which is longer in scope, might enable you to take the scenic route regarding how a particular method was reached but shorter pieces of work will need to be more direct. Having considered some of the issues that go into planning a piece of corpus-based discourse analysis, the following chapter considers the corpus itself.
Further reading Baker, P., Gabrielatos, C., Khosravinik, M., Krzyzanowski, M., McEnery, T. and Wodak, R. (2008), ‘A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press’, Discourse and Society 19(3): 273-306. This journal article outlines an iterative method of cycling between qualitative and quantitative approaches in order to carry out discourse analysis. Taylor, C. and Marchi, A. (eds) (2018), Corpus Approaches to Discourse: A Critical Review, London: Routledge. This edited collection contains chapters from a range of authors giving a reflexive account of methodological issues in CADS, considering overlooked areas or ‘dusty corners’, triangulation ‘identifying blind spots’ and research design ‘avoiding pitfalls’.
53
54
Using Corpora in Discourse Analysis
Questions for students 1 Consider the following research questions. Can they be improved and if so, how? (a) Do journalists write unfair negative things about group X? (b) How do newspapers A, B, C, D, E, F and G represent groups X, Y and Z? (c) How many times are metaphors used negatively about group X in newspapers? 2 Come up with two CADS research topics. With the first one you should have some prior knowledge or experience of the subject under study, with the second one you will have no knowledge or experience of the subject. Make lists of the pros and cons of researching each one, and think about ways to counteract the cons. 3 Choose two pieces of corpus analysis software (they do not have to be the ones mentioned in this chapter). Make lists of the different features available to each and note what is easy or difficult about using each one. Give each one a score out of 10 based on how useful it would be for a CADS research project. 4 Read a piece of CADS research (ideally the length of a book chapter or a journal article). Create an engaging summary of the study for a non-academic audience. You might want to use visuals to indicate the study’s findings. Afterwards, make notes on what sort of things you left out and how your use of language differed from that of the original study.
3 Corpus Building Introduction As discussed in Chapter 1, one of the potential problems with using corpora (as our only source of data) in the analysis of discourse is that we are dealing with decontextualised data. A corpus is a self-contained entity – issues to do with production and reception can be difficult to ascertain from looking at a frequency list (see Chapter 4) or a concordance (see Chapter 5). In addition, relationships between different texts in a corpus or even between sentences in the same file may be obscured when performing quantitative analyses. A possible step towards countering some of these concerns would be for researchers to familiarise themselves with the corpus. Both Hardt-Mautner (1995a: 55) and Partington (2003: 259) suggest that some form of prior interaction with the texts in a corpus, e.g. reading transcripts or listening to spoken files, will ensure that the discourse analyst does not commence from the position of tabula rasa. One means of familiarisation would be to actually build a corpus from scratch. The process of finding and selecting texts, obtaining permissions, transferring to electronic format, checking and annotating files will result in the researcher gaining a much better ‘feel’ for the data and its idiosyncrasies. This process may also provide the researcher with initial hypotheses as certain patterns are noticed – and such hypotheses could form the basis for the first stages of corpus research. There may be other, more pressing reasons to a build a corpus yourself, as the type of corpus you want to examine simply does not exist. This chapter therefore covers issues relating to choosing or building a corpus to carry out discourse analysis on. In terms of corpus building, I address questions to do with corpus size and representativeness, as well as considering some concerns related to different text types (written, spoken or computermediated). I also look at issues pertaining to permissions, annotation and 55
56
Using Corpora in Discourse Analysis
validation. Finally, I consider the case of reference corpora which ideally ought not to be built afresh if an existing one can be used. Therefore, I address a number of considerations connected to choosing a pre-existing reference corpus, and how they are obtained and exploited.
Some types of corpora Before thinking about how one would go about building a corpus, it is useful to know a little about the ways that different corpora can be categorised into types. While the term corpus merely refers to a body of electronically encoded text, it is not the case that a corpus consists of any collection of texts, picked at random. Instead, researchers have produced a range of recognisably different types of corpora, depending on the sorts of research goals that they have had in mind.1 The first, and perhaps most important type of corpus (in terms of discourse analysis) is called a specialised corpus. This would be used in order to study aspects of a particular variety or genre of language. So, for example, we might just be interested in the language of newspapers, or the language used in academic essays, or in spoken conversations between men and women. It would make sense then to only collect texts that conform to these specialised criteria. We may also place restrictions on the corpus in regard to time or place. An example of a specialised corpus would be the Michigan Corpus of Academic Spoken English, the texts in this corpus consisting of transcripts of spoken language recorded in academic institutions across America. We could specialise even further than this, for example, by only choosing texts which refer to a specific topic. For example, Johnson et al. (2003) built a corpus of British newspaper texts that contained references to the concept of political correctness. The criteria for inclusion in this corpus was that the article had to contain a phrase like politically correct, PC or political incorrectness. It may be useful at this point to make a distinction between corpora and text archives or databases. An archive is generally defined as being similar to a corpus, although with some significant differences. Leech (1991: 11) suggests that ‘the difference between an archive and a corpus must be that the latter is designed for a particular “representative” function’. An archive or database, on the other hand, is simply ‘a text repository, often huge and opportunistically collected, and normally not structured’ (Kennedy 1998: 4).
Corpus Building
Corpora therefore tend towards having a more balanced, carefully thoughtout collection of texts that are representative of a language variety or genre. Archives or databases may contain all of the published work of a single author, or all of the editions of a newspaper from a given year. More care is therefore taken when selecting texts to go in a corpus, although even so, there is usually a degree of opportunism or compromise in terms of what the builder would like to include and what is available. The hard theoretical distinction between corpus and text archive is therefore sometimes blurred in practice, and the desire to fulfil the criteria of ‘standard reference’ is what needs to be kept in mind when building corpora. Another aspect of traditional corpus building is in sampling. Many corpora are composed of a variety of texts, of which samples are taken. For example, each member of the Brown family of reference corpora is composed of 500 samples of texts, each around 2,000 words in length. This technique of sampling is in place to ensure that the corpus is not skewed by the presence of a few very large single texts (e.g. whole novels). We would also try to take samples from a range of different places in texts to ensure that our corpus is not comprised of only the first parts of texts (unless we are only interested in the language used at beginnings of texts, e.g. introductions, prologues, prefaces, etc.). If we use equally sized samples, we are more likely to be able to claim that our corpus is well balanced. However, for the purposes of discourse analysis, it may not be such a good idea to build a corpus which includes samples taken at different points from complete texts. We may be more interested in viewing our texts as having beginnings, middles and ends and tracking the ways that language is used at different points within them (see the discussion of dispersion in Chapter 4). Of course, it is important to bear in mind the types of research questions that we want to ask, and if we are only building a small corpus or utilising a few texts, there may be no need for sampling in any case. However, an awareness of the issue of sampling and the possible advantages or restrictions that it can place on certain forms of analysis are worth considering. For the purposes of discourse analysis, some of the chapters in this book utilise some form of specialised corpus, e.g. a corpus of holiday leaflets (Chapter 4), a corpus of propaganda texts (Chapter 6) or a corpus of political debates on fox hunting (Chapter 7). These are reasonably small corpora (none of them are over 2,000,000 words in size) and they did not take a great deal of time to assemble. They all included complete texts, rather than samples.
57
58
Using Corpora in Discourse Analysis
One question which beginning corpus builders often ask is, ‘How large should my corpus be?’ There is no simple answer, although there are a few points that are worth bearing in mind. For many (non-discourse oriented) corpus-based studies, a million words of one variety of language (e.g. British English, Indian English), covering a relatively short time period, is viewed as adequate for comparative work on high frequency phenomena (e.g. Leech’s (2003) study of modal verbs across the Brown family of corpora). Perhaps most importantly, the size of the corpus should be related to its eventual uses. Will it be used in order to derive simple frequencies, collocations or word meanings from concordances? Kennedy (1998: 68) suggests that for the study of prosody 100,000 words of spontaneous speech are adequate, whereas an analysis of verb-form morphology would require half a million words. For lexicography, a million words is unlikely to be large enough, as up to half the words will only occur once (and many of these may be polysemous). However, a million words would be enough for analysis of syntactic processes. Regarding using corpora for discourse analysis, it is possible to carry out corpus-based analyses on much smaller amounts of data. For example, as discussed previously in Chapter 1, Stubbs (1996: 81–100) compared two short letters from Lord Baden-Powell, consisting of approximately 330 and 550 words each. Even within these two short texts he was able to show that there were repetitive differences in the ways that certain words were used. With that said, corpus-based studies of small texts often refer to larger reference corpora to augment their analysis. If we are interested in examining a particular genre of language, then it is not usually necessary to build a corpus consisting of millions of words, especially if the genre is linguistically restricted or highly repetitive in some way. Shalom (1997) analysed a corpus of personal advertisements sent to a lonely-hearts column in a London-based magazine. She collected a total of 766 adverts, which probably puts her corpus size at between 15,000 and 20,000 words. With this relatively small sample, Shalom was able to demonstrate a range of lexical and grammatical patterns, for example frequent co-occurrences of words like slim and attractive. So specialised corpora that only contain small ‘colony texts’ (Hoey 1986) do not need to be millions of words in length. An average personal advert is only about twenty to thirty words. Each advert represents an individual text, and a page of adverts would be a colony text, the order in which the texts are placed would not alter the meaning of an individual advert. Other corpora of colony texts that are found in newspapers or magazines might include letters to problem pages, recipes or horoscopes.
Corpus Building
One consideration when building a specialised corpus in order to investigate the discursive construction of a particular subject is perhaps not so much the size of the corpus, but how often we would expect to find that subject mentioned within it. For example, if we are interested in studying discourses of unmarried mothers in newspapers, which of the following two corpora would be most useful to us? One which has ten million words but mentions unmarried mothers thirty times in total, or one which has fifty thousand words but mentions unmarried mothers six hundred times? The first corpus may be useful as a reference (see below), but the second is likely to tell us more about unmarried mothers due to the higher frequency of occurrences of this topic within it. Therefore, when building a specialised corpus for the purposes of investigating a particular subject or set of subjects, we may want to be more selective in choosing our texts, meaning that the quality or content of the data takes equal or more precedence over issues of quantity. Some types of data are easier to collect than others – for example, corpora containing threatening communications; suicide notes or narratives by asylum seekers might be difficult to come by, and as a result, some specialised corpora might be extremely small. This does not mean that a corpus analysis is not worth doing, but claims about representativeness would need to be adjusted accordingly. There are potential advantages from working with a small corpus – it should be possible to read all of the text in it and carry out a qualitative analysis, which could then be used as a form of triangulation alongside the corpus analysis. An aspect of corpus-based analysis that can often be extremely useful in terms of analysing discourses is the investigation of changes over time (see Chapter 8). Although discourses are not static, one technique of discourse maintenance is by implying or stating that ‘things have always been this way’. Discourses therefore may have the appearance of being written in stone, for example being cemented by phrases like ‘It is a truth universally acknowledged that . . .’ or ‘since time immemorial . . .’ However, discourses are not static, and one way of investigating their development and change is to use a diachronic corpus. A diachronic corpus is simply a corpus which has been built in order to be representative of a language or language variety over a particular period of time, making it possible for researchers to track linguistic changes within it. For example, the Helsinki Corpus of English Texts: Diachronic Part consists of 400 samples of texts covering the period AD 750 to 1700 (Kytö and Rissanen 1992). A diachronic specialised corpus was used by Rey (2001)
59
60
Using Corpora in Discourse Analysis
who collected scripts from the television series, Star Trek and two related spin-off series Star Trek the Next Generation and Star Trek: Deep Space Nine, dating from between 1966 and 1993. By analysing the language used in the scripts, she concluded that ‘female language has shifted away from highly involved linguistic production toward more informational discourse, while at the same time male language has shifted away from highly informational language toward more involved discourse’ (Rey 2001: 155). My own corpusbased research has often focussed on change over time, e.g. Baker et al. (2013) examined a corpus of newspaper articles about Islam from 1998 to 2009, with one chapter addressing linguistic trends over the years. The analysis found, for example, that words relating to gender and sexuality were relatively common in 1998–2000 and 2006–2009 but were less common in 2001–2005, a period when articles focussed more on stories related to terrorism. The use of a diachronic corpus can therefore enable researchers to address the criticism that corpus users tend not to take into account the fact that as society changes language changes with it. Clearly though, a diachronic corpus may not be able to fully take into account language change. For example, a corpus which consists of newspaper texts taken at regular intervals across a twenty-year period, e.g. from 1980 to 2000, will only reveal changes in language from that period – researchers will still have to undertake other forms of analysis if they want to investigate how particular words were used before and after that point. Additionally, if we build a corpus only containing texts from 1980, 1990, 2000 and 2010 we need to bear in mind that we are working with four snapshots and while this may result in some indications of larger trends in terms of change over time, we cannot be sure about fluctuations that may have happened in the missing years like 1981–1989. A study which compares only two time periods (e.g. 1980 vs 2020) should be especially reticent in claiming that it has identified a long-term trend. We should also take care that our research questions, hypotheses and search terms are appropriate for the time period we are examining. For example, a study which looks at how language has changed to represent women from 1800 to 2000 should not just take into account words like Ms or feminist that are relevant to the year 2000 but also consider terms that have become less fashionable like virtue or dowry. Additionally, while a diachronic corpus can introduce a more dynamic aspect into corpusbased analysis, this can also result in issues connected with over-focussing on change and reifying difference (a point which is addressed in more detail in Chapter 7).
Corpus Building
A final type of corpus which is often extremely useful for discourse analysis, although it may not incorporate the main research focus, is a reference corpus. It consists of a large corpus (usually consisting of millions of words from a wide range of texts) which is representative of a particular language variety (often but not always linked to a national language). For example, the British National Corpus (referred to here as BNC 1994) is a reference corpus consisting of approximately one hundred million words of written and spoken data. Its 4,124 texts were mainly produced in the late 1980s and early 1990s, although about 5.5 million words were first published between 1960 and 1984. The written texts consist of extracts from regional and national newspapers, specialist periodicals and journals, academic books and popular fiction, published and unpublished letters, and school and university essays. The spoken part includes a large number of unscripted informal conversations, recorded by volunteers selected from different age, regional and socio-economic groups, together with language collected in different contexts, ranging from formal business or government meetings to radio programmes and phone-ins. It is not always possible to find appropriate reference corpora, especially as they date quickly and the amount of time it takes to build large corpora can sometimes mean that they are already a few years old by the time they are released. The BNC 1994 is really a ‘historical corpus’,2 although a new version containing texts from the 2010s has more recently been created. The Corpus of Historical American English (COHA) is both a reference corpus and a diachronic corpus, containing large amounts of text from 1810 to 2000. The Brown ‘family’ of corpora are also used as reference corpora, although individually each one is much smaller than the BNC 1994 or the COHA. The original Brown Corpus consists of approximately one million words of written American English dating from 1961. It contains texts from fifteen different genre categories including press, religion, mystery and detective fiction, science fiction, love story and humour. Its creation has spawned a number of similarly structured corpora, e.g. the equivalent Lancaster-Oslo/Bergen (LOB) corpus (one million words of 1960s British English). There are also American and British versions covering 1931, 1991/2, 2006 and 2021. The tool Sketch Engine contains a wide range of very large (billions of words) reference corpora in numerous languages, many containing texts that have been scraped from the World Wide Web. A potential limitation in working only with a reference corpus is that it might be difficult to obtain keywords from it (as we will see in Chapter 7, keywords are obtained by comparing two corpora together). It is possible to
61
62
Using Corpora in Discourse Analysis
divide a reference corpus into different sections (e.g. by time period or text type) and compare those together. This would tell you what is distinctive about different parts of the reference corpus, but to acquire the keywords from the whole corpus you would need to use a second reference corpus, ideally one around the same size or even larger.
Capturing data So returning to the question of how large a corpus should be, the answer very much depends on the type of language that is being investigated. The more specific the use of language, the less need there is to collect millions of words of data. On the other hand, if you are intending to study language use in a relatively general context, it might be a good idea to make use of an existing reference corpus, rather than undertake the time-consuming task of creating one yourself. Large reference corpora are likely to be good sources for uncovering discourses pertaining to an extremely wide range of subjects. They also have the advantage of containing texts from many sources, which may result in a more interesting and varied set of discoveries, than say, just looking at a corpus of newspapers. However, there are also good reasons for building a specialised corpus: reference corpora may not contain enough of the text types you are interested in examining or may not have enough references to the subject(s) you want to investigate. And, as noted earlier, the texts in a reference corpus may be too old to have relevance to the present-day. Or you may not be able to find a suitable reference corpus for the language or register type you want to focus on. As with other methodologies, when building a corpus from scratch it is often useful to carry out a pilot study first in order to determine what sort of texts are available and how easy they are to obtain access to and convert to electronic form. One of the easiest ways to collect corpus texts is to use data which already exist in electronic format. Exploiting this can make the job of the corpus builder much less arduous than in earlier decades. Table 3.1 shows some sites that can be used for collecting corpus data. The interfaces for online text databases can regularly change, so researchers will often need to spend a bit of time working out how to obtain the best results from them. Downloading large numbers of texts can be a tedious business and so sites which enable multiple texts to be saved at once are most helpful.
Corpus Building
Table 3.1 Popular online corpus building resources Site
Content
Access
Nexis
Newspaper articles (minus advertising and visuals) Newspaper articles (minus advertising and visuals) Archive of online postings to newsgroups Linguistic and literary texts A library of eBooks Transcripts of UK parliamentary debates Books, product reviews
Subscription needed
Factiva Google Groups Oxford Text Archive Project Gutenberg Hansard Amazon Hathitrust Digital Library Twitter Reddit
Blogger Mumsnet IMSDb Tripadvisor Courts and Tribunals Judiciary
Collection of digitised books Tweets from individuals around the world Social news aggregation, web content rating and discussion website Online diaries or journals Forum discussions on a range of topics Scripts of movies Reviews of hotels, restaurants and attractions Judgements, Orders and Sentencing Remarks
Subscription needed Free Free Free Free Approximately 3000 word samples per book Requires log-in with partner institution Free Free
Free Free Free Free Free
Failing that, using a web-scraper like HTTrack3 can allow parts of a site to be copied to your hard-drive, recursively building directories, getting html, images and other files. Once the site has been copied, it is usually necessary to strip the files of unwanted text, and some websites are constructed in order to prevent copiers from taking their content in this way, which brings us to another issue, that of copyright and permission (see below). Another option is to use coding skills to collect data. Baker and McGlashan (2017) used Python with the libraries Beautiful Soup (which parses and scrapes data from HTML code) and Selenium (which simulates user behaviour like clicking links) in order to collect news articles and reader comments from the website of an online newspaper. Otherwise, web pages can be saved one by one. As a last resort, for very small corpus-building
63
64
Using Corpora in Discourse Analysis
projects, I have simply copied and pasted text directly from a website and into a text editor like Notepad. However, this is not recommended if a quicker, automatic way can be implemented. Sometimes, internet text is not encoded in HTML/text format, but is instead represented in other formats, such as jpg, gif, png or bmp files or as pdf documents. In the latter case, it may be necessary to obtain software which converts the pdf document to plain text (or otherwise be prepared to do a lot of cutting, pasting and editing to make the document readable). When text is represented as a graphics file it will either need to be keyed in by hand or scanned in, there is no point in simply saving the graphics files – current concordance programs will not be able to recognise the text within them.
Scanning and keying in If texts cannot be obtained from the internet, then in some cases there may be other ways that they can be collected electronically. For some projects I have worked with partner organisations who have been able to provide me with texts they had access to which are not available in the public domain. Crowd-sourcing could be considered as one way to gather data, or placing a request for particular text types at a relevant internet bulletin board or newsgroup may result in access to private collections of data that are not available anywhere else. If existing electronic sources are unavailable, then two other (rather time consuming) options present themselves. The first involves converting paper documents by running them through a scanner with Optical Character Recognition (OCR) software. For most people this is probably quicker than keying in the document by hand, although it is generally not a 100 per cent accurate process. The print quality of the document is likely to impact on the accuracy of the output, and the data will probably need to be hand-checked, spell-checked and corrected for errors (which, in the worst cases can be an even lengthier process than typing the data oneself). In general, the best types of texts that respond to OCRing are those which are published in a straightforward format (e.g. do not contain different types of fonts or multiple columns). Newspaper print, which is often very small, can get blurred and smudgy and may contain several articles over a range of different sized columns on the same page, so it does not always respond well to scanning. Kennedy (1998: 80) reports that scanners are likely to have problems with hyphens, apostrophes and certain letters or groups of letters.
Corpus Building
The final, and usually last resort of the corpus builder is to key in the text by hand (or pay someone to do it professionally). Kennedy (1998: 79) estimates that an experienced touch typist can achieve about 10,000 words a day, working full-time, and there are numerous companies which offer keying-in services, although this is likely to only be an option if you have access to research funding.
Spoken texts Certain types of texts will present their own unique problems to the corpus builder. For example, written data is generally much easier to obtain than spoken data. Conversations and monologues (whether occurring on the television, radio, online, at home or the workplace) will need to be transcribed (to an extent) by hand – a task which tends to take much longer than keyingin written texts for a number of reasons: parts of spoken texts are sometimes unclear, there can be overlapping dialogue and the audio file needs to be repeatedly listened to as people normally talk faster than the average typist can keep up with. In recent years there have been impressive advances with automatic subtitling tools (such as Otter.ai or Sonix), which can now provide almost instantaneous transcripts of audio files, although accuracy can vary depending on how clear the speech is, meaning that it is sensible to check the transcripts against the original and make corrections if required. Depending on the focus of your research, there is a range of additional information that may or may not be required: who is speaking at any given point and how they speak; e.g. prosodic information such as volume, speed and stress as well as paralinguistic information (laughter, coughing, etc.) and non-linguistic data (dogs barking, cars passing). Pauses and overlap may need to be transcribed too. There may also be problems involved in accurately rendering different types of accents or other phonetic variation, which can add to the complexity of spoken data and may not be covered by automatic subtitling software. Archives containing transcripts of spoken data are already in existence, either on the internet (as with government debates) or as spoken corpora. However, it should be noted that in cases where studying linguistic phenomena is not the main goal of archiving this data, these transcripts may have been ‘cleaned’ or glossed in order to remove or limit the effect of interruptions, false starts, hesitations, etc. (see Slembrouck 1992). It is also sometimes possible to obtain scripts from films, plays or television programmes, which contain spoken data (sometimes provided by fans, see
65
66
Using Corpora in Discourse Analysis
Bednarek 2015). These forms of scripted data do not always reflect how people really speak though. For example, in the spoken section of the BNC 1994, the frequencies per million words for the words yes and yeah are 3,840 and 7,891 respectively. Compare this to the frequencies of the same words in a 400,000 word corpus of scripts from sitcoms. Here, yes only occurs 1,080 times per million words and yeah occurs 3,266 times per million words. Such discourse markers therefore tend to be less frequent in scripted writtento-be-spoken data. Scripts tell us about how language is represented then, as opposed to how it is used naturally, although this is still interesting to examine from a discourse analysis perspective.
Copyright and ethics Issues around gaining permission to include texts in a corpus are not clearcut. If you are building a corpus which you intend to sell or share with others, then it is important to seek or ascertain that you have permission to use the texts you want to include. In such cases it is worth considering Kennedy (1998: 76), who notes that ‘before texts are copied into a corpus database, compilers must seek and gain the permission of the authors and publishers who hold copyright for the work, or the informed consent of individuals whose rights to privacy must be recognized’. Often, obtaining permissions can be a slow and complex task, particularly for a corpus containing texts from multiple sources. In 2014, a document released by the UK Intellectual Property Office noted a copyright exception to non-profit research involving text and data mining. It ‘allows researchers to make copies of any copyright material for the purpose of computational analysis if they already have the right to read the work (that is, work that they have “lawful access” to). They will be able to do this without having to obtain additional permission to make these copies from the rights holder’ (Intellectual Property Office 2014: 6). This exception only applies to research conducted within the UK, and copyright rules differ from country to country. Additionally, different platforms can have their own terms of service. For example, at the time of writing, Twitter does not explicitly prohibit data collection for research purposes but does specify that you may not ‘access or search or attempt to access or search the Services by any means (automated or otherwise) other than through our currently available, published interfaces that are provided by Twitter (and only pursuant to the applicable terms and conditions), unless you have been
Corpus Building
specifically allowed to do so in a separate agreement with Twitter’.4 Facebook5 and Instagram6 do not allow automated means to collect data, while Mumsnet7 notes that users must not be ‘invasive of privacy’ when using the platform. Blogger decrees that posters own the original content of their blog posts. If you intend to publish your research, bear in mind that some publishers may err on the side of caution and decide that certain types of online material should not be collected and analysed in published research unless permission is obtained. In particular, care ought to be taken in cases where data appears on a passworded site or one which a subscription needs to be paid to in order to gain access. Many universities require projects to pass clearance via an ethics committee although such committees often tend to be more comfortable dealing with research carried out on human beings as opposed to research on texts produced by human beings. Even if copyright is not an issue, there are still ethical concerns around using certain types of text. For example, consider online journals or weblogs. Many of these are normally easily accessed and they can provide fascinating insights into a vast range of human experience. However, most bloggers do not expect their posts to be downloaded, subjected to analysis by academics and then appear as part of published research. Ethically, it is good practice to notify bloggers that you would like to include their posts in your corpus, particularly if you intend to quote from them or refer to the blogs by name. If you are intending to make your corpus available to others, and the data is not freely available elsewhere, carrying out anonymisation on identifying information is a good idea. This would ideally involve changing or redacting proper nouns, especially names, although geographical locations may also need to be changed. Identities can sometimes be revealed in other ways, for example a negative reference to someone who attended a particular event might be traced with a bit of detective work. Unless you read your whole corpus or are very familiar with it, such cases might be difficult to redact, so ultimately we may have to accept that anonymisation is unlikely to be perfect. If you are not intending to share your corpus with others, then you may not feel the need to redact your corpus when you analyse it, although it is good practice to obscure identities in any examples you quote from in the research. Extra care relating to ethics ought to be taken when collecting text that deals with sensitive topics or personal information or involves people who might be classed as vulnerable in some way. Ensuring that such corpora cannot be accessed by others is important so think carefully about how and where the corpus files are stored, and ideally, ensure that they are password
67
68
Using Corpora in Discourse Analysis
protected. Our research should not put people at risk and it is worth thinking about the potential consequences (both to ourselves and to others) that could result from carrying out or publishing a piece of research. Corpusbased research might be seen as low risk because a large part of it involves extracting frequency information from large data sets. A word list is unlikely to get anyone into trouble. However, the danger arises when including illustrative extracts or concordance lines from our corpus. If we quote something that was found online, it may be possible for others to locate its source, and thus a great deal of information about the author, by entering a snippet of text into a search engine. In cases where I have viewed a piece of data as potentially problematic to cite, I have simply given a description of it, rather than quoting it word for word, or I have changed the example enough to make it extremely difficult to be traced (noting that I have done so and explaining why). Even if the corpus content itself does not need to be changed, we ought to carefully consider how and what we quote from it in our research. I would be particularly cautious about corpus research on social media which results in the naming and shaming of individuals (e.g. ordinary people who post obnoxious content on Twitter). Even if we do not agree with the person’s views, what could be the unintended consequences of exposing them; for example, how could it impact on their family members? A related ethical issue is in obtaining permission to carry out corpusbased (critical) discourse analysis on texts, which may result in findings which show the texts, and subsequently the text producers, in an unfavourable light, for example by revealing racist or sexist discourses. For this reason, it seems unlikely that everyone who carries out corpus analysis (or any form of critical analysis via other methods) always obtains permission from authors or copyright holders before collecting and analysing texts that are already available in the public domain. Otherwise, a great deal of research in existence would have never been published. Ethical guidelines from the Association of Internet Researchers (franzke et al. 2020) have recognised that making oneself known to certain communities might put the researcher at risk from harassment and in some instances the safety of the researcher must be prioritised over informed consent. Having spoken to several academics on the issue of gaining permission to carry out sole-authored, non-commercial research, there is inconsistency and confusion over whether permission is always required. There sometimes appear to be gaps between the guidelines or rules and their application in practice. Copyright laws can differ from country to country, adding a further dimension of complexity. Additionally, different publishers and funding
Corpus Building
bodies may vary in regard to their attitude towards the necessity of permissions. So it is useful to couch permissions in terms of ‘best practice’, but also be aware that there are some cases where permission cannot be obtained. In such cases, including a statement in published work like ‘every possible effort has been made to obtain permission’ may help to safeguard the researcher.
Cleaning the corpus Once you have obtained corpus data, it can often require a bit of additional work to make it ready for the tool you are using. Some tools require texts to be saved in a particular format (e.g. Unicode or UTF8) and may produce strange outputs if this is not taken into account. It is worth checking that you are able to upload a few files into the corpus tool(s) you intend to use, and then quickly running a few processes (concordancing, frequency lists) to see what is obtained. Some corpus tools have facilities which will identify the text format and convert it to one they can work with. Even if the tool can cope with the file format, there are various issues that may impede an accurate analysis, particularly duplicates, boilerplate, text rendering and spelling variation. If you are a skilled coder (or know such a person), you can probably deal with these issues fairly easily, so the suggestions below are aimed at people who do not have experience of coding and want to use off-the-shelf software to resolve them. Duplicates involve multiple copies of the same file appearing in your corpus, which will result in skewed frequency information. The news archive Nexis has options which help to reduce duplicates although I have found that they still let a high number through, with some newspapers being worse offenders than others. For example, some UK newspapers have English, Scottish and Northern Irish versions, or publish a morning and evening edition every day, which can result in up to six copies of the same article appearing in a corpus. Carrying out a few concordance searches on reasonably frequent words, then sorting the concordances alphabetically, is likely to give an impression of the extent of the problem, if it exists. In order to remove such duplicates I have used the text processing tools in WordSmith, first to split each news article into separate text files (see also Scott 2018), then to identify and remove duplicate files. The Duplicates tool in WordSmith allows you to specify a cut-off in terms of the percentage of lexical similarity between two files, and it is worth experimenting with
69
70
Using Corpora in Discourse Analysis
different percentages to determine which number will catch most or all of the duplicates without flagging up unproblematic cases. Deciding what counts as a duplicate is another matter. With newspaper data, there can be quite a lot of repetition with certain kinds of articles (e.g. those listing television programmes) appearing very similar. There is no perfect way to decide how to remove duplicates, although analysts should be explicit about which decisions were used (when using Nexis for example, I usually only count two texts as duplicate if they appear in the same newspaper on the same day). Boilerplate refers to repetitive data across each text which is usually related to meta-information about the text, such as copyright messages, the date of publication, the name of the text’s author, or the text’s length (see Figure 3.1 for an example from Nexis). While this information can be useful to know, it is not ideal to include this kind of language data in your analysis. From Figure 3.1, the only information I would want to include in a corpus is the title ‘Am I weird for hoarding junk?’ as the presence of everything else is likely to skew the analysis, e.g. words like documents, features, Sun, England, Wednesday and length would end up being highly frequent. It is possible to remove much of this data using a global search-and-replace function (Notepad++ can also do this while Microsoft Word can get rid of much of it
Figure 3.1 Header information obtained from Nexis.
Corpus Building
although it only works with one file at a time). It might be useful to retain some of this information, however, particularly if the filename or file storage system you are using will not help you to identify the date or name of each newspaper. So converting parts of header files into tags (discussed in more detail in the following section), which corpus analysis tools can be instructed to ignore when building frequency lists or carrying out concordances, might be more useful than simply deleting this information. Thus, I may decide to convert the information above into the following tags:
Some corpus software will ignore text between < and > characters, unless instructed, so this information will not be included in frequency-list data. Additionally, using software like WordSmith, I could restrict my analysis to only texts which contain a tag like to just consider one type of newspaper. Corpus files can sometimes contain surprising renderings of text which can also make it difficult to search for words or obtain accurate frequencies. If you have obtained a text from a database, then it is possible that something went wrong when the text was converted from its original form to the database format. This can be the case, for example, if the texts were converted from paper form using Optical Character Recognition. Additionally, problems can ensure when you make a copy of the text (converting texts from pdf to txt format, for example, can result in errors creeping in). Punctuation can get mis-translated or go astray (particularly look out for apostrophes and single quotes), while emoji can also be rendered incorrectly or not at all. Some emoji are made up of several codes – for example, the emoji can be rendered as five codes: U1F937 (person shrugging), U+1F3FD (medium skin tone), U+200D (zero width joiner), U+2640 (female symbol) and U+FE0F (Variation selector-16). When copying this emoji into certain formats, these codes can be separated and lost, so what appears in the corpus is simply the female symbol.8 Gaining familiarity with your corpus data before you start to do any analysis is therefore recommended, either by reading a selection of corpus files or carrying out a few processes to see what is obtained and whether anything looks odd.
71
72
Using Corpora in Discourse Analysis
A related issue concerns cases of spelling variation. This can occur in a range of contexts. For example, if you have texts from American and British authors, there is likely to be variation in the spelling of words like colour/ color, centre/center and realise/realize. Working with a diachronic corpus may also result in alternative spellings of certain words. Online texts can contain a wide range of spelling errors or typos, or cases where users miss out apostrophes so don’t could be rendered as dont. Such differences may be interesting and indeed form part of your analysis. However, the variety of alternative spellings is likely to impact on word frequencies, meaning that some words will miss out on being keywords or collocates, which would not have been the case if they had been spelled consistently. A tool called VARD 29 uses techniques from spell checkers to identify and standardise spelling variants in corpus files, either manually, automatically or semi-automatically, training the tool on a sample from the corpus. It may be useful to work with both the original and standardised versions of the corpus, depending on your research aims.
Annotation It is usually recommended that corpus builders employ some form of annotation scheme to their text files, however brief, in order to aid analysis and keep track of the structure of the corpus. Because the conventions for representing typographical features in electronic texts (aspects like line breaks, formatting of headings, paragraphs, pound signs, quotes, etc.) can vary depending on the software used to edit or view the text, it is often sensible to mark up such features using a standardised coding system. One such system is Extensible Markup Language (XML) which was created in the 1990s as a standard way of encoding electronic texts by using codes (also informally called ‘tags’) to define typeface, page layout, etc. In general, the tags are enclosed between ‘less than’ and ‘greater than’ symbols: < >. So for example the start-tag is used to indicate the start of a new section. End-tags contain a forward slash / symbol after the ‘less than’ symbol. So the end of a section would be encoded as . Tags may also contain attributes and values. For example, the tag could be used in a spoken transcription to indicate the occurrence of a pause during speech, the duration being 4 seconds. Here, the attribute is dur (duration) and its value is 4 (seconds).
Corpus Building
Different forms of XML have been employed for a range of purposes. So web pages are encoded in HTML (Hyper Text Markup Language) which uses a predefined set of codes that are based on the general XML rules. For example, bold print is specified in HTML with the code pair and . There is no reason why corpus builders must adopt XML if they encode their corpora, but it is worth knowing about the existence of such codes and what they look like in case they are encountered in other corpora. Also, if an annotation scheme is required, then it seems a shame to have to re-invent the wheel, when a perfectly good set of standards already exists. Corpus analysis packages like WordSmith tend to be capable of handling XML codes, but they may be less equipped to deal with an ad hoc coding system created by a researcher working alone. So in what contexts would annotation be useful for corpus-based discourse analysis? One aspect is in the creation of headers (see Figure 3.2). A header, marked with . . .. normally appears at the beginning of a file in a corpus (or exists as a separate file) and gives reference information about that particular text, perhaps in terms of the title, who created it and when and where it was published. The . . . .
Figure 3.2 Short sample header from a written text.
73
74
Using Corpora in Discourse Analysis
part of the file contains the actual text itself. Headers can constitute a useful form of record keeping, particularly if a corpus consists of many files from different sources, created at different times. Alternatively, you might want to include this sort of meta-information within file names or sort different types of files into folders that are named accordingly. The header may also contain information about the author or genre of the file, which can be useful at a later analytical stage, if you want to make comparisons between different genres, or say, only look at a subsection of your corpus which was authored by women. It would then be a matter of instructing your analysis software to only consider files that contain a tag like . Other meta-linguistic information that could appear in headers for written files includes the publication date (sometimes using categories such as 2000s, 2010s, 2020s, etc. can be useful when making comparisons over time), medium of text (book, magazine, letter), genre (science fiction, religious, news), level of difficulty, audience size, age and sex of author and target audience. For spoken corpora, we may be interested in encoding information about the different speakers in each text (their age, sex, socioeconomic status, first language or geographic region), whether the text is a monologue or dialogue, public or private, scripted or spontaneous. At the end of Chapter 4, I look in more detail at some of the different speaker classifications in corpora. Leaving the header, the actual body of the text can also be given different types of annotation, some of which are discussed in more detail later in this book (see Chapter 7). Part of speech annotation is one procedure that is commonly assigned to corpora at some stage towards the end of the building process. This usually involves attaching codes to each word (or in some cases groups of words) which denote a grammatical class such as noun, verb or adjective. Some encoding systems can be quite complex, encoding different tenses of verbs (past, present, future) or numbers of nouns (singular vs plural). Part of speech annotation can be useful in that it enables corpus users to make more specific analyses. For some words, their meaning and usage can change radically depending on their grammatical class. For example, text, horse, field and hits can all be nouns or verbs. Having an indication of part-of-speech information and being able to specify this within corpus searches can save a lot of time. We may also find it useful to split words and tag different parts of them separately – so with a word like don’t, we could tag the first morpheme (do) as a verb, and the second (n’t) as a negator. This would allow more flexibility when looking at just verbs or just
Corpus Building
Figure 3.3 Grammatically tagged sentence from the British National Corpus.
negators or combinations of both. Figure 3.3 shows a grammatically tagged sentence from the British National Corpus, which was annotated using an encoding scheme called the C5 tagset.10 Each word is preceded by an XML tag (word) which also contains a value, the three-character code which represents the grammatical class. So NN1 indicates a singular common noun while NN2 is a plural common noun. However, part-of-speech information does not always provide the fullest picture of differences in word usage. We may also want to distinguish between semantic differences, e.g. the difference between ‘a hits album’ or ‘hits to the body’. In this case, semantic tags can be applied to words in order to draw out crucial differences in word meaning. And other types of tags can be assigned where necessary, for example in spoken files, to indicate different speakers or types of non-linguistic or prosodic information. When I was building a corpus of personal adverts, I used tags to distinguish between the information which described the advertiser and the text which was related to the sort of person being sought (Baker 2005). A problem with annotation is that carrying it out by hand can be a painfully slow (and often error-prone) process. For this reason, computer programs have been developed which accomplish certain forms of tagging automatically. Some corpus software, like Sketch Engine and #LancsBox, automatically assign part-of-speech tags to any corpora that are uploaded into them, while Wmatrix (see Chapter 7) enables both part-of-speech and semantic annotation of corpus files. The accuracy rates of automatic taggers rarely reach 100 per cent though. The more predictable a text is in terms of following grammatical rules and having a low proportion of rare words, the greater the likelihood that the tagger will have a high accuracy rate. Grammatically unpredictable texts (such as conversations which contain repetitions and false starts) or texts containing jargon or slang
75
76
Using Corpora in Discourse Analysis
(which might not be recognised by the tagger’s lexicon) may result in higher error rates. Some taggers attempt to circumvent this problem by incorporating ‘portmanteau tags’, a combination of the two (or more) most likely tags in ambiguous cases. For large corpus building projects, part of the corpus annotation will sometimes be checked and hand-corrected. But even with an unchecked corpus, part-of-speech tagging can still be very useful. A final stage of corpus annotation usually involves some form of validation, by running the corpus through a piece of software which checks the encoding and alerts the researcher to any syntax errors or problems found, such as missing < symbols or incorrectly nested tags. Annotation may sound like a great deal of work, and it is not necessary to carry out tagging for the sake of it. Instead, corpus builders need to think about what sort of research questions they intend to ask of their corpus, and then decide whether or not particular forms of tagging will be required. It is always possible to go back to the building stage at a later point in order to carry out new forms of annotation, once the need has been established. The important point to take away from this section is that different forms of annotation are often carried out on corpora and can result in more sophisticated analyses of data, but that this is not compulsory.
Using a reference corpus I discussed reference corpora at the beginning of this chapter (and described the British National Corpus and the Brown family of corpora as some examples of reference corpora). Obtaining access to a reference corpus can be helpful for two reasons. First, reference corpora are large and representative enough of a particular genre of language that they can themselves be used to uncover evidence of particular discourses. For example, in Chapter 5 I use the BNC 1994 to examine representations of refugees. Secondly, a reference corpus acts as a good benchmark of what is ‘normal’ in language, by which your own data can be compared to. So in other chapters in the book I have built smaller specialised corpora, such as a corpus of holiday leaflets in Chapter 4, where I used the BNC 1994 to explore hypotheses about language that I formed, when looking at how words are used in the specialised corpus. Additionally, reference corpora can be useful in terms of showing us typical uses of language, which can then be compared against the findings in the specific corpus we are interested in. In Chapter 6
Corpus Building
I compared collocates of the word America in a corpus of terrorist propaganda, alongside its collocates in the BNC 1994, to show how the terrorist propaganda reconceptualises America by linking it to a completely different set of associations to the ones found in the BNC 1994. We can also compare a large reference corpus to a smaller corpus (or even a single file) in order to examine which words occur in the smaller text more frequently than we would normally expect them to occur by chance alone. Obtaining such a list of keywords from a file is a useful technique in the examination of the discourses evoked within it (see Chapter 7). Additionally, reference corpora may help us to test our theories. So I might hypothesise that a certain word occurs in a text in order to achieve a certain stylistic effect, e.g. by sounding typically scientific or informal or masculine, which may be contributing towards the discursive construction of the reader’s or writer’s identity in some way. By looking at the frequency of such a term in a reference corpus, investigating exactly what genres it is likely to occur in, what sort of people use it or what its associations are, I can start to provide evidence for this hypothesis. So clearly, access to a reference corpus is potentially useful for carrying out discourse analysis, even if this corpus is not the main focus of the analysis. Therefore, a pertinent question to ask at this stage is, which reference corpus is applicable to use? As a rule of thumb, we should at least try to obtain reference corpora which reflect some aspect of the smaller corpus or text sample we are studying. For example, if we have obtained a text that was created in the 2020s by British authors, then we would preferably want the reference corpus to also have been created around the same time and contain mainly (or all) British English – in which case, the one-million-word BE21 Corpus would probably suffice. If such a corpus is not forthcoming, then trying to find as close a match as possible is preferable. It is, of course, possible to use a completely different reference corpus but any interesting aspects of difference that are shored up from comparisons between a smaller text or corpus and a reference corpus, may be due to differences in language variety rather than telling us anything of note about the smaller text. So for example, using LOB (1960s British English) as a reference corpus to examine aspects of a British text created in 2021 is likely to reveal diachronic linguistic differences between the early 1960s and 2021. So we may find that modal verb use occurs less often in the recent text, when compared to LOB. However, this is more likely due to the fact that modal verb use has decreased over time in British English (Leech 2003) rather than telling us something interesting about our particular text under analysis.
77
78
Using Corpora in Discourse Analysis
So care should be taken when comparing texts with reference corpora to ensure that findings are not due to diachronic or synchronic differences between the two. At times, a perfectly matching reference corpus may not be forthcoming, and in cases like this, a ‘best fit’ will have to suffice, although this needs to be made clear when reporting findings. Another, perhaps more problematic issue, is to do with gaining access to corpora. While such corpora are in existence, they can be expensive to obtain (although much less expensive than building them). Researchers who have funding or are aligned to university departments with spare funds (and a positive attitude towards corpus-based research) will be at an advantage. Some corpus builders allow users limited access for a trial period before buying, or will offer a smaller ‘sample’ proportion of their corpus free of charge. And because large reference corpora can require a great deal of computer processing power to analyse, they are increasingly offered via internet-based platforms, whereby users can access a website which allows concordances, collocations, etc. to be carried out. An advantage of such standalone platforms is that only an account is purchased and corpora do not fill up one’s hard drive or have to be loaded into the software each time. However, this also means that the corpus cannot be easily ported to another software program, so we may have to put up with the affordances of a single tool.
Conclusion This chapter has raised a number of issues, both theoretical and practical, concerned with building or obtaining corpora. While there are certain procedures that do need to be followed (e.g. taking into account representativeness or gaining permissions), the researcher ultimately has freedom in choosing what type of corpus to build or use, how to encode the data into an electronic format and how to annotate the text (if at all). Having come this far, it is now time to turn to the remaining part of the book, which is more concerned with analysis, rather than building corpora or justifying their use at all. The next five chapters each consider different methodological procedures of using corpora to analyse discourses, covering a range of different types of data, as well as providing a more critical account of the possible problems that could be encountered. Each chapter gradually builds on the information covered in the others. So with that in mind,
Corpus Building
I begin with what is usually considered to be the least complex corpus-based process, analysing frequency data.
Further reading Hardie, A. (2014), ‘Modest XML for Corpora: Not a standard, but a suggestion’, ICAME Journal 38: 73–103. A journal article providing a sensible set of guidelines for XML tagging of corpora. Wynne, M. (ed.) (2005), Developing Linguistic Corpora: A Guide to Good Practice, Oxford: Oxbow Books. A more detailed account of different aspects of corpus building including annotation, metadata, character encoding, archiving and distribution.
Questions for students 1 You have decided to carry out a study of the way that people receiving government benefits are represented in newspaper discourse and you want to build a specialised corpus of articles which contain references to such people. You have access to a large database of all UK newspaper articles from the last year, and you can specify search terms in the database to collect certain articles. (a) How would you develop search terms to collect relevant articles (and not miss any)? Think about good and bad potential search terms. (b) Some articles mention government benefits just once and in a way not central to the news story. Is it worth collecting those? Is it a good idea to only collect articles that mention government benefits in the headline of the article? (c) Imagine you have a corpus that contains two billion words of articles about government benefits. What are the pros and cons of having such a large amount of data? What about a corpus of just fifty articles? Is it worth using corpus methods on it? 2 What ethical and copyright issues would need to be considered when collecting a corpus of (a) suicide notes, (b) racist online forum postings, (c) blog posts from LGBTQ+ people around the world? 3 You have built a 100,000 word corpus of British political speeches made between 1945 and 2000 and want to compare it to a reference
79
80
Using Corpora in Discourse Analysis
corpus to find out what is distinctive about it. What are the pros and cons of using the following reference corpora? (a) The AmE06 (one million words of American English from around 2006). (b) A 100,000 word reference corpus of British political speeches made between 2001 and 2021. (c) A ten million word corpus of British newspaper articles published between 1945 and 2000.
4 Frequency, Dispersion and Distribution Introduction Frequency is one of the most central concepts underpinning the analysis of corpora. However, its importance to this field of research has resulted in one of the most oft-heard misconceptions of corpus linguistics – that it is ‘only’ a quantitative methodology, leading to a list of objections: frequencies can be reductive and generalising, they can over-simplify, and their focus on comparing differences can obscure more interesting interpretations of data. It is the intention of this chapter to introduce the reader to the uses (and potential pitfalls) of frequency – as one of the most basic aspects of corpus linguistics it is a good starting point for the analysis of any type of corpus. Used sensitively, even a simple frequency list can illuminate a variety of interesting phenomena. This chapter examines how frequency can be employed to direct the researcher to investigate various parts of a corpus, how measures of dispersion and distribution can reveal trends across texts and how, with the right corpus, frequency data can help to give the user a sociological profile of a given word or phrase enabling greater understanding of its use in particular contexts. None of these processes are without problems, however, so the potential shortcomings of frequency data, and possible ways to overcome them, are also discussed here. This chapter also provides a grounding for the later methodological chapters – without discussing frequency it is more difficult to understand the concepts of collocation and key words. Frequency is of interest to discourse analysis because language is not a random affair. As we will see in Chapter 6, words tend to occur in relationship to other words, often with a remarkable degree of predictability. 81
82
Using Corpora in Discourse Analysis
Languages are rule-based – they consist of thousands of patterns which indicate what can and cannot be said or written at any given point. However, despite these rules, people usually have some sort of choice about the sort of language that they can use. This argument is summed up by Stubbs (1996: 107), who writes, ‘No terms are neutral. Choice of words expresses an ideological position.’ It is the tension between these two states – language as a set of rules vs language as free choice that makes the concept of frequency so important. If people speak or write in an unexpected way, or make one linguistic choice over another, more obvious one, then that reveals something about their intentions, whether conscious or not. For example, as Zwicky (1997: 22) points out, one contested choice is between the use of gay as opposed to homosexual. Other choices could include the use of euphemisms (that way inclined, confirmed bachelor, etc.), derogatory terms (faggot, sissy) or reclaimings of such terms (queer, dyke). Danet (1980) describes a case where a doctor who carried out a late abortion was tried for manslaughter. The language used in the courtroom was an explicit concern of the trial, with lawyers negotiating the different connotations of terms such as products of conception, fetus, male human being, male child and baby boy. The choice of such terms assume different frames of reference, e.g. baby boy suggests helplessness, whereas fetus expresses a medical position. However, choice need not be lexical. Another type of choice relates to grammatical uses of words: the use of gay as a noun or gay as an adjective. I would argue that ‘he is gay’ is somewhat less negatively biased than ‘he is a gay’. The adjectival usage suggests a person is described as possessing one trait among many possible attributes, whereas the noun usage implies that a person is merely the sum total of their sexuality and no more. In describing the analysis of texts using critical discourse analysis, Fairclough (1989: 110–11) lists ten sets of questions relating to formal linguistic features, each one indicating that a choice has been made by the author. These features include pronoun use, modality, metaphors, agency, passivisation and nominalisation. While Fairclough frames his questions in terms of appearing within specific texts, there is no reason why such an analysis of linguistic choices cannot be carried out on a corpus in order to uncover evidence for preference which occurs across a genre or language variety. However, as Sherrard (1991) argues, such a view of choice presupposes that language users feel that they actually have a choice or are aware that one exists. She points out that speakers will always be restricted in their ways of using language – for example, people in the 1950s would not have used a
Frequency, Dispersion and Distribution
term like Ms because such a choice was not available to them. It is therefore important that as researchers we are aware of the range of possible choices open to language users and interpret their decisions accordingly. Related to the concept of frequency are that of dispersion and distribution. As well as knowing that something is (or is not frequent) in a text or corpus, being able to determine where it occurs can also be extremely important. Corpora are often made up of numerous texts and simply considering the frequency of a linguistic item might obscure the fact that it is actually quite poorly distributed, perhaps only occurring in a very small number of texts. This might be interesting in of itself, helping us to identify a set of atypical texts in a corpus, but it means that we should not over-generalise our claims about that word as applying to the whole corpus. Additionally, texts have beginnings, middles and ends and narrative structures are usually imposed on them. It may be relevant to know that a particular word form is more frequent at the start of a text than at the end. It may be useful to ascertain whether its occurrences are all clumped together in one small section of the corpus, or whether the word is a constant feature, cropping up every now and again with regularity. Dispersion analyses are one way that we can take into account the fact that texts are discrete entities within themselves, and they also allow us to begin to consider context, albeit in a rather impressionistic way, although the relevance of context is something which will be developed in further chapters as we proceed. Distribution is a related concept to dispersion, used here to refer to the extent that a linguistic item occurs across different types of texts. So dispersion is concerned with whether mentions of an item all occur in the same part of a text or are well spread out across it, and whether such mentions appear at the start, middle or end, whereas distribution considers the extent to which the item occurs across multiple texts, e.g. does it occur more in spoken texts as opposed to written ones?
Join the club In order to demonstrate some possible uses of frequency, dispersion and distribution I am going to describe how they can be employed on a small corpus of data which consists of twelve leaflets advertising holidays. These leaflets were produced by the British tour operator Club 18–30, which was established in the 1960s and describes itself in its website as being ‘all about having a positive attitude. It’s about the clothes that you wear, the music you listen to and the places you go but more importantly it’s about you.’
83
84
Using Corpora in Discourse Analysis
The holiday leaflets were all published in 2005 and my goal in collecting them was to investigate discourses of tourism within them. As well as considering the leaflets as a single corpus, I was also interested in whether there was any variation between them – so were there different discourses to be found or was this a relatively homogenous corpus with a great deal of lexical and discursive repetition? Morgan and Pritchard (2001: 165) report how in 1995 (ten years prior to my study) Club 18–30 used a campaign of sun fun and sex . . . Posters were launched carrying slogans such as ‘Discover your erogenous zone’, ‘Summer of 69’ and ‘Girls, can we interest you in a package holiday?’ with a picture of a man, described by The Times as wearing ‘a well-padded pair of boxer shorts’. In addition to this high profile poster campaign (which was later extended to cinemas), the company continued the sexual theme in the pages of their brochures.
However, these kinds of holidays were subsequently criticised in the media. For example, in August 2005, in the UK, Channel 5 broadcast a television programme called ‘The Curse of Club 18–30’ which featured interviews by people who had not enjoyed their holiday, including a man who had been temporarily blinded at a foam party and a girl who had had her head shaved by a gang of drunken men. The Telegraph reported a story in 2003 about a Club 18–30 resort in Faliraki that had been reviewed by holiday goers as being a place to avoid if you are over thirty, teetotal or not promiscuous. Perhaps as a response to these and other criticisms, the Club 18–30 website included a section on ‘myths’ about the holiday operator. One myth reads: ‘All Club 18–30 holidays are about being forced to drink too much and having sex. I’ve heard that the reps shove drink down your throat and make sexual innuendos all the time!’ The Club 18–30 response was: ‘We are not moral guardians – what you do on your holiday is 100% down to you. If you want to drink on holiday, you drink, if you don’t, you don’t – simple as that! Young people drinking abroad is not exclusive to Club 18–30.’ I decided that it would be interesting to consider the extent to which Club 18–30 holiday brochures refuted or supported the ‘myths’ of alcohol consumption and promiscuity. Holiday brochures are an interesting text type to analyse because they are an inherently persuasive form of discourse. Their main aim is to ensure that potential customers will be sufficiently impressed to book a holiday: ‘The drive to create impactful and effective advertising still remains a major advertising challenge, despite the development of sophisticated advertising tracking and evaluation techniques’ (Morgan and Pritchard 2001: 17).
Frequency, Dispersion and Distribution
However, as with most advertising texts, writers must take care to engage with what can be a diverse audience in an appropriate way, for example by deciding what aspects of the holiday are foregrounded (or backgrounded) and what assumptions are made about the interests and lifestyles of the target audience. Language, therefore, is one of the most salient aspects of persuasive discourses. The message must be sufficiently attractive for the potential audience to want to engage in something that will essentially result in a financial exchange. The corpus used in this study involves text taken from a Club 18–30 brochure published in 2005. The brochure contains pages on eleven different locations for holidays so the text from each location was saved as a separate file, while the introduction pages to the brochure were included as an additional file. Table 4.1 shows the filenames and word counts of the twelve leaflets used in the holiday corpus. It can be noted from the filenames of the leaflets (which are derived from their titles), that the leaflets mainly advertise holidays on Spanish and Greek islands. Twelve texts consisting of 17,859 words is very small by corpus linguistics standards. However, when the texts are similar to one another and involve quite a restricted use of language, based around the same style and topic, it is likely that linguistic features will be repeated to the extent that it is possible to identify patterns. While the analysis of this corpus focusses on the language used in the leaflets, there is also a visual aspect to the holiday leaflets which also plays an important role in how the leaflets are understood and the discourses contained in them. As Hunston (2002: 23) points out, a Table 4.1 Holiday leaflets Filename
Words
cancun.txt corfu.txt crete.txt cyprus.txt gran.txt ibiza.txt intro.txt kos.txt mallorca.txt rhodes.txt tenerife.txt zante.txt Total
755 1,257 1,995 1,145 1,089 2,272 3,637 1,126 1,764 1,206 918 695 17,859
85
86
Using Corpora in Discourse Analysis
corpus cannot show that a text does not consist of words alone but is encountered by audiences within its visual and social context (see Kress and van Leeuwen 1996, Kress 1994). As well as drawings, diagrams and photographs the text itself is not ‘plain’. The font, colour, size and positions of text also play a role in how tourist discourses are likely to be interpreted. Therefore, although this chapter focusses on the analysis of the electronically encoded text in the corpus of leaflets, visuals are also an important part of these leaflets and ought to be examined in conjunction with the corpusbased findings (something I return to at the end of this chapter).
Frequency counts Using the corpus analysis software WordSmith (version 7), a word list of the twelve text files was obtained. A word list is a list of all of the words in a corpus along with their frequencies and the percentage contribution that each word makes towards the corpus. Considering frequency in terms of proportions is often a more sensible way of making sense of corpus data, particularly when comparisons between two or more data sets of different sizes are made. While WordSmith standardises frequencies in terms of percentages, other corpus tools do this slightly differently. An online tool called CQPweb (which we will look at in the next chapter), standardises frequencies by giving them as occurrences per million words. If your corpus is bigger than a million words, then this is fine. But if you are working with a smaller corpus, then giving frequencies per million words is rather misleading – as you do not have a million words, the count is an extrapolation. When working with three corpora (ranging between 480,000 and 1.7 million words) in Baker et al. (2021), we therefore gave frequencies in terms of per 100,000 words. The formula for calculating a standardised frequency (in occurrences per million words) is as follows:
Figure 4.1 shows the output of the word list in WordSmith. The output consists of a window with several tabs that can be clicked on to provide additional information. The main tab is frequency. This tab shows frequencies in terms of the highest first, whereas the alphabetical tab re-orders the word list alphabetically. You might notice that the second most common word is #
Frequency, Dispersion and Distribution
Figure 4.1 Wordlist output of WordSmith.
while the tenth most common is s. This is not an error but due to the way that WordSmith processes text. Its default settings simply group together all numerals (e.g. 5 but not five) into the category # (it is possible to change this if it is important that you distinguish numerals separately). Additionally, WordSmith treats the apostrophe character in words like there’s as a space, thus splitting this word into two separate words, there and s. Actually, the default setting of WordSmith is to recognise apostrophes (as long as they are encoded like this ʹ) and treat them as existing within words. However, when I created my corpus from pdf files, apostrophes were rendered as ’ as opposed to ʹ, which meant that WordSmith viewed them as spaces. It only takes a few seconds to go into WordSmith’s settings and add the ’ character to the list of characters that need to be included as part of words. If I wanted, I could also select an option to count numerals as words. There is no single ‘best’ way of doing this, it is dependent on what is in our corpus and what we are interested in counting. However, it is worth bearing in mind that different pieces of corpus software will have different ways of counting words, and this is one of the reasons why corpus linguists sometimes refer to the term tokens instead of words. Back at Figure 4.1, the statistics tab, if clicked on, provides statistical information about the corpus, including the total number of words (tokens), in this case 17,869, the number of distinct words (called types) and the type/ token ratio, which is the number of types divided by the number of tokens expressed as a percentage. For example, the word you occurs 348 times in the
87
88
Using Corpora in Discourse Analysis
corpus, although it only consists of one type of word. We could also refer to the type/token ratio as ‘the average number of tokens per type’. A corpus or file with a low type/token ratio will contain a great deal of repetition – the same words occurring again and again, whereas a high type/token ratio suggests that a more diverse form of language is being employed. Type/ token ratios tend to be meaningful when looking at relatively small text files (say under 5,000 words). However, as the size of a corpus grows, the type/ token ratio will almost always shrink, because high frequency grammatical words like the and to tend to be repeated no matter what the size of the corpus is. Because of this, large corpora almost always have very low type/ token ratios and comparisons between them become difficult. Therefore, WordSmith also calculates what is called a standardised type/token ratio which is based on taking the type/token ratio of the first 2,000 words in the corpus, then the type/token ratio of the next 2,000 words and the next 2,000 words after that, and so on, and then working out the mean of all of these individual type/token ratios. This standardised type/token ratio is almost invariably higher as a result and a better measure of comparison between corpora. For example, the type/token ratio of our corpus is 12.49, whereas the standardised type/token ratio is 37.67. Why is the type/token ratio useful? It can give an indication of the linguistic complexity or specificity of a file or corpus. A low type/token ratio is likely to indicate that a relatively narrow range of subjects are being discussed, which can sometimes (but not always) suggest that the language being used is relatively simplistic. For example, the standardised type/token ratio in the FLOB corpus of British English (which contains a range of written texts from different genres) is 45.53, whereas the same figure in a sample of transcribed informal spoken conversations from the British National Corpus is 32.96, reflecting the fact that written language tends to contain a higher proportion of unique words, whereas informal spoken language is more lexically repetitive. Comparing this to the standarised type/token ratio of the holiday leaflets (37.67), we find that it falls somewhere inbetween spoken and written language, a fact we may want to keep in mind for later. Clearly, however, the type/token ratio merely gives only the briefest indication of lexical complexity or specificity and further investigations are necessary. Looking at Figure 4.1 again, there are several columns of numbers. The Freq. column gives the actual frequencies (sometimes called raw frequencies), whereas the % column gives the overall proportion that a particular word contributes towards the whole corpus. So the word you occurs 348 times and
Frequency, Dispersion and Distribution
occurrences of this word contribute to 1.95 per cent of the corpus overall. As noted earlier, when comparing multiple corpora of different sizes it is useful to refer to the % column as this presents a standardised value. However, for the purposes of this chapter, we are only using one corpus, so we can stick with raw frequencies. From Figure 4.1 it becomes apparent that the most frequent words in the corpus are grammatical words (also known as function words). Such words belong to a closed grammatical class each consisting of a small number of high frequency words (pronouns, determiners, conjunctions, prepositions), and these categories tend not to be subject to linguistic innovation – we do not normally invent new conjunctions or pronouns as a matter of course. With few exceptions, almost all forms of language have a high proportion of grammatical words.1 However, in order to determine whether any of these words occur more often than we would expect, it can be useful to compare our holiday corpus to a corpus of general language such as the British National Corpus 1994. Table 4.2 shows the most frequent ten words in the holiday corpus and their equivalent proportions in the whole BNC and its written and spoken components (I have not included the tokens # or s). Some words in the holiday corpus have similar rates of occurrence in the spoken section of the BNC (to, of, in), other words have higher rates compared to the BNC, including those which generally occur more in the written section (and, for) or the spoken section (on, all). The use of you in the holiday corpus is closer to the figure for the spoken section of the BNC than it is for the written. This suggests that although the texts consist of written Table 4.2 Percentage frequencies of the ten most frequently occurring words in the holiday corpus and their equivalencies in the BNC
1 2 3 4 5 6 7 8 9 10
Word
% Frequency in holiday leaflets
% Frequency in BNC
% Frequency % Frequency in in BNC (written BNC (spoken texts only) texts only)
the and to a of you for in on all
5.55 3.62 2.64 2.44 1.96 1.95 1.38 1.37 1.15 1.04
6.20 2.68 2.66 2.21 3.12 0.68 0.90 1.97 0.74 0.28
6.46 2.70 2.70 2.24 3.29 0.45 0.93 2.05 0.74 0.26
3.97 2.53 2.26 1.99 1.69 2.59 0.64 1.37 0.78 0.42
89
90
Using Corpora in Discourse Analysis
documents, there are at least some aspects of the language within them that are similar to spoken language. The high use of you (the only pronoun to appear in the top ten) also suggests a personal style of writing, where the writer is directly addressing the reader. Comparing frequencies of function words can be useful in terms of discerning the register of a text. For example, Biber, Conrad and Reppen (1998) show how frequencies of different function words (among other things) can be used to categorise texts across a range of five stylistic dimensions (for example Dimension 1 is ‘involved vs information production’). However, perhaps we can get a better idea of discourses within the corpus if we put the grammatical words to one side for the moment, and only consider the most frequent lexical words and terms (e.g. the nouns, verbs, adjectives and lexical adverbs). Table 4.3 shows what the top ten looks like if we do this. This table gives us a much better idea of what the corpus is about. There are words describing holiday residences (studios, facilities, apartments) and other attractions (beach, pool, club, bar). One aspect of the table worth noting is that bar and bars both appear in the top ten list. Added together, their total frequency is 173, meaning that the noun lemma bar would potentially be the most common lexical term in the corpus. A lemma is the canonical form of a word. Francis and Kučera (1982: 1) define it as a ‘set of lexical forms having the same stem and belonging to the same major word class, differing only in inflection and/or spelling’. Lemmatised forms are sometimes written as small capitals e.g. the verb lemma walk consists of the verb forms of walk, walked, walking and walks.
Table 4.3 The most frequent ten lexical words in the holiday corpus
1 2 3 4 5 6 7 8 9 10
Word
Frequency
beach pool studios sleep club facilities bar private bars apartments
124 122 116 107 99 96 94 87 79 78
Frequency, Dispersion and Distribution
Table 4.4 The most frequent lexical lemmas in the holiday corpus Rank
Lemma
Frequency
1 2 3 4 5 6 7 8= 8= 10
bar club beach pool studio facility apartment balcony day night
173 144 136 128 116 96 80 78 78 70
Considering that bar is so frequent, it is possible that other lemmas may have important roles to play in the corpus. Therefore, Table 4.4 shows the top ten recalculated to take into account the most frequent lexical lemmas. The ordering of the lexical lemmas list is different to the simple frequency list. Now the two most frequent items are bar and club which give us a clearer idea about the focus of the holiday brochure. The other words in Table 4.3 are similar to those in Table 4.2, although we now also find day and night appearing as frequent lexical lemmas. Obviously, at this stage, while we can make educated guesses as to why and in what contexts these terms appear (for example, we would expect bar to usually, if not always, refer to contexts of consuming alcohol, rather than say, an iron bar), it is always useful to verify these guesses, using other techniques. In Chapters 5 and 6 I look at two ways to do this (by using concordances and collocates), but in this chapter I am going to focus on frequent clusters (see below). So now that we have a list of frequent terms, what can we do with them, and how can this information be used to tell us anything about discourses within the corpus? Let us examine bar in more detail.
Considering clusters How are bars described in the corpus? In order to do this we need to consider frequencies beyond single words. Using WordSmith it is possible to derive frequency lists for clusters of words. A cluster is a fixed sequence of two or more tokens. Related terms to cluster include lexical bundles, chunks,
91
92
Using Corpora in Discourse Analysis
multi-word sequences, lexical phrases, n-grams, formulas, routines, fixed expressions and prefabricated patterns. These terms have similar meaning but are sometimes used in distinct ways. A lexical bundle, for example, usually only involves sequences of words that occur above a certain frequency. As WordSmith refers to these sequences as clusters, that is the term we will use in this chapter. Creating a ‘word list’ of clusters in WordSmith is slightly complicated. An ‘index’ needs to be created. This is a special type of word list which contains information about the order of words in relation to each other. Once this has been done, we can calculate clusters. For our purposes, we will specify the cluster size as three, which results in the list in Figure 4.2. Using this word list, it is possible to search for clusters which contain the term bar or bars in order to uncover the ways that they are used. An examination of three-word clusters reveals some of the most common patterns: bars and clubs (19 occurrences), loads of bars (6), the best bars (6), bar serving snacks (6), pool and bar (4). There are a number of other clusters which emphasise the amount of bars: heaps of bars, plenty of bars, neverending stream of bars, tons of bars, variety of bars. How else are bars described? Some two-word clusters include well-stocked bar (4), vibrant bars (2), lively bars (3), 24-hour bar (3), excellent bars (4) and great bars (3). What about the second most popular lexical lemma, club ? The most common two-word cluster is Club 18–30 (53 occurrences or about 37 per cent of cases of the word club). Other frequent clusters include bars and clubs (19), club tickets (10), club nights (3) and club scene (3). Two-word clusters that involve evaluation include: great club (3), best clubs (2), lively club (1), hottest club (1) and greatest clubs (1).
Figure 4.2 Wordlist of three word clusters.
Frequency, Dispersion and Distribution
At this stage, we can say with certainty that places where alcohol can be consumed are the most frequent concept in the Club 18–30 brochures (taking priority to other places such as the beach, cultural amenities like museums, galleries or historic buildings or the apartment where people will be staying). We also know that bars are repeatedly evaluated positively (great, excellent) as well as being plentiful (heaps, plenty, never-ending, tons variety) and busy (vibrant, lively). The frequency of such terms coupled with the use of evaluations suggests a process of normalisation: readers of the brochures will be encouraged to believe that bars are an important part of their holiday. However, we need to be careful about jumping to conclusions too early in our analysis. While bar and club are the most frequent lexical lemmas in the holiday corpus, they are also the only lemmas in the top ten that relate to alcohol. We would perhaps expect other alcohol-related words to be reasonably frequent. And sometimes what is not present in a frequency list can be as revealing as what is frequent. With this in mind, I scanned the remainder of the frequency list to see where (if at all) other alcohol-related words appeared: cocktail (3 occurrences), cocktails (3), daiquiri (1), margaritas (1), hungover (1). The single use of hungover was interesting, and I examined its occurrence in the corpus: ‘If you’re too lazy (or hungover!) to cross the road to the beach you can chill out and lounge on a sunbed round the pool area.’ However, despite this statement, which carries an assumption on the behalf of the reader (that they might be hungover), the list of other alcoholrelated terms is very small, while words like beer, lager, wine and alcohol do not appear at all. Perhaps then, it is the case that encouragement to consume alcohol is a Club 18–30 myth and the high frequency of bar is due to other reasons, concerned with say, the social function of bars. So while the findings so far are somewhat suggestive, we cannot say at this stage that the brochures actively advocate consumption of alcohol (or if they do, to what degree). In order to begin to answer this question, we need to consider another class of words: verbs. Verbs play a particularly interesting role in tourist discourses as they can often be instructional, giving advice on the sort of activities and behaviours that tourists should and should not engage in while on holiday. Omoniyi (1998: 6) refers to imperative verb phrases as invitational imperatives, the reader is invited to partake in an activity, but is under no obligation to do so. However, invitational imperatives are another way that norms can be imposed in tourist discourses. What verbs, then, are the most frequently occurring in the holiday corpus? Here it is helpful to have carried out a prior part-of-speech
93
94
Using Corpora in Discourse Analysis
Table 4.5 The most frequent lexical verbs in the holiday corpus Rank
Verb
Frequency
1 2 3 4 5 6 7 8 9 10
sleep book want cost work miss make chill find relax
107 34 30 29 26 24 23 29 18 17
annotation (see Chapter 3) of the corpus, so that it is possible to distinguish words which can belong to multiple grammatical categories (e.g. hits can be a noun or a verb). Table 4.5 shows the most frequent lexical verb forms in the corpus. Here, overwhelmingly, it looks as if the most frequent verb is concerned with sleeping, a curious and unexpected finding. However, an examination of clusters reveals that all cases of sleep involve the phrase studios/rooms sleep 2–3/3–4 which are used to detail how many people can book an apartment or room. Similarly, the verbs book and cost are also concerned with aspects of booking a holiday or flight, rather than activities to do on the holiday. The remaining seven verbs in Table 4.5 are more interesting to examine. Frequent imperative verb clusters include don’t miss out (16), chill out (15) and make sure (12). A closer examination of chill and relax may be useful here. There are seven references in the corpus to ‘chilling out during the day’. The following sentences taken from the corpus illustrate the main pattern of ‘chilling’ as an invitational imperative: The hotel pool is a great place to chill-out during the day before starting your night with a drink in Bar Vistanova. The swimming pool is a great place to chill-out during the hot sunny days and the hotel bar is the ideal choice for your first drink before heading into the vibrant bars of Rhodes Town. The Partit Apartments are right on the doorstep of Club base Casita Blanca where you can have a swim in the pool, chill out with a cold drink and grab some snacks at the bar or just lounge around in the sun and re-charge your batteries!
Frequency, Dispersion and Distribution
The 24 hour bar, Petrino’s, is usually rammed at 5am as everyone gets together to chill out after a heavy night partying!
The verb relax also follows a similar pattern: The Green Ocean has the all the qualities of a great Club 18–30 property, with the added bonus of being the ideal place to relax and catch up on some wellearned snoozing! The pool area is the perfect place to hang out whether you want to be active or simply relax and recover. There is also a fantastic Vital Wellness Centre where you can relax and recharge in the spa or sauna before heading into the vibrant nightlife of Playa Del Ingles.
Chilling/relaxing is therefore suggested as a precursor before ‘starting the night’ or after a ‘heavy night partying’. It is also used to catch up on sleep, recover or recharge batteries. Therefore, the frequent use of the verbs chill and relax are attached to the presupposition that such activities will be required, due to the fact that holiday makers are expected to be out all night partying. Despite the fact that we do not find frequent imperative phrases to the reader to party or get drunk, the relatively high number of ‘recovery’ imperatives suggests a more subtle way that alcohol consumption is normalised to the reader.
Concordance plots One word in Table 4.5 is somewhat unexpected: the verb work which occurs twenty-six times in the corpus. It actually only occurs twelve times although looking at the word list in alphabetical form, I noticed that the word work2live occurred fourteen times, so I have added this to the total. We would perhaps not expect work to feature to often in a corpus about holidays so it is worth exploring this word further. Not all of the citations of work actually refer to paid work. The phrase work on your tan occurs three times, for example, while work out cheaper occurs twice (referring to different holiday packages). So far we have considered the term work simply in terms of the clusters which it tends to be most (and least) often embedded in. However, another way of looking at the word is to think about where it occurs within individual texts and within the corpus as a whole. Does it generally occur at the beginning of each pamphlet,
95
96
Using Corpora in Discourse Analysis
Figure 4.3 Concordance plot of work2live in the holiday corpus.
towards the end or is it evenly spread out throughout each pamphlet? Also, does it occur more often in certain pamphlets than others? In order to answer these questions it is useful to carry out a concordance plot of the term work2live across the whole corpus. A concordance plot gives a visual representation of where a search term occurs (see Figure 4.3). The first row of the concordance plot shows how work2live is dispersed across the entire corpus whereas the next ten rows each show the term’s dispersion across individual files. Two of the files, cancun.txt and gran.txt did not contain any references to work2live, so do not appear in the plot. As well as giving the filename, we are also told the total number of words in each file, the number of ‘hits’ (occurrences of work2live) and the number of occurrences per 1,000 words in the corpus. A dispersion measure gives a value between 0 and 1 based on how well dispersed a unit is across the corpus or each individual text. This is done by dividing the corpus into eight segments (see Oakes 1998: 190–1). A dispersion measure close to 1 indicates that the unit is uniformly dispersed across the corpus or text, whereas a measure close to 0 indicates that occurrences of the unit appear very closely together indicating ‘burstiness’ (Katz 1996). Most of the texts have a dispersion measure of 0, although this can be explained because they only have one or two mentions of work2live. To get a clearer understanding of how an item is dispersed, the figure also provides a visual representation (the plot) of where each occurrence of the phrase work2live occurs in the corpus. Each vertical black line represents one occurrence. The plot has also been standardised, so that each file in the corpus appears to be of the same length. This is useful, in that it allows us to compare where occurrences of the search term appears, across multiple files. Looking at the plot, we can see then that for nine of the files, the term work2live occurs very close to the beginning of each individual text. Why is
Frequency, Dispersion and Distribution
this the case? An investigation of the texts themselves is required. By taking a closer look at the beginnings of the pamphlets it becomes apparent that the majority of them begin with a short description of a (possibly fictional) holiday maker. Clare, Bank Clerk: just worked the last 3 Saturdays – this Saturday she’s having a laugh on the beach . . . don’t miss out, work2live Thea, Waitress: fed-up with serving food and drink to everyone else. Up for being the customer herself . . . don’t miss out, work2live Matt, Chef: the pressure of 80 covers a night has done his swede in, now he’s letting off steam . . . don’t miss out, work2live
Therefore, the beginning of each brochure follows the same pattern: [Name of person] [Occupation] [Description of undesirable working conditions] [Description of preferable holiday] don’t miss out, work2live.
The importance of this pattern, occurring as it does at the beginning of most of the brochures, and having the same structure, illustrates the over-arching ideology of Club 18–30, as a holiday destination that offers escapism from the mundanity and grind of everyday working life. The pattern ends with two imperatives: don’t miss out and work2live. The phrase work2live therefore implies that working is not living, but a means to living – and that a Club 18–30 holiday constitutes living. So that gives us more of an idea about discourses of tourism in the Club 18–30 holiday leaflets. However, a further aspect of discourses is that they help to construct the identity of both the producer and consumer of a text. For example, the descriptions of characters employed in unfulfilling jobs and escaping to a holiday, sets up a process of identification for the reader, constructing expectations about why readers may want to go on holiday and what activities they will engage in on arrival. Identity construction and identification are important processes in advertising discourses. Readers may be encouraged to identify with aspirational ideals or with identity constructions which are more realistic, e.g. less aspirational but similar to themselves. In what other ways can reader identity be constructed? Blommaert (2005: 15) suggests, ‘Language users have repertoires containing different sets of varieties . . . People, consequently are not entirely free when they communicate, they are restrained by the range and structure of their repertoires, and the distribution of elements of the repertoires in any society is unequal . . . What people actually produce as discourse will be conditioned
97
98
Using Corpora in Discourse Analysis
by their sociolinguistic background.’ Therefore, we could argue that lexical choices here are related in some way to identity. On reading this chapter, you may have noticed a number of informal terms and phrases which have appeared in the holiday corpus: heaps of bars or chill out for example. At this stage we may want to inquire why such terms are so popular in the corpus, and what contribution they make towards identity construction. In order to determine how and why this is the case, it is useful to carry out an exploration of these informal lexis, both in the holiday corpus and in a general corpus of spoken English in more detail.
Comparing demographic frequencies By examining the frequency list again, the most frequent informal terms in the corpus were collected and are presented in Table 4.6. Clearly, in compiling this table it was necessary to explore the contexts of some of these words in detail, in order to remove occurrences that were not used in a colloquial or informal way. In addition, the categorisation of a word as being ‘informal’ is somewhat subjective. For example, the most frequent informal word action occurred in phrases like ‘There’s always plenty of action here during the day’ rather than in phrases like ‘The government took action to establish her whereabouts.’ So in the holiday corpus, action refers to general excitement, activity and fun. The table also shows the frequencies per million words of the terms in the written and spoken sections of the British National Corpus. Table 4.6 The most frequent informal terms in the holiday corpus Rank
Verb
Frequency in holiday corpus
Frequency in written BNC
Frequency in spoken BNC
1 2 3 4 5 6 7 8 9 10
action chill loads mates cool chilled massive fab info tons
30 29 25 15 12 11 11 10 10 10
239.78 8.73 9.15 7.42 41.69 3.9 45.48 1.29 2.26 12.33
87.9 2.22 64.21 9.67 17.7 0.58 32.49 0.58 0.97 7.06
Frequency, Dispersion and Distribution
Perhaps surprisingly, most of the terms occurred proportionally more often in written rather than spoken British English (the exceptions being loads and mates). However, this is due to the fact that most of the colloquial terms in the list are derived from non-colloquial words, so would be expected to occur more often in formal, written English. For example, tons of steel (formal) vs tons of fun (informal). The spoken texts tend to contain many of the more informal meanings of the words in Table 4.6. And when the words occurred as informal in the written texts in the BNC, a high proportion of these cases referred to reported speech in novels. So why do we find a reasonably high proportion of informal lexis in the holiday corpus? One line of enquiry I want to explore is to consider which sorts of people are more likely to use informal language in their speech, by referring to a reference corpus. Using the spoken section of the BNC, I explored the demographic frequencies of the words loads, mates, cool, massive, fab, info and tons in order to identify their distributions across different types of speakers. I did not look at chill, chilled and action as these words did not occur very often in their colloquial form in the BNC, even in spoken British English. Therefore the data would need a great deal of combing through and editing before any conclusions could be made. Table 4.7 shows the combined frequencies per million words for these seven words in terms of sex, age and social class.2 From this table it can be seen that these terms are mostly typically used by people in the 15–24 age group (615.88 times per million words) and least in the 60+ group (225.32 times per million words). Additionally, males use Table 4.7 Combined frequencies per million words of loads, mates, cool, massive, fab, info and tons in the BNC for age, sex and social class Demographic
Group
Frequency per million words
Age
0–14 15–24 25–34 35–44 45–59 60+ Male Female AB C1 C2 DE
538.85 615.88 304.19 371.6 235.02 225.32 376.75 334.65 268.63 369.39 311.74 347.8
Sex Social class
99
100
Using Corpora in Discourse Analysis
these terms slightly more than females, and social class C1 use the terms more than other social class grouping. The strongest influence on usage seems to be age, whereas the weakest appears to be sex. However, it is important at this stage not to conclude that the most typical speaker of these sorts of words is a composite of these three demographics, e.g. a male aged 15–24 from social class C1. Different demographic factors can cause interaction effects – for example, hypothetically the high proportion of C1 speakers in the table could be due to the fact the these words are very commonly used by C1 females rather than C1 males, whereas the combined occurrences of these words from social groupings AB, C2 and DE could all consist of males only. It is therefore useful to cross tabulate the demographics in order to get a clearer picture of how they interact together (see Table 4.8). Table 4.8 gives a much more detailed picture of the distribution of frequencies of these informal words in society. While males aged 15–24 from social class C1 use these words quite often (214.4 occurrences per million words) they are by no means the most frequent users. This distinction goes to males aged 15–24 from social class AB (702.76 occurrences per million words). Other high users are males aged 0–14 from social class C1 and females aged 15–24 from social class C2. How can we then link these findings to the presence of these sorts of terms in the holiday corpus? There are a number of possible answers. First, perhaps the texts were written by males aged between 15–24. This is possible, but unlikely. More likely then, is that the leaflets were written with certain social groups in mind, emulating the typical language that those groups would use themselves and therefore be familiar with. Unsurprisingly, Club 18–30 specifically target their age demographic in their brand name, so it makes sense for the company to aim the language in their brochures at a young age group – this is confirmed by the high use of colloquial terms found in the 15–24 group in the BNC and in the holiday corpus. Although AB speakers, on the whole, tend to use fewer colloquial words than other groups, the exception to this are AB males aged 15–24, who use more of these colloquial terms than anyone else. It perhaps should also be noted that some of the cells in Table 4.8 contained frequencies of zero. This shows up one of the potential limitations of this sort of analysis. We should not conclude that males from social group AB instantly stop using colloquialisms once they reach the age of twentyfive, but rather, the refined data sets of different sorts of speakers are perhaps too small to draw accurate conclusions.
Table 4.8 Combined frequencies per million words of loads, mates, cool, massive, fab, info and tons in the BNC, cross-tabulated for sex, age and social class Males AB 0–14 15–24 25–34 35–44 45–49 60+
Females C1
C2
DE
187.72 559.69 367.95 417.01 702.76 263.44 313.17 680.59 0 214.4 171.8 178.52 125.37 106.52 177.61 0 182.07 127.22 45.77 63.75 43.67 0 54.35 143.82
AB
C1
C2
DE
216 422.42 218.48 128.21 77.77 107.24
56.64 417.81 155.38 119.68 138.92 68.47
319.71 587.9 201.11 122.73 130.3 74.4
0 406.91 365.21 125.39 118.9 218.73
So when looking at demographic frequency data it is important to take care before drawing strong conclusions as there are many possible factors at play. When considering different types of demographic data (e.g. age and sex and social class) it should be borne in mind that individual categories may become quite small or in some cases non-existent, meaning that results may be due to the eccentricities of a small range of speakers. It may also be necessary to take into account context such as the location of the speech (at home vs outside) or the audience (Are young people more likely to use informal language around other young people? Will the presence of older people inhibit their informal language?). Additionally, it may be necessary to take into account issues such as reported speech. One example from a young female in the BNC illustrates this: ‘she said there are loads of them on the cycle path without lights on!’ Should this use of loads count in the same way as non-reported speech? For the authors of the holiday leaflets to use informal language in order to index youthful identities we need to assume that they believed that such language was typical of this identity and that the target audience would also ‘read’ the leaflets in the same way. We may have to find other examples to support our case (for example – does this type of informal language occur very frequently in magazines or television programmes aimed at young people?), and also use our judgement of the author’s own linguistic competence (will a highly literate L1 speaker be a better judge of the social nuances and stereotypical demographic distributions of language than an L2 speaker?). So I do not advocate over-reliance on demographic frequency data, nor would I recommend using it to make statements about absolute ‘differences’ in language use between social groups. What is more useful, however, is by investigating how a particular word or phrase may be used in order to index
102
Using Corpora in Discourse Analysis
a stereotypical social identity based on age, sex or class or a combination of all three, or other factors (bearing in mind that writers/speakers and audiences may or may not all have access to the same sort of stereotypical notions of language and social identity). What we should be able to glean from the BNC spoken data though, is that the colloquialisms which co-occur in the holiday corpus are most strongly associated with young people and appear to have been used as a means of creating identification and making the message attractive to its target audience. By using a form of language which is strongly associated with youthful identities, the audience may feel that they are being spoken to in a narrative voice that they would find desirable (the voice of a potential friend or partner) or at least are comfortable with. Here it is perhaps useful to bring in additional non-corpus-based evidence, by looking at the visual aspects of the leaflets. An examination of the images used in many of the leaflets seems to support this hypothesis – many of them depict young, conventionally attractive men and women having fun, either in swimming pools or the sea, or at nightclubs or bars. Several of the pictures show young people enjoying a drink together, while one of the brochures contains a full page advertisement for the vodka-based drink WKD. Another advertisement advises holiday makers to ‘pack some condoms and always choose to use one’, while there is also a full page ‘Model search’ contest, looking for ‘3 gorgeous girls and 3 fit fellas to be our models of the year . . . All you have to do is send a full-length picture in swimwear.’ The images of happy holiday makers in the leaflets are perhaps somewhat idealised, everyone is happy, healthy and attractive; the women are all slender, the men muscular; there are no people who are overweight or wearing glasses. So while these images may not reflect the physical appearances of many of the potential readers of the leaflets, they do show desirable identities, suggesting to readers that these may be the types of people they will meet while on holiday. In addition, the use of colloquialisms also contributes to normalisation of certain types of youthful identities. It suggests a shared way of speaking for young people, which may not even be noticed by those whom it aims to target. However, young people who do not use informal language may be alerted to a discrepancy between their linguistic identities and those of the people featured in the brochure (and the narrative voice). In a similar way, young people reading the Club 18–30 brochure will be made aware of the implied expectations if they are to take a holiday with the tour operator – clubbing, chilling and more clubbing, which is represented as both attractive
Frequency, Dispersion and Distribution
(through the use of positive evaluation) and hegemonic (due to its repetition and high frequency in the brochures).
Conclusion To summarise, what has the corpus analysis of the Club 18–30 leaflets revealed about discourses of tourism? The analysis of frequent lexical lemmas revealed some of the most important concepts in the corpus (bar, club, etc.) and a more detailed analysis of clusters and individual incidences containing these terms revealed some of the ways that holiday makers were constructed, for example, as being interested in information about the variety and number of places to drink which are near their holiday accommodation, and likely to need periods of ‘chilling’ to recover from the excesses of the previous evening. The analysis of the concordance plot for work2live revealed how this term constituted a salient part of the overall discourse in the leaflets, being used at the start of each brochure in a repetitive structure which emphasised how working is a means to living which can be achieved by being on holiday. The leaflets did not explicitly advise holiday makers to get drunk (and elsewhere in the Club 18–30 website, accusations that tourists are encouraged to drink are dismissed as a myth). However, the analysis in this chapter suggests that there are more subtle messages at work. References to sex (another ‘myth’ according to the Club 18–30 website) also do not appear to be frequent in the leaflets, however an analysis of the visual content suggests that the leaflets engage in sexualised representations of holiday makers, again through implicit messages. As Morgan and Pritchard (2001: 165) note, ‘The sheer dominance of these images – many of them taking up a whole page – creates the brochures’ atmosphere of sexuality.’ Perhaps, in reacting to criticism, Club 18–30 have changed the tone of their leaflets, but at the same time used more oblique references to ensure that certain types of tourist discourses remain intact. Finally, by investigating how high frequency informal language occurred in a reference corpus of spoken British English, we were able to gain evidence in order to create hypotheses about how the readership of the holiday leaflets were constructed. Frequency counts can be useful, but as this chapter indicates at various points, their functionality is limited. Their main use is in directing the reader towards aspects of a corpus or text which occur often and therefore may or may not show evidence of the author making a specific lexical choice over
103
104
Using Corpora in Discourse Analysis
others, which could relate to the presentation of a particular discourse or attempts to construct identity in some way. Comparing the relative frequencies in a text or smaller corpus to a reference corpus is one way of denoting whether a word occurs more or less often than expected (we will look at a more thorough way of doing this in Chapter 7). Examining frequent clusters of words or their dispersions/distributions across a text (or set of texts) may be more revealing than just looking at words in isolation, and as the course of this chapter developed it became clear that context plays an important role in the analysis of particular words, something which is difficult to achieve from looking at frequencies alone. For this reason, the following chapters expand on the notion of simple frequency to consider corpus-based analyses which take into account context. In the following chapter we consider the investigation of concordances in detail.
Step-by-step guide to frequency analysis 1 Build or obtain access to a corpus. 2 Using a corpus tool, obtain a frequency list of the corpus. 3 Identify the most frequent words – are any frequent words unexpected? Use concordances to identify how the words are used in context. 4 Consider the most frequent lexical words (with concordances). 5 Consider the most frequent lemmatised forms (with concordances). 6 Consider the most frequent clusters (with concordances). 7 Examine concordance plots of frequent linguistic forms of interest. Do particular forms consistently occur at a certain place in a text? 8 Consider the distributions of the frequent linguistic forms – are they spread evenly across a wide range of texts or repeatedly used in a small number of texts? 9 How do the frequent linguistic forms compare to other types of corpora? Using reference corpora, try to see if they are typical of certain types of registers or speakers. 10 Try to explain why particular items are frequent in your corpus or why they have particular patterns of dispersion or distribution. What does this reveal about representations, discourses, arguments, legitimation strategies, ideologies or assumed knowledge?
Further Reading Baker, P. (2013), ‘Male bias and change over time: where are all the spokeswomen?’, in Using Corpora to Analyse Gender, Chapter 4, London:
Frequency, Dispersion and Distribution
Bloomsbury. This chapter discusses ways of using frequency to determine whether language is becoming less sexist over time. McEnery, T. (2005), Swearing in English, London: Routledge. A consideration of swearing across different socio-demographic categories in the British National Corpus. Partington, A. and Morley, J. (2004), ‘At the heart of ideology: Word and cluster/ bundle frequency in political debate’, in B. Lewandowska-Tomaszczyk (ed.), PALC 2003: Practical Applications in Language Corpora, Frankfurt/M: Peter Lang, 179–92. This is a key CADS study making use of clusters.
Questions for students 1 Sometimes a word can be important for discourse analysis because it never occurs in a corpus. Can you think of any examples of such cases? A frequency list would not show such words up so how can we know what is absent in a corpus? 2 Word A and word B both occur 100 times in the same corpus. Word A only occurs in one file. Word B occurs across thirty files. To what extent should the analyses of Word A and Word B receive equal weight? 3 Table 4.9 shows frequent three-word clusters from two related corpora. One is of articles in the Daily Express that contain the word Romanians (collected within the decade leading up to Britain’s vote to leave the European Union). The other is of reader comments that appeared under these articles. What do the clusters indicate about potential differences and similarities between the articles and the reader comments?
Table 4.9 Clusters relating to articles about Romanians Articles
Reader comments
the number of Romanians and Bulgarians per cent of the Prime Minister the Home Office tens of thousands the right to joined the EU to work in freedom of movement come to Britain over the past
the British people they are not leave the EU get out of we have to the only way do not want their own country they should be the majority of to pay for way of life
105
106
5 Concordances Introduction So far we have seen some of the ways that frequencies can be used in order to uncover the existence of discourses in text. We have considered word counts and the use of dispersion data which shows the spread and position of particular terms in a text. We have also expanded on the notion of simple lexical frequencies to examine multi-word units or clusters. The notion of clusters is important because it begins to take into account the context that a single word is placed in. Frequency lists can be helpful in determining the focus of a text, but care must be taken not to make assumptions about the ways that words are actually used within it. This is where taking an approach which combines quantitative and qualitative analysis is more productive than simply relying on quantitative methods alone. A concordance analysis is one of the most effective techniques which allows researchers to carry out this sort of close examination. A concordance is a list (usually presented in the form of a table) of the occurrences of a particular search term in a corpus, presented within the context that they occur in; usually a few words to the left and right of the search term. Sometimes people refer to each line in a concordance table as a concordance. As this can potentially be confusing, I make a distinction between a concordance (which is the whole table), and a concordance line (which is one line or row from a concordance table). In order to demonstrate how concordances can be of use to discourse analysis, I want to use this chapter to carry out a case study using a reference corpus. Reference corpora are usually not created specifically for answering questions about discourse, argumentation and representation but because they are often very large they can be useful for exploring a range of subjects. 107
108
Using Corpora in Discourse Analysis
An attraction of this approach is that numerous large reference corpora are in existence and can often be accessed via online interfaces so the analyst does not have to spend time locating, collecting and processing texts. However, this can also mean that the analyst is unlikely to be as familiar with the corpus as they would have been if they had collected it themselves, and they probably will not be able to access all of the corpus texts in their original form. It is worth bearing this in mind during the analysis and considering ways that better understanding of context can be incorporated. For example, if a set of interesting examples come from a particular magazine within the corpus, or are written by a particular author, it might be worth spending some time learning about that magazine or author as such information could help in terms of interpreting and explaining the findings. Extra care should be taken when working with a reference corpus (or any corpus really), which contains texts from a culture or country that is relatively unfamiliar to you. The likelihood of reaching an incorrect or incomplete conclusion will be higher in such cases so it is worth investing more time carrying out background research about the culture or asking members of the culture to look at your analyses before you share it with others. Of course, the opposite problem can also occur, where you are so familiar with the context that a corpus was produced in that you do not realise what is unlikely to be known to outsiders, so your analysis becomes difficult to follow. Again, asking others to read your analysis is likely to identify your assumptions about shared knowledge.
Investigating discourses of refugees Bearing this in mind, the topic I wish to examine are discourses relating to refugees. Refugees are a worthwhile subject to analyse in terms of discourse because they consist of one of the most relatively powerless groups in society. One aspect of this conceptualisation of discourse relating to ‘ways of looking at the world’ is that it enables or encourages a critical perspective of language and society. The Foucaultian view of discourse has been used in connection with critical social research, a form of academic enquiry which aims to achieve a better understanding of how societies work. Fairclough (2003: 202) defines a number of starting questions for critical social research such as ‘how do existing societies provide people with the possibilities and resources for rich and fulfilling lives, how on the other hand do they deny people these possibilities and resources?’ Consequently, Critical Discourse Analysis (CDA) is a form of critical social research that can be applied to a range of texts in
Concordances
order to address these and other questions. Wodak and Meyer (2001: 96) refer to CDA as ‘discourse analysis with an attitude’, although the lines between DA and CDA are sometimes rather blurred. Van Dijk (2001: 353) notes that CDA does not have a unitary theoretical framework although there are conceptual and theoretical frameworks (e.g. Marxism) which are closely linked to CDA. Two basic questions of CDA are ‘How do (more) powerful groups control public discourse?’ and ‘How does such discourse control the mind and action of (less) powerful groups, and what are the social consequences of such control, such as social inequality?’ (van Dijk 2001: 355). More recently, CDA has sometimes been referred to as CDS (Critical Discourse Studies), a term viewed as encompassing a wider range of frameworks, as CDA is sometimes associated with Fairclough’s approach only. Van Dijk (1996: 91–4) points out that minority groups are frequent topics of political talk and text, but have very little control over their representations in political discourse. Lack of access to journalists means that minority speakers tend to be quoted less often than majority speakers (van Dijk 1991), and those who are quoted tend to either be chosen because they represent the views of the majority, or because they are extremists who are quoted in order to facilitate attack (Downing 1980). In the media, refugees are rarely able to construct their own identities and the discourses surrounding themselves, but instead tend to have such identities and discourses constructed for them, by more powerful people. When we are analysing words that relate to social actors, it can be useful to think about how language is used to represent them. A concordance analysis can enable us to identify repetitive linguistic patterns that can contribute towards particular representations. Closer analysis of such representations can reveal them to be (overtly or subtly) positive or negative, and considering the ways that different types of representations can support or contradict one another can help us to identify discourses. So a discourse could be considered as a collection of representations which cohere to form a particular way of looking at the world. A set of related representations around refugees (such as: they are criminal, there are too many of them, they are lazy) could be referred to more generally as contributing towards a more general negative discourse. However, the distinction between representation and discourse can be difficult to make at times, and some people use the terms in ways that appear interchangeable. One way of considering how they differ is to remember that discourses are meant to involve social practices, not just language, but also the wider contexts relating to how the words on a page got there and what effect they had.
109
110
Using Corpora in Discourse Analysis
The reference corpus used in this study is the British National Corpus 1994 (BNC). To recap from previous chapters, this corpus was built in the 1990s, consisting of approximately 100 million words of writing (90%) and speech (10%) from a wide range of UK-based sources. The BNC is available in numerous online platforms although the one that I use for this study is called CQPweb, which is maintained at Lancaster University. Users need to provide an email address to sign up for a free account. The Standard Query page of the BNC in CQPweb contains a series of menu options for different kinds of analysis, and there is also a text box where searches can be carried out in order to obtain concordances. CQPweb uses Simple Query Syntax, meaning that it is possible to carry out concordance searches on multiple words at the same time, or to use wildcards to expand searches. For example, the wildcard * acts as zero or more characters, so searching for refugee* produces all the cases of the words refugee and refugees in the corpus, as well as less-expected examples involving hyphens like refugee-producing. As I just want to focus on refugee and its plural, I can use a search term which separates these words with a character representing a vertical line and put them in parentheses (refugee|refugees). Other corpus tools have different ways of searching for multiple words. For example, in AntConc (version 4) the equivalent search term would be refugeeǁrefugees, while in WordSmith, the / (solidus) character is used. Our search produces 2,723 hits, of which 1,889 are refugees and 834 are refugee. The plural form is more than twice as frequent as the singular, which suggests a strong tendency to collectivise the concept. Plural forms can be interesting to consider – for example, Partington (2003: 14) notes that absent quantifiers (occurring when using plural nouns) may result in information being obscured, allowing text producers to be imprecise, thereby creating over-generalisations (e.g. men are rapists). We might want to be on the lookout for these kinds of plural cases when carrying out the analysis. Figure 5.1 shows a screenshot from CQPweb, with the first ten concordance lines on view (to save space I have not shown all fifty lines that appear on the first page of the search). You may notice that the search was not case sensitive, so both Refugees and refugees are present. This is fine for our purposes, although if we had wanted to specify only lower case citations we could have done this via one of the drop-down windows on the initial search page. Each concordance line contains a line number, the filename assigned to the text where each citation came from and then the concordance. Only a small amount of context (about ten words either side of the search term) is shown, although that can be expanded if required. You might also
111
Figure 5.1 Screenshot of (refugee/refugees) in the BNC via CQPweb.
112
Using Corpora in Discourse Analysis
notice that the concordance lines do not represent full sentences necessarily, rather, they are snippets of context. As noted, the web page only displays the first fifty lines although the next fifty lines can be viewed by clicking on an icon marked ɝ, allowing us to cycle through all fifty-five pages of concordance lines if we wanted to. With 2,723 cases it would take a long time to read every line and it is likely that we will find very similar cases as we proceed, so there are a number of ways to try to reduce the amount of work that is required for an analysis. One approach would be to only investigate concordance lines that contain significant or frequent collocates of the search words (we will consider collocates in the following chapter). However, another way would be to only read a representative sample of the concordance lines rather than all of them. A third way would involve reading all concordance lines but sorting the concordance alphabetically so that similar lines are grouped together, making it easier to work with repeated cases. We will consider these last two approaches, starting with the representative sample first. How do we obtain a representative sample? We could just read the first page if we wanted. However, this would not be properly representative of the corpus. The concordance lines are shown in filename order. Each corpus in the BNC was assigned a filename consisting of letters and numbers, so on the first page, we are showing lines from file A03 through to A1V. No matter what search we carry out, files starting with the letter A will appear first in the search. If we intended to give equal amounts of analysis time to all 2,723 cases, that would not matter, but just looking at the first fifty lines will mean that the same files will always be examined, and if those files contain a lot of mentions of our search terms, we may end up only reading from a small number of texts. To counter this, CQPweb has a button labelled ‘Show in random order’ allowing us to re-order the concordance lines. Not all concordancers have this facility, so if you are working with one that does not you may need to use an online random number generator to decide which lines to focus on, or use a different way of collecting a sample of lines, e.g. looking only at the tenth, twentieth, thirtieth, etc. lines.
Analysing a randomised sample Now we have the concordance in random order, another question is raised, how many lines should we look at? There is no perfect answer to this and different suggestions have been made. For example, Sinclair (1999) has
Concordances
suggested initially looking at thirty lines, noting down patterns, then looking at another thirty to see if new patterns are found, then another thirty and so on, until nothing new is uncovered. This method recalls the concepts of closure (McEnery and Wilson 1996: 66) and saturation (Belica 1996: 61–74) related to the representativeness of a corpus. When I conduct a concordance analysis I have found that an examination of 100 lines taken at random is usually enough to allow me to identify the most common features around or uses of the search term as well as some of the less frequent features. Neither of these approaches are likely to provide all of the less frequent features though, and initial concordance searches can always be followed up with additional, more specific searches. An important consideration is whether you want the analysis to only identify the typical usage of a word or all of its uses. If it is just the typical usage, then thirty random lines might be enough to reveal this, e.g. if twenty-five of the thirty random lines show a particular pattern and the other five do not, then we can be reasonably certain that we have identified the typical usage. If the thirty lines reveal five different patterns, with not much difference in frequency between them, then it would make sense to look at a much larger sample. If you are suspicious that the sample you obtained might not be a very representative sample, you could examine another 100 lines and compare the results from that to the first sample. If the two samples are very different, this would indicate you need to look at an even larger sample. Another approach would be to consider a particular pattern or type of representation in your corpus and look through the concordance lines presented in random order, noting all the cases where your pattern of interest occurs. Once you reach a pre-determined number of cases you can stop, noting how many cases you had to look at in total. For example, to find ten representations of refugees which involved water metaphors like ‘flood of refugees’ I had to look at 396 concordance lines. This would suggest that this metaphor occurs once in every forty cases or so (although locating 100 water metaphors rather than ten would give me a more accurate figure).1 Even with sampling, the amount of work required can quickly multiply, particularly if you are comparing several corpora and working with a largish set of polysemous words. For example, a study that involves concordance analyses of the top twenty keywords in a corpus, comparing how they occur across four sub-corpora, can become a lengthy task. If each word has five different uses, that would mean that analysis would result in 100 different uses to describe, and with comparisons made between the four sub-corpora, taking into account 100 lines analysed for each one, that would mean 8,000
113
114
Using Corpora in Discourse Analysis
concordance lines need to be categorised. The analysis could also potentially make for a very long read. In such cases, the analyst would need to think about ways to get the main analytical points across, rather than telling the reader about everything in detail. For this chapter, we will initially consider just 100 concordance lines, taken at random. So how is this kind of analysis carried out? It can depend on our research question and the extent to which we have a specific focus. The object of creating concordances is to look for patterns of language use, based on repetitions. Identifying such patterns may help us to note discourses or representations, particularly if the patterns are relatively common. So we first need to scan the concordance lines, trying to pick out similarities in language use, by looking at the words and phrases which occur to the leftand right-hand sides of the terms refugee and refugees. This is therefore a much more qualitative, context-driven form of analysis than simply looking at frequency lists. Rheindorf (2019: 33) has criticised the way that manual concordance analysis is carried out ‘without explicating the qualitative method involved, if any’. Similarly, Breeze (2011: 498) warns that such analysis appears to be ‘moved by personal whim rather than well-grounded scholarly principle’. As one of the most qualitative aspects of corpus-assisted discourse analysis, it is therefore important that the way that a concordance analysis is carried out is made clear. One approach could be to use an existing analytical system, such as Halliday’s transitivity framework, or van Leeuwen’s social actor categorisation system. If you have a particular research question in mind at the outset of the analysis, then an existing framework may help you to address it. Advantages of adopting such systems is that they are tried and tested and well-known so will not require enormous amounts of justification and explanation. However, an issue with incorporating an existing framework to look at new data is that the data may not always fit into the categories of the data – there may be ambiguous cases, or cases which suggest that new categories ought to be created. This should not preclude using an existing system but analysts should be prepared to make adjustments to the categories if necessary, as well as explaining why this needed to be done. An alternative system would be to approach the data from a more naïve perspective, without specific research questions or a sense of what linguistic features you are interested in. Analysing the corpus becomes a more bottomup process, where categories and features are identified as you carry out concordance analyses of various words, phrases, keywords or collocational relationships. This can result in the formation of research questions and the
Concordances
development of a categorisation scheme, if appropriate. With this approach it is helpful to give more detailed information about how the analysis was carried out, how categories were identified and how concordance lines were assigned to them. The analysis below does not use a pre-defined categorisation scheme but instead I created groups of concordance lines which contributed towards similar kinds of representations of refugees. As I collected these representations I tried to consider whether they represented refugees in a positive or negative way, or in a more ambivalent or ambiguous way. We might want to begin by considering the immediate context that the search term occurs in, looking at two or three words to the right or left. Within those 100 concordance lines I found six cases of refugee camp(s), as well as seven cases where refugee(s) was preceded by numbers, e.g. 146,000 refugees. There were also two references to number(s) of refugees, indicating that quantification occurred in almost 10 per cent of cases of the sample. As I am interested in representation of refugees particularly, I paid special attention to adjectival modifiers, finding cases that described the nationality of refugees (Yugoslav, Liberian, Kurdish, Iraqi, Austrian, Vietnamese, Chinese, Albanian), indicating that the corpus contains references to no particular set of refugees but often involved news articles relating to a range of stories about them that were newsworthy at the time that the corpus was collected. Other adjectives provide other kinds of contexts relating to the situation that refugees are in. For example, refugee(s) are modified with the words squalid, orphaned, panicking and exhausted as shown in the Table 5.1.
Table 5.1 Refugees as victims be incestuous, haram – forbidden. Miriam had been an orphaned occupied territories”, and especially in its heartlands in the squalid Why didn’t anyone use knockout gas on all those panicking local estate balanced between bellicose German troops, jumpy partisans,
refugee child brought from the Lebanon at the Sheikha’s request. Amina refugee camps of the Gaza Strip, also posed a serious threat to refugees ?’ ‘They were a sacrifice to purity,’ murmured refugees , escaping British prisoners and the local Fascisti, during the appalling
exhausted and did not begin to understand the trauma of being a young otherwise intelligent people like the eminent bishop who greeted a party of
refugee . And the typical foster parent was not Jewish. This last refugee children, kitted out in their ill-fitting hand-me-downs, with the cheerful
115
116
Using Corpora in Discourse Analysis
Such lines represent refugees as the victims of unfortunate circumstances. While the second line actually refers to squalid refugee camps as opposed to squalid refugees, because it is refugees who live in such camps, we would also include it as contributing to a ‘victim’ representation. Similarly, I have included the fifth line which refers to the trauma of being a young refugee as even though trauma is not an adjective, this line also indicates the negative impact of being a refugee. The final line, which describes refugee children kitted out in their ill-fitting hand-me-downs also indicates a sense of impoverishment, although we can see that there is no single word here which contributes to the representation but the whole phrase. A related representation involves cases where actions are made in order to provide assistance to refugees. This often involves refugees being the recipient of verb processes carried out by others, like aid, organizing relief and help. Is this a positive or negative representation? The fact that people or organisations are helping refugees indicates one way that it is positive. However, it is perhaps not a very empowering representation for refugees, which may indicate a somewhat negative stance. We might want to ask though, how could refugees be reasonably represented as empowered? Their very nature, being displaced from home, usually due to circumstances beyond their control, is unlikely to enable this, although potential empowering representations might give voice to particular refugees, show them as helping others or making a positive contribution towards their host community. The lines in Table 5.2 suggests a willingness on behalf of others to help refugees, although this perhaps can sometimes be worded in a way which might indicate that such help is not always forthcoming. The last line refers to anybody who would take a couple of refugees off their hands and assume Table 5.2 Refugees as recipients of help ’ after mercy mission A LORRY driver who refugees in Croatia was yesterday named Britain ’s Top helped take aid to
Dad. Harry Pearce
the repeal of the Contagious Diseases refugees from Turkish atrocities against the Bulgars in Acts and organizing relief for the
1876. After moving to
The borough willingly assumes its refugees . Many other London Tory boroughs, responsibilities to help and cope with
however, do not help
for child welfare. On particular issues, such refugee as work with were grateful to anybody who was prepared to take a couple of
children and work with traveller and gypsy families, there is
refugees off their hands and assume responsibility for them. And that is
Concordances
responsibility for them, implying that the person in question is pleased not to have to look after them anymore. The fact that in the third line the borough is described as willingly helping refugees suggests that this is not always the case, and indeed, if we were to obtain more context from this concordance there is description of other boroughs which do not help. Table 5.3 indicates another, more clearly negative pattern, referring to the refugee situation or problem. This appears like an impersonal way of representing refugees, and despite the fact that it uses refugee in the singular, it appears to be functioning in a collectivising way – the lines do not refer to a situation caused by a single refugee but as being caused by a set of people. Another negative representation is found in Table 5.4, which occurs in the pattern NOUN of refugee(s). Here, it is indicated that there are too many refugees in the host country – the first two lines referring to difficulties via the verb cope. These lines suggest potential metaphorical representations. Floods literally involve water rather than people, while the term log-jam refers to a natural phenomenon involving a crowded mass of logs that block a river. It is less easy to find a literal interpretation of surge, so this is a case where we might want to use a reference corpus to identify its typical uses. As the BNC is already such a corpus, it is simply a case of carrying out an additional search of the word surge and examining its concordance lines or collocates. This reveals that it typically occurs in phrases like surge of NOUN, where the noun is often power, anger, interest, excitement, demand, support
Table 5.3 The refugee situation feasibility study will: (i) provide an overview of the after the despatch of the Robertson signal. It ran:” the 1949 Armistice Line? Did ‘a just settlement of the
refugee situation in each country; (ii) establish the field conditions refugee and PW situation in 5 Corps area becoming unmanageable and prejudicing operational refugee problem’ imply implementation of Resolution 194 of 1948 (confirming the
Table 5.4 Too many refugees Commissioner for Refugees have been struggling to cope with the flood of camps in West Germany to help cope with the current surge of Court judge warned of a serious problem because of a log-jam of
refugees and have appealed to the international community to step up relief. refugees , her Foreign Secretary, Mr Douglas Hurd, promised East Germany refugee cases as he gave leave for two of them to seek judicial
117
118
Using Corpora in Discourse Analysis
and activity. Surge tends to be used to refer to emotions or more generally, abstract concepts. These concepts are not usually negative, but it is of note that the phrasing is most commonly used in non-human contexts. We note again that the last line (refugee cases) is a singular use of refugee although it actually refers to refugees in the plural. The singular/plural distinction we noted earlier with refugee/refugees does not appear to be as easily drawn as we had initially thought. Sometimes a metaphorical use of a word can become more frequent than the literal use. In the BNC, an analysis of a random sample of 100 cases of flood indicated that fifty-five were literal and forty-five were metaphorical. However, flood of was almost always metaphorical. Additionally, flood of tended to refer to non-human or abstract cases like interest, fear, imports, information, tears and red ink, with more negative than positive cases occurring in this kind of context. The only notable cases where humans appear to be regularly characterised as a flood involve two related words, refugees and immigrants. There were only thirteen cases of log-jam in the BNC, so here we must exercise caution when interpreting results. However, hyphenated words can sometimes appear without hyphens. Indeed, logjam appears twelve times in the BNC, and log jam occurs five times, so in total we have thirty cases to examine. Only a small number of these refer to human beings. There is a reference to a log-jam party during an event involving a celebrity as well as a case involving a logjam of bewildered novice skiers, although these appear to be exceptional cases. The overall effect of characterising refugees in terms of a flood, log-jam or surge is therefore one which positions them in language often reserved for non-human contexts. In Table 5.5 we see another representation which refers to illegality of refugees, one which appears to be clearly negative. The first concordance line indicates a withdrawal of financial support for illegal refugees, indicating that such people are unwanted, whereas the second lines refers to bogus refugees who are described as bleeding Britain through benefit fraud. The metaphor casts refugees as harming Britain, perhaps through consuming its
Table 5.5 Illegal refugees announced on July 20 that it was ending financial support for illegal that, according to a headline in The Times today, bogus
refugees – a move seen as marking the beginning of a tougher policy refugees bleed Britain of £100 million through benefit fraud? Has he seen
Concordances
blood, bringing to mind cases like leeches or vampires. This is perhaps the most strongly negative line we have seen although we should note it does not relate to all refugees, only those who are deemed as bogus. However, here we encounter a potential problem with concordance-line analysis. Each concordance line only provides a very small amount of context around the search term, so while it is possible to scan the lines and obtain a sense of the surface representation, which is in this case negative, we need to bear in mind that representations may be more complex. Within CQPweb, a concordance line can be expanded to provide more context, by clicking on the search term in the middle of the line. If we expand the bogus refugees line the following text appears. Mr. Janman. Does my right hon. Friend agree that the opportunity for this country to help support genuine refugees abroad through various aid programmes is not helped by the fact that, according to a headline in The Times today, bogus refugees bleed Britain of £100 million through benefit fraud? Has he seen the comments of a DSS officer in the same article that benefit fraud is now a national sport and that bogus asylum seekers think that the way in which this country hands out so much money is hilarious?
It is useful to obtain further information about the text that this line was drawn from, so selecting the menu option for Text info, we see that this is a transcript of a political debate taken from Hansard. The word refugees appears in the speech as the speaker is quoting it from a headline in The Times newspaper. So this is a case of intertextuality (one text referring to another text) and we need to consider how the speaker positions himself in relation to the quote. Reading the extract, it appears that the speaker appears to be critical of the headline, noting that it is not helping the country to support genuine refugees. While the speaker does not question whether The Times is accurate to characterise refugees as bogus, he appears to disapprove of the way The Times uses the term. Is this a positive or negative representation of refugees then? The answer is both, to an extent, indicating that for our purposes a simple positive/ negative classification of concordance lines will not adequately cover all cases, so we might need to create another category, perhaps called ‘critical uses’. How do we know if a concordance line will reveal this kind of intertextual, critical use? We could expand every line and if we only analyse 100 cases, this should not take too long. Where we intend to examine larger concordances, sometimes there are clues in the concordance line that suggest that the text is part of a quotation. In the case of bogus refugees we can see
119
120
Using Corpora in Discourse Analysis
reference to a headline in The Times, so cases like this are not too difficult to spot. Other possible instances of intertextuality might involve the appearance of quotation marks or verbs of reported speech. Some corpus tools such as WordsSmith and CQPweb have an option to allow users to categorise concordance lines. For example, in CQPweb you can select Categorise from the drop-down menu on a concordance page and you will be taken to a page that allows you to define your own labels for different categories (up to 255 categories). This can be a useful facility in terms of helping you to keep track of which lines contribute towards particular representations or contain particular linguistic features you want to quantify.
Sorting concordances Dealing with a small sample is manageable, although patterns can be difficult to spot because the concordance lines are either presented randomly or in order in which they occur in the corpus. While this may make sense in one way, e.g. to start at the beginning and work towards the end, it does not help us to spot language patterns so easily. Because of this, most concordancers give the researcher the option to sort the concordance in various ways. For example, we could sort a concordance table alphabetically one or more places to the left or right of the search term. Figure 5.2 shows part of the concordance for (refugee|refugees) which has been alphabetically sorted one place to the right of the search term. Examining a sorted concordance can save us time as similar cases will appear next to one another. For example, lines 1324 and 1325 both have the word during occurring at the R1 (one place to the right) position. A concordance, when sorted in this way, starts to reveal some of its patterns of language. I analysed all 2,723 lines of the concordance, first sorting the concordance one place to the right, then looking at the same concordance sorted one place to the left. Different ways of sorting a concordance will reveal different patterns. For example, sorting to the right revealed cases of verbs which placed refugees as the agent (doer of the verb), although many of the verbs in the R1 position describe refugees as moving, sometimes in ways that appear to be involuntary, e.g. refugees arrive, cross, flee, trudged. There were very few verbs in the R1 position though, indicating that on the whole refugees are not often represented as freely carrying out actions, and hardly any where refugees were carrying out actions on other people.
121
Figure 5.2 Sorted concordance.
122
Using Corpora in Discourse Analysis
Instead, the sorted concordance revealed compound nouns which come after refugee(s), some of which refer to organisations aimed at helping refugees: council (14 cases), organisations (10), association (9), centre(s) (9), office (7), committee (6), groups (5), charity (3). We also find further cases which contribute to the representation of refugees as constituting a problem (outlined in Table 5.3); refugee crisis (25 cases), problem (27), problems (4), question (4), situation (7). An unusual case involved some of the 237 examples of refugee(s) followed by the word from. Many of these simply described where refugees had come from, e.g. He said the US authorities planned to allocate 50,000 places to refugees from the Soviet Union in the next 12 months.
However, not all of these examples of refugees from were used in this way. Five cases described non-literal cases of refugees which tended to appear in the sequence like a refugee from. They are used to allude to a person’s similarity to something else. The comparison is usually not made in a positive way but appears to be a joking way to insult someone. For example, in the line first of Table 5.6 someone is described as sounding like a refugee from a bad horror film. The last two lines of Table 5.6 describe people who are either a refugee from the 1960s or from Woodstock, a festival that took place in 1969. The context here appears to involve a mocking commentary on the fashion sense of the person involved. So while this use of language is relatively rare, there is evidence to suggest that its appearance in the corpus draws on and contributes to a representation of refugees being bad things that are out of place. No examples like this were found in the 100 line random sample we looked at so the analysis here demonstrates an advantage of considering a larger sorted concordance. Table 5.6 refugee from to violence.’ ‘You ’re starting to sound like a mud that has ruined my gold stilettos. I look like a a black hat covering all her hair. She looked like a closed his eyes. Mr Hunter was what he resembled: a laughed and joked with them. One boy, dressed like a
refugee from a bad horror film. The role of mad scientist does refugee from a Verdi opera, stranded in the damp gentle valleys of refugee from one of those films that at the time constantly glamourized the refugee from the sixties. His fat hand squeezed Graham’s shoulder. refugee from Woodstock, was spinning around in circles. His hair was
Concordances
Looking at the same concordance when it is sorted one place to the left reveals different kinds of patterns, including cases where refugees are modified by adjectives, the majority of which refer to the provenance of the refugee(s) in question (e.g. 72 cases of Kurdish, 29 of Bosnian, 18 of Vietnamese, 16 of Somali). However, we also find cases which contribute to the representation of refugees as illegal, shown in Table 5.7: genuine (34 cases in total), bogus (9), would-be (8), real (2), fake (1), illegal (1), true (1). I have included the words genuine, real and true as although such cases refer to refugees who are not fake, they imply that other refugees exist who might be fake, and Table 5.7 shows instances where a clear distinction is made between such genuine refugees and those who are viewed as not genuine. The distinction between genuine and bogus refugees also complicates the way that we might want to categorise representations as positive or negative. We could simplistically say that references to genuine refugees are good while the bogus ones are seen as bad but I do not think that really captures what is happening when these words are being used. Instead, there seems to be a concern that a proportion of refugees are bogus so are not deserving of help. Our concordance analysis cannot tell us what proportion of refugees actually are genuine or not, or whether concerns about bogus refugees are grounded in reality. As we are looking at a reference corpus containing many stories about different types of refugees from different countries, it would be difficult to find this kind of information in any case. Instead, what we can more easily conclude is that the corpus contains a reasonable number of instances where refugee status is queried, indicating suspicion of the motives of refugees. We could view this as contributing towards anti-immigration discourse.
Table 5.7 Genuine refugees right to family life. A new Act will guarantee refugees but prevent bogus applications for asylum. sanctuary to genuine
We are determined to see
system. What we must do is to distinguish refugees . That is what we will do, and we will do between bogus and genuine in the screening process in Hong Kong in refugees and economic migrants; (ii) improved order to distinguish between genuine
conditions within the Hong
the enormous delays in processing refugees , have encouraged bogus organisers of applications, which cause great fear for real
fraudulent claims, such as the
Hong Kong says that only a handful of them refugees fleeing persecution, while the rest are should qualify as true
‘economic migrants’ escaping
123
124
Using Corpora in Discourse Analysis
A further pattern found with the L1-sorted concordance was of refugees. This appeared in numerous cases that were similar to those shown in Table 5.4, in particular involving water metaphors: (in)flux of refugees (17 cases), flood of refugees (12), (in)flow of refugees (8), stream of refugees (3), volume of refugees (2) and wave(s) of refugees (2). The water metaphor is found in other ways, e.g. when the concordance was sorted one place to the right patterns like refugees flooding (2), flowing (1), streaming (3) and poured (2) were found. The movement of refugees is thus frequently constructed as an elemental force consisting of unwanted water which is difficult to predict and has no sense of control. As with floods, refugees are also represented as difficult to control and are thus dehumanised, as well as being characterised as something which potentially will result in disaster to others. The L1-sorted concordance also reveals verbs which place refugees in the patient position, being the recipient of the actions of others. These include help (8 cases), evacuate (3), granted (3), refused (3), allow (2), given (2), receive (2), smuggling (2), accommodating (1), expel (1), monitor (1), prevent (1) and resettle (1). Refugees are thus represented as being moved about by other people, as well as being the subject of other people’s decisions. As with the earlier analysis which casts them as out of control, these verb collocates further indicate that they are subject to the control of others. What about cases where refugees are clearly represented as having some agency then? The initial analysis of 100 random concordance lines did not find any cases like this so they are likely to be quite rare. I decided to carry out a second concordance search in order to find cases which would be more likely to identify cases of agency. Hunston (2002: 52) describes this as hypothesis testing – carrying out further searches in order to determine the extent to which a pattern occurs. The search term I used was refugee* ɝ2ɝ *_VV*. Within CQPweb’s search synatax this will return cases of words which start with refugee, followed by a lexical verb within the next two words. This resulted in 596 concordance lines, which is still a lot to work with, although a more manageable number. Some of these lines did not actually reveal cases where refugees had agency, e.g. refugees were resettled, so they could be easily passed over. One pattern which emerged from the analysis of this concordance were cases where refugees were described as carrying out destructive or manipulative actions, as shown in Table 5.8. Perhaps not all of these cases
Concordances
involve refugees acting in intentionally problematic ways – see for example, the third line which involves refugees bringing diseases with them. However, the wording here still involves grammatical agency, leaving it vague as to whether the refugees are to blame for any possible transmission of these diseases. Only two cases were found where refugees were attributed agency in more positive ways, as enriching artist life and giving up their beds ( Table 5.9), so while this representation is present in the corpus, it is rare. When we have a lot of concordance lines to deal with, it can take a while to go through all of them if they have been sorted alphabetically, in order to identify those which have the most frequent patterns. New ways of presenting sorted concordances are being considered, and AntConc version 4 has an option whereby a concordance can be presented with the words which appear most frequently in a particular position (e.g. R1) at the start. This enables the most common patterns to appear first, saving the analyst having to wade through pages of concordance data to find them (see Anthony 2018).
Table 5.8 Refugees as destructive or manipulative name to broaden its appeal. Horror attack: Three Afghan women demonstrators are chanting Mr Gorbachev’s name, while another 4,000 would-be children with the shrunken limbs and wrinkled faces of marasmus. The problem is the supply of wood for shelters and fuel. The unending indictment of his Stalinist regime.
refugees chopped off the tongue of a 10-year-old girl in Pakistan who refused refugees are choking the premises of Bonn’s embassy in Prague. It refugees brought other diseases with them, mainly eye infections and malaria. refugees have stripped the surrounding hills almost bare, and spend much of refugees manoeuvred him into an impossible
Worse still, the embassy
corner. They knew how desperately he
as Gregory discovered on a number of
refugees who caused trouble for for the bishops of
occasions. Among the
Tours were Austrapius,
Table 5.9 Positive representations of refugees skills to augment those of Zurich’s own craftsmen, and other nights was spent in a Romanian refugee camp at Terni where some
refugees enriched its artistic life. Wagner, in retreat from pursuing creditors refugees gave up their beds to that the three friends would be as
125
126
Using Corpora in Discourse Analysis
Semantic preference and discourse prosody Our concordance-based analysis of the terms refugee and refugees in the British National Corpus has been useful in revealing a range of representations: refugees as victims, as the recipients of official attempts to help, as a natural disaster, as possibly bogus, as a problem, as destructive or manipulative or as helpful. A concordance analysis therefore elucidates semantic preference. Semantic preference is, according to Stubbs (2001b: 65), ‘the relation, not between individual words, but between a lemma or word-form and a set of semantically related words’. For example, in the British National Corpus the word rising co-occurs with words to do with work and money: e.g. incomes, prices, wages, earnings, unemployment, etc. Semantic preference also occurs with multi-word units. For example glass of co-occurs with a lexical set of words denoting ‘cold drinks’: e.g. sherry, lemonade, water, champagne, milk, etc. Semantic preference is therefore related to the concept of collocation (see Chapter 6) but focusses on a lexical set of semantic categories rather than a single word or a related set of grammatical words. For example, refugees have a semantic preference for quantification. One of the pieces of information that is often given about refugees involves how many of them there are. However, semantic preference is also related to the concept of discourse prosody where patterns in discourse can be found between a word, phrase or lemma and a set of related words that suggest a discourse. The difference between semantic preference and discourse prosody is not always clear-cut. Stubbs (2001b: 65) says it is partly a question of how open-ended the list of collocates is. So it may be possible to list all of the words for ‘drinks’, indicating a semantic preference, but a more open-ended category such as ‘unpleasant things’ might be seen as a discourse prosody. Stubbs (2001b: 88) later notes that even a category of semantic preference will be open-ended, but will contain frequent and typical members. In addition, semantic preference denotes aspects of meaning which are independent of speakers, whereas discourse prosody focusses on the relationship of a word to speakers and hearers, and is more concerned with attitudes. Tables 5.1 and 5.8 could be seen as examples of discourse prosodies, where refugees are described in terms of victims and criminals/nuisances respectively. Semantic preference is therefore more likely to occur in cases where attitudes are not expressed. However, even the absence of an attitude can be significant, for example by showing a speaker’s desire to remain ‘on
Concordances
the fence’. And the case of writers giving refugees a semantic preference for quantification could involve a discourse prosody – why are writers so concerned with how many refugees there are? When we look at the discourse prosody which associates refugees with growing numbers that are described as a flood, influx or log-jam it is possible that at least in some cases of quantification there is anxiety about a situation that is viewed as spiralling out of control – large numbers of refugees are not seen as a good thing. There is sometimes inconsistency between the exact meanings of semantic preference and discourse prosody. Louw (1993) and Sinclair (1991) refer to another term called semantic prosody, which has been used by other researchers in a way which makes it akin to discourse prosodies. For example, Cotterill (2001) uses the term semantic prosody when analysing the language used in the high-profile trial of O.J. Simpson for marital violence. However, her analysis of phrases like to control and cycle of indicates that they contain patterns of evaluation, which suggests they could be classed as discourse prosodies. Therefore, I refer to discourse prosody in this chapter, although with an awareness that others may class this as semantic prosody (or semantic preference). Many of the linguistic strategies used to refer to refugees – such as referring to them as an indistinguishable mass or vague quantity, using water metaphors, referring to them as bogus or engaging in destructive behaviour, are linked to the more over-arching concept of racist discourse. As van Dijk (1987: 58) describes, there are four topic classes for racist discourses: they are different, they do not adapt, they are involved in negative acts and they threaten our socio-economic interests. Hardt-Mautner (1995b: 179) points out, ‘National identity emerges very much as a relational concept, the construction of “self ” being heavily dependent on the construction of “other”.’ The racist constructions of refugees therefore not only construct a threat to the status quo and national identity (which may help to sell newspapers), they also help to construct national identity by articulating what it is not. However, more encouraging aspects of the corpus data suggest a less prejudiced picture. Stereotypes of refugees as criminal nuisances were present in the corpus, yet they were relatively rare. Concordance lines which focussed on the problems encountered by refugees and/or attempts to help them were relatively more frequent, suggesting that there was an awareness of the need for sensitivity when discussing issues connected to immigration in the UK. Indeed, some concordance lines involved cases where a negative representation was given in order to counter it. As Law et al. (1997: 18) found, about three-quarters of news articles concerned with race contained media
127
128
Using Corpora in Discourse Analysis
frames ‘which seek to expose and criticise racist attitudes, statements, actions and policies, which address the concerns of immigrant and minority ethnic groups and show their contribution to British society, and which embrace an inclusive view of multi-cultural British identity’. A study by Jessika terWal (2002: 407) concludes that ‘the British tabloid press no longer seem to merit the overly racist tag that they were given by studies in the early 1980s’. A corpus-based approach is therefore useful, in that it helps to give a wider view of the range of possible ways of discussing refugees. A more qualitative, small-scale approach to analysis may mean that saliency is perceived as more important than frequency – whereby texts which present shocking or extreme positions are focussed on more than those which are more frequent, yet neutral. While it is important to examine extreme cases, it is also useful to put them into perspective alongside a wide range of other cases. In addition, corpus data can help us to establish which sorts of representations are most frequent or popular. For example, the refugees as water metaphor was found to be much more frequent than other metaphors. Rather than simply listing the metaphors which appear in the data then, we are able to get a more accurate sense of which ones are naturalised, and which ones may be particularly salient because they are so infrequent (in Chapter 8 I look more closely at obtaining metaphors from corpora). In attempting to provide explanations for our findings, we should take into account the context that these texts were produced in. Due to the fact that the BNC is a reference corpus, it is difficult to fully take into account all contexts although we should bear in mind that all involve British English and the majority of them were published in the early 1990s (none were published after this period). It was a period where the UK was led by a Conservative government and numbers of applications for asylum to the UK had risen, from 3,998 in 1988 to 44,840 in 1991.2 The more negative representations of refugees may be explained to an extent by this rise, particularly in certain types of newspapers that are included in the BNC. In terms of distribution of the search term (refugee|refugees), 1,382 cases come from a rather general category called non-academic prose and biography while 457 come from newspapers. Other categories like spoken conversation and fiction and verse contribute fewer cases (5 and 116 respectively).
Points of concern A concordance analysis is one of the more qualitative forms of analysis associated with corpus linguistics. While concordance programs allow
Concordances
researchers to sort and therefore view the data in a variety of different ways, it is still the responsibility of the analyst to recognise linguistic patterns and also to explain why they exist. A concordance analysis is therefore only as good as its analyst. A potential problem with concordance analyses is that they present us with small amounts of context, which can lead us to identify surface meaning. As shown earlier, it can be harder to identify cases in concordance lines where someone’s opinion is being quoted in order to criticise it. Another issue involves interpretative positivism (Simpson 1993), where we assume that a certain linguistic form always has the same function. As Hardt-Mautner (1995a) warns, cases of passive constructions (like ‘the woman was attacked’) are not always employed as a way of obscuring the agent who carried out the action. There might be more innocent reasons for use of passives (e.g. the agent appears elsewhere in the text and the author wants to avoid stylistic repetition, or the agent can be easily deduced, e.g. ‘the man was arrested’). So simply identifying the number of passive constructions in a corpus should not be seen as a way of indicating the extent that a corpus contains intentional audience manipulation. We should also note that a concordance analysis like the one in this chapter can only reveal representations rather than truths. We cannot use it in order to ask questions such as how many refugees are bogus or what is it like to be a refugee? If our corpus consisted of interviews from refugees then we might be able to make a stronger claim that our analysis reflected the actual experience of refugees, but a general corpus or even a specific corpus composed only of texts about refugees gives a range of perspectives, some which may be accurate, some more contested and some completely invented. As indicated with the intertextual example of refugees bleeding Britain, it should be noted that representations do not always appear in a straightforward way. As with the cases of italicised words in this chapter, authors sometimes refer to words meta-linguistically in order to examine, discuss or critique the word’s usage or meaning. For example, Kaye (1998) carried out an analysis of the words bogus and phoney in British broadsheet newspaper articles about asylum seekers. He found that in 35 per cent of cases of these words, they occurred in the context of writers criticising others for using them. He also found that liberal newspapers used such terms more often, but were also more critical of them, when compared to more conservative newspapers.Additionally, the majority of these words appeared in the context of writers reporting or quoting someone else, usually a politician or government official, suggesting that the newspapers were not taking the lead in setting the agenda, but were largely accepting the agenda as defined by others. Therefore, the context of a word is important in how it contributes towards particular discourses.
129
130
Using Corpora in Discourse Analysis
One aspect of the concordance analysis that I have not considered so far is that when carrying out searches on a particular subject (particularly a noun), as well as euphemisms and similes for that subject, it might also be the case that it is referred to numerous times with determiners (this, that, these, those) or pronouns (it, them, they, she). Referrals can also be cataphoric – referring to a subject that is to follow it, e.g. ‘I don’t know them personally, but there seem to be refugees everywhere.’ Expanding a corpus search to include pronouns and determiners may yield further evidence of patterns or even completely different representations. However, concordances of pronouns and determiners are likely to include many irrelevant examples – the vast majority of cases of they in the BNC do not refer to refugees. Therefore, taking anaphora and cataphora into account is likely to make the process of analysis more time-consuming, although it may result in a wider set of findings. One way to target the more relevant cases would be to only search for pronouns and determiners which are reasonably close to the main search term. For example, in CQPweb a concordance search of them ɜ15ɝ refugees will find cases of the word them within 15 words of refugees. A related issue concerns cases where members of a social group are referred to by their names. If we find any concordance lines that do refer to refugees by name it may be worthwhile carrying out additional searches on those names to see if they uncover further representations, although it may be less easy to generalise any discourses found to the wider class of refugees. However, it should be borne in mind that individuals can be used as prototype examples for a whole class of people, making discourses more subtle, yet still apparent. For example, Morrish (2002) reports how newspapers constructed homophobic discourses around a gay government minister, Peter Mandelson, even though the words gay and homosexual were not used in articles about him. In interpreting our findings, we might want to consider the extent to which the representations we have found are problematic, and what an unproblematic set of representations would look like. While it could be argued that representations of refugees as flooding or bogus are negative, perhaps there are real-life cases where people pretend to be refugees to escape detection from a crime or for economic reasons. Even if we could determine the extent to which the bogus representations are reflective of reality, to what extent is it acceptable that we should find these representations in a corpus like the BNC? How many mentions of bogus refugees is too many? Perhaps the more sympathetic representations which focus on the plight of refugees are kinder, although it could also be argued that pitying a group is not especially empowering for them. However, we also need to bear
Concordances
in mind that refugees are a dispossessed group and perhaps would not be expected to have power in their situation. A ‘sympathy’ representation might do more to ensure that they receive help by persuading people to give to supporting charities or governments to treat them kindly. However, we might also want to critically make a distinction between representations that could be viewed as sympathetic and those which are empathetic. So perhaps representations that emphasise the humanity of refugees would be seen as more helpful. Those which emphasise that the refugee status is a temporary and forced-upon one for people who hold it, and those which aim to provide background information about refugees or show them as individuals rather than representing them as an undistinguished mass would also be a step in the right direction. Representations that come from refugees themselves, through texts that tell their stories or give them a voice in some way would also be humanising, although it can be difficult to give voices to large numbers of people, and it might be the case that we need to devise additional search terms to find such voices. Considering that many references to refugees appear in news articles or texts relating to current affairs, we might want to bear in mind the concept of news values (Galtung and Ruge 1965). News stories tend to favour negativity and if we were to consider a wide range of other social group labels within news articles we would also be likely to find that they would attract negative discourse prosodies too. The question is then, are refugees represented especially unfavourably, even given our expectations that news reporting is likely to be negative about almost everyone? This study cannot fully address that, but it indicates a way that corpus approaches can result in further questions that were not initially thought of. It is notable that the flood metaphor did not appear with other social groups (apart from the related group immigrants) when we examined the rest of the BNC, so we could note this as a particularly salient (and problematic) representation of refugees. Another issue relates to frequency. Most forms of corpus analysis can be helpful in allowing us to quantify the extent of different representations in a set of texts. This can allow us to identify which ones are frequent and which are rare. However, we should not assume that a corpus, even a reference corpus like the BNC, contains all the possible representations or discourses around a subject, so a representation might be so infrequent that it never occurs at all and will thus be overlooked in the subsequent analysis. It is often the case that we do not know what we do not know. This is an issue that all forms of critical analyses have to contend with but with a corpus approach,
131
132
Using Corpora in Discourse Analysis
which is based around frequency and large data sets, we must not be complacent that we have the full picture. What is not there can be even more important than what is there. So how can this issue be resolved? One option could be to consider multiple corpora which are likely to contain different perspectives. The BNC, despite being a reference corpus, only contains discourses that were reasonably common in the UK in the early 1990s. Comparing it to corpora from other contexts might bring up other kinds of representations. If this is not possible, missing representations could be elicited from more targeted reading around the topic or from interactions with a wide range of people from different social backgrounds.
Conclusion Concordance analysis is the most important aspect of corpus-based discourse analysis – it is where the human analyst takes over from the computer software and where the essential stage of interpretation occurs. Of all of the stages of analysis that are carried out, it is the one which should usually take the most amount of time. However, it should be borne in mind that a concordance-informed discourse analysis is a form of subjective interpretation. The patterns of language which are found (or overlooked) are still likely to be subject to the researcher’s own ideological stance. And the way that they are interpreted may also be filtered through the researcher’s subject position. This is true of many other, if not all, forms of discourse analysis. However, the corpus-based approach at least helps to counter some of this bias by providing quantitative evidence of patterns that may be more difficult to ignore. Additionally, an analyst identification of a discourse may not mean that the same discourse is viewed in the same way, if at all, by other readers. Taking account of the issues surrounding text production and reception, as well as the historical context of the subject under discussion, are paramount in supporting the more linguistic-informed analysis of a corpus.
Step-by-step guide to concordance analysis 1 Build or obtain access to a corpus. 2 Decide on the search term (e.g. refugee) – bearing in mind that search
Concordances
3 4
5
6
7 8
9
10
terms can be expanded to include plurals (refugees), rewordings (aliens), anaphora (them, they) and proper nouns of relevant individuals. In order to do this it might be useful to initially carry out a pilot study, looking closely at a small sample of the corpus, or consulting other sources. Obtain a concordance of the search term(s). If possible, clean the concordance – e.g. by removing lines which are not relevant – for example references that refer to aliens from space rather than aliens as refugees. If there are hundreds or thousands of concordance lines, consider first examining a random sample (e.g. 100 lines) in order to initially identify representations that are likely to be reasonably frequent. Sort the whole concordance on different words to the left and right while looking for evidence of grammatical, semantic or discourse patterns. Note the frequencies of such patterns and try to create categories of representation from them. Use hypothesis testing by conducting additional searches to investigate further evidence of such patterns in the corpus. Investigate the presence of particular terms more closely – e.g. by exploring their collocates or distribution in reference corpora of general language. Note rare or non-existent cases of discourses based on your own intuitions. Carry out further hypothesis testing to try to identify such cases in the corpus or check other corpora to see if they occur there. Attempt to hypothesise why the patterns appear and relate this to issues of text production and reception, as well as considering the extent to which representations appear problematic.
Further reading Baker, P. (2013), ‘Discourse prosodies and legitimation strategies: Revisiting the Daily Mail’s representations of gay men’, in Using Corpora to Analyse Gender, London: Bloomsbury, 97–119. This book chapter uses concordance analysis to focus on press representations around gay men, identifying how discourse prosodies can ‘overlap’ and how analysis of expanded concordance lines can be used to identify legitimation strategies. Nguyen, L. and McCallum, K. (2016), ‘Drowning in our own home. A metaphor-led discourse analysis of Australian news media reporting on maritime asylum seekers’, Communication Research and Practice 2(2):
133
134
Using Corpora in Discourse Analysis
159–76. This paper identifies water metaphors in news articles which the authors view as being neutral, as well as some which appear to argue for better treatment of asylum seekers. The article thus cautions against always viewing a water metaphor as negative in the context of immigration. Taylor, C. (2021), ‘Metaphors of migration over time’, Discourse and Society. https://doi.org/10.1177/0957926521992156. This paper examines conventionalised framings of immigration in The Times newspaper between 1800 and 2018, finding that some metaphors, like the liquid one, persisted throughout the whole period, others were more recent, like the idea of immigrants as invaders or animals; and others had dropped out of use before returning, such as those which frame immigrants as commodities or guests. An article like this helps to provide historic context when focussing on a particular time period like the 1990s.
Questions for students 1 A concordance search works well for specific words and phrases, and is particularly good at identifying representations, but how could it be used to find other discourse-related phenomena like argumentation or legitimation strategies? 2 Try using a general corpus to investigate whether there are discourse prosodies around the following: (a) forms of swan as a verb, (b) forms of derive, (c) forms of grow. 3 Choose a word or phrase that relates to a type of social actor and conduct a concordance search of it in a corpus. Look at a random selection of thirty lines and group the lines using a simple but relevant categorisation system (e.g. whether the social actor is represented in a positive, negative, ambiguous or neutral way). Then examine another thirty cases for the word and apply your classification system again. Did you find the same results in the first set of thirty lines compared to the second set of thirty lines? How useful was the second search?
6 Collocates Introduction In the last chapter we saw how carrying out a close analysis of concordance lines can be helpful in revealing evidence for discourses within texts. We also demonstrated that looking at random samples of concordance lines can help to reduce the amount of work required if thousands of lines are produced. However, sampling might not always produce generalisable results and it may fail to reveal all the salient aspects of the concordance. Another problem with replying only on concordance samples is that patterns are not always as clear-cut in a concordance as we would like them to be. Consider the relationship between the words forgive and sins. A concordance of sins in the British National Corpus, sorted one place to the right, reveals the pattern forgive sins, which occurs thirteen times. This is not a particularly large number. However, further sorts, two, three, four and five places to the right reveal that forgive and sins occur near each other in other ways, e.g. forgive our sins, forgive us our sins, forgive me all my sins. In all, sins occurs near or next to forgive a total of twenty-nine times, and if we take into account related forms of these words (e.g. by looking at forgiving and sin, etc. as well), the relationship between the two concepts is even stronger. However, in order to get an idea of the strength of the relationship between forgive and sins, we have had to carry out analyses of the same concordance, sorted in lots of different ways, which is time consuming and reliant on the researcher’s attention not wandering. It would be more convenient, at least at the beginning, to be simply given a list of words which tend to occur near or next to forgive relatively often – we can worry about where they actually appear and what their relationship means once we know what those words are. All words co-occur with each other to some degree. However, when a word regularly appears near another word, and the relationship is statistically 135
136
Using Corpora in Discourse Analysis
significant in some way, then such co-occurrences are referred to as collocates and the phenomena of certain words frequently occurring next to or near each other is collocation. As Firth (1957: 11) wrote: ‘You shall know a lot about a word from the company it keeps.’ Collocation is a way of understanding meanings and associations between words which are otherwise difficult to ascertain from a small-scale analysis of a single text. Stubbs (1996: 172) takes the point further: ‘words occur in characteristic collocations, which show the associations and connotations they have, and therefore the assumptions which they embody’. In this chapter our case study involves the analysis of a 1.7 million word corpus of texts relating to Islam that were classified by experts working within the Home Office in the UK as being Extremist (170 texts in total). I refer to this as the Extreme Corpus. Such texts contain ‘endorsement/ glorification of violence in contemporary context and/or stark dehumanisation’ (Holbrook 2015: 60). The texts include news reports, magazines, interviews, lectures, biographies, political treatises, statements and guides on topics like bomb-making and computer encryption. Working with these kinds of texts requires special considerations from ethical and legal perspectives. Possession of these texts is illegal in the UK, and I obtained them from police authorities who engaged me to work on them. All of these texts were found in the possession of individuals who had been convicted of terrorism crimes and while we cannot claim that reading these texts caused people to carry out crime, it is likely that they played a role in the radicalisation of these individuals. For legal reasons it is not possible to share this corpus with others, which makes the analysis harder to verify. This is not an ideal situation but if we only carried out research under perfect conditions, then a good deal of important work would not get done. When describing this corpus I thought carefully about the amount of information I ought to reveal about the texts. Considering their nature, I decided that it would not be in the public interest to provide the names of individual texts or to quote at length from them, so I only provide context in the form of concordance lines or brief sentences. I provide a content warning when working with texts like this, and it is worth considering the effect that such texts might have on the people who analyse them. At times, researchers working from a discourse perspective may be asked to analyse texts that they find upsetting. There should be no obligation to work with such texts and individuals should not feel bad about withdrawing from a project if they feel that it will adversely affect them. I initially worked with these texts as part of a team and found that this was
Collocates
useful in terms of allowing me to discuss the personal impact of conducting the analysis. For the purpose of this chapter I am going to focus on a single word – America – which occurred 1,611 times in this corpus and is the seventeenth most common noun as well as the most frequently mentioned country in the corpus. Chilton and Lakoff (1995) have noted that modern states tend to be personified and assigned personalities with decision-making abilities. An analysis of 100 concordance lines taken at random from the corpus data found twenty-three cases where America literally referred to a country while seventy-seven cases represented America as a social actor (e.g. ‘America was sending its armies to the Gulf to invade Iraq’). This could be seen as a form of spatialisation, ‘whereby social actors are represented by means of reference to a place with which they are associated’ (van Leeuwen 1996: 59). In this corpus, America often appears to stand for the political leaders of America although sometimes the word is used in a way which suggests an entire society, e.g. with the phrase ‘America will not live in peace and security until we live it in Palestine’, a possible interpretation is that it is not just American leaders who will not live in peace and security but all American citizens. A potential effect of these different uses of America is to blur the boundary between the American government and ordinary Americans, thus helping to enable violence against ordinary citizens. In this chapter I will examine collocates of America in this corpus by using two tools, #LancsBox and Sketch Engine, each which affords a different kind of analysis. #LancsBox is a standalone piece of software which can be downloaded onto a computer free of charge while Sketch Engine is a webbased interface which requires a monthly subscription, although there is a free trial period. I am interested in considering the ways that America is represented in the corpus through its discourse prosodies, and also how such representations could link to the ways that the text producers try to legitimate violence towards American people.
Deriving collocates A collocational analysis will not always reveal much of interest, and particularly in cases where a word’s frequency is quite low (say under 100 occurrences) there may not be enough information to help us to identify patterns – or the patterns which occur may be due to idiosyncratic cases so are not especially representative of how the word is used. In such cases, a
137
138
Using Corpora in Discourse Analysis
concordance analysis is likely to be of more value. However, 1,611 occurrences of America is large enough for an analysis of collocates to be worthwhile. First though, we need to know something about the ways that collocates are calculated. A number of different techniques exist, the most simple of which is to count the number of times a given word appears within, say a five-word window to the left or right of a search term (some corpus tools give the option not to include cases which cut across sentence boundaries). If we load the Extreme Corpus into #LancsBox and use this procedure on America we get a list of words which includes the, and, of, to, is, it and that in the top twenty collocates. The most common collocates are therefore grammatical or function words: articles, prepositions, conjunctions and pronouns. One of the problems with using a frequency-based technique to calculate collocates is that the most frequent ones almost always tend to be the same kinds of function words. So this does not necessarily tell us anything special about the contexts that a particular word occurs in. A way of resolving this problem is to look further down the list until we come to the lexical words as we did in Chapter 4 when we looked at frequency lists derived from holiday leaflets. If we do this we find words like allies (which occurs 66 times with America), Muslims (also 66 times) war (57 times) and world (46 times). So far so good? Unfortunately, there is another problem. Just because Muslims occurs often as a lexical collocate of America we cannot be certain that frequency is the same as saliency. It is fairly likely that Muslims occurs in this corpus with a great number of other words as well and also that it occurs lots of times in the corpus when the word America is not nearby. In fact, Muslims occurs 4,292 times in this corpus, so in the majority of cases (98.46 per cent of the time) it does not appear near America. Other than using our intuition, we need a way of taking into account when a relationship between two words is fairly exclusive and not just due to one or both words being common to the corpus. Additionally, we should not discount function words too quickly as at times they can reveal interesting aspects of discourse – for example, in an analysis of a corpus of texts about swearing, McEnery (2005) showed that a strong collocate of swearing was the conjunction and. This was due to swearing being consistently paired with a range of negative phenomena (e.g. violence, sex, drunkenness) in order to cast swearing in the same light. However, because grammatical collocates are usually very frequent, that means we would need to explore them all every time we carry out an analysis, and most of the time they would not reveal exciting findings. We therefore need to rely on techniques that do not just tell us that a relationship is frequent, but it is
Collocates
special in some way – more frequent than we would normally expect. A measure called Mutual Information (MI) does that. MI is calculated by examining all of the places where two potential collocates occur in a text or corpus. An algorithm then computes what the expected probability of these two words occurring near to each other would be, based on their relative frequencies and the overall size of the corpus. It then compares this expected figure to the observed figure – what has actually happened – and converts the difference between the two into a number which indicates the strength of the collocation – the higher the number, the stronger the collocation. MI is an effect-size measure, meaning that it tells us about the strength of the relationship between two words. Some people have advised that an MI of 3 or above is evidence of collocation (Hunston 2002: 71) while Durrant and Doherty (2010) have indicated that for ‘collocational priming’ to occur (whereby we are triggered to think of one word when we encounter another), a MI score of at least 6 is required. If we examine the collocates of America using MI, we find 367 collocates that have an MI score of 3 or above, while 42 collocates have an MI score of 6 or above. It is possible to take into account lots of collocates at once by grouping similar collocates together, and the analysis using Sketch Engine later in this chapter attempts to do just that. However, one approach that some analysts have taken is to try to focus on a smaller number of collocates that are likely to be highly salient and thus memorable to readers. An option would be to take the highest scoring ten, twenty or fifty collocates and focus an analysis around them. Here we can claim that we have focussed on a set of the most salient collocates although we also have to acknowledge that our cut-off point is arbitrary. The top ten collocates of America using MI include the words leash, miserably, latin, trillions, racism and bleed. This reveals some interesting patterns in the corpus. For example, Table 6.1 shows some sample concordance lines where America collocates with bleed and racism. The first three lines indicate a strategy to make America bankrupt, with bleed being used metaphorically, while the final three lines represent America negatively as racist. While these representations are interesting, a potential problem with the top MI collocates is that they are usually infrequent – leash, miserably, latin, trillions, racism and bleed only collocate with America five times each. The patterns identified by MI are very salient – in this corpus bleed and America are much more likely to occur together than apart from one another, but as the pairing is still very infrequent we need to be careful about how much we can generalise. Perhaps bleed America to the point of bankruptcy forms part
139
140
Using Corpora in Discourse Analysis
Table 6.1 Concordance of America collocating with bleed or racism we are continuing to make
America
bleed to the point of bankruptcy, by
the success of our plan to bleed
America
to the point of bankruptcy, with
the same policy: to make
America
bleed til it becomes bankrupt –
against oppression and racism in
America
? Or should we believe that one day
more on the subject of racism in
America
, and with that I write seeking
other states echoing the call for
America
to stop racism. The attempt by
of a larger pattern relating to a desire to harm America, or it might be a rare case, unrelated to anything else. We can not tell from this analysis. So it would be good to have a collocational measure that produces higher frequency words, although not so high that just grammatical words appear. Therefore, a range of other calculations have been suggested. For example, the T score, Z-score (Berry-Rogghe 1973), MI3 (Oakes 1998: 171–2), log-log (Kilgarriff and Tugwell 2001) or log-likelihood (Dunning 1993) scores. New measures of collocation continue to be invented, with later ones including logDice, logRatio, Minimum sensitivity, Delta P and Cohen’s D. Some of the earlier measures, like the T score, produced scores that were strongly dependent on the size of the corpus being used, so later measures have tried to incorporate standardisation enabling comparisons of collocation scores that can be made across different sized corpora (Gablasova et al. 2017). It is useful to make a distinction between effect size measures like MI which tell us whether a relationship between two words is strong (or not), and hypothesis testing measures like log-likelihood, which tell us whether it is likely that a relationship between two words actually exists. The hypothesis testing measures tend to require more evidence of a relationship (even if the relationship is quite weak), so they tend to favour high frequency words (the top twenty log-likelihood collocates of America were pretty much indistinguishable from the ones I obtained just by looking at the most frequent collocates). The effect size measures tend to be based on the extent that two words appear together rather than apart but as they are less concerned with the extent of evidence, they can privilege low frequency pairs. So for example, in the Extreme Corpus, when I looked at collocates of America using MI score, the top collocate was leash, which occurred seven times in the corpus, of which five of these appeared near or next to America. When I used log-likelihood, the top collocate of America was the genitive marker ’s, which occurred 11,942 times in the whole corpus and 307 times with America. It could be argued that leash is a strong collocate – in five out
Collocates
of seven cases (or 71 per cent of the time it occurs) it is a collocate of America, but as we only have five cases to consider, that does not provide us with much coverage of the corpus so it does not give us much to say, really. On the other hand, we have 307 cases of ’s to talk about but actually, in our corpus ’s only occurs 2.5 per cent of the time with America, so it is a more frequent collocate but the relationship is much weaker. Some of the statistics, like the Dice score, attempt to give a compromise between evidence and strength, and some researchers have tried to cover both bases by saying that they will only consider collocates that have scores above cut-offs for two or more statistics, e.g. both MI and LL. #LancsBox currently has fourteen different ways of calculating collocation while Sketch Engine has seven plus the simple frequency method. That is a lot of measures, and as a result, the proliferation of techniques has the potential to result in confusion, especially for those new to the field. Some of my students struggle to decide which collocational measure is best, and they worry that their findings might be discredited because they might have used a measure that is viewed by some as outdated and thus inferior. A study by Hughes and Hardie (2019: 206) examined brain activity when people were shown different pairs of words, comparing their findings to eight different measures of collocation for the same word pairs based on data from the British National Corpus. The study aimed to test the psycholinguistic validity of different collocational measures. When the brain encounters an unfamiliar word pairing, it produces more activity as it works to process what it has encountered. If brain activity is low, we would expect this to be due to a strong collocation being encountered. Therefore, it would be interesting to see if the brain produces less activity for pairs that have a high MI score or whether pairs with a high LL score would correlate with less brain activity. Hughes and Hardie found that brain activity most clearly correlated with collocation measures that combined effect size and significance. This was a small-scale study though, and while it takes us in what I believe is a sensible direction, I would maintain that there is no perfect measure of collocation although it is important to understand what the statistics are actually measuring as this can help you to decide which one will give you a set of collocates that are useful in terms of answering your research questions, and also help you in terms of interpreting your results. It is important to be transparent (e.g. explain the collocational measure you use and the different criteria you used for collocation) and also consistent (if you begin a piece of research using one measure of collocation, do not switch to another halfway through).
141
142
Using Corpora in Discourse Analysis
For this analysis I am going to use the logDice measure of collocation. It is a statistic adopted both by #LancsBox and Sketch Engine, as well as being the statistic used when creating Word Sketches within Sketch Engine. LogDice expresses the typicality of a collocation (as opposed to its frequency) so it tends to favour reasonably low frequency collocates although usually not as low as the ones obtained using MI. The top ten logDice collocates of America, using the default settings for each tool, are shown in Table 6.2. For each collocate I have provided information relating to the number of times it occurs with America as well as its logDice score. It is useful to give this information so that others can more fully interpret the amount of data you are looking at and the strength or likelihood of the collocates you examine. If your analysis involves hundreds of collocates then you might want to present the longer tables in an appendix or consider adopting a more truncated account, e.g. by writing something like ‘all collocates considered in this analysis occurred between 31 and 155 times with America and had a logDice score between 8.97 and 10.01’. The total number of times the top ten collocates co-occur with America is 659 for #LancsBox and 550 for Sketch Engine, so this gives us a manageable amount of data to examine. Perhaps surprisingly, the two lists are different. Only eight collocates appear in both top ten lists (allies, its, Europe, has, her, Israel, war, against). Part of the reason for the discrepancy is due to the way that the different tools carry out tokenisation. Some count numerals as words, some will separate out words with hyphens, while others will include punctuation as a ‘word’. Additionally, the calculation of collocates can be done slightly
Table 6.2 Top ten logDice collocates of America using #LancsBox and Sketch Engine #LancsBox
1 2 3 4 5 6 7 8 9 10
Sketch Engine
Collocate
Frequency logDice
Collocate
Frequency logDice
allies its europe countries has her war israel against today
66 155 33 41 118 40 57 32 86 31
allies its Europe s has Israel her against war allied
50 106 30 94 100 29 30 59 36 16
10.01 9.71 9.29 9.21 9.07 9.07 9.02 9.02 9.02 8.97
9.50 9.16 8.99 8.94 8.81 8.73 8.51 8.39 8.36 8.18
Collocates
differently from tool to tool. As a result, loading the same corpus into two tools, even with the same settings, is likely to result in slightly different lists of collocates. In Sketch Engine, a word like America’s is broken into two parts, which explains why s appears as a collocate there. #LancsBox does not carry out tokenisation in the same way. We might also note that #LancsBox displays all collocates in lower-case whereas Sketch Engine retains word initial upper case. Another discrepancy relates to the number of times the words collocate with America and the associated logDice scores. So with #LancsBox allies collocates sixty-six times with America and has a logDice of 10.01, however with Sketch Engine it only occurs with America fifty times and has a logDice of 9.50. This is due to the fact that the default collocational span is different for the two tools. #LancsBox uses a span of five words either side of the search word whereas this number is three for Sketch Engine. Jones and Sinclair (1974) suggest that significant collocates are found within a span of 4:4, although Stubbs (2001b: 29) points out that there is no total agreement to this. Just as with the different measures for collocation, there is no perfect span that gives ideal results for all words and all corpora. Instead, some spans will provide more helpful results than others for particular words and particular research questions. For example, if I am examining a very frequent noun and want to know about adjectival collocates, using a fairly small span (perhaps two words either side of the search word), is likely to give me what I want. However, if I am looking at a much less frequent word and want to focus on a wider range of collocates, then a span of five words either side might be more useful. Larger spans might result in unwanted collocates, however, where the relationship between two words might be more tenuous.
Categorising collocates Although some collocates may occur with low frequencies, they may contribute towards a more common pattern if they function in similar ways to other collocates. For that reason, it can be a useful exercise to try to identify similarities between collocates, categorising them according to their grammatical class, function, meaning, representation or theme. The process of categorisation is suggestive of grounded theory, which begins with no initial hypothesis or framework but involves allowing ideas and concepts to emerge through the data (Martin and Turner 1986). Categorisation is an objective and imperfect process but it can be useful in helping the analyst to
143
144
Using Corpora in Discourse Analysis
identify leads that might have been missed so they can be investigated further. The categorisation does not constitute the analysis but it aids further analysis. How many collocates should be categorised? There is no simple answer although I have found that between thirty and 100 collocates is usually enough to allow categories to emerge. If you have already categorised fifty collocates and almost all of the next ten collocates you examine can fit reasonably into the existing categories, then you are not finding much that is new so you have probably done enough in terms of getting a sense of that word’s collocational relationships. There are a number of different ways to approach a categorisation system. It is possible to use an existing scheme, like USAS (UCREL Semantic Analysis System) which categorises words according to thesaurus categories (Wilson and Thomas 1997), or LIWC (Linguistic Inquiry and Word Count) which is based upon psychological processes (Tausczik and Pennebaker 2010). Such existing systems can be useful, especially if you have not carried out a categorisation by hand before, although they can also have their limitations as they may not capture more nuanced uses of language that are specific to your corpus. Many words can have multiple meanings. Honey can refer to the food produced by bees or can be a term of endearment. How should it be categorised? If it is always used in a single way in your corpus, then there is no problem. However, if it is used in multiple ways then it is worth looking at a random sample of concordance lines and considering whether one meaning is much more frequent, and choosing that one. If the word’s meanings are similarly frequent, it might be best to simply put it into two or more categories. Again, there is no perfect way although it is important to make your categorisation decisions in such cases explicit. It is likely that some categories will be based upon grammatical features. I normally begin with a catch-all ‘grammatical words’ category and then later decide whether it is helpful to break this up into smaller categories, e.g. for modal verbs like would and could, or for pronouns like you and he. I usually have a rule that if only one word appears in a category on its own, then the category is probably not very helpful (especially as we can end up with lots of categories with only one word in them), so I usually call the final row of the categorisation table ‘Others’ and put all these stray words that do not fit anywhere else in there. Table 6.3 shows my attempt at categorising the top fifty logDice collocates of America (derived via #LancsBox). This resulted in six categories. It was fairly easy to place allies and allied together as they were forms of the same
Collocates
word. Similarly, attack and attacks went together although I added war and 9/11 to them as they all appeared to relate to forms of violence. Another category that immediately suggested itself was to do with locations. I had initially called this category ‘Countries’ but then decided to put in words like west and world which required the label to be changed. The category ‘Leader’ was one which emerged unexpectedly and quite late in the process. It was only through concordance analyses that I realised that led almost always appeared in contexts like ‘The whole world, led by America’, whereas head occurred in cases like ‘the West in general, with America at its head’. Had I stuck to an existing categorisation system like USAS, I probably would not have identified this category as head would have been categorised as a body part. Analysts who are keen on using an existing system might want to consider adapting it if needed. An important point to make about this categorisation system is that it requires a concordance analysis to occur at least twice. An initial concordance analysis was used to get a sense of the jist or general meaning of most of the words (I did not concordance the grammatical words as their meaning was reasonably obvious). However, a full analysis would require more careful concordancing once the categorisation had been completed (which then might result in a few tweaks to the table). So Table 6.3 requires us to ask why two words collocate and how this relates to representations or discourses. For example, why does America collocate with violence words? If we took the collocate attacks, we would see that it occurs fifteen times with America, ten times as a noun and five times as a verb. Ten cases describe attacks upon America and five describe America as carrying out attacks. So the typical pattern has America as the recipient of violence although there are also some cases of America as violent.
Table 6.3 Categorisation of top fifty collocates of America Category
Collocates
Locations
europe, america, israel, west, countries, world, states, iraq, united, western, britain its, has, her, against, in, on, by, is, it, with, and, that, have, what, over, inside, will, been, not, to, itself, of, because, could, as, but, a attack, attacks, war, 9/11 allies, allied led, head today, failed, why, muslims
Grammatical words
Violence Allies Leader Others
145
146
Using Corpora in Discourse Analysis
Often, when we write up our analyses, we will not have space to include a concordance table for each word, phrase or collocational pair that we want to report on. Instead, we should try to summarise the analysis we have done, although it is also useful to provide an illustrative example. Usually, the example ought to be relevant of the most typical usage of the feature we are examining, although it can also be pertinent to include unusual cases, as long as they are noted as such. And rather than including a concordance line which simply gives a snippet that may not immediately make sense to the reader, I usually prefer to cite examples as sentences. If a cited sentence is not clear enough, it is worth expanding the example to include additional sentences. If we are comparing examples from different sub-corpora (e.g. articles from different newspapers or different time periods), it might be worth noting the source and/or date of the example you quote. Representative examples of America as a recipient of violence or as violent are shown below. Is it logical that America attacks us for more than fifty years and we let it live in security and peace? al-Qaeda spent $500,000 on the September 11 attacks, while America lost more than $500 billion, at the lowest estimate, in the event and its aftermath. The first example sets up a rhetorical question. America is characterised as carrying out attacks for more than fifty years while the word us is used vaguely and could be interpreted as referring to all Muslims. The author of this text argues that America deserves retaliatory attacks due to its aggressive behaviour towards others over a long period of time. In the second example, the author contrasts the relatively low cost of the 9/11 attacks on America with the financial cost to that country, a way of emphasising America’s losses as a result of apparently low investment on behalf of the attackers. There are germs of two representations here – one of America as warlike, the other of America as weakened. We will return to these later in the chapter, although next I want to turn to a somewhat more complex way of considering collocates.
Collocational networks One way of visualising the set of collocates of a particular word is to create a collocational network. Figure 6.1 shows a simple collocational network of the top ten collocates of America (shown in lower-case) using #LancsBox.
Collocates
Figure 6.1 Simple collocational network of America.
The network has the node word (america) in the middle and then a series of lines leading to other words. The shorter the line, the stronger the relationship between the two words. Additionally, the position of the word relative to america gives an indication about where the two words usually appear in relation to one another in the corpus. So against, which appears to the left of america, indicates that this word appears before America in the corpus. I had to alter the settings within #LancsBox to produce this network. Otherwise, the network would have contained 421 collocates making it impossible to read any of the words in it. So I specified that #LancsBox only considered collocates with a logDice score of 8.96, giving me the top ten collocates. Figure 6.1 is prettier to look at than a table, being more eyecatching, although with a table we could include extra information about word frequencies and logDice scores. However, a collocational network can prove to be more useful than a table listing the collocates because we can expand it to consider multiple relationships between words simultaneously. In Figure 6.2 I have obtained the collocates of allies from the corpus, which creates a more interesting network. This figure indicates that while America and allies collocate with one another, they both also collocate with two other top ten collocates, her and its. Her and its do not collocate with one another though. This part of the network (referred to as a graph) looks a bit like a parallelogram although due to the additional line running through its middle (from America to allies), we could informally refer to it as a diamond (its official name, according to Graph Theory, is K4-e – see Baker 2016). This type of graph suggests that the two words that do not collocate directly with one another might have similar functions (as is the case with her and its). The fact that america and allies do directly collocate would perhaps indicate that they are
147
148
Using Corpora in Discourse Analysis
Figure 6.2 Collocational network focussing on America and allies.
not used synonymously, but it is likely that a direct relationship is made between them in the corpus. It is useful to obtain examples from the corpus, via concordance lines, which illustrate how these four words co-occur together. Such cases are shown in Table 6.4. There are no cases where the four words (America, her, its and allies) all appear in the same concordance line – this would be expected because her and its do not collocate with one another. However, there are numerous cases where America is followed by a comma or the word and, then either the word her or its, then the word allies. Its and her thus have an equivalent function in this construction. Table 6.4 shows representations of America and her/its allies as hating Muslims and waging war against the Islamic State as well as being attacked by the authors of the texts. We will look at these kinds of representations later in the chapter, however, for now I want to focus on how the collocational network leads us to the conclusion that her is used to refer directly to America. All forty instances of America co-occurring with her are used to refer to America as female. Why would this be the case? Table 6.4 Sample concordance lines showing America collocating with her, its and allies Today you witness and hear that
America
, her allies and agents have gathered their
What more evidence does one need
America
and her allies hate Muslims who want to
America
and her allies as Islam doesn’t shy from
evil against you at both ends
that my religion required me to fight
practice their religion
stating who is the occupier And because
America
raised its allies and agents to wage war against the Islamic State alone
It is the only way wars end, and
America
and its allies will never win this war
we will continue to deliver blows to
America
and its allies until we shatter your shackles
Collocates
Perhaps it is typical of general English? To test this hypothesis, I examined cases where America collocated with her in the British National Corpus. There were forty-eight instances but none of them used her to refer to America. Instead, they occurred in contexts like ‘Aunt Alicia has written to her in America’. However, when I examined a few other countries (Germany, Britain, France, Spain, Russia and Italy) in the corpus, I found around 100 cases where countries were referred to as her. Such cases usually occurred in political or historical texts where the countries were standing for the leaders, e.g. ‘This might have been the moment for Britain to transfer her loyalties to the European community.’ So this is something which does occur in general English, particularly in political or historical texts. We might want to consider whether America is ever characterised as masculine with the equivalent pronoun his or other masculine pronouns like he and himself. No cases like this were found in the corpus (or the BNC for that matter). America is not always feminised in the Extreme Corpus though – we should bear in mind the 155 occurrences of its, which represent America as non-gendered. So it seems that some authors in this corpus position America as non-gendered and some position it as female. One possibility is that the feminising use of her with America is atypical in the Extreme Corpus, only associated with one or two authors. However, it occurs across seventeen different texts so it appears to be a fairly well-distributed phenomenon. Perhaps it is the case that in the Extreme Corpus some writers simply have a tendency to refer to all countries as female. To explore this hypothesis I carried out searches of her, herself and she and then examined concordance lines and collocates. No other countries collocated with female pronouns, despite the fact that some countries were mentioned fairly frequently in the corpus, e.g. Iraq (957 cases), Afghanistan (920), Israel (503) and Egypt (410). A question arises then regarding why America is treated differently by some authors who collocate it with a female pronoun. As noted earlier, sometimes countries do get feminised in historical texts so while this explains the finding to an extent, it does not tell us why this only happens to America in the Extreme Corpus. In the general English corpus we saw how when countries are referred to as her, it tends to be in political or historical contexts so this could indicate that in the Extreme Corpus, writers want to represent America as a political actor. Another possible reason is that America stands out as being the most frequently mentioned country and as we will see, it is also one which is represented very negatively, so perhaps its feminisation could be part of a
149
150
Using Corpora in Discourse Analysis
Figure 6.3 Collocational network focussing on America and war.
strategy which draws on a sexist stereotype of women as weak and defenceless, thus making it appear easier to conquer. Another reason could be that in representing America as gendered, the author is attempting to personalise the abstract concept of a country. Ultimately, these texts aim to persuade people to carry out violent acts on other people. In personalising America, the true target is implied – not just the country but its people. Figure 6.3 shows the collocational network of America which was been expanded to show additional collocates of the word war. We can see that war also collocates with words relating to Islam (Islam, Islamic, Muslims) as well as forms of the verb wage (wage, waged, waging). With this network I want to focus on a triangle containing three words, america, war and against, which all collocate with one another. Thirteen concordance lines contain all three words in a co-occurrence relationship, of which a sample is shown in Table 6.5. The majority of cases of this collocational triangle position America as waging a war against Islam and Muslims – not against a particular country
Table 6.5 Sample concordance lines showing America collocating with war and against Egypt will remain a base for the Crusade
America
in its war against Islam in the name of the
America
and its war against Muslims under the
and an essential participant with and faraway from abstaining to assist
war on terror.
name of the war on terrror participating with
America
in its war against Islam and Muslims
also confirms the reality of the war
America
wages against the Mujahideen
we will continue to work on in
America
. The nerve is its economy
America
cease? Did the jihad in the Muslim world
our war against Bin Laden was killed but did the war against
come to an end?
Collocates
but against a religion and its adherents – a claim which casts America in an extremely negative light. However, the last two lines show a different representation, where the authors of the texts position themselves as being undaunted in a war against America. In the fifth line the author states ‘we will continue to work on in our war against America’ while the final line poses rhetorical questions asking whether Bin Laden’s death ended the war against America. When the authors write about themselves as engaging more actively in war, they appear to present themselves positively, as determined to win. A collocational network can be useful in terms of getting a better sense of how multiple words can work together to create a representation or contribute towards a discourse. We might have come to the same conclusions had we simply looked at concordances containing pairs of collocates shown in Figure 6.1, but expanding the network further can help us to spot patterns we might have missed.
Word Sketches We now turn to the second tool we will use to examine collocates in this chapter, Sketch Engine. This tool allows us create Word Sketches, which essentially groups collocates according to their grammatical relationships, saving us a lot of work and enabling us to more easily identify representations or discourses. Table 6.6 shows part of the output produced by a Word Sketch of America. The columns show collocates arranged by their grammatical relationship with America. Only the top ten collocates are shown in the table although for Table 6.6 Word Sketch of America object_of force strike confront defeat fight hit attack despise exhaust aid
subject_of fail lead have enter lose s do make support start
modifier North Europe South way time oppressor today crime Arabia oppressive
modifies S France support Israel security ally Europe administration failed morale
and/or Europe Israel Country West Britain world France ally west Asia
pp_obj_of States Interest ally leadership heart face support instruction assistance history
pp_obj_in Muslims community happen reside occur operation racism subdue tower right
151
152
Using Corpora in Discourse Analysis
many of the columns, dozens of words appear. In the first column, verbs which position America as the object (or grammatical patient) include force, strike and confront. Some examples of these collocates in context are shown below. America was forced to admit the strength of resistance it faces. Muslims and others alike, celebrated and prayed for more to strike America. We should confront America, Israel and the assaulting West. Note that the words in the Word Sketch are shown in their canonical or base form. So a verb like force will include related forms like forcing, forced and forces. Similarly, a single noun like heart will also include the plural hearts. The second column shows verbs where America is placed in the subject (or agentive) position, like fail, lead and lose. Therefore, America has failed militarily, as it’s clear to all. This attack led by America on the mujahidin’s adherence to Islam . . . Here we could say that America has lost the most important element of global leadership. The grouping of collocates into different grammatical categories is done automatically in Sketch Engine and as a result errors can appear, so it is important not to take the results at face value. It is possible to derive the concordance lines that contain the collocates in order to check that the relationship claimed by Sketch Engine actually exists. Take for example, the third column which contains words which modify America. We would expect most of these to be adjectives, and oppressive is a good example of a modifier. the United Nations passes resolutions in support of tyrannical, oppressive America. However, other words in this column, such as way, time and today, while occurring before America, do not indicate this kind of modifying relationship so we might want to adjust the table to delete these words. Much the same way, America’s corrupt politicians who support an oppressor At that time, America was observing the situation for almost a month Today, America is withdrawing from Afghanistan
Collocates
The more complex and automated a piece of software is, the more likely it is that mis-categorisations will occur. This does not mean that we ought to avoid using such software but we should always try to verify the findings with concordance-line analysis and discount anything that is in error. Usually Sketch Engine gets most things right so the benefits of using it outweigh the disadvantages, especially if we carry out our own checks of its accuracy. The lists of collocates in Table 6.6 do not constitute the analysis in itself. However, they make the job of the analyst faster because it becomes easier to spot patterns, particularly as patterns of meanings can be more easily derived from grammatical patterns. So the analysis involves looking up and down the columns, as well as comparing columns together, to try to identify similar kinds of relationships. For example, I noted a set of verbs which position America as the subject, which all construct America as warlike: invade, kill, attack, perpetrate, launch, desecrate, wage, inflict and violate. Collectively, these verbs occur thirty-one times with America. The list of collocates could be said to represent a discourse prosody (discussed in the previous chapter). However, the representation of America as warlike occurs in some of the other columns of the Word Sketch. For example, in the ‘modifies’ column of the Word Sketch, we find references to America’s atrocities, war machine, air force, war, rockets, campaign and defenses. And in the column for ‘of-phrases’ there are references to the weaponry, force and crusaders of America. A sample of these lines are shown in Table 6.7. Another representation paints America as engaging in manipulative or criminal behaviour. This occurs when America is the subject of verbs like con, dupe, commit, dazzle, oppress and control, as well as of-phrases which Table 6.7 Sample concordance lines showing representations of America as warlike through their bombings
America
has killed more than a million and a half
which was made possible by the support
America
? Did he forget about the many
come with a democratic face as
America
’s war machine is quite often
In the year 1419H
America
launched a cruise missile attack
whose fighters are currently filling the
America
’s ground forces. The Turkish
America
is waging with its allies
Iraqi children
and weaponry of
role of the war that
153
154
Using Corpora in Discourse Analysis
suggest that America is in control of something, e.g. fear of, slave of, victim of, puppet of and lap of America. A similar pattern is found with to- and -for phrases like servitude to, enslave to, subservience to, prostrate to and lackey for America. America is described as a bully, an aggressor and an enemy, as well as being modified by adjectives like oppressive and tyrannical. It is described as the grammatical possessor of atrocities, arrogance, boasts and evil. This representation draws on an historical association of America as a slave-trading nation both explicitly and through use of terms like enslavement and servitude. Table 6.8 shows a sample of relevant concordance lines. Another representation of America, identified via the Word Sketch, was a set of collocates which positioned it as hated and under attack. A common pattern here involved the verbs which occurred when America was in the object position like despise, strike, confront, fight, hit, attack, exhaust, bleed, shake, humiliate, target, expose, decapitate and devastate. America also follows phrases like retaliate against, struggle against, hatred for, war on and anger at (Table 6.9). The negative representations here use animal terms (United Snakes of America), or reference to America as a cancer. These are forms of dehumanisation which Pisoiu (2012: 139) refers to as denial of humanity. However, while Americans are described as less than human, in other cases America itself is cast in human terms. The reference to America’s ugly face or metaphor processes like bleeding or decapitate America suggest ways that it is personalised. As we have seen earlier, references to America that use her also suggest a degree of personalisation. These different forms of dehumanisation and personalisation are likely to relate to the fact that America is a country standing for a social actor. In personalising America, it is easier to view it as Table 6.8 Sample concordance lines showing representations of America as manipulative or criminal He’s been bamboozled.
America
duped him and then dumped him.
Led by that leader of international
America
, a declaration from the [Saudi]
And the more crimes
America
commits, the more mujaidin will be
The reason why he is the lackey of
America
, which is the enemy of Islam and Muslims
The attack was considered the most
America
’s arrogance. The attempt to provoke
America
and its corporations
infidelity
insulting blow to liberate humanity from enslavement and servitude to
Collocates
Table 6.9 Sample concordance lines showing representations of America as hated and under attack the millions of Muslims around the world
America
. America is not only despised by
America
to the point of bankruptcy
. Biidhnillah, we will decapitate
America
from the rest of the world.
them out of the map completely.
America
is a cancer that needs to be removed
the strikes uncovered the mask behind
America
’s ugly face. That mask fell the
said the world’s police force, United
America
. The enemies of Allah congregated
who despise So we are continuing this policy in bleeding
along with the West
Snakes of
representative of a large group of human beings while the dehumanising representations help to legitimate violence towards individual Americans. A final representation of America involves the view that it is failing or losing because it is weak. America is the subject of verbs like fail, lose, fall, pay (the price) and fear, and there are references to America’s demise, meltdown, misery, doom, obtuseness, hubris, failure and disruption. Table 6.10 shows a selection of concordance lines. By counting the number of concordance lines which contribute towards the different representations identified through the analysis, we can gain an impression regarding which ones are more frequent, relative to the others. This is shown in Figure 6.4, where the y axis refers to occurrences per 100,000 words.
Table 6.10 Sample concordance lines showing representations of America as weak and failing We all saw together the beginning of
America
’s demise and the rise of Islam’s head.
military, political, economic and social
America
’s star is waning, its economy is
sciences that After it has been let down by the allies,
shriveling America
the weak If
justified its silence by announcing that America is no longer the world police.
America
failed to win when it was at its pinnacle of economic strength, how can it win today
Al-Qaeda in particular signifies the
America
, because it is a message of jihad
America
is defeated, then is it important
doom of When
155
156
Using Corpora in Discourse Analysis
Figure 6.4 Comparisons of representations of America.
The representation of America as hated and attacked is the most frequent one found, while the representation of America as female is least frequent. The ‘hated and attacked’ representation is perhaps so frequent because it presents an image of America as already unpopular and under attack by others – thus encouraging readers to join in. Some of the representations appear to be linked together – the ‘weak and losing’ representation is possibly linked to the feminising one as they both characterise America as able to be defeated. However, other representations represent America as stronger – the warlike one suggests it has military might whereas the one which characterises it as manipulative and criminal also suggests a form of power, albeit one which is grounded in immorality. This representation is another one which helps to legitimate violence towards America – if it is seen as not playing by the rules, then it becomes easier to justify random attacks on its citizens. Why would some representations of America appear to be contradictory in the corpus? It could indicate the fact that the texts are written by a range of authors. However, sometimes different representations are found in the same text. For example, Read these words and examine them carefully: ‘I am certain that Allah is able to protect me and my family, and nothing will afflict us except that which Allah has written for us, and no matter the level America’s strength reaches, there is a Lord stronger than it, and He will defeat it in the end.’
Collocates
Perhaps the authors of the texts wish to create a representation of America that is both strong and weak. An America that is too strong would perhaps make the task of attacking it appear overly difficult or futile. On the other hand, a representation of America as too weak might remove the urgency for attack. Instead, the combination of representations result in an America which appears to be a worthy opponent but is also beatable. The interrogation of collocates has enabled a set of interesting findings to be uncovered. While we would have expected America to be represented negatively in these kinds of texts, the analysis has revealed details relating to the representations that we might not have considered at the outset. I was not expecting to find cases of America being feminised, while the different uses of dehumanisation and personalisation, and the contradiction between America as powerful and weak, were also unexpected.
Conclusion As we have seen, collocates are useful in that they help to summarise the most significant relationships between words in a corpus. This can be timesaving and give analysts a clear focus. While #LancsBox and Sketch Engine provide additional ways of thinking about collocates, analysts (especially those who are inexperienced) should not feel compelled to use collocational networks or Word Sketches. In many cases, simply analysing a simple list of collocates and/or putting them into categories by hand will provide a great deal of analytical mileage. The additional tools should be used only if they enable new patterns to be identified. They should not be used just because they are visually interesting or appear innovative. The fact that Sketch Engine can sometimes miscategorise collocates into the wrong grammatical patterns needs to be borne in mind, while #LancsBox can require the analyst to spend time experimenting with settings in order to produce a collocational network that is not underor over-populated. Additionally, the output from #LancsBox and Sketch Engine does not constitute an analysis. Instead, it provides the researcher with the means to carry out an analysis. It is important that we do not over-interpret collocational data. As shown throughout this chapter, it was only really when we started to examine concordance lines that contained collocates that we were able to get a proper sense of how the two words in question related to one another.
157
158
Using Corpora in Discourse Analysis
While collocates may suggest a semantic preference or discourse prosody, concordances will help to flesh these out. Concordance analyses can also help to identify cases where an automatic tool may have made a categorisation error. We should also be aware that different methods of calculating collocation tend to yield different results – the interplay between frequency and saliency needs to be taken into account. The size of a corpus and the frequency of the word you are searching on will impact greatly on the number and type of collocates obtained. It is worth experimenting with different collocational statistics initially – if the ones in the top twenty are mostly grammatical words or are all low frequency words, then this is a sign that a different statistic might be more helpful, or you may need to raise the minimum frequency requirement and recalculate. Similarly, it is worth considering the span and adjusting it if it results in too many or too few collocates. A further consideration is the distribution of a collocate across the different texts in a corpus. If a collocate is reasonably frequent and has a high logDice score (or whatever measure you are using), but the relationship is restricted to only one text in your corpus, then it is difficult to argue that the pairing is a typical one but is instead due to idiosyncratic features of that single text. Such a pairing might tell us about a minority representation in the corpus so it is still worth paying attention to it, but we should provide information about the poor distribution in our report and reduce any claims to generalisability accordingly. It can be useful to consider whether the collocates we find are unique to a particular corpus or whether they are more representative of language generally. A Word Sketch of America carried out with the British National Corpus found that many of the collocates in the Extreme Corpus did not appear in the BNC. For example, in the BNC, when verbs collocated with America in the object position, they referred to processes involving people travelling to America, sometimes on holiday: tour, visit, discover, reach. The representations of America in the Extreme Corpus are therefore unusual ones, not typical to mainstream British English. We could argue that these texts aim to change the associations that people have when they think of a concept like America. By repeatedly using unfamiliar collocations, they have the potential to prime readers, to think about America as warlike, amoral, weak, female, etc., rather than as a place to visit. In the following chapter we consider another way of examining saliency in texts – this time not by focussing on words which occur near other words, but simply by considering words which occur more often than we would expect them to by chance alone – keywords.
Collocates
Step-by-step guide to collocational analysis 1 Build or obtain access to a corpus. 2 Decide on a search term (e.g. America) or a set of search terms, bearing in mind that plurals or other forms, euphemisms, anaphora or relevant proper nouns may be relevant to include. 3 Obtain a list of collocates. It is worth experimenting with different measures, spans and cut-off points. 4 Can the collocates be grouped grammatically, semantically or thematically? A tool like Sketch Engine might help. 5 Obtain concordances of the collocates and look for patterns. This should enable you to uncover the prosodies, representations, discourses or legitimation strategies surrounding the search term. 6 Consider how collocates are distributed across the corpus. Are some more typical of the corpus than others? 7 How do the collocates relate to each other? Try building a collocational network and examine cases where three or more collocates co-occur. 8 Consider using a second corpus, e.g. a reference corpus, to gain a clearer idea regarding whether a collocational pair is particular to the corpus you are examining or more typical of language generally. 9 Attempt to explain why particular discourse patterns appear around collocates.
Recommended reading Baker, P. (2016), ‘The shapes of collocation’, International Journal of Corpus Linguistics 21(2): 139–64. This journal article analyses collocational networks of a corpus of news articles and identifies the different kinds of graphs made in collocational networks. Pearce, M. (2008), ‘Investigating the collocational behaviour of man and woman in the BNC using Sketch Engine’, Corpora 3(1): 1–29. This journal article uses Sketch Engine to compare representations of men and women. Taylor, C. (2021), ‘Investigating gendered language through collocation. The case of mock politeness’, in J. Angouri and J. Baxter (eds), The Routledge Handbook of Gender and Sexuality, London: Routledge. This book chapter uses Sketch Engine and GraphColl to investigate reports of how men and women are described as being sarcastic.
159
160
Using Corpora in Discourse Analysis
Questions for students 1 Consider the following collocational network (Figure 6.5) taken from a corpus of tabloid newspaper articles. Discuss why certain words collocate and certain words do not directly collocate with one another, and whether any ideological significance is implied from the network.
Figure 6.5 Collocational network from tabloid news.
2 Try creating a categorisation system for the following collocates of the word anxiety (taken from a corpus of forum posts relating to mental health issues): battle, depression, trigger, phobia, bully, suffer, terrible, overcome, tension, enemy, handle, ruins, beast, flares, issue, crippling, throws, condition, pain, devil, messes, debilitating, insomnia, tricks, bitch, fear, stress, experience, horrible, control, bad, illness If you only had time to analyse three of these collocates, which would you choose? 3 Consider these sets of collocates of bachelor and spinster (from the English Web 2013): Bachelor: eligible, irresistible, pad, freedom, busy, teen-like, sexiest, party, dude, lifestyle, drunken, life-long, sworn, good-looking, avowed, hottest Spinster: sturdy, eccentric, self-sufficient, repressed, unattractive, witchy, bluestocking, hypochondriac, parishioner, Methodist, cat, shelf, bitter, elderly, housekeeper, great-aunt What discourses or representations do they suggest?
7 Keyness Frequency revisited As we saw in Chapter 4, a frequency list can help to provide researchers with the lexical foci of any given corpus. Investigating the reasons why a particular word appears so frequently in a corpus can help to reveal the presence of discourses, especially those of a hegemonic nature. The examination of the most frequent ten lexical lemmas in the holiday corpus provided clues about the overall focus of the leaflets (the top two lemmas: bar and club were used in ways which made much of the importance of being located within access to numerous places to drink). Obviously, the terms themselves do not yield this information; it is only with further explorations – collocational analyses, concordances, comparisons to other similar terms – that such conclusions can be drawn. Yet compiling a frequency list is an important first step in the analysis, giving the researcher an idea about what to focus on. However, simple frequency lists also possess limitations. In order to demonstrate this, I want us to turn to a new research topic and text type: political debates on fox hunting in the British House of Commons. In the UK, fox hunting as it is recognised today has been practised since the seventeenth century (Scruton 1998). There have been numerous attempts to regulate or ban it, stretching back over half a century. In January 2001, according to the BBC, more than 200,000 people took part in fox hunting in the UK, and it was described as ‘one of the most divisive issues among the population’.1 Tony Blair’s Labour Party manifesto in 1997 promised a ‘free vote in parliament on whether hunting with hounds should be banned’. In July 1999 he announced that he would make fox hunting illegal and before the next general election if possible. A government inquiry on hunting with dogs concluded that: ‘This is a complex issue that is full of paradoxes’ (Department for Environment, Food and Rural Affairs 2000: 1). The inquiry 161
162
Using Corpora in Discourse Analysis
did not attempt to answer the question of whether or not hunting should be banned but did consider the consequences of banning hunting and how such a ban might be implemented. After a number of parliamentary debates and votes, the ban was implemented in February 2005. In order to examine discourses surrounding the issue of banning fox hunting I decided to build a corpus of parliamentary debates on the issue. I collected electronic transcripts of three debates in the House of Commons which occurred prior to votes on hunting. These occurred on 18 March 2002, 16 December 2002 and 30 June 2003. In general, the majority of Commons members voted for the ban to go ahead, although in each debate a range of options could be debated and subsequently voted upon. For example: a complete ban vs hunting with some form of supervision. Two questions I want to focus on here are: what arguments, discourses or representations did those involved in the debates use, and how did they legitimate them? These three debates were not the only ones on banning hunting which had occurred, however; looking back at debates which had occurred as far back as 1997 would have made the factor of diachronic change a much more important consideration. Additionally, debates took place in the House of Lords, which I decided not to examine for the purposes of this chapter. One of the reasons why fox hunting was debated so often in the House of Commons was because the House of Lords rejected an outright ban on a number of occasions, sending the Bill back to the Commons. Therefore the outcome of debates in the House of Commons was significantly different to those in the House of Lords. Examining both sets of debates over a wider range of time is beyond the scope of this chapter. Focussing then on the three debates which I did collect, the first procedure that was carried out was to create a word list of what I refer to as the ‘Foxhunting Debates’ corpus. To do this I used the freely available corpus tool AntConc (I am using version 4.0). The corpus size was 129,365 words. Tables 7.1 and 7.2 show the ten most frequent words and lexical items respectively from this corpus. AntConc renders everything in lower-case unless its default settings are altered. To keep things simple, we will not do this, but bear in mind that in Table 7.1, the word i actually refers to upper-case I. An examination of these two word lists does not reveal anything initially interesting that relates to possible discourses of fox hunting within the debate. As with many other frequency lists, the most frequent words tend to be grammatical items such as determiners, prepositions and conjunctions so
Keyness
Table 7.1 The ten most frequent words in the fox hunting corpus Rank
Word
Frequency
1 2 3 4 5 6 7 8 9 10
the to that of and is a in i it
8,578 4,186 4,119 3,599 2,905 2,599 2,587 2,368 2,320 1,866
Table 7.2 The ten most frequent lexical words in the fox hunting corpus Rank
Word
Frequency
1 2 3 4 5 6 7 8 9 10
hon hunting mr bill house new ban minister right friend
1,113 1,053 711 687 493 482 442 438 433 428
it is difficult to know if these words are frequent because they are particularly important to this corpus, or because they are just frequent in general English. The most frequent lexical words (Table 7.2) are perhaps more interesting, although here we find words that we would perhaps have expected or guessed to appear. There are terms of address associated with the context of a parliamentary debate: hon (which is a short form of honourable), Mr, friend and right (the term of address hon. friend appears 397 times in the corpus while right hon. friend occurs 174 times). There are also other words associated with the context of parliament (bill, house, minister) and words connected with the subject under discussion (hunting, ban). Therefore, the most frequent lexical words in this case have only helped to confirm expectations surrounding the genre or the topic of the text, which is not necessarily the most useful finding.
163
164
Using Corpora in Discourse Analysis
One way of finding out what lexical items are interesting in a frequency list is to compare more than one list together. If a word occurs comparatively more often in, say, a corpus of modern English children’s stories, when compared to the British National Corpus, we could conclude that such a word has high saliency in the genre of children’s stories and is worth investigating in further detail. So, thinking about comparative possibilities of the fox hunting debate, it might be useful to consider that the debate had two sides and ultimately each speaker had to vote on the issue of banning fox hunting. While it may have been the case that speakers who voted the same way actually approached the subject from very different perspectives and had different reasons for the way they voted, the fact that speakers voted, and that their contributions to the debate would be made with an idea of persuading others to vote the same way as them, suggests one area where conflicting positions may be illuminated. Therefore, it was decided to split the corpus into two. The speech of all of the people who voted to ban fox hunting was placed into one file, while the speech of those who voted for hunting to remain was placed in another. The frequencies of the top ten lexical words from these sub-corpora is presented in Table 7.3. Although it was hoped that this table would reveal more interesting differences, what it has actually shown us are similarities! Seven of the words (hon, hunting, bill, mr, ban, right and way) appear in both lists, and almost all of the words are again connected to either the subject under discussion (hunting) or the context where the debates took place (parliament). Some of the frequencies may be more interesting to examine – for example, the
Table 7.3 The ten most frequent lexical words used by opposing groups in the fox hunting debate Anti-hunting
Pro-hunting
Rank
Word
Frequency
Word
Frequency
1 2 3 4 5 6 7 8 9 10
hon hunting bill new house mr ban friend right way
647 568 458 383 331 316 264 259 256 223
hunting hon mr minister bill people way member ban right
485 466 395 256 229 228 193 183 178 177
Keyness
pro-hunt voters refer to the word ban only 178 times, compared to the antihunt voters who used the same word 264 times. And is it relevant that hunting was mentioned 568 times by the anti-hunt voters and 485 times by the pro-hunt voters? As it turns out, the anti-hunt voters contributed more speech to the debates overall (71,194 words vs 58,171 words), so as proportions, these frequencies are more similar than they originally look (for example, ban occurs as 0.37 per cent of the anti-hunt vocabulary and as 0.30 per cent of the pro-hunt vocabulary). Therefore, a measure which takes into account the relative size of both sub-corpora combined with the relative frequencies of each word would be more useful. Fortunately, such a measure exists, in the concept of keyness.
Introducing keyness Corpus tools allow us to compare the frequencies in one word list against another in order to determine which words occur statistically more often in word list A when compared with word list B and vice versa. Then all of the words that do occur more often than expected in one word list when compared to another are compiled together into another list, called a keyword list. And it is this keyword list which is likely to be more useful in suggesting lexical items that could warrant further examination from a discourse analysis perspective. A keyword list therefore gives a measure of saliency, whereas a simple word list only provides frequency. So how is a keyword list compiled? AntConc takes into account the size of each sub-corpus and the frequencies of each word within them. It then carries out statistical tests on each word (as with collocation, different tests can be used, which I will discuss in a moment). For this chapter I am using the default setting (the log-likelihood test) which assigns a p (or probability) value to every word across both corpora. The p value (a number between 0 and 1) indicates the amount of confidence that we have that a word is key due to chance alone – the smaller the p value, the more likely that the word’s strong presence in one of the sub-corpora is not due to chance but a result of the author’s (conscious or subconscious) choice to use that word repeatedly. Because every word across the two corpora is assigned a p value, as corpus users it is up to us to decide how low the p value needs to be before we can label a word as key. The same sort of problem occurs in the social sciences, particularly in experimental psychology – and in general a p value
165
166
Using Corpora in Discourse Analysis
of about 0.05 (indicating that if there was no difference between the types of language use in the two sub-corpora, there’s a 5% chance that the word in question would be key) or less is taken as a ‘cut-off ’ point and therefore viewed as worth reporting. A potential problem with using a p of 0.05 is that it can produce a lot of keywords. For example, AntConc’s default setting only shows keywords with p of < 0.05, which produces 231 anti-hunt and 174 pro-hunt keywords. Changing this in the settings to a p < 0.0001 gives 27 and 28 keywords respectively. These are shown in Table 7.4 (remember, they are all in lower-case because that’s the default way that AntConc displays words in lists). I had to carry out the keywords procedure twice so Table 7.4 actually contains two keyword lists – one for the anti-hunt speakers and the other for Table 7.4 Keywords when p < 0.0001 Anti-hunt keyword 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Frequency Likelihood Pro-hunt keyword
new 382 i 1,455 clause 215 bill 458 commons 46 house 331 conclusion 45 issue 185 dogs 182 clear 112 vote 131 complete 40 kaufman 39 necessary 58 parliament 132 foster 29 lords 47 that 2,406 can 206 total 52 tonight 58 statement 26 enable 19 commitment 28 election 18 will 601 dr 42
125.44 56.08 49.73 38.52 33.45 29.98 29.29 28.48 27.79 24.72 22.07 21.74 20.82 20.71 19.87 19.59 19.17 19.05 18.54 17.83 17.58 16.55 16.35 15.72 15.26 15.24 15.23
gray criminal exmoor people mr minister garnier lidington fish portcullis barker regime shooting shot authority why gummer he fishing welfare all conservation moral citizens of incontrovertible luff gregory
Frequency Likelihood
64 38 39 228 395 256 28 25 23 30 20 20 72 42 25 114 26 361 31 90 230 17 30 14 1737 24 16 16
61.63 47.25 37.16 32.05 32.02 31,99 28.63 28.09 25.22 22.68 20.95 20.95 19.62 19.35 18.90 18.58 17.84 17.70 17.65 17.08 16.98 16.78 16.54 16.28 15.79 15.51 15.41 15.41
Keyness
the pro-hunt speakers. This entailed first telling AntConc that the anti-hunt corpus was my target corpus and the pro-hunt corpus was the reference corpus, then swapping them around and carrying out the process again. For ease of comparison I have combined the two lists into a single table. Some corpus analysis tools produce keywords slightly differently, sometimes producing a list that contains what are known as positive and negative keywords. A positive keyword is a word that occurs significantly more often in one corpus, compared to the other, while a negative one occurs significantly less often in one corpus, compared to the other. The frequency columns in Table 7.4 indicate how many times the keyword occurs in the corpus where it is a keyword. The Likelihood score is a number which indicates the extent that a word is distinctive in one text, compared to the other. As noted in the previous chapter, the log-likelihood test is a hypothesis-testing measure, so it does not necessarily tell us anything about the extent or strength of a difference between a word’s frequency in two corpora, but it is more concerned with whether there actually is a difference, no matter how large or small. As a result, the log-likelihood test has been criticised for producing keywords where the differences in frequency are not actually all that big. For example, consider the keyword i which is the second strongest keyword for anti-hunt speakers. It occurs 1,455 times in this subcorpus, comprising 2.04 per cent of all the words in that corpus, but it also occurs 865 times in the pro-hunt corpus (1.48 per cent). This is not a huge difference – i only appears 1.37 times as often in the anti-hunt speech compared to the pro-hunt speech. Compare it to the top keyword, new, which occurs 382 times in the anti-hunt corpus (0.53 per cent of that corpus) and 99 times in the pro-hunt corpus (0.17 per cent). Relatively speaking, new is more than three times as common in the anti-hunt corpus, compared to the pro-hunt corpus. So is it worth even considering i as a keyword, considering the difference in relative frequency is so small? Furthermore, if the log-likelihood test is producing keywords like i, is it worth using or should we switch to a different test? As with collocation, numerous measures of keyness have been proposed, some which have aimed to focus on the extent or strength of difference, like Dice, %diff or log-ratio. AntConc has the option to sort keywords by effect size, and when this is done, some of the keywords change, and the order they appear in changes too. For example, for the anti-hunt sub-corpus, the top effect size keywords (using the Dice statistic) are high frequency words: the, that and i. However, if I change the effect size measure from Dice to the MI score, lower frequency words are near the top: enable (frequency in the
167
168
Using Corpora in Discourse Analysis
anti-hunt sub-corpus = 19), election (18) and session (15). Some effect size measures (particularly MI) deprioritise high frequency grammatical words because the proportional frequency differences of a grammatical word in two corpora is rarely large enough to result in a keyword. All texts usually have to rely on grammatical words to some extent – it is hard to write in English without using words like the, and, on and to, so they are normally going to have a baseline frequency, meaning that we are rarely going to see very dramatic differences in them between two corpora. However, because grammatical words are so frequent, even the relatively smaller frequency differences can indicate something of interest, so I do not think we should dismiss them outright. I would argue that the keyword i falls into this category. We would expect it to occur a lot in both sub-corpora because the texts we are examining involve speeches made from a personal point of view. However, there is something interesting in the fact that i occurs that bit more (2.04 per cent vs 1.58 per cent) in the language of those who are against hunting, and we will look at why this is the case a bit later in the chapter. As with collocation then, it is useful to bear in mind that there is no perfect measure of keyness, but instead to have an idea of the kind of difference that the measure is privileging, and to be prepared to adjust the measure if it is not producing helpful results for your analysis. For the majority of cases, the default settings on a tool will produce something of interest, and increasingly, corpus analysis tools try to use some sort of hybrid or compromise measure, resulting in lists of keywords that are both reasonably frequent and indicate a reasonable difference in size between the two corpora. Although I have ordered the top keywords in Table 7.4 by loglikelihood score, I could also note that they all had a Dice score of at least 0.1. So is the keyword list any more helpful than a simple list of raw frequencies? One aspect of the list that perhaps is not that helpful for my purposes is the number of proper nouns within it. Scott (1999) says that keyword lists tend to show up three types of words. First, there are proper nouns. Secondly, the list will probably contain a number of ‘aboutness’ keywords. These tend to be lexical words: nouns, verbs, adjectives, adverbs and are generally those which are most interesting to analyse. Finally, there might be high frequency grammatical words, which Scott says may be more indicative of style than aboutness. Some tools, like WordSmith, allow users to create ‘stop lists’ to remove grammatical words from the keywords procedure altogether. However, as the style of a text may play some role in the discourses within it, it is recommended that such high frequency words are not discarded.
Keyness
The reason why so many proper nouns occur in this keyword list is due to the fact that each speech in the transcription is prefaced with the speaker’s name, and to the fact that speakers regularly refer to each other’s speeches. Keywords like kaufman, gray, garnier, lidington, barker, gummer, luff and gregory refer to Members of Parliament who wanted fox hunting to remain legal. The proper noun exmoor refers to a region of the UK where hunting often takes place, and is therefore a location which is likely to be affected by a ban. Although in some cases it may be interesting to pursue the use of proper nouns further, at this point I am going to move on to look at some of the other types of words in the list.
Analysis of keywords In order to see the value of considering grammatical keywords, let us first look at the keyword i (actually the pronoun I) mentioned earlier, as the second strongest keyword in the anti-hunt sub-corpus. Why is this the case? As with ordinary frequency lists, this is unfortunately where the limitations of keyword lists come into play. We may want to theorise for the reasons why I is used so much by anti-hunters, so looking at some of the other keywords may provide clues. However, without knowing more about the context of the word I, as it is used in both sides of the debate, our theories will remain just that – theories. Therefore, it is necessary to examine individual keywords in more detail, by carrying out concordances of them and looking at their collocates. In the anti-hunt speech the keyword i collocates with mental, verbal and relational process verbs like said (63 occurrences), want (61), believe (51), think (47), hope (47), say (46), accept (42), understand (35) and agree (33). It has a similar function in the pro-hunt speech, collocating with the same kind of verbs. It therefore tends to be often used by the speakers to show that they are about to indicate their stance. This is not surprising, considering the fact that we are examining a corpus of debates. It is notable then, that I is used more often by people who were against hunting as opposed to those who were for hunting. This is not the only time that I has occurred as a keyword in a corpus of political debates. In Love and Baker (2015), the authors directly compared two sets of speeches about gay law reform in the British House of Lords. One involved speeches arguing not to equalise the age of consent for gay men that took place in 1999–2000. The other involved speeches arguing not to allow same-sex marriage, taking place in 2013. The first set of debates also had I as a keyword. I would argue that
169
170
Using Corpora in Discourse Analysis
during the period of the first debate the speakers found it easier to refer to themselves in their speeches when arguing from a position that some people viewed as homophobic. During the 2013 debate, I was relatively less common, suggesting perhaps that speakers opposed to LGBT+ equality were a little more cautious about explicitly associating themselves with their point of view. Thinking about how this relates to the fox-hunting debates, it could be the case that the anti-hunt speakers felt more confident – after all this was a law that had been put forward by the government, it seemed to have support among the populace according to surveys at the time, and the law banning fox hunting was eventually passed. The anti-hunters were arguing from a stronger position then, and perhaps they were aware of this, making them more likely to personalise their discourse than the pro-hunt speakers. Speaking of the pro-hunt speakers, consider the word criminal. Once the proper noun Gray has been discarded, criminal is the strongest keyword used by those who were opposed to a ban on hunting. When a concordance of criminal was carried out on the corpus data, it was found that common clusters containing the word criminal included the criminal law (14), a criminal offence (10), criminal sanctions (6) and a criminal act (3). Deriving collocates of criminal (using a minimum collocate frequency of five and the log-likelihood statistic), produced the following words: law, offence, sanctions, make, attracting, made and act. As forms of the lemma make appear to be a relatively important collocate of criminal, a concordance of make when it occurs within five places to the left or right of criminal was carried out on the pro-hunt section of the corpus (see Table 7.5). What seems clear from the table is that the pro-hunters are using a strategy of framing the proposed fox-hunting ban as criminalizing (making criminal) people (see lines 6 and 9) and that they are against this – the change to the law is evaluated with phrases like wrong in principle (line 3) and not appropriate (line 10). Let us consider another keyword, this time one which occurs more often on the side of the debate of those who want to ban hunting: the word dogs. This word occurs 182 times (0.25 per cent) in the speech of the anti-hunters and 74 times (0.13 per cent) in the speech of those who want hunting to remain legal. So this word is also of reasonably high frequency (referring to the ordinary frequency list we first made, it is the twenty-fourth most frequent lexical item in the corpus) although it is used significantly more often by those who want to ban hunting. Another way of exploring the significance of a word in a corpus is to explore the clusters which occur around or near it. The most common three
Keyness
Table 7.5 Concordance of make with criminal 1
With the greatest respect, we do not
make
the criminal law on the basis of opinion polls.
2
fall those of whom we disapprove or
make
criminal those activities that we do not wis
3
dom. It is quite wrong in principle to
make
criminal an activity just because a certain
4
consequences may follow if it were
made
a criminal offence. It would undoubtedly – I do
made
part of the criminal law, foxhunting would c
5
ggesting that their findings should be
6
be sufficiently cruel to justify our
making it a criminal offence and sending people to pri
7
nt of people support hunting being
made
a criminal offence, while 49 per cent. said th
8
that they supported hunting being
made
a criminal offence, and 49 per cent. said that
9
attracting criminal sanctions, but to
make
criminal those people who otherwise want t
10
ent to it so it is not appropriate to
make
it a criminal activity. Secondly, most Member
11
approve of should not necessarily be
made
criminal. The first point that this debate sh
12
te a criminal offence. The Bill would
make
hunting a criminal offence unless it fell withi
Table 7.6 Common clusters containing dogs Anti-hunting speech
Pro-hunting speech
Cluster
F
Cluster
F
1
hunting with dogs
89
hunting with dogs
25
2
with dogs is
16
use of dogs
9
3
use of dogs
10
with dogs is
5
4
mammals with dogs
5
ban on hunting
10
6
with dogs the
7
7
with dogs and
5
8
with dogs will
5
7
word clusters which contain the word dogs are shown in Table 7.6. As the word dogs occurs less frequently in the pro-hunt sub-corpus, the number of overall clusters in the right-hand part of the table is smaller. However, what is most clearly similar about both sides of the debate is that when the word dogs is used, it appears as part of the cluster hunting with dogs most often. Although it is the most frequent cluster under analysis here in both sides of the debate, it should be noted that proportionally it occurs more often in the speech of the anti-hunters than in the speech of the pro-hunters.
171
172
Using Corpora in Discourse Analysis
The second most common cluster in the pro-hunting speeches is the phrase use of dogs which occurs nine times. Although use of dogs occurs ten times in the anti-hunting speeches, this statement occurs proportionally more often in the speech of the pro-hunters (taking into account the relative sizes of the two sub-corpora). As a whole then, it appears that dogs appears in two common types of phrase across the corpus – the more popular hunting with dogs and the less common use of dogs. These two clusters appear to be alternative ways of phrasing the same idea. We might want to ask why this is the case, and we may theorise that it could be because one is more specific (hunting with dogs) whereas the other is vague (use of dogs). However, we need to take care. Maybe use of dogs only appears vague because it occurs as part of a range of clusters such as use of dogs to hunt or use of dogs in hunting, etc. In order to examine this, a concordance of use of dogs was carried out for the whole corpus (see Table 7.7 – the top half of the table contains all the examples from pro-hunters, the bottom half contains all the examples from anti-hunters). From looking at this table, it transpires that in general the phrase use of dogs does not appear as part of statements such as use of dogs to hunt animals. Indeed, the only case where this happens is in line 18, which is used by an anti-hunt speaker. Instead, we find use of dogs on rabbits and rats in line 1 (quite a vague statement), and use of dogs to cull foxes in line 2. The verb lemma cull collocates most strongly in the BNC with from and appears in phrases such as ‘National newspapers cull their stories from all over the country’. Cull therefore appears to be close in meaning to take rather than kill; as with the phrase use of dogs the word cull also acts somewhat euphemistically. So a trend here is that one side of the debate shows a stronger preference for using a more explicit way of referring to the outcomes of fox hunting, while the other is sometimes vague or is more likely to use euphemistic references. At the moment, this finding only provides a clue to possible discourses that are being used in the debate, but it is one which is worth bearing in mind as the other keywords are examined. Discarding proper nouns, after criminal, the next strongest pro-hunt keyword is the word people (it is also one of the top ten most frequent lexical words in the pro-hunt sub-corpus). It refers to people whose lives are claimed will be adversely affected by the Bill if it is passed (their livelihoods stopped, their communities threatened and their futures involving a prison term). However, it also refers to (a presumably greater number of) people who do not hunt but are not upset or concerned by those who do, so it is used as part of the argumentum ad populum (‘appeal to the people’) argument which
Keyness
Table 7.7 Concordance of the use of dogs Examples from the pro-hunt side of the debate 1
ms of hunting, perhaps requiring the use of dogs on rabbits and rats would be perfectly sen
2
oes he propose a closed season for the use of dogs to cull foxes, but not for the shooting or sn
3
tee stage bans on hare hunting and the use of dogs below ground, as well as the original ban o
4
ed hunt? We would argue that the use of dogs is the most selective and humane method
5
down to drafting a Bill in terms of the use of dogs . Then the practical difficulties begin. In C
6
d not have fox hunting also rely on the use of dogs . The Minister for Rural Affairs will remem
7
ll. The legislation concentrates on the use of dogs . When we start to look at it in detail, we di
8
sion for various loopholes allowing the use of dogs . So, let us be clear that this debate is abo
9
fewer alternatives are available to the use of dogs ”? Mr. Geoffrey Clifton-Brown (Cotswold):
Examples from the anti-hunt side of the debate 10
e for banning hare hunting, or that the use of dogs would not result in significantly less sufferi
11
h as ratting, it is equally clear that the use of dogs will always be likely to cause less sufferin
12 pest control and, secondly, whether the use of dogs will cause significantly less suffering than 13
ath more cruel than that involving the use of dogs . Alun Michael: The hon. Gentleman, who
14
-new clause 6-which would permit the use of dogs under ground and enable the Secretary of
15
deer hunting and hare hunting and the use of dogs under ground that were agreed by the Sta
16
registration of hunting involving the use of dogs under ground. However, he would replace t
17 ut there are circumstances in which the use of dogs is less cruel than the alternatives and in w 18
should not be discounted to zero. The use of dogs to hunt animals is acceptable in certain cir
claims that a proposition must be true if many or most people believe in it. The two keyword lists have only given us a small number of words to examine, and once the proper nouns have been discounted, along with the keywords which relate to parliament (clause, bill, commons, house, vote, parliament, minister, portcullis – which refers to Portcullis House), we are left with a smaller amount of words, of which four have already been examined (i, dogs, criminal and people). As that does not leave us with a lot to carry out an analysis on, I changed the settings in AntConc to derive more keywords (the top 100 for each sub-corpus). With more keywords we can start to group them into categories in a way that is similar to the categorisation system I described in the previous chapter on collocates. While the log-likelihood scores are less impressive than the words considered so far, what is interesting about working with a larger list, is that it becomes possible to see connections
173
174
Using Corpora in Discourse Analysis
between keywords, which may not always be apparent at first, but are clearer once they have been subjected to a more rigorous mode of analysis. For example, the top 100 keywords in the pro-hunt debate include the following words: fellow, citizens, Britain, imposing, illiberal, sanctions and offence. All of these keywords are connected in some way to the findings we have already looked at. So sanctions, offence, imposing and illiberal occur in similar ways to the word criminal which was examined above. Table 7.8 shows a concordance of the word illiberal. It is used as a direct reference to the proposed Bill (e.g. licensing regime line 1, legislation lines 2, 4 and 6, Bill lines 3, 4, 5 and 7.) It also occurs in three cases with the intensifying adverbs deeply and profoundly (lines 2, 4 and 5) and in four cases it appears as part of a longer list of negative adjectives (e.g. difficult line 1, intolerant and arbitrary line 3, divisive line 6, and ineffectual line 7). The keywords sanctions, offence and imposing cover similar ground, contributing to the discourse of hunting as a civil right. As a different yet related strategy, consider the keywords fellow citizens, people and Britain, cases of which are shown in the concordance in Table 7.9. The term fellow citizens is always preceded by a first person possessive pronoun (my or our). The use of this term looks like a strategy on the behalf of pro-hunters to appear to be speaking for and with the people of Britain, thereby implicitly labelling their discourse as the hegemonic one. Note also how in lines 11 and 12, the debater speaks for the people: ‘the people of Britain are beginning to catch on’, ‘for most of the 55 million people in England it is of peripheral interest’. In line 10, hunting is described as well over 100 years old, being framed as one of Britain’s traditions, whereas line 8 refers to it as part of the ‘fabric of rural Britain’. In line 6 the speaker mentions going on a march for liberty, while line 7 notes how ‘we in Britain have fought against the persecution of minorities’. There is an underlying nationalist
Table 7.8 Concordance of illiberal (pro-hunt debate) 1 2
other place will amend the Minister’s illiberal and difficult licensing regime into r to read this stuff before he brings deeply illiberal legislation to the House? Mr. Gr
3
lating on this topic. The Bill is intolerant, illiberal and arbitrary. It will restrict freedo
4
do so on the ground that it is a profoundly illiberal Bill-the kind of legislation that brin
5
ite, prejudice and bigotry. This profoundly illiberal Bill, which should concern everyo
6
fabric of rural Britain and passed the most illiberal and divisive piece of legislation for
7
in me in voting against this unnecessary, illiberal and ineffectual Bill. Mr. Hogg: O
Keyness
discourse being drawn on here, in terms of: Britain is a good country because it is a place where people are free and minorities are protected. This discourse is used as an argument to allow fox hunting to continue. Finally, consider another keyword used by the pro-hunt speakers: activities (Table 7.10). It occurs as a plural, implying that there are a range of activities (the singular form activity is not a keyword incidentally). In Table 7.10 there are farming activities, social activities and normal activities. Other things that are labelled as activities include stag hunting, ratting, boxing and
Table 7.9 Sample concordance of fellow citizens, Britain and people (pro-hunt). 1 able to me and, I believe, to most of my fellow citizens . The killing of an animal is justifiable only 2
a small but significant minority of our fellow citizens . I agree with one thing the Minister said.
3 al freedom, that it will rob some of our fellow citizens of their livelihood and take homes from a 4 7, when the pensions of millions of our fellow citizens are affected by a deeply serious crisis fr 5 at. Of course, I accept that some of our fellow citizens genuinely disapprove of hunting with hou 6 umber of my family and 407,000 of my fellow citizens , I took part in the march for liberty and liv 7
the Third Reich. Down the ages, we in
Britain
have fought against the persecution of min
8
an who ripped apart the fabric of rural
Britain
and passed the most illiberal and divisive p
9
that is being practised on the people of
Britain
tonight. Mr. Atkinson: There we have it.
10 turies. Most of the hunts in existence in
Britain
today are well over 100 years old. They hav
11 se to offer the people of Britain, and the
people
of Britain are beginning to catch on.
12 rs speak, but for most of the 55 million
people
in England it is of peripheral interest. Mr.
13 ce to a largely urban nation, millions of
people
people recognise that to criminalise at a str
Table 7.10 Concordance of activities (pro-hunt) 1
sufficient reason to ban those activities. Many activities cause death or injury to animals. Some
2
them illegal. They are perfectly legitimate activities . Mr Simon Thomas: I am interested in
3 4
stag hunting, ratting and all the other activities about which he has spoken? Mr O’Brie that would not preclude all sorts of other activities , including other forms of pest control
5
category tonight. Most notable among those activities of course, are angling, shooting and fish
6
ital role in organising a whole range of social activities that provide a lifeline in the more remot
7 uth is that there is no difference between those activities ; they are all acceptable or they are all 8
that is tolerable in relation to the farming activities upon which the foxes are predators. It is
9
oxing, cigarette smoking and a variety of other activities . Why is the test applied only to hunting
10 if that principle were applied, people’s normal activities would have to be banned. A the ministe
175
176
Using Corpora in Discourse Analysis
cigarette smoking. As with the term use of dogs, I would argue that activities operates as a form of euphemism or vagueness, making it unclear to know exactly what is happening, so a direct response or criticism becomes difficult. The term also creates an association between fox hunting, which is to be made illegal, and other activities which also involve killing animals or causing harm, which are not illegal. Therefore, examining these additional keywords helps to build on the findings we have already uncovered. A number of discourses are then starting to come into focus, particularly for the pro-hunt speakers. For example, use of terms like criminal, sanctions, offence and imposing suggest a discourse of civil liberties, whereas words like Britain, fellow citizens and people suggest a discourse of shared British identity. It is not clear at this stage whether the keyword activities and the phrase use of dogs suggest another sort of discourse. These euphemisms may simply be due to a stylistic choice, but they suggest that the speakers are at least aware that there are some aspects of their stance which may be best glossed over. The more explicit position taken by the anti-hunt speakers supports this hypothesis. One of the top 100 keywords used in the anti-hunt sub-corpus is barbaric (Table 7.11) – a word which is so loaded with meaning that it appears to hardly require much analysis to unearth an argument – it is used alongside evaluative words like cruel, obscene and bloodthirsty, suggesting a strategy of rhetorical emphasis to make the speaker’s position clear. We might want to look a little closer at the concordance of barbaric though, asking what is being called barbaric? Hunting or foxhunting is
Table 7.11 Concordance of barbaric (anti-hunt) 1
overnments have legislated to ban cruel and barbaric sports, if one can call them sports,
2
ce. Licence or no licence, foxhunting is cruel, barbaric , unnecessary and very ineffective. I
3
y to prevent it is to introduce a ban on these barbaric and bloodthirsty forms of hunting wi
4
ears will ensure the end of the obscene and barbaric sport of hunting wild mammals with
5 6 7 8 9
baiting or cock fighting was, and it is just as barbaric . One form of cruelty that is often ov e introduction of a complete banning of such barbaric and bloodthirsty so-called sports. It is not possible to make hunting “slightly” barbaric , or to allow animals to be “almost” ears to give a little more credibility to what is barbaric and unacceptable. We are not tal all I had to say 14 months ago.Hunting is a barbaric practice which we in the House sho
10 ts in Dartford think that ending the cruel and barbaric sport once and for all is a high prior 11
at they want-a total ban on such a cruel and barbaric sport – would it not have been better i
Keyness
referred to as barbaric in lines 2, 3, 4, 7 and 9 while sport is called barbaric in lines 1, 6, 10 and 11, with speakers noting with disapproval that the word sport is euphemistic, e.g. line 1 ‘if one can call them sports’ and line ‘blood-thirsty so-called sports’. Perhaps an effect of calling hunting a sport is that it trivialises it and positions those who hunt as callous. However, this is a concordance that is missing social actors. In other words, we are told that hunting is barbaric, rather than people who hunt are barbaric. Perhaps this is a strategy to depersonalise the debate on the behalf of the anti-hunt speakers. The pro-hunt speakers talk often about the people who will be affected by the ban (as we have seen, people is a pro-hunt keyword) while the anti-hunt speakers seem to have the opposite strategy, to not mention people who hunt but to refer to hunting itself – as almost an agentless activity.
Similarities and lockwords So far our keywords analysis has been based on the idea that there are two sides to the debate, and that by comparing one side against another we are likely to find a list of keywords which will then act as signposts to the underlying discourses within the debate on fox hunting. Our analysis so far has uncovered differences between the two sides of the debate. However, in focussing on difference, we may be overlooking similarities – which could be equally important in building up a view of discourse within text. For example, why do certain words not appear as keywords? Considering that barbaric occurred as a keyword in the anti-hunting speeches, another word that I had expected to appear as key in the antihunting debates was cruelty. However, this word occurred 124 times in the anti-hunting speeches and 106 times in the pro-hunting speeches. In terms of proportions, taking into account the relative sizes of the two subcorpora, the anti-hunt speakers actually used the word cruelty proportionally less than the pro-hunters (0.17 vs 0.18 per cent). So while cruelty occurred slightly more often on one side of the debate, this was not a statistically significant difference – clearly the concept of cruelty is important to both sides. However, how would we know (without making an educated guess) that a word like cruelty is worth examining? One solution would be to carry out a different sort of keywords procedure; this time by comparing both sets of debates against a third corpus, one which is representative of general language use. This would produce two keyword lists which would
177
178
Using Corpora in Discourse Analysis
Figure 7.1 Keywords when the sub-corpora are compared against the same reference corpus.
contain some keywords that only appear in one list, but others which would appear in both. The words which appear in both would indicate lexical similarities. For my reference corpus I used a one million word corpus of written British English called the BE06, containing 500 texts published around 2006. First I compared the anti-hunt sub-corpus against it, obtaining the top twenty keywords. Then I compared the pro-hunt sub-corpus against the BE06, also noting the top twenty keywords. These words are shown as the Venn diagram in Figure 7.1. Looking at the middle part of the figure, we can see a number of shared keywords which occur due to the register of the language: hon, bill, clause, member, gentleman and others that are due to the topic: hunting, ban, foxes, fox.
Table 7.12 Concordance (sample) of cruelty (anti-hunt) 1
romise, no uncertainty, no delay; a ban on the cruelty and sport of hunting in the lifetime of this
2
ael: I see it very clearly in a Bill that bans the cruelty associated with hunting in all its forms. I h
3
ate about banning cruelty and eradicating the cruelty associated with hunting. I have tried to be
4
law, to be enforceable and to eradicate all the cruelty associated with hunting with dogs, and I i
5
rtant issue for many who want to see an end to cruelty and for those who want things to remain a
6
listen to an organisation that exists to prevent cruelty to animals and I remind the hon. Member
7
enshrining in law the principle of preventing cruelty as well as the principle of recognising utili
8
ke effective and enforceable law. It will tackle cruelty , but it also recognises the need to deal wi
9
ise, is uncompromising in seeking to root out cruelty . It will not allow cruelty through hunting
10 mingly, twice, to bring an end to unnecessary cruelty to wild mammals. There can seldom in pa
Keyness
Table 7.13 Concordance (sample) of cruelty (pro-hunt) 1 2 3
and the Bill entirely fails adequately to define cruelty or utility. As my hon. Friend the Mem eedless or avoidable suffering” when defining cruelty . The phrase “playing the fish” is no euph al act. The arbitrary application of the tests of cruelty and utility to foxhunting is illogical when
4
ul unless those who hunt can meet the tests of cruelty and utility described by the Minister. Th
5
. The whole House has heard the definition of cruelty , as given by the Minister, relating to ne
6
ften than not, focuses on cruelty or perceived cruelty . I commend the former Home Secretary
7
. It will not be for the authorities to prove that cruelty takes place; if the Bill is enacted, hunti
8
r described as incontrovertible evidence of the cruelty of deer hunting, he must tell us what it i
9
s. If the Minister is so concerned, where is the cruelty test in the autumn for shooting or snari
10
he Minister said that those would not pass the cruelty or utility tests. How can he know that?
The word cruelty also appears as a keyword for both sub-corpora. Examining this word in more detail, it becomes apparent that although it occurs with a reasonably comparable frequency on each side of the debate, the ways that it occurs are quite different. The anti-hunt speakers tend to use it in conjunction with words like ban, outlaw, unnecessary, target and eradicate (see Table 7.12). Their speech also tends to assume that cruelty already exists, e.g. ‘The underlying purpose of the Bill is to ban all cruelty associated with hunting with dogs.’ However, those who are pro-hunting question this position – using collocates such as test, tests, prove, evidence and defining (see Table 7.13). Therefore, rather than accepting the presence of cruelty, pro-hunting speakers problematise it: e.g. the full text in line 1 of Table 7.13 is: ‘Cruelty is subjective and comparative, and the Bill entirely fails adequately to define cruelty or utility.’ Comparing two sub-corpora against a reference corpus is therefore a useful way of determining key concepts across the corpus as a whole and will help to address the problem of over-focussing on differences at the expense of similarities. A related process involves conducting a direct comparison between the two sub-corpora, which, rather than identifying words where there is a strong statistical difference, instead focusses on cases where the relative frequencies are very similar. I have termed such words lockwords as their relative frequencies appear locked into place. AntConc does not have a feature to identify lockwords although they can be derived for corpora that have been installed on CQPweb. For example, the top five lockwords derived from comparing the BE06 (written text from 2006) against an equivalent corpus of written text from the 1930s called B-Brown
179
180
Using Corpora in Discourse Analysis
are at, talk, easy, direct and older. These are words which have almost identical frequencies between the two corpora. Multiple comparisons can be used to identify keywords in cases where we are dealing with more than two sub-corpora. In Baker et al. (2013) the authors analysed a corpus of British newspaper articles about Muslims and Islam. In order to identify what was lexically distinct about each newspaper, we derived keywords by comparing each newspaper against a sub-corpus containing all the other newspapers, doing this for each newspaper. I refer to this as ‘the remainder method’ of keywords because you are essentially using the remaining bits of a corpus as your reference corpus. This helped us to establish that the Sun was more likely to refer to the concept of evil compared to other papers, the Guardian showed a strong preference for the word Islamist while The Times and Telegraph used Islamic more than other newspapers. The process can also be done to examine change over time, for example, by comparing data from a single year in a corpus, against all the other years. In Brookes and Baker (2021), we adapted the method to analyse features of the annual news cycle in a ten-year corpus of articles about obesity. So rather than compare each year against each other, we compared all articles written in January against those written in February–December and so on. This identified how fluctuations across each year impacted on journalism about obesity – with articles about dieting and gyms in January giving way to a greater focus on sleep and comfort food like butter in February, while summer months like August had stories about fears of looking overweight in swimming suits while on holiday.
Key clusters Another way of spotting words which occur frequently in two comparable sets of texts but may be used for different purposes is to focus not on keywords per se but on key clusters of words. AntConc does not allow key clusters to be calculated, although other tools such as WordSmith and Sketch Engine do have this facility. Using WordSmith, it is possible to derive word lists of clusters of words, rather than single words. Then two of these lists can be compared against each other, in order to see which combinations of words occur more frequently in one text or corpus when compared with another – a list of key clusters. WordSmith also allows the user to specify the size of the cluster
Keyness
under examination – generally, the larger the cluster size we specify, the fewer the number of key clusters that are produced. Taking a cluster size of three, a list of key clusters was obtained by comparing the speech of the pro-hunters with those who were against hunting. This list (not shown as a table) contained some interesting clusters. For example, a complete ban and a total ban were key clusters used by antihunters. One aspect of the language of the pro-hunters was the fact that they tended to use a number of supportive phrases to show that they agreed with each other: a good point, friend is right, and learned friend. However, I want to focus mainly on clusters containing the word cruelty, because as we have seen, this was a high frequency word which occurred on both sides of the debate (and it was key when compared to the BE06 corpus). And while cruelty itself was not a keyword when the two sides of the debate were compared with each other, it did appear in a number of key clusters, suggesting that although it appeared with similar frequencies, its actual uses when embedded in discourse were more marked. For the anti-hunt speakers, cruelty occurred as key in the following cluster: cruelty associated with (10 vs 0 occurrences) and the cruelty associated (6 vs 0 occurrences). In the pro-hunt speech, cruelty occurred as key in there is cruelty (0 vs 5 occurrences) and is cruelty in (0 vs 5 occurrences). Looking first at cruelty associated with (Table 7.14), it can be seen from the concordance that this phrase is used as part of a particular pattern in almost all cases. There is a reference to the Bill under discussion, then the intention to ban (or outlaw, prevent or eradicate) cruelty associated with hunting (usually with dogs). The language used here assumes that
Table 7.14 Concordance of cruelty associated with (anti-hunt) 1
erlying purpose of the Bill is to ban all cruelty associated with hunting with dogs. The well-est
2
her provisions of the Bill, it will ban all cruelty associated with hunting with dogs. The first gr
3 4 5
n. Friends. The Bill will ensure that all cruelty associated with hunting with dogs will be banne m of the Bill is to deal with the issue of cruelty associated with hunting with dogs. The question o be enforceable and to eradicate all the cruelty associated with hunting with dogs, and I invite
6
see it very clearly in a Bill that bans the cruelty associated with hunting in all its forms. I have ju
7
ut banning cruelty and eradicating the cruelty associated with hunting. I have tried to be clinic
8
on offer today is a complete ban on the cruelty associated with hunting with dogs and a compl
9 10
rt the principle of the Bill to outlaw the cruelty associated with with hunting, I am unhappy with art is the key purpose of preventing the cruelty associated with hunting with dogs. That is why i
181
182
Using Corpora in Discourse Analysis
Table 7.15 Concordance of there is cruelty in (pro-hunt) 1
, Coastal (Mr. Gummer,) who said that
there is cruelty in
any form of field sport. Of course the
2
e there is. There is cruelty in shooting.
There is cruelty in
fishing. There is particular cruelty in
3
f field sport. Of course there is.
There is cruelty in
shooting. There is cruelty in fishing.
4 e is particular cruelty in coarse fishing.
There is cruelty in
any form of killing animals. There is
5
There is cruelty in
slaughterhouses. We eat meat for our
cruelty in any form of killing animals.
there is cruelty associated with hunting with dogs: lines 5 to 10 state it as a given fact by using the definite article the. Additionally, in lines 1, 2, 3 and 5 the word all appears as a pre-modifier to the phrase, indicating that the types of cruelty are multiple. Only in line 4 is cruelty not presented as a given but as something more questionable, e.g. ‘to deal with the issue of cruelty . . .’. How about the phrases there is cruelty and is cruelty in which are key in the pro-hunt speech? Both of these clusters occur as part of a longer fourword cluster there is cruelty in (see Table 7.15). This concordance has a less clear pattern – the speakers simply state that there is cruelty in a range of different activities – field sport, fishing, shooting, killing animals and slaughterhouses. However, in carrying out a dispersion plot (see Chapter 4) of there is cruelty in it is apparent that all of the cases of this cluster appear in the same place. Therefore, this cluster appears to be key because it appears as part of a single statement, rather than it being representative of the speech of the pro-hunting debaters. The relevant part of this speech is shown below: Many hon. Members have spoken about cruelty. I agree with my right hon. Friend the Member for Suffolk, Coastal (Mr. Gummer,) who said that there is cruelty in any form of field sport. Of course there is. There is cruelty in shooting. There is cruelty in fishing. There is particular cruelty in coarse fishing. There is cruelty in any form of killing animals. There is cruelty in slaughterhouses. We eat meat for our pleasure. We wear leather shoes for our pleasure. Any form of killing animals is cruel, but the question is how cruel it is.
Therefore, what this speaker appears to be arguing is that the extent of cruelty is more important than the presence of cruelty, a statement which fits in with the earlier findings relating to the more questioning use of the word cruelty by the pro-hunting speakers.
Keyness
The fact that the key cluster there is cruelty in can only be attributed to one speaker, and one small part of his speech, would perhaps make us question its relevance to the whole debate. For keyness to be meaningful does it actually need to be evenly distributed throughout a text or corpus? If its relative frequency is simply due to a single case of repetition like this one, is it worth commenting on? One way to take keyword spread into account is to use WordSmith’s key keywords feature, which gives a measure of the distribution of keywords across different files. As my sub-corpora do not consist of multiple files but just one file for each side of the debate, I have instead used AntConc’s Plot function, as described in Chapter 4. This gives a Dispersion measure for a chosen word for each corpus file – a number between 0 and 1, where 1 indicates that a word is widely dispersed throughout the file and 0 means that it is situated in the same place in the file. In the prohunt sub-corpus, the word cruelty has a dispersion measure of 0.831 (reasonably well-dispersed), although there is cruelty in has a dispersion measure of 0. We may want to argue then, that as the cluster there is cruelty in occurs in such a restricted way, its saliency as ‘key’ across the whole text can be questioned. However, it could also be argued that there is a strong rhetorical impact of having this cluster occur five times in such close proximity. It is extremely salient in terms of style. The phrase there is cruelty in indicates parallelism, which has the rhetorical effect of persuading and evoking an emotional response. The repetition makes complex ideas easier to process as well as holding the listener’s attention. Crystal (1995: 378) notes that ancient rhetorical structures consisting of lists ‘convey a sense of . . . power and provide a climax of expression which can act as a cue for applause’. Therefore, perhaps we should not simply dismiss this particular cluster because its frequency is due to repetition in one speech. However, when reporting the analysis of keyness, it is worth mentioning distribution or dispersion. Again, this requires a more close analysis of words and phrases in the corpus, rather than simply giving frequencies from word lists.
Key categories A further way of considering keyness is to look beyond the lexical or phrasal level, for example by considering words that share a related semantic meaning or grammatical function. While a simple keyword list will reveal differences between sets of texts or corpora, it is sometimes the case that
183
184
Using Corpora in Discourse Analysis
lower frequency words will not appear in the list, simply because they do not occur often enough to make a sufficient impact. This may be a problem, as low frequency synonyms tend to be overlooked in a keyword analysis. However, text producers may sometimes try to avoid repetition by using alternatives to a word, so it could be the case that it is not a word itself which is particularly important, but the general meaning or sense that it refers to. For example, it could be the case that the notion of ‘largeness’ is key in one text when compared to another, and this is demonstrated by the writer using a range of words such as big, huge, large, great, giant, massive, etc. – none of which occur in great numbers, but if considered together, would actually appear as key. Thinking grammatically, in a similar way, one text may have more than its fair share of modal verbs or gradable adjectives or first-person pronouns when compared to another text. Finding these key categories could help to point to the existence of particular discourse types – they would be a useful way of revealing discourse prosodies. In order for such analyses to be carried out, it is necessary to undertake the appropriate form(s) of annotation (see Chapter 3). As with all forms of automatic annotation, there is a possibility for human error which needs to be taken into account. Therefore, ideally, the annotation should be handchecked and corrected before key categories are derived. If this is not possible, then the presence of frequent or key tags should be properly accounted for if they have been mis-tagged. The automatic semantic annotation system used to tag the fox hunting corpus was the USAS (UCREL Semantic Analysis System) (Wilson and Thomas 1997). This semantic tagset was originally loosely based on McArthur’s (1981) Longman Lexicon of Contemporary English. It has a multi-tier structure with twenty-one major discourse fields, subdivided, and with the possibility of further fine-grained subdivision in certain cases. In some cases, tags can be assigned a number of plus or minus codes to show where meaning resides on a binary or linear distinction. For example, the code T3 refers to ‘Time: Old, new and young; age’, so the word kids is assigned T3− placing it at one end of a linear scale, whereas a word like pensioner would receive T3+. I used a free online tool called Wmatrix which carried out automatic semantic tagging of the two sub-corpora. Wmatrix then compared frequency lists (consisting of the frequencies of the semantic tags) of the two sides of the fox hunting debate to create lists of key semantic tags. One semantic category which occurred more often in the speech of those who are opposed to hunting was S1.2.5+ ‘Tough/strong’. This was the second
Keyness
Table 7.16 Concordance (sample) of words tagged as S1.2.5+ ‘Toughness; strong/weak’ (anti-hunt) 1
to the Bill, we would have incredibly
strong
legislation with which to tackle hunti
2
lleagues to unite today in getting good,
strong
legislation through the House. I hope
3
n. However, although the current Bill is
strong
in that respect, it does not set the th
4
hon. Lady’s argument is not especially
strong
. The Bill is good in that it takes us
5
stands is far from imperfect. It is a very
strong
Bill. It deals with the issue of cruelty
6
the other Government amendments to
strengthen
the Bill are agreed, I can give the Ho
7
practicable in their area. The measure is
tough
but fair, and it will be simple to
8
The tests, as I have said, are
tough
but fair. Supporters of hunting say th
9
eve in while being seen by the public as
tough
and fair and being strong enough to
will stand the test of time. The Bill is
robust
when tested against all four of these
10
Table 7.17 Concordance (sample) of words tagged as S1.2.6+ ‘Sensible’ and S1.2.6− ‘Foolish’ (pro-hunt) 1
he Bill makes illegal only the perfectly
reasonable
sensible and respectable occupations
2
continuation of hunting. I appeal to all
reasonable
hon. Members to support me in seeki
3
inal law rather than fiddle around in an
absurd
way with this absurd Minister on this
4
rmed roast. The debate has not shown a
rational
analysis of the facts: misplaced co
5
be justified by scientific evidence. The
ridiculous
new clause 13 wrecks it further, and i
6
this matter. Most people with common
sense
will say, “Why don’t they reach a dea
7
eds your protection. Mr. Gray: Calm,
sensible
and rational people across Britain a
8
ss. Why not? That would be a logical,
sensible
and coherent approach. As I have to
9
method of control in that time is utterly
illogical
Mr. Gray: My hon. Friend makes an
10
ng-during that time. This ludicrous and
illogical
new clause is the result of a shabby d
key category (when ordered according to log-likelihood scores), consisting of words such as tough, strong, stronger, strength, strengthening and robust (Table 7.16 shows a small sample of these cases). On this side of the debate then, the pro-hunt stance is viewed as weak, whereas the proposed Bill is frequently characterised as tough, strong or robust. In the pro-hunt sub-corpus two key semantic tags which stood out were S1.2.6−, which referred to a semantic category called ‘Foolish’, and S.1.2.6+ which was called ‘Sensible’. These were the sixth and twelfth most key semantic categories and Table 7.17 shows a small sample from the total
185
186
Using Corpora in Discourse Analysis
number of cases. The categories contain words relating to issues of sense: sensible, reasonable, common sense, rational, ridiculous, illogical and absurd. The prevalence of these types of words is due to the way that the pro-hunt speakers construct the proposed ban on hunting (as ridiculous, illogical and absurd) and the alternative decision to keep hunting (as reasonable, sensible and rational). While this way of presenting a position would appear to make sense in any argument, it should be noted that the anti-hunt speakers did not tend to characterise the debate in this way. They did not argue, for example, that their position was sensible, reasonable, etc. and that of their opponents was ridiculous and absurd. It is also worth noting that one feature of hegemonic discourses is that they are seen as ‘common-sense’ ways of thinking. To continually refer to your arguments in terms of ‘common-sense’ is therefore quite a powerful legitimation strategy. With this sort of analysis, we are not only seeing the presence of discourses in texts, but we are also uncovering evidence of how they are repeatedly presented as the ‘right’ way of viewing the world. So there is a significant difference in the ways that the two sides of the debate try to position themselves as correct. While the pro-hunt debate frames itself in terms of what is sensible, the anti-hunt debate uses strength as its criteria. It is difficult to explain the reason for this difference from just looking at concordance lines. Perhaps it was the case that members of parliament received explicit briefings on how to refer to their arguments or perhaps they subconsciously picked up on the language of the people on their own side. Perhaps the two framings occur across other debates and reflect differences in the ways that different political parties engage in evaluation of legislation. The analysis would need to be supplemented with other forms of enquiry to test these hypotheses. What is notable though is that the two sides are using different criteria to argue that they are right – meaning that they do not engage with each other’s criteria. However, one way that the pro-hunt speakers do appear to engage with the arguments of the other side of the debate was found through an analysis of the key tag G2.2+ ‘Ethical’. This tag (the ninth most key) related to a set of words that included moral, rights, principles, humane, morality, ethical, legitimate, noble and fair. It appears that the pro-hunt speakers are more likely to argue their position from an explicitly ethical standpoint – a somewhat surprising finding considering that the ethical position of ending cruelty to animals would appear to be a more obvious stance for the antihunt protesters to have taken. However, a closer examination of a concordance of words which receive the G2.2+ tag (Table 7.18) reveals that the pro-hunt
Keyness
Table 7.18 Concordance (sample) of words tagged as G2.2+ ‘Ethical’ (pro-hunt) 1
e should be careful about imposing our
morality
on other people, someone on the Lab
2
ople to make up their own minds about
morality
. One of the issues that I dealt with as
3
In any event, they are surely moral and
ethical
issues to be considered by individu
4
g, vivisection and slaughter? There are
moral
gradations here and no moral absolut
5
the Bill that it is based on no consistent
ethical
principle. I was rather pleased when
6
ere is a complete absence of consistent
ethical
principles in the contents of the Bill.
7
at not an issue? Is hunting not the more
humane
8
omeryshire (Lembit Öpik). There is no
9 10
moral
method of controlling the fox pop justification for the Government’s po
questions involved, will he explain the
moral
difference between a gamekeeper us
en. Predators do not consider the moral
rights
and wrongs as we do as human bein
speakers are pre-occupied with issues of morality because they wish to question the supposed absolutist ethical standpoint of the anti-hunters. Therefore, their frequent references to ethics are based around attempts to problematise or complicate the ethical position of the anti-hunters: again, this finding complements and widens the analysis of the word cruelty above. Semantic tagging of the corpus, then, helps to reveal some of the more general categories of meaning which are used in the construction of discourse positions on the different sides of the debate. The pro-hunt speakers talk in terms of what is sensible, whereas the anti-hunt speakers talk in terms of what is strong. On their own, individual words like strong, sensible and rational did not appear as keywords – it was only by considering them as a single part of a wider semantic category that their importance became apparent. Widening the scope of keywords beyond the lexical level can therefore be a fruitful endeavour.
Conclusion A keyword list is a useful tool for directing researchers to significant lexical differences between texts. However, care should be taken in order to ensure that we do not give too much attention to lexical differences whilst ignoring differences in word usage and/or similarities between texts. Carrying out comparisons between three or more sets of data, grouping infrequent keywords according to discursive similarity, showing awareness of
187
188
Using Corpora in Discourse Analysis
keyword dispersion across multiple files by calculating key keywords or dispersion plots, carrying out analyses on key clusters and on grammatically or semantically annotated data, and conducting supplementary concordance and collocational analyses will enable researchers to obtain a more accurate picture of how keywords function in texts. Although a keyword analysis is a relatively objective means of uncovering lexical salience between texts, it should not be forgotten that the researcher must specify cut-off points in order to determine levels of salience: such a procedure requires more work to establish how cut-off points can influence research outcomes. As with analysis of collocates, it is a good idea to provide an illustrative example or two (space providing), of keywords in use, from your corpus. And it is also worth considering the order in which you discuss the keywords. A good analysis will try to tell an engaging story, making connections between different keywords which gradually builds on an overall picture, as opposed to simply starting with the top keyword and then working down the list. Bear in mind that not all keywords need to be discussed in detail and the keywords do not necessarily require to be tackled in the order that they appear in the list. In this chapter I identified categories of keywords based on how they contributed towards the same discourse, e.g. fellow citizens, Britain, people or illiberal, sanctions, offence, imposing, criminal. This can take time and often requires detailed concordancing to identify how a set of words work together or embody similar meaning or functions. Alternatively, employing some form of collocational network analysis (see Chapter 6) might help to show connections between keywords, although another, even more systematic approach, is described by Clarke et al. (2021) who created an R script to carry out Multiple Correspondence Analysis (MCA) on the keywords in a corpus. The approach can potentially identify sub-registers, based on grouping related texts which contain similar keywords, and it can also identify potential discourses, based on the ways that some keywords may appear in the same text but not others. For example, when they used MCA to examine news articles about Islam, Clarke et al. (2021) identified nine ‘dimensions’ of keywords, each of which related to discourse. While the tool presents the sets of the keywords, it is up to the human researchers to identify how they contribute towards a particular discourse and to provide that discourse with a meaningful label. For example, one dimension they identified involved a set of keywords which collectively referenced the aftermath of terror attacks (e.g. bombing, suicide, footage), which contrasts
Keyness
with a second set of keywords which related to news stories about human rights and corruption (court, justice, life). Analysis of these sets of keywords in context revealed that ‘Islam and Muslims are often discussed in the national press in terms of being “behind” on human rights (especially women’s rights) or as having caused death and destruction to innocent victims’ (Clark et al. 2021: 161). This approach also allows the same keyword to appear as part of different discourses. For example, the keyword Trump appeared in several discourses (US politics, tribalism, human rights and corruption). The MCA approach would not have worked well on the corpus I used in this chapter, as I was essentially only working with two large ‘texts’, one of pro-hunt speeches, the other of anti-hunt speeches. However, I could create a file for each speech separately, and this would allow MCA to be carried out. Indeed, an advantage of MCA is that it can be used on very short texts, e.g. on a corpus where each file is a single tweet or text message. When used sensitively, keywords can reveal a great deal about frequencies in texts which is unlikely to be matched by researcher intuition. However, as with all statistical methods, how the researcher chooses to interpret the data is ultimately the most important aspect of corpus-based research.
Step-by-step guide to keyness analysis 1 Build or obtain access to a corpus. 2 Decide how you will compare your corpus. Can the corpus be split into two or more sub-corpora, or can the whole corpus be compared against a reference corpus? If the latter is the case, you will need to build or obtain access to a second corpus. 3 Obtain list(s) of keywords for the corpora or sub-corpora you are interested in examining. You may want to experiment with different tools, keyness statistics, minimum frequencies and keyness cut-off points. 4 Can the keywords be grouped grammatical, semantically or thematically? 5 Obtain concordances and collocates of the keywords and look for patterns. This should enable you to uncover the topics, stylistic features, prosodies, representations, discourses or legitimation strategies in the corpus.
189
190
Using Corpora in Discourse Analysis
6 Consider how keywords are distributed or dispersed across the corpus, perhaps using a process like key keywords. Are some keywords more typical of the corpus than others? 7 If you are interested in two or more corpora, can you use a third reference corpus or a tool to identify words of similar frequency (e.g. lockwords)? 8 Consider obtaining key clusters or key semantic or grammatical tags and carry out steps 4–7 on them. 9 Attempt to explain why particular discourse patterns appear around key terms.
Recommended reading Archer, D. (ed.) (2009), What’s in a Word-list? Investigating Word Frequency and Keyword Extraction, London: Routledge. This edited collection contains chapters on using keyness analysis including a chapter by Mike Scott called ‘In search of a bad reference corpus.’ Brookes, G. and Baker, P. (2021), Obesity in the British Press, Cambridge: Cambridge University Press. This book analyses a corpus of newspaper articles about obesity. Keyword analyses occur in several chapters, allowing the authors to compare different types of newspaper, change over time and articles vs reader comments. Taylor, C. and Marchi, A. (eds) (2018), Corpus Approaches to Discourse: A Critical Review, London: Routledge. This edited collection of chapters contains chapters by Charlotte Taylor on similarity and Costas Gabrielatos on different ways of calculating keyness.
Questions for students 1 What kinds of research contexts or questions would best be answered by analysing a keywords list and which would favour looking at a simple frequency list? 2 When comparing two corpora directly together for keywords, one corpus is 10,000 words in size, while the other one is one million words. The keywords procedure would work well if the larger corpus was the reference corpus. But is it feasible to use a much smaller corpus as a reference corpus? If not feasible, is there a way to get round this?
Keyness
3 Imagine you are analysing a corpus of British newspaper articles about a political campaign. You use various types of reference corpus to obtain different sets of keywords. Try to predict what keywords would be achieved if the reference corpus was (a) an equivalent corpus of American news articles about the same campaign, (b) a corpus of British news articles about a wide range of topics, (c) a corpus of general British English writing.
191
192
8 Going Beyond the Basics Introduction The previous chapters have considered some of the basic techniques used in CADS – frequency, concordancing, collocation and keyness. These can enable a great deal of corpus-assisted research to be gainfully carried out. However, in those chapters I worked with relatively ‘well-behaved’ corpora of written text. Even the spoken corpus I examined in the previous chapter actually contained transcripts of speeches (most of which had been prepared in advance), and very little interaction between participants. In this chapter I want to return to some of the issues that were raised in Chapter 3 on corpus building, to consider points relating to the analysis of different kinds of corpora, those containing visuals, social media or spoken data. I also examine more complex forms of analysis, involving change over time, corpora of different languages, combining corpus analysis with qualitative analysis of text samples and identifying difficult-to-find features like metaphor. Some of the processes involved in this chapter can be made easier by using programming languages like Python, R or Java or operating systems like Linux, and numerous tools exist that work in these environments, e.g. TWINT is a Twitter scraping tool written in Python. If you have time and the aptitude, equipping yourself with skills in this area is recommended. The corpus software described in this book are powerful and can allow you to carry out a lot of analysis, but they have limitations. Programming skills will allow you access to a wider range of existing tools, as well as enabling you to adapt them or create bespoke ones for your own purposes. However, realistically, it can be a difficult enough job to encourage some discourse researchers in the social sciences or humanities to work with any form of computer software, let alone ask them to learn how to code, so I would not place a strong requirement for discourse-analytic researchers to obtain those 193
194
Using Corpora in Discourse Analysis
kinds of skills, and instead I mainly refer to tools I have discussed elsewhere like WordSmith or Wmatrix, as well as noting other types of software that do not require programming skills, like FireAnt or Excel.
Triangulation and analysis of samples In Chapter 1 I wrote that a corpus approach could be used as a form of triangulation with other forms of analysis, particularly those which take a qualitative approach, although the variety of techniques within corpus linguistics also means that different forms of corpus analysis can be used to triangulate one another. I want to begin this chapter by returning to the idea of triangulation, thinking further about the ways it can benefit researchers, as well as considering some practicalities around putting it into practice. As I mentioned in my discussion of Baker and Egbert (2016) in Chapter 1, triangulation can involve multiple researchers working on the same corpus. This is also an approach taken by Marchi and Taylor (2009) who separately used corpus techniques to consider how journalists wrote about themselves and their profession in news articles. Both analysts found a set of convergent (broadly similar) and complementary (different but not contradictory) findings, indicating how the triangulation confirmed certain findings but also yielded different perspectives. In a similar study, I compared the reports of five researchers who independently analysed a corpus of newspaper articles about foreign doctors (Baker 2015). Each researcher used a different combination of methods, although everyone incorporated analyses of concordances and collocates. Only one finding was uncovered by all five researchers, that foreign doctors were represented as having poor language skills. Instead, the majority of findings were only made by one researcher, indicating more of a complementary than a convergent picture overall. The two analysts who produced the highest number of findings had taken different strategies – one had carried out a detailed concordance analysis of every occurrence of the phrase foreign doctor, while the other used a wide range of methods including dispersion plots, keywords and key semantic categories. One way of carrying out triangulation within corpus-assisted discourse analysis is to carry out both an analysis of the full corpus, using software, but to supplement this with a more detailed analysis of a smaller number of full
Going Beyond the Basics
texts from within the corpus. This is an approach taken by Baker and Levon (2015) who looked at a 41.5 million word corpus of newspaper texts which reference different kinds of men in terms of social class (e.g. middle-class man) and ethnicity (e.g. black man). I carried out a corpus analysis which involved looking at collocates of these terms while my co-investigator, Erez Levon, carried out a qualitative analysis of a down-sampled selection of fifty-one articles. The two sets of analyses were then compared in order to identify differences in terms of research findings and the kinds of findings that each approach yielded. For example, the corpus analysis allowed the identification of frequent patterns which were not necessarily evident from a small sample, as well as revealing unexpected collocates which might not have been spotted by a human analysis. This included the collocate wellwhich appeared with middle-class men in phrases like well-educated, welldressed and well-known, indicating ways that a discourse of privilege around such men were realised – the term usually did not indicate who or what had caused these men to be well-educated, well-dressed, etc. Conversely, the collocate self-appeared with working-class men, in contexts like self-made, self-educated and self-motivated, which indicated that success for such men was often represented as being due to their own efforts. On the other hand, the qualitative analysis indicated two recurring themes which related to the ways that social class and ethnicity were discussed in relation to masculinity. One of these was physicality, the other was ambition. So black men were represented as lacking ambition although there was also a representation of a successful black man who was criticised for being inauthentic. Asian men were represented as inherently entrepreneurial although sometimes over-ambitious to the extent of achieving success through immoral means. The study concluded that the qualitative analysis was useful in terms of showing how certain representations were located within grammatical constructions which enhanced their force, as well as enabling a more detailed interpretation of findings within their broader ideological context. While a corpus analysis is good, then, at identifying patterns which can point to particularly salient representations or discourses, it is often the case that a more in-depth qualitative analysis of an entire text is needed in order to interpret the effects of these representations and discourses and explain the reasons for them. A question arises then, relating to how a sample of texts should be chosen from a corpus. In Baker (2019) I experimented with different approaches to sampling, comparing what sort of results would be obtained if we took different samples of texts from a corpus of articles that contained the seed
195
196
Using Corpora in Discourse Analysis
word obesity. One sample contained articles which had the most mentions of this word, another contained articles which had the highest concentration of keywords across the whole corpus, a third sample included articles taken from the month during which the most articles were collected, and a final sample included articles taken at random. The analysis found that the different samples provided different kinds of findings. For example, articles which had the most mention of the word obesity tended to be good for identifying a wide range of perceived causes of obesity. The articles which contained the highest proportion of keywords were good for identifying types of people affected by obesity and potential consequences of it. The articles from the most productive time period in the corpus were also good for identifying types of people affected by obesity while the random sample was most useful in terms of identifying different rhetorical strategies that journalists used to argue various positions. Contrastingly, a corpus analysis of collocates of the terms obese and obesity was better than the sample analyses at identifying representations of obese people. This was a smallscale study and it is unlikely that the findings would be replicated on a different corpus although it does indicate that the way of collecting a sample is likely to have a bearing on what is found. The articles with the most mentions of the search term also tended to be the longest articles in the corpus and so they were ‘rich’ texts in terms of going into various topics in depth, but such texts were not necessarily typical. The articles which appeared in the month where the topic of obesity was most often written about were also of interest, although they tended to be focussed more narrowly on a smaller number of news stories that had been popular at a particular point in time. The random articles, on the other hand, were something of a mixed bag, and the extent to which they provide a representative sample is likely to be due to chance. Therefore, in terms of collecting articles that are most typical of a particular corpus, the set which contained the most keywords appeared to be the most effective technique. This was achieved through the use of a freely available tool called ProtAnt which I co-created with Laurence Anthony, the creator of AntConc (Anthony and Baker 2015). ProtAnt takes the files in a corpus, compares the corpus to a reference corpus and then works out the proportion of keywords in each individual corpus file, presenting the files in order, ostensibly indicating which ones are the most typical. The tool works best on files that are of equal size although the technique can be adapted for files that are different sizes by applying a logarithmic weighting.1 Experiments with the tool using corpora consisting of various types of texts found that it was able to identify which
Going Beyond the Basics
texts were most and least typical of a set. For example, a corpus containing ten sections of text from the novel Dracula, five from Frankenstein and five more from individual works, was able to pick out the Dracula texts as being most typical of the corpus, whereas when ProtAnt was given corpora consisting of a single genre of text from the Brown family which contained one example of a text which belonged to a different genre, it was successful at spotting the odd one out (ranking it as the least typical text or very close to the bottom of the list) in ten out of fifteen cases. ProtAnt therefore has the potential to identify both typical and atypical texts in a corpus, which could be useful in terms of identifying individual texts where dominant and minority discourses might appear.
Change over time Most corpus analysis tools are designed to work with one corpus at a time, although the keywords procedure requires two corpora to be compared together. In the previous chapter I discussed the ‘remainder method’ of keywords, which allows multiple corpora (or a single corpus split into subcorpora) to be compared, resulting in keyword lists for each one. This approach is useful in showing which words are particularly frequent in one corpus (or sub-corpus), compared against everything else. However, a potential issue with this approach is that it might miss more gradated differences, particularly those that might show up in a diachronic analysis. In Baker (2005) I compared personal adverts written by gay men that had been published in 1973, 1982, 1991 and 2000, noting how the frequencies of various categories of words had changed (e.g. references to being masculine or not being involved in the gay ‘scene’ were most frequent in 1991). This analysis was done in a fairly impressionistic way, first by putting frequent words into categories and then comparing their frequencies. However, I could have taken a more automated approach by applying a statistic called the Coefficient of Variance (CV). The CV is a measure of the relative dispersion of a set of data points. It can be expressed as a number (usually between 0 and 100), where the higher the number, the greater the dispersion. A low CV would produce a horizontal line if its corresponding frequencies were plotted as a chart. A high CV might produce a snake or a line that either went up or down. I am not aware of a current corpus tool that works out the CV, although it can be calculated by creating word lists of the (sub-)corpora you want to
197
198
Using Corpora in Discourse Analysis
compare using WordSmith. Then, within WordSmith’s WordList tool, create a Detailed Consistency List. This is simply a set of multiple word lists of all the (sub-)corpora, shown in a single table. The table can be useful in getting a sense of which words are most frequent across all the (sub-)corpora, and which words are most (or least) frequent between them. To calculate the CVs of all the words, the list can be saved as an Excel file and then worked on within Excel. Doing this on the personal advert data I mentioned above, it is worth pointing out that many of the words were very infrequent (and this is likely to be case with all corpora). For example, thickset only occurred once in 1973 and never in any of the other time periods. For low frequency words, the CV is practically meaningless, so I usually apply a cut-off point to focus on words where there is a reasonably high frequency. For this analysis, I only looked at the 100 words which had the highest combined frequency (shown in the Total column in Figure 8.1). Then within Excel, I worked out the CV by calculating the standard deviation (STDEV) of the words in all four word lists, dividing this by the average (AVERAGE) word frequency and then multiplying by 100. This was achieved with the following formula: =STDEV(B2:E2)/AVERAGE(B2:E2)*100 The part (B2:E2) applies to the columns B, C, D and E which is where the four frequency lists resided in the Excel spreadsheet I had created (see Figure 8.1). If they had appeared in columns M, N, O and P, I would have simply typed (M2:P2) instead. The CV is thus shown in column G. The resulting list indicated a number of words with high CVs. For example, of the adjectives in the corpus, caring had one of the highest CVs
Figure 8.1 Excel spreadsheet indicating Coefficient of Variance over four time periods.
Going Beyond the Basics
of 95.41 with frequencies of 2, 13, 22 and 51 across the four time periods. Clearly, being caring was seen as more important as time went on. The word active (not shown in Figure 8.1) had quite a high CV of 67.24, with a profile of 71, 48, 13 and 23, being most popular in 1973. Relationship had a middling CV of 39.93 with frequencies of 21, 65, 60 and 62. Here, a change appears to have occurred at some point between 1973 and 1982, with frequencies after 1982 being pretty stable. Similar had one of the lowest CVs (15.08), showing much less difference between the four periods: 65, 74, 82 and 58. The analysis helped me to determine which qualities were perennial – viewed with around equal importance across the four time periods, and which were time-dependent – going in or out of fashion (see Figure 8.2). It should be borne in mind that as the four sub-corpora I was comparing were all the same size, I could directly compare the raw frequencies. However, had the sub-corpora been of different sizes, it would have made more sense to compare standardised frequencies (e.g. occurrences per 10,000 words), otherwise the CVs would give an inaccurate picture of the variation. The CV can be a useful way of identifying words (or clusters or tags) which show the most or least variation between three or more corpora. Of course, a full analysis would then need to involve concordancing words of interest in order to interpret these patterns of frequency, particularly for the high CVs which show differences. An analysis of active, for example, indicated that it was used as a euphemism to refer to a man who took the
Figure 8.2 Frequencies of selected adjectives in gay personal adverts over time.
199
200
Using Corpora in Discourse Analysis
active (or penetrator) role during sex. Its low frequency in 1991 is indicative of a change that took place in the years leading up to this period, possibly as a result of concern and stigma around HIV-AIDS (during the same period, advertisers were also more likely to claim to be discreet, straight-acting and non-scene). I used the CV technique extensively in Baker (2017) when I compared eight members of the Brown family of reference corpora (consisting of American and British English from 1931, 1961, 1991/2 and 2006). The CV enabled me to identify various words, clusters and linguistic categories which had decreased, increased or stayed the same over time. For example, British English showed large declines in certain modal verbs like may, must and shall, as well as gradable adverbs like very and quite, while there were increases in contracted forms like didn’t and don’t. The analysis pointed to changes in discourse involving processes like densification (fitting more information into fewer words), informalisation/colloquialisation (written language becoming more informal and similar to spoken language) and democratisation (less authoritative-sounding language). Sketch Engine has a similar function for diachronic analysis called Trends, which requires corpora to be time-stamped with tags and some changes made to the configuration file which accompanies each corpus. The Trends function uses the Theil-Sen estimator, a method which fits a line to sample points on a plane using simple linear regression. This essentially assigns a trend score to all words in the corpus which shows the degree of change, along with a p value. Figure 8.3 shows a Trends analysis of the SiBol corpus which contains newspaper articles from 1993 to 2013 (see Partington 2010). The figure indicates the words whose frequencies have increased or decreased the most over that time period, with commenters showing the strongest increase and turnround having the strongest decrease. As some of these words are quite low-frequency (e.g. showrunner), we might want to apply a frequency cut-off, or perhaps focus on the high-frequency words, like tweet, in the list. Hocking (2022) used Trends to identify diachronic change in a 235,000 word corpus of artists’ interviews and statements, published between 1950 and 2019. His analysis indicates how artists drew on different kinds of discourses over time. For example, he found that over time artists were increasingly likely to conceptualise their work using words like project, performance, practice or research, while words indicating absoluteness and high modality like must, certainly, true, only and nothing decreased over time, a phenomenon which can be linked to the ways that artists responded to the end of the cultural and social movement known as modernism.
Going Beyond the Basics
Figure 8.3 Screenshot of Trends analysis in Sketch Engine.
Another approach to change over time would be to examine how the collocates of a word or set of words change across different corpora. In Gabrielatos and Baker (2008) the concept of consistent collocates (or c-collocates) was developed – the idea that the stability of a discourse could be shown through the fact that the same collocates of a word could be found across several years of data. The criteria for a c-collocate in their ten-year corpus of news articles about refugees, asylum seekers, immigrants and migrants was that a collocate needed to appear in at least seven of the ten years. The concept was further developed by McEnery and Baker (2016) who identified four categories of collocates which take into account different patterns of change (or lack of it) over time: consistent, initiating, terminating and transient. They applied specific criteria based on their corpus (Early English Books Online) and the ten time periods it was divided into. To apply a more simplistic definition of the terms, if we imagine that our data is divided into texts from four time periods, a consistent collocate would be found in all four periods, an initiating one would appear at some point after the first period and remain until the end, a terminating collocate would stop
201
202
Using Corpora in Discourse Analysis
appearing at some point after the first period and a transient collocate would appear in some periods but show no clear pattern.
Working with different languages In recent years there have been efforts by corpus tool designers to enable work with languages that do not use the Latin alphabet system, with various degrees of coverage. Carrying out corpus analysis of discourse on languages outside English can still be challenging though. Even if the tools can effectively process the corpus, other resources, such as reference corpora, can be absent, meaning that a keywords study may be difficult to carry out. English tends to be the default language of many academic journals and this can result in bias towards English-language research. For example, there is quite a bit of corpus-based discourse research on English-language newspapers that are published in China (e.g. Qian 2010, Wang 2018). While this research is worthwhile and interesting, it would also be good to conduct similar analyses on Chinese-language newspapers or to carry out comparisons of the discourses between the Chinese and English newspapers. Partington et al. (2013) refer to such studies as cross-linguistic corpusassisted discourse studies. One way of carrying out such a study would be to identify important words and their equivalent translations across more than one corpus. For example, Taylor (2017) examined the words community and comunitá in news corpora of English and Italian while Schröter and Veniard (2016) compared intégration and Integration in French and German public discourses relating to the topic of migration. In 2018, Rachelle Vessey and I carried out a keywords comparison on two similar corpora that contained English and French texts respectively. The texts related to violent jihad (some of the English ones were also analysed in Chapter 6 of this book). We wanted to determine the extent to which the kinds of persuasive strategies and topics in the two sets of texts differed, and if this could be explained by considering the language context of publication. Comparing the two corpora directly against each other would have produced little of use so instead we created two reference corpora, one in French and one English, in order to obtain two sets of keywords. It was difficult to find an appropriate reference corpus for French so we had to create our own. Even then, finding data was not easy and rather than building a full reference corpus (something like the Brown Corpus which contains numerous registers), we built a
Going Beyond the Basics
corpus of texts from a French-language newspaper. Subsequently, we also built the British reference corpus out of newspaper texts, which while not ideal, at least meant that the two reference corpora were broadly comparable. We then derived a French keyword list by comparing our French violent jihad corpus against the French news corpus. The same procedure was carried out to create an English keyword list – with the English violent jihad texts compared against the English newspapers. This gave us two sets of keywords. For comparison purposes we took the top 500 keywords from each list, placing them into categories by carrying out concordance, cluster and collocation analyses on them. We engaged in quite a lot of moving back and forth between the two languages to ensure that the categorisation scheme was consistent and worked for both corpora. We then worked out the total relative frequencies of the keywords in each category across the two corpora, and were able to identify which categories contained more tokens overall in one corpus, compared to the other. For example, we found that the English corpus contained many more Arabic words which seemed to have the effect of legitimating the discourse, as in English-speaking UK, many Muslims trace their lineage to Pakistan and Bangladesh and encounter Arabic in religious scripture. On the other hand, many Muslims in France come from Morocco, Algeria and Tunisia where Arabic is more widely spoken, so it does not have the same association with religious authority. Instead, we found that the French texts used a more formal register and more use of quotes from scripture (in French, rather than Arabic) to discuss permissions, rights, obligations and laws. The French texts also had more reference to schools, which were barely mentioned in the English texts. This related to the political and legal context in France where religious symbols had been banned in public schools since 2004. Our analysis thus indicated interesting differences between the two sets of texts, where stylistic and topic choices could be related to particular audiences and political contexts. Taylor and Del Fante (2020) discuss some of the issues around working with corpora of different languages: the difficulty in identifying translations of words, use of different metaphors across languages, and the fact that categorisations are subjective (which can be especially problematic if using an automated semantic tagging system). It can also be difficult to impose the same techniques on corpora of different languages and expect what emerges to be directly comparable. The concept of a ‘word’ in English is different from that of Chinese, and so using the same collocation measure and cut-off
203
204
Using Corpora in Discourse Analysis
points for both may not result in a meaningful analysis. Languages do not all contain the same number of words or have the same kinds of word distributions so it is important to be sensitive to salient aspects of the language(s) you are working with. When working with corpora of different languages, it is worth bearing in mind that processes of production and reception can differ enormously between different cultures, so this kind of context-based consideration is crucial in interpreting and explaining findings. For example, in China numerous newspapers are run by the state. This is different to contexts like the UK where newspaper owners aim to use their position both to make a profit and to influence government policy by changing public opinion. A comparison of Chinese and English newspapers on the same topic would need to take into account these and other differences in terms of production and reception when interpreting and explaining findings. A further issue relates to the context that discourse analysis is carried out in. The way we carry out analysis is always going to be filtered through our own internalised discourses that we may only be vaguely aware of, if at all, and reflexivity which aims to identify how our own discourses have influenced our interpretations should form part of the analytical process. However, different cultural contexts can result in different kinds of opportunities relating not only to the kinds of discourses we have access to but the ways that we are able to present our findings. For example, in countries which operate under a liberal democracy, it is reasonably easy for researchers to be critical of government policy. This is not always the case in other parts of the world, and as a result, it may be difficult for some kinds of research to be fully critical. Analysts need to guard carefully against assuming that the discourses they identify in texts are truths rather than representations, and I have occasionally seen corpus-assisted discourse analysis of political speech or newspaper articles where the author appears to uncritically accept a representation as reality. Not only is this misleading, but it defeats the purpose of corpus-assisted discourse analysis. I would perhaps be especially careful when working with corpora collected from two or more countries, one of which is your own country. If you end up concluding that your own country is wonderful and the other country is not, then your research may end up being perceived as nationalist propaganda by some. I do not advocate that researchers compromise themselves, so they should think carefully about the topics they research and the sources of data they use.
Going Beyond the Basics
Spoken analysis One of the most frequent questions I hear at non-corpus conferences after I give a presentation about corpus linguistics is, ‘Can I use corpus linguistics in order to analyse transcripts of spoken interactions like interviews?’ The answer is, yes, although you will probably need to work a bit on getting the transcripts into a ‘corpus-ready’ format. Specifically, this would involve assigning codes or tags so that different (types of) speakers can be identified and compared, along with the encoding of paralinguistic information (if this is relevant to your transcripts) like gesture, facial expression, laughter, whispering, shouting etc. Information like pauses, false starts or use of sarcasm could also be tagged, although there is no need to go overboard and tag lots of features just because they can be tagged, particularly if they are not likely to be considered in the actual analysis. The tags will not change the frequency counts in your data (most corpus software use the default setting of ignoring them when creating word lists) but will enable you to compare or isolate stretches of text that occur within particular tags, such as all speech by interviewers or those being interviewed, speech of males or females or speech where people are laughing. Transcribed speech can often appear ambiguous and Anderson (2010: 556) recommends that spoken data should be listened to, noting the existence of tools like Transcriber, Praat or ELAN which are able to link audio files to transcripts. She also advocates browsing and listening to longer stretches of speech during the process of analysis. A study I carried out on a spoken corpus involved scripts of the American sitcom Will and Grace (Baker 2005). In actuality, this was a written-to-bespoken corpus as the actors were reciting lines from a written script, and such text does not always model actual speech very well (there are far fewer ems and ers, overlaps and false starts). But it is a reasonable approximation of speech and the same procedures can work on scripts as on naturally recorded speech. In the study, I was interested in what made the characters distinctive from one another so I converted the scripts of the sitcom, which contained a basic notation of each character’s speech, e.g. ‘Will: What are you talking about?’ By using the search and replace functions of a text editor I was able to implement a simplistic tagging scheme whereby the speech of each character was enclosed by a tag (which was their name), and a default closing tag, e.g.
205
206
Using Corpora in Discourse Analysis
What are you talking about? Will, you’ve had one common problem in all your relationships. You. Jack, I’m good at relationships. Using the tags feature in WordSmith, I created word lists for the speech of each speaker in the corpus, which could then be used for keyword comparisons, e.g. by comparing the speech of Will against Jack, or the speech of Will against everyone else in the corpus. This identified, for example, that Jack employed a more self-centred use of language (higher use of the singular first person pronouns me and my) as opposed to Will’s more other-centred or inclusive language (higher use of you and we). A keyword analysis of interviews could be useful in identifying features of the interview which analysts may not have considered. This could involve comparing different interviewees against one another, or comparing the whole interview corpus against a reference corpus. For example, certain interviewees may use more hedging, hesitation or certain types of pronouns more than others, and this is likely to be revealed through a keyword analysis. Comparing the speech of the interviewer against the interviewees might reveal important differences, such as cases where an interviewer frames a concept in a certain way and interviewees reword this, using their own terms. A consideration of a full corpus of interview speech might help to identify themes or repeated ways of representing certain concepts. Additionally, analysts might want to focus on a pre-determined set of words and identify their collocates or the ways that they occur in multi-word units to obtain a better understanding of their usage. Collocation could also occur with paralinguistic units, so, for example, perhaps certain words are likely to be followed by long pauses or hesitations. A corpus analysis could help to identify consistent cases where conversations do not flow smoothly, perhaps indicating areas where interviewees experience certain emotional states. Dayrell et al. (2020) carried out a corpus analysis of seventy-three interviews with minority communities living in the UK from various religions (Muslims, Hindus and Christians). These interviews had been carried out in 2005 and had already been analysed qualitatively, although Dayrell et al. (2020: 114) note that ‘[t]he qualitative, hermeneutical interpretive method (consciously) predisposed the original researchers to follow the interviewee’s lead, and unpack their world of meaning through their own idiosyncratic speech. Corpus techniques on the other hand allowed observations of which the participants were unaware; with an
Going Beyond the Basics
empirical impulse not always given by qualitative analysis.’ As a result, taking a corpus approach allowed the analysts to identity pronouns like my, our and your as indicating places where interviewees reflected on their own faith, while consideration of the word family, which was found to be frequent across all groups, indicated how religion was intertwined with identity, culture and family background. In other studies, a tagged spoken corpus has facilitated a detailed qualitative analysis, enabling numerous examples of a particular form of interaction to be studied in detail. For example, Partington (2003) worked with two corpora of press briefings held at the White House (around one million words in total). The corpora had been tagged with paralinguistic information, including a tag which indicated laughter which occurred 532 times. This enabled Partington to identify possible locations where teasing occurred in the corpus, and from this he carried out a qualitative analysis which resulted in a categorisation scheme for different forms of teasing. If such tags are not available, then different methods are available for identifying potential linguistic features you are interested in. I used a mixture of introspection and reading samples of text in order to come up with a set of words and phrases where teaching staff indicated disagreement in the MICASE corpus of spoken academic interactions (Baker 2013). My list identified ninety-two such cases, although subsequently, more detailed reading of the entire corpus revealed a small number of additional examples that the search terms had missed. We thus need to accept that such methods may not find every case, but they can at least enable analysis to go ahead without the need to tag an entire corpus by hand. Another system for categorising spoken data is based on the notion of function discourse units. Egbert et al. (2021: 715) argue that ‘conversation is composed of contiguous units that are characterized by coherent communicative purposes’. Analysing the British National Corpus Spoken 2014, they identify how speakers can move from one communicative goal to another several times across a single conversation – for example, commenting on snow that is falling outside, discussing what gift to purchase for their mothers, telling stories about stag and hen parties they have attended, sharing opinions about what makes a good party or making holiday plans. The authors of the paper created a system for categorising communication into Discourse Units based on the functions of the talk at any given point. Nine units were identified, including joking around, engaging in conflict, figuring things out and sharing feelings and opinions. For each segment of conversation, up to three Discourse Units may be present, although
207
208
Using Corpora in Discourse Analysis
the coding system requires them to be categorised as having either a dominant, major or minor function. A corpus that has been tagged with the system could be used in order to identify typical and atypical ways that conversations develop – for example, joking around might precede engaging in conflict. Additionally, the corpus could be used to identify specific features of different Discourse Units – what kinds of language do people engage in when they are sharing feelings and opinions? Or it could be used in conjunction with other tags allowing us to compare say, patterns of male and female speakers. Although the system was developed with spoken data, in one project I worked on, it was successfully applied to a corpus of online forum interactions, which gave valuable insights into the overall goals of the forum members and the ways that these were achieved. The only proviso with working with the scheme was that the categorisation had to be carried out by hand and checked for accuracy and consistency, making it infeasible to carry it out on millions of words of data. Instead, it was carried out on a smaller sample of the corpus. However, the scheme has potential to enhance our understanding of how discourses can develop in real-time conversations.
Visual analysis Studies combining visual analysis with corpus analysis are still in their infancy and the more established corpus software have yet to meet the challenge of working with corpora that contain both written text and images. However, it is possible to annotate or manipulate corpus texts in different ways so that a visual analysis can be incorporated. In this section I demonstrate how I examined the relationship between written text and image in a small-scale study which considered the representation of obesity in a corpus of newspaper articles from the Daily Mail. Using the Daily Mail’s website search engine I collected 400 articles which contained at least one mention of the word obesity. Each article was saved as a Word document, containing images, and also a text-only document, with the images removed. I was interested in the language that appeared in articles where people were shown, particularly whether or not people’s faces appeared. So I created a file system to group different types of articles together. All of the text-only versions of articles that had originally contained a picture of a person were
Going Beyond the Basics
placed in one folder, and those which had not contained a picture of a person were put in another folder. For articles that had contained pictures of people, I created two sub-folders, to compare pictures of people whose face was showing vs those where only part of their body appeared or their face was obscured in some way. Using AntConc, I then carried out a series of keyword comparisons of the different sets of articles. For example, I first compared articles that contained a picture of a person vs those which contained a picture without a person. For those articles that contained a picture of a person, I examined the first 100 keywords found, excluding those which occurred less than five times. This resulted in the following keywords: weight, Argent, life, gastric, stone, smoking, saving, deaths, surgery, obesity, mother, live, between, study, she, body, James, heart, overweight, than, was, their, who and experts. The list indicated that articles containing images of people tended to place emphasis on a risk to life as a result of obesity, e.g. James Argent reveals he has undergone life-saving gastric surgery after reaching 27 stone and details harrowing six-week recovery Weight problems cause more deaths from heart conditions than puffing tobacco, new study suggests Bad news for men with ‘dad bods’: People with big BELLIES are more likely to develop heart disease – regardless of their BMI, study warns I then focussed just on the articles containing pictures of people, and compared those where faces were obscured against those where faces were visible. The keywords of people with obscured faces were deaths, smoking, obesity, study, heart, weight, than, more, PA and are. It is interesting how deaths appears as the strongest keyword in this list, suggesting that the emphasis on risk to life is even stronger when photos of people without their faces visible are examined. Deaths is a plural noun which indicates a countable entity as opposed to a process. While such articles do show pictures of people, the fact that no faces are shown, accompanied with the word deaths, is suggestive of a sense of distancing or abstraction surrounding people with obesity who die. The analysis required me to split the texts into different folders, although another way that this could have been approached would be to incorporate an encoding system which would then allow me to work with the texts in a single file, or in a single folder. Such an encoding system could encompass each article with opening and closing tags which indicate which type of image is present, for example for articles which contained no images of people I could
209
210
Using Corpora in Discourse Analysis
place the tag at the start of the article and then at the end. Other tags, such as and could be used to distinguish articles where people’s faces were shown or not. The corpus could then be used with WordSmith which has the facility within the Advanced Settings tab to only read certain parts of files if they occur between particular tags or stretches of text which can be specified by the analyst. This would allow me to create separate word lists for the different types of articles in the files, and then carry out the keyword comparisons from there. Tagging could be carried out in more detail, perhaps to indicate the exact part of a text in a single article which is attached to a certain image. One form of analysis could consider which words appear near a particular type of image, using a collocational span. Images could be marked with different or multiple tags if we wanted to carry out a more complex analysis, for example the sex of people in the images could be identified by using tags like and . We might also want to specify the age or ethnicity of people in images, resulting in tags like . Analysts would therefore be advised to spend some time perhaps first carrying out a qualitative pilot study of the images and some of the surrounding text in order to gain a sense regarding what distinguishes certain images from each other. It might be, for example, that use of colour is important, or we might want to consider whether a person is smiling or if they are looking at the camera or not. Kress and van Leeuwen (1996) have developed a framework for visual analysis which contains a range of features that could be incorporated into such a study, although I would probably warn against developing an annotation scheme that is too ambitious and tries to cover every aspect of every image. This kind of tagging scheme for images is quite basic, although still requires no small amount of work on the behalf of the researcher who is probably going to have to do a lot of referring back to the original texts. An ideal corpus tool would be one which contains images embedded within the files, which can then be shown alongside concordance searches of particular words and phrases. There is huge potential for such an analysis – for example, McGlashan (2016) examined the relationship between visuals and written text in a corpus of children’s books that featured same-sex families. He found that a key cluster in this corpus was love each other, and when a concordance of this phrase was examined, it appeared with pictures of families who were embracing in some way. The meaning of love each other was therefore supplemented by these images of people hugging, indicating that love was
Going Beyond the Basics
expressed through physical contact, although this would not have been found if only the written text of the stories had been considered alone. As there was no credible tool that could have done this analysis, it had to be carried out painstakingly by hand, with the original texts consulted for each concordance line. Another issue with the incorporation of images into a corpus analysis involves the fact that each image would need to be tagged by someone. Automatic Image Taggers which make use of Artificial Intelligence techniques are becoming available. Online tools like Google’s Cloud Vision or Amazon Rekognition will assign a series of tags to images which include identifying objects in the image, facial emotions and other labels (e.g. fashion, purple, performing arts). Baker and Collins (2023) outlines the potential in using automatic image tagging in a newspaper corpus. Some taggers require users to pay a subscription, and the tags used in such tools may not describe the features of the images that you are most interested in, or their accuracy may not be ideal (often a confidence score is assigned to each tag). Some tools allow users to define their own tags and train the tool first on an existing data-set, so that more relevant tags can be incorporated into the analysis. There is still plenty of work to do then, both in designing corpus tools that can work with images, and in automatic tagging procedures. A final issue here relates to copyright and publication. Generally, when working with written or spoken corpora, most publishers do not raise queries if authors quote small amounts of text in a book or journal article. The text often occurs as a snippet in a concordance line, or at most a paragraph, and such duplications usually come under ‘fair use’ guidelines. However, incorporating someone else’s image in a published work usually requires permission to be obtained from the copyright holder or the image needs to be purchased. This can be a timely, frustrating and costly experience, which has the potential to delay or even prevent publication of certain types of work. It is perhaps no wonder, then, that many corpus linguists simply stick to analysing and quoting written text. One option in working with visual images is simply to describe the images (as I did with the earlier analysis of Daily Mail articles about obesity), rather than replicate them. Another could be to collect a corpus of copyright-free images or to obtain permission from the creators of the images as the corpus is being collected (for example by building a corpus of children’s homework from a local school). However, I suspect that issues around copyright are one reason why corpus techniques of visual analysis lag behind those relating to written text.
211
212
Using Corpora in Discourse Analysis
Social media Social media texts are attractive to consider in corpus research due to the fact that data is already available in electronic format and can usually be gathered in large amounts. Laurence Anthony’s FireAnt is a freeware social media and data analysis toolkit which allows data to be imported in a number of formats, or can be set to collect data via Twitter (e.g. by specifying that tweets to be collected must contain a certain word). Another tool, NVivo, has the facility to collect Facebook, Twitter or YouTube content using a web-browser extension called NCapture. Bear in mind that use of some tools may not be in line with the terms of service of particular social media sites (see Chapter 3). Baker and McEnery (2015) examined a corpus of tweets relating to a television programme called Benefits Street. The programme was a documentary about a street where almost all the residents were in receipt of government welfare or ‘benefits’ and had sparked debate in the UK. Keywords were obtained by comparing the corpus against a reference corpus of random tweets collected during the same period. These keywords were then categorised into meaningful groups and collocational networks were obtained of some of them. This enabled a number of discourses to be identified such as an ‘idle poor’ discourse (where residents of the street were represented as lazy, undeserving and irresponsible), or the ‘poor as victims’ discourse (where poor people were represented as genuinely suffering). We had worked with a version of the corpus where retweets were removed, although for an additional part of the analysis we identified the 200 tweets that had been retweeted most often, categorising the discourses in each one by hand. This identified that a third discourse, ‘the richer get richer’ was actually the one that was most frequently retweeted. A number of issues are worth bearing in mind, relating to the fact that social media corpora can contain a great deal of repetitive text, e.g. as a result of people retweeting other people’s tweets or forums where people’s replies also contain quotes from earlier messages. Such repetitions are likely to skew a corpus analysis by affording higher frequencies to certain words or phrases. A solution could be to remove any text which originates from an earlier source. However, it may be important to know which parts of a message are specifically in response to earlier messages, particularly if aspects of Conversation Analysis or another qualitative approach is going to be taken with the data. Therefore, it may be a good idea to save different versions of the corpus, one which retains everything in the format it originally occurred
Going Beyond the Basics
in, the other which only contains unique ‘first time’ entries. Alternatively, an annotation scheme could be developed, in order to highlight such crossreferences within the corpus. Social media messages can also contain embedded images or videos, which can sometimes be elided by web-scraping tools, making the interpretation of meaning less easy. And often such messages are embedded within longer exchanges between different users, which may not always be collected by the scraping system, also reducing the capacity of a contextual analysis. Such problems can be overcome (or put aside and the limitations of the corpus acknowledged), although this is likely to take more work in terms of constructing the corpus, for example by linking posts together or incorporating a tagging system for images. Social media text makes use of aspects of computer-mediated communication, which can involve use of punctuation or combinations or letters and numbers, for example the phrase see you soon might be typed as c u soon. Emoticons, involving particular combinations of punctuation, were used to signify emotional affect, e.g. :) indicates a smiling face. Some emoticons might contain characters like / or | which might have special functions in corpus tools. Emoji in the form of single character pictograms like 😂 are increasingly popular. Collins (2020) describes how he built a corpus of texts from a business selling gifts on Facebook by creating tags for images and substituting emoji with their alphanumeric Unicode values, e.g. U+1F600. These values are then treated as tokens. Depending on the corpus tool being used, the settings may need to be changed so that numerals and characters like + are recognised as part of a token. Collins found that the majority of emoji in his corpus reflected positive affective states like applause and smiling. He noted how certain emoji collocated with words, for example 😍 collocated with images of the local area that had been tagged CANDID. The newer generations of corpus analysis tools are better at working with emoji, for example AntConc version 4 includes emoji in frequency lists, although at the time of writing the analyst needs to input a list of emoji into the Token Definition when creating a corpus. Another potentially interesting way of considering social media corpora is through measures of audience reception. Many social networking sites or apps allow readers to signify that they ‘like’ a particular post, through clicking an icon, and this information often accompanies the post. Incorporating this information would allow for an interesting range of research questions to be asked. For example, how does the language of posts that receive many likes compare to those which receive hardly any? Other quantitative information
213
214
Using Corpora in Discourse Analysis
such as the number of followers that a poster has, the number of replies a post receives, its location or time-stamp, could also be incorporated as tags to enable different forms of comparative corpus analysis.
Metaphor Metaphors are a particularly revealing way of helping to see discourses surrounding a subject. As Fairclough (1989: 119) points out, when we describe x in terms of y, all the ‘ideological attachments’, connotations, associations and their meanings carried by y are projected onto x. For example, if we recall the analysis of newspaper texts in Chapter 5, refugees were variously described in metaphorical expressions that referenced water and package metaphors, e.g. ‘a flood of refugees’. In this sense, they were characterised as an out of control problem with no sense of agency, as well as being depersonalised. The metaphor of refugees as water, then, contributed to an overtly negative discourse. Looking at the presence of metaphors in a corpus and noting their relative frequencies to each other should provide researchers with a different way of focussing on discourse. Cognitive metaphor theory proposes that common linguistic associations are expressions which reflect the way that our brains actually work – not only do we talk about x in terms of y, but we understand x in terms of y (Lakoff and Johnson 1980, Gibbs and Steen 1999). However, we can understand metaphors in terms of dual levels. As Koller (2004: 9) points out: ‘A bi-level view of metaphor holds that metaphoric expressions witnessed in actual texts are just different realisations of productive underlying metaphors.’ For example, the metaphorical expression ‘I’m just wild about Harry’ references an underlying LOVE IS MADNESS metaphor (Lakoff and Johnson 1980: 49). A useful resource to know about is the VU Amsterdam Metaphor Corpus (Steen et al. 2010). It contains about 190,000 words from the British National Corpus and has been annotated for a range of different metaphor types (indirect, direct and implicit metaphors). A search facility is available online, allowing for general research into metaphors in British English, as well as enabling users to check if certain uses of language actually count as metaphors. Unfortunately though, there is not a simple way of carrying out a metaphor-based analysis on a corpus. Deignan (1999: 180) points out the difficulties of using corpus-based techniques to uncover metaphors: ‘there is no automatic way of discovering the linguistic realisations of any conceptual
Going Beyond the Basics
metaphor, because a computer cannot tell the researcher anything about speaker meaning. Concordances will show the researcher words in their context, but he or she has to process this information.’ Until a linguistic metaphor database is widely available, it looks as if a bottom-up approach is the most likely option. Similarly, Charteris-Black (2004: 35–6) uses a qualitative, two-stage approach to corpus-based metaphor analysis. In stage one, the researcher carries out a close reading of a sample of texts in order to identify candidate metaphors. Words that have a tendency to be used in a metaphoric sense are noted as being ‘metaphor keywords’. In stage two, corpus contexts are examined to determine whether these keywords are metaphoric or literal. To demonstrate this technique, Charteris-Black carried out a close reading of texts arising from the Twin Tower bombing on 11 September 2001. He found that a salient phrase was George W. Bush’s vow to fight a ‘crusade against terror’, a term which could either be used in a literal or metaphorical sense. Charteris-Black then looked at the phrase crusade against in three reference corpora in order to see how it was commonly used. He found that strong collocates were words like corruption, slavery and communism, suggesting that crusade against was frequently used as a metaphor, i.e. a term from the domain of religious struggle being used in non-religious domains such as social reform. Charteris-Black (2004: 37) notes that ‘this methodology was used . . . because I believe it is only possible to develop software that searches for certain types of metaphor and not for metaphors that have become conventionalised’. A similar ‘close reading’ approach was taken by Koller (2004) in her analysis of gendered metaphors in business media discourse. She found that the discourse was most frequently conceptualised around traditionally masculine metaphors concerning war and fighting. Semino et al. (2017) have also adopted a two-stage identification process, first by reading samples of their corpus to identify the semantic fields which contained metaphors, and then by using Wmatrix to identify other cases of words in those semantic fields, which could then be checked in order to locate additional cases of metaphors. Some attempts have been made to automatically derive metaphors from corpora. For example, Sardinha (2007) carried out a corpus-driven analysis of metaphors in a corpus of dissertations. He first derived a list of the top 100 lexical keywords and examined their collocates. He then ran the collocational pairs he found through a program called WordNet3 which compares word pairs and reports the amount of distance or similarity between them in terms of their meaning. He hypothesised that sentences
215
216
Using Corpora in Discourse Analysis
which contain metaphors will contain words which have dissimilar meanings. For example, if flood and refugees prove to be strong collocates of each other, this may suggest that a metaphorical construct is repeatedly occurring (unless there are literal cases of a flood causing people to become refugees). Therefore, strong collocates which were scored as being dissimilar in terms of meaning were examined in more detail, and indeed, Sardinha found that many of these did prove to be the result of metaphorical constructs. Automatic metaphor identification work has also been carried out within the area of natural language processing. For example, Neuman et al. (2013) developed algorithms which had 71 per cent precision at identifying metaphors in corpora of news articles, while Turney et al. (2011) based their work around a measure of the abstractness level of nouns in a sentence, their hypothesis being that if a noun in an adjective-noun phrase is relatively abstract, then the adjective is likely to be used in a concrete way to explain its meaning, so this is likely to be a metaphor. Their model had a reasonably impressive 79 per cent accuracy. Partington (2003) takes a more typically corpus-driven approach, first obtaining keywords, then looking at clusters that contain those keywords and identifying by concordancing which ones involve metaphors. Philip (2012) also identifies and categorises keywords, arguing that while keywords will tell us what a corpus is about and provide indications of topics and target domains that are likely to feature in metaphors, it is a group of words called low frequency content words (LFCWs) which are where metaphor vehicles and source domains are to be found. Because corpora will typically contain large numbers of LFCWs, it is recommended to pull together inflected forms under their respective lemmas, to reduce the amount of work needed. Then semantic groups can be identified. Those groups which do not appear to be related to the keyword groups are likely to be candidates for metaphors. To test this out, I created a small corpus of tweets by the broadcaster Laura Kuenssberg. Comparing the whole of this corpus to the BE06 corpus (one million words of general British English), I identified the top 100 keywords, and also all words which only occurred once in this corpus. I then ran these two sets of words through the semantic tagging tool Wmatrix – so each keyword was essentially assigned a semantic tag – and then I compared the frequent semantic categories which appeared in both lists. I found that there was considerable similarity between the two lists of semantic tags. For example, the most frequent semantic tag for both the keywords and the low
Going Beyond the Basics
Table 8.1 Metaphors tagged as A1.1.1 in Laura Kuenssberg’s tweets and was a huge political effort to
patch
it back together. This really matters
is politicians who are really
dug
in to the place they represent + shout
and Parliament spent years
tearing
itself apart over Brexit debates as the
Brexit deal in Wales WILL
tweak
the law after all . . . Downing St briefing
Initial research suggests it
spreads faster, and has notified the WHO
and Martin Kenyon who Negotiators will be given the Mel Stride isn’t
bumped into our media colleagues at CNN in the nudge
By PM/VDL to give any ground and
mincing his words – “With little over 24 hours
backbenchers are already pretty
hacked
off after v bumpy few months 4. But
the PM’s
closing
message PM’s announce as expected
cuts
to govt aid PM was described as
some of the impact of
frequency words was Z99 (unmatched). Other semantic tags that occurred close to the top of both lists were Z5, Q2.2, Z1, M1, Z2, T1.3, G1.1, Q2.1 and F1. However, in the low frequency word list, the tag A1.1.1 appeared in sixth position (with forty words that appeared once only being tagged as this) whereas only one keyword was tagged as A1.1.1 (labour). The tag A1.1.1 refers to general actions, making, etc. When I looked at a concordance of words with this tag, I found a high number of metaphors (see Table 8.1 for a sample). While metaphor identification in a corpus is certainly possible, via a number of different techniques or combinations of techniques, it is unlikely that all cases of metaphors can be easily found without the time-consuming work of carefully reading the whole corpus. However, for most corpus research the aim is not to identify every case but to gain a sense of the most typical cases. The techniques that have been developed to date can at least more credibly claim to do this.
A final point No method of analysis can do everything, and it is sensible not to over-state the affordances of a corpus-based approach to discourse analysis. Sometimes we have to accept that the corpora, tools and procedures available to us are not up to the job of answering the research questions we have. Perhaps there could be sense in combining a corpus approach with another one, as a form of triangulation, but at times we may want to try a completely different form of analysis, rather than persisting with one that may give incomplete or
217
218
Using Corpora in Discourse Analysis
inaccurate results. For some of the topics discussed in this chapter, I would advise that the corpus approach can take us so far, but perhaps not as far as we would like to go. This may change in the future, but for example I still find that complex data types that are interactional and combine words and images like tweets, are difficult to fully do justice to, just using corpus tools and techniques. Similarly, I am not sure that a corpus approach could cope with all the different facets of a corpus of video recordings of human interactions. From my own experience as a PhD supervisor and a researcher, I have been involved in projects that have worked with multi-modal data where the initial goal was to take these complexities into account. However, often it has been the case that by the end of the project, the bulk of the work has been done on the aspects of the data that the corpus tools could easily handle, so analysis of gesture or visuals were side-lined. Our analyses can only be as good as the tools allow, and it is hoped that software designers of the future are able to widen the scope of analysis much further than they currently stand. The suggestions in this chapter therefore offer glimpses of the potential of the corpus approach to go beyond the analysis of written text. We will see where the next couple of decades take us.
Further reading Collins, L. (2019), Corpus Linguistics for Online Communication: A Guide for Research, London: Routledge. A practical guide to using corpus linguistics with texts like dating apps, online user comments and online learning platforms. Dayrell, C., Ram-Prasad, C. and Griffith-Dickson, G. (2020), ‘Bringing corpus linguistics into religious studies: self-representation amongst various immigrant communities with religious identity’, Journal of Corpora and Discourse Studies 3: 96–121. This free-to-access journal article reports on a corpus analysis of seventy-three interviews and contains an interesting discussion of the extent to which a corpus approach can reveal new insights into this kind of data. Lin, Y-L. (2017), ‘Co-occurrence of speech and gestures: a multimodal corpus approach to intercultural interaction’, Journal of Pragmatics 117: 155–67. This journal article examines a spoken corpus which has also been annotated for gesture, carrying out an analysis of the semantic categories of the speech that most commonly occurs with gestures. Rüdiger, S. and Dayter, D. (eds) (2020), Corpus Approaches to Social Media, Amsterdam: John Benjamins. This book contains chapters relating to the collection and analysis of different kinds of social media texts including WhatsApp, Twitter, Reddit and Facebook.
Going Beyond the Basics
Taylor, C. and Del Fante, D. (2020), ‘Comparing across languages in corpus and discourse analysis: some issues and approaches’, Meta 65(1): 29. DOI:10.7202/1073635ar. This paper considers the intersection of CADS with translation studies, presenting issues and solutions around comparing corpora of different languages.
Questions for students 1 Working with a friend, pick a word or phrase and carry out a search on Twitter to collect 100 tweets that contain the term. Then separately, consider use of images, animations and videos that appear in the tweets and devise a rough annotation scheme to capture a few aspects of these visuals so it could be used in a corpus analysis. How does your scheme differ to the one your friend created? 2 Below is an excerpt of transcribed speech from the Spoken BNC 2014. S0597: ɝhey don’t don’t pick on me S0596: why not? S0597: cos I would --UNCLEARWORD S0596: --UNCLEARWORD S0598: ow (.) too loud S0597: ɝthat’s up S0598: oh for fuck’s sake my phone (. . .) S0596: I don’t understand S0598: it’s when you ask for (.) okay (.) that’s the thing with captions okay S0596: ɝyeah but you have to say the caption S0598: the caption is like erm S0599: it’s just like a S0598: ɝwhen someone refuses S0597: ɝthis is light S0598: no when someone refuses to give you the answers (.) like er don’t worry that’s it’s past it S0599: ha ha ha ha Codes include ɝ overlapping speech, (.) short pause, (. . .) long pause. Try rewriting this excerpt by devising a simple tagging system to indicate different speakers and paralinguistic features. Load the excerpt into a corpus tool and try a few concordance searches or build a word list. Do you get the results you expected? Do you have to alter any of the tool’s settings in order to obtain meaningful results?
219
220
Using Corpora in Discourse Analysis
3 Build a small corpus of news articles (no more than 100) on a single topic (e.g. relating to some aspect of identity, health, the environment or politics). Using a corpus tool like AntConc retrieve a list of the top ten collocates of one of the words related to the topic and read concordance lines containing these collocates. Then carry out a close reading of five articles using a sampling method (see the section on analysis of samples in this chapter). What findings did the two analytical approaches produce that were similar or different?
9 Conclusion Introduction This book has identified some of the most useful methodological techniques of corpus-based research (frequencies, dispersion, distribution, collocations, concordances, keywords, annotation) and shown how they can be effectively used in the analysis of discourse. I have also illustrated the ways that these techniques are useful, in terms of providing a framework for making sense of very large amounts of data, indicating the signposts that are likely to result in a productive analysis, helping to identify linguistic patterns that the human eye might otherwise miss, giving information about language norms as well as rare cases and reducing human bias (to a manageable extent). What are the main points about language and discourse that our corpusbased analyses have revealed? First, corpus-based discourse analysis is not simply a quantitative procedure but one which involves a great deal of human choice at every stage: forming research questions, designing and building corpora, deciding which techniques to use, interpreting the results and framing explanations for them. This is not true just for discourse analysis, but all forms of analysis that use corpora. Second, corpus-based discourse analysis takes the researcher beyond simple lists of frequencies. Discourse analysis benefits from the inclusion of full texts in corpora (rather than samples), enabling the user to specify dispersion analyses, showing how the presence of lexis or other linguistic features develops across the course of a single text or set of texts. Third, attitudes, representations, arguments and consequently discourses are embedded in language via our cumulative, lifelong exposure to language patterns and choices: collocations, semantic and discourse prosodies. And finally, we are often unconscious of the patterns of language we encounter across our lifetime, but corpora are useful in identifying them: they emulate and reveal this cumulative exposure. 221
222
Using Corpora in Discourse Analysis
In this chapter I want to return to some of the issues that have been raised at various points throughout the book, particularly those relating to corpus building and corpus analysis. This is followed by some reflections on the impact of corpus-based discourse analysis and my hopes for what the future might hold for the field.
Corpus building As discussed in the earlier chapters in the book, the design and availability of corpora are paramount to its analysis. Diachronically, language and society are constantly changing (at a somewhat accelerated rate in the last ten or so decades). Therefore, discourses are changing as well. The British National Corpus, containing data collected from 1992 and earlier, which I analysed in Chapter 5, was outdated almost from the time that it became publicly available, and although it continues to be used as a ‘benchmark’ for general contemporary English, it is really a ‘historical’ corpus. At the time of writing, a new spoken BNC with data collected in 2014 is available, while a written equivalent from the same time period has recently been released. However, those corpora will date quickly too. Some aspects of language use do not change as rapidly as others, and if we bear that in mind, corpora like the BNC may continue to function as a benchmark of British English for years to come. However, Burnard (2002: 68) points out: ‘It is a rather depressing thought that linguists of this century may continue to study the language of the nineties for as long as those of the preceding one were constrained to study that of the sixties.’ Ideally, something like the BNC should be built every ten years, and not just for British English. However, this is unlikely to occur, and while innovations in data collection and mark-up have made some aspects of corpus building easier, others, relating to copyright clearance and ethics, can provide obstacles to making such corpora available to all. Web-scraping software has enabled the creation of much larger reference corpora such as the English Web 2020 Corpus (a thirtyeight billion corpus of internet English collected between 2019 and 2021), although these giants may have weaker claims of representativeness or balance. At the opposite end of the scale are the smaller, carefully sampled reference corpora, such as the Brown family, which consist of one million words each. Adding new members to the Brown family of corpora is more viable, although ideally a reference corpus should be at least as large as a
Conclusion
target corpus, so the corpora in the Brown family have more limited applications. Blommaert (2005: 37) points out that a major problem of critical discourse analysis is its closure to a particular time frame: ‘the absence of a sense of history in CDA’. Every discourse, according to Blommaert (2005: 136) is a discourse on history (where we see references to a variety of historical time frames) and discourse from history (articulating a set of shifting positions in history). Using corpora of texts that were created decades or centuries ago will help researchers to explore the ways that language was once used, shedding light on the reasons behind current meanings, collocations and discourse prosodies of particular words, phrases or grammatical constructions. Additionally, historic or longitudinal corpora allow us to chart how discourses have been formed, contested and modified over time. We can track the appearance of new words and subsequently the creation of new ideas. Comparing a range of corpora from different historic time periods will give us a series of linguistic ‘snap-shots’ which will allow discourses to appear to come to life – in the same way that the photographic film process works. However, the process of locating historical texts and converting them into electronic form is not likely to be as straightforward as building a modern-day corpus. Ideals concerning representativeness may also need to be compromised in terms of what is available. Another of Blommaert’s criticisms of CDA is also relevant: ‘There is no reason to restrict critical analyses of discourse to highly integrated, Late Modern and post-industrial, densely semiotised First-World societies’ (Blommaert 2005: 35). I am ruefully aware that this book has utilised corpora (both general and specialised) that are made up of English texts. In part, this has been due to my own preoccupations, coupled with availability (or lack) of certain types of corpus data, and the fact that my ability to work with other languages is not great. However, as well as the need to continue creating up-to-date general corpora, it is also important that we do not neglect other societies (both past and present) which communicate in languages other than English. At the time of writing, the online corpus analysis tool Sketch Engine contains 529 corpora in ninety-five languages. The languages are not evenly distributed though – eighty-three of these corpora (15.6 per cent) are in English, and while European languages are reasonably well represented, other parts of the world, particularly south of the Equator, are not. It is also the case, as discussed earlier in the book, that there are more written corpora in existence than spoken corpora. Spoken data is more difficult to obtain, even in its raw audio-recorded state (particularly when collecting private conversations) and although automatic transcription
223
224
Using Corpora in Discourse Analysis
software is improving, it is still not perfect. However, spontaneous spoken data can be particularly useful in helping to identify how discourses are constructed and maintained at grass-roots level. Compared to written language, spoken data can be a more organic, unedited, untidy affair, full of inconsistencies and unconscious verbal tics (one reason why interviews and focus groups can yield good data). Techniques of accurately and rapidly transcribing spoken data are therefore required in order to enable the creation of larger, more up-to-date spoken corpora. And finally, another aspect of corpus building which is particularly relevant for discourse analysis is the fact that context is so important. Corpora that include both the ‘electronic text-only with annotations’ form and the original texts would be useful for making sense of individual texts within them. For example, in the case of newspaper or magazine articles it would be useful to make references back to the original page(s) so we could note aspects such as font size and style, colours, layout and visuals. With spoken texts, the inclusion of links back to digitised sound files would allow the reader to make more sense of a spoken utterance – how something is said often being as important as what is said. The act of reading a spoken transcript, no matter how well annotated, rarely gives the same impression as actually hearing it. It is hoped that in the coming years, there is more focus on research, not only in creating multi-modal corpora, but in determining what such corpora can usefully reveal in terms of questions about discourse.
Corpus analysis First, it is important to note that a corpus-based analysis will not give a researcher a list of discourses around a subject. Instead, the analysis will point to patterns in language (both frequent and rare) which must then be interpreted in order to suggest the existence of discourses. Also, a corpusbased analysis can only show what is in the corpus. Although it may be a far-reaching analysis, it can never be exhaustive (but to be fair, this rider applies to most if not all forms of analysis). However, because corpora are so large, we may be tempted to think that our analysis has covered every potential discursive construction around a given subject. It is important that we do not over-state our claims, particularly because non-diachronic corpora offer snap-shots of language in use at best, as the situation regarding discourse is always likely to change again in the future. And frequency is not necessarily the same as importance when it comes to discourse.
Conclusion
Sometimes the most powerful discourses are so powerful that they are taken for granted and do not need to be referred to. They may appear so infrequently in a corpus that they do not emerge. Similarly, a topic may be so taboo in a particular culture that it is ‘unspeakable’ and never occurs. While the issue of over-generalising relates to all forms of data analysis, a second point of concern is perhaps more specific to corpus-based analysis. The techniques outlined in this book have tried to offer ways of reducing very large amounts of data into manageable portions, in order to make analysis a humanly possible task. However, even with this reduction, there still can be too much to make sense of. Lists of frequencies, keywords, collocates or concordance tables can run to many hundreds of lines and it may be tempting to focus on aspects of the analysis which help to support our hunches. Transparency in reporting research outcomes is therefore necessary: the reader should be able to challenge or fine-tune the researcher’s findings if they wish. Including word lists or keyword lists as appendices to research, for example, would be helpful in enabling readers to determine whether the researcher has focussed on one aspect of the data while backgrounding others. Then there is the problem of cut-off points. Why only look at the strongest twenty keywords when keyword number twenty-one reveals something relevant? Humans tend to like round numbers, 10, 50, etc., and our cut-off points often reflect this. However, such cut-off points are often a subjective aspect of the analysis (reflecting word-count limits on publications or the researcher’s own patience and endurance). More work needs to be carried out on determining the best compromise between saliency and quantity in terms of cut-off points. A linked issue relating to the wide variety of alternative statistical techniques available to the corpus user might mean that data can be subtly ‘massaged’ in order to reveal results that are interesting, controversial or simply confirm our suspicions. What if a collocation list derived via loglikelihood does not reveal much of note? Never mind, try again using mutual information. Again, selectivity can deplete claims to impartiality (which some social scientists have argued is a moot point in any case). While there will always be a variety of measures to take into account, one way of lessening the issue of data massage would be to pick the technique which you think works best with the corpus data and then stick to it throughout your particular piece of research. So if you use log-log with a range of −4 to +4 to carry out a collocational analysis of one word, you should use the same parameters for carrying out collocations of other words in the corpus.
225
226
Using Corpora in Discourse Analysis
Fiddling with techniques looks sloppy at best and suspicious at worst, whereas sticking to the same measure at least offers internal consistency. And again, trying to explain the effect of your choices is always a good idea, e.g.: ‘I chose to use the mutual information statistic for carrying out collocations because it tends to favour low frequency lexical words which is a useful statistic for revealing sets of words which contribute towards a range of semantic or discourse prosodies.’ What else have our analyses revealed? That frequency is not always a perfect measure of a hegemonic discourse and that in a general corpus, we should perhaps not attribute equal weight to the variety of sources and range of author types – newspaper texts are likely to be more far-reaching than privately-spoken conversations. When using a general corpus, issues surrounding the varying types of production and reception for all of the texts within can become highly problematic. These concerns are something which corpus linguistics can only tangentially address, they move the focus away from a straightforward linguistic analysis towards the sort of research we would be more likely to find in anthropology, sociology, media and cultural studies. However, it is useful to show awareness of such issues, even if it is not possible to carry out wide-scale analyses of reception and production. How can we address this problem? One option could therefore be to recognise that general corpora consist of a multitude of voices and to therefore use such data sparingly, instead carrying out the analysis of discourses on more specialised corpora, where issues of production and reception can be more easily articulated. I would not discount the use of general corpora altogether – the corpus perspective is one view of language, and while it may not be the only standpoint available to us, it is still an incredibly useful one. A second solution could be to carry out a more detailed distribution analysis, cross-tabulating the different kinds of texts in a general corpus alongside the different discourses that are elicited. For example, after analysing discourses surrounding a particular subject, e.g. ‘refugees’, we may hypothetically find that tabloid newspapers tend towards what could be termed negative discourses, broadsheet newspapers attempt to present a more neutral stance, whereas the more private texts in the corpus (letters, diary entries, spoken conversations) contain the most extreme discourse stances of all (both negative and positive). Such an analysis would obviously take time to carry out, and in considering different genres in the corpus separately we may find that the cake has been sliced too thinly, meaning that
Conclusion
there is not enough data in certain categories to warrant saying anything of merit. A third possibility could simply be to argue from a perspective that society is inter-connected and all texts influence each other. The view that political, religious, media and business leaders have sole access to the powerful discourses is not the case. If such discourses are so powerful, we would expect them to occur across a range of text types and genres. Therefore, discourses encountered in private conversations are likely to be fuelled by discourses in newspapers or religious speeches. The discourses we examine in texts intended for a very small audience may therefore be reflective of larger sources. They also may be more ‘honest’ stances – people are often less likely to be careful or hedge their opinions when writing or talking to a small informal audience (or no audience in the case of diary entries). And while these discourses would trickle-down – say from the media to personal conversations, the reverse would also be true – public language is often inspired by the personal and private. Returning to a concern that I raised in Chapter 1, corpora give somewhat decontextualised information about written or spoken language; the average corpus is unlikely to reveal anything about, say, the relationship between an image and the writing surrounding it, or the way that a hand gesture can show that a speaker is being serious or joking. Annotation systems may be developed in order to deal with such issues, but on the whole, corpora are collections of texts (and where they are annotated, it tends to be at the grammatical, morphological or semantic level). Hypotheses derived from corpora, therefore, should be tested further, via close examination of single texts in their original form. Sometimes, it really is necessary to go back to the original text, as in Chapter 4 when I analysed holiday leaflets. The corpus analysis pointed out the presence or absence of certain discourses of tourism, but examining the text in relationship to the visual images gave a much clearer picture of what was actually happening (the pictures of people enjoying drinks in nightclubs reinforced the high frequencies of bar and club, while explaining why there were very few references to actual drinking in the written text – they occurred in the visuals instead). In addition, in Chapter 7 the dispersion plot of the cluster there is cruelty in revealed that it occurred repeatedly in a small section of the fox hunting corpus, directing my attention to look in more detail at one particular speech. Therefore, corpus analysis is useful for telling us what is ‘normal’ or unusual in a text population – it can tell us where we should dig, but the spadework is still going to be a human endeavour.
227
228
Using Corpora in Discourse Analysis
Additionally, a corpus can only reveal its own contents to us. It does not tell us much about the world outside. For example, many of the keywords produced in the examination of parliamentary debates in Chapter 7 were words which were stylistically pertinent to the genre. We would therefore need to know more about the types of explicit and implicit language restrictions, conventions and norms that occur in parliamentary language in order to explain that this was the case. We may be able to deduce some of this from a closer examination of certain keywords or linguistic patterns, but the corpus cannot explicitly tell us this itself. Further research into the history of the British Parliament and its relationship to law, British society and other ways of producing discourses would also be essential for a fuller analysis of the discourses of fox hunting in Britain. The parliamentary debates also only revealed discourses of fox hunting that Members of Parliament felt were relevant or suitable to bring up in that particular context. So one discourse that was not drawn on very often in the parliamentary debates concerned the belief that the topic of fox hunting reflected a significant division between upper-class people and the rest of the UK. The fox hunting debate, then, has been constructed elsewhere as ‘really being about social class’ and not fox hunting at all. However, the analysis of keywords in the parliamentary debate did not bring up any words that referred to this social class discourse of fox hunting. Carrying out a specific search in the corpus (e.g. on the word class), it becomes clear that both sides do refer to the class discourse, although only a very small number of times. The word class occurs ten times across the debate and in six of these cases (three on each side of the debate) it is used to reference the social class discourse. Therefore, the word class was not shown up as key because it did not occur significantly more often on one side of the debate when compared to the other, and it did not occur particularly often when the whole debate was compared to the FLOB Corpus of general British English. One way of revealing this relative backgrounding of the class discourse in the parliamentary debates would be to compare it to a corpus on fox hunting taken from different sources, e.g. radio or television debates or internet bulletin board discussions. Such a comparison may have revealed class to be a ‘negative’ keyword in the parliamentary debates, allowing us to focus on it and the reasons why it occurred so infrequently. Another issue concerns the corpus linguist’s love of comparison, difference and categories. If we compare text a and text b, we are likely to focus on what is different between them. The human fascination with difference is perhaps another cognitive bias, a means of helping us to make
Conclusion
sense of the world. However, if we begin an analysis by assuming that two things are different we may overlook the fact that they actually may be similar in more ways than they differ. This, of course, should not preclude the analysis, but it is worth bearing in mind. So when comparing the keywords used by pro- and anti-fox-hunting debaters in the House of Commons, it was also worth carrying out another comparison which put the whole debate against a corpus of general English. In doing this, it was found that in many ways the debaters shared a lot in terms of the way they spoke and what they spoke about. Some of these linguistic phenomena were due to the fact that the debate was about fox hunting and it is difficult to argue for or against the subject without mentioning fox hunting, while other words used by both sides related to the genre or context of the debate – words connected to Parliament. However, surprisingly, that analysis revealed that the word cruelty was used significantly frequently on both sides of the debate (although in different ways). Identifying the ways that differences and similarities interact with each other is therefore an essential part of any comparative corpus-based study of discourse. And a corpus-based analysis is only as effective as the analytical techniques that are available to us. Using tools like WordSmith and CQPweb I was able to conduct fast analyses of frequencies, keywords and collocations. If I wanted to compare grammatical and semantic categories there are automatic taggers which can attach this kind of information throughout the corpus. However, in order to explore more complex linguistic phenomena such as metaphor and attribution I would have to engage with the corpus by employing less mechanistic procedures. There are a set of related problems here: the fact that when we investigate discourses of a given subject in a corpus, it is easier to focus on direct references (e.g. we look for the word refugees) rather than patterns around anaphora (words like them or those). Also, presence tends to take precedence over absence in a corpus, because we often may not know what is missing. This can have consequences for research which examines lexical choice (some terms may simply be missing from a corpus or text) or agency (various actors may be excluded or backgrounded, cf. van Leeuwen 1996). And consequently, certain discourses may not be present. The key point here is that some forms of analysis appear to be more ‘convenient’ than others. The development of corpus tools and techniques which allow for more sophisticated automatic analyses is therefore another area where further work could gainfully be carried out. Finally, I return to the problem of interpretation brought up in Chapter 1. A corpus-based analysis of discourse affords the researcher with the patterns
229
230
Using Corpora in Discourse Analysis
and trends in language (from the subtle to the gross). People are not computers though, and their ways of interacting with texts are very different, both from computers and from each other. Corpus-based discourse analysis should therefore play an important role in terms of removing bias, testing hypotheses, identifying norms and outliers and raising new research questions. It should not replace other forms of close human analysis but act in tandem with them. The corpus is therefore an extremely useful instrument to add to the workbox of techniques available to discourse analysts. But it should not mean that we can now throw away all of our existing tools.
The impact of corpus based research When I wrote the first edition of this book I had little experience of impactrelated research. There was a sense that the work had the potential for impact, but the challenge was in getting others outside the field (and outside academia for that matter) to see this. Within academia (at least in the British context) the last decade or so has seen a stronger imperative on social science research needing to demonstrate impact and contribute to positive change outside academia. Grant-awarding bodies are increasingly likely to fund research where academics work with organisations who want to know the answers to the research questions being set and are willing to get involved, either by providing data, their own questions or participating in the research process in some other way. Since then I have worked with a range of partners including the National Health Service, the Home Office, the Metropolitan Police, the Muslim Council of Britain, Mermaids and Obesity UK. It is certainly the case that with the increasing availability and amount of data (especially in online contexts), a diverse set of organisations urgently require ways to make sense of it. Some organisations, like Mermaids, wish to monitor the language of the media. Others, like the National Health Service, have collected large amounts of patient feedback and want to identify how this data can be used to improve services, while the Home Office needs to be able to identify texts that advocate harmful practices so they can be countered. An advantage of working with non-academic partners is that they can often provide data that it would be difficult to obtain otherwise, and working with such data can result in new methodological challenges and insights. Such work is also highly motivating, knowing that your findings will be put to good use in some way. What I as a corpus linguist might see as interesting in these datasets is not necessarily what my partner-stakeholders are interested in. When I worked
Conclusion
with the NHS, I was provided with a list of questions relating to the sorts of elements that patients evaluated positively or negatively. However, as I worked with the data I became interested in the reasons that such evaluations were made and the ways that patients used language to strengthen or justify their views, arguing that it is useful to know about such techniques in order to guard against certain types of feedback being taken more seriously than others. As a result, I have often found that when I work with partners, it can simultaneously consist of two strands, one which answers questions that have been given to me, the other which involves a set of additional questions which emerge as I work with the corpus. Often there is overlap and the two approaches can benefit from one another. An important aspect of corpus-based impact-related work is in communicating the approach, its philosophy and techniques, without losing the interest of the non-academic partner. Be sensitive during meetings with non-academic partners if they start to look confused or bored, take care to give as much detail as is needed, and in a clearly comprehensible way using engaging examples as much as possible. Corpus linguistics combines expertise from computing, statistics and linguistics so it comes with an array of technical language. This needs to be reined in when talking to nonacademic partners as much as possible, especially at the start. And it is also important to manage expectations. Some partners might misunderstand corpus approaches as being simply a case of loading data into a tool, pressing a button, then sitting back and letting the computer provide all the answers. As this book has tried to show, this is not the case – human decision making and endeavour is provided at almost every stage – the tools save us time and tell us where to look, but we still have to do the looking ourselves, as well as account for the findings by relating them to a wider context. With that said, we have a bit of an uphill struggle to convince others of the worth of our approach, not helped by the fact that the label corpus linguistics is non-intuitive. Speaking from personal experience, this is definitely an area where other approaches with more transparent-sounding names like topic modelling and opinion mining can appear more attractive. And sometimes it can be difficult to convince potential partners of the worth of the corpus approach, particularly because academics are not always great at selling themselves. In more than one situation I have seen a partner choose to work with a different set of people, who turned up to give a presentation wearing expensive suits and touting equally expensive software packages which promised that they would provide all the answers. The line I take is that the software is inexpensive, requires human input to get the best from it, and
231
232
Using Corpora in Discourse Analysis
cannot do everything, while being honest, will not always convince potential partners to choose us. Additionally, when presenting results to partners, it is important not to require them to read through pages of dense methodological description or to interpret endless tables. I usually try to provide a short (less than one page) summary of the most important findings at the start of a report, and then follow on with a more detailed account for those who want to read it. Visuals which summarise the main results can be extremely important in terms of providing a simple, eye-catching headline. The findings can also be incorporated into journal articles or books as required, bearing in mind that writing for a non-academic audience requires unlearning some of the skills we have honed for our academic writing. In an early attempt at impact, I recall putting out a press release detailing the findings of a study on news reporting of immigration. In hindsight, the language used was far too technical and dry, and it was not surprising that the story did not receive much interest.
To the future When I wrote the first edition of this book, corpus-based discourse analysis was only practised by a handful of scholars, often working separately, and yet to become a field. The situation is markedly different now. In Chapter 1 I described how edited collections, a journal and a bi-annual conference indicate the establishment of the corpus approach to discourse in its own right. In 2016, my research centre CASS (Corpus Approaches to Social Science) won the Queen’s Anniversary Prize for Further and High Education for its work in computer analysis of world languages in print, speech and online. It was a proud moment, not just for the centre, but for everyone who has worked with this methodological approach. It is a field which is in continuous development. I have mentioned elsewhere in this book that new techniques and tools are being created, promising more accurate and far-reaching ways of analysing discourse. This is exciting although it raises the issue of fragmentation of the field into competing approaches and corner-defending, which is not always productive. I hope that it is a field that continues to welcome new participants without placing too many restrictions on entry. There should be as much space for people who want to use established methods on a well-known tool as those who want to engage with the newest pieces of software and statistical techniques.
Conclusion
I also hope to see more engagement with other areas of the social sciences and humanities – there are still numerous under-exploited forms of textual analysis that can benefit from the corpus approach to discourse, and I hope to see the field make further inroads beyond the UK and the English language. As the world becomes more literate and interactions occur more often in online contexts, our approach has never been so well-placed to examine the huge amounts of language data that are produced every day. At the same time, there is more potential than ever for language to be used to reach very large numbers of people, to manipulate their emotions and beliefs and affect their behaviours, not always with positive outcomes. Our approach is at its best when it maintains a balance between human and computational analyses. And there is a great deal of work to be done.
Further reading McIntyre, D. and Price, H. (eds) (2018), Applying Linguistics. Language and the Impact Agenda, London: Routledge. This edited collection contains thoughtful insights and advice on impact-related research.
233
234
Notes
Chapter 1 1 In fact, in the 100-million-word British National Corpus (1994), gay man appears seventeen times, homosexual man occurs six times and heterosexual man appears once. Straight man appears twenty times, of which only two occurrences refer to sexuality (the others mainly refer to the ‘straight man’ of a comedy duo). Man (without these sexuality markers) occurs 58,834 times. 2 For example, the Helsinki Corpus of English Texts: Diachronic Part consists of 400 samples of texts covering the period from 750 to 1700 (Kytö and Rissanen 1992). 3 The Lancaster-Oslo/Bergen (LOB) and the British English 2021 (BE21) corpora respectively. 4 The reasons why these changes in uses of blind over time have appeared is another matter. Perhaps the more negative idiomatic metaphoric uses of blind have always existed in spoken conversation but were censored in written texts because editors required authors to use language more formally. What is interesting though, is that there has been a shift in written discourse which has resulted in blind being conceptualised in a very different way over a sixty-year period.
Chapter 3 1 There are some types of corpora that will not be discussed in this book as they are not essentially relevant for discourse analysis: e.g. parallel/aligned corpora, learner corpora, dialect corpora – instead I refer the reader to McEnery and Wilson (1996), Kennedy (1998) and Hunston (2002) for fuller descriptions. 2 As Burnard (2002: 64) points out, the BNC for example, which was built in the 1990s, only contains two references to the World Wide Web: ‘the BNC is definitely no longer an accurate reflection of the English language’. 3 http://www.httrack.com/ 4 https://twitter.com/en/tos 235
236
Notes
5 6 7 8 9 10
https://www.facebook.com/legal/terms https://help.instagram.com/581066165581870 https://www.mumsnet.com/i/terms-of-use Thanks to Luke Collins for this example. http://ucrel.lancs.ac.uk/vard/about/ See http://ucrel.lancs.ac.uk/claws5tags.html for the list of grammatical codes used.
Chapter 4 1 Exceptions could include extremely restricted forms of language which do not adhere to usual grammatical rules, such as shopping lists. 2 The social classifications in the BNC are based on occupation and are as follows. AB: higher and intermediate managerial, administrative and professional. C1: Supervisory, clerical and junior management, administrative and professional. C2: Skilled manual. DE: Semi-skilled and unskilled manual, casual labourers, state pensioners and the unemployed.
Chapter 5 1 Thanks for Daniel van Olmen for this technique. 2 Research and Statistics Department (1995) Home Official Statistical Bulletin. https://webarchive.nationalarchives.gov. uk/20110218145201/http://rds.homeoffice.gov.uk/rds/pdfs2/hosb1595.pdf
Chapter 7 1 http://news.bbc.co.uk/1/hi/uk/449139.stm
Chapter 8 1 Specifically, this involves taking the logarithm of the total number of key tokens in each file and dividing it by the logarithm of all of the tokens in the respective file (it can be done fairly easily by exporting the data into Excel).
Glossary
annotation: the practice of assigning additional information to aspects of a text, often by using tags. cluster: a fixed sequence of words which can be counted as a distinct linguistic item, e.g. the end of. Coefficient of Variance: a measure of the relative dispersion of a set of data points (e.g. frequencies) which can, for example, be used to identify the amount of change in a linguistic feature over time. collocate: a word which frequently appears next to or close to another word in a corpus, usually more often than would be expected if the words all appeared in random order. collocational network: a diagram showing relationships between multiple collocates. concordance: a table showing all of the occurrences of a word in its immediate context. concordance plot: a diagram which shows all of the occurrences of a word across the texts in a corpus. corpus: an electronically encoded collection of texts which have usually been sampled in order to be representative of a particular text type, genre or register. corpus-assisted discourse studies: a form of linguistic analysis which
uses techniques from corpus linguistics in order to answer questions relating to areas of discourse. corpus based: a form of corpus analysis which aims to examine pre-existing hypotheses about language, using a corpus as a source of information. corpus driven: a form of corpus analysis which does not begin with particular hypotheses but allows techniques from corpus linguistics to drive the analysis, e.g. by accounting for keywords. corpus linguistics: a form of linguistic analysis which uses specialist computer software with one or more corpora in order to help human analysts to make sense of linguistic patterns and trends. critical discourse analysis: a form of analysis which is focussed on identifying the ways that unequal power relationships are discursively embedded in texts and how such relationships are enabled or challenged by wider social structures. discourse: 1. language as it occurs in context. 2. a way of making sense of the world, often using language in repeated representations, narratives and arguments.
237
238
Glossary
discourse analysis: a range of techniques for identifying and understanding aspects of discourse in real-life uses of language (usually in texts). discourse prosody: the tendency for a word to co-occur with a set of words or phrases which all suggest a similar evaluation. dispersion: the positions and extent of spread that a linguistic item is found within the texts in a corpus. distribution: the extent to which a linguistic item occurs across different texts or text types in a corpus. frequency: the number of times a linguistic item occurs in a corpus. The raw frequency is the actual number whereas the standardised frequency would be given as a proportion, e.g. the number of occurrences per million words. frequency list: a list of all of the words (or other linguistic items) in a corpus, along with their frequencies. The list is usually presented in order of frequency or alphabetically. header: a set of codes or tags, occurring at the start of a file in a corpus or linked to a file in some way, which provides information about the text (e.g. date of publication, author). keyword: a word which occurs relatively more often in one corpus when compared against a second corpus (which often acts as a benchmark for typical frequencies in language). The
frequency difference is usually large and/or unlikely to be the result of chance. legitimation strategy: a way of making an attitude, discourse or representation appear reasonable, normal or right, often through linguistic means. lockword: a word which occurs with very similar relative frequencies in two corpora. part of speech: a category of word which has similar grammatical properties (e.g. noun, verb, adjective). reference corpus: a type of corpus which is usually very large, often contains a wide range of text types, and acts as a standard reference regarding typical uses and frequencies of linguistic items. representation: ways of using language to describe and evaluate something, e.g. a social actor. search term: a word, token, cluster, tag or combination of these, which can be entered into a search box in a corpus tool to produce a concordance and frequency information. semantic preference: the tendency for a particular word to co-occur with other words or phrases that all have a similar meaning. tagging: the practice of annotation – in corpus linguistics this is often carried out with computer software (e.g. to assign part of speech tags or semantic tags to words).
Glossary
thinning: obtaining a smaller sample of concordance data, sometimes taking cases at random. token: a sequence of characters in a document that have been grouped together to create a meaningful semantic unit. Tokens are often the same as words but some corpus software might define parts of words like ’s, as a token, or split up words with hyphens into separate tokens. type: a distinct token. There might be eighty-five cases of the token cat in a corpus, but this counts as a single type.
type/token ratio: the number of types in a corpus divided by the total number of tokens. This measure can be used to identify the lexical complexity of a text (a standardised measure is often used when working with a corpus). wildcard: a character in a search term (such as * or ?) which equals ‘any character’ or ‘zero or more characters’. Word Sketch: a set of collocates of a word that have been automatically grouped according to their grammatical relationships.
239
240
References
Anderson, G. (2010), ‘How to use corpus linguistics in sociolinguistics’, in A. O’Keefe and M. McCarthy (eds), The Routledge Handbook of Corpus Linguistics. Abingdon: Routledge, 547–62. Anthony, L. (2018), ‘Visualization in Corpus-Based Discourse Studies’, in C. Taylor and A. Marchi (eds), Corpus Approaches to Discourse: A Critical Review. Abingdon: Routledge, 197–224. Anthony, L. and Baker, P. (2015), ‘ProtAnt: A tool for analysing the protoypicality of texts.’ International Journal of Corpus Linguistics 20(3): 273–92. Archer, D. (ed.) (2009), What’s in a Word-list? Investigating Word Frequency and Keyword Extraction. London: Routledge. Bahktin, M. (1984), Problems of Dostoevsky’s Poetics. Minneapolis: University of Minnesota Press. First published 1929. Baker, P. (2005), Public Discourses of Gay Men. London: Routledge. Baker, P. (2006), Using Corpora to Analyse Discourse. London: Continuum. Baker, P. (2011), ‘Times may change but we’ll always have money: a corpus driven examination of vocabulary change in four diachronic corpora.’ Journal of English Linguistics 39: 65–88. Baker, P. (2013), Using Corpora to Analyse Gender. London: Bloomsbury. Baker, P. (2015), ‘Does Britain need any more foreign doctors? Inter-analyst consistency and corpus-assisted (critical) discourse analysis’, in N. Groom, M. Charles and J. Suganthi (eds), Corpora, Grammar and Discourse: In Honour of Susan Hunston. Amsterdam/Atlanta: John Benjamins, 283–300. Baker, P. (2016), ‘The shapes of collocation.’ International Journal of Corpus Linguistics 21(2): 139–64. Baker, P. (2017), American and British English. Divided by a Common Language? Cambridge: Cambridge University Press. Baker, P. (2019), ‘Analysing representation of obesity in the Daily Mail via corpus and down-sampling methods’, in J. Egbert and P. Baker (eds), Using Corpus Methods to Triangulate Linguistic Analysis. London: Routledge, 85–108. Baker, P. and Collins, L. (2023), Creating and analysing a multimodal corpus of news texts with Google Cloud Vision’s automatic image tagger. Applied Corpus Linguistics 3 (1). https://doi.org/10.1016/j.acorp.2023.100043. Baker, P. and Egbert, J. (eds) (2016), Triangulating Methodological Approaches in Corpus-Linguistic Research. London: Routledge. 241
242
References
Baker, P. and Levon, E. (2015), ‘Picking the right cherries?: a comparison of corpus-based and qualitative analyses of news articles about masculinity.’ Discourse and Communication 9(2): 221–336. Baker, P. and McEnery, T. (2005), ‘A corpus-based approach to discourses of refugees and asylum seekers in UN and newspaper texts.’ Language and Politics 4(2): 197–226. Baker, P. and McEnery, T. (eds) (2015), Corpora and Discourse: Integrating Discourse and Corpora. London: Palgrave. Baker, P. and McGlashan, M. (2020), ‘Critical Discourse Analysis’, in S. Adolphs and D. Knight (eds), The Routledge Handbook of English Language and the Digital Humanities. London: Routledge, 220–41. Baker, P. and Vessey, R. (2018), ‘A corpus-driven comparison of English and French Islamist extremist texts.’ International Journal of Corpus Linguistics 23(3): 255–78. Baker, P., Brookes, G. and Evans, C. (2019), The Language of Patient Feedback: A Corpus Linguistic Study of Online Health Communication. London: Routledge. Baker, P., Gabrielatos, C. and McEnery, T. (2013), Discourse Analysis and Media Attitudes: The Representation of Islam in the British Press. Cambridge: Cambridge University Press. Baker, P., Vessey, R. and McEnery, T. (2021), The Language of Violent Jihad. Cambridge: Cambridge University Press. Baker, P., Gabrielatos, C., Khosravinik, M., Krzyzanowski, M., McEnery, T. and Wodak, R. (2008), ‘A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press.’ Discourse and Society 19(3): 273–306. Baldry, A. (2000), Multimodality and Multimediality in the Distance Learning Age. Campobasso: Palladino. Baxter, J. (2003), Positioning Gender in Discourse: A Feminist Methodology. Basingstoke: Palgrave Macmillan. Becker, H. (1972), ‘Whose side are we on?’, in J. D. Douglas (ed.), The Relevance of Sociology. New York: Appleton-Century-Crofts. Bednarek, M. (2015), ‘Corpus-assisted multimodal discourse analysis of television and film narratives’, in P. Baker and T. McEnery (eds), Corpora and Discourse: Integrating Discourse and Corpora. London: Palgrave, 63–87. Belica, C. (1996), ‘Analysis of temporal changes in corpora.’ International Journal of Corpus Linguistics 1(1): 61–73. Berry-Rogghe, G. L. M. (1973), ‘The computation of collocations and their relevance in lexical studies’, In A. J. Aitken, R. Bailey and N. Hamilton-Smith (eds), The Computer and Literary Studies. Edinburgh: Edinburgh University Press, 103–12. Bevitori, C. and Johnson, J. (2017), ‘Human mobility and climate change at the crossroad: a diachronic Corpus-Assisted Discourse Analysis of the nexus in
References
UK and US newspaper discourse.’ Anglistica Aion, An Interdisciplinary Journal 21: 1–19. Bhaskar, R. (1989), Reclaiming Reality. London: Verso. Biber, D. (1988), Variation in Speech and Writing. Cambridge: Cambridge University Press. Biber, D., Conrad, S. and Reppen, R. (1998), Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999), Longman Grammar of Spoken and Written English. London: Longman. Billig, M. (2013), Learn to Write Badly: How to Succeed in the Social Sciences. Cambridge: Cambridge University Press. Blei, D. and Lafferty, J. (2007), ‘A correlated topic model of Science.’ Annals of Applied Statistics 1(1): 17–35. Blommaert, J. (2005), Discourse. Cambridge: Cambridge University Press. Borsley, R. D. and Ingham, R. (2002), ‘Grow your own linguistics? On some applied linguistics’ views of the subject.’ Lingua Franca 112: 1–6. Breeze, R. (2011), ‘Critical Discourse Analysis and its critics.’ Pragmatics 21(4): 493–525. Brindle, A. (2018), The Language of Hate: A Corpus Linguistic Analysis of White Supremacist Language. London: Routledge. Brookes, G. and Baker, P. (2021), Obesity in the British Press. Cambridge: Cambridge University Press. Brown, G. and Yule, G. (1983), Discourse Analysis. Cambridge: Cambridge University Press. Burnard, L. (2002), ‘Where did we go wrong? A retrospective look at the British National Corpus,’ in B. Kettemann and G. Marko (eds), Teaching and Learning by Doing Corpus Analysis. Amsterdam: Rodopi, 51–70. Burr, V. (1995), An Introduction to Social Constructionism. London: Routledge. Caldas-Coulthard, C. (1993), ‘From Discourse Analysis to Critical Discourse Analysis: The differential re-representation of women and men speaking in written news’, in J. Sinclair, M. Hoey and G. Fox (eds), Techniques of Description – Spoken and Written Discourse. London: Routledge, 196–208. Caldas-Coulthard, C. (1995), ‘Man in the news: the misrepresentation of women speaking in news as narrative discourse’, in S. Mills (ed.), Language and Gender: Interdisciplinary Perspectives. London: Longman, 226–39. Caldas-Coulthard, C. R. and van Leeuwen, T. (2002), ‘Stunning, shimmering, iridescent: Toys as the representation of gendered social actors’, in L. Litosseliti and J. Sunderland (eds), Gender Identity and Discourse Analysis. Amsterdam: John Benjamin, 91–108. Cameron, D. (1998), ‘Dreaming the dictionary : Keywords and corpus linguistics.’ Keywords 1, 35–46. Cameron, D. (2001), Working with Spoken Discourse. London: Sage.
243
244
References
Candelas de la Ossa (2019), ‘Exceptionalising intersectionality: a corpus study of implied readership in guidance for survivors of domestic abuse.’ Gender and Language 13(2): 224–50. Charteris-Black, J. (2004), Corpus Approaches to Critical Metaphor Analysis. Basingstoke: Palgrave Macmillan. Chilton, P. (2004), Analysing Political Discourse: Theory and Practice. London: Routledge. Chilton, P. and Lakoff, G. (1995), ‘Foreign policy by metaphor’, in C. Schäffner and A. L. Wenden (eds), Language and Peace. Aldershot, UK, and Brookfield, VA: Dartmouth Publishing Company Limited, 37–59. Chomsky, N. (1957), Syntactic Structures. The Hague: Mouton. Clarke, I., McEnery, T. and Brookes, G. (2021), ‘Multiple Correspondence Analysis, newspaper discourse and subregister. A case study of discourses of Islam in the British press.’ Register Studies 3(1): 144–71. Clear, J., Fox, G., Francis, G., Krishnamurthy, R. and Moon, R. (1996), ‘Cobuild: the state of the art’. International Journal of Corpus Linguistics 1: 303–14. Collins, C. (2020), ‘Working with images and emoji in the 🦆 Dukki Facebook corpus’, in S. Rüdiger and D. Dayter (eds) (2020), Corpus Approaches to Social Media. Amsterdam: John Benjamins, 175–96. Collins, L. (2019), Corpus Linguistics for Online Communication: A Guide for Research. London: Routledge. Cotterill, J. (2001), ‘Domestic discord, rocky relationships: semantic prosodies in representations of marital violence in the O.J. Simpson trial.’ Discourse and Society 12:3, 291–312. Crenshaw, K. (1989), ‘Demarginalizing the Intersection of Race and Sex: A Black Feminist Critique of Antidiscrimination Doctrine, Feminist Theory and Antiracist Politics.’ University of Chicago Legal Forum 1: 139–67. Crystal, D. (1995), Cambridge Encyclopedia of the English Language. Cambridge: Cambridge University Press. Danet, B. (1980), ‘ “Baby” or “fetus”: language and the construction of reality in a manslaughter trial.’ Semiotica 32(1/2): 187–219. Dayrell, C., Ram-Prasad, C. and Griffith-Dickson, G. (2020), ‘Bringing corpus linguistics into religious studies: self-representation amongst various immigrant communities with religious identity.’ Journal of Corpora and Discourse Studies 3: 96–121. Deignan, A. (1999), ‘Corpus-based research into metaphor’, in Lynne Cameron and Graham Low (eds), Researching and Applying Metaphor. Cambridge: Cambridge University Press. Denzin, N. (1988), ‘Qualitative analysis for social scientists.’ Contemporary Sociology 17(3): 430–2.
References
Department for Environment, Food and Rural Affairs (2000), The Final Report of the Committee of Inquiry into Hunting with Dogs in England and Wales. Norwich: The Stationery Office. Derrida, J. (1978), Writing and Difference. Chicago: University of Chicago Press. Derrida, J. (1981), Dissemination. Chicago: University of Chicago Press. Downing, J. (1980), The Media Machine. London: Pluto. Duguid, A. and Partington, A. (2018), ‘Absence: you don’t know what you’re missing. Or do you?’, in C. Taylor and A. Marchi (eds), Corpus Approaches to Discourse: A Critical Review. London: Routledge, 38–59. Durrant, P. and Doherty, A. (2010), ‘Are high frequency collocations psychologically real? Investigating the thesis of collocational priming.’ Corpus Linguistics and Linguistic Theory 6(2): 125–55. Dunning, T. (1993), ‘Accurate methods for the statistics of surprise and coincidence’, Computational Linguistics 19(1): 61–74. Egbert, J. and Baker, P. (eds) (2019), Using Corpus Methods to Triangulate Linguistic Analysis. London: Routledge. Egbert, J., Wizner, S., Keller, D., Biber, D., McEnery, T. and Baker, P. (2021), ‘Identifying and describing functional discourse units in the BNC Spoken 2014.’ Text & Talk 41(5–6): 715–37. El Refaie, E. (2002), ‘Metaphors we discriminate by : Naturalized themes in Austrian newspaper articles about asylum seekers’. Journal of Sociolinguistics 5(3): 352–71. Evans, C. (2020), A corpus-assisted discourse analysis of NHS responses to online patient feedback. Unpublished PhD Thesis, Lancaster University. Fairclough, N. (1989), Language and Power. London: Longman. Fairclough, N. (1995), Media Discourse. London: Hodder Arnold. Fairclough, N. (2003), Analysing Discourse: Textual Analysis for Social Research. London: Routledge. Firth, J. R. (1957), Papers in Linguistics 1934–1951. London: Oxford University Press. Flowerdew, J. (1997), ‘The discourse of colonial withdrawal: a case study in the creation of mythic discourse.’ Discourse and Society 8: 453–77. Flowerdew, L. (2000), ‘Investigating referential and pragmatic errors in a learner corpus’, in L. Burnard and T. McEnery (eds), Rethinking Language Pedagogy from a Corpus Perspective. Frankfurt: Peter Lang, 145–54. Foucault, M. (1972), The Archaeology of Knowledge. London: Tavistock. Fowler, R. (1991), Language in the News. London: Routledge. Fowler, R., Hodge, B., Kress, G. and Trew, T. (1979), Language and Control. London: Routledge. Francis, W. N. and Kučera, H. (1982), Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin.
245
246
References
franzke, a. s., Bechmann, A., Zimmer, M., Ess, C. and the Association of Internet Researchers (2020), Internet Research: Ethical Guidelines 3.0. Available at https://aoir.org/reports/ethics3.pdf Friginal, E. (2018), Corpus Linguistics for English Teachers: Tools, Online Resources and Classroom Activities. London: Routledge. Friginal, E. and Hardy, J. (eds) (2020), The Routledge Handbook of Corpus Approaches to Discourse. Abingdon: Routledge. Gabrielatos, C. and Baker, P. (2008), ‘Fleeing, sneaking, flooding: a corpus analysis of discursive constructions of refugees and asylum seekers in the UK Press 1996–2005.’ Journal of English Linguistics 36(1): 5–38. Gablasova, D., Brezina, V. and McEnery, T. (2017), ‘Collocations in corpusbased language learning research. Identifying, comparing and interpreting the evidence.’ Language Learning 67(S1): 155–79. Galtung, J. and Ruge, M. (1965), ‘The structure of foreign news: The presentation of the Congo, Cuba and Cyprus crises in four Norwegian newspapers.’ Journal of Peace Research, 2(1): 64–90. Gibbs, R. W. Jr. and Steen, G. J. (1999), Metaphor in Cognitive Linguistics. Amsterdam: John Benjamins. Gilbert, N. and Mulkay M. (1984), Opening Pandora’s Box: A Sociological Analysis of Scientists’ Discourse. Cambridge: Cambridge University Press. Gill, R. (1993), ‘Justifying justice: broadcasters’ accounts of inequality in radio’, in E. Burman and I. Parker (eds), Discourse Analytic Research. London: Routledge, 75–93. Gramsci, A. (1985), Selections from the Cultural Writings 1921–1926, ed. D. Forgacs and G. Nowell Smith, trans. W. Boelhower. London: Lawrence and Wishart. Gupta, K. (2017), Representation of the British Suffrage Movement. London: Bloomsbury. Hajer, M. (1997), The Politics of Environmental Discourse: Ecological Modernization and the Policy Process. Oxford: Oxford University Press. Hall, S., Critcher, C., Jefferson, T., Clarke, J. and Roberts, B. (1978), Policing the Crisis: Mugging, the State, and Law and Order. London: Macmillan. Hanks, P. (2012), ‘The corpus revolution in lexicography.’ International Journal of Lexicography 24(4): 398–436. Hardie, A. (2014), ‘Modest XML for Corpora: Not a standard, but a suggestion.’ ICAME Journal 38: 73–103. Hardt-Mautner (1995a), Only Connect: Critical discourse analysis and corpus linguistics, UCREL Technical Paper 6. Lancaster: University of Lancaster. Hardt-Mautner, G. (1995b), ‘How does one become a good European: The British press and European integration.’ Discourse and Society 6(2): 177–205. Hocking, D. (2022), The Impact of Language Change on the Practices of Visual Artists. Cambridge: Cambridge University Press.
References
Hoey, M. (1986), ‘The discourse colony: a preliminary study of a neglected discourse type’, in Talking about Text, Discourse Analysis Monograph no. 13, English Language Research, University of Birmingham, 1–26. Hoey, M. (2005), Lexical Priming. A New Theory of Words and Language. London: Routledge. Holbrook, D. (2015), ‘Designing and applying an “Extremist Media Index”.’ Perspectives on Terrorism 9(5): 56–67. Hollink, L, Bedjeti, A., van Harmelen, M. and Elliot, D. (2016), ‘A corpus of images and text in online news.’ Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC‘16), European Language Resources Association, 1377–82. Holloway, W. (1981), ‘ “I just wanted to kill a woman”, Why? The Ripper and male sexuality.’ Feminist Review 9: 33–40. Holloway, W. (1984), ‘Gender differences and the production of the subject’, in J. Henriques, W. Hollway, C. Urwin, C. Venn and V. Walkerdine (eds), Changing the Subject. London: Meuthuen, 227–63. Holmes, J. (2001), ‘A corpus based view of gender in New Zealand English’, in M. Hellinger and H. Bussman (eds), Gender Across Languages. The Linguistic Representation of Women and Men. Vol 1. Amsterdam: John Benjamins, 115–36. Hughes, J. and Hardie, A. (2019), ‘Corpus linguistics and event-related potentials’, in J. Egbert and P. Baker (eds), Using Corpus Methods to Triangulate Linguistic Analysis. Abingdon: Routledge, 185–218. Hunston, S. (1999), ‘Corpus evidence for disadvantage: issues in critical interpretation’, Paper read at the BAAL/CUP seminar ‘Investigating discourse practices through corpus research: methods, findings and applications’, University of Reading, May 1999. Hunston, S. (2002), Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Hunt, D. and Brookes, G. (2020), Corpus, Discourse and Mental Health. London: Bloomsbury. Hyland, K. and Paltridge, B. (eds) (2013), The Bloomsbury Companion to Discourse Analysis. London: Bloomsbury. Intellectual Property Office (2014) Exceptions to Copyright: Research. https:// www.gov.uk/government/uploads/system/uploads/attachment_data/ file/375954/Research.pdf Johansson, S. (1991), ‘Times change and so do corpora’, in K. Aijmer and B. Altenburg (eds), English Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman, 305–14. Johnson, S., Culpeper, J. and Suhr, S. (2003), ‘From “politically correct councillors” to “Blairite nonsense”: discourses of political correctness in three British newspapers.’ Discourse and Society 14(1): 28–47.
247
248
References
Jones, S. and Sinclair, J. (1974), ‘English lexical collocations.’ Cahiers de Lexicologie, 24: 15–61. Käding, J. (1897), Häufigkeitswörterbuch der deutschen Sprache, Steglitz: privately published. Kahneman, D. and Tversky, A. (1973), ‘On the psychology of prediction.’ Psychological Review, 80: 237–51. Katz, S. (1996), ‘Distribution of common words and phrases in text and language modelling.’ Natural Language Engineering 2(1): 15–59. Kaye, R. (1998), ‘Redefining the refugee: the UK media portrayal of asylum seekers’. in K. Koser and H. Lutz (eds), The New Migration in Europe: Social Constructions and Social Realities. London: MacMillan Press, 163–82. Kennedy, G. (1998), An Introduction to Corpus Linguistics. London: Longman. Kenny, D. (2001), Lexis and Creativity in Translation: A Corpus-based Study. Manchester: St Jerome Publishing. Kilgarriff, A. and Tugwell, D. (2001), ‘WASP-Bench: an MT Lexicographers’ Workstation Supporting State-of-the-art Lexical Disambiguation’. Proceedings of MT Summit VII, Santiago de Compostela, 187–90. Koller, V. (2004), Metaphor and Gender in Business Media Discourse. A Critical Cognitive Study. Houndmills: Palgrave MacMillan. Kress, G. (1994), ‘Text and grammar as explanation’, in U. Meinhof and K. Richardson (eds), Text, Discourse and Context: Representations of Poverty in Britain. London: Longman, 24–46. Kress, G. and van Leeuwen, T. (1996), Reading Images: The Grammar of Visual Design. London: Routledge. Krishnamurthy, R. (1996), ‘Ethnic, Racial and Tribal: The language of racism?,’ in C. R. Caldas-Coulthard and M. Coulthard (eds), Texts and Practices: Readings in Critical Discourse Analysis. London: Routledge, 129–49. Kytö, M. and Rissanen, M. (1992), ‘A language in transition: the Helsinki Corpus of English texts’, ICAME Journal 16: 7–27. Lakoff, G. and Johnson, M. (1980), Metaphors We Live By. Chicago: Chicago University Press. Law, I., Svennevig, M. and Morrison, D. E. (1997), Privilege and Silence. ‘Race’ in the British News during the General Election Campaign, 1997. Research Report for the Commission for Racial Equality. Leeds: University of Leeds Press. Layder, D. (1993), New Strategies in Social Research. Cambridge: Polity Press. Leech, G. (1991), ‘The state of the art in corpus linguistics’, in K. Aijmer and B. Altenberg (eds), English Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman, 105–22. Leech, G. (1992), ‘Corpora and theories of linguistic performance’, in Jan Svartvik (ed.), Directions in Corpus Linguistics: Proceedings of the Nobel Symposium 82, Stockholm, 4–8 August 1991. Berlin: Mouton de Gruyter, 105–22.
References
Leech, G. (2003), ‘Modality on the move: the English modal auxiliaries 1961– 1992’, in Roberta Facchinetti, Manfred Krug and Frank Palmer (eds), Modality in Contemporary English. Topics in English Linguistics 44. Berlin and New York: Mouton de Gruyter, 223–40. Leech, G. and Smith, N. (2005), ‘Extending the possibilities of corpus-based research on English in the twentieth century: a prequel to LOB and FLOB.’ ICAME Journal 29: 83–98. Leech, G., Hundt, M., Mair C. and Smith, N. (2009), Change in Contemporary English: A Grammatical Study. Cambridge: Cambridge University Press. Lin, Y-L. (2017), ‘Co-occurrence of speech and gestures: a multimodal corpus approach to intercultural interaction.’ Journal of Pragmatics 117: 155–67. Louw, B. (1993), ‘Irony in the text or insincerity in the writer? – The diagnostic potential of semantic prosodies’, in M. Baker, G. Francis and E. TogniniBonelli (eds), Text and Technology: In Honour of John Sinclair. Amsterdam: John Benjamins, 157–76. Louw, B. (1997), ‘The role of corpora in critical literary appreciation’, in A. Wichmann, S. Fligelstone, T. McEnery and G. Knowles (eds), Teaching and Language Corpora. London: Longman, 140–251. Love, R. and Baker, P. (2015), ‘The hate that dare not speak its name?’ Journal of Language, Aggression and Conflict 3(1): 57–86. Lukin, A. (2019), War and its Ideologies. Singapore: Springer. Mahlberg, M. (2013), Corpus Stylistics and Dickens’s Fiction. London: Routledge. Marchi, A. (2010), ‘ “The moral in the story”: a diachronic investigation of lexicalised morality in the UK press.’ Corpora 5(2): 161–89. Marchi, A. and Taylor, C. (2009), ‘If on a Winter’s Night Two Researchers . . . A challenge to assumptions of soundness of interpretation.’ Critical Approaches to Discourse Analysis across Disciplines 3(1): 1–20. Martin, P. and Turner, B. (1986), ‘Grounded theory and organizational research.’ Journal of Applied Behavioral Science 22(2): 141–57. Mautner, G. (2019), ‘A research note on corpora and discourse: Points to ponder in research design.’ Journal of Corpora and Discourse Studies, 2: 1–13 McArthur, T. (1981), Longman Lexicon of Contemporary English. London: Longman. McEnery, T. (2005), Swearing in English. London: Routledge. McEnery, T. and Baker, H. (2016), Corpus Linguistics and 17th-Century Prostitution. London: Bloomsbury. McEnery, T. and Baker, H. (2017), ‘The poor in seventeenth-century England: A corpus-based analysis.’ Token: A Journal of English Linguistics 6: 51–83. McEnery, T. and Baker, H. (2018), Corpus Linguistics and 17th Century Prostitution. London: Bloomsbury. McEnery, A. and Wilson, A. (1996), Corpus Linguistics. Edinburgh: Edinburgh University Press.
249
250
References
McEnery, T., Baker, P. and Hardie, A. (2000), ‘Swearing and abuse in modern British English’, in B. Lewandowska-Tomaszczyk and J. Melia, PALC 99 Practical Applications in Language Corpora. Hamburg: Peter Lang, 37–48. McEnery, T., Xiao, R. and Tono, Y. (2006), Corpus-based Language Studies: An Advanced Resource Book. London: Routledge. McGlashan, M. (2016), The representation of same-sex parents in children’s picturebooks: A corpus-assisted multi-modal critical discourse analysis. Unpublished PhD thesis, Lancaster University. McNeill, P. (1990), Research Methods. Second Edition. London: Routledge. Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak M. A. and Aiden, E. L. (2010), ‘Quantitative analysis of culture using millions of digitized books.’ Science 331(6014): 176–82. Morgan, N. and Pritchard, A. (2001), Advertising in Tourism and Leisure. London: Routledge. Morrish, L. (2002), ‘ “That’s so typical of Peter – as soon as there’s a cock-up he tries to sit on it.”: British Broadsheet Press versus Peter Mandleson 1996–2001.’ Paper given at the 9th Annual American University Conference on Lavender Languages and Linguistics, American University, Washington DC. Morrison, A. and Love, A. (1996), ‘A discourse of disillusionment: Letters to the Editor in two Zimbabwean magazines 10 years after independence.’ Discourse and Society 7: 39–76. Mynatt, C. R., Doherty, M. E. and Tweney, R. D. (1977), ‘Confirmation bias in a simulated research environment: an experimental study of scientific inference.’ Quarterly Journal of Experimental Psychology 29: 85–95. Neuman, Y., Assaf, D., Cohen, Y., Last, M., Argamon, S., Howard, N., et al. (2013), ‘Metaphor identification in large texts corpora.’ PLoS ONE 8(4): e62343. Newby, H. (1977), “In the Field: Reflections on the Study of Suffolk Farm Workers”, in C. Bell and H. Newby (eds), Doing Sociological Research. London: Allen and Unwin, 108–29. Nguyen, L. and McCallum, K. (2016), ‘Drowning in our own home. A metaphor-led discourse analysis of Australian news media reporting on maritime asylum seekers.’ Communication Research and Practice 2(2): 159–76. Oakes, M. (1998), Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press. O’Keefe, A., McCarthy, M. and Carter, R. (2007), From Corpus to Classroom: Language Use and Language Teaching. Cambridge: Cambridge University Press.
References
Omoniyi, T. (1998), ‘The discourse of tourism advertisements: Packaging nation and ideology in Singapore.’ Working Papers in Applied Linguistics 4(22): 2–14. Pang, B. and Lee, L. (2008), ‘Opinion mining and sentiment analysis.’ Foundations and Trends in Information Retrieval 2(1–2): 1–135. Parker, I. (1992), Discourse Dynamics: Critical Analysis for Social and Individual Psychology. London: Routledge. Parker, I. and Burman, E. (1993), ‘Against discursive imperialism, empiricism and constructionism: thirty two problems with discourse analysis’, in E. Burman and I. Parker (eds), Discourse Analytic Research. London: Routledge, 155–72. Partington, A. (1998), Patterns and Meanings. Amsterdam: Benjamins. Partington, A. (2003), The Linguistics of Political Argument: The Spin-doctor and the Wolf-pack at the White House. London: Routledge. Partington, A. (2008), ‘Teasing at the White House: A corpus-assisted study of face work in performing and responding to teases.’ Text and Talk 28(6): 771–92. Partington, A. (2010), ‘Modern Diachronic Corpus-Assisted Discourse Studies (MDCADS) on UK newspapers: An overview of the project.’ Corpora, 5(2): 83–108. Partington, A. (2017), ‘Varieties of non-obvious meaning in CL and CADS: from “hindsight post-dictability” to sweet serendipity.’ Corpora 12(3): 339–67. Partington, A. and Morley, J. (2004), ‘At the heart of ideology : Word and cluster/ bundle frequency in political debate’, in B. Lewandowska-Tomaszczyk (ed.), PALC 2003: Practical Applications in Language Corpora. Frankfurt/M: Peter Lang, 179–92. Partington, A., Duguid, A. and Taylor, C. (2013), Patterns and Meanings in Discourse: Theory and Practice in Corpus-Assisted Discourse Studies (CADS). Amsterdam: John Benjamins. Pearce, M. (2008), ‘Investigating the collocational behaviour of man and woman in the BNC using Sketch Engine.’ Corpora, 3(1): 1–29. Philip, G. (2012), ‘Locating metaphor candidates in specialized corpora using raw frequency and keyword lists’, in Fiona MacArthur, José Luis OncinsMartínez, Manuel Sánchez-García and Ana María Piquer-Píriz (eds), Metaphor in Use: Context, culture, and communication. Amsterdam: John Benjamins, 85–106. Pisoiu, D. (2012), Islamist Radicalisation in Europe: An Occupational Change Process. London: Routledge. Potter, J. and Wetherell, M. (1987), Discourse and Social Psychology. London: Sage. Preyer, W. (1889), The Mind of the Child. New York: Appleton. Translation of original German edition of 1882.
251
252
References
Qian, Y. (2010), Discursive Constructions Around Terrorism in the People’s Daily (China) and The Sun (UK) before and after 9/11. Bern: Peter Lang. Quirk, R. (1960), ‘Towards a description of English usage’, Transactions of the Philological Society, 40–61. Rayson, P., Leech, G. and Hodges, M. (1997), ‘Social differentiation in the use of English vocabulary: some analyses of the conversational component of the British National Corpus.’ International Journal of Corpus Linguistics 2: 133–50. Reppen, R., Fitzmaurice, S. and Biber, D. (eds) (2002), Using Corpora to Explore Linguistic Variation. Amsterdam/Philadelphia: John Benjamins. Rey, J. M. (2001), ‘Changing gender roles in popular culture: Dialogue in Star Trek episodes from 1966 to 1993’, in D. Biber and S. Conrad (eds), Variation in English: Multi-Dimensional Studies. London: Longman, 138–56. Rheindorf, M. (2019), Revisiting the Toolbox of Discourse Studies: New Trajectories in Methodology, Open Data, and Visualization. Cham: Palgrave Macmillan. Rich, A. (1980), ‘Compulsory heterosexuality and lesbian existence.’ Signs: Journal of Women in Culture and Society 5: 631–60. Ringbom, H. (1998), ‘Vocabulary frequencies in advanced learner English: a cross-linguistic approach’, in S. Granger (ed.), Learner English on Computer. London: Longman, 41–52. Roch, J. (2020), ‘Friends or foes? Europe and “the people” in representations of populist parties.’ Politics (August 2020), https://doi. org/10.1177/0263395720938537. Rüdiger, S. and Dayter, D. (eds) (2020), Corpus Approaches to Social Media. Amsterdam: John Benjamins. Salahshour, N. (2016), ‘Liquid metaphors as positive evaluations: A corpusassisted discourse analysis of the representation of migrants in a daily New Zealand newspaper.’ Discourse, Context and Media, 13: 73–81. Sardinha, T. B. (2007), ‘Metaphor in corpora: a corpus-driven analysis of applied linguistics dissertations.’ Revista Brasileira de Linguística Aplicada 7(1): 11–35. Scott, M. (1999), ‘Definition of Keyness’, WordSmith Tools Help. https:// lexically.net/downloads/version_64_8/HTML/keyness_definition.html Scott, M. (2018), ‘A parser for news downloads.’ DELTA: Documentação de Estudos em Lingüística Teórica e Aplicada 34(1): 1–16. Schröter, M. and Veniard, M. (2016), ‘Contrastive analysis of keywords in discourses: intégration and integration in French and German discourses about migration.’ International Journal of Language and Culture 3(1): 1–33. Scruton, R. (1998), On Hunting. London: Yellow Jersey Press. Semino, E., Demjén, Z., Demmen, J., Koller, V., Payne, S., Hardie, A. and Rayson, P. (2017), ‘The online use of Violence and Journey metaphors by patients
References
with cancer, as compared with health professionals: a mixed methods study.’ BMJ Supportive & Palliative Care 7: 60–66. Shalom, C. (1997), ‘That great supermarket of desire: attributes of the desired other in personal advertisements’, in K. Harvey and C. Shalom (eds), Language and Desire. London: Routledge, 186–203. Sherrard, C. (1991), ‘Developing discourse analysis.’ Journal of General Psychology 118(2): 171–9. Sigley, R. and Holmes, J. (2002), ‘Girl-watching in corpora of English.’ Journal of English Linguistics 30(2): 138–57. Simpson, P. (1993), Language, Ideology and Point of View. London: Routledge. Sinclair, J. (1991), Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sinclair, J. (1996), ‘The search for units of meaning’. Textus 9: 75–206. Sinclair, J. M. (1999), ‘A way with common words’, in H. Hasselgård and S. Oksefjell (eds), Out of Corpora: Studies in Honour of Stig Johnasson. Amsterdam: Rodopi, 157–79. Slembrouck, S. (1992), ‘The Parliamentary Hansard “Verbatim” Report: The Written Construction of Spoken Discourse.’ Language and Literature 1(2): 101–19. Steen, G. J., Dorst, A. G., Herrmann, J. B., Kaal, A. A., Krennmayr, T. and Pasma, T. (2010), A Method for Linguistic Metaphor Identification. From MIP to MIPVU. Amsterdam: John Benjamins. Stubbs, M. (1983), Discourse Analysis: the Sociolinguistic Analysis of Natural Language. Chicago: University of Chicago Press. Stubbs, M. (1996), Text and Corpus Analysis. London: Blackwell. Stubbs, M. (2001a), ‘Texts, corpora and problems of interpretation: A response to Widdowson.’ Applied Linguistics 22(2): 149–72. Stubbs, M. (2001b), Words and Phrases: Corpus Studies of Lexical Semantics. London: Blackwell. Stubbs, M. (2002), ‘On text and corpus analysis: A reply to Borsley and Ingham.’ Lingua Franca 112: 7–11. Sunderland, J. (2004), Gendered Discourses. Basingstoke: Palgrave. Swann, J. (2002), ‘Yes, but is it gender?’, in L. Litosseliti and J. Sunderland (eds), Gender Identity and Discourse Analysis. Amsterdam: John Benjamin, 43–67. Taine, H. (1877), ‘On the acquisition of language by children.’ Mind 2: 252–9. Tausczik, Y. and Pennebaker, J. (2010), ‘The psychological meaning of words: LIWC and computerized text analysis methods.’ Journal of Language and Social Psychology 29(1): 24–54. Taylor, C. (2010), ‘Science in the news: a diachronic perspective.’ Corpora 5(2): 221–50. Taylor, C. (2017), ‘Togetherness or othering? Community and comunità in the UK and Italian press’, in J. Chovanec and K. Molek-Kozakowska (eds),
253
254
References
Representing the Other in European Media Discourses. Amsterdamn: John Benjamins, 55–80. Taylor, C. (2018), ‘Similarity’, in C. Taylor and A. Marchi (eds), Corpus Approaches to Discourse: A Critical Review. London: Routledge, 19–37. Taylor, C. (2021), ‘Metaphors of migration over time.’ Discourse and Society 32(4): 443–62. Taylor, C. and Del Fante, D. (2020), ‘Comparing across languages in corpus and discourse analysis: some issues and approaches.’ Meta 65(1): 29. https://doi. org/10.7202/1073635ar. Taylor, C. and Marchi, A. (eds) (2018), Corpus Approaches to Discourse: A Critical Review. London: Routledge. TerWal, J. (2002), Racism and Cultural Diversity in the Mass Media. Vienna: European Research Center on Migration and Ethnic Relations. Tognini-Bonelli, E. (2001), Corpus Linguistics at Work (Studies in Corpus Linguistics: 6). Amsterdam/Atlanta, GA: John Benjamins. Tsolmon, B., Kwon, A.-R. and Lee, K.-S. (2012), ‘Extracting social events based on timeline and sentiment analysis in Twitter Corpus’, in G. Bouma, A. Ittoo, E. Métais and H. Wortmann (eds), Natural Language Processing and Information Systems. Proceedings of 17th International Conference on Applications of Natural Language to Information Systems. Groningen, The Netherlands. Lecture Notes in Computer Science 7337: 265–270. Turney, P., Neuman, Y., Assaf, D. and Cohen, Y. (2011), ‘Literal and metaphorical sense identification through concrete and abstract context’, in Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, 27–31 July: 680–90. Tversky, A. and Kahneman, D. (1973), ‘Availability : A heuristic for judging frequency and probability.’ Cognitive Psychology 5: 207–32. Vallone, R. P., Ross, L. and Lepper, M. R. (1985), ‘The hostile media phenomenon: Biased Perception and Perceptions of Media Bias in Coverage of the “Beirut Massacre”.’ Journal of Personality and Social Psychology 49: 577–85. van Dijk, T. (1987), Communicating Racism: Ethnic Prejudice in Thought and Talk. London: Sage. van Dijk, T. (1991), Racism and the Press. London: Routledge. van Dijk, T. (1996), ‘Discourse, power and access’, in C. R. Caldas-Coulthard and M. Coulthard (eds), Texts and Practices: Readings in Critical Discourse Analysis. London: Routledge, 84–104. van Dijk, T. (2001), ‘Critical discourse analysis’, in D. Schiffrin, D. Tannen and H. E. Hamilton (eds), The Handbook of Discourse Analysis. London: Blackwell, 352–71. van Leeuwen, T. (1996), ‘The representation of social actors’, in C. R. CaldasCoulthard and M. Coulthard (eds), Texts and Practices. Routledge, London, 32–70.
References
Wang, G. (2018), ‘A corpus-assisted critical discourse analysis of news reporting on China’s air pollution in the official Chinese English-language press.’ Discourse and Society 12(6): 645–62. Widdowson, H. G. (2000), ‘On the limitations of linguistics applied.’ Applied Linguistics 21(1): 3–25. Williams, P. and Chrisman, L. (eds) (1993), Colonial Discourse and Post-colonial Theory: A Reader. London: Longman. Wilson, A. and Thomas, J. (1997), ‘Semantic annotation’, in R. Garside, G. Leech and A. McEnery (eds), Corpus Annotation: Linguistic Information from Computer Texts. London: Longman, 55–65. Wodak, R. and Meyer, M. (2001), Methods of Critical Discourse Analysis. London: Sage. Wools, D. and Coulthard, M. (1998), ‘Tools for the trade.’ Forensic Linguistics 5: 33–57. Wright, D. (2017), ‘Using word n-grams to identify authors and idiolects A corpus approach to a forensic linguistic problem.’ International Journal of Corpus Linguistics 22(2): 212–41. Wynne, M. (ed.) (2005), Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books. Zwicky, A. (1997), ‘Two lavender issues for linguists’, in A. Livia and K. Hall (eds), Queerly Phrased. Oxford: Oxford Studies in Sociolinguistics, 21–34.
255
256
Index
absence 24, 41 action research 10 adjectives 82, 135, 152, 174, 198 adverbs 2, 174, 200 agency 125, 229 Amazon Rekognition 211 anaphora 130, 229 Anderson, G. 205 annotation 72–6, 209–10, 237, 238 accuracy 24, 75, 153, 211 grammatical see part-of-speech part-of-speech 2, 49, 74 semantic 49, 184, 216–17 validation 76 anonymisation 67–8 AntConc 47–8, 110, 125, 162, 165–7, 209 Anthony L. 47, 125, 196, 212 Archer D. 190 archive 56–7 argument 25, 172–3, 175–6, 186, 221 audience 23, 74, 85, 101–2, 213, 227 backgrounding 228 Bahktin M. 10 Baker H. 32, 201 Baker P. 7, 11, 12, 20, 24, 27, 32, 43, 53, 60, 63, 75, 104, 133, 147, 159, 169, 180, 190, 194–7, 200, 205, 207, 211, 212, 219 Baldry A. 9 Bank of English (BoE) Baxter J. 30 B-Brown Corpus 179 BE06 Corpus 179, 216, 235n BE21 Corpus 77 Becker H. 15
Bednarek M. 66 Belica C. 113 Berry-Rogghe G. L. E. 140 Bevitori C. 31 Bhaskar R. 13 bias 6, 13–14, 33, 221 Biber D. 2, 3, 10, 11, 90 Billig M. 51 Blei D. 12 Blommaert J. 4, 16, 97, 223 body 73 boilerplate 70 Borsley R. D. 9 Breeze R. 114 Brezina V. 48 Brindle A. 30 British National Corpus (BNC) 16, 61, 66, 75–7, 89, 98–101, 110, 112, 118, 126, 131, 141, 149, 207, 214, 222, 235n, 236n Brookes G. 30, 180, 190 Brown Corpus 3, 61, 203 Brown family 57, 61, 76, 222–3 Brown G. 3 Burman E. 11 Burnard L. 222, 235n Burr V. 4, 10, 13 CADS 6–8, 12–13, 15, 21, 29–36, 114, 237 Caldas-Coulthard C. R. 6, 22 Cameron D. 4, 9 categorisation 114–15, 143–5, 173, 203 Charteris-Black J. 215 cherry-picking 45 Chilton P. 4, 7, 137 257
258
Index
Chrisman L. 4 Cicourel A. V. Clarke I. 188–9 Clear J. 3 clusters 91–2, 104–5, 170–3, 227, 237 Coefficient of Variance 197–200, 237 Cohen’s D 140 Collins L. 211, 213, 218, 236n collocates 136–43, 159, 195, 225, 237 consistent 201 rank by frequency 138 span 143, 158–9 collocational networks 146–51, 159, 160, 237 colloquialisms see informal language colony texts 58 comparison 43–4 computer mediated communication 213 concordance plot 95–6, 104, 237 concordances 107, 109, 132–3, 135, 237 expanded 119 random order 112–14, 133 problems 108 search syntax 110, 238 sorted 120–5, 133 thinned 239 conjunctions see co-ordinators consistency 141, 203, 226 content words see lexical words Conrad S. 90 co-ordinators 138 copyright 66, 68, 79, 211 corpora 1, 237 diachronic 59–60, 223 reference 39, 61–2, 76–80, 123, 179, 238 size 23, 58–9, 62, 88, 139–40, 158, 165, 199 specialised 56, 58–9 spoken 65–6, 99, 205–8, 223–4 corpus assisted discourse studies see CADS corpus linguistics 1–3, 9, 28, 237 applications 3
building 55–80, 222–4 corpus-based/corpus-driven distinction 19, 237 criticisms 8–9, 128–32 Corpus of Historical American English 61 Cotterill J. 127 Coulthard M. 3 CQPweb 48–9, 86, 110, 119, 130, 179 Crenshaw K. 30 critical discourse analysis 6, 11, 41, 82, 108–9, 223, 237 critical discourse studies, see critical discourse analysis critical linguistics 6 critical realism 13 critique 46–7 Crystal D. 183 culturomics 12 cut-off points 25, 139, 141, 159, 166, 188–9, 198, 200, 225 Danet B. 82 database 56–7 Dayrell C. 31, 206, 218 Dayter D. 218 dehumanisation 24, 46, 124, 136, 154–5 de la Ossa A. 30 Deignan A. 214 Del Fante, D. 203, 219 Delta P 140 demographic data 2, 99–101 Denzin N. K. 10 depersonalisation 177, 214 Derrida J. 18–19 description 43 diachronic change 9, 17, 25, 197–202 Dice score 141, 167–8 %diff 167 difference 40–1 discourse 1, 3–6, 28, 59, 221, 237 academic 13, 36–7, 51 cumulative see incremental dominant see hegemonic
Index
hegemonic 16, 24, 28, 103 incremental 15–16, 28 mainstream see hegemonic racist 127 resistant 17–8 tourist 93, 103 discourse analysis 1, 6, 11, 238 discourse marker 4 discourse organiser 3 discourse prosody 126–7, 238 discourse structure 3 discourse units see functional discourse units dispersion 83, 96, 104, 183, 188, 238 dissemination 50 distribution 83, 104, 183, 226, 238 Doherty A. 139 Downing J. 109 Duguid A. 12, 27, 41 duplicates 69–70 Dunning T. 140 Durrant P. 139 Early English Books Online 201 effect size 139–41, 167–8 Egbert J. 20, 194, 207 ELAN 205 El Refaie E. 24 emoji 71, 213 enclitic 48 English-Corpora.org 48–9 English Web 2020 Corpus 222 ethics 67–8, 79 euphemisms 41, 82, 130, 172, 176–7, 199 Evans C. 8, 32 Excel 198 explanation 45 Extensible Markup Language (XML) 72–3 Facebook 67, 213 Fairclough N. 4, 7, 15, 22, 82, 108, 214 feminist linguistics 10 FireAnt 212
Firth J. R. 136 Flowerdew L. 7 Foucault M. 4, 5 Fowler R. 6 Francis W. N. 3, 90 Freiberg Lancaster-Oslo/Bergen (FLOB) Corpus 88, 228 frequency 23, 33, 43, 81–3, 131, 161, 224, 238 standardised 86 frequency list 86, 92, 104, 105, 107, 138, 162–5, 238 Friginal E. 3, 12 function words see grammatical words functional discourse units 207–8 Gablasova D. 140 Gabrielatos C. 11, 53, 201 Galtung J. 131 gender 10, 22, 60, 149–50, 215 Gibbs R. W. 214 Gilbert N. 25 Gill R. 11 Google Cloud Vision 211 grammatical words 19, 88–90, 138, 140, 144–5 158, 168–9 Gramsci A. 10 Graph Theory 147 Griffith-Dickson, C. 218 Gupta K. 32 Hajer M. 4 Hall S. 16 Halliday M. A. K. 114 Hanks P. 3 Hardt-Mautner G. 6, 55, 127, 129 Hardie A. 79, 141 Hardy J. 12 header 70–1, 73–4, 238 Helskini Corpus 59, 235n Hocking D. 200 Hoey M. 16, 24, 25, 58 Holbrook, D. 136 Hollink, L. 22
259
260
Index
Holloway W. 5 Holmes J. 7 HTML 64, 73 HTTrack 63 Hughes J. 141 Hunston S. 18, 22, 85, 124, 139, 235n Hunt D. 30 Hyland K. 27 hypothesis testing 124, 133, 140, 167 ideology 104, 131 images see visuals impact 50, 230–2 informal language 98–101 Ingham R. 9 Instagram 67 Intellectual Property Office 66 interpretation 2, 9, 13–14, 22–3, 44, 132, 204, 213, 229–30 interpretative repertoires 25 intersectionality 30 intertextuality 120 invitational imperatives 93–4 Johansson S. 3 Johnson J. 31 Johnson M. 214 Johnson S. 56 Jones S. 143 Käding J. 2 Kahneman D. 14 Katz S. 96 Kaye R. 129 Kennedy G. 56, 58, 64–5, 66, 235n Kenny D. 8 keying in 64–5 keyness 165–9 keyness score 167–8, 173 key word in context (KWIC) see concordance keywords 113, 187–190, 196, 197, 203, 206, 209, 212, 216, 228, 238
key categories 183–7 key clusters 180 key keywords 183, 188, 190 negative keywords 167, 228 remainder method 180, 197 Khosravinik M. 11, 53 Kilgarriff A. 140 Koller V. 214–5 Kress G. 86, 210 Krishnamurthy R. 7 Krzyzanowski M. 11, 53 Kučera H. 3, 90 Kytö N. 59 Lafferty J. 12 Lakoff G. 137, 214 Lancaster-Oslo/Bergen (LOB) Corpus 61, 77, 235n #LancsBox 48–9, 75, 137–8, 141–3, 146, 157 Law I. 127 Layder D. 19 learner English 4 Lee L. 12 Leech G. 3, 45, 56, 58, 77 legitimation 41, 104, 133–4, 159, 186, 189, 238 lemma 90, 92, 216 Levon E. 20, 195 lexical priming 24 lexical words 37, 90, 138, 163–4, 168, 172, 216, 226 Lin Y-L. 218 LIWC 144 lockword 49, 179, 190, 238 logDice 140, 142, 147 log-likelihood 140, 165, 167, 170, 173 log-log 140 logRatio 140, 167 Louw B. 3, 127 Love A. 7 Love R. 169 Lukin A. 31
Index
Mahlberg M. 3 Marchi A. 12, 31, 53, 190, 194 Martin P. 143 Mautner G. 20–1 McArthur T. 184 McCallum K. 133 McEnery T. 1, 10, 11, 12, 19, 24, 27, 32, 53, 105, 113, 138, 212, 235n McGlashan M. 63, 210 McIntyre D. 233 McNeill P. 19 metaphor 24, 117, 118, 214–17 Meyer M. 109 MI see Mutual Information MI3 140 Michigan Corpus of Academic Spoken English 56, 207 Michel J. B. 12 Microsoft Word 70 minority groups 30, 109, 128, 206 modality 58, 77, 200 Morgan N. 84, 103 Morley J. 105 Morrish L. 130 Morrison A. 7 Mulkay M. 25 Mumsnet 67 Multiple Correspondence Analysis 188–9 Mutual Information 139–40, 142, 167–8 Mynatt C. R. 14 names see proper nouns Neuman Y. 216 Newby H. 19 Nguyen L. 133 nominalisation 36, 41, 82 nouns 74, 110, 122, 216 NVivo 212 Oakes M. 96, 140 objectivity 13, 37 O’Keefe A. 3
Omoniyi T. 93 opinion mining 12, 231 Optical Character Recognition 64, 71 Otter.ai 65 Oxford Text Archive 63 Paltridge B. 27 Pang B. 12 paralinguistic 65, 205–7, 219 Parker I. 4, 11 Partington A. 1, 7, 8, 12, 15, 20, 21, 25, 27, 41, 55, 105, 110, 200, 202, 203, 216 passives 129 Pearce M. 159 Pennebaker J. 144 permissions 66–8 personalisation 154 Pisoiu D. 154 plurals 110, 118, 133, 152, 159, 175, 209 polysemy 113, 144 portmanteau tag 76 post-structuralism 10, 18, 36 Potter J. 4 Praat 205 Preyer W. 2 Price H. 233 Pritchard A. 84, 103 ProtAnt 196–7 proper nouns 67, 130, 133, 168–9 punctuation 71, 142, 213 p value 165–6, 200 Python 63 Qian Y. 202 qualitative research 2, 8, 10–12, 19–21, 23, 35, 41–2, 53, 59, 107, 114, 128, 193–5, 206–7, 210, 212, 215 quantitative research 2, 9, 18–9, 21, 35, 53, 55, 81, 107, 132, 213, 221 queer theory 10 Quirk R. 3
261
262
Index
Ram-Prasad C. 218 Rayson P. 2, 23 recommendations 50 reflexivity 35 repetitions 75, 114, 212 Reppen R. 3, 90 representations 4, 20, 29–32, 46, 109, 113–20, 129–33, 148, 153–9, 195–6, 238 representativeness 56 research questions 38–42, 54, 114 Rey J. M. 59–60 Rheindorf M. 114 Rich A. 5 Ringbom H. 4 Rissanen M. 59 Roch J. 31 Rüdiger S. 218 Ruge M. 131 Salashour N. 31 saliency 128, 138, 158, 164–5, 183, 225 sampling 57, 196 Sardinha T. B. 215–16 Schröter M. 202 Scott M. 6, 47, 69, 168 Scruton R. 161 semantic preference 126–7, 238 semantic prosody 127 sentiment analysis 12 sexuality 5, 13, 60, 82, 103, 235n Shalom C. 7, 58 Sherrard C. 82 similarity 41, 177–8, 229 Sigley R. 7 Sinclair J. 3, 112, 127, 143 Sketch Engine 48–9, 61, 75, 137, 141–3, 157, 159, 200, 223 Slembrouck S. 65 social constructionism 18 social media 212–14 Sonix 65 spelling 72
standardised type/token ratio 88 Steen G. J. 214 stop lists 168 spatialisation 137 Stubbs M. 3, 7, 9, 16, 58, 82, 126, 136, 143 Sunderland J. 5, 19–20 Survey of English Usage 3 Swann J. 10 synonyms 148, 184 T Score 140 tagging see annotation Taine H. 2 Tausczik Y. 144 Taylor C. 12, 27, 31, 41, 53, 134, 159, 190, 194, 202, 203, 219 teamwork 37–8, 51–2, 230–31 ter Wal J. 128 text archive see archive text producers 22, 23, 131, 204, 226 Thomas J. 144, 184 Tognini-Bonelli E. 19 token 87, 143, 239 Tono Y. 27 topic modelling 12, 231 Transcriber 205 transparency 53, 225 transitivity 114 Trends 200–201 triangulation 18–21, 194–7 Tsolomon B. 12 Tugwell D. 140 Turney P. 216 Turner B. 143 Tversky A. 14 TWINT 193 Twitter 66–7, 68, 212, 219 type 87, 239 type/token ratio 87–8, 239 Unicode 69, 213 USAS 144, 184 UTF-8 69
Index
Vallone R. P. 14 van Dijk T. 109, 127 van Leeuwen T. 22, 86, 114, 137, 210, 229 van Olmen D. 236n VARD 2 72 Veniard M. 202 verbs 93–4, 152 modal verbs see modality Vessey R. 202 visuals 22, 86, 102, 208–11, 224, 227
Wmatrix 48–9, 75, 184, 216 Wodak R. 7, 11, 53, 109 Wools D. 3 word list see frequency list WordNet 215 Word Sketch 151–3, 239 WordSmith 6, 47–8, 69, 71, 86–7, 91, 110, 168, 181, 206 Wright D. 3 Wynne M. 79
Wang G. 202 weblogs 67 websites see internet text Wetherell M. 4 Widdowson H. G. 9 wildcard 48, 110, 239 Williams P. 4 Wilson A. 1, 10, 19, 113, 144, 184, 235n
Xiao R. 27 XML see Extensible Markup Language Yule G. 3 Z-score 140 Zwicky A. 82
263
264
265
266
267
268