What's in a Word-list? (Digital Research in the Arts and Humanities) 0754672409, 9780754672401, 9780754680659

The frequency with which particular words are used in a text can tell us something meaningful both about that text and a

341 48 2MB

English Pages 200 Year 2009

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

What's in a Word-list? (Digital Research in the Arts and Humanities)
 0754672409, 9780754672401, 9780754680659

  • Similar Topics
  • Art
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

wha t’s in a word-list?

d igital r esearch in the a rts and h umanities Series Editors Marilyn d eegan, l orna h ughes and h arold s hort d igital technologies are becoming increasingly important to arts and humanities research and are expanding the horizons of our working methods. t his important series will cover a wide range of disciplines with each volume focusing on a particular area, identifying the ways in which technology impacts on speci.c subjects. The aim is to provide an authoritative relection of the ‘state of the art’ in the application of computing and technology. t he series will be critical reading for experts in digital humanities and technology issues but will also be of wide interest to all scholars working in humanities and arts research ahr C iCt Methods n etwork Editorial Board s heila a nderson, Centre for e-r esearch, King’s College l ondon Chris Bailey, l eeds Metropolitan University Bruce Brown, University of Brighton Mark Greengrass, University of Shefield s usan h ockey, University College l ondon s andra Kemp, r oyal College of a rt s imon Keynes, University of Cambridge Julian r ichards, University of York s eamus r oss, University of Glasgow Charlotte r oueché, King’s College l ondon Kathryn s utherland, University of o xford a ndrew w athey, n orthumbria University Forthcoming titles in the series t he Virtual r epresentation of the Past Edited by Mark Greengrass and Lorna Hughes is Bn 978 0 7546 7288 3 Modern Methods for Musicology Prospects, Proposals and r ealities Edited by Tim Crawford and Lorna Gibson is Bn 978 0 7546 7302 6

w hat’s in a w ord-list?

investigating w ord Frequency and Keyword Extraction

Edited by da wn ar Ch Er University of Central Lancashire, UK

© d awn a rcher 2009 a ll rights reserved. n o part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise without the prior permission of the publisher. d awn a rcher has asserted her moral right under the Copyright, d esigns and Patents a ct, 1988, to be identiied as the editor of this work. Published by a shgate Publishing l imited w ey Court East Union r oad Farnham s urrey, GU9 7Pt England

a shgate Publishing Company s uite 420 101 Cherry s treet Burlington Vt 05401-4405 Usa

www.ashgate.com British Library Cataloguing in Publication Data w hat’s in a word-list? : investigating word frequency and keyword extraction. - (d igital research in the arts and humanities) 1. l anguage and languages - w ord frequency i. a rcher, d awn 410.1'51 Library of Congress Cataloging-in-Publication Data w hat’s in a word-list? : investigating word frequency and keyword extraction / [edited] by d awn a rcher. p. cm. -- (d igital research in the arts and humanities) includes bibliographical references and index. is Bn 978-0-7546-7240-1 1. English language--w ord frequency. 2. English language--w ord order. 3. English language--Etomology. i. a rcher, d awn. PE1691.w 5 2008 428.1--dc22 2008034289 is Bn : 978-0-7546-7240-1 (h ardback) is Bn : 978-0-7546-8065-9 (E-book)

Contents List of Figures List of Tables Notes on Contributors Acknowledgements Series Preface

vii ix xi xv xvii

1

d oes Frequency r eally Matter? Dawn Archer

1

2

w ord Frequency Use or Misuse? John M. Kirk

17

3

w ord Frequency, s tatistical s tylistics and a uthorship a ttribution David L. Hoover

35

4

w ord Frequency in Context: a lternative a rchitectures for Examining r elated w ords, r egister Variation and h istorical Change Mark Davies

5

issues for h istorical and r egional Corpora: First Catch Your w ord Christian Kay

69

6

in s earch of a Bad r eference Corpus Mike Scott

79

7

Keywords and Moral Panics:Mary w hitehouse and Media Censorship Tony McEnery

8

‘The question is, how cruel is it?’Keywords, Fox h unting and the h ouse of Commons Paul Baker

125

9

Love – ‘a familiar or a devil’? a n Exploration of Key d omains in s hakespeare’s Comedies and t ragedies Dawn Archer, Jonathan Culpeper, Paul Rayson

137

53

93

What’s in a Word-list?

vi

10

Promoting the w ider Use of w ord Frequency and Keyword Extraction t echniques Dawn Archer

Appendix 1 Appendix 2 USAS taxonomy Bibliography Index

159 163 168 171 177

l ist of Figures 1.1

r esults for * ago in the Bn C, using ViEw

3.1

t he frequency of the thirty most frequent words of The Age of Innocence Modern a merican poetry: d elta analysis (2,000 MFw s) Modern a merican poetry: d elta-o z analysis (2,000 MFw s) Modern a merican poetry: d elta-l z (>0.7) analysis (3,000 MFw s) a uthorship simulation: d elta analysis (800 MFw s) a uthorship simulation: d elta-2X analysis (800 MFw s) a uthorship simulation: d elta-3X analysis (800 MFw s) a uthorship simulation: d elta-P1 analysis (600 MFw s) a uthorship simulation: d elta-o z analysis (800 MFw s) a uthorship simulation: d elta-l z (>0.7) analysis (1,000 MFw s) a uthorship simulation: changes in d elta and d elta-z from likeliest to second likeliest author: d elta-l z (1,000 MFw s)

3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11

3 37 44 44 45 47 47 48 48 49 49 51

6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8

Berber s ardinha’s (2004, 102) formula s aving keywords as text importing a word-list from plain text d etailed consistency view of the twenty-two keyboard sets Excel spreadsheet of results Precision values for text a 6l Precision values for text Kn G Precision values for text a 6l with genred Bn C r Cs

81 84 84 85 86 87 87 90

7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9

Words which are key keywords in ive or more chapters of the Mw C w ords which are key keywords in all of the Mw C texts t he responsible Porn is good t he call for the restoration of decency Pronoun use by Vala t he assumption of Christianity s peaking up for the silent majority t he use of wh-interrogatives by Vala

100 100 112 114 120 121 122 122 123

8.1

Keywords when p) was initiated as a means to sustain community building and outreach in the ield of digital arts and humanities scholarship beyond the Methods n etwork’s funding period. Note on the text a ll publications in English are presumed as published in l ondon unless otherwise stated.

I dedicate this edited collection to my husband, Eddie, to my children, Paul, Peter, Jonathan and Jessica, and to my daughters-in-law, Becky and Charlotte. Thank you, Eddie, for your constant love and support. Thanks kids for ensuring that life is always full of surprises!

This page has been left blank intentionally

Chapter 1

d oes Frequency r eally Matter? d awn a rcher

Words, words, words A hypothesis popular amongst computer hackers – the ininite monkey theorem1 – holds that, given enough time, a device that produces a random sequence of letters ad in.nitum will, ultimately, create not only a coherent text, but also one of great quality (for example, s hakespeare’s Hamlet). t he hypothesis has become more widely known thanks to d avid ives’ satirical play, Words, Words, Words (Dramatists Play Service, n Y). in the play, three monkeys – Kafka, Milton and s wift – are given the task of writing something akin to Hamlet, under the watchful eye of the experiment’s designer, d r r osenbaum. But, as Kafka reveals when she reads aloud what she has typed thus far, the experiment is beset with seemingly insurmountable dificulties: ‘“K k k k k, k k k! K k k! K ... k ... k.” i don’t know! i feel like i’m repeating myself!’2

in my view, Kafka’s concern about whether the simple repetition of letters can produce a meaningful text is well placed. But i would contend that the frequency with which particular words are used in a text can tell us something meaningful about that text and also about its author(s) – especially when we compare word choice/usage against the word choice/usage of other texts (and their authors). t his can be explained, albeit in a simplistic way, by inverting the underlying assumption of the ininite monkey theorem: we learn something about texts by focussing on the frequency with which authors use words precisely because their choice of words is seldom random.3 a s support for my position, i offer to the reader this edited collection, which brings together a number of researchers involved in the promotion of iCt methods such as frequency and keyword analysis. indeed, the chapters within What’s in a Word-list? Investigating Word Frequency and Keyword Extraction were originally 1 The ininite monkey theorem was irst introduced by Émile Borel at the beginning of the twentieth century, and was later popularized by s ir a rthur Eddington. 2 d . ives, Words, Words, Words, d ramatists Play s ervice, n ew York. 3 o f course, the extent to which this process is a completely cognitive one is a matter of debate.

2

What’s in a Word-list?

presented at the Expert s eminar in l inguistics (l ancaster 2005). t his event was hosted by the ahr C iCt Methods n etwork as a means of demonstrating to the a rts and h umanities disciplines the broad applicability of corpus linguistic techniques and, more speciically, frequency and keyword analysis. Explaining frequency and keyword analysis Frequency and keyword analysis involves the construction of word lists, using automatic computational techniques, which can then be analyzed in a number of ways, depending on one’s interest(s). For example, a researcher might focus on the most frequent lexical items of a number of generated word frequency lists to determine whether all the texts are written by the same author. a lternatively, they might wish to determine whether the most frequent words of a given text (captured by its word frequency list) are suggestive of potentially meaningful patterns that they might have missed had they read the text manually.4 t hey might then go on to view the most frequent words in their word frequency list in context (using a concordancer) as a means of determining their collocates and colligates (i.e. the content and function words with which the most frequent words keep regular company). For example, the word ‘ago’ occurs 19,326 times in the British National Corpus (Bn C)5 and, according to h oey,6 ‘is primed for collocation with year, weeks and days’. We can easily conirm this by entering the search string ‘* ago’ into Mark Davies’s relational database of the Bn C. In fact, we ind that nouns relating to periods of time account for the 20 most frequent collocates of ‘ago’ (see d avies, this volume, for a detailed discussion of the relational database employed here, and s cott and t ribble,7 for a more extensive discussion of the collocates of ‘ago’). t he researcher(s) who are interested in keyword analysis may also be interested in collocation and/or colligation, but they will compare, initially, the word frequency list of their chosen text (let’s call it text a ) with the word frequency list of another normative or reference8 text (let’s call it text B) as a means of identifying both words that are frequent and also words that are infrequent in text a , statistically speaking, when compared to text B. t his has the advantage of removing words

4 M. s cott and C. t ribble, Textual Patterns: Keyword and Corpus Analysis in Language Education (a msterdam: Benjamins, 2006), p. 5. 5 Produced in the 1990s, the Bn C is a 100 million-word corpus of modern British English containing registers that are representative of the spoken and written medium. 6 M. hoey h oey,, Lexical Priming: A New Theory of Words and Language (r outledge, 2005), p. 177. 7 s cott and t ribble, Textual Patterns, p. 43. 8 Normative corpus and reference corpus are often used interchangeably by corpus linguists.

Does Frequency Really Matter?

Figure 1.1

3

Results for * ago in the BNC, using VIEW

that are common to both texts, and so allows the researcher to focus on those words that make text a distinctive from text B (and vice versa). in the case of the majority of English texts, this will mean that function words (‘the’, ‘and’, ‘if’, etc.) do not occur in a generated keywords list, because function words tend to be frequent in the English language as a whole (and, as a result, are commonly found in English texts). t hat said, function words can occur in a keyword list if their usage is strikingly different from the norm established by the reference text. indeed, when Culpeper9 undertook a keywords analysis of six characters from s hakespeare’s Romeo and Juliet, using the play minus the words of the character under analysis as his reference text, he found that Juliet’s most frequent keyword was actually the function word ‘if’. On inspecting the concordance lines for ‘if’ and additional keyword terms, in particular, ‘yet’, ‘would’ and ‘be’, Culpeper concluded that, when viewed as a set, they served to indicate Juliet’s elevated pensiveness, anxiety and indecision, relative to the other characters in the play. Text mining techniques as indicators of potential relevance a s the example of Juliet (above) reveals, a set of automatically generated keywords will not necessarily match a set of human-generated keywords at irst glance. in some instances, automatically generated keywords may also be found to be 9 J. Culpeper, ‘Computers, Language and Characterisation: An Analysis of Six Characters in Romeo and Juliet, in U. Melander-Marttala, C. o stman and M. Kytö (eds), Conversation in Life and in Literature: Papers from the ASLA Symposium (Uppsala: a ssociation s uédoise de l inguistique a ppliquée, 2002), pp. 11–30.

What’s in a Word-list?

4

insigniicant by the researcher in the inal instance (see, for example, a rcher et al., this volume), in spite of being classiied as statistically signiicant by text analysis software. t his is not as problematic as it might seem. t he reason? t he main utility of keywords and similar text-mining procedures is that they identify (linguistic) items which are: 1. likely to be of interest in terms of the text’s aboutness10 and structuring (that is, its genre-related and content-related characteristics); and, 2. likely to repay further study – by, for example, using a concordancer to investigate collocation, colligation, etc. (adapted from s cott, this volume). Put simply, the contributors to this edited collection are not seeking (or wanting) to suggest that the procedures they utilize can replace human researchers. o n the contrary, they offer them as a way in to texts – or, to use corpus linguistic terminology, a way of mining texts – which is time-saving and, when used sensitively, informative. Aims, organization and content of the edited collection t he aims of What’s in a Word-list? are similar to those of the 2005 Expert s eminar, mentioned above: • •

to demonstrate the beneits to be gained by engaging in corpus linguistic techniques such as frequency and keyword analysis; and, to demonstrate the very broad applicability of these techniques both within and outside the academic world.

t hese aims are especially relevant today when one considers the rate at which electronic texts are becoming available, and the recent innovations in analytic techniques which allow such data to be mined in illuminating (and relatively trouble free) ways. t he contributors also identify a number of issues that are crucial, in their view, if corpus linguistic techniques are to be applied successfully within and beyond the ield of linguistics. They include determining: • • • • •

what counts as a word what we mean by frequency why frequency matters so much the consistency of the various keyword extraction techniques which of the (key)words captured by keyword/word frequency lists are the most relevant (and which are not)

10 M. Phillips, ‘Lexical Structure of Text’, Discourse Analysis Monographs 12 (Birmingham: University of Birmingham, 1989).

Does Frequency Really Matter?

• • • •

5

whether the (de)selection of keywords introduces some level of bias what counts as a reference corpus and why we need one whether a reference corpus can be bad and still show us something what we gain (in real terms) by applying frequency and keyword techniques to texts.

Word frequency: use or misuse? John Kirk begins the edited collection by (re)assessing the concept of the word (as token, type and lemmatized type), the range of words (in terms of their functions and meanings) and thus our understanding of word frequency (as a property of data). h e then goes on to refer to a range of corpora – the Corpus of Dramatic Texts in Scots, the Northern Ireland Transcribed Corpus of Speech, and the irish component of the International Corpus of English – to argue that, although word frequency appears to promise precision and objectivity, it can sometimes produce imprecision and relativity. h e thus proposes that, rather than regarding word frequency as an end in itself (and something that requires no explanation), we should promote it as: • •

something that needs interpretation through contextualization a methodology, which lends itself to approximation and replicability.

Kirk also advocates that there are some advantages to be gained by paying attention to words of low frequency as well as words of high frequency. in his concluding comments, he touches on the contribution made to linguistic theory by word frequency studies, and, in particular, the usefulness of authorship studies in the detection of plagiarism. Word frequency, statistical stylistics and authorship attribution d avid h oover continues the discussion of high versus low frequency words, and authorship attribution, focussing speciically on some of the innovations in analytic techniques and in the ways in which word frequencies are selected for analysis. h e begins with an explanation of how, historically, those working within authorship attribution and statistical stylistics have tended to base their indings on fewer than the 100 most frequent words of a corpus. t hese words – almost exclusively function words – are attractive because they are so frequent that they account for most of the running words of a text, and because such words have been assumed to be especially resistant to intentional manipulation by an author.11 h oover then goes on to document the most recent work on style variation which, by concentrating on word frequency in given sections of texts rather than in the 11 t his means that their frequencies should reveal authorial habits which remain relatively constant across a variety of texts.

What’s in a Word-list?

6

entire corpus, is proving more effective in capturing stylistic shifts. a second recent trend identiied by Hoover is that of increasing the number of words analysed to as many as 6,000 most frequent words – a point at which almost all the words of the text are included, and almost all of these are content words. The inal sections of his chapter are devoted to the authorship attribution community’s renewed interest in d elta, a method for identifying differences between texts that is based on comparing how individual texts within a corpus differ from the mean for that entire corpus (following the innovative work of John Burrows). d rawing on a two million word corpus of contemporary a merican poetry and a much larger corpus of 46 Victorian novels, h oover also argues that reinements in the selection of words for analysis and in alternative formulas for calculating d elta may allow for further improvements in accuracy, and result, in turn, in the establishment of a theoretical explanation of how and why word frequency analysis is able to capture authorship and style. Word frequency in context in Chapter 4, Mark d avies introduces some alternatives to techniques based on word searching. in particular, he focuses on the use he has made of architectures based on relational databases and n-gram12 frequencies when developing corpora (including the 100-million-word Corpus del Español,13 a Bn C-based 100-millionword corpus modelled on the same architecture (Variation in English Words and Phrases, ViEw ),14 and a 40-million-word Corpus of Historical English.15 d avies’ main proposal is that such architectures can dramatically improve performance in the searching of corpora. For example, the following capture three of the many simple word frequency queries that take no more than one to two seconds on a 100-million-word corpus: • • •

overall frequency of a given word, set of words, phrase, or substring in the corpus; ‘slot-based’ queries, e.g. the most common nouns one ‘slot’ after ‘mysterious’, or z-score rank words immediately preceding ‘chair’; and, wide-range collocates, e.g. the most common nouns within a ten-word window (left or right) of ‘string’ or ‘broken’.

d avies also highlights the importance of developing an architecture that can account for variation through the creation of n-gram frequency tables for each register within a given corpus. t he advantage of such an approach is that 12 a n n-gram is a (usually consecutive) sequence of items from a corpus. t he items in question can be characters (letters and numbers) or more usually words. 13 . 14 . 15 .

Does Frequency Really Matter?

7

each n-gram will have an associated frequency according to historical period and register, and this information will be directly accessible as part of a given query. l ike Kirk (chapter 2, this volume), d avies is zealous about word frequency being something that needs interpretation through contextualization. indeed, he advocates that word frequency, ‘be analyzed not just as the overall frequency of a given word or lemma in a certain corpus, but, rather, as the frequency of words in a wide range of related contexts’ (p. 66). Unlike Kirk, however, he does not seem to be readily concerned about the inclusion of low frequency words in any given query. This is because of a potential ‘size issue’ which means that n-gram tables can ‘become quite unmanageable’ when dealing with excessively large corpora (p. 57). Consequently, d avies advocates that, for such corpora, we include just those n-grams that occur three times or more. t his is not a problem if one is interested in only the highly-frequent n-grams, of course, but it could make a detailed comparison of sub-corpora potentially problematic. Issues for historical and regional corpora – irst catch your word in Chapter 5, Christian Kay focuses primarily on variable spelling within historical texts, and the dificulties that this occasions when seeking to ‘catch a word’ in corpora, especially corpora such as the Historical Thesaurus of English (ht E), and a semantic index to the Oxford English Dictionary,16 which is supplemented by o ld English materials (published separately in r oberts et al.’s A Thesaurus of Old English17) and, as such, captures English vocabulary from the earliest written records to the present.18 Kay goes on to point out that spelling variation can also create problems when searching corpora relating to (modern-day) non-standard varieties such as the Scottish Corpus of Texts and Speech and the Dictionary of the Scots Language. indeed, even the specialized dictionaries that lemmatize common variants (for example, the Dictionary of the Scots Language) are by no means comprehensive. s he also demonstrates how homonymy and polysemy can create additional problems for those working with (historical and dialectal) corpora – and this is something that lemmatization may not be able to solve. Kay concludes by suggesting ways of addressing some of these problems using the resources described above, including the development of a rule-based system which predicts possible variants and maps them to the relevant headwords (s ee also chapter 9). in addition, Kay touches on 16 Oxford English Dictionary (o xford: o xford University Press, 1884–, and subsequent edns); OED Online, ed. J.a . s impson. (o xford: o xford University Press, 2000–). 17 J. r oberts, C. Kay and l . Grundy, A Thesaurus of Old English (a msterdam: r odopi, 2000 [1995]). 18 w ord senses within the thesauri are organized in a hierarchy of categories and subcategories, with up to 14 levels of delicacy. t he material is held in a database that can be searched on the internet, and is likely to be of use in a range of humanities disciplines.

8

What’s in a Word-list?

the relationship between e-texts (of which there are many) and structured corpora (of which there are few). In search of a bad reference corpus Mike s cott’s contribution to this edited collection tackles the issue of reference corpora. More speciically, he is interested in determining how bad a reference corpus can be before it becomes unusable (in the sense that it generates keywords that do not help to clarify the aboutness of a target text). a s previous chapters have revealed, this issue is particularly pertinent, as good reference corpora are not available for all genres / periods / languages. Using the keywords facility of his own text analysis program, w ords mith t ools,19 s cott’s starting point is the formula proposed by Berber s ardinha, which suggests that the larger the reference corpus, the more keywords will be detected.20 Berber s ardinha also suggests that, as a reference corpus that is similar to the target text (i.e. the text being analysed) will ilter out genre features common to both, an optimum reference corpus is one that contains several different genres. d rawing on a series of reference texts of varying lengths (32 in total: 22 Bn C texts and 10 s hakespeare plays), s cott explores the different keyword results that are generated by w ords mith t ools for two target texts: an extract from a book proiling business leaders and a doctor/patient interaction. Scott pays particular attention to their ‘popularity’ and ‘precision’ scores as a means of answering three research questions: 1. t o what extent does the size of the reference text impact on the quality of the keywords and, if so, is there a point at which the size of the reference text renders the (quality of the) keywords unacceptable? 2. w hat sort of keyword results obtain if a reference text is used which has little or no relation to the target text (beyond them both being written in the same language)? 3. w hat sort of keyword results obtain if genre is included as a variable? Popularity relates to the presence of each keyword in the majority of the reference texts (for example, 20 out of the 22 Bn C texts). t his is based on the rationale that keywords which are identiied using most of the reference texts are more likely to be useful than those identiied in only a minority of the reference texts. Precision is

19 M. s cott, WordSmith Tools, Version 4 (o xford: o xford University Press, 2004). w ords mith t ools is probably the most popular text analysis program in corpus linguistics. For more information, see . 20 The critical size of a reference corpus is said to be about two, three and ive times the size of the node text: a .P. Berber s ardinha, Lingüística de Corpus (Barueri, s ão Paulo, Brazil: Editora Manole, 2004), pp. 101–103).

Does Frequency Really Matter?

9

computed following o akes,21 and involves dividing the total number of keywords for each reference text by the number of popular keywords (as determined by the popularity test). w hilst s cott admits that usefulness is a relative phenomenon, which is likely to vary according to research goals (and research goals cannot be predicted with certainty), he contends that it is still worth undertaking such a study, not least because it will help to determine the dimensions that appear to effect the meaningfulness (or not) of generated keywords. t hese include size in tokens (i.e. frequency), similarity of text-type, similarity of historical period, similarity of subject-matter, etc. More importantly, perhaps, this and later studies will provide a useful means of determining the robustness of the keywords procedure and thus, in turn, its potential usefulness in (non-)linguistic ields. And the indications from this preliminary study look promising; indeed, Scott suggests that even relatively restricted reference corpora can give good results in keyword extraction. t hat said, s cott notes that a small reference corpus containing a mixture of texts is likely to perform better than a larger corpus with more homogeneous texts. Keywords and moral panics – Mary Whitehouse and media censorship t ony McEnery also utilizes the keywords facility of w ords mith t ools – in conjunction with his own lexically driven model of moral panic theory22 – as a means of determining the extent of moral panic in the books penned by Mary w hitehouse during the period 1967–77. in brief, words that are found to be key (i.e. statistically frequent) in the writings of w hitehouse (relative to a reference corpus23) are classiied according to McEnery’s moral panic categories.24 t hese categories are heavily inluenced by the moral panic theory of the sociologist, s tanley Cohen. indeed, they capture the discourse roles thought to typify moral panic discourse, including ‘object of offence’, ‘scapegoat’, ‘moral entrepreneur’, ‘corrective action’, ‘consequence’, ‘desired outcome’ and ‘rhetoric’. Some of the categories are also sub-classiied according to pertinent semantic ields: for example, the scapegoat category contains the semantic ields of ‘people’, ‘research’, ‘broadcast programmes’, ‘media’, ‘media organisations and oficers’ and ‘groups’. These semantic ields have been generated using a ‘bottom-up’ approach: that is to say, they have been constructed by McEnery, rather than being 21 M. o akes, Statistics for Corpus Linguistics (Edinburgh: Edinburgh University Press, 1998), p. 176. 22 a .M. McEnery, Swearing in English: Bad Language, Purity and Power from 1586 to the Present (r outledge, 2005). 23 McEnery opted to use the l ancaster-o slo-Bergen (lo B) corpus as his reference corpus. t he lo B captures 15 text categories, containing 500 printed texts of British English (approximately 2,000 words each) all of which were produced in 1961. 24 ibid. see also A.M. McEnery, ‘The Moral Panic about Bad Language in England, 1691–1745’, Journal of Historical Pragmatics, 7/1 (2006): 89–113.

10

What’s in a Word-list?

automatically identiied by a text analysis tool (see Baker, Chapter 8, and a rcher et al., Chapter 9, for a useful comparison with the ‘bottom-up’ approach). Using this procedure, McEnery is able not only to capture keywords that help establish the aboutness of the moral panic, but also to determine those words (like ‘violence’) that are actually key keywords, i.e. are key in a number of related texts (as well as moral panic categories) in the corpus.25 McEnery’s chapter is a good example of the beneits to be gained from combining a keywords methodology with other theories (linguistic and nonlinguistic) – not least because it demonstrates the usefulness of corpus linguistic techniques beyond linguistics. in addition, McEnery is one of several authors in this edited collection (see, for example, Baker, Chapter 8, and a rcher et al., Chapter 9) who seek to combine a quantitative approach to text analysis with a qualitative approach. Indeed, McEnery speciically focuses on the issue of bad language and, in particular, how bad language was represented by w hitehouse’s organization Va l a (Viewers and l isteners’ a ssociation), through an investigation of the collocations and colligations of several of the more prominent key keywords in w hitehouse’s books. Moreover, he discusses those indings not only in respect of their semantic importance, but also in respect of their ideological signiicance. He argues, for example, that the key keywords within the corrective action category, ‘parents’ and ‘responsible’, serve to generate in and out-groups, the former being regarded as serious, reasonable and selless, and the latter, as the antithesis of these qualities. In addition, a closer inspection of the key keywords in context (using a concordancer) suggests that the in/out-group distinction is heavily related to a dichotomy between (religious) conservatism and liberalism, which, in turn, is comparable to the opposition to bad language voiced by seventeenth-century religious organizations. Keywords, fox hunting and the House of Commons Paul Baker is the third author in this edited collection to utilize the keywords facility in w ords mith t ools – in this case, to examine a small corpus of debates on fox hunting (totalling 130,000 words). t he debates took place in the (British) h ouse of Commons in 2002 and 2003, prior to a ban being implemented in 2005. For the purposes of this study, Baker split the corpus into two sub-corpora (depending on whether speakers argue for or against fox hunting to be banned) so that they could be compared with each other, rather than with a more general reference text. t he bulk of Baker’s chapter is dedicated to a discussion of the different discourses (or ways of looking at the world) that speakers access in order to persuade others of their point of view, which Baker identiies using concordance 25 McEnery suggests that the key keywords approach is especially useful when one is working with large volumes of data (and the volume is such that the number of keywords generated is overwhelming). Key keywords are also useful if the transience of particular keywords may be an issue.

Does Frequency Really Matter?

11

analyses of pertinent keywords. For example, he notes how the pro-hunt speakers overused ‘people’, relative to the anti-hunt speakers. Moreover, they tended to use the term to identify those: • •

who would be adversely affected by the ban if it was implemented (because of losing their jobs and/or their communities and/or facing the possibility of a prison sentence, if they opted to ignore the ban), and who do not hunt, but were not upset or concerned by those who do.

In addition, the pro-hunt speakers also utilized the keywords ‘fellow’, ‘citizens’, ‘Britain’ and ‘freedom’, the irst two occurred together as a noun phrase, e.g. ‘fellow citizens’, and when used as such were preceded in all cases by a irst person possessive pronoun (‘my’ or ‘our’). Baker argues that the pro-hunt speakers could thus be seen to use an hegemonic rhetorical strategy to intimate that it was they (and not their opponents) who were able to speak for and with the people of Britain. Baker also explores additional ways of using keyness to ind salient language differences in texts, including the identiication of key semantic categories (also referred to as key domains). a tool that enables such analysis to be undertaken automatically is the UCr El s emantic a nalysis s ystem (henceforth Usas , also referred to as the UCr El s emantic a nnotation s ystem). d eveloped at l ancaster University, Usas consists of a part-of-speech tagger, which utilizes Cla ws (the Constituent l ikelihood a utomatic w ord-tagging s ystem), and a semantic tagger that, at its conception, was loosely based on Mca rthur’s Longman Lexicon of Contemporary English,26 but has since been revised in the light of practical application.27 Currently, the semantic tagset consists of 21 macro categories that expand into 232 semantic ields (see a ppendix 2). o nce again, Baker focuses on just a few of the most salient key semantic categories. For example, he points out how the semantic category ‘S1.2.6 sensible’ is overused by the pro-hunt speakers in the parliamentary debates (relative to the anti-hunt speakers): words like ‘sensible’, ‘reasonable’, ‘common sense’ and ‘rational’ are used when discussing the reasons for keeping hunting, and ‘ridiculous’, ‘illogical’ and ‘absurd’, when describing the proposed ban on hunting, which prompts Baker to suggest that this may be another example of their hegemonic rhetorical strategy (i.e. presenting one’s view of the world as ‘right’ or ‘common sense’).

26 t . Mca rthur, l ongman Lexicon of Contemporary English (l ongman, 1981). 27 See, for example, A. Wilson and J. Thomas, ‘Semantic Annotation’, in R. Garside, G. l eech and a . McEnery (eds), Corpus Annotation: Linguistic Information from Computer Texts (Longman, 1997), pp. 55–65; P. Rayson, D. Archer, S.L. Piao. and T. McEnery, ‘The UCr El s emantic a nalysis s ystem’, proceedings of the workshop on Beyond n amed Entity r ecognition s emantic l abelling for nl P t asks in association with the fourth international conference on l anguage r esources and Evaluation (lr EC, 2004), pp. 7–12.

What’s in a Word-list?

12

Baker concludes by suggesting that keywords offer a potentially useful way of focussing researcher attention on aspects of a text or corpus, but that care should be taken not to over-focus on difference/presence at the expense of similarity/ absence. h e also suggests that the best means of gaining the fullest possible picture of the aboutness of text(s) is to use multiple reference corpora. For example, one might wish to compare texts of the same type against a (larger) corpus of (more) general language usage as a means of capturing those words that, because they are typical of the text-type or genre, may be too similar to show up as keywords in a same text-type comparison (see my discussion of Romeo and Juliet under ‘Explaining Frequency and Keyword Analysis’). An exploration of key domains in Shakespeare’s comedies and tragedies a rcher, Culpeper and r ayson also utilize Usas , in this case to explore the concept of love in three s hakespearean love-tragedies (Othello, Anthony and Cleopatra and Romeo and Juliet) and three s hakespearean love-comedies (A Midsummer Night’s Dream, The Two Gentlemen of Verona and As You Like It). t heir aim is to add a further dimension to approaches that: • •

use corpus linguistic methodologies such as keyword analysis to study s hakespeare,28 by systematically taking account of the semantic relationships between keywords through an investigation of key domains; and study s hakespeare from the perspective of cognitive metaphor theory,29 by providing empirical support for some of the love-related conceptual metaphors put forward by cognitive metaphor theorists.

in brief, their top-down30 approach involves determining how love is presented in the two datasets and then highlighting any resemblances between their indings and the conceptual metaphors identiied by cognitive metaphor theorists. They also discuss how the semantic ield of love co-occurs with different domains in the two datasets, and assess the implications this has on our understanding of the concept of love. a s the original Usas system is designed to undertake the automatic semantic analysis of present-day English language, they have opted to utilize the historical version of the tagger. d eveloped by a rcher and r ayson, the historical tagger includes supplementary historical dictionaries to relect changes in meaning over

28 See, for example, Culpeper, ‘Computers, Language and Characterisation’. 29 See, for example, D.C. Freeman, ‘“Catch[ing] the nearest way”: Macbeth and Cognitive Metaphor’, Journal of Pragmatics, 24 (1995): 689–708. 30 Top-down captures the fact that the categories are pre-deined and applied automatically by Usas .

Does Frequency Really Matter?

13

time and a pre-processing step to detect variant (i.e. non-modern) spellings.31 t he inclusion of a variant detector is important when automatically annotating historical texts as it means that variant spellings can be mapped to spellings that the text analysis tool can recognize; this, in turn, means that standard corpus linguistic methods (frequency proiling, concordancing, keyword analysis, etc.) are more effective (see Kay, Chapter 5). t he taxonomy of the historical tagger is the same, at present. h owever, a rcher et al. are using studies such as this to evaluate its suitability for the Early Modern English period.32 indeed, they comment on the semantic domains that seem to capture the data well in their chapter, whilst also pointing out semantic domains that do not work as well. For example, they explain how the overuse of L3 ‘Plants’ in the love-comedies (relative to the lovetragedies) can be explained in large part by ‘Mustardseed’ (a character’s name) and ‘lower’ (part of the phrase, ‘Cupid’s lower’, i.e. the lower that Oberon used to send t itania to sleep in A Midsummer Night’s Dream). in addition, the bulk of the remaining items in the l 3 category capture features of the setting (for As You Like It and A Midsummer Night’s Dream are set in the woods). n evertheless, even within the l 3 category, there are items which have a strong metaphorical association with ‘love’ or ‘sex’. By way of illustration, in As You Like It, Silvius uses an agricultural metaphor (‘crop’, ‘glean’, ‘harvest’, ‘reaps’) to conirm that he is prepared to have Phoebe as a wife in spite of her less-thanvirginal state. a ccording to o ncins-Martínez,33 the ‘sex is agriculture’ metaphor and its sub-mappings (‘a woman’s body is agricultural land’, ‘copulation is ploughing or sowing’, etc.) were common in the Early Modern English period. As Archer et al.’s indings demonstrate, then, a keyness analysis does not merely capture aboutness; it can also uncover metaphorical usage, as in this case, or character traits, as in the case of Culpeper,34 discussed above. in addition, their approach can conirm – and also suggest amendments to – existing conceptual metaphors. By way of illustration, they suggest that the container idea within

31 D. Archer, T. McEnery, P. Rayson and A. Hardie, ‘Developing an Automated s emantic a nalysis s ystem for Early Modern English’, in d . a rcher, P. r ayson, a . w ilson and t . McEnery (eds), Proceedings of the Corpus Linguistics 2003 Conference, UCr El Technical Paper Number 16 (Lancaster: UCREL, 2003), pp. 22–31; Rayson et al. 2005); Rayson, P., Archer, D. and Smith, N., ‘VARD Versus Word: A Comparison of the UCREL Variant d etector and Modern s pell Checkers on English h istorical Corpora’, Proceedings of the Corpus Linguistics Conference Series On-Line E-Journal 1:1 (2005). 32 archer a rcher and rayson r ayson are also exploring the feasibility of mapping the Usas tagset to, irst, the categories utilized by Spevack, in his A Thesaurus of Shakespeare and, then, to the h istorical t hesaurus of English. 33 J.l . o ncins-Martínez, n otes on the Metaphorical Basis of s exual l anguage in Early Modern English, in J.G. Vázquez-González et al. (eds), The Historical LinguisticsCognitive Linguistics Interface (h uelva: University of h uelva Press, 2006). 34 Culpeper, ‘Computers, Language and Characterisation’.

14

What’s in a Word-list?

Barcelona s ánchez’s35 ‘eyes are containers for supericial love’ (which, in itself, is a development of l akoff and Johnson’s36 ‘eyes are containers for the emotions’) is not clearly articulated in the (comedy) data, and that the latter would be better captured by the conceptual metaphor, ‘eyes are weapons of entrapment’. t his particular inding is made possible because of their innovative analysis of key collocates at the domain level, using s cott Piao’s Multilingual Corpus Toolkit.37 l ike Baker, a rcher et al. believe that key domains can capture words that, because of their low (comparative) frequency, would not be identiied as keywords in and of themselves.38 h owever, they are acutely aware that the Usas process is an automatic one, and so will mis-tag words on occasion. a rcher et al. therefore suggest that researchers thoroughly check the results of such processes, using a manual examination of concordance lines to determine their contextual relevance. By way of illustration, they comment on the occurrence of ‘deer’, which is assigned to the category L2 ‘Living creatures’ by USAS. Archer et al. found that deer (like many items assigned to l 2) was used metaphorically, and can be captured by the conceptual metaphor ‘love is a living being’ and the related metaphor ‘the object of love is an animal’. w hen the concordance lines of these items were checked, they discovered that, although correctly assigned, the bulk of them had strong negative associations, semantically speaking. This inding contrasts with the items that Barcelona s ánchez39 discusses in respect of Romeo and Juliet. indeed, even the ‘deer’ example is problematic: it is linked to cuckoldry in many of s hakespeare’s plays (e.g. Love’s Labours Lost, The Merry Wives of Windsor) and may indicate that it, too, had negative undertones for both s hakespeare and his audience.40 35 a . Barcelona s ánchez, Metaphorical Models of r omantic l ove in Romeo and Juliet’, Journal of Pragmatics, 24 (1995): 667–88, 679. 36 G. l akoff, and M. Johnson, Metaphors We Live By (Chicago and n ew York: University of Chicago Press, 1980). 37 S.L. Piao, A. Wilson and T. McEnery, ‘A Multilingual Corpus Toolkit’, paper given at aaa Cl –2002, indianapolis, indiana, Usa , 2002. 38 Given many authors/(public) speakers seek to avoid unnecessary repetition by using alternatives to a given word, i would suggest that key domain analysis provides us with a useful means of capturing low frequency words that (although not key in and of themselves) do become ‘key’ when viewed alongside terms with similar meaning (see r ayson 2003, 100–113, for a more detailed exploration of the advantages of the key domains approach). 39 Barcelona Sánchez, ‘Metaphorical Models of Romantic Love in Romeo and Juliet’, p. 683. 40 Culpeper’s investigation is another useful reminder of the importance of checking – as a means of contextualizing – any generated keywords (or key domains). For Culpeper found that some of the nurse’s keywords in Romeo and Juliet (‘god’, ‘warrant’, ‘faith’, ‘marry’, ‘ah’) did not relate to her character at all – or to aboutness for that matter. r ather, they were surge features (or outbursts of emotion), which occurred at points in the play when the nurse was reacting to traumatic events (involving Juliet, in particular). Culpeper, ‘Computers, Language and Characterisation’.

Does Frequency Really Matter?

15

Their inal sentence is devoted to a call for quantitative analysis to be combined with qualitative analysis. For, like Baker (Chapter 8), they recognize that it is the researcher who must determine their cut-off points in respect of (contextual) salience in the inal instance. Indeed, how the researcher chooses to interpret the data is probably the most important aspect of corpus-based research. Promoting the wider use of word frequency and keyword extraction In this inal Chapter, I report on several AHRC ICT Methods Network promotional events (some of which were inspired by the Expert s eminar in l inguistics) that have helped to bring frequency and keyword extraction techniques to a wider community of users. i also address ways in which we might promote word frequency and keyword extraction techniques to an even wider community than we have at present (commercial and academic). in particular, i stress the need for (ongoing) dialogue, so that: • •

the keyword extraction community can discover what it is that other research communities are interested in inding out, and then determine how their tools might help them to do so; and ‘other’ research communities keep the keyword extraction community informed of (the successes and failures of) research that makes use of text mining techniques, which will allow the latter, in turn, to improve (the functionality of) their text analysis tools further.

How to use this book i have deliberately incorporated detailed summaries of the contributors’ chapters in this introductory chapter so that readers can ‘pick and choose’ those chapters that seem most relevant to their interests. t hat said, i would encourage readers with the time and inclination to read the edited collection as a whole, so that they gain a better sense of the different issues that must be considered if we are to utilize word frequency and keyword extraction techniques successfully. t he most important message of this edited collection, however, is that the researcher who engages in word frequency/keyword analysis has at their disposal a relatively objective means of uncovering lexical salience/(frequency) patterns that invite – and frequently repay – further qualitative investigation.

This page has been left blank intentionally

Chapter 2

w ord Frequency Use or Misuse? John M. Kirk

Introduction in this chapter, i shall not be concerned with statistical treatments of word frequency beyond percentage distributions and relativized frequencies per thousand(s) or million(s) words. My primary concern will be frequency as a property of data, and I shall take a critical look at statements like ‘each text comprises 2,000 words’. I shall be concerned with words as tokens, types and lemmatized types; the range of functions and meanings of words; and words and lexemes; and I shall consider words of low frequency as well as of high frequency. in a critical section, i shall ask whether word frequencies are self-explanatory or need explanation, and whether approximation is as useful as precision. i shall refer to a range of well-known corpora of English as well as the three corpora which i have compiled: the Corpus of Dramatic Texts in Scots, the Northern Ireland Transcribed Corpus of Speech (nit Cs ), and the irish component of the International Corpus of English (ICE-Ireland). I also wish to discuss, briely, the following claims: • • • • •

w ord frequency is the placing of numbers on language or the representation of language through numbers. Word frequency provides an instantiation of the claim that ‘linguistics is the scientiic study of language’. w ord frequency promises precision and objectivity whereas the outcome tends to be imprecision and relativity. w ord frequency is not an end in itself but needs interpretation through contextualization whence the relativity and comparison. w ord frequency is not a science but a methodology, which lends itself to replicability.

o ne of the aims of this chapter is to deconstruct statements of the following type: ‘each text contains (approximately) 2,000 words’, in which there are two issues: the concept (word) and the number (2,000).

What’s in a Word-list?

18

Classes of words Of the many subclassiications of words, one which might suit our present purposes is the taxonomy proposed by Mca rthur1 which offers eight possible word classes: 1. 2. 3. 4. 5. 6. 7. 8.

t t t t t t t t

he orthographic word he phonological word he morphological word he lexical word he grammatical word he onomastic word he lexicographical word he statistical word

t o this list, i wish to add a further two classes: 9. t he numeral word 10. t he discourse word o f these eight or ten types, it is class eight – the statistical word – which is usually associated with the notion of word frequency. Mca rthur provides the following deinition: word in terms of occurrences in texts is embodied in such instructions as ‘count all the words on the page’: that is, count each letter or group of letters preceded and followed by a white space. t his instruction may or may not include numbers, codes, names, and abbreviations, all of which are not necessarily part of the everyday conception of ‘word’. Whatever routine is followed, the counter deals in tokens or instances and as the count is being made the emerging list turns tokens into types: or example, there could be 42 tokens of the type the on a page, and four tokens of the type dog. Both the tokens and the types however are unrelectingly spoken of as words.2

s tatistical words are words or any string of characters bounded by space which can be counted by a computer. n o other distinction is made. s uch words are regarded as word ‘types’.

1 T. McArthur, ‘What is a Word?’, in T. McArthur (ed.), Living Words: Language, Lexicography and the Knowledge Revolution (Exeter: Exeter University Press, 1999 [1992]). 2 OCEL 1992, reprinted in McArthur, ‘What is a Word?’, p. 47, my emphases.

Word Frequency Use or Misuse?

19

w hen the statistical word test is applied to iCE-ireland,3 what frequency precision do we ind? For the present, all igures are based on the beta version of the spoken component. it is regularly stated that the spoken component of an iCE corpus comprises 300 texts each of 2,000 words, thus amounting to 600,000 words in total. in the case of iCE-ireland, the total is 623,351 words comprising 300 texts ranging from 960 to 2,840 words each. w hereas these totals already exclude markup, they still include X-corpus, editorial comments and partial words (marked up as … and underlined here for presentation), as shown in (1) and (2): 1. Uhm Marie-l ouise and i were in you know the Bang and o luf w hat is it o lufsen

2. a nd uh like three thousand eight hundred a nd there was another one at four hu four thousand two hundred and something

t he question thus arises whether, in terms of Mca rthur’s taxonomy, those 623,351 statistical words are also 623,351 orthographic words, 623,351 phonological words, or even 623,351 morphological words. t hey are not 623,351 lexical words (in the sense of lexical types), even less 623,351 lexemes (in terms of which ‘die’, ‘pass on’ and ‘kick the bucket’ may be considered single lexemes). Let us consider briely each type of word in turn. The orthographic word o ne instance of an orthographic word is where the word has dual spellings, as in: airplane, aeroplane; esthetic, aesthetic; archeology, archaeology; connection, connexion; counselor, counsellor; gray, grey; instill, instil; jeweler, jeweller; jewelry, jewellery; libelous, libellous; marvelous, marvellous; mollusk, mollusc; mustache, moustache; panelist, panellist; paralyze, paralyse; analyze, analyse; pajamas, pyjamas; skeptic, sceptic; color, colour; honor, honour; labor, labour; traveler, traveller; traveling, travelling; willful, wilful; woolen, woollen.

t hese are well-known standardized instances of dual spellings which, as a result of institutionalization, are regarded as ‘American’ or ‘British’. When we investigated those spellings in the written texts of iCE-ireland, all but a few of which had been published in ireland, we found that iCE-ireland is actually more British than iCEGB, as shown in t able 2.1.

3 iCE-ireland is the abbreviated name for t he irish component of the international Corpus of English. See J.M. Kirk, J.l . Kallen, o . l owry, a . r ooney and M. Mannion, t he iCE-ireland Corpus: t he international Corpus of English: t he ireland Component (Cd ) (iCE-ireland Project: Queen’s University Belfast, 2005 (beta version)).

What’s in a Word-list?

20

Table 2.1

Verbal spellings in ‘–ise’ and ‘–ize’ ICE-NI

ICE-ROI

ICE-GB

Total

‘–ise’

17

9

35

61

‘–ize’

2

1

12

15

d ialect words present another particular instance of the orthographic word as many such words have survived in oral currency and have never had a standardized written form. in ireland, there are many words for the national crop, the humble potato, which can be listed under the headword ‘potato’, as in the Concise Ulster Dictionary: Potato: the national crop in all parts of ireland: potato, pitatie, pirtie, pirta, purta, purty, pitter, porie, pratie, praitie, prae, prata, prater, pritta, pritty, pruta, poota, tater, tattie, totie. (h iberno-English forms are recoded as pratie, praitie, etc.; Scots forms as pitatie, tattie, tottie; and a southern English form as tater).

For other words, there is no agreed standardized form, as in the various forms of the dialect word for ‘embers’ borrowed from the Irish word ‘griosach’: greeshoch, greesagh, greesach, greesay, greeshagh, greeshaugh, greeshaw, greesha, greshia, greeshy, greesh, grushaw, greeshog, greesog, greeshock

Some words are harder to identify. The word for ‘twilight’ or ‘dusk’ is ‘dailygan’ in the s cots dictionary of Ulster, but the Concise Ulster Dictionary lists: daylight going, daylit goin, dayligoin, daylight gone, dayligone, dailagone, dailygan, dayligane, dayagone, dayligo

making it unclear whether the underlying base form is ‘daylight going’ or ‘daylight gone’. a s statistical words, these orthographic words would be counted separately – as types – whereas they merely represent various pronunciation variants of the same lexical type. Each of these three lists present only one lexical type. t he BBC is currently running a nationwide dialect project called Voices. it falls into this same trap of counting orthographic variants as separate – it goes so far as to say unique – words. t he Voices website4 states (in s eptember 2005): 4

.

Word Frequency Use or Misuse?

21

The Word Map has been highly successful; an initial look at the data suggests 32,000 users have registered […] 23,000-odd unique words (including spelling variations) …

That list shows that ‘stunning’, ‘stunning’’ and ‘stunning’ or ‘smashin’, ‘smashin’’ and ‘smashing’ were counted as three ‘words’ in each set; ‘phaor’, ‘phoar’, ‘pheoar’, ‘phwaar’, ‘phwoaar’, ‘phwor’, ‘phwooooaar’ are each counted as separate ‘words’, etc. Such lists, albeit on a selective basis, are available (August 2007) online.5 w ith regard to the issue of word frequency, orthographic words present as many dificulties as statistical words. The phonological word Phonological words are conceivable as several subtypes: vocalized words (as the set of ‘phoar’ words in the preceding section indicate), partial words (the initial segments of a word but not the complete word, as in (1) and (2) above), orthographic or pronunciation-variable words (presenting different pronunciation variables, as in ‘economics’ or ‘tomato’, or because of shifting stress positions in ‘controversy’, or in the dialect forms above), syllabic words (e.g. ‘gonna’, ‘hadda’, ‘musta’, ‘needti’, ‘wanna’, etc.), or even clausal or intonation-unit words (e.g. ‘spindona’, spindona’, ‘gerritupyi’, etc.). In these ways, phonological words either become orthographic words (which in turn become statistical words) or appear as conlated words which, if counted as statistical words, under-represent the actual total. t here is no corpus of segments or syllables, although, interestingly, it is claimed by Crystal (2003) that 25 per cent of speech is made up of only 12 syllables. s o with regard to the issue of word frequency, phonological words present many other dificulties too. The morphological word Morphological words may be lexical or grammatical words. First, let us consider lexical morphemes. The preixes: ‘cyber–’, ‘e–’, ‘eco–’, ‘euro–’, etc. all became frequent as a result of change in technology, politics or attitudes to the environment. t he sudden increase of use of such forms helps to construct the discourses about these new realities. a s the Oxford Dictionary of Ologies and Isms shows, many preixes and sufixes are speciic to particular domains – even linguistics can claim ‘glosso–’, ‘grapho–’, ‘logo–’, ‘semio–’, ‘Slavo–’ as preixes and ‘–eme’, ‘–gram’, ‘–graphy’, ‘–lect’, ‘–lepsis’, ‘–logue’, ‘–onym’, ‘–phasia’, ‘–speak’ and ‘–word’ as sufixes. 5

.

What’s in a Word-list?

22

in the southern component of iCE-ireland, we discovered that clipped words with the sufix ‘–o’ marked colloquial speech, perhaps even slang:6 Defos

S

Slang; ‘deinites’ (used especially in replies)

Invos

S

Slang ‘invitations’

Morto

S

Slang ‘mortiied’

Séamo

s

Form of Séamus

Smarmo

s

in iCE as interjection (< smarmy).

Relies

S

Slang ‘relatives’

Sca

S

Slang ‘news, gossip’ (< scandal).

o ther forms were:

t here are no such forms in iCE-GB.7 Even if the absolute numbers are few, their presence in one corpus and not in another may be interpreted as signiicant – indicative of innovating colloquialisms, possibly slang words. t he more lexical items adopting this ‘–o’ sufix, the more the pattern becomes established. Frequency can thus reveal cultural innovation. Grammatical morphemes offer numerous challenges. s ome are mere variants of a single form, sometimes conditioned by external factors, such as dialect contact in the case of the past tense form of ‘bring’ as ‘brought’ or ‘brung’, or a negated form of ‘could’ as ‘couldn’t’ and ‘couldnae’. ICE-Ireland has six instances of ‘gotten’ alongside ‘got’, each with a clear dynamic meaning. Grammatical variants and grammatical innovations may be interpreted in terms of external contexts, but they may also be indicative of changes in the particular sub-system itself. The form ‘gonna’ may be as the output of a grammaticalized progressive ‘go’ construction, but only if ‘gonna’ is transcribed as such. in iCE-ireland, it was not so transcribed – only the standard ‘going’ was used for every instance of progressive ‘go’, in stark contrast to the British National Corpus where its inclusion was left to the subjective preference of the audio-typists who were transcribing the tapes. w hen the statistical word test is applied to grammatical words, the result can be confusing. ‘Is’ and ‘was’ are often shown to rank among the most frequent words, but they are only verb forms; they are neither verb types nor the most frequent verbs – for that we need the total of all forms of BE. a lthough t able 2.2 presents frequencies 6 See also r . hhickey ickey,, Dublin English: Evolution and Change, Varieties of English a round the w orld General s eries, vol. 35 (a msterdam and Philadelphia: John Benjamins Publishing Company, 2006),, pp. 138–9. 7 t he British component of the International Corpus of English (Cd ).

Word Frequency Use or Misuse?

Table 2.2

23

Frequencies of BE forms

Form

ICE-IRL Spoken f.

LONDON-LUND f.

MILLER f.

‘–’s’

17.21

21.64

30.60

‘is’

9.46

10.45

7.05

‘was’

10.64

10.52

7.51

‘be’

6.15

5.46

3.44

of individual forms of ‘be’ in three spoken corpora, which show some consistency across corpora for each form, it does not show the frequency of BE itself. Frequencies of WILL require the sum of ‘’ll’, ‘will’, ‘won’t’ and whatever other spelling variants are used. So, when it comes to word frequency, much caution and qualiication is needed concerning the frequency of grammatical words. The lexical word a s already shown, lexical words are often mistaken for variants of their realization: phonological words, which are rendered in writing as orthographic words, or morphological words (particularly with different noun or verb forms). Much of the interest shown in lexical words surrounds them as lexical types, or lemmatized types, not as families of realizations. Even if we establish frequencies for lexical types – something the statistical word does not do – how are we to interpret the result? ooff the many possible contexts, i raise only three here: semantic prosody, attitude raising, and constructions of identity or reality. Semantic prosody Following the pioneering work of l ouw,8 the notion of semantic prosody is now generally accepted. ‘Utterly’ is regarded as having a negative prosody, i.e. it collocates with words expressing a negative meaning, so that, in iCE-ireland,

8 B. Louw, ‘Irony in the Text or Insincerity in the Writer? The Diagnostic Potential of s emantic Prosodies’, in M. Baker, G. Francis and E. t ognini-Bonelli (eds), Text and Technology (Philadelphia/a msterdam: John Benjamins, 1993).

24

What’s in a Word-list?

we ind that prosody conirmed in the six examples: ‘utterly boring’, ‘utterly unacceptable’, ‘condemn utterly’ (x2) (iCE-ireland). Attitude raising The common word ‘happy’ seems innocent enough until put into the literature for boy scouts and girl guides by l ord Baden-Powell, who urges that the purpose for girls in life is to make boys happy, whereas the purpose of boys in life is simply to be happy. t he accumulation of overuse in those texts is shown by s tubbs9 to turn the word ‘happy’ into a sexist term. Constructions of identity or reality in a masterly study of keywords, Baker10 shows how gay identity is constructed very differently by different groups of people. For the h ouse of l ords, key words for the pro-formers were ‘law’, ‘rights’, ‘sexuality’, ‘reform’, ‘tolerance’, ‘orientation’, ‘sexual’. ‘human’, whereas key words for the anti-reformers were ‘buggery’, ‘anal’, ‘indecency’, ‘act’, ‘blood’, ‘intercourse’, ‘condom’. For the British tabloid press covering crimes on gay men, key words were ‘transiency’, ‘acts’, ‘crime’, ‘violence’, ‘secrecy’, ‘shame’, ‘shamelessness’, ‘promiscuity’. in contact ads, British gay men described themselves as ‘guy’, ‘bloke’, ‘slim’, ‘attractive’, ‘professional’, ‘young’, ‘tall’, ‘non-scene’, ‘good-looking’, ‘active’, ‘caring’, ‘sincere’. In gay fantasy literature, gay men are described as brutes (‘socks’, ‘sweat’, ‘beer’, ‘football’, ‘towel’, ‘team’) or emotionless machines (‘lubed’, ‘jacked’, ‘leaking’, ‘throb’, ‘throbbing’, ‘spurt’, ‘spurts’, ‘pumped’, ‘pumping’). In safer sex awareness lealets distributed to gay men, gay men are described as animals, and gay sex as violent (‘grunted’, ‘groaned’, ‘grabbed’, ‘shoved’, ‘jerk’, ‘jerked’, ‘jerking’, ‘slapping’, ‘pain’); at the same time, gay men’s language is shown to be informal, non-standard and impolite (‘fucker’, ‘cocksucker’, ‘faggot/fag’, ‘stuff’, ‘yeah’, ‘shit’, ‘hell’, ‘fuckin’, ‘ain’t’, ‘wanna’, ‘gotta’, ‘gonna’, ‘’em’, ‘kinda’, ‘real’, ‘hey’, ‘damn’, ‘good’). In each of these settings, it was the frequency of words occurring above the norm that created very different discourses in each context and foregrounded the perspective or point of view. Baker shows convincingly that by studying word frequencies, common, everyday words words like ‘human’ or ‘young’ when over-used in particular texts – and thus have a relatively high frequency – become keywords and agents in the creation of, and also discrimination between, those discourses. s imilarly, iCE-ireland creates ireland – through the use and frequency of various classes of lexical words: dialect words, irish loanwords,, other words in iCE-irl deemed ‘Irish’ and institutional words (many of them onomastic words) 9 M. s tubbs, Text and Corpus Analysis (o xford: Blackwell, 1996), ch. 4. 10 P. Baker, Public Discourses of Gay Men, r outledge a dvances in Corpus l inguistics, vol. 8 (l ondon and n ew York: r outledge, 2005).

Word Frequency Use or Misuse?

Table 2.3

Irish loanwords

Fleadh Gaeltacht Poitín Scór

Table 2.4

25

n,s s s s

t raditional music festival (< irish) irish-speaking district (< irish) illicit distilled spirits (< irish) ‘Tally’ (