245 64 15MB
English Pages 288 [290] Year 1996
MICHAEL
STUBBS
TEXT AND SO IUEKPAW INA OR
TEXAS
3.9351
WOMAN’S
U
0056660349
Text and Corpus Analysis
=
4
Language in Society GENERAL
EDITOR
Peter Trudgill,
Professor
in the Department
of Language
and
Linguistics,
University
Lausanne
ADVISORY EDITORS Ralph Fasold, Professor of Linguistics, Georgetown University William Labov, Professor of Linguistics, University of Pennsylvania
1
Language and Social Psychology Edited by Howard Giles and Robert N. St Clair Language and Social Networks (Second Edition) Lesley Milroy The Ethnography of Communication (Second Edition) Muriel Saville-Troike
13
(Second Edition) Suzanne Romaine
14
Sociolinguistics and Second Language Acquisition Dennis R. Preston
15
Pronouns and People The Linguistic Construction of Social and Personal Identity Peter Muhlhausler and Rom Harré
16
Politically Speaking John Wilson
17
The Language of the News Media Allan Bell
Discourse Analysis Michael Stubbs The Sociolinguistics of Society Introduction to Sociolinguistics, Volume I Ralph Fasold
Bilingualism
The Sociolinguistics of Language Introduction to Sociolinguistics, Volume II Ralph Fasold
18
The Language of Children and Adolescents The Acquisition of Communicative Competence Suzanne Romaine
19
Linguistic Variation and Change James Milroy
20
Principles of Linguistic Change William Labov
21
Intercultural Communication A Discourse Approach Ron Scollon and Suzanne Wong Scollon
22
Sociolinguistic Theory Linguistic Variation and its Social Significance J. K. Chambers
23
Text and Corpus Analysis Computer-assisted Studies of Language and Culture
Language, the Sexes and Society
Language, Society and the Elderly _ Discourse, Identity and Ageing Nikolas Coupland, Fustine Coupland and Howard Giles
Philip M. Smith The Language of Advertising Torben Vestergaard and Kim Schroder
10
Dialects in Contact Peter Trudgill
11
Pidgin and Creole Linguistics Peter Mihlhausler
12
Observing and Analysing Natural Language A Critical Account of Sociolinguistic Method Lesley Milroy
Michael Stubbs
of
Text and Corpus Analysis Computer-assisted Studies of Language and Culture
Michael Stubbs
[BLACKWELL | Publishers
Copyright © Michael Stubbs 1996
The right of Michael Stubbs to be identified as author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. First published 1996
2A 6 Sel0'9 75°31 Blackwell Publishers Ltd
108 Cowley Road Oxford OX4 1JF UK Blackwell Publishers Inc. 238 Main Street
Cambridge, Massachusetts 02142 USA All rights reserved. Except for the quotation of short passages for the purposes of criticism and review, no part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher. Except in the United States of America, this book is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, resold, hired out, or otherwise circulated without the publisher’s prior consent in any form of binding or cover other than that in which it is published and without a similar condition including this condition being imposed on the subsequent purchaser.
British Library Cataloguing in Publication Data A CIP catalogue record for this book is available from the British Library. Library of Congress Cataloging-in-Publication Data Stubbs, Michael, 1947—
Text and corpus analysis: computer-assisted studies of language and culture / Michael Stubbs. p. cm. — (Language in society; 23) Includes bibliographical references and indexes. ISBN 0-631-19511-4. — ISBN 0-631-19512-2 (pbk.) 1. Discourse analysis. 2. Discourse analysis — Data processing. 3. Language and culture. 4. English language — Modality. I. Title. II. Series: Language in society (Oxford, England); 23. P302.S773 1996 401'.41 — dce20 95-31070 CIP
Typeset in 10+on 12 pt Ehrhardt by Best-set Typesetter Ltd., Hong Kong Printed in Great Britain This book is printed on acid-free paper
Contents
Lists of Figures, Concordances and Tables Series Editor’s Preface Acknowledgements Data Conventions and Terminology Notes on Corpus Data and Software
Part I 1
Texts and Text Types
1.1
2
Concepts and History
The data The organization of the book A simple model: texts, production and reception Content, author and audience Texts and text types Text types and institutions Three studies 1.7.1 Advertising 1.7.2 Newspapers 1.7.3 Scientific research articles Summary
British Traditions in Text Analysis: Firth, Halliday and Sinclair
Principles: from Firth via Halliday to Sinclair Alternative traditions Linguistics as an applied social science Grammar and discourse: data and models Methodology: attested data and text corpora
Contents
Vi
Form and meaning, lexis and grammar Routines of cultural transmission Beyond Saussurian dualisms Data, description and theory 2.10 Description and theory, methods and applications Appendix: Notes on the intellectual background 3
Institutional Linguistics: Firth, Hill and Giddens
39
Institutional facts Firth on social semantics Hill on institutional linguistics Dualisms of subject and object, micro and macro Giddens on the constitution of society Projects for an institutional linguistics Sexist language Spoken and written language and linguistic theory 3.8.1 Dictionaries 3.8.2 Literature 3.8.3 School textbooks 3.8.4 Vocabulary Lexical density: an analysis of two corpora 3.9.1 Interpretation 3.9.2 Text samples
3.10 Conclusion Appendix: Further notes on the intellectual background
Part II 4
Text and Corpus Analysis
Baden-Powell:
=e!
a Comparative Analysis of Two Short Texts
Organization of the chapter Baden-Powell The texts: production and reception Content analysis HAPPY and HAPPINESS 4.5.1 Text G: Guides text 4.5.2 Text S: Scouts text Comparative data Practical implications: sexist language Theoretical implications: meaning in texts and in language Educational implications
Contents
4.10 Other analyses sul Summary Appendix: Baden-Powell’s last messages
5
Judging the Facts: an Analysis of One Text in Its Institutional Context Organization of the chapter Interactions in social institutions The linguistic encoding of facts Words and connotations The data Analysis of the data 5.6.1 Length and discourse markers 5.6.2. Connotations of individual words 5.6.3 Modal verbs 5.6.4 — Presuppositions 5.6.5 Syntactic complexity =f Conclusion Appendix: Other studies of courtroom language
6
Human and Inhuman Geography: a Comparative Analysis of Two Long Texts and a Corpus
6.9
Organization of the chapter Introductory discussion The importance of comparative data Criticisms of discourse analysis Texts and text fragments Data and hypotheses Computer-assisted comparative analysis Example 1. Ergative verbs: the syntax of key words 6.8.1 Definitions and terminology 6.8.2 Data analysis 6.8.3 Comparison of texts G and E 6.8.4 Interpretation 6.8.5. Comparison of texts G and E and LOB 6.8.6 Interpretation 6.8.7. Variation across verbs 6.8.8 Comparison of texts G and E: passives Interpretative problems 6.9.1 Further notes on ergatives 6.9.2 Probabilistic patterns
Contents
Vill
Another analysis: unexpected findings 6.9.3 Example 2. Projecting clauses 6.10.1 Definitions 6.10.2 World-creating predicates 6.10.3 Comparison of texts G and E 6.10.4 Interpretation 6.11 Some principles for text analysis Gulz Conclusion: computer-assisted studies Appendix: Further notes on ergativity
Keywords, Collocations and Culture: the Analysis of Word Meanings across Corpora
Yi wea”
Organization of the chapter Text and discourse: different senses of ‘discourse’ Introductory example: language and nation 7.3.1 Different discourses Firth on focal words Williams on keywords Other studies of cultural keywords The formal component: collocations 7.7.1 Quantitative methods Other examples of collocations and semantic prosodies The sociological component: encoding culture in lexis Designing a dictionary of keywords in British culture 7.10.1 Lifestyle and professionalization 7.10.2 Example analyses 7.10.3 Ambiguity of keywords Sample dictionary entries: culture and cultural Conclusion
Towards a Modal Grammar of English: a Matter of Prolonged Fieldwork Organization of the chapter Introductory example: propositional information The (limited) relevance of speech act theory Evidentiality, factivity, modality Summary Lexical, propositional and illocutionary commitment Degree and manner of commitment Modality and lexis 8.8.1 Morphology and pragmatic information
Contents
8.9
8.11 8.12 8.13 9
8.8.2 Lexical commitment 8.8.3 Vague lexis Modality and illocutionary force 8.9.1 Explicit illocutionary prefaces 5.9.2 Two types of speech act Modality and the truth value of propositions 8.10.1 Simple forms versus ing-forms of verbs 8.10.2 Verb classes and uses 8.10.3 Other parallels: can plus verb 8.10.4 Other verbal forms 6.10.5 Private verbs 8.10.6 Logical and pragmatic connectors Modal grammar Applied linguistics Conclusion
The Classic Questions 9.1 9.2 9.3.
Language and corpora Language and thought Conclusion
Notes References Subject Index Name Index
Figures, Concordances and Tables
Figure mel
Lexical density of 587 samples of written and spoken English
Concordances
Happy and happiness in Girl Guides text Happy and happiness in Boy Scouts text REEL, AGGRAVATE, etc. in Judge’s summing-up MAY and MIGHT in Judge’s summing-up Verb and noun CLOSE in one school book, sample only Passives in one school book, small sample only
Tables
Two Two One Two Two Two
school books school books, books, books,
books, all ergative verbs, all forms and LOB, five ergative verbs, all forms book, two ergative verbs projecting clauses
attributed and non-attributed projecting clauses personal and impersonal projecting clauses
74
Series Editor’s Preface
It is one of the traditions of this series that its volumes have a strong orientation towards linguistic data. Our contention is that the best sorts of work in linguistics are those which, as our title ‘Language in Society’ suggests, are based on research employing as data instances of language as actually used by real people in their everyday lives. We would not deny that research based on linguists’ intuitions about their own native varieties has been responsible for very considerable theoretical progress. But, in the final analysis, if linguistics is not about language as it is actually spoken and written by human beings, then it is about nothing at all. Michael Stubbs’s volume demonstrates this point with perhaps greater clarity than any previous volume in the series. Here is a work that is based not just on real language data rather than on intuitions, but on vast amounts of real language data. This computer-aided research, based on the immense linguistic corpora which are now available, gives Stubbs’s findings a degree of reliability that is unusual in linguistics, and reveals patterns of usage of which we previously had only a vague notion, or even no knowledge
whatsoever. Stubbs has also taken pains to situate his work in the British, Firthian tradition, which goes back many decades but has tended to be lost from view in recent years. It is probably not too much to claim that this book, relying as it does on the approach of British linguists such as Halliday and Sinclair, represents something of a renaissance of this tradition. It is clear, however, that it will appeal to all linguistics and sociolinguistics scholars, whatever their intellectual background, who would like to see the findings of linguistic ial, science grounded in real language. Some will find this work controvers ranging but all will find it stimulating. Stubbs covers a wide scale of topics, issue of sexism from the relationship between syntax and pragmatics to the confidence that in language, but all his analyses are presented with the Firthian tradionly corpora-based studies can provide. In keeping with the the problem of tion, as well as the aims of this series, Stubbs also discusses
Xil
Series Editor’s Preface
transmitting without distortion linguists’ expert knowledge about language to those non-linguists who can benefit from it, in such a way that they can comprehend it. Stubbs has elsewhere shown himself to be very competent in this respect, and this book can only add to his stature in the field. Peter Trudgill University of Lausanne
Acknowledgements
I am grateful to many friends and colleagues who provided helpful criticism on earlier versions of this book. Philip Carpenter, Florian Coulmas and Norman Fairclough encouraged the project at an early stage. Judy Delin, Gabi Keck, Greg Myers and Joan Swann made detailed comments on a complete draft. Dwight Atkinson, Wolfram Bublitz, Joanna Channell, Jenny Cheshire, Gill Francis, Andrea Gerbig, Bob Hodge, Susan Hunston, Anthony Johnson, John Sinclair and Dick Watts made helpful comments on various chapters. The book is much better for their good advice. Andrea Gerbig also helped with some of the data analyses and provided text E (for chapter 6). And Anthony Johnson insisted, several times, that I should read Giddens’s work. Work in corpus linguistics is always based on previous work by many people. Jeremy Clear and Tim Lane helped me to extract data from corpora at Cobuild. My student research assistants in Trier, Anja Helfrich and Susanne Jarczok, helped with text and corpus preparation; and Brigitte Grote, Oliver Jakobs and Oliver Hardt wrote programs for analysing lexical
density, producing concordances and analysing collocations. The texts of Baden-Powell’s last messages to the Girl Guides and the Boy Scouts were obtained from archivists of the Girl Guides Association and the Scout Association. I am most grateful to them for permission to reproduce the texts (in chapter 4), and also for details of their composition and publication. For permission to use the transcript of courtroom data (in chapter 5), I am grateful to the defendant in the case: for obvious reasons, I will not name him here. Simon & Schuster Education, Hemel Hempstead, UK, gave permission to store text G (in chapter 6) in computer readable form and to cite extracts; the text is The British Isles, copyright N. Punnett and P. Webber, Blackwell, 1984. Text E (in chapter 6) is The Ozone Message by D. Kinnear, P. Preuss and J. Rogers, Australian Conservation Foundation, 1989. For permission to use corpus materials, I am most grateful to: the Norwegian Computing Centre for the Humanities; Longmans Group UK
XIV
Acknowledgements
Ltd; and colleagues at the University of Birmingham and at Cobuild, especially Gwynneth Fox, John Sinclair and Malcolm Coulthard, for arranging access to the Cobuild corpus, the Bank of English. I am grateful to publishers for permission to use material from published articles: in all cases the material has been extensively revised. Chapter 2 is considerably revised from an article in M. Baker et al., eds (1993) Text and Technology, published by John Benjamins. Chapter 3 uses material from an article in M. Piitz, ed. (1992) Thirty Years of Linguistic Evolution, published by John Benjamins, but several sections are new. Chapter 5 is considerably » revised from an article in C. Uhlig and R. Zimmermann, eds (1991) Anglistentag 1990 Marburg: Proceedings, published by Max Niemeyer; the chapter also contains much new material. Chapter 6 is revised from an article in Applied Linguistics, 15, 2 (1994); the chapter also contains much new material. Chapter 7 uses some material from an article in Language and Education, 3, 4 (1989), published by Multilingual Matters, but the main part is previously unpublished. Chapter 8 uses material from an article in Applied Linguistics, 7, 1 (1986), published by Oxford University Press; the chapter also contains much new material. Other chapters are previously unpublished. The publishers apologize for any errors or omissions in the above list and would be grateful to be notified of any corrections that should be incorporated in the next edition or reprint of this book.
Data Conventions and
Terminology
1
[A]
[M] [1]
An important feature of the book is that all data which are analysed in detail are attested in naturally occurring language use. The status of examples is indicated as follows. attested, actual, authentic data: data which have occurred naturally in a real social context without the intervention of the analyst. modified data: examples which are based on attested data, but which have been modified (e.g. abbreviated) to exclude features deemed irrelevant to the current analysis. intuitive, introspective, invented data: data invented purely to illustrate a point in a linguistic argument.
Chapter 2 discusses the importance of such distinctions. Individual examples are not always marked in this way if their status is clear from the surrounding discussion.
Z 2.1 2 pa)
eS
Aes
Other conventions used are more standard. Double quotation marks “ ” for meanings of linguistic expressions. Italics for short forms cited within the text, e.g. The German sentence Sie soll sehr klug sein means “She is said to be very clever”. CAPITAL LETTERS for lemmas. Alternative terms for lemma include dictionary head-word and lexeme. A lemma is abstracted from a set of morphological variants. Conventionally the base form of verbs and the singular of nouns are used to represent lemmas. For example, do and does are forms of the lemma DO. Asterisk (*) for ill-formed sequences, e.g. ungrammatical or semantically anomalous sentences. For example, *He must can come. *He 1s a vegetarian and eats meat. or A question mark (?) before a form denotes a string of doubtful such n, definitio By come. ’t mayn *he e.g. ility, marginal acceptab
XVI
Data Conventions and Terminology asterisked and questioned items cannot be attested, and such intuitive judgements on ill-formedness should be treated with care. Corpus data often reveal forms which are thought intuitively not to occur, but which occur frequently and are used systematically. See especially chapter 8.
Notes on Corpus Data and Software
The first generation of computer-readable corpora (up to around one million words) was set up in the 1960s and 1970s. The Brown Corpus is so named because it was prepared at Brown University in the USA by W. Nelson Francis. This consists of one million words of written American English, published in 1961, and sampled as text fragments of 2,000 words each. Such corpora now seem very small, and can easily be handled on desktop PCs. One of the most important points about such corpus work is that linguistic data become public and accessible to other scholars, who can therefore check the interpretations and analyses proposed (see sections 2.9, 9.1). The following are the computer-readable corpora of spoken and written English which I have used in various ways in this book.
1 London—Lund corpus: 435,000 words of spoken British English. I have used a version of this corpus, which consists of 435,000 running words, comprising 87 texts of 5,000-word samples of adult educated usage, including face-to-face and telephone conversations, lectures, discussions and radio commentaries. The speakers are university academics, students, civil ser-
vants, doctors, secretaries, broadcasters and other professional people. Speakers are on variously intimate and distant personal and social terms with each other. (The corpus contains much coded prosodic information, such as tone unit boundaries, pitch, stress and pause phenomena. I have omitted such codings in citing examples.) The sections comprising face-toa face conversation are published in Svartvik and Quirk (1980) which gives work of detailed description of the corpus. Other details of the corpus and This based on it are given in Svartvik et al. (1982) and in Biber (1988). at Usage corpus was one part of the data gathered by the Survey of English A (1972) al.’s the University of London, in the preparation of Quirk et (e.g. grammars associated other and English, y Grammar of Contemporar Quirk et al., 1985). (See section 2.5.) of written 2 LOB (Lancaster—Oslo—Bergen) corpus: one million words
XVill
Notes on Corpus Data and Software
British English. The LOB corpus was designed as the British equivalent of the Brown corpus: one million words of written British English, also published in 1961, and also sampled as text fragments of 2,000 words each. These samples are from informative texts, such as newspaper texts, learned and scientific writing and imaginative fiction. For a detailed list of the textual categories represented, see [CAME News, 5, 1981, p. 4 (International Computer Archive of Modern English, Bergen). This corpus aims to provide a range of samples of different varieties of English, but could never be representative of the whole English language. There are, for example, many genres which are not represented in LOB. Since it contains samples only of published language, it contains no samples of business correspondence: a huge genre in the modern world. 3 Longman—Lancaster corpus: 30 million words of written English (only small selections are used here). The Longman—Lancaster corpus consists of about 30 million words of written (published) English. For some comparative purposes in this book, I have taken 2,000 word samples from each file, in order to construct a sub-corpus which can be compared with LOB. The corpus contains both fiction and non-fiction, some well known literary works but also published works randomly sampled from books in print. The non-fiction texts are sampled from broad topic fields, including the natural and social sciences, world affairs, commerce and finance, the arts and leisure. Summers (1993) provides details of the design of the corpus. 4 Cobuild corpus of written and spoken English (The Bank of English). This corpus is held at Cobuild in Birmingham, where it is used in the construction of major dictionaries and grammars (including Sinclair, 1987a, 1990, 1995), published by HarperCollins. For description of the early corpus development, see Renouf (1987) and Sinclair (1991a, pp. 13-26). For several articles based on the corpus, see Baker et a/. (1993). In 1995, the
corpus totalled some 200 million words. For the analyses in this book (e.g. chapter 7), about 120 million words of spoken and written British English were used. This comprises books on many different topics, both fiction and non-fiction, daily newspapers and samples of spoken English, face-to-face interaction, telephone calls and many types of radio programmes.
Several chapters present concordances. These were produced by running computer-readable texts through a computer program, which searches for words in the text, and prints them out in the centre of the page, with a half line of context (usually up to ten words) on each side. This provides a convenient layout for studying how a speaker or writer uses certain words and phrases, and whether there are particular patterns in his or her use of language. I have used commercially available software: the Longman Mini-
Notes on Corpus Data and Software
XIX
Concordancer, written for MS-DOS machines by Brian Chandler and available from Longman. This is an interactive program, extremely easy to use, with powerful pattern matching, which can handle texts of up to 50,000 words (given 640kb available RAM). I would highly recommended it for anyone who wants to start work in this area. I have also used other batch software, written by my students in Trier. See the acknowledgements above.
liad
cae .
agile aneaerael,
“ie
Thao Seifert£5 ee
J
Wf eee
ul > ;
Mea ieh
th «
Wieser
oe
ani fist
aberary
bug
yeti. te) Lear
©
(haga ‘pine ’ pier’ wat eae ee Yen
a Taare A ip ty vas dat
a ened: The wie ni
ae
ria
th
0’
2
+
Seeey
Tear.
S1S> we
@a
ur
ae
7
mt
i aageiel
tent,
eee
.j-4
dem
ocbaegpen
teres Sell Gt
to:
Vids
Cootatel arin
the
fee)
ee
oleh ena
eo dee
Oech,
Wee are eiegied Leu
To. Ges
1 eee
aes
@eres,
taht
Sones qt
affeed) qmeetertior
Gormumts (UU) payetdes detgibaet Mies
.
C2 msen 5 (os
a! when Englands (The Bank AY, .@ Renieghem whee 6 ed med
sPRéaS
chs Cr
‘ee
Sew eed
Ce 9 jar
ee
4
oet-
Ha! mae ola
—
ns ‘3S.
hen e=panh
4 ee
of See .
Puslae (IF 4, op
iat hase ree
PER
4
ie
“tie anasyees itythle RanRel
oe ad Trdice
othwrite dition
ee
-)
kh,
tt
+ a
he
are ized ee has
(aretha
i
ALi
c]
Rey galled wore npuhernbe canton’,
Ws
1s
2
MOBRCT we page C
LI
tiigins.ee 3
Se
+ wtagcthewn Efi, Sine
Cpt Relic pimgreathens(® Vacoe wear pietmecalieg ps
vy
Aieteee, ebathe
~aenensl sae
ai
=
@ «fuer tartieeed' fagian Pore
toetry “Ho
wx, thepeampls, eng