Text and Corpus Analysis: Computer Assisted Studies of Language and Culture 9780631195122

This book provides detailed studies in one of the fastest growing areas of linguistics - corpus analysis - and shows how

244 64 15MB

English Pages 288 [290] Year 1996

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
TABLE OF CONTENTS
List of Figures, Concordances and Tables.
Acknowledgements.

Data Conventions and Terminology.

Notes on Corpus Data and Software.

Part I: Concepts and History:.

1. Texts and Text Types.

2. British Traditions in Text Analysis: Firth, Halliday and Sinclair.

3. Institutional Linguistics: Firth, Hill and Giddens.

Part II: Text and Corpus Analysis:.

4. Baden-Powell: A Comparative Analysis of Two Short Texts.

5. Judging the Facts: An Analysis of One Text in its Institutional Context.

6. Human and Inhuman Geography: A Comparative Analysis of Two Long Texts and a Corpus.

7. Keywords, Collocations and Culture: The Analysis of Word Meanings across Corpora.

8. Towards a Modal Grammar of English: A Matter of Prolonged Fieldwork.

9. The Classic Questions.

Notes.

References.

Name Index.

Subject Index.
Recommend Papers

Text and Corpus Analysis: Computer Assisted Studies of Language and Culture
 9780631195122

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

MICHAEL

STUBBS

TEXT AND SO IUEKPAW INA OR

TEXAS

3.9351

WOMAN’S

U

0056660349

Text and Corpus Analysis

=

4

Language in Society GENERAL

EDITOR

Peter Trudgill,

Professor

in the Department

of Language

and

Linguistics,

University

Lausanne

ADVISORY EDITORS Ralph Fasold, Professor of Linguistics, Georgetown University William Labov, Professor of Linguistics, University of Pennsylvania

1

Language and Social Psychology Edited by Howard Giles and Robert N. St Clair Language and Social Networks (Second Edition) Lesley Milroy The Ethnography of Communication (Second Edition) Muriel Saville-Troike

13

(Second Edition) Suzanne Romaine

14

Sociolinguistics and Second Language Acquisition Dennis R. Preston

15

Pronouns and People The Linguistic Construction of Social and Personal Identity Peter Muhlhausler and Rom Harré

16

Politically Speaking John Wilson

17

The Language of the News Media Allan Bell

Discourse Analysis Michael Stubbs The Sociolinguistics of Society Introduction to Sociolinguistics, Volume I Ralph Fasold

Bilingualism

The Sociolinguistics of Language Introduction to Sociolinguistics, Volume II Ralph Fasold

18

The Language of Children and Adolescents The Acquisition of Communicative Competence Suzanne Romaine

19

Linguistic Variation and Change James Milroy

20

Principles of Linguistic Change William Labov

21

Intercultural Communication A Discourse Approach Ron Scollon and Suzanne Wong Scollon

22

Sociolinguistic Theory Linguistic Variation and its Social Significance J. K. Chambers

23

Text and Corpus Analysis Computer-assisted Studies of Language and Culture

Language, the Sexes and Society

Language, Society and the Elderly _ Discourse, Identity and Ageing Nikolas Coupland, Fustine Coupland and Howard Giles

Philip M. Smith The Language of Advertising Torben Vestergaard and Kim Schroder

10

Dialects in Contact Peter Trudgill

11

Pidgin and Creole Linguistics Peter Mihlhausler

12

Observing and Analysing Natural Language A Critical Account of Sociolinguistic Method Lesley Milroy

Michael Stubbs

of

Text and Corpus Analysis Computer-assisted Studies of Language and Culture

Michael Stubbs

[BLACKWELL | Publishers

Copyright © Michael Stubbs 1996

The right of Michael Stubbs to be identified as author of this work has been asserted in accordance with the Copyright, Designs and Patents Act 1988. First published 1996

2A 6 Sel0'9 75°31 Blackwell Publishers Ltd

108 Cowley Road Oxford OX4 1JF UK Blackwell Publishers Inc. 238 Main Street

Cambridge, Massachusetts 02142 USA All rights reserved. Except for the quotation of short passages for the purposes of criticism and review, no part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher. Except in the United States of America, this book is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, resold, hired out, or otherwise circulated without the publisher’s prior consent in any form of binding or cover other than that in which it is published and without a similar condition including this condition being imposed on the subsequent purchaser.

British Library Cataloguing in Publication Data A CIP catalogue record for this book is available from the British Library. Library of Congress Cataloging-in-Publication Data Stubbs, Michael, 1947—

Text and corpus analysis: computer-assisted studies of language and culture / Michael Stubbs. p. cm. — (Language in society; 23) Includes bibliographical references and indexes. ISBN 0-631-19511-4. — ISBN 0-631-19512-2 (pbk.) 1. Discourse analysis. 2. Discourse analysis — Data processing. 3. Language and culture. 4. English language — Modality. I. Title. II. Series: Language in society (Oxford, England); 23. P302.S773 1996 401'.41 — dce20 95-31070 CIP

Typeset in 10+on 12 pt Ehrhardt by Best-set Typesetter Ltd., Hong Kong Printed in Great Britain This book is printed on acid-free paper

Contents

Lists of Figures, Concordances and Tables Series Editor’s Preface Acknowledgements Data Conventions and Terminology Notes on Corpus Data and Software

Part I 1

Texts and Text Types

1.1

2

Concepts and History

The data The organization of the book A simple model: texts, production and reception Content, author and audience Texts and text types Text types and institutions Three studies 1.7.1 Advertising 1.7.2 Newspapers 1.7.3 Scientific research articles Summary

British Traditions in Text Analysis: Firth, Halliday and Sinclair

Principles: from Firth via Halliday to Sinclair Alternative traditions Linguistics as an applied social science Grammar and discourse: data and models Methodology: attested data and text corpora

Contents

Vi

Form and meaning, lexis and grammar Routines of cultural transmission Beyond Saussurian dualisms Data, description and theory 2.10 Description and theory, methods and applications Appendix: Notes on the intellectual background 3

Institutional Linguistics: Firth, Hill and Giddens

39

Institutional facts Firth on social semantics Hill on institutional linguistics Dualisms of subject and object, micro and macro Giddens on the constitution of society Projects for an institutional linguistics Sexist language Spoken and written language and linguistic theory 3.8.1 Dictionaries 3.8.2 Literature 3.8.3 School textbooks 3.8.4 Vocabulary Lexical density: an analysis of two corpora 3.9.1 Interpretation 3.9.2 Text samples

3.10 Conclusion Appendix: Further notes on the intellectual background

Part II 4

Text and Corpus Analysis

Baden-Powell:

=e!

a Comparative Analysis of Two Short Texts

Organization of the chapter Baden-Powell The texts: production and reception Content analysis HAPPY and HAPPINESS 4.5.1 Text G: Guides text 4.5.2 Text S: Scouts text Comparative data Practical implications: sexist language Theoretical implications: meaning in texts and in language Educational implications

Contents

4.10 Other analyses sul Summary Appendix: Baden-Powell’s last messages

5

Judging the Facts: an Analysis of One Text in Its Institutional Context Organization of the chapter Interactions in social institutions The linguistic encoding of facts Words and connotations The data Analysis of the data 5.6.1 Length and discourse markers 5.6.2. Connotations of individual words 5.6.3 Modal verbs 5.6.4 — Presuppositions 5.6.5 Syntactic complexity =f Conclusion Appendix: Other studies of courtroom language

6

Human and Inhuman Geography: a Comparative Analysis of Two Long Texts and a Corpus

6.9

Organization of the chapter Introductory discussion The importance of comparative data Criticisms of discourse analysis Texts and text fragments Data and hypotheses Computer-assisted comparative analysis Example 1. Ergative verbs: the syntax of key words 6.8.1 Definitions and terminology 6.8.2 Data analysis 6.8.3 Comparison of texts G and E 6.8.4 Interpretation 6.8.5. Comparison of texts G and E and LOB 6.8.6 Interpretation 6.8.7. Variation across verbs 6.8.8 Comparison of texts G and E: passives Interpretative problems 6.9.1 Further notes on ergatives 6.9.2 Probabilistic patterns

Contents

Vill

Another analysis: unexpected findings 6.9.3 Example 2. Projecting clauses 6.10.1 Definitions 6.10.2 World-creating predicates 6.10.3 Comparison of texts G and E 6.10.4 Interpretation 6.11 Some principles for text analysis Gulz Conclusion: computer-assisted studies Appendix: Further notes on ergativity

Keywords, Collocations and Culture: the Analysis of Word Meanings across Corpora

Yi wea”

Organization of the chapter Text and discourse: different senses of ‘discourse’ Introductory example: language and nation 7.3.1 Different discourses Firth on focal words Williams on keywords Other studies of cultural keywords The formal component: collocations 7.7.1 Quantitative methods Other examples of collocations and semantic prosodies The sociological component: encoding culture in lexis Designing a dictionary of keywords in British culture 7.10.1 Lifestyle and professionalization 7.10.2 Example analyses 7.10.3 Ambiguity of keywords Sample dictionary entries: culture and cultural Conclusion

Towards a Modal Grammar of English: a Matter of Prolonged Fieldwork Organization of the chapter Introductory example: propositional information The (limited) relevance of speech act theory Evidentiality, factivity, modality Summary Lexical, propositional and illocutionary commitment Degree and manner of commitment Modality and lexis 8.8.1 Morphology and pragmatic information

Contents

8.9

8.11 8.12 8.13 9

8.8.2 Lexical commitment 8.8.3 Vague lexis Modality and illocutionary force 8.9.1 Explicit illocutionary prefaces 5.9.2 Two types of speech act Modality and the truth value of propositions 8.10.1 Simple forms versus ing-forms of verbs 8.10.2 Verb classes and uses 8.10.3 Other parallels: can plus verb 8.10.4 Other verbal forms 6.10.5 Private verbs 8.10.6 Logical and pragmatic connectors Modal grammar Applied linguistics Conclusion

The Classic Questions 9.1 9.2 9.3.

Language and corpora Language and thought Conclusion

Notes References Subject Index Name Index

Figures, Concordances and Tables

Figure mel

Lexical density of 587 samples of written and spoken English

Concordances

Happy and happiness in Girl Guides text Happy and happiness in Boy Scouts text REEL, AGGRAVATE, etc. in Judge’s summing-up MAY and MIGHT in Judge’s summing-up Verb and noun CLOSE in one school book, sample only Passives in one school book, small sample only

Tables

Two Two One Two Two Two

school books school books, books, books,

books, all ergative verbs, all forms and LOB, five ergative verbs, all forms book, two ergative verbs projecting clauses

attributed and non-attributed projecting clauses personal and impersonal projecting clauses

74

Series Editor’s Preface

It is one of the traditions of this series that its volumes have a strong orientation towards linguistic data. Our contention is that the best sorts of work in linguistics are those which, as our title ‘Language in Society’ suggests, are based on research employing as data instances of language as actually used by real people in their everyday lives. We would not deny that research based on linguists’ intuitions about their own native varieties has been responsible for very considerable theoretical progress. But, in the final analysis, if linguistics is not about language as it is actually spoken and written by human beings, then it is about nothing at all. Michael Stubbs’s volume demonstrates this point with perhaps greater clarity than any previous volume in the series. Here is a work that is based not just on real language data rather than on intuitions, but on vast amounts of real language data. This computer-aided research, based on the immense linguistic corpora which are now available, gives Stubbs’s findings a degree of reliability that is unusual in linguistics, and reveals patterns of usage of which we previously had only a vague notion, or even no knowledge

whatsoever. Stubbs has also taken pains to situate his work in the British, Firthian tradition, which goes back many decades but has tended to be lost from view in recent years. It is probably not too much to claim that this book, relying as it does on the approach of British linguists such as Halliday and Sinclair, represents something of a renaissance of this tradition. It is clear, however, that it will appeal to all linguistics and sociolinguistics scholars, whatever their intellectual background, who would like to see the findings of linguistic ial, science grounded in real language. Some will find this work controvers ranging but all will find it stimulating. Stubbs covers a wide scale of topics, issue of sexism from the relationship between syntax and pragmatics to the confidence that in language, but all his analyses are presented with the Firthian tradionly corpora-based studies can provide. In keeping with the the problem of tion, as well as the aims of this series, Stubbs also discusses

Xil

Series Editor’s Preface

transmitting without distortion linguists’ expert knowledge about language to those non-linguists who can benefit from it, in such a way that they can comprehend it. Stubbs has elsewhere shown himself to be very competent in this respect, and this book can only add to his stature in the field. Peter Trudgill University of Lausanne

Acknowledgements

I am grateful to many friends and colleagues who provided helpful criticism on earlier versions of this book. Philip Carpenter, Florian Coulmas and Norman Fairclough encouraged the project at an early stage. Judy Delin, Gabi Keck, Greg Myers and Joan Swann made detailed comments on a complete draft. Dwight Atkinson, Wolfram Bublitz, Joanna Channell, Jenny Cheshire, Gill Francis, Andrea Gerbig, Bob Hodge, Susan Hunston, Anthony Johnson, John Sinclair and Dick Watts made helpful comments on various chapters. The book is much better for their good advice. Andrea Gerbig also helped with some of the data analyses and provided text E (for chapter 6). And Anthony Johnson insisted, several times, that I should read Giddens’s work. Work in corpus linguistics is always based on previous work by many people. Jeremy Clear and Tim Lane helped me to extract data from corpora at Cobuild. My student research assistants in Trier, Anja Helfrich and Susanne Jarczok, helped with text and corpus preparation; and Brigitte Grote, Oliver Jakobs and Oliver Hardt wrote programs for analysing lexical

density, producing concordances and analysing collocations. The texts of Baden-Powell’s last messages to the Girl Guides and the Boy Scouts were obtained from archivists of the Girl Guides Association and the Scout Association. I am most grateful to them for permission to reproduce the texts (in chapter 4), and also for details of their composition and publication. For permission to use the transcript of courtroom data (in chapter 5), I am grateful to the defendant in the case: for obvious reasons, I will not name him here. Simon & Schuster Education, Hemel Hempstead, UK, gave permission to store text G (in chapter 6) in computer readable form and to cite extracts; the text is The British Isles, copyright N. Punnett and P. Webber, Blackwell, 1984. Text E (in chapter 6) is The Ozone Message by D. Kinnear, P. Preuss and J. Rogers, Australian Conservation Foundation, 1989. For permission to use corpus materials, I am most grateful to: the Norwegian Computing Centre for the Humanities; Longmans Group UK

XIV

Acknowledgements

Ltd; and colleagues at the University of Birmingham and at Cobuild, especially Gwynneth Fox, John Sinclair and Malcolm Coulthard, for arranging access to the Cobuild corpus, the Bank of English. I am grateful to publishers for permission to use material from published articles: in all cases the material has been extensively revised. Chapter 2 is considerably revised from an article in M. Baker et al., eds (1993) Text and Technology, published by John Benjamins. Chapter 3 uses material from an article in M. Piitz, ed. (1992) Thirty Years of Linguistic Evolution, published by John Benjamins, but several sections are new. Chapter 5 is considerably » revised from an article in C. Uhlig and R. Zimmermann, eds (1991) Anglistentag 1990 Marburg: Proceedings, published by Max Niemeyer; the chapter also contains much new material. Chapter 6 is revised from an article in Applied Linguistics, 15, 2 (1994); the chapter also contains much new material. Chapter 7 uses some material from an article in Language and Education, 3, 4 (1989), published by Multilingual Matters, but the main part is previously unpublished. Chapter 8 uses material from an article in Applied Linguistics, 7, 1 (1986), published by Oxford University Press; the chapter also contains much new material. Other chapters are previously unpublished. The publishers apologize for any errors or omissions in the above list and would be grateful to be notified of any corrections that should be incorporated in the next edition or reprint of this book.

Data Conventions and

Terminology

1

[A]

[M] [1]

An important feature of the book is that all data which are analysed in detail are attested in naturally occurring language use. The status of examples is indicated as follows. attested, actual, authentic data: data which have occurred naturally in a real social context without the intervention of the analyst. modified data: examples which are based on attested data, but which have been modified (e.g. abbreviated) to exclude features deemed irrelevant to the current analysis. intuitive, introspective, invented data: data invented purely to illustrate a point in a linguistic argument.

Chapter 2 discusses the importance of such distinctions. Individual examples are not always marked in this way if their status is clear from the surrounding discussion.

Z 2.1 2 pa)

eS

Aes

Other conventions used are more standard. Double quotation marks “ ” for meanings of linguistic expressions. Italics for short forms cited within the text, e.g. The German sentence Sie soll sehr klug sein means “She is said to be very clever”. CAPITAL LETTERS for lemmas. Alternative terms for lemma include dictionary head-word and lexeme. A lemma is abstracted from a set of morphological variants. Conventionally the base form of verbs and the singular of nouns are used to represent lemmas. For example, do and does are forms of the lemma DO. Asterisk (*) for ill-formed sequences, e.g. ungrammatical or semantically anomalous sentences. For example, *He must can come. *He 1s a vegetarian and eats meat. or A question mark (?) before a form denotes a string of doubtful such n, definitio By come. ’t mayn *he e.g. ility, marginal acceptab

XVI

Data Conventions and Terminology asterisked and questioned items cannot be attested, and such intuitive judgements on ill-formedness should be treated with care. Corpus data often reveal forms which are thought intuitively not to occur, but which occur frequently and are used systematically. See especially chapter 8.

Notes on Corpus Data and Software

The first generation of computer-readable corpora (up to around one million words) was set up in the 1960s and 1970s. The Brown Corpus is so named because it was prepared at Brown University in the USA by W. Nelson Francis. This consists of one million words of written American English, published in 1961, and sampled as text fragments of 2,000 words each. Such corpora now seem very small, and can easily be handled on desktop PCs. One of the most important points about such corpus work is that linguistic data become public and accessible to other scholars, who can therefore check the interpretations and analyses proposed (see sections 2.9, 9.1). The following are the computer-readable corpora of spoken and written English which I have used in various ways in this book.

1 London—Lund corpus: 435,000 words of spoken British English. I have used a version of this corpus, which consists of 435,000 running words, comprising 87 texts of 5,000-word samples of adult educated usage, including face-to-face and telephone conversations, lectures, discussions and radio commentaries. The speakers are university academics, students, civil ser-

vants, doctors, secretaries, broadcasters and other professional people. Speakers are on variously intimate and distant personal and social terms with each other. (The corpus contains much coded prosodic information, such as tone unit boundaries, pitch, stress and pause phenomena. I have omitted such codings in citing examples.) The sections comprising face-toa face conversation are published in Svartvik and Quirk (1980) which gives work of detailed description of the corpus. Other details of the corpus and This based on it are given in Svartvik et al. (1982) and in Biber (1988). at Usage corpus was one part of the data gathered by the Survey of English A (1972) al.’s the University of London, in the preparation of Quirk et (e.g. grammars associated other and English, y Grammar of Contemporar Quirk et al., 1985). (See section 2.5.) of written 2 LOB (Lancaster—Oslo—Bergen) corpus: one million words

XVill

Notes on Corpus Data and Software

British English. The LOB corpus was designed as the British equivalent of the Brown corpus: one million words of written British English, also published in 1961, and also sampled as text fragments of 2,000 words each. These samples are from informative texts, such as newspaper texts, learned and scientific writing and imaginative fiction. For a detailed list of the textual categories represented, see [CAME News, 5, 1981, p. 4 (International Computer Archive of Modern English, Bergen). This corpus aims to provide a range of samples of different varieties of English, but could never be representative of the whole English language. There are, for example, many genres which are not represented in LOB. Since it contains samples only of published language, it contains no samples of business correspondence: a huge genre in the modern world. 3 Longman—Lancaster corpus: 30 million words of written English (only small selections are used here). The Longman—Lancaster corpus consists of about 30 million words of written (published) English. For some comparative purposes in this book, I have taken 2,000 word samples from each file, in order to construct a sub-corpus which can be compared with LOB. The corpus contains both fiction and non-fiction, some well known literary works but also published works randomly sampled from books in print. The non-fiction texts are sampled from broad topic fields, including the natural and social sciences, world affairs, commerce and finance, the arts and leisure. Summers (1993) provides details of the design of the corpus. 4 Cobuild corpus of written and spoken English (The Bank of English). This corpus is held at Cobuild in Birmingham, where it is used in the construction of major dictionaries and grammars (including Sinclair, 1987a, 1990, 1995), published by HarperCollins. For description of the early corpus development, see Renouf (1987) and Sinclair (1991a, pp. 13-26). For several articles based on the corpus, see Baker et a/. (1993). In 1995, the

corpus totalled some 200 million words. For the analyses in this book (e.g. chapter 7), about 120 million words of spoken and written British English were used. This comprises books on many different topics, both fiction and non-fiction, daily newspapers and samples of spoken English, face-to-face interaction, telephone calls and many types of radio programmes.

Several chapters present concordances. These were produced by running computer-readable texts through a computer program, which searches for words in the text, and prints them out in the centre of the page, with a half line of context (usually up to ten words) on each side. This provides a convenient layout for studying how a speaker or writer uses certain words and phrases, and whether there are particular patterns in his or her use of language. I have used commercially available software: the Longman Mini-

Notes on Corpus Data and Software

XIX

Concordancer, written for MS-DOS machines by Brian Chandler and available from Longman. This is an interactive program, extremely easy to use, with powerful pattern matching, which can handle texts of up to 50,000 words (given 640kb available RAM). I would highly recommended it for anyone who wants to start work in this area. I have also used other batch software, written by my students in Trier. See the acknowledgements above.

liad

cae .

agile aneaerael,

“ie

Thao Seifert£5 ee

J

Wf eee

ul > ;

Mea ieh

th «

Wieser

oe

ani fist

aberary

bug

yeti. te) Lear

©

(haga ‘pine ’ pier’ wat eae ee Yen

a Taare A ip ty vas dat

a ened: The wie ni

ae

ria

th

0’

2

+

Seeey

Tear.

S1S> we

@a

ur

ae

7

mt

i aageiel

tent,

eee

.j-4

dem

ocbaegpen

teres Sell Gt

to:

Vids

Cootatel arin

the

fee)

ee

oleh ena

eo dee

Oech,

Wee are eiegied Leu

To. Ges

1 eee

aes

@eres,

taht

Sones qt

affeed) qmeetertior

Gormumts (UU) payetdes detgibaet Mies

.

C2 msen 5 (os

a! when Englands (The Bank AY, .@ Renieghem whee 6 ed med

sPRéaS

chs Cr

‘ee

Sew eed

Ce 9 jar

ee

4

oet-

Ha! mae ola



ns ‘3S.

hen e=panh

4 ee

of See .

Puslae (IF 4, op

iat hase ree

PER

4

ie

“tie anasyees itythle RanRel

oe ad Trdice

othwrite dition

ee

-)

kh,

tt

+ a

he

are ized ee has

(aretha

i

ALi

c]

Rey galled wore npuhernbe canton’,

Ws

1s

2

MOBRCT we page C

LI

tiigins.ee 3

Se

+ wtagcthewn Efi, Sine

Cpt Relic pimgreathens(® Vacoe wear pietmecalieg ps

vy

Aieteee, ebathe

~aenensl sae

ai

=

@ «fuer tartieeed' fagian Pore

toetry “Ho

wx, thepeampls, eng