211 67 16MB
English Pages 499 [500] Year 1992
Directions in Corpus Linguistics
Trends in Linguistics Studies and Monographs 65
Editor
Werner Winter
Mouton de Gruyter Berlin · New York
Directions in Corpus Linguistics Proceedings of Nobel Symposium 82 Stockholm, 4 - 8 August 1991 Edited by
Jan Svartvik
Mouton de Gruyter Berlin · New York
1992
Mouton de Gruyter (formerly Mouton, The Hague) is a Division of Walter de Gruyter & Co., Berlin.
® Printed on acid-free paper which falls within the guidelines of the ANSI to ensure permanence and durability.
Library of Congress Cataloging in Publication Data Nobel Symposium (82nd : 1991 : Stockholm, Sweden) Directions in corpus linguistics : proceedings of Nobel Symposium 82. Stockholm, 4 - 8 August 1991 / edited by Jan Svartvik. p. cm. — (Trends in linguistics. Studies and monographs ; 65) Includes bibliographical references and index. ISBN 3-11-012826-8 (acid-free paper) : 1. Discourse analysis — Data processing — Congresses. 2. Computational linguistics — Congresses. 3. Linguistics — Methodology — Congresses. I. Svartvik, Jan. II. Title. III. Series. P302.3.N63 1991 401'.41 - dc20 92-11586 CIP
Die Deutsche Bibliothek — Cataloging in Publication Data Directions in corpus linguistics : proceedings of Nobel symposium 82, Stockholm, 4 - 8 August 1991 / ed. by Jan Svartvik. Berlin : New York : Mouton de Gruyter, 1992 (Trends in linguistics : Studies and monographs ; 65) ISBN 3-11-012826-8 NE: Svartvik, Jan [Hrsg.]; Nobel Symposium clears throat *SAR: see my doggie. %par: < b e f > whispers. < a f t > laughs The extreme case is when nonverbal information is treated as clarification of a missing utterance, indicated by 0 and www on the utterance line: *ADA: 0 [=! laughs], %act: < b e f > falling *MOT: whose girl are you? *SAR: 0 [=! whispers]. *SAR: www. %exp: Mother and Sarah whisper a yes/no back and forth game about the mike, %par: whispered and laughing While it is usually possible for readers to reconstruct the ordering of events, the separation hinders rapid integration of verbal and nonverbal events when reading through the transcript, and it implies a view of nonverbal events as clarifying utterances rather than being communicative acts in their own right. This would be misleading for certain types of discourse in which verbal and nonverbal events are interchangeable, as they are in early child language (Ochs 1979). Both methods - i.e. nonverbal inserted on the utterance line or placed on separate tiers - are good for computer search (since nonverbal in parentheses can be optionally suppressed by computer programs), but the former is much easier for rapid reading.
136
Jane Α. Edwards
D. Logical priority Contextual comments provide the background information needed for interpreting utterances in the discourse. For this reason they tend in many transcription systems to precede the utterances for which they are pertinent. Consider, for example, the following excerpt from Bloom's (1973) original transcript (Allison, third data session): (M and A sitting on chair; A wearing half-zippered jacket; fingers in her mouth) M: What did you see? What did you see over there? (M points to monitor) (A looking at monitor with fingers in her mouth) A: Mommy/ If contextual comments are instead placed after the relevant utterance, the transcript seems harder to read. Consider, for example, the CHILDES version of the same excerpt: *MOT: what did you see? what did you see over there? %sit: Mother and Allison sitting on chair; Allison wearing halfzippered jacket; fingers in her mouth %gpx: < a f t > points to monitor *ALI: Mommy. %gpx: looking at monitor with fingers in her mouth As with the example of prosodic encoding given earlier in this paper, the % lines here are viewed as subordinate and supplementary to the immediately preceding utterance, and are labelled as to type (e.g. sit for situational; gpx for gestural/proxemic). The Bloom version is non-committal with respect to the point at which a contextual comment ceases to be relevant; it remains available for influencing utterances until explicitly overruled by other contextual specifications. This seems in keeping with intuition concerning how interactions are structured. In the dependent tier approach, each % tier pertains to exactly one utterance. In CHILDES, if a contextual comment remains relevant across several utterances, the % line is duplicated beneath each utterance. This causes some ambiguities of reading (e.g. did a particular gesture occur only once and get duplicated across lines? or did it actually occur multiple times?) and decreases compactness (Principle F).
Design principles
in the transcription
of spoken discourse
137
E. Iconic and mnemonic marking The use of upward slash for rising intonation in the London-Lund example given above is an excellent example of iconic marking. The use of ac for acceleration is a good example of mnemonic marking in the Gumperz-Berenz conventions noted above. In either case, the meaning of the marking is easily recovered by the reader without much thought, and without much risk of misinterpretation. In contrast, marking a property of speech by means of an arbitrary number, such as, say "1" for rising; "2" for falling, etc., would be both non-iconic and non-mnemonic, and would increase the risk of errors in the data due to faulty encoding. F. Efficiency and compactness Compactness (i.e. minimum of non-essential symbols and non-essential separate lines) and efficiency (i.e. minimum redundancy in symbols) are useful in minimizing the burden on the reader's short term memory while reading through the transcript. They also minimize transcriber and coder error by increasing the transparency of the data. There are several ways to increase compactness. One way involves the use of the minimum number of characters needed to mark a distinction - so long as the abbreviation remains cognitively transparent to the reader (i.e. does not violate principle E). One example of a compact convention is one used often in literature to indicate "spelling aloud", namely encoding the letters as capitals followed by periods: Leopard's spelled L. E. O. P. A. R. D. . This approach is more compact than the convention used in CHILDES of attaching @1 to each letter: Leopard's spelled 1@1 e@l o@l p@l a@l r@l d@l. In addition, the use of capital letters with periods provides better visual separability between the actually spoken items and researcher metacomments (Principle B), with no loss of efficiency in computer search. Another type of compactness involves the marking of a distinction symbolically rather than with words. For example, overlaps are indicated in some systems simply by being enclosed in square brackets: A: Which way is it please to Grand Central [Station]? B: [Turn right] at the light and then straight ahead four blocks, and it's on your left.
138
Jane Α. Edwards
CHILDES sometimes uses explicit researcher metacomments, such as "overlap above" for this purpose: *MOT: [overlap below] *FAT: [overlap above] Using square brackets instead, gives the following format, which encodes the overlapping more compactly and with greater visual separation between speech and metacomment (Principle B): *MOT: [so the kids can dance] *FAT: [you gonna@a take the tape?] Where ambiguity arises concerning direction of overlap, this can be clarified by inserting matching numerical indices inside the first bracket of the overlapping parts. Several other examples given previously can also be used to illustrate what is meant by compactness. The use of separate tiers for nonverbal, prosodic, and contextual information involves requiring the reader to process a greater number of characters per utterance, as well as scanning a greater visual distance (more lines) than does the more commonly used convention of enclosing this information in parentheses on the utterance line and marking it in visually separable ways. An extreme example of compactness is the use of partiture (Tannen 1984; Ehlich in preparation), which resembles a musical score in that subsequent utterances by a speaker continue on the same line, until the right hand margin of the page is reached: A: Hi there. How's it going? B: Hello. Fine thanks. And you? This can be seen as an extremely compact way to preserve the time flow of the interaction, since every line is used up to the margin unlike the more usual methods described above. This approach may require special purpose programs for data entry and computer search, such as those developed by Ehlich (described in Ehlich in preparation). The most widespread method therefore, is in this sense a compromise between the most and least compact approaches: A: B: A: B:
Hi there. Hello. How's it going? Fine thanks. And you?
Design principles
in the transcription
of spoken discourse
139
2. Consistency for exhaustive retrieval The single most important property of any data base for purposes of computerassisted research is that SIMILAR INSTANCES BE E N C O D E D IN PREDICTABLY SIMILAR WAYS. Unless the variability in transcription and coding is totally predictable by researchers, some important variants of any pattern (word or code) of interest may be overlooked, and results may be accidentally biased and may not generalize. Archives of spoken language often contain variant pronunciations of a single form, which a researcher might wish to treat as functionally equivalent for purposes of a computer search. For example, you is pronounced sometimes as you, or ya, or y' as in the following examples from CHILDES: *MAR: *MAR: *MAR: *MAR:
you know what? (standard form) ya know what? (colloquial) I'll tell ya@e when I'm gonna say. (colloquial, tagged) how dya@a get these things in? (prepended, colloquial, tagged) *KUR: Ross's wasn't that y'know + . . . (apostrophe) *MAR: D'ya know # that boy at school one time # he did this. (prepended, apostrophe) *MAR: yknow what, (without apostrophe) Specifying only you in a computer search would lead to retrieval of only the first of these utterances. Because every inconsistency functionally removes relevant data from consideration in a study, it directly constrains the nature and strength of quantitative analytic claims that can justifiably be made about the data. The following types of quantitative statements presuppose that all relevant instances have been retrieved, that is, that the search was "exhaustive": "Form Y does not occur in Discourse Type 1" (or for Speaker 1) "5% of the verbs in Discourse Type 1 were X and 25% were Y" "More of the verbs were of type X than of type Y for Discourse Type 1" Lacking exhaustiveness, only the following much weaker quantitative statement is possible: "Form Y was present in Discourse Type 1." (or for Speaker 1) Non-exhaustive search results cannot even support conclusions such as that Form X is more common than Form Y, since unanticipated variant spellings
140
Jane A. Edwards
of Form Y could have removed them selectively from the reach of the search program. In order to ensure that all such variants are retrieved in computer search, it is necessary to "regularize" (or "normalize") the data, that is, to link each variant (i.e. ya, y', etc.) with the standard form most likely to be used in a search for instances of that type (i.e. you). There are two ways to do this. The first is to construct a conversion table with entries of the following type: you - ya, y \ y The second way of regularizing the data is to insert the standard form near each variant in the text itself (e.g. I'm teasing ya [= you]). In order for these to be effective means for ensuring exhaustive search, it is essential that they be applied uniformly and exhaustively to the data. That is, if the conversion table is used, it needs to contain all nonvariants contained in the data set. If adjacent regularization is used, a standard form needs to occur near each variant in the text. This task can be most easily accomplished with assistance from a concordance, that is an exhaustive alphabetized listing of all of the words and codes in the data together with one line of context. Until the variants in the data have been exhaustively normalized, researchers need to guard against underselection by using liberal search commands (which retrieve more than simply the form desired), and by spot-checking by line-by-line reading of the transcript.
3. Systematic contrast among the categories "Precise" measurement in the physical sciences means assigning values to an object with reference to a well-specified standard, with an acceptable margin of error ("1.2 meters in length, plus or minus .05 meters"). One of the benefits of precise measurement is that objects assigned particular values can be treated as equivalent with respect to the measured property (e.g. length) because a meter always means the same thing regardless of the type of edge to which it is applied. Similarly, with symbols or codes in a transcript, if they are carefully designed, well-specified and applied consistently, those utterances which are marked by the same symbols can be treated as similar with respect to the indicated properties. Unlike categories of length, however, these categories are often qualitative, and their meaning derives not only from the properties
Design principles
in the transcription
of spoken discourse
141
of the events themselves but also from the contrast of the individual coding categories with one another. One example of this type of category is utterance-final delimiters, widely used to encode the illocutionary force of an utterance (e.g. interrogative, directive, assertion, etc.) or its final prosodic trajectory (turn final, continuing, rising). In several systems these are indicated by using punctuation marks in a specialized way: comma for continuing intonation, period for turn final, question mark for rising. It is important to observe, though, that a particular category will cover different ground if it is placed in opposition with (treated as mutually exclusive to) a different number of other categories (for example, an added category for fall-rise intonation). Sometimes a situation may arise in which more than one category applies. For example, although it is possible to distinguish between emphatic utterances and questions, it is possible for a question to be uttered emphatically. In some systems, only one utterance-final category is allowed per clause. If an utterance is both a question and emphatic, only the emphatic marker is used. Functionally this amounts to assigning codes according to a hidden hierarchy, with emphatic dominating and overshadowing any other utterancefinal category with which it co-occurs. This uneven weighting of dimensions artificially forces categories to be mutually exclusive when they aren't, and may obscure distinctions in two ways. First, the dominating category (e.g. Emphatic) hides the co-occurring categories (e.g. Period, Question), making them inaccessible to later computer retrieval. Second, the functions covered by the dominating category differ arbitrarily from utterance to utterance (covering in some cases properties of Period and in others properties of Question), causing the category to be less pure than it would be if both categories were explicitly coded wherever both apply. Marking both, that is, "double coding", ensures that the categories themselves remain systematically contrastive, and improves the exhaustiveness of computer search.
A field-wide transcription standard There is much interest presently in establishing a standard for transcription which can serve as a basis for data exchange within and across the several disciplines engaged in research on language. It is useful to consider the implications of the issues discussed here for the form such a standard should take.
142
Jane Α. Edwards
At a basic level, there is substantial agreement among researchers concerning what is essential to include in a transcript. This includes the spoken words and who said them, context or ongoing activity, nonverbal events, pauses, truncated words or utterances, and some basic turn-taking information (overlaps, etc.). Beyond this level, preferences diverge, partly due to specific research needs and partly due to overall differences in theoretical orientation (see Edwards - Lampert in preparation). For this reason, a field-wide transcription standard needs to be "minimalistic" in the sense of preserving the most universally needed information, yet at the same time allow for diversity and differing levels of specification in a simply extendable and uniform way. It should also contain conventions for regularizing variants in the data, and for other types of consistency necessary for exhaustive computer search. When applied to a particular archive, it would be desirable also to specify a fixed set of precisely defined obligatory categories, to ensure that all categories mean the same thing across all data sets in the archive (i.e. are systematically contrastive). Finally, it should accomplish all of these things in a maximally readable and minimally theory-committal way. I have presented elsewhere (Edwards 1989) a proposal for a minimalist standard for child language transcription, which combines conventions from the child language literature (e.g. Brown 1973; Bloom in preparation; Bloom - Lahey 1978; Ochs 1979; Fletcher 1985), and adult discourse (especially Chafe in preparation; Du Bois - Schuetze - Coburn in preparation; and Gumperz - Berenz 1990), with provisions needed for exhaustive search. A similar minimalist set of information serves as the basis for the encoding standards being proposed by the Text Encoding Initiative's subcommittee on "spoken language", headed by Stig Johansson (cf. Johansson 1990; 1991). The Text Encoding Initiative is an international, interdisciplinary project, headed by Michael Sperberg-MacQueen, with funding from professional associations in linguistics, computer science and the humanities, and entrusted with establishing encoding standards for spoken and written language to facilitate data exchange. An important feature of the TEI approach is the use of a mark-up language (SGML), which involves marking of a distinction in a specific and highly systematic way, which can then be displayed in any of several different ways, as desired by the researcher. For example, overlaps could be displayed either minimally by square brackets or in a highly prominent manner including also vertical alignment. Similarly, speaker turns could be displayed one above the other or in columns or in any other systematically specifiable manner, and nonverbal information could be positioned with reference to time order or placed on dependent tiers beneath utterances as in
Design principles
in the transcription
of spoken discourse
143
CHILDES. This kind of flexibility of display will greatly facilitate use of the same data by researchers with differing theoretical orientations.
Summary and conclusion This paper discussed seven principles of visual display in a transcript, for highlighting and backgrounding information and indicating relationships among different types of information, illustrated with examples from actual conventions in use in the field. The paper then summarized considerations necessary for effective use of transcript data in computer-assisted research, focussing primarily on issues of regularization and systematic contrast of categories within and across data sets. The paper concluded with a consideration of the relevance of these properties to the design of a cross-disciplinary transcription standard, such as that being developed within the Text Encoding Initiative. The ability to test hypotheses across the enormous data sets in spoken language archives provides the chance to verify the generality of results across a wide range of speakers or situations, and thereby to strengthen the empirical base of theory in the field. The validity of such research, however, depends upon the accountability of the data records being analyzed. It is hoped that the issues raised here provide a useful basis for further consideration of these and related concerns.
References Bloom, Lois 1973
One word at a time; the use of single word utterances before syntax. (Janua linguarum, Series minor 154.) The Hague: Mouton. in preparation "Transcription and coding for child language research: the parts are more than the whole", in: Edwards - Lampert (eds.). Bloom, Lois - Margaret Lahey 1978 Chapter 1, in: Lois Bloom - Margaret Lahey (eds.), Language development and language disorders. New York: John Wiley & Sons. Brown, Roger 1973 A first language: the early stages. Cambridge, Mass.: Harvard University Press. Chafe, Wallace E. (ed.) 1980 The pear stories: cognitive, cultural, and linguistic aspects of narrative production. Norwood, NJ: Ablex.
144
Jane Α. Edwards
in preparation Du Bois, John W. in press
"Prosodic and functional units of language", in: Edwards - Lampert (eds.)·
"Transcription design principles for spoken discourse research". IPrA Papers in Pragmatics. Du Bois, John W. - Stephan Schuetze-Coburn in preparation "Outline of a system for discourse transcription", in Edwards - Lampert (eds.)· Edwards, Jane A. 1989 Transcription and the new functionalism: a counterproposal to the CHILDES CHAT conventions. Cognitive Science Program Technical Report 58. University of California, Berkeley. 1992 "Transcription in discourse", in: W. Bright (ed.), Oxford International Encyclopedia of Linguistics. Oxford: Oxford University Press. in preparation "Principles and contrasting systems of discourse transcription", in: Edwards - Lampert (eds.). Edwards, Jane A. - Martin D. Lampert (eds.) in preparation Talking data: transcription and coding in discourse research. New York: Erlbaum. Ehlich, Konrad in preparation "HIAT: a partiture transcription system for discourse data", in: Edwards Lampert (eds.). Fletcher, Paul 1985 A child's learning of English. New York: Blackwell. Gumperz, John J. - Norine Berenz 1990 Transcribing conversational exchange. Cognitive Science Program Technical Report 63. University of California, Berkeley. Johansson, Stig 1990 Encoding a corpus in machine-readable form. [MS.] 1991 Some thoughts on the encoding of spoken texts in machine-readable form. [MS.] MacWhinney, Brian 1991 The CHILDES Project: Tools for analyzing talk. Hillsdale, NJ: Erlbaum. MacWhinney, Brian - Catherine Snow 1985 "The child language data exchange system". Journal of Child Language 12: 271-296. Ochs, Elinor 1979 "Transcription as theory", in Elinor Ochs - Bambi Schieffelin (eds.), Developmental pragmatics, 43-72. New York: Academic Press. Svartvik, Jan - Randolph Quirk (eds.) 1980 A corpus of English conversation. Lund: Lund University Press. Tannen, Deborah 1984 Conversational style. Norwood, NJ: Ablex.
Comments by Gösta Bruce
The following very general criteria can be used as a starting point in the evaluation of a transcription system for spoken discourse: manageability (for the transcriber), readability, learnability, and interpretability (for the analyst and for the computer). It is reasonable to think that a transcription system should be easy to write, easy to read, easy to learn, and easy to search. These aspects are, of course, interrelated, but they do not necessarily provide the same answer to the question of the best transcription system. In her paper, Edwards focusses mainly on aspects of readability and interpretability, i.e. on the reception / perception side of a transcription, which admittedly seem to be the most important ones, but considerations about production / execution and learning may also turn out to be relevant. The principles of visual display and readability discussed by Edwards, which generally concern perceptual integration, separation and memory and more specifically spatial arrangement, contrast, transparency and economy, seem straightforward. Her argumentation appears to me, by and large, convincing and hard to disagree with. However, one objection to her last principle concerning efficiency and compactness may be that it should be applied with some caution, as it often seems favorable to maintain a certain amount of redundancy. Instead of trying to dispute the principles examined by Edwards, I will extend the discussion by relating them to my own experience from the transcription of prosody for the revision of the IPA (International Phonetic Alphabet) as well as from the transcription within a project on dialogue prosody. In the light of the current great interest in establishing a field-wide transcription standard in spoken discourse within the Text Encoding Initiative I think it is not unreasonable to ask what we can learn from related experience of transcription using the IPA, i.e. on a general rather than a more specific level (cf. Journal of the International Phonetic Association 19: 67-80, 1989). As the coordinator of the working group on Suprasegmental Categories set up for the recent revision of the IPA, I formulated the following, fairly general working principles and guidelines to facilitate the work in the group (cf. Bruce 1989). These principles may be of relevance also for the present discussion:
146
Gösta Bruce
• It is recommendable to use symbols that are as simple and transparent as possible. This will in all likelihood facilitate a widespread use of the symbols. It does not necessarily mean, however, that iconic symbols are always to be preferred over more abstract ones. • It may be better to favor already established conventions / symbols for suprasegmental transcription than completely new inventions. This means that the particular set of symbols has already been tested and probably found useful. • It is advisable to avoid ambiguity in the use of a particular symbol. This recommendation may, however, not be valid in cases where a specific symbol has two different, well established meanings in two unrelated languages and where the risk of confusion in the actual use is minimal anyway. • We should avoid entirely ad hoc or language-specific symbols and instead favor more general symbols and a system facilitating notation in agreement with language-independent / universal usage. In our recent work on dialogue prosody within a research project called "Contrastive Interactive Prosody" dealing with Swedish, French and Greek, the auditory analysis has taken the form of a prosody-oriented transcription, i.e. basically an orthographic transcription to which are added prosodic features selected from our model of prosody. In parallel we also make an analysis of the dialogue structure itself including textual, interactive and turn-taking aspects without specific reference to prosodic information (cf. Bruce - Touati 1990). For the symbolization of prosodic features we use IPA symbols as far as possible (stress, phrasing, pausing), while for other categories where the IPA does not provide such diacritics (pitch range, boundary tones) we have developed our own symbols. On the basis of this experience I would therefore like to make a plea for the use of IPA symbols, where possible, in the transcription of prosody of spoken discourse. To avoid some of the problems with "Systematic contrast among categories" discussed by Edwards I think there is need for a consistent theory or model of prosody to be used in the transcription of speech corpora. This makes me a little bit more skeptical about the realism in the suggestion by Edwards about a theory-neutral standard. To avoid further problems discussed by Edwards about "Systematic contrast" there is also, in my experience, a need for a transcription in which the notation of prosody is kept distinct from
Comments
147
the notation of discourse structure. Therefore, we recommend no direct symbolization of categories like Question Intonation or Continuation Tone, etc. It is only at a later stage, when the auditory prosodic analysis and the analysis of discourse structure are related, that such categories can be established, e.g. a strong, interactive initiative - i.e. one which requires a response frequently being combined with a particular intonation pattern. For documentation of variation in the choice of intonation pattern accompanying for example Questions, see also Brown et al. 1980.
References Brown, Gillian - Karen L. Currie - Joanne Kenworthy 1980 Questions of intonation. London: Croom Helm. Bruce, Gösta 1989 "Report from the IPA working group on suprasegmental categories". Working Papers 35: 25-40. Lund: Department of Linguistics. Bruce, Gösta - Paul Touati 1990 "Analysis and synthesis of dialogue prosody", in: Proc. ICSLP 90, Vol. 1, 489-492. Kobe: Japan.
Modern Swedish text corpora Martin
Gellerstam
1. Swedish text corpora: do they exist? Swedish is a small language spoken by some ten million Swedes and studied by perhaps a few hundred scholars. A fair number of them are interested in corpus linguistics or at least find text corpora useful. For obvious reasons, Swedish text corpora have not reached the far corners of the earth as the Brown Corpus has, to name a well-known example of English text corpora. Also, when we do attempt to construct Swedish corpora, we tend to look west for inspiration and guidance, an old habit of ours. Some of our corpora may have been derivative, or "in a sack before they got into a bag" to approximate a Swedish saying. To make matters worse, I am not quite sure what a real corpus is. In one of the questionnaires flooding our department these days asking about text corpora, I read that "a corpus differs from a collection in that it consists of texts (or parts of texts) by various authors, which have been assembled in a predefined way (often using statistical criteria) to construct a sample of a given language or sublanguage (newspapers, medical interviews, etc.)". This gives you some idea of what a corpus is but quite a few question marks remain. One issue of a newspaper is a corpus in this respect - thus a perfect mirror of the language spoken or written at a certain moment.
2. Why corpus evidence? Corpora are constructed to answer one or normally many linguistic questions. But do we need corpus evidence? What about the massive, accumulated experience called "linguistic intuition"? Such questions are not asked without risk today but there was a time, not long ago, when the brains of the "ideal speaker/listener" were enough for most purposes. The answer is that the intuition of the speaker/listener is insufficient in many respects. First of all, the intuition of the language user must be sharpened in questions of acceptability. Is it possible to say
150
Martin
Gellerstam
den där flickan som jag inte visste vad hon hette 'that girl who I didn't know what she was called'? Of course, this is a question of defining acceptability: is it a matter of the written norm ("the correct way of expression") or is it a matter of grammaticality ("the possible way of expression")? In both cases, corpora can be used as evidence. Acceptability in the latter sense can be considered as a linguistic phenomenon concerning one end of a scale. The other end concerns normality: what is the typical syntactic appearance of a verb like anse 'consider' ? Is it a anser att + SATS a anser b + PRED a anser b vara + PRED
'α considers that + clause' 'a considers b + predicative' 'a considers b to be + predicative' ?
The need for good data showing what is normal use in various linguistic respects is crucial in lexicographic contexts where vocabulary usage should be summed up. Lexicography is also the field where very large text corpora are a necessary prerequisite. Acceptability and normality are different aspects of the frequency concept. But information about frequency also presupposes information on distribution over text types, semantic categories etc. The intuition of the language user is supported in this respect by the distributional data of the corpus, information on dispersion. For instance, it is a well-known fact that pronouns like I and my are rare in certain types of formal contexts in contrast to passive constructions and nominalizations. But linguistic units are distributed not only over text types but also according to regional, temporal and social factors. We wish to be informed by a dictionary that a word is regional, that it is a children's word or a word that should not be used for reasons of propriety. In all these cases, intuition needs the support of corpus evidence. In matters of support for our intuition, it is necessary that the corpus mirrors the factors of language we want to investigate. But corpora are useful for a lot of things. In many cases the role of the corpus is simply to help the linguist in his or her linguistic creativity, simply to give "good examples" of linguistic usage. In such cases, the construction of the corpus is less important than a wealth of good examples. I am putting the case for corpus evidence not because it is necessary in this forum but in order to point out in what respects a corpus may be useful. Naturally, a corpus is not good for everything. The construction and size of a corpus is governed by the type of question you wish to put. If you
Modern Swedish text corpora
151
want to study the store of graphemes in a language you can make do with a few pages of a book. But if you want to know anything for certain about infrequent words in a language and their contextual appearance, you need a corpus of tens or hundreds of millions of running words. There is no standard corpus for everything. The type of questions we want to ask also governs the construction of the corpus. If you want to study metaphor in informal speech, you have to find a corpus - or build one - that gives you an answer to your questions, not a perfectly constructed corpus with no possibility of giving you any answers.
3. From corpus to text bank Text corpora have existed in Sweden for some decades as well as ideas about how they should be put together. Even from an international point of view, Sweden started early collecting and systematizing linguistic data. The first major corpus - of one million running words (what else?) was collected back in 1965, by what was then the Research Group in Modern Swedish, led by Sture Allen. This happened at a time when corpus people still hid when they saw a theoretical linguist. Working with corpora was "counting words", not only a disreputable but also - according to Chomsky - an unnecessary occupation, since everything needed for linguistic description could be taken from the brains of the "ideal speaker/listener" anyway. The text corpora produced in many countries, including Sweden, at the time of the construction of the Brown Corpus and for about a decade after that, had a strong bias towards frequency counts. The Brown Corpus project appeared as a "Computational Analysis of Present-Day American English". Even if there was a well-defined corpus behind this investigation, the project gave the impression that disclosing frequencies and distributions of words in written American English was the principal aim. This was true of many other corpora at the time: Juilland - Chang-Rodriguez for Spanish, ΑΙΙέη for Swedish, Rosengren for German, etc. They were all "frequency counts" rather than corpus investigations in general. This statistical interest in vocabulary, underlined by extensive introductions giving statistical data about vocabulary measures, explains the strategy behind the construction of the corpora. To be able to draw statistical conclusions from the corpus, you had to make its foundation statistically sound. However, even if there was a fair amount of statistics in these early corpus projects, brought forth by the new powerful data medium, it was not
152
Martin
Gellerstam
merely a matter of "counting words". But quantitative data tended to come into the foreground, overshadowing the necessary qualitative basis of the investigations. You cannot count without knowing what to count. By looking at sampling principles, you can see that the focus was on obtaining quantitative data (frequencies of words, constructions, morphemes, graphemes) rather than on compiling a range of corpora useful for different purposes. An important point was to make the corpus so diversified that no individual text could possibly distort the frequency figures. This is apparent in an investigation like that made by Juilland and Chang-Rodrigues (1964) who built their text "worlds" out of samples of isolated sentences from the respective sources (though the time span did not seem to present a problem). The method could be disputed from a statistical viewpoint also, but one fact should be stated: the corpus was of limited value for general corpus use. To generalize, there were two principal strategies for corpus collection at the time: the first (represented by the Brown Corpus) aimed at the collection of a mixed sample of written language, based on stratification of genres and random samples within each category; the second aimed at a detailed investigation of a more limited genre (like newspaper text). Apart from the fact that nobody to my knowledge has succeeded in collecting an interesting corpus of written language (to say nothing of spoken language) by using random sampling, the critics of the two approaches have said roughly the following: the principle of mixed text types (the Brown Corpus) makes reliable conclusions difficult since you do not know what the mixture stands for, whereas the more limited sample (like the Press 65 corpus) is not representative enough of the population Written Language. From the viewpoint of language statistics it was probably not so important which method was used - for the simple reason that very few indisputable statistical conclusions were drawn from any of the bodies of material. The language statistics had two principal aims. One was to test general statistical theories about vocabulary: type-token ratio, Zipf's law, the size, variation and richness of vocabulary according to ideas put forward by Guiraud, Herdan, Muller and others. The other was to compare frequencies in the sample with frequencies in other corpora or frequency counts. These investigations led to the problems of comparing corpora of different sizes, not least in the area of disputed authorship. The statistics were also used for more practical purposes like building up basic vocabularies or making full-size dictionaries. It is doubtful whether you could talk about a breakthrough in language statistics as a result of these early corpus investigations. Vocabulary measures of the type just mentioned have hardly given us any powerful tools for
Modern Swedish text corpora
153
exploring new ground in the field of lexicology. On the contrary, there is a tendency to look upon language as a rather odd population, to put it mildly. As time went on, corpora gradually came into their own as useful tools for various linguistic purposes. Behind this development there was a growing interest in corpus studies in general and a closely related interest in lexico-semantic questions. The rapid expansion of computational methods accelerated this development. These factors also initiated new trends in corpus construction. Very soon people found out that they needed to search the corpora systematically, not only via concordances but also via linguistic variables. In short, you wanted, for example, to be able to produce all the instances of object-plus-infinitive without too much labour. This led to the need for syntactically tagged corpora, a project that was carried out on the Brown Corpus early on. Today, these laborious tasks have been made easier by new methods of semi-automated tagging in computational linguistics. People studying lexico-semantic phenomena or making dictionaries soon found out that a corpus of one million running words was inadequate in many respects. If you want to make a dictionary such a corpus is not very large. In fact, when you have discarded accidental word formations, names etc. there are perhaps a mere 50,000 lemmas left, half of which only appear once. Today, a corpus of one hundred million running words is an entirely realistic idea. Finally, people found that the text samples in corpora of the Brown type were too short for investigations based on texts rather than words. The need was felt for longer, continuous texts. All this led to a development away from "the ideal corpus" to what could be called "the text bank model". This means that you collect many different types of texts, define them accurately and, when necessary, mix them into something that could work as a standard corpus. Today, there are many such text banks around the world, some of them pure text archives, others connected to lexicographic and computational linguistic activities. But parallel to this development, the standard corpus has had a revival for different reasons: one is the fact that the Brown Corpus is still such a successful model that new corpora are modelled on it; another is the need to test systems for automatic analysis by means of computational linguistic methods. A project comprising all these elements - standard corpus, larger corpora, automatic tagging, testing of automatic analysis - has just been started as a joint venture by the Universities of Stockholm and Umeä.
154
Martin
Gellerstam
4. Representativeness A carefully collected body of text known as a corpus presupposes that the texts represent language on some level, ranging from the whole language of a nation to certain genres, subject areas, periods of time, etc. But endeavours to produce corpora that are random samples of spoken or written language are rare. The normal method is rather to make a stratified sample on the basis of ideas about the frequency of various text types. The random sampling comes in after that, when the desirable mixture of text types has been formulated. This is a reasonable attitude. Everyone thinking about the concept of language population must agree with that. How should such a population be described? Is it a question of the total output of written language during a certain period or is it the kind of language that people frequently come into contact with? And how are text types distributed in this population? How common is biblical language as compared with texts about chess? And if we take the vocabulary of the various text types, what is the frequency relation between individual words in the population? Is dog more common than cat in the language vocabulary? The question is absurd. But if you say that newspaper texts represent 75 per cent of all written texts in Swedish (which is true), and that a sample should mirror this fact, this sounds just as absurd. How many running words of fiction, or biblical language, or historic texts would be found in such a sample? In practice, most linguistic corpora are stratified. This can be done along two different lines: it can be based on information on the distribution of the total output of written language (you can find statistics to that effect), or the stratification can be based on what people really read, which is a different matter. Huge quantities of parliamentary proceedings are read by very few people, etc. But the stratification can also be based on ideas about what questions the texts can present good answers to. There are "one question corpora" (even if you might get reasonably good answers to other questions as well). If you wish to investigate how essay marks correlate with certain social factors, your corpus is already defined by the question. The use of the corpus governs its construction even if you are not interested in answering one question only. Evidently, corpora will differ widely according to what field of linguistics they will be used in. And corpora could be used in many fields of linguistics as Svartvik points out (1986: 9): "lexicography, lexicology, syntax, semantics, word-formation, parsing, questionanswer systems, software development, spelling checkers, speech synthesis and recognition, text-to-speech conversion, pragmatics, text linguistics, lan-
Modern Swedish text corpora
155
guage teaching and learning, stylistics, machine translation, child language, psycholinguistics, sociolinguistics, theoretical linguistics, corpus clones in other languages such as Arabic and Spanish - well, even language and sex." Finally, we should remember (cf. Section 2) that corpora can be used in linguistic research in other ways besides forming the basis of statistical conclusions. In fact, considering the way corpora are used, the question of representativeness is not always very important. We will take a few examples from Swedish corpora.
5. What questions can be answered with reference to a corpus? Let us have a look at the way corpora are referred to in papers on Swedish linguistics (for a more detailed discussion of similar examples, see Staffan Hellberg's paper, this volume). I will choose the proceedings of one of the leading conferences on the subject (Svenskans beskrivning 15, 1985). In one of the plenary papers at the conference, Bengt Loman (p. 50) refers to the new possibilities of investigating grammatical structure as well as language use that have been created by the computer-produced text corpora. Out of some thirty papers presented at the conference - many of them theoretical, historical or describing special genres - there were six referring to general Swedish text corpora. The first one (107-117) discusses the context and relative frequency of markers of vagueness liksom 'sort o f , pä nägot sätt 'somehow', nänstans 'somewhere' in Swedish speech. The corpus evidence is taken from Conversation in Göteborg (see Appendix). The aim of the paper is to explain the function of markers of vagueness and the corpus offers a rich variety of examples and the relative frequency of different types. The style level of the corpus is described by the author as one of informal conversation which is also the aim of the corpus constructors. There is no further discussion of how representative it is. Another author (193-204) discusses lexicalized phrases like falla i sömn 'fall asleep' and raka i panik 'panic'. She calls them "abstract transition phrases" and argues that there are two types: closed and open. The open ones, allowing new formations within the framework of the archetypal phrase, are demonstrated with reference to new examples created in the corpus Novels 76 and 80 (falla i slummer 'fall into slumber', etc.). The corpus is used as a storehouse of good examples rather than anything else. Still, the choice of fiction - being a creative type of language - appears to be intentional.
156
Martin Gellerstam
This "intuition-aid" type of corpus use is in fact the typical approach. One of the other authors makes an explicit reference to this fact, saying roughly the following (in my translation): "In order that the analysis should not be based on my own intuition only, I have analyzed the verbs (of certain semantic fields) in the Novel 76 corpus. Most of the examples in this paper have been taken from this corpus" (p. 531). There is no discussion of - and no need for - representativeness. However, in one of the papers, the need is felt for reliable data on different types of text. The writer (347-358) discusses the omission of the auxiliary ha 'have' in connection with the past participle in subordinate clauses: När de (hade) kommit fram, fick de middag 'When they had arrived, they were given dinner'. The aim of the paper is to study the distribution of this type of construction in different types of text (especially speech versus writing) and to see whether the type of subordinate clause or the position of it makes any difference. The corpora used for this purpose are Conversation/Debate 67 and Interviews 67 (for speech) and Press 65 and 76 as well as Novels 76 and 81 (for writing). At one end of the scale is formal writing (the press corpora) with many instances of the construction without ha, and at the other end is speech with considerably fewer instances. In between there are the novel corpora with traits of both media. The author also spots a time trend in that the construction without ha decreases from 1965 to 1976. Of course, statements like these would be safer if the corpora were more diversified as to text types. It would also be easier to find answers to the other type of questions asked by the author (position, type of verb, etc.) since such factors tend to correlate with text type and style: verbs in the passive voice favour constructions without ha and the frequency of the passive voice tends to differ according to text type.
6. A survey of Swedish text corpora Swedish text corpora come in two shapes: those for general purposes and those for special purposes, all within the framework of linguistic studies. Corpora for general purposes are relatively large and corpora tailored for special purposes are relatively small. Of course, the question of corpus size is a matter of opinion, especially at a time when corpora of a hundred million words are not far off. There is also a difference in size owing to the problem of putting speech into written form: collecting a speech corpus the size of the London-Lund Corpus is a much more impressive achievement than collecting
Modern Swedish text corpora
157
a much larger corpus of written text (apart from the problem of tagging the text). All the same, it might be useful to make a distinction depending on the purpose of the corpora. But before we discuss these corpora I would like to make a few qualifications. The Press 65 corpus is not the first Swedish corpus. In fact, there are three earlier bodies of material that should be mentioned. The first is a frequency count based on reports of the proceedings of the Swedish Parliament and business letters. The second is a frequency count of some half a million running words of Parliamentary proceedings (Widegren 1935). The last and most recent is a frequency count based on half a million running words from fiction, newspaper texts, school essays etc., distributed over a period of fifty years (Hassler-Göransson 1966). The reason why these corpora have not been discussed in this survey is that they are frequency counts rather than text corpora (there is no means of systematic access in the form of concordances). They also represent other purposes besides linguistic study (data to improve stenography and language teaching). I would also like to mention the fact that minor text corpora (or minor machine-readable texts of various kinds used as a basis for individual investigations) are considered to be outside the scope of this survey. So are machine-readable historical texts (see for instance the project called Källtext 'source text' within the Language Bank). 6.1. Corpora for general purposes Most of the corpora produced for general use were collected by the Department of Computational Linguistics, University of Göteborg. Although suitable for general purposes the collection of these corpora was not altogether altruistic. In fact there were two - or rather three - good reasons for constructing the corpora. The first reason was the lexicological interest of the department leading to the collection of a diversified set of texts considered to be of use in constructing a dictionary. The second reason was the Language Bank (formerly Logotheque) which started working officially around 1975 with the task of collecting machine-readable texts of various kinds. And a third reason may be added: when the Language Bank started working, the department was well under way describing a corpus of written standard Swedish, the Press 65 corpus (resulting in four volumes of the Frequency Dictionary of Present-Day Swedish). If the Press 65 corpus was built up in the context of standard corpora as a basis of frequency counts, the collection of the following corpora was
158
Martin
Gellerstam
very much guided by text bank and lexicographical purposes. Lexicographic work demanded different types of general (but not so many terminological) genres. In the first run these were the following: official prose (a corpus of Parliamentary debates, 1978-1979), legal prose (represented by a collection of laws), general prose (bruksprosa, represented by three newspaper corpora) and fiction (represented by some ten million words from novels). The construction of the Press 65 corpus was accompanied by a discussion of representativeness. In this discussion, critics of the newspaper corpus pointed out the obvious fact that it was not representative of written Swedish in general. The constructors of the corpus (Sture Allen and his colleagues) agreed, regarding the population as one "made up of all texts by the same authors under the same circumstances" (Allen 1971: xxix). But Allen goes on: "When judging the relevancy of the population, one should remember that a large number of other writers may be expected to aim at the type of writing represented by the material, that it covers a central field of written Standard Swedish, and that the number of readers is very large." As we have pointed out before, it was also a deliberate choice - partly in opposition to adherents of the "ideal sample" policy - to collect a well-defined text, a good piece of Swedish, even if it could not tell the whole truth about how words were distributed in Swedish. The policy behind the other text corpora was to collect large chunks of Swedish, typical of different text types. Generally, this collection was not guided by ideas based on print statistics or sales figures, but rather by what seemed to be of interest for lexicographical work. And what is of interest from this point of view seems to be of interest from a general linguistic point of view, at least judging from the use of the corpora. A small sample of these corpora has been tagged (about a hundred thousand running words from Press 65, see Järborg 1990) but otherwise these large corpora are not tagged. The result of the separation of homographs made on the Press 65 corpus was included in the frequency dictionary but the text was not tagged accordingly, a decision that we have reason to regret today. On the other hand, automatic word class tagging is easier today even if the problem of homography is far from being solved. The Swedish text corpora produced at Lund in the Department of Scandinavian Languages are more interesting from the point of view of tagging. Although smaller in size and partly more specialized as to text type, the corpora Conversation/Debate 67, Interviews 67 as well as "Bruksprosa 70" ('general prose') and "Gymnasistprosa 70" ('16+ Pupils' prose') have all been tagged according to syntax and word class criteria. These corpora have
Modern Swedish text corpora
159
formed the basis for a wealth of books and reports on written and spoken Swedish, social factors of speech, writing ability among school children, etc. There are a few slightly larger corpora of spoken Swedish, Conversation in Göteborg and the Eskilstuna corpus, both around half a million running words. The Göteborg corpus comprises dialogue, the Eskilstuna corpus interviews. Both corpora have been used for sociolinguistic investigations.
6.2. Corpora for special purposes The corpora for special purposes are tailored to give good answers to specific questions - but, as often as not, the material could be used to put more questions. The Reportage 76 corpus is collected to shed light on linguistic variation in newspaper language but is a valuable corpus for other questions as well. The corpus "Elevsvenska" ('Pupils' Swedish') from 1977-1978 is tailored to answer questions about writing ability and could be seen as a counterpart of the "Gymnasistprosa" ('16+ Pupils' prose') corpus from 1970. The way boys and girls are referred to in books for boys and girls has been studied in an investigation at the University of Umeä based on a corpus of some 400,000 running words.
6.3. Present corpus investigations A large Swedish corpus investigation (the SUC project) is under way at the Universities of Stockholm and Umeä. The goal is to produce a corpus of (at least) one million words of running text from different genres (to match the principles of the Brown and LOB Corpora). All words are to be classified for word class and for a set of morpho-syntactic properties. The corpus is also meant to function as a test-bed and a basis for comparison in the development and testing of various models for analysis. The project hopes to be able to take "a considerable step towards a fully automated tagging of unrestricted text". The Helsinki Swedish corpus of roughly one million running words will be tagged in cooperation with the SUC project.
160
Martin
Gellerstam
Appendix: Modern Swedish Text Corpora (a) Written Swedish 1. Spräkbanken ('The Language Bank') Department of Computational Linguistics, University of Göteborg. (For further information, see leaflet with a description of the Language Bank.)
Press 65 One million running words from the following newspapers: GHT, SvD, DN, ST, SDS. Sample: paper tape from four weeks of each newspaper 1965. The text forms the basis of the Frequency Dictionary of Present-Day Swedish 1-4 (a detailed list of articles and authors is published in the fourth volume). Concordance and text in interactive form. Tagged version of one tenth of the corpus (see Järborg 1990). Press 76 1.3 million running words from the following newspapers: GP, SvD, DN, Arb, SDS. Sample: paper tape from four weeks of each newspaper 1976. The text is being corrected from the point of view of printing errors. Concordance in alphabetic and reverse alphabetic order. Microfiche and tape. Press 87 4 million running words from Dagens Nyheter (DN). Sample: editorial texts from four separate weeks 1987. The corpus is accessible in interactive form. Parliamentary debates 4 million running words from Parliamentary debates during the parliamentary year 1978-1979. Sample: all plenary texts during the year. Concordance in alphabetic order with source information (political party, sex, etc. of the speaker). Microfiche and tape. Novels 76 5.6 million running words from 69 novels published 1976-1977. Sample: all novels published by Bonniers Grafiska Industrier at the time. Half of the novels are written by Swedish authors; the other half are translations into Swedish (mostly from English original texts). Concordance in alphabetical order. Microfiche and tape.
Modern Swedish text corpora
161
Novels 80 4 million running words from 60 novels by Swedish writers published 1980-1981. Sample: all novels published by Bonniers Grafiska Industrier at the time. Concordance in alphabetic order. Microfiche and tape. Legal language 0.5 million running words from Svensk Författningssamling 19781981. Concordance in alphabetic order. Microfiche and tape. 2. Talbanken ('The Speech Bank') Department of Scandinavian Languages, Lund University (further information in Westman 1974, Einarsson 1978 and Hultman - Westman 1977) "Bruksprosa 70" 0.09 million running words from brochures, newspapers, school books and debates 1970-1971. Sample: texts by professional writers considered (by experts) as being of general interest, informative, readable and comprehensible. Text tagged according to Teleman 1974. Paper concordance. "Gymnasistprosa 70" 0.09 million running words from upper secondary school essays ('centrala prov för gymnasiet äk 3'). Sample: random sample of 150 essays written in Swedish schools, 1970. Text tagged according to Teleman 1974. Paper concordance. 3. Other corpora Reportage 76 0.06 million running words from morning and evening newspapers (DN, GP, SvD, SDS, Arb; Exp, AB, KvP, GT) and weeklies (Aret runt, Hemmets veckotidning, Vi, Lektyr, Se) 1976. Sample: random sample from the 15 text sources. Content analysis (see Strand 1984). Department of Scandinavian Languages, University of Stockholm. "Elevsvenska" 0.07 million running words of school essays from secondary and upper secondary schools 1977-1978 (see Larsson 1984). Sample: stratified according to school level, type of education, sex and how the essay is marked. Tagged (word class). Frequency dictionary based on the corpus (Kent Larsson, Ordbok over svenska elevtexter).
162
Martin
Gellerstam
Books for boys and girls 0.4 million running words from 20 books (published 1972-1980) for boys and girls. Syntactic and semantic classification of some 8000 excerpted words. Sample: stratified sample according to popularity, age and sex of the reader, price and literary quality (see Hene 1984). Helsinki Swedish corpus 1 million running words from fiction (15 novels) and non-fiction (psychology, social debate, environment, technology, etc.). Sample: no explicit statistical sample. The corpus will be tagged. SUC corpus (Forthcoming) One million running words of Modern Swedish. The corpus will be tagged. Sample: a model of the Brown Corpus sampling. (See Källgren 1990.)
(b) Spoken Swedish Conversation in Göteborg 0.5 million words of spoken dialogue recorded in Göteborg 1979. Sample: random sample. Variables of interest: sex, age, occupation, education and social status of the informants (see Löfström 1982). Conversation/debate 67 0.048 million running words of conversation and debates recorded at Lund (21 academic informants), 1967-1968 (see Jörgensen 1970). Text tagged according to Loman - Jörgensen 1971 and Teleman 1974. Interviews 67 0.067 million runnings words of interviews recorded at Boras 19671968. Sample: stratified sample of 32 interviews representing sex and occupation of the informant (see Lindstedt 1977). Text tagged according to Loman - Jörgensen 1971 and Teleman 1974. Eskilstuna corpus 0.46 million running words of interviews recorded at Eskilstuna 1967-1968. Sample: stratified sample from 83 informants representing age, sex and social factors of the informants (see Nordberg 1985).
Modern Swedish text corpora
163
References Allen, Sture et al. 1971 Frequency dictionary of present-day Swedish 2. Lemmas. Stockholm: Almqvist & Wiksell. Einarsson, Jan 1978 Talad och skriven svenska. Lund: Studentlitteratur. Hassler-Göransson, Carita 1966 Ordfrekvenser i nusvenskt skriftsprak. Stockholm: Skriptor. Hene, Birgitta 1984 Den dyrkade Lasse och stackars lilla Lotta. [Diss.] University of Umeä. Hultman, Tor - Margareta Westman 1977 Gymnasistsvenska. Lund: Liber Läromedel. Järborg, Jerker 1990 Användning av SynTag. [Working report.] Sprakdata, University of Göteborg. Jörgensen, Nils 1970 Om makrosyntagmer i informell och formell stil. Lund: Studentlitteratur. Juilland, Alphonse - E. Chang-Rodriguez 1964 Frequency dictionary of Spanish words. The Hague: Mouton. Källgren, Gunnel 1990 " 'The first million is hardest to get': Building a large tagged corpus as automatically as possible", in: Hans Karlgren (ed.), COLING -90. University of Helsinki. Larsson, Kent 1984 Skrivförmäga. Studier i svenskt elevspräk. Malmö: Liber. Lindstedt, Lennart 1977 "Insamling av ett socialt stratifierat talspräksmaterial", in: Bengt Loman (ed.), Spräk och samhälle 3. Lund: Gleerup. Löfström, Jonas 1982 "Spräkdata", in: Nils Erik Enkvist (ed.), Impromptu speech: a symposium, 371-376. Äbo: Äbo Akademi. Loman, Bengt - Nils Jörgensen 1971 Manual för analys och beskrivning av makrosyntagmer. Lund: Studentlitteratur. Nordberg, Bengt 1985 Det mangskiftande spräket. Om variation i nusvenskan. Lund: Studentlitteratur. Strand, Hans 1984 Nusvenskt tidningsspräk. Stockholm. Svartvik, Jan 1986 "For W. Nelson Francis", 1CAME News 10: 8-9. Svenskans beskrivning 15 1985 University of Göteborg, Department of Scandinavian Languages. Teleman, Ulf 1974 Manual för grammatisk beskrivning av talad och skriven svenska. Lund: Studentlitteratur. Westman, Margareta 1974 Bruksprosa. Lund: Liber. Widegren, P.G. 1935 Frekvenser i nusvenskans debattspräk 1. Stockholm.
Comments by Gunnel Engwall
As Martin Gellerstam points out in his paper, Swedish linguists have been interested in corpus linguistics for several decades. This interest has led to a comparatively large number of existing Swedish language corpora and text banks, all of which are enumerated in an extensive appendix to the paper. In order to see what these materials represent, we can classify them into six commonly accepted categories. It is possible, of course, to imagine both fewer and more classes for the categorization, or even divisions according to other kinds of variables, such as levels of style or audience. Keeping to the proposed six categories, we get the distribution shown in Figure 1. In addition to the three corpora representing several genres of written language ("Mixed" in Figure 1), the Swedish materials can thus be classified into three of the first four categories. Altogether the written language materials total 23 million running words, in which the literature genre amounts to 10 million, the learned works genre 4.7 million, the newspapers 6.4 million and the mixed corpora 2 million running words. The spoken Swedish materials consist of interviews recorded in two medium-sized towns on the west coast and in central Sweden (0.5 million running words) and conversational material from the west coast and the south of Sweden (0.5 million), giving a total of 1 million words. Thus, the materials based upon the spoken language are fewer and shorter than the corpora of the written genres. This is actually true for most languages, since spoken language is much harder to establish in an appropriate way. The ratio between the spoken and written materials is probably higher for Swedish than for most other languages. Considering French language corpora, for example, we can note that the ratio is much lower. This is primarily due to the constantly increasing masses of written texts processed by the TLF (Tresor de la langue frangaise). To give an idea of the situation for French, important examples of French text materials are classified into the same six categories as for Swedish above (see Figure 2).1 Considering only the original corpus of TLF, totaling 72 million running words, the ratio between the spoken and written language becomes less than 1 million to about 80 million running words. As we see, the six categories considered are all represented in French text materials. However, the first category has attracted by far the greatest
Comments
165
Literary works Novels 76, 80 Books for youngsters (72-80) Learned works Legal (Svförfattn. saml. 78-81) School essays 70 School essays 77-8 Parliamentary debates 78-9 Written language (mixed: SUC Bruksprosa 70 Helsinki SweC)
Newspapers Press 65, 76, 87 Reportage 76 Letters
Total population "la parole" Spoken language
Monologue Intv. Boras 67-8 Intv. Eskilstuna 67-8
Dialogue Conversat. Lund 67-8 Conversat. Gbg 79 Figure 1.
Swedish language materials classified into six categories
interest among French linguists, and categories such as learned works and newspapers have only recently come into focus. Already this categorization points to the question of how to classify the linguistic material, and to the problem of representativeness, which Martin Gellerstam also discusses in his paper. The necessity of examining this question has become more and more apparent today. It certainly adds to our understanding of the Swedish language, the more we know about text types and contexts; for instance, where we find the expressions such as liksom, pa nägot sätt, falla i sömn and räka i panik quoted in the paper discussed. As the total population of a language cannot be accurately delimited, a choice of texts out of subpopulations becomes crucial. In my mind, this choice ought to be made in a multi-step procedure, similar to the cluster sampling method, including the selection of the category (or categories, if a mixed category corpus is to be constructed), the genre (or genres and subgenres),
166
Gunnel Engwall Literary works TLF (Diet, desfrequ.) Novels 62-68 Celine Learned works Economic textbooks Newspapers COSTO, Belgian Swiss Written language I Mixed: Juilland)
Total poplulation "la parole"
, Monologue De Gaulle d 'Estaing/Mitterand Spoken language ; Le Frang. parle
Figure 2.
Letters Commercial Tracts Parole syndicate
Dialogue Frangais fondamental M. d'Orleans M. de Louvain
Examples of French materials classified into six categories
the time period, and finally individual texts or samples of these texts. This procedure gives us stratified samples, as mentioned in Martin Gellerstam's paper, also discussed in Biber (1991). It is particularly the last step, the sampling of specific passages and their size, which has been debated during the last few years. In a study published in 1988, Nelleke Oostdijk states that many existing corpora are inadequate for studies of linguistic variation, as their sample texts are too short. She suggests that the sample length should be around 20,000 running words, instead of 2000 or 5000 as in the Brown, LOB or London-Lund Corpus. This issue does not arise for the majority of the existing Swedish corpora, as they normally include entire issues of newspapers, or complete novels or essays. For my own studies on French best-selling novels the problem arises, however, and accordingly I have undertaken some lexical studies on the variation between samples of 2000 words and of 20,000 words. 2 My results
Comments
167
point to the fact that in many cases even small samples of 2000 words give convincing results. In Biber (1988; 1990) sample size problems are studied more thoroughly and for a large number of variables. From these studies, Douglas Biber draws the conclusion that our existing corpora provide useful material for many studies of variation.3 A third issue to discuss in connection with representativeness is the diversity of texts. How many genres should be included in the corpus? Our answer depends on the purpose and the available resources. For an overall study of the language and the linguistic variations it is obviously essential that the corpus represents all possible genres. In this context we ought to mention also the copyright question that may present a constraint to the corpus builder.4 The issues discussed are all crucial for the collection of text material, whether it be a corpus or a part of a text bank (or text archives). The selection methods used ought to influence our terminology, and should provide the basis for a clear differentiation between the terms. It is obvious that the term "corpus" has been used in different ways and is not unambiguous. It is therefore not surprising to meet Martin Gellerstam's statement that he is not quite sure what a real corpus is. He goes on to quote a definition of "corpus", which is said to consist of texts (or parts of texts) by various authors.5 Before concluding, I would like to propose a slightly broader definition of "corpus" and contrast it with a definition of "text bank". Corpus: A closed set of texts in machine-readable form established for general or specific purposes by previously defined criteria. Text bank (or text archives): An open set of texts in machinereadable form, to which new texts can be added continuously. A corpus can be established out of the different texts in the text bank. Defining a corpus in this way means that the texts chosen to form the corpus can be written by just one author, but that choice procedure and the used criteria are clearly defined in advance. The classification of the materials into six categories has shown that the spoken language is comparatively well represented among Swedish machinereadable texts listed in Martin Gellerstam's paper. The diversity is also quite broad, although the representativeness is not always clear. The importance of a deliberate multi-step choice of texts is stressed in my comments, which end with a proposal for a distinction between the terms "corpus" and "text bank".
168
Gunnel Engwall
Notes 1. The French text materials quoted in Figure 2 are described in Dictionnaire des frequences (1969-1971); Engwall (1984); Fortier (1981); Engwall - Bartning (1989); Danell (1990); Juilland - Brodin - Davidovitch (1970); Lyne (1985); Des tracts en mai 68 (1975); La Parole syndicate (1982); Cotteret - Moreau (1969); Cotteret et al. (1976); Blanche-Benveniste et al. (1990); Gougenheim et al. (1967); Robach (1974); de Kock (1983). 2. For some early results, see Engwall (1977; 1978). 3. For a discussion of this issue, see also the contributions to this volume by Greenbaum, Kennedy and Leech. 4. The concern for general considerations to be taken upon the construction of corpora and text banks, and for standardized information tags and formats within these text materials, has resulted in the international project, the Text Encoding Initiative (see for example SperbergMcQueen - Burnard 1990). 5. This definition problem is mentioned in several of the papers in this volume, cf. especially the contribution by Francis.
References Biber, Douglas 1988 Variation across speech and writing. Cambridge: Cambridge University Press. 1990 "Methodological issues regarding corpus-based analyses of linguistic variation". Literary and Linguistic Computing 5: 257-69. 1991 Representativeness in corpus design. [Paper for the Pisa conference on European corpus resources.] Blanche-Benveniste, Claire - Mireille Bilger - Christine Rouget - Karel van den Eynde 1990 Le frangais parle. Etudes grammaticales. (Sciences du langage.) Paris: Editions du CNRS. Cotteret, Jean-Marie - Rene Moreau 1969 Recherches sur le vocabulaire du General de Gaulle. Analyse statistique des allocutions radiodijfusees 1958-1965. (Travaux et recherches de science politique 3.) Paris: Armand Colin. Cotteret, Jean-Marie - Claude fimeri - Jacques Gerstle - Rene Moreau 1976 Giscard d'Estaing/Mitterand. 54 774 mots pour convaincre. Paris: Presses universitäres de France. Danell, Karl-Johan 1990 "Corpus de journaux francophones sur ordinateur". Travaux de linguistique 20: 73-82. de Kock, Josse 1983 "De la frequence relative des phonemes en fran9ais et de la relativite de ces frequences". ITL (Louvain) 59. Des tracts en mai 68. Mesures de vocabulaire et de contenu. 1975 (Travaux et recherches de science politique 31.) Paris: Armand Colin. Dictionnaire des frequences. Vocabulaire litteraire des XIX6 et XXe siecles 1969-71 4 tomes (tome 1: 4 vol.). (Etudes statistiques sur le vocabulaire fran^ais.) Nancy: Centre de recherche pour un Tresor de la langue fran9aise.
Comments
169
Engwall, Gunnel 1977 "G. Perec: Les Choses - sous quelques aspects quantitatifs", in: Lennart Carlsson (ed.), Actes du sixieme Congres des romanistes scandinaves, 67-77. (Acta Universitatis Upsaliensis. Studia Romanica Upsaliensia 18.) Stockholm: Almqvist & Wiksell International. 1978 "Contenu, vocabulaire et statistique". Cahiers de lexicologie 33: 71-90. 1984 Vocabulaire du roman frangais (1962-1968). Dictionnaire des frequences. (Data linguistica 17.) Stockholm: Almqvist & Wiksell International. Engwall, Gunnel - Inge Bartning 1989 "Le COSTO - description d'un corpus journalistique". Moderna Spräk 83: 343348. Fortier, Paul 1981 Le metro emotif: Etude du fonctionnement des structures thematiques dans Voyage au bout de la nuit. Paris: Minard. Gougenheim, Georges - Rend Michda - Paul Rivenc - Aurdlien Sauvageot 1967 L'elaboration du frangais fondamental (Ier degre). Etude sur l'etablissement d'un vocabulaire et d'une grammaire de base, nouvelle έά. refondue et augmentde ( l r e ed. 1956. L'elaboration du frangais elementaire). Paris: Didier. Juilland, Alphonse - Dorothy Brodin - Catherine Davidovitch 1970 Frequency dictionary of French words. (The Romance languages and their structures, First series, Fl.) The Hague: Mouton. Lyne, Anthony A. 1985 The vocabulary of French business correspondence. Geneve: Slatkine. Oostdijk, Nelleke 1988 "A corpus linguistic approach to linguistic variation". Literary and Linguistic Computing 3: 12-25. Parole syndicate, La. Etude du vocabulaire confederal des centrales ouvrieres frangaises 19711976 1982 Groupe de Saint-Cloud - Bergounioux, Alain - Michel F. Launay - Rene Mouriaux - Jean-Pierre Sueur - Maurice Toumier (eds.). Paris: Presses universitaires de France. Robach, Britt-Inger 1974 Etude socio-linguistique de la segmentation syntaxique du frangais parle. (Etudes romanes de Lund 23.) Lund: Gleerup. Sperberg-McQueen, C. Michael - Lou Burnard (eds.) 1990 Guidelines for the encoding and interchange of machine-readable texts. TEI. Chicago and Oxford: ACH-ACL-ALLC. Tresor de la langue frangaise. Dictionnaire de la langue du XIXs et du XX6 siecle (1789-1960) 1971 Institut national de la langue fran9aise (Nancy). Paris: Klinksieck/Editions du CNRS/Gallimard.
A new corpus of English: ICE Sidney Greenbaum
Of making many corpora there is no end, to adapt the cynical comment of Ecclesiastes. Several corpora of English are in existence or are being planned, and the compilers of a new corpus of English need to avoid duplication and to justify the effort and expense. In my initial brief proposal for an International Corpus of English (ICE), I saw as its purpose the facilitation of comparative studies between national varieties in countries where English is the first language or an official additional language (Greenbaum 1988). In this objective, ICE will be a unique resource. Comparison is to some extent already possible between American and British English through the investigation of data in the American Brown Corpus and the British LOB Corpus, but these corpora are restricted to just two national varieties and to just printed material, and furthermore they are somewhat dated since all the printed texts were published in 1961. In contrast, I envisaged that the national corpora that constitute components of ICE would cover many more national varieties, would include spoken and manuscript material as well as printed texts, and would be dated no earlier than 1990. To permit valid comparisons, the components would be assembled along parallel lines within the same short period and would be processed in similar ways. At the time of writing at least 15 regional components are planned. They include the major English-speaking countries where English is predominantly the native language - Australia, Canada, New Zealand, UK, and USA - as well as the heavily populated countries where it is an official non-native language - India, Nigeria, and the Philippines. The East Africa component will comprise material from three countries: Kenya, Tanzania, and Zambia. Although the stated purpose of ICE is to provide the means for comparative studies, one conspicuous benefit of the project is that it is stimulating the compilation of national corpora in countries where they had not previously existed - such as Canada, Jamaica, and a number of Anglophone countries in Africa - thereby providing for the first time the resources for systematic study of the national variety as an end in itself. In addition, proposals have emerged for specialized corpora within the ICE framework: written translations from EC languages, Euro-English (the language used in documents
172
Sidney
Greenbaum
from the European Commission), writing by advanced learners of English, and international communication in which speakers from different countries participate (including those where English is a foreign language). An international project involving so many participants raises problems in exchange of ideas and agreement on methods. We are attempting to overcome these problems through various means of communication: meetings at the annual ICAME conferences; personal visits to leading ICE centres, such as the the Survey of English Usage at University College London (already visited by most ICE participants) and the TOSCA group at the University of Nijmegen; the ICE Newsletters accompanied by discussion documents from the Survey or other research teams (of which the thirteenth appeared in March 1992); frequent correspondence by letters or e-mail; and telephone calls. A twoday workshop is planned to be held before the 1992 ICAME conference in Nijmegen. Flexibility is built into the project. The original project was purposely left very general, so that ICE participants could contribute to its development. Changes have been introduced as a result of early discussion and subsequent experience. From various teams have come suggestions for acquiring computer and audio equipment and software, and also suggestions for the software that it would be desirable to create. The Survey team, in particular, is devising software for ICE. The project has also been advanced by valuable advice and assistance from members of our international board of advisors. For comparative studies, all regional teams are required to compile a core corpus that will be a component of ICE. The core corpora will be identical in size and as far as possible also identical in the language material of which they are composed, the period in which the material is produced, and the methods of processing. Each regional corpus also serves independently as a resource for the study of the particular regional variety. We recognize, however, that some research teams may have the funding, as well as the desire, to extend their corpus in various respects for their own regional investigations. These optional extensions can take different forms; for example: (1)
(2)
an expanded corpus, where the same categories are retained in the same proportions as in the core corpus but the corpus is enlarged in size. a specialized corpus, in which material for a particular category (e.g. newspaper reports, student essays, business letters) is collected beyond what is required for the core corpus. The specialized corpus may also constitute a category that is not in the core corpus (perhaps
A new corpus of English: ICE
(3)
(4)
173
because it does not range internationally as a significant category, e.g. electronic mail or answerphone) but is felt to be of particular interest for one or more regions. collections of speech to illustrate nonstandard sociolects or regional dialects, the language of children, or the language of immigrant communities. a monitor corpus, in the sense of the COBUILD project, that contains vast amounts of material (generally printed) that are not subject to precise categorization and that are continually replaced by newer material.
The same flexible and pragmatic approach has been applied to the collection and processing of material. It is imperative for comparative studies that ICE material should date from the same period, which it has been agreed should be 1990-1993. Spoken texts must be recorded during that period, though some radio and television broadcasts may still be available years afterwards. Printed and manuscript texts can usually be obtained later if necessary. Processing of the material can start at any time. The extent of processing will depend on funding, but I hope that all research teams will at least be able to convert the core corpus into machine-readable form and concordance it for lexical strings. Some research teams will be able to go much further. The Survey has received funding for an initial three years and expects within that period to tag each word in the corpus for its word-class and to parse the whole corpus. We are collaborating with the TOSCA research team, led by Jan Aarts at the University of Nijmegen, on their automatic word-tagging and parsing programs and these programs will become available to ICE teams. We are producing a Corpus Utility that will meet the ICE needs for concordancing, searching, and analysing corpora. In this work we are receiving assistance from the Computer Science Department at University College London, whose students are contributing research projects to the Corpus Utility. The Survey has already made considerable progress in providing facilities for the processing of ICE corpora. We have devised a mark-up system for the corpus and in addition mark-up assistant software that allows (1) word counting (to ensure texts of at least 2000 words), (2) automatic insertion of text unit boundaries, (3) reduced keypresses for insertion of markup symbols, and (4) automatic numbering of text units. Systems to expedite other ICE processes, such as word-class tag selection and syntactic marking, are also being developed. We are working on concordance programs designed specifically for ICE requirements.
174
Sidney
Greenbaum
In collaboration with the TOSCA team we are producing a word-class tagset for ICE and an accompanying manual. The changes that we are introducing from the earlier tagset are intended to promote consistency and to speed up the tag selection on the output of the automatic word-tagging. Manual tag selection is necessary because the output of the tagging program often provides more than one possible tag and sometimes the correct tag has to be inserted. If funding permits, we also hope to collaborate with the TOSCA team in refining their automatic parsing program, particularly in its application to spoken texts that exhibit hesitations, repetitions, self-corrections, and anacolutha. Charles Meyer (University of Massachusetts-Boston) has submitted an interesting example of transcribed conversation that illustrates the formidable task of adapting for speech a parser devised for printed texts: but anyway you were saying Peggy that well I was asking you the general question what um how when you've taken course in the linguistics program have you done mainly just textbook stuff or have you actually this is actual hardcore linguistic scholarship Our mark-up system provides normalization markers to signal deviations from the norms of standard written English. In written English they cover deviations from the norms of spelling and punctuation as well as syntax. We retain the original text together with normalizations involving insertions, deletions, or replacements - some of them produced by the speaker as self-corrections. We provide these emendations to allow the satisfactory application of the parsing program. Researchers will be able to retrieve the deviations and accompanying normalizations. We expect that at least some of the syntactic deviations will prove to be rule-governed and could therefore be incorporated in a grammar of spoken English. At all events, the corpus will provide easily accessible data for distinguishing different kinds of such phenomena and the possible motivations for their production. Figure 1 displays a marked-up extract from a judge's summing up. It contains five examples of normalization: in line 2, the word-partial u is expanded by the speaker to under; in line 4, the judge says an ap uh an appropriate, and an ap is shown as deleted; in line 10, the judge corrects on this part of the case to on this part of the summons; in line 11 the word-partial i is expanded by the speaker to is; and in line 13 the speaker repeats the indefinite article a.
A new corpus of English: ICE
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
175
< # \ > T h e Court has to decide whether this is an appropriate receiving Court < } _ > < - _ > < . _ > u < . / > < - / x = _ > u n d e r < = / > < } / > t h e < „ > u h < „ > Act and whether the coroner in Gibraltar is < } _ > < - _ > a n ap < . / > < - / > uh an appropriate giving Court so far as the evidence is concerned < # \ > U h Miss uh Rogers explained this at considerable length and I could take up time in this ruling uh by uh giving reasons why uh in my judgement she made good her submissions under both A c t s < , > < # \ > I don't propose to do so uh because really the only point uh made by Mister Shields < } _ > < - _ > o n this part of the case uh on this part of the summons< = / > < } / > < , > < } _ > < - _ > < . _ > i < . / > < - / > < = _ > i s < = / > < } / > to say uh that uh the A c t < „ > and the rules made thereunder ought to be held not to apply uh to uh < } _ > < - _ > a < - / > < = _ > a < = / > < } / > uh court u h < , > in another jurisdiction specifically the Gibraltar uh < w _ > C o r o n e r ' s < w / > uh Court
Figure 1. A marked-up extract from a judge's summing up
Ideally, we would like to see every regional component pass through all of the following stages:
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
collection of texts, including recordings of spoken material and documentation of all material optical scanning of printed texts and keypunching of other texts mark-up of texts automatic word-tagging tag selection syntactic marking (required to reduce the amount of ambiguity in the output of the automatic parsing program) conversion into standard mark-up automatic parsing parse selection conversion into standard mark-up digitization
Concordancing will be introduced, as the material is processed, for lexical strings and collocations, for word-tags and combinations of word-tags, and for combinations of words with specified word-tags. Syntactic information
176
Sidney
Greenbaum
can be retrieved from the database after parse selection. Digitization will enable scholars to listen to the sound while reading the concordanced spoken text on the screen. Certain decisions were agreed on early and have been adhered to. We decided that each regional core corpus will contain approximately one million words. The number of words in the corpus is of course arbitrary, but it follows the precedent of the Survey, Brown, and LOB corpora. We felt that a million-word corpus was a practicable aim for all teams. As I have pointed out before, teams have flexibility in that they can extend their corpus to any size they wish outside their core corpus. We recognize that a corpus of a million words is inadequate for some purposes, for example the compilation of dictionaries, but we expect that our core corpora will be large enough to illustrate at least the major grammatical features of each national variety. Our corpora are much smaller than some that are being planned, but we hope to compensate for the smaller size by more intensive analysis, perhaps eventually incorporating semantic and discourse information. The number of words in a text was fixed at 2000. In this we followed the precedent of the Brown and LOB corpora rather than that of the Survey texts of 5000 words, since we preferred to restrict the use of composite texts, i.e. texts drawn from a number of sources, and we wanted to enlarge the number of samples. Composite texts will usually be required for material such as letters or telephone conversations. Teams can go beyond the minimum for their extended corpus; for example, if they wish to include a whole chapter or a whole book, perhaps for research into discourse structure. On the advice of the TOSCA team, a minimum of ten texts, i.e. 20,000 words, will be used for each text category. We agreed that the transcription of speech should be orthographic. The primary purpose of the ICE corpus is not directed towards studies of pronunciation or intonation. Since the sound recordings will eventually be made available, such studies will be possible and indeed will be facilitated for those corpora that are digitized for sound. We decided not to include prosodic transcription as a core requirement because of the problems of achieving consistency internationally. In addition, experience has shown that intonation experts prefer to use their own system; with a computerized corpus it is possible for experts to impose their own system on individual versions of the corpus. Again, research teams have the option of adding prosodic transcription or discourse markings to spoken material. For that reason I included with one of the ICE Newsletters a paper by Josef Taglicht outlining suggestions for prosodic transcription.
A new corpus of English: ICE
177
Punctuation will not be included in speech transcription, but pauses will be indicated by a binary system of brief and longer pauses (related to the tempo of the individual speaker). To assist the automatic parsing program, we have found it necessary to mark what would correspond to the ends of sentences in written English, though we recognize that such marking (which will not be set out as part of the text) is often arbitrary. Capital letters will follow the norms of printed English; the Survey of English Usage uses them at the beginnings of sentences to enhance readability. We have had considerable discussion on the population that will be sampled in our core corpus. We have agreed that the population should be restricted to adults, and for that purpose we have fixed a minimum age of 18. We want to sample educated usage for the core corpus, but there are problems in determining who should be considered as educated. At first, we considered as the essential criterion that the speakers or writers had received formal education through the medium of English to the completion of secondary school, and perhaps in second-language countries to the completion of a first degree. However, some of us felt that this criterion was too restrictive. One suggestion is that we should sample the language of professionals in the widest sense. The groups that would then be represented would include academics, lawyers, politicians, authors, broadcasters, journalists, and business professionals (e.g. managers, accountants). Students in higher education would be included as aspiring professionals. By far the largest amount of discussion has focused on the selection of ICE texts. Three papers have already been published on this topic: Schmied 1989 and 1990 and Leitner 1990. Schmied advocates a multidimensional approach for text categorization. In this approach, all textual and social variables that may influence the language of texts are recorded and used to characterize text types. The categorization of texts therefore becomes less significant than the variables exhibited in the texts. Schmied distinguishes the following textual variables, some of which are scalar: the spoken medium the presence of one or more other speakers the presence of non-speaking listeners the written medium printed physical distance between speakers(s) and listeners(s) direction towards a specific audience / readership prepared
178
Sidney
Greenbaum
social distance between speaker(s) and listener(s) subject matter interactional and informative functions creative Social variables include sex, ethnic affiliation (in some countries, indicating the first language), education, region, age, and country (for multi-country corpora). In practice, Leitner's approach results in a selection of texts similar to that advocated by Schmied. The difference resides primarily in Leitner's conception of the corpus structure as hierarchical and the importance he attaches to the social parameters of domains (such as home and mass media) and sub-domains. The coding system that ICE is developing for the corpora will record the kinds of variables distinguished by Schmied and Leitner. The Survey expects to create software that will construct subcorpora exhibiting specified variables or combinations of variables. It will then be possible to concordance, search, and analyse subcorpora that are defined by text category and - more importantly - by any combination of textual or social variables. For example, we will be able to select a subcorpus of all public unscripted monologues spoken by males and a separate subcorpus for all such monologues spoken by females. Similarly, we will be able to examine separately all utterances by female speakers in conversation with males, with other females, and with a mixed set of male and female participants. These subcorpora will take account of the sub-texts that are components of a composite corpus text. We agreed early on that the selection of texts should not be random. Rather, we should make an effort to include the full range of variables. For example, we should ensure that some telephone conversations are between males, others between females, and others still between speakers of both sexes. This requirement does not mean that the proportions should be the same as in the population as a whole. It is sufficient that these three sex variables are represented in telephone conversations. We appreciate that some text types are not available or are not significant for all countries and for that reason they are excluded from the core corpora, though they may be deposited in a regional extended corpus. Among such text types are sermons, phone-in radio programmes, e-mail, answerphone recordings, fax transmissions. We have also excluded poetry and certain other highly restricted forms of language, such as legal statutes and advertising. As I have said earlier, ICE will provide the resources for comparative studies of national varieties and will also provide the basis for the first sys-
A new corpus of English: ICE
179
tematic studies of most national varieties. The data will be invaluable for research into syntax, morphology, vocabulary, and discourse, and through access to the sound recordings for research into phonetics and phonology. ICE will prove to be of particular interest to sociolinguists, and will have practical applications in English language teaching and in language planning. I expect that the corpora and the computational work associated with them will contribute to research in natural language processing. The ICE proposal that I published in 1988 has rapidly attracted a large number of enthusiastic collaborators. The time was ripe for the project.
Notes The research reported in this paper was supported by grant R00 23 2077 from the Economic & Social Research Council.
References Greenbaum, Sidney 1988
"A proposal for an international computerized corpus of English", World Englishes 7: 315. Leitner, Gerhard 1990 "International Corpus of English - corpus design - problems and suggested solutions", Computer Corpora des Englischen. CCE Newsletter 4: 25-49. Schmied, Josef 1989 "Text categorization according to use and user and the International Corpus of English", Computer Corpora des Englischen. CCE Newsletter 3: 13-29. 1990 "Corpus linguistics and non-native varieties of English", World Englishes 9: 255-268.
Comments by Jan Aarts
Sidney Greenbaum (SG) has given an outline of the set-up and the aims of the International Corpus of English. My comments will be complementary to, rather than critical of what he has been saying. As SG has said, the purpose of ICE is to create a collection of corpora which will provide the means for comparative studies between various national varieties of English. In order to achieve this purpose it is necessary that the corpora should be truly comparable: there should be standardization of the textual categories that are included, of the way in which the texts are edited and of the editorial markup that is added. Standardization of this kind will ensure comparability of the raw text of the ICE corpora. But, as everyone knows, the comparison of corpora containing just raw text cannot go beyond linguistically rather trivial observations. For more sophisticated comparisons it is necessary that the corpora should be enriched with linguistic information - experiences with two other corpora intended for comparison, the Brown and LOB corpora, have shown this: the availability of tagged versions of the two corpora has greatly increased their usefulness. Given the aims of the ICE project, it is therefore obvious that not only should the ICE corpora be linguistically enriched (say, tagged and parsed), but also that this enrichment should be standardized in order to make it possible for comparative studies to be based on the corpora. If we want to achieve this, two questions arise. The first is: what should be the level and what the nature of the enrichment, and the second: how can standardization be achieved? The answer to the first question is not difficult to give. Bearing in mind that the corpora are not primarily intended for practical applications (although no doubt their availability will also serve various practical purposes), but are to serve as input data for further linguistic research, it is clear that their enrichment should be at the level of detail and sophistication which is customary in English descriptive linguistics. It is not difficult to determine this level: English language studies have had a long and powerful tradition of grammatical description - over the years (centuries, rather) a vast literature has been accumulated, whose highlights are the grammatical handbooks providing survey descriptions of the English language (Jespersen, Poutsma, Kruisinga). In
Comments
181
the last two decades this line of grammatical handbooks has culminated in the grammars of Quirk - Greenbaum - Leech - Svartvik: A grammar of contemporary English and A comprehensive grammar of the English language. It is no exaggeration to say that no new studies in English grammar can be published without reference to the Quirk grammars - they have become truly standard for every linguist working in this field. The standard for the enrichment of the ICE corpora is therefore given in these grammars. In practice, this means that in the further processing of the corpora we should follow, as much as possible, the basic theoretical assumptions, the method of description, the linguistic notions and terminology, as well as the level of detail of description found in these grammars. If the Quirk grammars provide an answer to the question what the level and the nature of the ICE corpora should be, it might be assumed that the question of how to achieve standardization of enrichment is answered at the same time. One could simply leave it to each national team to conform to the standard set by these grammars in the tagging and parsing of each of the ICE corpora. But this is less simple than it sounds, for at least two reasons. In the first place, one of the major points of corpus analysis is precisely the fact that the analysis of any single corpus will always yield a great many (variants of) constructions and phenomena that are not described in the literature. And even grammars with the coverage of the Quirk grammars can never hope to describe everything that one happens to find in a corpus, let alone in a collection of over twenty one-million word corpora. A second and much more important reason is that for the linguistic enrichment of corpora one needs computational tools (taggers, parsers, parser-generators), and it is these tools that determine to a very large extent the analysis results one gets. This means that, if possible, the same tools should be used for the enrichment of the various ICE corpora. And that is where the Nijmegen TOSCA team comes in. As most of you will be aware, the name TOSCA stands for "TOols for Syntactic Corpus Analysis". Most of you will also have a rough idea of the kind of tools that we have developed over the years. The TOSCA tools include a programming environment for the tagging and parsing of corpora, a grammatical formalism geared to the analysis of natural languages (Extended Affix Grammar: a type of two-level attribute grammar), a parser-generator to convert grammars written in this formalism to parsers and a database (LDB) to store syntactic analysis trees, together with a query language. For English, we have a tagger and a parser which have proved their workability in the analysis of a little over a quarter of a million words of written English. In developing the tagger and the formal grammar for English we have followed
182
JanAarts
the descriptive system of the Quirk grammars as closely as possible. The demands set on the quality of the analyses are high, since the analysed utterances are intended to provide the research data for subsequent linguistic research. For this purpose, a "correct" analysis is understood to be one that is not only syntactically, but also semantically and pragmatically correct, reflecting the contextually appropriate interpretation of each utterance. This requires knowledge of the context as well as general knowledge of the world. Since in the present state of the art such knowledge cannot be made fully explicit and formalized in a grammar, the analysis process has to be interactive, the linguist making interventions to provide additional information or restricting possible parses whenever this is necessary. Intervention takes place at two points in the analysis process: once after tagging and again after parsing to select, if necessary, the contextually appropriate analysis tree. For their use within the framework of the ICE project these tools are being, or will be adapted. In the first place, the ICE tagset is a reduced and simplified version of the TOSCA tagset. Basically, the reason for this adaptation is the fact that so far the TOSCA tools have only been in internal use by the TOSCA team; and, as everyone knows, if you use your own software, this can be done easily and efficiently with a minimum of documentation. The situation changes when the software is going to be employed by external users. Therefore, as SG has said, the original TOSCA tagset was slightly reduced in order to promote consistency and speed up tag selection. A tag within the TOSCA system looks as follows: X(a,b,c), where X is a major word class and a, b and c are "affixes" or "features". These "affixes" may indicate subclasses, or carry a value for a morphological or semantic feature. Whereas the TOSCA tagset consists of 36 major word class tags and comprises, with the inclusion of the features, a total number of 322 possible tags, the ICE tagset numbers only 28 major categories and a total number of 270 possible tags. At this time, the ICE tagset is definitive and the tagging manual is ready in its final version. Revision of the tagset entails, of course, revision of the grammar, for the tags are the terminal elements of the grammar and if these change, the grammar has to change. So in the near future we shall also have to turn the TOSCA grammar into an ICE grammar. Now, this sort of adaptation is a comparatively simple matter. Far less simple and much more time-consuming will be the adaptation of the grammar for two new varieties of English which we have not yet dealt with in the TOSCA setting: manuscript English and spoken English. We expect that manuscript English will show a fair resemblance to printed English, so that an adaptation of the grammar will suffice. Spoken English, however, may well be a different matter altogether,
Comments
183
and I would not be surprised if it should prove to be necessary to write an entirely new grammar for this variety. For the early stages of the ICE project very close collaboration with the Survey team will be necessary. In a pilot project we have developed a form of collaboration which roughly implies that all automatic processing is done in Nijmegen, while the selection procedures are carried out in London. Close collaboration in the early stages is necessary in the first place because we have to reach complete agreement about all aspects of grammatical description. It is also needed because, as I said, documentation of the analysis system was virtually lacking, owing to the fact that we ourselves were the only users. It was not until collaboration with the Survey team actually started that we produced a TOSCA tagging manual. The first documentation that is now available within the ICE framework is the ICE tagging manual. What will also be needed, once we launch upon the parsing stage, is an ICE handbook of grammatical description, say a readable version in natural English of the ICE grammar. This is needed not only to give an account of the ICE descriptive principles, but also to enable future users to fully understand the ICE analyses. Once we have gained sufficient experience in tagging and parsing and we can confidently expect that the systems are as reliably operational elsewhere as they are in Nijmegen, both the tagger and the parser will be made available to other national ICE teams at cost price, that is the cost that the TOSCA team will have to invest in the maintenance, updating and documentation of tagger and parser. It is our expectation that the tagger will become available fairly soon, while for the parser(s) more time will be needed. In this short survey of activities I have completely ignored the little question of funding, but it will be clear that the acquisition of funding is something that requires a lot of time and labour, and that the success of our attempts is essential for the realization of our plans. Ideally, however, there will be, at the end of the ICE project, three versions available of each of the corpora: a raw corpus, a tagged corpus and an analysed corpus.
The diachronic corpus as a window to the history of English Matti
Rissanen
1. Introductory In this paper, I shall discuss some principles and problems relevant to the compilation of a diachronic corpus. My discussion is based on personal experience and on the observations made by our project group 1 while preparing The Helsinki Corpus of English Texts: Diachronic and Dialectal. The diachronic part of this corpus (henceforth the Helsinki Corpus) is now complete and ready for international scholarly distribution. It is, as far as I know, the first multi-purpose 2 corpus of English so far compiled which covers the time span of several centuries. Its size is c. 1.5 million words and it contains some 400 samples of continuous text from the 8th-century Ccedmon's Hymn to the beginning of the 18th century. Shorter texts are included in toto while extracts from longer texts ranging from 2500 to 20,000 words have been included. A detailed description of the Helsinki Corpus can be found in Kytö (1991). 3 One of the central questions concerning the usefulness of the Helsinki Corpus, like all corpora intended for linguistic analysis, is that of grammatical tagging. So far, our Corpus is untagged, which means that all retrievals must be based on words, their parts or combinations. We are very much aware of the immense importance of tagging and will consider this question seriously in the future. Programs suitable for tagging historical text material are being developed by Keith Williamson at Gayre Institute, Edinburgh, and Douglas Biber at the University of Arizona (formerly at the University of Southern California). It is obvious, however, that equipping the entire Corpus even with a fairly simple tag system would, with our present resources, be a matter of years of hard work.
2. Selection and coding of samples Our Corpus differs from the previous multi-purpose corpora known to me in that each text sample is described with the aid of a number of parameters
186
Matti
Rissanen
comprising a set of two or more values. As can be seen in the following sample taken from our Corpus, the parameter values give shorthand information on the text and its author. The codes are introduced in COCOA format (cf. Oxford Concordance Program).
< Q E2 XX CORP H A R L E Y > < N LET TO H U S B A N D > < A HARLEY BRILLIANA> < C E2>
< D ENGLISH> < V PROSE > < T LET P R I V >
< W WRITTEN> < X FEMALE> < Y 20-40> < H HIGH>
< E INT U P > < J INTERACTIVE> < 1 INFORMAL >
[ } TO MY DEARE HUSBAND S = R =
ROB ART HARLEY, KNIGHT OF THE
- Docter Barker has put my sister into a cours of ientell fisek, which I hope by God's bllsing will doo her much good. My sister giues you thankes for seending him to her. I pray you remember that I recken the days you are away; and I hope you are nowe well at Heariford, wheare it may be, this letter will put you in minde of me, and let you knowe, all your frinds heare are well; and all the nwes I can seend you is, that my Lo. Brooke is nowe at Beaethams Court. My hope is to see you heare this day senet, or to-morrowe senet, and I pray God giue vs a happy meeting, and presarfe you safe; which will be the great comfort of Your most true affectionat wife, Brilliana Harley ("Ragly: the 30 of Sep. 1625.") BATHE.}] S = r =
The diachronic corpus
187
This is Lady Brilliana Harley's letter to her husband written in 1625, with parameter codings added to the beginning of the text.4 The top line in the code column can be used for the reference to each example taken from this text. The code values underneath indicate, among other things, that the text is a private letter (T) whose writer and addressee are on intimate terms but have unequal social standing (E). The author is female (X), between 20 and 40 years of age (Y), and of high social position (H). The parameter (C) defines the sub-period in which this text was written. The purpose of our coding is to give the user of the Corpus basic information on the sample in question. Such retrieval programs as the OCP can use the coding system for searches directed only to those samples which fulfil a predefined set of constraints. The parameter coding enables the student to choose between two approaches to the corpus. S/he can either collect all the occurrences of the structure or lexical item s/he is studying, in the entire corpus or a part of it, and then observe the distribution of the instances between the various parameters (e.g. dialect, text type, author, etc.). Alternatively, s/he can restrict his/her searches to samples fulfilling certain constraints (e.g. private letters written by middle-aged wives to their husbands between 1500 and 1640; religious treatises representing the East Midland dialect in 1250-1420) and contrast the instances found with others occurring in samples differently defined. The first approach could be characterized as descriptive and is close to the inductive methods of research; the latter is more dynamic and more deductive. In actual practice, the student probably uses both methods side by side: s/he most likely begins with an overall survey of the occurrences of the form or construction under scrutiny, and after preliminary results may continue with a check on the influence of varying combinations of constraints on the distribution patterns. Equipping our corpus with parameter codings has compelled us to concentrate carefully on the process of selecting our samples and finding as much relevant information as possible on their characteristics and their authors' backgrounds. An attempt to define the parameter values for each text sample has been a highly rewarding but also a very sobering experience. The further back in time we go, the hazier the contours of the available text material become; for one thing, the anonymity of medieval texts is a rule with relatively few exceptions. We have done a great deal of philological detective work, particularly with our Old and Middle English samples and have also shamelessly exploited the expertise of the foreign scholars who have visited our Department. In many cases, however, we have been compelled to resort to the X value, which indicates either "not applicable or irrelevant to this
188
Matti
Rissanen
sample" 5 or, lamentably often, "not known" or "information too uncertain or inaccurate to be coded". Despite the obvious problems, I would emphatically recommend that all compilers of future corpora should seriously consider appending some kind of descriptive parameter coding to their text samples. This would not only help the user of the corpus enormously, but also encourage (or compel) the compilers to pay special attention to the question of how reliable a picture the corpus gives of the reality of the language it is intended to represent. It is only to be expected that neither the sample selection nor the parameter system can be fully satisfactory or watertight in any corpus: by definition, compiling a corpus means an endless series of compromises. Just as a corpus will never reliably reflect the language in all its varieties and modes of existence, so, too, parameter coding can never hope to give a complete and theoretically valid description of the samples. One of the first decisions a corpus compiler has to make is, indeed, whether to try and follow the demands of logic and theoretical adequacy in his or her work, or just observe the heuristic needs of the potential user of the corpus. In theory, these two aims should be identical; in practice the situation is much more complicated. In compiling the Helsinki Corpus we have fairly consistently let heuristic considerations prevail over theoretical ones. We feel that particularly in regard to a diachronic corpus this is the only sensible alternative. At the early stages of English the text material from which the sampling can be made is scanty and one-sided. And any attempt at a satisfactory coverage of the language of the past is jeopardized by the fact that the most relevant mode for the study of change, i.e. spoken expression, is missing. To mention a few of our heuristic solutions, we have not aimed at symmetry and equal length of samples but often give a disproportionately large share to texts which may help the student to reconstruct the vocabulary and constructions typical of spoken language. For related reasons, the inclusion of material representing Old English dialects is not in proportion to the number of extant manuscripts: non-West-Saxon texts are over-represented. Furthermore, very long extracts are taken from the Anglo-Saxon Chronicle because it represents simple, matter-of-fact narration; extracts of the translation of Boethius' De Consolatione Philosophiae are included in five versions to facilitate the study of the translators' solutions in different periods, etc.
The diachronic corpus
189
3. Structural features of a diachronic corpus As is implied in the above discussion, at least the following four aspects must be taken into consideration in the compilation of a diachronic multi-purpose corpus: 1. 2. 3.
4.
Chronological coverage: the corpus should be representative of all parts of the period(s) it is intended to cover. Regional coverage: the corpus compiler should pay attention to the regional varieties of the language. Sociolinguistic coverage: the texts of the corpus should be produced by male and female authors representing different age groups, social backgrounds and levels of education. Generic coverage: the corpus should contain samples representing a wide variety of genres or types of text.
Evidently, the second, third and fourth points are common to all corpora; only the first is typical of a diachronic corpus. 3.1. Chronological coverage The language historian's life is a constant wrestling with time, both real and apparent. This is particularly true of the constructor of a diachronic corpus who, even to begin this kind of work, must be obsessed by the idea of the importance of synchronic variation as a major source of diachronic developments. The very idea of a diachronic corpus is to give the student an opportunity to map and compare variant fields or variant paradigms in successive synchronic stages in the past. In addition to the problem of less choice in texts in the earliest stages of the language, the diachronic corpus compiler also has to face the problem of how to shape the chronological ladder in his corpus. With the Helsinki Corpus we have, even in these decisions, let practical factors outweigh systematic and symmetrical solutions. In Old and Early Middle English, our sub-periods are a century long (from mid-century to mid-century), in Late Middle and Early Modern English, seventy or eighty years. The century principle is first broken in Late Middle English by the sub-periods 1350-1420 and 1420-1500. In this way, it has been possible for us to include the crucial decades of the gradual formation of the 15th-century Chancery standard within one and the same sub-period. Admittedly, our sub-periods are long, and in future versions of the corpus we may consider shortening them. On the other hand, in Old and Early Middle
190
Matti Rissanen
English an accurate dating of the texts is impossible, and the definition of the chronological stage of the language represented by the early samples is further complicated by the long manuscript histories. Even the century-long sub-periodization seems to organize the early material in a way that gives interesting results. One example that might be offered is a detail in the Old and Early Middle English pronominal paradigm, the earliest development of the forms of (n)aught 'anything', 'something', ('nothing'). This form goes back to Old English nawiht, nowiht (wiht 'creature', 'thing'). 6 Table 1 illustrates the chronological and dialectal distribution of the forms. Table 1. The forms of (n)aught in the Old English prose samples and Early Middle English samples dating from 1150-1250: distribution by time and dialect. (WS = West-Saxon; A = Anglian; Μ = Mercian; Ν = Northumbrian; X = unspecified dialectal element; Κ = Kentish; S = Southern; WM = WestMidland; EM = East-Midland). WS/A(M)
WS(/X) -850
AN
AM(/X) owiht(e)
1
850-950
(n)awuht nanwuht (n)owiht (n)auht nawht naht noht
5 14 2 27 1 4 4
nanwuht (n)owiht (n)aht (n)oht(es)
1 2 5 7
(no)wiht (n)oht
4 2
950-1050
nawuht nauht naht
2 1 18
awiht (n)owiht naht noht
2 4 6 10
owiht (n)oht
2 3
1050-1150
(n)awiht nanwyht auht (n)aht (n)oht
1 1 1 15 2
(n)aht (n)oht(es)
7 4
EM (n)oh(h)t (n)aht nauht
72 37 3
1150-1250
K, S awiht (n)aht noht(e)
1 6 3
WM (n)awi(c)ht nowiht nawt naht noht
51 2 174 3 26
noht
7
The diachronic corpus
191
As can be seen from Table 1, the longer forms become infrequent in the course of the Old English period. The distinction between the Alfredian texts of the late ninth century and the /Elfrician ones (c. 1000) can be easily seen. In the earliest Middle English, the longer affirmative forms are very rare (2 instances). The long negative forms are much more common, but they are only found in West-Midland texts; almost half the instances (24) occur in the Lambeth Homilies, which are copies of Old English originals. As mentioned above, our Early Modern English samples are divided into sub-periods of 70 years (1500-1570, 1570-1640, 1640-1710). Even though this division can be seen simply as the result of slicing two centuries into three parts of equal length, it reflects changes in society and the stages in the structural development of English. The first sub-period, EModEl, is in many respects indicative of the Middle English heritage. EModE2 marks the process of rapid and radical change, while the third sub-period reflects the gradual establishment of the present-day structural system of English (see e.g. Nevalainen - Raumolin-Brunberg 1989). The simple figures in Table 2, which compare the occurrences of periphrastic do in affirmative statements (He did see his uncle yesterday) with those of the progressive be + -ing construction in the Helsinki Corpus give a rough idea of these three stages of development. 7 Table 2. Occurrences of periphrastic do in affirmative statements and of the progressive form in the Early Modem English sub-periods in the Helsinki Corpus (excluding the Bible translations).
DO (aff. stat.)
BE + -ING
EModEl 1500-1570
EModE2 1570-1640
EModE3 1640-1710
285 25
447 38
185 85
Both periphrastic do and the progressive form show a remarkable proportionate increase in popularity in the late 16th and early 17th century. While the last-mentioned construction continues its steady progress in the second half of the 17th century, the number of occurrences of do drops sharply. The question of whether the two developments are in some way related remains to be answered by future studies, but it does not seem unlikely in view of the general development of verbal syntax in this crucial period in the history of English.
192
Matti
Rissanen
3.2. Dialect It is impossible to think of a diachronic corpus which would not pay attention to regional dialectal distribution in the periods preceding the establishment of the standard. In the Helsinki Corpus, all samples up to 1500 have been given dialect or localization parameter values; in the case of many 15thcentury texts the definition, based on external evidence, simply signals that the sample represents some stage of development of the Southern standard. In Early Modern English, all texts are selected as representing this standard;8 collecting a dialect corpus from this period would need a completely new project and are illustrated by Table 1 above. A quick glance at the distribution of the forms shows how the stem vowel < a > is typical of the West-Saxon dialect while < o > predominates in Anglian. In texts representing a mixture of these two dialects, there is considerable variation. In EME, < a > prevails in the South; it is also the more common variant in the West-Midland texts, while < o > is the predominating vowel in the East-Midland ones. Table 1 also shows how the form (n)auht is typical of the West-Saxon of King Alfred's period - a detail which, to my knowledge, has escaped earlier scholarship (cf. Rissanen, forthcoming). One problem in defining the dialects, as well as the dates of composition, of the early samples is that these definitions must necessarily be based on earlier linguistic research. Too little is known about the authors and the provenance and manuscript history of the texts to rely on extralinguistic criteria. Thus there is a risk of circularity in dialect studies based on parameter values, and it is only to be hoped that the easy access to the earliest text material provided by the Corpus will help scholars to revise and sharpen the theories and assumptions on Old and Early Middle English dialects. As mentioned above, the dialect parameter is not applied to the Early Modern English section of the Helsinki Corpus. Observing geographical variety distributions is, however, possible in this period as well. There are two supplementary corpora under preparation: Older Scots (1450-1700) and Early American English (1620-1720), collected and classified by Anneli MeurmanSolin and Merja Kytö, respectively. These corpora follow, mutatis mutandis, the same format and conventions as the basic Helsinki Corpus and will, in due time, be available for international scholarly use.
The diachronic corpus
193
3.3. Sociolinguistic factors While the local dialect parameter is important only with Old and Middle English, the parameters giving sociolinguistic information on the authors are given greater importance in the later sections of the corpus. These parameters are applied only from Middle English on: even though we have reliable information on some Old English authors like King Alfred or Archbishop Wulfstan, this information is too inconsistent to form a basis for any sociohistorical considerations. In the Early Modern English section we have done a great deal of work to give reliable values to the sociolinguistic parameters; future studies will show to what extent sociolinguistic observations can be applied to the Early Modern text material. I have made a very preliminary and tentative test on the use of hope (noun and verb) in private letters in the second (1570-1640) and third (1640-1710) Early Modern English sub-sections. 9 In their letters, women use hope 40 times per 10,000 words (absolute figures 38/9548) while the corresponding ratio in letters written by men is only 24.4/10,000 words (37/15163). At least in the letters studied, hope seems to convey involvement, attitudinal uncertainty, and concern for the future and for the welfare of the addressee or other persons. Similar surveys could easily be made of such lexical items as beg, pray, etc. Sociolinguistic implications pertaining to the amount of education can be traced in the spelling and overall use of the language in the letters of women when compared with those written by their husbands (cf. the letter written by Lady Brilliana Harley, above).
3.4. Type of text Text type categorization is a highly relevant but also a difficult and frustrating structural aspect in corpus construction. A glance at our text types in Table 3, below, clearly shows that we have followed heuristic rather than logical principles in labelling our text types. It seems that no theoretically satisfactory classification by text type has so far been developed for the use of corpus compilers; much more research is needed in this field. As with chronological and dialectal categorization, we hope that future studies based on the text type groupings of existing corpora will help create new and more satisfactory models of classification. One advantage of diachronic text type definitions, in comparison to chronological and dialectal ones, is that they can be based on extralinguistic criteria, and the risk of circularity in results is in this way diminished. These crite-
194
Μatti Rissanen
ria mainly pertain to the subject matter, purpose, discourse situation and relations between the writer and the receiver. Earlier studies on register, formality, discourse types, etc. have greatly helped us in our definitions and general approach to the problem of text types. 10 A diachronic corpus poses special problems concerning text type definitions because it is obvious that the linguistic features typical of various text types do not remain unchanged (cf. e.g. Biber 1988). On the other hand, the implications given by the corpus for generic shift over the centuries are most interesting. As an illustration of these questions I shall present, in the following, a simple survey of the distribution of the personal pronoun forms at three stages in the history of English: 950-1050 (OE3 in the Helsinki Corpus), 1350-1420 (ME3), and 1570-1640 (EModE2). As some text types (OE histories, ME laws and correspondence) are lacking or poorly represented in these sub-periods, examples have also been collected from adjoining subperiods (OE2 for histories, ME4 for laws and correspondence). To simplify the tabulation - and to make the searches easier - I have only collected the subject forms of the pronouns. 11 I assume that the subject forms are sufficiently indicative of the amount of involvement and give relevant information on whether the text is first or second person oriented. 12 As can be seen in Table 3, the differences in the pronoun distribution in the various text types are remarkable and also relatively consistent in different periods. Text types vary in their first, second, or third person orientation in ways that might be expected. The most obviously first person oriented genre is of course the diary; next come private letters, comedies, travelogues with a first person narrator, 13 and, perhaps somewhat surprisingly, documents and official letters. Again not unexpectedly, in handbooks, rules and religious treatises second person subjects are common. In rules, travelogues, diaries, comedies, letters and state trial records, there are, in general, more first and second person subjects than third person ones: the rate of involvement is high. The lowest rate of involvement is exhibited by law texts, histories and biographies. In Table 3 only those genres are included which appear in at least one of the three sub-periods under scrutiny. The Table shows graphically that the text types are not equally well represented in all sub-periods in the Helsinki Corpus. In the case of some genres (e.g. documents), the coverage of the corpus can be supplemented at a later stage; some will necessarily remain more or less non-diachronic. To diminish the disadvantages of "generic noncontinuity" we have, experimentally, grouped the text types into larger categories which run from Old English to Early Modern English: statutory texts
195
The diachronic corpus
Table 3. Subject forms of personal pronouns in three sub-periods in the Helsinki Corpus. OE (2-)3
ME3(-4)
EModE2
(850-)950-1050
1350-1420(-1500)
1570-1640
1 STA.LAW1 STA/XX.DOCUM IS.HANDB IS/EX.SCIENCE
3
1
2
3
1
2
138
-
1%
99%
197
16%
1%
83%
164
4
161
140
11
307
50%
1%
49%
31%
2%
67%
9
40
201
125
147
236
196
176
412
4%
16%
80%
25%
29%
46%
25%
22%
53%
110
25
264
13
1
96
141
90
195
28%
6%
66%
12%
1%
87%
33%
21%
46%
122
44
259
29%
10%
61%
-
362
-
103
-
797
-
—
—
—
70
356
73%
32%
11%
57%
15%
IR.RULE
282
200
331
64
144
111
35%
25%
40%
20%
45%
35%
104
81
267
18%
59%
380 19%
1275
23%
348 17%
NN.DIARY NI. FICTION
—
—
—
—
64%
58
9
96
36%
5%
59%
132
60
1286
262
134
946
12
223
9%
4%
87%
20%
10%
70%
5%
95%
280
38
203
35
54
168
414
5
294
54%
7%
39%
14%
21%
65%
58%
1%
41%
138
121
855
12%
11%
77%
-
114 24%
NI.GEOGR.
-
195
63%
NN.BIOGR.
100%
766
8%
NI/NN.TRAV.
109
123
29%
NN.HIST.2
-
153
SERMON
IR/XX.PREF.
3
1
12%
IR.REL.TR.
2
39
EX.EDUC. IR. HOMILY/
2
—
-
153 32% —
-
-
-
-
-
38
18
382
9%
4%
87%
686
5
149
81%
1%
18%
212
232
127
692
272
152
458
44%
22%
12%
66%
31%
17%
52%
68 100%
196
Matti
Rissanen
Table 3 (cont.) OE (2-)3 (850-)950-1050 1 XX.PHILOS.
XX.CORR.PR.
XX.COMEDY
3 —
3
XX.CORR.OFF.
XX.TRIAL
2
—
3
—
—
—
ME3(-4) 1350-1420(-1500)
EModE2 1570-1640
1
2
3
1
2
3
223
109
483
159
59
235
27%
14%
59%
35%
13%
52%
596
242
556
520
107
215
43%
17%
40%
62%
13%
25%
122
51
76
104
15
115
49%
20%
31%
44%
7%
49%
—
359
160
487
36%
16%
48%
—
—
—
—
—
422
237
227
47%
27%
26%
(1) Middle English figures based on ME4 (1420-1500). (2) Old English figures based on OE2 and OE3 (850-1050). (3) Middle English figures based on ME3 and ME4 (1350-1500).
(STA), secular instruction (IS), religious instruction (IR), expository texts (EX), non-imaginative narration (NN) and imaginative narration (NI).14 The disadvantage with this grouping is that the categories necessarily contain heterogeneous material; we can see, however, that homilies and sermons, histories, and fiction.15 As can be seen from the figures in Table 4, pronominal usage in law texts is consistent in all the three periods in that first and second person subjects are avoided. The only notable exception is the frequent occurrence of " we" (also two occurrences of / ) in Old English laws. In Anglo-Saxon society, legislation was closely associated with the person of the ruler: we speak of the laws of Alfred, vEthelred, Canute, etc. The following extract from Canute's laws exemplifies the use of both first person pronouns: J?is is seo woruldcunde geraednes, ]>e ic wylle mid minan witenan raede, J>aet man healde ofer eall Englaland. öaet is ]x>nne aerest, }>xt ic wylle, ^aet man rihte laga upp araere & aeghwylce unlaga georne afylle, . . . And we laeraö, J>aet, J^eah hwa agylte & hine sylfne deope forwyrce, j^onne gefadige man steore, swa hit for Gode sy gebeorhlic & for worulde aberendlic. And we beodaö, J^aet man Cristene men for ealles to lytlum huru to deaj^e ne forraede; (LawsllC 308) 16
197
The diachronic corpus
Table 4. Occurrence of the subject forms of personal pronouns in five text types in the Helsinki Corpus, sub-periods (850-)950-1050, 1350-1420(-1500), and 1570-1640. I _
WE
THOU
YE
HE
SHE
IT
THEY
LAW 950-1050
2 1%
37
-
16%
1350-1500
-
1570-1640
-
-
-
1 1%
126 53%
2
25
1%
18%
-
8 3% -
24
39 16%
10%
75 54%
38 27%
22
2
61
24
20%
2%
56%
22%
691 47% 634
34 2%
40 3%
35%
HISTORIES 850-1050
86 6%
1350-1420
210 16%
1570-1640
12 5%
46
43
17
3%
3%
1%
52
108 8%
26 2%
4% -
-
-
33
47% 72
3%
31%
15%
36
98 7% 28 12%
521 181 13% 87 37%
HANDBOOKS 950-1050
7
2
40
3%
1%
16%
1350-1420
125 25%
1570-1640
165
21%
-
31 4%
_
52
26
97
26
21%
10%
39%
10%
146
1
63
10
116
47
29% 1
0% 175 22%
12%
2% 92 12%
23%
9%
95 12%
79 10%
45
396
64
284
4%
31%
53 4%
5%
23%
56
336 32%
123 12%
244 24%
109
122
0%
146 19%
HOMILIES & SERMONS 950-1050
66
1350-1420
5% 55 5%
296 23%
58 5%
98 9%
67 7%
103
92
12
5% 58
16%
15%
2%
103 22%
11 2%
1350-1420
214
1570-1640
20% 263 30%
1570-1640
122
9%
20%
0%
18%
20%
141
12
46
11
3%
10%
2%
16 3%
18 2%
29% 82 8%
139 29%
45 4%
33 3%
31 3%
121 14%
131 13% 98 11%
115 11%
9 1%
413 39% 219 25%
73 8%
68 8%
CTION 950-1050
63 6% 3
198
Matti
Rissanen
('This is the secular ordinance that, according to the advice of my councillors, I want to be followed in all England. That is then first that I want rightful laws to be established and all unjust laws abolished . . . And we teach that although anyone offend or commit a serious sin, let the correction be regulated so that it be becoming before God and tolerable before the world. . . . And we command that Christian men should not, indeed, be too lightly condemned to death . . . ' [transl. partly based on Bosworth - Toller 1898])
The very high percentage of the third person neuter pronoun it in Middle and Modern English law text samples is noteworthy. (In OE, of course, even masc. and fem. 3rd p. sg. forms could refer to inanimate subjects.) This is another proof of the highly impersonal character of law texts from late Middle English on. Besides law texts, the histories show the least involvement of the text types included in the Helsinki Corpus, judging by subject pronoun distribution. This genre, too, shows a fair degree of consistency over the centuries. The apparent deviation shown by the figures relating to ME3 is due to the inclusion of (a late manuscript of) the metrical Cursor Mundi, which represents ecclesiastical historical writing and the narration of Biblical and other episodes in a lively and artistically enjoyable way. The figures for the personal pronoun subjects in the two other histories sampled from that period (The Brut and Trevisa's translation of Higden's Polychronicon) are more in accordance with the OE23 and EModE2 historical writings: out of the total of 521 pronominal subjects in these two histories, only 13% have first person and 8% second person subjects. The following extracts illustrate the differences in the narrative technique in Cursor Mundi (3719-3735) and Trevisa's translation: J)is iacob went quan sua was don, And esau com efter son. "Fader," he said, "vp on J}i bedd, I haue J^e broght quar-of be fedd Ο venisun, .1. here \>e bring, Ete and giue me \>i blissing." His fader him asked, "quat art JJOU?" "}>i sun," he said, "i esau." "Was )x>u not at me right now, And fedd me wit J)i fang i trau?" "I?" he said, "nai, nai goddote, Moght i not be sua light ο fote." Wit })is gaue ysaac a grane; "Sun," he said, "right nou was an
The diachronic corpus
199
J>at first me fedd, and sythen me kist, And me be-suak, J>at i ne wist, Mi benisun now has \>i broiler." Also \>t kynge for to have j^e more large spens toward Ierusalem, he resignede ]>t castelles of Berwik and of Rokesburgh to kyng of Scotlond for ten J^owsand pound. Also he begiled j^e olde man J>e riche bisshop of Durham, and he made hym begge his owne province for a greet somme of money. J^erfore J)e kyng seide ofte in his game, "I am a wonder crafty man, for I have i-made a newe eorle of an olde bisshop." By suche manere while and speche he emptede meny men purses and bagges, and solde dignetees and lordschippes \>at longede to J^e kyng, as J?ey3 he J>ou3te nevere for to come a3en. (Trevisa 87-8)
In all periods, handbooks are characterized by a high proportion of second person subjects. In Middle and Early Modern English samples the distribution is remarkably similar; the Old English sample, which consists of medical recipes, seems to differ radically from the later representatives of the same genre in being more third person oriented. The distribution can be further clarified if the individual texts providing the Middle and Early Modern samples are studied:
Table 5. Subject forms of personal pronouns in ME and EModE handbooks.
Chaucer, Astrolabe
Equatorie of Planets Treatise on Horses Gifford, Witches Markham, Contentments
ip
2p
3p
62 48 15
26 19 102
64 36 136
181 15
75 101
279 133
Table 5 shows that handbooks can be further divided into two basic types: those that give straightforward advice, exemplified by the Old English medical recipes, the ME Treatise on the Horses, and Markham's Country Contentments, and those instructing more indirectly on less matter-of-fact topics. The latter type may make use of personalized address (Chaucer to his son Lewis) or dialogue (Gifford's Dialogue Concerning Witches and Witchcraftes). It is only natural that in the last-mentioned type the rate of first person involvement is higher. Compare the following extracts:
200
Matti
Rissanen
Rekne and knowe which is the day of thy month, and ley thy rewle upon that same day, and than wol the verrey poynt of thy rewle sitten in the bordure upon the degre of thy sonne. Ensample as thus: The yeer of oure Lord 1391, the 12 day of March at midday, I wolde knowe the degre of the sonne. I soughte in the bakhalf of myn Astrelabie and fond the cercle of the daies, the whiche I knowe by the names of the monthes writen under the same cercle. Tho leyde I my reule over this foreseide day, and fond the point of my reule in the bordure upon the firste degre of Aries, a litel within the degre. (Chaucer, Astrolabe 669.CI) And when J^ou hast a good hors at ]Μη owen wille loke J^at J)ou be warre bi-tyme f>at he take not harme }>rou3 rauhede of blode where})rou3 an hors take}) many euelis. And J)us schalt J^ou knowe when J>in hors nede]) to be I-lete blod. (Treatise on Horses 87) Dan. Our matter which we come vnto nowe, is the helpe and remedie that is fought for against witches at the hands of cunning men. And now if it please you to propound your questions, I will answere to them the best I can. M.B. Nay truly, I see already all is naught, but yet I will obiect those things which haue caried me awrie. I take it a man is to seek remedy against euils, & I thought it was euen a gift that God gaue vnto those whom we cal cunning men, that they did very much good by. When a thing is lost, when a thing is stollen, many goe to them, and they help them to it. (Gifford E3V) After his belly is emptied you shall cloath him first with a single cloath, whilest the heat indureth, and after with more as you shall see occasion require, and when you begin to cloath the horse, then you shall dresse, curry and rubbe him also; (Markham, Country Contentments 74) In homilies and sermons, the distribution between the persons is relatively stable throughout the history of English, particularly as concerns the ratio between first and second person subjects on the one hand and third person subjects on the other. This may reflect the uniformity of the purpose of these texts, and also of the means for achieving this purpose, both by direct address and exhortation, and by references to Biblical and (particularly in the earlier periods) patristic and legendary material. The most remarkable variation in the samples under scrutiny is traceable in the distribution between the singular and plural first person pronoun: the preacher / author may involve his audience in the discourse by using the plural pronoun. This distinction can be clearly seen if the occurrences of the first and second person forms are studied in the M E and EModE homilies and sermons (see Table 6):
The diachronic corpus
201
Table 6. Occurrence of first person singular and plural subjects in some ME and EModE homilies and sermons. I
we
Northern Homily Cycle Wycl. sermons
50 5
23 75
Hooker Smith
28 75
71 21
Wyclif (or his fellow preachers) clearly avoid the first person singular subject, and this practice is shared by Hooker in comparison to Smith. 17 The following extracts show typical first person subject usage in each of the four authors: A, Lord, blith aght vs to be When we think inwardly J^at we Sal lif ay in ])at bigly blis And neuer of ]?a mirthes mis. Gude werkes gladly we suld bigin, })at vnto )>at welth might vs win, CNorthern Homily Cycle II 206) And so we Schilden more haue sorwe for synne J^an for any o^>er euel. And J^us, 3if we myhten lette synne, we schulden be Godis procuratours, al 3if we dyen ^erfore and profi3ten here no more. But lyue we wel, and God fayluj) not to counselen vs how we schullen do. And Jxis assente we not to synne, but profi3te we as God biddi|> vs. (Wycl. Serm. 36 I 376) Sometimes confessing with lob the righteous, in treating of things too wonderfull for vs, we haue spoken we wist not what. Sometimes ending their talke, as doth the history of the Macchabees, if we haue done wel, & as the cause required, it is that we desire, if we haue spoken slenderly and barely, we haue done what we could. (Hooker 5) I charge you in the feare of God that you do not mistake that which is said, for I knowe no learned preacher, nor learned writer of other mind. Yet least you should mistake the matter, as I distinguished of lenders, so I will distinguish of borrowers. (Smith E7V) Finally, the samples of fiction show a clear bias towards first and second person subjects from Old English on (the OE sample is the story of Apollonius of Tyre, translated from Latin). The lower percentage of the ME figures is probably due to the long descriptive passages in the samples taken from Chaucer's Canterbury Tales.
202
Matti
Rissanen
In general, the preceding diachronic survey of the pronoun subject usage in five text types shows that there is considerable generic consistency in texts representing widely different periods. I also hope it has shown that although the text type classification of the Helsinki Corpus is by no means conclusive, it can give interesting results and provide impetus for further indepth studies. It is important to note that in interpreting the results of text type comparisons, the individual texts should also be studied: only in this way can the typological heterogeneity be analysed and understood. Ideally, too, variation within a single text should be taken into account; unfortunately, we have only occasionally been able to indicate internal variation in our parameter coding system.
4. Concluding remarks In sum, it seems obvious that the Helsinki Corpus will enhance the study of the history of English in making the access to textual evidence much easier than before. At the same time it is necessary to remind the users of certain facts which are, in a way, inherent in all corpus studies but particularly in the studies based on a historical corpus, and which can seriously undermine the results obtained. Firstly, the corpus always gives only a limited and biased picture of the reality of language. This is of course a truism but is easily forgotten when the student's instinctive awareness of the shortcomings of the corpus cannot be supported by his/her personal introspection and mastery of the language form studied. Secondly, the corpus only offers a material basis for the analysis; the power and capacity for the interpretation of the evidence can only be obtained through a scrupulous reading of the texts of past centuries and through a continuous effort to understand their most subtle shades of meaning. Thus text corpora should never be allowed to alienate scholars - particularly young scholars - from the study and love of original texts; on the contrary, corpus study should feed their curiosity and imagination. This seems to me fully possible, and I am also convinced that the existence of large corpora will encourage scholars to tackle linguistic problems and carry out ambitious research projects which would earlier have been impossible owing to the excessive toil and trouble of material collection.
The diachronic corpus
203
Notes 1. I am indebted to all members of our project group for their work and for their stimulating discussions on the structure of the corpus. The following scholars have mainly been responsible for the choice of texts for the various parts of the corpus: Old English, Leena Kahlas-Tarkka, Matti Kilpiö, Ilkka Mönkkönen and Aune Österman; Middle English, Juha Hannula, Leena Koskinen, Saara Nevanlinna, Tesma Outakoski, Kirsti Peitsara, Irma Taavitsainen; early Modem British English, Terttu Nevalainen, Helena Raumolin-Brunberg; Early American English (supplementary), Meija Kytö; Older Scots (supplementary), Anneli Meurman-Solin. The two supplementary corpora are being prepared. Meija Kytö has been the secretary of the project and it has been directed by the present author. 2. Computerized corpora can be conveniently divided into one-purpose corpora and multipurpose corpora. A corpus of the first-mentioned type is compiled for a particular, clearlydefined research topic and its size, structure and other characteristics are entirely determined by the demands of the topic. A multi-purpose corpus should provide the basis for a variety of studies over an extended period of time. 3. See also Rissanen - Kytö - Palander (forthcoming); Nevalainen - Raumolin-Brunberg (1989) for EModE; Rissanen (1991a) for OE. 4. The typographical conventions are explained in Kytö (1991). 5. As, for instance, the < M > and < K > parameters in Lady Brilliana Harley's letter, which specify the date of the extant manuscript and the contemporaneity of this manuscript with the original text, or the < G > and < F > parameters, which give information on texts translated from other languages. (See Kytö 1991.) 6. This development is discussed in more detail in Rissanen, forthcoming. 7. For a detailed discussion of the occurrence of periphrastic do in the Early Modern English samples of the Helsinki Corpus, see Rissanen (1991). To be valid, the figures should be compared with the instances in which do or the progressive form is not used. This comparison would, however, be too time-consuming for the present purposes. We have tried to make the three EModE sub-periods comparable in terms of the amount of text and the types of the samples (see Nevalainen - RaumolinBrunberg 1989). 8. How successful we have been in selecting samples representing the standard is estimated by Raumolin-Brunberg - Nevalainen (1990). 9. In the 1500-1570 sub-section the number of occurrences was too low (23) to allow for results. 10. Among many important studies in the field, I would particularly like to mention Halliday Hasan (1985); Milroy - Milroy (1985); Werlich (1983); Traugott - Romaine (1985); and Biber (1988). 11. Despite this limitation of the forms, the retrieval was relatively time-consuming. In Old English the third person sg. fem. and third person pi. forms are partly homonymous and there is considerable variation in forms (heo, hio, hie, hi, hy, etc.). With these pronouns, as with (h)it and EModE you, the nominative and oblique forms are partly homonymous. Using the Corpus even for simple searches soon convinces the student of the importance of grammatical tagging! At a rough estimate, the net time spent in collecting and sorting the material (c. 26,000 occurrences), and analysing the relevant instances, was approximately four days.
204
Matti Rissanen
12. The comparability of the third person pronoun occurrences with the first and second person ones is of course diminished by the fact that the degree of pronominalization in the text should also be taken into account. The differences commented on are, however, so obvious that they are not decisively affected by this factor. I hope to return to the question of pronominalization at a later occasion. 13. Mandeville's Travels, which represents the "travelogue" in ME3, is written as a third person narrative. 14. Table 3 gives a rough indication of the grouping of the text types into these categories. Individual texts belonging to one and the same text type may be given different category labels (IS/EX, etc.). The label XX means that the text type is not included in any of the prototypical categories. 15. For the texts representing these genres, see Kytö (1991). The labels 'homilies' and 'sermons' in our terminology reflect the tradition of calling the Old English texts written by ^Elfric, Wulfstan, etc. 'homilies' (cf. also the EME Bodley Homilies, Vespasian Homilies, etc.), and e.g. Wyclif s writings 'sermons'. Both types belong to the larger diachronic category 'Religious Instruction'. 16. Bibliographical information on the texts can be found in Kytö (1991). Accurate source references are also given at the beginning of each text file in the corpus. 17. For more accurate comparison, the occurrences of the pronominal forms in quotations and in the preacher's own text should be separated. Naturally, second person pronouns occur much more frequently in quotations than first person pronouns. For one thing, Christ's advice and exhortations to his disciples and others are often quoted. The Northern Homilies contain too much narrative material and direct quotations to offer a good point of comparison; the author seems to favour the first person plural pronoun in his own text.
References Biber, Douglas 1988
Variation across speech and writing. Cambridge: Cambridge University Press.
Bosworth, Joseph - T. Northcote Toller 1898 An Anglo-Saxon dictionary. Oxford: Clarendon Press. Halliday, M.A.K. - Ruquiya Hasan 1985 Language, context and text: aspects of language in a social-semiotic tive. Geelong: Deakin University Press. Kytö, Merja 1991
perspec-
Manual to the diachronic part of the Helsinki Corpus of English Texts: coding conventions and lists of source texts. Department of English, University of Helsinki.
Milroy, John - Lesley Milroy 1985 "Linguistic change, social network and speaker innovation", Journal of Linguistics 21: 339-384. Nevalainen, Terttu - Helena Raumolin-Brunberg 1989 "A Corpus of Early Modem Standard English in a socio-historical perspective", Neuphilologische Mitteilungen 90: 67-110.
The diachronic corpus
205
Raumolin-Brunberg, Helena - Terttu Nevalainen 1990 "Dialectal features in a Corpus of Early Modern Standard English?", in: Graham Caie et al. (eds.), Proceedings from the Fourth Nordic Conference for English Studies, 119-131. Copenhagen: University of Copenhagen. Rissanen, Matti 1991 "Spoken language and the history of Jo-periphrasis", in: Dieter Kastovsky (ed.), Historical English syntax, 321-342. Berlin: Mouton de Gruyter. forthcoming "Computers are useful - for auht I know", in: Fran Colman (ed.), Evidence for Old English: material and theoretical bases for reconstruction, 160-173. Edinburgh: John Donald. Rissanen, Matti - Merja Kytö - Minna Palander forthcoming The diachronic part of the Helsinki Corpus of English Texts: introduction and pilot studies. Traugott, Elizabeth Closs - Suzanne Romaine 1985 "Some questions for the definition of 'style' in socio-historical linguistics", Folia Linguistica Historica 6: 7-39. Werlich, Egon 1983 A text grammar of English. Heidelberg: Quelle & Mayer.
Comments by Gunnel Tottie
The advantages offered by the Helsinki Corpus to historical linguists are obvious, but it is possible that it is the researcher specializing in Present-Day English who will reap the greatest benefits from its existence. Any student of linguistic variation knows that synchronic variation is often a symptom of language change, incipient or ongoing, and that the dichotomy between synchrony and diachrony - however elegant it may be as a theoretical construct - cannot be maintained in such research without causing major analytical problems. Explanatory adequacy concerning phenomena of the contemporary language can never be attained without a historical perspective. However, even with a reasonable background in the history of English, Present-Day English specialists will usually lack expertise in selecting historical material, choosing the best editions of the most appropriate texts for their purposes and making representative samples. The Helsinki Corpus offers precisely this to the linguist specializing in Present-Day English: principled selection and sampling carried out by eminent philologists, experts in their respective periods of research specialization. With the help of WordCruncher, the Helsinki Corpus will indeed be a window on the history of English for those who need it most but who cannot afford the time required for total immersion in earlier periods of the language. However, certain problems will face the Present-Day English specialist wishing to make a diachronic survey of a particular contemporary phenomenon. I will illustrate this with an example taken from my own work concerning the use of indefinite determiners in non-assertive clauses, i.e. the choice between zero, a(n) and any in sentences like She didn't see a cyclist/ She didn 't see any cyclist or He didn 't have friends in London / He didn 't have any friends in London. (My goal is to establish the importance of factors such as specific and non-specific reference as well as regional dialect for the choice of determiner, and especially to chart the use of determiners with singular count nouns. See Tottie forthcoming.) Indefinite noun phrases occurring after the finite verb in sentences negated with not were extracted from a subset of the London-Lund Corpus of English Conversation (the 3 texts published in Svartvik - Quirk 1980) and a subset of the Lancaster-Oslo/Bergen Corpus of Written English (categories A-J, thus expository prose only).1 I found
Comments
207
that any was indeed unusual in written texts, only 9/139 instances, or 6%, compared with 29/163, or 18%, in conversation; cf Table 1. The difference is highly significant: chi-square 8.735, ρ < .005, 1 d.f. Table 1. Indefinite determiners LOB (categories A-J) and LLC LOB (A - J )
LLC
any
zero
a(n)
All
any
zero
a(n)
All
9
77
53
139
29
74
60
163
(18%)
(6%)
Table 2. Any versus other determiners in different types in Early Modern English (Helsinki Corpus, 1988 version)
Law
SI
RI
Οπ-
any
28
4
5
ό
zero
32
38
60
29
13
26
Det type
a(n)
-
Trav
Fic
Dr
Tri
PrC
Totals
4
1
12
10
70
8
17
18
31
20
253
12
6
18
30
21
12
138
-
Totals
60
55
91
47
14
39
49
64
42
461
%any
49%
7%
5%
13%
0%
10%
2%
19%
24%
15%
Key: SI = secular instruction, RI = religious instruction, Chr = Chronicles, Trav = travel, Fic = fiction, Dr = drama, Tri = Trials, PrC = Private Correspondence
I then wanted to check how the indefinite determiners, especially any, were used in earlier periods of English, going backwards one step at a time and starting with Early Modern English. The distribution of the relevant determiners in the Early Modern English part of the Helsinki Corpus appears from Table 2, which shows that the overall proportion of any was 15%. This percentage is not very helpful, however, for several reasons. First of all, the Helsinki Corpus obviously consists of only written material. Even so it does not readily compare with the Lancaster-Oslo/Bergen Corpus because of the different principles of composition of the corpora. The Helsinki Corpus incorporates speech-based texts (Trials), texts close to spoken material (Private Correspondence) or intended to represent speech (Drama). On the other hand, although these text types can be assumed to be close to spoken language, we do not know how similar they are. Looking at the proportion of any in the
208
Gunnel
Tottie
text categories Trials and Private Correspondence, we can surmise that their higher-than-average relative frequency of any (19% and 24%, respectively) somehow reflects spoken usage. On the other hand, Drama has only 1/49 or 2% any, this could indicate that the samples included do not really succeed in imitating spoken language. More interestingly, however, the law texts, which are definitely not a genre close to speech, have the highest any score of all. A couple of obvious conclusions follow from the above observations. First of all, in order to make comparisons with the Helsinki Corpus, what we need is a Present-Day English counterpart of it, a corpus whose composition closely matches the genres making up the Helsinki Corpus. Given the abundance of printed material from the present day and the availability of scanners, it should not be a very daunting task to produce such a modern counterpart; ideally of course, one would wish to have also the intervening sub-periods leading up to the present time conveniently represented. Secondly, the distribution of indefinite determiners over the different genres in the Early Modern English sample highlights the problem discussed by Rissanen, concerning the stability of text types over time, and illustrated by him with the use of personal pronouns in different genres. It would be interesting to know to what extent the use of any is indicative of text type in different periods of the English language. (Concerning the diachronic study of genres, see Finegan - Biber 1989.) More specifically, one would also like to know to what factor(s) we may attribute the similarity between law texts, trials and private letters in Early Modern English with respect to the use of any. Legal texts need to be exhaustive, and private letters show involvement; trials have characteristics of both categories, in that they must reflect legal terminology and are likely to have a high involvement factor. It seems likely that the common denominator is a desire for exhaustiveness and emphasis; see examples (l)-(3) and Tottie forthcoming. (1)
(2)
(3)
And that no Capper Hatter nor any other p~ sone shall not take by hymself or any other ρ ~ sone to his use for any Cappe made of the fynest... (MOD l LA LAW STAT UnKnown:9) And notwithstanding the old Error amongst you, whiche did not admit any Witnesse to speake, or any other matter to be hearde in the favor of the Aduersarie... (MOD L TR TRL THRC SAMPLE :2) . . . sith I refused to swere, I wolde not declare any speciall parte of that othe that grudged my conscience . . . (MOD l PRIL MORE UnKnown:6)
Comments
209
One final comment is is prompted by Rissanen's survey of different spellings representing aught and naught in Old and Early Middle English. Clearly, where forms abound, it would be a great help to the researcher if some kind of lemmatization could be provided with the Helsinki Corpus, so that she or he did not have to search the OED in every case. Even if the OED will be available to all of us on CD-ROM in the not too distant future, such an aid supplied with the Helsinki Corpus would offer substantial savings of time.
Notes 1. I thank Matti Rissanen for making the unpublished version of the Helsinki Corpus available to me in 1988, and Merja Kytö for providing practical assistance and details of the composition of the corpus.
References Finegan, Edward - Douglas Biber 1989 "Drift and the evolution of English style: a history of three genres." Language 65: 487-517. Svartvik, Jan - Randolph Quirk (eds.) 1980 A corpus of English conversation. Lund: Lund University Press. Tottie, Gunnel Forthcoming "Indefinite determiners in non-assertive clauses in Early Modern English", in: Dieter Kastovsky (ed.), Papers from the Conference on Early Modern English, Vienna, July 7-11, 1991. Berlin: Mouton de Gruyter.
Exploration and application of corpora
Using computer-based text corpora to analyze the referential strategies of spoken and written texts* Douglas
Biber
1. Introduction Text corpora have proven to be important in numerous linguistic analyses over the last three decades. Research on the linguistic patterns of English by a number of American and European researchers has shown that linguistic analyses based on a collection of texts often do not conform to our prior intuitive expectations. The use of computer-based text corpora, together with computer programs to facilitate linguistic analysis, enables investigations of a scope not otherwise feasible. Surprisingly, though, corpus-based approaches and computational approaches are often not combined. Thus, texts have often been analyzed by hand in corpus-based studies, disregarding the potential of computational analysis; and computational analyses for "natural language understanding" have often focused on only a few sentences, disregarding the ease with which computers can analyze large quantities of text and the importance of robust computational systems. Using either half of this "equation" in isolation has the consequence that only a relatively small quantity of data can be analyzed, and thus investigations are restricted relative to their potential. Corpus-based studies constitute an important advance over previous research in that they are based on naturally occurring discourse, representing actual usage rather than linguists' intuitions. But if they are analyzed by hand, there are few analytical advantages of computer-based corpora over other text corpora. Similarly computerbased parsers of isolated sentences provide an important advance in natural language processing, but these parsers ignore the potential provided by distributional analysis of grammatical and syntactic features in large collections of texts, and they disregard the importance of developing robust processing systems that can handle a wide range of language data. Many researchers, however, have combined the resources provided by large computer-based text corpora and computer programs for automated
214
Douglas
Biber
analysis. With respect to functional and stylistic investigations, this combination enables consideration of a large number of linguistic features across many texts and text types. Thus researchers such as Francis - Kucera (1982); Johansson - Hofland (1989); Oakman (1975); and Ross (1973; cf. the other papers in Aitken - Bailey - Hamilton-Smith 1973), have combined computational analyses and computer-based text corpora to compare the linguistic characteristics of texts and text types. With respect to lexicographic research, the results of this combination as applied in the COBUILD Project (Sinclair 1987) were so successful that now a number of publishers are pursuing dictionary projects derived from computer-based corpora using automated computational techniques. Similarly with respect to research on natural language understanding systems, researchers such as Garside - Leech - Sampson (1987) have combined corpusbased and automated-computational approaches to develop natural language understanding systems that are much more robust than previously achieved. In my own previous research, I have exploited the potential provided by automated computational analyses of large computer-based text corpora to investigate linguistic variation among spoken and written text varieties. These combined resources have enabled investigations that are not otherwise feasible, including analysis of many linguistic features across many texts and text types. These types of analyses have been used to investigate the linguistic similarities and differences among many different varieties of English, including: the range of spoken and written varieties (Biber 1986; 1988; 1989), British and American written varieties (Biber 1987), written varieties from different historical periods (Biber - Finegan 1989a), different "stance" types (Biber - Finegan 1989b), and school reading materials (Biber 1991). In the present paper, I investigate the distribution of referential information in texts using a similar approach, combining the resources of computer-based text corpora and computer programs for automated analysis. In Section 2, I discuss previous related research, dividing studies into three groups: theoretical studies, corpus-based studies, and computational studies. It turns out that in this research area, similar to the situation described above, few previous studies have combined the resources of corpus-based and computational analyses. In Section 3, I describe the methodology of the present study. In Section 3.1, I briefly describe the texts used for analysis. In Section 3.2, I describe the particular linguistic features analyzed, representing different aspects of the distribution of referential information in texts. These features include the information type (given, new), lexical type for given references (pronominal, lexical, repetitive), grammatical type (nominal, pronominal), occurrence in a
Using computer-based text corpora
215
chain, chain length, distance among references within chains, and syntactic distribution of information. In Section 3.3, then, I provide a relatively complete description of the computer programs used for analysis. The analysis proceeded in three steps. First, computer programs were used to identify all nominal and pronominal referents in texts, classify those referents as given or new information, identify the referential chain that they belong to, and classify them according to their syntactic environment. The second stage involves hand editing of the output from Step 1, checking for co-referential forms not identified by the program, and for syntactic environments incorrectly identified by the program. Finally, another computer program was used to tally the frequency counts of each referential type, and to compute the average length of referential chains and the frequencies of references in each syntactic environment. In Section 4, I present the main findings of the analysis, focusing on the differences among spoken and written genres with respect to these linguistic features. In Section 5, I briefly discuss the distribution of these features in relation to the textual dimensions identified in Biber (1988). Finally, in the conclusion I summarize the findings and return to the importance of combining the resources of computer-based corpora with the power of automated computational analyses.
2. Background There is an extensive body of research dealing with the packaging of information in sentences and in texts, including analyses of "given" and "new" information, "topic" and "comment", "theme" and "rheme", and aspects of "cohesion" and "coherence". In the present paper, I have restricted the scope of inquiry to tracking the type and distribution of references in texts. Two particularly useful early papers on the types of information are Chafe (1976) and Prince (1981). Chafe attempts to distinguish among theoretical constructs such as givenness, contrastiveness, definiteness, and topics, while Prince further investigates the informational status of referents in texts and proposes a taxonomy of "assumed familiarity" with three main categories new, inferable, evoked - and seven subcategories. Research on lexical cohesion (Halliday - Hasan 1976; 1989) also catalogs the types of informational relations in texts, distinguishing among co-reference, co-classification, and co-extension, and within the category of general lexical cohesion, distinguishing among repetition, synonymy, antonymy, and metonymy.
216
Douglas
Biber
There have also been corpus-based studies of cohesive relations in texts. For example, Grabe (1986) analyzes the distribution of lexical repetitions, inclusions, comparatives, and synonymy / antonymy across 150 texts representing 13 expository types in English. DeStefano - Kantor (1988) analyze the marking of cohesion in mother-child face-to-face conversation and fictional dialogue in basal readers. Cox - Shanahan - Sulzby (1990) analyze cohesion in the expository and narrative written compositions of good and poor elementary readers. A related area of research focuses exclusively on anaphoric referents, tracking the types of referring expressions used for anaphoric reference, the syntactic distribution of referring expressions, and the distance between anaphoric references of different types. Many of these studies are corpus-based. The papers in Givon (1983) address anaphora from the perspective of "topic continuity", focusing on the choice among modified full noun forms, full nouns, pronouns, and zero anaphora, in relation to the distance between subsequent referents and the syntactic distribution of co-referential forms. Topics that are more discontinuous (or less predictable) must be assigned more coding material; topics become more discontinuous as the distance to the last mention becomes greater and as the ambiguity from other referents becomes greater. The paper by Brown (1983) is a good example of this approach, examining topic continuity in written English narrative. Fox (1987a; 1987b) goes beyond this approach, showing how structural factors, such as event-lines, plans, and actions, are also important in predicting the choice of anaphoric form. Fox (1987a) analyzes the marking of anaphora in English conversations and written exposition, and Fox (1987b) analyzes popular fiction. More recently, Lord - Dahlgren (1990) analyze the marking of anaphora in a corpus of news commentaries from the Wall Street Journal. There have also been computational approaches to anaphora resolution (see Hirst 1981). To my knowledge, though, no study to date has combined automated computational analyses of text with a corpus-based approach to address the distribution of anaphoric forms across genres. In addition, no previous study has taken a comparative approach to the analysis of anaphora; rather, corpus-based studies have focused on the analysis of some particular genre, such as newspaper texts, popular fiction, or conversation. (The exception to this generalization is Chapter 6 of Fox 1987a, which compares two discourse types: conversation and written exposition.) In the present paper, I thus hope to extend previous research on anaphora in these two respects. First, I present a methodological approach that combines the resources of computer-based text corpora and computer programs for automated linguistic analyses of referential forms. And second, I take an overtly comparative
Using computer-based text corpora
217
approach, showing how the types and distribution of referential expressions vary across nine spoken and written genres.1
3. Methodology 3.1. Texts used for analysis In the present paper, I analyze the distibution of anaphoric forms in 58 texts taken from 9 spoken and written genres of the LOB and London-Lund (LL) corpora. The genres represented are: Press "spot news" reportage, Legal documents (acts and treaties), Humanities academic prose, Technical academic prose, General fiction, Face-to-face conversation, Sports broadcasts, Parliamentary spontaneous speeches, and Sermons. The genre categories here are narrowly defined, since the extent of variation for anaphoric features, across and within genres, was not clear at the outset. The particular texts used for the analysis are listed in Table 1. Except for conversations (LL category 1) and general fiction (LOB category K), these genres all represent specific sub-categories within the LOB or LondonLund corpora. For example, humanities and technical academic prose are subgenres of the LOB category of "learned and scientific writings" (category J), which also includes natural science, medicine, and social science; the legal acts and treaties are from the LOB category "Miscellaneous" (H), which also includes government reports, foundation reports, industry reports, and a university catalogue; spot news is from the LOB category of press reportage (A), which also includes financial, sports, and society news; sports broadcasts are from the more general category of broadcasts in the LL (category 10), which also includes a broadcast of a funeral, a wedding, and a royal visit; parliamentary speeches are from the LL category of spontaneous speeches (11), which also include a court case and a dinner speech; and sermons are from the LL category of prepared speeches (12), which also include a university lecture, cases in court, and a political speech. No composite texts were used in the analysis. For example, an individual article was extracted from each Press "text" in the LOB corpus, where each 2000-word text sample can comprise several articles. Parliamentary speeches were made up of a single speech on a coherent topic, or a single QuestionAnswer sequence, where both the Question and the Answer were speeches on the same topic. Sports broadcasts were taken from the on-line reportage of sports events, excluding any introductory commentary preceding the event.
218
Douglas Biber
Table 1. Texts used in the analysis
Genre
Corpus category
Texts in corpus
Press reportage
LOB - A
11-14, 24,
No. of texts 10
34-37, 43 Legal documents
LOB - Η
13, 14a, 14b
3
Humanities academic prose
LOB - J
61-64, 67-68*
6
Technical academic prose
LOB - J
71-80
10
General fiction
LOB - Κ
1-10
10
Face-to-face conversation
LL - 1
1-5
5
Sports broadcasts
LL - 10
1-3, 4a, 4c
5
Parliamentary spon. speeches
LL - 11
4 (3 "texts")
5
5 (2 "texts") Sermons Total:
LL - 12
la, lb, lc, Id
4 58
* Texts J65 and J66 had long segments of foreign or literary quotations and were thus excluded from the analysis.
3.2. Anaphora features analyzed The distribution of anaphoric features was analyzed in the first 200 words of each text, so all feature counts are normed per 200 words. Features were analyzed representing: the overall occurrence of given and new information, the overall frequency of referential expressions, the number and length of anaphoric "chains", the distance between referring expressions within chains, and for given or anaphoric expressions, the choice between lexical repetition and pronominal forms, and the syntactic distribution of forms. I will use the term "referent" for the different entities referred to in a text, and the term "referring expression" for the linguistic material representing a referent; thus multiple "referring expressions" can refer to a single "referent". I consider only nominal referring expressions in the present analysis. I use the term "anaphor(ic)" for referring expressions that have the same referent as some previous referring expression in a text. "New" expressions
Using computer-based
text corpora
219
are first-time references to a referent in a text.2 "Given" expressions include lexical anaphors (repetitions and synonyms) and all pronominal forms. I interpreted the category of "synonym" strictly, to include only lexical referring expressions having identical referents (for example, Joe... the man, but not Pan Am. ..airline companies). Within pronouns, I distinguish among true anaphoric pronouns (sharing a referent with a specific previous referring expression in the text), exophoric pronouns (referring to a participant or object in the situation of communication, most frequently I, me, we, us, you), and "vague" pronouns (referring to an action or previous stretch of discourse that does not have an associated previous referring expression). Following is a list of the linguistic features analyzed:
A. Overall distribution of referring expressions 1. 2. 3.
4.
5. 6.
Total referring expressions: the total number of referential phrases in a text, whether given or new. Total different referents: the total number of distinct entities or referents in a text. Total referential chains: the total number of referents that are referred to by multiple referring expressions in a text; each of these constitutes a "chain". Total "dead-end" referents: the total different referents minus the total referential chains, which represents the number of referents that are introduced once but not discussed any further in a text. Chain length: the number of referring expressions included in a referential chain. Average chain length: the mean length of all chains in a text.
B. Distance measures 1. 2. 3.
Referential distance: the number of intervening references to other referents between co-referential items in a chain. Average referential distance: the average distance among referring expressions in a chain, averaged over all chains in a text. Maximum referential distance: for each text, the distance measure for the chain which has the largest average distance among referring expressions.
220
Douglas
Biber
C. Distribution of given and new referring expressions 1. 2.
"New" referring expressions: first-time references to a referent in a text. "Given" referring expressions: references that are either textually or situationally evoked, including lexical anaphors (repetitions and synonyms) and all pronominal forms (anaphoric, exophoric, or vague).
D. Distribution and types of given referring expressions 1. 2. 3. 4. 5. 6.
Repetition anaphors: lexical repetitions of nouns that share referents with previous mentions. Synonymous anaphors: nouns that clearly share an identical referent with a previous referring expression. Total pronouns: a count of all pronominal forms in a text. Anaphoric pronouns: pronouns that share a referent with a specific previous referring expression in the text. Exophoric pronouns: pronouns that refer to a participant or object in the situation of communication, especially I, me, we, us, you. "Vague" pronouns: pronouns that refer to an action or previous stretch of discourse which is not represented by a previous referring expression in the text.
E. Distance measures of given referring expressions 1.
2.
Average pronominal distance: the average distance from previous mentions (regardless of form) to anaphoric pronominal forms in chains. Average repetition distance: the average distance from previous mentions (regardless of form) to repeated lexical forms in chains.
F. Measures of the syntactic distribution of information: 1. 2. 3.
Frequencies of new referring expressions in: main clauses, prepositional phrases, relative clauses, other dependent clauses. Frequencies of repeated lexical referring expressions in: main clauses, prepositional phrases, relative clauses, other dependent clauses. Frequencies of given pronominal referring expressions in: main clauses, prepositional phrases, relative clauses, other dependent clauses.
Using computer-based
4.
5.
text corpora
221
Distance measures for repeated lexical referring expressions in: main clauses, prepositional phrases, relative clauses, other dependent clauses. Distance measures for anaphoric pronominal referring expressions in: main clauses, prepositional phrases, relative clauses, other dependent clauses.
These feature counts relate to one another in the following ways: Total chains (#3) and deadend references (#4) together equal the number of different referents in a text (#2). Average chain length (#6) equals the total referring expressions (#1) divided by the total different referents (#2). New (#10) plus given (#11) referring expressions equal the total number of referring expressions (#1). Repetition anaphors (#12) plus synonymous anaphors (#13) plus total pronouns (#14) equal the total given referring expressions (#11). And total pronouns (#14) comprises anaphoric pronouns (#15) plus exophoric pronouns (#16) plus "vague" pronouns (#17).
3.3. Computational analysis Two separate computer programs were developed for the analysis of referring expressions in English texts. As input, the first program used "tagged" texts in which the grammatical category of each word was marked. For this purpose, I used the tagged version of the LOB (Lancaster-Oslo/Bergen) corpus (see Johansson - Leech - Goodluck 1978) and the tagged version of the LL (London-Lund) corpus developed for the analysis in Biber (1988). Previous grammatical tagging was required to identify all nominal and pronominal forms (i.e., all referring expressions) in these texts. The analyses reported here thus actually depend on prior computational analyses of texts, to "tag" these two corpora for grammatical categories. The programs used to tag the LOB corpus are described in Garside - Leech - Sampson (1987), and the programs used to tag the LL corpus are described in Biber (1988, Appendix II). The purpose of the first computer program developed for the present study was to identify and classify all referring expressions in the input text, as illustrated in Table 2. For each word tagged as a noun or pronoun in the input text, this program 1) classifies the informational status of the word as new or given; 2) classifies the form of the word as first-time lexical, repeated lexical, anaphoric pronoun, or exophoric pronoun; 3) identifies the chain number of the word if it is a repeated lexical form; and 4) identifies the syntactic context. This program also outputs the sequential word number
222
Douglas
Biber
and the sequential line number of each word in the input text, although these are not required for the analyses. The first computer program marked referential forms as follows: All pronominal forms and repeated lexical forms were marked as having "given" informational status, while all first-time lexical forms were marked as being "new". First and second person pronouns were marked as exophoric, and all other pronouns were marked as anaphoric. All nouns were stored in an array, so that each new noun in a text could be compared to all previous noun forms (with and without plural endings) to check for repeated forms. Each new noun in a text was assigned a new chain number, and subsequent occurrences of that noun received the same chain identification. Pronouns identified as anophoric were all marked as chain number 0 at this stage, and exophoric pronouns were all marked as chain number 100. Finally, syntactic context was determined strictly from the surface grammar, depending on the closest syntactic boundary (to the left). Four syntactic environments were distinguished: main clause, prepositional phrase, relative clause, and other dependent clause. These are not at all exclusive categories (e.g. prepositional phrases can occur in relative clauses, and vice versa; and both can occur in other dependent clauses, and vice versa). The analysis here considered only the closest syntactic boundary, although future research could distinguish further among the various types and degrees of embedding. The second stage of the analysis was to edit the output of this program by hand. Status as "given" versus "new", and "lexical" versus "repeated" versus "pronominal", was accurately assigned by the program, except for synonymous nouns, which were changed to "given" and assigned the chain number of the previous nominal form having the same referent. (Also there were a few Table 2. Sample output from the computer program to identify and classify referring expressions in texts; from a technical academic prose text in the LOB Corpus (J-72) and a conversation text in the LL corpus (1-1). Column headings: A = sequential word number from original text Β = informational status: New or Given C = form: LEX (full lexical noun - if classified as Given, these are synonyms), RPT (repeated lexical noun), PRO (anaphoric pronoun), PEX (exophoric pronoun), PVG ('vague' pronoun) D = chain number Ε = syntactic context F = line number in the LOB corpus or LL corpus
Using computer-based text corpora
Table 2. (cont.) Word
A
Β
C
2
24
Ν Ν Ν Ν Ν Ν G Ν
27
Ν
29
Ν
30
Ν
35
Ν
38 48
G G G
49
Ν
54
G Ν Ν G
LEX LEX LEX LEX LEX LEX RPT LEX LEX LEX LEX LEX RPT RPT RPT LEX RPT LEX LEX RPT
223
D
Ε
F
1
MAIN MAIN MAIN PRP PRP PRP PRP INF PRP PRP PRP PRP PRP CMP REL REL CMP PRP CMP MAIN
2
J-72:
summary authors testing explosives reference ability test presence differences ignition probability reliability test tests ignition rates tests class discriminators ability
6 9
11 14 17 20
45
60 63 67
2 7 4 5 6 7 8 9
10 11 12 7 7
10 13 7
14 15 6
3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 7 8 9
1-1:
spanish graphology you that joe i it us joe paper i paper i people you this
35
Ν Ν G Ν Ν G G G G G G G G
42
Ν
49
G G
2 3 7 9
11 13 15 17 19 22 26 32
53
LEX LEX PEX PRO LEX PEX PRO PEX RPT LEX PEX RPT PEX LEX PEX PVG
1 2
100 4 3
100 4
100 3 4
50 4
50 6
50 0
MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN MAIN
1 1 3 3 4 4 5 5 6 6 9
10 11 12 13 14
224
Douglas
Biber
pronouns that preceded the first full noun reference to a referent in a text; these pronouns were assigned "new" status, and the subsequent noun was changed to "given" status.) Pronominal forms needed to be edited by hand to determine their referent (and therefore their chain number), to confirm their status as anophoric or exophoric, and to identify any "vague" references that were not exophoric but were not linked to any particular previous referring expression in the text. Finally, syntactic context was accurately assigned (on the basis of the tagged input texts) for prepositional phrases, relative clauses, and other dependent clauses, but these assignments needed to be checked for main clause environments since there is no overt surface marker for main clauses (as there is for these other environments). Once the referential listings produced by the first program were edited, the second program was run to compute the frequency counts and distance measures listed in Section 3.2. First of all, this program would simply tally the frequencies of each referential type (e.g. new/lexical/main clause; given/repeated/relative clause; given/anaphoric pronoun/main clause; etc.). These frequencies were subsequently added together to give overall counts for total pronouns, given versus new references, etc. The total number of referring expressions are simply the total of all nominal and pronominal forms (per 200 words of text) in the listing produced by the first program. As the second program progressed through the listing, it would keep track of the number of different chain identifications, the number of referring expressions included in each different chain, and the "distance" measures for each reference type (see below). Once the listing for a text was processed, the number of different referents was computed as the total number of different chain numbers used in that text; the total referential chains were the number of chains that had more than one referring expression, and deadend references were those chains that had only one referring expression. Average chain length is simply the total number of referring expressions divided by the total different referents. Distance measures were computed separately for each reference type. An array was used to keep track of the frequency of references in each additional chain encountered in a text, and to compute the distance measures for multiple references within chains. The program recorded the sequential number of each referring expression (e.g. 1st, 2nd, 3rd) and its chain number; when a subsequent referring expression was found from the same chain, these sequential numbers were subtracted, giving the number of intervening referring expressions. (For example, if the first reference in a chain was the 3rd referring expression in a text, and the next reference in that chain was the 7th referring expression in the text, the distance would be 7 — 3 = 4; i.e. there
Using computer-based
text corpora
225
are 3 intervening referring expressions, and the second reference is the 4th referring expression from the original reference.) The program computed a cumulative distance measure for each reference type (e.g. lexical repetitions in main clauses, etc.) depending on the type of the second reference (to investigate the question of whether different reference types are used to bridge different distances). The average distance for each type was thus the total cumulative distance for the type divided by the total frequency of the type. Overall average distance for a text was computed by weighting the average distance of each reference type for the frequency of that type, summing the weighted distances, and then dividing by the total references. This second program thus produced linguistic counts and distance measures for each text. These results were then analyzed using a statistical package (SAS), to compare the mean frequency counts and distance measures of each genre. These results are presented in Section 4. 3
4. The distribution of reference types across genres and across syntactic environments Tables 3-12 and Figures 1-5 summarize the distribution of referential types in the genres analyzed here. Tables 3-11 have the same format. At the top they present the results of a General Linear Models procedure (GLM; a procedure similar to ANOVA that does not require balanced cells), and below they present the individual mean scores and standard deviations for each genre. The GLM procedure compares the differences among the mean scores for each genre relative to the differences among the texts within genres. (The size of the standard deviation for each genre reflects the extent of variation among texts within a genre.) If the differences among genres are large relative to the differences among texts within a genre, then that particular measure is a good predictor of genre differences. The F score for each linguistic feature is a statistical representation of the importance of cross-genre variation relative to genre-internal variation, and the ρ value indicates whether the linguistic feature is a statistically "significant" predictor of genre differences. Statistical significance is influenced by both the extent of the observed differences and by the number of texts analyzed - consideration of more texts makes it possible to achieve significance with weaker relationships. The r2 value is a direct indication of the strength or importance of a linguistic feature in predicting genre differences, regardless of the number of texts analyzed, r2 can be interpreted as the percentage of
226
Douglas
Biber
variation in the text frequencies of a linguistic feature that is accounted for by knowing the genre categories. Any r 2 value over 20% indicates a noteworthy relationship, while r 2 values over 50% indicate quite strong relationships. Section 4.1 presents the overall results of the distibution of reference types, while Section 4.2 focuses on the syntactic distribution of reference types. 58 texts from 9 spoken and written genres were analyzed for the first section, but only 24 texts from 5 written genres were included in the more specialized analyses of syntactic distribution. 4.1. Overall distribution of reference types across spoken and written genres Table 3 presents the overall distribution of referring expressions and referents across the nine genres, and this same information is presented graphically in Figure 1.
Legal Docs Tech Prose
— '
'
Hum Prose — f — — . 1., Spot News BBHBHMHHHHJI Fiction Sermons Speeches Γ'-
Broadcasts
,
1
Convers. 0
fcggg Total References
1 10
1 20
30 40 Frequency
Different Referents
50
60
70
Y///A 'Deadend' Referents 18888) Total Chains
Figure 1. Totals for reference types in spoken and written genres
Total referring expressions, presented in Table 3.1 and the bottom bar in Figure 1, represent the overall extent to which a text is referential (versus elaborative, verbal, etc.). Surprisingly, the spoken genres tend to have more referring expressions than written genres. Broadcasts, sermons, and spot news have high numbers of referring expressions, conversations and speeches have
Using computer-based
text corpora
227
Table 3. Overall distribution of referring expressions and referents across nine spoken and written genres
3.1. Total Referring Expressions F=6.52; pC.001; 1^=51.6% GENRE
Ν
Conversations Broadcasts Speeches Sermons Gen. Fiction Spot News Hum Ac Prose Tech Ac Prose Legal Docs
5 5 5 4 10 10 6 10 3
Mean 59.00 68.00 57.00 64.75 47.90 63.40 48.50 51.70 52.00
3.2. Different Referents F=6.35; p5ίίί»5$ίίί*5ίίίίίίίίίίίί0ίί^ Tech Prose
'/////////Λ
_L
Hum Prose φ
Spot N
§5 O
e
w
s
v
z
z
m
mzz?, vmm
Fiction Sermons
y////////////////////^^^
Speeches ί^ίί5ίί55»ίίίί»ίίίίίί5ίίίίίίίίίί5ίίίίίίί«ίίί5ίίίί5 Broadcasts
s//////////////////^^^
10
15
3.
I
ι
ι
I
,1 I .i 20 25 30 Frequency
New References
Figure
I
• •
1 ,1
35
D - f
40
45
Given References
G i v e n v e r s u s n e w r e f e r e n c e s in s p o k e n and written g e n r e s
cate relative informational focus. It is noteworthy that spot news, humanities academic prose, and technical academic prose are extremely "informational" in both respects: they have the highest absolute frequencies of new referents, and proportionally they use very high percentages of their referring expressions for new references (all three around 65%). In contrast, conversations are markedly non-informational in both respects: they have a markedly low frequency of new referents, and they have a low proportion of their total referring expressions that are new (c. 29%). Political speeches and sermons, which are both informational in purpose but spoken, show relatively high frequencies of new referents in absolute terms but have roughly equal proportions of given and new references. Fiction, which is not informational in purpose but is written, shows a relatively low frequency of new referents in absolute terms but again shows roughly equal proportions of given and new references. Table 7 and Figure 4 further describe the distribution of given references by separately presenting the distribution of lexical repetitions and pronouns (and synonymous referring expressions on Table 7). Figure 4 shows the same degree of extreme variation as Figure 3 but a different distribution across genres. Conversations are again at one extreme, in this case having the most pronouns and fewest lexical repetitions. The other extreme, though, is occu-
Using computer-based
text corpora
233
Table 7. Distribution of given referring expressions as repetition anaphors, pronominal forms, or synonymous anaphors across nine spoken and written genres 7.1. Repetition Anaphors F=9.97; pC.001; I2=61.9% GENRE
Ν
Mean
Conversations Broadcasts Speeches Sermons Gen. Fiction Spot News Hum Ac Prose Tech Ac Prose Legal Docs
5 5 5 4 10 10 6 10 3
7.40 25.00 10.80 8.50 3.20 17.70 10.00 16.60 21.33
SD 3.57 8.27 3.11 4.35 2.34 6.68 5.40 7.76 2.08
7.2. Total Pronouns F = 36.02; pC.001; r2=85.5% Mean
SD
33.80 12.80 17.40 23.25 19.00 5.00 4.50 1.00 1.66
9.49 4.20 3.04 6.60 5.31 3.09 2.58 1.56 1.52
7.3. Synonymous Anaphors 7.4. Pronominal Anaphors F=3.06; pC.Ol; r2=33.3% F= 12.04; pC.001; r2=66.3% GENRE
Ν
Mean
Conversations Broadcasts Speeches Sermons Gen. Fiction Spot News Hum Ac Prose Tech Ac Prose Legal Docs
5 5 5 4 10 10 6 10 3
0.40 0.00 0.00 0.75 2.30 0.70 0.33 0.90 0.00
GENRE
Ν
Conversations Broadcasts Speeches Sermons Gen. Fiction Spot News Hum Ac Prose Tech Ac Prose Legal Docs
5 5 5 4 10 10 6 10 3
SD 0.54 0.00 0.00 0.50 1.94 0.82 0.81 1.52 0.00
7.5. 'Vague' Pronouns F=12.21; pC.001; r*=66.6% Mean SD 4.40 0.80 1.20 1.50 0.00 0.00 0.00 0.00 0.00
2.701 1.303 1.095 1.290 0.000 0.000 0.000 0.000 0.000
Mean
SD
7.20 10.60 6.80 10.00 13.10 3.50 3.00 0.50 1.66
2.38 2.60 2.68 4.32 6.74 1.58 1.67 0.70 1.52
7.6. Exophoric Pronouns F= 18.98; p Ο ο
vo σ\
s σ\
Μ 4-t
e>o 1
^2
Μ Ο
w υ
υ > 'S (Λ
tn (Λ ω Wh 0ß Ο
α,
ω
Or
4-t
t/5 CD CL
κ
IX
Tt Ö
ω > '35 en
aW) ο
"C α> α. w W3 C3 CL,
υ J3
343
344
Graeme Kennedy
Table 2. Distribution of verb forms (adapted from George 1963a: 5) % Verb stem + ed Verb stem (+ s) Verb stem + ing to + verb stem
46.2 33.4 11.2 9.2
Table 3. Rank ordering of the most frequent verb form uses (adapted from George 1963c: 32) Rank
Item
% of verb form tokens
1
Simple past narrative
15.6
2
Simple present actual
12.0
3
Simple past actual
8.3
4
Simple present neutral
7.0
5
Past participle of occurrence
5.9
6
Past participle of state
3.3
7
Verb + to + stem
2.7
8
Stem + ing = adjective
2.5
9
Stem + ed = adjective
2.3
10
Plain stem after don't
1.7 61.3
Table 4. Uses of simple present (from George 1963a) % Present or actual moment Neutral (devoid of time reference) Habitual-iterative Others
57.7 33.5 5.5 3.3
Preferred ways of putting things
345
Table 5. The expression of "habit" (from George 1963a)
% Present habit Simple present Present progressive will + verb stem
86.1 9.5 4.4
Past habit Simple past Past progressive would + verb stem
77.5 17.1 5.4
frequently the progressive is taught early as the form for "now". Similarly, the main use of the finite simple present is not to express habitual or iterative meaning as Table 4 shows. According to George, "habit" is typically expressed with the verb forms outlined in Table 5. In Quirk et al. (1985: 217) it is suggested that the order of frequency for expressing future time with verb forms is: 1 2 3 4 5
will, shall or ΊΙ + verb stem Simple present be going to + infinitive Present progressive will/shall + progressive infinitive
The Hyderabad study generally supports this analysis but provides statistical data to give a clearer picture of the relative frequencies. It should be noted, however, that the relatively high frequency of the simple present for expressing future time in the Hyderabad study is not reflected in the emphasis given in the most recent corpus-based grammatical description of English (Sinclair 1990: 255-257). Ota (1963) does not indicate the exact size of the corpus of US English used, although it seems likely that it contained about 150,000 words. It consisted of (i) 10.3 hours of unrehearsed radio conversations and interviews, amounting to 300 pages of transcribed text, (ii) ten TV play scripts and (iii) ten 3000-word samples of academic writing. Ota found 16,189 tokens of the eight finite verb forms in the relative proportions shown in Table 1, and a further 977 passives.
346
Graeme Kennedy
Table 6. Expressing future time (adapted from George 1963a) % ) ) + verb stem )
18.7 9.8 12.8
1
will shall 71
2
Simple present
39.4
3
Present progressive
10.0
4
Others (including be going to)
41.3
9.3
Ota set out to study "the probable and improbable rather than the possible and the impossible" (1963: 14). From the large amount of statistical information on verb form use, I will draw attention to just a few of the findings with implications for second language teaching. While providing independent verification of many of George's general findings, Ota provided a more finegrained analysis of which verbs tend to be associated with particular verb form use and even of which adverbials were most likely to be associated with particular verb forms. For example, in his corpus, contrary to what is frequently taught, today and this year were as likely to be associated with simple past as with simple present (pp. 22-24). Seven verb types in the corpus accounted for 50% of all the 17,166 verb tokens and the 16 verbs listed in Table 7 accounted for 61%. Be was the most frequent with over 30% of all tokens. It was used in the simple present in 79% of its occurrences and 17% in the simple past. The extent of the distinction between stative and dynamic verbs is not only made explicit with such data but also the extent to which particular verbs are associated with particular tenses. Ota also draws attention to the extent to which different domains or language varieties can affect verb form frequency use with an extensive series of comparisons between the unscripted radio discussions, TV plays and written sources. For example, the past perfect was five times more frequent in the written texts than in the TV plays, while the past progressive was almost three times more frequent in unscripted radio conversation than in written sources. As part of the discussion of the semantics of verb form usage, Ota also notes a striking characteristic of the stative verbs in the corpus which do not only tend to avoid the progressive (a semantic factor), as has often been shown, but are also eight to ten times more likely to be associated with first
Preferred ways of putting things
347
Table 7. Most frequent verbs (adapted from Ota 1963: 66-71) Verbs
% of verb tokens in corpus
be
% of tokens of each verb Simple
Present
Simple
Past
present
progressive
past
progressive
30.7
79.0
0.05
17.0
0.0
think
5.0
87.7
1.3
9.7
0.3
have
4.0
1.3
know
3.6
66.0 88.7
0.0
23.7 8.2
0.1 0.0
say
2.6
37.4
2.5
50.0
0.7
want
2.4
81.1
0.0
17.7
0.2
go get
1.7
32.1
2.4
1.5
27.5 9.2
26.5
47.7
36.6
1.1
do
1.5 1.4
29.3
22.4
23.6
1.9
33.6
11.2
39.0
3.3
24.3
0.0
come have to
1.3
74.8
0.4
see
1.3
65.0
0.9
23.0
0.0
make
1.3
36.6
8.8
29.2
0.9
mean
1.2
82.6
0.0
15.4
0.0
feel
0.9
68.7
3.3
22.7
0.7
take
0.9
27.5
9.4
38.9
2.0
61.3
person subjects in statements or questions or with second person subjects in questions than with any other subjects, i.e. I/we know or do you know are much more likely to occur than she knows. Duskovä - Urbanovä (1967) undertook a verb form study of a single play (Osborne's Look Back in Anger) containing 24,000 words. They found 2905 indicative verb forms. Table 1 shows that their findings are broadly consistent with the "plays" category of the Hyderabad study and with the general findings of the other studies. Duskovä - Urbanovä also suggest that the very high frequency of be, have and the modals, which account for 47.3% of all present forms in their corpus but are not found in the progressive, probably distorts the picture given in other studies of the relative frequency of simple present and present progressive for expressing actual present. Krämsky (1969) undertook a comparative study within a corpus of 61,785 words, of equal samples of English fiction, "colloquial style" (three plays) and a small selection of academic texts. The 7550 verb forms counted were
348
Graeme Kennedy
analysed in various ways some of which are summarized in Table 1. Not surprisingly, there were significant differences between sources. For example, the difference in frequency of use between the simple present and the simple past was much more marked in the academic part of the corpus than in the "colloquial". Passive voice forms made up 17.3% of verb forms in the academic section and 2.25% in the "colloquial". This analysis is consistent with the analysis of active and passive voice use in the Brown corpus where passive forms made up 21.95% of the verbs in Category J (learned), whereas only 3.32% were passive in Category Ρ (romance) (Francis - Kucera 1982: 554). Krämsky found that the simple past was the most frequently used form in fiction, while the simple present was the most frequent in the other two categories. The detailed analysis supported the view that while there is a definite association of particular verb-form use with particular varieties of English, within each variety there is considerable stability of use. Another pioneering study which has regrettably had less influence in applied linguistics than it deserved is the corpus study by Joos (1964) of British English verb form use as evidenced in a single work (Bedford's The Trial of Dr. Adams). In Joos's complex study his text, being the account of a courtroom trial, reflects this legal domain with a high number of modal forms. He found 8038 finite forms plus almost 1100 non-finite forms. Of 224 grammatically possible finite forms or combinations of forms, only 79 occurred in the corpus, and 10 of these occurred only once. Table 8 shows the 23 which occurred most frequently. Types 1-15 accounted for 90% of the tokens, and types 1-23 accounted for 95%. The remaining 56 types accounted for just 5% of the tokens. The broad picture from these studies is clear for those who believe that extent of use is a measure of usefulness. Most English verb forms are not used frequently enough to warrant pedagogical attention in the early stages at least. Learners' time and effort would be better spent acquiring vocabulary or pragmatic elements which are rarely part of curricula. Regrettably, this point has not prevented learners of English often being subjected to a pedagogy based on a descriptive grammar or perhaps worse, a comparative grammar where differences are magnified, and idiosyncrasies frequently receive as much attention as the typical. As George (1963a: 1) noted, If, for instance, all the tenses have been taught with equal thoroughness, and drilled with impartial application, then the students' English is likely to differ from native English in two ways: it is likely to show a wider and a more even
Preferred ways of putting things
349
Table 8. Order of frequency of most frequent finite verb forms in Joos (1964)
Types 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Tokens 2,853 I always say no good comes of these cases 2,143 When the doctor went away did he leave 319 the defence have decided not to call the doctor 292 Morphia and heroin were commonly used 249 both morphia and heroin are administered to people 219 If there were, I would take them and destroy them 208 the answers sound as colourless as one can make them 111 the period when he was prescribing for her 175 are you standing there and saying as a trained nurse 164 had you made any inquiries before giving evidence I will certainly help you 115 90 asks if he may put a further question to the witness And you still say so? I do 83 did the doctor ask you for anything? - He did 80 77 would you have expected the doses to have a fatal result you must believe me 65 whether he might say 65 63 cases where this amount has been given he told me I should prepare a codicil 59 I did not think you could prove murder 56 He might have given hyoscine 34 32 the only way in which justice can be done you could have asked this very helpful question 27
% of finite forms 35.5 26.7 4.0 3.6 3.1 2.7 2.6 2.2 2.2 2.0 1.4 1.1 1.0 1.0 1.0 0.8 0.8 0.8 0.7 0.7 0.4 0.4 0.3 95.0%
distribution of usage; and it is likely to show personal usage-habits which may be "correct" in so far as they represent features which could be used by native speakers, but which cause these features to figure disproportionately. The development of computer hardware and software and the availability of the Brown, LOB and London-Lund corpora made possible a flowering of corpus research especially over the last decade. Although most studies were not directed specifically towards second language teaching, they have ranged over many areas of language and language use which I believe have implications for pedagogy.
350
Graeme Kennedy
Of interest in light of the earlier research on verb form use was a study by Coates (1983) of the use of modal auxiliaries in the London-Lund and LOB corpora. Her analysis was based on representative samples of about two hundred tokens of each of the modal auxiliaries from each of the two corpora. It is always risky to attempt to re-organize or summarize data analysed by someone else and in this case the task is not made easier by the nature of modal use where, as Coates points out, boundaries are fuzzy and categories rarely discrete. However, the intelligence of Coates's discussion and the richness of detail invite further analysis, and it is hoped that the summary in Table 9, in spite of some rounding off when the data is converted to percentages, does not distort her findings. The importance of the epistemic use of at least half of the modals is evident from Table 9. As Coates points out, attempts to generalize about the use of the modals can be misleading because her data shows clear differences between spoken and written use, and the influence of domain of use. For example, while would is the commonest written modal, it is only the third commonest spoken modal. Will and can are much more common in speech than in writing (and are also learned first by children). With regard to the use of must, for example, Coates notes that . . . in written language, and in language written to be spoken, and in formal spoken language root must occurs more frequently than epistemic must. But in informal spoken language, that is, in normal everyday adult conversation, the inverse is true: epistemic must is preponderant. (48)
The fact that West (1953) showed that 86% of the uses of must are to express obligation and only 12% to express epistemic modality illustrates the need for more genre-sensitive studies for applied linguistic purposes of the frequency of important lexical and grammatical elements. This may be further illustrated by considering the use of must in Section J of the LOB corpus where must occurs in passive constructions like must be assumed in 43% of tokens. The scope of Coates's analysis of modals was widened by Holmes (1988) who identified over 350 linguistic devices used for expressing epistemic modality in English. Beginning with a 50,000-word corpus of spoken and written English to help establish the types, Holmes then studied the frequency of these types in a 640,000-word sample from the Brown, LOB and London-Lund corpora. Holmes shows the importance of words like seem, assume, believe, feel, know, suggest, suppose, chance, doubt, possibility, likely, obvious, clear, apparently, certainly, of course, perhaps in addition to the modal auxiliaries. This information is then used to evaluate the adequacy of some widely-used English teaching texts and reference books and some are
Preferred ways of putting things
351
Table 9. Percentages of use of modal auxiliaries in the London-Lund and LOB corpora (based on Coates 1983) SEMANTIC CATEGORIES
Root use
Epistemic
Hypo-
use
thetical
Other
use Obligation-necessity MUST
53
(65)
NEED
87
SHOULD
42
(51)
18
(12)
OUGHT
84
(84)
9
(13)
46
(31)
1
(4)
13
Possibility
Ability
65
(64)
21 (25)
5
(3)
COULD
25
(30)
12 (29)
2
(2)
4
(7)
MAY
4
(22)
16
(6)
74
(61)
MIGHT
1
1 Obliga-
(9)
19 (28) 7
(3)
9
(8)
0
(0)
Permission
CAN
Willingness
21
56 (32)
6 (11) 44
54
0
Intention
tion WILL
13
(8)
SHALL
19
(9)
WOULD
4
(6)
2 (34)
23
(14)
58
(71)
6
(7)
18
(19)
61
(35)
1
(3)
1
(3)
83
(83)
1
(3)
SHOULD
11
(5)
21
(9)
Note: Percentages from LOB corpus in parentheses
found to be wanting. Holmes's analysis should be useful for writers of future pedagogical works. Corpus analysis of verb form use relevant for language teaching purposes can also be illustrated in a study of English conditionals. Traditionally, learners of English have been taught about three main semantic categories of conditional sentences, namely possible (real, or open, conditions), improbable and impossible (counterfactual) conditions, each realized through particular verb forms in the main and subordinate clauses. Hill (1960) challenged this analysis by listing some 324 potentially acceptable finite verb form combina-
352
Graeme Kennedy
tions for expressing conditions with i f . In a detailed analysis of all sentences containing if in the Brown and LOB corpora, Wang (1991) showed that only 76 of the clause combinations described by Hill occur in Brown with a total of 2017 if-tokens. In the case of LOB, 103 of Hill's types occur with a total of 2307 tokens. Furthermore, approximately 15% of the tokens of conditional sentences containing (/"-clauses in each corpus consist of the verb form combinations shown in Table 10.
Table 10. Verb form use in conditional sentences in the Brown and LOB corpora
Verb form in «/-clause
present simple present simple past simple past simple present simple past perfect were/were to can + stem present simple present simple
Verb form in main clause
present simple will/shall/be going to + stem would/could/might + stem past simple should/must/can/may/ought to + stem would/could/might have + past participle would/could/might + stem present simple would/could/might + stem imperative
Percentage of iftokens Brown
LOB
22.0 13.2 11.3 6.7 10.0 3.9 4.0 1.1 1.9 1.7
22.0 12.5
75.8
74.5
11.1 6.8 6.4 4.1 4.0 3.2 2.4 2.0
Where the conditional clause begins with unless rather than if the rank ordering for the five most frequent verb form combinations is the same. The semantic categories of the conditional sentences in the Brown and LOB corpora are summarized in Table 11. As Tables 10 and 11 show, there appears to be a very close parallel between US and British English in both the verb form and semantic use of conditional sentences. The only divergence found is a slight difference in clause ordering. Even here, as Table 12 shows, there is a much more striking difference in the clause ordering for (/"-conditional and wn/ess-conditional sentences than any difference between US and British usage.
Preferred
ways of putting things
353
Table 11. Semantic categories of conditionals in Brown and LOB corpora Percentage of j/-conditional sentences Brown LOB Open conditions Factual Predictive
47.3 28.3
48.3 26.3
Hypothetical conditions Improbable Counterfactual
14.2 10.2
14.3 11.1
100
100
Table 12. Clause order for conditional sentences in Brown and LOB corpora (percentages)
Initial »/-clause Initial «n/ess-clause
Brown
LOB
77.5 37.5
70.5 24.7
Syntactic and semantic studies The structure of the conceptual system we express through language has been an important concern of communicative language teaching (Wilkins 1976). Kennedy (1978: 178) suggested, for example, that metaphors of physical motion are very important organizing frameworks for the expression of economic and other phenomena. The following sentences taken from a single page of a newspaper illustrate this: The oil price tumbled towards the end of May. After the initial drop, prices took off and climbed steadily until the end of the year. To understand the significance of this, one has to go back to 1975. It has got to the point where the government's propaganda machine has again started. The Gulf confrontation already has demonstrated an ability to intrude on the global economic scene.
354
Graeme Kennedy
The metaphors of motion are not, of course, restricted to the economic sphere as the following sentences from the same paper show. The rising tide of violence shows no sign of subsiding. Immigrants and police have hammered out their differences. The increasingly successful assault on these genetic disorders led the government to launch an organized attempt to find every one of the estimated 50,000 to 100,000 genes in the human body. It is not unusual to find over forty percent of sentences in a journalistic text containing such metaphors. Indeed, Lakoff - Johnson (1980: 4) referred to them as "orientational metaphors" and have argued that these and others demonstrate that "most of our ordinary conceptual system is metaphorical in nature". The structure of the conceptual system has also been explored in a corpus study in which it is suggested that over fourteen percent of the words used are involved in quantification, and that quantification is not expressed mainly by numbers and grammatical quantifiers (Kennedy 1987a). Fourteen subcategories of quantification such as totality, approximation and equivalence and the relative proportions of each are listed, along with the frequencies of the major linguistic devices used for expressing them in Category J of the Brown and LOB corpora. While quantifiers usually receive attention in language teaching, they accounted for only about ten percent of quantification tokens in this study. For expressing the quantificational subcategory of totality, for example, words such as global, total, entirely and make up feature alongside quantifiers such as all or every among the 73 types in the corpus. Such studies of the number of tokens in representative corpora can in turn lead to the development of more informed teaching materials than those based simply on impressionistic grounds. Although little statistical data on conceptual structures in English is available, a recent comparative study of the structure of metaphors in medical writing using corpora of over 40,000 words in English, French and Spanish opens up interesting possibilities. Salager-Meyer (1990) found that whereas metaphors associated with structures accounted for 70.6% of metaphors, those associated with processes, functions and relations accounted for only 29.2%. That is, nerve roots or abdominal walls were more likely to be found than migratory pain, or vehicles of inflection. Over 85% of the metaphors were in nominal groups. There were no significant differences between the languages. In another study of how conceptual categories or semantic notions are realized in texts, Kennedy (1987b) identified almost 300 linguistic devices
Preferred
way.9 of putting things
355
which are used for expressing temporal frequency and then explored how often each of these occurred in the academic English section (J) of the Brown and LOB corpora (about 350,000 words). The most frequent devices were not the expected adverbs of frequency, but rather subordinate clauses. Adjectives such as general and common were more frequent than temporal adverbs such as generally or commonly. In his influential language teaching syllabus, van Ek (1976: 57) listed seven linguistic devices for expressing the notions of cause, reason and effect in English. These included why, because + subordinate clause, as + subordinate clause, the reason is, the result is, then, so. Although reference grammars such as Quirk et al. (1985) describe some 40 causative devices, there have been few statistical sources for checking whether the types listed in a syllabus such as van Ek's are indeed the most likely to be encountered or required. Altenberg (1987) found that, among 63 types he identified, because and so account for 80% of the causal tokens in a 100,000-word sample from the London-Lund corpus and only 22% of the tokens in a similar-sized sample from the LOB corpus. In a study of causation, Fang (1990) identified 130 devices which explicitly mark causation and discussed other non-marked ways of expressing causation through juxtaposition, for example. In the LOB corpus, only 11 of the 130 explicit causative devices do not occur. Twentythree types account for 82% of the 5862 tokens. These are listed in Table 13. Altenberg (1987) suggested that overall the cause-result clause order is about as common as the result-cause order, although there was variation between genres. Fang, on the other hand, has suggested that the preferred order is with the subordinate causative clause following and that sentences in which the subordinate clause beginning with because comes first account for only 6% of sentences which contain because. Fang's general finding on clause order in one area of subordination can be supported independently by an analysis of clause order involving the conjunctions when, after and before in Section J of the LOB corpus. Over three-quarters of the tokens have the subordinate temporal clause following the main clause. These findings on clause order with causal and temporal subordination contrast markedly with Wang's findings on conditionals (cited earlier) and deserve further exploration. The importance of genre-specific information is again seen with studies of linguistic devices used for expressing causation. The word since, for example, occurs 542 times in LOB of which only 189 (35%) express causation. In Section J (academic texts), however, 92 out of 144 tokens of since (64%) express causation.
356
Graeme Kennedy
Table 13. Explicit causative devices in the LOB corpus (from Fang 1990)
because why so for therefore effect cause reason thus result since as because of so that then so that due to for (that) reason lead to from hence as a result of being
No. of tokens 635 443 425 365 296 278 265 258 227 212 189 166 142 139 135 127 123 93 83 81 52 48 47 4,829
%
10.8 7.6 7.3 6.2 5.0 4.7 4.5 4.4 3.9 3.6 3.2 2.8 2.4 2.4 2.3 2.2 2.1 1.6 1.4 1.4 0.9 0.8 0.8 82.0
I have illustrated some findings from corpus research, particularly on verb form use, and on the expression of conceptual categories. There have been also many often isolated studies of disparate aspects of English, which although small in themselves throw light on the nature of the task facing a language learner. Bibliographies such as those by Altenberg (1991) and Greenbaum - Svartvik (1990) document the wide range of these studies from the major corpora. Many have been concerned with issues in grammatical and sociolinguistic description, studies, for example, of that, really, some and any, shall and will, tags, the subjunctive in British and US English. It is not the size of a study which necessarily determines its relevance or importance for language teaching. Even a small study can reveal aspects of language use which could be the basis for whole new directions in research and language teaching.
Preferred
ways of putting
things
357
There would thus be few studies without relevance to language teaching even if the precise nature of that contribution remains to be seen. Some of these studies could affect the naturalness of language learning materials. For example, contrary to an earlier claim by Keenan that subject relatives were more common than object relatives, Fox (1987) found in a sample from a small spoken corpus that there were the same number of subject and object relatives. The relative clauses which one finds in English conversation are quite different from those usually displayed in the linguistic literature. The head of the relative is often non-definite (something, anyone, a guy etc.); and object relatives, rather than appearing with full NP subjects, occur almost exclusively with pronoun subjects: Have you heard about the orgy we had the other night? By contrast, the kinds of relative clauses cited in many studies represent what is often thought to be the central function of relative clauses - identifying a previously introduced referent: I saw the dog that bit the cat. (858)
Fox found very few such identifying relative clauses, and yet these are common in teaching materials. Biber (1988) in a comparison of spoken and written English based on the LOB and London-Lund corpora, showed how difficult it is to establish basic differences in the formal characteristics of the two varieties. The extent of the importance of discourse items such as hedges, responses and softeners, however, in spoken English is shown by Altenberg (1990) who, in a study of a 50,000-word sample of the London-Lund corpus found that they made up 9.4% of all word class tokens and were more frequent than prepositions, adverbs, determiners, conjunctions or adjectives. Because discourse items are not handled well in most dictionaries and grammars, they are not part of traditional language teaching, with consequent effects on the naturalness of learners' English. The linguistic distinctiveness of different speech varieties is also well illustrated in Johansson (1981) who compared Sections J (learned and scientific) and K-R (fiction) of the LOB corpus. He used a 'distinctiveness coefficient' to identify the nouns, verbs, adjectives and adverbs which were most particularly associated with the two varieties. Table 14 includes some of the most distinctive verbs and adverbs. An analysis made by Ljung (1990) of the vocabulary of over 50 widelyused TEFL texts showed a major mismatch between pedagogical materials and the most frequent items in the corpus compiled for the Cobuild project. The TEFL texts were found to contain an unusually high proportion of simple, concrete words and a smaller than expected number of more abstract words.
358
Graeme Kennedy
Table 14. LOB categories J and K-R: most distinctive verbs and adverbs (from Johansson 1981) Verbs
Adverbs
J
K-R
J
K-R
measured assuming calculated occurs assigned emphasized obtained executed tested corresponding vary bending varying loading measuring determine isolated dissolved resulting defined occur stressed illustrates
kissed heaved leaned glanced smiled hesitated exclaimed murmured gasped hurried flushed cried eyed staring paused whispered waved nodded frowned shivered muttered shared flung
theoretically significantly approximately hence relatively respectively commonly separately consequently similarly rapidly thus furthermore sufficiently therefore secondly ultimately readily effectively generally widely strictly mainly
impatiently softly hastily nervously upstairs faintly quietly abruptly eagerly upright tomorrow downstairs gently anyway maybe swiftly presently suddenly somewhere back slowly desperately sharply
Ways in which standard linguistic descriptions can be supplemented for language teaching can also be illustrated from a corpus study by Master (1987) of the use of the definite article, a well-attested learning problem in English. Quirk et al. (1985: 5.52 ff.) noted that although the, a/an and zero can all be used generically, . . . it should not, however, be assumed that the three options are in free variation . . . The is rather limited in its generic function. With singular heads it is often formal or literary in tone, indicating the class as represented by its typical specimen. Teachers of English for academic purposes may thus well wonder whether to avoid teaching generic the. Using a corpus of only 50,000 words from
Preferred ways of putting things
359
11 scientific writers, Master found that the accounted for 38% of generic article use, zero 54%, and a/an 8%. Although there was variation between writers, generic the occurred in subject position for 71% of its tokens and only 12% in object position. It was also more prevalent in the first sentence of a paragraph. Master argued that teachers need to have such information available so that the learners do not get the erroneous idea that the article system cannot be explained or learned. Corpus projects which have provided evidence for reconsidering the units of speech and therefore which have implications for language teaching include Carterette - Jones (1974) and the TESS project (Svartvik 1990). The 60,000-word corpus of Carterette and Jones, now available as part of the CHILDES corpus (MacWhinney - Snow 1990), was transcribed without conventional word divisions. It led the authors, on the basis of their statistical analysis, to conclude that the "syllabic phrase" of about 15 phonemes is the basic unit of speech operating under a suprasegmental umbrella. Clearly, when there is much written input for second language learners using the normal word boundaries, there could be a major mismatch since "phonemic words are three times as long as lexical words" (Carterette - Jones 1974: 31). The authors further note that this mismatch may make learning to read difficult. Hieke (1989) has discussed implications of this research for the development of listening comprehension and fluency by second language learners. The TESS project, associated with the London-Lund corpus, has defined the tone unit as the basic prosodic unit, being characterized by contour rather than by pause boundaries, with the average size of each segment of 4.5 words per tone unit being slightly smaller than the pause-defined 5.7-word segments. The findings of both these corpus-based projects fit in well with recent corpus-based work on collocation (e.g. Sinclair 1989) in expanding the notion of the word for language learning and use. There have always been those who have found a sharp division between vocabulary and grammar to be less than satisfactory, and there have been increasing numbers of linguists who have reconsidered the units of language acquisition and use (e.g. Nattinger 1980; Pawley - Syder 1983; Peters 1983; Kjellmer 1984). The identification of recurrent word combinations and their frequency in different domains or varieties of use is a promising potential contribution of corpus linguistics to second language teaching. There is considerable potential for descriptive linguistic research, as well as studies of the effects of methodology which incorporate these insights. Sinclair - Renouf (1987) have argued for a lexical syllabus which highlights the common uses of common words and their collocates and which places more emphasis
360
Graeme Kennedy
on the content of language teaching. In so doing, they occupy some of the ground formerly held by Palmer in particular, earlier in the century. The large commercial Cobuild project is the most fully developed contemporary attempt to use a corpus as a basis for the selection of content for English language teaching. Other major reference works, most particularly Quirk et al. (1985), have been associated with a corpus, but the Cobuild project has moved systematically beyond the reference dictionary to a grammar and various pedagogical applications. One contribution has been in the principled selection of lexis for teaching. The innovative Cobuild Dictionary, and the Grammar, could be further enhanced as reference sources for teachers with the addition of more statistical information on the most frequent items in different contexts and if the most frequent 2000 or 3000 words were coded for ease of identification. The Cobuild English Course (Willis - Willis 1988) for adult learners is based on a lexical syllabus derived from the Cobuild corpus. The validity of a pedagogical progression which draws attention to different meanings of certain high frequency words could be the subject of useful research. The significance of a lexical syllabus can be illustrated in corpus research on prepositions and adverbials. Approximately one out of every eight words in the LOB corpus is a preposition, with 13 of the almost one hundred types accounting for 90% of the preposition tokens (Mindt 1989). Like verb forms, prepositions are used very frequently and are hard to learn, when viewed as grammatical items. Corpus studies are beginning to show why. Sinclair (1989: 137), in a detailed study of the word of in context, has suggested that . . . it may ultimately be considered distracting to regard of as a preposition at all . . . we are asked to believe that the word which is by far the commonest member of its class . . . is not normally used in the structure which is by far the commonest structure for the class.
That is, of does not typically precede a noun to produce a prepositional phrase which functions as a clausal adjunct. Sinclair shows of to be more sensitive to preceding nouns. Collocational studies of prepositions show major regularities as well as major differences in the word classes which most commonly occur with particular prepositions, and in the semantic functions they serve. Kennedy (1990), for example, has suggested that 63% of the tokens of at in the LOB corpus are represented by about 150 collocations which can be listed easily on a single page. The 28 most frequent types (apart from at + the name of a town or place) are listed in Table 15.
Preferred ways of putting things
361
Table 15. Number of tokens of the most frequent right collocations of at in the LOB corpus. at at at at at at at at at at at at at at
least + personal pronoun + numeral all last once the same time the end (of the) home the time which present first any rate
249 236 181 175 111 98 92 88 83 77 61 57 50 34
at at at at at at at at at at at at at at
night the moment (of) the top times the beginning (of) this time work the meeting (of) that time the age of the back (of) any time the bottom (of) the present time
34 34 31 30 30 28 26 25 24 24 22 21 20 20
Viewed thus as part of a lexical phenomenon rather than as a disembodied grammatical function word, at takes on more manageable pedagogical proportions. Similarly, just as the prepositional and adverbial uses of through are not always easy to distinguish, learners of English do not always find it easy to distinguish through from between on semantic grounds. Kennedy (1991) suggests that the major grammatical distinction between the two words lies in the word class they each most frequently associate with. Table 16, which contains about 40% of the tokens immediately preceding between and through in the LOB corpus, shows that the most frequently occurring words immediately before between are typically nouns, while verbs most frequently precede through. The ways in which between and through may be distinguished also lie not just in the total list of different categories of meaning of each word, but rather in the relative frequencies of use in corpora of each of these meanings. Thus, for example, both between and through can be associated with motion. In LOB, the use of between to express motion accounts for 4.4% of tokens (e.g. She ran between the dining room and the kitchen), whereas a dynamic use of through accounts for 36.8% of tokens (e.g. when he was passing through London). Corpus studies undertaken as small projects as part of teacher education or graduate classes have been reported from many universities. Celce-Murcia
362
Graeme Kennedy
Table 16. Words occurring four or more times immediately before between and through in the LOB corpus No. of tokens •difference • relationship • distinction •relation • gap • agreement •contrast •distance •place be •comparison exist •meeting • contact •link and in as •conflict • correlation • gulf lie that • time agree • connection distinguish •interval pass •border •exchange make out • proportion •quarrel • similarity • space • struggle
between
59 25 19 16 12 11 11 11 11 10 9 9 9 8 8 7 7 6 6 6 6 6 6 6 5 5 5 5 5 4 4 4 4 4 4 4 4 4 345
No. of tokens go pass come be and get break •him run • way • it fall lead look out in live only • them all carry cut down flash • line one or right see shoot
through
Nouns are recorded in their singular form, verbs in their stem form *= noun or pronoun
36 33 20 15 13 12 10 10 10 9 8 7 7 7 7 6 6 6 6 5 5 4 4 4 4 4 4 4 4 4 274
Preferred ways of putting things
363
(1990), for example, reports a number of such language teaching-related projects which show the kinds of infrequent elements which are often taught to learners of English, and an earlier grammar (Celce-Murcia - LarsenFreeman 1983) reported statistical findings from a number of student studies which used small corpora of US English. Students and teachers can learn a great deal from such small studies. For example, in language teaching courses, how is typically taught as an interrogative pro-form. In Section J of the LOB corpus, it can be seen quite easily that only 15% of the tokens of how are used in this way, with subjectverb inversion or Jo-support (How do we know this?). The most frequent use by far is as a subordinator to introduce a nominal clause (/ like to see how things are made). The misplaced emphasis of pedagogy could well be one source of the typical learner's error (* How you are?, *I want to find out how is it).
Developmental studies In addition to possible implications of descriptive studies of the kind mentioned, it might be thought that studies of first language acquisition based on diaries and other corpora would have had a major influence on second language learning theory and practice. Early studies often produced frequency data on phonemes and vocabulary in particular (McCarthy 1954). Studies of grammatical development also often produced statistical data. Hunt (1965), for example, using a corpus of 54,000 words of English written by children over the age of ten, showed that the main developmental change was in the length of clauses. As learners get older there is more modification (including relativization) and complementation. A later study by O'Donnell - Griffin - Norris (1967) of a corpus of spoken and written English produced by schoolchildren suggested that the extent of the use of ellipsis may be a better measure of development than measures such as the amount of subordination or nominalization or the relative frequencies of word classes. Such studies had little direct influence on second language theory or practice. From the 1970s, however, a number of important corpus-based developmental studies of first language acquisition did influence second language theory. The data for some of these studies has become more generally available as part of the CHILDES corpus. Perhaps the most well known and influential of these developmental studies was the Harvard project (Brown 1973).
364
Graeme Kennedy
In a psycholinguistic analysis of several hundred hours of the transcribed speech of three children, associated with a state-of-the-art but constantly evolving linguistic theory, Brown characterized first language development as beginning with the expression of a series of between 8 and 15 semantic relations such as agent and action which are later modified or given greater specificity through morphological and syntactic development. Brown's data showed an unexpectedly consistent acquisition order for some 14 linguistic elements such as tense, number, and articles, an order which furthermore was claimed to be unrelated to the frequency of these elements in the parents' speech (1973: 362). Brown's study did much to inspire a decade of second language research studies including the morpheme order studies (e.g. Dulay Burt 1974) which attempted to show that the putative universal processes of first language acquisition were paralleled in child second language acquisition. Somewhat ironically, this activity tended to give comfort to those who favoured a non-formal language teaching pedagogy. In the coming together of both "communicative" goals and "naturalistic" language teaching theory there was a period in which there was a conspicuous loss of interest in the formal content of what was to be learned, and more emphasis on the messages communicated. Other first language studies meanwhile complemented the psycholinguistic studies. Halliday (1975), for example, recorded the development of a set of sociolinguistic functions in which first language acquisition was interpreted as an interactive sociolinguistic process to realize a set of meanings.
Implications for language teaching As language teachers are aware, language teaching theory has tended to be influenced periodically by changes in fashion and ideology. From the late 1960s through to the 1980s, corpus-based research, with the exception of that derived from developmental studies of first language acquisition, did not exert any great influence on pedagogy or curricula. In fact, teachers tended to show more interest in the learner and the learning process than in what was being learned. There were a number of other factors which contributed to the decline in influence of corpus studies on language teaching. Firstly, as has been noted above, it became increasingly unclear by the 1970s what the units of language acquisition might be, with the notion of linguistic structure changing and the division between grammar and lexicon becoming less well defined.
Preferred
ways of putting
things
365
Further, as the influence of sociolinguistic concepts of social and situational appropriateness gained widespread acceptance, language teachers were urged to avoid the teaching of language as an unapplied system. In order to teach language as communication, however, it was of course not necessary that the formal basis for communication should have been neglected. There was also a growing awareness that because the context of situation affects the forms and structure of discourse and language behaviour, language teaching should be directed towards particular goals or "specific purposes". From the 1980s it was also being argued by some that the controlled pedagogical input of form and even meanings through selection and gradation could be misguided, that the input for learning should come from interaction and the negotiation of meaning, with the process of interaction initiated by "tasks" making learning possible. Long (1990) suggested, for example, that if language is taught through communication not only for communication, then the distinction between syllabus and methodology becomes blurred. Because interaction and the negotiation of meaning are unpredictable, especially in terms of formal elements, the teacher tends to become an organizer of situations or tasks rather than a source of knowledge about the forms of the language or a controller of a progression of these forms. In this paper, I have attempted to outline something of the range and nature of corpus studies which have produced information on the linguistic nature of the learner's task and which I believe have potential relevance for language teaching theory and practice. As has been noted, however, contemporary language teaching has not always been able to take advantage of this information. The influential Council of Europe syllabus was partly a product of professional judgement, "introspection" and experience rather than objective, observational research, since data for a more scientific approach was not easily available (Wilkins 1973: 137). Although the conventional wisdom sometimes results in a non-linguistic syllabus where both the grammar and vocabulary learned are the product of the demands of authentic task-based interaction and negotiation, not all language teachers are willing to limit their role to creating situational opportunities for learning. If the task of curriculum designers and language teachers is to bring together motivating learning tasks or situations (including interesting discourse or text) which provide repeated exposure to salient, useful linguistic and pragmatic elements of the language, corpus linguistics should be able to provide information on frequency of use. By instantiating language produced unselfconsciously when the focus is on messages not form, a corpus can contribute to second language teaching through improvements in the descriptive accuracy of grammars, dictionaries and wordlists. Quantificational data
366
Graeme Kennedy
on frequency can then contribute to syllabus selection and sequencing of the linguistic and sociolinguistic elements of the system. At the classroom level, corpus linguistics can raise the level of teachers' consciousness about the dimensions of the learners' task and influence the selection of content in appropriate authentic texts and activities. Quantitative evidence from texts can provide a necessary check on subjective judgement at all these levels. Texts selected without awareness of how typically they represent salient features of the language can present a chaotic picture of the language, while invented examples can present a distorted version of typicality or an over-tidy picture of the system. However, in spite of the amount that has been discovered about the structure and use of English through corpus linguistics, it should be a matter of concern that many of these rich insights are either not widely known or are ignored. There are a number of issues and problems in the use of the results of corpus analysis which may partly account for this neglect. First, there continues to be a mismatch between the results obtained from representative corpora and the recognition that variation is inherent in language use. Register, domain, topic and medium influence the forms used. Since any statistics are inevitably text sensitive, their validity for applied linguistics is an issue. The ICE project which will provide more modern corpora of a number of regional varieties of English and cover a number of registers should boost confidence in the face validity of corpus studies. The wider availability of optical scanning equipment to enable specialized corpora to be computerized more quickly and the development of more user-friendly software should also increase the use of corpora by language teachers. Second, there is the issue of the reliability of results of analysis arising from corpora of different sizes. Corpora have ranged from single books such as that used by Joos to the large and expanding Cobuild corpus. Bongers (1947: 104) argued that "no reliable word count can be made on material under about 1 million words", whereas corpora of over ten million words only reflect changes in the frequency of very rare words depending on the character of the particular texts. The consistency of findings on frequent items such as verb forms from quite small corpora suggests that the "standard" onemillion-word corpora are reasonably reliable for high or medium frequency items. Celce-Murcia (1990) suggests that 200,000-word corpora consisting of equal proportions of spoken and written sources are quite adequate for many studies. For textual study of very high frequency items such as of, if the corpus is very large, then sampling appears necessary (Sinclair 1989). In addition to being valid and reliable, the results of corpus research need to be systematic, non-trivial, accessible and coherent if they are to have greater
Preferred ways of putting things
367
influence for language teaching. Some corpus studies have been interesting but isolated, "magpie" explorations of quite small phenomena such as how a single word is used. Similarly, a finding that children's sentences get longer as they get older does not necessarily have a pedagogical application. As well as distinguishing between what is scientifically interesting and what is pedagogically useful there will be a need to make corpus research findings accessible through clear and transparent summaries in journals and more importantly through handbooks or manuals. It is not enough to tell teachers that curricula, reference works or teaching materials are based on corpus analysis. Increasingly, the most professional teachers expect evidence to justify positions taken, and teacher trainees should receive statistical information as part of the description of English or whatever language they are learning to teach. After all, chemists know that gold and iron differ not only in their atomic structure but in where they are found and the relative quantities of each available. Similar evidence needs to be available on linguistic elements to guide language teachers as to where it is worthwhile investing effort. Corpus work nowadays is very much associated with the speed and scope which computers can bring to analysis. However, as has been suggested above, many pedagogically significant studies have been undertaken without computers. Further, many studies cannot yet be fully computerized. There is a need for a great deal of laborious hands-on work, particularly on semantic issues, to discover or identify the types which in most cases can then be counted by machine. Without such preparatory or complementary manual analysis, how else do we note, for example, the prevalence of the unmarked juxtaposition of propositions as a relatively common way of expressing causation in English. Language description can be challenged by corpus research. It should not be assumed that the types identified in traditional descriptions should be the only things which are quantified. The need for an open mind on the types to be counted is necessary, for example, to ensure that the fixed phrases we snatch from memory in the production of speech, and the prevailing metaphors and conceptual categories we use do not get missed as we record words and sentences. Language teachers should appreciate not just generalized frequency information but reliable generalizations on the nature, structure and probabilistic characteristics of different varieties of language. There is a need too for the pedagogical contribution of corpus linguistics, now closely associated with computers, to be clearly distinguished from computer-assisted language learning. The potential of CALL has yet to be realized and gain widespread acceptance by language teachers. Many of the
368
Graeme Kennedy
drills, exercises and other activities available are relatively trivial or unmotivating or are incompatible with contemporary teaching theories and with learning models which include the open-ended negotiation of meaning and which computers seem less well placed to meet. At present, interaction with databases by more advanced and sophisticated language learners may hold more potential. For example, it can be a moment of real insight for an advanced student using the LOB corpus in this way to find that must is frequently used in the passive in academic texts. Many teachers need persuading that corpus linguistics can make a contribution to their professional activity which is quite independent of whether or not CALL, which tends still to be a technology in search of educational applications, will aid language learning. It would appear to be an opportune time to attempt to bring together a corpus research agenda for second language teaching and a research programme on second language learning. Felix - Hahn (1985: 236), for example, argued, on the basis of a study of the acquisition of the English pronominal system by German 10-12-year-olds in classrooms, that . . . it is clear that the student's learning process cannot be manipulated at will but only within certain narrow limits . . . It appears that teaching efforts are doomed to failure when they are in conflict with naturalistic language acquisition principles.
Rutherford - Sharwood-Smith (1985: 275), on the other hand, proposed a "pedagogical grammar hypothesis" which challenges prevailing "naturalistic" models of second language learning. Instructional strategies which draw the attention of the learner to specifically structural regularities of the language, as distinct from the message content, will under certain conditions significantly increase the rate of acquisition over and above the rate expected from learners acquiring that language under natural circumstances where attention to form may be minimal and sporadic.
There is clearly room for further research here, but if the pedagogical grammar hypothesis is supported then the role of corpus linguistics in identifying the preferred ways of putting things will be highly relevant. A research agenda, which matched the systematic effort in corpus making represented by the ICE project, would include the systematic study of lexicon, grammar and discourse variables and provide statistical information on their use in defined varieties. Mitchell (1990), for example, describes a semantic classification of 24 ways of comparing adjectives. Reliable information on the relative frequencies in use of these and other ways of making comparisons in different contexts should be available for teachers. The psycholinguistic
Preferred ways of putting things
369
and sociolinguistic questions of why one way of making comparisons might be preferred over another is also worthy of research in this and other major grammatical areas. Robust statistics on the use of complex noun phrase structures, verb form use, prepositional collocations, and subordination would be of use to teachers. So too would explanatory studies which explored why, for example, the progressive tends to be so rare in written English, or why voiceless stops seem to be so prevalent as syllable boundary markers (Carterette Jones 1974). Larger-scale, computer-assisted, developmental corpus studies are needed to throw light on the processes of language acquisition and use, and the data needs to be considered in both traditional and non-traditional ways. Examples might include the relative frequencies of different semantic notions or categories of propositional meaning, or of discourse structure for the expression of sociolinguistic meanings along with the commonest linguistic devices used to express them. Wolfson (1981) provided a striking example of how more corpus-based analysis of sociolinguistic dimensions could have significance for language teaching. Her study of almost 700 compliments found that two-thirds contained one of only five adjectives (nice, good, beauttful, pretty, great) and 85% fell into one of three basic syntactic patterns. With these, a second language learner of English would have little difficulty in producing appropriate compliments usable within almost any context. Sixty years ago the usefulness of frequency information on vocabulary use was taken for granted by English language teachers and a 30-year programme of research culminating in West's General Service List of English Words was the result. One of the challenges facing those who work with corpora in the 1990s is to carry out a similarly systematic and comprehensive programme of work across the whole spectrum of language and language use and to make the results of this work easily accessible.
References Altenberg, Bengt 1987
"Causal ordering strategies in English conversation", in: J. Monaghan (ed.), Grammar in the construction of texts, 50-64. London: Pinter. 1990 "Spoken English and the dictionary", in: Jan Svartvik (ed.), 177-192. 1991 "A bibliography of publications relating to English computer corpora", in: Stig Johansson - Anna-Brita Stenström (eds.), English computer corpora. Selected papers and research guide, 355-396. Berlin: Mouton de Gruyter. Biber, Douglas 1988 Variation across speech and writing. Cambridge: Cambridge University Press.
370
Graeme Kennedy
Bongers, Η. 1947 The history and principles of vocabulary control. Woerden: Wocopi. Brown, Roger 1973 A first language. Cambridge, Mass.: Harvard University Press. Carroll, John B. - P. Davies - Barry Richman 1971 The American Heritage word frequency book. New York: Houghton Mifflin. Carterette, Edward C. - Margaret H. Jones 1974 Informal speech. Berkeley: University of California Press. Celce-Murcia, Marianne 1990 "Data-based language analysis and TESL", in: Monograph of Georgetown University Round Table on Languages and Linguistics 1990, 1-16. Washington, DC: Georgetown University Press. Celce-Murcia, Marianne - Diane Larsen-Freeman 1983 The grammar book. Rowley, Mass.: Newbury House. Coates, Jennifer 1983 The semantics of the modal auxiliaries. Beckenham: Croom Helm. Coleman, Algernon 1929 The teaching of modern foreign languages in the United States. New York: Macmillan. Dulay, Heidi C. - Marina K. Burt 1974 "Natural sequences in child second language acquisition", Language Learning 24: 37-53. Du§kovä, LibuSe - V6ra Urbanovä 1967 "A frequency count of English tenses with application to teaching English as a foreign language", Prague Studies in Mathematical Linguistics 2: 19-36. Eaton, H. S. 1940 Semantic frequency list for English, French, German and Spanish. Chicago: Chicago University Press. Fang Xuelan 1990 A computer-assisted analysis of the notion of causation in English. Victoria University of Wellington. [Unpublished MA thesis.] Faucett, L. - Harold E. Palmer - Michael West - Edward L. Thomdike 1936 Interim report on vocabulary selection. London: P.S. King. Felix, Sacha - Angela Hahn 1985 "Natural processes in classroom second language learning", Applied Linguistics 6: 223-238. Fox, B.A. 1987 "The noun phrase accessibility hierarchy reinterpreted", Language 63,4: 856-870. Francis, W. Nelson - Henry KuiSera 1982 Frequency analysis of English usage. Boston: Houghton Mifflin. Fries, Charles C. 1940 American English grammar. New York: Appleton-Century. 1952 The structure of English: an introduction to the construction of English sentences. New York: Harcourt Brace & Co. Fries, Charles C. - A. Aileen Traver 1940 English word lists. A study of their adaptability for instruction. Washington: American Council on Education.
Preferred ways of putting things
371
George, H.V. 1963a Report on a verb-form frequency count. Hyderabad: Central Institute of English. Monograph 1. 1963b A verb-form frequency count: application to course design. Hyderabad: Central Institute of English. Monograph 2. 1963c "A verb-form frequency count", English Language Teaching 18, 1: 31-37. Gougenheim, G. - R. Michea - P. Rivenc - A. Sauvageot 1956 L'elaboration du frangais elementaire. Paris: Didier. Greenbaum, Sidney - Jan Svartvik 1990 "Publications using survey material", in: Jan Svartvik (ed.), 11-62. Halliday, M.A.K. 1975 Learning how to mean. London: Edward Arnold. Hieke, A.E. 1989 "Spoken language phonotactics: implications for the ESL/EFL classroom in speech production and perception", Language Sciences 11, 2: 197-213. Hill, L.A. 1960 "The sequence of tenses with »/-clauses", Language Learning 10, 3: 165-178. Holmes, Janet 1988 "Doubt and certainty in ESL textbooks", Applied Linguistics 9,1: 21-44. Hom, E. 1926 A basic writing vocabulary. Iowa City: University of Iowa Monographs in Education. Hunt, Kellogg W. 1965 Grammatical structures written at three grade levels. Champaign, 111.: National Council of Teachers of English. Research Report 3. Johansson, Stig 1981 "Word frequencies in different types of English texts", ICAME News 5: 1-13. Johansson, Stig - Knut Hofland 1989 Frequency analysis of English vocabulary and grammar. Vols. 1 and 2. Oxford: Clarendon Press. Joos, Martin 1964 The English verb: form and meaning. Madison: University of Wisconsin Press. Kaeding, J.W. 1897 Häufigkeitswörterbuch der deutschen Sprache. Steglitz: Der Herausgeber. Kennedy, Graeme D. 1978 "Conceptual aspects of language learning", in: J.C. Richards (ed.), Understanding second and foreign language learning, 117-133. Rowley, Mass.: Newbury House. 1987a "Quantification and the use of English: a case study of one aspect of the learner's task", Applied Linguistics 8, 3: 264-286. 1987b "Expressing temporal frequency in academic English", TESOL Quarterly 21, 1: 69-86. 1990 "Collocations: where grammar and vocabulary teaching meet", in: S. Anivan (ed.), Language teaching methodology for the nineties, 215-229. Singapore: RELC Anthology Series. 1991 " Between and through: the company they keep and the functions they serve", in: Karin Aijmer - Bengt Altenberg (eds.), English corpus linguistics: studies in honour of Jan Svartvik, 95-110. London: Longman.
372
Graeme Kennedy
Kjellmer, Göran 1984
"Some thoughts on collocational distinctiveness", in: Jan Aarts - Willem Meijs (eds.), Corpus linguistics: recent developments in the use of computer corpora in English language research, 163-171. Amsterdam: Rodopi. Krämsky, Jin 1969
"Verb form frequency in English", Brno Studies in English 8: 111-120.
Lakoff, George - Mark Johnson 1980
Metaphors we live by. Chicago: University of Chicago Press.
Ljung, Magnus 1990 A study of TEFL vocabulary. Acta Universitatis Stockholmiensis. (Stockholm Studies in English 78.) Stockholm: Almqvist & Wiksell. Long, Michael H. 1990 "Task, group and task-group interactions", in: S. Anivan (ed.), Language teaching methodology for the nineties. Singapore: RELC Anthology Series 24: 31-50. Lorge, Irving 1949 Semantic count of the 570 commonest English words. New York: Columbia University Press. Mackey, William F. 1965
Language teaching analysis. London: Longman.
MacWhinney, B. - C. Snow 1990 'The child language data exchange system", ICAMΕ Journal 14: 3-25. Master, Peter 1987
"Generic the in Scientific American", English for Specific Purposes 6: 165-186.
McCarthy, Dorothea 1954 "Language development in children", in: L. Carmichael (ed.), Manual of child psychology (2nd ed.), 492-630. New York: Wiley. Miller, George A. 1951 Language and communication. New York: McGraw-Hill. Mindt, Dieter 1989
"Prepositions in LOB and Brown", ICAME Journal 13: 67-70.
Mitchell, Keith 1990
"On comparisons in a notional grammar", Applied Linguistics
11, 1: 52-72.
Nation, I.S.P. 1990
Teaching and learning vocabulary. Rowley, Mass.: Newbury House.
Nattinger, James R. 1980
"A lexical phrase grammar for ESL", TESOL Quarterly 14, 3: 337-344.
O'Donnell, Roy C. - William J. Griffin - Raymond C. Norris 1967 Syntax of kindergarten children: a transformational analysis. Champaign, 111.: NaOta, Akira tional Council of Teachers of English. Research Report 8. 1963
Tense and aspect of present-day American English. Tokyo: Kenkyusha.
Palmer, Harold E. 1933 Second interim report on English collocations. English Teaching.
Tokyo: Institute for Research in
Preferred ways of putting things
373
Pawley, Andrew - Frances H. Syder 1983 'Two puzzles for linguistic theory: native-like selection and native-like fluency", in: J.C. Richards - R.W. Schmidt (eds.), Language and communication, 191-226. London: Longman. Peters, Anne M. 1983 The units of language acquisition. Cambridge: Cambridge University Press. Quirk, Randolph - Sidney Greenbaum - Geoffrey Leech - Jan Svartvik 1985 A comprehensive grammar of the English language. London: Longman. Rutherford, William E. - Michael Sharwood-Smith 1985 "Consciousness-raising and universal grammar", Applied Linguistics 6: 274-282. Salager-Meyer, F. 1990 "Metaphors in medical English prose. A comparative study with French and Spanish", English for specific purposes 9: 145-159. Sinclair, John M. 1989 "Uncommonly common words", in: M.L. Tickoo (ed.), Learners' dictionaries: state of the art. Singapore: RELC Anthology Series 23: 135-152. Sinclair, John M. (ed.) 1990 Collins Cobuild English grammar. London: Collins. Sinclair, John M. - Antoinette J. Renouf 1987 "A lexical syllabus for language learning", in: Ronald A. Carter - Michael McCarthy (eds.), Vocabulary in language teaching. London: Longman. Svartvik, Jan (ed.) 1990 The London-Lund Corpus of Spoken English: Description and research. Lund: Lund University Press. Thorndike, Edward L. 1921 Teacher's word book. New York: Columbia Teachers College. Thorndike, Edward L. - Irving Lorge 1944 A teacher's word book of 30,000 words. New York: Columbia Teachers College, van Ek, J.A. 1976 The threshold level for modern language learning in schools. London: Longman. Wang Sheng 1991 A corpus study of English conditionals. Victoria University of Wellington. [Unpublished MA thesis.] West, Michael 1953 A general service list of English words. London: Longman. Wilkins, David A. 1973 "The linguistic and situational content of the common core in a unit/credit system", Systems development in adult language learning. Strasbourg: Council of Europe. 1976 Notional syllabuses. London: Oxford University Press. Willis, Jane - David Willis 1988 Collins Cobuild English course. London: Collins. Wolfson, Nessa 1981 "Compliments in a cross-cultural perspective". TESOL Quarterly 15, 2: 117-124.
Comments by Göran Kjellmer
1. Preamble Graeme Kennedy's paper is an excellent historical survey of scholarly corpus work with implications, direct or indirect, for language teaching and learning, with enough detail both to whet one's appetite for the works one is not familiar with and to remind one of the works one happens to know something about. The paper is very useful as a kind of reference work in its own right.
2. On corpora and the computer After Nelson Francis's contribution there is little need to remind you that corpus studies are much older than the computer, if by "a corpus" we mean "a more or less systematic collection of language material brought together for purposes of linguistic study". Let me just add that corpus studies have been particularly natural in classical languages, where comprehensiveness and even exhaustiveness, "total accountability", have been possible to achieve.1 (This is also true of languages like Old English, as evidenced by the Toronto Corpus.) Although it is inaccurate to say that corpus linguistics represents an unbroken tradition dating back many centuries in time, as the Chomsky school has demonstrated and argued so eloquently, a powerful tradition it has been, nevertheless. For instance, the Great Grammarians of the pre-computational age - Jespersen, Poutsma, Kruisinga - all worked with corpora. Corpus work should therefore not be seen as exclusively associated with the computer. The computer makes previously unthinkable research projects possible, adds rigour to the methodology, invites new approaches, but most of the basic principles remain unchanged.
Comments
375
3. Deciding what to teach and learn: on "different" vs. "typical" A great many of the works reviewed in Kennedy's paper bear on the methods and material of second-language teaching and learning. The perspective adopted could be represented by the following quotation: Learners' time and effort would be better spent acquiring vocabulary or pragmatic elements which are rarely part of curricula. Regrettably, this point has not prevented learners of English often being subjected to a pedagogy based on a descriptive grammar or perhaps worse, a comparative grammar where differences are magnified, and idiosyncrasies frequently receive as much attention as the typical. (Kennedy p. 348)
This, I would suggest, is only one side of a complex question. When selecting, in second-language teaching, what elements to teach in which order, it is necessary, I believe, to strike a judicious balance between (a) the elements the learner is likely to mistreat because they are different, maybe insidiously different, from those in his native language, and (b) the elements he is likely to meet and to require. Information on (a) is supplied by contrastive work, corpus or non-corpus (see below), information on (b) is supplied by corpus work. Overemphasis on either an (a) strategy or a (b) strategy can have non-productive or unwanted results. Consider (a) first. It is evident that elements that are structurally different in the student's native language and his target language (LI and L2) are more difficult to learn and more likely to cause harm than those which are not and that they hence merit more attention from teacher and student. But it is obviously ill-advised, in teaching L2 to beginners, to attend almost exclusively to such features, for example to teach Swedish beginners English progressive verb forms to the exclusion of the simple present or the simple past, as is sometimes done in Swedish schools, the English progressive having no direct equivalent in Swedish. (According to Table 1, the progressive represents a tiny proportion [about 5%] of English finite verb forms.) Here, indeed, the learners' language is likely to "show personal usage-habits which may be 'correct' in so far as they represent features which could be used by native speakers, but which cause these features to figure disproportionately" (p. 349, quoted from H.V. George). But consider now the other side of the coin, (b). Computer-based corpus studies have presented us with a more representative and reliable picture of today's language than was ever possible before. It is obvious that the goal we want to achieve in second-language teaching is to make the learner's second language conform as closely as possible to that picture. However,
376
Göran
Kjellmer
it is much less obvious that the best and most economical way to achieve this is to provide him with teaching material congruent in form with representative second-language corpora without taking account of the properties of his first language. If, for example, the verb systems of his first and second language, LI and L2, show the same statistical characteristics except in one respect where they differ radically, it would be wasteful to devote as much time and effort to every aspect of the verb system of L2; it is arguably profitable educationally to concentrate on the point where LI and L2 differ at the expense of the other aspects. Similarly, it would be uneconomical to spend a great deal of time on noun-phrase structure when teaching English to Swedish beginners, as Swedish differs little from English in that respect, even though noun-phrase structure is of course a key phenomenon in English sentence building. One should not thus take it for granted that any features characteristic of a representative corpus ought necessarily to be integrated into a learning programme. Kennedy himself suggests that "a finding that children's sentences get longer as they get older does not necessarily have a pedagogical application" (p. 367). And even a moderately extensive corpus of spontaneous speech is sure to contain a certain proportion of anacolutha. To insist on the same proportion of anacolutha in learning materials seems to me little short of perverse. The conclusion to be drawn from the preceding argument, then, seems to be that a compromise between an emphasis-on-difference approach and an emphasis-on-typicality approach would be the most beneficial in the classroom. Ideally, the balance between (a) and (b) involves finding and making use of tendencies (frequencies, proportions, distributions) related to restricted groups of lexically as well as grammatically defined items (words, collocations), even to individual items, and to different domains of language2 in both LI and L2 so as to relate, compare and contrast the results. So contrastive corpus studies (LI vs. L2) should be useful for teaching and learning purposes. Several contrastive corpus studies have been made, as Kennedy demonstrates in his paper, but it is also evident that a great deal of work remains to be done in this area. From the point of view of second-language teaching and learning, this seems to me to be one of the most promising fields of corpus linguistics.
Comments
377
4. On structures in teaching There are scholars who argue, to my mind convincingly, that mental constructs like patterns and systems have an important educational value in second-language teaching as well as in other fields.3 Thus, they claim, secondlanguage students need to be shown the structure / system of a grammatical sphere, for example the verbal system, at least in broad outline, even if some slots are rarely filled. (Some sort of system can surely be assumed to exist in the mind of even the least sophisticated native speaker.) Note that "structural regularities" 3 could not be equated with "statistical regularities". That is, a learner may well be helped by being shown the language as a system, where the constituents, by being part of a system, are of equal importance, although the constituents may be very unequal statistically. Withholding such structural information from the learner could therefore deprive him of a useful pedagogical aid. This is of course not to say that all the components of the system should receive an equal amount of attention in the language learning materials (texts, exercises), only that the student should have some idea of the lie of the land before he starts walking into it. The teaching routine that suggests itself could then be: first acquaintance, however superficial, with the system, then information on statistical properties, including information on variation in different registers. It is in the last stage that "the preferred ways of putting things" become relevant.
5. Syllabus selection: on "natural texts" and typicality Selecting texts for language study normally involves a subjective element. This may give rise to some concern that they are not sufficiently representative of the target language. As Kennedy puts it: Texts selected without awareness of how typically they represent salient features of the language can present a chaotic picture of the language, while invented examples can present a distorted version of typicality or an over-tidy picture of the system, (p. 366)
Without much exaggeration, however, it could be said that any natural text of limited length presents a chaotic picture of the language, in the sense that some features will be overrepresented and some underrepresented; it is only the sum total of a great many texts that could be regarded as representative. If we do not want the artificiality of "an over-tidy picture of the system", to be found in invented examples, we shall have to accept a mild version of
378
Göran
Kjellmer
linguistic chaos in early texts, confident that representativeness will become progressively greater as the body of texts studied increases.
6. Conclusion The general question raised by Graeme Kennedy's paper is this: to what extent are corpus analyses relevant for second-language teaching? Although it is still possible to debate how they should influence the ways and means of language teaching - corpus statistics need not, in my view, be reflected in pedagogical materials and methods in every detail - there is no question but that corpus analyses of the type described in Kennedy's fine paper can contribute significantly to defining the goal to which language pedagogy should aspire.
Notes 1. Cf. well-known titles such as Lexicon totius latinitatis (by Egidio Forcellini, published posthumously in 1771, "il repertorio fondamentale della lessicografia latina" [Enciclopedia Italiana]) and Glossarium media et infimce latinitatis (by Du Cange 1883-1887, "dictionnaire complet de la basse latinite" [L. Favre, editor]). 2. ". . . while there is a definite association of particular verb-form use with particular varieties of English, within each variety there is considerable stability of use" (Kennedy p. 348). 3. Cf. the quotation from Rutherford - Sharwood-Smith 1985 (Kennedy p. 368).
The automatic analysis of corpora John M. Sinclair
1. Introduction I think we can look forward to some very exciting years and decades in linguistics, as we begin to harness the information that we can retrieve from text corpora. The advent of computers has improved the quality of many scientific disciplines in recent years, but in none of them is the effect so profound as it will be in the study of language. For linguistics will see quite new methodologies and argumentations, and the relationship between speculation and fact will alter sharply. This is because it is now possible to check statements about language. There are already working procedures which deal with statements concerning features of language which lie on or near the surface, and we can look forward to more sophisticated analytical routines in the coming years. My aim in this paper is to contribute towards a strategy for planning and designing the next generation of analytic software. I feel that an explicit strategy is necessary because of the current state of this young subject, corpus linguistics. From being a fringe activity with a few devotees and corpora of very modest size, it is accelerating towards intellectual orbit, and is the object of attention of many policy makers. A few years ago, corpora were virtually unknown in the machine analysis of language, and its applications, such as machine translation. Now they are the flavour of the decade.
2. Purposes We do things for a purpose. For what kinds of purpose do we analyse corpora? I would like to distinguish first between specific and general purposes. Specific purposes are those where the end result is the only important issue. A client is identified who needs some device or facility that, for its proper functioning, depends on the analysis of corpora. The client is only interested in the performance of the device, and does not care about its internal architecture, its design or the theoretical basis of its operation.
380
John Μ. Sinclair
A spelling checker is a familiar example of a device that fulfils a specific purpose. A corpus can be used to provide a word list and frequency figures that will increase the efficiency of the spelling checker. Nobody cares how it is done, so long as it works. It is reasonable to suppose that in the next few years analysed corpora will be in great demand for specific purposes of this kind, and the academic community will learn a lot from the experience of meeting the requirement of a wide range of clients. General purposes are those where the task is so complex that we have to rely on the application of linguistic principles. It is so unlikely that a combination of ingenious strategies, involving a corpus or not, will produce an acceptable performance, that we prefer to study the processes thoroughly and try to understand them in their own terms before attempting automation. One such general purpose is machine translation. The clients for this service have been so wealthy and demanding that generations of linguists have been lured into treating it as a specific task. This mistake has led to a succession of unfortunate results. It seems at last to be generally agreed that we need to know a lot about the details of the structure of languages before we can translate them by machine. The intensive, and very general study of large corpora is likely to be the next significant step. Another general purpose is the understanding of language by machine. The goal of being able to chat to your computer is still a long way off, but it is now possible to discern how it might be achieved; and the goal is so important that pressure from clients will grow over the next decade, and no doubt tempt the unwary again. Of the two main components, the recognition of speech is the focus of a great deal of research, and the automatic analysis of dictionaries is opening up the possibility of associating a text with its meaning. There is a third set of purposes for which corpora are analysed. Language is an object of study for its own sake, and there are plenty of enquiries to be undertaken which do not have to be justified by the existence of hungry clients. They may have difficulty finding sponsors, but that is a practical matter; sometimes they can be dimly discerned as components of specific applications. Close to this third set of purposes is a fourth one, which is easily confused with it. That is the application of pre-existing descriptions to corpora, to check and improve them. A big corpus is an attractive and demanding testbed; weaknesses in logic and lack of clarity in the criteria of a description will quickly be found out. The process of developing the analytical system to prove itself on the corpus will tell researchers quite a lot about the analytical
The automatic analysis of corpora
381
system, because it will be made clear wherever it does not match the corpus evidence. It happens that most, if not all, of the systems devised without a prior study of corpora fail quite significantly to match the corpus and the mismatch has to be made good by human intervention. This is not unexpected, but it has considerable implications. One is that the reasons for failure might include the fact that the system was originally devised without the benefit of corpus evidence. From the experience of corpus analysis that we have built up so far, I would be virtually certain that this was the case; but the richness of patterning in natural language is such that this causal connection may never be identified. The activity is harmless in itself, but it is dangerous in policy terms. A large and influential group of scholars have fairly suddenly realised that they should have been promoting corpus work and using corpora for many years. They must now transfer to corpus-centred work or change their jobs more fundamentally. But the only expertise they can bring to corpus work is the failed techniques of pre-corpus modelling. The danger is then that our current policy may be heavily influenced by scholars who have little or no experience of corpus work, and whose methods are likely to insulate them from the experience of corpora. There is no substitute for experience of corpus work: the results of even simple questions cannot be predicted, and the retrieval of linguistic facts is a humbling experience. My plea is that we devise methods of analysis that prioritise information about the language that we can derive from the corpus, and not the vindication of models.
3. Guidelines As we go along, I should like to pick out some statements that may form guidelines for software standards. One such is • Analysis should be restricted to what the machine can do without human checking, or intervention. Of course, I am in no position to dictate what others should or should not do. The spectacle of serried ranks of people (actually referred to as "slaves") correcting the errors of an inadequate analysis, is already with us. My argument is for the prioritisation of fully automated analysis. If we do not give this priority, we won't ever get it.
382
John Μ.
Sinclair
As the size of corpora moves into the hundreds of millions, the futility of reliance on human intervention becomes clearer, although it can be temporarily obscured by throwing money at it. But now the first "monitor corpus" is being built, and that makes the manual tidying process quite impossible. A monitor corpus consists of a huge stream of language in motion (Clear 1987). Of no finite size, it flows across a set of filters, which extract relevant linguistic evidence, and then the text is discarded. The challenge of the monitor corpus is to automate retrieval software. The futility of mixing automatic and manual analysis can be shown by considering two of the enduring ambiguities found in the machine parsing of English. One is the occurrence of a prepositional phrase following a noun. . . . may be able to use any loss on the deal to offset capital gains tax on other transactions . . . which would present the UK company with a formidable new competitor... (Guardian, 19.3.91) There is no grammatical way of showing that on the deal and on other transactions define loss and tax respectively, and that with a formidable new competitor does not define company, but is an adjunct following on present. Hints can be given by a valency grammar, especially for the complementation of present, but it could not be decisive. Lexicostatistical studies could help the interpretation of particular instances, but again could not be decisive, and in these cases could not be comprehensive either. Perhaps English is trying to tell us something. The assignment of a prepositional phrase following a noun may not ultimately be a precise matter, but one for fairly loose interpretation. In the first two instances above the prepositional phrase is pretty redundant and unlikely to require critical interpretation. The lurking ambiguity of the infinitive is more serious and the to/in order to overlap is another such case where the language is neutral. In the third example the two virtually antonymic meanings of present with will cause much more trouble than the possibility of with not being associated with present The other unresolvable ambiguity is the branching co-ordinate structure. New owners likely for Grattan and Empire Stores . . . (Guardian, 19.3.91) Is the phrase Grattan Stores well-formed? Can Empire be used without Stores? The sentence gives no further clues, and it would be difficult to justify the expenditure of much effort to sort it out. Again the preferred pol-
The automatic analysis of corpora
383
icy is to leave the matter specifically unresolved. (Subsequent text shows that Grattan is only used on its own, and that Empire Stores is used for first mention, thereafter Empire. ) There is no pre-existing requirement on parsers to take up a definite position on a question couched in conventional descriptive terms. In particular there is no need to be more definite than native speakers have to be. Possibly in the future strategies will be evolved which will be more reliable; until then the indeterminacy of the present state of our grammars should be emphasised, and not camouflaged. In every day use, speakers and writers of English find very little instance of ambiguity. If it is proving difficult for machines, the reason might be that the machines do not have access to same language model as the ordinary speaker of English. For example, the machines may be required to make definite decisions about structures, sentence by sentence. In contrast the ordinary user may be keeping a number of provisional points in an undecided state both what exactly the other person said, and whether it is important to find out.
4. Parsing Parsing is a way of interpreting text. To parse, one replaces a string of characters with a category label. The justification of this activity is that it makes a contribution to the elucidation of the meaning of the text; it makes explicit a set of relationships which are retrieved by competent users of the language. Parsing is specific to each text being parsed; it is an application of generalities to a unique set of circumstances. It is assessed for its fidelity to such things as the intuitive impressions of native speakers. The status of parsing is particularly important because it is one of the rigorous connections between form and meaning. There are, however, no preconditions to parsing. There is no such thing as a complete parse, and there are no accepted standards so far in corpusbased analysis. Parsers used to be assessed by their ability to respond to a series of bizarre and contrived sentences, of a kind which would be only infinitesimally likely to occur. Text study shows this to be irrelevant, but no replacement has yet been offered. Parsing is also a very complex business, requiring the manipulation of thousands of rules in intricate relationships. It may be expedient, as we move to absorb the evidence of long texts, to lower the priority of an all-purpose
384
John Μ.
Sinclair
comprehensive parser and consider in its place a number of simple operations. Each one would contribute to an eventual parse strategy, but in the meantime each can give some valuable information for research and for limited applications. An example of a simplified parser is a "boundary marker". This device examines each word space and the words on either side. It attempts to discover whether the space also marks the boundary of a more extensive unit - a group, phrase or clause. Punctuation marks in written text assist the process. An efficient boundary marker can have a lot of independent applications, and in combination with other tools it can play an important role in more complex parsing. It may use the same evidence that other tools use, but will deploy the evidence in a unique way. One advantage of a staged approach to parsing is that the same basic facts of the presence and position of word-forms can be put to use in a number of different but complementary strategies. For some time now I have been calling these "partial parsers". Each concentrates on achieving a good success rate in one specific task. The boundary marker performs a simple but generally useful operation. A partial parser which performs a specialised task is the Technical Term Finder of Yang Hui Zhong (1986). As Yang shows, its results are encouraging, when applied to scientific text. It will actually find the nearest things to technical terms in any text, but reports a very small incidence of them in literary fiction. There are no doubt a number of examples of partial parsers to be found. They are software routines which use linguistic evidence in texts to identify incidences of general categories, the aim being to elucidate the meaning of the texts. The approach to automatic analysis advocated here, that of partial parsing, is consistent with another guideline for software standards. • Analysis should be done in real time. There are two current approaches to the supply of analysed text in machine readable form. One is to analyse the text in advance of making it available, when time is not critical. Some parsing routines demand this, because they are not designed to operate in real time. The other approach is to hold the text in raw form and analyse it fresh each time some analysis is required; this approach is highly time sensitive. The second approach is preferable on a number of grounds. One is that, with a suite of software tools, great flexibility can be offered. It has been a feature of a corpus resource that each new request is different from previous ones, and a flexible service is essential. Another advantage is that the service can be extended without having to be redone from the start. It also encourages
The automatic analysis of corpora
385
hierarchical, parallel, and single-pass processing, all of which are necessary for conformity to a third guideline • Operations should be designed to cope with unlimited quantities of text material. The whole notion of batching and of finite corpora will fall away as the full impact of reusable language data is felt. It will simply become impractical to store analysed or annotated text. This position does not preclude special applications which require, for example, an annotated reference corpus. Such things can be provided, but it is a fair prediction that they will become less popular when the tools for handling monitor corpora become available.
5. Taggers All parsers are designed to elucidate in some way actual instances of text, and specifically new text, text which has not been parsed before nor used in the construction of the parser. Some other software tools operate in the reverse direction. They extract information from text, and do not contribute directly to the elucidation of its meaning. One such group of tools is those that assign word class tags to the wordforms in a text. Tagging can be seen as a stage towards parsing, but the information it provides is not even a partial parse. Also taggers have uses which are quite independent of parsing. They have an important role in lexicography, where word class and meaning are often associated; and they can provide statistical information for the classification of texts. What taggers do not do is contribute to the understanding of a text. Each word form in turn is seen as a variable within a fixed context, so the routine of assigning tags is a movement from the text to the tag. The text is never seen as an entity with a meaning, but only as a repository of contexts. Like parsers, however, there are no pre-set conditions for taggers. There is no agreed tag set for any language, and taggers vary considerably in their aims and strategies. Most are extensions and adaptations of the traditional word classes, noun, verb, etc. but there is in fact no necessity for them to take over categories that are not machine-validated. The function of a tagger is to reduce the information in a text through a classification process. In the (made-up) example the cat sat on the mat
386
John Μ. Sinclair
a typical tagging operation on this example would replace cat by Ν and mat also by N. The operation thus loses the information that cat and mat are different words, and gains the information that they share the feature "nounness". If there are no preconditions, how do we choose which tags we want, and how do we define them? The answer must be that we build up criteria by evaluating textual evidence through intuitive expectations. We cannot rely on textual evidence alone, because that does not constitute an explanation of anything. We cannot, however, ignore the fact that textual evidence often challenges our intuitions, and must not be suppressed. This position suggests that flexibility is a good policy in constructing a tagger. A language like English has a large amount of word-class overlap, so that very few English words are immediately classifiable. In a lot of cases, the doubt can be resolved by a simple exploration of the environment, but with our present strategies there will be a substantial leftover. From experience so far, it appears likely that there will be a lot of indeterminateness in automatic analysis. The problem is different from the so-called "syntactic ambiguities" which I discussed earlier, where it is not expected that descriptive criteria will contribute to a more specific allocation of relationships in the foreseeable future. In this kind of analysis, there are some instances which give enough clues to be definitely classified and others for which the evidence is insufficient. It is advantageous to put doubtful cases into a more abstract category, which is arranged to be prior in application to the distinction that cannot be resolved. This gives rise to a hierarchy of tags on a scale of uncertainty, which is also, in linguistic terms, a scale of delicacy. Thus, in the text that I have been using for examples, there are the following words ending in -ing. acquiring increasing meeting selling
backing leaving morning shopping
handling looming passing string
having making retailing taking
A look-up routine would discard string and morning but would hold fire on evening until an examination of the phrasal environment last Friday evening made the meaning clear. The criterion of being immediately preceded by an article identifies backing, retailing, meeting as components of noun groups; meeting can be shown to be the head of its simple group because of the clausal environment; and the dictionary confirms its regular use as a countable noun:
The automatic analysis of corpora
387
. . . spent yesterday in a meeting and issued no response . . . Backing is also classified as a noun because it is immediately followed by of. In the case of retailing its position between the and giant shows it to be a modifier. The other instance of retailing is a little more problematic. But retailing industry analyst Paul Morris . . . I
Shopping occurs with a noun on either side and counts as a-modifier also. If the verb to be occurs immediately in front we may strongly expect the present continuous tense, as in . . . it would be making a bid . . . . . . the market is passing into foreign hands . . . , it will not be increasing its offer After a comma, we may normally expect a present participle, as in ..., leaving the way clear . . . ..., handling the sale of . . . A tagger with this kind of strategy could allocate -ing forms as in Figure I.
-ing
not a morpheme string morning evening
a morpheme
verbal
nominal
continuous tense
present participle
modifier
head
making etc.
leaving
retailing
backing
etc.
Figure 1.
In the other instance of retailing the -ing form has to remain simply as "a morpheme", i.e. the tagger recognises that -ing is a live morpheme, but cannot be sure whether it is verbal or nominal.
388
John Μ. Sinclair
The instances where the -ing form is a participle preceded by a preposition/conjunction such as with or by are less easy to find because the two words need not be together. . . . by selling it to the French . . . with a formidable new competitor acquiring 18 per cent of . . . . . . by taking over two companies . . . Acquiring may have to stay unresolved at the first level of discrimination, for the time being. We are left with the following: (a) The problem is one of a relatively small company like Empire not having the backing of a more powerful organisation . . . Coming between not and the is probably sufficient evidence to identify this having as a participle. (b) a looming cash crisis The twin criteria of following an article and preceding one or more nouns is sufficient to locate this instance as a modifier. Intuition, however, tells us that it is not the same kind of modifier as retailing, shopping. A look-up routine will confirm that looming is never a noun, and we can add a step in delicacy to the tagger (see Figure 2).
nominal
adjective looming
noun shopping
Figure 2.
(c) . . . , even taking into account the very . . . The look-up should identify taking into account as a compound preposition, and thus adding a third option at the second level of discrimination (Figure 3).
The automatic analysis of corpora
389
morpheme
verbal Figure
nominal
prepositional
3.
Let us then match a tag set onto our diagram (Figure 4):
(entry condition; words in -ing)
(REJECT)
ING
VING
PROG
N-ING
PL
MOD
ADJ Figure
PREP
Ν
N-MOD
4.
At any point, if a decision cannot be reliably taken, it is not taken. For example if the tagger cannot decide between ADJ and N-MOD it can still be sure of at least MOD; and so on. The user is thus often given three answers to a binary choice, the third being "can't decide". For most purposes involving a large corpus, this third list can be safely ignored, though if it is of any substantial size or particular constituency then it may constitute the basis of a case for improving the tagger. This illustration of tagging takes a lot for granted, but it does show how the text can help to shape the tagger. From the study of the "can't decide" lists can come insights into language which further improve the tagger of the future. We can now propose another general guideline for software • Software will be designed to operate at more than one level of discrimination, so as to bypass doubtful decisions.
390
John Μ.
Sinclair
There is another, more subtle, moral to be drawn from this illustration, but it can only be mentioned here. One criticism of my analysis of -ing forms above could be that it just reflects traditional categories; it finds what it looks for, and it may miss important regularities in the language. The original assumption is that it is sensible to examine -ing forms in isolation from adjectives, for example. It is necessary to mount a regular critique of any analytic software, to explore its limitations, check its results and search for improvements. In time it should be largely freed from its origins in human intuitions and the tradition of grammars - retaining, presumably, a great deal of what it began with but restating and reinterpreting much of it. At some point in the future linguists might have enough confidence to devise self-organising models for some of the analytical tasks, and then there would be another set of insights into the nature of language.
6. Lexical tools In discussing the properties of taggers and parsers we have been concentrating on grammar. There is a parallel range of lexical tools, which need to be developed, but which are in a much more primitive state. The basis of lexical patterning is the tendency of words to occur in the vicinity of each other to an extent that is not predicted by chance (Firth 1957). This tendency contributes substantially to the redundancy of language, and so we can make an assumption that it is meaningful. Current work begs a lot a questions - what unit is to be counted, and what constitutes the optimal environment being just two fundamental ones that I shall return to. The nature of language text compounds the difficulty. Zipf s Law (1935) makes it clear that the vast majority of words occur only infrequently; hence the necessity for corpora in the tens and hundreds of millions of words. The first stage of lexical analysis is parallel to grammatical tagging: the unit is the word-form and the objective is to determine the collocations of that form, including its contribution to compounds, "fixed" phrases and the like. This tool is called a collocator. The word-form may not be the best unit for lexical patterning. For example it is conventional to think of meaning as constant across different inflected forms of a word; in such cases the inflections could be conflated together
The automatic analysis of corpora
391
into lemmas and a lemmatiser used to do the job. The collocational analysis would be simplified a lot. It has been recognised for many years that lemmatising is not a straightforward matter. There are lots of ambiguous forms like leaves, and tricky decisions like actor, actress. However, the problem is further compounded because the basic assumption - of consistency of meaning across inflection is not by any means always justified. An adequate approach to lemmatising will develop criteria for judging whether the collocational patterns of two word forms are sufficiently close to justify putting them in the same lemma. There is another reason why a word-form may not be the best unit for describing collocational patterns. Wherever it appears that more than one form contributes to a single choice of meaning, we can expect the collocational evidence to provide confirmation. That is, the two (or more) word forms involved will have a marked mutual expectancy, and the lexical environment of the two of them together will be different from either of them separately. For example cream does not collocate with cone unless it is immediately preceded by ice. There is a need to develop a software tool for identifying compound lexical items - a compounder. Some work has already been done in this field (Yang 1986; Osborne forthcoming) but a great deal more is needed. A third reason why a word-form may not be the best unit is that one word may have many meanings, and collocational choices will reflect the meanings. In fact, the collocations can be used to disambiguate. By studying collocational consistency, we can develop criteria for differentiating meaning. Collocational consistency is the intercollocation of collocates. If one word collocates strongly with two others, and yet the other two do not collocate, then the original word is likely to have two different meanings. However if the other two do collocate, then it is likely that the meaning of the original word does not change. This can readily be shown in examples (Sinclair 1991) and prototype disambiguator software exists. In all of the arguments so far about lexis we have assumed that there is a very strong and detailed correlation between the meaning of a lexical item and the environments in which it is found. Over the last decade researchers have become more and more confident about this correlation, and feel that some of the persistent dichotomies in language description - form and meaning, form and function, meaning and function - may be capable of integration, through the study of corpus evidence. The key to advances in this field is the size and construction of the corpus. The underlying regular correlations are obscured because each instance of
392
John Μ.
Sinclair
language in use serves a unique communicative purpose, and is not necessarily typical of anything but itself. However if many instances are accumulated, each sharing a common feature such as a word or a phrase, then gradually the individual features are obscured in turn by the general patterns and trends. Eccentric instances end up having little or no effect upon the overall pattern. The lexical tools discussed so far are all of the "tagger" type rather than the "parser". They operate on texts in order to discover information about the organisation of the language as a whole. We need to develop a parallel set of tools which will return the insights thus gained to the language in use to interpret the lexical meaning of new texts, just as a parser interprets the grammatical meaning of new texts. The procedure is reversed; starting with the generalisations that have been made by the previous analytical work, the computer attempts to find clear instances of these generalisations in action; there will be lots of unclear instances and so, following the guidelines, it is designed to back off from a very specific identification when the evidence is not good enough; it has strategies for coping with unexpected events and it learns from the experience of individual texts. Parallel to a collocator we need a lexical parser·, alongside a compounder we need a phrase finder; reversing a disambiguator gives you an exemplifier. The lexical parser will bring out the collocations in a text instance and relate them to the meaning of the words involved. The phrase finder will bring out the mutually conditioned choices and assess which may be proposed as separate lexical items. The exemplifier will evaluate the instances and assign them to meanings. As with partial parsers, the domains of these software tools overlap, the evidence they use is shared, and they offer prospects of mutual support. Thus for example the output from the phrase finder can be used to improve the lexical parse, and clear the ground for the exemplifier.
7. Lexicogrammar So far in this paper I have kept grammatical and lexical matters separate, although all the evidence suggests that they must be co-ordinated. For example, the study of collocation brings out the notion of a lexical set, which is a kind of lexical paradigm. By identifying and building up these sets, we can gain further generalisations. Simple and obvious sets are days of the week, time intervals, reporting verbs, and colours. It is not expected that a collocate
The automatic analysis of corpora
393
of the set will collocate with every one of its members - and at that point we slip from lexis into grammar. Some adaptation of the strategy of the disambiguator - the study of collocational consistency - should contribute to the design of a setter. The parallel tool to apply the results of a setter to the interpretation of texts would be a lexicogrammatical analyser capable of using both lexical and grammatical evidence simultaneously. In a recent grammar (Sinclair - Fox et al. 1990) there are promising indications of the likely convergence of some grammatical classes and lexical sets. For example, on page 414 there is a list of nouns that operate in a prefacing structure. When you want to comment on a fact, event, or situation, you can use a two-clause structure as follows: First clause: (comment)
subject: verb: complement:
it be a + noun
Second clause:
conjunction:
that (optional)
Example: it's a shame he didn't come Here is a list of the nouns used as complements in this preface disgrace pity wonder
marvel shame
nuisance surprise
The lists are indicative rather than exhaustive in this grammar, but there is a clear semantic linkage among members of this list. The words are loosely related in Roget's Thesaurus', shame and disgrace appear together in 867 and all three of marvel, surprise and wonder are nearby in 864. Nuisance is not very near in 827 and it seems as if the relevant meaning has been overlooked, while pity heads paragraph 905. However these categories largely deal with things like disrepute and painfulness, and the kind of meaning associated with these words in this structure cuts across the classification of the thesaurus. It is not possible yet to say whether or not this list constitutes a lexical set, or the nucleus of one. But the emergence of more and more such lists, as the grammar becomes more detailed, is encouraging. The general tendency is that the more carefully one specifies the grammatical environment, the more the resultant classes gain lexical coherence. This line of argument allows the conventional domain of collocation to be extended well beyond the statistical behaviour of individual words. The commonest words in the language show co-occurrence patterns, and grammatical
394
John Μ. Sinclair
patterns can be considered as lexical in the same way as vocabulary words can be seen as realisations of grammatical items in a grammar (Renouf Sinclair 1991).
8. Other software tools There are some further software tools that are at least in the design stage, and for which a growing need can be discerned. In terms of sophistication, they will build on the combined results of the lexicogrammatical analysis that is set out above. One package will lie between lexis and grammar, and will be a classifier. Given any two stretches of text that purport to be instances of the same event, the classifier will list the events that are instantiated on the basis of common ground existing between the two texts. The program will be reversible, so that it will also provide grounds for differentiation. Another valuable tool will do similar jobs on texts as a whole. It will seek internal - linguistic - evidence on which texts can be classified. It is clear from the work of Biber (1988) that patterns of language form can be measured and used to classify texts according to the linguistic choices that make them what they are. Such a tool might be called a typologiser. When it can be used in conjunction with external classifications such as we find in most conventional corpora - genre, authorial origin etc., the internal evidence should give rise to a much sharper classification. So far all the software has been concerned with just one language. However, there is a considerable interest in providing the same software tools for a range of languages. In Europe some leading institutions are combining in various ways in order to make available compatible corpora and processing and analytical software. In the case of grammatical and lexicogrammatical analysers, and lemmatisers, only the design strategy is general enough to cross language barriers, because most of the detail is specific to the language. This covers taggers, parsers, typologisers and classifiers. When more detail of collocation and set is built in, the lexical software will also become relevant to a single language only. At present the simple collocators and phrase finders need only discriminate word divisions in the texts on which they operate; thus they have a general utility value.
The automatic analysis of corpora
395
The new corpus resources are expected to have a profound effect on the translations of the future. Attempts at machine translation have consistently demonstrated to linguists that they do not know enough about the languages concerned to effect an acceptable translation. In principle, the corpora can provide the information. In an exploratory project supported by the Council of Europe, the notion of translation equivalence has been expressed in terms of item and environment. Using a similar argument to that of the disambiguator, pairings of meanings are made between languages, and the way is clear to establishing bilingual and possibly multilingual dictionaries based on corpus evidence. The criteria are the repetitive patterns in the environment of the words studied. There are a few more guidelines to mention on software standards, qualities and strategies. One follows from the principle that software should regard texts as indefinite in length. • Speed of processing should take precedence over ultimate precision of analysis. We are talking here about marginal decisions and diminishing returns. Any analysis must be largely accurate or it is no use at all. Occasionally, however, individual instances may be highly problematic, ambiguous or misleading, and the effort required to resolve the difficulties may be out of all proportion to the value of the precise categorisation. In these cases the software may be set to back off from the difficult decision. Sometimes the software may just get it wrong, and as long as there is no regular pattern to the mistakes, they are unlikely to have a great effect on the results of analysis when the corpus is many millions of words in length. Another fairly practical guideline is that • Software should be robust. That is to say, it should be able to recover quickly from error or other difficult situations. It is another good argument in favour of modularising the analytical software. If each routine is kept fairly simple, then the range of environment that has to be examined will often be very small. Language analysis is largely a matter of detail, and in creating software we find that detail quickly obscures the strategic plans. Analytical tools are prone to error. Finding errors in a very complex package can be very demoralising. Adaptations and developments are also hazardous in a big package. For these and other reasons I am advocating a modularised kit of software tools. Each will specify input requirements and a variety of output
396
John Μ. Sinclair
conventions. Complex routines can be built up for particular applications by organising several tools together. The expenditure of effort that is envisaged in this paper is considerable, and it is important that the products of the research are available to all who need them. Some agreement on standardisation is needed to overcome the grave difficulties of portability that have characterised corpus linguistics since its outset. All the software that has been outlined in this paper is written in "C" and works under the UNIX operating system. These facilities are not the most widespread at present, because they have been developed for medium sized machines rather than small ones. However, the domestic or personal computer is now powerful enough to run a "C" compiler and to absorb the overheads demanded by UNIX, so there is hope that a useful standard may be generally available to the general public. To conclude, I shall draw together the guidelines for software design that seem to be appropriate to the nineties in language text processing. • Analysis should be restricted to what the machine can do without human checking, or intervention. • Analysis should be done in real time. • Operations should be designed to cope with unlimited quantities of text material. • Software will be designed to operate at more than one level of discrimination, so as to bypass doubtful decisions. • Speed of processing should take precedence over ultimate precision of analysis. • Software should be robust.
References Biber, Douglas 1988 Clear, J.H. 1987 Firth, J.R. 1957
Variation across speech and writing. Press.
Cambridge: Cambridge University
"Trawling the language: monitor corpora", in: M. Snell-Hornby (ed.), ZURILEX Proceedings. Tübingen: Francke. "Modes of meaning", in Papers in Linguistics. London: Oxford University Press.
The automatic analysis of corpora
Osborne, Gary forthcoming
397
Computational analysis of idiomatic phrases in Modern English. University of Birmingham. [M. Phil. Thesis.] Renouf, Antoinette J. - John M. Sinclair 1991 "Collocational frameworks in English", in: Karin Aijmer - Bengt Altenberg (eds.), English corpus linguistics. Studies in honour of Jan Svartvik, 128143. London: Longman. Sinclair, John M. 1991 Corpus, concordance, collocation. London: Oxford University Press. Sinclair, John M. - Gwyneth Fox et al. 1990 The Collins Cobuild English Grammar. London: Collins. Yang, Hui Zhong 1986 "A new technique for identifying scientific/technical terms and describing science texts", Literary and Linguistic Computing 1, 2: 93-103. Zipf, G.K. 1935 The psychobiology of language. Boston: Houghton Mifflin.
Comments by Fred Karlsson
The general strategy suggested for design of corpus analysis software in the 1990s is laudable. However, the principles of preferring fully automatic analysis over human intervention, and of having processing speed take precedence over ultimate precision of analysis, are potentially risky, at least if applied without due care. They could lead to lowered quality standards. Even a 1% error rate seems too high in regard to most linguistic phenomena, including the basic task of part-of-speech tagging. This would yield 10,000 errors in 1 million words, 1 million errors in 100 million words. Casual users might be unaware of the limitations of recall and precision of such analysis systems. An interesting kind of remedy is the use of partial or underspecified analyses. In situations of uncertainty, the analyzer would make no decisions but rather leave the alternatives pending. If the uncertainty rate can be brought, for example, to the 1% level, it is quite possible, for many purposes (such as tagging a corpus), to have these decisions made by human intervention. Only the uncertainties would be presented for human evaluation, for example on-line. Such intervention is fast and dependable, compared to the task of having to spot 1% errors by scanning all of an analyzed corpus. It is important both for theoretical and practical reasons to have perfect analysis as one of the central goals of automatic corpus analysis. It is also important to have future natural language processing (NLP) systems more extensively documented, tested, evaluated, and compared to other similar systems, than is presently the case. Another central goal is to promote free or cheap non-commercial scientific use of existing NLP systems. The formalisms should be language-independent (applicable as such to any natural language). Some brands of theoretical linguistics have not paid enough attention, or only a misguided type of attention, to corpora. But of course there are also instances of corpus-based studies where the problems addressed and the answers given are more or less trivial. "Good" theory and "educated" corpus study should be united throughout the research process. Real problems should be insightfully described on a sufficient level of abstraction.
Comments
399
Modularity is a central principle of software design. The idea of "partial parsers", each covering some central subproblem, is a promising one. In NLP, module interconnections and interplay are important. Often, decisions to be made in one module depend upon answers provided by other modules. For example, is there a clause boundary in front of a certain instance of the wordform that ? A dependable answer presupposes a fairly conclusive parse of the near context, to be performed by another module, which in turn might rely on boundary information. Successful resolution of such interdependencies presupposes, inter alia, monotonic accumulation of information during the parsing process, and powerful declarative means of expressing the relevant restrictions. How should partial parses and indeterminacy be represented in NLP? One alternative is Constraint Grammar Parsing (CGP, Karlsson 1990) viewing the whole enterprise of morphosyntactic parsing as disambiguation. Consider the various morphological readings that could be attributed to the word-form that, each on its own line below, with its potential syntactic codes (each prefixed by "@") within parentheses: that < C L B > CS (@CS) that ADV ( @ A D - A > ) that
PRON
DEM
SG
(@SUBJ
@OBJ
@PCOMPL-S
@PC0MPL-0 @I-OBJ @ < P . . . ) that DET CENTRAL DEM SG ( @ D N > ) that
PRON
SG/PL (@SUBJ
@OBJ
@PCOMPL-S @PC0MPL-0 @I-OBJ @ < P . . . )
The readings are complementizer (CS = subordinating conjunction, < CLB > = starts a new clause), adjectival intensifier ( " > " indicating the direction of the head), pronoun in head function (< NONMOD > = not modifier), determiner as modifier of a noun to the right (@DN>), and relative pronoun. The constraints of Constraint Grammar disambiguate morphological readings and syntactic codes equally by way of discarding alternatives. For example, for that, the morphological reading "PRON SG/PL" is the proper one, and its syntactic function is @OBJ, in sentence-final position if there is a transitive verb to the left in the same clause. Using some 1500 constraints of this down-to-earth type, relying decisively on morphological information provided by the lexicon and on extensive corpus studies, a full-scale morphosyntactic parser for English called ENGCG has been developed at the University of Helsinki (Karlsson - Voutilainen Anttila - Heikkilä 1991). All ambiguities "are there" at the outset, undecided alternatives will be left in the output, constraints discard most (optimally: all
400
Fred
Karlsson
spurious) alternatives. When applied to fresh running text, the error rate of part-of-speech assignment of ENGCG is less than 0.3%. The CG formalism is a compromise between pure qualitative grammar statements and pure probabilistic descriptions. It is fully language-independent. Tagging and parsing are closely related. Part-of-speech tagging could be seen as the basic step of any parser. In parser design, lexicon and grammar should be closely integrated. A central module of a robust parser is a Master Lexicon covering the core vocabulary, 30,000-50,000 lexical items. The Master Lexicon should work in conjunction with a proper and precise morphological analyzer yielding inflectional descriptions such as (1), including base-form reduction (lemmatizing), that serve as input to the disambiguation modules of the parser. In this sense, morphological analysis is indispensable in automatic analysis of any language. One more task facing corpus analysts is determining the proper or optimal corpus size for various types of linguistic problems. The magnitude of this problem grows in parallel with corpus size. For example, if 500,000 hits are retrieved from a 200 million word corpus, powerful software for scanning and structuring the hits will be in great demand.
References Karlsson, Fred 1990
"Constraint grammar as a framework for parsing running text", in: H. Karlgren (ed.), Proceedings from the XIHth Int. Conf. on Computational Linguistics 3: 168173. Helsinki. Karlsson, Fred - Atro Voutilainen - Arto Anttila - Juha Heikkilä 1991 "Constraint grammar: a language-independent system for parsing unrestricted text, with an application to English", in: Proceedings from the AAAI-91 Workshop on Natural Language Text Retrieval. Anaheim, CA.
The odd couple: The linguist and the software engineer. The struggle for high quality computerized language aids. Henry Kucera
Computer-based language corpora have unquestionably had an important impact on various aspects of linguistic research, as the extensive ICAME corpus bibliography so amply demonstrates (Altenberg 1991). Moreover, these computerized databases and the analyses that stem from them have favorably affected research in other fields, primarily in psychology, as in reaction time word-recognition experiments, or - more recently - in artificial intelligence where the search for parsing and retrieval algorithms has increasingly turned to more empirical and even stochastic approaches. My purpose in this paper is to focus on the relation between computational linguistics in general and corpus linguistics in particular on the one hand, and significant practical language aids on the other hand. The availability of these aids is now so commonly expected by computer users of various word processing systems that, in the jargon of the software business, some of them are considered to be "checklist items," which simply means that no respectable product can be sold without them. The most obvious examples of such language tools are, of course, the ubiquitous spelling checkers and, in the English speaking world at least, computerized thesauri; more recently, our commercial friends have put out real or putative "grammar checkers" and even some general dictionaries, not always of overwhelming quality but at least small enough to run on a personal computer and containing some useful search functions for an intelligent user of an office or home machine. As W. Nelson Francis argues in his paper for this volume, the compilation of language corpora is not an invention of the 1960s. Much useful information was derived from many older manually assembled corpora, including word frequency data, useful in language teaching and other aspects of applied linguistics. Nevertheless, it was only in the 1960s that rationally constructed corpora first became available to scholars and researchers in computer readable form; that, in turn, made it possible to extract information from such databases not only much more accurately than had been the case with man-
402
Henry Kucera
ually assembled collections but also with immensely greater flexibility and speed. As one who participated in the original conference in 1963 that agreed on the specifications of the Brown corpus (and I am happy to say that two other members of that group are taking part in this Nobel Symposium, Sir Randolph Quirk and W. Nelson Francis), I would have reacted with incredulity if someone, on that February day at Brown University, had actually predicted that the corpus which we were then designing, and some other corpora later constructed along the same lines, would still be in substantial use and demand some twenty-eight years later. But that is indeed the case. I will turn very briefly to some of the reasons for this fact later in the paper. First, however, let me offer you an unsentimental reminder of what the 1960s were like in our field of scholarly inquiry. Those of us old enough to remember the last twenty-five years of what may, somewhat charitably, be referred to as the intellectual history of linguistic science, know only too well that the prevalent linguistic fashions of the early 1960s were hardly favorable, at least in the United States, to any enterprise that included an examination and analysis of actual language data. The goal then was "to capture", to use the favorite verb of that age, various profound generalizations about the competence of an ideal speaker-listener who, we were instructed, knew his or her language perfectly, had no memory limitations, lived in a completely homogeneous society, and suffered from no distractions, including demands of style or effective communication; all of this inquiry was to be pursued with the ultimate aim, achieved only perhaps in the following millennium, of discovering the basis of a universal grammar by the application of superior reasoning. Collecting empirical data was thus not considered a worthwhile enterprise in the circles of true believers since, as many of our colleagues from Boston and its suburbs so firmly impressed on us on every suitable or unsuitable occasion, a native speaker of English, for example, could provide the linguist in five minutes with a much greater amount of useful information that even a corpus of a billion words could, if one actually existed. The use of computers to discover anything significant about language only increased the severity of our betrayal. I must confess that I still have in my files a letter from a very wellknown linguist of those days who, with something much less than good taste, paraphrased for my benefit - with one single word substitution - the wellknown saying of Hermann Goering: "Whenever I hear the word computer, I reach for my gun." But, in a small version of Gorbachev's predicament, we had enemies not only on the left but also on the right. There were many members of the humanistic world in various academic institutions, including Brown University, who had a predictable fear of the new "calculating
The linguist and the software engineer
403
machines" and little more than contempt for those among us who dared to commit the treason of joining the scientists' camp of vacuum tubes, relays and binary numbers. So, from both sides, we were certainly not spared the labels of word-counting fools, and the predictions were boldly made that we would, at best, turn into bad statisticians and intellectual mechanics. History, as the last three years so amply demonstrated, has the unpleasant habit of not being particularly kind to self-righteous prophets. But my purpose is not to rehash the past. The fact that we are gathered here at a Nobel Symposium is sufficient evidence of what has been accomplished. I will thus bypass with silence the failure of the various linguistic prophecies, including the paraphrase of the Reichsmarschal's statement. Rather, my aim will be to look at still another battle or - perhaps more accurately - a strained symbiosis that computer technology has unavoidably created for some of us: the common fate of the odd couple, the linguist and the software engineer. Although this story has a relatively ancient prehistory, dating from the 1950s and the numerous and varied failures of expensive machine translation projects, it begins in its full flower around 1980 or just before, when the wondrous world of word processing started to take hold of public imagination, at least of those who needed to express themselves in written language. The prospects and, almost immediately, the needs to consider seriously the incorporation of language aids in word processing programs soon brought some members of the community of computational linguists into contact and argument - with those experts who had the task of implementing the linguistic ideas in practice. It was not always a love affair since the different perspectives and standards of the two "cultures" became apparent almost immediately. As a participant in numerous such encounters during the last ten years and as a corpus linguist, I would now like to consider some selected but perhaps interesting aspects of this coexistence: the struggle for highquality language aids and the intelligent utilization of the techniques of corpus linguistics, a battle only partially successful and still clearly unfinished. It would be too pretentious, in speaking of the two cultures of the linguist and the software engineer, to equate this relatively minor clash of values with the famous paradigm posited by C. P. Snow. In 1959, C. P. Snow delivered his provocative lecture, The Two Cultures and the Scientific Revolution (published in 1961). In it, Snow developed the argument that the intellectual community had divided itself into two disparate entities with different and often contradictory values, attitudes and outlooks on the future of mankind: the world of the scientists, and the culture of the literary intellectuals. In the miniversion of this conflict, the linguist has indeed played the role of the guardian of language research as a serious scholarly enterprise, while
404
Henry
Kucera
the software engineer, quite understandably, has focused on practical dictates of applied computer science: speed, computer-memory limitations, ease of implementation and, at least ideally, a universality of application of his programs across languages. The task of the linguist in this kind of encounter was surely not made easier by the well-known syndrome of which even the intelligent populace out there suffers regularly, namely the conviction that a native speaker of a language knows enough about it not only to teach it to others but to become an expert at designing functional computer language aids. The result of this belief, held by many, has been a considerable oversupply of inferior products, some of them still present in the best-known word processors. But at least some of the software engineers have learned a few painful lessons about linguistics, just as the computational linguists had much to learn about the practical aspects of software engineering. In the following sections, I propose to take up the major topics with particular reference to English. Space and time limitations prevent me from doing justice to other languages for which computerized language aids have been designed. Nevertheless, I will conclude the paper with a brief review section of the specific problems that one encounters in working with languages other than English, particularly French, Spanish, German and the Slavic family.
The spelling checkers To be quite concrete, let me first concentrate, for the purposes of illustration, on the granddaddy of all the word-processing language aids, the "spelling checker". Actually, the name itself is misleading. Most currently available spelling checkers consist of two components: one, which is essential, is a spelling verifier; and the other, now common but in principle optional, is the actual spelling corrector. The verifier algorithm discovers potential misspellings; the corrector algorithm then suggests a set of alternative correct spellings of the potential misspelling discovered by the verifier. It is thus important to bear in mind that the input into the corrector is the output of the verifier and that no practical corrector can exist without verification having taken place first. The opposite, of course, is not true. The original spelling checkers of the late 1970s and the early 1980s were only verifiers: they highlighted potential errors (or simply beeped, as is still the case on some less expensive electronic typewriters) but made no suggestion of how the misspelling was to be corrected. It soon became apparent that verifiers had two distinct shortcomings: if the spelling error was not a typographical one but
The linguist and the software engineer
405
one due to the user's ignorance (something for which I have used, in my presentations to the software community, the charitable term "cognitive error"), the verifier offered no help and a search of a conventional dictionary might have been necessary, a search which was not only laborious but could easily fail - as, for example, when the cognitive misspelling involved a wrong initial letter. And even when the nature of the correction of the misspelling was obvious, the user still had to retype the erroneous string. Hence we witnessed, beginning in the early 1980s, the advent of the spelling corrector. It is important to emphasize that the verification and the correction process represent two distinct cognitive abilities and, consequently, vastly different algorithmic implementations: a verifier involves essentially the recognition of a word as being or not being an element of a well-formed set. Assuming the availability of a word list (software engineers tend to refer to this list loosely as a "dictionary"), the corresponding algorithm consists essentially of an attempt to match a word in the user's document with an item in the word list. If the word list is a "literal" one, i.e. includes not only base forms but all the inflected forms of a lexical item (i.e. all the elements of a lemma), then the problem appears to be, at least superficially, a purely computational one: making the search as speedy as possible and minimizing the memory requirements for storing the word list. Spelling correction, on the other hand, involves a much more complex cognitive task, a process of association: the ill-formed string is to be associated with the most likely well-formed one from the available word list and such suggestions (one or more) then offered as candidates for substitution for the error. If one of the suggestions is a feasible one, the user can then automatically select it and replace the error without any retyping, clearly an important convenience. While verification appears, as I already mentioned, to be a computational problem without apparent linguistic implications, it soon occurred to the software engineers that this was not the case. A verifier must, of course, check every single string in the examined document, regardless of its frequency; even the English definite article the, which accounts for almost 7% of an average English text, may be misspelled, e.g. through an inversion of two letters (such as hte). Given the extremely skewed frequency distribution of English word-forms, the speed of checking a document becomes a matter of efficient access to the word list. While RAM (Random Access Memory) is a fast access storage medium, a hard disk, an optical disk or a CD-ROM disk are much slower in access speed. Consequently, the most frequent words to be checked should be stored in RAM, while the rest can be checked on a slower storage medium since the search of that portion of the word list will be much less frequent.
406
Henry
Kucera
The relation between frequency studies and the verification algorithm is clearly a crucial one, especially if we consider the actual statistics: on the basis of the Brown corpus, we can pretty much predict that the 130 most frequent forms of American English (mostly determiners, prepositions, but also some common noun, adjective and verb forms) will account for almost 50% of an average text. After that, the payoff gets progressively smaller but, if we do our frequency study reasonably well, we could probably identify another 3000 word forms or so that will in most cases account for 80% of the tokens in an English text (aside, perhaps, from some specialized genres, such as legal documents). By placing that 3000-word portion of the list into RAM, and the most frequent 130 word forms into some privileged cache section of the computer memory, we can speed up the verification process dramatically. The better the predictive value of our frequency data, the faster our algorithm will be. Word frequency studies were, of course, available in printed form to the verifier designers of the 1980s and, as a matter of fact, with only modest effort such computerized databases as the Brown and LOB corpora could also provide useful frequency information. Since any charge that word lists may have been copied without permission for commercial purposes is extremely difficult to prove, few of our software friends bothered obtaining any permissions from the linguists who compiled the frequency lists. I am not able to say how many word lists of commercial spelling checkers include data derived from corpus research but, by looking for the "smoking gun" in some such lists (e.g. errors intentionally left in the Brown corpus text because they appeared in the original printed sample), I would not be surprised if a great many American English spelling checkers owed at least some debt to our corpus research. The wealth of language texts in computer readable form is now so huge that constructing new frequency lists from such large databases is clearly not difficult. But such "raw" frequency studies are not necessarily better (and most likely are less valuable) than a frequency analysis of a smaller but carefully constructed corpus. To corpus linguists the reasons for this are of course obvious: the representativeness of the text is increased by a weighted selection of the number of samples taken from individual genres and subcategories and that, in turn, makes it possible to adjust the overall frequency of a lexical item in the corpus by its dispersion, i.e. by how evenly spread over the numerous selections of the corpus its occurrences may be. (For one revealing measure of adjusted frequency, cf. Francis - Kucera 1982.)
The linguist and the software engineer
407
The numbers game and other follies If PC Magazine, Infoworld or BYTE should be among your favorite reading, the "numbers game" must be a familiar experience. These and similar periodicals regularly review new (or more likely revised) word processing packages and their spelling checkers and offer - as the most substantial, if not the only - criterion of the value of such linguistic aids the size of the "dictionary", the software engineer's term of a word list. While eight years ago a lexicon of 50,000 words was considered to be spectacular enough for English, the traders quickly grasped the potential of the game and went to 80,000, to 90,000 and, most recently, to 130,000 word "dictionaries". Some of this is just simple padding (such as the inclusion of spelled-out fractions, e.g. two-seventeenths), some simply a compilation from a variety of good and bad dictionaries. This kind of game is particularly difficult for a computational linguist to tackle because any rebuttal involves a number of superficially trivial but nevertheless crucial concepts. One is tempted, of course, to ask first what is meant by a "word". Do we have here perhaps a count of lemmas or just of word forms? To no one's surprise, it will turn out, without exception, to be the latter, since the number of word forms is always larger than the number of lemmas (in English possibly by a ratio of 3:1, greater in more highly inflected languages). But that still leaves the important question of whether bigger is better, at least as far as word lists are concerned. As John B. Carroll demonstrated quite convincingly (in Kuöera Francis 1967: 406-424), the ratio of "types" (corresponding in his calculation to word forms) to "tokens" (the length of the text) is a lognormal function. Carroll predicted that the number of new lexical items as the size of the text increases gradually slows to a trickle, to reach, for example, just barely over 200,000 in a sample of 100 million tokens. Moreover, many of the lexical items would be hapax legomena, so that their actual identity becomes quite uncertain, being almost totally dependent on the nature of the text. But even more important, there is the unpleasant problem of undetected errors. Peterson (1986) has offered a demonstration that the number of such possible undetected errors increases rapidly with the linear increase in the size of the word list. Peterson's conclusion is that any English word list intended for spelling verification that exceeds 60,000 word forms is not only not worth the price but can be definitely dangerous to one's reputation. I have used the term "collisions" for confusions of this general kind which can cause very troublesome complications in other linguistic tasks as well, such as in automatic speech recognition. At this point, let me first clarify the phenomenon of such collisions with reference to our current discussion. All
408
Henry Kucera
spelling checkers are word-based, i.e. the domain of their examination is a string popularly known as a "word", which is somewhat loosely defined as the sequence of letters between two spaces or, if need be, between a space and a trailing but attached punctuation mark (a sentential period, comma, semicolon, etc.). Consequently, any spelling error that creates an acceptable English word will be passed by a spelling checker as correct, resulting in what we might call a fatal error since it leaves a mistake in our finished text and thus defeats the purpose of the language aid. Some of these confusions, known as collisions, are clearly unavoidable: horse and house differ in a single letter (separated only by two other keys on the keyboard) but both are important enough to be included in any respectable word list. Even more serious are common typographical errors that many users make, for example entering form instead of from. Again, a word-based algorithm that cannot tell the difference between the proper place for a preposition or a noun / verb is incapable of handling this problem. But many dangerous collisions can be avoided: even paperback dictionaries of English include both the words calendar (a system for reckoning time) and calender (machine for smoothing paper). But if one includes in one's word list the latter, which is exceedingly rare, then any misspelling of the former, which is common, will result in a collision: the misspelling will pass as correct and the error will remain undiscovered, at least by the spelling checker. Peterson (1986) presents a statistical proof that it is precisely this type of undetected errors that one invites by mechanically enlarging the word list to play the numbers game. Naturally, if the rare calender is omitted from our word list, the misspelled word for "the system for time reckoning" will not pass and we will be saved the embarrassment. There is a price we will have to pay for this, of course: should we ever wish to write an article about machines that smooth paper, we would undoubtedly wish to include, at least temporarily, the item calender in our personal dictionary to avoid extensive overflagging of what is not an error in our specialized topic. But that is precisely what personal dictionaries, for which all spelling checkers make reasonable provisions, are for. Let me say only parenthetically that Peterson's conclusion is much too pessimistic because his calculations take no account whatsoever of the frequency or generality of a word that may cause a collision. With the application of some linguistic and word-frequency information, the danger of collisions can be minimized even in a larger word list. There are two other items that I need to mention under the heading of "Other Follies": The tolerance of fatal errors for the sake of making the word list computationally compact and the risk of such errors because of inadequate elementary parsing of the textual strings that are potential word
The linguist and the software engineer
409
forms. To show that neither of these follies is theoretical, I will take as my target two of the best selling English-language word processors, WordPerfect and Microsoft Word. (I offer advance apologies to the latter since Microsoft Corporation has now apparently recognized its misguided habitudes and may soon mend its ways.) One would expect, at the very least, a spelling checker to detect capitalization errors in proper names. And yet, we have here the best-selling WordPerfect (Macintosh version) which joyfully passes as correct america, boston, Washington, tokyo, and even ronald reagan. Here, the software engineers, trying to compress the word list, resorted to an elimination of the distinction between upper and lower case (with upper case obligatory in proper nouns, positionally determined in other words, e.g. at sentence beginning). Such a reduction of the character set saves computer memory and, as a bonus, makes the algorithm simpler. But it also frustrates the user who may rely on the checker to find instances, all too common, where a fast typist has missed the shift key and mistakenly keyboarded an initial letter of a proper name in lower case. It is clear, of course, that this annoying compromise has other far reaching consequences since such nonsense syllables as ar (common typo for are), nd (not uncommon for and in fast typing), ca, ri, ba, iud and others too numerous to list here, are casually accepted as "correct" because they happen to resemble either Zip Code symbols of US states, symbols of chemical elements or acronyms, all of which, of course, should in correct usage include their proper capitalization. Although the domain of a spelling checker is the "word" (rather than any larger linguistic string, such as a phrase or a sentence), even this seemingly modest task requires an intelligent though elementary parsing of the text, even if only on the level of the graphic system. Punctuation attached to lexical items (commas, sentential periods, etc.) has to be stripped in order for the word to be located in the word list and not flagged as an error. Clearly, this immediately presents the problem of distinguishing between sentential periods and abbreviations: a poor treatment of such distinctions may again be found in the leading word processors: both WordPerfect and Microsoft Word (Macintosh versions) pass st op cit viz and many others, even if appearing without periods. Worse still, in any verification one needs to deal intelligently with embedded punctuation, some of which must be considered as equivalent to word boundaries (e.g. syntactically motivated hyphens) but characters that could result from hitting a wrong key should be flagged. Both of our systems fail here miserably, largely because of their failure to parse the graphic text properly, and accept without question such monstrosities as letter,s mis$ed m%st, etc.
410
Henry Kuöera
As obvious as such shortcomings are, they tend to escape the notice of the reviewers and, in many cases, even of the software engineers who implemented them, simply because the discovery of such failures requires informed linguistic probing. Needless to say, a computational linguist who points them out to the word processing manufacturer only rarely receives much gratitude for his troubles. But the struggle should not be abandoned since, in the final analysis, an improvement in the quality of all such language aids - something that is beginning to be taken seriously by at least some companies - can only profit all parties.
Correction: an associative algorithm The need for spelling checkers to include a correction module was recognized relatively early. Damerau (1964) claimed, on the basis of empirical evidence, that 80% of spelling mistakes in English are due to four types of errors: missing, extra, transposed or wrong letters. James Peterson, a computer scientist, wrote a whole book on the subject (Peterson 1980) in which he published an actual computer program for an essentially mechanical correction of such typographical spelling errors. The algorithm is based on the assumption that a majority of such errors involves only a single mistake per word string. Given these assumptions, one can then construct a relatively simple algorithm which would essentially go through four "repair" operations in some reasonable sequence, trying to correct the misspelling, i.e. by inverting every two adjacent letters in succession, by substituting, one at a time, each of the letters of the character set in each word position, by inserting, again one at a time, each of the letters in each possible word position, or by deleting one letter of the misspelled string in succession. Whenever any such single attempted repair is made, the result is then checked against the word list and, if a match is found, the suggested correction is presented to the user. Although these ideas present some computational problems, particularly in access speed if the bulk of the word list is on hard disk, they still form part of many current spelling checkers and have some usefulness in correcting routine typographical errors. One problem with this "brute force" approach, as it became known in the trade jargon, is that some of its subroutines are computationally expensive (e.g. the substitution of each of the 26 letters of the English character set in each position of a misspelled string of six characters in length could involve 156 dictionary lookups and still fail). Consequently, spelling checkers usually use this method only for words stored in RAM (i.e. high frequency items).
The linguist and the software engineer
411
Even more serious, however, is the limitation of the "brute force" algorithm to a single error per word string. The method, at least in its original form, is not suitable for dealing with two or more errors per word since it could then result in a computational explosion, i.e. the creation of such a huge number of attempted "corrections" (most of them false) that their continuous checking against the word list would take unrealistically long. Consider, for example, that the string embaresed contains three errors and its fix-up with the brute force routine might easily result - depending on the sequence of the combinatory attempts - in millions of trips to the spelling dictionary. Not surprisingly then, spelling correctors that rely only on a brute force approach have nothing to offer for such errors. Mistakes that involve multiple characters in a string are often motivated by a mismatch between the graphic and the phonological systems, certainly a common situation in English. Because of the history of English orthography, we find many examples of every one of the three possible relations between the orthography and the phonological system: a one-to-one relation between letter and phoneme, as well as one-to-many and many-to-one correspondences. The task of a corrector that can deal with such "phonetic errors", as they are commonly referred to, is then essentially an associative one: finding one or more well-formed lexical items that are most similar to the misspelling. Computers, of course, have no inherent capacity to understand the notion of "similarity". Unlike human brains, which are masterful pattern recognizers (e.g. in their ability to recognize hundreds of human faces of old and new acquaintances even when viewed from different angles and even notice their similarity in, let us say, a family resemblance), computers are quite bad at such tasks. Consequently, any pattern recognition or matching - such as one faces in phonetically motivated spelling correction - must be programmed by using a set of sequential rules that adequately capture the most reasonable correspondences. Most spelling correctors included with current versions of popular word processors have some built-in "phonetic" rules. While the exact algorithms or programs have not been published (with one exception, cf. below) and are usually classified as trade secrets, an informed testing procedure can shed light on the approaches used. Most of these algorithms do reasonably well when the phonetic error involves a straightforward correspondence (e.g. the confusion of / and ph, as in the misspelling fysical, which involves a twoletter error) and even with multiple errors, as the embaresed example given above. The rule of thumb is that whenever we have a manageable error (involving one or two phonetic substitutions for the correct spelling) and the
412
Henry Kucera
word is fairly long (six or more letters) most spelling correctors will come up with a reasonable set of suggestions. But if the word is shorter, we can get some abysmal results. Consider, for example, the misspelling quinain (with multiple-letter error) and its treatment by the WordPerfect spelling corrector. We get offered eleven "phonetic suggestions", of which quinine is in position nine, preceded by such unhelpful if not misleading offerings as canaan (with lower case c), cannon, chignon, kinin, etc. Or if we were to type quitte and were not interested in the typographical correction quite, we could rejoice in no less than 80 (eighty!) phonetic alternatives, of which the most likely, quit, occupies number 78 in the displayed list. While that is bad enough, the nature of some of the suggestions is a real cause for despair. Here is a small sample: cit, catha, chaetae, kuwait and kyoto (both with lower case k). And in some other instances, we are even privileged to encounter such wonderful "words" as od'd, cd, oidia. The WordPerfect and some other systems thus tend to offer, at least in some cases, an extremely large and unwieldy number of suggestions, some plausible and some ridiculously far removed from the misspelling. Clearly, the more suggestions, reasonable or not, a spelling checker lists for a particular error, the more likely it is that among them will be the one that the user had in mind. If the entire word list were offered each time, to take an absurd case, then the procedure would probably reach close to 100% "perfection" while simultaneously driving the user to despair. A system that overoffers suggestions thus completely fails one of the most important requirements of a spelling corrector, i.e. that the list of its suggestions should be reasonable, small and properly ordered, so as to allow the user a quick and relatively painless replacement of the misspelling. Having said this much about others, it would be unfair if I did not put my cards on the table and talked about a correction system which I designed, at the suggestion of Digital Equipment Corporation, in 1981-1982 and which was eventually licensed and distributed by the Software Division of Houghton Mifflin Co. Since the system is legally protected (by a U.S. patent issued in 1986), it is one of the few (or perhaps the only one) whose algorithm is actually published. The basic idea, which makes this system fast and able to deal with a multiplicity of phonetic and other errors is the notion of a "skeleton" and of equivalence classes, motivated by grapheme-phoneme correspondences (e.g. / and ph are in the same equivalence class, unstressed vowels are in the same equivalence class, while the likely stressed vowels are in only two classes, front vs. back, etc.). A set of ordered rewrite rules reduces any potential misspelling identified by the verifier to a "skeleton", a string of salient features, with redundant letters not needed for the recognition of the
The linguist and the software engineer
413
word eliminated and equivalence classes represented by cover symbols. The resulting skeleton can then be matched to a skeleton representation of the word list where, of course, each skeleton is linked to a correct spelling of the word. The result is that the correction of most errors, even outrageous ones (e.g. newmonya for pneumonia) can be achieved by a single lookup in the dictionary. It needs to be said that the system (which runs on DEC machines, as well as such personal word processors and reference systems as WordStar, Panasonic, Brother, Microsoft Bookshelf, and others) varies in individual implementations and is far from perfect. Its major flaw is that it does not always do very well with purely typographical errors involving consonants (it works fine with vowels). The reason for this is that the implementation of the typographical correction feature, which is incorporated in the prototype, was considered by those licensing the system as too computationally demanding and was generally not implemented. Even if it occasionally fails to offer a good suggestion, however, the system has the advantage that its list of corrections is usually quite small and precise, and that no absurd alternatives are presented. The fact that, even in its imperfect form, this linguistically based system has won a number of adherents is perhaps a reason for some satisfaction that, in the domestic disputes of the odd couple, the linguist does not always lose.
A grammar corrector: Mission Impossible? The usefulness of spelling correctors is, of course, severely limited by the textual domain which they analyze, i.e. the graphic word. Although some modifications to this can and have been made, such as the flagging of two successive occurrences of the same word {the the, for example, tends to be a common error, especially across text lines), any substantial improvements in proofreading performance need to enlarge the domain of analysis to at least an entire sentence. In other words, the programs would now have to move into the realm of syntax and automatic parsing; only then can at least the most common errors due to collisions - in the sense discussed above be minimized: in order to discover that the string form is, in any particular instance, a typo for the preposition from, the program has to determine that the particular sentential structure calls for a preposition, not for a noun or verb. Although this kind of typographical error is quite common, the solution is far from simple. Essentially, the error should be flagged only if a nominal context is excluded and the phrase contains another main verb. A similar
414
Henry Kucera
situation obtains with regard to cognitive errors (wrong forms due to phonetic similarity), such as the confusion of their and there, council (noun only) and counsel (noun or verb), illicit and elicit, etc. But collisions of this kind, undiscoverable by a spelling checker, are only part of the story: common errors in American writing include wrong pronoun cases (for my wife and I, and - at least for some stylistic purposes - who instead of whom), subjectverb agreement errors, particularly in cases when the linear distance in the text between the subject head and the verb is considerable (due to so-called proximity-concord tendency of American usage). Then there is the usage of adjectives instead of adverbs (good news travels slow), and a number of other embarrassing confusions, including to and too, or even two, and simple typographical errors, such as the article a instead of an. In word computer produced writing, moreover, syntactic errors of various kinds tend to be more common, precisely because of the versatility of word processing programs; these systems allow users to modify a portion of a sentence, eliminate a part of it or add to it, paste other text fragments in the middle of it, etc., all of which is then conducive to syntactic mistakes resulting from an inadequate adjustment to the rest of the original sentence. Given the state of the parsing art and the relatively slow progress that has been made in many parsing projects of English, prospects for a grammar corrector might not look promising. This is especially so if one bears in mind that a parser intended for a grammar corrector cannot be based on the assumption that a sentence is well-formed; it must be able to parse adequately both well-formed and ill-formed strings and, moreover, discover why a sentence is ill-formed in order to flag the error and propose a correction. Let me emphasize, at this point, that a grammar program that does not offer useful suggestions of how the sentence is to be corrected is of little use and is likely to be quickly discarded by a frustrated user. Just flagging a sentence as erroneous and proposing some uninformative advice ("rewrite this sentence") is likely to be resented; if the error is one stemming from the ignorance of the particular grammatical structure, as is often the case, the user has few printed resources to turn to for a solution: there is not even the equivalent of an alphabetically ordered dictionary that can be used to find proper spelling. In other words, a grammar corrector must be just that. A grammar verifier or a "corrector" that offers little more than vague counsel will not be a successful proofreading tool. Nevertheless, by setting realistic limits, some success has been achieved. There are now, to the best of my knowledge, three commercially available products that are, or at least claim to be, grammar correctors for English. All three are available on the DOS platform and the first two for the Macintosh as
The linguist and the software engineer
415
well: Correct Grammar is marketed by WordStar International, Grammatik IV and Macintosh Grammatik by Reference Software, and RightWriter by RightSoft, Inc. Of these, the last is predominantly a cliche checker (flagging any expression that it has stored in its list of undesirable stylistic habits), and only to a very small extent a syntactic product. Correct Grammar, on the other hand, is a licensed version of Houghton Mifflin's CorrecText, which is the trademark of a serious syntactically-based parser and analyzer of ill-formed strings. To a computational linguist the practical interest of the software community in producing grammar correctors is both a source of joy and frustration. The joy comes particularly from the recognition that here, too, we have an area in which previous corpus work, primarily syntactic tagging and, perhaps most importantly, the statistical information about grammatical properties of English provided by a tagged corpus can prove useful. Many results of the analysis of tagged corpora have been either published or made available on tapes, and have been used, sometimes with permission and at other times probably without, in the research and construction underlying the nascent grammar correctors. In either case, we can at the very least take some pride in an intellectual payoff and, occasionally, even profit from a real one. The frustration, which I find at times almost overwhelming, is due to an almost complete ignorance of the software community (and, needless to say, the self-appointed "expert" reviewers of the popular computer magazines) of the levels of linguistic structure. Concretely, I have found it almost impossible to get across the idea to the "other" culture that, for instance, there is a difference between syntactic and stylistic properties of a sentence and that some errors are due to the violation of the rules of English syntax while others are perhaps poor, redundant or hackneyed expressions but can still be quite well-formed syntactically. To the other culture, grammar is all the things above spelling, all mixed together. Syntax errors, cliches, redundancies, split infinitives and, yes, the passive voice, the great American bugaboo, all get thrown together into one huge pot labelled "grammar." Even the best of the products, Correct Grammar, has added to the original algorithm a feature that flags every passive sentence and some which only look like passive (e.g. "I am determined to throw you out!"), and accompanies that by a sanctimonious message that the passive voice may make comprehension difficult. Fortunately, in the latest version the passive checking feature can be turned off. Grammatik IV is an even greater offender by providing false statistics about the passive transgression at the end of its analysis. Since this product cannot properly identify phrase boundaries within a sentence but does, after a fashion, count the number of sentences containing real or
416
Henry Kucera
putative passives, the poor writer may find himself accused of having been guilty of 50% passive use when, in actuality, he may have had only some 10% of passive phrases in the entire document. I have tried to convince my software colleagues with a few factual and rational arguments: that the passive exists for a reason and has a function. I have counted the passive phrases in the Declaration of Independence and in the Gettysburg Address (about 13%) and tried to point out that, for example, the expression "all men are created equal" would, if rendered in the active voice, be not only much less effective stylistically but possibly even do some violence to the separation of church and state. I have even provided my friends with reliable statistics about the number of passives in finite predications in the Brown Corpus (about 11.07% in the entire corpus but as high as 24.15% in the Learned genre, cf. Francis - Kucera 1982: 554-555) and pointed out on numerous examples that in certain types of writing one simply cannot manage without the passive voice. All to little avail: the schoolmarm dogma that passives are bad, taken over by some teachers of journalism, appears to be unconquerable. But if one is willing to live with that, turn it off whenever the program allows the option and disregard some of the other oddities (such as charges of verbosity or extensive sentence length), a grammar corrector can help in the proofreading task, even if the best of them probably catches no more than some 80% of true, and potentially embarrassing, grammatical errors, whether they are due to carelessness or ignorance. By and large then, my feelings here - as an old corpus hand - are those of achievement, some real and some still on the horizon.
Other languages As I pointed out in the introductory paragraphs to this paper, my discussion had to focus on English, simply to keep the length of this contribution manageable. Before concluding, however, I would like to point out some of the additional computational challenges presented by some foreign languages in comparison with English. The major one, of course, is the much greater number of inflected forms. When the verbal system of French or of Spanish generates some forty inflected forms per verb, the question clearly arises whether a "literal" word list, i.e. one including all possible forms, is suitable even for a spelling checker of these languages. Clearly, such a word list would be considerably larger than a list of comparable coverage in English and, without clever compression, possibly much too large for effective
The linguist and the software
engineer
417
personal-computer use. The alternative is then to enter, in a word list of a highly inflected language, stems only, each carefully coded for part-of-speech information and the proper inflections that it admits, including modifications in the base that a combination of an inflectional suffix may trigger. In essence then, the spelling checker would now be based on a grammatically and morphologically annotated word list which would also have a future potential use in a grammar checker. While, in French or Spanish, such techniques would reduce a literal dictionary perhaps to a quarter of its size, this simple figure must be viewed in conjunction with the efforts needed for its construction as well as the actual computer memory requirements. The annotation system, while certainly feasible (we have actually successfully done it for Spanish) is very complex and, in any practical implementation, requires the availability of a computer-readable dictionary with good morphological classification codes for its entries. The algorithm that relates an inflected form to its base and decides whether the string is indeed well-formed is also vastly more complex than the simple lookup of forms in a literal word list. We then have here just another example of a familiar tradeoff in designing computer programs: larger memory requirements and simpler and faster algorithms, or the reverse. In highly inflected languages, such as Russian or most of the other members of the Slavic family, the base + affix approach is clearly indicated. A Czech spelling verifier developed in Prague that I tested recently works impressively well precisely on this basis. It is only a verifier, however, since the offering of corrections in highly inflected systems, especially those that use a morphological approach in their analysis, turns out to be extremely difficult. It should also be said that a grammatically and morphologically annotated word list of a similar kind is needed even in English when it comes to grammar correction. While this is an easier task in English in one respect - the number of inflections needed to be considered - it is more difficult in other ways, particularly because of the categorial ambiguity of English lexical items: in Russian we very rarely find that an identical word form can function either as a noun or as a verb, something which, of course, is exceedingly common in English. German and the Scandinavian languages present yet another formidable problem, even on the level of spelling verifiers: closed compounds. While there is clearly a practical limit to the number of elements in a closed compound, theoretically there is none and the set of possible well-formed compounds is, strictly speaking, non-finite; for practical purposes, the set is finite but very large. Such examples from German as the compound
418
Henry Kucera
Donaudampfschiffahrtsgessellschaftskapitänswitwe 'the widow of the captain of the Danube Steamship Company' are often cited as extreme examples of compounding. So large is the set even in normal practice, in fact, that we have to abandon all hope of dealing with closed compounds within the framework of a literal database. A compound parsing algorithm is thus needed which can identify the individual elements of a compound and then determine whether each of them is acceptable. (Semantic appropriateness of a closed compound can not be algorithmically determined, of course, at least not at present.) But compound parsing and error detection cannot be limited to a procedure of finding lexical boundaries, as difficult as that sometimes is. German compounds, for example, may or may not have linking elements that are used to put the compounding stems together. Whether a linking element is used or not is predictable only in terms of the individual words that form the compound. The words Heirat 'marriage' and Heimat 'fatherland' are grammatically identical (feminine nouns) and differ in only one stem letter. Nevertheless, they compound differently: Heirat as a left-hand element takes a linking -s (e.g. Heiratsantrag 'marriage proposal'); Heimat does not (e.g. Heimathafen 'home port'). A similar problem is presented by the linking element -n: Reise 'journey' and Reihe 'row' are again feminine nouns, differing in a single letter. Yet, the first does not allow a linking -n in compounding, as in Reisefuehrer 'travel guide,' but the second requires one: Reihenfolge 'succession'. There are solutions to these problems but they clearly require additional classification of stems in the dictionary and other linguistic work.
Conclusion In the preceding paragraphs, I tried to demonstrate, albeit in a sketchy form, the role of corpus research and computational linguistics in the design of practical language tools for the millions of computer users who now use word processing programs to do most of their writing. In the past, I have sometimes faced skeptical comments from those who feel that they, as authors, can do better than most of these imperfect systems - and imperfect they are, without exception - and that an acquisition or use of such language aids is not worth the cost or the trouble. As someone who uses word processing very extensively and often, I find such arguments difficult to understand. No degree of arrogance so commonly heard ("I am a good speller!") will get us very far because none of us, unless in bored retirement or compulsively
The linguist and the software engineer
419
masochistic, can guarantee never to transpose two letters, hit the wrong key, or even, God forbid, forget that the word supersede is spelled this way because it comes from the Latin word for sitting. It has been convincingly demonstrated, I think, that authors are their worst editors and, especially, the worst proofreaders of their own creations, simply because they read the manuscripts in anticipation of what they wished to say, not what actually occurs on the examined page. Since detailed professional editorial work in document production is hardly a realistic expectation, the facts clearly call for some high quality linguistic tools to aid authors in preparing reasonably well-formed samples of a given language, be it English or other tongues. That leaves us - as linguists - with the struggle for quality and that is, let me assure you, not an easy task. For the past decade, I have had the privilege and the frustration of being a frequent member of the odd couple of the linguist and the software engineer. It was often fun but even more often quite infuriating. The values are different, of course, as they must be: the software engineer, after all, is ultimately responsible for a working product that people will actually buy and use. Nevertheless, the battle for high quality language aids is well worth fighting: it can often make a difference. And it also gives us the satisfaction that the last twenty-eight years of corpus work were not only useful in advancing linguistic knowledge but brought something of an intellectual value to the community at large, regardless of what our colleagues in the Boston suburbs or the paraphrasers of Hermann Goering had to say about it.
References Altenberg, Bengt 1991
"A bibliography of publications relating to English computer corpora", in: Stig Johansson - Anna-Brita Stenström (eds.), English computer corpora. Selected papers and research guide, 355-396. Berlin: Mouton de Gruyter. Carroll, John B. 1967 "On sampling from a lognormal model of word-frequency distribution", in: Henry KuCera - W. Nelson Francis, 406-424. Damerau, Fred J. 1964 "A technique for computer detection and correction of spelling errors", Communications of the ACM 5: 171-176. Francis, W. Nelson - Henry Kucera 1979 Manual of information to accompany a standard sample of present-day American English. Providence: Brown University Press. 1982 Frequency analysis of English usage: lexicon and grammar. Boston: Houghton Mifflin Co.
420
Henry Kucera
Kucera, Henry - W. Nelson Francis 1967 Computational analysis of present-day American English. Providence: Brown University Press. Peterson, James L. 1980 Computer programs for spelling correction: an experiment in program design. Berlin and New York: Springer-Verlag. 1986 "A note on undetected typing errors", Communications of the ACM 29: 633-637. Snow, C.P. 1961 The two cultures and the scientific revolution. New York: Cambridge University Press.
Comments by Magnus Ljung
1. Introduction One of the more obvious ways in which corpus linguistics can be of practical use is in providing a basis for the development of "language aids" for computer users, like spelling checkers and grammar (or style) checkers. This use of corpora is the topic of Henry KuCera's paper. The main aim of the paper is to outline the structure of present and future language aid programs and to discuss the contributions that corpus linguists can make in the development of such programs. Special emphasis is given to the numerous problems arising when corpus linguists and software developers attempt to pool their resources in the development of new programs. The paper contains a wealth of information about software and offers valuable insights into both the strengths and the weaknesses of existing programs. Implicitly it also raises the question how far language aids should be developed, a question with both practical and philosophical implications. In my brief comments here I will focus attention on both aspects of this question, i.e. on the practical issue how far it makes sense to develop language aid programs and on the "philosophical" question what the consequences of further development will be for the language and its users.
2. The limitations of language aids KuCera's paper discusses two types of language aids, i. e. spelling checkers and grammar checkers. The less sophisticated spelling checkers simply match forms against a wordlist known as the "dictionary" and flag them if they cannot be found. More sophisticated programs also suggest replacements. The grammar checker should both help in resolving those spelling problems whose solution depends on the (surface) syntactic structure in which it occurs and offer advice on "sensitive" points of grammar and style. The first function presupposes automatic parsing of the syntactic surface structure. The problem is that in many cases this will be of no help. A case in point is the claim in Kuöera's paper that a misspelling of from as form can
422
Magnus
Ljung
be detected and corrected by a program which can analyse surface syntactic structure. There are contexts where this can be done. However, the usefulness of such a rule appears doubtful when we consider how easy it is to construct sentences which differ only in the fact that one contains the word form, the other the word from: (1) (2)
Where did you take this from? Where did you take this form ?
(3) (4)
I would like to see them from a circle. I would like to see them form a circle.
It is clearly impossible to write a rule able to determine from (1) and (2) or from (3) and (4) whether the writer intended to write form or from. Since mistakes of this type can only be detected by traditional proof-reading, little will be gained by introducing a rule able to handle only a subsection of the problems.
3. The problem of variation Another practical problem for spelling checkers and "grammar checkers" alike is linguistic variation in the form of borrowing. A language like Swedish, for instance, imports English lexical items at what some regard as an alarming rate (cf. Chrystal 1988; Ljung 1985, 1988). Since none of the major dictionaries are able to keep up with this influx, we cannot expect the "dictionaries" of our word-processing programs to do any better. Admittedly, the texts with the greatest amount of borrowing represent somehow "special" genres like e.g. technical writing and scientific reports. General spelling checkers and grammar checkers, it might be argued, cannot provide for such texts. This leads to the question what kinds of texts the software producers do have in mind. Kucera's answer is that the language aids should facilitate "document production", a phrase suggesting office documents like reports, business letters and the like. Since the software producers' market is largely (but not exclusively!) to be found in the business world, this limitation is neither unexpected nor reprehensible. However, it brings with it certain assumptions about style and "good grammar" which I feel cannot be accepted. As Kucera himself points out, the public at large has no understanding of the difference between variable syntactic rules, on the one hand, and serious syntactic mistakes on the other. It comes as no surprise that this ignorance
Comments
423
is shared by the software engineers; indeed this is claimed to have been a source of constant irritation to Kuöera in his attempts to co-operate with the software producers. As he himself admits, he has had little success in dissuading the software producers from banning the passive. He is also willing to accept the inclusion of certain other prescriptive rules like the ones banning "the use of adjectives as adverbs" (as in e.g. Good news travels slow) and the splitting of infinitives.
4. Conclusion One consequence of this is to weaken the claim that "previous corpus work . . . can prove useful" in this kind of work: if it is impossible to dissuade the software producers from perpetuating popular prescriptive rules, then the impact of corpus work will be less important than it could have been. Another and more serious consequence has to do with the "ideological" implications of the whole enterprise. There is no doubt something to be said for certain kinds of built-in grammatical and maybe stylistic advice. But if a narrow range of prescriptive stylistic principles are to be made mandatory for everybody using a word-processing program, I see lurking in the background the re-emergence of the proverbial schoolmarm, Miss Fidditch. Previous Misses Fidditch, being non-electronic, had only limited success in their attempts to impose their prescriptive rules on users of the language. But our new electronic Miss Fidditch would be all-powerful and could eventually reduce all prose produced on word processors to a kind of Newspeak unsuspected even by Orwell. It is therefore vitally important that the stylistic advice provided by the word-processing programs is based on some reasonable notion of standard and "good usage". The only way to ensure that is of course to do precisely what Henry Kuiera has been doing, i.e. try to reach a compromise between the two cultures. It is clear from Kucera's account that this is no easy matter. Still one must be grateful that there are corpus linguists like Kuiera who are willing to fight the good fight.
References Chrystal, Judith A. 1988
Engelskan i svensk dagspress. holm: Esselte Studium.
Skrifter utgivna av Svenska Spräknämnden 74. Stock-
424
Magnus Ljung
Ljung, Magnus 1985 Lam anka - ett mäste? EIS Report No. 8. Department of English, University of Stockholm. 1988 Skinheads, hackers och lama ankor: engelskan i 80-talets svenska. Stockholm: Trevi.
Probabilistic parsing Geoffrey Sampson
We want to give computers the ability to process human languages. But computers use systems of their own which are also called "languages", and which share at least some features with human languages; and we know how computers succeed in processing computer languages, since it is humans who have arranged for them to do so. Inevitably there is a temptation to see the automatic processing of computer languages as a precedent or model for the automatic processing (and perhaps even for the human processing) of human languages. In some cases the precedent may be useful, but clearly we cannot just assume that human languages are similar to computer languages in all relevant ways. In the area of grammatical parsing of human languages, which seems to be acknowledged by common consent as the central problem of natural language processing - "NLP" - at the present time, I believe the computer-language precedent may have misled us. One of the ideas underlying my work is that human languages, as grammatical systems, may be too different from computer languages for it to be appropriate to use the same approaches to automatic parsing. Although the average computer scientist would probably think of naturallanguage parsing as a somewhat esoteric task, automatic parsing of computer programming languages such as C or Pop-11 is one of the most fundamental computing operations; before a program written in a user-oriented programming language such as these can be run it must be "compiled" into machine code - that is, automatically translated into a very different, "low level" programming language - and compilation depends on extracting the grammatical structure by virtue of which the C or Pop-11 program is well-formed. To construct a compiler capable of doing this, one begins from a "production system" (i.e. a set of rules) which defines the class of well-formed programs in the relevant user-oriented language. In fact there exist software systems called "compiler-compilers" or "parser generators" which accept as input a production system for a language and automatically yield as output a parser for the language. To the computer scientist it is self-evident that parsing is based on rules for well-formedness in a language. If one seeks to apply this concept to natural languages, an obvious question is whether rules of well-formedness can possibly be as central for processing
426
Geoffrey
Sampson
natural languages, which have grown by unplanned evolution and accretion over many generations, as they are for processing formal programming languages, which are rule-governed by stipulation. What counts as a valid C program is fixed by Brian Kernighan and Dennis Ritchie - or, now, by the relevant American National Standards Institute committee - and programmers are expected to learn the rules and keep within them. If a programmer inadvertently tries to extend the language by producing a program that violates some detail (perhaps a very minor detail) of the ANSI rules, it is quite all right for the compiler software to reject the program outright. In the case of speakers of natural languages it is not intuitively obvious that their skill revolves in a similar way round a set of rules defining what is well-formed in their mother tongue. It is true that I sometimes hear children or foreigners producing English utterances that sound a little odd, but it seems to me that (for me and for other people) the immediate response is to understand the utterance, and noticing its oddity is a secondary phenomenon if it occurs at all. It is not normally a case, as the compiler model might suggest, of initially hearing the utterance as gibberish and then resorting to special mental processes to extract meaning from it nevertheless. These points seem fairly uncontroversial, and if the period when linguistics and computer science first became heavily involved with one another had been other than when it was (namely about 1980) they might have led to widespread scepticism among linguists about the applicability of the compiler model to natural language parsing. But intellectual trends within linguistics at that time happened to dovetail neatly with the compiler model. The 1970s had been the high point of Noam Chomsky's intellectual dominance of linguistics - the period when university linguistics departments routinely treated "generative grammar" as the centrepiece of their first-year linguistics courses, as the doctrine of the phoneme had been twenty years earlier - and, for Chomsky, the leading task of linguistics was to discover how to formulate a rigorous definition specifying the range of well-formed sentences for a natural language. Chomsky's first book began: " . . . The fundamental aim in the linguistic analysis of a language L is to separate the grammatical sequences which are the sentences of L from the ungrammatical sequences which are not sentences of L . . . " (Chomsky 1957: 13). Chomsky's reason for treating this aim as fundamental did not have to do with automatic parsing, which was not a topic that concerned him. Chomsky has often, with some justification, contradicted writers who suggested that his kind of linguistics is preoccupied with linguistic automation, or that he believes the mind "works like a computer". In the context of Chomsky's thought, the reason to construct grammars which generate "all and only" the
Probabilistic
parsing
427
grammatical sentences of different natural languages was his belief that these grammars turn out to have various highly specific and unpredicted formal properties, which do not differ from one natural language to another, and that these universals of natural grammar are a proof of the fact that (as Chomsky later put it) "we do not really learn language; rather, grammar grows in the mind" (Chomsky 1980: 134) - and that, more generally, an adult's mental world is not the result of his individual intellectual inventiveness responding to his environment but rather, like his anatomy, is predetermined in much of its detail by his genetic inheritance. For Chomsky formal linguistics was a branch of psychology, not of computer science. This idea of Chomsky's that linguistic structure offers evidence for genetic determination of our cognitive life seems when the arguments are examined carefully to be quite mistaken (cf. Sampson 1980; 1989). But it was influential for many years; and its significance for present purposes is that, when linguists and computer scientists began talking to one another, it led the linguists to agree with the computer scientists that the important thing to do with a natural language was to design a production system for it. What linguists call a "generative grammar" is what computer scientists call a "production system". 1 Arguably, indeed, the rise of NLP gave generative grammar a new lease of life within linguistics. About 1980 it seemed to be losing ground in terms of perceived centrality in the discipline to less formal, more sociallyoriented trends, but during the subsequent decade there was a striking revival of formal grammar theory. Between them, then, these mutually-reinforcing traditions made it seem inevitable that the way to produce parsers for natural languages was to define generative grammars for them and to derive parsers from the generative grammars as compilers are derived from computer-language production systems. Nobody thought the task would be easy: a natural language grammar would clearly be much larger than the grammar for a computer language, which is designed to be simple, and Chomsky's own research had suggested that natural-language generative grammars were also formally of a different, less computationally tractable type than the "context-free" grammars which are normally adequate for programming languages. But these points do not challenge the principle of parsing as compilation, though they explain why immediate success cannot reasonably be expected. I do not want to claim that the compiler model for natural language parsing is necessarily wrong; but the head start which this model began with, partly for reasons of historical accident, has led alternative models to be unjustifiably overlooked. Both Geoffrey Leech's group and mine - but as yet few other computational-linguistics research groups, so far as I know - find it more
428
Geoffrey
Sampson
natural to approach the problem of automatically analysing the grammatically rich and unpredictable material contained in corpora of real-life usage via statistical techniques somewhat akin to those commonly applied in the field of visual pattern recognition, rather than via the logical techniques of formallanguage compilation. To my mind there are two problems about using the compiler model for parsing the sort of language found in resources such as the LOB Corpus. The first is that such resources contain a far greater wealth of grammatical phenomena than standard generative grammars take into account. Since the LOB Corpus represents written English, one obvious example is punctuation marks: English and other European languages use a variety of these, they have their own quite specific grammatical properties, but textbooks of linguistic theory never in my experience comment on how punctuation marks are to be accounted for in grammatical analyses. This is just one example: there are many, many more. Personal names have their own complex grammar in English we have titles such as Mrs, Dr which can introduce a name, titles such as Ph.D., Bart which can conclude one, Christian names can be represented by initials but surnames in most contexts cannot, and so on and so forth - yet the textbooks often gloss all this over by offering rules which rewrite "NP" as "ProperName" and "ProperName" simply as John, Mary, ... Addresses have internal structure which is a good deal more complex than that of personal names (and highly language-specific - English addresses proceed from small to large, e.g. house name, street, town, county, while several European languages place the town before the smaller units, for instance), but they rarely figure in linguistics textbooks at all. Money sums (note the characteristic grammar of English £2.50 v. Portuguese 2$50), weights and measures, headlines and captions, and many further items are part of the warp and weft of real-life language, yet are virtually invisible in standard generative grammars. In one sense that is justifiable. We have seen that the original motivation for generative linguistic research had to do with locating possible geneticallydetermined cognitive mechanisms, and if such mechanisms existed one can agree that they would be more likely to relate to conceptually general areas of grammar, such as question-formation, than to manifestly culture-bound areas such as the grammar of money or postal addresses. But considerations of that kind have no relevance for NLP as a branch of practical technology. If computers are to deal with human language, we need them to deal with addresses as much as we need them to deal with questions. The examples I have just quoted are all characteristic more of written than spoken language; my corpus experience has until recently been mainly with
Probabilistic
parsing
429
the LOB and Brown Corpora of written English. But my group has recently begun to turn its attention to automatic parsing of spoken English, working with the London-Lund Corpus, and it is already quite clear that this too involves many frequent phenomena which play no part in standard generative grammars. Perhaps the most salient is so-called "speech repairs", whereby a speaker who notices himself going wrong backtracks and edits his utterance on the fly. Standard generative grammars would explicitly exclude speech repairs as "performance deviations", and again for theoretical linguistics as a branch of cognitive psychology this may be a reasonable strategy; but speech repairs occur, they fall into characteristic patterns, and practical automatic speech-understanding systems will need to be able to analyse them. Furthermore, even in the areas of grammar which are common to writing and speech and which linguists would see as part of what a language description ought (at least ideally) to cover, there is a vast amount to be done in terms of listing and classifying the phenomena that occur. Many constructions are omitted from theoretical descriptions not for reasons of principle but because they are not very frequent and/or do not seem to interact in theoretically-interesting ways with central aspects of grammar, and although they may be mentioned in traditional descriptive grammars they are not systematically assigned places in explicit inventories of the resources of the language. One example among very many might be the English the more . . . the more ... construction discussed by Fillmore - Kay - O'Connor (1988), an article which makes some of the same points I am trying to make about the tendency for much of a language's structure to be overlooked by the linguist. All this is to say, then, that there is far more to a natural language than generative linguistics has traditionally recognized. That does not imply that comprehensive generative grammars cannot be written, but it does mean that the task remains to be done. There is no use hoping that one can lift a grammar out of a standard generative-linguistic definition of a natural language and use it with a few modifications as the basis of an adequate parser. But the second problem, which leads me to wonder whether reasonably comprehensive generative grammars for real-life languages are attainable even in principle, is the somewhat anarchic quality of much of the language one finds in resources such as LOB. If it is correct to describe linguistic behaviour as rule-governed, this is much more like the sense in which cardrivers' behaviour is governed by the Highway Code than the sense in which the behaviour of material objects is governed by the laws of physics, which can never be violated. When writing carefully for publication, we do stick to most of the rules, and with a police car behind him an Englishman keeps
430
Geoffrey
Sampson
to 30 m.p.h. in a built-up area. But any rule can be broken on occasion. If a tree has fallen on the left side of the road, then common sense overrides the Highway Code and we drive cautiously round on the right. With no police near, "30 m.p.h." is interpreted as "not much over 40". So it seems to be with language. To re-use an example that I have quoted elsewhere (Garside - Leech - Sampson 1987: 19): a rule of English that one might have thought rock-solid is that the subject of a finite clause cannot consist wholly of a reflexive pronoun, yet LOB contains the following sentence, from a current-affairs magazine article by Bertrand Russell: Each side proceeds on the assumption that itself loves peace, but the other side consists of warmongers. Itself served better than plain it to carry the contrast with the other side, so the grammatical rule gives way to the need for a persuasive rhetorical effect. A tree has blocked the left-hand lane, so the writer drives round on the right and is allowed to do so, even though the New Statesman's copyeditor is behind him with a blue light on his roof. In this case the grammatical deviation, though quite specific, is subtle; in other cases it can be much more gross. Ten or fifteen years ago I am sure we would all have agreed about the utter grammatical impossibility of the sentence: *Best before see base of can. But any theory which treated it as impossible today would have to contend with the fact that this has become one of the highest-frequency sentences of written British English. Formal languages can be perfectly rule-governed by stipulation; it is acceptable for a compiler to reject a C program containing a misplaced comma. But with a natural language, either the rules which apply are not complete enough to specify what is possible and what is not possible in many cases, or if there is a complete set of rules then language-users are quite prepared to break them. I am not sure which of these better describes the situation, but, either way, a worthwhile NLP system has to apply to language as it is actually used: we do not want it to keep rejecting authentic inputs as "ill-formed". The conclusion I draw from observations like these is that, if I had to construct a generative grammar covering everything in the LOB Corpus in order to derive a system capable of automatically analysing LOB examples and others like them, the job would be unending. Rules would have to be multiplied far beyond the number found in the completest existing formal linguistic descriptions, and as the task of rule-writing proceeded one would
Probabilistic
parsing
431
increasingly find oneself trying to make definite and precise statements about matters that are inherently vague and fluid. In a paper to an ICAME conference (Sampson 1987) I used concrete numerical evidence in order to turn this negative conclusion into something more solid than a personal wail of despair. I looked at statistics on the diversity of grammatical constructions found in the "Lancaster-Leeds Treebank", a c. 40,000-word subset of the LOB Corpus which I had parsed manually, in collaboration with Geoffrey Leech and his team, in order to create a database (described in Garside - Leech - Sampson 1987, Chapter 7; and Sampson 1991) to be exploited for our joint NLP activities. I had drawn labelled trees representing the surface grammatical structures of the sentences, using a set of grammatical categories that were chosen to be maximally uncontroversial and in conformity with the linguistic consensus, and taking great pains to ensure that decisions about constituent boundaries and category membership were consistent with one another across the database, but imposing no prior assumptions about what configurations of grammatical categories can and cannot occur in English. In Sampson (1987) I took the highest-frequency grammatical category (the noun phrase) and looked at the numbers of different types of noun phrase in the data, where a "type" of noun phrase is a particular sequence of one or more daughter categories immediately dominated by a noun phrase node. Types were classified using a very coarse vocabulary of just 47 labels for daughter nodes (14 phrase and clause classes, 28 word-classes, and five classes of punctuation mark), omitting many finer subclassifications that are included in the Treebank. There were 8328 noun phrase tokens in my data set, which between them represented 747 different types, but the frequency of the types was very various: the commonest single type (determiner followed by singular noun) accounted for about 14% of all noun phrase tokens, while many different types were represented by one token each. The particularly interesting finding emerged when I considered figures on the proportion of all noun phrase tokens belonging to types of not more than a set frequency in the data, and plotted a graph showing the proportion ρ as a function of the threshold type-frequency / (with / expressed as a fraction of the frequency of the commonest type, so that ρ — 1 when / = 1). The 58 points for different observed frequencies fell beautifully close to an exponential curve, ρ — f0A. As the fraction / falls, f04 falls much more slowly: as we consider increasingly low-frequency constructions, the number of different constructions occurring at such frequencies keeps multiplying in a smoothly predictable fashion so that quite sizeable proportions of the data are accounted for even by constructions of the lowest frequencies. (More than
432
Geoffrey Sampson
5% of the noun phrase tokens in my data set represented constructions which each occurred just once.) If this regular relationship were maintained in larger samples of data (this is admittedly a big "if' - as yet there simply do not exist carefully-analysed language samples large enough to allow the question to be checked), it would imply that even extremely rare constructions would collectively be reasonably frequent. One in a thousand noun phrase tokens, for instance, would represent some noun phrase type occurring not more than once in a thousand million words. Yet how could one hope to design a grammar that generates "all and only" the correct set of constructions, if ascertaining the set of constructions to be generated requires one to monitor samples of that size? Accordingly, our approach to automatic parsing avoids any use of the concept of well-formedness. In fitting a labelled tree to an input word-string, our system simply asks "What labelled tree over these words comes closest to being representative of the configurations in our database of parsed sentences?" The system does not go on to ask "Is that a grammatically 'legal' tree?" in our framework this question has no meaning. This general concept of parsing as maximizing conformity with statistical norms is, I think, common to the work of Geoffrey Leech's team at Lancaster and my Project APRIL, sponsored by the Royal Signals & Radar Establishment at the University of Leeds and directed by myself and, after my departure from the academic profession, by Robin Haigh.2 There are considerable differences between the deterministic techniques used by Leech's team (see, for example, Garside - Leech - Sampson 1987, Chapter 6) and the stochastic APRIL approach, and I can describe only the latter; but although the APRIL technique of parsing by stochastic optimization was an invention of my own, I make no claim to pioneer status with respect to the general concept of probabilistic parsing - this I borrowed from the Leech team. The APRIL system is described for instance in Sampson - Haigh - Atwell (1989). In broad outline the system works like this. We assume that the desired analysis for any input string is always going to be a tree structure with labels drawn from an agreed vocabulary of grammatical categories. For any particular input, say w words in length, the range of solutions available to be considered in principle is simply the class of all distinct tree-structures having w terminal nodes and having labels drawn from the agreed vocabulary on the nonterminal nodes. The root node of a tree is required to have a specified "sentence" label, but apart from that any label can occur on any nonterminal node: a complex, many-layered tree over a long sentence in which every single node between root and "leaves" is labelled "prepositional phrase", say, would in APRIL terms not be an "illegal / ill-formed / ungrammatical"
Probabilistic
parsing
433
tree, it would just be a quite poor tree in the sense that it would not look much like any of the trees in the Treebank database. Parsing proceeds by searching the massive logical space of distinct labelled trees to find the best. There are essentially two problems: how is the "goodness" of a labelled tree measured, and how is the particular tree that maximizes this measure located (given that there will be far too many alternative solutions in the solution-space for each to be checked systematically)? The answer to the first question is that individual nodes of a tree are assigned figures of merit by reference to probabilistic transition networks. Suppose some node in a tree to be evaluated is labelled X, and has a sequence of daughters labelled Ρ Q R. This might be a sequence which would commonly be found below an X node in correctly-parsed sentences (if X is "noun phrase", Ρ Q R might be respectively "definite article", "singular noun", "relative clause", say), or it might be some absurd expansion for X (say, "comma", "prepositional phrase", "adverbial clause"), or it might be something in between. For the label X (and for every other label in the agreed vocabulary) the system has a transition network which - ignoring certain complications for ease of exposition - includes a path (designed manually) for each of the sequences commonly found below that label in accurately parsed material, to which skip arcs and loop arcs have been added automatically in such a fashion that any label-string whatever, of any length, corresponds to some route through the network. (Any particular label on a high-frequency path can be bypassed via a skip arc, and any extra label can be accepted at any point via a loop arc.) The way the network is designed ensures that (again omitting some complications) it is deterministic - whatever label sequence may be found below an X in a tree, there will be one and only one route by which the X network can accept that sequence. Probabilities are assigned to the arcs of the networks for X and the other labels in the vocabulary by driving the trees of the database over them, which will tend to result in arcs on the manually-designed routes being assigned relatively high probabilities and the automatically-added skip and loop arcs being assigned relatively low probabilities. (One might compare the distinction between the manually-designed high-frequency routes and the routes using automatically-added skip or loop arcs to Chomsky's distinction between ideal linguistic "competence" and deviant "performance" - though this comparison could not be pressed very far: the range of constructions accepted by the manually-designed parts of the APRIL networks alone would not be equated, by us or by anyone else, with the class of "competent / well-formed" constructions.) Then, in essence, the figure of merit assigned to any labelled tree is the product of the probabilities
434
Geoffrey
Sampson
associated with the arcs traversed when the daughter-strings of the various nodes of the tree are accepted by the networks. As for the second question: APRIL locates the best tree for an input by a stochastic optimization technique, namely the technique of "simulated annealing" (see, for example, Kirkpatrick - Gelatt - Vecchi 1983; Aarts - Korst 1989). That is, the system executes a random walk through the solution space, evaluating each random move from one labelled tree to another as it is generated, and applying an initially weak but steadily growing bias against accepting moves from "better" to "worse" trees. In this way the system evolves towards an optimal analysis for an input, without needing initially to know whereabouts in the solution space the optimum is located, and without getting trapped at "local minima" - solutions which are in themselves suboptimal but happen to be slightly better than each of their immediately-neighbouring solutions. Stochastic optimization techniques like this one have something of the robust simplicity of Darwinian evolution in the natural world: the process does not "know where it is going", and it may be subject to all sorts of chance accidents on the way, but in the long run it creates highly-valued outcomes through nothing more than random mutation and a tendency to select fitter alternatives. As yet, APRIL'S performance leaves plenty to be desired. Commonly it gets the structure of an input largely right but with various local errors, either because the tree-evaluation function fails to assign the best score to what is in fact the correct analysis, or because the annealing process "freezes" on a solution whose score is not the best available, or both. Let me give one example, quoted from Sampson - Haigh - Atwell (1989), of the outcome of a run on the following sentence (taken from LOB text E23, input to APRIL as a string of wordtags): The final touch was added to this dramatic interpretation, by placing it to stand on a base of misty grey tulle, representing the mysteries of the human mind. According to our parsing scheme, the correct analysis is as follows (for the symbols, see Garside - Leech - Sampson 1987, Chapter 7, Section 5): [S[N the final touch] [Vp was added] [P to [N this dramatic interpretation]], [P by [Tg[Vg placing] [N it] [Ti [Vi to stand] [P on [N a base [P of [N misty grey tulle, [Tg [Vg representing] [N the mysteries [P of [N the human mind]]]]]]]]]]].] The analysis produced by APRIL on the run in question was as follows:
Probabilistic
parsing
435
[S[N the final touch] [Vp was added] [P to [N this dramatic interpretation]], [P by [Tg [Vg placing] [N it] [Ti[Vi to stand] [P on [N a base of [P of [N misty grey tulle]]]], [Tg[Vg representing] [N the mysteries] [P of [N the human mind]]]]]].] That is, of the human mind was treated as an adjunct of representing rather than as a postmodifier of mysteries, and the representing clause was treated as an adjunct of placing rather than as a postmodifier of tulle. Our method of assessing performance gives this output a mark of 76%, which was roughly average for APRIL'S performance at the time (though some errors which would reduce the percentage score by no more than the errors in this example might look less venial to a human judge). We had then, and still have, a long way to go. But our approach has the great advantage that it is easy to make small incremental adjustments: probabilities can be adjusted on individual transition-network arcs, for instance, without causing the system to crash and fail to deliver any analysis at all for some input (as can relatively easily happen with a compiler-like parser); and the system does not care how grammatical the input is. The example above is in fact a rather polished English sentence; but APRIL would operate in the same fashion on a thoroughly garbled, ill-formed input, evolving the best available analysis irrespective of whether the absolute value of that analysis is high or low. Currently, much of our work on APRIL is concerned with adapting it to deal with spoken English, where grammatical ill-formedness is much commoner than in the edited writing of LOB. To delve deeper into the technicalities of APRIL would not be appropriate here. But in any case this specific research project is a less significant topic for the corpus linguistics community at large than is the general need, which this and related research has brought into focus for me, for a formal stocktaking of the resources of the languages we work with. Those of us who work with English think of our language as relatively thoroughly studied, yet we have no comprehensive inventory and classification, at the level of precision needed for NLP purposes, of the grammatical phenomena found in real-life written and/or spoken English usage; I surmise that the same is true for other European languages. By far the largest part of the work of creating the 40,000-word LancasterLeeds Treebank lay not in drawing the labelled trees for the individual sentences but in developing a set of analytic categories and maintaining a coherent body of precedents for their application, so as to ensure that anything occurring in the texts could be given a labelled tree structure, and that a decision to mark some sequence off as a constituent of a given category at
436
Geoffrey
Sampson
one point in the texts would always be consistent with decisions to mark off and label comparable sequences elsewhere. It is easy, for instance, to agree that English has a category of "adjective phrases" (encoded in our scheme as J), core examples of which would be premodified adjectives (very small, pale green)·, but what about cases where an adjective is followed by a prepositional phrase which expands its meaning, as in: they are alike in placing more emphasis . . . - should in placing . . . be regarded as a postmodifier within the J whose head is alike, or is alike a one-word J and the in placing ... sequence a sister constituent? There is no one answer to this question which any English linguist would immediately recognize as obviously correct; and, while some answer might ultimately be derived from theoretical studies of grammar, we cannot expect that theoreticians will decide all such questions for us immediately and with a single voice: notoriously, theories differ, theories change, and many of the tree-drawing problems that crop up have never yet been considered by theoretical grammarians. But, for probabilistic approaches to NLP, we must have some definite answer to this and very many other comparable issues. Statistics extracted from a database of parsed sentences in which some cases of adjective + prepositional phrase were grouped as a single constituent and other, linguistically indistinguishable cases were analysed as separate immediate constituents of the sentence node, on a random basis, would be meaningless and useless. Accordingly, much of the work of creating the database involved imposing and documenting decisions with respect to a multitude of such issues; one strove to make the decisions in a linguistically reasonable way, but the overriding principle was that it was more important to have some clearcut body of explicit analytic precedents, and to follow them consistently, than it was that the precedents should always be indisputably "correct". The body of "parsing law" that resulted was in no sense a generative grammar - it says nothing about what sequences "cannot occur" or are "ungrammatical", which is the distinctive property of a generative grammar but what it does attempt to do is to lay down explicit rules for bracketing and labelling in a predictable manner any sequence that does occur, so that as far as possible two analysts independently drawing labelled trees for the same novel and perhaps unusual example of English would be forced by the parsing law to draw identical trees. Although my own motive for undertaking this precedent-setting task had to do with providing a statistical database to be used by probabilistic parsers, a thoroughgoing formal inventory of a language's resources is important for
Probabilistic
parsing
437
NLP progress in other ways too. Now that NLP research internationally is moving beyond the preoccupation with artificially-simple invented examples that characterized its early years, there is a need for research groups to be able routinely to exchange quantities of precise and unambiguous information about the contents of a language; but at present this sort of information exchange is hampered in the domain of grammar by the fact that traditional terminology is used in inconsistent and sometimes vague ways. For instance, various English-speaking linguists use the terms "complement", or "predicate", in quite incompatible ways. Other terms, such as "noun phrase", are used much more consistently in the sense that different groups agree on core examples of the term; but traditional descriptive grammars, such as Quirk et al. (1985) and its lesser predecessors, do not see it as part of their task to define clearcut boundaries between terms that would allow borderline cases to be assigned predictably to one category or another. For computational purposes we need sharpness and predictability. What I am arguing for - I see it as currently the most pressing need in the NLP discipline - is taxonomic research in the grammatical domain that should yield something akin to the Linnaean taxonomy for the biological world. Traditional grammars describe constructions shading into one another, as indeed they do, but the analogous situation in biology did not prevent Linne imposing sharp boundaries between botanical species and genera. Linne said: Natura non facit saltus. Plantae omnes utrinque affinitatem monstrant, uti territorium in mappa geographica; but Linne imposed boundaries in this apparent continuum, as nineteenth-century European statesmen created colonial boundaries in the map of Africa. The arrangement of species and genera in the Linnaean system was artificial and in some respects actually conflicted with the natural (i.e. theoretically correct) arrangement, and Linne knew this perfectly well - indeed, he spent part of his career producing fragments of a natural taxonomy, as an alternative to his artificial taxonomy; but the artificial system was based on concrete, objective features which made it practical to apply, and because it did not have to wait on the resolution of theoretical puzzles Linne could make it complete. Artificial though the Linnaean system was, it enabled the researcher to locate a definite name for any specimen (and to know that any other botanist in the world would use the same name for that specimen), and it gave him something approaching an exhaustive conspectus of the "data elements" which a more theoretical approach would need to be able to cope with. If no-one had ever done what Linne did, then Swedish biologists would continually be wondering what British biologists meant (indeed, Lancastrian biologists would be wondering what Cambridge biologists meant) by, say,
438
Geoffrey
Sampson
cuckoo-pint, and whether cuckoo-pint, cuckoo flower, and ragged robin were one plant, two, or three. Since Linne, we all say Arum maculatum and we know what we are talking about. Computational linguistics, I feel, is still operating more or less on the cuckoo-pint standard. First let us do a proper stocktaking of our material, and then we shall have among other things a better basis for theoretical work. In one area an excellent start has already been made. Stig Johansson's Tagged LOB Corpus Manual (Johansson 1986) includes a great deal of detailed boundary-drawing between adjacent wordtags of the LOB tagset. Leech's and my groups have refined certain aspects of Johansson's wordclass taxonomy, making more distinctions in areas such as proper names and numerical and technical items, for instance, but we could not have done what we have done except by building on the foundation provided by Johansson's work; and it is interesting and surprising to note that, although Johansson (1986) was produced for one very specific and limited purpose (to document the tagging decisions in a specific tagged corpus), the book has to my knowledge no precedent in the level of detail with which it specifies the application of wordclass categories. One might have expected that many earlier linguists would have felt the need to define a detailed set of wordclasses with sufficient precision to allow independent analysts to apply them predictably: but apparently the need was not perceived before the creation of analysed versions of electronic corpora. With respect to grammatical structure above the level of terminal nodes, i.e. the taxonomy of phrases and clauses, nothing comparable to Johansson's work has been published. I have referred to my own, unpublished, work done in connexion with the Lancaster-Leeds Treebank; and at present this work is being extended under Project SUSANNE (sponsored by the Economic & Social Research Council), the goal of which is the creation of an analysed English corpus significantly larger than the Lancaster-Leeds Treebank, and analysed with a comparable degree of detail and self-consistency, but in conformity with an analytic scheme that extends beyond the purely "surface" grammatical notations of the Lancaster-Leeds scheme to represent also the "underlying" or logical structure of sentences where this conflicts with surface structure.3 We want to develop our probabilistic parsing techniques so that they deliver logical as well as surface grammatical analyses, and a prerequisite for this is a database of logically-parsed material. The SUSANNE Corpus is based on a grammatically-annotated 128,000word subset of the Brown Corpus created at Gothenburg University in the 1970s by Alvar Ellegärd and his students (Ellegärd 1978). The solid work already done by Ellegärd's team has enabled my group to aim to produce
Probabilistic
parsing
439
a database of a size and level of detail that would otherwise have been far beyond our resources. But the Gothenburg product does have limitations (as its creator recognizes); notably, the annotation scheme used, while covering a rather comprehensive spectrum of English grammatical phenomena, is defined in only a few pages of instructions to analysts. As an inevitable consequence, there are inconsistencies and errors in the way it is applied to the 64 texts from four Brown genres represented in the Gothenburg subset. We are aiming to make the analyses consistent (as well as representing them in a more transparent notation, and adding extra categories of information); but, as a logically prior task, we are also formulating and documenting a much more detailed set of definitions and precedents for applying the categories used in the SUSANNE Corpus. Our strategy is to begin with the "surfacy" Lancaster-Leeds Treebank parsing scheme, which is already welldefined and documented internally within our group, and to add to it new notations representing the deep-grammar matters marked in the Gothenburg files but not in the Treebank, without altering the well-established LancasterLeeds Treebank analyses of surface grammar. (For most aspects of logical grammar it proved easier than one might have expected to define notations that diverse theorists should be able to interpret in their own terms.) Thus the outcome of Project SUSANNE will include an analytic scheme in which the surface-parsing standards of the Lancaster-Leeds parsing law are both enriched by a larger body of precedent and also extended by the addition of standards for deep parsing. (Because the Brown Corpus is American, the SUSANNE analytic scheme has also involved broadening the LancasterLeeds Treebank scheme to cover American as well as British usage.) Project SUSANNE is scheduled for completion in 1992; its product will subsequently be published, with the annotated corpus itself distributed in electronic form by the Oxford Text Archive, and the analytic scheme to which the annotations conform issued as a book by Oxford University Press. Corpus-builders have traditionally, I think, seen the manuals they write as secondary items playing a supporting role to the corpora themselves. My view is different. If our work on Project SUSANNE has any lasting value, I am convinced that this will stem primarily from its relatively comprehensive and explicitly-defined taxonomy of English grammatical phenomena. Naturally I hope - and believe - that the SUSANNE Corpus too will prove useful in various ways. But, although the SUSANNE Corpus will be some three times the size of the database we have used as a source of grammatical statistics to date, in terms of sheer size I believe that the SUSANNE and other existing analysed corpora described in Sampson (1991) are due soon to be eclipsed by much larger databases being produced in the USA, notably
440
Geoffrey
Sampson
the "Penn Treebank" being created by Mitchell Marcus of the University of Pennsylvania. The significance of the SUSANNE Corpus will lie not in size but in the detail, depth, and explicitness of its analytic scheme. (Marcus's Treebank uses a wordtag set that is extremely simple relative to that of the Tagged LOB or Brown Corpora - it contains just 36 tags (Santorini 1990); and, as I understand, the Penn Treebank will also involve quite simple and limited indications of higher-level structure, whether because the difficulty of producing richer annotations grows with the size of a corpus, or because Marcus wishes to avoid becoming embroiled in the theoretical controversies that might be entailed by commitment to any richer annotation scheme.) Even if we succeed perfectly in the ambitious task of bringing every detail of the annotations in 128,000 words of text into line with the SUSANNE taxonomic principles, one of the most significant long-term roles of the SUSANNE Corpus itself will be as an earnest of the fact that the rules of the published taxonomy have been evolved through application to real-life data rather than chosen speculatively. I hope our SUSANNE work may thus offer the beginnings of a "Linnaean taxonomy of the English language". It will be no more than a beginning; there will certainly be plenty of further work to be done. How controversial is the general programme of work in corpus linguistics that I have outlined in these pages? To me it seems almost self-evidently reasonable and appropriate, but it is easy to delude oneself on such matters. The truth is that the rise of the corpus-based approach to computational linguistics has not always been welcomed by adherents of the older, compilation-oriented approach; and to some extent my own work seems to be serving as the representative target for those who object to corpus linguistics. (I cannot reasonably resent this, since I myself have stirred a few academic controversies in the past.) In particular, a series of papers (Taylor - Grover - Briscoe 1989; Briscoe 1990) have challenged my attempt, discussed above, to demonstrate that individually rare constructions are collectively so common as to render unfeasible the aim of designing a production system to generate "all and only" the constructions which occur in real-life usage in a natural language. My experiment took for granted the relatively surfacy, theoretically middle-of-the-road grammatical analysis scheme that had been evolved over a series of corpus linguistics projects at Lancaster, in Norway, and at Leeds in order to represent the grammar of LOB sentences in a manner that would as far as possible be uncontroversial and accordingly useful to a wide range of researchers. But of course it is true that a simple, concrete theoretical approach which eliminates controversial elements is itself a particular theoretical approach, which
Probabilistic
parsing
441
the proponents of more abstract theories may see as mistaken. Taylor et al. believe in a much more abstract approach to English grammatical analysis; and they argue that my findings about the incidence of rare constructions are an artefact of my misguided analysis, rather than being inherent in my data. Their preferred theory of English grammar is embodied in the formal generative grammar of the Alvey Natural Language Tools ("ANLT") parsing system (for distribution details see note 1 of Briscoe 1990); Taylor et al. use this grammar to reanalyse my data, and they argue that most of the constructions which I counted as low-frequency are generated by high-frequency rules of the ANLT grammar. According to Taylor et al., the ANLT system is strikingly successful at analysing my data-set, accurately parsing as many as 97% of my noun phrase tokens. 4 My use of the theoretically-unenlightened LOB analytic scheme is, for Briscoe (1990), symptomatic of a tendency for corpus linguistics in general to operate as "a self-justifying and hermeneutically sealed sub-discipline". Several points in these papers seem oddly misleading. Taylor et al. repeatedly describe the LOB analytic scheme as if it were much more a private creation of my own than it actually was, thereby raising in their readers' minds a natural suspicion that problems such as those described in Sampson (1987) might well stem purely from an idiosyncratic analytic scheme which is possibly ill-defined, ill-judged, and/or fixed so as to help me prove my point. One example relates to the system, used in my 1987 investigation, whereby the detailed set of 132 LOB wordtags is reduced to a coarser classification by grouping certain classes of cognate tags under more general "cover tags". Referring to this system, Taylor et al. comment that "Sampson . . . does not explain the extent to which he has generalised types in this fashion"; "Sampson . . . gives no details of this procedure"; Briscoe (1990) adds that an attempt I made to explain the facts to him in correspondence "does not shed much light on the generalisations employed . . . as Garside Leech - Sampson (1987) does not give a complete listing of cover-tags". In fact I had no hand in defining the system of cover tags which was used in my experiment (or in defining the wordtags on which the cover tags were based). The cover tags were defined, in a perfectly precise manner, by a colleague (Geoffrey Leech, as it happens) and were in routine use on research projects directed by Leech at Lancaster in which Lolita Taylor was an active participant. Thus, although it is true that my paper did not give enough detail to allow an outsider to check the nature or origin of the cover-tag system (and outside readers may accordingly have been receptive to the suggestions of Taylor et al. on this point), Taylor herself was well aware of the true situation. She (and, through her, Briscoe) had access to the details independently of
442
Geoffrey Sampson
my publications, and independently of Garside - Leech - Sampson (1987). (They had closer access than I, since I had left Lancaster at the relevant time while Taylor et al. were still there.) Then, although Taylor - Grover - Briscoe (1989) and Briscoe (1990) claim that the ANLT grammar is very successful at parsing the range of noun phrase structures on which my research was based, the status of this claim is open to question in view of the fact that the grammar was tested only manually. The ANLT grammar was created as part of an automatic parsing system, and Taylor et al. say that they tried to check the data using the automatic parser but had to give up the attempt: sometimes parses failed not because of inadequacies in the grammar but because of "resource limitations", and sometimes so many alternative parses were generated that it was impractical to check whether these included the correct analysis. But anyone with experience of highly complex formal systems knows that it is not easy to check their implications manually. Even the most painstakingly designed computer programs turn out to behave differently in practice from what their creators intend and expect; and likewise the only satisfactory way to test whether a parser accepts an input is to run the parser over the input automatically. Although much of the text of Briscoe (1990) is word for word identical with Taylor et al. (1989), Briscoe suppresses the paragraphs explaining that the checking was done manually, saying simply that examples were "parsed using the ANLT grammar. Further details of this process . . . can be found in Taylor et al. (1989)" (the latter publication being relatively inaccessible). I was particularly surprised by the success rate claimed for the ANLT grammar in view of my own experience with this particular system. It happens that I was recently commissioned by a commercial client to develop assessment criteria for automatic parsers and to apply them to a range of systems; the ANLT parser was one of those I tested (using automatic rather than manual techniques), and its performance was strikingly poor both absolutely and by comparison with its leading competitor, SRI International's Core Language Engine ("CLE": Alshawi et al. 1989). I encountered no "resource limitation" problems - the ANLT system either found one or more analyses for an input or else finished processing the input with an explicit statement that no analyses were found; but the latter message repeatedly occurred in response to inputs that were very simple and unquestionably well-formed. Sentences such as Can you suggest an alternative?, Are any of the waiters students?, and Which college is the oldest? proved unparsable. (For application-related reasons my test-set consisted mainly of questions. I cite the examples here in normal orthography, though the ANLT system requires the orthography of its inputs to be simplified in various ways: for example, capitals must
Probabilistic
parsing
443
be replaced by lower-case letters, and punctuation marks eliminated.) I did not systematically examine performance on the data-set of Sampson (1987), which was not relevant to the commission I was undertaking, but the grammar had features which appeared to imply limited performance on realistically complex noun phrase structures. The only form of personal name that seemed acceptable to the system was a one-word Christian name: the lexical coding system had no category more precise than "proper name". Often I could find no way to reconcile a standard form of real-life English proper name with the orthographic limitations imposed by the ANLT system on its inputs - I tried submitting the very standard type of sovereign's name King John VIII in each of the forms: king king king king
john john john john
viii 8 eight the eighth
but each version led to parsing failure. 5 It is true that the ANLT system tested by me was "Release 1", dated November 1987, while Taylor - Grover - Briscoe (1989) discuss also a "2nd release" dated 1989. But the purely manual testing described by Taylor et al. seems to me insufficient evidence to overcome the a priori implausibility of such a dramatic performance improvement between 1987 and 1989 versions as their and my findings jointly imply. A problem in any theoretically-abstract analytic approach is that depth of vision tends to be bought at the cost of a narrow focus, which overlooks part of the richness and diversity present in the data. Taylor et al. are open about one respect in which this is true of their approach to natural language parsing: in reanalysing my data they stripped out all punctuation marks occurring within the noun phrases, because "we do not regard punctuation as a syntactic phenomenon". That is, the range of constructions on which the ANLT parsing system is claimed to perform well is not the noun phrases of a 40,000-word sample of written English, but the noun phrases of a sample of an artificial language derived by eliminating punctuation marks from written English. With respect to my data-set this is quite a significant simplification, because more than a tenth of the vocabulary of symbols used to define the noun phrase structures are cover tags for punctuation marks. Of course, where the ANLT system does yield the right analysis for an input it is in one sense all the more admirable if this is achieved without exploiting the cues offered by punctuation. But on the other hand punctuation is crucial to many of the constructions which I have discussed above as needing more attention than
444
Geoffrey Sampson
they have received from the computational linguistics of the past. A Harvardstyle bibliographical reference, for instance, as in Smith (1985: 32) writes is largely defined by its use of brackets and colon. It would be unfortunate to adopt a theory which forced one to ignore an aspect of the English language as significant as punctuation, and I do not understand Taylor et al.'s attempt to justify this by denying that punctuation is a "syntactic phenomenon": punctuation is physically there, as much part of the written language as the alphabetic words are, and with as much right to be dealt with by systems for automatically processing written language. I do not believe that the choice of a concrete rather than abstract intellectual framework, which allows researchers to remain open to such phenomena, can reasonably be described as retreat into "a self-justifying and hermeneutically sealed sub-discipline". The most serious flaw in Taylor et al.'s paper is that they misunderstand the nature of the problem raised in Sampson (1987). According to Taylor et al., I assumed that in a generative grammar each distinct noun phrase type "will be associated with one rule", and I argued that "any parsing system based on generative rules will need a large or open-ended set of spurious 'rules' which . . . only apply once"; Taylor et al. point out, rightly, that a typical generative grammar will generate many of the individual constructions in my data-set through more than one rule-application, and consequently a relatively small set of. rules can between them generate a relatively large range of constructions. But corpus linguists are not as ignorant of alternative approaches to linguistic analysis as Taylor et al. suppose. I had explicitly tried in my 1987 paper to eliminate the possibility of misunderstandings such as theirs by writing: "the problem is not that the number of distinct noun phrase types is very large. A generative grammar can define a large (indeed, infinitely large) number of alternative expansions for a symbol by means of a small number of rules." As I went on to say, the real problem lies in knowing which expansions should and which should not be generated. If extremely rare constructions cannot be ignored because they are collectively frequent enough to represent an important part of a language, then it is not clear how we could ever hope to establish the class of constructions, all of the (perhaps infinitely numerous) members of which and only the members of which should be generated by a generative grammar - even though, if such a class could be established, it may be that a generative grammar could define it using a finite number of rules. Briscoe (1990, note 3) comments on a version of this point which I made in a letter prompted by the publication of Taylor - Grover - Briscoe (1989), but in terms which suggest that he has not yet understood it. According to Briscoe, I "implfy] that we should declare
Probabilistic
parsing
445
rare types ungrammatical, by fiat, and not attempt to write rules for them". I have written nothing carrying this implication. Taylor et al. examine the residue of noun phrases in my data-set which they accept that the ANLT grammar cannot deal with, and they suggest various ways in which the ANLT rule-set might be extended to cope with such cases. Their suggestions are sensible, and it may well be that adopting them would improve the system's performance. My suspicion, though, is that with a reallife language there will be no end to this process. When one looks carefully to see where a rainbow meets the ground, it often looks easy to reach that spot; but we know that, having done so, one is no closer to the rainbow. I believe the task of producing an observationally adequate definition of usage in a natural language is like that. That is why I personally prefer to work on approaches to automatic parsing that do not incorporate any distinction between grammatical / well-formed / legal and ungrammatical / ill-formed / illegal. But let me not seem to claim too much. The compilation model for language processing has real virtues: in particular, when the compilation technique works at all it is far more efficient, in terms of quantity of processing required, than a stochastic optimizing technique. In domains involving restricted, relatively well-behaved input language, the compilation model may be the only one worth considering; and it seems likely that as NLP applications multiply there will be such domains - it is clear, for instance, that language consciously addressed by humans to machines tends spontaneously to adapt to the perceived limitations of the machines. And even in the case of unrestricted text or speech I am certainly not saying that my probabilistic APRIL system is superior to the ANLT system. To be truthful, at present neither of these systems is very good. Indeed, I would go further: it is difficult to rank natural language parsers on a single scale, because they differ on several incommensurable parameters, but if I had to select one general-purpose English-language parsing system as best overall among those now existing, I would vote for the CLE - which is a compiler-like rather than probabilistic parser. The CLE too leaves a great deal to be desired, and the probabilistic approach is so new that I personally feel optimistic about the possibility that in due course it may overtake the compilation approach, at least in domains requiring robust performance with unrestricted inputs; but at present this is a purely hypothetical forecast. What I do strongly believe is that there is a great deal of important naturallanguage grammar, often related to cultural rather than logical matters, over and beyond the range of logic-related core constructions on which theoretical linguists commonly focus; and that it will be very regrettable if the discipline
446
Geoffrey Sampson
as a whole espouses abstract theories which prevent those other phenomena being noticed. How far would botany or zoology have advanced, if communication among researchers was hampered because no generally-agreed comprehensive taxonomy and nomenclature could be established pending final resolution of species relationships through comparison of amino-acid sequences in their respective genomes? We need a systematic, formal stocktaking of everything in our languages; this will help theoretical analysis to advance, rather than get in its way, and it can be achieved only through the compilation and investigation of corpora.
Notes 1. In fact these two concepts were not independent developments: the early work of Chomsky and some of his collaborators, such as M.P. Schützenberger, lay at the root of formal language theory within computer science, as well as of linguistic theory - though few of the linguists who became preoccupied with generative grammars in the 1960s and 1970s had any inkling of the role played by the equivalent concept in computer science. 2. "APRIL" stands for "Annealing Parser for Realistic Input Language". Project APRIL was funded under Ministry of Defence contract no. D/ER1/9/4/2062/151; work has been suspended since mid-1991 when Haigh was transferred to another post. 3. "SUSANNE" stands for "Surface and Underlying Structural Analyses of Natural English". Project SUSANNE is funded under ESRC grant no. R000 23 1142. 4. Taylor et al. quote the percentage to two places of decimals, but I hardly imagine their claim is intended to be so precise. 5. The name in the test data was King Henry VIII, but it happened that the name Henry was not in the ANLT dictionary and therefore I made the test fair by substituting a name that was.
References Aarts, E. - J. Korst 1989 Simulated annealing and Boltzmann machines. Chichester: Wiley. Alshawi, H. et al. 1989 Research programme in natural language processing: final report. Prepared by SRI International for the Information Engineering Directorate Natural Language Processing Club. [Alvey Project no. ALV/PRJ/IKBS/105, SRI Project no. 2989.] Briscoe, E.J. 1990 "English noun phrases are regular: a reply to Professor Sampson", in: Jan Aarts Willem Meijs (eds.), Theory and practice in corpus linguistics, 45-60. Amsterdam: Rodopi. Chomsky, Noam 1957 Syntactic structures. The Hague: Mouton. 1980 Rules and representations. Oxford: Blackwell.
Probabilistic parsing
447
Ellegärd, Alvar 1978 The syntactic structure of English texts. (Gothenburg Studies in English 43.) Göteborg: Acta Universitatis Gothoburgensis. Fillmore, Charles J. - P. Kay - Mary Catherine O'Connor 1988 "Regularity and idiomaticity in grammatical constructions", Language 64: 501-538. Garside, Roger - Geoffrey Leech - Geoffrey Sampson (eds.) 1987 The computational analysis of English. London: Longman. Johansson, Stig 1986 The Tagged LOB Corpus users' manual. Bergen: Norwegian Computing Centre for the Humanities. Kirkpatrick, S. - C.D. Gelatt - M.P. Vecchi 1983 "Optimization by simulated annealing", Science 220: 671-680. Quirk, Randolph - Sidney Greenbaum - Geoffrey Leech - Jan Svartvik 1985 A comprehensive grammar of the English language. London: Longman. Sampson, G.R. 1980 Making sense. Oxford: Oxford University Press. 1987 "Evidence against the ' grammatical Τ ungrammatical' distinction", in: Willem Meijs (ed.), Corpus linguistics and beyond, 219-226. Amsterdam: Rodopi. 1989 "Language acquisition: growth or learning?", Philosophical Papers 18: 203-240. 1991 "Analysed corpora of English: a consumer guide", in: Martha Pennington - V. Stevens (eds.), Computers in applied linguistics. Clevedon, Avon: Multilingual Matters. Sampson, G.R. - R. Haigh - E.S. Atwell 1989 "Natural language analysis by stochastic optimization: a progress report on Project APRIL", Journal of Experimental and Theoretical Artificial Intelligence 1: 271-287. Santorini, Beatrice 1990 Annotation manual for the Penn Treebank project. [Preliminary draft dated 28.3.1990.] University of Pennsylvania. Taylor, Lolita - Claire Grover - E.J. Briscoe 1989 "The syntactic regularity of English noun phrases", Proceedings of the Fourth Annual Meeting of the European Chapter of the Association for Computational Linguistics, University of Manchester Institute of Science and Technology.
Comments by Benny Brodda
Until quite recently, the community of computational linguists has, by and large, been divided into two almost disjunct subgroups: one where the typical member applies computationally trivial operations (concordance-making, word-counting) to enormous amounts of data (millions, or even hundreds of millions, of words of running text); and another group where the typical member applies sophisticated procedures (parsing, semantic analysis) to ridiculously small sets of data (a "database" consisting of 20 to 40 sentences chosen out of the linguist's hat and a lexicon comprising a hundred or so words). For some time now, however, quite a few computational linguists have realized that both theoretical linguistics and applied linguistics call for methods using sophisticated analyses of linguistic material which is both qualitatively and quantitatively realistic. Geoffrey Sampson is a good representative of this last group. What he tries to do (and also manages to do, he claims - we have not seen a printout from an actual run, nor an actual demonstration) is to find one parse tree out of an existing stock of manually obtained parse trees (based on an analysis of 40,000 words of running text) that is most likely to match a given sentence. The method he employs in finding such an optimal tree is called "simulated annealing" (cf. Aarts - Korst 1989; Kirkpatrick - Gelatt Vecchi 1983 in Sampson's reference list), a method that has the advantage of always finding an overall optimal tree, even if there is no perfect match found in the existing stock. My first comment does not only concern Sampson's approach, but I think it is symptomatic that (at least in this paper) he has very little to say about morphology. It might be the case that the occasional occurrence of an -s, -es or -ed word-ending can be handled in ad-hoc ways when parsing English text, but I think it is important that the morphological component is not neglected. There are other languages in which the morphology is probably more important - and certainly more complex. There is a great deal of research being done on computational morphology in other languages, and there exists today a widely accepted method for doing morphological analysis - I am referring to Koskenniemi's TwoLevel model (1983) - which combines impressive performance with solid linguistic output,
Comments
449
and this goes for English as well as for Finnish, German or Swedish and other languages with a richer morphology. I think it would be a good idea also for the English-speaking world to start taking automatic morphological analysis seriously. As Fred Karlsson has pointed out several times, even in English the morphological component carries more information than one would think, if one really tries to make use of it. In his paper Sampson expresses a widely held misconception about "productions", viz. that these are just the computer world's name for what we linguists call "generative rules". By correcting this misconception I get an opportunity to promote one of my hobby horses: the use of production systems in computational linguistics. During the 1940-1950 period the American mathematician Emil Post presented a series of papers in Bulletins from the American Mathematical Society, promoting an alternative framework to Alan Turing's theory of computation and computability. Mathematically it is trivial to show that these two frameworks are equivalent, but where Turing's theory emphasizes the mechanical side of computation, Post's model emphasizes the creative side of it; it is this latter quality, I think, that has made the production framework more and more popular among people outside the engineering side of computer design. Post's own way of defining productions was a bit peculiar, but in the modern view of what a production is (cf. Smullyan 1961), production systems are today conceived as a very general framework for describing formal manipulations of formal structures (cf. also Rosner 1983). In short, a production is a kind of generalized rewrite rule. Production systems are equally well suited for generation, parsing, or any other manipulation of linguistic structures one can think of, when it comes to computational linguistics. The production framework may even be the way to describe formal analyses of language, e.g. in the format described by Smullyan; see also Brodda 1988, where a prototypic production system is presented, and in this case used for tracking down the turn-structures in the dialogue texts of the London-Lund Corpus. My third comment again concerns something not mentioned in Sampson's paper: my question is whether he has thought about his parsing scheme in the following terms. I conceive of probabilistic parsing as a problem that has an inherent Bayesian "flavour". "Ordinary" probabilistic models usually carry with them some kind of predictions: "If we do like this, then this or that will happen with probabilities so and so." Bayesian models can be thought of as describing (in statistical terms) how it just "was". A typical situation where Bayes' theorem (Bayes 1763) applies is when some event (out of a set of possible events) has occurred, but where we, for some reason or other, are
450
Benny Brodda
not in a position to observe the event directly; what we see is the result of the event. We can see the debris of the collision, so to say, not the collision itself, and what we want to do is to "count backwards" from what we can observe to what was the most likely event that occurred. When a writer / speaker produces an utterance, he has (hopefully) arranged his thoughts in some logico-linguistic structure prior to outputting his thoughts. The actual utterance output does not, however, reproduce the original structure of the message; what we get is only a linearized, indirect reflection of it. In script we get only the sequencing of the words to go by (and some occasional punctuation marks); in speech we also get prosodic indications. Question: what was the most likely structure of the original message, given that we observe the linearized message we actually get? For the basic problem discussed in his paper (i.e. the problem of finding an optimal parse tree for a given sentence) Sampson does not need to speculate about what goes on inside people's heads when they write or talk, but I still see Sampson's problem as essentially Bayesian. To me his problem looks very much like a model problem in decision theory, a branch of statistical reasoning that typically employs Bayesian thinking (cf. Raiffa 1968). It may be the case that this view - acknowledging that the problem is by nature Bayesian - merely has philosophical consequences in Sampson's case. Yet, as Sharman has shown, at least when it comes to automatic word tagging, one can get really interesting algorithms by looking upon it this way (Sharman 1990; cf. Geoffrey Leech, this volume). So far I have commented chiefly on things Sampson has not done. Let me now turn to what he actually has done. First of all I will express my appreciation of the high standards he has set for his project, and also my admiration for the way he has pursued it; Sampson is obviously not afraid of hard work, neither manual nor intellectual. When it comes to actual analyses, Sampson does not provide many details, but let me comment briefly on the analysis of the following sentence, which is the single example he actually discusses in depth: The final touch was added to this dramatic interpretation, by placing it to stand on a base of misty grey tulle, representing the mysteries of the human mind. He first gives the analysis of this sentence as he himself would have done it, and then the analysis produced by the parser. There are a couple of errors in the parser's output but, on the whole, the automatic analysis is not very deviant from the manual. As I see it, the parser's major mistake is to stick the representing clause in the wrong place, viz. as an adjunct of (by) placing
Comments
451
. . . (as Sampson says); the parentheses indicate that is placed as an adjunct of to stand rather than as a postmodifier to tulle. The other mistake the parser makes is, I think, more interesting, viz. when it places the prepositional phrase of the human mind as an adjunct of representing rather than as postmodifier of the mysteries. This example is just an instance of the notoriously difficult "PP Attachment Problem": In a construction of the type V PP, is the prepositional phrase inside or outside the VP? In a construction of the type NP PP, is the prepositional phrase a postmodifier of the displayed NP, or does it modify something further to the left? (Cf. John Sinclair, this volume, for a couple of very nice examples of the problem discussed now.) In many "ordinary" (strictly algorithmic, non-probabilistic) parsers, this problem is often left unsolved in the syntactic component of the parser). For example, a string like PP PP PP is given a syntactic structure like a garden rake rather than a tree. The final attachment of each PP is then left to the semantic component (which may not solve the problem). Sometimes lexical valency may help: from the lexical phrase add X to Y one can conclude that the PP to the stew belongs to the VP in He added more pepper to the stew. This is a general statement; in particular cases one can be much more specific. In a configuration of the type Det X of Υ, where Det denotes a determiner and X a word, the word of starts a prepositional phrase which, in well over 95 % of all cases, is a postmodifier of X, regardless of the nature of X and Y. (It also points out X as a noun, but that is not the issue here.) With the aid of some simple pattern matching program of the Grep type, this statement can readily be verified by anyone having access to some English texts in his computer; the pattern is common in most text types. The statement above can be extended to any configuration where Det X denotes an arbitrary NP, but that is of course harder to verify in this simple way. Just for the fun of it, I scanned through Sampson's preprint text manually and found just under 100 instances of the generalized pattern above, none of which had an ^/-phrase not modifying Det X. I find it somewhat peculiar that Sampson's system has missed this simple, heuristic generalization. (In the sentence I made a concordance of the 20,000 words in this novel, the word-group printed in boldface might be an example of an of -phrase not being a postmodifier of X; I myself would say that it is rather an adjunct of made.) In collaboration with Geoffrey Leech and his colleagues at Lancaster, Sampson has put great effort into trying to arrive at a coherent and systematic notational system for their analyses. As a Swede I am, of course, enchanted by Sampson's allusions to Carl von Linne's work in natural his-
452
Benny Brodda
tory - more specifically, to Linne's naming scheme for all living objects, a work which has of course had a profound influence on the biological sciences. By no means, Sampson does not claim to be the Linne of linguistics, and I am in full agreement with his argument that it is about time that (computational) linguists try to arrive at something similar for syntactic notations. It is such a tremendous job to produce a large database of syntactically analysed texts (no matter whether it is done manually or semi-automatically), and it is such a misuse of resources to develop a new notational system for each new project. By arriving at some systematic notational system, the science of computational linguistics can become truly cumulative - it is almost a prerequisite. In the whole field of computational linguistics there is today a strong movement towards achieving "reusable" results: at the meeting in Berlin of the European Chapter of ACL in April 1991 there was a section devoted to a discussion of, and a plea for, reusable lexicons, reusable parsers, reusable tagging systems, and what have you. Here Sampson pleads for reusable syntactic analyses. Another reflex of this movement is the work on standardization now being undertaken in the Text Encoding Initiative. Sampson's approach is squarely in line with this new and important direction of research in our science. Another prerequisite for our science to become cumulative is that text corpora, lexicons, analysed texts, and so on, are actually made available to the scientific community (and at a reasonable price). Sampson represents a tradition in which the standards were once set by W. Nelson Francis and Henry Kucera at Brown University, when they openly and generously started to distribute their creation, the now venerable Brown Corpus, to any interested user. This openness has created the basis for some remarkable and successful international research projects, first expressed through the Lancaster-OsloBergen axis, and later through the London-Lund axis - and the success story continues. Viewed in retrospect by an outsider to the field of English studies, this openness has created a tremendous boost to the study of the English language all over the world. It is remarkable that this successful strategy has not automatically set a standard also for the study of other languages.
Comments
453
References Bayes, Thomas R. 1763 "Towards solving a problem in the doctrine of chances", in: Philosophical Transactions of the Royal Society. London: 370-418. Brodda, Benny 1988 "Tracing turns in the London-Lund Corpus with BetaText", Literary and Linguistic Computing 3: 71-104. Koskenniemi, Kimmo 1983 Two-level morphology: a general computational model for word-form recognition and production. Publication No. 11, Department of General Linguistics, University of Helsinki. Raiffa, Howard 1968 Decision analysis. Reading, Mass.: Addison-Wesley. Rosner, Michael 1983 "Production systems", in: M. King (ed.), Parsing natural languages, 35-58. New York: Academic Press. Sharman, Richard 1990 Hidden Markov model methods for word tagging. Report 214, IBM UK Scientific Center, Winchester. Smullyan, Richard 1961 Theory of Formal Systems. Annals of Mathematical Studies. New York.
Postscript
On corpus principles and design Randolph
Quirk
Caught between being flattered by the invitation and terrified by the challenge, I yielded to the former and accepted in some trepidation Jan Svartvik's invitation to write what he entitled a Postscript to the Symposium. In a subsequent letter of May 1990, he made it clear that he did not have in mind the tangential and irrelevant afterthoughts that one associates with the postscript to a letter but rather some attempt to sum up the main thrust of the Symposium and its achievement. The tall order was getting taller, and it was to get taller still. In June 1991, I was instructed to complete my postscript as a sort of prescript and hand it to the Secretariat before our proceedings began. And although you have not yet received the papers on which your summing-up will be based, Professor Svartvik's letter continued, never fear: they will arrive before you fly off to South America at the beginning of July. And so they did, too, and I no longer had to wonder how I might while away the hours of flying between London and Argentina. I had everything with me except Charles Fillmore's paper, and I offset that deprivation by equipping myself with a xerox of his 1988 article on let alone, confident that it would add the requisite Fillmorean piquancy to whatever awaited my taste buds in the two yellow tomes from Lund. I was struck, as we must all have been, by the richness and variety these volumes displayed, and I felt that my summing-up might well simply repeat John Dryden's verdict on the works of Chaucer: " 'Tis sufficient to say . . . that here is God's plenty." My remarks are necessarily based on the papers in their pre-print form. I shall not attempt to discuss them in detail, since my comments (written despite a distractingly beautiful view of the Andes from my hotel window) might conflict with the final form of the papers and could not anticipate the impact upon them of discussion during the Symposium itself. But yes, God's plenty. We have variety in the particular corpora discussed, some of them familiar like Brown and LOB, some less familiar (to me at least) like Matti Rissanen's historical data, some with a special orientation of another kind like the mother-child interaction studied by Ruqaiya Hasan. The papers reflect a wide range of purpose, too: for example, language teaching, devising mechanised language aids, investigating the process of reasoning. We have specific issues in grammatical description, such as Douglas Biber's
458
Randolph Quirk
inquiry into the distribution of anaphoric forms across genres - a model study of corpus data by computational means. We have the broader thrust of Michael Halliday's paper on grammar, an aspect of linguistics - he says "with too much theory and too little data", and he explores in characteristically limpid aphoristic prose the place of probability theory in the linguistic analysis of discourse: "a register is a syndrome of lexicogrammatical probabilities". The interpenetration of lexicon and grammar is the focus of John Sinclair's work too, as he examines not merely the potentiality but the necessity for automated analysis of data with the new generation of corpora reaching a size which, he says, puts them beyond "human intervention". Even broader issues are confronted by Wallace Chafe, Geoffrey Leech, and Geoffrey Sampson. In the course of presenting his corpus inquiry into the "light subject" and the "one new idea" constraints, not to mention his disquisition upon anaticide, Professor Chafe makes a robust defence of introspection and also insists on the need to supplement corpus data by other modes of inquiry - a matter to which I shall return. Geoffrey Leech's paper reminds us that while this Symposium is on corpus study, the nerve centre of such work is now the study of corpora by computational means and that computer science has a vital bearing upon linguistics well beyond the study of corpora. In consequence, he takes us back to the information theory of the Shannon era and forward to empirical modes of inquiry into linguistic competence, Chomsky's I-language. Geoffrey Sampson presents us with computer science at a still more technical level, arguing for an approach that will be more appropriate in harnessing the computer to handle natural language such that we can achieve a grammatical taxonomic inventory for such a language as English comparable to what Linnaeus achieved for botany in Stockholm and Uppsala 250 years ago. So, richness and variety amounting indeed to God's plenty: except in one rather striking respect; the virtually unvaried concentration on only one of God's languages. Of the papers before us, only two go beyond English, and while these exceptions appropriately deal with Swedish, the language of Alfred Bernhard Nobel, I did find myself fearing again that the current dominance of English as a language of exposition was increasingly influencing scholars to make it also the almost exclusive object of study. It is bad enough that linguists in English-speaking countries are so often unable to read studies within our own discipline that are perversely written in exotic languages like French and German, without the danger of a blinkered concentration on English preventing us from taking stock of the corpora comprising Spanish, Portuguese, German, Italian and other languages, not to mention the theoretical work on processing and analysing them. For example, the work
On corpus principles and design
459
of Ataliba de Castilho and his team of 40 on Brazilian Portuguese. In this connection, I was glad to note references to corpora of other languages made in some of the Symposium papers: by Graeme Kennedy, for example, and by the wise and witty Henry Kuöera. In the course of his paper, Geoffrey Sampson remarks that in computational corpus studies we "still have a long way to go", a sentiment so closely echoed by some of the other contributors that it seemed to me it might constitute in itself a not inappropriate summary of the Symposium. An equally reasonable conclusion, however, is the virtual converse: What a long way we have come. This is well brought out in several of the papers, but especially of course in the charmingly urbane review by Nelson Francis, reminding us of the long healthy roots that linguistics has in empirical studies of language data if not actually - until quite recently - in the disciplined study of corpora as these are now conceived. But if the Nobel meeting indicates both how far we have come and how far we still have to go, it has also been a salutary reminder of the extent to which some of the issues with most topical concern are those that have exercised us for thirty years. In February 1960,1 read a paper to the Philological Society in London (cf. Quirk 1968) which set out the principles for the Survey of English Usage. These were anchored in a corpus of contemporary natural language data which (a)
(b)
would be representative of the spoken and written grammatical repertoire mastered by mature native speakers in their varied roles at work or play; and would be subjected to exhaustive and non-selective study: the vital principle of total accountability.
That 1960 paper expressed indebtedness to two of the scholars present at this Stockholm meeting, Michael Halliday and Nelson Francis; and in connection with the principle of total accountability, I quoted from the work of J.R. Firth, who had been Halliday's teacher and mine, on the importance of studying every item in a text if we were to discover the extent to which "words are mutually expectant and mutually prehended" - a theme assiduously developed over the years by another of the scholars present here, John Sinclair. My paper made it clear, however, that the Survey would not be confined to corpus study, and I shall be saying more on this below. But for the present, so much for February 1960. Just three years later, in February 1963,1 was invited to Brown University in Providence, Rhode Island, to join Nelson Francis, Henry Kucera, John B.
460
Randolph
Quirk
Carroll, Philip Gove, and Freeman Twaddell in a brain-storming few days in which these principles were applied to the entirely innovative concept of a computerised corpus, the completed work on which was in due course to become the model for the LOB corpus and much else. Ruqaiya Hasan wonders in her paper here whether we corpus planners of the early nineteen-sixties saw ourselves as iconoclastic. In fact, rather than destroying, I think we saw ourselves as restoring and modernising longvalued icons. But we certainly knew we were swimming against the tide that was flowing in Massachusetts Bay towards the "Boston suburbs", as Kucera puts it; and if we weren't waving, there were many of our contemporaries who were convinced we were drowning. The papers at this Symposium show that the principles informing the Survey and the Brown corpus underpin the very exciting work that is going on in corpus linguistics all over the world. But of course much has changed. The Brown innovation of making a corpus machine-readable (the Survey at first used the computer only for number-crunching and model-building: cf. Carvell - Svartvik 1969) has transformed the way in which total accountability can be implemented, and as software for automated analysis becomes more sophisticated this transformation looks set to accelerate. The other big change, also attributable to developments in computational technology, relates to scale. In the 1960's we modestly thought in terms of corpora a million words in extent. John Sinclair has more recently been working - as have others - with a corpus comprising tens of millions, and the corpus in which I myself am currently involved sees one hundred million words as a by no means excessive or daunting target. This is the British National Corpus, on which I should like to dilate for a few minutes. Let me begin with a summary of the main features.
Key points • Size: we plan to collect at least 100m words. Over 10m words of natural speech will be included in the corpus, transcribed orthographically though not prosodically. • Availability: after completion of the project, we plan to make the corpus widely available as a truly national resource. • Balance: the corpus will be underpinned by a detailed design specification so that it will be as representative as possible of modern British English, drawing from a very broad range of text types. Books, magazines, drama,
On corpus principles and design
461
TV programmes, letters, memos, telephone conversations, meetings, advertisements, etc. will all be included. • Software: the project plans to make available with the corpus a set of software tools to enable researchers to gain access to this large body of data without having to buy or develop basic text-processing facilities. • Strategy: this project is part of a large-scale strategy to enhance information technology in the UK, through the development of more intelligent and sophisticated language processing capabilities in computers. It is parallel to similar initiatives in other EC countries, particularly in Italy and Spain, and also in the USA. • Standardisation: the texts in the corpus will contain embedded codes indicating bibliographic details, lists, headings, footnotes, paragraphs and other significant textual features. The project will take account of the recommendations of the Text Encoding Initiative, the aim of which is to establish a standard computer markup for the exchange of machine-readable text. The TEI is an international project funded by the US National Endowment for the Humanities, the Commission of the European Community, and the Andrew W. Mellon Foundation.
Planned uses of the corpus • Lexicography. The corpus will provide a body of new data on word meaning, grammar, and usage which will assist our understanding of the workings of the English language and will inform all kinds of reference works. Empirical statistical data on word frequencies, word classes (nouns, verbs, adjectives, adverbs, etc.), spelling preferences and so on will be derived from the corpus. • Linguistic research. Researchers have too often relied on inventing examples to illustrate points of grammar or to test theories. The corpus will provide a standard basis for investigating phenomena and testing competing linguistic theories. Stylistic studies of the variation between different text types will be based on a large standardised data sample. Results obtained by different researchers can be compared and evaluated directly if the data material from which they work is in a standard form. • Language technology. To develop working systems for processing language, we require very high levels of sophistication in software design. Statistical techniques, requiring very large samples of text, are increasingly used in machine translation, speech recognition (voice input), speech
462
Randolph Quirk
synthesis (speaking computers), spelling and grammar checkers for word processing and desk-top publishing, hand-held electronic books, and other advanced developments in information technology. It is widely recognised that progress is hindered at present by the lack of suitable text corpora to be used as raw data.
The consortium • The work is being conducted by a consortium comprising: Oxford University Press, Longman Group UK, the University of Lancaster's Unit for Computer Research in the English Language (UCREL), Oxford University Computing Service (OUCS), and the British Library. W. & R. Chambers are in the process of joining the consortium. • OUP is the Lead Participant of the consortium and was responsible for putting together the proposal to the Department of Trade and Industry (DTI). • The project is planned to run for 39 months. It started officially on 1st January 1991. • The lead partner's production of the Oxford English Dictionary on CDROM established it as an innovative force in electronic text materials. The Oxford Advanced Learners' Dictionary has been available as an electronic research tool since 1988. Work on collecting a pilot corpus has been in progress since 1989. • Longman has been publishing dictionaries for over 250 years, among them Johnson and Roget. The supply of the machine-readable version of the Longman Dictionary of Contemporary English has benefited the research community in machine translation and other natural language processing research. A design for a reliable corpus system (for the Longman-Lancaster Corpus) has been implemented over the last few years. • Oxford University Computing Service has established and maintained the Oxford Text Archive, a large repository of individual machine-readable texts for use by scholars, and is a major partner in the Text Encoding Initiative (see above under "Key points"). • The University of Lancaster's UCREL has a long involvement in corpus analysis and development, and has been one of the pioneers of probabilistic techniques for linguistic analysis of corpora. • The British Library has vast experience in the field of paper and electronic text handling, and is the most appropriate organization for the long-term development of the corpus as a national resource.
On corpus principles and design
463
Funding • The project was successful in its bid for funding under the DTI's "Information Engineering Advanced Technology Programme" which is a joint funding programme with the Science and Engineering Research Council. • The estimated cost of the project is £1.2m. The DTI and SERC have offered to pay up to 50% of this total cost. • The universities and the British Library qualify for 100% funding; the commercial participants receive a subsidy proportional to their expenditure such that public funding accounts for no more than 50% of the total costs.
Organisation The British National Corpus Project has three levels of committee structure: Advisory Council chaired by Dr Anthony Kenny (President of the British Academy) and including Michael Brady (Professor of Engineering, Oxford University) Christopher Butler (Professor of English, Oxford University) Charles Clark (Copyright Adviser to the Publishers' Association) David Crystal (Language Consultant) Nicholas Ostler (Consultant to DTI on Speech and Language Technology) Sir Randolph Quirk (University College London; Chairman, British Library Advisory Committee) Henry Thompson (Human Communications Research Centre, Edinburgh University). The role of the Council is to take a critical overview of the project from a national perspective and to advise the Project Committee on matters relating to the interests of the speech and natural language community. Project Committee is the executive committee for the project and consists of three representatives from OUP and one representative from each of the participating organisations. Task Groups will deal with each of the main operations of the project.
464
Randolph
Quirk
Personnel Oxford University Press S. Murison-Bowie (Project Director) is Director of Electronic Publishing and Development T. Benbow (Project Committee Chairman) is Director of the OUP Dictionary Department J.H. Clear is the Project Manager Longman D. Summers is the Managing Director of Longman Dictionaries S. Crowdy is the Longman Systems Development Editor University of Lancaster G.N. Leech is Professor of Linguistics and Modern English R. Garside lectures in the Department of Computing Oxford University Computer Service S. Hockey is Director of the CTI Centre for Literature and Linguistic Studies L.D. Burnard is Director of the Oxford Text Archive British Library T. Cannon is Assistant Director of the Research and Development Department
Work programme overview There are five main operations in the project: Design, Copyright & Permissions, Data Capture & Encoding, Storage & Distribution, and Corpus Processing. • Design deals with defining the range of text types to be sampled, agreeing the number and size of samples and specifying the encoding of text features. OUP, Longman, Lancaster and OUCS all have expertise in this area and will work together on this topic. The design operation will be complete by month 10. • Copyright & Permissions involves establishing appropriate copyright clearance to store texts in the corpus and distribute them for research purposes. We anticipate that several thousand different copyright permissions will be required for 100 million words. The first four months concentrated on establishing a form of copyright permission acceptable to the authors, publishers and agents. OUP and Longman take primary responsibility for this operation.
On corpus principles and design
465
• Data Capture & Encoding will be the largest and most costly single operation. Texts will be scanned by Optical Character Reader, converted from other machine-readable forms or keyed manually. Audio recordings of meetings, casual speech, radio and TV broadcasts will be transcribed. All texts in the corpus will be stored and distributed using a uniform target encoding scheme, which will conform with the international Standard Generalised Markup Language (SGML) and the recommendations of the Association for Computing in the Humanities, Association for Literary and Linguistic Computing, and Association for Computational Linguistics' Text Encoding Initiative, with which the project will liaise closely. It is planned that most of the 100m words will be collected by the end of the second year. OUP and Longman will manage the data capture. OUCS will be responsible for text encoding. • Storage & Distribution. Archiving, cataloguing, retrieval, and distribution procedures will be developed, both computational and administrative. The contents and classification of texts in the corpus will be recorded in a data-base catalogue for reference and distribution. OUCS and the British Library will be mainly involved in this operation. • Corpus Processing. A suite of corpus processing tools will be developed, which can be used for searching and retrieving information from the corpus. In order to enhance its value, the complete corpus of 100 million words will be annotated automatically with word-class tags. Lancaster will take primary responsibility for such corpus processing. By the standards of the BNC and like corpora of the 1990s, corpora of one million words can be seen only as inadequate samples. Indeed, this was clear to me in 1960 when I spoke of the need not merely for supplementary corpora to serve special inquiries but of the need above all for psycholinguistic techniques for eliciting data from subjects. To my mind, the scalar orders of magnitude confronting us in current and future corpora in no way reduce the need for developing elicitation procedures. Wallace Chafe's paper seems to suggest that these are relevant only to researchers studying languages other than their own, for example inviting a native informant to judge acceptability or to supply a linguistic specimen. But in a series of studies beginning with the Quirk - Svartvik book of 1966, my colleagues and I have demonstrated that sophisticated elicitation procedures could establish for one's own language statistically significant generalisations which resisted introspection and could scarcely be imagined as emerging from corpus scrutiny alone (though corpus data could often be the best clue to the issues worth such further investigation).
466
Randolph
Quirk
Let me give two examples. The first is a pair of results from a paper written in collaboration with Ruth Kempson (Kempson - Quirk 1971), in which we looked inter alia at complementations with -ing versus the infinitive. While some verbs such as enjoy regularly take -ing and some such as expect regularly take the infinitive, there are other verbs that can take either: She started singing, She started to sing. Using the "forced-choice" technique, where subjects must use both constructions but can please themselves as to which of two vacant spaces best suits each, we found that there was a significant (and obviously rule-governed) preference (chi-square value 19.45, p