Extending the Scope of Corpus-Based Research : New Applications, New Challenges [1 ed.] 9789042029248, 9789042011366

Extending the scope of corpus-based research: new applications, new challenges is a collection of articles which highlig

200 27 4MB

English Pages 257 Year 2003

Recommend Papers

Extending the Scope of Construction Grammar 9783110366273, 9783110367065, 9783110555042

The field of constructionist linguistics is rapidly expanding, as research on a broad variety of language phenomena is i

154 113 3MB Read more

Extending the Scope of Construction Grammar 9783110366273, 9783110367065, 9783110555042

The field of constructionist linguistics is rapidly expanding, as research on a broad variety of language phenomena is i

133 41 5MB Read more

Handbook of the Psychology of Coping: New Research : New Research [1 ed.] 9781620814659, 9781620814642

Coping has been defined as constantly changing cognitive and behavioral efforts to manage specific external and/or inter

140 47 7MB Read more

Pathology: New Research : New Research [1 ed.] 9781621007173, 9781621006985

General pathology is a broad and complex scientific field which seeks to understand the mechanisms of injury to cells an

143 108 5MB Read more

Learner Corpus Research: New Perspectives and Applications 9781474272889, 9781474272919, 9781474272896

This volume showcases original, agenda-setting studies in the field of learner corpus research of both spoken and writte

153 89 1MB Read more

Extending the Scope of Corpus-Based Translation Studies 9781350143258, 9781350143289, 9781350143265

With the rapid growth of corpus-based translations studies (CBTS) over recent years, this book offers a timely overview

148 29 12MB Read more

New Challenges Need New Answers : учебное пособие 9785392335671

Цель данного пособия заключается в формировании межкультурной коммуникативной компетенции и углубленных навыков чтения,

125 19 Read more

The New European Union: Confronting Challenges of Integration 9781685851347

This concise but wide-ranging work explores the major political, economic, and strategic challenges confronting the Euro

123 71 2MB Read more

Regulating Labour in the Wake of Globalisation: New Challenges, New Institutions 9781472564245, 9781841137667

In recent decades, the prevailing response to the problem of unacceptable labour market outcomes in both Europe and Nort

145 75 1MB Read more

Handbook of Research on Globalized Agricultural Trade and New Challenges for Food Security 9781799810421, 9781799810438

497 74 15MB Read more

Extending the Scope of Corpus-Based Research : New Applications, New Challenges [1 ed.]
9789042029248, 9789042011366

Author / Uploaded
Sylviane Granger
Stephanie Petch-Tyson

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Contents List of contributors

7

Preface

9

I.

Corpora and methodology

Using the MF/MD method for automatic text classification Inge de Mönnink, Niek Brom & Nelleke Oostdijk

15

Scientific experiments in parsed corpora: an overview Sean Wallis

27

WebCorp: providing a renewable data source for corpus linguists Antoinette Renouf

39

Normalization and disfluencies in spoken language data Nelleke Oostdijk

59

Textual structure and segmentation in online documents Pam Peters & Adam Smith

71

II.

Corpora in language description

Shall and will as first person future auxiliaries in a corpus of Early Modern English texts Maurizio Gotti

91

The role of gender in the use of MUST in Early Modern English Arja Nurmi

111

From corpus data to a theory of talk units in spoken English Joybrato Mukherjee

121

The BNC and the OED. Examining the usefulness of two different types of data in an analysis of the morpheme eco Bernhard Kettemann, Martina König & Georg Marko

135

6 Lexical gaps Göran Kjellmer

149

The use of native lexical items in English texts as a codeswitching strategy Hajar Abdul Rahim & Harshita Aini Haroon

159

The structure of children’s writing: moving from spoken to adult written norms Geoffrey Sampson

177

III.

Corpora in foreign language learning and teaching

On clefts and information structure in Swedish EFL writing Mia Boström Aronsson

197

Contrasting learner corpora: the use of modal and reporting verbs in the expression of writer stance JoAnne Neff, Emma Dafouz, Honesto Herrera, Francisco Martinez, Juan Pedro Rica, Mercedes Diez, Rosa Prieto & Carmen Sancho

211

Learning English prepositions in the Chemnitz Internet Grammar Josef Schmied

231

Integrating networked learner oral corpora into foreign language instruction Pascual Pérez-Paredes

249

7 List of Contributors Hajar Abdul Rahim, Universiti Sains Malaysia, Malaysia Harshita Aini Haroon, Universiti Utara Malaysia, Malaysia Mia Boström Aronsson, University of Göteborg, Sweden Niek Brom, University of Nijmegen, The Netherlands Emma Dafouz-Milne, Universidad Complutense de Madrid, Spain Inge de Mönnink, University of Nijmegen, The Netherlands Mercedes Diez Prados, Universidad de Alcalá, Spain Maurizio Gotti, Università di Bergamo, Italy Sylviane Granger, University of Louvain, Belgium Honesto Herrera, Universidad Complutense de Madrid, Spain Bernhard Kettemann, University of Graz, Austria Göran Kjellmer, University of Göteborg, Sweden Martina König, University of Graz, Austria Georg Marko, University of Graz, Austria Francisco Martinez, Universidad Complutense de Madrid, Spain Joybrato Mukherjee, University of Bonn, Germany JoAnne Neff, Universidad Complutense de Madrid, Spain Arja Nurmi, University of Helsinki, Finland Nelleke Oostdijk, University of Nijmegen, The Netherlands Pascual Pérez-Paredes, Universidad de Murcia, Spain Stephanie Petch-Tyson, University of Louvain, Belgium Pam Peters, Macquarie University, Australia

8

Rosa Prieto, Escuela Oficial de Idiomas de Valdezarza, Spain Antoinette Renouf, University of Liverpool, United Kingdom Juan Pedro Rica, Universidad Complutense de Madrid, Spain Geoffrey Sampson, University of Sussex, United Kingdom Carmen Sancho Guinda, Escuela Técnica Superior de Ingenieros Aeronáuticos, Spain Josef Schmied, Chemnitz University of Technology, Germany Adam Smith, Macquarie University, Australia Sean Wallis, University College London, United Kingdom

Preface Sylviane Granger & Stephanie Petch-Tyson University of Louvain The choice of ‘Future challenges for Corpus Linguistics’ as the theme for ICAME 2001,1 the conference which gave rise to this volume, was a choice which reflected how far Corpus Linguistics has come in a relatively short space of time. After all, it is not typically until one is firmly established, in whatever discipline, that one feels the need to ask oneself where the future lies. Equally, the theme reflected the crucial importance, precisely at the point where the groundwork has been done, and when the validity of Corpus Linguistics (CL) as a core methodology in language research is no longer in question, of addressing the issue of where we go from here. The panel discussion held at the end of the conference identified some key areas felt by the discussants to be in need of special attention. One recurrent theme was the need for a clearer relationship between CL research and linguistic theory, with more hypothesis testing, better statistics and fewer facts for facts’ sake. The simple yet pertinent question raised by Charles Fillmore - are facts interesting for their own sake? – will have provided a challenge in itself for many at the conference. Another challenge identified by several speakers was that of ensuring a higher degree of methodological standardization, a move which would enable research results to be more directly comparable. The issue of corpus annotation was also raised by more than one panel discussant, from different viewpoints and particularly within the perspective of developing a new generation of corpora for the 21st century. On the one hand, it was recognized that there is a need for linguists to pay urgent attention to the types of annotation systems under development and for them to work hand in hand with the people developing different types of annotation systems, so that the systems developed are the ones that linguists actually want. On the other hand, there was a call for more enriched annotation systems, to enable the corpus linguist to get closer to the communicative activity underlying language and thus to make more interesting observations about language. As regards the type of language being collected and the types of analysis being conducted, it was felt that more differentiation needs to be made between speech and writing. Obviously, the collection of large-scale speech corpora is a relatively new phenomenon, but with the advent of such corpora and the promise of increasing technological capabilities, such as speech and text linkage, a whole extremely important and exciting new area of corpus research will become viable. Finally, the feeling was expressed that although CL has convincingly argued its relevance for FL teaching, this has yet to be proven and that there is thus a need for more concrete and tested teaching applications. Many of the articles in this volume show how these ‘future challenges for corpus linguistics’ are already being addressed by today’s researchers. The

10

Sylviane Granger & Stephanie Petch-Tyson

five articles in section 1, Corpora and methodology, each offer a forward-looking perspective on corpus research. De Mönnink et al. attempt to reproduce the now well-known Biber method of automatic text classification using a word-class tagged corpus. They find that the enriched syntactic information was a significant benefit, simplifying and improving the search for linguistic features and encouragingly, that Biber’s multi-feature/multi-dimension method was indeed largely reproducible. This finding will be interesting for many researchers who have found Biber’s method instinctively highly attractive but difficult to replicate and shows one advantage that a word-class tagged corpus can bring. Wallis’s focus is on the need for sound experimental methodologies in the CL community. He highlights the importance for everyone in the research community of setting up transparent, reproducible, hypothesis-based, scientific experiments and demonstrates how certain issues in particular are of key importance for those working with parsed corpora, illustrating this with experiments carried out using the parsed ICE-GB corpus. Oostdijk’s interest is also parsed corpora, this time the implications that disfluencies in spoken language may have for the design of a parser for speech. She analyses various types of normalized disfluencies in the parsed ICE-GB corpus, concluding that a majority of disfluencies (hesitations being by far the most prevalent form) have no impact on the underlying syntax and therefore require no special provision in a speech parser. Others however, do indeed have an impact and will require a different approach. The creation of a fully-fledged parser for speech will be of major benefit to the CL community and one which will provide a significant step forward in understanding some of the crucial syntactic differences between the spoken and written language. The final two articles in the section, by Renouf and Peters & Smith, both deal with new corpus resources. Renouf’s article reports on the remarkable ‘Webcorp’ tool, which has been created specifically for linguists to support Internet searches and shows how the web, accessed through WebCorp, can offer linguistic evidence not available from any other corpus resource. Peters & Smith’s concern is the effect of the electronic medium (e-documents as found on the web) on layout and document design. They compare global and local structural elements of e-documents and traditional printed documents and find that the electronic medium indeed has some impact, particularly on local structure, and speculate as to whether the e-document may bring about the demise of the paragraph, which would be a change for all dyed-in-the-wool paragraph lovers! Section 2, Corpora in language description, accurately reflects the diversity of interests found at ICAME conferences. The first two articles analyze aspects of Early Modern English. Gotti investigates the use of the modal auxiliaries shall and will in a corpus of Early Modern English texts, comparing his findings with the prescriptive rules for the formation of future sentences in seventeenth century grammars and discovering there to be more variation in use than predicted in the grammars. Nurmi’s sociolinguistic investigation of the modal auxiliary must traces the development of the two main meanings of the auxiliary (personal obligation and logical necessity) between the 15th and 17th centuries. She reaches the interesting conclusion that the increase in use of these

Preface

11

two different meanings are gender-linked. In a somewhat different vein, Mukherjee’s article introduces the talk-unit model as a theoretical framework for describing linguistic structures created through the interaction of syntax and intonation. Testing this framework on corpus data, he convincingly demonstrates how corpus-based methods can lead to new theoretical concepts of spoken language not covered by existing descriptions. The following three articles describe different areas of corpus-based lexical research. Kettemann et al.’s contrastive analysis of corpus (the British National Corpus) and dictionary (the Oxford English Dictionary) data relating to use of the morpheme eco, establishes the overwhelming superiority of the corpus, in purely quantitative terms, in revealing a wide range of different eco-words and at the same time shows the complementarity of the two data types and the consequent desirability for lexical studies of combining different data sources. Kjellmer’s study of lexical gaps within the category of adjectives and de-adjectival nouns in the Cobuild corpus establishes six factors which broadly determine whether a de-adjectival noun is likely to be formed, and identifies potentiality as playing a particularly powerful role in this area of the lexicon. In their study of codeswitching in the English writing of Malay native speakers, Hajar and Rahim identify lexical gaps between languages as one relatively infrequent cause, but find that the majority of occurrences in their corpus result from an intentional strategy, aimed at making use of particular familiar connotations present in the Malay word but not in the English equivalent. The final article in the section demonstrates how grammatical annotation can inform and enrich comparative studies of speech and writing. In this article, Sampson analyses developmental aspects of children’s written English, comparing it both to spoken and written language. In section 3 the focus moves from the study of native English to foreign language learning and teaching. The first two articles are both corpus studies of written learner language. Aronsson’s investigation of cleft and pseudo-cleft structures in the written English of Swedish university students reveals that the students are either unaware or unable to manipulate the thematic and textual properties of these structures and have a consequent tendency to both overuse and misuse them. Neff et al. compare the English writing of students from a variety of mother tongue backgrounds for the use of modal and reporting verbs used in the expression of writer stance. They find that the student writers have difficulties using modal verbs which may stem from typological, instructional and sociocultural factors and highlight the need for EFL instruction to place more emphasis on pragmatic modal contexts in L2. Schmied’s article features an Internet Grammar developed to present students with real language data and language rules through inductive and deductive learning strategies. Using the example of prepositions, a notoriously difficult area of grammar, Schmied attempts to show how the graphic representations of expansions of meaning in prepositions have advantages over previous treatments in helping learners discover major areas of difference between prepositions. The final article, by Perez-Paredes, also draws on discovery as a way of learning, this time through a methodology of creating an integrated computer networking environment which enables the teacher to

12

Sylviane Granger & Stephanie Petch-Tyson

compile and exploit oral corpora from his own students’ production which he can then use for pedagogical purposes with the students and simultaneously offer students access to the data for self-directed study purposes. This last article, firmly anchored in speech research and teaching, is one of only three in the volume which investigate the spoken rather than the written medium (see also Oostdijk and Mukherjee). It is clearly to be hoped that this imbalance is addressed by researchers as they begin to work with the large spoken corpora now becoming available. Interesting work is being done both using and developing annotation systems but again, it is a minority, not a majority of articles that use annotated corpora. The articles by Schmied and Perez-Paredes both describe highly interesting and intuitively helpful corpus-based, pedagogical approaches but the benefit of these approaches versus more traditional approaches has yet to be proven. As technology makes more and more things possible, new horizons are opened up, at the same time showing how much there is left to be discovered. The challenges of the 21st century are clearly being addressed but it is comforting that there is still much with which to keep researchers occupied for the forthcoming decades!

Note 1

ICAME 2001, the Twenty-Second Conference of the International Computer Archive of Modern and Medieval English was held in Louvainla-Neuve, Belgium in May 2001.

I. Corpora and Methodology

Using the MF/MD method for automatic text classification Inge de Mönnink, Niek Brom and Nelleke Oostdijk University of Nijmegen Abstract In corpus linguistics, but also in computational linguistics and information retrieval, there is an increasing demand for the automatic classification of large amounts of text(s). In his research, Biber uses the Multi-Feature/Multi-Dimension (MF/MD) method to obtain a classification of English texts. A major disadvantage of his approach is the heavy reliance on the frequency count of complex grammatical features which are hard to retrieve automatically. In this paper, we investigate whether Biber’s MF/MD method can be used for automatic text classification. For this purpose, the MF/MD method is applied to the ICE-GB corpus, using three different sets of linguistic features. The results indicate that automatic text classification is indeed feasible using word class tags as input for the MF/MD method.1

1.

Introduction

Text classification is the process by which a classificatory format is constructed. One application of the classification method is document clustering. In document clustering, the documents in a given set are grouped in a way which maximizes the within-cluster similarity, and minimizes between-cluster similarity. Thus, other than in document categorization, the clustering task does not assume an existing classification of documents.2 Document clustering has its applications in areas such as corpus linguistics, computational linguistics and information retrieval. In corpus linguistics, Biber and Finegan use cluster analysis to “provide an overall typology of text types that can be used to specify the interrelations existing among different kinds of texts in terms of their strategic exploitation of linguistic variables for functional purposes” (Biber and Finegan 1986: 19). In information retrieval, document clustering is used to improve the efficiency of retrieval programs. If a collection can be divided up into a set of N conceptually coherent clusters, then queries could first be compared against representations of each of the N clusters. Ordinary retrieval could then be applied only within the top cluster or clusters, thus saving the cost of comparing the query to the documents in all of the other more distant clusters. (Jurafsky and Martin 2000: 659) In computational linguistics, document clustering may prove useful for parsing and machine translation purposes. Sublanguages typically share a number of

16

Inge de Mönnink, Niek Brom & Nelleke Oostdijk

common features. This may make them easier to handle for parsers and MT systems (cf. Arnold et al. 1994: 159-163 and Oostdijk 1996: 205-6). The chief attraction of sublanguage and text type restriction to MT researchers is the promise of improved output, without the need to artificially restrict the input. Restricting the coverage to texts of particular types in certain subject domains will allow one to profit form regularities and restrictions in syntactic form and lexical content. (cf. Arnold et al. 1994: 160) The classification technique which is used in corpus linguistics is very different from that used in information retrieval.3 Biber’s classification is based on the (co-)occurrence of grammatical features, whereas in information retrieval keywords or index-terms are used. One disadvantage of Biber’s approach is that it is hard, if not impossible, to retrieve the grammatical features which he uses as input automatically. This would make his approach useless for information retrieval purposes. If, however, a meaningful classification into text types can be obtained by using the result of automatic annotation (e.g. word classes and/or syntactic structures) as input, the MF/MD method can be used for fully automatic classification, and may well prove useful for application in information retrieval and/or computational linguistics. At the time that Biber conducted his research, no corpora were available that had been annotated with detailed syntactic information. Since then, however, the fully parsed ICE-GB corpus (Nelson 1996) has become available. It is against this background that we decided to conduct a study which aimed to answer the question whether fully automatic classification is feasible using Biber’s MF/MD method on a different set of data. At the same time we wanted to clarify the issue as to whether the data set has any effect on the results obtained in applying the MF/MD method. The research conducted by Biber in which he uses the MF/MD method has been criticized by a number of people, including Oostdijk (1988), Altenberg (1989), and most recently Lee (2000). The criticisms that are put forward concern various aspects of the method and the way in which it has been applied. Among these are the size and the nature of the corpus and the samples used, and the selection of grammatical features. It has also been suggested that the dimensions as postulated by Biber (1990) are not as pronounced as he claims. One of the questions that so far have remained unanswered is the following: would it have made a difference if Biber had used a different set of grammatical features? 2.

Description of the MF/MD method

Before we describe our experiments (section 3) and the results thereof (section 4), in this section we first summarize the Multi-Feature/Multi-Dimension method as used by Biber. The MF/MD method consists of the following four steps (Biber and Finegan 1986):

Using the MF/MD method for automatic text classification I II -

17

Preliminary analyses review of previous research to identify potentially important linguistic features and genres establish frequency of occurrence of those features in the texts in the corpus Factor analysis perform factor analysis on frequencies, clustering features that co-occur with a high frequency interpretation of factors as textual dimensions, through assessment of the communicative function(s) most widely shared by the features constituting each factor

III Compute factor scores - for each factor, compute a factor score for each text - analysis of the distribution of the factor scores among the genres - further functional interpretation of the textual dimensions IV Cluster analysis - clustering of texts that are most similar to one another - interpretation of the clusters as underlying ‘text types’ 3.

Description of the experiments

The present study includes three experiments. In these experiments, steps I to III of the MF/MD method are performed on the fully parsed, one-million-word ICEGB corpus (Nelson 1996), using three different sets of linguistic features: 1. 2. 3.

Biber’s set of 67 variables a set of 129 word class tags a set of 103 sentence structures

In the first experiment, Biber’s application of the MF/MD method as described in Biber (1988) was copied as meticulously as possible on the ICE-GB corpus.4 While making use of the syntactic annotation available in the ICE-GB corpus, it was attempted to stay as close as possible to Biber’s 67 grammatical features. In other words, we expressed Biber’s algorithms in terms of Fuzzy Tree Fragments (FTFs). By means of this experiment we wanted to investigate whether the classification as postulated by Biber would hold when the research was replicated on a different corpus. In the attempt to copy the linguistic features, some obscurities, inconsistencies and shortcomings in the original algorithms came to light. The availability of syntactic annotation provided the opportunity to search for most linguistic features using only one (complex) FTF and to improve on some of the original search schemes. Only changes that improved the precision and recall of the original algorithms were carried through. For example, Biber uses the following search scheme to find “that relative clauses on subject position”5:

18

Inge de Mönnink, Niek Brom & Nelleke Oostdijk

N + (T#) + that + (ADV) + AUX / V An example of such a structure provided by Biber (1988: 234) is the following: (1)

... the dog that bit me ...

In other words, this search scheme tries to find all relative clauses which are introduced by the pronoun that, where the relative clause functions as postmodifier in a noun phrase and where that is the subject of the postmodifying clause. In ICE-GB, this structure can be found by applying the FTF as shown in Figure 1. However, this FTF produces some extra occurrences of postmodifying that relative clauses, which are missed when using Biber’s search scheme. Most notably these are instances where the head of the noun phrase is not realized by a noun but for example by a pronoun or proform (ex. 2), or where the relative clause is preceded by another postmodifying phrase or clause (exs. 3-4). 6 (2) (3) (4)

Is her mother [the one (that had the stroke)] [s1a-036-104] It is [an approach (to evolution) (that has been fostered by biologists with a particular interest in cognition, intelligence and culture)]. [w1a-009-011] There is [a dentist (I know) (that’s got a clock in his house ...)] [s1a-046-001]

Figure 1. Fuzzy Tree Fragment for “that relative clauses on subject position” For the second experiment, the full set of 332 tags found in the ICE-GB corpus was reduced to make it suitable for factor analysis. Factor analysis can only reliably be performed if the number of variables is in the ratio of 1 to 3 to the number of observations. In other words, if the number of variables (in this case the number of tags) is 332, the number of text samples should ideally be 996 or higher. As it is, the ICE-GB corpus contains 500 texts, making a reduction of the full set of tags necessary. At the same time, the choice of variables should be such that as little linguistic information as possible is lost. Therefore, the reduction was established by ignoring some minor word classes (e.g. pause, punc, interjec) and features (e.g. comp, disc, procl), by ignoring incomplete tags (e.g. V(ditr),

Using the MF/MD method for automatic text classification

19

N(sing)), and by ignoring all features for some major word classes (e.g. ADJ, NUM). Beside word class tags, syntactic structures can also be automatically obtained from the ICE-GB corpus. In our third experiment we wanted to establish whether using syntactic structures as input could lead to an acceptable text classification. Obviously, structures are not as easy to retrieve automatically as tags. While real-time tagging of texts with a performance of 95% or higher is nowadays state of the art, automatic parsing of large amounts of text(s) is still hardly ever undertaken. This would make syntactic structures less suitable as input for information retrieval purposes. However, Biber uses a combination of lexical items, tags, and complex structures, and it is interesting to investigate whether using only tags or only syntactic structures leads to comparable results. The amount of structures available is enormous. So what should we choose as input? It was obvious that we had to make a selection, since, again, the number of variables could only be 167 or less. We decided to take sentence patterns as input. This has the advantage that, if the results are favourable, the output of skeleton parsing can be used instead of full parsing, making it more useful for application in automatic classification. Also, the input is maximally different from the type of linguistic features that Biber uses. In this way we can establish whether using a different set of grammatical features as input would lead to different results. To establish the set of sentence patterns to be used, we looked at the functions of the daughters of the highest node in the sentence. This resulted in a total of 4815 different functional structures. To reduce this set, we only included those structures with a frequency of occurrence of 70 or higher. This resulted in a set of 103 sentence structures.5 The last two experiments differ from the first in the fact that the set of linguistic features is not based on previous micro-analysis. While this may complicate or even render impossible the functional interpretation of linguistic features in terms of an underlying dimensional structure, it has the clear advantage that no prior research into the communicative functions of features is needed to carry out the MF/MD analysis. If it turns out that a factor analysis on tags and/or sentence structures still results in a meaningful text classification, this method can be used for document clustering applications. 4.

Description of the results

In this section, we discuss the results of the factor analysis and the computation of the factor scores. For the computation of the factor scores, the MF/MD method uses the genres of the corpus used, in this case the genres of the ICE-GB corpus. First, a factor score is computed for every separate text. Then the average score per genre is computed. In ICE-GB the text categories are organised in a hierarchical structure. At the lowest level 32 genres are distinguished. When we compared the factor scores for these 32 genres, no significant differences were found. This was caused by the low number of texts in some of the genres. To obtain significantly distinct scores, we had to use the categories one level up in

20

Inge de Mönnink, Niek Brom & Nelleke Oostdijk

the genre hierarchy of ICE-GB. At this level 12 different categories are distinguished. To be able to compare our results with those of Biber, the factor scores of the ICE-GB categories had to be compared with the scores for categories in Biber’s corpus. Biber used a selection of the London-Oslo/Bergen (LOB) corpus and the London-Lund Corpus (LLC), supplemented with personal and professional letters. His corpus contains a total of 23 categories. A comparison between the ICE-GB categories and Biber’s categories is not straightforward. In some cases a category of Biber does not have an equivalent in ICE-GB, in other cases several of Biber’s categories form one category in ICE-GB. In the discussion of the results below, we use letters to refer to the different categories. Table 1 gives an overview of the categories in both corpora and the abbreviations used. Where possible, we use the original LOB label. Table 1. Text categories and their abbreviations

A B C D E F G H J K L M N P R S T U V W X Y Z

Categories in Biber’s corpus Press: reportage Press: editorials Press: reviews Religion Skills, trades and hobbies Popular lore Belles lettres, biography, essays Miscellaneous (official documents) Learned and scientific writing General fiction Mystery and detective fiction Science fiction Adventure and western fiction Romance and love story Humour Personal letters Professional letters Telephone conversations Face-to-face conversations Public conversations Broadcast Prepared speeches Spontaneous speeches

Categories in ICE-GB A Reportage B Persuasive writing E Instructional writing G H J K-R

Non-professional writing Non-academic writing Academic writing Creative writing

ST Correspondence UV

Private dialogue

W Public dialogue XY Scripted monologue Z

Unscripted monologue

In the first experiment, the (normalized) frequency counts resulting from the fuzzy tree fragments were used as input for factor analysis. The resulting factorial structure consists of five factors, instead of Biber’s seven factors. When comparing both factorial structures, we find some differences, but also some

Using the MF/MD method for automatic text classification

21

striking similarities. For example, the 13 variables that have a salient positive loading on Factor 1 in our factorial structure also load on Biber’s first factor and the variables in our Factor 3 all load on Biber’s Factor 2. Some of the differences that we found can be explained by the use of the improved search algorithms. The feature ‘wh-relative clause on subject position’, for example, was improved to include cases where the head of the noun phrase is not realized by a common noun, or where the postmodifying relative clause is preceded by a prepositional phrase. In Biber’s calculations this variable has a positive loading on Factor 3, while in our factorial structure it has a high negative loading on Factor 1. If we compare the mean factor scores of the ICE-GB genres with Biber’s scores, we find that Biber’s first dimension ‘involved versus informational’ is clearly reflected in the mean scores for our Factor 1, with private and public conversations on one end of the scale, and academic writing on the other (cf. Figure 2). Biber’s second dimension ‘narrative versus non-narrative’ is reflected in the scores for our Factor 3, with creative writing on one end of the scale, and instructional writing on the other (cf. Figure 3). The distribution of scores for our Factor 2 are not directly reflected in Biber’s study, but it seems to make a clear distinction between spontaneous speech on the one hand, and scripted speech and writing on the other (cf. Figure 4).

H

E M K A G B D T N P J C FR X L Y -15

inv.

(Biber 1988) Z W S

-10 -5 0 5 J G B Z ST W H E K-R A XY

10

15 UV

20

V 25

30

U

35 inf.

(Brom 2000)

Figure 2. Mean factor scores on ‘involved vs. informational production’.

non-n. -10 E J

HUDA Z E TWVS R X J C BFY G

M L NK P

-5 0 5 ST G H W XY B A Z UV

(Biber 1988) nar. 10

15

20

25 K-R

(Brom 2000)

Figure 3. Mean factor scores on ‘narrative versus non-narrative discourse’

22

Inge de Mönnink, Niek Brom & Nelleke Oostdijk

spoken

written -12 A K-R

-10 E

-8 -6 -4 H J G XY B ST Z

-2

0

2

4

6

8 UV W (Brom 2000)

Figure 4. Mean factor scores on ‘written versus spoken production’ The results from our second and third experiment cannot be compared with Biber’s results directly, since we are dealing with both a different corpus and a different set of linguistic features, but we can compare the mean factor scores for the genres with the results of our first experiment. On doing so, we find that the results for the set of tags are very similar to those of the first experiment. Again, we find a distinction between involved vs. dimensional (cf. Figure 5), narrative vs. non-narrative (cf. Figure 6), and written vs. spoken (cf. Figure 7).

J

B

G

Z H

-10 -5 B XY inv. G J H E A

XY

A

K-R

(Word classes; Brom 2000) W UV E ST

0 Z

5 ST

10

15

W

UV

inf.

K-R

(Biber’s features; Brom 2000)

Figure 5. Mean factor scores on ‘involved versus informational production’, Biber’s features versus word classes.

E non-n. -10 E J

J

G ST B Z

(Word classes; Brom 2000) XY UV W H A

-5 0 5 ST G H W XY B A Z UV

K-R 10

15

20

nar. 25 K-R

(Biber’s features; Brom 2000)

Figure 6. Mean factor scores on ‘non-narrative versus narrative discourse’, Biber’s features versus word classes.

Using the MF/MD method for automatic text classification XY ST H B K-R E

A written -12 A K-R

-10 E

23

(Word classes; Brom 2000) G J

-8 -6 -4 H J G XY B ST Z

Z -2

0

W 2

4

UV spoken

6

8 UV W (Biber’s features; Brom 2000)

Figure 7. Mean factor scores on ‘written versus spoken production’, Biber’s features versus word classes. While approximately the same distinctions are found for the set of structures, the mean factor scores are far less distinct. However, we cannot conclude from this that syntactic structures are useless as input for the MF/MD method in general. Perhaps clause structures would have given better results than the sentence structures we used. Clause structures are closer to the type of syntactic structures which Biber used as input. 5.

Conclusions

In the current study Biber’s classification (cf. Biber 1988) was largely reproduced, using the same linguistic features. It was shown that the availability of syntactic annotation simplified and improved the search for the linguistic features considerably. At the same time, a factor analysis carried out on the frequency counts of only a set of word class tags resulted in largely the same classification. These results indicate that Biber’s factorial structure is not simply a consequence of his choice of corpus or his choice of linguistic variables, but that it can be replicated using a different corpus and a different set of linguistic features. On the other hand, the original text classification was not obtained when we used sentence structures as input. This indicates that not just any input will lead to the same classification. The current study also brought to light some limiting conditions on and suggestions for the successful implementation of the MF/MD method. With regard to the design of the corpus, it can be concluded that the number of texts in some of the 32 genres in ICE-GB are too few to obtain significant differences in factor scores (using the Anova test). For a successful implementation of the MF/MD method the number of texts per genre should be increased. It was also shown that the MF/MD method can be applied more successfully given the availability of a more accurately annotated corpus. On the whole it can be concluded that, while Biber’s factorial structure was largely reproduced, the text classification for English is not yet stable and can still be improved bearing in mind the findings of the present study. Assuming for the moment that the current classification is a useful one, the present study indicates that automatic classification of large amounts of (tagged) texts is possible using the word class labels as input for the MF/MD method. Whether the

24

Inge de Mönnink, Niek Brom & Nelleke Oostdijk

classification is indeed useful for information retrieval, machine translation and/or parsing purposes will have to be addressed in further research. Notes 1

This article is largely based on the research that was carried out by Niek Brom for his MA-thesis (Brom 2000).

2

See Jurafsky and Martin (2000: 658) and van Rijsbergen (1999: 36-38).

3

To our knowledge, no automatic classification has been attempted so far in computational linguistics for the purpose of improving parser efficiency.

4

Biber (1988) contains a more elaborate description of the MF/MD method which was first described in Biber and Finegan (1986).

5

Biber’s description “that relative clauses on subject position” is misleading. What is meant here is “that relative clauses with that in subject position”.

6

When an example is taken from the BNC, the code between dashes is a sample reference code from the BNC. For an explanation of the codes see Burnard (1995).

7

The most frequent structure is SUBJECT-VERB-SUBJECT COMPLEMENT, with a frequency of 5390.

References Altenberg, B. (1989), ‘Review of ‘Variation across speech and writing’ by D. Biber (1988)’, Studia Linguistica 43(2): 167-174. Arnold, D., L. Balkan, S. Meijer, R. Humphreys and L. Sadler (1994), Machine translation: An introductory guide. London: Blackwells-NCC. http://clwww.essex.ac.uk/MTbook/ Biber, D. (1988), Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D. (1990), ‘Methodological issues regarding corpus-based analysis of linguistic variation’, Literary and Linguistic Computing, 5: 257-269. Biber, D. (1995), Dimensions of register variation. Cambridge: Cambridge University Press. Biber, D. and E. Finegan (1986), ‘An initial typology of English text types’, in: J. Aarts and W. Meijs (eds), Corpus Linguistics II. Amsterdam: Rodopi. 1946. Brom, N. (2000), Drie keer de MFMD methode. Een tekstclassificatie methode toegepast met variatie in taalkundige informatie. Unpublished MA-thesis. Nijmegen: University of Nijmegen.

Using the MF/MD method for automatic text classification

25

Burnard, L. (1995), Users reference guide for the British National Corpus. Version 1.0. Oxford: Oxford University Computing Services. Jurafsky, D. and J.H. Martin (2000), An introduction to natural language processing, computational linguistics, and speech recognition. Upper Saddle River, New Jersey: Prentice-Hall. Lee, D. (2000), Modelling variation in spoken and written English: the multidimensional approach revisited. Lancaster: Lancaster University. Nelson, G. (1996), ‘The design of the corpus’, in: S. Greenbaum (ed.), Comparing English worldwide: The International Corpus of English. Oxford: Clarendon Press. 27-35. Oostdijk, N. (1988), ‘A corpus linguistic approach to linguistic variation’, Literary and Linguistic Computing, 3: 12-25. Oostdijk, N. (1996), ‘Using the TOSCA analysis system to analyse a software manual corpus’, in: R. Sutcliffe, H. Koch and A. McElligott (eds), Industrial parsing of software manuals. Amsterdam: Rodopi. 179-206. van Rijsbergen, C.J. (1999), Information retrieval, second edition. CD-ROM version. http://www.dcs.gla.ac.uk/~iain/keith/index.htm

Scientific experiments in parsed corpora: an overview Sean Wallis University College London Abstract Our research community is producing sizable corpora containing detailed structural analysis. In order to investigate volumes of data such as parsed corpora and communicate our results, we must agree effective experimental methods. Such methods should be transparent, defensible and consistent with conventional approaches to scientific evidence. In this paper, and in accompanying pages on the internet,1 we demonstrate how a standard categorical variation paradigm may be deployed in corpus linguistics, provide examples of experiments, and present some of the central issues in experimental design applied to structured parsed corpora.

1.

Introduction

Corpus linguistics – the study of authentic linguistic data in order to support and develop scientific theories – has been through a recent renaissance as new, large and structurally detailed corpora have been produced. Cheap computing power, large datasets and more sophisticated query systems have all combined to provide linguists with a myriad of possibilities for research. In particular, parsed corpora (also known as treebanks), consisting of sequences of grammatically analysed sentences, offer a great number of research possibilities. Such corpora include Susanne (Sampson 1995), The British Component of the International Corpus of English (ICE-GB, Nelson et al. 2002) and the University of Pennsylvania Treebank (Marcus et al. 1993). The annotation in a parsed corpus permits researchers to evaluate hypotheses regarding the interaction of specific linguistic features and structures. In this paper, we outline the scientific method as it applies to all corpus data – parsed or otherwise. Certain issues, however, are particularly acute with respect to a dataset consisting of a series of tree structures. During the construction phase of corpus research, it has been necessary to focus on annotation methods (Wallis and Nelson 1997; Wallis 2003). During the release of ICE-GB, we focused on developing natural and intuitive query systems to allow linguists to ‘get to grips’ with the data (Wallis et al. 1999). As such structurally annotated corpora become more established, it becomes necessary to ensure that our experimental methodologies are sound and sensitive to the possibilities that these corpora offer. Outlined below are the basics of the scientific method.

28

Sean Wallis

2.

Experimentation

2.1

Defining a scientific experiment

A scientific experiment is an empirical test of a hypothesis. A hypothesis is a statement that may be believed to be true but is not verified, for example: 1. The element ’s is a clitic rather than a word. 2. The word whom is used less frequently in spoken English. Whom rather than who is spoken less often than it is written in 1990’s 3. British English. In each case, the issue for the researcher is to devise an experiment to decide whether evidence collected from empirical data supports or contradicts the hypothesis. Compare Hypotheses 2 and 3 above. The more general the hypothesis, (a) the more difficult it is to collect evidence to test the hypothesis, and (b) the more likely it is that this evidence might support a number of other explanations. Researchers need to turn general hypotheses into a programme of experiments: a series of more specific hypotheses that are more easily testable. The art of experimental design is to collect and evaluate data so that conclusions may be drawn which support or refute these specific hypotheses. We discuss the collection of data in the next section. In brief, a simple experimental design consists of the following: a) a dependent variable (DV), which may or may not depend on: b) an independent variable (IV), which varies over the normal course of events. Thus for Hypothesis 3, taking data from ICE-GB, the independent variable would be ‘spoken versus written’, and the dependent variable, whether whom was used in place of who, when the speaker had the choice (see Section 4.1 below). Note that in this case both variables consist of simple categorical alternatives.2 Once the data has been collected, the strength of the correlation between dependent and independent variables is calculated using a statistical formula and tested against a threshold (‘critical’) value. a) If the measure is small, the variables are probably independent, i.e. they do not affect one another. b) If the measure is large, it means that the variables correlate, i.e. the presence of one phenomenon may be dependent on another. 2.2

Experiments and proof

However, the existence of a correlation does not prove a hypothesis. A correlation between two phenomena, A and B, does not prove that A causes B. They may correlate because they are both caused by some other factor C, or the reverse may even be true, i.e. B may in fact cause A. The sample may not be representative or the researcher may even be correlating two aspects of the same

Scientific experiments in parsed corpora: an overview

29

phenomenon, i.e. demonstrating a relationship between A and A! To be useful, therefore, an experiment must be evaluated through a theoretical argument. 2.3

Experiments and disproof

If an experiment yields a non-significant result, this does not disprove the hypothesis. The conventional language used when describing an experiment is couched in double negatives. The default position is the null hypothesis: the negation of the hypothesis the experiment is designed to establish. For our Hypothesis 3, this would be the statement ‘there is no difference between spoken and written 1990’s British English in the usage of whom rather than who.’ If a test does not find sufficient variation, the researcher can only report that the null hypothesis cannot be rejected. This is not the same as saying that the original hypothesis is wrong, rather that the data does not support rejecting the notion that nothing is happening. Collecting more data, revising terms or redefining variables can all contribute to obtaining significant results. 2.4

The value of an experiment

The value of an experimental procedure is that experiments permit researchers to advance a position. If independent evidence points to the same general conclusion, the research programme may be on the right track. An effective research programme makes novel predictions that one can evaluate experimentally (Lakatos 1978). Competing bodies of theory, according to Lakatos, progress or regress depending on whether they productively generate novel predictions or degenerate into a patchwork of exceptions. Moreover, within a community of researchers, provided experimental methods are transparent, others can reproduce experiments with different, or identical, data. This has two important practical implications for us. a) A shared corpus becomes a focus of discussion within the research community as well as a testbed for theories. b) As a community, we have to establish agreed standards of evidence, experimental design and reporting. 3.

Designing an experiment

An experiment consists of at least two variables: a dependent and an independent variable. The experimental hypothesis summarises the experimental design and is couched in terms of these variables. The dependent variable in Hypothesis 3 is the usage of whom versus the usage of who where both are equally applicable; our independent variable is spoken versus written. Let us take data from ICE-GB, although we could equally take data from other sources. Our experimental hypothesis is (more strictly) a more specific version than our previous one (Hypothesis 3), in this case:

30

Sean Wallis

Table 1. A simple contingency table for two Boolean variables (DV × IV). dependent variable (use of whom over who) DV = who DV = whom (O) TOTAL (E*) independent variable (spoken or written)

IV = spoken

spoken Ù who spoken Ù whom

IV = written

written Ù who written Ù whom

TOTAL

(spoken Ú written) Ù who

(spoken Ú written) Ù whom

spoken Ù (who Ú whom) written Ù (who Ú whom) (spoken Ú written) Ù (who Ú whom)

The pronoun whom is a less frequent alternate for who in spoken, compared to written, British English, sampled in a directly comparable way to ICE-GB. This last caveat is just another way of saying that we can only use our experimental results to make claims about similar kinds of British English. 4.

3.1

Constructing a contingency table

In order to evaluate the hypothesis, we perform a series of searches in the corpus and construct a table, called a contingency table, which summarises the data. An outline table for this example is provided in Table 1 (‘Ú’ and ‘Ù’ are logical or and and, respectively). This table helps us to organise our data to perform an appropriate significance test for categorical data: the c2 (chi-square) contingency correlation test. The table is a simple 2 × 2 contingency table, i.e. where both variables have two possibilities. We obtain the values in the four grey cells from the corpus, and then calculate row and column totals. Using the ICE Corpus Utility Program (ICECUP, Nelson et al. 2002) and ICE-GB, we obtain Table 2. 3.2

Performing the test

Given the choice, is whom less likely to be spoken than written? The data Table 2. Completing Table 1 with data from ICE-GB. dependent variable (use of whom over who) DV = who DV = whom (O) TOTAL (E*) independent variable (spoken or written)

IV = spoken

1,336

40

1,376

IV = written

735

43

778

TOTAL

2,187

83

2,270

Scientific experiments in parsed corpora: an overview

31

certainly seems to suggest it. According to the table there are similar numbers of cases of whom in the spoken and written sections of the corpus. But there are nearly twice the number of uses of pronoun who in the spoken part as in the written. The critical point is, if you picked a case at random, not knowing whether it was written or spoken, the odds that it was who would increase if you then found out that it was spoken. The statistical test takes this large variation in total frequency into account. The c2 test compares the difference between the observed distribution, O, and an expected distribution, E, calculated by scaling the ‘TOTAL’ column (“who or whom”, labelled “E*”) so that it has the same total as the observed distribution. The formula for chi-square is the sum of the squared difference between each pair of values from the observed and expected distributions, divided by the value in the expected distribution, written S(o-e)2/e, where o Î O and e Î E. We then determine the threshold critical value for c2, which must be beaten by our value of c2. There are two cells in the distribution, so the number of degrees of freedom, df = n-1 = 1. By convention, we can accept an error of 1 in 20 (0.05), so we obtain c2crit(1, 0.05) = 3.841. Observed O = {40, 43}, E* = {1376, 778}, scale factor SF = 83/2270 = 0.0366, so expected E » {50, 33}. c2 = S(o-e)2/e = 102/50 + 102/33 » 2+3 = 5. Since c2 > critical value, the result is significant, and the null hypothesis, i.e. that the use of “whom” does not correlate with whether the text was written or spoken, is rejected. As we observed, just because a result is statistically significant does not mean that the result is linguistically very meaningful. It could be explained by other factors, e.g. a difference in sampling between the two samples. We have to argue for a particular interpretation of the result by referring back to the original corpus. A related consideration is the size of the effect.3 A small variation in a large sample may be significant, but not necessarily interesting. 4.

Scientific experiments with lexical and wordclass data

Presented above is one example of an experiment in lexical variation, which compared lexical usage between text genres. The dependent variable was defined by the incidence of whom when the choice of either who or whom was present. The experiment was restricted to pronouns in order to screen out cases where the choice is not present: for example, in metalinguistic uses or proper names: It should be from who from whom from whom I think [ICE-GB S1B-005 #78] . ...For Whom the Bell Tolls... [ICE-GB W2B-004 #62]. 4.1

Absolute versus relative frequency

This experiment implicitly broke with a common tradition in corpus linguistics and lexicography that concerns itself with the absolute frequency of words. Rather, what was examined was the relative frequency of whom given some other

32

Sean Wallis

condition, in this case, when the speaker could feasibly choose who or whom. The illustration below has been given to show why this is both important and beneficial. Suppose you read that trains are becoming safer and that between 1990 and 2000, the number of accidents on the railways fell by ten percent. But what if the number of journeys fell by a third over the same time period? Or the average journey length halved? Should you believe what you read? In either case the relative risk of injury (per journey or distance travelled) has risen. The absolute frequency of a word tells you how frequent the word is in the corpus. But the reason that a word is present in the first place will probably depend on many factors that are irrelevant to a particular experimental hypothesis. As we have seen, the employment of relative frequencies focuses on variation where there is a choice. A researcher may need to check each case in the corpus to see if such a choice really exists. This is particularly important where cases may be classified by tagging or parsing. The employment of statistical tests has not obviated the requirement to interpret sentences. Experimental results are a principled way of focusing browsing and exemplification. In the previous example, the use of relative frequencies means that the expected distribution was calculated by scaling the total “who or whom” column. If we calculated E on the basis of absolute frequencies, it would be proportional to sample size (in ICE-GB, the ratio of spoken:written is 3:2). As was observed, Table 2 indicates that the ratio of the number of cases of pronoun who or whom is 2:1 (spoken:written). Relative frequency focuses experiments on the linguistic choice being made rather than the volume of data. 4.2

Lexical and grammatical interaction

So far we have examined experiments where context affects a linguistic choice. Attributional studies reverse this implication by attempting to predict context – authorship and origin – from linguistic choices. A rather more central issue for linguists, however, is how to carry out research into the interaction between two or more lexical or grammatical variables. Let us suppose we wish to investigate two-word patterns consisting of the preposition of, from or to, followed by the pronoun which or what, and evaluate whether the choice of the first affects the second word in these patterns. When we are collecting data from the corpus, we have to ensure that we extract the frequency for each specific form separately and correctly. Not all sentences containing of and which contain of which. We have to count each distinct lexical pattern in the greyed cells in Table 3 separately and then sum the totals. The statistical evaluation then proceeds as before. As an aside, note that in some circumstances one might also consider whether the choice of the second word has an effect on the choice of the first. Given our limited knowledge of the structure of mental processes one should probably be a little wary of the assumption that all linguistic choice implications are mirrored by word order.

Scientific experiments in parsed corpora: an overview

33

Table 3. A contingency table where cases consist of two-word lexical patterns, so of which refers to cases where which is immediately preceded by of. dependent variable (which or what) DV = which

DV = what

TOTAL

IV = of

of which

of what

of what or of which

independent

IV = from

from which

from what

from what or from which

variable

IV = to

to which

to what

to what or to which

TOTAL

of which, from which or to which

of what, from what or to what

All

preposition

5.

Scientific experiments with grammatical data

The issue of relating two or more lexical/grammatical elements together becomes central when we turn to the question of performing experiments on a parsed corpus. Using a parsed corpus has two main advantages for experiments: a)

b)

It is easier to be more precise when establishing the grammatical typology of a lexical or grammatical item, saying, “retrieve this item in this grammatical context”. One can also vary the precision of our definitions by adding or removing features and topological attributes such as the position of a clause. It is easier to precisely relate two items, e.g. “of and what are both part of the same prepositional phrase”.

In addition, given that we are interested in, say, prepositional phrases, one can enumerate a typology of such cases and thereby gather other examples that we had not previously considered. Against this perspective is the observation that an experiment on a parsed corpus must necessarily be in the context of a particular set of assumptions, i.e. the grammar employed. However, we should be clear that this does not make experiments unscientific. Rather, it implies that we must qualify our results – “an NP, according to this grammar”, etc. In fact, as Putnam (1981) points out, scientific theories are never assumption-free.4 In a parallel-parsed corpus containing two or more different grammatical analyses of each sentence, one could go further still and evaluate which aspects of one framework predict which aspects of another. (In practice, unfortunately, the cost of building a large corpus of this nature would currently be prohibitive.)

34 5.1

Sean Wallis Specifying cases with Fuzzy Tree Fragments

ICECUP employs grammatical models called Fuzzy Tree Fragments (FTFs: Wallis et al. 1999) to establish the relationship between two or more lexical or grammatical elements.5 As well as permitting the extraction of an exhaustive set of well-defined complex structures from a parsed corpus, FTFs permit the study of the interaction between grammatical terms. FTFs express generalised tree structures by, among other things, weakening ordering and immediacy constraints between nodes, so that you can say this node eventually follows that node, for example. Consider the relationship between adverbs and prepositions in clauses containing an adverbial phrase followed by a prepositional phrase, such as “No I load up with fast film” (ICE-GB S1A-009 #19, Figure 1, left). A researcher might wish to investigate cases defined by a particular basic structure (Figure 1, right). A research question might be, in such cases, is the existence of a phrasal feature in the preposition node related to the presence of the phrasal feature in the preceding adverb? With ICECUP the method is as follows (visualised in Figure 2). 1. Draw up a contingency table where the DV is the presence of the phrasal feature in the preposition node, and the IV, the existence of the phrasal feature in the preceding adverb node. 2. Compose four FTFs using the same basic structure, one for each of the four permutations with and without the phrasal feature in either node. 3. Perform these four searches and complete the contingency table, calculating the number of cases where the feature is absent by subtraction from the results from the more general case.6

Figure 1. An example case of an FTF match in ICE-GB and an FTF skeleton for a set of closely related queries.

Scientific experiments in parsed corpora: an overview

35

put the results in DV PREP(phras)

contingency table True IV ADV(phras)

False

TOTAL

True False TOTAL

Figure 2. Constructing a contingency table to hold the results of the four FTF queries. Remaining cells are calculated by subtraction from totals 5.2

Strict independence of cases in grammar

One problem that rapidly becomes apparent in parsed corpora is that individual cases may not be strictly independent and can therefore interact with one another. To be strict, one would need to insist that each case must be taken from a different source text. However, this is impractical to all intents and purposes. The limited number of available parsed samples, coupled with increasingly sophisticated techniques to extract cases, produces this problem in a number of different ways. These are listed below from the most grammatically explicit to the most incidental7. Types 1-4 arise where two cases are to be found in the same text unit. 1. Cases fully overlap one another. This is possible if an FTF contains unordered links between siblings or words, but it is rare. 2. Cases partially overlap one another, i.e. part of one case coincides with part of another. This is not uncommon. There are two distinct types of partial overlap. a) Where overlapping nodes in the tree match the same node in the FTF. This can arise with eventual links, e.g. if the link in Figure 1 between adverbial and prepositional phrases was eventual rather than immediate, then it could match overlapping cases in sequences like [AVP], [AVP], [[PP]] or [[AVP]], [PP], [PP]. b) Where overlapping nodes in the tree refer to different parts of an FTF, e.g. an NP within an NP or with co-ordination (is hunting and fishing and shooting one or two cases of co-ordination?).

36 3. 4. 5.

Sean Wallis Overlapping matching regions is not the only measure of interaction, however. Even if the FTF identifies a single clause node, for example, one case can dominate or subsume another – a clause within a clause. Similarly, one case might be related to another via a simple construction, e.g. co-ordination. Finally there are circumstances of rhetoric and echo: the construction may be repeatedly employed by the same speaker (possibly in the same utterance) or echoed by another speaker.

We need to avoid or at least minimise the effect of all of these. A simple check is to review those text units containing more than one matching FTF. As a general rule, however, we refer back to our original point. It is a good idea to ensure that data is sampled from as large a number of different texts as possible. Researchers should be particularly careful if they are compelled to base their results on a small number of texts (for example, if they are looking at specialised language use), or if they have relatively few cases. Conversely, this problem is less serious with common cases (say, over fifty in each category).8 5.3

Performing experiments with ICECUP

ICECUP obtains instances of grammatical structures in parsed corpora on welldefined principles. However, as we have seen, simply obtaining a set of matching cases is only the beginning of the experimental process. In the current release of the ICECUP software, one can use the software to form queries and extract results, but contingency tables must be manually constructed. As well as having to perform the statistical calculation by hand each time, this also means that (a) every permutation of an FTF must be worked out and each search performed individually (naturally, this limits the complexity of experiments),9 and (b) case interaction must be manually evaluated. There is no way of automatically accounting for case interaction. Despite these limitations, however, one can perform experiments that simply were not feasible previously. We believe that ultimately the best way to deal with these issues is by supporting scientific experiments in software (Nelson et al. 2002), providing tools to help researchers design experiments, automating the collection of statistics, and attempting to evaluate and compensate for case interaction. Moreover, beyond this, one could permit researchers to investigate multivariable interactions by defining a space of hypotheses which the computer could search and evaluate, a process called knowledge discovery (Wallis and Nelson, 2001). This procedure goes a stage further by performing not one but a series of experiments of increasing complexity, looking for interactions between many different variables defined by the researcher. Yet no degree of computational sophistication removes the necessity for linguistic debate about the implications of a scientific experiment by any method so undertaken.

Scientific experiments in parsed corpora: an overview 6.

37

Conclusions

It is perfectly possible to perform robust scientific experiments on corpora, whether lexical, tagged or parsed. The presence of a parse analysis in a corpus permits easier and more reproducible experiments in grammar, but it also requires that the researcher critically engages with the grammatical analysis. In the first place this means that the researcher must check how consistently the framework was applied to the dataset, and, in designing experiments and reporting results, be constantly aware of issues of compatibility with alternative frameworks. In the longer term the perspective discussed here may eventually permit the community to critique analyses in a non-circular fashion and contrast frameworks in terms of their predictive and explanatory power. As with any experimental design process, it is essential to specify the null hypothesis correctly, i.e. in terms of the relative, rather than absolute, frequency of an event. This means designing an experiment around the following question: What factors influence a speaker’s behaviour when she is given the choice of one construction over another? Finally, scientific experimentation does not mean ignoring the sentence level description. It is often necessary to examine the corpus in a naturalistic fashion in order to determine a) whether a speaker genuinely has a choice, b) whether cases are independent from one another, and c) whether the annotation scheme, and thus the empirical results, can be related to other published results based on a different corpus or grammar. Acknowledgments I am grateful to a number of people for contributing to this line of argument and to this particular paper. Gerry Nelson, Tom Lavelle, Evelien Keizer and Gabriel Oz n cannot be held responsible for the views described here but nonetheless acted as critical reviewers and guinea pigs along the way. Much of the perspective was developed in the context of experiments in knowledge discovery (Wallis and Nelson 2001). The purpose of this paper is to translate this technical perspective into a practical research method that is applicable to existing tools and resources, and to encourage the debate about the implications of such a paradigm shift. Notes 1

These pages are to be found in the FTF web site at the Survey of English Usage. See http://www.ucl.ac.uk/english-usage/ftfs. Unless given in full, all references cited below are relative to this point. Thus the main discussion on experimentation starts at http://www.ucl.ac.uk/english-

38

Sean Wallis usage/ftfs/experiment.htm. A lengthy discussion of scientific methodology in parsed corpora is in chapter 9 of Nelson, Wallis & Aarts (2002).

2

In this paper we concern ourselves with discrete categorical variables. We are concerned with distinct types of thing, rather than graduated alternatives (cf. ‘written-to-be-spoken’) or numerical quantities. The same general approach, with a different statistical test replacing c2, may be applied in such cases.

3

For more on measuring the size of an effect, see experiment2.htm.

4

See also the questions on experimentation in faqs.htm.

5

You can download ICECUP for free with a 20,000 word sample corpus from http://www.ucl.ac.uk/english-usage/ice-gb/sampler.

6

An extended discussion of this experiment is included in experiment4.htm.

7

See also experiment2.htm.

8

See experiment4.htm for a discussion on quantifying case interaction.

9

In ICECUP 3.1 you can specify that a feature is absent explicitly, but in ICECUP 3.0 you have to subtract the feature-absent case from the case where its presence is unspecified (the total). See experiment3.htm.

References Lakatos, I. (1978), Mathematics, science and epistemology. Cambridge: Cambridge University Press. Marcus, M., M.A. Marcinkiewicz and B. Santorini (1993), ‘Building a Large Annotated Corpus of English: The Penn Treebank’, Computational Linguistics, 19: 313-330. Nelson, G., S.A. Wallis and B. Aarts (2002), Exploring Natural Language: Working with the British Component of the International Corpus of English. Amsterdam: John Benjamins. Putnam, H. (1981), ‘The “Corroboration” of Scientific Theories’, in: I. Hacking (ed.) Scientific Revolutions. Readings in Philosophy. Oxford: Oxford University Press. 60-79. Sampson, G. (1995), English for the computer: The Susanne Corpus and analytic scheme. Oxford: Clarendon Press. Wallis, S.A. (2003), ‘Completing parsed corpora: from correction to evolution.’ In A. Abeillé (ed.), Treebanks: Building and Using Parsed Corpora, Boston: Kluwer. 61-71. Wallis, S.A., B. Aarts and G. Nelson (1999), ‘Parsing in reverse – Exploring ICEGB with Fuzzy Tree Fragments and ICECUP’, in: J.M. Kirk (ed.), Corpora Galore, papers from ICAME-98, Amsterdam: Rodopi. 335-344. Wallis, S.A. and G. Nelson (2001), ‘Knowledge Discovery in Grammatically Analysed Corpora’, Data Mining and Knowledge Discovery, 15: 307-340.

WebCorp: providing a renewable data source for corpus linguists Antoinette Renouf Research and Development Unit for English Studies, University of Liverpool Abstract The many electronic text corpora available nowadays present ever fewer obstacles to a wide range of corpus linguistic study. However, corpora are expensive resources to create and to update, and there remain problems for linguists if they seek access to very large, very recent, or changing language. The World Wide Web, whilst intended as an information source, is an obvious resource for the retrieval of linguistic information, being the largest store of texts in existence, freely-available, covering a range of domains, and constantly added to and updated. Individual linguistic researchers have been trying to retrieve instances of rare or neologistic language use from the web by manipulating existing web search engines. Whilst this strategy is possible, in particular via Google, the output is rather haphazard and not linguist-friendly. The Research and Development Unit for English Studies has been seeking to remedy the situation through the creation of ‘WebCorp’, a tool designed to search the Internet and provide on-line tailored access to linguists. A demonstration tool is available at http://www.webcorp.org.uk. This paper will report on the research initiative and highlight some of the issues involved.

1.

Introduction

A previously unimaginable number and range of electronic text corpora are now available to corpus linguists, from small and sampled collections to very large textual databases. Whilst this wealth of data makes possible many types of corpus-based research, particularly in the formerly rather inaccessible areas of lexis and lexico-grammar, it has inherent limitations. In practical terms, the corpus data and software may not be available without the appropriate computer access, licences, and so on. More fundamental linguistic limitations relate to the size, age and static nature of the corpora, which can preclude certain kinds of linguistic empirical investigation, for instance the study of very rare, new or changing language features. An alternative source of linguistic information is the web, a publicly available data resource containing a vast and evolving accumulation of texts. Admittedly, this is not constructed or managed with the rigour or for the purposes of a corpus. It is a muddle of multilinguality; it operates a loose definition of ‘text’ which includes all manner of extraneous matter; text dating is sporadic and linguistically uninterpretable, so that neither the latest coinages nor the elements of language change across time that are undeniably in there are traceable by means of chronological organisation. Nevertheless, as a renewable resource which in itself costs the linguistic community nothing to create or access, it is worthy of serious consideration.

40

Antoinette Renouf

The web itself is larger than any corpus. Estimates vary, but on the basis of extrapolation from AltaVista figures for sample words, we calculate its size, in terms of the searchable texts, to be currently at over 50 billion words and growing. In addition to size, the web obviously offers range; many specialised textual domains are represented. The problem of identifying these other than by Yahoo or Open Directory will be alleviated in due course, as the URLs become more transparent and the mark-up protocols are tightened up. Web text is up-to-date. By this is meant not that the web consists exclusively of yesterday’s or even today’s language use, but that it is not subject to the same delays in creation that dog designed corpus initiatives. Web texts are a combination of old and new, but ‘old’ by web standards generally means texts from the late nineties and 2000. The web is constantly updated, with several million pages being added every day. Consequently, it provides ever more - and more recent - data, and corresponding opportunities for retrieving fresh findings. In holding texts across time, which contain instances of language change which could be traceable, the web could even meet some of the needs of modern diachronic corpus linguists (Renouf, 2002). 2.

Some strategies for accessing the web as a linguistic resource

2.1

Off-line processing

The web is being targeted as a linguistic resource from various quarters and in different ways. Some linguists, such as Kübler and Foucou (2000), are extracting texts, and meta-texts, which together make up corpora deemed to be representative of something, such as ‘general language use’, or a technical field. Fairon (1999, 2000) downloads entire newspaper sites for processing. Kilgariff (2001), on the other hand, is currently collecting reference sets of URLs based on the BNC typology, with which a user can create domain-specified corpora by downloading, without copyright infringement. 2.2

Enquiry by search engine

Another strategy is to exploit the functionality of web search engines. Standard engines operate by searching the web for factual information containing a specified search term. A small effort of imagination recasts this in corpus linguistic terms as searching the web for contexts containing target word or phrase. A growing number of researchers have been driven (Bergh et al. 1998; Brekka 1999, 2000), by the absence or insufficiency of evidence in existing corpora for rarer or newer linguistic items and features, to attempt a trawl of the web by this means. Search engines are not, however, designed to accommodate such an approach, and the consequent negotiation entails tedious serial searching and downloading of sometimes individually thin pickings, followed by painstaking manual editing of whole texts. For the word cull, featuring heavily in British news from spring 2001 in relation to foot-and-mouth disease, AltaVista yielded 14,005 returns on Nov. 13 th, 2000, using its Advanced Search Facility.

WebCorp: providing a renewable data source for corpus linguists

41

We present the first few, fairly typical contexts, from which it will be plain that the output, whether or not relevant from the point of view of topic, is neither linguist-friendly in format, nor rich in relevant instances of usage of the word cull itself. Wilkes-Barre Scranton Penguins Hockey Club - the Official Website of the AHL A Wilkes Barre/Scranton Penguins are the AAA affiliate of the NHL’s Pittsburgh Penguins playing in the AHL. URL: http://www.wbspenguins.com/ Related pages Translate Topic: Scranton, Pennsylvania - Sports and Recreation Alternative Music, NPR News -KCRW 89.9 FM Alternative, Eclectic, World, Pop, Jazz, Electronic, House and Hip Hop music and NPR, PRI, BBC and VOA news. Listen live or on-demand with RealAudio. URL: http://www.kcrw.org/ Related pages Translate Topic: Live Music Broadcasts Top Internet News Stories from DataSegment.com Top Internet News Stories from DataSegment.com (Top Internet News Stories) URL: http://breakingnews.datasegment.com/top_internet_stories/ Translate

The Register 24 August 2001 Updated: 20:22 GMT. Flirting tops recession-beating US wireless agenda. Tasteless maybe - but shockingly cluefull. 24 August 2001... URL: http://www.theregister.co.uk/content/7/index.html Related pages Translate Only at AltaVista Return 8 does a potentially relevant context occur: BBC News | FOOT AND MOUTH HOMEPAGE. | SPORT. | WEATHER. | WORLD SERVICE. | MY BBC. Search BBC News Online. "> You are in: In Depth: Foot and mouth. Front Page World UK UK... URL: http://news.bbc.co.uk/hi/english/in_dept...uth/default.stm Related pages Translate As a linguistic search tool, Google is unique in extracting context for search terms, and has built in some refinements which will be shown later, but it retrieves only one instance of a search term from a given web page on which it may actually occur several times. A search engine also covers only a slice of the web. Furthermore, it can retrieve information only for the search terms in its periodically-updated index. Google could not yet trace Sophiegate, of April 1st 2001 vintage, in early May. 3.

A new linguistic tool: the WebCorp system

A further leap of imagination reveals the web to be ripe for exploitation by software tailored to find and retrieve contextualised instances of words and

42

Antoinette Renouf

phrases. Such information could serve the linguistic community in areas of linguistic, pedagogic, lexicographic and other endeavour by filling the information gaps left by traditional corpus data. This was the point of departure for the WebCorp project, in the Unit at Liverpool. The WebCorp tool is being developed according to an intensive and ambitious two-year project plan, which has been informed in part by the copious feedback received in response to a simple pre-project prototype software demonstrator that we installed on the web back in May 2000. Among the many unsolicited expressions of enthusiasm for the WebCorp tool, Michael Rundell stated in his paper entitled ‘The biggest corpus of all’ (Rundell 2000) that: ...a major breakthrough is at hand, in the form of a stunning new website that produces real `concordances’. As with Altavista and others, http://www.webcorp.org.uk/ [i.e. WebCorp] searches the entire Internet for your query. But in this case the output is a proper concordance with an amount of surrounding context which the user (that’s you) can specify in advance. The results, in other words, look very similar to what you might get from the BNC or COBUILD Direct - but in this case the "source data" is the vast store of text on the entire Internet. 3.1

Diagram of WebCorp system operation

The WebCorp system operates as follows. The tool currently has six stages, as shown in figure 1. It first interfaces with the user request, which can be a word or a contiguous phrase, converting it into a format acceptable to a selection of search engines. It then piggy-backs on one or other of these that has been specified by the user. Each search engine follows its own procedure for searching a section of the web for texts containing the specified language item. Once the engine has traced the search term, via its own index, to a candidate text, WebCorp downloads that text temporarily into memory and extracts the appropriate linguistic context, processing and collating it before presenting it to the user. In its current form, the Graphical User Interface looks as shown in figure 2. It offers various options to the user. The user can submit a keyword consisting of a word or phrase. The user can select one of 5 search engines: currently, Google, Altavista, Northern Light, FAST, and MetaCrawler. Case sensitivity may be specified or not. There is a choice of output format: HTML, HTML KWIC tables, and plain ASCII text. The user may wish to have a display of URLs, or for the sake of readability, omit these and have an unbroken set of KWIC contexts.

WebCorp: providing a renewable data source for corpus linguists

43

S e a r c h E n g in e 3

2 4

5

W e b C o r p

W e b T e x ts

1 6 U s e r I n te rfa c e

Figure 1. WebCorp System Diagram

Figure 2. Graphical User Interface of the WebCorp Tool

44

Antoinette Renouf

The number of concordance lines may be specified. For the purposes of search refinement, a particular site domain may be specified in terms of any part of a URL; for example, .fr or .ac.uk. The concordance span may be set at between 1 50 words to the left and to the right of the node. Users not wishing to wait for results may receive them by selecting an email option. Further refinements to this interface are in progress. 4.

Sample WebCorp searches

The next section of this paper will present some of the linguistic information which can usefully be retrieved from the web. 4.1

Study of neologisms

One area of linguistic interest is the search for neologisms and new word uses. One might have noticed in computing journals that a new term, corporate portal, appears to be taking on some of the role of the earlier term intranet. The APRIL project graph (Pacey et al. forthcoming) confirms a changeover in the frequency with which the two terms (intranet and portal, since the latter occurs only in corporate portal) occur in the Independent newspaper from 1989-1999.

Figure 3. Frequency of occurrence of portal vs. intranet between 1990 and 2000

WebCorp: providing a renewable data source for corpus linguists

45

A search using the WebCorp tool with the AltaVista search engine yields 57 occurrences for corporate portal, which can be extracted in neat, one-line contexts as follows: 1, is using a mobile bizli.com , an employee-only next month by delivering a 10, 1999 TIBCO launches joins forces with TIBCO on plans to get into the developed what we call a scalable, widely deployed The intranet programming Management of your The genesis of the Using Domino to build a News Yahoo and SAP target would spice up Yahoo’s

4.2

corporate portal corporate portal corporate-portal corporate portal corporate portal corporate portal corporate portal corporate portal corporate portal corporate portal corporate portal corporate portal corporate portal corporate portal corporate portal

to give traveling consultants access to access intranet applications from platform and related tools capable building effort, hosting service TIBCO Yahoo plans to get into business through a partnership with to power this Internet business software on the market. < market-leader goes global! and e-commerce application based on state-of-the demand. movement is the runaway intranet Seagate Software distributes business space By Ian Lynch intranet technology by enabling users

Study of rare uses

One of the known processes of change in modern English grammar is the shift from person who to person which type of construction discussed by Mair (1998). The WebCorp tool can furnish examples of such rare language use, which supplement the few found by Mair in a number of corpus sources, and thus help, as Bergh (1998) says, to ‘improve predictive power in determining what is ‘real English’. WebCorp used Altavista to generate 88 returns for person which; an extract of which is shown: on self entries. Here the person which entered his entry can decide on the plan, or by any person which is a disqualified person with related person” means any person which was under common control to make first contact. The person which has written the search ad for your first contact. The person which has written the ad will Registrar: a registrar is the person which is authorized to enter and 46: In Reply toL The person which posted it has to be me guestbook! You are the person which visits my page same person or by any person which controls or is controlled by, or The person which posted it has to be 15 years, or both. A person which is an organization shall, passed by, He saw a person which was blind from birth. by a household or person which generates less than 100 kilos

46 4.3

Antoinette Renouf Study of productive linguistic features

Another area of development in the English lexicon that one might wish to study is the productivity of some morphemes. However, the chronological, quantified diachronic study that productivity implies to a corpus linguist is not yet possible on the web, due to the lack of textual authorship dating. Nevertheless, hypotheses based on diachronic data sources, such as the chronologically-stored and processed journalistic text collections and the APRIL morphological database at Liverpool; or the FLOB and Frown corpora at Freiburg, can be tested against the larger web resource. The user might note that the ‘e-’ splinter, an abbreviation of electronic, is gaining status as a full-blown affix, and is taking over in popularity from cyber and techno in creating new formations. The APRIL tool compares eand cyber frequencies of use in 10 years of Independent text (figure 4).

Figure 4. Frequency of occurrence of e- vs. cyber- between 1990 and 2000 E-words can be searched for with our WebCorp tool. One finds that a subset of the ‘e’ words is evolving in text, whereby an E-word can be created not simply by attaching ‘e’ to a root form, but also by taking a word like retail, which has e as the vowel in its initial syllable, and clipping the letters preceding the ‘e’. Neologisms arise in this way from semantically transparent and appropriate bases such as freelancers, which becomes elancers; and retail becomes etail, as shown below:

WebCorp: providing a renewable data source for corpus linguists

47

WebCorp output for search term “etail” 1998 – Web-based retail, or etail Weeks. Close behind are the etail 11.5 million if the etail Can be viewed at www, etail Providing goods from one supplier etail Your purchases are made through etail On your behalf. Members of etail Normal RRP of products since etail etail shopping experience now… etail Personal details. Jump to About etail Secure Shopping Contacting etail Your Shopping Trolley Login to etail 4.4

as the word du jour enablers, those that make the effort meets targets over next .ie home | web sites | brochures has arrangements with several top on your behalf. Members of will receive discounts on the h negotiated discounts with all Shopping Shopping. The Checkout New Special Offers for Your Shopping Shopping Search Help

Study of unconventional use

In corpus-based study of language change, one can encounter unexpected usage in otherwise conventional sources. Investigating the continuing regularisation of irregular verbs (e.g. Mair 1998), for example, one finds in the Independent newspaper several linguistic and metalinguistic uses of maked, gived and catched: It maked sense to police chiefs It does not matter to them who maked their laws Past tenses as `hited’, `maked’, `singed’ and `goed’. Technical papers gived the total number of fuel elements changed No sooner had I said this than Guiseppe catched hold of my hand The girls bowled, batted, ran and catched as well as most men could WebCorp would be a resource to turn to for confirmatory or counter evidence. However, alongside conventional usage, the Web is strewn with typographical errors, as well as uninformed, colloquial, provisional and improvised language use of the spontaneous kinds encouraged particularly in chat rooms and news lists; and Web-based text is very often also written by non-native speakers. So one can find instances of almost any odd formation. There is not room in this article to indulge in detailed illustration of the full joys of Web invention; nor is it pertinent. The point is that it means that the Web is not a place to go for reliable confirmation of correct usage unless the user is in a position to evaluate what he/she finds. Such evaluation is usually achievable through native-speaker intuition, but these means are not open to the non-native speaker-learner. There is a need to discover some clues as to the status, bona fide or otherwise, of what is being observed, that can allow the language learner to recognise error and thus make the web a more usable language source. If we take a questionable form such as maked above, we find 92 instances yielded by WebCorp via Google on 14 th June 2002. Closer observation reveals among these a range of error types, often errors in combination. The purpose here

48

Antoinette Renouf

is not to identify the kinds of errors that occur per se, but to see whether there is anything in the environment which can indicate to someone with no recourse to native-speaker judgement that a word is erroneously or conventional spelt and used. Of the 92 instances of maked, 54 are native-speaker typographical errors, intended to be the words marked or naked, as shown in examples (1)-(4): (1) (2) (3) (4)

All mandatory items are maked with an asterisk (*) (45 occs.) Assessment is by internally maked homework and external examination. (3 occs.) Within a week of starting Accolate, I noticed a maked improvement of my symptoms. (1 occ.) it was considered great fun to eject maked classmates out into the winter (5 occs)

These errors are identifiable to the native-speaker through contextual clues in the form of more, (or less) established collocates (here, respectively: items, with, asterisk, *; assessment, homework; noticed, improvement; fun, eject and winter). WebCorp currently offers a collocational profile which reveals some of this information (see 6.2). A further 8 instances appear to be mis-typings of makes, created by native speakers out of carelessness or perhaps through a last-minute decision to change tense. The fact that they occur incongruously in text otherwise set in the present is the clue that aids this interpretation. See examples (5)-(6): (5) (6)

If…he has only just realised this, it maked one wonder if he has ever played the game In ‘Living on thin air’ there was a prediction that in the future the tax system will be under threat. As companies become bigger it maked the economies of scale possible

In 2 above, the writer appears to be caught between the conventional rules of reporting, which require past tenses to follow the reporting phrase `there was a prediction that’, and the informal tendency not to sequence reported verbs. Then there are 11 instances of maked which are intended to be made. These fall into two categories: 9 that occur in contexts liberally sprinkled with orthographic and grammatical inaccuracies, often on message boards or chat rooms (so identifiable in the URL by such key words as messages, games, club, connection, or dubious elements such as dontlike or diy) – and so perhaps recognisable as errors, as in: (7) (8)

The park looked decidely scruffy, and stupid things like Submission still having the Virtual Queue lines maked the park look like a tacky fun park strumming while Derek (Fish Dick) maked his way throught the crowd

WebCorp: providing a renewable data source for corpus linguists

49

There are 2 further instances of maked for made which are isolated errors in otherwise accurate text, and so probably slips of the brain, but correspondingly less identifiable as errors by the non-native speaker. See examples (9)-(10): (9) (10)

Club stalwart Alan Elliott maked the trip up to get a day’s sailing under his belt One of the fans in one of the PC’s started to maked a noise. Upon investigation (with the intention of replacing the fan)

There seem to be just 12 instances of non-native speaking errors, all where maked is intended to express the past tense of make, as in (11)-(12): (11) (12)

Alhambra forests … maked it necesary the application of an special program of control the cinema called "de pipas" (“of pipes”) because it was maked only to win money and without interest.

Instances like (11)-(12) may be recognised by some non-native speakers by the fact that they occur in text authored by people with non-English names, or containing a number of non-native speaking errors as indicated. The remaining 7 instances of maked are either not erroneous, as in the 3 cases of Early English spellings; see (13): (13)

Of auentours þat fel bi dayes, Wher-of Bretouns maked her layes.’’

or they are deliberately wrongly spelt, employed as jokes in 3 cases, as in (14): (14)

He ‘can not believe you wood get someone who maked all those gramar mestakes to WORk for you’

and once in a literary (science-fictional) context, to convey a sense of atmosphere through imitating spoken dialect, in (15): (15)

Eye Rock or someplace such, just like as the South’uns maked a gather for themselves and clans out at the West

It seems, then, that it is orthographic and lexical clues in the immediate and larger context of a particular word instance that are the learner’s best hope of knowing how to judge its accuracy. At the moment, there is no standardised referencing of Web documents such that they are marked for ‘native-speaking’ status. The current state of Web annotation poses problems for all serious users as to how far the data can be relied on to model acceptable or typical use for pedagogic and lexicographic applications. Domain specification (see 6.4), whereby message boards and chat rooms are ignored, would help. Country specification, such as .uk (see 6.5), might limit the retrieval of non (English)-native-speaking text and word use (though not

50

Antoinette Renouf

careless or humorous native-speaking contributions). An automated spellchecker, that is to say a master word index against which items were checked and graded, would be another useful filter, although one man’s error can be another man’s creative use, and today’s error next year’s norm. A cumulative collocate bank for the words indexed could rate a Web word for probable misspelling on the basis of its context. It is hoped that the next generation of XML annotation, together with advances in the Semantic Web initiative, will increase the chances of providing some metalinguistic guide as to the status, provenance and thus reliability of the words under investigation. 5.

The user

5.1

User search habits

As said earlier, the design of the WebCorp System is informed by usage and user comment via the feedback mechanism on the web. Among other things, we note the types of terms that users tend to submit. A sample of these is shown in table 1. Table 1. Sample WebCorp Search Terms abruptity albeit diffuse tension token sophie-gate blandity milleaux the better at long last "geht sich aus” gave a lecture cleave brockoli doing my head in odiferous can reach me 5.2

eskamoteur rehabilitation virtual classes shockjock market drive hear back from misfortunate tip of my tongue disinvestment coolosity la-di-da ballpark tweenager ick hot-desking if you don’t mind

B.Sc. spinach is bad pastoral steven bird elder rights hopefulty gargle fixed for snafu kicked the bucket gobsmacked un* holistic racialism is your call zine

User requirements for improvement

The features and improvements most commonly requested by users, according to our feedback tool, were in May 2001 – in descending order: · · · ·

Increase speed Regular Expressions, pattern matching, wildcards Full sentence output Language selection / detection

WebCorp: providing a renewable data source for corpus linguists · · · · · · · · · 6.

51

Add more search engines (FAST, Northern Light) Ability to customise max. no. of concordances to be returned - useful for slower connections Collocation Discontinuous phrases - words within certain distance Support for double-byte characters (e.g. Chinese) More elimination of duplicate results, on same/different pages/sites Sorting on left or right words Senseval-style output format option1 Wild cards Improvements in WebCorp functionality

Taking into account the user feedback above, but also following our own planned programme of development, we have moved through successive versions of the WebCorp tool. Progress is incremental but swift. Since the prototype tool was first reported on at the Corpus Linguistics 2001 conference in Lancaster, two new versions have emerged, the first offering improvements including smaller font and compact presentation for concordance lines, numbered concordance lines, and HTML-centred keywords. 6.1

Formatting

By way of illustration, I show below the application of one particular set of format options, producing a 10-word context, HTML-formatted concordance extract with keyword emboldened and line numbering for convenience of reference. This was retrieved via the Northern Light search engine at midday on May 2nd, 2001, (at a time when the neologism was still not accessible in the indexes of either AltaVista or Google). Sample WebCorp output for the recent neologism, ‘Sophiegate’:

1. isn’t likely to end " Sophiegate " soon. Word is, the newspaper 2. called R-JH. Thanks to Sophiegate , she’s stepped down and 3. the sharp end of the Sophiegate skewer. Tony Blair put on 4 Britain’s closet republicans. Sophiegate is a huge blow to 5. monarchy faces tough choices over " Sophiegate " tapes. Apr 06 2001 17 6. interest between the two. The " Sophiegate " affair will also be a 7. to say that the recent " Sophiegate " scandal involving Sophie Rhys-Jones 8. lucrative deal. The so-called " Sophiegate " scandal led many newspaper edit 9. Pak-origin scribe set up ’ Sophiegate ’. Pope commemorates Good Friday 10. isn’t likely to end " Sophiegate " soon. Word is, the newspaper 11. column Marketing & PR|Press & publishing ’ Sophiegate ’: what the papers s 12. 2001. Mark Lawson on the Sophiegate . 31 Mar 2001. Mark Lawson

Thus, in this particular case, WebCorp was able to extract up-to-the-month results for a vogue formation.

52 6.2

Antoinette Renouf Collocation

The most recent publicly-available version (4.7) of the WebCorp tool incorporates type/token counts for web pages, improvements in speed of search and retrieval, and contiguous collocational statistics. Simple collocational information, based on a word span of 4 words to the right and left of the keyword, is shown in table 2, taking rage as the search term. Table 2. Extract of Collocational Profile for “rage” (excluding stopwords)

Word

Total L4 L3 L2 L1 R1 R2 R3 R4

air 43 41 1 1 UK 22 8 1 5 1 1 6 computer 6 6 98 6 1 4 1 rage 6 1 1 1 1 1 1 Britain 6 3 2 1 new 5 1 2 2 International 5 3 2 Aggression 4 3 1 Internet 4 2 2 Air 4 4 Transport 4 2 2 introduced 4 2 2 99 4 1 3 incidents 4 1 3 links 4 2 2 Sep 4 2 2 so-called 3 3 Spotlight 3 3 BA 3 1 2 Key Phrases: air rage computer rage Air rage 6.3

Precision through additional search terms

Precision in retrieval of linguistic information is required, as it is for information retrieval. That is to say, the user wants to see relevant and only relevant search results, particularly from a heterogeneous environment like the web, which will often swamp the user without some kind of filtering. Google is currently the only

WebCorp: providing a renewable data source for corpus linguists

53

search engine which refines search by means of additional search terms. Taking the word cull, I assume that the user wishes to test the hypothesis that the word has changed in meaning since the foot and mouth epidemic began this spring, that it no longer means ‘strengthen a herd by removal and slaughter of the weaker specimens’, but has simply become a euphemism for ‘kill’. First using Google’s Advanced Search, I specify that the contexts returned must be ‘in English’, last updated in the ‘past 3 months’, and must contain the phrase ‘foot and mouth’ somewhere in the text. I am presented with the following output on Nov. 13th, 2001, extracted from a total of 4,150 returns: BBC News | FOOT AND MOUTH ... Farm vaccine report launched Finnie presses for meat exports Foot-and-mouth clean-up complete Cull delay ’worsened epidemic’ Legal threat over pyre clean-up ... Description: Ongoing collection of news articles, reports, forums, audio and video. From BBC News. UK. Category: Society > Issues > Animal Welfare > Farming > News and Media news.bbc.co.uk/hi/english/in_depth/uk/2001/foot_and_mouth/ default.stm - 44k - Cached Similar pages Guardian Unlimited | Special reports | Special report: foot ... ... 09.10.01 Foot and mouth inquiry told of ’needless killing’. 04.10.01 Swifter cull ’would have curbed foot and mouth’. ... Description: Ongoing collection of news, commentary, audio, graphics and interactive guides to the outbreak. www.guardian.co.uk/footandmouth/0,7368,441391,00.html - 66k - Cached - Similar pages Foot it around and Mouth Off in Edinburgh ... If you want to get involved in debate around the Foot and Mouth cull then follow this link. site map. www.mouthoff.org.uk/ - 3k - Cached - Similar pages Foot and Mouth Disease (FMD) site presented by Cybersavvy UK ... a report which states that the cull came too late and was scientifically ... using a pneumonia cure to stop foot and mouth reaching his livestock. Although he ... Description: For Cumbria and the Yorkshire Dales. Latest figures, headlines, links to FMD resources, commentary,... Category: Society > Issues > Animal Welfare > Farming > News and Media www.webpr.co.uk/fmd/ - 47k - Cached - Similar pages Yorkshire Dales - foot and mouth - news from Daelnet ... FIGURES showing that hundreds of thousands of cattle slaughtered in the foot and mouth cull did not in fact have the disease threw a massive bombshell into the ... www.daelnet.co.uk/news/foot_and_mouth/foot_and_mouth_110501.cfm - 18k - Cached Similar pages CIWF Press Releases 2001 ... 25th April 2001, FOOT AND MOUTH SPREAD THROUGH GOVERNMENT FIASCO. 23rd April 2001, MASS CULL BRANDED FUTILE AS FOOT AND MOUTH THREATENS DEER. ... www.ciwf.co.uk/PRs/2001/2001.htm - 13k - Cached - Similar pages Foot in Mouth ... Articles on Page 2 Foot and Mouth Disease Like lemmings our ... www.silentmajority.co.uk/FootInMouth/ - 101k - Cached - Similar pages

54

Antoinette Renouf

It will be noted that this is not easy to read, and that from the linguistic point of view, several contexts have been truncated. In contrast, the application of the ‘Additional Filter’ option for the WebCorp tool, using Altavista, produces 156 concordance lines on Nov. 13th, 2001, of which an extract (with contexts which could be specified as longer, and with hypertext links to the original text) is shown below: Britain to update EU over foot-and-mouth against foot and mouth disease, involving the set your edition UK to relax foot-and-mouth to be put down after surviving a Massive foot-and-mouth LONDON, England -- A mass nationwide Horse Owners FMD Information - stop the vets are preparing for a widespread birds. Also at risk of a possible ago, had been contained without a major slaughtered but the possibility of a wildlife farmland?" Attempts to further extend any was concerned that experts involved in the disturb settled groups of animals during a the current evidence would justify a major to extend nationwide the pre-emptive the epidemic 23 Mar 01 | Wales Mass butterfly in danger from foot and mouth threatened by the foot and mouth own livestock of any type. The kill/ Foot-and-mouth

6.4

cull cull cull cull cull cull cull cull cull cull cull cull cull cull cull cull cull cull cull cull cull

British farm minister Nick Brown of up to 100,000 animals which = LONDON, England (CNN) -on a farm in Devon in the begins. Authorities will look to bu of livestock is being carried out in Send a link | Link to us BBC of wild animals in the latest de are herds of deer spotted gathered but accepted that the latest cases raised fears of a considerably would lead to logistical problems could spread the disease on their and send them further afield. Colin of wildlife. It they were to of healthy animals within two miles begins on Welsh border 23 Mar 01 One of Britain’s rarest butterflies be The Marsh Fritillary’s last stronghold policy now in place effects us all challenged -- With the acrid smoke

Precision through domain specification

Another recent WebCorp refinement is domain specification. Web URLs are not yet transparent, and the only alternative offered by search engines is the indexed information provided by YAHOO and Open Directory, which have pre-indexed downloads of the web. Using WebCorp, however, it is possible to restrict web search by specifying a part or all of a site domain (e.g. .uk or .fr). A linguistic question might concern the use of EU terms as part of the globalisation process in languages. If the user wishes to observe the extent to which the French–originated term acquis communautaire is being used in English (Renouf 2003), it is possible to do so by restricting the URL search to texts within the UK. An extract of the 127 results produced by WebCorp using AltaVista is shown here: and legislation consistent with the ouropean legislation, the so-called European Union forces too much with some parts of the of European legislation, the so-called the 31 chapters of the

acquis communautaire acquis communautaire acquis communautaire acquis communautaire acquis communautaire acquis communautaire

Some of the candidates The only negotiations will on them. That is the was one on which almost The only negotiations will and has closed 11 chapters

WebCorp: providing a renewable data source for corpus linguists qualified majority. The and the implementation of the it difficult to implement the meeting the challenges of the irrevocably binding. The principle of ’ with the ECJ and the occupied field and of the completely right. The concept of This is part of the

6.5

acquis communautaire acquis communautaire acquis communautaire acquis communautaire acquis communautaire ’ acquis communautaire acquis communautaire acquis communautaire acquis communautaire

55

means that the arrangeme the faster they can join However, I am extremely but that it also concerns means that once any legis would be remedied.The are unchallengeable? Mr. is very important in this which is binding on all

Precision through language specification

A further refinement in linguistic search is the specification of a particular language for the source text. The WebCorp tool will soon be refined, in conjunction with our collaborating search engine company, to handle different languages. Meanwhile, the means available for restricting language is simpleminded but effective: it is to specify the particular section of the URL which designates country. Thus, to find instances of cyberterroriste in French text, one would specify ‘.fr’ as the ‘domain’ option. Using Altavista, this generates 8 results, shown below: WebCorp output for search term “cyberterroriste” moigner d’une nouvelle menace d’ordre " quelle personne peut devenir un de l’internautes un Internet est devenu pour le mail-bombing qui permettent de s’improviser véritable démon. Un messages électroniques en provenance d’un informatique et le qualificatif de

7.

cyberterroriste cyberterroriste cyberterroriste cyberterroriste cyberterroriste cyberterroriste cyberterroriste cyberterroriste

". On parle plus de d Détruire des données un outil de plus en au moindre risque. On le un anarchiste. À côté de M. Schmidt - Oui. Un qui colle au technohéros

Future plans

Our future plans for improvements to the WebCorp system have been identified above. Three additional areas of development are in operation behind the scenes. 7.1

Grid involvement

The first involves the careful monitoring of next-generation Internet activities. The web explosion will lead to its being superseded by the Grid (Foster & Kesselmann 1999). ‘Grid’ is not an acronym but a metaphor for the next stage of Internet and Web organisation, whereby electronic data retrieval and processing activities are conceptualised as basic utilities for society, by analogy with gas or electricity, on a global scale. ‘Grid’ designates the philosophy behind the provision of vastly increased computing resources, entailing such measures as distributed and shared computing processes. By about 2005, this or a similar resource will usher in a new generation of electronic facilities – hardware,

56

Antoinette Renouf

middleware and new distributed ways of storing and accessing text – which will probably involve text being accessed via the replacement Internet but not sitting directly on it. WebCorp is well placed to develop in step with Grid initiatives at Liverpool. 7.2

Internationalisation

The second stage of development involves the internationalisation of WebCorp in collaboration with colleagues abroad. By ‘internationalisation’ is meant the introduction of measures to allow the identification and handling of other languages on and via the Web. Language identification will improve search refinement, whilst foreign language handling will speed up response, possibly also effected through local (foreign) site location. 7.3

Standardisation

The third line of development involves the standardisation of Web text markup, with particular reference to the dating of text. The current dating mechanism is unreliable, uninterpretable and unuseful in linguistic terms. Our experience shows that just over half of web servers return a ‘Last-modified’ header, but this fails to indicate whether the date, if given, indicates date of authorship, date of extensive editing and updating, date of complete rewriting, or simply date when minor typographical error was removed. The W3C has proposed the ‘Resource Description Framework’ (RDF) as a meta-standard, one feature of which is intended to require the page author to specify ‘a date associated with an event in the life cycle of the resource’. Among the qualifiers are ‘Created’, and ‘Modified’.2 These would be valuable sources of information both in modern diachronic corpus study, and in the study of text editions and versions. We are actively supporting this standardisation initiative (Kehoe & Renouf 2002). 8.

Concluding remarks

This paper has discussed the need of the linguistic community for access to a large-scale, renewable source of information about recent and current language use. It has demonstrated that the web, when accessed by WebCorp, offers linguistic evidence that is not supplied by existing text corpora, or which supplements meagre evidence for rarer or older aspects of language use. A basic system functionality is relatively simple to achieve; the real challenges for WebCorp lie in developing a closer understanding of the web’s structure and content, and in devising ways of compensating for the current limitations of search engines in order to produce a maximally efficient, informative and userfriendly tool.

WebCorp: providing a renewable data source for corpus linguists

57

Acknowledgement I acknowledge with thanks the funding of the WebCorp project by the EPSRC, and the computational ingenuity of Mike Pacey and Andrew Kehoe, who have developed WebCorp software hitherto. Notes 1

Senseval is an initiative established in the US as a means of evaluating competing software tools for their efficacity in identifying sense relations by various means.

2

http://dublincore.org/documents/dcmes-qualifiers/

References Bergh, G., A. Seppaenen and J. Trotta (1998), ‘Language corpora and the Internet: a joint linguistic resource’, in: A.J. Renouf (ed.), Explorations in Corpus Linguistics. Amsterdam: Rodopi. 41-56. Brekke, M. (1999), ‘When "Empiry" strikes back: a corporal confrontation’, Norwegian School of Economics, Norway. Brekke, M. (2000), ‘From BNC to the Cybercorpus: a quantum leap into chaos?’, in: J. Kirk (ed.), Corpora Galore. Amsterdam: Rodopi. 227-247. Fairon, C. (1998-1999), ‘Glossanet: parsing a Web site as a corpus’, in: C. Fairon (ed.), Analyse lexicale et syntaxique: Le système INTEX, Lingvisticae Investigationes Tome XXII (Volume special). Amsterdam/Philadelphia: John Benjamins Publishing. 327-340. Fairon, C. and B. Courtois (2000), ‘Extension de la couverture lexicale des dictionnaires électroniques du LADL à l’aide de GlossaNet’, in: Actes du Colloque JADT 2000 : 5e Journées Internationales d’Analyse Statistique des Données Textuelles, Lausanne. Vol. 1. 189-196. Foster & Kesselmann (1999) (eds), The GRID: Blueprint for a New Computing Infrastructure. USA: Morgan Kaufmann Publishers Inc. Kehoe, A. and A. Renouf (2002), ‘WebCorp: applying the Web to linguistics and linguistics to the Web’, in: Proceedings of 11 th International World Wide Web Conference, Honolulu, Hawaii, 7-11 May 2002. (http://www2002.org/CDROM/poster/67/) Kilgarriff, A. (2001), ‘Web as corpus’, in: P. Rayson, A. Wilson, T. McEnery, A. Hardie and S. Khoja (eds), Proceedings of the Corpus Linguistics 2001 Conference, UCREL. 342-344. Kübler, N. and P.-Y. Foucou (2000), ‘A Web-based environment for teaching technical English’, in: L. Burnard and T. McEnery (eds), Rethinking Language Pedagogy: Papers from the third international conference on language and teaching. Peter Lang GmbH: Frankfurt am Main. 65-73.

58

Antoinette Renouf

Mair, C. (1998), ‘Last of the old, or first of the new?’, in A.J. Renouf (ed.), Explorations in Corpus Linguistics. Amsterdam: Rodopi. 123-133. Renouf, A. (2002), ‘The time dimension in Modern English corpus linguistics’, in: B. Kettemann et al. (eds), Proceedings of the TALC 2000 Conference, Univ. of Graz. Amsterdam: Rodopi. 27-41. Renouf, A. (2003), ‘Shall we hors-d’oeuvres? The use and abuse of Gallicisms in English’, in: E. Laporte, C. Leclère, M. Piot and M. Silberztein (eds), Syntaxe, Lexique et Lexique-Grammaire. Volume dédié à Maurice Gross . Lingvisticae Investigationes Supplementa, 24: 523-543. Amsterdam/ Philadelphia: John Benjamins Publishing Co. Rundell, M. (2000), ‘The biggest corpus of all’, Humanising Language Teaching, 2(3); May 2000 (http://www.hltmag.co.uk/may00/idea.htm)

Normalization and disfluencies in spoken language data Nelleke Oostdijk University of Nijmegen Abstract This paper seeks to investigate a number of phenomena that are considered to be characteristic of spoken language use, more in particular disfluencies such as hesitations, false-starts and self-corrections. The aim is to get insight in the nature, frequency and distribution of these phenomena, so that we may consider the implications this has for the construction of a parser geared towards the analysis of spoken language data. The study is based on the normalized data found in the spoken part of ICE-GB.

1.

Introduction

Spoken language corpora present interesting new challenges to researchers who are working on the construction of parsers. While over the past years efforts have been directed predominantly at the implementation of parsers for the linguistic annotation of written corpora, the idiosyncracies of spoken language data have been largely neglected. The provisions that were made in order to deal with for example the spoken passages in written fiction should be considered for what they are, viz. ad hoc provisions that were made to prevent the parsing process from being upset and forced to a halt (cf. Aarts and Oostdijk 1997; Oostdijk 2000a). Meanwhile, in the light of the absence of parsers geared to the analysis of spoken language, it is not surprising to find that parsers that were originally constructed for parsing written language data are being applied for the analysis of spoken data. The overall performance of the parser on such data will prove rather poor. What we see then is that in order to accommodate the parser, language data are being normalized, i.e. they are made to conform to what are postulated to be the ‘rules of grammar’. In the present paper I investigate the practice adopted in parsing ICE-GB. The aim is to find out whether for (particular types of) disfluencies it would be feasible to refrain from normalizing. 2.

Normalization in ICE-GB

ICE-GB, the British component of the International Corpus of English (Greenbaum, ed. 1996) comprises some one million words of spoken and written English produced by adult, educated, native speakers of British English. The texts in the corpus date from 1990-1994 inclusive, i.e. all texts were originally published or recorded during this period (Nelson 1996a). The corpus has been

60

Nelleke Oostdijk

fully tagged for part-of-speech information, while it has also been syntactically annotated. In the annotation process the TOSCA-ICE parser was used (Oostdijk 2000b). Prior to the linguistic annotation of the corpus, all the material – both spoken and written – was marked up, using two types of markup: (1) textual markup, which was added to the texts themselves and typically encodes features of the original text that are lost when it is converted into a computerized text file, and (2) bibliographical and biographical markup, which was stored externally in the form of a file header for each text (Nelson 1996b: 36). While it was observed that “spoken texts, and especially dialogues require much more markup than written ones”, the set of (textual) markup symbols was designed also to include symbols that could be used “to indicate such features as pauses, speaker turns and overlapping segments” (ibid. 39-42). Nelson’s discussion of the various markup symbols, however, reveals that quite a number of these do not so much preserve features that would otherwise be lost, but were introduced on the grounds that without these markup symbols automatic parsing would be problematic: This system for marking overlaps was adopted because complete speaker turns are essential for parsing. The marking scheme indicates the overlapping without making the turns discontinuous. (Nelson 1996b: 41) Especially with the spoken data from the corpus, textual markup has been applied to normalize the input for the parser: Spoken English is characterized by a wide range of nonfluencies which are not found in writing. … These phenomena are transcribed as they occur, and the markup for them will be of particular interest to researchers studying the interaction between speakers. However, they may be seen as disruptions of the underlying syntax, and as such are problematic from the point of view of automatic parsing. We use the general term ‘normalization’ to describe the method of using markup to deal with them. (ibid.) Table 1 lists the subset of markup symbols that were used in the process of normalization. For the present study especially two types of normalization are of interest, viz. normalization involving normative deletion and that involving normative insertion. The first type of normalization typically is applied with repetitions, self-corrections, and hesitations. For example, in (1)-(1’) where the demonstrative pronoun is found to be repeated, the first instance of the pronoun is marked up for normative deletion and will not form part of the input to the parser. 2

Normalization and disfluences in spoken language data Table 1.

Markup symbols used for normalizing texts1

…

discontinuous word, eg fan-bloody-tastic

…

normalized version of a discontinuous word, eg bloody fantastic

…

incomplete word

…

normative deletion

…

normative insertion

…

part of a repetition or self-correction which is retained for parsing

…

these symbols enclose every instance of normalization

…

overlapping string

…

two or more overlapping string sets

(1) (1’)

61

If one’s gone through those those uh procedures then it wouldn’t take very long I think to uh clean up the rest If one’s gone through those those uh procedures then it wouldn’t take very long I think to uh clean up the rest

In a similar fashion the self-correction in (2)-(2’) and the hesitation in (3)-(3’) are dealt with: (2) (2’) (3) (3’)

Unfortunately there is tends to be a bias towards publishing only positive results Unfortunately there is tends to be a bias towards publishing only positive results A a and more likely I suppose primary because your first degree has got to relate pretty closely if you’re going to get on a course in most cases A a and more likely I suppose primary because your first degree has got to relate pretty closely if you’re going to get on a course in most cases

Normalization involving normative insertion includes the use of markup to insert elements in the input that are considered to be essential for being able to parse the string. Thus a word (e.g. an article, relative pronoun or subordinator) that was omitted may be inserted. Normative insertion can be used in combination with normative deletion. For example, where a word was left incomplete it will be marked for normative deletion and the complete form will be introduced through normative insertion (cf. (4)-(4’)). 3 (4)

So you’ve got twenty odd pension

62

Nelleke Oostdijk

(4’)

So you’ve got twenty odd pension pensioners

3.

Data collected from ICE-GB

For the present study I collected all the instances from the spoken part of ICE-GB where the original input had been normalized. The spoken subcorpus comprises 300 samples of some 2,000 words each. The samples represent widely different types of speech, ranging from direct spontaneous conversations to scripted speeches.4 In all I found as many as 10,104 instances by searching for strings enclosed between the markup symbols and . 5 The distribution of the normalizations over the three main text categories represented in the material is displayed in Table 2. Table 2. Distribution of normalizations over the three main text categories Text category

Total no. of words

No. of normalizations per 100 words

Dialogue

360,000

18.9

Monologue, unscripted

140,000

17.4

Monologue, scripted

100,000

4.7

The distribution found here seems to support the hypothesis that the frequency of occurrence (as well as the types) of disfluencies correlates with the degree of interaction, degree of spontaneity or preparedness. However, as the breakdown of the main text categories into the various types of speech underlying these shows, the text categories are by no means homogeneous and the differences between the types of speech are quite substantial (cf. Figure 1). 4.

Types of disfluencies

Previous studies of disfluencies (e.g. Fromkin 1973a,b, 1988; Garrett 1988; Garnham et al. 1982; Hotopf 1983; Nooteboom 1973; Tanenhaus 1988) have identified many and diverse instances where the production of the targeted segment, syllable, word or structure ran less smoothly, was delayed or yielded a faulty result. In studies originating from the fields of phonetics and phonology the focus primarily has been on what collectively are referred to as sound errors and the circumstances in which they were produced. Psycholinguistic investigations typically have concentrated on various types of lexical error and the conditions under which these occur. Results from these studies have been used as evidence in shaping linguistic theory. While the aim of the present study is much more modest, insights that have been obtained especially with regard to the different types of disfluencies are valuable in developing a classification scheme that may

Normalization and disfluences in spoken language data

63

40

35

D

D = broadcast discussions

L

L = legal presentations

E

J = unscripted speeches E = broadcast interview s

30 J 25 F = parliamentary debates

F 20 C H 15

G K B O

10 I

5

A

K = demonstrations H = business transactions A = direct conversations B = distanced conversations O = scripted speeches

M

N 0

G = legal cross-examinations C = classroom lessons

I = spontaneous commentaries M = new s broadcasts

N = broadcast talks

Figure 1. Distribution of normalizations over different text types (no. of normalizations per 100 words) be used for describing the normalizations. The next step then will be to apply this scheme to the data collected from ICE-GB and see whether conclusions can be drawn regarding the feasibility of handling certain types of disfluencies without prior normalization. The scheme that was developed can be found in Table 3. The classification scheme distinguishes between disfluencies in which the speaker edits his or her

64

Nelleke Oostdijk

speech as he/she moves along on the one hand, and those errors that have gone unnoticed by the speaker or that the speaker has chosen not to correct on the other hand. In the latter case it is the annotator/linguist who observes the error and marks it for normalization. Table 3. Types of disfluencies I. Speaker-edited A. hesitations B. self-corrections C. further detailing, specifications D. changes of plan

II. Non-speaker edited A. incomplete items B. errors

Each of the types is briefly discussed and exemplified below.6 IA. Hesitations As observed by for example Hotopf (1983: 158) hesitations serve the purpose of playing for time whilst deciding how to proceed or making a lexical choice. The class of hesitations comprises immediate repetitions of segments, syllables, single and multiple words, etc. Consider the following examples:7 (5) Uh sh… shall I go and get one [S1A-069-160] (6) I’m bi … bilingual so I suppose [S1A-069-110] (7) They’re not great social animals computer computer scientists [S1A-014264] (8) But what what do people think of it who who are familiar with the authors [S1B-026-140] (9) I think we must we must try and schedule this for next week mustn’t we [S1B-078-179] IB. Self-corrections While it has been claimed that hesitations (and pauses for that matter) can be used to buy the speaker time, it has also been observed that in most communicative situations the time available for pausing is limited as there is the risk of losing the floor to another speaker or losing the audience’s attention. Under such conditions, it has been suggested, errors may occur which then need to be repaired at some later stage. With self-corrections two subtypes can be distinguished:8 1. 2.

self-corrections that typically occur with errors involving the anticipation, substitution, deletion, or perseveration of segments, syllables or words; self-corrections that follow the erroneous application of a grammatical rule or failure to apply a certain rule.

The first subtype comprises self-corrections of widely diverse errors, including vowel substitutions (e.g. (10)), consonant deletions (e.g. (11)), anticipation of initial consonants (e.g. (12)), perseveration of vowels (e.g. (13)), anticipation of initial syllables (e.g. (14)), and blends (e.g. (15)).

Normalization and disfluences in spoken language data

65

but if the correct endpoint is when it baw … boils round balls round the knife then uh it should come right every time [S1A-057-193] (11) Yeah I mean definily definitely can [S1A-029-212] (12) Fibroblasts are the part of the dermis that will grow and por form part of the wound [S2A-058-140] (13) She was readmitted on the fifteenth of January to the Queen Victoria Hospital East Grimstead and had a fourth operation on the sixteenth of January and then what’s called a proplast implint impl … a proplast implant was introduced [S2A-062-043] (14) Uhm so the the well there is were two bar waves of barbarian settlements that these communities are responding to [S2A-060-006] (15) Uh I c I mean uhm I can sort of read it to see because I I knew his life the personal uh signification significance of the pictures [S1A-015-100] An example of the erroneous application of a grammatical rule is given in example (16). (16) Now this is coming comes into the repertory of caricature I think in the seventeen seventies this use of a kind of single shorthand element [S2A057-098] (10)

IC. Further detailing, specification Partial repetition of what was said before occurs when the speaker makes an attempt to be more precise in expressing what he wants to convey. A higher degree of precision may be obtained through identification (e.g. (17)), specification (e.g. (18)), or modification (e.g. (19)). (17) She was thinking Lizzie was thinking of coming but [S1A-073-198] (18) The horses The old black Irish horses of the Household Cavalry know the sound of the trumpet and they immediately break into a trot themselves [S2A-011-085] (19) And we’ve had developments in particularly in New Zealand and Canada ¾ [S2A-044-066] ID. Change of plan Where a speaker abandons his current train of thought and changes the plan for what he is going to say, usually larger strings are involved. Normalization in such cases is quite often problematic. For example, (20) No but actually you weren’t supposed to you were advised not to drink water in Leningrad because they they have this special bug that’s only found there and in Wolverhampton [S1A-014-188] (21) ¾ and you’ve made key assumptions that those are perhaps the you need to try and relate those assumptions and the the errors in that area those areas ¾ [S1B-020-097] IIA. Incomplete items Items may be found to be incomplete as speakers abandon what they were saying or are interrupted by other speakers.9 This applies to words as well as to phrases and clauses. For incomplete words the full form is provided by the annotator/linguist through normative insertion. Incomplete phrases or clauses on

66

Nelleke Oostdijk

the other hand, generally are marked for normative deletion while no material is inserted. (22) That is hilar hilarious [S1A-041-129] (23) and it has become a a much greater part of of of the of the work that they do and I think uhm [S1A-002-066] IIB. Errors Errors that are observed by the annotator/linguist and are marked for normalization are diverse in nature. They include sound errors, lexical errors, and grammatical errors. In all cases the correct form is introduced through normative insertion. Consider the following examples: (24) and he would accept that it’s totally inagdequate … inadequate to provide the sort of answers he has given regarding care in the community ¾ [S1B056-041] (25) But the output from the com from the wizard is also also has to be constrained in terms of insofar as he has to use very limited sentence constructions ¾ [S2A-035-114] (26) I didn’t see too many much evidence of of dead children [S2A-050-154] (27) and there was no kind of painting space that you was were trying to get in it [S1B-018-062] 5.

Findings

Before I go on to discuss what I found in the data I collected, I should point out that there are also a number of things I did not find, or at least did not find to be as frequent as I had expected. This is partly because of the approach I took, partly it can be explained through the nature of the data and the way they have been annotated. Thus, as I mentioned before, I only included in this study the instances that could be identified on the basis of a proper markup. Moreover, in these data certain types of disfluencies are likely to be underrepresented for various reasons. For example, since the data have only been transcribed at an orthographic level using ordinary spelling conventions, it has been difficult to capture certain features; this would explain the infrequent occurrence of sound errors. Another example is that of lexical substitution errors: these are quite difficult to detect for anyone but the speaker. Unlike the speaker, the annotator/linguist does not have immediate access to the larger context/setting in which the error was produced. Consequently, if the speaker does not identify the error, it is bound to remain unnoticed. Finally, certain disfluencies (especially self-corrections) have been more or less embedded in the syntactic structure, for example through the use of a construction by means of or rather and therefore do not occur as normalizations. What the classification of the disfluencies in the data collected from the spoken part of ICE-GB brought to light was that normalization was most frequently applied with speaker-edited disfluencies, specially hesitations and changes of plan (cf. Table 4). Only in relatively few instances had the annotator/linguist found the need to normalize the data. Surprisingly, about half of the instances here classified as errors were normalizations of substandard

Normalization and disfluences in spoken language data

67

forms such as dunno, gonna, wanna, and innit, which typically occurred in the direct conversations. Here it seems there is a deviation from the overall principle to normalize only those instances where the syntax is disrupted. From a parsing point of view there is no problem whatsoever with substandard forms. It appears then that normalization has been used here to impose some prescriptive norm about what constitutes proper English. Table 4. Relative frequency and distribution of various types of disfluencies I. Speaker-edited - hesitations - self-corrections - further detailing, specifications - changes of plan - miscellaneous10 II. Non-speaker-edited - incomplete items - errors Total

% 61.8 5.6 3.7 25.6 2.1 .5 .7 100

While from a parsing point of view most types of disfluencies are indeed problematic, with hesitations it seems feasible to refrain from normalizing the input; in fact, some fairly simple adaptations of the rules would allow us to deal with most of the instances automatically. Thus we can exploit the fact that where hesitations involve the repetition of a single word (the most frequent type of hesitation by far, cf. Table 5), it is usually a function word, more specifically an article, a preposition, or a conjunction. With hesitations involving the repetition of multiple words there appears to be a tendency to repeat the beginning of a clause (typically the subject and the verb) or a prepositional phrase (usually the preposition followed by an article or a pronoun). In many instances overt hesitation markers (uh, uhm) are present which signal the hesitation. While the repetition of words can be handled by an adaptation of the rules, typically for the repetition of a segment or a syllable it would be most practical to filter these out during the preprocessing phase. Segments and syllables can be easily detected since they do not normally correspond to instances of a regular wordform. 12 Table 5. Relative frequency and distribution of hesitations Hesitations involving the

%

-

15.8 8.8 57.5 16.1 1.8

repetition of a segment repetition of a syllable repetition of a single word repetition of multiple words other11

Total

100

68 6.

Nelleke Oostdijk Concluding remarks

In this paper I presented the results of an investigation into the nature, the frequency and distribution of disfluencies in spoken language data. The aim was to find out whether it would be feasible to refrain from normalizing the data prior to parsing. The study was based on normalized data from the spoken part of ICEGB. What I found was that nearly all disfluencies I came across were speakeredited, while hesitations were the most frequent type of disfluency by far. Since hesitations can be easily detected and dealt with during a preprocessing stage there is no need for normalization, nor do we need to adapt the rules of the grammar underlying the parser. For other types of disfluencies, however, such an approach is as yet not feasible and additional research is required. While this study has concentrated on what types of disfluencies occurred, future research should be carried out to establish where the syntax is disrupted and should investigate the role of discourse markers in signalling disfluencies. Notes 1

The markup symbols listed here are a subset of a larger set of markup symbols used in ICE. The full set is described in Nelson (1996b).

2

Examples (1)-(3) are authentic examples from ICE-GB used by Nelson (1996b) to illustrate the use of markup. The respective location codes of these instances are S1A-024-061, S2A-033-030 and S1A-033-174. The location codes consist of a text sample reference (e.g. S1A-024; see also the Appendix) followed by the line number in the text.

3

S1A-082-100.

4

The composition of the spoken part of ICE-GB is given in more detail in the Appendix.

5

The actual number of normalizations is higher since instances where the markup was found to be erroneous (because for example it was incomplete) have not been included here. Note also that with regard to the frequency with which normalization is applied, there is an uncertain effect caused by the fashion in which the parsing units are identified. Occasionally normalization may be avoided by identifying multiple parsing units.

6

In the examples that follow normative deletions are crossed out, while normative insertions are underlined. In principle all instances have been included in full as they occur in the corpus (including any errors); where this is not the case the symbol ‘¾’ is used to indicate that material has been omitted.

Normalization and disfluences in spoken language data

69

7

Not included here are hesitation markers such as uh and uhm. The reason for this is that they are not normalized. Instead, they are filtered out prior to parsing after they have been identified in the tagging phase.

8

Cf. Fromkin (1988: 121).

9

Another factor that plays a role here is the quality of the sound file from which the data were transcribed.

10

Since it was not always easy to assign a disfluency to one particular class, the miscellaneous type was introduced.

11

‘Other’ refers to for example hesitations involving the repetition of the first part of a compound word (cf. e.g. (7)) and repetitions of a word + a segment or syllable.

12

Obvious exceptions are the segments a and I, and syllables such as you, up, in, etc.

References Aarts, J. and N. Oostdijk (1997), ‘Handling discourse elements in syntax’, in: U. Fries, V. Müller and P. Schneider (eds), From Ælfric to the New York Times. Studies in corpus linguistics. Amsterdam: Rodopi. 107-123. Fromkin, V. (1973a), ‘The non-anomalous nature of anomalous utterances’, in: V. Fromkin (ed.), 144-163. Fromkin, V. (1973b), ‘Appendix’, in: V. Fromkin (ed.), 243-269. Fromkin, V. (ed.) (1973), Speech errors as linguistic evidence. The Hague: Mouton. Fromkin, V. (1988), ‘Grammatical aspects of speech errors’, in: F.J. Newmeyer (ed.) (1988a), 117-138. Garnham, A., R. Shillcock, G. Brown, A. Mill and A. Cutler (1982), ‘Slips of the tongue in the London-Lund corpus of spontaneous conversation’, in: A. Cutler (ed.), Slips of the tongue and language production. Amsterdam: Mouton. 251-263. Garrett, M. (1988), ‘Processes in language production’, in: F.J. Newmeyer (ed.) (1988b), 69-96. Greenbaum, S. (1996), ‘Introducing ICE’, in: S. Greenbaum (ed.), 3-12. Greenbaum, S. (ed.) (1996), Comparing English worldwide. The International Corpus of English. Oxford: Clarendon Press. Hotopf, W. (1983), ‘Lexical slips of the pen and tongue: What they tell us about language production’, in: B. Butterworth (ed.), Language production. Vol. 2. Development, writing and other language processes. London: Academic Press. 147-199. Nelson, G. (1996a), ‘The design of the corpus’, in: Greenbaum, S. (ed.), 27-35. Nelson, G. (1996b), ‘Markup systems’, in: S. Greenbaum (ed.), 36-53.

70

Nelleke Oostdijk

Newmeyer, F.J. (ed.) (1988a), Linguistics: The Cambridge Survey. II Linguistic theory: Extensions and implications. Cambridge: Cambridge University Press. Newmeyer, F.J. (ed.) (1988b), Linguistics: The Cambridge Survey. III Language: Psychological and behavioral aspects. Cambridge: Cambridge University Press. Nooteboom, S. (1973), ‘The tongue slips into patterns’, in: V. Fromkin (ed.), 144163. Oostdijk, N. (2000a), ‘Towards a model for the description of language use’, in: C. Mair and M. Hundt (eds), Corpus linguistics and linguistic theory. Amsterdam: Rodopi. 275-288. Oostdijk, N. (2000b), ‘Corpus-based English linguistics at a cross-roads’, English studies. A journal of English language and literature, 81(2): 127-141. Tanenhaus, M. (1988), ‘Psycholinguistics: An overview’, in: F.J. Newmeyer (ed.), (1988b), 1-37. Appendix A schematic representation of the composition of the spoken part of ICE-GB is given in Figure 2 (cf. Nelson 1996a: 29). Between brackets the number of texts per text type is given. Dialogue (180) Private (100) Public (80)

Monologue (120) Unscripted (70)

Scripted (50)

direct conversations (90) distanced conversations (10) class lessons (20) broadcast discussions (20) broadcast interviews (10) parliamentary debates (10) legal cross examinations (10) business transactions (10)

text sample references S1A-001 to S1A-090 S1A-091 to S1A-100 S1B-001 to S1B-020 S1B-021 to S1B-040 S1B-041 to S1B-050 S1B-051 to S1B-060 S1B-061 to S1B-070 S1B-071 to S1B-080

spontaneous commentaries (20) unscripted speeches (30) demonstrations (10) legal presentations (10) broadcast news (20) broadcast talks (20) speeches (not broadcast) (10)

S2A-001 to S2A-020 S2A-021 to S2A-050 S2A-051 to S2A-060 S2A-061 to S2A-070 S2B-001 to S2B-020 S2B-021 to S2B-040 S2B-041 to S2B-050

Figure 2. Composition of the spoken part of ICE-GB.

Textual structure and segmentation in online documents Pam Peters and Adam Smith Macquarie University Abstract Despite the flexibility of the electronic medium, the computer screen itself puts constraints on the shape of the discourse accessed through it. The effects are likely to show up in longer documents, where the communication strategies relied on for the printed page would need to be modified and/or supplemented. This paper describes a comparative study of structural units in print and electronic documents (p-texts taken from AUS-ICE, and etexts from a new corpus of electronic documents (EDOC), a selection of hypertexts that were “digitally born”, i.e. published first in electronic form). The study focuses on segmentation and local structures in a matched sample of e-texts and p-texts from two genres of prose (informational and instructional). The findings point to more explicit marking of local structures in e-texts, with much more regular use of sections and multilevel headings. In the electronic medium lists are a more important aspect of text structure and paragraphs less so.

1.

Introduction

The digital revolution has raised all kinds of expectations about the electronic medium, the information it can deliver as the “information superhighway”, and the communities of people who can interact through the internet as an electronic “global village”. Governments and international bodies see it as an inexpensive way of distributing information. Online educators embrace it to extend their reach and connect with students in a variety of cultural contexts. But the success of global communication comes back to the effectiveness of information delivery, which is not to be taken for granted. Ways of articulating the structure of extended discourse have developed on the printed page over more than five centuries (Peters 2002), at all levels from chapter divisions down to sentence lengths and the punctuation system. These have evolved against the backdrop of double-page spreads, on which usually arbitrary amounts of text are “poured” (Nunberg 1990: 51). The counterpoint between what is displayed on twinned, enumerated pages and the explicit structuring of text is something we rarely consider – the rhythmic regularity of the first and the meaningful variability of the second. Yet between them they enable readers to calibrate their forward progress through a text. They are there to a greater or lesser extent in all varieties of publishing, though the structural signals in a novel are much more sparing than in expository texts, such as textbooks. The textbook’s subdivisions and levels of structure are continually signaled through headings and the chunking of paragraphs into sections. In academic publications, the internal structure of the discourse is less explicitly

72

Pam Peters & Adam Smith

marked, and writers rely more on the implicit stages of argument to carry readers on, paragraph by paragraph. Both learned monographs and journal articles continue the tradition of the 19th century critical essay, fostering discursive projection of the self and subjective cognition with the fluid structures it entails (Jackson 1997: 487). Sentence patterns in printed texts vary with the genre. In textbooks they will be fully worded in the body of the text, and fragmentary in headings and captions. There will be both declarative and imperative sentences, depending on how interactive the book is intended to be. In discursive nonfiction, sentences are more uniform: typically declarative and argumentative, and longer rather than shorter. These differences are reflected in the average sentence lengths found in comparable categories of the Brown corpus. In Category J (learned and scientific) the average was 22.31, well above that of Category E (skills etc), where the average was 18.53 (Francis and Kucera 1982: 552). Thus both explicit structural marking and the pattern of sentences (the most basic structural unit) seem to correlate with conventional genres of prose. How far should we expect these features to be transferred into the digital counterparts of print documents? Both printed page and the computer screen present visual forms of language, so that they have more affinity with each other than with spoken language. But the computer screen imposes its own constraints on the delivery of information – a single page of adaptable shape, instead of a fixed double-page spread. The page is continuous like a papyrus scroll, masking its ultimate length. For readers there is no small danger of becoming disoriented within a long document, a problem of which hypertext researchers have been well aware (McKnight et al. 1991: 65–6). There is still relatively little published advice on the subject, though the pioneering Yale Style Manual (online) highlights the need for a structural response, using more systematic chunking of information and frequent locating devices and signals of place. Web-advisers such as Nielsen (2000: 104) emphasize the need to present information in scannable form, i.e. as a list of dot points rather than as continuous discursive sentences – as a way of dealing with the general problem of screen reading and extracting information from a continuous page. Improved screen resolution may make a difference in the longer run, but research reported by Thurstun (2000: 73) suggests that with improved layout, the readability of a text can be greatly enhanced. Part of the purpose of this research was to see how far those who publish e-documents were already anticipating these problems. Are longer e-documents providing more frequent signals of their internal structure, more conspicuous segmentation of discourse? Are lists being used more frequently as a means of streamlining batches of similar information? Is this impacting on the average length of sentences? The differential effects of genre were also a point of interest. Are some genres of prose more adaptable and responsive to the demands of electronic communication than others? The online author’s sensitivity to audience might also be a factor. Those used to addressing the general public might adapt more quickly than those in the academic tradition.

Textual structure and segmentation in online documents

73

Apart from these research questions, this study was motivated by the need to develop guidelines on electronic documentation for the Australian Government.1 The desire to provide online access to volumes of official information entails the question as to how to optimize communication in individual documents. The need for research-based advice is felt here, and in any institution that wants to facilitate online communication with the public, as citizens or students. 2.

Materials for the research

2.1

P-documents

Though the internet presents a wonderful array of genres, new and old, shorter and longer, personal and public, our research objectives took us to the second rather than the first in each of those pairs. To make the comparison with established genres, we must focus on those for which there were print counterparts. Longer documents present more structural challenges than shorter ones; and this typically means public rather than personal texts. This too was in keeping with our aim of providing lights on the presentation of public information, and also the design of instructional documents. Informational and instructional discourse are in fact contrasting genres in Longacre’s (1974: 200) seminal taxonomy. He calls them the “expository” and the “procedural” respectively. Though they share the general properties of monologic discourse, the expository (informational) distinguishes itself by its emphasis on content rather than audience. The procedural (instructional) meanwhile is oriented towards the audience, and interacts linguistically with them through the use of the second person. Longacre’s categorizations arise out of field work with oral cultures, but they seem to tap the fundamentals of written/printed discourse in those genres – and (we might expect) digital discourse as well. The existing Australian corpora (ACE and ICE-AUS) provide comparative raw material in the two chosen categories of discourse (informational and instructional). ACE is a computerized collection of printed texts from 1986, while ICE-AUS includes texts from 1991–1995. ACE (like the Brown corpus) includes several categories of informational prose (F, G for the general reader; J for the academic), and instructional prose (E for the general reader, H for institutional management). The ICE corpus has informational texts of general and academic kinds in W2B and W2A and instructional and regulatory texts (W2D). By combining texts from the two corpora we created a research corpus of 100 samples (each consisting of 2000 words) for further analysis and annotation, as shown in Table 1. The ICE texts were already marked for such things as headings, paragraph breaks and sentence boundaries (text units), though the ACE markup had to be elaborated in those respects. Both ACE and ICE needed the additional marking of levels of heading, lists, bullets (see below, section 3), so that the texts of the

74

Pam Peters & Adam Smith

research corpus matched the system implemented in the corpus of electronic documents (next section). Table 1. Constituency of research corpus Informational (learned)

Informational (general reader)

Instructional (admin./ regulatory)

Instructional (general reader)

ACE texts

J: Learned & Scientific writings (10) W2A: Humanities (10) Social Sciences (5) 25

H: Government docs., foundation & industry reports etc. (15) W2D: Admin./ regulatory (10)

E: Skills, trades & hobbies (15)

ICE texts

F: Popular lore (5) G: Belles letters, biography, essays (5) W2B: Humanities (10) Social Sciences (5) 25

25

25

Total (100) 2.2

W2D: Skills/ hobbies (10)

E-documents

To carry out this research, a new corpus of electronic documents had to be assembled. The EDOC corpus, as it is code-named, had to contain two categories of text: informational and instructional, those being the prime focus for the project. A target of 50 texts was set for both categories, to create a corpus of approximately 200 000 words. EDOC was compiled in 2000/2001, and therefore at a relatively early stage of electronic document development. We might expect them to show many of the conventions of the print documents before them, as well as pioneering aspects of electronic documentation. The corpus is designed as a kind of benchmark in time, and a resource for diachronic comparisons as well as intercomparisons between the two genres represented in it. For proper comparison with p-documents, the e-documents had to have originated (i.e. been first published) as online texts – thus “digitally born”. So any which were mounted as PDF files were discounted, on the assumption that they had in the first instance been drafted to appear as paper documents, and that the electronic version was simply a clone. Texts were identified from the Australian National Library’s bibliography of online publications, and also by means of internet searches limited to Australia (with the aid of Yahoo and Altavista). Care was taken to exclude documents which were simply remounts from overseas sites. Linked extensions to other documents were also disregarded, whereas those which were a means of dividing the document into manageable chunks were followed through and included in the sample (see Appendix 1, Table B). EDOC consists of whole texts, because our primary interest is in how the authors had decided to structure and segment them, not their lexical or idiomatic

Textual structure and segmentation in online documents

75

elements (Sinclair 1991: 19). This means of course that they vary in length (from 141 to 8754 words). In practice the instructional texts were typically less than 2000 words (the ACE/ICE standard), and informational ones often more than that (see below, section 3). The brevity of some texts entailed the need to excerpt several from the same site, in order to create a sample of at least 1000 words wherever possible. Composite samples were less often needed in the collection of informational texts, which were often more than 2000 words long. Instructional material from the total of 50 websites has yielded over 75,000 words, whereas informational material from the same number of sites provides more than 135,000. The notional target in terms of total word count was therefore achieved. More importantly, EDOC consists of whole texts, and we have equal numbers of websites in the two parts of the corpus, and so a comparable sample of authoring practices. We looked for sites managed by an institution or community group, to ensure (as far as one could) that they were under stable editorial management, and not likely to be idiosyncratic or personal in content or style. The contents of the e-texts are as wide as those in comparable categories from ACE and ICE. Individual e-texts were extracted from the webpages containing them, and saved as ASCII files. Material such as navigational bars, graphics, captions, supplementary external hotlinks and copyright disclaimers were excluded from this version of the text, which was the one to be wordcounted and annotated. These elements were retained in an archived version of the text, which contains all the webpages’ associated files so that their layout can be fully reproduced. The visual format and structure of e-documents is the product of many parameters of text presentation, not all of which are stable in electronic delivery. Some, such as the typefaces used, the text measure (= width of the text), and the amounts of white space used in vertical separations of blocks of text, are infinitely variable, depending on the settings of the browser used by the reader. Although these aspects can be controlled by means of a cascading style sheet, it is a relatively new facility for online publications, and attached only to a subset (25%) of the samples. We therefore decided to focus on aspects of text structure and segmentation that are built in by the writers and/or web-authors of edocuments, and not subject to modification or manipulation by the browser. They include: -

text units (= complete sentences, or sentence fragments in lists) lists bullets (of various shapes) and other enumerative devices paragraphs multiple heading levels

These features have been marked with distinguishing codes compatible with those used in annotating ICE texts, so that they could be used in the ACE/ICE research corpus as well. The markup allows us to track and quantify the segmentation of the document at every level from major sections down to sentences and items within lists. The marking of bullets will show their role in complementing

76

Pam Peters & Adam Smith

ordinary punctuation. Their increasing use suggests the scope for more up-front punctuation, where most English punctuation is sentence-final. 3.

Texts for the pilot study and summary findings

The pilot study reported here compares a set of EDOC texts from 8 websites, 4 informational, 4 instructional, with a set of 8 p-texts extracted from the ACE/ICE research corpus. Within each generic set, 2 are from academic or bureaucratic sources, and 2 from sources aimed at the general reader. These parameters are built into the ACE/ICE categories, and though not used in the sampling process for EDOC, they are readily applied to website material, according to whether the site is managed by an academic body/government department, or by commercial or community interests. We thought it of no small interest to see whether the structural properties we were investigating were associated with particular audiences, as well as particular genres of discourse. The topics of the two sets of pilot texts were comparable, reflecting on the one hand various institutional and academic matters, such as selection committee procedures, online instruction, cultural studies; and on the other, the individual and recreational interests addressed by commercial and community publications, such as pets, home decoration, popular music. (See the table in Appendix 2 for the full titles of the sample set.) Because e-texts vary in length, those chosen supply about the average number of words for the genre of material: i.e. informational texts with texts of about 2700 words, and instructional texts totaling about 1550 words. Word counts were taken within the various structural and segmental units (text unit, paragraph, listed items) in both types of text, to discover what range there might be within the sample. The working hypothesis was that the size of units at each level in structural hierarchy (from section down to sentence) would be smaller in the e-documents than the p-documents, in keeping with the physical constraints of the computer screen. Full statistical results are presented in the tables in Appendix 1. For discussion purposes, some of the results are averaged in Tables 2a and 2b below. These results do not support the working hypothesis in its strongest form, of systematic reductions in the size of all structural units from the print texts to the electronic ones. There are nevertheless a number of interesting signs of adaptation at particular levels, as discussed in the next section. 4.

Comparative findings on structural features of p-texts and e-texts

The most decisive differences between the two types of texts are at the highest level of segmentation, though some generic differences at the lower levels (paragraph, sentence) also merit discussion.

Textual structure and segmentation in online documents

77

Table 2a. Summary results for structural and segmental units in p- and e-texts

P-informational General Academic Info Total P-instructional General Academic Inst Total E-informational General Academic Info Total E-instructional General Academic Inst Total

Total Words

Total minus headings Total and listed items Sentences

Ave. Sentence Length

3942 4075 8017

3763 4058 7821

208 133 341

18.09 30.51 22.94

4121 4160 8281

3789 2565 6354

220 138 358

17.22 18.59 17.75

4605 4470 9075

4067 4010 8077

175 167 342

23.24 24.01 23.62

2669 2272 4941

2573 1249 3822

132 65 197

19.49 19.22 19.40

Table 2b. Summary results for structural and segmental units in p- and e-texts

P-informational General Academic Info Total P-instructional General Academic Inst Total E-informational General Academic Info Total E-instructional General Academic Inst Total

Total Paras

Ave. Sentences/ Para.

No. of Sections

Ave. Paras/ Section

71 26 97

2.93 5.12 3.52

4 2 6

17.8 13.0 16.2

107 85 192

2.06 1.62 1.86

35 19 54

3.1 4.5 3.6

56 19 75

3.13 8.79 4.56

7 2 9

8 9.5 8.3

42 49 91

3.14 1.33 2.16

27 22 49

1.6 2.2 1.9

78

Pam Peters & Adam Smith

4.1

Sectioning and levels of heading

The outstanding finding of the pilot analysis is the heavier segmentation of the edocuments, creating sectional units much more regularly than in the p-texts. This result is clear in the averages of the final column of Table 2, and unequivocal in the individual scores in Tables A and B in Appendix 1. The average section of an e-text always contains far fewer paragraphs than the p-text equivalent, and the trend is the same in the longer informative documents, and the typically shorter instructional documents. The difference shows up most finely in the results for the instructional, where sectioning has always been a way of distinguishing successive steps in a procedure. Still the e-texts take this further than the p-texts in the samples used in the pilot study, with only half the number of paragraphs per section, whether the texts are meant for general or academic/bureaucratic readers. This unequivocal finding affirms the importance of sectional divisions within what might otherwise be chapters of a printed book. The fact that they are getting smaller means of course that they appear more regularly within e-texts than p-texts, raising the likelihood that some internal division of the text will be visible on most screenfuls delivered by the electronic browser. Coupled with the more frequent use of section divisions are the more fully articulated sets of headings found in the e-texts. Though informational texts in each medium work with fewer levels than the instructional (only one or two, as opposed to three or four), the results for “Levels of heading” in the tables of Appendix 1 show that four levels is typical for instructional e-texts, and one for informational p-texts. The different levels are embodied in different type faces, although these are not necessarily the same for every reader (except when controlled by a cascading style sheet: see above section 2.2). Whatever the basis of contrast, it serves to alert readers to the levels of headings, and so of nested structures within sections of the text. Given that those sections are constrained in length, there is a good chance of readers seeing more than one level of heading on most screenfuls of text. The underlying hierarchies of information are thus more visible and signaled more regularly. 4.2

Paragraphing

The effects of the electronic medium on paragraph lengths are not so uniform in the texts of the pilot study. Table 2 shows no general reduction of the lengths of paragraphs in terms of how many sentences they contain – except for the academic instructional e-texts, where the average is down on its print counterpart. This average is however brought down by one particular text (from the Centre for Flexible Learning), as Table B in Appendix 1 shows. It was created by online instructional designers at a university, i.e. those within academia who are most likely to be attentive to such matters. All other e-texts in the pilot study, whether for general or academic readerships, actually produce longer paragraphs than their print counterparts. The longer paragraph is clearly associated with academic informational writing in both e-texts and p-texts, and it is remarkable how very long the

Textual structure and segmentation in online documents

79

paragraphs in the academic e-texts are, averaging almost nine sentences per paragraph. That this average is the product of two relatively similar values, not an outlier, can be seen from details in Table B of Appendix 1. The longer paragraphs of the texts concerned, from the Australian Humanities Review (see Figure 1) and Globe: E-Journal of Contemporary Art, suggest the continuation of the discursive essayist tradition cultivated in the humanities, already noted, which resists dividing the text into anything but paragraphs. In fact it makes paragraphs equivalent to the sectional unit. Neither e-texts nor p-texts in the academic informational category have any sectional divisions (or, put another way, they are all one section).

Figure 1. Opening paragraphs of academic informational e-text Shorter paragraphs have long been associated with popular media, and are a touchstone of informational discourse for general audiences, by the averages

80

Pam Peters & Adam Smith

shown in Table 2. The p-texts of this type are from mass-circulating magazines (Australian Women’s Weekly, The Bulletin), and their average paragraphs of around three sentences are quite similar to those of the comparable e-texts (Addicted to Noise, Journal of Social Change). The instructional texts intended for general audiences show somewhat more divergence across the two media, the p-texts presenting a typical two-sentence paragraph, and e-texts one of three sentences plus. However, the e-texts in this case (from Permaculture and the Golden Retriever Club) are also ones which make particularly frequent use of sections with four levels of heading, and so the internal structure of the document is effectively signaled thereby. 4.3

Sentences in text

When it comes to sentence lengths in continuous text, the overall averages in each set of e-texts are higher than those of the p-texts, contrary to the working hypothesis. The consistency of the results for the instructional texts in both media is remarkable, an average of 19.4 words for e-texts and 17.75 words for the ptexts (see Table 2). These figures position themselves on either side of the Brown corpus average for “skills” texts in category E of 18.53 words (Kucera and Francis 1982: 552). The average for the e-texts is taken upwards by the two general reader texts already mentioned for their longer paragraphs (Permaculture and the Golden Retriever Club), both produced by community interest groups, and therefore not subject to the editorial processes of commercial publishing which might perhaps have trimmed them. Many websites are mounted independent of mainstream publishing, so we may expect some divergences from the existing stylistic norms. Yet the result for one pair of e-texts did seem to support the hypothesis. For academic informational e-texts, the average sentence length is shorter than in comparable p-texts. The difference is heightened by the fact that the sentences of the p-texts (from Critical Review and Caught in the draught) are very long, averaging over 30 words per sentence (see Appendix 1, Table A). This is well above the average of 22.31 words registered in the American academic texts (category J of the Brown corpus), according to Francis and Kucera (1982: 552). It makes a striking contrast with both general informational p-texts and all the instructional p-texts. Sentences with “studied amplitude”, loved by Samuel Johnson, are still the hallmark of “serious” prose in the modern era (Gordon 1966: 144–6, 155). The very high average for academic informational p-texts puts into relief that of the comparable e-texts, whose average (just on 24 words) is a good deal closer to the Brown corpus average for category J. Still that average, and that of the general informational texts, are both relatively high. If keeping sentence lengths down makes for easier reading on screen, then the trend needs to be reversed by the writers of e-texts. The data on sentence lengths do not support Haussamen’s forecast (1994: 21) that English sentences will tend to get shorter as the electronic medium neutralizes the differences between oral and written styles.

Textual structure and segmentation in online documents 5.

81

Extended uses of lists

One other dimension of interest is the use of lists in e- and p-documents. These have a pragmatic function in instructional documents, both as a means of itemizing raw materials, and coordinating parallel operations. They have a strong presence in all the p-texts, with from 2 to 12 lists presented within the 2000 word samples (see Appendix 1, Table A). Within those lists, the individual items can be quite long relative to the average discursive sentence, as in the documents from Burke’s Backyard and the NSW Dept. of Agriculture. Some lists include items with multiple sentences. They provide an alternative way of structuring information which would get lost in discursive paragraphs. Although only a minority of the e-texts contain lists, they present novelties in two directions. One is that a list appears within one of the informational etexts, Journal of Social Change, which serves as a closing summary to the article. It consists of 5 relatively long items, averaging 21.6 words (each a full sentence) but relatively shorter than the discursive sentences in the body of the article. Lists are a conspicuous addition to the repertoire of structural devices now used in informational texts. They draw attention to themselves by creating their own array in white space, and by their use of local enumerative devices, including bullets. The wording of successive items is often made parallel, which provides further visual and (subliminal) aural support for a sometimes complex batch of items. Lists are a feature of both the academic instructional e-texts. In the SCOAP Selection Committee document of just under 1200 words there are three bulleted lists whose items are worded at an average of 17.8 words. The average includes the short (4–7) word items of one list, used for display; and the longer items (ranging from 9 to 66 words, with multiple sentences) in two other lists, in which key issues are articulated. Both types of list occur freely in the Centre for Flexible Learning (CFL) document, which contains 12 of them sprinkled through a text of just over 1000 words. The average length of listed items is down to 11.4 words (range 1–28 words). The CFL document (see Figure 2) shows interestingly how much of the substance of an instructional document can be turned into lists. In fact it creates hierarchies within lists. The higher level points marked with filled bullets usually consist of longer items, and the sublists within of shorter ones marked by hollow bullets. Lists make up more than 60% of the wording of the text, as indicated in Table B in Appendix 1. Thus lists seem to be taking the place of the paragraph as a structural device, at least in this instructional edocument. They meet the needs of the e-reader, by providing clear visual structuring of clusters of information which is easily scanned within the confines of the computer screen.

82

Pam Peters & Adam Smith

Figure 2. Use of lists in instructional e-text A common feature of lists in both p-texts and e-texts is their use of bullets. They assume a variety of shapes: ball (filled or unfilled), lozenge, arrow, hand, asterisk, among others; but their constant function is to mark each of the items in a list. Their presence at the start of each item makes final punctuation less important, and the trend is to leave it out and to allow the continuing white space on the line to mark the end of the item. A combination of bullets and white space is becoming standard punctuation for lists (Style Manual 2000:142–3). Surveys conducted through Australian Style (1996) showed that the inclination to rely on white space correlated with the length of the listed items. More than 70% of respondents endorsed it when the items were 1–3 words long, but only 54% when they were 4–8 words long. The role of white space is discussed as part of the punctuation system in some relatively recent accounts (e.g. Quirk et al. 1985), though bullets themselves are not yet recognized. Because they preface anything

Textual structure and segmentation in online documents

83

from words and phrases to whole sentences and multiple sentences, they do not fit neatly into the standard punctuation classification of stops that appear sentence-internally or at sentence-end. Rather they belong to the set of metapunctuation marks such as quotation marks and parentheses, which can be used with smaller and larger strings of language. Unlike those, bullets are not paired, but complemented by white space. Despite their variable shape, bullets perform a regular punctuative function which lends itself to the articulation of e-documents, and needs to be included in future accounts of the punctuation system. 6.

Summary and conclusions

The working hypothesis of this pilot study – that segmental units in the e-texts would be systematically smaller than those in comparable p-texts – was supported only at the highest level of discourse structure in the samples, i.e. where the text divides into sections. Even so, the academic informational e-texts were still presented as single undivided stretches of argument, much like those of their print counterparts. By contrast, the academic instructional e-texts were highly segmented, as were all the e-texts intended for general readers. There are thus factors of audience and genre cutting across those of the medium itself. We need not expect all genres to accommodate themselves to the new medium at the same pace, and those which are more audience-oriented (i.e. the instructional) might indeed be expected to adapt faster, in anticipating the navigational and reading needs of the screen user. The investigation found no general trend towards shorter paragraphs in the e-texts. By the same token, it highlights the increasing use of lists that create an intermediate level of structure between paragraph and sentence. This organizational device is beginning to be seen in informational e-discourse, making its appearance in one text meant for the general reader. In those intended for academic readers, the principle of seamless argument still prevails (perhaps they were not written with screen-readers in mind). Where lists are used, they would seem to offer a kind of tradeoff between complexity of material and clarity of presentation. They streamline complex elements of exposition into parallel items, sometimes including multiple sentences. The extra white space which they entail reduces the visual density of information (Kerr 1986:375), which might otherwise form paragraph blocks that dominate the screen. Instead the discursive part of the paragraph may be just two or three sentences framing lists of six or seven items, and there can be lists within lists, as in the CFL document in Figure 2. Whether the text seems to be pushing too much of its material into lists is another issue. At any rate it highlights the tendency of the majority of e-documents examined in this pilot study: to implement more levels of structure between sentence and paragraph (i.e. lists), and between paragraph and whole texts (headed sections). The sample of e-documents used for this study is just under 7% of the corpus, and the trends in electronic documentation will be more fully describable once the annotation of texts is complete. But the study suggests that some writers,

84

Pam Peters & Adam Smith

web-authors and publishers are anticipating the needs of screen readers, whether consciously or intuitively. The emphasis on local structures – sections marked by means of multilayered headings, and lists with bullets – correlate with a greater sense of making visual the underlying structure of textual material. All such things help readers to keep a sense of their place in the document, and to scan and absorb blocks of information within the infinity of hypertext and cyberspace. These structural responses to the electronic medium may have their roots in earlier genres (cf. Jucker 2000), but their domains of use are now expanding. The trends described tend to downplay the importance of paragraphs in electronic discourse. Paragraphs are subsumed within larger headed sections, and are less distinctive than lists, having much less visible internal structure. The status and function of paragraphs is in any case quite variable from one kind of discourse to another. They hold their place on the double-page spread of a printed book, but their role in the continuity of an electronic document is less secure. Whether they become the chief casualty of electronic documentation – and/or a special feature of printed prose – remains to be seen. Note 1

The compilation of EDOC in 2000/2001 was made possible by a research grant from the Australian Government’s Department of Finance and Administration. Its annotation was facilitated by grants from Macquarie University in 2001/2002.

References Australian Style (1996), Feedback Report on survey 8.5(1): 13-14. Francis, N. and H. Kucera (1982), Frequency Analysis of English Usage: Lexicon and Grammar. Boston: Houghton Mifflin. Gordon, I. (1966), The Movement of English Prose. London: Longman. Haussamen, B. (1994), ‘The future of the English sentence’, Visible Language, 28(1): 4–25. Jackson, G. (1997), ‘Literary Theory and the Essay’, in: T. Chevalier (ed.), Encyclopedia of the Essay. London & Chicago: Fitzroy Dearborn. 487493. Jucker, A. (2000), ‘Multimedia and Hypertext. Neue Formen der Kommunikation oder alte Wein in neuen Schlauchen?’, in: G. Fritz and A. Jucker (eds), Kommunikationsformen im Wandel der Zeit. Vom mittelalterlichen Heldenepos zum elektronischen Hypertext. Tübingen: Niemeyer. 7-28. Kerr, S. (1986), ‘Instructional Text: the transition from page to screen’, Visible Language, 20(4): 369–392. Longacre, R. (1983), The Anatomy of Speech Notions. Lisse: Peter de Ridder Press. McKnight, C., A. Dillon and J. Richardson (1991), Hypertext in Context. Cambridge: Cambridge University Press.

Textual structure and segmentation in online documents

85

Nielsen, J. (2000), Designing Web Usability. Indianapolis: New Riders. Nunberg, G. (1990), The Linguistics of Punctuation. Stanford, CA: Center for the Study of Language and Information. Peters, P. (2002), ‘Textual Morphology from Gutenberg to the E-book’, in: A. Fischer, G. Tottie and P. Schneider (ed.), Text Types and Corpora. Studies in Honour of Udo Fries. Tübingen: Gunter Narr Verlag. 77-90. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A Comprehensive Grammar of the English Language. London & New York: Longman. Sinclair, J. (1989), Corpus, Concordance, Collocation. Oxford: Oxford University Press. Style Manual for authors, editors and printers (6th edition, 2002). Milton, Old: John Wiley. Thurstun, J. (2000), ‘Screen reading: challenges of the new literacy’, in: D. Gibbs and K-L. Krause (ed.), Cyberlines: Languages and Cultures of the Internet. Melbourne: James Nicholas Publishers. 61-78. Yale Style Manual: http://info.med.yale.edu/caim/manual/

2016

2059

Critical Review

Caught in the draught: essays on contemporary Australian society and culture

A

111

55

83

1952

1238

1327

G Burke’s Backyard 2082 Vol. 3 A Macquarie 2050 University A NSW Department of 2110 Agriculture

109

73

60

98

110

1837

2049

2009

2056

1707

Total minus Total headings Sentences and listed items

2039

G The Australian Home Beautiful

INSTRUCTIONAL

2065

The Bulletin

A

The Australian 1877 Women’s Weekly

Total Words

G

G

INFORMATIONAL

G = General A = Academic/ bureaucratic

16.0

22.5

17.6

16.9

28.1

33.5

21.0

15.5

Ave. Sentence Length

50

35

48

59

15

11

29

42

Total Paras

1.66

1.57

2.3

1.85

4.9

5.5

3.38

2.62

8

11

26 (3)*

9 (4)*

1

1

2

2 (2)*

Ave. No. of Sentences/ Headed Para. Sections

6.3

3.2

1.8

6.6

15

11

14.5

21

Ave. Paras/ Section

2

4

3

3

1

1

2

1

Levels of Heading

12

7

2

3

0

0

0

0

36

39

6

26

No. of Total Lists listed items

21.8

19.9

21.7

7.8

Bullet

Bullet

Bullet

Ave. ItemiWords sation per Item Device

Table A. Results for structural and segmental units in p-texts (* numbers in brackets: different topics covered in the sample)

Appendix 1

A

A

G

G

Permaculture Research Institute website Golden Retriever Club of Victoria (GRCV) website Selection Committee on On-line Australian Publications website Centre for Flexible Learning website

INSTRUCTIONAL

1078

1194

1311

1358

9075

Australian 2529 Humanities Review

A

433

816

1275

1298

8077

2200

1810

2014

Journal of Social 2210 Change and Critical Inquiry Globe- E Journal of 1941 Contemporary Art

G

A

2053

28

37

67

65

342

86

81

72

103

Total minus Total headings Sentences and listed items

Addicted to Noise - 2395 music magazine

Total Words

G

INFORMATIONAL

G = General A = Academic/ bureaucratic

15.5

22.1

19

19.96

25.6

22.3

27.9

20.0

Ave. Sentence Length

28

21

19

23

75

9

10

26

30

Total Paras

1

1.76

3.53

2.83

9.5

8.1

2.76

3.43

Ave. Sentences /Para.

12

10

13

14

9

1

1

4

3**

No. of Sections

2.3

2.1

1.5

1.6

8.3

9

10

6.5

10

Ave. Paras/ Section

4

3

4

4

1

1

2

2

Levels of Heading

12

3

0

0

0

0

1

0

52

16

5

No. of Total Lists listed items

Table B. Results for structural and segmental units in e-texts (** Sections separated by screen rather than heading)

11.4

17.8

21.6

Bullet

Bullet

Bullet

Ave. ItemiWords sation per Item Device

Appendix 2 The Australian Women’s Weekly The Bulletin Critical Review Caught in the draught: essays on contemporary Australian society and culture The Australian Home Beautiful Burke’s Backyard Vol 3 Macquarie University NSW Department of Agriculture Addicted to Noise - Online music magazine Journal of Social Change and Critical Inquiry Globe - E Journal of Contemporary Art Australian Humanities Review Permaculture Research Institute website Golden Retriever Club of Victoria (GRCV) website Selection Committee on On-line Australian Publications website Centre for Flexible Learning website

A Favourite Every Year + The Underclass The Turn to Ethics in the 1990s Racism What you need to know about DIY paving +++ Cat breeds Research Policy and Grants Guide Selection Committee Reference Manual The Superjesus - Mark II: The Jet Age Law Sites Revisited: Looking Differences The Art of Conversation The Imaginary Eco-(Pre-) Historian Worm Farms Basic Obedience SCOAP Work Plan Teaching and Learning

at

II. Corpora in Language Description

Shall and will as first person future auxiliaries in a corpus of Early Modern English texts Maurizio Gotti Università di Bergamo Abstract This article analyses the use of shall and will for first person subject future time reference in a corpus of Early Modern English texts. The analysis focuses on the uses of these modal auxiliaries both in interrogative and non-interrogative sentences, and examines their occurrences (both from a quantitative and a qualitative point of view) in different text types and for the performance of various pragmatic functions (e.g. prediction, intention, promise and proposal). The results are compared to the rules for the formation of future sentences given in a number of grammar books published in the seventeenth century.

1.

Introduction

This paper analyses the use of shall and will for first person subject future time reference in a corpus of Early Modern English texts. The period taken into consideration is 1640-1710 and the texts analysed are those included in the third section of the Early Modern English part of The Helsinki Corpus of English Texts. The use of shall and will for future time reference has been the object not only of various previous analyses, but also of specific rules pointed out in several of the grammar books published in the seventeenth century. As regards the former, one of the first studies carried out on the subject is Fries (1925), whose investigation is based on a survey of the usage of shall and will in fifty English dramas from 1560 to 1915 (two dramas of roughly the same date were selected for approximately every decade in that period). In examining these texts, Fries divided the instances into three groups: (1) will and shall in independent declarative statements; (2) will and shall in questions; (3) will and shall in subordinate clauses. As regards the first group, will has been found to be more frequently used than shall with first person subjects (with average percentages of respectively 80% vs 20% in the period taken into consideration here). In direct questions on the other hand shall overwhelmingly predominates (97% vs 3%); even in the few instances in which will occurs in the first person, the majority of cases consists of ‘echo-questions’, in which will repeats the use of the same modal auxiliary in the previous sentence. As regards subordinate clauses, the data are fairly well-balanced, with a slight majority of shall-forms (53.4% vs 46.6%).1 In her analysis of a corpus of texts taken from a wider section of the Helsinki Corpus than ours, Merja Kytö (1991) finds an initial increase in the use of will with first person subjects, reaching a peak from the 1570s to the 1640s; this increase in the use of will is particularly noticeable in colloquial language

92

Maurizio Gotti

(e.g. private letters) and speech-based texts (e.g. sermons and trial proceedings).2 Later the use of will decreases, probably owing to the regulating influence of grammarians, who started advocating shall in first person and will in second and third person uses. As regards the period taken into consideration here, the use of the two modal auxiliaries with first person subjects is quite well-balanced, with a slight majority for shall (51.7% vs 48.3%); will, however, occurs more frequently with dynamic uses of the main verb (in 71% of all cases), while shall is the auxiliary favouring stative uses (in 66% of all cases). In direct questions, first person shall is entirely dominant (100% of all cases). 2.

The rules in contemporary grammar books 3

Wallis (1653) is the first English grammarian to introduce a distinction in the use of the two auxiliaries to indicate future time reference.4 The author is conscious of the originality of his observation and therefore highlights the motivation for such a distinction, which is to be found in common usage: Shall and will indicate the future: it shall burn, it will burn. It is difficult for foreigners to know when to use the first form and when the second (we do not use them both interchangeably), and no other description that I have seen has given any rules for guidance, so I thought I ought to give some; if these rules are observed they will prevent any mistakes being made. (ibid. 94/3395) As can be seen, these notes are principally meant for foreigners learning English, not for native speakers of the language. The differentiation in the use of the two auxiliaries is based on pragmatic criteria,6 linking the appearance of each modal to specific speech acts to be performed: In the first person shall simply indicates a prediction, whereas will is used for promising or threatening. In the second and third persons shall is used for promising or threatening, and will of a straightforward prediction. I shall burn, you will (thou wilt), he will; we shall, ye will, they will burn all simply predict what will happen; whereas I will, you shall (thou shalt), he shall, we will, ye shall, they shall burn are used for guarantees or pledges of what will happen. (ibid. 94-5/339) A distinction of usage is also made by Cooper (1685), who amplifies Wallis’s specifications stating that shall used with first persons indicates declaration, while with second and third persons it stands for an order: Shall in primis personis innuit declarationem, in secundis & tertiis, mandatum; ut I shall prepare, we shall prepare, you shall prepare, ye shall prepare, he shall prepare, they shall prepare. (ibid. 143)

Shall and will in Early Modern English texts

93

The functions of will are identified in promise, intention or decision (when used with a first person) or in promise, declaration or decision (when used with second or third persons): Will in primis personis denotat promissum, intentionem vel resolutionem; in secundis & tertiis promissum, declarationem, vel resolutionem; ut I will prepare, we will prepare, you will prepare, ye will prepare, he will prepare, they will prepare. (ibid. 143) Cooper then specifies that the pragmatic function of the form expressing future time reference also depends either on its specific meaning or on the locutor’s aim or particular emphasis: Sive declaratio, promissum, mandatum vel resolutio significatur, a sensu & loquentis proposito & emphasi, apparet. (ibid. 143) Miege (1688) adopts Cooper’s explanation, but not in its complete form: the various uses of shall are retained and differentiated according to the person of the subject, while those of will are unified for all persons: Future is a Tense, which affords some Nicety in the proper and distinct Use of its Signs, Shall and Will. For Shall, in the first Persons, denotes a Declaration; and, in the second and third Persons, a Command or Injunction. But Will doth every where import a Promise, Intention, or Resolution. Thus, when I say, I shall go, or I will go, I make a Declaration of my Willingness or Resolution to go. But if I say, you shall go, there’s a plain Injunction. And this Foreiners chiefly need take special Notice of, who are apt to confound the Use of these two Signs. (ibid. 71) Aickin (1693) also derives his description from Cooper’s text, from which he borrows the examples as well. It should be noted, however, that he simplifies the uses of will. Shall and will declare the future, as, when I shall love. Note that, shall in the first persons signifies a declaration of ones mind, in the second and third a command. I shall prepare, thou shalt prepare. So will in the first persons, signifies a promise, in the second and third a declaration, as I will prepare, ye will prepare, he will prepare. (ibid. part 2, p. 11) The analysis of these grammars has thus helped to outline the main uses of the two auxiliaries in future contexts identified by early English grammarians. In particular, as regards first person subjects, the future with shall is identified with prediction or declaration, while that with will is generally deemed to perform the pragmatic functions of promise, intention or resolution.

94 3.

Maurizio Gotti The analysis of the corpus

The corpus analysed consists of long extracts from a number of texts published in the period 1640-1710, and contains a total number of 171,040 words. The various texts have been grouped into different textual categories, each representing a main text type. The specific texts and textual categories included in the corpus are as follows (the date and length of each text is added in brackets; more detailed information about the single texts may be found in the manual to the Helsinki Corpus – cf. Kytö 1996): ¨ ¨

¨ ¨

¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨

LAW: - Statutes (1695-99; 13,180 words) HANDBOOKS: - Walton, The Compleat Angler (1676; 5,370 words) - Langford, Plain and Full Instructions to Raise All Sorts of Fruit-Trees (1699; 6,000 words) SCIENCE: - Hooke, Micrographia (1665; 6,060 words) - Boyle, Electricity & Magnetism (1675-6; 5,220 words) EDUCATIONAL TREATISES: - Locke, Directions Concerning Education (1693; 5,200 words) - Hoole, A New Discovery of the Old Art of Teaching Schoole (1660; 6,120 words) PHILOSOPHY: - Preston, Boethius (1695; 8,820) SERMONS: - Tillotson, Sermons (1671/1679; 6,600 words) - Taylor, The Marriage Ring (1673; 5,870 words) PROCEEDINGS, TRIALS: - The trial of Titus Oates (1685; 7,730 words) - The trial of Lady Alice Lisle (1685; 6,030 words) HISTORY: - Burnet, History of My Own Time (1703; 5,820 words) - Milton, The History of Britain (1670; 5,820 words) TRAVELOGUE: - Fiennes, The Journeys of Celia Fiennes (1698; 5,140 words) - Fryer, A New Account of East India (1698; 5,330 words) DIARIES: - Pepys, The Diary of Samuel Pepys (1666-7; 5,140 words) - Evelyn, The Diary of John Evelyn, (1689-90; 6,070 words) BIOGRAPHY, AUTOBIOGRAPHY: - Fox, The Journal of George Fox (1694; 5,560 words) BIOGRAPHY, OTHER: - Burnet, Some Passages of the Life and Death of … Earl of Rochester (1680; 6,170 words)

Shall and will in Early Modern English texts ¨ ¨ ¨

¨

95

FICTION: - Samuel Pepys’ Penny Merriments (1684-5; 6,560 words) - Behn, Oroonoko (1688; 5,480 words) DRAMA, COMEDIES: - Vanbrugh, The Relapse (1697; 7,190 words) - Farquhar, The Beaux Stratagem (1707; 5,550 words) LETTERS, PRIVATE: - Haddock (Richard, Sr; Richard, Jr; Nicholas), Strype, Oxinden (Henry; Elizabeth), Hatton (Charles; Frances; Alice; Anne; Elizabeth), Pinney (Jane; John), Henry (Philip) (various dates; 13,140 words) LETTERS, NON-PRIVATE: - Somers, Spencer, A letter by the Privy Council, Capel, Charles II, Osborne, Aungier, A letter by the Commissioners (various dates; 5,870 words)

The corpus taken into consideration contains 242 first person subject shall- and will-forms, 91 with will and 151 with shall (contracted forms have not been included in the number, owing to the impossibility of attributing them either to shall or will with any certainty); of these, however, only 216 may be considered to express futurity, as 26 shall-forms clearly denote present time reference and appear mainly in set formulas introducing an exposition or concept or in phrases referring to the exposition itself: (1)

To which I shall adde, that possibly the Celerity of the motion of the Flame upwards, may render it very difficult for the Electrical Emanations (BOYLE: 26)

The reference to the present time is often underlined by the use of the adverb now, as in the following case: (2)

I shall now say no more then that no man can have a more real heart toward any then hath to Thee (HOXINDEN: 293)

The other verbal constructions containing shall and will have different semantic and pragmatic values, although they all refer to a future time: some have a predominantly deontic function, while others have a merely dynamic value concerning futurity (the terms ’deontic’ and ’dynamic’ referring to modality are used according to Palmer 1986/2001). 7 The occurrences found in the corpus are summed up in Table 1.

96

Maurizio Gotti

Table 1. Occurrences of first person subject shall- and will-forms in future expressions. Normalised data – showing the frequency of the modal per 1,000 words – are given in brackets. SHALL WILL Inter Pred Inten Prom TOT Inter Pred Inten Prom8 LAW HANDBKS

2 2 5 (4.4) 1 (0.9) 10 (1.8) (1.8) (8.8) SCIENCE 1 (0.9) 1 (0.9) EDUCA. 2 1 (0.9) 3 (1.8) (2.7) PHILO. 2 1 3 (3.4) 6 (2.3) (1.1) (6.8) SERMONS 1 1 (0.8) 2 (0.8) (1.6) TRIALS 2 2 (1.5) (1.5) HISTORY 1 (0.8) 1 (0.8) TRAVEL DIARIES 1 (0.9) 1 (0.9) BIOGRA. 1 2 1 (0.9) 4 (0.9) (1.7) (3.4) FICTION 11 12 9 (7.5) 2 (1.7) 34 (9.1) (10) (28. 2) DRAMA 7 16 4 (3.1) 3 (2.4) 30 (5.5) (12. (25. 6) 9) LETTERS 19 10 2 (1.1) 31 (10) (5.3) (16. 3) Total 23 57 36 9 (0.5) 125 (1.3) (3.3) (2.1) (7.3) 3.1

9 (7.9) 2 (1.8)

Prop TOT 11 (9.7)

2 (1.8)

2 (1.8)

1 (1.1)

1 (1.1)

1 4 (3.2) (0.8) 11 (8) 6 (4,4) 1 (0.8)

5 (4) 17 (12.4) 1 (0.8)

1 (0.9)

1 (0.9)

4 (3.4)

4 (3.4)

8 (6.6) 5 (4.2) 11 24 (9.1) (19.9) 4 (3.1) 3 (2.4) 15 1 (0.5) (7.9) 1 60 (0.1) (3.5)

17 (1)

2 9 (7.1) (1.6) 16 (8.4) 13 91 (0.8) (5.3)

Interrogative sentences

In interrogative sentences only shall has been found to occur; apart from one case (a rhetorical question), interrogative sentences appear either in the dialogues of fictional works, comedies and handbooks or in quotations of direct speech reported in other text types, such as philosophical works and biographies. Most questions (12) express the pragmatic function of asking for the addressee’s

Shall and will in Early Modern English texts

97

opinion and commonly have a first person plural subject, as can be seen in the following quotations: (3) (4) (5)

Where shall we lye the next night? (PENNY: 120) What maid shall we have? (PENNY: 120) What shall we do, Sir? (FARQUHAR: 60)

In two cases the expression shall we introduces questions conveying the pragmatic value of suggestions: (6) (7)

Shall we follow them over the water? (WALTON: 211) Shall we kill the Rogues? (FARQUHAR: 63)

Six, instead, are requests for advice; as they concern the locutor’s future action, they all have a first-person singular subject: (8) (9) (10)

How shall I do that? (PHILO: 135) But good Jack how shall I do to behave my self at that time amongst so many? (PENNY: 118) But how shall I get off without being observ’d? (FARQUHAR: 64)

Three interrogative sentences perform a merely predictive function, as in the following case: (11)

In what a condition shall I be, if I Relapse after all this? (BURNETROC: 146)

In two of these predictive cases the shall-forms are elliptical and therefore are not followed by any other verb but rely on the one mentioned in the previous sentence, as the following example illustrates: (12)

Y.FASH. Well, you shall have your Choice when you come there. MISS. Shall I? (VANBR I: 63)

In the instance below, the question performs the function of consulting the addressee’s wishes and can be paraphrased by the expression ’when do you want me to …’: (13)

ARCH. When shall I come? MRS SULL. To Morrow when you will. (FARQUHAR: 59)

In one case, the shall-form is included in a rhetorical question and has the function of simulating a dialogue with the reader, thus enabling the author to provide more information as though he had been requested to do so. Here is the example of this strategy:

98

Maurizio Gotti

(14)

What then? Shall we put our selves into the Company of those which I have before shewed to resemble Beasts? (BOETHPR: 183)

3.2

Non-interrogative sentences

First person pronoun statements in non-interrogative sentences appear in all text types except legal texts and travelogues. This absence from travelogues reflects an overall low number of shall- and will-forms throughout the travelogue section of the corpus,9 due to the very limited recourse to futurity typical of the genre. The case of legal texts is quite different, as shall-forms appearing in the general corpus represent 45% of all cases; also 6 instances of will-forms out of the total of 566 are drawn from statutes. The absence of the particular first person form from our subcorpus is of course due to the fact that shall has only been found in third person verbal forms, as the future events or obligations dealt with in legal texts commonly concern agents referred to in a general and impersonal way, the subject often being ’he or she’, ’any person or persons’, ’any Justice or Justices of the Peace’, ’hee shee or they’ etc. (For a more complete analysis of shall and will in EModE statutes cf. Gotti 2001). The presence of first person subject statements is also very limited in other text types such as scientific treatises, educational books, philosophical works, history books, biographies and diaries. As regards the first three of these genres, the lack may be due to the impersonal style commonly adopted in them. In the case of history books, biographies and diaries on the other hand, this limited use is quite understandable, as they commonly report events which happened in the past (thus requiring a past tense). The highest number of instances of first person subject future forms has been found in the category of fiction, drama and letters, as these text types commonly contain expressions concerning prediction and intention. Another genre with a high number of occurrences of first person pronoun future forms is that of handbooks; this can be explained by the dialogic structure of the first text in which they recur, while in the second, first-person forms relate to the very personal style of the text itself. The frequency of first person subjects is also comparatively high in trial proceedings, a genre involving a remarkably high use of volition, commonly denoted by clear intentionality and commitment on the locutor’s part. As regards the use of the two modal verbs, first person statements mainly adopt shall (in 102 cases out of 193, i.e. 53%), but will is also widely used (91 out of 193, i.e. 47%). Comparing the normalised data of Table 1, we can see that, generally speaking, there are some text types which have a similar number of occurrences of shall- and will-forms, while others show great discrepancies. As seen above (cf. Kytö 1991), the distinction between these two categories has been attributed to the medium used or the level of formality. Our data confirms that the rise of first person will in this period comes from its use in speech-based texts, with will-forms more frequent than shall-forms in sermons and trial proceedings. As regards the attribution of the same reason to the use of colloquial language (typical of private letters) however, our corpus data seem to paint a different picture, as can be seen in Table 2.

Shall and will in Early Modern English texts

99

Table 2. Occurrences of first person subject shall- and will-forms in private and non-private letters Pred PRIVATE 19 LETTERS NONPRIVATE LETTERS Total 19

SHALL WILL SHALL + WILL Inten Prom TOT Inten Prom TOT Pred Inten Prom TOT 10 1 30 11 1 12 19 21 2 42

10

1

1

4

2

31

15

4 1

16

19

4

1

5

25

3

47

Although first person subject future forms are not very frequent in non-private letters, they tend to be expressed more commonly by will than by shall (4 occurences vs 1); moreover, in private letters shall-forms are more frequent than will-forms (30 vs 12). Instead, as the analysis below will confirm, the adoption of these two auxiliaries greatly depends on the different pragmatic uses which they perform. 3.2.1 Shall-forms 3.2.1.1. Prediction In the majority of cases (57 out of 102, i.e. 56%) shall indicates prediction and is the modal par excellence expressing this pragmatic use (57 out of 58, i.e. 98%). In such statements shall expresses no intention and is typically followed by stative verbs, such as be, have, expect, need, know, etc as in examples 15-17 below: (15) (16) (17)

O what a happy man shall I be, what a good housewife thou hast been, thou hast good cloathes too. (PENNY: 116) A week – why, I shall be an old Woman by that time. (VANBR I: 62) I received D r Kings letter; but I shall not need much of his phiseck, for I thank God I am much better. (FHATTON I: 148)

Predictive statements are often preceded by such verbs as hope, believe, expect or expressions such as I’m afraid, as in examples 18-20 below: (18) (19) (20) (21)

But I hope, Madam, at one time or other, I shall have the Honour to lead your Ladyship to your Coach there. (VANBR I: 38) I believe I shall like your cook very well. (FHATTON I: 148) but if ever you expect that I shall be friends with you, there must be two things granted. (PENNY: 271) Ay, Sir, I’m afraid we shall find a difficult Job on’t. (VANBR I: 58)

100

Maurizio Gotti

Predictions are often included in statements expressing result or consequence, as can be seen in the following quotations: (22) (23)

But you must not look towards me, for then I shall laugh (PENNY: 118) Good lack they will keep such a do when they come in to eat it, and taking their leaves of us, and throwing the stocking, and thing or other, that I shall wish them all far enough. (PENNY: 119)

Another frequent context for predictive statements is within sentences which have a temporal value, as in examples 24-26: (24) (25) (26)

When we shall see Him, there is no beauty that we should desire him. (BURNETROC: 142) When I shall be that happy thing her Husband. (VANBR I: 63) I do not yet know when I shall leave this twone. [sic] (ANHATTON I: 212)

There were two cases in the corpus where the shall-forms expressing prediction are made stronger so as to provide assurance to the interlocutor. One of these instances of assurance relies on a content-oriented booster, intended to increase the illocutionary force of the speech act by underlining the certainty of the proposition asserted and thus emphasize its validity: (27)

and then we shall have good handsel indeed (PENNY: 120)

In the other case, the value of the proposition is stressed by means of a change in the conventional word order: (28)

Now on my Knees, my Dear, let me ask your pardon for my Indiscretion, my own I never shall obtain. (VANBR I: 40)

3.2.1.2. Intention 36 of the 102 occurrences of shall (35%) are used to express intention, and for the expression of intention, shall is the modal of choice in 37.5% of cases. Intentional statements containing shall are most frequent in letters, fictional works and comedies, as in the examples below: (29) (30) (31)

ye tickets are guineas a piece, wch is a little to much for me to throw away; so I shall not be there (ALHATTON I: 245) But fear not, Mary thou has not erred, step, step in there, step in and I shall declare unto her that thou, according to the Light, art praying for a Holy Sister, whom one of the Prophaned caused to go astray. (PENNY: 149) Sir, if you have no cause neither, I desire to know who you are; for till I know your Name, I shall not ask you to come into my House; (VANBR I: 58)

Shall and will in Early Modern English texts

101

Intentional statements containing shall also occur frequently in handbooks and philosophical works, where they are employed to perform various argumentative functions. For example, they are often used to refer to an exposition or a concept which is to be presented subsequently: (32)

(33) (34)

there is no rule without an exception, and therefore being possest with that hope and patience which I wish to all Fishers, especially to the CarpAngler, I shall tell you with what bait to fish for him. But first you are to know … (WALTON: 297) That thou shalt do so, I shall make clear to thee by undeniable Reasons, if thou wilt but grant me those things which a little before I have laid down as Conclusions. (BOETHPR: 141) and it will contribute much to thy Cure to know these things, although I am confined within the narrow Bounds of Time, I shall endeavour to give thee some taste of them. (BOETHPR: 191)

In these types of texts the statements containing shall-forms may also aim to guide the reader’s decoding activity or clarify the author’s discourse strategies, as the following instances show: (35) (36)

I shall therefore give you three or four more short observations of the Carp, and then fall upon some directions how you shall fish for him. (WALTON: 294) But I shall confine my discourse to the Dead. (BURNETROC: 144)

3.2.1.3. Promise The corpus also contains nine instances of first person shall-forms used in statements which perform the pragmatic function of promise (corresponding to 9% of shall-forms in non-interrogative statements). Although shall is used in promise statements in 35% of cases, will is in fact used more often (17 occurrences). The majority of the occurrences of shall are in comedies, fictional works and letters as in the examples below: (37) (38)

Sir, (my Lord, I meant) you may speak to me about what you please, I shall give you a Civil Answer. (VANBR I: 62) The King has directed me to attend him tomorrow about the matters of yr Excellencie’s last letter and I shall not bee wanting to acquainte you with his Maties pleasure so soon as I know itt (OSBORNE: 22)

Promissory statements are sometimes reinforced by specific expressions, such as the phrase ye may be sure or the temporal adverb ever:

102 (39) (40)

Maurizio Gotti ye may be sure we shall not recompense you with any molestation, but shall provide rather how we may friendliest entertain ye; (MILTON X: 144) Yea, I say unto thee, I shall thank her, O Mary, I shall ever commend thee for a sanctified Sister among our friends (PENNY: 151)

3.2.2 Will-forms 3.2.2.1. Prediction Will is very seldom used to express prediction in first person subject statements – only one quotation of will-forms out of 91 (i.e 1%) performs this pragmatic function and constitutes a very small percentage (1 out of 58, i.e. 2%) of all future statements expressing prediction found in the corpus. The predictive will-form occurred in the category of sermons and – more specifically – in the text by Tillotson, which indicates the possibility of an idiosyncratic usage by this author. (41)

God hath appointed Guides and Teachers for us in matters of Religion, and if we will be contented to be instructed by them in those necessary Articles and Duties of Religion, (TILLOTS II, ii: 447)

3.2.2.2. Intention Intention is the pragmatic function which first person subject will-forms are most commonly chosen to perform (60 instances out of 91, i.e. 66%); and will is much more frequently used than shall (62.5% vs 37.5%). Intentional statements appear in almost all kinds of texts, but most particularly in trial proceedings, handbooks, letters and fictional works. The instances of first person futures containing will identified in trial proceedings commonly underline clear intentionality and commitment on the locutor’s part, as shown in examples 42-44: (42) (43) (44)

My Lord, I submit; I will be diretced by the Court in any thing that is fair, and not injurious to my Defence (OATES IV, 75: C2) Come, I will ask thee another Question: When was the first time thou heard’st Nelthorp’s Name? (LISLE IV: 115, C1) I will assure you, Nelthorp told me all the Story before I came out of Town (LISLE IV, 121: C2)

In handbooks the high number of occurrences of first person future will-forms is often justified by the pragmatic value of the speech-acts with an argumentative purpose. Indeed, sometimes the use of will expresses the locutor’s willingness or reluctance to contrast a previously expressed utterance: (45)

against all which any honest man may make a just quarrel, but I will not, I will leave them to be quarreled with, and kill’d by others; (WALTON: 214)

Shall and will in Early Modern English texts (46)

103

But we will not at this time discourse of these. (BOETHPR: 181)

In other cases they are used to refer to an exposition or a concept to be presented subsequently: (47) (48) (49)

But first I will tell you how to make this Carp that is so curious to be caught, so curious a dish of meat, as shall make him worth all your labour and patience; (WALTON: 298) and will, as you desire me, tell you somewhat of the nature of most of the Fish that we are to angle for; (WALTON: 217) And before I proceed further, I will expresse my minde in the next two chapters touching the erecting of a Petty-Schoole, and how it may probably flourish (HOOLE: 213)

In the other text types, will commonly underlines the locutor’s commitment to the semantic content of the proposition: (50) (51) (52)

for I will speake nothinge to you but in pointe of law (FOX: 84) I will write ye particulars of our fight as soon as wee come into any port. (RHADDJR: 42) I will assure Thee that I have had such care in sending to thee that I usually do not defer writing till the last houre and am as careful to send them in good time as may bee. (HOXINDEN:276)

In some quotations the intentionality of the statement including will is highlighted by the use of inversion or by the contrast with other modal auxiliaries, such as must (expressing obligation) or can (conveying the meaning of ability): (53) (54) (55)

therefore Satan, though I defie thee and all thy Works, yet will I go in unto Mary as I have said: (PENNY: 147) therefore Mary, I say I must, nay I will, and if thou deniest the refreshing of a Brother, thou are not worthy to be called a Sister. (PENNY: 151) and I am sure I both can and will tell you more than any common Angler yet knows. (WALTON: 217)

In some instances will-forms have been found to express refusal to perform an action, as in the following quotations: (56) (57)

Oh fie no, I will not ask him, he will take it for an affront (PENNY: 119) Then you shall dress it, Sir; for if any body looks upon it, I won’t. (VANBR I: 40)

3.2.2.3. Promise The corpus analysed includes 17 instances of promises expressed by first person subject will-forms; they represent the main form of expression of this pragmatic

104

Maurizio Gotti

function, as they correspond to 65% of all promissory quotations. This kind of statement has been found in trials, fictional works, comedies and handbooks. Here are a few examples: (58) (59) (60)

My Lord, I will tell the Truth as near as I can. (LISLE IV: 114, C1) PISC. but you must do me one courtesie, it must be done instantly. HOST. I will do it, Mr. Piscator, and with all the speed I can. (WALTON: 216) T OM. Thirdly, if any Gentle Woman comes to have me take measure of her, you must forthwith go out of the Room, and leave us together and not be jealous. IONE. All this I will observe. (PENNY: 268)

Promises are sometimes reinforced by specific words, such as the harmonising noun promise (as in the first of the following quotations), the expression be assur’d (in the second) or the verb to swear (in the third): (61) (62) (63)

trust god in his promisses, for hee have saide, I will never leave thee nor for sake thee, (JPINNEY: 18) Madam, be assur’d I will protect you, or lose my life. (FARQUHAR: 61) T OM. First, you shall kiss my hand and swear that you will acknowledge me to be your Lord and Master. IONE. I will Sir. (PENNY: 268)

In one case the promise concerns an action which is harmful to the interlocutor, and thus represents an instance of a threat: (64)

I must know the Truth of that; remember that I gave you fair Warning, do not tell me a Lye, for I will be sure to treasure up every Lye that thou tellest me, and thou may’st be certain it will not be for thy Advantage: (LISLE IV: 114, C1)

3.2.2.4. Proposal Several of the occurrences of first person will statements are used in proposals (13 out of 91, i.e. 14%); in fact, only will is used in this function, no instances of shall-forms were found. ‘Proposal’ will only appears in fictional works and comedies, always with a first person plural subject, as in the examples below: (65) (66)

Yea, Mary, thou hast even said, and now this first refreshment is over, let us wait another motion from the Light within and till then, if thou shalt think fit, we will sing a song of Son. (PENNY: 148) Come Mary, we will depart unto the Congregation of those Saints that be of our Notions. (PENNY: 152)

Shall and will in Early Modern English texts

105

The proposal often comes as a reply to a request for advice, as can be seen in the following examples: (67)

KA. JO .

(68)

K A. JO . K A. JO .

(69)

4.

What shall we do for Clothes? In troth Kate we will save that money, those that we have will serve very well. (PENNY:117) What musick shall we have? We will have old Rowly and his company. (PENNY: 118) What maid shall we have? We will have a lusty wench, who may be able to do our work, for fourty shillings the year (PENNY: 120)

Conclusion

The analysis of the corpus has shown that the rules contained in the grammars of the period oversimplify the range of uses of the two modal auxiliaries with first person subjects; first person future expressions with shall do not merely denote prediction or declaration, nor do the ones with will only express promise, intention or resolution. Shall is used in interrogative sentences to express various pragmatic functions, such as asking for the addressee’s opinion, giving suggestions, requesting advice, and inquiring about the addressee’s wishes. In addition to these uses, questions starting with shall-forms perform a predictive function; occasionally they are used as rhetorical devices to serve argumentative purposes. In non-interrogative sentences shall mainly indicates prediction, being used almost exclusively (98% of all cases of prediction use shall). Shall-forms are also used to express intention, representing an alternative to the use of will-forms; however, the quantitative difference between the two modals is relevant, as shall is used in only 37.5% of all statements expressing intention. The corpus also contains 9 instances of first person shall-forms used in statements with the pragmatic function of promise, about one third of all forms expressing promise in the corpus. Will conversely is very seldom used to express prediction in first person subject statements. As this use has only been found in Tillotson’s sermons, its occurrence may arguably be due to the author’s idiosyncratic behaviour rather than general usage. The most widespread pragmatic function performed by first person subject will-forms is that of intention (65% of all occurrences). Will is used much more commonly in this function than shall (62.5% vs 37.5%). In particular, a comparison of the use of the two modal auxiliaries in homogeneous contexts points to the adoption of will where a more marked degree of intentionality is to be denoted, as in the following example: (70)

To doe it contemptibly I would not advise her, but if with Credit I shall not be against it. But not to medle with the Scot: I will rather maintaine her (though she hath grieved me). (JOPINNEY: 59)

106

Maurizio Gotti

First person subject will-forms are also commonly used to express promise (in 65% of cases). Finally, 14% of the occurrences of first person will statements represent instances of proposals, a speech act which relies exclusively on this modal. As seen above, while shall is used in questions starting with shall we, no instances of shall-forms have been found in non-interrogative ‘proposal’ sentences. A confirmation of the different behaviour of the two modal verbs can be found in the following example, in which the alternation in the use of shall and will denotes the different pragmatic values realized by the two auxiliaries – the first corresponding to prediction and the second to proposal: (71)

we must furnish [the house] before, and lay in some Ale , that we may be able to invite all the wedding people to drink with us, and then we shall have good handsel indeed, and we will also have a good Gammon of Bacon, and that will make the drink go down merrily. (PENNY: 120)

As can be seen from the analysis above and from the results obtained, first person subject uses of shall- and will-forms are more varied than indicated by those Early Modern English grammarians taken into consideration here. This discrepancy may be due to the fact that at that time the level of sophistication in the analysis of linguistic data was more limited and therefore fewer categories were employed to encompass many meanings. If we compare the usage of these modal auxiliaries in the eighteenth century to Present-day English data (cf. Coates 1983), the range of pragmatic functions expressed by the two auxiliaries has remained more or less the same. However quantitative data show that will has become more and more frequently used to express futurity, thus ousting, to a certain degree, its ‘rival’ auxiliary shall, especially in the American varieties of the English language. Notes 1

The results of this analysis have caused a heated debate, and have been bitterly criticized by many, particularly as regards the conclusions drawn from the data presented. Cf., for example, Taglicht: Neither Fries not those who have accepted his conclusions have really come to grips with the problems involved, [...] they have misinterpreted some parts of the evidence and ignored others. [...] Fries’s case against the descriptive validity of the grammarians’ rules is partly based on general arguments of very dubious cogency. (Taglicht 1970: 197, 199)

2

According to Rissanen (2000), this ‘change from below’ had important consequences for the development of the standard usage of these modal auxiliaries.

3

The grammars taken into consideration are the following:

Shall and will in Early Modern English texts

107

- John Wallis, Grammatica Linguae Anglicanae, London, 1653 (EL 142); - Christopher Cooper, Grammatica Linguae Anglicanae, London, 1685 (EL 86); - Guy Miege, The English Grammar, London, 1688 (EL 152); - Joseph Aickin, The English Grammar, London, 1693 (EL 21). The acronym EL stands for ’English Linguistics 1500-1800’, the series of texts on microfiches selected and edited by R.C. Alston for The Scolar Press. 4

In fact Wallis was preceded in making this distinction by George Mason’s Grammaire Angloise, London, 1633 (EL 261): Le signe du futur est, shall ou will, mais il n’en faut pas vser indifferemment: car si vous vsez de ce signe, shall, quand il faut dire, will, il a mauvaise grace, outre qu’il semblera que vous parliez d’audace: example, vous pouvez dire elegamment, If I doe eate that, I shall be sicke, si je mange cela, je seray malade: au lieu que si vous disiez, I will be sicke, il sembleroit que volontairement vous voulussiez estre malade: ainsi vous pouvez dire: I hope you will be my good friend, j’espere que vous me serez amy: If you doe that, you shall bee beaten or chidden, Si vous faites cela, vous serez battu ou tancé: but I shall not, mais non seray: but you shall not chuse, mais vous ne choisirez pas, cest a sçauoir, ce ne sera pas à vostre chois: Pour le faire court, il est mal-aisé d’en bailler reigle certaine, parquoy je vous renvoye a l’vsage, auquel, afin de mieux y parvenir, nous vous proposerons la variation de certains verbes. (pp. 25-6) However, Wallis’s grammar has been innovative in many other aspects. For the analysis of its most relevant innovations see, among others, Vorlat (1975) and Robins (1986).

5

The quotations from Wallis’s grammar are not taken from the original Latin text, but from its translation into English by J.A. Kemp (John Wallis, Grammar of the English Language, London: Longman, 1972). The first page after the quotation refers to the original text, while the second refers to the translation.

6

The great innovation of Wallis’ pragmatic approach has been particularly appreciated in recent studies, such as Arnovick: A careful examination of the Wallis Rules reveals the way in which they tell us ’how to do things with words’. The formulations indicate whether a speaker commands, promises, or predicts; they indicate whether the speaker is the agent of volition or prediction or whether he or she is the questioner of such agency. By distinguishing speaker attitude and speaker

108

Maurizio Gotti involvement the Wallis paradigm makes formal distinctions of modality which are central to the utterance of a speech act. (Arnovick 1997: 140-1)

7

It has naturally been difficult on occasion to separate mere futurity from deontic futurity, as in many cases the semantic value of the verbal form seemed to fit the two possibilities; the analysis of the context, however, has helped to decide whether futurity alone was meant or whether a sense of willingness or intention was also involved.

8

This column also includes one instance of threat, which is a pledge to do an action which is harmful to the interlocutor, and can be considered therefore an illocutionary subclass of promises (Searle 1978:58).

9

No shall-forms have been found in any of the travelogues included in the corpus, and only two will-forms appear in them: one expresses prediction, while the other is a case of non-futurity, as it refers to habitual actions. For the results of the analysis of all the central modals included in the third part of the Early Modern English section of the Helsinki Corpus cf. Gotti et al. (2002).

References Arnovick, L.K. (1997), ‘Proscribed collocations with shall and will: the eighteenth-century (non-)standard reassessed’, in: J. Cheshire and D. Stein (eds), Taming the vernacular. London: Longman. 135-151. Coates, J. (1983), The semantics of the modal auxiliaries. London: Croom Helm. Fries, C.C. (1925), ‘The periphrastic future with shall and will in Modern English’, Publications of the Modern Language Association of America , XL: 963-1024. Gotti, M. (2001), ‘Semantic and pragmatic values of shall and will in Early Modern English Statutes’, in: M. Gotti and M. Dossena (eds), Modality in specialized texts. Bern: Peter Lang. 89-111. Gotti, M., M. Dossena, R. Dury, R. Facchinetti and M. Lima (2002), Variation in central modals: a repertoire of forms and usage in Late Middlle English and Early Modern English. Bern: Peter Lang. Kytö, M. (1991), Variation and diachrony, with Early American English in focus. Frankfurt am Main: Peter Lang. Kytö, M. (1996), Manual to the diachronic part of the Helsinki Corpus of English Texts. Helsinki: Department of English, University of Helsinki. Palmer, F.R. (1986, 2001), Mood and modality. Cambridge: Cambridge University Press. Rissanen, M. (2000), ‘Standardisation and the language of early statutes’, in: L. Wight (ed.), The development of Standard English: 1300-1800. Cambridge: Cambridge University Press. 117-130.

Shall and will in Early Modern English texts

109

Robins, R.H. (1986), ‘The evolution of English grammar books since the Renaissance’, in: G. Leitner (ed.) The English reference grammar. Tübingen: Max Niemeyer Verlag, 292-306. Searle, J. (1978), Speech acts. Cambridge: Cambridge University Press. Taglicht, J. (1970), ‘The genesis of the conventional rules for the use of shall and will’, English Studies, 51, 3: 193-213. Vorlat, E. (1975), The development of English grammatical theory 1586-1737. Leuven: Leuven University Press.

The role of gender in the use of MUST in Early Modern English Arja Nurmi Research Unit for Variation and Change in English, Department of English, University of Helsinki Abstract This study uses the 2.7 million word Corpus of Early English Correspondence to trace the development of the modal auxiliary MUST and its two main meanings, ’personal obligation’ and ’logical necessity’ through the span of the corpus, from the early 15th century to the late 17th. The approach is sociolinguistic, with particular reference to gender differences in the use of the auxiliary. The main finding is that MUST increased steadily in frequency during the Early Modern English period. This increase is found both in the language of men and women, with men leading in the use of MUST in the 15th century and women reaching and overtaking their level of usage by the 17th century. The use of MUST also increases in both meanings, but the ’logical necessity’ meaning is clearly less frequent than the ’personal obligation’ sense. The rise of ’logical necessity’ seems to stem from educated men, from whom its use spreads to other men and to educated women.

1.

Introduction

This article1 continues my attempt to trace the sociolinguistic variables associated with the core modals in late Middle and Early Modern English. The main focus is on the different meanings of MUST and their association with one social variable, gender. This is a pilot study into the feasibility of such an undertaking, and will function as the model for a later application of the same method to other modal auxiliaries. As Krug (2000: 256) points out, we have very little frequency information on the history of modal auxiliaries. The research presented in this article is one step in that direction, as it provides a general outline of the frequency of MUST from the 15th to the 17th centuries while concentrating on the more central question of combining meanings and social variables. This study uses the Corpus of Early English Correspondence (CEEC),2 a 2.7 million word corpus covering the years c.1410–1681 and consisting of approximately 6000 personal letters written by nearly 800 letter writers. The CEEC was compiled for the purpose of applying modern sociolinguistic methods to the study of language history, and has proved a powerful tool in the description of late Middle and Early Modern English sociolinguistic variation connected with various changes in progress during that period (see e.g. Nevalainen and Raumolin-Brunberg (eds) (1996), Nurmi (1999) and Palander-Collin (1999)). On the basis of a pilot study using the Corpus of Early English Correspondence Sampler (Nurmi 2002a; see also Nurmi 2002b), it can be said that there was a clear increase in the use of WILL and an equally clear decline of

112

Arja Nurmi

in the sixteenth century, but the rest of the core modals experienced very little change in frequency. MUST was chosen as the next object of study for three reasons. Firstly, it is the least frequent of the core modals, and therefore the practicalities of combining meanings to social variables are less arduous than with other modals. Secondly, because MUST itself is originally the past tense form of MOTEN, there is no corresponding past tense (or tentative) form. Therefore the estimation of when a past tense form is merely a past tense form, and when it is an independent auxiliary carrying its own meaning (as in the case of SHALL/SHOULD and WILL/WOULD) is left out of the equation. Finally, there is some previous corpus-based research, which makes it possible to compare my findings from the CEEC to those of Biber et al. (1998). SHALL

2.

The meanings of MUST

During the Middle English period MUST, the past tense form of MOTEN, gradually gained independent status. The last quotes in the Oxford English Dictionary with MUST in its old use date from the late 15th century. The Middle English Dictionary lists numerous instances of MUST with past tense meaning through the 15th century up to the year 1500. The first quotes of MUST as a present tense form in the OED date from the 14th century, while in the MED the earliest past tense forms with present or future meaning are found in the late 12th century. Both the dictionaries give a detailed analysis of the various meanings of the auxiliary. In a corpus-based study, however, it is necessary to have as few meaning categories as possible, otherwise, when connected to sociolinguistic variables, the results would suffer from what Rissanen (1989: 18) calls “The Mystery of Vanishing Reliability”. The model of meaning applied to MUST in this study is the fairly simplified one found in Biber et al. (1999: 485, 494–495). The instances of MUST are classified under two labels: ‘personal obligation’ and ‘logical necessity’. Biber’s terms are very close to those found in Quirk et al. (1985: 224–225), who divide the uses of MUST into ‘(logical) necessity’ and ‘obligation or compulsion’. Leech (1987: 78) makes the same division, and points out that ‘logical necessity’ “can easily be weakened to ‘reasonable assumption’”, often leading to an informed guess rather than a chain of strictly logical thinking. Quirk et al. (1985) mention further the distinction between epistemic (logical) necessity and root necessity (the latter expressing a necessity where no human control is implied). The sense of root necessity is very rare in the CEEC, which contains only a handful of examples. These examples, as well as all negative instances, were left out of the study. Mindt (1995: 115) separates three modal meanings of MUST: ‘obligation’ which occurs in 37% of the instances in his Present-day English corpus; the equally frequent ‘inference/deduction’ (which corresponds to ‘logical necessity’) and ‘necessity’, which accounts for 24% of instances. The example of necessity given by Mindt (he looked at his watch and announced he must be on his way ) would be classified as ‘personal obligation’ in this study.

The role of gender in the use of MUST in Early Modern English

113

The application of a model based on the Present-day English meanings of is not without its problems, but preliminary results seem to indicate that this model can be applied to Early Modern English data without doing great injustice to the material. Furthermore, Biber et al. have already applied this model to slightly later historical data, using the ARCHER corpus (Biber et al. 1998: 205–210). It is also possible to group the meanings presented in the OED and MED under these two headings. On the basis of the OED and the MED the ‘personal obligation’ meaning is clearly the older of the two: the first instance of the ‘logical necessity’ meaning in the OED is quoted from 1652. In the CEEC there are several cases of MUST in this sense from an earlier time. Examples 1 and 2 present cases of the ‘logical necessity’ meaning of MUST in the CEEC: MUST

(1)

(2)

A great part to haue al thes things is to desire to haue them: and altho Glorye and honest name are not the verye endes wherfor thes thinges are to be folowed, yet surly they must nedes folowe them, as light folowth fire, though it wer kindled for warmth. (WYATT: 1537 Thomas Wyatt, 38) you must thincke we are brought to a lowe ebbe when the last weeke the archdukes ambassador was caried to see the auncient goodly plate of the house of Burgundie (CHAMBERLAIN: 1613 John Chamberlain, I,434)

Examples 3 and 4 are instances of the ‘personal obligation’ meaning of MUST in the CEEC: (3) (4)

All the hedges and fences must be allso presently made. (HOLLES: 1630 John Holles, III,404) I must confess this sodaine allteration of your purpose and promise makes me imploye my patience and dewtie; (BARRINGTON: 1629 Thomas Barrington, 96)

Phrases like I must confess (say, let you know), as in example (4), are fairly common in the CEEC. While I agree with Visser (1963–1973: 1807), who comments on these, saying that the “notion of compulsion or obligation is considerably obscured”, these have still been included in the ‘personal obligation’ group. Biber et al. (1999: 495) also give an example of this type of expression, including it under the ‘personal obligation’ label. 3.

The development of MUST in late Middle and Early Modern English

A total of 2858 instances of MUST were retrieved from the CEEC: the range of variation and the general trends in the development of MUST remained more or less within the picture suggested by CEECS, the CEEC Sampler (Nurmi 2002a). Table 1 shows the development of MUST in the CEEC. I divided the full CEEC into three subcorpora, covering the 15th, 16th and 17th centuries. The first and

114

Arja Nurmi

last centuries fall two decades short of a full century, but this is not likely to cause any great problems in the interpretation of the results. The analysis revealed an upwards trend in the frequency of the auxiliary, from 7.2 instances of MUST /10,000 words in the 15th century to first 8.1 in the 16th and 13.2 in the 17th century. The difference between the 15th and 16th centuries is not statistically significant, but the rise from the 16th to the 17th is highly significant (p P or S < P), and on whether their frequency in child writing is closer to that in speech or to that in published writing (C S or C P). S > P, C

P

Nominal clauses, verbless clauses, antecedentless relative clauses, and bare nonfinite clauses are less frequent in published writing than speech, and the childwriting figure is closer to published writing; as we might put it, children have successfully learned to ration their use of them. In the case of verbless clauses, the child-writing figure is actually much lower even than the published-writing figure, which perhaps reflects teachers’ injunctions to “write in complete sentences”. A similar relationship obtains between the three figures for nominal clauses, which at this point I do not understand — I would not have guessed beforehand that this category was commoner in speech than writing. (Possibly one explanation might be the frequency, in speech, of introductory hedging phrases like I think … or you know …?, where the material following think or know will be analysed as a nominal clause object of the respective verb — I have not yet looked into this.) Antecedentless relative clauses, and bare non-finite clauses, do feel like relatively “intimate” constructions — the latter because their use is restricted mainly to the verb make meaning “force” and to verbs of perception, and the former because formal prose tends to favour explicit antecedents (think of the way that stuffy writing uses that which in contexts where what would be far more idiomatic); so the differences between the three genres are unsurprising. S > P, C

S

Adverbial clauses are commoner in speech than in published writing, and the child-writing figure is about the same as the speech figure. S < P, C

S

Present participle clauses, comparative clauses, with clauses, and special as clauses are more frequent in published writing than in speech, and the childwriting figure remains closer to the speech figure. S < P, C

P

Finally, there are relative clauses (with explicit antecedents), whiz-deleted relative clauses, and past participle clauses. These are constructions used more frequently in published writing than in speech; and the frequencies in the child

The structure of children’s writing

187

writing are closer to the former than to the latter. (In the case of past participle clauses, the child writing frequency is admittedly not far from the mid-point between the other two genres.) These three categories are also, logically speaking, varieties of the same construction, in which a nominal element is postmodified by a clause in which the nominal plays a grammatical role. A whiz-deleted relative is a relative clause in which the main verb is a form of BE and in which that verb, and the relative pronoun, are “understood” rather than made explicit. A past participle clause is, or at least can be, a whiz-deleted relative clause based on a passive construction, where what is left after the relative pronoun and BE are suppressed begins with a past participle. In our scheme, the category “past participle clause” also covers tagmas which are similar in their internal structure but occur in functions other than noun postmodifiers, e.g. (to hear the winner’s name) called out; but most past participle clauses in the child-writing sample are cases functioning as reduced relatives. (The great majority of these are clauses based on the participles called or named, e.g. (a road) called the Ring, (a girl) named Jennifer.) Summing up, then: if we think of children’s acquisition of writing skills as, in part, the replacement of the grammatical habits of conversational speech with the norms of adult writing, it seems that, at the stage represented by our child writing data, the children have already achieved much of this adaptation with respect to phrasal constructions (whether this means using more of one type or fewer of another type); but less adaptation has occurred with respect to clause constructions. For a number of types of clause, the children’s written usage remains closer to spoken norms (C S). For various clause-types which are used less in published writing than in speech (S > P), the children have learned to reduce their usage. But the only clause categories used more in published writing than in speech and where the child writing has risen close to the published norms (S < P, C P) are various kinds of (full or reduced) relative clauses.11 10.

The complexity of the relative construction

It seems easily understandable that children will take longer to adapt to adult norms (where these involve increased rather than decreased use) in the case of subordinate clauses, which are complex structures, than in the case of phrases. I find it more surprising that adaptation occurs sooner with relative clauses than other kinds of subordinate clause. This surely cannot be because relative clauses are structurally simpler; considered as abstract formal structures, relative clauses seem strikingly more complicated than some other subordinate clause types. Assuming that declarative main clauses can be seen as basic, producing a relative clause involves modifying that basic structure by deleting some element which may be related only remotely to the main verb of the declarative structure (for instance it may be a subordinate constituent of an immediate constituent of the structure), yielding a word-sequence that would be bizarre in isolation. In some cases an appropriate relative pronoun is used, and if the relativized item is object of a preposition then the preposition may be shifted (“Pied-Piped”) to precede the relative pronoun. Cases with zero relative pronoun are formally

188

Geoffrey Sampson

simpler, but are arguably no simpler to master, since the logical relationship between clause and antecedent is inexplicit and may be very diverse. Adverbial clauses or nominal clauses, by contrast, are constructed simply by prefixing a subordinating conjunction to a declarative structure; in the case of nominal clauses, although they may be signalled by the conjunction that, not even this is necessary. Admittedly, these two categories are S > P cases (their formal simplicity may perhaps be relevant to their high speech frequency), so there is no issue about how children develop the skill of using them in writing. But, among S < P categories, present participle clauses, for instance, are surely no more formally complex than relative clauses — one might well see them as less complex — yet their child-writing frequency is little different from their speech frequency. It is not easy to understand why relative clauses should lead present participle clauses so strikingly in the degree to which child writers increase their use of them towards adult written norms. 11.

Simple vs. complex relatives

It is true that some relative clauses are simpler than others. A subject relative (a relative clause in which the relativized item is clause subject) has the same shape as a declarative clause, with a wh- pronoun in place of the subject, and since the logical and surface position of the relative pronoun is immediately adjacent to the antecedent it is straightforward to interpret. Likewise a relative in which the relativized item is the whole of an adjunct of the clause (e.g. (every time) we hit a wave, where the relativized item is a Time adjunct of hit) has the same shape as a declarative (there is no obvious gap, because adjuncts are optional extras), and the logical relationship between relative clause and antecedent is usually clear because the antecedent is a general noun like time or place. If relative clauses are more profuse in child writing than the complexity of the construction would lead one to expect, one might guess that this is because children confine themselves to the simplest types of relative, so that for them the construction is not a complex one. Children’s written English might occupy an earlier point on the KeenanComrie relativization hierarchy (Keenan & Comrie 1977) than adult written English. From a limited sampling it appears that this is not so. It would be tedious to check all the relative clauses in our data manually, so I checked forty “full” relative clauses (i.e. not whiz-deleted relatives or past participle clauses) from each of the speech, child writing, and published writing samples.12 I classified the 120 relative clauses according to whether the relativized element is: A, subject of the relative clause: (the Christmas story,) which took place many years ago B, an entire adjunct of the relative clause: (every time) we hit a wave C, object or complement of the relative clause: (a small animal) they catch D, a constituent of a phrase constituent of the relative clause: (the person) to whom it points

The structure of children’s writing

189

E, a constituent of a phrase constituent of a phrase constituent of the relative clause: (some flowers […]) that I do not know the name of F, a constituent of a subordinate clause constituent of the relative clause: (I am in K. [House],) which I naturally think is the best G, a constituent of a phrase constituent of a subordinate clause constituent of the relative clause: (a village dance) which the headmistress has forbidden any of the girls to go to (Examples in italics are quoted from the child-writing sample in each case.) Intuitively, the sequence A to G roughly corresponds to complexity of relative clause types, so if it were true that relative clauses in children’s writing were simpler than in adult writing, one would expect the breakdown to show higher figures for child writing than for published writing in rows A and B, with the child-writing figures declining to zero in lower rows.13 It is true that the simpler relative structures are more frequent than the more complex structures in all three genres, but in other respects the figures by no means conform to that prediction:14 A B C D E F G

speech 16 7 13 2 – (2) –

child writing 18 7 7 (4) 2 1 1

published writing 25 5 1 5 3 1 –

The proportion of A- and B-type relatives is actually considerably higher in the published writing than either the child-writing or speech sample. These samples are admittedly small and possibly unrepresentative, but if the frequency of relative clauses in child writing were explainable in terms of children using simple versions of the construction, one might expect this to be visible even in small samples. The figures seem to imply that the relative clause construction used in the child-writing sample is the full adult relative-clause construction; and, hence, that complexity of different constructions is not a reliable predictor of the extent to which the constructions will be deployed in child writing. 12.

Unanswered questions

If children make heavy written use of relative clauses earlier than some simpler constructions, one would like to know what it is about relative clauses that permits or encourages this. Are relative clauses for some reason more useful, in the kinds of written communication represented in the Nuffield material, than

190

Geoffrey Sampson

some other subordinate-clause types which are needed for adults’ more diverse communicative goals? Is it that relative clauses, though formally complex, represent a more straightforward development from simpler and earlier written usage than some other constructions? At this stage I cannot even guess at the answers. In other respects, too, it is clear that the foregoing has only begun to scratch the surface of what can potentially be learned from resources like the LUCY and CHRISTINE Corpora. Once consistently annotated samples are available in machine-readable form, the questions one can ask about the acquisition of writing skills are limited only by the researcher’s ingenuity. Acknowledgements I am grateful to Anna Babarczy and Alan Morris for their contributions to the research resources used in this study, and to Gerald Gazdar, Adam Kilgarriff, and Anna Babarczy for comments on versions of the paper. Responsibility for its shortcomings is mine alone. Notes 1

The CHRISTINE and LUCY projects were/are sponsored by the Economic and Social Research Council (UK) under contracts R 000 23 6443 and R 000 23 8146. Stage I of the CHRISTINE Corpus is available for downloading from www.grsampson.net (follow link to “downloadable resources”); when complete the LUCY Corpus will be made as accessible as copyright restrictions permit.

2

Of the forty text files in CHRISTINE Stage I, Release 2, file T40 was omitted from this study because of a format error which interfered with the statistics extraction software.

3

It is intended that, when complete, the LUCY Corpus will also contain a section of adult writing which is ephemeral or which, even if published, has a relatively high incidence of deviations from standard usage; for that reason, items of the latter kind were excluded from the “published writing” sample which has already been annotated.

4

In an earlier study (Sampson 1997) I used the term “depth” in a different sense, inspired by the work of Victor Yngve (e.g. 1961), to refer to the extent to which parse-trees contain left-branching structures. (The leftbranchingness measure of that study, applied to the present data, gives mean figures which are similar for all three genres.) “Depth” in the present paper refers to distance between leaf and root nodes, and not to a measure of asymmetry between left and right branching.

5

Unfortunately the depth figures in the study quoted above are not directly comparable with those shown here; in that study I averaged over all words,

The structure of children’s writing

191

including discourse items not contained in clauses (which were assigned depth 0) — this was a reasonable approach in research which compared the oral output of different speakers, but becomes less appropriate when speech is compared with writing. 6

See Sampson (1999: §9) on the rules by which our scheme annotates such cases.

7

I have not checked whether the wordiness differential might be partly attributable to phrase within phrase recursion of the kind just illustrated — it is not entirely clear how, formally, one should tease apart the contributions of different types of recursion; but, impressionistically, noun phrase within prepositional phrase within noun phrase structures seem very common in the speech data.

8

More strictly, between the raw figures from which those percentages are calculated. The chi-squared test does not apply to percentages.

9

For detailed definitions of these categories and the subordinate-clause categories discussed below, see Sampson (1995). It is not possible in a brief space to illustrate the full range of constructions covered, but I give one example from the child-writing sample for each category: all the first formers noun phrase had been verb group in the world prepositional phrase very small adjective phrase as soon as she’s used to her toys adverb phrase the other two number phrase any of the girls determiner phrase Mary Todd’s genitive phrase

10

Again I give a single example from the child writing for each category (wording in brackets is included to show the context, and is not part of the example tagma): to keep her out of trouble infinitival clause if he had him adverbial clause that it is a Four of Diamonds nominal clause (they go in to dinner,) then the second bell verbless clause (by) adding some more to it present participle clause (one pup) who looked just like his mother relative clause What I like doing antecedentless relative (make us) do the right thing bare non-finite clause (a girl) named Jennifer past participle clause (as black) as Alan’s is fair comparative clause with clause (a yellow door …) with the name wrote on it (field archers do not use sights) as target special as clause archers do (“Amazon Adventure”) also by Willard Price whiz-deleted relative

192

Geoffrey Sampson

11

It is interesting to compare these findings with those of Perera (1984), an excellent book which is the only previous substantial study of the grammar of child writing known to me, though written slightly too early to exploit the possibilities now opened up by computer manipulation of machinereadable annotated corpora. Perera’s table 19, p. 232, does not match our finding of child writing assimilating to adult norms earlier with respect to use of relative clauses than other subordinate clause types (though she does note that relative clauses increase in frequency more rapidly during the school years than other clause types, p. 234). Exact comparisons between Perera’s and our findings are difficult, for one thing because her statistics relate to children’s speech and children’s writing but do not give comparative figures for adult writing.

12

For speech I took all the full relative clauses (omitting one case whose type could not be determined because it was broken off before completion) in CHRISTINE files T07, T14, T21, T28, T35, and the first five from T20. For child writing I took the first twenty relative clauses in both the 9-10year-old and the 11-12-year-old files, which were not in a systematic sequence — the forty cases were produced by nine 9-year-olds, seven 11year-olds, and one 12-year-old. For published writing I took the first twenty cases from a passage of Independent sports reporting (part of BNC file A4B) and from an extract from a book on provision of legal services in Britain (part of BNC file GVH).

13

At one point (1998: 109), Miller & Weinert claim in effect that the standard English relative clause construction occurs in spontaneous speech only in patterns A to C. They note the existence of an alternative construction which occurs only in speech, and is more transparent because the relativized item is represented by a pronoun in its logical position, as in the book that I found these words on its pages; Miller & Weinert say that the relativized item can play far more diverse roles in this latter construction. However, Miller & Weinert’s claim about the spoken use of the standard construction seems to be contradicted by examples they quote at other points (e.g. the shop I bought it in, their p. 106 — and see our data in the table below). The alternative construction involving a “shadow pronoun” does occur in our CHRISTINE speech data, though impressionistically it is far rarer than standard relative clauses, and so far as I have noticed it does not occur at all in the child writing data.

14

The bracketed figure in row D corresponds to the fact that the one 9-yearold type D relative clause in the sample is deviant: (there are many others [scil. birds]) in which I often read about. The bracketed figure in row F relates to the spoken example (all) she’s supposed to do now, which by the rules of our scheme is analysed as having the relativized item as object of an infinitival clause subject of supposed. One might well prefer to see BE supposed to as a quasi-modal construction, in which case the F figure under “speech” would reduce from 2 to 1 and the C figure would increase from 13 to 14.

The structure of children’s writing

193

References Burnard, L. (1995), Users Reference Guide for the British National Corpus Version 1.0. Oxford University Computing Services. Handscombe, R.J. (ed.) (1967a), The Written Language of Nine and Ten-Year Old Children. (Nuffield Foreign Languages Teaching Materials Project, Reports and Occasional Papers, no. 24.). Leeds University. Handscombe, R.J. (ed.) (1967b), The Written Language of Eleven and TwelveYear Old Children. (Nuffield Foreign Languages Teaching Materials Project, Reports and Occasional Papers, no. 25.). Leeds University. Keenan, E.L. and B. Comrie (1977), ‘Noun phrase accessibility and Universal Grammar’, Linguistic Inquiry, 8: 63-99. Miller, J. and R. Weinert (1998), Spontaneous Spoken Language: Syntax and Discourse. Oxford: Clarendon Press. Perera, K. (1984), Children’s Writing and Reading: Analysing Classroom Language. Oxford: Basil Blackwell; in association with André Deutsch. Sampson, G.R. (1995), English for the Computer. Oxford: Clarendon Press. Sampson, G.R. (1997), ‘Depth in English grammar’, Journal of Linguistics, 33: 131-51; reprinted as ch. 4 of Sampson (2001). Sampson, G.R. (1999), ‘CHRISTINE Corpus, Stage I: Documentation’. http://www.cogs.susx.ac.uk/users/geoffs/ChrisDoc.html. Sampson, G.R. (2001), Empirical Linguistics. London & New York: Continuum. Yngve, V.H. (1961), ‘The depth hypothesis’, in: R. Jakobson (ed.), Structure of Language and its Mathematical Aspects. American Mathematical Society (Providence, Rhode Island). 130-138; reprinted in F.W. Householder (ed.), Syntactic Theory I: Structuralist. Harmondsworth: Penguin. 115-123.

III. Corpora in Foreign Language Learning and Teaching

On clefts and information structure in Swedish EFL writing Mia Boström Aronsson Göteborg University Abstract Learner writing is known to differ from native speaker writing in several ways, for instance in terms of frequency of certain words or structures. This study looks into the use of different types of cleft constructions, a type of focusing device over-represented in Swedish advanced learners’ written English. On the basis of material from the Swedish component of the International Corpus of Learner English (ICLE) and comparable native speaker writing, the study discusses some differences between Swedish advanced learners’ and native speakers’ use of it-clefts and pseudo-clefts in argumentative writing. Even though cleft constructions exist in Swedish too, it is possible that learners are not fully aware of their thematic meanings and their effect on the text when writing in English, since the learners sometimes appear to use these constructions without taking the textual consequences into account.

1.

Introduction

Several studies have shown that advanced learner writing often deviates from native speaker writing in different ways, particularly in terms of frequencies of different words or structures. This study discusses some differences between Swedish advanced learner writing and native speaker writing reflected in the learners’ use of cleft constructions. These are a type of focusing device that is over-represented in Swedish advanced learner writing. The study comprises an analysis of the examples in isolation, which shows how the use of these constructions may reflect differences between the argumentative styles of learners and native speakers and an analysis that shows how the examples in their context indicate that the learners have problems with the distribution of information in their texts. To begin with, some terminological issues will be dealt with. The terms thematic structure and information structure will be defined and the effect of clefts on the thematic structure and the information structure will be discussed. 2.

Thematic structure and information structure

Thematic structure is made up of the components theme and rheme. These terms are defined differently within different approaches. This study follows Halliday (1994:37), who identifies the theme on the basis of its position in the clause, initial position. This means that in a typical English declarative sentence with the SVC-order, the subject makes up the theme, whereas the rest of the sentence makes up the rheme. According to the Hallidayan approach (1994:52), the theme

198

Mia Boström Aronsson

may extend over several components, but always ends with the first experiential element, i.e. an element that is participant, circumstance, or process. This element is called the topical theme (Halliday 1994:53). Information structure is discussed in terms of given and new information. Given information is usually, but not necessarily, placed as the starting point of the utterance, in thematic position, whereas new information is usually placed in focus at the end of the sentence, in the rheme (Halliday 1994:296ff). The thematic structure can be varied in different ways to achieve different rhetorical effects. The use of cleft constructions is one way to manipulate the thematic structure to focus on and add emphasis to a certain element. As can be seen in the following examples from Collins (1991:1, 3), the it-cleft in (1b) and the pseudo-cleft in (1c) add focus to the subject ‘Tom’, which is not in focus in the regular declarative sentence in (1a): (1a) (1b) (1c)

Tom offered Sue a sherry It was Tom who offered Sue a sherry The one who offered Sue a sherry was Tom

Cleft constructions have other properties, i.e. an exclusiveness implicature and an existential presupposition. A cleft implicates that something is expressed exclusively, in (1b) and (1c) who it was that offered Sue a sherry, and presupposes the existence of something as a fact, in (1b) and (1c) that ‘someone offered Sue a sherry’. (See further Collins 1991:69ff; Huddleston 1984:464ff; Johansson 1996:129ff.) However, these aspects of the cleft construction are not the focus of the current research. 3.

Definition of clefts

This study includes it-clefts and different types of pseudo-clefts. The definition of pseudo-clefts is based on Collins’s (1991:27) definition, a comparatively wide one including three types of pseudo-clefts: wh-clefts, i.e. constructions with a fused relative clause that begins with what, who, where, when, why, or how, thclefts, i.e. constructions in which the relative clause begins with the and one of the “pro-form equivalents of the English interrogatives (thing, one, place, time, reason, way)”, and all-clefts. These constructions are illustrated in the following examples taken from Collins (1991:32): (2a) (2b) (2c)

What the car needs is a new battery. The thing the car needs is a new battery. All the car needs is a new battery.

The study includes both basic pseudo-clefts, i.e. constructions in which the relative clause is in the theme, as in (3a), and reversed pseudo-clefts, i.e. constructions with the relative clause in the rheme (Collins 1991:3), as in (3b):

On clefts and information structure in Swedish EFL writing (3a) (3b)

199

What you need most is a good rest. (Quirk et al. 1985:1388, my underlining) A good rest is what you need most. (Quirk et al. ibid, my underlining)

Since this study is concerned with cleft constructions as representing a choice made by the writer as regards how to distribute information in the sentence, only examples which have non-cleft counterparts following the basic sentence pattern are included. As can be seen in example (4), the pseudo-clefts in (2a)-(2c) can be changed into regular declarative sentences if the word(s) introducing the relative clause and BE are omitted: (4)

What/The thing/All the car needs is a new battery battery

The car needs a new

A construction such as Collins’s (1991:59) example that’s how it operates, on the other hand, does not have a non-cleft counterpart following the basic sentence pattern (*It operates that) and would not be included in this study. Furthermore, due to the fact that the focus of the study is on aspects of the thematic properties of clefts, only examples in thematic position in main clauses are included. Thus, the study does not include examples in which the cleft is preceded by a thematic prepositional phrase or a subordinate clause, as in (5a) and (5b), or cleft constructions in subordinate clauses, as in (5c): (5a) (5b)

(5c)

In these days of instability, this is just what we need. (ICLE-SW-UG-053, theme underlined) While it can be demonstrated that both sides of the capital punishment argument have strengths and weaknesses, it is the “anti-death” argument that ultimately presents a more powerful case. (ICLE-US-MRQ-0003.1, theme underlined) This will probably also be the case if one feels that it is society that needs protection. (ICLE-SW-UG-016, theme underlined)

Regarding examples such as (6), below, the it-cleft is not considered subordinate, since modal clauses such as I believe and I think are treated as what Halliday (1994:354ff) describes as ‘metaphors of modality’ similar to expressions such as probably and certainly, thus only making up a modal theme and not the topical theme. The underlining in example (6) illustrates the thematic structure, consisting of the modal theme and the theme of the it-cleft, which is seen as a predicated theme (Halliday 1994:60, 96ff):

200

Mia Boström Aronsson

(6)

I believe it is only the storeowners who realize this process is occurring… (ICLE-US-MICH-0009.1)

4.

Material

The study is based on approximately 204,000 words from the SWICLE corpus, the Swedish component of the International Corpus of Learner English. This subcorpus consists of argumentative essays produced by native Swedish students in their second year of English university studies. The learner writing is compared to an equally large sample of similar writing taken from the LOCNESS corpus (Louvain Corpus of Native English Essays). The native speaker sample used here consists of approximately 150,000 words of essays produced by American university students and 54,000 words of A-level essays written by British students. (For a more detailed description of these corpora, see Granger 1998.) 5.

Frequency

A study of the distribution of it-clefts and pseudo-clefts in the two corpora indicates that both it-clefts and pseudo-clefts are more frequent in Swedish advanced learner writing than in native speaker writing. There are 85 thematic itclefts in the learner corpus and only 42 in the native speaker corpus, whereas there are 163 thematic pseudo-clefts in the learner corpus and only 75 in the native speaker corpus.1 An analysis of the it-clefts in isolation shows that the learners use more or less the same types of constructions as native speakers of English and that they use them to highlight the same types of sentence elements as native speakers do. In both corpora, it-clefts formed with a that-clause are the most common, making up 63.5% of the examples in the learner writing and 78.6% in the native speaker writing, whereas the second most common type, with a who-clause, accounts for 15.3% of the learner examples and 19% of the native speaker examples. In both corpora constructions with where are infrequent, occurring only once in the native speaker corpus and twice in the learner corpus. The learners also use constructions with a zero-that-clause in 10.6% of cases and constructions with a which-clause in 8.2% of the constructions, whereas there are no such constructions among the native speaker examples. Table 1 illustrates the frequency of different types of relative clauses in it-clefts in Swedish advanced learner writing and native speaker writing. Regarding relative clauses formed with a zero-that-clause, these are generally more informal than constructions with that or which (Quirk et al. 1985:1252). Although there are only a few examples, the difference between the learners’ and the native speakers’ constructions in this respect may indicate a preference among the learners for informal constructions. As regards it-clefts with a which-clause, these do not seem to be uncommon in English even though they are not represented among the native speaker examples in this study. For

On clefts and information structure in Swedish EFL writing

201

Table 1. Types of relative clauses in it-clefts That Who Zero Where Which Total

Learners 54 (63.5%) 13 (15.3%) 9 (10.6%) 2 (2.4%) 7 (8.2%) 85

Native speakers 33 (78.6%) 8 (19.0%) 1 (2.4%) 42

example, Collins’s (1991:35) study of cleft constructions in the LOB and the London-Lund corpora showed that almost 7% of the it-clefts were formed with which. Collins’s figures are not fully comparable to the figures in this study, though, since Collins’s (1991:34) definition of clefts is somewhat wider than the one used here, including, for example, constructions without a highlighted element with experiential functions, as in (7), which are not included in this study: (7)

it’s not that Mervyn’s TOTALLY unreliable (Collins ibid)

As far as the function of the highlighted elements in it-clefts is concerned, itclefts are most often used to focus on the subject. Such constructions make up 47.1% of the learners’ examples and 52.4% of the native speakers’ examples. Adjuncts are highlighted almost as often as subjects in both learner and native speaker writing, 43.5% in the learner writing compared to 45.2% in the native speaker writing. Objects are rarely highlighted in the two corpora, 5 times (5.9%) in the learner corpus and only once (2.4%) in the native speaker corpus, whereas the complement of a preposition is highlighted three times in the learner writing (3.5%) and not at all in the native speaker writing. These figures are illustrated in Table 2. Table 2. The function of the highlighted element in it-clefts Learners Subject 40 (47.1%) Adjunct 37 (43.5%) Object 5 (5.9%) Complement of preposition 3 (3.5%) Total 85

Native speakers 22 (52.4%) 19 (45.2%) 1 (2.4%) - 42

As far as pseudo-clefts are concerned, some differences can be noted between the learners’ and the native speakers’ examples when studied in isolation. This study touches upon two areas that involve differences in the use of pseudo-clefts as a result of different argumentative styles. The wh-cleft is the most common type in both learner writing and native speaker writing. However, this type makes up a larger share of the total number of examples in the learner writing than in the native speaker writing, since

202

Mia Boström Aronsson

approximately 80% (131 instances) of the examples in the learner writing are whclefts, whereas in the native speaker writing about 60% of the examples (46 instances) are wh-clefts. In terms of over-representation, the learners’ use of whclefts, and more specifically constructions with what, appears to account for the entire over-representation of pseudo-clefts, since th-clefts are almost as frequent in the native speaker writing as in the learner writing (20 instances compared to 25). All-clefts are slightly more common in the native speaker writing than in the learner writing (9 instances compared to 7). 2 Table 3 illustrates the frequencies of the different types of pseudo-clefts in learner writing and native speaker writing. Table 3. The frequency of different types of pseudo-clefts Wh-clefts Th-clefts All-clefts Total

Learners 131 (80.4%) 25 (15.3%) 7 (4.3%) 163

Native speakers 46 (61.3%) 20 (26.7%) 9 (12.0%) 75

One type of construction that highly contributes to the high frequency of pseudoclefts in learner writing is pseudo-clefts formed with What we/What I…, as in (8a) and (8b): (8a) What we need now is a global strategy since pollution knows no national boundaries. (ICLE-SW-UL-014) (8b) What I m3 trying to say is that modern society with its sterile impact doesn’t exactly stimulate you to hold on to your dreams… (ICLE-SW-UL056) Such constructions account for almost a third of all thematic wh-clefts in the learner writing (41 instances), whereas there are only three such examples among the native speaker examples, all of them formed with the first person singular pronoun ‘I’. Thus, the difference between the learners’ and native speakers’ examples here may well reflect the Swedish writers’ increased tendency to apply a more involved and subjective style in their argumentative writing. Another difference in the use of pseudo-clefts that may be connected to different argumentative styles is found in constructions that highlight an adjunct, as in (9): (9)

The reason why fox hunting should be banned is because it is a cruel sport. (Alevel-FH-05)4

Such constructions are, in contrast to pseudo-clefts in general, by far more common in the native speaker writing than in the learner writing. In the native speaker corpus, adjuncts are almost as frequent as subjects and objects as highlighted elements of pseudo-clefts, making up approximately a quarter of the examples (18 instances). In the learner corpus, on the other hand, adjuncts are only placed as the highlighted element in seven of the pseudo-clefts, i.e. about 4%

On clefts and information structure in Swedish EFL writing

203

of the cases. Table 4 illustrates the frequency with which different types of sentence elements are placed as highlighted elements of pseudo-clefts. Table 4. Highlighted elements in pseudo-clefts Subject Object Adjunct Complement of preposition Predicate Total

Learners 68 (41.7%) 51 (31.3%) 7 (4.3%) 25 (15.3%) 12 (7.4%) 163

Native speakers 22 (29.3%) 23 (30.7 %) 18 (24.0%) 6 (8.0%) 6 (8.0%) 75

In native speaker writing, constructions which highlight an adjunct are mainly used to explain and to motivate the writer’s arguments, as in (10a) and (10b): (10a) College football should have a 16 team playoff system at the end of the regular season and use their bowl games for neutral sites. College football should take the top 16 teams in the country, pair them up into two brackets, and let them play it out. The reason it should be only 16 teams is so that the playoffs won’t last so long and 16 is a fair number to have when selecting teams. (ICLE-SCU-0002.3, my italics)5 (10b) As the fetous begins its life in a test tube, and the sperm is selected, this means that the sex of the sperm could also be selected. The way the sex can be chosen is by using genetics. The women has 2 X chromosomes and the man has one X and a Y-chromosomes. (ICLE-ALEV-0005.8, my italics) Although the Swedish EFL corpus also contains some instances of highlighted adjuncts, most of these are not used to fill this function in the argumentation, but instead, they function mainly to identify a point in time or a place, as in (11a), or to describe something, as in (11b): (11a) The first time I went to a foreign university was in 1988. (ICLE-SW-ULL047) (11b) But I truly think that what people watched day and night during the “Desert Storm” operation was pure propaganda for Bush and his whole administration. To me the war was a perfect excuse for Bush to postpone dealing with problems at home. It is a well-known strategic move for a politician to start or escalate a conflict if his/her popularity is declining. Bush became a hero all of a sudden! The way he and his generals were portrayed was as the “good guys”, and the nation loved them. Except for the people in San Francisco. (ICLE-SW-ULX-022, my italics) The idea that learner and native speaker writing differ in argumentative style has been suggested by Lorenz (1998) in a study of adjective intensification in learner

204

Mia Boström Aronsson

writing. According to Lorenz (1998:64), “EFL student writing appears to be more geared towards creating an impression than towards arguing a case”. This finding seems to receive support from the present study, both as regards leaners’ non-use of highlighting adjuncts to bring focus on the argumentation and also as regards their generalized overuse of emphatic constructions. 6.

Textual aspects on the use of cleft constructions

A study of the learners’ examples in their context indicates that the need for a cleft is sometimes doubtful. That is, a less marked construction that would not place such focus on the highlighted element might have been a more natural choice. Due to the fact that such differences are usually not a question of correct or incorrect use but rather a matter of degree, these differences between learner writing and native speaker writing cannot be discussed in terms of exact numbers. The study indicates, however, that it is not uncommon for the learners’ examples of it-clefts and pseudo-clefts to involve some aspect that makes the choice of the particular construction appear unmotivated, whereas this does not occur often in native speaker writing. Since this study is concerned with writing, it has been necessary to rely on the information retrieved from the context to interpret which word(s) the writers want to focus on and on the basis of this determine whether the use of a cleft appears to be motivated or not. For example, in the it-cleft in (12), it can be understood that focus falls on ‘local’ and not on ‘history’, since the previous sentence sets up a contrast between ‘national history’ and ‘local history’: (12)

Something that is even more interesting than our national history is our local history, which is hardly taught at all in school. I believe it is the local history in particular we want to know more about. (ICLE-SW-ULX-034, my italics)

In other cases, however, it may be more difficult to determine which word(s) are meant to be in focus. In example (13), emphasis on one of the words that is placed as the highlighted element of the it-cleft would give an idea of a contrastive reading, for example, that it was the Mazetti-company in Malmö, not in Stockholm, that offered free canned cacao during a whole year: (13)

I liked writing stories in order to shun that dreary wartime. Once I succeeded in winning a contest It was the Mazetti-company in Malmö which offered free canned cacao during a whole year to the pupil who could write the best essay about “milk as the best way of keeping you healthy”. (ICLE-SW-ULL-021, my italics)

Considering the context, though, a contrastive reading is not possible, since there is nothing in the context with which to contrast the company or the city mentioned. Instead, the it-cleft is possibly being used as an impersonal construction so as to avoid placing new information first in the sentence and the

On clefts and information structure in Swedish EFL writing

205

emphatic effect and the exclusiveness implicature associated with it-clefts are not considered by the writer. Since it-clefts, according to Collins (1991:69ff, 171ff), give thematic prominence to the highlighted element, which is interpreted as identifying something exclusively, the use of an it-cleft which does not appear to be motivated in the context may be confusing to readers trying to find a relevant interpretation of the exclusiveness aspect of the cleft construction. According to Halliday (1967:204), information focus is a kind of emphasis used by the speaker to mark out the part of a message that s/he wants to appear informative. Since the learners’ examples sometimes focus on and emphasize elements which do not seem to need extra focus or emphasis judging from the context, the learners appear to have problems with the distribution of information in the text. The fact that learners of English may have problems with the distribution of information has been noted by, for example, Francis (1989:220), who claims that learners of English often produce texts “which appear disconnected and disjointed because there is no clear information structure”. Francis also points at the role of unintended emphasis, and claims that […] an incomplete understanding of the meanings of both theme and focus of new information often leads to unintended emphases, making it difficult for the reader/teacher to understand the points being made. (Francis 1989: 220) As regards Swedish students’ use of constructions that add more emphasis than is required, it is possible that Swedish learners are affected by the fact that emphatic constructions may be more emphatic in English than they are in Swedish. Moreover, Swedish is more tolerant than English of rearrangements of sentence elements. Fronting, for instance, another construction used to add prominence to a certain element, is not, according to Hjulmand & Schwarz (1998:316), as emphatic in Danish as it is in English, being mainly used as a linking device. Due to similarities between the Scandinavian languages, it is likely that the same goes for Swedish. It thus seems possible, likely even, that Swedish learners use cleft constructions as a way of manipulating sentence structure without being aware of the emphatic effect of English clefts and their potentially negative effect on the thematic and information structure. In example (14), for instance, the learner makes ‘in Lund’ a marked theme by placing it as the highlighted element of an itcleft: (14)

Lately I have started to appreciate State Media in Sweden. After having spent twenty years of my life literally “in the forrest” without TV 3, Channel 4, MTV, CNN, […], Super-Channel and another odd number of the same kind, I came to Lund. It was in Lund that I first discovered ‘Discovery’ and all the other channels mentioned above. At first it was exciting: click, click, click. I could watch whatever I wanted: (ICLE-SWULX-009, my italics)

206

Mia Boström Aronsson

Since there is nothing in the text that contrasts with ‘in Lund’, the exclusiveness implicature created by the cleft construction is not appropriate in this example, and the cleft places more emphasis on the prepositional phrase than may be necessary in the context. Example (15) illustrates another case in which an it-cleft focuses on an element which does not seem to need to be in focus: (15)

I do not think that people really want their children to grow up in a world where you can not drink the water from a stream in the mountains, or breath the air in the cities without a gas-mask. The problem is that we are in a period where people have not yet experienced what pollution can do to the world. Scientists speak about holes in the ozon layer and abnormal fluctuations in the weather, but it is not often that people encounter environmental destruction themselves. The problem for the world then, is that it is the actions of the people living now that will decide the future for the earth. (ICLE-SW-UG-006, my italics)

In this case, the it-cleft appears somewhat awkward due to the fact that it focuses on frequency by highlighting ‘often’ even though it is the fact ‘that people encounter environmental destruction themselves’ which should be in focus. Similarly, in example (16), the rearrangement of sentence elements by means of an it-cleft seems to place more focus on the highlighted element than the context requires: (16)

Surely integration would be the ideal solution but the existing cultureclashes tells us it is not working. The immigrants have to be fully accepted with their different cultures and languages but at the same time they have to reach out to the Swedish society and try to understand it. I am sure that it is going to be a different and perhaps a bit twisted view they get of their new country, but who said the differnet has to be so bad. (ICLE-SW-UL-045, my italics)

The learners’ difficulties with information distribution is also reflected in their use of cleft constructions turning an element into a marked theme even though there is no reason for the particular element to be thematic and indeed when giving it thematic status may have a negative effect on the thematic development of the paragraph. The importance of the choice of theme for the development of a text is obvious from Halliday’s (1994:336) statement that “[t]he choice of Theme, clause by clause, is what carries forward the development of the text as a whole”. Thus, the theme of a clause should be chosen in such a way that it contributes to the development of the text. Judging from the findings in this study, this is not always the case in learner writing. Examples (17) and (18) illustrate how learners sometimes use reversed pseudo-clefts which place new information thematically, as the highlighted element, and presupposed material in the rheme, even though new information is generally placed rhematically:

On clefts and information structure in Swedish EFL writing (17)

207

Nevertheless I would like the third category of dreamers to have the same magnitude as scientists and politicians have. This category is related to the romantics in terms of reaction against technology but not technology as such. The way technology is used as a source of profit instead of a tool for progress is what these “dreamers” oppose to. You find them in Africa, asking why modern technology is used in diamond mining but not in welldrilling. You find them in Europe, asking why technology isn’t used to inhibit unemployment instead of increasing it. They are people who materialize the importance of dreams in changing the current development of our mutual world. (ICLE-SW-UL-059, my italics)

In (17), the learner uses a reversed pseudo-cleft in which the relative clause presupposes that ‘these “dreamers”’, from an earlier sentence, are opposed to something, whereas the thematic, highlighted element introduces what the dreamers are opposed to. Since the relative clause contains information which is presented as if it can be assumed by the reader, whereas the highlighted element presents new information, the use of a basic pseudo-cleft, What these “dreamers” are opposed to is the way technology is used as a source of profit instead of a tool for progress, would link this sentence more closely to the preceding text. Similarly, the reversed pseudo-cleft in example (18), below, taken from an essay that discusses the question of how to integrate immigrants, is not well integrated in the text, due to the fact that new information is placed thematically, as the highlighted element, whereas presupposed material is placed in the rheme. The presupposition of the relative clause, ‘that we should aim for something’, can be inferred from the preceding discussion and would tie this sentence more closely to the preceding text if made thematic (e.g. What we should aim for is the golden middle): (18)

To be able to create the society we want we have to stop looking at immigrants as a burden. New influences from all over the world should of course be seen as assets, and some thing we could benefit from. Something is wrong when we waste valuable knowledge, letting immigrants with university degrees claim benefits instead of helping them find proper jobs. Then again, the question of how to integrate immigrants in our society is extremely difficult and we can never expect to come up with straightforward solutions. The golden middle is what we should aim for and that means hard work for both Swedes and immigrants. (ICLESW-UG-001, my italics)

Also it-clefts are sometimes used to make an element thematic even though this does not fit in with the thematic development of the paragraph. In (19), the learner uses an it-cleft which makes the noun phrase ‘our generation’ thematic, even though it does not fit in with the thematic development: (19)

Nevertheless modernisation also brought urbanization, unemployment, housing problems and various forms of stress, to mention but a few of the

208

Mia Boström Aronsson immediate consequences. In more recent days we have discovered how the industry has also affected the planet we live on, not to mention the other beings who also dwell here, trying to share the room with us. The industry and its followers have come to symbolize the destroyers of God s planet that he gave to man. This has all been in the name of one thing - money. In their quest for wealth the industrial leaders have blatantly disregarded all concern for public welfare as well as our environment, and it is our generation that has to take the consequences. (ICLE-SW-UL-067, my italics)

Since the essay from which this extract is taken deals with technology contra imagination and does not involve any comparison of generations, the shift in focus appears quite unmotivated. Similarly, in example (20), the learner appears to use an it-cleft for the sake of the existential presupposition, but the vague connection to the context and the fact that the thematic, focused noun phrase ‘Miss Quested’ does not fit into the thematic development makes this construction appear unmotivated here: (20)

The Marabar Caves are in the centre of the novel right form the first line. They are described in great detail; said to be uninteresting in themselves but still captures everyone’s imagination. They could be seen as a symbol for the whole of India. Secretive and impossible to understand even for the Indians. In these caves Miss Quested experiences the assault, imagined or real, Mrs Moore looses her faith in God and Dr Aziz his belief in people. Forster tries to describe India, it is not just Miss Quested who wants to know the real India. The surroundings are carefully described and important for several events in the story. The impact of the weather is also of importance. The story is divided in three parts. (ICLE-SW-ULL-020, my italics)

As can be seen in the extract, the first sentence of the third paragraph makes the author, ‘Forster’, thematic, whereas ‘describe India’ is rhematic. The it-cleft then shifts the focus to ‘Miss Quested’, who is placed as the highlighted element of the it-cleft, whereas the relative clause of the it-cleft implies that ‘someone wants to know the real India’. The next sentence again deals with the act of describing. This time the noun phrase ‘The surroundings’ is placed as the theme of the sentence. Obviously, there is no clear line of thematic development in this paragraph and the it-cleft confuses the reader instead of helping to interpret the text. Non-native speakers’ potential difficulty in handling thematic progression in texts has been noted by, for instance, Mauranen (1993), who studied Finnish speakers’ written English and written Finnish. According to Mauranen (1993:254), Finnish writers had problems with thematic progression in their English writing which they did not have when writing in Finnish. Mauranen (1993:255) claims that writing in a foreign language entails an extra processing

On clefts and information structure in Swedish EFL writing

209

load which gives the writers problems in the textual organization that they do not have in their native language. This explanantion also seems plausible in this study which has shown how Swedish learner writers use cleft constructions to give unwarranted thematic prominence to elements in the text. 7.

Conclusion

The study of cleft constructions in Swedish advanced learner and native speaker writing shows that both it-clefts and pseudo-clefts are over-represented in Swedish advanced learner writing. A study of the examples in isolation and in context indicates that several different factors contribute to the differences between learner and native speaker writing. The study of the form of the pseudoclefts showed that the learners frequently use pseudo-clefts with What I/What we…, which are barely used at all in the native speaker writing, whereas the study of the highlighted elements showed that constructions that highlight an adjunct, which are used to explain or to motivate the writer’s argument, are more frequent in native speaker writing than in learner writing. These differences appear to reflect the fact that the learners apply a different argumentative style in their writing from the native English writers. The study of the different examples in their context indicates that the high frequency of it-clefts and pseudo-clefts in learner writing may partly be the result of the learners’ having difficulties with the distribution of information in their texts. These difficulties result in unmotivated focus and emphasis, and implications of contrastiveness when there is none. This may be the result of differences between Swedish and English as regards how emphatic such constructions appear. The learners’ difficulties with the distribution of information may also result from the burden of processing a foreign language, as reflected in learners’ use of clefts that place an element as a marked theme even though this does not fit in with the thematic development of the paragraph, and can even make the text more difficult to follow. Acknowledgements I am grateful to Karin Aijmer and Jennifer Herriman for helpful comments on an earlier version of this article. Notes 1

Similar differences between learner writing and native speaker writing are found when both thematic and non-thematic clefts are considered.

2

A study of both thematic and non-thematic pseudo-clefts shows similar tendencies, with almost as many th-clefts in native speaker writing as in learner writing and slightly more all-clefts in native speaker writing than in learner writing.

210

Mia Boström Aronsson

3

This and other mistakes in the learner corpus examples have been reproduced exactly as they were submitted.

4

Even though it does not change my point about the use of pseudo-clefts highlighting adjuncts in native English, it should be noted that it is not considered good English to use ‘The reason why’ followed by ‘is because’, but ‘is that’ would be preferable. (See Quirk et al. 1985:1006.)

5

‘The reason why’ should not usually be followed by ‘so that’, but rather by ‘that’. (See also Note 3.)

References Collins, P. C. (1991), Cleft and Pseudo-Cleft Constructions in English. London: Routledge. Francis, G. (1989), ‘Thematic selection and distribution in written discourse’, Word, 40(1-2): 201-221. Granger, S. (1998), ‘The computer learner corpus: a versatile new source of data for SLA research’, in: S. Granger (ed.), Learner English on Computer. London: Longman. 3-18. Halliday, M.A.K. (1967), ‘Notes on transitivity and theme in English. Part 2’, Journal of Linguistics, 3: 199-244. Halliday, M.A.K. (1994), An Introduction to Functional Grammar. 2nd ed. London: Arnold. Hjulmand, L.-L. and H. Schwarz (1998), A Contrastive Grammar of English for Danish Students. Fredriksberg: Samfundslitteratur. Huddleston, R. (1984), Introduction to the Grammar of English. Cambridge: Cambridge University Press. Johansson, M. (1996), ‘Contrastive data as a resource in the study of English clefts’, in K. Aijmer, B. Altenberg, and M. Johansson (eds), Languages in Contrast: Papers from a Symposium on Text-based Cross-linguistic Studies. Lund 4-5 March 1994. Lund: Lund University Press. 127-150. Lorenz, G. (1998), ‘Overstatement in advanced learners’ writing: stylistic aspects of adjective intensification’, in: S. Granger (ed.), Learner English on Computer. London: Longman. 53-66. Mauranen, A. (1993), Cultural Differences in Academic Rhetoric: A Textlinguistic Study. Frankfurt am Main: Peter Lang. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A Comprehensive Grammar of the English Language. London: Longman.

Contrasting learner corpora: the use of modal and reporting verbs in the expression of writer stance JoAnne Neff, Emma Dafouz, Honesto Herrera, Francisco Martínez & Juan Pedro Rica Universidad Complutense de Madrid Mercedes Díez Universidad de Alcalá Rosa Prieto Escuela Oficial de Idiomas de Valdezarza Carmen Sancho Universidad Politécnica de Madrid

Abstract This article presents part of the results from research carried out by the SPICLE 1 team on argumentative texts written in English by student writers, both native and non-native speakers from several L1 backgrounds. The aim of the study was to compare how these writers construct stance by examining their use of devices of evidentiality, specifically, modal verbs (can, could, may, might and must) and nine reporting verbs (suggest, wonder, argue, explain, express, recognise, say, show, and state). The texts of American university writers were contrasted with those produced by five EFL groups (speakers of Spanish, Dutch, Italian, French and German). The results showed that the EFL writers either overuse or underuse modal verbs in comparison with the American writers. Regarding the use of reporting verbs, native writers use a wider range of verbs, many of which carry a higher pragmatic import for stance taking. This research is significant not only for the comparison of typological and pragmatic differences but also for the study of interlanguage features and the teaching and learning of writing conventions.

1.

Introduction

It is well-known that one of the most difficult aspects of constructing argumentative texts for both native and non-native university writers is learning how to modalise both the propositions which they themselves put forth or those which they attribute to someone else. Holmes (1983: 100) has noted that the sociolinguistic competence of English as a Foreign Language students (EFL) in both speaking and writing involves learning the social and cultural values of the target community. She divides the grammatical classes of modal verbs, lexical verbs, adverbial constructions, nouns and adjectives into three categories of

212

JoAnne Neff et al.

devices used in expressing propositions: personalised (subjective modality in Halliday 1985: 333); impersonalised (objective modality in Halliday 1985: 333) and depersonalised (abstract rhetors in Hyland 2000). She points out that these devices serve at least two simultaneous functions: the expression of certainty/doubt concerning the proposition but also the speaker/writer’s attitude towards the audience. In line with this finding, Hinkel (1995) has shown that non-native speakers’ use of modal verbs reflects the pragmatic frameworks and norms specific to first language (L1) environments, which may differ considerably from those expected in second language (L2) conceptual structures. In the scrutiny of more than seven hundred essays written by Asian non-native speakers of English, Hinkel found that, besides being topic-bound, the use of a number of modal verbs (mostly those of obligation and necessity) heavily depended on authority or moral values intrinsic to the concrete Oriental philosophy of the students. As well, Thompson and Yiyun (1991: 366) point out that it is a common experience for teachers of Academic English to have difficulties in identifying clearly the kind of stance that EFL student writers intend to construct when citing other authors, that is, whether the student writers intend to accept the author’s viewpoint or reject it. Their conclusion is that students should be trained in identifying the different layers of reporting structures and the evaluative meanings attached to them. It appears, then, that EFL students may experience difficulties in academic writing for a number of reasons. There may be typological differences between the L1 and L2, such as the existence of a subjunctive mood, which can be used to express modality, instead of a modal verb. There may also be differences in the politeness strategies used by different discourse communities, either in the types or frequencies of the hedges and boosters used as well as in the types and number of devices used for personalised, impersonalised or depersonalised constructions, as noted by Holmes (1983). Finally, EFL writers may not realise that different evaluative meanings are attached to different reporting verbs or they may not have sufficient command of a variety of devices (hedges and boosters) with which to express the appropriate sociocultural competency. In this study, part of a larger project2 funded by the Spanish Ministry of Education, we compare the different constructions of stance by native and nonnative writers with regard to their modalisation of propositions (Halliday 1985: 335) through the use of modal verbs (can, could, may, might and must) and reporting verbs (argue, explain, express, recognise, say, show, state, suggest, and wonder). These nine verbs were found to be among the twelve most frequently used in a preliminary analysis of thirty reporting verbs used by both Spanish and American university writers.3 By stance, we mean “the dialogically enacted positioning of a social agent with respect to alignment, power, knowledge, belief, evidence, affect and other socially salient categories” (Du Bois 2001). We define modality in the broad sense, meaning the source and the reliability of speaker/writer’s knowledge, with the co-hyponyms evidentiality, referring to source and modality, referring to certainty of knowledge (Dendale and Tasmowski 2001). Thus, we accept

The use of modal and reporting verbs in the expression of writer stance

213

Palmer’s (1986) proposal that evidentiality be included within modality, the former being a means for signalling epistemic attitude towards the information. The focus on modality here follows from a wider comparative study of the construction of stance by native writers (American university students and professional editorialists) and Spanish EFL university students. Three research questions were posed: 1) Are there quantitative and/or qualitative differences between the native (American university writers) and non-native use (Dutch, French, German, Italian and Spanish) of modal and reporting verbs as evidentiality devices? 2) Are there quantitative and/or qualitative differences in the use of these devices among the various non-native groups? 3) Do the results point to typological and/or sociocultural differences in the construction of writer stance with modal and reporting verbs? 2.

The modalisation of propositions

Modal verbs are used to express the speaker/writer’s attitude toward the nonfactual and non-temporal elements of the situation under consideration. Thus, the use of a modal verb always entails a speaker/writer’s judgement or opinion. The function of non-epistemic can,4 as Hoey (1994: 42) notes, is evaluative. It is less concerned with the hypothetical (signalled by may and might) than with assessing the possibility, which may be physical, of permission, or non-restriction (Lewis 1986: 104). With can, the situation is judged as a truly existing possibility. It signals that, in the writer’s judgement, a definite possibility exists. For this reason, and particularly when denoting non-restriction, can is often involved in presenting a change from problem to solution, as in this text written by an American college student: (1)

What do Cindy Crawford, Kate Moss, Naomi Campbell, and Kristy Turlington have in common? All of these women are extremely beautiful, all are top notch models, but most importantly all are understood to be the "perfect and ideal" women. The 1990’s base these standards or expectations on every woman, and every woman who doesn’t resemble Barbie is perceived as imperfect. … every woman may not look at the famous models and want to resemble them, but I admit that I am the opposite -- I only wish that I could be as flawless as Cindy Crawford. … Plastic surgery is on an increase these days, it seems that everywhere you look, girls as young as fourteen are having breast implants, or facial operations, what they don’t realize though, is that implants or tucks don’t last forever, and are a leading cause of Cancer. Liposuction is used to remove fat, but is a very dangerous procedure that could lead to bleeding to death, at the slightest mistake made by a surgeon.

214

JoAnne Neff et al. … Any wealthy woman could go to a plastic surgeon, and explain what she wants to look like, but we all can understand and be truly happy with ourselves… (LOCNESS, AUW)

The function of could is to signal the existence of remote possibility. In example (1), stressed could deductively evaluates the hypotheses of being beautiful, of bleeding to death, and of wealthy women going to plastic surgeons. These instances of could occur in clauses for which it is not possible to either affirm or deny the hypothesis, given that the events have not taken place. The only other possibility for the reader, then, is to agree or disagree with it (Winter 1982: 198). What differentiates could from might is that the former signals unilateral possibility, while might maintains open possibility, i.e. various options may be considered (Gresset 2001). Writers must advance their own viewpoints but should counterbalance them by introducing the arguments of others, a rhetorical strategy that student writers may not always be aware of. It is in the critical examination of evidence that the epistemic modals may and might are often used. The same relationship is true of this pair of modal verbs as between can and could, in that might signals a more remote possibility than may. But what differentiates them from can and could is that with the former the volitionality of the speaker/writer is involved in creating the possibility (Lewis 1986: 113). Propositions including may and might are speculations or hypotheses. These modals and other signals of suspension of fact, such as irrealis constructions, for instance, concessive constructions with if, establish a speculative context in which the writer can put forward a hypothesised situation. 3.

Stance-taking with reporting verbs

In their study on evaluation in reporting verbs, Thompson and Yiyun (1991: 372373) first make a classification of these verbs into those that denote the author’s stance (the person whose information is being reported) and the writer’s stance (the stance adopted by the student writers themselves), both of which are concerned with the truth/correctness of the reported proposition. However, they also note that if a (student) writer reports information from another author using, for example, the verb recognise, the writer is giving, at the same time, an interpretation of the author’s stance (in their framework, the author’s behaviour interpretation). Thus, verbs like suggest and wonder signal the non-factive when the (student) writer puts forward a proposition; but when the proposition being presented is that of another author, the writer is also signalling his/her interpretation of the author’s stance. This latter type also includes argue and recognise, as compared to factive but neutral (non-interpretative) verbs such as say, state, and express. Explain and show, on the other hand, signal presentation of facts, either by the writer or another author.

The use of modal and reporting verbs in the expression of writer stance 4.

215

Analysis

In order to compare the expression of modality and evidentiality in a variety of non-native writing, we analysed argumentative texts from the International Corpus of Learner English (ICLE), written by non-native speakers of the following languages: Dutch, French, German, Italian and Spanish.5 The performance of these EFL groups, representative of Romance languages (Frenchspeaking university writers, FUW; Italian university writers, IUW; and Spanish university writers, SUW) and Germanic languages (Dutch university writers, DUW; and German university writers, GUW), was contrasted with that of a reference group of American university writers (AUW), using material from the LOCNESS corpus (Louvain Corpus of Native English Essays). The analysis of data in all corpora was conducted with OUP WordSmith Tools 3.0, specifically the Wordlist and Keywords tools, and was divided into five steps. After a search for the five modal verbs and the nine reporting verbs mentioned above, the frequencies for each item in every corpus were counted. These data were then run through the Keywords tool in order to calculate the chisquare test of significance with Yates correction for a 2 X 2 table. Then, a comparison between each non-native corpus and the native reference corpus was carried out. All items showing a p value lower than 0.05 were considered statistically significant. To compensate for the uneven number of words among corpora, we normalised the statistically significant figures by 10,000 words. Finally, a graph of every item showing the statistical differences was produced. The chi-square values for all the quantitative results appear in the Appendix. 5.

Findings for the modal verbs

In answer to the first two research questions, the results concerning the use of modal verbs by the non-native writers show patterns of over- and underuse of certain modals. The case of can is especially interesting since it is overused by all non-native writers, the highest frequencies being found in the Italian and Spanish learner writing (Figure 1). The principal difference between the use of can in the native texts (AUW), on the one hand, and in the EFL student texts, on the other, is that the former use can less in the dynamic sense of ability and more in the sense of possibility of non-restriction. Also native uses of can denote a definite possibility, as in: (2)

Money of itself is neither good nor bad. One’s attitude about money is the deciding factor that determines what influence money has. It can be an influence for evil if the need for whatever it buys is possessing the buyer … (LOCNESS, AUW)

In two L1 Romance language EFL groups, the Spanish and the Italian, can (which signals a definite possibility of non-restriction) is sometimes incorrectly used, when could (the remote possibility) would be more appropriate. The use of

216

JoAnne Neff et al.

can in this example from the Italian corpus can be compared to the use of could in example (1) above, or to the use of the more hypothetical might:

CAN 48

49

FUW

DUW

51

54

42 34

AUW

GUW

SUW

IUW

Figure 1. The use of can by native and non-native writers (3)

This is the first of a hundred questions that I ask to myself every time that I read or I hear about a capital execution. Sometimes I think that this is the right solution and that it can be the only way to improve our society but then when I understand that it is very difficult to lower the criminality I ask to myself how the death of someone can be considered a right thing. (IUW)

We assume that this erroneous use is caused, at least in part, by the use of the modal can in both the Spanish and Italian texts in an epistemic sense, which reflects the more hypothetical senses of the Spanish modal poder and the Italian modal potere. This over-extension in meaning may account for part of the drastic overuse of can by both the Spanish and Italian EFL writers, the two groups with the highest frequency for this modal. The overuse may also be due to a teaching effect, in that this is the first modal EFL students learn and they may thus believe that it can be used in exactly the same contexts as their own modals poder and potere. This mistaken use of can contrasts sharply with the correct use of this modal by the Dutch and German EFL writers, who have more tokens of can + be, all indicating a definite possibility, as in: (4)

… is the automobile man’s worst friend? It is a fact that it can be very helpful, it makes us much more independent and mobile.’ (GUW)

Most of the use of can by the Dutch writers is in passive voice clauses, as in:

The use of modal and reporting verbs in the expression of writer stance

(5)

217

… the same rule can be applied to artists (DUW)

In the French corpus, on the other hand, there are only fourteen instances of can + be, most indicating a definite possibility, although there are also three erroneous uses, as in: (6)

… now they favour a world of power and economy. This can be a reason for the disappearance of dreaming …’ (FUW)

Here could would have been more appropriate to signal a unilateral possibility, rather than can, which indicates a definite possibility. While the Dutch, French and German EFL writers also showed a significant overuse of can, this is not caused by choosing this modal instead of the more remote and unilateral could, as with the Spanish and Italian writers. One other similarity between the Spanish and the Italian EFL writers should be noticed. That is that these two groups of writers manifest the same type of error in constructions with anticipated it. This results in the construction of sentences with two subjects, like the following example from the Italian corpus: (7)

Television influences people beyond the limit of danger; it can be noticed a relationship between the unjustified death of a student and the previous telecast of a film in which ... (IUW)

Or this one from the Spanish corpus: (8)

It can be appreciated a contrast between the archetypal couple that will finally get married and the secondary characters in which the author ... (SUW)

We assume that this type of error is due to typological differences of Italian and Spanish with English. That is, these constructions in the EFL student texts are probably due to transfer from the native languages, both of which are nullsubject, allow for a flexible word order and permit a ‘reflexive passive’ construction. In this construction, a verb carries a reflexive preposed particle to construct an impersonal subject and the topic of the clause is introduced in the direct object position, as in the Spanish translation of example (8): (9)

Se puede apreciar un contraste entre la pareja arquetípica que finalmente se casa y los personajes secundarios en los que el autor ...

The opposite tendency can be observed for could, which was underused by all EFL student writers except for the Italian group (Figure 2).

218

JoAnne Neff et al.

26

COULD

19 14

14

SUW

GUW

15

16

DUW

FUW

AUW

IUW

Figure 2. The use of could by native and non-native writers The Spanish writers show the lowest frequency for could, a result which may be related to their great overuse of can. On the other hand, the Italian overuse of could may be related to having misjudged this modal (instead of may, or might) as a signal for hypothetical reasoning, as in these examples from the IUW corpus: (10)

(11)

Following these considerations for every single woman will be very difficult to obtain the permission to have an artificial insemination but according to me it would be the right thing to do: for instance, when the child will go to school or will establish contacts with other children, he could ask to himself why he has not a family as the others have, with a father and a mother and not only with a mother and this could be a problem for him. A further reason why I am against the artificial insemination for single women is that if some of them have never had sexual intercourse, the decision to gave birth to a son is against God and against human nature. (IUW) … These people are unfortunately not used to arms and so they make some fatal mistakes by exercising. But, not only may arms be deadly for people who use them, they could also injure a member of the gun owner’s family as, for instance, a curious child who identifies his father’s gun with a toy. (IUW)

In example (10), apart from the null subject of the first clause (also frequent in the Spanish texts), this Italian student writer uses the auxiliary will to indicate future in the temporal clause beginning with ‘for instance,..’, as it would be in Italian, and then constructs the hypothetical clause with could. If this is a hypothetical proposition – which it seems to be since it begins with ‘for instance’ – the best

The use of modal and reporting verbs in the expression of writer stance

219

solution in English would be to use the present tense in the temporal clause and the most hypothetical modal (might) in the main clause. In example (11), the second clause with could is as speculative as the previous clause with may, yet the Italian student writer has used could as if this modal signalled a hypothesis. The usual combination of clauses with a ‘not only/but also’ construction involves two equal verbs, often in the present tense, but when modal auxiliaries are used, the first clause frequently has a verb in the present tense and a modal auxiliary – usually a speculative one -- in the second, as in: ‘Not only is it dangerous personally, but it might also provoke an international crisis.’ As Quirk et al. (1985: 941) note, this construction suggests ‘that the content of the first clause is surprising, and that of the second clause, often reinforced by an adverb such as also or even, is still more surprising.’ However, in example (11), the student writer uses a modal auxiliary in the first clause to indicate a hypothesis, but then regresses to a less speculative modal auxiliary in the second clause (could). This seems to suggest that the Italian writer believes that may and could have the same epistemic value. Mistaken uses of could, such as the ones in examples (10) and (11) on the part of the Italian student writers account for approximately 15 per cent of their 596 tokens of could. As Biber et al. (1999) have noted, may is extremely common in academic prose. Thus, it is not surprising that this modal has a high frequency in the American writers’ texts. In the EFL texts, may was underused by all the EFL groups except the French (Figure 3), who show no significant difference from the AUW, and, are, therefore, not represented on the chart. Here, the Spanish subcorpus exhibits once again the lowest frequency, a trend which is sustained by a similar low occurrence of might (Figure 4). As previously noted, may is frequently used to put forward a hypothetical proposition. This is later negated, sometimes in a clause marked by an adversative conjunction, such as yet or but, as in example (12) or the combination of a concessive clause and a negative adjective which indicates the opposite of the expected, as in example (13), both (12) and (13) being taken from the AUW texts: (12)

(13)

She compares the lack of coverage of female athletes to the overwhelming responses to the swimsuit issue. Some may argue that the women in the magazines are better to look at than those on the court. But players are not asking for a comparison of looks, but for a sense of respect as a woman. (LOCNESS, AUW) They want the audience to see this as an uncontrollable problem, but as any college student can tell you, while all these facts may be true, they are not uncontrollable. (LOCNESS, AUW)

220

JoAnne Neff et al. 13 MAY 10

10

IUW

DUW

7 5

SUW

GUW

AUW

Figure 3. The use of may by native and non-native writers MIGHT

8

8

6

3

1

SUW

AUW

IUW

DUW

GUW

Figure 4. The use of might by native and non-native writers These rhetorical strategies are infrequently used by the Spanish writers, which would account, at least in part, for the low numbers of both may and might. As for the use of the modal verb might, all the EFL groups, except the SUW, have at least double the frequency shown in the texts of the American writers (Figure 4). Since may is almost five times as frequent in academic texts (Biber et al. 1999) as might, which signals a more remote possibility than may,

The use of modal and reporting verbs in the expression of writer stance

221

the EFL writers, except for the Spanish, may be interpreting might as more mitigating. Thus, its use (instead of may) may constitute a politeness strategy, as it has been found to do in unsigned editorial texts (Neff et al. 2001). Regarding the use of must, only the writing of native speakers of Germanic languages present statistically significant differences with the reference group (Figure 5) in their underuse of this modal. This may be caused partially by a preference for verbs like the German sollen (should) as a politeness mitigator. 10,9 MUST 8,7

6,2

DUW

GUW

AUW

Figure 5. The use of must by native and non-native writers In a previous analysis by Neff et al. (2000), the construction we can proved to have the highest frequency for two-word clusters within the SUW subcorpus. This result motivated an investigation into the uses of we can and several other clusters consisting of we + modal verb among the DUW, FUW, GUW, IUW and SUW in comparison to AUW. As can be seen in Table 1, which shows the frequencies per ten thousand words, all of the EFL texts had a statistically significant overuse of we can, except for the German writers. Table 1. Cluster we + modal verb AUW DUW FUW GUW IUW SUW

we can 1.5 3.9 11.6 — 7.0 17.7

we could 0.5 — 2.7 — 1.3 —

we may 0.4 — 2.4 — — —

we might 0.1 — 0.6 — — —

we must 1.5 — 2.6 0.6 — 2.8

222

JoAnne Neff et al.

The groups with an L1 Romance language — the Italians, the French, and most of all, the Spanish — had the highest frequencies for this cluster, which most frequently involved a lexical verb denoting mental or verbal processes, such as see, observe, find, and say, as in the following three examples from the Spanish texts: (14) (15) (16)

Regarding love we can say that it is also attainable if you have enough money to afford or if your parents agree to it (SUW) We can see the double personality of young Marlow, who is not able to maintain his position in love affairs. On the contrary, Hastings can do it very well (SUW) The problems that we can find are two: on the one hand, the overcrowding of the prisons that makes difficult, to a great extent, the application of these plans of rehabilitation (SUW)

The French texts show an overuse of we + all of the modal verbs: we can (11.6), we could (2.7), we may (2.4), we might (0.6) and we must (2.6), in comparison to the AUW frequencies. Particularly notable was their use of the clusters involving say: we can say, we could say, and we may say. Both the Spanish and the Italian texts showed the highest frequencies for we can see, we can find, and we can say, the SUW being the group that most overused these clusters. Finally, the overuse of we must deserves further comment as it is twice as frequent in the SUW texts as in those of the AUW. A full 53% of the tokens of we must are used with reporting verbs such as point out, add, appreciate, say, underline, and state, as shown in the following examples: (17) (18)

To finish, we must indicate that in 19th century people had not liberty to choose between to be Christian, atheistic or something like that;… (SUW) But the world of women is still a dropout in the political world. At last, we must state that it is due to the feminist struggle that the sexuality of women is put on the same level as the masculine one, and it is not so hidden. (SUW)

Some of the other uses of we must by the Spanish writers, as in example (19) appear to be quite similar to the way in which the American writers use this expression, as in (20): (19)

(20)

In my opinion we have to forget the feminism of the seventies because this term identifies us with inequality, oppression and victimization. And instead we must focus our energies on the realization of our personal and political power. However, we must take into account that we have some qualities counting against us … (SUW) Therefore, we must take responsibility for the water... (AUW)

However, the profusion of their use in the SUW texts, one coming right after another as in example (19), tends to maintain the emphasis on the speaker’s

The use of modal and reporting verbs in the expression of writer stance

223

necessity and not on a more abstract necessity. Even if the two uses of must in (19) are quite different rhetorically, when must is used so frequently with we, it may appear to border on face-threatening acts which make strong impositions on the reader for common action, a tendency which Sancho (2001) found in her analysis of deontic modals in aeronautical research articles written in English by Spanish engineers. For the most part, these we clusters are used by writers with L1 Romance languages to present new topics, as in examples (14) through (16) above. However, these clusters also have the pragmatic function of including the reader in the writer’s discourse community and assuming that the information presented is common knowledge, instead of constructing a more impersonal reader-in-thetext stance, such as it might be argued (Neff, et al. 2001), which does not oblige the reader to take on board the proposition. 6.

Findings for the reporting verbs

Table 2 displays the frequencies of reporting verbs per ten thousand words. The Dutch writers’ behaviour is the closest to that of the reference group since they do not show statistically significant differences for any of the reporting verbs except for argue, which is underused by all the EFL writers. This reporting verb constitutes an important rhetorical device, since it allows the writer to put forward another author’s argument without presupposing its acceptance, either by the writer or by the reader. It does imply, though, that the author has produced some sort of evidence in favour of his/her proposition. Table 2. Reporting verb use AUW DUW FUW GUW IUW SUW

suggest 1.2 — — 0.5 2.6 —

wonder argue explain express recognize say 1.2 5.3 1.4 1.5 1.6 13.2 — 1.4 — — — — — 1.1 3 2.9 0.3 19.2 2.7 1.3 — — 0.4 — — 1.8 2.6 2.6 — 15.9 — 0.9 — — 0.6 21.4

show 7.4 — 9.5 4.8 — —

state 8.6 — 0.8 1 1.4 0.6

Another verb predominant in the native texts but infrequent in most non-native ones is state, which serves approximately the same rhetorical function as that of argue, except that it more neutrally signals affirm or express belief. The reporting verb par excellence in all sub-corpora is say, which, indicating utter or speak, carries less pragmatic import than either argue or state. The speakers of Romance languages use this verb much more often than the Dutch or German speakers, who show no significant difference in comparison to the AUW. Of the three L1 Romance language groups, the Spanish writers use say most often. The Italian and the French groups, on the other hand, show a preference for two verbs, explain and express, which are rare both in the American texts and in the rest of the EFL texts.

224

JoAnne Neff et al.

The general underuse of the verbs argue and state is most marked in the SUW corpus. This fact may be attributable to insufficient training or to the excessive formality and/or contextual specificity surrounding some of these verbs in Spanish: argüir (contend), aducir (adduce), argumentar (argue), declarar (state), manifestar (show), sostener (claim), defender (advocate). The main Spanish translation for state in dictionary entries, for example, is declarar, a verb mainly used in administrative and legal settings. Furthermore, the overuse of say by all Romance language groups might bear a relationship to the contextual meanings of some verbs in the native languages with regard to the channel through which the reporting act is realised. For example, the Spanish decir (say) is commonly used to report both oral and written messages, which in English might be rendered differently, as in The man says, The law states. One last case worth mentioning is the exclusive overuse of wonder by the German writers. Most occurrences of this verb are accompanied by a first person singular pronoun, but there are also examples with you and we, all of which denote an informal stance, as can be seen in the following examples from the GUW corpus: (21) (22) (23)

This means that I will have to break this habit. I wonder what the conclusion of my next essay is going to be like (GUW) You should not wonder if a man says: "The woman has to do the household, she has to cook and to stay at home with the baby and give him/her the love (GUW) The same is found in much smaller units of life: How often do we wish that the working day should be over? How often do we wonder about the ending of a book, the outcome of a story? (GUW)

In summary, the AUW show a rather balanced use of the reporting verbs say, state, show and argue, whereas in some of the non-native groups there is a heavier reliance on one reporting verb (say) and a much reduced repertoire of other verbs which might allow the EFL writers to report authors’ propositions with greater or lesser grades of certainty or doubt. These examples from the AUW corpus, all of which allow the writers to present counterbalanced argumentation, are not common in the EFL corpora: (24) (25) (26)

Statistics recently state that out of all the "hate crimes committed in the United States, robbery and murder are amongst the highest” with robbery being ... (Locness, AUW) …The pro-gun activists, however, argue that firearms actually prevent murders, rapes and burglaries (AUW) Some people argue that professional soldiers would make up a group of mercenaries, way too distant from the rest of the society (AUW)

The use of modal and reporting verbs in the expression of writer stance 7.

225

Conclusion

This study confirms previous research (Cook 1978; Hinkel 1995) which has shown that, owing to typological, instructional or sociocultural factors, nonnative speakers encounter difficulties in the use of English modals. A case in point is the overuse of can by Spanish and Italian writers, in part due to transfer from the L1 epistemic meanings of the Spanish modal poder and the Italian modal potere. A large part of the EFL students’ problems must surely be caused by not distinguishing between can as a definite possibility, and could as a remote possibility. On the other hand, they may also have difficulties in discriminating between could as a unilateral possibility, and might as signalling various options and therefore facilitating the construction of a reader-in-the-text. The occurrence of clusters of we + modal verb + verb of mental/verbal process in the texts of writers with an L1 Romance language may also reveal a transfer of positive politeness strategies (i.e. towards affiliation and commitment). The Spanish results, for example, suggest a transfer from Spanish to English of the sociolinguistic norms for formal writing. It remains to be seen how developmental factors (i.e. interlanguage stages) shape the informants’ modal choices in each subcorpus. It will also be necessary to identify the causes of large quantitative differences in the use of certain reporting verbs among EFL writers with L1 languages pertaining to the same family, for example, regarding the overuse of the reporting verbs explain and express in both the FUW and IUW subcorpora, but not in the Spanish subcorpus. However, in other cases, writers with a Romance L1 do behave similarly. For instance, although frequently used by all EFL groups, say is – of all reporting verbs with a low illocutionary force – most overused by the EFL writers with an L1 Romance language. This suggests that these non-native students, even at quite advanced stages, have not acquired a broad enough range of reporting verbs from which to select the most appropriate for the context. In this way, the study has implications for the teaching of EFL academic writing. Further contrastive studies should be carried out in the three Romance languages – French, Italian and Spanish – to determine the extent of possible L1 influence on the use of reporting verbs, that is, whether the L1 also uses a narrow range of reporting verbs in argumentative writing. Two further teaching implications can be drawn. Firstly, it would be useful to raise awareness of the inherent difficulties presented by certain modal verbs because of their lack of exact correspondence with those of the native language. For this purpose, modal verbs in native texts should be analysed in context, so that EFL students can better grasp the difference between sets of modals. Secondly, it would be advisable to provide non-natives with a detailed contrastive view of epistemic and pragmatic modal contexts in L1 and L2, placing emphasis on the stylistic and communicative effects produced by different hedges and boosters.

226

JoAnne Neff et al.

Notes 1

SPICLE is the Spanish team which contributed texts of Spanish university writers to the International Corpus of Learner English (ICLE), Centre for English Corpus Linguistics, Université catholique de Louvain, Belgium.

2

The larger study is titled ‘A Contrastive Analysis of Evidentiality in English and Spanish: A Corpus Study’, BFF2000-0699-C02-01, Universidad Complutense, Madrid.

3

One reporting verb frequently used by the American university writers, claim, was used only seven times by the Spanish EFL writers, and, thus, was not considered in this analysis.

4

Can is only epistemic in interrogative and negative clauses, as in ‘Can it be that ...?’

5

The number of words contained in each EFL corpus were as follows: Dutch, 237,631 words; French, 287,683 words; German, 203,647 words; Italian, 226,988 words and Spanish, 194,845 words. The American university texts from the LOCNESS corpus totalled 149,790 words.

References Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman Grammar of Spoken and Written English. London: Longman. Cook, W. (1978), ‘Semantic structure of English modals’, TESOL Quarterly, 12: 5-16. Dendale, P. and L. Tasmowski (2001), ‘Introduction: evidentiality and related notions’, Journal of Pragmatics, 33: 339-348. Du Bois, J. (2001), ‘Taking a stance: constituting the stance differential in dialogic interaction’, paper, American Anthropological Association, San Francisco (November). Gresset, S. (2001), ‘Towards a contextual micro-analysis of the non-equivalence of MIGHT and COULD’, I International Congress on Modality, University of Verona. Halliday, M. A. K. (1985), An Introduction to Functional Grammar . London: Edward Arnold. Hinkel, E. (1995), ‘The use of modal verbs as a reflection of cultural values’, TESOL Quarterly, 29: 325-341. Hoey, M. (1994), ‘Signalling in discourse’, in: M. Coulthard (ed.), Advances in Written Text Analysis. London: Routledge. 26-45. Holmes, J. (1983), ‘Speaking English with the appropriate degree of conviction’, in: C. Brumfit (ed.), Learning and Teaching Languages for Communication: Applied Linguistic Perspectives. London: Center for Information on Language Teaching and Research. 100-121.

The use of modal and reporting verbs in the expression of writer stance

227

Hyland, K. (2000), Disciplinary Discourses: Social Interactions in Academic Writing. London: Longman. Lewis, M. (1986), The English Verb: An Exploration of Structure and Meaning. London: Language Teaching Publications. Neff, J., E. Dafouz, M. Díez, H. Herrera, F. Martínez, R. Prieto, J. P. Rica and C. Sancho (2001), ‘Subjective and objective modalization of evidentiality in native and non-native argumentative texts’, I International Congress on Modality, University of Verona. Neff, J., E. Dafouz, M. Díez, F. Martínez, R. Prieto and J .P. Rica (2000), ‘The construction of writer stance in native and non-native texts’, XVIII Congreso Nacional AESLA, Universidad de Barcelona. Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik (1985), A Comprehensive Grammar of the English Language. London: Longman. Palmer, F.R. (1986), Mood and Modality. Cambridge: Cambridge University Press. Sancho, C. (2001), ‘Epistemic tides and personal strongholds: embedded discursive spaces in the research article’, XIII European Symposium on Language for Specific Purposes, University of Vaasa, Finland. Thompson, G. and Y. Yiyun (1991), ‘Evaluation in the reporting verbs used in academic papers’, Applied Linguistics, 12(1): 365-382. Winter, E. (1982), Towards a Contextual Grammar of English: The Clause and its Place in the Definition of Sentence. London: Allen & Unwin. Appendix Modal verb use per EFL group as compared to the control group, the American university writers (WordSmith Output) [D = Dutch; A = American; S = Spanish; G = German; F = French; I = Italian] Word Freq. DUW % DUW Freq.AUW % AUW Keyness P (Chi square) D vs. A

can

1.173

0,49

514

0,34

47,6

0,000000

D vs. A

might

189

0,08

48

0,03

33,1

0,000000

D vs. A

may

236

0,1

196

0,13

7,9

0,004881

D vs. A

could

365

0,15

290

0,19

8,5

0,003598

D vs. A

must

148

0,06

163

0,11

24,2

0,000001

Word

Freq. SUW % SUW Freq.AUW % AUW

Keyness P (Chi square)

S vs. A

can

997

0,51

514

0,34

54,7

0,000000

S vs. A

could

281

0,14

290

0,19

12,2

0,00048

S vs. A

might

18

0,01

48

0,03

21,8

0,000003

S vs. A

may

107

0,05

196

0,13

54,7

0,000000

228

JoAnne Neff et al.

Word Freq. GUW % GUW Freq.AUW % AUW Keyness P (Chi square) G vs. A

might

169

0,08

48

0,03

35,7

0,000000

G vs. A

can

858

0,42

514

0,34

13,4

0,000247

G vs. A

could

294

0,14

290

0,19

12,4

0,000432

G vs. A

may

148

0,07

196

0,13

29,4

0,000000

G vs. A

must

177

0,09

163

0,11

4,1

0,043297

Word

Freq. FUW % FUW Freq.AUW % AUW Keyness P (Chi square)

F vs. A

can

1.377

0,48

514

0,34

41,7

0,000000

F vs. A

could

458

0,16

290

0,19

6,6

0,010031

Word

Freq. IUW

I vs. A

can

1.215

0,54

514

0,34

72,5

0,000000

I vs. A

could

596

0,26

290

0,19

18

0,000022

I vs. A

might

144

0,06

48

0,03

16,9

0,00004

I vs. A

may

216

0,1

196

0,13

10,2

0,001405

% IUW Freq. AUW % AUW Keyness P (Chi square)

Use of we + modal verb per EFL group as compared to the control group, the American university writers (WordSmith Output) [D = Dutch; A = American; F = French; G = German; I = Italian; S = Spanish]

D vs. A

F vs. A

Word

Freq. DUW

we can

92

Word

Freq. FUW

we can

% DUW Freq. AUW % AUW Keyness P (Chi square) 0,04

23

0,02

16,1

0,00006

% FUW Freq. AUW % AUW Keyness P (Chi square)

335

0,12

23

0,02

121,9

0,000000

F vs. A we could

77

0,03

7

>0,01

23,9

0,000001

F vs. A

we may

69

0,02

6

>0,01

21,8

0,000003

F vs. A we must

74

0,03

22

0,01

5

0,025697

F vs. A we might

16

>0,01

1

>0,01

4,9

0,027209

Word G vs. A we must

Freq. GUW % GUW Freq. AUW % AUW Keyness P (Chi square) 12

>0,01

22

0,01

6,1

0,01386

The use of modal and reporting verbs in the expression of writer stance

229

Word

Freq. IUW

% IUW Freq. AUW % AUW Keyness P (Chi square)

I vs. A

we can

159

0,07

23

0,02

54,8

0,000000

I vs. A

we could

29

0,01

7

>0,01

5,4

0,020342

Word

Freq. SUW

S vs. A

we can

344

0,18

23

0,02

205,3

0,000000

S vs. A we must

54

0,03

22

0,01

5,9

0,014792

% SUW Freq. AUW % AUW Keyness P (Chi square)

Use of reporting verbs per EFL group as compared to the control group, the American university writers (Wordsmith Output) [D = Dutch; A = American; F = French; G = German; I = Italian; S = Spanish] Word

Freq. DUW % DUW Freq. AUW % AUW Keyness P (Chi square)

D vs. A

admit

39

0,02

D vs. A

indicate

15

D vs. A maintain

4

10

>0,01

6,1

0,013229

>0,01

2

>0,01

4,1

0,042514

>0,01

14

>0,01

10

0,001547

D vs. A

argue

33

0,01

80

0,05

47,9

0,000000

D vs. A

state

51

0,02

129

0,09

81,3

0,000000

Word F vs. A

Freq. FUW % FUW Freq. AUW % AUW Keyness P (Chi square)

imply

72

0,03

5

25,1

0,3

0,000001

F vs. A

say

552

0,19

198

0,13

20,2

0,000007

F vs. A

explain

87

0,03

21

0,01

9,9

0,001693

F vs. A

express

83

0,03

22

0,01

7,7

0,005659

F vs. A

show

272

0,09

111

0,07

4,5

0,034371

F vs. A recognize

9

>0,01

0,3

0,02

0,0

0,000008

F vs. A

argue

33

0,01

80

0,05

65,5

0,000000

F vs. A

state

22

>0,01

0,09

0,0

0,000000

Word

Freq. GUW % GUW Freq. AUW % AUW Keyness P (Chi square)

G vs. A

wonder

54

0,03

18

0,01

8,2

0,004163

G vs. A

admit

33

0,02

10

>0,01

5,7

0,017139

G vs. A

suggest

10

>0,01

18

0,01

4,6

0,031205

230

JoAnne Neff et al.

G vs. A conclude

7

>0,01

16

0,01

5,9

0,01521

G vs. A

note

3

>0,01

11

>0,01

6,1

0,013516

G vs. A

show

98

0,05

111

0,07

9,4

0,002142

G vs. A recognize

9

>0,01

24

0,02

11,2

0,000803

G vs. A

argue

27

0,01

80

0,05

44,7

0,000000

G vs. A

state

20

>0,01

129

0,1

117,4

0,000000

Word

Freq. IUW

I vs. A

suggest

60

0,03

18

0,01

8,4

0,003798

I vs. A

explain

60

0,03

21

0,01

5,9

0,015098

I vs. A

express

60

0,03

22

0,01

5,2

0,022655

I vs. A

say

360

0,16

198

0,13

4,1

0,043375

I vs. A

argue

40

0,02

80

0,05

35,2

0,000000

I vs. A

state

31

0,01

129

0,09

109,9

0,000000

Word S vs. A

say

% IUW Freq. AUW % AUW Keyness P (Chi square)

Freq. SUW % SUW Freq. AUW % AUW Keyness P (Chi square) 416

0,21

198

0,13

31

0,000000

S vs. A disagree

4

>0,01

12

>0,01

5,3

0,021868

S vs. A recognize

11

>0,01

24

0,02

8

0,004711

S vs. A

claim

16

>0,01

32

0,02

9,6

0,001951

S vs. A

argue

17

>0,01

80

0,05

58,5

0,000000

S vs. A

state

11

>0,01

129

0,09

133,1

0,000000

Learning English prepositions in the Chemnitz Internet Grammar Josef Schmied English Language and Linguistics, Chemnitz University of Technology Abstract This article uses part of the Internet Grammar that was developed at Chemnitz University to illustrate a new way of presenting real language data and language rules in the same ‘grammar’ with an inductive component, called the discovery section, and a deductive component, called the explanations section. I have focused on prepositions here in order to demonstrate how grammar can be presented and to discuss how learners use both inductive and deductive learning strategies in their work to come to terms with these extremely polysemous forms of English. The new computer-based learning environment allows us not only to set up a new form of ‘grammar’ but also, for the first time, to look empirically at how such a learning resource is exploited by individual learners to develop their own internal grammar from an external grammar (an electronic grammar supported with corpus examples).

1. Work in progress: Using the Internet Grammar to observe active language learning The Internet Grammar on which this article is based1 has two functions essentially: on the one hand, it is a learning aid which can be used by students to improve their English generally or to find out about specific (contrastive) constructions. On the other hand, the grammar is a research tool, not only for contrastive language analysis, but also for investigating learning strategies. Thus the results of research into recorded learning efforts can be fed back directly into the learning tool, so that in an ideal world, a constant improvement can be achieved through incorporating learners’ feedback. The principal aims of the project are: · to present a grammar in hypertext-structure (making more consistent use of the potential of this format than the London Internet Grammar, for instance, which still reveals its derivation from Greenbaum’s 1996 Oxford English Grammar), · to monitor learner behaviour through a tracking mechanism registering learners’ movements and input on the website itself, and via oral or email discussion and comments relating to the website, and thus · to draw conclusions about the ways that learners of English can develop their own ‘personalised’ internal grammar from a corpus-informed electronic grammar. In this context, two aspects of the Internet grammar are important:

232 ·

·

Josef Schmied It is user-specific; in other words, our intention is to present a grammar that is relevant to the individual needs of the learner. The research presented in this paper is aimed at advanced learners who are already close to proficiency level. Even at this level however, the category of prepositions is an area in which inconsistency and idiomaticity make intensive autonomous work useful, even essential. It is contrastive in so far as it is written primarily for advanced German learners of English. Thus, our search mechanisms and our examples are based on an English - German translation database and the exercises and tasks in both the deductive and inductive sections concentrate on those areas of grammar where, despite considerable overlap, the German learner of English encounters difficulties. Even advanced German learners of English can be deceived by seemingly parallel structures.

2.

Analysing English prepositions for the language learner

2.1.

Prepositions as a special borderline case between the lexicon and grammar

Prepositions are a notoriously difficult field for foreign language learners in general, partly because detailed information on prepositions is not presented in an appropriate system either in traditional grammar books or in traditional dictionaries. This failing can be explained by the fact that prepositions have unique syntactic as well as semantic specifications. Prepositions are syntactic link words connecting nouns to verbs, to other nouns and, occasionally, to other word classes. The choice of preposition, however, often depends on the meaning of the syntactic element that determines it. Put in terms of dependency theory, prepositions depend partly on the preceding noun, verb, etc., and partly on the following noun. This double dependency of prepositions is not always specified in grammars, even if they distinguish between ‘free’ and ‘bound’ prepositions (as in the Longman Grammar of Written and Spoken English (LGWSE) 1999: 74) or between adverbial and complementation (as in the Comprehensive Grammar of the English Language (CGEL) 1985: 657). Prepositions also overlap, at least in English and German, with other word classes, such as adverbs and particles (cf. LGWSE 1999: 76f, esp. Table 2.5, and below). All this makes it very difficult to use or to categorise prepositions on the basis of their surface value, although exactly this is done in simple corpus linguistic analyses. Thus Capel (1993: 11) provides the following ‘Guide to concordance pages’ for from, for instance: Main meaning: You use from to indicate who or what is the source or provider of something. eg 10 Other meanings: Indicating a range, e.g. 1,2. Mentioning the cause of something, eg 31 Changing from one thing to another, eg 3,19

Learning English prepositions in the Chemnitz Internet Grammar

233

Although there is a vast amount of literature on prepositions (cf. Wibbelt 1993; Rauh 1991; esp. Huppertz 1991), few studies both agree on the theoretical framework beyond the core meanings and are based on a detailed analysis of corpus data. Since Herskovits (1986), Lindstromberg (1998) and Radden (1989), prepositions have also been described in a cognitive framework, explicitly (Boers 1996) or implicitly (Lindstromberg 1998) using explanations such as: The basic function of in is to refer to a situation where one object (the ’trajector’) is contained within another (the ’landmark’). However, even if we focus only on those uses that are concerned with relations between physical space (as opposed to examples such as in trouble), we find that in is used in a whole range of situations where there is only an approximation to this ideal meaning (Lee 2001: 19). Although the explanations section of the Internet Grammar also takes a cognitive approach, the discovery section cannot. To carry out a sophisticated, quantitative, semantically-based corpus analysis would require that as yet rare animal, the semantically-tagged corpus, since there are no generally accepted simple tools (and categories) that can do the job satisfactorily. Not to mention the fact that experienced linguistic labour is usually too scarce and too valuable to be assigned to the task of going through millions of words, as would be the case in the translation corpus used for this research, for example. 2.2.

English and German prepositions compared

A frequency list of the most common words in the translation corpus (using Wordsmith Tools) shows that prepositions are among the most common words in English and German: in the English list, we find of in 2nd, to in 4th, in in 5th, for in 9th and with in 13th position; in the German list, in is in 4th, zu 6th, von 8th, auf 11th, mit 15th and für 21st position. Prepositions thus occupy similar positions in the frequency hierarchy in the two languages.2 Although there are a few differences between the position of equivalents, of and for in English being for instance more common and thus seeming to fulfil more functions than von and für in German, some prepositions, such as in and with, are relatively similar in their status in the hierarchy. It must, however, be emphasised that this first analysis is based solely on surface parallels. A more detailed analysis of the actual usage, which is only possible with a good comparable database like our translation corpus, reveals some interesting divergences, for instance in the behaviour of English and German in and English with and German mit (cf. Schmied 1998). 3.

From prototypes to transferred usage to idiomaticity: the example of from

The following discussion starts from the assumption that the most frequent English prepositions have developed from their local meaning, indicated in the

234

Josef Schmied

following figures as PLACE, which is then further extended into the fourth dimension TIME. These meanings are usually seen as prototypical and constitute the basis of the discussion of the semantics of English prepositions in grammar (e.g. Downing & Locke 1992; Leech & Svartvik 1994 or Quirk et al. 1985). Thus 3 we see the local/temporal meanings as prototypical at one end of the spectrum, idiomatic expressions as exceptions at the other. The ‘meaning’ of a preposition in English is often determined by the context, mainly by the preceding verb, noun or adjective or the following noun, so that is has to be learnt as a relatively fixed collocation by foreign language learners. These lexicon-specific meanings will be grouped under the heading LEXICON. In modern dictionaries of the English language, the additional senses are usually listed after examples illustrating the prototypical and figurative meanings.4 This study focuses its attention on the learner rather than the linguist and on how grammatical information has to be presented to ensure an efficient learning process. When German learners read the detailed explanations of prepositions in dictionaries (but also in many web-based grammar explanations), they often find them of little help because many of the basic and original meanings are listed and explained using examples that are intuitively understood by German learners. This is not surprising as German and English are closely related and the cognitive systems of prepositions have not diverged greatly in the past 2000 years.5 Thus, German learners tend to assume that all their problems with prepositions are based on ‘idiomaticity’ and they think, "you just have to learn it by heart". But, as will be shown below, many such usages can be explained independently of individual cases and so they are not really ‘irregular’ and ‘singular’. In this, I agree with Lindstromberg (1998: 1): Basically, I do not believe that prepositions are generally used in a quirky, idiomatic fashion. Yes, prepositions are sometimes used idiomatically in expressions that have to be learned one by one. But only a small minority of propositional uses are thoroughly idiomatic. In a category situated between prototypes and idiomaticity and which we call ‘figurative TRANSFER’, we will put together semantically related cases, which often exhibit interesting metaphorical shifts from the original but are not lexiconspecific. Categories and labels for these multiple cases could be selected from the case grammar categories developed since Fillmore’s (1968) ‘case’ or, alternatively, one can try to harmonise the standard dictionary definitions. We have followed this latter path. The meanings for the English preposition from, which look so similar to the German von, have been taken from the standard dictionaries used in Germany (e.g. the Oxford Advanced Learners’ Dictionary or the Cambridge International Dictionary of English as monolingual references, or Langenscheidt’s Handwörterbuch as a bilingual reference). These dictionaries follow the traditional route of explaining more prototypical, local and temporal usages first, before moving on to more figurative usages. This can be exemplified by data from our translation corpus.

Learning English prepositions in the Chemnitz Internet Grammar

235

The following diagram of the preposition from (Figure 1) illustrates the extent to which the prepositions have developed further and further away from the basic local meaning. Here a plus sign (+) indicates the occurrence and a minus (-) the non-occurrence of the German equivalent. Of course, these indications of partly overlapping structures are only tentative and their validity will have to be proven by a large quantitative analysis. In a hypertext format, the corpus examples and the explanations can be made visible by clicking on the diagram presented, and so an overview can always be gained by returning to the diagram ‘surface’. When we look at the actual usage of prepositions, it becomes clear that the temporal and local usages constitute, as expected, the vast majority of examples, but prepositions are so common that other cases can already be found in relatively few texts, since, according to Lee 2001: 28, "basic spatial notions can be applied to a wide range of non-spatial situations" (Figure 1). A standard example for the LOCAL meaning of the preposition from indicates SOURCE: (1)

A modified version which relies on the pressure modulation of the gas cells to select the radiation emitted from the centre of the carbon dioxide lines is now incorporated into the operational equipment for current weather satellites. Eine modifizierte Version, bei der die Selektion der vom Zentrum der Kohlendioxidlinien emittierten Strahlung auf der Modifizierung des Druckes der Gaszellen beruht, wurde in die operationale Ausrüstung der gegenwärtigen Wettersatelliten aufgenommen. (Burroughs).

A further prototypical meaning of the local meaning is DISTANCE, where from introduces the first part of a pair, the second follows with to. This pair cannot be found only in local but also in temporal and figurative meanings, a typical glide from distance towards spectrum or change. The German pair is usually von – bis and (2) and (3) are exceptions because different prepositions are chosen (in and zwischen - und depending on the head noun Wechsel with a slight change in meaning): (2)

(3)

The first satellite pictures amply confirmed how much detail about the weather could be obtained, from the local to the global level. Die ersten Satellitenaufnahmen bestätigten nachdrücklich, wie viele Details des Wettergeschehens sich sowohl im lokalen wie im globalen Maßstab beobachten lassen - ... (Burroughs) Another feature that stands out clearly on the polar projections is the change from winter to summer in the northern Hemisphere. Ein anderes Detail, das auf den Polarprojektionen klar zu erkennen ist, ist der Wechsel zwischen Winter und Sommer auf der Nordhalbkugel. (Burroughs).

236

Josef Schmied

from PLACE

TIME

source

begin

+ von

+ von

distance/1 source

figurative TRANSFER origin +aus aus material

+ von

aus

spectrum level/1 + von

LEXICON GRAMMAR

cause

protection

aus

deduce aus nach

vor

prevention vor

difference als change/1 + von

Figure 1. Prototypical, figurative and idio matic usages of from with their Ger man translation equivalents

The expansion of from into the TEMPORAL meaning is clear in: (4)

From that da y on people have been dying in Hatti land. Seit diesem Tage sterben die Menschen in Hatti-Land, ... ( A.W. Crosby)

The typical figurative expansion of from can be seen in (5): (5)

This automatic system, which uses primar y satellite data dra wn from a computer-compatible archive to write directly on to film, achieves much higher relative and absolute adjustments of geometry and colour density. Mit diesem automatischen System, das die aus einem computerkompatiblen Archiv abgerufenen primären Satellitendaten zur

Learning English prepositions in the Chemnitz Internet Grammar

237

direkten Belichtung eines Filmes benutzt, läßt sich eine viel bessere relative und absolute Kontrolle der Geometrie und Farbdichte erzielen. (Burroughs). From ORIGIN and MATERIAL there is only a minor step to other transferred cases such as CAUSE and DEDUCE, even in a mental sense ((6) and (7), both with indirect translations): (6) (7)

The west to east scan results from the spinning of the satellite. Dabei wird die Abtastung in Ost-West Richtung durch die Rotation des Satelliten ermöglicht. (Burroughs). The measurements of each element in this scanning process are then transmitted to a ground receiver and the image recreated from these data. Die bei diesen Abtastprozessen erhaltenen Daten werden zu einem Empfänger am Boden übertragen und dort zu einem Gesamtbild zusammengesetzt. (Burroughs).

A more stative meaning that is related to the local/temporal meaning of distance is DIFFERENCE, as in (8), which also shows the particle away, typically added for emphasis: (8)

... towards the Sun things obviously look very different from directions away from the Sun. Blickt man zur Sonne hin, sehen die Dinge offensichtlich wesentlich anders aus, als wenn man in die der Sonne abgewandte Richtung schaut. (Davies/Brown)

The following cases are examples of idiomatic uses, all with specific features: (9) and (10) show that there are still polysemantic nuances and a certain relationship to CAUSE, (11) shows a sequence of particle and preposition (a phrasal prepositional verb) and (12) introduces a clause (cf. below) and could be replaced by an infinitive: (9)

(10)

The season was excessively wet and cold, and both invaders and defenders suffered from hunger, because the hostilities had prevented sowing and therefore harvesting. Dieser Winter war außergewöhnlich feucht und kalt, und Invasoren wie Verteidiger litten unter einer Hungersnot, weil die Feindseligkeiten die Aussaat und damit auch die Ernte des Getreides verhindert hatten. (A.W. Crosby) The trade between Norway and Iceland languished as well, and fifteenthcentury Iceland suffered severely from neglect, almost to the point of demise. Auch zwischen Norwegen und Island lag der Handelsaustausch darnieder, und im Laufe des 15. Jahrhunderts geriet die Insel fast vollkommen in Vergessenheit. (A.W. Crosby)

238

Josef Schmied

(11)

The Guanches died off from a multitude of causes. Das Aussterben der Guanchen hatte vielfältige Gründe. (A.W. Crosby) Such hypothetical particles have gained the name tachyons; these are forbidden by the theory of relativity from crossing the light barrier the other way, i.e. they can never travel slower than light. Man hat ihnen den Namen »Tachyonen« gegeben. Die Relativitätstheorie verbietet diesen Teilchen das Überschreiten der Lichtgeschwindigkeit in entgegengesetzter Richtung, das heißt sie können sich niemals langsamer als Licht bewegen. (Davies/Brown)

(12)

4.

Expansion of the concept to other local propositions: at, by, in and with

Having established the principles of our approach, we can now expand it to include other prepositions. Because of the limitations of space, we will illustrate the clines in the form of diagrams and leave aside the examples from our corpus. Again, in cognitive linguistics the complexity has already been noted (Lee 2001: 19): The preposition at provides a particularly clear example of the flexibility and abstraction involved in the coding of special relationships. Herskovits (1986: 128-40) argues that the function of at is to locate two entities at precisely the same point in space and construe them as geometric points. This provides an elegant account of various characteristics of the use of at, but it clearly involves a considerable degree of abstraction and idealisation. But this still does not explain corpus occurrences like at the initiative of the United Nations, at regular intervals, at high speed, at wavelengths around 5 mm, etc. The major difference between at, from and by on the one hand and in and with on the other is that in and with are often "misused" for grammatical purposes, i.e. to introduce non-finite clauses (as in the two cases in (13)) and as an adverbial component in (14) which could be replaced by the corresponding adverbs quantitatively and additionally. (13)

These can provide important information about the motion and stability of the atmosphere which is of value in measuring wind and in forecasting the development of severe thunderstorms. Auf diese Weise lassen sich wichtige Informationen über Bewegung und Stabilität der Atmosphäre gewinnen, was besonders für die Messung von Winden und Vorhersagen über die Entwicklung heftiger Gewitter von Wert ist. (Burroughs).

Learning English prepositions in the Chemnitz Internet Grammar (14)

239

The animals provided the humans with a diet not likely to produce a Chaucerian friar’s jowls, but one nourishing and sufficient in quantity. What the mother and daughters needed in addition, they could obtain by bartering the extra food and wool. Die Tiere lieferten den Menschen ausreichend nahrhafte Kost, und was die Mutter und ihre Töchter darüber hinaus benötigten, beschafften sie sich im Austausch gegen ihre überschüssigen Nahrungsmittel und Mollys Wolle. (A.W. Crosby)

Figures 2-5 in Appendix 1 also indicate how far transfer of the German equivalents (von, in, mit, an and bei, respectively) is possible. They emphasise a grey area located between the original meanings to be found in grammars and the idiomatic lexicon-related meanings to be found in dictionaries, and this grey area is of particular interest for two reasons: · These meanings occur frequently enough to justify separate dictionary entries, and · they can be deduced from the original meanings, as indicated by the arrows in the diagrams above. Such meanings constitute a problem area however since the transfer of CAUSE, for instance, does not only occur from the local meaning of from, but also of with (as in with rage/anger). To make matters worse for German learners of English, their mother tongue is more consistent here, using aus in both cases. But this simply shows that the two languages carve out different areas from an underlying continuum. 5.

Student reactions

Although the Internet Grammar is not yet complete and has not yet been released on-line for general usage, preliminary test runs and intensive discussions with students have resulted in some general and some preposition-specific observations: 1. Given free choice, students clearly prefer the explanations section to the discovery section, since this is the approach they are used to from school and they think that grammar is the sum of rules that has to be applied to language instead of an abstraction of regular patterns that has to be distilled out of language. 2. The fact that the learner’s internalised grammar overlaps only partially with the external descriptive grammar and that the translation corpus can be used to test this overlap is a real revelation to students: constructing their own internal grammar leads them from the theoretical "know that" to the practical "know how to". But the practical construction of explorative search queries requires intensive training. Thus finding examples to counter the dictionary rule that different collocates with from may be feasible, but drawing conclusions from the frequency differences of beneath in various text types is not easy. It is simply not the case that "[b]y studying the language in

240

Josef Schmied concordance form, learners at all levels can discover the central and typical patterns of English" (Capel 1993: 4). The following concordance sample from a random search of from not translated as von is intended to illustrate that patterns are not in fact all that obvious, if we want to verbalise what can be visualised as a simple arrow: From the earliest days meteorologists have been developing equipment to address practical problems. Pythagoras’ theorem can be deduced from the axioms, but the proof involves a fairly complicated chain of reasoning. but the ashes and rocks from this eruption were old volcanic material from the crater region but he did not flinch from its implications. they discovered from the end of the century on, that an unlimited readiness to assimilate was not enough. My approach can be summed up in two statements, from which a distinctive methodology flows. this heat and the shock from the core collapse also heated the material which enveloped the core. the next step is to take intermarket relationships into consideration to see if the individual conclusions make sense from an intermarket perspective.

3.

4.

6.

The hypertext structure indicated above, in which the meanings of prepositions are laid out as a continuum and concrete examples are always available for illustration, helps students to see the relationship between fuzzy rules and also a wide spectrum of examples and counter examples. Although students would prefer a 1:1 relationship between lexical units, i.e. one English preposition = one German preposition, like in = in, or at least between sememes (i.e. METHOD = METHOD, like METHOD durch = METHOD by), they are nonetheless able to develop a feeling for patterns and they find them more practical (i.e. easier to apply) than simply learning dependent prepositions always together with their lexical context or head. On a lower learner level, prepositional synonymy may be used, such as from = away, of, off, off from, on, out to and with (cf. Lindstromberg 1998: 300-6), on a higher level semantic cases in the widest sense may be appropriate – our Internet Grammar may be able to help us find out. Conclusion

In this article, I have described a system that will enable a learner of English (such as a user of our Internet Grammar) to recognise the major areas of difference, and particularly the possibilities of figurative meaning in English prepositions, a system going beyond the usual list of ‘idiomatic expressions’. Although other scholars (e.g. Radden 1989) have pointed out the ‘expansion of

Learning English prepositions in the Chemnitz Internet Grammar

241

local prepositions’, the graphic representation with underlying examples is a unique feature of our Internet Grammar. Further work will have to be undertaken to corroborate the proposed system in two ways: first, a quantitative analysis will be undertaken to determine where the borderline between regular transfer and lexeme-specific idiomatic expressions lies, and secondly, users of the Grammar will be involved in investigating to what extent this approach is more useful than previous treatments. We intend to do this in the future both by inviting our users to give us their subjective views which we can then analyse and also by means of objective tests of their learning results in this area. Generally, the use of a translation corpus as a tool in language analysis and language learning has proved useful, and we intend to continue along this path until we can assemble a thorough qualitative and quantitative treatment of English prepositions in comparison. Notes 1

The project has been financed by the German Research Association (DFG) since 1998 as part of the New Media research group at Chemnitz University of Technology. It also serves as a basis for other e-learning projects. I wish to thank my collaborators Christoph Haase, Angela Hahn, Naomi Hallan, Diana Hudson Ettle and Sabine Reich for many interesting and thought-provoking discussions. An introduction to the project is given in Schmied 1999, technical details in Gorlow et al. 2001.

2

Minor differences are caused by the morphological endings of German articles, so that for instance die, der, das, den are separate entries in the German list whereas the is only one entry in the English list.

3

The difference between core and prototypical meanings (e.g. Bennet 1975 and Hawkins 1985) is not relevant for our pedagogical purposes, so that our students do not have to know the principles of cognitive (originally "space"!) grammar and can follow their intuitions about more central and more peripheral meanings and compare them to frequency differences in corpora.

4

If prepositional phrases are idiomatic in the sense that the composite meaning cannot be deduced easily, as in be with s.o. 1. stay in s.o.’s house 2. understand, they are clearly marked as IDIOM or even entered in separate lexical entries in modern learner dictionaries.

5

Recently (e.g. in O’Dowd 1998: 144) the historical development has been seen as a process of grammaticalisation, "a series of unidirectional paths of change for certain lexical elements from relatively unconstrained expressions to increasingly constrained morphosyntactic functions, or from more concrete to progressively abstract meanings".

242

Josef Schmied

Translation corpus texts used Burroughs, W.J. Watching the World’s Weather. CUP, 1991/Die Weltwettermaschine. Birkhäuser, 1993. Crosby, A.W. Ecological Imperialism: The Biological Expansion of Europe, 9001900. CUP, 1986./Die Früchte des weißen Mannes: Ökologischer Imperialismus 900-1900. Campus, 1991. Davies, P.C.W. & J. Brown. Superstrings. A Theory of Everything? CUP, 1988/Superstrings. Eine Allumfassende Theorie der Natur in der Diskussion. DTV 1989. References Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), The Longman Grammar of Spoken and Written English. London: Longman (referred to as LGSWE). Bennett, D. C. (1975), Spatial and Temporal Uses of English Prepositions: An Essay in Stratificational Semantics. London: Longman. Boers, F. (1996), Spatial Prepositions and Metaphor A Cognitive Semantic Journey along the UP-DOWN and the FRONT-BACK Dimensions. Tübingen: Narr. Cambridge International Dictionary of English (1995). Cambridge: Cambridge University Press Capel, A. (1993), Collins Cobuild Concordance Samplers 1: Prepositions. London: Harper Collins Publishers. Dirven, R. (1989), ‘Space prepositions’, in: R. Dirven (ed.), 519-550. Dirven, R. (ed.) (1989), A User’s Grammar of English: Word, Sentence, Text, Interaction. Frankfurt am Main: Lang. Downing, A. and P. Locke (1992), A University Course in English Grammar. New York: Prentice Hall. Fillmore, C.J. (1968), ‘The case for case’, in E. Bach and R.T. Herms (eds), Universals in Linguistic Theory. New York: Holt, Rinehart & Winston. 188. Greenbaum, S. (1996), Oxford English Grammar. Oxford: Oxford University Press. Gorlow, E., Ch. Haase, N. Hallan, D. Hudson-Ettle and J. Schmied (2001), Internet Grammar. Technical Documentation. http://www.tu-chemnitz.de/ phil/InternetGrammar/manual.html. Hawkins, B. W. (1985), The Semantics of English Spatial Prepositions. PhD dissertation, UCSD. Trier: L.A.U.T. Herskovits, A. (1986), Language and Spatial Cognition: An Interdisciplinary Study of the Prepositions in English. Cambridge: Cambridge University Press . Huppertz, A. (1991), ‘Bibliography on prepositions’, in: G. Rauh (ed.), Approaches to Prepositions. Tübingen: Gunter Narr. 9-28. Langenscheidts Handwörterbuch Englisch (1991). London: Longman.

Learning English prepositions in the Chemnitz Internet Grammar

243

Lee, D. (2001), Cognitive Linguistics. An Introduction. Oxford: Oxford University Press. Leech, G. and J. Svartvik. (1994), A Communicative Grammar of English. London: Longman. Lindstromberg, S. (1998), English Prepositions Explained. Amsterdam: John Benjamins. Linkvist, K. G. (1950), Studies on the Local Sense of the Prepositions IN, AN, ON and TO in Modern English. Lund Series of English, 22. Lund & Copenhagen: Munksgaard. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A Comprehensive Grammar of the English Language. London: Longman (quoted as CGEL). O’Dowd, E. (1998), Prepositions and Particles in English: A DiscourseFunctional Account. New York: Oxford University Press . Oxford Advanced Learners’ Dictionary (1995). Oxford: Oxford University Press. Radden, G. (1989), ‘Figurative use of prepositions’, in: R. Dirven (ed.), 551-576. Rauh, G. (ed.) (1991), Approaches to Prepositions. Tübingen: Gunter Narr. Schmied, J. (1998), ‘To choose or not to choose the translation equivalent. The case of English with and German mit’, in: R. Schulze (ed.), Making Meaningful Choices. Tübingen: Narr (Language in Performance). 207222. Schmied, J. (1999), ‘Applying Contrastive Corpora in modern contrastive grammars: The Chemnitz Internet Grammar of English’, in: H. Hasselgard and S. Oksenfjell (eds), Out of Corpora. Studies in Honour for Stig Johansson. Amsterdam: Rodopi. 21-30. Wibbelt, C. Z. (ed.) (1993), The Semantics of Prepositions. Berlin: De Gruyter. Wood, F. T. (1967), English Prepositional Idioms. London: Macmillan.

244

Josef Schmied

Appendix 1

in

PLACE

TIME

SPACE

Figurative TRANSFER

period

condition

+ in

+ in

+ in direction + in

participation +/in/bis

material

LEXICON

GRAMMAR

knowledge

ADVERB

+ in

+ in

-

possession

reason

fashion

CLAUSAL

+ in

aus

+ in

-

measure + in

perspective

government

+ in

an

Figure 2. Prototypical, figurative and idio matic usages of in with their Ger man translation equivalents

Learning English prepositions in the Chemnitz Internet Grammar

245

with

PLACE

TIME

figurative TRANSFER

company

method

cause

+/mit/bei

durch

wegen

instrument + mit

LEXICON opposition +/mit gegen

+ mit

AND + mit

separation

ADVERB

von

-

comparison

duration

GRAMMAR

+/mit/wie

CLAUSAL . circumstance . situational

be with you 1+2

Figure 3. Prototypical, figurative and idio matic usages of with with their Ger man translation equivalents

246

Josef Schmied

at

PLACE

TIME

location

point

cause

amount

+ an

+ an

wegen

mit

condition

judgement

direction zu

LEXICON

in

in

Figure 4. Prototypical, figurative and idio matic usages of at with their Ger man translation equivalents

+ SUPERLATIVE

Learning English prepositions in the Chemnitz Internet Grammar

247

by PLACE

TIME

figurative TRANSFER

location

period

method

measure

+ bei

+ bis

durch

mal

limit - + bisbis

LEXICON

AGENS von

Figure 5. Prototypical, figurative and idiomatic usages of by with their Ger man translation equivalents

GRAMMAR

Integrating networked learner oral corpora into foreign language instruction Pascual Pérez-Paredes Universidad de Murcia, Spain Abstract As Computer Assisted Language Learning (CALL) spreads, educational institutions and students are becoming more familiar with the use of computers in foreign language learning. The capabilities and functionalities of CALL environments are progressively expanding and it is thus not uncommon to find universities or secondary schools in Europe, Asia and the States where different learning formats combine to provide teachers and students with rich and varied learning experiences. This article explores ways of incorporating oral learner corpora into mainstream CALL environments and presents a framework for data gathering that integrates state-of-the-art technology and network multipoint approaches, the underlying principle being that Computer Based (CBT) and Instructor Led Training (ILT) can both play a decisive role in boosting the use of customized oral learner corpora in language teaching institutions. Alongside, the article contains a thorough description of tools and procedures and a discussion of potential implications for foreign language learning and teaching.

The reality that we perceive is that technology can offer new kinds of interactions and activities for both learners and teachers; new instructional settings can be created; new opportunities for learning can be designed. (Spector and Davidsen 2000: 243). 1.

Introduction

In computer programming, integration is the process of combining software or hardware components or both into an overall system; in everyday language it is the act or process of making whole or entire. The present article aims to introduce a framework for corpus data gathering and implementation that integrates stateof-the-art technology and network multipoint approaches into a learning environment.1 At the same time, it will describe the tools and procedures involved in a specific application of networked learner oral corpora with second and third year advanced students of English at the University of Murcia (UMU). With the spread of increasingly multi-faceted Computer Assisted Language Learning (CALL) environments, educational institutions and students are becoming more familiar with the use of computers in foreign language learning. There are many examples of Universities or Secondary Schools in Europe, Asia and the States where interesting and enriching technology programmes have been implemented.2

250

Pascual Pérez-Paredes

Simultaneously, computer learner corpora (CLC) have, not surprisingly, become an important resource for both linguists and teachers. There are a number of reasons for this. First, there is the conviction that learner corpora can provide language teaching professionals and language researchers with a better insight into the language actually used by foreign language students. Second, there is the spread of classroom applications of language awareness processes. Evidence from Applied Linguistics supports the notion that explicit attention to form can facilitate second language learning (DeKeyser 1998; Norris & Ortega 2000). From cognitive perspectives on language learning, it has been stressed that noticing, in other words noting, observing or paying special attention to particular language items, is generally a prerequisite for learning (Schmidt 1990, 1993; Robinson 1996; Skehan 2001), in which case learner language corpora would appear to be an ideal resource. Foreign Language teaching (FLT) professionals and linguists are very much interested in using corpora, as witnessed by the success of symposia such as ICAME and TALC (Teaching and Language Corpora). However, exclusively oral corpora of students’ foreign language are scarce. Written language is, no doubt, still dominant in the field. Very recently, Basturkmen (2001) has claimed that ELT has focused its attention on describing and teaching the written language and talks about spoken language as being neglected. Halliday (1990) has expressed the same view and regrets a situation which affects linguistic analysis in general. This article sets out to explore ways of incorporating oral learner corpora into mainstream CALL environments within a technology-enhanced e-learning approach, which has been the basis for our research and our primary source of learner feedback. This environment is characterised by (1) the presence of the teacher in the computer facility; (2) the fact that the sessions are, to different degrees, live, face-to-face and instructor-led and (3) the use of materials which in some way have been previously delivered to students as they are familiar with the type of tasks underlying the corpus. Typically, this environment is asynchronous. A technology-delivered e-learning approach was not considered as the learner audience and the instructor actually met to share the learning sessions. 2.

Underlying motivations behind the compilation of Learner Oral Corpora (LOC)

Using digitised learner corpora for teaching and learning purposes is a very recent idea. The corpus ‘revolution’ has come about so forcefully that the learning / teaching community has not had enough time to fully appreciate and assimilate the potential contributions of language corpora. Below is a list of six key motivating factors involved in the decision to compile and use a corpus of spoken learner English at the University of Murcia. The first two are common to both written and spoken corpora: (1) A corpus can contribute to a better understanding of students’ use of the foreign language (Granger and Tribble 1998).

Integrating networked LOC into foreign language instruction

251

(2) It can offer teachers classroom data that are not frequently analysed, fundamentally because classroom dynamics make it extremely difficult for teachers to monitor every learner’s performance. Usually, teachers tend to concentrate on fragmented chunks of discourse and, more often than not, students are too aware of this monitoring task, which in different ways can make learners shy away from otherwise natural use of the FL.3 Such a corpus can give teachers the chance to examine student performance in detail, both as individuals and as a group. In addition, LOC might prove useful in at least the following ways: (3) Learners’ oral performance can be diagnosed and measured based on both qualitative and quantitative information. This way, teachers have more opportunity to reflect on their students’ output and both on-thespot and continuous assessment can be enhanced. (4) Group assessment is favoured. Teachers can establish accurate performance comparisons between two individuals, between two groups of students or even between learner and native speaker corpora. This is extremely useful when it comes to going deeper into students’ oral output, a territory often neglected by educators and learners themselves, perhaps in the belief that the very nature of oral discourse is unapprehendable and elusive. (5) Teachers can build monitor learner corpora that can contribute to changes and adjustments in their methodology, particularly those aspects more directly connected with developing students’ oral skills. (6) LOC can be used to promote students’ language awareness of both segmental and suprasegmental aspects of students’ FL production. These six factors might potentially determine different compilation, annotation, if applicable, and access criteria. In this sense, Granger (1998: 8) lists some of the features relevant to learner corpus building, distinguishing between language and learner variables. In the first group we find medium, genre, topic, technicality and task setting; in the second we find age, sex, L1, region, other foreign languages studied or spoken, L2 level, learning context and practical experience. These features are to be carefully considered and weighed up when designing learner corpora. It seems that compilers have two options here. The first is to opt for heterogeneous samples of learner language and account for them on different levels, mainly those of representation, tagging and codification (Llisterri 1999: 54); the second is to strive for homogeneous collections of texts in terms of the two broad categories presented above. The ultimate choice depends on the research aims of the corpus builder and individual corpora should be compiled to serve particular research purposes (cf. Sánchez et al. 1995). This research adheres to the second approach, as functional parameters are deemed to be absolutely necessary when design stages of a learner oral corpus

252

Pascual Pérez-Paredes

are first sketched out. The proposal here is to link learner oral corpora to their expected audience, functionality, access technology and network delivery system. In the main, there are three major types of learning formats that can be adapted to learner oral corpus implementation and which could be of interest to teachers and researchers alike: asynchronous self-study, synchronous instructor-led events and, third, group work. These formats can combine with the six benefits of LOC outlined above to offer a wide range of potential activities for the FL classroom. 3.

Gathering the LOC data

Compiling oral corpora is time-consuming and presents researchers with important functional problems. Following Biber’s 1993 benchmark work, some of the sampling decisions will necessarily affect the type and number of texts included, the selection of text samples from within texts and the length of these samples. However, in the case of LOC, text representativeness is arguably not the primary source of concern as all but the most advanced learners typically have limited assets in terms of range of text types they are likely to produce. If one can make this assumption, data collection is greatly simplified and can be restricted to major communicative functions or, to use BNC terminology, could be fully FL context-governed. In most cases, (excluding Specific Purposes contexts), teachers and researchers would be satisfied with a LOC consisting of one single domain, general communication purposes, with different communicative situations determining further subclassifications.4 When gathering LOC various options are available. The Louvain International Database of Spoken English Interlanguage (LINDSEI) project has opted for a traditional approach to the issue. The data is orthographically transcribed. This approach I call traditional, as audio and transcripts remain separate parts. In most work, the recorded files have a marginal use, mainly as a source to transcribers. An example of this is Jones’ (1997) spoken German corpus. Berglund (1999) has worked on the exploitation of the oral component of the BNC but still the written medium is used. At the Ohio University digital recordings of conversations have been transferred onto a Linux workstation to create the VIC corpus.5 The recorded conversations are being transcribed into written English text, stored in ASCII files appropriate for use in later stages of the phonetic transcription process. There have been different efforts to link audio and text, this way integrating the corpus into, to some extent, one single entity. Cauldwell (1992) at the University of Birmingham used hyperlinks on transcriptions to play CD audio. CGI scripting has been used by the Language Technical Group at Edinburgh University.6 But there is no doubt that aligning is still the main linking strategy. Approaches differ, though. Sound Writer (UC Santa Barbara) aligns conversational transcriptions with the corresponding digitised sounds and so do Entropic, Softsounds, Waves Plus and S. Tools. Speech recognition is bound to play a key role in aligning software in the near future and several applications have made headway in this interesting field. Olive, sponsored by the Telematics Application Program of the European Union (Language Engineering) aligns

Integrating networked LOC into foreign language instruction

253

scripts with the soundtrack of video recordings. ALTA (Automated Link of Transcript and Audio) generates links between transcripts and digitised video / audio files. Haddock (1996) reports a different technique. Keywords are identified in audio files and then transcribed by the recognition process. This is a very convenient way to index chunks of language of interest to researchers and corpus users. It is for these corpus users to judge whether these tools are useful and practical. Within the medical sector, DVI has introduced a new digital dictation system, called DVI Voice Power, which, implementing Automatic Speech Recognition (ASR), is used as a dictation collector.7 Integration would seem to be a key priority for contexts involving a cycle of data gathering, research, teaching, assessing and needs-analysis. Clearly, for the sake of simplicity and convenience, the same tools should be shared by all the corpus users, irrespective of their role. From our own experience at UMU, this range of tools works to an optimum when they comply with the following requirements: (a) Students should be able to (1) (2) (3) (4)

Record and retrieve files at individual workstations Record and retrieve (the same) files at different workstations simultaneously Engage in real-time communication with other users at individual workstations Engage in real-time communication with other users at different workstations

(b) Teachers should be able to (1) (2) (3) (4) (5)

Launch media players and corpus analysis software applications on any workstation or at least have the chance to let students know how to do it Communicate with students in real time through an audio network or real time Microsoft Messenger-like applications Create portfolios for various courses, groups, individual students or data-gathering initiatives Broadcast corpus files to individual workstations Broadcast corpus files to group workstations

In addition, the tools should have transcription facilities. An environment offering these features should meet the needs of a wide range of corpus audiences (researchers, teachers and students), functionalities (whether the corpus use is teacher or researcher-controlled or student-controlled), different network delivery systems (asynchronous self-study, instructor-led learning/ teaching or small group work). The next two sections give an overview of the uses of LOC at UMU.

254

Pascual Pérez-Paredes

4.

Networking LOC

Hughes (1997: 293) states that “in order to exploit machines efficiently and cost effectively it is important to link them in a network”. This author is particularly interested in highlighting issues affecting data shareability and network organization. Kern and Warschauer (2000: 1) define Network Based Language Teaching (NBLT) as the use of computers connected to one another in either local or global networks to teach language. The stress is laid on the communication facilities presented by new technologies. Chapelle (2001:20) remarks that LAN activities are primarily “built around learner-learner interactions”. Network technologies, in most cases, belong to one of two different groups. Local area network (LAN) technologies interconnect devices that are close to each other, usually in the same room, building or campus. Wide area network (WAN) technologies connect devices that can be kilometres apart. Nowadays these differences are not so important as fibre optic cables allow LAN technologies to connect devices many kilometres apart while simultaneously improving the speed of WANs. LANs and WANs are facilitating the integration of research, data gathering and teaching in almost every university in the Western world. At the Modern Languages laboratory facility at UMU, LOC researchers and compilers for the first time have the opportunity to (1) record students’ performance8 in natural communication interaction with other classmates in the facility; (2) record learner foreign language which can be the result of a myriad of communicative interactions such as face-to-face pair work, telephone pair work, face-to-face group work, telephone group work and whole class work - needless to say, individual performance recording is also possible - and (3) store the data in different digital formats. Working on a LAN facility, and making use of Tandberg TLC 3000 language laboratory software, the media player Divace Duo by Teleste Educational9 was chosen to meet the requirements outlined in 3 above. Divace Duo is not only an advanced media player, it also offers users the chance to add alphabetical transcription to audio files. Later, these captions can be removed and called back when necessary. If transcriptions are already available in plain text/ ASCII format, copy and paste makes it easier for teachers and researchers to link audio and text. 5.

Networked LOC in action: hands-on experience

As pointed out above, our hands-on experience with LOC draws on the six functionalities outlined in 2. Let us review them in the light of practical classroom activities which derive from the use of LOC in a networked learning environment. It is justifiable to consider that networked LOC expose real learner FL use. This is interesting in itself as researchers, teachers and students can have the opportunity to view language as a product and to frame the object of their

Integrating networked LOC into foreign language instruction

255

research/ teaching and learning. A case in point with my students is the underuse of contractions. The following concordance lines give us a glimpse of this:

Figure 1. I have in UMU students’ sample The corresponding concordance lines for ’ve were 0. In such cases, bookmarked digitised LOC files have major advantages as teachers thus have the opportunity to make LOC users aware of crucial linguistic features in learners’ output. With a digitised LOC in which a teacher has bookmarked points in the flow of discourse where strong forms such as those in Figure 1 are overused, students have the chance to listen to their actual instances of speech and relate those to the contents of the language course they are taking. Other uses of bookmarking at MMU include form-focused, awareness-raising noticing of sloppy segmental pronunciation, faulty discourse organization and misuse, or total absence, of word linking. Similarly, students are asked to identify the existence or absence of such characteristics in their own or in their classmates’ oral productions. Drawing a parallel, we could say that bookmarking in digital audio files equates the tagging of written corpora and, best of all, does not exclude the use of meta-information in text files as they can be kept separately. Figure 2 shows Divace Duo media player with bookmarks above the control buttons. This facilitates remedial teaching, as students’ oral output can be easily hyper-linked to digitised files that can help them become more aware of their deficiencies and indeed their strengths. Through the UMU intranet, students are given the chance to complete a protocol that guides them through an examination of their own discourse.

256

Pascual Pérez-Paredes

Figure 2. Screenshot of Teleste Divace Duo Qualitative assessment of learner oral productions is similarly favoured through the creation of topic-based group portfolios where LOC are stored and retrieved without difficulty. Students discussing environmental issues might decide to explore other groups’ opinions and language use through their own media player. In this sense, as Carter (1993: 145) states: “we are more likely to see things perspectively, creatively and with understanding if things are viewed not in isolation but set alongside each other, compared and contrasted. There are innumerable opportunities within the system of a language for contrast to be generated”. The same principle underlies Granger’s (1998a) corpus-based contrastive analysis approach. Similar to this idea is that of monitor LOC. These are easily enlarged as students continue to contribute oral output within the same academic year or successive years. In a similar way, group, level and individual assessment is likely to be enhanced as qualitative analysis of language is fostered in this approach. For this, it is advisable for transcriptions to be performed first on .txt or similar files that can be moved into the digital player environment. Finally, and central to the whole approach taken here, is the notion that networked LOC have vast potential in FL teaching through their awarenessraising capabilities. I believe that having the chance to access their own, or other fellow-students’ oral FL production, learners will acquire “a greater selfconsciousness about the forms of the language they use” (Carter 1993) and that it is necessary to “recognise that (…) language is a system and that it is for the most part very systematically patterned” (ibid. 1993:142). It seems that both students and teachers can benefit from such an approach. Lynch (2001) argues that collaborative transcribing and editing can encourage students to focus on form in their language output, while Kindt & Wright (2001) have stressed the fact that language teachers’ effectiveness can improve significantly through working on empirical data. Access to the systematic patterns in the flow of oral discourse has traditionally been highly restricted – just compare typical feedback on learner oral language and essays or assignments, If NBLT continues its growth, networked LOC may establish and reveal themselves as a new standard for ELT practices. However, research shows that

Integrating networked LOC into foreign language instruction

257

this approach is not problem-free. Gu and Xu (1999: 181) have pointed out that NBLT involves changing teachers’ roles and challenging their old educational beliefs, training students in basic computer skills and, last but not least, continuous administrative and technical support. Although ability-improvement oriented, the activities described here cannot be considered pedagogical CALL (Chapelle 1997) as they are not primarily aimed at fostering linguistic interaction. The advantages of the paradigm are varied and, to some extent, learning-context sensitive. However, the following will hold true in every situation: (1) Shareability: different LOC users find themselves at ease within the same self-contained, all-purpose environment. Not only is data shared but also the tools which manage them. (2) Integration: the approach favours a natural introduction of corpora in FLT. Within this paradigm, corpora are no longer viewed as a onedaytrip-to-the-computer-facility add-on. (3) FLT curricular support: it is common sense that technology should support the curricula, not determine it. Our approach does not interfere with any specific type of learning methodology as the data gathering process is built into the very learning and teaching environment, leaving teachers free to implement their own curricula. (4) Self-access scalability: one of the applications of networked LOC is the possibility for students to access the data simultaneously and to do this in different autonomous environments. Following the Australian Adult Migration Education Program (Gardner and Miller 1999: 52-55), there would seem to be three types of environments of interest to networked LOC users: the study centre, as a complement classroom teaching tool; the remediation centre, where remedial teaching is carried out and the self-directed learning centre, where students can develop language and metalanguage awareness skills that can help them become more independent in their language learning experience. Implicit in the discussion above is the belief that Computer Based (CBT) and Instructor Led Training (ILT) can both play a decisive role in boosting the use of customized oral learner corpora in language teaching institutions. Networked LOC can be compiled, properly stored and delivered to end-users via a computer using standard LAN technology. Following Rosenberg (2001: 29) we can state that this paradigm focuses on a broad view of learning and teaching/research solutions which go beyond the traditional framework. In the same manner, it aims to increase the relatively small amount of research into computer networking and language learning (Kern and Warshauer 2000) and to, very modestly, present a framework that can help researchers and FLT teachers gather and use LOC in an integrated cycle and alleviate some of the problems pointed out by Leech:

258

Pascual Pérez-Paredes However, the general problem that the history of spoken and written language corpora reveals is that the corpora which are easiest to compile are not necessarily those which are most useful for language learning purposes. This leads to the question: what kinds of corpora do we need to develop, to make up the deficit between what corpora exist, and what corpora are needed for the best applications to language teaching? (Leech 1997: 18)

The integrated computer corpus-based approach put forward in this article is, in Spector and Davidsen’s (2000: 243) terms, realistic as it makes use of combined technologies to make up for one of the deficits Leech seems to be referring to, namely the lack of oral corpora that can meet the different needs of the Foreign Language Teaching community. Again, integration is always a plus. Notes 1

In everyday language multipoint can be understood as multiple. Following the Technical Specifications and Technical Reports for a 3rd Generation Mobile System (3GPP) based on evolved GSM core networks, which are globally applicable, we should talk about communication configurations which involve the use of more than one network terminal.

2

See Jackson (2001) for a full description of learning formats.

3

We understand “natural” here as a classroom situation where the foreign language is used spontaneously, is indispensable for interpersonal communication within a given activity and where fairly predictable language is at play. See Go biowsja (1990) for a further discussion of the concept.

4

Obviously teachers and researchers have to decide on the scope and aim of their academic interests.

5

"VIC" stands for Variation in Conversation. The corpus compilers and designers claim that they want to develop a corpus of conversational spoken American English that can be used to quantify the phonological variation in word production that occurs in casual speech. More information at http://vic.psy.ohio-state.edu/index.html.

6

"CGI" stands for "Common Gateway Interface." CGI is the method through which a web server, or a LAN server in a network, can obtain data from (or send data to) documents, databases, and other programs, and present that data to viewers via the web or similar. A CGI can be written in any programming language, but Perl is the most widely used.

7

"DVI", Digital Dictation and Voice Information System, is produced by Narratek (http://www.narratek.com). DVI offers dictation solutions, user-

Integrating networked LOC into foreign language instruction

259

friendly management reporting functions, network/Internet transmission of digital voice files, a variety of transcription stations, and interface capabilities. 8

In the US, the Reporters’ Committee for Freedom of the Press (RCFP) has drafted a document entitled Can We Tape? A Practical Guide to Taping Phone Calls and In-Person Conversations in the 50 States and D.C. In general terms, it is legal to tape a face-to-face conversation when your recorder can be seen. The consent of all parties is presumed in these situations. Teachers and researchers are to carefully consider ethical issues which may affect the recording of their students’ language output, even if this has a formative or scientific purpose.

9

For information on Tandberg, Teleste and Divace range of educational solutions you can visit http://www.divace.com. For further information on modern language laboratories see Pérez-Paredes (2002).

References Abbey, B. (ed.) (2000), Instructional and Cognitive Impacts of Web-Based Education. Hershey: Idea Group Publishing. Basturkmen, H. (2001), ‘Descriptions of spoken language for higher-level learners: the example of questioning’, ELT Journal, 55(1): 1-10. Biber, D. (1993), ‘Representativeness in corpus design’, Literary and Linguistic Computing, 8: 243-257. Berglund, Y. (1999), ‘Exploiting a large spoken corpus: an end-user’s way to the BNC’, International Journal of Corpus Linguistics, 4(1): 29-52. Carter, R. (1993), ‘Language awareness and language learning’, in: M. Hoey (ed.), Data, Description, Discourse. London: Harper Collins. 139-149. Cauldwell, R. (1992), ‘Direct encounters with fast speech on CD-Audio to teach listening’, System, 24(4): 521-528. Chapelle, C. (1997), ‘CALL in the year 2000: still in search of research paradigms?’, Language Learning & Technology, 1(1): 19-43. Chapelle, C. (2001), Computer Applications in Second Language Acquisition. Cambridge: Cambridge University Press. Debski, R. and M. Levy (eds) (1999), World CALL. Lisse: Swets and Zeitlinger. DeKeyser, R. (1998), ‘Beyond focus on form: Cognitive perspectives on learning and practical second language grammar’, in: C. Doughty and J. Williams (eds). 42-63. Doughty, C. and J. Williams (eds) (1998), Focus on Form in Classroom Second Language Acquisition. Cambridge: Cambridge University Press. Gardner, D. and L. Miller (1999), Establishing Self-Access. From Theory to Practice. Cambridge: Cambridge University Press. Go biowsja, A. (1990), Getting students to talk. Hertfordshire: Prentice Hall. Granger, S. (ed.) (1998), Learner English on Computer. Harlow: Longman. Granger, S. (1998a), ‘The computer learner corpus: a versatile new source of data for SLA research’, in: S. Granger (ed.). 3-18.

260

Pascual Pérez-Paredes

Granger, S. and C. Tribble (1998), ‘Learner corpus data in the foreign language classroom: form-focused instruction and data-driven learning’, in: S. Granger (ed.). 199-209. Gu, P. and Z. Xu (1999), ‘Improving EFL learning environment through networking’, in: R. Debski and M. Levy (eds). 169-184. Haddock, N. (1996), ‘Structuring voice records using keyword labels’, in: Proceedings of CHI’96, Vancouver. URL at http://www.acm.org/sigchi/ chi96/proceedings/intpost/Haddock/hn_txt.htm. Halliday, M.A.K. (1990), Spoken and Written English. Oxford: Oxford University Press. Hughes, G. (1997), ‘Developing a computing infrastructure for corpus-based teaching’, in: A. Wichman et al. (eds). 292-308. Jackson, R. (2001), ‘Web based learning resources library’. URL at http://www.outreach.utk.edu/weblearning/ (accessed on 14/05/01). Jones, R. (1997), ‘Creating and using a corpus of spoken German’, in: A. Wichman et al. 146-156. Kern, R. and M. Warschauer (2000), ‘Theory and practice of network-based language teaching’, in: M. Warschauer and R. Kern (eds). 1-19. Kindt, D. and Wright, M. (2001), ‘Integrating language learning and teaching with the construction of computer learner corpora’. URL at http://www.nufs.ac.jp/~dukindt/media/corpora.pdf (accessed on 7/05/03). Leech, G. (1997), ‘Teaching and language corpora: a convergence’, in: A. Wichman et al. 1-24. Llisterri, J.(1999), ‘Trascripción, etiquetado y codificación de corpus orales’, RESLA 1999, 53-82. Lynch, T. (2001), ‘Seeing what they meant: transcribing as a route to noticing’, ELT Journal, 55(2): 124-132. Muñoz, C. (ed.), Trabajos en lingüística aplicada. Barcelona: Univerbook SL. Norris, J. and L. Ortega (2000), ‘Effectiveness of L2 instruction: a research synthesis and quantitative meta-analysis’, Language Learning, 50(3): 417528. Pérez-Paredes, P. (2002), ‘From rooms to environments: techno-short-sightedness and language laboratories’, International Journal of English Studies, 2(1): 59-80. Pérez-Paredes, P.and P. Cantos Gómez (eds) (2002), ‘New trends in computer assisted language learning and teaching’, International Journal of English Studies, 2(1). Robinson, P. (1996), ‘Learning simple and complex language rules under implicit, incident, incidental, rule-search and instructed conditions’, Studies in Second Language Acquisition, 18: 27-62. Rosenberg, M. (2001), E-Learning. New York: McGraw-Hill. Sánchez, A. (2000), ‘Language teaching before and after “digitilized corpora”. Three main issues’, Cuadernos de Filología Inglesa, 9(1) Corpus-based Research in English Language and Linguistics. Universidad de Murcia. 5-37.

Integrating networked LOC into foreign language instruction

261

Sánchez, A., R. Sarmiento, P. Cantos and J. Simón (1995), Cumbre. Corpus lingüístico del español contemporáneo. Fundamentos, metodología y aplicaciones. Madrid: SGEL. Schmidt, R. (1990), ‘The role of consciousness in second language learning’, Applied Linguistics, 11(2): 129-158. Schmidt, R. (1993), ‘Awareness and second language acquisition’, Annual Review of Applied Linguistics, 13: 206-226. Skehan, P. (2001), ‘The role of a focus on form during task-based instruction’, in: Muñoz (ed.), Trabajos en lingüística aplicada. Barcelona: Univerbook SL. 11-24. Spector, J. and P. Davidsen (2000), ‘Designing technology enhanced learning enviroments’, in: B. Abbey (ed.). 241-261. Warschauer, M. and R. Kern (2000), Network-based Language Teaching: Concepts and Practice. Cambridge: Cambridge University Press Waters, A. and M. Vilches (2001), ‘Implementing ELT innovations: a needs analysis framework’, ELT Journal, 55(2): 133-141. Wichmann A., S. Fligelstone, T. McEnery and G. Knowles (eds) (1997), Teaching and language Corpora. Harlow: Longman.