The Cambridge Handbook of Learner Corpus Research 1316432149, 9781316432143

265 52 13MB

English Pages [618] Year 2015

Recommend Papers

Learner Corpus Research: New Perspectives and Applications 9781474272889, 9781474272919, 9781474272896

This volume showcases original, agenda-setting studies in the field of learner corpus research of both spoken and writte

153 89 1MB Read more

Developmental and Crosslinguistic Perspectives in Learner Corpus Research [1 ed.] 9789027271723, 9789027207715

This volume provides a state-of-the-art overview of current research and developments on the use of learner corpora perc

147 34 4MB Read more

Cambridge Handbook of Qualitative Digital Research 100909887X, 9781009098878

Big data and algorithmic decision-making have been touted as game-changing developments in management research, but they

154 7 4MB Read more

Tense and Aspect in Second Language Acquisition and Learner Corpus Research [1 ed.] 9789027260949, 9789027207159

The expression of temporal relations, notably through tense and aspect, is central in all processes of communication, bu

127 44 3MB Read more

Accuracy across Proficiency Levels: A Learner Corpus Approach 9782875584304, 9782875584311

112 58 19MB Read more

Multiple Perspectives on Learner Interaction: The Corpus of Collaborative Oral Tasks 9781501511370, 9781501517372, 9781501521447

In the field of Second Language Studies, shared datasets provide a valuable contribution to second language research as

155 90 924KB Read more

Multiple Perspectives on Learner Interaction: The Corpus of Collaborative Oral Tasks 9781501511370, 9781501517372, 9781501521447

In the field of Second Language Studies, shared datasets provide a valuable contribution to second language research as

147 76 2MB Read more

The Cambridge Handbook of the Development of Coping 1108831427, 9781108831420

Despite broad interest in how children and youth cope with stress and how others can support their coping, this is the f

191 1 6MB Read more

The Oxford Handbook of Corpus Phonology 9780191669279, 9780199571932, 019166927X

This handbook presents the first systematic account of corpus phonology - the employment of corpora for studying speaker

111 61 23MB Read more

The Oxford Handbook of Corpus Phonology 9780199571932, 0199571937

This handbook presents the first systematic account of corpus phonology - the employment of corpora, especially purpose-

121 56 47MB Read more

The Cambridge Handbook of Learner Corpus Research
1316432149, 9781316432143

Author / Uploaded
Sylviane Granger
Gaëtanelle Gilquin
Fanny Meunier

Similar Topics
Linguistics

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

1 Introduction: learner corpus research – past, present and future Sylviane Granger, Gaëtanelle Gilquin and Fanny Meunier

Written and spoken data produced by learners has always been a key resource for the study of second language acquisition (SLA). However, for a long time the data used was rather artiﬁcial, i.e. resulting from highly controlled language tasks, and therefore not necessarily a reﬂection of what learners do in more natural communication contexts. In addition, the data samples were usually quite small, often involving no more than a handful of learners, and therefore raised concerns in terms of representativeness. The combined wish to address these two issues and produce more learner-aware/learner-focused pedagogical tools prompted the emergence of learner corpora, which can be deﬁned as electronic collections of natural or near-natural data produced by foreign or second language (L2) learners and assembled according to explicit design criteria. Learner corpora gave rise to a ﬂurry of studies, which have come to be grouped under the umbrella term of ‘learner corpus research’ (LCR). This new research strand emerged in the late 1980s as an offshoot of corpus linguistics, a ﬁeld which had shown great potential in investigating a wide range of native language varieties (diachronic, stylistic, regional) but had neglected the non-native varieties. In the case of English, by far the most widely investigated language at the time, this was hardly justiﬁed in view of the fact that the number of non-native speakers far outnumbers that of native speakers. Having access to electronic collections of L2 data presents two signiﬁcant advantages. First, as these collections are usually quite large and are collected from a great number of learners, they are arguably more representative than smaller data samples involving a limited number of learners. Second being in electronic format, the data can be analysed with a whole battery of software tools that greatly speed up the analysis and enable a wide range of investigations that either cannot be performed manually at all or only at huge cost in terms of human resources. Part-of-speech taggers, for example, assign to each word in a learner

Downloaded from https:/www.cambridge.org/core. University of Sussex Library, on 04 Mar 2017 at 09:02:37, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.001

2

GRANGER, GILQUIN AND MEUNIER

corpus a tag that indicates its grammatical category, thereby facilitating investigations into learners’ use of speciﬁc grammatical categories such as prepositions or auxiliary verbs. Concordance programs, on the other hand, have contributed to bringing lexis and phraseology to the forefront of L2 studies: they generate frequency lists of both single words and phrases, present all the instances of a linguistic item in their immediate linguistic context and include functionalities such as automatic extraction of collocations and word clusters or identiﬁcation of keywords. For the analysis of errors, researchers can rely on error editors, which allow the insertion of error annotations into text ﬁles and play a key role in the type of error analysis carried out within LCR, known as computer-aided error analysis. Recent developments in automatic error detection and correction offer hope for increased automation of this key aspect of learner corpus research. Although still quite young, the ﬁeld of learner corpus research has already undergone remarkable developments. At ﬁrst, learner corpus studies were limited to learner English. This was understandable in view of the position of English as the major lingua franca internationally, but in an increasingly multilingual society it is good to see the LCR ﬁeld embrace an ever larger number of different L2s. The ‘Learner corpora around the world’ website maintained by the University of Louvain1 currently contains 137 learner corpora, 82 (60%) representing L2 English, the rest focusing on other languages (Arabic, French, German, Korean, Spanish, etc.). In terms of medium and text type, the dominant focus was – and to a large extent still is – on writing, in particular essay writing, but there is a general diversiﬁcation of data types and, especially, a growing number of projects on learner speech. Another signiﬁcant trend concerns the research design: while there is still a preponderance of cross-sectional studies, i.e. studies that sample data from learners at a single point in time, the design of longitudinal corpora made up of data sampled from the same learners across time is showing a slow but steady rise. There is also a growing awareness among the learner corpus community of the need to pay greater attention to individual variability. Learner corpora lend themselves particularly well to the study of whole learner populations, and this global perspective undoubtedly has many beneﬁts, especially from a teaching perspective. However, in view of the high degree of variability between and within learners, the exclusive use of aggregate data may be misleading. The growing application of more sophisticated statistical techniques in LCR has begun to remedy this weakness. Learner corpus researchers have started to realise that by using the appropriate statistics, it is possible to combine the best of both worlds: keep the group perspective which constitutes one of the strengths of LCR, while at the same time taking individual variability into account. 1

www.uclouvain.be/en-cecl-lcworld.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Sussex Library, on 04 Mar 2017 at 09:02:37, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.001

Learner corpus research – past, present and future

3

The Cambridge Handbook of Learner Corpus Research, with its aim to provide a state-of-the-art introduction to all the facets of the fast-expanding ﬁeld of learner corpus research, reﬂects these recent developments. For example, it describes studies in a variety of L2s, deals with both speech and writing, includes a chapter on longitudinal research design and points to the need to attach greater importance to individual learners. At the same time, however, it is representative of the ﬁeld in its current state, which necessarily means that certain target languages, text types, techniques or objects of study are less frequently dealt with than others. While LCR shares with mainstream SLA research the objective of gaining a better understanding of the mechanisms of foreign or second language acquisition, LCR stands out because of its strong applied orientation. Initially limited to the sphere of foreign language teaching, it now includes a wide range of applications – in particular in natural language processing (NLP), such as automated scoring and automatic error detection and correction – which accordingly take pride of place in this handbook. As a result, LCR has become a truly interdisciplinary ﬁeld at the crossroads between corpus linguistics, second language acquisition, language teaching and natural language processing. This makes LCR particularly fertile ground, but also brings with it the need to pull together the different research strands that are still insufﬁciently integrated. Several recent initiatives have been launched to foster greater synergy. The Learner Corpus Association,2 set up in 2013, acts as an interdisciplinary forum for discussion and exchange of information on learner corpus research and coordinates the planning of a biennial international conference, the Learner Corpus Research Conference.3 A new journal launched in 2015, the International Journal of Learner Corpus Research,4 provides a dedicated publication outlet for research covering methodological, theoretical and applied work in any area of learner corpus research. In line with these initiatives, The Cambridge Handbook of Learner Corpus Research aims to provide the rapidly growing community of researchers, teachers and students who are interested in this ﬁeld with an overview of all the key aspects of learner corpus research. The handbook is subdivided into ﬁve main parts: 1. 2. 3. 4. 5.

learner corpus design and methodology analysis of learner language learner corpus research and second language acquisition learner corpus research and language teaching learner corpus research and natural language processing.

2

www.learnercorpusassociation.org/ (last accessed on 13 April 2015).

3

LCR2011 in Louvain-la-Neuve (Belgium), LCR2013 in Bergen (Norway) and LCR2015 in Nijmegen (the Netherlands).

4

https://benjamins.com/#catalog/journals/ijlcr/main (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Sussex Library, on 04 Mar 2017 at 09:02:37, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.001

4

GRANGER, GILQUIN AND MEUNIER

Each of the chapters, all written by experts in their ﬁelds, introduces a different facet of LCR. They present state-of-the-art reviews and place emphasis on theoretical, methodological and applied aspects of wider relevance. The large number of chapters in Part I reﬂects the importance of corpus design and methodology for a good corpus analysis. The following topics are tackled: learner corpus design and collection, learner corpus methodology, learner corpora and psycholinguistic research, learner corpus annotation, speech annotation, error annotation and statistics for learner corpus research. Part II deals with the main foci of linguistic analysis in learner corpus research: lexis, phraseology, grammar, discourse and pragmatics. Part III situates learner corpus research within the general ﬁeld of SLA and highlights more particularly the issues of transfer, formulaic language, developmental patterns, variability and the impact of the learning context. Part IV considers the links between LCR and language teaching, both in general and speciﬁc settings, introduces the notion of ‘pedagogic corpus’ and gives an overview of learner-corpus-informed pedagogical materials and testing practices. The last part is devoted to NLP applications, in particular automatic grammar- and spell-checking, automated scoring and automatic identiﬁcation of the learner’s native language. As a result of the interdisciplinary nature of LCR, each of the different research domains represented in the handbook has its own paradigms, theories and methodologies and it is essential to respect these. While it would not have been possible (nor indeed desirable) to present a uniﬁed theoretical or methodological framework, we have taken great care to enhance the coherence of the volume by including a large number of cross-references to help readers navigate through the chapters and confront different perspectives. All the chapters in the handbook follow the same general format. After an introduction to the topic, the authors expand on a number of issues which they consider to be of particular importance. The third section describes in some detail two to four representative studies. This is an important section, as a handbook is meant not only to provide key theoretical information, but also to contain precise guidelines about how to do research in the ﬁeld; the representative studies supply models of how to conduct learner corpus analyses. The fourth section takes a critical look at past and current research and points to promising future directions. This last section is especially relevant as the ﬁeld of LCR is still very new, and twenty-ﬁve years after its advent it is time to take stock of the progress made and identify priorities for the future. To further assist users of the handbook, recommended key readings are provided at the end of each chapter, together with a short summary and an indication of their relevance to the topic of the chapter. These key readings, as well as the extended general bibliography, will allow researchers in the ﬁeld to delve more deeply into all the aspects addressed. The handbook also features four indexes to facilitate navigation. Besides the traditional author and

Downloaded from https:/www.cambridge.org/core. University of Sussex Library, on 04 Mar 2017 at 09:02:37, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.001

Learner corpus research – past, present and future

5

subject indexes, there is one for all the corpora referred to in the volume and another for the software tools. One of our main priorities in editing this handbook has been to cater for both novice and seasoned learner corpus researchers. For budding or would-be researchers, it will serve as an accessible introduction to all aspects of the ﬁeld, with no unnecessary jargon and with all technical terms deﬁned and illustrated. The representative studies described in each chapter make up a sort of how-to guide on how to conduct learner corpus research in a wide range of areas. For more experienced learner corpus researchers, the handbook will act as a prompt to embrace new perspectives: revise some of their methodological practices, add a new theoretical dimension, adopt a higher level of computational or statistical sophistication, enrich the interpretative side of the analysis or imagine new applications. A greater awareness of the interdisciplinary nature of LCR could also be an incentive to collaborate more closely with researchers in other disciplines. This handbook aims to provide a comprehensive survey of learner corpus research. The multifaceted picture of the ﬁeld that emerges from the different chapters highlights some of the major strengths of the research conducted to date but also points to shortcomings that need to be addressed and gaps that need to be ﬁlled. It is our hope that the handbook will be useful to a wide range of researchers from different disciplinary backgrounds. We also hope that it will play a key role in turning LCR into a fully mature ﬁeld with stronger theoretical substrates and increased methodological rigour and that it will contribute to the production of a wide range of exciting new applications.

Downloaded from https:/www.cambridge.org/core. University of Sussex Library, on 04 Mar 2017 at 09:02:37, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.001

Downloaded from https:/www.cambridge.org/core. University of Sussex Library, on 04 Mar 2017 at 09:02:37, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.001

2 From design to collection of learner corpora Gaëtanelle Gilquin

1 Introduction Since the development of the ﬁeld of second language acquisition (SLA), which Gass et al. (1998: 409) situate in the 1960s or 1970s, use has been made of authentic data representing learners’ interlanguage. However, what has characterised many of these SLA studies is the small number of subjects investigated and the limited size of the data collected. This can be illustrated by the case studies selected by Ellis (2008: 9–17) as an ‘introduction to second language acquisition research’: Wong Fillmore’s (1976, 1979) study of ﬁve Mexican children, Schumann’s (1978) study of Alberto, Schmidt’s (1983) study of Wes, Ellis’s (1984, 1992) study of three classroom learners and Lardiere’s (2007) study of Patty. While such studies have allowed for a very thorough and detailed analysis of the data under scrutiny (including individual variation and developmental stages), their degree of generalisation can be questioned (Ellis 2008: 8). In this respect, the expansion of corpus linguistics to the study of interlanguage phenomena has opened up new possibilities, materialised in the form of learner corpora. Like any corpus, the learner corpus is a ‘collection of machine-readable authentic texts (including transcripts of spoken data) which is sampled to be representative of a particular language or language variety’ (McEnery et al. 2006: 5). What makes the learner corpus special is that it represents language as produced by foreign or second language (L2) learners. What makes it different from the data used in earlier SLA studies is that it seeks to be representative of this language variety. This element is emphasised by some of the deﬁnitions of learner corpora found in the literature, e.g. Nesselhauf ’s (2004: 125) deﬁnition as ‘systematic computerized collections of texts produced by language learners’ (emphasis added), where ‘systematic’ means that ‘the texts included in the corpus were selected on the basis of a number of – mostly external – criteria (e.g. learner level(s), the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

10

GILQUIN

learners’ L1(s) [mother tongue(s)]) and that the selection is representative and balanced’ (Nesselhauf 2004: 127). Design criteria are essential when collecting a learner corpus and will therefore be dealt with as one of the core issues (Section 2.2). Another issue when deﬁning learner corpora is their degree of naturalness. Granger’s (2008a: 338) deﬁnition of learner corpora as ‘electronic collections of (near-) natural foreign or second language learner texts assembled according to explicit design criteria’ (emphasis added) suggests that they may comprise texts that are not, strictly speaking, naturally occurring texts.1 This is because, for learners (especially foreign language learners), the target language fulﬁls only a limited number of functions, most of which are restricted to the classroom context. When learners engage in activities like writing a mock letter to an imaginary friend or doing role-plays with their classmates, the main objective is for them to practise and improve their skills in using the target language rather than to convey a genuine message. Data collected in such situations therefore do not represent the linguistic output of ‘people going about their normal business’ (Sinclair 1996), as would be expected of fully natural data. However, as is the case with corpora in general (see Gilquin and Gries 2009: 6), learner corpora may display varying degrees of naturalness, even when collected within the context of the school/ university, from the more natural (e.g. the computer-mediated interactions between German and American students gathered in Telekorp; see Belz 2006)2 to the more constrained (e.g. the retellings of a silent Charlie Chaplin movie included in the Giessen-Long Beach Chaplin Corpus; Jucker et al. 2003), through the semi-natural case of essay writing (e.g. ICLE, the International Corpus of Learner English; Granger et al. 2009), a pedagogical task that is natural in the context of the language learning classroom. In accordance with this continuum, and following Nesselhauf (2004: 128), learner data collected with more control on the language produced (e.g. the translations contained in the UPF Learner Translation Corpus; Espunya 2014) may be considered ‘peripheral learner corpora’. When so much control is exerted that the learner is no longer free to choose his/her own wording, for instance in the case of a reading-aloud task, the term ‘learner corpus’ will normally be avoided.3 Note that ‘database’ is sometimes used to refer to collections of learner data that have been gathered from both natural and less natural contexts, for example 1

The definition also underlines, like Nesselhauf’s (2004), the importance of design criteria in the compilation of

2

Telekorp is the Telecollaborative Learner Corpus of English and German. It contains data produced by the students

3

It must be pointed out, however, that, e.g., Atwell et al. (2003) refer to ISLE (Interactive Spoken Language

learner corpora (see Section 2.2). in their L1 and L2. Education) as a corpus, although it includes recordings of German and Italian learners reading English texts. According to Gut (2014: 287), such collections of ‘decontextualized sentences or text passages that are read out or repeated’ qualify as ‘peripheral types of learner corpora’. See also Chapter 6 (this volume) for a very broad use of the term ‘learner corpus’, covering highly constrained types of spoken data.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

11

LINDSEI, the Louvain International Database of Spoken English Interlanguage (Gilquin et al. 2010), which is made up of (in decreasing order of naturalness) free informal discussions, monologues on a set topic and picture descriptions. Related to the concept of naturalness is what could be referred to as the degree of monitoring (in the sense of Krashen 1977) or editing of the data included in the corpus. Some data are produced with no prior planning and no subsequent editing (this is typically the case of speech, which by its very nature is more spontaneous than writing). When given sufﬁcient time before and/or after language production, however, the learner can organise his/her discourse more carefully and (in the case of written discourse) revise and improve the text, possibly with the help of reference tools or feedback from an instructor. Some recent learner corpus projects which aim to make the writing process visible show various stages in the drafting of a text and thus reﬂect different degrees of editing/monitoring. The Hanken Corpus of Academic Written English for Economics (Mäkinen and Hiltunen 2014), for example, consists of the ﬁrst drafts and ﬁnal versions of end-of-term papers (before and after the teacher’s feedback). The Marburg Corpus of Intermediate Learner English (MILE), on the other hand, seeks to represent the changes made during the writing process by marking deletions, additions or line breaks when digitising the learners’ (handwritten) data (see Kreyer 2014). As noted by Kreyer (2014: 56), such alterations are interesting in that they can be ‘regarded as an additional window onto the development of L2 competence’. The above characterisation of the learner corpus normally excludes corpora like the ELFA (English as a Lingua Franca in Academic Settings) corpus, which contains data produced by L2 users (rather than L2 learners, see Mauranen 2011), and like ICE (International Corpus of English), which includes data produced by speakers of indigenised varieties of English – often, rather confusingly, referred to as English as a Second Language, but differing from the varieties included in learner corpora in that these indigenised varieties are used in countries where English is not a native language but has an ofﬁcial or semi-ofﬁcial status (see Chapter 19, this volume, on these ‘new’ varieties of English). However, these distinctions are not always clear-cut. The NUS Corpus of Learner English (Dahlmeier et al. 2013), for example, contains data produced by Singaporeans, who are speakers of an indigenised variety of English rather than learners of English in the strict sense; in this case, the use of the term ‘learner corpus’ might be justiﬁed by the fact that the data included in it were produced by undergraduate university students, not adult users. Nesselhauf (2004: 128), however, notes that the term ‘learner’ (and hence ‘learner corpus’) may also be applied to adult speakers ‘in countries in which the status of the language in question is somewhere between foreign and second language (for example English in Hong Kong)’.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

12

GILQUIN

2 2.1

Core issues Learner corpus typology

Several types of learner corpora can be distinguished, differing along one or more dimensions, some of which are common to all corpora while others are speciﬁc to learner corpora. The ﬁrst dimension, which is crucial in determining how the data will be collected and turned into a corpus, is that of medium. Learner corpora can consist of written texts or transcriptions of spoken discourse. Unsurprisingly, the ﬁrst learner corpora, which started to be collected in the late 1980s, were of the former type. Spoken corpora, which are more laborious to collect (see Section 2.3), only appeared later. Today, written learner corpora are still more numerous than spoken learner corpora – they are over twice as common according to the list of ‘Learner Corpora around the World’ (LCW) compiled by the University of Louvain4 – but a number of spoken learner corpora have become available over the last few years and have started to form the basis of extensive research. Among spoken learner corpora, a distinction can be made between those that simply consist in written transcriptions of spoken discourse and those that are distributed with their corresponding sound ﬁles and thus give access to the speech signal; the terms ‘mute spoken corpus’ (Chapter 6, this volume) and ‘speech corpus’ (Wichmann 2008) can be used to describe this difference. Some learner corpora include both written and spoken data, like the Santiago University Learner Corpus.5 As newcomers to the ﬁeld, multimodal (or audio-visual) learner corpora (like MAELC, the Multimedia Adult ESL Learner Corpus; Reder et al. 2003) include video recordings, which give access to new domains of investigation like the analysis of learners’ gazes or gestures, such as Hashimoto and Takeuchi’s (2012) study of non-verbal elements in presentations, based on their Multimedia Learner Corpus of Basic Presentation (MLCP), in which each video-recorded presentation is accompanied by peer evaluations from the audience. Genre is another aspect that may serve to categorise learner corpora. In principle, any genre (or combination of genres) may be represented in a learner corpus. However, in practice, the variety of genres tends to be limited as a result of (i) the restricted number of genres for which a second or (especially) foreign language variety is actually used (see Section 1) and (ii) learner corpus compilers’ preference for certain genres, for example argumentative essays among written learner corpora, which correspond to over half of the written learner corpora included in the LCW list. Most learner corpora to date correspond to language as it is used for general purposes, but recently language for speciﬁc purposes (LSP) learner corpora have made their appearance. Unlike general learner corpora, which are 4

www.uclouvain.be/en-cecl-lcworld.html (last accessed on 13 April 2015).

5

www.sulec.es/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

13

mainly collected within the framework of general language courses, LSP learner corpora are made up of ‘discipline and genre-speciﬁc texts written by learners within the framework of LSP or content courses’ (Granger and Paquot 2013: 3142; see also Chapter 21, this volume). An example of such a corpus is the Active Learning of English for Science Students (ALESS) Learner Corpus (Allen 2009), which consists of research papers written by Japanese students majoring in science. Particularly interesting are learner corpora that contain a variety of genres, like the MiLC Corpus (Andreu Andrés et al. 2010) which includes, among others, essays, reports, formal and informal letters, summaries and business letters, as these kinds of corpora make it possible to compare interlanguage across genres. Learner corpora can also be distinguished on the basis of the target language they represent. English, the language of the ﬁrst learner corpora that were collected, is still the most predominant target language. However, over the last few years, new projects have been launched that seek to collect data representing other target languages, most notably French (e.g. FLLOC, French Learner Language Oral Corpora),6 German (e.g. Falko, Fehlerannotiertes Lernerkorpus)7 and Spanish (e.g. CEDEL2, Corpus Escrito del Español L2; see also Section 3.1),8 which are the most widely represented target languages after English according to the LCW list. While most learner corpora are monolingual, containing data from only one target language, a small number of learner corpora are multilingual, like the MiLC Corpus mentioned above, which contains learner data in Catalan, English, French and Spanish, or the USP Multilingual Learner Corpus (Tagnin 2006), which has English, German, Italian and Spanish as target languages. Besides the target language, one has to take the learner’s mother tongue into account. Among the learner corpora that contain data produced by a single L1 population (‘mono-L1 learner corpora’), it seems, on the basis of the LCW list, that Asian learners are the most widely represented, e.g. the Taiwanese Learner Corpus of English (Shih 2000) or the Japanese English as a Foreign Language Learner (JEFLL) Corpus (Tono 2007), but many other L1 populations are represented as well. Quite a few learner corpora (about a third of all the learner corpora included in the LCW list) are ‘multi-L1’; in this case, learners from several L1 populations have contributed to the corpus (see Granger (2012a: 12) on the distinction between mono- and multi-L1 learner corpora). One such corpus is the International Corpus of Learner Finnish (Jantunen 2011), which contains data produced by learners of Finnish from several mother-tongue backgrounds, including Estonian, German, Polish, Russian and Swedish. Multi-L1 learner corpora are very useful for the study of L1 inﬂuence (see Chapter 15, this volume) as they are generally made up of subsets of data that are comparable across the

6

www.flloc.soton.ac.uk/ (last accessed on 13 April 2015).

7

www.linguistik.hu-berlin.de/institut/professuren/korpuslinguistik/forschung/falko (last accessed on 13 April 2015).

8

www.uam.es/proyectosinv/woslac/collaborating.htm (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

14

GILQUIN

different L1 populations, thus making it possible to isolate interlanguage features that are typical of certain populations. It should be noted that some multi-L1 learner corpora do not allow for such comparisons as the learners’ L1 is not identiﬁed (or at least not precisely enough), e.g. the EF-Cambridge Open Language Database (EFCAMDAT),9 which currently includes information about the learners’ nationalities but not about their L1. In the same way as a general corpus may include data from one period in time (synchronic corpus) or from several periods (diachronic corpus), a learner corpus may be a snapshot of learners’ knowledge of the target language at a particular moment or a representation of the evolution of their knowledge through time. Most learner corpora are of the former type, being made up of cross-sectional data. Corpora that seek to gather learner output produced at different stages in their development are called longitudinal corpora. They may vary in density depending on how often the data are gathered: the more regular the data collection, the denser the corpus. The Longitudinal Database of Learner English (LONGDALE)10 is a project that aims to follow the same learners over a minimum period of three years, with at least one data collection per year. Increasing the number of collections per year would make the corpus denser. Belz and Vyatkina (2008: 33) use the term ‘developmental learner corpus’ to refer to dense corpora ‘in which learner performance is documented at close intervals or at all points of production’ – in their case, data from Telekorp that were collected from the same students over a two-month period (see Chapter 13, this volume). Longitudinal (and developmental) corpora make it possible to investigate learners’ progress (or lack thereof) over time and are therefore a precious resource (see Chapter 17, this volume). However, because such corpora are difﬁcult to compile (among other things because some learners drop out during the course of the data collection), there are very few currently available. For want of longitudinal learner corpora, researchers may instead resort to corpora of pseudo-longitudinal data (Gass and Selinker 2008: 56–7), also referred to as quasi-longitudinal data (Granger 2002: 11). Such corpora are gathered at a speciﬁc point in time but from (different) learners representing different proﬁciency levels. The NICT JLE Corpus (National Institute of Information and Communications Technology Japanese Learner English Corpus; Izumi et al. 2004) is quasi-longitudinal as it contains data produced by different learners and divided into nine proﬁciency levels. Learner corpora may also include both truly and quasi-longitudinal data, as illustrated by the Corpus of Learner German (CLEG13).11

9

http://corpus.mml.cam.ac.uk/efcamdat/index.php (last accessed on 13 April 2015).

10

www.uclouvain.be/en-cecl-longdale.html (last accessed on 13 April 2015).

11

http://korpling.german.hu-berlin.de/public/CLEG13/CLEG13_documentation.pdf (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

15

A distinction can also be drawn between global and local learner corpora. Most learner corpora are global, being part of large-scale projects and being collected among learners who are subjects providing data for inclusion in the corpus. Local learner corpora, on the other hand, are typically collected by teachers among their students, who are both contributors to and users of the corpus. The objective of this approach is to identify one’s own learners’ speciﬁc needs through a corpus analysis of their output and thus provide tailor-made solutions to their problems. Mukherjee and Rohrbach (2006) illustrate the compilation and use of a local learner corpus, the Giessen-Göttingen Local Learner Corpus of English (see also Millar and Lehtinen (2008) for an overview of the compilation and analysis of local learner corpora).12 Finally, learner corpora can be distinguished on the basis of their origin and the main purpose for which they were created. Commercial learner corpora are collected by publishing houses with a view to developing pedagogical materials (dictionaries, coursebooks, etc.) based on authentic learner output (see Chapter 22, this volume). Most of the time, these corpora are not publicly available. The two most notable examples of commercial learner corpora are the Longman Learners’ Corpus13 and the Cambridge Learner Corpus.14 Unlike commercial learner corpora, academic learner corpora are initiated by researchers and/or teachers working in educational settings and interested in learning more about interlanguage (possibly with pedagogical aims in mind). In what precedes, a number of learner corpora have already been mentioned, many of which are publicly available for research purposes. However, there are some cases in which it may be necessary, or desirable, to collect one’s own data. This might be because, as with local learner corpora, the researcher wants to have access to data collected in a speciﬁc environment (his/her classroom, school, area, etc.), or because the ready-made learner corpora that are available do not suit his/her research purposes (e.g. they are too small, do not contain enough metadata or, in the case of (mute) spoken learner corpora, have not been transcribed in sufﬁcient detail). The next sections will provide an overview of how to go about collecting a learner corpus, starting with the issue of design criteria. The focus will be on those features that are speciﬁc to learner corpora; for a good overview of the issues to be considered when compiling a (general) corpus, see the chapters collected in O’Keeffe and McCarthy (2010, Section II). 12

A further type of corpus is sometimes recognised in between global and local learner corpora, namely in-house learner corpora, i.e. ‘local reference learner corpora which reflect the production of a given learner population’ (Rankin and Schiftner 2011: 430). In this case, the contributors and the users are not the same students, but they come from the same population (typically, the same school/university), which enhances the relevance of the analyses of these data for the users.

13

www.pearsonlongman.com/dictionaries/corpus/learners.html (last accessed on 13 April 2015).

14

www.cambridge.org/gb/cambridgeenglish/about-cambridge-english/cambridge-english-corpus (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

16

GILQUIN

2.2

Design: environment, task and learner variables

The importance of adopting strict criteria when designing a corpus has been regularly emphasised in the literature (e.g. Atkins et al. 1992). In the case of learner corpora, design criteria are even more crucial given the highly heterogeneous nature of interlanguage, which can be affected by many variables related to the environment, the task and the learner him-/herself. Before embarking on the collection of a learner corpus, the researcher therefore has to think carefully about what exactly will be included in the corpus for, as pointed out by Granger (2013a: 3235), ‘ “mixed bag” collections of L2 data present little interest’. In this respect, the dimensions presented in the previous section will obviously have a role to play: whether one wishes to collect, say, spoken or written learner data will have an inﬂuence on the way the corpus will be compiled and how the data will be analysed and interpreted. But there are many other variables that could be taken into account. These variables can pertain to the environment in which the data are collected, the tasks which the subjects are carrying out during the data collection, and the learners whose performances are being recorded. In terms of the environment, a major distinction can be made between cases where the target language is a native language that is used in everyday interactions in the learner’s environment (second language) and cases where the target language has no such functions and is normally conﬁned to the classroom (foreign language). In addition, one can distinguish between collection of the data in an educational setting (at school/ university) and in a natural setting (outside school/university). This distinction is especially relevant for second language learner corpora since second languages can be used in a wider variety of contexts, but foreign languages can sometimes also be used outside the educational setting, for example when a learner writes a letter or an email to a pen friend from home. Task variables are closely related to the notions of medium and genre (see Section 2.1). Producing an argumentative essay or orally describing a picture, for instance, will activate very different mechanisms and will offer different possibilities for controlling the way the task is performed. Written tasks can involve variables like time constraints (did the learner have a limited amount of time available to write the text?), availability of reference tools (dictionaries or grammar books), intertextuality (did the learner have access to secondary sources such as articles or other students’ essays?) and computerisation (did the learner write by hand or using a computer?). Task variables for spoken learner corpora include preparation time (did the learner have time to think about what s/he was going to say?), written support (did the learner have access to some written support, either notes of his/her own or text to which s/he is supposed to react?) and technique of recording (e.g. was the technique invasive or not?). In addition, one should consider whether the task was part of an

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

17

exam, as exam conditions may place students under increased pressure. Topic also has an important role to play as it may inﬂuence certain aspects of learner production (especially more lexical ones). Unsurprisingly, many of the variables that affect the nature of interlanguage concern the learners themselves. Some of these variables are general, being applicable to any speaker/writer, native or not, e.g. age, gender, country/area, mother tongue. Other variables are more speciﬁcally relevant to learners, like the parents’ native languages, the language(s) spoken at home, the learner’s proﬁciency level, exposure to the target language inside the classroom (e.g. number of years spent learning the target language, pedagogical materials used) and outside the classroom (e.g. contact with the target language in everyday life, stays in target-language countries), or knowledge of other foreign languages. Different measures of the learner’s proﬁciency and/or motivation may also be provided. The PAROLE Corpus is an example of a learner corpus that offers a particularly wide variety of measures, including motivation, listening comprehension skills, grammatical and lexical competence, aptitude for grammatical analysis and phonological memory (see Hilton 2008). In the case of spoken learner corpora, other participants may be involved in addition to the learner, and it may be useful to include variables about these participants too. In LINDSEI, for example, information was also gathered about the interviewer (gender, mother tongue, knowledge of other foreign languages and familiarity with the learner) as it was thought to have a possible inﬂuence on the learner’s production (for example, a learner may be more likely to resort to words from his/her mother tongue if the interviewer has knowledge of this language; see also Chapter 13, this volume, on the potential impact of the relationship between the interlocutors). Not all these variables should necessarily be controlled for, but (at least) some of them should be recorded. In other words, the learner corpus compiler does not have to take all these variables into account when deciding what to include or not in the learner corpus, but s/he should keep a record of as many of them as possible so that their impact on the learner’s linguistic behaviour can be assessed (see Chapters 15, 18 and 19, this volume, on the attested inﬂuence of some of these variables). When designing a learner corpus, the researcher should therefore identify the features that will be shared by all the data (i.e. variables that are kept constant) and those that can vary across the data. S/he may, for instance, want to restrict the learner corpus to data produced by Italian-speaking learners with at least ﬁve years of learning Spanish, but leave it unspeciﬁed whether the learners should have spent time in a Spanish-speaking country or not (although this variable may still be recorded). Note that some variables are less likely to be kept constant in a learner corpus, for example gender: whenever possible, corpus designers will seek to strike a balance between male and female learners rather than targeting only

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

18

GILQUIN

males or females. Usually, information about the variables is gathered through a form that is (partly) completed by the learner. This ‘learner proﬁle questionnaire’ is often combined with the consent form that learners are required to sign if they allow their data to be used for research purposes. For certain variables, it might be necessary to have the learners take a test, for example to determine their proﬁciency level or motivation. Finally, recording all these metadata is of little use if they are not made available to the corpus user, together with the actual data produced by the learners. As Burnard (2005: 31) puts it, ‘[w]ithout metadata, the investigator has nothing but disconnected words of unknowable provenance or authenticity’. Minimally, the metadata could be given in the header of each text making up the corpus (ideally in an XML-like format). In ICLE, the metadata are not integrated into the text ﬁles directly, but are included in a database which is linked to the text ﬁles, so that the user can select any number of variables and then extract the part of the corpus that corresponds to these criteria.

2.3

Collection of learner corpora

Once the learner corpus has been carefully designed, the ﬁrst concrete step in collecting it is to select the subjects who will contribute to it. In practice, the learners tend to be recruited among the students with whom the compiler is in (direct or indirect) contact. When the task performed is integrated into the students’ pedagogical activities (e.g. essay written within the frame of an exam), all the students may be expected to participate, and the selection will then be based on which students gave permission for their data to be used and which fulﬁl the criteria established during the design of the corpus (see Section 2.2). When the task performed is not part of the learners’ normal curriculum, on the other hand, the compiler will often be dependent on their willingness to participate voluntarily. In this case, self-selection may introduce a bias in that certain types of learners may be more likely to volunteer than others (e.g. female rather than male learners, learners who are self-conﬁdent, motivated and/or consider their proﬁciency level to be relatively high). This, some may argue, can compromise the balance and representativeness of the learner corpus. However, it should be emphasised that the compiler is still free to remove some data from the corpus if they do not match the predeﬁned criteria. Furthermore, as McEnery et al. (2006: 73) point out, the notions of balance and representativeness should be ‘interpreted in relative terms, i.e. a corpus should only be as representative as possible of the language variety under consideration’, as ‘[c]orpus-building is of necessity a marriage of perfection and pragmatism’. CEDEL2 is an example of a corpus where the contributors, while volunteers and hence self-selected, come from a large, diverse and thus presumably representative pool of learners, since calls for participation were distributed via a

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

19

wide range of mailing lists and the learners could contribute data to the corpus via an online application from anywhere in the world (see Lozano and Mendikoetxea 2013 and Section 3.1). The next steps involved in the collection of a learner corpus differ widely depending on the type of corpus that is collected. In what follows, a major distinction will be made between written and spoken learner corpora. Some other types of learner corpora will also be mentioned in passing. Written learner corpora start with either handwritten or typed texts. Handwriting was the norm when the ﬁrst learner corpora were compiled, which involved keyboarding by the researcher. This part can be quite tricky, as the texts have to be reproduced exactly as they are, including the learners’ errors but without introducing additional ones. Illegible handwriting can further complicate the task. Having typewritten texts scanned and converted through optical character recognition is another method of collection (here again the researcher should check that the result is an exact reproduction of the learner’s output), but today most written learner corpora start straight from computerised versions of the learners’ texts, either transferred electronically to the corpus compiler or directly uploaded (and even typed) via an online interface, which can also serve to collect the metadata related to the learner and to the text produced. Once the raw texts have been collected, some mark-up may be added, such as a header containing a reference and details about the text, or metatextual information within the text itself, indicating, for example, formatting and layout properties. It may also be necessary to identify (by means of special tags) and/or remove some chunks of text, especially quotations (which do not represent the learner’s own use of language and may therefore have to be excluded from the analysis of the corpus) and elements that may reveal the learner’s identity. If the learner corpus design is cross-sectional or quasi-longitudinal (see Section 2.1), the corpus compilation is complete once the data of all the selected learners have been collected. For longitudinal written learner corpora, the above procedure has to be repeated among the same learners at different points in time, as many times as required, depending on the desired density of the corpus. Spoken corpora start from a sound, not a text. Collecting spoken learner data therefore requires as an initial step that the spoken output be recorded. This should be done with high-quality material so that the sound ﬁles are fully exploitable, also for phonetic purposes. The ﬁrst spoken learner corpora were recorded on cassette tapes, which had to be digitised when more modern technologies became available. Nowadays, most recording equipment produces sound ﬁles which can be imported straight onto a computer. The recordings form the basis of transcription, that is, the transformation of an oral format into a written one. The transcription process can be performed via a simple text-editing program or

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

20

GILQUIN

using more sophisticated tools, e.g. transcript editors like CLAN, Praat or EXMARaLDA (see Chapter 6, this volume), which, by showing the waveform or spectrogram of the audio ﬁles, can facilitate the transcription process. Varying degrees of precision can be aimed at when transcribing the data, from very basic orthographic transcription, which just seeks to reproduce the words uttered by the learners, to very detailed phonological and phonetic transcription, which shows how the words were actually pronounced by the learners; both orthographic and phonological/phonetic transcription can be more or less broad or narrow. For obvious reasons of economy (see below), most spoken learner corpora are made up of orthographic transcripts. The LeaP (Learning Prosody in a Foreign Language) corpus is one of the few exceptions: next to a word tier which contains an orthographic transcription of the data, it includes tiers for syllables, segments, tone and pitch (Gut 2012; see Section 3.3). The degree of delicacy of the transcription will mainly depend on the resources available (time and money) and the research purposes. If the corpus is primarily compiled to carry out lexical analyses of spoken interlanguage, then an orthographic transcription is probably sufﬁcient; if, on the other hand, the main goal is to investigate learners’ pronunciation and prosody, it might be worth investing in a narrower type of phonetic transcription. More often than not, however, a (spoken) learner corpus is compiled with no one particular research question in mind, or at least, with a view to allowing the larger community of linguists to beneﬁt from it as well. In such cases, pragmatism may prevail over perfection (see McEnery et al.’s (2006) quotation earlier in this section) and the compilers may decide to keep the transcription relatively broad, not only to reduce the costs and efforts involved, but also in acceptance of the fact that a spoken learner corpus, however delicate its transcription, will never answer all of the questions that a syntactician, semanticist, phonetician or SLA specialist may want to study, and that the user of the corpus may therefore have to add a level of transcription him-/herself before embarking on a speciﬁc research project. It should be underlined at this stage that even a ‘simple’ orthographic type of transcription can be quite costly. According to Ballier and Martin (2013: 33), it is estimated that one word of ‘simple’ orthographic transcription costs about one euro. In terms of time, it was calculated within the framework of the LINDSEI project that each minute of learner speech requires some twenty to thirty minutes for transcription (including post-transcription checks). In addition, transcribing speech verbatim is a complex undertaking. If this is true of any type of speech, it is all the more so of learner speech, which tends to be difﬁcult to decode because of the many dysﬂuencies and errors (including pronunciation errors) that it contains and that have to be transcribed (Gilquin and De Cock 2011). This has been shown to lead to ‘a higher degree of perceptual reconstruction by the transcriber in L2 than in L1’ (Detey 2012: 234), possibly inﬂuenced

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

21

by the transcriber’s L1 (Bonaventura et al. 2000), and to a substantially lower rate of inter-transcriber agreement (Zechner 2009). In other words, because transcribers sometimes hear different things when listening to learner speech, they may come up with different transcriptions of the same stretch of discourse. There are also more speciﬁc problems when transcribing spoken interlanguage, such as the issue of how to deal with deviant forms (e.g. choregraphy instead of choreography or womans instead of women). While reproducing the deviant form may prevent it from being extracted automatically from the corpus (if the researcher uses the standard form as a search item), normalising the form results in a loss of information (in the case of choregraphy, for instance, the possible inﬂuence of the L1 if the learner is French-speaking, as the equivalent French word is chorégraphie). Matters get even worse when mispronunciation results in a different word (e.g. law pronounced as low or dessert pronounced as desert) or when the word simply does not exist in the target language. Admittedly, confused pairs of words or invented words occur in written learner corpora too and can present problems for the automatic extraction of words, but in a written corpus it is the learner who selects a particular spelling, whereas in a spoken corpus it is the transcriber who is responsible for choosing a certain transcription. As a ﬁnal note on transcription, it must be said that attempts have been made to transcribe learner speech (semi-)automatically, either as a ﬁrst step before manual correction (see Bonaventura et al. 2000) or with the aim of developing automatic speech recognition software applicable to non-native speech (e.g. Wang and Schultz 2003). However, it is fair to say that there is still a long way to go before spontaneous learner speech can be transcribed accurately in a fully automatic manner, so that for the next years or decades to come, researchers will probably have to go through the arduous and time-consuming process of manual transcription (unless they can have ‘Turkers’ do the work for them, see Evanini et al. 2010).15 In the case of longitudinal spoken learner corpora, as with longitudinal written learner corpora, the procedure has to be repeated at several points in time. Multimodal learner corpora, being made up of video-recorded speech, rely on some of the steps described above for their compilation. In comparison with spoken learner corpora, they require video recording of the speech event and may involve some sort of ‘transcription’ of the video as well (e.g. indication of the gestures made by the learner). For all types of learner corpora, different options are available for the encoding of the ﬁnal product (SGML, XML, etc.); for spoken and multimodal learner corpora that are distributed with their sound/video ﬁles, it is possible (and often desirable) to align the text transcript with the sound/ 15

Turkers are users of the crowdsourcing Amazon Mechanical Turk platform (www.mturk.com, last accessed on 13 April 2015) who get paid to perform (usually simple) tasks online.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

22

GILQUIN

video so that the two can be examined and queried simultaneously. Since these features are common to all corpora, native and non-native, they will not be further discussed here. As for the post-processing of learner corpora (lemmatisation, part-of-speech tagging, error annotation, phonetic annotation, etc.), this will be dealt with in Chapters 5 to 7 (this volume).

3

Representative studies

This section presents three studies which describe the design and collection of different types of learner corpora. The ﬁrst one, by Lozano and Mendikoetxea (2013), deals with the compilation of a written learner corpus, while the other two concern the compilation of a spoken learner corpus. In the case of Jendryczka-Wierszycka (2009), the corpus can be deﬁned as a mute spoken learner corpus, whereas in the case of Gut (2012) it is a speech learner corpus with speech–text alignment (see Section 2.1 on the distinction between mute and speech corpora). The three studies also represent a range of target languages: Spanish in Lozano and Mendikoetxea (2013), English in Jendryczka-Wierszycka (2009), and English and German in Gut (2012). 3.1 Lozano, C. and Mendikoetxea, A. 2013. ‘Learner corpora and second language acquisition: The design and collection of CEDEL2’, in Díaz-Negrillo, A., Ballier, N. and Thompson, P. (eds.), Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: Benjamins, pp. 65–100. Lozano and Mendikoetxea (2013) describe the compilation of CEDEL2 (see above and Lozano 2009a), a cross-sectional corpus of L2 Spanish compositions written by English-speaking learners. They demonstrate that CEDEL2 is a well-designed and carefully constructed corpus by showing how it follows the ten key design criteria set out by Sinclair (2005). Their main arguments are summarised here: 1. Content selection: the texts included in CEDEL2 have not been selected on the basis of the language they contain; they are supposed to represent the use of learner Spanish under natural conditions. 2. Representativeness: CEDEL2 represents a large sample of learners, from all proﬁciency levels and writing on a wide range of topics. 3. Contrast: a corpus of native Spanish, designed according to the same criteria as the learner corpus, allows for legitimate comparisons between native and non-native writing. 4. Structural criteria: CEDEL2 is simply structured according to the writers’ L1 (English or, for the native comparable corpus, Spanish) and the learners’ proﬁciency level (beginner, intermediate or advanced). 5. Annotation: tags (in XML format) are stored separately from the texts (in raw text format).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

23

6. Sample size: CEDEL2 is made up of complete, unedited texts which may vary in length. 7. Documentation: detailed information about the structure of CEDEL2, as well as about the learners who contributed to it and their compositions, is available. 8. Balance: while limited to written language, CEDEL2, as a specialised corpus, is claimed to provide a good basis for the study of interlanguage phenomena. 9. Topic: the twelve composition topics writers could select from are assumed to be varied enough to elicit a large range of linguistic phenomena. 10. Homogeneity: texts submitted online that do not satisfy the design criteria are not included in CEDEL2. In accordance with the need for metadata underlined in Section 2.2, Lozano and Mendikoetxea (2013) also explain how they have collected information through two forms to be completed by each participant: (i) a learning background form, which asks for writers’ personal details (age, gender, institution, etc.), linguistic details (L1, parents’ L1, stay in Spanish-speaking countries, etc.) and self-rated proﬁciency in speaking, listening, reading and writing (in Spanish and in other languages they may have learned); (ii) a composition form, which includes the composition itself, but also information about background research (did the writer conduct any research before writing the composition, and if so, how long and by what means?), composition title (among the twelve possible topics), writing location (in class, at home or both) and writing tools if any (dictionaries, spell-checkers, native help, etc.). In addition to considerations concerning the design of CEDEL2, Lozano and Mendikoetxea (2013) describe the current state of the corpus (which included about 750,000 words produced by some 2,500 participants in March 2011 but continues to be expanded, with an intended target of 1 million words), the distribution of the data it contains and the preliminary post-processing it has undergone. What is particularly interesting about this corpus is that, unlike most learner corpora which are collected in a small number of environments (often depending on the location of the researchers involved in their compilation, see Section 2.3), CEDEL2 was collected via a web application, through which speakers of Spanish all over the world were invited to contribute. This results in a wide range of writer proﬁles, using different varieties of (learner and native) Spanish. Another advantage of the corpus is that it comes with an assessment of each learner’s proﬁciency level. This is done via the learning background form, which requires learners to self-rate their proﬁciency in the four skills (see above). In addition, learners’ actual proﬁciency is assessed by means of an independent and standardised placement test, the University of Wisconsin placement test, which the participants can take online. As

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

24

GILQUIN

will be noted in Section 4, proﬁciency is a variable that is often lacking (or determined with insufﬁcient precision) in learner corpora, and this double proﬁciency measure in CEDEL2 is therefore a major asset (also because, as suggested by the authors, it allows the comparison between self-rated and real proﬁciency). The compilation of a native counterpart to the learner corpus should be underlined as well, as it makes it possible to compare native and non-native writing using data that are fully comparable since they were collected according to the same design criteria. Finally, it is noteworthy that the paper is written from an SLA perspective and that CEDEL2 is designed to answer questions that are (also) of interest to SLA researchers. The paper and the corpus therefore represent a laudable attempt to bring learner corpus research and SLA closer together (see Chapter 14, this volume, on the relation between the two ﬁelds), on the grounds that ‘if corpus-based research is going to make a signiﬁcant contribution to the ﬁeld of SLA, new, well-designed corpora need to be made available to the research community’ (p. 89). 3.2 Jendryczka-Wierszycka, J. 2009. ‘Collecting spoken learner data: Challenges and beneﬁts. A Polish L1 perspective’, in Mahlberg, M., González-Díaz, V. and Smith, C. (eds.), Proceedings of the Corpus Linguistics Conference, University of Liverpool, UK, 20–23 July 2009. Written by one of the partners in the LINDSEI project, JendryczkaWierszycka’s (2009) paper has the interesting feature that it not only describes the compilation of a component of LINDSEI (the Polish component), but it also underlines the many challenges one can face when collecting spoken learner data. Jendryczka-Wierszycka starts by introducing the project as a whole, meant as a spoken counterpart to ICLE.16 Being a multi-L1 learner corpus, LINDSEI is made up of several subcorpora that are each compiled according to the same principles, which ensures comparability across the different subcorpora. These principles include the fact that the data consist of informal interviews and that the participants are advanced foreign language learners of English. The structure of the interviews, in three parts, also needs to be adhered to: after choosing one among three set topics and talking about it for a few minutes, the learners answer questions about what they have just said and about more general topics like hobbies or life at university, and ﬁnally they are asked to describe a four-picture cartoon. In addition, a learner proﬁle questionnaire has to be completed by every learner who contributes to LINDSEI. The questionnaire gathers information about the learner (age, gender, stay in an English-speaking country, other foreign languages known, etc.) and also includes the learner’s consent for his/her data to be used for research purposes.

16

See the LINDSEI handbook (Gilquin et al. 2010) for more details.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

25

Among the challenges mentioned by Jendryczka-Wierszycka, the ﬁrst one has to do with student recruitment. She notes that ‘[s]ince getting people’s time, even for the sake of science, is well-known not to be the easiest task if there is no money involved, we had low expectations of the number of volunteers for our corpus’. In the end, Jendryczka-Wierszycka was able to recruit ﬁfty-one students by appealing to them during lectures and by announcing a prize draw among the volunteers. In the description of these students’ characteristics, which are also summarised in an appendix, another problem is alluded to, which is valid for all the LINDSEI subcorpora (and many other learner corpora too, see Section 4), namely the identiﬁcation of the learners’ proﬁciency level. In LINDSEI, all learners should be in their third or fourth year at university, and on this basis are expected to be (upper-intermediate to) advanced learners of English. However, Jendryczka-Wierszycka rightly points out that this ‘may be a faulty assumption as the level naturally differs from one university to another even within one country, not to mention university level differences worldwide’. In the particular case of Polish learners, she adds that the quality of English classes that are taught in different schools across Poland is so uneven that the number of years of English at school cannot be a good indication of the learners’ proﬁciency either. Another major challenge that is described at length in the paper is the transcription of the data. Although LINDSEI comes with its own transcription guidelines, which are outlined on the project website,17 Jendryczka-Wierszycka recognises that transcribing the interviews (which was done with the help of the SoundScriber software)18 was not an easy task. Besides technological difﬁculties (one of the recorded interviews would not play back), some passages were unintelligible, due, among other things, to overlapping speech and external noises, and certain items (like ﬁllers) proved particularly hard to transcribe consistently. Jendryczka-Wierszycka explains that in the transcription process she was helped by a group of M.A. English linguistics students. While this reduced the time of transcription considerably, this also involved training of the transcribers, good coordination of the group and correction of the transcripts by the coordinator. All in all, Jendryczka-Wierszycka notes that each interview (lasting about ﬁfteen minutes) required an average of ﬁve hours to be transcribed, which included listening to the sound ﬁle at least twice and making a ﬁnal check of the transcript. The transcription phase took seven months, as against two months for the recording of the interviews. 17

See www.uclouvain.be/en-cecl-lindsei.html (last accessed on 13 April 2015). Note that the conventions described by Jendryczka-Wierszycka correspond to an earlier version of the transcription guidelines. Thus, the use of square brackets and vertical alignment to signal overlapping speech has now been replaced by the tag which is inserted in each of the two overlapping utterances, while foreign words are now marked by means of … , in lieu of italics.

18

www-personal.umich.edu/~ebreck/code/sscriber/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

26

GILQUIN

The paper also provides some useful statistics about the composition of the Polish component of LINDSEI, as well as a summary of two case studies published by the same author: one on vague language and the other on discourse markers. Finally, it brieﬂy describes the author’s attempt to apply a part-of-speech (POS) tagger designed for the annotation of native data, CLAWS, on learner data, underlining the problems that appeared and proposing some possible solutions to them. The paper ends with the hope that the corpus can become a useful resource not only for linguists but also for language teachers and translators. 3.3 Gut, U. 2012. ‘The LeaP corpus. A multilingual corpus of spoken learner German and learner English’, in Schmidt, Th. and Wörner, K. (eds.), Multilingual Corpora and Multilingual Corpus Analysis. Amsterdam: Benjamins, pp. 3–23. This paper by Gut (2012) describes the compilation and annotation of the LeaP corpus, a corpus of spoken learner German and learner English totalling over twelve hours of recording which, as its name indicates (LeaP stands for Learning Prosody in a Foreign Language), is primarily aimed at studying the second language acquisition of prosody.19 Unlike a learner corpus like LINDSEI, which is made up of data from a rather speciﬁc learner population (young, relatively advanced learners in their third or fourth year at university), the LeaP corpus was designed to be as representative as possible of the German and English interlanguages, including data from a wide range of learners (seventeen mother-tongue backgrounds, ages from 18 to 60, ﬁrst contact with the target language from 1 to 33 years of age, etc.). Certain groups of learners were selected so as to answer predeﬁned research questions, e.g. a group of very advanced, native-like learners ‘to test what type of ultimate phonological attainment is possible’ (p. 5) and a group of learners who were recorded before and after a course in pronunciation to measure the impact of formal training. In addition, a few native speakers of German and English were recorded as a baseline. The corpus also includes different types of speech: free speech collected in a semi-structured interview setting, prepared reading of a story, semi-spontaneous retelling of this story and reading of a list of nonsense words. To ensure the high quality of the audio ﬁles and their possible phonological exploitation, the recordings took place in a sound-treated chamber. The paper also describes the detailed transcription and annotation of the corpus. Being a speech (rather than mute) learner corpus (see Section 2.1), the LeaP corpus is distributed with its audio ﬁles in the form of time-aligned phonological and phonetic transcriptions, where the transcription is linked to the corresponding part of the recording by means of time-stamps set at the beginning and end of each relevant unit (word, 19

The LeaP corpus is freely available for research purposes at http://corpus1.mpi.nl/ds/imdi_browser/? openpath=MPI671070%23 (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

27

syllable, phoneme, etc.). This text-to-sound alignment makes it possible, by means of appropriate software like Praat (which was used to annotate the LeaP corpus), to have simultaneous access to transcription and sound – the latter represented in Praat by a waveform, a spectrogram and a pitch track. Annotations are similarly aligned, with each type of annotation constituting an individual tier. The LeaP corpus contains up to eight tiers, six of them carried out manually and the other two added automatically: 1. Phrase tier: division into intonation phrases, with indication of interrupted intonation phrases, unﬁlled pauses, hesitation phenomena, elongated phonemes and some non-speech events (noise, breath, laughter). 2. Word tier: orthographic transcription and annotation of the beginning and end of each word. 3. Syllable tier: broad phonetic transcription and annotation of the beginning and end of each syllable. 4. Segment tier: vocalic intervals, consonantal intervals and intervening pauses. 5. Tone tier: pitch accents and boundary tones. 6. Pitch tier: phonetic properties of pitch range (initial high pitch, ﬁnal low pitch, intervening pitch peaks and valleys). 7. POS tier: automatic annotation of parts of speech. 8. Lemma tier: automatic annotation of lemmas. For each minute of recording, an average of 1,000 events were annotated. This was done by six annotators who received training in annotation (criteria for the division into intonation phrases, annotation schemes, etc.). The reliability of the manual annotation was measured by means of inter-annotator agreement (to what extent do all annotators agree on the annotation of the same recording, i.e. how stable are the annotations?) and intra-annotator agreement (to what extent does an annotator agree with him-/herself when annotating the same recording twice, i.e. how reproducible are the annotations?). Both measurements yielded differing results, depending on the complexity of the task, so that ‘[t]he higher the number of different categories in an annotation scheme, the lower the agreement’ (p. 11). Experience with annotation was also shown to have a positive inﬂuence on the reproducibility of the annotations. The corpus includes metadata (which Gut refers to as ‘non-linguistic annotation’) comprising a wide range of information such as date and place of the recording, gender, age and native language of the learner, duration and type of stays abroad, prosodic knowledge, motivation and attitude towards the target language, importance self-attributed to competence in pronunciation, and even experience and ability in music and in acting. These metadata are integrated into the corpus data, which are in an XML-based format specially developed for the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

28

GILQUIN

corpus, the Time Aligned Signal data eXchange (TASX) format. Analysis of the corpus is possible on the basis of this format or through conversion to other ﬁle formats compatible with various search tools. The paper ends with an illustration of how the LeaP corpus can be used to explore ﬂuency in learner German and learner English, thus convincingly demonstrating that ‘a corpus with rich annotations and a standardised data format despite having a relatively small size offers numerous possibilities of testing previous concepts and claims in L2 acquisition research’ (p. 20).

4

Critical assessment and future directions

In this section, we will reconsider the three core issues outlined in Section 2 from a more critical perspective, pointing to certain limitations of the typology, design and collection of learner corpora, and adding some suggestions for possible future developments in these three areas. In terms of learner corpus typology, it must be recognised that there is a striking imbalance between the types of learner corpora that are currently available. There are more written than spoken corpora, more general than speciﬁc corpora, more corpora of English than any other language, and more corpora containing cross-sectional than longitudinal data. This imbalance is a partial reﬂection of the ease with which data can be collected: the transcription of spoken texts is more time consuming than the keyboarding, scanning or electronic collating of written texts, learners of English for general purposes are more numerous than learners of English for speciﬁc purposes and, especially, learners of other languages, and multiple data collection from the same learners requires heavier logistics than one-off data collection. Because of researchers’ tendency to collect data from learners who are easy to reach (see Section 2.3), we also notice a predominance of learner corpora representing relatively advanced university students, often majoring in the target language, whereas beginners and young learners are less often represented.20 While all sorts of practical constraints may make it difﬁcult to collect certain types of learner data, a collective effort should nonetheless be made not only to enlarge our repertoire of learner corpora but also to diversify it, as the types of learner corpora that are lacking are bound to provide invaluable information about interlanguage (see, e.g., Chapter 17, this volume, on longitudinal corpora). In particular, two types of learner corpora that are currently extremely rare seem to hold special promise: multimodal and local learner corpora. While the

20

An exception is the International Corpus of Crosslinguistic Interlanguage (ICCI), which consists of data produced by beginner to lower-intermediate learners of English; see http://cblle.tufs.ac.jp/llc/icci/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

29

former open up a whole range of new possibilities in the study of interlanguage by adding the picture to the text and sound, the latter invite teachers and students alike into the ﬁeld of learner corpus research by making them both providers and beneﬁciaries, thus resulting in learner corpora being directly useful to those for whom, ultimately, they have been compiled. It might also be interesting to collect more multilingual mono-L1 learner corpora like the USP Multilingual Learner Corpus (Tagnin 2006), which contains data produced by Brazilian learners in different foreign languages. Unlike monolingual multi-L1 learner corpora (like ICLE), which make it possible to identify the universal vs L1-speciﬁc problems learners encounter when learning a speciﬁc target language, such corpora offer the opposite perspective and show what difﬁculties learners of a given mother-tongue background experience when learning a foreign language. As for the types of learner corpora that are likely to appear in the near future, it seems as if we might be moving towards multidimensional learner corpora (or perhaps databases) that contain several subsets of data designed according to similar criteria but representing different language varieties (including a native counterpart), different media, different genres, different tasks, different acquisition settings, etc., so that legitimate comparisons can be drawn and reliable statements can be made about the possible inﬂuence of these factors. In view of the growing awareness of the internal variation of learner corpora and the individuality of learners (see Gilquin and Granger 2015), it may be expected that, whenever possible, these data will be collected from the same learners, who will be required, say, to produce spoken and written interlanguage, write an essay in timed and untimed conditions, describe a picture in the target language and in their mother tongue, or participate in linguistic experiments whose results will be recorded and then confronted with their more natural production. The ASU Corpus (Hammarberg 2010) is a ﬁrst step in this direction, since it contains spoken and written data produced at regular intervals and through various tasks by the same learners of Swedish, and also includes a native Swedish counterpart built in a similar way (though, obviously, with different informants). Moving on to design criteria, we can applaud the fact that most learner corpora are built according to some (more or less strict) criteria and, above all, that they are often accompanied by information about the proﬁle of the learners who contributed to the corpus and about the circumstances in which the data were contributed. As regards the choice of variables that are recorded, we can only agree with Granger (2004: 126) that ‘there are so many variables that inﬂuence learner output that one cannot realistically expect ready-made learner corpora to contain all the variables for which one may want to control’. One variable that is of crucial importance but whose identiﬁcation has often been less than optimal, however, is that of proﬁciency. In ICLE, for example, proﬁciency is determined by means of external criteria like age and number of years of English at

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

30

GILQUIN

university (Granger 1998a: 9). Yet, several linguists have objected that such external criteria do not necessarily offer an accurate representation of a learner’s proﬁciency level (e.g. Thomas 1994; Pendar and Chapelle 2008; see also Section 3.2)21 and that a more objective measure of proﬁciency should therefore be provided. As explained by Carlsen (2012: 165), there are two ways of doing this: through learner-centred or text-centred methods. The former (which also include external criteria like age) determine proﬁciency by examining the learner’s characteristics, for example by having him/her take an independent, standardised proﬁciency test like the University of Wisconsin placement test for CEDEL2 (see Section 3.1) or the Oxford Quick Study Placement Test for MiLC and WriCLE22 (cf. Mediero Durán and Robles Baena 2012). Text-centred methods, on the other hand, examine the text itself to establish proﬁciency, as was done with the ASK corpus of Norwegian L2, whose individual texts were rated according to the Common European Framework of Reference for Languages (Council of Europe 2001) – see Carlsen (2012). In effect, such measures provide ‘a description of the quality of one single essay produced by each individual learner, rather than a more independent assessment of the learners’ overall proﬁciency’ (Thewissen 2013: 79). Each method has its disadvantages: assessment on the basis of an independent placement test may not reﬂect the level of the text since ‘one and the same learner may perform slightly differently from one day to the next or from one test to another’ (Carlsen 2012: 168), whilst assessment on the basis of the corpus texts runs the risk of circularity as the texts are rated according to their linguistic features and then analysed linguistically to say something about the learner’s proﬁciency (see Hulstijn 2010). However, both types of measure constitute an improvement over impressionistic evaluation of the proﬁciency level and should thus be encouraged in the design of learner corpora. In this respect, we can certainly welcome the fact that some of the most recent learner corpus projects have integrated an objective proﬁciency score in their design (see above examples). Another variable that could usefully be improved is that of exposure to the target language. While traditionally this has been limited to a description of the acquisition setting (foreign or second language environment), the number of years of instruction in the target language and the time spent in a target-language country, there are many other elements that could have an inﬂuence on learners’ degree of exposure, especially in today’s high-tech world, where resources and contacts in other languages 21

This has been confirmed by a CEFR (Common European Framework of Reference for Languages; Council of Europe 2001) evaluation of a sample of essays from ICLE which, on the basis of the external criterion of number of years of English at university, were supposed to represent the same level, but whose actual scores ranged from B2 and lower (40%) to C2 (less than 20%) (see Granger et al. 2009: 11–12). The same sort of contrast emerged from the CEFR rating of a sample of LINDSEI (see Gilquin et al. 2010: 10–11).

22

WriCLE stands for Written Corpus of Learner English; see www.uam.es/proyectosinv/woslac/Wricle/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

31

are at a learner’s ﬁngertips. If a learner spends all his/her free time watching TV series or playing multiplayer online games in English, for instance, this is likely to have more impact on his/her knowledge of English than a two-week holiday in the UK with his/her family.23 In an attempt to approximate the learner’s full experience with the target language, information could be gathered about incidental learning in everyday life through reading, entertainment, social networking, etc., along the lines of the questionnaire found in Schmitt and Redwood (2011: 206–7) or, for a much more detailed version, Freed et al.’s (2004a) ‘language contact proﬁle’. Probably as important as the reﬁnement of certain variables, however, is the urgent need to standardise the metadata that come with learner corpora. This would not necessarily mean that all learner corpora should include exactly the same metadata, but if they do include a certain type of information, it should follow a speciﬁc format (e.g. precise age in years and months rather than ranges of years), so that results related to these variables can be compared across different learner corpora. General initiatives have been undertaken to make recommendations about the selection and presentation of metadata (e.g. Dublin Core Metadata Initiative)24 but, to date, similar initiatives speciﬁcally concerned with learner corpus metadata are still lacking. Standardisation is key to the successful compilation and encoding of learner corpora too. In addition to adopting good practices such as those recommended by Sinclair (2005) – see Section 3.1 – it would be desirable, in order to increase compatibility between different learner corpora, to follow the same guidelines to represent text in electronic format (e.g. form of corpus headers, transposition of typographical features, indication of quotations, conventions of transcription). Again, such initiatives exist, like the Text Encoding Initiative (TEI),25 but few learner corpora so far have applied these standards (for an exception, see BACKBONE, a corpus whose annotation relies on TEI-compliant XML and which includes a number of interviews with non-native speakers of English; see Kohn 2012).26 Equally important in order to allow the community to beneﬁt from a learner corpus are the availability of detailed documentation describing the compilation of the corpus and, of course, the accessibility of the corpus data (including sound/video ﬁles if appropriate) in the ﬁrst place, so that studies based on these can be replicated and more studies can be undertaken. This is not necessarily obvious: in Schiftner’s (2008) survey, documentation of the learner corpus projects turned out to be ‘scattered and often scarce’ (p. 48), and only half of the learner corpora

23

See the ReCALL special issue edited by Cornillie et al. (2012) on the role of digital games for language learning.

24

http://dublincore.org/ (last accessed on 13 April 2015).

25

www.tei-c.org/ (last accessed on 13 April 2015).

26

Of special interest in this respect is the SACODEYL Annotator, which makes it possible to create XML TEI-compliant annotations (see Pérez-Paredes and Alcaraz-Calero 2009).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

32

GILQUIN

were publicly available (though not always free of charge). It might be that one of the reasons (partly) accounting for certain researchers’ tendency not to disseminate their learner corpora has to do with the subjects involved in the collection of such corpora, viz. mainly young people, whose participation may require a special set-up.27 Thus, while every effort should be made, whatever the type of corpus, to follow ethical procedures and maintain anonymity, this is all the more crucial with young subjects, whose safety and welfare should be preserved at all costs, during but also after the data collection. This can be particularly problematic with multimodal learner corpora, which capture the participant’s voice and face, and which should therefore be anonymised appropriately to conceal his/ her identity – while bearing in mind that their complete anonymisation may limit the types of analyses that they allow (e.g. facial feature analysis; see Adolphs and Knight (2010: 43–4) on the anonymisation of multimodal corpora). Obtaining consent for the data to be collected and used for research purposes may also be a complex task, as consent may have to be granted by a parent or guardian if the subject is under-age and/or by a teacher, a headmaster or even a higher-level authority if the data collection takes place in a school; if the learner is old enough to give consent him-/herself, the researcher should make sure that s/he fully understands the nature of the research before signing the consent form. Finally, looking at what might be the learner corpus of the future, it is likely that new technologies will have a major role to play in how it is collected (see also Chapter 18, this volume). Learner corpora of computer-mediated communication (like Telekorp) are an early illustration of this. Multiplayer online games, mentioned above for their possible impact on learners’ knowledge of foreign languages, as well as the recent trend of massive open online courses (MOOCs), can also be used as a way of collecting learner data. Besides computers, data could be collected via smartphones and tablets, whose popularity among young people would contribute to the non-intrusive character of the process. These technologies, because they are part and parcel of the everyday life of the new generation of learners, make it possible to move corpus collection away from the academic setting and into a more natural environment, thus coming closer to the ideal of genuine communications that should be included in a corpus (Sinclair 1996; see Section 1). In a way, this is a natural development, since the knowledge of an L2 is an inherent feature of individuals, which is with them all the time and not just during the (limited) periods when they are actually learning it. It is therefore only normal that learner corpora, if they are to serve as repositories of interlanguages, should strive to reﬂect learners’ full experience with the L2 as accurately as possible. 27

See, e.g., the Guidelines for Research with Children and Young People published by the National Children’s Bureau (Shaw et al. 2011).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

From design to collection of learner corpora

33

Key readings O’Keeffe, A. and McCarthy, M. (eds.) 2010. The Routledge Handbook of Corpus Linguistics. London: Routledge. The second section of this handbook (pp. 29–103), entitled ‘Building and designing a corpus: What are the key considerations?’, covers the basics of corpus compilation. It provides a step-by-step guide for how to build a spoken, written, small specialised and audio-visual corpus, and a corpus that represents a certain language variety (like American or academic English). Although the chapters do not deal speciﬁcally with learner corpora, they provide information that is relevant to learner corpus compilation, and also include a few references to learner corpora (especially in the chapter on small specialised corpora). Granger, S. 1998a. ‘The computer learner corpus: A versatile new source of data for SLA research’, in Granger, S. (ed.), Learner English on Computerr. London: Longman, pp. 3–18. This is one of the founding texts that introduced the learner corpus, situating it within the broader ﬁelds of corpus linguistics, second language acquisition and foreign language teaching, describing the main language- and learner-related design criteria relevant to learner corpus building (with an illustration by means of ICLE) and pointing to some of the difﬁculties involved in compiling a learner corpus. Pravec, N. A. 2002. ‘Survey of learner corpora’, ICAME Journal 26: 81–114. Though slightly dated, this survey of learner corpora of (written) English provides detailed information about the attributes (size, availability of learner background information, format, etc.) of the ten corpora that were available at the time, with a view to helping researchers select the corpus that is the most suitable for their purposes. Tono, Y. 2003. ‘Learner corpora: Design, development and applications’, in Archer, D., Rayson, P., Wilson, A. and McEnery, T. (eds.), Proceedings of the Corpus Linguistics 2003 Conference, UCREL Technical Paper 16. Lancaster University, pp. 800–9. This paper provides a good overview of the considerations that should be kept in mind when compiling (and analysing) a learner corpus. It also provides a survey of some twenty learner corpora and their features (including size, types of subjects and texts), as well as some directions for the future.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

34

GILQUIN

Schiftner, B. 2008. ‘Learner corpora of English and German: What is their status quo and where are they headed?’, Vienna English Working Papers 17(2): 47–78. In addition to a detailed description of twenty-six English and ﬁve German learner corpora, the paper considers developments in English and German learner corpus compilation, with special emphasis on problems related to design and accessibility, and it offers useful suggestions for the compilation of English and German learner corpora, some of which might be relevant to other target languages as well.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:02, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.002

3 Learner corpus methodology Marcus Callies

1 Introduction In contrast to other types of data that have traditionally been used in second language acquisition (SLA) research, learner corpora provide systematic collections of authentic, continuous and contextualised language use by foreign/second language (L2) learners stored in electronic format. They enable the systematic and (semi-)automatic extraction, visualisation and analysis of large amounts of learner data in a way that was not possible before. Access to and analysis of learner corpus data is greatly facilitated by the digital medium, and their sheer quantity can give SLA theories a more solid empirical foundation alongside experimental data. As is the case with other instruments and techniques of data collection or pre-compiled databases, the choice of method(s) depends on the object(s) of study and the research question(s) being asked, and in turn, ﬁndings and results are highly dependent on the method(s) or database(s) chosen. Carrying out research by means of a learner corpus may be conceptualised as a process involving various steps that range from the choice of research approach and the selection of the appropriate corpus to the annotation, extraction, analysis and interpretation of the data (see, e.g., Granger 2012a). This chapter will provide an overview of current practices, developments, challenges and future perspectives in learner corpus methodology. It ﬁrst addresses several principal ways in which learner corpora can be used, and then describes the two most commonly practised types of analysis. It also highlights the possibilities and advantages of combining learner corpus data with (quasi-)experimental methods and presents a critical assessment of current practices in learner corpus analysis and an outlook on methodological developments in the ﬁeld.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

36

CALLIES

2

Core issues

There are several principal ways in which learner corpora can be used methodologically. As in corpus linguistics in general, one can draw a tripartite distinction between corpus-informed, corpus-based and corpus-driven approaches, depending on the kind of evidence the corpus data is needed for, and the degree of involvement of the researcher with respect to data retrieval, analysis and interpretation. It is important to stress that these are not strict distinctions but that the three types partially overlap and merge into one another. Some of the ways learner corpora are used combine two or even all three approaches. In the corpus-informed approach, a learner corpus is used as a general reference source for information, for example to check a researcher’s intuition or to provide evidence for the occurrence of a speciﬁc linguistic feature (e.g. the frequency of a certain word, phrase or construction) or a certain type of error. In that sense, researchers make use of information retrieved from a corpus on a meta-level but do not work with corpus data proper. The corpus-based approach can be narrowly considered ‘a methodology that avails itself of the corpus mainly to expound, test or exemplify theories and descriptions that were formulated before large corpora became available to inform language study’ (Tognini-Bonelli 2001: 65). In that view, researchers work with corpus data and use them to provide primary evidence for the nature of learner language to either conﬁrm or refute existing hypotheses, often comparing learner language to that of native speakers. However, the notion ‘corpus-based’ is problematic as it is used not only in this narrow sense but more often in a much wider and more general sense for basically any work that makes use of a corpus. A third kind of approach is often referred to as corpus-driven and presupposes the least degree of involvement on the part of the researcher in that it is strictly based on computer techniques for data extraction and evaluation. Corpus-driven approaches use minimal prior assumptions about language structure and are more inductive, in that the corpus itself is the data and the patterns of language use it represents are noted as ways of expressing regularities and exceptions in language. The role of the researcher is to formulate questions and to draw conclusions derived from what corpus data reveal when subjected to statistical analysis rather than using the data to test a research hypothesis by approaching a corpus with a number of preconceived ideas. Thus, ‘corpus-driven’ is used in the sense of ‘data-driven’ (see Francis 1993: 139). Another major distinction in the way learner corpora can be explored is between quantitative and qualitative analyses. The research methodology that underlies the quantitative analyses that are typically used in learner corpus research (LCR) is primarily deductive, product-oriented and designed to test a speciﬁc hypothesis, which can then be conﬁrmed or rejected, or reﬁned and re-tested. Quantitative data represent ‘hard’ data in that they are identiﬁable, classiﬁable, quantiﬁable and thus

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

37

subject to reﬁned statistical analysis. They are generalisable and more easily replicable than qualitative data. Qualitative analyses, on the other hand, are typical of research methodologies in which the observation of speakers and the description and explanation of their language within naturally occurring social and cultural settings is considered fundamental in revealing the factors that underlie language use. The research approach that underlies qualitative data is primarily heuristic, processand discovery-oriented, not intended to test a speciﬁc hypothesis, but rather to generate hypotheses. It focuses on in-depth investigations of linguistic phenomena grounded in the context of authentic, communicative samples of language, adopting an exploratory, inductive approach to the empirically based study of how the meanings and functions of linguistic forms interact with diverse ecological characteristics of language used for communication (see Hasko 2013a, 2013b). The majority of learner corpora currently available are cross-sectional in that they comprise data gathered from a large and diverse number of informants that allow the study of L2 learners’ language use at a single point in time. Learner corpora also have the potential to enable investigations of the developmental processes that underlie L2 learning (see Chapter 17, this volume). However, researchers have to make do with a relative scarcity of truly longitudinal learner corpora, i.e. corpora that include data collected from individuals or a small group of informants at periodic intervals over a prolonged period of time in order to obtain information about language development (see, e.g., Hasko 2013c). Therefore, L2 development is often examined by means of quasi-/pseudo-longitudinal corpus projects in which data are still collected at a single point in time but from several groups of learners at different proﬁciency levels (e.g. Thewissen 2013). There are increasing efforts to compile truly longitudinal learner corpora (see Chapter 2, this volume) and a larger number of studies using a longitudinal approach can be expected in the future (see, e.g., the papers in Hasko and Meunier 2013). The speciﬁc ways in which a learner corpus can be used in a corpusinformed, corpus-based or corpus-driven approach, either quantitatively or qualitatively, largely depend on the object of study and how the data have been processed and annotated. Like other electronically available corpora, learner corpora can be used in their raw, unannotated form with no additional linguistic information and mark-up added to the corpus texts. However, for many types of investigation, it is useful to have the corpus annotated for speciﬁc features (see Chapters 5 and 6, this volume). A speciﬁc type of annotation that is particularly relevant for learner data is error annotation (see Section 2.1 below and Chapter 7, this volume). Learner corpora can also be exploited to investigate patterns and determinants of interlanguage variability (see Chapter 18, this volume). Owing to the fact that most learner corpora are compiled according to explicit design criteria, extra care being taken of the many

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

38

CALLIES

learner- and task-speciﬁc variables that affect L2 learning and production (see Chapter 2, this volume), it is possible to examine learner corpus data with regard to the inﬂuence of one or more such variables, e.g. time spent in a country where the target language is spoken, or task conditions such as timing or access to reference works. Studies employing Contrastive Interlanguage Analysis (see Section 2.1) focusing on the learners’ native language (L1) have been the most popular so far, but many observed differences between native speakers and learners that seem to be caused by the L1–L2 difference can also be examined vis-à-vis the many variables that can potentially inﬂuence linguistic choices. Arguably, the general methodology and procedure employed in LCR to date has mostly been corpus-based, quantitative, cross-sectional and comparative. This preference has to do with the predominance of quantitative approaches in corpus linguistics at large, where researchers typically take corpora as being representative of a certain language variety and work with aggregate data to abstract away from individual language users, despite the fact that learner data are usually subject to a signiﬁcant degree of inter-learner variability.1 It also has to do with the scarcity of truly longitudinal learner corpora mentioned above, the strong (often pedagogically motivated) interest in transfer effects in LCR (Chapter 15, this volume), and possibly also the lack of training in sophisticated computational techniques needed to carry out truly corpus-driven research.

2.1

Contrastive Interlanguage Analysis and Computer-aided Error Analysis

As a ﬁrst step towards the actual analysis of learner corpus data, speciﬁc linguistic structures (single words, multi-word sequences or more complex grammatical constructions) are identiﬁed as objects of investigation and extracted (semi-)automatically from learner corpora by means of text retrieval software programs like WordSmith Tools (Scott 2012) or AntConc (Anthony 2014), which provide detailed information on the use of words, phrases and constructions by L2 learners. Such programs usually offer a set of tools that greatly facilitate the extraction, visualisation and analysis of large amounts of learner data. Standard tools include word lists providing all words and their frequencies of occurrence in the corpus; keyword lists that compare the frequencies of words in two corpora and calculate the over- or underrepresentation of certain words in one corpus relative to the other; concordancers that display search items in their neighbouring context and which can be sorted and exported in various ways according to the user’s needs. Further standard options include the detection and visualisation of the distribution/dispersion of a search 1

It is obvious that more fine-grained, qualitative analyses and comparisons are essential to uncover individual learner differences (see Section 4 for further discussion).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

39

item across a corpus, the extraction of left- and right-hand collocates, as well as clusters/n-grams of variable length. When relying on such (semi) automatic tools, researchers need to bear in mind that the characteristics of learner data necessitate extra care in terms of search strategies, post-editing and interpretation because learner production is subject to spelling mistakes, word-formation and grammatical errors that can make an exhaustive automatic extraction of target forms difﬁcult (see, e.g., Granger and Wynne 1999). In the analysis of learner data, researchers need to take extra care to treat learner language as if it were a new foreign language, not assuming that learners are necessarily using the same structures or patterns to express the same functions as native speakers would (see also Section 4). Learner data need to be examined carefully when trying to discover recurrent patterns before making generalisations. The representation, extraction and analysis of spoken learner data faces additional challenges, many of which have to do with the fact that a multimodal form of communication (spoken language, prosody, gesture) needs to be represented in a written format (see also Chapter 6, this volume). Unclear, unintelligible passages, noise, overlapping speech and the coding and representation of ﬁlled and unﬁlled pauses and other dysﬂuencies present speciﬁc challenges in transcription. The type and level of detail of a transcription obviously depends on the research goal(s), but corpora of spoken interlanguage often come with an all-purpose, general type of transcription. Ideally, the corpus should provide access to transcribed text and sound/prosody (in aligned form) for the researcher to be able to check stress and intonation patterns that are key when analysing information structure and other pragmatic phenomena, and which are also needed to examine the phonological integrity of multi-word units and the signiﬁcance of pauses. However, very few learner corpora, if any, offer this. Thus, researchers often have to make do with learner corpora that provide access to transcribed speech in the form of text only. The extracted data are then usually related to those obtained from reference-variety (often native-speaker) control corpora by comparing (normalised) frequency counts, applying statistical tests and procedures (see Chapter 8, this volume) or by further manually coding/annotating more complex patterns of usage. The ﬁndings are subsequently analysed and interpreted in terms of quantitative and qualitative differences between the two populations, often similar to practices in traditional contrastive analysis. However, in LCR this type of comparison has been extended and is often combined with a corresponding analysis of language produced by different groups of learners, a method that has become known as Contrastive Interlanguage Analysis (CIA). CIA as introduced by Granger (1996) is probably the most widely used methodological approach in LCR. It involves two types of comparison: ﬁrst, a comparison of interlanguage (IL) data with native language (NL) data, and second, a comparison of different types of IL data, usually from learners of different L1 backgrounds.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

40

CALLIES

The ﬁrst type of comparison aims at uncovering the characteristics and patterns of use that distinguish learners from native speakers in terms of quantitative differences (in current terminology referred to as ‘overuse’ and ‘underuse’) as well as qualitative differences (‘misuse’). The comparison of different interlanguages is carried out to establish whether the observed learner characteristics and differences from native speakers are typical of a certain learner group or context of use (e.g. features that could be effects of cross-linguistic inﬂuence), or if they can be generalised to a wider learner population and to more situations of use irrespective of the L1 (e.g. features that are developmental in nature or inﬂuenced by contextual variables). A serious but yet unresolved issue in CIA is the question of corpus comparability, in particular as to the appropriate basis of comparison for learner corpus data, i.e. against which yardstick learner data should be compared and evaluated. Should this be only corpora representing the language of (monolingual) native speakers? And if so, what variety should serve as the comparative basis? And should researchers compare learner data to L1 peers, e.g. novice writers of similar academic standing (students), or expert writers (professionals)? The choice of control corpus has signiﬁcant implications for learner corpus analysis and the interpretation of ﬁndings (see Sections 3 and 4 for further discussion). Recently, Granger (2015) has proposed a revised model (CIA²), which explicitly acknowledges the central role played by variation in interlanguage studies and is thought to be more in line with the current state of foreign language theory and practice. Most importantly, Granger introduces the concepts of ‘reference language varieties’ and ‘interlanguage varieties’. The term ‘reference language varieties’ replaces the native-speaker target (NL) in ‘traditional’ CIA, indicating that there is a large number of different reference points against which learner data can be compared (traditional inner-circle varieties, outer-circle varieties, as well as corpora of competent L2 users, e.g. a corpus of texts produced by expert language users, for example academic writers, who may or may not be native speakers). Granger stresses that the word ‘reference’ in particular makes it clear that the corpus does not necessarily need to represent a norm. The term ‘interlanguage varieties’ is introduced to acknowledge the inherent variability of learner language and to draw attention to the large number of variables whose effect on L2 use should be investigated more in LCR. The second popular method that speciﬁcally serves to analyse learner corpus data is Computer-aided Error Analysis (CEA; Dagneaux et al. 1998). This method is based on learner corpora that have been annotated for errors according to a standardised system of error tags (e.g. Dagneaux et al. 2008). Error-tagging systems usually consist of taxonomies of errors based on structural linguistic categories and thus include tags for errors that relate to categories like form (e.g. spelling), grammar, lexis, register, style, word (missing or redundant word), lexico-grammar and

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

41

punctuation (see, e.g., Chapter 7, this volume, and Díaz-Negrillo and Fernández-Domínguez (2006) for a review of error-tagging systems for learner corpora). Annotation can be carried out with the help of software programs to speed up the tagging process (e.g. Hutchinson 1996; O’Donnell 2014). Obviously, CEA has some clear advantages over traditional error analysis (EA). It is based on a standardised error taxonomy, allows searches and counts for a speciﬁc error category by using a software program, and enables the researcher to sort the errors in various ways and to analyse them in context.2 Having said that, CEA is also subject to some of the challenges and limitations that hold for traditional EA on the levels of error identiﬁcation, classiﬁcation and description. The method considers errors rather than looking at the full picture of learner production, but non-errors can reveal just as much about learner language as errors, e.g. when it comes to avoidance phenomena which cannot be accounted for by EA. In addition, various problems with error identiﬁcation and the reconstruction of the intended L2 forms may emerge. Given that an error can be deﬁned as a deviation from the norms of the target language, the ﬁrst problem is the deﬁnition of the norm, which very much depends on the context of language usage. Are spoken or written norms considered, and which national variety of a language serves as the basis? Does an EA focus only on grammatical correctness (i.e. formal breaches of the code) or also on pragmatic appropriateness (functional-pragmatic ‘misuse’ of the code which may be much more difﬁcult to judge)? Second, EA cannot account for the difference between overt and covert errors, since covert errors are not easily recognised. Overt errors are clear deviations from the norm, whereas covert errors are structures that are superﬁcially well-formed but do not mean what the learner intends them to mean, and thus may only be correct by chance (Corder 1971). Third, one needs to distinguish between errors and mistakes (Corder 1967). Errors are usually systematic, with learners being unaware of the speciﬁc problem at hand. They result from a lack of L2 knowledge and reﬂect deﬁcits in competence. By contrast, mistakes are non-systematic and temporary, often slips of the pen or tongue and considered performance phenomena. They are often recognised by the learner, either instantly or in retrospect, who is able to correct them. However, this seemingly clear distinction is difﬁcult to put into practice if a researcher does not have access to background information about the respective learner. Finally, there are the well-known difﬁculties with error identiﬁcation and the reconstruction of intended L2 forms due to the indeterminacy and ambiguity of the causes and sources of errors (see also Chapter 24 on the issue of target hypothesis).

2

A publicly available concordancer based on a 50,000-word error-tagged sample of the Polish component of the International Corpus of Learner English (ICLE; Granger et al. 2009) is made available by Przemysław Kaszubski at http://ifa.amu.edu.pl/~kprzemek/concord2adv/errors/errors.htm (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

42

CALLIES

2.2

Combining learner corpora and experimental methods

Learner corpora provide authentic, continuous and contextualised language use by L2 learners. Granger (2008b: 261) highlights length of language samples and the context in which the language has been produced as the most important criteria for a learner corpus: the notion of ‘continuous text’ lies at the heart of corpushood. A series of decontextualized words or sentences produced by learners, while being bona ﬁde learner production data, will never qualify as learner corpus data. In addition, it is best to restrict the term ‘learner corpus’ to the most open-ended types of tasks, viz. those tasks that allow learners to choose their own wording rather than being requested to produce a particular word or structure. It is important to point out that not all learner corpora contain perfectly naturalistic data like those envisaged by Granger as ideal learner corpus data (see also Chapter 2, this volume). Thus, some corpora may be considered non-prototypical or peripheral, while others are referred to as databases. For example, the Louvain International Database of Spoken English Interlanguage (LINDSEI; Gilquin et al. 2010) contains free and structured interviews as well as picture descriptions, and the Longitudinal Database of Learner English (LONGDALE; Meunier and Littré 2013) includes an even wider variety of spoken and written data types, ranging from free compositions and narratives to guided writing tasks and grammaticality judgements. When compared to multi-million-word native-language reference corpora, most learner corpora are still comparatively small, which means that they cannot be used to study all aspects of learner language. They are particularly useful for investigations of high-frequency phenomena at all linguistic levels (e.g. morphology, grammar, lexis, discourse), but may prove to be limited or even unsuitable for the study of infrequent, highly L2-speciﬁc or optional phenomena. (Quasi-)experimental data, mostly gained through elicitation, have traditionally been favoured in SLA research. Elicitation tasks are designed to make informants produce a certain linguistic feature, ideally without raising their awareness by concealing the actual research purpose. Such techniques are typically used when researchers working with an analytic-deductive research design with a narrow focus intend to test a speciﬁc hypothesis (see, e.g., Gass and Mackey (2007) for an overview of elicitation techniques commonly used in SLA research). In such designs, many learner and contextual variables have to be controlled, which is extremely difﬁcult, if not impossible, in non-experimental settings. Moreover, researchers are sometimes faced with the problem that the linguistic phenomenon under study is difﬁcult to obtain because it is unlikely to be produced frequently enough in spontaneous written or spoken language. The target structure may be used only sporadically, avoided (for various reasons), or just accidentally not occur because of the limited size of a corpus, i.e. it may be underspeciﬁed, a problem also known as Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

43

construct underrepresentation. In such cases, elicited data are often the only option. Experimental data can also be useful as corroborating or converging evidence to supplement corpus data, most importantly in the form of between-methods triangulation. In triangulated research designs, a speciﬁc research question can be examined from different perspectives by drawing on various kinds and sources of information for analysis to produce either converging or diverging observations and interpretations. The purpose of triangulation is to gain a more complete picture of the phenomenon under study and to increase the reliability and validity of instruments and results, thereby strengthening the conﬁdence in research ﬁndings and conclusions drawn from them. Typically, a triangulated research design draws on several independent sources of evidence in which method- and/or observer-inherent bias is counterbalanced (Callies 2012). For example, Callies (2009) studied the production and comprehension of lexico-syntactic means of information highlighting by German learners of English as a Foreign Language (EFL) using triangulated written learner corpus data, experimental data and retrospective interviews to produce corroborating evidence. The comparative analysis of argumentative writing was based on material from the International Corpus of Learner English (ICLE) and the Louvain Corpus of Native English Essays (LOCNESS), while the experimental data included elicited production and metapragmatic assessment in the form of written questionnaires. The ﬁndings show a clear overrepresentation of subject-prominent structures (it-clefts, existentials, presentationals, extraposition) and an underrepresentation of certain lexico-grammatical focusing devices (e.g. emphatic do and pragmatic markers) in the learner data. The retrospective interviews provided evidence for the hypothesis that – in contrast to lexical means such as intensiﬁers – even advanced learners have no conscious awareness of syntactic means of information focusing. Gilquin and Gries (2009) suggest that there is no strict dichotomy between corpora on the one hand and experiments on the other. Corpora, just as linguistic data in general, can be located on a continuum of naturalness of production and collection. While theoretical and methodological scepticism has been an obstacle to the establishment of genuine bi-directional links between SLA and LCR (see Chapter 14, this volume), more recently, researchers in both camps have realised the potential that a combination of different types of learner data provides (e.g. Gilquin 2007; Callies 2009; Meunier and Littré 2013; Mendikoetxea and Lozano 2015).

3

Representative studies

The speciﬁc ways in which learner corpora can be used (1) in a corpus-informed, corpus-based or corpus-driven approach, (2) with a focus on a quantitative or qualitative analysis, and (3) cross-sectionally or Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

44

CALLIES

(quasi-)longitudinally are exempliﬁed in the representative studies presented in this section. 3.1 Gilquin, G. and Paquot, M. 2008. ‘Too chatty: Learner academic writing and register variation’, English Text Construction 1(1): 41–61. The study by Gilquin and Paquot (2008) exempliﬁes CIA, using a corpus-based, quantitative and cross-sectional approach, also discussing questions of corpus comparability, in particular as to the appropriate basis of comparison for learner corpus data. Their aim was to study how upper-intermediate to advanced EFL learners express a number of rhetorical functions that are particularly prominent in academic discourse, and how this compares with novice and expert native-speaker writers. In particular, the authors wanted to investigate the extent to which the learners use spoken-like features in their academic prose. The basis for the study was a list of some 350 items representing words and phrases frequently used in rhetorical functions as part of a general academic vocabulary identiﬁed in Paquot’s (2010) Academic Keyword List. The use of these items was then analysed and compared on the basis of three different corpora: (1) a corpus of expert academic native-speaker English sampled from the British National Corpus (BNC), consisting of samples from books and journal articles in several disciplines; (2) a corpus of spoken native-speaker English, also sampled from the BNC, including a wide variety of spoken registers such as broadcast documentary and news, interviews and lectures; and (3) a corpus of learner writing, i.e. the second version of the ICLE (Granger et al. 2009), which contains argumentative and literary essays produced by upper-intermediate to advanced EFL learners from a variety of L1 backgrounds. The learner texts were restricted to purely argumentative essays written in untimed conditions under which the learners had access to reference tools, and included essays produced by learners from fourteen different L1 backgrounds. The frequencies of the selected items were compared across the three corpora to identify cases of over- and underuse (determined on the basis of chi-square tests) that may be shared by learners from a wide range of L1 backgrounds. On the basis of the analysis of the learner data and their comparison with the native-speaker control corpora, the authors suggest that learner writing of the type presented in the ICLE is characterised by a ‘chatty’ style in that learners have a strong tendency to use features that are more typical of speech than of academic prose. More speciﬁcally, learners tend to overuse words and phrases which are more likely to appear in native speakers’ speech, and underuse more formal expressions typical of native speakers’ academic writing. This leads to a higher degree of visibility and subjective involvement in their texts, which suggests that they are largely unaware of register differences. Gilquin and Paquot provide four possible explanations for this lack of register awareness: the inﬂuence of speech (discarded by

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

45

the authors), L1 transfer (exempliﬁed in terms of ‘transfer of register’ by means of the striking overuse of let us / let’s in the French component of the ICLE), teaching-induced factors, and developmental factors. To examine the last factor, the authors aimed to ﬁnd out if the learners’ ‘chatty’ style was due to their being L2 users or was because they were novice writers who had not yet acquired the conventions of academic writing. Thus, they additionally compared the learner data to similar writing produced by English L1 peers (novice writers of similar academic standing) as included in the LOCNESS. The ﬁndings indicate that the novice native-speaker writers share the learners’ problem with register to a certain extent, also overusing items which are more typical of speech than of writing, thus occupying an intermediate position between expert academic writing and learner writing. Gilquin and Paquot’s study clearly illustrates the explanatory power of CIA vis-à-vis transfer effects, and it also highlights the importance of the choice of control corpus. At the same time, it provides evidence for the necessity to control for task-speciﬁc variables which can have signiﬁcant implications for the interpretation of ﬁndings. While the ICLE texts are treated as academic writing and are thus compared with expert academic writing in the form of books and research papers sampled from the BNC, it seems likely that the observed ‘chatty’ style and over-involvement could partly be a genre and task effect. The large majority of the ICLE texts do not represent academic writing in a narrow sense but differ from academic prose in some important aspects. First, they are often loosely characterised as ‘essays’, a cover term for a general text type that is open to subjective interpretation (student writers may differ considerably in what they consider an essay), which makes a comparison to more speciﬁc academic text types difﬁcult. Second, they are argumentative texts whose communicative purpose is not to inform but rather to argue for a certain position, voice a personal opinion or persuade an (unspeciﬁed) audience. Learners are also explicitly prompted to give their personal opinions (see the list of the most popular essay topics in Granger et al. 2009: 6ff.). 3.2 Díez-Bedmar, M. B. and Casas Pedrosa, A. V. 2011. ‘The use of prepositions by Spanish learners of English at university level: A longitudinal analysis’, in Kübler, N. (ed.), Corpora, Language, Teaching, and Resources: From Theory to Practice. Bern: Peter Lang, pp. 199–218. This study is an example of CEA used in a longitudinal research design. The authors examine developmental trajectories and illustrate how these can be captured by means of an error-tagged learner corpus, focusing on the accuracy in the use of prepositions by Spanish EFL learners. The authors compiled a truly longitudinal learner corpus consisting of written compositions produced for exams or in exam-like situations by twenty-eight Spanish students of English philology over a span of four academic years at a university in the south of Spain. To ensure comparability, all compositions were timed and students did not have access to

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

46

CALLIES

reference materials. Similar to other longitudinal corpus projects, the compilation of the corpus had to cope with the fact that only thirteen of the twenty-eight students who took part at the beginning of the project ﬁnished their degree in four years. Thus, the number of essays in the corpus varies from one year to another. For the study under discussion, a corpus comprising 164 compositions, totalling 69,980 words, was analysed. The corpus was error tagged using an error editor (Hutchinson 1996) and an error-tagging manual (version 1.1 of Dagneaux et al. 2008). Instances of misused prepositions were retrieved from the general error category ‘Lexical Selection’ using WordSmith Tools. The error categories ‘Word Missing’ and ‘Word Redundant’ were also checked for cases of prepositions. This way, the authors uncovered twenty-seven prepositions that caused problems for the learners. All instances of the correct uses of these prepositions were also extracted using WordSmith Tools. Frequencies of correct and incorrect uses per year were calculated for each preposition to see whether there was an overall statistically signiﬁcant developmental pattern. Comparing percentages of correct and incorrect uses, the ﬁndings reveal four patterns of development with regard to the accuracy with which the students use prepositions: •

positive evolution, or a decrease in the percentage of errors over time; prepositions following this pattern are around, as, at, between, by, during, for, in, inside, like, of and to • stability, i.e. no errors in three out of four academic years (about, across, back, behind, besides, instead of, into, throughout, under and up to) • fluctuation from one academic year to the next (along and from); e.g. errors with from increased from year 1 to year 2, then fell in year 3, to rise again in year 4 • negative evolution, or an increase in errors over time (on, since and with). In sum, this study highlights the advantages and potential of CEA, especially when implemented in a longitudinal research design, while it also illustrates the challenges faced by a truly longitudinal corpus project in that it is extremely difﬁcult to follow the same learners over longer periods of time. 3.3 Mendikoetxea, A. and Lozano, C. 2015. ‘Conceptual and methodological interfaces in SLA research: Triangulating corpus and experimental data in L2 subject–verb and verb–subject alternations’. Unpublished manuscript. Mendikoetxea and Lozano (2015) show how corpus and experimental data can be combined in complementary fashion to gain insights into the processes that shape the development of L2 learners’ interlanguage grammars. They investigate the Subject–Verb (SV) / Verb–Subject (VS) alternation in the grammar of Spanish EFL learners to examine the role played

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

47

by linguistic interfaces, i.e. interactional processes between different language components such as pragmatics and syntax, in the acquisition of SV-/VS-structures in L2 English. Their starting point is the well-established observation that EFL learners with various L1 backgrounds produce postverbal subjects in intransitive (X)VS-structures where the logical subject is an NP (e.g. *It will happen something exciting). English as a canonical SV-language allows VS-constructions like there-existentials and locative inversion only in a limited number of discourse-pragmatically motivated contexts that are strongly inﬂuenced by the end-focus and end-weight principles: postverbal subjects are allowed if they are in focus and complex/heavy. Verbs appearing in VS-structures belong to a semantic class that denotes location, existence or appearance (e.g. happen, exist, come or appear), in formal approaches traditionally termed unaccusative, as opposed to the other type of intransitives referred to as unergatives, which typically denote activities (e.g. cry, speak or sing). By contrast, null-subject languages like Spanish show apparently free SV/VS alternations regulated by verb type (postverbal subjects are favoured with unaccusative verbs, not with unergatives) and information structure (focused subjects appear in sentence-ﬁnal position, independently of verb type). While most previous studies on interface phenomena in SLA have been experimental, the authors advocate the systematic combination of naturalistic production data (learner corpora) and experimental data (acceptability judgements) in the search for converging evidence to obtain a fuller picture of the nature of interface phenomena in SLA (which they refer to as the ‘methodological interface’). They carried out a corpus study on the basis of two corpora representing essay writing produced by upper-intermediate Spanish university students of English: the Spanish component of the ICLE, and the Written Corpus of Learner English (WriCLE),3 using the LOCNESS as a native-speaker control corpus. The corpus data conﬁrmed previous research in that the learners produced postverbal subjects only with unaccusatives and never with unergatives, in both structurally impossible and possible sentences (e.g. *It has appeared some cases of women who have killed their husbands; There exist about two hundred organizations such as Greenpeace). The corpus data also brought to light something that had gone unnoticed in previous experimental studies: unaccusativity is a necessary but not a sufﬁcient condition for the production of postverbal subjects by learners, since two additional factors, the end-focus and end-weight principles, also regulate the position of the subject in their L2. In addition, the corpus data suggest that learners had difﬁculties in syntactically encoding the elements that occur preverbally: even advanced learners omitted preverbal material in certain contexts (using zero-subjects as 3

Information about this corpus can be accessed at www.uam.es/proyectosinv/woslac/Wricle/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

48

CALLIES

in …*because exist the science technology and the industrialisation), while underusing there (There exist positive means of earning money) and overusing it as the default subject placeholder (…*it will not exist a machine or something able to imitate the human imagination). Therefore, a follow-up experimental study was designed to test learners’ knowledge of the preverbal ﬁeld. The authors set up an acceptability-judgement task administered online over the internet in which learners had to judge sentences that were structurally similar to those produced in the corpus on a ﬁve-point Likert scale (‘1’ being fully ungrammatical, ‘5’ fully grammatical). A total of 378 subjects completed the test, including a control group of English native speakers. Learners were grouped into six proﬁciency levels after completing an online version of the Oxford Placement Test. The results show that PP-locative inversion (e.g. In some places still exist popularly supported death penalty) was the preferred inverted structure by learners at all proﬁciency levels, followed by existential there-constructions. Ungrammatical it-insertion and zero-structures were dispreferred across proﬁciency levels. In sum, triangulation of corpus and experimental data produced converging evidence: the learners obey the Unaccusative Hypothesis and there is a gradient scale in the production/acceptance of four unaccusative structures. One of the main differences between the corpus and the experimental data is that it-insertion structures were overrepresented in the corpus, while in the experiment they were accepted at the same level as zero-insertion. The authors show how SLA research can beneﬁt from studies located at a methodological interface, i.e. a triangulation of contextualised, naturalistic production data in the form of learner corpora and experimental data to study the same linguistic phenomenon. This approach seems particularly relevant and fruitful when the object of study is a relatively infrequent and highly marked L2 construction, the choice of which is inﬂuenced by a complex interplay of syntactic and discourse-pragmatic factors.

4

Critical assessment and future directions

This section discusses and evaluates current practices in learner corpus methodology and analysis, and provides a brief outlook on methodological developments in the ﬁeld. As stated earlier in this chapter, the large majority of studies in LCR have been corpus-based, quantitative, cross-sectional and comparative. They typically employ CIA and focus on the learners’ L1 as a variable, comparing quantitative ﬁndings to those obtained from native-speaker control corpora of different kinds. The dominance of CIA and CEA, their underlying assumptions, as well as some mutual theoretical and methodological scepticism pose major challenges for LCR to win its place among the established research methodologies

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

49

in SLA. Moreover, the practice that learner corpora (or components of them) are too often used as aggregate data sets without reﬂecting on, let alone controlling for, the inﬂuence of learner and task variables that have been shown to inﬂuence L2 acquisition and production needs to be reﬁned to achieve methodological advances in the ﬁeld and to establish more genuine bi-directional links with SLA. Additionally, it seems that in LCR, the potential effects of the learners’ L1 are often overestimated and readily taken as the most obvious explanation, despite evidence that observed differences between native speakers and learners may need to be considered vis-à-vis other factors (see Ädel 2008; Gilquin and Paquot 2008). Thus, LCR has a lot to gain from embracing a variationist perspective, taking full advantage of the recorded metadata that pertain to both learner and task variables. I will ﬁrst address the question of corpus comparability and then discuss the variability of learner language vis-à-vis current methodological practices in LCR. The use of a native-speaker yardstick in LCR has been criticised for falling prey to the ‘comparative fallacy’ in that it fails to recognise learner language as a variety in its own right. While this criticism is certainly too harsh (see Granger 2009a, 2013a, 2015 for discussion and counterarguments), it is true that CIA is based on an underlying pedagogical perspective that considers native speakers’ language as a kind of benchmark against which differences and features of learner language are evaluated and characterised as native-like or, rather, non-native-like. This perspective has certain advantages in that it facilitates empirical work because it provides a norm against which learner data can be measured and evaluated, and because a normative perspective is often needed in foreign language teaching. But it has also been criticised as centring around a target-deviation perspective in which interlanguages are merely seen as more or less successful attempts to reproduce an implicit target-language norm. Learners try to do what adult native speakers do, but do so less well, thus exhibiting an imperfect and deﬁcient imitation of the target language. This view also has the disadvantage that it privileges particular kinds and methods of learner data analysis and can lead to a monolingual bias and ‘discourse of deﬁcit’ (e.g. Cook 1999; Ortega 2013). This is also reﬂected in the terminological practice of labelling quantitative differences between native speakers and learners as ‘overuse’ and ‘underuse’, while qualitative differences are referred to as ‘misuse’. The evaluative character of this practice had already been criticised in the early years of LCR by Leech (1998: xix–xx), who issued a warning not to confuse description and prescription: ‘Although the non-natives can be technically described as “underusing” … expressions, this term “underuse” and the contrasting “overuse” should be not used in a judgemental spirit. They should be interpreted, to my mind, not prescriptively but descriptively, as a convenient shorthand for “signiﬁcantly more/less frequent use by NNSs [non-native speakers] than by NSs [native speakers]” ’.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

50

CALLIES

The terms ‘overrepresentation’ and ‘underrepresentation’ may be more descriptive, neutral options. The practical usefulness of a native-speaker control corpus as an operationalised ‘abstracted corpus norm’ that ‘provides a very good yardstick for improving learner language in terms of native-like lexicogrammaticality, acceptability and idiomaticity’ (Mukherjee 2005: 16) is clearly at hand. But the question remains whether researchers are always comparing like with like or, possibly out of necessity and lack of better alternatives, make do with a corpus that is conveniently available. Should control corpora represent the language of (monolingual) native speakers? This neglects the fact that L2 learners are bi- or multilingual speakers by deﬁnition, and hence the only reasonable norm would be bi-/multilingual rather than monolingual (see Cook’s (1999) concept of ‘multi-competence’). And what variety should serve as the comparative basis? This question was also put forward as early as 1998 by Leech (1998: xix–xx), who drew attention to the fact that L1 peers (e.g. novice writers of similar academic standing) ‘do not necessarily provide models that everyone would want to imitate’. On the other hand, and with good reasons, the practice of researchers comparing learner data against an ‘unrealistic standard of “expert writer” models’ (Hyland and Milton 1997: 184), i.e. trained professional users, has been considered ‘both unfair and descriptively inadequate’ (Lorenz 1999a: 14). A way out of this dilemma and to overcome the monolingual bias in SLA (see, e.g., Ortega 2013), where an idealised image of a native speaker is often treated as the gold standard, would be to conceive of native-like proﬁciency as a ‘gradual, probabilistic phenomenon that transcends a native–non-native speaker divide’ (Wulff and Gries 2011: 61). This would be well in line with the aims of LCR to describe and explain the nature and development of interlanguages, and with a variationist perspective on SLA in which learners are seen as systematically passing through a series of varieties that are characterised by a particular lexical repertoire and interaction of organisational principles (Klein and Perdue 1992, 1997). Methodological practices in LCR have to account more fundamentally for the inherent variability of learner language. The majority of learner corpus studies are based on comparative analyses of texts produced by L1 and L2 speakers/writers, and thus their ﬁndings are subject to corpus comparability, and the generalisations derived from them are valid provided that variables such as proﬁciency, text type or task setting have sufﬁciently been controlled and documented. The study by Ädel (2008) has shown, for instance, that observed characteristics of learner texts are not necessarily due to their status as non-native speakers but can in fact result from differences in task setting (e.g. prompt, timing, access to reference works), and possibly task instruction and imagined audience. A speciﬁc example of how task variables can inﬂuence writers’ use of ﬁrst-person pronouns is presented by Breeze (2007), who compared

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

51

the use of pronouns in two text types, i.e. TOEFL (Test of English as a Foreign Language)-style essays and reports, both written as homework assignments by Spanish EFL learners. Breeze found a crucial difference in the use of pronouns between the two tasks. While the students used I, we, you and various possessives in the essays, they hardly used I or you in the reports but preferred we and our. Breeze argues that the students evidently perceived a generic difference between the essay and the report task. They were generally able to modulate their style in accord with the task and seemed to have some awareness of the genre conventions governing the report. One possible explanation for this phenomenon is that the report was correctly perceived as a type of collective task involving others, whereas the essay prompt seemed to be asking for a personal opinion, thus triggering a highly personal reaction. Problems of corpus comparability as to (sub-)register and genre are exempliﬁed in a study by Granger and Paquot (2009a), who compared the use of reporting verbs on the basis of the ICLE and a corpus of academic prose (book samples and articles) written by expert L1 writers. The authors discuss various aspects that make the two corpora not fully comparable (2009a: 197) and note that some of the differences between learner and academic writing … may be due to differences in text type … Learners’ underuse of these verbs may thus be (at least) partly explained by a difference in text type as there is no need in argumentative writing to situate one’s opinion against what has been written in the literature and typically, argumentative essays do not contain tables and graphs and are too short to include internal reference to chapters and sections. (2009a: 208) While a strict control of variables is essential to ensure validity, replicability and generalisability of research ﬁndings, the fact that most learner corpora are still comparatively small compared to native-language reference corpora may cause difﬁculties when researchers do try to control key variables. For example, in a study of English emphatic do in the German and French components of the LINDSEI, Callies (2013a) found an underrepresentation of this lexico-grammatical device to express contrastive emphasis in both learner corpora when compared to the native-speaker corpus used, the Louvain Corpus of Native English Conversation (LOCNEC). It was assumed that the signiﬁcantly lower frequency counts in the learner data may partially be an effect of the task and/or the interviewer. It is well known that interlanguage variation is inﬂuenced by a number of external sociolinguistic factors that have to do with the situational context of language use, e.g. task, topic and interlocutor (see, e.g., Ellis 2008: 141ff.). It is thus possible that L2 learners may be less inclined to disagree or object (hence experience much less need to make use of the linguistic means that signal contrastive emphasis) when they are interviewed by a native speaker who is of the opposite sex and not

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

52

CALLIES

familiar to them, rather than when interviewed by a same-sex non-native speaker whom they know. Although variables such as the interviewer’s mother tongue, sex and distance/closeness to the interviewee have been recorded as metadata in the LINDSEI, their inﬂuence cannot (in the current version) be assessed on a broad basis because of the small corpus size: strict control of all the relevant variables results in a very small dataset of sometimes only a handful of interviews. Learner corpora are typically used to abstract away from individual learners to identify a corpus-based description of a speciﬁc learner group based on L1, institutional status or proﬁciency level. In line with standard practice in corpus linguistics, researchers then usually aggregate and analyse data drawn from all learners that meet a chosen selection criterion. However, it is important to remember that learner corpus data are usually subject to a signiﬁcant degree of inter-learner variability. Much of this variability is likely to be caused by inter-learner differences in proﬁciency level, which has been a fuzzy variable in learner corpus compilation and analysis (Granger et al. 2009: 10). Due to practical constraints it is often operationalised by ‘learner-centered’ methods (Carlsen 2012: 165) such as institutional status. For example, in the compilation of the ICLE, learners’ proﬁciency level was assessed globally by means of external criteria: learners were considered advanced because of their institutional status as ‘university undergraduates in English (usually in their third or fourth year)’ (Granger et al. 2009: 11). However, the results of human rating of twenty essays per ICLE subcorpus according to the Common European Framework of Reference for Languages (CEFR; Council of Europe 2001) levels (Granger et al. 2009: 12) show that the proﬁciency level of learners represented in the ICLE actually varies between (higher) intermediate and advanced. While some subcorpora predominantly seem to include learners from either the CEFR’s B2 (e.g. Chinese, Japanese, Tswana and Turkish EFL learners) or C1 proﬁciency levels (Bulgarian, Russian and Swedish learners), others show a higher degree of intra-group variability (Czech, German and Norwegian learners) (Granger et al. 2009: 12). Recent studies have reported signiﬁcant individual differences (Mukherjee 2009; Callies 2013a, 2013b) and conﬁrm that global proﬁciency measures based on external criteria alone, e.g. the time spent learning English at school and university, are not reliable indicators of proﬁciency for corpus compilation. A measure such as institutional status is likely to conﬂate intra-group variability in proﬁciency level. However, individual differences have often gone unnoticed or have tended to be disregarded in learner corpus analysis and are thus not reported in favour of (possibly skewed) average frequency counts. Mukherjee (2009), observing an extremely uneven distribution of the pragmatic marker you know in the German component of the LINDSEI, concludes that ‘the ﬁction of homogeneity that is often associated with the compilation of a learner

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

53

corpus according to well-deﬁned standards and design criteria may run counter to the wide range of differing individual levels of competence in the corpus’ (2009: 216). Finally, I would like to propose two desirable methodological developments in the ﬁeld: ﬁrst, a function-to-form approach to the analysis of learner language and a function-driven type of annotation of learner corpora, and second, the use of learner corpora as an operationalised, empirical method for a quantitative and qualitative description of L2 proﬁciency. The advantages and potential of a speciﬁc type of annotation of learner corpora, error tagging, have been pointed out earlier in this chapter. Besides POS tagging, other forms of annotation, e.g. corpora annotated for discourse-pragmatic features, are rarely found both in corpus linguistics in general and in LCR in particular. Valuable as it is, any kind of form-driven corpus annotation and query, however, is likely to identify and represent only an incomplete picture of how learners use language. First, because this type of approach most often singles out a ﬁnite set of the most typical items but neglects ﬁndings on the use of a wider range of linguistic means for expressing a particular function (Paquot’s (2010) study of exempliﬁcation in learner writing being a notable exception). A larger set of potential options is indispensable if one wants to ﬁnd out how frequently a structure ‘could have occurred but in fact didn’t because the concept was expressed differently’ (Hoffmann 2004: 190, emphasis in original). Second, one would have to take into account a notion like ‘conceptual frequency’ (ibid.), which refers to the frequency of a concept or a function, rather than that of a form. A related concept is ‘opportunity of use’ (Buttery and Caines 2012), which refers to the number of occasions a linguistic feature could have been used. Such measures capture the probability of a particular linguistic form being encountered in a particular text based on what is known about the frequency of the associated function. An example would be where a quantitative analysis of the use of the contrastive connector in contrast in learner writing has to take into account information on the genre, and also on the topic of the text, as it may be the case that not all genres or topics suggest the use of that item in discourse in the ﬁrst place. In sum, with its focus on the use of single language forms rather than the functions that they serve to perform, a form-driven type of annotation does not allow for a truly usage-based perspective on learner production, where learners’ experience with language in particular social settings is the focus of attention. A function-to-form approach to the analysis of learner language and a function-driven type of annotation of learner corpora offer various advantages. They aim to link the occurrence of particular language forms and the functions that they serve in discourse, e.g. lexico-grammatical means used to express contrastive emphasis. At the same time, this type of annotation allows for a deepening of our understanding of form–function mappings

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

54

CALLIES

in discourse by accounting for unexpected, interlanguage-speciﬁc structures that are bound to occur in learner language. These will often be overlooked in form-driven searches because learners may (creatively) use non-canonical, strategic means to express communicative functions (see Hirschmann et al. 2007; Lüdeling 2008). A good example of such an unexpected, strategic behaviour is learners’ use of questions to highlight information, as pointed out by Callies (2009: 138). Only a function-driven annotation makes possible the identiﬁcation and documentation of a near-complete inventory of lexico-syntactic means used to express various communicative functions in discourse. Once the corpus has been annotated for discourse functions, this approach would lead to corpus searches that retrieve a stock of relevant forms, for example all lexico-syntactic structures expressing contrast, in one query, while in a form-oriented approach, many more, partially very complex queries are needed that are subject to extensive manual post-processing due to a rather low(er) precision rate. Learner corpora also have the potential to increase transparency, consistency and comparability in the assessment of L2 proﬁciency, and are thus increasingly being used in this ﬁeld (see Chapter 23, this volume; Callies et al. 2014). Learner corpora present an option to inform, supplement and possibly advance the way proﬁciency is operationalised in the CEFR, and they may also be used in a more data-driven approach to the assessment of proﬁciency that is partially independent of human rating. L2 proﬁciency as a construct is considered multi-componential in nature, and its dimensions are frequently captured and operationalised by the parameters of complexity, accuracy and ﬂuency which have been used both as performance descriptors for the assessment of oral and written skills of L2 learners and as indicators of learners’ proﬁciency underlying their performance. The operationalisation of a quantitatively and qualitatively well-founded description of advanced proﬁciency in terms of linguistic criteria for the assessment of advancedness is still lacking. Recently, Ortega and Byrnes (2008a) have discussed four partially overlapping global measures commonly used to operationalise advancedness: institutional status, standardised tests, late-acquired linguistic features, and a concept they call ‘sophisticated language use in context’ in which advancedness is conceptualised not only in terms of ‘purely linguistic accomplishments’, but also – among other things – in terms of literacy, ‘choice among registers’ and ‘voice’ (Ortega and Byrnes 2008a: 8). This construct may be applied to the context of academic writing by using linguistic descriptors that are characteristic of this register and can thus also be used for L2 assessment. Wulff and Gries’s (2011: 61) definition of accuracy as a ‘proﬁcient selection of constructions in their preferred constructional context in a particular target genre’ would be a speciﬁc instance of sophisticated language use in context.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Learner corpus methodology

55

Key readings Ädel, A. 2006. Metadiscourse in L1 and L2 English. Amsterdam: Benjamins. This book provides a large-scale study of metadiscourse (commentary on ongoing discourse) in the writing of L1 and L2 university students. The author ﬁnds that L2 speakers’ overuse of metadiscourse strongly marks them as lacking in communicative competence. Appendices 1 and 2 contain highly useful and instructive discussions on issues of corpus comparability and control corpora. Dagneaux, E., Denness, S. and Granger, S. 1998. ‘Computer-aided error analysis’, System 26(2): 163–74. This is the ‘classic’ text introducing CEA. It features a case study to demonstrate the methodological approach based on a 150,000-word corpus of English written by intermediate and advanced French learners of English. It illustrates how the corpus is annotated for errors by means of an error editor and a comprehensive error-classiﬁcation system. Granger, S. 1996. ‘From CA to CIA and back: An integrated approach to computerized bilingual and learner corpora’, in Aijmer, K., Altenberg, B. and Johansson, M. (eds.), Languages in Contrast. Text-based Cross-linguistic Studies. Lund University Press, pp. 37–51. This is the ‘classic’ text that introduces CIA, arguing that contrastive analysis (CA) has been revitalised as a major approach in corpus-based translation and interlanguage studies because of the corpus-linguistic turn and the availability of computerised bilingual and learner corpora. It also proposes that CA and CIA can be combined. Gilquin, G. 2000/2001. ‘The Integrated Contrastive Model: Spicing up your data’, Languages in Contrastt 3(1): 95–123. This paper introduces an integrated model that combines CA and CIA and illustrates it by means of causative constructions in English and French. It argues that contrastive corpus data can help explain some of the characteristics of learners’ interlanguage vis-à-vis the concept of transfer. Granger, S. 2012a. ‘How to use foreign and second language learner corpora’, in Mackey, A. and Gass, S. M. (eds.) 2012. Research Methods in Second Language Acquisition: A Practical Guide. London: Basil Blackwell, pp. 7–29. This chapter offers a concise, beginner-friendly introduction to learner corpora and learner corpus methodology with project ideas and exercises. It features study boxes that discuss instructive case studies including methodological approach and statistical tools.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.003

4 Learner corpora and psycholinguistics Philip Durrant and Anna Siyanova-Chanturia

1

Introduction

Corpus linguistics and psycholinguistics have historically had rather different goals and methodologies. Corpus linguistics has been mainly concerned with identifying patterns of use in samples of language which are representative of a particular speech community or text type. It deals with the end products of language use, tends to be descriptive and exploratory, and emphasises social contexts. Psycholinguistics, in contrast, is concerned with understanding the mental processes and representations involved in on-line language comprehension and production. It focuses on testing theoretical models through carefully controlled laboratory experimentation. Given these different aims and approaches, it is unsurprising that attempts to combine the two areas of enquiry have at times been controversial and are still relatively uncommon. However, increasingly, researchers from each area are feeling the need to draw on theories and methods from the other. This is very much in line with a more general movement in language studies towards drawing on multiple research methods (see, e.g., Schönefeld 2011). We believe that much is to be gained from such interaction, but that – as with all interdisciplinary work – great care needs to be taken if the two are to work together in meaningful ways. The theoretical arguments for and against combining psycholinguistics and corpus linguistics in general have been discussed at length elsewhere (e.g. Gries 2010a: 334–8; McEnery and Hardie 2012: 192–224) and will not be repeated here. Rather, we will review the ways in which psycholinguistics and learner corpora have interacted in recent research and attempt to draw speciﬁc lessons about what can be gained from such interactions, and about some of the pitfalls such interdisciplinary research needs to avoid if it is to be conducted effectively.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

58

DURRANT AND SIYANOVA-CHANTURIA

2

Core issues

Learner corpora and psycholinguistics have interacted in two main ways. The ﬁrst, currently far more popular, type of enquiry involves using corpus data to draw psycholinguistic conclusions. The second type takes integration one step further by using both corpus and experimental data. Work of this kind is currently much less common, but, we will argue, potentially more useful. The current section will deal with each of the two types in turn.

2.1

Learner corpus data as evidence for psycholinguistic hypotheses

By far the most common types of study linking learner corpora with psycholinguistics have used corpus data to make psycholinguistic claims about second language learning. Certain topics lend themselves more naturally than others to these types of study, and three areas in particular have drawn the majority of researchers’ attention: formulaic language, usage-based models of second language learning and L1 transfer. Several learner corpus studies have attempted to address the question of whether adult second language learners acquire and use language formulaically, that is, relying on prefabricated chunks, rather than exploiting combinatorial linguistic mechanisms. A number of researchers have claimed that adult second language (L2) learners differ from child ﬁrst language (L1) learners in that the former tend to learn and process language on a word-by-word basis, whereas the latter acquire a large repertoire of multi-word expressions (multi-word verbs, collocations, idioms, speech routines, lexical bundles, etc.), which enables them to deal with language in larger chunks (e.g. Kjellmer 1982; Wray 2002). It is claimed that this difference accounts for much of the lack of ﬂuency and idiomaticity of non-native speech (Pawley and Syder 1983). Researchers have employed learner corpora in an attempt to determine if adult L2 production is as lacking in formulaic language as this hypothesis suggests (De Cock et al. 1998; Oppenheim 2000; Foster 2001; Nesselhauf 2005; Durrant and Schmitt 2009). One obvious criticism of such attempts is that it is impossible really to know whether a piece of language appearing in a corpus is or is not a formula (i.e. a prefabricated chunk) for the person who produced it. As Sinclair (1991) pointed out, the patterns which are found in corpora are the product of a number of factors, including the structure of the material world and sociolinguistic conventions, in addition to psycholinguistic mechanisms. Any direct inference from corpus to mind therefore requires further justiﬁcation.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

59

Studies have taken a variety of approaches to deciding which strings of language should be counted as formulas. Oppenheim (2000) and Foster (2001) relied on native-speaker judgement. Either the researchers themselves (Oppenheim 2000) or a panel of informed judges (Foster 2001) read through the transcripts and picked out sequences which they believed to be formulaic. Nesselhauf (2005) identiﬁed sequences based on a set of criteria (including consultation of dictionaries) which aimed to determine whether a word pairing was relatively ﬁxed or part of a free combination. De Cock et al. (1998) and Durrant and Schmitt (2009) used frequency information from corpora to determine formulaicity. In the former study, formulas were combinations of words which occurred more than a given number of times in the learner corpus under study. In the latter, they were word pairs which co-occurred frequently in a large reference corpus of native-speaker production. Each of these methods is indirect and based on particular theories about formulaic language: in the case of Oppenheim (2000) and Foster (2001), the theory is that formulaic sequences are recognisable as such to trained native speakers; for Nesselhauf (2005), it is that arbitrarily ﬁxed pairings must be formulaic; for De Cock et al. (1998) and Durrant and Schmitt (2009), it is that frequent combinations are likely to be formulaic. A critic who rejected the particular theory underlying each study could therefore reject the data as irrelevant. However, it is important to note that any evidence about the formulaicity of a piece of language (including laboratory studies which claim to measure on-line processing) is inevitably indirect and theory-bound. The formulaic status of an item in the mental system can never be directly observed. Rather, we observe data – such as frequency in a corpus or reaction times or patterns of neural activation in the brain – and infer formulaicity from them. This inference always depends on theoretical assumptions. On the one hand, this means that the plausibility of the research rests on the plausibility of its underlying theory, and the researcher needs to take care to establish this plausibility. On the other hand, it means that criticism of the use of corpora to study formulaicity must be based on criticism of the individual underlying theories, rather than on the fact of indirectness itself. It is also important to note that all of the ﬁve above studies reach similar conclusions about formulaicity: that non-native speakers do make extensive use of formulaic language, but that their repertoires of such language are narrower and somewhat different in type from those of native speakers. This convergence of results between studies with such different approaches suggests that they are triangulating on a real phenomenon and lends substantial plausibility to their methods. A further criticism which could be levelled at some of these studies concerns the ways in which they aggregate data from different learners. The corpora used by De Cock et al. (1998), Foster (2001) and Nesselhauf (2005) combine texts produced by 25, 32 and 207 learners, respectively.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

60

DURRANT AND SIYANOVA-CHANTURIA

In quantifying the amount of formulaicity in their corpora, no attempt is made to disentangle the language produced by different individuals. No account is therefore taken of possible variation between individuals. This is problematic because, ﬁrst, L1 acquisition researchers have long observed that some children are more ‘formulaic’ in their approach to learning a language than others (Nelson 1981; Peters 1983; Pine and Lieven 1993). If the same is true for second language learning, conﬂating data from different learners may therefore provide averages which are not fair reﬂections of the language systems of any individuals. Second, without an estimate of variation, it is impossible to judge whether ﬁndings can be generalised to a broader population of learners. In Oppenheim’s (2000) study, the formulaic language produced by each learner is quantiﬁed separately, which overcomes the ﬁrst problem, though the small sample size (N = 6) means that it does not overcome the second. Durrant and Schmitt (2009) also quantify formulaicity separately for each contributor, this time with a larger sample (N = 30). By ascribing levels of formulaicity to each writer, rather than to the corpus as a whole, they are able to treat texts as separate units (akin to individual participants in an experiment). This renders visible the variation between texts and allows the use of standard inferential statistics (in Durrant and Schmitt’s case, t-tests) to evaluate the generalisability of differences between natives and non-natives. Analysis of this sort, in which quantitative data are tied to individual language users or texts, rather than to the corpus as a whole, seems a necessary step if learner corpus work is going to enable generalisations about learners’ language systems. A second psycholinguistic topic which has attracted learner corpus research is that of usage-based models of second language learning. The term ‘usage-based’ has been applied to a number of related linguistic theories (for overviews, see Kemmer and Barlow 2000; Croft and Cruse 2004). In the context of second language acquisition research (see, e.g., Ellis 2008a; Gries 2008a), usage-based theories view language knowledge as a structured inventory of symbolic units (constructions), which vary in schematicity, from concrete words and morphemes to abstract syntactic structures. Since all units on this scale are considered to be constructions, there is no dichotomy between grammar and lexis. Both the route of acquisition and the structure of the mature system are dependent on the input the learner receives and are particularly inﬂuenced by the type and token frequency of constructions (see also Chapter 16, this volume). Language structure emerges from the piecemeal learning of concrete usages and a gradual process of abstraction as commonalities between usages are identiﬁed. Language use and language knowledge are therefore intimately bound up with each other, and the traditional Chomskyan dichotomy between competence and performance (Chomsky 1957) is rejected. As a number of researchers have observed (Eskildsen and Cadierno 2007; Gries 2008a), the rejection of the dichotomies between

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

61

competence and performance and between lexis and grammar, and the importance ascribed to frequency of occurrence, have led to a particularly strong pairing between usage-based models and corpus linguistics. The general approach of learner corpus studies in this area has been to track the development over time of a single construction (or related set of constructions) in the production of a single learner (or small groups of learners) to see whether this development matches the predictions of usage-based models (Mellow 2006; Eskildsen and Cadierno 2007; Eskildsen 2009, 2012). This requires both extensive data collection over an extended period and coding of the corpus for occurrences of the construction under study before analysis can take place. Analyses usually provide a series of ‘snapshots’ of features such as the range, accuracy and complexity of constructions used, or the frequency and diversity of lexis found in each construction, at a number of points in time. The researchers then compile these snapshots for particular features of interest and trace their development over time. A slightly different approach is taken by Blom et al. (2012), who use the level of accuracy of production in obligatory contexts of the third-person singular s morpheme as the dependent variable in a logistic regression, with variables which are posited by usage-based models to be relevant (such as type and token frequency of the word form and its lemma and the child’s overall vocabulary size) entered as predictors. As with the formulaic language studies, this research relies on the assumption that the production data found in corpora provide evidence for the linguistic representations in learners’ minds. Indeed, the rejection of Chomsky’s (1957) performance–competence distinction means that a close relationship between language use and the mental linguistic system is a basic theoretical commitment of usage-based models, which hold that ‘usage mirrors knowledge and that linguistic knowledge is conceptualized as linguistic experience’ (Eskildsen and Cadierno 2007: 2). While this means that the research is consistent in its own terms, it is important to note the danger of circularity here: the data which are presented in support of usage-based learning models only count as valid evidence if we presuppose one of the key tenets of these models. Researchers working in other paradigms may well remain unimpressed. This problem is acknowledged by Mellow (2006), who notes that the predictions about usage which he sets out to test are also consistent with rival universal grammar models. An important methodological step which has the potential to enrich research of this type, and perhaps make it more persuasive, is taken by Eskildsen (2012). In addition to his quantitative analysis of the development of constructions over time, he employs a qualitative conversation analysis of his corpus data to uncover the situational contexts in which various forms of the constructions are used. In this way, he is able to show how the emergence and development of particular constructions

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

62

DURRANT AND SIYANOVA-CHANTURIA

are based in the communicative demands of particular conversational exchanges. This adds an important dimension to our understanding of how constructions develop and, by taking explicit account of the inﬂuence of situational factors on the language found in the corpus, guards against the naive ascription of patterns of use directly to psychological mechanisms. A third way in which learner corpora have been used to support psycholinguistic conclusions can be seen in studies of L1 transfer (see also Chapter 15, this volume). Unlike the research reviewed so far, these tend to be situated in a universal grammar framework. Studies of this sort have tested a range of hypotheses about language learning and representation by determining whether learners with contrasting ﬁrst languages make different types of errors in their L2 production or over-/underuse particular linguistic items, in comparison with native-speaker norms. Zdorenko and Paradis (2008), for example, test the ‘ﬂuctuation hypothesis’ of article acquisition, according to which L2 learners – like L1 learners – have full access to universal grammar principles and parameters and so ﬂuctuate between different possible parameter settings for articles until the input enables them to set the parameter to the appropriate value. The authors hypothesise that, while this model will be correct for learners whose L1 does not have articles (such as Mandarin and Japanese), learners whose L1 does have articles (such as Spanish and Romanian) will directly transfer the parameter settings from their ﬁrst language. Examining a longitudinal corpus of narratives produced by child learners with a variety of L1s, they ﬁnd little evidence of differences in the patterns of errors produced by learners with contrasting L1s, and so reject their transfer hypothesis. A similar approach is taken by Rankin (2012), who tests the hypothesis that, while advanced learners can achieve mastery of ‘narrow syntax’ (the basic, context-independent, rules of the grammar), transfer effects may persist in situations where syntax interacts with other modules of the grammar, and particularly at the interface between syntax and discourse-pragmatics. He contrasts errors made in corpora of writing by students whose L1s have a verb-second constraint (German and Dutch) with those made by writers whose L1 does not have such a constraint (French). He ﬁnds that, while all students demonstrate good mastery of the syntax of verb inversion in English, only the former group tend to wrongly use inversion in certain discourse situations. As with the other research reviewed in this section, studies of transfer can be criticised on the grounds that the corpus alone cannot provide direct evidence regarding the psychological processes and mechanisms under investigation. This is acknowledged by Rankin (2012), who observes that, because corpus data cannot give information about the mechanisms involved in on-line language processing, it is not possible to conclude with certainty whether transfer effects are a result of representation or of processing, a distinction he equates with that between errors (which are the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

63

result of systematic differences in underlying competence) and mistakes (which are caused by intermittent problems with on-line production). While usage-based models reject such distinctions, and so can ignore them without risk of inconsistency, within generative models, where such distinctions are key, the failure of research to deal with them is problematic.

2.2

Integrating learner corpus and experimental research

As the above review has argued, studies which base psycholinguistic conclusions on corpus data alone tend to be vulnerable to the criticism that their evidence underdetermines their conclusions. The language productions represented in corpora are the result of a wide range of both psycholinguistic and situational factors which cannot be easily disentangled. So, while corpus data can suggest potentially fruitful hypotheses about psycholinguistic mechanisms, inferences must be seen as, at best, tentative hypotheses in need of further conﬁrmation through other methods. Given this limitation, it is important that learner corpus researchers with an interest in psycholinguistics learn to work in cross-disciplinary ways, integrating corpus data with experimental methods (see also Chapter 3, this volume). Truly interdisciplinary corpus-psycholinguistic work offers, we believe, great potential in that the two disciplines can give complementary perspectives on linguistic phenomena. Corpus methods allow for the analysis of the productions of large numbers of speakers, can be collected naturalistically (and so can have high contextual validity), and can be used in an exploratory way to suggest new hypotheses. Psycholinguistic methods, on the other hand, allow for the testing of hypotheses through tightly controlled experiments (providing high construct validity) and allow us to monitor on-line language production and comprehension.1 While important steps towards integration have been taken by researchers working with L1 data (see, e.g., the reviews in Gilquin and Gries 2009; McEnery and Hardie 2012), interdisciplinary work making use of learner corpora is, at present, both scarce and of variable quality. We can identify three types of integration that learner corpus researchers have explored. The ﬁrst is psycholinguistic studies which use learner corpora as a source of experimental stimuli. The ability of corpora to provide reliable quantitative information about language has long been exploited by psycholinguists to generate experimental items meeting particular criteria. This sort of method has been extremely popular as experimenters are often interested in measuring the inﬂuence on processing of exactly the types of variables which corpora are good at providing, such

1

We define on-line language processing (production and comprehension) as processing happening in real time. In on-line studies, reaction times and/or brain activity are recorded while participants perform a task in a laboratory setting under significant time pressure.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

64

DURRANT AND SIYANOVA-CHANTURIA

as frequencies of occurrence and levels of association between linguistic elements. Use of learner corpora in this sort of work is much less common, however. Exceptions are Siyanova and Schmitt (2008), who used a learner corpus to identify a set of collocations for an on-line judgement task, and Millar (2011), who looked at the effects of attested learner errors on native-speaker comprehension. Both studies are discussed in detail below. A second type of combination is seen in Payne and Ross’s (2005) study of the inﬂuence of working memory on L2 oral production, also described in detail below. The combination seen here is interesting in that it reverses the normal relationship between experimentally and corpus-generated variables. A common technique in experimental psycholinguistic studies (Ellis et al. 2008; Arnon and Snider 2010; Durrant and Doherty 2010) is to use corpus-linguistic data to quantify a variable (e.g. word or phrase frequency) and then measure the effect of that variable on some experimentally measured aspect of language processing or representation (e.g. word recognition time). In Payne and Ross’s (2005) study, however, corpus data provide the output variables while experimental data are used to measure predictor variables. Learner corpus studies have often related their corpus ﬁndings to background data, such as L1 or years of learning, that are thought likely to inﬂuence their production. Payne and Ross’s study can be seen as an extension of this method to incorporate variables which are measured by psycholinguistic means. The third type of integration of corpus and experimental work draws on data from both types of study to triangulate on a single question. An example of this is Siyanova and Schmitt (2008). This study represents a more thorough integration of corpus and psycholinguistic research in that it both uses a learner corpus as a source of stimuli for a psycholinguistic experiment and triangulates the ﬁndings of this experiment with a more traditional learner corpus study. Again, this study will be discussed in more detail below.

3

Representative studies

The following sections will look in detail at the three studies described in the previous section as examples of three ways of integrating learner corpora with experimental psycholinguistic methods and discuss what lessons can be learnt from each that might inform future work. 3.1 Millar, N. 2011. ‘The processing of malformed formulaic language’, Applied Linguistics 32(2): 129–48. Millar’s (2011) study of the effects of learners’ ‘atypical collocations’ on native speakers’ reading processes is a rare example of a study which uses a learner corpus as a source for materials in a psycholinguistic experiment.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

65

This study was motivated by claims that second and foreign language learners should be encouraged to learn ‘formulaic language’, deﬁned in the paper as combinations of words which are used frequently and which appear to be stored and processed holistically by native speakers. Millar notes that one of the arguments put forward for teaching such sequences is that their holistic psycholinguistic status makes them easy for native speakers to process and so facilitates communication. A learner is more easily understood, it is claimed, if they use a formula than if they use a novel utterance. Millar set out to test this claim empirically by measuring the effect on native speakers’ comprehension of learners’ failure to use an appropriate formula. Since the claim is not that natives will fail to understand non-formulaic language, but rather will understand it with greater difﬁculty, he employed a psycholinguistic methodology – self-paced reading – to measure the processing load involved in handling incoming formulaic and non-formulaic strings of language. Formulaic language includes a wide range of rather different phenomena which are probably not strictly comparable with each other (e.g. clichés, idioms, lexical bundles, collocations), so Millar restricted his attention to a particular type of collocation: adjacent two-word combinations whose frequency suggests they may be formulaic. Examples include married life, free time and ideal partner. Drawing on a 180,000-word corpus of English essays written by 960 different Japanese university students studying English as a Foreign Language, he identiﬁed a number of what he calls ‘atypical collocations’. These were word pairs which are ‘learner collocations’, in that they: 1. occur in the corpus at least twice, and in the writing of at least two different learners 2. are ‘statistically signiﬁcant’, in that the chances of their attaining the frequency they do by chance, given the frequencies of their constituent words, is lower than 5 per cent 3. are atypical, in that they: a. do not occur in the British National Corpus (BNC) b. were judged by the intuition of the researcher to be non-native-like. For each atypical collocation, Millar identiﬁed a native-like collocation which expresses the apparently intended meaning. For example, the atypical pairing marriage life was matched with the native-like collocation married life. This was done by searching for the atypical pairings’ component words in the BNC and identifying their collocates. In each case, only one component word was changed to create the typical collocation. The original atypical pairings were then classiﬁed as either ‘lexical misselections’ (where an inappropriate lexeme has been used, e.g. best (ideal) partner) or ‘misformations’ (where an inappropriate form of a lemma has been used, e.g. culture (cultural) background). Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

66

DURRANT AND SIYANOVA-CHANTURIA

Thirty-two atypical-native-like collocation pairs were embedded in contrived sentence contexts, such that each sentence had two versions – one including an atypical collocation and one including a native-like collocation. Plausible contexts were identiﬁed with reference to sentences in which the native-like collocations appeared in the BNC and their naturalness was conﬁrmed by a panel of native-speaker informants. A set of ﬁller sentences was also created. These were sentences containing native-speaker bigrams and were included to distract participants from the focus of the experiment. Finally, the sentences were divided into two lists, each containing sixteen sentences with atypical collocations, sixteen with native-like collocations and sixteen ﬁller sentences. Paired atypical and native-like collocations never appeared in the same list. In the main experiment, each participant read only the sentences from one list, and thus saw one version of each sentence. The experiment was a self-paced reading task, created using Lancaster University’s own PsyScript application.2 Native-speaker participants were shown the sentences one word at a time on a computer screen. Participants paced their own reading by clicking on a mouse button to advance through the words. To ensure that participants read the sentences for comprehension, a simple true/false question followed each sentence. The key data were the length of time participants spent reading each word (which were recorded by the computer program). Millar hypothesised that reading times would be longer on the second word of atypical collocations, reﬂecting greater difﬁculty of processing. The results conﬁrmed this, with reading times on sentences containing atypical collocations being signiﬁcantly higher on the second word of the collocation and on the two words following the collocation than they were on the corresponding words following native-like collocations (the analysis could not be extended beyond this span as the collocations were placed near the end of the sentences). The difference in reading times for atypical and native-like collocations was higher for lexical misselections than for misformations, though the relatively small number of the latter type makes it hard to base any ﬁrm generalisations on this difference. Many learner corpus studies have looked at how the phraseology of learner writing differs from that of native speakers (see, e.g., the studies of formulaic language described in Section 2.1; see also Chapter 10, this volume). Such studies are important in that they enable exploratory descriptions of learner language. However, they remain at the level of the text, dealing with the products of communication only. The implications of these ﬁndings for processes beyond the text remain speculative. Millar’s study takes the important further step of investigating how textual differences affect the communication process, in terms of their actual 2

More information can be found from https://open.psych.lancs.ac.uk/software/PsyScript.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

67

(rather than assumed) impact on readers. This integration of textual data from the corpus with processing data about individuals’ moment-bymoment interaction with language is a crucial one if we hope to properly understand the signiﬁcance of learner corpus ﬁndings. Millar’s study offers a good template for how integration of this type might be achieved and we would encourage researchers to follow his example. If further research in this area is to be effective, however, it is also important to reﬂect on what can be learned from this study’s limitations. Millar describes the study as an extension of more traditional investigations of learner errors, which, he notes, have relied on native speakers’ intuitions regarding the severity of errors. He argues that measures of on-line processing provide a more objective view of the effects of errors than intuition-based methods. The learner corpus is therefore primarily used as a bank of authentic learner divergences from native-speaker norms. Importantly, however, it is only the collocations themselves which are drawn from the corpus. The broader contexts in which they are set are fabricated, based on contexts which were found to be typical for corresponding native-like collocations. What the experimental participants read, therefore, was not authentic learner productions, but rather elements of atypical utterances which had been embedded in other contexts. Since these contexts were designed based on contexts which were typical for the native-like collocations, it is perhaps unsurprising that the native-like collocations were read more easily. It remains unclear whether the original learner utterances would have caused similar problems. The broader point here relates to a recurring problem of corpus linguistics: identifying features from across a large number of texts (e.g. extracting recurring collocations from a corpus) involves abstracting away from the original textual contexts, and those contexts may include information which is essential to a proper understanding of the phenomena being studied. Where learner corpora are being used as a source of authentic examples, therefore, it is important that authenticity is appropriately thoroughgoing, incorporating as much of the context as is practicable. From the experimental side, the use of the self-paced reading task as a measure of processing difﬁculty is open to question. The major downside of this method is that the reader cannot skip words or regress while reading (something readers do all the time, especially in highly predictable contexts, such as those provided by frequent collocations) but are forced to read each word and only one word at a time. This renders a word-byword self-paced reading task rather unnatural (very much unlike the eye-tracking method; for an overview, see Roberts and Siyanova-Chanturia (2013) and Siyanova-Chanturia (2013)). It is admittedly the case that the eye-tracking method is not as cheap or as available to the applied linguistics community as one could wish. Reaction-time experiments, however,

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

68

DURRANT AND SIYANOVA-CHANTURIA

if designed well, can reveal a great deal about on-line processing. Given that the focus of Millar’s study was on multi-word expressions rather than single words, it would perhaps have been more enlightening to perform a phrase-by-phrase, rather than a word-by-word, self-paced reading experiment, allowing for ‘holistic’ presentation of the target collocation. Given that Millar observed slower reading times not only for the second word of the atypical collocation but also for the subsequent words, the processing difﬁculty would still be observable while the reading, at least that of the collocation itself, would have been more naturalistic given its holistic presentation. This method of presentation would also cohere well with Millar’s hypothesis that collocations are stored holistically (which, it needs to be pointed out, is a rather speculative and still empirically unsupported proposition). If collocations are – as Millar assumes – holistic units, presenting them as two separate words may serve to undermine the key aspect of natural holistic processing which the experiment aims to tap. A further disadvantage of word-by-word self-paced reading is that it may interfere with readers’ creating a natural prosodic contour for the collocation. This is important as prosody appears to be important in syntactic ambiguity resolution (Fodor 2002) and in multi-word expression processing (e.g. Lin 2012). Lastly, according to Millar, no instructions were given to participants as to whether the sentences should be read silently or aloud. This resulted (Millar reports) in some sentences (or even selected words) being read aloud and others silently. It needs to be pointed out that reading aloud is only half as fast as silent reading (Field 2004) and the two modalities must not be used interchangeably in one experiment. The above might have introduced a confound and potentially led to longer reading times for some (but not other) collocations. These points demonstrate that important methodological questions need to be answered when setting out to conduct a psycholinguistic experiment, and interdisciplinary researchers, who may not be steeped in the discipline of psycholinguistics, need to take special care to address these. 3.2 Payne, J. S. and Ross, B. M. 2005. ‘Synchronous CMC, working memory, and L2 oral proﬁciency development’, Language Learning and Technology 9(3): 35–54. Payne and Ross (2005) aimed to investigate the inﬂuence of working memory capacity on various aspects of learners’ performance in oral and internet chatroom tasks. It builds on previous work (Payne and Whitney 2002) which had suggested that taking part in synchronous computer-mediated communication (such as chatrooms) can help to develop L2 learners’ ability to engage in oral conversation, and that the beneﬁts are greatest for learners with lower working memory capacity. This previous work had theorised that such learners may be especially helped by such tasks because, by reducing the pace of discussion (as

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

69

compared with normal oral conversation) and allowing interlocutors to review previous contributions whenever necessary, they reduce the burden on working memory (Payne and Whitney 2002). Participants were twenty-four learners of Spanish as a foreign language at an American university, taking part in a course which involved weekly chatroom tasks including discussion of readings and personal themes, role-plays, and co-construction and understanding of video content. Instructors also took part in the discussions, helping to sustain them and sometimes providing corrective feedback. Transcripts from 150 different chatroom sessions (20 different tasks undertaken by 6–8 separate groups) of around 50 minutes each were compiled to form the learner corpus. In addition to their online tasks, participants also completed oral proﬁciency tests at the beginning and end of the course, in which they were asked to speak on a topic for approximately ﬁve minutes. While the results of the oral tests were included in the analysis (see below), they did not form part of the corpus. The corpus was divided into two parts, corresponding to the ﬁrst and second halves of the semester. Data were collected on the total number of words, utterances and turns per chat and on the number of ‘repetitions’ (reuse of a word or phrase from an earlier turn by another participant) and ‘relexicalisations’ (in which an earlier idea is repeated but the structure and/or one of the main grammatical words or phrases is altered). For its predictor variables, the study drew on Baddeley’s (1986, 2000) model of working memory, and in particular his concepts of ‘phonological working memory capacity’ (PWMC) and ‘central executive’. PWMC maintains phonetic information in memory for a short period, while the central executive is the system responsible for allocating attention to particular tasks. Payne and Ross measured participants’ PWMC through a task in which they listened to three sets of eight pseudo-words read aloud. After each set, they were asked to select the items they had heard from a list. The central executive was measured through a more complex task in which participants ﬁrst viewed ﬁfteen sets of sentences, containing sixty sentences in total. In each set, sentences were displayed one at a time, with a seven-second gap between each. Participants were asked to (1) decide whether each sentence made sense and record their judgement by pressing an appropriate button; and (2) remember the ﬁnal word of each sentence. After all sentences had been viewed, participants were asked to select the words they had remembered from a list. The list included both the target words and, for each target word, two distractors. In each case, one distractor was related in meaning to the target word (e.g. for the target word girl, a distractor might be woman) and the other was a ﬁnal word from one of the sentences in a previous set. For the following analyses, participants were divided into equal-sized ‘high’ versus ‘low’ working memory capacity groups for each of the two measures based on a median split.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

70

DURRANT AND SIYANOVA-CHANTURIA

Analysis showed a signiﬁcant reduction in both repetition and relexicalisation from the ﬁrst to the second half of the course, with no signiﬁcant differences between high and low working memory capacity groups. The number of words produced per session, the number of utterances per session, and the number of turns per session all increased from the ﬁrst to the second half of the course. These increases were all signiﬁcantly greater for the high than the low central executive group. The increase in words per session was also higher for the low than the high PWMC group. To unpack the apparently opposing effects of the two types of working memory, the researchers divided participants into four groups according to their scores on the two working memory constructs (low PWMC/low central executive; low PWMC/high central executive; high PWMC/low central executive; high PWMC/high central executive). Signiﬁcant overall differences were found between the four groups, with post hoc tests showing that the high phonological working memory/low executive control group showed signiﬁcantly less increase than the other groups in the number of words per session. The researchers conclude that there may be an interaction between the different types of working memory, which means that the role of working memory in computer-mediated communication may be more complex than previously supposed. Looking at the results of the oral proﬁciency tests, the high PMWC group showed a signiﬁcantly greater improvement than the low group, while no difference was found between the high vs low central executive groups. A central technique of corpus linguistics is that of examining quantitative variations in linguistic features across predeﬁned variables (across text types, medium of communication, categories of writer, etc.). Payne and Ross’s study is of great interest in that it shows how such analyses can incorporate psycholinguistically deﬁned variables. This is an important step in that it enables us to integrate hypotheses about cognitive mechanisms (in Payne and Ross’s case, Baddeley’s model of working memory) into our analysis, and so to make principled claims about the causal mechanisms behind production. As with Millar’s study, however, there is also much to be learned from the limitations of Payne and Ross’s work. One important point concerns the operationalisation of the variables. As was noted above, the ‘nonword repetition task’ differed from traditional tasks in that it did not literally involve repetition but rather selecting items from a list. However, no rationale is provided for this change from the more usual method, and no argument provided for the new method’s ability to tap phonological working memory. Similarly, the rather complex task used to measure reading span is presented without argument, leaving the reader with no way of judging the extent to which this succeeds in measuring the targeted construct. Given the rather confusing (and unexpected) picture

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

71

which comes out of the analysis, it is worth questioning how well the two intended constructs were captured by these techniques. Secondly, given the rich nature of the data collected, it is rather surprising that the researchers are so quick to lump their quantitative data into broad categories. Although they have data for each of twenty chatroom sessions, rather than tracing performance across this time frame, the sessions are divided rather bluntly into ﬁrst versus second half of the semester. A great deal of potentially informative detail is thereby removed from the analysis with no explanation. Similarly, the decision to divide participants into two broad groups of low versus high working memory excludes a large amount of detail about the variation in capacity across individuals. An analysis which retained this rich data might well have produced a more accurate and nuanced picture of the inﬂuence of working memory on production. 3.3 Siyanova, A. and Schmitt, N. 2008. ‘L2 learner production and processing of collocation: A multi-study perspective’, The Canadian Modern Language Review / La Revue canadienne des langues vivantes 64(3): 429–58. The overall aim of this study was to assess advanced second language learners’ knowledge of collocations. The researchers use three separate methods to approach this question from different angles. Study 1 uses a learner corpus and a comparable native-speaker corpus to ask whether L2 learners produce as many high-frequency collocations as natives in their writing. Study 2 uses a judgement task to ask whether L2 learners have as accurate a sense of collocation frequency as natives. Study 3 uses a timed judgement task to ask whether L2 learners process collocations in a similar way to natives. In the ﬁrst study, 810 adjective–noun combinations were extracted from a corpus of thirty-one essays written by Russian advanced learners of English (part of the International Corpus of Learner English, ICLEv1 (Granger et al. 2002)). Analysis showed that roughly half of these learner collocations appeared at least six times in the 100-million-word BNC. Around one-quarter were not attested in the BNC, while another quarter were found to appear ﬁve times or fewer. Analysis of the native speaker corpus (22 argumentative essays that were part of the Louvain Corpus of Native English Essays, LOCNESS)3 found a similar proﬁle for native speakers. The authors conclude that ‘a large percentage of the learners’ collocations could be considered appropriate’ (Siyanova and Schmitt 2008: 437) and that L2 learners do not, as some researchers had claimed, underuse native-like collocations. An obvious shortcoming of Study 1 is that – like the text-based studies reviewed in Section 2.1 – it deals only with the products of collocation knowledge. This may be an incomplete reﬂection of what learners 3

www.uclouvain.be/en-cecl-locness.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

72

DURRANT AND SIYANOVA-CHANTURIA

know and it tells us nothing about how collocations are processed. The remaining studies in the article aim to provide alternative perspectives on collocation knowledge which take these aspects into account. In Study 2, the researchers looked at learners’ intuitions by asking a group of native and non-native speakers to rate the typicality of adjective–noun pairs in a judgement task. Two groups of collocations were selected from those analysed in Study 1: frequent native-like collocations and infrequent learner collocations. Care was taken to ensure that frequent collocations were typical – this was judged on the basis of BNC frequency, mutual information scores, and their appearance in two collocation dictionaries. The infrequent learner collocations did not appear in the BNC or the collocation dictionaries. To allow for a ﬁner-grained analysis, the frequent collocation group was subdivided into high (occurring more than 100 times in the BNC) and medium (occurring 21–100 times in the BNC) frequency collocations. The frequent and infrequent collocations were inserted in a questionnaire, in which participants (60 native speakers and 60 non-native speaking staff and students at two British universities) were asked to rate each collocation on a six-point Likert scale, from 1 (very uncommon) to 6 (very common). Analysis revealed that native speakers’ intuitions mirrored the BNC frequency data rather better than did the non-natives’ (a similar ﬁnding has been observed in Italian, see Siyanova-Chanturia and Spina (in press), where native but not non-native speaker judgements were found to correlate with the L1 reference corpus for some of the target items). While L2 learners’ ratings did reliably distinguish frequent from infrequent collocations, they were found to be very similar for high- and medium-frequency collocations. In contrast, L1 speakers reliably distinguished frequent from infrequent and high- from medium-frequency collocations. Natives were also far more decisive in their ratings, relying on a wider range of scores, such that, on average, they gave frequent collocations higher, and infrequent collocations lower scores than did the non-natives. Perhaps as a result of this, native speakers’ ratings also correlated more strongly (r = 0.58) with BNC frequency data than did non-native ratings (r = 0.44). Study 2 therefore seems to have uncovered differences between native and non-native knowledge of collocation which were not evident in the corpus-based study. The ﬁnal study pushes the investigation further by looking at the mental processing involved in making acceptability judgements. As was discussed in the review of Millar’s article above, it is commonly assumed that high-frequency collocations are processed more efﬁciently than novel word combinations, and that the more frequent a collocation, the more efﬁcient the processing is likely to be. To determine whether advanced learners process collocations in a native-like way, Siyanova and Schmitt repeated Study 2 using an ‘on-line’ method – that is, one in which not only participants’ judgements but also the time they took to reach these judgements were recorded. The task, performed with twenty-seven native and twenty-seven non-native English speakers (not

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

73

involved in Study 2), was the same as in Study 2, but this time it was performed through a computer program which recorded response times for each judgement. The results for the judgements made closely matched those of Study 2: whereas native speakers reliably distinguished low-, medium- and high-frequency combinations, non-natives distinguished low-frequency combinations from the others, but did not distinguish medium- from high-frequency pairs. Interestingly, the reaction-time data mirrored these results exactly: both native and non-native participants responded more rapidly to high-/medium-frequency collocations than to low-frequency collocations, whereas natives, but not non-natives, also responded more rapidly to high- than to medium-frequency collocations. This appears to indicate that native speakers’ more ﬁnely tuned ability to distinguish collocation frequency is mirrored in their language processing. One of the merits of this article is that, by investigating learner collocations from a number of different perspectives, it enables a more in-depth analysis of the phenomenon than any of the studies could have provided alone. The contrast between the corpus-based Study 1 and judgement-based Studies 2 and 3 is particularly instructive. The judgement studies appear to demonstrate differences between learner and native collocation knowledge and processing which were not evident in the corpus-based study. At the same time, the equivalence in collocation use seen in the corpus-based study appears to suggest that, in terms of the language they produce, learners are able to overcome limitations in intuition and processing to produce writing that is (at least in the ways it was evaluated in this study) native-like. As with the other studies reviewed, however, this article has a number of shortcomings, on which it will be instructive to reﬂect. We have noted that an important claim of the corpus-based study is that learners’ use of collocations is similar to that of native writers. It is imperative to recognise, however, that this claim is based (much like Millar’s (2011) extraction of atypical collocations) on a rather decontextualised analysis. Collocations are judged to be ‘appropriate’ if they are attested in the BNC. However, no attempt is made to ensure that these attested collocations are appropriate to the context in which they were originally used. While the study shows that learners are similar to natives in the extent to which they draw on frequent pairs, therefore, it does not tell us whether they are using them in similar ways. It is possible that a more detailed corpus analysis would reveal important differences between learners and natives in their collocation use. Turning to the psycholinguistic side of the study, while the on-line judgement task used in Study 3 represents an important step in showing how our understanding of a phenomenon can be deepened through psycholinguistic methods, two important shortcomings should be noted with its experimental set-up. First, it relied on custom-built reaction-time software and used a standard computer mouse as an interface (the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

74

DURRANT AND SIYANOVA-CHANTURIA

latter point also applies to the Millar (2011) study reviewed above). This is important as the accurate measurement of reaction times through a computer is a rather problematic task, easily affected by such factors as the refresh-rate of the screen, the processing speed of the computer, other computer programs that may be running at the same time, and the speed with which the computer registers inputs from peripherals, such as mouse and keyboard. Because of the possible distorting effects of these factors, it is in general advisable to use professionally designed software (such as E-Prime or DMDX)4 in combination with a ‘button box’ (an input device that allows for more accurate timing of an event, e.g. button press, compared to a conventional keyboard) to register responses. Second, insufﬁcient steps were taken in Study 3 to ensure that the experimental stimuli were matched for all of the characteristics that are known to affect on-line processing. In particular, the individual words which appeared in each combination were not deliberately matched in terms of frequency and length. It is crucial to bear in mind when designing experiments of this sort that only when the stimuli have been closely matched for various lexical properties and the experimental design and/or instrument are sound can one hope for the study to be both reliable and replicable (although much of subsequent research has conﬁrmed the ﬁndings of Study 3, see Siyanova-Chanturia and Martinez (2014) for a review).

4

Critical assessment and future directions

We believe that there is much to be gained from integrating learner corpus and psycholinguistic research. The methodologies of the two ﬁelds have complementary strengths and can provide information about different aspects of learner language to yield a more robust and more complete picture than either could give in isolation. Corpus studies can suggest hypotheses about language processing for psycholinguists to test and the other way round; methodologies from each ﬁeld can be used to quantify a fuller range of variables for analysis; and corpus/experimental studies can be triangulated to provide a range of complementary perspectives on a single phenomenon. While corpus/psycholinguistic integration has much to offer, great care needs to be taken when pursuing such research. Corpus-only studies which attempt to draw psycholinguistic conclusions rely on theoretical assumptions which are often open to debate and so should be treated as correspondingly tentative. It is especially important that we avoid the temptation of automatically ascribing all patterns found in the corpus 4

More information about E-Prime is available at www.pstnet.com/eprime.cfm (last accessed on 13 April 2015). More information about DMDX is available at www.u.arizona.edu/~kforster/dmdx/dmdx.htm (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

75

to features of the mind without further interrogation. The three studies discussed in the previous section provide good examples of the ways in which this can be achieved. A further area where care needs to be taken is in the use of statistics in studies that combine learner corpora and psycholinguistic techniques (see Chapter 8, this volume). As we discussed in Section 2.1, the tendency amongst corpus linguists to conﬂate quantitative data from across large numbers of language users may lead to misleading central tendencies and makes the generalisability of ﬁndings across language users impossible to estimate. While statistics of this sort can make sense in some research contexts (e.g. in making generalisations about particular varieties of a language), they are often ill-suited to drawing psycholinguistic conclusions about language users themselves. Approaches such as Durrant and Schmitt’s (2009), in which data are quantiﬁed separately for each language user may offer a way forward here. In relation to Payne and Ross’s (2005) study, we have also argued that full use needs to be made of the richness of the quantitative data provided by both corpus and experimental methods. Premature conversion of interval data into categories (as seen in the conversion of working memory scores into two broad ‘high’ vs ‘low’ groups) seems to offer few advantages. Researchers should therefore be prepared to make use of statistical methods that can accommodate continuous independent variables (e.g. correlation, regression, ANCOVA), rather than relying on methods which look for differences between groups (such as t-test and ANOVA). Finally, researchers should also be aware of new statistical approaches to data analyses that are gaining ground in studies on ﬁrst and second language acquisition and processing, such as, for example, mixed-effects modelling (see Arnon and Snider 2010; Siyanova-Chanturia et al. 2011; for an overview, see Cunnings 2012). Mixed-effects modelling is used in the analysis of repeated measurement data with subject and items as crossed random effects. According to Baayen et al. (2008), one of the many advantages of this method is that it is possible to take into account the effects that unfold during the course of an experiment (e.g. effects of learning and fatigue), as well as to consider other potentially relevant covariates. Perhaps most importantly, and most problematically, learner corpus researchers who wish to integrate their work with psycholinguistics need to familiarise themselves thoroughly with the methodologies they are adopting. In our brief review, we have seen examples of inappropriate tasks, failures to control for confounding variables, and the questionable use of self-built software and inappropriate hardware, all of which may reduce the validity of these studies from a psycholinguistic perspective. We are conscious of having held the studies reviewed to rather exacting standards on this count. Our aim is not to disparage the researchers involved. Interdisciplinary work of this sort is a relatively recent phenomenon and few applied linguists have a thorough familiarity with both

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

76

DURRANT AND SIYANOVA-CHANTURIA

corpus and psycholinguistic methods. Most of us involved in this area have therefore been obliged to feel our way rather tentatively and the authors of the present chapter are as guilty as anyone of making imperfect use of psycholinguistic methods (as the review of Siyanova and Schmitt’s paper makes clear). This is, perhaps, an inevitable part of attempting to break new interdisciplinary ground. However, if the corpus-psycholinguistic enterprise is to develop further, it is crucial that researchers deepen their methodological understanding on both fronts. A constant danger of interdisciplinary work in general is that of misappropriating or misusing methods or ideas that have developed within particular disciplinary contexts. This has been seen here in the premature ascription of features of corpora to mental constructs, in the inappropriate use of corpus statistical methods, and in the rather shaky adoptions of psycholinguistic methods by researchers from other research backgrounds. Going forward, the greatest challenge facing corpus-psycholinguistic researchers, like that facing interdisciplinary researchers in general, is both to reﬂect the intellectual and methodological rigour which has developed in the two areas and to combine them in original ways that make sense in the context of the issues they aim to address.

Key readings Gilquin, G. and Gries, S. Th. 2009. ‘Corpora and experimental methods: A state-of-the-art review’, in Gilquin, G. (ed.), Corpora and Experimental Methods. Special issue of Corpus Linguistics and Linguistic Theory 5(1): 1–26. This review paper raises important questions with regard to the interdisciplinary research (or lack thereof) that makes use of both corpora and experimental methods. It is argued that psycholinguists frequently exploit the potential of combining corpus and experimental data. In contrast, corpus linguists (and applied linguists, for that matter) rarely do so. The paper is enlightening in that it presents succinctly the compelling beneﬁts of combining the two methods. A must-read for any corpus linguist contemplating the possibility of combining their corpus evidence with experimental data. Gries, S. Th. 2012a. ‘Corpus linguistics, theoretical linguistics, and cognitive/psycholinguistics: Towards more and more fruitful exchanges’, in Mukherjee, J. and Huber, M. (eds.), Corpus Linguistics and Variation in English: Theory and Description. Description. Amsterdam: Rodopi, pp. 41–63. This is another paper by Gries that strongly argues for the beneﬁts of interdisciplinary research and presents a case for why corpus linguistics should enter into more ‘mutually beneﬁcial relations’ with

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Learner corpora and psycholinguistics

77

its neighbouring ﬁelds, such as theoretical linguistics, cognitive linguistics and psycholinguistics. Durrant, P. and Doherty, A. 2010. ‘Are high-frequency collocations psychologically real? Investigating the thesis of collocational priming’, Corpus Linguistics and Linguistic Theory 6(2): 125–55. This paper investigates whether frequency of occurrence in a corpus is a reliable indicator of psycholinguistic priming between words. The ﬁndings suggest that corpus frequency does indeed reﬂect psychological reality of collocations. The implications of this study are important in that frequency-based methods can be used to identify useful collocations for a second language learner to acquire. This study is of interest also because it was the ﬁrst to demonstrate that frequency-based collocational priming can exist independently of psychological association. Salsbury, T., Crossley, S. A. and McNamara, D. S. 2011. ‘Psycholinguistic word information in second language oral discourse’, Second Language Research 27(3): 343–60. This longitudinal corpus study is interesting in that it explores the development of word knowledge in the oral discourse of English language learners using four psycholinguistic indices of word knowledge: concreteness, imageability, meaningfulness and familiarity. As such, the focus of the study is on vocabulary depth rather than breadth. The results suggest that over the course of one year, learners’ productive vocabularies became more abstract, less context dependent and more tightly associated. These ﬁndings offer new insights into how vocabulary knowledge can be measured in future studies of L2 lexical development. Siyanova-Chanturia, A., Conklin, K. and van Heuven, W. J. B. 2011. ‘Seeing a phrase “time and again” matters: The role of phrasal frequency in the processing of multi-word sequences’, Journal of Experimental Psychology: Language, Memory and Cognition 37(3): 776–84. Outside the idiom domain, very few studies have looked at formulaic language on-line processing in an L2. This study examines the processing of binomial phrases in L1 and L2 using an eye-tracking paradigm. Similar to Arnon and Snider (2010), mixed-effects modelling was used. The ﬁndings showed that both native and non-native speakers were sensitive to phrasal frequency as well as phrasal conﬁguration. These results support the view that each and every occurrence of a linguistic unit contributes to its degree of entrenchment in one’s memory.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:45:32, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.004

5 Annotating learner corpora Bertus van Rooy

1 Introduction Corpus annotation refers to ‘the practice of adding interpretative, linguistic information to an electronic corpus of spoken and/or written language data’ as well as the end product of this process (Leech 1997a: 2, emphasis original). A learner corpus, like any other corpus, consists in the ﬁrst instance of the raw text, an electronic version of the word tokens originally produced by the learner (or transcribed from the recording in the case of a spoken corpus). A learner corpus offers three advantages to researchers in comparison with the traditional data sources used in second language acquisition (SLA) and foreign language teaching (FLT) research: size (there is much more data), variability (more individuals and a wider range of text types can be included), and automation of many aspects of data analysis. However, the value of the resource is increased even further by the availability of the additional layers of information that come from relevant forms of annotation (Granger 2004: 124–28; Gilquin and De Cock 2011). Error tagging has been the most frequent type of annotation used in learner corpus research up to now (Meurers 2009: 469; Rastelli 2009: 57; Díaz-Negrillo et al. 2010: 2; Rehbein et al. 2012: 2) and is the topic of a separate chapter (Chapter 7, this volume). Within computational linguistics, a substantial amount of research is directed at the development of automatic tools for the annotation of data, targeting linguistic properties such as word class, syntactic structure or semantic ﬁelds (e.g. Garside et al. 1997; van Halteren 1999; Jurafsky and Martin 2000; Mitkov 2004). Some of these tools have been applied to learner corpora as well, especially part-of-speech, or word class, tagging (henceforth abbreviated as POS tagging) and syntactic parsing. Research about the use of existing annotation tools, as well as the development of customised annotations for the linguistic properties of learner corpora, is the subject of this chapter.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

80

VAN ROOY

The annotation of learner corpora for grammatical features increases the value of such a corpus for researchers, because they can extract larger volumes of data with greater ease than from an unannotated, or raw, corpus. This chapter reviews the nature and types of linguistic annotation, before proceeding to highlight other core issues speciﬁc to learner corpora, such as the uses of annotated learner corpora, criteria for assessment and the implications of the accuracy of annotation for research. Chapter 6 (this volume) complements this chapter by looking at the annotation of spoken language. The typical studies that are reviewed in this chapter look speciﬁcally at the effect of learner error on annotation accuracy, at the assumptions that are made when using annotation tools developed for native-speaker data on learner corpora, and at the architecture for annotations at the word-class level. A study of the possibilities offered by syntactic parsing for learner corpus research concludes the presentation of representative studies. The most important issue that still requires conceptual debate and a resolution is the assumptions made when dealing with the non-canonical, or erroneous, parts of learner corpora. Two views are currently proposed by researchers: that a target hypothesis should be formulated, so that a corrected version of the learner data is subjected to annotation, or else that the learner data should not be corrected, so that the annotation can be used precisely to discover underlying properties of learner data beyond the reconstruction of an intended target-language form. Beyond this conceptual debate, other important tasks for future research are identiﬁed.

2 2.1

Core issues Types of annotation

Among descriptive linguistic types of annotation, POS tagging is the most frequently encountered form in learner corpora and generally acts as the ﬁrst type of grammatical annotation for any corpus (Leech 1997a: 5). An example of a tagged sentence, using the CLAWS tagger with the C7-tagset, from the Tswana Learner English corpus, is the following:

(1)

We_PPIS2 ﬁnd_VV0 that_CST in_II fact_NN1 these_DD2 people_NN are_VBR the_AT most_RGT exposed_JJ to_II media_NN not_XX to_TO mension_VVI the_AT fact_NN1 that_CST there_EX is_VBZ forever_RT AIDS_NP1 awareness_NN1 campaigns_NN2 launged_VVN through_RP out_RP the_AT county_NN1._ (ICLE-TS-NOUN-0005.1)

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

81

In regular orthographic form this sentence reads: (2)

We ﬁnd that in fact these people are the most exposed to media not to mension the fact that there is forever AIDS awareness campaigns launged through out the county.

In the sentence, there are a number of clear learner errors, such as the spelling of mension (mention) and launged (launched), and a concord error in the existential construction, with the notional subject ‘AIDS awareness campaigns’ requiring a plural form of the verb ‘to be’. However, in principle, these characteristics of the learner sentence are separate from the POS tags, and POS tags can also be assigned to learner data, including the erroneous forms. The tags indicate the word class to which each word belongs, e.g. •

Verbs: find (VV0=verb, base form), are (VBR=verb be in the present tense, plural form), mension (VVI=verb, infinitive form), launged (VVN=verb, past participle); • Nouns: fact (NN1=singular common noun), people (NN=common noun, neutral for number), campaigns (NN2=plural common noun), AIDS (NP1=singular proper noun). The tags do not only indicate word class in its broadest sense, but also usually aim to distinguish more speciﬁc subclasses as well, such as distinguishing between inﬂectional classes of verbs, or between common and proper nouns, as illustrated by these tags. In principle, it is possible to develop annotation for any level of linguistic analysis. At the grammatical levels, apart from POS tagging, syntactic parsing is also often encountered (Leech 1997a: 12). Syntactic parsers attempt to analyse the syntactic structure of sentences at various depths of analysis. Shallow parsing simply attempts to divide a sentence into consecutive word chunks that form categories like verb or noun phrases, without much attention to internal constituency and embedded constructs (Carroll 2004: 234). Dependency parsing attempts to determine if a particular word functions as head or dependent of a syntactic phrase, and attempts to attach the dependents to their heads, as illustrated by the following example from Geertzen et al. (2013: 5):

(3a) (3b)

I hope we can learn English together. 1 I PRP 2 hope VBP 3 we PRP 4 can MD 5 learn VB 6 English NNP 7 together RB

2 0 5 5 2 5 5

nsubj root nsubj aux ccomp dobj advmod

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

82

VAN ROOY

In the syntactic parse, (3b), the words are numbered from 1–7 as they occur in the sentence, followed by the POS tag. The next column represents the word number of the syntactic head on which a particular word is dependent. Since word 2 (hope) is the main verb, it is not itself dependent on anything else. Its subject, the pronoun I, is marked with a 2, which means that it is dependent on the verb. The relation is labelled as ‘nsubj’, which means it is the subject of the verb. Context-free parsing will not only identify the boundaries and label phrase-structure constituents, but it will also indicate the internal, hierarchical structure of constituents in a range of notations that can usually be translated back to a syntactic tree of the kind used in early transformational grammar formalisms (Jurafsky and Martin 2000: 326–94). At an even more advanced level, constraint-based formalisms for parsing combine hierarchical structures with syntactic features familiar from frameworks such as Head-Driven Phrase Structure Grammar or Lexical-Functional Grammar, to deal with agreement features and long-distance relations within a sentence (Jurafsky and Martin 2000: 395–446). Semantic tagging involves assigning a word to a semantic ﬁeld or making more reﬁned distinctions as required by the purpose of the research. Each word (or each content word) is assigned a tag that represents its semantic ﬁeld, which enables the determination of the most frequent semantic concerns in a text or a corpus (Wilson and Thomas 1997). Semantic tagging can be illustrated by examples from the UCREL Semantic Analysis System (USAS), taken from Archer et al. (2002). A word is assigned to a general discourse ﬁeld, represented by the upper case letter at the beginning of the tag, followed by a digit that represents the ﬁrst subdivision in the ﬁeld. After these two compulsory elements, various further optional codes can be assigned, such as ﬁner subdivisions, indicated by a second digit after a decimal point, and positive or negative scaling of the word, as exempliﬁed by (4a–c).

(4a) (4b) (4c)

chance apron guilt trip

A1.4 B5 E4.2-

In example (4a), the following codes are used: Discourse ﬁeld A (general and abstract terms), ﬁrst subdivision A1 (general), second subdivision A1.4 (chance, luck); the codes in example (4b) are: Discourse ﬁeld B (the body and the individual), ﬁrst subdivision B5 (clothes and personal belongings); and, in example (4c): Discourse ﬁeld E (emotional actions, states and processes), ﬁrst subdivision E4 (happy/sad), second subdivision E4.2 (contentment), E4.2- (negative scaling). While some work has been done on POS tagging, starting with de Haan (2000), and more recently on syntactic parsing (e.g. Geertzen et al. 2013), Ragheb and Dickinson

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

83

(2013: 169) maintain that work on learner corpus annotation is still very limited. Publicly available annotated corpora which can be used to compare results and develop software tools for learner corpus analysis are even less widely available (Nagata et al. 2011: 1210). Beyond these forms of annotation, which correspond to typical levels of linguistic analysis, a number of application-oriented (Leech 1997a: 15) or problem-oriented (de Haan 1984) annotation systems can be distinguished. Leech (1997a: 15) speciﬁcally mentions error tagging for learner corpora as one such system, while de Haan (1984: 123–39) develops a proposal for a very detailed system of analysis of noun phrases, in which different properties are annotated manually and used for further analysis. Smith et al. (2008: 167) point out that very little attention has been given to manual annotations of corpora (but see Springer 2012 and Detey 2012), and such annotations are usually not integrated with the actual corpus but treated in separate ﬁles, often spreadsheets or other database structures. They note that ‘linguists who are new to corpus-based studies may fail to be sensitized to its [manual annotation’s] fundamental value for many types of linguistic analysis’ (2008: 167). They review a number of solutions for the manual annotation of data and propose guidelines for future solutions as to how manual annotations can best be integrated with other forms of annotation.

2.2

Uses of annotation

Leech (1997a: 4), Granger (2004: 128) and Díaz-Negrillo et al. (2010: 2) identify the availability of additional information for the extraction of data and analysis of a corpus as the chief value of linguistic annotation, particularly to the extent that the annotation makes it possible to retrieve categories from the data that would otherwise be impossible to retrieve. Appropriate annotation makes it possible, for instance, to determine whether a POS category, such as noun, occurs more or less frequently in a learner corpus than some reference corpus, or how two parts of speech compare in frequency, for instance nouns and pronouns. With a raw corpus, it is either not possible, or prohibitively time consuming, to count all the nouns, as opposed to a closed class like pronouns or modal auxiliaries that can be extracted and counted with relative ease. Such counts can in turn be used for further analysis and interpretation, for instance to determine the most salient differences between a group of learners and a native control corpus (Granger and Rayson 1998), test hypotheses about the order of morpheme acquisition (Tono 2000a), or examine recurrent POS tag sequences from a functional point of view (Aarts and Granger 1998; Borin and Prütz 2004). Annotation is particularly useful to disambiguate forms that have multiple functions, for example to prepare the way for an analysis of linguistic structures such as complement clauses (Biber and Reppen 1998),

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

84

VAN ROOY

where there is a need to distinguish the uses of the word form that as a demonstrative determiner (e.g. that man) from those where it is a subordinating conjunction (the man said that he…). Another example of such an ambiguity that can be resolved by good POS tagging is the English word form to. Granger (2002: 17–18) notes that a POS-tagged corpus shows that the prepositional usage is underrepresented in learner language, while the inﬁnitive usage is not. A syntactically parsed corpus makes it possible to retrieve even more complex structures, like clause-initial adverbials, which can be used to answer functional and information-structure questions, such as differences in preference for particular adverbial types between native and advanced learner writers (van Vuuren 2013). However, as shown by de Haan (1984), manual annotation is another possible strategy to get to other more complex structures, such as noun phrase post-modiﬁcation. The linguistic annotation of learner corpora is also valuable to the ﬁeld of natural language processing (Ragheb and Dickinson 2012: 966; Chapters 24 to 27, this volume). Applications that perform automatic scoring of essays or generate automatic feedback for learners are trained on such annotated corpora, but they also perform various kinds of tagging as part of their processing of learner data. Futagi et al. (2008) report on a system that automatically detects collocation errors in non-native learners’ writing, using POS tags as part of its identiﬁcation strategy. Likewise, Meurers (2009: 470) notes that, alongside the needs of linguistic research into learner corpora, natural language processing research into feedback systems and learner modelling depends on systematic, accurate and consistent annotation of learner corpora.

2.3

Criteria for annotation

Linguistic annotation of learner corpora needs to meet certain criteria in order to perform the functions we require of it. These criteria apply to corpus annotation in general, but there are certain aspects that are particularly important in the context of learner corpora. The ﬁrst criterion is the meaningfulness of the annotation. Annotation should add value to a corpus and make it possible to extract from it information that goes beyond the information already contained in the word forms. A more elaborate tagset with reﬁned classiﬁcations is therefore valuable (Meunier 1998: 20), but this value is limited by the extent to which such annotations maintain consistency (Granger 2002: 18). If the tags are too idiosyncratic, or require extensive judgement on the part of the analyst, there is a risk that the annotations become less consistent. Unusual features and/or errors in learner language may also pose classiﬁcation problems that are in principle difﬁcult to resolve, as will be discussed later in this chapter.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

85

A second criterion, which is often in a trade-off relationship with meaningfulness, is the accuracy and consistency of annotation. Any form of annotation of a corpus should be as accurate as possible. The same form should always be tagged in the same way, so that the user can have conﬁdence in extracting information from a tagged corpus and know that the information will be accurate. Accuracy is usually broken down into two related notions, recall and precision, following van Rijsbergen (1979: 144–50). Precision measures how many of the tokens that received a speciﬁc tag (such as NN1) received that tag correctly – thus, if you have a particular tag, how certain are you that that tag is a correct tag? A score for precision is calculated by dividing the number of tokens that received a tag correctly by the total number of tokens that received that particular tag. Recall measures how many tokens that should receive a particular tag do in fact receive that tag – thus, if you are interested in a particular linguistic category, which proportion of instances can you retrieve (or recall) from the corpus using a particular tag? It is calculated by dividing the total number of tokens that received a speciﬁc tag correctly by the total number of tokens that should have received that tag. To maximise recall requires that as many instances as possible of a particular category, such as transitive verbs, are actually tagged as transitive verbs. In practice, this may come at a cost for precision, if too many non-transitive verbs are also tagged as transitive verbs. By contrast, in an attempt to maximise precision, a tagging system may favour a particular classiﬁcation only if it is very certain and thus miss many other instances of a category. The measures of precision and recall may therefore be in competition to a degree. One way of combining the two measures is by calculating a so-called F-score, which is the harmonic mean of precision (P) and recall (R), according to the following formula: F=

2 (P × R) P+ R

The numeric value of the F-score will be in the range between the numeric values of precision and recall, e.g. if the recall on a particular task is 0.8 and the precision is 0.9, then F-score will be 0.847. However, the F-score will be closer to the lower of the two values, and in extreme cases, such as a recall of 0.1 and precision of 0.9, the F-score is a mere 0.18, thus much closer to the lower than the higher of the two values. Therefore, it gives a clear indication that such a system does not perform very well. The third set of criteria concerns the formatting of the annotation. Annotation should always be inserted in a text in such a way that the raw text remains recoverable. The words of the raw corpus should in principle be recoverable from the annotation corpus, but conversely, it should be possible to deal with tags and tag sequences independently of the words

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

86

VAN ROOY

(Leech 1997a: 6). This is best achieved by using a consistent format for tags, for instance by using XML (Extensible Markup Language), which is a subset of SGML that was speciﬁcally designed to make documents portable across a range of platforms and be more legible to the human eye at the same time (W3C 2008). Díaz-Negrillo et al. (2010: 6) implement such a solution when they develop a complex annotation system with multiple layers, which can be stored and retrieved separately, depending on the purpose of the research (see Section 3.3 for more detail on the corpus and the annotation). Furthermore, the tags should ideally lend themselves to easy interpretation, which is a trade-off between transparent labels/abbreviations and short labels that do not interrupt the visual inspection of the ﬂow of textual data with interspersed tags. It is more valuable to have a tag such as NN1 for a singular noun, and NN2 for a plural noun, where the NN code denotes common noun (as opposed to proper noun, tagged NP) and the digits 1 and 2 refer to singular and plural respectively, than the alternative of codes such as Xa and Xb, which could conceivably be construed. In this way, the analysis of the tagged corpus, while supported by a manual that spells out the meaning and structure of category labels, or tags, remains a more manageable task than working with less transparent and interpretable tags. A ﬁnal consideration for corpus annotation concerns the efﬁciency and ease of the process. In general, the more reﬁned a classiﬁcation system is, the longer it takes to perform the annotation. In manual annotation, particularly if the annotation is aimed at a speciﬁc grammatical structure, it will save time to ﬁrst annotate the data with an automatic annotation tool to at least get a rough ﬁrst indication of where the target forms are potentially located in the corpus. The human annotator can then evaluate these annotated instances and accept or reject a particular classiﬁcation or insert a new set of annotations. In automatic annotation, a shorter run-time, in trade-off with the maximum degree of accuracy, should drive the decision for an annotation system. It is no use taking a very easy-touse system that gives results in a fraction of a second, if such a system is not accurate enough for research to be based on the annotation produced. On the other hand, if the time taken to perform a particular annotation approaches (or even exceeds) the time it would take to simply read through a corpus and identify all instances of a particular item by hand, then clearly the annotation is inefﬁcient.

2.4

Process of annotation

Annotation can take place in a number of ways. The most efﬁcient method, in terms of time and cost, is fully automatic tagging, where a computer program assigns tags to data by itself, without interacting with the user. This is often possible for POS tagging, as exempliﬁed by the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

87

International Corpus of Learner English (ICLE), which is tagged with CLAWS7 (Garside and Smith 1997). An intermediate option is interactive tagging, where part of the work is done by a computer program, but the user has to make certain decisions or insert certain markers. This is often the case with syntactic parsers that aim for full parsing (Meunier 1998: 21). Ragheb and Dickinson (2013: 173) describe a project in which a syntactically parsed learner corpus is developed interactively: automatic annotation inserts POS tags, after which human annotators check the tags for accuracy before creating dependencies by dragging an arrow between words. In such cases, the workload is shared between human and computer, but obviously the investment of time is higher than with automatic annotation. The ﬁnal option is fully manual annotation. A human analyst inserts all tags directly into the corpus, although the use of smart editors can alleviate the time demand. It also helps to avoid typing errors if it offers a drag-and-drop kind of menu for the insertion of tags, as is the case with some of the software for error tagging (see Chapter 7, this volume). Automatic annotation usually proceeds in three steps. The ﬁrst step is to identify the elements that need to be tagged, a process called tokenisation. In the case of POS tagging, in most cases the tokens are orthographic words, and therefore tokenisation is usually a simple process. However, if the purpose is morphemic analysis or the tagging of higher-level syntactic constituents, tokenisation is already a more complex task. The second step is potential tag assignment, which is usually performed through looking up the tags that are associated with a particular token. In the case of POS tagging of relatively unambiguous forms, such as the articles or auxiliary verbs, this step yields a single tag, but ambiguous forms, like the English word to, receive multiple tags (such as ‘inﬁnitive particle’ and ‘preposition’). When a form is encountered that has not been included in the training data of the annotation program, the program employs a range of techniques to guess the tag and may even guess a number of potential tags for the item, based on its morphological form and/or its syntactic position. In the case of learner corpora, Nagata et al. (2011: 121) note that spelling mistakes, capitalisation errors and other learner forms often present this kind of problem for the tagging programs, since the learner forms would not have been encountered in the training data used to develop the tagger. The existence of multiple candidate tags for a single token necessitates a third step, tag disambiguation, where a computer program applies various algorithms (e.g. rules, constraints, distance metrics in multiple planes or statistical probability calculation) to determine the most likely tag for a particular form in a particular context. Such programs use probabilistic algorithms that calculate the most likely tag among the available options based on a range of statistical formulas.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

88

VAN ROOY

It may seem as if automatic annotation should be used wherever possible, but it is not that simple. An important caveat is that automatic systems are not perfect, and may make mistakes in annotating linguistic features of corpora. These annotation errors may be due to the system or to the idiosyncratic properties of the learner language. It is therefore important to establish the accuracy of linguistic annotation before proceeding with further analysis of a learner corpus. Should the accuracy not be adequate for the purpose (i.e. the precision and/or recall is too low), the researcher should take steps to compensate for the problem, either by improving the accuracy of the annotation or by performing further manual annotations.

2.5

Measuring and improving the quality of the annotations

Researchers have tested the accuracy of POS taggers and parsers on learner corpora in a number of studies. Depending on the accuracy, a researcher can either proceed with a particular analysis, or take steps to improve the quality of data annotation. Accuracy for POS taggers trained on native-speaker English and French data exceeds 96% at present. The recent ACL (2012) review indicates accuracy in the range between 96.5% and 97.5% for English taggers trained and tested on the Wall Street Journal and accuracy of 97.68% to 97.8% for French taggers trained on the Le Monde corpus. Even for learner corpora, accuracy of up to 95% has been reported by de Haan (2000), although van Rooy and Schäfer (2002, 2003a) report a lower ﬁgure of just below 90% for the Tswana Learner English corpus tagged with two taggers, and 96.3% with the CLAWS tagger, provided spelling errors are removed beforehand. Rehbein et al. (2012) report an overall accuracy of 93.8% on a German learner corpus. Accuracy for parsers is a little more complex to measure, depending on the type of parsing that is undertaken. Nagata et al. (2011: 1217) report that for the speciﬁc case of head noun identiﬁcation in shallow parsing, recall is 90.3% and precision is 90.7%. Geertzen et al. (2013) examine accuracy during dependency parsing and report two kinds of measures: a labelled attachment score (LAS) and an unlabelled attachment score (UAS). LAS measures ‘the proportion of word tokens that are assigned both the correct head and correct dependency relation label’, whereas UAS measures ‘the proportion of word tokens that are assigned the correct head regardless of what dependency relation label is assigned’ (Geertzen et al. 2013: 8). They report an overall LAS score of 89.6% and a UAS score of 92.1% for their data, although the parser performs better on sentences without any learner errors (LAS = 92.6%; UAS = 95.0%) than sentences that contain at least one learner error (LAS = 83.8%; UAS = 87.4%) (Geertzen et al. 2013: 9).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

89

There are two main reasons for automatic annotation systems to perform less well on learner corpora than they do on native-language data. The ﬁrst, and obvious, reason is the occurrence of learner errors, as reviewed earlier in this chapter. The second reason for the poorer performance of taggers and other annotation systems is that they are developed on the basis of a particular type of data, usually native-speaker data from registers such as newspaper writing. When confronted by a different type of data, produced by language learners, the taggers do not perform as well since there are more structures that are not well represented in the training data (Díaz-Negrillo et al. 2010: 4; Rehbein et al. 2012: 1). Gilquin and De Cock (2011: 149) further note that automatic annotation systems achieve an even lower accuracy on spoken learner data, in part because the annotators are trained on written data and therefore fail to deal adequately with some of the typical features of spoken (learner) language. However, despite these potential risks, a number of researchers (e.g. Geertzen et al. 2013: 11–13) report that the performance of annotation systems is relatively good on learner language, because of the overall simplicity of learner language, and because some errors, especially semantic ones, are not particularly detrimental to the process of grammatical annotation. When deciding to analyse an annotated learner corpus, Granger (1997) recommends that the researcher should start with an evaluation of annotation accuracy on a sample of the corpus, unless such information is already known for a particular corpus. By assessing the overall accuracy rate (or the accuracy of a subset of tags that are central to the research project), the researcher can report the precision and recall of the tags of interest, and then judge whether the accuracy is sufﬁcient to proceed with an analysis. If the accuracy of annotation of the category of interest is not adequate, the researcher can take a number of steps to improve the quality of data extraction from the annotated corpus. The distinction between precision and recall as metrics for the evaluation of accuracy is quite useful in this regard. If the recall of a particular annotation category is high, but precision somewhat lower, it means that most instances in the corpus will have been retrieved by the corpus query, but a number of irrelevant instances have also been retrieved. Smith et al. (2008: 163–4) note that this is quite typical for any kind of corpus query and that manual cleaning of the search results should simply take place. However, if the precision is high but the recall somewhat lower, then most instances retrieved by the corpus query are valid instances of what the researcher is looking for, but too many valid instances have been missed. In such a case, the researcher should determine the most prevalent ways in which a particular category has been misclassiﬁed, and consider also retrieving the most frequent classiﬁcations erroneously assigned to the category of interest, retrieve those categories too and perform manual clean-up.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

90

VAN ROOY

This process can be illustrated by a brief example taken from the original data set of van Rooy and Schäfer (2003a) in their study of the accuracy of a number of different POS taggers on the Tswana Learner English corpus. They determined that the CLAWS tagger, using the CLAWS7 tagset, classiﬁed the past-tense forms (VVD) correctly in 27/28 instances, so the recall is adequate (96.4%) for this tag. However, the past participle forms (VVN) were tagged correctly only 78/85 times, for a recall of 92 per cent. This may still be adequate in a coarse-grained study, but often it will not be accurate enough. However, of the seven instances that were missed, six were tagged erroneously as VVD, i.e. simple past-tense forms, by the tagger. By also retrieving VVD, the researcher can therefore catch almost all instances of VVN, and at the same time, correct the over-extraction of past-tense forms. The tag VVD has a lower precision (27/35 or 77.1%), because in six instances, this tag has been misapplied to past participle forms (and a further two to adjectives). Therefore any study of past-tense forms using the VVD tag should perform manual cleaning to remove unwanted past participle forms. Thouësney (2011) proposes three consecutive strategies to improve overall tagger accuracy on a corpus of learner French. The ﬁrst step is to identify and classify unknown lemmas by improving lexical lookup across versions with lower case and upper case spelling, followed by manual correction of remaining unknown lemmas. Thereafter, a set of rules handcrafted by the researcher are run on the tagger output to correct common errors. The third step is to use information from error tags that were already inserted in the corpus manually to improve POS tagging. After application of these steps, the accuracy of the POS tags on the corpus of learner French data (as measured by the F-score) rose from 78.03% to 96.61%. Dickinson and Meurers (2003) propose that identical word n-grams with different tags should be investigated for tagging errors. The basic intuition is that if the same sequence of words (n-gram) is tagged differently by the same tagger (which they label variation n-gram), there is a good chance that there is a tagging error in at least one of the n-grams. Applying this metric, they determine that 2,436 out of 2,495 variation n-grams contained tagging errors in the part of the Wall Street Journal corpus they used as data. This computational technique can obviously be extended to learner corpora as well, although it may be somewhat less effective since learner errors are less likely to form repetitive variation n-grams and other variations (Rehbein et al. 2012: 4). It nevertheless remains an efﬁcient way of identifying potential tagging errors and ﬁnding a solution to correct them. Another computational solution to tagging errors is to run multiple taggers on the same corpus and compare the output of the taggers. Provided that the equivalences between tagsets can be determined,

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

91

there is a good chance that some of the errors made by one tagging program will not be repeated by another program, and thus all such cases of disagreement between taggers can be exploited for the improvement of the tagging. Differences in the output of taggers can therefore be used to predict likely tagging errors for manual correction. Rehbein et al. (2012: 8–9) ﬁnd that, in a sample corpus of learner German writing comprising 1,921 tokens, there are 33 to 68 tagging errors, depending on which tagger is used. Of these tagging errors, only two are tagged incorrectly in the same way by the three taggers they use, and would therefore not be identiﬁable by adopting this strategy. The remainder of the tagging errors, though, would be identiﬁed and can be corrected by human annotators. Depending on the needs of a research project, a researcher can therefore choose to work around annotation errors by using smarter manual extraction strategies or by implementing a range of interactive or automatic techniques to improve the overall accuracy of annotation and then proceed with the extraction of data for the main research question.

3

Representative studies

Four studies have been selected for presentation in this section, based on their value to illustrate important research directions in the ﬁeld. A speciﬁc complication of corpus annotation is that automatic annotation software is typically trained on native-speaker data, but such tools are then used to annotate learner data. Van Rooy and Schäfer (2003a) examine the inﬂuence of learner errors on the accuracy of three POS taggers used for learner English data. Corpora are sometimes annotated for learner errors and sometimes for grammatical properties. Combining the two poses special challenges, particularly to the extent that data containing errors are annotated for word class or syntactic properties. Hirschmann et al. (2007) propose to formulate an explicit target hypothesis and proceed from there to develop an annotation system for the main syntactic elements in learner German data. Not all researchers agree with the idea of a target hypothesis. An alternative that has received considerable support in recent years is to annotate the data simultaneously at multiple levels. Díaz-Negrillo et al. (2010) develop a system for annotating the word-class properties at three different levels simultaneously, which offers additional information and insight into the nature of learner language properties. The annotation of learner language should go beyond the level of word-class classiﬁcation. Geertzen et al. (2013) investigate the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

92

VAN ROOY

accuracy of dependency parsing and obtain very good results even at the syntactic level. 3.1 van Rooy, B. and Schäfer, L. 2003a. Automatic POS tagging of a learner corpus: The inﬂuence of learner error on tagger accuracy. Paper presented at Corpus Linguistics 2003. Lancaster University. Van Rooy and Schäfer (2002) extend the work of de Haan (2000) on the effect of learner errors on tagging errors. They separate the effect of spelling errors and grammatical errors, and determine the relative severity of learner errors for tagging accuracy. They also extend the scope of investigation by comparing three taggers rather than evaluating a single tagger. This leads to the research by van Rooy and Schäfer (2003a), which has not been published beyond the original conference presentation, and is therefore reported here. The two questions they ask are: how accurate are different taggers in assigning POS tags to learner corpus data? And what is the effect of learner errors on POS tag errors? The data for the evaluation and analysis is drawn from a random sample of thirteen complete essays in the Tswana Learner English (TLE) corpus, a part of the International Corpus of Learner English (ICLE). When punctuation is included in the count, the total number of tokens for the evaluation is 5,618. Three different taggers are used: the TOSCA-ICLE tagger, which was originally developed for the International Corpus of English project (Aarts et al. 1998) and adjusted for ICLE (de Haan 2000), the Brill tagger (Brill 1999) and the CLAWS tagger (Garside and Smith 1997). These taggers were selected because they were easily available, well documented and represented different types of computational architectures. The sample corpus was corrected for spelling errors ﬁrst, since the effect of spelling errors on tagging errors had already been reported by van Rooy and Schäfer (2002), who found that tagger accuracy improved by 2–3 per cent after the removal of spelling errors. Thereafter, all other learner errors were identiﬁed and classiﬁed inductively. The spell-corrected version was then tagged automatically by all three taggers. These taggers used different tagsets and therefore the output of each tagger was corrected manually by the two researchers. The results were stored in an Excel document, with the token, original tag and corrected version in separate columns. A tag for every learner error was added in another column, based on the prior classiﬁcation of errors. The corpus sample used for the evaluation was judged to contain 753 learner errors, but the accuracy of the POS taggers was surprisingly good, as shown in Table 5.1. Learner errors contributed a substantial proportion to the overall error rate, but the range remained between one-quarter and one-third across the three taggers, see Table 5.2.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

93

Table 5.1. Overall accuracy of POS tags in the 5,168-token sample from the TLE Tagger

CLAWS

TOSCA

Brill

Number of tag errors Tagger accuracy

189 96.3%

654 88.0%

753 86.3%

Table 5.2. Contribution of learner errors to tagging errors of the three taggers Tagger

CLAWS

TOSCA

Brill

Number of tag errors in edited sample Number of tag errors due to learner errors Contribution of learner error to tag errors

189

654

753

61

165

122

32.3%

25.2%

16.2%

An examination of the data reveals that certain types of learner errors have a very serious inﬂuence on tagger accuracy, while others have almost no effect. Errors related to the incorrect use of articles, prepositions and also the incorrect assignment of number features within noun phrases (nouns, pronouns and determiners) have almost no effect on tagger accuracy. The reasons seem straightforward. If the wrong article is used, then the article used is simply tagged. Prepositions are similarly a very simple category from a syntactic viewpoint – if the wrong preposition is used, it will still be tagged correctly as a preposition. Nouns, pronouns and determiners are tagged for singular and plural. A form that has the correct syntactic position but the wrong number will simply be tagged in terms of the form used, as long as the other syntactic properties are not disturbed. Other learner errors affect tagging accuracy more adversely. Those related to the concord, tense and aspect features of verbs frequently result in tagging errors. This is so because a verb that appears in an unmarked form in contexts where a marked form is expected (e.g. she have) is often tagged as something other than, e.g., the simple present form that it is presumed to be in context. A probabilistic tagger is confronted with a form that would not have occurred in the training data of native-speaker corpus on the basis of which it extracted the regularities that are used to compute the tags for any new corpus. Therefore, it guesses, guided by a statistical calculation, which tag is the most likely one, given the form it observes in a particular context.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

94

VAN ROOY

At clause level, incorrect verb complements are a serious cause of tagging errors. This is particularly problematic for the TOSCA tagger, which aims to classify verbs for their transitivity features, as illustrated by the following example, where bother has both an object noun phrase and an object clause: (5)

they do not bother themselves to use a condom (ICLE-TS-NOUN-1001.1)

Lexical errors also lead to incorrect tagging in many cases. If an incorrect lexical item is used, then the item will always be tagged incorrectly if the contextually appropriate tag does not occur in the lexicon for that particular word. An example of lexical error is in the clause …we learn them our greetings, where learn is tagged as a monotransitive verb, rather than as a ditransitive verb. This error occurs because the correct lexical item teach has not been used and in the TOSCA lexicon the verb learn does not have the option of being tagged as ditransitive, only monotransitive or intransitive. However, this problem does not occur in taggers with less ﬁne-grained tagsets, which assign the same tag to all lexical verbs and do not differentiate for complement type. 3.2 Hirschmann, H., Doolittle, S. and Lüdeling, A. 2007. ‘Syntactic annotation of non-canonical linguistic structures’, in Davies, M., Rayson, P., Hunston, S. and Danielsson, P. (eds.), Proceedings of the Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham. Hirschmann et al. (2007) argue that POS and syntactic annotation of learner data is useful, but such a system can often not be applied to sentences that contain erroneous forms. By contrast, error analysis focuses only on sentences that contain identiﬁable errors, and therefore the error annotations are not encoded in a way that such sentences can be compared to canonical sentences in terms of the frequencies of particular grammatical structures. They offer a proposal for an annotation system of learner data that combines the annotation of word class, syntactic structure and learner error in a multi-level annotation. The key construct they deﬁne is the contrast between canonical and non-canonical linguistic structures, where non-canonical is deﬁned as ‘structures that cannot be described or generated by a given linguistic framework’ (Hirschmann et al. 2007: 1). This intuition can be applied to learner data and also to spoken language or computer-mediated communication. They identify examples of learner errors in the Fehlerannotiertes Lernerkorpus des Deutschen als Fremdsprache (Falko – the error-annotated corpus of learner German) that are not, in principle, describable with POS tags. For instance:

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

(6)

95

…die individuelle Lernenentwicklung der Schülern… ‘… the individual learner+development of-the pupils…’

The form Schülern is in the dative case, whereas it should be in the genitive case (here just Schüler) after der. In terms of its syntactic position, it is used as a genitive, but in terms of its inﬂectional form, it is a dative. Tagging it as a dative disregards its syntactic use, but tagging it as a genitive disregards its dative case form. Such a form is therefore non-canonical in the terms proposed by Hirschmann et al. (2007). Based on the identiﬁcation of a non-canonical structure, Hirschmann et al. (2007) argue that the analyst should propose an explicit target hypothesis, a correction that reformulates the non-canonical sentence in canonical terms with the least possible difference from the original structure. Rehbein et al. (2012) add that the analyst is in any case bound to interpret the learner data in some way or another in order to annotate it, so it is better to make the target hypothesis explicit than to annotate deviations without doing so. The researchers used the Falko corpus, which consists of German written by learners of different levels at two German and one American university. A sample of the corpus was manually annotated for non-canonical sentences. The Stuttgart-Tübingen Tagset (STTS) TreeTagger was used for the automatic identiﬁcation of lemmas in the corpus, as well as assigning POS tags. For the manual annotation of non-canonical sentences, the Extensible Markup Language for Discourse Annotation (EXMARaLDA) annotation tool was used (details mainly come from Doolittle 2008). The lemmas and POS categories were assigned automatically by the STTS TreeTagger. Thereafter, a target hypothesis was supplied for every non-canonical sentence in another tier. Finally, error annotation was supplied by annotating the deviation from the canonical target hypothesis in a separate tier. Syntactic annotation was done in terms of topological ﬁeld, in another tier. A partial representation of the annotation, from Hirschmann et al. (2007: 10), is given in Figure 5.1. Hirschmann et al. (2007) mainly outline the procedure and show how it should be applied. They ﬁnd that the model they propose, with an explicit target hypothesis, offers a more accurate and comprehensive way to annotate the linguistic properties and errors in learner data. The model has the further value of being applicable to other forms of text, such as spoken language, dialects or computer-mediated communication (CMC), where non-canonical structures are also encountered frequently. Doolittle (2008) reports on the ﬁndings of an analysis using this annotation system. She reports the overall proportion of non-canonical structures, some of the main ﬁndings related to non-canonical structures, as well as aspects of inter-rater reliability.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

[word]

Er

[target

Er

tatsächlich war

war

tatsächlich

sehr

wohlhabend

gewesen

.

sehr

wohlhabend

gewesen

.

hypothesis] description of

token

token

deviation

inserted

deleted

top. field

initial

left

annotation of

field

bracket

middle field

target hypothesis

Figure 5.1 Annotation proposed by Hirschmann et al. (2007)

right bracket

Annotating learner corpora

97

3.3 Díaz-Negrillo, A., Meurers, D., Valera, S. and Wunsch, H. 2010. ‘Towards interlanguage POS annotation for effective learner corpora in SLA and FLT’, Language Forum 36(1–2): 139–54. Díaz-Negrillo et al. (2010) investigate the most appropriate form of linguistic annotation for learner corpora. They note that the use of POS taggers and other tools that are trained on native data constitute a case of domain transfer, where a tool is transferred from one domain to another, with the complication that the new data is more challenging than the data from the original domain. Such an approach is represented by studies of the accuracy of POS taggers for non-native data (e.g. van Rooy and Schäfer 2002, 2003a). Like Hirschmann et al. (2007), they observe that there are systematic cases in the data where a single POS tag cannot be assigned to an example. Rather than forcing an answer on the data by means of a target hypothesis, however, they ask whether a three-level POS tagging system would not simultaneously enable a more accurate description of the data and serve as useful input to the analysis of learner error. Díaz-Negrillo et al. (2010) used a 39,015-word section of the Non-native Spanish Corpus of English (NOCE) that had been tagged manually for learner errors. The corpus was tagged for POS with three different taggers, the TreeTagger, TnT and the Stanford Tagger (Meurers and Wunsch 2010). POS categories in native speaker data are identiﬁed on the basis of three types of evidence that usually converge: the morphological (inﬂectional) form of a word, the lexical base (with possible typical derivational afﬁxes) and the syntactic distribution (Díaz-Negrillo et al. 2010). The authors argue that in learner data, many errors are detected exactly where a mismatch occurs between the three sources of evidence for POS categories. Since the three taggers have different fall-back strategies to deal with unusual forms, they often give different tags for the same form (Meurers and Wunsch 2010). These mismatches were interpreted in terms of the layers of evidence by Díaz-Negrillo et al. (2010). The main ﬁnding from the study of Díaz-Negrillo et al. (2010) is that the threefold split in POS information improves the insight that the tagging yields. Whereas solutions have to be speciﬁed by decree if a single POS tag has to be selected, the threefold division enables an accurate classiﬁcation of the actual linguistic form in terms of its separate features. The analysis of the data shows that there are various mismatches between the classiﬁcation variables, and it is exactly these mismatches that offer new insight into the grammatical properties of the learner errors. A stem-distribution mismatch occurs when a stem that is unambiguously from one word class is used in the distributional slot (function) of another class, as shown by example (7): (7)

you can ﬁnd a big vary of beautiful beaches

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

98

VAN ROOY

The stem ‘vary’ is, lexically speaking, a verb, but it is used in the distributional slot of a noun. Díaz-Negrillo et al. (2010) remark that this is the type of error de Haan (2000: 74) labelled ‘word class transfer’. The second mismatch that Díaz-Negrillo et al. (2010) identify is the combined stem-distribution, stem-morphology mismatch, illustrated by example (8): (8)

one of the favourite places to visit for many foreigns

Distributionally, it is a nominal slot, but the stem form is adjectival, representing the ﬁrst mismatch. Moreover, this adjective stem also acquires a nominal plural sufﬁx, representing a stem-morphology mismatch as well. They point out that the inﬂectional and derivational properties of a word’s morphology should be kept apart when considering the most appropriate classiﬁcation of learner data. The third type of mismatch is between stem and morphology only, as in (9): (9)

this ﬁlm is one of the bests ever costumes

A form that is a predicative adjective in terms of stem (and distribution) receives a nominal sufﬁx, resembling a noun in terms of its morphology. The fourth and ﬁnal mismatch is between distribution and morphology, where inﬂectional morphology does not match the distributional slot, illustrated in (10): (10)

for almost every jobs nowadays

While the noun job is permitted to take a plural form, it cannot do so in the environment immediately after every. Thus, it functions in the distributional slot of a singular noun, but it displays the morphology of a plural noun, hence the mismatch. Based on this overall typology, Díaz-Negrillo et al. (2010) conclude that it is feasible to assign a POS tag for each of the three separate levels. They argue that this approach represents an improvement on the approach of assigning a single tag and yields valuable information about the learners’ language at the same time. 3.4 Geertzen, J., Alexopoulou, T. and Korhonen, A. 2013. ‘Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat)’, in Miller, R. T., Martin, K. I., Eddington, C. M., Henery, A., Miguel, N. M., Tseng, A., Tuninetti, A. and Walter, D. (eds.), Proceedings of the 31st Second Language Research Forum (SLRF). Carnegie Mellon: Cascadilla Press. Geertzen et al. (2013) introduce a very large corpus of learner English, the EF-Cambridge Open Language Database (EFCAMDAT), and inquire about Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

99

the accuracy of dependency parsing using natural language processing (NLP) tools. The new source of data introduced by Geertzen et al. (2013), EFCAMDAT, consists of 551,036 scripts, written by 84,864 learners of English across a range of proﬁciency levels. In total, the corpus contains about 2.9 million sentences that represent about 33 million words. The corpus was POS tagged and thereafter parsed syntactically with the Stanford Parser, using the Penn Treebank tagset of thirty-six tags. Parsing was done by marking heads and dependency relations between word-level units and the syntactic heads. A stratiﬁed sample of 1,000 sentences was tagged and parsed, and thereafter corrected by two trained linguists. The sentences were selected to represent learners from the ﬁve countries most widely represented among the learner population (Brazil, China, Russia, Mexico and Germany) and to represent the proﬁciency levels of the learners evenly. Another sample of 100 sentences was corrected by both linguists to determine inter-annotator agreement, which was established at 95.3%. After consultation with a third linguist, certain cases were resolved and inter-annotator agreement increased to 97.1%. Annotators marked learner errors and all tagging and parsing errors, and provided corrections for the tagging and parsing errors. It turned out that 33.8% of the sentences contained at least one learner error. The most frequent types of learner errors were spelling and capitalisation errors, although a range of morphosyntactic and semantic irregularities, as well as missing words, were also detected. Two main metrics were used to determine the accuracy of the syntactic parsing: the LAS, which measured the percentages of word tokens that were assigned correctly to heads and had their dependencies labelled correctly, and UAS, which measured the percentages of word tokens that were assigned to the correct heads, irrespective of the dependency labels (see Section 2.5). The principal ﬁnding was that the Stanford Parser achieved very high accuracy in the course of the automatic parsing of the data, even better than the scores achieved for parsing the Wall Street Journal (WSJ) corpus. The LAS was 89.6% (compared to 84.7% for the WSJ corpus) and the UAS was 92.1% (compared to 87.2% for the WSJ corpus). POS tagging accuracy was 96.1%. If POS tagging errors were included in the scores, the LAS dropped to 88.6% and the UAS to 90.3%. At sentence level, 54.1% of the sentences contained no labelled attachment error, and 63.2% contained no unlabelled attachment error, while 73.1% of the sentences contained no POS tagging errors. Learner errors interfered with the tagging. Of the 593 words that were associated directly with learner errors, 49.2% received an incorrect POS tag or incorrect assignment of syntactic dependencies. Learner errors mainly affected POS tag assignment and head or dependency classiﬁcation only as a consequence of erroneous POS tags (Geertzen et al. 2013: 9). Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

100

VAN ROOY

As the learners’ proﬁciency levels increased, so did the accuracy of the tagging and parsing. Geertzen et al. (2013) conclude that the use of a parser on learner corpora yields a data source with reliable morphosyntactic annotation, which indeed becomes a useful tool for researchers investigating learner language, even at the lowest proﬁciency levels. They propose two reasons for the success of the syntactic parsing in their project. Learner language tends to be relatively simpler than native speaker data, typically with shorter sentence lengths (on average, 11 words per sentence in the EFCAMDAT corpus, in contrast to the 21 words per sentence in the WSJ corpus), which makes the sentences easier to parse. In addition, the system is quite robust and many learner errors do not affect the tagging or parsing accuracy, especially not the semantic errors.

4

Critical assessment and future directions

Linguistic annotation adds a new layer of information to a learner corpus. It enables improved retrieval of data, even in cases where ambiguous forms exist. As a consequence, the learner corpus researcher is able to conduct types of analyses that are not possible with a raw corpus. Apart from POS tagging, however, the full potential of corpus annotation has not yet been harnessed, with only limited work conducted on syntactic parsing and almost nothing on other types of annotation. Important advances that have emanated from work on learner corpus annotation include work on the accuracy of annotation and the inﬂuence of learner errors on annotation accuracy. Systems have been developed to capture both conventional linguistic features and learner errors in the same representation system, giving rise to multi-layer annotations. There is one very important issue that inﬂuences approaches to tagging, which is whether a target hypothesis is formulated to improve tagging accuracy, or, alternatively, whether the learner data is annotated without such an intervening step.

4.1

Approaches to learner corpus annotation

The initiative to improve POS or syntactic annotation for learner corpus data often rests on the assumption that the learner forms are either analysable in the same way as native speaker data, or, alternatively, that the intended form of the particular instance can be inferred and tagged. This assumption is called the ‘target hypothesis’ by Hirschmann et al. (2007) and Rehbein et al. (2012). It is implicitly assumed by de Haan (2000) and van Rooy and Schäfer (2002, 2003a), but not explicitly formulated. The opposing view, formulated most forcefully by Rastelli (2009), is that such

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

101

an approach falls prey to the comparative fallacy, by erroneously imposing the categories of the target language on the interlanguage production of the learner. The arguments in favour of these two approaches are reviewed in this subsection, showing that both have merits and can be employed usefully in learner corpus research, even if the two views are to a degree irreconcilable. At the very least, these two approaches are likely to lead to different types of inquiries into learner language. Given that learner corpus data contain forms that are ‘non-canonical’ in their terms, Hirschmann et al. (2007) argue that the annotation of the POS properties and syntactic structure requires a solution in the form of a target hypothesis to be tagged and parsed by a system developed for the target language. A target hypothesis is deﬁned as a manually corrected version of the non-canonical learner utterances, including the correction of spelling, word-formation and word-order errors (Rehbein et al. 2012: 4). Once a target hypothesis is provided, the normalised form is tagged. This assumption is implicit even in cases where a corpus is ‘simply tagged’ by a tagger such as the way CLAWS was used to POS tag ICLE, because the surface morphological form is taken to be the basic form by the tagger in most cases of mismatches between surface form and distributional evidence. However, exceptions do occur, which is exactly why Hirschmann et al. (2007) propose to make the target hypothesis explicit. There are a number of advantages to the formulation of a target hypothesis and the tagging of the presumed corrected or canonical form. If the sentences that contain learner errors are not analysable in similar terms to the more canonical sentences, a comprehensive overall view of the data in a corpus is not possible. Hirschmann et al. (2007: 13) argue that the annotation of learner corpora often deals separately with the canonical and non-canonical structures, which makes it difﬁcult to compute instances of underuse and overuse of structures. By supplying a target hypothesis, all structures can be included in the extraction of data from an annotated corpus. The second advantage is that the POS tags and syntactic annotation of the target hypothesis can be compared to the original form produced by the learner. The description of the error in the learner form can take advantage of this information, where the difference between the original form and the target hypothesis is used for annotating the error (Hirschmann et al. 2007: 13). Finally, the use of a target hypothesis leads to a very considerable improvement in POS tagging accuracy, enabling more accurate retrieval of information from the corpus. Rehbein et al. (2012: 4) argue that for syntactic parsing, very accurate POS tags are required. By means of a target hypothesis, they improve the accuracy of the POS tags from 93.8% to 98.7%, which in turn leads to a signiﬁcant improvement in parsing accuracy. By contrast, further manual correction of tagging errors does not lead to a signiﬁcant further improvement in parsing accuracy.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

102

VAN ROOY

A number of researchers argue against the use of a target hypothesis. The main ground for the rejection is a theoretical one: the target hypothesis works on the basis of the distance between the actual learner form and some reconstruction of the ‘intended’ utterance in the target language. This hypothesis is termed the comparative fallacy after Bley-Vroman (1983). Rastelli (2009: 59) argues that the comparative fallacy leads researchers to assume that the target language structure is what the learner was aiming at and that there is a valid distance from the target to be measured. He argues that this leads to an incorrect analysis, which fails to detect the internal structure of the learner’s competence. With learners at lower and even intermediate proﬁciency levels, it is often not possible to even infer what the learner target is, and thus the analysis of learner data in terms of a target hypothesis involves a lot of guesswork on the part of the analyst (Rastelli 2009: 59). The degree to which an analysis based on such guesswork can reveal anything meaningful about the learner’s competence is questionable, according to Rastelli (2009: 59). Rather, Ragheb and Dickinson (2011: 114) argue that by focusing on the textual evidence, the analyst should aim to uncover how the learner uses language at that particular point. In a similar vein, Díaz-Negrillo et al. (2010: 180–1) note that the robustness needed to classify the data comes at the price of hiding essential characteristics of learner data. These researchers suggest two avenues to overcome the comparative fallacy. Díaz-Negrillo et al. (2010) and Ragheb and Dickinson (2011) argue in favour of a multi-level POS tagging, where the various sources of conﬂicting evidence from different levels are separated out and made available for analysis. The mismatches between levels, as shown earlier, become informative about the competence of the learner. Rastelli (2009), however, takes an even more radical position. He argues that the analyst should take the output of the tagger trained on target language data without correction and should interpret the tagging output (especially ‘tagging errors’) as evidence of similarities between forms in the learner data that are not normally regarded as belonging to similar categories in native-language production. He calls the correspondences imposed by the tagger ‘virtual categories’, which reveal how a form is used in a way in which it would not be used in the target language. This is exempliﬁed in (11), from Rastelli (2009: 64), where a formal deviation in learner Italian yields an adjective form sbagliato (‘wrong’) used where a noun sbaglio (‘mistake’) would have been used by native speakers. By tagging the deviant form as an adjective, it is put in a virtual correspondence with other adjectives: (11)

quando la donna è in bagno lei faccia un sbagliato when the woman is in bathroom she does-SUBJ a wrong-ADJ ‘when she is in the bathroom she makes a mistake’

No general consensus has yet emerged among researchers as to whether a target hypothesis or a multi-layered annotation system gives the best Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

103

results, but both approaches can be used productively in different kinds of research projects.

4.2

Unexplored avenues for future research

The annotation of learner corpora increases the value of the data for research because the information added by annotation makes it possible to extract data from learner corpora that would otherwise not be accessible. POS tagging and syntactic annotation have been undertaken, but various research projects point to the challenges and pitfalls associated with the annotation of learner corpora. Learner corpus researchers should therefore use annotation with awareness of the limitations, including the assumptions that are made when annotating data with tools that have been designed for native-speaker data in the ﬁrst instance. Beyond POS tagging and syntactic parsing, automatic annotation of learner data has not yet been explored on an extensive scale. More research is required to determine the feasibility, accuracy and value of semantic annotation and annotation of higher levels, such as discourse or pragmatic annotation (see Chapters 12 and 13, this volume). A possibility that has only received limited attention (e.g. Nagata et al. 2011; Ragheb and Dickinson 2013) is to customise annotation tools for learner data, using computational techniques. This can be done by using teaching material rather than newspaper corpora as training data for the development of annotation tools, or by customising the categories of analysis for the speciﬁc features of learner language. Alternatively, as attempted by Nagata et al. (2011), different computational algorithms can be used for the development of annotator programs, e.g. rule-based or machine-learning architectures, to determine if other computational techniques yield better results for learner data. A last need in the ﬁeld is for more publicly available annotated data, especially data that has been improved, to feature as a kind of gold standard against which other annotation tools can be assessed. Nagata et al. (2011) and Geertzen et al. (2013) represent bold ﬁrst attempts to make available syntactically parsed corpora to the research community, but more data is required.

Key readings de Haan, P. 2000. ‘Tagging non-native English with the TOSCA-ICLE tagger’, in Mair, C. and Hundt, M. (eds.), Corpus Linguistics and Linguistic Theory. Amsterdam, Rodopi, pp. 69–79. Theory. The author proposes ways to improve POS-tagged versions of corpora from the ICLE project through a process of tag correction using the Tag Selection Tool in the TOSCA tagging system. He examines a

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

104

VAN ROOY

number of different types of learner errors to determine how best to correct POS-tagging errors in such contexts and proposes a taxonomy of error types and tag-correction solutions for these. van Rooy, B. and Schäfer, L. 2003b. ‘An evaluation of three POS taggers for the tagging of the Tswana Learner English Corpus’, in Archer, D., Rayson, P., Wilson, A. and McEnery, T. (eds.), Proceedings of the Corpus Linguistics 2003 Conference, Lancaster University, 28–31 March 2003, UCREL Technical Paper 16. Lancaster University, pp. 835–44. This paper proposes a strategy for gradually determining the POS tag categories that are most problematic to get right and for implementing systematic manual correction strategies to develop a more accurately annotated corpus of learner data. Thouësney, S. 2011. ‘Increasing the reliability of a part-of-speech tagging tool for use with learner language’, in Proceedings from the Pre-conference (AALL’09) Workshop on Automatic Analysis of Learner Language: From a Better Understanding of Annotation Needs to the Development and Standardization of Annotation Schemes Schemes.. Tempe, AZ: Arizona State University. This paper proposes three strategies to improve POS-tagger accuracy: identify and classify unknown lemmas, correct common tagging errors with rules, and use information from an error-tagged corpus to improve POS tagging. Together the strategies improve POS tagging accuracy to above 97 per cent for a corpus of learner French data. Rastelli, S. 2009. ‘Learner corpora without error tagging’, Linguistik Online 38(2): 57–66. The author argues against POS tag correction as an instance of the comparative fallacy. Instead, he proposes to interpret the output of a POS tagger trained on native-speaker data from the target language as indicative of virtual categories or alternative groupings in interlanguage data, providing new insight into the emerging structure of the learner data. Ragheb, M. and Dickinson, M. 2011. ‘Avoiding the comparative fallacy in the annotation of learner corpora’, in Granena, G., Koeth, J., Lee-Ellis, S., Lukyanchenko, A., Botana, G. P. and Rhoades, E. (eds.), Selected Proceedings of the 2010 Second Language Research Forum: Reconsidering SLA Research, Dimensions, and Directions. Somerville, MA: Cascadilla Proceedings Project, pp. 114–24. The authors develop a proposal for learner language annotation without a target hypothesis, based on three principles: all words must be tagged for POS and syntactic dependencies, not just errors; linguistic evidence from the text is used in assigning categories, rather than

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Annotating learner corpora

105

inferences based on information about target structures or the native language of the learners; and a multi-layered description should be provided along the lines of Díaz-Negrillo et al. (2010). Rehbein, I., Hirschmann, H., Lüdeling, A. and Reznicek, M. 2012. ‘Better tags give better trees – Or do they?’, Linguistic Issues in Language Technology 7(10): 1–18. The authors explore the effect of the accuracy of POS tags on the accuracy of syntactic parsing. They ﬁnd that tagging a version of a corpus that has been supplied with a target hypothesis leads to an improvement in POS tags. They also assess the value and the amount of time taken to implement other correction strategies. However, they ﬁnd that syntactic parsing is quite robust and therefore the various POS-correction strategies do not result in a statistically signiﬁcant improvement in parsing accuracy. Nagata, R., Whittaker, E. and Sheinman, V. 2011. ‘Creating a manually error-tagged and shallow-parsed learner corpus’, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, OR, pp. 1210–19. The authors develop a corpus of Japanese learner English that is manually error tagged, and also POS tagged and shallow-parsed. They explore the effectiveness of different algorithms and training data for the development of POS taggers for the learner corpus and ﬁnd that training (or developing) a tool using teaching materials rather than a general corpus yields improved results on a learner corpus, compared to the results of the same tool on native-speaker data.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:04, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.005

6 Speech annotation of learner corpora Nicolas Ballier and Philippe Martin

1

Introduction

Spoken learner corpora are still rare today. In a comprehensive catalogue of existing learner corpora1 containing more than 140 entries, only 38 correspond to spoken data. One reason for this scarcity is that spoken corpora are more costly (in terms of money, time and technology) to collect and annotate, which results in less spoken than written corpus data. It is symptomatic, in this regard, that the British National Corpus only includes 10 per cent of spoken data, while one could argue that people talk more than they actually write. Of the few spoken learner corpora that are available, most do not come with audio ﬁles, but simply consist of transcripts (more or less detailed) prepared by the corpus compilers (we call these ‘mute spoken data’; see Section 2.1). Such transcripts are suitable for the analysis of lexical or syntactic features, for instance, but they tend to be less adequate for phonetic studies, as they do not usually present a ﬁne-grained level of speech annotation. This can be related to the difﬁculty of automating annotation of the speech chain: no immediate phonetic transcriptions of words can be provided, the automatic syllable division for a language like English is very problematic, and prosodic units like accent phrases (nuclei) or intonational phrases (groups of prosodic units) cannot be automatically labelled, although some algorithms have been implemented for prosodic layers of annotation. For phonetic research to be possible, access to the acoustic signal must be provided. Since this type of research has mostly focused on speciﬁc features which require a controlled protocol, it has often relied on more constrained types of learner data. Typically, second/foreign language (L2) learners are asked to read several versions of a carrier sentence, where only one word (or even one sound) is changed, to measure the perceived 1

www.uclouvain.be/en-cecl-lcworld.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

108

BALLIER AND MARTIN

differences with native pronunciations. The emphasis is then not so much on learner spoken production as a whole but on the acoustic analysis of a particular feature. More recently, learner corpora representing naturally occurring speech and speciﬁcally targeted at phonetics have also become available. Such corpora include text–sound alignment and sometimes several layers of phonetic/phonemic annotation. All these spoken learner corpora open the way for the empirical analysis of pronunciation and prosody in read and naturally occurring speech (for English, see, for example, Gut 2009; Tortel and Hirst 2010), thus contributing to the ﬁeld of interlanguage phonology. Interlanguage phonology focuses on the different stages of acquisition of a given phoneme by L2 learners. As in the general analysis of interlanguage (Selinker and Lamendella 1981), interlanguage phonology has moved away from contrastive analysis to more systematic investigations of errors in order to generalise learning processes, especially when compared with results from ﬁrst language (L1) acquisition. Various theoretical frameworks have been proposed, advocating different kinds of phonological constraints and assessing the importance of markedness of a given feature in the observation of L1 transfers (see Eckman 2004; Rasier and Hiligsmann 2007 for an overview; Major 2008). The overall domain, inspired by the notion of interlanguage, is known as ‘interphonology’, and though the number of spoken learner corpora suitable for this kind of analysis is still limited, many studies have appeared in the last few years. Most of the previous research was devoted to identifying the ‘critical period’ in learning foreign languages or the ‘factors’ in speech that give learners away. In the 1990s, the emphasis was on segmental issues, which led to the formulation of models accounting for the inﬂuence of native languages on non-native realisations (see Best 1995; Major 2001; Flege et al. 2003). In recent decades, the agenda has shifted to suprasegmental questions (Gut 2009; Hincks and Edlund 2009; He et al. 2012). This chapter deals with the annotation of spoken learner corpora which makes the study of interphonology possible. The term ‘annotation’ is used here in a broad sense to cover both the transcription of speech and ‘the practice of adding interpretative, linguistic information to an electronic corpus’ (Leech 1997a: 2, emphasis original). This is justiﬁed by the fact that transcription is also interpretative and either implicitly or explicitly theory driven (see Ochs 1979; Davidson 2009). Díaz-Negrillo and Thompson (2013: 22) claim that ‘[a]s learner corpora are in essence a language corpus type, they share with other language corpora basic corpus linguistic principles that relate to corpus design, data processing, data analysis and corpus tools design, albeit with an obvious degree of specialisation’. This is even truer for spoken learner data. The fact that speech requires speciﬁc tools to visualise and analyse the physical realisations of the recording makes its written representation an important issue in the ﬁeld. Indeed, for their analysis to be possible, spoken corpora have to be transcribed, which means that a textual representation of Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

109

speech has to be given. When learner corpora are distributed without the sound ﬁle, the transcripts need to be as explicit as possible for the rendition of speech. Although it may appear counterintuitive, it should be stressed that transcription of speech is more complex than the mere transcription of the individual words spoken by the recorded speaker. Often, punctuation signs and capitalisation are omitted, to avoid presupposing phrasing (i.e. segmentation of the speech chain) and the potential overall structure of the utterance. Such a transcription is known as ‘orthographic transcription’, where choices have to be made between gonna, going or goin’, for example. If the pronunciation is matched too closely, one runs the risk of missing occurrences when launching a word-based query. A common strategy consists in adopting a ‘standard orthographic transcription’, whose word forms are consistent with dictionaries. If this strategy is adopted, the orthographic transcription does not usually reﬂect the different phonetic realisations and tends to be standardised. As for annotation in the strict sense, this can involve adding different types of linguistic information. However, we will conﬁne our discussion to speech annotation, which consists in adding information about the pronunciation of sounds (vowels and consonants), known as the ‘segmental level’, and about prosodic structures (syllables, accents, rhythm, contours, etc.), known as the ‘suprasegmental level’. Types of annotation that can equally apply to written learner corpora, such as part-of-speech (POS) tagging and parsing or error tagging, will not be dealt with here; readers are referred to Chapters 5 and 7 (this volume), respectively, for a discussion of these types of annotation. Note that because speech annotation is a highly specialised ﬁeld, this chapter will necessarily involve a certain degree of technicality.

2

Core issues

Section 2 covers what we take to be the main issues of the ﬁeld. First, we present and illustrate the different types of spoken learner corpora. Then, we deal with the transcription of speech features, i.e. the main conventions adopted to render spoken data in written transcripts. The last two issues are the annotation of pronunciation and prosody, and the computer software programs that can be used to annotate and/or analyse spoken learner corpora.

2.1

Types of spoken learner corpora

The term ‘learner corpus’ normally implies a certain degree of naturalness, so that some researchers make the point that a collection of reading-aloud tasks does not qualify as a learner corpus (see Chapter 2, this volume). However, such tasks will be taken into account here because: (1) several annotated corpora include different types of speech styles, among which Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

110

BALLIER AND MARTIN

is read speech (e.g. LeaP, Learning Prosody in a Foreign Language); (2) the more annotated a spoken learner corpus is, the less natural it tends to be. The latter point is probably linked to (i) the fact that such annotation requires excellent sound quality, recorded, for example, in an anechoic chamber (which does not ﬁt well with spontaneous conversations in their original context), and (ii) the usefulness of having access to a comparable native sample, which is possible if native and non-native participants are asked to read the same extract (see ANGLISH, AixOx, LeaP). The ﬁnal reason for inclusion is more pragmatic: there are not so many annotated spoken learner corpora, hence the necessity to cast one’s net wide. We make a distinction between spoken-based corpora (when transcriptions have been proposed, but the corresponding sound is not made available, such as LINDSEI, the Louvain International Database of Spoken English Interlanguage) and spoken corpora such as LeaP, where access to the acoustic signal is possible (physical dimension of the sound, usually stored in a WAV or MP3 sound ﬁle). Among the corpora that are accompanied by their corresponding sound ﬁles, we distinguish corpora that have been time-aligned, so that it is possible to hear portions of the learner’s actual speech, usually speaker turns, by playing the sound ﬁle (‘speaking corpora’), and what we call ‘phonetic corpora’, where smaller units of speech can be visualised. For this latter type of corpus, the annotation of the sound ﬁle corresponds to the linguistic units, not only at the word level, but sometimes even at the syllable or sound-unit level. Such corpora usually require speciﬁc tools for their analysis, unlike spoken-based corpora that can be queried on the basis of text ﬁles with a concordancer. Typically, web-based interfaces such as that of the French Learner Language Oral Corpora (FLLOC)2 only allow queries on strings of text, so that spoken corpora of this kind are actually ‘mute corpora’, or spoken-based rather than spoken corpora. Given that they have different aims (and research questions), learner corpora based on actual recordings may adopt different strategies and tools for annotation. Table 6.1 sums up the different kinds of spoken learner corpora that are available, categorises the different research questions that can be addressed and exempliﬁes some of the notions under scrutiny. Phonetic corpora are not yet the most frequent type available. Most spoken learner corpora are designed with English as the target language, produced by learners from different L1s, e.g. Polish for the PELCRA Learner English Corpus3 or Japanese for the NICT JLE (Japanese Learner English) Corpus,4 but other target languages are represented as well. Table 6.2 sums up the main features of some existing spoken corpora, from least to most richly annotated corpora. 2

www.flloc.soton.ac.uk/index.html (last accessed on 13 April 2015); see also the SPLLOC (Spanish Learner Language Oral Corpora) interface: www.splloc.soton.ac.uk/search.php (last accessed on 13 April 2015). Results return the relevant excerpt including the string at utterance level and access to the corresponding CLAN, XML and WAV files.

3

http://pelcra.pl/plec/research (last accessed on 13 April 2015).

4

http://alaginrc.nict.go.jp/nict_jle/index_E.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

111

Table 6.1. A typology of spoken learner corpora (adapted from Ballier and Martin 2013)

Features Type of alignment

Status of the signal

Mute spoken corpora (transcripts) Time alignment is irrelevant. Transcriptions are available independently (standoff annotation) Sound clips are optional and can be played but not analysed

Type of annotation

Mostly text transcripts: turn-taking, hesitations, repetitions, retracings, sometimes pauses and vowel lengthening

Query interface and typical tool

The favourite interface is a unidimensional (text) concordancer, with possible collocates, clusters

Typical investigation of learner speech features

‘Lexically based’ in the broad sense (phraseological units, grammaticality of morphemes, clauses and sentences): lexical frequency, creativity and morphological productivity Lexical confusion, lexical attrition but also grammar, morphology, phraseology, etc.

Typical candidates for learner ‘errors’

Select publications

Tono (2005), De Cock (2007)

Speaking corpora

Phonetic corpora

Time alignment is crucial for the study of disruption, ﬂuency, pauses

Time alignment of data is crucial, as well as syllable alignment and, if possible, segment alignment

Sound ﬁles are associated with the transcript and relevant sections of the sound ﬁle can be played Multiple layers are possible: lemmatisation, POS tagging (CLAN); phonemic targets, phonetic realisations (Phon) Speciﬁc query modules allow complex multidimensional queries for multiple annotation layers (CLAN, Phon, NXT)

Vital: part of the annotation is signal-based and the signal can be visualised (pitch, spectrograms)

Time-based: ﬂuency (silent/ﬁlled pauses); speech rate, tempo, reformulations, complexity

‘Repairs’, dysﬂuency, retracings, hesitations, stutterings (CLAN); phone substitutions, resyllabiﬁcations, phonologisation (Phon)

Osborne (2007), Hilton (2009)

Usually several layers: phones (phonemic targets); pitch measurements and transcription; frequency measurements; syllable boundaries and tone-unit boundaries (phrasing) Two options: (i) XML format to query textual data, to be later exported to speech software (ii) Speech concordancers: WinPitch or Praat plugins Signal-based: rhythm, ‘accent’, pronunciation, conformity to native grapho-phonemics (read speech), deviation from expected frequency range

‘Non-standard’ segmental or suprasegmental realisations: phone substitution, stress misplacement, stress clash, focus displacement, unexpected signal modulation, phonetic transfers Neri et al. (2006), Chen et al. (2008), Gut (2009), Tortel (2009)

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

112

BALLIER AND MARTIN

Table 6.2. Representative spoken learner corpora and their features Corpus

Type

L2

L1

Task

Transcription/annotation

LINDSEI1

Mute

English

Various

Informal interviews and picture description

NICT JLE2

Mute

English

Japanese

Interview tests

FLLOC3

Speaking

French

English

SPLLOC4

Speaking

Spanish

English

PAROLE5

Speaking

LeaP6

Phonetic

Italian, Various French and English German and Various English

ANGLISH7

Phonetic

English

French

Elicited imitation, role-play, story retelling Interviews, narratives, pair discussions Summary and running commentary of small video clips Read speech, prepared speech, free speech, story retelling and nonsense word lists Read speech and monologues

Orthographic with some speech features and phonetic/prosody information (text concordancer) Error tagging (lexis, grammar) and speech features (text concordancer) Orthographic with POS tagging (CLAN)

AixOx8

Phonetic

French and English

English and French

Read speech and monologues

IPFC9

Phonetic

French

Various

Word lists and interviews

1. 2. 3. 4. 5.

6.

7. 8. 9.

Orthographic with POS tagging (CLAN), error tier, code-switching Orthographic with POS tagging (CLAN)

Intervocalic intervals, words, predications, syllables, tones, contours, lemmas, POS tagging (Praat) Words, phones, intervocalic intervals, syllables, feet (Praat) Words, phones, syllables (SPPAS), syllable structures (Praat) Words, liaisons (Praat, Dolmen)

www.uclouvain.be/en-cecl-lindsei.html (last accessed on 13 April 2015). http://alaginrc.nict.go.jp/nict_jle/index_E.html (last accessed on 13 April 2015). www.ﬂloc.soton.ac.uk/LANGSNAP/index.html (last accessed on 13 April 2015). www.splloc.soton.ac.uk/ (last accessed on 13 April 2015). The manual is only accessible in French: www.umr7023.cnrs.fr/sites/sﬂ/IMG/pdf/PAROLE_manual.pdf (last accessed on 13 April 2015). www.philhist.uni-augsburg.de/de/lehrstuehle/anglistik/angewandte_sprachwissenschaft/workshop/pdfs/ LeapCorpus_Manual.pdf (last accessed on 13 April 2015). http://sldr.org/voir_depot.php?lang=en&id=731&version=1 (last accessed on 13 April 2015). http://sldr.org/voir_depot.php?lang=en&id=784&version=2 (last accessed on 13 April 2015). www.projet-pfc.net/le-projet-pfc-ef.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

2.2

113

Transcribing speech features

Conventions vary for the annotation of speech features. Some paralinguistic features such as voice quality (creaks, vocal fry, breathy voice) are rarely annotated (see Adolphs and Knight (2010) on the annotation of the Nottingham Multi-Modal Corpus). However, other structural units such as speakers’ turns, pauses or overlaps are usually indicated. Hesitations, false starts, repetitions and self-corrections are also normally annotated. For instance, the Salford Corpus, a corpus of anglophones learning French that is part of the FLLOC, makes the following distinction in its annotation guidelines between repetitions and retracings: the former ought to be marked in the transcripts in the following way: les [/] les petits hommes (repetition), whereas retracing for one word is indicated as et puis le [//] les hommes, and retracing for more than one word as et puis [//] les garçons.5 A distinction is usually made between empty pauses and ﬁlled pauses (when something is pronounced and a ﬁller such as uh or uhm is transcribed). The LINDSEI conventions indicate vowel lengthening (but not its duration) and distinguish between (eh) [brief], (er), (em), (erm), (mm), (uhu) and (mhm), whereas ANGLISH has (euh), the most common ﬁller in the L1 (French). Investigations of mute corpora have focused on discourse markers, ﬂuency and repairs, where differences with natives can often be attributed to the frequency and position of markers (Liu 2013). For discourse markers (see also Chapter 13, this volume), studies have investigated, for instance, the uses of so (Watanabe 2010; Buysse 2012), general extenders such as and stuff, or something like that (Buysse 2014) in combination with other discourse features such as like, you know, I don’t know (Aijmer 2009), well (Aijmer 2011; Polat 2011). Aijmer (2004) reports the use of I don’t know in learner data before, between and after constituents and in combination with other markers. The investigation of discourse features in that sense is mostly a pragmatic analysis of lexical material. Examining oral data from the Cambridge ESOL examination (Cambridge Learner Corpus), McCarthy (2010: 11) makes the point that ‘native-like linkages in turn-construction and back-channelling’ are more crucial than strict grammatical adequacy. He highlights the role of turn-openings for successful interactions, a phenomenon he describes as ‘conﬂuence’. Osborne (2007) studies ﬂuency features in a database of French learners of English, the PAROLE Corpus (Hilton et al. 2008). He assesses the different metrics used to measure ﬂuency, which help to distinguish styles of delivery but do not characterise differences in pronunciation. Fluency has also been investigated for automated scoring of test responses (see Chapter 26, this volume). Having analysed hesitation markers among learners of English and natives, Gilquin (2008a) concludes that it is not ‘the presence vs absence 5

www.flloc.soton.ac.uk/salford/conventions.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

114

BALLIER AND MARTIN

of such features that distinguishes between NNS [non-native speech] and NS [native speech] performance, but their frequency and distribution’ (see also Lennon 1990: 392–3). Gilquin and De Cock (2013) suggest that there is a continuum between errors of competence and errors of performance, between errors and dysﬂuencies. Researchers working on hesitations and repairs may beneﬁt from speciﬁc tags in mute spoken-based texts for this analysis. However, it is important to bear in mind that sentence intonation may structure linguistic units (such as stress groups) differently at the utterance level. Native speakers have a better control of sentence prosody than learners, and especially the ability to resort to ‘prosodic erasing’ (Blanche-Benveniste and Martin 2011: 135), in particular by reformulating and restructuring discourse after false starts. Native speakers may also successfully connect successive prosodic chunks by modulating their voice quality (e.g. creaky voice). The difference in sentence perception by both native and L2 speakers also depends on the types of self-correction strategies used by native speakers. These phenomena can only be observed at the prosodic level, with the sound ﬁle, hence the need for more prosodically detailed corpora. The categorisation of the speech features described so far might be deemed to be heavily lexically based and somehow a consequence of our habits of dealing with written data. Hard-core phoneticians have other categories, which we present in the following section.

2.3 Annotating pronunciation and prosody 2.3.1 Segmental and suprasegmental features From a phonetic point of view, annotation of spoken learner data can take place at either the segmental level of pronunciation or the suprasegmental level of prosody (see Table 6.3). Although the boundaries are somewhat artiﬁcial, this synthetic representation allows the presentation of the domain and the introduction of the main technical terms. The terms ‘pronunciation’ and ‘prosody’ refer to the realisation of the phonemic, or segmental, units and the realisation of the units above the phoneme, or suprasegmental units, respectively. The pivotal linguistic notions and units of the speech chain are listed in the ﬁrst row (linguistic units). Pronunciation and prosody may refer to the segmental and suprasegmental components of ‘foreign accent’ and a reminder of the key concepts of the phonetics/phonology divide is in order (see Section 2.3.2). Learner phonetic realisations can traditionally be analysed according to segmental features (the realisation of individual sounds, known as ‘phonemes’ at the phonological level and ‘phones’ – grouped into syllables – at the phonetic level) and suprasegmental features. In the segmental category, we have excluded issues related to how we divide words into syllables (syllabiﬁcation). In the suprasegmental category (‘above the syllable’), we follow the conventional distinction

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Table 6.3. A simpliﬁed presentation of the phonetic domains for the annotation of spoken learner data Domain

Pronunciation (segments)

Prosody (suprasegments)

Linguistic units

Consonants

Vowels

Syllables

Stress

Rhythm

Acoustic correlates

Formants

Formants

Not so clear for all syllabic transitions

Duration, stress

Learner realisations and candidates for criterial features

Final devoicing, consonant cluster reduction Phone tier (ANGLISH, LeaP, Tortel 2009)

Phone substitutions, phonetic transfers Phone tier (Méli 2013)

Resyllabiﬁcations; templatic transfers

Duration, fundamental frequency (F0), intensity Stressed syllable misplacement

Syllable tier (ANGLISH, Tortel 2009)

Accent tier (Chen et al. 2008)

Annotation layer in learner corpora

Syllable-timing; stress-timing

Intervocalic interval tier (ANGLISH, LeaP)

Intonation (tonality, tone, tonicity) Duration, F0, pauses and phrasing

Prosodic transfers; non-syntactic phrasing; focus displacement, tone substitution Prosodic (ToBI) annotation (LeaP)

116

BALLIER AND MARTIN

Figure 6.1 Screen capture of LeaP in Praat (Gut 2009) Top: waveform, middle: spectrogram, bottom: annotation layers (tiers)

between stressed and unstressed syllables (lexical and boundary stress assignment and rhythm) and prosody (using pauses, lengthening and pitch modulation to convey the so-called ‘prosodic structure’ of the sentence). The second row of Table 6.3 presents the physical characteristics of the acoustic signal (known as the ‘acoustic correlates’ of the signal) that can be investigated with speech software. In Figure 6.1, for example, the formants correspond to the horizontal black ‘lines’ on the spectrogram; they are known as F1, F2, F3 and F4, counting up from the bottom. These formants are linked with articulatory movements (lip-rounding, back and front articulation, etc.). Intensity is proportional to the loudness of the sound, as visible in the variation of the amplitude of the waveform. For intonation, the fundamental frequency, or F0, is correlated with the vocal fold vibration and displayed as pitch curves. The third row presents some of the learner errors that can be classiﬁed within this simpliﬁed framework. The errors are categorised in this table as ‘non-standard’ segmental or suprasegmental realisations, which may turn out to be due to the inﬂuence of the L1. To investigate such issues, some speciﬁc layers of annotation have been used in previous studies, exempliﬁed in the last row. Several kinds of transfer from the learner’s L1 can be observed (see also Chapter 15, this volume). Phonetic transfers are realisations which borrow phonological features of another language (e.g. aspiration of initial plosives). Phonological transfers (see Mennen 2007) are systematic uses of features that are encoded in the grammar of the L1. An example of a phonological transfer is the tendency of French learners to lay the stress on the ﬁnal syllable of the stress group. Some transfers can be related to syllable structures. For a French realisation of comfortable (Tortel 2013), the absence of the ﬁnal syllabic consonant (due to syllable-timed pronunciation) results in an epenthetic vowel (œ) and in a four-syllable realisation of the word com-for-ta-ble that reproduces the preferred syllable type (CV,

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

117

i.e. any consonant + any vowel) of French. Similarly, difﬁcult is realised as [di ﬁ kœlt] in French learners’ production, whereas English dictionaries have /dǰf ǰ kǟlt/. This type of realisation can be seen as a templatic transfer (resyllabiﬁcation) where the Romance tendency to structure syllables as CV templates transfers onto the pronunciation of the syllable in English. In some cases, the phonological transfers bear on phrasing (Herment, Ballier et al. 2014) or on accent rule (Avanzi et al. 2012). The latter have investigated possible intonation transfer between Swiss-German and French (‘français fédéral’) with sophisticated statistical tools and extensive acoustic analysis. They have shown that Swiss speakers who have a Swiss-German dialect as mother tongue retain features of their mother tongue for French as an L2. They produce only 74 per cent of the sandhis (liaisons). Typically, non-native speakers would not make the liaison for pâtes italiennes (Italian pasta), where a linking /z/ would be expected among natives. For a phrase like grand émoi (great stir), they often stress both words whereas natives would form a single stress group and stress its last syllable. Admittedly, the pronunciation issues probably involve other phonetic transfers such as devoicing of ﬁnal voiced consonants.

2.3.2

The annotation of pronunciation

Compared with written corpora, spoken learner corpora are still in their infancy. One reason for this is that in learner speech, the target, i.e. the expected form, is easier to identify than the actual learner output. The distinction between target and actual realisations is not always proposed in spoken corpora. It is implemented, for example, in the software Phon (Rose et al. 2006; see Section 2.4), but the distinction between the learner output and the expected form lies at the heart of the matter for any phonetic analysis. Phoneticians have worked with a similar distinction, i.e. phoneme vs (allo)phone. Two levels of granularity typically exist in phonetics and phonology: phonemic transcription (using the inventory of sounds acknowledged for the linguistic system of the target language) and phonetic transcription (attempts at reproducing the learners’ phones). A reinterpretation of the phonetics/phonology divide is summed up in Table 6.4. There is a complex interaction between the units of the system and their different realisations. Phonology deals with abstract units (phonemes, in the left column). Phonetics is aimed at describing the details of phones (i.e. the realisations of phonemes). These phones also include allophones, i.e. rule-governed realisations of the same abstract phoneme, like clear and dark (velarised) realisations of /l/ (see the right column). There are potentially several degrees of granularity in the phonetic transcriptions (detailed notation of realisations, capturing articulatory details of the sound production, position of the tongue, lip-rounding, role of articulators). It is difﬁcult for phoneticians to give unquestionable detailed phonetic transcriptions (see Delais-Roussarie and Post 2014 for a thorough presentation of how the Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

118

BALLIER AND MARTIN

Table 6.4. The phonetics/phonology divide in transcriptions of learner speech at the segmental level Phonemic transcription PHONOLOGY

Phonetic transcription PHONETICS

Types (phonemes from the reference variety of the target language): phonemes Usually represented between slanted brackets /l/

Expert notation of the tokens, specifying the articulatory roles of the organs of speech: phones (allophones) Typically represented between square brackets, using diacritics to indicate phonetic details such as velarisation [Ǳ] or devoicing [l ʍ] Realisations (actual rendition by learners of the target language)

Expected targets

International Phonetic Alphabet (IPA) can be used to convey details of sounds). Variation at the phonetic level is almost inﬁnite, which is why it cannot be automated, at least if the aim is a precise transcription of the learner’s actual output. As a result, learner corpora typically include the transcription (usually orthographic) of the target and at best indicate the standard phone substitution that can be encountered. This is the case, for example, in the English Speech Corpus of Chinese Learners (ESCCL; Chen et al. 2008), where a devoiced realisation of the ﬁnal consonant in was is coded as wǘz in the Standard tier, with ﬁrst the target and then the learner realisation within the tag . What Table 6.4 suggests is that we should aim for a double layer of annotation for segments, one representing the intended target and the other reproducing (or interpreting) the learner realisation. This is a very complex issue (see also Chapter 7, this volume). A learner pronouncing the word boat with a very open vowel that sounds like the word bat would be transcribed as ‘boat’ in an orthographically transcribed corpus like LINDSEI, but several choices could be made for a phonetic transcription ([bat], [bǗt] or [bæt]), each one making assumptions about the kind of vowel a learner may have in his/her system. Should [œ] in [di ﬁ kœlt] be regarded as a misrealisation of /Ɵ/ or /Ȓ/, or as a transfer of the French front open round vowel /œ/ onto /Ȓ/? Should [di ﬁ kœlt] be transcribed [ɇdi ɇﬁ ɇkœlt] or [di ﬁ kœlt] for a syllable-timed pronunciation that does not distinguish between stressed and unstressed syllables? Kenworthy (2000) offers a detailed phonetic transcription of three actual recordings of the same fairy tale as pronounced by Spanish, Italian and Japanese learners, but this degree of detail is the exception rather than the rule. Ideally, the risk of phonetic transcription should be taken, however questionable, for assumed misrealisations/‘errors’. For the time being, when a ‘phonetic’ tier exists in learner corpora, it actually represents a ‘phonemic’ (quasi-phonological) transcription of the learner speech. In some cases (AixOx), it is automatically derived from a phonetic

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

119

dictionary, which somehow assumes a given phonological system for the learner (General American transcription in the case of AixOx). There have been few investigations into learner modiﬁcations of articulatory settings. Mortreux (2008) has shown that very advanced non-native speakers (lecturers in English) have modiﬁed their articulatory habits so as to reach similar palatographic prints (traces of the tongue tip on the alveolar ridge) to natives when realising [t] and [d] for English (phonemes for which the place of articulation is dental in French but alveolar in English). Voice onset time (VOT) measures conﬁrmed a correlation between articulatory habits and proﬁciency levels of learners: very advanced learners pronounce these sounds with a delay in the burst of the consonant similar to English natives. One way of getting around the difﬁculty of detailed phonetic transcription involves working on the acoustics of the speech signal. For consonants, plosives are analysed by investigating the duration of the interval between the burst of the consonant and the vibration of the vocal folds for the following vowel (VOT) (see Section 4). Vowels are analysed by comparing learner formants against native reference values (we present the principle in Section 3.1 for Méli 2013). For segmental features, automated formant interpretations are complex to achieve, as variability among natives is an established fact and speech perception by natives is the crucial element. Morrison (2005) has suggested 3D plotting of native responses to learner realisations, which means that the investigation of learner segmental features needs to be supplemented by perception studies. For instance, above a certain multidimensional threshold of F1 values and duration (see Section 2.3.1), L1-Spanish vowels were interpreted as /iɏ/ or /ǰ/, varying either in duration or formants. In the long run, a systematic investigation of the vowel distinctions tested by natives could give way to expert systems categorising learner productions, emulating a native ear.

2.3.3

The annotation of intonation

In this chapter, we assume that intonation refers to speciﬁc syllables in discourse. Even the simple representation of a fundamental frequency graph in frequency/time coordinates presupposes the use of speciﬁc speech analysis algorithms, which can change, sometimes to a large extent, the resulting melodic curves. The analysis of these melodic curves is not that simple. Small corpora of spoken learner production started to be collected in the 1980s, see James (1982), Lepetit (1992) and Grippon (2009) for French, among others. These ﬁrst attempts illustrate the choices to be made with respect to transcription principles. The annotation of intonation pertains to the use of a particular transcription system to assess learner production, for example on speech segments assumed to be phonologically and/or perceptually important. Obviously, the process presupposes a sound knowledge of the intonation events, their importance and their

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

120

BALLIER AND MARTIN

distribution, according to the theory selected by the annotator. The annotator may consider that there is no satisfactory intonation model for the language under study (e.g. James 1982) or, alternatively, may adopt a speciﬁc model to transcribe the learner’s performance (which allows for comparison with native-speaker models). In the ﬁrst case, mathematical tools may be adopted to simplify the representation of the data, namely the fundamental frequency and intensity curves. Such tools will then generate a score supposed to reﬂect learner performance. In the second case, the strategy consists in embracing (and trusting) a theoretical model to adequately represent the most relevant aspects of the prosodic realisations. The ﬁrst approach can be called purely empirical, as no linguistic explanatory principle is implied in the transcription or in the assessment of the learner, while the second is theory-based and will of course depend on the annotator’s conﬁdence in the selected model. This does not mean that an empirical annotation system is free from theoretical presuppositions, even if it is automated. The INTSINT system (Hirst 2007), for example, replaces complex fundamental frequency curves (with macroand micro-prosodic variations) with recalculated (‘smoothed’) curves, assuming that both are perceptually equivalent. One widely used annotation system is ToBI (Tones and Break Indices, ToBI 2014),6 which relies on the autosegmental-metrical framework (Goldsmith 1990). ToBI assumes rules of well-formedness: not only do stressed syllables need to be stressed, but speciﬁc tonal targets have to be attained for speciﬁc contours. Studies based on French and Romance prosodic corpora suggest that the system does not account for the range of targets observed (Martin 2009). Other approaches prefer a more intuitive representation, for example describing pitch movement, rising, ﬂat, falling, etc., after the British tradition. This is the case, for example, for the analysis of questions in Herment, Ballier et al. (2014). Taking into account learner needs and comparisons between native models and learner realisations, we will make some possible suggestions in Section 4. With (Mennen 2007; Gut 2009) or without (Tortel 2009; Herment, Tortel et al. 2014) speciﬁc references to the autosegmental-metrical model, learner intonation has been analysed in terms of prosodic transfers, investigating areas such as phrasing, pitch curves and rhythm. An aspect of speech that plays a very important role in intonation is phrasing, i.e. the segmentation employed by speakers in the production of sentences. It is expected that phrasing should correspond to important syntactic boundaries, irrespective of the type of acoustic markers selected (pause, interruption, boundary tone, etc.) It is relatively easy to evaluate the correctness of a learner’s phrasing relative to a native-speaker model and to characterise it statistically. Herment, Ballier et al. (2014) have shown that less-advanced learners of English tend to chunk their 6

www.ling.ohio-state.edu/~tobi/ame_tobi/annotation_conventions.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

121

questions into three or four intonation phrases, reproducing the French prosodic structure, whereas more advanced learners adopt a phrasing with fewer intonation phrases, thus coming closer to native speakers of English. Cauvin (2013) suggests that phrasing could be used as a criterial feature for identifying different learner proﬁciency levels. Other aspects of intonation have also been examined, such as the well-formedness of contours, i.e. the realisation of the pitch curve for stressed syllables. Gut (2009), for example, has investigated the proportion of rising, ﬂat and falling boundary tones, as well as the percentage of nuclear tones (this study will be described in more detail in Section 3.2). Tortel and Hirst (2010) and Tortel (2013) have used the ANGLISH corpus to study differences in rhythm between French advanced learners of English (postgraduate students in English), intermediate learners (A-level students) and native speakers. The three types of speakers were recorded in groups of twenty speakers. The duration of intervocalic and interconsonantal intervals was measured and computed using different kinds of metrics (percentage of vocalic duration, standard deviation, measures of successive interval). In spite of a high degree of complexity in the statistical analysis, the experiment failed to identify properly the different groups with more than 69 per cent accuracy on the basis of these metrics alone. The hypothesis that languages like French and English have strongly opposed rhythmical systems may still hold, but the methodology fails to capture it. Admittedly, this may be due to the fact that the annotation of these intervals was ﬁrst introduced for L1 and the underlying phonological model was developed for infants.

2.3.4

The tokenisation of learner speech

Now that we have outlined the two main domains of phonetics for learner speech, we can raise another central issue. Phonetic units can be analysed in the signal but, so far, the technology cannot go beyond the word level for automated tasks. The recorded speech can be processed by ‘forced alignment’ (see SPPAS in Section 4), but the automation of other linguistic units is more problematic, especially at the phone level. Tokenisation cannot therefore be automated for the phonetic level (phone realisation, i.e. categorisation of the learner sound production). To grasp the distinction between learner output and expected realisations, a multi-layered annotation of the corpus distinguishing targets and actual realisation represents a good strategy for the segmental level. A detailed automatic phonetic annotation of actual realisations is error-prone (and open to debate, with many possible diacritics to code articulatory details of the pronunciation and decisions as to the acceptability of the realisations to be made in the very act of transcription). The target layer can be phonologically deduced from the lemmatisation of the corpus by grapheme-to-phoneme conversion. Targets should be clearly stated and can be transcribed using

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

122

BALLIER AND MARTIN

automatic phonetisers that assume a pronunciation model (for the time being, mostly a form of American English, as transcribed in the free online Carnegie Mellon University dictionary).7 For this endeavour, ﬁner-grained and variety-tailored electronic resources are required (we need to accommodate the electronic resources to the underlying pronunciation model, whether General American English or Standard British English, for example). As for the suprasegmental level, this double target/actual labelling is daunting since it presupposes that a single target might be conceivable for intonation, that is, that each utterance corresponds to a single intonation pattern. However, just like a phoneme may encompass a set of acceptable phones, we believe that an utterance may encompass a set of acceptable pragmatic and personal decisions marked by various tonic and prosodic choices (tone, tonicity, tonality). A phonological annotation of the target requires a prosodic model that would at best indicate a canonical rendition of the utterance (see Section 4 for new directions in automatic prosodic annotation). Spoken learner corpora work at this very fruitful interface of phonetics and phonology, re-enacting the type vs token opposition in the guise of the target vs actual realisation, forcing interlanguage phonologists to be explicit in their model of ‘acceptable’ prosody and pronunciation, which at the very least requires explicit reference pronunciation models and prosodic modelling of the range of possible intonations of an utterance. At the segmental level, variation between the target and the actual realisation is mostly paradigmatic (phone substitution, insertion or deletion such as [h] dropping or initial [h] insertion), with any syntagmatic variation at phonemic level being taken care of by (re)syllabiﬁcation (epenthesis, syllable afﬁliation). At the suprasegmental level, variation is much more complex and both paradigmatic (stressed or unstressed syllable, high or low contour) and syntagmatic (phrasing or tonality). Taking these syntagmatic and paradigmatic dimensions of the speech chain into account is complex and speech software has adopted different interfaces to cater for the needs of researchers.

2.4

Annotation software

In this section, we present some of the tools that can be used for the transcription and annotation of spoken learner corpora. Almost all of the programs work with a sound ﬁle (usually a WAV ﬁle) and a text ﬁle (called a ‘TextGrid’ in Praat), used to store the annotation. Most of these programs allow for a multi-layered presentation, which is quite typical of spoken corpora but still relatively rare for written corpora. As

7

www.speech.cs.cmu.edu/cgi-bin/cmudict (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

123

Table 6.5. Some speech software features and their strengths for the analysis of spoken learner data

Software

Research focus

CLAN

Dysﬂuencies, phrasing

Phon

Praat

ELAN

EXMARaLDA

WinPitch

1.

Main assets for learner corpus researchers (and some examples of possible queries)

Complex queries, POS tagging, built-in metrics for syntactic complexity (e.g. what is the mean length of the utterances?) Phone substitution, Exports to Praat in the latest version. Automatic templatic narrow transcription of the target, query transfers engine with built-in phonological categories (e.g. what are the plosives that are most often dropped by learners in coda position, whether word-ﬁnally or word-initially?) Acoustic analysis, Strong research community, numerous prosody, formant scripts, plugins and tutorials. Perception analysis, prosodic tests can be run with Praat. Occasional morphing ﬁne-tuning required for pitch tracking. The search function is less powerful than that of a concordancer, but Praat-based concordancers such as Dolmen1 have been developed. Not suitable for multimodal analysis at this stage. Scripts can be written to ask queries like: what are the mean F1 and F2 for this vowel? What is the rhythm of this speaker? Multimodal analysis Multiple tiers, customisable categories for annotation of gestures (e.g. what are the gestures that co-occur with this word?) Multimodal analysis No straight access to the speech signal, excellent import/export facilities to other programs, XML format (e.g. what are the gestures that co-occur with this word?) Multimodal analysis, Useful ‘on-the-ﬂy’ alignment, imports of many acoustic analysis, formats (XML), imports and exports to Praat, syllable duration, UNICODE, multi-method pitch tracking and prosodic built-in concordancer with Excel output morphing (e.g. what are the prosodic realisations of this word in this context?). Similar features to CLAN, ELAN (multimedia) and Praat combined

www.julieneychenne.info/dolmen/ (last accessed on 13 April 2015).

evidenced in Figure 6.1 (Section 2.3.1), which shows a sample of LeaP in Praat, several layers (or ‘tiers’) of annotation allow several levels of analysis. In Table 6.5 and the description that follows, we have – somewhat subjectively – ordered the programs according to their degree of sophistication. Tools for multimodal analyses offer the widest range of functionalities and are therefore ranked among the most sophisticated software.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

124

BALLIER AND MARTIN

CLAN8 displays data vertically, so that the speech appears on the screen very much like a text. In that sense, it is one of the most intuitive programs. The other programs display the transcripts horizontally, in line with a timeline of the speech signal. CLAN has been used to investigate ﬂuency (Osborne 2007; Hilton 2009). The software allows the transcription, coding, analysis and sharing of transcripts of conversations linked to either audio or video media. It uses the CHILDES conventions for notations and interfaces with ELAN, EXMARaLDA, Praat, Phon and SALT. It was intensively used in the FLLOC and SPLLOC projects. Phon9 is a piece of software that allows the investigation of phone substitution, phonetic transfers, phoneme acquisition and resyllabiﬁcations according to the syllabiﬁcation of the L1 (‘templatic transfers’). So far, it has mostly been used for ﬁrst language acquisition, but the functionalities have also been applied to second language acquisition (Rose et al. 2007; Kutasi 2013). A series of dictionaries (and various algorithms for syllabiﬁcation) are used to display phonemic transcription and allow for the automatic annotation of the target level. The researcher has to type in the words uttered in the corpus (orthographic layer) and transcribe phonetically what is pronounced by the speaker. This is an important distinction: the actual realisation has to be interpreted by the researcher, but it can be pre-formatted with the existing dictionaries. The target is automated but can be customised. All tiers can now be exported to Praat. Praat10 (the Dutch word for ‘talk’) is a computer software package for the phonetic analysis of speech. It was designed, and continues to be developed, by Paul Boersma and David Weenink at the University of Amsterdam. It enables multi-layered transcription of speech ﬁles and most commands can be scripted to perform various analysis tasks. As illustrated in Figure 6.1, the upper part represents the waveform, the part below represents a wide-band spectrogram displaying F0 and formants. Scripts can be written to automate tasks, such as the calculation of pauses (Osborne 2007) or the extraction of formant values (Méli 2013). ELAN11 (EUDICO Linguistic Annotator, Wittenburg et al. 2006) was created by the Max Planck Institute and displays data on a horizontally layered set of tiers. The transcription resembles a musical score with the different layers aligned with time and videos or sounds. ELAN has been used for discourse annotation (see, for instance, Querol-Julián 2010). Osborne (2007) comments on the possibilities of automating the detection of pauses with ELAN. EXMARaLDA12 is more complex because it can deal with different ﬁle formats and different types of data. It can communicate with other tools 8

http://childes.psy.cmu.edu/clan/ (last accessed on 13 April 2015).

9

www.phon.ca/phontrac/wiki/Downloads (last accessed on 13 April 2015).

10

www.praat.org (last accessed on 13 April 2015).

11

Max Planck Institute for Psycholinguistics. The Language Archive, Nijmegen, the Netherlands. http://tla.mpi.nl/tools/

12

www.exmaralda.org/ (last accessed on 13 April 2015).

tla-tools/elan/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

125

and has a very rich set of options. It has been used for written learner corpora with several layers of annotation, such as Falko (see Chapter 7, this volume) and ALeSKo (Zinsmeister and Breckle 2010). It incorporates an interface for annotation (PARTITUR), a corpus manager for handling metadata, and the EXMARaLDA Analysis and Concordance Tool (EXAKT), a KWIC (keyword-in-context) built-in concordancer. Sarré (2010) has investigated French learners’ English productions in a hybrid multimodal environment, comparing their interactions in videoconferences, chats and social networks. The EXMARaLDA software (Schmidt 2009) was used to compute and compare interactions in the different modes, paving the way for an innovative study of online learning communities. WinPitch13 (Martin 2014) is a more general-purpose software program for speech transcription and acoustical analysis, with emphasis on functions linked to prosodic analysis. It allows multi-layer annotation and a unique on-the-ﬂy alignment process. Speech segments can be played and slowed down (up to seven times) to facilitate their transcription through insertions of boundaries to allow transcripts to be aligned with the signal (time-aligned data). It can import and export other formats like Praat TextGrid ﬁles, and perform automatic IPA transcription, as well as syllabic and intersyllabic segmentation. The program is also multimodal and can read and display video ﬁles in many formats, and in this respect is similar to ELAN and CLAN.14

3

Representative studies

This section presents two studies that are representative of the possible uses of phonetic corpora. The ﬁrst (Méli 2013) illustrates research on pronunciation (segmental analysis) with Praat, while the second (Gut 2009) deals with suprasegmental issues with Praat. 3.1 Méli, A. 2013. ‘Phonological acquisition in the French–English interlanguage: Rising above the phoneme’, in Díaz-Negrillo, A., Ballier, N. and Thompson, P. (eds.), Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: Benjamins, pp. 207–26. Méli (2013) analyses segmental realisations of French learners of English in a cross-sectional study of three groups of students. The analysis targets phonemic distinctions and raises the theoretical issue of whether ‘perceived dissimilarity’ is a hindrance or an advantage for the acquisition of the sounds of the L2 that do not exist in the learner L1. The realisations of the interdental fricative , as well as some of the

13

www.winpitch.com/ (last accessed on 13 April 2015).

14

The IRCOM website provides a more exhaustive comparison (in French) of tools for spoken and multimodal data: http://ircom.huma-num.fr/site/p.php?p=ressourceslogiciels (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

126

BALLIER AND MARTIN

0

1

2

3

4

5

6

7

8

9 15

13

15

F3-F1 (Bark)

14

17

12 16

13

/i:/ + 14

12 13 1417 + / i/ 18

: /Ι / : /i:/

12 11 10

16

9

15

8 F3-F2 (Bark)

Figure 6.2 Learner distinctions of short /ǰ/ and long /iɏ/ (after Méli 2013)

phonemic asymmetries for vowels, {/i/ in French, /ǰ/-/iɏ/ in English} and {/u/ in French, /Ȑ/-/uɏ/ in English}, were analysed with Praat. Eighteen second- and fourth-year university students were recorded for this study, partly in the context of the Longitudinal Database of Learner English (LONGDALE) project (see Chapter 17, this volume). After studying the acoustic characteristics of certain aspects of native speech (the ﬁrst three formants of vowels – see Section 2.3.1 for the analysis of vowels), it is possible to plot learner realisations of vowels. ‘Vowel plots’, as illustrated in Figure 6.2, are graphs displaying the vowels on two axes (corresponding to the subtracted values of the formants, i.e. F3-F1 and F3-F2), which represent native (crosses) and learner (squares) realisations of vowels. The two crosses in Figure 6.2 correspond to the ideal target; they represent average values calculated for the natives from corpus-based studies published in the Journal of the International Phonetic Association. The dividing line represents the phonemic distinction supposedly made by natives. This graph illustrates the difﬁculty for learners of reproducing, for example, the distinction between short /ǰ/ and long /iɏ/. The standard methodology is to use formant values to compute the phonetic distance between realisations of vowels (the study used the Bark Difference Metric method). The paper does not shy away from the complex issues of normalisation procedures for the formants of the vowels (see Thomas (2011) for an introduction) and compares native and non-native vowels using Lobanov-normalised formant values. Although the paper uses a recent set of reference values (Ferragne and Pellegrino 2010) as a benchmark for the assessment of the quality of the vowels, the context of occurrence of the vowels is different. Ferragne and Pellegrino (2010) measured only vowels in /h_d/ sequences to calculate mean formant values for reference purposes, and a more massive corpus-based study may lead to a more nuanced view of these results. By advocating an analysis ‘above the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

127

phoneme’ (i.e. in its wider context), the study stresses the importance of coarticulation errors at syllable level, but does not compute such effects for vowels (see below). The importance of distributions is shown only for the realisation of the fricative: interdental in bath, but alveolar (i.e. [z]) in with. The analysis of the interdental reveals the importance of phonotactics (the syntax of phonemes) and possibly the importance of lexical frequency. For instance, among the 879 tokens of the interdental fricative that were analysed, the 113 occurrences of think exhibited a frequency effect (in this case, increased accuracy). The hypothesis of such frequency effects ought to be tested on a larger scale (function words like determiners were discarded in the study); for instance, it is possible that some words or ‘islands of reliability’ such as I think might work like lexical ‘magnets’, structuring phonetic realisations. The productions of learners and their distance from native reference values suggest that the /Ȑ/~/uɏ/ distinction is fuzzier (‘undercategorised’) than /ǰ/~/iɏ/ if we accept the principle that production is an acceptable image of categorisation. This phonetic analysis of formants seems to indicate different learning patterns for these sets of phonemes (quite possibly, /Ȑ/~/uɏ/ is acquired later than /ǰ/~/iɏ/). It still remains to be seen whether the task (monologues and interviews) has a signiﬁcant effect on realisations, since it was not taken into account as a variable in this study. The other variable to be considered is syllable structure for vowels. French /i/ can occur in open or closed syllables, so that the mapping of the two phonemes (/ǰ/~/iɏ/ in English) into one (/i/ in French) is more complex if we take into account the phonological distribution of vowels. Analysing the perception and the categorisation of the realisations by learners themselves may lead us to qualify Méli’s results. When subjects are faced with variation in duration as well as in formant values in perception tests, the categorisation of the realisations might be more complex to formulate. The question of categorisation would be the ultimate goal of this kind of investigation. Is the phoneme acquired, i.e. are its variable realisations sufﬁciently stable to be unambiguously perceived as the target phoneme? This clearly requires perception tests to investigate categorisation. 3.2 Gut, U. 2009. Non-native Speech: A Corpus-based Analysis of Phonological and Phonetic Properties of L2 English and German. Frankfurt: Peter Lang. The LeaP corpus (Gut 2009) contains 176 recordings of different learner groups representative in terms of age, gender, native languages, level of competence, exposure to the target language, etc. The corpus is Praat-based but also includes XML ﬁles. The annotation is illustrated in Figure 6.1. Learners were asked to read a fairy tale and to reformulate it afterwards, and the corpus includes standard tasks such as reading and interviews. English and German were the target languages, annotated manually with ESPS/waves+ and Praat. The transcription tiers involved were orthographic

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

128

BALLIER AND MARTIN

(phrase and word levels), syllabic, segmental (using SAMPA15), tonal and pitch (using a modiﬁed ToBI notation for both pitch accents and boundary tones). On the pitch tier, four categories of pitch were used: ﬁrst peak, ﬁnal low, intervening peak and intervening valley. Inter-annotator coherence was tested statistically. Extensive statistical results are given. For example, the percentage of resyllabiﬁcation of intervocalic consonants realised by non-native and native speakers of English is 64.3% and 93.7%, respectively, in all contexts (VCV, VCCV and VCCCV), which shows the importance of syllables in the analysis. One of the main features of Gut’s (2009) study is that it focuses on intonation phrasing produced by learners, i.e. the chunking of the utterance into intonation groups, the aim being to observe the correspondence or mismatch between syntactic units and intonation groups. As for the analysis of intonation, results were, as expected, heavily dependent on the notation system used. Comparisons were made between native and non-native speakers, but only in terms of global information, such as highest and lowest fundamental frequency values. Although an inventory of English and German contours has been established on native-speaker data using the ToBI notation system, complete data sets of learner realisations were not made available. Despite the relative lack of a coherent theoretical intonational model for the languages recorded – a situation that may weaken the statistical signiﬁcance of the conclusions drawn from these observations – this publication does provide a great deal of experimental data pertaining to most aspects of learner prosodic realisations. Gut (2009: 233) lists the percentages for each different tone produced by non-natives, showing that rises are more frequent in non-native German (27.14% of 2,185 tokens) than in non-native English (14.68% of 2,723 tokens) for read speech. Irrespective of detailed semantic intentions, the overall percentage of falls and rises has been computed for learners and native speakers: the observations yield an interesting characterisation of speaking styles and possibly learner tasks, as fall-rise is less frequent in free speech than in retellings or read speech. However, a more detailed investigation of the semantic implications and expectations of a prosodic pattern might also be desirable.

4

Critical assessment and future directions

This section discusses some of the challenges of the ﬁeld. We ﬁrst insist on the need for larger and more richly annotated corpora of learner speech, and more varied types of corpora, in particular multimodal ones that include gestures. We then highlight the need to design more tools

15

www.phon.ucl.ac.uk/home/sampa/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

129

for the automation of annotation tasks. Finally, we describe some pedagogical applications that can beneﬁt from learner corpus research. Firstly, we should aim for larger and more richly annotated databases of learner speech. By current standards (Ballier and Martin 2013), multi-layered annotated phonetic corpora are considered large when they comprise over three hours of recordings. The AixOx corpus (Herment, Tortel et al. 2014) has thirty hours of recorded data, but some ﬁles are not fully annotated. In addition, learner corpora should preferably contain more natural speech as opposed to read speech. It would also be useful to collect longitudinal corpus data to be able to study the development of learners’ speech productions across time. The LONGDALE initiative (see Chapter 17, this volume) is a ﬁrst step in this direction. A recent study by Myssyk (2011), conducted at Paris Diderot on a subset of the LONGDALE Diderot corpus (Goutéraux 2013), shows the potential of the corpus to identify the criterial features distinguishing different proﬁciency levels of L2 speakers. In this study, Myssyk analysed the production of twenty-two French learners of English reading a two-line excerpt from The Selﬁsh Giant by Oscar Wilde. The annotation of errors led to the identiﬁcation of sixteen criterial features whose importance was then correlated to normalised ratings given by teachers and experts. The ratings were consistent and showed that some phonological characteristics were good candidates for criterial features. The ranking of features yielded a hierarchy of phonological criteria for this passage that revealed well-documented pronunciation issues for French native speakers (e.g. the sound /iɏ/ in sweet as opposed to the sound /ǰ/ in musicians or king). However, other criteria were deemed more relevant than suggested in the literature on contrastive error analysis, such as the absence of aspiration for /p/ (evidenced by measures of VOT). This segmental analysis of pronunciation showed the prominence of some features over others (avoidance of front vowels or lip-rounding for central vowel /Ȓ/ as in lovely, one and some was the third criterial feature in order of importance). Another important aspect that should be investigated and recorded in learner corpora is body language. Do learners use the same body language when speaking in L2 and in L1? How do gestures correlate with their repairs or hesitations? The research questions in multimodality are fascinating but have only reached a preliminary stage. Filmed interviews are scarce and usually entail more restrictive consent forms to respect the speakers’ rights to their own images. There are technical issues (number of cameras available for each session, annotation software) and theoretical choices to be made. The OTIM project (Blache et al. 2007) tries to propose standard annotation guidelines for multimodality. Knight et al. (2008) discuss the hand tracker used for the Nottingham Multi-Modal Corpus. The larger the corpora, the more acute the need to automate the analysis. The second critical desideratum is the design of more efﬁcient and reliable automatic annotation techniques. Current techniques to

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

130

BALLIER AND MARTIN

align the orthographic transcription and annotate the different layers of annotation are extremely time consuming. Researchers need solutions to automate segmental and suprasegmental annotation. One promising avenue is the design of computer-aided decision-making techniques based on routines. For example, SPPAS 1.516 (Bigi and Hirst 2012) can be used to annotate spoken learner corpora phonemically with a database of grapheme-to-phoneme conversions for American English, French, Chinese and Italian. To help with the phonetic transcription of learner speech, the orthographic transcription can be customised to the needs of learner realisations. For instance, to account for the actual pronunciation of Greenstead by learners, SPPAS will generate /i/ in the transcription if prompted in the orthographic transcription, for an approximate rendering of the pronunciation (see Bigi et al. 2012 for examples of the alignment accuracy pending modiﬁcations of the text to be converted into phonemes). Other forced-alignment methods have been tested with learner data, and intensity appears to be the most robust cue for automatic alignment of learner data (Ferragne 2013). For computer-aided annotation of prosody, some programs have been developed to ‘translate’ the signal into stylised representations of prosody with algorithms.17 For example, MOMEL, which is included in SPPAS, is a coding program for melodic curves. It replaces complex fundamental frequency curves obtained through acoustical analysis with a much simpler representation (Hirst and Espesser 1993). The companion program INTSINT performs an automatic prosodic annotation from MOMEL curves using tone levels (High, Low, etc.) and tone changes (Up, Down, etc.). Prosogram18 is another automatic annotation tool detecting changes in fundamental frequency superior to a glissando threshold19 and replacing pitch movements below 0.32 semi-tones per second by a ﬂat segment (Mertens 2004). As can be seen in Figure 6.3 (a screen capture of an excerpt from the ANGLISH corpus), a rather good approximation of syllable nuclei is reached. The general feature of syllable-timing and deaccentuation of this learner speech is evidenced by the series of ﬂat sequences, with the stress placed on the second syllable of passengers, for instance. Even though the system was originally designed for French, these Praat scripts can operate on any language, but the highlighting of contours will not necessarily match the syllable divisions. This is illustrated in Figure 6.3, where the French speaker has quite a syllable-timed (more speciﬁcally, CV-patterned) pronunciation and this is reﬂected in the Prosogram representation. This example is telling in that the syllable division is only partially reﬂected: the four syllables of this particular

16

http://sldr.org/SLDR_data/Disk0/preview/000800/?lang=en (last accessed on 13 April 2015).

17

See also Bartkova et al. (2012) for PROSOTRAN.

18

http://bach.arts.kuleuven.be/pmertens/prosogram/download.html (last accessed on 22 August 2014).

19

The glissando threshold (Rossi 1971) corresponds to the level of pitch variation which can be perceived.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

131

Figure 6.3 A sample of ANGLISH using Prosogram as a device for computer-aided decision

pronunciation of the word comfortable are outlined in the four sections of the word, but the division of never is lost in the automatic analysis. The ﬁnal issue we wish to address is that of computer-assisted pedagogical applications (see also Chapters 20 and 22, this volume). There has been very active research on speech technology within computer-assisted language learning (CALL) (for a survey, see Hardison 2009). A range of CALL programs have been designed with a view to helping learners improve their pronunciation. As noted by Tsurutani et al. (2006), the software usually takes one of the following two forms: ‘playback and comparison’ or ‘visible speech aids’. For comparison with expected realisation, these authors have actually used learner data to improve their software. They have reﬁned the accuracy of their automatic speech recognition system for Japanese tokens produced by Australian learners by incorporating a list of common learner errors. Systems for computer-assisted pronunciation teaching (CAPT) try to be less L1-dependent, but current models based on L1/L2 rules perform better in terms of accuracy. Some complex metrics have been developed to improve automatic scoring (see also Chapter 26, this volume). Witt (2012) gives an overview of the phonetic features used for pronunciation scoring and assesses current challenges for CAPT, such as the need for robust detection of errors and corrective feedback. For articulatory feedback, experiments are conducted with visual representations of the tongue movements to help learners improve their pronunciation (Engwall 2012). Despite some commercial claims (e.g. Talk to Me),20 learner prosody is not really assessed as such. In some cases, commercial software grades learners, but without any speciﬁc feedback on prosodic errors. The grading algorithms are usually kept conﬁdential but so far have not provided explicit guidelines for the realisation of prosodic contours.

20

http://auralog.software.informer.com/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

132

BALLIER AND MARTIN

For teaching purposes, one of the crucial points is the comparability of learner and native realisations, as well as the accessibility and user-friendliness of the software for displaying these. Empirical comparison was used in early analysis of learner prosodic performance. James (1982), for example, compared the model and imitation melodic curves using a template cut out of cardboard. Later, various mathematical functions were proposed to evaluate the overall differences between the two fundamental frequency curves, instructor’s and student’s. One of the persistent problems with this approach pertains to the expected difference in rhythm between both utterances. A conventional way to address this difﬁculty consists in implementing a dynamic warping algorithm, automatically aligning important sections of the sentences. As these algorithms are generally based on spectral analysis, differences in acoustic realisations of syllables, vowels and consonants can cause the algorithm to fail. Empirical comparison can be very sophisticated and uses a large set of parameters derived from acoustic data. Herment, Ballier et al. (2014), for example, used the ProsodyPro analyser (Xu 2013), which can deliver more than ten different parameters, in the hope that some will be linguistically pertinent, but the limitations of this strategy remain the same: the comparison is executed on acoustic parameters that are often difﬁcult to interpret by learners, resulting in a ‘parrot’ requirement from the system, with no speciﬁc phonological insight. Indeed, the learners are assessed on the similarities of their realisations with the model. It would be important in this case to evaluate the signiﬁcance of each of these parameters phonologically, or at least perceptually. A better theoretical intonation framework that could be easily understood by language learners should be adopted or designed. Differences between model and learner production could then be meaningful to both researchers and users, as they help to focus on what is important. Popular intonation notation systems such as ToBI are quite difﬁcult for learners to understand, whatever their connection to meaningful theoretical frameworks. It might be advisable to avoid theoretically laden annotation schemes such as ToBI (see the appendix in Wells 2006 for a balanced translation of ToBI into British notational systems) or offer translation into other systems (Chen et al. 2008). An intonation notation system closer to phonetic data (i.e. fundamental frequency and intensity curves, segment duration) might be easier for learners to understand and use. It may be beneﬁcial to use a common system based on articulatory realisations by learners, directly linked to their production and perception. The Prosogram software may be a step in the right direction in this respect, as it can easily be automated for high-quality pitch curves and to detect the perceptual correlate of intonation, the glissando threshold. The choice of annotation has consequences not only on the learner’s prosodic assessment but also on the pedagogical approach presented to learners. In the case of automatic annotation, the instructions given to learners may simply consist in imitating as closely as possible some Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

Speech annotation of learner corpora

133

model sentences. In a theory-based approach, some instructions (such as realisation of a melodic rise on speciﬁc syllables) can be given to learners to better reproduce what is considered essential rather than the whole set of parameters in all their detailed variations. Other annotation strategies and learner tools are possible, especially if, as is increasingly the case (see ANGLISH, AixOx), similar data is recorded by natives and non-natives. WinPitch has a language teaching and learning version (Martin 2005), which allows the display of parallel utterances (model/learner). More speciﬁcally, speech synthesis allows learners to modify their own intonation according to a model. A method of French as a Foreign Language (Léon et al. 2013) has been implemented with this software. This emphasis on the modiﬁcation of melody by synthesis (prosodic morphing) has been under-researched but is very promising to raise learners’ awareness. In other words, more perceptual studies are needed to circumscribe the boundaries between the variable learner realisations and the range of acceptable realisations within a given reference model.

Key readings Arche, M. J. 2008. SPLLOC Transcription Conventions. A must-have for any corpus annotator, especially for spoken corpora, these annotation guidelines for the Spanish Learner Language Oral Corpora ((SPLLOC SPLLOC) explain the methodology followed when annotating the corpus with CLAN. The guidelines are crucial for any spoken corpus and their accuracy dramatically improves inter-rater agreement; in other words, the reliability of the annotation. These transcription conventions are detailed and try to answer any questions that annotators may have. Gut, U. 2010. The LeaP Corpus. A Phonetically Annotated Corpus of Non-native Speech Speech.. This is the most comprehensive description of a spoken learner corpus to date. Based on German learners of English (Gut 2009), the protocol of the LeaP corpus mixes interviews, read speech and retelling of a story. This research project was designed to investigate stress assignment, sentence intonation and rhythm. Speakers’ pitch range can be calculated using this corpus. Chen, H., Wen, Q. and Li, A. 2008. ‘A learner corpus – ESCCL’, in Barbosa, P. A, Madureira, S. and Reis, C. (eds.), Proceedings of Conference on Speech Prosody 2008 2008.. Campinas, Brazil. The English Speech Corpus of Chinese Learners ((ESCCL ESCCL) contains recordings of various learners of English with four different educational backgrounds. Subjects originated from different parts of China and

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

134

BALLIER AND MARTIN

dialectal areas and either read a prepared text or interacted in a topic-based spontaneous dialogue. The annotation system employed in the corpus combines the British (Prehead – Head Nucleus – Tail) and American systems (ToBI) and also uses the Praatt multi-tiered system. Pronunciation errors can be investigated, as the corpus includes a tier where learner productions and expected targets are annotated. Neri, A., Cucchiarini, C. and Strik, H. 2006. ‘Selecting segmental errors in non-native Dutch for optimal pronunciation training’, International Review of Applied Linguistics in Language Teachingg 44: 354–404. Neri et al. (2006) present their methodology for segmental error annotation and inter-annotator agreement of segmental labelling. The reference pronunciation/gold standards were deﬁned from a previous phonetic study on Dutch by Gussenhoven (1999). Consonants appear to be more problematic for non-native speakers of Dutch, even though differences can be observed based on learners’ L1. A total of 810 fragments (15 sentences × 54 speakers) of learners from eleven different L1s (and three levels of proﬁciency) were tested. Gilquin, G. and De Cock, S. 2013. ‘Errors and disﬂuencies in spoken corpora: Setting the scene’, in Gilquin, G. and De Cock, S. (eds.), Errors and Disﬂuencies in Spoken Corpora. Amsterdam: Benjamins, pp. 1–32. In the introduction to a collected volume, Gilquin and De Cock (2013) challenge the relevance of the traditional distinction between errors and dysﬂuencies in both native and learner language. They note that (i) the boundary between errors and dysﬂuencies is hard to ﬁnd (although errors may be linked to competence and dysﬂuencies to performance) and (ii) as native speakers’ oral productions contain errors and dysﬂuencies as well, the difference with learners may pertain to the frequency of these occurrences in spontaneous speech rather than a difference in nature.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:48:24, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.006

7 Error annotation systems Anke Lüdeling and Hagen Hirschmann

1

Introduction

The categorisation and investigation of errors made by foreign or second language learners is an interesting and fruitful way of studying accuracy and other aspects of learner language (Corder 1967, 1981; Dagut and Laufer 1982; Ellis and Barkhuizen 2005, among many others). In addition to being an analytical tool for assessing the ‘quality’ of a text, error analysis, if done correctly, sheds light on the hypotheses a learner has about the language to be learned. Missing or incorrect articles, for instance, can point us to a better understanding of the learner’s ideas of deﬁniteness; certain lexical errors tell us that a learner might not be able to use the appropriate register, etc. Error analysis (henceforth EA) is a research method and, as for any other method, there are a number of issues to take into account when applying it. These issues include the categorisation and assignment of error types as well as the (linguistic and extra-linguistic) contextualisation of errors. It is, for example, often necessary to consider the larger context in order to decide whether a deﬁnite article is required. Knowing the ﬁrst language (L1) of a learner and the circumstances under which a text was produced can be crucial in understanding a register error. Since the early days of EA, some of these methodological issues have sometimes been neglected, and EA has often been criticised (for an overview of the criticism, see e.g. Dagneaux et al. 1998). It has also long been recognised that the study of language acquisition processes needs to be reproducible and testable. Complementing experimental data of various types, learner corpora can be a valuable source of data for reproducible studies of language acquisition – but only if they are well designed, well described and publicly available. Corpus data must be interpreted and categorised to be useful. In this chapter, we focus on one way of interpreting learner corpus data, namely error annotation as the explicit and transparent way of marking errors in a learner corpus. Error

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

136

LÜDELING AND HIRSCHMANN

annotation is one step in computer-aided error analysis (CEA), a term introduced by Dagneaux et al. (1998) to refer to error analysis conducted on the basis of learner corpora. We will describe how error annotation schemes are designed, how they can be queried, and what opportunities and problems error annotation brings.

2 2.1

Core issues Annotation

Unannotated corpus data can be used for many research questions. But whenever one wants to search for categories of something – all ﬁnite verbs, all sentences under ﬁve words, all orthography errors – and not strings, it is useful to assign these categories to the corpus data. The utterance in example (1), for instance, is annotated with part-of-speech categories,1 lemmas, and noun phrases. It is now possible to search for noun phrases that contain conjunctions or for noun phrases that use singular nouns without a preceding article. (1) part of speech lemma noun phrases

The AT0 the NP

learner NN1 learner

requires VVZ require

support NN1 support NP

and CJC and

guidance NN1 guidance

Error annotation works in the same way; segments from a learner corpus are annotated with an error category. Technically, annotation is the assignment of a category to a segment of the corpus (see also Chapter 5, this volume). It is done in the corpus and not somewhere else such as on a ﬁle card, in a spreadsheet, or in a statistical table.2 Often the category is taken from a ﬁnite tagset, as in the part-of-speech layer in example (1) or, as we will see, an error category from a predeﬁned error tagset. Sometimes this is not possible because the values that can be used are inﬁnite or unforeseeable, as in a lemma layer or in the target hypothesis layers that we will introduce below. Annotation is categorisation and thus involves a necessary loss of information. The same data can be categorised in different ways, even for the same type of information, depending on the criteria one wants to use. There are, for example, many part-of-speech tagsets (see Atwell 2008), some focusing on 1

The utterance is taken from the British National Corpus, lemmas and noun phrases are added by us. The part-of speech tags are from the CLAWS tagset (Garside and Smith 1997); AT0 stands for article, NN1 for singular noun, VVZ for finite verb, CJC for conjunction.

2

Linguistic data itself can be spoken or written. In this chapter we assume that the sound waves constituting spoken language data are represented by some kind of written representation, be it an IPA transcription or an orthographic transliteration, which then will be the base for further annotation as discussed here (see e.g. Lehmann 2004; Himmelmann 2012; Chapter 6, this volume).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

137

the syntactic properties of words, others on the morphological properties, etc. An annotation layer thus never codes the ‘truth’ – rather it codes one way of interpreting the corpus data. Explicitly annotating the data means that the interpretation of the data is available to the reader of the analysis. It is important that annotation is separable from and will not corrupt the corpus data (Leech 2005). The corpus is separated into tokens – tokens are technically just the smallest units in a corpus and could be of any length and complexity, but typically in European languages they constitute something like ‘graphemic words’ – with all the problems that notion entails (see, e.g., Schmid 2008 for a discussion). Annotation can pertain to any unit of the corpus. The sequence in the corpus to which a category applies is called an exponent. There is ‘subtoken’ annotation such as phonetic or phonological annotation, token annotation such as part-of-speech annotation, annotation spanning several tokens, such as the annotation of the noun phrases in (1), idiomatic sequences, sentences, or paragraphs. The annotation itself can come in various formats. Next to the assignment of a simple category to a given token (such as a part-of-speech category ‘NN1’ to a singular noun) or a sequence of tokens, we ﬁnd different types of hierarchical annotation (such as constituency trees) or pointing relations (such as the members of an anaphoric chain).

2.2

Corpus architectures

Because error annotation can, in principle, occur in different formats and attach to any exponent, and because there can be many layers of error annotation (see Section 2.5.1), a multi-layer corpus standoff architecture is very useful. In standoff architectures (see Carletta et al. 2003; Chiarcos et al. 2008, among many others) it is possible to deﬁne as many independent annotation layers as necessary. This means that different people, using different tools, can work on the same data and all their analyses can be consolidated. It also means that different interpretations of the same data can be kept apart. In accordance with what we said at the beginning, this makes it possible to test different hypotheses on the same data. We will show below that computer-aided error analysis often uses other annotation layers such as parts of speech or syntactic annotation in addition to the error categories, which is another reason why multi-layer architectures are helpful. That said, many existing learner corpora are not coded in corpus standoff architectures but use some kind of inline format where the annotation is not represented separately from the corpus data.

2.3

Error analysis

Errors may concern language production as well as language reception. In the following, we will exclusively discuss language-production errors.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

138

LÜDELING AND HIRSCHMANN

Since the 1960s, the notion of what constitutes an error, how errors can be classiﬁed, and what role errors play in language acquisition has changed. We do not have space here to give an overview of the history of EA but can only sketch some of the major trends. For more comprehensive overviews of error analysis, see Corder (1981: 35ff.), Ellis (1994: 47ff.), James (1998), and Díaz-Negrillo and Fernández-Domínguez (2006). The scientiﬁc study of learner errors is based on the assumption that errors are a surface reﬂex of the learner’s internal grammar or interlanguage (Selinker 1972). This notion was inﬂuenced by generative models of grammar that assume a systematic internal grammar for native speakers. The interlanguage of a learner is assumed to be just as systematic, although different from the internal grammar of native speakers. Inﬂuenced by the idea of a systematic interlanguage, the perception of errors as expressions that are simply ill-formed and chaotic has given way to a concept of describing errors in a systematic way. Single errors are not useful for the study of language acquisition because an error might occur for any number of reasons and only some of these reasons might have to do with the learner’s interlanguage, others being ‘performance’ errors due to tiredness, inattention, etc. It is thus necessary to study types of errors with many tokens and compare contexts and situations. One distinction that mirrors this is the error vs mistake distinction. ‘Performance’ errors – called mistakes – might point to processing issues but are not relevant for the study of interlanguage. ‘Competence’ errors (or simply ‘errors’), on the other hand, might point to non-target-like structures in the interlanguage. While this distinction might be theoretically valid, it cannot be made in a corpus analysis because there is typically no way of knowing what the learner knew and which other tasks, feelings, etc. might have inﬂuenced his or her production. As an example, consider a (simpliﬁed) typing issue. The layout of a keyboard inﬂuences the frequency of typing mistakes. Keys that are next to each other are substituted for each other more often than keys that are further away from each other. On a QWERTY keyboard this could explain a number of n for m substitutions and it might be argued that they are simply mistakes and not inﬂuenced by the learner’s knowledge of the target grammar. What happens, however, if a learner of L2 German substitutes n for m at the end of a word that should be in dative case (the accusative article den instead of the dative article dem, say)? This could be a pointer to an interlanguage problem or it could just be a typo. This simple example shows that the distinction between errors and mistakes can be made after a careful analysis of the data but not in the error annotation itself. In tandem with the idea that errors can be described systematically, the notion of the function of errors in the acquisition process also changed (cf. Corder 1967, 1981). Errors used to be interpreted as violations of language rules that should be avoided. Today, certain types of errors are seen as signalling necessary stages on the way to target-like language. Many

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

139

of these necessary errors have to do with segmentation and productivity. Roughly speaking, learners ﬁrst learn complex forms, such as inﬂected or derived words, without segmenting them. At that stage, they might not make errors but cannot use the language productively. At later stages, they might understand how to segment some of the forms and detect regularities, which leads to an overgeneralisation because they have not yet learned the exceptions. They might actually make more errors at this stage but these errors are a sign of a deeper understanding of the language and are therefore considered to be crucial in the acquisition process. The idea that certain errors are typical for a given acquisition stage is fundamental in some acquisition models. Klein and Perdue (1997), for instance, hypothesise a common language-independent stage for untutored adult second language learners, which they call the basic variety. After observing second language learner groups with ten distinct ﬁrst-language–second-language relations, Klein and Perdue describe common structural properties that occur for each group independent of the L2 and the L1. One example of such a property is ‘no inﬂection in the basic variety, hence no marking of case, number, gender, tense, aspect, agreement by morphology’ (Klein and Perdue 1997: 11). These properties are said to appear at a certain stage in the acquisition and will (for many learners) be overcome at some point when they proceed in their acquisition.3 In a model like this, the appearance and disappearance of certain types of errors can be treated as benchmarks for the acquisition process. The idea that a learner’s interlanguage is systematic and that it can be analysed by looking at errors has been criticised repeatedly. One of the most fundamental pieces of criticism stems from Bley-Vroman (1983).4 He states that error analysis is always done from a native-speaker perspective and that the analysis of a learner variety through the native perspective will not reveal the true properties of the learner’s text. Bley-Vroman calls this the ‘comparative fallacy’; similar issues have been raised by other researchers (Klein and Perdue 1997, for instance, use the term ‘closeness fallacy’ for a similar problem). Bley-Vroman (1983: 2) goes as far as to say that the comparative fallacy pertains ‘to any study in which errors are tabulated … or to any system of classiﬁcation of interlanguage (IL) production based on such notions as omission, substitution or the like’. Bley-Vroman’s criticism is valid and any error annotation that uses a native ‘standard’ against which an error analysis is performed is problematic. This certainly has to be taken into account and there are many researchers that try to do error analysis in ways that avoid (or minimise) the comparative fallacy, mainly by carefully explaining and motivating any step in the analysis so as to ﬁnd possible biases (see Tenfjord, Hagen and Johansen 2006; Ragheb and Dickinson 2011; Reznicek et al. 2013). 3

This is a simplified account of Klein and Perdue’s acquisition model. Other influences on learner language development discussed in that model are information structure and communicative needs.

4

Bley-Vroman doubts the systematicity of interlanguage itself but this does not have to concern us here.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

140

LÜDELING AND HIRSCHMANN

Ellis (1994: 50ff.) names four distinct steps in error analysis: (1) identiﬁcation, (2) description, (3) explanation, and (4) evaluation of errors. It is important to keep these steps apart – conceptually and technically.

2.4

Identiﬁcation and description of errors

In order to distinguish errors from non-errors in learner utterances it would be helpful to have a clear deﬁnition of what an error is. The ﬁrst idea that comes to mind is that an error is a violation of a rule. To operationalise this, everything about our linguistic behaviour would have to be codiﬁed and described by rules. This is, of course, not possible. Many linguistic models make a distinction between grammar and usage, or between grammar errors and appropriateness errors.5 Grammar, it is argued, can be codiﬁed by (categorial) rules, and grammar errors are therefore easy to detect and describe. Usage, on the other hand, is quantitative rather than categorial and appropriateness errors depend more on interpretation. If this were true, people would always agree on grammatical errors but we would expect some disagreement over the identiﬁcation and classiﬁcation of appropriateness errors. Lennon (1991: 182) seems to refer to both types of error when he deﬁnes an error as ‘a linguistic form or combination of forms which, in the same context and under similar conditions of production, would, in all likelihood, not be produced by the speakers’ native counterparts’. This deﬁnition results from the observation that ‘to be fully nativelike language must be not only grammatical but also appropriate’ (Lennon 1991: 184). Unidiomatic expressions or stylistically inappropriate forms can be grammatical in a strict sense. In Section 2.5.1 we will outline a way to deal with the difference between grammar and appropriateness errors. In the following, we will argue that the identiﬁcation and classiﬁcation of grammar errors is far from easy and that already here we have to interpret the data. In order to address how a given learner expression would be used by native speakers it is necessary to provide an alternative expression. Compare examples (2)–(4). (2) (3) (4)

She must saves money. She must saved money. She must my.

Example (2) is clearly ungrammatical. We could provide a grammatical equivalent by changing the verb form: She must save money. This assumption makes it plausible to interpret the error as a verb morphology error 5

The term ‘usage’ here is very broad and encompasses phenomena people have described as belonging to pragmatics, information structure, register, etc.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

141

(no 3rd person-s in the scope of a modal) or predicate structure error. The assumption of which ‘correct’ utterance corresponds to the erroneous utterance is called reconstruction, target form or target hypothesis. For example (3), we can construct several target hypotheses which seem equally likely. Without further context, the following are equally possible and plausible: She must have saved money or She must save money. The error would be analysed differently depending on which target hypothesis is chosen. If the context does not help to disambiguate between these possibilities it is impossible to say which target hypothesis is to be selected. Example (4) shows an utterance which does not provide enough information to formulate a target hypothesis. In order to create a corresponding grammatical sentence, one would have to add so much information that innumerable target hypotheses are possible. This means that, in (4), it is impossible to appropriately analyse the error that causes the ungrammaticality of the utterance. The only sensible error category would be ‘uninterpretable’. Examples (2)–(4) show that clear grammatical errors already involve interpretation. This is even more true for appropriateness errors.6 Appropriateness is judged differently by different people and can involve all linguistic levels. Words can be inappropriate (e.g. the use of maybe instead of perhaps in an academic register), syntactic structures can be inappropriate (e.g. a sentence with three complicated sub-clauses might be inappropriate in a conversation with a child), etc. While it is sometimes possible to mark a grammar error by looking at a sentence in isolation, the identiﬁcation of appropriateness errors needs linguistic and extra-linguistic context. But just like grammar errors, appropriateness errors can only be found with the help of a target hypothesis. The target hypothesis might look different from one that is constructed for clear grammar errors. One possible way of dealing with this problem is the introduction of several target hypotheses; another might be the use of task-based corpora where the purpose of a learner utterance is clearly constrained by the context. Errors cannot be found and analysed without an implicit or explicit target hypothesis – it is impossible not to interpret the data. It is important to note that the construction of a target hypothesis makes no assumptions about what a learner wanted to say or should have said. The analyser cannot know the intentions of the learner. The ‘correct’ version against which a learner utterance is evaluated is simply a necessary methodological step in identifying an error.

6

The distinction between grammaticality and appropriateness is similar (although not equivalent) to what Ellis (1994: 701) and others have called overt vs covert errors. Overt errors are ‘apparent in the surface form of the utterance’ while covert errors are visible only in a broader context, ‘when the learner’s meaning intention is taken into account’.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

142

LÜDELING AND HIRSCHMANN

In some learner corpora, target hypotheses are not given explicitly. Considering the high cost of the analysis, this is understandable. It is, however, a problematic decision: an error-annotated corpus which does not provide target hypotheses hides an essential step of the analysis – this could lead to mistakenly assuming that the error annotation which is present in a corpus is the ‘truth’ or ‘correct analysis’ instead of just one among many interpretations (similar to what Rissanen 1989 calls ‘God’s truth fallacy’). This is why increasingly more learner corpora are offering explicit target hypotheses along with error classes. In error annotation it is necessary to assign one or more categories to each error; sometimes a token contains several errors, such as a morphological error and an orthographic error – in such cases it should be possible to assign both. Errors can be categorised according to many criteria – the exponent (one token, multiple tokens, etc.), the grammatical level (syntax, morphology, register, etc.), and many more. Which categorisation scheme and level of granularity is chosen depends on the research purpose (see Section 2.5.2).

2.5

Error marking

In Section 2.3 and 2.4 we have argued that error analysis is an interpretation of the primary data on many levels: 1. The identiﬁcation of an error always depends on a target hypothesis. There can be more than one target hypothesis for any given learner utterance. 2. Even with the same target hypothesis there can be many different descriptions of an error. Error categories depend on the research question, the grammatical model, etc. 3. There can be several explanations for each description. Error annotation schemes differ widely with respect to what counts as an error, the format of error coding, scope, depth of analysis, etc. This is, of course, mainly due to the different research questions (which lead to different categorisations). In the following we will compare some of the more common strategies that are found in existing error annotation schemes. Rather than aiming for a comprehensive list of existing systems we will concentrate on the underlying conceptual issues.

2.5.1

Target hypotheses

As stated in Section 2.4, the identiﬁcation and categorisation of errors depend on an implicit or explicit target hypothesis. In this section we want to show how target hypotheses can be made explicit and what can be done if there are several competing target hypotheses. This pertains to the error exponent (sometimes also called extent of an error, or error domain) as well as to the error category (sometimes called error tag).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

143

Clear-cut grammar errors seem easiest to deal with. Consider example (5) from the Falko corpus.7 (5)

dass eine Frau zu Hause bleibt, um sich um den Kindern und dem Haus zu kümmern [fk008_2006_07_L2v2.4] ‘that a woman stays at home in order to care for the childrenDAT and the houseDAT’

The coordinated noun phrase den Kindern und dem Haus is in dative case while the verb kümmern subcategorises for an um-PP, which itself subcategorises for accusative case. In this sentence, the description of the error seems clear – we would analyse it as a case error or a subcategorisation error. However, what is the error exponent? Is it the coordinated noun phrase den Kindern und dem Haus? Or do we assume two errors – one for each of the conjuncts of the noun phrase? Our decision involves assumptions about the syntactic structure of the sentence with regard to case marking and coordination. Are these the same assumptions that we would make in example (6), which is analogous in many respects (we are here only interested in the sequence dem Führerschein und das Fahrenlernen and ignore the other problems in the sentence)? We have a verb that subcategorises for a certain PP (mit and dative) and a subcategorised coordinated noun phrase. The learner uses the correct form dem Führerschein but the incorrect das Fahrenlernen within one coordinative structure. Here the error exponent cannot be the coordinated noun phrase and it is much less clear what type of error we are seeing.

(6)

[…] kann auch mit dem Führerschein und das Fahrenlernen eines PKS verglichen werden. [cbs003_2006_09_L2v2.4] ‘… could be compared to a driver’s licenseDAT and learningNOM/ACC to drive a car.’

Whichever way we decide, it becomes clear that if the analysis is not made explicit there is always the danger that such problems are (inadvertently) overlooked and that parallel cases in the data are treated differently. This has consequences for the error count and the ﬁnal analysis (see Lüdeling 2008 for an experiment). Often it is impossible to give one clear target hypothesis. Consider example (7) (taken from Weinberger 2002: 30). Here we see a number mismatch between the subject and the verb which should be congruent. In a target hypothesis either the subject or the verb could be changed, 7

The Falko (Fehlerannotiertes Lernerkorpus) corpus contains essays written by advanced learners of German as a foreign language (Lüdeling et al. 2008).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

144

LÜDELING AND HIRSCHMANN

see (8). Depending on which target hypothesis we chose we might have a subject number error or a verb number error. (7) (8)

Jeder werden davon proﬁtieren. ‘EachSINGULAR willPLURAL proﬁt from this.’ LU Jeder werden TH 1 Jeder wird ‘willSG ’ ‘everyoneSG ’ TH 2 alle werden ‘willPL ’ ‘everyonePL ’

davon davon

proﬁtieren proﬁtieren

davon

proﬁtieren

In this and the examples that follow: LU=learner utterance, TH= target hypothesis

There are several possibilities in cases like (7). One possibility is to look at the context and decide whether there is a cue in it that points to one or the other option. This might often be feasible but there are two problems. First, the option that the annotator chooses might be inﬂuenced by his or her research interest; he or she might even see only one of the options. Second, if analogous cases are sometimes resolved one way and sometimes resolved in a different way, it is impossible to do a systematic search. Better alternatives for handling such cases are to either consistently resolve them in the same way (say, always change the subject, independent of context) or to give them an abstract mismatch tag (here: subject–verb agreement mismatch). While grammatical (or overt) errors may be difﬁcult to analyse, appropriateness errors are even more challenging. Consider example (9) from the International Corpus of Learner English, ICLE (Dagneaux et al. 2005). (9)

It sleeps inside everyone from the start of being, it just waits for opportunity to arose and manifest itself. (ICLE-CZ-PRAG-0018.2)

Here the learner is writing about negative traits like greed, stating that they are present in every human being and emerge in situations which support them. The utterance It sleeps inside everyone from the start of being is not ungrammatical in a strict sense but it is still not quite idiomatic. The sequence from the start of being is not likely to be used by ‘the speakers’ native counterparts’ (Lennon 1991: 182). Expressions like since birth or from the beginning sound more native-like in this situation. The ﬁrst decision is whether to mark this as an error – it is easy to see that ‘idiomaticity’ is gradual (Lennon’s phrase ‘in all likelihood’ reﬂects that). Example (10) shows three possible target hypotheses for the ﬁrst part of (9). Again, the error exponent (as well as the resulting error description) differs but these target hypotheses also differ in abstractness. Target Hypotheses 1

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

145

and 2 provide an alternative wording, while Target Hypothesis 3 is more abstract and says only that this part of the learner utterance is unidiomatic, conﬂating an implicit target hypothesis with an error tag (the annotator is only able to know that this expression is unidiomatic if he or she knows a more idiomatic expression). (10)

LU TH 1 TH 2 TH 3

it it it it

sleeps sleeps sleeps sleeps

inside inside inside inside

everyone everyone everyone everyone

from the start of since birth from the beginning UNIDIOMATIC

being

Different target hypotheses are not equivalent; a target hypothesis directly inﬂuences the following analysis. The Falko corpus consistently has two target hypotheses – the ﬁrst one deals with clear grammatical errors and the second one also corrects stylistic problems. (11)

Dependance on gambling is something like dependance on drugs (…) (ICLE-CZ-PRAG-0013.3)

The need for such an approach becomes clear in (11). The learner utterance in (11) contains a spelling error. The two occurrences of dependance have to be replaced by dependence. From a more abstract perspective, the whole phrase Dependence on gambling sounds unidiomatic if we take into account that the learner wants to refer to a speciﬁc kind of addiction. Similarly, dependence on drugs appears to be a marked expression as opposed to drug addiction. An annotation that wants to take this into consideration has to separate the description into the annotation of the spelling error and the annotation of the stylistic error in order not to lose one of the pieces of information. Example (12) illustrates this: (12)

LU TH 1 TH 2

Dependance Dependence Gambling addiction

on on

gambling gambling

The examples in this section show how important the step of formulating a target hypothesis is – the subsequent error classiﬁcation critically depends on this ﬁrst step. In order to operationalise the ﬁrst step of the error annotation, one can give guidelines for the formulation of target hypotheses, in addition to the guidelines for assigning error tags, which also need to be evaluated with regard to consistency (see Section 2.6). The problem of unclear error identiﬁcation has been discussed since the beginning of EA. Milton and Chowdhury (1994) have already suggested

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

146

LÜDELING AND HIRSCHMANN

that sometimes multiple analyses should be coded in a learner corpus. If the target hypothesis is left implicit or there is only one error analysis, the user is given an error annotation without knowing against which form the utterance was evaluated. In early corpora (pre-multi-layer, pre-XML) it was technically impossible to show the error exponent because errors could only be marked on one token. In corpora that use an XML format it is possible to mark spans, and target hypotheses are sometimes given in the XML mark-up. Only in standoff architectures, however, is it possible to give several competing target hypotheses. Examples of learner corpora with consistent and well-documented (multiple) target hypotheses are the Falko corpus, the trilingual MERLIN corpus (Wisniewski et al. 2013) or the Czech as a Second Language corpus (Rosen et al. 2014).

2.5.2

Error tagsets

Error annotation systems8 assign categories to errors. As stated above, the types and granularity of the error categories depend on the research question. Technically, there is an error every time the learner utterance differs from the target hypothesis. The error tag describes the type of the error within a given error annotation scheme. There are systems that annotate errors on all grammatical levels, and there are systems that tag only one speciﬁc type of phenomenon such as, for example, errors pertaining to the marking of modality or tense. Some error annotation systems assign grammatical, lexical or other linguistic error categories. Consider example (13), which covers the second half of the sentence in (9), and example (14). Both examples contain target hypotheses which provide grammatical structures for the respective ungrammaticalities that can be found in the original learner utterances. The errors in the learner utterances are made visible by the deviations of the target hypotheses from the target forms. The analysis of these errors is provided in two different ways. The ﬁrst we call edit-distance-based error tagging, a form-based description of the edit operations that have to be performed in order to generate the target form out of the original learner form. The second error tag is a linguistic interpretation of the deviation. Editdistance based annotation schemes consist of categories like ‘change’, ‘delete’, ‘insert’, sometimes ‘move-source’, ‘move-target’ or what Lennon (1991: 189) calls errors of substitution, over-suppliance, omission or permutation. Once a target hypothesis is given, this can be done automatically. While a distance-based annotation scheme might not look very interesting in itself, it can become very useful in combination with other layers of annotation, such as part of speech or lemma (for an example, see Reznicek et al. 2013). One could then ﬁnd all cases where an article was inserted or deleted.9 Distance-based systems are often only the ﬁrst step 8

They are sometimes called error taxonomies, a term we avoid because many of them are not taxonomies in a

9

There is another equally likely target hypothesis where an is changed into some. This would yield different error tags.

technical sense.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

(13)

LU TH D-BT L-BT

It It

just just

waits waits

for for

DET INSERT missing article

opportunity opportunity

to to

arose arise CHANGE inﬂection // orthography

and and

D-BT = distance-based tagging, L-BT = linguistically based tagging

(14)

LU TH D-BT L-BT

He He

tries tries

to to

get get

an DELETE superﬂuous article

information information

about about

this this

profession profession

manifest manifest

itself itself

148

LÜDELING AND HIRSCHMANN

in the analysis – it is possible to add linguistically motivated error types on further annotation layers. Linguistically based tagging systems interpret the difference between the learner utterance and the target hypothesis with respect to a given grammatical or pragmatic model. For the arose case in example (13) they could use a tag for orthographic errors, for inﬂectional errors, or use an ambiguous or ‘undecided’ tag. Díaz-Negrillo and Fernández-Domínguez (2006) discuss different linguistically motivated error-tagging systems for learner corpora, pointing out that ‘the way the linguistic information is organised in taxonomies varies from system to system’ (2006: 93). Often the error tags are conﬂated with part-of-speech or word-class information. As just one example, consider how sentences (13) and (14) would be tagged according to the ICLE ‘Error Tagging Manual’ (Version 1.2; Dagneaux et al. 2005). Errors there are divided into eight major categories: ‘form’; ‘grammar’; ‘lexico-grammar’; ‘lexis’; ‘word redundant’, ‘word missing’, ‘word order’; ‘punctuation’; ‘style’; ‘infelicities’. In ‘grammar’, ‘lexico-grammar’ and ‘lexis’, word classes are referred to explicitly, which means that the interpretation of many learner errors directly depends on the part of speech involved. Other errors depend on morphological, graphemic, word placement or stylistic problems, regardless of a speciﬁc word class (e.g. ‘word order’). For missing elements like the article in (13) or superﬂuous elements like the article in (14) there is no intuitive method to anticipate whether to assign the error to the category word class (article) or to the fact that they are missing, but the manual tries to provide an unambiguous solution for all errors. According to Dagneaux et al. (2005), the missing article in (13) receives the error tag ‘GA’ for the classes ‘grammar-article’, the superﬂuous article in (14) receives the tag ‘XNUC’ (for the classes ‘lexico-grammar-noun, uncountable/countable’), and the error in the form arose in (13) is tagged as ‘GVM’ for ‘grammar-verb-morphology’.

2.6

Evaluation of error annotation

The usefulness of error-annotated corpora depends on the consistency of the annotation. Error annotation in learner corpora is mostly done manually and in this section we are concerned with the evaluation of manual annotation (see Chapters 25 and 26, this volume, for more on automatic annotation and evaluation). Evaluation of annotation reliability is always necessary, but because there are so many possibly controversial decisions to make in error annotation (target hypothesis, tagset, error assignment, etc.), it is especially difﬁcult and crucial to evaluate the annotation. Evaluation of annotation is done in one of two ways: either one has a gold standard (a corpus with annotations deemed to be correct) and evaluates it against this corpus,10 or one uses several annotators to 10

The standard measures here are recall, precision, and the f-measure; these are found in all statistics introductions, see, e.g., Baayen (2008) or Gries (2009).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

149

annotate the same subcorpus using the same tagset and guidelines and evaluates how often and where they agree (called inter-annotator agreement, inter-rater reliability, or inter-coder reliability, see Carletta 1996; Artstein and Poesio 2008). Evaluation is a necessary step in assuring the consistency of annotation. It shows which categories are clearly deﬁned and can be assigned unambiguously, and which categories or guidelines are unclear and therefore assigned inconsistently. Evaluation is typically an iterative process – guidelines are reformulated after an evaluation, evaluated again, etc., until the result is consistent.

2.7

The main uses of error annotation

There are many studies that use error-annotated corpora for interlanguage research.11 In Section 3 we highlight three studies using error annotation. At this point, we want to give an overview of the general types of error studies. Qualitative studies that focus on a small number of errors by single learners or a few selected learners are often the ﬁrst step in the error analysis and give rise to hypotheses that can then be tested quantitatively. The study by Brand and Götz (2011) is a good example of this. Brand and Götz, using the error-tagged German component of the Louvain International Database of Spoken English Interlanguage (LINDSEI-GE), which is made up of spoken learner English produced by speakers with L1 German, investigate different properties of ﬂuency and accuracy. They provide overall statistics for all learners but, crucially, they also include a detailed study of ﬁve selected learners, which leads to a deeper understanding of the interaction. Early studies in EA sometimes counted raw error frequencies in order to ﬁnd out which linguistic phenomena seemed to be especially difﬁcult for learners. Diehl et al. (1991), for instance, motivate their corpus-based contribution about the acquisition of the German declension system with the observation that inﬂectional errors within the noun phrase are by far the most frequent errors among different learner groups. Error annotations are just like other linguistic annotations, and studies using error categories can follow standard corpus-linguistic methodology. This can go far beyond the simple exploratory study by Diehl et al. (1991) into statistical hypothesis testing, multivariate analysis, and modelling (for an overview, see Gries 2008a). We cannot explain here the different methods in detail but want to structure our overview according to a simple distinction made of quantitative corpus studies introduced in

11

There are many learner corpora with (partial or complete) error annotation – here we list only a few examples: for L2 English the International Corpus of Learner English (ICLE, Granger 2003a), the Hong Kong University of Science and Technology Corpus (HKUST, Milton and Chowdhury 1994), the Cambridge Learner Corpus (CLC, Nicholls 2003), for L2 French the French Interlanguage Database (FRIDA, Granger 2003b), for L2 German Falko (Lüdeling et al. 2008), for L2 Czech the Acquisition Corpora of Czech (AKCES, Hana et al. 2012), for L2 Norwegian the Andrespråkskorpus (ASK, Tenfjord, Meurer and Hofland 2006), for L2 Arabic the Pilot Arabic Learner Corpus (Abuhakema et al. 2008).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

150

LÜDELING AND HIRSCHMANN

Biber and Jones (2009) – ‘type-A studies’ (pp. 1291ff.) and ‘type-B studies’ (pp. 1298ff.). Type-A studies focus on one linguistic phenomenon in one corpus with the aim of understanding how different variants are distributed. Biber and Jones give the example of subordinate clauses in English, which can be introduced by that or nothing (he thinks that she should smile more often vs he thinks she should smile more often). The goal of a type-A study is to identify linguistic and extra-linguistic features that inﬂuence the choice between the variants. In their study they ﬁnd that a combination of register, the frequency of the embedding verb, and the nature of the subject in the subordinate clause play a role. In that sense, type-A studies are detailed analyses of linguistic behaviour. Because the choice of variant is typically inﬂuenced by many factors, some of which interact, they are often described using multifactorial models. Type-A studies offer an important gain in error-based language research. Rather than looking only at errors, type-A studies can choose a linguistic phenomenon (e.g. articles in noun phrases) and count all errors (missing articles, wrong articles) as well as all correct instances of article placement in a corpus. The relation between certain erroneous structures and correct structures of the same type allows more meaningful and precise conclusions than the observation of errors without their comparison with correct cases. Type-B studies, on the other hand, compare (counts of) categories across different corpora. Most of the recent learner corpus studies compare two (or more) learner corpora or a learner corpus and a native speaker corpus. In learner corpus research this is called Contrastive Interlanguage Analysis (CIA), see Granger (1996) and Chapter 3 (this volume). Learner language is multifaceted, which leads to multi-method study designs (see Tono 2004; Gries 2008a; Brand and Götz 2011, and many others). For many research questions, type-A studies and type-B studies are combined. Often the analysis of error categories is combined with the analysis of other categories in the corpus. Type-A studies are typically driven by function, not form – the basic idea being that there are several ways to express the same function. In that sense, they are variationist in nature although their purpose might be different from other (sociolinguistic or diachronic) variationist studies (cf. Labov 1978, 2004). Type-B studies compare different corpora – either different L2 corpora or a learner corpus and a native speaker corpus. Granger (2002) suggests using multiple comparisons, involving different L2s as well as L2 vs L1, in order to tease apart the different inﬂuences on learner language. Many type-B studies do not involve error tags but compare either lexical forms or annotation categories; the studies reported on here do, of course, use error annotation. Type-B studies can be cross-sectional or longitudinal. Ideally, the corpora that are being compared differ only in one extra-linguistic variable, such as L1 or proﬁciency level, while all other external variables are kept stable. As a result,

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

151

quantitative differences between the measured categories in the different corpora can be interpreted as an effect of the extra-linguistic variable itself. The ICLE subcorpora, for example, are collected according to the same criteria except for the learner’s L1 and can be compared to identify L1 inﬂuences.12 In Section 2.3, we brieﬂy sketched acquisition models that hypothesise that certain types of errors are typical, and possibly necessary, for a given acquisition stage. These hypotheses would have to be veriﬁed in longitudinal studies. Genuine longitudinal studies compare the same learners across different acquisition stages (ideally using the same or at least comparable tasks). Such genuine longitudinal corpora are rare and therefore sometimes replaced by quasi-longitudinal corpora in which different learner groups with different proﬁciency levels are investigated (see Chapter 17, this volume). A recent quasi-longitudinal study based on ICLE is Thewissen (2013). Thewissen aims at measuring the development of language acquisition by comparing error tags, annotated according to the ICLE guidelines, across different proﬁciency levels. In order to assess proﬁciency levels, the learner texts (223 essays) were rated by professional raters according to precise rating guidelines. We want to brieﬂy mention one problem that pertains to all type-B studies and is often ignored. Learner corpora are typically collections of texts by different learners. The corpus design speciﬁes a number of external variables such as L1, level of proﬁciency, text type or mode of acquisition. The texts within the collection are then treated as a homogeneous corpus. Put in statistical terms: all texts are seen as samples from the same population. This implies that the internal grammars of all the people who contribute to the corpus follow the same system. The corpus is then compared to another corpus which differs in (at least) one design parameter. This can be statistically problematic whenever the within-group variation is too high or when there are clusters within the corpus. This shows that the samples cannot stem from one population. At the very least, variances should be reported, but often it might be necessary to calculate a model that takes such effects into account (see Evert 2006; Gries 2009 for more on this issue). There are studies which go a step beyond classical type-A or type-B studies and combine both types in order to ﬁnd out which categories are overused or underused by learners compared to native speakers. Good examples are the two papers reported in Section 3: Maden-Weinberger (2009) (Section 3.1) and Díez-Bedmar and Papp (2008) (Section 3.2). Those studies use variationist designs which are able to analyse the overuse and underuse not only of form but also of function. For studies like these one, needs a corpus with (often quite speciﬁc) error tags as well as grammatical 12

This is an idealisation. The teaching method, previously studied languages, and possibly situational parameters in the collection may also differ.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

152

LÜDELING AND HIRSCHMANN

information like part-of-speech information or even syntactic information (compare Hirschmann et al. 2013 for a study on the use of modiﬁcation by learners of L2 German which uses a parsed corpus). Carefully done, type-A studies or combined studies are one way to avoid the comparative fallacy. Studies that start with functions instead of forms are difﬁcult to carry out because they have to ﬁnd all places in a corpus where the function under consideration is used, which is a matter of interpreting the data. How hard this problem is depends on the phenomenon: it is not problematic to ﬁnd all nouns in a corpus and see whether they are preceded by an article. It is, however, very difﬁcult to see which sentences ‘require’ modality. Again, standoff architectures would be helpful and it is good practice to make the complete information available. Type-A and type-B studies use error annotation to investigate acquisition processes. For this purpose the learner corpora are annotated independently of the learner and the learner does not ever see the annotation. Acquisition studies feed back into teaching only indirectly. But error annotation can be and has been used directly in teaching. In many settings, a learner is allowed to produce drafts of an assignment which are commented on by a teacher, corrected by the student, resubmitted, etc., until the ﬁnal version is turned in for grading (see Burstein et al. 2004). A number of very interesting corpora have developed from this setting. These are treated in more detail in Chapters 20 and 22 (this volume) but we will sketch one such corpus (Section 3.3) in order to illustrate the possibilities after we have introduced two representative type-A studies.

3

Representative studies

3.1 Maden-Weinberger, U. 2009. Modality in Learner German: A Corpus-Based Study Investigating Modal Expressions in Argumentative Texts by British Learners of German. Unpublished Ph.D. thesis, Lancaster University. A typical variationist study is reported in Maden-Weinberger (2009), which analyses the expression of modality by learners of German as a foreign language with L1 English. Maden-Weinberger aims to analyse to what extent modality is expressed differently by these learners in comparison to German native speakers, and which words, structures or morphemes are difﬁcult to acquire. To achieve this objective, she collects, annotates and compares the Corpus of Learner German (CLEG), consisting of argumentative essays written by English learners of German as a foreign language, a comparable German native speaker corpus (KEDS, Korpus von Erörterungen deutscher Schüler), collected from German secondary school students, plus a native English corpus (LIMAS, Linguistik und Maschinelle Sprachbearbeitung) and a German–English translation corpus (INTERSECT, International Sample of English Contrastive Texts). The last two corpora are used to compare the L2–L1 differences with general differences in expressing modality in

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

153

native English and German. The investigation focuses on a variable or function – the expression of epistemic and deontic modality – and starts with deﬁning possible forms to express this function, such as modal verbs, adverbs and adverbials or subjunctive mood. Maden-Weinberger annotates all forms functioning as epistemic or deontic modal expressions that can be found in her corpus data. In addition, she tags all errors regarding modality. In this way, she can see which forms are used by learners and which of these seem especially difﬁcult. She can prove that the learners in her study use different modal expressions than the German native speakers in the comparable native speaker corpus: while the learners avoid (underuse) modal verbs, they overuse different adverbial words and phrases to express epistemic modality, although German modal verbs like werden ‘will’ are generally overused by the learners. She concludes that the epistemic function that modal verbs in German can have is especially hard to acquire by English learners. 3.2 Díez-Bedmar, M. B. and Papp, S. 2008. ‘The use of the English article system by Chinese and Spanish learners’, in Gilquin, G., Papp, S. and Díez-Bedmar, M. B. (eds.), Linking up Contrastive and Learner Corpus Research. Amsterdam: Rodopi, pp. 147–75. Díez-Bedmar and Papp (2008) analyse the use of articles in English by two learner groups whose L1s differ with respect to articles and the marking of deﬁniteness, Chinese and Spanish. While Chinese does not use articles at all, the Spanish article system is similar to the English one, with rather subtle differences regarding pragmatic aspects. The authors analyse the misuse of articles by Chinese and Spanish learners of English in a language transfer perspective, with the hypothesis that both learner groups will produce pragmatic article errors, while Chinese learners will also produce strictly grammatical article errors. In an extensive corpus study, student essays of Chinese and Spanish learners of English as a Foreign Language are collected and analysed. Again they combine an error analysis with a CIA. Different semantic features that inﬂuence the type of article, such as genericity, deﬁniteness and speciﬁcity of the respective noun phrase, are annotated and taken into account in the statistical analysis. The CIA shows that the Chinese learners grammatically avoid (underuse) articles in contrast to Spanish learners of English. The error analysis shows that the Chinese learners produce more errors in all semantic contexts (except for erroneous zero articles in generic uses) and that the speciﬁc pragmatic contexts are more difﬁcult for both learner groups (indeﬁnite articles in generic contexts show the least accurate uses for both learner groups). Both exemplary studies above show that it can be very helpful to study one phenomenon in detail and annotate additional information (such as the type of modality or the factors inﬂuencing article choice), rather than working with a general tagset that aims at addressing many different

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

154

LÜDELING AND HIRSCHMANN

types of errors at once (see Meunier 1998 for a discussion of the granularity of tagsets). The studies also show that error analysis can be fruitfully combined with other methods of analysis such as CIA. 3.3 Lee, J., Yeung, C. Y., Zeldes, A., Reznicek, M., Lüdeling, A. and Webster, J. 2015. ‘CityU corpus of essay drafts of English language learners: A corpus of textual revision in second language writing’, Language Resources and Evaluation. doi: 10.1007/s10579-015-9301-z. Our third case study concerns the use of error annotation in teaching. In a corpus of academic L2 English collected at the City University of Hong Kong, students were allowed to submit as many drafts as they wanted before turning in the ﬁnal assignment. The teachers commented on each submission and mostly used error tags from a predeﬁned error tagset. Each student then made the corrections he or she wanted to make and resubmitted the text. Submissions and teacher feedback were collected and stored as a parallel corpus so that each version of the text can be compared with the other versions. Example (16) is a corrected version of the sentence in (15). Note that the student corrected the article (the program → a program) after getting an error code ‘wrong article’. The verb inﬂection (have improve → have been improved) was changed but is still not correct. (15)

(16)

I learned the function of the Visual Basic and due to the debugging of the program, I have improve my understanding in the structure of the program code. it would be useful at next time I write the program. I learned the function of the Visual Basic and due to the debugging of the program, I have been improved my understanding at the structure of the program code. It will be useful next time I write a program.

Corpora like these can be used to assess what effect an error code has on the student and how many errors are actually corrected and how they are corrected. Studies that use error-annotated corpora in this way include Wible et al. (2001) and O’Donnell (2012a).

4

Critical assessment and future directions

Error-annotated corpora make it possible to investigate whether, for example, learners of L2 German underuse modiﬁcation or whether they just underuse speciﬁc types of modiﬁers, how speech rate and accuracy interact in L2 English, and which errors appear and disappear in which stage of acquisition. We believe that there are still many open methodological and conceptual issues in the study of learner corpora. Some of these (corpus design, acquisition processes, statistical modelling) are discussed in other chapters of this handbook. The open issues that pertain to error

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

155

annotation include reproducibility and replicability, the interpretation of errors, and the combination of error studies and other learner data. If learner corpora are available with error annotation and if error annotation is done in a transparent and clear way, error studies become reproducible and results become replicable. In this chapter we showed that error annotation always depends on an interpretation of the data. Bley-Vroman (1983) warns against analysing one variety (the learner language) through the eyes of another variety (the ‘standard’ of the target language) and says that the properties of learner language can only be understood if the L2 variety is studied as a genuine variety in and of itself. While Bley-Vroman is certainly right, it is also clear that many of the properties of learner language can only be understood if learner language is compared to target language structures. The ﬁrst step in error annotation is the reconstruction of a target grammar utterance – called target hypothesis – against which the learner utterance is evaluated. There can be many such target hypotheses for a given learner utterance. Whichever one is chosen in a given corpus inﬂuences the error exponent and the error tag – and the analysis that follows from these. We argued that it is useful to make the target hypothesis explicit and to use a corpus architecture that allows multiple target hypotheses. Error annotation studies often combine the evaluation and counts of the error tags with other corpus information such as part-of-speech tags, syntactic analysis or statistical patterns of lexical information within the corpus. Most of these studies combine several methods, which helps minimise the comparative fallacy. Another interesting issue is that of standardisation of error tagsets. While some degree of standardisation in terms of edit-distance-based tagging (as described above) is useful, there is some doubt as to the desirability of more ﬁne-grained standardisation. The scope and granularity of an error tagset depends on the phenomenon to be studied and the research question to be answered. In ﬂexible corpus architectures it is possible to add one or several layer(s) for the ﬁne-grained analysis of a given error type or phenomenon. The development of error tagsets for a given phenomenon could be viewed as the most important step in understanding it (and thus is an integral and necessary part of research). Because it involves so many decisions, manual error annotation is time consuming. In the future we will see more and more semi-automatic and automatic methods for error annotation (see Chapter 25, this volume). It is especially important to test and report the reliability of error annotation, be it manual or automatic. There are different ways of testing reliability. Manual annotation is typically tested by comparing the decisions made by two or more annotators (inter-annotator agreement, inter-rater reliability, see Section 2.6), while automatically annotated

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

156

LÜDELING AND HIRSCHMANN

corpora are typically evaluated by comparing the results with a training corpus. In the future, we will also see even more statistical modelling of errors and other properties in learner corpora (see Chapter 8, this volume). This corresponds to the trend in grammar and acquisition models – away from categorial, algebraic models to probabilistic, usage-based models.

Key readings Corder, S. P. 1967. ‘The signiﬁcance of learner’s errors’, International Review of Applied Linguistics in Language Teachingg 5(1–4): 161–70. This book is the ﬁrst and still very useful approach to integrating the notion of learner errors into a comprehensive analysis of learner language and theory of second language acquisition. It contains the basic concepts of linguistic errors that were discussed in this chapter, argues for the necessity of errors in the language acquisition process, and discusses similarities and differences between the ﬁrst language and second language acquisition process. Lennon, P. 1991. ‘Error: Some problems of deﬁnition, identiﬁcation, and distinction’, Applied Linguistics 12(2): 180–96. This article deﬁnes and discusses basic concepts in error analysis and fundamental distinctions of error types. This is exempliﬁed by a learner corpus study of advanced English learners. Ellis, R. and Barkhuizen, G. 2005. Analysing Learner Language. Oxford University Press. This book introduces and discusses the essential methods for a comprehensive analysis of spoken and written learner language. Chapter 3 is dedicated to error analysis, providing a historical and theoretical background and leading the reader through the different steps of a state-of-the-art error analysis. Díaz-Negrillo, A. and Fernández-Domínguez, J. 2006. ‘Error tagging systems for learner corpora’, Revista Española de Lingüística Aplicada 19: 83–102. The article provides an overview of existing error taxonomies. Díaz-Negrillo and Fernández-Domínguez compare different learner corpora implementing error classiﬁcations and discuss the conceptual differences of the approaches.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Error annotation systems

157

Granger, S. 2008b. ‘Learner corpora’, in Lüdeling, A. and Kytö, M. (eds.), Corpus Linguistics. An International Handbook. Volume 1. Berlin: Mouton de Gruyter, pp. 259–75. In this article, Sylviane Granger explains the basic methodology of using learner corpora in the study of second language acquisition. Alongside other essential methods in learner corpus research, she describes different aspects of error annotation and how it can be used in acquisition studies, in computer-assisted language learning and in teaching. Dagneaux, E., Denness, S. and Granger, S. 1998. ‘Computer-aided error analysis’, System 26(2): 163–74. The paper explains how EA problems can be overcome by using error-annotated corpora, introducing data from the International Corpus of Learner English and the error tagset used in the corpus. Reznicek, M., Lüdeling, A. and Hirschmann, H. 2013. ‘Competing target hypotheses in the Falko corpus: A ﬂexible multi-layer corpus architecture’, in Díaz-Negrillo, A., Ballier, N. and Thompson, P. (eds.), Automatic Treatment and Analysis of Learner Corpus Data. Amsterdam: Benjamins, pp. 101–23. The authors argue for explicit and multiple target hypotheses in a multi-layer corpus architecture. In addition to the methodological problem of deciding on one target hypothesis, they show that different target hypotheses (and, based on these, different error tags) highlight different types of errors.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

Downloaded from https:/www.cambridge.org/core. University of Florida, on 28 Jan 2017 at 06:34:23, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.007

8 Statistics for learner corpus research Stefan Th. Gries

1 Introduction Over the last decades, second/foreign language acquisition (S/FLA) has become an ever larger, more diverse, and more productive discipline. This evolution notwithstanding, for most of that time SLA research seems to have favoured experimental and introspective data over the exploration or analysis of corpus data (cf. Granger 2002: 5). Fittingly, Mackey and Gass (2005), for example, devote not even two pages to the topic of corpora in learner corpus research (LCR) (in Chapter 3, which is nearly sixty pages long), and in Chapter 9, which covers quantitative methods of analysis, corpus data play no role (the later Mackey and Gass (2012) includes a chapter on LCR, however). Similarly, Tyler (2012) discusses many experimental results in great detail but summarises a mere handful of corpus studies. Despite this neglect, corpus data have now become a major source of data in S/FLA research, both on their own and in combination with experimental data. This is in particular due to the increasing availability of corpora of learner language (most of them on learner English), which offer researchers the opportunity to study a wide range of questions regarding: • • •

how learners from different mother-tongue (L1) backgrounds use English in speaking and writing how the use of English by learners with a particular L1 differs from that of learners with other L1s how the use of English by learners differs from that of native speakers.

However, corpora contain nothing but frequency data: they reveal whether linguistic element x does or does not occur in a corpus (nx > 0 or nx = 0), whether x occurs in a part a of a corpus (e.g. a register, dialect, variety, speaker group) or not (nx in a > 0 or nx in a = 0), whether x occurs

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

160

GRIES

with y or not (nx and y > 0 or nx and y =0). Thus, whatever a (corpus) linguist is interested in needs to be (i) operationalised in terms of frequencies of (co-)occurrence and (ii) analysed with the tools of the discipline that deals with quantitative data, statistics. In this chapter, I will survey the ways in which corpus-based research in SLA has utilised, or has yet to utilise, statistical methods. While I will attempt to cast a wide net and cover a variety of different approaches and tools, this survey can, of course, only be selective.

2 2.1

Core issues Statistical methods in LCR

The simplest kind of statistics in (corpus) linguistics is general descriptive statistics, i.e. statistics describing some state of affairs in the data. The most frequent ones in LCR include: •

frequencies of occurrence of linguistic elements as observed frequencies, as normalised frequencies (per cent, per thousand words, per million words), ranks of such frequencies or statistics computed from such frequencies (e.g. type–token ratios, vocabulary richness/growth statistics) • frequencies of co-occurrence or association measures that do not involve statistical significance testing like mutual information (MI) or odds ratios; such measures quantify the association of one linguistic item (typically a word) to another (typically a word or a syntactic pattern/construction), in which case we talk about collocation or colligation/collostruction, or the association of a word to one of two corpora (which is what, statistically, the method of keywords boils down to) • measures of central tendencies such as means or medians • dispersion measures, which should accompany averages, such as standard deviations, standard errors, median absolute deviations, or interquartile ranges • correlation measures such as Pearson’s r or Kendall’s τ. Second, there are tools from the domain of inferential statistics in the form of statistical tests returning p-values (determining how likely an obtained result is due to chance variation alone) or in the form of conﬁdence intervals (providing likely ranges into which observed results may fall); the most common ones involve: • •

significance tests of two-dimensional frequency tables involving chi-square tests or, much more rarely, Fisher-Yates or similar exact tests association measures that do involve significance tests (e.g. log-likelihood ratio G2, z, t, see Evert 2009)

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

Statistics for learner corpus research

•

161

significance tests for differences between measures of central tendencies involving t-tests, U-tests, or Kruskal-Wallis tests as well as significance tests for correlations.1

Currently, most of the statistics used in LCR are covered by these categories, but more advanced tools are available. First, in inferential statistics, there is the area of multifactorial regression modelling. A multifactorial regression is a statistical model trying to predict a dependent variable/response (often the effect in a hypothesised cause–effect relationship, either a numeric variable or a categorical outcome such as a speaker’s choice of one of two or more ways of saying the same thing) on the basis of multiple independent variables/predictors (usually the potential causes of some effect), using a regression equation. Such a regression helps to quantify each predictor’s signiﬁcance and/or importance (‘does this predictor help make the prediction more accurate or not and how much so?’) and direction of effect (‘which of the possible outcomes does this predictor make more likely?’). Thus, a regression equation is little more than the mathematical way of expressing something such as If a possessor is animate and a possessee is inanimate, then the speaker is x times more likely to encode this relation with a possessive s-genitive than with an of-genitive. This type of approach – as well as its ‘sister approaches’ of classiﬁcation trees and other classiﬁers – is extremely powerful in allowing researchers to investigate the impact of multiple predictors on a linguistic choice simultaneously, but it is still very much underutilised; this method and its advantages will be discussed in detail below. Second, there is the area of multivariate exploratory tools, such as hierarchical cluster analysis, principal components analysis, correspondence analysis, multidimensional scaling, and others. These methods do not try to predict a particular outcome such as a speaker’s choice on the basis of several predictors and typically do not return p-values from signiﬁcance tests – rather, they ﬁnd structure in variables with an eye to allowing researchers to detect groups of variables/expressions that are similar to each other but different from everything else. Such results can then either be interesting in their own right or inform subsequent (regression) modelling. I will discuss these techniques very brieﬂy below, too. The next section will survey some studies that have utilised some of these methods.

2.2 Applications involving simple descriptive statistics 2.2.1 Frequency data Just about every empirical learner corpus study reports some kind of frequency data. Consider, as a ﬁrst example, Hyland and Milton (1997), 1

See Gries (2013a) for detailed hands-on explanation of how these statistics are computed in the context of LCR and Gries (2013b) for statistics in linguistics in general.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

162

GRIES

8.0 will

Log frequency in corpus

7.5

may 7.0

6.5

think

would always

6.0

usually 5.5 know in fact actually probably 0.0

0.5

1.0 1.5 Log rank of modality expression

2.0

Figure 8.1 Visualisation of the data in Hyland and Milton (1997: 189)

who compare ways in which native speakers (NS) and non-native speakers (NNS), here Cantonese-speaking learners of English, express modality. Among other things, they report overall frequencies of expressions of epistemic modality, ﬁnding that NS and NNS exhibit considerable similarities of usage. In addition, they used sorted top-10 frequency lists of epistemic modality expressions for NS and NNS. While they do not dwell on this, especially for the NNS they ﬁnd a very Zipﬁan distribution, i.e. a distribution that is highly typical of linguistic data where a small set of the types (here, the top ten types) accounts for a large proportion (here 75%) of the total tokens. Their NNS data are visualised in Figure 8.1 with the log of the frequency of an expression on the y-axis, the log of the rank of the frequency of an expression on the x-axis, and the expressions plotted at their coordinates. Further statistics they report include normalised frequencies (per thousand words) for different grammatical ways of expressing epistemic modality (e.g. with modal or lexical verbs, adverbials, adjectives or nouns) or these frequencies grouped into different ability grades. An example whose orientation is representative of much current LCR is Hasselgård and Johansson (2011). Like many learner corpus studies, they report different kinds of frequency data with an eye to

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

Statistics for learner corpus research

163

Table 8.1. Raw/normalised frequencies per million words (pmw) of quite (from Hasselgård and Johansson 2011: 46)

Frequency Frequency pmw

LOCNESS

ICLE-SP

ICLE-FR

ICLE-NO

ICLE-GE

67 205

63 318

78 380

92 437

147 623

Table 8.2. Laufer and Waldman’s (2011: 660) extended frequency data on V–N collocations

LOCNESS V–N collocations Non-collocations Totals

2,527 22,242 24,769

ILCoWE: advanced 852 12,953 13,805

ILCoWE: intermediate 162 2,895 3,057

ILCoWE: basic

Totals

68 1,465 1,533

3,609 39,555 43,164

over-/underuse in the learner data, as well as results of simple signiﬁcance tests. For instance, one of their case studies is concerned with the frequencies of quite in the Louvain Corpus of Native English Essays (LOCNESS) and four components of the International Corpus of Learner English (ICLE) (Norwegian, German, Belgian-French and Spanish) shown in Table 8.1. They report the results of chi-square tests comparing each frequency from the ICLE components to the LOCNESS frequency and state that ‘quite is overused in all the learner groups’, that all learners but the Spanish ones differ signiﬁcantly from the NS data and that ‘the overall frequency distribution … thus seems to reﬂect the Germanic–Romance distinction’ (pp. 45–6). Then, there is interesting work bridging the gap from frequency statistics to association measures, namely research on collocations that does not involve measures of collocational strength but, for instance, collocational dictionaries. One such example is Laufer and Waldman (2011). They compare the use of verb–noun (V–N) collocations by NS (LOCNESS) with that of learners in the Israeli Learner Corpus of Written English (ILCoWE); verb–noun candidates were considered a collocation if they were listed in at least one combinatory/collocational dictionary. Laufer and Waldman then test whether NS and NNS differ with regard to the number of V–N collocations with a chi-square test and ﬁnd that the NS produce signiﬁcantly more V–N collocations. They then proceed to group the NNS into three proﬁciency groups as represented in Table 8.2 and conduct eight different chi-square tests on this table. They summarise their results by stating that the NS produce signiﬁcantly more collocations than the learners and that, within the NNS, only the advanced and basic NNS differ from each other signiﬁcantly.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

164

GRIES

Other studies also largely based on raw and normalised frequencies of elements and basic statistical comparisons of frequencies include Altenberg (2002) and Götz and Schilk (2011). The former is concerned with the uses of causative make by American NS as well as French and Swedish NNS. Altenberg reports many tables of observed frequencies and percentages that indicate how learners differ in their use of causative make from the American NS and each other, which Swedish equivalents of causative make are used how often, which English equivalents of causative göra are used how often, etc. Götz and Schilk (2011) contrast the frequencies of 3-grams in spoken L1 English from the British component of the International Corpus of English (ICE-GB) and the Louvain Corpus of Native English Conversation (LOCNEC), in spoken L2 English from the Indian component of the ICE (ICE-IND), and in spoken learner English from the German component of the Louvain International Database of Spoken English Interlanguage (LINDSEI-GE), and then perform G2-tests to determine which observed frequencies differ from each other. Another similar example is Gilquin and Granger (2011), who explore the use of into across four ICLE subcorpora – the Dutch, French, Spanish and Tswana components – and in NS English. They, too, report relative frequencies of into per 100,000 words and comparisons using the G2-statistic.

2.2.2

Association measures and other (monofactorial) signiﬁcance tests

A different group of studies involves frequency data but uses them more as a basis for association measures quantifying how much two elements are attracted to, or repelled by, each other; as mentioned above, some of these association measures involve statistical signiﬁcance tests (see Evert 2009). Sometimes, such studies also utilise simple monofactorial statistics, e.g. signiﬁcance tests for measures of central tendencies or correlations. One study that fruitfully combines different statistical tools is Zinsmeister and Breckle (2012), who explore the annotated learner corpus (Annotiertes Lernersprachenkorpus, ALeSKo) of German essays produced by NS and NNS (Chinese) learners. Apart from a comparison of frequent 3-grams, they also discuss frequencies of part-of-speech 3-grams. As a cut-off point for over-/underuse, they do not use the G2-test (‘because of the small size of the ALeSKo corpus’ p. 84, n 25) but the difference between ranks in frequency lists. However, they also use several more sophisticated tools: to study the lexical complexity of their corpora, they compute and test type–token ratios and vocabulary growth rates for both subcorpora. Similarly, they compute summary statistics for both and test for signiﬁcant differences using non-parametric U-tests. Durrant and Schmitt (2009) is another interesting case in point. They compare the use of adjective–noun and noun–noun collocations by Bulgarian

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

Statistics for learner corpus research

165

learners of English with that of NS, which were extracted from essays and whose strength was quantiﬁed using the t-score (to highlight more frequent collocations) and MI (to highlight less frequent collocations). These values were classiﬁed into seven and eight bands, respectively, so that the authors could explore how much NS and NNS use collocations of particular strengths with t-tests. Results for the t-scores indicate that NNS make greater use of collocations in terms of tokens (but that this is in part due to their overuse of some favourite collocations), whereas results for MI indicate that NNSs make less use of collocations in terms of tokens. Apart from collocational studies, S/FLA research has also begun to target colligations/collostructions, i.e. the association of words to syntactic patterns. One of the ﬁrst studies to explore verb-construction associations is Gries and Wulff (2005), who compare the attraction that verbs exhibit to the ditransitive and the prepositional dative constructions in NS corpus data (based on Gries and Stefanowitsch’s (2004) distinctive collexeme analysis) to advanced German learners’ sentence-completion behaviour in a priming study (ﬁnding a signiﬁcant positive correlation), but also to the constructional preferences of the German translational equivalents of these verbs (ﬁnding no correlation). This is interesting because it suggests that the tested German learners have internalised the frequency distributions of English verbs in constructions rather than falling back on what their L1 would have them do. A ﬁnal related example is Ellis and Ferreira-Junior (2009a). They study six different hypotheses regarding the acquisition of three verb-argument constructions: the verb-locative construction (e.g. My squirrel walked into the kitchen), the verb-object-locative construction (e.g. My squirrel carried the nuts into the kitchen), and the ditransitive construction (e.g. My squirrel gave the other squirrel a nut). More speciﬁcally, their study is concerned with the distribution of verbs in the verb slots of these constructions (is it Zipﬁan?) and whether the ﬁrst-learned verbs in the constructions are more frequent in, more strongly attracted to, and more prototypical of, the construction. Their study is one of the ﬁrst to use a directional measure of association, ΔP, i.e. an association measure that does not quantify the association between two elements x and y in a bidirectional fashion, but makes it possible to distinguish the association from x to y from the one from y to x (cf. Gries 2013c). Their exploration is based on the European Science Foundation Second Language Database and shows that the type–token distributions in the verb slots are Zipﬁan and that ﬁrst-learned verbs are highly frequent in, strongly attracted to, and prototypical of, the respective constructions.

2.3

Applications involving multifactorial statistics (regression modelling)

Very recently, LCR has begun to recognise the power of regression approaches and researchers are becoming familiar with the basic logic

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

166

GRIES

underlying regression-analytic approaches. Regression approaches of the type mentioned above offer many advantages: •

as mentioned above, they allow us to include multiple predictors in an analysis • with multiple predictors, one can explore interactions between variables, i.e. one can test whether one variable has an effect on how another variable is correlated with the dependent variable; also, non-linear effects can be explored • regression modelling provides a unified framework to understand many seemingly unrelated tests. For instance, instead of trying to learn many monofactorial tests (e.g. chi-square tests, t-tests, Pearson correlations, U-tests) and then regression modelling separately, it is useful to understand that monofactorial tests can often be seen as the simplest possible cases of a monofactorial regression • while regression modelling is typically used in a hypothesis-testing context, there are extensions that allow the researcher to also perform (guided) exploration of the data • regressions generate predictions (with confidence intervals) of how a response will behave, which often allows for seamless integration of results from different studies of whatever type (observational, experimental, simulations, etc.). The remainder of this section is devoted to exemplifying these advantages. As a ﬁrst simple example, let us return to Hasselgård and Johansson’s (2011: 46) data shown in Table 8.1. Even a simple one-dimensional frequency list such as this one can beneﬁt from a regression-analytic approach. Here, where one is interested in frequencies, one useful kind of regression is a monofactorial Poisson regression (cf. Gries 2013b: Section 5.4.3). Such a regression tries to predict, or model, the frequencies of quite (the dependent variable) in each of the different corpora (the independent variable), as shown in (1).2 (1)

a.

F REQ

~

b.

dep. variable

as a function of

C ORPUS (L1 vs SP vs FR vs NO vs GE) predictors

If this approach is applied to Hasselgård and Johansson’s data, one ﬁnds that indeed all learner varieties are signiﬁcantly different from the LOCNESS baseline. However, as mentioned above, one can now undertake more detailed exploration using so-called general linear hypothesis tests – a method that allows the researcher to test, for instance, whether

2

An offset was included to account for the fact that the corpus sizes differ, but this does not affect the general logic.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

Statistics for learner corpus research

167

NO/ROMANCE GE LOCNESS

50

100 Predicted frequency of quite

150

Figure 8.2 Visualisation of the ﬁnal model on Hasselgård and Johansson’s (2011: 46) data

different L1 data differ from each other signiﬁcantly (cf. Bretz et al. 2010). The results suggest that the two Romance languages can indeed be conﬂated without a signiﬁcant loss of accuracy, but that the two Germanic languages cannot. If one followed Occam’s razor, one would therefore conﬂate the two Romance languages in a second regression model, which then reveals that (i) the two Germanic languages do not behave similarly, but that (ii) the Norwegian data are not signiﬁcantly different from the two Romance languages’ frequencies. The ﬁnal results of a third model that conﬂates the Norwegian and the two Romance data points shows results quite different from Hasselgård and Johansson (see Figure 8.2): the postulated Germanic–Romance distinction collapses because (i) the two Germanic languages do not behave identically and (ii) the two Romance languages are not different from Norwegian.3 More interesting applications involve additional complexity and result in powerful explorations of learner corpus data. As discussed above, this ‘additional complexity’ can result both from different independent linguistic variables or their interactions. However, another crucial level of complexity arises when the corpus source, or speaker group or L1, is not only included as a predictor but also allowed to interact with all others. This step is simultaneously the most important and most underutilised one; the present discussion borrows from Gries and Deshors (2014). Imagine a regression where one tries to predict the choice of may or can – let’s call this variable F ORM – on the basis of two linguistic predictors: N EGATION (whether the clause in which the speaker has to choose may/can is negated or not) and A SPECT (whether the clause in which the speaker has to choose may/can features neutral or perfect/progressive aspect). In addition, there is another predictor C ORPUS , which speciﬁes the L1 of the speakers (let’s say, native vs French vs Chinese). Several models are conceivable:

3

For advanced readers who object to sequential model simplification or the use of p-values of the above kind, it should be noted that a single regression with planned contrasts also reveals that the alleged Germanic cluster is not homogeneous.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

168

GRIES

(2) (3) (4)

F ORM F ORM F ORM

~ ~ ~

C ORPUS + A SPECT + N EGATIO N C ORPUS + A SPECT + N EGATIO N + A SPECT : N EGATION C ORPUS + A SPECT + N EGATIO N + C ORPUS : A SPECT + C ORPUS : N EGATION + A SPECT : N EGATIO N + C ORPUS : A SPECT : N EGATION

The model in (2) already goes beyond much previous work because it embodies a multifactorial regression where several predictors, not just one, are studied simultaneously. However, it may still be lacking because it does not include interactions: one will not learn, say, whether the effect of N E G A T I O N is the same for both levels of A S P E C T (or vice versa). For instance, if negated clauses in general have a higher probability of may, is that equally true for both aspects? The model in (3) answers this question by including the interaction A S P E C T : N E G A T I O N and returning a regression coefﬁcient and a p-value for this interaction. However, the model that should really be ﬁt is that in (4) because here not only the two linguistic predictors interact with each other, but all predictors – including C O R P U S – do. These interactions, which contrast different speaker groups, are what most work in contrastive analysis and Contrastive Interlanguage Analysis is implicitly about, but which are too rarely tested explicitly: •

the interaction C ORPUS : A SPECT tests whether the effect that A SPECT has on F ORM (can vs may) is the same in the three L1 speaker groups • the interaction C ORPUS : N EGATION tests whether the effect that N EGATION has on F ORM is the same in the three L1 speaker groups • the interaction C ORPUS : A SPECT : N EGATION tests whether the interaction of A SPECT and N EGATION has the same effect on F ORM in the three L1 speaker groups.

Thus, only this type of regression will quantify whether any linguistic predictor does different things in NS vs NNS as well as in NNS1 (e.g. French) vs NNS2 (e.g. Chinese). On the basis of such a ﬁrst regression model, one can then trim the model to the minimally adequate one by (i) weeding out independent variables that do not contribute enough predictive power to a model one by one and (ii) conﬂating levels of predictors that do not merit enough to be distinguished. One recently developed approach (Gries and Deshors 2014) adds a new exploratory twist to regression modelling, namely a way to study in detail the following questions: (i) ‘given the linguistic/contextual situation the NNS is in right now, what would a NS do?’ and (ii) ‘what determines the degree to which NNS do not make the choices a NS would have made?’. These are central questions raised a long time ago, but hardly studied accordingly: we need to be ‘comparing/contrasting what non-native and native speakers of a language do in a comparable

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

Statistics for learner corpus research

169

situation’ (Pery-Woodley 1990: 143, quoted from Granger 1996: 43, my emphasis). Most previous LCR has adopted a very lax interpretation of ‘comparable situation’, namely that the NS and the NNS data were produced, e.g., in a ‘similar essay-writing/speech situation’; some better ones do at least control for topics (see below). However, with a regression-analytic mindset, a much more realistic and revealing approach can be pursued: Gries and Deshors (2014) develop a protocol called MuPDAR (for Multifactorial Prediction and Deviation Analysis with Regressions); see also Gries and Adelman (2014) and Wulff and Gries (2015). First, one uses a multifactorial regression to see why NS make a particular choice. Second, if the ﬁt of that regression is good, then that regression equation is applied to the NNS, which is the statistical way of asking (i) above, ‘what would a NS do here?’ Then, one determines where the NNS did not make the choice that a NS would have made and explores, with a second regression, which of the annotated factors explain when NNS do not behave as NS would. The authors show that French NNS often make non-NS choices with negated clauses as well as with can in perfective/progressive and may in neutral aspect, plus they have more difﬁculties with may with animate than with inanimate subjects. Similarly, Gries and Adelman (2014) show that NNS have difﬁculties in making NS-like subject realisation choices in Japanese precisely when the subject referents are not completely discourse-new or completely discourse-given but in the grey area in between. Such results are nearly impossible to obtain with mere over-/underuse counts and require methods with a ﬁne-grained and contextualised view of the data.

3

Representative studies

3.1 Gries, St. Th. and Wulff, S. 2013. ‘The genitive alternation in Chinese and German ESL learners: Towards a multifactorial notion of context in learner corpus research’, International Journal of Corpus Linguistics 18(3): 327–56. Gries and Wulff (2013) is a study involving the above-mentioned regression approach. They study the genitive alternation in English as represented in (5) by comparing the constructional choices of native speakers of British English to those of Chinese and German learners of English. (5)

a. b.

the squirrel’s nut the nut of the squirrel

s-genitive of-genitive

possessor’s possessed possessed of possessor

Previous studies of the genitive alternation in native speaker data have uncovered a large number of factors that co-determine which genitive

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

170

GRIES

speakers choose. As with many other alternations, these factors are from many different levels of analysis and include: •

morphosyntactic and semantic features: number, animacy and specificity of possessor and possessed as well as the semantic relationship between possessor and possessed (e.g. possession, attribution, participant/time and event) • processing-related features: length and complexity of possessor and possessed, the previous choice made by a speaker, information status of possessor and possessed • phonological features such as rhythmic alternation (the preference to have stressed and unstressed syllables alternate) or segment alternation (the preference for CV structures).

They retrieve approximately 3,000 examples of of- and s-genitives from the ICE-GB (for the NS) data and from the Chinese and German components of the ICLE (for the learner data), and annotate them for the above features. Given the fact that the genitive alternation is obviously a multifactorial phenomenon, they adopt a regression-analytic approach along the lines discussed in the previous section, and since the dependent variable – the choice of genitive – is binary (of vs s), they perform a logistic regression analysis, i.e. an analysis that determines (i) which of the annotated features and their combinations predict all speakers’ genitive choices best, (ii) if/how the NNS differ from the NS in their genitive choices, and (iii) if/ how the two NNS groups differ from each other. In order to determine the most parsimonious model of the genitive alternation, they undertake a manual model selection process that weeds out predictors that do not signiﬁcantly help predict the genitive alternation while at the same time controlling for collinearity, i.e. the omnipresent and potentially dangerous phenomenon that predictors are too highly related to each other and thus do not allow the researcher to identify which predictor has what kind of effect. Their ﬁnal model is then shown to explain the data very well based on a correlation coefﬁcient and on the high accuracy with which the model can classify the speakers’ genitive choices (>93%). Several interesting ﬁndings emerge – once the results are visualised: multifactorial regressions are usually much easier to understand when represented graphically; visualisation should nearly always be provided for results. One is an effect across all three speaker groups such that segment alternation patterns are indeed weakly preferred. Another is an interaction of P OSSESSOR N UMBER and L ENGTH D IFFERENCE (between the possessor and the possessed), which is represented in Figure 8.3. The x-axis represents the predicted probability that a speaker would use the s-genitive, the y-axis represents the difference L ENGTH P OSSESSOR minus L ENGTH P OSSESSED (in characters), and the small s’s and p’s represent the predicted probabilities of the s-genitives for singular and plural possessors; the left panel highlights the curve for singular possessors (and

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

Statistics for learner corpus research Plural possessors : Length difference 30

20

20

Length difference

Length difference

Singular possessors : Length difference 30

10 0

171

10 0 –10

–10 0.00

0.02

0.04

0.06

0.08

Predicted probability of the s-genitive

0.00

0.02

0.04

0.06

0.08

Predicted probability of the s-genitive

Figure 8.3 The effect of the interaction of P OSSESSOR N UMBER : L ENGTH D IFFERENCE on the choice of s-genitives in Gries and Wulff (2013)

plots the plurals in grey for the sake of easy comparison), the right panel focuses on the curve for plural possessors. This is an interesting ﬁnding (that normal chi-square test analyses could not really make) because, while nearly every analysis of the genitive alternation has found that L ENGTH D IFFERENCE matters (in a general short-before-long tendency), this interaction shows that this is more pronounced with singular than with plural possessors. This is presumably because there is a general avoidance of s-genitives with plural possessors for articulatory reasons, which means that for L ENGTH D IFFERENCE to still have any impact at all, the length difference has to be quite large to still ‘overpower’ that avoidance. Finally, one interaction that shows how the NNS differ from the NS involves the speciﬁcity of the possessed: all speakers prefer s-genitives with non-speciﬁc possesseds, but the German NNS do so only weakly while the Chinese NNS do so strongly. In sum, this study is instructive in how it showcases the power of regression-analytic approaches, viz. the ability to study multiple determinants of a phenomenon and their interactions as they affect the language of learners from different L1s. 3.2 Paquot, M. 2014. ‘Cross-linguistic inﬂuence and formulaic language: Recurrent word sequences in French learner writing’, in Roberts, L., Vedder, I. and Hulstijn, J. (eds.), EUROSLA Yearbook 14. Amsterdam: Benjamins, pp. 240–61. A study that is interesting for its use of monofactorial tests and frequency-based observations is Paquot (2014). She explores English 2/3/4grams containing a lexical verb produced by French learners of English to determine how much of the learners’ idiosyncratic use of n-grams is due to L1 transfer and what kinds of transfer effects can be found. She retrieves all 2/3/4-grams from the ICLE that occur 5+ times and computes their normalised frequencies per 100 words in order to control for the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

172

GRIES

fact that corpus parts are not always equally large, which makes it impossible to conduct frequency comparisons based on raw frequencies. She then compares the mean French learners’ n-gram frequencies with those of the other nine learner groups; laudably, she •

uses pairwise Wilcoxon rank sum tests for this rather than the more commonly but often incorrectly used alternatives of t-tests or ANOVAs; in this case, Wilcoxon tests are more appropriate because the data on which the tests are run will violate the assumptions that t-tests/ ANOVAs make • applies corrections for multiple testing rather than use the traditional significance level of 0.05 for all these tests; in this case, this is appropriate, or even required, because she studies one and the same data set with multiple (pairwise) significance tests.

On the basis of these tests, a variety of n-grams whose frequencies in the French data differ signiﬁcantly from ﬁve other learner groups are identiﬁed: 228 n-grams (154 2-grams, 59 3-grams and 15 4-grams) showing intra-L1-group homogeneity and inter-L1-group heterogeneity. A more qualitative route is used to determine the degree of congruity between French and the French learners’ English, whereby each n-gram’s use in the learner data is compared to what the translation equivalent in French would be (also controlling for the effect that topic might have). Paquot ﬁnds that the large majority of the 228 signiﬁcantly overused n-grams identiﬁed as described above are referential expressions (>86%), but also that many of these are more likely due to the choice of essay topic by the learners because a large majority of these n-grams only appear in French learners’ essays discussing one particular topic (the creation and future of Europe); these n-grams include both statistically overrepresented content words and function words marking tense. However, a variety of n-grams whose function was classiﬁed as ‘discourse organisers’ or ‘stance markers’ exhibited signiﬁcant overuse by the French learners that could not be attributed to the essay topic. Of these, several are part of a longer chunk, which allows for a subsequent analysis of French translation equivalents: some overused n-grams turn out to result from their use in teaching materials, but many others can be shown to be due to several categories of transfer effects such as transfer of • semantic properties (cf. on the contrary and au contraire) • collocational/colligational properties (cf. according to me and selon moi) • functions and discourse conventions (cf. let us not forget that and n’oublions pas que) • L1 frequency (cf. from this point of view and de ce point de vue). In sum, this study is instructive in its use of statistical tools (the use of non-parametric statistics given the non-normal data it studies) and the ways in which the statistical results – while not multifactorial per se – are

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

Statistics for learner corpus research

173

carefully controlled for potentially epiphenomenal effects (topic choice) and include careful comparisons with frequency effects in L1s other than the targeted French learners.

4

Critical assessment and future directions

LCR has made an important contribution to S/FLA in that it showed how corpus frequencies are correlated with many central notions in S/FLA research and has thus raised awareness of the important role that all kinds of frequency information play in language acquisition and learning (cf. Ellis 2002). This development coincided with the general recognition in linguistics that corpus data – long shunned while generative linguistics was the dominant linguistic framework – have a lot to offer. However, while this positive development has drastically increased the number of corpus studies in LCR, the somewhat obvious fact that corpus methods are by deﬁnition distributional and quantitative has not yet led to an analogous increase in the statistical sophistication of LCR – neighbouring ﬁelds such as sociolinguistics, psycholinguistics or corpus linguistics in general exhibit an overall larger degree of sophistication. After having discussed a variety of core issues and representative studies in the previous sections, I will now discuss a range of problems that LCR studies often manifest (regardless of any insights they may still offer) as well as make a variety of suggestions as to how LCR needs to evolve to come to grips with the immensely complex nature of its questions and data. Speciﬁcally, I will ﬁrst turn to a few problems that arise from how data and analyses are reported and make a few easy-to-implement suggestions to address these problems (Section 4.1), before I turn to problems in how analyses are conducted and what needs to be done instead or on top of current practices (Section 4.2). Crucially, the goal is not to dismantle any studies mentioned here, but to anticipate the common objections that the points to be made are neither as frequent nor as harmful as I claim they are.

4.1

Problems with how statistical analyses are reported

The simplest problems to ﬁx pertain to how data and methods are characterised. For the former, many studies unfortunately only provide normalised frequencies of the phenomena studied. A case in point is Altenberg (2002: 44), who provides only normalised frequencies of make/göra in source texts and their translations; another is Connor et al. (2005), who report a variety of statistical results without any representation of the data thus analysed. This is problematic because it rules out follow-up analyses or replications because one can neither replicate the tests nor conduct additional ones without knowing the exact distributions of the data or, minimally, the sample sizes. As for the latter, often the statistical tests

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 06 Mar 2017 at 21:49:40, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.008

174

GRIES

undertaken are not described comprehensively enough for readers to understand, let alone try and replicate/extend the analysis. For example, many LCR studies report using chi-square tests, but they do not report what kind of chi-square tests they used (goodness-of-ﬁt or independence), whether they used a correction for continuity (yes or no), they do not report the chi-square values themselves (only p-value ranges such as 1

18

12/12

39

4/4

Notes. Stage 2: SVX and plural-s marking; Stage 3: XP-adjunction and NP agreement; Stage 4: SV-inversion and inter-phrasal agreement; Stage 5: subordinate clauses, subjunctive and interclausal agreement.

Learners were then classiﬁed according to which PT stage they had reached, using the emergence criteria (Pienemann 1998; Pallotti 2007), which state that a structure has to have been used productively and systematically at least four times to assume it has been acquired. Systematicity aims to avoid counting rote-learned words or expressions as having been acquired by requiring a sufﬁcient number of contexts of use, and productivity refers to a structure being used with a variety of lexical items or in a variety of contexts. The level of the learners was statistically analysed using implicational scaling, which makes it possible to interpret the cross-sectional data of learners’ morphological or syntactic development as representative of an individual learner over time (Bonilla 2015: 62). Table 14.1 shows the implicational scaling for all learners for both the syntactic (S) and morphological (M) structures. The thick line delineates acquisition of a given stage according to the emergence criteria outlined above. To the left of the line, a given stage has been acquired, and to the right of the line, the stage has not yet been acquired. Shaded areas

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

SLA theory and learner corpus research

325

indicate that the syntax typical of a given stage has been acquired, but not yet the morphology (Bonilla 2015: 63). As can be seen from the table, the learners show a clear progression through the stages, which emerged in the order predicted (statistical tests indicate that the table is 100 per cent scalable). There is also evidence that each of the ﬁve stages predicted are indeed independent stages and acquired one by one (Bonilla 2015: 63). Interestingly, as indicated by the shaded areas, learners seemed to acquire the syntax associated with a given stage before the morphology, as no instance was found of morphology emerging before the syntax in any of the learners. The stages predicted by PT were found to emerge in the set order predicted for all learners, thus supporting the PT claims that processing limitations are what constrains L2 development. This study is a good illustration of research taking as a starting point an SLA theory and using a corpus of learner language in order to test the hypotheses it makes. From an LCR perspective, this study could perhaps have better exploited the corpus it is based on: it could have investigated all the learners available rather than a small subset, and it could also have used the range of spoken data available for each learner, rather than just one speaking activity. It could also possibly have made more systematic use of the sophisticated corpus analysis tools now available, and of the fact that the data was already tagged for parts of speech, rather than carrying out what seems to be its own annotation of the raw transcripts. However, this study usefully illustrates how a research agenda arising from current SLA theorising can be tested effectively on existing learner corpora. 3.3 Myles, F. and Mitchell, R. 2012. Learning French from Ages 5, 7 and 11: An Investigation into Starting Ages, Rates and Routes of Learning amongst Early Foreign Language Learners. ESRC End of Award Report RES-062-23-1545. The corpus used in this study was designed to investigate the role of age in instructed learners, an issue which is of considerable theoretical as well as practical interest. It is of theoretical interest because it can tell us about differences in the way in which we learn and process foreign languages at different ages, as a result of differences in the brain mechanisms involved, either due to normal developmental or aging processes, or because innate abilities to learn languages are no longer available after the critical period (Herschensohn 2007; Muñoz 2008b; Birdsong 2009; Muñoz and Singleton 2011; DeKeyser 2012). And it is of practical interest because many countries around the world have introduced foreign languages in the primary curriculum, assuming that ‘earlier is better’, without a clear understanding of differences in learning processes at different ages (Muñoz 2006, 2008a).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

326

MYLES

The project4 had four broad objectives: 1. to document the development of linguistic competence among young English classroom learners of French at three different starting ages (ﬁve, seven and eleven), and identify similarities and differences 2. to compare rates of development at different ages after the same amount of classroom exposure 3. to document and compare the children’s learning strategies and attitudes at different ages 4. through this evidence, to contribute to theoretical understandings of second language acquisition among young learners, and consequently inform current primary language initiatives and educational practices in the UK and internationally. In order to be conﬁdent that any differences in learning were due to the age of the participants rather than other factors, all other variables were kept constant as far as possible. This was achieved as follows: •

• •

Three intact classes in two schools sharing similar socio-economic status (predominantly working class) were identified: 27 children in Year 1 (5–6-year-olds); 26 in Year 3 (7–8-year-olds); and 19 in Year 7 (11–12-year-olds). All children were complete beginners, as ascertained by a pre-test. The same teacher employed by the project provided all the teaching (38 hours over 19 weeks), following the same scheme of work across the age groups (with minor variations in delivery according to age).

All classes were video-recorded so that there would be a complete record of the input children were exposed to and of how they engaged with that input. The testing was designed on the basis of the input received, as follows: •

a pre-test consisting of a group interview (testing previous knowledge of French and general awareness about France and the French language) and a receptive vocabulary test • a mid-project test (after 18 hours’ teaching), consisting of a story retelling, elicited imitation, role-play and an input-based receptive vocabulary test (controlled for factors in the input such as frequency, recency – how recently a word has been heard prior to testing, type of input – e.g. story, song) • an end test (after 38 hours’ teaching) repeating all mid-project tests, plus a working memory test

4

The project Learning French from ages 5, 7 and 11: An investigation into starting ages, rates and routes of learning amongst early foreign language learners was funded by ESRC grant RES-062-23-1545 (2009–11). The research team included Florence Myles, Rosamond Mitchell, Annabelle David, Sarah Rule, Christophe dos Santos and Kevin McManus. Full details on www.flloc.soton.ac.uk/primary/index.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

SLA theory and learner corpus research

327

Table 14.2. Corpus and experimental data collected

•

Corpus data (all groups)

Experimental data (all groups)

Classroom interaction • All classes video-recorded and transcribed in CHAT format; coded for gestures Mid test • RP: Role-play • SR: Picture-based story retelling Final test • RP: Role-play • SR: Picture-based story retelling Delayed post-test • RP: Role-play • SR: Picture-based story retelling Interview (in English) Focus groups (in English)

Pre-test • Group interview • Receptive vocabulary test Mid test • EI: Elicited imitation • Receptive vocabulary test Final test • EI: Elicited imitation • Receptive vocabulary test Delayed post-test • EI: Elicited imitation • Receptive vocabulary test Other measures • Working memory test • Literacy score

finally, a delayed post-test (2 months later), consisting of the same tasks as mid-project, and a short one-to-one learner interview (in English) about French learning (e.g. attitudes, strategies, motivation).

Focus groups were also held in English with all the children in groups of three to six children, asking them about their attitudes, motivation and learning strategies. As can be seen from Table 14.2, the database collected is a mixture of corpus and experimental data, as is often necessary in SLA research to address speciﬁc research agendas (Gilquin and Gries 2009; Meunier and Littré 2013). The corpus of classroom interaction was used to design the tests. As the children had no access to French outside the classroom and were complete beginners at the start, it was possible to track all instances of input and output, including the modality (e.g. song, story, role-play, teacher gesture). For example, the vocabulary test was designed to incorporate words which the children produced frequently, words they had heard but not produced, either as classroom management talk (e.g. silence) or in stories (e.g. chien – dog), words with varying frequency and/or recency in the input, cognates, words supported by teacher gesture, words in songs, etc. In terms of receptive vocabulary learning, there was little difference in performance between the groups. Frequency in the input was the most important factor for successful learning across all groups. However, recency was more important for the 5-year-olds than for the older learners. The younger learners were slower at the beginning but caught up later; this could be because their working memory is less developed at that age. Cognates were well learnt, but words heard in songs were not

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

328

MYLES

(even though the songs themselves were). Teacher gesture could aid saliency and therefore noticing and learning, but could also distract from linguistic input (as children can understand the message without having to process the input). In terms of grammar, there was a clear age advantage. The younger learners had comparatively more difﬁculty with longer utterances and more complex language. Signiﬁcant correlations were found between all aspects of development and working memory, except between working memory and receptive vocabulary in the younger children (5-year-olds), which suggests that their working memory is not yet sufﬁciently developed to aid vocabulary learning. The correlations between working memory and grammar were particularly strong, suggesting that more developed processing abilities are especially important when dealing with grammar, and as working memory develops throughout childhood, this gives the older children an advantage. A highly signiﬁcant correlation was also found between literacy and vocabulary development in the 5- and 7-year-olds. In sum, both working memory and literacy support language learning in young learners, and as they develop during the course of childhood, the older learners make more progress in the case of grammar, and learn faster in the case of vocabulary. The focus group and interview data showed that the older children made use of a wider range of cognitive strategies to aid learning, but that the younger children were very enthusiastic. What this study has shown is that by combining corpus data with experimental data, links can be established between the classroom experience of learners and their development. By having a digital record of all classroom interaction and learner data ranging from semi-spontaneous to experimental, at various intervals during the course of this longitudinal study, insights could be gained which would not have been achievable otherwise. The design of this corpus – longitudinal, oral, containing a range of different tasks administered across learners and age groups and complemented by other measures as necessary (working memory scores, literacy scores) – enabled the research questions set to be investigated thoroughly.

4

Critical assessment and future directions

On the basis of the needs of SLA theory outlined in Section 2.1.2, we can conclude that the increase in the number, size and diversity of learner corpora, as well as the increased sophistication in the tools used to exploit them, have gone a long way towards meeting these needs. There are, however, important gaps which remain to be ﬁlled, and which can be summarised as follows:

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

SLA theory and learner corpus research

329

•

More oral data is needed. Currently the overwhelming majority is written, which promotes the use of metalinguistic explicit knowledge and makes investigation of implicit knowledge more problematic (Myles 2005, 2008; Saito 2012; Tono et al. 2012). In Granger et al.’s (2013) state-of-the-art volume, over 81 per cent of the learner corpora used are written. • The communicative activities and tasks used to gather L2 data should be varied, including tasks encouraging the production of infrequent or rare constructions. The essence of learner corpora is their continuous and relatively unconstrained nature, and many activities learners typically engage in in the classroom give rise to such discourse, other than written essays which currently form the bulk of the data available. • Corpora incorporating a wide range of different languages need to be collected, as both L1s and L2s. In Granger et al. (2013), 76 per cent of the corpora are L2 English. In particular, the number of languages typologically unrelated to Indo-European languages (essential for testing many current SLA hypotheses) needs to increase. All proficiency levels need to be represented. Currently, most corpora are of advanced learners, and include a single proficiency level, making it difficult to investigate L2 development (Tono et al. 2012). • Native controls performing the same tasks as the learners should be included. This is now the norm in most corpora, although how comparable the tasks are is not always completely obvious (Buttery and Caines 2012: 193). There are clear signs that some of these gaps are being ﬁlled, and there is much evidence of a growing dialogue and rapprochement between SLA and LCR. For example, a community of scholars interested in LCR is emerging, spearheaded by the pioneering Centre for English Corpus Linguistics led by Sylviane Granger at the University of Louvain, and an international Learner Corpus Association was established in 2013, facilitating networking and the sharing of resources, as well as organising a biennial conference. I agree with Granger (2009a: 28) when she concludes that ‘[t]he future of the ﬁeld [learner corpus research] is bright … it is slowly but surely being integrated into SLA, a movement which is due both to recognition among SLA researchers of the value of the LC [learner corpus] approach and a corresponding recognition among LC researchers of the importance of SLA ﬁndings’. But although I agree that the future is bright, I think there is still much work to be done: SLA researchers are on the whole rather slow in embracing the possibilities offered by LCR methodologies. This is undoubtedly partly because the corpora available do not always provide the data which would be necessary to test SLA hypotheses rigorously, and partly because collecting the kind of corpus which could give answers to some of the questions the ﬁeld is currently asking

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

330

MYLES

is a highly time-consuming and resource-intensive task. This of course makes it all the more necessary for corpora to be shared and widely used, but this remains ‘work in progress’. And although LCR is now much better informed by SLA research, there are still many studies which remain rather descriptive, focusing primarily on learner errors and contrastive analysis, often without the theoretical frameworks which would enable rigorous interpretation or explanation of the data. The present handbook is a very welcome initiative in the direction of closer collaboration between the two ﬁelds. The present chapter has been written from the point of view of an SLA theorist who has been constructing and using learner corpora for over twenty years with the speciﬁc goal of investigating learner development and testing SLA hypotheses. The analysis presented obviously reﬂects these theoretical motivations. Many of the current learner corpora are suitable for all kinds of analyses which adopt a ‘bottom-up’ approach to research, that is, which aim to extract patterns and tendencies from large samples of data. These endeavours should continue, of course, as we have learnt much from them about e.g. lexical development, idiomaticity, common errors or problem areas. These corpora should be well designed and documented, as has been stressed for many years now (Granger 2009a; see also Chapter 2, this volume). But their design principles are often rather different from those meeting the requirements of more ‘top-down’ SLA approaches, which need learner data to test hypotheses arising from SLA theorising.

Key readings Mitchell, R., Myles, F. and Marsden, E. 2013. Second Language Learning Theories. Third Edition Edition.. Oxon: Routledge. An overview of current SLA theories. For each theory, the claims and scope of the theory are outlined, as well as the way in which it views language, the learner and the learning process, and the methodological tools favoured. Summaries of key studies illustrate the different theoretical approaches. Chaudron, C. 2003. ‘Data collection in SLA research’, in Doughty, C. and Long, M. (eds.), The Handbook of Second Language Acquisition. Malden, MA: Blackwell, pp. 762–828. This paper offers an overview of the methodologies used by SLA researchers to collect data, depending on their speciﬁc research agendas. Issues of reliability and validity are discussed within the context of each data-collection procedure, as well as the generalisability of the ﬁndings arising from the various approaches and the importance of triangulation. Although the focus is not on LCR methodologies, this

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

SLA theory and learner corpus research

331

overview is an excellent analysis of the theoretical reasons behind methodological choices in SLA. Gilquin, G. and Gries, S. Th. 2009. ‘Corpora and experimental methods: A state-of-the-art review’, in Gilquin, G. (ed.), Corpora and Experimental Methods Methods.. Special issue of Corpus Linguistics and Linguistic Theory 5(1): 1–26. A review of studies combining corpus and experimental methodologies, showing that psycholinguists regularly combine the two but use corpora in different ways when compared to corpus linguists, and that corpus linguists rarely make use of experimental data to complement their own investigations. Meunier, F. and Littré, D. 2013. ‘Tracking learners’ progress: Adopting a dual “corpus cum experimental data” approach’, The Modern Language Journal 97(S1): 61–76. This is a study investigating the development of the English tense and aspect system by French learners and demonstrating the necessity of combining the analysis of a longitudinal corpus with experimental methodologies in order to fully understand the difﬁculties faced by learners. Mitchell, R., Domínguez, L., Arche, M., Myles, F. and Marsden, E. 2008. ‘SPLLOC: A new database for Spanish second language acquisition’, in Roberts, L., Myles, F. and David, A. (eds.), EUROSLA Yearbookk 8. Amsterdam: Benjamins, pp. 287–304. This article describes the design principles and methodology behind the construction of an oral corpus of L2 Spanish which aims to support a focused research agenda investigating learner development with respect to the verb phrase, clitic pronouns and word order. Ortega, L. and Byrnes, H. (eds.) 2008b. The Longitudinal Study of Advanced L2 Capacities Capacities.. New York: Routledge. This volume stresses the importance of longitudinal data for research into advanced L2 learning and teaching, and argues this need has been neglected so far. The editors conclude with a proposal for a systematic programme of research for the longitudinal investigation of advanced L2 capacities.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:05:45, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.014

15 Transfer and learner corpus research John Osborne

1 Introduction Transfer is the inﬂuence that previous knowledge or skills have on future learning. Speciﬁcally applied to the language domain, it is often referred to as ‘language transfer’, which Odlin (1989: 27) deﬁnes as ‘the inﬂuence resulting from similarities and differences between the target language and any other language that has been previously (and perhaps imperfectly) acquired’. Traditionally, this inﬂuence is seen as potentially having positive effects which can facilitate learning (‘positive transfer’ or ‘facilitation’) or negative effects that can inhibit learning (‘negative transfer’ or ‘interference’), but, as we shall see, transfer is a more complex phenomenon than a simple positive–negative opposition might suggest. Since the 1980s many writers on the subject, from Kellerman and Sharwood Smith (1986) onwards, have preferred to use the term ‘cross-linguistic inﬂuence’ (or CLI) to refer to language transfer. Other terms, such as ‘cross-linguistic transfer’ and ‘interlanguage transfer’, are also encountered in the literature, along with earlier expressions such as the ‘cross-associations’ discussed by Sweet (1900). Interest in transfer clearly goes back well before the development of electronic corpora, and to appreciate the added value that learner corpora have brought to the study of these phenomena, it is useful to have an overview of the background to transfer studies and the gradual development of a research agenda and accompanying methodologies, to which, over the last twenty years, learner corpora have given a new impetus. In its most general sense, ‘transfer’ is a consequence of our expectation that the world is regular: ‘When we have lived any time, and have been accustomed to the uniformity of nature, we acquire a general habit, by which we always transfer the known to the unknown, and conceive the latter to resemble the former’ (Hume 1793: 604). As a basic component of factual reasoning, transfer has long been a subject of interest for the

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

334

OSBORNE

psychology of learning, with the aim of determining to what extent, and in what conditions ‘learning in one context enhances (positive transfer) or undermines (negative transfer) a related performance in another context’ (Perkins and Salomon 1994: 6452). Scientiﬁc investigation of transfer, however, has found it to be a difﬁcult phenomenon to pin down. The classic studies by Thorndike and Woodworth in the early twentieth century found that practice in estimating the surface area of a series of rectangles did not result in an improved ability to estimate the area of similarly sized non-rectangular objects (Thorndike and Woodworth 1901) or, in a different domain, that learning Latin did not have any impact on performance in other subjects (Thorndike 1923). It appears, then, that ‘far’ transfer, into a context that is different from that of the original learning, does not take place easily. But what of transfer in more closely related contexts, such as learning a new language, particularly one which shares features with the language(s) already known? The idea that when language systems come into contact, speakers will carry elements of one language over onto the other is probably as old as language contact itself. In the case of English, Ranulph Higden was already complaining in the fourteenth century that the language was ‘corrupted’ (apayred) by contact with the Danes and Normans (contemporary translation by John of Trevisa, in Babbington 1869: 159). In language pedagogy, where there is also a long-standing interest in cross-linguistic inﬂuences, concern has more often focused on ‘forward’ transfer, from the ﬁrst language onto the second. In his Didactica Magna, Comenius raises two points that continue to be relevant in discussions of transfer in language learning. Firstly, there is no need to give learners explicit instruction on what they already know: ‘In writing the rules for the new language the already known must be continually kept in mind, so that stress may be laid only on the points in which the languages differ’ (1657: 206). Secondly, learning more than one foreign language opens the possibility that transfer may take place between the newly acquired languages (‘lateral’ transfer), and so they need to be kept separate in the learner’s mind: ‘One language should always be learnt after, and not at the same time as, another; since otherwise both will be learned confusedly’ (ibid.: 205). Pedagogical interest in cross-association became more prominent with the Reform Movement of the late nineteenth century and its concern to avoid what Franke (1884) termed ‘zigzagging’ between languages (see Howatt and Widdowson 2004: 187–209). Sweet (1900: 55) echoes Comenius’ advice that each language should be learnt thoroughly before embarking on the study of another, and adopts a cautious approach to the similarities between languages: We are naturally inclined to assume that the nearer a foreign language is to our own, the easier it is … But this very likeness is often a source

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

335

of confusion. It is a help to the beginner who merely wants to understand the allied language, and is content with a rough knowledge; but it is a hindrance to any thorough knowledge because of the constant cross-associations that are sure to present themselves. Sweet concludes his remarks on cross-associations with a suggestion that opens the way to later work on contrastive analysis: ‘a good deal of help might be afforded by systematic summaries of the conﬂicting associations – the confusions and divergences – between each pair of languages’ (ibid.: 198). The rationale behind such systematic comparisons, as developed ﬁfty years later by Fries (1945) and Lado (1957), starts from the assumption that ‘individuals tend to transfer the forms and meanings, and the distribution of forms and meanings of their native language and culture to the foreign language and culture’ (Lado 1957: 2) and posits that ‘in the comparison between native and foreign language lies the ease or difﬁculty in foreign language learning’ (ibid.: 1). Lado is careful to point out that the comparative analysis needs to be validated ‘by checking it against the actual speech of students’ (ibid.: 72) and that there will undoubtedly be individual variations and difﬁculties that were underestimated. He remains nevertheless conﬁdent that the learning difﬁculties identiﬁed by this method will ‘prove quite stable and predictable for each language background’ (ibid.: 72). Not everyone shared this conﬁdence, however. French (1949) voices scepticism based on his compilation of authentic errors collected from learners in many parts of the world, representing a wide range of ﬁrst languages. He observes that if similar errors are found thousands of miles apart, from learners whose ﬁrst languages are entirely unrelated, then the idea that such errors are due to cross-association is untenable. Since they are ‘common’ errors, he argues, they must have a common origin, which he suggests is to be found in the learner’s honest endeavour to ‘use his brain’ and make sense of patterns in the target language. The ‘strong’ version of the contrastive hypothesis – ‘the assumption that we can predict and describe the patterns that will cause difﬁculty in learning, and those that will not cause difﬁculty, by comparing systematically the language and culture to be learned with the native language and culture of the student’ (Lado 1957: vii) – was not widely adopted. The weak version, on the other hand, starts from the evidence provided by linguistic interference and ‘requires of the linguist only that he use the best linguistic knowledge available to him in order to account for observed difﬁculties in second language learning’ (ibid.: 126). The error analysis studies of the 1970s therefore reversed the contrastive procedure, starting from the actual speech of students in order to establish inventories and classiﬁcations of observed forms as a basis for describing the learner’s interlanguage. The learner’s version of the target language

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

336

OSBORNE

could then be compared with the target language itself, so as to identify divergences (see James 1998: 5). Although error analysis succeeded in placing the study of interlanguage on a ﬁrmer empirical footing, the quantity and scope of the evidence used remained relatively limited, leaving the possibility that certain phenomena might be over- or underestimated, or even missed altogether. If the only data used come from learner productions, then it will not be possible to detect cases where learners avoid using a feature precisely because it presents a difﬁculty. Schachter (1974) demonstrated that, contrary to the predictions of contrastive analysis, Japanese and Chinese learners of English produced fewer errors in the use of relative clauses than did speakers of Persian and Arabic. However, comparison of the total number of relative constructions used by the four groups of learners, and by native speakers of English, showed that the proportion of relative clauses produced by the Japanese and Chinese learners was unusually low, suggesting that they were using an avoidance strategy. There are two methodological lessons to be drawn from this. Firstly, it is the mismatch between the results of contrastive and error analysis respectively that draws attention to the possible existence of a phenomenon that error analysis alone might miss; the two are thus complementary rather than alternative ways of looking at interlanguage. The second lesson is that both contrastive and error analysis will fail to capture certain phenomena if their data are limited to decontextualised fragments of learner language: The ﬁrst step in an error analysis is the extraction of errors from the corpus. In many cases the corpus is then excluded from further consideration as the investigator focuses on the task of organising the errors. This seemingly innocuous move (abandoning the corpus) provides what some consider to be the most devastating criticism of the whole EA [Error Analysis] enterprise. (Schachter and Celce-Murcia 1977: 444–5)

2 2.1

Core issues Data for transfer studies

With the advent of computers, an obvious step was to store error inventories in digital form, in order to facilitate cross-referencing and the rapid recovery of all errors corresponding to speciﬁed search criteria. An early example is Janitza (1990), who between 1973 and 1990 compiled a database of errors extracted from the exam scripts of French-speaking undergraduate learners of German. The extracts were assigned to 152 error types falling into two main categories, ‘interference errors’ and ‘non-interference errors’, and stored on a PC in dBASE III. An interesting characteristic of the database is that the timespan of the data collection,

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

337

over a total of seventeen years, makes it possible to track evolutions in the error types, which Janitza attributes partly to new learning transfer effects resulting from the introduction of audio-visual methodologies in the 1970s. In other respects, though, a database of this kind suffers from the same limitations as the hand-built inventories that preceded it, in that it is built up from fragments of texts, removed from their context, and included on the basis that they contain already-identiﬁed errors. Of these, Janitza notes that interference errors make up a considerable part, but observes that ‘only a really sizeable corpus can contribute to the well-known debate on interference’ (1990: 104, my translation). The subsequent development of computer-based learner corpora is partly a response to this need. Leech (1998) suggests that developments in learner corpora will be useful in answering two questions concerning, respectively, the areas of error and the proportion of non-target-like behaviour peculiar to native speakers of a language A, as opposed to that which is shared by all learners of a target language T, irrespective of their ﬁrst language. He concludes: ‘It appears odd that SLA [second language acquisition] has not yet provided a clear answer to these questions, especially to the second one’ (1998: xv).

2.2

Identifying transfer

That SLA has not provided a clear estimation of the proportion of target language behaviour that can be related to a learner’s ﬁrst language is immediately apparent from overviews of transfer research. Ellis (1985: 29) summarises studies carried out between 1971 and 1983, in which the proportion of errors ascribed to transfer from the learner’s mother tongue (L1) ranges from 3% to 51%. As Ellis remarks, some of the discrepancy can be explained by factors such as the learners’ ages, their level or the distance between the target language (English) and their ﬁrst languages (Spanish, Italian, Chinese, Arabic and others). But even allowing for these factors, there is clearly still a wide divergence in how different researchers identify transfer errors. Why should this be so, and why is it after all not so ‘odd’ that SLA has not yet provided an answer to Leech’s question? At ﬁrst sight, it is a straightforward empirical issue: when we look at the features that appear in the language behaviour of learners from different linguistic backgrounds, which features are common to all learners, which features appear only in those who share a speciﬁc L1, and which appear in those whose L1 is one of a given group of languages? In practice, the difﬁculties begin as soon as we start to list the kinds of features that we want to look for in the data. If the purpose is to identify possible instances of transfer, we need to have an idea of what types of transfer can occur and how they might surface in production. Odlin (1989) organises his survey of transfer into four areas of ‘structural’ transfer: discourse, semantics, syntax, and phonetics,

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

338

OSBORNE

phonology and writing systems. Jarvis and Pavlenko (2007: 19–26) propose a more comprehensive ten-dimensional scheme for characterising types of cross-linguistic inﬂuence, taking into account the area of language knowledge/use concerned (phonological, lexical, etc.), directionality, cognitive level (linguistic or conceptual), type of knowledge (implicit or explicit), intentionality (intentional or not), mode (productive or receptive), channel (aural or visual), form (verbal or non-verbal), manifestation (overt or covert) and outcome (positive or negative). Combining all of these categories potentially gives thousands of different types of transfer, and yet it could be argued that the taxonomy would still be incomplete, since some of the categories are open to further distinctions. Negative transfer, for instance, includes not just the production of errors (themselves subdividable into further categories), but also overproduction and underproduction. The second difﬁculty is that just as there are many potential types of transfer, there are also many variables that might affect whether or not transfer actually takes place. Odlin (1989: 129–50) discusses a number of these ‘non-structural factors’ (individual variation, age of acquisition, linguistic awareness and social context), to which could be added other variables such as the frequency of the linguistic items concerned, the amount and type of exposure, level of education, other languages known, degree of awareness, the type of task, the geographical, historical or typological proximity of the languages involved, markedness, etc. (see R. Ellis 1994: 315–35; Jarvis 2000; Ellis and Larsen-Freeman 2006; see also Chapter 18, this volume, on variability in learner language). Disentangling all of these factors and understanding their interplay is a formidable task, requiring large amounts of the right kind of data and appropriate methodologies, so it is perhaps not so surprising that, despite many years of transfer research, there are still many questions unanswered.

2.3

Methodology

The methodology of comparison-based transfer studies essentially turns around the choice of what to compare with what, and for what purpose. To take a relatively straightforward example, a researcher interested in investigating possible transfer from the learner’s ﬁrst language (A) to the target language (T) could use any of the types of data listed in Table 15.1. A traditional contrastive analysis, as described by Lado (1957; see above), will list similarities and divergences between languages A and T and, in its ‘strong’ form, predict the difﬁculties that learners will encounter. These predictions can then be veriﬁed by checking whether the problems actually appear in samples of ILA. However, this does not necessarily demonstrate that their presence is due to transfer, because other possible causes have not been ruled out.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

339

Table 15.1. Data for studying transfer between languages

1st language data Interlanguage data External data

Samples of …

By speakers of …

Data type

Language A Language T Language T Language T Parallel (translation) texts Comparable texts

Language A Language T Language A Languages B, C… Languages A and T Languages A and T

L1A L1T ILA ILB,C… A//T A↔T

The identiﬁcation of these other causes was the main concern of the interlanguage studies of the 1970s, which took as their starting point the analysis of selected data from ILA,B,C… in order to make hypotheses about the nature of learners’ approximative systems (Nemser 1971) or idiosyncratic dialects (Corder 1971) and the interaction of factors in determining them. These studies were clear that further discussion of language transfer needed to be held in abeyance until these questions were resolved: ‘Until the role of some of these other factors is more clearly understood, it is not possible to evaluate the amount of systemic interference due to language transfer alone’ (Richards and Sampson 1974: 5). A less restrictive approach to transfer analysis involves a reconsideration of the predictive role of contrastive analysis which, although overemphasised by early proponents, usefully brings out features that can be compared against those found in error analysis and learner production data. As James (1998: 6) puts it succinctly, ‘TA [Transfer Analysis] is something salvaged from CA [Contrastive Analysis] and added to EA [Error Analysis]’. The ‘Integrated Contrastive Model’ (ICM) proposed by Granger (1996) combines computer-based contrastive analysis (CA) and Contrastive Interlanguage Analysis (CIA). The ﬁrst (CA) uses comparable corpora and/ or parallel corpora (A↔T and A//T) to investigate the degree of similarity between the two languages. The second (CIA) compares ILA both with ILB,C… and with L1T, i.e. interlanguage data from speakers of language A is compared with that from speakers of other languages, and with native-speaker data. In this way, transfer effects can be gradually pinned down through ‘constant to-ing and fro-ing’ (Granger 1996: 46) between the comparisons. If, for example, comparison of ILA and L1T reveals that a certain feature is less frequent in the interlanguage of speakers of language A than in comparable productions of native speakers, then this apparent underuse is a candidate transfer effect. Comparisons using comparable and parallel corpora (A↔T and A//T) will conﬁrm whether or not the feature is indeed less frequent in language A than in language T. If it is, then returning to the interlanguage data and comparing ILA with ILB,C… will show whether underuse is speciﬁc to speakers of language A,

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

340

OSBORNE

in which case there is a good case for transfer, or is a general characteristic of learner interlanguage. For a more detailed presentation of this approach, and an example of its application to causative constructions in interlanguage, see Gilquin (2000/2001). Jarvis (2000) proposes a ‘Uniﬁed Framework’ for studying L1 inﬂuence. Along with two other components – a deﬁnition of transfer and a list of external variables to be controlled – this framework deﬁnes three types of evidence that must be considered in order to make a case for or against L1 inﬂuence (Jarvis 2000: 252–9; Jarvis and Pavlenko 2007: 41–8). In a revised version of the framework, Jarvis (2010) includes an additional type, thus giving the four types of evidence listed below, which result from comparisons both within and between groups and within and between languages. 1. Intra-group homogeneity (within-group similarities): do all learners who speak the same L1 perform in the same way when using the target language (L2)? Answers to this question can be found by comparing individual performances within ILA. 2. Inter-group heterogeneity (between-group differences): do comparable learners who have different L1s perform differently when using the same L2? This involves comparing the interlanguage data from different groups, ILA and ILB,C…. 3. Cross-linguistic performance congruity (between-language similarities): is there a parallel between learners’ use of a feature in the L2 and their use of a corresponding feature in their L1? For transfer effects from just one source language, this means comparing ILA and L1A. If more than one source language is concerned, then the comparisons will also require data from L1B,C… to be compared with ILB,C…. 4. Intralingual contrasts (within-language differences): are there differences in learners’ performance on features in a given target language according to how those features correspond to features in their ﬁrst language? This would involve listing features in Language T that correspond to features in Language A and those that do not, rather as in a classic contrastive analysis, and then measuring whether there are signiﬁcant differences in learners’ performance with respect to these two lists of features. Gilquin (2008b) describes a three-phase model of ‘detection, explanation, evaluation’ (DEE), which speciﬁes the types of comparison that can be useful at each phase. In the detection phase, two types of comparison are used. The ﬁrst is a comparison between the learners’ interlanguage and their L1 (i.e. ILA compared with L1A), to test for what Jarvis calls cross-linguistic congruity. The second comparison is with the interlanguage of learners who have different L1s (i.e. ILA compared with ILB,C…), to establish whether there is heterogeneity between groups. In the explanation phase, comparisons between parallel corpora (A//T) and comparable corpora (A↔T) are used to establish the degree of cross-linguistic

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

341

similarity, with a view to providing an explanation for the presence or absence of transfer. Finally, in the evaluation phase, two remaining types of comparison can be used, not to provide evidence that transfer is taking place, but to judge whether it is pedagogically important. Comparing ILA with L1T, the interlanguage of language A speakers with native-speaker usage of target language T, can serve to distinguish between positive and negative transfer, while comparisons within ILA can be used to discover to what extent a transfer problem is widespread and therefore merits pedagogical attention. The methodologies outlined above make use of similar types of comparisons, but in different combinations, not always in the same order, and for slightly different purposes. The more recent methodologies, post-EA onwards, all share a conviction that one type of evidence is not enough to identify transfer effects, and that it is necessary to associate two or more comparisons before making claims about the presence or absence of transfer effects. One respect in which the models differ is in relation to the use of native-speaker data. Whereas Jarvis’s (2000, 2010) framework uses L1T data for examining intralingual contrasts as just one of the four types of evidence available, both the ICM and DEE models explicitly include comparison between ILA and L1T as a phase in the study. As the authors acknowledge (Granger 2004: 132–3; Gilquin 2008b: 14–15), this inevitably raises the issue of the ‘comparative fallacy’ (Bley-Vroman 1983), namely whether comparison with native-speaker usage prevents the interlanguage from being looked at in its own terms, rather than being seen simply in terms of deviation. If this is the only point of comparison, then undoubtedly there is a danger of falling into a comparative fallacy, but combining it with other comparisons, as recommended by the methodologies in question, largely obviates this risk. It could also be argued that ILA/L1T comparisons are, after all, not entirely fallacious. For many language learners, the productions of native speakers are a major source of input, and it is legitimate to ask to what extent their interlanguage grammars correspond to those which have provided the input (see Lardiere 2003). Finally, the usefulness of these comparisons will be determined by the purposes of the study. The last ten years have seen considerable interest in advanced learner varieties (Labeau and Myles 2009) and in the characteristics of ‘near-nativeness’ (Sorace 2003) or ‘nativelikeness’ (Abrahamsson and Hyltenstam 2008, 2009). As the terms suggest, a major aim of these studies is to determine whether, and in what respects, the performance of such speakers differs from that of native speakers, and why. For this, comparisons between the two groups are obviously indispensable, and may incidentally shed interesting light on native performance at the same time. The methodologies outlined above also differ in the importance that they accord to computer-based data. Granger’s ICM model and Gilquin’s

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

342

OSBORNE

DEE model both use corpus data at all stages of the investigation, whereas Jarvis illustrates his framework through a study based on experimental data. There is no reason in principle, though, why it could not be successfully applied to corpus data (see Jarvis and Pavlenko 2007: 230). With SLA research using increasingly large data sets, and corpus-based investigations complementing their analyses with other types of data, the borderline between corpus-based and experimental research has become increasingly tenuous: ‘Not only do we need large datasets in order to be able to generalise our ﬁndings, but some of the structures which are crucial for informing current debates are rarely found in learner data. They therefore either must be elicited speciﬁcally, or large datasets are needed in order to maximise their chance of being present’ (Myles 2005: 376). Strictly speaking, all learner corpora are built from elicited data, since they contain language that has been produced in response to an external task. Whether these data are authentic in the sense of being ‘the genuine communications of people going about their normal business’ (Sinclair 1996) is really a question of degree (see also Chapter 2, this volume). A dissertation written by a non-native speaker is clearly a part of normal academic business; writing an essay is a normal class activity, but is more restricted to the context of language learning; oral tasks may be simulations of things people do in their normal business, and so on. There is no clear cut-off point between authentic and contrived data. Similarly, there is a continuum of methodological approaches to learner corpus data, from corpus-based or hypothesis-driven approaches to those which are corpus-driven or hypothesis-ﬁnding (Granger 1998a: 15–16; Barlow 2005: 344).

3

Representative studies

This section presents three examples of learner-corpus-based transfer studies in, respectively, lexis, grammar and discourse. 3.1 Paquot, M. 2013. ‘Lexical bundles and L1 transfer effects’, International Journal of Corpus Linguistics 18(3): 391–417. After pronunciation, vocabulary is undoubtedly the area of L2 use where transfer effects are the most immediately perceptible. It is also the area where transfer is most likely to lead to misunderstanding, embarrassment or involuntary amusement. Johansson (2008: 9) quotes a Newsweek report of Ingmar Bergman as saying, ‘I still have a tremendous amount of lust to make movies’. Lado (1957: 84) considers such ‘deceptive cognates’ to be ‘sure-ﬁre traps’, and remarks that ‘their similarity in form to words in the native language raises their frequency in student usage above that normal for the language’. Although inventories or dictionaries of ‘false-friends’

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

343

have been available at least since the eighteenth century,1 more diffuse forms of lexical transfer, involving under- and overuse of lexical items, collocational patterns in lexis, lexical bundles and other phraseological phenomena, are more difﬁcult to study without corpus data. They have largely escaped notice in traditional error analysis and interlanguage studies because they do not necessarily result in clearly erroneous usage or misunderstanding. For many language learners, modelling their usage on the collocational or phraseological preferences of native speakers is not a major issue; they have more immediate communicative needs. But for more-advanced learners, wishing to use the language for certain professional or academic purposes, developing native-like patterns may be of interest. It is thus useful to be able to identify such patterns and the ways in which learner usage diverges from them, and why. The study presented here (Paquot 2013) and a follow-up study (Paquot 2014) apply Jarvis’s (2000) methodological framework, discussed above, to the investigation of possible transfer effects in the use of lexical bundles (i.e. recurrent sequences of words) by French-speaking learners of English. The initial study focuses on three-word sequences containing a verb and addresses two related questions: to what extent can French learners’ use of lexical bundles be attributed to L1 inﬂuence, and what kind of transfer (of form, of function) is most noticeable? In order to make the necessary comparisons, within and between groups, the study draws on several corpora. Learner data were taken from the ﬁrst version of the International Corpus of Learner English (ICLE) (Granger et al. 2002). The structure of ICLE into subcorpora, each corresponding to a different L1, makes it possible to investigate both intra-group homogeneity (within the French subcorpus) and inter-group heterogeneity (between the French and the nine other ICLE subcorpora used for the study). The third type of evidence, cross-linguistic congruity (i.e. whether there is a parallel between the French learners’ usage in L2 English and a corresponding feature in L1 French) requires data from corpora of French writing. Three such corpora were used: the French WaCKy corpus (frWaC),2 the humanities component of the Scientext corpus3 and the Corpus de Dissertations Françaises (CODIF), a corpus of native French essay writing compiled at the Université catholique de Louvain. Three-word lexical bundles were ﬁrst extracted from the ICLE French subcorpus using WordSmith Tools 5 (Scott 2008) and then checked by hand to select those – 273 in all – that contained a verb. Relative frequencies for these 273 three-word sequences were then calculated across

1

For an extensive bibliography of dictionaries of false friends, see www.lipczuk.buncic.de/ (last accessed on 13 April 2015).

2

WaCKy corpora: wacky.sslmit.unibo.it/doku.php?id=corpora (last accessed on 13 April 2015).

3

Scientext: scientext.msh-alpes.fr (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

344

OSBORNE

all ten of the ICLE subcorpora, and statistical tests were run to determine whether there were signiﬁcant differences in use among them. A one-way between-groups analysis of variance (ANOVA) identiﬁed eighty-seven lexical bundles that showed differences in use among the subcorpora. To determine to what extent a speciﬁc group, in this case the French-speaking learners in ICLE, was responsible for the difference, a multiple comparison procedure (Dunnett’s test) was used. This identiﬁed thirty-four bundles – sequences such as be tempted to or we may wonder – that showed signiﬁcant differences between the French subcorpus and at least half of the other subcorpora. In order to exclude bundles whose frequency was possibly due to the topic of the texts in which they appeared (for example, the recurrence of the sequence say that Europe in essays about the future of Europe), only those bundles which appeared in a range of essays were retained, leaving twenty lexical bundles for which topic inﬂuence could reasonably be ruled out. These were then used to examine the third type of evidence, cross-linguistic congruity. For the majority of the IL bundles in question, formally equivalent bundles were found in the L1 French corpora, for example être tenté(es) de for be tempted to. For the remaining bundles, all containing ﬁrst-person we, French bundles with equivalent meaning or function were found, for example, on peut se demander for we may wonder. Combined evidence of intra-group homogeneity, inter-group heterogeneity and cross-linguistic congruity thus provides a good case for claiming that more than half of the French-speaking learners’ idiosyncratic use of lexical bundles (20 out of the 34 sequences that showed signiﬁcant differences in use compared with the other subcorpora) can be attributed to inﬂuence from their L1. A follow-up study (Paquot 2014) uses a similar methodology on a larger data set, covering two- to four-word bundles without restriction on their composition (i.e. not necessarily containing a verb). The main methodological difference is in the statistical treatment, using pairwise Wilcoxon rank sum tests rather than ANOVA and post-test, to compare the French learners’ use with that of the nine other groups in ICLE. With a larger data set and more restrictive statistical tests, which selected a lower percentage of signiﬁcant lexical bundles and did not identify many of the bundles found to be L1-induced in the earlier study, the proportion of French-speaking learners’ lexical bundles attributable to cross-linguistic inﬂuence is lower than in the earlier study, but L1 inﬂuence is clearly present nevertheless. This follow-up study also adds two new ﬁndings. Firstly, the extent of L1 inﬂuence increases with the size of the lexical bundles, accounting for 12.3% of two-word sequences, but 20% of four-word sequences. Secondly, some of the L1-induced bundles appearing in this larger data set do not just display idiosyncratic preferences or comparative overuse but are semantically inappropriate (even if instead of even though; on the contrary rather than on the other hand).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

345

Paquot concludes that the transfer effects identiﬁed in her two studies lie at the interface between lexical transfer and discursive transfer (cf. Jarvis and Pavlenko 2007: 20) and fall into four major types: transfer of semantic properties, transfer of collocational and colligational preferences, transfer of functions and discourse conventions, and transfer of L1 frequency. Together, the two studies are a good illustration of how close analysis of learner corpora can not only provide a sound methodological basis for claims about L1 transfer but also broaden the scope of lexical transfer studies beyond traditional questions of form–meaning correspondence. 3.2 Spoelman, M. 2013a. ‘Prior linguistic knowledge matters: The use of the partitive case in Finnish learner language’, Acta Universitatis Ouluensis. Series B, Humaniora 111. Morphology and syntax have received considerable attention in second language acquisition, starting well before the advent of computerised learner corpora. Grammatical features are recurrent in language, errors are relatively easy to detect, and data can be elicited from production tasks, acceptability judgements, etc. What, then, is the added value of using learner corpora to study grammatical transfer? The most obvious answer is that many grammatical choices are determined by interaction between morphology, syntax, semantics and pragmatics; questions that arise at the interface typically depend on numerous contextual factors and will be difﬁcult to answer without sufﬁcient data for patterns to become discernible. The example presented here (Spoelman 2013a) uses the methodologies discussed above in Section 2.3 – the Integrated Contrastive Model and Jarvis’s (2000, 2010) framework – to investigate the use of the partitive case in L2 Finnish by L1 speakers of Estonian, Dutch and German. The partitive case represents a severe learning challenge for L2 speakers of Finnish. It is one of the most frequently used cases, but its use ‘remains a constant struggle for learners of Finnish’ (Spoelman 2013a: 14). What follows is a very simpliﬁed overview of the main issues. For a detailed account of the development and use of the Finnish partitive, readers are referred to the relevant section of Spoelman’s study (ibid.: 25–80). The basic problem for learners of Finnish is to know when it is appropriate to put partitive marking on nominal components in a sentence: objects, subjects and predicatives (i.e. adjective or noun phrases in copula constructions). There are three main factors that can determine the choice between partitive and other cases (nominative, genitive or accusative), brieﬂy characterised below. 1. Aspectual boundedness: this has to do with the aspectual characteristics of the situation referred to, whether it is bounded or unbounded, resultative or irresultative, telic or atelic (i.e. whether or not it has an

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

346

OSBORNE

inherent end-point). For example, in the Finnish equivalents of (a) The hunter shot [at] the hare [but did not kill it] and (b) The hunter shot the hare [and killed it], the word for ‘hare’ would carry partitive marking in (a) but not in (b). 2. Polarity (afﬁrmative vs negative sentences): for example, in the Finnish equivalents of (a) I did not receive the letter and (b) I received the letter, the word for ‘letter’ would carry partitive marking in (a) but not in (b). 3. Quantitative boundedness: this has to do with (in)deﬁniteness and whether the entity referred to is divisible or indivisible. It is in some respects similar to the mass–count distinction more familiar in English. For example, in the Finnish equivalents of (a) I bought (some) bread and (b) I bought a book, the word for ‘bread’ in (a) would carry partitive marking, but ‘book’ in (b) would not. However, the same lexical item in different contexts can require different case marking. For instance, in the equivalents of (a) Here is some coffee and (b) Here is your coffee, the word for ‘coffee’ would carry partitive marking in (a) but not in (b). Brieﬂy, all three of these factors can intervene in determining the case marking of object nominals, while polarity and quantitative boundedness are relevant for the marking of subject nominals in existential constructions, and only quantitative boundedness affects the marking of predicatives. In Estonian, a language closely related to Finnish, case marking is largely similar as regards object and subject nominals, apart from some subtle differences; object marking, for example, can differ because certain verbs have a different aspectual reading in Estonian. The case marking of predicatives, however, is substantially different, since partitive predicatives have very limited occurrence in Estonian. As for the two Germanic languages concerned in the study, although German has preserved a four-case declension system, it has no equivalent to the Finnish–Estonian partitive case, while modern Dutch has lost nearly all of its morphological case distinctions. The main research questions addressed by Spoelman’s study are therefore whether the patterns of use, overuse and underuse of the partitive extracted from the Estonian learner corpus are similar or different from those extracted from the German and Dutch learner corpora, and to what extent any potential similarities and differences can be attributed to cross-linguistic inﬂuence. The learner data for the study were extracted from the Estonian, German and Dutch subcorpora of the International Corpus of Learner Finnish (ICLFI). All of the learner essays were assigned to a Common European Framework of Reference for Languages (CEFR; Council of Europe 2001) level (between A2 and C2) by two independent raters. They were then prepared for analysis in three phases. First, they were tagged for partitive

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

347

marking using a Microsoft Word macro designed to identify all possible endings of partitive-case-marked words: regular and irregular endings, morphological exceptions, endings resulting from morphological contractions and incorrectly inﬂected forms. Second, all the tagged phrases were classiﬁed into the three types described above: partitive objects, partitive subjects and partitive predicatives, using another set of macros. Finally, errors of over- and underuse in the learner productions were manually detected and the corpora were error tagged. WordSmith Tools 5 (Scott 2008) was used to obtain absolute frequencies of partitive objects, subjects and predicatives, and to obtain error frequencies from the learner corpora. Altogether, ﬁve frequency measures were used for the analysis: 1. total partitive-case-marked objects, subjects and predicatives 2. partitive overuse errors 3. correctly used partitive objects, subjects and predicatives (calculated by subtracting all the occurrences in category 2 from those in category 1) 4. partitive underuse errors 5. partitive requiring contexts (calculated by adding together categories 3 and 4). An online log-likelihood calculator4 was used for comparison of the frequencies of partitive objects, partitive subjects and partitive predicatives observed from the different learner corpora and those observed from the reference corpus (a subset of the Native Finnish Corpus), which is a subcorpus of the Corpus of Translated Finnish compiled at the University of Joensuu. Detailed results of these comparisons are given in part 5 of Spoelman’s study (2013a: 199–342), but the main ﬁndings can be summarised as follows (see Spoelman 2013b for a more detailed summary, focusing specifically on learners’ use of partitive objects). In the Estonian subcorpus, error rates for overuse were low, with no signiﬁcant difference between the three types of partitive (partitive objects, subjects and predicatives). For underuse, error rates were signiﬁcantly higher for partitive subjects than for partitive objects, and signiﬁcantly lower than for partitive predicatives. In comparison, the German and Dutch subcorpora showed relatively high rates of overuse for partitive subjects and partitive predicatives, but not for partitive objects. As for underuse, the German subcorpus showed a higher rate of errors for partitive predicatives than for objects or subjects, whereas the Dutch subcorpus showed no striking differences in underuse error rates for the three types of partitive. In view of the ﬁnding that Estonian learners’ underuse error rates are higher for partitive subjects than for partitive objects, and higher 4

ucrel.lancs.ac.uk/llwizard.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

348

OSBORNE

again for partitive predicatives, Spoelman concludes that L1–L2 similarities in the marking of partitive objects are an aid to learners, while differences between Finnish and Estonian existential sentences result in negative L1 inﬂuence on partitive subject marking, and the lack of nominative–partitive opposition in Estonian predicatives results in a higher underuse rate still for predicatives in L2 Finnish. The patterns observed in the German and Dutch subcorpora, different from those of the Estonian learners, show no evidence of L1 inﬂuence. Instead, Spoelman suggests that the frequency of basic non-inﬂected objects and predicatives can be attributed to restrictive simpliﬁcation, while overuse of the partitive is plausibly a consequence of overgeneralisation. As regards language transfer, the most interesting aspect of this study is that it makes a case for the simultaneous operation of positive and negative transfer effects between related languages, Estonian and Finnish, effects which are absent from the German and Dutch subcorpora. From a methodological point of view, it is also clear that the complex interplay between over-/underuse of morphological marking and the syntactic and semantic conditions that determine case marking could not have been successfully studied without careful analysis of learner corpus data. 3.3 Neff van Aertselaer, J. 2008. ‘Contrasting English–Spanish interpersonal discourse phrases: A corpus study’, in Meunier, F. and Granger, S. (eds.), Phraseology in Foreign Language Learning and Teaching. Amsterdam: Benjamins, pp. 85–99. The advantages of using corpus data to study possible effects of linguistic or cultural transfer in discourse are even more immediately apparent, since, unlike many grammatical phenomena, transfer patterns in discourse may be difﬁcult to detect at all in isolated texts. Early work on contrastive rhetoric by Kaplan (1966) was based on a (non-electronic) corpus of nearly 700 essays collected between 1963 and 1965, of which 99 were excluded on the grounds that they represented linguistic groups too small within the sample to be signiﬁcant. Although Kaplan later stated that his article and the ‘doodles’ which he used to represent different rhetorical structures were an overstatement of his case (Kaplan 1987), he remained convinced that there was a legitimate case to be made for the existence of ‘cultural thought patterns’. While the different rhetorical modes are possible in any language, he argued, ‘[t]he issue is that each language has clear preferences, so that while all forms are possible, all forms do not occur with equal frequency or in parallel distribution’ (ibid.: 10). Clearly, in order to estimate the frequency of these forms and to derive sufﬁcient evidence to make claims about language preferences, it will be necessary to analyse substantial quantities of text. Another methodological question which has to be addressed in contrasting discourse is choosing what to look for. Any underlying preferences will only be detectable insofar

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

349

Table 15.2. Corpora used in Neff van Aertselaer (2008) Novice Learner ILA ICLE-SP

Native L1T LOCNESS

Expert Target language L1T English editorials

First language L1A Spanish editorials

as they are reﬂected in the text and so, ‘at least initially, analysis must begin with surface structure because that is what a text is’ (ibid.: 20). Most corpus-based studies of discourse have therefore attempted to discover how the relative frequency of certain surface features may reﬂect preferred patterns of argumentation, cohesion, author stance, hedging, etc. The features in question may be lexical items, collocations, lexical bundles (Cortes 2004; Hyland 2008b), vocabulary-based discourse units (Biber, Csomay, Jones and Keck 2004) or syntactic units (Demol and Hadermann 2008; Callies 2008); whatever their linguistic status, their presence in the text is potentially an indicator of how learners introduce, organise and contextualise ideas and information. Interpretation of these features, however, also needs to take into account the fact that the texts have been produced by people who are not only language learners but also, in many cases, inexpert writers in any language. Consequently, their discourse style may in fact be inﬂuenced by three factors: discursive transfer, linguistic limitations on the repertoire that they can choose from, and their degree of expertise as writers and familiarity with different text types. Finally, as Connor (2002) suggests, there may be a diachronic dimension to contrastive studies, as existing cultural differences are blurred by the emergence of, for example, new patterns of ‘Eurorhetoric’, or as a globalised business environment leads to a more homogenised style in texts such as job application letters. In the study presented below, Neff van Aertselaer (2008) examined possible effects of discourse transfer in learner writing, using the two principal dimensions of comparison suggested above, namely not only learner vs native-speaker, but also ‘novice’ vs ‘expert’ writer. The learner data for the study came from the Spanish component of ICLE, and the novice native data from the Louvain Corpus of Native English Essays (LOCNESS). The expert data were taken from the English–Spanish contrastive corpus of editorials held at the Universidad Complutense de Madrid. The corpus sources are summarised in Table 15.2. Speciﬁcally, Neff van Aertselaer looked at discourse strategies for engaging the reader in the argumentative process. These ‘interactional phrases’ were grouped into three patterns: 1. hedging expressions formed with it + adjective + extraposed clause (It is possible/likely that…) or adverbs (possibly, probably) and their Spanish equivalents (Es posible/probable, posiblemente, probablemente) Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

350

OSBORNE

2. certainty expressions and adverbs (It is clear/certain…, clearly, certainly) and their Spanish equivalents (Es cierto…, claramente, seguramente) 3. impersonalising passive constructions (It is known, It is said) and the corresponding reﬂexive constructions in Spanish (Se dice). The purpose of the study was to ‘distinguish novice writer features from preferred or non-preferred rhetorical features’ (Neff van Aertselaer 2008: 85). Comparing the frequency of the three types of interactional phrases across the four corpora gave mixed results. For the ﬁrst type, contrasting ILA with L1T experts showed a low occurrence of hedging phrases in the ICLE-SP texts compared with the English expert texts. This could be due to the fact that the Spanish learners are novice writers who have not yet mastered the conventions of hedging in academic writing. But the frequency of hedging phrases in LOCNESS texts, also written by novices, is slightly higher than in the English expert texts, suggesting that low use of hedging phrases is in fact not a general effect of novice writing. The possibility that it may be a result of transfer is given support by the signiﬁcantly lower rate of hedging phrases in the last group of texts, those written by Spanish experts, as compared with English expert texts. For the second type of interactional phrase, no signiﬁcant difference was found for the number of certainty expressions used in the four corpora, although both novice-writer groups did use more ‘forceful’ adjectives/adverbs of certainty (e.g. it is obvious that) than in the expert texts. The third interactional strategy, use of impersonalising passive constructions, was more present in both the Spanish learner texts and the Spanish expert texts than it was in the English novice and expert texts, suggesting that the use of phrases like it is believed that… or it is known that… in ICLE-SP is the result of transfer. Neff van Aertselaer concludes that there are in fact three factors that inﬂuence the use of interactive phrases by Spanish-speaking learners of English. Two are developmental factors, either linguistic (incomplete mastery of English modality, including modal adverbs) or authorial (inappropriate use of forceful adjectival phrases and adverbs). The third is a transfer factor: a preference for reﬂexive impersonal constructions and fewer lexical hedging phrases, inﬂuenced by L1 Spanish. There are important practical lessons to be drawn from such comparisons, not only for the teaching of academic writing, which Neff van Aertselaer (ibid.: 97) herself mentions, but also, one might add, for language assessment. If the kinds of features described above at least partly reﬂect differences in the writers’ rhetorical preferences, then it is reasonable to suppose that, as readers, they will also have partly differing expectations as to what makes a good text. Consequently, the importance accorded to criteria for judging text quality may well vary from one reader-assessor to another (see Carlsen 2010). Descriptors such as those used in the CEFR (Council of Europe 2001) are based on

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

351

experienced teachers’ intuitions about the typical characteristics of texts produced by learners at different levels of proﬁciency. But there is a need for these to be checked against the evidence of learner corpora, not only to establish whether they correspond to the features actually found in learner texts, but also to determine whether unconventional features result from lack of linguistic control or from cultural transfer in discourse. A similar question, incidentally, could be asked in the reverse direction, about the way in which rhetorical transfer may affect learners’ ability to comprehend the structure of texts, but that is another issue.

4

Critical assessment and future directions

Transfer studies have often been open to criticism for basing claims on insufﬁcient evidence. It is not enough to identify errors in learner production and attribute them to transfer solely on the grounds that similar forms exist in the learners’ L1, if other possible sources of error have not been reasonably ruled out. Using corpus data does not of itself constitute an answer to this criticism, of course, and learner corpus studies of cross-linguistic inﬂuence can suffer from various shortcomings. Examining only one type of evidence is one; other possible weaknesses are reporting differences without using appropriate statistical tests to verify that the differences are signiﬁcant, or giving insufﬁcient information about the tests used (see Chapter 8, this volume), comparing corpora that are not strictly equivalent, or providing insufﬁcient information about the learners involved, the types of tasks or the sampling of learner productions for the studies to be replicable. Nevertheless, we hope to have shown that the use of learner corpora, coupled with clearly deﬁned research questions and an appropriate methodology, have already made a signiﬁcant contribution to the complex and long-standing question of transfer in language learning. Continuing exploitation of existing data and the development of new corpora have the potential to complement, reﬁne and diversify the research already carried out, in ways that will be suggested below. As Hasselgård and Johansson (2011: 37) observe, the role of learner corpora has been important in expanding the empirical basis for transfer studies and in making them more easily replicable: ‘Whereas earlier work was generally limited in scale and range, it now became possible to increase the size and variety of the material; and whereas the material used earlier rarely went beyond the individual researcher, the new electronic corpora could be developed as research tools to be used more generally by scholars in the ﬁeld’. Hasselgård and Johansson go on to suggest some areas in which the empirical basis needs further diversiﬁcation and where work is still in progress (ibid.: 56): corpora that include more different registers, as in

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

352

OSBORNE

the Varieties of English for Speciﬁc Purposes Database (VESPA),5 and large-scale longitudinal corpora, such as the Longitudinal Database of Learner English (LONGDALE).6 Respectively, these can provide data for examining questions about domain-related transfer (e.g. are there speciﬁc types of, say, phraseological or rhetorical transfer according to specialised text types?) and about the evolution of transfer effects at different stages of acquisition. To these could be added continuing development of learner corpora in other target languages and of multilingual comparable corpora, not only for studying possible transfer effects between different pairings of languages, but also for looking at those cross-linguistic phenomena which ‘could not exist without a minimum of three languages in the mind’ (De Angelis and Dewaele 2011a: vii). The age-spread of learner corpora is another area ripe for diversiﬁcation. To date, most learner corpora have focused on young adults, particularly in a university context. The development, in many European countries, of pre-secondary-level language learning and of associated research (Nikolov 2009) offers new opportunities to study how cross-linguistic inﬂuences operate in younger minds. Another fruitful area for research lies in the development of oral and multimodal corpora (see Chapter 2, this volume). This involves time-consuming work of transcription and annotation, particularly if all the features of oral production – pauses, hesitations, backtracking, incomplete units, interruptions, etc. – are to be preserved in the transcription, but spoken language corpora allow for the detection of transfer phenomena that might not appear in written production and provide evidence of how learners backtrack to deal with problems that they themselves have noticed (Kormos 1999, 2000). They also add entirely new dimensions to transfer research, such as cross-linguistic effects in hesitation phenomena (Rose 2013) or the inﬂuence of phonological transfer on listeners’ perception of ﬂuency in learner speech (Götz 2013: 147–68). Fully multimodal corpora extend the possibilities of analysis to paralinguistic features in speech, and particularly to the use of gestures (for overviews of gesture in second language learning see McCafferty and Stam 2008; Gullberg 2008, 2010). A good example of this is work by Brown and Gullberg (A. Brown 2007; Brown and Gullberg 2008) on the encoding of motion events in Japanese and English by Japanese monolinguals, Japanese learners of L2 English and English monolinguals. The focus of their study was how these speakers encoded path (i.e. the direction of movement) and manner (i.e. how the movement is effected), either verbally or by gesture. To classify gestural encoding, they used McNeill’s (2000: 53–5) distinction between ‘manner fog’, where the manner of movement is not encoded verbally, but is added in an accompanying gesture (as is more usual in ‘verb-framed’ 5

www.uclouvain.be/en-cecl-vespa.html (last accessed on 13 April 2015).

6

www.uclouvain.be/en-cecl-longdale.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

353

languages such as Japanese) and ‘manner modulation’, where manner is encoded in speech, but not in the accompanying gesture, which encodes only path (as is more common in ‘satellite-framed’ languages such as English). Their results showed two kinds of transfer taking place in the Japanese learners’ productions. Both in L1 and in L2 speech, the learners encoded manner less than monolingual English speakers, thus showing signs of forward transfer from their L1 on to the L2. In gesture, their occasional use of ‘manner fog’, entirely absent from the English monolingual productions, also suggests L1–L2 transfer. Conversely, their use of path-only gestures indicates the opposite tendency: a de-emphasising of manner, typical of satellite-framed languages like English, being transferred to their L1. If the non-monolingual speaker’s language competence is viewed as a whole, then this two-way transfer can be seen as a convergence between two language systems, a process in which ‘distributional frequencies in the L1 and L2 begin to merge’ (A. Brown 2007: 360). Transfer studies can also beneﬁt from methodologies developed for corpus-driven contrastive analysis outside language learning. For example, where languages have shared properties, there may nevertheless be probabilistic differences in the way they deploy common structures (see Wiechmann 2011 for an example of differing preferences in English and German relative constructions). The methodologies and statistical tools used in such studies, while probably unfamiliar to most interlanguage researchers, offer possibilities for more complex and ﬁner-grained transfer studies through the convergence of interlanguage and contrastive studies. Fruitful interactions between corpus research methods and those used in linguistics, psychology and second language acquisition (see Gilquin and Gries 2009) also provide new ways of examining the complexities of language transfer. This is particularly so in the investigation of conceptual transfer (Odlin 2005, 2008; Jarvis and Pavlenko 2007: 112–52; Pavlenko 2011), where evidence of transfer both in production and in reception can be derived from speech data, naming tasks, sentence interpretation, classiﬁcation and similarity tasks, acceptability judgements and self-reports. Finally, in the development of frameworks for understanding the interaction of languages in the mind, we need to accommodate not just transfer between pairs of languages (whether positive or negative, forward or reverse), but all of the possible consequences of what Cook (2003a: 2) has termed multi-competence, i.e. ‘knowledge of two or more languages in one mind’. A promising direction in recent research is to view this multiple evolving knowledge as a complex system composed of subsystems for different languages (Larsen-Freeman 1997), which ‘interact and have fuzzy borders’ (de Bot et al. 2013: 213). Within this framework, the purpose of transfer studies will be to determine in what ways learning is shaped by previous experience of one or more languages:

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

354

OSBORNE

Such shaping in SLA has traditionally been referred to as transfer, whereby learners’ initial experience in using their ﬁrst language leads to a neural attunement to the ﬁrst language, which affects their second language learning experience (Ellis and Larsen-Freeman 2006). Of course, the effects of this experience can occur in third and subsequent language acquisition as well, making the mix of these dynamic systems all the more complex. (Larsen-Freeman and Cameron 2008a: 133) Teasing apart the factors in this complex mix requires convergence between different disciplines and methodological approaches. Learner-corpus-based research has at least three important contributions to bring to the collective endeavour: a concern for language samples that are as uncontrived as possible, methods for annotating and comparing large sets of data, and the data themselves, which, thanks to the rapid development of interest in learner corpora, are available in an increasing variety of language mixes, registers, written and spoken modes, learning contexts and cross-sectional or longitudinal designs. They constitute a cumulative resource; if the corpora remain available to the research community after the project that originally generated them is ﬁnished, then data collected at any time can be revisited with new techniques. Ellis and colleagues (Ellis and Ferreira-Junior 2009a; Ellis and Larsen-Freeman 2009), for example, use data originally collected more than twenty years earlier in the European Science Foundation (ESF) project (Perdue 1993) to investigate the effects of frequency, prototypicality and contingency of form–function mapping on the acquisition of verb-argument constructions. Similarly, although it is more than twenty years since the ﬁrst launch of ICLE, new techniques continue to be applied to its data, notably in the development of an automatic detection-based approach to transfer (Jarvis and Crossley 2012; Chapter 27, this volume). Weinreich (1953: 11) described interference in speech as being ‘like sand carried by a stream’, as distinct from habitualised borrowings in language which are ‘the sedimented sand deposited on the bottom of a lake’. By the same analogy, transfer in interlanguage is like the unsettled sand carried by multiple currents. To understand patterns in its movement, the combination of methods from corpus linguistics, contrastive analysis, second language acquisition and psychology offers fruitful ways of analysing both new and existing data.

Key readings Odlin, T. 1989. Language Transfer: Cross-linguistic Inﬂuence in Language Learningg. Cambridge University Press. Although edited volumes of papers devoted to language transfer appeared earlier in the 1980s (Gass and Selinker 1983; Kellerman and

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

Transfer and learner corpus research

355

Sharwood Smith 1986), Odlin’s book was one of the ﬁrst full-length overviews of the question, published shortly after Ringbom (1987), which is presently out of print. Jarvis, S. and Pavlenko, A. 2007. Crosslinguistic Inﬂuence in Language and Cognition. London: Routledge. This is, as the authors state in their preface, the ﬁrst book-length review of cross-linguistic inﬂuence since Odlin (1989). It is the most complete survey of cross-linguistic inﬂuence currently available. Learningg. Ringbom, H. 2007. Cross-linguistic Similarity in Foreign Language Learning. Clevedon: Multilingual Matters. Ringbom is one of the pioneers of modern transfer research. Unlike the tradition inherited from error analysis, largely concerned with the effects of negative transfer, Ringbom (2007: 5) starts from the position that ‘[s]imilarity is basic, difference secondary’ and examines the advantages that cross-linguistic similarity brings to the task of learning a language. Cook, V. (ed.) 2003b. Effects of the Second Language on the First. Clevedon: Multilingual Matters. This book is devoted entirely to the effects of reverse transfer, exploring the proposition that L2 users’ knowledge of their L1 is not the same as that of monolingual speakers. Although most of the studies collected here are not corpus-based, the theoretical and methodological questions that they discuss are of great interest to anyone concerned with the nature of ‘multi-competence’ in the language user and its consequences on the L1. Cenoz, J., Hufeisen, B. and Jessner, U. (eds.) 2001. Cross-linguistic Inﬂuence in Third Language Acquisition: Psycholinguistic Perspectives. Clevedon: Multilingual Matters. De Angelis, G. and Dewaele, J.-M. (eds.) 2011b. New Trends in Cross-linguistic Research. Bristol: Multilingual Matters. Inﬂuence and Multilingualism Research. These two volumes, published ten years apart, are both devoted to cross-linguistic inﬂuence in speakers of three or more languages. If L2 learners are different from monolinguals, those who have acquired a third or subsequent language are different again, in at least two respects: they come to the learning task as already experienced learners, and the possible directions and effects of transfer are multiplied. Together, the two volumes provide compelling evidence that cross-linguistic inﬂuence is a much more complex process than the one-to-one phenomenon that might appear from consideration of just two languages.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

356

OSBORNE

Gilquin, G., Papp, S. and Díez-Bedmar, M. B. (eds.) 2008a. Linking up Contrastive and Learner Corpus Research. Research. Amsterdam: Rodopi. Although the book is not speciﬁcally dedicated to questions of language transfer, the papers collected in this volume all look at learner corpora from a contrastive perspective, combining learner corpus analysis with contrastive analysis and/or comparing several learner varieties. Tono, Y., Kawaguchi, Y. and Minegishi, M. (eds.) 2012. Developmental and Crosslinguistic Perspectives in Learner Corpus Research. Amsterdam: Benjamins. This edited volume is a product of the International Corpus of Crosslinguistic Interlanguage ((ICCI ICCII) project, initiated in 2007. An interesting particularity of the ICCII project is that it focuses on productions from younger learners (grades 3 to 12) in seven countries: Austria, China, Hong Kong, Israel, Poland, Spain and Taiwan.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:21, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.015

16 Learner corpora and formulaic language in second language acquisition research Nick C. Ellis, Rita Simpson-Vlach, Ute Römer, Matthew Brook O’Donnell and Stefanie Wulff

1

Introduction

Just how proﬁcient are second language learners in using formulaic language? Do formulaic phrases play a role in second language acquisition (SLA)? These are the two questions to be addressed here using evidence from learner corpus research. Whilst Krashen and Scarcella (1978) argued that formulaic language was outside the creative language process, Ellis (1996) proposed that learners’ long-term knowledge of lexical sequences in formulaic phrases serves as the database for language acquisition. The current chapter addresses the apparent paradox whereby analyses of learner language show that second/foreign (L2) learners typically do not achieve native-like formulaicity and idiomaticity (Pawley and Syder 1983; Granger 1998b), whereas longitudinal analyses of learner corpora such as Myles (2004) show that formulaic phrases can provide learners with complex structures beyond their current grammar, and that resolving the tension between these grammatically advanced chunks and the current grammar drives the learning process forward. Usage-based theories of language hold that L2 learners acquire constructions from the abstraction of patterns of form–meaning correspondence in their usage experience and that the acquisition of linguistic constructions can be understood in terms of the cognitive science of concept formation following the general associative principles of the induction of categories from experience of the features of their exemplars

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

358

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

(Robinson and Ellis 2008; Hoffmann and Trousdale 2013). In natural language, the type–token frequency distributions of the occupants of each part of a construction, their prototypicality and generality of function in these roles, and the reliability of mappings between these, all affect the learning process. Child-language researchers (Tomasello 2003; Lieven and Tomasello 2008) and L2 researchers (Ellis 2013) have proposed that formulaic phrases with routine functional purposes play a large part in this experience, and the analysis of their components gives rise to abstract linguistic structure and creativity: ‘[t]he typical route of emergence of constructions is from formula, through low-scope pattern, to construction’ (Ellis 2002: 143). Researching these issues necessitates the bringing together of a range of types of methods to triangulate with learner corpus research (see also Chapter 3, this volume). Learner corpora are essential in showing the evidence of learner formulaic use, and dense longitudinal corpora allow the charting of the growth of learner use (Paquot and Granger 2012). But the analysis of large corpora of everyday usage like the British National Corpus (BNC)1 and the Corpus of Contemporary American English (COCA)2 is a necessary adjunct in order to get a picture of typical language experience which serves learners as their evidence for learning (McEnery and Hardie 2012). Furthermore, psycholinguistic experiments are necessary to look at learners’ implicit knowledge of linguistic structures and the strengths of association of their components as they affect on-line processing in language comprehension and production (e.g. Ellis 2002; Schmitt 2004; see also Chapter 4, this volume). We concur with Gilquin and Gries (2009: 9) that ‘[b]ecause the advantages and disadvantages of corpora and experiments are largely complementary, using the two methodologies in conjunction with each other often makes it possible to (i) solve problems that would be encountered if one employed one type of data only and (ii) approach phenomena from a multiplicity of perspectives’.

2 2.1

Core issues L2 processing is sensitive to the statistical properties of formulaic language

Research in psycholinguistics, corpus linguistics and cognitive linguistics demonstrates that language users have rich knowledge of the frequencies of forms and of their sequential dependencies in their native language (Ellis 2002). Language processing is sensitive to the sequential probabilities of linguistic elements at all levels from phonemes to phrases, in comprehension as well as in ﬂuency and idiomaticity of speech production.

1

www.natcorp.ox.ac.uk/ (last accessed on 13 April 2015).

2

http://corpus.byu.edu/coca/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

359

This sensitivity to sequence information in language processing is evidence of learners’ implicit knowledge of memorised sequences of language, and this knowledge serves as the basis for linguistic systematicity and creativity. The last ten years have seen substantial further research conﬁrming native and L2 users’ implicit knowledge of linguistic constructions and their probabilities of usage (Ellis 2012a; Rebuschat and Williams 2012). Illustrative recent studies demonstrating second language learners’ implicit knowledge of the sequential probabilities of linguistic elements include the following. Jiang and Nekrasova (2007) examined the representation and processing of formulaic sequences using on-line grammaticality judgement tasks. English as a second language speakers and native English speakers were tested with formulaic and non-formulaic phrases matched for word length and frequency (e.g. to tell the truth vs to tell the price). Both native and non-native speakers responded to the formulaic sequences signiﬁcantly faster and with fewer errors than they did to non-formulaic sequences. Conklin and Schmitt (2007) measured reading times for formulaic sequences versus matched non-formulaic phrases in native and non-native speakers of English. The formulaic sequences were read more quickly than the non-formulaic phrases by both groups of participants. Ellis and Simpson-Vlach (2009) and Ellis et al. (2008) used four experimental procedures to determine how the corpus-linguistic metrics of frequency and mutual information (MI, a statistical measure of the coherence of strings) are represented implicitly in native and non-native speakers of English, and how this knowledge affects their accuracy and ﬂuency of processing of the formulas of the Academic Formulas List (AFL, Simpson-Vlach and Ellis 2010, see Section 3.1 for further details). The language-processing tasks in these experiments were selected to sample an ecologically valid range of language-processing skills: spoken and written, production and comprehension, form-focused and meaning-focused. They were: (1) speed of reading and acceptance in a grammaticality judgement task where half of the items were real phrases in English and half were not, (2) rate of reading and rate of spoken articulation, (3) binding and primed pronunciation – the degree to which reading the beginning of the formula primed recognition of its ﬁnal word, (4) speed of comprehension and acceptance of the formula as being appropriate in a meaningful context. Processing in all experiments was affected by various corpus-derived metrics: length, frequency and mutual information. Frequency was the major determinant for non-native speakers, but for native speakers it was predominantly the MI of the formula which determined processability. Durrant and Schmitt (2009) extracted adjacent English adjective–noun collocations from two learner corpora and two comparable corpora of native student writing and calculated the t-score and MI score in the BNC for each combination extracted. This study also found that non-native

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

360

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

writers rely heavily on high-frequency collocations like good example or long way, but that they underuse less frequent, strongly associated collocations like bated breath or preconceived notions. They conclude ‘that these ﬁndings are consistent with usage-based models of acquisition while accounting for the impression that non-native writing lacks idiomatic phraseology’ (2009: 157). Such ﬁndings argue against a clear distinction between linguistic forms that are stored as formulas and ones that are openly constructed. Grammatical and lexical knowledge are not stored or processed in different mental modules, but rather form a continuum from heavily entrenched and conventionalised formulaic units (unique patterns of high token frequency, such as Hi! How are you?) to loosely connected but collaborative elements (patterns of high type frequency, such as the generic slot-and-frame pattern Put [NP] on the table, which generates a variety of useful tea-time commands: Put it on the table, Put the bread on the table, Put the knives and forks on the table, Put some plates on the table, etc.) (Ellis 2008c; Robinson and Ellis 2008; Ellis and Larsen-Freeman 2009; Bybee 2010; Ellis 2012b). That learners are sensitive to the frequencies of occurrence of constructions and their transitional probabilities suggests that they learn these statistics from usage, tallying them implicitly during each processing episode. Linguistic structure emerges from the conspiracy of these experiences (Ellis 1998, 2011). Hopper (1987: 143), in laying the foundations for Emergent Grammar, argued that ‘[t]he linguist’s task is in fact to study the whole range of repetition in discourse, and in doing so to seek out those regularities which promise interest as incipient sub-systems. Structure, then, in this view is not an overarching set of abstract principles, but more a question of a spreading of systematicity from individual words, phrases, and small sets’.

2.2

Three different statistical operationalisations of formulaic language

Section 2.1 argued against a ﬁrm distinction between linguistic forms that are stored as formulas and ones that are openly constructed. Instead it proposed that formulaicity is a dimension to be deﬁned in terms of strength of serial dependencies occurring at all levels of granularity and at each transition in a string of forms. At one extreme are formulaic units that are heavily entrenched (high token frequency, unique patterns), at the other are creative constructions consisting of strings of slots each potentially ﬁlled by many types. Broadly, the more frequent and the more coherent a string, the faster it is processed. It follows that formulas need to be operationalised in statistical terms that measure frequency and coherence. Statistical operationalisations allow triangulation with corpus samples of the usage which serves as the source of our

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

361

knowledge of formulaicity and patterns in language. Corpus-linguistic techniques provide a range of methods for the quantiﬁcation of recurring sequences (as clusters, n-grams, collocations, phrase-frames, etc.) and for gauging the strength of association between the component words. Three broad options for the basis of determination of formulaic sequences are frequency, association and native norms. Each is considered in turn in the following subsections. (For further studies of phraseological patterning in learner language, see Chapter 10, this volume.)

2.2.1

Frequency

Formulas are recurrent sequences. One deﬁnition, then, is that we should identify strings that recur often. This is the approach of Biber and colleagues (Biber et al. 1999; Biber, Conrad and Cortes 2004), who deﬁne lexical bundles solely on the basis of frequency. This has the great advantages of being methodologically straightforward and having face validity. We all agree that high-frequency strings like How are you?, Nice day today and Good to see you are formulaic sequences. But we also know some formulas that are not of particularly high frequency, like blue moon, latitude and longitude and raining cats and dogs. And other high-frequency strings, like and of the or but it is, do not seem very formulaic. Deﬁnitions in terms of frequency alone result in long lists of recurrent word sequences that collapse distinctions that intuition would deem relevant. N-grams consisting of high-frequency words occur often. But this does not imply that they have clearly identiﬁable or distinctive functions or meanings; many of them occur simply by dint of the high frequency of their component words, often grammatical functors. The fact that a formula is above a certain frequency threshold does not necessarily imply either psycholinguistic salience or coherence (Schmitt et al. 2004).

2.2.2

Association

Psycholinguistically salient sequences, on the other hand, like once in a blue moon, on the other hand or put it on the table cohere much more than would be expected by chance. They are ‘glued together’ and thus measures of association, rather than raw frequency, are more relevant. There are numerous statistical measures of association available, each with their own advantages and disadvantages (Evert 2005; Gries 2008c, 2009, 2012b, 2013d). For example, MI is a statistical measure commonly used in information science to assess the degree to which the words in a phrase occur together more frequently than would be expected by chance (Oakes 1998; Manning and Schütze 1999). A higher MI score means a stronger association between the words, while a lower score indicates that their co-occurrence is more likely due to chance. MI is a scale, not a test of signiﬁcance, so there is no minimum threshold value; the value of MI scores lies in the comparative information they provide. MI privileges coherent

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

362

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

strings that are constituted by low-frequency items, like longitude and latitude.

2.2.3

Native norms

Deﬁnitions purely in terms of frequency or association might well reﬂect that language production makes use of sequences that are ready made by the speaker or writer, but these need not necessarily be native-like. Non-native academic writing can often be identiﬁed by the high frequency of use of phrases that come from strategies of translation from the L1 (mother tongue) (like make my homework or make a diet), or formulas that occur frequently in spoken language but which are frowned upon as informal in academic writing (like I would like to talk about or I think that…) (Gilquin and Paquot 2008). An additional, divergent, criterion for formulaicity is that it reﬂects native-like selection and native-like ﬂuency (Pawley and Syder 1983). Thus we can also operationalise the formulaicity of L2 language by how well it uses the formulaic sequences and grammatico-lexical techniques of the norms of its reference genre. For example, as we will see in Section 3.2, O’Donnell et al. (2013) search for instances of formulaic academic patterns of the AFL (Simpson-Vlach and Ellis 2010) in corpora of native and non-native English academic writing at different levels of proﬁciency. They show that L2 learners’ writing is less rich in the use of these native-norm-derived academic formulas compared to expert native writers. We are only beginning to explore how these different statistical and corpus-based operationalisations affect acquisition and processing, and this is a research area where much remains to be done. There is strong consensus that research on formulaic language, phraseology and constructions is in dire need of triangulation across research in ﬁrst and second language acquisition, corpus linguistics, usage-based linguistics and psycholinguistics (Ellis 2008c; Gries 2008c, 2009; Divjak and Gries 2012), and shared operationalisations rest at the foundations of this enterprise.

2.3

L2 learners have difﬁculty mastering native-like formulaic language

The ﬁelds of applied linguistics and SLA showed early interest in multi-word sequences and their potential role in language development. Corder (1973) coined the term holophrase to refer to unanalysed multi-word sequences associated with a particular pragmatic function; Brown (1973) called them ‘prefabricated routines’. One of the main research questions for SLA researchers at the time was: do prefabricated routines pose a challenge to the traditional view of L1 learning as a process by which children start out with small units (morphemes and words) and then gradually combine them into more complex structures? Do children alternatively and/or additionally start out from large(r) chunks of language which

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

363

they then gradually break down into their component parts? Early studies did not yield conclusive results (a good discussion can be found in Krashen and Scarcella 1978). For example, Hakuta (1976), based on data from a 5-year-old Japanese learner of English, argued in favour of a more ﬁne-grained distinction between prefabricated routines and prefabricated patterns, that is, low-scope patterns that have at least one variable slot. Wong Fillmore’s (1976) dissertation project was one of the ﬁrst to track more than one child over a longer period of time; her analysis suggested that ESL (English as a Second Language) children do in fact start out with prefabricated patterns which they gradually break down into their component parts in search of the rules governing their L2, which, in turn, ultimately enables them to use language creatively. There were only a few early studies on adult L2 learners (Wray 2002: 172–98 provides a detailed overview). The general consensus, however, was that while adult L2 learners may occasionally employ prefabricated language, there was less evidence than in children’s data that knowledge of prefabricated language would foster grammatical development in adult L2 acquisition (L2A). Hanania and Gradman (1977), for instance, studied Fatmah, a native speaker of Arabic. Fatmah was 19 years old at the time of the study, and she had received little formal education in her native language. When speaking English, Fatmah used several routines that were tied to speciﬁc pragmatic situations; however, the researchers found her largely unable to analyse these routines into their component parts. Similarly, Schumann (1978), who investigated data from several adult L2 learners with different native language backgrounds, found little evidence in favour of prefabricated language use. A slightly different picture emerged in Schmidt’s (1983) well-known research on Wes, a native speaker of Japanese who immigrated to Hawaii in his early thirties. Wes seemed to make extensive use of prefabricated routines. However, while this signiﬁcantly boosted Wes’s ﬂuency, his grammatical competence remained low. Ellis (1984), looking at the use of prefabricated language in an instructional setting, suggested that there is considerable individual variation in learners’ ability to make the leap from prefabricated routines to the underlying grammatical rules they exemplify. Krashen and Scarcella (1978) were outright pessimistic regarding adult learners’ ability to even retain prefabricated routines, and cautioned against focusing adult learners’ attention on prefabricated language because ‘[t]he outside world for adults is nowhere near as predictable as the linguistic environment around Wong Fillmore’s children was’ (Krashen and Scarcella 1978: 298). In their classic analysis of formulaic language usage in SLA, ‘Two puzzles for linguistic theory: Nativelike selection and nativelike ﬂuency’, Pawley and Syder (1983) put the clear case that L2 speakers, despite considerable knowledge of L2 grammar, still make productions that are unidiomatic. Likewise, in her analysis of the incidence of formulaic language in French

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

364

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

students’ advanced EFL (English as a Foreign Language) writing, Granger (1998c) showed that learners made less use of formulaic expressions and collocations than native writers. The studies reviewed here suggest a potential difference in formulaic use between ESL learners who are exposed to large amounts of naturalistic spoken language and EFL learners who are not. Learning the usages that are normal or unmarked from those that are unnatural or marked requires a huge amount of immersion in the speech community. Language learning is essentially a sampling problem – the learner has to estimate the native norms from a sample of usage experience (Ellis 2008b). Many of the forms required for idiomatic use are of relatively low frequency, and the learner thus needs a large input sample just to encounter them: Becoming idiomatic and ﬂuent requires a sufﬁcient sample of needsrelevant authentic input for the necessary implicit tunings to take place. The ‘two puzzles for linguistic theory’, nativelike selection and nativelike ﬂuency (Pawley and Syder, 1983), are less perplexing when considered in these terms of frequency and probability. There is a lot of tallying to be done here. The necessary sample is certainly to be counted in terms of thousands of hours on task. (Ellis 2008b: 152)

2.4

L2 longitudinal research: from formula to low-scope pattern to creative construction?

That L2 learners have difﬁculty in acquiring the full range of native-like formulaic expressions does not mean that some high-frequency formulas do not play a part in language acquisition. There are recent longitudinal studies in support of this developmental sequence. Particular formulas, high in frequency, functionality and prototypicality might serve as pacemakers. Myles and colleagues (Myles et al. 1998; Myles et al. 1999; Myles 2004) analysed longitudinal corpora of oral language in secondary school pupils learning French as a foreign language in England. The study investigated the development of chunks within individual learners over time, showing a clear correlation between chunk use and linguistic development: In the beginners’ corpus, at one extreme, we had learners who failed to memorise chunks after the ﬁrst round of elicitation; these were also the learners whose interlanguage remained primarily verbless, and who needed extensive help in carrying out the tasks. At the other extreme, we had learners whose linguistic development was most advanced by the end of the study. These were also the learners who, far from discarding chunks, were seen to be actively working on them throughout the data-collection period. These chunks seem to provide these learners with a databank of complex structures beyond their current grammar,

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

365

which they keep working on until they can make their current generative grammar compatible with them. (Myles 2004: 153) This study is such a landmark that we have chosen it for further detailed examination in Section 3.3. Eskildsen and Cadierno (2007) investigated the development of donegation by a Mexican learner of English. Do-negation learning was found to be initially reliant on one speciﬁc instantiation of the pattern I don’t know, which thereafter gradually expanded to be used with other verbs and pronouns as the underlying knowledge seemed to become increasingly abstract, as reﬂected in token and type frequencies. The emerging system was initially based on formulaic sequences, and development was based on the gradual abstraction of regularities that link expressions as constructions (see also Eskildsen 2012). Mellow (2008) describes a longitudinal case study of a 12-year-old Spanish learner of English, Ana, who wrote stories describing ﬁfteen different wordless picture books during a 201-day period. The ﬁndings indicate that Ana began by producing only a few types of complex constructions that were lexically selected by a small set of verbs, which gradually then seeded an increasingly large range of constructions. Sugaya and Shirai (2009) describe acquisition of Japanese tense–aspect morphology in L1 Russian learner Alla. In her ten-month longitudinal data, some verbs (e.g. siru ‘come to know’, tuku ‘be attached’) were produced exclusively with imperfective aspect marker -te i-(ru), while other verbs (e.g. iku ‘go’, tigau ‘differ’) were rarely used with -te i-(ru). Even though these verbs can be used in any of the four basic forms, Alla demonstrated a very strong verb-speciﬁc preference. Sugaya and Shirai follow this up with a larger cross-sectional study of sixty-one intermediate and advanced learners who were divided into thirty-four lower- and twenty-seven higher-proﬁciency groups using grammaticality judgement tasks. The lower-proﬁciency learners used the individual verbs in verb-speciﬁc ways and this tendency was stronger for the verbs denoting resultative state meaning with -te i-(ru) (e.g. achievement verbs) than the verbs denoting progressive meaning with -te i-(ru) (e.g. activity, accomplishment verbs). Sugaya and Shirai conclude that learners begin with item-based learning and ‘low-scope patterns’ and that these formulas allow them to gradually gain control over tense–aspect. Nevertheless, they also consider how memory-based and rule-based processes might co-exist for particular linguistic forms, and how linguistic knowledge should be considered a ‘formulaic–creative continuum’. Having said that, there are studies of L2 that have set out to look for the developmental sequence from formula to low-scope pattern to creative construction in a learner corpus and found less compelling evidence. These are reviewed below.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

366

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

Bardovi-Harlig (2002) studied the emergence of future expressions involving will and going to in a longitudinal corpus study of sixteen adult ESL learners (mean length of observation: 11.5 months; 1,576 written texts, mainly journal entries, and 175 oral texts, either guided conversational interviews or elicited narratives based on silent ﬁlms). The data showed that future will emerges ﬁrst and greatly outnumbers the use of tokens of going to. Bardovi-Harlig (2002: 192) describes how the rapid spread of will to a variety of verbs suggests that ‘for most learners, there is either little initial formulaic use of will or that it is so brief that it cannot be detected in this corpus’. There was some evidence of formulaicity in early use of going to: ‘[f]or 5 of the 16 learners, the use of I am going to write stands out. Their productions over the months of observation show that the formula breaks down into smaller parts, from the full I am going to write about to the core going to where not only the verb but also person and number vary. This seems to be an example of learner production moving along the formulaic–creative continuum’ (2002: 197). But other learners showed greater variety of use of going to, with different verbs and different person-number forms, from its earliest appearance in the diary. Bardovi-Harlig (2002: 198) concludes that ‘although the use of formulaic language seems to play a limited role in the expression of future, its inﬂuence is noteworthy’. Eskildsen (2009) analysed longitudinal oral second language classroom interaction for the use of can by one student, Carlo. Can ﬁrst appeared in the data in the formula I can write. But Eskildsen noted how formulas are interactionally and locally contextualised, which means that they may possibly be transitory in nature, their deployment over time being occasioned by speciﬁc recurring usage events. Hall (2010) reports a small-scale study of the oral production of three adult beginner learners of ESL over a nine-week period in a community language programme meeting three days per week for two hours each day. A wide variety of tasks was used to elicit the data, which included picture description and semi-structured interviews. Hall reports that formulas were minimally present in the learner output and that constructions and formulas of similar structure co-existed, but that a developmental relationship between formulas and constructions was not clearly evident. He concludes that the amount of elicited data was too limited to substantiate the learning path under investigation, and that more controlled task dimensions were also needed.

3

Representative studies

We have chosen four research studies to illustrate a range of different approaches to these issues. The ﬁrst identiﬁes formulas from corpora of genre-speciﬁc language and then assesses L1 and L2 knowledge of these

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

367

formulas in psycholinguistic experiments. The second uses cross-sectional learner corpora to investigate the development of formulaic language in ﬁrst and second language writing, investigating effects of statistical operationalisation in terms of frequency, association and native norm. The third is a mixed-methods longitudinal corpus-plus-experimentation study of the role of formulas in language learning in secondary school. The fourth tracks constructions over time in a longitudinal corpus of naturalistic second language acquisition in adults, investigating type–token frequency distributions in verb-argument constructions over time, the ways in which native speaker usage guides learner language, how constructions develop following psychological principles of category learning, and complementing observational description with computational simulations. 3.1 Simpson-Vlach, R. C. and Ellis, N. C. 2010. ‘An academic formulas list: New methods in phraseological research’, Applied Linguistics 31(4): 487–512. Our ﬁrst representative study is not a learner corpus study per se, but one which uses corpus techniques to identify the formulas in academic language so that learner knowledge of these could then be evaluated ﬁrstly by using psycholinguistic approaches (Ellis et al. 2008; Ellis and Simpson-Vlach 2009) and secondly by searching for these expressions in learner corpora (O’Donnell et al. 2013). Simpson-Vlach and Ellis (2010) used corpus-linguistic techniques to identify the phraseology speciﬁc to academic discourse. The resultant Academic Formulas List includes formulaic sequences identiﬁed as (1) frequent recurrent patterns in corpora of written and spoken language, which (2) occur signiﬁcantly more often in academic than in non-academic discourse, and (3) inhabit a wide range of academic genres. Three-, four- and ﬁve-word formulas occurring at least ten times per million words were extracted from corpora of 2.1 million words of academic spoken language [Michigan Corpus of Academic Spoken English, MICASE,3 and selected academic spoken BNC ﬁles], 2.1 million words of academic written language [Hyland’s (2004a) research article corpus, plus selected academic writing BNC ﬁles], 2.9 million words of non-academic speech [the Switchboard4 corpus] and 1.9 million words of non-academic writing [the FLOB5 and Frown6 corpora gathered in 1991 to reﬂect British and American English over ﬁfteen genres]. The program Collocate (Barlow 2004) allowed the authors to measure the frequency of each n-gram along with the MI score for each phrase.

3

http://quod.lib.umich.edu/m/micase/ (last accessed on 13 April 2015).

4

https://catalog.ldc.upenn.edu/LDC97S62 (last accessed on 13 April 2015).

5

http://clu.uni.no/icame/manuals/FLOB/INDEX.HTM (last accessed on 13 April 2015).

6

http://clu.uni.no/icame/manuals/FROWN/INDEX.HTM (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

368

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

The total number of formulas appearing in any one of the four varieties at the threshold level of ten per million words was approximately 14,000. In order to determine which formulas were more frequent in the academic corpora than in their non-academic counterparts, the authors used the log-likelihood (LL) statistic (Oakes 1998) to determine the formulas which were statistically more frequent, at a signiﬁcance level of p Undergraduate), with, if anything, L2 learners producing more formulas than their native peers. O’Donnell et al. suggest that these are likely effects of text sampling on the recurrence of formulaic patterns, with the prompt questions driving the more common formulaic sequences in ICLE (e.g. the opium of the masses, the birth of a nation, the generation gap, ICLE French) and LOCNESS (e.g. the Joy Luck Club, in Le Myth de Sysiphe, the root of all evil). MICUSP (especially MICUSP-NS) generated common formulaic sequences from reference sections (e.g. American Journal of Public Health, Hispanic Journal of Behavioral Sciences, levels of psychological well-being). The Hyland corpus, with its greater diversity of topics across disciplines, showed less of these sampling foci. For AFL-deﬁned formulas, there were clear effects of high levels of expertise (Expert > A-grade Graduate ≈ Undergraduate), but no effect of L1/L2 status. The expert (Hyland corpus) authors were senior scholars, who had had multiple-year university training and experience in getting 7

www.uclouvain.be/en-cecl-locness.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

370

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

published in peer-reviewed journals. They were clearly differentiated from both the novice academic writers who contributed to ICLE and LOCNESS, and those who produced A-grade MICUSP papers on their way to developing expert writing skills and becoming accepted members of academic communities of practice. The fact that there were no effects of L1/L2 status suggests that these means of expression are as novel and specialised for natives as for non-natives. These analyses thus show clear effects of operationalisation of ‘formulaic language’ and of the choices underlying the design of different corpora. We will consider the implications further in Section 4. 3.3 Myles, F., Mitchell, R. and Hooper, J. 1999. ‘Interrogative chunks in French L2: A basis for creative construction?’, Studies in Second Language Acquisition 21(1): 49–80. In an extensive study of secondary school pupils learning French as a foreign language in England, Myles (Myles et al. 1998; Myles et al. 1999; Myles 2004) analysed longitudinal corpora of oral language in sixteen Beginners [Years 7, 8 and 9 (11–14 years old), tracked over the ﬁrst 2¼ years, using thirteen oral tasks (2–3 per term over six terms)] and sixty Intermediates [20 classroom learners in each of Years 9, 10 and 11 studied cross-sectionally using four oral tasks (three repeated from the Beginners project)]. These data showed that multimorphemic sequences which go well beyond learners’ grammatical competence are very common in early L2 production. Notwithstanding that these sequences contain such forms as ﬁnite verbs, wh-questions and clitics, Myles denies this as evidence for the sequences being openly created by syntactic means from the start of L2 acquisition because the relevant functional projections were not present outside chunks initially. Analyses of inﬂected verb forms suggested that early productions containing them were formulaic chunks. These structures, sometimes highly complex syntactically (e.g. in the case of interrogatives), cohabited for extended periods of time with very simple sentences, usually verbless or, when a verb was present, normally untensed. Likewise, clitics ﬁrst appeared in chunks containing tensed verbs, suggesting that it is through these chunks that learners acquire them. Myles characterises these early grammars as consisting of lexical projections and formulaic sequences, showing no evidence of open syntactic creation. ‘Chunks do not become discarded; they remain grammatically advanced until the grammar catches up, and it is this process of resolving the tension between these grammatically advanced chunks and the current grammar which drives the learning process forward’ (Myles 2004: 152). The results of this extensive corpus study were reported in three or four papers, each concentrating on different linguistic constructions. Myles’s conclusion for the relationship between formulaic chunks and creative construction was not that the direction of development was one

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

371

of integration (from words to formulas) or one of differentiation (from formulaic phrases to their components), but rather that ‘creative construction and chunk breakdown clearly go hand in hand’ (Myles et al. 1999: 76): We see, on the one hand, chunks becoming simpler and more like other constructions present in the grammar at a given time and, on the other hand, creative constructions becoming more complex as elements from the chunks feed into the process. It is as if, at any one time, learners are attempting to resolve the tension between complex but communicatively rich chunks on the one hand and simple but communicatively inadequate structures on the other hand. This is a dynamic tension that drives forward the overall development of the L2 system. (Myles et al. 1999: 77) 3.4 Ellis, N. C. and Ferreira-Junior, F. 2009a. ‘Constructions and their acquisition: Islands and the distinctiveness of their occupancy’, Annual Review of Cognitive Linguistics 7: 187–220. Ellis, N. C. and Ferreira-Junior, F. 2009b. ‘Construction learning as a function of frequency, frequency distribution, and function’, The Modern Language Journal 93: 370–86. Ellis and Ferreira-Junior (2009a, 2009b) were interested in the processes of integration and differentiation of formulaic and semi-formulaic phrases in the acquisition of more schematic constructions in naturalistic second language acquisition. They therefore investigated effects of type–token distributions in the slots comprising the linguistic form of three English verb-argument constructions (VACs), namely verb locative (VL), e.g. Tom walked to the store, verb object locative (VOL), e.g. he put the book on the shelf and ditransitive (VOO), e.g. he sent his son some money, in the speech of second language learners in the European Science Foundation (ESF) corpus (Feldweg 1991; Perdue 1993; Dietrich et al. 1995). The ESF project collected the spontaneous and elicited second language of adult immigrants recorded longitudinally in interviews every four to six weeks for approximately thirty months. Ellis and Ferreira-Junior focused upon seven ESL learners living in Britain whose native languages were Italian (n=4) or Punjabi (n=3). The ESF corpus includes transcribed data from 234 sessions for these ESL learners and their native-speaker conversation partners during a range of activities. Goldberg (2006) had previously argued for child language acquisition that Zipﬁan8 (Zipf 1935) type–token frequency distribution of verbs in natural language might optimise construction learning by providing

8

In natural language, Zipf’s (1935) law describes how the highest frequency words account for the most linguistic tokens. The frequency of words decreases as a power function of their rank in the frequency table, with the most frequent word occurring approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

372

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

one very high-frequency exemplar that is also prototypical in meaning. Ellis and Ferreira-Junior (2009b) conﬁrmed that, in the naturalistic L2A of English, VAC verb type–token distribution in the input is Zipﬁan and learners ﬁrst acquire the most frequent, prototypical and generic exemplar (e.g. go in VL, put in VOL, give in VOO). Ellis and Ferreira-Junior (2009a) further illustrate how acquisition is affected by the frequency and frequency distribution of exemplars within each island of the construction (e.g. [Subj V Obj OblPAT H / L O C ]), by their prototypicality, and, using a variety of psychological and corpus-linguistic association metrics, by their contingency of form–function mapping and by the degree to which the different elements in the VAC sequence (such as Subj V Obj Obl) are mutually informative and form predictable chunks. The highest-frequency elements seeding the learners’ VL pattern were go to the shop, the VOL pattern put it on the table and the VOO pattern they give me money. We will describe in more detail in Section 4 the cycles of integration and differentiation whereby overlapping chunks of formulaic phrases resonate with creative constructions. Ellis and Larsen-Freeman (2009) used computational (emergent connectionist) models to test theories of how these various factors play out in the emergence of constructions as generalised linguistic schemas from the ESF learners’ analysis of patterns in their usage history.

4 4.1

Critical assessment and future directions Corpus design and formulaicity

The research reviewed above allows us to identify aspects of corpus design which affect the incidence of formulaicity and which inform the design and analysis of future studies. 1. There are several well-justiﬁed but divergent operational deﬁnitions of formulaicity. Choices of operationalisation entail that different researchers are potentially researching and theorising different phenomena. 2. Formulaicity may vary as a function of ﬁrst vs second language acquisition. L1 acquisition (L1A) may indeed be more formulaic than L2A. When child L1 learners are learning about language from formulaic frames (Mintz 2003; Tomasello 2003; Ambridge and Lieven 2011) and the analysis of sequences of words (Kiss 1973; Elman 1990; Redington and Chater 1998), they are learning from scratch about more abstract categories such as verb, pronoun, preposition, noun or transitive frame. It is debatable whether the units of early L1A are words at all (Peters 1983). Adult L2 learners already know about the existence of these units, categories and linguistic structures. They expect that there will be words and constructions in the L2 which correspond to such word classes and frames. Once they have identiﬁed them, or

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

373

even, once they have searched them out and actively learned such key vocabulary, they are more likely therefore to attempt creative construction, swopping these elements into corresponding slots in frames. Transfer from the L1 is also likely to affect the process (Granger 1998c; Chapter 15, this volume). The more learners attempt word-by-word translation from their L1, the more they deviate from L2 idiomaticity. There is unconscious transfer too (Jiang and Nekrasova 2007). 3. The amount and type of language exposure is inﬂuential (e.g. ESL vs EFL) (Groom 2009; Reppen 2009). Children are naturalistic language learners from thousands of hours of interaction and input. While some adults learn naturalistically, others take grammar-rich courses and foreign language environments provide only restricted access to authentic language. Thus second language can be more formulaic than foreign language. 4. For studies that seek to trace the development of formulaic language, data has to be dense enough to identify repeated uses at the time of emergence (Tomasello and Stahl 2004). The use of formulas and constructions is determined by context, function, genre and register. If elicitation tasks vary, the chance of sampling the same formula and its potential variants diminishes accordingly. Myles (2004) demonstrates that an understanding of L2A can only come from analysis of extensive representative corpora of language sampled in the same learners over time. This, with transcription, mark-up, checking and distribution, entails huge effort. Myles also illustrates how supplementing the language data with targeted psycholinguistic experimental tasks, focused upon times of critical change, can enhance the value of the corpus description. The ﬁeld of child language acquisition became a scientiﬁc enterprise upon the recognition of the need for proper longitudinal corpora describing individual language development (Brown 1973). More recently, this has become recognised as a need for dense longitudinal corpora of naturalistic language development that capture perhaps 10 per cent of the child’s speech and the input they are exposed to, collected from 2–4 years old when the child is undergoing maximal language development (Maslen et al. 2004; Behrens, 2008), or even a complete corpus of a learner’s situated language development (Roy 2009). The making available of the evidence of learner language through CHILDES and TalkBank (MacWhinney 2000) has transformed the study of child language acquisition. Although beginnings have been made for L2, for example the ESF longitudinal corpora (Klein and Perdue 1992), we must together strive for a similar richness of evidential sources for SLA research too (Ortega and Iberri-Shea 2005). 5. As in all other areas of language processing, recognition of formulas is easier than production. Ellis and Ferreira-Junior (2009a, 2009b) showed that naturalistic adult L2 learners used the same verbs in

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

374

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

frequent verb-argument constructions as are found in their input experience, with the relative ordering of the types in the input predicting uptake with correlations in excess of r = 0.90. Nevertheless, while they would accurately produce short simple formulaic sequences such as come in or I went to the shop, structurally more complex constructions were often produced incorrectly. Thus psycholinguistic studies of formula recognition may identify wider knowledge than is evidenced in formula production in learner corpora. 6. Modality, genre and task are also important. Using the range of methods of O’Donnell et al. (2013) described in Section 3.2, Ellis et al. (2009) showed that oral language was much denser in formulaic language than was written news reporting or light ﬁction. Likewise, the greater the working-memory demands of the processing task, the greater the need to rely on formulas: Kuiper (1996) analysed ‘smooth talkers’ – sports commentators and auctioneers who are in communicative contexts which place pressure to observe what is transpiring around them, analyse these happenings in short-term memory and formulate speech reports describing what is observed in real time without getting left behind. Smooth talkers use many formulas in their speech – recurrent sequences of verbal behaviour, whether conventional or idiosyncratic, which are sequentially and hierarchically organised. The faster the action, the more difﬁcult it is for the commentator to provide an instantaneous commentary. By contrasting fast-action commentators (horse races, antique and livestock auctioneers) with slow-action commentators (cricket, real estate auctioneers), Kuiper showed that the fast-action commentators made much more use of formulas than did the slow-action ones. We expect similar resort to formulaic language whenever L1 or L2 language users have to speak under conditions of high cognitive demand. 7. Corpus design features including the number of participants, the nature of their tasks and prompts, the amount of language they produce, etc., are potent determinants of outcome. There remains much basic research to be done to assess how formulaicity is affected by potential independent variables of concern for control purposes (text length, type–token ratio, mean length of utterance, entropy, vocabulary frequency proﬁles, number of speakers, range of prompts and topics, etc.) and by variables of greater theoretical weight, including potential text variables such as spoken/written genre, potential subject variables such as native vs second language status, proﬁciency, education, and potential situational variables such as degree of preparation, rehearsal and working-memory demand. 8. With so many variables in play in the emergence of linguistic constructions and system (Ellis 2011), an essential part of testing theories of development includes their investigation using computational models as applied to learner corpus data (see further, Ellis 2012b).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

4.2

375

The roles of formulaic language in SLA

The evidence reviewed above demonstrates that (1) language learners have substantial statistical knowledge of the units of language and their phraseological patterning; (2) when one compares second/foreign language to ﬁrst language, the former displays a smaller range of formulaic expressions; (3) formulaic language can serve in the language acquisition process. Let us bring these three facts together. Some formulaic sequences are readily learnable by dint of being highly frequent and prototypical in their functionality – How are you?, It’s lunchtime, I don’t know, Good example, I am going to write about… and the like. These are good candidates for construction seeds. Other formulaic sequences are not readily learnable – these are of low frequency, often indeed rare, and many are non-transparent and idiomatic in their interpretation (e.g. once in a blue moon, bated breath). As idioms they must be learned as such. However, learners require considerable language experience before they encounter these once, never mind sufﬁcient times to commit them to memory (Ellis 2008b; Ellis et al. 2008). This is why learners typically do not achieve native-like idiomaticity (Pawley and Syder 1983; Granger 1998b; Durrant and Schmitt 2009). These low-frequency, low-transparency formulas are targets for learning rather than seeds of learning. In the huge middle ground between high and low token-frequency formulaic expressions, there is interaction. Let us consider this ‘formulaic–creative continuum’ (Sugaya and Shirai 2009: 440), the ‘repeated cycles of integration and differentiation’ (Studdert-Kennedy 1991: 25) or the ‘dynamic tension that drives forward the overall development’ (Myles et al. 1999: 77) in further detail, with the aid of a corpus, of course. Begin with the formula (i) put it in, and put it in its context of usage in a large corpus of English, such as COCA: put it in occurs 3,620 times.9 Consider it as a formulaic exempliﬁcation of the schematic verb-object-locative (VOL) verb-argument construction (VAC) which can describe a routine generic caused-motion function of moving something to a new place or in a new direction. Compare it to other VOL VACs. Search for put it [i*].10 This is very common (8,065 token occurrences), from put it in (3,620), put it on (1,926), put it onto (745) (all highly functional, stereotypical, formulaic phrases in their own right), with then the distribution dropping rapidly to a heavy right tail of items that appear just once, such as put it away. These frequencies broadly follow a Zipﬁan distribution (Zipf 1935; Solé et al. 2005; Ninio 2011; see footnote 8), as in language overall, but not following the particular ordering found in language as a whole – each slot attracts particular types of occupants (Ellis and O’Donnell 2012). A learner

9 10

Numbers may differ because the corpus is always growing. This is the wildcard for any preposition.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

376

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

Frequency of NS use as VOL

Lemma

120

100

80

60

40

20

0 put take see get bring leave pick talk send watch have switch drop cross hang speak hit carry hold phone try turn drive stop steal buy fit mark open spend visit want withdraw

Figure 16.1 The Zipﬁan type–token frequency distribution of verb lemmas in the VOL VAC in the native English participants of the ESF project (based on Ellis and Ferreira-Junior 2009a)

would get a very good idea of locatives by abstracting over these types and tokens of prepositions. Next consider the types of verbs that work in these constructions. Searching [v*]11 it [i*] produces put it in (3,608), give it to (2,521), do it in (2,059), put it on (1,917) (again all formulaic)… There are many more types here but the frequencies still follow a Zipﬁan distribution. Figure 16.1 shows the results of a parallel analysis of the verb types in VOL constructions from the native English speakers in the ESF corpus from Ellis and Ferreira-Junior (2009a). There is some noise, but abstracting over the verb types, of which put takes the lion’s share in useful, stereotypically functional formulaic phrases such as put it in, put it on, put it onto, the learner would get a pretty good idea of the semantics of caused-motion verbs. Back to COCA, a more speciﬁc search with put it in the * generates put it in the oven (53), put it in the refrigerator (28), put it in the back (27), put it in the freezer (26),… put it in the hold (2). The sorts of everyday places where people put things in are pretty clear in their semantics too, when averaged thus. And who puts? Searching [p*]/[n*]12 put it generates you put it 11

This is the wildcard for any verb.

12

This is the wildcard for any pronoun or noun.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

Formulaic language in SLA research

377

(1,067), he put it (975), I put it (891), … who put it (72), ofﬁcial put it (62), etc. The learner would get a clear idea of the sorts of entities who do the putting. There are exceptions, but there is semantic coherence over the general exemplar cloud. In each of these analyses there is a broadly Zipﬁan type–token frequency distribution within the slot; the most frequent, pathbreaking slot-ﬁller for each VAC is much more frequent than the other members; the most frequent slot-ﬁller is semantically prototypical and generic of the VAC island as a whole. This analysis in COCA was seeded with a frequent formulaic prototype VOL, put it in, with its characteristic form and its generic interpretation. Scrutiny of its component slots and the types they attract in usage generated other VOLs with high-frequency prototypical occupants. Abstracting over the typical types in the various slots results in a generalised schema for the VOL, with the different slots becoming progressively deﬁned as attractors. Each slot in each construction thus makes a signiﬁcant contribution to its identiﬁcation and interpretation (Tomasello 2003; Goldberg 2006; Ellis and Ferreira-Junior 2009a, 2009b; Ellis and Larsen-Freeman 2009; Bybee 2010; Ambridge and Lieven 2011; Ellis and O’Donnell 2012). Is the notion of language acquisition being seeded by formulaic phrases and yet learner language being formula-light illogical? Is this ‘having your cake and eating it too’? Pawley and Syder (1983) thought not. While much of their classic article concentrated on the difﬁculty L2 learners had in achieving native-like formulaic selection and native-like ﬂuency, nevertheless they stated ‘[i]ndeed, we believe that memorized sentences are the normal building blocks of ﬂuent spoken discourse, and at the same time, that they provide models for the creation of many (partly) new sequences which are memorable and in their turn enter into the stock of familiar uses’ (1983: 208). Granger’s (1998c) analysis of collocations and formulas in advanced EFL writing showed likewise that ‘learners use fewer prefabs than their native-speaker counterparts’ while at the same time they use some lexical teddy bears as ‘general-purpose ampliﬁers’ in booster and maximiser phrases – ‘the analysis showed a highly signiﬁcant overuse of very as the all-round ampliﬁer par excellence … one could postulate that the learners’ underuse of ly ampliﬁers is compensated for by their overuse of very’ (1998c: 151). At this stage of learning, very [adj] is the ‘all-round ampliﬁer par excellence’, the memorised and prototypical model of ampliﬁer phrases yet to come. The present characterisation of the developmental sequence ‘from formula to low-scope pattern to creative construction’ is less true to the traditional idea of a formula as categorically deﬁned, and more so to that of formulaicity as a variable reﬂecting sequential dependencies in usage and degree of entrenchment in the learner’s mind. To properly investigate these questions, we need more longitudinal studies based on dense data (see also Chapter 17, this volume), more studies that

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

378

ELLIS, SIMPSON-VLACH, RÖMER, O’DONNELL AND WULFF

compare formulaic language in L1 vs L2, more studies that compare formulaic language development in second vs foreign language acquisition, and more studies that compare formulaic language in recognition vs comprehension. Only then will we be able to put rich, quantitative ﬂesh on the core, skeletal claim that ‘grammar is what results when formulas are re-arranged, or dismantled and re-assembled, in different ways’ (Hopper 1987: 145).

Key readings Robinson, P. and Ellis, N. C. (eds.) 2008. Handbook of Cognitive Linguistics and Second Language Acquisition. London: Routledge. This edited collection brings together leading researchers in usage-based ﬁrst and second language acquisition. Usage-based approaches hold that we learn language from our experience of language, and that formulaic language plays a key role. This is the ﬁrst volume to extend cognitive-linguistic analyses across L1A and L2A. Polio, C. (ed.) 2012. Topics in Formulaic Language. Special issue of Annual Review of Applied Linguistics 32. This is a recent, comprehensive, and broad-ranging collection of twelve articles reviewing cognitive perspectives in L1A, L2A, language processing, language disorders, formulaic language pedagogy, and social perspectives on formulaic language. Paquot, M. and Granger, S. 2012. ‘Formulaic language in learner corpora’, Annual Review of Applied Linguistics 32: 130–49. This is a recent state-of-the-art review of learner corpus studies. Rebuschat, P. and Williams, J. N. (eds.) 2012. Statistical Learning and Language Acquisition Acquisition.. Berlin: Mouton de Gruyter. Linguistic constructions are acquired from experience of input following associative learning principles. This collection on statistical language learning considers theories of how type–token frequency patterns in the input, patterns that can only be ascertained from corpus analysis, drive the statistical learning that results in categorisation. Hoffmann, Th. and Trousdale, G. (eds.) 2013. The Oxford Handbook of Construction Grammarr. Oxford University Press. This is a recent collection on construction grammar and usage-based acquisition. A central theme is the interplay between formulaic language and more open constructions and their synergy in language acquisition, knowledge, processing, and change.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:08:29, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.016

17 Developmental patterns in learner corpora Fanny Meunier

1 Introduction The understanding and description of learners’ developmental patterns have been at the core of second language acquisition (SLA) research for about forty years now.1 Kramsch (2000: 315) deﬁnes SLA as being concerned ‘with the process by which children and adults acquire (learn) second (third or fourth) languages in addition to their native language’ and adds that SLA is interested ‘in the nature of these learners’ language and their development throughout life’. Similarly, Ortega (2012: 133) writes that one of the key issues in SLA is to ‘shed light on how interlanguage development proceeds over time, from initial emerging representations to a full-blown, mature system of the new language’. As evidenced by the two preceding quotes, the correlated notions of progress and time are central in SLA. Learner corpora are one possible data type that can be used to analyse interlanguage development. Granger (2008a: 338) deﬁnes learner corpora as ‘electronic collections of (near-) natural foreign or second language learner texts assembled according to explicit design criteria’. Although learner corpus research (LCR) has – from its onset – paid speciﬁc attention to the design criteria of learner corpora and to the collection of metadata (see Granger 1998a; Chapter 2, this volume), an even better control of some of the variables at play in SLA has only recently become a central concern in LCR. This focus on variables (be it in learner corpus data collection or analysis) combined with the use of ad hoc inferential statistics (see Chapter 8, this volume) now makes it possible for learner corpus specialists to use and analyse variables as dependent variables, potential 1

Ortega (2013: 2) suggests taking Selinker’s (1972) field-defining publication as a convenient official marker of the disciplinary beginnings of SLA.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

380

MEUNIER

predictors for the linguistic features of texts, or as dynamic factors that should be taken into account in the learning process. Another key feature of LCR is that the data can be stored electronically and (pre-)processed with the help of (semi-)automatic corpus tools. This computational treatment makes it possible to analyse the production data of numerous learners (as opposed to more traditional SLA studies which typically involve few participants). As a result, LCR studies are able to replicate earlier SLA studies but on much larger populations. Murakami’s (2013a) replication study on the order of acquisition of morphemes is one example of replication studies carried out within an LCR paradigm. The aim of the present chapter is to illustrate how current learner corpora and LCR methods can be used to track development in the acquisition/learning of a language other than the mother tongue, both at group and individual level.

2

Core issues

2.1 Longitudinal vs pseudo-longitudinal 2.1.1 Study design Unlike cross-sectional studies which examine the language behaviour of a group or groups of language learners at a single point in their development, longitudinal studies are deﬁned by Johnson and Johnson (1999) as studies which examine the language behaviour of one or more subjects as that behaviour develops over time. Longitudinal study designs thus follow the same individual(s) over time and collect language-related data from this/these individual(s) at different points in time. Longitudinal research is deﬁned as ‘emphasizing the study of change and containing at minimum three repeated observations on at least one of the substantive constructs of interest’ (Ployhart and Vandenberg 2010: 97). This minimum of three data-collection points makes it possible to ﬁt a developmental line and visualise potential effects on that line (linear progression or regression, U- or reversed U-shaped behaviour).2 Obviously, the more collection points, the more reﬁned the interpretation of the development can be. When longitudinal data are collected at numerous intervals, the notion of ‘dense data collection’ is often used, especially when the corpus is accompanied by rich metadata. One example of a dense corpus (although devoted to the acquisition of the mother tongue) is the Human Speechome Corpus, which contains about 10 million words of transcribed recordings of child–caregiver interactions in natural contexts corresponding to about 120,000 hours of audio and 90,000 hours of video, capturing an

2

One of the limitations of two-wave studies (i.e. with only two data collection points) is that any and all change from Time 1 to Time 2 will by default be linear (i.e. a straight line), which makes it impossible to determine a more precise form of change (Singer and Willett 2003: 9–10).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

381

estimated 70 per cent of the child’s ﬁrst three years of waking hours3 (see Roy et al. 2012 for further details on the corpus). The trade-off between the density of the data and the number – or representativeness – of the subjects whose language data is collected is still an unresolved issue in LCR (see Section 4 below for further comments). Collecting longitudinal learner corpus data is a real challenge as it is both time consuming and requires much planning ahead. Another non-negligible problem, in today’s stuck-in-fast-forward, competitive research agenda and publication timing, is that the analysis can only start when the entire data collection is over. Myles (Chapter 14, this volume) also mentions the prohibitive costs of collecting longitudinal data and the fact that research funders do not like to commit resources for very long periods of time. Attrition (i.e. the sometimes signiﬁcant number of participants dropping out before each data-collection point) is another major challenge in dealing with longitudinal data, especially when it comes to learners of a second/foreign language whose learning histories cannot be predicted for certain (such as in the case of students who do not turn up for tests or data collections, decide to give up their studies, or change school or option). Such difﬁculties in collecting data mean that the high demand for longitudinal learner corpora is – quite unsurprisingly – met with few research teams collecting such data types. When it is not possible to follow the same individuals over time, researchers can carry out a comparison of cross-sectional studies of different groups of learners at different developmental stages. Using such an approach yields what Johnson and Johnson (1999) call a pseudo-longitudinal effect as the learners’ productions compared do not come from the same learners, hence the use of the ‘pseudo’ preﬁx. The ‘time’ variable (which can be measured directly in longitudinal studies) is thus measured in pseudo-longitudinal designs by a proxy such as age or proﬁciency level. In other words, instead of following one group of students through every step of their progress in acquiring a target language, researchers compare several groups of learners displaying different levels of proﬁciency. Those groups, whilst containing different learners, nonetheless often share a number of characteristics in order to warrant some homogeneity (e.g. same mother-tongue background or same learning context). Such study designs are thus called pseudo-longitudinal (Johnson and Johnson 1999; Huat 2012: 197) or quasi-longitudinal (Granger 2002; Thewissen 2013). The longitudinal and pseudo-/quasi-longitudinal designs are graphically summarised in Figures 17.1a and 17.1b. They will be further illustrated with concrete examples in Section 3. It is important to stress that individual trajectories can only be accessed indirectly in quasi- or pseudo-longitudinal studies as the data-collection 3

To my knowledge, there is no such equivalent dense learner corpus available.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

382

MEUNIER

Student 1: data collection time 1

Student 1: data collection time 2

Student 1: data collection time 3

Figure 17.1a Graphical representation of a longitudinal study design

Student 1: proficiency level A1

Student 2: proficiency level A2

Student 3: proficiency level B1

Figure 17.1b Graphical representation of a pseudo- or quasi-longitudinal study design (proxy for time used here: proﬁciency level)

procedure is inherently cross-sectional. In such designs, only group development can be measured. Individual variation within each group or sub-group can, however, be analysed. With longitudinal study designs, in contrast, group progress, individual variation within groups and individual trajectories can be analysed. This requires the use of, for instance, multi-level modelling – also referred to as hierarchical linear modelling or mixed-effects models (see Raudenbush and Bryk 2002; Baayen et al. 2008; Cunnings 2012; Chapter 8, this volume). Multi-level modelling allows a variety of predictors to be analysed, with ‘time’ being a key predictor in longitudinal studies: do participants become more proﬁcient as time goes by and, if so, how strong is the effect of time? Such statistical modelling can be applied to individuals within groups as well as to individuals as individuals, by analysing both endpoints and trajectories.

2.1.2

Learner corpora for developmental studies

Longitudinal learner corpora, which follow the same set of participants over multiple data-gathering sessions, are not very numerous.4 They include, among others: •

five subcorpora of the FLLOC (French Learner Language Oral Corpora) project,5 viz. the LANGSNAP Corpus, the Newcastle Corpus, the Progression Corpus, the Brussels Corpus and the Salford Corpus

4

For a survey of existing longitudinal learner corpora, see www.uclouvain.be/en-cecl-lcworld.html (last accessed on

5

See the project webpage at www.flloc.soton.ac.uk/index.html (last accessed on 13 April 2015) for a description of

13 April 2015). the FLLOC subcorpora. Each subcorpus is described in detail (types of learners, tasks, transcription conventions, headers used, database content).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

383

• some of the subcorpora of the InterFra (Interlangue française) corpus6 • the longitudinal subcorpus of the Corpus Ecrit de Français Langue Etrangère (CEFLE)7 • the Barcelona English Language Corpus (BELC)8 • some subcorpora of the Corpus of Learner German (CLEG13)9 • the Telecollaborative Learner Corpus of English and German (Telekorp), a special kind of longitudinal corpus which contains bilingual contrastive learner data of computer-mediated communication between native German speakers and American non-native speakers of German (Belz and Vyatkina 2008) • the LONGDALE (Longitudinal Database of Learner English) project (for further details, see Section 2.2).10 When learner corpora are collected according to a pseudo-longitudinal design, a different but comparable set of participants is used for each data-gathering session. The different samples of participants recruited for each separate data collection are comparable for certain attributes relating to the study carried out (language studied, type of instructional setting, etc.) and typically differ in terms of age or proﬁciency level. The number of learner corpora collected using a pseudo-longitudinal design are more numerous and include, among others: • • • • • • • •

four subcorpora of the FLLOC project mentioned earlier, viz. the Young Learners Corpus, the Linguistic Development Corpus and the UEA Corpus some of the InterFra subcorpora the Cambridge Learner Corpus (CLC)11 the Cambridge English Profile Corpus (CEPC)12 the National Institute of Information and Communications Technology Japanese Learner English Corpus (NICT JLE)13 the Japanese EFL Learner Corpus (JEFLL Corpus) (see Tono 2000b) the Spanish Learner Language Oral Corpora (SPLLOC)14 some CLEG13 subcorpora.

Table 17.1 provides a brief summary of the corpora listed above, their design, the targeted language studied and the data type covered. The corpora listed in Table 17.1 have been collected in the framework of projects explicitly aiming at developmental studies. It should be added 6

For a detailed description, visit www.su.se/romklass/interfra (last accessed on 13 April 2015).

7

http://projekt.ht.lu.se/cefle/information/le-sous-corpus-longitudinal/ (last accessed on 13 April 2015).

8

www.ubgral.com/corpus.html (last accessed on 13 April 2015).

9

http://korpling.german.hu-berlin.de/public/CLEG13/CLEG13_documentation.pdf (last accessed on 13 April 2015).

10

www.uclouvain.be/en-cecl-longdale.html (last accessed on 13 April 2015).

11

www.cambridge.org/gb/elt/catalogue/subject/custom/item3646603/ Cambridge-English-Corpus-Cambridge-Learner-Corpus/?site_locale=en_GB (last accessed on 13 April 2015). See also Chapters 22 and 23 (this volume) for more information on the corpus.

12

www.englishprofile.org/index.php/corpus (last accessed on 13 April 2015).

13

http://alaginrc.nict.go.jp/nict_jle/index_E.html (last accessed on 13 April 2015).

14

www.splloc.soton.ac.uk (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

384

MEUNIER

Table 17.1. Select list of learner corpora collected according to a longitudinal or pseudo-longitudinal design

Corpus name LONGDALE (several subcorpora) FLLOC (several subcorpora) InterFra (several subcorpora)

Longitudinal design

Pseudolongitudinal design

ᅚ

Target language

Data type

English

Oral and written data Oral data

ᅚ

ᅚ

French

ᅚ

ᅚ

French (some other L2s Oral and written data are also present in some of the subcopora, such as Swedish, Spanish, English and Italian) English Written responses to tests of English for Speakers of Other Languages English Oral and written data English Oral data English Written data French Written data

CLC

ᅚ

CEPC NICT JLE JEFLL CEFLE (longitudinal subcorpus) BELC SPLLOC CLEG13 Telekorp

ᅚ ᅚ ᅚ

ᅚ ᅚ ᅚ ᅚ

ᅚ ᅚ

English Spanish German German and English

Oral and written data Mainly oral data Written data Computer-mediated communication

that some pure cross-sectional learner corpora have also been used to track development. Despite the fact that pure cross-sectional research cannot tell us anything about intra-individual or inter-individual change processes, the cross-sectional design initially adopted in some learner corpora has been supplemented by post hoc proﬁciency level assessment of the subjects, thereby allowing some researchers to perform pseudo-longitudinal research. For instance, some of the essays collected in the International Corpus of Learner English (Granger et al. 2009) have been assessed for proﬁciency after collection. This post hoc assessment variable made the adoption of a pseudo-longitudinal design possible, as for instance in Thewissen’s (2013) study on accuracy developmental patterns. This approach was also adopted in one of the FLLOC subcopora, viz. the Reading Corpus, where students from an initially cross-sectional corpus (34 secondary school students, aged 16, who had all been learning French for ﬁve years, receiving four 35-minute lessons a week) were shown to display huge variation in terms of oral proﬁciency (from very little contribution to conversation to a level comparable with native speakers of French).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

2.2

385

Analysing group and/or individual trajectories in LCR: some key considerations

Learners typically progress through various stages of proﬁciency, and in a vast majority of cases there is an overall positive correlation between the time and effort spent learning an additional language and the evolution of proﬁciency in the target language. This is clearly apparent in – and also intrinsically characteristic of – most educational settings, where a majority of learners gradually step from one level to a slightly higher one and tend to become increasingly proﬁcient throughout their instructional path. The descriptors for linguistic competence included in the Common European Framework of Reference for Languages (CEFR; Council of Europe 2001) or in the ACTFL15 Assessment of Performance toward Proﬁciency in Languages (AAPPL) clearly illustrate this approach to proﬁciency development. The CEFR and AAPPL descriptors are regularly used to inform curriculum design, language syllabuses, learning materials, tests (be they automated or not), language policies and teacher training programmes. Working with learner corpora can help inform pedagogical decisions, materials and practices at larger group- and proﬁciency levels (see Section 3 in this chapter for some illustrations, and also Chapters 20 to 23, this volume, for discussions of the various types of pedagogical applications of LCR). An exclusive focus on larger groups would, however, be too restrictive as numerous types of individual differences are at play in SLA (see, for instance, Bigelow and Watson 2012; Duff 2012; Skehan 2012; Ushioda and Dörnyei 2012; Williams 2012 for recent research on individual differences). Individual differences typically include aptitude, motivation, identity issues, personality traits, type of working memory, socio-educational background, language proﬁciency in the mother tongue (L1) and other languages learnt, but also numerous aspects related to cognitive restructuring. As Bylund and Athanasopoulos (2014) recently put it: ‘the extent and nature of cognitive restructuring in L2 [foreign/second language] speakers is essentially a function of variation in individual learners’ trajectories’. Describing and understanding individual learning trajectories is thus – besides being valuable in itself – also essential to gaining a better understanding of group trajectories. The improved design, updated storage facilities and increasingly powerful descriptive and inferential methods of analysis used in LCR now make it possible to focus on learners as a group and learners within a group but also on individual trajectories as such, a too often neglected aspect in the early days of LCR. Not all the variables potentially related to individual differences in SLA can be recorded in the metadata of learner corpora. Metadata related to personality traits, identity issues and L1 aptitude are, to my knowledge, not included in current learner corpora. In contrast, variables such as age 15

American Council on the Teaching of Foreign Languages, see www.actfl.org/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

386

MEUNIER

or information on the socio-educational background usually are. More recent longitudinal learner corpora also include complementary data types (other than authentic production data) to facilitate access to cognitive features of SLA. In the LONGDALE project, for instance, the same students are followed over a period of at least three years and data collections are organised at least once per year. The database contains argumentative essays, narratives and informal interviews, but also more guided types of productions (such as picture descriptions). Experimental data is also included for some of the subcorpora. The metadata are stored in a comprehensive learner proﬁle which is gathered during each data-collection session. The variables include, among others: age, gender, educational background, country, language background, variables pertaining to the task and, when available, information on the proﬁciency levels of the students as measured by internationally recognised tests. The subcorpora in the FLLOC project also contain rich data types. As explained in the previous section, some of these corpora are longitudinal and others are cross-sectional. The LANGSNAP (Languages and Social Networks Abroad Project) Corpus is longitudinal and documents the development of modern language students’ knowledge and use of the target language over a 23-month period including a 9-month stay abroad. LANGSNAP was collected to investigate learners’ evolving social networks while abroad, the factors inﬂuencing type and amount of language engagement abroad, the kinds of learning opportunities afforded by target language interaction in a year-abroad context and the relationship between social networking, affect, social interaction and language learning. The data collected include authentic oral interactions but also day-long participant observation (‘shadowing’). The Young Learners Corpus, also part of the FLLOC project, is cross-sectional but its design makes a pseudo-longitudinal approach possible. The corpus aims to document the development of linguistic competence among young classroom learners of French at three different starting ages, in primary and early secondary school classrooms, and identify similarities and differences (comparison of the rates of development at different ages after the same amount of classroom exposure; comparison of the classroom-learning strategies used by children at different ages and their attitudes to language learning). The corpus contains about forty hours of French language teaching for each of those three groups of learners. All language classes were recorded and, in addition, testing of the learners’ French language proﬁciency took place at four different stages.

2.3

The focus of developmental studies in LCR

Numerous longitudinal studies have been carried out in SLA. Ortega and Iberri-Shea (2005) explain that the focus of such studies is often strictly linguistic (concentrating mainly on L2 morphology), and that the studies

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

387

typically involve few participants (although more recent studies tend to include more participants). The authors note, however, that there has been a recent broadening of the linguistic focus and also of the epistemological approach to language development, as illustrated, for instance, by longitudinal studies situated within a Vygotskian sociocultural framework. Ortega and Iberri-Shea (2005) refer to Belz and Kinginger’s (2002) study of address form use as an example of such a new epistemological stance. The study documents critical incidents that contributed to the learning of indexical politeness (the use of French tu/vous and German Du/Sie) by two fourth-semester foreign language students at an American university taking part in a telecollaboration project. All the learners’ interactions were collected in a learner corpus. Thanks to its powerful lexical analysis techniques (in particular the extraction of collocations and recurrent word sequences), corpus linguistics has shifted the focus of linguistic analysis from grammar to lexis, and more especially phraseology, and LCR developmental studies reﬂect this trend. For example, Horst and Collins’s (2006) longitudinal study tracks vocabulary growth and draws on an 80,000-word longitudinal corpus consisting of narrative texts produced by 210 beginner-level francophone learners of English. The samples were collected at four 100-hour intervals of intensive language instruction and the authors used lexical frequency proﬁling techniques. The authors ﬁnd that although learners continue to use large proportions of frequent words over time, their productive vocabulary features fewer French cognates, a greater variety of frequent words and more morphologically developed forms. Marsden and David (2008) describe vocabulary use during semi-spontaneous oral production amongst instructed learners of French and Spanish at two different stages in the British educational system: Year 9 (near beginners) and Year 13 (approximately low intermediates). This pseudo-longitudinal study compares lexical diversity (range or variety of vocabulary used) across languages and across years, including analyses of different word classes. A last illustration is Bestgen and Granger’s (2014) study, which focuses on phraseology and aims to assess the role played by phraseological competence in the development of L2 writing proﬁciency and text quality assessment. The authors use CollGram, a technique that assigns to each pair of contiguous words (bigrams) in a learner text two association scores, viz. mutual information and t-score, which are computed on the basis of a large reference corpus. The results show a longitudinal decrease in the use of collocations made up of high-frequency words that are less typical of native writers. As the study is conducted both longitudinally and pseudo-longitudinally, it also helps identify the respective contribution of each research design to the study of L2 writing development. Other lexically oriented studies tracking learners’ development include Chen (2013b) on phrasal verbs, Crossley and Salsbury (2011) on lexical bundles (for a detailed description, see Chapter 10, this volume),

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

388

MEUNIER

Kobayashi (2013) on the comparison between spoken and written productions, and Verspoor et al. (2012) on various types of lexical chunks (among other features examined in the study). Other studies focus on morphology, grammar and syntax, often taking advantage of linguistic annotation tools such as part-of-speech (POS) taggers or, more rarely, parsers (see Chapter 5, this volume). For example, Vyatkina’s (2013a) investigation of the development of grammatical complexity features relies on a POS-tagged corpus (see Section 3.3 for more details). Van Vuuren (2013), on the other hand, uses a syntactically annotated longitudinal corpus of student writing and compares it to a native reference corpus. She focuses on information structural transfer and analyses clause-initial adverbials in English as a Foreign Language writing produced by Dutch learners. Cross-linguistic differences in the information status of clause-initial position in a verb-second language like Dutch (compared to a Subject-Verb-Object order language like English) are hypothesised to result in an overuse of clause-initial adverbials in the writing of advanced Dutch learners of English. She observes that although there is a clear development in the direction of native writing, transfer of information structural features of Dutch can still be observed even after three years of extended academic exposure. Some other papers address the relationship between lexical development and morphosyntactic measures. One example is David et al.’s (2009) study on lexical development in instructed L2 learners of French. The authors of this cross-sectional study analyse the relationship between lexical development and morphosyntactic measures in sixty instructed learners of French in Years 8, 10 and 12. The better understanding of SLA among young learners is meant to inform current primary language initiatives and educational practices in the United Kingdom and internationally.

2.4

Learner corpora, production data: what’s in a name?

Different terminological options may sometimes lead to the conclusion that some data types are underrepresented. However, some mainstream, non-corpus-based SLA studies using data not referred to as learner corpora can be very similar to learner corpus studies and hence highly relevant to LCR. For example, Serrano et al. (2012) carry out a longitudinal analysis of the effects of one year abroad. They analyse the progress of fourteen Spanish-speaking learners of English during a one-year stay at a British university. Both oral and written data have been collected (three data-collection points) and the samples analysed in terms of ﬂuency, syntactic complexity, lexical richness and accuracy. Two main research questions are investigated: (i) does L2 proﬁciency in oral and written production develop at the same pace while abroad or is improvement in one modality faster than in the other? and (ii) can learners’ individual

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

389

variables, such as attitudes or chances to interact abroad, explain certain aspects of language development in oral and written production? Students’ background information (referring to language attitude and language use) and authentic production data were collected and transcribed. The Computerized Language Analysis program (MacWhinney 2000) and the Statistical Package for the Social Sciences (SPSS) were used for the coding and analyses of the writing samples. The descriptive statistics for oral and written productions included ﬂuency (syllables/minute), syntactic complexity (clauses/T-unit), lexical richness (Guiraud’s Index) and accuracy (errors/T-unit). The results of the statistical analyses indicate that, while a few months abroad might be sufﬁcient for some gains in oral performance, improvement in written production is slower. The type of interaction experienced and some attitudinal features have been shown to partly explain language development in some areas. Whilst Serrano et al.’s (2012) study would have escaped bibliographical searches relying on a keyword such as ‘learner corpus’ as the word is not used once in the article, it clearly has all the features of a longitudinal learner corpus study and is of high relevance for LCR. Another example is Ferris et al.’s (2013) longitudinal study on written corrective feedback in L2 writing. The authors adopt a longitudinal (15-week semester), multiple-case (ten university-level L2 writers) classroom research design to address the impact of written corrective feedback for individual L2 writers. Although the focus of the study is primarily on students’ descriptions of their own self-monitoring processes as they revise marked papers and write new texts, Ferris and colleagues set up data ﬁles for each of the ten students; those ﬁles include, among other things, the marked and revised texts (annotation for errors) and progress charts for each of the learners. After each of the ﬁrst three timed writings, the researchers marked the 3–4 most prominent error patterns in each text written by the ten participants. In the study, ‘prominent’ could mean either most frequent or most serious for overall text effectiveness or some combination of the two. The individual researchers marking the texts were asked to use their best judgement about which 3–4 error types to mark. The authors argue that this procedure is similar to what classroom teachers do if they choose to mark student errors selectively rather than comprehensively. The types of errors marked by the researchers include article usage, lexical choices, missing words, sentence structure, agreement and punctuation. The ten case-study participants were shown to make a wide variety of errors in their timed writing assignments and individual learners’ error patterns changed over the course of the semester, with, for instance, some learners making fewer word choice errors (vocabulary) but more sentence-level errors (syntax). Here too, whilst some differences appear between what could be called learner corpus research and Ferris et al.’s (2013) work (notably more reliance on the use of intuition in terms of the marking of the errors, no

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

390

MEUNIER

clear deﬁnition of what a prominent error may be and no error-tagging scheme provided in the article), the similarities with learner corpus data are numerous even if the term ‘corpus’ is not used in the article. Those similarities include the collection of metadata, of production data, the annotation of data and the establishment of progress charts. An additional strength of the article is the combination of various data types to best pinpoint the impact of instruction. As has been shown in this section, longitudinal and pseudo-longitudinal studies point to the need for richly documented (multi)data and can be carried out using corpus data only or corpus data complemented by other data types. Whether the term learner corpus is explicitly mentioned or not, the validity of analysing authentic production data as one of the key data types to access learners’ developmental patterns can no longer be questioned. To quote Larsen-Freeman and Cameron (2008b: 210), using learner corpora ‘give[s] us access to stabilized patterns and variability around them’.

3

Representative studies

This section presents four studies illustrating the various designs presented in Section 2, with the ﬁrst two being pseudo-longitudinal and the last two longitudinal. The publications also focus on various linguistic features and L2s: demonstrative reference for Austrian learners of English (Schiftner and Rankin 2012), prefabricated sequences for Swedish learners of French (Bartning and Forsberg 2006), syntactic complexity features for American learners of German (Vyatkina 2013a), and tense and aspect acquisition for French learners of English (Meunier and Littré 2013). The number of learners/subjects included in the representative studies also varies from two focal learners up to thirty-eight learners, and the data-collection points range from three up to fourteen. 3.1 Schiftner, B. and Rankin, T. 2012. ‘The use of demonstrative reference in English texts by Austrian school-age learners’, in Tono, Y., Kawaguchi, Y. and Minegishi, M. (eds.), Developmental and Crosslinguistic Perspectives in Learner Corpus Research. Amsterdam: Benjamins, pp. 63–82. Schiftner and Rankin’s (2012) study seeks to identify developmental patterns in the usage of demonstrative reference in the written production of beginner and intermediate Austrian learners of English. The data used come from the International Corpus of Crosslinguistic Interlanguage (ICCI).16 Seven data sets coming from the Austrian subcorpus of ICCI have been used, focusing on learners from Grade 5 to Grade 11 (i.e. with the age of

16

See http://cblle.tufs.ac.jp/llc/icci/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

391

learners ranging from approximately 10 to 17). Part of the Louvain Corpus of Native English Essays (LOCNESS)17 has been used as the native-speaker reference corpus. Concordances for the four forms of demonstratives (that, those, this and these) were extracted using WordSmith Tools (Scott 2012). The demonstrative forms were coded manually for a number of grammatical and referential properties: grammatical function, proximity (proximal – P – or distal – D), number, type of reference (exophoric, anaphoric, cataphoric) and referent (noun phrase of proposition). It should be noted that learners sometimes use distal pronouns (e.g. that) to express proximal reference, hence the existence of, for instance, that P annotations. The results show that the overall frequency of demonstratives in all the learner subcorpora is lower than the overall frequency in the native corpus. However, there are differences between the individual demonstratives. The pronoun that is consistently overused (also when used as a proximal demonstrative) across all levels, while the pronoun this is consistently underused. Despite many differences between the native and non-native corpora, even some low-proﬁciency learners show similarities with native speakers in their use of demonstratives. For example, demonstratives are used most frequently as short-range anaphors (i.e. referring to rather close antecedents in the text). The authors also brieﬂy comment on the pedagogical implications of their work. They suggest expanding the scope of teaching demonstratives beyond the properties of reference and proximity to include explicit comments on the larger syntactic patterns in which demonstratives occur. 3.2 Bartning, I. and Forsberg, F. 2006. ‘Les séquences préfabriquées à travers les stades de développement en français L2’, in Actes du 16e congrès des romanistes scandinaves. Department of Language and Culture, Roskilde University. Bartning and Forsberg (2006) analyse the stages of development in prefabricated sequences (PSs) produced by Swedish learners of French. They map the acquisition of PSs with stages of morphosyntactic acquisition for the same learners. The six stages, described in detail in Bartning and Schlyter (2004a, 2004b), are labelled as initial, post-initial, intermediate, lower advanced, mid advanced and upper advanced. Only the ﬁrst ﬁve stages have been examined in the present study. Bartning and Forsberg (2006) divide the PSs into ﬁve main categories: lexical PSs (e.g. coup de foudre, faire la fête); grammatical PSs (e.g. pas du tout, être en train de); discursive PSs (parce que, je veux dire que, tout à fait); interlanguage PSs, which are syntactically or semantically deviant from native-like PSs but nonetheless used by learners as holistic and repeated PSs (such as the repeated use of c’est tout passé bien instead of the target 17

www.uclouvain.be/en-cecl-locness.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

392

MEUNIER

native PS which is tout s’est bien passé); and autobiographical PSs (e.g. je m’appelle, j’ai x ans). A total of thirty semi-guided interviews have been used: twenty-ﬁve interviews with Swedish learners of French – coming from the InterFra (Bartning 2002) and Lund corpora (Granfeldt 2005) – and ﬁve interviews with native speakers of French. The beginner learners in the corpus had none or almost no French prior to data collection (and were aged 19 to 30); the intermediate ones had an average of 3.5 years of French (and were aged 16 to 18) and the advanced learners had from 4.5 up to 6 years of instruction in French and were university students aged 19 to 26. As can be seen, the design of the study is pseudo-longitudinal and the proxy used is proﬁciency level correlated with numbers of years of instruction in the target language (but not correlated with age). In this study, the authors provide the numbers of PSs per type – and per learner too – and also individual and average percentages of words included in PSs out of the total number of words produced. The results show that lexical PSs are those that most distinguish learners from native speakers. Learners display a lack of progress from the initial to intermediate stages. Those ﬁrst three stages are followed by substantial progress in stages four and ﬁve (i.e. lower and mid advanced). However, at these more advanced stages, the types and frequencies of PSs displayed by learners are still quite different from those produced by native speakers of French. Grammatical PSs are the only ones that do not display clear progress, with similar percentages found throughout the various stages. The use of discursive PSs, in contrast, increases signiﬁcantly from the initial to post-initial stage. Autobiographical PSs are found mainly in initial stages and are almost non-existent at more advanced stages. This is due to the fact that autobiographical PSs are typical of discussion topics at initial stages of proﬁciency. The interlanguage PSs decrease signiﬁcantly with proﬁciency: from 21% of interlanguage PSs at the initial stage to only 1% at the mid advanced stage. In terms of overall frequencies, Bartning and Forsberg (2006) show that the proportion of PSs in learners’ speech increases with proﬁciency. The authors nonetheless conclude that whilst verbal morphology displays what they call a strict development (p. 19), prefabricated language does not seem to follow such strict development and is more sensitive to input and to the communicative style of individual learners. 3.3 Vyatkina, N. 2013a. ‘Speciﬁc syntactic complexity: Developmental proﬁling of individuals based on an annotated learner corpus’, The Modern Language Journal 97(S1): 11–30. Vyatkina (2013a) analyses speciﬁc syntactic complexity by studying the developmental proﬁling of individuals on the basis of an annotated learner corpus. The study aims to track the development of syntactic complexity – with a focus on individual developmental pathways – and aims

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

393

at pedagogical improvements in the teaching of writing to beginners. The author provides an in-depth analysis of the writing of two beginner L2 German learners (with L1 English) over four semesters of collegiate language study by using developmental proﬁling techniques. The study design is longitudinal, with multiple and relatively dense data-collection waves (fourteen measurement occasions corresponding to the fourteen units seen in the textbook used in class). Vyatkina explores variation in terms of the frequency of some complexity features (such as coordinate, nominal and non-ﬁnite verb structures) and, to do so, she uses corpus analysis techniques with semi-automatic corpus annotation: initial automatic POS tagging, manual checking of the output and manual selection and count of more complex structures. The developmental dynamic is explored as sets of complexiﬁcation strategies (see Ortega 2012) or repertoires of choices (as described in Ortega and Byrnes 2008a) of speciﬁc syntactic structures, used by learners at each of the fourteen measurement occasions. In so doing, Vyatkina analyses learner development in terms of multidimensional variability and non-linear relationships between the instructional progression and individual developmental paths. The study follows the ‘instruction-embedded total-sampling approach’ (Byrnes et al. 2010: 165), in which writing samples were rough drafts of essays written by the students in response to curricular tasks rather than to external experimental tasks. The participants were students enrolled in a beginning German language programme at university over four sequential 16-week-long semesters. All classes were taught by graduate teaching assistants who followed a uniform syllabus and used the same textbooks. Each writing task concluded a corresponding textbook chapter and reﬂected the book’s instructional content, including the focus on selected grammar structures. During the ﬁrst three semesters, students typed each essay in class under timed conditions. They were required to write during the whole 50-minute-long class period and were allowed to use online dictionaries but not online translators, textbooks or notes. During the fourth semester, they wrote essays at home under untimed conditions and were allowed to use reference materials. The very last essay was again timed and written under controlled conditions. As argued by Vyatkina, variation in tasks and topics may affect linguistic complexity. However, the present study does not focus on these speciﬁc effects but rather on how two different learners respond to one and the same task at each time point. The data collected was then annotated for syntactic complexity. The learner corpus was ﬁrst tagged automatically for ﬁfty distinct word classes using the TreeTagger for German (Schmid 1994) and the output was manually checked. The annotation of speciﬁc complex structures was then performed using a mixture of automatic and manual searches (e.g. searching coordinating conjunctions using the Concord function of

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

394

MEUNIER

WordSmith Tools (Scott 2012) and then counting different coordinate structures manually based on the context of each retrieved example). More complex structures (e.g. clause types and inﬁnitive constructions) were annotated manually by two independent annotators. Without going into the details of each speciﬁc syntactic measure used in the study, it can be said that whilst the results show a general developmental trend towards increased frequency and range of syntactic complexity features, the trajectories of the two focal learners reveal divergence between the learners in the second half of the observation period. One male participant readily responds to instruction but abandons some syntactic features when progressing to the next task, whilst the other focal learner balances both previously learned and new features in her writing. The pedagogical implications derived from the study have direct classroom relevance. Vyatkina (2013a) proposes the design of rubrics listing speciﬁc lexical and syntactic features (with examples) associated with each level-appropriate writing task, i.e. some sort of idealised writing proﬁles, with model texts and the inclusion of contextual use and functions in order to raise learners’ awareness of what their expected developmental targets are. 3.4 Meunier, F. and Littré, D. 2013. ‘Tracking learners’ progress: Adopting a dual “corpus cum experimental data” approach’, The Modern Language Journal 97(S1): 61–76. In this article, Meunier and Littré (2013) track French learners’ progress in the acquisition of the English tense and aspect system. To do so, they adopt a dual approach and use both corpus and experimental data. The ﬁrst part of the article reviews the status of longitudinal research on the acquisition of tense and aspect in SLA and explains that L2 longitudinal studies have often mirrored typical L1 longitudinal studies by tracing the ﬁrst steps of language acquisition (either by children or adults learning the L2). Other common features of tense and aspect studies in SLA are that a majority of the studies focus on very few learners or have selected learners from a variety of mother-tongue backgrounds (which helps uncover universal paths of acquisition but limits further investigations into transfer effects). In their longitudinal study, Meunier and Littré want to highlight the value of a learner corpus approach to the study of tense and aspect development (more speciﬁcally in that study, simple present vs present continuous) but also insist on the importance of combining learner corpus data with other data types (grammaticality judgement tests and grammaticality cum interpretation tasks). This combination makes it possible to go beyond the identiﬁcation of difﬁculties (or lack thereof) in the learners’ acquisition over time and to try and uncover the cause(s) of some remaining language difﬁculties even after many years of exposure to, instruction in and use of English.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

395

The authors have decided to focus on more advanced levels of acquisition by analysing the written productions of a cohort of thirty-eight French-speaking English language and literature students at the University of Louvain, with each participant having contributed three argumentative essays (one per year over a three-year span). These essays are part of the LONGDALE project (presented in Section 2.2). Whilst a total of thirty-eight learners might not appear very impressive in comparison to corpus standards, the study nonetheless has a much larger sample size than most of the existing longitudinal studies on tense and aspect, which generally have an average number of informants lower than ten (and often lower than ﬁve). For the very few studies on tense and aspect carried out on more than twenty subjects (see, for instance, Klein and Perdue 1992; Bardovi-Harlig 1992), the informants came from a mixture of native-language backgrounds. The results of the multi-level regression analysis show that the time predictor has a positive effect on the decrease in tense and aspect errors produced by learners over a period of three years, both at group and individual levels. It is interesting to note that various statistical models were also ﬁtted on more data (to test the impact of attrition, taking into account essays produced by learners at year 1 and year 3 but not at year 2, for instance). These models included list-wise deletion (i.e. selecting only the thirty-eight participants who wrote three essays), taking only a subset of participants (i.e. only the groups that did not signiﬁcantly differ from other groups were included, meaning that participants who dropped out between year 1 and year 2 and those who joined the study in year 3 were rejected) or taking all participants (i.e. all participants including those with attrited data). Irrespective of the models tested, time has a signiﬁcant effect on the reduction of tense and aspect errors (ranging from about 16–27% depending on the model ﬁtted). The top ten error pairs were reorganised into three categories, namely aspect-only errors (i.e. when the tense is correct and the aspect is not), tense-only errors (i.e. when the tense is incorrect and the aspect correct) and mixed tense and aspect errors (i.e. when both tense and aspect are incorrect). The results show that more than 50% were aspect-only errors, with progressive present/simple present ranking at the top and accounting for 25% of all the tense–aspect errors. The experimental data analysis, carried out to help trace the reason for the persistence of errors related to the progressive aspect, shows that whilst learners master the most salient elaboration of the progressive (viz. ongoingness), their understanding of less core uses (e.g. planned events) is much less precise. Subsequent guidelines for classroom teaching are proposed in the study, among which the idea that teachers of advanced French learners of English should not necessarily review the whole range of uses of tenses and aspect from year to year (as is commonly done). They should

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

396

MEUNIER

no longer teach the most prototypical uses to more advanced learners and should spend more time on less frequent or less core uses (e.g. the modal meanings of the present progressive).

4

Critical assessment and future directions

The discussion of the core issues and the presentation of representative studies have pointed to the fact that learner corpora are solid and reliable data sources to trace learners’ proﬁciency development in an L2. As pointed out in Section 1, many variables recorded as metadata in LCR can be used as dependent variables, potential predictors or dynamic factors impacting SLA. As for the learners’ productions, they can be analysed as being representative of larger groups or populations (on the basis of the variables encoded in the corpus) but within-group variability and individual trajectories can also be accessed. The linguistic focus of the studies presented is also extremely varied and encompasses all dimensions of the complexity, accuracy and ﬂuency paradigm (Housen et al. 2012). In the last paragraphs I would like to stress what I consider as essential steps to be taken for a sound and healthy development of longitudinal learner corpus studies. First, whilst proxies (such as proﬁciency level) can undoubtedly be used to circumvent the difﬁculties inherent to the collection of longitudinal data (see Section 2.2), it is essential that sustained efforts be devoted to the collection of longitudinal data as ‘longitudinal designs can uniquely help researchers document the lengthy trajectories of adults who strive to become multicompetent and multicultural language users’ (Ortega and Byrnes 2008a: 18). The collection of longitudinal data goes hand in hand with the need for new practices/requirements in learner corpus data collection. I would, for instance, plead for the collection of information related to proﬁciency in the learners’ L1. This would enable researchers to be much more speciﬁc in their future analyses and interpretations of bi- and multi-literacy practices. L1 production data should also ideally be collected as it would enable an integrated comparison of the learners’ proﬁciency levels in their L1 and L2 and would greatly enhance the interpretation of the results for individual trajectories (including access to features of learning disabilities such as dysorthographia or dysgraphia for writing and dyslexia for speaking). Another requirement is the ongoing/dynamic collection of metadata. The importance of metadata in learner corpus studies is paramount and perhaps even more so in longitudinal designs as researchers follow learners over a much longer period of time. This naturally implies that the initial metadata collected at, say, Time 1 will not be self-sufﬁcient. Researchers will want to document the learning paths (courses taken, stays abroad, amount and type of language practice, etc.). Such rich and dynamic

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

397

metadata will be essential for a reﬁned understanding and interpretation of future research results. Second, in terms of linguistic features analysed, lexis (single or multi-word units) and grammar have occupied pride of place in LCR. The lexis–grammar interface and the patterned nature of language have also been central. Communication strategies, in contrast, have been the poor relation. Future research should also focus on these strategies, viz. how language learners maintain communication, make meaning and negotiate meaning. If such issues have been partially addressed, mainly in LCR studies using computer-mediated communication, they nonetheless deserve more attention in the future. A third issue that I ﬁnd essential to consider is that of corpus size and representativeness as measured, among other things, by the number of subjects included in a developmental study. Case-study approaches are not always particularly valued in corpus-linguistic circles; Ortega and Byrnes (2008a: 9) even speak of ‘the contested legitimacy of the approach in certain social science circles, including sectors of applied linguistics’. That said, the highly valued exponential growth of corpus sizes as exempliﬁed by the collection of what Davies calls ‘second generation mega corpora’18 and the numerous beneﬁts that can be gained from the analysis of such corpora may not necessarily be the desired path to follow when it comes to longitudinal learner corpora. Whilst ‘big is beautiful’ is still a valid motto in corpus circles, smaller but much ‘denser’ longitudinal learner corpora should also be valued, and collected. To quote Polat (2011: 3754), ‘despite its time-consuming and labor-intensive collection process, the use of a dense developmental corpus seems to be a very promising research approach, especially if paired with more qualitative analyses’. As mentioned in Section 2.1, learner corpus collection implies a constant trade-off between the density of the data and the number and/or representativeness of the subjects whose language data is collected. More subjects are typically involved in cross-sectional data-collection designs, which lend themselves more naturally to quantitative analyses. In longitudinal designs, in contrast, fewer subjects are often involved but more qualitative studies can be performed as individual trajectories can be analysed in detail. In order to combine the strengths of various approaches, a mixed approach can be used. Johnson et al. (2007: 112) state that mixed-methods research (MMR) is ‘becoming increasingly articulated, attached to research practice, and recognized as the third major research approach or research paradigm, along with qualitative research and quantitative research’. The promotion of MMR in LCR is the fourth issue that I would like to identify. Bergman (2008: 1) deﬁnes MMR as ‘the combination of at least one qualitative and at least one quantitative component in a single research project or program’. As for Ibbotson (2013: 2), he explains 18

See http://davies-linguistics.byu.edu/ling485/for_class/corpora_notes.htm (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

398

MEUNIER

that no matter whether ‘the focus is on language processing, acquisition, or change’, in usage-based linguistics knowledge of a language ‘is based in knowledge of actual usage and generalizations made over usage events’ and the complexity of language ‘emerges not as a result of a language-speciﬁc instinct but through the interaction of cognition and use’. LCR, with its exclusive focus on (semi-)authentic language use and analysis, constitutes one of the usage-based paradigms and, as such, lends itself well to MMR. It must be kept in mind that a quality standard for MMR is, however, as Hashemi and Babaii (2013: 828) state, to achieve high degrees of integration at various stages of the study. This integration has, for instance, been achieved in Rosi’s (2009) study on the acquisition of aspect in Italian L2. Her research involves work on a native Italian corpus and on longitudinal learner corpus data, which is supplemented by the analysis of experimental data; there is constant to-ing and fro-ing between quantitative and qualitative analyses which are carried out and interpreted within a cognitive and connectionist framework. Finally, the last issue concerns availability: more data should be made available to a larger research community. This wish, expressed repeatedly by Myles (2008 and other publications), goes beyond making the data available and includes making statistical codes, task prompts or coding systems available in order to favour replication studies and enhance the expertise in analysing longitudinal data. As argued by Littré (2014), it may help other researchers better understand the choices made in a speciﬁc study but it also represents a commitment to the openness and transparency that is central to the scientiﬁc endeavour. The author adds that whilst a necessary balance must be struck between openness and other practical considerations (time needed to collect the data and priority for analysing it, for instance), it would also allow other research teams to identify potential errors and improve upon previous analyses.

Key readings Ortega, L. and Byrnes, H. (eds.) 2008b. The Longitudinal Study of Advanced L2 Capacities. New York: Routledge. This edited volume, whilst not focusing exclusively on learner corpora, is a must-read for any researcher interested in longitudinal research. It provides key theoretical and methodological reﬂections on the longitudinal study of advanced capacities and includes chapters that report on empirical longitudinal investigations of various types (descriptive, quasi-experimental, qualitative and quantitative). Chapter 4 in the volume, written by Florence Myles, is more speciﬁcally dedicated to the investigation of learner language development with electronic longitudinal corpora.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

Developmental patterns in learner corpora

399

Tono, Y., Kawaguchi, Y. and Minegishi, M. (eds.) 2012. Developmental and Crosslinguistic Perspectives in Learner Corpus Research. Amsterdam: Benjamins. This volume provides an overview of current research on the use of learner corpora perceived from developmental and cross-linguistic perspectives. Eleven chapters of the book focus on the proﬁciency development of young learners of English as an L2 on the basis of the International Corpus of Crosslinguistic Interlanguage ((ICCI ICCII). The other articles present studies carried out on spoken learner corpora and on learner corpora of languages other than English (French and Japanese). Hasko, V. and Meunier, F. (eds.) 2013. Capturing L2 Development through Learner Corpus Analysis. Special issue of The Modern Language Journal 97 (S1). This special issue of The Modern Language Journal is entirely devoted to the role that learner corpora can play in uncovering the developmental processes in L2 learning. The introductory chapter offers a critical discussion of the aspects in which the disciplines of LCR and SLA would beneﬁt from closer interdisciplinary engagement. The six articles included in the volume address syntactic complexity, contiguous and discontiguous multi-word unit use, the numeral classiﬁer system, tense and aspect acquisition and the development of L2 accuracy learner proﬁles, and illustrate various learner corpus research designs. Belz, J. A. and Vyatkina, N. 2008. ‘The pedagogical mediation of a developmental learner corpus for classroom-based language instruction’, Language Learning and Technology 12(3): 33–52. This article explores the pedagogically mediated use of a learner corpus in language teaching and in the developmental analysis of second language acquisition. It also addresses the issue of authentication in corpus-driven language pedagogy. The authors illustrate how an ethnographically supplemented developmental learner corpus may contribute to second language acquisition research via dense documentation of micro-changes in learners’ language use over time. Castello, E., Ackerley, K. and Coccetta, F. (eds.) In press. Studies in Learner Corpus Linguistics: Research and Applications for Foreign Language Assessmentt. Bern: Peter Lang. Teaching and Assessment. This edited volume contains ﬁve articles devoted to longitudinal corpora. The authors of these articles have all used LONGDALE data and address the following topics: tense and aspect issues in an

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

400

MEUNIER

error-annotated corpus of, respectively, French and German learner writing; the evolution of word classes in a variety of written assignments produced by Dutch learners and individual learners’ differences that can affect vocabulary and syntactic control; French learners’ ways of expressing attitudinal stance in oral communication; metadiscursive features, more speciﬁcally the evolution of the use of itt-extraposition in the reading reports and argumentative essays produced by Italian learners; and ﬁnally the use and short-term impact of corpus literacy practices and data-driven learning activities in ﬁrst-year Italian language students.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:52, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.017

18 Variability in learner corpora Annelie Ädel

1 Introduction Corpora and corpus-based methods can make a contribution to the study of variability in learner language for two main reasons. One reason is that the study of linguistic variation itself is particularly amenable to quantitative and corpus-based analysis. The corpus, especially when used in combination with metadata about the learners represented and about the situation in which the language was produced, enables the researcher to quantify and compare data in systematic ways. The quantitative corpus results can then be used to verify or falsify claims made in the second language acquisition (SLA) literature or to generate new hypotheses about learner language. Another reason is that the focus on naturally occurring language in corpus work means that the types of learner data studied represent authentic language use. There is much experimental work in SLA, which means that the language analysed is produced in an experimental setting (such as a laboratory), typically solely for the express purpose of linguistic analysis. While there are many good reasons for the experimental elicitation of linguistic data – the complexity of language use is reduced; the language production and variables potentially affecting it can be controlled; the likelihood of capturing relevant types of linguistic output can be maximised – it is also the case that such data simply do not represent the full gamut of authentic language use. Almost inevitably, researchers who study learner corpus data will encounter linguistic variability and will need to account for it. Learner corpus research has paid a great deal of attention to the inﬂuence of the mother-tongue background on learner language (see Chapter 15, this volume), but it has tended to neglect other factors that may exert an inﬂuence and that may serve to account for some of the variability attested in learner corpora. This chapter will discuss some of these alternative

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

402

ÄDEL

factors and demonstrate how important they can be in language production in general and in foreign/second language production in particular.

2

Core issues

Language is not a static phenomenon, but rather varies – sometimes considerably – depending on why it is used, where it is used, by whom it is used, and so on. Variability occurs at all levels of language: speakers make choices in pronunciation, morphology, vocabulary, grammar, information structure and politeness level, among others. Linguistic variability depends on a number of non-linguistic factors, including ‘the speaker’s purpose in communication, the relationship between [participants], the production circumstances, and various demographic afﬁliations that a speaker can have’ (Reppen et al. 2002: vii). That language use is characterised by variability is true whether the user speaks the language natively or is a learner. It is probably intuitively clear also to the non-linguist that the language of the learner is likely to change in conjunction with increased, or even decreased, proﬁciency. The trajectory from beginner to intermediate to advanced levels of proﬁciency constitutes a very noticeable type of language variation – as with child language acquisition – even if it takes place along a continuum rather than between discrete categories. ‘Variability’ and the related term ‘variation’ are polysemous. Both terms are used interchangeably (and will be used interchangeably also in this chapter) to refer to differences among learners as well as differences within an individual learner. The former involves different speakers expressing the same meaning by means of different forms (‘inter-learner variation’), while the latter involves a single speaker using different linguistic forms to express the same meaning on different occasions (‘intra-learner variation’).1 Both inter-learner and intra-learner variation can be investigated using learner corpora, but the focus of learner corpus research has very much been on inter-learner variation. Corpora have been used cross-sectionally to study patterns of variation which distinguish different groups of speakers. In traditional SLA, this type of variation has often been approached from the perspective of differences in performance, for example based on proﬁciency in target language use, assessment scores. It is a fundamental goal of SLA to describe and explain the factors that condition variation in the level of proﬁciency attained by learners. This relates to the question that is the Holy Grail of the language teacher: how best to help facilitate learning and enhance student performance, thus increasing proﬁciency. 1

Attempts to create terminological order have been made, such as a recent suggestion to use ‘variation’ for the former and ‘variability’ for the latter (Verspoor and van Dijk 2013).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

Variability in learner corpora

403

Intra-learner variation is deﬁned as ‘instability in learners’ linguistic systems’ (H. D. Brown 2007: 392). Researchers often emphasise the fact that language acquisition typically does not entail a steady acquisition of one language rule after another. For example, in one and the same essay, even a relatively advanced learner (of English) will produce linguistic output (here, examples of subject–verb agreement) which both follows the native-speaker rule – The student does not…; Instead the student presents… – and violates it – The student also have a tendency to…; …which this student have succeeded in… The term ‘learner variety’ has emerged in the SLA literature, referring to ‘a coherent linguistic system produced by a language learner’ (Dimroth 2013: 3256). It is used with the express purpose of avoiding treating such developing systems as deﬁcient, seeing them instead as languages in their own right.2 How is variability conceptualised and measured by linguists? Essentially, linguists who are interested in language in use have accumulated a great deal of evidence to show that linguistic variation is not random, but rather highly systematic. One of the major goals of many branches of linguistics is to account for the variability found in language and to establish the causes of such variability. One way of investigating what conditions linguistic variation is through the notion of correlation. The types of correlations that linguists are concerned with are those that can be established, whether positively or negatively, between a linguistic phenomenon, on the one hand, and some variable X, on the other. A ‘linguistic phenomenon’ is represented by some kind of linguistic data at some level of language: phonology, morphology, vocabulary, syntax, discourse. ‘Variable X’ stands for any kind of factor that may affect the way language is used and may therefore explain variability; the most commonly described categories here are learner-speciﬁc (e.g. having to do with motivation), social (e.g. having to do with the relationship between the people interacting), situational (e.g. having to do with the purpose of communication), cognitive (e.g. having to do with processing effort) and linguistic (having to do with the linguistic system itself). It is often the case that more than one variable has a role to play in linguistic variation; the relative inﬂuence of several variables can be estimated through statistical methods (see Chapter 8, this volume). It is beyond the scope of this chapter to consider all of these variables – especially keeping in mind that ‘the factors that can bring about variation in learner output are numerous, perhaps inﬁnite’ (Ellis 1994: 49), but a number of them will be illustrated here, with a focus on learner-speciﬁc variables. The variability examined is typically between different groups of learners, so that the researcher can take advantage of the generalising power of corpus-based methodologies and establish systematic differences and 2

The term ‘interlanguage’ is seen as suggesting that ‘learner languages are hybrid systems situated in-between two “real” languages’ (Dimroth 2013: 3256), the source language and the target language.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

404

ÄDEL

similarities in language use between different populations of speakers. That is not to say that variability within individual learners is not of interest, as we saw above; however, this topic is underexplored in mainstream corpus linguistics.

2.1

Types of variability

A learner who aims for full proﬁciency in a second/foreign language (L2) needs to acquire not only native-like grammar, but also native-speaker patterns of variation, including the ability to style shift in moving from formal to informal situations (Bayley and Tarone 2011: 50). This is sometimes illustrated through the well-known distinction between ‘linguistic competence’ (knowledge of the linguistic system) and ‘communicative competence’ (knowledge of how to use language in speciﬁc social situations). In looking more closely at causes of linguistic variation, we will be drawing a distinction between ‘learner-speciﬁc variables’, on the one hand, and ‘general variables’, on the other.3 The former variables are speciﬁc to language learners and are based on a model presented in Yorio (1976), while the latter variables are social and situational ones which apply generally to all types of speakers and language production. Research in SLA has paid speciﬁc attention to the former but has tended to neglect the latter.

2.1.1

Learner-speciﬁc variables

The six main categories of Yorio’s (1976: 61) model of learner-speciﬁc variables are as follows: • • • • • •

native language age cognition input affective domain educational background.

The role of the native language in learning a foreign/second language is one of the most widely studied variables in SLA, as pointed out above. It is dealt with in Chapter 15 (this volume), while this chapter focuses on other variables. The next category in Yorio’s list – age – has also been the subject of considerable research in SLA, especially targeting the similarities and differences in language development between child, adolescent and adult learners. Perhaps the most widely known outcome of this research is the 3

This distinction largely overlaps that made between ‘learner variables which characterize the learner and task variables which pertain to the language situation’ (Granger 2008b: 264; emphasis added).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

Variability in learner corpora

405

so-called ‘critical period hypothesis’, which holds that language, at least at the phonological level, is acquired less efﬁciently after puberty. Age is often linked to the notion of ‘developmental stages’ (see Chapter 17, this volume). One of the ways in which such stages have been operationalised in learner corpus research is in terms of school years. For example, in a study of Japanese learners’ acquisition of verb structure in English (Tono 2004), the overuse/underuse/misuse of a selection of verb patterns was investigated by dividing the learners into groups based on school grade: 7–8, 9–10 and 11–12. ‘Years of schooling’ was found to have a considerable impact on verb use. However, age is interrelated with biological factors (cf. the critical period) as well as social factors (such as parental and peer group inﬂuence), so it may not be quite as straightforward as one might think. The next category is cognition, which has long been a major topic in SLA. Cognition subsumes general intelligence and language aptitude as well as proﬁciency. Aptitude cannot be directly observed, but is operationalised in the form of a test – such as the Modern Language Aptitude Test (MLAT) – ‘which aims to predict phenomena that characterise second language acquisition … and the extent to which successful SLA occurs as a result’ (Robinson 2013: 129). Although it is referred to frequently, proﬁciency is a variable that tends to be operationalised in fuzzy ways in learner corpus research (see, e.g., Granger et al. 2009; Carlsen 2012; Chapter 2, this volume). In fact, the categories of proﬁciency, developmental stages and age tend to be conﬂated. Typically, the more easily recorded criterion of ‘number of years of L2 schooling’ is preferred compared to subjecting learners to a standardised proﬁciency test, although the latter would give a more detailed and accurate picture of student levels. In Section 3.1 below, we will see an example of issues involved in studying proﬁciency as a variable in learner corpus research. The category of input is the one in which the SLA pendulum has tended to swing most radically. In the behaviourist approach, input was everything, while in the generative approach, input was largely irrelevant. Extreme behaviourism can be said to focus on observable behaviour at the expense of cognitive phenomena, while extreme generativism only pays attention to cognition. In recent usage-based and psycholinguistic approaches, input – and especially frequency effects in second language acquisition (see, e.g., Ellis 2012b; Chapter 16, this volume) – is again seen as a crucial factor in SLA. In some learner corpus research, input (or ‘exposure’) has been operationalised as the L2 textbooks used with instructed learners. In a study of learners of English in a Japanese classroom setting (Tono 2004), a corpus of textbook material was created, which could then be compared to a complementary corpus of student production, for example to examine the degree of overlap between input and output. In this particular setting, the learners had very limited exposure to English outside the classroom, so this was deemed as a successful

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

406

ÄDEL

operationalisation. There are, of course, many different types of L2 exposure, which bring with them different processing conditions for the language learner. SLA researchers often make a distinction between ‘explicit’ learning, which is intentional and requires effort, and ‘implicit’ learning, which relies heavily on exposure. This has been much studied in SLA, above all through the traditional distinction between classroom learning and immersion. Section 3.2 gives an example of a corpus-based study on the effects of immersion, operationalised as number of months in a target-language country (see also Chapter 19, this volume). The category labelled ‘affective domain’ includes sociocultural factors, personality factors and motivational factors. The affective domain in general – and motivation in particular – has been re-evaluated in recent work in SLA. Motivation was originally examined predominantly in a Canadian context (such as the classic work by Gardner, e.g. 1996), but as more and more contexts have been studied, the description of its mechanisms has become more ﬁne-grained. Above all, due to the spread of English as a lingua franca and the increasing diversity of the communities of English language speakers, traditional ways of analysing motivation with reference to the degree of the learner’s identiﬁcation with the target language community have increasingly been abandoned in favour of the ‘internal domain of self’, for example through Dörnyei’s (2009) model based on ‘people’s visions of themselves in the future’ (cf. Ushioda 2013). Following this trend, motivation research in SLA has shifted focus from ‘what’ questions (what reasons motivate people to learn languages?) and ‘why’ questions (why are some people more motivated to learn languages than others?) to ‘how’ questions (how does motivation develop and emerge in interaction with the social learning environment?) (Ushioda 2013: 3763). It seems that, in this process, researchers investigating motivation are paying more attention to qualitative analysis of individual learners’ behaviour, perhaps at the expense of variable-based, quantitative analysis aiming to make generalisations about different learner populations. The traditional focus of motivation research on predicting and identifying ‘what types of external factors (e.g. teacher behaviors or strategies) may have a positive effect on learner motivation, and what forms of motivation may result in optimal learner behaviors and outcomes’ (Ushioda 2013: 3764) is giving way to investigations of motivation as a process, where the individual learner and the speciﬁc context of learning are brought to the fore. Such changes in theoretical concerns inevitably lead to changes in methodology: the heavy reliance on large-scale empirical self-report data is being relaxed as interview studies emerge; these involve a much smaller number of informants but more in-depth description of the experience of motivation. The category of educational background, ﬁnally, has to do with aspects such as whether the language learner is literate or illiterate. Furthermore, if the learner is literate, it makes a difference whether the learning has

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

Variability in learner corpora

407

taken place in a non-professional, general academic setting, or in a professional setting with a greater degree of specialisation in a given domain. There is very little corpus-based research on the effects of these different conditions.

2.1.2

General variables

An additional group of variables likely to have a bearing on learner corpus research is referred to here under the label of ‘general variables’. This refers to a type of variation that is not so much about the speciﬁcs of second language acquisition, but is rather about the ability, or need, to vary one’s language output depending on social and situational circumstances.4 One of the areas in which such variation is a core concern is sociolinguistics. Sociolinguists have demonstrated that social variables such as gender, socio-economic status, level of education, social roles, ethnicity or identity have an impact on language use. To take an example from a classic native-speaker study (from Norwich, England), Trudgill (1974) showed that the lower the socio-economic status of speakers, the more likely they were to pronounce as /n/ (rather than /ŋ/) the last sound of the ing form in words such as working. In fact, not only social variables were found to inﬂuence language (pronunciation in this case), but also situational variables, such as the level of formality. The more formal the situation, the more closely the speakers monitored their speech, and the less likely they were to select /n/ over /ŋ/, regardless of socio-economic class. It has taken a while for sociolinguistic approaches to gain ground in SLA research, as it has been dominated by a cognitive paradigm and a strong focus on the learner’s cognitive processes. As an illustration of how relatively recent this shift is, in a review article on sociolinguistic approaches to SLA from 2007, the author felt a need to state that ‘[t]he L2 learner’s mind, unlike my laptop computer, processes L2 data differently in response to different social variables’ (Tarone 2007: 839), responding to rather extreme generativist views in which learner language development is a purely cognitive phenomenon, independent of social context. Increasingly, however, in contemporary models and theories of SLA, the learner is viewed ‘as a social being whose cognitive processing of the L2 is affected by social interactions and social relationships with others, including those others who provide L2 input and corrective feedback’ (Tarone 2007: 840). Research in SLA shows that it is not only native speakers who use a range of different styles in different social situations, but learners too. It is predominantly through the so-called ‘variationist’ SLA approach that the impact of social factors on variation in interlanguage has been studied (see, e.g., Preston 1996; Regan 2013). 4

Note that there is a small number of variables in this general category that may be specific to learners. Learner corpus researchers typically use the term ‘task variables’ for circumstances that are relevant only/primarily to language learners; translation exercises and picture descriptions are a case in point (although a picture description task could also be used in a psycholinguistic study not involving learners).

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

408

ÄDEL

Not only sociolinguistic approaches are concerned with variation in language, but many other branches of linguistics also see it as their task to describe and explain linguistic variation of different kinds – for example, dialectology targets variation based on region; diachronic linguistics studies historical change; and genre analysis and register analysis deal with variation across types of texts. A register can be deﬁned by a particular conﬁguration of situational variables. The following list (from Biber and Conrad 2003: 175) presents a useful summary of situational variables, based on classic work in sociology and anthropology: •

the participants, their relationships and their attitudes toward the communication • the setting, including factors such as the extent to which time and place are shared by the participants, and the level of formality • the channel of communication • the production and processing circumstances • the purpose of the communication • the topic or subject matter. It is beyond the scope of this chapter to discuss all these variables, but it is nevertheless important to indicate the complexity of social interaction through language. The importance of the participants and their relationships is probably obvious: consider how speakers vary their language when addressing young children versus people their own age; people they know well versus strangers; equals versus superiors, etc. An example from SLA research will serve to illustrate the impact of who the participants are: researchers who have had interview data collected through ﬁeldworkers of the same ethnicity as the informants have found that learner behaviour is affected by whether or not the participants are of the same ethnicity. In one study it was found that ‘Thai learners of English used more Thai variants when interviewed by ethnic Thais than by ethnic Chinese’ (reported in Bayley and Tarone 2011: 43). Participants in a communicative situation draw on a range of attributes (in addition to ethnicity) to assess their relative similarity, as has been much discussed in Communication Accommodation Theory (e.g. Giles and Ogay 2007), which essentially studies aspects of how speakers modify their linguistic behaviour based on how they perceive the participants. The important role played by the communicative setting has long been recognised in SLA, where experimental research has been used to show that it makes a difference whether the type of learner language studied is produced, for example, in a language laboratory, in the classroom, or for the explicit purpose of testing the learner’s knowledge in some area. The setting typically relates to the purpose of communication, and it is also relevant to the channel and the production/processing circumstances, so there is a certain amount of overlap between different variables. For

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

Variability in learner corpora

409

example, if we consider the two basic channels of communication, writing and speaking, one of the major differences between them is that a writer has more time than a speaker to plan and edit the message. The message is also affected by the possibilities for interaction, whether participants share time and/or place, which represents another difference between typical speech and typical writing. The ﬁndings in Ädel (2008), reported on in Section 3.3, are explained by reference to observations that the amount of time available and the possibilities for interaction have an effect on the linguistic output. Section 3.3 will also illustrate production/processing circumstances, which have to do with the effects of how much time is available in language production: for example, if a text is written under time pressure, as in the case of a sit-down essay assignment, this will have linguistic consequences. Finally, if one considers the concept of ‘linguistic taboo’ and the notion that some topics are best talked about in a circumscribed way, or not at all, it is perhaps not surprising that the topic itself can have a considerable inﬂuence on language use. In a learner language context, there is the simple fact that L2 speakers know how to talk about some topics better than others and have more restricted linguistic repertoires than native speakers. ‘Topic’ can also be interpreted as ‘essay prompt’: SLA work has shown that, in the context of essay tasks, even small differences in prompts or assigned topics affect the written production (Hinkel 2002: 162–4). In a corpus-based study of learner writing by learners of German, for which the learners wrote on one of four topics, the choice of topic itself was found to strongly inﬂuence the use of syntactic modiﬁers (Golcher and Reznicek 2011).

3

Representative studies

Variability in learner language is typically studied using a comparative method, such as Contrastive Interlanguage Analysis (CIA; see Granger 1996; Chapter 3, this volume). For example, learner corpus results are often compared to the results of a ‘control corpus’ of comparable native-speaker linguistic production. There are many thorny issues involved in deciding on (or even ﬁnding) a speciﬁc control corpus, however. Take the case of comparing student learner writing – such as an argumentative term paper for a university course – to professional native-speaker writing – such as an opinion piece for a newspaper. In such a comparison, the analyst is not holding constant as many variables as possible, because the two populations differ not only in native-speaker status, but also in professional-writer status. Furthermore, it could be argued that the two types of writing represent different genres, serving different purposes and targeting different audiences. That said, it is sometimes the case that language learners and native speakers of a given language simply do

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

410

ÄDEL

not produce the same types of genres, in which case researchers have to make do with samples that are as comparable as possible. Learner corpus research also commonly involves comparison of different learner groups, most typically learners of the same target language who speak different ﬁrst languages. This design is useful for examining the effects of the ﬁrst language on language learning, such as transfer effects. As always, however, the researcher needs to ascertain that a given comparison of group A to group B is valid, by ensuring that the two sets of corpus material are maximally comparable. However, researchers are often faced with difﬁculties here as well. Tono (2012), for example, mentions an issue in collecting written data produced by novice learners of English with different L1s due to different educational practices in different countries. In many countries where English is taught in primary school, the focus is on spoken rather than written skills, so there is very little written production. What the researchers did was to ‘collect data opportunistically at the earliest stage where written output was feasible and made available’ (2012: 28). In order for a learner corpus to be suitable for variation studies, its design needs to be speciﬁed (see Chapter 2, this volume). In particular, the corpus needs to have metadata relevant to the variable(s) to be studied. Ideally, the metadata should provide information not only about the learner population itself (such as how much exposure to the target language the individual learners have had) but also about the type of data represented in the corpus (such as whether the writing was timed or untimed in the case of student essays). Perhaps the most well-known example of learner corpus metadata is the ‘learner proﬁles’ of the International Corpus of Learner English (see Granger et al. 2009), which record more than twenty variables related to both learners and task. The fact that these variables are included in the metadata is an acknowledgement of the range of learner-oriented and situational factors that can potentially impact language production. As we shall see, such proﬁles provided crucial information for all three studies reviewed below. Learner corpus researchers, once they have accumulated results from comparable sets of data, typically make use of statistical measures in order to establish which of the differences (if any) found between different groups are statistically signiﬁcant – that is, likely not to be due to mere chance – and to what degree the different variables have had an impact on the linguistic phenomenon under study. Seeing as how second language acquisition is such a complex phenomenon, dependent on many different factors, it makes good sense to use statistics as an aid in measuring the relative importance of different variables, for example by applying multivariate analyses (see Chapter 8, this volume).5 All three 5

See also Tono (2004), which is unusually explicit in explaining the method behind weighting several factors affecting second language acquisition.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

Variability in learner corpora

411

studies reviewed below used some kind of inferential statistics as an aid in evaluating and interpreting learner data. By way of illustrating how variables can be investigated via learner corpora, in the next three subsections we will take a closer look at three different studies based on learner corpora. They are concerned with three different variables and their possible effects on learner language, and they also study three different linguistic phenomena: proﬁciency level and its effect on vocabulary (Pendar and Chapelle 2008), immersion and its effect on collocation (Groom 2009) and task settings and their effects on writing style from the perspective of involvement/detachment (Ädel 2008). Although all three studies are concerned with (a) writing and (b) English, the issues raised are of general relevance to learner corpus research of all types. 3.1 Pendar, N. and Chapelle, C. A. 2008. ‘Investigating the promise of learner corpora: Methodological issues’, CALICO Journal 25(2): 189–206. In SLA, the development of language proﬁciency has been studied along the dimensions of linguistic ﬂuency, accuracy and complexity (e.g. based on Skehan’s (1998) framework): a proﬁcient speaker is able to communicate ﬂuently and produce accurate statements using complex vocabulary and grammar. This has been operationalised by linking the dimensions to speciﬁc linguistic features (identiﬁed in previous research). The ability to identify different learner levels, in a reliable and automatic fashion, is a highly desirable goal in that it could be applied to placement, assessment, measuring the effects of instruction, and so on. As pointed out by Pendar and Chapelle (2008), the issue of variation, especially in advanced learner writing, is critical to both assessment and teaching in higher education (p. 191), especially in English as a Second Language contexts. Reliable automatic methods for evaluating exactly how proﬁcient a given L2 performance is could save a great deal of time and effort (see Chapter 26, this volume). Pendar and Chapelle’s aim was rather exploratory: to ﬁnd out whether indicators of learner levels could be found by using (a) a large corpus, (b) automatic methods of analysis and (c) statistical methods. For the last aim, they used a method called ‘decision trees’ to help determine which indicator(s) made a difference. The dependent variable was learner level (even though this turned out to be not so easy to pinpoint, as we shall see), while the main independent variable – that is, the information tested for its ability to make predictions about the dependent variable – involved vocabulary choice in the form of lexical features having been associated in previous research with different advanced learner levels (see Chapter 9, this volume). Given the methodology adopted, it was important for the researchers to use features that could be located in learner writing through automatic computer searches. In the long run, this type of research could lead to applications for checking to what extent learner

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

412

ÄDEL

writers are able to manage situational variation in a given domain: in particular, adapting vocabulary to register. The material used for the study was culled from the International Corpus of Learner English (ICLE; Granger et al. 2002), which consists of intermediate to advanced learner writing by university students of English in their third and fourth years of study. Although there was no unambiguous indication of the speciﬁc competency levels of the ICLE population, the corpus metadata contained different types of information that could be associated to learner levels. Of these, the researchers combined three length-ofstudy variables: (1) number of years studying English at school; (2) number of years studying English at university; and (3) number of months of exposure to English in an English-speaking country. These correspond to input in Yorio’s (1976) model. The three variables were combined according to explicitly stated criteria (for example, based on the median value in the corpus as a whole). This formed the basis for assigning the selected essays to a level of low, mid or high. The corpus texts were then divided into different groups on the basis of information drawn from the metadata connected to each text. Note that this was also done in the other two studies reviewed below. The lexical features were searched for by means of three different sets of word lists; this was done in order to test which one(s) would work as useful indicators of proﬁciency level. One of the lists was based on two widely known word lists: the General Service List (GSL; West 1953) and the Academic Word List (AWL; Coxhead 2000). This list was divided into four subsets, including the ﬁrst 1,000 words of the GSL, the second 1,000 words of the GSL, levels 1–10 of the AWL, and words found neither in GSL nor AWL. The higher levels of the AWL include more ‘difﬁcult’ words (e.g. intrinsic, whereby), and the lower levels include words that are more common in academic contexts (e.g. research, interpretation). The assumption was that less-proﬁcient writers would use a smaller proportion of the higher-level words and, accordingly, a larger proportion of the lower-level words. Generally speaking, the AWL was a relatively good predictor for more advanced learners, although the results were somewhat mixed and did not allow any unambiguous matching of lexical proﬁles to learner levels. Since many lexical features were included in the analysis – remember that there were three different lists, each with several subsets – the researchers used a statistical method, decision tree modelling, to help them evaluate which combination of lexical features had the strongest predictive power vis-à-vis the dependent variable (learner level). Although the modelling was found to be not very reliable, that is, not particularly generalisable to new data that was not already in the training data, the method was found to be promising. Pendar and Chapelle emphasise that it was a problem for their analysis that the proﬁciency variable was not clear and reliable (cf. Carlsen 2012, who presents this as a general problem in learner corpus design), and that

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

Variability in learner corpora

413

this probably contributed to the statistical test not yielding very robust results. They also suggest that the fact that they only studied lexical features in the essays – for the sake of ease of automatic identiﬁcation – and not syntactic or discourse-level phenomena associated with language development probably also contributed to the poor performance of the statistical test. The overall evaluation of the lexical proﬁle method and the statistical modelling method used was that they are promising, but that further research is needed, using larger corpora and, above all, making use of corpora which are based on clearer classiﬁcations of texts with respect to the proﬁciency level of the writers. One of the learner-speciﬁc variables used in Pendar and Chapelle – number of months of exposure to English in an English-speaking country, listed as (3) above – was also used by Groom (2009) in a study of the effects of immersion – not on vocabulary in terms of individual word choice, but on the co-selection and co-occurrence of words in native-like, expected ways. 3.2 Groom, N. 2009. ‘Effects of second language immersion on second language collocational development’, in Barﬁeld, A. and Gyllstad, H. (eds.), Researching Collocations in Another Language – Multiple Interpretations. Basingstoke: Palgrave Macmillan, pp. 21–33. Thanks to corpus data, researchers have become increasingly aware of the fact that language tends to occur in chunks rather than through the selection of individual words, one after another. One of the terms used to describe such linguistic co-selection is ‘collocation’, which is generally viewed as two or more words which have established a tendency to co-occur (see Chapter 10, this volume). Groom’s (2009) study set out to answer the question of whether a learner’s development of collocations beneﬁts from immersion, that is, through work or study in an L2 setting. When it comes to learners and collocation, it is typically assumed that immersion has a positive effect on collocational development: the more exposure to an L2 environment, the more native-like a learner’s co-selection patterns will be. However, a corpus-based study of L1 German learners of English (Nesselhauf 2005) found that the number of correct collocations in their writing was only slightly improved by increased exposure to English in English-speaking countries and, surprisingly, that there was a negative correlation between time spent in an L2 setting and the number of different collocations produced by the learners.6 Groom found these ﬁndings startling and decided to investigate the two claims by means of a different way of operationalising ‘idiomatic language’ and deﬁning ‘collocation’. Nesselhauf had taken a phraseological approach, deﬁning collocation in qualitative (not quantitative) 6

A precursor of the German subcorpus of the ICLE was used for the Nesselhauf (2005) study, which drew on metadata from the learner profiles to determine the extent of immersion.

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

414

ÄDEL

terms, and studied only verb + object combinations, such as make money, follow a trend and do the washing up. The ﬁndings had not been subjected to inferential statistical analysis. What Groom (2009: 21) did was adopt a corpus-linguistic approach to the phenomenon of co-selection, deﬁning it in quantitative terms as ‘two or more words occurring near each other in a text’ (based on Sinclair 2003). More speciﬁcally, he used computer algorithms to retrieve the data, by means of (a) lexical bundles analysis and (b) node-and-collocate analysis. Lexical bundles are multi-word sequences that recur frequently and are distributed widely across different texts (Biber 2010: 170), such as statistically signiﬁcant, in terms of, as a result of. The node-and-collocate analysis involved mapping the co-occurrence patterns of the ten most frequent prepositions (including a span of four words to the left and four words to the right of the preposition), which were then subjected to two statistical tests: t-score and mutual information (MI). The learner material used for the study was taken from the Uppsala Student English Corpus (USE; Axelsson 2000), consisting of essays written by Swedish university students in their second term of English Studies. Like the ICLE, this corpus also comes with metadata describing both the learners and the task. The metadata used as an indicator of ‘immersion’ was the number of months the student writers had spent in an English-speaking environment. Based on this information, a selection of the essays was divided into two groups of approximately 250,000 words each: the USE 0 set, written by students who had spent less than one month, and the USE 12+ set, written by students who had reported at least one year in an English-speaking environment. By using the software AntConc (Anthony 2006), all two-word, three-word, four-word and ﬁve-word lexical bundles were automatically retrieved from the corpus. In order for a bundle to be included, it needed to reach a certain frequency level (set at 40 per million words) and a certain spread across the texts in the corpus (set at at least ﬁve texts). The analysis showed that the number of lexical bundles was consistently higher for each span in the USE 0 group. In fact, the longer the bundles, the greater the difference was. Groom’s interpretation of the result, however, was not that immersion leads to decreased use of collocation, but rather that it may indicate that the USE 12+ group is ‘relying less on an overused set of known lexical bundles, or that the formulaic sequences that they do use are inﬂected by a greater degree of constituency and positional variation than is the case with the students in the USE 0 group’ (Groom 2009: 28). Unlike a lexical-bundles analysis, which retrieves units without taking potential pattern variability into account (e.g. seeing as separate bundles to a large extent and to a very large extent, or a quantitative study of … and the study was mainly quantitative), a node-and-collocate analysis allows for internal variability, which is why it is useful to use both measurements for the same set of data. However, this type of analysis is more restricted than lexical-bundles analysis in that the analyst needs to work with a

Downloaded from https:/www.cambridge.org/core. University of Liverpool Library, on 07 Mar 2017 at 07:10:56, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.018

Variability in learner corpora

415

predetermined list of node words for the analysis. For this purpose, the ten most frequent prepositions in each corpus set were selected as nodes; these turned out to be the same in both groups. AntConc was used to calculate t-scores and MI scores for the prepositions (i.e. the nodes) and their co-occurring words (i.e. the collocates). The collocates of the nodes were required to have a co-occurrence frequency of at least ten to be included, and there was also a threshold value for statistical signiﬁcance. Interestingly, the results of this analysis showed the reverse trend: the number of collocations (both types and tokens) was higher in the USE 12+ group according to both statistical tests, and for the majority of prepositions. By looking at the data this way, the expected pattern was found: a positive rather than a negative correlation between the frequency of collocations and extent of immersion. At this point, an additional step was taken in the analysis, involving a qualitative check of the learners’ use of prepositions from the perspective of accuracy. (Recall that this was checked also in Nesselhauf ’s (2005) study of the German learners.) In order to make this feasible, a sampling procedure was adopted. For each preposition, ﬁve random 100-line concordances were manually checked for collocation errors; to do so, the analyst used his native-speaker intuition, a large native-speaker English corpus (the Bank of English) and Google-based searches for veriﬁcation. Again, this method yielded the expected results, with the USE 12+ group displaying consistently fewer collocational errors – on average, half as many as the USE 0 group – and the difference between the groups was statistically signiﬁcant at p > A recent piece of news reveals that football players are taking drugs), followed by a brief explanation of the error. Particular focus is placed on potential transfer from the learners’ mother tongue. For instance, the task devoted to ‘Degree adverbs: the use of very’ features two examples of incorrect use: This have been *very commented on and This point is *very related to the previous one followed by the following comment: These errors may be inﬂuenced by the fact that the Spanish adverb muy is more widely used than its English equivalent very. Thus, if we translate these sentences into Spanish we would use muy with comentado (commented) or relacionado (related). In English, instead, the adverb much and degree adverbs like those below [adverbs ending in –ly such as highly, strongly, closely] should be used. (Mendikoetxea et al. 2010: 187) As the comments are meant for undergraduates in English who have a fairly good background in the descriptive grammar of English, Mendikoetxea et al. have been able to make use of technical terms such as ‘count/uncount’, ‘noun phrases’ or ‘degree adverbs’ without having to explain them. Each task contains a range of exercises which may be deductive, i.e. rule-driven (e.g. error correction, cloze or translation), or inductive, i.e. data-driven (e.g. concordance-based exercises). The translation exercises focus on grammatical points that show degrees of discrepancy in English and Spanish. The error-detection and correction exercises, which are often denounced in the literature, proved to be very popular with students, who found them ‘motivating and fun’ (ibid.: 184). One of the distinctive features of this study is that, unlike other pedagogically oriented studies, it does not stop short at pedagogical implications but includes the design and actual classroom use of pedagogical resources. 3.2 Harrison, J. 2015. ‘The English Grammar Proﬁle’, in Harrison, J. and Barker, F. (eds.), English Proﬁle in Practice. Cambridge University Press, pp. 28-48. Like Mendikoetxea et al.’s work, Harrison’s (2015) study reports on a research project that relies on learner corpus data to describe learners’ grammatical knowledge and generate teaching materials that address learners’ attested needs. In this case, however, the corpus is very big and global rather than small and local. The resource used is the 55-million-word Cambridge Learner Corpus, which contains exam-related data at different proﬁciency levels representing a wide range of mother-tongue backgrounds. The study is situated within the wider English Proﬁle research programme conducted by a team from Cambridge University Press and Cambridge English Language Assessment. The project has a vocabulary branch, English Vocabulary Proﬁle, and a grammar branch, English Grammar Proﬁle, which is the topic of Harrison’s chapter. The focus is not on errors,

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

Learner corpora and materials design

501

but on the development of grammatical knowledge in terms of the structures used at the different levels (from A1 ‘Breakthrough’ to C2 ‘Mastery’) and the increasing sophistication with which learners use the structures they have learned. The project focuses on eighteen areas of grammar traditionally covered in ELT grammars of English (adjectives, adverbs, determiners, passives, present time, etc.), which are themselves subdivided into a number of categories, such as present simple and present continuous for the section on present time. The description of each of the grammatical features includes three key aspects of learner use: (1) form: the development of learners’ use of a feature in different forms and structures; (2) lexico-grammatical development: the ability to use the feature with an increasing range of vocabulary; and (3) use: the ability to use the feature for a wide range of functions and in a wide range of contexts. The article contains ample illustrations of these three aspects. An example of formal development is the use of the superlative which is used with an inﬁnitive at B2 (the cheapest way to travel) and at C1 with a postmodiﬁer which strengthens the superlative (the shortest time possible). An example of lexico-grammatical development is the increasing range of adverbs used between the auxiliary and the main verb in the past continuous (A2: I was just watching; I was always dreaming vs B2 and later stages: she was wistfully walking; I was constantly trying). Results show that the main type of competence that emerges at the B levels, and therefore serves to deﬁne the difference between the lower and higher levels, is lexico-grammatical competence. The third aspect looks at the different uses of structures in terms of the functions they perform. Each grammatical structure has a number of different uses – a phenomenon referred to as ‘grammatical polysemy’ – which learners progressively acquire. For example, from its core use to refer to actions in progress at some point in the past at A2 level (it was raining when I arrived), the past continuous is progressively used to express other functions, such as making requests (I was wondering if I could…) at B2 level or talking about repeated events which are undesired (I was always running out of time) at C2. The aspect of learner use also includes the progressive acquisition of politeness strategies and features typical of formal registers. One major ﬁnding of the study, which has particularly important pedagogical implications, is that it should not be assumed that once a structure has been taught at a given level, it is not necessary to revisit it at a later stage. There is a constant need for revision and recycling of grammatical knowledge. The learner-corpus-based descriptions are meant to inform a range of grammar-based activities. They are turned into teacher-friendly sections called ‘grammar gems’12 which will be made available to teachers. One of them – on the future tense – is included in the article together with 12

www.englishprofile.org/index.php/grammar-gems (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

502

GRANGER

examples of follow-up activities. This section gives a very detailed description of the uses of the future typical of each proﬁciency level and is illustrated with examples from the learner corpus. The latter feature is a particularly welcome change, as learner corpus examples have so far nearly exclusively been used in pedagogical tools to illustrate erroneous use. 3.3 Cowan, R., Choo, J. and Lee, G. S. 2014. ‘ICALL for improving Korean L2 writers’ ability to edit grammatical errors’, Language Learning and Technology 18(3): 193–207. This article by Cowan et al. (2014) reports on an ICALL program, the ESL Writing Tutor, which aims to raise advanced Korean ESL learners’ awareness of persistent errors in their writing. All stages in the design of the program are described: from learner corpus collection to error selection, tagging and analysis, design of targeted ICALL units, classroom experiment and assessment of the effectiveness of the program. The ESL Writing Tutor is a corpus-based program: it relies on a corpus of writing of approximately 230,000 words produced by Korean undergraduate and graduate students enrolled in ESL remedial writing courses at an American university. The learners’ proﬁciency level, measured by two tests (the TOEFL and the English as a Second Language Placement Test), is reported as advanced. Manual inspection of a sample of the corpus identiﬁed four types of persistent grammatical errors: passivisation of unaccusative verbs (e.g. the lecture was lasted for three hours), articles (e.g. I had same experience), noun phrases with quantiﬁers (e.g. most of Americans like hamburgers) and demonstrative determiners (e.g. this mistakes). These four types of errors were subsequently examined in the full corpus; all the erroneous instances were error tagged and error frequencies computed. To be selected for analysis, errors needed to have been identiﬁed as such by two raters and occur over 10 per cent of the time in obligatory contexts. Individual units were designed with a view to helping Korean learners overcome their attested difﬁculties with these grammar points. Each unit utilises ‘a combination of explicit instruction and metalinguistic prompting feedback for error correction’ (p. 196). After a ﬁrst section aimed at refreshing students’ memory of the rules relating to the grammar topic, students are asked to judge the grammaticality of a series of sentences and, in a later stage, to detect and correct errors in short and long passages. One of the strong features of the program is the variable feedback provided during students’ efforts at error identiﬁcation and correction. All the ungrammatical sentences used in the error detection and correction exercises are taken from the learner corpus and about a quarter of the sentences contain no errors. An interesting aspect of the interface is that teachers can introduce their own sentences, thereby allowing for some degree of customisation. The effectiveness of the program was tested in a controlled experiment involving forty Korean learners, twenty-two of whom took the

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

Learner corpora and materials design

503

four CALL lessons (the CALL group); eighteen received no such instruction (the NO CALL group). The task involved editing grammatical errors in a 615-word text in which thirty-two test items were embedded together with an equal number of correct examples. The experiment included a pre-test, a ﬁrst post-test which took place ﬁve weeks after the pre-test, and a second post-test ﬁve months later. The results show that the CALL group made signiﬁcant progress between the pre-test and post-test 1 in all four grammatical categories, and between the pre-test and post-test 2 for passives, quantiﬁers and demonstratives but not for articles. In contrast, the NO CALL group showed no signiﬁcant improvement between test administrations. The authors conclude that the ESL Writing Tutor is an effective pedagogical approach for addressing persistent grammatical errors and can serve as a useful supplement to in-class writing instruction. They also advocate more extended use of learner corpus data for the design of ICALL courseware on the grounds that it is a richer source of data than ‘the production tasks currently used in ICALL programs, since these are somewhat artiﬁcial and restrict output’ (p. 204). More generally, they highlight the role that the synergy between corpus linguistics and CALL can play in improving L2 instruction and furthering our understanding of developmental stages in second language acquisition. 3.4 Gilquin, G., Granger, S. and Paquot, M. 2007a. ‘Learner corpora: The missing link in EAP pedagogy’, Journal of English for Academic Purposes 6(4): 319–35. Gilquin et al.’s (2007a) study reports on a two-year collaborative corpus-based project between researchers from the University of Louvain and Macmillan Education. The aim of the project was to apply insights from learner corpora to the creation of writing materials to be integrated into a monolingual learners’ dictionary, the second edition of the Macmillan English Dictionary for Advanced Learners (MEDAL) (2007). The corpus-informed materials included error notes (described in Section 2.2.1), sections on aspects of English grammar, spelling and punctuation that remained problematic at an advanced proﬁciency level and a thirty-page academic writing section, which is the focus of Gilquin et al.’s article. The article starts with an overview of the contribution of corpora to the ﬁeld of English for Academic Purposes (EAP). The survey shows that corpus-based studies, mostly based on native corpora of professional writing, have helped uncover a number of distinctive features of academic discourse. As regards lexis, studies have demonstrated the existence of a general academic vocabulary that is found in a wide range of disciplines and has its own EAP-speciﬁc phraseology. The authors argue that these studies, while offering highly valuable insights into EAP, need to be complemented with insights gleaned from learner corpora with a view to identifying the difﬁculties that learners experience when writing

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

504

GRANGER

academic texts. These areas of difﬁculty include collocational patterning (miscollocations), pragmatic appropriacy (anomalous use of modals, boosters, hedges, etc.) and a range of discourse features (e.g. misuse as well as over- and underuse of connectors). Corpus-based EAP materials cover some of these features but as they tend to be based exclusively on native-speaker corpora, they only cater for the errors and infelicities that are shared by non-native writers and novice native writers. This leaves a lot of ground uncovered, which could be ﬁlled by analysis of learner corpora of academic texts. The article describes an extended writing aid section included in the MEDAL. This section concentrates on twelve rhetorical functions that are particularly prominent in academic writing, such as comparing and contrasting, introducing examples, reporting and quoting, and expressing personal opinions. The focus is on 350 EAP markers extracted with the corpus-driven method described in Paquot (2007). Both native and learner data have been used to inform the section. A native corpus was necessary to provide a detailed description of the markers, particularly their phraseological patterning. The resource used was the 15-million-word academic subcorpus of the British National Corpus.13 The learner corpus was the 3.5-million-word International Corpus of Learner English (Granger et al. 2009), which contains academic essays written by foreign language learners from sixteen different mother-tongue backgrounds. These two corpora were compared to highlight the features that distinguished learner from native production. The different learner subcorpora were also compared with each other to assess the degree of generalisability of the results. Only the features that were shared by at least half of the learner populations were discussed in the writing section. The difﬁculties revealed by this method include problems of frequency, register, positioning, semantics and phraseology, all of which are amply illustrated in the article. Particular attention is paid to items which learners tend to overuse (such as for example or for instance) and alternatives are presented with a view to helping learners widen their lexical repertoire (for the function of exempliﬁcation, alternative expressions like X is an example of Y or an example of Y is X are suggested). Problems of register confusion (e.g. overuse of really, of course or absolutely) are also highlighted. Graphs are included to help learners visualise the differences between learner and native-speaker use. The article shows the beneﬁt of adding an L2 perspective to the production of corpus-based EAP resources. One weakness is that the generic nature of the resource precludes the insertion of L1-speciﬁc notes, a weakness which has been addressed in subsequent work at Louvain within the framework of the LEAD project (see Section 2.2.1).

13

www.natcorp.ox.ac.uk/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

Learner corpora and materials design

4

505

Critical assessment and future directions

Learner corpora hold great potential for reference and instructional materials design: they provide hitherto inaccessible information on what learners can and cannot yet do at different levels of proﬁciency. The data is particularly rich, as all linguistic phenomena are embedded in extended contexts of use, which greatly contributes to the analysis. However, a survey of the ﬁeld conﬁrms Flowerdew’s (2012: 283) conclusion that ‘learner corpora are still at the periphery of language teaching’. The educational sector seldom uses learner corpus data and when it does, the approach tends to be both limited and one-sided, i.e. the learner corpus is only used to check and/ or illustrate suspected errors. In other words, the driving force is still the author’s intuition rather than the learner corpus itself. This explains why learner-corpus-informed content often looks so familiar, leaving an internationally recognised expert like Michael Swan unimpressed: for him, the language items uncovered ‘have been standard in ELT [English Language Teaching] syllabuses for half a century or more’ (Swan 2014: 93). One major weakness in current learner-corpus-based applications is the marked focus on errors. This is not a bad thing per se, especially as errors extracted from learner corpora are ‘real’ errors, i.e. they reﬂect learners’ attested rather than potential difﬁculties. In addition, thanks to error annotation, they can be investigated much more systematically than in the past and analysed in an extended context. However, it is time to go beyond errors and focus on other dimensions of language proﬁciency, in particular complexity. Analysis of learner corpus data representing different proﬁciency levels makes it possible to identify the emergence of increasingly complex lexical items and grammatical structures. Besides exercises focused on errors, it would be good to include exercises that push learners to move out of their comfort zone and produce language of increasing sophistication. Projects like the English Proﬁle or TREACLE hold great promise in this regard. Another conclusion that can be drawn from a survey of current applications is that the power of learner corpus data only comes into its own when a strong, corpus-driven approach is adopted, still a rare occurrence these days. While pedagogical intuition is important, indeed critical, when designing language-teaching resources, language practitioners only have a partial awareness of students’ difﬁculties. For example, they are much more likely to notice repeated use of some words or structures than underuse or absence. A comparison between a learner corpus and a reference corpus (expert or native) brings out both over- and underuse in a matter of seconds. However, much of the usefulness and reliability of learner corpus insights depends on the quality and size of the learner corpus used. Two key issues need careful consideration, as they may signiﬁcantly impact the generalisability of the results: one concerns the type of data, the other the type of learner.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

506

GRANGER

To date, most learner corpora have been collected by teachers or academic researchers in the context of language classes, the assignments being either part of normal teaching activities or set speciﬁcally for the purpose of corpus collection. A more limited number of learner corpora have been initiated by language certiﬁcation bodies like Cambridge English Language Assessment. In this case, the data is exam-related: it consists of scripts from nationally or internationally organised standardised language exams. Both types of data have their strengths and weaknesses. The ﬁrst type has the advantage of not being linked to a unique syllabus, but its usefulness is often impaired by the limited corpus size and the absence of a clear indication of proﬁciency status. Some researchers have tried to circumvent this weakness by having the data rated post hoc (see, for example, Thewissen in press), but this procedure is costly and fraught with difﬁculties, not least that of ﬁnding reliable raters. The second type of corpus is a mirror image of the ﬁrst: the learner corpus tends to be very large and reliably stratiﬁed in terms of proﬁciency, but the downside is that it is linked to a well-deﬁned syllabus, which may bring about some degree of circularity: the learners master a particular language feature at a particular level because it is part of the syllabus at that level. Swan (2005: 96) highlights this danger in connection with the English Proﬁle programme and warns against modelling standard language descriptors such as those for the Common European Framework of Reference for Languages (CEFR) (Council of Europe 2001) on the basis of examination material from one particular examining body. Clearly, if the aim of the research is to identify general descriptors for a particular L2, it is essential to check the criterial features against both exam and non-exam data. This is something that the protagonists of the Language Proﬁle project are aware of and have started to address (McCarthy 2014: 16). The second issue relates to the mother-tongue background of the learners. There is some tension between the results of learner corpus research that point to a high degree of transfer-related features in learner language (see Chapter 15, this volume) and the priorities of the educational publishing sector, which favours generic tools for understandable market reasons. An investigation of phrasal verbs carried out by Negishi et al. (2012: 15) is revealing in this respect. The results highlight a discrepancy between the CEFR-linked English Vocabulary List of phrasal verbs and the degree of difﬁculty for Japanese learners. The authors conclude that ‘[s]peciﬁc L1 characteristics, such as proximity in terms of language family and lexico-grammatical similarities and differences might affect the order and the degree of phrasal verb acquisition’. In spite of compelling evidence of L1 speciﬁcity, the prevailing option is still the traditional ‘one-size-ﬁts-all’ model (Rundell 2006: 742), a model that Tomlinson (2012: 158) is highly critical of: ‘In attempting to cater for all students at a particular age and level, global coursebooks often end up not meeting the needs and wants of any.’

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

Learner corpora and materials design

507

The fact remains that, in addition to L1-speciﬁc features, learner language also displays developmental features which affect all learners, whatever their mother-tongue background. Negishi et al. (2012: 15–16) highlight this duality: Put in a much wider perspective, it might be possible that there is a core group of linguistic items such as words, phrasal verbs, etc. which show a general pattern of increasing difﬁculty, whereas there are some peripheral linguistic items that are context-speciﬁc or perhaps culture-speciﬁc. The acquisition of those peripheral linguistic items might be affected by the syllabus adopted, teaching materials used, the learner’s L1, etc. The best way of implementing this duality in reference and instructional materials is to promote ‘glocalization’ (Gray 2002), i.e. the smooth integration of the global and the local. Learner corpora – if they are carefully designed and analysed – can play a key role in the development of glocal resources that account for both generic and context-speciﬁc features. The only devices that can realistically host glocal resources are electronic ones. A modular database design enables customisation based on a number of variables, in particular learners’ proﬁciency level and L1 background, a major desideratum in view of the high degree of variability revealed by analysis of learner corpus data. Rundell (2007: 50) sees future dictionaries as sets of ‘reference components which customers can mix and match according to their needs’. This ‘mix-and-match’ model could fruitfully be applied to CALL software development and computer-based learning tasks. Indeed, at a time when the boundaries between language resources – dictionaries, grammars, language learning tools, writing aids – are becoming increasingly ﬂuid, there is much to be gained from a rapprochement between electronic lexicography and CALL (Granger 2011b, 2012b). These resources should be – at least partly – web-based so as to allow for maximum ﬂexibility and accessibility on a range of devices, not only computers, but also tablets and mobile phones. Although Gray (2002: 165) still sees a place for printed core coursebooks, he predicts that the demand for online add-ons will grow so as to ensure a better match between coursebook materials and users. The corpus world is replete with laments that the corpus revolution has not yet reached the language-teaching world. When it comes to explaining this state of affairs, there is a lot of ‘passing the corpus buck’ going on: teachers blame publishers, publishers blame teachers, researchers blame publishers, etc. In fact, each actor should accept part of the blame. As suggested by Harwood (2005), publishers should be more daring and open to research. And learner corpus researchers should do more than point to some vague pedagogical implications. As regards grammar, for example, Meunier and Reppen (2015: 514) make clear that ‘[i]t should

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

508

GRANGER

therefore also be part of the corpus linguists’ agenda to provide textbook writers with clear guidelines as to which type of core grammatical information is worth including in textbooks.’ The voice of teachers should not be ignored either, as ‘[a] course book which adheres to linguistic principles but ignores teacher expectation may not succeed’ (McCarten 2010: 417). McCarthy (2008) argues for a shift in the relationship between teachers, academics and publishers: rather than seeing teachers as passive consumers, we should involve them as active participants: teachers’ voices ‘should be respected and heeded if the applied linguistics profession is to be a two-way street’ (p. 565). In all this, we should keep one thing in mind: the bottom-up approach that characterises learner corpus research is a massive undertaking which will inevitably take time. As Leech (1997b: 22) rightly observes, ‘ “the corpus revolution” is a misnomer for a change which is taking place gradually, as suitable materials become available’.

Key readings Flowerdew, L. 2012. Corpora and Language Education. Basingstoke: Palgrave Macmillan. This volume provides a user-friendly overview of corpus linguistics with a particular focus on its applications to language education. There are numerous references to the use of learner corpora and their relevance to teaching, particularly in Chapter 6 (‘Applying corpus linguistics in research arenas’) and Chapter 7 (‘Applying corpus linguistics in teaching arenas’). Meunier, F. 2010. ‘Learner corpora and English language teaching: Checkup time’, Anglistik: International Journal of English Studies 21(1): 209–20. This article provides a good overview of the links between learner corpus research and English language teaching. It deals with the two types of learner corpus use: indirect, i.e. based on learner corpus data for delayed pedagogical use, and direct, i.e. relying on learner corpus data for immediate pedagogical use in the classroom. Campoy-Cubillo, M. C., Bellés-Fortuño, B. and Gea-Valor, L. (eds.) 2010. Corpus-based Approaches to English Language Teaching. Teachingg. London: Continuum. This volume contains a collection of papers on the application of corpus-based approaches to English language teaching. Part 3 entitled ‘Learner corpora and corpus-informed teaching materials’ is of particular interest: it contains eight studies describing the use of learner corpus data in a wide range of contexts.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

Learner corpora and materials design

509

De Cock, S. and Granger, S. 2005. ‘Computer learner corpora and monolingual learners’ dictionaries: The perfect match’, Lexicographica 20: 72–86. The aim of this article is to highlight the contribution of learner corpus (LC) data to monolingual learners’ dictionaries (MLDs) of English. The authors highlight the properties of LC data for pedagogical lexicography and describe the types of LC-based information that can be integrated into MLDs. By way of illustration, two LC-informed MLDs are compared: the Longman Dictionary of Contemporary English (2003) and the Cambridge Advanced Learner’s Dictionary (2003). Hawkins, J. A. and Filipovic8c,, L. 2012. English Proﬁle Studies 1. Criterial Features in L2 English: Specifying the Reference Levels of the Common European Framework. Cambridge University Press. This book describes the English Proﬁle programme, a project which aims to develop reference-level descriptors for English to accompany the CEFR. It contains a detailed account of how the Cambridge Learner Corpus has been used to identify criterial features for the different levels of the CEFR. These features can be negative (errors) or positive (correct properties). The last chapter suggests practical applications of the research for materials design. Wible, D., Kuo, C.-H., Chien, F.-Y., Liu, A. and Tsao, N.-L. 2001. ‘A web-based EFL writing environment: Integrating information for learners, teachers, and researchers’, Computers and Education 37: 297–315. In this article, Wible and colleagues describe an interactive web-based tool called Intelligent Web-based Interactive Language (IWiLL), which allows both students and teachers to creLearning (IWiLL), ate and use an online database of Taiwanese learners’ essays and teachers’ error annotations. This environment is extremely attractive both for learners, who get immediate feedback on their writing and have access to lists of errors they are prone to produce, and for teachers, who progressively and painlessly build a large database of learner data from which they can draw to develop targeted exercises. Granger, S., Kraif, O., Ponton, C., Antoniadis, G. and Zampa, V. 2007. ‘Integrating learner corpora and natural language processing: A crucial step towards reconciling technological sophistication and pedagogical effectiveness’, ReCALL 19(3): 252–68. This article advocates the integration of learner corpus data and natural language processing for the design of CALL programs. The

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

510

GRANGER

authors advocate a realistic approach that reconciles the capabilities of NLP tools and the realities of foreign language teaching. The integration of NLP techniques and learner corpus data is illustrated via two applications: NLP-based error detection and feedback, and NLP-based error analysis interface.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:28:16, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.022

23 Learner corpora and language testing Fiona Barker, Angeliki Salamoura and Nick Saville

1

Introduction

The ﬁeld of language testing and assessment (LTA) is concerned with the development of valid and reliable assessments that measure language ability through speciﬁc tasks for particular purposes. Language testers seek to measure hidden (latent) traits in order to make inferences about an individual’s language ability. Through well-designed tasks we can observe behaviours to which we attach a test score; test scores therefore provide evidence for an individual’s ability in a speciﬁc language skill or for their overall language competence. Language tests are used for a wide range of purposes, including: establishing someone’s current language-learning needs (placement testing), establishing areas of weakness (diagnostic testing) or measuring a student’s language ability (proﬁciency testing). For an introduction to LTA and its terminology, see ALTE Members (1998), McNamara (2000) and Hughes (2003). LTA traces its origins back to the imperial civil service examination system in China, and the ﬁrst ESOL (English for Speakers of Other Languages) test was launched in 1913 – the Certiﬁcate of Proﬁciency in English (see Weir and Milanovic 2003). Language testers started to use language corpora only in the 1990s, despite the fact that corpora have played an increasingly important role in education, publishing and related ﬁelds since they were ﬁrst computerised, starting with the Brown Corpus of Written American English in the 1960s (Francis and Kuc\era 1964). In this chapter we consider what beneﬁts learner corpora bring to LTA and describe the ways in which they are being used or have the potential to be used in future. Whilst this chapter focuses on the development and application of learner corpora to LTA, collections of texts produced by native users of English (and other languages) have been used to develop or validate various task formats or language tests since the 1990s. Taylor

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

512

BARKER, SALAMOURA AND SAVILLE

and Barker (2008) provide an overview of the use of native corpora for LTA, describing how key developments in corpus design after the Brown Corpus informed LTA, also noting how written native corpora were joined by spoken native corpora, then learner corpora and onwards into the expanding range of learner corpora available or under development (see Chapter 2, this volume). The range of applications of learner corpora to LTA is fairly wide and they overlap with some of the uses of native corpora. On a practical level, Barker (2004) describes how both types are used to develop word and structure lists that test writers refer to when producing tests. Hargreaves (2000) explains how collocations were identiﬁed in native reference corpora and a corpus of learners’ exam scripts (the Cambridge Learner Corpus, described in Sections 2.1 and 3.1) and fed into a new task type for an advanced general English proﬁciency test. On a more theoretical level, both types of corpora are used to develop rating scales and linguistic descriptions of learner writing and speech across the proﬁciency continuum (see Hawkey and Barker 2004; Hendriks 2008). Learner corpora can be used to help revise existing test formats (Weir and Milanovic 2003) or to develop new ones (Biber, Conrad, Reppen et al. 2004), to create word or phrase lists used by teachers, curriculum planners or test writers (Capel 2010, 2012) and to select new testing material at different proﬁciency levels. The analysis of learners’ errors provides evidence about what distinguishes one proﬁciency level from another (see Hawkins and Buttery’s (2010) notion of ‘criterial features’, and also Hawkins and Filipovic8 (2012) described in Section 3). Corpora of learner texts enable researchers to compare how learners perform on various tasks or topics over time, according to proﬁciency level or by any other variable collected alongside the corpus texts. A persuasive reason to use corpus data for language testing is that corpora provide empirical evidence that balances teachers’ and test writers’ intuitions and expertise in designing and rating assessments. Furthermore, corpora provide real language across a range of settings, thus positively enhancing what the test purports to measure.

2

Core issues

We now focus on how learner corpora have been developed speciﬁcally for use in LTA, noting signiﬁcant developments and exploring the implications that corpus evidence and related analytic techniques have for language assessment. We focus on large-scale language proﬁciency testing which tends to have high-stakes outcomes for test takers, although we also refer to studies where learner corpora support smaller-scale language tests.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

2.1

513

Types of learner corpora and models of learner corpus building

We can identify four types of corpora that are useful for LTA. The ﬁrst type comprises corpora of learner exam material. The 56-million-word (and growing) Cambridge Learner Corpus (CLC) consists of written exam scripts and associated questions and metadata, and is a dynamic monitor corpus developed by Cambridge English Language Assessment and Cambridge University Press in the UK (see Section 3.1).1 Another example is the NICT Japanese Learner English Corpus (NICT JLE Corpus) consisting of transcripts of audio-recorded speech samples from 300 hours of ACTFL-ALC Standard Speaking tests.2 A native corpus was also compiled to enable comparison between native speakers and learners, which was particularly useful for the errors not captured by the error tagset. The NICT JLE Corpus provided evidence for the CEFR-J, which is an adapted version of the Common European Framework of Reference for Languages (CEFR; Council of Europe 2001) for English language teaching in Japan which consists of ‘Can Do statements’ at twelve levels, with a new Pre-A1 level and the lowest two levels divided into three (A1) and two (A2) levels, which reﬂect the needs of English language learners in Japan (Negishi et al. 2013). The second type includes corpora of texts elicited from learners for use by test developers and researchers. Here we ﬁnd the Cambridge English Proﬁle Corpus (CEPC; see Alexopoulou 2008) and Cambridge Corpus of Academic English (CAMCAE).3 Their corpus-building model involves individuals completing a selection of tasks online (CEPC) or uploading their own choice of work to an online portal (CAMCAE) with a local coordinator who is either a teacher or an academic manager in the relevant institution.4 The 10-million-word Longman Learners’ Corpus5 was compiled by contributions from teachers and learners. The third type of corpus consists of (non-learner) texts used in the target language use situation. For example, the development of the static T2K-SWAL Corpus for the Educational Testing Service (ETS) test provider included a range of academic text types and has informed many studies of discourse structure and genre analysis (Biber, Conrad, Reppen et al. 2004). The T2K-SWAL Corpus also underpinned the research programme that led to the development of the internet-based Test of English as a Foreign Language, TOEFL iBT®. Other test providers commission external expertise to build their corpora (see The Pearson International Corpus of Academic English (PICAE) of written and spoken texts; Ackermann et al. 1 2

See www.cambridge.org/corpus/ (last accessed on 13 April 2015). American Council on the Teaching of Foreign Languages - Association of Language Companies; see www.actfl.org/ (last accessed on 13 April 2015).

3

See www.cambridge.org/camcae/ (last accessed on 13 April 2015).

4

See www.englishprofile.org/index.php/corpus (last accessed on 13 April 2015).

5

See www.pearsonlongman.com/dictionaries/corpus/learners.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

514

BARKER, SALAMOURA AND SAVILLE

2010). Whilst practical and potentially cost-saving, using external expertise has implications for several aspects of corpus design, a key one being quality control of the nature and proﬁciency level of the texts collected for the corpus. Finally, the fourth type consists of general learner corpora that are useful for testing purposes. Examples include the International Corpus of Learner English (ICLE; Granger et al. 2009), with a centrally managed model of corpus building by teams in different locations, and the Michigan Corpus of Upper-level Student Papers (MICUSP).6 No data-collection and corpus-building model is appropriate for all purposes or contexts and one can ﬁnd out about what works (and, more importantly, what does not) by reading accounts of corpus building, for example Brand and Kämmerer’s (2006) account of compiling the German component for the Louvain International Database of Spoken English Interlanguage (LINDSEI) or Römer and O’Donnell’s (2011) account of designing, compiling and classifying by genres the MICUSP.

2.2

Implications of corpus design for language testing

The design of a corpus must be fully understood in order for a language tester to make the best selection for their intended purpose, whether this is for research or for operational use, i.e. to inform a new test or to improve an existing test. A corpus user should also consider the generalisability of the research ﬁndings and any decisions based on these, thinking carefully about how and in what context corpus results will be used. Two facets of corpus design that are crucial to language testing are learner-speciﬁc variables (the learners’ demographic information) and general variables (the nature and context of the learners’ texts; see Chapter 18, this volume). The range of learner-speciﬁc variables collected can vary widely, but would normally contain information about a learner’s mother tongue(s) and nationality, their language-learning history (for example any education partly or fully in that language), together with demographic information such as their age or gender (see Chapter 2, this volume). A potentially problematic learner-speciﬁc variable is their current proﬁciency level in the language of interest. This might be a teacher’s/examiner’s assessment of a learner’s level (e.g. CEFR level B2) or stage of learning (e.g. upper-intermediate) or the result of a placement test such as the Cambridge English Placement Test.7 A learner can also self-assess their own level based on ‘Can Do statements’ such as these taken from the ‘Common Reference Levels: global scale’ (Council of Europe 2001: 24):

6

See www.elicorpora.info/ (last accessed on 13 April 2015).

7

See www.cambridgeenglish.org/placement/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

515

A1: Can understand and use familiar everyday expressions and very basic phrases aimed at the satisfaction of needs of a concrete type. Can introduce him/herself and others and can ask and answer questions about personal details such as where he/she lives, people he/she knows and things he/she has. Can interact in a simple way provided the other person talks slowly and clearly and is prepared to help. C1: Can understand a wide range of demanding, longer texts, and recognise implicit meaning. Can express him/herself ﬂuently and spontaneously without much obvious searching for expressions. Can use language ﬂexibly and effectively for social, academic and professional purposes. Can produce clear, well-structured, detailed text on complex subjects, showing controlled use of organisational patterns, connectors and cohesive devices. A learner’s proﬁciency level may also be provided by peer assessment or the automated assessment of their output (we revisit the automated rating of writing in Section 2.4; see also Chapter 26, this volume). There are other important questions to bear in mind: was the level assignment done before the learner produced the text for the corpus or afterwards? did the levels provided by several raters display an acceptable level of agreement? was the rating checked for intra- and inter-rater reliability and consistency? (see ALTE Members 1998; Davies et al. 1999 for deﬁnitions of these terms). Salamoura (2008) describes how non-exam data can be aligned to the CEFR, what she calls ‘uncharted territory’. For example, the research team developing the CEPC collected three types of evidence for each learner’s proﬁciency level: self-assessment and teacher assessment of a learner’s current writing proﬁciency level (the former using a self-assessment grid, Council of Europe 2001: 26–7), and results from a 25-question multiple-choice test which indicated the level of exam learners should work towards.8 Further evidence will include teacher and examiner ratings as well as automarking of the texts written by learners (see Alexopoulou 2008). Salamoura (2008: 5) states that: In systematising the linking of research data to the CEFR, it would be perhaps best to view alignment, and in particular evidence for alignment to the CEFR, as a continuum … Data that derive from exams which are already aligned to the CEFR will undoubtedly carry strong evidence for alignment whereas non-exam data may hold lighter evidence. For instance, when one can obtain only self-assessment statements or teacher assessment in relation to what learners can or cannot do in a second language across the CEFR scale, this evidence will not be as strong as evidence about these learners’ CEFR level which originates from a reliable CEFR-aligned exam. 8

See www.cambridgeenglish.org/test-your-english/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

516

BARKER, SALAMOURA AND SAVILLE

Assigning a level to any text should be approached with caution, however, as it takes training and practice to interpret, extend and apply the underspeciﬁed CEFR descriptors. For example, see Thewissen (2013) described in Section 3.2. When we consider general variables, there are further issues to contend with when designing or selecting a suitable learner corpus to provide evidence for a speciﬁc testing purpose. On a basic level, we start with general versus specialised language, as the level of speciﬁcity of any text within a corpus could have a large impact on its relevance for language assessment. Furthermore, as a test should be based on evidence from the target language domain, it is important to consider whether the corpus matches its genre, mode, format, topic and so forth. For assessment purposes, the nature of the task that learners respond to is crucial. As discussed in Section 2.1, learners are likely to be required to respond to a speciﬁc task or select from a set of tasks (CEPC; ICLE), to upload a self-selected piece of their own work (CAMCAE), or to respond to one or more exam tasks (CLC). The popular ‘web as corpus’ approach to corpus development (e.g. Kilgarriff and Grefenstette 2003) is harder to apply to learner data as demographic information about online text producers is difﬁcult to obtain for web-crawled texts. However, this is an appropriate way of collecting a large amount of material quickly from a particular domain (PICAE). An ongoing challenge for corpus researchers is the comparability of one set of corpus results with another. Naturally, each corpus is designed with speciﬁc research goals in mind, with associated metadata collected, meaning that true comparability between corpora is not always possible, although there are family resemblances between pairs or groups of corpora, such as the Michigan Corpus of Academic Spoken English or the Michigan Corpus of Upper-level Student Papers (MICASE/MICUSP) in the US and the British Academic Spoken English Corpus or the British Academic Written English Corpus (BASE/BAWE) in the UK.9 Such corpora share design features, which means that they are comparable to a certain degree. However, corpus researchers still need to bear in mind any differences in the period of data collection between corpora as there are bound to be differences in faster-changing aspects of the language, such as at the lexical level, which can potentially skew corpus results. A related point is that any corpus reliant on people entering information is liable to human error as people may avoid questions that they do not understand or do not wish to answer, so caution should always be exercised when interpreting corpus ﬁndings. Every corpus should have a speciﬁcations document that provides a detailed description of the types of texts within it, their collection methodology and the ways in which they have been formatted and annotated (for example the ICLE handbook,

9

See www.elicorpora.info/ for MICASE/MICUSP; www2.warwick.ac.uk/fac/soc/al/research/collect/ for BASE/BAWE (both last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

517

Granger et al. 2009). A speciﬁcations document may also provide information on intrinsic aspects of the tasks, such as the topic, purpose and intended recipient of the resulting text. There are also various extrinsic aspects of corpus design, including: 1. Main purpose for writing/speaking: Did the learners know that their production was going into a corpus? Was the writing or speech collected primarily for another purpose, for example as a test or as part of a learning management system (e.g. the EF-Cambridge Open Language Database; EFCAMDAT)?10 2. Participant choice: Did the learners have a choice in what they submitted to the corpus? If no, could this have affected their contribution in quantity or quality? If yes, was it an open or limited choice? Did all learners have the same amount of choice? 3. Time: Did the learners have the same amount of time to produce the text or would this vary by their level, location or some other variable? 4. Preparation: Did learners undergo any speciﬁc preparation before contributing to the corpus? Did the learners produce a rote-learnt piece of text, which they may have done in a testing situation if this was their test-taking strategy or if they could not attempt any of the tasks they were presented with? 5. Motivation: What was the learners’ motivation to complete the task? Was it a high-stakes testing situation that they had an incentive to do well in (i.e. they were sitting an exam that would contribute to them getting a speciﬁc job, access to higher or further education, etc.), or were they taking a test for personal reasons? 6. Recycling of task language: Did the learners reuse parts of the task rubric in their responses, or elements of their partner’s responses in a speaking test? These six sets of questions suggest that no two learner corpora can be identical, as there are a range of contextual features – both intrinsic to the corpus design and beyond the corpus itself – which stem from the heterogeneity of the contributing learners. The key point here is that any of the characteristics of a test taker inﬂuence how they perform on any test, so this should be acknowledged and taken into account in any research based on responses to those tasks. To be of maximal use, we suggest that a learner corpus needs to have the following characteristics as a minimum: • •

10

information about each learner’s first language and nationality or region language-learning background: whether in school or private lessons, number of years, intensity, private or state language schools, tutors, self-study, etc. See http://boreas.mml.cam.ac.uk/efcd/index.php (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

518

BARKER, SALAMOURA AND SAVILLE

• their actual level of performance (often given in a CEFR level) • learner’s age and/or stage of learning • detailed information about the nature of the task(s) undertaken by learners – and the situation or context in which they completed the task(s) or took the test. The following variables often take secondary importance: • gender is not necessarily a key feature, but is straightforward to collect • intensity of language learning, which may be related to contact with native or proficient users and any significant immersion periods spent in a country where the target language is spoken • reason for taking a language test (only relevant to a corpus containing tests). The other optimal characteristics of learner corpora for LTA depend on the aims of each research study, for example, a corpus builder may wish to include only texts from learners who show good command of the language, or any level of performance may be acceptable. Table 23.1 shows the metadata collected by the two best-known learner corpora, CLC and ICLE.11 The decision about what metadata to collect is crucial, as it is often difﬁcult to return to participants to request further information once data collection has taken place, although planned longitudinal data collection permits this, such as for the Longitudinal Database of Learner English (LONGDALE; see Section 2.4).12 Once the learner-speciﬁc and general variables have been established and data collected, the next stage is deciding how to annotate the corpus texts to turn them into a searchable and analysable data set.

2.3

Annotating learner corpora for LTA

The key methods of annotating and error coding learner corpora are addressed in Chapters 5 to 7 (this volume) and will only be brieﬂy discussed here in relation to selecting or developing a corpus for LTA. Learner corpora developed speciﬁcally to inform language assessment are rare, so this description of annotation systems will focus on the CLC. The CLC raw data – text ﬁles keyed-in verbatim from handwritten exam papers – are being supplemented by exams taken on computer. The keyed-in data ﬁles are marked up in various ways, although this is not necessary for basic concordancing or lexical analysis using software packages such as WordSmith Tools (Scott 2012) or scripting languages. Various forms of tagging can be added to raw texts, starting with part-ofspeech (POS) tagging. Once it has been POS tagged, a corpus can be parsed 11

See ICLE guidelines at www.uclouvain.be/en-317607.htm (last accessed on 13 April 2015).

12

See www.uclouvain.be/en-cecl-longdale.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

519

Table 23.1. Metadata in CLC and ICLE Type of metadata

CLC

ICLE

Participant

Age (single numbers and ranges) Gender First language Nationality (by country)

Age Male/Female Native language Nationality Father’s mother tongue Mother’s mother tongue Language(s) spoken at home (if more than one, please give the average % use of each) Education: primary school – medium of instruction Secondary school – medium of instruction Current studies Current year of study Institution Medium of instruction: English only/other language(s) (specify)/both Years of English at school Years of English at university Stay in an English-speaking country: where?/ when?/how long? Other foreign languages in decreasing order of proﬁciency

Education level

Years studying English

Scores

Task rubric Context

Full-time student or not Took a preparation course Reason for taking test (select from a set) Previously sat the same exam (i.e. is this a re-sit?) Other Cambridge English exams taken Area of work (if employed) CEFR level of the writing Overall scores/grades on the exam, by component and for the 1, 2 or more writing tasks completed The task is provided alongside the response – users see an image of it Year of exam Exam name CEFR level of the exam

Essay/Title

Approximate length required Conditions: timed/untimed Examination: yes/no Reference tools: yes/no What reference tools?: bilingual dictionary/ English monolingual dictionary/ grammar/other

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

520

BARKER, SALAMOURA AND SAVILLE

to display grammatical structures which can then be exploited in various ways, for example Hawkins and Filipovic8’s (2012) exploration of criterial features in learner data within the English Proﬁle programme, explored further in Section 3.1. Both POS tagging and grammatical parsing can be challenging on learner texts where you may either get very little connected prose or where texts may contain complex errors. Once a corpus is POS tagged – and optionally parsed – a common (and crucial) stage is to undertake error annotation, which is described in relation to two studies in Section 3 (see also Chapter 7, this volume). We now turn to some recent learner corpus developments which have implications for language testing.

2.4

Recent learner corpus developments

Different types of learner corpora include collections of language for speciﬁc purposes materials and longitudinal corpora. The Varieties of English for Speciﬁc Purposes Database (VESPA) contains written English for speciﬁc purposes texts from various mother tongues (L1s), covering a range of academic disciplines, text types and levels of expertise.13 Collaborative research is encouraged within the learner corpus community, which should ensure that research can feed into improvements in the formal testing of language for speciﬁc purposes, which requires large, current data sets from both non-expert and expert users to inform realistic tests. Another relatively young project is LONGDALE, which includes written and spoken longitudinal data collected by teams worldwide. The learners’ demographic details, information about the task and responses are being collected, and all contributing students take two language tests which provide an objective measure of their proﬁciency level. Research based on this corpus and others will surely lead to improvements in our understanding of second language acquisition (SLA) and how teaching and learning inform learners’ written and spoken output (see Meunier and Littré 2013; Chapter 17, this volume). Beyond English, learner corpora of language tests include the Lexicon of Spoken Italian by Foreigners (LIPS) collection of second and foreign language Italian (Barni and Gallina 2009), which incorporates 1,500 speaking tests from the Certiﬁcate of Italian as a Foreign Language. Gallina (2010) reports on a study to investigate lexical acquisition and development in order to produce word lists for comparing vocabulary size and range with native data, guiding the selection of input texts for future versions of this test and exploring learners’ proﬁciency in productive skills. In a comparative study, Mendikoetxea (2006) compares the properties which inﬂuence word order in the interlanguage of L2 learners of English and Spanish

13

See www.uclouvain.be/en-cecl-vespa.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

521

using two written learner corpora: WriCLE (L1 Spanish–L2 English) and CEDEL2 (L1 English–L2 Spanish). There are various technical innovations which are helping to develop the LTA ﬁeld. From computational linguistics, machine learning techniques are being applied to learner corpora, such as in the work of Yannakoudakis et al. (2011). Using machine learning, Yannakoudakis et al. automated the assessment of ESOL exam scripts through the extraction of features, calculating their contribution to a learner’s overall performance on two writing tasks. Their experimental results indicate that their system can achieve similar levels of performance to examiners rating the same set of scripts, which has implications for the use of human raters, to be returned to in Section 4.

2.5

How learner corpora inform LTA

Corpus-informed language testing involves developing a picture of a speciﬁc language feature across proﬁciency levels or identifying a language-related issue relevant to a particular group of learners through an analysis of corpus data which often incorporates a triangulation with other data or analytical approaches (see, e.g., Hawkey and Barker 2004). Learner corpora can inform various stages in the ‘lifecycle’ of a language test (Saville 2003). Here we summarise their applications in deﬁning test user requirements, the test’s overall purpose, test design and task rating.

2.5.1

Deﬁning user requirements and a test’s overall purpose

Learner corpora show us what learners of a language can do at a speciﬁc proﬁciency level, which can then inform what is tested. Learner corpus evidence provides a qualitative analysis that balances the language tester’s quantitative analysis of test data and informs the writing of test materials. The University of Michigan’s Testing Division (now known as Cambridge Michigan Language Assessments, CaMLA) used a corpus of written and spoken B2-level general English tests to inform the scoring criteria for the Examination for the Certiﬁcate of Competency in English (ECCE).14 This was done by analysing the test takers’ output and revising the rating scales to include ﬁve levels on the following criteria: content and development, organisation and connection of ideas, linguistic range and control, and communicative effect, so that they better reﬂect the linguistic features of the learners’ output. Similarly, Hawkey and Barker (2004) analysed both actual and experimental general language test performances to support tabula rasa manual analysis in the development of a Common Scale for Writing across a wide range of proﬁciency levels and types of English in order to propose key language features that distinguish performance at four proﬁciency levels. The resulting scale of 14

See www.cambridgemichigan.org/ecce (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

522

BARKER, SALAMOURA AND SAVILLE

descriptors is still provided in exam handbooks and seems to have succeeded in its aim to ‘assist test users in interpreting levels of performance across exams and locating the level of one examination in relation to another’ (p. 122). Native corpora have been used more often than learner corpora to inform tests of language for speciﬁc purposes (LSP), supporting the demand for domain-speciﬁc language tests for ﬁnancial or legal ﬁelds, for example. LSP learner corpora are beginning to appear, however, such as VESPA, introduced in Section 2.4. A well-regarded academic English corpus is MICASE, a corpus of American university speech from native and non-native users (Simpson et al. 2002). CaMLA has used this corpus to develop and validate various language tests including designing new listening test items from information about word frequencies for the high level Examination for the Certiﬁcate of Proﬁciency in English (ECPE).15 The newly designed items discriminated well between high- and low-scoring test takers and were subsequently used in live tests. Native corpora have long informed vocabulary testing, including Coxhead’s (2000) Academic Word List, which was based on frequency counts from an academic corpus, as well as Nation and Beglar’s (2007) Vocabulary Size Test, which drew on data from the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA).16 Learner corpora can also provide authentic materials for vocabulary tests that writers can use to base their own test items on. Current areas of growth include compiling academic frequency word lists, research on lexical bundles, and core and specialised vocabulary (see Horner and Strutt 2004; Jamieson 2005).

2.5.2

Test design

Learner corpora can show us to what extent demographic variables (age, L1, language-learning experience, etc.), the testing mode (paper-based or computer-based) and the learning environment affect the learners’ output, which can inform the design of language tests. Learner corpora are being used alongside native corpora and other types of data sources to underpin the English Proﬁle interdisciplinary research programme, which is developing reference level descriptors for English.17 The more than 56-million-word Cambridge Learner Corpus underpins English Proﬁle research, and complementary written and spoken corpora are being collected to extend the reach of this research programme (see Alexopoulou 2008). English Proﬁle data collection involves individuals and teams contributing existing sets of written or spoken

15 16 17

See www.cambridgemichigan.org/resources/ecpe (last accessed on 13 April 2015). See www.natcorp.ox.ac.uk/ for BNC and http://corpus.byu.edu/coca/ for COCA (last accessed on 13 April 2015). See www.EnglishProfile.org (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

523

data – from teaching or testing events in various educational contexts – together with learners’ metadata. Such corpora aim to extend the range of mother tongues, text types and proﬁciency levels that exist in the CLC (which consists of writing scripts at levels A1 to C2). Such additional data sets also provide English Proﬁle Network researchers with the ability to explore hypotheses about linguistic features that seem to be criterial for identifying a particular proﬁciency level or ﬁrst language (or language family) inﬂuence (see Chapter 27, this volume). Individual students are also targeted with an online data-collection portal which includes self-assessment, teacher assessment and external assessments of each learner’s proﬁciency level before they produce any writing on the website. Researchers are in the process of having these learner texts rated by both expert human raters according to CEFR scales and an automated text classiﬁer (Yannakoudakis et al. 2011). Learner corpora also help language testers to explore collocational patterns in learners’ written or spoken production, indicating what patterns are common or less frequent at certain levels, therefore approving or guiding their inclusion in a language test. Additionally, corpora that are tagged for errors can reveal the most frequent errors or misuses of speciﬁc collocational pairings – together with many other features – thereby suggesting suitable distractor items for multiple-choice questions. Test writers also use learner corpora to more accurately describe various linguistic domains. For example, Horner and Strutt (2004) analysed a subset of a business English vocabulary list obtained from a corpus of business exam scripts. They asked native and non-native informants to apply four meaning-based categories to 600 items on the list, which were: •

Category 1: the word rarely if ever occurs outside this domain (e.g. cost-benefit analysis). • Category 2: the word is used both inside and outside this field but with a radical change in meaning (e.g. bull relating to the stock market). • Category 3: the word is used both inside and outside this field; the specialised meaning can be inferred through its meaning outside the field (e.g. option relating to share dealing). • Category 4: the word is more common in the specialised field than elsewhere (e.g. credit). (Horner and Strutt 2004: 6) Horner and Strutt (2004: 8) found that both groups had problems in applying the categories consistently when asked to identify core and non-core vocabulary.

2.5.3

Task rating

Task rating can be informed by learner corpora and associated analytical techniques following around thirty years of work on automatically

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

524

BARKER, SALAMOURA AND SAVILLE

evaluating writing, which can be linked to the detection and analysis of errors in learner output. The ETS test provider in the USA started developing automated systems for assessing writing in the 1990s using various corpora and NLP techniques (Burstein et al. 2004; Chapter 26, this volume). Deane and Gurevich (2008) describe the use of a corpus of native and non-native writing-test data (responses to the same TOEFL® writing-test prompt) through which they contrasted the phraseology and content from both groups. Such research has implications for the automatic rating of writing and for systems which provide evaluative – and sometimes formative – feedback for learners. Test providers and other researchers provide online evaluation services whereby a learner or teacher uploads a piece of written text and receives personalised feedback. This process can contribute to corpus development if the evaluation system captures data usage permissions and background information alongside learners’ language samples; see, for example, Andersen et al.’s (2013) self-assessment and tutoring system. Despite the advantages of consistency of rating and reduced time provided by automated assessment of extended text (AAET; see Xi 2010a; Whithaus 2013), test providers vary in their use of this technology. The internet-based TOEFL iBT® combines one automated rating and one human rating for writing tasks but only human rating for speaking tasks. The Pearson Test of English (Academic) features fully automated scoring for all parts of the test. It is worth noting that impact and practicality often drive the use of AAET, although, generally speaking, technology is used alongside humans rather than instead of them.

3

Representative studies

This section presents two recent studies of how learner corpora can be analysed to reveal features of language that can inform language assessment. The ﬁrst study concerns the identiﬁcation of features that distinguish between proﬁciency levels from exam corpus data, with teaching and assessment end-users in mind (Hawkins and Filipovic8 2012). The second study uses a learner corpus of mostly non-exam essays to provide descriptions of L2 accuracy patterns which have pedagogical relevance and potential implications for assessment (Thewissen 2013). 3.1 Hawkins, J. A. and Filipovic8, L. 2012. Criterial Features in L2 English: Specifying the Reference Levels of the Common European Framework. English Proﬁle Studies 1. Cambridge University Press. Hawkins and Filipovic8 (2012) report on research taking place within the English Proﬁle programme. English Proﬁle aims ‘to analyse language produced by learners of English in order to throw light on what they can and can’t do with the language at each of the CEFR levels’ (UCLES/

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

525

Cambridge University Press 2011: 6). It is a long-term research endeavour and its research strands focus on identifying lexical, grammatical and functional features of learner English at the different proﬁciency levels. The research on vocabulary, which led to the online English Vocabulary Proﬁle resource, has been completed, but will be revisited in the future (Capel 2010, 2012).18 English Proﬁle is interested in the learning dimension and the interconnections between LTA, SLA, theoretical linguistics, corpus linguistics and computational linguistics, and provides an important melting-pot for input from these areas in pursuit of a speciﬁc goal, the production of reference level descriptions for the English language, as required by the Council of Europe.19 The Cambridge Learner Corpus (CLC) is the central resource for most English Proﬁle researchers and is a unique learner corpus of Cambridge English exam scripts from 1993 to the present day written by candidates worldwide (there are more than 138 ﬁrst languages and 203 countries of origin represented). The largest collection of its type, CLC has over 56 million words from more than 250,000 learners across all proﬁciency levels, which are representative of the global candidature. Twenty-four different language exams representing general and speciﬁc domains are included.20 In addition to the candidates’ writing (two or more texts for most learners in this corpus) the CLC contains demographic information together with the overall score and grade that they received for all components of the exam, not just the writing component, meaning that scores for their speaking, listening, reading and Use of English components are included. This corpus has been marked up with a set of eighty-eight error codes, one of the most comprehensive and precise error-coding systems available (Nicholls 2003), so that researchers can explore what learners can do – and what common errors they make – by CEFR level, ﬁrst language and any other variable in the corpus. The error codes consist of a two-letter coding system: the ﬁrst letter shows the general error type (e.g. omission) and the second letter shows the required word class, for example: FN = wrong Form used on a Noun (see Nicholls 2003). The CLC error-coding process remains predominantly manual but is gradually being automated to code speciﬁc error types: Ø. E. Andersen (2011: 1) notes that ‘manual error annotation of learner corpora is time-consuming and error-prone, whereas existing automatic techniques cannot reliably detect and correct all types of error’ and describes how both methods can be used successfully together through the automatic detection and correction of trivial errors, thus enabling the expert coders to concentrate on errors which cannot yet be handled mechanically.

18

See www.englishprofile.org/index.php/wordlists/free-subscription (last accessed on 13 April 2015).

19

See www.coe.int/t/dg4/linguistic/Source/DNR_Guide_EN.pdf (last accessed on 13 April 2015).

20

See www.CambridgeEnglish.org/exams for further information (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

526

BARKER, SALAMOURA AND SAVILLE

The CLC has also been POS tagged and parsed using the Robust Accurate Statistical Parser (RASP; Briscoe et al. 2006), which enables it to be searched by grammatical relations, a feature which is rarely found in learner corpora. English Proﬁle researchers can therefore obtain very detailed and accurate syntactic analyses of learner English which reveal a mapping of learner syntax and error patterns across the CEFR levels. The English Proﬁle approach takes into account grammatical relations as well, meaning that sophisticated kinds of grammatical analysis are possible. The annotated CLC can be searched in various ways: through scripted searches, through an in-house set of tools which has been replaced by a version of Sketch Engine (Kilgarriff et al. 2004) and also via a visualisation tool (see Alexopoulou et al. 2013). The rationale behind Hawkins and Filipovic8’s (2012) study is that there are certain linguistic properties that are ‘characteristic and indicative of L2 proﬁciency at each of the CEFR levels and that distinguish high levels from lower levels’ (p. 11). They surmise that these properties – criterial features – are the basis on which teachers and examiners make their practical assessments of a learner’s proﬁciency level, alongside the extent to which a learner fulﬁls the communicative functions required of a task. Hawkins and Filipovic8 seek to identify these criterial features empirically so that they can distinguish each proﬁciency level from A1 to C2 in English. In relation to grammar, Hawkins and Filipovic8 (2012) aim to discover the structures that are used by learners at one level (e.g. B1) but that are not fully acquired by learners at the level below (e.g. A2) as well as the structures they do not seem to master until they have progressed to the next level (e.g. B2). They focus on two main types of criterial features: correct linguistic properties that have been acquired at a speciﬁc level and generally remain stable at higher levels (‘positive features’) and incorrect properties (errors; ‘negative features’) that occur with a characteristic frequency at a level or levels. It is important to note that ‘[b]oth the presence and absence of the errors, and especially their characteristic frequency, can be criterial for a level … no single feature can be criterial or distinctive for a whole level; only clusters of features have the potential to be criterial’ (UCLES/Cambridge University Press 2011: 9). There are two main types of results arising from Hawkins and Filipovic8 (2012). These are, ﬁrstly, a set of grammatical criterial features that distinguish each level – for example at A2 level: simple sentences; sentences with clauses joined by that; descriptive phrases introduced by a past participle; simple direct wh-questions; simple sentences using inﬁnitive; some modals (see UCLES/Cambridge University Press 2011: 11–33). More detailed descriptions of the criterial features and illustrations with learner examples were also produced, which would be of direct use to teachers, curriculum writers and language testers. The second group of ﬁndings concerns error types that signiﬁcantly improve in learners’ production

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

527

between adjacent levels, for example: between A2 and B1 they list nine error types that signiﬁcantly improve, including: anaphor agreement such as It’s three years old and he works very well, but I would like a new computer. It is when learners reach the C-levels, however, that a substantial reduction in the number of lexico-grammatical errors is observed: from B2 to C1 level twenty-eight errors show signiﬁcant improvement, whereas from C1 to C2 an additional thirty-ﬁve errors improve. Every ﬁnding is accompanied by a representative example from the CLC at the appropriate proﬁciency level. In a follow-up study, Alexopoulou et al. (2013) investigated features that distinguish learner performance at B2 level from performance below B2. While Hawkins and Filipovic8 (2012) took a theory-driven approach in their search for criterial features, their springboard being a set of linguistic hypotheses about the nature of learner language development, Alexopoulou et al. (2013) followed a complementary, data-driven approach (see also Chapter 26, this volume). 3.2 Thewissen, J. 2013. ‘Capturing L2 accuracy developmental patterns: Insights from an error-tagged EFL learner corpus’, The Modern Language Journal 97(S1): 77–101. The second study also investigates developmental patterns of errors, known as trajectories, using ICLE, which consists of sixteen subcorpora of around 200,000 words each from different mother-tongue backgrounds (Granger et al. 2009). ICLE was instigated in 1990 to provide data which could be contrasted with native user data and also to provide comparisons within and between different language backgrounds for Contrastive Interlanguage Analysis (CIA; Granger 1998a), which involves comparing native and non-native varieties of the same language and different non-native varieties of the same language (see Chapter 3, this volume). The former analysis is possible as the Louvain team also collected the Louvain Corpus of Native English Essays (LOCNESS) consisting of argumentative essays from British and American students.21 The latter analysis ‘enables the researcher to ﬁnd out whether certain features in the L2 production of a speciﬁc L1 group of learners is actually a result of L1 transfer or whether it is a feature more generally present in learner output of a certain target language (and thus potentially a universal feature of L2 productions)’ (Nesselhauf 2006: 148). ICLE lends itself to such analysis with its comparable L1 subcorpora and, according to Nesselhauf (2006: 151), a ‘recurrent result is that the role of the L1 in L2 production is even greater than commonly assumed’. ICLE contains around 3.5 million words collected from undergraduate students of English in their third of fourth year of study who, Nesselhauf (2006: 146) notes, ‘are considered advanced learners on the basis of their 21

See www.uclouvain.be/en-cecl-locness.html (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

528

BARKER, SALAMOURA AND SAVILLE

status and not on the basis of their actual language proﬁciency’. This criticism of the original corpus design has been addressed by Thewissen (2013), who had 223 French, German and Spanish ICLE essays rated by two to three experts according to CEFR levels B1–C2 on linguistic competence (pp. 78–9). The essays are around 500–1,000 words in length – longer than many in the CLC – and are either argumentative or literature essays based on a list of titles provided (although it is not clear whether students were given any choice). All essays are collected according to standard guidelines and the majority were untimed and from learners who used reference works (dictionaries, grammars, etc.) but who were not permitted to ask a native speaker to check their work. The range of essays is far more limited than in the CLC, with the guidelines stating that descriptive, narrative or technical subjects are not suitable for submission and also favouring essay titles that require opinion or weighing up evidence, which are in the CLC along with many other formats and task types. All students ﬁll in the learner proﬁle which provides the information listed in Table 23.1, which is used in some studies. For example, Nesselhauf (2006: 152–3) explored whether collocation production in German-speaking students was informed by circumstances of text production (dictionaries, timed or not, exposure to English, length of stay(s) in English-speaking country, teaching and informal exposure to English). Nesselhauf (2006: 152) found that ‘time pressure led to the production of slightly fewer collocations and slightly more deviant collocations … learners who used a dictionary produced slightly more collocations but exactly the same percentage of deviant collocations as those learners who did not’. More importantly for pedagogy, Nesselhauf (2006: 153) observed that ‘collocations are not taught in German-speaking countries in a way which leads to their appropriate acquisition … exposure in more natural settings does seem to lead to a degree of improvement – albeit small – in a learner’s collocational performance’. ICLE version 2, used by Thewissen (2013), is POS tagged and is supplied with querying software through which users can select a set of texts according to particular variables. The resulting list of essay ﬁles can be viewed text-by-text, printed, or saved for use with other corpus-processing software. After obtaining her essays, Thewissen had them marked up with errors using the Louvain error-tagging taxonomy (Dagneaux et al. 1998) after being rated on ﬁve linguistic areas, including grammatical accuracy as well as coherence and cohesion, from which an overall CEFR level was calculated (Thewissen 2013: 79). The error system consists of ﬁfty-four error types, the main seven categories include: formal, grammatical, lexical, lexico-grammatical, punctuation, style and other errors. The errors were quantiﬁed by both error tags and POS tags using ‘potential occasion analysis’, a method of counting errors that involves considering the number of errors of different types ‘in relation to the number of times a learner could potentially have committed such an error’ (p. 81). This was

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

529

done by counting errors in twenty-one ‘POS denominators’ – for example the overall personal pronouns in the POS-tagged data – rather than in relation to the total number of tokens or sentences in a corpus. Statistical analysis was carried out – a four-way ANOVA – to identify the developmental trajectories of the forty-ﬁve error categories in the data set. This study revealed three developmental patterns: strong, weak and non-progressive. Strong patterns involved the statistically signiﬁcant difference in behaviour between at least one pair of adjacent proﬁciency levels – i.e. B1 vs B2. These included twenty-two types, the clearest exponent being B1 > B2 > C1 > C2, i.e. where errors decrease with an increase in proﬁciency, although this only related to the total number of errors per text and lexical confusion between two words (p. 84). The next most frequent strong pattern was B1 > [B2 > C1 > C2], which shows signiﬁcant progress from B1 to B2 then stabilisation; seventeen error types have this pattern, including spelling errors and uncountable noun errors. The eight weak patterns showed a statistically signiﬁcant difference in behaviour between at least one pair of non-adjacent proﬁciency levels – i.e. B2 vs C2, shown by relative pronouns: B2 > C2 and B1 > C2. The sixteen non-progressive patterns were error types that show no signiﬁcant change from B1 to C2, including missing punctuation and tenses. Thewissen (2013: 92) states that ‘the presence of tense errors (GVT) in this pattern is an important ﬁnding, indicating that tense usage constitutes a rather improvement-resistant area for the EFL [English as a Foreign Language] groups’, which is backed up by other studies in SLA and CIA. Thewissen’s (2013) analysis suggested that the clearest evidence for improvement was found between B1 and B2 levels. Thewissen found almost no U-shaped patterns in her data set, where errors increase after B1 and then decline by C2, which differs from Hawkins and Filipovic8’s (2012) ﬁndings. Thewissen attributes this to the lack of A-level data in the ICLE corpus. Overall, developmental trajectories similar to Hawkins and Filipovic8 (2012) were observed in that ‘error developmental patterns tend to be dominated by progress and stabilisation trends and that progress is often located between B1 and B2’ (Thewissen 2013: 77). There are several implications for SLA. Firstly, Thewissen suggests that the non-linear aspect of development was borne out by this study, with only two error types displaying a linear development – total errors and lexical single errors (p. 94). There were three other main patterns identiﬁed: stabilisation-only errors (34% of error types), progress and stabilisation errors (60%) and a few cases of regression.

4

Critical assessment and future directions

As we have shown, there is both commonality and diversiﬁcation in contemporary uses of learner corpora, which suggests that careful notice

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

530

BARKER, SALAMOURA AND SAVILLE

should be taken of all approaches in order to determine the best models for future work in this area. Nesselhauf (2006: 142) identiﬁes three ways in which learner corpora contribute to language pedagogy. The most common use is the identiﬁcation of non-native features in L2 production, which improves pedagogical materials. Secondly, ‘the analysis of L2 productions yields insights into processes of second language acquisition, such as L1 transfer, communication strategies or developmental sequences, which in turn can inform language teaching materials and/ or syllabuses’ (Nesselhauf 2006: 142; see also Thewissen 2013). Less commonly, the learner corpus is directly exploited in the language classroom: ‘in a data-driven learning approach, evidence from native speaker corpora can to some degree be complemented by negative evidence from a learner corpus’ (Nesselhauf 2006: 142). The ﬁrst two uses are the most relevant to assessment, although perhaps the use of learner corpora versus native corpora extracts – perhaps a well-chosen set of concordance lines – would be suitable to form part of an assessment in the future. We propose that research studies such as those reported in Section 3 are key to the future enhancement of our understanding of what learners can do at each level, the errors they make and the patterning of their developmental trajectories. These two approaches represent complementary perspectives on the use of learner corpus data to inform language testing practices. The two corpora mentioned above have their own strengths and weaknesses, which we summarise here as being representative of exam (e.g. CLC) and non-exam (e.g. ICLE) production data. Exam production data has a number of strengths, starting with consistency of test delivery, hence all learners within (and between) sessions in the corpus have the same experience. There is also a consistency of the format of learner samples within (and between) sessions, noting, however, that there may be revision to test formats over time. Exam production data also provide many responses to multiple test questions, although there may be different numbers of texts collected from each question, which could lead to skewing in lexical research, although this is also an issue with any imbalanced corpus. A major strength of the CLC is its in-built proﬁciency level based on an extensive historical, conceptual and empirical linking to the CEFR – we would argue that this cannot be replicated elsewhere.22 The approach taken by Thewissen (2013) – retrospective double marking of learner texts – is the next best approach to texts that have not been assigned a level at the point of collection. Exam corpora also provide information about other skills as all test takers’ scores may be available, although the test output itself of other skills may not be available. Another weakness with exam corpora is that learners may underperform under exam conditions. However, exams reﬂect what 22

See www.cambridgeenglish.org/research-and-validation/fitness-for-purpose/ (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

531

learners can do on their own, so may reﬂect just as well as classroom or homework what learners can achieve. McCarthy (2014) stresses the need to complement exam data with non-exam data. Corpora of non-exam production data also have strengths, for example it is possible to collect more demographic information than can be collected alongside an exam (see Table 23.1) due to time pressures of administering tests. Non-exam production corpora also have the potential to reach a wider demographic, as corpus builders are not limited to learners who take a language test at the end of a course of study and the data are not linked to a particular syllabus, hence there is arguably less probability of circularity in this type of corpus. There is also the potential to include a wider variety of text types which are arguably more realistic than some exam texts. One weakness of non-exam production corpora is that data may be collected in different ways where multiple teams are involved, resulting in data inconsistencies. Comparability may also be an issue between teams if they interpret guidelines differently, with some teams offering a choice of essay prompts and others not. However, this can be remedied with the use of online data-collection portals, for example those used for CEPC or CAMCAE. Another weakness of some learner corpora is their less than rigorous (or non-existent) allocation of proﬁciency level – but in the case of ICLE, post hoc analysis by Thewissen (2013) assigns learner essays a level. An important volume for CIA and learner corpus research is Gilquin et al. (2008a). Most relevant to assessment is Ädel’s (2008) chapter, which considers comparability of corpora, demonstrating ‘that differences in foreign language production attributed to the learner’s mother tongue may in fact result from discrepancies in task setting (i.e. how much time is available) and/or intertextuality (i.e. whether access is given to secondary sources)’ (Gilquin et al. 2008b: viii). This is one of the few ICLE-based studies which explicitly refers to implications of the research for assessment, although these are more to do with managing testing events to enable learners to write better argumentative texts than to inform the language tests themselves. So what is on the horizon? There are many areas being taken forward both within language test providers (whose key business is in developing and delivering tests but who are increasingly offering language-learning courses and associated support and reference materials) and by research teams. There are clear ways in which corpus design can be improved for applications in LTA. For example, Díez-Bedmar (2011) notes that indications of impairment are usually excluded from corpus design, which could be rectiﬁed. The fairness and accessibility of language tests to all potential candidates is a key topic in LTA, so this supports the

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

532

BARKER, SALAMOURA AND SAVILLE

adding of such information to corpora. There are also a growing number of corpus-informed studies that are relevant for assessment, see, for example, some of the studies reported in Granger et al. (2013), including using learner corpus data to improve distractors in multiple-choice grammar questions (Usami 2013); proﬁling learner proﬁciency using error and syntactic analysis (Murcia-Bielsa and Macdonald 2013); a new corpus, the Corpus of Academic Learner English (CALE), to study and assess advanced language proﬁciency (Callies and Zaytseva 2013b); integrating learner corpus data into the assessment of spoken interaction in English in an Italian university context (Castello 2013); and intonational phrasing as a potential indicator for establishing prosodic learner proﬁles (Cauvin 2013). Academic researchers will continue to focus on more theoretical aspects of language proﬁciency, such as establishing the developmental patterns of second language acquisition, or they may be working on describing what learners can do with language, such as the English Proﬁle research teams. Language testing practitioners will continue to take the ﬁndings from these studies and ﬁnd ways of applying them to existing tests, or to inform new ones. Beyond this, we envisage a number of key growth areas in the application of learner corpus linguistics to language testing. Firstly, the collection of more data from lower proﬁciency levels and younger language learners will be necessary because many existing learner corpora focus on advanced-level learners. Secondly, more academic and domain-speciﬁc corpora will be collected, to enable researchers to explore further the nature of academic literacy and other domains, leading to the expansion of the notion of learner along the content–ﬁeld knowledge cline as well as the proﬁciency cline. Thirdly, there will also be an increase in spoken corpora for speciﬁc language families or L1s to complement existing written corpora, especially capturing data from the same learners. Fourthly, technological advances in data capture and storage (e.g. learner management systems, cloud computing) will enable the collection of ‘big data’ as well as longitudinal learner data, which will shed light into the developmental trajectory of learner language across the lifespan and at different proﬁciency levels. There will also be further improvements in tagging and parsing systems for learner data, automated error coding of learner language, automated assessment of writing and speaking, and the provision of feedback to learners and teachers. In general, technology will continue to play a crucial role in rating language performance and providing evaluative feedback in collaboration with human raters. We believe that learner corpora have a crucial role to play in the evaluation of language proﬁciency and will continue to develop even more rapidly in this digital age than in their ﬁrst three decades of existence.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Learner corpora and language testing

533

Key readings Alderson, J. C. 1996. ‘Do corpora have a role in language assessment?’, in Thomas, J. A. and Short, M. H. (eds.), Using Corpora for Language Research Research.. London: Longman, pp. 248–59. This article was the ﬁrst published in the LTA ﬁeld that discussed the application of corpora to language assessment, so it is an important starting point for anyone interested in this area. Taylor, L. and Barker, F. 2008. ‘Using corpora for language assessment’, in Shohamy, E. and Hornberger, N. H. (eds.), Encyclopedia of Language and Education, Volume 7: Language Testing and Assessment, Assessmentt, 2nd edn. New York: Springer, pp. 241–54. This chapter summarises the development and use of corpora in language testing, focusing on high-stakes uses, providing a useful summary of the development of computerised corpora since the 1960s and their links to the LTA literature. Salamoura, A. and Saville, N. 2010. ‘Exemplifying the CEFR: Criterial features of written learner English from the English Proﬁle programme’, in Bartning, I., Martin, M. and Vedder, I. (eds.), Communicative Proﬁciency and Linguistic Development: Intersections between SLA and Language Testing Research. Research. EUROSLA: Monographs Series 1, pp. 101–31. This chapter summarises English Proﬁle research which combines hypotheses from second language acquisition and psycholinguistics alongside corpus data to develop reference level descriptions for English, focusing on ‘criterial features’ which distinguish CEFR levels from one another. Schlitz, S. A. (ed.) 2010. Exploring Corpus-informed Approaches to Writing Research. Special issue of Journal of Writing Research 2(2). Research. This special issue contains many useful articles on corpus-informed approaches to analysing learner writing, which seek to explain language phenomena based on naturally occurring language from various corpora.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:39:08, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.023

24 Learner corpora and natural language processing Detmar Meurers 1 Introduction Natural Language Processing (NLP) deals with the representation and the automatic analysis and generation of human language (Jurafsky and Martin 2009). Learner corpora collect the language produced by people learning a language. The two thus overlap in the representation and automatic analysis of learner language, which constitutes the topic of this chapter. We can distinguish three main uses of NLP involving learner corpora. First, NLP tools are employed to annotate learner corpora with a wide range of general properties and to gain insights into the nature of language acquisition or typical learner needs on that basis. On the one hand, this includes general linguistic properties from part of speech and morphology, via syntactic structure and dependency analysis, to aspects of meaning and discourse, function and style. On the other, there are properties speciﬁc to learner language, such as different types of learner errors, again ranging from the lexical and syntactic to discourse, function and usage. The use of NLP tools for annotation can be combined with human post-editing to eliminate potential problems introduced by the automatic analysis. NLP tools can also be integrated into a manual annotation set-up to ﬂag annotation that appears to be inconsistent across comparable corpus instances (Boyd et al. 2008), automatically identify likely error locations and reﬁne manual annotation (Rosen et al. 2014). The use of NLP for corpus annotation will be an important focus of this chapter. A detailed discussion of the spell- and grammar-checking techniques related to error annotation can be found in Chapter 25 (this volume). Second, NLP tools are used to provide speciﬁc analyses of the learner language in the corpus. For instance, for the task of native language identiﬁcation discussed in detail in Chapter 27 (this volume), classiﬁers are

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

538

MEURERS

trained to automatically determine the native language of the second/ foreign language learner who wrote a given text. In another NLP task, learner texts are analysed to determine the proﬁciency level of the learner who wrote a given text (Pendar and Chapelle 2008; Yannakoudakis et al. 2011; Vajjala and Lõo 2013; Hancke and Meurers 2013), a task related to the analysis of developmental sequences and criterial features of different stages of proﬁciency (Granfeldt et al. 2005; Rahkonen and Håkansson 2008; Alexopoulou et al. 2010; Tono 2013; Murakami 2013a, 2013b) and the popular application domain of automatic essay grading addressed in Chapter 26 (this volume). The third type of NLP application in the context of learner corpora is related to the previous two, but, unlike them, it is not designed to provide insights into the learner corpus as such. Instead, the learner corpus is used only to train NLP tools, speciﬁcally the statistical or machine learning components. The trained NLP tools can then be applied to learner language arising in other contexts. A tool trained on a learner corpus to detect particular types of learner errors can be used to provide immediate, individualised feedback to learners who complete exercises in an intelligent tutoring system. Such Computer-Assisted Language Learning (CALL) systems to which NLP has been added are commonly referred to as Intelligent Computer-Assisted Language Learning (ICALL) – cf. Chapter 22 (this volume). While traditionally the two ﬁelds of learner corpus research and ICALL developed independently and largely unconnected (but see Granger et al. 2007), the NLP analysis of learner language for corpus annotation is essentially an ofﬂine version of the online NLP analysis of learner language in ICALL, where the learner is waiting for feedback. Complementing the NLP analysis of learner language, the other use of NLP in the language learning context (Meurers 2013) analyses the native language to be learned. Examples of the latter include the retrieval of pedagogically appropriate reading materials (e.g. Brown and Eskenazi 2005; Ott and Meurers 2010), the generation of exercises (e.g. Aldabe 2011) or the presentation of texts to learners with visual or other enhancements supporting language learning (Meurers et al. 2010). The standard NLP tools have been developed for native language and thus are directly applicable in that domain. On the other hand, the analysis of learner language raises additional challenges, which ﬁgure prominently in the next section on the core issues, from corpus representation as such to linguistic and error annotation.

2 2.1

Core issues Representing learner data and the relevance of target hypotheses

At the fundamental level, representing a learner corpus amounts to encoding the language produced by the learner and its metadata, such

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing

539

as information about the learner and the task performed (see Granger 2008b: 264). For the spoken language constituting the primary data for research on ﬁrst language acquisition (e.g. CHILDES,1 MacWhinney 2000), most work on uninstructed second language acquisition (e.g. ESF,2 Perdue 1993), and a small part of the instructed second language acquisition corpora (e.g. NICT JLE,3 Tono et al. 2004), this involves the question of how to encode and orthographically transcribe the sound recordings (Wichmann 2008: 195ff.). Written language, such as the learner essays typically collected in instructed contexts (e.g. ICLE,4 Granger et al. 2009), also requires transcription for handwritten learner texts, though essays typed by learners are increasingly common. The language typed into CALL systems is starting to be systematically collected, supporting the creation of very large learner corpora such as EFCAMDAT5 (Geertzen et al. 2013). Learner corpora can also be collected from websites such as Lang-8 (Brooke and Hirst 2013), a website where non-native writers can receive feedback from native speakers. While the fundamental questions around how to represent spoken and written language in corpora are largely independent of the nature of the language being collected, and good general corpus-linguistic discussions can be found in McEnery et al. (2006) and Lüdeling and Kytö (2008, 2009), there are important aspects of representation that are speciﬁc to learner language. Researchers in second language acquisition emphasise the individual, dynamic nature of interlanguage (Selinker 1972), and focus on characterising its properties as a language system in its own right. At the same time, the analysis of language, be it manual linguistic analysis or automatic NLP analysis, was developed for and trained on well-formed native language. When trying to analyse learner data on that basis, one encounters forms and patterns which cannot be analysed in terms of the targeted native language system. Consider, for example, the learner sentence in (1), taken from the Non-native Corpus of English (NOCE; Díaz-Negrillo 2007), consisting of essays by intermediate Spanish learners of English. (1)

People who speak another language have more opportunities to be choiced for a job because there is a lot connection between the different countries nowadays.

In line with the native English language system, the verbal -ed sufﬁx of the bolded word choiced can be identiﬁed as a verbal sufﬁx and interpreted 1

Child Language Data Exchange System.

2

European Science Foundation Second Language.

3

National Institute of Information and Communications Technology – Japanese Learner English.

4

International Corpus of Learner English.

5

EF-Cambridge Open Language Database.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

540

MEURERS

as past tense, and the distributional slot between to be and for a job is syntactically appropriate for a verb. But the stem choice in English can only be a noun or an adjective. In this example and a systematic set of cases discussed in Díaz-Negrillo et al. (2010), it is thus not possible to assign a unique English part of speech to learner tokens. In examples such as (1), it seems straightforward to analyse the sentence as though the learner had written the appropriate native English form chosen in place of the interlanguage form choiced. Yet, even for such apparently clear non-word cases, where a learner used a word that is not part of the target language system, different native language words may be inferred as targets (e.g. selected could be another option for the example above), and the subsequent analysis can differ depending on which target is assumed. When we go beyond the occurrence of isolated non-words, the question of which level of representation of learner corpora can form the basis for the subsequent analysis becomes more pronounced. For example, consider the sentence in (2) written by a beginning learner of German as found in the Error-Annotated German Learner Corpus (EAGLE, Boyd 2012: 135ff.). The sentence includes only well-formed words, but the subject and the verb fail to show the subject–verb agreement required by German grammar.

(2)

Du you2 S G

arbeiten in work1 P L /3 P L / I N F in

Liechtenstein. Liechtenstein

Given that agreement phenomena always involve (at least) two elements, there is a systematic ambiguity in determining grammatical target forms. If we take the second-person subject du at face value, the corresponding second-person verb form arbeitest is the likely target. If we instead interpret the verb as it was written, we have to assume that it is a ﬁnite form and thus postulate the corresponding third-person plural sie (‘they’) or ﬁrst-person wir (‘we’) as subject to obtain a well-formed target language sentence. Fitzpatrick and Seegmiller (2004) present a study conﬁrming that it is often difﬁcult to decide on a unique target form. Considering the difﬁculty of uniquely determining target forms, one needs to document the reference on which any subsequent analysis is based in order to ensure valid, sustainable interpretations of learner language. Lüdeling (2008) thus argues for explicitly specifying such target hypotheses as a representation level of learner corpora. Rosen et al. (2014: 80–1) conﬁrm that disagreement in the analysis of learner data often arises from different target hypotheses being assumed. We will discuss their ﬁndings for the Czech as a Second Language (CzeSL) corpus in the ﬁrst case study in Section 3.1. While there seems to be a growing consensus that a replicable analysis of learner data requires the explicit representation of target hypotheses

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing

541

in learner corpora, what constitutes a target hypothesis and how it is obtained needs clariﬁcation. There are two pieces of evidence that one can take into account in determining a target hypothesis. On the one hand, one can interpret the forms produced by the learner bottom-up in terms of a linguistic reference system, such as the targeted native-language system codiﬁed in the standard corpus annotation schemes. One can then deﬁne a target hypothesis which encodes the minimal form change that is required to turn the learner sentence into a sentence which is well-formed in terms of the target-language grammar. A good example is the Minimal Target Hypothesis (TH1) made explicit in the annotation manual of the German learner corpus Falko (Reznicek et al. 2012: 42ff.). An alternative incremental operationalisation of a purely form-based target hypothesis is spelled out in Boyd (2012). Both approaches explicitly deﬁne what counts as minimal form change. They do not try to guess what the learner may have wanted to say and how this could have been expressed in a well-formed sentence, but instead they determine the minimal number of explicitly deﬁned form changes that is needed to turn the learner sentence into a grammatical sentence in the target language. While this makes it possible to uniquely identify a single target hypothesis in many cases, for some cases multiple possible target hypotheses are required. This can readily be represented in corpora using multi-layer standoff annotation (Reznicek et al. 2013; Chapter 7, this volume). On the other hand, one can determine target hypotheses using top-down information about the function of the language and the meaning the learner was trying to express, based on what we know about the particular task and expectations about human communication in general. The top-down, meaning-driven and the bottom-up, form-driven interpretation processes essentially interact in any interpretation of human language. For interpreting learner language, this interaction is particularly relevant given that the interlanguage forms used by language learners cannot be fully interpreted in terms of the established linguistic reference systems developed for the native language. This is particularly evident for learner language such as the Basic Variety that is characteristic of uninstructed second language acquisition (Klein and Perdue 1997), which lacks most grammatical form marking. The fact that learner language offers limited and hard-to-interpret form-driven information bottom-up makes corpora with explicit task contexts particularly relevant for learner corpus research aimed at drawing valid inferences about the learners’ second language knowledge and development. Consider, for example, the learner sentences in (3) written by Japanese learners of English, as recorded in the Hiroshima English Learners’ Corpus (HELC; Miura 1998). (3)

a. I don’t know his lives. b. I know where he lives.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

542

MEURERS

Both sentences are grammatically well formed in English. Given that form-based target hypotheses are deﬁned to consist of the minimal form change that is needed to obtain a sentence that is grammatical in the target-language system, these target hypotheses are identical to the learner sentences in both cases. However, if we go beyond the form of the sentence and take the context and meaning into account, we ﬁnd that both sentences were produced in a translation task to express the Japanese sentence meaning I don’t know where he lives. We can thus provide a second, meaning-based target hypothesis for the two sentences. On that basis, we can analyse the learner sentences and, for example, interpret them in terms of the learners’ capabilities to use do support, negation, and to distinguish semantically related words with different parts of speech (cf. Zyzik and Azevedo 2009). While the example relies on an explicit task context in which a speciﬁc sentence encodes the meaning to be expressed for this translation exercise, the idea to go beyond the forms in the sentence towards meaning and function in context is generally applicable. It is also present in the annotation guidelines used for the learner essays and summaries collected in the Falko corpus. The Extended Target Hypothesis (TH2) operationalised in Reznicek et al. (2012: 51ff.) takes into account the overall text, the meaning expressed, the function and information structure and aspects of the style. While such an extended target hypothesis provides an important reference for a more global, functional analysis, as such it cannot be made explicit in the same formal way as the minimal form change target hypothesis TH1. To ensure sufﬁcient inter-annotator agreement for TH2 annotation, task design arguably requires particular attention. The importance of integrating more task and learner information into the analysis of learner data is conﬁrmed by the prominent evidence-centred design approach in language assessment (Mislevy et al. 2003). A global, meaning-based target hypothesis may also seem to come closer to an intuitive idea of the target hypothesis as ‘what the learner wanted to say’, but such a seemingly intuitive conceptualisation of target hypotheses would be somewhat naive and more misleading than helpful. Learners do not simply write down language to express a speciﬁc meaning. They employ a broad range of strategies to use language in a way that achieves their communicative or task goals. Indeed, strategic competence is one of the components of language competence distinguished in the seminal work of Canale and Swain (1980), and Bachman and Palmer (1996) explicitly discuss planning how to approach a test task as a good example of such strategic competence. In an instructed setting, second/ foreign learners know that form errors are one of the aspects they typically are evaluated on, and therefore they may strategically produce language in a way that minimises the number of form errors they produce. For example, Ott et al. (2012: 59–60) found that the learners in the Corpus of Reading Comprehension Exercises in German (CREG) simply lift material

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing

543

from texts or use familiar chunks, a strategy which, for example, allows the learners to avoid generating the complex agreement patterns within German noun phrases. In a similar English learner corpus, Bailey (2008) found that this strategy was used more frequently by less-proﬁcient learners, who made fewer form errors overall (but less frequently answered the question successfully). A second reason for rejecting the idea that a target hypothesis is ‘what the learner wanted to say’ is that learners do not plan what they want to say in terms of full-ﬂedged target language forms (though learners may access chunks and represent aspects at a propositional level, cf. Kintsch and Mangalath 2011). Even when learners produce apparently grammatical target-language forms, their conceptualisation of the language forms does not necessarily coincide with the analysis in terms of the target-language system. For example, Amaral and Meurers (2009) found that learners could not interpret feedback provided by an intelligent tutoring system because they misconceptualised contracted forms. Summing up this discussion, target hypotheses are intended to provide an explicit representation that can be interpreted in terms of an established linguistic reference system, typically that of the language being acquired. It is also this targeted native language for which the linguistic annotation schemes and NLP tools have been developed. The form-based and the meaning-based target hypotheses discussed above are two systematic options that can serve as a reference for a wide range of language analyses. Conceptually, a target hypothesis needs to make explicit the minimal commitment required to support a speciﬁc type of analysis/ annotation of the corpus. As such, target hypotheses may only consist of one or a couple of words instead of full sentences, and more abstract target hypothesis representations may help avoid an overcommitment that would be entailed by specifying the full surface forms of sentences, e.g. where multiple word orders are possible in a given context.

2.2

Annotating learner data

The purpose of annotating learner corpora is to provide an effective and efﬁcient index into relevant subclasses of data. As such, linguistic annotation serves essentially the same purpose as the index of a telephone book. A telephone book allows us to efﬁciently look up the phone number of people by the ﬁrst letter of the last name. The alternative of doing a linear search, by reading through the phone book from beginning to end until one ﬁnds the right person, would be possible in theory but would generally not be efﬁcient enough in practice. While indexing phone-book information by the ﬁrst letter of the last name is typical, it is only one possible index – one that is well suited to the typical questions one tries to address using phone books written alphabetically. For Chinese names, on the other hand, the number of strokes in the name as

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

544

MEURERS

written in the logographic writing system is typically used instead, with the radical-and-stroke sorting used in Chinese dictionaries being another option. For other questions which can be addressed using the same telephone book information, we need other indices. For example, consider a situation in which someone called us, we have a phone that displays the number of the caller, and we now want to ﬁnd out who called us. We would need a phone book that is indexed by phone numbers. Or to be able to efﬁciently look up who lives on a particular street, we would need a book that is indexed alphabetically by the ﬁrst letter of the street name. Taking this running example one important step further, consider what it takes to look up phone numbers of all the butchers in a given town. Given that a phone book typically does not list professions, we need an additional resource to ﬁrst determine the names of all the butchers. If we often want to look up people by their profession, we may decide to add that information to the phone book so that we can more readily index the data based on that information. Any such index is an interpretation of the data giving us direct access to speciﬁc subsets of data which are relevant in a particular perspective. Each layer of annotation we add to corpora as collections of language data serves exactly that purpose of providing an efﬁcient way to index language data to retrieve the subclasses of data that help us answer common (research) questions. For example, to pick out occurrences of the main verb can as in Dario doesn’t want to can tuna for a living, we need part-ofspeech annotation that makes it possible to distinguish such occurrences of can from the frequent uses of can as an auxiliary (Cora can dance.) or as a noun (What is Marius doing with that can of beer?), which cannot readily be distinguished by only looking at surface forms in the corpus. Which subclasses are relevant depends on the research question and how corpus data is involved in addressing it. For Foreign Language Teaching and Learning (FLTL), the questions are driven by the desire to identify and exemplify typical student characteristics and needs. For Second Language Acquisition (SLA) research, learner corpora are queried to inform the empirical basis on which theories of the acquisition process and its properties are developed and validated. General linguistic layers of annotation, such as parts of speech or syntactic dependencies, are useful for querying the corpus for a wide range of research questions arising in FLTL and SLA – much like annotating telephone book entries with professions allows us to search for people to address different needs, from plumbers to hairdressers. On the other hand, annotating all phone entries with the particular day of the week on which they are born would not provide access to generally relevant classes of data. Which type of annotations one can and should provide for learner corpora using automatic or manual annotation or a combination of the two is an important research issue at the intersection of learner corpus and NLP research.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing

2.2.1

545

Linguistic annotation

A wide range of linguistic corpus annotation schemes have been developed for written and spoken language corpora (compare, e.g., Garside et al. 1997; Leech 2005; see also Chapters 5 and 6, this volume), and the NLP tools developed over the past two decades support the automatic identiﬁcation of a number of language properties, including lexical, syntactic, semantic and pragmatic aspects of the linguistic system. For learner corpora, the use of NLP tools for annotation is much more recent (de Haan 2000; de Mönnink 2000; van Rooy and Schäfer 2002, 2003b; Sagae et al. 2010). Which kind of annotation schemes are relevant and useful to address which learner corpus research questions is only starting to be discussed. For advanced learner varieties, the annotation schemes and NLP tools originally developed for native language, especially edited news text, can seemingly be applied. At closer inspection, even this requires some leeway when checking whether the deﬁnitions in the annotation schemes apply to a given learner corpus example. In NLP, such leeway is generally discussed under the topic of robustness. Real-life NLP applications such as a machine translation system should, for example, be able to translate sentences even if they contain some spelling mistakes or include words we have not encountered before, such as a particular proper name. Robustness in corpus annotation allows the NLP tools to classify a given learner language instance as a member of a particular class (e.g. a particular part of speech) even when the observed properties of those instances differ from what is expected for that class, e.g. when the wrong stem is used, as in the case of choiced we discussed for example (1). At a given level of analysis, robustness thus allows the NLP tools to gloss over those aspects of learner language that differ from the native language for which the annotation schemes and tools were developed and trained. In other words, robustness at a given level of analysis is intended to ignore the differences between the learner and the native language at that level. In contrast, many of the uses of learner corpora aim to advance our understanding of language acquisition by identifying characteristics of learner language. For such research, the particularities and variability of learner language at the level being investigated are exactly what we want to identify, not gloss over robustly. In Section 2.1 we already discussed a key component for addressing this issue: target hypotheses (speciﬁcally the form-based TH1). We can see those as a way of documenting the variation that robust analysis would simply have glossed over. Target hypotheses require the researcher to make explicit where a change is required to be able to analyse the learner language using a standard linguistic annotation scheme. A learner corpus including target hypotheses and linguistic annotation thus makes it possible to identify both the places where the learner language diverges from the native language norm and the general linguistic classes needed for retrieval of relevant subsets of learner data.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

546

MEURERS

At the same time, such an approach cannot be the full solution to analysing the characteristics of learner language. It amounts to interpreting learner language in a documented way, but still in terms of the annotation schemes developed for native language instead of annotation schemes deﬁned to systematically reﬂect the properties of interlanguage itself. This is natural, given that linguistic category systems arose on the basis of a long history of data observations, based on which a consensus of the relevant categories emerged. Such category systems are thus difﬁcult to develop for the individual, dynamic interlanguage of language learners. But if we instead simply use a native-language annotation scheme to characterise learner language, we run the danger of committing a comparative fallacy, ‘the mistake of studying the systematic character of one language by comparing it to another’ (Bley-Vroman 1983: 6). Given the insight from hermeneutics6 that every interpretation is based on a given background, it is evident that we can never perceive anything as such.7 However, it is possible to limit the degree of the comparative fallacy entailed by the annotation scheme used. The idea is to annotate learner language as closely as possible to the speciﬁc dimensions of observable empirical properties. For example, traditional parts of speech encode a bundle of syntactic, morphological, lexical and semantic characteristics of words. For learner language, Díaz-Negrillo et al. (2010) proposed instead to employ a tripartite representation with three separate parts of speech explicitly encoding the actually observable distributional, morphological and lexical stem information. Consider the examples in (4). (4)

a. The intrepid girl smiled. b. He ambulated home. c. The king of France is bald.

In terms of the distributional evidence for the part of speech of the word intrepid in sentence (4a), between a determiner and a noun we are most likely to ﬁnd an adjective. Illustrating the morphological evidence, in (4b) the word ending in the sufﬁx -ed is most likely to be a verb. Finally, in (4c) the word of is lexically unambiguous so that looking up the isolated word in a dictionary is sufﬁcient for determining that it is a preposition. For native language, the three sources of empirical evidence converge and can be encoded by one part-of-speech tag. For learner language, these three information sources may diverge, as was illustrated by example (1). To avoid some of the complexity of such multidimensional tagsets, Reznicek and Zinsmeister (2013) show that the use of underspeciﬁed tags 6

http://plato.stanford.edu/entries/hermeneutics (last accessed on 13 April 2015).

7

An accessible introduction to hermeneutics and radical constructivism can be found in Winograd and Flores (1986). Humans under this perspective are autopoietic systems evolving in a constant hermeneutic circle.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing

547

(leaving out some information) and portmanteau tags (providing richer tagsets, combining information) can lead to an improved part-of-speech analysis of German learner language. In the syntactic domain, encoding classes close to the empirical observations can be realised by breaking down constituency in terms of (a) the overall topology of a sentence, i.e. the sentence-level word order, (b) chunks and chunk-internal word order and (c) lexical dependencies. What is encoded in the overall topology of a sentence depends on the language and includes grammatical functions and discourse aspects, but the prevalence of notions such as fronting or extraposition in linguistic characterisations of data illustrates the relevance of such a global, topological characterisation of sentences. For some Germanic languages, this characterisation can build on the tradition of topological ﬁelds analysis, based on which automatic NLP analyses have also been developed for German (Cheung and Penn 2009). Topological ﬁelds are also starting to be employed in the analysis of learner language (Hirschmann et al. 2007). Chunks are widely discussed in the context of learner language (though often in need of a precise operationalisation and corpus-based evaluation), but the dependency analysis requires more elaboration here. To pursue the envisaged analysis close to the speciﬁc empirical observations, one must carefully distinguish between morphological, syntactic and semantic dependencies. This is, for example, the case in Meaning Text Theory (Mel’c]uk 1988) and it is reﬂected in the distinction between the analytical and the tectogrammatical8 layer of the Prague Dependency Treebank (Böhmová et al. 2003). Against this background, we can distinguish two types of dependency analyses which have been developed for learner language. On the one hand, we ﬁnd surface-evidence-based approaches that aim at providing a ﬁne-grained record of the morphological and syntactic evidence (Dickinson and Ragheb 2009; Ragheb and Dickinson 2012), such as observable case marking or agreement properties. On the other, there are approaches which essentially target a level of semantic dependencies (Rosén and Smedt 2010; Ott and Ziai 2010). The goal here is to robustly abstract away from learner-speciﬁc forms and constructions where syntax and semantics diverge (such as English case-marking prepositions or the interpretation of the subject of non-ﬁnite constructions) to encode the underlying function–argument relations from which the sentential meaning can be derived. For example, dependency parsing is used as part of a content-assessment system analysing learner responses to reading comprehension questions (Hahn and Meurers 2012). King and Dickinson (2013) report on the NLP analysis of another task-based learner corpus supporting the evaluation of meaning. For learner data 8

The tectogrammatical layer is the underlying structure at the heart of Prague School dependency analysis. In contrast to the surface-based analytical layer, the tectogrammatical layer focuses on those aspects which contribute to the semantic and pragmatic interpretation.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

548

MEURERS

from a picture-description task, they obtain very high accuracies for the extraction of the core functor–argument relations using shallow semantic analysis. For any dependency analysis of learner data to be useful for research, the essential question is which kind of dependency distinctions can reliably be identiﬁed given the information in the corpus. This is starting to be addressed in recent work (Ragheb and Dickinson 2013). Relatedly, when using parsers to automatically assign dependency analyses for learner language, one needs to be aware that the particular parsing set-up chosen impacts the nature and quality of the dependency analysis that is obtained. Comparing two different computational approaches to dependency parsing German learner language, for example, Krivanek and Meurers (2013) show that a rule-based approach was more reliable in identifying the main argument relations, whereas a data-driven parser was more reliable in identifying adjunct relations. This is also intuitively plausible, given that statistical approaches can use the world knowledge encoded in a corpus for disambiguation, whereas the grammar-based approach can rely on high-quality subcategorisation information for the arguments.

2.2.2

Error annotation

A second type of annotation of learner corpora, error annotation, targets the nature of the difference between learner data and native language (see Chapter 7, this volume). Given the FLTL interest in identifying, diagnosing and providing feedback on learner errors, and the fact that learner corpora are commonly collected in an FLTL context, error annotation is the most common form of annotation in the context of learner corpora (Granger 2003b; Díaz-Negrillo and Fernández-Domínguez 2006). At the same time, error annotation is only starting to be subjected to the rigorous systematisation and inter-annotator agreement testing established for linguistic annotation, which will help determine which distinctions can reliably be annotated based on the evidence available in the corpus. The analyses becoming available in the NLP context conﬁrm that the issue indeed requires scrutiny. Rozovskaya and Roth (2010a) ﬁnd very low inter-annotator agreement for error classiﬁcation of English as a Second Language sentences. Even for the highly focused task of annotating preposition errors, Tetreault and Chodorow (2008a) report that trained annotators failed to reach good agreement. Rosen et al. (2014) provide detailed inter-annotator agreement analyses for the Czech as a Second Language corpus, making concrete for which aspects of error annotation good agreement can be obtained and what this requires – a study we discuss in detail in Section 3.1. For corpus annotation to support reliable, replicable access to systematic classes of data in the way explained at the beginning of Section 2.2, it is essential to reduce annotation schemes to those categories that can reliably be assigned based on the evidence available in the corpus.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing

2.2.3

549

Automatic detection and diagnosis of learner errors

In terms of tools for detecting errors, learner corpus research has long envisaged automatic approaches (e.g. Granger and Meunier 1994), but the small community at the intersection of NLP and learner corpus research is only starting to make headway. The mentioned conceptual challenges and the unavailability of gold-standard error-annotated learner corpora hinder progress in this area. Corpora with gold-standard annotation are essential for developing and evaluating current NLP technology, which is generally built using statistical or supervised machine learning components that need to be trained on large, representative gold-standard data sets. Writing aids for native speakers, such as the standard spell- and grammar-checkers, may seem like a natural option to fall back on. However, such tools rely on assumptions about typical errors made by native speakers, which are not necessarily applicable to language learners (Flor et al. in press). For example, Rimrott and Heift (2008: 73) report that ‘in contrast to most misspellings by native writers, many L2 misspellings are multiple-edit errors and are thus not corrected by a spell checker designed for native writers’. Examples of such multiple-edit errors include lexical competence errors such as German Postkeutzah → Postleitzahl (‘postal code’) or grammatical overgeneralisations as in gegehen → gegangen (‘went’). The overall landscape of computational approaches for detecting and diagnosing learner errors can be systematised in terms of the nature of data that is targeted, from single tokens via local domains to full sentences. Pattern-matching approaches target single tokens or local patterns to identify speciﬁc types of errors. Language-licensing approaches attempt to analyse an entire learner utterance to diagnose its characteristics. In the following, a conceptual overview is provided, with Chapter 25 (this volume) spelling the topic out further. Pattern-matching approaches traditionally employ error patterns explicitly specifying surface forms. For example, an error pattern for English can target occurrences of their immediately preceding is or are to detect learner errors such as (5) from the Chinese Learner English Corpus (CLEC).9 (5)

Their are all kinds of people around us.

Such local error patterns can also be deﬁned in terms of annotations such as parts of speech to allow identiﬁcation of more general patterns. For example, one can target more or less followed by an adjective or adverb, followed by then, an error pattern instantiated by the CLEC learner sentence in (6). (6)

9

At class, students listen more carefuladj then any other time.

http://purl.org/icall/clec (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

550

MEURERS

Error pattern matching is commonly implemented in standard grammar-checkers. For example, the open source LanguageTool10 provides a general implementation, in which typical learner error patterns can be speciﬁed. While directly specifying such error patterns works well for certain clear cases of errors, more advanced pattern matching splits the error identiﬁcation into two steps. First, a context pattern is deﬁned to identify the contexts in which a particular type of error may arise. Then, potentially relevant features are collected, recording all properties which may play a role in distinguishing erroneous from correct usage. A supervised machine learning set-up can then be used to learn how to weigh the evidence to accurately diagnose the presence of an error and its type. For example, given that determiner usage is a well-known problem area for learners of English, a context pattern can be used to identify all noun chunks. Properties of the noun and its context can then be used to determine whether a deﬁnite, an indeﬁnite or no determiner is required for this chunk. This general approach is a very common set-up for NLP research targeting learner language (e.g. De Felice 2008; Tetreault and Chodorow 2008b; Gamon et al. 2009) and it raises important general questions for future work in terms of how much context and which linguistic properties are needed to accurately diagnose which type of errors. Note the interesting connection between these questions and the need to further advance error annotation schemes based on detailed analyses of inter-annotator agreement. Language-licensing approaches go beyond characterising local patterns and attempt to analyse complete sentences. These so-called deep NLP approaches are based on fully explicit, formal grammars of the language to be licensed. Grammars essentially are compact representations of the wide range of lexical and syntactic possibilities of a language. To process with such grammars, efﬁcient parsing algorithms are available to license a potentially inﬁnite set of strings based on ﬁnite grammars. On the conceptual side, grammars can be expressed in two distinct ways (Johnson 1994). In a validity-based grammar set-up, a grammar is a set of rules. A string is recognised if and only if one can derive that string from the start symbol of the grammar. A grammar without rules licenses no strings, and, essentially, the more rules are added, the more different types of strings can be licensed. In a satisﬁability-based grammar set-up, a grammar consists of a set of constraints. A string is grammatical if and only if it satisﬁes all of the constraints in the grammar. A grammar without constraints thus licenses any string, and the more constraints are added, the fewer types of strings are licensed. A number of linguistic formalisms have been developed for expressing such grammars, from basic context-free grammars lacking the 10

http://languagetool.org (last accessed on 13 April 2015).

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing

551

ability to generalise across categories and rules to the modern lexicalised grammar formalisms for which efﬁcient parsing approaches have been developed, such as Head-Driven Phrase Structure Grammar (HPSG), Lexical-Functional Grammar (LFG), Combinatory Categorial Grammar (CCG) and Tree-Adjoining Grammar (TAG) – cf. the grammar framework overviews in Brown (2006). To use any of these approaches in our context, we need to consider that linguistic theories and grammars are generally designed to license well-formed native language, which raises the question of how they can license learner language (and identify errors as part of the process). There are essentially two types of approaches for licensing learner language, corresponding to the two types of formal grammars introduced above. In a validity-based set-up using a regular parser, so-called mal-rules can be added to the grammar (see, e.g., Schwind 1990; Matthews 1992) to license and thereby identify ill-formed strings occurring in the learner language. For example, Schwind (1990: 575) deﬁnes a phrase-structure rule licensing German noun phrases in which the adjective follows the noun. This is ungrammatical in German, so the rule is marked as licensing an erroneous structure, i.e. it is a mal-rule. A mal-rule approach requires each possible type of learner error to be pre-envisaged and explicitly encoded in every rule in which it may surface. For small grammar fragments, as needed for exercises effectively constraining what the learner is likely to produce (Amaral and Meurers 2011: 9ff.), this may be feasible; but for a grammar with any broader coverage of language phenomena and learner error types, writing the mal-rules needed would be very labour intensive and error-prone. Some types of errors can arise in a large number of rules; for example, subject–verb agreement errors may need to be accommodated in any rule realising subjects together with a ﬁnite verbal projection. Following Weischedel and Sondheimer (1983), meta-rules can be used to express such generalisations over rules. For example, they deﬁne a meta-rule that allows subject–verb agreement to be relaxed anywhere, and one allowing articles to be omitted in different types of noun phrases. Rule-based grammars license the inﬁnite set of possible strings by modularising the analysis into local trees. A local tree is a tree of depth one, i.e. a mother node and its immediate children. Each local tree in the overall analysis of a sentence is independently licensed by a single rule in the grammar. Thus a mal-rule also licenses a local tree, which in combination with other local trees licensed by other rules ultimately licenses the entire sentence. The mal-rule approach is conceptually simple when the nature of an error can be captured within the local domain of a single rule. For example, a mal-rule licensing the combination of an article and a noun disagreeing in gender (leM A S C tableF E M ) can be added to a French grammar. The fact that rules and mal-rules in a grammar interact requires very careful grammar (re)writing to avoid unintended combinations. The

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

552

MEURERS

situation is complicated further when the domain of an error is larger than a local tree. For example, extending the word orders licensed by a rule S → NP VP by adding the mal-rule S → VP NP makes it possible to license (7a) and (7b).

(7)

a. Mary [loves cats]. b. * [loves cats] Mary. c. * loves Mary cats.

The order in (7c), on the other hand, cannot be licensed in this way given that it involves reordering words licensed by two different rules, so that no single mal-rule can do the job – unless one writes an ad hoc, combined mal-rule for the ﬂattened tree (S → V NP NP), which would require adding such rules for combinations with all other rules licensing VPs (intransitive, ditransitive, etc.) as well. Lexicalised grammar formalisms using richer data structures, such as the typed feature structure representation of signs used in HPSG, make it possible to encode more general types of mal-rules (cf. Heift and Schulze 2007). Similarly, mildly context-sensitive frameworks such as TAG and CCG provide an extended domain of locality that could in principle be used to express mal-rules encoding errors in those extended domains. To limit the search space explosion commonly resulting from rule interaction, the use of mal-rules may be limited. One option is to only include the mal-rules in processing when parsing a sentence with the regular grammar fails. However, this only reduces the search space for well-formed strings. If parsing fails, the question of which mal-rules need to be added is not addressed. An intelligent solution to this question was pursued by the ICICLE system (Interactive Computer Identiﬁcation and Correction of Language Errors; Michaud and McCoy 2004). It selects groups of rules based on learner modelling. For grammar constructs that the learner has shown mastery of, it uses the native language rule set, but no rules are included for constructs beyond the developmental level of the learner. For structures currently being acquired, both the native rule set and the mal-rules relating to those phenomena are included. In probabilistic grammar formalisms such as probabilistic context-free grammars (PCFGs) or when using LFG grammars with optimality theoretic mark-up, one can also try to tune the licensing of grammatical and ungrammatical structures to the learner language characteristics (cf. Wagner and Foster 2009). The second approach to licensing sentences which go beyond the native-language grammars is based on constraint relaxation (Kwasny and Sondheimer 1981). It relies on a satisﬁability-based grammar set-up or a rule-based grammar formalism employing complex categories (feature structures, ﬁrst-order terms) for which the process of combining

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing

553

information (uniﬁcation) and the enforcement of constraints can be relaxed. Instead of writing complete additional rules, as in the mal-rule approach, constraint relaxation makes it possible to eliminate speciﬁc requirements of regular rules, thereby admitting additional structures normally excluded. For example, the feature speciﬁcations ensuring subject–verb agreement can be eliminated in this way to also license ungrammatical strings. Relaxation works best when there is a natural one-to-one correspondence between a particular kind of error and a particular speciﬁcation in the grammar, as in the case of subject–verb agreement errors being directly linked to the person and number speciﬁcations of ﬁnite verbs and their subject argument. One can also integrate a mechanism corresponding to the meta-rules of the mal-rule set-up, in which the speciﬁcations of particular features are relaxed for particular sets of rules or constraints, or everywhere in the grammar. In this context, one often ﬁnds the claim that constraint relaxation does not require learner errors to be pre-envisaged and therefore should be preferred over a mal-rule approach. Closer inspection makes it clear that such a broad claim is incorrect: to effectively parse a sentence with a potentially recursive structure, it is essential to distinguish those constraints which may be relaxed from those that are supposed to be hard, i.e. always enforced. Otherwise either parsing does not terminate, or any learner sentence can be licensed with any structure so that nothing is gained by parsing. Instead of completely eliminating constraints, constraints can also be associated with weights or probabilities, with the goal of preferring or enforcing a particular analysis without ruling out ungrammatical sentences. One prominent example is the Weighted Constraint Dependency Grammar (WCDG) approach of Foth et al. (2005). The as yet unsolved question raised by such approaches (and the probabilistic grammar formalisms mentioned above) is how the weights can be obtained in a way that makes it possible to identify the likely error causes underlying a given learner sentence. The constraint-relaxation research on learner error diagnosis has generally developed handcrafted formalisms and solutions. At the same time, computer science has studied Constraint Satisfaction Problems (CSP) in general and developed general CSP solvers. In essence, licensing a learner utterance is dealt with in the same way as solving a Sudoku puzzle or ﬁnding a solution to a complex scheduling problem. Boyd (2012) presents an approach that explores this connection and shows how learner error analysis can be compiled into a form that can be handled by general CSP solvers, with diagnosis of learner errors being handled by general conﬂict-detection approaches. Finally, while most approaches to licensing learner language use standard parsing algorithms with extended or modiﬁed grammars to license ill-formed sentences, there is also some work modifying the algorithmic

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

554

MEURERS

side instead. For instance, Reuer (2003) combines a constraint-relaxation technique with a parsing algorithm modiﬁed to license strings in which words have been inserted or omitted, an idea which in essence moves generalisations over rules in the spirit of meta-rules into the parsing algorithm. Let us conclude this discussion with a note on evaluation. Just like the analysis of inter-annotator agreement is an important evaluation criterion for the viability of the distinctions made by an error annotation scheme, the meaningful evaluation of grammatical-error-detection approaches is an important and under-researched area. One trend in this domain is to avoid the problem of gold-standard error annotation as reference for testing by artiﬁcially introducing errors into native corpora (e.g. Foster 2005). While this may be a good choice to monitor progress during development, such artiﬁcially created test sets naturally only reﬂect the properties of learner data in a very limited sense and do not eliminate the need ultimately to evaluate an approach on authentic learner data with gold-standard annotation. A good overview of the range of issues behind the difﬁculty of evaluating grammatical-error-detection systems is provided in Chodorow et al. (2012) and Chapter 25 (this volume).

3

Representative studies

The following two case studies take a closer look at two representative approaches spelling out some of the general issues introduced above. The ﬁrst case study focuses on a state-of-the-art learner corpus, for which detailed information on the integration of manual and automatic analysis as well as detailed inter-annotator agreement information is available. The second case study provides a concrete example for error detection in the domain of word-order errors, a frequent but under-researched type of error that also allows us to exemplify how the nature of the phenomenon determines the choice of NLP analysis used for error detection. 3.1 Rosen, A., Hana, J., Štindlová, B. and Feldman, A. 2014. ‘Evaluating and automating the annotation of a learner corpus’, Language Resources and Evaluation 48(1): 65–92. To showcase the key components of a state-of-the-art learner corpus annotation project integrating insights and tools from NLP, we take a look at the Czech as a Second Language (CzeSL) corpus based on Rosen et al. (2014). The corpus consists of 2.64 million words with written and transcribed spoken components, produced by foreign language learners of Czech at all levels of proﬁciency and by Roma acquiring Czech as a second language. So far, a sample of 370,000 words from the written portion has been manually annotated.

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

Learner corpora and natural language processing Myslim

že

incorlnfl stylColl

by

kdy

byl

se

wbdOther

Myslím

že

kdyby

svim

ditem

incorlnfl stylColl

incorlnfl incorBase

byl

se

svým

ˇ dítetem

byl

se

svým

ˇ dítetem

555

agr

Myslím

,

že

kdybych

,

Figure 24.1 An example for the multi-tier representation of the CzeSL corpus (Rosen et al. 2014: 72)

The corpus is encoded in a multi-tier representation. Tier 0 encodes the learner text as such, tier 1 encodes a ﬁrst target hypothesis in which all non-words are corrected, and tier 2 is a target hypothesis in which syntax, word order and a few aspects of style are corrected. The differences between the tiers 0 and 1 and between the tiers 1 and 2 can be annotated with error tags. Depending on the nature of the error, the annotations link individual tokens across two tiers, or they can scope over multiple tokens, including discontinuous units. Figure 24.1 exempliﬁes the CzeSL multi-tier representation. Tier 0 at the top of Figure 24.1 is the sentence as written by the learner. This learner sentence includes several non-words, of which three require changes to the inﬂection or stem, and one requires two tokens to be merged into a single word. Tier 2, shown at the bottom of the ﬁgure, further corrects an agreement error to obtain the target hypothesis, of which a glossed version is shown in (8).

(8)

Myslím, z]e kdybych byl that ifS G 1 wasM A S C thinkS G 1 ‘I think that if I were with my child, …’

se with

svým my

díte]tem, child

The errors in individual word forms treated at tier 1 include misspellings, misplaced word boundaries, inﬂectional and derivational morphology, incorrect word stems and invented or foreign words. The tier is thus closely related to the minimal form change target hypothesis we discussed in Section 2.1, but focuses exclusively on obtaining well-formed individual words rather than full syntactic forms. Such a dedicated tier for individual word forms is well motivated considering the complex morphology of Czech. The target form encoded in tier 2 addresses errors

Downloaded from https:/www.cambridge.org/core. University of Florida, on 29 Mar 2017 at 07:51:01, subject to the Cambridge Core terms of use, available at https:/www.cambridge.org/core/terms. https://doi.org/10.1017/CBO9781139649414.024

556

MEURERS

in agreement, valency, analytical forms, word order, pronominal reference, negative concord, the choice of tense, aspect, lexical item or idiom. The manual annotation process is supported by the annotation tool feat11 developed for this purpose. The annotation started with a pilot annotation of sixty-seven texts totalling almost 10,000 tokens. Fourteen annotators were split into two groups, with each group annotating the sample independently. The Inter-Annotator Agreement (IAA) was computed using the standard Cohen’s kappa metric (κ, cf. Artstein and Poesio 2008). Since tiers 1 and 2 can differ between annotators, for computing the IAA, error tags are projected onto tier 0 tokens. The feedback from the pilot annotation was used to improve the annotation manual, the training of the annotators, and to modify the error taxonomy of the annotation scheme in a few cases. The annotation was then continued by thirty-one annotators who analysed 1,396 texts totalling 175,234 words. Both for the pilot and for the second annotation phase, a detailed quantitative and qualitative discussion of IAA results and confusion matrices is provided in Rosen et al. (2014: 76), one of which is shown in Table 24.1. We see that the annotators showed good agreement at tier 1 for incorrect morphology (incor*: κ > 0.8) and improper word boundaries (wbd*: κ > 0.6), and also for agreement errors (agr: κ > 0.6) and syntactic dependency errors (dep: κ 0.58). Such errors can thus be reliably annotated given the explicit, form-based target hypothesis. On the other hand, pronominal reference (ref), secondary (follow-up) errors (sec), errors in analytical verb forms/complex predicates (vbx) and negation (neg) show a very low IAA level, as do tags for usage and lexical errors (κ