Developmental and Crosslinguistic Perspectives in Learner Corpus Research [1 ed.] 9789027271723, 9789027207715

This volume provides a state-of-the-art overview of current research and developments on the use of learner corpora perc

180 35 4MB

English Pages 367 Year 2012

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Developmental and Crosslinguistic Perspectives in Learner Corpus Research [1 ed.]
 9789027271723, 9789027207715

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Developmental and Crosslinguistic Perspectives in Learner Corpus Research

Tokyo University of Foreign Studies (TUFS) Studies in Linguistics For an overview of all books published in this series, please see http://benjamins.com/catalog/tufs

Volume 4 Developmental and Crosslinguistic Perspectives in Learner Corpus Research Edited by Yukio Tono, Yuji Kawaguchi and Makoto Minegishi

Developmental and Crosslinguistic Perspectives in Learner Corpus Research Edited by

Yukio Tono Yuji Kawaguchi Makoto Minegishi Tokyo University of Foreign Studies

John Benjamins Publishing Company Amsterdam / Philadelphia

8

TM

The paper used in this publication meets the minimum requirements of the American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.

Library of Congress Cataloging-in-Publication Data Developmental and crosslinguistic perspectives in learner corpus research / edited by Yukio Tono, Yuji Kawaguchi, Makoto Minegishi. p. cm. (Tokyo University of Foreign Studies (TUFS), studies in linguistics, issn 18776248 ; v. 4) Includes bibliographical references and index. 1. Corpora (Linguistics) 2. Language acquisition--Research. 3. Computational linguistics. I. Tono, Yukio. II. Kawaguchi, Yuji. III. Minegishi, Makoto. P128.C68D48 2012 410.1’88--dc23 2012003775 isbn 978 90 272 0771 5 (Hb ; alk. paper) isbn 978 90 272 7172 3  (Eb)

© 2012 – Tokyo University of Foreign Studies No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co. · P.O. Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa

Contents Message from the President Ikuo KAMEYAMA (President, Tokyo University of Foreign Studies)............................. 1 Center for Corpus-based Linguistics and Language Education Makoto MINEGISHI (GCOE Project Leader) ............................................................... 3 Introduction Yukio TONO, Yuji KAWAGUCHI and Makoto MINEGISHI .......................................... 7 Part 1 The International Corpus of Crosslinguistic Interlanguage (ICCI) The English Profile: Using Learner Data to Develop the CEFR for English Nick SAVILLE ............................................................................................................... 17 International Corpus of Crosslinguistic Interlanguage: Project Overview and a Case Study on the Acquisition of New Verb Co-occurrence Patterns Yukio TONO.................................................................................................................. 27 Compilation and Exploration of ICCI Corpus for Learner Language Research Huaqing HONG............................................................................................................ 47 The Use of Demonstrative Reference in English Texts by Austrian School-age Learners Barbara SCHIFTNER and Tom RANKIN .................................................................... 63 The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners Agnieszka LEŃKO-SZYMAŃSKA ................................................................................ 83 The Use of Intensifying Adverbs in Learner Writing Pascual PÉREZ-PAREDES and María Belén DÍEZ-BEDMAR ................................. 105 Profiling EFL Learners’ Writing Performance by Syntactic Complexity: A Corpus-based Study Austina SHIH and May MA ........................................................................................ 125 A Cross-sectional Analysis of the Use of the English Article System in Spanish Learner Writing María Belén DÍEZ-BEDMAR and Pascual PÉREZ-PAREDES ................................. 139 Lexical Richness and Variation in the Writing of School-age EFL Learners at Different Learning Stages and Different Educational Systems Tammar LEVITZKY-AVIAD........................................................................................ 159

Use and Misuse of Cohesive Devices in the Writings of EFL Chinese Learners: A Corpus-based Study Yongbing LIU and Huiping ZHANG .......................................................................... 169 Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora Paula BUTTERY and Andrew CAINES ...................................................................... 187 Part 2 Issues of Learner Corpus Research: Focus on Speech Data Spanish Learners’ Production of French Close Rounded Vowels: A Corpus-based Perceptual Study Isabelle RACINE ........................................................................................................ 205 Coding an L2 Phonological Corpus: From Perceptual Assessment to Non-native Speech Models —An Illustration with French Nasal Vowels— Sylvain DETEY ........................................................................................................... 229 Design and Analysis of Asian English Speech Corpus —How to Elicit L1 Phonology in L2 English Data— Mariko KONDO ......................................................................................................... 251 Lexical Profile of French Learner Speech: The Case of Japanese University Students Kaori SUGIYAMA ...................................................................................................... 279 What’s (not) in a Corpus?—What to Look for in a Learner Corpus of Spoken English— Hiroko SAITO ............................................................................................................. 299 The Use of Multi-word Units in Learner Language Narratives: Are there Qualitative and/or Quantitative Differences between Japanese ESL Learners and EFL Learners? Asako YOSHITOMI .................................................................................................... 309 Corpus-based Analysis of Lexical Collocations by Intermediate Japanese Language Learners —With a Focus on the Verb Suru Ayano SUZUKI and Tae UMINO ............................................................................... 333

Index of Proper Nouns ......................................................................................................... 355 Index of Subjects.................................................................................................................. 359 Contributors ......................................................................................................................... 361

Message from the President Ikuo KAMEYAMA (President, Tokyo University of Foreign Studies) It is a great honor for me to introduce a new volume of the TUFS Studies in Linguistics, entitled Developmental and Crosslinguistic Perspectives in Learner Corpus Research. This book was originally based on the symposium which had to be cancelled after our devastating March 11 earthquake and tsunami. I truly appreciate all the contributors of this volume, from twelve countries around the world, for their continued support for our project and their prayers for our country’s recovery from the disaster. I will now dwell briefly on the Global COE Program, which began in April 2007. This program is an effort by the Japanese government to strengthen its support to research and educational institutions in which internationally renowned work is taking place. The program was developed to take advantage of world-class resources to help foster the development of creative researchers who can lead in their fields, and to strengthen research and education in Japan’s centers of graduate education. In 2007, proposals were solicited in five areas. The program submitted in the area of humanities by our university was one of 12 selected nationally. The humanities category encompasses fields as diverse as philosophy, art, psychology, education and archaeology. The submission from our university was the only one selected in the area of linguistics. We believe this reflects the high level of research and education at our institution. Our submission, entitled “Corpus-based Linguistics and Language Education” emphasizes on a field of empirical linguistics based on the uses of corpora. The program’s goal is to foster the growth of advanced researchers with international perspectives. This program continues the research conducted under the “Usage-based Linguistic Informatics” 21st Century COE Program, which concluded March 2007. The new program will build on the international joint research framework that was created over the past five years to achieve two goals, with the support of the entire university: 1. To further develop a comprehensive education program for graduate students 2. To give graduate students opportunities to perform fieldwork, build and analyze corpora, and receive language education and training, both in Japan and overseas. I am not an expert in linguistics nor do I have a deep scholarly understanding of corpus linguistics. However, as a scholar of literature, I have a keen interest in the possibilities inherent in the field. The corpus concept was introduced into my area of specialization, Russian literature, in the late 1980s. As far

2

Ikuo KAMEYAMA

as I know, this resulted in the creation of corpora for the works of authors such as Fyodor Dostoevsky and Andrei Platonov. However, it is not yet clear how effective the corpus concept will be in the development of the study of literature. In contrast, corpus-based linguistics seeks not to use linguistic data to understand the latent properties of a text as a close system, but to understand the linguistic structure and function of a language within a larger context. So, I believe that corpus linguistics provides us with higher objectivity and richer possibilities in the field of humanities. Still, it is my opinion that the greatest hurdles for corpus-based linguistics are still to come. Humans are creatures that cannot help but seek out meaning and possibilities of systemizing matters. It is evident that corpus linguistics has not been a field that describes only the actual uses of languages, but one that finds ways to generalize creative discoveries and to extend its insights. Its value lies in its ability to push itself. For corpus-based linguistics to grow creatively as a human science, we must help young researchers to develop innovative and unique capabilities for analysis. I believe that this is where the real importance of the current G-COE Program lies. In conclusion, I would like, as president of this institution, to express my sincere respect to all the leading researchers who attended this symposium, for their untiring efforts. More importantly, I hope that the young scholars who attended the symposium have imbibed some of the passion that was on display, and I hope that it will help them to grow internationally competitive researchers. January 6, 2012

Center for Corpus-based Linguistics and Language Education Makoto MINEGISHI (GCOE Project Leader) The Center for Corpus-based Linguistics and Language Education (CbLLE) was established with the express target to build an education and research center with unique strengths in the study of linguistic diversity and also in usage-based research of linguistic structure and language education. This centre builds on the strengths of the nationally high-ranking Graduate School of Area and Culture Studies of the Tokyo University of Foreign Studies (TUFS) and of the Research Institute of Languages and Cultures of Asia and Africa (ILCAA). Its educational and research uniqueness is achieved by integrating the three core areas of activities: (a) collection and analysis of naturally occurring language use data through field research, (b) compilation and analysis of large-scale corpora of language use data from a wide range of languages, and (c) application of corpus-based linguistic analyses to language education and pedagogy. A few details of the work that is being done in the above core areas follows: Field linguistics: The field linguistics program supports fieldwork-based research on a typologically diverse set of languages, including not only the world’s major languages but also lesser-studied languages. They include languages of Africa, Eurasia, and North America. It also aims at advancing typological research on the basis of the primary data from a broad range of languages. It provides a solid training to students in the methodology of collecting, processing, and analysis of the field data. The project has undertaken fieldwork-based study of a diverse range of languages of the world (lesserstudied languages in particular) and typologically-informed description of these languages. Some of the projects under this category are: Compilation of a Word List for Field Research on Khwe Languages; Field-work based study of under-studied speeches of India; Collation of Spontaneous Conversational Data of Individual Languages such as: Swahili, Russian, French, Spanish, Turkish. Corpus linguistics: The program in corpus linguistics supports analysis of a large amount of language use data and compilation of corpora, which feed into linguistic informatics research and also into descriptive and typological research. Some of the specific targets are: Building electronic corpora and developing analysis and processing tools in order to support new ways of analyzing language data and multipurposing of the data; Developing

4

Makoto MINEGISHI

multilingual and multifunctional integrative corpora of language use for major languages on the basis of language use data collected in language teaching classrooms, blogs, etc.; Conducting international collaborative research and providing support in development and utilization of tools for corpus creation, morphological analysis, electronic dictionary creation, text analysis. The projects undertaken here include: Development of Electronic Dictionaries for Russian as well as Thai (separately); corpus Compilation of Data from Medium/Minor Language Groups; Development of Utility Manuals for German Corpus; Preparation of Introductory Text-book on Lexicology based on Corpus Data; Research on Corpora for Minor Language Group in EU Countries. Linguistic informatics: The linguistic informatics program builds on the research in field linguistics and corpus linguistics components to significantly advance research in language pedagogy. It seeks to make a significant contribution to the research in language pedagogy by taking into account the factors of linguistic and cultural diversity through analysis of corpora of language use in actual contexts of language instruction, including naturally occurring conversations and learners’ language use. A few studies undertaken in this context are: Research on Lexicon/language-use based on Corpora for Various Fields; Language Processing/education Technology; POS Search Engine for Spoken French as well as Spoken Spanish; Basic Research on E-learning through Moodle; Corpora of Learners’ Language Use (both as an internal project as well as an international collaborative project); Creation of Language Tests based on Error Analysis of Language Use of English Learners. The GCOE trains researchers and educators who have a clear understanding of the nature and significance of linguistic and cultural diversity and can take a flexible research approach to language structure and language education. This project equips young researchers with a broad foundation for linguistic research by providing practical trainings in field research, corpus-based research and language education. These training programs support the integrative research on linguistic and cultural diversity and usage-based linguistics by connecting effectively field data collection, data analysis, and educational application of theoretical insights obtained from the analysis. The specific projects and tasks listed above form part of the larger plan of building an international research and education center with more generalized targets described below. The Center seeks to build a world-leading research and education center in the study of linguistic diversity and in the usage-based research of linguistic structure. The national and international infrastructure for supporting the GCoE are being built through the following activities: Formation of an international network of collaborative research:

Center for CbLLE

5

Collaboration in corpus creation and in development of analysis tools (such as electronic dictionary systems); building a network of international collaboration and academic exchange in linguistic research and teaching within the framework of the ‘Consortium for Asian and African Studies’ which has its headquarters at the University. Expansion of opportunities for academic interaction across institutions and across countries: Expanding opportunities for young researchers, as well as established scholars, within Japan and abroad to assemble and interact through visiting scholar programs and through employment. Support program for young researchers: Providing young researchers with financial and technical support for linguistic field research, corpus creation, and education research in the field; and providing young researchers with financial support for professional development (including presentation at international conferences). Active international dissemination research results: Building an information technological infrastructure that supports active electronic dissemination of research results; and Publishing the research results in a series of publications through international publishers that are specialized in publication in linguistics — the present volume being a contribution towards this aim.

Introduction Yukio TONO, Yuji KAWAGUCHI and Makoto MINEGISHI Learner language has always been a focus of research interest in second language acquisition (SLA) since the discipline of applied linguistics was established in the 1960s. Pit Corder was among the first who took a special interest in learner language as an independent system, narrowing in on the role of the learner’s errors in describing the process of language development (Corder 1967). Selinker coined the term “interlanguage” (Selinker 1972) to give learner language an independent status, proposing that it was a transitional system from the mother tongue to the target language. In the 1980s, instead of looking only at errors, a more comprehensive analysis of learner language was proposed, which led to so-called “performance analysis” (Dulay, Burt & Krashen 1982). Their research was closely linked to the Input Hypothesis proposed by Krashen (1985), which emphasized the built-in mechanism of second-language (L2) acquisition and underestimated the role of formal instruction in SLA. This theory was criticized by Gregg (1984) and others, and the importance of formal instruction was reaffirmed in a famous review paper by Long (1983). In the meantime, descriptive studies of learner language continued and a new hypothesis of the L2 developmental process in naturalistic settings was proposed by a group of L2 German researchers (e.g., Meisel, Clahsen, and Pienemann 1981; Pienemann 1984). At the same time, Tarone declared the more qualitative aspects of interlanguage by revealing the variability of interlanguage systems within the same individual learner (Tarone 1983). After that, the analysis of interlanguage in its entirety gradually decreased as attention shifted to a more subtle relationship between input and interactions in the classroom and aspects of interlanguage pragmatics. In the early 1990s, the advent of computer technology brought a new focus on learner language. It became increasingly possible to store a large amount of learner production data on computers for textual analysis. In the U.S., child acquisition researchers started to share their recordings of childcaretaker interactions and their transcripts, which set a foundation for the Child Language Data Exchange System (CHILDES) project (MacWhinney 1995). In Europe, the International Corpus of Learner English (ICLE) was launched as sub-component of the International Corpus of English (ICE) project (Granger 1994). Since then, a number of papers using ICLE have been published (see the website of CECL at UCL1). ICLE was a collection 1

http://www.uclouvain.be/en-cecl-lcBiblio.html

8

Yukio TONO, Yuji KAWAGUCHI and Makoto MINEGISHI

of essays written by English major students at university level. One of the unique features of ICLE is that learners with various mother tongue backgrounds were included in the design for cross-comparison. Granger (2002) called this “Contrastive Interlanguage Analysis (CIA)” (ibid: p.12). CIA involves two types of comparison: native speaker (NS) vs. non-native speaker (NNS) comparisons for investigating non-native features of learner language, and NNS vs. NNS comparisons for identifying first language (L1)-related features (e.g., errors or overuse/underuse phenomena) as well as universal learner language properties. As ICLE became more and more influential, there was also a growing interest in spoken learner corpora; resources such as Louvain Interlanguage Database of Spoken English Interlanguage (LINDSEI) (Gilquin et al. 2010) or National Institute of Information and Communications Technology Japanese Learner English Corpus (NICT JLE) (Izumi et al. 2004) are now available. While learner corpus research is gaining grounds and the types of data and research topics have been diversified (e.g., Granger 1998; Granger et al. 2002; Gilquin et al. 2008), there is still one serious problem. That is, research from the beginning stages of learning or acquisition is still very scarce. This is primarily because most learner corpus building projects to date focus on adult learners, especially university students, due to the ease of data collection from university students. Most large learner corpora were constructed by collecting learners’ essays, either online or in an electronic format (e.g., Word documents), whereas younger learners usually do not access PCs or the Internet for writing. The same thing can be applied to commercial publishers’ learner corpora, such as the Cambridge Learner Corpus (CLC). It consists of Cambridge exam scripts written mainly by those who wanted to study at universities in the U.K. Thus, the primary population for the corpus consists of young adults and older people who have already finished their secondary school education. Although they have exams for younger learners, their corpus breakdown shows that there is a lack of such data. With this background in mind, we aimed to launch a project of collecting corpora of data from younger learners, focusing on the very beginning stages of learning, as well as a lower-intermediate level. It was fortunate that this project was supported by the five-year Global COE project, granted to Tokyo University of Foreign Studies (TUFS) by the Ministry of Education, Culture, Sports, Science, and Technology (MEXT). The project was called the “International Corpus of Crosslinguistic Interlanguage (ICCI).” We initially planned to compile bidirectional learner corpora, which is the reason why we called the project “crosslinguistic.” By “bidirectional,” it originally meant that we not only compiled corpora of English writings by learners from different countries/regions (e.g. Spain, China, Israel, Poland, etc.), but

Introduction

9

also corpora of TUFS students specializing in the languages spoken in those regions (Spanish, Chinese, Hebrew, Polish, etc.). This initial aim was not fully realized because of an enormous amount of work on the compilation of English corpus sections, but we aim to move on to produce something bidirectional in the future. Like the previous volumes in the TUFS Studies in Linguistics series by John Benjamins, the papers in this volume demonstrate the depth and breadth of international linguistic research carried out at TUFS. Originally, this book should have been based on the international symposium on learner corpora, which was scheduled to be held from the 13th to the 15th of March. It was three days before this symposium that the magnitude 9.0 earthquake and the massive tsunami struck northern Japan, followed by the devastating nuclear power plant failure. All the related meetings were cancelled and most research had to be temporarily stopped, but we decided to move ahead and publish this volume as a token of our gratitude to all those involved in the project. This volume provides an up-to-date snapshot of recent research and developments in the use of learner corpora. It is divided into two parts. Part I focuses on the ICCI project and Part II is devoted to other variations of learner corpora, with a special emphasis on spoken learner corpora and learner corpora of languages other than English. In the first chapter in Part I, Nick Saville, one of the guest speakers for the symposium mentioned above, shares with us the ongoing research of the English Profile Programme (EPP). He describes a historical background of the EPP, and explains basic notions such as reference level descriptions, the origins of the CEFR levels, and how the CEFR is linked to the Cambridge exams. He then introduces the idea of criterial features and how the EPP researchers are tackling this problem by using the Cambridge Learner Corpus (CLC). He clarifies a number of important concepts in the CEFR and the EPP, which provides us with an important background for where this exciting new research agenda has emerged. Yukio Tono, the director of the ICCI project, gives an overview of the project in terms of its design criteria. Following Saville’s chapter, Tono focuses on the importance of younger learners’ data and demonstrates how the ICCI data will help to improve the description of criterial features for CEFR levels; new verb co-occurrence frames are taken as an example. He also stresses the importance of setting up formal procedures for objectively evaluating criterial features, using learner corpora and statistical methods. Huaqing Hong works as a computational linguist for the ICCI project. In chapter 3, Hong describes the technical aspects of data collection and corpus compilation in the ICCI project. Here we can see all the details of the ICCI data in terms of learner profile information (country, grade, gender,

10

Yukio TONO, Yuji KAWAGUCHI and Makoto MINEGISHI

mother tongue, genre, topic, etc.). Hong goes on to describe the process of formatting texts and linguistic annotations. The ICCI data have been prepared in different formats to be used with different interfaces (WordSmith, AntConc, Xaira, etc.). He also built an original query package for the ICCI data, which has many user-friendly features. Hong elaborates on each of the features, demonstrating how the ICCI data can be exploited using his tool. Schiftner and Rankin, the ICCI coordinators in Austria, investigate the use of demonstrative reference in Austrian learner writing. Their research design is quite ambitious in that it covers both the dependent and independent use of demonstratives, and includes demonstratives as used for both exophoric as well as endophoric reference. They observe the overuse of the demonstrative that and show that it could be the result of a number of different possible underlying processes, including L1 influence, the effects of an L2-internal system, or the learner’s preference for simpler clause structures, among others. Their findings suggest that the identification of gaps between native speakers and learners will help to improve the quality of formal instruction. Agnieszka Leńko-Szymańska, the ICCI coordinator in Poland, aims in chapter 5 to establish the extent to which the use of articles by native speakers and Polish learners of English as a foreign language at different proficiency levels can be explained as being motivated by the use of conventionalized language, or “lexical bundles.” Using the ICCI and the ICLE, she compares the frequencies of the articles the and a/an between learner corpora at different proficiency levels and native corpora (FLOB and FROWN). A novel aspect of Leńko-Szymańska’s investigation is that she uncovers the learners’ awareness of the existence of recurrent word combinations containing articles, as well as the fact that this sensitivity is increasingly stronger as the proficiency levels go up. Pascual Pérez-Paredes and María Belén Díez-Bedmar, the ICCI coordinators in Spain, present the results of a cross-sectional study on the use of intensifying adverbs by Spanish primary and secondary school students. Their study documents the use of intensifying adverbs in very young EFL learners below the age of 16. They show that the use of intensifying adverbs can be a good predictor of the differences in performance between grades 7 and the other three groups (grades 8 through 10). They suggest that it takes several years for very young learners of English to start using adverbs other than very. Austina Shih and May Ma, the ICCI coordinators in Taiwan, investigate the syntactic complexity of Taiwanese learners of English. They use a tool to produce various metrics of syntactic complexity, and find the mean length of a T-unit to be the best indicator for nonadjacent levels of proficiency.

Introduction

11

Although they admit that too much reliance on an automatic tool can be dangerous, it is certainly revealing to have a bird’s eye view of syntactic complexity measures across different groups of learners and to speculate on the reasons for a difference. Shih and Ma also suggest an interesting direction for future research. Chapter 8, also by María Belén Díez-Bedmar and Pascual PérezParedes, presents the results of a six-year cross-sectional study on the use of the English article system by Spanish secondary school students. They find that there is a significant relationship between an increase in the correct use of articles and its contexts defined by Bickerton’s (1981) and Huebner’s (1983) taxonomies. They also find that the lack of articles constantly pose problems to the learners. Tammar Levitzky-Aviad, the ICCI coordinator in Israel, investigates the lexical richness and variation of EFL learners sampled from the ICCI. The findings suggest that it is difficult for EFL learners to attain a proficiency level high enough to produce sophisticated vocabulary beyond 3,000 words. These results may seem rather obvious, considering the nature of the tasks used in the ICCI, where learners are only required to use everyday-life vocabulary; however, Levitzky-Aviad claims that more focused teaching of vocabulary should be recommended in order to teach L2 vocabulary beyond the basic 2,000 words. Liu Yongbing, the ICCI coordinator in China, and his student Zhang Huiping compare three groups of Chinese learners of English (Hong Kong, Taiwan, and Mainland) in the use/misuse of conjunctive cohesive devices. Their findings reveal that these learners are able to use the three basic cohesive devices (additive, adversative, and causal) in their writing, and that there are four types of misuse identified as systematic negative transfers from Mandarin Chinese. They also show that there are some regional differences, which might be related to local contexts of teaching English. Part 1 is closed by a chapter written by Paula Buttery and Andrew Caines. Paula Buttery was another speaker invited to the March symposium, but who was unable to attend. Instead, she has contributed a chapter with her colleague about how to perform appropriate normalization for the data in order to prevent a misleading picture of learner development. Even though it is somewhat preliminary in nature, this is a thought-provoking study. They seek a way to eliminate the effects of text length, topic, and task upon opportunities to use a linguistic property in a corpus. They set out to investigate the relationship between text length and two language properties, and the mean length of utterance and the variety and quantitative usage of adverbs. Researchers involved in learner corpora all face a similar kind of dilemma, and Buttery and Caines try to give answers to this difficult question.

12

Yukio TONO, Yuji KAWAGUCHI and Makoto MINEGISHI

Part II contains seven chapters focusing on the corpora of spoken texts, as well as the corpora in languages other than English. Isabelle Racine presents a corpus-based perceptual study of the French /y/ and /u/ produced by Spanish learners of French. She performed an experiment in which French native listeners had to identify the vowel (/y/, /u/) of four monosyllables produced by two groups of Spanish learners (one from Geneva and one from Madrid) and a group of native French speakers in two tasks (repetition and reading). The results reveal a better identification rate for /u/ than for /y/, suggesting that a new vowel, such as a vowel that has no phonemic and phonetic equivalent in the L1 (/y/), is more difficult to learn than a vowel that is phonemically but not phonetically similar (/u/). Moreover, for /y/, orthography seems to interfere negatively with more errors for the words produced in the reading task. On the contrary, orthography acts positively for /u/, with better performance for the words produced in the repetition task. Lastly, although her results show surprisingly better performance for the Spanish learners in Madrid, a detailed analysis reveals that this difference can be attributed to individual differences among learners. Sylvain Detey aims to show that a coding system such as schwa coding in the Phonologie du Français Contemporain project can be a useful procedure to exploit an L2 phonological corpus for research-oriented purposes. Detey goes on to argue that such a coding procedure can turn an L2 database into a rated L2 database that can be used in the field of applied L2 speech sciences. Finally, she shows that such an L2 coding issue highlights the socio- and psycholinguistic links between L2 speech assessment and speech models. Mariko Kondo introduces the Asian English Speech cOrpus Project (AESOP), which aims to collect a large-scale L2 English speech corpus of Asian language speakers. The corpus is unique in that it is designed to illustrate common and language-specific phonological features among Asian language speakers, and uses common platforms and procedures. Kondo reports on some of the features of AESOP, such as annotations for segmental and suprasegmental characteristics of English based on read speech and semi-spontaneous speech data, and an automatic phoneme alignment system specifically customized for Japanese speakers. Kaori Sugiyama analyses the use of words by Japanese learners of French in terms of lexical diversity, sophistication, and density. Based on a small oral corpus, a vocabulary profile analysis was performed, using VocabProfil, a French version of the Lexical Frequency Profile (Laufer and Nation 1995). She finds significant differences in the use of high-frequency content words, middle-frequency words, and off-list words indicated by the Guiraud index, between three subgroups. She suggests that Japanese learners of French can be classified based upon these lexical measures.

Introduction

13

Hiroko Saito argues that there seems to be an imbalance between the number of written and spoken corpora being developed, as well as among the amount of research carried out using learner corpora. She finds that research concerning the prosody of English spoken by Japanese learners is scarce. In this chapter, Saito introduces her own experiment, which suggests the possible influence of the learner’s dialect on the pronunciation of the interlanguage, and discusses what kind of information might be incorporated in order to develop a decent spoken learner corpus. Asako Yoshitomi examines learner language narratives, focusing on the use of multi-word units, by Japanese learners of ESL and Japanese learners of EFL. The study assumes a post-cognitivist paradigm, and aims to investigate whether differences in the social and linguistic contexts in which learners acquire English have a significant effect on their performance in oral production. Findings indicate that there may be quantitative as well as qualitative differences between the two types of learners. However, differences between the learners and the native speakers of English were much more considerable, indicating the difficulty of thinking for speaking in an additional language, even for pre-pubescent advanced learners with ESL experience. The final chapter of the volume is written by Ayano Suzuki and Tae Umino. They investigate the use of verb-noun collocations in Japanese learner corpora, focusing on the verb suru [do]. This corpus of Japanese learners is also one of the G-COE projects at TUFS. The results reveal that when using the verb suru, learners tend to use the -o suru combination more fixedly than native speakers do. Additionally, when using the sahen verb, native speakers often used “o + noun of Japanese origin + suru” in addition to “Sino-Japanese noun + suru,” whereas learners tend to stick to “SinoJapanese noun + suru.” The results of the analysis of each composition task also reveal that learners tend to use the “noun + particle + verb” pattern, whereas native speakers possess a wider variety of expressions other than “noun + particle + verb.” On the basis of these results, Suzuki and Umino conclude that learners have a tendency to use collocations based on fixed connections. This series of snapshots of learner corpus research is addressed to researchers in (applied) corpus linguistics, foreign language educators with an interest in ongoing research related to language learning, and postgraduate teachers and students working towards the development of original studies in this area. We hope that they will find a solid background and fresh stimuli to further contribute to this fast-developing field. On a final note, we wish to thank the series editors at John Benjamins, the external as well as internal reviewers for their constructive comments on our draft chapters, and the contributors for their noteworthy submissions.

14

Yukio TONO, Yuji KAWAGUCHI and Makoto MINEGISHI

Thanks also go to the MEXT for generously funding our research, and TUFS for giving us this wonderful opportunity to share our research with the world. References Bikerton, D. 1981. Roots of Language. Ann Arbor, MI: Karoma Press. Corder, S.P. 1967. “The significance of learner’s errors”. International Review of Applied Linguistics 5. 161-170. Dulay, H., M. Burt and S. Krashen. 1982. Language Two. New York: Oxford University Press. Gilquin, G., S. De Cock and S. Granger. 2010. Louvain International Database of Spoken English Interlanguage. Louvain-la-Neuve: Presses universitaires de Louvain. Granger, S. 1994. “The Learner Corpus: a revolution in applied linguistics”. English Today 10. 25-33. Granger, S. 2002. “A Bird’s eye view of learner corpus research”. Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching, Granger, S., J. Hung and S. Petch-Tyson (eds). Amsterdam: John Benjamins. 3-33. Granger, S., E. Dagneaux, F. Meunier and M. Paquot. 2009. International Corpus of Learner English. Version 2. Louvain-la-Neuve: Presses universitaires de Louvain. Gregg, K. 1984. “Krashen’s monitor and Occam’s razor”. Applied Linguistics 5:2. 79-100. Huebner, T. 1983. A Longitudinal Analysis of the Acquisition of English. Ann Arbor, MI: Karoma. Izumi, E., K. Uchimoto and H. Isahara (eds). 2004. Nihon-jin 1200-nin no Eigo Speaking Corpus (A spoken corpus of 1200 Japanese-speaking learners of English). Tokyo: ALC Press. Krashen, S. 1985. The Input Hypothesis: Issues and Implications. London and New York: Longman. Laufer, B. and P. Nation. 1995. “Vocabulary Size and Use: Lexical richness in L2 written production”. Applied Linguistics 16:3. 307-322. Long, M. 1983. “Does second language instruction make a difference: A review of research”. TESOL Quarterly 17:3. 359-382. MacWhinney, B. 1991. The CHILDES Project: Tools for Analyzing Talk. New York: Lawrence Erlbaum. Meisel, J.M., H. Clahsen and M. Pienemann. 1981. “On determining developmental stages in natural second language acquisition”. Studies in Second Language Acquisition 3. 109-135. Pienemann, M. 1984. “Psychological constraints on the teachability of languages”. Studies in Second Language Acquisition 6: 186-214.

Introduction

15

Selinker, L. 1972. “Interlanguage”. International Journal of Applied Linguistics 10. 209-232. Tarone, E. 1983. “On the variability of interlanguage systems”. Applied Linguistics 4:2. 142-164. Tono, Y. (ed). 2007 NihonjinChukousei 10000-nin no Eigo Corpus (JEFLL Corpus: A Corpus of 10,000 Japanese EFL Learners). Tokyo: Shogakukan.

The English Profile: Using Learner Data to Develop the CEFR for English Nick SAVILLE

1. Introduction The Common European Framework of Reference (CEFR) appeared in its published form in 2001, ten years after the Rüschlikon Conference of 1991 which concluded that a “common framework of reference” of this kind would be useful as a planning tool to promote “transparency and coherence” in language education. Since its publication the document itself has been translated into 37 languages and has been disseminated widely in Europe, Asia and Latin America (see Little, 2006 for an overview). However, in its published format the CEFR was intended to be “a work in progress”, to be adapted and modified as necessary to meet the needs of users (see, for example, the CEFR Japan). The six reference levels (A1 to C2) have been very influential in the fields of curriculum development, language teaching, and above all in assessment (see Coste, 2007). The levels are described through the Global Scale and Illustrative Descriptors which can be applied to the learning and teaching of any language. To ensure that the Framework can be adapted to local contexts and purposes, the Council of Europe has encouraged the production of instruments and support materials to complement the CEFR. These instruments (sometimes known as the CEFR toolkit) include Reference Level Descriptions (RLDs) for national and regional languages. The RLDs are a new generation of language-specific descriptions which identify the forms of a given language (words, grammar, etc.) at each of the six reference levels which can be set as objectives for learning or used to in the assessment of each level of proficiency. The Language Policy Division of the Council of Europe produced a general Guide for the production of RLDs, which was discussed at a seminar held in Strasbourg in December 2005. The participants discussed how to set up projects and reported on progress already being made for various languages (see the Council of Europe website for details). The RLDs for German, Profile deutsch (covering A1, A2, B1, B2) had already been published as a result of an international collaborative project and was presented in detail at this seminar. Projects representing seventeen languages were presented, including a proposal for English put forward by the University of

18

Nick SAVILLE

Cambridge. This proposal subsequently became known as the English Profile Programme (EPP). The founder members of the EPP first met in Cambridge in 2005 to discuss the possibility of setting up an RLD project for English; these included representatives from several departments of the University of Cambridge led by Cambridge ESOL (Cambridge University Press, the Research Centre for English and Applied Linguistics and the Computer Laboratory), together with representatives from the British Council, English UK, and the University of Bedfordshire (Centre for Research in English Language Learning and Assessment—CRELLA). As a result, an initial threeyear project was set up with project coordination based in Cambridge and since then regular progress reports have been submitted to the Language Policy Division through Dr. John Trim, who has acted as an observer on the Council of Europe’s behalf. After the first three years, the programme was extended with a growing network of collaborators around the world (including Japan) and the long-term EPP was established. From an early stage the EPP placed particular emphasis on empirical research rooted in data such as learner corpora of English, and the collection of representative samples of learner language which could be used to explore language development across the reference levels. In order to achieve this it was necessary to set up a network of collaborators in different parts of the world who could supply samples of speaking and writing produced by learners. This aspect of the project received external funding by the European Commission and has been underway since 2008/9. It has also required technical resources in developing electronic corpora so that the samples of learner language can be stored and accessed effectively. 2. RLD and the Cambridge Learner Corpus At the outset in the first phase of the programme, an extensive resource of learner data derived from the Cambridge ESOL examinations has proved invaluable in gaining insights into the language of learners at each level. Samples of writing from the Cambridge English examinations have been systematically collected from cohorts of learners across the proficiency continuum since the early 1990s. These samples have been stored in a large database in order to create a well-defined corpus of learner writing, both within and across the proficiency levels. This data base is known as the Cambridge Learner Corpus or CLC. The CLC consists of learners’ written English from the Cambridge ESOL examinations covering the ability range from A2 to C2, together with meta-data (gender, age, first language) and evidence of overall proficiency based on marks in the other components (typically reading, listening and

The English Profile

19

speaking). While lexical analysis had been carried out for many years by researchers in Cambridge ESOL, error coding and parsing of the corpus extended the kinds of analyses which can be carried out and have allowed the research teams to investigate a wider range of English language features (see Nicholls, 2003 on the error-coding system which has informed EP research). A computational strand of research was introduced into the EPP at the outset and the CLC has been tagged and parsed using the Robust Accurate Statistical Parser (RASP) by researchers in the Computer Laboratory under the supervision of Professor Ted Briscoe (Briscoe, Carrol & Watson, 2006). Although this data comes from an examination context, they have provided a unique opportunity to investigate learner writing linked to the CEFR level system. This is because the proficiency standards represented by the CEFR global scale can be found in the linguistic behaviour of learners who have been the candidates for the Cambridge examinations dating back many decades. 3. The origins of the CEFR levels The CEFR levels were not originally based on a theory of second language acquisition (SLA) or on a psycholinguistic analysis of how people learn languages, rather the origins were in the nature of educational systems and to do with the organisation of learning across the educational cycles—e.g. in school-based cohorts and in classrooms. In fact, the origins can be found in the language learners—the people (both young people and adults) across Europe in the many diverse educational contexts where “instructed learning” was taking place in the 1970s and 1980s (e.g. state schools, adult education, university language centres, private language schools, cultural institutes and so on). The authors of CEFR provided a comprehensive way of capturing what language learners may need to learn in order to make progress and to meet their learning objectives; in so doing they also set out realistic targets for judging success in terms of a conventional level system which broadly fits many educational contexts. The primary concern of those involved was to help practitioners with the task of organising their language teaching/learning within formal educational systems for many different types of learner, from kindergarten to high school graduation (K12), and throughout life (in higher education and work-related contexts, for travel and migration, and for a range of social purposes such as holidays, heritage links, cultural interests, etc.). The six reference levels emerged from this way of thinking. Starting as far back as the 1960s, those who have shaped the evolution of a common framework over a period of 40 years or so have sought to characterise these progressive stages in language learning/teaching (from

20

Nick SAVILLE

beginner to advanced level), and have tried to capture the conventional understandings of language proficiency through a descriptive approach (i.e. what constitutes for them a practical representation of the developing interlanguage of learners). In other words, in developing a comprehensive frame of reference, the authoring group and advisors set out to formalise existing understandings of levels and the ways in which learning content was being graded for learners at different stages in various educational contexts across the continent (see, for example, Taylor and Jones, 2006). Take, for example, the top level- C2. This is not described with reference to the competence of a well-educated native speaker (however that may be defined), but is conceived of as the highest level that learners might aspire to reach within the normal educational processes available to them in learning another language. Language specialists such as interpreters, professional writers and so on may develop skills which exceed the C2 level thus allowing for a possible D level. By systematically formalising what already existed, and by attempting to provide a useful framework of levels and other categories, the aim was to introduce greater transparency and coherence to the very complex picture which had begun to emerge. The widespread use of the CEFR since its publication in the past ten years is a testimony to its usefulness and demonstrates its success in meetings its main objective. Continuity with earlier Council of Europe projects was particularly relevant to the CEFR authors and was deliberately built into their approach. The CEFR was based on levels which were already in use in Europe from the 1970s onwards, including the Waystage and Threshold specifications. The authors also drew heavily on examination systems which were already widely used at that time, including the Cambridge English: First (FCE) and Cambridge English: Proficiency (CPE) examinations. This point is endorsed by the authors of the CEFR. See the introductory chapter of SILT 33 by North et al. (2010) and other references to origins and appropriate uses of CEFR. See also Trim in Green (2011 forthcoming). 4. The Cambridge exams and the CEFR The Cambridge exams, therefore, share the same origins as the CEFR levels; they informed the CEFR and were informed by it in a process of convergence over many years. For example, the original test specifications for the Cambridge English Preliminary (PET) and Cambridge English: Key (KET) were directly based on Waystage and Threshold 1990; these specifications have been reviewed and revised in light of the CEFR (2001) but have remained essentially the same since the early 1990s when both tests were introduced into the Cambridge English suite of general English

The English Profile

21

examinations alongside FCE, Cambridge English Advanced (CAE) and CPE. In addition to their common origins in the cohorts of European learners and test takers (as noted above), the subsequent reviews and revisions of the Cambridge English suite (FCE, CAE and CPE) since the late 1990s have sought to maintain alignment with the CEFR’s principles—its functions, domains of use and content—as well as to enhance the precision of alignment to the levels in light of the evolving understandings of the reference levels themselves. For example, cross-comparisons with the standards set for other language through participation in international benchmarking exercises have helped to ensure alignment of standards in speaking and writing skills (see the work of the ALTE Manual Special Interest Group since 2004—www.alte.org). Nowadays the results of many Cambridge examinations are reported with reference to three CEFR levels: for example, test takers for Cambridge First (FCE/B2) take it because a B2 level examination best fits their current stage of learning. If they do well and get the top grade (A), their result is reported with reference to the C1 level. On the other hand, if they do relatively poorly and narrowly fail to reach the B2 standard, their result is reported with reference to the B1 level. This is made possible due to the way the examinations are constructed and statistically calibrated. All the examinations within the Cambridge English system are calibrated using itembanking methods and anchoring procedures based on Rasch analysis (a form of item response theory). It can be claimed therefore that the Cambridge examinations are strongly aligned with the CEFR and so an analysis of learner writing elicited during the examination process provides useful information about the linguistic features of learners’ English corresponding to the functional descriptions which underpin the CEFR itself. 5. The CLC and criterial features The CLC now contains approximately 45 million words taken from over 500,000 samples of writing produced by about 200,000 different learners from many different first language backgrounds. It is now the largest database of learner writing of its kind, with large samples for each level across the range—A2 to C2. Because the writing samples were collected under examination conditions, they are described and categorised in a highly consistent way. They are also systematically accompanied by meta data about the learners (L1, age, gender, etc.) together with other measures of proficiency (e.g. the writers’ scores in reading, listening and speaking). The fact that the writing was produced in an examination context is a major strength but there are also a number of limitations with regards the use of this data. On the positive side, we can be sure that the writing was actually produced by the learners and that assistance was not provided by a more

22

Nick SAVILLE

proficient writer, such as a teacher or parent. The scripts can therefore be considered “external representations” of the learners’ internalised knowledge of English which they have deployed in doing the writing unassisted. In addition, because of the sampling frame used and the number of scripts in the corpus, the interlanguage of these learners at each level is not the direct result of any single course or method of teaching—although to some extent their knowledge will (we hope) have been influenced by the syllabus and method used to teach them. Students following courses in preparation for Cambridge First (FCE/B2) in different parts of the world experience many different learning environments, course materials and are exposed to many diverse opportunities for learning outside of the classroom. The internalised knowledge of the syntactic features and lexico-grammatical system of English is therefore not a direct result of a single syllabus or method which is shared by all learners in a category (e.g. FCE/B2). Inevitably, the samples of writing in the CLC are numerous but represent a limited range of possible discourse types. Some aspects of the written performance may also be determined by the examination context and influenced by task effects such as the topic. However, this does not detract from the usefulness of this corpus data for investigating features of learner language at different proficiency levels. While there is “noise” in this kind of data (e.g. boundary effects) where samples from different examinations at contiguous levels overlap, sub-corpora can be carefully selected so that the written samples in each category represent distinctively different levels of performance. This is achieved by selecting samples written by learners who are judged to be in the middle of the proficiency band (e.g. as assessed by indicators of their overall proficiency). The written texts in these sub-corpora can then be used to exemplify the language of the learners at the increasing levels of proficiency. The sub-corpora can also be analysed to determine which linguistic features of English occur at each level. If certain features apparently correlate with one level more than another, this information can be subjected to further study and confirmed in future analyses. This kind of analysis provides a data-driven explanation of what it means to have acquired English at each of the CEFR levels. Another innovative feature of the EPP is the concept of “criterial features”, as described in a number of publications available on the English Profile (EP) website. This notion is central to the approach taken within the EP to specify the reference levels for English and has been set out in detail by Hawkins and Filipović (forthcoming). They explain what criterial features are and elaborate on the practical and theoretical relevance. They also provide a taxonomy of the features which have emerged during the research carried out by them and the other members of the EP team in Cambridge (Hawkins and Buttery, 2009, 2010).

The English Profile

23

The concept of criteriality is not restricted to the lexico-grammatical domain which Hawkins and his colleagues have concentrated on, but is applicable to other aspects of language and language learning (phonetics, form-function relations, semantics and pragmatics, etc.). In addition to describing the “real language” used by learners, the English Profile has sought to investigate the learning dimension and to connect the empirical work with relevant SLA and linguistic research. The EP researchers in Cambridge are interested in “how learners learn English” and how different factors interact under various contextual conditions. They are addressing questions of the following kind: • how do the different kinds of criterial features (lexical semantic, morpho-syntactic, syntactic, discourse, notional, functional, etc.) interrelate? • to what extent does the criteriality of features vary depending on the L1 of the learner? • which criterial features can be used as diagnostics at the individual learner level? • what is the effect of task type on learner production and criterial features? • how does the type of context in which the English is produced help explain the findings? • which linguistic features realise which language functions across the CEFR levels? The emerging performance patterns are informative for our understanding of SLA, for example the order of acquisition of linguistic features and the interaction of factors such as frequency and transfer from the first language. It is hoped that findings from the EP may contribute to new aspects of theory and provide useful insights for developing a model of L2 acquisition. 6. Conclusion Possible criterial features which have emerged have now been published in an English Profile Handbook, available to download from the EP website. These cover both the vertical dimension showing how a single feature develops across the proficiency range and the horizontal dimension showing how the features cluster to characterise a given level (e.g. B1). The concept of “criteriality” is now being extended to the analysis to other samples of learner language and in particular to speech. Spoken language data is being collected and corpora are being built with the necessary computational tools to enable the research to be extended in that direction. In addition, a particular focus on the C-levels continues with the collection of academic writing samples. It is hoped that these strands of

24

Nick SAVILLE

research will be published in the new EP studies series and will make a useful contribution to our field by providing insights into using the CEFR for the learning, teaching and assessment of English. References Alexopoulou, T. 2008. “Building new corpora for English Profile”. Research Notes 33. 15-19. Briscoe, E., R. Carroll and R. Watson. 2006. “The second release of the RASP system”. Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions (Sydney, Australia). Available online; http://acl.ldc.upenn.edu/P/P06/P06-4020.pdf Capel, A. 2009. “A1-B2 vocabulary: Insights and issues arising from the English Profile Wordlists project”. Paper presented at the English Profile Seminar (Cambridge, 5-6 February 2009). Coste, D. 2007. “Contextualising the use of the CEFR for Languages”. Paper presented at the Council of Europe Policy Forum (Strasbourg 2007). Available online; www.coe..int/T/DG4/Linguistic/Source /Source Forum07/ D-Coste_contextualise_EN.doc Council of Europe. 2001. Common European Framework of Reference for Languages: Learning, Teaching, assessment. Cambridge: Cambridge University Press. Green, T. 2008. “English Profile: Functional progression in materials for ELT”. Research Notes 33. 19-25. Green, T. 2011. Functional Progression Revisited, English Profile 2. Cambridge: Cambridge ESOL/CUP. Hawkins, J.A. 1994. A performance theory of order and constituency. Cambridge: Cambridge University Press. Hawkins, J.A. 2004. Efficiency and complexity in grammars. Oxford: Oxford University Press. Hawkins, J.A. 2009. “An efficiency theory of complexity and related phenomena”. Complexity as an evolving variable, Gill, D., G. Sampson and P. Trudgill (eds). Oxford: Oxford University Press. Hawkins, J.A. and L. Filipović. 2011. Criterial features of English across the reference levels of the CEFR, English Profile 1. Cambridge: Cambridge ESOL/CUP. Hawkins, J.A. and P. Buttery. 2009. “Criterial features in learner corpora: Theory and illustrations”. Paper presented at the English Profile Seminar (Cambridge, 5-6 February 2009).

The English Profile

25

Hawkins, J.A. and P. Buttery. 2010. “Using learner language from corpora to profile levels of proficiency: Insights from the English Profile Programme”. English Profile Journal 1:1. 1-23. To appear online; www.englishprofile.org Hendriks, H. 2008. “Presenting the English Profile Programme: In search of criterial features”. Research Notes 33. 7-10. Jones, N. 2000. “Background to the validation of the ALTE Can Do Project and the revised Common European Framework”. Research Notes 2. 11-13. Jones, N. 2001. “The ALTE Can Do Project and the role of measurement in constructing a proficiency framework”. Research Notes 5. 5-8. Jones, N. 2002. “Relating the ALTE framework to the Common European Framework of Reference”. Common European Framework of Reference for Languages: learning, teaching, assessment—Case studies, Alderson, J.C. (ed). Strasbourg: Council of Europe. 167-183. Available online; www.coe.int/T/DG4/Portfolio/documents/case_studies_ CEF.doc Jones, N. 2009. “The classroom and the Common European Framework: towards a model for formative assessment”. Research Notes 36. 2-8. Khalifa, H. and C. Weir. 2009. Examining Reading: research and practice in assessing second language reading, Studies in Language Testing 29. Cambridge: University of Cambridge ESOL Examinations and Cambridge University Press. Kurteš, S., and N. Saville. 2008. “The English Profile Programme: An overview”. Research Notes 33. 2-4. Little, D. 2006. “The Common European Framework of Reference for Languages: Contents, purpose, origin, reception and impact”. Language Teaching 39:3. 167-190. Little, D. 2007. “The Common European Framework of Reference for Languages: Perspectives on the making of supranational language education policy”. Modern Language Journal 91. 645-655. McCarthy, M. and N. Saville. 2009. “Profiling English in the real world: what learners and teachers can tell us about what they know”. Paper presented at American Association for Applied Linguistics Conference (Denver, Colorado, 21-24 March 2009). Milanovic, M. 2009. “Cambridge ESOL and the CEFR”. Research Notes 37. 2-5. Nicholls, D. 2003. “The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT”. Available online; http://ucrel.lancs.ac.uk/publications/CL2003/papers/ nicholls.pdf

26

Nick SAVILLE

North, B., W. Martyniuk and J. Panthier. 2010. Introduction to Studies in Language testing 33. Cambridge: University of Cambridge ESOL Examinations and Cambridge University Press. Parodi, T. 2008. “L2 morpho-syntax and learner strategies”. Paper presented at the Cambridge Institute for Language Research Seminar (Cambridge, UK, 8 December 2008). Salamoura, A. 2008. “Aligning English Profile research data to the CEFR”. Research Notes 33. 5-7. Salamoura, A. and N. Saville. 2009. “Criterial features across the CEFR levels: Evidence from the English Profile Programme”. Research Notes 37. 34-40. Shaw, S. and C. Weir. 2007. Examining Writing: research and practice in assessing second language writing, Studies in Language Testing 26. Cambridge: University of Cambridge ESOL Examinations and Cambridge University Press. Taylor, L. 2004. “Issues of test comparability”. Research Notes 15. 2-5. Taylor, L. and N. Jones. 2006. “Cambridge ESOL exams and the Common European Framework of Reference (CEFR)”. Research Notes 24. 2-5. Trim, J.L.M. 2009/2001. Breakthrough. Available online; www.englishprofile.org Van Ek, J. and J.L.M. Trim. 1990a/1998a. Threshold 1990. Cambridge: Cambridge University Press. Van Ek, J. and J.L.M. Trim. 1990b/1998b. Waystage 1990. Cambridge: Cambridge University Press. Van Ek, J. and J.L.M. Trim. 2001. Vantage. Cambridge: Cambridge University Press.

International Corpus of Crosslinguistic Interlanguage: Project Overview and a Case Study on the Acquisition of New Verb Co-occurrence Patterns Yukio TONO 1. Introduction The goal of this chapter is twofold. First, it gives a brief overview of the International Corpus of Crosslinguistic Interlangauge (ICCI) project and the design criteria used to build the corpus. Second, it presents results from a case study that uses the ICCI to investigate the acquisition of new verb co-occurrence patterns at different stages of interlanguage (IL) development. The ICCI project was launched in 2007 at the start of the five-year government-funded Global COE program at Tokyo University of Foreign Studies. The primary purpose of the ICCI project was to build a corpus of younger learners of English with beginner and lower-intermediate proficiency levels. The focus on younger learners stems from the lack of beginner learners’ data in learner corpus research. Most studies using ICLE (Granger et al. 2009) and other major learner corpora only examine upperintermediate or advanced learners, primarily because data collection is much easier for adults. The research using adult learners’ data has thus provided an incomplete picture of IL systems, focusing mainly on stylistic or pragmatic aspects of IL development. As a result, there is a lack of research on more basic lexical and syntactic development in early stages. This gap in the research is especially prominent since learner corpus research is now integrated into the larger project of identifying language features that characterize the development of foreign languages, according to the Common European Framework of Reference (CEFR). It is difficult to describe the early stages of language use given the lack of data on beginner (A1 & A2 level) learners. For these reasons, the ICCI project can potentially make a significant contribution in the field. 2. The corpus design criteria of ICCI Since the breakdown of ICCI is detailed elsewhere (see Hong in this volume), the basic principles of the corpus design criteria for our project will be described here.

28

Yukio TONO

2.1. Target learner groups The ICCI project is an international collaborative research project in which seven different countries/regions have participated (Austria, China, Hong Kong, Israel, Poland, Spain, and Taiwan). In each country, written essay data by younger learners were collected. It was important for us to make the ICCI data comparable to the JEFLL Corpus (Tono 2007). The JEFLL Corpus is comprised of writings ranging from the first year of junior high school (year seven in the US education system) to the third year of senior high school (year twelve). Therefore, we aimed to collect data produced by novice learners in each country/region. One of the difficulties we faced is that in most of the participating countries, English is taught in primary school with a focus on oral communication skills rather than writing. It was difficult to obtain written data from these younger learners, forcing us to collect data opportunistically at the earliest stage where written output was feasible and made available. Part of the data set was collected from as early as the third grade in primary schools, but limited to pupils in Hong Kong. Primarily, however, data were collected from the sixth year onwards. Regarding the breakdown of the regions/countries, we planned for a minimum number of participants, i.e., approximately 100 from each school year (years six to twelve), totaling 700 files. We expected the corpus size would average around 70,000 to 100,000 files for each region/country. However, the corpus size turned out to be larger in the case of Austria, and smaller in the cases of Poland and Spain, because of slightly skewed sampling from lower or upper levels. The Israel team provided more than 2,000 electronic files and they were included in the corpus. The Polish team also contributed more than 2,000 files, but since they required manual transcription, only 751 files were processed for inclusion. In total, we have 6,700 files (533,924 running words), which is a reasonable size to compare against the JEFLL Corpus (10,038 students; 669,281 running words). 2.2. Writing tasks Writing tasks were also designed to mirror the tasks used in the JEFLL Corpus. The JEFLL Corpus used six topics (three argumentative and three narrative/descriptive). Only three of these topics could be adopted for the ICCI project because the other topics were culture-specific and difficult to use in other countries. The four topics (food & money for argumentative tasks, and school events & funny stories for narrative tasks) were comparable to the JEFLL data, even though some customization was made to meet the local needs of each school. Overall, more than 2,000 essays were collected for food and money topics, and around 1,000 for the topics of school events and funny stories.

International Corpus of Crosslinguistic Interlanguage

29

Following the JEFLL Corpus, the writing tasks were assigned as in-class timed essays. Twenty minutes were allotted for these assignments and students were not allowed to prepare or use dictionaries. There was time allowed for revisions and all the compositions were collected at the completion of the task. For primary school pupils, twenty-minute writing tasks could be too long; however, for comparison purposes, the same tasks were given to those younger learners as well. 2.3. Learner profile information The descriptions of learner profiles are crucial in order to sub-classify the corpus based on learner variables. Each learner was asked to provide at least the following information: (a) (b) (c) (d) (e) (f) (g) (h)

Country/region Nationality Mother tongue Gender Grade (age) Year of English instructions Overseas experience (if any) External exam scores (if any)

The web query interface developed by Huaqing Hong (see Hong in this volume) makes it possible to create and compare subcorpora according to these learner variables. 2.4. Mark-up and annotations The ICCI data is available for project members for the moment in various formats suitable for individual needs. The simplest version appears in plain ASCII format, classified by directories according to countries and grades. The meta-data is stored in the Excel file and kept separate from the main body. There is also a version with the header information at the top of each text, making it possible to perform subcorpus queries with this data using commercial software such as WordSmith (Scott 2008) or MonoConc Pro (Barlow 2000). An XML version conforms to the XCES guidelines, and it is fully annotated for POS and lemma information. This version is especially useful when indexed and queried by Xaira (Burnard 2004). Finally, the vertical format of one word per line with POS, lemma, and USAS tags, processed by wmatrix (Rayson 2009) is also available for further implementation into a database system.

30

Yukio TONO

3. CEFR and learner corpora: a new research agenda As we built our corpus, an interesting new research paradigm has emerged in relation to the Common European Framework of Reference (CEFR). The CEFR is increasingly popular not only in Europe but also in the rest of the world as a generic framework for foreign language proficiency guidelines. By linking findings from learner corpora to the CEFR, the results are more readily accessible for language policy makers and practitioners. Therefore, it is important to investigate how learner corpus research should be linked to CEFR. This chapter presents the rationale of such an approach as one of the ways the ICCI data are exploited. 3.1. What is the CEFR? The Common European Framework of Reference (CEFR) is a descriptive scheme that can be used to analyze L2 learners’ needs, specify L2 learning goals, guide the development of L2 learning materials and activities, and provide orientation for the assessment of L2 learning outcomes (Little 2006: 167). The descriptive scheme has a vertical and a horizontal dimension. The vertical dimension uses ‘can do’ descriptors to define six levels of communicative proficiency in three bands (A1, A2–basic user; B1, B2–independent user; C1, C2–proficient user). There are also scales for listening and reading (reception), spoken production, written production, spoken interaction, and written interaction. The horizontal dimension is concerned with the learners’ communicative language competences and strategies. Since the CEFR is language-independent, each country in Europe is now developing a set of Reference Level Descriptions (RLDs), i.e., a set of linguistic features that form the criteria for each CEFR level. The French publisher of the CEFR has begun producing a series of reference books each of which is devoted to a single proficiency level in French (Beacco et al. 2004, 2006). A project with German, Swiss, and Austrian funding has developed Profile deutsch, an interactive CD-ROM that presents the CEFR in German together with a functional-notional resource and functional and systematic treatments of German grammar, among others (Glaboniat et al. 2005). In the case of English, the English Profile (EP) is responsible for RLD development (see Saville in this volume). One unique feature of the EP program is the use of a corpus-driven approach to identify “criterial features” of English for each CEFR level. 3.2. Criterial features and LCR Salamoura and Saville (2009) defined a “criterial feature” as follows:

International Corpus of Crosslinguistic Interlanguage

31

A ‘criterial feature’ is one whose use varies according to the level achieved and thus can serve as a basis for the estimation of a language learner’s proficiency level. So far the various EP research strands have identified the following kinds of linguistic feature whose use or non-use, accuracy of use or frequency of use may be criterial: lexical/ semantic, morpho-syntactic/syntactic, functional, notional, discourse, and pragmatic. (Salamoura and Saville 2009: 34)

What is unique about their project is that they look for criterial features by analyzing learner corpora with the CEFR level classifications. Hawkins and Buttery (2010), for example, have identified four types of features that may be criterial for distinguishing between CEFR levels. Table 1 shows the classifications. The EP researchers have done preliminary studies with regard to the criterial features, using the Cambridge Learner Corpus (CLC) (Williams

Table 1. Possible criterial feature types Type of feature Descriptions Acquired/Learnt language features Correct properties of English that are required at a certain L2 level and that generally persist at all higher levels, e.g., property P acquired at B2 may differentiate [B2, C1 and C2] from [A1, A2 and B1] and will be criterial for the former. Developing language features Incorrect properties or errors that occur at a certain level or levels with characteristic frequencies. Both the presence versus absence of the errors, and the characteristic frequency of error can be criterial for the given level or levels, e.g., error property P with a characteristic frequency F may be criterial for [B1 and B2]. Acquired/Native-like usage Positive usage distributions for a correct property of L2 distributions of a correct feature that match the distribution of native speakers (i.e., L1 users). The positive usage distribution may be acquired at a certain level and will generally persist at all higher levels and remain criterial for the relevant levels. Developing/Non-native-like usage Negative usage distributions for a correct property of distributions of a correct feature L2 that do not match the distribution of native speakers (i.e., L1 users). The negative usage distribution may occur at a certain level or levels with a characteristic frequency F and remain criterial for the relevant level(s).

32

Yukio TONO

2007; Parodi 2008; Hendriks 2008; Hawkins and Buttery 2010; Filipovic 2009). The CLC currently comprises approximately forty-five million words of written learner data, roughly half of which is coded for errors. It has been also parsed using the Robust Accurate Statistical Parser (RASP) (Briscoe, Carroll, and Watson 2006). As the reports showed, the CLC mainly covers the A2 level and above, which is the reason they started to build a new corpus called the Cambridge English Profile Corpus (CEPC) that focuses on lower-proficiency level students’ writing and speech. CEPC has not been available yet, and despite the researchers’ initial goals, it consists largely of university students’ data. Therefore, the fact remains that they still lack data from younger learners at the novice and beginning stage of learning. Considering the sheer size of the CLC with error annotations and the framework of the CEFR, however, this EP program will surely create a new field in learner corpus research. Those who are interested in using learner corpora in SLA research can compare their findings with the EP research on criterial features. Those who are involved in syllabus and materials design will find the RLDs for English very informative once those items are actually identified. Test developers will make full use of the EP research results to improve test design and contents. Since the goal of ICCI is quite similar to the EP researchers and we have the advantage of having more data for younger learners, it would be extremely valuable to pursue the same research questions and validate findings against each other. Research using the ICCI can also contribute to the identification of lexis and grammar criterial features, thus improving reference level descriptions. In the next section, I present a report on a case study that uses ICCI to identify new verb co-occurrence patterns as criterial features. 3.3. New verb co-occurrence patterns as criterial features While the influence of the EP program will definitely increase in learner corpus research, it is also necessary to have an objective method of verifying their claims and findings. One challenging aspect of the EP program is that the CLC is not publicly available; rather, it is for in-house use only for people at the Cambridge ESOL and the Cambridge University Press for test and materials development. Yet it is possible to verify their findings against comparable learner corpora. In this paper, I will report the findings of a validation study that uses the ICCI data to examine one of the proposed criterial features of CEFR levels in English, namely new verb co-occurrence patterns by Williams (2007).

International Corpus of Crosslinguistic Interlanguage

33

3.3.1. Williams’s (2007) study Williams (2007) extracted verb co-occurrence patterns from the syntactically annotated version of the CLC. She found that there is a clear progression in the data from A2 to B2 in the appearance of new verb co-occurrence frames. Table 2 shows a portion of her findings. Since the primary focus in this study is the beginning level, all the verb co-occurrence patterns assigned to A2 are listed in Table 2. For the other patterns for B1 and above, only major patterns are listed. For the full list, see Hawkins and Buttery (2010:12-13). Hawkins’ original hypothesis was that there is a progression from A2 to C2, but Williams found no evidence for new verb co-occurrence frames at the C levels. It appears that these basic constructions of English have been acquired by B2. Hawkins suggested that they require a more subtle kind of analysis in order to capture progress at the C levels (ibid: 12). Another notable finding by Williams is that the progression from A2 to B2 correlates with the frequencies of these co-occurrence frames in the British National Corpus (BNC). In other words, learners first acquire Table 2. New verb co-occurrence frames in different CEFR levels (based on Williams 2007) CEFR level NP-V He went A2 NP-V (reciprocal Subj) They met A2 NP-V-PP They apologized [to him] A2 NP-V-NP He loved her A2 NP-V-Part-NP She looked up [the number] A2 NP-V-NP-Part She looked [the number] up A2 NP-V-NP-PP She added [the flowers] [to the bouquet] A2 NP-V-NP-PP (P = for) She bought [a book] [for him] A2 NP-V-V(+ing) His hair needs combing A2 NP-V-VPinfinitival (Subj Control) I wanted to play A2 NP-V-S They thought [that he was always late] A2 NP-V-NP-NP She asked him [his name] B1 NP-V-VPinfin (Wh-move) He explained [how to do it] B1 NP-V-S (Wh-move) He asked [how she did it] B1 NP-V-P-S (whether = Wh-move) He thought about [whether he wanted to go] B1 NP-V-NP-AdjP (Obj Control) He painted [the car] red B2 NP-V-NP-as-NP (Obj Control) I sent him as [a messenger] B2 NP-V-NP-S He told [the audience] [that he was leaving] B2 NP-V-P-VPinfin (Wh-move) (Subj Control) He thought about [what to do] B2 Verb co-occurrence frames

Examples

34

Yukio TONO

Table 3. Frequencies for verb co-occurrence frames in English corpora (including BNC) Average token frequencies in the BNC etc. for the verb co-occurrence frames appearing at each learner level A2 B1 B2/C1/C2 1,041,634 38,174 27,615 Average frequency ranking in the BNC etc for the verb co-occurrence frames appearing at each learner level A2 B1 B2/C1/C2 8.2 38.6 55.6

the more frequent frames used by English native speakers and later they progressively acquire less frequent frames. Table 3 shows the average token frequencies for the verb co-occurrences found by Williams, and their average frequency ranking, in a number of corpora including the BNC. William’s findings suggest that the verb complementation patterns that emerge in the interlanguage serve as good candidates for criterial features of CEFR levels. As shown in Table 2, however, she could not identify A1-level criterial features since the CLC did not contain data from A1-level learners. Therefore, it is necessary to confirm whether some of the basic constructions such as NP-V or NP-V-NP should appear earlier than the A2 level. In the next section, I discuss how the ICCI data set was exploited to validate William’s findings. 4. Method 4.1. Research questions The following research questions were formulated for this study. (1) Are there any A2 verb co-occurrence patterns in Williams’ list that should be better defined as being criterial for the A1 level? (2) Are there any A2 verb co-occurrence patterns in Williams’ list that can better serve as criterial features for the B1 level? These two questions were raised because the CLC only covers learners at the A2 level and beyond, which suggests that some constructions appear much earlier. In addition, Williams’ list contains relatively complex structures (e.g., V-NP-Part) for A2 criterial features that usually appear much later in the school curriculum. Thus, there is a possibility that some of the constructions classified as A2 criterial features should be moved to much later stages.

International Corpus of Crosslinguistic Interlanguage

35

4.2. Procedure 4.2.1. Subcorpora based on CEFR In order to extract verb co-occurrence patterns from ICCI and JEFLL, it was necessary to classify texts into subcorpora according to CEFR levels. To this end, all the files were processed and grouped together according to the average text length. Since no CEFR level was identified for each essay, a measure of total text length was used to re-classify the essays. This was done because the essay tasks in ICCI and JEFLL were controlled in terms of time allotment (twenty minutes), and text length is one of the strong predictors of proficiency levels as well as readability (cf. Pitler & Nenkova 2008). Interestingly, Cambridge KET (for A2) and PET (for B1) require writing tasks to be completed in twenty minutes with a minimum text length of 35 words and 100 words, respectively. Table 4 shows the breakdown of subcorpora from ICCI and JEFLL according to the text length. The classification was made by obtaining the average text length for each grade in each country/region. JEFLL has six grades (years seven to twelve). The average text length for each grade showed nice progression up to the 11th grade (see Table 5). The 12th grade’s performance in terms of average text length, however, is not as high as the 11th grade. This finding is related to the fact that the 12th grade consists of more private and public school students, whose proficiency levels tend to be lower than the students attending schools attached to national universities. Since the 11th grade subcorpus has more national school students, we decided to assign the 12th grade to A2 and the 11th grade to B1. Regarding the boundaries of A1 and A2, it was found that in most countries/regions, there was a divide between novice learners and more Table 4. CEFR-level subcorpora of ICCI and JEFLL Test n/a KET PET FCE

Writing spec. n/a 35 words (20 min) 100 words (20 min) not specified (1 hour)

CEFR-level A1 A2 B1 B2

No. of files 1341 12042 3063 297

Average text length 35.46 64.41 117.63 164.51

Table 5. The average text length of the JEFLL subcorpora Grade 7 8 9 10 11 12

Token 51149 159736 117766 91096 170555 78979

Ave. text length Assigned CEFR level 36.72 A1 60.62 A2 74.11 A2 72.59 A2 86.27 B1 66.43 A2

36

Yukio TONO

advanced learner groups. The former acquired approximately between 30-40 running words on average and the latter acquired at least 60-70 words. This finding fits quite nicely with the proposed instructional guideline by KET (A2 level) that requires at least thirty-five words in twenty-minute writing tasks. Since the next level, PET (B1 level), requires 100 words in twenty minutes, the students who can write more than 35 words and less than 100 words should belong to A2. Thus, those students who acquired an average of 60-70 words were classified into the A2 level. 4.2.2 Extraction of verb complementation patterns All the data for JEFLL and ICCI were tagged using RASP (cf. 3.2). While there are several different output formats available in RASP, the most popular output that resembled the PenTreebank was used. The following is a sample output for a sentence, “My favorite food is KimChi.” (|My:1_APP$| |favorite:2_JJ| |food:3_NN1| |be+s:4_VBZ| |KimChi:5_NP1| |.:6_.|) 1 ; (-7.664) upenn: 1 (TOP (S (NP (APP$ My:1) (JJ favorite:2) (NN1 food:3)) (VP (VBZ be+s:4) (NP1 KimChi:5))) (. .:6))

In order to extract verb complementation patterns, Tregex (Levy and Andrew 2006) was used. It was difficult to identify syntactic patterns of entire sentences because the pattern, e.g., [V-NP], might occur recursively in various positions in a sentence. Instead, all the fragments containing specified patterns from every candidate clause or sentence were extracted. The patterns [V-NP] and [V-NP-NP] were distinguished by specifying the presence/ absence of sister elements of the NP in preceding/following positions. Table 6 shows the verb complementation patterns found by Williams (2007) as criterial features for the A2 level and their associated Tregex syntax. Figure 1 shows the output of Tregex using the GUI interface. The precision of the extraction was checked by randomly sampling the matches. The overall precision of RASP was 65-70% for raw learner data, which contains various misspellings and foreign words (e.g., people and place names). This made the precision of the matches slightly lower than normal (approximately 70%). 4.2.3 Data analysis First, the corpus size of each subcorpus was adjusted to 50,000 words and all the instances of each verb complementation pattern were

International Corpus of Crosslinguistic Interlanguage

37

Table 6. Verb complementation patterns and corresponding Tregex queries Verb frames NP-V NP-V (reciprocal Subj) NP-V-PP NP-V-NP NP-V-Part-NP NP-V-NP-Part NP-V-NP-PP NP-V-NP-PP (P = for) NP-V-V (+ing) NP-V-Vpinfinitival (Subj Control) NP-V-S

Tregex syntax VP > S S S S (V $++ NP) VP S S S < (/VV.*/ . (TO . VV0)) VP > S 2.70E-61 296 251.98 7.79 > 3.63E-17 164 195.92 5.33 < 4.67E-13 472 440.61 2.21 > 1.88E-05 363 389.11 1.75 < 0.000142 152 182.18 5.13 < 1.39E-12 1617 1600.97 0.15 > 0.567705

Dec *** *** *** *** *** *** *** *** *** ns

Q 0.023 0.021 0.009 0.008 0.006 0.005 0.005 0.004 0.004 0.003

International Corpus of Crosslinguistic Interlanguage

39

.000; Q = 0.021). It means that there is a strong tendency that the NP-V-NP pattern will occur significantly more often in the A1 level than the B1 level. On the other hand, the pattern NP-V-S showed the opposite tendency. It occurred significantly more frequently in the B1 subcorpora (χ2 = 32.62; p < .000; Q = 0.008), while it turned out to be an antitype for A1 (χ2 = 56.29; p < .000; Q = 0.009). The findings suggest that a canonical structure such as NP-V-NP is more often used at the beginning level, whereas a construction such as NP-V-S is typically used at more advanced levels. Table 8 summarizes the results of HCFA for the variable configuration (verb frames) x (A1). Please note that all verb co-occurrence frames are classified into the A2-level criterial features by Williams (2007). Most verb frames were found to be antitypes for the A1 level, which indicates that they do not occur as often as expected. Thus, most items with the symbol “ 2.71E-40 NP-V-S A1 7 68.86 56.29 < 9.75E-20 NP-V-Vinf A1 164 195.92 5.33 < 4.67E-13 intrans A1 13 8.83 2.07 > 0.000240 NP-V-NP-Part A1 1 7.39 3.62 < 3.42E-25 NP-V-NP-PP A1 61 71.82 1.51 < 0.000613 NP-V-NP-PP(for) A1 15 21.57 2.00 < 2.98E-05 NP-V-Part-NP A1 5 12.26 3.84 < 7.76E-11 NP-V-V(+ing) A1 27 33.80 1.41 < 0.000951 NP-V-PP A1 293 302.54 0.28 < 0.389626

Dec *** *** *** *** *** *** *** *** *** ns

Q 0.023 0.009 0.005 0.001 0.001 0.001 0.001 0.001 0.001 0.001

40

Yukio TONO Table 9. The results of HCFA for the A2 level

Verb frames CEFR Freq Exp Cont. chisq Obs-exp PHolm NP-V-PP A2 472 440.61 2.21 > 1.88E-05 NP-V-NP A2 1617 1600.97 0.15 > 0.567705 NP-V-Vinf A2 267 285.33 1.19 < 0.002737 intrans A2 9 12.86 1.22 < 0.002097 NP-V-NP-Part A2 4 10.77 3.77 < 7.85E-11 NP-V-Part-NP A2 12 17.85 1.79 < 9.58E-05 NP-V-S A2 95 100.28 0.24 < 0.491867 NP-V-V(+ing) A2 45 49.23 0.43 < 0.198904 NP-V-NP-PP A2 104 104.60 0.03 < 0.867857 NP-V-NP-PP(for) A2 29 314.1349 1.7038 < 0.495903

Dec *** ns ** ** *** *** ns ns ns ns

Q 0.005 0.003 0.003 0.001 0.001 0.001 0.001 0.001 0 0

Dec *** *** *** *** ** *** *** *** *** ns

Q 0.021 0.008 0.006 0.004 0.002 0.002 0.001 0.001 0.001 0

Table 10. The results of HCFA for the B1 level Verb frames CEFR Freq Exp Cont. chisq Obs-exp PHolm NP-V-NP B1 1292 1413.8.4 10.59 < 6.24E-30 NP-V-S B1 142 88.56 32.62 > 2.70E-61 NP-V-Vinf B1 296 251.98 7.80 > 3.63E-17 NP-V-PP B1 363 389.11 1.75 < 0.000142 NP-V-NP-PP B1 104 92.37 1.44 > 0.001519 NP-V-V(+ing) B1 55 43.47 3.11 > 1.06E-06 NP-V-NP-Part B1 20 9.51 11.35 > 3.47E-19 NP-V-NP-PP(for) B1 37 27.74 3.09 > 1.56E-06 NP-V-Part-NP B1 25 15.76 5.18 > 4.94E-10 intrans B1 10 11.35 1.86 < 0.549782

that the pattern “intrans” was an antitype for A2. This finding supports the previous results presented in Table 8 that the intransitive use is more criterial for the A1 level. The verb frames involving particles (NP-V-Part-NP and NP-V-NP-Part) were found to be antitypes for A2. Although Williams (2007) claims that these frames will also serve as criterial features for A2, my results show that it is not the case. Table 10 shows the results of HCFA for B1, which support the claim that constructions involving particles are more likely to be criterial for the B1 level. The results suggest that the verb frames involving particle constructions were much more frequent in B1 corpora. Additionally, the frames which are found to be types for B1 (e.g., NP-V-S, NP-V-Vinf, and NP-V-V(+ing)) are likely to be more criterial for B1, even though the constructions themselves will have already appeared at the A2 level.

International Corpus of Crosslinguistic Interlanguage

41

Table 11. Revision of A2-level verb co-occurrence frames by Williams (2007) Verb frames NP-V NP-V (reciprocal Subj) NP-V-NP NP-V-PP NP-V-NP-PP NP-V-NP-PP (P = for) NP-V-V(+ing) NP-V-VPinfinitival (Subj Control) NP-V-S NP-V-Part-NP NP-V-NP-Part

Williams (2007) A2 A2 A2 A2 A2 A2 A2 A2 A2 A2 A2

the present study A1 A1 A1 A2 A2 A2 A2 A2 A2 B1 B1

Table 11 summarizes a revision of the list of new verb co-occurrence frames as criterial features. The results show that when tested against younger learners’ data, some of the verb frames (NP-V, NP-V-NP, and “intrans”) originally assigned to A2 criterial features are actually better candidates for A1 criterial features. On the other hand, the results also show that some frames should serve as criterial features for B1 instead of A2, such as NP-V-Part-NP and NP-VNP-Part. While it is difficult to determine exactly the point at which linguistic features start to appear in learners’ output, we should take into account the relative significance of features occurring significantly more frequently at a given CEFR stage than the other levels. The HCFA provides interesting possibilities to evaluate the contribution of each variable configuration to the entire model, and thus it suits the purpose of the present study very nicely. 6. Discussion 6.1. The contribution of the ICCI data So far, I have shown how the ICCI data can shed some light on the nature of interlanguage in terms of criteria features for CEFR levels. By looking at the data produced by younger learners, it is possible to describe in detail the very beginning stage of learning/acquisition. Most learner corpora available to date were collected from intermediate and advanced learners of English, which tends to make researchers focus on the characteristics of interlanguage that intermediate or advanced learners find problematic, such as stylistic or pragmatic aspects. For the beginning level learners, however, different issues need to be addressed. They have difficulties with more basic syntactic and lexico-grammatical problems, such as verb complementation

42

Yukio TONO

patterns, complex noun and adverbial phrases, and sentence alternation patterns, all of which involve form—function mapping of lexis and grammar between L1 and L2. The A1 and A2 levels of CEFR consist of a core component of English usually taught at school, and it is crucial to describe in detail what language is taught and in what order. The English Profile Programme is currently working on this project of identifying concrete language items to serve as criterial features for different CEFR levels by using the CLC. As their findings from the CLC start to appear, it is crucial that we are able to review and validate the results. The description of the A1 level is difficult with the CLC data so far, and the availability of the ICCI data will surely contribute to a better understanding of the A1 and A2 levels. This replication study of Williams (2007) shows that the ICCI data improved the classification made by Williams in several ways; some verb frames originally assigned to A2 should be moved to A1-level criteria features. The CLC did not seem to provide sufficient data to distinguish between A1 and A2-level learners. In addition, the verb frames involving particles better served as criteria for B1, which is intuitively more natural,1 compared to the list of other more basic structures such as NP-V-PP for A2 criterial features. 6.2. The direction of research into CEFR criterial features There are some implications and issues about the ICCI project and the case study presented here. First, while it is true that the CLC is larger than our corpus and therefore perceived to be more reliable, I argue that the CLC is not balanced in terms of CEFR levels, i.e., skewed toward higher proficiency level learners. Moreover, the claims made using the CLC should be more carefully scrutinized, especially regarding the beginning levels. It is often the case that the bigger the corpus, the better. It is true that skewed data will only produce skewed results. As corpus linguists, we sometimes need to play the role of “watchdog” to examine the validity of claims. To do this, specialized learner corpora like the ICCI would have a great potential. Second, since the CLC is a corpus available only for in-house use at the Cambridge University Press and the Cambridge ESOL, we cannot check the data directly. In this case, we can only depend on the description of data analysis in a given paper. In my case study, I tried to replicate what Williams (2007) did in her work. The analysis procedure was sometimes poorly documented and unclear. Williams (2007), for instance, did not 1

Interestingly, the NP-VP-Part construction (e.g., “I got up at 7”) was not included in Williams’ list. This pattern happened to occur very frequently in A1 level and could be a good criteria feature for A1.

International Corpus of Crosslinguistic Interlanguage

43

specify any descriptive statistics from the CLC that formed the basis of the categorizations of verb frames. There needs to be a formal procedure of setting the threshold of determining whether a particular verb co-occurrence pattern is “new” at a given CEFR level. This is crucial because entire arguments will be based on “criterial features” and it would be very confusing if the method of selecting criterial features differs from study to study. Thus, it is important to formalize the procedure of extracting and determining some language features as criterial. In this sense, I hope that the learner corpus researchers who are interested in this research agenda combine their collective knowledge and expertise to make the analysis procedure as explicit as possible. Third, as more criterial features are identified, a dilemma will emerge of how to decide relative weights of individual criterial features. Some features should be more fundamental than others; some features are more peripheral. We are currently creating lists of those features, but sooner or later we need to know which criterial features are crucial in designing syllabi or language tests. So far, we lack a formal procedure to do so. In terms of the Reference Level Descriptions, a simple list of criterial features would suffice. In order to use the list for effective curriculum planning or assessment, information about priority among criterial features would have to be included. In the future, more research should be done in this direction. 6.3. Methodological implications There are a few methodological implications as well as to make as some cautionary notes. First, the accuracy rate of RASP could be problematic in interpreting the results. The average accuracy rate for learner data is around 60-70%, much lower compared to native speakers’ data. This may not be a problem in this particular study because both CLC and ICCI/JEFLL were parsed with RASP without any post editing, which means the quality of the baseline data was the same. However, I found that RASP could not retrieve all the instances of S-V-S constructions, for example. RASP + Tgrep identified seven constructions from sampled corpora; however, a simple concordancing of “think” produced more than thirty results. Many of them were erroneous sentences such as the following. a) I think buy mane games to the compiuter. (is06_379) b) ... becuze he thinked theres problem with ghost in the house. (is06_237) c) I don’t no what I am do, I think give you 10,200$, it’s ok? (is06_115) This means that learners attempt to use the verb “think,” but cannot produce target-like utterances. The results in this study stem from only correctly

44

Yukio TONO

parsed portions of the data, which were sufficient to achieve the aims of the present study. Yet we should be aware that the analysis of interlanguage systems becomes extremely difficult at the beginning stage of acquisition, due to so many non-target productions. Using a parser that is trained only in well-formed sentences might not be the best solution for that sort of data. In order to capture the entire picture of interlanguage systems, it may be necessary to look at both correct and incorrect use of language, and to use text analysis tools that can handle non-target productions. Secondly, the HCFA turned out to be very useful in evaluating relative influences of different variable configurations, but it should be noted that this method was not the same approach taken by Williams (2007). Methodologically, Poisson regression could be an alternative,2 and we should discuss what standard statistical procedure could determine whether some linguistic features are criterial for a particular CEFR level. This will be a very interesting question and merit special attention in future research. Finally, some portions of the ICCI data need to be more balanced. In particular, distributions across countries/regions by school years are not well balanced, thus findings should be carefully interpreted especially when comparisons are made across grades by countries/regions. In the present study, no comparison was made across countries/regions. The effects of tasks, especially essay topics, could be strong when investigating the use of vocabulary. In particular, words related to “food” and “money” are more frequent in the ICCI data than in ordinary texts. The classification of CEFRlevel subcorpora was made only according to total text length. We cannot deny, however, the possibility that some students in the samples who had very high proficiency in English did not do the task seriously, resulting in shorter texts. Overall, the chances are slim, but some motivational factors could affect the quality of the tasks, which is different from exam situations like the Cambridge exams. 7. Conclusions This paper has introduced the ICCI project and has shown how corpora of younger learners could be exploited in the context of CEFR-related learner corpus research. Considering the lack of beginning level learners’ data, our corpus will be a valuable addition to the existing learner corpus resources. The present study has also shown that the analysis of the ICCI could shed light on the nature of interlanguage at early stages, which provides relevant data for those who are involved in teaching English in primary and secondary education. There is a need to set objective procedures and criteria for 2

Stefan T. Gries (personal communication)

International Corpus of Crosslinguistic Interlanguage

45

identifying and validating criterial features for CEFR levels using learner corpora. I hope that the ICCI will contribute to rigorous research in this growing new field. References Barlow, M. 2000. Monoconc Pro (Version 2). Houston: Athelstan. Beacco, J.-C., S. Bouquet and R. Porquier (eds). 2004. Niveau B2 pour le français (utilisateur/apprenant indépendant): Textes et références. Paris: Didier. Beacco, J.-C., M. de Ferrari and G. Lhote (eds). 2006. Niveau A1.1 pour le français: Référentiel et certification (DILF) pour les premiers acquis en français. Paris: Didier. Briscoe, E., J. Carroll and R. Watson. 2006. “The second release of the RASP System”. Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions (Sydney, Australia). available online http://acl.ldc. upenn.edu/P/P06/P06-4020.pdf Burnard, L. 2004. “BNC-Baby and Xaira”. TALC 2004: Proceedings of the Sixth Teaching and Language Corpora conference, Granada. 84-85. von Eye, A. 1990. Introduction to configural frequency analysis: the search for types and antitypes in cross-classifications. Cambridge: Cambridge University Press. Filipovic, L. 2009. English Profile—Interim report. Internal Cambridge ESOL report, April 2009. Glaboniat, M., M. Müller, P. Rusch, H. Schmitz and L. Wertenschlag. 2005. Profile deutsch. CD-ROM version 2.0 and accompanying manual. Berlin: Langenscheidt. Granger, S., E. Dagneaux, F. Meunier and M. Paquot. 2009. International Corpus of Learner English. Version 2. Louvain-la-Neuve: Presses universitaires de Louvain. Gries, S.T. 2009. Statistics for Linguistics with R. Walter de Gruyter. Hawkins, J.A. and P. Buttery. 2010. “Criterial features in learner corpora: Theory and illustration”. English Profile Journal 1: e5. (DOI: 10.1017/ S2041536210000103) Hendriks, H. 2008. “Presenting the English Profile Programme: In search of criterial features”. Research Notes 33. Cambridge: Cambridge ESOL. 7-10. Levy, R. and G. Andrew. 2006. “Tregex and Tsurgeon: tools for querying and manipulating tree data structures”. 5th International Conference on Language Resources and Evaluation (LREC 2006). Available at http:// nlp.stanford.edu/pubs/ levy_andrew_lrec2006.pdf.

46

Yukio TONO

Little, D. 2006. “The Common European Framework of Reference for Languages: Content, purpose, origin, reception and impact”. Language Teaching 39. 167-190. Parodi, T. 2008. “L2 morpho-syntax and learner strategies”. Paper presented at the Cambridge Institute for Language Research Seminar (Cambridge, UK, 8 December 2008). Pitler, E. and A. Nenkova. 2008. “Revisiting readability: A unified framework for predicting text quality”. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics. 186-195. Rayson, P. 2009. Wmatrix: a web-based corpus processing environment. Computing Department, Lancaster University. http://ucrel.lancs.ac.uk/ wmatrix/ Salamoura, A. and N. Saville. 2009. “Criterial features of English across the CEFR levels: evidence from the English Profile Programme”. Research Notes 37. 34-40. Scott, M. 2008. WordSmith Tools version 5. Liverpool: Lexical Analysis Software. Tono, Y. (ed). 2007. Nihonjin Chukousei 10000-nin no Eigo Corpus. (JEFLL Corpus: A Corpus of 10,000 Japanese EFL Learners). Tokyo: Shogakukan. Tono, Y. 2010. “Learner corpus research: some recent trends”. Corpus, ICT, and Language Education, Weir, G. and S. Ishikawa (eds). Glasgow: University of Strathclyde Publishing. 7-17. Williams, C. 2007. “A preliminary study into the verbal subcategorisation frame: Usage in the CLC”. RCEAL. Cambridge: University of Cambridge. unpublished manuscript.

Compilation and Exploration of ICCI Corpus for Learner Language Research Huaqing HONG

1. Introduction The International Corpus of Crosslinguistic Interlanguage (ICCI1) is an international joint project for building learner corpus initiated in 2007 by Prof. Yukio Tono from Tokyo University of Foreign Studies (TUFS), Japan, and started in 2008. The aim of the project is to compile corpora from the production data of young English learners from different proficiency levels and L1 backgrounds across the world. Currently, twelve scholars from nine countries/regions (Hong Kong, Austria, Israel, China, Japan, Poland, Singapore, Spain, and Taiwan) actively contribute to this project. The ICCI is one of the research projects within the framework of the Global COE (G-COE) Program. The G-COE is a five-year governmentfunded project for promoting research through centers of excellence in specialized fields. Tokyo University of Foreign Studies (TUFS) has been identified as one such leading research institute in the field of linguistics and language education. The special theme of this COE program for TUFS is “corpus-based linguistics and language education,” in which three major disciplines, namely, field linguistics, corpus linguistics, and language education, are to be closely linked to each other to train researchers who are competent in doing research on various aspects of language about the integrated perspectives of these three fields. This paper is part of a study aimed at making a cross-linguistic interlanguage corpus of the written performance of young EFL learners from seven EFL countries/regions, namely, Hong Kong, Israel, China, Japan, Poland, Spain, and Taiwan. It sets out to investigate how such a learner corpus can be designed and compiled in compliance with Corpus Linguistics standards for interlanguage research. Thus, a multidimensional approach, drawn mainly from some common corpus linguistics techniques, was followed to foreground the similarities and differences among the student participants from different culture backgrounds. Some innovative features were also integrated into the corpus database to serve the research needs and diversified interests of the team members. It was primarily intended to be 1

http://corpus.nie.edu.sg/icci/index.jsp

48

Huaqing HONG

useful for researchers in the areas of applied linguistics. We also believe it can shed light on the integration of corpus techniques with EFL studies. The creation of the ICCI corpus database consists of three basic steps: the compilation of the corpus, the processing of the data, and the exploration of the corpus. First, we discuss the main decisions taken regarding the compilation of the corpus and the way it would be internally structured. Then, we present the current state of development and integration of corpus processing tools. The integration of processing tools followed naturally from the compilation of the corpus itself, and we developed them only in case we did not find existing tools that could be easily adapted and integrated to our environment. The final section is devoted to explaining the exploration tools and the online query package that has been implemented according to the internal organization of the corpus. 2. Corpus compilation The main objective of the ICCI database is to support the research activities of the members from different countries/regions. Although the team members had some knowledge about corpus linguistics and corpus tools, we did not expect them to spend too much time on processing the data to make them compatible for use with common corpus tools, such as WordSmith (Scott, 2008), AntConc (Anthony, 2006), Xaira2, etc. To this end, the ICCI database and its associated processing tools would need to provide the computational basis requested by the researchers. These common basics mainly include 1) providing ready data in various formats for further analysis with different tools, and 2) presenting a common platform that could be remotely accessed to pull out the data for different purposes, such as concordances, wide context, sub-corpus generation, and corpus comparison. Based on these requirements, a number of design criteria were set up and a specific internal organization was adopted. 2.1. Design criteria Based on the objectives and requirements, a number of design criteria were set: • Sampling of the texts: The current corpus texts were collected from seven European and Asian countries/regions (Austria, Hong Kong, Israel, China, Poland, Spain, and Taiwan). These texts were produced in 2009 and 2010 by over 7000 students in Grades 3 to 12. The texts were originally handwritten within twenty minutes in class, and were then scanned and transcribed to machine-readable texts. The thematic 2

http://www.oucs.ox.ac.uk/rts/xaira/

Compilation and Exploration of ICCI Corpus for Learner Language Research

49

areas covered two genres, i.e. argumentative and descriptive, and over a dozen of topics. • Representativeness of the corpus: The corpus has to be as representative as possible. Of the over 7000 collected pieces of written artifacts, only 6700 pieces of valid writings were selected to be included in the corpus. Although the distribution of the data is not evenly distributed across country, gender, age, mother tongue, genre, and topic, it is statistically valid and acceptable. • Flexible adaptation: The corpus was compiled to be flexible enough to be easily adapted to meet the team members’ different research interests and requirements. Thus, a variety of data formats and a sophisticated query platform have to be provided. • Technical implementation: The researchers are not expected to be involved in the actual technical implementation, so that they could primarily focus on the analysis. The technical implementation, thus, required a minimal learning curve from the researchers. On the contrary, the technical process demanded a lot from the development team. 2.2. Corpus components At present, the corpus comprises 6700 transcripts, totaling more than half a million words. The detailed distribution is presented in Tables 1 to 6. As shown in Tables 1 to 6, the corpus data were mainly categorized into two types based on 1) the students’ background information, such as the country/region (where they are currently located), mother tongue (home language), grade (age), and gender; and 2) the text information, i.e., the genre and topic of the texts that the students produced. We believe such student profile information would be useful for any analysis of the linguistic properties of the data.

Table 1. The distribution of the corpus data by Country Country Austria China Hong Kong Israel Poland Spain Taiwan TOTAL

Files 769 951 790 2028 751 737 674 6700

Tokens 98822 80678 64596 127705 56821 47647 57655 533924

50

Huaqing HONG Table 2. The distribution of the corpus data by Grade Grade 3 4 5 6 7 8 9 10 11 12 Unknown TOTAL

Files 108 46 281 1204 1340 956 918 740 683 412 12 6700

Tokens 3784 3596 12431 64096 96274 83069 81951 71041 68868 47565 1249 533924

Table 3. The distribution of the corpus data by Gender Gender Female Male Unknown TOTAL

Files 1729 1711 3260 6700

Tokens 152488 128927 252509 533924

Table 4. The distribution of the corpus data by Mother Tongue Mother Tongue Albanian Amharic Arabic Armenian Bengali Bosnian Bulgarian Chinese Croatian English Finnish French German Greek Hebrew Hungarian Hindi Italian Japanese

Files 4 5 8 5 1 4 6 2421 9 29 1 2 609 2 1867 7 4 3 2

Tokens 486 288 1011 366 67 450 713 203931 1112 3320 132 298 79693 172 114751 962 642 459 175

Compilation and Exploration of ICCI Corpus for Learner Language Research Korean Malayalam Malaysian Persian Polish Portuguese Romanian Russian Serbian Slovakian Spanish Swedish Tagalog Turkish Ukrainian Unknown Urdu TOTAL

1 2 1 8 773 3 6 141 14 3 742 1 4 8 1 2 1 6700

96 145 95 975 59172 334 713 11503 1462 402 48121 186 347 1038 125 101 81 533924

Table 5. The distribution of the corpus data by Genre Genre Argumentative Descriptive TOTAL

Files 2494 4206 6700

Tokens 186375 347549 533924

Table 6. The distribution of the corpus data by Topic Topic After School Birthday Breakfast Picture Describing Yourself Eating Out Event or Funny Story Exciting Dream Film Food Funny Story Good or Bad Day Important Event Money Postcard School Schoolyard Picture TOTAL

Files 1 200 272 68 43 351 74 692 1840 510 181 1 2299 65 102 1 6700

Tokens 39 18395 12328 5426 5065 24760 10914 60815 140297 32130 30847 116 179677 5475 7615 25 533924

51

52

Huaqing HONG

3. Corpus processing 3.1. Pipeline procedure In a corpus such as the one we describe, the data would need to be processed in various formats with a number of pipelined procedures. These steps include the following main tasks: • Collecting students’ written articles; • Transcribing to plain texts; • Adding header information; • Converting to various formats; • Tagging the texts using a POS tagger; • Indexing the database; • Building an online query package; • Query and analysis Every task requires one or more specific pieces of software. To manipulate the data, we used any available tools that could be easily adapted to our working environment. Whenever possible we adapted already existing tools, such as EditPlus3, Wmatrix (Rayson, 2009), MMAX24, AppFuse 25, etc. If there were no handy tools, we developed some utilities to facilitate the process. In the following sections, we discuss some of the stages of the processing, and briefly explain the main deliverables at each of these stages. 3.2. Lexical corpus The lexical corpus was built by manipulating the raw texts of the data. The collected hard copies of the written essays were scanned and saved as PDF files, and stored for later reference. A group of student assistants was asked to transcribe them to plain text files. The transcripts were then further formatted with certain mark-ups according to the SGML standard. The mark-ups were inserted with the predefined student background profiles and text types. The following is an excerpt from a sample transcript.

3 4 5

http://www.editplus.com/ http://mmax2.sourceforge.net/ http://appfuse.org/display/APF/Home

Compilation and Exploration of ICCI Corpus for Learner Language Research

53



country='Austria' school='Secondary school' schoolname='GRG 13 Wenzgasse' schoollevel='' overseas='1' classhour='' extralesson='' region='' textbook='' year='5' class='1F' studentid='' name='' sex='male' mothertongue='German' otherscore='' medium='written' genre='descriptive' topic='describing yourself' dicuse='' preparation='' time='20min' feedback='' date='' transcriber='Antoniya’

Hi, Micheal Jackson I'm Günther. I live in Vienna. I have short black hair and I'm thin. I like to play football and Playstation. That is really cool! I like to swim in the swiming-pool! When it's sunny outside I play football! Your friend, Günther!

To ensure the files could be readily used with common corpus tools, such as WordSmith and AntConc, the raw transcripts were converted to plain texts by removing the SGML tags and metadata information. Apart from the version with SGML headers, two other versions of the lexical corpus of plain texts were thus produced: a Unicode version for WordSmith and a UTF-8 version for AntConc.

54

Huaqing HONG

3.3. POS-tagged corpus Once the lexical corpus was ready, part-of-speech (POS) tagging and semantic tagging were introduced to process the data. The Wmatrix corpus processing tools were used at this stage. This process consists of two stages. First, we uploaded the data to the Wmatrix server working folder to get the initial tagging results. Before downloading the results for later use, we looked at the list of unknown words that were identified by Wmatrix. These included misspelled words, foreign characters, wrong punctuation marks, etc. These noisy problems could potentially cause further errors at a later stage. Thus, a cleaning procedure was proposed, which involved going back to the corresponding texts to correct them. Only the transcribers’ errors were corrected, and the students’ original errors were preserved. After the cleanup, the data were ready for the tagging processing performed by Wmatrix, i.e., POS and semantic tagging. A sample output of Wmatrix is presented below: … Hi , Micheal Jackson I 'm Gunther . …

Each line has one token tagged with the corresponding POS and semantic categories. The tagged result files were further converted to the XCES6 XML format, and were ready to be indexed by the Xaira indexer. An excerpt from the sample file that is ready for Xaira is presented below:

6

http://www.xces.org/

Compilation and Exploration of ICCI Corpus for Learner Language Research

55



icci0001 Austria 5 Male German Descriptive Describing Yourself 48 icci0001

Hi , Micheal Jackson I 'm Gunther . …

The XCES XML files were then indexed using the Xaira indexer, and a Xaira queryable corpus was delivered so that the researchers could explore using the Xaira client. 3.4. Corpus database The corpus data was processed into several formats so that the researchers could directly use them with different corpus tools. Three lexical corpora, namely, SGML7 encoded files (with header info and mark-ups), Unicode encoded plain texts (text only), and UTF-8 encoded plain texts (text only), can be used with WordSmith and AntConc. The Xaira-indexed, POS-tagged files could be used with the Xaira client. Apart from these standalone corpus tools, it would be better for the researchers if the data can be queried directly with a web interface. It is ideal to process the data with database technology. An in-house indexer was thus designed and implemented to index the corpus data into the MySQL database. The database essentially consists of the following tables: • Table of profile of each file/student • Table of tokens • Table of paragraphs • Table of POS tags 7

http://www.isgmlug.org/sgmlhelp/g-index.htm

56

Huaqing HONG

• Table of frequency counts When these tables were successfully indexed, the database was ready to be uploaded to the web server to work with the web query interface. In the next section, we discuss how the corpus can be explored with the database and the online query package. 4. Corpus exploration The main purpose of building a corpus is for the researcher to observe the linguistic properties in the data. The different formats of the corpus data were intended to cater to different aspects of linguistic research. The tool devoted to such explorations could play a crucial role in the profit obtained by compiling the corpus (Badia et al., 1998; Davies, 2005). Therefore, in order to satisfy the needs of various researchers, we designed the web-based query package with some sophisticated features. Some of the characteristics of the web query package are listed below: • It has a user-friendly interface; • It has a low learning curve for most users; • It is accessible to as many users as possible at the same time; • It is easy to generate the output for further analysis; • It is reasonably robust and efficient to pull out the query results; • It is compatible with common platforms (e.g., Windows and Mac OS) and web browsers (e.g., IE, Firefox, and Safari). The query package consists mainly of three modules: lexical query, corpus comparison, and corpus generation. These are briefly discussed in the following sections. 4.1. Lexical query The lexical query module is a web-based query interface to search the ICCI database. The figure below depicts the main page of the query interface, which resembles the BYU BNC (http://corpus.byu.edu/bnc/) with the consideration of easy use. However, the background process and the database structure are quite different. The ICCI corpus query interface was created using the AppFuse 2 framework and MySQL database.

Compilation and Exploration of ICCI Corpus for Learner Language Research

57

When a user defines the search options and submits a query, the query algorithm outputs a list of results for the user to choose in order to see the concordances.

By clicking “a lot of money”, we can have the following concordance lines:

There are several unique features that are rarely seen in other online query corpora. Firstly, we integrated the sorting function to the concordance lines. By selecting the context as 0, L1, L2, L3, L4, L5, R1, R2, R3, R4, or R5, together with ascending or descending order, concordance lines could be sorted according to the defined options. This is a common function of standalone corpus tools such as WordSmith and AntConc, but it is hardly seen in online searchable corpora due to technical hurdles. This was also why a database structure was adopted for our purpose. Secondly, a large context of concordance lines can be viewed by clicking the corresponding transcripts. By default, only the three lines before and the three lines after the query item are displayed. A wider context can be set by the users if required. Thirdly, users can select the “View Statistics” option to pull out the statistical report about the query item. Two very important features—hits in files (HF) and HF_tokens—were also introduced to enable the researchers to better scrutinize the results.

58

Huaqing HONG

As shown in the figure above, the statistical counts do not only display the hits or instances found in the defined search, but also the number of the files in which the hits or instances were found (thus Hits in Files is introduced here). We believe this is particularly important to a learner corpus of this kind. For instance, only the hits and the total tokens of the corpus are provided to the researchers, traditionally. With that information, it would not be possible to understand how many students contributed to the hits (which could have been only a few students). With the Hits in Files (the actual number of students, as each student produced one file) information, we can see how many students used the query item in their writings. Similarly, the HF_Tokens provided us with some more information. Finally, lexical query can be used with the POS tags. By selecting the dropdown menu of inserting POS, users can insert a POS tag to query with or without a lexical string. For instance, querying “a lot [N*]” produces the following results:

Compilation and Exploration of ICCI Corpus for Learner Language Research

59

The last item of the list gives the total number of instances. By clicking it, the concordance window will list out all the concordances results. This is particularly useful when a user wants to have a comprehensive list of all the concordances in one window. 4.2 Corpus comparison The corpus comparison module was designed to provide a web-based interface for users to compare any two sub-sets of the corpus. By selecting two groups of corpus files with predefined categories, the algorithm would process the query and output a log likelihood result of the comparison result. The linguistic features to be compared could be the use of lexis, POS, and semantic features. A sample result is provided below:

As we can see from the statistical results and the visualized graph, the differences between the two groups of students in using certain lexis in their writings are statistically significant. The interface is similar to the Wmatrix corpus comparison, but the log likelihood algorithm was designed and implemented by ourselves to run on the ICCI web server. 4.3 Corpus generation Since the corpus comprises 6700 files, it could be difficult for users to separate out some files for certain purposes. A user-friendly web interface would be helpful in this context, and therefore, we designed the corpus generation module. By selecting from the predefined categories, users could submit their query to request the algorithm to generate a sub-corpus. An

60

Huaqing HONG

example of the result with the search options set as “China Grade 10 Female students’ writings” is shown below:

Although this module is still under development, some users have found it to be particularly helpful in partitioning the corpus into different sub-corpora for individual research purposes. In brief, the three modules reported here can be integrated to facilitate the researchers in fully exploring the corpus data. Since the query modules reported here are still under development, our first aim in the immediate future would be to enhance the algorithms in the performance of every module as well as of the query package as a whole. 5. Conclusion In this paper, we described the compilation of the ICCI database, and how it could be explored using a sophisticated web query package. After briefly discussing the criteria used to design the corpus, we described the pipelined procedures in data processing and manipulation. To provide researchers with ready-to-use data compatible with different handy corpus tools, different versions of lexical corpora and POS-tagged corpora were compiled. In addition, we demonstrated the design, implementation, and significance of a web-based query package to fully explore the ICCI corpus database. It was primarily intended to be useful for researchers in the various areas of applied linguistics. We also believe it can shed light on the integration of corpus techniques with EFL studies.

Compilation and Exploration of ICCI Corpus for Learner Language Research

61

References Anthony, L. 2006. “Developing a Freeware, Multiplatform Corpus Analysis Toolkit for the Technical Writing Classroom”. IEEE Transactions on Professional Communication 49:3. 275-286. Badia, T. and et al. 1998. “IULA’s LSP Multilingual Corpus: compilation and processing”. Paper presented at the ELRA conference, Granada, 29-31 May, 1998. Davies, M. 2005. “The advantage of using relational databases for large corpora: speed, advanced queries, and unlimited annotation”. International Journal of Corpus Linguistics 10. 301-328. Rayson, P. 2009. Wmatrix: a web-based corpus processing environment. Computing Department, Lancaster University. http://ucrel.lancs.ac.uk/ wmatrix/ Scott, M. 2008. WordSmith Tools version 5. Liverpool: Lexical Analysis Software.

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners Barbara SCHIFTNER and Tom RANKIN

1. Introduction The central question underlying the present study is how the use of demonstratives in the writing of school-age Austrian learners develops over time. Confirming the intuitive supposition that demonstrative use in learner writing diverges from patterns found in native speaker writing, previous corpus studies have revealed problematic issues with the use of demonstratives in learner English (Blagoeva 2002, Leńko-Szymańska 2004, Petch-Tyson 2000). All of these studies focused on high proficiency, tertiarylevel data of learners with a wide range of L1s and dealt primarily with the function of demonstratives as anaphora markers. They identified various patterns of under-, over-, and misuse, some of which seem to be universal to all learners, while others may be influenced by specific L1 backgrounds (for details, see section 2). In the English language, there are two demonstratives, namely this and that, both of which inflect for number (these/those). They can be used independently as pronouns or dependently as determiners or degree modifiers (Huddleston & Pullum 2002: 1504). In establishing demonstrative reference, they “mark the NP as definite” (Huddleston & Pullum 2002: 373), that is, they mark something as known to the recipient because it was already mentioned or will be mentioned in the text (textual or endophoric reference) or because it refers to a known entity outside the text (situation or exophoric reference) (cf. Biber et al. 1999: 347, Halliday & Hasan 1976: 31ff). Demonstratives perform a large number of functions, including endophoric and exophoric reference as well as time reference, marking the focus of attention, etc. These functions relate to the basic semantic meanings of nearness (this/these) and farness (that/those), that is, to a focusing vs. a distancing function (Chen 1990). Botley and McEnery (2001a, 2001b) perform corpus-based analyses of writing from different genres and provide a wealth of findings on the distribution of demonstrative expressions in English. Their findings are drawn upon in the rest of the paper to provide a point of comparison with information drawn from large-scale corpus analyses of expert native writing.1

64

Barbara SCHIFTNER and Tom RANKIN

Considering the manifold and often not clear-cut use of demonstrative reference, it is not surprising that learners of English seem to struggle with it. This paper presents an exploratory study of school age learners of English whose L1 is German. The aim is to identify patterns in the development of Austrian learners’ use of demonstrative reference. The study thus complements previous studies of demonstrative use by extending the focus to beginners and intermediate learners. The focus on German L1 learners is related to the authors’ specific research interest in Austrian learners, which is based on their teaching and research context. In addition, the linguistic properties of German provide an interesting point of comparison to English and the L1s considered in previous learner corpus studies. 2. Demonstratives in learner writing—Previous studies Even though intuitive assumptions suggest that the use of demonstratives causes problems for learners of English, only a few studies on this issue have been undertaken thus far (e.g., Blagoeva 2002, Leńko-Szymańska 2004, Petch-Tyson 2000). These studies unanimously revealed patterns of over-, under-, and misuse, which, it seems, cannot always be interpreted straightforwardly. This difficulty relates to Leńko-Szymańska’s (2004: 90) observation that the problems advanced learners have are not usually explicit errors, but are often related to patterns of use that diverge from native usage. Petch-Tyson (2000), in her cross-linguistic study of learner English by Dutch, French, Finnish, and Swedish L1 speakers, identified a general underuse of demonstratives as well as problems in establishing successful reference using demonstratives. She specifically points out the advanced learners’ difficulties in making what she terms “situation reference,” that is, endophoric reference used to “refer to higher-order entities such as events, propositions, facts etc., which are often non-nominal antecedents” (Petch-Tyson 2000: 45, we will henceforth refer to this type of reference as “propositional reference”). What is particularly interesting in light of the results presented in the present paper (see sections 6 and 7) is that even though Petch-Tyson identified a general underuse of demonstratives, she also found that the singular distal demonstrative pronoun that was consistently overused by all learner groups. Petch-Tyson hypothesizes that in cases where demonstratives are used for endophoric reference to propositions, the frequent use of the pronoun that, if accompanied by the underuse of 1

We refer to ‘expert’ native writing in what follows as shorthand to distinguish between the corpora used by McEnery and Botley (2001a, 2001b) and the LOCNESS corpus of native English student and pupil writing. McEnery and Botley analyzed the American Printing House for the Blind (APHB) Corpus, the Associated Press (AP) Corpus, and the Hansard Corpus.

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

65

this, could be related to different approaches to developing an argument. She suggests that “that, because of its function of shifting focus across topics, may contribute to creating a less persuasive rhetorical effect than a text which is developed using this” (Petch-Tyson 2000: 56). While Petch-Tyson’s study focuses on cross-linguistic aspects of demonstrative use in learner writing, Leńko-Szymańska’s (2004) study is concerned with specific aspects of demonstrative use by L1 Polish learners of English. Like Petch-Tyson, Leńko-Szymańska also focuses on endophoric reference; she states that exophoric reference does not seem to be a problem for Polish learners and that it is not a prominent feature of argumentative writing. Interestingly, the study reveals quite different patterns of demonstrative use by Polish learners than suggested by PetchTyson for Dutch, French, Finish, and Swedish learners. Most importantly, demonstratives are generally over- rather than underused by Polish learners, a finding that Leńko-Szymańska relates to the lack of articles in Polish; one of the ways to mark definiteness in Polish is the use of demonstratives. The fact that this overuse is especially high with distal demonstratives cannot be explained by L1 transfer. Rather, it is hypothesized that this “preference for distal demonstratives” is related to the learners’ “awareness that distal demonstratives, particularly that, are less marked than proximal ones” (Leńko-Szymańska 2004: 104-105). Also working on learners of English with a Slavic L1, Blagoeva (2002) examines the English writing of Bulgarian speakers. Her work provides an interesting point of contrast as she notes that English and Bulgarian have functionally similar systems of distal and proximal demonstrative pronouns and determiners (although Bulgarian has a wider range of inflectional forms). However, the singular proximal form in Bulgarian is the only one that can be used for extended reference to facts or longer stretches of text, while both of the singular forms this and that can be used in English to perform the same function (Blagoeva 2002: 301). Despite this distinction, it is found that L1 Bulgarian learners also overuse the distal demonstratives that/those (Blagoeva 2002: 305). Blagoeva (2002: 306) suggests that an “indisputable reason” for divergence from target norms is L1 interference; however, it is not fully explained what is meant by interference in this instance. In order to paint a comprehensive picture of demonstrative use in Austrian learner writing, the present study encompasses both the dependent and independent use of demonstratives and includes demonstratives as used for both exophoric as well as endophoric reference. We also code for types of referent; this expands on the previous studies of learner corpora, which concentrated on the quantitative distribution of demonstrative expressions. The following sections will outline the use of demonstratives in English and

66

Barbara SCHIFTNER and Tom RANKIN

German before presenting the methodology employed for the corpus study and the results. 3. Demonstrative expressions: German vs. English German and English are grammatically similar in terms of the paradigm for demonstrative pronouns and determiners (the more elaborate German inflectional system for number, case, and gender agreement notwithstanding). Table 1 outlines the paradigm. There are, however, two main complicating factors in the German demonstrative system in comparison to English. Firstly, while the German paradigm as illustrated above appears to mark the proximal/distal distinction consistently, the use of the distal forms jen- is increasingly rare and tends to be replaced by forms of the definite article or the proximal dies- in combination with adverbial equivalents of here and there (Duden 2006: 295), as illustrated in examples (1) and (2). (1) #Bitte gib mir jenes Buch. Please give me that book. (2) Bitte gib mir das Buch da. Please give me the book there.

Secondly, in addition to personal pronouns er/sie/es, etc. (he/she/it ...), German has a complementary system of demonstrative pronouns, which are homophonous with forms of the definite article and are inflected for number, gender, and case (see Table 2). Table 1. Demonstratives in English and German

Pronoun Determiner

German diesjendiesjen-

Proximal Distal Proximal Distal

English this/these that/those this/these that/those

Table 2. German personal and corresponding demonstrative pronouns

Nom Acc Gen Dat

Masc er ihn seiner ihm

Personal Fem Neut sie es sie es ihrer seiner ihr ihm

Pl sie sie ihrer ihnen

Masc der den dessen dem

Demonstrative Fem Neut die das die das deren dessen der dem

Pl die die deren denen

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

67

There is no proximal/distal distinction in usage of these series of pronouns; rather, the choice between a demonstrative and personal pronoun is determined by an interplay of grammatical and discourse factors as each type of pronoun tends to prefer antecedents with different properties, such as whether or not an antecedent is a discourse topic (see for example Bosch, Rozario & Zhao 2003; Bosch, Katz & Umbach 2007). The choice of demonstrative expression in English is likewise constrained by a range of factors connected to the discourse status of an entity such as the cognitive and textual accessibility of the referent (cf. Botley & McEnery’s 2001b investigation of Ariel’s 1988, 1990 “Accessibility Hierarchy”). Given constraints of space, it would not be possible to do justice to the extensive range of factors that have been suggested as influencing the choice of demonstrative expression in English and German; thus, we refer readers to the studies mentioned above and the references cited therein for more information. In addition to the general linguistic description of differences in the usage and distribution of demonstratives in English and German, two courses of government-approved school textbooks for English teaching in Austrian lower secondary schools were consulted to examine how the differences may be presented to learners (Gerngross et al. 2007, 2008a, 2008b, 2009; Harmer 2007; Harmer & Avecedo 2008, 2009, 2010). It is acknowledged that textbooks provide only a partial insight into what actually happens in the foreign language classroom, but they are a useful indication nevertheless. The main trend that emerged from the textbook review was that where demonstratives were touched upon, it was exclusively in terms of deictic reference and the relative physical or temporal distance from the speaker/ hearer. This was reinforced by the use of illustrations where entities were either physically close or remote from the people depicted. The deictic expressions here and there were often supplied in exercises to guide pupils towards the use of either the proximal or distal demonstrative. Drawing on these linguistics and education insights, a range of research questions were addressed in the corpus study. 4. Aims, hypotheses and research questions Learner corpus linguistics in general has tended to address the production of advanced proficiency learners. Our study of lower proficiency school-age learners is mainly exploratory in nature and seeks to identify developmental patterns in the usage of demonstrative reference in the written production of Austrian learners of English. Our main, guiding research question was, therefore, to establish how the usage of demonstratives develops over time in the writing of school-age Austrian learners. In doing this, we also aim to extend previous studies by incorporating a fine-

68

Barbara SCHIFTNER and Tom RANKIN

grained coding system to illustrate different types of reference relations and antecedent preferences in order to provide more qualitative data on the distribution of demonstratives. Given previous findings that advanced proficiency learners of English exhibit non-target usage of demonstratives in terms of patterns of over/ underuse, we hypothesize that this area of English poses difficulties to learners in general and, therefore, that the lower proficiency learners in our study will also show evidence of non-target usage in terms of the quantitative distribution of demonstratives in their written production. Assuming L1 transfer affects the use of demonstratives and given the increasing lack of a proximal/distal distinction in German demonstrative pronouns, we predict that distal demonstratives may show particular evidence of non-target usage by German-speaking learners. Specifically, we assume that proximal demonstrative determiners and pronouns may be overused while the distal demonstratives may be underused. Given the lower proficiency level of our learner groups, we define quantitative over- and underuse with respect to patterns in the Louvain Corpus of Native English Essays (LOCNESS), rather than expert native writing (see 5.2). 5. The study 5.1. Corpus data The data used for this study are taken from the International Corpus of Crosslinguistic Interlanguage (ICCI), more specifically the Austrian subcorpora. The data are subdivided into seven data sets (henceforth called corpora), with one data set per grade, spanning learners from grade five to grade eleven (from approximately age 10 to age 17). All of the learners who contributed to the data speak German as their L1.2 The text types compiled in these corpora range from more descriptive and narrative texts in grades five to ten to more argumentative texts in grade eleven. A subset of LOCNESS was used as an L1 reference corpus. LOCNESS was specifically designed as a reference corpus for advanced learner writing, and thus consists of student writing rather than expert writing. The corpus contains a collection of British A-level argumentative essays. Since the students who produced these essays should be roughly the same age as the learners in grade 11, this subsection was chosen for the comparative analysis. For a detailed breakdown of the corpora used, see Table 3. Genre differences obviously have a significant effect on the language produced by both native and non-native speakers. The data contain a 2

A small number of learners who speak German as a second language were also included in this study.

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

69

Table 3. The corpora used in this study Corpus ICCI_5 ICCI_6 ICCI_7 ICCI_8 ICCI_9 ICCI_10 ICCI_11 LOCNESS-AL

No. of Texts 60 148 167 139 118 69 68 114

Tokens 3629 12604 25912 23993 19232 12104 12962 60209

range of different genres (descriptive, narrative, argumentative). The data in the ICCI_11 subcorpus are made up of texts elicited by means of an argumentative task and are thus most directly comparable to LOCNESS-AL in terms of genre. While attempts were made to be as consistent as possible with the text types produced in data collection, the different levels of age, literacy, and proficiency in English dictated the types of texts that learners could be expected to produce. For example, a descriptive task, which would be demanding for a grade 5 learner, would naturally be an unsuitable task for a higher grade learner. Similarly, formulating more complex argumentative texts would be beyond the abilities of younger learners. The differences in genres will certainly have an influence on the outcome of the study and will have to be taken into consideration in the interpretation of the results. 5.2. Methodology Concordances for the four forms of demonstratives were extracted using WordSmith Tools (Scott 2008). The results were sorted to include only demonstrative usages; in practice, this meant discarding occurrences of that used as a relative pronoun or complementizer. The remaining concordances were then entered into an Excel spreadsheet and coded manually for a number of grammatical and referential properties. The rationale for the coding of referential relations and types of referents is based on the system used by Botley and McEnery (2001a & b), which was itself based on previous categorizations in the literature (Halliday & Hasan 1976; Himmelmann 1996; Lakoff 1974) as well as categories derived from large-scale corpus data (see Botley & McEnery 2001: 9-10). Their system was simplified to leave the categories illustrated in Table 4. Such a coding scheme extends the previous learner corpus work on demonstratives and types of reference by permitting a finer-grained analysis of over-, under-, and misuse of specific referential properties of

70

Barbara SCHIFTNER and Tom RANKIN Table 4. Coding scheme Grammatical Properties a. Grammatical Function i. Pronoun ii. Determiner b. Proximity i. Proximal ii. Distal c. Number i. Singular ii. Plural

Referential Properties a. Reference i. Exophoric ii. Anaphoric iii. Cataphoric b. Referent i. NP i. Proposition

Figure 1. Normalized frequency of demonstratives in ICCI corpora & LOCNESS-AL (per 10,000 words)

the demonstratives in addition to a consideration of overall quantitative comparisons of the different grammatical forms of demonstratives. 6. Results The detailed annotation of the data revealed interesting insights into the use of demonstratives across all levels. In a first step, the overall frequency of demonstratives in all L2 corpora as well as the native corpus was compared. To present the differences between the levels more concisely and render the changes/development comprehensible and presentable, grades 5 & 6, 7 & 8, and 9 & 10 were grouped together, which leaves us with four learner corpora and one native reference corpus. As depicted in Figure 1, the overall frequency of demonstratives in every ICCI subcorpus is lower than the overall frequency in LOCNESS-AL.

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

71

Table 5. Distribution of demonstratives (per 10,000 words) and statistical significance dark shading = significant “overuse”/light shading = significant “underuse” (* p < .05/** p < .01/*** p < .001) ICCI_5 ICCI_6 ICCI_7 ICCI_8 ICCI_9 ICCI_10 ICCI_11 Locness

this-D ***5,51 ***1,59 **23,93 ***22,51 41,08 29,74 ***10,80 36,37

this-P that-D that-P these-D these-P those-D those-P ***24,80 2,76 8,27 0,00 0,00 0,00 0,00 ***4,76 0,00 **13,49 0,00 **0,79 0,00 0,00 ***13,12 ***10,42 ***15,05 0,00 ***0,39 0,00 0,00 ***5,84 4,58 ***15,42 ***0,42 *2,08 0,42 ***0,83 ***17,68 ***16,12 ***19,24 ***4,68 *1,56 0,00 0,00 ***9,91 *9,09 ***16,52 ***2,48 0,00 0,83 **0,83 ***14,66 *0,77 ***18,52 ***6,94 *1,54 2,31 2,31 65,77

3,82

5,31

22,26

5,48

1,66

6,31

This chart also shows that plural forms are hardly ever used. This is the case especially for the distal plural forms (those-D and those-P), which occur a total of 11 times in all Austrian-ICCI subcorpora and very rarely in the LOCNESS-AL data. In the learner data, plural demonstratives occur almost entirely in grades 9-11, that is, they are basically not found in the production of lower secondary students. The distribution of the different types of demonstratives reveals a quite different picture. Even though demonstratives are used much less frequently in the learner texts than in the native control corpus, the pronoun that (that-P) is consistently “overused” across all levels, while the pronoun this (this-P) is consistently “underused”3 (see Table 5 for normalized frequencies and information on statistical significance). A similar pattern of over- and underuse occurs in the use of this and that as determiners (that-D and this-D), shown in Table 5, although the patterns with determiners are not as consistent. The singular demonstrative determiners, both distal and proximal, are very rarely used at all in grades 5 and 6.4 Both forms are used a lot more often as of grade 7; whereas the determiner this remains “underused”, that is already “overused” in grade 7 (see Table 5). There is a marked drop in the frequency of use of both this-D and that-D from grade 10 to grade 11, 3

4

The use of the common terminology “underuse” and “overuse” does not seem appropriate in the comparison of different levels, since the learners are in the process of acquiring the language, which means that they may simply not have encountered some items at a specific stage. Nonetheless, such frequency measures are an interesting tool for an exploratory study of L2 development such as the one described here. The terms “overuse” and “underuse” are therefore to be understood descriptively, in the sense of a “more or less frequent” use. For this reason, grades 5 and 6 are not included in the more detailed analysis in Figure 2.

72

Barbara SCHIFTNER and Tom RANKIN

Figure 2. Percentage of endophoric (anaphoric & cataphoric) and exophoric reference of this-D

Figure 3. Percentage of endophoric (anaphoric & cataphoric) and exophoric reference (overall distribution)

which coincides with a consistently less frequent use of this-D for exophoric reference from grade 8 to grade 11 (see Figure 2). That-D is rarely used for exophoric reference in all corpora and it is generally not used often enough to warrant conclusive findings. This development from exophoric to endophoric reference observed for this-D coincides with a general trend in the use of demonstratives in the ICCI corpora. As depicted in Figure 3, there is a tendency towards a more frequent

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

73

use of endophoric (mostly anaphoric) reference in higher grades, which may suggest a development toward the increased use of demonstratives as linguistic markers of information structure to establish coherence relations within a text. This may certainly have some connection to the text types represented in the different corpora, especially in ICCI_11, which is solely argumentative; nonetheless, the trend seems indicative of the learners’ development as writers of more complex and increasingly cohesive texts. In addition, it is interesting to note that this high use of anaphoric reference is very much in line with the findings of Botley and McEnery (2001a), namely that anaphoric reference is the most common form of demonstrative reference across different types of texts in expert native speaker writing. The quantitative results indicate that the learners in the study show various patterns of over- and underuse of demonstrative pronouns and determiners in comparison to the writing of native speaker pupils (cf. table 5). In general, the emergent pattern is the persistent overuse of the distal that and underuse of the proximal this, especially as pronouns but also to a large extent as determiners. Nevertheless, in certain areas, such as patterns of demonstrative reference (i.e., frequency of anaphoric reference), the learners show similarity in usage to expert native speaker writing (as discussed in Botley and McEnery 2001a). Therefore, it seems that certain properties of the use of demonstrative expressions are relatively unproblematic for learners, even at lower levels of proficiency. The question then arises as to how exactly the over/underuse of specific demonstrative expressions is qualitatively distributed and whether this distribution may reveal the influence of the L1 or more general trends in the development of grammar and literacy skills. We take, therefore, the statistical distribution as a guide to those elements that warrant closer qualitative investigation and we turn below to a more detailed consideration of the distribution and use of that. This aspect of our investigation is particularly pertinent, as previous studies also found that that was particularly prone to over/misuse. 7. Discussion 7.1. Overuse of ‘that’ As stated above and illustrated in Table 5, the overuse of that as a pronoun is consistently statistically significant from grade 6 to grade 11 (this is also true to some extent of that as a determiner). With regard to more advanced Finnish learners of English, Petch-Tyson (2000, cf. section 2 above) argued that the overuse of the pronoun that may be related to argumentative structures and differing strategies in referring to propositions. To see whether propositional reference was also an issue in the writing of

74

Barbara SCHIFTNER and Tom RANKIN

Figure 4. Endophoric use of the pronoun that: percentage of reference to NPs and propositions

Austrian lower proficiency learners, the distribution of reference to NPs and to propositions was determined for all instances of endophoric reference of the pronoun that. As can be seen in Figure 4, the use of that to refer to NPs is very common in the learner corpora (up to 40% of occurrences) as well as in the LOCNESS corpus (33%). There is an increase in the use of that for propositional reference in grade 11, which points to the fact that propositional reference is more common in argumentative writing (a higher use of propositional reference in grade 11 was attested for demonstratives in general). What is interesting, however, is that this higher percentage of propositional reference is not accompanied by a significant increase in the overuse of that; the pronoun that is also overused in grades 6 to 10, where it is frequently used for NP reference. These findings suggest that in the case of school-age learners, the overuse of the pronoun that is unlikely to be related to argumentative structures, even more so since it also occurs in the lower grades where the text types produced by the learners are descriptive or narrative rather than argumentative (cf. description of the data in section 6.1.). We hypothesize that the extremely frequent use of the pronoun that in Austrian learner writing, even at lower levels, to establish a wider range of anaphoric relations is related to the system of demonstrative pronouns available in German. Interestingly, the extremely frequent use of the distal demonstrative runs counter to the development in German, where the use of the distal demonstratives is actually decreasing (cf. the discussion of German demonstratives in section 3). At first sight, it seems then that processes other than L1 transfer are at work. Taken together with the results discussed

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

75

above, it would appear that the overuse of that emerges as a universal pattern independent of the influence of the L1.5 It is perhaps the case that that is in some way more salient for learners. Recall that Leńko-Szymańska (2004: 105) proposes that learners have an awareness that distal demonstratives are less marked than proximals.6 It is worth considering that that is also the demonstrative most common in conversation (and literary fiction), whereas this, these, and those are more frequent in academic prose than in conversation, fiction, and news genres (Biber et al. 1999: 349). It can therefore be assumed that learners are exposed to that much more frequently than to the other demonstratives and this fact is likely to be reflected in their production. However, it is possible that L1 German is still exerting an influence on the overuse of that. Apart from the demonstratives that signal proximity, German also has a complementary system of demonstrative pronouns in addition to personal pronouns er/sie/es (cf. section 3, Table 2). The nominative and accusative form of the demonstrative equivalent to the personal pronoun es (“it”) is das. There is a certain amount of overlap between the range of contexts where German das and English that can occur; for example, they can both refer to NPs, stretches of text, and propositions. However, there are also rather subtle differences in usage, where das would be possible in German but the use of that in English would be interpreted with special emphasis. These contexts are illustrated in examples (3) to (5), where the pronoun it or relative clauses would be more appropriate, but where das in similar contexts in German would be neutral. (3) The cold air drew little ice-flowers on my window and [that] was very beautiful. (ICCI_7) (4) My parents were mad at me, Leonard never wanted to meet me again—[that] was quite a bad situation. (ICCI_9) (5) There you can see how the make bread and tea. [That] was very interesting. (ICCI_7)

The ICCI data thus provide further evidence that unmarked forms are more liable to overuse, independent of L1 influence. However, given that the unmarked that in English is also cognate with German das, this general tendency towards overuse of that in learner English (cf. Blagoeva 2002, 5

6

However, given that the previous studies have so far only studied European languages, ‘universal’ must be treated with some caution. Leńko-Szymańska (2002: 105) bases her idea of markedness on Lyons (1977: 647), who argues that “this is marked and that unmarked [...]: there are many syntactic positions in which that occurs in English and is neutral with respect to proximity and any other distinctions based on deixis.”

76

Barbara SCHIFTNER and Tom RANKIN

Leńko-Szymańska 2004) may be reinforced by a degree of L1 influence. Thus, the referential relations encoded by demonstratives are often different as compared to native English usage, where other pronominal forms might be preferred. As a rule, patterns of usage in learner writing should not be described as errors; rather, there are recurrent preferences for simpler subject-predicate sentence structures where the occurrence of demonstratives stands out. Such structures would be less frequent if there were an increased use of complex sentence structures with relative clauses or other pronominal forms. This dynamic reflects perhaps not only the lower proficiency L2 level of the learners, but also a lower level of literacy and a preference for simpler sentence structures, which might also be reflected in their L1 written production. However, yet again, results from expert native writing suggest that such usage could not be considered an error in any sense, but rather that it is a strategy relied on more frequently by learners. Botley and McEnery (2001b: 230) conclude that both proximal and distal demonstratives in English are generally short-range anaphors that refer to antecedents in either the same or the preceding sentence. It is, therefore, not a problem that learners establish short-range referential links between sentences; rather, the more frequent use of simple main sentence structures requires a more frequent use of demonstratives for this purpose. In summary, there are various possible analyses of the overuse of that, none of which provides us with the full picture in isolation. While there may be an influence of L1 German, this could also be reinforced by a general preference for unmarked forms in the L2. In addition, we have noted that there is a preference for simple main clauses in the learner writing, and given that this is the case, demonstrative expressions are used more frequently to establish intersentential links. The observation of overuse might therefore be seen to be the result of a number of different possible underlying processes. 7.2. Post-modification of that-P One specific example of differences in usage between the learner and native productions has to do with post-modification patterns of the pronoun that. These patterns are particularly interesting in light of similar findings reported by Leńko-Szymańska (2004: 101) for the post-modification of those. She points out that two different groups of advanced learners of English show similar preferences for the post-modification of those with relative clauses at twice the rate of native speakers. In contrast, post-modification with prepositional or participial clauses in the learner writing occurs at half the rate of that in the native corpus (Leńko-Szymańska 2004: 103).7

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

77

In LOCNESS-AL, that used with an NP referent has a typical prepositional post-modification structure (NP ... that of), which is used in 80% of the instances where the pronoun that refers to a NP (see examples 6-8). (6) Another problem is that of understaffing. (LOCNESS-AL) (7) The first argument is that of the parents. (LOCNESS-AL) (8) The average speed of traffic in Central London is no faster than that of 200 years ago. (LOCNESS-AL)

This structure is not used at all by the learners in ICCI Austria. Again, this pattern would seem to be indicative of a general avoidance of more complex sentence structures, which appear to be beyond school-age learners’ level of literacy and L2 proficiency. Interestingly, taken together with LeńkoSzymańska’s results, it would seem that there are certain complex sentence structures that even advanced university level students of English do not fully master. 8. Summary and conclusions The present study adds to previous learner corpus research concerning problems in the distribution of demonstrative pronouns and determiners in the written production of learners of English. One interesting finding is that, in comparison to analyses of large corpora of expert native writing, even low proficiency learners in some areas show similarities in their use of demonstrative expressions, that is, demonstratives are used most frequently as short-range anaphors and that usually refers to propositions. However, in line with previous learner corpus studies, there remains evidence of patterns of usage that are both quantitatively and qualitatively divergent from native speaker norms. Particularly prominent in this regard is the infrequent use of plural forms and the comparatively infrequent use of the singular proximal pronoun this. In contrast, the singular distal pronoun that is consistently overused. We have analyzed this as being the result of a combination of two factors. Firstly, there is a preference for relatively simple subject-predicate sentence structures, where that is often used to refer to a range of previously mentioned entities or propositions (cf. Botley & McEnery’s [2001b: 26] finding that the distal form is more likely to be used to refer anaphorically to propositions). Secondly, this may show the influence of the close German 7

This of course begs the question of why post-modification with relative clauses should be more readily acquired and used by learners in contrast to post-modification by participle and prepositional clauses. It is possible that the explicit attention given to relative clauses as a topic for grammar teaching in ELT enhances the saliency of these sorts of structures for learners. We return to issues of teaching in the conclusion.

78

Barbara SCHIFTNER and Tom RANKIN

cognate das, which may occur in a range of contexts where this would be preferred in English. These results provide interesting points of comparison with the results from previous studies. Even though it is possible that the preference for that in the Austrian ICCI data betrays the influence of L1 German, these results are very much in line with the studies discussed in section 2, which found an overall preference for distal demonstratives. As proposed by LeńkoSzymańska (2004), it is perhaps the case that there is a general trend among learners to overuse the most salient or least marked forms. In conjunction with the evidence from a range of different L1 groups, this new data from L1 German speakers may provide further weight to this idea. A further interesting point of comparison with previous studies is the proficiency level of the learners. Blagoeva (2002), Leńko-Szymańska (2004), and Petch-Tyson (2000) all studied corpora of relatively advanced university-level students of English. Yet the patterns of over/misuse are similar in ICCI, indicating that the issues that remain problematic at later stages of proficiency already arise at lower levels. At this point, it may be worth entering a caveat. As noted above, there are differences in the makeup of the ICCI corpora and the more advanced learner corpora referred to above. In particular, the text types are necessarily somewhat different in order to facilitate the collection of data from relatively low-proficiency learners. These distinctions may have an impact on the comparability of the corpora in some respects; however, the fact that the patterns of usage appear consistent for the use of demonstratives indicates that this issue is consistently problematic regardless of genre. Given that demonstratives seem to be problematic for learners at a range of proficiency levels, a natural question would be to ask whether this might be addressed in formal instruction. Since all the learners considered in the corpus studies have been formally instructed, it is interesting to note that the detailed referential properties of demonstratives tend not to be addressed in ELT materials. Obviously, the materials we consulted were specifically aimed at the Austrian market, but it seems to be more generally true as well.8 As Cowan (2008: 205) notes, “[w]ith regard to demonstrative determiners, most textbooks teach the concept that this/these and that/ those are used to indicate different degrees of proximity. This is frequently practiced in dialogs in which students use the demonstratives to refer to items that are either near or far away.” Cowan (2008: 205) suggests going beyond this to introduce 8

It is worth acknowledging again that the information provided in course books does not necessarily accurately reflect what happens in practice in the classroom. It is, nonetheless, a useful indication of the sort of elements that one can assume are addressed in formal instruction.

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

79

concepts of time, relevance, and given vs. new information with upperintermediate and advanced level students. This would seem to be a sensible course of action with more advanced learners. However, based on our results and those of Leńko-Szymańska (2004), it could be suggested that it might be worthwhile to expand the scope of teaching demonstratives beyond the properties of reference and proximity to include considerations of the syntactic patterns in which demonstratives occur. This is not generally the subject of explicit instruction (at least as far as can be ascertained from the information in textbooks) and it is an area that appears difficult to master. We have seen that the ICCI students, perhaps unsurprisingly, do not use any complex post-modification structures, and Leńko-Szymańska (2004) reports that there are differences in the way demonstrative pronouns are modified in learner vs. native production. Specifically, there is an avoidance of participial and prepositional postmodification. This is an area that apparently remains problematic given that there are still differences at advanced levels of proficiency, and it may benefit from explicit instruction. 9. References Ariel, M. 1988. “Referring and Accessibility”. Journal of Linguistics 24. 65-87. Ariel, M. 1990. “Referring Expression and +/− Coreference Distinction”. Reference and Referent Accessibility, Fretheim, T. and J.K. Gundel (eds). Amsterdam: John Benjamins. 13-33. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. 1999. The Longman Grammar of Spoken and Written English. London: Longman. Blagoeva, R. 2002. “Demonstrative reference as a cohesive device in advanced learner writing: a corpus-based study”. Advances in Corpus Linguistics. Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23), (Göteborg, 22-26 May 2002), Aijmer, Karin and Bengt Altenberg (eds). 297-307. Bosch, P., T. Rozario and Y. Zhao. 2003. “Demonstrative Pronouns and Personal Pronouns. German der vs. er”. Proceedings of the EACL2003. Budapest. Workshop on The Computational Treatment of Anaphora. Available at: http://www.aclweb.org/anthology-new/W/W03/W03-2609. pdf (28 June 2011). Bosch, P., G. Katz and C. Umbach. 2007. “The Non-subject Bias of German Demonstrative Pronouns”. Anaphors in Text: Cognitive, formal and applied approaches to anaphoric reference, Schwarz-Friesel, M., M. Consten and M. Knees (eds). 145-164. Botley, S. and T. McEnery. 2001a. “Demonstratives in English: A CorpusBased Study”. Journal of English Linguistics 29:1. 7-33.

80

Barbara SCHIFTNER and Tom RANKIN

Botley, S. and T. McEnery. 2001b. “Proximal and Distal Demonstratives: A Corpus-Based Study”. Journal of English Linguistics 29:3. 214-233. Chen, R. 1990. “English Demonstratives: A Case of Semantic Expansion”. Language Sciences. 12:2/3. 139-153. Cowan, R. 2008. The Teacher’s Grammar of English: A course book and reference guide. Cambridge: Cambridge University Press. Duden. Die Grammatik. 2006. (4th edition). Dudenredaktion. Mannheim: Bibliographisches Institut & F.A. Brockhaus. Gerngross, G., H. Puchta, C. Holzmann, J. Stranks and P. Lewis-Jones. 2007. More! Student’s Book 1. Helbing Languages. Gerngross, G., H. Puchta, C. Holzmann, J. Stranks and P. Lewis-Jones. 2008a. More! Student’s Book 2. Helbing Languages. Gerngross, G., H. Puchta, C. Holzmann, J. Stranks and P. Lewis-Jones. 2008b. More! Student’s Book 3. Helbing Languages. Gerngross, G., H. Puchta, C. Holzmann, J. Stranks and P. Lewis-Jones. 2009. More! Student’s Book 4. Helbing Languages. Halliday, M. and R. Hasan. 1976. Cohesion in English. London: Longman. Harmer, J. 2007. Your Turn 1. Textbook. Wien: Langenscheidt. Harmer, J. and A. Avecedo. 2008. Your Turn 2. Textbook. Wien: Langenscheidt. Harmer, J. and A. Avecedo. 2009. Your Turn 3. Textbook. Wien: Langenscheidt. Harmer, J. and A. Avecedo. 2010. Your Turn 4. Textbook. Wien: Langenscheidt. Himmelmann, N.P. 1996. “Demonstratives in Narrative Discourse: A Taxonomy of Universal Uses”. Studies in Anaphora, Fox, B.A. (ed). Amsterdam: John Benjamins. 205-254. Huddleston, R. and G.K. Pullum. 2002. The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press. Lakoff, R.T. 1974. “Remarks on This and That”. Papers from the 10th Regional Meeting of the Chicago Linguistics Society. Chicago: Chicago Linguistics Society. 345-356. Leńko-Szymańska, A. 2004. “Demonstratives as anaphora markers in advanced learners’ English”. Corpora and Language Learners, Aston, G., S. Bernardini and D. Stewart (eds). Amsterdam: Benjamins. 89-107. Lyons, J. 1977. Semantics. Cambridge: Cambridge University Press.

The Use of Demonstrative Reference in English Texts by Austrian School-age Learners

81

Petch-Tyson, S. 2000. “Demonstrative expressions in argumentative discourse: A computer corpus-based comparison of non-native and native English”. Corpus-based and Computational Approaches to Discourse Anaphora, Botley, S.P. and T. McEnery (eds). Amsterdam: John Benjamins. 43-64. Scott, M. 2008. WordSmith Tools. Version 5. Liverpool: Lexical Analysis Software.

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners Agnieszka LEŃKO-SZYMAŃSKA

1. Introduction The article system is one of the most pervasive features of the English grammar. The articles the and a/an are among the most frequent lexical items, amounting to over 6% and over 3% of any spoken and written discourse, respectively. Yet, in spite of their prominence, articles seem particularly hard for EFL learners to acquire. Even advanced students seem to struggle with the correct use of articles and frequently make errors. Students whose mother tongues do not have article systems find it especially difficult to acquire this feature of the English grammar. However, even students whose L1s have articles find it difficult to use English articles accurately. Many studies have examined the acquisition of articles by EFL learners, and several theories and hypotheses have been proposed to account for this process. Some considered the emergence of articles along with other morphemes of English (Hakuta, 1976; Tarone, 1985, among others) while others concentrated solely on the article system (Parrish 1987, Thomas 1989; Butler 2002; Jarvis, 2002; Ekiert 2004; Li & Yang 2010; Crompton 2011; Díez-Bedmar and Papp 2008; Díez-Bedmar 2010; Díez-Bedmar and Pérez Paredes, this volume; among others). However, researchers are still far from agreeing on a single model that explains the process of the acquisition of articles. The difficulty in acquiring articles has been ascribed to problems in mastering a complex system of grammatical, semantic, and pragmatic relations. Earlier studies usually concentrated on the analysis of obligatory contexts for article use and the analysis of learners’ accuracy levels in these contexts. Such an approach implies the belief that the use of articles is rulebased, and that learners gain higher and higher levels of mastery of these rules throughout the process of learning. Yet, it has long been asserted that language processing is not solely rulebased. Sinclair (1990) proposes two complementary principles explaining language use. According to him, in addition to being built from scratch (based on generating grammatical structures), language is processed through

84

Agnieszka LEŃKO-SZYMAŃSKA

the idiom principle, i.e., by selecting ready-made multi-word chunks of language from the phrasicon. Articles are frequently part of such multi-word expressions. Thus, it can be hypothesized that at least some obligatory contexts for the use of articles are acquired within larger lexical phrases, and are processed as such. So far, few studies considered the acquisition of articles from this perspective, and if they did do so, they tended to treat the idiomatic uses of articles only marginally. The study reported in this paper is meant to fill this gap. It aims to establish which uses of the articles the and a/an in student writing can be accounted for by the learners’ use of conventionalized multi-word phrases rather than by the application of rules relating to grammatical, semantic, and pragmatic relations. Before reporting the results of the study, the paper describes the use of articles in English, reviews the most relevant studies on the acquisition of articles by EFL learners, and discusses conventionality in language. It has to be noted that this study is exploratory in nature. The proposed interpretation of the EFL learners’ use and acquisition of articles calls for a fully-fledged investigation requiring much more data and space. Yet, certain tendencies can be highlighted, which would hopefully inspire further research in this area. 2. English article system Articles are part of a larger class of determiners whose role is to indicate the reference of a noun. They convey the semantic categories of definiteness and specificity, but they also depend on the grammatical category of countability. The English article system includes two core members: the definite article the and the indefinite article a, with its variant an. Many grammarians also distinguish the zero article (Quirk et al., 1985; Biber et al., 1990, among others), and some others go even further and propose two categories of zero and null articles (Palmer 1939); however, these two categories are not realized lexically in English.1 The indefinite article indicates a reference to an indefinite identity (specific or non-specific), and can also be used to make a generic reference. It is used only with singular countable nouns. The same functions are also performed by the zero article, which is used with plural countable nouns and uncountable nouns. The definite article is used for definite or generic reference. The distribution of articles for the different kinds of reference is best summarized by Bickerton’s (1981) semantic wheel. He classifies noun phrases according to two categories of referentiality: [±specific reference] 1

This paper will follow the description of the article system proposed by Quirk et al. (1985) and Biber et al. (1990), and will not differentiate between the zero and null articles.

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

85

Figure 1. Bickerton’s (1981) semantic wheel for NP reference (adapted from Huebner, 1983) Table 1. Taxonomy of the English article system (adapted from Díez-Bedmar and Papp, 2008) Features Type 1 [−SR, +HK]

Environment generic nouns

Articles a, the, Ø

Examples Ø Fruit flourishes in the valley. Ø Elephants have trunks. The Grenomian is an excitable person. They say the elephant never forgets. A paper clip comes in handy. An elephant never forgets.

Type 2 [+SR, +HK]

referential definites previous mentions specified by entailment specified by definition unique in all contexts unique in a given context

the

Pass me the pen. The idea of coming to the UK was ... I found a book. The book was ... The first person to walk on the moon ...

Type 3 [+SR, −HK]

referential indefinites first-mention nouns

a, Ø

Chris approached me carrying a dog. I’ve bought a new car. A man phoned. I keep sending Ø messages to him. I’ve got Ø friends in the UK. I’ve managed to find Ø work.

Type 4 [−SR, −HK]

non-referential nouns attributive indefinites non-specific indefinites

a, Ø

Alice is an accountant. I need a new car. I guess I should buy a new car. A man is in the ladies, but I haven’t seen him. Ø Foreigners would come up with a better solution.

Type 5

idioms other conventional uses

a, the, Ø

All of a sudden, he woke up. In the 1950s, there weren’t many cars. His family is now living Ø hand to Ø mouth.

86

Agnieszka LEŃKO-SZYMAŃSKA

([±SR]) and [±hearer knowledge] ([±HK]). The category of specific reference relates to the speaker’s intention to refer to a specific entity, while the category of hearer knowledge refers to the speaker’s assumption about the hearer’s ability to infer the referent. These two categories combine in a semantic wheel to produce four types of NP environments (Figure 1). The list of four types of NP environments proposed by Bickerton (1981) and Huebner (1983) was later extended by Thomas (1989), who distinguished another NP type encompassing idioms and other conventional uses. In this type of noun phrase, the article is not used to mark a specific type of reference, but its use is motivated by convention. The five types of NP contexts are summarized in Table 1 (adapted from Díez-Bedmar & Papp, 2008). It has to be noted that the taxonomy is only an approximation; it does not fully exhaust all possible contexts for the use of articles but captures the most prototypical ones. 3. L2 acquisition of the English article system The L2 acquisition of the English article system has been widely studied within different theoretical frameworks (cf. Garcıa-Mayo & Hawkins, 2009 and Bowles et al., 2009 for generative accounts). The aim of various research projects has been (a) to establish the order in which the three English articles appear accurately in obligatory contexts in learner production; (b) to establish the patterns of overuse and underuse of individual articles in interlanguage; and (c) to propose an explanation or a model of the acquisition process. One common tradition of investigating the emergence of articles uses Huebner’s (1983) taxonomy to trace the patterns of emergence of the three articles in the four NP contexts. Such studies attempt not only to trace the order of the development of article use in interlanguage but also to establish which contexts appear to be particularly problematic for L2 learners. Prior studies show that L2 acquisition of articles is problematic overall, and the system is not fully mastered even by fairly advanced L2 learners. Many apparently more complex morphosyntactic features of English (such as the tense and aspect system) stabilize earlier in learner production, compared to the article system. Master (2002: 332) proposes three reasons for the difficulty of the article system: (a) Articles are among the most frequently occurring function words in English, making continuous rule application difficult over an extended stretch of discourse. (b) Function words are normally unstressed and consequently very difficult, if not impossible, for a non-native speaker to discern, thus affecting the availability of input in the spoken mode.

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

87

(c) The article system stacks multiple functions onto a single morpheme, a considerable burden for the learner, who generally looks for a one-form-one-function correspondence in navigating the language until the advanced stages of acquisition.

Ekiert (2004) observes that an additional reason for the difficulty of articles is the tendency of L2 learners to process the language for meaning, during which they overlook function words such as articles. Master (1987, as quoted in Master, 1997) was among the first researchers to point out that the acquisition of the English article system differs for EFL learners with different L1s. The article system seems to pose more problems for those learners whose mother tongues do not have an article system (such as Polish or Chinese). However, even students whose first languages possess an article system (such as Spanish) were found to have problems with the accurate rendering of definiteness and specificity in English (Díez-Bedmar, 2010, Díez-Bedmar and Pérez Paredes, this volume). Prior studies show that learners with [+ART] mother tongues generally display a greater degree of accuracy in the use of articles than learners with [−ART] L1s (Thomas, 1989; Díez-Bedmar and Papp, 2008). The interlanguage of [+ART] L1 students demonstrates that a/an and the are mastered earlier than the zero article. There is less agreement among researchers about the order of accuracy for the definite and indefinite articles, with Díez-Bedmar and Papp (2008) postulating the a > the sequence, and Thomas (1989) proposing the the > a sequence. For learners with a [−ART] mother tongue, the zero article dominates during the early stages of acquisition, and its overuse persists until advanced stages of proficiency (Parrish, 1987; Master, 1997; Díez-Bedmar & Papp, 2008). However, this conclusion is problematic given the fact that the zero article is not realized lexically or morphologically; thus, it is impossible to distinguish between the use of the zero article and the failure to supply any article. Thus, Master (1997) concludes that “acquisition is largely by default” (p. 216); and this concern is also expressed by Parish (1987) and Thomas (1989). Studies on the acquisition of the definite article the and the indefinite article a by this group of learners also show rather conflicting results. Some studies (Parrish, 1987; Thomas, 1989; Master, 1997; Li & Yang 2010) demonstrate that the definite article is integrated into the interlanguage before the indefinite article. On the other hand, Young (1996), Liu and Gleason (2002), and Díez-Bedmar and Papp (2008), who looked at the acquisition of articles by Czech and Slovak, and Chinese EFL learners respectively, propose that the indefinite article is used accurately from the early stages of acquisition and that the definite article is mastered rather late. Researchers seem to agree on the existence of a period of the-flooding in

88

Agnieszka LEŃKO-SZYMAŃSKA

which learners overuse the definite article in all the contexts, before they start to use it accurately for marking definite reference (Huebner, 1983; Young, 1996; Master, 1997). Most of the studies discussed so far attempted to establish the NP contexts (as defined in Bickerton’s (1981) semantic wheel) that posed particular problems for L2 learners. Here also, the results differ between learners with [+ART] and [−ART] languages; however, the general observation is that the referential definites (Type 2) requiring the use of the in [+SR, +HK] contexts are the easiest to acquire (Li and Yang 2010), while the use of the three articles in the generic context (Type 1) seems to be problematic (Ekiert, 2004; Li & Yang, 2010; Crompton, 2011; DíezBedmar and Pérez Paredes, this volume). For instance, Díez-Bedmar and Papp (2008) find that both Spanish and Chinese students overused the zero article with generic nouns; Ekiert (2004) also demonstrate that Polish EFL and ESL learners overused the zero article (or more likely omitted the article). Moreover, Thomas (1989) and Díez-Bedmar and Papp (2008) claim that the overgeneralization of the definite article is a common phenomenon in specific indefinite contexts, and also in non-specific indefinite contexts (Type 3 and 4). Parish (1987) and Thomas (1989) also find that the was overused in the [+SR, −HK] (Type 3) and [−SR, +HK] (Type 1) environments, but not in the [−SR, −HK] (Type 4) environments. What is particularly relevant to the present study is the fact that all but two studies exclude the conventional uses of articles from their analysis. Even though they acknowledge the existence of the contexts in which the use of an article is motivated by convention rather than by the referentiality and specificity of a given NP, they omit the idiomatic uses of articles from their results (Butler, 2002; Díez-Bedmar & Papp, 2008) or the discussion (Thomas, 1989). However, Thomas (1989) remarks that a high accuracy level for one particular use of the indefinite article in the [+SR, −HK] context may, in fact, be due to the learners acquiring the structure there is a/an as a single chunk. The only two studies that include the idiomatic uses of articles in their results and analyses are Ekiert (2004) and Li and Yang (2010). These studies demonstrate that this type of article application is particularly problematic for Polish and Chinese EFL learners, respectively, and is among the most frequent sources of errors, together with the generic context. Li and Yang (2010) also demonstrate that the idiomatic use achieves high accuracy levels only at very advanced stages of proficiency. Both Ekiert (2004) and Li and Yang (2010) used a gap-filling elicitation technique to collect their data. The idiomatic items in their tests include:

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

‘in the 1960s’ ‘thrown out of Ø work’ ‘living Ø hand to Ø mouth’ ‘all of a sudden’ ‘flies in the face’ ‘in Ø front’

89

‘game of Ø cat and Ø mouse’ ‘getting Ø cold feet’ ‘in Ø space’ ‘a pain in the neck’ ‘in the mood’ ‘poor as a mouse’

What should be noted is that most of these expressions are true idioms or set phrases, and, except for in the 1960s, they are not productive. 4. Conventionality and probability in language All the studies discussed in the previous section assumed that the use of articles in discourse is rule-governed, i.e., in generating each NP, a language user chooses among the, a/an, or the zero article based on his/her implicit or explicit knowledge of the structural, semantic, and pragmatic rules related to the expression of referentiality, specificity, and countability in English. Even if some researchers acknowledged that certain applications of articles were conventional and not rule-based, they discarded these uses as marginal. However, it has long been recognized that language is not always produced by combining its smaller constituents according to syntactic rules. Sinclair (1990) advocates the existence of two principles governing the generation of language. On the one hand, during the production of a new utterance, the syntactic rules generate a sentence structure with empty slots marked for the part of speech, and these slots are filled out with words. Thus, the open choice principle asserts that language is a series of open choices performed at each slot and restricted only by the requirement of grammatical correctness and semantic consistency. At the same time, another principle is also responsible for language processing. The idiom principle was formulated by Sinclair (1990: 110) as follows: A language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices even though they may appear to be analysable into segments.

According to Sinclair (1990), a great proportion of language is built from semi-constructed phrases that are stored as single units in the mental lexicon (or the phrasicon). These phrases are often canonical in terms of their grammatical structure and are semantically transparent, i.e., not idiomatic; yet, they represent conventionalized ways of encoding certain meanings. There are many different types of lexical phrases, such as phrasal verbs, collocations, idioms, and formulaic expressions (see Leńko-Szymańska, 1997

90

Agnieszka LEŃKO-SZYMAŃSKA

for an overview). They are characterized by different degrees of syntactic productivity, compositionality of meaning, continuity, and grammatical canonicity. One type of multi-word chunks are lexical bundles or n-grams, defined by Biber et al. (1999) as extended collocations or sequences of three or more words that show a statistical tendency to co-occur, regardless of their idiomaticity and their structural status. Biber et al. (1999) assert that lexical bundles are not fixed expressions because they do not form single semantic units, and they are often not recognized as such by native speakers; yet, they constitute the basic building blocks of accurate, natural, and idiomatic language (pp. 989-991). In most cases, n-grams cut across phrasal and clausal boundaries. They can be composed of the beginning of a main clause followed by the beginning of an embedded clause (e.g., I don’t know why), or of a noun phrase followed by the preposition that typically introduces its complement (e.g., a reason for). Articles are also frequently a part of n-grams. Lexical bundles are relatively frequent, and they occur in the language produced by different speakers and in different situations. They can occur with different frequencies: the most common lexical bundles appear as often as 1000 times per 1 million words. The average frequency of a three-word bundle is 25 times per 1 million words. Biber et al. (1999) propose that the cut-off point for a four-word sequence should be 10 per 1 million tokens, and that these occurrences have to be spread across at least five different texts in a register. The cut-off point for 3-grams should be relatively higher as these sequences are more common and relatively lower for the less common 5-, and 6-grams (Biber et al., 1999: 990-991). The concept of n-grams was particularly investigated within the framework of cognitive science. This gave rise to probabilistic accounts of language that claim that “grammatical rules may be associated with probabilities of use, capturing what is linguistically likely, not just what is linguistically possible” (Chater & Manning, 2006: 336). The n-gram models of language aim to explain both language processing and acquisition through the language users’ and the language learners’ sensitivity to the frequencies of patterns in language (see Chater & Manning, 2006 for an overview). During the acquisition of a second language, learners become familiar with formulaic expressions, including lexical bundles, and process these expressions as single units in language production and comprehension.2 Thus, it can be hypothesized that certain uses of articles in interlanguage are generated as parts of recurrent expressions, and are not a result of the 2

See Wray 2002 for a detailed account of the acquisition and use of formulaic language by L2 learners.

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

91

application of a rule (even if the use of an article in a given context complies with a particular rule). The aim of this study is to establish the extent to which the use of articles by native speakers and Polish learners of English as a foreign language at different proficiency levels can be explained as being motivated by the use of conventionalized language, more specifically lexical bundles. Prior studies demonstrated that the interlanguage of learners with [−ART] mother tongues (including Polish) would be characterized by the overuse of the zero article at each proficiency level. Since the interpretation of the use of the zero article is problematic (as it is not clear whether the learner applied a zero article or failed to use an article at all), the current study concentrates solely on the emergence of the definite article (the) and the indefinite article (a/an). 5. The study 5.1. Data and tools The data used in this study were drawn from several corpora. Essays written by Polish learners at different proficiency levels were drawn from the ICCI and ICLE databases. The FLOB and FROWN corpora served as sources of native data. All these corpora are briefly described below. The International Corpus of Crosslinguistic Interlanguage (ICCI) was described adequately in Chapter 2. Its Polish component consists of 751 essays (56 821 words in total) written by students from grades 4 to 12. Even though the corpus contains cross-sectional data, it has to be noted that the grades are only rough estimates of learners’ levels. Since the Polish educational system is not very uniform as far as teaching foreign languages is concerned, the successive grades do not necessarily represent higher levels of language proficiency. This is why the data were drawn only from three grades, namely, grade 6, grade 9, and grade 12, which mark important turning points in the Polish educational system: the end of the primary, the lower secondary, and the higher secondary schools. They roughly represent the beginner, pre-intermediate, and intermediate levels. It has to be pointed out, however, that a lot of variability exists within each level. To complement the ICCI data, the Polish section of the International Corpus of Learner English (ICLE) was used as a source of data from advanced learners of English. The ICLE is a commercially available learner corpus compiled at Université Catholique de Louvain, Belgium. The corpus contains argumentative essays written by advanced learners of English from different mother tongue backgrounds. The essays are 500 words long on average, and they were written by third- and fourth-year university students in English departments from across the world. The first release of the corpus

92

Agnieszka LEŃKO-SZYMAŃSKA Table 2. Summary of the data used in the study Corpus ICCI 6 ICCI 9 ICCI 12 ICLE FLOB FROWN

Level Beginner Pre-intermediate Intermediate Advanced Native Native

No. of texts 84 101 85 363

Tokens 4505 6768 12361 230658 999652 1005948

consists of eleven sections, each corresponding to a different first language, containing about 200,000 running words (equivalent to about 400 essays). The Polish component comprises 363 essays, which amount to 230 065 tokens. The learner production was compared with native data taken from the Freiburg-London-Oslo-Bergen (FLOB) and Freiburg Brown (FROWN) corpora. Both are commercially available reference corpora of British and American English, respectively, which match each other in size and composition. Both include 1 million running words, and contain published texts from the beginning of the 1990s. Each corpus consists of fifteen sections (marked with letters A to R) that correspond to different genres of written language. Since Polish learners of English are exposed to both varieties of English, the two corpora were assumed to form together a good representation of the native English language in its most common variants; therefore, they were consulted jointly in the study. The reason why these smaller reference corpora were chosen over larger representations of the English language, such as the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA), is that the data analysis required multiple comparisons of tens of thousands of lexical bundles, which could be done automatically using the Wordsmith Tools package for the smaller corpora, but would have required individual consultations for the large corpora. It was assumed that the size of 2 million tokens was sufficient to trace the occurrences of frequent lexical bundles. In addition to the Wordsmith Tools package, a program called Collocate was used to generate lists of lexical bundles from the learner and native corpora. Table 2 presents a summary of the data used in the study. The sets of learner data will be henceforth referred to as corpora, even though they are only sections of larger databases. 5.2. Analysis As the initial step in data analysis, the frequencies of the definite and indefinite articles in the corpora were tabulated. Next, lists of all three-word

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

93

combinations containing the articles the and a/an were generated for each learner corpus using Collocate. The statistical measure used in the generation of the lists was Mutual Information, and the thresholds were set very low to include all three-word sequences in the lists. Thus, the lists contained inventories of three-word sequences with the definite and indefinite articles that occurred at least once in a given learner corpus, such as I have a, go to the, pizza the best, and an email then. A list of three-word lexical bundles containing definite and indefinite articles was also generated for the native corpus. Finally, the lists of three-word sequences in the learner data were compared against the native corpus in order to detect those combinations that in fact function as lexical bundles in English. Since the reference corpus was small, a slightly lower cut-off point was adopted for the identification of n-grams. A three-word sequence had to occur at least twenty times in the native corpus, i.e., 10 times per 1 million words (which was the threshold that Biber et al. (1999) proposed for 4-grams). The analysis concentrated solely on three-word bundles, and disregarded longer sequences. This was done because the production of lower-level students contained very few instances of longer phraseological units. Additionally, four- and five-word bundles are built of smaller sequences (3-grams); thus, any article that was part of a longer bundle would anyway be included in the lists of 3-grams. Since the focus of the analysis was not the development of phraseology in interlanguage, but detecting which uses of articles are conventional, the analysis of 3-grams seemed sufficient to account for these uses. However, it had to be borne in mind during the analysis of the frequencies of articles within bundles that sometimes two 3-grams could instantiate a single use of an article. For instance, at the end and the end of form one 4-gram at the end of, with the single use of the definite article. Moreover, two 3-grams I have a and a lot of could be combined in a phrase I have a lot of, which in itself is not a lexical bundle but still accounts for one occurrence of the article a. This is why the number of articles in phraseological units did not correspond to the number of 3-gram tokens in a corpus, and had to be corrected. This was done manually for the learner corpora; for the native corpus, the frequencies of recurrent four-grams were subtracted from the total frequencies of their constituent 3-grams. Such a procedure might have led to the omission of certain combinations of 3-grams whose frequency as a four-gram was below the cut-off point. Therefore, the corrected numbers for the native corpus were more an approximation, rather than a precise count.

94

Agnieszka LEŃKO-SZYMAŃSKA Table 3. Frequencies of the and a/an in the corpora Corpora ICCI 6 ICCI 9 ICCI 12 ICLE FLOB + FROWN (Native)

the 63 137 446 12197 126297

%3 1.40% 2.02% 3.61% 5.28% 6.30%

a/an 75 140 259 5663 53570

%3 1.66% 2.07% 2.10% 2.45% 2.67%

Table 4. Results of the Log Likelihood tests of significance The ICCI 6 vs. ICCI 9 ICCI 9 vs. ICCI 12 ICCI 12 vs. ICLE ICLE vs. Native a/an ICCI 6 vs. ICCI 9 ICCI 9 vs. ICCI 12 ICCI 12 vs. ICLE ICLE vs. Native

stat. sign. * * * *

*

LL

p

LL = 61.5 p < 0.5 LL = 38.44 p < 0.0001 LL = 70.95 p < 0.0001 LL = 356.14 p < 0.0001 LL = 2.35 LL = 0.01 LL = 6.53 LL = 37.21

p > 0.05 p > 0.05 p < 0.05 p < 0.001

5.3. Results Table 3 presents the frequency of the articles the and a/an in the corpora. Overall, both the articles are underused by the Polish learners at each level, but this underuse diminishes with increasing proficiency. The standardized frequencies of the definite and indefinite articles stay at the same level for beginner and pre-intermediate learners. However, since a/an is half as frequent in English as the, the underuse of a/an is not as pronounced for lower levels (almost two-thirds of the native use) as the underuse of the, whose frequency represents less than a quarter of the native norm. The frequencies of both articles rise steadily across proficiency levels; however, given the fact that the frequency of a/an is relatively high even at the beginner level, the growth from stage to stage is either minimal or statistically non-significant. The frequencies of both articles come close to the native norm at the advanced stage; however, the definite article in particular remains underused by Polish advanced learners. Table 4 presents the results of the Log Likelihood test examining the statistical significance of the differences among the levels. 3

Percentage of the total number of words in a corpus.

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

95

Table 5. Frequencies and proportions of the conventional uses of the definite article 1 ICCI 6 ICCI 9 ICCI 12 ICLE Native

2 the 63 137 446 12197 126297

3 Types 13 24 85 679 922

4 Tokens 22 34 142 4962 38749

5 Corrected 21 32 121 4225 36650

6 % 33% 23% 27% 35% 29%

Table 6. Frequencies and proportions of the conventional uses of the indefinite article 1 ICCI 6 ICCI 9 ICCI 12 ICLE Native

2 a/an 75 140 259 5663 53570

3 Types 6 13 40 179 224

4 Tokens 10 23 79 1427 9902

5 Corrected 9 23 73 1321 9256

6 % 12% 16% 28% 23% 17%

The frequencies are not indicative of the accuracy with which learners use the articles; in fact, it might often be the case that an article is used incorrectly. Yet, the steady growth in frequencies proves that learners become more and more sensitive to the frequency with which the two articles are used in the target language. Tables 5 and 6 present the frequencies of three-word bundles containing the definite and indefinite articles, respectively. The second column in both tables repeats the number of occurrences of the and a in each corpus. The numbers in columns 3 and 4 report the frequencies of 3-grams (types and tokens, respectively). Column 5 presents the number of articles that are part of three-word bundles. This number is lower than the number of 3-gram tokens as some occurrences of articles make up two (or even three) 3-grams simultaneously (as discussed in the previous section). Finally, column 6 reports the proportion of the conventional use of articles in relation to the total number of articles in every corpus. Several conclusions can be drawn from an analysis of these tables. First, the use of articles within lexical bundles is not such as pervasive a phenomenon in English as was expected at the outset of the study. The frequencies in the native corpus show that n-grams are responsible for around 30% of the instances of the, and 17% of the instances of a/an. Still, 70% of the uses of the and over 80% of the uses of a/an are sanctioned

96

Agnieszka LEŃKO-SZYMAŃSKA

Table 7. Standardized frequencies of conventional and rule-based uses of the definite article

ICCI 6 ICCI 9 ICCI 12 ICLE Native

Corpus size

the

4505 6768 12361 230658 2005600

63 137 446 12197 126297

Convent. use 21 32 121 4225 36650

per 10 000 47 47 98 183 183

Rule-based use 42 105 325 7972 89647

per 10 000 93 155 263 346 447

Table 8. Standardized frequencies of conventional and rule-based uses of the indefinite article

ICCI 6 ICCI 9 ICCI 12 ICLE Native

Corpus size

a/an

4505 6768 12361 230658 2005600

75 140 259 5663 53570

Convent. use 9 23 73 1321 9256

per 10 000 20 34 59 57 46

Rule-based use 66 117 186 4342 44314

per 10 000 147 173 150 188 221

semantically and pragmatically. The indefinite article is part of recurrent word combinations much less frequently than the definite article in native English. Every third to fourth occurrence of the is within a lexical bundle, as opposed to every sixth occurrence of a/an in a 3-gram. This imbalance is reflected in the conventionalized uses of both the articles by Polish EFL learners. The proportions of the occurring in lexical bundles are generally higher than the proportions of a across the levels. Moreover, with increasing proficiency and the increased frequency of article use, learners tend to rely more and more on the conventional use of the articles. The proportions of conventional uses of both articles are above the native norm particularly at the advanced level. By taking into consideration only the proportions of conventionalized uses of articles in the overall occurrences of the and a/an in each corpus, the general underuse of articles is excluded from the picture. However, if the standardized frequencies of conventional and rule-based uses of the articles are considered, new facts emerge. These frequencies are presented in Tables 7 and 8. An analysis of Tables 7 and 8 shows that conventional and rule-based occurrences of the articles are both underrepresented at the lower levels of proficiency. Both uses grow with years of learning; yet, the conventional selections achieve and even surpass native-like frequencies (for the and a/ an, respectively), while the rule-based occurrences remain underused by advanced learners. Thus, while both uses of articles are equally responsible

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

97

Table 9. Twenty most frequent 3-grams containing the definite article the, together with their frequencies ICCI 6 go to the, 10 in the summer, 2 went to the, 1 the most important, 1 at the end, 1 the end of, 1 end of the, 1 get to the, 1 for the last, 1 end of the, 1 at the moment, 1 and in the, 1 and at the, 1

ICCI 9 go to the, 4 of the best, 3 in the morning, 3 I think the, 3 the end of, 2 what is the, 1 the story of, 1 the rest of, 1 the problem is, 1 the man who, 1 the first time, 1 the door to, 1 one of the, 1 on the front, 1 left in the, 1 is the most, 1 is on the, 1 is in the, 1 in the world, 1 in the summer, 1

ICCI 12 one of the, 14 go to the, 9 of the best, 7 all the time, 5 the fact that, 4 in the end, 3 in the same, 3 is on the, 3 it is the, 3 of the most, 3 around the world, 2 away from the, 2 based on the, 2 came to the, 2 end of the, 2 in the future, 2 of the great, 2 say that the, 2 the end of, 2 the most important, 2

ICLE Native one of the, 719 one of the, 137 the end of, 337 the fact that, 135 out of the, 332 on the other, 104 the united states, 326 the other hand, 92 part of the, 293 the right to, 74 some of the, 288 in the world, 67 the fact that, 254 at the same, 65 end of the, 229 the same time, 65 the u s, 210 of the world, 60 the number of, 199 it is the, 53 on the other, 197 to be the, 50 the most important, 49 most of the, 191 the development of, 48 the rest of, 190 the use of, 181 the number of, 43 in the first, 180 all over the, 41 it was the, 171 aware of the, 39 the influence of, 39 at the same, 161 the problem of, 39 the first time, 155 over the world, 38 the same time, 153 of the most, 37 to be the, 152

for the underuse of articles at the lower proficiency levels, it is only the rule-based selections that remain underused at the advanced level. The overreliance of the intermediate and advanced learners on conventionalized word combinations containing a/an is notable, especially since in native writing these combinations are less varied than those containing the (224 vs. 922 types) Again, not all the conventional uses of articles are correct. The students sometimes use a lexical bundle in contexts where it is not appropriate, as in Example 1 below drown from Grade 12: (1) There is a much comteinations [combinations]. icci_pol0705

In fact, such errors are even a better proof of the learners’ use of articles within ready-made chunks of language without referring to a semantic, pragmatic, or structural context. Tables 9 and 10 present the twenty most frequent 3-grams in each corpus, together with their frequencies. For the corpora ICCI 6 and ICCI 9, these lists contain all the lexical bundles that occur in the corpora.

98

Agnieszka LEŃKO-SZYMAŃSKA

Table 10. Twenty most frequent 3-grams containing the indefinite article a/an, together with their frequencies ICCI 6 a lot of, 4 this is a, 1 will be a, 1 I have a, 1 have a lot, 2 to buy a, 1

ICCI 9 A lot of, 8 I have a, 3 this is a, 2 to become a, 1 they have a, 1 it with a, 1 it was a, 1 it is a, 1 is a very, 1 he was a, 1 has been a, 1 an example of, 1 a bit of, 1

ICCI 12 a lot of, 15 this is a, 5 he is a, 5 to find a, 3 to be a, 3 there is a, 3 it is a, 3 is a good, 3 I have an, 2 to have a, 2 they have a, 2 is also a, 2 I have a, 2 he was a, 2 a group of, 2 would be a, 1 to make a, 1 to become a, 1 to be an, 1 there are a, 1

ICLE a lot of, 87 there is a, 70 it is a, 70 as a result, 59 to be a, 53 in such a, 37 to have a, 33 is not a, 30 a number of, 27 a kind of, 27 a result of, 25 a matter of, 24 a means of, 23 a source of, 21 a member of, 19 it is an, 18 in a way, 18 as a means, 18 of such a, 16 a part of, 16

Native there is a, 298 it was a, 281 there was a, 263 a number of, 258 to be a, 240 a lot of, 231 it is a, 195 a couple of, 141 he was a, 137 as a result, 135 more than a, 128 a series of, 122 to have a, 110 a variety of, 109 a matter of, 105 this is a, 99 to make a, 95 a kind of, 91 had been a, 89 would be a, 87

Most of the lexical bundles in Tables 9 and 10 represent highly conventionalized uses of the articles. However, it has to be pointed out that some 3-grams did not exempt a language user from making a semantically and pragmatically motivated choice. For example, the bundles it was a and it was the are among the most frequent combinations in the native language; the first was only a little more frequent than the latter (281 vs. 171). 5.4. Discussion Even though the learners’ use of the zero article was not addressed directly in this study, the analysis of the frequencies of the and a/an confirms the earlier observations made by several researchers that the zero article dominates the language production at lower levels of proficiencty. This conclusion is, in Master’s (1997) terms, “by default”. The fact that the learners underused the and a/an implies that they overused the zero article, although in fact it is impossible to distinguish between a learner’s deliberate use of the zero article and a failure to supply any article at all. The results also seem to lend support to the a > the hierarchy of difficulty of the other two articles (Young 1996; Liu & Gleason, 2002; Díez-Bedmar & Papp, 2008). Even though the accuracy of the two articles was not analyzed in this study, the frequencies seem to indicate that the indefinite article was

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

99

integrated in the interlanguage much earlier than the definite article. The learners’ awareness of the existence of recurrent word combinations containing articles was also demonstrated in this study. Butler (2002) observes that students form idiosyncratic rules regulating the use of articles. Quite a lot of the idiosyncratic rules that were identified in her study could be traced back to the learners’ sensitivity to the occurrence of articles in certain lexical contexts: To make sense of what they observed in terms of actual English article usage, some learners tried to find a solution by hypothesizing word-article collocational rules. [...] words that belonged to different word classes (e.g., prepositions, nouns, verbs, adjectives, and adverbs) were reported by the learners to have certain collocational relationships with articles. Among them, the most frequently mentioned pseudo collocational rules were those involving prepositions, though it is not entirely clear why this was so. One possibility might be the relatively high frequency with which prepositions appear with articles in English discourse. [...] Furthermore, it appears that it was not easy for learners to discard such collocational rules once they became accustomed to using them, especially when they thought that there were no other evident rules to rely on. Even some of the advanced learners expressed their belief in certain nongeneralizable word-article collocations. (Butler, 2002: 468)

Butler (2002) calls this sensitivity to lexical contexts a pseudocollocational knowledge, thus implying that it acts as a poor substitute for the native speaker’s true knowledge of the rules that regulate the use of articles. However, the cognitive accounts of language processing that give prominence to the probabilistic rules of language suggest that native speakers also frequently resort to collocational knowledge during language production. According to Butler (2002), the learners who were particularly prone to making frequency-sensitive hypotheses were the students at the higher (but not the highest) proficiency levels. This observation is in keeping with the results obtained in this study, in which the over-representation of the conventionally-motivated article selections was demonstrated for intermediate and advanced learners.4 Finally, the results of the study contradict Ekiert’s (2004) and Li and Yang’s (2010) observations that the idiomatic uses of articles are particularly 4

It can be presumed that Butler’s (2002) advanced learners were at higher levels of language proficiency than the advanced learners in this study. They lived and studied in the United States, whereas the Polish learners in this study lived and studied in their native country.

100

Agnieszka LEŃKO-SZYMAŃSKA

problematic for EFL learners. The results of the present study suggest that 35% of the occurrences of the and 23% of the occurrences of a/an fell within conventionalized units, and that the frequency of these uses surpassed the native frequencies. Such a discrepancy could be a result of the different definition of idiomatic use that was adopted in the present study. While the earlier studies focused on true idioms that were highly conventionalized but very infrequent word combinations, this study defined phraseology as “frequently recurring word combinations.” If students are sensitive to the frequencies of word combinations, then idioms can be problematic, since they are not very frequent in the input. The use of articles within idioms cannot be explained by semantic and pragmatic rules, and this further complicates their processing. To illustrate the extent to which the knowledge of lexical bundles could be responsible for the use of articles in learner production, an essay from the Polish component of the ICCI (Grade 11) is provided as a sample. (2) One of the most important think in my life is food. Every day I eat somethink about 3 meals. First is breakfast. I love eating scrumble eggs, sandwiches with ham or cornflakes with milk. Second breakfast I eat at school. Most of time it’s sandwich or apple. When I back home, I eat diner. I’m big fan of potatoes, so they are usually have them. I’m not vegetarian, I love meat, but also each kinds of fishes. When I finish my homework, I always have supper. I usually make soup or sandwiches for hole family. I love eating snacks and sweets. Sometime I bake my favorite cake at the weekand for my friends, and next we eat at all. I hate fast food, because for me it isn’t food, but ENT036 rubish ENT036. I think that is better to buty some healthy vegetables and fruits, make salad, than pay for “nothing” like hamburger or chips. icci_pol0672

The learner failed to use articles in this text, except in two lexical bundles.5 Even though these uses can be explained in reference to the semantic and pragmatic context, it seems unlikely that this was the learner’s motivation in his/her selection. A more probable explanation is that the student was still at a very early stage in the development of the article system, characterized by a heavy overuse of the zero article (or more likely, by the failure to use articles). However, he/she already has a grasp of frequent lexical bundles, and applies this knowledge successfully in his/her production. 5

At the weekend is not, in fact, a lexical bundle according to Biber et al.’s (1990) definition, as it does not meet the criterion of frequency. However, it is still a frequently recurring word combination in English.

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

101

Finally, the limitations of the present study have to be acknowledged. First, the accuracy of the use of lexical bundles was not analyzed. Such an examination would certainly throw additional light on the processes of acquisition that are discussed here. Moreover, the zero article was not included in the analysis, even though it occurs in many lexical bundles. Tracing the frequencies of these bundles would also contribute to a better understanding of the acquisition of the English article system, even though analyzing the zero article is more troublesome (as discussed in the introduction). Lastly, since the use of articles within lexical bundles in many cases is still structurally, semantically, and pragmatically canonical, it can never be asserted with certainty which of the two motivations is responsible for each selection of an article. This can only be discovered through interviews with learners, when they can explain their reasons for their particular choices. 6. Conclusions The study reported in this paper was exploratory in nature. It aimed to draw attention to the role of the conventional uses of language in the selection of articles by learners of English. By analyzing the overall frequencies of articles and the frequencies of articles in lexical bundles, the study demonstrated that students with increasing proficiency became increasingly sensitive to the frequencies of articles and their reoccurring lexical contexts, and the conventional uses of lexical combinations became increasingly responsible for the selection of articles in the interlanguage. The awareness of the existence of the idiomatic uses of articles was certainly not new; however, until now, researchers had assumed that idiomatic uses played a marginal role in the acquisition of the article system by EFL learners. The present study demonstrated that even though a large proportion of the occurrences of articles were motivated by structural, semantic, and pragmatic rules related to the expression of referentiality, specificity, and countability in English, the role of conventional language in the acquisition and use of articles cannot be underestimated. References Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. 1999. Longman Grammar of Spoken and Written English. Harlow: Pearson Education Limited. Bickerton, D. 1981. Roots of Language. Ann Arbor, MI: Karoma Press.

102

Agnieszka LEŃKO-SZYMAŃSKA

Bowles, M., T. Ionin, S. Montrul and A. Tremblay (eds). 2009. Proceedings of the 10th Generative Approaches to Second Language Acquisition Conference (GASLA 2009). Somerville, MA: Cascadilla Proceedings Project. Butler, Y.G. 2002. “Second language learners’ theories on the use of English articles: an analysis of the metalinguistic knowledge used by Japanese students in acquiring the English article system”. Studies in Second Language Acquisition 24. 451-480. Chater, N. and C.D. Manning. 2006. “Probabilistic models of language processing and acquisition”. TRENDS in Cognitive Sciences 10. 335344. Crompton, P. 2011. “Article errors in the English writing of advanced L1 Arabic learners: The role of transfer”. Asian EFL Journal. Professional Teaching Articles 50. 4-34. Díez-Bedmar, M.B. 2010. “From secondary school to university: the use of the English article system by Spanish learners”. Exploring Corpusbased Research in English Language Teaching. B. Belles-Fortuno, M.C. Campoy and L. Gea-Valor (eds). Jaume: Publicacions de la Universitat Jaume I. Collecció Estudis Filològics. 45-55. Díez-Bedmar, M.B. and S. Papp. 2008. “The use of the English article system by Chinese and Spanish learners”. Linking up Contrastive and Learner Corpus Research, G. Gilquin, S. Papp and M.B. Díez-Bedmar (eds). Amsterdam, Atlanta: Rodopi. 147-175. Ekiert, M. 2004. “Acquisition of the English article system by speakers of Polish in ESL and EFL settings”. Teachers College, Columbia University Working Papers in TESOL & Applied Linguistics 4. 1-23. Garcıa-Mayo, M. and R. Hawkins (eds). 2009. Second Language Acquisition of Articles: Empirical Findings and Theoretical Implications. Amsterdam: Benjamins. Hakuta, K. 1976. “A case study of a Japanese child learning English as a second language”. Language Learning 26. 321-351. Huebner, T. 1983. A Longitudinal Analysis of the Acquisition of English. Ann Arbor, MI: Karoma. Scott, J. 2002. “Topic continuity in L2 English article use”. Studies in Second Language Acquisition 24. 387-418. Leńko-Szymańska, A. 1997. The Structure of the Mental Lexicon and Its implications for teaching Foreign Language Vocabularly. Unpublished Ph.D. dissertation. University of Łódź. Li, H. and L. Yang. 2010. “An investigation of English articles’ acquisition by Chinese learners of English”. Chinese Journal of Applied Linguistics 33. 15-31.

The Role of Conventionalized Language in the Acquisition and Use of Articles by Polish EFL Learners

103

Liu, D. and J. L. Gleason. 2002. “Acquisition of the article “the” by nonnative speakers of English”. Studies in Second Language Acquisition 24. 1-26. Master, P. 1987. A Cross-linguistic Interlanguage Analysis of the Acquisition of the English Article System. Unpublished Ph.D. dissertation, UCLA. Master, P. 1997. “The English article system: Acquisition, function, and pedagogy”. System 25. 215-232. Master, P. 2002. “Information structure and English article pedagogy”. System 30. 331-348. Palmer, H.E. 1939. A Grammar of Spoken English on a Strictly Phonetic Basis (Second edition). Cambridge: Heffer. Parrish, B. 1987. “A new look at methodologies in the study of article acquisition for learners of ESL”. Language Learning 37. 361-383. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (eds). 1985. A Comprehensive Grammar of the English Language. London: Longman. Sinclair, J.M.H. 1990. Corpus, Concordance, Collocation. Oxford: Oxford University Press. Tarone, E. 1985. “Variability in interlanguage use: A study of style-shifting in morphology and syntax”. Language Learning 35. 373-403. Thomas, M. 1989. “The acquisition of English articles by first- and second language learners”. Applied Psycholinguistics 10. 335-355. Wray, A. 2002. Formulaic Language and the Lexicon. Cambridge: Cambridge University Press. Young, R. 1996. “Form-function relations in articles in English interlanguage”. Second Language Acquisition and Linguistic Variation, R. Bayley and D.R. Preston (eds). Amsterdam: John Benjamins Publishing Company. 135-175. Software Barlow, M. 2004. Collocate 1.0: Locating collocations and terminology. Houston, TX: Athelstan. Scott, M. 2008. WordSmith Tools version 5. Liverpool: Lexical Analysis Software.

The Use of Intensifying Adverbs in Learner Writing Pascual PÉREZ-PAREDES and María Belén DÍEZ-BEDMAR

1. Introduction Intensification is a kind of linguistic grading that adds expressive richness to one’s message. Partington (1993: 178) describes it as the desire to “exploit hyperbole”; it is “a vehicle for impressing, praising, persuading, insulting, and generally influencing the listener’s reception of the message.” De Devitiis, Mariani, and O’Malley (1989: 121), in their account of the grammar of concepts and communicative functions, devote a chapter to the notion of degree and the “language used to indicate different levels of intensity.” According to them, the adverbs of degree qualify other adverbs, adjectives, or verbs by increasing or decreasing the degree of intensity of the word to which they refer. However straightforward this qualifying function may seem, Philip (2007, 2008) identifies adverbs, in general, and intensifying adverbs, in particular, as problematic for learners of English: a small group of these adverbs are over-used by learners, while others are rare in learner data. In the first group, we can find adverbs such as quite, really, and very; in the second group we may find highly and deeply, among others. Downing and Locke (2006: 488) state that the former group can intensify almost any adjective, while the latter “are more limited to specific types of adjectives or to individual ones.” Our research examines the use of intensifying adverbs in the written production of young (533,000 words). It included the production of learners in grades 5 to 10. The learners’ ages ranged from eleven (grade 5) to sixteen (grade 10). This section consisted of 226 descriptive essays on the topic “Which is your favourite film? What happens in it?” and 273 argumentative essays on the topic “Imagine you the lottery. What would you choose to do with the money?” The learners who contributed their data to this section were enrolled either in the last two years of primary education (grades 5 and 6), or in the four years of compulsory secondary education (grades 7 to 10). Table 1 shows the number of words and the number of contributors per topic and grade. Our informants were kids in the final two years of primary education and the four years of compulsory secondary education. This was one of the differences that our study presented, compared to the other learner language analyses where the informants were young adults or adults. Our analysis followed an incidental design, as these learners could not possibly know or foresee that we would study their use of intensifying adverbs. 4. Results 4.1. Grade 5: 11-year-old students Only 15 kids contributed their essays to the Spanish component of the ICCI corpus. At this stage, essay writing is not taught or extensively used, so the teachers were very reluctant to see their students’ writing tested or

The Use of Intensifying Adverbs in Learner Writing

111

Table 1. Number of words and contributors per essay topic and grade Topic: Film Topic: Money Number of Number Number of Number contributors of words contributors of words Grade 5: Primary education 8 271 7 126 Grade 6: Primary education 13 599 6 161 Grade 7: Secondary education, Year 1 52 2,892 82 2,750 Grade 8: Secondary education, Year 2 47 4,276 56 3,476 Grade 9: Secondary education, Year 3 46 4,008 96 4,419 Grade 10: Secondary education, Year 4 60 4,187 26 987 Total 226 6,233 273 1,919 Academic Year

assessed by a third party. This made it extremely difficult for the compilers of the corpus to gather a more significant number of samples. The mean of the word count in this group was 26.4 (SD = 12.8), 33.8 (SD = 11.2) for those who wrote on the topic related to film and 18 (SD = 8.9) in the case of those learners who wrote on the topic related to money. Only two learners in the group that wrote on films used an intensifying adverb: (1) Because [very] ENT036 zoothy ENT036. My favourite magician is Alex1 (icci_ esp0005) (2) Weverly Place vacaciones en el Carive " porque is [very] fun (icci_esp0006)

Thirteen kids out of fifteen did not use any kind of adverbial grading. None of the kids used any kind of general adverb in their writing. 4.2. Grade 6: 12-year-old students Grade 6 is the last year of primary education in Spain. Essay writing was not used at this level, which explains the reluctance of schools to contribute to the ICCI corpus. In this level, 19 learners contributed to the ICCI Spanish sub-corpus. The mean of the word count in this group was 40.0 (SD = 13.2), 46.0 (SD = 9.1) for those who wrote on films, and 26.8 (SD = 11.2) for the learners who wrote on the topic related to money. Eight degree adverbs were found in the grade 6 writings, all of them in the essays on films; all of them were occurrences of very: (3) heros to salve the world. This [very] good. I like.I dress up the (icci_esp0043) (4) about one monkey but the monkey it’s [very] big. Some persons go to an island (icci_esp0045) 1

The transcriptions reflect the actual production of the learners.

112

Pascual PÉREZ-PAREDES and María Belén DÍEZ-BEDMAR

Fourteen kids out of nineteen did not use any kind of adverbial grading at all. There were only two instances of general adverbs in their writing, namely, late and an instance of non-degree too. 4.3. Grade 7: 13-year-old students Grade 7 is the first year of compulsory secondary education in the Spanish educational system. In this level, 134 learners contributed to the Spanish component of the ICCI corpus. The mean of the word count in this group was 42.1 (SD = 30.3), 55.6 (SD = 35.4) for those who wrote on films, and 33.5 (SD = 23) in the case of the learners who wrote on the topic related to money. Thirty degree adverbs were found in the grade 7 essays—eleven in the essays on films and nineteen in the essays related to money—all of which were occurrences of very: (5) The assesin, a brilliant teacher, is a [very] difficult enemy for Sherlock and he goes with Watson (icci_esp0147) (6) A blue sea ... white send and my [very] cold orange juice. After, We are go (icci_esp0075)

Adverbial grading was not used by 84.6% of the kids writing on films and 71% of those writing on money. At least 11.5% of the kids in the film group used an intensifying adverb, while 8.5% of the money group used it once. The Mann-Whitney test found no statistically significant differences between the film and the money groups with regard to the use of very (U = 2094.5, z = −0.283; p = 0.777). There were 31 instances of general adverbs in the learners’ writing in this sub-corpus, including six instances of non-degree too, four occurrences of soon, and two instances of always. 4.4. Grade 8: 14-year-old students Grade 8 is the second year of compulsory secondary education in the Spanish educational system. At this level, 103 learners contributed their essays to the Spanish component of the ICCI corpus. The mean of the word count in this group was 75.26 (SD = 46.9): 90.9 (SD = 59.6) for those who wrote on films and 62 (SD = 26.6) for those learners who wrote on the topic related to money. Seventy-six intensifying adverbs were found in the grade 8 writings—35 in the film essays and 41 in the money essays—most of which were occurrences of very. For the first time in our sample, two were instances of too:

The Use of Intensifying Adverbs in Learner Writing

113

(7) This film is [very] beautiful and mex a history of love and (icci_esp0291) (8) I went last summer and I ENT036 liKed ENT036 [too] much. I’ll ENT036 keep (icci_esp0212)

Adverbial grading was not used by 63.9% of the kids writing on films and 60.7% of those writing on the topic related to money. At least 17% of the kids in the film group used an intensifying adverb, while 18% used it once in the money group. The Mann-Whitney test found no statistically significant differences between the film and the money groups with regard to the use of very (U = 1269.5, z = −0.360; p = 0.719). There were 34 instances of general adverbs in the learners’ writing in this sub-corpus, including eight instances of also, six occurrences of much, and three instances of finally. 4.5. Grade 9: 15-year-old students Grade 9 is the third year of compulsory secondary education in the Spanish educational system, and 142 learners in this level contributed to the Spanish component of the ICCI corpus. The mean of the word count in this group was 59.35 (SD = 40.3): 87.13 (SD = 46.9) for those who wrote on films and 46.03 (SD = 28.6) for those learners who wrote on the topic related to money. Seventy-five intensifying adverbs were found in grade 9 writings—44 in the film essays and 31 in the money essays—most of which were occurrences of very; for the first time in our sample, two were instances of really, and one was an instance of so: (9) she has lost of memory, but she is [really] funny. Marlin and Doris (icci_esp0420) (10) I’m ENT036 [so] happy, beacuse last week I won the lottery (icci_esp0328)

Adverbial grading was not used by 50% of the kids writing on films and 77.2% of those writing the money essays. At least 28.3% of the kids in the film group used an intensifying adverb, while 17.6% used it once in the money group. The Mann-Whitney test found statistically significant differences between the film and the money groups with regard to the use of very (U = 1560, z = −3.52; p = .000). There were eighty-five instances of general adverbs in the learners’ writing in this sub-corpus, including 17 instances of non-degree too and also, 16 occurrences of finally, and six instances of always. 4.6. Grade 10: 16-year-old students Grade 10 is the fourth and final year of compulsory secondary education in the Spanish educational system; 86 learners at this level contributed their

114

Pascual PÉREZ-PAREDES and María Belén DÍEZ-BEDMAR

essays to the Spanish component of the ICCI corpus. The mean of the word count in this group was 60.16 (SD = 30.3): 60 (SD = 29.3) in the case of those who wrote on films and 37.9 (SD = 18.9) in the case of those learners who wrote on the topic related to money. Thirty-eight intensifying adverbs were found in the grade 10 writings—thirty-two in the film essays and six in the money essays—most of which were occurrences of very; two were instances of so, and one was an instance of really: (11) Last year I won the lottery. I was [very] surprised. I could n’t believe it (icci_esp0474) (12) him to converter in an animal, but is [so] difficult, but they get it (icci_esp0547)

Adverbial grading was not used by 65% of the kids writing on films and 80.2% of those writing the money essays. At least 25% of the kids in the film group used an intensifying adverb, while 24.3% used it once in the money group. The Mann-Whitney test found no statistically significant differences between the film and the money groups with regard to the use of very (U = 649.5, z = −1.58; p = 0.113). There were 37 instances of general adverbs in the learners’ writing in this sub-corpus, including ten occurrences of finally, five instances of non-degree too, and four instances of only. 4.7. Frequency of use of intensifying adverbs per grade Only very, too, so, and really occurred in our data. Although really was not tagged as a degree adverb by Wmatrix,2 we decided to include it in our study as long as it complied with the scope of an intensifying adverb as described in Hewings (2005) and Carter and McCarthy (2006). There were no instances of the other intensifying adverbs discussed in Carter and McCarthy (2006), such as absolutely, fully, greatly, or quite. Adverb-driven intensification was not attested in a high percentage of the analyzed writings. It was only in grade 9 that 50% of the learners used intensifying adverbs. However, in grade 10, the number of writers who did not use intensification was higher. Significant differences regarding the frequency of use of very were only found between grades 7 and 8 (U = 5319.5, z = −4.06; p = .000). The differences between grades 8 and 9, and between grades 9 and 10 were not significant, although the former showed a more steady tendency towards significance (U = 6729, z = −1.29; p = 0.197). When non-adjacent grades were compared, significant differences regarding the frequency of use of very were found between grades 7 and 9 (U = 8026.5, 2

Wmatrix was used to tag the ICCI corpus. For more information, visit http://corpus.nie. edu.sg/icci/index.jsp

The Use of Intensifying Adverbs in Learner Writing

115

z = −3.1; p = 0.002), and grades 7 and 10 (U = 5041.5, z = −2.28; p = 0.022). We found the highest ratios in grades 8 and 9, as opposed to grade 7. The results of grades 5 and 6 were too idiosyncratic to be taken into account. However, the lack of data for this level justified our effort to document the use of intensifying adverbs at this stage of learner language development. Very preceded an adjective in 162 occurrences, 22 times before beautiful, 14 times before good, 12 times before big, ten times before funny, nine times before happy and sad, and eight times before interesting and intelligent. Only 1.2% of the adjectives in our corpus data (1,373) were intensified by adverbs. In 26 instances, very was used before a noun, which pointed to errors in the use of intensifying adverbs. These errors accounted for 11.4% of the total number of intensifying adverbs in our data. 5. Discussion This paper presented the results of a cross-sectional analysis of the use of intensifying adverbs by school students, over six grades at the primary and the compulsory secondary education levels. The age of the informants ranged from 11 to 16, which added extra value to this study as most of the extant learner language research in this area was conducted using data from young adults in their early twenties (Pérez-Paredes, 2010a). The informants in this study were asked to write either a descriptive essay on the topic “Which is your favorite film? What happens in it” or an argumentative essay on the topic “Imagine you win the lottery. What do you choose to do with the money?” The learners were allowed to express their ideas freely, with no space restrictions or excessively tight limitations of time. Thus, the designers of the ICCI corpus tried to avoid some of the issues identified by Ädel (2008) regarding task setting, specifically the amount of time the kids had to write the essays. The writing tasks in the ICCI corpus were not designed to elicit specific discrete linguistic items, something that necessarily frames the interpretation of the results within incidental research designs, as is the case in corpus-driven research. Our first research question involved analyzing whether the learners used adverb intensification at all, and consequently, whether quantitative differences existed in the use of intensifying adverbs over the period of education covered by our study. Significant differences regarding the frequency of use of very were found only between grades 7 and 8, but not between the other 1-grade intervals, that is, between grades 8 and 9, and between grades 9 and 10. Our data suggest that there was a gap between the learners in grade 7 and those in grade 8, at least with regard to the use of very, the most standard intensifying adverb. Díez-Bedmar and Pérez-Paredes (this volume) also find a significant difference between grades 7 and 8 in the

116

Pascual PÉREZ-PAREDES and María Belén DÍEZ-BEDMAR

total number of words written by the learners who contributed their essays to the ICCI corpus. On average, 30.4% of our informants in secondary school (grades 7–10) used adverb-driven intensification. In a previous contrastive analysis, Pérez-Paredes et al. (2011) found that the percentage of Spanish young adults using adverbial hedges in spontaneous speech was 34%, as compared to 75% of the British speakers who accomplished the same speaking tasks. In these two studies, the percentage of adverb users remained constant despite the age differences. Although the number of informants was far too limited to extrapolate from the learning context where the data was gathered, it would appear that grade 8 is a cut-off point for very young Spanish EFL learners, as they start to make more extensive use of adverb-driven intensification. Our results suggest that students in the second year of compulsory secondary education (grade 8) not only write more words per essay, but also make use of intensification in a more extended way. It remains to be seen whether the lack of significance in the difference in frequency of use of very between grades 8 and 9, and between grades 9 and 10, respectively, would reveal the presence of a prolonged acquisition stage where learners start to use these amplifiers. Even though these adverbs are used as “building bricks” (Granger, 1998: 151), the learners showed a more sophisticated awareness of the possibilities of language, which included the use of intensification. If we accept that intensification is a step forward in the way learners display an ever-increasing sophistication in the use of the language (De Devitiis, Mariani & O’Malley, 1989), we would need to conclude that this linguistic notion would be used more widely as the learners go up the academic ladder. Our study confirms this, as significant differences were found between grades 7 and 9 and grades 7 and 10 regarding the frequency of use of very, which reinforces the hypothesis of the incremental nature of vocabulary acquisition (Schmitt, 2010b; Frantzen, 2010). These significant differences can be accounted for from different angles. First, these could be due to the salience of adverbs (Philip, 2008), which may set in the learners’ repertoire as they grow more mature. Second, Laufer (2010) concludes her review of vocabulary learning by highlighting the effectiveness of wordfocus instruction, which is widely used in the Spanish educational context. This finding, however, would be inconclusive if we did not examine the accuracy with which intensifying adverbs were used across the grades in our study. In grade 7, that is, the first year of compulsory secondary education, the learners used very in an erroneous way one out of three times, one out of seven times in grade 8, 0.5 out of seven times in grade 9, and, finally, in grade 10, one out of every three times. The data found in grades 7, 8, and 9 confirm the incremental nature of vocabulary acquisition. As for grade 10,

The Use of Intensifying Adverbs in Learner Writing

117

a closer examination revealed that one learner was responsible for 50% of all the errors, and a second learner was responsible for 30% of the errors. Further research is needed to examine whether new stages appear later at the non-compulsory secondary education level, and to what extent these new stages involve a more extensive and varied use of intensifying adverbs, that is, when too or so are more prominent in the discourse of learners and/or when learners start using adverbs that intensify specific types of adjectives (Downing & Locke, 2006). In general, our results suggested that very young learners of English would take several years to start using a more varied set of intensifying adverbs other than very. No uses of diminishers or downtoners were found in our data, which suggests that this area of intensification was not explored by our informants. Our second research question dealt with the quantitative differences in the use of intensifying adverbs according to the essay topic. The MannWhitney test found no statistically significant differences between the film and the money groups for grades 7, 8, and 10 regarding the use of very. It was only in grade 9 that a significant difference could be found between the film and the money groups. In the film group, the kids found more room to express the extent to which a characteristic held (Biber et al., 1999). Further, there was a significant difference in the number of words written by the two groups (U = 18273.5, z = −7.84; p = .000), which highlighted the influence of the topic as a communication trigger. In the film group, the informants wrote about a girl: very courageus and attractive (icci_esp0421), a very intelligent magician (icci_esp0422); a ship where a very poor man and a very rich woman fall in love (icci_esp0426); this same man is very handsame (icci_esp0431); a girl named Cenicienta3 who met two mice very friendly (icci_esp0456); and Lord Voldemort, who has a snake very big (icci_esp0462).Of the 46 occurrences of very in the film group, only two uses could be labeled as errors: died very people (icci_esp0426) and very years later (icci_esp0437). In the money group, however, we found fewer uses of adverb-driven intensification as well as more errors (seven, to be precise). Perhaps, the nature of the task led the informants to intensify quantity, and this could explain these errors: very shoes (icci_esp0349) or very clothes (icci_esp0350). These writers certainly wanted to express “greater or less” extent “than usual or than that of something else in the neighboring discourse” (Biber et al., 1999: 555), but faced problems in relating the notions of quantity and intensification. This is a research area that requires more attention.

3

Cenicienta is Spanish for Cinderella.

118

Pascual PÉREZ-PAREDES and María Belén DÍEZ-BEDMAR

Traditional interlanguage contrastive analysis considered overuse and underuse phenomena. Hsue-Hueh Shih (2000), for instance, found that Taiwanese intermediate-advanced adult learners of English overused deeply and underused particularly. While the contrast between the learner corpus in their research and the BNC could present important interpretation challenges, it is necessary to acknowledge the role that interlanguage contrastive as well as cross-sectional intralanguage corpus linguistics can play in unveiling the functional, discourse-level uses of a particular word, a string of words, or a grammatical category. In this sense, the incremental nature of vocabulary acquisition—that our results suggest—would be better understood if we examined the increasing frequency of use of other general adverbs. In grade 7, the learners started to display a more extensive use of adverbs, although it was still very limited. The range and the number of general adverbs were more extensive in grade 9. The use of finally suggests the presence of a more structured discourse that relied on adverbs. Similarly, the frequency and focus adverbs at this stage were more common than in the previous years. Not surprisingly, really appears for the first time in our data in grade 9. Our results differ from those discussed in Lorenz (1999), where a more varied array of intensification repertoire was found in German advanced EFL learners. The author concludes that the problems concerning the use of intensifiers were related to word-formation, to position in the sentence, to collocation, to scalar incompatibility between languages, and, interestingly, to overstatement. Philip (2008) claims that adverbs continue to be treated as adjective derivations in the EFL curriculum, which could explain many of the difficulties that students face when using adverbs. In the light of the results of the current study, this statement does not do justice to intensifying adverbs, at least in the case of Spanish speakers, as they could easily relate their muy with English very at the lexical and functional levels (Sacks, 1972). 6. Conclusions Schmitt (2010a: 7) suggests that “second language learners do not need to achieve native-like vocabulary sizes in order to use English well.” The author holds that “a more reasonable vocabulary goal for these learners is the amount of lexis necessary to enable the various forms of communication in English.” The main issue is regarding the forms of communication of English that are necessary for general competence, and (in our case) the extent to which intensification is a necessity for very young EFL learners. Schmitt (2010a) agrees on the distinction between competence and nativelikeness in different genres. For instance, most speakers would need basic oral communication skills for everyday communication and survival,

The Use of Intensifying Adverbs in Learner Writing

119

but could simultaneously be native-like users of the genre in which they are professionally engaged, regardless of whether it is nanotechnology research articles or e-mail writing related to export/import. In this context, interlanguage contrastive analysis could be of great help. For instance, PérezParedes et al. (2011) find a narrower range of uses of adverbial hedges in Spanish speakers learning English for Academic Purposes, which indicates a lack of rhetorical awareness of the hedging device manifested by adverbials (sort of, kind of, maybe, and almost). The authors suggest that this low awareness could have a negative impact on the professional careers of these students if poor persuasive language skills were maintained. Further study is required to assess whether this low awareness about the role of intensification is common to other writers in the ICCI corpus, or is an exclusive trait of Spanish learners, or is related to the kind of expressions that very young learners are ready to use. The results of our research suggest the sensitiveness of lexical choice and the range of meanings expressed by writers to the register (Swales & Burke, 2003), and to the writing task, in the FLT context. In level 9, we found statistically significant differences between the film and the money groups regarding the frequency of use of very. Although the use of this particular adverb does not present any challenges to learners from a phraseological or pattern-study perspective, it remains to be seen whether the teaching of intensification in schools go beyond the introduction of selected adverbs that correspond word for word with other adverbs, at least in Spanish. This could partially explain some of the errors that were found in our data, as well as the correct uses of the perfectly matching functionalities of these adverbs in Spanish and English. There are claims that adverbs were treated as adjectives derivations in FLT; further research would be required to confirm whether this is indeed the case with regard to intensifying adverbs. The lack of inflectional suffixes in this class of adverbs may be the reason why these adverbs are not noticed by very young learners of English. However, this could not be validated due to the design limitations of our research. The range of uses of very, too, so, and really in our data seem to neutralize the dichotomy between the open-choice principle and the idiom principle that has characterized the theoretical accounts of the role of corpus linguistics. Most of the grammar studies reviewed in this paper highlight that intensification is almost exclusively realized through a very limited set of adverbs that combine freely with nouns or adjectives that are pre-modified in the noun phrase. While some of these words may demand specific intensifying adverbs, none were attested in our data. Further research is required to examine whether this holds true only for intensifying adverbs or whether it can be extended to the general adverb class. The role of corpus

120

Pascual PÉREZ-PAREDES and María Belén DÍEZ-BEDMAR

linguistics and learner corpora (Gilquin, Granger & Paquot, 2007; PérezParedes, 2010b) in this context is of paramount importance. References Abello-Contesse, C., R. Chacón-Beltrán and M. Torreblanca-López (eds). 2010. Insights into Non-native Vocabulary Teaching and Learning. Bristol: Multilingual Matters. Ädel, A. 2006. Metadiscourse in L1 and L2 English. Amsterdam: John Benjamins. Alderson, J.C. 2005. Diagnosing Foreign Language Proficiency. London: Continuum. Alderson, J.C. 2007. “Judging the frequency of English words”. Applied Linguistics 28: 3. 383-409. Altenberg B. and M. Tapper. 1998. “The use of adverbial connectors in advanced Swedish learners’ written English”. Learner English on Computer, Granger (ed). London/New York: Addison Wesley Longman. 80-93. Biber, D. 1988. Variation across speech and writing. Cambridge: Cambridge University Press. Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. 1999. Longman Grammar of Spoken and Written English. London: Longman. Brand, C. and S. Kämmerer. 2006 “The Louvain International Database of Spoken English Interlanguage LINDSEI: Compiling the German component”. Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, Kohn, K., S. Braun and J. Mukherjee (eds). Frankfurt/Main: Peter Lang. 127-140. Carter, R. and M. McCarthy. 2006. Cambridge Grammar of English. Cambridge: Cambridge University Press. Chacón-Beltrán, R., C. Abello-Contesse and M. Torreblanca-López. 2010. “Vocabulary Teaching and Learning: Introduction and Overview”. Insights into Non-native Vocabulary Teaching and Learning, AbelloContesse, C., R. Chacón-Beltrán and M. Torreblanca-López (eds). 1-14. Collins Cobuild English Usage. 1992. Glasgow: Harper Collins. Conrad, S. and D. Biber. 2001 (eds). Variation in English: Multi-dimensional studies. London: Longman. Downing, A. and P. Locke. 2006. English Grammar: A University Course. Abingdon and New York: Routledge. De Devitiis, G., L. Mariani and K. O’Malley. 1989. English Grammar for Communication. Harlow: Longman Group.

The Use of Intensifying Adverbs in Learner Writing

121

De Haan, P. 1999. “English writing by Dutch-speaking students”. Out of Corpora, Hasselgård, H. and S. Oksefjell (eds). Amsterdam: Rodopi. 203-212. Eubank, L., J. Bischof, A. Huffstutler, P. Leek and C. West. 1997. “‘Tom eats slowly cooked eggs’: Thematic verb-raising in L2 knowledge”. Language Acquisition 6. 171-199. Eubank, L. and S. Grace. 1998. “V-to-I and inflection in non-native grammars”. Morphology and its interfaces in L2 knowledge, Beck, M. L. (ed). Amsterdam: John Benjamins. 69-88. Fries, C.C. 1952. The structure of English. New York: Harcourt Brace. Foley, M. and D. Hall. 2003. Longman Advanced Learners’ Grammar. Harlow: Pearson Education Limited. Francis, N. 1958. The Structure of American English. New York: Ronald Press. Frantzen, D. 2010. “Evidence of Incremental Vocabulary Learning in Advanced L2 Spanish Learner”. Insights into Non-native Vocabulary Teaching and Learning, Abello-Contesse, C., R. Chacón-Beltrán and M. Torreblanca-López (eds). 126-144. Gilquin, G., S. Granger and M. Paquot. 2007. “Learner corpora: The missing link in EAP pedagogy”. Journal of English for Academic Purposes 6:4. 319-335. Granger, S. 1998. “Prefabricated patterns in advanced EFL writing: Collocations and formulae”. Phraseology, Cowie, A.P. (ed). Oxford: Clarendon Press. 145-160. Hewings, M. 2005. Advanced Grammar in Use. Second Edition. Cambridge: Cambridge University Press. Hsue-Hueh Shih, R. 2000. “Compiling Taiwanese Learner Corpus of English”. Computational Linguistics and Chinese Language Processing 5:2. 87-100. Huddleston, R. and G.K. Pullum. 2002. The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press. Huddleston, R. and G.K. Pullum. 2005. A Student’s Introduction to English Grammar. Cambridge: Cambridge University Press. Laufer, B. 2010. “Form-focused Instruction in Second Language Vocabulary Learning”. Insights into Non-native Vocabulary Teaching and Learning, Abello-Contesse, C., R. Chacón-Beltrán and M. Torreblanca-López (eds). 15-27. Lawley, J. 2010. “Conspicuous by Their Absence: The Infrequency of Very Frequent Words in some English as a Foreign Language Textbooks”. Insights into Non-native Vocabulary Teaching and Learning, AbelloContesse, C., R. Chacón-Beltrán and M. Torreblanca-López (eds). 145156.

122

Pascual PÉREZ-PAREDES and María Belén DÍEZ-BEDMAR

Lorenz, G. 1999. Adjective intensification - learners versus native speakers: A corpus study of argumentative writing. Amsterdam: Rodopi. Osborne J. 2008. “Adverb placement in post-intermediate learner English: a contrastive study of learner corpora”. Linking up Contrastive and Learner Corpus Research, Gilquin, G., S. Papp and M. Díez-Bedmar (eds). Amsterdam: Rodopi 127-146. Parrott, M. 2000. Grammar for English Language Teaching. Cambridge: Cambridge University Press. Partington, A. 1993. “Corpus evidence of language change: The case of the intensifier”. Text and Technology, Baker, M., G. Francis and E. TogniniBonelli (eds). Amsterdam: John Benjamins. 177-192. Pavón, V. and F. Rubio. 2010. “Teachers’ concerns and uncertainties about the introduction of CLIL programmes”. Porta Linguarum 14. 45-58. Pérez Basanta, C. 1996. “La integración de los contenidos léxicos en los métodos comunicativos: una cuestión pendiente”. Segundas jornadas sobre estudio y enseñanza del léxico, Durán, L. and P. Bertrán (eds). Granada: Método. 229-310. Pérez-Basanta, C. 2005. “Assessing the receptive vocabulary of Spanish students of English philology: An empirical investigation”. Towards an understanding of the English language, past, present and future: Studies in honour of Fernando Serrano, Martínez-Dueñas, E. (ed). Granada: Universidad de Granada. 545-564. Pérez-Paredes, P. 2010a. “The death of the adverb revisited: Attested uses of adverbs in native and non-native comparable corpora of spoken English”. Exploring new paths in language pedagogy lexis and corpusbased language teaching, Jaén, M.M., F.S. Valverde and M.C. Pérez (eds). London: Equinox Publishing. 157-172. Pérez-Paredes, P. 2010b. “Appropriation and integration issues in corpus methods and mainstream language education”. Corpus Linguistics in Language Teaching, Harris, T. and C. Pérez-Basanta (eds). Berlin: Peter Lang. 53-73. Pérez-Paredes, P., P. Sánchez-Hernández and P. Aguado. 2011. “The use of adverbial hedges in EAP students’ oral performance: a cross-language analysis”. Researching Specialized Languages. Studies in Corpus Linguistics 47. Bhatia, Sánchez-Hernández and Pérez-Paredes (eds). Amsterdam: John Benjamins. 95-114. Philip, G. 2008 “Adverb use in EFL student writing: from learner dictionary to text production”. Proceedings of EURALEX XIII International lexicography congress. Retrieved October 5, 2009 from http://amsacta. cib.unibo.it Roberts, P. 1956. Patterns of English. New York: Harcourt.

The Use of Intensifying Adverbs in Learner Writing

123

Sánchez-Hernández, P. and P. Pérez-Paredes. 2005. “Examining English for Academic Purposes students’ vocabulary output: Corpus-aided analysis and learner corpora”. RESLA 2005. 201-12. Quirk, R. and S. Greenbaum. 1973. A University Grammar of English. London: Longman. Sacks, N.P. 1971. “English very, French très, and Spanish muy: A Structural Comparison and Its Significance for Bilingual Lexicography”. PMLA 86:2. 190-201. Schmitt, N. 2010a. Researching Vocabulary: A Vocabulary Research Manual Research and Practice in Applied Linguistics. Houndmills: Palgrave MacMillan. Schmitt, N. 2010b. “Key Issues in Teaching and Learning Vocabulary”. Insights into Non-native Vocabulary Teaching and Learning, AbelloContesse, C., R. Chacón-Beltrán and M. Torreblanca-López (eds). 2840. Sledd, J. 1959. A Short Introduction to English Grammar. Chicago: Scott Foresman. Swales, J. and A. Burke. 2003. “It’s really fascinating work: Differences in evaluative adjectives across academic registers”. Corpus analysis: Language structure and language use, Leistyna, P. and C. Meyer (eds). Amsterdam: Rodopi. 1-18.

Profiling EFL Learners’ Writing Performance by Syntactic Complexity: A Corpus-based Study Austina SHIH and May MA

1. Introduction This study has two parts. The first part is an overview of the Taiwanese data. It reports the design of data collection, data distribution, and the results of questionnaires on learners’ English learning experiences. The second part is a study on the syntactic complexity based on the data. It presents a computational system and explores the syntactic complexity of the learners’ writing. Some initial findings are summarized. This paper concludes with discussions on the limitations and proposed future research. 2. Overview of the data 2.1. Data collection Data were collected from six public schools in Taiwan. In order to have a balanced sampling in terms of students’ abilities, three junior high schools and three senior high schools were sampled, and the students in each school were ranked as advanced, upper-intermediate, and lower-intermediate according to the overall academic performance of their schools. The writing task was supervised by English teachers, who had been provided with rubrics about the task to ensure consistent administration. To complete the task, students wrote on an assigned prompt for twenty minutes without the aid of any reference tools, and afterwards, they filled out a questionnaire on their background and English learning experiences. 2.2. Statistical distribution of the data To get a well-represented sample, learners were sampled from almost every grade in each school. In total, 734 essays were collected. Two writing prompts about money and food were developed in line with the prompts used in the JEFLL Corpus. The two prompts were: A: Please describe your favorite food and explain why you like it. B: You got NT$2,000 for your birthday. Please write down what you will do with the money and explain why. Students assigned the food prompt were expected to write a descriptive composition, whereas students assigned the money prompt were expected write an expository composition. Table 1 shows the distribution of the

126

Austina SHIH and May MA Table 1. Sample distribution across grades and schools

Level

Advanced

Upperintermediate Lowerintermediate Subtotal Total

Junior high school Senior high school No. of No. of Grades Prompt Sub-total Grades Prompt Sub-total samples samples 7 B 36 10 B 35 8 A 36 11 A 43 140 148 8 B 34 11 B 41 9 A 34 12 A 31 7 B 36 10 B 39 8 A 33 11 A 38 104 155 9 A 35 11 B 38 12 A 40 7 B 40 11 B 38 8 B 31 111 12 A 38 76 9 A 40 355 379 734 Table 2. Sample distribution across grades Grade No. of samples 7 112 8 134 9 109 10 74 11 196 12 109 Total 734

Percentage No. of valid samples 15% 100 18% 116 15% 89 10% 72 27% 193 15% 104 674

samples among schools and grades and the prompts assigned. Samples containing too little English output were considered invalid and excluded from analysis. The invalid cases included the following: blank (24 cases), fewer than two sentences/five words (22 cases), written in Chinese only, and only the prompt copied. Table 2 shows the distribution of valid samples across grades. The software AntConc 3.2 was employed to calculate the types and tokens of the data. Tables 3 and 4 display the statistics of types and tokens both by grade and by school. The average essay lengths on both prompts were rather close, with a difference of five words. Senior high school students wrote significantly longer texts than did junior high school students. However, the essay length was negatively correlated with students’ levels among junior high schools. The mean essay length of the advanced junior high students was the shortest, at 15 words shorter than that of essays by the lower-intermediate students.

Profiling EFL Learners’ Writing Performance by Syntactic Complexity

127

Table 3. Statistics of types and tokens by grade Grade 7 8 8 9 10 11 11 12 7, 8, 10, 11, 8, 9, 11, 12 All 6 grades

Prompt B A B A B A B A B A A&B

No. of essays 100 56 60 89 72 78 115 104 347 327 674

Type 916 647 657 946 987 1,348 1,519 1,543 2,397 2,536 3,858

Token 7,290 3,714 3,827 6,445 6,426 8,153 11,359 10,473 28,902 28,785 57,687

Mean tokens per essay 73 66 64 72 89 105 99 101 83 88 86

Table 4. Statistics of types and tokens by school School Prompt No. of essays Advanced JH* A&B 114 Upper-intermediate JH A&B 83 Lower-intermediate JH A&B 108 JH Total A&B 305 Advanced SH** A&B 143 Upper-intermediate SH A&B 153 Lower-intermediate SH A&B 73 SH Total A&B 369

Type 1,018 913 1,116 1,866 2,034 1,879 1,100 3,169

Token Mean tokens per essay 7,118 62 5,851 70 8,307 77 21,276 70 14,460 101 15,792 103 6,159 84 36,411 99

* Note. JH: junior high school, ** SH: senior high school.

Another analysis was conducted to investigate lexical coverage by determining how many of words the students used were included in the wordlists of the General English Proficiency Test1 (GEPT). The writing samples, grouped according to school, were first transformed into wordlists and then compared with the GEPT wordlists. The results are shown in Table 5. The essays of the advanced junior high school students, which had a shorter mean text length, had more intermediate and high-intermediate level vocabulary than did the essays by the other junior high school students, and this result correlates positively with their English proficiency. More evidence 1

The GEPT, a criterion-referenced test, is divided into five levels, elementary, intermediate, high-intermediate, advanced, and superior, corresponding to the levels of A2, B1, B2, C1, C2 of the Common European Framework of Reference for Languages: Learning, Teaching, Assessment (CEFR), respectively. Learners who pass the lower three levels have abilities roughly equivalent to that of a junior high school graduate, a high school graduate, and a university graduate (non-English major). The wordlists were developed for the lower three levels in order to assist test takers in preparing for the tests.

128

Austina SHIH and May MA Table 5. Correlation between the students’ vocabulary and GEPT wordlists

GEPT wordlists Elementary level Intermediate level High-intermediate Off list (2263 words) (2684 words) level (3264 words) School No. of words No. of words No. of words No. of % % % % in common in common in common words Advanced JH* 603 84% 61 9% 14 2% 37 5% Upper-intermediate JH 575 88% 44 7% 7 1% 27 4% Lower-intermediate JH 668 87% 54 7% 11 1% 35 5% Advanced SH** 960 67% 323 23% 95 7% 55 4% Upper-intermediate SH 940 71% 277 21% 52 4% 47 4% Lower-intermediate SH 667 82% 98 12% 27 3% 19 2% * Note. JH: junior high school, ** SH: senior high school.

of this correlation was found among the high school students, whose vocabulary range was broader and included more words belonging to higher level GEPT wordlists. Words that were not found in the GEPT wordlists (off list) consisted mostly of proper nouns and low-frequency words. 2.3. Learners’ profile Each participating student was asked to complete a questionnaire on their English learning experiences after the writing task. Please refer to the Appendix for the questionnaire and the statistics of the results. A brief description of the results follows. Based on a total of 734 responses collected, 55% respondents were male, 43% female, and 2% unspecified. Nearly 70% of the students had started to learn English at the ages of 6 (13%) to 10 (15%). Approximately 55% of them received 4 to 5 hours of English instruction in school per week, and 50% of them received one to four extra hours of English instruction outside school. Since most high school students in Taiwan need to take high school and college entrance exams which include the English subject test, English instruction is mostly test-oriented. Nearly 70% of school instruction focuses on grammar and 25% on reading, which are two major components in the English subject tests. Writing is another component, so approximately 40% of the students felt the need to improve their writing abilities. Of the respondents in this study, 43% received instruction on writing almost every day or often, and 39% did so occasionally. 3. Analysis of syntactic complexity 3.1. Introduction Many analyses in L2 research have focused on different aspects of structural features to establish relationships between syntactic complexity

Profiling EFL Learners’ Writing Performance by Syntactic Complexity

129

and L2 development. Syntactic complexity, or grammatical complexity, is one indicator of developmental measures in L2 writing (Wolfe-Quintero et al. 1998, Ortega 2003). The study of syntactic complexity has been widely used to evaluate L2 writing by computing occurrences of the target measures. However, there have been inconclusive findings as to which syntactic complexity measures are best for L2 writing. There has also been a lack of investigation with writing samples from lower proficiency levels (Lu 2011, Ishikawa 1995). As Ortega (2003) concluded in her research of crosssectional studies of university-level L2 writing, differences in instructional settings lead to a need for more studies conducted in an EFL setting (Ortega 2003). In his study of Chinese college learner writing, Lu (2011) examined 14 syntactic complexity measures. In order to process a large amount of data, he proposed an analysis model and presented a number of interesting findings. Following Lu’s analysis model, the second part of this paper attempts to investigate EFL data from learners with a lower proficiency in terms of syntactic complexity. As Wolfe-Quintero et al. (1998) concluded, to achieve consistent and significant results, syntactic complexity in L2 writing is observed when proficiency is defined by program or school levels. In Taiwan, a nationwide standardized curriculum is taught to both junior and senior high school students. In addition, at the end of the third year of junior high school education, students take an entrance examination and are assigned to schools in accordance with their test scores.2 Therefore, in order to unify proficiency levels in the present study, it was decided that school grade would be used to define proficiency group differences. Specifically, this study seeks to answer the following research questions: 1. Does the writing of higher proficiency level students demonstrate more sophisticated use in terms of syntactic complexity? 2. Which syntactic measures significantly assess between-grade differences? 3. What syntactic measures can be proposed as good indicators for developmental indices of EFL writing across the 6 Taiwanese high school grades? 3.2. The analyzer—overview In the research of syntactic complexity, analysis is often conducted manually. Yet, in recent years, advances in technology have provided more tools that allow analysis on larger-scale data sets. Instead of establishing a new measurement scale of syntactic complexity, this study employs an 2

Students take the Basic Competence Test for Junior High School Students before they are sorted to senior high schools.

130

Austina SHIH and May MA Table 6. 9 Syntactic complexity measures W S VP C T DC CT CP CN

9 Structures Words Sentences Verb phrases Clauses T-units Dependent clauses Complex T-units Coordinate phrases Complex nominals

Table 7. 14 syntactic complexity indices Type 1: Length of production MLS Mean length of sentence: number of words divided by number of sentences MLT Mean length of T-unit: number of words divided by number of T-units MLC Mean length of clause: number of words divided by number of clauses Type 2: Sentence complexity C/S Clauses per sentence: number of clauses divided by number of sentences Type 3: Subordination C/T Clauses per T-unit: number of clauses divided by number of T-units CT/T Complex T-units per T-unit: number of complex T-units divided by number of T-units DC/C Dependent clauses per clause: number of dependent clauses divided by number of clauses DC/T Dependent clauses per T-unit: number of dependent clauses divided by number of T-units Type 4: Coordination T/S T-units per sentence: number of T-units divided by number of sentences CP/T Coordinate phrases per T-unit: number of coordinate phrases divided by number of T-units CP/C Coordinate phrases per clause: number of coordinate phrases divided by number of clauses Type 5: Particular structures CN/T Complex nominals per T-unit: number of complex nominals divided by number of T-units CN/C Complex nominals per clause: number of complex nominals divided by number of clauses VP/T Verb phrases per T-unit: number of verb phrases divided by number of T-units * Note. Tables 6 & 7 are adopted with kind permission from Dr. Lu (2011).

analysis tool which evaluates 14 syntactic complexity indices (Lu 2010, 2011).3 This system, trained with Chinese learner data, consists of three components: Stanford Parser, Tregex, and the syntactic complexity analyzer. After the removal of Chinese and unidentified tags and the exclusion of 17 invalid samples, a total of 657 writing samples were processed and the output generated. These measures and the descriptions are listed in Tables 6 and 7.

3

Please refer to Lu (2010, 2011) for details of the definitions of the measures and technical information.

Profiling EFL Learners’ Writing Performance by Syntactic Complexity

131

Table 8. Mean syntactic complexity values and standard deviations (SD) by grade Measure MLS MLT MLC C/S C/T CT/T DC/C DC/T T/S CP/T CP/C CN/T CN/C VP/T

7 (n* = 95) Mean SD 12.338 5.737 10.354 4.391 7.285 2.916 1.743 0.738 1.457 0.490 0.310 0.243 0.229 0.150 0.380 0.320 1.195 0.345 0.312 0.336 0.237 0.288 0.798 0.496 0.555 0.318 1.782 0.605

8 (n = 114) Mean SD 11.984 4.824 9.774 2.869 7.209 1.499 1.688 0.641 1.375 0.366 0.322 0.244 0.250 0.144 0.375 0.280 1.230 0.353 0.180 0.179 0.138 0.130 0.795 0.453 0.583 0.312 1.749 0.540

9 (n = 88) Mean SD 12.010 6.090 9.990 4.457 6.663 1.580 1.852 0.954 1.525 0.637 0.370 0.244 0.280 0.143 0.460 0.345 1.225 0.506 0.228 0.239 0.159 0.187 0.881 0.513 0.593 0.320 1.790 0.822

10 (n = 71) Mean SD 14.574 4.772 12.987 3.713 7.656 1.756 1.960 0.661 1.750 0.537 0.518 0.256 0.375 0.156 0.719 0.451 1.125 0.192 0.174 0.212 0.111 0.133 1.106 0.638 0.659 0.349 2.335 0.760

11 (n = 187) Mean SD 13.605 4.246 12.327 3.645 7.854 1.808 1.774 0.559 1.600 0.440 0.432 0.220 0.338 0.136 0.584 0.370 1.109 0.180 0.257 0.219 0.170 0.157 1.259 0.591 0.798 0.341 2.090 0.638

12 (n = 102) Mean SD 12.917 4.281 11.375 2.742 7.329 1.205 1.780 0.600 1.565 0.350 0.439 0.209 0.330 0.123 0.546 0.303 1.132 0.225 0.248 0.198 0.161 0.125 1.128 0.482 0.720 0.266 1.887 0.463

*Note. n = number of writing samples.

3.3. Results and discussion Lu (2011) assumed that good candidates of syntactic complexity are those measures that “progress linearly” and are significantly related to proficiency level. In his analysis of the 14 syntactic complexity indices, ten of them progressed in a linear fashion. Among them, seven progressed linearly from Level 1 to level 3. He suggested that these seven indices would be good L2 writing developmental indices. However, in relation to the first research question in the present study, there are no substantial findings. Due to varied results, these 14 measures do not indicate clear and consistent progress across these six grades. Nevertheless, as shown in Table 8, increases in 10 measures from grades 8 to 10 were observed (apart from MLC, T/S, CP/T, and CP/C). There seems to be slight growth in a wide range of grammatical structures in regard to the learners’ proficiency levels in the three grades. In addition, among them, mean length of sentence (MLS) and mean length of T-unit (MLT) were observed to have increased from grades 9 to 10 at the magnitude of mean values ranging from 2.56 to almost 3.4 A reasonable inference about this increase could be that grade 9 students are required to prepare for the nationwide senior high school entrance examination, and this results in the enhancement of their performance at this phase. 4

In her research synthesis of twenty-five studies, Ortega (2003) suggested that for MLT, a critical magnitude of between-proficiency differences should be at least two words or higher.

132

Austina SHIH and May MA

In previous studies, these 14 indices were reported to be indicators of syntactic complexity in different areas of L2 research (Ortega, 2003). As shown in Table 9, except for clauses per sentence (C/S) and T-units per sentence (T/S), one-way ANOVA followed by Scheffe’s post hoc test detected a significant difference in 12 of these measures. They discriminate at least one or more pairs of proficiency levels. Furthermore, the following six measures show much discrimination in at least six pairs of levels: mean length of T-unit (MLT), dependent clauses per clause (DC/C), dependent clauses per T-unit (DC/T), complex T-units per T-unit (CT/T), complex nominals per T-unit (CN/T), and verb phrases per T-unit (VP/T). Based on this initial observation, we assume that these six measures are good candidates for discriminating proficiency levels. Among them, MLT discriminates six pairs of adjacent levels, and five of which reach a magnitude of at least 2 (2.337 to 3.213). Unlike the previous findings that MLT/MLS discriminated adjacent levels (Lu, 2011), MLT is proposed as the best indicator for both adjacent and nonadjacent levels of proficiency in this Taiwanese high school corpus. On the other hand, clauses per sentence (C/S) and T-units per sentence (T/S) did not discriminate any pairs of proficiency levels in the present study. This confirms findings in Lu (2011) that T/S is not a good indicator. 4. Conclusion and implications 4.1. Summary of findings The work reported in this paper is part of a 5-year collaborative project. Table 9. Significant between-level differences* Grades MLS MLT MLC C/S C/T DC/C DC/T CT/T T/S CP/T CP/C CN/T CN/C VP/T

7-10

8-10 9-10 2.589(*) 2.633(*) 3.213(*) 2.998(*) 0.992(*) 0.294(*) 0.146(*) 0.338(*) 0.207(*)

0.375(*) 0.125(*) 0.964(*) 0.343(*) 0.258(*) 0.195(*) 0.147(*)

0.138(*) 0.126(*) 0.307(*) 0.310(*)

10-12

7-11

8-11

9-11

7-12

8-12

8-7

10-7

1.973(*) 2.552(*) 2.337(*) 1.190(*) 0.224(*) 0.109(*) 0.088(*) 0.203(*) 0.209(*) 0.122(*) 0.110(*)

0.101(*) 0.080(*) 0.165(*) 0.170(*) 0.129(*) 0.116(*) −0.132(*) −0.138(*) −0.099(*) −0.126(*)

0.460(*) 0.463(*) 0.377(*) 0.329(*) 0.332(*) 0.243(*) 0.215(*) 0.205(*) 0.165(*) 0.552(*) 0.585(*) 0.548(*) 0.447(*) 0.307(*) 0.340(*) 0.303(*)

*Note. This table summarizes one-way ANOVA followed by Scheffe’s post hoc test of significant differences (p < .05).

Profiling EFL Learners’ Writing Performance by Syntactic Complexity

133

With the aid of the learner background profile, further analysis can be done in view of the learners’ sociolinguistic repertoires they adapt in the development phase. On the analysis of syntactic complexity, though still in its ongoing process when this study was reported, the preliminary results presented findings which were in agreement with some previous studies. First, though no distinct linear pattern across the six grades was found, increases in the average scores on 10 measures from grades 8 to 10 were found to exhibit linear progress, indicating growth in the learners’ grammatical maturity in this area. Second, MLT is proposed as the best indicator for discriminating both adjacent and nonadjacent levels of proficiency. Nonetheless, the findings should be interpreted cautiously and not generalized to other L2 groups with different L1 backgrounds or proficiency levels. 4.2. Limitations and future research Lu’s syntactic analyzer, though trained with Chinese EFL writing samples and achieving high reliability, was evaluated with more advanced learner data than those in the present study (2011). Due to time constraints, only a small number of occurrences were identified by human annotators. Even though the results were in accordance with the scores obtained by the analyzer, we would like to recommend future research in this area to render more empirical support for the reliability of this analyzer, particularly in the analysis of data from learners of lower proficiency. Furthermore, learners are likely to experience a generally “slower pace of development” in an EFL instructional context (Ortega 2003, p. 512). As stated earlier, the results in this study did not identify a clear progression across all levels in terms of syntactic complexity. In Chang & Li’s (2008) study on Taiwanese senior high school textbooks, complex sentences were the most frequent structures observed in the three versions of the 18 textbooks they surveyed. In view of this, apart from quantifying these structures, in-depth analysis could also be performed to establish the link between the teaching materials and the non-linear complexification observed in this Taiwanese learner corpus. The exploratory findings of this study lead to a variety of future research options. First, not only can frequency counts of various syntactic measures be obtained, but one can also gain access to a wide range of measures that immediately identify the potential area of interest without having to manually calculate syntactic features. Some researchers have also suggested that the research scope to be expanded to incorporate multiple measures and sample sizes as well as heterogeneous L1 backgrounds (Ortega 2003, Lu 2011). Moreover, combined with other types of analysis such as major word category profiling (Granger & Rayson, 1998), unique lexical behaviors can be ascertained and analyzed.

134

Austina SHIH and May MA

To further analyze the data, the writing samples will be rated and scored. Both a holistic and an analytic score will be given based on the rating scales5 with specific criteria for relevance, grammatical use, lexical use, coherence, organization, etc. The scoring data will be triangulated with the analyses of learners’ syntactic and lexical performances and their learning backgrounds in the hope that the results will provide a clearer picture of learners’ language development and benefit teachers and learners. Acknowledgements The authors wish to thank Director Tien-en Kao, the Executive Director of the Language Training and Testing Center (LTTC) for his support of this project. They also wish to acknowledge Ms. Mei-jiang Chiu of National Feng-Hsin Senior High School for her assistance in collecting the samples. References Chang, W.C. and I. Li. 2008. “Examining English Grammar Instruction in Taiwan’s Senior High Schools: A Discourse/Pragmatic Perspective”. English Teaching and Learning 32:3. 123-155. Granger, S. and P. Rayson. 1998. “Automatic Profiling of Learner Texts”. Learner English on Computer, Granger, S. (ed). London/New York: Longman. 119-131. Ishikawa, S. 1995. “Objective Measurement of Low-Proficiency EFL narrative writing”. Journal of Second Language Writing 4:1. 51-69. Lu, X. 2010. “Automatic Analysis of Syntactic Complexity in Second Language Writing”. International Journal of Corpus Linguistics 15:4. 474-496. Lu, X. 2011. “A Corpus-Based Evaluation of Syntactic Complexity Measures as Indices of College-Level ESL Writers’ Language Development”. TESOL Quarterly 45:1. 36-62. Ortega, L. 2003. “Syntactic Complexity Measures and Their Relationship to L2 Proficiency: A Research Synthesis of College-Level L2 Writing”. Applied Linguistics 24. 492-518. Wolfe-Quintero, K., S. Inagaki and H. Kim. 1998. Second Language Development in Writing: Measures of Fluency, Accuracy and Complexity. National Foreign Language Resource Center, University of Hawaii.

5

We would like to adopt the writing scale of the Intermediate Writing Test of GEPT. The English ability of the test takers who pass this level is roughly equivalent to that of a senior high school graduate in Taiwan. For more information, please refer to https:// www.gept.org.tw.

Profiling EFL Learners’ Writing Performance by Syntactic Complexity

135

Appendix: Learners’ questionnaire on their English Learning Experiences 1. I was born in _____ (yy/mm). The respondents were students from grade 7 to grade 12. 2. I started to learn English when I was ________ years old. Age 4 & under 5 6 7 8

No. Percentage (%) 58 8% 72 10% 89 13% 101 14% 86 12%

Age 9 10 11 12 & above Blank

No. Percentage (%) 107 15% 107 15% 36 5% 57 8% 22 3%

3. Gender Male Female Blank Total

419 300 15 734

57% 41% 2%

4. I have studied in an English-speaking country before. Never In the U.S. In other English speaking countries Blank Total

661 22 15 34 734

90% 3% 2% 5%

5. The length of my study in the English-speaking country Time for overseas study No. of responses Less than 6 months 275 6 months~1 year 11 1~2 years 3 More than 3 years 4 Blank 441 Total 734

Percentages 37% 1% 0% 1% 60%

6. The number of English classes I have in school each week No. of classes 4 5 6 7 8 or more Blank Total

No. of responses 173 213 119 171 19 39 734

Percentage 24% 29% 16% 23% 3% 5%

136

Austina SHIH and May MA

7. I attend extra-curricular English classes each week. No. of classes 0 1~2 3~4 5~6 6 or more Blank Total

No. of responses 288 115 257 26 7 41 734

Percentages 39% 16% 35% 4% 1% 6%

8. The English classes in my school focus mainly on __________. Instruction focus No. of responses Grammar 425 Reading 191 Listening 29 Speaking 16 Writing 13 Grammar, Reading 7 Grammar, Listening 2 Grammar, Listening, Reading, Writing 1 Grammar, Listening, Speaking 1 Blank 49 Total 734

Percentages 58% 26% 4% 2% 2% 1% 0% 0% 0% 7%

9. I think I need to improve my English _______ ability the most. Skills No. of responses Writing 300 Speaking 162 Listening 127 Reading 52 Listening, Reading, Writing, Speaking 16 Listening, Reading, Writing 1 Listening, Writing, Speaking 1 Listening, Writing 7 Listening, Speaking 6 Listening, Reading 1 Reading, Writing, Speaking 2 Reading, Writing 5 Reading, Speaking 2 Writing, Speaking 4 Blank 48 Total 734

Percentages 41% 22% 17% 7% 2% 0% 0% 1% 1% 0% 0% 1% 0% 1% 7%

Profiling EFL Learners’ Writing Performance by Syntactic Complexity

137

10. After school, I read English newspapers, magazines, storybooks or novels. Almost every day Often Occasionally Never Blank Total

37 115 442 133 7 734

5% 16% 60% 18% 1%

11. After school, I use English to write diary entries, blogs, letters, email, or online instant messages. Almost every day Often Occasionally Never Blank Total

12 40 304 369 9 734

2% 5% 41% 50% 1%

12. My school English teacher teaches how to write English sentences or paragraphs. Almost every day Often Occasionally Never Blank Total

64 246 282 128 14 734

9% 34% 38% 17% 2%

13. The large-scale standardized English proficiency tests I have taken GEPT Elementary GEPT Intermediate GEPT High-Intermediate Cambridge ESOL tests Other tests None

106 92 12 17 11 518

14% 12% 2% 2% 1% 69%

A Cross-sectional Analysis of the Use of the English Article System in Spanish Learner Writing María Belén DÍEZ-BEDMAR and Pascual PÉREZ-PAREDES

1. Introduction The use of the English article system, i.e., the, a(n), Ø,1 poses problems to students of English as a Foreign Language (FL, henceforth) from various first language (L1, henceforth) backgrounds (Thomas, 1989; Bataineh, 1997; Kambou, 1997; Robertson, 2000; Tono, 2000; Butler, 2002; Chuang, 2005; Díez-Bedmar, 2005; Kaszubski, 2005; Prat Zagrebelski, 2005; Chuang and Nesi, 2006; Díez-Bedmar and Papp, 2008; Díez-Bedmar, 2010a, 2010b). Even though students whose L1 is [−ART] struggle more when using the article system (Snape, 2006; Díez-Bedmar and Papp, 2008), the mastery of the definite, indefinite, and zero articles (at the interface between semantics and pragmatics) is problematic for all students regardless of their L1. Apart from the fact that the highest percentage of errors is made by [−ART] L1 students, the difference found between [−ART] and [+ART] L1 students is the accuracy order in their use of articles. In the case of students with a [−ART] L1, the zero article ranks first, followed by the definite and the indefinite articles, while in the case of students with [+ART] L1s, the definite article is first on the list, followed by the indefinite and the zero articles (Master, 1987; Thomas, 1989). According to Master (2002), the difficulty in the use of the article system stems from three main aspects. The first one is related to the high frequency of occurrence of articles, which are among the five most frequently used words in English (Master, 1997; Leech, Rayson, and Wilson, 2001; Sinclair, 1991). The second one corresponds to the fact that articles are normally unstressed, and are, therefore, difficult for a learner to single out in oral language. Finally, the use of the article system involves the knowledge of many semantic and discourse aspects which are encoded in a single word. The number of studies analyzing the use of the article system in English by second or foreign learners of English has increased since the early publications in which the article system was but another grammatical 1

In this paper, the term “zero article” will be used to refer to any case where no article is used in the Noun Phrase. Thus, the distinction between “zero article” and “null article” (see, for instance Celce-Murcia and Larsen-Freeman, 1999) will not be made.

140

María Belén DÍEZ-BEDMAR and Pascual PÉREZ-PAREDES

morpheme considered (Hakuta, 1976; Huebner, 1979, 1983; Tarone, 1985; and so on). In the earliest publications on article use, the presence of articles in obligatory contexts was discussed, without any detailed subclassification of contexts. This degree of specification came with Bickerton’s (1981) use of the binary semantic and discourse-pragmatic features, namely, speaker reference [±SR] and hearer knowledge [±HK], and Huebner’s subsequent taxonomy, which proved to be a turning point in the history of the analysis of article use. The consideration of these two features gave rise to four noun phrase contexts in which the article system could be used (see Section 3). In the first context, generic nouns, with the features [−SR, +HK], are found, and the definite, indefinite, and zero articles may be used. The second context includes referential definites, with the features [+SR, +HK], and only the definite article is possible. In the third context, there are first mention nouns or referential indefinites, with the features [+SR, −HK], which may be accompanied by the indefinite and zero articles. Finally, in the fourth context, there are non-referentials, attributive indefinites, and non-specific indefinites, with the features [−SR, −HK] and the indefinite and zero articles are also used. Apart from these four contexts, a fifth one which includes idiomatic expressions and the conventional uses of the article system was considered later (Thomas, 1989; Goto Butler, 2002; Ekiert, 2005).2 Since Bickerton’s (1981) and Huebner’s (1983) publications, most studies have used their taxonomy to analyze the students’ use of the article system in the FL. This has been specially the case to explore the use of the article system by students whose L1 is [−ART], such as Polish (Ekiert, 2005; Leńko-Szymańska, this volume), Chinese (Díez-Bedmar and Papp, 2008; Haiyan and Lianrui, 2010), or Japanese (Humphrey, 2007). The use of the article system by Spanish learners of English has also been studied (see Section 2). Although their command of the definite, indefinite, and zero articles supersedes that of students with a [−ART] language, some problems are found in their use of articles. Previous studies involving groups of Spanish learners used the features [±SR] and [±HK] to analyze the articles and contexts which were problematic (see Section 2). However, no cross-sectional study with these binary features has been undertaken so far to gain a deeper understanding of the students’ problems with contexts and articles in different academic years. Therefore, this paper aims to explore the use of the article system by Spanish secondary school students during the six years of their compulsory and optional secondary education, using Bickerton’s (1981) and Huebner’s (1983) taxonomies to 2

Precisely, Leńko-Szymańska (this volume) has discussed article use and conventionalized language in the context of lexical bundles.

A Cross-sectional Analysis of the Use of the English Article System in Spanish Learner Writing

141

conduct an Interlanguage Analysis (IL analysis, henceforth) (Selinker, 1972). The objectives of this paper are to analyze cross-sectionally the following aspects: i) the relationship between the number of words written and the type of contexts and articles used; ii) the frequency of use of the three articles; iii) the correct and incorrect uses of the three articles per context; iv) the correct and incorrect uses of each article per context; iv) the effective use of articles per context and individually; and vi) the effective selection of the appropriate article per context. By exploring these six issues, this paper will contribute to the second type of studies dealing with the analysis of the article system (Liu and Gleason, 2002: 2), i.e., the one comprising the studies that analyze the acquisition process of the article system by second or FL learners of English. The rest of the paper is divided as follows. First, the main studies dealing with the use of the article system by Spanish learners of English will be reviewed in Section 2. Section 3 will describe the methodology employed to carry out the research. The main results will be provided in Section 4, and discussed in Section 5. Finally, the main conclusions of the paper will be drawn in Section 6. 2. The use of the article system by Spanish learners of English Despite using different learner corpora and error taxonomies, previous analyses of the written production by Spanish students of English at different levels indicate that the percentage of errors triggered by the use of articles is not high, either at the secondary education level or at the university level (Bueno González, 1992; García Gómez and Bou Franch, 1992; Jiménez Catalán, 1996; Valero Garcés, 1997; Crespo García, 1999; Wood Wood, 2002; Rodríguez Aguado, 2004; Díez-Bedmar, 2005, 2010a, b, c; Snape, 2006; Díez-Bedmar and Papp, 2008). At the secondary education level, two (Computer-aided) Error Analyses, (C)EAs, (Dagneaux, Denness, and Granger, 1998) report that the use of determiners accounted for 9.62% of the total number of errors (Bueno González, 1992), and that the percentage of errors in the substitution of articles represented 3.18% of the total number of errors (Jiménez Catalán, 1996).3 The use of articles by learners has also been analyzed using learner data taken from the English exam for the University Entrance Examination (just before entering university). As reported in previous studies, the percentage of errors caused by the definite, indefinite, or zero articles was not found to be high. For instance, the (C)EA in Crespo García’s (1999) 3

Since these studies did not specify the differences in the uses of determiners (Bueno González, 1992), or any other problems with articles (Jiménez Catalán, 1996), only the results reported in these studies are discussed here.

142

María Belén DÍEZ-BEDMAR and Pascual PÉREZ-PAREDES

study reveals that these errors accounted for 12% of the total number of errors in the learner corpus, and indicates that the zero article triggered more errors (77% of the total number of errors in the use of articles), followed by the indefinite article (14.6%), and the definite article (8.4%). Another (C)EA (Rodríguez Aguado, 2004) concludes that the students taking this exam showed an overuse of the definite article in contexts where the zero article would have been preferred. First-year university students’ productions were analyzed by Valero Garcés (1997) and Díez-Bedmar (2005) in two (C)EAs. These two studies report that the problems found in the use of the articles amounted to 5.2% and 5.3% of the total number of errors in the two learner corpora considered, respectively. The L1-induced errors in the productions by second-year university students were considered by García Gómez and Bou Franch (1992), who found that the problems with the article system stemmed from the overgeneralization of the definite article, which amounted to 7.4% of the errors in the corpus. Most of the studies on the use of the article system by Spanish learners of English provide the percentage of errors made by the students when using the article system, without specifying the contexts or the specific article uses which are more problematic. In fact, only two studies considered the students’ use of articles when expressing generic, non-generic, definite, and indefinite references. The first one was the IL analysis by Wood Wood (2002). The results obtained reveal that students made more errors when trying to express generic reference (51.68%), indefinite reference (27.28%), and definite reference (11.08%), in decreasing order of frequency. In the case of generic reference, the main problem was found in those instances where the zero article should have been used, but the definite article was used instead. García Mayo (2008) focused on the uses of the non-generic definite article by replicating Liu and Gleason’s (2002) study, and found an underuse of the definite article in non-generic references, which significantly decreased as the students’ level changed from the elementary to the low-intermediate level. L1 transfer was found to be the reason behind the students’ overuse of the definite article in non-generic reference, and an indirect relation with the students’ proficiency level was established in such cases. Bickerton’s (1981) semantic wheel and Huebner’s (1983) taxonomy have only been used in two studies on the written production by Spanish learners of English. The first one by Díez-Bedmar and Papp (2008) followed the Integrated Contrastive Model (Granger, 1996; Gilquin, 2000/2001). The authors claim that Spanish first-year university students of English showed the highest percentage of problems when using the indefinite and the zero articles in type 1 contexts. These findings were in line with the results of

A Cross-sectional Analysis of the Use of the English Article System in Spanish Learner Writing

143

the Contrastive Analysis conducted, since it is not possible to use the zero article to express generic reference in type 1 contexts in Spanish, and because students did not use the indefinite article to express generics when writing in their L1, although this was possible. The second study (Díez-Bedmar, 2010a) compared the use of the article system by final-year secondary school students and first-year university students. The results confirm that type 1 contexts were the most difficult ones for students at both academic levels. The highest percentage of errors was found in the use of the zero article in type 1 and type 3 contexts, and in the use of the indefinite article in type 4 contexts for secondary school students. In the case of university students, the uses of the definite and zero articles in type 1 contexts were the most problematic ones. For both learner groups, the problems were also related to the overuse of the definite article when the zero article would have been preferred; in the case of the university students, the problems were also related to the use of the zero article where the definite one would have been used by a native speaker of English. The changes in the students’ use of articles from the secondary education level to the university level were also analyzed. The findings revealed that there were statistically significant improvements in the use of the definite article in type 2 contexts, the zero article in type 1 and type 3 contexts, and the indefinite article in type 4 contexts. However, the number of errors in the use of the definite article in type 1 contexts increased significantly. A further analysis of the first-year students’ use of the articles at the beginning of the academic year and at the end also indicated that the use of the definite article in type 2 contexts and the zero article in type 3 contexts improved significantly; however, this was not the case with the zero article in type 4 contexts. 3. Method The learner corpus used for this paper is a section of the Spanish subcorpus of the International Corpus of Crosslinguistic Interlanguage (ICCI). This section was compiled in January 2010 in a state secondary school in Jaén (Spain), and is composed of the texts written on the topic “Which is your favourite film? What happens in it?” To have a balanced representation of the students’ texts cross-sectionally at the different levels of their secondary education, twenty-five texts from each academic year were randomly selected.4 The learner corpus that was analyzed comprised 13,645 words, distributed as shown in Table 1. 4

In the Spanish education system, there are four compulsory years of secondary education (i.e., 1 ESO, 2 ESO, 3ESO, and 4 ESO), and two further years of optional secondary education (1 Bachillerato and 2 Bachillerato) for those students who would like to proceed to university.

144

María Belén DÍEZ-BEDMAR and Pascual PÉREZ-PAREDES Table 1. Distribution of number of words and mean of words per composition for each academic year Academic Year Year 1: 1 ESO Year 2: 2 ESO Year 3: 3 ESO Year 4: 4 ESO Year 5: 1 BACH Year 6: 2 BACH

Number of words 1,552 2,822 2,201 1,737 1,805 3,538

Mean and Standard Deviation M = 62.08, SD = 32.61 M = 112.88, SD = 54.22 M = 88.04, SD = 50.69 M = 69.48, SD = 32.94 M = 72.30, SD = 43.06 M = 141.12, SD = 43.35

Figure 1. Bickerton’s (1981) semantic wheel

The theoretical framework for conducting the IL analysis (Selinker, 1972) of the students’ use of the, a(n), and Ø was Bickerton’s (1981) semantic wheel and Huebner’s (1983) taxonomy. This framework is based on two binary semantic and discourse-pragmatic features. The first one is speaker reference [±SR], i.e., whether it is specific or not, and the second one is hearer knowledge [±HK], i.e., whether the hearer shares the knowledge transmitted by the speaker or not. As a result of the combination of these features, four contexts are possible when using the articles, as summarized in Bickerton’s (1981) semantic wheel (Figure 1), and exemplified in Table 2. Following the tagging systems developed by Díez-Bedmar and Papp (2008) and further explored in Díez-Bedmar (2010a, 2010b), the correct and incorrect uses of the articles were identified and manually tagged according to the taxonomies in Tables 3 and 4, respectively.5 5

The incorrect uses of the articles were identified with the help of a native speaker of English.

A Cross-sectional Analysis of the Use of the English Article System in Spanish Learner Writing

145

Table 2. Classification and examples of the four contexts, following Thomas (1989) and Goto Butler (2002) Type Type 1 [−SR, +HK]

Definition Generic nouns

Type 2 [+SR, +HK]

Referential definites, the previous mention, specified by entailment, specified by definition, unique in all contexts, unique in a given context Referential indefinites, a, Ø first-mention nouns

Type 3 [+SR, −HK]

Type 4 [−SR, −HK]

Non-referential nouns, attributive indefinites, non-specific indefinites

Articles a, the, Ø

a, Ø

Examples Ø Fruit flourishes in the valley. Ø Elephants have trunks. The Grenomian is an excitable person. They say the elephant never forgets. A paper clip comes in handy. An elephant never forgets. Pass me the pen. The idea of coming to the UK was ... I found a book. The book was ... The first person to walk on the moon ... Chris approached me carrying a dog. I’ve bought a new car. A man phoned. I keep sending Ø messages to him. I’ve got Ø friends in the UK. I’ve managed to find Ø work. Alice is an accountant. I need a new car. I guess I should buy a new car. A man is in the ladies, but I haven’t seen him. Ø Foreigners would come up with a better solution.

Table 3. Tagging system for the correct uses of articles (Díez-Bedmar and Papp, 2008)

Definite Article Indefinite Article Zero Article

Context 1 1DA 1IA 1ZA

Article used by the learner Context 2 Context 3 2DA 3IA 3ZA

Context 4 4IA 4ZA

Table 4. Error tagging system for articles (Díez-Bedmar and Papp, 2008)

Definite article Indefinite article Zero article

Context 1 1GAIA 1GAZA 1GADA 1GAZA 1GADA IGAIA

Article used by the learner Context 2 Context 3

2GADA 2GADA

3GADA 3GAZA 3GADA 3GAIA

Context 4

4GADA 4GAZA 4GADA 4GAIA

146

María Belén DÍEZ-BEDMAR and Pascual PÉREZ-PAREDES

Once the learner corpus had been (error-)tagged, WordSmith Tools version 5 (Scott, 2008) was used to retrieve and quantify the uses of the articles. Finally, statistical tests were run with the software package SPSS, version 15. 4. Results The results obtained are discussed in six sections that correspond to the objectives of the paper. 4.1. Number of words written per academic year As shown in Table 1, the mean of words written per academic year did not present a steady evolution from year 1 to year 6 in secondary education, as corroborated by the non-normal distribution of the data indicated by the Levene test (p < .05). To find out whether the differences in the mean of words written per academic year were significant from one year to another, a Kruskal-Wallis test was run. Since the result revealed differences (H(5) = 46.572, p < .001), Mann-Whitney tests were conducted to see where those differences were located. As a result, two important stages were found once the Bonferroni correction had been applied. The first one was located between years 1 and 2 (U = 119, z = −3.755, p < .05), and the second one was between years 5 and 6 (U = 59.5, z = −4.91, p < .05). In both cases, the mean of words written in the higher academic year, i.e., years 2 and 6, respectively, was significantly higher. 4.2. Frequency of use of the three articles A total of 1,780 (correct and incorrect) uses of the articles were identified in the learner corpus. In decreasing order of frequency, the definite article was used most frequently (1,104), followed by the indefinite article (427), and the zero article (249). The uses of the three articles over the six academic years did not follow a steady evolution. In fact, the Kruskal-Wallis test indicated that there were significant differences when considering the uses of the definite, indefinite, and zero articles (H(5) = 33.57, p < .001; H(5) = 11.71, p < .05; and H(5) = 32.65, p < .001, respectively). Posterior Mann-Whitney tests and the application of the Bonferroni correction revealed that the differences in the mean of article use per composition and academic year were significant between years 5 and 6 in the use of the definite, indefinite, and zero articles (U = 91, z = −4.30, p < .01; U = 160, z = −2.98, p < .005; and U = 91, z = −4.39, p < .01, respectively), and between years 1 and 2 in the case of the definite article (U = 149.5, z = −3.17, p < .01). Although the increase in the number of definite, indefinite, and zero articles used between years 5

A Cross-sectional Analysis of the Use of the English Article System in Spanish Learner Writing

147

and 6 could be due to the parallel significant increase in the mean of words written between those two years, it is important to highlight that at the other stage when there was a significant increase in the mean of words written (i.e., between years 1 and 2), only the definite article showed a statistically significant difference. 4.3. Correct and incorrect uses of the three articles per context As stated in Section 3, Bickerton’s (1981) semantic wheel and Huebner’s (1983) subsequent taxonomy divide the use of the articles into four possible contexts. A study of the correct and incorrect uses of articles per context (Figure 2) reveals that the percentage of incorrect uses of articles was 7.8%. If the different contexts are considered, context 1 posed more problems to students (23.25% of errors), followed by context 3 (11.79%), 4 (6.96%), and 2 (5.20%), in descending order of frequency. After checking the non-normal distribution of the correct and incorrect uses of articles per academic year (p > .05, in both cases), a Kruskal-Wallis test was run to see whether the cross-sectional analysis of the data pointed to any significant difference regarding the students’ correct and incorrect uses of articles. The results obtained revealed that there were no statistically significant differences in the incorrect uses of articles per context (p ≥ .05 in all contexts). However, the correct use of articles per context was found to be significant in context 1 (H(5) = 14.531, p < .01), context 2 (H(5) = 31.149, p < .001), and context 4 (H(5) = 28.079, p < .001). Mann-Whitney tests

Figure 2. Correct and incorrect uses of articles per contexts

148

María Belén DÍEZ-BEDMAR and Pascual PÉREZ-PAREDES

were run to follow these results. After applying the Bonferroni correction, the correct uses of articles in context 2 showed a significant increase in the mean of articles correctly used in such contexts from years 1 to 2 (U = 149.5, z = −3.173, p < .005). The correct uses of articles in three contexts showed statistically significant differences between years 5 and 6 (context 1: U = 183, z = −2.895, p < .005; context 2: U = 104, z = −5.055, p < .001; and context 4: U = 159.5, z = −3.079, p < .005). 4.4. Correct and incorrect uses of each article per context Once the general picture of the correct and incorrect article use per context had been analyzed, the analysis of the correct and incorrect use of each article per context was undertaken. Different scenarios were found depending on the context and the specific article considered. As was the case with the data analyzed earlier, the use of each article in each context and year showed a non-normal distribution (p > .05). After the statistical analyses, significant differences were found in the correct use of the zero article in type 1 contexts (H(5) = 16.275, p < .01), the use of the definite article in type 2 contexts (H(5) = 21.249, p < .001), and the uses of the indefinite article and the zero articles in type 4 contexts (H(5) = 18.402, p < .005 and H(5) = 26.998, p < .001, respectively). The posterior Mann-Whitney tests and Bonferroni corrections confirmed that there were two important stages in the data. First, there was a statistically significant difference in the correct use of the definite article in type 2 contexts from year 1 to year 2 (U = 149.5, z = −3.173, p < .005). Second, the correct use of the zero article in type 1 contexts (U = 180.5, z = −3.246, p .05), a Kruskal-Wallis test was conducted to find out the possible differences in the effective use of the articles over the six years. The effective uses of the zero article in context 1 (H(5) = 11.827, p < .05), the definite article in context 2 (H(5) =

A Cross-sectional Analysis of the Use of the English Article System in Spanish Learner Writing

149

31,249, p < .001), and the indefinite article (H(5) = 18,338, p < .005) and the zero article in context 4 (H(5) = 27,221, p < .001) were found to be significant. Mann-Whitney tests were run, and the Bonferroni correction was applied. The findings obtained point to statistically significant differences in the effective uses of the following articles: i) the definite article in context 2 from year 1 and 2 (U = 149.5, z = −3.173, p < .005); ii) the zero article in context 1 from year 5 to 6 (U = 197.5, z = −2,760, p < .01); iii) the definite article in context 2 from year 5 to 6 (U = 104, z = −4,055, p < .001); iv) the indefinite article in context 4 from year 5 to 6 (U = 199, z = −2.734, p < .01); and iv) the zero article in context 4 from year 5 to 6 (U = 193, z = −2.474, p < .01). 4.6. The effective selection of article per context and academic year The last aspect that was analyzed in the data took into consideration the effectiveness in the selection of the appropriate article per context. The number of correct uses of an article in a context minus the incorrect uses of the other articles in that context (i.e., when the context required the use a specific article, but the students provided an incorrect one) was calculated. The non-normal distribution of the data (p < .05 in all contexts) was considered for the statistical analyses. The results of the Kruskal-Wallis test revealed statistically significant differences in the effective selection of the definite article in type 2 contexts (H(5) = 27.434, p < .001), the indefinite article in type 4 contexts (H(5) = 18,338, p 11.402 > 9.731. In other words, there are differences in mean tokens per sentence among the three groups, with a descending order: HK 13.055 > TW 11.402 > MC 9.731. 3.2. Method: Annotating scheme and frequency count As noted earlier, Halliday and Hasan (1976) have provided a framework where English cohesive devices are classified into five major categories. Among them, three types of devices—lexical, reference, and conjunction—have been investigated in relation to ESL/EFL writings; the substitution and ellipsis devices are generally believed to occur primarily in oral communication. For exploratory purposes, in this study we examine only one of the three main types of CDs, namely conjunctive devices. We know that cohesive devices, whether oral or written, play an indispensible part in language communication. Despite the various names (e.g., discourse marker, discourse connectives) and definitions given by different scholars from different perspectives (e.g., Fraser, 1999), there is a basic consensus: CDs “are a class of lexical expressions drawn primarily from the syntactic classes of conjunctions, adverbs, and prepositional phrases. With certain exceptions, these CDs signal a relationship between the interpretation of the segment they introduce, S2, and the prior segment, S1” (Fraser, 1999: 931). According to this definition, conjunctive devices can be subdivided into three main categories (see Table 2 below): The CDs have a core meaning that is “procedural, not conceptual, and their more specific interpretation is negotiated by the context, both linguistic and conceptual” (Fraser, 1999: 931). For example, CDs such as and, also, too, as well, besides, and in addition introduce new information that strengthens the old information already present in the text. In other words, Table 2. Three classes of conjunctive devices Additive devices Adversative devices Causal devices adding contrast causality and, also, too, as well, besides, but, yet, however, nevertheless Because, So, Because of, since, in addition therefore, as a result

174

Yongbing LIU and Huiping ZHANG

by adding new information, it strengthens the old assumption (Sperber & Wilson, 2001). The CDs but, yet, however, and nevertheless introduce new information that is contrary, or even directly opposed to, the old information, suggesting the abandonment of the old assumptions (Sperber & Wilson, 2001). Similarly, CDs such as because, so, because of, since, therefore, and as a result introduce new information that may combine with old information to form new contextual implications, for example, causality. This scheme was used to annotate the texts of the Chinese learner sub-corpora of the ICCI and calculate the overall and peak frequencies of the CDs used in the texts produced by the three groups of learners, respectively. In order to facilitate the annotation, some treatments were applied to the texts. First, all the original xml markings were detagged with the software Text Editor. Then, the texts were tagged with parts of speech by means of the TreeTagger software. After all the treatments, the AntConc program was used for item searching and observation. To count the frequencies of all the CDs, the CD too was calculated with more complicated methods: the frequencies of too whose meaning is “to an excessive degree” (see http://dict.youdao.com/) and the tokens of too used to describe adjectives or adverbs were deducted from the total. First, the frequencies of too that are used to describe adjectives were deducted from the total. Next, the tokens of too used to modify an adjective were treated by writing a regular expression “\S+_RB\s\S+_JJ\s” in AntConc’s “search bar,” and after “sorting” them according to alphabetical order, the concordance lines with too were observed and counted. Similarly, the tokens of too used to modify adverbs were treated by writing a regular expression “\S+_RB\s\S+_ RB\s” in the “search bar,” and after “sorting” them according to alphabetical order, the concordance lines with too were observed and counted. 4. Results and discussion In this section of the article, we report the occurrence of specific features with checked marks in the appropriate categories of conjunctive devices. Analysis of the data involves the calculation and synthesis of the frequency and percentage of CD occurrences devoted to each subcategory. It is the relative overall occurrence and non-occurrence of features that allows for our discussion of the research questions posed earlier. Results pertaining to each of the major categories form the basis of the following sections. As noted earlier, actual CD tokens were classified in terms of the following three types: additive, adversative, and casual devices. The numbers of these CDs used by each of the three groups were counted, followed by the determination of the frequencies of the conjunctive devices in each category. Examples from the corpus are provided to exemplify the descriptions where necessary.

Use and Misuse of Cohesive Devices in the Writings of EFL Chinese Learners

175

Table 3. Types and tokens of CDs used by three groups Groups CDs

Types Tokens

ML And Also Too But However Yet Because So Because of Therefore

10 1125

HK And Also Too As well Besides In addition But However Yet Nevertheless Because So Because of Since Therefore 15 1256

TW And Also Too Besides In addition But However Because So Because of Since Therefore

12 1289

4.1. The overall tendency of CD use by the three groups On average, the ML group produced 10 different CDs, the HK group produced 15 different CDs, and the TW group produced 12 different CDs (as shown in Table 3). There are about three CD differences between the TW and HK groups, while the variation between the ML and HK groups is much larger at five CD differences. Similarly, there are also notable differences between ML and TW or HK groups in terms of the total frequency of CD use (as shown in Table 3). The ML group used 1,125 tokens, whereas the TW group used 1,289 and the HK group used 1,256 tokens. These figures show that, among the three groups, the HK group used more CDs than the other two in terms of type, but the differences are not very great. Of significance are the differences between the ML and HK groups in terms of the overall frequency of CD types (15 vs. 10) and tokens (1,125 vs. 1,256). In relation to the differences of mean tokens per sentence between the three groups noted earlier (a descending order of HK 13.055 > TW 11.402 > MC 9.731), it can be seen here that the HK group did not only use more CDs but also relatively longer sentences in their writings. These figures suggest that the cohesive aspects of the writings of the HK group are much more diverse and sophisticated than their ML and TW counterparts. 4.2. The most frequently used CDs From Table 4-6, it can be seen that and is the most frequently used CD to express adding (ML 80.54%, HK 76.75%, TW 81.30%), but is the most

176

Yongbing LIU and Huiping ZHANG Table 4. Frequencies of additive devices

Adding And Also Too As well Besides In addition Total frequencies of each group Total frequencies of all groups and its percentage in all CDs

563 88 48 0 0 0

MC 80.54% 12.59% 6.87% 0.00% 0.00% 0.00% 699

HK 76.75% 11.48% 9.94% 0.14% 0.98% 0.70% 714 2199 59.92%

548 82 71 1 7 5

639 62 80 0 4 1

TW 81.30% 7.89% 10.18% 0.00% 0.51% 0.13% 786

175 4 0 0

TW 97.77% 2.23% 0.00% 0.00% 179

240 71 4 6 3

TW 74.07% 21.91% 1.23% 1.85% 0.93% 324

Table 5. Frequencies of adversative devices Contrast But However Yet Nevertheless Total frequencies of each group Total frequencies of all groups and its percentage in all CDs

157 3 2 0

MC 96.91% 1.85% 1.23% 0.00% 162

HK 196 93.78% 9 4.31% 3 1.43% 1 0.48% 209 550 14.99%

Table 6. Frequencies of causal devices Causality Because So Because of Since Therefore Total frequencies of each group Total frequencies of all groups and its percentage in all CDs

203 52 7 0 2

MC 76.89% 19.69% 2.65% 0.00% 0.76% 264

HK 67.27% 27.33% 2.10% 2.70% 0.60% 333 921 25.09%

224 91 7 9 2

frequently used to express contrast (ML 96.91%, HK 93.78%, TW 97.77%), and because is the most frequently used to express causality (ML 76.89%, HK 67.27%, TW 74.07%). Importantly, there is no marked difference in the use of these CDs among the three groups. The CDs that mark adding, causality and contrast are recurrent, with and ranking first (ML 50.04%, HK 43.63%, TW 49.57%), because second (ML 20.44%, HK 17.83%, TW

Use and Misuse of Cohesive Devices in the Writings of EFL Chinese Learners

177

18.62%), and but third in average frequency (ML 13.96%, HK 15.61%, TW 13.58%). Additionally, there is no marked difference among the three groups of learners. 4.3. Overall differences among the three groups There are some overall differences among the three groups of learners in terms of CD use. In spite of the most frequently used discourse markers (and, but, because), the HK and TW groups used additional CDs more than the ML group, including besides (HK, 7 times; TW, 4 times, ML, 0 times), too (HK, 71 times; TW, 80 times, ML, 48 times), and in addition (HK, 5 times; TW, 1 time, ML, 0 times) to mark adding, however to mark contrast, and so and since to mark causality (refer to Tables 4-6). These numbers show that the writings of the HK and TW groups are relatively more diverse in terms of CD use, and thus the sentences are more complex than those of the ML group, even though the differences are not very great. Both besides and and can be used to mark an additive relationship; however, besides, which means “making an additional point” (see http://dict. youdao.com/), is more specific in meaning. The HK and TW groups used more CDs in marking different relationships, but the ML group’s uses were relatively fewer. For example, consider the following cases of besides: (1) My favorite food is beef noodles. I like the beef noodles because it can keep me warm in cold Winter. Besides, it is very delicious, too. (tw11081.txt) (2) After I got HK$500 for my birthday, I will save $300 in the bank and take the money out if there is an emergency. Besides, saving the money in bank is safe and I had no need to worry about my money. (hk87222.txt)

From the examples, we can see that the HK and TW learners have fairly clear understandings of the differences between the two CDs and and besides, and that they used besides appropriately. And normally combines with its structural status as a coordinating conjunction to have one basic function, that is to continue a cumulative set (e.g., continue an action, continue a topic) (Fraser, 1999). Obviously in the above examples, the CD also is used to make an additional point rather than to continue an action or topic. From Table 3, it can be seen that also is more frequently used by the three groups of learners than the other two CDs, with too ranking second and as well ranking third. This ordering in terms of use frequency is similar to that of English native speakers. However, in the cases of also placed at the beginning of a sentence, the ML and TW groups never or seldom used it in their writings, while the HK group used it more often for emphasizing new information. In terms of the use of commas, all groups have recognized

178

Yongbing LIU and Huiping ZHANG

that also can be followed by a comma when it is used at the beginning of a sentence, but the ML group might have mistaken the comma as a rigid rule, because in the few cases where also was used, it was followed by a comma. In addition, all three groups might have mistakenly understood that too can only be placed at the end of a sentence or they are unaware of its flexibility in use to achieve different contextual effects, given that there are no cases where too was used in positions other than at the end of a sentence. In terms of comma usage, the ML and TW groups frequently used too at the end of a sentence with a comma, while the HK group displayed more flexibly in terms of its position, with patterns similar to English native speakers, that is, they placed a comma in only 1/5 of the cases located at the end of a sentence. 4.4. The misuse of “and then, and also” As noted earlier, the frequencies of and among the three groups were significantly higher than all of the other discourse markers. The reason for this finding may be that and can be used flexibly and combined with other discourse markers to suggest other kinds of relationships, such as and then, and yet, and thus, and therefore, and and also, to mark time successiveness, causality, or intentionality (He, 2000: 176). Consider the following examples: A: B1: B2: B3:

John turned the key and the engine started. John turned the key and then the engine started. (and implying time successiveness) John turned the key and therefore the engine started. (and implying causality) John turned the key in order to cause the engine to start. (and implying intentionality)

Yet, upon a closer look at the combination of and with some other CDs in the current corpus, we find that and then and and also are most frequently used by all three groups. This finding shows that there is a negative transfer of Chinese CD ranhou (meaning then or and then) and erqie (meaning moreover) in their English writing to mark “a sequence of acts or event,” and it has even become a pet phrase in English oral communication (Zhan & Liu, 2010). The following are some examples from the learner corpus. (1) Today is my birthday. My parents told me that I must do something for myself. firstly, we clean the room and then I help my mother do some cooking. (cn00006.txt) (2) I want use the money to buy a telephone. Because many of my classmates own it. I also want to have one, and then they will not laugh at me. (cn20247.txt) (3) I want buy a bike, because I walk to go to school every day, and then I can ride a bike to school, ... (tw12111.txt) (4) At first, I’ll buy some CDs, because I like listen music. And then, I’ll buy some gifts

Use and Misuse of Cohesive Devices in the Writings of EFL Chinese Learners

179

for my parents, if my NT$2,000 aren’t they give. (tw12070.txt) (5) So I think I will use the money to buy The Twilight Saga 4 and then maybe I will also buy some decoration for my bedroom such as flower. (hk87226.txt) (6) It must be very good. My friend and I will be very happy. And then I will go shopping with my friend too. (hk87267.txt) (7) ... my classmates told me my mother is very becautiful, I feel very happy, and also need to thank my parents. (cn00090.txt) (8) Take a bowl and pour the milk, cream and sugar in it and also add the flavouring oil and mix them well. (hk87102.txt)

4.5. The misuse of also in terms of its position The CD also is one of the most frequently used devices in the corpus. The table below shows that the overall frequency of the misuses in the writings of the ML group is higher than that of the TW group and that the HK group seldom misused it. Type 1 is most frequently misused, especially by the ML and TW groups. There are only a few other types misused by the three groups. These findings may indicate that there are differences between these learners in the three different learning environments. According to the English convention of also as a CD (Quirk et al., 1985), it can be used after any modal or link verbs, but it is seldom used before a model verb. However, the ML and TW groups of Chinese learners very often put also before different modal verbs. In fact, this has become a kind of misused collocation in this learner corpus, along with also can and also will (also can occurred most frequently; refer to Table 7). As shown in Table 8, there is a marked difference between HK and the other two groups in the misuse of also can and also will. The ML and TW groups used these two combinations very frequently while the HK group used them only once. We know that both the ML and TW groups share Mandarin as their mother tongue, but the HK group’s mother tongue is Cantonese (a Table 7. Frequencies of also misuse in type Types of misuses Type 1 Also located before a modal verb Type 2 Also located before a link verb Type 3 Also misused in a negative sentence

MC 19 3 2

HK 1 1 1

TW 11 2 0

Table 8. Frequencies of the misusing types of also combined with modal verbs Different modals combined with also Also can Also will

MC 17 2

HK 0 1

TW 11 0

180

Yongbing LIU and Huiping ZHANG

Chinese dialect that is very different from spoken Mandarin). In Mandarin, ye (meaning also) is a free CD, which can be placed before nearly all the modal verbs or a link verb, such as, “... ye neng” (also can), “... ye hui” (also will), “... ye yinggai” (also should), and “ye shi” (also is/am/are/was/were). On the other hand, there are no such combinations or expressions in Cantonese. This difference between the three groups may contribute to the different results in terms of the misuse of also plus model verbs in their English writings. This finding also shows that there is a systematic negative transfer of Chinese in these learners’ English writings. Consider the following examples from the corpus: (1) When I grow up, my friends usually invite me to join their party. I feel very happy, then I hope my parents also can hold a birthday party for me, they were very angry with me and told me directly “NO” ... (cn00116.txt) (2) I want to buy a computer for my mother, because when she has computer, she will not feel lonely, and we also can talk through internet, I hope my mother will come back soon ... (cn20267.txt) (3) NT$2000 is not many! A red envelope from my perant is more than it. But it also can buy many thing. First, I will buy maby ten comic books ... (tw12127.txt) (4) My favorite food is steak. Steak is yummy, but it’s too expensive. Steak also can good for us to be tall. (tw11097.txt)

From the above examples, we can see that also can is used incorrectly instead of the correct word order can also. However, the correct form and the misused form are both present in Chinese learners’ corpus at the same time. Through text observation, we found that some learners used both the correct and incorrect form interchangeably in the same text, which may be considered a case of “free variability” in the learner’s interlanguage. An example of this case is presented below. (1) I will give it to some poor family. Although it isn’t so much, it also can help them. I can also buy some trees. (tw12047.txt)

This finding shows that in the initial stages of L2 development it is likely that learners develop a fair proportion of variation. When learners first internalize new linguistic items, they do not know precisely what functions they realize in the target language and the result is “free variability” (Ellis, 1985: 84). “Free variability” is the major source of instability in interlanguage because the learner will try to improve the efficiency of his interlanguage system by developing clear-cut understandings of form-function relationships (Ellis, 1985: 95). When new forms enter the interlanguage, they are likely to be used in free variation. In subsequent stages, the learner will progressively

Use and Misuse of Cohesive Devices in the Writings of EFL Chinese Learners

181

Table 9. Frequencies of because introducing a fragment sentence Sub-corpus Frequencies

MC 98

HK 26

TW 74

sort out forms into functional pigeon holes, and it is likely that the first sorting out will not establish the form-function correlations of the target language (Ellis, 1985). According to Ellis (1985), this process may take several sortings, and many learners may never entirely achieve the target form-function relationship. The sorting process is a continuous one as long as new forms are assimilated, because each new form will require further functional reorganization in order to resolve the issue of “free variability” (Ellis, 1985). 4.6. The misuse of “because” From Table 9, we can see that because is another CD that is frequently misused by the three groups of learners, with the ML group ranking first (98 times), the TW group ranking second (74 times), and HK ranking third (26 times). There are two types of misuse: 1) it is misused to introduce a fragment sentence, and 2) it is misused in combination with another CD so. The frequency of misuse among learners in the ML group is much higher than those of the other two groups. Again, the HK group ranks the lowest in terms of misuse of this CD. In Chinese, yinwei (because) can be used to form a grammatically correct sentence, which is used to state the reason of the proposition conveyed in the prior sentence. This Chinese linguistic feature was transferred by the Chinese learner into their English writings. The following examples show this transfer: (1) First, I want to buy some new pen. Because my pen was bad. (tw12002.txt) (2) First I want to buy a cloth to my mother, and buy a T-shirt to my father. Because my father and mother love me very much. (cn08008.txt) (3) I like black chocolate very much. Because black chocolate is making very many dessert. (hk87002.txt)

From the examples, we can see that a causal relation (or meaning) is very clear between the first and the second sentences. In written English, because should be used as a CD to connect two clauses to form a single sentence expressing causality, but these learners very often used because to introduce a separate sentence, resulting in a fragment rather than a cohesive sentence.

182

Yongbing LIU and Huiping ZHANG

In addition, all three groups of learners very often combined because with so when expressing a causal relationship, which is a marked transfer from the Chinese “yinwei...suoyi” sentence structure. The following examples show this transfer: (1) I would like to have a becautiful shoes. Because my friends have many kind of becautiful shoes, so I also want buy one, but my parents told me that I can not buy shoes untill next year, so I want to buy a pair of shoes. (cn20236.txt) (2) First, I will buy a birthday cake for my birthday party, and I will do that in the Pizza Hut. Because I do that before when I was seven, so I think this time I can do that also. (hk87301.txt) (3) I will like to make my birthday cake with my best friend Audrey too. Because she know how to make a cheese cake and she know where can buy the meterial, so I will give some money to her. I guess I will use lower than HK$500. (hk87309.txt)

5. Conclusion and implications In summary, the overall tendency of CDs used by the three groups of Chinese learners is that among the three sub-categories of conjunctive devices, additive devices (59.92%) accounted for the largest percentage of use, followed by causal (25.09%), and finally adversative (14.99%) devices. As described earlier, all three groups of Chinese learners appeared to be aware of the three sub-categories of conjunctive devices. Additive devices were the most frequently used because they are perhaps the easiest CDs to acquire and use in order to connect clauses and sentences in writing. The cohesive item with the highest frequency was and (1,750), which belongs to the group of easy words and expressions Chinese learners start to learn as soon as they have access to English. Among adversative devices, but was used with the highest frequency (528). However and yet occasionally occurred in the writings; however, many others such as on the contrary and instead were never used. This finding may imply that these learners were not competent enough to use other cohesive items to indicate transition of meaning, such as however, rather, and on the contrary. Among causal devices, because (667) was the most frequently used item, followed by so and because of. Other items, including as a result and thus, were not used. Comparatively, the HK group produced relatively longer sentences; the TW group produced relatively shorter sentences, and the ML group produced the shortest sentences in their writings. The HK group used more CDs in terms of type than the ML and TW groups, suggesting that the writings of the HK group are more diverse in terms of CD use and more complex in terms of sentence structure than those of the ML and TW groups, even though the differences are not very great.

Use and Misuse of Cohesive Devices in the Writings of EFL Chinese Learners

183

There are four types of CD misuse identified as systematic negative transfers from Mandarin Chinese. Learners from all three groups tend to map Chinese CD ranhou to mark a sequence of actions or events in their writings. Constructions using also + modal verbs are more frequently used by the ML and TW learner groups, which reflects a negative transfer of Chinese word order. Because is frequently used in fragment clauses by these two groups and occasionally combines with so to express causal relationships, suggesting a negative transfer of Chinese syntactic rules. It seems that the students were capable of using a few devices to bridge the previous clause or segment(s) and the following one(s) to make their writing clear and logical. However, only those commonly used items such as and, but, because, and so were used to accomplish these functions, whereas items learned later, such as furthermore, on the contrary, moreover, in addition, and nevertheless, never occurred in their writings. Although these devices are introduced in the course book, they might not be emphasized in the classroom teaching, or the learners tend to use the easy ones rather than the complicated ones (more constraints on the use of these CDs) since the learners were given limited time to recall these CDs in their writings. Nonetheless, as their use of CDs is limited, the writings of the three groups are very basic both in terms of text organization and genre structure. No distinctive features of genre structure can be identified between the narrative, explanatory, and argumentative categories. Finally, there are some features that distinguish the HK group from the other two in terms of sentence complexity, variety of CD use, and genre structure. The findings of this study confirm the results from previous research that suggest different CDs can cause different contextual effects, appropriate use of different CDs largely determines whether a text is cohesive and coherent, and ESL/EFL learners choose different or limited CDs in comparison with native speakers (e.g., Fung & Ronald, 2007; Liu & Braine, 2005; Zhang, 2000). The findings also confirm that there are systematic linguistic transfers (e.g., Chan, 2004; Odlin, 2005) and learning contexts that affect ESL/EFL learning processes (HK use English as L2). In conclusion, there are several implications we think are important. First, there is a need in the teaching of writing to make these beginning learners aware of the importance and use of alternative cohesive devices in their writings. Second, since all the learners, especially the ML and TW groups, were found to have problems using some basic CDs effectively and accurately, explicit instruction with examples should be provided by teachers in the classroom, rather than hope for accumulated, general awareness through L2 development. This is especially important because these learners have limited access to English out of the classroom. More importantly,

184

Yongbing LIU and Huiping ZHANG

this study identified some features of CD use in terms of cross-linguistic transfer; therefore, we plan to make continued use of the ICCI to do a more sophisticated corpus-based analysis of this phenomenon in the future. It is important to recognize that the data set used in this study is small and other variables, such as different grades and school types (key or ordinary), are not controlled. In this way, the findings of the cross-linguistic transfers are still preliminary and exploratory. Nonetheless, from this tentative study we could develop a more sophisticated research design for further study of the interlanguage features of Chinese learners of English. References Chan, A. 2004. “Syntactic transfer: Evidence from the interlanguage of Hong Kong Chinese ESL Learners”. Modern Language Journal 88. 56-74. Ellis, R. 1985. Understanding Second Language Acquisition. Oxford: Oxford University Press. Ferris, D. and J. Hedgcock. 1998. Teaching ESL composition. Mahwah, NJ: Erlbaum. Fung, L. and C. Ronald. 2007. “Discourse markers and spoken English: native and learner use in pedagogic settings”. Applied Linguistics 23:3. 410-439. Fraser, B. 1999. “What are discourse markers?”. Journal of Pragmatics 31. 931-952. Green, C.F., E.R. Christopher and L. Jacquelyn. 2000. “The incidence and effects on coherence of marked themes in interlanguage texts: a corpusbased enquiry”. English for Specific Purposes 19. 99-113. Halliday, M.A.K. and R. Hasan. 1976. Cohesion in English. London: Longman. He, Zhaoxiong. 2000. A New Introduction to Pragmatics. Shanghai: Shanghai Foreign Language Education Press. Hinkel, E. 2003. “Simplicity Without Elegance: Features of Sentences in L1 and L2 Academic Texts”. TESOL QUARTERLY 37. 263-275. Jafarpur, A. 1991. “Cohesiveness as a basis for evaluating compositions”. System 19. 459-465. Johnson, D.P. 1992. “Cohesion and coherence in compositions in Malay and English”. RELC Journal 23. 1-17. Liu, M. and G. Braine. 2005. “Cohesive features in argumentative writing produced by Chinese undergraduates”. System 33. 623-636. Odlin, T. 2005. “Crosslingistic influence and conceptual transfer: what are the concepts? ”. Annual Review of Applied Linguistics 25. 3-25. Palmer, J.C. 1999. “Coherence and cohesion in the English language classroom: the use of lexical reiteration and pronominlisation”. RELC Journal 30. 61-85.

Use and Misuse of Cohesive Devices in the Writings of EFL Chinese Learners

185

Pienemann, M. 1985. “Learnability and syllabus construction”. Modelling and assessing second language acquisition, Hyltenstam, K. and M. Pienemann (eds). San Diego, CA: College Hill Press. 23-77. Pica, T. 1985. “Linguistic simplicity and learnability: Implications for language syllabus design”. Modelling and assessing second language acquisition, Hyltenstam, K. and M. Pienemann (eds). San Diego, CA: College Hill Press. 137-153. Quirk, R., S. Greenbaum, G. Leech and J. Svartvik. 1985. A comprehensive grammar of the English language. New York: Longman. Read, J. 2000. Assessing vocabulary. Cambridge: Cambridge University Press. Reid, J. 1993. Teaching ESL writing. Englewood Cliffs, NJ: Prentice Hall. Shaw, P. and E.T.-K. Liu. 1998. “What develops in the development of second language writing”. Applied Linguistics 19. 225-254. Sperber, D. and D. Wilson. 2001. Relevance: Communication and Cognition. Beijing: Foreign Language Teaching and Research Press. Vaughan, C. 1991. “Holistic assessment: what goes on in the rater’s mind?”. Assessing Second Language Writing in Academic Contexts, HampLyons, L. (ed). Norwood NJ: Ablex. Zhang, M. 2000. “Cohesive features in exploratory writing of undergraduates in two Chinese universities”. RELC Journal 31. 61-93. Zhang, H.P. and Y.B. Liu. 2010. “English teachers’ discourse markers in the classroom: A corpus-based study”. Foreign Languages Teaching and Research 40:5. 23-30.

Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora Paula BUTTERY and Andrew CAINES 1. Introduction Several of the largest learner corpora have been compiled for the purpose of producing language resources rather than testing specific learning hypotheses. Examples from UK publishers include the Longman Learners’ Corpus1 which is used to compile the Longman Active Study Dictionary; the Cambridge Learner Corpus (Nicholls 2003) which is used to compile Cambridge University Press dictionaries such as the Cambridge Advanced Learner’s Dictionary, as well as student materials such as Carter et al. (2011); and the World English Corpus2 on which the Macmillan English Dictionary is based. The goal of our research is to develop a methodology for employing these large learner corpora in the testing of language learning hypotheses. In this paper we take the first steps towards this goal by highlighting some of the issues which demonstrate why the development of such a methodology is important. We use the error-coded section of the Cambridge Learner Corpus as an example, to investigate. 1) the relationship between learners’ increasing proficiency and mean length of utterance; 2) the relationship between learners’ increasing proficiency and variety and quantitative usage of adverbs. We show that, when using large learner corpora, it is essential to consider appropriate normalisation for the linguistic components under investigation otherwise a misleading picture of learner development may be inferred. In particular, we discuss how topic, task and text length may affect the ‘opportunity of use’ of a given linguistic component. We address the issue of text length by using a native speaker email corpus, compiled from short selfcontained texts, to normalise for examination script length. 2. Background In an ideal controlled experiment all variables are held constant except for the one we wish to measure. For a physical example, if we were to 1 2

Pearson Education Ltd. http://www.pearsonlongman.com/dictionaries/corpus/learners.html Macmillan Publishers Ltd. http://www.macmillandictionary.com/aboutcorpus.htm

188

Paula BUTTERY and Andrew CAINES

measure the maximum speed of different trains we would need to control for any possible variables such as track quality, load and weather conditions. We might do this by running the trains over the same piece of track with the same load on the same day (while the weather holds). Controlling for the variables allows us to have a fair comparison of train performance. However, with a corpus that is not designed for our specific purposes we lose the luxury of a controlled experimental set-up and we must be sure to account for all changing variables within our analysis of the data. Every corpus is compiled according to its own guidelines; only very occasionally are corpora designed to complement one another.3 Therefore we must pay particular attention not only to the nature of the corpus but also any compilation documentation in order to understand what might be varying across language samples in addition to the target variable we are interested in. With this in mind, let us consider the constitution of our example corpus. The Cambridge Learner Corpus (CLC) is a collection of examination scripts written in English by non-native students from all over the world and who therefore have a wide variety of first languages (L1s). The transcripts come from the suite of examinations offered by ‘Cambridge ESOL (English for Speakers of Other Languages) Examinations’, henceforth referred to as ESOL. The corpus is still growing: it currently contains approximately 135,000 exam scripts. For the work presented here, we took only pass scripts from the subsection of the corpus which has been error coded. This amounts to a 4 million word subcorpus containing around 11,500 examination scripts. All exam scripts in the corpus have been assigned one of five proficiency levels in accordance with the Common European Framework of Reference (CEFR)4. In order of increasing proficiency, the levels are A2, B1, B2, C1, C2 and they are assigned on the basis of ESOL examiners’ grading. We do not include the A1 set of scripts in our analysis as there are too few of these to make a meaningful contribution. Table 1 shows how many ‘pass’ examination scripts there are at each CEFR level in our errorcoded subcorpus of the CLC. Word counts and mean document length are also given for each CEFR subsection (mean script length has been calculated by dividing the total number of words at a level by the total number of exam scripts at the same level). 3

4

Rare examples being the Brown family of corpora (e.g. Francis 1965; Johansson, Leech and Goodluck 1978); and The Limerick Corpus of Irish English (Farr, Murphy and O’Keeffe 2002) being designed to match the Cambridge Nottingham Corpus of Discourse in English (Carter and McCarthy 1995). See http://www.coe.int/t/dg4/linguistic/cadre_en.asp The Council of Europe’s ‘The Common European Framework of Reference for Languages’ which ‘provides a basis for the mutual recognition of language qualifications’ (Council of Europe 2001).

Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora

189

Table 1. Exam script and word counts per CEFR level for pass scripts in the error-coded section of the Cambridge Learner Corpus. CEFR level A2 B1 B2 C1 C2 Overall

Total num. of exam scripts Total num. of words 976 32,931 4070 517,792 3513 1,383,567 1831 1,023,293 1194 892,830 11,584 3,850,413

Mean script length 33.74 127.22 393.84 558.87 747.76 332.39

Austro-Asiatic

Austronesian

Caucasian

Dravidian

Indo-European

Niger-Congo

Tai-Kadai

Sino-Tibetan

Uralic

A2 B1 B2 C1 C2

Afro-Asiatic

CEFR level

Isolates

Table 2a. Language families of learners’ L1s, as proportions of examination scripts per CEFR level.

7.76 5.56 14.5 10.8 13.9

1.63 2.13 1.63 1.78 0.16

0.72 0.09 0.14 0.05 0.08

0.2 0.26 2.28 0.65 0

0 0.05 0 0 0

0.2 0.47 1.24 0.86 0

78.8 79.8 59.0 69.9 78.7

0 0.07 0 0 0

0.1 0.14 2.02 0.76 0.25

10.4 10.8 17.9 14.6 6.7

0.2 0.56 1.35 0.54 0.25

Table 2b. Language families of learners’ L1s, as proportions of examination scripts per CEFR level: further detail for selected language groups.

Greek

Indo-Iranian

Romance

Slavic

Other

Chinese

Tibeto-Burman

A2 B1 B2 C1 C2

Sino-Tibetan

Germanic

CEFR level

Indo-European

Japanese

Isolate

3.58 1.69 5.23 4.27 5.43

3.27 19.8 8.49 10.8 11.2

2.96 0.26 2.95 6.92 4.61

3.78 2.32 4.41 5.51 0.08

62.1 48.5 33.6 35.8 48.0

6.23 8.66 9.55 10.8 14.8

0.41 0.32 0.03 0.11 0

10.2 10.8 17.8 14.6 6.7

0.2 0.05 0.06 0 0

In Table 2a we group the exam scripts at each CEFR level by the language family of the learners’ L1s5. The two language families most frequently occurring as the L1 of an ESOL learner (Indo-European and Sino-Tibetan) are further broken down into language sub-groups in Table 2b. Japanese is another frequent L1 among ESOL learners, and so it is also included in Table 2b.

190

Paula BUTTERY and Andrew CAINES Table 3. The ratio of examination types at each CEFR level.

CEFR level A2 B1 B2 C1 C2

General English scripts Business English scripts 99.8% 0.0% 78.7% 21.1% 68.0% 9.1% 66.9% 17.5% 99.5% 0.0%

IELTS scripts 0.2% 0.2% 22.9% 15.6% 0.5%

With the exception of the proficiency test, test, the International English Language Testing System (IELTS), all of the passing grades from a given examination can be mapped to a single CEFR level (Taylor 2004).6 However, it is possible for several examinations to map to the same CEFR level. For instance, level B1 is aligned with passing grades in both the Preliminary English Test (PET) and Business English Certificate Preliminary (BECP) examinations. Where this is the case it is important for us to know the ratio of these examinations within our corpus since it may have an impact on our analysis. Table 3 shows the ratio of examination types at each CEFR level: notice that we have a single predominant examination for each CEFR level. The introductory text at the start of each examination paper provides instructions on how to answer the questions contained therein and often specifies word limits for the students’ answers. The rubric varies between examinations. Table 4 shows how passing grades from examinations within our subcorpus map onto the CEFR levels together with the typical word limits imposed on the examinees.7 Taking the above into consideration we can now attempt to control experimental variables when testing hypotheses using this corpus. For any hypothesis that investigates learner progression we may use our five CEFR levels to define five subcorpora. Within each subcorpus the proficiency level of the scripts may be considered constant within the specified bounds of the CEFR scale. To test a hypothesis we can then compare how linguistic 5

6

7

The language isolates referred to in Table 2a are Basque, Japanese, Korean, Mongolian and Turkish; aside from Basque, which is widely accepted as unrelated to any other extant language, these have been argued to belong to various families—not least Altaic—but have also been argued to each be isolates. That is the view which is taken here, since there is considerable controversy surrounding their true status. ESOL’s webpage about the BEC qualification (http://www.cambridgeesol.org/exams/ professional-english/bec.html) suggests that it does not in fact map to a single CEFR level. An A grade in the BEC maps to the level above the target examination level, introducing a possible complication that we will not address here since numbers of BEC examinations are small and the points we illustrate are not dependant upon this detail. Word limit information has been collated from Taylor 2004 and sample papers available on the ESOL website: http://www.cambridgeesol.org/exams

Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora

191

Table 4. The Cambridge ESOL examinations which feature in our error-coded section of the CLC together with typical total word limits for an exam paper. Exam A2 B1 B2 C1 C2 International English Language Testing System ≥400 ≥400 ≥400 ≥400 ≥400 (IELTS) Key English Test (KET) 25–358 Preliminary English Test (PET) 135–145 Business English Certificate Preliminary (BECP) 90–120 First Certificate in English (FCE) 240–330 Business English Certificate Vantage (BECV) 160–190 Certificate in Advanced English (CAE) 180–220 Business English Certificate Higher (BECH) 320–390 Certificate of Proficiency in English (CPE) 600–700

phenomena are exhibited within each subcorpus. However, to make this a fair comparison we need to control for any variability across subcorpora. 3. Controlling for variation across subcorpora 3.1. Population of candidates As stated previously, the scripts have been written by candidates from all over the world. At first, then, we need to consider whether the population of examination candidates represented within each subcorpus is approximately the same. For instance, it has been established that learners from different language groups make different types of learning errors (Richards 1971, Zobl 1980, Ellis 1994, among others) so it will be essential to control for the L1 of the candidate across subgroups. For instance, if we are investigating article usage, it would not be a fair comparison to have the majority of candidates at one proficiency level to be native speakers of Mandarin (a language which has no article system) while the candidates at another proficiency level are mostly native speakers of Spanish (a language which does exhibit articles). Within the CLC each script is annotated with meta-information about the candidate, which means that controlling for population variables is a relatively straightforward task. Tables 2a & 2b showed the distribution of L1s by CEFR level. We can control for L1 across the proficiency level subcorpora by selecting only scripts from candidates of the same L1 (or the same language group); or, if we do not wish to constrain ourselves to a single L1 or language group for our study, then we can ensure that the distribution of L1s is similar for all subsections of the corpus. 8

The Key English Test requires only 25-35 words of continuous prose; the other sections of the examination involve word completion and a non-continuous writing task such as form completion.

192

Paula BUTTERY and Andrew CAINES

3.2. Opportunity of use ‘Opportunity of use’ refers to the opportunity within a script for a candidate to use a linguistic component (whether a lexical item or syntactic construction). For a fair comparison between any two CEFR subcorpora, opportunity of use should be controlled for. Imagine that one of the proficiency levels contains scripts that describe ‘a day at the seaside’, but at another proficiency level contains scripts that answer questions on global finance. It would not be fair to compare, for instance, adjectival usage between these proficiency levels since the opportunity for adjectival use is likely to be very different for each topic. In Table 3 we saw that there was general homogeneity of examination at each CEFR level. This means that there will be an inevitably high level of homogeneity within a proficiency level subcorpus in terms of the tasks which have been set, the topics discussed and the average script length but relatively low homogeneity between CEFR subcorpora. This is problematic since task, question topic and script length are all likely to affect opportunity of use. The difference in script length between the proficiency levels means that any raw frequency counts of linguistic features should be considered with caution. An apparent increase in, for example, adverb use at the higher proficiency levels may simply be a consequence of greater opportunity to use such features, rather than an indication of improved ability with the English language. All else being equal, one can assume that the longer an exam script the greater the chance the candidate has of exhibiting a linguistic property (i.e. there is a monotonically increasing relationship between script length and exhibition of a linguistic property). The question is, what exactly is this relationship? Previous work on the CLC has adopted a method of normalising open class linguistic properties by corpus size: e.g. lexical verbs, errors or relative clauses per million words (Hawkins and Filipovic forthcoming; Hawkins and Buttery 2010), without proving that this is a valid thing to do. Relativisation by number of words inherently assumes a linear relationship between opportunity of use and script size. That is, if the script length doubles then the candidate has twice the chance to exhibit a property. In turn this implies that shorter scripts can be considered to be representative samples from longer scripts. A first research question then, which we will discuss below, is whether this assumption is true. The issue of topic and task effect on opportunity of use will be left as the subject of future research. 4. Normalisation for script length The assumption under scrutiny is that there are known relationships between the opportunity of use of linguistic components and a document’s

Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora

193

length. A linguistic component here refers to any quantifiable property of language that we may wish to investigate. Linguistic components fall into several broad categories: meta-language components (e.g. mean length of utterance); lexical components; type components (e.g. counts of a word type—either a closed class such as prepositions or an open class such as adverbs); and syntactic components (e.g. subcategorization frames for a verb). For meta-language components it is often assumed that there is a constant relationship with script length: that for a native speaker, mean length of utterance is approximately constant regardless of script size. For other components a linear relationship with script length is often assumed: that if one doubles the script length one simultaneously doubles the opportunity of use for any given linguistic component. The goal of our work is to establish appropriate methods for the normalisation required to control for opportunity of use. In time, we will investigate the relationship between script length (as well as topic and task) and each of the linguistic component categories listed above, but here we investigate just two. Firstly, a meta-language component: the mean length of utterance; and secondly, an open class token component: the variety and quantitative usage of adverbs. We chose to analyse adverbs since they are grammatically optional and yet often add vital content to discourse. Native usage of adverbs is consequently subtle and we expect progression towards ‘native-like’ usage to develop gradually with experience (and hence for progression to be seen at all proficiency levels). The crucial point here, however, is how to quantify native-like usage so that we might measure progression towards it. We need to know how a native user of English would use adverbs under similar constraints as the learners. In an ideal world we would have native speakers of English answer the same questions as the learners. We could then use the native speaker texts to normalise any effects caused by script length. Unfortunately, no such data is available. So in order to investigate the effect of script length on component counts, we have constructed a corpus of emails written by native speakers to act as a benchmark against which the CLC scripts can be compared and normalised. 5. Email corpus We collated a corpus of email messages written by native speakers of English to test the effect of script length on linguistic component counts. The corpus is an anonymised collection of messages addressed to the authors by more than fifty participants and does not include any of our outgoing responses. This design offers (a) diversity of contributor to the corpus and (b) no unwanted concentration of material from one or two people, given

194

Paula BUTTERY and Andrew CAINES

that language use is a product of experience and therefore idiosyncratic in nature. The corpus contains just under 20,000 documents and a total of 1.6 million words. The emails range in length between 10 and 1500 words and we excluded any emails which were not self-contained (i.e. one line replies and interspersed replies were removed). The corpus contains a mixture of business and personal mails (which will be useful when we come to investigate topic in future work). Note that we will not be quoting from any of the emails (copyright remains with the sender), but simply generating statistics about language use. Previous work on the CLC has compared learner language to native speaker language within sections of the British National Corpus (henceforth BNC; 2007). Regardless of genre, the documents which make up the written sections of the BNC are lengthier than those in our email corpus, and are often excerpts from an even longer text. Rather than the BNC or other such native speaker corpora, we believe our email corpus offers a more appropriate native speaker comparison for the CLC. Each email is a short self-contained text, just as the examination scripts are short self-contained texts designed to communicate what has been demanded in the question. Thus our email corpus affords a method of investigating how linguistic properties change with text length for native speakers, providing a benchmark for comparison with the CLC. 6. Email and learner corpus comparison 6.1. Mean length of utterance A simple indicator of learner progression through the CEFR levels is the meta-language component, mean length of utterance (MLU). We have calculated the MLU for every script in our CEFR-level subcorpora by using the automatic sentence boundary detector contained in the RASP toolkit (Briscoe et al. 2006). The sentence boundary detector uses the immediate context (capitals, other punctuation etc.) to distinguish between full stops used to end sentences and those used to end abbreviations (including titles and initials). The program assumes there is a sentence boundary wherever there is a blank line, or white-space which is preceded by valid sentence final punctuation and followed by a capital letter. Figure 1 plots MLU against document length for (a) the CEFR levels in the CLC, with the points shaded more darkly the higher the proficiency level, and (b) the email corpus. Using the same sentence boundary software we calculate the MLU for sentences in the fictional writing section of the written BNC to be around 25.9 At first glance then, it appears that as learners ascend the proficiency levels they are progressing towards something like the norm for native speakers. But as already mentioned we are not comparing like with like. The analysis

Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora

195

Figure 1. Scatter plots of MLU with document length: (a) CLC; (b) email corpus

does not take into account any notion of the learners’ task (which is to produce a short self-contained text). Many documents in the BNC are longer texts which are themselves excerpts from much larger texts such as novels, essays and news articles, whereas in the CLC each script is a complete text in itself. Comparison with our email corpus is more appropriate therefore. Figure 1b shows how native speaker MLU changes with the the length of email. A horizontal line on this graph would have indicated that MLU was fairly constant regardless of text length. However, the graph shows an increase in MLU with text length simiilar to the CLC graph in Figure 1a. Hence we cannot attribute the learners’ rising MLU entirely to increased proficiency but must allow for the underlying phenomenon which occurs in texts written by native speakers. Figure 2 shows the major trends for MLU by document length in the CLC and email corpora. It is the change in distance between these lines that really signifies learner progress. The lines moving closer together indicates that the learners are moving closer to native-like language use. Thus, we see 9

The fictional writing section was chosen over the other sections (such as the newspaper text or science writing) as an approximate match to the subject matter in the exam scripts—recall from Table 3 that the majority of scripts in the CLC are General English or IELTS rather than Business English.

196

Paula BUTTERY and Andrew CAINES

Figure 2. Trends in MLU against document length: comparison of the CLC and email corpus

that it is not simply changes in raw counts that indicates learning progression, but rather it is changes in the distance to comparative counts produced by native speakers under the same constraints. We see here that a reduced text length decreases the adverb usage for native speakers. A comparison to the native speaker counts in texts of the same lengths as the examination scripts, is therefore required to control for opportunity of adverb use for learners at different levels. By controlling for what native speakers would do under the same constraints as the learners, we can be absolutely sure we are measuring progression. A narrowing of the gap between learner and native speaker, as is seen in Figure 2, indicates the learner's improving proficiency.

Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora

197

Figure 3. Scatter plots of adverb counts with document length: (a) CLC; (b) email corpus

6.2. Adverb use We next investigated the use of adverbs in the CLC and email corpora. Adverbs were identified by analysing each examination script and email using the part-of-speech tagger from the RASP toolkit. This tagger uses the CLAWS-2 tagset (Garside 1987) which identifies general adverbs with an RR tag as illustrated in the following example: “He sleeps peacefully” He_PPHS1 sleeps_VVZ peacefully_RR Figure 3 shows adverb use per document on the y-axis against document length along the x-axis for both the CLC and email corpora. Figure 3b, for the email corpus, shows an approximate linear relationship between text length and average number of adverbs. This indicates that when investigating the development of adverb usage between CEFR levels, normalising adverb counts by subcorpus size (to give, for instance, occurrence per million words) is a reasonable thing to do. The CLC data in Figure 3a shows that learner adverb use also rises as document length increases. But, as with the MLU, it is the difference between this relationship and the one exhibited by the native speakers that is actually an indication of learner progression.

198

Paula BUTTERY and Andrew CAINES

Figure 4. Trends in adverb counts against doc-length: comparison of the CLC and email corpus

Figure 4 shows the trends10 for adverb use by document length in the CLC and email corpora. In documents up to 800 words long, the amount of adverb use in the learner data is as one might expect: slightly less than that in the native corpora. The gap between the native and non-native trend lines does not noticeably close, suggesting that there is not much progress towards native-like adverb use across the lower proficiency levels in which script length is lowest. Above a document length of 800 words the learner 10

‘Trends’ have been calculated using the ‘stat_smooth’ function in the ggplot2 package for R (http://had.co.nz/ggplot2/); stat_smooth uses a generalised additive model to plot a smoothed line with confidence band indicating upper and lower standard error bounds.

Figure 5. Heatmaps of individual adverb use against doc-length: (a) CLC; (b) email corpus

200

Paula BUTTERY and Andrew CAINES

trend line deviates dramatically from the email trend line. Normally this would indicate that the learners are becoming less native-like but in fact this is indicative of the changing composition of the examination papers (and hence also the tasks and topic) between the lower and higher levels. Further inspection reveals that answers to the Business English questions can (not infrequently) have a zero adverbial count. This indicates that topic and task are of central importance to language use, and therefore research into second language acquisition. We will return to this issue in more detail in future work. Regardless of the interference from topic, there is a difference between the adverb use trends for native speakers and learners that needs to be accounted for. For instance, one possible interpretation for the constant distance between CLC and email corpus trend lines in the lower CEFR levels (where exam scripts are generally shorter), is that learners are making no progress towards native-like use. To investigate whether this is the case, it is not sufficient to simply present counts of adverb use. Instead, we need to examine the variety of adverbs used by learners and native speakers. In Figure 5 there are two heatmaps representing how often each of the fifty most frequent adverbs from the corpora are used in texts of the lengths specified on the x-axis. The adverbs are listed on the y-axis in order of rank frequency, from high to low running down the axis. The tiles darken to indicate higher frequency counts for that adverb at that particular text length. White tiles indicate zero occurrence in the corpus for that adverb at that document length. In the CLC (Figure 5a) we see that at the low proficiency CEFR levels—that is, with smaller document lengths, the range of adverbs in use is much smaller than that of native speakers: there are many more white tiles at this side of the graph. It is only as document length increases that a range of adverbs comparable to the native speakers is introduced. In other words, in this case it is the increasing range rather than quantity of adverb use which indicates improvement in learner proficiency. In contrast, in the native speaker email corpus (Figure 5b) the fifty adverbs are of approximately equal frequency across all text lengths. We can thus see from this data that the adverbial type-token ratio becomes more native-like as learners progress through the levels. 7. Discussion We have presented linguistic analyses of mean length of utterance and adverb usage in the error-coded section of the Cambridge Learner Corpus in comparison to a corpus of native speaker emails. We discussed how in normal experimental design we hold all variables constant except for the one we wish to measure. However, since detailed and testable hypotheses were

Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora

201

not formed in advance of the CLC’s construction there are now variables contained therein which need to be identified and studied so that we may gain a fuller understanding of their influence on language use. Some initial work into identifying ‘criterial’ linguistic differences between the CEFR proficiency levels (e.g. Hawkins and Filipovic forthcoming, Hawkins and Buttery 2011) has made assumptions regarding methods of normalisation. But we would add caveats to the frequency counts that have been produced, as we believe these should be normalised to account for document length, topic and task at the very least. We have mainly focused upon the issue of document length in this paper, while maintaining an awareness of topic and task effect. It will be necessary to revisit the latter issues in full in future research. Of particular concern in the CLC are variables relating to ‘opportunity of use’. We discussed how demographic variables such as the learner’s L1 can be controlled through analysis of meta data. Variables internal to the examination papers include the required answer length, the question topic and the question task (‘write a postcard’ versus ‘write a speech’), all of which may be hypothesised to affect opportunity of use. To investigate these issues, we created a benchmark corpus of native speaker email messages comprised of short self-contained texts comparable to the examination scripts. We found that mean length of utterance displays a similar trend in both the CLC and email corpus, increasing as document length increases. Without the email corpus we may have incorrectly inferred a progressive trend in MLU that is in fact largely down to the increase in script size between CEFR levels. We noted that it is the difference between the CLC and email trends that shows any real convergence toward the native speaker norm. In our second investigation we found that adverb counts within native speaker texts have a linear relationship with text length. This means that any previous work that has normalised by corpus size is in fact justified in doing so. However, we also found that adverb usage was largely affected by examination paper (which amounts to variation within topic and task); this hasn’t been accounted for in previous work. In particular, scripts from business examinations were found to frequently have a zero adverb count. Finally, we showed that one should be careful when discussing differences between trend lines for broad linguistic components. Specifically, with the adverb use data, we could have been misled into perceiving a low rate of progression towards native-like use at the lower CEFR levels, since the distance between trends lines is approximately constant. However, a finergrained lexical analysis showed that it is the increasing range and not quantity of adverb use which indicates improvement in learner proficiency at those levels.

202

Paula BUTTERY and Andrew CAINES

In summary, when investigating the development of a linguistic component we go through each of the following steps: 1) Investigate the variation of that component within the population of native speakers; 2) Investigate the effect of text length on that component for native speakers; 3) Investigate the effect of topic on that component for native speakers; 4) Investigate the effect of task on that component for native speakers. We have discussed steps one and two in this paper, and touched on steps three and four. We will return to a fuller investigation of topic and task in future work. It is our belief, however, that all four steps are of equal importance. Crucially when we attempt to analyse learner progression we must consider only the difference between the observed trend in the learner corpus and the native speaker corpus, rather than the learner trend by itself. The purpose of the line of research introduced here is to understand the influence of uncontrolled variables on language use so that we can normalise for their effects in the CLC and other learner corpora. But the work we have started has relevance beyond the field of learner corpora. Corpora are often created without a specific hypothesis to test. Most corpora become all-purpose research resources from which we can at least gain a snapshot of language use in a certain domain in a particular time period, but they have usually not been tightly controlled in their design. Many corpora are financed on an agenda—the lexicographers’ British National Corpus11 (2007) and military’s Switchboard Corpus12 (Godfrey and Holliman 1997) to name but two examples—but are otherwise unconstrained except for an attempt to achieve some sort of demographic and/or textual balance. The same is true of the CLC, funded by Cambridge University Press and Cambridge ESOL. Our aim, then, is to identify methods of accounting for uncontrolled, non-target variables, which are applicable to corpus linguistics in general. Acknowledgements We would like to thank Cambridge University Press and ESOL Examinations for use of the Cambridge Learner Corpus and iLexIR for the use of ILexIR search. We are greatly indebted to Mike McCarthy for his written comments on a previous draft of this paper, as well as Ted Briscoe, 11

12

The BNC was funded by a consortium including the dictionary-making publishers Longman (now Pearson Education), Chambers and Oxford University Press. The Switchboard Corpus was funded by DARPA (Defense Advanced Research Projects Agency)—the research and development arm of the U.S. Department of Defense.

Normalising Frequency Counts to Account for ‘opportunity of use’ in Learner Corpora

203

Anne O’Keeffe, John Hawkins, Geraldine Mark and Nick Saville for interesting discussions and helpful feedback about this work. References Briscoe, E., J. Carroll and R. Watson. 2006. “The Second Release of the RASP System”. Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions. Sydney, Australia. The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk Carter, R. and M. McCarthy. 1995. “Grammar and the Spoken Language”. Applied Linguistics 16.141-158. Carter, R., M. McCarthy, G. Mark and A. O’Keeffe. 2011. English Grammar Today. Cambridge: Cambridge University Press. Council of Europe. 2001. Common European Framework of Reference for Languages. Cambridge: Cambridge University Press. Ellis, R. 1994. The Study of Second Language Acquisition. Oxford Oxfordshire: Oxford University Press. Farr, F., B. Murphy and A. O’Keeffe. 2002. “The Limerick Corpus of Irish English: design, description and application”. Teanga 21. 5-29. Francis, W. Nelson. 1965. “A standard corpus of edited present-day American English”. College English 26. 267-273. Garside, R. 1987. “The CLAWS Word-tagging System”. The Computational Analysis of English: A Corpus-based Approach, Garside, R., G. Leech and G. Sampson (eds). London: Longman. See also: http://ucrel.lancs. ac.uk/claws/ Godfrey, John J. and E. Holliman. 1997. Switchboard-1 Release 2. Philadelphia: Linguistic Data Consortium. Hawkins, John A. and Paula J. Buttery. 2010. “Criterial Features in Learner Corpora: Theory and Illustrations”. English Profile Journal 1. Cambridge: Cambridge University Press. 1-23. Hawkins, John A. and L. Filipovic. forthcoming. Criterial Features in the Learning of English: Specifying the Reference Levels of the Common European Framework. Cambridge: Cambridge University Press. Johansson, S., G. Leech and H. Goodluck. 1978. Manual of Information to Accompany the Lancaster-Oslo/Bergen corpus of British English. Oslo: University of Oslo. Nicholls, D. 2003. “The Cambridge Learner Corpus: Error coding and analysis for lexicography and ELT”. Proceedings of the Corpus Linguistics 2003 Conference, Archer, D., P. Rayson, A. Wilson and T. McEnery (eds).

204

Paula BUTTERY and Andrew CAINES

Richards, J. 1971. “Error analysis and second language strategies”. Language Sciences 17.12-22. Taylor, L. 2004. “IELTS, Cambridge ESOL examinations and the Common European Framework”. Research Notes 18:2-3. Cambridge: University of Cambridge ESOL Examinations: http://www.cambridgeesol.org/rs_ notes/offprints/pdfs/RN18p2-3.pdf Zobl, H. 1980. “Developmental and transfer errors: their common bases and (possibly) differential effects on subsequent learning”. Tesol Quarterly 14. 469-479.

Spanish Learners’ Production of French Close Rounded Vowels: A Corpus-based Perceptual Study Isabelle RACINE

1. Introduction1 In the field of Second Language Acquisition (SLA), the use of corpora in studies of L2 learners’ interlanguage(s) (Selinker 1972; Vogel 1995) is still fairly rare; most of the existing corpus-based work has focused instead on lexical and morphosyntactical properties of the learners’ interlanguage. As pointed out by Gut (2009), there is a clear lack of corpus-based research in the field of L2 pronunciation, and it is only recently that L2 oral corpora have been created in order to study L2 phonetics and phonology on both the segmental and supra-segmental levels (Trouvain & Gut 2007; Meng, Tseng, Kondo, Harrison & Visceglia 2009). Examples of these studies include L2 Dutch (Neri, Cucchiarini & Strik 2006), Polish (Cylwik, Wagner & Demenko 2009), German, and English in Europe (Gut 2009) and Asia (Visceglia, Tseng, Kondo, Meng & Sagisaka 2009). According to Gut (2009) and Zampini (2008), most research conducted over the past forty years in L2 phonetics and phonology is based on data limited mostly to laboratory settings. They focus on a small number of structures produced by a limited number of speakers, and most of them use English as the L2 or L1. Research in this area should be extended to include other languages such as Spanish, Japanese, and French. Similarly, in the case of French, relatively few oral corpus studies have been devoted to analyses of the phonetico-phonological systems of non-native speakers. In the past twenty years, non-native oral corpora have been put to use in the field of the L2 acquisition of French, including ESF (Perdue 1993), LANCOM (Debrock & Flament-Boistrancourt 1996) and FLLOC 1

I would like to express my deepest gratitude to Sylvain Detey for his precious and relevant comments on the final version of this manuscript, and to Sandra Schwab for her help with the design of the experiment, the statistical analyses, and the redaction of this chapter. In addition, I would like to thank the editors of this volume, Yukio Tono, Yuji Kawaguchi, and Makoto Minegishi, for all their work and the two anonymous reviewers for their relevant comments on the first version of the manuscript. All remaining errors are mine.

206

Isabelle RACINE

(Myles & Mitchell 2007). Their focus, however, is mainly set on lexical and morphosyntactical aspects. This gap in the field of L2 French phonetics and phonology was the starting point of the project InterPhonologie du Français Contemporain (InterPhonology of Contemporary French, henceforth IPFC) (Detey & Kawaguchi 2008; Racine, Detey, Zay & Kawaguchi in press; see also Detey, this volume), while two others followed suit: the COREIL corpus (Delais-Roussarie & Yoo 2010) and another one collected by a team in Paris III University (Pillot-Loiseau, Amelot & Fredet 2010). The aim of the IPFC project is to build a large oral corpus of various learners of French as a foreign language with the general purpose of describing their interphonological systems. The protocol is based on the one used for French native speakers in the project Phonologie du Français Contemporain: usages, variétés et structure (Phonology of Contemporary French: usages, varieties and structure, henceforth PFC) (Durand, Laks & Lyche, 2002, 2005, 20092). It includes six tasks: the repetition of a specific L1-related wordlist, the reading of this specific wordlist, the reading of a generic wordlist (PFC), the reading of a generic text (PFC), and finally, the performance of two types of semi-spontaneous speech, one with a native speaker and another between two learners. The specific wordlist is divided into two parts: words common to all surveys and words selected in order to test L1-dependent phenomena. At the current stage of the project, six surveys are being carried out with speakers of Dutch, English (Canada), German, Greek (Cyprus), Italian, and Norwegian, and two corpora with learners of Japanese and Spanish have almost been completed.3 In what follows, we will focus on the Spanish element of the IPFC project (Racine et al. in press).4 The Spanish corpus is made up of two groups of Spanish students who learned French in two different contexts. The first group, with sixteen students at the moment and four others to be recorded shortly, is studying French in a French-speaking environment 2 3 4

See the website of the project: http://www.projet-pfc.net. See the website of the project: http://cblle.tufs.ac.jp/ipfc/. The IPFC-Spanish project is supported by a grant from the Swiss National Science Foundation (100012_132144/1). Data collection has also been made possible by the support of the University of Geneva and the Academic Society of Geneva. I would also like to thank Nathalie Bühler and Maria Luisa Fernández for their help in the collection and the transcription of the data in Geneva and Madrid. Thanks also to the members of the IPFC-Spanish team—Françoise Zay, Sandra Schwab, and Maria Ángeles Barquero—for all their work on the project, and to Sylvain Detey and Yuji Kawaguchi for sharing with me the heading of the IPFC project. Lastly, I would like to express all my gratitude to the Spanish students in Geneva and Madrid who shared their time and their French with us for the benefit of the project, as well as to the native participants of the experiment.

Spanish Learners’ Production of French Close Rounded Vowels

207

(Geneva, Switzerland), while the other group, with six students already recorded and fourteen more to come, is studying French in a Spanishspeaking environment (Madrid, Spain). All learners come from central Spain and are advanced learners of French: B2-C1 level, according to the Common European Framework of Reference for Languages (CEFRL). The research focus of IPFC-Spanish is put on the segmental aspects (nasal vowels, rounded vowels, voiced stops, etc.) as well as on the supra-segmental aspects (accentuation and prosodic aspects of discourse organization) of the learners’ performance. In collaboration with the IPFC-Japanese team, a series of perceptual and acoustic evaluations has already been carried out on the production of nasal vowels by advanced Spanish and Japanese learners of French (Detey, Racine, Kawaguchi, Zay, Buehler & Schwab 2010; Racine, Detey, Bühler, Schwab, Zay & Kawaguchi 2010). Regarding the prosodic aspects, a first analysis of a subset of words drawn from the wordlists has already been carried out (Schwab to appear) and a preliminary analysis of the prosodic organization of the text read aloud is in progress (Barquero, Racine & Baqué to appear). The study presented in this chapter constitutes the first step of a larger research project we are conducting within the framework of IPFCSpanish. Its general purpose is to assess the quality of realization5 of the French rounded vowels (/y, u, ø, o, œ, ɔ/) produced by the Spanish learners. In what follows, we will present a corpus-based perceptual study in which the realization of the two close rounded vowels /y/ and /u/ is evaluated by French-speaking native listeners. Note that we have confined ourselves to the results from a subset of the two groups of Spanish students. 2. The French and Spanish vocalic systems Among the challenges that all learners of French must face is the mastery of the vocalic system. The system contains a minimum of thirteen vowels, depending on the geographical variety we refer to.6 As Kamiyama & Vaissière (2009) pointed out, the acquisition of L2 vowels is especially important because, on the one hand, they have been shown to present a less categorical perception than consonants. On the other hand, the continuous nature of their articulation makes them more difficult to define and for teachers to explain and thus, for learners to reproduce. The French vocalic 5

6

This notion must not be confused with intelligibility or accentedness, or even comprehensibility or acceptability (see Munro 2008). The varieties of French have a vowel system containing a minimum of 13 and a maximum of 16 vowels. Two elements do not exist in all the varieties: /ɑ/ and the nasal /œ̃/. In addition, the phonemic status of schwa (/ǝ/) remains controversial (for a complete review, see Lyche 2010).

208

Isabelle RACINE

system is particularly challenging for Spanish learners since their L1 has a relatively sparse vocalic inventory with only five vowels: /i/, /e/, /a/, /o/, and /u/. Thus, for Spanish learners, learning French implies the acquisition of 8 additional vowels and, among them, a series of three nasal (/ɑ̃/, /ɔ̃/, /ɛ̃/) and 3 front rounded vowels (/y/, /ø/ and /œ/). If we compare the systems of oral vowels in French and Spanish, as in Figure 1, we notice that the three French front rounded vowels (/y/, /ø/ and /œ/) differ from each other on the basis of degree of aperture (or height) and from other vowels in terms of front/back (e.g. /y/ vs. /u/) and rounding distinctions (e.g. /y/ vs. /i/).7 Spanish speakers are familiar with these characteristics because they use them to differentiate the vowels in their L1; for example, in Spanish /i/ is distinguished from /e/ on the basis of degree of aperture, while it contrasts with /u/ on rounding and backness criteria. Despite this, the rounding distinction in Spanish is always combined with another criterion in order to contrast vowels: the two rounded vowels are also all back vowels, while the three unrounded vowels are also front vowels. In addition, Meunier, Frenck-Meystre, Lelekov-Boissard & Le Besnerais (2003) have shown that the French system appears to be globally more closed and more posterior than the Spanish one. This is clear when we consider Figures 2 and 3—from Meunier et al. (2003)—that present the Spanish (on the left) and French (on the right) vocalic spaces (oral vowels only). The dispersion area for /e/ in Spanish covers the areas of both /e/ and /ɛ/ in French. Meunier et al. (2003) first point out that in both languages there is little or no overlap between categories, which means that speakers produce distinct vowels. Second, for both systems, the production space of back vowels (/u/, /o/ and, for French, /ɔ/) is highly restricted compared to the front vowels (/i/, /e/ and, for French, /ɛ/). This could easily be explained by the fact that the back of the mouth is less mobile than the front. Magnen (2009), who also comments on the two figures from Meunier et al. (2003), notices that Spanish /e/ and /a/ have a larger dispersion area than their French counterparts. Since the French vocalic system contains more elements than the Spanish one, there is less variation because each space is dedicated to a particular realization. Finally, Magnen (2009) points out that these representations are evidence of the extreme variation characterizing speech sounds. For the same vowel, we found multiple realizations due to intra- and inter-speaker variability, which both native and non-native listeners have to deal with.8 7

8

Let us briefly remind the reader of the relations between the acoustic and the articulatory properties of vowels. F1 values are determined by the position of the jaws (open or closed). F2 values are a function of the tongue’s position (front or back) but also as a function of the lip’s position (rounded or not) (cf. Meunier 2007). For a discussion on how variability challenges perception, see Racine (2008) for native speakers and Detey (2009) for non-native speakers.

Spanish Learners’ Production of French Close Rounded Vowels

209

Figure 1. The French and Spanish oral vocalic systems (Magnen, Billières & Gaillard 2005).

Figures 2 and 3. Spanish (left) and French (right) vocalic spaces (Meunier et al. 2003). The dispersion area of each vowel has been calculated on the basis of 30 exemplars of the same vowel, each produced by three native speakers of both languages.9

9

See Meunier et al. (2003) for more details about the methodology.

210

Isabelle RACINE

Regarding the French close rounded vowels /y/ and /u/, even though the former must be considered a new category for Spanish learners, the status of the latter is not quite so evident; phonemically, /u/ is an existing sound in the L1 system of our learners, but phonetically, the /u/ sound in Spanish and French is not equivalent. In the experiment presented in Section 3, we examine the production of these two vowels by our learners and, more precisely, how these L2 productions are perceptually assessed by French native listeners in comparison to those of French native speakers. 3. The French close rounded vowels /y/ and /u/: a perceptual assessment 3.1. Objectives and initial hypotheses As mentioned above, most work in the field of L2 phonology has been undertaken with English as L1 or L2. Thus, we find an important amount of research involving Spanish or French learners of English and English learners of Spanish or French (e.g. Face & Menke 2009; Flege 1987; Flege 1991; Morrison 2006; Flege, Munro & Fox 1994; Zampini 1994, 1998, among others). However, as far as we know, studies on the acquisition of the French vocalic inventory by Spanish learners are still scarce (cf. Baqué & Cañada 2005; Billières, Magnen & Gaillard 2006; Gaillard, Magnen & Billières 2006; Magnen et al. 2005; Magnen 2009; Poch-Olive & Harmegnies 1992). Moreover, the majority of these studies focused primarily on L2 perception, which, according to most of the researchers, is mastered prior to production (see Escudero 2006 for a review). In the study we present here, our focus is different—we examine the productions of Spanish learners through a perceptual assessment performed by French native speakers. We chose this approach because, before examining the productions of the learners in terms of acoustical measurements, we first needed to determine which aspects in their productions are problematic from a native point of view. Therefore, our research questions are as follows: 1) Are the productions of both vowels easily identified by French listeners? And is /u/, which is phonemically but not phonetically identical for our learners, perceived differently than a native /u/? 2) Does the task used to induce our learners’ production have an impact on the assessment carried out by the native listeners? 3) Are the productions of the two groups of learners (Spanish from Geneva vs. Spanish from Madrid) assessed equally? The first question concerns the difficulty involved in pronouncing the two vowels /u/ and /y/ for the Spanish learners. Billières et al. (2006) showed that Spanish learners of French have more difficulty in producing the vowel /y/ than /u/. In a longitudinal study carried out in the framework of the verbotonal method (Intravaia 2000; Renard 1979), Billières et al. (2006) analyzed the acquisition strategies developed by the Spanish learners when

Spanish Learners’ Production of French Close Rounded Vowels

211

producing French sounds that they recognized as difficult (/y/ vs. /u/, /v/ and /z/). According to most of the researchers, difficulty in producing new sounds can be attributed to imperfect perceptual ability. Evidence indicates that if the phonological contrasts cannot be perceived, speakers will have difficulty producing them (Rochet 1995). Several studies have documented the inability of adult learners to discriminate speech contrasts that do not exist in their native language (see Ioup 2008 for a review). According to models of speech learning such as the SLM (Speech Learning Model, Flege 1995, 2003) or the L2LP (Second Language Linguistic Perception model, Escudero 2005), the processes and mechanisms that govern L2 acquisition are not different from those used for L1 learning, except that for L2, learners already have an existing system for speech perception, which interferes with the L2 system. At the initial stage of L2 learning, L2 vowels and consonants are filtered through the L1 system (Trubetzkoy 1939; Trubetzkoy 1958). On the basis of this notion of “phonological deafness”, we can expect difficulties in producing the French vowel /y/ since it is phonemically and phonetically a new sound for the Spanish learners. This problem might occur at least for the words they produced in the repetition task, since in order to perform this task correctly, they have to first perceive the vowel correctly. For the majority of the L2 learning models, the Contrastive Analysis Hypothesis (CAH, Lado 1957), which predicts that those aspects of the L2 sound system that are different from the L1 will be most difficult to acquire, is however not sufficient to explain all errors. Transfer is merely one factor among others (see Major 2008 for a review). Another notion that has been investigated to explain L2 errors is the “phonetic similarity/dissimilarity” of L1 and L2 sounds. This concept is used in Flege’s SLM model and in Best’s PAM model (Perceptual Assimilation Model, Best 1994, 1995) to predict the relative difficulties that listeners will have in perceptual differentiation of non-native segmental contrasts.10 Following Flege’s SLM, the filtering done by the L1 system will progressively disappear as learners acquire elements of L2 that need to be phonetically distinguished (Flege 2003). Gradually, new categories will be formed depending on the perceived phonetic dissimilarity between the L2 sound and the closest L1 sound. SLM postulates that the greater the perceived phonetic dissimilarity is, the more likely it is that a new category will be created. According to Major (2008), this is due to the fact that the larger the phonetic differences are, the more easily they tend to be noticed, 10

It should be noticed, however, that the notion of perceived similarity remains controversial even though, according to both models, it contributes to an explanation of learning difficulties that adult L2 learners have when mastering the L2 phonological system. Further cross-language studies are therefore needed in order to define this notion more precisely (see Strange & Shafer 2008 and Major 2008 for a discussion).

212

Isabelle RACINE

and so learning is more likely to take place. In contrast, minimal differences often go unnoticed, resulting in non-learning and in persistence of transfer. Flege (1987) showed, for example, that adult native English learners of French were able to produce a more phonetically accurate French /y/ than /u/. This result was explained by the greater phonetic distance between French /y/ and the closest English vowel, as compared with the distance between the French /u/ and the closest English vowel. In research done on the acquisition of the phonemes /y, u, ø/ by Japanese learners of French, Kamiyama & Vaissière (2009) confirmed Flege’s observations. Taken together, these two studies suggest that a “similar” L2 phone that has a phonemic equivalent but that is phonetically different in L1, as in the case of French /u/ for Spanish learners, might be difficult to produce accurately. Based on these studies, we can make the assumption that Spanish learners will produce a vowel that corresponds acoustically more to a Spanish /u/. Our experiment will determine if the French native listeners are sensitive to this difference. If this is the case, we expect a difference in the perceptual assessment between the two groups of Spanish learners and the control French native group. As regards the second question, another factor that has to be taken into account in the production of L2 sounds is the impact of orthography. It is especially relevant for /y, u/ because in French, the grapheme does not correspond to /u/ but to /y/, whereas in Spanish (as in many other languages), it corresponds to /u/. Our previous work carried out on the production of French nasal vowels by Spanish and Japanese speakers (Detey et al. 2010, Racine et al. 2010) confirmed that the role of orthography cannot be neglected when examining learners’ production (see also Detey 2009). Our study showed that on the one hand, orthography had a positive impact for both populations on the vocalic quality of the French nasal vowels, with better results for the words produced in the reading task than for the words produced in the repetition task. On the other hand, orthography had a negative impact on postvocalic excrescences, with a higher rate in both populations of postvocalic excrescences for the words produced in the reading task than for words produced in the repetition task. Regarding /u/ and /y/, we hypothesize that orthography will interfere negatively during the production of /y/ and therefore, for this vowel, we expect better results for the words produced in the repetition task of the IPFC corpus than for those produced in the reading task. In spite of this, for /u/, orthography should interfere positively and help the learners to correctly identify and produce this vowel. If such is the case, the words drawn from the reading tasks should receive a better assessment since the target vowel that they have to produce is clear, whereas in the repetition task some perceptual difficulty may remain in the identification of the target vowel.

Spanish Learners’ Production of French Close Rounded Vowels

213

The third question concerns another factor that has been extensively investigated recently, which is the experience or the length of exposure to the L2, in addition to L1 and L2 use and length of residence in the L2 environment (Ioup 2008). Research has shown that, as one would expect, the more the use of L2 and the less the use of L1, there is a weaker foreign accent; conversely, the less the use of L2 and the more the use of L1, the foreign accent is stronger (see Major 2008 for a review). For instance, nativespeaker input plays an important role in Flege’s SLM. In this model, L2 learning is considered a long process that requires a large amount of native input in order for success. If this is the case, we can hypothesize that the production of the Spanish learners who live in a French environment will be significantly better than that of the Spanish learners who study French in Madrid and are therefore less exposed to oral French. In order to test these three assumptions, the remainder of the chapter presents a perceptual assessment of /y/ and /u/ performed by French native listeners on words produced in two different tasks (repetition and reading) by our populations of learners and native speakers of French. 3.2. Method 3.2.1. Participants The speakers were five native French speakers (four women and one man) and ten Spanish learners of French (eight women and two men). Of the learners, five were students at the University of Geneva and five were studying French in Madrid.11 They were selected from the IPFC Spanish corpus on the basis of their proficiency level in French: B2-C1 according to the Common European Framework of Reference for Languages (CEFRL). Thirty native French speakers took part in the perceptual experiment. All were students from the University of Neuchâtel (Switzerland). 3.2.2. Material Four monosyllabic words from the word lists in the IPFC protocol were selected: two contained the close back rounded vowel /u/ and two contained the close front rounded vowel /y/. Each vowel appeared in two different contexts: VC (i.e. bu “drinkPAST-PART-MASC”, bout “end”) and CVC (i.e. boule “bowl”, bulle “bubble”). All four words were produced twice by each 11

As the language they use for their studies in Geneva is French and as they live in a French-speaking environment, the Spanish learners in Geneva are much more exposed to French in their everyday lives and use it more often than the Spanish learners in Madrid. According to their questionnaire, the five learners in Madrid are only exposed to French during 3 or 4 hours of French classes per week. None of them use it in their everyday life.

214

Isabelle RACINE

speaker, in a repetition task and in a reading task. The final stimulus set consisted of 120 words. 3.2.3. Procedure Participants were instructed to listen carefully to the native and non-native productions of the four monosyllabic words, and then told to decide which vowel (/u/ or /y/) they perceived by clicking on the appropriate button.12 The experiment was carried out via an Internet platform (www. labguistic.com).13 3.2.4. Data analysis For the vowel identification task, a correct vowel identification rate was calculated as a function of the group of speakers (native = N, Spanish from Geneva = SG, and Spanish from Madrid = SM), vowel (/u/ or /y/), context (VC vs. CVC), and production task (repetition task vs. reading task). We analyzed the data by means of mixed-effects regression models (e.g. Baayen, Davidson & Bates 2008), in which the participants and stimuli were entered as random terms. Only the predictors showing an effect or involvement in a significant interaction were retained. All statistical analyses were run with the statistical software R (R Development Core Team 2007), and the mixed-effects models were computed with the package lme4 (Bates & Sarkar 2007). For the sake of clarity, the results and figures are presented in percentages, although all statistical analyses have been performed on raw data. 3.3. Results A mixed-effects model was run with the response (correct vs. incorrect) as the dependent variable, and with groups of speakers (N, SG, SM), vowel (/u/ and /y/), context (VC and CVC), and task (repetition and reading) as the fixed factors. The model revealed no significant effect on context. The predictor was therefore removed, which still did not change the effects of the other predictors. Results showed three main effects. First, a vowel effect (F (1, 3548) = 37.27, p < 0.001) with a higher correct identification rate for /u/ than for /y/ (respectively 96.79% vs. 66.68%); second, a task effect (F (1, 3548) = 12

13

Note that native listeners were also asked to perform a ‘goodness’ judgment on the vowel they perceived, using a 1 to 5 rating scale (1 = very good exemplar; 5 = other vowel). The better the exemplar, the lower the number. They had 6,000 ms to carry out both tasks (vowel identification and goodness rating). Because the statistical analyses are in progress, the ‘goodness’ results are not presented in this chapter. We would like to thank Pierre Ménétrey for the development of this platform and for all the adaptations he carried out in order to make this experiment possible.

Spanish Learners’ Production of French Close Rounded Vowels

215

147.11, p < 0.001) with a higher correct identification rate for the words produced in the repetition task (91.16%) than for those produced in the reading task (72.44%); third, a group effect (F (2, 3548) = 34.00, p < 0.001) with a higher correct identification rate for the French speakers (99.16%) compared to both groups of learners (SG = 68.40%, β = 4.44, z = 13.54, p < 0.001 and SM = 77.64%, β = 3.81, z = 11.59, p < 0.001). Surprisingly, there was a higher correct identification rate for Spanish learners from Madrid than for Spanish learners from Geneva (SM = 77.64%, SG = 68.40%, β = –64, z = –5.81, p < 0.01). As vowels significantly interacted with group (F (2, 3548) = 16.21, p < 0.001) and task (F (1, 3548) = 75.52, p < 0.001), we ran two separate models for each of the vowels (/y/ and /u/), with the response (correct vs. incorrect) as the dependant variables and with group and task as factors. The results for each vowel are presented in the following two sections. 3.3.1. Results for /y/ Figure 4 presents the results for the close front rounded vowel /y/ stimuli used in the experiment. We observed an important effect of task (F (1, 1772) = 314.32, p < 0.001), with better performances for the words produced in the repetition task (86.28%) than for those produced in the reading task (47.34%). Results also showed an effect of group (F (2, 1772) = 52.85, p < 0.01), with better performance for native speakers (99.17%)

Figure 4. Percent correct response for the vowel /y/ as a function of group (French speakers, Spanish learners from Geneva and Spanish learners from Madrid) and task (Repetition vs. Reading).

216

Isabelle RACINE

than for both groups of Spanish learners (SG = 42.05%, β = 5.17, z = 11.27, p < 0.001; SM = 59.23%: β = 4.44, z = 9.70, p < 0.001); and again, there was a higher correct identification rate for Spanish learners from Madrid than for Spanish learners from Geneva (SM = 59.23%, SG = 42.05%, β = –0.72, z = –6.03, p < 0.01). There was also a clear interaction between group and task (F (2, 1772) = 6.90, p < 0.01), indicating that the impact of the task was not the same for each of the three groups. Unsurprisingly, the type task made no difference for native speakers (Repetition = 99.00%, Reading = 99.33%, β = –0.41, z = –0.45, n.s.). On the other hand, the pattern of results was identical for both groups of Spanish learners, with better performance for the words produced in the repetition task than for those produced in the reading task (for SG: repetition = 70.59%, reading = 13.51%, β = 2.91, z = 13.06, p < 0.001; for SM: repetition = 89.27%, reading = 29.19%, β = 3.17, z = 13.48, p < 0.001). To summarize, the native perceptual assessment we carried out on the /y/ productions showed first an effect of the task with better results for the words produced in the repetition task than for those produced in the reading task for the two groups of Spanish learners. For the natives, unsurprisingly, results did not differ as a function of the task. Second, our results revealed a better performance for native speakers than for both groups of Spanish learners but, more interestingly, a difference between the two groups of learners, with better results for the Spanish learners from Madrid than for those from Geneva, whatever the task may be. 3.3.2. Results for /u/ Figure 5 presents the results for the close back rounded vowel /u/ stimuli used in the experiment. The results showed two main effects. First, a task effect (F (1, 1774) = 5.65, p < 0.05) with a higher identification rate for the words produced in the reading task (97.53%) than for those produced in the repetition task (96.05%); second, a group effect (F (2, 1774) = 13.62, p