Working with Portuguese Corpora 9781441190505, 9781472593641, 9781472570000

Although Portuguese is one of the main world languages and researchers have been working on Portuguese electronic text c

227 47 9MB

English Pages [348] Year 2014

Table of contents :
Cover
Half-title
Title
Copyright
Dedication
Contents
List of Contributors
Foreword
Acknowledgements
Introduction
References
Part 1: Lexis and Grammar
1 Looking at Collocations in Brazilian Portuguese Through the Brazilian Corpus
1. Introduction
2. Goal and research questions
3. Corpora and method
4. Results
5. Conclusion
Acknowledgements
References
Appendix
Notes
2 Lexical Bundles in Brazilian Portuguese
1. Introduction
2. Goals and methods
3. Frequency of bundles
4. Functional classification of lexical bundles
5. Conclusion
Acknowledgements
References
Notes
3 Changing ‘Faces’: A Case Study of Complex Prepositions in Brazilian Portuguese
1. Introduction
2. The corpora and the analysis
3. Conclusion
Acknowledgements
References
Part 2: Lexicography
4 The Corpus do Português and the Frequency Dictionary of Portuguese
1. Introduction
2. Corpus do Português – texts
3. Corpus do Português – importance of genre balance
4. Corpus do Português – annotating the texts
5. Corpus do Português – lexical and grammatical queries
6. Corpus do Português – semantically-based queries
7. Corpus do Português – comparing genres, dialects and time periods
8. The Frequency Dictionary of Portuguese
References
Notes
5 PtTenTen: A Corpus for Portuguese Lexicography
1. Introduction
2. Word sketches and the Sketch Engine
3. Corpus collection
4. Language technology tools for processing Portuguese
5. Into the Sketch Engine
6. Regional variants
7. Conclusion
References
Notes
Part 3: Language Teaching and Terminology
6 Idiomaticity in a Coursebook for Teaching Brazilian Portuguese as a Foreign Language
1. Introduction
2. Research methodology
3. Data analysis
4. Final considerations
Acknowledgements
References
Notes
7 Retrieving (Onco)mastology Terms in Portuguese Corpora
1. Introduction
2. Methodological procedures
3. Results
4. Final considerations
References
Appendix
Notes
Part 4: Translation
8 Understanding Portuguese Translations with the Help of Corpora
1. Introduction
2. Translating from English into Portuguese
3. Comparing translated and non-translated Portuguese
4. Conclusions
References
Notes
9 The Per-Fide Corpus: A New Resource for Corpus-Based Terminology, Contrastive Linguistics and Translation Studies
1. Introduction
2. The Per-Fide corpus in the context of Natural Language Processing
3. Corpus processing pipeline
4. Applications of the Per-Fide corpus in cross-linguistic research
5. Concluding remarks
Acknowledgements
References
Notes
10 The CoMET Project: Corpora for Teaching and Translation
1. Introduction
2. The CoMET Project
3. CorTec
4. CorTrad
5. Final comments
References
Notes
Part 5: Corpus Building and Sharing
11 Corpora at Linguateca: Vision and Roads Taken
1. A short history
2. Corpora at Linguateca now: What’s up?
3. Concluding remarks
Acknowledgements
References
Notes
12 The Reference Corpus of Contemporary Portuguese and Related Resources
1. Introduction
2. The Reference Corpus of Contemporary Portuguese
3. Related resources
4. Conclusion
References
Notes
13 C-ORAL-BRASIL: Description, Methodology and Theoretical Framework
1. Introduction
2. The architecture
3. Methodological aspects
4. Concluding remarks
References
Notes
Part 6: Parsing and Annotation
14 PALAVRAS: A Constraint Grammar-Based Parsing System for Portuguese
1. Background: A modular, rule-based parsing architecture
2. Palmorf: a lexicon-based, analytical annotation scheme
3. Morphosyntactic disambiguation and constraint grammar
4. Structural annotation: Dependency syntax and constituent trees
5. Corpus annotation and format filtering
6. Semantic annotation
7. Integrating probabilistic information from corpora
8. Beyond the sentence: anaphora annotation
9. Non-standard data varieties
10. Conclusion
References
Notes
15 New Corpora for ‘New’ Challenges in Portuguese Processing
1. Introduction
2. Linguistic annotation: A bridge between Natural Language Processing and Corpus Linguistics
3. Recent annotation projects at NILC
4. Final remarks
Acknowledgements
References
Index

Recommend Papers

Exploring English with Online Corpora [2 ed.] 1137438126, 9781137438126

This is an essential guide to using digital resources in the study of English language and linguistics. Assuming no prio

120 81 Read more

Working With Diagrams 9781800735590

Arising from the need to go beyond the semiotic, cognitive, epistemic and symbolic reading of diagrams, this book looks

156 66 4MB Read more

Working with Kundalini 9781620558812, 9781620558829

A guide to moving gracefully through the 3-phase process of Kundalini awakening • Explains the three phases of Kundalini

100 54 737KB Read more

Working With the Person With Schizophrenia 9780814788813

The person with schizophrenia poses a formidable challenge even to the experienced clinician. Bizarre, unpredictable beh

147 43 151MB Read more

Working Effectively with Unit Tests

456 56 584KB Read more

Working Effectively with Unit Tests

381 57 5MB Read more

Working with Unattached Youth 9781136251474

196 106 614KB Read more

Working with LARSP 0713161175, 0713161183

This series is the first to approach the problem of language disability as a single field. It attempts to bring togeth,e

370 58 15MB Read more

Working With Emotional Intelligence 9780553903218

117 54 511KB Read more

Portuguese Cookbook: From Funchal to Madeira Discover Delicious European Cooking with Easy Portuguese Recipes

Portuguese 101. Get your copy of the best and most unique Portuguese recipes from BookSumo Press! Come take a journey w

116 37 2MB Read more

Working with Portuguese Corpora
9781441190505, 9781472593641, 9781472570000

Author / Uploaded
Tony Berber Sardinha
Telma de Lurdes São Bento Ferreira (editors)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Working with Portuguese Corpora

Also available from Bloomsbury Corpus Linguistics and Linguistically Annotated Corpora, Sandra Kuebler and Heike Zinsmeister Key Terms in Corpus Linguistics, Michaela Mahlberg and Matthew Brook O’Donnell Lexicology and Corpus Linguistics, M. A. K. Halliday, Wolfgang Teubert and Colin Yallop Working with German Corpora, edited by Bill Dodd Working with Spanish Corpora, edited by Giovanni Parodi

Working with Portuguese Corpora Edited by Tony Berber Sardinha and Telma de Lurdes São Bento Ferreira

Bloomsbury Academic An imprint of Bloomsbury Publishing Plc LON DON • OX F O R D • N E W YO R K • N E W D E L H I • SY DN EY

Bloomsbury Academic An imprint of Bloomsbury Publishing Plc 50 Bedford Square London WC1B 3DP UK

1385 Broadway New York NY 10018 USA

www.bloomsbury.com BLOOMSBURY and the Diana logo are trademarks of Bloomsbury Publishing Plc First published 2014 Paperback edition first published 2015 © Tony Berber Sardinha, Telma de Lurdes São Bento Ferreira and Contributors, 2014 Tony Berber Sardinha and Telma de Lurdes São Bento Ferreira have asserted their right under the Copyright, Designs and Patents Act, 1988, to be identified as the Editors of this work. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage or retrieval system, without prior permission in writing from the publishers. No responsibility for loss caused to any individual or organization acting on or refraining from action as a result of the material in this publication can be accepted by Bloomsbury or the author. British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.

ISBN: HB: 978-1-4411-9050-5 PB: 978-1-4742-6284-2 ePDF: 978-1-4725-7000-0 ePUB: 978-1-4725-7001-7 Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress. Typeset by Fakenham Prepress Solutions, Fakenham, Norfolk NR21 8NN Printed and bound in Great Britain

To Marilisa and Julia To Ricardo and Maria Sofia

Contents List of Contributors Foreword Mike Scott Acknowledgements

ix xv xvi

Introduction Tony Berber Sardinha and Telma de Lurdes São Bento Ferreira 1 Part 1 Lexis and Grammar 1 2 3

Looking at Collocations in Brazilian Portuguese Through the Brazilian Corpus Tony Berber Sardinha

9

Lexical Bundles in Brazilian Portuguese Tony Berber Sardinha, Rosana de Barros Silva e Teixeira and Telma de Lurdes São Bento Ferreira

33

Changing ‘Faces’: A Case Study of Complex Prepositions in Brazilian Portuguese Tania Maria Granja Shepherd

69

Part 2 Lexicography 4 The Corpus do Português and the Frequency Dictionary of Portuguese Mark Davies 5

PtTenTen: A Corpus for Portuguese Lexicography Adam Kilgarriff, Miloš Jakubíček, Jan Pomikalek, Tony Berber Sardinha and Pete Whitelock

89

111

Part 3 Language Teaching and Terminology 6 7

Idiomaticity in a Coursebook for Teaching Brazilian Portuguese as a Foreign Language Telma de Lurdes São Bento Ferreira

131

Retrieving (Onco)mastology Terms in Portuguese Corpora Rosana de Barros Silva e Teixeira

147

Part 4 Translation 8

Understanding Portuguese Translations with the Help of Corpora Ana Frankenberg-Garcia

161

viii Contents 9

The Per-Fide Corpus: A New Resource for Corpus-Based Terminology, Contrastive Linguistics and Translation Studies José João Almeida, Sílvia Araújo, Nuno Carvalho, Idalete Dias, Ana Oliveira, André Santos and Alberto Simões

10 The CoMET Project: Corpora for Teaching and Translation Stella E. O. Tagnin

177

201

Part 5 Corpus Building and Sharing 11 Corpora at Linguateca: Vision and Roads Taken Diana Santos

219

12 The Reference Corpus of Contemporary Portuguese and Related Resources Maria Fernanda Bacelar do Nascimento, Amália Mendes, Sandra Antunes and Luísa Pereira

237

13 C-ORAL-BRASIL: Description, Methodology and Theoretical Framework 257 Tommaso Raso and Heliana Mello Part 6 Parsing and Annotation 14 PALAVRAS: A Constraint Grammar-Based Parsing System for Portuguese Eckhard Bick

279

15 New Corpora for ‘New’ Challenges in Portuguese Processing Sandra Maria Aluísio, Thiago Alexandre Salgueiro Pardo and Magali Sanches Duran 303 Index

323

List of Contributors José João Dias de Almeida is an assistant professor at the University of Minho in the Computer Science Department. His research interests focus on compilers, natural language processing, and scripting tools. He has participated in several projects related to parallel corpora and terminology. Sandra Maria Aluísio has been a computer science professor at the University of São Paulo since 1988. She has a PhD in artificial intelligence – specifically, in natural language processing – from the University of São Paulo. She teaches and supervises students completing their theses and dissertations at the University of São Paulo in natural language processing and artificial intelligence in education, using primarily the machine learning approach. Her research efforts have focused on corpus linguistics, PoS tagging, automatic adaptation of Portuguese texts, semantic role labelling and semantic resources, automatic detection of discourse structure, scientific writing tools, automatic term extraction, and computer adaptive testing. Sandra Antunes has been a fellow at the Centre of Linguistics of the University of Lisbon (CLUL) since 1998 and has worked in several projects in the area of corpus linguistics. Her research has focused on spoken corpus (discourse organization and conversational interaction), regular polysemy, and multiword expressions (MWE). She is currently completing her PhD thesis exploring Portuguese MWE properties from a semantic, syntactic, lexical, and pragmatic standpoint. Her publications include co-authored chapters in books published by John Benjamins (2005, 2006) as well as papers in international journals. Sílvia Araújo is an assistant professor of the Institute of Arts and Human Sciences at the University of Minho. Her research interests focus on corpus linguistics and contrastive studies, specializing in the study of diathesis, tense, and aspect mainly in Portuguese and French. Some of her major publications include articles related to the NLP research project Per-Fide – Parallelizing Portuguese with six different languages. Maria Fernanda Bacelar do Nascimento is a senior researcher at the Centre of Linguistics of the University of Lisbon (CLUL). Her research efforts have focused primarily on corpus linguistics, with particular regard to constitution and exploitation of large corpora, lexical and syntactic analysis, and register variation (in Portuguese and its national varieties). She has participated in several national and international research projects, including Fundamental Portuguese, The Reference Corpus of Contemporary Portuguese, Varieties of African Portuguese, LE-PAROLE,

x

List of Contributors

SIMPLE, and C-ORAL-ROM. She is a member of the Organizing Committee of Gramática do Português, published in October 2013 by Fundação Calouste Gulbenkian. Her publications include book chapters published by John Benjamins (2005, 2006), as well as papers in international journals. Tony Berber Sardinha is associate professor of linguistics at São Paulo Catholic University (PUCSP), where he works at the Graduate Program in Applied Linguistics (LAEL) and at the Centre for Language Research, Resources and Information (CEPRIL). He has a PhD in English from the University of Liverpool, under the supervision of Prof. Michael Hoey. He is a researcher with the Brazilian Science and Research Council (CNPq), and a founding member of RaAM (The International Association for Researching and Applying Metaphor) and ALSFAL (Latin American Association for Systemic Functional Linguistics). He has numerous publications on corpus linguistics, language teaching, and metaphor, and serves on the editorial board of major journals and book series in those fields. Eckhard Bick is Project Leader for the VISL project at the University of Southern Denmark (Odense), where he works as a researcher at the Institute of Language and Communication (ISK), and co-founder of a small Danish language technology start-up, GrammarSoft ApS. Eckhard Bick has a background in medicine, English and paedagogics (Bonn University), but also holds the Danish equivalent of an MA (cand. mag.) in Nordic Languages and Portuguese (Århus University). In 2000, he defended his dr.phil. thesis in linguistics, on the automatic parsing of Portuguese. Current work includes research within Natural Language Processing, machine translation, and corpus linguistics, with a focus on Romance and Germanic languages, for many of which Eckhard Bick has developed extensive Constraint Grammar systems. Nuno Carvalho is currently a PhD student at the University of Minho in the MAP-i doctoral program. His thesis is in the area of program comprehension, devising ways to understand how a program works without any previous knowledge about its implementation or design goals. His main areas of research interests are domain-specific languages, automatic construction of compilers and other language-based tools, any general problems related to language design and implementation, and natural language processing. Mark Davies is professor of (corpus) linguistics at Brigham Young University in Provo, UT, USA. He is the author of A Frequency Dictionary of Portuguese (Routledge, 2007). He has also authored three other books and more than sixty other publications dealing with corpus creation and use, especially in the area of historical and genre-based variation (for Portuguese, Spanish, and English). In addition to creating the Corpus do Português (with Michael Ferreira), he is also the creator of many other corpora that are freely available from http://corpus.byu.edu. These resources are used by more than 100,000 distinct users each month, and they serve as the basis for several hundred academic publications each year.

List of Contributors

xi

Idalete Dias is an assistant professor of the Institute of Arts and Human Sciences at the University of Minho. Her main research areas include lexicography, semantics, corpus linguistics, and digital humanities. Her publications include the co-authored Portuguese– German Dictionary of Idioms/Idiomatik Portugiesisch–Deutsch/Dicionário Idiomático Português–Alemão (2012) and articles related to the NLP research project Per-Fide – Parallelizing Portuguese with six different languages. She is currently coordinating digital humanities projects that focus on digital text representation and text encoding. Magali Sanches Duran received her master’s degree in 2004 and her PhD degree in linguistics in 2008 from São Paulo State University (UNESP), specializing in pedagogical lexicography. Her main interest is to develop lexical resources for both human and computational use. Since 2009, she has been a post-doctoral researcher at the Centre of Computational Linguistics (NILC) at the University of São Paulo (USP), dedicating her efforts to corpus annotation with semantic role labelling in order to provide linguistic resources that will enable improvements in Portuguese automatic processing. Ana Frankenberg-Garcia holds a PhD in applied linguistics from Edinburgh University and was jointly responsible for creating COMPARA, a parallel corpus of English and Portuguese (available online at www.linguateca.pt/COMPARA). She is Senior Lecturer in Translation Studies at the University of Surrey. Her work on the applied uses of corpora has been published in international, peer-reviewed books and journals, including International Journal of Lexicography, International Journal of Corpus Linguistics, Corpora, Language Teaching, and English Language Teaching Journal. In 2011 she co-edited New Trends in Corpora and Language Learning (Continuum). She has been working for Oxford University Press since 2011 as chief editor of a new lexicographic project expected to be published in 2015. Miloš Jakubíček is a software engineer and an NLP researcher. His research interests are devoted mainly to two fields: effective processing of very large text corpora and parsing of morphologically rich languages. For the past five years he has been involved in the development of the Sketch Engine corpus management suite, working on fast indexation and querying of billion words corpora. He is a fellow of the NLP Centre at Faculty of Informatics, Masaryk University, Brno, Czech Republic, where his interests lie mainly in syntactic analysis and its practical applications. Adam Kilgarriff is both Director of Lexical Computing Ltd., and a research scientist working at the intersection of lexicography, corpus linguistics and language technology. His company has developed the Sketch Engine http://www.sketchengine. co.uk, a leading tool for corpus research used for linguistic research, translation and dictionary-making at Oxford University Press, Cambridge University Press and many other companies and universities. His PhD, on ‘polysemy’, was from the University of Sussex and he has since worked at Longman Dictionaries and the University of Brighton. He is a Visiting Research Fellow at the University of Leeds. He is active in

xii

List of Contributors

moves to make the web available as a linguists’ corpus and was the founding chair of ACL-SIGWAC (Association for Computational Linguistics Special Interest Group on Web as Corpus). Heliana Mello is an associate professor of English and linguistics at UFMG, Brazil. Her research efforts have concentrated on language contact and change, corpus linguistics with a special interest in corpus compilation, and methodologies for corpus exploitation. Her publications include a monographic book about language contact in the genesis of Brazilian Portuguese (1997) and co-authorship in books focusing on corpus linguistics. She is a joint coordinator of the C-ORAL-BRASIL spontaneous speech corpus project. Amália Mendes is a senior researcher at the Centre of Linguistics of the University of Lisbon (CLUL). She obtained her PhD in Portuguese linguistics in 2001 from the Faculty of Letters of the University of Lisbon. She has participated in several national and international projects in the area of corpus linguistics, including the Reference Corpus of Contemporary Portuguese. She is a member of the Organizing Committee of Gramática do Português, published in October 2013 by Fundação Calouste Gulbenkian. Her publications include book chapters published by John Benjamins, Springer-Verlag, and Peter Lang as well as papers in international journals. Her main research interests are corpus compilation and annotation, multiword expressions, and lexical semantics. Ana Oliveira holds a degree in European Languages and Literature. She has worked as a research grantee at the Centre for Humanistic Studies of the University of Minho (Portugal) within the scope of the Per-Fide project for the compilation of parallel corpora between Portuguese and six other languages (Spanish, French, Italian, German, English, and Russian). Her academic interests focus on ontologies and digital archiving. Thiago Alexandre Salgueiro Pardo received his degree in computer science (1999) and MSc degree (2002) from the Federal University of São Carlos (UFSCar) and his PhD degree (2005) from the University of São Paulo (USP), in Brazil. Since 2006, he has been a full-time professor and researcher at USP. He is mainly interested in natural language processing and computational linguistics, with research efforts on text summarization, semantic and discourse analysis and parsing, and text mining. Luísa Pereira is a collaborator in CLUL’s Research Sector ANAGRAMA (Grammatical Analysis and Corpora), University of Lisbon, where she works on corpus linguistics for the Reference Corpus of Contemporary Portuguese (CRPC) and on projects using this resource, such as LE-PAROLE, Varieties of African Portuguese, and COMBINA-PT. Her publications, mostly in collaboration, include papers about corpora and their uses and have been presented to scientific journals, conferences,

List of Contributors

xiii

and in the book How to Use Corpora in Language Teaching, published by John Benjamins (2004). Jan Pomikalek is senior software engineer at NetSuite, Brno, and holds a PhD in Informatics from Masaryk University (Czech Republic). His research interests include web corpora, web crawling, character encoding detection, language detection, boilerplate cleaning, and de-duplication. Tommaso Raso is full professor of linguistics at UFMG, Brazil. After having worked in medieval philology and professional writing as an associate professor at the University of Venice, he concentrated his research on spoken corpus compilation and on phonetic and pragmalinguistic analysis of speech. His publications include a critical edition of a medieval didactic manuscript (Colacchi, 2001), a book on bureaucratic writing (Carocci, 2005), a co-authored book on professional writing (Zanichelli, 2002), and the co-authored books Prosody and Pragmatics (FUP, 2012) and C-ORALBRASIL I (UFMG, 2012). André Santos is currently working at SAPO (http://sapo.pt). The subject of his master’s thesis, for which the work was performed in the context of the Per-Fide project as a research grant holder, was the development and implementation of methods for the improvement of the quality of bitext alignments and the extraction of knowledge from the alignments. His main areas of interest are named entity recognition and relationship extraction, domain-specific languages, and natural language processing in general. Diana Santos is the leader of Linguateca, a network for fostering research and development in the processing of the Portuguese language, paying special attention to evaluating and making resources available and free. Her research has spanned several fields of computation linguistics, with an interest in translation, contrastive semantics, and Portuguese grammar. She is currently at the University of Oslo, after having worked in several research institutes, such as INESC (Portugal), IBM, and SINTEF (Norway). Telma de Lurdes São Bento Ferreira earned her MA in applied linguistics from São Paulo Catholic University (PUCSP). She is the coauthor of Muito prazer – fale o português do Brasil, a Brazilian Portuguese teaching course series (Disal, 2008, 2012), and the co-editor of Tecnologias e mídias no ensino de inglês: o corpus nas ‘receitas’ (Macmillan, 2012). She is currently a researcher at the Corpus Linguistics Research Group (GELC) in São Paulo, Brazil, and her interests focus on lexicography and Brazilian Portuguese as a foreign language. Tania Maria Granja Shepherd is an associate professor of English (ESL) at Rio de Janeiro State University. Her publications include contrastive studies of English and Portuguese from a corpus linguistics and systemic functional perspective. She is the

xiv

List of Contributors

co-editor of Linguística da Internet (2013), published by Editora Contexto, which focuses on major digital registers in both English and Portuguese. Rosana de Barros Silva e Teixeira earned her degree in communications/journalism (1998), her teaching degree in Portuguese language (1999) and her MA in applied linguistics and language studies at PUCSP (2011). Currently, she is a state-employed school teacher of Portuguese and teaches classes in Portuguese for special purposes at the Department for the General Coordination of Specialization. She has completed advanced training and extension courses (COGEAE/PUCSP) and is a member of the Corpus Linguistics Research Group (CNPq/PUCSP). Her interests include corpus linguistics, terminology/terminography, lexicography, discourse analysis and production of academic texts. She is the author of Glossário de Oncomastologia: um repertório de termos sobre o câncer de mama (‘Glossary of Oncomastology: a repertoire of terms about breast cancer’), published by Olho d’Água (2013), and articles in the field of corpus linguistics/terminology/French discourse analysis. Alberto Simões is a PhD in natural language processing affiliated with the Polytechnic Institute of Cávado and Ave (Portugal) and works as a researcher at the Centre for Humanistic Studies at Universidade do Minho. His research interests focus on parallel corpora alignment, probabilistic translation dictionaries, and bilingual terminology extraction. Some of his major publications include NATools (a statistical word aligner workbench in Procesamiento del Lenguaje Natural 31, 2003), Makefile: Parallel dependency specification language (in Anne-Marie Kermarrec, Luc Bougé and Thierry Priol, editors, Euro-Parl 2007, volume 4641) and Portuguese English word alignment: Some experiments (in LREC 2008 – The 6th edition of the Language and Resources Evaluation Conference, Marrakech, 2008). Stella E. O. Tagnin is a lecturer in English and translation at the University of São Paulo. Her research efforts have focused on corpus linguistics, mainly in connection with translation, terminology, and English teaching. She is the author of O Jeito que a Gente Diz (Disal, 2005, 2013), and has organized various collections of articles on corpus linguistics (Humanitas 2008, HUB Editorial, 2010, 2013, 2014 in press). Her other publications include Vocabulário para Culinária Inglês-português (SBS, 2008) and various articles in national and international books and journals. She coordinates the CoMET Project (www.fflch.usp.br/dlm/comet), which consists of three corpora: CorTec (a technical corpus), CorTrad (a translation corpus), and CoMAprend (a learner corpus) – all freely available for online investigation. Pete Whitelock is Principal Language Engineer for Academic Dictionaries at Oxford University Press. He is responsible for corpora and a range of Language Engineering activities involved in dictionary creation, improvement and exploitation.

Foreword This volume is a timely tour de force. Portuguese as a major world language steps forward computationally in its pages to show speakers of other languages how corpus tools and resources enliven and enhance the study of linguistic patterning. In centres both within the Portuguese-speaking nations and outside, a wealth of exciting research is taking place at the very frontiers of knowledge, not just pushing forwards what is known about the language but honing the cutting edge of corpus research itself. As a long-time user of the language I find this exciting both as a description of a language I’m interested in and also as a set of ideas about corpus methods. You are holding in your hands an invitation to watch corpora in action. Mike Scott Aston University and Lexical Analysis Software Ltd.

Acknowledgements Our debts of gratitude go to the authors who have kindly contributed their writing and given their time to this collection, to Mike Scott for the foreword, and to the staff at Bloomsbury, Gurdeep Mattu and Andrew Wardell, for their kind and prompt assistance in the editorial process. We are also grateful to the anonymous reviewers for their expert feedback on the chapters. We would also like to thank Fapesp (São Paulo Research Foundation, Brazil) and CNPq (The National Council for Scientific and Technological Development, Brazil) for their support of our research.

Introduction Tony Berber Sardinha and Telma de Lurdes São Bento Ferreira

Although Portuguese is one of the main world languages and researchers have been working on Portuguese electronic text collections for decades (e.g., Kelly, 1970; Biderman, 1978; Bacelar do Nascimento et al., 1984; Berber Sardinha, 2005; Unesp Ciência, 2013; Castro, 2010; Santos, 2011), this is the first volume in English that encapsulates the exciting and cutting-edge corpus linguistic work being done with Portuguese language corpora on different continents. The book includes chapters by leading corpus linguists dealing with Portuguese corpora across the world and their contributions explore various methods and how they are applicable to a wide range of language issues. The book is divided into six sections, each covering a key issue in corpus linguistics, from the use of corpora in language description, both synchronic and diachronic, to real-world uses such as dictionary-making, translation, language teaching and terminology; from corpus design and compilation to annotation, parsing and the online availability of corpora in portals and gateways; and from software development for linguists and language researchers to applications for the general public. Together these sections present the reader with a broad picture of the field. Part 1 comprises chapters that report on research into the lexis and grammar of Brazilian Portuguese. The opening chapter, by Berber Sardinha, explores the use of the Corpus Brasileiro, a one-billion-token, register-diversified corpus of Brazilian Portuguese, as a means to track collocations in individual newspaper texts from the Brazilian Register Variation Corpus. The author argues for a central place for texts in collocation research, as they are the natural environment for collocations. The main goal is to reveal how many word combinations in the texts are actual collocations (i.e., they occur frequently enough in the language to reach statistical significance), the answer to which is not available in the literature, despite some half-century of corpus linguistic investigations into collocation. The results indicate that collocation is present in the texts 90 per cent of the time. Chapter 2, by Berber Sardinha, Silva e Teixeira and São Bento Ferreira, investigates lexical bundles, the most frequently occurring fixed word sequences in a register, in the Brazilian Register Variation Corpus (5.6 million words, 48 registers). The study is the first to provide a language-wide perspective on bundles in any language. The results showed wide variation across the registers, which was interpreted as evidence of routinization – that is, the massive use of routine word sequences as a constitutive feature of a register. The results were also compared cross-linguistically and it was

2

Working with Portuguese Corpora

found that English and Spanish have similar levels of bundles in research articles as Brazilian Portuguese, which in turn was seen as a reflection of widespread international conventions for academic writing. Chapter 3, by Shepherd, focuses on complex prepositions from the point of view of grammaticalization, with the goal of showing that seemingly synonymous constructions have distinct meanings and acquired preposition status in different periods of time. The study shows that the noun face (an English cognate) became grammaticalized as part of the prepositions em face de (in the face of), em face a (in the face of) and face a (facing); however, despite the fact that they share the same lexical item, each of them has a specialized sense. The chapter argues for an approximation between grammaticalization theorists and corpus linguists, as both share similar premises about the nature of language change. Part 2 presents major online corpora, focusing on how these resources can be explored in lexicography. Chapter 4, by Davies, focuses on the Corpus do Português as well as the Frequency Dictionary of Portuguese (jointly authored with Ana Maria Raposo Preto-Bay), based on the corpus. The Corpus do Português was developed by Davies and Michael Ferreira (Georgetown University, USA) and shares the same architecture as Davies’s other online corpora, such as the Corpus of Contemporary American English (COCA) and the Corpus of Historical American English (COHA). The corpus, with 45 million words of both the Brazilian and European varieties, is fully available online and covers a wide time period, from the 12th to the 20th centuries, making it a unique resource for diachronic studies of Portuguese. Chapter 5, by Kilgarriff, Jakubíček, Pomikalek, Berber Sardinha and Whitelock, introduces ptTenTen, a large corpus of web texts, containing more than 1.9 billion tokens of Brazilian and European Portuguese. The corpus was collected as a resource for the creation of the Oxford Portuguese Dictionary, a bilingual Portuguese–English, English–Portuguese dictionary, and is now available online at the Sketch Engine website. The chapter details the challenges faced in collecting such a large corpus, including web crawling, cleaning and parsing. It also presents an analysis of the main keywords in both varieties, illustrating the potential of Sketch Engine as a tool for the lexicographer as well as for the corpus linguist in general. Part 3 centres on two major applications of corpus research: language teaching and terminology, both of which are geared to ‘real-world problems.’ Chapter 6, by São Bento Ferreira, analyses the texts in a coursebook for Portuguese as a foreign language, with a view to determining whether these texts present a suitable picture of the lexicogrammar to students. The method consists of comparing each 3-gram in the coursebook to both the Brazilian Register Variation Corpus (5.6 million words) and the Brazilian Corpus (1.1 billion tokens), recording the matches and their frequencies. The results suggest that the textbook offers an idealized version of the language, where each text maximizes the use of a few lexical sequences and constantly slips typical written language units into spoken texts. The chapter is a cautionary tale of the dangers of manipulating language for educational purposes. Chapter 7, by Silva e Teixeira, looks at the issue of term identification by computer by employing and testing the success rate of four different tools in retrieving breast cancer terms. The tools were chosen with humanities students and researchers in

Introduction

3

mind; hence, they all had easy-to-use, point-and-click interfaces. The four software programs had comparable performance levels, identifying a core set of terms, but they each also pulled out a set of unique terms. The general conclusion is that term extraction is best served by combining different tools, as each one will contribute a number of different terms. The chapter takes the reader through the decision-making process in a step-by-step manner and illustrates the findings with examples of the terms, forming a glossary that has been recently published. Part 4 is devoted to translation and includes chapters focusing on different aspects of translated texts and translators’ options. Chapter 8, by Frankenberg-Garcia, explores systematic features of translated texts by comparing translations and non-translations through the 1.5 million-word COMPARA parallel corpus. The results showed consistent patterns in translation, such as a preference for fronting certain elements, choosing less common words over more typical ones, including foreign words and a tendency for formality. In addition, the study dispels myths and incorrect perceptions about translated Portuguese texts, for instance that they are more verbose than English texts. In fact, the results show a non-significant difference between originals and translations. The findings provide important insights for anyone working in translation – from professionals to educators to translation software developers. Chapter 9, by Almeida, Araújo, Carvalho, Dias, Oliveira, Santos and Simões, introduces the Per-Fide corpus, a multilingual parallel collection that encompasses six languages: English, Russian, French, Italian, German and Spanish. The corpus is bidirectional, with Portuguese as the pivot language (it is always the search or target language). The chapter details the various stages involved in the preparation of the corpus, including automating tasks such as compilation and alignment, estimating the quality of the resources, developing flexible tools and offering materials to the community. In addition, the chapter illustrates how the corpus can be used in crosslinguistic translation research, providing examples of translation issues between Portuguese and English or French. Chapter 10, by Tagnin, presents the corpora available online through the CoMET Project (Multilingual Corpora for Teaching and Translation) portal. CorTec is a comparable corpus with original technical texts in both English and Portuguese. CorTrad is a parallel corpus with three subcorpora – technical-scientific, science journalism and literary – in English and Portuguese. The chapter illustrates the use of these resources for translation and terminology by showing how to generate word frequency lists for the corpora, examine the listings, choose particular words, concordance them and spot the most appropriate translation choices. The queries illustrated in the chapter contribute insights for training teachers, students and translators, both novice and experienced. Part 5 includes three chapters related to corpus building and the online availability of major corpora of Portuguese. Chapter 11, by Santos, chronicles the creation, evolution and current state of Linguateca, a pioneering portal for corpus-based resources in Portuguese. Linguateca provides users with access to a wide range of online Portuguese corpora and corpus-processing tools, free of charge. The portal was launched as early as 1999, which – in ‘web years’ – is equivalent to a distant past (to put this in perspective, Google was founded just one year before and major web

4

Working with Portuguese Corpora

browsers like Firefox and Safari were released only four or five years later). This feat demonstrates Santos’s commitment to the ideal of ensuring unrestricted access to technology for the public at large. Chapter 12, by Bacelar do Nascimento, Mendes, Antunes and Pereira, presents the Corpus de Referência do Português Contemporâneo (Reference Corpus of Contemporary Portuguese, CRPC), the largest multi-variety corpus of Portuguese, with more than 310 million words, available online. The corpus has been compiled by the Centre of Linguistics at the University of Lisbon (CLUL) since 1988. It comprises samples of Portuguese found on all the continents where the language is spoken (Europe, Americas, Africa and Asia), from ten different countries, including those not commonly represented in Portuguese corpora, such as Goa, Macao and East Timor. More than any other resource, the CRPC attests to the international status of the Portuguese language. Chapter 13, by Raso and Mello, introduces C-ORAL-BRASIL, the Brazilian member of the C-ORAL family of spoken corpora. The corpus was compiled at Minas Gerais Federal University to represent a wide cross-section of spoken Brazilian Portuguese, specifically from the city of Belo Horizonte. It is distributed on CD-ROM, which includes not only the transcriptions of the interactions, but also the audio files. The chapter makes the case for the faithful accounting of prosodic breaks in transcription, which can only be accomplished through the careful examination of the audio stream. The corpus comprises an impressive range of registers, some of which are rarely if ever found in other corpora, such as the interaction among soccer players during a game, the recording of which required sophisticated equipment and good planning. The book closes with Part 6, which is concerned with the complex yet vital issues of parsing and annotating corpora. Chapter 14, by Bick, is devoted to the history, implementation and application of the PALAVRAS parser, the de facto, most influential automatic annotation tool in Portuguese corpus linguistics and Natural Language Processing. Although few users know that its name is an acronym of ‘Portuguese Automatic Linguistic Analysis by means of a Versatile and Robust Annotation System,’ and not simply the Portuguese equivalent of ‘words’, to most users PALAVRAS and Eckhard Bick are household names. The parser is behind most major corpus projects around the world, such as the various corpora at Linguateca, NILC, Sketch Engine, CEPRIL and CoMET – many of which are documented in this volume. Chapter 15, by Aluísio, Pardo and Sanches Duran, discusses the role of linguistic annotation in corpus-based research, particularly in Natural Language Processing, on which many practical applications depend. The discussion is framed in the context of the impressive corpus processing infrastructure built by Núcleo Interinstitucional de Linguística Computacional (NILC), which includes a wide range of Natural Language Processing tools and procedures devoted to parsing, annotating and modelling language. The chapter details how these resources have been channelled to automate tasks such as summarization and simplification, which are intended to help the public to deal better with the increasing amount of information made available online. We hope readers will engage with the book with as much joy as we had editing it, for this collection is a testimony to the vitality of the field of corpus-based research in Portuguese.

Introduction

5

References Bacelar do Nascimento, M. F., Marques, M. L. G. and Segura da Cruz, M. L. (1984), Português Fundamental: Vocabulário e Gramática. Lisbon: Centro de Linguística da Universidade de Lisboa. Berber Sardinha, T. (ed.) (2005), A Língua Portuguesa no Computador. São Paulo: Mercado de Letras / FAPESP. Biderman, M. T. C. (1978), Teoria Linguística (Linguística Quantitativa e Computacional). Rio de Janeiro / São Paulo: LTC. Castro, I. (2010), Lindley Cintra. [ONLINE] Available at: http://cvc.instituto-camoes.pt/ conhecer/bases-tematicas/figuras-da-cultura-portuguesa/1426-lindley-cintra.html. Kelly, J. R. (1970), ‘A computational frequency and range list of five hundred BrazilianPortuguese words’. Luso-Brazilian Review, 7, (2), 104–13. Santos, D. (2011), ‘Linguateca’s infrastructure for Portuguese and how it allows the detailed study of language varieties’. Oslo Studies in Language, 3, (2), 113–28. Unesp Ciência (2013), Fransisco Borba: O garimpeiro das palavras. Available at: http:// www2.unesp.br/revista/?p=6347

Part One

Lexis and Grammar

1

Looking at Collocations in Brazilian Portuguese Through the Brazilian Corpus Tony Berber Sardinha

1. Introduction One of the most enduring contributions of corpus linguistics to our understanding of language use is collocation. Although the basic notion of collocation as the juxtaposition of words has been around for centuries (Barnbrook et al., 2013), it took the advent of electronic corpora for collocation to be established as a central construct in linguistic theory. Electronic corpora and the tools needed to explore them revealed the systematic nature of collocation as well as its pervasiveness in language. According to Hoey (2009, p. 34), ‘it was John Sinclair who demonstrated instrumentally the existence of collocation (Sinclair et al., 1970/2004) and in this sense he can be deemed the discoverer of collocation.’ The discovery of collocation by the methods pioneered by Sinclair in the 1960s dramatically changed the way we see collocation to this day – no longer as individual instances that we come across piecemeal in our routine linguistic experience, but as evidence of larger, more abstract language-organizing principles, such as the idiom principle (Sinclair, 1991) and phraseology (Barnbrook et al., 2013). Yet this shift in perspective has also meant that we have lost track of the importance of collocation in shaping texts, despite the fact that texts are, after all, the raw material of which corpora are made. Although corpora contain texts, the textual boundaries are normally lost in these collections (either because they were not taken into account during compilation or because the analysis software ignores them); thus, with most corpora, when we retrieve collocations we end up with a picture of collocation presence in the language rather than in the texts represented in those collections. As Scott and Thompson (2000, p. 3) argue: Although a good deal of Corpus Linguistics work is not […] text-oriented, since the objective is very often to make claims about verb-patterns, neologisms, etc. in the English language or some other language, it does find all its genuine data at this level […].

What we see in the corpora is therefore an abstraction from the actual individual uses of collocation in context. The power to identify and observe vast amounts of collocation

10

Working with Portuguese Corpora

is certainly one of the main appeals of corpus research and should not be dismissed, but it does create a disconnect between the presence of collocation in corpus listings and the reality of the employment of collocation in language production. This is a true turning point, from the perception of collocation in texts, both written and spoken, where they were noticed a few at a time, to their inspection in corpora, where hundreds or thousands of instances of collocation can be seen at once. There are both advantages and disadvantages to this. An advantage is that we can abstract information from the individual instances to achieve a picture of a whole language or a variety, for instance, by considering evidence that cuts across the individual texts in which they originally occurred. This cannot be reliably achieved by taking stock of individual instances. A disadvantage is arguably that we can lose track of the importance that each individual instance plays in the constitution of the actual text in which it is found. I want to claim that the cumulative presence or absence of collocation in individual texts should be a matter of importance to linguistics in general and to corpus linguistics in particular. More specifically, since the use of collocation in text is seen as a defining characteristic of natural, fluent text, describing collocation in individual texts should help to reveal how the text comes into existence and works as it does in the real world. However, in order to appreciate the role of collocation in text, we do need corpora to function as the backdrop against which each individual instance of lexical co-occurrence in the texts is to be examined. Hence, we bring the text back into the picture of collocation research, but at the same time we do not deny the role of corpora in the process because, as Sinclair showed, collocation can only be demonstrated in corpora. A few studies have considered a text perspective on collocation. For example, Hunston and Francis (2000) provide a thorough description of English based on patterns – that is, ‘a phraseology frequently associated with (a sense of) a word, particularly in terms of the prepositions, groups and clauses that follow the word’ (p. 3). The existence of the patterns was ascertained from a corpus perspective through the cumulative evidence afforded by inspecting a corpus. However, once the corpus-based description was completed, Hunston and Francis (2000) adopted a text perspective of the patterns, examining them in detail in individual texts to determine how the patterns occurred both hierarchically and linearly in the sentences. They called the linear perspective ‘text flow’ (where patterns overlap) or ‘text strings’ (where no pattern overlap occurs), which enabled them to identify how one pattern leads on and attaches itself to the next pattern. For instance, the sentence ‘I wanted to ensure that you could send me a university award form’ contains a ‘verb + to infinitive clause’ pattern in ‘wanted to ensure’; ‘ensure that’ realizes a ‘verb + that clause’ pattern in ‘ensure that you could…’ while ‘send me … a form’ is a token of the ‘verb + noun + noun’ pattern in ‘send me a university award form’ (p. 212). The authors concluded that ‘pattern flow is an extremely common phenomenon and can be found in all kinds of speech and writing’ (p. 212) and that ‘naturally-occurring discourse, written or spoken, occurs as a sequence of patterns’. Another such work is Hunston (2010), who returned to the issue of the presence of patterns in texts by taking a sentence that occurs only once on the web, and is therefore probably unique, to demonstrate that – despite its uniqueness – ‘[the sentence] does

Collocations in Portuguese

11

“feel” familiar, because it is composed of elements that are highly predictable because they are part of the patterning of English’ (pp. 164–5). She demonstrated how the sentence comprises several patterns that are not very frequent in a reference corpus individually, but combine to produce a sentence that is perfectly acceptable and idiomatic in English. Finally, Mason (2007) attempted ‘to model the linguistic experience of a speaker through use of a general reference corpus, the British National Corpus (BNC)’ (p. 3) by verifying which multi-word units (MWUs – 3- and 4-grams) found in the BNC occur in sample texts. For instance, the sentence ‘the papers presented at the conference will be available in proceedings on the first day’ (p. 4) is fully accounted for by the BNC MWUs: each 3 and 4 gram in the sentence exists in the reference corpus. Each MWU latches on to the following one, either wholly or in part, thereby prospecting the ones that follow. ‘Prospecting’ is an important concept in linear grammar, as it restricts the choice of subsequent elements and leads us towards the idiom principle, away from the open-choice model. These chains of MWUs are interpreted as reflecting the property that much language use is routine, which is desired since ‘[by] going back to routine usages we make it easier for the recipients to understand what we are saying, as it involves less effort to process something that one has already encountered before’ (Mason 2007, p. 5). In fact, the presence of routine elements renders texts fluent and familiar, which are psychological reflections of encountering natural samples of language use. According to Hoey (2005, p. 8): We can only account for collocation if we assume that every word is mentally primed for collocational use. As a word is acquired through encounters with it in speech and writing, it becomes cumulatively loaded with the contexts and co-texts in which it is encountered, and our knowledge of it includes the fact that it co-occurs with certain other words in certain kinds of context.

The instrumental use of collocation for language production has also long been recognized by practitioners in foreign language teaching, as one of the major goals of teaching a foreign language is to enable students to achieve fluency. As argued here, the employment of collocation in texts helps to make language output more fluent and natural. As a result, we have witnessed a healthy surge of interest in teaching collocation in both language-teaching coursebooks and corpus linguistic research. As McCarthy and O’Dell (2005, p. 4) argue, in English language teaching, learning collocations will help students ‘speak and write English in a more natural and accurate way’ whereas failing to use them can cause their language to ‘sound unnatural and might perhaps confuse.’ To summarize, the discovery of collocation is the result of the introduction of electronic corpora, which made it possible to see mounting evidence of the routine associations between words occurring near each other in texts. At the same time, corpora have traditionally erased the boundaries between texts, thereby making it harder to appreciate the role of the cumulative presence of linear collocations in the text. The goal of the present paper is then to return to texts as the locus of collocation by verifying automatically how each word combination fulfils or fails to fulfil

12

Working with Portuguese Corpora

the typical collocations in Brazilian Portuguese. In so doing, I hope to derive some observations about the presence, density and distribution of collocation, as well as its variation, in language. Although the data are from Brazilian Portuguese, the findings should be relevant to a broader understanding of collocation in text in other similar languages as well.

2. Goal and research questions The primary goal of this study is to verify the extent of the use of collocation in newspaper texts, including frequency, distribution and variation. The corpus is the newspaper reportage subcorpus of the Corpus Brasileiro de Variação de Registro (CBVR; Brazilian Register Variation Corpus). The research questions addressed in this study are as follows: 1. How many collocations are present in the texts? 2. Is there variation across and within the texts with respect to the frequency of collocation? If so, how can such variation be accounted for?

3. Corpora and method Two corpora were utilized in this study: A specialized newspaper reportage corpus (from the CVBR) and the general register-diversified Corpus Brasileiro (Brazilian Corpus). The former comprises 20 different newspaper stories, amounting to 11,467 words; the latter is a multi-register corpus with 1.1 billion tokens (see Table 1.1). The newspaper register texts are not included in the Brazilian corpus; therefore, any attestations found in the newspaper corpus will not be the result of shared texts between the two corpora. The CBVR was tagged for part of speech using the PALAVRAS parser (see Bick, this volume), while the Brazilian Corpus was annotated with the Tree-Tagger trained with Pablo Gamallo’s parameter file (http://www.cis. uni-muenchen.de/~schmid/tools/TreeTagger/). The Brazilian Corpus is available on both the Sketch Engine (sketchengine.co.uk; see Kilgarriff, Jakubíček, Pomikalek, Berber Sardinha and Whitelock, this volume) and Linguateca (linguateca.pt; see Santos, this volume). A distinction is made between simple co-occurrence and collocation proper, based on Hoey (2005, p. 3), who stated that ‘[w]henever I need to refer to the occurrence of two or more words within a short space of each other, I shall talk of “lexical co-occurrence”’. When two words occur near each other once in a single stretch of text, they create a lexical co-occurrence; when these words co-occur in a corpus a critical number of times, they produce a collocation. The difference is therefore statistical and, consequently, our definition of collocation is also based on statistical criteria. Hoey defines collocation as ‘the relationship a lexical item has with items that appear with greater than random probability in its (textual) context’ (Hoey 1991, pp. 6–7).

Collocations in Portuguese

13

Table 1.1 Composition of the Brazilian Corpus Subcorpus* Theses and dissertations Articles Newspaper Education, various SESSIONS OF CONGRESS Wikipedia Reports and manuals Legislation, various Literature, various Conference proceedings INTERVIEWS STATE SENATE PROCEEDINGS PRESIDENTIAL SPEECHES Religion, various Bible Manuals Biographies Magazines SCREENPLAYS Essays (crônicas) Drug labels SOCCER BROADCASTS Short stories PRESIDENTIAL TV DEBATES Horoscope Total

Tokens 310,972,387 258,585,002 253,732,527 89,398,389 77,139,578 45,910,768 13,742,224 9,097,447 8,659,955 6,947,244 4,003,975 3,977,450 1,803,404 914,786 859,004 708,239 534,965 494,974 289,389 160,525 113,228 86,323 60,777 22,033 4,319 1,088,218,912

% 28.58% 23.76% 23.32% 8.22% 7.09% 4.22% 1.26% .84% .80% .64% .37% .36% .17% .08% .08% .07% .05% .05% .03% .01% .01% .01% .01% Duplicate values. For the complete details of these steps, see Silva e Teixeira (2011).

2.1.2 WordSmith Tools 3.0 WordSmith Tools 3.0 is a suite containing three programs: WordList, KeyWords and Concordance. KeyWords is used here for term-candidate extraction. The KeyWords tool produces a list of positive keywords, which are words that have a higher frequency in the main corpus than in the reference (negative keywords are those whose frequency is higher in the reference corpus). First, both MAMAtex and Banco de Português 1.0 were loaded into the WordList program to generate a frequency list for words for each of the corpora separately. Then, the frequency parameter in KeyWords was adjusted to ≥ 7 (as discussed above). The program then compared one word frequency list against the other and returned a list of keywords (unigrams). The resulting positive keywords were pasted into the Excel file. As a keyword list by definition includes only unique words, there was no need to wean out repeated items, unlike with other tools.

2.1.3 e-Termos I loaded the (onco)mastology text files into this tool and set the frequency parameter to ≥ 7, which produced a list of term candidates (unigrams). These candidates were copied into the Excel file. As before, some of the listed candidates were compounds, such as autoexame (self-examination); these were broken down into single words. Duplicate items were also eliminated, as described before.

2.1.4 ZExtractor I adjusted the minimum frequency to ≥ 7 and then loaded the research and the reference corpora into the program. This returned a list of keywords (unigrams). As with the other programs, this list of unigrams (candidates) was copied into a worksheet.

150

Working with Portuguese Corpora

2.2 Step 2: Extraction of mutual candidates5 In order to determine which candidates were selected by all four tools, a list of mutual candidates was drawn up. This step made use of the VLOOKUP function in Excel (Formulas > subgroup Function Library > button Look up and Reference > VLOOKUP). This feature finds a particular value (entry) in a source list – in this case, each term candidate on each of the four lists. It also tries to match each candidate on each list against all other lists and, upon filtering the data (Home > subgroup Edit > Sort and Filter button > Filter), instances that were not present (#N/A) across all lists were removed. The successful hits (mutual candidates) were then pasted manually into an Excel worksheet. As a means of ensuring the consistency of this matching procedure, a final check was run on the resulting data contained to verify that all and only the mutual term candidates were present. The ‘IF’ function, which checks if a condition is true or false, was used for this: menu Formulas > subgroup Function Library > button Logical > IF.

2.3 Step 3: Extraction of unique candidates6 Once a list of mutual term candidates was obtained, another list had to be drawn up, this time showing only the unique candidates for each tool. The VLOOKUP function was again used, as was data filtering (see previous section). This time, however, non-available candidates (#N/A) indicated that they were present in a given list but not in the others. In the same way as with the mutual candidates, a check was run with the resulting list of unique terms to ascertain whether it was correct. In addition, a manual hand-and-eye analysis was performed in order to confirm whether filtered candidates were truly unique because, depending on the program, certain characters were omitted, which would mistakenly produce unique

Figure 7.1 Summary of procedures

Retrieving Terms in Portuguese Corpora

151

candidates when in fact they were already present. For example, candidate ‘BRCA1’ was reported to be present only in the Corpógrafo list; however, the form ‘BRCA’ without the number 1 was found on the lists from other programs via this manual check (the number 1 was featured on the main corpus either separated from ‘BRCA’ by a space or by a hyphen). This check was run using concordances. Whenever a falsely unique candidate was found, it was eliminated from the list and entered into the shared list.

2.4 Step 4: Concordance analysis Having both the mutual and unique candidates, the next step was to determine whether each was a true term (true positive) or not (false positive). This was done via concordancing, using the Concordance option available in either WordSmith Tools or Corpógrafo. A candidate was considered to be a term if it designated a concept of the (sub)area in question – namely, if it had an ‘unusually specific value’ (Cabré, 1999, p. 124). In addition to this criterion, which is pivotal in determining term status, others were applied, such as semantic unpredictability (stability of form and meaning) as well as co-occurrence, usage and frequency of use (Barros, 2007, pp. 42–50). To illustrate, consider the term candidate biópsia (biopsy): it passes the first test for stability of form, but when it is looked at in a concordance, we find that it is frequently part of a collocation with cirúrgica (surgical), forming a different term biópsia cirúrgica (surgical biopsy). The collocation reveals a new complex term based on a single word term, which retains a relationship of similarity with the former term: in this case, the former is the generic term (biopsy = hypernym) while the latter corresponds to a type of the former (surgical biopsy = hyponym). Note that words removed from the list of candidates for having frequencies lower than seven would still make the candidate list if they were collocates of a candidate. This was the case for ooforectomia (oophorectomy) and imunoterapia (immunotherapy), for example, which appeared as collocates of the candidate hormonioterapia (hormonal therapy). Thus, although they were not present in the list of candidates common to all programs nor in the list of candidates unique to each, they were recovered and incorporated into the list of terms based on conceptual ‘clues’, such as the presence of morphology from Greek. These ‘clues’ relate to the semantic properties of suffixes (-ectomia, -oma) (-ectomy, -oma) and radicals (terapia, grafia) (therapy, graphy), together with other morphemes of a similar nature (oofor-; oophor-) or of Latin origin (ultra-), which often, for example, form the names of processes in medical science in general. For Delvizio and Barros (2008, p. 1409), ‘in medical terminology it is common for the morphological structure of a word to reveal features of the phenomenon to which it applies.’ On the other hand, candidates such as benignidade (benignity) did not make the final term lists. This is an example of a false positive (a candidate contributed by the programs as mutual or unique and having a minimum frequency of 7, but that does not designate a concept belonging to the field in question) as it refers to the non-invasive condition of a tumour, which disqualifies the cancer as a degenerative disease. Thus ‘benignity’ might have representativeness within the nomenclature of

Working with Portuguese Corpora

152

mastology, which covers all sorts of diseases related to the breast, but not specifically within (onco)mastology, which is concerned with breast cancer. It is important to clarify that having a specific conceptual value in (onco)mastology does not mean that consideration was given only to those terms that originated in the (sub)area, such as mastectomia (mastectomy); items were considered to be terms if, although originating from related areas, such as the terms ‘biopsy’ (from pathology) and ‘gene’ (from genetics), they were seen to have linguistic-conceptual productivity (usefulness) in the configuration of the (sub)specialization in question. The final term list containing the true positive, concordance-checked candidates was used as the basis for a glossary of (onco)mastology terms. To illustrate, a sample of the 237 terms from Silva e Teixeira (2013) is shown in the Appendix (Table 7.3); the full list covers aspects such as oncogenesis, types of breast cancer, forms of spread, diagnosis and screening, treatment, staging and prognosis.

3. Results The ability of each tool to detect common and unique terms was tested using the lists of candidates described in the previous section. To determine how well each tool performed, an accuracy rate, specifically with regard to the proper extraction of terms, was calculated. The results appear in Table 7.2 and Figure 7.2. Table 7.2 Counts of true and false positives for each program Candidates

Corpógrafo 4.0

WST 3.0

e-Termos

ZExtractor

True positives False positives Total

204 536 740

211 758 969

207 1,226 1,433

203 576 779

Figure 7.2 Rate of accuracy and error for the tools

Retrieving Terms in Portuguese Corpora

153

To test whether significant differences existed among the accuracy rates for the programs, a chi-square (χ2) statistical test was run. The results showed a statistically significant difference among the programs with respect to true positive and false positive terms (χ2 = 45.232, df = 3, p = 0.000). Figure 7.2 shows that three of the four tools used (Corpógrafo 4.0, WST 3.0 and ZExtractor) had hit rates (true positive candidates) at around 20 per cent and error rates (false positive candidates) at around 70 per cent. In contrast, e-Termos ranked fourth in terms of accuracy, with around 7 per cent less accuracy than the software in third place (i.e., WST). The success rate of the tools becomes more apparent when assessing the results for the total candidates inventoried by the applications – based, as mentioned previously, on what was shared by all and unique to each. These numbers suggest that, while Corpógrafo 4.0, WST 3.0 and ZExtractor maintain an average for hits and misses, e-Termos fell short of expectations regarding the accurate identification of true positive candidates. Bagot (1999, 2001, cited in Oliveira, 2009a) suggests that statistical tools for the purpose in reference have a mean noise (false positive candidates) of approximately 75 per cent, whereas the noise for tools driven by linguistic methods varies between 55 per cent and 75 per cent. Within these parameters, both Corpógrafo 4.0 and ZExtractor seem to meet expectations, unlike the other two programs.

4. Final considerations The results suggest that some tools perform better under the conditions set up in this study, with Corpógrafo and ZExtractor outranking the others. At the same time, because there was variation with respect to the terms retrieved by each one, it also indicates that no single tool available for term extraction can provide all the term candidates. The number of candidates retrieved by each tool was different and each tool supplied at least some useful terms. Ignoring the output of any of these tools could lead to some terms not being picked up, which in turn would result in the extraction activity being less reliable. As such, a possible recommendation coming out of this project is that researchers should use more than one tool if they want to achieve better term retrieval than would be achieved with a single tool. However, as shown, the output of any single program – or, indeed, of all four combined – cannot guarantee a complete recall of all terms in the corpus. In addition, it is important that researchers use their knowledge of the field in order to check whether each candidate is a true term or not. Working with a specialist in the field could be useful for validating them. It should be noted that the method suggested in this chapter was operationalized for researchers who are not computer experts, as it does not require any programming knowledge. All tools are user-friendly with intuitive point-and-click GUIs. In addition, the support provided by Excel was essential and easy to obtain through its built-in commands, which can be learned by using the detailed tutorials available in the program itself or on the web.

154

Working with Portuguese Corpora

We thus hope to have contributed to the field of term extraction in Portuguese by showing how people who are not computer experts can use available tools to retrieve terms from corpora.

References Bagot, R. E. (1999), Extracció de Terminologia: elements per a construcció d’un SEACUSE (Sistema d’Extracció Automàtica de Candidats a Unitats de Significació Especialitzada). PhD Dissertation, Universidade Pompeu Fabra. —(2001), ‘Extracción de terminologia: Elements per la construcción de un extractor’. TradTerm, 7, (1), 225–50. Barros, L. A. (2007), Conhecimentos de Terminologia Geral para a Prática Tradutória. São José do Rio Preto: NovaGraf. Berber Sardinha, T. (2000), Banco de Português v.1. [ONLINE] Available at http://www2. lael.pucsp.br /corpora /bp —(2005), ‘A influência do tamanho do corpus de referência na obtenção de palavraschave usando o programa computacional WordSmith Tools’. The ESPecialist, 26, (2), 183–204. Cabré, M. T. (1999), La Terminología: Representación y comunicación. Barcelona: Institut Universitari de Linguística Aplicada. Delvizio, I. A. and Barros, L. A. (2008) ‘Processos de criação lexical na terminologia médica: Das ruas aos laboratórios’, in J. S. Magalhães and L. Travaglia, L. (eds), Múltiplas Perspectivas em Linguística. Uberlândia: UFU, p. 1404–11. Fromm, G. (2004), ‘Ferramentas de análise lexical computadorizadas: Uma aplicação prática’. Factus, 1, (3), 153–64. —(2008), ‘VoTec: Uma ferramenta para terminógrafos, tradutores e alunos de Letras’. Paper presented at XI Mini-Enapol: Tratamentos do Léxico: diversidade cultural, a multiconceptualização do mundo, Uberlândia Federal University. Kennedy, G. (1998), An Introduction to Corpus Linguistics. New York: Longman. Maia, B., Sarmento, L. and Santos, D. (2004), Corpógrafo. [ONLINE] Available at http:// labclup.letras.up.pt/corpografo —(2005), ‘Introduzindo o Corpógrafo – Um conjunto de ferramentas para criar corpora especializados e comparáveis e bases de dados teminológicas’. Terminómetro, 7, 61–2. Matuda, S. (2008), ‘Fraseologia no futebol: Um estudo bilíngue baseado em corpus’. Domínios da Linguagem, 2 (2). Moreira, A. C. S. (2010), Terminologia e Tradução: Criação de uma Base de Dados Terminológica do Turismo Baseada num Corpus Paralelo Português-Inglês. PhD Dissertation, Universidade de Vigo. Moreira Filho, J. L. (2009), ZExtractor. [ONLINE] Available at http://www.fflch.usp.br/dl/ li/x/?p=559 Nazar, R. (2010), A Quantitative Approach to Concept Analysis. PhD Dissertation, Universitat Pompeu Fabra. Oliveira, L. H. M. (2009a), e-Termos: Um Ambiente Colaborativo Web de Gestão Terminológica. PhD Dissertation, Universidade de São Paulo, São Carlos. —(2009b), e-Termos. Ambiente Colaborativo Web de Gestão Terminológica. [ONLINE] Available at http://www.etermos.cnptia.embrapa.br/index.php Scott, M. (1999), WordSmith Tools version 3. Oxford: Oxford University Press.

Retrieving Terms in Portuguese Corpora

155

Silva e Teixeira, R. B. (2011), Termos de (Onco)mastologia: Uma abordagem mediada por corpus. Master’s Thesis, São Paulo Catholic University. —(2013), Glossário de Oncomastologia: Um repertório de termos sobre o câncer de mama. São Paulo: Editora Olho d’Água. Teixeira, E. D. (2005), ‘Tradução e Terminologia Plurilíngue – a Línguística de Corpus como Proposta de Aproximação’. Paper presented at 53º Encontro do GEL, Universidade Federal de São Carlos. —(2008a), A Linguística de Corpus a Serviço do Tradutor: Proposta de um dicionário de Culinária voltado para padronização textual. PhD Dissertation, Universidade de São Paulo. —(2008b), ‘Tradução culinária e ensino: Um exemplo de metodologia de avaliação utilizando etiquetagem e o WordSmith Tools’. Domínios da Linguagem, 4, 2–32.

Appendix Table 7.3 Sample of terms Portuguese term

English term

carcinoma adenoide-cístico quimioterapia adjuvante hormonioterapia adjuvante carcinoma apócrino inativadores de aromatase inibidores de aromatase ATM metástase axilar linfadenectomia axilar membrana basal BCL2 biópsia braquiterapia BRCA-1 BRCA-2 câncer de mama reconstrução mamária c-erbB-2 CA 15-3 CA 27-29 carcinoma in situ catepsina D CEA CHEK2 quimioterapia exame clínico doppler colorido tomografia computadorizada cirurgia conservadora

adenoid cystic carcinoma adjuvant chemotherapy adjuvant hormonal therapy apocrine carcinoma aromatase inactivators aromatase inhibitors ATM axillary lymph-node metastasis axillary lymphadenectomy basement membrane BCL2 biopsy brachytherapy BRCA-1 BRCA-2 breast cancer breast reconstruction c-erbB-2 CA 15-3 CA 27-29 carcinoma in situ cathepsin D CEA CHEK2 chemotherapy clinical examination colour Doppler computerized tomography conservative surgery

156

Working with Portuguese Corpora

Portuguese term

English term

core biopsy exame citológico mamografia digital metástase a distância carcinoma ductal in situ receptor de estrogênio biópsia excisional radioterapia externa PAAF biópsia de congelação gene mastectomia radial a Halsted exame histopatológico HMMR hormonioterapia receptores hormonais exame de imagem imunoterapia biópsia incisional radioterapia intraoperatória carcinoma invasivo carcinoma ductal infiltrante

core biopsy cytological examination digital mammography distant metastasis ductal carcinoma in situ oestrogen receptor excisional biopsy external radiotherapy FNA frozen-section biopsy gene Halsted radical mastectomy histopathological examination HMMR hormonal therapy hormone receptors imaging examination immunotherapy incisional biopsy intraoperative radiotherapy invasive carcinoma invasive ductal carcinoma

Notes 1 Although the program was not designed for terminology research, many researchers in the field have been using it for this purpose, which serves to justify the analysis of its performance. It is also referred to here as WST. 2 As breast cancer is a matter for concern and consequently a subject for scientific research, experts say that a subspecialty, (onco)mastology (onco = gr. ógkos [‘tumour’] + mastology = gr. mastós [‘breast’] + logos [‘study’] + -y), is in the process of being formalized, given that it already provides the name for a specialized sector within the discipline of mastology at, for example, the São Paulo State School of Medicine at São Paulo Federal University (Unifesp). 3 In 2008, at the time of the compilation of the main corpus, the spelling of this word (in Brazilian Portuguese) still used a hyphen, unlike today, due to the New Spelling Agreement approved in 2009. 4 Bearing in mind that I intended to find keywords using the WordSmith Tools 3.0 program, which presents them as unigrams (one lexeme), as I only had the list of unigrams from the reference corpus, I thought it was sensible to standardize the extraction of term candidates for all programs on unigrams, despite being aware that the majority of terms in many fields are presented syntagmatically. The intention was to use the list of unigrams as an access point to the corpus; however, when choosing the size of the search string in Corpógrafo, the choice is the minimum, which means that the program can provide candidates for complex terms (with more than one orthographic form).

Retrieving Terms in Portuguese Corpora

157

5 Mutual candidates were taken as a basis on the rationale that the majority of terms (true positive candidates) would be among them. 6 The decision was made to also investigate the unique ones in order to determine what each particular tool would return as a term (true positive candidate).

Part Four

Translation

8

Understanding Portuguese Translations with the Help of Corpora Ana Frankenberg-Garcia

1. Introduction Judging by the way people generally criticize machine translation output or complain about on-screen translation, where actors’ spoken dialogue often differs quite substantially from subtitled translations displayed on screen, it seems that individuals who can speak more than one language can quite readily detect translation mistakes – at least the most glaring ones. What is not so easy to pin down is the opposite – namely, what constitutes a decent translation, or more generally what translation is all about. With the gradual establishment of translation studies as an independent academic discipline over recent decades, there has been a growing concern with describing translation from a neutral, non-judgemental perspective. According to Frawley (1984), translated texts are different from the source texts that give rise to them and, at the same time, are different from target language texts that are not translations. This ‘third code’ (Frawley, 1984, p. 168) is therefore a type of language that deserves being studied in its own right. Not very long ago, it was only feasible to examine translations one text at a time; however, as Baker (1993) predicted, with the advent of corpora, it is now possible to take descriptive approaches to translation to new levels. Comparable corpora of translated and non-translated texts enable us to analyse enormous quantities of translated and non-translated language and to understand some of the general characteristics that distinguish the two. Furthermore, parallel corpora (i.e., corpora of source texts aligned with their respective translations; see Almeida et al.’s, Santos’s and Tagnin’s chapters in this volume) enable us to uncover prevailing trends that cannot be identified when examining individual source texts and translations one by one. In this chapter, I use corpora to examine general tendencies that characterize what translators do when they translate from English into Portuguese and to examine what could make Portuguese translations read differently from texts originally written in Portuguese. With this, my aim is to describe several distinguishing features of translated Portuguese in terms of lexis, syntax and discourse.

Working with Portuguese Corpora

162

Because the focus of the study is on what is normal rather than what is anomalous or stands out as a mistake, many of the issues raised here often go unnoticed; however, it is precisely by noticing them that we can obtain a better understanding of Portuguese translation. This awareness is not only of purely academic interest, but can also have important implications for translator education and the development of translation software.

2. Translating from English into Portuguese This section focuses on what is normal in the translation of English into Portuguese. In particular, using a parallel corpus of English source texts aligned with their Portuguese translations, I will report on three short studies that were carried out to uncover trends not immediately visible to the naked eye.

2.1 Counting words Counting words (or characters or lines) is an important concern of many translators because it serves as the basis for how they price their work. The number of words in source texts is rarely identical to the number of words in translated texts. In educated circles, it is often claimed that changes in text length are largely due to cultural and rhetorical differences between languages, with some languages being naturally wordier than others. With regard to English and Portuguese, it has been claimed that Portuguese is the more verbose of the two. According to McKenny and Bennet (2011, p. 248), ‘While English values succinctness, clarity and objectivity, […] Portuguese […] is characterised by a general “wordiness” and redundancy.’ If one follows this line of thought, then Portuguese translations should contain more words than English source texts; however, I believe the most significant changes affecting word counts in translation do not have to do with contrastive rhetoric, but with the morphological and syntactic characteristics of the language pairs involved. When compared to English, Portuguese is a morphologically very rich language. This is not the place for an in-depth contrastive analysis of the two linguistic systems, but it is clear that the number of verb inflections possible in Portuguese is substantially greater than those possible in English; it is also clear that, unlike English, Portuguese common nouns are marked for gender, adjectives inflect accordingly and so on. Therefore, contrary to the previous claim, many ideas that have to be expressed by X words in English can be converted into fewer Portuguese words. For example, ‘it was raining’ (three words) becomes chovia (one word), ‘I know’ (two words) becomes sei (one word), ‘a female cat’ (three words) becomes uma gata (two words), ‘he tried it’ (three words) becomes experimentou-o (one word),1 and so on. Therefore, when translating from English into Portuguese, provided that there are no major changes in content, it is unlikely that the overall number of words should expand in the process. If anything, the total number of words should decrease. To verify this, in the present study, I looked at word counts in COMPARA,2 a bidirectional parallel corpus of

Portuguese Translations

163

Table 8.1 Comparison of word counts in English source texts and Portuguese translations in COMPARA Mean EN STwords* 24,075.12 PT TTwords** 23,700.15 Ranks TTwords < STwords TTwords > STwords TTwords = STwords Total

N 21 13 0 34

Wilcoxon Signed Ranks Test Z (Based on positive ranks) Asymp. Sig. (2-tailed)

No. of bi-texts Std. Dev.

Std. Error Mean

34 34

1,851.441 1,911.247

10,795.665 11,144.387 Mean Rank 17.43 17.62

Sum of Ranks 366.00 229.00

TTwords – STwords –1.171 0.242

* English source texts ** Portuguese translations

English and Portuguese fiction (Frankenberg-Garcia and Santos, 2003). The 34 bi-texts and approximately 1.5 million words involving English-to-Portuguese translation in a corpus representing the work of 25 professional Portuguese translators indicated that there was a 2.5 per cent average decrease in the number of words from source texts to translations. To assess whether the lack of an increase in the overall number of words could be a general trend rather than a distortion caused by the idiosyncratic behaviour of just a few translators, a Wilcoxon Matched-Pairs Signed-Ranks Test was applied to the word counts of the 34 bi-texts considered in the analysis.3 As shown in Table 8.1, the differences between the number of words in the source texts and in the translations were not significant. In other words, these results indicate that it is unlikely that word counts will increase when translating from English into Portuguese. Thus, even if Portuguese is held to be more verbose, one cannot say that translation will expand the text in terms of word counts – at least not as far as translated fiction is concerned. Thus, when pricing their work, English–Portuguese translators need not be overly concerned about whether to count the number of words in the source text or in the translation in order for their work to be more lucrative.4

2.2 Inserting words Although it is true that the total number of words in texts translated from English to Portuguese does not seem to increase, this does not mean that Portuguese translators do not occasionally add information not present in the originals. As amply debated in the literature (see, e.g., Blum-Kulka, 1986; Klaudy, 1998; Pym, 2005; Frankenberg-Garcia, 2009), there seems to be a general tendency for translators to make translations more explicit than source texts. Simple word counts do not tell us much about this phenomenon because, as I discuss in Frankenberg-Garcia (2009), it

164

Working with Portuguese Corpora

is hard to differentiate between the extent to which changes in word counts are merely due to morphosyntactic differences between languages and the extent to which they can be attributed to actual additions or deletions of propositional meanings. Note that a translation can convey more information than a source text even when it has fewer words. In Example (1) below, taken from Frankenberg-Garcia (2009), the English source text is seven words long, whereas the Portuguese translation, which contains an extra piece of information not present in the source text (então), is only five words long. (1) Source Translation Back translation

What have I got to complain about? De que me queixo ? What have I got to complain about ?

One way to find out whether translated Portuguese contains propositional meanings not present in English source texts is to compare aligned texts while looking out for particular features that might have been added to the translations. When exploring a small, bidirectional corpus of 120,000 running words of economic texts in English and Portuguese, Amador (2013) observed that the adverb também (also) stood out in the Portuguese translations because, in 12 per cent of its occurrences, there were no matching words in the English source texts that gave rise to them. In the present study, I attempted to find out whether it would be possible to observe findings similar to those of Amador (2013) in the much larger COMPARA corpus of fiction. As can be seen in Example (2), taken from the COMPARA corpus, the addition of the adverb também renders the Portuguese translation more explicit than the English source text. In particular, there is nothing in the source text to indicate that the interlocutor’s understanding is shared with anyone else, but in the translated text the addition of também gives the idea that there is someone else that shares the same understanding. (2) Source Is that your understanding? Translation é assim que vê as coisas? Back translation Is that how you see things ? A search for também among the Portuguese translations in COMPARA (796,566 words) disclosed 1,029 hits. As can be seen in Table 8.2, a brief inspection of the parallel concordances then showed that there are several different English words that can generate também in Portuguese: ‘also’, ‘too’, ‘either’, ‘neither’, ‘as well’, ‘so does someone’, ‘oneself ’ and ‘so on’. Yet there were also many concordances where também found no equivalent in the English source texts. Each one of the 1,029 concordances for também in the Portuguese translations was therefore inspected manually for the absence of an English equivalent in the parallel source text.5 The results indicate 279 cases of explicitation (i.e., 27.1 per cent of the occurrences of também in translated Portuguese found no equivalent in their corresponding English source texts). In order to find out whether the addition of também to Portuguese translations was a trend of translated Portuguese fiction in general, rather than the work of just a handful of translators represented in the corpus, it was hypothesized that the occurrences of também in the translations would be significantly greater than the occurrences of English equivalents in the source texts. To test this

Portuguese Translations

165

Table 8.2 Parallel concordances for também in COMPARA English source texts

Portuguese translations

The other guys noticed her .

Os outros tipos repararam nela. E  é lindo de morrer.

It’s  incredibly, heartstoppingly beautiful. I heard her breathing settle into a deep, slow rhythm before I dropped off . Anyway, I’d better stop, or I’ll miss the 5.40 . If you had a Fall, we. It’s not in Edward’s nature .

Ouvi a sua respiração tomar um ritmo lento e profundo, té que adormecei . De qualquer modo, é melhor parar ou ainda perco  o comboio das 5.40. Se vocês tiveram um pecado original, nós . não é da natureza de Edward.

hypothesis, a Wilcoxon Matched-Pairs Signed-Ranks Test was conducted to compare the occurrences of também and its English equivalents in the 34 English–Portuguese bi-texts in the corpus.6 The results are summarized in Table 8.3. As can be seen, the figures indicate that the addition of também with no corresponding matches in the source texts was highly significant, with a probability of more than 99.9 per cent that these differences did not occur by chance. These results suggest that, even if Portuguese translations are not significantly longer than English source texts, they can contain additional cohesive devices that contribute to rendering translated Portuguese more explicit than the English source texts. Table 8.3 Comparison of também in Portuguese translations (TT) in COMPARA and their equivalents in English source texts (ST) também_ST também_TT

Mean

N

Std. Dev.

Std. Error Mean

22.06 30.26

34 34

14.719 17.908

2.524 3.071

Ranks tambémTT < tambémST tambémTT > tambémST tambémTT = tambémST Total Wilcoxon Signed Ranks Test Z (Based on negative ranks) Asymp. Sig. (2-tailed)

N 0 33 1 34

Mean Rank 0.00 17.00

Sum of Ranks 0.00 561.00

também_TT – também_ST –5.016 .000

166

Working with Portuguese Corpora

2.3 Changing the position of words Having analysed word counts and the addition of the adverb também in English translated into Portuguese, I would like to finish this section with an example of a translation change affecting discourse. Whereas in English the unmarked position of adverbs of time is at the end of the clause, in Portuguese there seems to be a preference for inserting time adverbs in the sentence-initial position. I was therefore interested in finding out whether Portuguese translators tend to move time adverbs to the front of the clause in the process of translation. Using the Portuguese translations in the COMPARA corpus again, a search for hoje (today) was conducted in order to retrieve parallel concordances like those in Table 8.4, and the position in the clause of hoje and adverbial expressions containing this word, like hoje em dia (nowadays), hoje à tarde (this afternoon) and so on, was compared with the position of their equivalents in the English source texts. Table 8.4 Parallel concordances for hoje in COMPARA English source texts

Portuguese translations

I couldn’t decide what tie to wear . Bobby Moore died , of cancer. Even , I can’t think of myself like that. I think people pay a lot of money for old things . I hardly ever ride on a bus. Who were those zombies out there ?

, não conseguia decidir que gravata usar. morreu Bobby Moore, de câncer. Mesmo , não consigo imaginar-me como tal. Acho que se paga bom dinheiro por essas coisas velhas. , quase não tomo ônibus. Quem, eram aqueles zumbis lá no auditório ?

Both the concordances where the English time adverb was already in the clause-initial position, as in (3), and the concordances with no English equivalent to hoje adverbs, as in (4), were disregarded. Thus, the analysis only took into account the cases in which the hoje adverbs were brought forward in relation to where their English equivalents were placed, as in (5), and the cases where they remained in the same non-clauseinitial position as their English counterparts, as in (6).7 (3) Source but I could rule the world Translation mas sinto-me uma rainha Back translation but I feel like a queen It’s one of those days. (4) Source Translation é um daqueles dias.8 Back translation is one of those days. (5) Source I won’t go in , I’ll ring. Translation não vou. Vou telefonar. Back translation I won’t go. I’ll ring. (6) Source You must get a man here , Sergeant. Tem que aqui mandar um homem , sargento. Translation Back translation You must send here a man , Sergeant.

Portuguese Translations

167

The analysis disclosed 181 expressions equivalent to hoje adverbs that were in the non-clause-initial position in the English source texts, 81 of which were fronted in the process of translation. To test whether the fronting of hoje adverbs in Portuguese translations was not something done by just a handful of translators represented in the corpus, but rather constituted a general trend, it was hypothesized that, if nothing noteworthy was happening to these adverbs, the number of hoje adverbs in non-clause-initial position in English source texts would not significantly differ from the number of adverbs that remained in that same position in the Portuguese translations. A Wilcoxon Matched-Pairs Signed-Ranks Test was thus applied comparing the two, the results of which are presented in Table 8.5. As can be seen, the mean number of hoje adverbs in the non-clause-initial position in source-texts was 5.32 and an average of only 2.94 remained in the same place in translation. The differences between the two were highly significant, which leads one to reject the hypothesis that nothing special was happening to the position of these adverbs in the process of translation. Because the actual changes observed involved fronting hoje adverbs, one can conclude that there seems to be a general trend of this occurring in the translation of English into Portuguese. In this part of the study, I used a parallel corpus to identify three features that appear to be the norm in the translation of English into Portuguese fiction: texts do not tend to expand in terms of number of words; the cohesive device também is often added to the translated text when there is no corresponding term in the source text; and there is a tendency to front hoje time adverbs. The next section focuses on what distinguishes translated from non-translated Portuguese. Table 8.5 Comparison of hoje adverbs in the non-clause-initial position in English source texts and adverbs that remained in that same position in corresponding Portuguese translations Hoje Adverbs

Mean

N

Std. Deviation

Std. Error Mean

Non-clause initial in ST Same position in TT

5.32 2.94

34 34

8.601 5.548

1.475 0.952

Ranks Non-clause initial in ST < Same position in TT Non-clause initial in ST > Same position in TT Non-clause initial in ST = Same position in TT Total Wilcoxon Signed Ranks Test Z (based on negative ranks) Asymp. Sig. (2-tailed)

N 0 25 9 34

Mean Rank 0.00 13.00

Sum of Ranks 0.00 325.00

Non-clause initial in ST – Same position in TT –4.423 0.000

168

Working with Portuguese Corpora

3. Comparing translated and non-translated Portuguese People familiar with reading translations and texts that are not translations might realize that there are some differences between them, but it is often difficult to pin down what exactly makes translated texts sound different from texts that are not translations. Drawing on a couple of recent corpus-based studies that compare translated and non-translated Portuguese and then carrying out a contrastive analysis of prepositions, I will describe a number of features that set the translated and non-translated Portuguese apart. Although the analyses described in the previous section were only feasible with recourse to a parallel corpus of English source texts aligned with their corresponding Portuguese translations, parallel alignment is not necessary here. Instead, the studies described in this section draw on comparable corpora of translated and non-translated Portuguese. More specifically, while the previous section examined the English source texts and Portuguese translations in the COMPARA corpus, this section will report on findings from that same corpus that draw upon original Portuguese fiction and Portuguese fiction translated from English. This is only possible, of course, because, as shown in Figure 8.1, COMPARA is a bidirectional corpus. Thus, while the previous section drew on data from (b) in Figure 8.1, this section is based on data from (c). The translated Portuguese component contains 796,566 words (41 translations) and the non-translated Portuguese part of the corpus is comprised of 639,360 words (34 source texts).

Figure 8.1 Possible directions of analysis in the COMPARA corpus (FrankenbergGarcia, 2006)

3.1 Foreign words in original and translated Portuguese fiction One of the most immediately obvious features that sets original and translated texts apart is the presence of foreign words. Translators resort to loans as a strategy when certain words in the source texts find no equivalent in the translation language and they sometimes use loans deliberately to confer a foreign flavour to the translation (Vinay and Darbelnet, 1995). However, as I argue in Frankenberg-Garcia (2005), the use of loans is not a prerogative of translated texts; people writing directly in

Portuguese Translations

169

their native languages may also resort to foreign words. Thus, it cannot be the mere presence of foreign words that makes translations read differently from non-translated texts; there must be specific differences with regard to how loans are used. In Frankenberg-Garcia (2005), I examined the use of foreign words in the translated and non-translated Portuguese texts in the COMPARA corpus, focusing on the frequency and language distribution of those loans. It was found that in original Portuguese there were on average only 1.5 foreign words for every ten thousand words compared to 24.3 in translated Portuguese. This means that translated Portuguese contained on average approximately 16 times more foreign words than original Portuguese. Note, however, that around one-third of the foreign words and expressions in the Portuguese translations were already present in the English source texts that gave rise to them, as illustrated in Example (7) (foreign words in emphasis); as such, it was not just a question of Portuguese translators using loans as a translation strategy, as in Example (8). In approximately one-third of the cases, the foreign words were already there and the translators simply preserved them. (7) Source Translation (8) Source Translation

Well, c’est normal. Bem, c’est normal. I’m writing this on my laptop on the train to London. Estou escrevendo isto no meu laptop no trem para Londres.

In Frankenberg-Garcia (2005), it was also found that the Portuguese original fiction texts in COMPARA contained foreign words from only four different languages: English, Latin, French and German. The loans were used in very few texts and none of the loan languages seemed to prevail. In contrast, the translated Portuguese texts contained loans from fifteen different languages, with English (the language of the source texts) being by far the most prevalent ones. As in the non-translated Portuguese texts, the translated texts also exhibited loans from French, German and Latin. However, in addition to the loan languages already seen in non-translated Portuguese, the translated Portuguese texts contained loans from Italian, Spanish, Hebrew, Afrikaans, Greek, Japanese, Hawaiian, Czech and Yiddish. Thus, the fact that Portuguese translators resort to loans as a translation strategy only explains in part why translated and non-translated fiction differ with regard to the presence of foreign words. What is not so evident is that English fiction seems to tolerate more foreign words and more loan languages than original Portuguese fiction. When English fiction is translated into Portuguese, the loans already present in the source text are generally transposed to the translation.

3.2 Lemmas in original and translated Portuguese It is not only the comparatively high frequency of foreign words and the presence of loan languages that are not normally used in original Portuguese fiction that might confer a different feel to translated Portuguese fiction, but the choice of Portuguese words themselves can also differ between the original and translated texts. Although the words employed in translations are constrained by the words used in the source texts that give rise to them, people writing directly in Portuguese are free to use

170

Working with Portuguese Corpora

whatever Portuguese words come to mind. Could there be any remarkable differences between the two? In Frankenberg-Garcia (2008), I compared the distribution of noun, verb, adjective and adverb lemmas in original and translated Portuguese, again using the COMPARA corpus. With regard to nouns, the main differences found were that nouns used for classifying things, such as gênero/espécie/tipo (type), membro (member), grupo (group), lista (list) and maioria (majority), as well as nouns used for specifying manner, like tom (tone), modo (manner), expressão (expression), aspecto (aspect) and atitude (attitude) occurred at least twice as frequently in translated Portuguese. In contrast, nouns referring to human beings – like sobrinho (nephew), moço (young man), menino (boy), velha (old lady), soldado (soldier), menina (girl), velho (old man), padre (priest), senhora (lady), dono (owner), senhor (gentleman), primo (cousin) and rei (king) – and nouns closely associated with the Portuguese psyche – like lembrança (memory), saudade (nostalgia), alma (soul) and tristeza (sadness) – occurred at least twice as frequently in non-translated Portuguese. The analysis of adjectival lemmas revealed that evaluative adjectives reflecting personal opinions and feelings – like maravilhoso (wonderful), evidente (obvious), especial (special), horrível (horrible), suficiente (enough) and principal (main) – occurred at least twice as frequently in translated Portuguese, whereas adjectives describing observable traits – like gordo (fat), grosso (thick), igual (equal), nu (naked), rico (rich) and morto (dead) as well as emotions like triste (sad) and alegre (happy) – were at least twice more common in non-translated Portuguese. With regard to verbs, linking verbs (e.g., encontrar-se (find oneself), tornar-se (become) and sentir-se (feel)), reporting verbs (e.g., revelar (reveal), exclamar (exclaim), lamentar (regret), sugerir (suggest), comentar (comment) and replicar (reply)), verbs used to indicate movement (e.g., inclinar-se (to lean), regressar (return), dirigir-se (turn to), baixar (lower), virar-se (turn), apanhar (catch), apoiar (lean), voltar-se (turn), acenar (nod) and abanar (shake)) and verbs that frequently precede other verbs (e.g., tentar (try), conseguir (manage) and permitir (allow)) occurred at least twice as frequently in translated Portuguese. On the other hand, highly lexical verbs closely related to the dramatic language of literary texts (e.g., vencer (win), fugir (run away), beijar (kiss), cantar (sing), quebrar (break), sonhar (dream), amar (love), roubar (steal), chorar (cry), matar (kill), morrer (die) and nascer (be born)) were at least twice as frequently used in non-translated Portuguese. The analysis of adverbial lemmas, in turn, showed that adverbs with the -mente suffix (e.g., profundamente (deeply), absolutamente (absolutely), completamente (completely), simplesmente (simply), perfeitamente (perfectly) and imediatamente (immediately)) and other adverbs of manner (e.g., demasiado (too) and bastante (rather)) occurred at least twice as frequently in translated Portuguese whereas the adverbs that stood out in original Portuguese were mostly adverbs of time and frequency (e.g., enfim (finally), logo (soon), ontem (yesterday), jamais (never), amanhã (tomorrow) and hoje (today)). Interestingly, the analysis also revealed that some of the lemmas that were noticeably more frequent in the translated Portuguese had near synonyms that were markedly more frequent in non-translated Portuguese. For example, the following

Portuguese Translations

171

pairs show the preferred lemma in translated Portuguese followed by the preferred synonym in non-translated Portuguese: recordação / lembrança (souvenir), escola / colégio (school), edifício / prédio (building), compreender / entender (understand), recordar / lembrar (remember), reparar / notar (observe), observar / examinar (observe / examine), decidir / resolver (decide), obrigar / mandar (force/order), manter / guardar (keep), apanhar / recolher (pick / gather), completamente / todo (completely) and finalmente / enfim or afinal (finally). Although these synonymous pairs seem to reflect mostly linguistic differences, with translated Portuguese appearing to express preference for the more formal option, many of the noun, verb, adjective and adverb lemma differences described earlier seem to mirror actual contrasts in culture.

3.3 Prepositions in original and translated Portuguese The analysis carried out in Frankenberg-Garcia (2008) focused on contrastive lexis; however, there might also be grammatical features that distinguish translated from non-translated Portuguese. In the present study, I conducted a brief exploratory analysis to examine prepositions in translated and non-translated Portuguese. A search for the overall distribution of a random selection of core Portuguese prepositions in COMPARA revealed that three of them might be over-represented in translated Portuguese. The results of this summary analysis are presented in Table 8.6. Although there was nothing particularly remarkable about most prepositions, após (after), durante (during) and perante (before) were selected for further analysis Table 8.6 Distribution of a selection of core prepositions in translated and non-translated Portuguese (per 100,000 words) Preposition a + ao(s) + à(s) ante após até com conforme de + da(s) + do(s) desde durante em + num + numa entre exceto + excepto mediante para perante por sem sob sobre

Translated Portuguese 5,069.5 1.6 21.3 164.6 1,125.1 2.9 6,725.8 33.9 101.1 1,116.8 99.9 7.0 0 1,276.7 11.7 707.5 166.7 27.0 147.8

Non-Translated Portuguese 4,608.2 3.6 12.8 148.0 952.5 6.7 6,328.7 41.9 33.3 1,012.9 102.6 1.6 0.5 937.7 5.2 577.8 228.5 23.8 140.3

172

Working with Portuguese Corpora

because (1) they had a frequency of more than 10 hits per 100,000 words in at least one corpus (the prepositions with lower frequencies were not considered to be representative enough) and (2) their distributions seemed to be markedly different in translated and non-translated Portuguese. To assess whether these distinctive distributions were not being distorted by just a handful of translators or authors, a more fine-grained analysis was carried out by investigating their distributions separately in the 41 translated and 34 non-translated Portuguese texts in the corpus, the results of which are summarized in Table 8.7. Next, for each separate preposition, a Mann-Whitney Test9 was applied to determine whether the differences observed were significant. Results are summarized in Tables 8.8, 8.9 and 8.10. Table 8.7 Distribution of após, durante and perante in translated and non-translated Portuguese Text

Statistic

Translated Portuguese Mean no. of hits/100 k words N SD Non-translated Mean no. of hits/100 k words Portuguese N SD

após

durante perante

20.7 41 18.7 12.1 34 16.5

92.2 41 52.0 32.6 34 31.0

11.3 41 14.6 4.8 34 8.1

Table 8.8 Comparison of após in translated and non-translated Portuguese Text type

N

Mean Rank

Sum of Ranks

Translated Portuguese Non-translated Portuguese Total

41 34 75

43.70 31.13

1,791.50 1,058.50

Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed)

463.500 1,058.500 –2.497 0.013

Table 8.9 Comparison of durante in translated and non-translated Portuguese Text type

N

Mean Rank

Sum of Ranks

Translated Portuguese Non-translated Portuguese Total

41 34 75

50.10 23.41

2,054.00 796.00

Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed)

201.000 796.000 –5.280 0.000

Portuguese Translations

173

Table 8.10 Comparison of perante in translated and non-translated Portuguese Text type

N

Mean Rank

Sum of Ranks

Translated Portuguese Non-translated Portuguese Total

41 34 75

43.12 31.82

1768.00 1082.00

Mann-Whitney U Wilcoxon W Z Asymp. Sig. (2-tailed)

487.000 1,082.000 –2.337 .019

As can be seen, in all three cases, the differences observed were statistically significant. In other words, it is highly unlikely that the differences between the use of após, durante and perante in translated and non-translated Portuguese are merely due to chance or the individual choices of just a handful of translators. Interestingly, all three prepositions sound rather formal and have less formal syntactic options with similar meanings that could be used in their place: após is equivalent to the less formal prepositional phrase depois de; perante is equivalent to the less formal prepositional phrase diante de; and durante can be replaced by the less formal temporal clause enquanto. Thus, just as with the synonymous pairs of lemmas discussed in the previous section, this seems to be yet another indication that translated Portuguese fiction has a tendency for formality.

4. Conclusions Corpora enable us to access large quantities of text and describe language from the viewpoint of a vast number of users. This chapter used a bidirectional parallel corpus of English and Portuguese of three million words to describe a few distinguishing features of translated Portuguese fiction. The trends unveiled in this study are not idiosyncrasies, but rather the result of an analysis of the combined work of the 25 Portuguese translators represented in the corpus. In the first part of the study, the analysis was based on English source texts aligned with Portuguese translations and three conclusions were reached. First, despite the general belief that Portuguese tends to be less concise than English, the translation of English into Portuguese does not tend to produce a significant increase in the number of words. Second, despite the fact that translated Portuguese texts do not seem to expand in terms of the number of words, Portuguese translators may nevertheless add words to the translation that have no equivalent in the source texts. In particular, this study showed that there is a strong tendency for Portuguese translators of fiction to insert the adverb também where its equivalent was not present in the English text, thereby rendering the translation more explicit than its source text. Third, there appears to be a propensity for Portuguese translators to front adverbs of time when

174

Working with Portuguese Corpora

they are in non-clause initial position in the source text. More specifically, the analysis showed that when adverbs like ‘nowadays’, ‘these days’, ‘today’, ‘this morning’, ‘tonight’ and ‘now’ are in non-clause initial position in English, there is a tendency for them to be fronted in Portuguese translation. The second part of the present chapter did not examine what happened from source text to translation. Instead, it summarized the findings of two previous studies comparing translated and non-translated Portuguese and reported on new findings that bring to light factors that can contribute to making translations read differently from texts that are not translations. The first study showed that translated fiction tends to contain considerably more foreign words and tends to make use of a wider range of loan languages than non-translated Portuguese fiction. The fact that translated fiction contains more foreign words is not very surprising, given that the use of loans is a common translation strategy; however, what would not have been easily observable without recourse to a parallel corpus was that one-third of the foreign words present in the Portuguese translations and the unusual loan languages used actually originated in the English source texts. The second study compared the distribution of noun, verb, adjective and adverb lemmas in translated and non-translated Portuguese and unveiled a series of lexical contrasts between the two. A few of the differences observed could have been easily anticipated, such as the conspicuous absence of the very Portuguese word saudade (nostalgia) from translations and – probably because of the influence of the English cognate ‘finally’ – the preference for the adverb finalmente instead of its synonyms enfim or afinal, which are more common in non-translated Portuguese. The analysis, however, also disclosed a number of unexpected and remarkable linguistic and cultural contrasts and a tendency for formality, all of which can play an important role in making translated Portuguese fiction different from non-translated Portuguese fiction. Finally, an analysis was carried out to explore whether prepositions might also have distinctive distributions in translated and non-translated Portuguese. The results revealed that the prepositions após, durante and perante occurred significantly more frequently in translations. Because these prepositions have less formal syntactic equivalents that could have been used instead, these observations support the idea that there is a propensity for a more formal register to be used in translated Portuguese. This study used corpora to uncover a series of trends portraying what is normal (rather than what is unusual) in Portuguese translations. Although it is relatively easy to spot what stands out as anomalous, it is only possible to capture what is standard practice when analysing large quantities of text. The findings reported here will, hopefully, not only contribute to our general understanding of Portuguese translation, but also provide valuable insights to those working in translator education and the development of translation software. Countless other analyses could be carried out using parallel and comparable corpora and this chapter aims in part to inspire others to develop further research in this area.

Portuguese Translations

175

References Amador, P. (2013), Universais da Tradução: A Explicitação Através de um Estudo de Textos Económicos. MA Thesis, ISLA Lisbon Campus. Baker, M. (1993), ‘Corpus linguistics and translation studies. Implications and applications’, in M. Baker, G. Francis and E. Tognini-Bonelli (eds), Text and Technology: In Honour of John Sinclair. Amsterdam, Philadelphia, PA: John Benjamins, pp. 233–50. Blum-Kulka, S. (1986), ‘Shifts of cohesion and coherence in translation’, in J. House and S. Blum-Kulka (eds), Interlingual and Intercultural Communication: Discourse and Cognition in Translation and Second Language Acquisition Studies. Tübingen: Gunter Narr, pp. 17–35. Frankenberg-Garcia, A. (2005), ‘A corpus-based study of loan words in original and translated texts’, in P. Danielsson and M. Wagenmakers (eds), Proceedings of the Corpus Linguistics 2005 Conference. Available at http://www.birmingham.ac.uk/ research/activity/corpus/publications/conference-archives/2005-conf-e-journal.aspx —(2006), ‘Using a parallel corpus in translation practice and research’, in Proceedings of Contrapor 2006, 1ª Conferência de Tradução Portuguesa, pp. 142–8. —(2008), ‘Suggesting rather special facts: A corpus-based study of distinctive lexical distributions in translated texts’. Corpora, 3, (2), 195–211. —(2009), ‘Are translations longer than source texts? A corpus-based study of explicitation’, in A. Beeby, P. Rodríguez and P. Sánchez-Gijón (eds), Corpus Use and Translating. Amsterdam and Philadelphia, PA: John Benjamins, pp. 47–58. Frankenberg-Garcia, A. and Santos, D. (2003), ‘Introducing COMPARA: The PortugueseEnglish parallel corpus’, in F. Zanettin, S. Bernardini and D. Stewart (eds), Corpora in Translator Education. Manchester: St. Jerome, pp. 71–87. Frawley, W. (1984), ‘Prolegomenon to a theory of translation’, in W. Frawley (ed.), Translation, Literary, Linguistic and Philosophical Perspectives. London and Toronto: Associated University Presses, pp. 159–75. Klaudy, K. (1998), ‘Explicitation’, in M. Baker (ed.), Encyclopedia of Translation Studies. London: Routledge, pp. 80–5. McKenny, J. and Bennet, K. (2011), ‘Polishing papers for publication: palimpsests or procrustean beds?’, in A. Frankenberg-Garcia, L. Flowerdew and G. Aston (eds), New Trends in Corpora and Language Learning. London: Continuum, pp. 247–62. Pym, A. (2005), ‘Explaining explicitation’, in K. Karoly and A. Foris (eds), New Trends in Translation Studies. In Honour of K. Klaudy. Budapest: Akadémia Kiadó, pp. 29–34. Vinay, J. and Darbelnet, J. (1995), Comparative Stylistics of French and English: A Methodology for Translation. Amsterdam, Philadelphia, PA: John Benjamins.

Notes 1 One orthographic word in this case – i.e., a string of characters separated by spaces. 2 Available online at www.linguateca.pt/COMPARA. Version 13.1.22 dated 01/05/2011 was used in the present study. 3 This test was chosen because the data did not meet the normality assumption required for the paired t-test. The Wilcoxon Matched-Pairs test focuses on the differences between paired data – in this case, the differences between the number of words in

176

4

5

6 7 8 9

Working with Portuguese Corpora

source texts and translations for each separate bi-text – but, unlike the paired t-test, it does not assume normal distribution. The same does not appear to be true, however, for translators working in the opposite direction (i.e., Portuguese to English), where a reverse analysis drawing on the remaining 41 PT-EN bi-texts in COMPARA indicates an average 13 per cent increase in the number of words and that the differences between the mean number of words in STs and TTs is highly significant. Although the search interface to COMPARA allows for queries involving alignment constraints, where one can look up parallel concordances with também on the Portuguese side of the concordance but without also or too or any other possible equivalent on the English side, it was not feasible to carry out the analysis automatically because of the large variety of matches conceivable. Again, this test was used because the data were not normally distributed. In the entire corpus, there were only two instances in which hoje adverbs were actually moved to the end of the clause in translation. They were considered to be too marginal and were therefore also disregarded in the analysis. Note that the insertion of the adverb here is another indication of explicitation. This test is used to compare two independent samples of data that do not satisfy the criterion of normal distribution, as was the case here.

9

The Per-Fide Corpus: A New Resource for Corpus-Based Terminology, Contrastive Linguistics and Translation Studies José João Almeida, Sílvia Araújo, Nuno Carvalho, Idalete Dias, Ana Oliveira, André Santos and Alberto Simões

1. Introduction The Per-Fide project is a joint collaboration between researchers at the Department of Informatics and the Institute of Arts and Humanities at the University of Minho, Portugal. The acronym Per-Fide stands for Portuguese (P) in parallel with six languages: English (E), Russian (R), French (F), Italian (I), German/Deutsch (D) and Spanish/Español (E). First, we expound on the role of the Per-Fide project within the context of existing corpora that include the Portuguese language in its different variants – namely, European Portuguese, Brazilian Portuguese and Portuguese spoken in African countries (Angola, Mozambique, Guinea-Bissau, Cape Verde, São Tomé and Príncipe). The idea of creating a multilingual parallel corpus project in which Portuguese assumes a pivotal role arose primarily due to the fact that the majority of online corpora that include Portuguese are either monolingual or bilingual. Furthermore, these corpora focus mainly on one specific text type. Consequently, the few multilingual parallel corpora that include Portuguese consist of a relatively small Portuguese subcorpus1 and provide limited search facilities mainly due to the fact that the Portuguese texts have not been morphologically tagged and/or syntactically annotated. Our second goal in this chapter is to provide an overview of the design criteria for the development of tools and resources in the various stages of the Per-Fide corpora construction process, focusing particularly on automation, validation, generalization and resource sharing. Here, a brief description of the workflow components involved in the pre- and post-alignment phases will be included. Finally, we draw attention to several practical applications of the current features of the Per-Fide corpus in translation practice and contrastive linguistic studies, focusing on the use and potential of probabilistic translation dictionaries and the role that parallel corpora can play in translating idioms.

178

Working with Portuguese Corpora

2. The Per-Fide corpus in the context of Natural Language Processing In discussing the current status of the computational processing of the Portuguese language, particular mention must be made of the work developed by the Language Resource Center for Portuguese, Linguateca (see Santos, in this volume). Most of the corpora compiled by Linguateca are monolingual, such as the CETEMPúblico corpus and its Brazilian counterpart CETENFolha, two large corpus collections of articles from the Portuguese newspaper Público and the Brazilian newspaper Folha de S. Paulo, respectively. The most noticeable parallel corpus project being developed by Linguateca2 is the Portuguese–English Parallel Translation Corpus COMPARA (Frankenberg-Garcia and Santos, 2002). COMPARA is a bidirectional parallel corpus of English and Portuguese literary text extracts. The majority of Portuguese source texts were written by Portuguese or Brazilian authors, but the corpus also contains texts by Mozambican and Angolan authors. The original English texts were written by authors from the United Kingdom, the United States and South Africa. In some cases, more than one translation of the same source text has been included in the corpus and each corpus text has been annotated with simple text-related metadata consisting of the following elements: author, translator, title, publishing and copyright information, extract start and end page numbers, and the number of tokens, words and types. The

Figure 9.1 The COMPARA search interface and query results

The Per-Fide Corpus

179

COMPARA corpus can be queried via the DISPARA interface (Santos, 2002), which provides simple and advanced search facilities.3 To formulate a simple query, users need only define the search direction (Portuguese–English or English–Portuguese) and enter the search term. In simple query mode, all texts in the collection will be searched. The advanced query features include searching by text(s), author(s), publication date and varieties of Portuguese or English. Furthermore, users can specify the query type and, consequently, how the query results are to be presented (see Figure 9.1). The parallel concordance query displays and highlights all occurrences of a specific word or expression in the source language and its translation equivalent in the sentence-level contexts in which they appear. Since both the Portuguese and English texts are part-of-speech tagged, the parallel concordance query can be further refined by using the part-of-speech tags option. Since part-of-speech-annotated corpora allow users to perform more complex and sophisticated searches involving patternbased queries, to take advantage of this feature, users must be somewhat familiar with the corpus query syntax,4 a powerful information retrieval and corpus analysis tool roughly defined in terms of regular expressions consisting of an attribute and its value: [attribute = ‘value’]. For example, if we want to look for three-word Portuguese sequences beginning with any adjective, followed first by a noun and then by an adjective, the query can have the following form, where ‘pos’ stands for any part-ofspeech tag: [pos = ‘ADJ’][pos = ‘N’][pos = ‘ADJ’]. This pattern captures sequences like jovem médica asiática (young Asian doctor), principais nações industriais (top industrial nations) and pequenas tarefas domésticas (humdrum domestic tasks). The COMPARA search interface allows for two types of frequency distribution queries: MM

MM

text-specific frequency search options, which map the frequency of each word form of a given lemma; and corpus-specific frequency search options, which show how the search term is distributed across the texts in the corpus by mapping the search term to textual sources, authors, variety of English or Portuguese and original or translated text. Frequency data for verb tense, person, number and gender are available only for Portuguese.

Moreover, the COMPARA advanced search mode includes a filter option that allows for the specification of alignment constraints. For example, if we want to find all occurrences of the Portuguese word bonito that have been translated as ‘nice’ in the English text collection, this mechanism for narrowing down the search results will exclude the instances in which other translation equivalents, such as ‘beautiful,’ ‘good looking,’ ‘handsome’ and ‘pretty’ occur. In terms of the multilingual corpora that include Portuguese, particular emphasis must be placed on the OPUS project, ‘a growing multilingual corpus of translated open-source documents available on the Internet’ (Tiedemann and Nygaard, 2004, p. 1183). Currently, the OPUS multilingual search interface5 (Tiedemann, 2012) consists of 16 specialized subcorpora drawn from heterogeneous-specific domains: legislative and parliamentary texts (European Constitution and European Parliament

180

Working with Portuguese Corpora

Proceedings), economic and financial documents (European Central Bank), medical texts (European Medicines Agency), technical computer-related texts/manuals (PHP scripting language, OpenOffice software suite) and subtitle collections (OpenSubtitle. org corpus). European Portuguese is included in seven and Brazilian Portuguese in three of the 16 subcorpora. Generating a parallel concordance via the OPUS search interface begins with the selection of a subcorpus and a search language. The resulting query panel allows us to perform simple and more complex searches as well as control the concordance output. If we are interested in a bilingual or multilingual query, then one or multiple target languages must be selected. The hits for the search language and the selected target languages will be displayed in the same concordance results window. As such, OPUS is particularly suited to researchers who wish to carry out contrastive linguistics studies in more than two languages. Advanced search options (e.g., lemma, part of speech, phrase structure information) are currently not available for most of the languages of the subcorpora, including Portuguese. For example, in the medical subcorpus, only three of the 22 languages have been lemmatized and part-of-speech tagged: English, French and Italian. As is the case with COMPARA, the OPUS search facility can be optimized using the corpus query syntax, as shown in Figure 9.2. It is important to note that, in general terms, these two corpus projects follow the same query syntax, but differences occur concerning the attribute and value tagsets6 used for morphosyntactic annotation and lemmatization. To give a simple illustration,

Figure 9.2 The OPUS search interface and query results

The Per-Fide Corpus

181

suppose we want to find occurrences of the lemma ‘take’ (as a support verb) (Gross, 1998), followed by a noun, such as ‘take care’, ‘take offence’, ‘take place’, and ‘take part’. In COMPARA, this can be achieved using the following query: [lema = ‘take’][pos = ‘N.*’]. The same result can be obtained in OPUS using the structure [lem = ‘take’] [pos = ‘NN’]. Note the differences between the attributes lema/lem and the tags used to annotate the grammatical class of the tokens: N/NN for nouns. The OPUS interface offers several options for displaying search results: the vertical parallel concordance output and the horizontal KWIC (KeyWord in Context) concordance output. The former displays the sentence-level contexts in which the search term occurs and the corresponding target language sentences in parallel vertical alignment. The latter shows from 5 to 15 words on either side of the keyword/term and the corresponding target language translations in parallel horizontal alignment. If more context is needed to the left and/or right of the query term, the context option allows the user to specify the size of the context in terms of the number of sentences or paragraphs on either side of the term. As with COMPARA, OPUS users can also specify alignment constraints. One of the main concerns of the OPUS project is to provide the research community with multilingual corpus-based resources and corpus processing tools, such as language-specific taggers and parsers. The parallel corpora prepared in XML Corpus Encoding Standard (XCES) format are freely downloadable from the OPUS project website. This component, which aims to make resources and tools publicly available, is also a key feature of the Per-Fide project. As noted in the introduction, the Per-Fide project grew out of the need to develop significant multilingual corpus-based resources in which the Portuguese language in its different variants plays a pivotal role. The Per-Fide parallel corpora are bidirectional and Portuguese is always either the search or target language in combination with six other languages: Spanish, Russian, French, Italian, German and English. As the resulting language pair combinations (PT ↔ ES/RU/FR/IT/DE/EN) are not commonly found in existing multilingual parallel corpora, interesting contrastive linguistic studies can be performed, including the comparison of lexical, semantic and syntactic patterns between these languages. Another distinctive property of the Per-Fide project is that it covers a wide range of text types: religious texts (main sources: the Vatican, Comboni Missionary Community, Taizé Community), literary texts, official documents and legal texts (JRC-Acquis, EuroParl, EurLex), journalistic texts and technical texts (economics and finance, technology, health and medicine, social sciences, philosophy, tourism, gastronomy). Needless to say, one of the most complicated tasks in the text-selection process is obtaining copyright clearance for both originals and translations. This is particularly true for literary texts.7 The heterogeneous nature of the texts also raises key issues concerning the appropriate document classification scheme and metadata handling. Choosing or developing a classification scheme is not a straightforward matter, due in part to the fact that some texts can be classified as belonging to different categories. For example, more often than not, texts with a spiritual and religious content can be considered borderline cases between fiction and non-fiction/literary and religious texts. Furthermore, it was clear from the outset that the project classification scheme had to be flexible enough to incorporate new text types. Preparatory work undertaken

182

Working with Portuguese Corpora

to develop a project-specific classification system included the study of document classification schemes, such as the Universal Decimal Classification (UDC; McIlwaine, 2000) and the UNESCO Thesaurus.8 Another main goal of the Per-Fide project is to provide each text with a bibliographic record containing not only basic metadata elements (e.g., title, author, editor), but also more specific items, such as language, translator, text type and literary period. For this purpose, we decided to use the metadata encoding scheme developed by the Text Encoding Initiative (TEI) (Sperberg-McQueen and Burnard, 2002). Given that not all project members involved in the text selection and classification process were familiar with TEI, a web interface was designed containing a bibliographic data entry form to generate a TEI header automatically for each text.

3. Corpus processing pipeline In what follows, we describe the design goals that underlie the development and management of the Per-Fide project tools that comprise the Per-Fide corpora pipeline: MM

MM

MM

MM

Automation: making all of the corpora processing tasks as systematic and automatic as possible ensures the rapid processing of new texts and reprocessing of existing corpora when a new tool is developed and incorporated or a bug is fixed. Validation: much time and effort have been dedicated to developing tools that compute metrics. These tools can be used to evaluate, to a certain extent, the quality of the generated resources. Generalization: the development of tools that are general enough to be used in other contexts is an asset-building strategy. Therefore, most of the Per-Fide tools are available to the community through an open-source licence. The overall quality of the tools developed will most certainly benefit from community feedback. Resource sharing: the Per-Fide project provides resources on at least three distinct levels. First, the project’s web interface allows querying and browsing of the corpora collection and other relevant content, such as Probabilistic Translation Dictionaries (PTDs). Secondly, the available corpora, PTDs and other intermediate resources are also available for public download as files in standard formats. Finally, following the semantic web trends, a RESTful Application Programming Interface (API) (Fielding and Taylor, 2002) is also publicly available, providing queries and operations that allow the implementation of third-party tools that can easily be integrated with Per-Fide resources.

Preparing documents to be included in corpora and enriching the corpora obtained involve several different steps that result in a complex network of dependencies and conditional tasks determined by the original format and state of the documents, the type of resource extraction, and the intended use. The corpus pipeline9 can be divided into two main phases: pre-alignment and post-alignment. The pre-alignment phase includes tasks such as text cleaning, normalization and the alignment task itself.

The Per-Fide Corpus

183

Post-alignment tasks include corpora tokenization, segmentation and tagging as well as the generation of derived resources such as the extraction of PTDs and terminology. This phase includes the process of making the corpora available for online querying and download. Both the large amount of documents and the many alternative and conditional tasks involved reduce the feasibility of manual maintenance. The automation of these processes presents several interesting challenges. This section describes the two main phases that comprise the corpus-building process. The tasks are currently in different stages of integration in the workflow, ranging from fully integrated and automated steps to prototypes that are still being tried, tested and tuned.

3.1 Pre-alignment phase The alignment process takes as its input pairs of plain text files. These files must resemble each other very closely in terms of size and structure. If this is not the case, the performance of the alignment tool will decrease.

3.1.1 Cleaning documents Text documents retrieved for the purpose of being automatically processed often present several types of ‘noise,’ such as structural residue (e.g. page numbers, headers and footers, footnotes), text encoding and mark-up syntax (e.g. notation used for sections, paragraphs and sentences), which are obstacles to any further use of the texts in the corpus. This is particularly true for the conversion of documents available in portable document format (PDF) to plain text. For example, mathematical formulae and tables can render parts of the resulting textual document completely illegible. In fact, for some documents, the simple conversion of the document to plain text format can be difficult depending on which tool generated the original PDF file. In order to reduce the noise in these documents and its impact on subsequent processing, documents are pre-processed with the Text-Per-Fide-BookCleaner, a tool designed to clean and normalize unwanted elements (Santos and Almeida, 2011). This tool makes use of an ontology of document structure elements to detect and remove unwanted parts.

3.1.2 Finding pairs of documents One of the challenges of preparing files for alignment is finding, within a large collection, pairs of files to be aligned – that is, pairs of files where each is a translation of the other in a distinct language. Depending on the origin of the candidate documents, it is sometimes possible to extract information for the pairing of the documents from their names or universal resource locators (URLs). This typically happens when the document pairs are retrieved from the same source, like a website, using crawling mechanisms (Almeida et al., 2002). However, when the documents come from a variety of distinct sources, different naming conventions might have been used, which means that methods for finding candidate pairs for alignment must

184

Working with Portuguese Corpora

rely on the contents of the documents. This is particularly the case with literary texts, where the translations in the different languages are obtained from different sources. A method for solving this problem consists in comparing documents based on language-independent elements (LIEs) (Santos, 2011). Examples of such elements are year references (e.g., 1755) and proper names (e.g., Hamlet). The set of LIEs in every file is extracted and the sets compared to each other. Files presenting a high percentage of LIEs in common are proposed as candidate pairs for alignment. This approach is similar to that used for the word alignment of parallel corpora (Tiedemann, 2003). Although this method will help organize and detect document pairs from the set of documents collected, some prior organizational efforts will greatly contribute to achieving document pairing. Therefore, the collected documents were added into a collaborative repository (SVN) and stored in a hierarchical tree of directories, making the task of detecting document pairs easier.

3.1.3 Synchronizing documents Text-alignment tools are very sensitive to differences in the documents resulting from the insertion and/or deletion of text and the inversion of the order of paragraphs or sentences. It is quite common to see entire sections of books, such as biographical notes and prefaces to a given edition, absent in the translations. Such cases often render the entire alignment unusable: once the aligner tool desynchronizes, it is very difficult to synchronize later on in the alignment process. Document synchronization, then, is a process of aligning two documents at the section level (Santos et al., 2012). This process can be used either to define hard anchor points (which the aligner can use for synchronization purposes) or to split the original pair of documents into smaller parallel chunks. It also allows for the identification of non-matching sections that can be removed later. Another alternative for defining anchor points consists in the use of bilingual dictionaries or similar resources, like the Unambiguous-Concept Translation Sets (UCTSs) (Santos et al., 2012). As will become clear later on, the advantage of this approach is that users do not need to identify the document sections, only a set of synchronization points. The alignment itself is carried out using the easy-align tool that is part of the Open Corpus Workbench package (Evert and Hardie, 2011). The Per-Fide workgroup is currently analysing the alignment quality of another aligner, namely HunAlign (Varga et al., 2005).

3.2 Post-alignment phase Once the texts have been aligned, they can be given as input to the next workflow component available in the Per-Fide environment. The alignment process typically culminates in the generation of a parallel corpus, encoded using the Translation Memory Interchange (TMX) format. An example of two alignment entries in the TMX file follows. Note that neither the document header nor footer is shown. The extract in Figure 9.3 is taken from a corpus composed of free software internationalization messages.

The Per-Fide Corpus

185

Figure 9.3 Pair of translation units in TMX format from a corpus of free software internationalization messages As soon as these TMX files are added to the Per-Fide corpora collection, an automatic procedure defines the processing operations needed to build the commonly available resources. These instructions are stored in a standard Unix Makefile.10 There are two main advantages for adopting this technology: (a) Many operations described in the Makefile are computationally intensive and can take several days to compute without parallelization. Thus, one of the goals is to minimize the number of operations to be performed, making sure that an operation is only executed when required, either because the resource is not available, or a resource has changed and dependent resources need to be updated. Makefiles allow for the specification of dependencies in chains and the built-in rules verify whether a resource (file) really needs to be calculated. (b) It is possible to take advantage of many other built-in features of Makefiles. One of these features is used to parallelize long operation sequences. This means that many computations can be performed simultaneously to save time. For example, one might divide the corpus into smaller chunks and annotate each chunk using a parallel process, thus making the full annotation process faster, much like the well-known MapReduce technique (Lin and Dyer, 2010). Another major concern is that all the operations described need to be language-aware. The computed resources need to be created for every language and every language pair available in the TMX file. In practice, this is achieved by using a tool that manages all available resources and Makefiles. It also provides a web interface for the administration of the project environment. Moreover, it is general enough to be used in other contexts, outside the scope of this project. To implement this tool it was necessary to

186

Working with Portuguese Corpora

devise a set of documents, specifically a corpus manifesto, that lists, for each corpus, all the files related to it.

3.2.1 Segmentation and tokenization Segmentation is the task of dividing text into sentences (or other segments), whereas tokenization is the task of dividing sentences into tokens. Although these tasks have a precision level of 99 per cent on most tools, they are not straightforward processes. In Per-Fide, we decided to use a natural language processing library, named FreeLing (Padró, 2011), to perform segmentation and tokenization operations. Although FreeLing does not currently support all of the languages in the Per-Fide corpora, its syntax for defining new segmentation and tokenization tools is quite simple and extensible. Once TMX files with the desired languages are available, tokenization can take place. This step is carried out earlier in the workflow as many subsequent operations can take advantage of tokenization. The tool developed for the tokenization of TMX files is aware of the TMX annotation and applies the correct tokenization module for that language.

3.2.2 Part-of-speech tagging Corpora become particularly valuable resources when annotated (or tagged) with part-of-speech information. Here, again, the NLP library FreeLing was used. It includes two different part-of-speech (PoS) tagging approaches that can be used for this task: a method based on Hidden Markov Models (tagger training based on sequences of annotations and the prediction of the more probable annotation that follows, based on the possible PoS chosen by the morphological analyser) and another based on the Relax Algorithm (relaxation labelling is a family of energy-functionminimizing algorithms that change the labelling in accordance with a set of constraint rules; Manning and Schütze, 1999). Unfortunately, FreeLing does not include data for both methods for all supported languages, therefore the method was chosen depending on the language and the tagging data available. When both methods are available for a specific language, the Hidden Markov Model was chosen. FreeLing is also able to detect and tag different kinds of constructions, like proper names and locutions (multi-word terms). Finally, a Chart Parser is also available, allowing for the annotation of tree structure information, but this kind of annotation has not yet been included in our corpora. Currently, tools are being developed for the annotation of the TMX files. This is being done in such a way as to allow the annotated files to be encoded later in the IMS Corpus Workbench (see below).

3.2.3 Probabilistic Translation Dictionaries Probabilistic Translation Dictionaries (PTDs) are translation dictionaries that map words from one language to a set of possible translations. Each of these translations has a probability value. Further details on how PTDs are computed can be found in Simões and Almeida (2003). A PTD entry example extracted from a Portuguese– English dictionary is shown in Figure 9.4.

The Per-Fide Corpus

187

Figure 9.4 A Probabilistic Translation Dictionary (PTD) entry In the corpus pipeline, PTDs are computed for every language pair available. In addition to computing one PTD for every language pair, an accumulated PTD is also compiled that aggregates all the PTDs calculated, in order to improve both the coverage and the quality of the dictionary. The different PTDs can then be browsed using the Internet interface, or the files can be downloaded for offline processing.

3.2.4 IMS Corpus WorkBench indexing The IMS Corpus WorkBench is undoubtedly the most widely used tool for managing and indexing corpora for fast querying. For each TMX file, we create a monolingual corpus for every available language, as well as a parallel corpus for each language combination. This step is crucial for the web interface to be able to rapidly query the available corpora. When a corpus is queried, either through the web interface (by humans using a browser) or the web service (more oriented to machine–machine communication), all the resources required to answer the query have already been computed. This greatly improves the query response time, and even queries that return many millions of hits are displayed almost immediately.

3.2.5 Unambiguous-concept translation sets The translation of certain terms or lexical units does not pose ambiguity problems; in principle, they are always translated the same way. We call these ‘unambiguous concepts.’ Examples include: MM MM MM MM

MM

proper names (Londonen ~ Londrespt ; Oportoen, Portoen ~ Portopt); technical terminology (fileen ~ ficheiropt ; folderen, directoryen ~ pastapt, directoriapt); possible synonyms (wolframen, tungstenen ~ volfrâmiopt); morphological agreement constraints that need to be kept (Israelpt ~ Израильru (nominative or accusative case), Израилемru (instrumental case), Израиляru (genitive case), Израилюru (dative case)); and months, seasons, weekdays, numerals, cardinals, etc.

Unambiguous-concept translation sets (UCTSs) can be exported/extracted from resources, produced manually, or extracted automatically from PTD files. They can be used for such tasks as partial synchronization or alignment as well as the assessment of the alignment process (in the translation sector, terminology is also used for quality assurance).

188

Working with Portuguese Corpora

3.2.6 Bi-word sets A bi-word set (BWS) is a collection of strongly related word pairs. Each bi-word ‘wordL1, wordL2’ tells us that wordL2 is a possible translation of wordL1 (although there may be other translation candidates). The main difference between UCTSs and BWSs is the notion of (un)ambiguity. In BWSs: MM MM MM

terms can be ambiguous; relations are unidirectional; and one term can appear in more than one entry.

In UCTSs, on the other hand: MM

MM MM

lexical units are expected to be translated by a term belonging to a small set of well-defined concepts; each term in a UCTS list can only be found in exactly one UCTS; and relationships are bidirectional.

BWSs include pairs of words whose relation is not so strong as with UCTSs: resten = descansopt, resten = descansarpt, resten = repousopt, resten = pausapt, pauseen = pausapt, breaken = pausapt. UCTS and BWS can be used for the analysis of translation (or alignment) quality. Consider, for example, the UCTS that defines the translation of ‘Oporto’ as Porto. If, after extracting a translation dictionary from a corpus, Porto is missing from the translations of ‘Oporto’, it can be deduced that something went wrong in the alignment process. BWSs are more useful during the alignment process. They can be used as clues (soft anchor points) for the aligner tool. See, for instance, Tiedemann (2003) for a discussion on clue-based alignment procedures.

3.2.7 Evaluation, metrics and quality The automatic evaluation of the alignment process and the derived resources is particularly challenging when dealing with large amounts of files and data as it is difficult to identify what metrics to use to infer their quality. Nonetheless, some elements can be measured and provide clues to evaluate the quality of the resource. One of these, which refers to the evaluation of TMX files, is the metric known as the percentage of 1:1 correspondences (i.e. a single sentence aligned with one single sentence) versus other types of correspondences. Although non-1:1 correspondences can occur in correctly aligned texts, a high percentage of these usually indicates a low-quality alignment. Another method of evaluating alignments is to check for the presence of UCTSs in translation units, as illustrated in the previous section. If a term from a UCTS appears in one language segment from a translation unit then one of the accepted translations should appear in the aligned language segment. If not, it is highly probable that the translation unit has been translated incorrectly. Even without the same level of confidence as with UCTSs, this same kind of approach can be performed using BWSs or PTDs.

The Per-Fide Corpus

189

3.2.8 Using resources to improve our tools As a corollary to the previously mentioned design goals concerning generalization and resource sharing, our tools are often implemented as clients of each other – namely, resources generated by some tools can be used by other tools in order to improve results. For example, although the UCTS extraction requires corpora alignment, the alignment can take advantage of extracted UCTSs for a better alignment quality. Therefore, one can align a corpus, extract UCTSs and use them to re-align the same corpus. The UCTSs generated from PTDs can be used in the above-mentioned process of document synchronization to split the texts before the alignment as well as to synchronize the alignment tools.

3.2.9 Query interface Once all of the resources have been calculated and made available, they can be immediately queried by any user. This can be done using the Per-Fide query Internet interface.11 The project environment also provides a web service that applies a set of operations, via an easy-to-use RESTful public interface (API), to available resources that can be incorporated into other tools to build more complex applications. This web service provides a set of well-defined online operations for querying project resources. The query results are provided in a set of well-defined XML schemas. This is a useful component for other tools that want to take advantage of the resources described and keep up to date with the new resources being built and/or updated. In addition to all of these query options, all of the linguistic resources are available for download, either for offline operations or for building new resources.

4. Applications of the Per-Fide corpus in cross-linguistic research There is an increasing research interest in corpora and their applications and the potential for development in this area is immense. As stated by Granger (2010, p. 7), any field that relies on the analysis of two or more languages can benefit from corpus-based cross-linguistic research. In addition to the undeniable utility of parallel concordances in translation studies, bilingual lexicography and machine translation (Granger et al., 2007), their potential for second-language learning is enormous and there is a significant body of literature demonstrating how language learners can benefit from their use (e.g., Aston, 2001; Granger, 2003; Sinclair, 2004; FrankenbergGarcia, 2005). Students at the beginner and intermediate levels are still very dependent on dictionaries. The use of concordances as a tool for language learning at the beginner’s level can be motivating and rewarding for learners because this tool can provide contextualized examples that encourage such students to explore the meanings and uses of words in authentic contexts (cf. St. John, 2001). Portuguese learners of French who look up démarche in a Portuguese–French bilingual dictionary, for example, will

190

Working with Portuguese Corpora

encounter several possible translations for the word (modo de andar, passo, atitude, comportamento, modo de pensar, procedimento, diligências, trâmites, etc.) and might find it difficult to choose which term to use. If they look up démarche in the L1–L2 direction of a Portuguese–French parallel corpus like Per-Fide, they will not only be able to see different ways in which démarche has been rendered in Portuguese, but also the different contexts in which each term was used (see Examples 1a through 7b). (1a) […] la communauté internationale est invitée à suivre cette . (1b) […] a comunidade internacional é convidada a seguir esta . (2a) Les analyses prospectives et socio-économiques représentent une partie importante de la . (2b) Uma parte importante da é constituída por análises prospectivas e socioeconómicas. (3a) Une telle se heurterait cependant à deux inconvénients majeurs. (3b) Um deste tipo levanta, contudo, dois inconvenientes significativos. (4a) La Région flamande aurait suivi la même et fixé ses propres objectifs de qualité. (4b) a Região da Flandres teria feito a mesma e fixado os seus próprios objectivos de qualidade. (5a) Il est important que la Cour adopte une cohérente pour décider d’exercer ou non sa compétence. (5b) É importante que o Tribunal de Justiça adopte uma coerente quando decidir se deve ou não considerar-se competente. (6a) […] la de la Commission n’appelle aucune réserve […] (6b) […] a da Comissão não levanta quaisquer reservas […] (7a) Une anormale et des chutes ont été des événements indésirables très fréquemment rapportés avec olanzapine. (7b) Os efeitos adversos muito frequentes associados com o uso da olanzapina neste grupo de doentes, foram perturbações na e quedas. It is clear that, used as a complement to (or instead of) monolingual or bilingual dictionaries, parallel concordances can help learners understand foreign words they do not know as well as the contexts in which the words are appropriate (FrankenbergGarcia, 2005, p. 191). For the verb implementar, the dictionary Infopédia12 offers the following equivalents in French: accomplir, exécuter and implémenter. The bilingual concordance depicted below, which we extracted from the Per-Fide corpus, illustrates the different possibilities of translating the Portuguese collocation (cf. Iriarte Sanromán, 2001; Grossmann and Tutin, 2003) implementar medidas into French. Note that none of the French verbs proposed by Infopédia appear in the bilingual concordance (see Examples 8a through 13b). (8a) Existem procedimentos para desenvolver e de controlo de riscos. (8b) Il existe des procédures pour élaborer et de maîtrise des risques.

The Per-Fide Corpus

191

(9a) […] os Estados-Membros deverão eficazes de acompanhamento e controlo. (9b) […] les États membres devraient efficaces de suivi et de contrôle. (10a) Insiste-se aqui na necessidade de destinadas a preservar a quantidade dos recursos naturais. (10b) On insiste ici sur la nécessité de destinées à préserver la quantité des ressources naturelles. (11a) […] a Opel Nederland considerou necessário de neutralização em Outubro e em Dezembro de 1996. (11b) […] Opel Nederland a jugé utile de correctives en octobre et en décembre 1996. (12a) Se se tornar necessário fiscais para alcançar os objectivos acordados, então, em minha opinião, esta via terá fracassado. (12b) S’il devait s’avérer nécessaire d’ fiscales pour atteindre les objectifs convenus, cette voie serait alors pour moi un échec. (13a) Cada parte tem o direito de adoptar e mais rigorosas do que as enunciadas nas disposições da presente convenção. (13b) Chacune des parties contractantes a le droit d’adopter et d’ plus rigoureuses que celles qui sont énoncées dans la présente convention. Whereas no single case of the French verb implémenter followed by the phrase des mesures was found in the Per-Fide corpus, the English translation of the Portuguese locution implementar medidas is almost exclusively translated as the combination of the verb ‘implement’ and the noun ‘measures’ (see Examples 14a through 15b). (14a) Insiste-se aqui na necessidade de destinadas a preservar a quantidade dos recursos naturais. (14b) The emphasis here is on the necessity of to preserve the quantity of natural resources. (15a) Cada parte tem o direito de adoptar e mais rigorosas do que as enunciadas nas disposições da presente convenção. (15b) Each Contracting Party has the right to adopt and being more stringent than those resulting from the provisions of this Convention. The fact that parallel concordances provide not only linguistic equivalents, but also the contexts in which different terms are equivalent (prendre / introduire / appliquer / mettre en œuvre / mettre en place … des mesures), can help learners decide which term is appropriate in a specific context. This tool can be especially helpful when learners or translators have to deal with idiomatic expressions for which there are no simple, direct translations available in their mother tongue. The following examples, taken from the Per-Fide corpus with French as the source language, demonstrate that – when confronted with an idiomatic phrase such as couper les cheveux en quatre (to split hairs) – translators can draw upon a number

192

Working with Portuguese Corpora

of different translation alternatives which, in this case, might be more or less synonymous (see Examples 16a through 19b). (16a) Les chercheurs sont des gestionnaires, des ingénieurs, des collectionneurs, des , ou des artistes. (16b) Os investigadores podem ser gestores, engenheiros, coleccionadores, , fantasiadores ou artistas. (17a) Monsieur le Président, il ne faut pas , comme disent les Français. (17b) Senhor Presidente, não devemos , como dizem os franceses. (18a) Au Conseil de ministres, je dirai: arrêtez de . (18b) Ao Conselho de Ministros direi: . (19a) C’est essentiel afin que le Parlement européen ne devienne pas un lieu où l’on mais un lieu que la Commission et le Conseil prennent au sérieux. (19b) É essencial para que o Parlamento Europeu não seja um local de e para que a Comissão e o Conselho o levem a sério. These examples clearly show that idioms are ‘one of the most relevant manifestations of the creative potential of any language, as evidenced by the richness of their images, the originality of their metaphors as well as the variety and malleability of their structure’ (Alvarez, 2007, p. 160). Indeed, it is not always easy to grasp the metaphorical nuances of such phrases, as demonstrated in Examples (20a) and (20b), in which the proposed translation was too literal and therefore unable to render the meaning of the original structure. (20a) Le commissaire a évoqué la nécessité d’un mouvement proactif dans le sens de la prestation de services, mais nous n’allons pas commencer à ; au contraire, nous devons faire quelque chose de productif. (20b) O Senhor Comissário falou da necessidade de medidas proactivas dirigidas à prestação de serviços, mas não podemos dedicar-nos todos ; pelo contrário, temos de produzir algo também. Idioms are unquestionably a critical area of languages in general, inasmuch as their global meaning cannot be apprehended by aggregating the individual meaning of each constituent. When combined, isolated lexemes generate new meaning (Mejri, 1997), which results in the construction of syntagmatic structures whose figurative value refers to a specific reality with a particular meaning. Thus, it comes as no surprise that the translation of such structures should pose such a multitude of challenges. Since Google Translate is one of the most widely used translation resources, being used on a fairly regular basis by translators, we decided to test the quality of the translation of idioms provided by this machine translation system. As can be seen in Examples (21a) through (21e), the French idiom couper les cheveux en quatre (to split hairs) has been incorrectly translated into Portuguese, Spanish, English and German (besides the incorrect translation of the expression couper les cheveux en quatre, Example (21e) presents an incorrect translation of the phrase comme disent les Français).

The Per-Fide Corpus

193

(21a) French: Monsieur le Président, il ne faut pas , comme disent les Français. (21b) Portuguese: Sr. Presidente, , como dizem os franceses. (21c) Spanish: Señor Presidente, , como dicen los franceses. (21d) English: Mr. President, , as the French say. (21e) German: Herr Präsident, , wie die Französisch Wort. Google Translate returns literal translations of the lexical units that make up the French idiom. In translating the idiom poser un lapin (to stand someone up), Google Translate does not provide a literal translation of the elements that comprise the French expression (poser un lapin à quelqu’un, literally ‘to put a rabbit to someone’) in any of the four target languages (see Examples 22a through 22e), but the results leave much to be desired. Note that the surrounding context does not point the system toward a more appropriate translation, as would be expected. (22a) French: Je l’ai attendue tout l’après-midi. Elle n’est pas venue à notre rendez-vous. Elle m’. (22b) Portuguese: Esperei toda a tarde. Ela não veio ao nosso encontro. Ela . (22c) Spanish: Esperé toda la tarde. Ella no vino a nuestro encuentro. . (22d) English: I waited all afternoon. She did not come to our rendezvous. She . (22e) German: Ich wartete den ganzen Nachmittag. Sie wollte nicht zu unserem Treffpunkt kommen. Sie . Thus, it becomes clear that the ability to query corpora is, beyond any doubt, a feature that needs to be introduced in training courses for translators and other professionals in related fields. The Per-Fide search interface allows for simultaneous L1 and L2 queries. This kind of search can be useful, on the one hand, for determining if a word in L1 corresponds to another word in L2, such as if casa (house) can be translated as ‘box’ and, if so, in which contexts; on the other hand, it can be used to identify the various equivalent terms of the word in L2, such as the word casa corresponding to ‘box’, ‘home’, ‘house’, ‘place’, ‘section’, etc. In order to narrow down the search in L1 to a particular concept (e.g., casa referring to a type of housing facility), we can choose corresponding terms, such as ‘home’ or ‘house’ in L2, and all the occurrences featuring these last two terms will be retrieved (PT: casa – EN: home / house; see Table 9.1). The search query involving the word casa can be refined if we wish, for example, to determine lexical patterns in which the word casa is followed by the preposition de. The Per-Fide query interface will return the instances of casa de as shown in Table 9.2. The results of the simple search casa and the more refined search casa de enable us to see that the Portuguese multi-word lexical unit casa de férias has two different English translation equivalents: ‘holiday home’ and ‘resort home’. It would be an interesting task to use the Per-Fide search facilities to look into the semantic nuances of the English units, which is beyond the scope of this chapter.

194

Working with Portuguese Corpora

Table 9.1 Search for casa Portuguese

English

casa casa particular casa natal casa de morada de família casa de férias […] casa casa das máquinas casa familiar-tipo casa provincial casa do clero casa de leilões […]

home private home native home family home / matrimonial home holiday home […] house wheel house typical family house provincial house priest’s house auction house […]

Table 9.2 Occurrences of casa de in the Per-Fide corpus Portuguese

English

casa de fim-de-semana casa de férias casa de hóspedes casa de acolhimento […]

weekend-house resort-home boarding house guest house […]

When searching for multi-word lexical units containing words connected by prepositions, it is possible to specify which prepositions we wish to include in the search as well as their various contracted forms, such as máquina (‘machine’) followed by the preposition de (of) or the contractions do, da, dos and das (‘of ’ plus a definite article). The following formulae can be used to perform this search: (a) máquina d. (b) máquina (de|da|do|dos|das) In search query (a), the period operator (.) matches any single character. Therefore, only occurrences that feature the word máquina followed either by the preposition de or the singular contracted forms da or do will be retrieved. In query (b), the disjunction operator (|) expresses an alternative. In this case, instances of the word máquina followed by all the specified singular and plural contracted forms will be matched. Both queries (a) and (b) will only retrieve occurrences that include the word máquina followed by the preposition de and the specified contracted forms (see Table 9.3). Regardless of the fact that the Per-Fide corpus is still under construction and we are currently working on the part-of-speech tagging, it is already possible, as demonstrated

The Per-Fide Corpus

195

Table 9.3 Search results for occurrences with máquina (de | da | das | do | dos) Portuguese

English

French

máquina de escrever máquina de lavar loiça máquina de lavar roupa (para uso doméstico) máquina de barbear eléctrica sem cabeça máquina de depilar com motor eléctrico incorporado máquina de escolha de notas com retalhadora integrada […]

typewriter dishwasher (household) washing machine electric shaver with the head removed hair-removing appliance with self-contained electric motor banknote sorting machine with an integrated shredder […]

machine à écrire lave-vaisselle lave-linge (ménager) rasoir électrique sans tête appareil à épiler à moteur électrique incorporé machine de tri équipée d’un broyeur intégré […]

in the previous examples, to run simple single-word or multi-word search queries. Moreover, the corpus offers a supplementary resource for corpus querying: PTDs. By generating a PTD, the query system provides a paradigmatic family of functional equivalents along with the respective percentage of direct correspondence between the source term and various possible target terms within the selected corpus (see the following example of PTDs in the EuroParl corpus) and in all the corpora included in Per-Fide (see Figure 9.5).

Figure 9.5 Micro- and Mega-PTD for the word ‘frame’

196

Working with Portuguese Corpora

The option ‘ptd’ was selected to initiate the search for the English word ‘frame’, which, as can be seen in Figure 9.5, has several equivalents in Portuguese. The results obtained with this option allow users to visualize the translation alternatives of the source term in different contexts, which they access by clicking on the arrow located alongside each term listed in the PTD. PTDs can be used to compile terminological lists, which is potentially useful for the production of bilingual glossaries or dictionaries. It goes without saying that the list of terms generated by the PTD becomes highly valuable to translators if enhanced with the study of the term in context. In other words, the work of the translator can benefit significantly from switching between the isolated terms given by the PTD and their contextualized bilingual concordance. Users can begin by choosing the bilingual concordance without even looking at the PTD (as we saw in the simple query of idioms). In the bilingual concordance, users must skim through all of the occurrences, which might reach the hundreds, in order to identify the term that best suits the context of the word they wish to translate. Thus, users who activate the PTD as a query option can increase the efficacy and output efficiency of their work by immediately circumscribing the range of alternatives at their disposal to translate a given expression. Hence, it is essential to expand the corpus both qualitatively (text types) and quantitatively (amount of bi-texts) so that the terms listed in the PTD are representative of as many usage contexts as possible. Mega-PTDs aim to complement the data supplied by the PTD of a single corpus (micro-PTD) and can encompass contexts that the micro-PTD was unable to retrieve. In the Mega-PTD, it is possible to depart from an L1 to an L2 and revert to L1. For example, one of the equivalents of ‘frame’ is the term quadro; if we click on the latter,

Figure 9.6 Micro- and Mega-PTD of quadro

The Per-Fide Corpus

197

Table 9.4 Micro-PTD of casa in different subcorpora of the Per-Fide corpus Comboni

Vatican

DGT

ECB

house 68.02% home 9.13%

house 52.36% home 29.67%

box 65.55%

mint 52.23%

we will view the respective micro-PTD in the source language (L1), where we can once again have access to its translation equivalents and their contexts, as can be seen in Figure 9.6. It thus becomes clear that PTDs, by allowing for the alternation between L1 and L2, have a cyclical nature that can be automatically activated either through the micro- or mega-PTD by clicking on the four centrifugal arrows. It is interesting to note in Table 9.4 that the results obtained with the micro-PTDs might diverge depending on the type of corpus under analysis. The functional equivalents of casa (‘house’ / ‘home’) become more specific when our query involves more technical corpora linked to the financial and economic field: Casa might not correspond to ‘home’ or ‘house’ and might be translated, for example, as ‘mint’ (Casa da Moeda) in the European Central Bank (ECB) corpus or as ‘box’ in the European Commission Directorate-General for Translation (DGT) corpus. This clearly illustrates the importance of building corpora based on a diversified text typology, as can be seen from the fact that the level of the technicality of terms gradually increases in some types of corpora featuring specialized language.

5. Concluding remarks The Per-Fide corpus sets itself apart from other corpora mainly due to the number of languages involved and the central role played by Portuguese. Furthermore, it will be made freely available to the research community for searching and downloading, along with the terminological and lexicographic material produced in the context of this project. As observed by Kraif (2006, p. 15), both alignment and bilingual concordance tools still remain largely underexplored. Indeed, some of their features, such as the automatic extraction of bilingual lexicons (PTDs) or the query process based on morphosyntactic annotation and lemmatization, are unknown to many students, linguists and translators. The Per-Fide Project has organized a series of workshops demonstrating the potential of these resources and tools in different research domains. Our mission has been to help different target groups work efficiently with corpus-based instruments and apply corpus querying methods in their research and professional activities.

198

Working with Portuguese Corpora

Acknowledgements The Per-Fide Project is supported in part by a grant (Reference No. PTDC / CLELLI / 108948 / 2008) from the Portuguese Foundation for Science and Technology, and it is co-funded by the European Regional Development Fund. We would like to thank all contributing authors, translators, publishers, and institutions for their generosity in allowing us to include their texts in the Per-Fide corpus.

References Almeida, J. J., Simões, A. and Castro, J. A. (2002), ‘Grabbing parallel corpora from the web’. Procesamiento del Lenguaje Natural, 29, 13–20. Alvarez, M. L. O. (2007), ‘As expressões idiomáticas nas aulas de ELE: Um bicho de sete cabeças?’, in I. González Rey (ed.), Les Expressions Figées en Didactique des Langues Étrangères. Fernelmont: Proximités E.M.E, pp. 159–79. Aston, G. (ed.) (2001), Learning with Corpora. Houston: Athelstan. Evert, S. and Hardie, A. (2011), ‘Twenty-first century Corpus Workbench: Updating a query architecture for the New Millennium’. Paper presented at Corpus Linguistics 2011, University of Birmingham, UK. Fielding, R. T. and Taylor, R. N. (2002), ‘Principled design of the modern Web architecture’. ACM Transactions on Internet Technology, 2 (2), 115–50. Frankenberg-Garcia, A. (2005), ‘Pedagogical uses of monolingual and parallel concordances’. ELT Journal, 59, (3), 189–98. Frankenberg-Garcia, A. and Santos, D. (2002), ‘COMPARA, um corpus paralelo de português e de inglês na Web’. Cadernos de Tradução, 9 (1), 61–79. Granger, S. (2003), ‘The International Corpus of Learner English: A new resource for foreign language learning and teaching and second language acquisition research’, TESOL Quarterly, 37, (3), 538–46. —(2010), ‘Comparable and translation corpora in cross-linguistic research. Design, analysis and applications’. Journal of Shanghai Jiaotong University, 2, 14–21. Granger S., Lerot J. and Petch-Tyson S. (eds) (2007), Corpus-based Approaches to Contrastive Linguistics and Translation Studies. Beijing: Foreign Language Teaching and Research Press. Gross, M. (1998), ‘La fonction sémantique des verbes supports’. Travaux de Linguistique, 37, 25–46. Grossmann, F. and Tutin, A. (2003), Les Collocations: Analyse et traitement. Amsterdam: De Werelt. Iriarte Sanromán, Á. (2001), A Unidade Lexicográfica. Palavras, Colocações, Frasemas, Pragmatemas. Braga: University of Minho. Kraif, O. (2006), ‘Qu’attendre de l’alignement de corpus multilingues?’, in Revue Traduire, 4e Journée de la Traduction Professionnelle, 210, 17–37. Lin, J. and Dyer, C. (2010), Data-Intensive Text Processing With MapReduce. San Rafael, CA: Morgan & Claypool Publishers. Manning, C. and Schütze, H. (1999), Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. McIlwaine, I. (2000), The Universal Decimal Classification: A Guide to its Use. The Hague: UDC Consortium.

The Per-Fide Corpus

199

Mejri, S. (1997), Le Figement Lexical. Descriptions Linguistiques et Structuration Sémantique. Tunis: Publications de la Faculté des lettres Manouba. Padró, L. (2011), ‘Analizadores Multilingües en FreeLing’. Linguamática, 3 (2), 13–20. Santos, A. (2011), Contributions for Building a Corpora-Flow System. Master’s thesis, University of Minho. —(2002), ‘DISPARA, a system for distributing parallel corpora on the Web’, in N. Mamede and E. Ranchhod (eds), Advances in Natural Language Processing (PorTAL 2002), Berlin/Heidelberg: Springer-Verlag, pp. 209–18. Santos, A. and Almeida, J. J. (2011), ‘Text::Perfide::BookCleaner, a Perl module to clean plain text books’, Paper presented at 27th Conference of the Spanish Society for Natural Language Processing (SEPLN 2011), University of Huelva, Spain. Santos, A., Almeida, J. J. and Carvalho, N. (2012), ‘Structural alignment of plain text books’, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012). CD-ROM. Simões, A. M. and Almeida, J. J. (2003), ‘NATools – a statistical word aligner workbench’. Procesamiento del Lenguaje Natural, 31, 217–24. Sinclair, J. (ed.) (2004), How to Use Corpora in Language Teaching. Amsterdam: John Benjamins. Sperberg-McQueen, C. M. and Burnard, L. (eds) (2002), Guidelines for Text Encoding and Interchange. Oxford: University of Oxford, Humanities Computing Unit. St John, E. (2001), ‘A case for using a parallel corpus and concordancer for beginners of a foreign language’. Language Learning & Technology, 5, (3), 185–203. Tiedemann, J. (2003), ‘Combining Clues for Word Alignment’, in Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pp. 339–46. —(2012), ‘Parallel Data, Tools and Interfaces in OPUS’, in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012). CD-ROM Tiedemann, J. and Nygaard, L. (2004),‘The OPUS corpus – parallel and free’, in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). CD-ROM. Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V. and Nagy, V. (2005), ‘Parallel corpora for medium density languages’, in Proceedings of the RANLP 2005, pp. 590–6.

Notes 1 Although it is beyond the scope of this article to provide an exhaustive list of the parallel corpora which include a Portuguese subcorpus, mention should be made of the following projects. (i) The Oslo Multilingual Corpus (OMC) consists of an English–Norwegian–Portuguese translation subcorpus containing 15 English original fiction texts and their translations into Norwegian and Portuguese. Due to copyright reasons, access to the OMC is only available to researchers and graduate students at the universities in Oslo and Bergen. For details of the OMC, see http://www.hf.uio. no/ilos/english/services/omc/ (ii) The Linguistic Corpus of the University of Vigo (CLUVI) includes four small parallel subcorpora of Portuguese as source or target language: TURIGAL, a 1.3-million-word corpus of Portuguese–English tourism texts; the 900,000 word corpus of English–Portuguese literary texts; PALOP, a 600,000

200

Working with Portuguese Corpora

word corpus of Portuguese–Spanish postcolonial literature; and PEGA, a 70,000 word corpus of Portuguese–Galician literary texts. For further information on CLUVI, see http://sli.uvigo.es/CLUVI/index_en.html 2 For further information on the activities being conducted and the resources made available by the Linguateca, see http://www.linguateca.pt/ 3 http://www.linguateca.pt/COMPARA/ 4 In order to realize the full potential of electronic corpora, most of today’s linguists depend on the availability of specialized software tools. The IMS Corpus Workbench (CWB: http://cwb.sourceforge.net/) is a widely used architecture for corpus analysis, originally designed at the IMS, University of Stuttgart. The central component of the Corpus Workbench (cf. Evert and Hardie, 2011) is the corpus query processor CQP. Its query language allows sophisticated searches both for individual words and lexicogrammatical patterns. 5 http://opus.lingfil.uu.se/bin/opuscqp.pl 6 For a detailed account of the attribute and value tagsets used for morphosyntactic annotation and lemmatization in COMPARA and OPUS, see the handout from the workshop Como pesquisar em corpora (How to query corpora) in the scope of the I International Per-Fide Conference on Corpora and Translation: http://per-fide.ilch. uminho.pt/site.pl/workshop.pt 7 For a regularly updated list of the collaborators and the texts included in the Per-Fide corpora, see http://per-fide.ilch.uminho.pt/ 8 For a detailed account of the UNESCO Thesaurus, see the UNESCO website http:// www2.ulcc.ac.uk/unesco/ 9 In computer science, a pipeline usually refers to a sequence of operations where data are handled from one operation to the next – in this specific context, operations over corpora. 10 A simple file to describe how compilation or other operations over files are executed. 11 http://perfide.ilch.uminho.pt/query 12 Porto Editora is the leading educational publisher in Portugal, specializing in educational manuals, dictionaries, and multimedia products both online and offline. The lexicographic resources are available from the service Infopédia: (http://www. infopedia.pt/).

10

The CoMET Project: Corpora for Teaching and Translation Stella E. O. Tagnin

1. Introduction Computer corpora have now been in use for linguistic research for more than 30 years, but freely available corpora in languages other than English are still scarce. Comparable and parallel corpora are even fewer, especially for the English– Portuguese pair. To the best of our knowledge, aside from CoMET, the only other parallel English–Portuguese corpus available that includes the Brazilian Portuguese variant is COMPARA (Frankenberg-Garcia and Santos, 2002, 2003; see FrankenbergGarcia’s and Santos’s chapters in this volume), which consists of extracts from literary texts. Although it has ceased to be updated, it is still available online. This chapter will focus on two Portuguese–English corpora that are part of a project developed at the University of São Paulo (Brazil) – namely, the CoMET Project (Multilingual Corpus for Teaching and Translation). The first corpus is CorTec, a technical comparable corpus, while the second one is CorTrad, a parallel corpus. Each corpus will be described in detail with various examples of usage.

2. The CoMET Project The CoMET Project (Multilingual Corpora for Teaching and Translation) (Tagnin, 2010) was created in 1998, but the corpora were made available online only in 2005 (www.fflch.usp.br/dlm/comet). The project consists of three corpora: a technical corpus (CorTec), a translation corpus (CorTrad) and a learner corpus (CoMAprend). As the last one does not involve Portuguese, it will not be addressed here. CorTec is a comparable corpus with original technical texts in both English and Portuguese. It currently features 20 distinct technical areas and is intended to be constantly enlarged to include a widening range of fields. It is especially useful for translators, terminologists and ESP teachers and students. CorTrad is a parallel corpus with three subcorpora: technical-scientific, science journalism and literary. The technical-scientific subcorpus comprises a Brazilian

202

Working with Portuguese Corpora

cookbook translated into English, the science journalism corpus consists of 1,072 articles translated from Portuguese into English and the literary corpus features 28 Australian and 20 Canadian short stories translated from English into Portuguese. A distinctive feature of CorTrad is that it offers multiple versions of the same text – that is, the original text and up to three versions of its translation. This allows users to track changes that have been made from the first draft to the final published text. Both corpora feature built-in search tools. CorTec allows users to build wordlists and extract concordances as well as n-grams, whereas CorTrad presents an elaborate array of search possibilities according to the peculiarities of each subcorpus. All three subcorpora of CorTrad are syntactically annotated and PoS tagged, thereby enabling users to carry out specific grammatical searches. In addition, the journalism subcorpus can display the distribution of a word or phrase by document, text type, date of publication and topic. The literary subcorpus can display word and phrase distribution by either book or author whereas the technical-scientific one can show them by text type, part of the work, or recipe. As all three subcorpora have been semantically tagged for ‘colour’ and ‘clothing’ in Portuguese and only for ‘colour’ in English, searches can also be made for distribution by semantic field (colour, race, wine, human) or semantic group (colour, clothing). Thanks to all functionalities offered, the CoMET corpora have been extensively used both for language teaching and translator training.

3. CorTec 3.1 Description This comparable technical corpus consists of 20 English and Portuguese technical subcorpora in the following fields: astronomy, autoclaves, coffee, computer science (hardware, computer science) software, cooking (with two separate corpora), cultural tourism, ecotourism, fashion, soccer, hypertension, kidney failure, legal documents, linguistics, magnetic flowmeters, nutritional supplements, prosthodontics, photography and tourism (hotel facilities). An organic chemistry subcorpus will be included shortly while soccer and coffee will be updated and the two separate cooking corpora will be conflated. Most subcorpora were carefully compiled by graduate students for their research, making sure that the texts collected covered the same domains and subjects, were of the same text types, were from similar periods of time, etc. Other subcorpora, built by students studying specialization in translation, were revised to meet the quality standards required for inclusion in the CorTec corpus. The interface is easy to use. It also offers details related to the contents of each corpus. For instance, clicking on the name of a subcorpus reveals information about its contents, the number of types and tokens in Portuguese. Figure 10.1 provides a translation into English of the information that users would see about the ecotourism subcorpus on their screens. To investigate the corpus, users first select the field and then the language

The CoMET Project

203

(either English or Portuguese). Next, they can choose from three tools: frequency counter, concordancer and n-gram generator. Frequency counters display wordlists by frequency or alphabetical order (Figures 10.2 and 10.3). They also offer more detailed information on types and tokens in the corpus (Figure 10.4).

Figure 10.1 Content information from ecotourism corpus

Figure 10.2 Wordlist of the ecotourism subcorpus in frequency order

204

Working with Portuguese Corpora

Figure 10.3 Wordlist of the ecotourism subcorpus in alphabetical order

Figure 10.4 Ecotourism subcorpus – type and token details Concordances can be retrieved by an exact expression or word (‘same as’) or ‘starting with’, ‘ending with’, or even ‘containing’ a fragment of a word (Figure 10.5). Users might also wish to expand the context of the search word, which is done by simply clicking on the word in the concordance display. N-grams – that is, combinations of words – can be obtained for sequences of 2, 3, or 4 words using the n-gram generator.

The CoMET Project

205

Figure 10.5 Retrieving concordance lines

3.2 Usage Comparable corpora are extremely useful for various types of contrastive studies (Johansson and Hofland, 1994; Santos, 1995; Tagnin and Teixeira, 2004), machine translation-related issues (Isabelle, 1992; McEnery and Wilson, 1993; Somers, 1993), translation teaching (Malmkjaer, 1998; Zanettin, 1998; Frankenberg-Garcia, 2002), as well as translation in general (Baker, 1995; Sharoff, 2004, 2006; Philip, 2009; Yun and Defeng, 2010). Because they consist of original texts in both languages, they offer authentic terminology (i.e., terms actually used in each field). For instance, if we search the legal documents corpus, we find that although the English word ‘contract’ has a Portuguese cognate (i.e., contrato), their frequencies in the relevant subcorpora are strikingly different. Whereas ‘contract’ appears in position

206

Working with Portuguese Corpora

166, contrato ranks as one of the most frequent words in the Portuguese subcorpus (15th position). This is a clear indication that they are not always translation equivalents, at least not in principle. Given that both subcorpora were built to contain texts covering the same topics, it would be expected that translation equivalents would rank in quite similar positions in both subcorpora. Interestingly, the first content word in the English subcorpus is ‘agreement’ (19th position, which is quite close to contrato’s 15th position). An examination of the context in which these words occur will confirm their equivalence, as shown in Examples (1), (2) and (3), manually selected from the concordance lines for each word.1 (1) This shall be binding upon and inure to the benefit of the Parties and their permitted successors and assigns. O presente é celebrado em caráter irrevogável e irretratável, obrigando as partes por si, seus herdeiros ou sucessores. (2) Either Party may terminate this in the event the other Party has materially breached or defaulted in the performance of any of its obligations hereunder. O também poderá ser rescindido caso uma das partes descumpra o estabelecido nas cláusulas do presente instrumento. (3) The term of this shall expire June 30, 2001. O presente terá prazo de (xxx), iniciando-se no dia (xxx), e terminando no dia (xxx). CorTec is also useful for teaching English or Portuguese for special purposes. From the wordlists, relevant vocabulary can be extracted and users can examine the context in which the word is used by looking at the concordance lines. For example, the wordlist of the kidney failure English subcorpus shows that, apart from function words, possible relevant terms – among the 50 most frequent words – would be as shown in Table 10.1. Table 10.1 Partial wordlist for the kidney failure subcorpus #

Word

Freq.

9 10 16 19 24 30 32 33 34 37 42 43 48 49

renal patients disease esrd dialysis failure chronic kidney protein blood study treatment risk transplantation

1726 1383 813 638 532 461 424 383 373 361 322 316 289 288

The CoMET Project

207

Concordance lines for any of these words will confirm that they are indeed a term or part of longer terminological phraseologies, such as ‘disease’ in ‘polycystic kidney disease’, ‘end-stage renal disease’, ‘renal disease’, ‘glomerular disease’, ‘thin basement membrane disease’, ‘chronic renal disease’ and a few others. In contrast, doença, the corresponding Portuguese word for ‘disease’, does not appear among the 50 most frequent words in the Portuguese counterpart of the corpus. In addition, although ‘disease’ occurs 813 times throughout the corpus, doença occurs only 239 times. This seems to indicate that many names of diseases in Portuguese do not carry the word doença. For instance, if we try to find out what ‘ESRD’ (638 occurrences) means and run a concordance for it, we will find that it stands for ‘end-stage renal disease’. We should expect a phraseology with doença in Portuguese because we have ‘disease’ in English and, indeed, we find doença renal (renal disease) (95 occurrences), doença renal crônica (chronic renal disease) (18) and doença renal crônica terminal (terminal chronic renal disease) (3). However, the number of occurrences in both languages is strikingly different, which leads us to seek another equivalent. A possible candidate (high up on the frequency list) is insuficiência (insufficiency). Concordance lines indicate various phraseologies, among them, insuficiência renal (renal insufficiency) (397 occurrences), insuficiência renal crônica (chronic renal insufficiency) (265) and insuficiência renal crônica terminal (terminal chronic renal insufficiency) (25). The last one is also cited as IRCT, with 17 occurrences. Thus, the corpus shows that a possible equivalent for end-stage renal disease (ESRD) is insuficiência renal crônica terminal (IRCT) (Examples 4 and 5). (4) também que os pacientes com a maioria dos pacientes com (5) of secondary to RVD (RVD-ESRD) , chronic kidney failure The frequency lists also show that the number of occurrences of insuficiência is very close to that of ‘failure’: 422 for insuficiência and 461 for ‘failure’. A closer examination of the concordance lines for both words reveals 20 occurrences for ‘end(-)stage renal failure’ and 25 for insuficiência renal crônica terminal – again, very close in number, showing that they are possible equivalents. Thus, by resorting to this type of corpus, teachers will be able to teach the terminology that is actually used in the respective field, instead of relying on ESP textbooks that, more often than not, focus on vocabulary that is not the most relevant or frequently used by that field’s specialists (Santos, 2011).

4. CorTrad 4.1 Description CorTrad arose from a cooperation between our CoMET Project and Linguateca (Teixeira et al., 2012; see Santos’s chapter in this volume) in 2008. Although CoMET is responsible for collecting and saving the texts that constitute the corpora in a text

208

Working with Portuguese Corpora

format, Linguateca is in charge of both the computational implementation based on DISPARA (Santos, 2002) – a system to make parallel corpora available online – and part-of-speech annotation and semantic tagging. Annotation was carried out differently for each language: the Portuguese part of CorTrad was syntactically annotated by PALAVRAS (Bick, 2000; see Bick’s chapter in this volume) while the English counterpart was PoS-annotated by CLAWS (Rayson and Garside, 1998). Semantic tagging followed the corte-e-costura (made-to-measure) method (Santos and Mota, 2010; see Santos’s chapter in this volume). As previously mentioned, CorTrad (Teixeira et al., 2012) is composed of three subcorpora: science journalism, technical-scientific and literary. With the exception of the first of these, the subcorpora include more than one version of the translated text, which enables users to track changes across different versions. The texts of the science journalism corpus come from a Brazilian science magazine published by a state funding agency (i.e., FAPESP). All texts were originally written in Portuguese and then translated into English. Although the magazine is published in hardcopy versions, the English translation is available online only (http://www. revistapesquisa.fapesp.br/?lg=en). Twenty issues were included, covering the period from 2001 to 2003. Text types range from letters, articles and news to interviews, editorials and cover stories. A detailed list can be obtained that shows its content by genre and subject. The technical-scientific corpus consists, to date, of a Brazilian cookbook (Bacellar, 2005) translated into English (Bacellar, 2008) by two Brazilian translators (version 1) and revised by an American native speaker and specialist in the area (version 2).2 We still hope to be able to add the final published version. The book is divided into sections (Hot Days, Cold Days, Special Occasions, For When You Are Short of Time), which are introduced by the author’s comments and then further broken down into topics. For instance, Hot Days has sections such as On the Veranda, A Vegetarian Lunch, Summer Vacations, and At the Seashore. Each of these include related recipes. Both the author’s comments and the recipes make up the corpus. All in all, the total number of words amounts to about 130,000. The literary corpus is composed of 28 Australian and 20 Canadian short stories, which were translated by students from a translation specialization course at the University of São Paulo. It features three versions of the translation: the students’ first draft, a corrected version (including the teacher’s corrections/suggestions) and the final published version, revised by a professional translator. The composition of each corpus is shown on its respective first screen (Figure 10.6). In addition to being a multi-version corpus, CorTrad’s search functionalities have been customized for each subcorpus. All three subcorpora show results in concordance format or through the distribution of types, lemmas, PoS, verbal tense and/or pronoun case, person and/or number, morphological gender, syntactic function, semantic field and group (of colour or clothing). In addition, the science journalism subcorpus allows users to investigate the corpus by document, genre, date of publication and subject; the technical-scientific corpus shows results by section of book, recipe and text type while the literary corpus displays results by short story and author.

The CoMET Project

209

Figure 10.6 CorTrad screen showing part of the content of the literary subcorpus A distinctive feature of CorTrad is that all three Portuguese subcorpora have been semantically tagged for ‘colour’ and ‘clothing’; thus far, the English ones have only been tagged for ‘colour’ (Santos and Mota, 2010; see Santos’s chapter in this volume). To the best of our knowledge, this is one of the first semantically tagged bilingual corpora freely available for investigation online.

4.2 Usage CorTrad is especially useful for training teachers (Frankenberg-Garcia, 2000) and translators, both novice and professional (Santos and Oksefjell, 1999). Comparing uses and frequencies across different text types, for instance, can make users aware of the adequate vocabulary for each text variety. For instance, the Portuguese verbs achar (find) and acreditar (believe) are used differently in journalistic articles and press interviews. Whereas acreditar is commonly used in articles (92 occurrences), achar only occurs 11 times in these texts. In contrast, achar is almost equally frequent in interviews (57 occurrences) and articles (50). The same relationship holds for their English correspondents: 59 occurrences of ‘believe’ occur in articles as opposed to only 12 in interviews. ‘Think’, however, is more common in interviews (57 occurrences), with only 29 occurrences in articles. As CorTrad offers up to three translations for the same original text, it allows investigators to detect recurrent translation problems as well as possible improvements on a given translation. The example in Table 10.2 shows a very literal first version, in which sat is translated as havia sentado, a better rendering in the second version (havia participado; had participated) and an overall more fluent translation, including syntactic adaptations, such as inverting ‘sat’ (participara) and ‘chaired’ (presidira) because of the preposition de required by participar. Version two opted for

Working with Portuguese Corpora

210

the original sequence and omitted the preposition (havia participado ou presidido um congresso), thereby introducing a grammatical error, as the verb participar requires the preposition de. Table 10.2 Original plus three translations for text featuring ‘chaired’ Version

Text

Original

This Bodhisattva had never on a committee or one. Esta Bodhisattva nunca em um comitê ou um. Esta Bodhisattva nunca ou um congresso. Essa Bodisatva, certamente, jamais ou de um comitê.

First version Second version Final version

Colour and clothing tagging adds a new dimension of investigation possibilities. For instance, if we ask for the distribution of branco (white) by semantic field in the published translations of the literary corpus, we get the results shown in Table 10.3. Table 10.3 Distribution of branco by semantic field Semantic field

Frequency

Cor (colour) cor: raça (colour: race) cor: vinho (colour: wine) cor: humana (colour: human)

13 8 2 1

This indicates that branco (white) is used 13 times to name a colour, 8 times to refer to race, twice as a type of wine and once as the colour of a human attribute, such as cabelo branco (white hair). A comparison of the most frequent colours in the three subcorpora3 also shows interesting results (Table 10.4). Table 10.4 Most frequent colours in the three CorTrad subcorpora Subcorpus

Word (translation, frequency)

Science news (original)

cor (colour, 86), branco (white, 57), vermelho (red, 43), verde (green, 42), negro (black, 37), buraco negro (black hole, 31), ultravioleta (ultraviolet, 28), infravermelho (infrared, 27), azul (blue, 25), amarelinho (small yellow [lit.], 22) branco (white, 69), azul (blue, 48), vermelho (red, 7), cor (colour, 26), negro (black, 25), preto (black, 22), verde (green, 19), amarelo (yellow, 18), pálido (pale, 15), cinza (grey, 13) dourar (to brown, 341), branco (white, 144), vermelho (red, 122), verde (green, 113), preto (black, 39), cor (colour, 39), tinto (red – for wine, 33), dourado (brown, 27), amarelo (yellow, 16), roxo (purple, 14)

Short stories (original) Cookbook (translated)

The CoMET Project

211

The occurrence of amarelinho (diminutive for amarelo, yellow) in the science journalism corpus is a surprising result until we look at the concordance lines, which reveal that it is the name of a disease that affects citrus trees. Also of interest is the occurrence of dourar (turn something into the colour of gold) – which is not actually the name of a colour, like red, white, or blue – topping the list in the cookbook. In fact, the verb dourar in this context does stand for a colour and corresponds to the English ‘golden (brown)’, as in Examples (6) and (7). (6) até os rolinhos começarem a dourar na superfície e nas laterais, perdendo o jeito de massa crua. until rolls are on all sides and lose the appearance of raw dough. (7) polvilhe com o parmesão e leve ao forno por 30 minutos, até dourar. sprinkle them with Parmesan cheese and bake for 30 minutes, until . Another study that will bring out the recurrent collocations from each text type is to search, for instance, for the collocates of the lemma branco (white) in the different subcorpora (Table 10.5). Table 10.5 Collocates of the lemma branco in the three CorTrad subcorpora Subcorpus

Freq.

Collocate (translation, frequency)

Science news

45

Short stories

54

Cookbook

126

glóbulo (globule, 4), cabelo (hair, 4), cubo (cube, 3), mancha (stain, 3), pelagem (fur, 2), luz (light, 2,) população (population, 2), célula (cell, 2) mão (hand, 9), homem (man, 8), pena (feather, 6), cabelo (hair, 5), látex (latex, 2), galão (gallon, 2), blusa (blouse, 2) vinho (wine, 45), parte (part, 14), chocolate (chocolate, 12), arroz (rice, 8), pele (skin, 8), pão de forma (sandwich bread, 7), milho (corn, 5), pimenta-do-reino (black pepper, 3), fumaça (smoke, 3), carne (meat, 3), peixe (fish, 3)

Without even looking at the subcorpus column, users would be able to detect scientific terms, such as glóbulo branco (white globule), luz branca (white light) and célula branca (white cell) as well as cooking terms, such as vinho branco (white wine), chocolate branco (white chocolate), arroz branco (white rice). The middle row shows less specific collocations as they belong to the general vocabulary: mão branca (white hand), homem branco (white man) and latex branco (white latex). Pena branca (white feather) is actually the name of one of the short stories in the literary corpus. Colours are also pervasively used in figurative and idiomatic language, but are rarely translated into the same colour – if into a colour at all. All the English examples in Table 10.6 feature a colour, although no colour at all appears in the Portuguese renderings.

212

Working with Portuguese Corpora

Table 10.6 Idiomatic expressions with colours in English and their Portuguese translations English source texts

Portuguese translations

just declared itself It was to be . She never refused to go to Melbourne, but it was her hoodoo city, . The dog knew they were coming, and barked . would quarrel with her till knowing about

. Nunca se negava a ir a Melbourne, mas era uma cidade de azar, . O cachorro sabia que eles estavam vindo e latiu . de discutir com ela até Eu entendia de

Cultural differences can be seen in some cooking ingredients (Examples 8 and 9). (8) red wine vinho tinto (dyed, tinted wine) (9) red cabbage repolho roxo (purple cabbage) Also interesting are expressions with clothes. Some have cognates (Example 10). (10) He says the government will set an example in . Ele diz que o governo estabelecerá um exemplo de . Others have equivalents with another type of clothing (Example 11). (11) desafio de se pôr nos sapatos (put yourself in the shoes) de um arqueólogo task of of an archaeologist Creative translations, still within the clothing vocabulary, can also be retrieved (Examples 12 and 13). (12) The arrangement suited Henry extremely well. O esquema caiu como uma luva para Henry (The arrangement fitted Henry like a glove). (13) think of what you will wear pense na roupa (think about the clothes) As we hope to have shown, translators – whether trainees, novices, or professionals – will find CorTrad useful for finding synonyms, thereby enabling them to enrich their vocabulary, find creative solutions and solve various types of translation problems. Teachers will also find it helpful to aid them in detecting common translation discrepancies, which can then be used to prepare teaching materials. Last, but not least, a translation researcher will find CorTrad’s variety of text types and functionalities to be a rich source of material and tools to engage in contrastive and translation studies.

The CoMET Project

213

5. Final comments We aimed to present our CoMET Project and two of the corpora it comprises: CorTec, the technical corpus, and CorTrad, a parallel English–Portuguese translation corpus. We described each one of them and illustrated their possibilities for research with several examples. They are both dynamic corpora, which means they are constantly being enlarged. In the very near future, CorTec’s corpora will also include domains such as prosthodontics (dentistry), hotels (tourism) and organic chemistry. CorTrad will be enlarged with a corpus of Mercosud legislation, and scientific abstracts. We hope that this material will contribute to further research in a pair of languages for which computational resources are still scarce.

References Bacellar, H. (2005), Cozinhando para Amigos. São Paulo: DBA. —(2008), Cooking for Friends. Translated from Portuguese by E. D. Teixeira and A. H. C. Andrade Lamparelli, and revised by V. Klie. São Paulo: DBA. Baker, M. (1995), ‘Corpora in Translation Studies: An overview and some suggestions for future research’. Target, 7, (2), 223–43. Bick, E. (2000), The Parsing System Palavras – Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. Dissertation. Århus: Århus University Press. Frankenberg-Garcia, A. (2000), ‘Using a translation corpus to teach English to native speakers of Portuguese’. A Journal of Anglo-American Studies, 3, 65–78. —(2002). ‘COMPARA, language learning and translation training’, in B. Maia, J. Haller and M. Ulyrch (eds), Training the Language Service Provider for the New Millennium. Porto: FLUP, pp. 187–98. Frankenberg-Garcia, A. and Santos, D. (2002) ‘COMPARA, um corpus paralelo de português e de inglês na Web’. Cadernos de Tradução, 9 (1), 61–79. —(2003), ‘Introducing COMPARA, the Portuguese-English parallel translation corpus’, in F. Zanettin, S. Bernardini and D. Stewart (eds), Corpora in Translation Education. Manchester: St Jerome, pp. 71–87. Isabelle, P. (1992), ‘Current Research in Machine Translation: A reply to Somers’. Machine Translation, 7, (4), 265–72. Johansson, S. and Hofland, K. (1994), ‘Towards an English-Norwegian parallel corpus’, in U. Fries, G. Tottie, and P. Schneider (eds), Creating and Using English Language Corpora. Amsterdam: Rodopi, pp. 25–37. Malmkjaer, K. (1998), ‘Introduction’, in K. Malmkjaer (ed.), Translation and Language Teaching: Language Teaching and Translation. Manchester, UK: St Jerome, pp. 1–11. McEnery, T. and Wilson, A. (1993), ‘Corpora and translation: Uses and future prospects’. UCREL Technical Papers, University of Lancaster. Philip, G. (2009), ‘Arriving at equivalence: Making a case for comparable general reference corpora in translation studies’, in A. Beeby, P. Rodríguez Inés and P. Sánchez-Gijón (eds), Corpus Use and Translating: Corpus use for learning to translate and learning corpus use to translate. Amsterdam: John Benjamins, pp. 59–73.

214

Working with Portuguese Corpora

Rayson, P and Garside, R. (1998), ‘The CLAWS Web Tagger’. ICAME Journal, 22, 121–3. Santos, A. G. dos (2011), Working Closely with Corpora. Proposta de Ensino de Colocações Adverbiais em Inglês para Negócios, Sob a Luz da Linguística de Corpus. Master’s Thesis, University of São Paulo. Santos, D. (1995), ‘On grammatical translationese’, in K. Koskenniemi (ed.), Short Papers presented at NODALIDA 95, pp. 59–66. Available at http://www.linguateca.pt/Diana/ download/SantosNodalida96.pdf —(2002), ‘DISPARA, a system for distributing parallel corpora on the Web’, in E. Ranchod and N. J. Mamede (eds), Proceedings of Advances in Natural Language Processing (Third International Conference, PorTAL 2002), pp. 209–18. Santos, D. and Mota, C. (2010), ‘Experiments in human-computer cooperation for the semantic annotation of Portuguese corpora’, in N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias (eds), Proceedings of the International Conference on Language Resources and Evaluation (LREC 2010), pp. 1437–44. Santos, D. and Oksefjell, S. (1999), ‘Using a parallel corpus to validate independent claims’. Languages in Contrast, 2, (1), 117–32. Santos, D., Tagnin, S. E. O. and Teixeira, E. D. (2011), ‘CorTrad search features and translation studies: a pilot study on colours, clothing and food domains’. Paper presented at ICAME 2011. Sharoff, S. (2004), ‘Harnessing the lawless: Using comparable corpora to find translation equivalents’. Journal of Applied Linguistics and Professional Practice, 1, (3), 333–50. —(2004), ‘Translation as problem solving: Uses of comparable corpora’, in Yuste Rodrigo, E. (ed.), Proceedings of the Third International Workshop on Language Resources for Translation Work, Reseach & Training (LR4Trans-III). Paris: ELRA (European Language Resources Association). Available at: http://www.ifi.unizh.ch/cl/yuste/ LR4Trans-III/materials/proceedingsLR4TransIIIey.pdf Somers, H. L. (1993), ‘Current research in Machine Translation’. Machine Translation, 7, 231–46. Tagnin, S. (2010), ‘The COMET Project: Comparable and parallel corpora for the English-Portuguese pair.’ Paper presented at The International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS2010). Tagnin, S. E. O. and Teixeira, E. D. (2004), ‘British vs. American English, Brazilian vs. European Portuguese: How close or how far apart? – a corpus-driven study’. Lodz Studies in Language, 9, 193–208. Teixeira, E. D., Santos, D. and Tagnin, S. E. O. (2012), ‘CorTrad: Um novo corpus paralelo multiversão para o par de línguas português-inglês’, in T. Shepherd, T. Berber Sardinha and M. V. Pinto (eds), Caminhos da Linguística de Corpus. Campinas: Mercado de Letras, pp. 151–76. Yun, X. and Defeng, L. (2010). ‘Specialized Comparable Corpora and Pragmatic Translation’. Paper presented at the International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS2010). Zanettin, F. (1998). ‘Bilingual comparable corpora and the training of translators’. Meta, 43, (4), 1–14.

The CoMET Project

215

Notes 1 The markers (xxx) identify portions omitted to preserve anonymity. 2 Contrary to what some people might believe, cooking is a highly technical domain, with a very specialized vocabulary. In addition, it involves a great number of cultural differences when translating between two different languages (see Tagnin and Teixeira, 2004). 3 Tables 10.4 and 10.5 as well as some of the following examples are from Santos, Tagnin, and Teixeira (2011).

Part Five

Corpus Building and Sharing

11

Corpora at Linguateca: Vision and Roads Taken Diana Santos

1. A short history In the late 1990s, access to Portuguese data in electronic form was scarce and was considered one of the bottlenecks limiting the advance of natural language processing of Portuguese (Santos, 1999a). The launch of AC/DC1 by Linguateca, therefore, was intended to increase significantly the amount of such data as well as raising its quality and improving the way it was annotated and classified. To the best of my knowledge, AC/DC was the first service on the Internet to provide free and unencumbered access to a set of Portuguese language materials for linguists wanting to conduct research on Portuguese. Nowadays, given the web and the large amount of Web 2.0 materials everywhere, the need to increase free access to textual data is probably hard to understand as the current requirement is not so much for text or even annotated text, but for tools that can further process those large quantities. The situation has advanced beyond the need for help with gathering to the need for help with filtering, making sense of the large quantities, and using corpus linguistics methodology and findings when interrogating corpora. Still, it might be at least of some historical interest to review the conditions and the progress made since 1999, together with the sub-goals and results obtained in each period by Linguateca, which will constitute the first half of this chapter (for more on the activities of Linguateca in other fields, such as evaluation, or as a general catalyst of research on Portuguese, see Santos, 2009). The second half of this chapter presents in detail the current capabilities and what I envisage as the AC/DC cluster’s near future. This chapter is thus not meant to be an inventory or catalogue of the many resources that are (or have been) offered by Linguateca; rather, the purpose is to reflect on our practice as far as corpora are concerned. It is definitely not a comparative paper either. This book provides the information which will enable readers to compare the many different projects dealing with Portuguese corpora.

220

Working with Portuguese Corpora

1.1 In the beginning (1998–2000) We feared that an entrance into the dictionary business would be counterproductive by (possibly) disturbing the only (computational) language-related business area dealing with Portuguese where there was apparently no crisis (at least in Portugal, where dictionaries received public funding, but were not corpus-based). At the same time, we expected that partnerships with dictionary publishers might ultimately be a good source of data (and even income) for the computational processing of Portuguese. Consequently, instead of creating free dictionaries, our first actions at Linguateca were devoted to substantially increasing the availability of electronic corpora for Portuguese – an endeavour that was not commercially exploited at all (at that time). Meeting Eckhard Bick on the Internet (through a discussion on corpora list) very soon after deciding to embark on a large corpus venture had a decisive influence, as the use of his PALAVRAS2 parser (see Bick, this volume) in annotating the corpora provided enormous added value to the available Portuguese corpora.3 A comparison of Santos and Ranchhod (1999) with Santos and Bick (2000) is clear evidence of the leap forward. Graça Nunes, from NILC, whom I met at PROPOR 1999 in Évora (Portugal), was instrumental in making AC/DC relevant overseas and we obtained permission from NILC to make (part of) their NILC corpus – a large contemporary Brazilian Portuguese corpus (see Aluísio, Pardo and Duran, this volume) – available through AC/DC. Shortly thereafter, a similar virtual meeting occurred with Ana FrankenbergGarcia, with whom it was arranged that a parallel English–Portuguese corpus would be launched by Linguateca, which resulted in COMPARA (Frankenberg-Garcia and Santos, 2003). Two other major projects – CETEMPúblico and Floresta Sintá(c)tica – were also launched. CETEMPúblico, a corpus including nine years (1991–9) of Público, a major newspaper in Portugal, brought two challenges: copyright clearance and the sheer size of the material (ca. 200 million words), which was at that time a very large corpus (Rocha and Santos, 2000). To illustrate how quickly technology can make decisions obsolete, the corpus, in addition to web access and web downloading, was distributed on CD and sent (free of charge) by post, something that is no longer necessary. Yet the most ambitious project of all was the first (and so far, the only public) treebank for Portuguese, Floresta (Afonso et al., 2002; Freitas et al., 2008), a joint venture with Eckhard Bick and the VISL project. This project has had a considerable international impact,4 although less so on the community of researchers doing computational processing of Portuguese. Not everything worked out as planned in this first phase, but the most serious failure was probably the vain attempt to create a corpus ‘chooser’ that selected subparts of (AC/DC and other Linguateca-available) corpora according to a set of categories (e.g., sentence length, syntactic complexity, genre), which never quite got off the ground. A good description of the expectations and preoccupations surrounding Linguateca at that time can be found in Santos (2000).

Corpora at Linguateca

221

1.2 Steady development (2001–2002) A new organizational phase in Linguateca5 required launching several nodes (that is, sites on the Linguateca network) and therefore defining new and different goals. During this phase, priority was given to increasing corpus size (both in terms of more text and different genres), while also performing some evaluation. Accordingly, we evaluated CETEMPúblico’s impact and reflected on what could be learned from its deployment in Santos and Rocha (2001). We also assessed both the quality and import of PALAVRAS annotation of the AC/DC corpora (Santos and Gasperin, 2002). A new service associated with the AC/DC corpora, meant to provide lexical frequencies, was launched and dubbed Ordenador (which has not changed much since its launch). We also created a Brazilian counterpart of CETEMPúblico entitled CETENFolha, with help from NILC and with material from Folha de São Paulo. It was smaller (as it only concerned data from 1994), but was completely free for us to use and redistribute. It was no longer distributed on CD as, at the time of its launch (i.e., September 4, 2002), this format was already obsolete (our users did not ask for CDs, preferring to download or access the corpora online). In addition, we focused attention on our users, having presented the first user analysis of the Linguateca site in 2002 (Santos, 2002a), including the use of AC/DC (Santos and Sarmento, 2003). At that time, we also worked on improving the search system for the pages indexed by the Linguateca catalogue, called Busca, using three modes: a people search (researchers and developers in the area), a free text search and a publication search (for publications in the area). However, we soon gave up on developing it. This was a potential ‘corpus’ (of specific web pages, identified by our catalogue and indexed by our search system) the value of which we never harvested.

1.3 New attempts, use of AC/DC corpora for evaluation (2003–2004) In this phase, we decided to broaden our support for corpus research by devising a web service for tagging corpora on the fly, which we called AnELL (Mota and Moura, 2003), developed in conjunction with the LabEL group at the Technical University of Lisbon, and also by improving the functionalities available for use with COMPARA by annotating it with PALAVRAS. Above all, the main effort of Linguateca at that time was to motivate the community to participate in an evaluation contest (avaliação conjunta in Portuguese, see Santos et al., 2003; Santos and Rocha, 2003; Santos, 2007) – that is, a shared task to compare the performance of systems and the maturity of a scientific area. Thus, the AC/DC corpora were actually used to create material for evaluation, but the corpora were not developed further at that time. In addition, attempts were made to add other texts to AC/DC using the VISL service6 in the context of the Leva e Traz system (Aires, 2005), although they never

222

Working with Portuguese Corpora

became mainstream in the sense that we did not create new corpora to incorporate into AC/DC. This was because we were fighting on too many fronts, as it were, and did not see AC/DC enlargement as a priority. Moreover, there was always the problem of copyright clearance. At AC/DC, we redistribute material, so we need copyright permission. At that time, we also decided that CETEMPúblico distribution was no longer relevant in CD format, so we made it available only via download from the web, in two formats: as an annotated corpus in text format and as a CWB file,7 for those power users who wanted access to the features present in that format. WPT-03 (Cardoso et al., 2007), a large freely distributed web collection,8 did not make it to AC/DC proper due to the technical limitations of the underlying system and tools; consequently, users could not query the (parsed) corpus online.

1.4 Human revision and semantics take over (2005–2007) In this period, due to the successful training of human resources in the subprojects of COMPARA and Floresta, we decided to focus on improving the annotation of our corpora. Thus, we moved on to revise the syntactic tagging of the Portuguese subcorpus of COMPARA (Santos and Inácio, 2006; Inácio and Santos, 2006). In hindsight, we can now say that we should not have directed so much effort to COMPARA, the most copyright-bound of the corpora at Linguateca.9 The result was that the revised annotation could only benefit COMPARA users and only through its web search interface, but not other users interested in features only accessible in the texts themselves. In any case, the work with COMPARA went on to include semantic aspects – namely, colour (all words related to colour were tagged as such, together with the assignment of one main colour to each). The assessment of the human revision work that was required turned out to shape the future of corpus work in Linguateca. In my view at least, the annotation of the colour domain turned out to be the most interesting activity. We also conducted a study with COMPARA users examining actual searches to try to understand the reasons why some queries lacked useful results, which was again innovative and – as far as we know – the first user study about corpus querying on the Internet (Santos and Frankenberg-Garcia, 2007). As for the other corpora, we again used AC/DC to create several evaluation materials for CLEF, the international evaluation contest for cross-lingual information retrieval (Peters et al., 2004; Rocha and Santos, 2007a), whose goal was to evaluate the ability of systems to query information across languages, and for HAREM, a Linguateca-organized international contest for Portuguese named-entity recognition systems (Santos and Cardoso, 2007; Mota and Santos, 2008), whose goal was to compare the performance of the existing systems and reveal the state of the art in this area. This process was interesting for two reasons. In addition to helping create better evaluation materials, such as the CHAVE collection, consisting of two years of Folha de São Paulo and Público with the evaluation pool for CLEF queries (Santos and

Corpora at Linguateca

223

Rocha, 2005) and the CDHAREM golden collection(s) with correctly annotated named entities and relations among entities in Portuguese (Rocha and Santos, 2007b), we brought new materials into AC/DC, enhancing the corpus portfolio with humanrevised data, thereby contributing to advance corpus research. On yet another front, a new phase in Floresta was initiated (Freitas et al., 2008) and we added new materials, such as blogs and scientific texts, and deployed a new browsing system – namely, Milhafre (Freitas and Rocha, 2008). However, due to unfortunate external circumstances, we had to stop such efforts much sooner than expected, leaving a huge treasure to mine.

1.5 Semantics is key (2008–2011) For many reasons, the end of 2008 was a turning point at Linguateca. Briefly put, this was brought about by a radical change (a decrease) in the funding model, so that only Oslo and Lisbon (at FCCN) remained, resulting in several unstable contracts for most of the staff. Several obstacles impeded further development of COMPARA (the project was frozen on December 31, 2008). At the same time a more flexible system for semantic annotation was deployed in order to increase the order of magnitude of the amount of text annotated, and new annotation work on the other corpora was started. With corte-e-costura (made-to-measure) (Santos and Mota, 2010), the semantic fields of colour, clothing and feelings (the set of feelings started with ‘fear’; see Maia and Santos, 2012) were annotated in the AC/DC cluster (for more on motivation and initial results, see Santos, 2011). A new joint venture in parallel corpora, called CorTrad, was initiated with Stella Tagnin and Elisa Duarte Teixeira from the CoMET project at the University of São Paulo (see Tagnin, this volume; Tagnin et al., 2009). In this way, further genres were added to the AC/DC cluster, including cooking, translated short stories and scientific news – all coming from CorTrad. Translation memories (Portuguese into English) were also made available in specialized domains, but have not yet been included in a search interface through DISPARA due to the lack of human resources and time.10 Furthermore, a new pedagogic prototype – PoNTE11 – was developed for Portuguese to Norwegian and Norwegian to Portuguese, including students’ translations at the University of Oslo. Both CorTrad and PoNTE rely largely on reusing DISPARA (Santos, 2002b) with different search functionalities. In addition to this new emphasis on contrastive materials, and due to the fact that several Linguateca team members moved into teaching positions, we also started to deploy educational programs such as Ensinador (Teacher) (Simões and Santos, 2011) to provide teaching materials in Portuguese language and linguistics in the form of cloze (i.e., gap-filling) tests. Inspired by real users’ needs, new services were created. Comparador allowed for comparing results from two different searches in the AC/DC cluster whereas VARRA (Freitas et al., in press) enabled users to validate semantic relations in corpora (that is, given a particular semantic relation between two words, it looks for corpus evidence

224

Working with Portuguese Corpora

that supports that relation). As additional annotation of the corpora themselves, we also added lexical semantic information to each lemma form (possible synonyms, antonyms and hypernyms) from two lexical ontologies for Portuguese: PAPEL (Gonçalo Oliveira et al., 2008, 2010) and TeP (Maziero et al., 2008).

1.6 Summing up, documentation and other approaches Nowadays, numerous services and corpora are available on the Internet, the most recent of which is the CLUL (Centro de Linguística da Universidade de Lisboa) online corpus (see Bacelar do Nascimento, Mendes, Antunes and Pereira, this volume), but I do not think that this means the AC/DC initiative was in vain. Rather, I believe that, especially for Portuguese, it served as a role model. Most existing systems serving Portuguese corpora might have been inspired or developed to improve upon AC/DC (even if this is not avowed by their developers). Of course, one could also say that some might have been developed to compete with AC/DC, especially in cases where the model (use of CWB, web access availability) is the same but the only difference is that the data remain under the control of the corpus owners. In any case, my belief is that, if the AC/DC corpora had not been free on the Internet, those researchers would have tried to take advantage of giving access to their corpora (either by requiring partnerships or even a financial counterpart). Another important contribution from Linguateca’s corpus work is the huge amount of documentation created over the years, as web pages, tutorials and research papers, such as those by Freitas and Afonso (2008), Silva et al. (2008), Inácio and Santos (2008), Freitas et al. (2011), Santos et al. (2011a) and Silva and Santos (2012), have revitalized our work even after our corpora themselves have become outdated and their query systems obsolete. This is a hope that we share with Geoffrey Sampson, who stated (2003) that he considered his greatest contribution to linguistics to be his book English for the Computer (Sampson, 1995; in which he describes problems and solutions in the annotation of English of the SUSANNE corpus), regardless of the fact that people kept using, reusing and citing the corpus while almost completely disregarding the underlying linguistic work.12 It is clearly positive to be in good (intellectual) company; for Portuguese, there is far less published empirical work than for English, so our efforts seem even more likely to be praised and reused, but they do not appear to be so. The same seems to have happened with the ‘obligatory’ citing of Bick’s PhD thesis (Bick, 2000) when using PALAVRAS. How many people actually discuss the linguistic options instead of mentioning the resource or the parser? How many people make a clear distinction between the theoretical framework behind PALAVRAS, expounded in several places by Eckhard Bick (with which people may disagree; see Bick, this volume), and its implementation in the parser, with its many shortcomings and its ‘bugs,’ which are an unavoidable result of an automatic program and are obviously unintended? The possibility of replicating studies based on our public and legally cleared data is again something for which we had high hopes, as can be seen from the suggestions about citing particular corpus examples in Santos (2000) – hopes that we have no longer. Of the few people who do cite our corpora, none (to my knowledge) provide

Corpora at Linguateca

225

examples along with the citation (that is, indicating the version of the corpus, the query issued, etc.). This might be because either the examples are not highly relevant or people trust that it is easy to recover them if one simply queries the site once more, but it might also indicate that people do not consider it important that their work be checked and replicated by others, contrary to the claim we made in Santos and Oksefjell (1999) – namely, that replication in corpus work was of supreme importance. Another question worth discussing is the balance between working for others and doing research of our own. This is a very important issue, cf. the criticism voiced in Nunes (2008) on an evaluation of Linguateca. Indeed, if one project (in this case, Linguateca) has the main goal of creating, improving and making available resources for others, then how is this compatible with having Linguateca staff use the data for their own research or for developing scientifically interesting (and publishable) systems or services? After all, there is no denying that resource creators are those who know the resource best and have most influence on the options taken. Furthermore, the whole process of making a resource available and disseminating its uses and benefits requires that (i) at least some examples and measures be published regarding that resource and (ii) some documentation be written that demonstrates how to use the resource. Where should we draw the line between necessary publishing in order to make the resource known and our own scientific promotion? In Santos (1999b), I wrote that there were usually three kinds of people (roles) involved in corpus development (namely, builders, users and tool developers) and that this separation into three communities with different needs and fields of activity was often damaging for all. Maybe we have gone too far in the other direction in the Linguateca circle, as we develop (fully public) resources in which we are interested and with which we play before users even know they exist (that is, we are avid users of our own resources and of their respective tools). Another important view on the limits of corpus dissemination activities at Linguateca has been expressed by Belinda Maia (1997, 2003). The realization of these limits gave rise to Corpógrafo (Sarmento et al., 2006; Maia and Matos, 2008), a tool developed in the context of Linguateca.13 Maia’s philosophy is radically different from that of the AC/DC cluster and can best be summarized as follows: give the users the tools and allow them to create their own corpora. This makes a lot of sense, especially because there are more genres, subjects and issues than any body of texts (or even the web) can account for and also because of the legal constraints that make some kinds of texts private by definition. In other words, no matter how many corpora at Linguateca we compile and make available, we will never be able to satisfy the (majority of) users. Corpógrafo was undeniably very successful pedagogically (comparable only to Floresta in terms of the number of users/downloads), but we lack information about the extent of the research carried out by its users. On the one hand, numerous term papers at universities have used Corpógrafo, but most teachers are overworked with the teaching load and furthermore students’ work is not made public, so it is natural that we do not hear about the results. On the other hand, this path poses considerably higher work demands on the user, who has to choose and download her/his texts and thus become a corpus compiler. At the individual level, this takes much of a user’s time and keeps them from doing research.

226

Working with Portuguese Corpora

As Maia (2012, personal communication) reported when I asked her to try to quantify the work done with Corpógrafo: The term papers referred to above are usually the result of quite considerable work in preparing private corpora, analyzing them and extracting terminology to the database. For instance, a typical piece of terminology work done at FLUP requires the student to find texts and create comparable corpora in a very specific domain, and then extract terminology data and connect the terms using a semantic relations tool that then generates a conceptual diagram. Examples from 2011 are microphones, digital cameras, well logging, olive oil, chocolate production, bipolar disease, all in English and Portuguese, and divorce legislation in English and French. Once the data have been extracted, there is a tool very similar to Bootcat that can retrieve further texts from the internet. […] There have been other interesting theses and articles produced using Corpógrafo that have little to do with terminology. […] This type of work has been carried out by individuals who use the n-gram and concordance tools creatively.

Gauging how much the user or corpus provider should do is a balance difficult to achieve from the point of view of a project like Linguateca, the main aim of which is to increase the quality and quantity of, among others, corpus-based R&D in Portuguese, with limited funding. There is a way out of this dilemma: ideally, small and specialized corpora created with Corpógrafo might later be included (possibly anonymized or scrambled in such a way that prevents users from building texts back up from fragments while at the same time protecting copyright) in large specialized corpora that might be put to good use by many other users. Unfortunately, this has never taken place,14 largely for copyright and logistical reasons. Surely, scraping corpora off the web (as done by CorTec of CoMET15 – see Tagnin, in this volume – or by Floresta) is no longer a problem, and – as far as size is concerned – this method allows for larger amounts of texts to be collected more quickly than any which individuals with a specific interest can come up with.

2. Corpora at Linguateca now: What’s up? Let us turn now to a description of the present capabilities of the AC/DC cluster and the kind of studies that we envisage and hope to make possible with its corpus data. Table 11.1 provides a simple quantitative description of the current AC/DC corpora, according to different axes and issues. We have a stable infrastructure that steadily incorporates more information and allows for joint searching across all of our corpora (selecting in AC/DC corpus todos juntos; meaning ‘all together’) as well as in each particular corpus. It is important to note that, because we are committed to our users, we need to guarantee backward compatibility. Now that we have enabled searching all of our corpora at once, we are also able to perform general queries across all of them without losing sight of each variety. This allows users, for example, to verify that tanto quanto and tanto como (both meaning

Corpora at Linguateca

227

Table 11.1 AC/DC material by genre, part of speech and semantic categories (July 2012) Category

Word forms

Brazil Portugal Other or unknown Newspaper Fiction Oral Technical Other Nouns Verbs Proper noun** Adjectives Adverbs Grammatical words Colour Clothing Fear and courage

59,640,988 240,453,311 739,395 270,527,960 17,120,908 472,583 4,222,250 5,169,834 58,733,442 36,829,885 26,544,455 17,147,949 14,662,541 98,507,674 257,296 181,664 115,414

Different word forms* Different lemmas 589,685 1,183,667 – 1,216,994 353,248 27,468 142,847 – 338,357 340,946 579,416 142,205 10,147 1,450 2,863 1,952 1,520

729,013 1,693,530 – 1,913,222 235,199 16,003 109,360 – 258,472 78,030 1,633,439 87,428 8,958 383 1,245 597 255

* Case insensitive. ** Proper names with more than one word in the AC/DC corpora are spelled word by word, but in their lemmas, the individual words are joined by an equal sign.

‘as much as’) have complementary distributions in text from Brazil and from Portugal (see Table 11.2, which shows that the former is preferred by Brazilians and the latter by Portuguese) or that the position of sempre (always) is different according to the tense with which it co-occurs, as Table 11.3 shows (in both the imperfect (Imperfeito) and the present tenses the adverb almost always follows the verb, whereas in the perfect preterit (Perfeito) it precedes the verb in most cases). Table 11.2 Frequencies of tanto como and tanto quanto by variety Variety

tanto como

tanto quanto

tanto []* como

tanto []* quanto

Portugal Brazil

1,412 85

1,676 588

22,603 3,292

5,216 3,573

Table 11.3 Frequencies of pre- and post-posed sempre with its most frequent Portuguese tenses (main verbs only) Tense

main verb sempre

sempre main verb

Presente (Present) Perfeito (Perfect preterit) Imperfeito (Imperfect)

27,490 9,275 3,924

7,967 13,382 878

228

Working with Portuguese Corpora

2.1 A range of different services AC/DC has evolved from being a specific service based on corpus queries that provides concordances and distributions16 (as most other services do), in addition to an Open/IMS-CWB corpus workbench infrastructure, to becoming a set of services and subprojects dealing with corpora in a multitude of ways. I will thus return to Ordenador, VARRA and Ensinador, which have already been mentioned in this chapter, to discuss them in more detail. The first subservice, Ordenador, displays frequency and rank for a word or regular expression. It has been active for more than 10 years, providing both general frequencies (overall, by variety and by corpus) and information obtained by regular expressions. We are not sure how useful this particular service is because many people ask for frequency lists instead of using it (and download them all), but it was easy to implement. The next service developed on top of the AC/DC infrastructure was VARRA (Freitas et al., forthcoming), a more advanced and complex environment dealing with semantic relations (antonymy, caused by, is-a, meronymy, etc.) in Portuguese to help support a discovery process about how to express these notions in Portuguese and about what kinds of semantic relations were expressed at all. We soon realized that two sufficiently different tasks were at stake: validation (i.e., given a proposed relation, finding corpus examples for subsequent human inspection) and discovery proper (i.e., finding new relations or relation instantiations), which requires a considerably higher number of queries by the researcher than does the previous task. Another quite different service is Ensinador (Simões and Santos, 2011), an AC/ DC spin-off that creates a wide variety of cloze tests (and their original sentences, the ‘solutions’) from the underlying AC/DC corpora. This turned out to be useful for teaching Portuguese grammar to foreign students as well as amassing linguistic data for specific linguistic research questions. Other ways of using the AC/DC corpora, still at the prototype level, have also been implemented: MM

MM

Comparador, already mentioned, which allows users to compare two different searches and see them in parallel; and Distribuidor, which significantly extends the kind of distribution queries that can be requested (for example by any number of attributes: colour by tense by variety) and enables better searches for structural attributes.

Yet, from a linguistic point of view, the major improvement of the AC/DC cluster was the annotation of semantic domains such as colour or feelings, which made our corpora the first to have such large amounts of annotation – some of which was revised by hand – as will be described in the next section.

2.2 The addition of semantic annotation We started work in computational semantics in Linguateca with HAREM and included some named entity classifications in the CDHAREM corpus (Freitas et al.,

Corpora at Linguateca

229

2009), but a major breakthrough was the application of large-scale colour annotation to all corpora. As previously mentioned, the first attempt took place in COMPARA, following the identification of additional interest in comparing two different languages (Santos et al., 2008), but the set of all AC/DC corpora was soon added. Preliminary data can be read in Santos (2011), while in Santos et al. (2011b) and Freitas et al. (2012), we tackled differences in colour between the two national varieties of Portuguese (i.e., from Brazil and from Portugal), inspired by both the setup of the CONDIVport corpus (Soares da Silva, 2011) and the remarks about language differences in Ellis (1993). In his excellent book, Ellis argues for the incommensurability of temperature words in German and English and comments on the specific position in the noun phrase assigned to colour adjectives in English. As a result, a huge amount of data related to the semantic domain of colour is available to the whole community, and – although not yet revised – similar information is available for other domains, such as clothing (for early explorations with CorTrad, see Santos et al., 2012) and fear (Maia and Santos, 2012).

2.3 Improving synergy with other research directions Finally, in an attempt to develop synergy from other Linguateca projects and resources, or other public-domain resources for Portuguese, we implemented a two-way crossover between PAPEL, a large-scale lexical ontology based on a published dictionary, and AC/DC. On the one hand, corpora were annotated for synonymy, hypernymy and antonymy (obtained from PAPEL and TeP). On the other hand, with Folheador (Costa, 2011; Gonçalo Oliveira et al., 2012), an interface for querying lexical semantic information, one can search both VARRA and AC/DC on the fly and obtain examples of the semantic triples (i.e., word co-occurrences) involved.

3. Concluding remarks I chose to devote the present chapter to AC/DC, even though Floresta Sintáctica and Corpógrafo are arguably Linguateca’s most important contributions to Portuguese corpora – at least if we measure importance by the number of users and, therefore, impact. However, I believed that writing a paper on either of these subjects would require co-authorship (or main authorship) with (at least) Eckhard Bick, Susana Afonso and Cláudia Freitas for the former and (at least) Belinda Maia, Luís Sarmento and Luís Miguel Cabral for the latter. But I believe that I have played the role of the main leader as far as AC/DC is concerned. Yet it would be rather odd if the only reason to write about AC/DC was my inability to write as single author on other subjects. On the contrary, I still believe AC/DC was (also) a major contribution to Portuguese linguistics, currently still is, and will continue to be so in the future. In fact, AC/DC is now being promoted as the backbone (and the data source) of two new Linguateca initiatives, which should in fact transform AC/DC into an important contribution for the whole community:

230 MM MM

Working with Portuguese Corpora

The writing of a corpus-based grammar for Portuguese (cf. Santos, 2012) The creation of a richer infrastructure for lusophone cultural studies, whose first step was the organization of Págico to study Wikipedia materials for that purpose (Mota et al., 2012)

These recently started projects will allow the whole community interested in Portuguese language and culture to do research and studies on our language in an unprecedented way. Let me conclude by stressing that, whatever the worth of AC/DC, it exists thanks to all data providers and corpus builders who enabled us to make available, and enrich, their corpora.

Acknowledgements AC/DC owes deep gratitude to the many people who have contributed their efforts throughout the years. I would like to highlight especially Paulo Rocha as well as Renato Haber, Luís Costa, Cláudia Freitas, Cristina Mota and the many users who have helped improve it. Last but not least, I want to thank all the corpus providers who trusted us with their data. Eckhard Bick, by supplying and steadily improving PALAVRAS is, of course, a key factor and partner in our success. As should be clear from the references, the addition of Hugo Gonçalo Oliveira and Alberto Simões to our team was essential to the development of the newer AC/DC services, and the continuous expert help of Rosário Silva in the semantic annotation has been absolutely priceless. Linguateca has, throughout the years, been jointly funded by the Portuguese Government, the European Union (FEDER and FSE), UMIC, FCCN and FCT. Since 2011, I have received financial support from the University of Oslo to improve the AC/ DC infrastructure, as well as technical support from the Research Computing Services group, both of which are deeply appreciated.

References17 Afonso, S., Bick, E., Haber, R. and Santos, D. (2002), ‘Floresta Sintá(c)tica: A treebank for Portuguese’, in M. G. Rodríguez and C. P. S. Araujo (eds), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation, pp. 1698–703. Aires, R. V. X. (2005), Uso de Marcadores Estilísticos para a Busca na Web em Português. PhD dissertation. University of São Paulo at São Carlos. Bick, E. (1996), ‘Automatic parsing of Portuguese’, in L. S. García (ed.), Anais do II Encontro para o Processamento Computacional de Português Escrito e Falado, pp. 91–100. —(2000), The Parsing System ‘Palavras’: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Århus: Århus University Press. Buchholz, S. and Green, D. (2006), ‘Quality control of treebanks: Documenting, converting, patching’, in N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J.

Corpora at Linguateca

231

Mariani, J. Odjik and D. Tapias (eds), Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), pp. 26–31. Buchholz, S. and Marsi, E. (2006), ‘CoNLL-X shared task on Multilingual Dependency Parsing’, in Proceedings of the Tenth Conference on Computational Natural Language Learning, pp. 149–65. Cardoso, N., Martins, B., Gomes, D. and Silva, M. J. (2007), ‘WPT 03: A primeira colecção pública proveniente de uma recolha da web portuguesa’, in D. Santos (ed.), Avaliação Conjunta: um Novo Paradigma no Processamento Computacional da Língua Portuguesa. Lisbon: IST Press, pp. 279–88. Costa, H. (2011), ‘O desenho do novo Folheador’. Linguateca/FCCN Report. Ellis, J. M. (1993), Language, Thought and Logic. Evanston, IL: Northwestern University Press. Evert, S. (2009), The CQP Query Language Tutorial. [ONLINE] Available at: http://cwb.sourceforge.net/temp/CQPTutorial.pdf Frankenberg-Garcia, A. and Santos, D. (2003), ‘Introducing COMPARA, the PortugueseEnglish parallel translation corpus’, in F. Zanettin, S. Bernardini and D. Stewart (eds), Corpora in Translation Education. Manchester: St Jerome, pp. 71–87. Freitas, C. and Afonso, S. (2008), Bíblia Florestal: Um manual lingüístico da Floresta Sintá(c)tica. [ONLINE] Available at http://www.linguateca.pt /Floresta /BibliaFlorestal /completa.html Freitas, C. and Rocha, P. (2008), Primeiros vôos com o MILHAFRE. [ONLINE] Available at http://www.linguateca.pt /Floresta /milhafre /tutorial.milhafre.html Freitas, C., Rocha, P. and Bick, E. (2008), ‘Floresta Sintá(c)tica: Bigger, Thicker and Easier’, in A. Teixeira, V. L. Strube de Lima, L. Caldas de Oliveira and P. Quaresma (eds), Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008), pp. 216–19. Freitas, C., Santos, D. and Gonçalves, A. (2011), Perguntas já respondidas sobre o AC/ DC: Desde como começar até uso complexo de funcionalidades poderosas. [ONLINE] Available at: http://www.linguateca.pt /acesso /PJR_ACDC_Tudo.pdf Freitas, C., Santos, D. and Silva, R. (2012), ‘Corpos e cores: Colorindo a descrição da língua portuguesa’, in D. Dutra and H. Mello (eds), Anais do X Encontro de Linguística de Corpus: Aspetos metodológicos dos estudos de corpora. Belo Horizonte: Faculdade de Letras da UFMG, pp. 76–99. Freitas, C., Santos, D., Gonçalo Oliveira, H. and Quental, V. (in press), ‘VARRA: Validação, Avaliação e Revisão de Relações semânticas no AC/DC’, in Atas do ELC 2010. Campinas, SP: Mercado de Letras. Freitas, C., Santos, D., Mota, C., Gonçalo Oliveira, H. and Carvalho, P. (2009), ‘Detection of relations between named entities: Report of a shared task’, in Proceedings of the NAACL HLT Workshop on Semantic Evaluations: Recent Achievements and Future Directions, SEW-2009, pp. 129–37. Gonçalo Oliveira, H., Costa, H. and Santos, D. (2012), ‘Folheador: Browsing through Portuguese semantic relations’, in Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL2012, pp. 35–40. Gonçalo Oliveira, H., Santos, D., Gomes, P. and Seco, N. (2008), ‘PAPEL: A dictionarybased lexical ontology for Portuguese’, in A. Teixeira, V. L. Strube de Lima, L. C. de Oliveira and P. Quaresma (eds), Computational Processing of the Portuguese Language, 8th International Conference, Proceedings (PROPOR 2008), pp. 31–40. Gonçalo Oliveira, H., Santos, D. and Gomes, P. (2010), ‘Extracção de relações semânticas

232

Working with Portuguese Corpora

entre palavras a partir de um dicionário: O PAPEL e sua avaliação’. Linguamática, 2, (1), 77–94. Inácio, S. and Santos, D. (2006), ‘Syntactical annotation of COMPARA: Workflow and first results’, in R. Vieira, P. Quaresma, M. G. Volpes Nunes, N. Mamede, C. Oliveira and M. C. Dias (eds), 7th Workshop on Computational Processing of Written and Spoken Language (PROPOR 2006), pp. 256–9. —(2008) Documentação da anotação da parte portuguesa do COMPARA. [ONLINE] Available at http: //www.linguateca.pt /COMPARA/ DocAnotacaoPortCOMPARA.pdf Maia, B. (1997), ‘Do-it-yourself corpora … with a little bit of help from your friends’, in Barbara Lewandowska-Tomaszczyk and P. J. Melia (eds), PALC 97: practical applications in language corpora. Lodz: Lodz University Press, pp. 403–10. —(2003), ‘Some Languages are more equal than others: Training translators in terminology and information retrieval using comparable and parallel corpora’, in F. Zanettin, S. Bernardini and D. Stewart (eds), Corpora in Translation Education. Manchester: St Jerome, pp. 43–53. Maia, B. and Matos, S. (2008), ‘Corpógrafo V4 – Tools for Researchers and Teachers using Comparable Corpora’, in P. Zweigenbaum, É. Gaussier and P. Fung (eds), LREC 2008 Workshop on Comparable Corpora, pp. 79–82. Maia, B. and Santos, D. (2012), ‘Who’s afraid of … what? – in English and Portuguese’, in S. Oksefjell Ebeling, J. Ebeling and H. Hasselgård (eds), Aspects of Corpus Linguistics: Compilation, Annotation, Analysis. Helsinki: Research Unit for Variation, Contacts, and Change in English. Available at: http://www.helsinki.fi/varieng/series/volumes/12/ maia_santos/ Maziero, E. G., Pardo, T., Di Felippo, A. and Dias-da-Silva, B. C. (2008), ‘A Base de Dados Lexical e a Interface Web do TeP 2.0 – Thesaurus Eletrônico para o Português do Brasil’, in VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL), pp. 390–2. Mota, C. and Moura, P. (2003) ‘ANELL: A Web System for Portuguese Corpora Annotation’, in J. Baptista, I. Trancoso, M. G. Volpe Nunes and N. J. Mamede (eds), Computational Processing of the Portuguese Language: 6th International Workshop, PROPOR 2003, pp. 184–8. Mota, C. and Santos, D. (eds) (2008), Desafios na Avaliação Conjunta do Reconhecimento de Entidades Mencionadas: O Segundo HAREM. Linguateca. Mota, C., Simões, A., Freitas, C., Costa, L. and Santos, D. (2012), ‘Págico: Evaluating Wikipedia-based information retrieval in Portuguese’, in N. Calzolari, K. Choukri, T. Declerck, M. Uğur Doğan, B. Maegaard, J. Mariani, J. Odijk and S. Piperidis (eds), Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2015–22. Nunes, M. G. V. (2008), Relato sobre a parceria Linguateca-NILC. [ONLINE] Available at http://www.linguateca.pt /Linguateca10anos /Apresentacoes /AprNunesL10.pdf Oksefjell, S. and Santos, D. (1998), ‘Breve panorâmica dos recursos de português mencionados na Web’, in V. L. Strube de Lima (ed.), III Encontro para o Processamento Computacional do Português Escrito e Falado (PROPOR 98), pp. 38–47. Peters, C., Braschler, M., Choukri, K., Gonzalo, J. and Kluck, M. (2004), ‘The future of evaluation for cross-language information retrieval systems’, in M. T. Lino, M. F. Xavier, F. Ferreira, R. Costa and R. Silva (eds), Proceedings of LREC 2004, Fourth International Conference on Language resources and Evaluation, pp. 841–4. Rocha, P. A. and Santos, D. (2000), ‘CETEMPúblico: Um corpus de grandes dimensões de linguagem jornalística portuguesa’, in M. G. Volpe Nunes (ed.), Actas do V Encontro

Corpora at Linguateca

233

para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR 2000), pp. 131–40. —(2007a), ‘CLEF: Abrindo a porta à participação internacional em avaliação de RI do português’, in D. Santos (ed.), Avaliação Conjunta: um Novo Paradigma no Processamento Computacional da Língua Portuguesa. Lisbon: IST Press, pp. 143–58. —(2007b), ‘Disponibilizando a Colecção Dourada do ACONTECIMENTO>HAREM através do projecto AC/DC’, in D. Santos and N. Cardoso (eds), Reconhecimento de Entidades Mencionadas em Português: Documentação e Actas do HAREM, a Primeira Avaliação Conjunta na Área, pp. 307–26. Sampson, G. (1995), English for the Computer: The SUSANNE Corpus and Analytic Scheme. Oxford: Clarendon Press. —(2003), ‘Thoughts on two decades of drawing trees’, in A. Abeillé (ed.), Treebanks: Building and Using Parsed Corpora. Dordrecht: Kluwer, pp. 23–41. Santos, D. (1999a), Processamento computacional da língua portuguesa: Documento de trabalho. [ONLINE] Available at http://www.linguateca.pt /branco/index.html —(1999b), ‘Disponibilização de corpora através da WWW’, in P. Marrafa and M. A. Mota (eds), Linguística Computacional: Investigação Fundamental e Aplicações: Actas do I Workshop sobre Linguística Computacional da Associação Portuguesa de Linguística. Lisbon: Colibri, pp. 323–46. —(2000), ‘O projecto Processamento Computacional do Português: Balanço e perspectivas’, in M. G. Volpe Nunes (ed.), Actas do V Encontro para o Processamento Computacional da Língua Portuguesa Escrita e Falada (PROPOR 2000), pp. 105–13. —(2002a), ‘Um centro de recursos para o processamento computacional do português’. DataGramaZero – Revista de Ciência da Informação, 3, (2). Available at http://www. dgz.org.br/fev02/Art_02.htm —(2002b), ‘DISPARA, a system for distributing parallel corpora on the Web’, in E. R. and N. J. Mamede (eds), Third International Conference, PorTAL 2002, pp. 209–18. —(2007), ‘Avaliação conjunta’, in Santos, D. (ed.), Avaliação Conjunta: um Novo Paradigma no Processamento Computacional da Língua Portuguesa. Lisbon: IST Press, pp. 1–12. —(2009), ‘Caminhos percorridos no mapa da portuguesificação: A Linguateca em perspectiva’. Linguamática, 1, (1), 25–58. —(2011), ‘Linguateca’s infrastructure for Portuguese and how it allows the detailed study of language varieties’, in J. B. Johannessen (ed.), Language Variation Infrastructure, pp. 113–28. —(2012), ‘The next step for the translation network’. In D. Santos, K. Lindén and W. Ng’ang’a (eds), Shall We Play the Festschrift Game? Essays on the Occasion of Lauri Carlson’s 60th Birthday. Dordrecht: Springer, pp. 35–52. Santos, D. and Bick, E. (2000), ‘Providing Internet access to Portuguese corpora: The AC/ DC project’, in M. Gavriladou, G. Carayannis, S. Markantonatou, S. Piperidis and G. Stainhaouer (eds), Proceedings of the Second International Conference on Language Resources and Evaluation, LREC 2000, pp. 205–10. Santos, D. and Cardoso, N. (eds) (2007), HAREM, a Primeira Avaliação Conjunta de Sistemas de Reconhecimento de Entidades Mencionadas para Português: Documentação e Actas do Encontro. Linguateca. Santos, D. and Frankenberg-Garcia, A. (2007), ‘The corpus, its users and their needs: A user-oriented evaluation of COMPARA’. International Journal of Corpus Linguistics, 12, (3), 335–74.

234

Working with Portuguese Corpora

Santos, D. and Gasperin, C. (2002), ‘Evaluation of parsed corpora: Experiments in user-transparent and user-visible evaluation’, in M. G. Rodríguez and C. P. Suárez Araujo (eds), Proceedings of LREC 2002, the Third International Conference on Language Resources and Evaluation, pp. 597–604. Santos, D. and Inácio, S. (2006), ‘Annotating COMPARA, a grammar-aware parallel corpus’, in N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odjik and D. Tapias (eds), Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), pp. 1216–21. Santos, D. and Mota, C. (2010), ‘Experiments in human-computer cooperation for the semantic annotation of Portuguese corpora’, in N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner and D. Tapias (eds), Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, pp. 1437–44. Santos, D. and Oksefjell, S. (1999), ‘Using a Parallel Corpus to Validate Independent Claims’. Languages in Contrast, 2 (1), 117–32. Santos, D. and Ranchhod, E. (1999), ‘Ambientes de processamento de corpora em português: Comparação entre dois sistemas’, in Actas do IV Encontro sobre o Processamento Computacional da Língua Portuguesa (Escrita e Falada), pp. 257–68. Santos, D. and Rocha, P. (2001), ‘Evaluating CETEMPúblico, a free resource for Portuguese’, in Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pp. 442–9. —(2003), ‘AvalON: Uma iniciativa de avaliação conjunta para o português’, in A. Mendes and T. Freitas (eds), Actas do XVIII Encontro da Associação Portuguesa de Linguística, pp. 693–704. —(2005), ‘The key to the first CLEF in Portuguese: Topics, questions and answers in CHAVE’, in C. Peters, P. Clough, J. Gonzalo, G. Jones, M. Kluck and B. Magnini (eds), Multilingual Information Access for Text, Speech and Images: 5th Workshop of the CrossLanguage Evaluation Forum (CLEF 2004), pp. 821–32. Santos, D. and Sarmento, L. (2003), ‘O projecto AC/DC: Acesso a corpora / disponibilização de corpora’, in A. Mendes and T. Freitas (eds), Actas do XVIII Encontro da Associação Portuguesa de Linguística, pp. 705–17. Santos, D., Costa, L. and Rocha, P. (2003), ‘Cooperatively evaluating Portuguese morphology’, in N. J. Mamede, J. Baptista, I. Trancoso and M. G. Volpe Nunes (eds), Computational Processing of the Portuguese Language, 6th International Workshop, PROPOR 2003, pp. 259–66. Santos, D., Mota, C. and Soares da Silva, A. (2011a), Guarda-fatos: Notas sobre a anotação do campo semântica do vestuário nos corpos do AC/DC. [ONLINE] Available at http://www.linguateca.pt /acesso /GuardaFatos.pdf Santos, D., Silva, R. and Freitas, C. (2011b), ‘Pluralidades na cor: Contrastando a língua do Brasil e de Portugal’, in A. S. da Silva, A. Torres and M. Gonçalves (eds), Línguas Pluricêntricas: Variação Linguística e Dimensões Sociocognitivas. Pluricentric Languages: Linguistic Variation and Sociocognitive Dimensions. Braga: Aletheia, pp. 555–72. Santos, D., Silva, R. and Inácio, S. (2008), ‘What’s in a colour? Studying and contrasting colours with COMPARA’, in Proceedings of the Sixth International Conference on Language Resources and Evaluation, pp. 255–62. Santos, D., Tagnin, S. E. O. and Teixeira, E. D. (2012), ‘CorTrad and Portuguese-English translation studies: Investigating colours’, in S. Oksefjell Ebeling, J. Ebeling and H. Hasselgård (eds), Aspects of Corpus Linguistics: Compilation, Annotation, Analysis.

Corpora at Linguateca

235

Helsinki: Research Unit for Variation, Contacts, and Change in English. Available at: http://www.helsinki.fi/varieng/series/volumes/12/santos_tagnin_teixeira Sarmento, L., Maia, B., Santos, D., Pinto, A. and Cabral, L. (2006), ‘Corpógrafo V3: From Terminological Aid to Semi-automatic Knowledge Engine’, in N. Calzolari, K. Choukri, A. Gangemi, B. Maegaard, J. Mariani, J. Odjik and D. Tapias (eds), Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), pp. 1502–5. Silva, R. and Santos, D. (2012), ‘Arco-íris: Notas sobre a anotação do campo semântico da cor em português’. [ONLINE] Available at http://www.linguateca.pt /acesso/ArcoIris. pdf Silva, R., Inácio, S. and Santos, D. (2008), Documentação da anotação relativa à cor no COMPARA. [ONLINE] Available at http: //www.linguateca.pt /COMPARA / DocAnotacaoCorCOMPARA.pdf Simões, A. and Santos, D. (2011), ‘Ensinador: Corpus-based Portuguese grammar exercises’. Procesamiento del Lenguaje Natural, 47, 301–9. Soares da Silva, A. (2011), ‘Measuring and parameterizing lexical convergence and divergence between European and Brazilian Portuguese’, in D. Geeraerts, G. Kristiansen and Y. Peirsman (eds), Advances in Cognitive Sociolinguistics. Berlin/New York: de Gruyter, pp. 41–83. Tagnin, S. E. O., Teixeira, E. D. and Santos, D. (2009), ‘CorTrad: A multiversion translation corpus for the Portuguese-English pair’. Arena Romanistica, 4, 314–23.

Notes 1 Standing for Acesso a Corpos / Disponibilização de Corpos (corpus access / corpus providing), stressing the relationship between users and providers. 2 PALAVRAS has as official reference Bick (2000), but has been around for at least five years before that publication date, as evidenced by Bick (1996). 3 To see the available corpora at the time, check Oksefjell and Santos (1998). 4 It was used in CoNLL (Buchholz and Marsi, 2006), discussed in Buchholz and Green (2006), and has been reused for several evaluation contests since then. 5 At that time, it was still called CrdLP: Centro de Recursos distribuído para o processamento computacional da Língua Portuguesa. 6 It has been possible to submit through the Internet text to be parsed by PALAVRAS in the context of the VISL project (led by Eckhard Bick at the University of South Denmark) for more than a decade. 7 AC/DC is implemented using the powerful Open CWB (previously IMS-CWB) corpus environment; see Evert (2009) and http://cwb.sourceforge.net/ 8 Both WPT-03 and WPT-05, corresponding to a complete web crawl of the Internet in Portugal, are available from http://www.linguateca.pt/WPT/ 9 Since COMPARA was compiled from many fiction texts owned by different publishers, almost all texts have more stringent copyright conditions than any AC/DC corpus. 10 These memories, anonymized, are only available for download in a text format from http://www.linguateca.pt/Repositorio/ 11 Available from http://www.linguateca.pt/PoNTE/ 12 To quote: ‘From our point of view, the explicit annotation scheme is the central output

236

Working with Portuguese Corpora

of our research effort, and the corpora that we develop in the process of debugging the annotation scheme should be seen as secondary by-products (though in practice it seems that this scale of priorities is not one which others can easily be persuaded to share).’ 13 It should be mentioned that the ultimate goal of Corpógrafo is to be a system for comparable corpora, although, for reasons I cannot discuss here, this has thus far not been fully attained. 14 Only Linguateca’s team has/had access to the whole set of corpora, and we know therefore that this has not been done. 15 This is not to say that the compilation of these corpora did not follow a very careful design or that the texts were chosen at random. Quite the contrary. This remark is solely aimed at the copyright issue. 16 Distribution, also called ‘repartição’ in Portuguese, indicates relative frequency according to a given feature, which can be genre, text, or any other feature that the corpus contains as meta-data. Absolute numbers do not mean anything; distribution is the first quantitative measure in a corpus that can provide some information about the linguistic phenomenon under investigation. 17 As a matter of principle I prefer a bibliography style with names in full (see http://linguistlist.org/issues/6/6-527.html), but because of editorial constraints this could not be implemented in the present chapter.

12

The Reference Corpus of Contemporary Portugueseand Related Resources Maria Fernanda Bacelar do Nascimento, Amália Mendes, Sandra Antunes and Luísa Pereira

1. Introduction The extraordinary growth of computer applications, particularly over the last two decades, has enabled the compilation and exploration of large corpora and lexica. These linguistic resources play a fundamental role in the areas of theoretical linguistics and Natural Language Engineering. Combining these two areas of knowledge can, in fact, result in the development of a large number of applications, such as new and straightforward descriptions of languages based on real data, contrastive studies between varieties of a particular language aimed at finding factors of unity and diversity, cross-linguistic contrastive studies, grammars, lexica and dictionaries, terminologies, assisted translation materials, language teaching materials, and computer tools and applications for processing natural language. Having this principle in mind and following the tradition at the Centre for Linguistics at the University of Lisbon (CLUL)1 of collecting and studying real language data, a large electronic corpus – the Corpus de Referência do Português Contemporâneo (Reference Corpus of Contemporary Portuguese, CRPC) – has been compiled and constantly updated and enlarged at CLUL since 1988. The CRPC currently contains approximately 310 million words, searchable through a user-friendly interface, and it is envisaged as a monitor corpus (from which one can extract balanced subcorpora) that can serve as a sample of the Portuguese language (in both its written and spoken varieties). In the next sections, we will describe the CRPC and how it forms the basis for important resources developed at CLUL.

2. The Reference Corpus of Contemporary Portuguese The CRPC is a large electronic Portuguese corpus that has been under development at CLUL since 1988.2 Presently, this corpus contains 311.4 million words from written and spoken language, reflecting both regional and national varieties of Portuguese

238

Working with Portuguese Corpora

Table 12.1 Varieties covered by the CRPC and number of tokens for each Characteristics Time span Mode Spoken registers Written text types

Words Before 1900 1901–1970 1970–2012 Written Spoken Formal Informal Newspapers Books Periodicals Parliament sessions Supreme court Leaflets Correspondence Miscellania

1,000,000 2,600,000 307,800,00 309,812,943 1,652,707 431,228 1,221,479 110,503,376 20,557,296 7,581,850 163,267,089 2,927,953 80,833 88,370 4,806,176

(Portugal, Brazil, Angola, Cape Verde, Guinea-Bissau, Mozambique, São Tomé and Principe, Goa, Macao and East Timor), distributed as shown in Table 12.1. The CRPC covers written texts (309.8 million words) from different genres (e.g., newspapers, books, periodicals, parliament sessions, decisions of the Supreme Court of Justice, leaflets, correspondence, miscellaneous) and spoken transcriptions (1.6 million) of informal and formal sessions (the latter of which includes radio and television programmes – namely, features, news, talk shows, interviews, sports – as well as political speeches and debates, conferences, preaching and teaching). From a chronological point of view, the CRPC includes texts from the second half of the 19th century until today, most of which were produced after 1970 (Bacelar do Nascimento et al., 2000; Bacelar do Nascimento, 2000). Table 12.1 shows the constitution of the CRPC and its broad variety of text types. Table 12.2 Composition of the CRPC Country

Tokens

Portugal Angola Brazil Macau Cape Verde Mozambique São Tomé Guiné-Bissau Timor Goa Total

291,311,212 10,801,990 3,562,947 2,093,538 1,474,682 1,152,465 562,887 389,437 125,984 1,840 311,476,982

The Reference Corpus of Contemporary Portuguese

239

In order to enhance access to our corpus, the written part of the CRPC was recently enriched with linguistic information and metadata, undergoing a full process of cleaning and preparation for online queries. CRPC is now available on the CQPWeb platform (Mendes et al., 2012; Généreux et al., 2012),3 which enables extensive search options. Amongst the available options are: (i) restricted queries over specific text types or text varieties; (ii) queries for words, regular expressions, lemmas, part-of-speech tags, nominal chunks; (iii) the sorting and downloading of concordances results; (iv) frequency and information on the distribution of the search item in the texts of the corpus; (v) collocations information; (vi) lexical comparison of subcorpora; and (vii) subcorpora customization.4 As an example, Figures 12.1 and 12.2 show a query for the common noun lemma poder (power), restricted by both variety (Portugal only) and register (newspaper), with a concordance display.

Figure 12.1 A search for the common noun lemma poder (power) in the CRPC web interface

Figure 12.2 Concordances for the common noun lemma poder (power) from the CRPC web interface

240

Working with Portuguese Corpora

The spoken subpart of the CRPC has been developed under specific projects and is constituted by several subcorpora that will be described in Part 3.1 below. Almost all of them are publicly available, either freely or for purchase.

3. Related resources The CRPC has been used in numerous national and international research projects and academic studies, many of which have resulted in linguistic resources that are available online.5 Some of these resources deserve particular attention and will be described in the following sections.

3.1. Spoken corpora of European Portuguese 3.1.1. Português Fundamental (1984) Having Français Fondamental as a reference (as well as its Spanish counterpart), Português Fundamental was the first spoken corpus collected at CLUL and was developed between 1970 and 1974. At the time, the task of establishing the vocabulary essential for communication when teaching a foreign language was performed by teachers and textbook authors, guided only by their intuition. In order to change this method, the project aimed to provide information on the Portuguese vocabulary often used in everyday life situations. To collect this vocabulary, two corpora were compiled: the Frequency Corpus (Corpus de Frequência) and the Availability Corpus (Corpus de Disponibilidade). (i) The Frequency Corpus (Corpus de Frequência) This heterogeneous spoken corpus aimed to be representative of the Portuguese language and, as in all of these types of corpora, particular attention was given to informants’ specific sociolinguistic characteristics. It was intended that information would be from all districts of continental Portugal and islands, and include different ages, and diversified social and professional backgrounds. From a total of 1,800 recordings (approximately 500 hours) of spontaneous spoken communication on different themes of everyday life, a total of 1,400 were selected and 700,000 words were transcribed, making up the Frequency Corpus. From this corpus, the total list of occurring word forms was extracted (25,107 different word forms), together with their frequency. This list was lemmatized and used to define the set of lemmas with frequency equal to or greater than 40 (1,179 lemmas), thereby forming the Frequency Vocabulary. The sample of the transcriptions of the spoken corpus published in 1987 (140 recordings) is freely available for download.6 The transcriptions of this sample have recently been revised according to the CHAT guidelines,7 text-to-sound aligned with the EXMARaLDA software (Schmidt, 2012), and automatically lemmatized and annotated with PoS information. This updated version of the Português Fundamental Corpus is freely available for research purposes in the ELRA catalogue.8

The Reference Corpus of Contemporary Portuguese

241

Table 12.3 Example of word lists from the Fundamental Vocabulary, sorted alphabetically by lemma (including contracted forms) and by reverse frequency Alphabetical order (25,107 word forms) Frequency order (1179 lemmas) Frequency A, art a as à às da das na nas pela pelas … COMPANHIA companhia companhias companhiazitas … ZONA zona zonas

Frequency

38,973 21,907 3,977 2,579 1,395 3,859 1,209 3,138 485 335 89 123 83 39 1 262 201 61

SER, v DE NÃO E TER PARA ESTAR EU DIZER IR MAS HAVER POR … JUVENTUDE MEIO-DIA PARTO PORREIRO RURAL, adj VANTAGEM

34,740 33,160 22,519 22,090 12,968 8,938 8,268 8,262 6,887 6,724 6,392 6,350 6,143 40 40 40 40 40 40

(ii) The Availability Corpus (Corpus de Disponibilidade) Although special care was taken to cover a range of different themes in the Frequency Corpus, some vocabulary had nonetheless a very low or even zero occurrence in the recordings due to the fact that these lexical items were only used in specific contexts and because they were often replaced by deictics or other elements in the conversation. These cases were then addressed through the selection of 30 topics with lower probability of occurrence in spontaneous spoken discourse but admittedly essential to communication (e.g., politics, working relationships, the human body, health and illness, professions and trades). These topics were addressed in a supplementary survey, called the Availability Corpus (Corpus de Disponibilidade), which was implemented between 1970 and 1974. It consisted of questionnaires in which informants were asked to identify what they believed were the most significant nouns, adjectives and verbs related to the topics at hand. The result is a set of answers to questionnaires, called the Availability Corpus (which is not, in fact, a textual corpus). The questionnaires further led to a vocabulary of 481,800 words from specific topics, named the ‘Availability Vocabulary’. After 1974, a supplementary survey was administered to cover themes considered sensitive before the revolution of April 259 because of the political censorship.

242

Working with Portuguese Corpora

3.1.2. C-ORAL-ROM – Integrated reference corpora for spoken Romance languages (2004) The C-ORAL-ROM corpus was developed to answer the need to increase the spoken language resources for Romance languages.10 A multilingual corpus of spontaneous spoken language of the four main Romance languages (French, Italian, European Portuguese and Spanish) was compiled and made available. The corpus comprises 1,200,000 words (300,000 words for each language), covering both formal and informal speech. A similar approach was recently followed for Brazilian Portuguese and resulted in the C-ORAL-BRASIL, as described by Raso and Mello in this volume. The European Portuguese corpus contains 153 recordings, lasting 30 hours in total. The corpus design (with a matrix for all the languages) is represented in Table 12.4. Table 12.4 The C-ORAL ROM European Portuguese corpus contents Informal register Family/Private

Public

Total informal

Conversations Dialogues Monologues Total Conversations Dialogues Monologues Total

24,449 62,738 46,005 133,192 1,817 23,119 7,710 32,646 165,838

Business Conference Law Political Debate Political Speech Prof. Explanation Preaching Teaching Total Interviews News Reportage Scientific Press Sport Talk Shows Weather Forecast Total Private

10,215 9,750 6,315 8,923 8,649 6,473 6,127 9,822 66,274 14,570 1,859 10,762 9,923 5,676 17,396 1,930 62,116 24,365 152,755

Formal register Natural Context

Media

Telephone Total formal

The Reference Corpus of Contemporary Portuguese

243

This resource for spoken Romance languages is a benchmark for corpus design, dialogue representation, prosodic annotation, part-of-speech (PoS) tagging, multimedia storage and speech analysis. It comprises several components: (i) a multimedia corpus containing, for each text: (a) the acoustic source; (b) the orthographic transcription, in CHAT format and enriched with the tagging of terminal and non-terminal prosodic breaks; (c) session metadata containing essential information for speakers, recording the situation and session; (d) text-to-sound synchronization – namely, the alignment between the acoustic source and the transcribed utterances; (e) a second orthographic transcription with lemma and PoS tags of each form in the transcribed texts; and (f) frequency lists for both forms and lemmas; (ii) software tool for speech analysis, with simultaneous access to acoustic and textual information (WinPitch Corpus, developed by Pitch France);11 (iii) a concordancer tool, which allows searches within both text-only and PoS-tagged files (Contextes, developed by Jean Véronis);12 and (iv) appendixes containing descriptions of the four subcorpora as well as the procedures followed and choices made by each team during corpus design and preparation (e.g., orthographic transcription, lemmatization, tagging), in addition to comparative linguistic studies on lexical and structural strategies in the four languages as well as models and standard linguistic measures of spoken language variability. C-ORAL-ROM is available in two versions: (i) one with permission to explore the materials, as described above, available on eight DVDs distributed by ELRA;13 and (ii) an encrypted version (that does not allow for the full extraction of concordances, for example), available on one DVD that accompanies the 2005 book published by John Benjamins (Cresti and Moneglia, 2005; Bacelar do Nascimento et al., 2005).14

3.2 Corpora of Portuguese varieties 3.2.1 Spoken Portuguese – geographical and social varieties (2001) Considering that the use of authentic spoken texts in the teaching of Portuguese as a foreign language was not a common practice (written texts were often used to reproduce spontaneous speech), the main goal of the Spoken Portuguese corpus15 was to collect and transcribe recordings of the different varieties of the Portuguese language in the world, thereby contributing to the improvement of production and comprehension skills by students of Portuguese as a second language. This corpus represents real communication by sociolinguistically diverse speakers having Portuguese as their mother tongue or as a second language. It consists of informal conversations among acquaintances, friends and relatives as well as formal acts (radio programmes or conferences) in a total of 86 recordings (8h44m) and 91,966 tokens. The corpus covers the Portuguese spoken in Portugal (30 transcribed recordings), Brazil (20), Angola (5), Cape Verde (5), Guinea-Bissau (5), Mozambique (5), São

244

Working with Portuguese Corpora

Tomé and Principe (5), Macao (5), Goa (3) and East Timor (3), covering the period from 1970 to 2001, with 70 per cent of recordings being produced between 1990 and 2001. Users can explore this corpus to improve their listening skills, particularly their ability to understand different varieties of Portuguese not limited to spoken aspects (e.g., pronunciation, prosody, intonation contour, accent), but also including morphological, lexical, syntactic and discursive characteristics (Bettencourt Gonçalves and Veloso, 2000). This resource was first published on CD-ROMs that included the recordings, orthographic transcriptions, text-to-sound synchronization and metadata information about the speakers (e.g., origin, sex, age, professional status, level of education) as well as the place, date and situation in which the recording was made (Bacelar do Nascimento, 2001). All these materials are also freely available for download from the project’s webpage.16 The transcriptions of the Spoken Portuguese corpus have also been revised according to the CHAT guidelines, text-to-sound aligned with EXMARaLDA software, and automatically lemmatized and annotated with PoS information. This new version is freely available for research purposes in the ELRA catalogue.17

3.2.2 Africa corpus Given the notorious lack of studies on African varieties of Portuguese, two interrelated projects18 were designed and conducted to fill this gap – namely, Linguistic Resources for the Study of African Varieties of Portuguese (2006) and Properties of African Portuguese Varieties compared with European Portuguese (2008). Both provide resources for a description of five varieties, simultaneously contributing to a better understanding of the differences in spoken and written productions of native speakers from several countries. Specifically, the main goal of the Linguistic Resources for the Study of the African Varieties of Portuguese project was to constitute, analyse and make available online a 3.2-million-word corpus of written and spoken texts, including five different subcorpora comparable in size, time frame and genre, of 640,000 words each, corresponding to the varieties of Angola, Cape Verde, Guinea-Bissau, Mozambique, and São Tomé and Principe, as shown in Tables 12.5 and 12.6.

Table 12.5 Corpus design of African Portuguese varieties, by country Variety Angola Cape Verde Guinea-Bissau Mozambique São Tomé and Principe TOTAL

Spoken 27,363 25,413 25,016 26,166 25,287 129,245

Written

Total

613,495 612,120 615,404 615,297 614,563 3,070,879

640,858 637,533 640,420 641,463 639,850 3,200,124

The Reference Corpus of Contemporary Portuguese

245

Table 12.6 Design of the African Portuguese varieties corpora, written and spoken registers Text type and Register Written

Size Book Newspaper Miscellaneous Formal and Informal

Spoken Total

20% (120,000) 50% (340,000) 26% (156,000) 4% (24,000) 100% (640,000)

These corpora were automatically annotated for lemma and PoS, and some difficult cases prone to automatic error tagging were revised manually. The Africa Corpus allows for the study of each African variety, but also for inter-corpora comparative studies of the varieties, which on the one hand point to the existence of a common core vocabulary and grammar across the varieties, with variations that result from discursive and pragmatic particularities and, on the other hand, highlight aspects of linguistic unity or diversity that characterize the spoken Portuguese of all five African countries (Bacelar do Nascimento et al., 2008a; Bacelar do Nascimento et al., 2008b). The analysis of these comparable subcorpora yielded the following results: (i) lists of lemmas and forms with frequency data, broken down by subcorpus and by genre (Table 12.7); (ii) contrastive word indexes (lemmas and forms) that occur in each subcorpus with frequency data, divided by genre (Table 12.8); (iii) comparative description of the vocabulary of the subcorpora – word formation processes and syntactic and morphosyntactic phenomena – as a result of quantitative and statistical analysis; and (iv) comparative study between the linguistic processes of each variety and those of European Portuguese (Table 12.9). These lists and comparisons are freely available from the project’s webpage.19 Table 12.7 Example of word indexes (lemma and form) that occur in Angola Lemma

PoS

ABA ABA ABA ABACATE ABACATE ABACATEIRO ABACATEIRO ABACAXI ABACAXI ABACAXI

CN*

*Common Noun

CN CN CN

Word form aba abas abacate abacateiro abacaxi abacaxis

Oral

Written Total

0 0 0 0 0 0 0 0 0 0

9 8 1 1 1 1 1 2 1 1

9 8 1 1 1 1 1 2 1 1

fruto frutos frutuosa frutuoso frutuosas fuba fubas fuças fugam fundilho fundilhos

FRUTO

CN CN ADJ ADJ ADJ CN CN CN V CN CN

PoS

OR* 1 0 0 0 0 1 2 0 1 0 0

WRT** 37 8 0 0 0 6 0 2 0 0 0

ANG

PoS key: CN: Common Noun; ADJ: Adjective; V: Verb.

* Oral ** Written

FUÇA FUGAR FUNDILHO

FUBA

FRUTUOSO

Form

Lemma OR 1 0 0 0 0 0 0 0 0 0 0

WRT 21 8 1 1 0 0 0 0 0 0 0

CV OR 0 0 0 0 0 0 0 0 0 0 0

WRT 21 8 1 1 1 0 0 0 0 0 0

GUI

Table 12.8 Frequency of some forms in the five subcorpora in spoken and written registers OR 1 1 0 0 0 0 0 0 0 0 0

WRT 18 13 0 0 0 0 0 1 0 0 1

MOZ OR 3 1 0 0 0 0 0 0 0 0 0

WRT 21 28 0 0 0 1 0 0 0 1 0

ST

124 67 2 2 1 8 2 3 1 1 1

TOTAL

246 Working with Portuguese Corpora

The Reference Corpus of Contemporary Portuguese

247

Table 12.9 Examples of a comparative analysis of each variety with European Portuguese Linguistic process

Varieties

Word formation

Mozambican Portuguese emplasticar depressar sabadal

European Portuguese plastificar apressar relativo a sábado

Verbal government

Mozambican Portuguese há filhos que pais não filhos pais

European Portuguese há filhos que pais não filhos pais

Nominal and subjectpredicate agreement

Angolan Portuguese

European Portuguese

muito cara muito caras São Tomean Portuguese European Portuguese a maneira a dar-lhe com um machim maneira de lhe dar com um machim

Clitics

Cape Verdean Portuguese e roupas

European Portuguese e roupas

Guinea-Bissau Portuguese European Portuguese comportar na sociedade comportar na sociedade na sociedade

3.3 Manually annotated subcorpora 3.3.1 LE-PAROLE Corpus (1998) The LE-PAROLE corpus is the result of a European initiative to develop corpora and lexica for all European Union (EU) languages according to mutual design and composition principles, using both linguistic and computer resources already available in the EU countries. The use of common tools and integrated models ensured the development of multilingual resources and comparable studies (for more details, see the project’s webpage).20 The languages involved in the LE-PAROLE project are Belgian, French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian,

248

Working with Portuguese Corpora

Portuguese and Swedish. For each language, a 3-million-word corpus was built on a shared set of design, markup, and annotation, including a 250,000-word corpus tagged with PoS annotation (Bacelar do Nascimento et al., 1998). The LE-PAROLE corpora were classified and encoded according to the common core Parole Encoding Standard. There were agreed parameters for time of production (no texts older than 1970 were allowed) as well as for publication medium (inclusion of specific proportions of texts from a closed set of categories – namely, ‘book’, ‘newspaper’, ‘periodical’, and ‘miscellaneous’). The content of the Portuguese corpus in terms of percentage of tokens is as follows: MM MM MM MM

Newspaper: about 65 per cent, from 1996 to 1997, 3 publications; Book: about 20 per cent, 12 titles from 3 publishing houses; Periodical: about 5 per cent, 7 weekly issues of 1 title, from 1996; and Miscellaneous: about 10 per cent, 8 titles.

As for the corpus annotation, an equal proportion of the corpus (up to 250,000 running words) was PoS-tagged for each language, based on a common core PAROLE tagset that was extended to include a set of language-specific features. Disambiguation was checked manually. The Portuguese corpus (raw and annotated) is available for sale in ELRA’s catalogue.21 For most of the languages involved, a lexicon of 20,000 entries was also implemented (see Part 3.4.2 below).

3.3.2 CINTIL corpus (2006) CINTIL – Corpus Internacional do Português22 – is a corpus of Portuguese made up of 1 million annotated tokens, hand-checked by expert human annotators (Branco and Silva, 2004; Barreto et al., 2006). The corpus is the result of a joint enterprise between CLUL and FCUL (Faculty of Sciences of the University of Lisbon) in the scope of the TagShare project.23 The annotation comprises information on parts of speech, openclass lemma and inflection, multiword expressions (adverbial expressions and closed PoS classes), multiword proper names (for named entity recognition) and specific tags Table 12.10 Design of the CINTIL corpus

Written

News Fiction Other Other Total

Spoken

Formal/Natural Formal/Media Formal/Phone Informal/Private Informal/Public Informal/Phone Total

Grand total

33.96% 16.8% 7.07% 7.07% 57.8%

404,690 200,194 84,240 84,240 689,124

8.18% 7.45% 4.05% 18.26% 4.05% 0.19% 42.2% 100%

97,499 88,727 48,284 217,604 48,221 2,287 502,622 1,191,746

The Reference Corpus of Contemporary Portuguese

249

for spoken discourse (to account for extralinguistic and paralinguistic elements and fragmented words).24 The corpus contains both written (58 per cent) and spoken (42 per cent) texts, as presented in Table 12.10. The corpus is available online for concordancing through a user-friendly interface that allows for both simple and advanced search options. The available search options include simple orthographic forms, regular expressions, PoS, nominal and verbal inflection, named entities and metadata. Furthermore, the complete corpus is available for sale in ELRA’s catalogue.25

3.4 Lexica 3.4.1 Fundamental vocabulary of Portuguese As mentioned in Part 3.1.1, the Português Fundamental corpus was used to identify the fundamental lexicon for European Portuguese. The Frequency Corpus (spoken corpus) and the Availability Corpus (questionnaires) were the source of two specific vocabularies, which resulted in the Fundamental Vocabulary of Portuguese, with 2,217 words, published in 1984 (Bacelar do Nascimento et al., 1984). Two further volumes were published containing a detailed description of the methods used in compiling, analysing and establishing this vocabulary in addition to a set of documents resulting from these collections and analysis that included a sampling of the transcriptions of the recorded conversations; lemmatized lists with frequency information, sorted alphabetically and by decreasing frequency, extracted from both corpora; and a joint list of the lemmas of these two corpora (Bacelar do Nascimento et al., 1987a, 1987b).

3.4.2 PAROLE/SIMPLE lexicon (2000) The PAROLE lexica cover 12 of the 15 languages included in the LE-PAROLE project – namely, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish. The lexica contain 20,000 entries each per language, including both PoS and syntactic information, according to the PAROLE common encoding standards. The Portuguese lexicon (Marrafa et al., 1999) is available for sale in ELRA’s catalogue.26 A follow-up of this work was undertaken by the SIMPLE (Semantic Information for Multifunctional Plurilingual Lexica) project,27 which aimed to incorporate semantic information into a set of morphosyntactic units of the PAROLE lexica. This resulted in a set of mutual multifunctional syntactic lexica. The attributes of the semantic units include examples, definitions, semantic features (domain, inheritance template, partial qualia information) and relations (synonymy, hyponymy and predicative representation). The resulting SIMPLE Portuguese lexicon contains 10,438 such units.28 The Portuguese PAROLE lexicon is also available on the ELRA catalogue.29

3.4.3 Multifunctional computational lexicon of contemporary Portuguese (2000) Another lexical resource based on a subcorpus of the CRPC is the Multifunctional Computational Lexicon, which provides frequency, lemma, and PoS information and

250

Working with Portuguese Corpora

Figure 12.3 Corpus design for the extraction of the Multifunctional Computational Lexicon of Contemporary Portuguese followed what were at the time state-of-the-art standards (cf. Bacelar do Nascimento, 2006). Special care was taken with the design of the subcorpus that gave rise to the lexicon: the corpus contains 16,210,438 words, from both spoken (856,195 words) and written (15,354,243 words) registers, as shown in Figure 12.3. From this subcorpus, 26,443 lemmas were then extracted, resulting in 140,315 tokens (with a minimum lemma frequency of 6) that constitute the lexicon. All word forms were automatically tagged, lemmatized and manually revised (Bacelar do Nascimento, 2006; Amaro and Barreto, 2004). Thus, each token and lemma is followed by both morphosyntactic and quantitative information regarding Table 12.11 Example of the two lists of the Multifunctional Computational Lexicon of Contemporary Portuguese sorted alphabetically by lemma and by decreasing frequency Alphabetical order

Decreasing order

@ a (P) # 23858 a (P) # 7258 à (S+P) # 1020 as (P) # 4040 às (S+P) # 486 da (S+P) # 3243 das (S+P) # 1603 … @ aba (N) # 117 aba (N) # 63 abas (N) # 54 … @ lisonjeiro (A) # 11 lisonjeiro (A) # 6 lisonjeiras (A) # 2 lisonjeiros (A) # 2 lisonjeira (A) # 1

@ a (P) # 23858 a (P) # 7258 à (S+P) # 1020 as (P) # 4040 às (S+P) # 486 da (S+P) # 3243 … @ ambicioso (A) # 231 ambicioso (A) # 115 ambiciosa (A) # 59 ambiciosos (A) # 48 ambiciosas (A) # 9 … @ zombeteiro (A) # 6 zombeteiro (A) # 4 zombeteira (A) # 1 zombeteiras (A) # 1

The Reference Corpus of Contemporary Portuguese

251

the number of occurrences in the corpus. This quantitative information is presented as both a probability value (to account for the fact that there is an error rate regarding the PoS tagging) and exact values of frequency. The lexicon entries are also listed in both alphabetical and reverse frequency order, as shown in Table 12.11. All the data are available for download from the project’s webpage, where more detailed information on the project can also be found.30

3.4.4 COMBINA-PT: word combinations in the Portuguese language (2006) Another lexical resource based on the CRPC is a lexicon of multiword expressions (Mendes et al., 2006), developed in the scope of the COMBINA-PT project.31 As it is widely known, the lexicon does not consist mainly of single lexical items, but appears to be populated with numerous chunks that are more or less predictable, although not fixed (Firth, 1955). In fact, the availability of large amounts of textual data and the development of computer technologies and corpus-based approaches have enabled the identification of complex patterns of word associations, showing that speakers use a large number of prefabricated phrases that constitute single choices (Sinclair, 1991). For the extraction of lexical associations, a subcorpus of the CRPC was designed, with 50 million written words from different genres, as illustrated in Table 12.12. Table 12.12 Corpus design for the extraction of lexical associations Genre Newspapers Books

Magazines Miscellaneous Leaflets Supreme court decisions Parliament sessions Grand total

Words Fiction Technical Didactic Total Informative Technical Total

30,000,000 6,237,551 3,827,551 852,787 10,917,889 5,709,061 1,790,939 7,500,000 1,851,828 104,889 313,962 277,586 50,966,154

The extraction of significant associations was implemented through a software application that extracts groups of 2, 3, 4 and 5 tokens from the corpus, together with several types of information, including: (i) the number of elements of the group; (ii) the distance between the elements of the group – groups of 2 tokens can be either contiguous or separated by a maximum of 3 tokens (e.g., conjuntura internacional (international situation), conjuntura económica internacional (international economic situation), whereas groups of more than 2 tokens must be contiguous;

252

Working with Portuguese Corpora

(iii) the frequency of the group at a specific distance and in all occurring distances; (iv) the frequency of each element of the group in the corpus; (v) lexical association measure (the groups automatically extracted are sorted using the lexical association measure Mutual Information (MI));32 and (vi) concordance lines (in KWIC format) of the group in the corpus, together with an index code pointing to the occurring position in the corpus. A sample of the large candidate list extracted from the corpus was hand-checked. This sample was established using the best MI values, ranging between 8 and 10 (for details on the results of the MI statistical measure, see Evert and Krenn (2001) and Pereira and Mendes (2002)). Through the manual validation, multiword expressions that presented some syntactic and semantic cohesion were selected, paying particular attention to four important aspects: (i) lexical and syntactic fixedness that can be observed through the possibility of replacing elements, inserting modifiers and changing the syntagmatic structure or gender/number features; (ii) total or partial loss of compositional meaning, indicating that the meaning of the expressions cannot be predicted by the meaning of the parts; (iii) frequency of occurrence, which might reveal sets of favoured co-occurring forms (i.e., expressions that might be semantically compositional but occur with a higher frequency than any other alternative expression of the same concept, which could point to an initial stage of lexicalization); and (iv) syntactic category of the groups (verbal phrases, noun phrases, sentences). The lexicon covers multiword expressions with different degrees of lexicalization, ranging from idiomatic expressions (i.e., fully lexicalized) with lexical and syntactic restrictions – deitar cedo e cedo erguer dá saúde e faz crescer (early to bed and early to rise makes a man healthy, wealthy and wise); a sangue frio (in cold blood) – to collocations (i.e., non-idiomatic expressions where the elements reveal a tendency to co-occur in certain contexts (lufada de ar fresco (breath of fresh air); condenar ao fracasso (doomed to failure)). The lexicon of multiword expressions33 is organized as follows: each multiword expression in the lexicon is linked to the lemma of its node word – namely, a single word from different PoS categories (e.g., fogo (fire)) – and is also linked to a group lemma, which corresponds to the canonical form of the expression (e.g., arma de fogo (firearm)) and covers all the variants of the multiword expressions that occurred in the corpus (e.g., arma de fogo (firearm); armas de fogo (firearms)). In all, the lexicon comprises 1,180 main lemmas, 14,153 group lemmas and 48,154 word combinations (Bacelar do Nascimento et al., 2006; Mendes et al., 2006). This resource can be useful in several areas, including psycholinguistics (development of hypotheses about the representation of an individual’s mental lexicon, semantic memory and cognitive processes in general), lexicography (improvement of coverage in modern dictionaries), second-language acquisition (enhancement of acquiring significant word combinations in Portuguese, which will make the student’s speech and writing sound more natural) and computational linguistics (development

The Reference Corpus of Contemporary Portuguese

253

and evaluation of language-processing tools capable of dealing with expressionspecific issues, like automatic unit recognition or tagging and parsing problems).

4. Conclusion For more than twenty years, the CRPC (Corpus de Referência do Português Contemporâneo) has been enlarged and updated for new technologies, having become the reference monitor corpus of Portuguese that it is today. It has been widely used in both linguistic research projects and the development of software tools for the computational processing of Portuguese. The most relevant of these projects have been mentioned here, but many more were carried out at CLUL, along with numerous academic studies that also used the CRPC as the basis for their research. To continue improving this large corpus, future work should include: (i) adding more searchable metadata tags on the CQPWeb platform; (ii) introducing a language spotter for the few pockets of foreign language present in the corpus; (iii) enlarging linguistic annotation to cover nominal and verbal inflection and addressing issues related to multiword expressions; (iv revising the design of the corpus to improve representativeness; and (v) contacting publishers and authors for copyright clearance in order to make part of the CRPC freely available.

References Amaro, R. and Barreto, F. (2004), ‘Multifunctional Computational Lexicon of Contemporary Portuguese: An available resource for multitype applications’, in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC), pp. 1075–8. Bacelar do Nascimento, M. F. (2000), ‘O Corpus de Referência do Português Contemporâneo e os projectos de investigação do Centro de Linguística da Universidade de Lisboa sobre variedades do português falado e escrito’, in E. Gärtner, C. Hundt and A. Schönberger (eds), Estudos de Gramática Portuguesa (I). Frankfurt am Main: Biblioteca Luso-Brasileira, Centro do Livro e do Disco de Língua Portuguesa, pp. 185–200. —(ed.) (2001), Português Falado, Documentos Autênticos, Gravações audio com transcrições alinhadas (CD-ROM), Lisbon: CLUL e Instituto Camões. —(2006), ‘Um novo léxico de frequências do português’, in Miscelânea de Estudos in memoriam José Herculano de Carvalho, Revista Portuguesa de Filologia, Vol. XXV, tomo 1, Instituto de Língua e Literatura Portuguesa; Faculdade de Letras da Universidade de Coimbra, Coimbra, pp. 121–40. Bacelar do Nascimento, M. F., Garcia Marques, M. L. and Segura da Cruz, M. L. (1984), Português Fundamental. Vocabulário e Gramática. Lisbon: INIC, CLUL.

254

Working with Portuguese Corpora

—(1987a), Português Fundamental. Métodos e Documentos. Inquérito de Frequência, Vol. 1. Lisbon: INIC, CLUL. —(1987b), Português Fundamental. Métodos e Documentos. Inquérito de Disponibilidade, Vol. 2. Lisbon: INIC, CLUL. Bacelar do Nascimento, M. F., Marrafa, P., Pereira, L. A. S., Ribeiro, R., Veloso, R. and Wittmann, L. (1998), ‘LE-PAROLE – Do corpus à modelização da informação lexical num sistema-multifunção’, in Proceedings of XIII Encontro Nacional da Associação Portuguesa de Linguística, pp. 115–34. Bacelar do Nascimento, M. F., Pereira, L. and Saramago, J. (2000), ‘Portuguese corpora at CLUL’, in Proceedings of the Second International Conference on Language Resources and Evaluation (LREC), pp. 1603–7. Bacelar do Nascimento, M. F., Bettencourt Gonçalves, J., Veloso, R., Antunes, S., Barreto, F. and Amaro, R. (2005), ‘The Portuguese corpus’, in E. Cresti and M. Moneglia (eds), C-ORAL-ROM: Integrated Reference Corpora for Spoken Romance Languages. Amsterdam: John Benjamins, pp. 163–207. Bacelar do Nascimento, M. F., Mendes, A. and Antunes, S. (2006), ‘Typologies of multiword expressions revisited: A corpus-driven approach’, in Y. Kawaguchi, S. Zaima and T. Takagaki (eds), Spoken Language Corpus and Linguistic Informatics. Amsterdam: John Benjamins, pp. 227–44. Bacelar do Nascimento, M. F., Gonçalves, J. B. , L. Pereira, L., Estrela, A. and Oliveira, S. (2008a), ‘Aspectos de unidade e diversidade do português: As variedades africanas face à variedade europeia’. Veredas, 35–60. Bacelar do Nascimento, M. F., Estrela, A., Mendes, A. and Pereira, L. (2008b), ‘On the use of comparable corpora of African varieties of Portuguese for linguistic description and teaching/learning applications’, in Proceedings of the Second Workshop on Building and Using Comparable Corpora (LREC), pp. 39–46. Barreto, F., Branco, A., Ferreira, E., Mendes, A., Bacelar do Nascimento, M. F., Nunes, F. and Silva, J. R. (2006), ‘Open resources and tools for the shallow processing of Portuguese: The TagShare project’, in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), pp. 1438–43. Bettencourt Gonçalves, J. and Veloso, R. (2000), ‘Spoken Portuguese: Geographic and Social Varieties’, in Proceedings of the Second International Conference on Language Resources and Evaluation (LREC), pp. 905–8. Branco, A. and Silva, J. (2004), ‘Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese’, in Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), pp. 507–10. Church, K. W. and Hanks, P. (1990), ‘Word association norms, mutual information, and lexicography’, Computational Linguistics, 16, (1), pp. 22–9. Cresti, E. and Moneglia, M. (eds) (2005), C-ORAL-ROM: Integrated Reference Corpora for Spoken Romance Languages. Amsterdam: John Benjamins. Evert, S. and B. Krenn (2001), ‘Methods for the Qualitative Evaluation of Lexical Association Measures’, in Proceedings of the 39th Meeting of ACL, pp. 188–95. Firth, R. J. (1955), ‘Modes of meaning’, in Papers in Linguistics 1934–1951, Oxford: Oxford University Press, pp. 190–215. Généreux, M., Hendrickx, I. and Mendes, A. (2012), ‘Introducing the Reference Corpus of Contemporary Portuguese online’, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), pp. 2237–44. Marrafa, P., Gonçalves, J. B., Mendes, A. and Veloso, R. (1999), ‘A sintaxe do LE-PAROLE’, in P. Marrafa, and M. Mota (eds), Linguística Computacional.

The Reference Corpus of Contemporary Portuguese

255

Investigação Fundamental e Aplicações, Lisbon: Associação Portuguesa de Linguística/ Edições Colibri, pp. 191–205. Mendes, A., Antunes, S., Bacelar do Nascimento, M. F., Casteleiro, J. M., Pereira, L. and Sá, T. (2006), ‘COMBINA-PT: a large corpus-extracted and hand-checked lexical database of Portuguese multiword expressions’, in Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), pp. 1900–5. Mendes, A., Généreux, M., Hendrickx, I., Pereira, L., Bacelar do Nascimento, M. F. and Antunes, S. (2012), ‘CQPWeb: uma nova plataforma de pesquisa para o CRPC’, in Textos Seleccionados do XXVII Encontro Nacional da Associação Portuguesa de Linguística. Lisboa: APL, pp. 466–77. Pereira, L. and Mendes, A. (2002), ‘An electronic dictionary of collocations for european Portuguese: methodology, results and applications’, in Proceedings of the 10th International Congress of EURALEX, pp. 841–49. Sinclair, J. (1991), Corpus, Concordance and Collocation. Oxford: Oxford University Press. Schmidt, T. (2012), ‘EXMARaLDA and the FOLK tools – two toolsets for transcribing and annotating spoken language’, in Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), pp. 236–40.

Notes 1 http://www.clul.ul.pt/index.php 2 http://www.clul.ul.pt/en/resources/183-reference-corpus-of-contemporaryportuguese-crpc 3 http://alfclul.clul.ul.pt/CQPweb/ 4 For more information about all the query options, please consult: http://alfclul.clul.ul.pt/CQPweb/doc/CRPCmanual.v1_en.pdf; http://alfclul.clul.ul.pt/CQPweb/doc/shortsyntax.v1_en.pdf 5 http://www.clul.ul.pt/en/resources 6 https://www.clul.ul.pt/en/resources/ 84-spoken-corpus-qportugues-fundamental-pfq-r 7 http://childes.psy.cmu.edu/manuals/CHAT.pdf 8 http://catalog.elra.info/product_info.php?products_id=1173 9 On April 25, 1974, a military coup overthrew, with the support of the population, the dictatorial regime, in what is known as the Carnation Revolution (Revolução dos Cravos). 10 http://www.clul.ul.pt/en/research-teams/189-c-oral-rom-integrated-reference-corporafor-spoken-romance-languages http://lablita.dit.unifi.it/coralrom 11 http://www.winpitch.com/ 12 http://sites.univ-provence.fr/veronis/logiciels/Contextes/index-fr.html 13 http://catalog.elra.info/product_info.php?products_id=757 14 http://benjamins.com/#catalog/books/scl.15/main 15 This corpus was the result of the project entitled Spoken Portuguese – Geographical and social varieties (more information at http://www.clul.ul.pt/en/ research-teams/195-spoken-portuguese-geographical-and-social-varieties). 16 http://www.clul.ul.pt/en/research-teams/ 195-spoken-portuguese-geographical-andsocial-varieties 17 http://catalog.elra.info/product_info.php?products_id=1172

256

Working with Portuguese Corpora

18 http://www.clul.ul.pt/en/research-teams/186-linguistic-resources-for-the-study-of-theafrican-varieties-of-portuguese http://www.clul.ul.pt/en/research-teams/185-properties-of-african-portuguesevarieties-compared-with-european-portuguese 19 http://www.clul.ul.pt/ en/research-teams/186-linguistic-resources-for-the-study-ofthe-african-varieties-of-portuguese 20 http://www.clul.ul.pt/en/research-teams/197-le-parole 21 http://catalog.elra.info/product_info.php?products_id=765 22 http://cintil.ul.pt 23 http://www.clul.ul.pt/en/research-teams/188-tagshare-tagging-and-shallowmorphosyntactic-processing-tools-and-resources http://tagshare.di.fc.ul.pt/ 24 The annotation guideline with all the linguistic information encoded in CINTIL is available at: http://nlxserv.di.fc.ul.pt/tagsharecorpus/guidelines.pdf 25 http://catalog.elra.info/ product_info.php?products_id=1102 26 http://catalog.elra.info/product_info.php?products_id=713 27 http://www.clul.ul.pt/en/research-teams/196-simple-semantic-information-formultifunctional-plurilingual-lexica 28 For more information on the LE-PAROLE+SIMPLE projects, visit the following website: http://www.ub.edu/gilcub/SIMPLE/simple.html 29 http://catalog.elra.info/product_info.php?products_id=713 30 http://www.clul.ul.pt/en/research-teams/194-multifunctional-computational-lexiconof-contemporary-portuguese 31 http://www.clul.ul.pt/en/research-teams/187-combina-pt-word-combinations-inportuguese-language 32 Mutual information is a lexical association measure that calculates the frequency of each group in the corpus and contrasts this frequency with the corpus frequency of each word of the group (Church and Hanks, 1990). 33 The lexicon and the user manual are available from the project’s webpage (http://www. clul.ul.pt/sectores/linguistica_de_corpus/manual_combinatorias_online.php) and the Meta-Share repository (http://www.meta-net.eu/meta-share).

13

C-ORAL-BRASIL: Description, Methodology and Theoretical Framework Tommaso Raso and Heliana Mello

1. Introduction The C-ORAL-BRASIL (Raso and Mello, 2012)1 is a Brazilian Portuguese spontaneous speech corpus, especially representative of the Mineiro (Minas Gerais state) diatopy, primarily from the metropolitan area of the state capital, Belo Horizonte. The texts were recorded between 2006 and 2011 using sophisticated wireless equipment in order to guarantee highly accurate acoustic quality. C-ORAL-BRASIL is structured to be comparable with the C-ORAL-ROM corpora (Cresti and Moneglia, 2005)2 for French, Italian, Spanish and European Portuguese. Here, we will provide key information about the corpus as well as the motivation for its architecture and sampling in order to demonstrate the advantages that the methodology that was followed affords for the study of spontaneous speech, mainly from a pragmatic perspective (Mello, 2012). The corpus DVD contains: MM

MM

MM MM

MM

The multimedia corpus. Each individual text is offered in the following formats: audio (wav), transcription (rtf) and aligned file (xml) through WinPitch software (Martin, 2005) as well as txt files; the metadata: title, file name, participant’s name abbreviation and main sociolinguistic characteristics (gender, age, school level, occupation and role played in the interaction), recording date, place, context and topic, corpus branch, duration in time and number of words, acoustic quality, transcribers’ and revisers’ names, and any commentary considered useful; the tagged corpus (Bick, 2012) in both a full and a simplified version (xml and txt); frequency lists, spreadsheets with relevant measurements and statistics about the informants; and a book, in PDF format, in which audio examples are linked to the text carrying the corpus description, a presentation of the theory behind it, the explanation of transcription and segmentation criteria and their validation, a discussion of the main speech measurements, and a description and discussion of the parser used for the lexical and morphosyntactic tagging.

258

Working with Portuguese Corpora

The corpus transcription format follows CHAT (MacWhinney, 2000), implemented for prosodic annotation (Moneglia and Cresti, 1997); the corpus is segmented into utterances and tone units (Raso, 2012b; Mello et al., 2012). An utterance is defined as the minimal unit with pragmatic autonomy, and its identification is marked by a prosodic break perceivable as terminal. (1) (bfammn03)3 *ALO: mas os filho também nũ são fácil também juntou os filho todo foram lá e trouxeram o corpo na força // (but the sons too they are not easy either they all meet (they) go there and bring the body by force) The linguistic sequence in Example (1) can, in principle, be segmented in different ways. A simple reading induces the interpretation of mas os filho também nũ são fácil também (‘but the sons too they are not easy either’) as an entity (i.e., one utterance), as it is syntactically autonomous; the rest can be interpreted as one or more entities. Nevertheless, by listening to the sequence, it is clear that there is just one autonomous entity, segmentable as in Example (2). (2) *ALO: mas os filho também nũ são fácil também / juntou os filho todo / foram lá e trouxeram o corpo na força // The double slash marks a terminal break (i.e., the utterance boundary) whereas the single slash marks a tone unit boundary. In fact, the first part of the sequence could seem autonomous in reading, but is not perceived as such by listening to the actual recording. Example (3) shows an opposite situation: (3) (bfamdl04) *SIL: tava no jornal // de ontem // (it was on the newspaper // from yesterday) Here, we seem to have only one utterance, with a clause and a temporal adjunction. However, listening to it, we notice two autonomous units (marked by slashes). Finally, Example (4) shows that the same linguistic sequence can be interpreted by a reader as a negative assertion, although listening to it makes it clear that it is an affirmative assertion: (4) (bpubdl01) *PAU: não // tá dando a altura daquele que a Isa marcou lá / né // (no // it has the height of that one that Isa marked there / doesn’t it //)

which is also interpretable by the reader as: (it doesn’t have the height of that one that Isa marked there / does it //)

These examples illustrate the fact that terminal breaks do not necessarily match the pauses that can be heard in the sound files. The opposite is also true: a pause, even a long one, does not imply an utterance boundary. Therefore, only by listening to a verbal sequence is it possible to determine where a pragmatically interpretable

The C-ORAL-BRASIL Corpus

259

prosodic unit ends. Hence, it is not possible to prosodically analyse speech without audio, nor is it possible to transcribe speech without marking the reference units that allow for its segmentation. These elements are not readily noticed in transcriptions where prosodic features are missing, nor can speech reference units be automatically inferred from pauses as pauses do not necessarily coincide with prosodic breaks (Moneglia, 2005). The main reasons why the C-ORAL-BRASIL is speech-to-text-aligned by utterance are as follows. Alignment is a crucial aspect in the study of speech. Transcriptions that do not explicitly mark speech features cannot claim to represent speech. In such transcriptions, the real object of the study would be the transcription itself, which represents a special variety of writing, but not the actual speech. Crucially, these transcriptions lack prosodic information. In our view, it is impossible to study speech adequately without its acoustic information as that alone allows for the recognition of the main categories of speech, illocution being the most basic one (Cresti, 2000b).4 In fact, Example (5) shows that a pure syntactic and semantic analysis that does not result from an illocutive interpretation cannot account for the understanding of speech. (5) (bfamdl04) *KAT: o quê // (what //) *SIL: copos // copos de Urano / que tem aí // (glasses // glasses from Urano / that are here //) *KAT: copos de quê // (glasses made of what //) *SIL: Urano // (Urano //) *KAT: Urano // (Urano //) *SIL: é // Urano // Urano // (yeah // Urano // Urano) It is only through the inspection of the different illocutions that we can recover the different meanings of Urano in the different utterances; the first illocution (*SIL) stands for a confirmation, the second (*KAT) for an incredulity expression and the third (with two illocutions; *SIL) for conclusions, performed with two different attitudes. Such communicative functions cannot be recovered through just the semantic or morphosyntactic forms alone.

2. The architecture By ‘spontaneous speech’, we mean speech that is planned at the same moment it is performed – namely, speech that does not deliver a previously (totally or partially) planned text, like acted speech or even a previously planned discourse (Nencioni, 1983; Cresti, 2000a; Biber, 1988; Blanche-Benveniste et al., 1990; Miller and Weinert, 1998; Givón, 1979; Moneglia, 2005, 2011). Spoken events that can be classified as spontaneous show: (i) multimodal face-to-face interaction (including not only speech, but also facial expressions, gestuality, paralinguistic and extra-linguistic elements);

260

Working with Portuguese Corpora

(ii) intersubjective reference to the deictic space (people engaged in the speech event share deictic reference – the ‘here and now,’ as it were); (iii) mental programming concomitant with vocal performance (there is no pre-programming for what is going to be uttered; speech flows concomitantly with the cognitive processing involved in the interaction); and (iv) contextually undetermined linguistic behaviour – namely, unforeseen behaviour (linguistic behaviour cannot be predicted based on knowing the speakers or the situation, as the interaction unfolds without any pre-programming). A long tradition of sociolinguistic studies (Berruto, 1987; Biber and Conrad, 2001; Biber et al., 1998; Gadet, 1996a, 1996b, 1997, 2000, 2003; Halliday, 1989) has focused on the value of sociological and contextual parameters in defining speech qualities and has highlighted their variability. There are many types of spontaneous speech, and they vary according to the following parameters: (a) the possible structural varieties of the communicative event (monologue, dialogue, conversation); (b) the communicative channel; (c) the sociological context – namely, the social domain of the event (family, private, public life); (d) the programming conditions (partially or totally programmed versus non-programmed speech); (e) possible register and genre varieties; (f) sociolinguistic factors (gender, age, school level, speaker’s occupation); (g) geographic origin; (h) speech event task; and (i) topic. Therefore, planning a spoken corpus is a complex task that must ensure the representativeness of the principal variations explored by the different types of events in spontaneous speech (Berruto, 1987; Biber, 1988; De Mauro et al., 1993; Gadet, 1996a, 1996b, 2003). Speech resources built with the purposes of developing technology (telephone information, health interactions, map tasking) have been produced in controlled situations. This allows for very high acoustic quality but, on the other hand, represents restricted semantic domains with highly foreseeable linguistic behaviour. C-ORAL-BRASIL, like C-ORAL-ROM, includes data from natural contexts, which necessarily reduces acoustic quality and can cause problems during the recording process. C-ORAL-BRASIL made great efforts to obtain the best possible acoustic quality for recordings made in different environments by using sophisticated wireless equipment. An important goal of this corpus is to achieve comparability with the C-ORAL-ROM corpora. Comparable corpora have been built for written language comprising parallel corpora or corpora on the same specialized topic. For spoken corpora, topic comparability implies building up corpora of readings, leading to a loss in spontaneity. In speech, comparability can be easily achieved only in strongly controlled situations, but if we assume that – in order for speech to be properly documented – it is necessary to have it recorded with the largest possible textual typology variation. Consequently, the

The C-ORAL-BRASIL Corpus

261

more textual variation we have, the less comparability we achieve. Therefore, comparability among the corpora of the C-ORAL projects results from the application of specific compilation parameters. Thus far, the C-ORAL-BRASIL corpus project has launched only the informal part of its spontaneous speech corpus, which comprises casual interactions in family or public settings. The formal part, which includes lectures, professional speech, media and telephone interactions, is still being compiled. The informal corpus features 208,130 words, distributed in 139 texts of approximately 1,500 words each. A few texts included might be larger (up to 5,000 words) or smaller (so long as they maintain textual autonomy). The 139 texts were divided into two contexts: private/familiar (159,364 words) and public (48,766 words); for each context, the texts were divided similarly among three interactional typologies: monologues, dialogues and conversations (i.e., dialogic texts with more than two participants). Texts were transcribed using the CHILDES-CLAN format (MacWhinney, 2000) implemented for prosodic annotation (Moneglia and Cresti, 1997). The prosodic annotation features the segmentation of the speech flow in utterances (double slash) and tone units (single slash);5 interrupted utterances (+) and retractings ([/n])6 are also annotated. Transcriptions follow traditional orthography, with significant exceptions due to the need to capture speech phenomena that can exhibit ongoing grammaticalization and lexicalization processes so that they can be computed and statistically studied.7

2.1 Pragmatic perspective and diaphasic variation A corpus of true spontaneous speech must portray in the best possible way diaphasic (i.e., situational) variation. In fact, what constrains the structure of speech for the most part is not variation across speakers or topics, but rather variation across the speech actions being performed. Especially under a pragmatic perspective, it is crucial to document the differences in verbal behaviour depending on the different tasks that speakers perform in different situations. The sociolinguistic tradition allows us to identify the main domains of formal speech, but not informal speech, as it is an open-ended system. Therefore, while for formal speech it is feasible to list a number of typical contexts, given that it encompasses predictable interactional scenarios such as professional discourse, ceremonial speech and the like, in informal speech this cannot be achieved because the range of situational interactions among humans is open ended; therefore, the goal is to document the widest range of situations, as no specific context can be considered, in principle, more typical than another. In order for this goal to be achieved, considering that in general the cost (in terms of both time and money) of compiling a spoken corpus is much higher than that of a written one, it is important to maximize the effectiveness of corpus compilation situations by collecting as many different texts as possible and by setting text size to be a portion of speech that contains a fully completed textual unit. Texts averaging 1,500 words are long enough both to guarantee that the interaction will be autonomous and to include key syntactic and pragmatic properties (Blanche-Benveniste et al., 1990; Scarano, 2003), allowing for the representation of a wide variety of situations.

262

Working with Portuguese Corpora

Within the informal register, the partition between private/familiar versus public contexts refers to the role that a participant plays, whether s/he acts as an individual, as in interactions with relatives or friends, or in a professional or institutional capacity, as in interactions between clients and sellers, students and professors, or citizens and civil servants. Approximately 75 per cent of the corpus represents the private/familiar context, as this context also comprises a larger portion of natural human interactions. Each context includes three different typologies of interactions: (i) a monologic typology, in which a speaker builds a spoken text, (almost) without any interaction; (ii) a dialogic typology, in which two interlocutors interact; and (iii) a conversational typology, in which three or more speakers interact. The text characteristics are strongly conditioned by the interaction typology, especially in the opposition between monologic versus interactional.8 However, it must be highlighted that, unlike the formal register, the informal one does not have, in principle, perfect monologic texts. There will almost always be some kind of interaction. The monologic typology is built by long turns and, within them, by very highly articulated utterances with complex information structures and many tone units. The reference to the situational context is usually poor, while a great amount of cognitive contextualization is necessary. Depending on the textual typology of the monologic text, the more frequent illocutions change, but the illocutionary variation is poor. On the other hand, interactional typologies involve short turns and small and informationally patterned utterances; in these cases, the reference to the situational context is strong, rendering a high amount of verbal contextualization unnecessary while the illocutionary variation inside the same text is very high.9 In addition to this first important distinction between monologic and interactive typologies, the second most relevant factor in speech variation is associated with textual genre and the actions performed through each textual typology. In the monologic typology, speech structure depends mainly on textual genre: life narratives, professional explanation of a given issue (such as an engineer’s account of a construction procedure), argumentation, joke, recipe, story, etc. In dialogues and conversations, variation is basically due to the task that speakers perform: a chat between friends is highly different from a couple’s quarrel or an interaction between seller and client, among the players in a football game, between a personal trainer and an athlete, between mother and crying child, between two interactants performing a task together, etc. It is evident that, in each activity, the actions to be performed change, as do turn size, amount of silence, etc. These observations should be sufficient to emphasize how crucial true diaphasic variation is in a spontaneous speech corpus. Variation in speech structure cannot be picked up by variation among speakers or topics. Different speakers perform the same speech action (illocution) in basically the same way, and the change in topic in chats or interviews does not lead to structural variation (i.e., illocution- or information-based; Cresti, 2000b).

The C-ORAL-BRASIL Corpus

263

2.2 Diastratic variation Although diaphasic variation was the main parameter underlying the corpus architecture, diastratic (i.e., sociolinguistic) variation is also observed. What is methodologically important is that, while speech act (diaphasic) variation has no chance of being documented when a corpus is designed only for sociolinguistic (diastratic) variation, our methodology shows that diastratic variation is a natural consequence of diaphasic variation (Cresti and Moneglia, 2012). C-ORAL-BRASIL features 362 speakers. For 68.23 per cent of them, gender, age, origin and school level are documented. The fact that nearly one-third of the speakers are not documented in full is due to the fact that they entered the recording context in an unforeseen way. This strongly reinforces the point that the recording context was strictly uncontrolled. Moreover, they account for only 1.91 per cent of the corpus words. A breakdown of word groupings is shown in Table 13.1. Table 13.1 Word groupings: 1 to 6,309 Groupings

Speakers

1–247 words 280–627 words 649–908 words 933–1,016 words 1,134–1,400 words 1,455–1,663 words 1,777–1,994 words 2,140–2,455 words 2,611–2,901 words 3,550–3,738 words 4,211–4,327 words 6,309 words Total

161 81 37 16 26 17 7 10 2 2 2 1 362

Table 13.1 shows that 44.5 per cent of the speakers utter up to 247 words, accounting for just 3.92 per cent of the corpus. Table 13.2 shows a breakdown of these word groupings. Table 13.2 Word Groupings: 1 to 247 Groupings

Speakers

1–22 words 25–47 words 54–72 words 77–95 words 99–115 words 136–164 words 172–185 words 204–247 words Total

82 27 10 13 6 9 5 8 161

264

Working with Portuguese Corpora

Table 13.2 shows that the great majority of the non-documented speakers (109) utters up to 47 words and more than half of them (82) utter up to 22 words. These data show that the percentage of words uttered by non-documented speakers is irrelevant (1.91 per cent), but at the same time is a guarantee of spontaneity in recordings and demonstrates lack of control over contexts. Table 13.1 further shows that the corpus features a small group of speakers (5) that utter more than 3,550 words each, representing 10.63 per cent of the corpus words. These speakers appear in more than one recording in different situations and can be studied to see how the same speaker’s structure varies by context. Gender representation is balanced in terms of the number of words: 50.36 per cent of the speakers are female and 49.64 per cent are male. However, in terms of the number of speakers, the majority is female (203 versus 159 male; one informant uttered just one word and his/her gender was not identified). Age groups in the corpus match the distribution of age in the population of Brazil as follows (measured in words): 27.13 per cent of the speakers belong to group A (18–25 years old); 30.28 per cent to group B (26–40 years old); 31.01 per cent to group C (41–60 years old); 8.05 per cent to group D (more than 60 years old); 1.61 per cent are underage; and 1.91 per cent were not documented for age. The corpus is very well balanced as far as speakers 18 years and older are concerned (groups A, B and C), considering that group D (older than 60) comprises a smaller number of individuals in Brazilian society and is, therefore, smaller than the other age groups in the corpus. As far as the number of speakers is concerned, 75 are in group A (1 is registered in group A in one interaction and later in group B), 88 pertain to group B, 64 to group C, 15 to group D and 11 to group M. As for school level, the corpus has higher representation from high and mid schooling levels, which make up the bulk of the population that actually speak more standardized varieties of Brazilian Portuguese, but a low schooling level is also sufficiently represented in C-ORAL-BRASIL. In terms of word numbers, 15.79 per cent are on level 1 (no more than 7 years of schooling), 40.76 per cent on level 2 (college graduates whose degree was never used for their occupation) and 40.66 per cent on level 3 (either using their college degree for their job or have a post-graduate degree). As for the number of speakers, 46 are in group 1, 101 in group 2, 104 in group 3 and one speaker appears once in both groups 2 and 3. The last diastratic aspect to be mentioned is the speakers’ occupation, which is an open category and cannot be treated like previous ones. Looking at the metadata, the importance of occupations linked to the field of education is clear. This happens for several reasons: because professors and students who worked in the corpus compilation appear in the recordings, because they looked for informants in their social environment (which, of course, is linked to the educational system), and because age group A is largely formed of students. Nevertheless, in the group linked to education, we find students and professors from different faculties as well as teachers, school directors and school clerks of different levels. Yet a significant part of the informants have occupations outside the education system. The corpus features many shop attendants and sellers, artists, public clerks, liberal professionals from a wide range of fields (e.g., lawyers, doctors, psychologists, dentists, engineers, physiotherapists), housekeepers, technicians, brokers, craftsmen, labourers, masons, managers, farmers and many other occupations.

The C-ORAL-BRASIL Corpus

265

2.3 Other aspects of the architecture 2.3.1 Diatopy As previously mentioned, the diatopic variation of C-ORAL-BRASIL is essentially that of the Mineiro (i.e., from the southeastern state of Minas Gerais) variety of Brazilian Portuguese. A corpus of this size must strive to incorporate other types of variation (diaphasic, diastratic, etc.) within a single diatopy. The same happens with the C-ORAL-ROM corpora, which represent the regions of Madrid, Marseille, Florence and Lisbon (Cresti et al., 2002). In all of the corpora, speakers of other regions and countries are present, as a large metropolitan area implies a percentage of people from outside; however, what is mandatory for each corpus is that more than 50 per cent of the speakers directly represent the chosen diatopy. For the C-ORAL-BRASIL, this diatopy is the metropolitan area of Belo Horizonte, the capital city of the state of Minas Gerais. Table 13.3 shows the range of speaker origins. Table 13.3 Speakers’ origins Origin Belo Horizonte Other cities in Minas Gerais state Other Brazilian states Other countries Unknown Total

Speakers 138 89 19 2 114 362

Apart from those speakers whose origin was unknown – who, as we have already seen, account for a tiny percentage of the corpus tokens – 55.6 per cent of the speakers are from Belo Horizonte and 35.9 per cent from other cities in the state of Minas Gerais (many from the nearby cities of Contagem, Betim and Sete Lagoas). Therefore, 91.5 per cent of the documented speakers represent the Mineiro variety; more specifically, more than 50 per cent represent the Belo Horizonte area.

2.3.2 Overview of textual specifications One of our main efforts was to avoid as much as possible, especially in dialogues, the incidence of chats (casual conversations with no specific purpose) and interviews; that is, situations in which the interlocutors do not perform an activity other than speaking. These are already the most well-documented situations in other oral corpora available, the easiest to record, and also the least interesting if the goal is to document speech structure. Among dialogues, only eight of the 48 texts can be considered chats or interviews. Among conversations, the incidence of chats is higher because dialogic interaction is more frequently marked by a lack of specific speech actions (cf. Sbisà and Turner, 2013). For conversations, 17 of 42 texts can be considered chats, but none of those occurred in public contexts. We focused specific attention on recording in moving contexts as static and dynamic actionality can be treated as two different actional macro-domains. Four dialogues

266

Working with Portuguese Corpora

were recorded completely or partially between informants in a moving car that one of them was driving. Of the recordings, 19 are dynamic: conversation bfamcv03 features friends playing snooker, bfamcv05 features friends playing football (with very high acoustic quality), bfamcv10 features a group preparing lunch, bpubcv01 features a visit to a blood donation centre, bpubcv09 features a gym session and other conversations feature dynamic parts. Among the dialogues, in addition to those already mentioned, bfamdl01 was recorded in a supermarket as two friends were shopping, bfamdl04 features two maids cleaning a kitchen and other rooms, bfamdl05 features a real estate broker driving and showing different apartments to his sister, bfamdl26 features a mother and daughter cleaning an apartment, bpubdl02 and bpudl06 are recordings inside a store while a client tries on shoes and dresses, bpubdl03 features a gym lesson with a personal trainer, bpubdl05 is a visit to a bee-keeping farm, bpubdl07 features two waiters preparing and serving pizza at a party and other dialogues also feature dynamic parts. A few texts are longer than the average. The decision to include these in the corpus was made in order to document a longer textual development or due to the specific characteristics of the texts. For example, bfamdl09 and bfamdl31 have around 3,000 words each. The latter is especially interesting, as it documents two parallel dialogues. In fact, two microphones were placed on two informants who were repairing windows at home and were expected to interact with one another. The distance between them caused their interaction to happen, as expected, in only some circumstances. Most of the time each of them is interacting with two other unforeseen participants, and two different dialogues took place without any overlapping. We thought it was interesting to document parallel dialogues, although this phenomenon also happens in parts of other recordings. Among monologues, bfammn14 features more than 4,800 words. This monologue is carried out by an informant from Serra do Cipo,10 an area whose linguistic variety is considered particularly interesting (another informant from the same area is documented in bfammn29). Meanwhile, bpubdl07 has more than 3,100 words and features waiters preparing and serving pizza at a party. This recording documents a particularly interesting context as the waiters move around and have many small interactions with different interlocutors, giving rise to speech acts not easy to document but very common in real life, such as greeting and thanking. Especially among monologues, some texts are smaller than the average and have fewer than 1,000 words (1 of the dialogues, 3 of the conversations and 12 of the monologues); in a few cases, they have only a few hundred words. They are all concluded textual entities. Three additional recordings warrant further observations. Recording bfamcv06, which features a birthday party, includes an entire text with overlapping voices, but it is clear enough to be understood; we thought it would be interesting to document this aspect of speech, although of course it appears, with less evidence, in other recordings. Recording bfamdl12 documents the interaction between an infant and his mother: the infant cries and the mother talks to him, trying to calm him down. Although we have only one speaker here, it is clear that this text documents an interaction and not a monologue as the mother speaks in reaction to the infant’s actions. Therefore, the text must be considered a dialogue, with different turns, reacting to different actions of the interactant. Something similar happens in

The C-ORAL-BRASIL Corpus

267

bpubdl03, in which a personal trainer tells a client how to work out; the client is almost silent, but the trainer’s turns are interactive with respect to the client’s movements.

2.3.3 An information-tagged minicorpus In order to enable cross-linguistic research on information structure11 and illocutions, two comparable minicorpora were selected for Brazilian Portuguese and Italian (Raso and Mittmann, 2012). The two minicorpora feature approximately 32,000 words in 5,500 utterances each. The texts were chosen to achieve the widest diaphasic variation possible while still ensuring good acoustic quality and the presence of unique speakers. A complex manual system was used to tag the minicorpora. The tagging was double-validated: first by rater agreement among the three human annotators, and then through a comparison between the Italian and Brazilian annotating teams.12 A sample of the annotated minicorpus is shown in (6). (6) bfamcv01 *LUI: com certeza es nũ vão participar /COM uai //PHA (they will certainly not take part in it /COM see //PHA) In this example, the single bar marks the end of the comment unit, which is immediately followed by a phatic unit, closing the end of the utterance (marked by double bars). The criteria for the informational tagging are documented in many of the works by the Italian and Brazilian C-ORAL teams, and are based on the Language into Act Theory (Cresti, 2000a).13 The goal of the two teams was to tag the minicorpora illocutionarily and to make possible studies in order to analyse speech crossing lexicalmorphosyntactic, informational and illocutionary tagging, taking advantage of the full potential of the resource elaborated by Panunzi and Gregori (2012) that allows for the automatic analysis of the three different levels.

3. Methodological aspects 3.1 Acoustic quality Recording speech with high-quality acoustics in natural contexts is a challenging task, but it is crucial for spontaneous speech corpora. As we previously argued, it is not enough to make recordings that allow for the transcription of the main segmental aspects only; a speech corpus that aims to document not only the lexicon and morphosyntax, but also phonetics and pragmatics must have a much higher acoustic quality and must be speech-to-text aligned with appropriate software. Of course, it is easy to obtain good acoustics in labs or sound studios in controlled situations, but it is much more difficult to do so in natural contexts, especially with wide diaphasic variation. It is important to have high-quality wireless recording equipment;14 it is also crucial to plan the recording situations very carefully and record extra time for each session as well as extra sessions in order to be able to choose those of higher quality.

Working with Portuguese Corpora

268

Table 13.4 Recording quality assessment Properties

Label

Very high quality. Excellent microphone response. Almost the entire recording is appropriate for almost all phonetic study. Almost no overlapping. Almost no background noise. F0 computation possible for (almost) the entire file. High quality. Excellent microphone response. Most of the recording is appropriate for almost all phonetic research. Few overlappings. Almost no background noise. F0 computation possible for (almost) the entire file. Medium quality. Good or medium microphone response. Many parts of the recording are appropriate for phonetic analysis. F0 computation possible for most of the file. Few overlappings and not much background noise. Low quality. Medium microphone response. F0 computation possible in at least 60% of the file. Even when F0 computation is not trustworthy, the recording is clearly understandable. Low quality. Medium or low microphone response. F0 computation possible in at least 60% of the file. A few parts might not be clearly understandable.

A

AB

B

BC C

In this project we recorded three times the number of final corpus texts, and each recording was, on average, four times longer than the published transcribed duration. Table 13.4 shows the characteristics of all the acoustic quality labels used for the corpus, and Table 13.5 indicates the acoustic quality of all the texts. All recordings are in wav format and most are in stereo. Overall, 60 per cent of the recordings were in high or very high quality; only 23 per cent were in low quality. Acoustic quality tends to be lower in conversations, which naturally have more overlappings and more challenges in terms of microphone management. In principle, low quality should be accepted only when the recording is particularly interesting for its diaphasic, diastratic, or diatopic aspects and it is Table 13.5 Audio file acoustic quality File* bfamcv bpubcv bfamdl bpubdl bfammn bpubmn Total

A**

AB

8 1 7 5 13 3 40

11 2 14 2 10 4 43

B 4 – 6 2 10 4 25

BC 6 1 5 1 1 2 14

C 5 5 3 1 2 1 18

Total 34 9 35 11 36 14 139

*Filenames follow C-ORAL-ROM conventions: b = Brazilian; fam = private/familiar; pub = public; cv = conversation; dl = dialogue; mn=monologue. ** Key to acoustic quality labels: See Table 13.4.

The C-ORAL-BRASIL Corpus

269

impossible for the recording to avoid quality problems, such as background noise in a supermarket. Acoustic quality is crucial for pragmatic analysis, which in turn can be better appreciated in the next section in which the importance of prosody for illocutionary and informational studies is explored (Raso, 2012b; Mello and Raso, 2011; Moneglia, 2011).

3.2 Speech segmentation Speech segmentation criteria represent a very important and original aspect of the C-ORAL-ROM and C-ORAL-BRASIL projects.15 Criteria for segmenting speech have been widely discussed, and no consensus is evident in the literature (Moneglia, 2005). In written texts, the relative ease in recognizing reference units, such as sentences, is generally accepted and not controversial. It is possible to choose different units for analysing written texts, but all of them are generally discrete categories pertaining to syntax (Abeillé, 2003). Since the mid-1980s, there has been a lot of debate on the question of how to segment speech (Blanche-Benveniste, 1997; Biber et al., 1999; Cresti, 2000a, 2000b; Miller and Weinert, 1998; Quirk et al., 1985). Even if there is some agreement that the utterance should be taken as the reference unit, different authors have different definitions (see the discussion in Scarano, 2003). Among them, some are noteworthy, such as Biber et al. (1999), who identified an utterance as a C-unit (i.e., an entity with or without a clause structure), and Blanche-Benveniste (1997), who referred to utterances as prosodic domains (noyaux) containing the realization of chains of macrosyntactic phenomena (e.g., pronominal cohesion) within the context of her pronominal approach with a macro-syntactic domain based on a modalized noyau. Other definitions (e.g., Voghera, 1992) normally imply a verbal nucleus in a speech unit; however, this does not take into account the fact that many languages include spontaneous speech verbless autonomous units with an incidence of about 30 per cent of the overall number of utterances (Cresti, 2001, 2005a, 2005b; Raso and Mittmann, 2012). Therefore, utterance definitions based on clause structure or predication – that is, taking a verb as the utterance nucleus – cannot account for spontaneous speech units. As we previously discussed, in the C-ORAL-BRASIL, as in the C-ORAL-ROM, prosodic breaks are taken to be the most relevant feature in determining utterance boundaries.16 Tone units are those portions of speech separated by prosodic breaks, and a general correspondence between tone units and information units has been recognized since Halliday (1976). Hence, it is possible to divide the speech flow into information units. In fact, the perceptual relevance of prosodic breaks is strong; however, this is not sufficient for individualizing utterances as the correspondence between information unit and utterance is not mutual. An information unit might not overlap with an utterance; it might just be part of it. An utterance can match one single information unit or more than one, in which case it will be realized by more than one tone unit (Cresti, 2000a; Hart et al., 1990). However, the correlation between prosodic break and utterance can be maintained, considering that classic linguistic studies (Crystal, 1975; Karcevsky, 1931) identified the utterance as having a prosodic

270

Working with Portuguese Corpora

break terminal profile. As such, prosodic breaks that end utterances can be distinguished from those that are not conclusive – that is, terminal versus non-terminal profiles. Consequently, utterances presenting a bi-univocal correspondence with tone units (one utterance made up of one tone unit) can be distinguished from those not presenting this feature (one utterance made up of more than one tone unit.) In both C-ORAL-BRASIL and C-ORAL-ROM, the identification of terminal prosodic breaks was considered a heuristic method for determining utterance boundaries. Each sequence that ended with a terminal break within the speech flow was considered an utterance based on the assumption that linguistic actions (i.e., speech acts) are necessarily correlated with prosody and that it is prosody that makes the interface between illocutionary and locutionary acts. Performing illocutionary acts is therefore considered the main property that a linguistic event must present in order to be considered an utterance. The illocutionary force determines how the propositional content of the utterance must be interpreted. This explains why the utterance is defined as the minimal linguistic unit that allows a pragmatic interpretation Prosodic features allow a competent individual to interpret linguistic activity. Competent speakers/listeners are very skillful in detecting even subtle voluntary prosodic variations (Hart et al., 1990), which is what happens when a linguistically codified prosodic profile is performed to express an illocutionary force or an information unit (Raso, 2012b; Moneglia, 2011). Segmentation identifies speech act boundaries, but of course it does not label them. Identifying the conclusion of a speech act and labelling its illocutionary function are two different tasks. In the introduction, we showed how speech is segmented in the corpus and briefly explained the main perceptual motivations for the segmentation. It is crucial to understand that speech and writing cannot be analysed according to the same criteria. Prosody, which is absent in writing, is the main structural characteristic of speech. Through prosody, it is possible to identify the reference units of speech, label these units for illocutionary features, and segment the utterance in information units, identifying their specific functions (Cresti, 2000a; Moneglia, 2011; Raso, 2012b; Moneglia and Cresti, 2006). Example (7) shows how different the analysis for speech and writing can be. (7) Não espera aqui em cima não In speech, the communicative value of this sequence depends on how we segment it: a) Não. Espera aqui em cima. Não. (No. Wait up here. No.) b) Não espera. Aqui? Em cima? Não? (Don’t wait. Here? On top? No?) c) Não. Espera aqui! Em cima não! (No. Wait here! Not up there!) (Shouldn’t wait up here? No?) d) Não espera aqui? Em cima? Não! (Shouldn’t wait here? Up? No!) e) ...

The C-ORAL-BRASIL Corpus

271

These and other segmentation possibilities show how many speech acts we can have in the sequence and what types they are. Without prosody, we cannot make any of the following decisions: 1. What boundaries, in a sequence of speech, allow us to individualize the different actions performed? Through the verbal sequence, is the speaker performing one, two, or more actions? Which words pertain to each action? 2. What kind of speech act is being performed? A question, order, request, assertion, expression of surprise, etc.? How many information units make up the utterance? What are their specific functions (Cresti, 2011; Cresti and Firenzuoli, 1999)? All of these questions can only be answered through prosody. Note that even punctuation cannot fully represent all possible illocutions or information structures; however, the main point is to highlight that the reference unit in speech is the utterance and that it corresponds to a speech act (Austin, 1962). Of course, the segmentation criteria require that the people doing the segmenting should be properly trained and that their decisions should be statistically validated through inter-rater agreement statistics.17 We paid great attention to both, and the validation after the first revision, which was not the last one, yielded a Kappa (Fleiss, 1971) of 0.86, indicating excellent inter-rater agreement.

3.3 Transcriptions An important implementation of C-ORAL-BRASIL was the choice of a specific set of transcription criteria for the segmental part. We wanted to capture a great quantity of phenomena that might be subject to grammaticalization and lexicalization in order to study them through quantitative methodology and statistical criteria as well as measure their co-occurrence and the systemic relationships among them. The criteria were based on the following parameters: (i) transcriptions should be able to represent phenomena subject to grammaticalization and lexicalization (e.g., subject and negation cliticization, loss of verbal morphology, demonstrative reduction, articulated preposition contraction, loss of the verb ser in cleft constructions, government changes, aphaeresis); (ii) they should be readable, excluding phenomena of an exclusively phonetic nature, without clear grammatical effect; and (iii) they should ensure that the transcribers were consistent in their job, choosing perceivable phenomena clearly. An example of this last aspect is that of the cliticization of subject pronouns: although the distinction between tonic and clitic forms of the second and third person is relatively easy to perceive (você(s) versus cês (you) and ele(s) or ela(s) (he or she) versus e’, es, ea, eas), the same is not true for first-person singular and plural as many times it is not clear whether people utter eu, e, o (I) or nós, no, nu, n’ (we); in this

272

Working with Portuguese Corpora

case, we decided against representing the opposition between tonic and clitic forms orthographically. All these phenomena are already known in linguistics, but they have never been systematically documented or studied through corpora. Only corpus-based studies of spontaneous speech can verify (a) how recurrent these and other phenomena are in spontaneous speech; (b) to what extent they coexist and determine deep changes in the system; (c) which grammaticalized phenomena most trigger others; and (d) what their distribution is from a sociolinguistic viewpoint. If these phenomena were not marked in the transcription, it would be impossible to handle them. In fact, all forms with unusual spellings, compared to their traditional forms, were added to the parser (Bick, 2012), thereby enabling a large number of studies to be conducted about ongoing linguistic changes that would be impossible to code by hand. We want to stress the importance of our efforts in choosing, computationally implementing, and validating the transcription criteria statistically. They should be considered among the most advanced criteria of their kind for exploring spontaneous Brazilian Portuguese speech at present.18 We conducted two validations: the first before the last revision and the second after the last revision. The baseline for the validation was 10 per cent of the utterances of each text, chosen randomly. The validation was divided into two modules: one aimed at validating the transcription as a whole and the other concentrated only on the non-orthographic phenomena. The result was outstanding. The errors in terms of words in the transcription as a whole were 0.81 per cent. The incidence of errors was low at 0.43 per cent (considering only non-orthographic criteria items – namely, errors related to transcription variability of specific speech phenomena). The item that presented the most errors was articulated prepositions (e.g., pru; from para + o, ‘for/to’ + ‘the’), with 3.28 per cent.

4. Concluding remarks The C-ORAL-BRASIL corpus, through its methodological and architectural parameters, represents a new advancement in the growing field of spontaneous speech corpora, as it contains high-quality audio files, utterance segmented and validated transcriptions, and speech-to-text alignment in addition to PoS and parsed transcription files. The C-ORAL-BRASIL resources can be used in studies ranging from text linguistics to phonetics and we hope that it will inspire the development of similar initiatives both in Brazil and in other countries.

References Abeillé, A. (2003), Treebanks Building and Using Parsed Corpora. Dordrecht: Kluwer Academic. Adolphs, S. (2008), Corpus and Context: Investigating pragmatic functions in spoken discourse. Amsterdam and Philadelphia: John Benjamins.

The C-ORAL-BRASIL Corpus

273

Adolphs, S. and Carter, R. (2013), Spoken Corpus Linguistics: From monomodal to multimodal. London: Routledge. Austin, L. J. (1962), How to Do Things with Words. Oxford: Oxford University Press. Berruto, G. (1987), Sociolinguistica dell’Italiano Contemporaneo. Rome: La Nuova Italia Scientifica. Biber, D. (1988), Variation Across Speech and Writing. Cambridge: Cambridge University Press. Biber, D. and Conrad, S. (2001), ‘Register variation: A corpus approach’, in D. Schiffrin, D. Tannen and H. Hamilton (eds), The Handbook of Discourse Analysis. Oxford: Blackwell, pp. 175–96. Biber, D., Conrad, S., and Reppen, R. (1998), Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. Biber, D., Johansson, S., Leech, G., Conrad, S. and Finegan, E. (1999), The Longman Grammar of Spoken and Written English. Harlow-Essex: Pearson Education. Bick, E. (2012), ‘A anotação grammatical do C-ORAL-BRASIL’, in T. Raso and H. Mello, (eds), C-ORAL-BRASIL I: Corpus de Referência do Português Brasileiro Informal. Belo Horizonte: Editora UFMG, 2012, pp. 223–54. Blanche-Benveniste, C. (1997), Approches de la Langue Parlée en Français. Paris: Ophrys. Blanche-Benveniste, C., Bilger, M., Rouget, Ch., Van Den Eynde, K. and Mertens, P. (1990), Le Français Parlé: Études Grammaticales. Paris: Éditions du C.N.R.S. Carter, R. and McCarthy, M. (1995), ‘Grammar and the spoken language’. Applied Linguistics 16, 141–58. —(2006), The Cambridge Grammar of English: A Comprehensive Guide to Spoken and Written Grammar and Usage. Cambridge: Cambridge University Press. Cresti, E. (1995), ‘Speech acts units and informational units’, in E. Fava (ed.), Speech Acts and Linguistics Research, Proceedings of the Workshop, July 15–17, 1994, pp. 89–107. —(2000a), Corpus di Italiano Parlato. Florence: Accademia della Crusca. —(2000b), ‘Critère illocutoire et articulation informative’, in M. Bilger (ed.), Corpus, Méthodologie et Applications Linguistiques. Paris: Champion, pp. 350–67. —(2001), ‘Per una nuova definizione di frase’, in P. Bongrani, A. Dardi, M. Fanfani, and R. Tesi (eds), Studi di Storia della Lingua Italiana Offerti a Ghino Ghinassi. Florence: Le Lettere, pp. 511–50. —(2005a), ‘Enunciato e frase: Teoria e verifiche empiriche’, in M. Biffi, O. Calabrese and L. Salibra (eds), Italia Linguistica: Discorsi di Scritto e di Parlato – Nuovi Studi di Linguistica Italiana per Giovanni Nencioni. Siena: Protagon, pp. 249–60. —(2005b), ‘Notes on lexical strategy, structural strategies and surface clause indexes in the C-ORAL-ROM spoken corpora’, in E. Cresti and M. Moneglia (eds), C-ORAL-ROM: Integrated Reference Corpora for Spoken Romance Languages. Amsterdam, Philadelphia, PA: John Benjamins, pp. 209–56. —(2011), ‘The definition of Focus in the framework of the Language into Act Theory (LAT)’, in H. Mello, A. Panunzi, T. Raso, T. (eds), Pragmatics and Prosody. Illocution, Modality, Attitude, Information Patterning and Speech Annotation. Florence: FUP, pp. 39–82. Cresti, E, and Firenzuoli, V. (1999), ‘Illocution et profils intonatifs de l’italien’. Revue Française de Linguistique Appliquée, 4, (2), 77–98. Cresti, E., Moneglia, M., Bacelar do Nascimento, M. F., Moreno-Sandoval, A., Véronis, J., Martin, P., Choukri, C., Mapelli, V., Falavigna, F., Cid, A. and Blum, C. (2002), The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus. [ONLINE] Available at: http://www.lrec-conf.org/proceedings/ lrec2002/pdf/290.pdf

274

Working with Portuguese Corpora

Cresti, E. and Moneglia, M. (eds) (2005), C-ORAL-ROM: Integrated Reference Corpora For Spoken Romance Languages. Amsterdam: John Benjamins. —(2012), ‘Prefácio’, in T. Raso, T and H. Mello (eds), C-ORAL-BRASIL I: Corpus de Referência do Português Brasileiro Falado Informal. Belo Horizonte: Editora UFMG, pp. 5–25. Crystal, D. (1975), The English Tone of Voice. London: Edward Arnold. De Mauro, T., Mancini, F., Vedovelli, M. and Voghera, M. (1993), Lessico di Frequenza dell’Italiano Parlato. Milan: ETAS. Fleiss, J. L. (1971), ‘Measuring nominal scale agreement among many raters’. Psychological Bulletin, 76, 378–82. Gadet, F. (1996a), ‘Niveaux de langue et variation intrinsèque’. Palympsestes, 10, 17–40. —(1996b), ‘Variabilité, variation, variété’. Journal of French Language Studies, 1, 75–98. —(ed.) (1997), ‘Special issue on la variation en syntaxe’. Langue Française, 115. —(2000), ‘Vers une sociolinguistique des locuteurs’. Sociolinguistica, 14, 99–103. —(2003), La Variation Sociale en Français. Paris: Ophrys. Givón, T. (ed.) (1979), Discourse and Syntax. New York: Academic Press. Halliday, M. A. K. (1976), System and Function in Language: Selected Papers. Oxford: Oxford University Press. —(1989), Spoken and Written Languages. Oxford: Oxford University Press. Hart, J., Collier, R. and Cohen, A. (1990), A Perceptual Study on Intonation: An Experimental Approach to Speech Melody. Cambridge: Cambridge University Press. Karcevsky, S. (1931), ‘Sur la phonologie de la phrase’. Travaux du Cercle linguistique de Prague, 4, 188–228. MacWhinney, B. J. (2000), The CHILDES Project: Tools for Analyzing Talk (3rd edn). Mahwah, NJ: Lawrence Erlbaum Associates. Martin, P. (2005), ‘WinPitch Corpus: A text-to-speech analysis and alignment tool’, in E. Cresti, M. Moneglia (eds), C-ORAL-ROM: Integrated Reference Corpora for Spoken Romance Languages. Amsterdam: John Benjamins, pp. 40–51. McCarthy, M. (1998), Spoken Language and Applied Linguistics. Cambridge: Cambridge University Press. Mello, H. (2012), ‘Os corpora orais e o C-ORAL-BRASIL’, in T. Raso, T. and H. Mello (eds), C-ORAL-BRASIL I: Corpus de Referência do Português Brasileiro Informal. Belo Horizonte: Editora UFMG, pp. 31–54. Mello, H. and Raso, T. (2009), ‘Para a transcriçao da fala espontânea: O caso do C-ORALBRASIL’. Revista Portuguesa de Humanidades, 13, (1), 153–78. —(2011), ‘Illocution, modality, attitude: Different names for different categories’, in H. Mello, A. Panunzi and T. Raso (eds), Pragmatics and Prosody. Illocution, Modality, Attitude, Information Patterning and Speech Annotation. Florence: FUP, pp. 1–18. Mello, H., Raso, T., Mittmann, M.,Vale, H. and Côrtes, P. (2012), ‘Transcrição e segmentação prosodica do Corpus C-ORAL-BRASIL: Critérios de implementação e validação’, in T. Raso and H. Mello (eds), C-ORAL-BRASIL I: Corpus de Referência do Português Brasileiro Falado Informal. Belo Horizonte: Editora UFMG, pp. 125–76. Miller, J. and Weinert, R. (1998), Spontaneous Spoken Language. Oxford: Clarendon Press. Mittmann, M. M. and Raso, T. (2012), ‘The C-ORAL-BRASIL informationally tagged minicorpus’, in H. Mello, A. Panunzi and T. Raso (eds), Pragmatics and Prosody. Illocution, Modality, Attitude, Information Patterning and Speech Annotation. Florence: FUP, pp. 151–83. Moneglia, M. (2005), ‘The C-ORAL-ROM resource’, in E. Cresti and M. Moneglia

The C-ORAL-BRASIL Corpus

275

(eds), C-ORAL-ROM: Integrated Reference Corpora for Spoken Romance Languages. Amsterdam: John Benjamins, pp. 1–70. —(2011), ‘Spoken Corpora and Pragmatics’. Revista Brasileira de Linguística Aplicada, 11, (2), 479–519. Moneglia, M and Cresti, E. (1997), ‘L’intonazione e i criteri di trascrizione del parlato adulto e infantile’, in U. Bortolini and E. Pizzuto (eds), Il Progetto CHILDES Italia. Pisa: Del Cerro, pp. 57–90. —(2006), ‘C-ORAL-ROM. Prosodic boundaries for spontaneous speech analysis’, in Y. Kawaguchi, S. Zaima, and T. Takagaki (eds), Spoken Language Corpus and Linguistics Informatics. Amsterdam: John Benjamins, pp. 89–113. Moneglia, M., Raso, T., Mittmann, M. M. and Mello, H. (2010), ‘Challenging the perceptual relevance of prosodic breaks in multilingual spontaneous speech corpora: C-ORAL-BRASIL / C-ORAL-ROM’. Paper presented at Speech Prosody 2010, Satellite Workshop on Prosodic Prominence: Perceptual, Automatic Identification. Chicago. Nencioni, G. (1983), Di scritto e di parlato: Discorsi linguistici. Bologna: Zanichelli. Panunzi, A. and Gregori, L. (2012), ‘An XML model for multi-layer representation of spoken language’, in H. Mello, A. Panunzi and T. Raso (eds), Pragmatics and Prosody. Illocution, Modality, Attitude, Information Patterning and Speech Annotation. Florence: FUP, pp. 133–50. Quirk, R., Greenbaum, S., Leech, G. and Svartvik, J. (1985), A Comprehensive Grammar of the English Language, London: Longman. Raso, T. (2012a), ‘O Corpus C-ORAL-BRASIL’, in T. Raso and H. Mello (eds), C-ORALBRASIL I: Corpus de Referência do Português Brasileiro Falado Informal. Belo Horizonte: Editora UFMG, pp. 55–90. —(2012b), ‘O C-ORAL-BRASIL e a Teoria da Língua em Ato’, in T. Raso and H. Mello (eds), C-ORAL-BRASIL I: Corpus de Referência do Português Brasileiro Falado Informal. Belo Horizonte: Editora UFMG, pp. 91–123. Raso, T. and Mello, H. (eds) (2012), C-ORAL-BRASIL I: Corpus de Referência do Português Brasileiro Falado Informal. Belo Horizonte: Editora UFMG. Raso, T. and Mittmann, M. M. (2009), ‘Validação estatística dos critérios de segmentação da fala espontânea no corpus C-ORAL-BRASIL’. Revista de Estudos da Linguagem, 17, (2), 73–91. Raso, T. and Mittmann, M. M. (2012), ‘As principais medidas da fala’, in T. Raso and H. Mello (eds), C-ORAL-BRASIL I: Corpus de Referência do Português Brasileiro Falado Informal. Belo Horizonte: Editora UFMG, pp. 173–221. Sbisà, M. and Turner, K. (eds) (2013), Pragmatics of Speech Actions. Berlin: de Gruyter. Scarano, A. (ed.) (2003), Macro-syntaxe et Pragmatique. L’Analyse Linguistique de l’Oral. Rome: Bulzoni. Simon, A. C. (2004), La Structuration Prosodique du Discours en Français. Une Approche Multidimensionelle et Expérientelle. Bern: Peter Lang. Voghera, M. (1992), Sintassi e Intonazione nell’Italiano Parlato. Bologna: Il Mulino.

Notes 1 The C-ORAL-BRASIL Project was financed by Fapemig, CNPq and UFMG. 2 For a comparison between C-ORAL-BRASIL and C-ORAL-ROM, see Raso (2012a) and Mittmann and Raso (2012).

276

Working with Portuguese Corpora

3 All the examples cited in this paper can be listened to on the C-ORAL-BRASIL DVD. 4 The view we embrace about what a speech corpus should encompass follows the tradition founded by Cresti (1995, 2000a) and Blanche-Benveniste (1997), among others, and represents third-generation corpora. This view is in stark contrast to that of scholars who have been studying speech through transcriptions, mostly through conventional corpus linguistics methods such as lexical and collocate frequencies and syntactic patterning (cf. Biber et al., 1999, Carter and McCarthy, 1995, 2006, among others). Even among scholars who claim to focus on discourse phenomena, the study of a transcription without resorting to sound has been the most common procedure, as attested by McCarthy (1998). For an in-depth discussion about the canonical study of speech corpora, see Adolphs (2008) and references mentioned therein. For new ways of studying speech corpora, which incorporate third-generation, multimodal corpora, see Adolphs and Carter (2013). 5 For the segmentation theoretical frame, see Cresti (2000a); for the segmentation and validation methodology, see Mello et al. (2012), Raso and Mittmann (2009), and Moneglia et al. (2010). 6 Figures indicate the count of retracted words. 7 For the transcription criteria, see Mello and Raso (2009) and Mello et al. (2012). 8 See Raso and Mittmann (2012). 9 See Raso (2012b) and Raso and Mittmann (2012). 10 A region about 100 km away from Belo Horizonte, on the south side of the Espinhaço mountain range. 11 Information structure refers to the organization of information into utterances. It is governed by pragmatic principles, and its linguistic organization is marked by prosody (Cresti, 2000a). 12 The two minicorpora, entirely annotated, are available through the IPIC database search at: http://lablita.dit.unifi.it/app/dbipic/ 13 For more information on the Italian team, see http://lablita.dit.unifi.it/; for the Brazilian team, see www.c-oral-brasil.org. For the criteria behind the informationtagged minicorpora, see Mittmann and Raso (2012). 14 The equipment used for C-ORAL-BRASIL is described on the corpus DVD (Raso and Mello, 2012). 15 For C-ORAL-BRASIL, see Mello et al. (2012), Raso (2012b), Raso and Mittmann (2009), and Moneglia et al. (2010). 16 For the relationship between prosodic breaks and utterance boundaries, see Simon (2004). 17 For details on the training process and the validation, see Mello et al. (2012), Moneglia et al. (2010), and Raso and Mittmann (2009). 18 For a complete list of and discussion on the transcription criteria, see Mello and Raso (2009) and Mello et al. (2012).

Part Six

Parsing and Annotation

14

PALAVRAS: A Constraint Grammar-Based Parsing System for Portuguese Eckhard Bick

1. Background: A modular, rule-based parsing architecture A corpus without grammatical annotation is like a city map without place or street names – navigating it is far less efficient and informative and, for certain tasks, all but impossible. Without grammatical tags and structure, corpus users have access only to very simple character string statistics, and are forced to deduct correlations and more complex statistics based from indirect, non-explicit clues and patterns. Even low-level corpus tasks such as frequency lists will profit from grammatical annotation, allowing the user to subsume different inflected forms of the same word under one lemma or base form, or to distinguish between homographs belonging to different word classes. Therefore, since the early days of corpus linguistics, corpus annotation tools have been among computational linguists’ primary concerns, historically moving from low-level part-of-speech (POS) tagging to more complex tasks like syntactic parsing and semantic annotation. With the advent of stronger computers and the inexorable progression of Moore’s law, two developments in particular can be noted: on the content side, corpora have grown larger, from hand-scanned books to robot-harvested Internet corpora containing billions of words. On the methodological side, datadriven methods have become more realistic and the balance between rule-based and probabilistic taggers/parsers has shifted in favour of the latter. However, even sophisticated, unsupervised machine-learning systems profit from richly annotated training data, which are often produced by hand-correcting the output of a rule-based parser. Furthermore, even without revised training data, the two approaches may interact. First, quality may be traded for speed by ‘reverse-engineering’ a rule-based parser’s output into a probabilistic system. Second, rules may be edited or learned based on statistical information gained from annotated corpora, in an iterative process of grammar writing and corpus annotation. For Portuguese, the PALAVRAS1 parser is probably the most well-known exponent of the rule-based2 camp within parsing technology. Now incorporating over 6,000 linguistic rules, it has been actively developed for almost 20 years, with morphological analysis introduced as early as 1993, automatic syntactic tree generation in 1996, and

280

Working with Portuguese Corpora

semantic classification (of nouns) in 1999. The core grammar has grown considerably over the years, and various modules, both analytic and applicational, have been added since its inception, prompted by cooperative projects, user feedback and corpus evaluation – underscoring two important design differences between rule-based and probabilistic systems: first, a modular rule-based system can more easily adapt to different tasks, since specific changes and exceptions can be entered in a controlled fashion. While an ML system can be modelled to different tasks, it often requires expensive training or tuning data to do so, and no common core can be retained across variants. Second, a rule-based system can learn from its errors and grow over time, with no effort wasted. A probabilistic system, on the other hand, is reset as a whole when improvements are made, and interfering with its internal workings of ‘majority rule’ by adding manual exceptions for rare constructions may even be detrimental to overall performance (Chanod and Tapanainen, 1994). PALAVRAS owes its versatility to its modular Constraint Grammar architecture, where consecutive level-specific grammars, while drawing on shared lexical resources, focus on different annotational and functional aspects, progressing from inflectional analysis and morphological disambiguation to syntactic function tags, tree structures and semantics. Each level of annotation focuses on a separate type of tag or tags, ordered in distinct tag fields with optional discriminatory prefixes; however, all annotation, even structural information, is expressed as tags and linked to tokens. This way, annotation information is maximally local (rather than distributed) and easy to filter by tag name or format, including a conversion of tag fields into XML feature-attribute pairs.

2. Palmorf: a lexicon-based, analytical annotation scheme Palmorf, the winning system of the first Portuguese Morpholympics,3 organized in the context of the joined Portuguese NLP evaluation initiative AvalON (Santos and Rocha, 2002), is the morphological stage of the PALAVRAS parsers. Though ultimately morphological tasks – not least tokenization and disambiguation – are handled in a distributed way, Palmorf can be described as a multi-tagger that will, for each input token, provide a so-called ‘cohort’ (list) of possible readings: “” “via” PRP “via” N F S “ver” V IMPF 1/3S IND VFIN

Analysis is performed as outward-inward morphological decomposition, i.e. by recognizing (and removing) inflection endings and affixes one by one until a ‘legal’ lexical root is identified. Decomposition rules cover inflection paradigms, suffixation, prefixation, orthographic variation and combinatorial laws for morpheme combination, both in terms of category and phonology. For reasons of linguistic transparency, a lexicon of 70,000 lexeme entries is used, with information on morphological potential rather than full forms. The example below illustrates this lexicon: the first field contains the word form, followed by various fields containing information

The PALAVRAS Parsing System

281

about word class (sf = female gender noun), semantic class (tool-cut = cutting tool), inflections and pronunciation information (root alterations = AaiDd, vowel shift i/e for root-stressed inflexional forms), valency markers ( = que-verb), verb complementation ( = verb allowing human subject), semantic atomic feature markers ( with upper case for positive features and lower case for negative features, e.g. i = not moving, J = movable) and, finally, an ID number. faca#=########22056 sugerir-#1##AaiDd######47837

The lexicon also contains information on valency potential (e.g., for transitive), semantic prototype (e.g., for path location or for a cutting tool), selection restrictions ( for human-subject verb), domain, etc.; but these are so-called ‘secondary tags’ designed to provide context for disambiguation rules and higher-level analysis. Palmorf ’s input are tokens created by a preprocessor with some contextual and lexical information. The preprocessor can thus distinguish between in-token dots, commas and hyphens as opposed to real punctuation; it can handle uppercase normalization, orthographic variation and quote attachment; and it can create polylexicals via lexicon lookup and a named entity recognition module. PALAVRAS basic definition of ‘word class’ (PoS) is morphological, not syntactic, with the main distinctor being inflexion category inventory. Thus, nouns (N) are defined as words having gender as a lexeme category and number as a word-form category, while both gender and number are word-form categories in adjectives (ADJ) and lexeme categories in proper nouns (PROP). In the pronoun class, a similar distinction is made between ‘adjectival’, gender-number inflecting determiner pronouns (DET) and ‘noun-like’ non-inflecting independent pronouns (INDP), as well as personinflecting personal pronouns (PERS). This morphological view on PoS leads to a somewhat relaxed attitude towards double lexicon entries for adjectives (or determiners) that sometimes function as np heads and, in other grammar traditions, would be considered nouns in such cases. PALAVRAS does not necessarily provide the noun tag but can rely on its syntactic function tag to re-tag (filter) adjectives as nouns where they occur as e.g., subject or object heads (os fracos, the weak; o sim, yes; o não-avançar, the non-advancement; o cinco, five).

2.1 Derivation Palmorf has an intricate system of derivational analysis, handling prefixation, suffixation and even ‘circumfixation’ (suffixes depending on the presence of a certain prefix) in a rule-based way (Bick, 2000, 13ff). Thus, the analyser has compositional restrictions concerning affix-root combinations and understands phonological-graphical mutations (vowel elision, consonant doubling, etc.). It will also allow for word class changes (e.g., adjective->noun: -idade) or class-transparency (e.g., -inho, where part of speech doesn’t change with affixation), and recognize productive LatinGreek ‘terminological’ suffixes (e.g., medical like -fobia) and their common-language analogues (disco+grafia, discography; jazz+ófilos, jazz-aficionado, politiqu+ês, political jargon, haltero+fil+istas, weight lifters) as well as combined prefixation/suffixation

282

Working with Portuguese Corpora

(superfaturamento, over-billing, biodegradável, biodegradable) and letter-name chaining (peemedebistas, a follower of the (Brazilian) PMDB party -> P+M+D+B+istas). The system does not, however, strive to perform full (etymological) derivation tagging – instead, it provides derivational analyses only where the full lexeme could not be found in its lexicon. Palmorf ’s derivation tags are thus useful for increasing lexical coverage, or for lexicographical studies into new words in Portuguese (e.g., sambódromo, sambadrome; jurnalês; newspaperese), but do not claim full coverage for ordinary words. Consequently, the tagger’s original use of baseforms (root excluding affixes), was later changed to lexeme tagging (root including affixes).

2.2 Lexical heuristics Though they account for only a small fraction of the total word forms in running text, derivation-proof unknown word forms4 are more difficult to handle than those amenable to derivation, due to their functional diversity and lack of a clear morphological marker. In a study by Bick (1998), it is shown that lexico-morphological heuristics – at least for a morphology-rich language like Portuguese – can be based on structural clues and the systematic exploitation of derivational and inflectional sublexica. Three types (i.e., orthographic errors/variants,5 underivable Portuguese words6 and foreign loan words)7 accounted each for about one third of all cases, demanding three different optimization strategies. Thus, foreign loan words are typically nouns or noun phrases, and derivation attempts might lead to false positive PoS tagging. In ‘real’ Portuguese words, however, structural clues – like inflection endings and suffixes – should be emphasized. In addition, Portuguese misspellings profit from specific rules about letter gemination (letter doubling), inversions, accented character variation, etc. Since prefixes have very little bearing on the probability of a word’s PoS or inflectional categories, only inflection endings and suffix lexica are used when searching for derivational clues. Working with a lower root length of 3 letters, hypothetical root words such as ‘xxxar’ (V/ADJ), ‘xxxo’ (N/ADJ) and ‘xxx’ (N) were entered into the lexeme lexicon, and Palmorf successively replaced ever-longer left-hand chunks of an unknown word form with ‘xxx’, until an inflexional or derivational match is encountered or the whole word is replaced by ‘xxx’. For a word like ontogeneticamente (ontogenetically), six analysable endings combinations of decreasing length could be found; however, preferring long derivations over short ones, the tagger would stop at the xxxticamente level. xxx(t)icamente->‘ontogene’+ico+mente_ADV xxxamente->‘ontogenetico’+amente_ADV xxxente->‘ontogeneticamer’+ente_ADV xxxe->‘ontogeneticamentar/er/ir’+e_V xxx->‘ontogeneticamente’_N

Besides the typical stems ending in -o/-a/-r, plain ‘xxx’ is used to accommodate for foreign nouns with ‘un-Portuguese’ spelling. To forestall an unwanted derivation bias towards verbs, certain frequent verbo-nominal ambiguities (e.g., English –er nouns,

The PALAVRAS Parsing System

283

Latin-Portuguese –ia nouns and –ar adjectives) were also entered. In ambiguous cases, shorter derivations have priority, and contextual clues will be exploited by ordinary CG rules. In a test run, true Portuguese words performed best (5 per cent PoS errors, 98.6 per cent accuracy for nouns), while misspellings performed somewhat lower, but with a similar PoS distribution. Loan-word analyses were usually wrong when analysed as non-nouns, but noun accuracy was actually higher in this group (92.6 per cent) than in the Portuguese misspellings group (84.4 per cent), reflecting the strong noun bias of the former, and the risk of mis-tagging closed-class words as nouns in the latter. Exploiting these results, heuristic-statistical PoS choices should incorporate clues like ‘lusoid’ (accented forms) and ‘angloid’ (‘y’, ‘w’, ‘th’) letter patterns.

2.3 Polylexicals / Multiword Expressions (MWE) Palmorf was designed to serve a syntactic system, not to run in isolation. Therefore, syntactic considerations are honoured already at the morphological stage. Thus, recognizing polylexical forms for classes such as adverbs and prepositions (known in Portuguese as locuções) is essential for context-based disambiguation and syntax since it makes context patterns less complex and avoids awkward and semantically unattractive analyses of unnecessary depth (e.g., a fim de que, em vez de). More problematic and less amenable to closed-list treatment is the concept of polylexical (i.e. multi-word) nouns (e.g. cobra cascavel – rattlesnake) and verb incorporation (estar com fome – to be hungry). These are important for semantic disambiguation and machine translation (MT) but could be avoided at the syntactic level. Thus, many lexically encoded MWE of this type remain unexpressed in PALAVRAS corpus annotation. An exception is name MWEs, partly because they often contain a lot of unorthodox (foreign) material and structure that could otherwise be mis-analysed, partly because PALAVRAS has a strong focus on Named Entity Recognition (NER, Part 6), where MWEs facilitate the assignment of semantic tags and syntactic integration into the rest of the sentence.

3. Morphosyntactic disambiguation and constraint grammar In most of its annotation modules, PALAVRAS relies on Constraint Grammar8 (Karlsson et al., 1995), and the parser is intertwined with the CG paradigm at both the methodological and the descriptive levels. Unfiltered, all grammatical information is expressed as traditional token-based CG tags, with field markers such as prefixes (e.g., @ for syntax, § for semantic roles, % for NER, “…” or [ …] for lemma, for secondary tags), and internally favours dependency over constituent grammar because the former can be expressed as token relations (for an example of the issues discussed in this and the following sections, please refer to the annotation sample provided in the Appendix). PALAVRAS Portuguese grammar is one of the largest (and longest-living) CGs in existence, and changes were made on several occasions to GrammarSoft’s CG

284

Working with Portuguese Corpora

compilers (first vislcg and later CG3) because a new PALAVRAS module needed more expressivity (e.g., tag and set unification, named relations).

3.1 Disambiguation rules and set definitions As a method, CG focuses on contextual disambiguation and information mapping. A core task is to remove, select or modify ambiguous morphological tagging, targeting individual tags to remove or select one or more lines in a readings cohort. Considerable robustness arises from the fact that the grammar does not even have to express directly what is correct but can progressively chop away options until only one remains. Rule (a) is a simple local-context rule discarding finite verb readings if a safe (C) article is found directly to the left (-1 position). Rule (b) is a global-context rule selecting finite verbs if scanning left (*-1) all the way to the sentence start (>>>) or a clause boundary word (e.g., a conjunction or relative) does not find (BARRIER) a competing VFIN, and if there is no other VFIN anywhere to the right (*1) either. (a) REMOVE VFIN (-1C ART) (b) SELECT VFIN (*-1 >>> OR CLB-WORD BARRIER VFIN) (NOT *1 VFIN)

Rules can refer to tags either directly, including lemma and word form, or through set definitions. (c) LIST VFIN = PR IMPF PS FUT COND ; (tenses and conditional) (d) LIST CLB-WORD = KS KC (“que” ) ; (subordinator, coordinator, relative, etc.)

Sets can be combined with each other in various ways, e.g., ORed or ANDed. The special sets >>> and 3 primeiro_ [primeiro]__ADJ_M_S_@>N_#2->3 fabricante_ [fabricante]__N_M_S_@SUBJ>_#3->17_§AG mundial_ [mundial]_ADJ_M_S_@N3 de_ [de]_PRP_@N3 «ratos» [rato]__N_M_P_@P5_§RES para_ [para]_PRP_@N7 computador, [computador]__N_M_S_@P9_§FIN a_ [o]__ART_F_S_@>N_#12->13 empresa [empresa]__N_F_S_@APP_#13->3_§ID suíça [suíço]_ADJ_F_S_@N13 Logitech, [Logitech]__PROP_F_S_@N13_§ID apresentou [apresentar]___V_PS_3S_IND_@STA_#17->0_§PRED esta_ [este]__DET_F_S_@>N_#18->19 semana [semana]__N_F_S_@17_§LOC-TMP em [em]__PRP_@17 uma [um]__ART_F_S_@>N_#21->22 feira [feira]__N_F_S_@P20_§LOC especializada [especializar]_V_PCP_PAS_F_S_@N22 que [que]__INDP_F_S_@SUBJ>_#24->25_§TH teve [ter]___V_PS_3S_IND_@FS-N22__§ATR lugar [lugar]__N_M_S_@25_§INC em [em]_PRP_@25 Basileia [Basileia]__PROP_F_S_@P27_§LOC (Suíça) [Suíça]__PROP_F_S_@N27_§LOC um [um]__ART_M_S_@>N_#32->33 equipamento [equipamento]__N_M_S_@17_§TH periférico [periférico]_ADJ_M_S_@N33 denominado [denominar]__V_PCP_PAS_@ICL-N33_§ATR «Audioman» [Audioman]__PROP_M_S_@35_§ATR-RES que [que]__&hum_INDP_M_S_@SUBJ>_#39->40__§AG permitirá [permitir]___V_FUT_3S_IND_@FS-N37_§ATR dotar [dotar]___V_INF_@ICL-40__§EV os [o]__ART_M_P_@>N_#42->43 computadores [computador]__N_M_P_@41_§BEN de [de]_PRP_@41 «orelhas» [orelha]__N_F_P_@P44_§TH

The PALAVRAS Parsing System

301

Notes 1 Portuguese Automatic Linguistic Analysis by means of a Versatile and Robust Annotation System. 2 As opposed to statistical/probabilistic systems. Rule-based approaches include, besides Constraint Grammar, also Chomskyan generative rewriting rules and topological field analysis. 3 A contest of this kind aims at measuring performance across automated systems of linguistic analysis, with a shared task and a shared gold standard annotation as reference. See, for instance, http://www.linguateca.pt/ aval_conjunta/ morfolimpiadas/ morpholympics.html or, for specific comparisons, http://www.linguateca.pt/ aval_ conjunta/ morfolimpiadas/ comp_dourada_fig.html, where PalMorf equals system A in the evaluation tables. 4 One example was the word ‘ontogeneticamente’ (ontogenetically), where neither the adjective ‘ontogenético’ nor the scientific prefix ‘onto’ was found in the lexicon at evaluation time. The first would have allowed an ‘-amente’ adverbial derivation, the second could have helped to identify the – existing – adjective ‘genético’ as a root candidate. 5 e.g. ‘balangou’ (balançou), ‘alfaltada’ (asfaltada) 6 e.g. ‘inventimanha’, ‘itamaroxia’, ‘coruptograma’ 7 e.g. ‘cast’ (from English), ‘entente’ (from French), or misspelled as ‘entaente’ 8 The concept and methodology of Constraint Grammar is explained in Part 3.1. Like Generative Grammar, CG is rule-based, but compared with the latter, it is more reductionist and robust, expressing linguistic regularities not as structural rewriting rules, but as contextual limitations (constraints). Rules operate on ambiguous alternative readings (e.g. morphological analyses, ambiguous part of speech, syntactic function alternatives like subject/object or semantic role candidates like agent/ patient), and will always leave one surviving reading, even for unorthodox, creative or wrong textual input. 9 The (c) hints at the differences between Brazilian and European Portuguese, and underlines the fact that the treebank covers both variants. Notwithstanding the copyright pun (also intended), all data are freely accessible to everybody on the Internet. 10 In computational linguistics, a treebank is a collection of syntactically analysed, and usually manually revised, sentences. Formats, grammatical conventions and size differ a lot between individual treebanks. The Floresta treebank itself is among the bigger ones, and was built by converting a PALAVRAS Constraint Grammar analysis into constituent trees (and later, dependency trees). 11 The Java interface opens when choosing ‘Refined’ in the search interface at http://corp. hum.sdu.dk/cqp.pt.html (see Guided Tour explanations on the site). 12 ‘Secondary tag’ means here that the tag is intended to help disambiguation of tags on other levels (e.g., morphological or syntactic), but is not to be disambiguated itself at this level. The word ‘margem’, for instance, comes with two semantic prototypes, (abstract place, ‘margin’) and (natural place, ‘river bank’). At the semantic analysis level, e.g. for machine translation, these may become primary tags, and can be disambiguated using CG rules operating on already-established syntactic context and structural relations. A relation to a dependency head meaning ‘write’ or ‘read’, for instance, may trigger the reading.

302

Working with Portuguese Corpora

13 MUC = Message Understanding Conference. Seven MUC-conferences were held between 1987 and 1997, initiated by DARPA (Defense Advanced Research Projects Agency). 14 The distinction between form and function is central to Constraint Grammar, and to many aspects of linguistics in general. Here, lexical form is a category that can be looked up in the lexicon, while lexical function is a category type that is instantiated through contextual constraints, for instance, a verb projecting a certain function onto its verb, as explained for the ‘civitas’ category in the next sentence. 15 HAREM = Avaliação Reconhecimento de Entidades Mencionadas (Evaluation of Named Entity Recognition). 16 Since the rare readings are part of the cohort fed to the CG disambiguator, their word classes compete on equal terms with those of the more likely readings, and even a 1 per cent word class error rate may account for a much larger percentage in relative terms – if the reading in question should have had a 1:10,000 proportion, and is disambiguated on the basis of word class alone, the wrong reading will be 100 times over-represented in relative terms. This will not change the accuracy of the parser as a whole, but if a corpus user searches for rare wrong readings of a given word, he will – without statistical culling – find sufficient wrong examples to irritate him. 17 The most refined, in this respect, are the English and Danish parsers, but frequency information is also used by the Spanish, French, German, Swedish and various other parsers in the CG family. 18 While the English and Danish parsers have CG framenets at their disposal, only the Portuguese PALAVRAS parsers has a database of verb-argument relations enriched with frequency information for combinations of syntactic function tags on the one hand and syntactic roles on the other. 19 Romance languages do not build compounds like in some Germanic languages (German, Danish, Dutch …). With the exception of Latin/Greek science derivatives, Romance words have only one root (plus affixes), and what would be a compound in German, would be expressed with pp attributes in Portuguese.

15

New Corpora for ‘New’ Challenges in Portuguese Processing Sandra Maria Aluísio, Thiago Alexandre Salgueiro Pardo and Magali Sanches Duran

1. Introduction Corpora are one of the main driving forces for better Natural Language Processing tools and applications. In addition, annotated corpora are one of the main meeting points for linguists and computer scientists working in the growing field of research known as corpus annotation, as they are the place where linguistic knowledge is systematized and codified as well as the starting point for system training and development. The availability of annotated corpora has allowed not only the acceptance and widespread visibility of linguistic theories and models, but also the development of good part-of-speech (PoS) taggers, syntactic and discourse parsers and semantic analysers. Moreover, in the last decade, annotated corpora have enabled significant advances, such as improvements in machine translation, the rise of text simplification techniques and other text adaptation methods, and the development of writing support tools. For Portuguese, only recently did more sophisticated annotated corpora and tools start to be built. In this chapter, we report on our recent experience at the Núcleo Interinstitucional de Linguística Computacional (NILC; http://www.nilc.icmc. usp.br) in designing and building such corpora, focusing on three specific corpora intended to support work on semantic role label analysis (Propbank-Br), multidocument summarization (CSTNews corpus) and text simplification (the corpora of simplified texts in the PorSimples project). We introduce each corpus in the light of seven main methodological questions that are currently receiving attention in corpus annotation: (1) deciding on what linguistic phenomena to annotate; (2) building balanced and representative corpora for the envisioned phenomena; (3) selecting and training annotators; (4) developing simple, fast and reliable annotation procedures; (5) deciding on the computational interface for the annotation task and determining its influence on the results;

304

Working with Portuguese Corpora

(6) evaluating the annotation and selecting suitable agreement measures; and (7) deciding on how to store and make the corpus available. The organization of this chapter is as follows. In Part 2, we discuss why linguistic annotation is important for both Natural Language Processing (NLP) and corpus linguistics. In Part 3, we analyse three specific corpora in the light of the abovementioned seven points.

2. Linguistic annotation: A bridge between Natural Language Processing and Corpus Linguistics Corpora have been largely used in linguistic studies, but corpus annotation is still more often used for NLP purposes than for linguistic purposes. From a corpus linguistic point of view, once the annotation of a corpus has been concluded, the labels assigned can be used as strings for running corpus searches, promoting a shift from queries based solely on words to more sophisticated searches (see, for example, a searching tool available for recovering excerpts of the corpus Floresta Sintá(c)tica, automatically annotated by the parser PALAVRAS (Bick, 2000), at http://www. linguateca.pt/Floresta/milhafre/). Meanwhile, from an NLP perspective, an annotated corpus can be used to train classifiers that automatically perform the same annotation task as that performed by human annotators. Several techniques are used to make the machine learn a task from annotated corpora, although these are not our focus here. Among the NLP tools that might benefit from corpus annotation, we cite, (i) at the level of the word, PoS and semantic taggers (i.e., automatic classifiers that assign respectively grammatical and semantic categories to the words in a text); (ii) at the syntactic level, parsers that perform syntactic analyses and display them in a range of different output formats, like syntactic trees, for example; (iii) at the syntactic/semantic level, semantic role label classifiers, which assign labels such as agent, patient, beneficiary and instrument to the arguments of a predicate; and (iv) at the discourse level, co-reference annotators (i.e., programs that handle the process of adding information about anaphoric links in a text, such as connecting a pronoun them to its antecedent). These layers of automatic analysis have mutual benefits for other tools and applications. For example, PoS taggers are useful for, among other things, distinguishing words that have the same spelling but different meanings or pronunciations; semantic taggers aid in disambiguating word senses; syntactic parsers can help with the task of simplifying a text by shortening its sentences as a means to improve its readability; semantic role labelling benefits the task of answering wh-questions about an event described in a text; and co-reference resolution can help grammar checkers decide whether pronouns have a reference in the text. Due to the increasing importance of annotated corpora, since the early 1990s many efforts have been made to provide infrastructure to corpus creation and annotation. Examples of these efforts include the Expert Advisory Group on Language Engineering Standards (EAGLES; http://www.ilc.cnr.it/EAGLES96/home.html), which provided standards for building very large-scale language resources; the Text

New Challenges in Portuguese Processing

305

Encoding Initiative (TEI; http://www.tei-c.org/), which develops and maintains a standard for the representation of texts in digital form for online research, teaching and preservation; recent corpus encoding and annotation initiatives using XML (Extensible Markup Language) as the markup language, like XCES (Corpus Encoding Standard for XML; http://xces.org/); and guidelines for good practice to develop linguistic corpora (e.g., Wynne, 2005). Currently, many fields in NLP agree that annotated corpora, if they follow principled guidelines, can give rise to systems that deal with linguistic phenomena long considered too complex to be automatically addressed. For Portuguese, which is the focus of this chapter, Floresta Sintá(c)tica (Afonso et al., 2002) has been used as a reference corpus for work on syntax (e.g., Wing and Baldridge, 2006; Silva et al., 2010); PLN-BR-FULL corpus (Muniz et al., 2007; Bruckschen et al., 2008), with PoS annotation, has been used for identifying and analysing complex predicates (Duran et al., 2011); PropBank.Br (Duran and Aluísio, 2012), which had its first version released in 2011 (and is currently being annotated with verb senses mapped to Propbank), is already being used for the development of semantic role-labelling systems (Manchego and Rosa, 2012; Fonseca and Rosa, 2012); and CorpusTCC (Pardo and Nunes, 2004), Summ-it (Collovini et al., 2007) and CSTNews (Cardoso et al., 2011a) have been used to develop discourse parsing (Pardo and Nunes, 2008; Maziero and Pardo, 2011) and co-reference resolution (Gonçalves et al., 2008; Souza et al., 2008). In addition to parsing and analysis tools, other applications have benefited from annotated specialized corpora. For example, CorpusDT (Feltrim et al., 2003) was used to develop writing support tools (Feltrim, 2004; Souza and Feltrim, 2013) and both the CSTNews and the Rhetalho corpora (Pardo and Seno, 2005) have been used for training, developing and evaluating both single and multi-document summarization methods and systems (e.g., Uzêda et al. 2010; Castro Jorge and Pardo, 2010; Cardoso et al., 2011b; Ribaldo et al., 2012). Moreover, corpora of original and simplified texts (Caseli et al., 2009) were used (i) to train a model to ‘translate’ natural sentences into simplified ones (Specia, 2010), (ii) to create a system that learns the appropriate degree of simpliﬁcation according to a given literacy level (Gasperin et al., 2009) and (iii) to test text readability measurement methods (Scarton and Aluísio, 2010; Aluísio et al., 2010). In addition, corpora of technical manuals (Muniz, 2011) were used to help authors perform lexical simplification tasks in technical manuals with the SIMPLIFICA editor (Muniz et al., 2011) and both parallel and aligned corpora (Aziz and Specia, 2011) have been used for training statistical machine translation systems (Aziz et al., 2009; Caseli and Nunes, 2009; Beck, 2011; Beck and Caseli, 2013). Most of these corpora were developed and made available by NILC, along with many others during the centre’s long history in corpus development. A comprehensive overview of NILC corpora, including annotated and non-annotated corpora, is presented by Nunes et al. (2010) and can be found on the NILC webpage. The traditional linguistic levels usually taken into account in written text processing in both NLP and corpus linguistics (as sketched by Jurafsky and Martin, 2009) are morphology, syntax, semantics, discourse and pragmatics. Figure 15.1 shows some

306

Working with Portuguese Corpora

Figure 15.1 State-of-the-art applications for Portuguese and the annotation level of corpora on which they were based state-of-the-art applications developed for Portuguese over time and their relationship with the annotation in the corpora on which they were developed. The NILC grammar checker, for instance, was fully developed in 1998 and was based on the Corpus NILC (Pinheiro and Aluísio, 2003), which had not been annotated at the time. The best summarization systems for Portuguese have emerged since 2010 and were mainly based on the CSTNews corpus, which had both semantic and discourse annotation. State-of-the-art machine translation systems for Portuguese have just started to explore more than simple part-of-speech and phrase/n-gram information (Beck, 2011; Beck and Caseli, 2013). Text simplification for Portuguese texts involves the most linguistic levels, from morphology to discourse. In general, corpus annotation can be divided into five main phases: (i) theory, (ii) preparation, (iii) annotation, (iv) evaluation and (v) delivery. In the first phase, one needs to select the task/problem in order to make initial decisions concerning what to annotate and how. The preparation phase is responsible for collecting the corpus, deciding on which annotation software to use, and hiring/selecting and training annotators. In the third phase, the actual annotation task is carried out and the project manager is in charge of monitoring the progress and giving feedback to the team. In the evaluation phase, the quality of the annotated data is assessed. The last phase is less complex as it involves formatting and delivering the annotated corpus for public or private use. Just as with corpus creation and annotation, the uses and applications of corpora have evolved over time. Current corpora projects in NLP aim to build widely available corpora, following interchangeable annotation schemes and standards, with robust linguistic theories and models underlying the annotation. In NLP, both corpus creation and annotation have started to be seen as a science and, as such, must follow strict scientific methods (Hovy and Lavid, 2010).

New Challenges in Portuguese Processing

307

Currently, in order to have reliable corpora and, therefore, trustworthy applications, it is usual to employ several annotators (working either online or on-site) not only to scale the annotation to a large number of texts, but also to guarantee that the phenomena under investigation are handled systematically and consistently by more than one person. Double-blind annotation (two annotators for each text) avoids possible idiosyncratic bias and enables the computation of agreement measures, which provide an indication about how trustworthy the annotation is. As expected, such an approach to corpus annotation requires a strict control over the process, which includes (i) making strategic corpus engineering decisions; (ii) providing annotation training so that the annotation is as uniform as possible; (iii) producing guidelines to support the task; and (iv) registering both the history of the whole process and the valuable learned lessons. Online annotation schemes (e.g., through Amazon’s Mechanical Turk) enable crowdsourcing and have ushered in the e-science world, but they also call for more control over the whole process and require researchers to deal with problems such as depending on annotators of different expertise levels (see Callison-Burch and Dredze, 2010, for details). Time has shown that good annotated corpora last for decades. In addition, a consensus has been reached that annotating a corpus is one of the main meeting points for linguists and computer scientists working with NLP and corpus linguistics as it allows linguists and related professionals (a) to harvest data in order to investigate phenomena of interest and eventually to model or explain them; (b) to make linguistic theories and models clear and systematized enough to be applied to actual data; and (c) to test, validate and/or improve theories and models. At the same time, reliably annotated corpora are a means for computer scientists to develop, train and test their systems. This enables multidisciplinary work in an area that has experienced both communication and collaboration difficulties, as Dias da Silva (1996) effectively pointed out. Accordingly, corpus annotation evolution and relevance have posed to the research community some methodological questions on corpus design. Seven of these questions synthesize the discussion to date and are mandatory for any annotation that is carried out (Hovy and Lavid, 2010). They can be summarized as follows: (1) decision on what linguistic phenomena to annotate; (2) balancing and representativeness of the corpus for the envisioned phenomena; (3) selection and training of annotators; (4) development of simple, fast and reliable procedures of annotation; (5) decision on what annotation interface to use for the task and the analysis of its influence on the results; (6) evaluation of the annotation and decision on which agreement measures are suitable for such annotation; and (7) description, storage and availability of the corpus. Such areas refine the five main phases or stages previously presented. In what follows, we introduce and discuss some initiatives of corpus annotation for Portuguese language in the light of these seven areas.

Working with Portuguese Corpora

308

3. Recent annotation projects at NILC 3.1. PropBank.Br 3.1.1 Deciding on the linguistic phenomena to annotate One of the new challenges in Portuguese automatic processing is to add a semantic layer of annotation over syntax. Among several possibilities of semantic analysis, semantic role labelling (SRL) is one that has been adopted for several languages. SRL comprises three steps: (i) delimitation of argument takers (a verb, e.g., ‘to love’, or complex predicates with more than one element, e.g., ‘to take care’), (ii) delimitation of arguments and (iii) assignment of role labels from a pre-defined list, such as agent, instrument, beneficiary and recipient. To bring Portuguese processing up to date with that of other languages and, at the same time, to benefit from the know-how already available for other languages, we chose to add a semantic layer with semantic role labels in a Brazilian Portuguese corpus that had already been annotated by a syntactic parser and had already been manually revised (i.e., a tree bank). The result is a humanannotated corpus that enables machine learning for future automatic annotation. This can benefit several NLP tasks based on machine learning approaches, as these labels will enrich their set of features. Once the aim of the annotation is defined, we had to choose between developing a new approach for SRL or adopting an approach already tested for other languages, like Propbank (Palmer et al., 2005) or FrameNet (Baker et al., 1998), originally conceived for English. A new approach would require more time and resources than we had at our disposal; thus, this was not an option. As this was a post-doctoral project, we had limited resources: a single linguist and a period of one year to develop the annotation task. These limitations influenced the project design. We chose Propbank instead of FrameNet because Propbank has a smaller set of role labels than FrameNet, thereby facilitating human annotation and enabling future uses for NLP purposes. Specifically, annotation in Propbank is theory free (uses generic labels) and relies on a set of six numbered arguments (Arg0-Arg5) as well as a small set of labels for annotating modifiers (ARGMs, such as location, time, manner and direction). Let’s take an example from Propbank to illustrate the SRL task. Example (1) is annotated with three numbered labels and with two ARGMs, as shown in the following, after having the argument taker aceitar identified: (1) Ele não podia aceitar nada de valor dos seus clientes (He couldn’t accept anything of value from his clients) [Arg0 Ele ] [ArgM-NEG não ] [ArgM-MOD poderia ] [V aceitar ] [Arg1 nada de valor] [Arg2 dos seus clientes] The Propbank annotation scheme requires that semantic role labels and the verb sense identification (verb sense ID) be immediately annotated. It is supported by a lexical resource called framefile, in which generic role labels are ‘translated’ into very specific role labels. For example, in the framefile ‘speak,’ there is a role set with the following role labels:

New Challenges in Portuguese Processing

309

Arg0: talker Arg1: subject/language Arg2: hearer

Therefore, this makes it easy for annotators to decide which generic label to assign to each argument in the corpus. For example, in the sentence ‘Mary spoke to John about his atrocious breath,’ ‘Mary’ is Arg0, ‘John’ is Arg2 and ‘about his atrocious breath’ is Arg1; in the sentence ‘Mary speaks five languages,’ ‘Mary’ is Arg0 and ‘five languages’ is Arg1 (in this case there is no Arg2).

3.1.2 Balance and representativeness of the corpus According to Propbank guidelines, the SRL annotation should be performed over syntactic trees (see Figure 15.2). For this reason, we needed a syntactically parsed corpus that preferably had been checked by human experts. When we initiated our project, the only corpus that met those requirements was the Bosque corpus (http:// linguateca.pt), a subcorpus of the Floresta Sintá(c)tica project (Afonso et al., 2002). Therefore, we decided to use Bosque as a starting point.

Figure 15.2 Semantic role labelling annotation over a syntactic tree Bosque has 9,368 sentences; we selected all of the sentences that were of Brazilian Portuguese (4,213 sentences extracted from newspaper Folha de São Paulo in 1994). We decided to annotate only full lexical verbs, as previous analyses showed us that auxiliary verbs, including temporal, modal, aspectual and passive voice auxiliaries, have predictable roles and could be automatically annotated. For example, the verb começar in the pattern começar a + infinitive (start to + infinitive) is an aspectual verb and therefore was not annotated. Each sentence was duplicated as many times as the number of main verbs it contained, so that each verb could be annotated as a separate instance. The resulting corpus had 6,142 instances and 1,068 different verbs.

3.1.3 Selecting and training annotators Although the project Propbank-Br did not require selecting different annotators, as it had a sole annotator, training was not ignored and it was carried out as selftraining. The Propbank guidelines (http://verbs.colorado.edu/~mpalmer/projects/ace/ PBguidelines.pdf) and framefiles (http://verbs.colorado.edu/verb-index/index.php)

310

Working with Portuguese Corpora

were used as guidelines for role label assignment. Every time a language-specific problem arose, the annotator made a decision and recorded it in a training manual for Portuguese annotation, which was used for her own consultation as time passed and past decisions were no longer ‘fresh’ in her mind.

3.1.4 Developing a simple, fast and reliable procedure of annotation One of the lessons learned during the annotation was that the simpler the task, the faster and more reliable it is. Delays resulted from cases not foreseen in the training manual, almost all of which related to language-specific features. When a particular problem occurred for the first time and we had no solution, we flagged the sentence for subsequent analysis. When the same problem arose again later, we were forced to make a decision about how to treat it, which entailed a revision of all sentences flagged for subsequent analysis in order to ensure consistent annotation. Such decisions were added to the training manual.

3.1.5 Choosing an annotation interface Most annotation projects develop tailor-made annotation tools, based on the procedure decisions made previously. Due to our limited project resources, developing our own annotation tool was out of the question. Therefore, we made a list of requirements and compared several freely available annotation tools. This process has been reported by Duran et al. (2010). Ultimately, we chose SALTO (Burchardt et al., 2006), a tool developed for German FrameNet and therefore suitable for semantic role labelling over syntactic trees. The SALTO tool has a graphic interface that facilitates the annotation task: for each instance, the annotator had only to click on a role label and drag it to the syntactic constituent node. SALTO has fully met our expectations. An important feature is the possibility of editing frames during annotation, which is useful for adding new labels to the set of existing labels when necessary. The possibility of editing, however, must not affect labels already assigned and is not adequate for projects that involve several annotators, as they might change the set of labels midway through an annotation task and produce divergences with other annotators. SALTO has been conceived for double-blind annotation – that is, it allows for the same text to be sent to two different annotators. When the annotation is concluded, SALTO calculates the agreement between the two annotators and highlights cases where they diverged for the project coordinator to peruse. In addition to assigning role labels to the arguments of the verb, SALTO enables annotation of sentences as a whole (using sentence flags) and tokens (using word tags). These resources were used to cluster similar queries for later analysis. SALTO provides the option of exhibiting only flagged sentences or only those sentences flagged with a specific flag, which is very useful in revising the annotation of difficult cases. For example, we used this feature to spot sentences with an omitted subject for us to decide how to treat them later.

New Challenges in Portuguese Processing

311

3.1.6 The description, storage and availability of the corpus The first version of Propbank-Br (an XML file) is available at http://www.nilc.icmc.usp. br/ portlex/. We will soon release a second version, containing verb senses mapped to Propbank (English) verb senses. We are also constructing frame files – that is, a file for each verb containing a role set for each of its senses. The frame files will give support for the annotator’s decisions. Propbank-Br is currently being used as a training corpus to develop automatic SRL annotators. After this step, we will have met the requirements to undertake a large-scale project in SRL, employing several annotators: MM MM MM

A lexical resource with verbs and their ID senses (framefiles); A training manual in Portuguese; and An automatic SRL annotator for pre-annotation (so that human annotators will revise the automatic annotations instead of making them from scratch).

3.2. CSTNews 3.2.1 Choosing the linguistic phenomena to annotate The CSTNews corpus (Cardoso et al., 2011a) was developed to support research on multi-document summarization, which is the task of producing a unique summary from a group of texts on the same topic (Mani, 2001). In particular, CSTNews aims to make available reliable discourse annotation for subsidizing deep linguistic processing in summarization, which requires knowing which discourse relations hold among text passages/segments (usually sentences or clauses) in order to select appropriate segments for the summary. For instance, if two sentences show an equivalence relation (and thus have similar content), only one of them should be included in the summary (otherwise, a summary with redundancies would be produced). The discourse annotation in CSTNews models both single and multi-document relationships. Although single text annotation can show how the segments within a text relate to one another (e.g., by elaboration, contrast and cause–effect relations), the multi-document annotation can display some of these same previous relations among different texts and other relationships, such as content overlap, citation among documents and modality. To model the single document relations, we used rhetorical structure theory (RST) (Mann and Thompson, 1987); for multi-document relations, we adopted cross-document structure theory (CST) (Radev, 2000; Maziero et al., 2010). For instance, the sentences in Examples (2) and (3) both show a cause–effect RST relationship (the aeroplane accident caused 17 people to die). (2) Ao menos 17 pessoas morreram após a queda de um avião no Congo (At least 17 people died in an aeroplane crash in Congo) (3) A aeronave se chocou com uma montanha e caiu sobre uma floresta próxima ao aeroporto (The aeroplane collided with a mountain and crashed down in a forest near the airport) At the same time, the sentence in Example (2) shows an overlap CST relationship

312

Working with Portuguese Corpora

with the sentence in Example (4) (from another text) as both show some overlapping content (the accident in Congo). (4) O porta-voz das Nações Unidas informou que houve um acidente aéreo na localidade de Bukavu—na República Democrática do Congo—na quinta-feira à tarde (The United Nations spokesman informed the public that there was an aeroplane accident in Bukavu – Democratic Republic of Congo – on Thursday afternoon)

3.2.2 Balance and representativeness of the corpus The corpus has 140 Brazilian Portuguese news texts distributed in 50 clusters, amounting to 2,088 sentences and 47,240 words. Each cluster contains two or three texts on the same topic. Per cluster, the corpus contains an average of 2.8 texts, 41.76 sentences and 944.8 words. The texts were collected in the middle of 2007 from varied sections (Daily News, World, Sports, Science, Economy, Politics and Money) from mainstream online Brazilian news agencies, such as Folha de São Paulo, O Estado de São Paulo, O Globo, Jornal do Brasil and Gazeta do Povo. It is interesting to note that, as the selection of texts to compose the corpus was driven by topic relevance (in order for these topics to be commented on and covered by different news sources), the distribution of texts was not uniform among sections and agencies. For instance, some sections, such as World and Daily News, have far more texts than others, such as the Science and Money sections. Overall, after corpus annotation (Cardoso et al., 2011a), it is possible to realize that the main discourse relations desired and useful for summarization purposes (both single and multi-document) can be observed in the corpus. Moreover, the usual behaviour of other discourse-annotated corpora can also be noted, with some relationships being more common than others.

3.2.3 Selecting and training annotators RST and CST annotations were performed by different groups. RST annotation was performed by eight computational linguists, whereas CST annotation was performed by four. All of the annotators had some experience in discourse, but not necessarily in the discourse models that were adopted. For each annotation, the training phase lasted from two to four weeks, resulting in annotation manuals (with rules, exceptions and examples) and sometimes in some refinements in the discourse models (as in the case of CST, as reported by Maziero et al., 2010) due to adaptations to text genre, domain and language. The training phases finished when the only remaining disagreements were those due to natural language ambiguity and subjectivity, as evidenced during the discussion of the disagreement cases.

3.2.4 Developing a simple, fast and reliable annotation procedure For each discourse model, the annotation procedure took about two months of daily one-hour meetings. The daily annotation proved to be good for consistency in annotation and was successful in creating a regular commitment with the annotators. Each day provided enough time to annotate one or two clusters, usually in groups of

New Challenges in Portuguese Processing

313

two or three annotators for RST and one or two for CST. Annotating in groups was useful for discussions and dealing with doubts. It also allowed each group to consult other groups whenever there were difficult questions or decisions to be made. Each group was randomly formed for each annotation session in order to avoid bias in the process. In special annotation sessions, usually held once a week, all of the annotators annotated the same texts in order to compute agreement.

3.2.5 Choosing an annotation interface and assessing its influence on the results For RST annotation, the RSTTool (O’Donnell, 2000) was used as it is widely known and used in the area, being easy to use and having a graphical interface that is customized for RST annotation. Although it proved to be difficult in some annotation points, the annotators performed very well in using it. For CST annotation, a simple visual tool was produced, named CSTTool (Aleixo and Pardo, 2008). It helped the annotators detect and select which passages to relate. Both tools produce output file formats (XML, as well as other formats) that are commonly used in the area. The simplicity and practicality of the tools played a decisive role in making the annotation faster and more reliable. In addition, as the tools already incorporate some of the discourse models’ characteristics and restrictions, they avoided some errors, letting the annotators know of possible annotation problems.

3.2.6 Evaluating the annotation and choosing agreement measures For both discourse models, the traditional kappa agreement measure (Carletta, 1996) was used on the common annotations that were performed. In order to have a more refined view of the annotation, simple per cent counts were computed for total, partial and null agreements among annotators, allowing for the identification of the most problematic cases. The measures were applied to the specificities of each discourse model: for RST, agreement was computed not only for the identified relationships, but also for the text segments that were related (as their granularity can vary) and their status in the annotation; for CST, the agreement on the direction of the relationships (when directionality applied) was also tested. For RST, agreement was similar to that found in related work in the literature; for CST, agreement was better than the only other annotation effort in the area so far.

3.2.7 The corpus, its storage and availability The corpus and its annotations are available for research purposes on the research group’s web page and at the SUCINTO project portal (www.icmc.usp.br/~taspardo/ sucinto). The annotations have been registered and published in technical reports and papers. The annotation data are those produced by the annotation tools adopted, in common XML formats for discourse representation. Finally, it is worth noting that CSTNews contains not only the reported annotations, but also several other annotations and bits of information that are useful for summarization, such as text-summary alignments, different types of summaries for each text and cluster, temporal and aspect annotations and sense annotation for nouns.

Working with Portuguese Corpora

314

3.3. The corpora of simplified texts of the PorSimples Project 3.3.1 Choosing the linguistic phenomena to annotate The PorSimples project (http://www.nilc.icmc.usp.br/wiki/index.php/Principal) was funded by Fapesp, a Brazilian funding agency, and by MS Research from 2007 to 2010. It aimed to produce text simplification tools for promoting digital inclusion and accessibility for people with low levels of literacy as well as possibly other kinds of reading disabilities. The project’s goal was to help these readers process documents available on the Internet, such as texts published on governmental sites or by relevant news agencies, both of which are expected to be of importance to a large audience comprised of various literacy levels. The language of the texts was Brazilian Portuguese, for which there was no text simplification system at the time. Two types of simplification – natural and strong – were proposed in the PorSimples project. The rationale behind this decision was that different types of simplification are needed for readers with different literacy levels (basic and rudimentary), children learning to read and people with cognitive disabilities. Table 15.1 provides examples of Table 15.1 An example of an original text (A) and its simplified versions (B and C) (Caseli et al., 2009) Version

Text

Em uma entrevista com a imprensa, convocada para responder a acusações de corrupção durante seu mandato como prefeito na cidade de Ribeirão Preto, o ministro Antonio Palocci Filho (Fazenda) disse que deixou seu cargo disponível, mas por recomendação do presidente Luiz Inácio Lula da Silva, permaneceria no governo. (In a press conference called to answer corruption charges during his term as mayor of the city of Ribeirão Preto, Minister Antonio Palocci Filho (Treasury) said he made his position available but, at the recommendation of President Luiz Inácio Lula da Silva, would remain in the cabinet.) B, simplified O ministro Antonio Palocci (Fazenda) disse numa entrevista com a imprensa que vai deixar o seu cargo, embora o presidente Lula o aconselhou a permanecer no governo. (Minister Antonio Palocci (Treasury) said in a press conference that he will leave his position, although President Lula advised him to remain in the government. ) C, simplified O ministro Antonio Palocci é o ministro da Fazenda. Antonio Palocci disse em uma entrevista com a imprensa que vai deixar o seu cargo. Mas ele disse que o presidente Lula o aconselhou a permanecer no governo. (Minister Antonio Palocci is the Treasury Minister. Antonio Palocci said in a press conference that he will leave his position. But he said that President Lula advised him to remain in the government.) A, original

New Challenges in Portuguese Processing

315

an original text from an online Brazilian newspaper (translated here from Portuguese) in (A), its natural simplification in (B) and its strong simplification in (C). The sentence in (B) can be further simplified if broken down into shorter ones, as shown in (C), which illustrates the application of all simplification operations (as defined in a manual) and which can be useful for a low-literacy readership. In addition, the tools were designed to help users improve their reading skills over time. The difference between these two types of simplification (natural and strong) is the degree of application of simplification operations to the sentences. For strong simplification, we apply a set of pre-defined simplification operations to make each sentence as simple as possible, whereas for natural simplification these operations are applied only when the resulting text remains natural. Naturalness is based on a group of features that are hard to define using hand-crafted rules, which were learned from examples of natural simplifications, using a machine learning approach (Gasperin et al., 2009). This approach can learn rules (or statistical models) by generalizing from examples provided by annotated corpora (for details about the machine learning approach to NLP tasks, see Mitchell, 1997). In order to build text simplification tools, it is important to compare general-use, non-simplified texts with their corresponding simplified versions (i.e., use parallel corpora of original simplified texts). This way, we can investigate which kinds of changes should be applied, what resources are necessary in order for them to work and how to evaluate the simplification task. In addition, a parallel corpus like this can be directly used with statistical techniques to learn simplification rules.

3.3.2 Balance and representativeness of the corpus The first corpus manually simplified in the PorSimples project is composed of 104 texts from the Zero Hora newspaper. This corpus was selected because this newspaper had a corresponding simplified version geared toward children. Therefore, this parallel corpus can be useful as a means to evaluate the proposed simplification operations used for automatically generating newspaper versions for children; it can also serve as a training corpus in a machine learning approach for learning simplification operations. The second corpus, composed of 50 texts, was selected from the Folha de São Paulo newspaper – namely, from its Caderno da Ciência (Science section) – as we wanted to test if the simplification operations learned automatically could also be used in different genres. To provide an overview of the features of a parallel corpus of the original simplified texts, Table 15.2 shows the total number of sentences, words and average sentence length (in words) of the original, natural and strong simplified texts in the Zero Hora corpus. In the last column, we see that a considerable reduction occurred with respect to individual sentence lengths. The overall text length is longer than the original, which was expected, as simplification usually yields the repetition of information in different sentences, particularly when splitting operations are performed. In the Zero Hora corpus, among the syntactic phenomena addressed during the simplification process in the PorSimples project (i.e., apposition, coordinate clauses, passive voice, relative and subordinate clauses), we found that for natural

316

Working with Portuguese Corpora

Table 15.2 Statistics on the original and simplified corpora (natural and strong; Caseli et al., 2009) Statistic

Original (O) Natural (N) Strong (S)

# of sentences 2,116 # of words 41,897 Average sentence length 19.8

3,104 43,013 13.85

3,537 43,676 12.35

Change from O to S +67.15% +4.246% –37.63%

simplifications the most common operation was lexical simplification, followed by splitting sentences, dropping parts of the text and changing discourse markers for simpler and/or more frequent ones. As for strong simplifications, the most frequent operations were splitting sentences and lexical substitution. With regard to simplification operations in the Zero Hora corpus (i.e., non-simplification, strong rewriting, simple rewriting, subject–verb–object order, passive to active voice, inversion of clause order, splitting sentences, joining sentences, dropping one sentence, dropping sentence parts and lexical substitution), we found that the most frequent are coordinate, relative and subordinate clauses. In general, these are the most difficult cases to simplify, according to studies carried out in our project (see details in Caseli et al., 2009).

3.3.3 Choosing and training annotators Both corpora were simplified by only one linguist, who was an expert in text simplification, which is considered a drawback of this project. The support provided for her to perform the task was the manual of simplification operations created during the PorSimples project (Specia et al., 2008), in addition to an annotation edition tool called Simplification Annotation Editor (Caseli et al., 2009) to support lexical and syntactical simplifications.

3.3.4 Developing a simple, fast and reliable annotation procedure To perform the simplification task, we developed a three-step process with the help of the Simplification Annotation Editor, which was considered user-friendly by the project’s annotator. First, the source text (original version) is created (or simply opened from a file) and possibly revised to correct punctuation and spelling mistakes. Second, natural simplifications are produced and logged. Finally, from these, the strong simplifications are generated.

3.3.5 Choosing an annotation interface and assessing its influence on the results A great deal of effort was put into the annotation process of PorSimples as we were interested in fostering the research area on text simplification in addition to building all the necessary tools to meet the project’s requirements. During the project, we proposed (i) a XCES-based annotation schema, (ii) an annotation edition tool and

New Challenges in Portuguese Processing

317

(iii) a portal to access parallel corpora of original-simplified texts. The editor that supported the linguist’s simplification task worked at both the lexical and syntactic levels, allowing, at the first level, the replacement of words and discourse markers with simpler (non-ambiguous) and/or more frequent ones. In order to perform lexical simplification, the editor used a number of resources: a list of simple words taken from a dictionary for young people written by Biderman (2005) and from children’s sections in newspapers, a list of the most frequent words taken from both the NILC corpus (Pinheiro and Aluísio, 2003) and the list of discourse markers of Pardo and Nunes (2004). We considered to be simple the words used in texts addressed to children, in dictionaries for young people, or the most frequent words taken from large corpora. As for the syntactic level, the editor proposed syntactic operations predefined in the manual created for the project, in addition to joining, dropping and rewriting operations used in other simplification projects for English language. The linguistic resource used in this mode is the parser PALAVRAS (Bick, 2000), which provided the syntactic trees with the identification of both clauses and their elements that were the focus of the simplification rules. For example, in order to split a non-restrictive relative clause, the simplification operation uses syntactic clues (clauses tagged as relative), lexical clues (relative pronoun) and punctuation clues such as commas. For the original sentence in Example (5), the simplification operation will generate the sentences in Example (6). (5) Mais de 20 pessoas foram mordidas por palometas que vivem nas águas da barragem Sanchuri. (More than 20 people have been bitten by gold piranhas, which live in the waters of the Sanchuri dam.) (6) Mais de 20 pessoas foram mordidas por palometas. Palometas vivem nas águas da barragem Sanchuri. (More than 20 people have been bitten by gold piranhas. Gold piranhas live in the waters of the Sanchuri dam.)

3.3.6 Evaluating the annotation and choosing agreement measures Because all of the simplification annotations on both corpora were performed by a single expert linguist in this task, there was no opportunity to calculate annotation agreement measures. This is considered to be a major drawback of the PorSimples project, which can be resolved through the collaborative compilation of new parallel corpora, for example addressing new genres other than newspapers.

3.3.7 The corpus, its storage and availability Both simplified corpora are publicly available in XCES encoding standard, in which the source documents are plain texts and all the annotations are stored in stand-off XML documents, making it possible to record the simplification operations and align natural and strong simplified sentences. These parallel corpora (i.e., the original texts with their natural and strong simplified versions) are available to download as zip files from the PorSimples wiki (http://www.nilc.icmc.usp.br/wiki/index.php/Tools) and the Portal of Parallel Corpora of Simplified Texts, which contains all the different versions

318

Working with Portuguese Corpora

(i.e., original, revised, natural and strong simplified) – these can be searched through the database available at the portal (http://www.nilc.icmc.usp.br/portal/). From a parallel corpus, one can query the portal to do the following: MM MM MM MM MM

recover all original sentences that were modified during the simplification, see the lexical substitution pairs (complex and simple words), access the XCES annotation and the resources used, download dictionaries of simple words and the list of discourse markers, search the corpus for the original and simplified texts (including statistics), the alignment between such texts, and the syntactical constructions considered to be the actual simplified texts.

4. Final remarks Corpus creation and annotation have become an important part of research projects in both corpus linguistics and NLP. Corpora are no longer conceived merely as a necessary step towards end-user applications; rather, they represent in themselves a research challenge and sometimes turn out to be more difficult to build than the final intended tasks, imposing the need to understand, systematize, explain and validate theories and models in actual data. Time has shown that good annotated corpora can be defining factors in the acceptance of new ideas and methods and for the development of good systems. We believe that the more refined the annotation, the more sophisticated will be the tasks that NLP and corpus linguistics can tackle. In this chapter, we have provided evidence of this by describing the design decisions of three new corpora aimed at supporting research in complex tasks, such as semantic and discourse analysis, summarization and simplification. We hope not only that these corpora will be useful to a variety of new challenges, but also that their compilation history might inspire investments and new scientifically sound efforts in corpus creation and annotation.

Acknowledgements The authors are grateful to Microsoft Research and FAPESP for supporting this work.

References Afonso, S., Bick, E., Haber, R. and Santos, D. (2002), ‘Floresta sintá(c)tica: A treebank for Portuguese’, in Proceedings of the 3rd International Conference on Language Resources and Evaluation, pp. 1698–703. Aleixo, P. and Pardo, T. A. S. (2008), ‘CSTTool: Uma Ferramenta Semi-automática para Anotação de Córpus pela Teoria Discursiva Multidocumento CST’. Technical Report NILC-TR-08–03, São Carlos, SP.

New Challenges in Portuguese Processing

319

Aluísio, S. M., Specia, L., Gasperin, C., and Scarton, C. E. (2010), ‘Readability assessment for text simplification’, in Proceedings of the 5th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 1–9. Aziz, W. F. and Specia, L. (2011), ‘Fully automatic compilation of Portuguese-English and Portuguese-Spanish parallel corpora’, in Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology, pp. 234–8. Aziz, W. F., Pardo, T. A. S. and Paraboni, I. (2009), ‘Statistical phrase-based machine translation: Experiments with Brazilian Portuguese’, in Anais do VII Encontro Nacional de Inteligência Artificial, pp. 769–78. Baker, C. F., Fillmore, C. J. and Lowe, J. B. (1998), ‘The Berkeley FrameNet project’, in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pp. 86–90. Beck, D. E. (2011), ‘Syntax-based statistical machine translation using tree automata and tree transducer’, in Proceedings of the Student Session of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 36–40. Beck, D. E. and Caseli, H. M. (2013), ‘Tree-based statistical machine translation: Experiments with the English and Brazilian Portuguese pair’. Learning and Nonlinear Models, 11, 11–25. Bick, E. (2000), The Parsing System ‘PALAVRAS’: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. PhD Dissertation, Århus University. Biderman, M. T. (2005), Dicionário Ilustrado de Português. São Paulo: Editora Ática. Bruckschen, M., Muniz, F., Souza, J. G. C., Fuchs, J. T., Infante, K., Muniz, M., Gonçalves, P. N., Vieira, R. and Aluísio, S. M. (2008), ‘Anotação linguística em XML do corpus PLN-BR’. Technical Report NILC-TR-09–08, São Carlos, SP. Burchardt, A., Erk, K., Frank, A., Kowalski, A. and Pado, S. (2006), ‘SALTO – a versatile multi-level annotation tool’, in Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy. Callison-Burch, C. and Dredze, M. (2010), ‘Creating speech and language data with Amazon’s Mechanical Turk’, in Proceedings of the NAACL-HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pp. 1–12. Cardoso, P. C. F., Maziero, E. G., Castro Jorge, M. L. R., Seno, E. M. R., Di Felippo, A., Rino, L. H. M., Nunes, M. G. V. and Pardo, T. A. S. (2011a), ‘CSTNews – a discourseannotated corpus for single and multi-document summarization of news texts in Brazilian Portuguese’, in Proceedings of the 3rd RST Brazilian Meeting, pp. 88–105. Cardoso, P. C. F., Pardo, T. A. S. and Nunes, M. G. V. (2011b), ‘Métodos para sumarização automática multidocumento usando modelos semântico-discursivos’, in Proceedings of the 3rd RST Brazilian Meeting, pp. 59–74. Carletta, J. (1996), ‘Assessing agreement on classification tasks: The Kappa statistic’. Computational Linguistics, 22, (2), 249–54. Caseli, H. M. and Nunes, I. A. (2009), ‘Statistical machine translation: Little changes big impacts’, in Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology, pp. 1–9. Caseli, H. M., Pereira, T. F., Specia, L., Pardo, T. A. S., Gasperin, C. and Aluísio, S. M. (2009), ‘Building a Brazilian Portuguese parallel corpus of original and simplified texts’, in Proceedings of the 10th Conference on Intelligent Text Processing and Computational Linguistics, pp. 59–70. Castro Jorge, M. L. R. and Pardo, T. A. S. (2010), ‘Experiments with CST-based multidocument summarization’, in Proceedings of the ACL Workshop TextGraphs-5: Graph-based Methods for Natural Language Processing, pp. 74–82.

320

Working with Portuguese Corpora

Collovini, S., Carbonel, T. I., Fuchs, J. T., Coelho, J. C. B., Rino, L. H. M. and Vieira, R. (2007), ‘Summ-it: Um corpus anotado com informações discursivas visando à sumarização automática’, in Proceedings of the 5th Workshop in Information and Human Language Technology, pp. 1605–14. Dias da Silva, B. C. (1996), A Face Tecnológica dos Estudos da Linguagem: O Processamento Automático das Línguas Naturais. PhD Dissertation, Universidade Estadual Paulista. Duran, M. S. and Aluísio, S. M. (2012), ‘Propbank-Br: A Brazilian treebank annotated with semantic role labels’, in Proceedings of the 8th International Conference on Language Resources and Evaluation. Duran, M. S., Amancio, M. A. and Aluísio, S. M. (2010), ‘Assigning wh-questions to verbal arguments: Annotation tools evaluation and corpus’, in Proceedings of the 7th Conference on International Language Resources and Evaluation, pp. 1445–51. Duran, M. S., Ramisch, C., Aluísio, S. M. and Villavicencio, A. (2011), ‘Identifying and analyzing Brazilian Portuguese complex predicates’, in Proceedings of Multiword Expressions: from Parsing and Generation to the Real World, pp. 74–82. Feltrim, V. D. (2004). Uma Abordagem Baseada em Córpus e em Sistemas de Crítica para a Construção de Ambientes Web de Auxílio à Escrita Acadêmica em Português. PhD Dissertation, Universidade de São Paulo. Feltrim, V. D., Aluísio, S. M. and Nunes, M. G. V. (2003), ‘Analysis of the rhetorical structure of computer science abstracts in Portuguese’, in Proceedings of the Corpus Linguistics Conference, pp. 212–18. Fonseca, E. R. and Rosa, J. L. G. (2012), ‘An architecture for semantic role labelling on Portuguese’, in Proceedings of the International Conference on Computational Processing of the Portuguese Language, pp. 204–9. Gasperin, C., Maziero, E. G. , Specia, L., Pardo, T. A. S. and Aluísio, S. M. (2009), ‘Natural Language Processing for social inclusion: A text simplification architecture for different literacy levels’, in Anais do XXXVI Seminário Integrado de Software e Hardware (SEMISH 2009), em conjunto com o CSBC 2009 – XXIX Congresso da Sociedade Brasileira de Computação, pp. 387–401. Gonçalves, P. N., Rino, L. H. M. and Vieira, R. (2008), ‘Summarizing and referring: Towards cohesive extracts’, in Proceedings of the ACM Symposium on Document Engineering, pp. 253–6. Hovy, E. H. and Lavid, J. M. (2010), ‘Towards a “science” of corpus annotation: A new methodological challenge for Corpus Linguistics’. International Journal of Translation Studies, 22, (1), 13–36. Jurafsky, D. and Martin, J. H. (2009), Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Upper Saddle River, NJ: Prentice Hall. Manchego, F. A. and Rosa, J. L. G. (2012), ‘Towards semi-supervised Brazilian Portuguese semantic role labelling: Building a benchmark’, in Proceedings of the International Conference on Computational Processing of the Portuguese Language, pp. 210–17. Mani, I. (2001). Automatic Summarization. Amsterdam: John Benjamins. Mann, W. C. and Thompson, S. A. (1987), ‘Rhetorical structure theory: A theory of text organization’. Technical Report ISI/RS-87–190. Maziero, E. G., Jorge, M. L. C. and Pardo, T. A. S. (2010), ‘Identifying multidocument relations’, in Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science, pp. 60–9. Maziero, E. G. and Pardo, T. A. S. (2011), ‘Multi-document discourse parsing using

New Challenges in Portuguese Processing

321

traditional and hierarchical machine learning’, in Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology, pp. 1–10. Mitchell, T. M. (1997), Machine Learning. New York, NY: WCB/McGraw-Hill. Muniz, F. A. M. (2011). Extração de Termos de Manuais Técnicos de Produtos Tecnológicos: Uma Aplicação em Sistemas de Adaptação Textual. Master’s Thesis, Universidade de São Paulo. Muniz, F., Watanabe, W. M., Scarton, C. E. and Aluísio, S. M. (2011), ‘Extração de termos de manuais técnicos de produtos tecnológicos: Uma aplicação em sistemas de adaptação textual’, in Anais do SEMISH 2011 – XXXVIII Seminário Integrado de Software e Hardware, em conjunto com o CSBC 2011 – XXXI Congresso da Sociedade Brasileira de Computação, pp. 1293–306. Muniz, M., Paulovich, F. V., Minghim, R., Infante, K., Muniz, F., Vieira, R. and Aluísio, S. M. (2007), ‘Taming the tiger topic: An XCES compliant corpus portal to generate subcorpus based on automatic text topic identification’, in Proceedings of the Corpus Linguistics Conference. Available at: http://ucrel.lancs.ac.uk/publications/CL2007/ paper/257_Paper.pdf Nunes, M. G. V., Aluísio, S. M. and Pardo, T. A. S. (2010), ‘Um panorama do Núcleo Interinstitucional de Linguística Computacional às vésperas de sua maioridade’. Linguamática, 2, (2), 13–27. O’Donnell, M. (2000), ‘RSTTool 2.4 – A markup tool for rhetorical structure theory’, in Proceedings of the International Natural Language Generation Conference, pp. 253–56. Palmer, M., Gildea, D. and Kingsbury, P. (2005), ‘The proposition bank: An annotated corpus of semantic roles’. Computational Linguistics, 31, (1), 71–105. Pardo, T. A. S. and Nunes, M. G. V. (2004), ‘Relações retóricas e seus marcadores superficiais: Análise de um corpus de textos científicos em português do Brasil’. Technical Report n. 231, São Paulo, SP. Pardo, T. A. S. and Nunes, M. G. V. (2008), ‘On the development and evaluation of a Brazilian Portuguese discourse parser’. Journal of Theoretical and Applied Computing, 15, (2), 43–64. Pardo, T. A. S. and Seno, E. R. M. (2005), ‘Rhetalho: Um corpus de referência anotado retoricamente’, in Anais do V Encontro de Corpora. Available at: http://www.icmc.usp. br/pessoas/taspardo/VEncontroCorpora2005-PardoSeno.pdf Pinheiro, G. M. and Aluísio, S. M. (2003), ‘Córpus NILC: Descrição e análise crítica com vistas ao projeto Lacio-Web’. Technical Report NILC-TR-03–03, São Carlos, SP, Brazil. Radev, D. R. (2000), ‘A common theory of information fusion from multiple text sources step one: Cross-document structure’, in Proceedings of the 1st ACL SIGDIAL Workshop on Discourse and Dialogue, pp. 74–83. Available at: http://clair.si.umich.edu/~radev/ papers/acl-disc00.pdf Ribaldo, R., Akabane, A. T., Rino, L. H. M. and Pardo, T. A. S. (2012), ‘Graph-based methods for multi-document summarization: exploring relationship maps, complex networks and discourse information’, in Proceedings of the 10th International Conference on Computational Processing of Portuguese, pp. 260–71. Scarton, C. E. and Aluísio, S. M. (2010), ‘Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: Adaptando as métricas do Coh-Metrix para o Português’. Linguamática, 2 (1), 45–66. Silva, J., Branco, A., Castro, S. and Reis, R. (2010), ‘Out-of-the-box robust parsing of Portuguese’, in Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, pp. 75–85.

322

Working with Portuguese Corpora

Souza, J. G. C., Gonçalvez, P. N. and Vieira, R. (2008), ‘Learning coreference resolution for Portuguese texts’, in Proceedings of the 8th International Conference on Computational Processing of the Portuguese Language, pp. 153–62. Souza, V. M. A. and Feltrim, V. D. (2013), ‘A coherence analysis module for SciPo: Providing suggestions for scientific abstracts written in Portuguese’. Journal of the Brazilian Computer Society (19), 59–73. Specia, L. (2010), ‘Translating from complex to simplified sentences’, in Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, pp. 30–9. Specia, L., Aluísio, S. M. and Pardo, T. A. S. (2008), ‘Manual de simplificação sintática para o Português’. Technical Report NILC-TR-08–06, São Carlos, SP. Uzêda, V. R., Pardo, T. A. S. and Nunes, M. G. V. (2010), ‘A comprehensive comparative evaluation of RST-based summarization methods’. ACM Transactions on Speech and Language Processing, 6 (4), 1–20. Wing, B. and Baldridge, J. (2006), ‘Adaptation of data and models for Probabilistic parsing of Portuguese’, in Proceedings of the Workshop on Computational Processing of the Portuguese Language, pp. 140–9. Wynne, M. (ed.) (2005), Developing Linguistic Corpora: A Guide to Good Practice. Oxford: Oxbow Books.

Index adverb 15, 30–2, 70, 92–3, 106, 108, 123, 164, 166–7, 170–1, 173, 174, 176, 227, 248, 283, 285, 287, 289, 291–2, 295, 301 agreement measures 313, 317 alignment 3, 95, 102, 161–2, 164, 168, 173, 176, 177, 179, 181–4, 187–9, 197, 240, 243–4, 257, 259, 267, 272, 287, 305, 313, 317–8 aligned file 183, 257 aligned texts 164 parallel 168 speech-to-text 259, 272 text-to-sound 240, 243, 244 Amador, P. 164 anaphora, annotation 285, 293–5 AnELL 221 annotation and tagging 1–4, 10, 12, 15–16, 18, 21, 26, 32, 35–7, 39, 41, 44, 55, 61, 63, 70, 74, 76, 84, 90, 92, 94, 96, 102, 104, 108, 111, 113, 117–19, 126, 134, 137–9, 152, 155–6, 161, 177–81, 183–6, 188–9, 192–5, 197, 200–2, 205, 207–9, 210, 215, 219–24, 226, 228–36, 239–45, 247–8, 250–8, 261, 264–5, 267, 272–7, 279–322 CLAWS 208, 288 corte-e-costura 208, 223 CST annotation 313 CSTTool 313 discourse annotation 306, 311 discourse parsing 305 double-blind annotation 307, 310 manual annotation 94 multi-document annotation 311 nominal chunks 239 PALAVRAS 4, 12, 15, 117–19, 220–1, 224, 230, 235, 279–81, 283–91, 293–8, 301–2, 304, 317 prosodic annotation 258

prosodic breaks 243, 258, 269–70, 298 RST annotation 313 semantic annotation 222, 228–9 syntactic parsing 279 text annotation 311 Tree-Tagger 12, 94 two-level annotation 297 argumentation 20–1, 31, 262 atomic semantic features 289 backward compatibility 226 Bagot, R. E. 148, 153 Baker, M. 161 Barros, L. A. 151 baseline levels 135, 136–7, 138, 140, 143 bi-word sets (BWSs) 188 Biber, D. 33–8, 40, 45–6, 63, 132, 142, 259–60, 269, 276 Blum-Kulka, S. 163 Bolinger, D. 17 bundles 1, 2, 33–4, 35, 37, 39–49, 51, 53–67, 132, 142, 289–90 ceiling 40–1, 64 cut-off points 38–9 discourse organizing 34, 35, 47, 57, 58, 60, 61, 62–3 frequency 38, 42, 48–53, 64 functional classification 35, 40, 46, 53–4 identification and focus 35 referential 34, 35, 47–8, 57–9 selection criteria 37–41, 46, 64 stance 34, 46–7, 60, 61, 62 Cabré, M. T. 151 chi-square 153 classifiers 304 CLEF 222 co-occurrence 10, 12, 20, 73, 151, 229, 252, 271 co-reference resolution 304, 305

324 Index COBUILD 112 Coh-Metrix 20–1, 26, 30 coherence 117 cohesion 35, 165, 167, 252, 269 collocate 13–17, 22, 24, 26, 69, 71, 75, 77, 79, 82–3, 95, 98–104, 112, 114, 125–6, 151, 211, 276 candidates 13 investigation of 98 true 13 collocation 1, 9–14, 16–26, 28–31, 75, 77, 111–14, 131–2, 151, 190, 211, 239, 252 density distribution 25 discoverer of 9 discovery of 9, 11–12 independence from grammar 26 independence from word frequency 24 percentage of 16, 26 presence 9, 10, 16, 20, 23, 26, 29, 30 text perspective on 10 variation within individual texts 25 colour 202, 209, 210–12, 222, 223, 227, 229 Comparador 223, 228 comparing corpora dialects 103–4 genres 102–3 time periods 104–5 complex predicates 305 concordances 3, 14–15, 35, 75, 80, 95–7, 110, 112–13, 116, 142, 144, 149, 151–2, 164–6, 176, 179–81, 189–91, 196–8, 202–8, 211, 226, 228, 239, 243, 252 connectors 71 constituent trees 285–6 Constraint Grammar 280, 283–4 contrastive analysis, contrastive linguistics 162, 168, 177, 180 copyright issues 220, 222, 253 Corpógrafo 147–8, 151–4, 156, 225–6, 229, 236 corpora, named AC/DC 219, 221–3, 229–30 Africa 244–7 Banco de Português 147 Brazilian Corpus see Corpus Brasileiro

Brazilian Register Variation Corpus (CBVR) see Corpus Brasileiro de Variação de Registro C-ORAL-BRASIL 257–72, 296, 297 C-ORAL-ROM 242–3, 257, 260–1, 265, 269–70, 275 CETEMPúblico 178, 220, 221–2, CETENFolha 75, 80, 178, 221 CoMET Project 201–2, 213 CorTec 201, 202–7 CorTrad 201, 207–12, 223 COMPARA 162, 163, 178–9, 221, 222 CONDIVport 229 Corpus Brasileiro de Variação de Registro (CBVR) 12–13, 34, 36–7, 44–5, 133, 134–5, 136 Corpus Brasileiro 9, 12–14, 23–4, 133, 135, 137, 140–41, 142 Corpus de Referência do Português Contemporâneo (CRPC) 237–40, 253 Corpus do Português 75, 76, 77, 79, 82, 83, 89–105 Corpus Internacional do Português (CINTIL) 248–9 LE-PAROLE 247–8 MAMAtex 147 Per-Fide 177, 181–2, 189–92, 197 PorTex 133, 135, 137–40 Português Fundamental 240, 241 Spoken Portuguese 243–4 corpus, corpora annotated 303, 304, 305, 307, 312, 318 baseline (see CBVR) 133, 135 bidirectional 162, 164, 173 comparable 161, 174, 201, 202, 205, 260 free access 219 interface 193–7, 202 monitor 237 multilingual 242 parallel 161, 173–4, 177, 181, 201, 220, 260 Portuguese-English 201 reference 4, 11, 13, 15, 23, 26–7, 133, 135, 137, 140–2, 145, 147, 149, 156, 237, 239–43, 245, 247, 249, 251, 253 specialized 147, 179, 226, 305 training 311, 315

Index corpus-based lexica COMBINA-PT 251–3 Fundamental Vocabulary of Portuguese 249 Multifunctional Computational Lexicon of Contemporary Portuguese 249–51 PAROLE 249 SIMPLE 249 syntactic information 249 corpus design 1, 242–4, 250, 251, 307 genre balance 92–3 corpus linguistics 1, 4, 9–11, 33–4, 69, 72–4, 84, 133, 147, 219, 276, 279, 304–5, 307, 318 corpus-query systems 111–12 corpus, searching collocations 95, 112, 239 distribution 239 downloading 239 frequency 239, 250 lemmas 95, 239 part-of-speech 95, 239, 243 queries 94–5, 97, 98, 239 regular expressions 239, 249 sorting 239 CorpusEye 288, 298 CQPWeb platform 239, 253 crawling 2, 115, 116, 126, 183, 235 cross-document structure theory (CST) 311 cross-text variation 18 CWB 112, 200, 222, 224, 228, 235 Darbelnet, J. 168, 175 dependency syntax 285–6 derivation 281–2 diaphasic 261–2 diastratic 263–4 diatopy 257, 265 dictionary 1, 2, 70, 89–90, 105–8, 111, 115, 177, 182, 184, 186–200, 220, 229, 252, 255, 317–18 dimensions of variation 20, 26, 31 features 31 multidimensional analysis 20–1 scores 21 disambiguation 248, 280–1, 283–5, 288–91, 293–4, 296–8, 301–2

325

discourse 20, 31, 33–5, 37, 40, 45, 47, 53, 55–64, 67, 74, 82, 161, 166, 241, 249, 259, 261, 276, 303–6, 311–13, 316–18 marker 31–2, 316–18 models 312, 313 relations 311, 312 DISPARA 179, 208, 223 Distribuidor 228 e-Termos 147–9, 152–4 Ensinador 223, 228 evaluation contest 221–2, 235 EXMARaLDA software 240, 244 explicitation 164–5, 176 fiction 38, 50, 51, 53, 58, 93, 102–3, 107, 164, 167–9, 173–4 fixedness 252 Floresta Sintá(c)tica 220, 223, 229, 286–7 fluency 11, 26, 132 Folheador 229 foreign words 168–9, 174, 190 format filtering 287–8 frame files 311 FrameNet 293, 308 Frawley, W. 161 Frequency Dictionary of Portuguese 89–90, 105–8 frequency-based materials 105 frequency 2, 3, 12, 15, 18, 23–6, 30, 34, 37–40, 42–6, 48, 63–4, 69, 76, 78–9, 82, 89–91, 93–101, 103–8, 114, 123, 132, 140–2, 148–9, 151, 169–70, 172, 179, 203, 205, 207, 209–11, 221, 227–8, 236, 239–41, 243, 245, 249–52, 256, 257, 276, 279, 293–4, 302 genre-based lists 107–8 global-context rule 284 grammar checker 306 grammaticalization 69, 72, 73–4, 83, 84 abstractization 78, 79 desemanticization 73, 74 directionality 72 extension 73, 74 layers 79, 80 grammatically-oriented lists 107

326 Index Halliday, M. A. K. 26–7, 74, 123, 260, 269 HAREM 222, 290 heuristicity batches 284 historical texts 296–7 Hoey, M. 9, 11–2, 141 human revision 222–3 hypernymy 30, 151, 229, 293 hyponymy 151, 249 idiomaticity 11, 16, 24, 131–5, 137–43, 191, 211–12, 252 idiom principle 9, 11, 16, 131, 132 idiomatic expressions 252 idiomatic texts 131–2 index 133 procedure for determining 135 illocution 259 information structure 267 infrastructure 226, 228, 230 inter-rater agreement statistics 271 interaction 4, 45–6, 257, 259–62, 264–7, 285–6, 289 keywords 2, 95–7, 119–25, 148–9, 156, 181 positive 149 unigrams 149 Klaudy, K. 163 language-independent elements (LIEs) 184 lemma 15, 24, 45, 75, 89, 92–6, 101–3, 105–6, 110, 112, 114–15, 117–19, 125, 169–74, 179–81, 197, 200, 208, 211, 224, 227, 239–41, 243–6, 248–50, 252, 279, 283–4, 286, 293, 297–8 lexical heuristics 282–3 lexicalization 252, 261, 271 lexicography 2, 87, 111–19, 121, 123, 125–6, 189, 197, 200, 252, 282, 298 lexis 1, 2, 10, 12, 16–17, 33–5, 37, 39–47, 49, 51, 53, 55, 57, 59, 61, 63–4, 69, 71–2, 74, 83–4, 94, 103–4, 110, 112, 117–18, 124, 127, 132, 141–3, 148, 161, 170–1, 174, 181, 187–8, 193–4, 221, 224, 229, 239, 241, 243–4, 249, 251–2, 257, 267, 276,

280–3, 286, 289–92, 295, 302, 305, 308–9, 311, 316–18 literacy 20–1, 31, 305, 314–15 literacy levels 305, 314 loans 168, 169, 174 local-context rule 284 machine learning 308, 315 McKenny, J. 162 meaning and usage 99 metadata 178, 181–2, 239, 239, 244, 249, 253, 257, 264 Microsoft Excel 148 Milhafre 223 morphological decomposition 280 morphology 45, 63, 151, 162, 177, 186–7, 208, 271, 279–84, 288, 290, 293–4, 296–7, 301, 305–6 morpholympics 280, 298 morphosyntactic disambiguation 283–4 multidimensional analysis (MD) 20–1, 26, 31 multiword expressions (MWE) 251–2, 283, 287, 290 mutual information (MI) 95, 99, 234–5, 252, 256, 318 name entity recognition (NER) 289–90 n-grams 34, 40, 45, 132, 136, 140, 148–9, 156, 202–4, 226, 306 classification 140–1 shared 135, 136 see bundles, multiword expressions non-translated Portuguese 168, 169, 170, 171–3, 174 numerical frequency tags 294 open choice principle 11, 132 orality 4, 20, 31, 75, 227, 242–3, 245, 255, 257, 259–61, 263–5, 267–72, 296–8 orthographic transcription 243, 244 Págico 230 PALAVRAS see annotation and tagging Palmorf 280 PAPEL 224, 229 phonology 131, 280, 281 phraseology 9, 10, 22, 207

Index PonTE 223 prepositions 2, 10, 15, 31, 69–84, 118, 171–4, 193–4, 209–10, 271–2, 283, 285–7, 291–2, 295 complex 69, 70, 71, 73, 74, 79, 82, 83 probabilistic translation dictionaries (PTDs) 182, 186–7 Pym, A. 163 Rayson, P. 208 readability 304 regional variants 119–26, 237–8, 243–7 register 12, 20–1, 26–7, 31, 33–4, 36–8, 41–6, 48, 53–67, 78, 106, 133–5, 142, 147, 238–9, 245–6, 250, 260, 262 relation propagation 295 replicating studies 225 research centres CEPRIL 4, 147 CLUL 4, 224, 237, 240, 248, 253–6 CoMET 4, 201–3, 205, 207, 209, 211, 213–15, 223, 226 Linguateca 178, 207–8, 219–21, 224, 229–30, 287 NILC 21, 220–1, 303, 305, 306, 308, 311, 314, 317–18 representativeness 58, 132, 151, 196, 240, 253, 257, 303, 307, 309, 312, 315 rhetorical structure theory (RST) 311 routinization 34, 44–5, 63–4 Scott, M. 9, 34, 112, 132, 147 segmentation 14, 25, 131, 183, 186, 188, 257–9, 261, 267, 269–72, 276, 298, 311, 313 semantic prosody 80–1, 83 semantic prototypes 289 semantic role labeling 303, 304, 305, 308, 310 semantic roles 290–1 semantic unpredictability 151 semantics 21, 69, 71, 73, 75–83, 89, 95, 98–103, 106, 108, 151, 181–2, 193, 202, 208–9, 210, 222–4, 226–30, 249, 252, 256, 259–60, 279–81, 283, 285, 287–95, 300–1, 303–6, 308–10, 318 SET rules 285

327

simplification 303, 306, 314, 315, 316–18 operations 315, 316, 317 process 315 types of 315 Sinclair, J. McH. 9, 10, 131–2, 189, 251 Sketch Engine 2, 4, 12, 14, 27, 111–13, 118–19, 125–6 specific conceptual value 152 speech act 263, 270–1 spontaneous speech 259–60 SPSS 18, 23, 37, 143 summarization 4, 303–6, 311–13 synonyms 2, 70, 81–3, 95, 99, 100–3, 110, 125, 170–4, 187, 212, 224, 229, 249 syntactic parsers 304 syntax 69–70, 72, 82, 84, 117, 161–2, 164, 173–4, 179–81, 183, 186, 197, 200, 208–9, 220, 222, 244–5, 249–50, 252, 256–7, 259, 261, 267, 269, 275–6, 279–81, 283, 285, 288, 290–4, 296–8, 301–5, 308–10, 315, 317 teaching 1–3, 11, 33, 35, 77, 108, 131–3, 137, 141, 143, 201–2, 205–6, 212, 223, 225, 228, 237–8, 240, 242–3, 286, 288, 298, 305 coursebook/textbook 131, 133, 143 Portuguese as a foreign/second language 131, 243 use of corpora 131 terminology 2, 3, 122, 147–8, 151, 156–7, 183, 187, 205, 226 automatic retrieval of terms 147–8 computer-assisted analysis 147 concordance analysis 151 false positive terms 151 true positive terms 151 TeP 224 text accessibility 314 text versions 314–15 thesaurus 100, 113 tokenization 183, 186, 280, 287, 296–7 tone units 258 tools 2–4, 9, 12, 21, 69, 74–5, 89, 94, 111–12, 117–18, 125–7, 147–57, 177, 179, 181–9, 191, 197, 200, 202–3, 212, 219, 222, 225–6, 237, 243, 247, 253–6, 279, 281, 285, 289, 300, 303–5, 310, 313–17

328 Index transcribed speech 297–8 translated Portuguese 168, 169, 170, 171–3, 174 translation 1, 3, 17, 32–3, 89, 99, 105–6, 142, 161–79, 181–2, 184–93, 196–202, 205–6, 208–13, 223, 237, 283, 294, 301, 303, 305–6, 308, 315 machine translation (MT) 294, 303, 305, 306 memories 223 Portuguese translators 163, 169 problems detection 209, 212 strategy 169 studies 161, 177, 189 translator training 202, 209 translation memory interchange (TMX) 184–6 transpositors 71 treebanks 220, 223, 229–30, 272, 286–8, 291–2, 298, 301, 318 unambiguous-concept translation sets (UCTSs) 184, 187

validation 228 VARRA 223, 228 verb-argument frames 292 Vinay, J. 168 VISL 220–1, 235, 286, 288–9, 291, 294, 298 weather-versus-climate perspective 26–7 word sketches 2, 4, 12, 14, 27, 111–15, 118–19, 125–6, 305 wordlists 149, 202–4, 206 WordSmith Tools 75, 112, 147–9, 151, 156 WPT–03 222 writing support tools 303, 305 written, writing 2, 10, 14, 24, 34, 37, 39, 41, 53, 57, 60, 113, 132–3, 135–45, 147, 161, 178, 208, 225, 237–9, 243–5, 248–51, 260–1, 269, 286–7, 296–8, 305, 317 ZExtractor 147