137 57 47MB
English Pages 689 Year 2014
T H E OX F OR D HA N DB O OK OF
C OR P U S P HON OL O G Y
OXFORD HANDBOOKS IN LINGUISTICS The Oxford Handbook of Applied Linguistics Second edition Edited by Robert B. Kaplan The Oxford Handbook of Case Edited by Andrej Malchukov and Andrew Spencer The Oxford Handbook of Cognitive Linguistics Edited by Dirk Geeraerts and Hubert Cuyckens The Oxford Handbook of Compounding Edited by Rochelle Lieber and Pavol Štekauer The Oxford Handbook of Compositionality Edited by Markus Werning, Edouard Machery, and Wolfram Hinzen The Oxford Handbook of Computational Linguistics Edited by Ruslan Mitkov The Oxford Handbook of Field Linguistics Edited by Nicholas Thieberger The Oxford Handbook of Grammaticalization Edited by Heiko Narrog and Bernd Heine The Oxford Handbook of Historical Phonology Edited by Patrick Honeybone and Joseph Salmons The Oxford Handbook of the History of English Edited by Terttu Nevalainen and Elizabeth Closs Traugott The Oxford Handbook of the History of Linguistics Edited by Keith Allan The Oxford Handbook of Japanese Linguistics Edited by Shigeru Miyagawa and Mamoru Saito The Oxford Handbook of Laboratory Phonology Edited by Abigail C. Cohn, Cécile Fougeron, and Marie Hoffman The Oxford Handbook of Language Evolution Edited by Maggie Tallerman and Kathleen Gibson The Oxford Handbook of Language and Law Edited by Peter Tiersma and Lawrence M. Solan The Oxford Handbook of Linguistic Analysis Edited by Bernd Heine and Heiko Narrog The Oxford Handbook of Linguistic Interfaces Edited by Gillian Ramchand and Charles Reiss The Oxford Handbook of Linguistic Minimalism Edited by Cedric Boeckx The Oxford Handbook of Linguistic Typology Edited by Jae Jung Song The Oxford Handbook of Sociolinguistics Edited by Robert Bayley, Richard Cameron, and Ceil Lucas The Oxford Handbook of Translation Studies Edited by Kirsten Malmkjaer and Kevin Windle
THE OXFORD HANDBOOK OF
CORPUS PHONOLOGY Edited by
JACQUES DURAND, ULRIKE GUT, and
GJERT KRISTOFFERSEN
1
3 Great Clarendon Street, Oxford, United Kingdom
OX2 6DP,
Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © editorial matter and organization Jacques Durand, Ulrike Gut, and Gjert Kristoffersen 2014 © the chapters their several authors 2014 The moral rights of the authorshave been asserted First Edition published in 2014 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2014933501 ISBN 978–0–19–957193–2 Printed and bound by CPI Group (UK) Ltd, Croydon,
CR0
4YY
Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.
Contents
List of Contributors
ix
1. Introduction Jacques Durand, Ulrike Gut, and Gjert Kristoffersen
1
PA RT I P HON OL O G IC A L C OR P OR A : DE SIG N , C OM P I L AT ION , A N D E X P L OI TAT ION 2. Corpus Design Ulrike Gut and Holger Voorman
13
3. Data Collection Bruce Birch
27
4. Corpus Annotation: Methodology and Transcription Systems Elisabeth Delais-Roussarie and Brechtje Post
46
5. On Automatic Phonological Transcription of Speech Corpora Helmer Strik and Catia Cucchiarini
89
6. Statistical Corpus Exploitation Hermann Moisl
110
7. Corpus Archiving and Dissemination Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg
133
8. Metadata Formats Daan Broeder and Dieter van Uytvanck
150
9. Data Formats for Phonological Corpora Laurent Romary and Andreas Witt
166
vi Contents
PA RT I I A P P L IC AT ION S 10. Corpus and Research in Phonetics and Phonology: Methodological and Formal Considerations Elisabeth Delais-Roussarie and Hiyon Yoo
193
11. A Corpus-Based Study of Apicalization of /s/ before /l/ in Oslo Norwegian Gjert Kristoffersen and Hanne Gram Simonsen
214
12. Corpora, Variation, and Phonology: An Illustration from French Liaison Jacques Durand
240
13. Corpus-Based Investigations of Child Phonological Development: Formal and Practical Considerations Yvan Rose
265
14. Corpus Phonology and Second Language Acquisition Ulrike Gut
286
PA RT I I I TO OL S A N D M E T HOD S 15. ELAN: Multimedia Annotation Application Han Sloetjes
305
16. EMU Tina John and Lasse Bombien
321
17. The Use of Praat in Corpus Research Paul Boersma
342
18. Praat Scripting Caren Brinckmann
361
19. The PhonBank Project: Data and Software-Assisted Methods for the Study of Phonology and Phonological Development Yvan Rose and Brian MacWhinney
380
20. EXMARaLDA Thomas Schmidt and Kai Wörner
402
21. ANVIL: The Video Annotation Research Tool Michael Kipp
420
Contents vii
22. Web-based Archiving and Sharing of Phonological Corpora Atanas Tchobanov
437
PA RT I V C OR P OR A 23. The IViE Corpus Francis Nolan and Brechtje Post
475
24. French Phonology from a Corpus Perspective: The PFC Programme 486 Jacques Durand, Bernard Laks, and Chantal Lyche 25. Two Norwegian Speech Corpora: NoTa-Oslo and TAUS Kristin Hagen and Hanne Gram Simonsen
498
26. The LeaP Corpus Ulrike Gut
509
27. The Diachronic Electronic Corpus of Tyneside English: Annotation Practices and Dissemination Strategies Joan C. Beal, Karen P. Corrigan, Adam J. Mearns, and Hermann Moisl
517
28. The LANCHART Corpus Frans Gregersen, Marie Maegaard, and Nicolai Pharao
534
29. Phonological and Phonetic Databases at the Meertens Institute Marc van Oostendorp
546
30. The VALIBEL Speech Database Anne Catherine Simon, Michel Francard, and Philippe Hambye
552
31. Prosody and Discourse in the Australian Map Task Corpus Janet Fletcher and Lesley Stirling
562
32. A Phonological Corpus of L1 Acquisition of Taiwan Southern Min Jane S. Tsay
576
References Index
589 639
List of Contributors
Joan C. Beal is Professor of English Language at the University of Sheffield. Her research interests are in the fields of sociolinguistics/dialectology and the history of the English language since 1700. She has published widely in both fields. Bruce Birch is currently Departmental Visitor in Linguistics at the Australian National University in Canberra. His research has focused on the development of a usage-based approach to the analysis of prosodic structure including intonation, as well as issues involved in the documentation of endangered languages. He has been collecting data and contributing to the building of online corpora for Iwaidja and other highly endangered languages of Northwestern Arnhem Land, Australia, since 1999. Paul Boersma received an MSc. in physics from the University of Nijmegen in 1988 and a Ph.D in linguistics from the University of Amsterdam in 1998 for a dissertation entitled ‘Functional phonology’. Since 2005 he has been Professor of Phonetic Sciences at the University of Amsterdam. Lasse Bombien works as a researcher at the Universities of Munich and Potsdam. In 2006 he received his MA in Phonetics at the University of Kiel, and in 2011 a D.Phil. in Phonetics at the University of Munich. His areas of interest include speech production, articulatory coordination, sound change, effects of prosodic structure on phonetic detail, phonetics and phonology of Scandinavian languages, techniques for speech production investigation, and software development. Caren Brinckmann studied computational linguistics and phonetics at Saarland University (Saarbrücken, Germany) and Tohoku University (Sendai, Japan). For her master’s thesis in 2004 she improved the prosody prediction module of a German speech synthesis system with corpus-based statistical methods. As a researcher at the national Institute for the German Language (IDS, Mannheim, Germany) she subsequently focused on regional phonetic variation, word phonology, and automatic text classification with statistical methods. In 2011 she left academia to apply her data-mining skills in a major German e-commerce company. Daan Broeder has a background in electrical engineering, is deputy head of the TLA unit at the Max Planck Institute for Psycholinguistics (Nijmegen, The Netherlands) and has been senior developer responsible for all infrastructure and metadata development for many years. He plays leading roles in European and national projects, such as all
x list of Contributors metadata-related work in TLA and CLARIN, and is the responsible convener for ISO standards on metadata and persistent identifiers. Karen P. Corrigan is Professor of Linguistics and English Language at Newcastle University. She researches language variation and change in dialects of the British Isles with a particular focus on Northern Ireland and northeast England. She was principal investigator on the research project that created the Newcastle Electronic Corpus of Tyneside English (2000–2005), and fulfilled the same role for the Diachronic Electronic Corpus of Tyneside English project (2010–2012) at Newcastle University. Catia Cucchiarini obtained her Ph.D in phonetics from the University of Nijmegen. She worked at the Centre for Language and Education of K.U. Leuven in Belgium, and has been working at the University of Nijmegen on various projects on speech processing and computer-assisted language learning. She has supervised Ph.D students and has published many articles in international journals. In addition to her research activities, she has since 1999 been working at the Nederlandse Taalunie (Dutch Language Union) as a senior project manager for language policy and human language technologies. Elisabeth Delais-Roussarie is a senior researcher at the CNRS, Laboratoire de Linguistique Formelle, Paris (Université Paris-Diderot). She has worked on several topics in sentence phonology, such as the modelling of intonation and accentual patterns in French, the phonology–syntax interface, and prosodic phrasing in French. Her recent work has focused on the development and evaluation of prosodic annotation systems and tools that facilitate a corpus-based approach in sentence phonology and in the L2 acquisition of prosody. Jacques Durand is Emeritus Professor of Linguistics at the University of Toulouse II – Le Mirail and a Member of the Institut Universitaire de France. He was formerly Professor at the University of Salford, Director of the CLLE-ERSS research centre in Toulouse, and in charge of linguistics at CNRS headquarters. His publications are mainly in phonology (particularly within the framework of Dependency Phonology, in collaboration with John Anderson), but he also worked in machine translation in the 1980s and 1990s within the Eurotra project. Since the late 1990s he has coordinated two major research programmes in corpus phonology: Phonology of Contemporary French, with M.-H. Côté, B. Laks, and C. Lyche, and Phonology of Contemporary English, with P. Carr and A. Przewozny. Janet Fletcher is Associate Professor of Phonetics at the University of Melbourne. She completed her Ph.D at the University of Reading and has held research positions at the University of Edinburgh, Ohio State University, and Macquarie University. Her research interests include articulatory and acoustic modelling of coarticulation, and prosody and intonation in Australian English and Australian Indigenous languages. Michel Francard is Professor of Linguistics at the Catholic University of Louvain (Louvain-la-Neuve) and founder of the VALIBEL research centre in 1989. His main
list of Contributors
xi
research interests include linguistic variation (especially the lexicography of peripheral French varieties) and the evolution of endangered languages in the globalized linguistic market. His most recent book, Dictionnaire des belgicismes (2010) illustrates the emergence of an autonomous norm within a variety of French outside France. Frans Gregersen is Professor of Danish Language at the Department of Scandinavian Studies and Linguistics, University of Copenhagen, and director of the Danish National Research Foundation LANCHART (LANguage CHAnge in Real Time) Centre since 2005. The centre webpage with current publications may be found at www.lanchart.dk. Ulrike Gut holds the Chair of English Linguistics at the Westfälische WilhelmsUniversity in Münster. She received her Ph.D from Mannheim University and her postdoctoral degree (Habilitation) from Freiburg University. Her main research interests include phonetics and phonology, corpus linguistics, second language acquisition, and worldwide varieties of English. She has collected the LeaP corpus and is currently involved in the compilation of the ICE-Nigeria. Kristin Hagen is Senior Engineer at the Text Laboratory, Department of Linguistic and Scandinavian Studies at the University of Oslo. For many years she has worked with the development of speech corpora such as NoTa-Oslo and the Nordic Dialect Corpus. She has also worked in other language technology domains like POS tagging (the Oslo– Bergen tagger), parsing and grammar checking. Her background is in linguistics and in computer science. Philippe Hambye is Professor of French Linguistics at the University of Louvain. His research mainly includes work in sociolinguistics relating to variation of linguistic norms and practices in the French-speaking world, language practices in education and work, and language policies, with a special interest in questions of legitimacy, power, and social inequalities. Tina John is a lecturer in linguistics at the University of Kiel. After MA graduation in Phonetics, Linguistics and Computer Science in 2004, she joined the developer team of the EMU System. In 2012 she obtained her Ph.D in Phonetics and Computer Linguistics at the University of Munich. Her areas of interest, in addition to the development of the EMU System and algorithms in general, are all kinds of speech data analysis (e.g. finding acoustic correlates) as well as analyses of text corpora. Michael Kipp is Professor for Interactive Media at Augsburg University of Applied Sciences, Germany. Previously he was head of a junior research group at the German Research Center for AI (DFKI), Saarbrücken, and Saarland University. His research topics are embodied agents, multimodal annotation, and communication and interaction design. He developed and maintains the ANVIL video annotation tool. Gjert Kristoffersen is Professor of Scandinavian Languages at the University of Bergen. His research interests are synchronic and diachronic aspects of Scandinavian phonology,
xii list of Contributors especially Norwegian and Swedish prosody from a variationist perspective. He is the author of The Phonology of Norwegian, published by Oxford University Press in 2000. Bernard Laks is Professor at Paris Ouest Nanterre University and senior member of the Institut Universitaire de France. Former director of the Nanterre linguistic laboratory, he is with Jacques Durand, Chantal Lyche, and Marie-Hélène Côté, director of the ‘Phonologie du français contemporain’ corpus and research program (PFC). He has published intensively in phonology, corpus linguistics, variation, cognitive linguistics, the history of linguistics, and modelling. Chantal Lyche is currently Professor of French Linguistics at the University of Oslo. She has been adjunct Professor at the University of Tromsø and an associate member of CASTL (Center for Advanced Studies in Theoretical Linguistics, Tromsø). She has published extensively on French phonology and is the co-founder of a research programme in corpus phonology: ‘Phonology of contemporary French’ (with Jacques Durand and Bernard Laks). Since the 1990s she has focused more specifically on varieties of French outside France, particularly in Switzerland, Louisiana, Mauritius, and Africa. In addition, she has worked on the study of large corpora from a prosodic point of view. She is also the co-author of a standard textbook on the phonology of French, and is actively involved in the teaching of French as a foreign language. Brian MacWhinney, Professor of Psychology, Computational Linguistics, and Modern Languages at Carnegie Mellon University, has developed a model of first and second language acquisition and processing called the Competition Model which he has also applied to aphasia and fMRI studies of children with focal lesions. He has developed databases such as CHILDES, SLABank, BilingBank, and CABank for the study of language learning and usage. He is currently developing methods for second language learning based on mobile devices and web-based tutors and games. Marie Maegaard holds a Ph.D from the University of Copenhagen. She is currently Associate Professor of Danish Spoken Language at the Department of Scandinavian Research, University of Copenhagen, and is in charge of the phonetic studies at the Danish National Research Foundation LANCHART (LANguage CHAnge in Real Time) Centre. The centre webpage with current publications may be found at www.lanchart.dk. Adam J. Mearns was postdoctoral research associate on the Diachronic Electronic Corpus of Tyneside English project (2010–2012) at Newcastle University. His background is in the lexical semantics of Old English and the history of the English language. Hermann Moisl is a Senior Lecturer in Computational Linguistics at the University of Newcastle, UK. His background is in linguistics and computer science, and his research interests and publications are in neural language modelling using nonlinear attractor dynamics, and in methodologies for preparation and cluster analysis of data abstracted from natural language corpora.
list of Contributors
xiii
Francis Nolan is Professor of Phonetics in the Department of Linguistics at the University of Cambridge. He studied languages at Cambridge before specializing in phonetics. After a first post in Bangor (North Wales) he returned to Cambridge, developing research interests in phonetic theory, connected speech processes, speaker characteristics, variation in English, and prosody—the latter two united in the IViE project in the late 1990s. He has been active in forensic phonetic research and casework. He is currently President of the British Association of Academic Phoneticians. Marc van Oostendorp is a researcher at the Meertens Instituut of the Royal Netherlands Academy of Arts and Sciences, and a Professor of Phonological Microvariation at Leiden University. His main interests are models of geographical and social variation, the relation between language as a property of the mind and language as a property of a community, and alternatives to derivational relations in the phonology–morphology interface. Nicolai Pharao is Assistant Professor at the Danish National Research Foundation’s Centre for Language Change in Real Time, LANCHART. He received his Ph.D in linguistics with the dissertation ‘Consonant Reduction in Copenhagen Danish’ in 2010. His research includes corpus based studies of phonetic variation and change and experimental studies of the relationship between phonetic variation, social meaning, and language attitudes. He is particularly interested in how the usage and social evaluation of phonetic features influences the representation of word forms in the mental lexicon. Brechtje Post is Lecturer in Phonetics and Phonology at the University of Cambridge. Her research interests centre around the syntax–phonology interface and intonation, which she investigates from a phonetic, phonological, acquisitional, cognitive, and neural perspective. She has published in journals such as Linguistics, Journal of Phonetics, Language and Speech, Cognition, and Neuropsychologia. Laurent Romary is Directeur de Recherche at INRIA, France and guest scientist at Humboldt University in Berlin, Germany. He carries out research on the modelling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He is the chairman of ISO committee TC 37/SC 4 on Language Resource Management, and has been active as member (2001–2007), then chair (2008–2011), of the TEI (Text Encoding Initiative) council. He currently contributes to the establishment and coordination of the European Dariah infrastructure for the arts and humanities. Yvan Rose is currently Associate Professor of Linguistics at Memorial University and co-director of the PhonBank Project within CHILDES. He received his Ph.D in Linguistics from McGill University. His research concentrates on the nature of phonological representations and of their acquisition by young children. He investigates these questions through software-assisted methods, implemented in the Phon program for the analysis of transcript data on phonology and phonological development. Thomas Schmidt holds a Ph.D from the University of Dortmund. His research interests are spoken language corpora, text and corpus technology, and computational
xiv list of Contributors lexicography. He is one of the developers of EXMARaLDA and the author of the Kicktionary, a multilingual electronic dictionary of football language. He has spent most of his professional life as a researcher at the University of Hamburg. He also worked as a language resource engineer for a commercial company and as a research associate at ICSI Berkeley and at the Berlin-Brandenburg Academy of Sciences. Currently he heads the Archive for Spoken German (AGD) at the Institute for the German Language in Mannheim. Anne Catherine Simon is Professor of French Linguistics at the Catholic University of Louvain (Louvain-la-Neuve, Belgium). She has been in charge of the VALIBEL research centre since 2009. Her research in French linguistics is in the areas of prosody and syntax of spoken speech, and their interaction in various speaking genres. Her dissertation, ‘Structuration prosodique du discours en français’, was published in 2004. She is co-author of La variation prosodique régionale en français (2012). Hanne Gram Simonsen is Professor of Linguistics at the Department of Linguistics and Scandinavian Studies at the University of Oslo. Her research interests include language acquisition (in particular phonology, morphology, and lexicon) and instrumental and articulatory phonetics, as well as clinical linguistics (language disorders in children and adults). She has published on these topics in journals such as Journal of Child Language, Journal of Phonetics, and Clinical Linguistics and Phonetics. Han Sloetjes is a software developer at the Language Archive, a department of the Max Planck Institute for Psycholinguistics. He has been involved in the development of the multimedia annotation tool ELAN since 2003. Currently he is the main responsible person for supporting, maintaining, and further developing this application. Lesley Stirling is Associate Professor of Linguistics and Applied Linguistics at the University of Melbourne. She has a disciplinary background in linguistics and cognitive science, and has published work on a variety of topics in descriptive and typological linguistics, semantics, and discourse analysis. One research interest has been the relationship between dialogue structure and prosody, involving collaborative cross-disciplinary research funded by the Australian Research Council. Helmer Strik received his Ph.D in physics from the University of Nijmegen, where he is now Associate Professor of Speech Science and Technology. His research addresses both human speech processing (voice source modelling, intonation, pronunciation variation, speech pathology) and speech technology (automatic speech recognition and transcription, spoken dialogue systems, and computer-assisted language learning and therapy). He has published over 150 refereed papers, has coordinated national and international projects, and has been an invited speaker at international events. Atanas Tchobanov is a research engineer in MoDyCo CNRS lab. He has been active in the field of oral corpora web implementations since 2001. His research interests also include data analysis and unsupervised learning of phonological invariants.
list of Contributors
xv
Paul Trilsbeek is currently head of archive management at the Language Archive, Max Planck Institute for Psycholinguistics, Nijmegen. He studied sonology at the Royal Conservatory in The Hague, after which he worked at the Radboud University in Nijmegen as a music technologist in the Music, Mind, Machine project. His experience in working with audiovisual media turned out to be of great value in the domain of language resource archiving, in which he has been working since 2003. Jane S. Tsay received her Ph.D in Linguistics from the University of Arizona. She was a postdoctoral research fellow at the State University of New York at Buffalo 1993–1995. She is currently a Professor of Linguistics and the Dean of the College of Humanities at the National Chung Cheng University in Taiwan. Her research interests include phonology acquisition, experimental phonetics, corpus linguistics, and sign language phonology. She has constructed the Taiwanese Child Language Corpus, based on 330 hours of recordings of young children’s spontaneous speech. She is also the co-director of the Sign Language Research Group at the University and has compiled a Taiwan Sign Language online dictionary. Her recent research, besides phonology acquisition, is on the phonological structure of spoken and signed languages. Dieter van Uytvanck studied computer science at Ghent University and linguistics at the Radboud University, Nijmegen. After graduating he worked at the Max Planck Institute for Psycholinguistics in Nijmegen. Since 2008 he has been active in the technical setup of the CLARIN research infrastructure (www.clarin.eu) and as of 2012 he is director at CLARIN-ERIC. Holger Voormann received a degree in computer science from the University of Stuttgart. He worked as a research associate at the IMS Stuttgart and held several positions in IT companies. He is now a freelance software developer and consultant, and is involved in the development of several open source projects, such as the Platform for Annotated Corpora in XML (Pacx). Andreas Witt received his Ph.D in Computational Linguistics and Text Technology from Bielefeld University in 2002, and continued there for the next four years as an instructor and researcher in those fields. In 2006 he moved to Tübingen University, where he participated in a project on ‘Sustainability of Linguistic Resources’ and in projects on the interoperability of language data. Since 2009 he has headed the Research Infrastructure group at the Institute for the German Language in Mannheim. Florian Wittenburg works in the Language Archive at the MPI in Nijmegen in collaboration with the Max Planck Society, the Berlin Brandenburg Academy of Sciences, and the Royal Dutch Academy of Sciences. Peter Wittenburg has a diploma degree in Electrical Engineering from the Technical University Berlin and in 1976 became head of the technical group at the newly founded Max Planck Institute for Psycholinguistics. He has had leading roles in various European and national reearch projects including the DOBES programme, CLARIN and EUDAT
xvi list of Contributors as well as ISO initiatives. He is the head of the Language Archive, a collaboration between the Max Planck Society, the Berlin Brandenburg Academy of Sciences, and the Royal Dutch Academy of Sciences. Kai Wörner holds a Ph.D in text technology from the University of Bielefeld. After finishing his university studies at Gießen University, he worked as a web developer in Hamburg. He is currently the managing director of the Hamburg Centre for Language Corpora and a research assistant in the language resource infrastructure project CLARIN-D. His research interests are corpus and computational linguistics. He is one of the developers of the EXMARaLDA system.
C HA P T E R 1
INTRODUCTION JAC QU E S DU R A N D, U L R I K E G U T, A N D G J E RT K R I STOF F E R SE N
Corpus phonology is a new interdisciplinary field of research that has only begun to emerge during the last few years. It has grown out of the need for modern phonological research to be embedded within a larger framework of social, cognitive, and biological science, and combines methods and theoretical approaches from phonology, both diachronic and synchronic, phonetics, corpus linguistics, speech technology, information technology and computer science, mathematics, and statistics. In the past, phonological research comprised predominantly descriptive methods, but while new methods such as experimentation, acoustic-perceptual, and aerodynamic modelling, as well as psycholinguistic and statistical methods, have recently been introduced, the employment of purpose-built corpora in phonological research is still in its infancy. With the increasing number of phonological corpora being compiled all over the world, the need arises for the international research community to exchange ideas and find a consensus on fundamental issues such as corpus annotation, analysis, and dissemination as well as corpus data formats and archiving. The time seems right for the development of standards for phonological corpus compilation and especially corpus annotation and metadata. It is the aim of this Handbook to address these issues. It offers guidelines and proposes international standards for the compilation, annotation, and analysis of phonological corpora. This includes state-of-the-art practices in data collection and exploitation, theoretical advances in corpus design, best practice guidelines for corpus annotation, and the description of various tools for corpus annotation, exploitation, and dissemination. It also comprises chapters on phonological findings based on corpus analyses, including studies in fields as diverse as the phonology–phonetics interface, language variation, and language acquisition. Moreover, an overview is provided of a large number of existing phonological corpora and tools for corpus compilation, annotation, and exploitation. The Handbook is structured in four parts. The first part, ‘Phonological Corpora: Design, Compilation, and Exploitation’, contains contributions on general issues in phonological corpus compilation, annotation, analysis, storage, and dissemination. In
2 Jacques Durand, Ulrike Gut, and Gjert Kristoffersen chapter 2, Ulrike Gut and Holger Voormann describe the basic processes of phonological corpus design including data compilation, data selection, and data annotation as well as corpus storage, sustainability, and reuse. They address fundamental questions and decisions that compilers of a phonological corpus are inevitably faced with, such as questions of corpus representativeness and size, raw data selection, and corpus sharing. On the basis of these reflections, the authors propose a methodology for corpus creation. Many of the issues raised in Chapter 2 are developed in the next three chapters. Chapter 3 is concerned with corpus-based data collection. In it, Bruce Birch discusses some key issues such as control over primary data, context, and contextual variation, and the observer’s paradox. He further classifies various widespread data collection techniques in terms of the amount and type of control they assert over the production of speech, and gives a comprehensive overview of data collection techniques for purposes of phonological research. The task of phonological corpus annotation is described in chapter 4 Elisabeth Delais-Roussarie and Brechtje Post first discuss some theoretical issues that arise in the transcription and annotation of speech such as segmentation and the assignment of labels. Furthermore, they provide a comprehensive overview and evaluation of the various systems that are in use for the annotation of segmental and suprasegmental information in the speech signal. Chapter 5 provides an overview of the state of the art in automatic phonetic transcription of corpora. After introducing the most relevant methodological issues in this area, Helmer Strik and Catia Cucchiarini describe and evaluate the different techniques of (semi-)automatic phonetic corpus transcription that can be applied, depending on what kind of data and annotations are available to corpus compilers. The next couple of chapters are concerned with the exploitation and archiving of phonological corpora. In chapter 6, Hermann Moisl presents statistical methods for analysing phonological corpora, focusing in particular on cluster analysis. Illustrating his account with the Newcastle Electronic Corpus of Tyneside English (which is presented in Part IV), he describes and discusses in a detailed way the process and benefits of applying the technique of clustering to phonological corpus data. Chapter 7 is concerned with corpus archiving and dissemination. Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg discuss how the traditional model of corpus archiving and dissemination is changing dramatically, with digital innovations opening up new possibilities. They examine various preservation requirements that need to be met, and illustrate the use of advanced infrastructures for data accessibility, archiving and dissemination. The last two chapters of Part I are concerned with the concept of metadata and data formats. In chapter 8, Daan Broeder and Dieter van Uytvanck describe some of the major metadata sets in use for the compilation of corpora including OLAC, TEI, IMDI, and CMDI. They further give some practical advice on what metadata schema to use or how to design one’s own if required. Finally, Chapter 9 addresses basic issues that are important for corpus compilers with regard to the choice of data format. Laurent Romary and Andreas Witt argue for providing the research community with a set of standardized formats that allow a high reuse rate of phonological corpora as well as better interoperability across tools used to produce or exploit them. They describe some
Introduction 3
basic concepts related to the representation of annotated linguistic content, and offer some proposals for the annotation of spoken corpus data. The second part of this Handbook, ‘Applications’, is devoted to how speech corpora can be put to use. Each chapter considers how corpus-based methods may enrich and improve research within different subfields of phonology such as phonetics, prosody, segmental phonology, diachrony, first language acquisition, and second language acquisition. These topics and perspectives should by no means be regarded as exhaustive; they are but a few examples of many possible ones that are intended to show the usefulness of corpus-based methods. In chapter 10, Elisabeth Delais-Roussarie and Hiyon Yoo take as their starting point the various data and methods commonly used for research in phonetics and phonology. This leads them to a definition of what can be considered (1) a corpus and (2) a corpus-based approach to the two disciplines. The rest of the chapter is devoted to post-lexical phonology and prosody, such as liaison in French, suprasegmental phenomena such as phrasing or intonation, and the use of corpora in phonetic research. The topic of chapter 11, written by Hanne Gram Simonsen and Gjert Kristoffersen, is segmental phonology from a variationist point of view. An ongoing change where a formerly laminal /s/ is turned into apical /ʂ/ before /l/ in Oslo Norwegian is shown to be governed by a complex set of phonological and morphological constraints that could not have been identified without recourse to corpus-based methods. The corpora used in their analysis are described in Chapter 25 in this volume. Chapter 12 takes up again one of the topics of Chapter 10 in greater detail: French liaison. Based on the PFC corpus (see Chapter 24, this volume), Jacques Durand shows how recourse to corpora has contributed to a better understanding of perhaps one of the most thoroughly analysed phenomena in the phonology of French. Durand argues that previous analyses are to a certain extent flawed because they are based on data which are too scarce, occasionally spurious, and often uncritically adopted from previous treatments. The PFC corpus has helped to put the analysis on firmer empirical ground and to chart which areas are relatively stable across speakers and which are variable. The topic of chapter 13, written by Yvan Rose, is phonological development in children. Following a discussion of issues that are central to research in phonological development, the chapter describe some solutions, with an emphasis on the recently proposed PhonBank initiative (see also chapter 19 of this volume) within the larger CHILDES project. Finally, chapter 14 is concerned with second language acquisition. Here, Ulrike Gut shows how research on the acquisition and structure of L2 phonetics and phonology can profit from the analysis of phonological corpora of second language learner speech. A second objective of this chapter is to discuss how corpora can support the creation of teaching materials and teaching curricula, and how they can be employed in classroom teaching and learning of phonology. Part III of the Handbook concerns ‘Tools and Methods’. A number of tools, systems, or methods in this section have become standard in the field and are used in a large number of research projects. Thus, chapter 15 by Han Sloetjes provides an overview of ELAN, a stand-alone tool developed at the Max Planck Institute for Psycholinguistics in Nijmegen in the Netherlands. ELAN is a generic multimedia annotation tool which is
4 Jacques Durand, Ulrike Gut, and Gjert Kristoffersen not restricted to the analysis of spoken language, since it is also applied in sign language research, gesture research, and language documentation, to name just a few. It offers powerful descriptive strategies, since it supports time-aligned multilevel transcriptions and permits annotations to reference other annotations, allowing for the creation of annotation tree structures. By contrast, EMU presented in chapter 16 by Tina John and Lasse Bombien is a database system for the specific analysis of speech, consisting of a collection of software tools for the creation, manipulation, and analysis of speech databases. EMU includes an interactive labeller which can display spectrograms and speech waveforms, and which allows the creation of hierarchical as well as sequential labels for a speech utterance. A central concern of the EMU project is the statistical analysis of speech corpora. To this end, EMU interfaces with the R environment for statistical computing. Like EMU, Praat, devised by Paul Boersma and David Weenink at the University of Amsterdam, is a computer program for analysing, synthesizing, and manipulating speech and other sounds, and for creating publication-quality graphics. A speech corpus typically consists of a set of sound files, each of which is paired with an annotation file, and metadata information. Paul Boersma’s introduction to Praat in chapter 17 demonstrates that the strengths of this tool lie in the acoustic analysis of the individual sounds, in the annotation of these sounds, and in browsing multiple sound and annotation files across the corpus. Moreover, corpus-wide acoustic analyses, leading to tables ready for statistical analysis, can be performed by the Praat scripting language, which is thoroughly described and illustrated by Caren Brinckmann in chapter 18. As stressed by this author, building a speech corpus and exploiting it to answer phonetic and phonological research questions is a very time-consuming process. Many of the necessary steps in the corpus-building process and the analysis stage can be facilitated by scripting. Caren Brinckmann demonstrates how scripts can be employed to support orthographic transcription, phonetic and prosodic annotation, querying, analysis, and preparation for distribution. The contribution by Yvan Rose and Brian MacWhinney (chapter 19) is centred on the PhonBank project. The authors provide a description of the tools available through the PhonBank initiative for corpus-based research on phonological development as well as for data sharing. PhonBank is one of ten subcomponents of a larger database of spoken language corpora called TalkBank. Other areas in TalkBank include AphasiaBank, BilingBank, CABank, CHILDES, ClassBank, DementiaBank, GestureBank, Tutoring, and TBIBank. All of the TalkBank corpora use the CHAT data transcription format, which enables a thorough analysis with the CLAN programs (Computerized Language ANalysis). The PhonBank corpus is unique in that it can be analysed both with the CLAN programs and also with an additional program, called Phon, which is designed specifically for phonological analysis. The authors provide an introduction to Phon and then widen the discussion to methodological issues relevant to software-assisted approaches to phonological development, and to phonology, more generally. In chapter 20, Thomas Schmidt and Kai Wörner provide an overview of EXMARaLDA. This is a system for creating, managing, and analysing digital corpora of spoken language which has been developed at the University of Hamburg since
Introduction 5
2000. From the outset, EXMARaLDA was planned to serve a variety of purposes and user communities. Today, the system is used, among other things, for corpus development in pragmatics and conversation analysis, in dialectology, in studies of multimodality, and for the curation of legacy corpora systems of its kind. This chapter foregrounds the use of EXMARaLDA for corpus phonology within the wider study of spoken interactions. As part of the overview, three corpora are presented—a phonological corpus, a discourse corpus and a dialect corpus—all constructed with the help of EXMARaLDA. Chapter 21 by Michael Kipp is devoted to ANVIL, a highly generic video annotation research tool. Like ELAN (cf. chapter 15), ANVIL supports three activities which are central to contemporary research on language interaction: the systematic annotation of audiovisual media (coding), the management of the resulting data in a corpus, and various forms of statistical analysis. In addition, ANVIL also allows for audio, video, and 3D motion-capture data. The chapter provides an in-depth introduction to ANVIL’s underlying concepts which is especially important when comparing it to alternative tools, several of which are described in other chapters of this volume. It also tries to highlight some of the more advanced features (like track types, spatial coding, and manual generation) that can significantly increase the efficiency and robustness of the coding process. To conclude this part, chapter 22 by Atanas Tchobanov returns to an issue raised in Part I of this volume, namely web-based sharing of data for research, pedagogical, or demonstration purposes. Modern corpus projects are usually designed with a web-based administration of the corpus compilation process in mind. However, transferring existing corpora to the web proves to be a more challenging task. This chapter reviews the solutions available for designing and deploying a web-based phonological corpus. Furthermore, it gives a practical step-by-step description of how to archive and share a phonological corpus on the web. Finally, the author discusses a novel approach, based on the latest HTML5 and relying only on the now universal JavaScript language, implemented in any modern browser. The new system he presents runs online from any http server, but also offline from a hard drive, CD/DVD, or flash drive memory, thus opening up new possibilities for the dissemination of phonological corpora. In line with its title, ‘Corpora’, Part IV of this Handbook aims at presenting a number of leading corpora in the field of phonology. Even a book of this size cannot remotely hope to reference all the worthwhile projects currently available. Our aim has therefore been a more modest one: that of giving an overview of some well-known speech corpora exemplifying the methods and techniques discussed in earlier parts of the book and covering different countries, different languages, different linguistic levels (from the segmental to the prosodic), and different perspectives (e.g. dialectology, sociolinguistics, and first and second language acquisition). In chapter 23, Francis Nolan and Brechtje Post provide an overview of the IViE Corpus of spoken English. IViE stands for ‘Intonational Variation in English’ and refers to a collection of audio recordings of young adult speakers of urban varieties of English in the British Isles made between 1997 and 2002. These recordings were devised to facilitate the systematic investigation of intonational variation in the British
6 Jacques Durand, Ulrike Gut, and Gjert Kristoffersen Isles, and have served as a model for similar studies in other parts of the world. This chapter sets out by describing the reasoning behind the choices made in designing the corpus, and surveys some of the research applications in which recordings from IViE have been used. Chapter 24 is devoted to an ongoing programme concerning spoken French which was set up in the late 1990s: the PFC Programme (Phonologie du Français Contemporain: usages, variétés et structure), which is by far one of the largest databases of spoken French of its kind. In their contribution, Jacques Durand, Bernard Laks, and Chantal Lyche attempt to show the advantages of a uniform type of data collection, transcription, and coding which has led to the construction of an interactive website integrating advanced search and analysis tools and allowing for the systematic comparison of varieties of French throughout the world. They also emphasize that, while the core of the programme has been phonological (and initially mainly segmental), the database permits applications ranging from speech recognition to syntax and discourse—a point made by many other contributors to this volume. In chapter 25, Kristin Hagen and Hanne Gram Simonsen provide a description of two speech corpora hosted by the University of Oslo: NoTa-Oslo and TAUS (see also Chapter 11, where research based on these corpora are reported). Both corpora are based on recordings of spontaneous speech from Oslo residents, NoTa-Oslo speech recorded in 2005–2006 and TAUS speech recorded in 1972–1973. These two corpora permit a thorough synchronic and diachronic investigation of speech from Oslo and its immediate surroundings, which can be seen as representative of Urban East Norwegian speech. In both cases, the web search interface is relatively simple to use, and the transcriptions are linked to audio files (for both NoTa-Oslo and TAUS) and video files (for NoTa-Oslo). NoTa-Oslo and TAUS are both multi-purpose corpora, designed to support research in different fields, such as phonology, morphology, syntax, semantics, discourse, dialectology, sociolinguistics, lexicography, and language technology. This makes the corpora very useful for most purposes, but it also means that they cannot immediately meet the demands of every research task. The authors show how NoTa-Oslo and TAUS have been used for phonological research, but also discuss some of the limitations entailed by the types of interaction at the core of the corpus—a discussion most instructive for researchers wishing to embark on large-scale socio-phonological projects. Chapter 26 by Ulrike Gut turns to another area where phonological corpora are proving indispensable—that of second language acquisition. The LeaP corpus was collected in Germany at the University of Bielefeld between 2001 and 2003 as part of the LeaP (Learning Prosody in a Foreign Language) project. The aim has been to investigate the acquisition of prosody by second language learners of German and English with a special focus on stress, intonation, and speech rhythm as well as the influencing factors on the acquisition process and outcome. The LeaP corpus comprises spoken language produced by 46 learners of English and 55 learners of German as well as recordings with 4 native speakers of English and 7 native speakers of German. This chapter is particularly useful in providing a detailed discussion of methods concerning the compilation of a corpus designed for studying the acquisition of prosody: selection of speakers,
Introduction 7
recordings, types of speech, transcription issues, annotation procedures, data formats, and assessment of annotator reliability. In chapter 27 by Joan C. Beal, Karen P. Corrigan, Adam J. Mearns, and Hermann Moisl, the focus is on the Diachronic Electronic Corpus of Tyneside English (DECTE), and particularly annotation practices and dissemination strategies. The first stage in the development of the Diachronic Electronic Corpus of Tyneside English (DECTE) was the construction of the Newcastle Electronic Corpus of Tyneside English (NECTE) between 2000 and 2005. NECTE is what is called a legacy corpus based on data collected for two sociolinguistic surveys conducted on Tyneside, northeast England, in c.1969–1971 and 1994, respectively. The authors concentrate in particular on transcription issues relevant for addressing research questions in phonetics/phonology, and on the nature of and rationale for the text-encoding systems adopted in the corpus construction phase. They also offer some discussion of the dissemination strategy employed since completion of the first stage of the corpus in 2005. The exploitation of NECTE for phonetic/phonological analysis is described in Moisl’s chapter in Part I of this Handbook. Insofar as the researchers behind NECTE have been pioneers in the construction of a unique electronic corpus of vernacular English which was aligned, tagged for parts of speech, and fully compliant with international standards for encoding text, the continuing work on the subcorpora now included within DECTE is of interest to all projects having to deal with recordings and metadata stretching back in time. Interestingly, the following chapter on LANCHART by Frans Gregersen, Marie Maegaard, and Nicolai Pharao (28) focuses on similar issues concerning Danish. The authors give an outline of the corpus work done at the LANCHART Centre of the Danish National Research Foundation. The Centre has performed re-recordings of a number of informants from earlier studies of Danish speech, thus making it possible to study variation and change in real time. The chapter deals with the methodological problems posed by such a diachronic perspective in terms of data collection, annotation, and interpretation. Gregersen, Maegaard, and Pharao then focus on three significant examples: the geographical pattern of the (əð) variable, the accommodation to a moving target constituted by the raising of (æ) to [ɛ], and finally the covariation of three phonetic variables and one grammatical variable (the generic pronoun) in a single interview. Chapter 29, written by Marc van Oostendorp, is devoted to phonological and phonetic databases at the Meertens Institute in Amsterdam. This centre was founded in 1930 under the name ‘Dialect Bureau’ (Dialectenbureau) before being named in 1979 after its first director, P. J. Meertens. Originally, the institute had as its primary goal the documentation of the traditional dialects as well as folk culture of the Netherlands. In the course of time, this focus has broadened in several ways. From a linguistic standpoint, the Institute has widened its scope to topics other than the traditional dialects. Currently it comprises two departments, one of Dutch Ethnology and one of Variation Linguistics. Although the documentation of dialects has made significant progress, considerable effort has recently gone into digitizing material and putting it online. Van Oostendorp’s contribution seeks to describe the two most important databases on Dutch dialects which are available at the Meertens Institute: the Goeman–Taeldeman–Van
8 Jacques Durand, Ulrike Gut, and Gjert Kristoffersen Reenen Database and Soundbites. He concludes by presenting new research areas at the Meertens Institute and by pointing out some desiderata concerning them. Chapter 30 is concerned with the VALIBEL speech database. Anne Catherine Simon, Philippe Hambye, and Michel Francard present the ‘speech bank’ which has been developed since 1989 at the Centre de recherche Valibel of the Catholic University of Louvain (Belgium). This speech database, which is one of the largest banks of spoken French in the world, is not a homogeneous corpus but rather a compilation of corpora, collected with a wide range of linguistic applications in mind and integrated into a system allowing for various kinds of investigation. The authors give a thorough description of the database, with special attention to the features that are relevant for research in phonology. Although the first aim of VALIBEL was not to build up a reference corpus of spoken French, but to collect data in order to provide a sociolinguistic description of the varieties of French spoken in Belgium, Simon, Hambye, and Francard show how the continuing gathering of data for various research projects has finally resulted in the creation of a large and controlled database, highly relevant for research in a number of fields, including phonetics and phonology. In chapter 31, Janet Fletcher and Lesley Stirling focus on prosody and discourse in the Australian Map Task corpus. The Australian Map Task corpus is part of the Australian National Database of Spoken Language (ANDOSL), which was collected in the 1990s for use in general speech science and speech technology research in Australia. It is closely modelled on the HCRC Map Task, which was designed in the 1990s by a team of British researchers to elicit spoken interaction typical of everyday talks in a controlled laboratory environment. Versions of this task have been used successfully to develop or test models of intonation and prosody in a wide number of languages including several varieties of English (as illustrated by Nolan and Post in Chapter 23). The authors show how the Australian Map Task has proved to be a useful tool with which to examine different prosodic features of spoken interactive discourse. While the intonational system of Australian English shares many features with other varieties of English, tune usage and tune interpretation are argued to remain variety-specific, with the Map Task proving to be a rich source of information on this question. The studies summarized in this contribution also illustrate the flexibility of Map Task data in permitting correlations of both micro-level discourse units such as dialogue acts and larger discourse segments such as Common Ground Units, with intonational and prosodic features of Australian English. The chapter includes a detailed discussion of annotation and analytical techniques for the study of prosody, thus complementing the contribution of Nolan and Post at the beginning of this part of the Handbook. The Handbook concludes with a chapter by Jane S. Tsay describing a phonological corpus of L1 acquisition of Taiwan Southern Min. In her contribution, Tsay outlines the data collection, transcription, and annotations for the Taiwanese Child Language Corpus (TAICORP), including a brief description of computer programs developed for specific phonological analyses. TAICORP is a corpus of spontaneous speech between young children growing up in Taiwanese-speaking families and their carers. The target language, Taiwanese, is a variety of Southern Min Chinese spoken in Taiwan.
Introduction 9
(Taiwanese and Southern Min are used interchangeably by the author.) Tsay shows how a well-designed phonological corpus such as TAICORP can be used to throw light on many issues beyond phonology such as the acquisition of syntax (syntactic categories, causatives, classifiers) and of pragmatic features. From a phonological point of view, much of the literature on child language acquisition has focused primarily on universal innate patterns (e.g. markedness constraints within Optimality Theory), but many contemporary studies have also argued that frequency factors are highly relevant and indeed more crucial than markedness. Only corpora such as TAICORP can allow investigators to test competing hypotheses in this area. As is argued in most chapters of this volume, the construction of corpora cannot be divorced from theory construction and evaluation. The idea for this Handbook was born during an ESF-funded workshop on phonological corpora held in Amsterdam in 2006. We would like to thank all participants of this event and of the summer school on Corpus Phonology held at Augsburg University in 2008 for their discussions, comments, and commitment to this emerging discipline of corpus phonology. Our thanks also go to Eva Fischer and Paula Skosples, who assisted us in the editing process of this Handbook. We hope that it will be of interest to researchers from a wide range of linguistic fields including phonology, both synchronic and diachronic, phonetics, language variation, dialectology, first and second language acquisition, and sociolinguistics.
PA R T I
P HON OL O G IC A L C OR P OR A : DE SIG N , C OM P I L AT ION , A N D E X P L OI TAT ION
C HA P T E R 2
C O R P U S D E S I G N U L R I K E G U T A N D HOLG E R VO OR M A N N
2.1 Introduction Corpus phonology is a new interdisciplinary field of research that has emerged over the last few years. It refers to a novel methodological approach in phonology: the use of purpose-built phonological corpora for studying speakers’ and listeners’ knowledge and use of the sound system of their native language(s), the laws underlying such sound systems, and the acquisition of these systems in first and second language learning. Throughout its history, phonological research has employed a number of different methods, including the comparative method, experimental methods (Ohala 1995), and acoustic-perceptual and aerodynamic modelling taken from phonetics and integrated into the approach of laboratory phonology (Beckman and Kingston 1990: 3). The usage of purpose-built corpora in phonological research, however, is still in its infancy (see the chapters in part II of The Oxford Handbook of Corpus Phonology). Corpus linguistics as a method for studying the structure and use of language can be traced back to the 18th century (Kennedy 1998: 13). Modern corpora began to be collected in the 1960s and have quickly developed into one of the main methods of linguistic inquiry. It is now widely acknowledged that corpus-based linguistic research allows the modelling and analysis of language and language use as a valid alternative to linguistic research based on isolated examples of language. The development of corpus linguistics has proceeded in several waves (e.g. Renouf 2007; Johansson 2008). The few corpora that were compiled in the 1960s and 1970s were relatively small in size and were mainly used for lexical studies. In the 1980s, the number of different types of corpora increased rapidly and the first multi-million word corpora were created; these were employed for a wide range of linguistic studies including lexis, morphosyntax, language change, language variation, and language acquisition. In the past few years, the World Wide Web has been increasingly used as a corpus for morphosyntactic studies, and an entirely new type of corpus has appeared: the first phonological corpora. Accordingly, reflecting the technological possibilities and respective purposes of the different periods, the term ‘corpus’ has
14 Ulrike Gut and Holger Voormann been defined in many different ways. In general, it is used to refer to a substantial collection of language samples in electronic form that was assembled for a specific linguistic purpose (Sinclair 1991: 171). No agreed definition of what makes a corpus a phonological corpus exists as yet. This chapter attempts such a definition by outlining what a phonological corpus consists of and in what way it differs from other types of corpus (section 2.2). Researchers have only begun collecting phonological corpora in the past few years. Precursors of phonological corpora were developed in the 1980s. These were so-called speech corpora that were assembled for technological applications such as text-to-speech, automatic speech recognition or the evaluation of language processing systems (see Gibbon et al. 1997). They are, however, of limited use for phonological inquiry as they typically contain recordings in a very restricted range of speaking styles (usually they contain only several sentences read out by different speakers or many sentences read by one speaker) and do not include time-aligned phonological annotations. The spoken language corpora of the 1990s (e.g. the London-Lund Corpus, Svartvik 1990, the IBM Lancaster corpus of spoken English, and the Bergen corpus of London Teenage Language: Breivik and Hasselgren 2002) were collected with the aim of studying grammatical aspects of speech. Thus, they contain a more representative sample of speaking styles, but typically do not contain time-aligned phonological annotations either. It is only in the past few years that corpora have begun to be compiled with the express purpose of studying phonological phenomena. Such phonological corpora include the PFC (Phonologie du Français Contemporain) (see Durand, Laks and Lyche, this volume), the IViE corpus comprising different regional British varieties (see Post and Nolan, this volume), and the LeaP corpus of learner German and English (see Gut, this volume). The small number of extant phonological corpora indicates how little experience has been gained so far in compiling this type of corpus. It is therefore the second aim of this chapter to describe the entire design process of phonological corpora and to suggest some best practice guidelines that will help future corpus compilers to avoid common pitfalls and problems. Moreover, a theory of corpus creation, agile corpus creation (Voormann and Gut 2008), will be presented. This chapter is structured in the following way: After a definition of a phonological corpus is presented in section 2.2, section 2.3 discusses the most important elements in the design of phonological corpora. These include corpus storage, sustainability, sharing and reuse (section 2.3.1), questions of corpus representativeness and size (2.3.2), and raw data selection (2.3.3), as well as the issue of time-aligned phonological annotations (2.3.4). The chapter concludes with a discussion of theories of the corpus creation process (section 2.4) and a conclusion and outlook (section 2.5).
2.2 Definition of a Phonological Corpus No unanimously accepted definition of what constitutes a phonological corpus exists to date. In order to describe the essential properties and functions of a phonological corpus, first a general description of the term ‘corpus’ and its properties are given.
Corpus Design 15
2.2.1 Characteristics of a Corpus A corpus comprises two types of data: raw (or primary) data and annotations. The raw data of a linguistic corpus consists of samples of language. The types of raw data range from handwritten or printed texts and electronic texts to audio and video recordings. Some researchers only accept ‘authentic language’, i.e. language that was produced in a real communicative situation, as raw data and exclude recordings of individual sentences or text passages that are read out or repeated by speakers (e.g. McCarthy and O’Keefe 2008: 1012). Corpora containing the latter type of raw data have been classified as peripheral corpora (Nesselhauf 2004: 128) or databases. What all types of raw data have in common is that they have been selected but not altered or interpreted by researchers, and are accessible in the original form in which they were produced by speakers and writers. The term ‘annotation’ refers to additional (or secondary) information about the raw data of the corpus that is added by the corpus compilers. It can be divided into linguistic and nonlinguistic information. Examples of linguistic annotations are orthographic, phonemic, and prosodic transcriptions as well as part-of-speech tagging, semantic annotations, anaphoric annotations, and lemmatization. Annotations are always products of linguistic interpretation (see also Lehmann 2007: 17). Even an orthographic transcription reflects researchers’ decisions—for example by either using the spelling gonna or using the form going to. By the same token, annotations of syntactic, morphological, semantic, or phonological phenomena are results of interpretive processes resulting from the application of particular theoretical frameworks and perspectives. Non-linguistic corpus annotations are generally referred to as metadata, and include information about the corpus as a whole (e.g. who collected it, where, when and for which purpose); about the language samples (e.g. where and when they were produced); about the speakers/writers (e.g. their age, native language, and other languages); about the situation in which the language samples were produced (e.g. addressee, audience, event, purpose, time, date, and location); and about the recording process (e.g. what microphones and recording devices were used, what were the recording conditions). Not every collection of raw language data with corresponding annotations constitutes a corpus. The language sample should be representative (see section 2.3.2; McEnery and Wilson 2001: 29; Sinclair 2005). Some researchers furthermore claim that every corpus ‘is assembled for a specific purpose’ and that in the case of linguistic corpora, the purpose is the study of language structures and language use (e.g. Atkins et al. 1992: 1; Wynne 2009: 11; McCarthy and O’Keefe 2008: 1012). Thus, both language archives and the World Wide Web do not qualify as corpora in this strict sense since they were not collected for a linguistic purpose (e.g. Atkins et al. 1992: 1). By contrast, other researchers argue that the World Wide Web, if an informed selection of web pages is made, can serve as a corpus for linguistic research (Renouf 2007; Hoffmann 2009). While the usage of corpora in the past has been mainly restricted to the study of the structures and use of language, new opportunities for the applications of corpora have recently opened up. These include technical applications such as the training of automatic translation systems and the employment of corpora in the development of
16 Ulrike Gut and Holger Voormann dictionaries and grammars (e.g. Biber et al. 1999) as well as in language teaching (e.g. Kettemann and Marko 2002; Sinclair 2004; Gut 2006; Römer 2008). Modern corpora are available in electronic form and are thus machine-readable, so that a rapid (semi-) automatic analysis of large amounts of data in a given corpus can be realized.
2.2.2 Definition of a Phonological Corpus No commonly agreed definition of the term ‘phonological corpus’ exists yet. Phonological corpora can be divided into two types: speech databases and phonological corpora proper (see also Wichmann 2008: 187). The raw data of speech databases typically consists of lists of individual words, sets of sentences, or text passages read out by speakers under laboratory conditions. These recordings are well suited for instrumental analyses but, because of their highly controlled nature, might be of restricted use for the study of phonological phenomena other than those which the corpus compilers had in mind. Since the speech they contain is produced in a highly specific communicative situation, its phonological properties differ from those of speech produced under more ‘natural’ conditions such as in informal conversations, during conference speeches, in radio discussions, or in interviews (Summers et al. 1988; Byrd 1994; see also Birch, this volume). Raw data collected in these authentic communicative situations, by contrast, might suffer from a lower recording quality owing to background noises and speaker overlaps. Phonological corpora, compared to speech databases, have a wider application, and are collected with the purpose of studying the phonological structures of a language or language variety and their usage and acquisition as a whole. In our view, speech databases and phonological corpora should not be seen as a dichotomy but rather as the two endpoints of a continuum (see also Birch, this volume) with many possible intermediate forms, such as corpora containing spontaneous speech elicited in a controlled way and covering a very restricted topic, as in Map Tasks (e.g. Stirling et al. 2001). We will therefore not include a specific purpose in our definition of a phonological corpus. Phonological corpora can be designed for different purposes, and speech databases might be later converted and reused as phonological corpora by adding time-aligned phonological annotations. A phonological corpus is thus defined here as a representative sample of language that contains • primary data in the form of audio or video data; • phonological annotations that refer to the raw data by time information (time-alignment); and • metadata about the recordings, speakers and corpus as a whole. This definition is thus very close to Gibbon et al.’s (1997: 79) definition of a spoken language corpus as ‘any collection of speech recordings which is accessible in computer readable form and which comes with annotation and documentation sufficient to allow re-use of the data in-house, or by scientists in other organisations’. In detail, according to
Corpus Design 17
well w
e
and l
6
the n d@
11.99
mouse m
a 12.61
Time (s) FIGURE 2.1 Orthographic
and phonemic annotation of part of an utterance in Praat.
the above definition, phonological corpora always contain raw data in the form of audio or video data (thus excluding written or sign language corpora). Strategies for selecting such raw data are discussed in section 2.3.3 below. Time-aligned phonological annotations constitute the second prerequisite of a phonological corpus. They can include phonemic and phonetic transcriptions on the segmental level, and transcriptions of suprasegmental phenomena such as stress and accent, intonation, tone, pitch accents, pitch range, and pauses (see Post and Delais-Roussarie, this volume). The minimal requirement in terms of annotation for a corpus to be a phonological corpus is a time-aligned orthographic annotation plus one level of time-aligned phonological annotation. The term ‘time alignment’ refers to a technique that links linguistic annotations to the raw data. Figure 2.1 illustrates a time-aligned annotation carried out with the speech analysis software Praat of a part of an utterance that begins with ‘Well and the mouse . . . ’. Beneath the speech waveform, the annotation is displayed on two different tiers. The top tier shows the boundaries of each of the individual words together with an orthographic transcription. On the bottom tier, the speech is segmented into phonemes, and the individual phonemes are transcribed phonetically using SAMPA, the computer-readable adaptation of the IPA (Wells et al. 1992). For time-aligned annotations, the boundaries of annotated elements are defined by time stamps in the underlying text file that is created by the speech analysis software. This means that information about the exact beginning and end of each annotated phonological element is available in the corresponding file. The annotation illustrated in Figure 2.1 thus provides direct access from each annotated element to the corresponding primary data, i.e. the original recordings. By clicking on any annotated element, the matching part of the recording will be played back. This is not only useful for the annotation and analysis of the corpus, allowing for items in question to be listened to repeatedly, but it also facilitates automatic corpus analyses: on the basis of the text file, specifically designed software tools can calculate phonetic phenomena such as the mean length of words or phonemes and the exact alignment of pitch peaks and valleys with the phonemes in the speech signal.
18 Ulrike Gut and Holger Voormann The third requirement of a phonological corpus is that it includes metadata about the recording (e.g. data, place), speaker/s (e.g. age, dialect background), the recording situation (e.g. situational context), and the corpus as a whole (e.g. collectors, annotation schemas that were chosen).
2.3 Corpus Design Designing any corpus requires careful planning that takes into account the entire life cycle of the corpus. All decisions that are made before and during raw data collection and annotation will determine the eventual usability of the corpus. The first considerations of corpus designers should therefore centre round the issues of corpus use, sustainability, sharing, and reuse.
2.3.1 Storage, Sustainability, Sharing, and Reuse One of the principal issues to be addressed before the compilation of a phonological corpus can begin is that of corpus storage and sustainability (see also Wynne 2009). If the corpus is to be used over a long period of time, its sustainability and preservation need to be ensured. Corpus storage firstly involves organizing the provision of institutionalized archiving facilities that guarantee continued access to the corpus. Infrastructures for the storage of language resources have, for example, been created by the LDC (http:// www.ldc.upenn.edu/) and the CLARIN initiative (http://www.clarin.eu/; see also Wittenburg, Trilsbeek and Wittenburg, this volume). Furthermore, in order to compile a sustainable corpus, corpus creators should choose a data format and annotation tools that will be able to keep up with future technical developments. Adaptation to future changes is easier when available standards are used during corpus creation and when adequate documentation is provided. In particular, a standardized data format should be chosen (see Romary and Witt, this volume) and annotation tools should be selected that have the prospect of being further developed and maintained in the future. The issues of standardization and documentation also apply to the sharing and reuse of phonological corpora. Although Johansson (2008: 35) claims that the sharing of resources is a ‘rather novel aspect of the development of corpus linguistics’, it is a central requirement for theoretical advancement in linguistics as a whole and phonology in particular. Examples abound of corpora that cannot be reused because the compilers did not envisage sharing their data: the original recordings of the spoken part of the British National Corpus (BNC), for instance, cannot be made available because permission was only sought from the speakers for publishing the transcripts. Obtaining declarations of consent now is impossible due to the anonymization procedure, and because no lists of the recorded speakers seem to have been made (Burnard 2002). Lack of planning can also be seen in the example of the Spoken English Corpus, whose orthographic transcriptions
Corpus Design 19
had to be time-aligned with the original recordings in retrospect, requiring great effort and cost of time. Together with automatically generated phonemic and intonation transcriptions, the corpus now constitutes a phonological corpus under the name of Aix-Marsec corpus (Auran et al. 2004). Currently, the reuse (and extension) of existing corpora is still impeded by the fact that the annotation tools used for corpus compilation have different data formats regarding both metadata and annotation, which results in limited interoperability. Moreover, there are as yet no commonly accepted ISO guidelines for the encoding of phonological corpora, and the existing TEI guidelines for the encoding of spoken language remain inadequate for collectors of phonological corpora. In fact, the only quasi-phonological corpus (the annotations are not time-aligned) encoded according to the TEI guidelines so far is the NECTE corpus (e.g. Moisl and Jones 2005). Legal issues requiring consideration at the outset of phonological corpus compilation include both licensing and permissions. For a publication of the corpus that provides access to all of its potential users, a declaration of consent that allows the subsequent distribution of the data has to be signed by every speaker who contributes raw data to the corpus. This declaration of consent should ideally include all possible forms of publication, even via media that cannot yet be envisaged at the point of corpus compilation. Most universities now provide standard forms for declarations of consent which contain a short description of the purpose of the study and descriptions of the intended use of the data by researchers. Surreptitious recordings such as those obtained for the spoken language sub-corpus of the BNC (Crowdy 1993) no longer meet ethical standards. It is no longer acceptable to record speakers and inform them only afterwards that they have been recorded. Consent must be obtained prior to the recording. The reuse and enhancement of corpora by researchers other than those directly involved in the compilation process is only possible given adequate documentation of both the corpus content and the corpus creation process. Publishing the corpus under a license such as the Creative Commons license (http://creativecommons.org) ensures that all subsequent modifications of the corpus by other researchers are required to be made available again under the same license. The corpus creation process itself can be made open, as is the case for the Nigerian component of the International Corpus of English (ICE Nigeria) that was compiled at the University of Münster (Wunder et al. 2010). Its creation process is documented on http://pacx.sf.net with a video tutorial that shows how to create a corpus from scratch in three minutes with the tool Pacx. Reuse of the corpus also requires the accessibility of appropriate annotation and search tools. Ideally, these are freely available open source tools that allow researchers both to add their own annotations to the corpus and to search the corpus automatically. ELAN (see Sloetjes, this volume) and Phon (see Rose and MacWhinney, this volume) are examples of such tools.
2.3.2 Representativeness and Size Representativeness and a sufficient size are two commonly agreed-on requirements for corpora (e.g. McEnery and Wilson 2001: 29; Sinclair 2005; Lehmann 2007: 23). Yet
20 Ulrike Gut and Holger Voormann representativeness has been described as a ‘not precisely definable and attainable’ goal (Sinclair 2005) or as impossible (Atkins et al. 1992: 4), and many corpus compilers state that their corpus does not reach this goal (e.g. Johansson et al. 1978: 14 on the LOB corpus) or that representativeness has to be sacrificed for other needs (Hunston 2008: 156). The term ‘representativeness’ is usually used to refer to the objective that the raw data of a corpus should constitute a sample of a language or a language variety that includes its full range of variability. It should thus provide researchers with as accurate as possible a picture of the occurrence and variation of linguistic phenomena, and the potential to generalize the corpus-based findings to a language or language variety as a whole. It is of course never possible for a corpus to be representative in the strict sense, since this presupposes an exact knowledge of the distribution and frequency of linguistic structures in the language or language variety in question. It is extremely difficult to define what, for example, ‘British English’ is, let alone to decide which linguistic features it does and does not contain (see also Clear 1992: 21). Typically, corpus collectors try to solve this methodological problem by using an intelligent sampling technique. Sampling, the selection of raw data, can be carried out as simple random sampling—choosing raw data without any predefined categorization—or as stratified random sampling, for which units or categories are established from which random samples are subsequently drawn. It is generally believed that the second approach achieves higher representativeness (Biber 1993: 244). The sampling frame that is created for stratified random sampling can be either linguistically motivated or rely on demographic features. A sampling frame based on linguistic criteria could consist of a predetermined set of different text types or speaking styles, usually illustrated as different cells of a table (e.g. Hunston 2008: 154). The aim of raw data collection then is to fill all such cells with an equal number of language samples (usually measured in words). This approach has some inherent problems. The first question to be asked is whether the different text types or speaking styles should be taken from language production or language perception. Several authors have argued that the types of language people hear do not match the types of language they produce (e.g. Atkins et al. 1992: 5; Clear 1992: 24ff.). For instance, while very few people ever address others in public speeches, these public speeches are heard by many. Some researchers argue that only production defines the language under investigation (Clear 1992: 26) and therefore suggest including the text types of language production rather than perception in a corpus. Others hold that the selection of text types should be based on the inclusion of both language production and perception (Atkins et al. 1992: 5). The second question is concerned with the distribution of the different text types in the corpus. If, as has been argued, the distributional characteristics of the text types should be proportionally sampled, i.e. should match their relative distribution in the target language (e.g. Biber 1993: 247), all phonological corpora would have to consist of 90 per cent of conversations, which is the estimated percentage of this text type to be produced by the average speaker. Biber (1993: 244) states that ‘identifying an adequate sampling frame [for spoken texts] is difficult’.
Corpus Design 21
A demographic sampling frame selects not samples of language but speakers, and does so on the basis of standard social criteria such as age, gender, social class, regional background, and socioeconomic status. Often, combined linguistic and demographic sampling frames are used in corpus compilation: the collectors of the spoken part of the BNC, for instance, attempted to include recordings with British speakers of all regions, socioeconomic groups, and ages, and to cover a wide range of communicative situations (Crowdy 1993: 259). The dimensions of variation that should be included in a representative phonological corpus need to have an empirical basis, i.e. they should be determined by empirical studies of the extent and type of phonological variation and its constraining factors. Dimensions of variability that have been identified in previous research include • speaker groups of different age, gender, regional background, socioeconomic status; • speaker states (physiological and emotional); • situational contexts (e.g. interactional partners); • communicative contexts; • speaking style. The systematic influence of topic as a dimension of phonological variability remains to be conclusively demonstrated and requires further empirical investigation (see Biber 1993: 247). It is obvious that true representativeness is not always possible in corpus design. Together with issues of availability and willingness of speakers, also at issue are external factors such as the number of project members involved in corpus compilation and the duration and extent of funding. Likewise, it is very difficult to determine the optimal size of a corpus. With the ever-increasing capacity of computers to store gigabytes of data, the formerly arbitrarily established ‘ideal’ corpus size of one million words—at least for corpora with written texts as primary data—has now been superseded by the motto ‘the larger the better’ (Sinclair 1991: 18, Clear 1992: 30). Methodological and technological limitations constraining corpus size have largely been eliminated. However, no convincing proposals offering rigorous arguments for ideal corpus size have been published yet, and systematic studies in this area have yet to be conducted. It is increasingly accepted that the representativeness of a corpus does not correlate with a particular corpus size (e.g. Clear 1992: 24; Biber 1993: 243; McCarthy and O’Keefe 2008: 1012). Currently available phonological corpora vary enormously in size. At the small end of the scale, specialized corpora such as the LeaP corpus of learner German and learner English (see Gut, this volume) or the NIESSC (National Institute of Education Spoken Singapore Corpus) corpus of spoken educated Singapore English (Deterding and Low 2001) consist of 12 hours and 3.5 hours of recorded speech respectively. Although both corpora aim to include a representative sample of speakers, the problem remains that the small corpus size may cause difficulties in the interpretation and generalization of the results. At the large end of the scale are corpora such as the IViE corpus, which contains 36
22 Ulrike Gut and Holger Voormann hours of speech data from 108 speakers whose intonation was transcribed (see Nolan and Post, this volume). The optimal size of a corpus is therefore one that requires a minimum of time, effort, and funding for corpus compilation but that, at the same time, guarantees that the distribution of all linguistic features is faithfully represented. Biber (1993) has shown for written language that some structural features are distributed linearly and others in a curvilinear way. For the former type of linguistic feature, an increased corpus size implies a linearly growing representation in the corpus, while for the latter type an increased corpus size results in an overrepresentation. Empirical research of the same nature appears not to be available for phonological features. Only future research can determine whether these and/or other types of distributional patterns exist for phonological structures and how such patterns interrelate with sample size. In fact, it will be the increasing compilation and availability of purpose-built phonological corpora that allow for the first research of this kind, contributing in turn to refinements in the design of future corpora.
2.3.3 Raw Data Selection The ‘authenticity’ of the language contained in a corpus is the central argument that is typically used to point out the advantages of corpus linguistics in comparison with linguistic research based on invented sentences. It has been claimed that only on the basis of language samples that were actually produced, i.e. utterances that were used in real life and have proven their status as communicative vehicles, can the structure and usage of language be studied. While this is possibly true for research on morphosyntax, lexicon, and pragmatics, the problem of ‘non-authentic’ language production for phonological investigation is a more intricate one. Even speech elicited in very controlled situations can be considered ‘natural’ since it is appropriate in the very specific communicative situation of phonetic experiments (see also Birch, this volume). Although it has been shown in many studies that the prosodic properties of speech vary with speaking style, and that read speech exhibits distinct phonological properties from spontaneous ‘natural’ speech (e.g. Gut 2009), this should be taken as an argument in favour of including as many different speaking styles as possible into a phonological corpus, as was the case for the collection of the IViE corpus (see Nolan and Post, this volume), the PFC (see Durand, Laks and Lyche, this volume), and the LeaP corpus (see Gut, this volume). Distinctions in the phonological domain between speech produced under laboratory conditions and ‘natural’ are a much-needed focus for future research. Data collection and selection should be driven by external rather than internal criteria (Clear 1992: 29; Sinclair 2005). External criteria are defined situationally and refer to different language registers, text types, or communicative situations. Internal criteria refer to the distribution of certain linguistic structures or the frequency of particular words or constructions. Raw data selected according to internal criteria might be biased by the researchers’ purposes, and may therefore fail to achieve a high level of representativeness (see section 2.3.2).
Corpus Design 23
A further consideration to be made when embarking on data collection is the time and effort different types of raw data require to be gathered. It is well known, for example, that it is much more difficult to obtain language samples from speakers of lower socioeconomic status and educational level than from speakers with higher socioeconomic status and educational level (see Oostdijk and Boves 2008). Moreover, not all collected recordings will be usable, especially those that have been recorded in non-laboratory conditions. Crowdy (1993: 261) reports for the collection of the recordings for the spoken part of the BNC that the total amount of usable recordings was about 60 per cent of all recordings that were made. When, or preferably before, making recordings, potential difficulties with subsequent annotation should already be considered. These include the separation of the different speakers in multi-speaker conversations, and the overall intelligibility that might be reduced by speaker overlaps and background noise. It is generally a good idea to first collect a small pilot corpus and annotate and analyze this before proceeding to collect more raw data (see section 2.4 on the corpus creation process).
2.3.4 Time-Aligned Phonological Annotations Before embarking on corpus annotation, a suitable annotation tool needs to be chosen that fulfils the specific requirements of the corpus creation process. In order to ensure the usability and enhanceability of the corpus, the tool should store the created files in a standardized data format such as XML. Moreover, these files should be readable by other tools, e.g. corpus search tools, without requiring complex conversion routines. If the corpus annotators work in different locations, the tool should provide facilities for central storage and repeated modifications of the annotations. Pacx, for instance, which is used in the compilation of several corpora, is a tool that supports the entire corpus creation process including raw data storage, annotation and metadata storage in XML, collaborative annotation, and automatic checks of annotation consistency and annotation errors, as well as simple corpus searches (Wunder et al. 2010). The first type of annotation every phonological corpus needs to contain is a time-aligned orthographic transcription. This is indispensable for adding further annotations such as an automatic phonemic transcription or part-of-speech tagging and lemmatization. Due to the current lack of standardization, many decisions are required for an orthographic transcription, which include the spelling of colloquial forms, the representation of truncated words, and the usage of punctuation symbols. In the case of phonological corpora that contain non-native speech, this requirement is even more challenging since often a decision needs to be taken whether to transcribe the words and forms that were actually produced or those that might have been intended. While these decisions and the decision regarding what to annotate as the smallest unit of speech (words, turns, utterances) might differ according to the purpose of the specific corpus, it is essential for the use and reuse of the corpus that they are documented in a very detailed way, and that this documentation is accessible to all future users of the corpus. One way of documenting the decisions that were made during corpus annotation is to document them directly in the corpus or to set up a Wiki page that can be continuously updated by all annotators.
24 Ulrike Gut and Holger Voormann The second type of annotation that makes a corpus a phonological corpus is phonological annotation, which includes phonetic and phonemic transcription as well as transcription of prosodic features such as intonation and stress. Phonological annotations can have different formats and should be—in order to allow the reuse of the corpus—both well documented and as theory-independent as possible (Oostdijk and Boves 2008: 644). The degree of phonetic detail represented in the annotations depends on the type and purpose of the corpus (see also Delais-Roussarie and Post, this volume and Cucchiarini and Strik, this volume). Annotation of a corpus compiled for the documentation of a language with which the corpus collectors have a limited familiarity, for example, requires a far greater level of detail than annotation of a phonological corpus for a language that has a well-described phonology. For the former case, Lehmann (2007: 22) suggests the following guidelines: • Before the description of the phonological system of a language is completed, all variations down to the smallest allophonic level should be annotated. Some variation previously considered irrelevant might turn out to have functional relevance or be theoretically interesting. • Transcriptions of the phonetic details should be annotated on different tiers so that all the variants are linked to the corresponding invariant on a more abstract level. Manual annotations and especially phonological annotations are very time-consuming, taking up to an hour for a minute of speech. Furthermore, several studies have shown that annotator inconsistencies and errors are inevitable (e.g. Stirling et al. 2001; Gut and Bayerl 2004; Kerswill and Wright 1990). It is therefore advisable to carry out repeated measurements of annotator consistency and accuracy, or to use a tool that carries them out automatically (see Oostdijk and Boves 2008: 663; Voormann and Gut 2008). One general requirement for all annotations is that they are separated into different tiers, with each tier representing one speaker, one type of linguistic level, or one event. Thus, non-linguistic events such as laughter, noises, and background voices should be annotated on a separate tier as well as all additional annotations to the orthographic tier (e.g. repairs, false starts, disfluencies and hesitations). However, there is as yet no common definition across the different annotation tools of what a tier is. Neither have the concept and form of an annotation yet been specified formally. In the future, a language similar to DTD or XML Schema for XML needs to be created in order to allow formal specification of phonological corpus design. At present, the fact that nearly every annotation tool uses a different data format hinders interoperability.
2.4 The Corpus Creation Process Successful corpus creation is constrained by multiple factors. Restrictions on funding and time determine the corpus size and representativeness as well as the richness
Corpus Design 25
of corpus annotation and the accuracy of the annotations. Currently available corpora seem to suggest that these properties stand in a trade-off relationship, in the sense that the improvement of one of them results in the weakening of another. Theories of corpus creation address the problem of simultaneously maximizing corpus size as well as the quality and quantity of the annotations while minimizing the time and cost involved in corpus creation. Traditionally, corpus creation has been divided into separate phases that are carried out in a sequential manner: when all data has been collected, corpus creators will devise an annotation schema or decide to use an already established one. This is followed by an annotation phase, and only after this has been completed will the corpus be searched (e.g. Wynne 2009: 15). Some modern theories of corpus creation suggest a new approach. Biber (1993) and Atkins et al. (1992), for example, propose that corpus compilation should proceed as a cyclic process. Feedback from corpus users or corpus searches of a small sample corpus should be employed to provide guidelines for further data collection. Taking this idea several steps further, Voormann and Gut (2008) suggest an iterative theory of corpus creation, agile corpus creation. The central ideas of agile corpus creation are the reorganization of the traditional linear and separate phases of corpus design and the recognition of potential sources of errors during corpus creation. Modelled on agile software development, the theory of agile corpus creation replaces the linear-phase approach with a cyclic and iterative small-step process that turns the traditional sequence on its head. The starting point is a corpus query that drives the compilation of a prototypical mini-corpus which contains the essential functions such as the data format, the preliminary annotation schemas, some of the annotations, and a search tool for the execution of a query. As the first step of corpus creation, a corpus query is formulated, which drives the development of a first version of the annotation schema. When the specification of the annotation schema is accomplished, a small part of the primary data is annotated accordingly. This is followed by the next corpus query and thus the second cycle. The theory of agile corpus creation claims that possible errors can occur at any time during the corpus creation process, and that it is therefore cheaper and quicker to incorporate opportunities for concomitant and continuous modifications in the cyclic corpus creation process. With the first corpus query, all previous aspects of the corpus creation are analysed: Is the annotation schema suitable for the analysis of the corpus query? Is the annotation consistent? Design errors, annotation errors, and conceptual inadequacies in the annotation schema thus become visible at a very early stage. Even more importantly, the early corpus queries constitute an evaluation of the annotation process. The annotations are checked for inter- and intra-annotator agreement as a measure of reliability. Subsequently, sources of inconsistency are identified and appropriate steps to improve the annotation schema are taken. Corpus queries thus monitor annotator and inter-annotator reliability, as well as structural inadequacies. Importantly, the method of agile corpus creation allows corpus creators to know when the corpus has reached a size sufficient for the specific research question. An open source tool that supports agile corpus creation is Pacx (http://pacx.sf.net).
26 Ulrike Gut and Holger Voormann
2.5 Conclusion and Outlook Phonological corpus design that has the aim of producing a valuable tool for phonological and phonetic research requires careful planning. Not only data collection and annotation principles need to be taken into account but also considerations regarding software choice, licenses, documentation, and archiving. Only a phonological corpus that makes use of both the available standards in terms of data format and annotation, along with state-of-the-art sampling techniques and corpus creation tools, will prove to be accessible from a long-term perspective and will be reusable and enhanceable for future researchers. For both existing and planned phonological corpora, following internationally recognized standards for data formats, annotations, and the integration of metadata are of the highest importance. The standardization of some aspects of phonological corpora has just begun, and important decisions regarding data model, data format, and annotation will have to be taken by the international community in the near future. The growing infrastructure and an increasing number of tools that support phonological corpus creation, storage, and distribution, however, paint a positive picture for future developments.
C HA P T E R 3
DATA C O L L E C T I O N BRU C E BI RC H
3.1 Introduction In addition to giving a summary of techniques useful in the collection of data for the purposes of assembling phonological corpora, this chapter attempts to provide some theoretical background to the task of data collection, examining key issues including control over primary data, context and contextual variation, and the observer’s paradox. In section 2, I introduce the concept of a data collection continuum, suggesting that it is helpful to think of data collection techniques in terms of how much control, and the type of control, they assert over the production of speech. Section 3 deals with the problem of context: how the existence of different types of linguistic context and the interaction between contexts impact on primary data. In section 4, the closely related issues of ‘observer effects’ and the question of what is ‘natural’ data are examined in some detail. Sections 5–7 attempt to give an overview of data collection techniques, and Section 8 provides some practical advice on recording equipment and recording technique.
3.2 The Data Collection Continuum It is perhaps useful to recognize two types of corpus which provide data for phonological analysis, and which differ in terms of their origin (see Gut and Voormann, this volume). One type results from defining a research question, and subsequently designing speech experiments or elicitation tools and stimuli explicitly intended to capture the speech data required for the exploration of the question. The resulting corpus, especially in the case of speech experiments, will tend to be relatively small, and intentionally limited in terms of the contexts in which target segments/words/phrases occur. The intentional limitation of the primary data in such approaches is what is referred to by the notion of ‘control’.
28 Bruce Birch A second type is generated in order to make available a ‘representative’ sample of the language or languages being investigated. It does not result from the need to answer a specific research question, but rather is intended to be used by researchers investigating a broad range of questions. This type of corpus, as a result of its intention to be ‘representative’, will necessarily be large and richly complex in terms of the contexts in which yet-to-be-identified target segments/syllables/words, etc occur. It is for the most part intentionally ‘uncontrolled’, the speech it contains exemplifying a wide spectrum of observable communicative events. Controlled context Scripted Single research question Small sample Single communicative event FIGURE 3.1 The
Uncontrolled context Unscripted Multiple research questions Large sample Broad range of communicative events
Data Collection Continuum.
Variation in the amount of control exercised over primary data may be expressed in terms of a data collection continuum (see Figure 3.1). At one pole of the continuum are scripted speech experiments and reading tasks which attempt, more or less successfully, to maximize control over contextual factors identified as relevant in order to eliminate ‘noise’. At the other pole are, for example, language documentation projects which aim to capture the richness and diversity of a particular language, exerting relatively little intentional control over contextual factors within any given text, being more concerned with capturing a diverse range of unscripted communicative events. In between these two poles sit data elicitation techniques which contain elements of intentional control and limiting, and which at the same time set out to capture a representative sample of unscripted language use. A well-known and often-employed example of such a technique is the HCRC Map Task (see section 6.3), which typically aims to capture multiple instantiations of target phenomena (e.g., phrases providing environments known to trigger t-deletion in English, or words composed entirely of sonorants and vowels, which provide reliable unbroken FO traces) by limiting the names of locations on a map in terms of their segmental composition, at the same time as aiming to elicit dialogue motivated by the communicative needs of participants negotiating a shared task, rather than by the prompting of an experimenter or script. At the controlled end of the continuum, no experiment is perfect, and noise will tend to seep into the data despite the best efforts of those involved in the design and collection processes.1 At the uncontrolled end, a multitude of unintended contextual factors
1 For a description of this process with regard to efforts to examine neutralization of voicing in final obstruents in German, see Kohler (2007).
Data Collection
29
are always present, such that (a) the data collected will inevitably be limited in ways of which the collector is unaware (potential target phenomena will be absent or under-represented), and (b) contexts crucial for the investigation of particular research questions will be absent (potential target phenomena may be present and well represented, but are absent in desired contexts). Regardless of where the corpus sits on the data collection continuum, therefore, it will exhibit gaps. These gaps are typically exposed as a result of attempts to use the corpus to answer a research question. As a result of the identification of such a gap, elicitation techniques may be employed in order to expand the corpus in a specific direction in order to fill the identified gap. At the uncontrolled end of the spectrum, where a corpus intended for multiple analyses consists mainly of recordings of communicative events occurring in the everyday lives of speakers, gaps may be filled via the addition of subcorpora which sit further toward the controlled end of the continuum. For example, in order to make a corpus of value for the investigation of the phonological correlates of contrastive focus, it may be necessary to employ or develop some nonverbal elicitation tools designed specifically for the purpose, perhaps along the lines of the QUIS2 stimuli set produced at the University of Potsdam. Data gathered through the use of these stimuli will form a subcorpus of your overall corpus, as well as a subcorpus of the total QUIS corpus generated as a result of the use of the same stimuli across a variety of the world’s languages. At the controlled end of the continuum, to fill identified gaps it may be necessary to modify the restrictions initially placed on the data by adding missing contexts, for instance through the addition of task-oriented elicited data to a corpus which consists only of read speech. The identification of gaps in your corpus by trying it out (if this is feasible) is probably the most reliable way to gain information on how to modify your data collection techniques to accommodate the research questions for which the corpus is intended (see the discussion of ‘agile corpus creation’ in Gut and Voormann, this volume). Both large and small corpora often include subcorpora from different places on the data collection continuum in order to achieve this aim. A corpus may contain, for example, a mixture of read speech or elicited words/paradigms, etc; data obtained through the use of stimuli such as the Map Task; and interviews with subjects about major events in their lives.
3.3 Context and Contextual Variation Before beginning on the task of data collection for phonological corpora, an understanding of the issues surrounding context and contextual variation is an absolute
2
Skopeteas et al. (2006).
30 Bruce Birch requirement. As Pierrehumbert (2000a) has pointed out in support of the approach taken by Bruce (1973) in his study of Swedish intonation, Much early work on prosody and intonation (such as Fry 1958) takes citation forms of words as basic. Insofar as the intonation of continuous speech was treated at all, it was in terms of concatenation and reduction of word patterns which could have been found in isolation. Bruce, in contrast, adopted the working hypothesis that the ‘basic’ word patterns were abstract patterns whose character would be revealed by examining the full range of variation found when words are produced in different contexts. . . . The citation form is then reconstructed as the form produced in a specific prosodic context—when the word is both phrase-final and bears the main stress of the phrase. The importance of this point cannot be overemphasized. In effect there is no such thing as an intonation pattern without a prosodic context. The nuclear position and the phrase-final position are both particular contexts, and as such leave their traces in the intonation pattern. (Pierrehumbert 2000a: 17).
If we accept the claim that analysis of intonational phenomena and, by extension, of phonological and phonetic phenomena in general will be incomplete in the absence of examination of the ‘full range of variation found when words are produced in different contexts’, then it may seem to follow that the aim of data collection for the purposes of corpus-based phonological analysis must correspondingly be to capture language in as broad a range of contexts as possible. So, for example, capturing occurrences of the phoneme /r/ in some non-rhotic varieties of English in both prevocalic and preconsonantal position would be essential in order to provide, on the one hand, tokens where /r/ is realized (e.g., in the phrase butcher in town) and on the other, where it is deleted (as in the phrase the butcher shop). Further, it would be necessary to include examples which are word-final vs. word-initial, utterance-initial vs. utterance-final, and so on, in order to determine the impact of prosodic constituent boundaries on realization. However, an exhaustive study will require that contextual factors other than those of a purely phonological nature be taken into account. Grammatical, syntactic, and pragmatic contextual factors, as well as discourse genre and social context, have all been shown to impact on, or interact with, phonological structure. Moreover, we can also expect (or at least cannot rule out) that interactions between these different types of context exist—that, for example, social context may influence syntactic structure, which in turn will impact phonological structure, and so on. For example, it has been shown that speech elicited from speakers in the context of a laboratory reading task contrasts systematically with, on the one hand, reduced speech occurring in unobserved conversations (e.g. Summers et al 1988; Byrd 1994) and, on the other, hyperarticulated speech (e.g. Uchanski et al. 1996). It soon becomes apparent, therefore, that aiming to obtain a ‘complete’ set of contexts for a single phenomenon, let alone for a large range of phenomena, is to set oneself an infinite task, as there is ‘no principled upper limit’ (Himmelmann 2006) to the number
Data Collection
31
and type of discoverable contexts. Or, as Sinclair puts it, language is a ‘population without limits, and a corpus is necessarily finite at any one point’ (Sinclair 2008: 30). Given the complexity of language as an object of study, it has been necessary for linguistics as a discipline to break the study of language up into different subdisciplines, and in turn to create further divisions within these subdisciplines. Thus, within the subdiscipline of phonology, it is possible, for example, to specialize in the behaviour of certain segments, or in the study of intonation systems, and so on. Correspondingly, the corpus required by a specialist in intonation will vary considerably from that, say, required by someone studying the behaviour of post-alveolar segments. To extend the example given above, examination of the way in which the intonation system of a language signals narrow contrastive focus, i.e. whether the language has a tendency (like English) to de-accent repeated material or not (like Spanish), will require a corpus which includes sentence pairs containing such repeated material (e.g. pairs such as I don’t understand that. You DO understand that). Such context will be of little interest to the analyst focusing on the phonological or phonetic behaviour of post-alveolar segments, who may want multiple examples of the same word in contrasting carrier phrases or frame sentences containing a specific range of phonological contexts. It is the task of phonological analysis, like any other kind of linguistic analysis, to confront and deal with the complexity resulting from the presence of, and interactions between, contextual factors in order to proceed at all. The myriad of contextual factors affecting the realization of constituents such as syllables and words must be identified (as far as this is possible) as a first step. From that point, there are two directions in which to move. Towards the ‘controlled’ end of the data collection continuum, data collectors aim to artificially reduce the complexity of language by eliciting linguistic data for which certain contextual factors are controlled. This approach is exemplified by the experimental paradigm, in which subjects produce speech in a context which has been specifically designed to control for, or eliminate, certain contextual factors which would otherwise introduce ‘noise’ to the data collected. In such experiments speakers may be repeating, for instance, ‘nonce’ or ‘meaningless’ words; reading; or responding to grammatical but semi-nonsensical or totally nonsensical verbal stimuli. The same approach is also exemplified by a linguist eliciting word lists in the early stages of compiling a language documentation or grammar of an unknown language. In this situation, the linguist compels the subject to suspend the usual contextualization of words in discourse, eliciting from them instead sequences of single words or phrases uttered in isolation. A second direction to move in, rather than attempting to artificially reduce the complexity of linguistic data, is to develop approaches and tools which allow for the analysis of relatively ‘uncontrolled’ data. In other words, rather than focusing on modifying or controlling the data at the production end, such approaches attempt to develop sophisticated analytical techniques which can filter complex, less controlled data in a variety of ways in order to make it available to a wide spectrum of analyses. Sophisticated annotation techniques such as those developed within the EAGLES framework (http://www. ilc.cnr.it/EAGLES96/home.html) are one such tool, in which the mark-up of data allows
32 Bruce Birch for the inclusion and exclusion of an array of contextual factors and combinations of factors. Clearly, your data collection activities need to reflect which of these two directions, or which elements of each, will be followed in order to analyse the data. Crucially, the complexity of language as an object of research guarantees that even the largest ‘multi-purpose’ corpus may well not contain enough tokens of a given phonological phenomenon to provide falsifiable results for a specific research question.3 Simply by amassing a large amount of data, the collector cannot hope to have as a result anything approaching a corpus which furnishes evidence for the infinite number of possible phonological research questions which it may be required to answer. If the intention of the data collector is to build a large multi-purpose corpus, the corpus should be viewed as an evolving process, gaining richness and complexity over time, rather than as a static object. The evolution of this richness and complexity will be facilitated greatly through attempts to use the corpus for the purposes of analysis. Users will inevitably identify gaps where required data is not present. These gaps can then be filled via more data collection. In this way, use of the corpus is crucial, feeding back into the structure of the corpus, gradually increasing its value for a wide range of phonological research questions.
3.4 The Observer’s Paradox The use of the words ‘natural’ or ‘naturalistic’, or the phrase ‘naturally occurring’, in regard to speech is problematic. All speech is in some sense ‘natural’. Wolfson suggests that the notion of ‘natural’ speech is ‘properly equivalent to that of appropriate speech; as not equivalent to unselfconscious speech’ (Wolfson 1976: 189), and that speech collected from native speakers in interviews or laboratory tasks is speech-appropriate, and therefore natural, to those contexts. If subjects are selfconscious in such contexts, their speech will reflect this in an appropriate way, for example by being more careful in character, with consequences for phonological analysis. Wolfson’s point is that careful speech produced under such conditions is in no way unnatural, suggesting further that the only reliable method of obtaining unselfconscious data is through unobtrusive observation in a diverse range of contexts. Recording speech unobtrusively while remaining within the bounds of ethically sound practice, however, presents its challenges. The difficulties involved in capturing unselfconscious speech led William Labov to coin the term ‘observer’s paradox’, characterizing it in the following way: ‘the aim of linguistic research in the community must be to find out how people talk when they are not being systematically observed; yet we can only obtain this data by systematic observation’ (Labov 1970: 32). This has important
3
Biber (1993) provides some statistically supported estimates of required corpus size for various target phenomena.
Data Collection
33
consequences for the goal of collecting unselfconscious speech data, as large observer effects will skew the data, undermining the aim of capturing typical language use. Extreme observer effects are present, for example, in the process of direct transcription, where speakers are intentionally encouraged to focus on their own speech production in the process of orally disambiguating the segmental composition of words and phrases to allow for accurate written representations. A transcriber’s inability to reproduce a word correctly may elicit from the speaker slow, hyperarticulated tokens of the word, long words being perhaps broken into separate intonation units syllable by syllable. Data collected under these conditions, while quite ‘natural’ given Wolfson’s clarification of the term, is highly atypical in the context of the speaker’s everyday use of language, produced under specific and perhaps infrequently occurring conditions. Subsequent claims made on the basis of phonological analysis of such data may therefore be limited to the specific context in which the data were collected rather than applying to the language as a whole. For this reason it has been crucial to develop techniques in phonological data collection which in some way distract the subject or subjects from the goal of the exercise, reducing their awareness of the fact that every sound they make may eventually be scrutinized in minute detail. Labov’s solution to this dilemma involved, among other things, systematic limitation of interview topics. As summarized by Jack K. Chambers, ‘of the topics used by Labov [who was eliciting narrative monologues], the most successful in making the subjects forget the unnaturalness of the situation were the recollection of street games and of life-threatening situations. Most reliable in eliciting truly casual speech were fortuitous interruptions by family members and friends while the tape recorder was turned on’ (Chambers 1995: 20). Labov found that when speakers narrated stories with which they were emotionally engaged, they would tend to lose the selfconsciousness brought about by the interview setting. (Section 6 will deal in more detail with techniques aimed at obtaining unselfconscious data.) On the other hand, although the obtaining of unselfconscious speech data is an important requirement for corpus-based phonology in general, this is not to say that obtaining selfconscious data is unimportant. Himmelmann (2006: 9), writing in the context of language documentation, identifies ‘observable linguistic behaviour’ as a major target of data collection in the language documentation context, suggesting that it encompasses ‘all kinds of communicative activities in a speech community, from everyday small talk to elaborate rituals’. Thought about in this way, the communicative events selected will depend on the goals of the data collector and the intended use of the resulting corpus. Under the heading of communicative events, however, we must include not only the usual range of events observable in a given speech community, but also the less usual, such as the transcription session exemplified above. To return to that example, the fact that a speaker is able to break single words into separate intonation units under certain conditions offers evidence regarding the relative independence of the intonation system of the language under investigation, suggesting that words in that language do not have a single corresponding prosodic structure, but rather have a number of variants. Such evidence would be difficult, if not impossible, to find in more frequently occurring communicative events.
34 Bruce Birch In this way, the most controlled forms of speech elicitation nevertheless result in the collection of ‘natural’ data, (though likely to be selfconscious), of certain value for phonological analysis. This suggests an amendment to Labov’s statement quoted above, adapted to the specific aims of data collection for phonological analysis, where the aim must be to collect the speech of individuals ‘when they are not being systematically observed’, and also when they are being systematically observed. All data is ‘naturally occurring’, and the availability of data collected under highly controlled conditions on the one hand and relatively uncontrolled conditions on the other is of significant value to much phonological analysis. In the following sections the data collection continuum provides an organizing principle. Section 5 deals with data collection techniques which are at the highly controlled end of the continuum, where every word spoken is dictated by a script provided by the collector. This section includes reading tasks and elicitation of word lists and paradigms. Section 6 focuses mainly on the use of nonverbal stimuli designed to give the collector a certain amount of control over the resulting spoken text, but at the same time eliciting unscripted responses from subjects. Section 7 then looks at techniques where the collector exerts virtually no control over the resulting text other than by attempting to ensure that it is representative of a certain type of communicative event.
3.5 Highly Controlled Techniques At this point, it is important to make a distinction between two contrasting situations in which linguists find themselves. On the one hand, linguists may be recording data in a linguistic and cultural context with which they are familiar, and on the other hand they may be working in an alien cultural context, recording a language with which they have far less familiarity. In the first case, the data collector will typically have relatively less trouble implementing elicitation tasks, whereas in the second it may be (1) difficult to establish common ground, and (2) difficult to inspire enough confidence in speakers that you are understanding the information they are providing. It is ideal in this context to find a native speaker collaborator who can organize participants, conduct interviews, and make recordings. This allows you to take a back seat, and inspires confidence in subjects that the information they are providing is being understood and processed in a natural way, allowing them to produce the unselfconscious narrative and dialogue you are aiming for.
3.5.1 Reading Tasks Read speech is an effective way of restricting primary data to the precise needs of a research question. Many corpora include read speech as a counterpoint to unscripted material. For many research questions it is useful to be able to contrast the behaviour
Data Collection
35
of target phenomena contextualized in a discourse with ‘citation forms’. An example of a corpus which takes this approach is the IViE Corpus,4 produced by a team at the University of Oxford, which was created for the purposes of analysing intonation across nine varieties of English occurring within the British Isles. The read stimuli consisted of short sentences intended to elicit declarative intonation, question intonation, etc, as well as a short passage based on the Cinderella fairy tale. To this was added Map Task data and recordings of ‘free conversation’. The obvious advantage of reading tasks is that they guarantee the inclusion of target phenomena previously identified by the researcher. It must be borne in mind, however, that reading is unlikely to produce unselfconscious speech, although the use of ‘distractor tasks’, such as assembling a simple construction while speaking, can alleviate this to some extent. An obvious limitation is that subjects must be literate, rendering reading tasks useless in work with non-literate individuals and cultures.
3.5.2 Elicitation of Word Lists In situations where the linguist is not a native speaker of the target language, and is perhaps dealing with an oral culture, the elicitation of word lists, sentences, and paradigms is the rough equivalent of a reading task in terms of the degree of control exerted over the data. Target words and phrases are elicited via a simple translation exercise, with the subject being instructed previously as to the requirements of the task. For example, you may require three repetitions of a word or phrase, or you may provide a carrier phrase designed to meet the needs of a specific research question. As with read speech, this kind of task guarantees the recording of target phenomena, and will provide an example of speech from a specific type of communicative event which will be useful for comparison with unscripted speech.
3.6 The use of Nonverbal Stimuli In response to the problem of minimizing observer effects, linguists, including phonologists and phoneticians, have put effort over the years into the development of nonverbal stimuli, such as images, video, slideshows, and animations. Although the design of the majority of these stimuli is not necessarily motivated by the potential use of resultant corpora for phonological analysis, such tools nevertheless provide a reliable method of obtaining relatively unselfconscious data, either narrative or dialogue. They also provide models on the basis of which new stimuli can be designed to meet the needs of specific analysts. In this section, I present a few of the better-known examples of this genre, not claiming it to be an exhaustive review.
4
http://www.phon.ox.ac.uk/files/apps/IViE/
36 Bruce Birch
3.6.1 Film: The Pear Story An early example of the use of nonverbal stimuli to elicit speech is The Pear Story, a film of around six minutes in length designed by Wallace Chafe (Chafe 1980) and a research team based at the University of California in the 1970s to test how a single filmed narrative would be reproduced verbally across different cultures and languages. By showing a man harvesting pears, which are stolen by a boy on a bike, the film was intended to reference ‘universal’ experiences (harvesting, theft, etc), thereby making it suitable for use in a wide range of cultural contexts. The film contains no dialogue or voiceover. Several scenes were included to elicit particular responses. For example, a scene showing a boy falling off a bike and spilling pears was intended to elicit how languages encode cause and effect. After viewing the film, subjects were interviewed individually by a fellow native speaker of similar social status, who asked them, ‘You have just seen a film. But I have not seen it. Can you tell me what happens in the film?’ The resulting recorded narratives were around two minutes long. One basic principle behind the design of The Pear Story is common to most nonverbal stimuli used for linguistic elicitation. Stimuli should be adapted to the culture of the speakers of the target language or languages. If the material is to be used in cross-linguistic, typological research, it needs to be intelligible to speakers of all of the languages involved. A second principle is the inclusion of material designed to stimulate responses useful for the exploration of a specific research question. While not specifically targeted at the collection of phonological data, The Pear Story provides a template on the basis of which such specifically targeted stimuli can be produced. By taking some of the design features as a starting point, a film-based stimulus can be created with phonological analysis in mind. In films where the target audience speaks a single language, characters, locations, and props can be chosen according to the phonological composition of their names; events or sequences of events can be selected in order to elicit specific information structure, with resultant predictable consequences for intonation patterns; and so on.
3.6.2 Animated Sequence: Fish Film A highly successful example of a nonverbal stimulus designed to answer a specific research question, this time in the area of syntax, is the Fish Film (Tomlin 1997). This is a computer animation in which subjects are required to describe an unfolding drama enacted by two fish, one dark and one light-coloured, in real time. In each trial one fish is cued visually by an arrow. At a certain point, one fish eats the other fish. In cases where the cued fish was eaten, subjects tended to use the passive voice, responding with, ‘The dark fish was eaten by the light fish’. In cases where the cued fish ate the other fish, subjects used the active voice: ‘The light fish ate the dark fish’. The subjects’ attention to the cue
Data Collection
37
influenced the choice of voice and correspondingly the syntactic subject. Tomlin’s results were robust, with 90 per cent of subjects choosing the cued fish as the syntactic subject. Animated sequences and slideshows provide a useful way of collecting unscripted but focused data, particularly when available resources do not permit the making of a film. The elicitation of a real-time description or commentary featured in this exercise is a technique which can be applied in the contexts of other tasks (viewing images, films, television series, etc).
3.6.3 Shared Task: Map Task Another highly successful stimulus for the production of naturalistic corpora intended for phonological analysis is the HCRC Map Task (http://groups.inf.ed.ac.uk/maptask/), developed by the Human Communication Research Centre at the University of Edinburgh. The Map Task was devised specifically in response to the issues discussed in the first part of this chapter: on the one hand, as a result of the dominance of the experimental paradigm ‘much of our knowledge of language is based on scripted materials, despite most language use taking the form of unscripted dialogue with specific communicative goals’; and on the other, samples of naturally occurring speech present ‘the problem of context: critical aspects of both linguistic and extralinguistic context may be either unknown or uncontrolled’. Moreover, ‘huge corpora may fail to provide sufficient instances to support any strong claims about the phenomenon under study’. The intention of the Map Task developers, therefore, ‘was to elicit unscripted dialogues in such a way as to boost the likelihood of occurrence of certain linguistic phenomena, and to control some of the effects of context’. The Map Task website describes the resulting stimulus task in the following way: The Map Task is a cooperative task involving two participants. The two speakers sit opposite one another and each has a map which the other cannot see. One speaker— designated the Instruction Giver—has a route marked on her map; the other speaker—the Instruction Follower—has no route. The speakers are told that their goal is to reproduce the Instruction Giver’s route on the Instruction Follower’s map. The maps are not identical and the speakers are told this explicitly at the beginning of their first session. It is, however, up to them to discover how the two maps differ.
The maps consist of named graphically depicted landmarks. Discrepancies between the landmarks on the two maps are of three types: in one type, a given landmark is present on only one of the maps; a second type involves two identically drawn and positioned landmarks having different names; and in the third type, a landmark appeared twice on the Instruction Giver’s map, and only once on the Follower’s map. The differences between the two maps make the negotiation process more complex, thus tending to produce longer, more animated dialogues than would be the case if the maps were identical. The designers of the maps had control over the names of landmarks, which were guaranteed to occur frequently throughout the corpus, and could therefore tailor the outcome to some extent to meet their research needs. For example, a name like vast
38 Bruce Birch meadows would provide evidence about t-deletion, or chestnut tree would be ideal for the measurement of glottalization, and so on. Unlike Chafe’s film, The Pear Tree, the Map Task was not originally designed to be used in a wide variety of cultural contexts. For instance, in its original form it could not be used with non-literate societies. However, the task does lend itself to adaptation for such contexts. An adaptation of the Map Task was used recently with Iwaidja speakers in Northwestern Arnhem Land, Australia (Birch and Edmonds-Wathen 2011). Although the basic principle of the task was retained, a map was created which did not rely on literate subjects. Landmarks were based on commonly occurring, easily recognizable objects such as creeks, pandanus trees, magpie geese, oysters, and so on. Participants were shown pictures of the landmarks beforehand and asked to call their names in order to record ‘citation forms’. Instead of being placed face to face across a barrier, participants were placed side by side with a barrier between them, a more culturally appropriate seating configuration. Because one of the intended uses of the resulting data was acoustic analysis of vowel formants in relation to metrical structure, accent, and focus, landmarks with names containing low central vowels were chosen, since these vowels provide the most robust FO, formant, and intensity traces.
3.6.4 Games and Other Tasks: MPI Nijmegen Language and Cognition Group The Language and Cognition (L & C) Group based at the Max Planck Institute for Psycholinguistics at Nijmegen in The Netherlands under the direction of Stephen Levinson has nearly twenty years’ experience in the production of elicitation tools aimed at exploring the cognitive infrastructure of human language. Over the last two decades they have produced an array of nonverbal elicitation tools aimed at exploring cross-linguistic semantic domains such as sound, time and space, emotion, and colour. They have also addressed areas such as the acquisition of social cognition by infants, and ‘event packaging’ across languages. The materials include films, card games, construction tasks, image sets, and much more, and are publicly available on agreement to conditions of use. As with the other stimuli discussed in this section, the L & C tools can be used in their current form to stimulate narrative and dialogue for phonological analysis, or may be used as inspiration for the development of new tools designed with phonological research questions in mind.
3.7 Collecting ‘Uncontrolled’ Data At the uncontrolled end of the data collection continuum are simply recordings of people talking—about anything. Although such recordings have the disadvantage, discussed above, that they may well not contain enough instances of desired target
Data Collection
39
phenomena to be useful, they do have the advantage that, if data are captured in such a way as to eliminate or at least minimize observer effects, users of the resulting corpus may be confident that they are accessing unselfconscious speech in one of a range of communicative events engaged in by speakers in the course of their everyday language use. In both familiar and unfamiliar cultural contexts, where the intention is to record in some sense a representative sample of a language, it may be necessary to do some preparatory research in order to establish a typology, or partial typology, of communicative events, speech registers, etc. Hymes’ S-P-E-A-K-I-N-G heuristic (setting/scene, participants, ends, act, sequence, key, instrumentalities, norms, genre) provides a good starting point for this (Hymes 1974).
3.7.1 ‘Ready-Made’ Corpora At this end of the data collection spectrum, the collection of linguistic data for the purposes of phonological analysis is essentially indistinguishable from the collection of linguistic data for grammatical, semantic, pragmatic, or syntactic analysis, or even for nonlinguistic analysis. This falls out from the fact that the ‘representative sample’ the collector is aiming to obtain is not modified in any way by consideration of the research questions for which it is intended to provide raw data. For this reason, data which is collected with no future linguistic analysis in mind—for example, the archived recordings of a radio or television network—may do just as well as recordings made by a linguist for a language documentation project. They may even be better in that the people doing the recording are highly experienced in the field and will therefore typically make high quality recordings. This is undoubtedly the easiest form of data collection! A fine example of phonological research using a ready-made corpus is the excellent study by Shattuck-Hufnagel et al. (1994), who made use of a Massachusetts radio speech corpus for the exploration of variation in pitch-accent placement within words containing more than one metrical foot.
3.7.2 The Use of Video You have Recorded Recording an event (a ceremony or other social occasion with a complex structure is a good choice, but there are many options) of importance or relevance to the speakers you are targeting is an excellent way of obtaining flowing conversation and narrative. Having shown the film to your speakers, you can follow the Pear Film procedure by having a native speaker who has not seen the film ask people to narrate the story. Alternatively, you can prepare interview questions regarding the content of the film, then conduct interviews (either in person or preferably via a native speaker collaborator) with one or more subjects who have knowledge about the content. A further technique is to record a
40 Bruce Birch real-time commentary from different speakers, asking subjects to describe the action of the film as it unfolds.
3.7.3 The Use of Archival Visual Material Asking people to view photographs or films of relevance to them, such as family photographs, photos, or film of places or events familiar to them, but perhaps which they have not seen for some time, is a reliable way of eliciting relatively unselfconscious speech, as subjects lose themselves in reliving places and events from their past. People will enjoy sharing with each other experiences they have in common, and will frequently have different interpretations of events which may result in animated discussions.
3.7.4 Interviews Without Nonverbal Stimuli Following some preparatory research along the lines suggested in the introduction to this section, you may choose some topics for interviews or discussions to be conducted without the use of nonverbal stimuli. The topic clearly needs to be complex enough to stimulate discussion or narrative, and needs to be something subjects (a) like talking about and (b) are comfortable talking about in public One of the areas of greatest complexity in the languages of Northwestern Arnhem Land in Northern Australia, for example—and also an area which I personally found people liked talking about—was kinship and kinship terminology. As I was working for a language documentation project at the time, I was interested in actually understanding the system and in collecting terms for entry into a dictionary. I found that getting two speakers talking together based on questions such as ‘How is A related to B?’; ‘Why does A call B [insert kin term]?’; or ‘Why is A promised to B as a marriage partner?’; was a reliable way to elicit long and convoluted conversations and explanations (which then took weeks to transcribe and translate).
3.7.5 Partnering Other Data Collection Activities In the context where I have worked most (indigenous languages of Australia) I have found that collaborating with researchers in other fields has been an effective way to gather high-quality unscripted data. Most recently, for example, I have been working with researchers interested in documenting indigenous ecological knowledge both on land and at sea. Bringing on board an indigenous collaborator to act as translator and interpreter in interview situations, we have collected a large amount of narrative and dialogue. The fact that the researchers involved in the process are specialists in their field (marine biologists, ecologists, ethnobotanists, etc) inspires confidence in speakers
Data Collection
41
that they are being understood when discussing in depth specialized areas of knowledge which the average linguist cannot be expected to know about.
3.7.6 Ethical Considerations Remembering Labov’s finding (referred to above) that truly casual speech data was collected as a result of fortuitous interruptions by family members while the tape recorder was switched on, ethical considerations require that care be taken in this and similar situations to protect the interests of the people who have generously given their time to assist you. While it may be the intention of the collector to reduce or even eliminate speakers’ awareness of the fact that they are being recorded, and while every word they say will potentially be analysed in great detail, if in the process they say things which they don’t really intend others to hear, it is only fair to make them aware of this possibility. For example, although recording subjects gossiping about other members of their own community ‘behind their backs’ may not be an obvious issue if the resultant text is used purely by researchers in an academic context, it certainly would be problematic if the recording or transcription reached the ears of the people who were the object of the gossip. In cases where potentially contentious material has been recorded, a good strategy is to replay the recording to the speaker or speakers involved, seeking their approval or otherwise to use the data.
3.8 Equipment and Recording Technique 3.8.1 Before You Start. . . . . . you need to decide a few things. First, do you require video images as well as audio, or will a good audio recording satisfy the needs of the project? Assuming that the main focus of your project is to capture spoken language, a good audio record is clearly your first priority. However, an accompanying video record has several advantages. Video may help you disambiguate, for example, what speakers are talking about when they point to something, or refer to someone, say a child, who is in the room but is silent and therefore unmonitored by your audio device. Video will also display gestures which may be of interest for the analysis of prosody—for example, where the nod of a head may coincide with an accented syllable in a word. Sometimes simply being able to look at a speaker’s mouth may help disambiguate speech sounds, especially in cases where transcribers have shaky knowledge of the dialect or language they are transcribing. If you decide to use video, however, be aware that you will perhaps need more personnel for your data collection activities than would be the case for an audio-only documentation. Managing a camera, an external microphone, and, in the case of an interview situation, the direction and development of the content will typically be too demanding for a single person. That said, in situations where people are stationary, and
42 Bruce Birch are performing a task not requiring the intervention of the data collector, cameras and mikes can be set up such that minimal adjustment is required during a session, allowing the operator to simply monitor video and sound to ensure that both are functioning as desired. Be aware that if you choose to include video in your documentation, you will not be relying on the inbuilt microphone of the camera. Inbuilt mikes are intended for amateur or home use only, in situations where the picture, for example, recording the antics of the family pet, is more important than the sound—not your situation. Moreover, inbuilt mikes will pick up the sound of your hands moving about the camera body, and are also susceptible to even low wind conditions when recording outdoors.
3.8.2 What Format are You Archiving and Annotating? You will need to think about, or get advice about, the required archive formats for the recordings you are making. Various video and audio formats are available, differing in quality and in terms of the amount of disk space they require. The generally accepted appropriate archive format for audio is uncompressed WAV, also referred to as ‘linear PCM’ (Pulse Code Modulation). Using a compressed format makes no sense in language documentation or data collection, as it results in loss of data and is designed for quick download and space-saving on disks. Video is a different matter. Video files take up far more disk space than audio files, and if they are secondary to your goal, you may decide to use a compressed format such as MPEG2 or MPEG4, which will reduce uncompressed video to a fraction of its original size while presenting it at a standard equivalent to the average commercially available DVD. On the other hand, the increasing affordability of large quantities of disk space may make it feasible for you to archive full-quality video. A further consideration will be ensuring the formats you archive are compatible with annotation and analysis tools such as ELAN5 or PRAAT.6 Having determined the amount of disk space available to you, and having researched which formats will be friendly to the programs you have selected, you will need to choose your equipment accordingly. Neither every video camera nor every audio device will output the formats you need. Ensuring that your equipment is compatible with your required archive and annotation formats will save time and bother down the track when you discover, for instance, that all your files need to be converted before they can be used.
3.8.3 Where are You Recording? Having obtained high-quality equipment and set up your mikes appropriately, etc, you may be surprised to find that you are not achieving the sound quality you need. This 5 6
http://www.lat-mpi.eu/tools/elan/ http://www.fon.hum.uva.nl/praat/
Data Collection
43
may be because the acoustics of the space you are using for the recording are impacting adversely on the final product. Hard interior surfaces, or simply the shape or dimensions of a built environment, reflect and distort sound waves in ways that the human brain typically factors out, meaning that what sounds fine to you when having a conversation in a room may sound harsh and unpleasant when you monitor it through headphones or listen back to your recording. This means you will need to source a good room, and a good position within that room, by doing a trial recording before you begin using it as a data collection location proper. Always use headphones to monitor the sound: you cannot trust your ears in these situations. Recording outdoors typically avoids problems associated with room recording, although in the case of (for example) a naturally occurring rock configuration in the vicinity, you may have similar issues. Your worst enemy outdoors, however, is wind. Even a slight wind will cause distortion on unprotected mikes, and a simple foam pop-cover will not help much either. Writing as someone who mostly records outdoors, I find a professional quality windshield (such that you will have seen in use by media crews, film crews, etc) indispensable. I’ve used them in strong winds with excellent results. Whether indoors or outdoors, unless you are in a soundproof booth, your recordings can be compromised by extraneous sound such as traffic noise, the activities of people or animals in the vicinity, and household equipment such as fans, refrigerators, and air conditioners. It will be necessary to eliminate such noise as far as possible before pressing the record button. Switching off ceiling fans or air-conditioning units in hot climates may be counterintuitive (and uncomfortable), but it will make a huge difference to the quality of your recording.
3.8.4 Audio Recorders There is now a fairly wide range of digital audio recording devices on the market, referred to as ‘field recorders’, or more technically ‘Linear PCM recorders’. These devices have the capacity to record sound in uncompressed WAV format, the format which will produce the best results for you. Avoid recording in compressed formats such as the popular mp3 format. Compressed formats result in loss of data. The resultant files are much smaller than their uncompressed counterparts, but the quality will never compare favourably. Typically, these recorders have built-in stereo mikes. You may use these if necessary (they are typically of better quality than built-in mikes on video camcorders), but using a good-quality external microphone will produce better results. As was mentioned above in relation to video cameras, built-in mikes on field recorders will pick up the movement of your hands on the device. In addition, the use of an external mike (or mikes) allows the operator freedom of placement, or movement, during the recording session, while continuing to monitor the sound. As most good-quality microphones have an XLR connection, it is advisable to purchase an audio device which correspondingly has XLR sockets. Most of the cheaper field recorders have mini-jack sockets, rather than XLR sockets, and while there is no problem obtaining a cable to connect the mike to the
44 Bruce Birch recorder, the mini-jacks themselves, and the sockets in the cameras, are often not strong enough for your needs, and may become noisy and unreliable after relatively little use. Your field recorder should have an SD (Secure Digital) Card slot, allowing you to record hours of audio before copying files from the card to a hard drive and erasing the card ready for the next recording session.
3.8.5 Video Camcorders Video camcorders today are capable of directly outputting the files you need for archiving and annotation purposes, thus delivering you from the process of file conversion down the track. Whereas DV tapes required capturing and conversion, today’s cameras record ready-to-archive files on SD Cards which are also ready to open in editing and annotation programs. There is now a bewildering array of cameras on the market, and a huge range of variation in terms of price, quality, and intended use. Your choice, as always, will be determined by your budget and your needs, and the compatibility of the camera with your computer, your archiving format, and your annotation and analysis programs. For example, a data collector who is archiving MPEG2 video and uncompressed 16-bit 48 khz WAV and who uses a MacBookPro laptop computer in the field may choose to use a camera such as the JVC GY-HM100, which records uncompressed WAV audio and Mac-friendly Quicktime-readable MPEG2 video files onto two standard (Class 6) SD Cards, currently available up to 32 GB each, allowing a total of around four hours of uninterrupted recording time. As soon as the recording is finished, access to it will be relatively seamless. The SD cards can be inserted into a slot in the computer and the video can be read instantly by the computer’s software. If desired, the video can be edited immediately on the SD card, but copying the files onto an external hard drive first is good practice. In fact, ideally, a back-up copy will be stored on a second hard drive immediately, before reinserting the SD Cards into the camera and erasing the data in preparation for the next recording. An uncompressed WAV file, which was encoded by the camera along with the video, can now be exported using Quicktime on the Mac. Both the audio and video files are instantly ready for use by the annotation program ELAN, and can be edited if desired using Final Cut Pro. The video files are easily converted to smaller formats if required. Choosing the wrong camera will affect your workflow adversely, so it is definitely worth investing some time into research before you buy. Ensuring a good match between camera, computer, software, and archiving needs will lay the basis for an efficient workflow. Because you are interested in obtaining high-quality audio, you will be using a good-quality external microphone in conjunction with your camera. As was mentioned above in relation to audio devices, cheaper cameras have mini-jack rather than XLR connections, and while there is no problem obtaining a cable to connect the mike to the camera, a camera with stereo XLR sockets built in, or with the capacity to connect to an external XLR socket unit, is preferable. In general you should use a tripod for your camera where possible, as it will result in a steadier picture, and you should also try to avoid gratuitous zooming and panning as
Data Collection
45
it will distract the viewer. Obviously, you will practise using your eventual set-up before you begin data collection, and if you are inexperienced, you will obtain advice or training from experts at the outset.
3.8.6 Microphones Good microphones are essential ingredients in linguistic data collection, especially when phonetic and phonological analyses are to be the principal uses of the corpus you are creating. Mikes vary in a number of ways. First, there is directionality. Unidirectional mikes are designed to minimize sensitivity to sound other than that coming from the direction in which they are pointed. For example, when pointed at a subject in a room full of people talking, the subject’s voice will be privileged over all other voices in the room. Conversely, an omnidirectional mike would pick up more of the room noise, privileging those speakers in closest proximity. In general, unidirectional microphones will be more suitable for your purposes, although if recording a group conversation, for example, you may choose an omnidirectional microphone. The microphone should be placed as close as possible to the subject’s head, without causing discomfort. If the mike is placed too close, you will be picking up air release (by plosives, for example). Ideally, the mike will be mounted on a stand or boom at an optimal distance from the speaker’s head. You will be monitoring the sound as you move the mike into position so that you can find the optimal position. A good alternative is to use lapel mikes. These come in both wired and wireless designs, the wireless type being referred to as ‘radio’ mikes. These have the advantage of being less intrusive than their wired counterparts when the aim is to elicit unselfconscious speech, as the wearer will readily forget that they are wearing the mike. A disadvantage of both types of lapel microphone is that they will sometimes pick up the sound of the wearer adjusting their sitting or standing position, or accidentally making hand contact with the mike, or the clothing to which it is attached. This is especially the case with wired lapel mikes. There are sometimes compatibility issues between microphones and recording devices and cameras. It is therefore a good idea to seek advice on which microphones best suit your recording device or camera of choice before making a purchase.
3.8.7 A Note on Metadata Recording metadata about your recordings is crucial, as with the passage of time, important contextual knowledge about the recordings risks being lost forever. Best practice is to record metadata for each recording more or less as soon as the recording is made. What you choose to record will partly depend on the nature of your project. However, details about the speaker, the recorder, the equipment used, the content of the recording, and the location should all be part of basic metadata.
C HA P T E R 4
C O R P U S A N N O TAT I O N Methodology and Transcription Systems E L I S A BET H DE L A I S - ROU S S A R I E A N D BR E C H T J E P O ST
4.1 Introduction and Background In the last thirty years, corpus-based approaches have become very important in language research, notably in syntax, morphology, phonology, and psycholinguistics, as well as in language acquisition (Rose, this volume and Gut, this volume) and in Natural Language Processing (e.g. tools development, corpus-based Text to Speech). The use of corpora opens new perspectives. First, it provides new insights into the way in which linguistic events can be captured and defined. Some aspects that have long gone unnoticed have been observed and described by looking at different types of data in large corpora (see e.g. Durand 2009 for various concrete examples in several domains of linguistics). Secondly, it offers new perspectives on the way in which structures and processes can be accounted for, e.g. through statistical modelling as proposed for phonological phenomena in probabilistic phonology (see e.g. Pierrehumbert 2003a) or in stochastic OT (e.g. Boersma and Hayes 2001). Two major factors have allowed corpus-based approaches to become prominent in recent decades: the increase of available data and the development of new tools. The number of available corpora for language research has increased exponentially in recent decades. Here we consider as corpora any sets of machine-readable spoken or written data in electronic format which are associated with sufficient information/documentation to allow the data sets to be reused in different studies1. Two distinct types of information need to be provided: • metadata which specify the content of the corpus, the identity of the speakers, the ways the data have been collected, etc. (see e.g. Durand, Laks and Lyche, this 1
This definition echoes that of Gibbon et al. (1997) for speech and oral data: ‘A corpus is any collection of speech recordings which is accessible in computer readable form and which comes with
Corpus Annotation
47
volume). The information provided in the metadata is crucial in order to evaluate the representativity and the homogeneity of the data. Moreover, it should be sufficiently specific to allow any studies that are based on the corpus to be replicated or compared with other corpus-based studies. • metadata that describe the data format used (see Romary and Witt, this volume), the types of annotation proposed (Part of Speech (PoS) tagging, syntactic parsing, etc.) and the symbols and linguistic categories used in the annotation procedure, etc. The second factor that explains the increase of corpus-based research in the discipline is the development of tools that facilitate the analysis and the annotation of large amounts of language data (e.g. taggers and parsers for PoS tagging) (see e.g.Valli and Véronis 1999 and 2000; Véronis 1998 and 2000), automatic speech recognition (ASR) for segmental annotation and alignment (see Cucchiarini and Strik, this volume), and various custom-built database tools designed to analyse large datasets (see e.g. Phon and Phonbank (Rose and MacWhinney, this volume), CLAN and Emu (John and Bombien, this volume)). The task of annotating can be seen as consisting of assigning a label to an element or an interval in the data, where the label marks a specific event in the text or speech signal. If the event is linguistic in nature, the label normally represents a linguistic unit of some sort, such as a word, a phoneme or segment, a syntactic phrase, a prosodic phrase, a focused element, or a tone.2 However, labels can also represent communicatively meaningful nonlinguistic events occurring in speakers’ productions, like pauses, hesitations, interruptions, and changes in loudness and tempo. Deciding what to annotate depends on the purposes of the annotation, and will typically be constrained by the needs of the users, the size of the corpus, and the tools and the manpower that are available to create them. In this chapter we will focus on annotations that provide information about the phonological/phonetic dimension of the speech sample or dataset (i.e. annotation types that are relevant for corpus-based research in phonetics and phonology). As with other types of annotation, phonetic/phonological annotation can be seen as the assignment of a label to a specific unit in the data. In the case of speech data, however, several aspects have to be taken into consideration in the segmentation of the speech signal and the assignment of labels. In section 2 we will discuss some of the basic issues that arise in the transcription and annotation of speech. Sections 3 and 4 will be concerned with the encoding of segmental and suprasegmental information, respectively.
annotation and documentation sufficient to allow re-use of the data in-house, or by people in other organisations’ (Gibbon et al. 1997). 2 This conception of annotation prevails in the formal representation proposed by Bird and Liberman (2001) under the name ‘annotation graph’. Any annotation type can in fact be represented by an acyclic graph.
48 Elisabeth Delais-Roussarie and Brechtje Post
4.2 Basic Issues: Abstraction, Segmentation, and Discretization Unlike written data, which are already an abstract linguistic representation of language, audio data can be captured either in the form of various types of representation that are physically derived directly from the speech signal or as a representation that abstracts information from the speech signal by taking into account the content that is transmitted. Examples of physical derived representations are the digital audio file itself, in a format like .wav, or the speech pressure wave, a spectrogram, a fundamental frequency (F0) trace, a smoothed version of the F0 trace in which microprosodic effects that are not relevant for perception are removed from the representation (e.g. Prosogram, see Mertens 2004a; 2004b; and section 4.2.2 below). Like the speech signal itself, these representations are continuous in nature. Examples of symbolic representations are phonemic transcriptions (e.g. in IPA, see International Phonetic Association 1999; and section 3.2.2 below) and intonational transcriptions (e.g. in ToBI, see Silverman et al. 1992, Beckman et al. 2005, and section 4.3.1), but also orthographic transcriptions (in which case the audio data are effectively transformed into written data). Such transcriptions are discrete and symbolic in nature. Hence, before starting the annotation, the transciber has to decide for what purpose the data are being transcribed (or even whether a transcription is indeed the most efficient way of achieving the objectives), and what requirements this imposes on the annotator and the annotations and tools used. Annotations serve the purposes of making the data searchable, quantifiable for specific research purposes, and comparable with other datasets. In other words, they improve accessibility to the data for a large community of users. In any case, the annotation will consist of assigning a label to a portion of the speech signal, where the label provides information in relation to the unit in a conventionalized way (the part of speech for a word, its function and/or category for a syntactic phrase, etc). The convention will depend on theoretical and methodological considerations, and the objectives of the annotation will determine the choice of segmentation process and the choice of labels. When the annotation is abstract, symbolic, and discrete, the speech chain is analysed as a linear sequence of units or intervals to which labels are assigned. Assigning a label to an interval presupposes two sub-tasks: • A segmentation of the speech signal or text into units to which labels can be assigned. These units may take different forms depending on the type of data. In written data, they may consist of syllables, words, sequences of words, etc.; in speech data, they will be either an interval or a point in time. • The definition of the labels or sets of symbols.
Corpus Annotation
49
In some cases the annotation task appears to be quite trivial, as the labels and units are almost given by the data/linguistic form. Consider, for instance, annotation in parts of speech (PoS). Usually, the labels correspond to the PoS categories that are commonly agreed on (verb, noun, etc.), while the segments of speech or text that are being labelled correspond to the words or lexical entries present in the text. In other cases, segmentation and choice of labels are less straightfoward. An example is annotation of syntactic categories and phrases, where segmentation and labelling are typically carried out relative to a specific theoretical paradigm, and the specifics of the theory need to be understood for the labelling to be carried out or interpreted. In any case, regardless of how theory-dependent the annotations are, they will provide an array of information that can be used to formulate the queries that are essential for research to be carried out with the corpus data. For the annotation of speech data, the segmentation task is further complicated by the fact that the audio signal contains a wide array of simultaneously occurring information of a linguistic and nonlinguistic nature, covering more localized events like an individual sound segment in a word (generally referred to as segmental events) as well as more global events which extend over longer stretches of speech like intonation contours (generally referred to as suprasegmental or prosodic events). For example, in (1) the sound sample corresponds to a conversation between two speakers, who are sometimes speaking simultaneously (Figure 4.1).3 Several events occur in parallel in the signal: • the sounds ([œ] and [e]) are produced at the same time, since the two speakers overlap; • the realization of the sound [e] coincides with an overlap and a hesitation realized by the other speaker • prosodically, the segment [ɑ̃] at the end of évidemment coincides with the end of an F0 rise. Note however that the rise is realized over the entire syllable /mɑ̃/ and not just on the segment. (1) A: évidemment ça peut pas se faire à pied quoi euh B: et c’est loin du centre-ville Different annotation systems take different approaches to segmentation. The events that are being annotated may be considered from a formal point of view, representing their acoustic or auditory properties, or from the point of view of their function in language. For any annotation to be successful, these properties and functions need to be explicitly identified and disentangled, at least to the extent that they are directly relevant 3 The audio data associated with the examples treated in the chapter can be downloaded from the following website: http://makino.linguist.univ-paris-diderot.fr/transcriptionpho (Last accessed 16/09/2013)
50 Elisabeth Delais-Roussarie and Brechtje Post 0.692
0 300 –0.6002 250
2.868
Time (s)
200
Pitch (Hz)
150 100 50 (Èvi)demment ça peut pas se faire à pied
quoi
euh Et c’est loin du centre-ville?
ã
s a p ø p a s
f
ε
я
i d a m 0
a
p
j
e
k
w
a
œ e\œ s e l w ὲ dy
....
Time (s)
FIGURE
2.868
4.1 Representation of the signal for (1). Corpus ACSYNT-COAE 8.
to the objectives that are to be met by the annotation. This will have consequences for how the speech continuum is discretized in order to enable a labelling of the units. The degree varies to which different annotation systems propose clear guidelines on what to encode, and how to proceed, but for oral corpora a standard has been proposed which specifies the events that have to be encoded such as speaker turns, overlaps, and background noise (see EAGLES 1996). The annotation of this type of information falls outside the scope of this chapter, so it will not be discussed here. The focus here is on the annotation of phonetic and phonological information in the speech signal,4 where we distinguish between segmental and suprasegmental phenomena, reflecting current practice in the field. Segmental information in speech corpora tends to be annotated—at least implicitly—at an abstract phonological level, while suprasegmental (or prosodic) transcription systems analyse the signal at phonological as well as auditory/acoustic levels. In the hypothetical case in which the segmental transcription is carried out directly from the audio signal (i.e. regardless of the signal’s linguistic content), segmentation and assigning labels would have to be done independently of the language being transcribed. Such an undertaking is probably impossible at the segmental level, since for any one label, the interpretation of the relevant properties of the signal (e.g. formant structure for vowels) as representing that label rather than another crucially depends on how an individual language exploits those properties in cueing different sound segments (e.g. for vowels, one language’s /i:/ may be another language’s /Ι/). Another 4 Phonology is primarily concerned with structure: studying how languages organize sounds to convey meaning. Phonetics, by contrast, is primarily concerned with form: the study of sounds used in speech, described in terms of their production, transmission and perception. By implication, phonological information will always be language-dependent.
Corpus Annotation
51
example is sequences which contain sounds with double articulation such as [aΙ] and [ɛj] in French paille (straw) or soleil (sun), and which would generally be transcribed as [paj] (or [pɑj]) and [solɛj], respectively, rather than [paI] or [solɛj]. This reflects the fact that double articulation vowels and diphthongs do not contrast with other types of vowel in French, and are therefore not considered primitives in the French phonemic inventory. Just like context-dependent and predictable phonetic variation in the production of the phonemes, double articulation would only be transcribed when a narrow phonetic transcription is required (as opposed to what is termed a broad phonemic transcription). Similarly, what is transcribed as one phonological category, with its own label, in one language may be more appropriately considered as a sequence of two labels in another (e.g. affricates like /ts/ that could also be sequences of /t/ and /s/). By contrast, segmentation and assigning labels without phonological analysis is conceivable at the suprasegmental level to some extent, at least in the transcription of durational properties or melodic variation. For the transcription procedure to be crosslinguistically valid, segmentation would have to occur at the syllabic level, since this is the only unit of analysis that can reliably be claimed to have universal value (see e.g. Segui 1984; note that language-specific knowledge of syllabification will need to be referred to). The assignment of labels representing suprasegmental information cued by duration and melodic variation could be based either on local variation or on more general phonetic and psycho-acoustic information. That is, for the encoding of durational information, the speed of articulation and the internal composition of syllables may be taken into account (e.g. when automatically evaluating the degree of lengthening of a syllable in the speech stream). For melodic variation, information such as thresholds of glissando can be exploited. Note that this information is actually used in a number of stylization tools, most notably the Prosogram (Mertens 2004a; 2004b; section 4.2.2 below). In the following section, transcription and annotation at the segmental level are seen as consisting of the assignment of labels to grammatical units that range from segments to lexical entries. Hence, the segmentation task and the selection of the labels presuppose phonological and lexical knowledge and cannot be achieved language-independently. This has two consequences that will be discussed further: • It is difficult to transcribe data produced in a language that one does not master, as well as in a dialect or variety of the language that is poorly understood (e.g. because it has never been analysed phonologically, or because the system underlying the variety is in transition, as in bilingual speech or L1 and L2 acquisition data). • In any transcription, the accuracy and the granularity depend on the targeted representation level and on the procedure used, rather than on the underlying phonological assumptions: segments and intervals are usually the same, and they are defined according to the phonology of the language to transcribe. In section 4 it will be shown that exactly the same issues arise for abstract phonological annotations of suprasegmental information. In each section, the issues will be
52 Elisabeth Delais-Roussarie and Brechtje Post discussed in further detail in the context of the systems and methodologies that are used to annotate the speech signal for phonological (and phonetic) information. It will allow us to show how different systems take different approaches to address these issues.
4.3 Encoding Segmental Phonetic and Phonological Information Systems that are commonly used to represent (or annotate) the segmental dimension of speech are presented here. In 3.1, a detailed description of the various levels of representation will be given. Section 3.2 will be devoted to a presentation of major systems, including an evaluation of their advantages and limitations.
4.3.1 Transcription Procedures and Levels of Representation To represent the segmental dimension of the message, different levels of representation have been proposed for different types of transcription (see e.g. Gibbon et al. 1997). Two points are essential in the definition of the levels: (i) the degree of granularity or detail observed in the transcription, and (ii) the way in which it abstracts away from the reality of the speech signal.
4.3.1.1 Approaching Segmental Transcription at a Linguistic/Phonological Level Among the six different levels of annotation proposed in the EAGLES Handbook on Standards and Resources for Spoken Language Systems (see Gibbon et al. 1997), three depend on taking into account the conceptual dimension of the message, i.e. its meaning. These levels of transcription will primarily give access to the words or lexical units that compose the message. 4.3.1.1.1 Transcription from Script At this level, the transcription aligned with the speech signal consists of replicating the scripts or texts that were given to the speaker during the recording session. This level is thus only possible for read scripted speech (lists of words or sentences, text reading tasks, etc.) as in the PFC corpus (see Durand et al., this volume), and some of the IViE corpus (see Nolan and Post, this volume). In this case the transcription is not always accurate, as it does not necessarily represent what the speaker actually said. Repetitions, hesitations, and false starts are not encoded at all, as is shown in (2). Example (2b) provides an orthographic transcription from the script, while (2a) represents orthographically what was actually said by the speaker.
Corpus Annotation
53
(2) Extract from the PAC Reading task (Californian speaker) a. Sequence produced by a speaker from California 5 If television evangelists are anything like the rest of us, all they really want to do in Christmas week is snap at their families, criticize their friends and make their neighbours’ children cry by glaring at them over the garden fence. Yet society expects them to be as jovial and beaming as they are for the other fifty-one weeks of the year. If anything more. . . more so. b. Transcription from the Reading script If television evangelists are anything like the rest of us, all they really want to do in Christmas week is snap at their families, criticize their friends and make their neighbours’ children cry by glaring at them over the garden fence. Yet society expects them to be as jovial and beaming as they are for the other fifty-one weeks of the year. If anything more so. 4.3.1.1.2 Orthographic Transcription At this level, the transcription consists of encoding in standard orthographic form the linguistic units that are part of the message. This level of transcription tends to be used for transcribing or annotating large amounts of speech data. Note that this type of transcription is recommended by the TEI or CES for oral corpora (see e.g. EAGLES 1996; Gibbon et al. 1997). Orthographic transcriptions allow for all the words produced by the speaker to be represented, even in cases of false starts, hesitations, and incomplete sentences or words. As a consequence, the transcriptions obtained are more accurate than any transcription based on the recording script alone. Moreover, orthography may be used to represent a wide range of data types (read scripted speech, spontaneous speech, etc.). (3) Extract from a formal conversation between a researcher and a female speaker (Lancashire, PAC data)6 er, I loved school when, and I loved primary school best and, and when I started school and er I went to school in Essex and er (silence) I always lo/ I had really good teachers, I was really lucky and because my sister had a few like dud teachers but mine were really nice and I used to like draw a picture and put: I love Mrs (name) (laughter) and sit on her knee and stuff. Er, so that was really lucky, and er, (it was) er, quite a sort of progressive school and you could wear jeans and, like in France, you could wear jeans, any clothes and any, any clothes er, you wanted and we
MD: Yeah,
5 The sound files of the examples given in this chapter may be downloaded from the following website: http://makino.linguist.univ-paris-diderot.fr/transcriptionpho/ 6 Unscripted words or events such as false starts, hesitations, etc. are given in bold in the example.
54 Elisabeth Delais-Roussarie and Brechtje Post did like cookery and woodwork and quite adv/, different things whereas when I moved to Bolton, I went to a school where you couldnt, you had to wear a skirt or a dress if youre a girl and er, boys had to wear short trousers (laughter) and er, you had to wear indoor shoes when you were indoor, like plimsolls or (tra/),. like little slippers and we had nothing like cookery or woodwork or nothing sort of creative and it was all like textbooks and things so I didnt like that when I moved there, but I made nice, good friends. Even though this level of representation leads to accurate transcriptions of the speech chain (in terms of the words produced), it has its limitations: orthographic representation does not allow the transcriber to represent what the speaker says, for instance in the case of sandhi phenomena (e.g. intrusive [r] in American English, liaison or schwa deletion in French, etc.). In the orthographic transcription in (4a), for instance, there is no indication whether liaison between the underlined words occurs or not. Only a phonemic transcription as in (4b) provides such information. (4) Extract from an informal conversation between two colleagues (ACSYNT, COAE 16). a. Orthographic transcription Euh parce par exemple je suis allée à Lille bon c’était pour le boulot hein mais c’était une ville dont je me faisais une idée complètement euh l’horreur totale quoi et en fait j’ai été plutôt agréablement étonnée quoi hein b. Phonemic transcription of the underlined sequence [ʒəsɥizale], [setEynvil] Transcribing data with standard orthography may be problematic when the speech samples do not fit the forms expected by standard language systems, for instance in the speech of bilinguals, language learners, and speakers with a speech pathology. In many cases, the forms that are produced are not in the lexicon, and hence difficult to transcribe accurately in standard orthography. 4.3.1.1.3 Citation-phonemic Representation At this level, phonetic symbols are used to represent the linguistic signs that are present in the message. To a certain extent this representation is equivalent to an orthographic transcription, from which it can easily be automatically derived. That is, a citation-phonemic representation corresponds to a phonemic representation that results from the concatenation of the phonemic representations associated with each lexical item present in the message. Elision and sandhi phenomena are not represented at all, since the citation-phonemic representation ignores any variations in pronunciation that occur in continuous speech. For instance, for the sentence in (5), the citation-phonemic representation (5b) is equivalent to the concatenation of the representations of each word taken in isolation (5a).
Corpus Annotation
55
(5) Non je les ai trouvés un peu difficiles. Il y avait des mots qui euh qui se enfin je qui se suivaient un petit peu et j’arrivais pas trop à faire le la liaison quoi (ACSYNT, BAVE) a. Concatenation of the representations of each word taken in isolation (using SAMPA, section 3.2.2.2). Note that latent consonants are given in parentheses in the symbolic representation. n o~ + Z @ + l e ( z) + e + t R u v e (z) + e ~ (n) + p 2 + d i f i s i l @( z) + i l + i + a v e (t) + d e( z) + m o t (z) + k i + 2 + k i + s @ + a ~ f e ~ + Z @ + k i + s @ + s H i v E/ (t) + e ~ (n) + p @ t i (t ) + p 2 + e + Z@ + a R i v E/ ( z) + p a (z) + t R o (p) + a + f E R + l @ + l a + l j e z o ~ + k wa b. Citation-phonemic representation (using SAMPA) n o~ Z @ l e ( z) e t R u v e (z) e ~ (n) p 2 d i f i s i l @( z) i l i a v e (t) d e ( z) m o t (z) k i 2 k i s @ a ~ f e ~ Z @ k i s @ s H i v E/ (t) e ~ (n) p @ t i (t ) p 2 e Z@ a R i v E/ ( z) p a (z) t R o (p) a f E R l @ l a l j e z o k wa As shown in (5), citation-phonemic representation is comparable to orthographic representation in terms of accuracy. However, it makes possible queries that need to refer to phonological units or events, such as potential liaison contexts.
4.3.1.2 Approaching Segmental Transcription at a Phonetic/Acoustic Level The levels of representation that have been described so far do not allow a direct representation of the finer detail of what the speaker actually said beyond linguistic, phonologically contrastive information. The transcriptions focus on representing the linguistic content of the message, and the lexical items that compose it. Three further types of representation can be defined, which presuppose a segmentation in phonemes while enabling direct access to phonemic and/or subphonemic (phonetic) information. The construction of these phonological representations requires knowledge of the language, attentive listening to the signal, and even, in some cases, use of acoustic representations of the signal (spectrogram etc.). 4.3.1.2.1 Broad Phonetic or Phonemic (Phonotypic) Transcription A broad phonetic transcription provides a phonemic representation of the speaker’s actual pronunciation of the words in the speech signal, at a contrastive phonological level. Unlike a citation-phonemic representation, it will indicate phenomena that are characteristic of connected speech such as liaison, intrusive /r/, consonant deletion, vowel reduction, and assimilation. Hence, the resulting transcription tends to be more detailed than a citation-phonemic representation. The beginning of the extract in (5) can be transcribed as (6) in a broad phonemic transcription.
56 Elisabeth Delais-Roussarie and Brechtje Post (6) Non je les ai trouvés un peu difficiles n o~ Z l e z e t R u v e e ~ p 2 d i f i s i l Such a representation can easily be derived from an orthographic or a citation-phonemic representation of the data: • automatically, by using ASR and forced alignment algorithms on the citation-phonemic representation, or even the orthographic transcription. All segments that appear in the citation-phonemic representation and that are not produced in the signal will be erased (Cucchiarini and Strik, this volume); • semi-automatically, by listening to the contexts in which continuous speech phenomena may apply, and modifying the automatically derived citation-phonemic transcriptions accordingly. Broad phonetic representation relies on the use of a clearly defined, limited set of symbols, like those of the IPA or its machine-readable extension SAMPA, but only symbols that have the status of phonemes are taken into consideration to encode the output of connected speech processes. By offering more phonetic detail than the citation forms, this level constitutes a balance between accuracy of representation and ease of derivation. 4.3.1.2.2 Narrow Phonetic Transcription To achieve a transcription at this level of representation, the transcriber has to listen very carefully to the signal, sometimes combined with visual inspection of the waveform and spectrogram. All segments have to be represented by phonetic symbols or combinations of symbols and diacritics (e.g. in IPA, section 3.2.2 below), which correspond most closely to the sound sequence that is actually produced. Allophonic variants are encoded when necessary, necessitating the use of symbols and diacritics with both phonemic and subphonemic (often allophonic) status in the language. For instance, the aspiration of a plosive in onset position—one of the allophonic variants of voiceless plosives in English—would be h encoded in the transcription, as is shown in (7) (transcribed as [ ] in the example), as well ~ as processes like nasalization of a vowel (marked [ ]), regressive place assimilation (/n/ pronounced as [m]), and pre-glottalization of a stop in coda position (transcribed as (7)). (7) Ten bikes were stolen from the depot yesterday evening. [thɛ̃mbɑΙʔks].. This level of representation is very accurate, but is also time-consuming to produce. Hence, narrow phonetic transcription should only be used in cases in which it is absolutely necessary: It is better not to embark without good reason on this level of representation, which requires the researcher to inspect the speech itself, as this greatly increases the
Corpus Annotation
57
resources needed (in terms of time and effort). If the broad phonetic (i.e. phonotypic) level is considered sufficient, then labelling at the narrow phonetic level should not be undertaken. (Gibbon et al. 1997)
4.3.1.2.3 Acoustic Phonetic Transcription This level of transcription provides very detailed information about the various elements and phases that occur during the realization of a sound. For a plosive, an acoustic phonetic transcription will indicate all the phases that can be distinguished in its production, including the oral or glottal closure, any aspiration, the release burst, etc. Such a transcription can only be achieved when the transcriber refers to visual representations of the acoustic signal (e.g. spectrum, spectrogram, and speech pressure wave). The labels that encode the acoustic information would normally be aligned to the signal in a graphical representation.
4.3.1.3 Deciding on a Level of Transcription Six different levels of segmental transcription have been presented here which provide different information about phonemic and subphonemic events in the speech signal. To decide which one to use, a number of factors need to be taken into account: the size of the corpus, the objectives, and the degree of detail required by the user. Thus, if the corpus has been developed to carry out research on e.g. discourse and conversation, and more precisely on turn-taking, an orthographic transcription in which speaker turns are indicated is sufficient. By contrast, if the purpose of the research is to study pronunciation variants of certain phonemes, a narrow phonetic transcription may be required, at least for the phonemes of interest and their immediate context. The difficulty of the transcription task depends crucially on the level of representation, but also on the nature or genre of the speech samples that are to be transcribed. For instance, scripted data that are recorded in a soundproof booth are much easier to transcribe than informal spontaneous conversation between several speakers, because the former is much more predictable, and more likely to stay closer to citation-form pronunciations of the lexical items in the speech signal. Automatic procedures may also be more successful in formal speech, since they typically rely on citation-form speech (or derivations from citation-form speech). Two distinct classes of systems can be used to provide a transcription of segmental information in the speech signal: orthographic systems and phonemic/phonetic systems. In languages with a segment-based orthographic system which closely follows the sound changes in the language, the two classes of representation may not be that far removed from each other, but the symbolic representations obtained in the systems will differ, and offer different research perspectives. However, since a number of tools can be used to transform an orthographic transcription into a citation-phonetic representation and vice versa, orthographic transcription can be considered as pivotal between alphabetical and phonotypic representations.
58 Elisabeth Delais-Roussarie and Brechtje Post
4.3.2 Most Commonly Used Systems As mentioned in section 3.1, two distinct types of system can be used to annotate a speech corpus for phonetic/phonological research: the orthographic system of any given language, or a set of phonetic symbols that represent the sounds present in the signal. However, the two systems cannot be applied in all cases, as they may not offer the same degree of granularity in the way in which they represent the speech signal. In this section, the most commonly used systems will be presented, and their advantages and limitations will be discussed.
4.3.2.1 Representing the Segmental Dimension of the Speech Signal by Means of Alphabetical Writing Systems7 As mentioned in the previous section, making an orthographic transcription of a speech file consists of representing the linguistic content of the speech signal symbolically, in orthographic form. This necessarily implies that the signal is interpreted by the transcriber, at least to the extent that the orthography serves as a means not only to express a sound sequence but also to refer to a concept or an idea (de Saussure’s sign: see de Saussure 1916). This level of transcription—which is often applied in large speech corpora (GARS corpus of spoken French, spoken sections of the British National Corpus, etc.) and which is recommended by the TEI and EAGLES for the transcription of spoken corpora (EAGLES 1996; Burnard 1993; 1995; Sperberg-McQueen and Burnard 1999)—has a number of advantages: • The system is easy to use (since it does not require any knowledge of special symbols or segmentation conventions); • It allows transcription of large datasets; • It provides transcriptions that are readable for all potential user groups; • The transcriptions can serve as input to computional tools for automatic language processing, which can be used to automatically generate linguistic annotations from the text (e.g. syntactic parsing and tagging, phonemic representations, etc.). Nevertheless, producing an accurate orthographic transcription can be an arduous task, in spite of the range of tools that are available to assist the transcriber (cf. e.g. Delais-Roussarie (2003a; 2003b) for a review; Sloetjes, this volume; Boersma, this volume; Kipp, this volume). In fact, a number of factors make the task difficult (cf. e.g. Blanche-Benveniste and Jeanjean 1987): • It is often difficult to hear precisely what is being said due to distortions of the signal (background noise, etc.), especially in the case of spontaneous speech which is not recorded in a quiet environment. 7 Only orthographic alphabetical systems are covered here, since they represent segmental information more closely than other writing systems. This does not imply that other writing systems do not share some of the advantages and limitations of the orthographic systems discussed here.
Corpus Annotation
59
• Transcribers will tend to auditorily reconstruct elements that they cannot readily interpret as part of the message, which may result in erroneous interpretations (unexpected words that are produced out of context, unfamiliar pronunciations in dialectal or non-native speech, etc.). • A variety of preconceptions can cause transcription errors, linguistic preconceptions in particular (i.e. when a transcriber relies on his/her knowledge and representations of the language, and erroneously reinterprets what he or she hears). • Sometimes the speech signal allows for multiple interpretations which need to be disambiguated in writing (e.g. auditory [lɑd ̃ ʁwaulɔn ̃ E] in French can correspond with orthographic l’endroit où l’on est as well as l’endroit où l’on naît; note that such ambiguities are usually disambiguated by the context and/or phonetic detail in the signal). A number of recommendations have been proposed (see e.g. EAGLES 1996; Burnard 1995) with a view to ensuring accuracy and ease of use (by human and machine), notably: • Transcriptions need to observe standard orthographic conventions as much as possible; generally adopted conventions are also used for contractions (e.g. gonna or wanna in English; t’as in French). • Hesitations, false starts, repetitions, and all other forms of self-correction need to be transcribed literally. • Numbers, formulae, and other symbols need to be represented as written words. • Abbreviations and acronyms are transcribed, distinguishing between abbreviations that are pronounced as words and those that are pronounced as a series of initials (NATO vs. U.S.A). • Only major punctuation marks are used (i.e. those used at the end of the sentence like question marks, exclamation marks and full stops). Two of these recommendations are the subject of debate: the use of punctuation, and the reliance on standard orthographic conventions. A number of editors of oral corpora refuse to use punctuation on the grounds that (i) punctuation is part of the written code, and/or (ii) there are no sentences in speech (cf. e.g. Blanche-Benveniste and Jeanjean 1987). Nevertheless, many speech corpora use punctuation to segment speech into utterances (cf. the spoken sections of the BNC, the CHAT conventions, etc.). This choice is justified in various ways. French (1992), for instance, states clearly which indicators the transcriber should use to segment the speech signal into utterances: Try to be guided by intonation—the rises and falls in the voice—as well as by the words themselves. If it sounds as though someone has finished a sentence and gone to another (their voice drops, they take a breath and start on a higher note), then it’s probably safe to start a new sentence.
60 Elisabeth Delais-Roussarie and Brechtje Post Payne (1995) proposes a definition of the sentence in spoken language when he addresses the issue: The resulting functional sentence is perhaps difficult to define precisely in linguistic terms, but as an essentially practical unit creates few problems for transcribers, as it is using their intuition about when speakers are starting, continuing and completing what they are saying on the one hand, and when they are abandoning an incomplete utterance on the other. (Payne 1995: 204)
As mentioned before, many guidelines and recommendations which aim to reach a standard in the design of oral corpora usually insist on the use of standard spelling to represent the linguistic content of the audio signal (e.g. TEI, EAGLES 1996). However, standard spelling cannot account for various realizations that occur in connected speech, and as such does not very accurately represent what has been effectively said by speakers. In any case, an orthographic transcription using standard spelling cannot represent the occurrence of phenomena such as liaison or optional schwa deletion in French, or intrusive /r/ in American English. In order to overcome this impossibility, some tricks which lead to a departure from standard spelling have been used in transcribing some oral data. For instance, in the orthographic transcription from an oral corpus of Acadian French (Péronnet-Kasparian: see Chevalier et al. 2003), standard spelling is not observed in transcribing some words. The sequences je suivais and je suis are transcribed chuivais and chuis, respectively, to account for the assimilation after schwa elision (8). (8) Orthographic transcription and tricks y . . . fait-que j’ai euh travaillé deux ans à la firme comptable après mon bacc. / pis chuivais ces cours de correspondance-là / quand j’ai vu que ça marchait pas les cours / j’ai euh / appliqué à l’Université Laval pour la licence// j’ai faiT un an là pour euh obtenir ta licence pis là ton coaching de l’été qu(i) est de / juin/ juillet/ août / genre revue de tout c’que/ que t’as vu / euh de là / on fait note/ j’ai fait mon examen de comptable agréé // euhm // après mon / ma licence à Québec / là chuis déménagée à Montréal / pis j’ai travaillé pour une firme Other tricks are used to account for schwa deletion (le p’tit instead of le petit) and other phonological and phonetic phenomena occurring in connected speech. In general, such tricks are problematic for several reasons: • A lack of consistency may occur. In (8), for instance, the schwa deletion is encoded in c’que, but nothing is clear concerning the realization of sequences such as de l’été, genre revue de tout. • Transcriptions using such tricks may not be very easy to read in comparison to any transcription using standard spelling.
Corpus Annotation
61
• Transcriptions cannot be further annotated by using automatic tools such as a tagger or a parser. Many words or orthographic forms cannot be properly labelled, as they are not in the dictionary. A clear advantage that the recommendations offer is that they facilitate the interpretation and automatic processing of the transcribed data or texts. In general, orthographic transcriptions—both from scripted speech and from careful listening—provide readily interpretable representations of the linguistic message, while allowing for the automatic generation of phonemic transcriptions. They are relevant for all further processing of the data, since they make systematic searches for specific phenomena possible (linguistic forms, phonemes, specific phonemic environments, etc.). However, certain types of data can prove difficult to transcribe when nonstandard forms are used, as in regional varieties, learner varieties, and child speech.
4.3.2.2 Representing the Segmental Dimension with Phonetic Symbols To provide a transcription that represents the sounds produced, and not only the words pronounced by the speaker, the transcriber tends to use systems in which a symbol corresponds to a sound. Such systems have to be used to transcribe data at any of the following levels of representation (cf. in particular section 3.1.2): citation-phonemic representations, phonemic representation, and phonetic representation. The most commonly used sound–symbol systems are the IPA and some machine-readable extensions of the IPA: SAMPA, ARPABET, X-SAMPA. Some of these systems will be presented in detail in the next sections (see also International Phonetic Association 1999 and Wells 1997). Note that all these systems presuppose phonological knowledge in the sense that symbols are assigned to segments that correspond roughly to phonemes. 4.3.2.2.1 The International Phonetic Alphabet The IPA was first developed in 1888 by phoneticians and teachers of French and English under the auspices of the International Phonetic Association (cf. International Phonetic Association 1999). Its development was intended to facilitate the teaching of pronunciation in foreign languages, at least in terms of the sound segments (phonemes) in words, avoiding the complications introduced by orthographic representation. According to this system, a limited set of symbols should make it possible to represent any and all of the sound segments that are used contrastively in the languages of the world. To achieve this aim, the IPA builds on three fundamental theoretical assumptions: • The number of symbols is strictly limited (188 symbols, which represent vowels, pulmonic and non-pulmonic consonants, and 76 diacritics). • The system can be used to transcribe any language or variety of a language; in other words, it is independent of any given language. • The symbols are assigned to segments (not to phonemes), which presupposes a segmentation of the speech stream into pre-theoretical units.
62 Elisabeth Delais-Roussarie and Brechtje Post These principles should facilitate data sharing, as well as transcription validation. Moreover, different languages or varieties of a language can be compared, as different symbols are assigned to different sounds.8 The symbols used to represent the vowels and the pulmonic and non-pulmonic consonants are given in (9) (reprinted with the permission of the International Phonetic Association). (9) The IPA Chart (Consonants and vowels) THE INTERNATIONAL PHONETIC ALPHABET (revised to 2005) CONSONANTS (PULMONIC)
© 2005 IPA
Bilabial Labiodental Dental Alveolar Post alveolar Retroflex
Palatal
Velar
Uvular Pharyngeal
Glottal
Plosive Nasal Trill Tap or flap Fricative Lateral fricative Approximant Lateral approximant Where symbols appear in pairs, the one to the right represents a voiced consonant. Shaded areas denote articulations judged impossible.
CONSONANTS (NON-PULMONIC) Clicks
Voiced implosives
Ejectives
Bilabial
Bilabial
Examples:
Dental
Dental/alveolar
Bilabial
(Post)alveolar
Palatal
Dental/alveolar
Palatoalveolar
Velar
Velar
Alveolar lateral
Uvular
Alveolar fricative
VOWELS Front
Central
Back
Close
Close-mid
Open-mid
Open Where symbols appear in pairs, the one to the right represents a rounded vowel. 8 Databases like UPSID (see Maddieson 1984) are constructed on the basis of comparisons between phonological inventories represented in IPA symbols. However, it should be borne in mind that one phonemic symbol can represent different phonetic realizations.
Corpus Annotation
63
However, the principles just mentioned are in fact assumptions which run, to some extent, counter to fact: in its use the system is not truly independent of individual languages. First, an individual symbol does not always represent the same acoustic/phonetic reality. In fact, its precise realization varies from one language to another. For instance, the symbol [p] is used to transcribe a voiceless bilabial plosive in French and in English, even though from an auditory and acoustic point of view the two sounds are different, with more aspiration in English than in French. This illustrates the important role of the notion of contrast in determining the mapping between sound and symbols in this system, which implies a high level of abstraction already. Second, as mentioned in sections 2 and 3.1, for any transcription to be made, the message needs to be interpreted, and hence the transcription cannot be made independently of a given language. In spite of these potential limitations, the IPA is widely used, and often serves as a medium for exchanging and analysing linguistic data. Three factors contribute to its popularity: • By representing the continuous sound signal as a string of segments, the IPA offers an intuitive way to represent the speech signal in a way that is to a certain extent comparable to orthography. • As the locus of contrast, the segment represents a fundamental unit for segmentation of the speech signal which encodes only linguistically relevant aspects of speech (as well as some phonetic detail). • By adopting the segment as its fundamental transcription unit while allowing for considerable flexibility in the precise realization of the transcription symbol, the IPA allows for more or less detailed transcriptions, depending on the level of abstraction at which the transcription is carried out—phonemic or phonetic. Distinguishing between different levels of representation does not call into question the need for the segmentation and phonemic analysis of the speech signal before any transcription can take place. The phonetic detail is rendered by the symbol chosen to represent the sound effectively realized. Thus, in examples (10) and (11), phonological knowledge allows us to segment the linguistic message, but the selection of the symbols can either be based on the abstract phonological form, as in examples (10a) for French and (11a) for English, or they can represent at a phonetic (or allophonic) level what the speaker actually produced, as in examples (10b) and (11b). (10) C’est parti pour sa quatrième campagne présidentielle. (FOR-LEC) a. Phonemic representation /sε paʁti puʁ sa katʁijεm kɑp ̃ aɲ prezidɑs̃ jεl/ b. Phonetic representation9 [sεpaxti puχsachatxijεməkɑm ̃ paɲəpxεzidɑs̃ çεl] 9
Blank spaces are inserted between orthographic words here to make the transcription easier to read.
64 Elisabeth Delais-Roussarie and Brechtje Post (11) Ten bikes were stolen from the depot yesterday evening. a. Phonemic representation /tεnbɑIks . . ./ b. Phonetic representation [thɛm ̃ bɑIʔks . . . .] While examples (10a) and (11a) provide a phonemic transcription of the utterances, the phonetic transcriptions in examples (10b) and (11b) reveal how the sound segments were actually realized by the speaker. The French example in (10b) shows that some of the symbols that are used to transcribe the actual realization of the segments are not phonemes of French; they merely represent pronunciation variants. Thus, the /ʁ/ in parti is pronounced as a velar fricative instead of a uvular. Similarly, the velar plosive /k/ of quatrième is produced as a dorso-palatal aspirated plosive. By contrast, the English example (11b) shows that sometimes, when a different symbol is used to represent the pronunciation variant in the phonetic transcription, the alternative symbol also represents a contrastive segment in the phonemic inventory of the language, as in the case of alvealoar nasal /n/ in ten, which is pronounced as bilabial [m] in connected speech. It should be borne in mind, though, that even though the transcriptions in examples (10b) and (11b) offer a high degree of detail, they do not directly correspond to the reality of the signal. Nevertheless, a number of points mentioned above can be distinguished, which can be considered an integral part of the transciption task: • Transcription requires the discretization of a speech signal which is inherently continuous in nature. • The units or segments that are chosen to discretize the signal represent loci of phonological contrast or opposition. Thus in example (10), the sound /s/ in c’est could enter into opposition with e.g. /t/ (t’es parti). If exchanging one phonetic segment for another (i.e. replacing one symbol with another) gives rise to a difference in meaning, the two segments are in opposition, and they correspond to variants of two distinct phonemes. • A difference in the choice of symbols can be used to encode different degrees of granularity, ranging from phonemic citation-form transcriptions to acoustic-phonetic transcription. To conclude, as a transcription system the IPA has a number of important characteristics. It is based on the assumption that (i) the speech signal can be analysed as a string of segments; (ii) segments are units that are relatively neutral from a theoretical point of view, while they represent possible loci for contrast and opposition; (iii) the set of labels (or symbols) is strictly limited (in spite of the use of diacritics); (iv) each symbol corresponds to a single sound segment in the speech signal; and (v) the sound signal can be
Corpus Annotation
65
represented at a more or less abstract level, where the level of abstraction is marked by the symbol that is chosen to represent the realization of the sound segments in the signal. In general, the IPA allows for finer-grained transcriptions of phonetic realizations than are possible in orthography. However, the resulting transcriptions are difficult to read for untrained users. Moreover, they cannot be used as input for linguistic processing such as grammatical annotation of the data. Finally, the IPA does not resolve all the issues that arise when nonstandard speech data are transcribed orthographically, for instance in first and second language acquisition data, or pathological speech. This is because segmenting the speech signal into a phonemic string is a prerequisite for transcription into IPA symbols. 4.3.2.2.2 Machine-readable Extension of the IPA: SAMPA, X-SAMPA, and ARPABET10 Not all the IPA symbols are machine-readable alphanumeric symbols (or ASCII characters). Therefore, they cannot always be typographically represented. A number of systems were developed to address this issue. They have been mostly used in speech technology (TTS or Text-to-Speech synthesis, and speech recognition). They mainly consist of a fairly straightforward modification of IPA in which IPA symbols are translated into alphanumeric code, which makes it usable in computing. Examples of the best-known systems are given in examples (12) and (13) below. In (12), the transcriptions with SAMPA and X-SAMPA of two sentences are presented.11 In (13), transcription with ARPABET is illustrated.12 (12) a. The North Wind and the sun were disputing which was the stronger. With the IPA δə ˈnoɹθ ˌwInd ən (ð)ə ˈsʌn wɚ dIs ˈpjutIŋ ˈwItʃ wəz ðə ˈstɹɑŋɡɚ With SAMPA D@ nO4T wind @n @ sVn w@` dIspju4IN witS w@z D@ st4ANg@`
10 Some other machine-readable systems have been developed in house for research purposes. However, they are not as widely accepted as the systems we present here. An example is the system that was developed in the 1980s in Orange Labs, and which is used to transcribe French data in speech technology: (i) La bise et le soleil se disputaient, chacun assurant qu’il était le plus fort. a / la biz e l@ sOlEj s@ dispyte SakE~ asyRA~kilete l@ ply fOR/ in SAMPA b / L A B I Z EI L EU S AU L AI Y S EU D I S P U T AI CH A K UN A S U R AN K I L EI T AI L EU P L U F AU R/ in Orange Lab’s annotation system for TTS 11 SAMPA and X-SAMPA have been mostly developed in Europe, X-SAMPA consisting of an extension of SAMPA, which was originally developed for individual languages. A complete description of the symbols used in SAMPA and X-SAMPA can be found, respectively, at: http://www.phon.ucl.ac.uk/ home/sampa/ (Last accessed 16/09/2013) http://en.wikipedia.org/wiki/Extended_Speech_Assessment_ Methods_Phonetic_Alphabet#Summary (Last accessed 16/09/2013). 12 ARPABET is a phonetic transcription code developed by the Advanced Research Projects Agency (ARPA) during a project on speech understanding (1971–1976). The system is presented in great detail at http://en.wikipedia.org/wiki/Arpabet. (Last accessed 16/09/2013).
66 Elisabeth Delais-Roussarie and Brechtje Post With X-SAMPA D@ nOr\T wind @n @ sVn ǀ w@`dIspju4IN witS w@z D@ str\ANg@ b. La bise et le soleil se disputaient, chacun assurant qu’il était le plus fort. With the IPA la biz e lə sɔlɛj səs dispytE ʃakɛ̃ asyʁɑ̃ kiletE lə ply fɔʁ with SAMPA and X-SAMPA la biz e l@ sOlEj s@ dispyte SakE~ asyRA~kilete l@ ply fOR (13) Transcriptions with ARPABET a. Thursday is transcribed as /TH ER1 Z D EY2/ b. Thanks is transcribed as /TH AE1 NG K S/ c. Father is transcribed as / F AA1 DH ER/ As can be seen in (12), the IPA symbols that correspond to alphanumeric symbols remain in general unchanged in SAMPA and X-SAMPA. For instance, IPA /p/ is transcribed /p/ in SAMPA and X-SAMPA, and this holds for all languages in which a voiceless bilabial stop contrasts with other sound segments. However, symbols which do not correspond directly to ASCII characters, like the bilabial fricatives /ð/ and /β/, are translated as /D/ and /B/ respectively in SAMPA as well as X-SAMPA. The IPA diacritics also have their equivalents, but they are placed generally in the same typographical line as the symbol that they are modifying. However, suprasegmental information is encoded differently in the two types of system. In the IPA, prosodic symbols are normally inserted on the same typographical line as the segmental symbols, unlike in e.g. SAMPA, where they are used on a separate transcription tier (or line, in typographic terms). This difference allows for multiple use of the same symbol in SAMPA and SAMPROSA (the prosodic version of SAMPA) to represent two distinct sound objects. For instance, the symbol H represents the labio-palatal approximant /ɥ/ in SAMPA, as well as a High tone in SAMPROSA. In ARPABET, by contrast, the representation of IPA symbols follows very different principles, as can be seen in (13). Every phoneme is represented by one or two capital letters. For instance, /ɔ/ is represented by AO, and /Z/ by ZH. Stress is indicated by digits that are placed at the end of the stressed syllabic vowel. Intonational phenomena are represented by punctuation marks that are used in the same way as they are in the written language. In contradistinction to X-SAMPA, ARPABET is developed for American English only. Having been developed as machine-readable versions of the IPA, the systems implicitly adopt the phonological nature of the IPA system: • They are language-dependent, which means that a full transcription can only be made if the sound system of the language is known.
Corpus Annotation
67
• Each symbol that represents a phoneme in a language has its equivalent in the transcription system, where one symbol corresponds to one sound segment. • The speech signal is treated as a sequence of segments (beads on a string). SAMPA, X-SAMPA, and ARPABET offer the same advantages and drawbacks as the IPA, but they were primarily devised to make phonemic transcription in ASCII symbols possible.
4.4 Transcription Systems and Suprasegmental Information Most issues that arise in the transcription and annotation of segmental information also arise with the transcription of suprasegmental, prosodic information (see section 2). However, unlike segmental transcription where the IPA—or SAMPA—is generally accepted as a useful standard, there is no commonly accepted transcription system for suprasegmental information. A number of different systems have been developed which are based on different theoretical approaches, pursuing different objectives. Hence, they deal in different ways with the issues that arise in the transcription of suprasegmental information. In this section, we will exemplify different approaches and objectives by presenting some commonly used transcription systems (see Llisterri 1994 for a more comprehensive survey of various prosodic transcription systems). We will focus on symbolic representation systems, i.e. systems that provide a discrete symbolic representation or a phonological representation of various prosodic phenomena (and to a certain extent their phonetic realization). It means that not all of the systems that have been developed to stylize the F0 curve or to decompose it into smaller components will be reviewed here, even though they can be a very helpful—and sometimes even crucial—tool for developing a linguistic analysis or a symbolic representation of a specific speech sample. Only two of these systems will in fact be included, but only because they are used to compute intermediate steps in an annotation process that yields a symbolic transcription (see e.g. sections 4.2.2 and 4.3.3 on Mertens’ transcription system and INTSINT, respectively).13 We will mostly concentrate on the transcription of intonational events, since a number of systems have been proposed in this domain that are used more widely, and
13 Examples of systems that provide a stylization of the F0 curve are TILT (Taylor 2000), Momel (Hirst and Espesser 1993), Prosogramme (Mertens 2004a; 2004b), and the stylization system based on the IPO framework (’t Hart et al. 1990). The system developed by Fujisaki provides an analysis of the F0 curve that allows it to be decomposed in two distinct components (e.g. Fujisaki 2006).
68 Elisabeth Delais-Roussarie and Brechtje Post because intonation transcription systems tend to include devices for encoding the phenomena that are assumed to be closely related to intonation—phrasing and accentuation. Thus, the representation of suprasegmental phenomena like the evolution in time of voice quality and rhythm will not be dealt with in this chapter. We will distinguish two types of system, based on the way in which they segment the speech signal for encoding the suprasegmental information (see section 4.1.2): contour-based systems (IPA and Mertens’ system: section 4.2) and target-based systems (Momel-INTSINT, ToBI, and IViE/IV: section 4.3). A second difference between the systems, which cuts across this distinction, is between systems that allow for (semi-)automatic transcription (Mertens system and Momel-INTSINT) and those that are manual (IPA, ToBI, and IViE/IV). A third distinction, which also cuts across the contours/targets distinction, is between systems that are designed to be language-specific (ToBI, and to a certain extent IPA) and those that are applicable to any language, even if the language’s suprasegmental system is not known (Momel-INTSINT, and arguably IViE/IV). This distinction is related to the level of linguistic abstraction that the systems are designed to represent in the transcription, since an abstract phonological representation of the suprasegmental characteristics of the speech signal depends on a linguistic interpretation of the signal, which is of necessity language-specific, while a surface phonetic representation does not necessarily do so.14 Prosogram targets a perceptual phonetic level, ToBI an abstract phonological level, and IViE/IV was developed to do both; Momel-INTSINT and IPA arguably also fall into the latter category, but as we will see below, this is problematic. In section 4.1, the different suprasegmental phenomena that need to be transcribed will be introduced briefly, and the key theoretical issues that have given rise to different approaches in transcription will be summarized. In sections 4.2 and 4.3, contour-based and target-based systems will be reviewed, respectively, including a summary discussion of the main strengths and weaknesses of each system. These discussions will clarify why developing a single transcription system that would be accepted and used by the entire research community has proved difficult. For each type, automatic or semi-automatic systems will be presented (Mertens system and INTSINT) as well as manual systems (IPA and ToBi).
4.4.1 Suprasegmental Phenomena and Segmentation 4.4.1.1 Intertwined Phenomena Suprasegmental information in the signal encompasses an array of different phenomena that are closely intertwined in the signal. Intonation, or the melody of speech, can be used to convey linguistic as well as nonlinguistic information (see e.g.Ladd 1996; 14
Leaving aside the perceptual effects of prior experience (e.g. Cumming 2011 for native language effects in the perception of speech rhythm).
Corpus Annotation
69
Gussenhoven 2004). That is, an utterance’s intonation contour can signal its pragmatic function, for instance when a speaker produces a falling pitch movement for a declarative utterance, or rising pitch for an interrogative. Intonation can also cue when a new topic is broached by the speaker, it can be used to highlight information, and it conveys functions in conversational interactions like turn-taking and floor-holding. At the same time, intonation can convey nonlinguistic information like the speaker’s attitude or emotion (often referred to as paralinguistic information), or extralinguistic information about the speaker, like gender. Intonation can also be used to mark prosodic phrasing, or the chunking of speech into linguistically relevant units like words, phrases, and utterances (e.g. Truckenbrodt 2007a; 2007b). The edges of the units are typically cued by changes in pitch, loudness, and duration (e.g. Wightman et al. 1992; Streeter 1978). Rhythm and accentuation are closely intertwined with phrasing and intonation (see e.g. Beckman 1986; Liberman 1975). Intonation contours are usually analysed in terms of accentual and phrase-marking pitch movements, and phrasing and accentuation are important contributors to the perception of rhythm in a specific language (Prieto, Vanrell et al. 2012). In many languages, accent placement can signal aspects of information structure like focus distribution. The potential locations of accents in an utterance are usually determined by the metrical structure, which indicates the elements in words that can be stressed (syllables, morae or feet). In tone languages, words can also be marked by lexical tone, which is used to distinguish word meanings, and is part of the lexical representation of the word together with the vowels and consonants that define the word segmentally. A transcriber’s first task is to decide which phenomena need to be represented for the transcription to meet its objectives. For instance, if the transcription is carried out for a study of the phonetic correlates of turn-taking, the locations of accents and phrasal edges will be relevant, as well as the type of intonation contour, in addition to various other factors. The transcriber may also need to decide how the relationships between the phenomena at issue are to be represented (e.g. how intonation contours are associated with phrases). As the review below illustrates, the existing transcription systems for suprasegmental information differ in this respect.
4.4.1.2 Suprasegmental Segmentation Units and the Discretization of the Speech Flow The segmentation of the speech stream into transcription units for suprasegmental labelling depends on the choice of phenomena that are being described (e.g. syllables for the transcription of individual accents or Intonation Phrases for the transcription of intonation contours; see Silverman et al. 1992), as well as the theoretical perspective that underlies the transcription system. Two approaches to the decomposition of intonation contours into discrete elements have been proposed (global vs atomistic: Bolinger 1951): • contour-based analyses which decompose intonation in terms of configurations of pitch movements;
70 Elisabeth Delais-Roussarie and Brechtje Post • target-based analyses which decompose intonation in terms of configurations of turning points with distinct pitch levels. Examples of the former are the Dutch model developed at the IPO (Collier and ’t Hart 1981; ’t Hart et al. 1990), and what is generally referred to as the British tradition, e.g. Halliday (1967), O’Connor and Arnold (1961/1973), and Crystal (1969). Here, the intonation contour of an utterance is modelled to consist of a sequence of discrete pitch movements (e.g. rises and falls). The movements are the primitives of the intonation structure; at the phonological level, the inventory of the movements and the way in which they can be combined is given. The IPA has its roots in the British tradition of intonation analysis, and is therefore a clear example of a contour-based approach. Examples of the target-based approach of intonation analysis are INTSINT (Hirst and Di Cristo 1998) and the Autosegmental-Metrical framework (Bruce 1977; Pierrehumbert 1980), in which intonation contours are analysed as linear sequences of locally specified targets, which are linked up by means of phonetic transitions. The phonetic targets are fundamental frequency maxima and minima that correspond to tones (High or Low) in the phonological representation, and these tones associate with specific locations in the utterance (metrically strong syllables (T*) or the edges of prosodic phrases (T-, T%) in the Autosegmental-Metrical framework). The inventory of pitch accents and boundary tones may vary from one language to another, and the phonetic realization of their targets is defined by a set of language-specific implementation rules. These approaches originate in American structuralist descriptions that decompose the intonation contour into its component levels (e.g. Low, Mid, High, and Extra High: Pike 1945; Trager and Smith 1951). Momel-INTSINT, ToBI and IViE/IV are all target-based transcription systems (see Hirst et al. 2000 for Momel-INTSINT; Beckman et al. 2005 for ToBi; and Grabe et al. 2001 for IViE). Although Prosogram (Mertens 2004a; 2004b) uses symbols for pitch levels in its symbolic representation of intonation, it is classified as a contours-based approach here, because it does not decompose pitch movements into turning points, which are associated with certain locations in the utterance between which pitch is interpolated. Instead, it specifies pitch on a syllable-by-syllable basis while taking changes in pitch into account. The transcription of suprasegmental phenomena other than intonation (e.g. metrical structure and phrasing) is treated indirectly in the suprasegmental transcription systems presented here, as will become clear in sections 4.2 and 4.3.
4.4.2 Contour-based Transcription Systems and the Encoding of Intonational Phenomena 4.4.2.1 The IPA Although the IPA was originally developed for the transcription of segmental information, it was further extended to include suprasegmental transcription symbols. The relevant symbols are given in the tables in (13) under the headings Suprasegmentals and Tones and words accents (in the computer-compatible version of the IPA, the
Corpus Annotation
71
SAMPROSA subset of symbols for suprasegmental information is used on a separate transcription tier; see http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm; last accessed 16/09/2013). (13) IPA symbols representing suprasegmental events SUPRASEGMENTALS Primary stress Secondary stress
Long Half-long Extra-short Minor (foot) group Major (intonation) group Syllable break Linking (absence of a break)
TONES AND WORD ACCENTS LEVEL CONTOUR or
Extra high
.or
Rising
High
Falling
Mid
High rising
Low
Low rising
Extra low
Risingfalling
Downstep
Global rise
Upstep
Global fall
72 Elisabeth Delais-Roussarie and Brechtje Post The symbols ‘|’ and ‘||’ represent minor and major intonation group boundaries, respectively. Their definition depends on the language and on the aims of the transcriber. Two levels of stress can be indicated by placing the symbols ‘ˈ’ for primary stress, and ‘ˌ’ for secondary stress immediately before the stressed (or accented) syllable in the word. Lexical tone can be transcribed with a set of symbols that covers all of the lexically distinctive movements that are associated with words in tone languages. A set of four symbols is provided for the transcription of intonational pitch movements: ‘↓’ and ‘↑’ stand for down- and upstep respectively, and ‘↘’ and ‘↗’ mark falling and rising movements.15 They represent intonation over the whole of the minor or major intonation group, and they are placed before the syllable on which the pitch movement or register change takes place. For instance, the transcription of a typical suprasegmental realization of the first sentence of ‘The North Wind and the Sun’ in American English is given in (14) (International Phonetic Association 1999: 44). (14) The North Wind and the Sun were disputing which was the stronger, when a traveller came along wrapped in a warm cloak. ǁ δə ˈnɔɹθ ˌwInd ən (ð)ə ˈsʌn ǀ wɚ dIsˈpjutɪŋ ǀ ˈwɪtʃ wəz ðə ˈstɹɑŋɡɚ ↘ ǁ wɛn ə ˈtɹævlɚ ǀ ˌkem əˈlɑŋ ǀ ɹæpˈt In əˌ wɔɹm ˈklok ↘ ǁ The IPA system offers a number of advantages. First, it allows the transcriber to encode a wide range of suprasegmental phenomena, including intonational phrasing, stress placement, intonation, and lexical tone. Also, the different phenomena can be encoded independent of one another. Second, transcription can be done at a pre-theoretical level, without necessarily anticipating a possible phonological analysis of the data, possibly with the exception of stress. That is, the symbols can be used to transcribe the observed prosodic realizations without requiring a full understanding of the phonological and prosodic system of the language. Nevertheless, since the transcriber is forced to place a symbol in a specific location, and since the symbol necessarily refers to a specific stretch of speech, the transcriber will have to decide at which locations changes in pitch that occur are relevant for the description of intonation contours in the language at issue, and over which domains the movements transcribed extend (i.e. does the movement which the symbol refers to extend over a single syllable, a group of syllables, or a word group, and what are the defining features of a minor and major intonation group in the language?). Compare, for instance, the overall rising–falling pitch movements marked in the solid boxes in Figures 4.2 and 4.3, which would be an acceptable production of the first accent in a
15 The symbols and may also be used to represent respectively a falling or a rising contour. In contradistinction to ↘ and ↗, which indicate a fall or a rise that spans over a whole prosodic unit, and represent a movement that occurs on a syllable.
Corpus Annotation
73
0.1433 0 –0.1144 0 400
Time (s)
3.232
Pitch (Hz)
300 200
70 The banana
from Guatemala
0
has an extra quality. 3.232
Time (s)
FIGURE 4.2 Time-aligned orthographic transcription of The banana from Guatemala has an extra quality with an utterance-initial globally rising-falling movement (solid box), the segmentation unit over which the relevant accent extends (dotted box), with the middle of the accented vowel marked (double arrow).
0.348 0
Pitch (Hz)
–0.3789 0 400
2.792
Time (s)
300 200
70 La banana 0
de Guatemala
és de bona qualitat Time (s)
2.792
FIGURE 4.3 Time-aligned orthographic transcription of La banana de Guatemala és de bona qualitat with an utterance-initial globally rising-falling movement (solid box), the segmentation unit over which the relevant accent extends (dotted box), with the middle of the accented vowel marked (double arrow).
neutral declarative in English and Catalan, and many other languages (data from the April Project: Prieto, Vanrell et al. 2012); the IPA transcriptions are given in (15a) and (15b). (15) a. The banana from Guatemala has an extra quality. /baˈnana/ b. La banana de Guatemala és de bona qualitat. /baˈnana/
74 Elisabeth Delais-Roussarie and Brechtje Post As the example shows, although the same IPA symbols could in principle be used to transcribe the pitch movements in both cases (‘’ and ‘’), different linguistic categories are represented in the two languages, which is reflected in the alignment of the peak relative to the accented syllable (marked by the double arrow in the figures). In Catalan (15b), the rise continues through the accented syllable in banana to a peak in the final syllable of the word, followed by a falling movement that is the transition between the high point and the start of the following intonational event. The movement is normally analysed as a rising accent (L+H* in Cat_ToBI: Prieto et al. 2009). In British English (15a), by contrast, the rising part of the movement ends in a peak in the accented syllable, and is followed by a fall to the following accented syllable (ma in Guatemala, here). The pitch movement is analysed as a fall (e.g. Crystal 1969; H*L in IViE: Grabe et al. 2001) which is preceded by a rising movement from the beginning of the utterance (or phrase). The difference between these analyses arises primarily from a difference in segmentation. The segmentation unit that is considered central in the analysis of the example in Catalan is the accented syllable plus any pre-accentual syllables, while in British English it is the accentual syllable plus any post-accentual syllables (also referred to as the nuclear tone: Crystal 1969). In fact, the prosodic extension of the IPA has been criticized for being too heavily informed by a tonetic theory of stress-marking, which makes it relatively inflexible (see Gibbon 1990). Although its theoretical assumptions are rooted in the British tradition, the IPA is nevertheless nonprescriptive on these points. In fact, as the transcriptions of the text ‘The North Wind and the Sun’ recorded in different languages in the Handbook of the IPA illustrate, the conventions that are adopted for segmentation and symbol assignment diverge wildly. For instance, the major intonation group corresponds to the written sentence (usually marked by a full stop in the text) for some, but for others, it corresponds to the clause or Intonational Phrase (usually marked by a comma). This inconsistency makes it difficult to directly compare transcriptions produced by different transcribers in this system. A transcriber will need to refer to his/her knowledge of the language in order to reliably identify the accented syllables in the speech stream, since what is perceived as prominent is language-specific, depending on the way in which various acoustic correlates conspire to signal it. For instance, a bisyllabic word like railing with stress on the first syllable may be pronounced with a relatively prominent final syllable by a speaker of Punjabi English. To the average standard Southern British English listener, this is likely to sound like a stressed syllable. The IPA stress symbols are also problematic because they obscure the relationship between perceived prominence and the linguistic structure that it is associated with, which depends on the linguistic status of prominence (or accent) in the language. For languages with fixed stress (e.g. Finnish, in which stress always falls on the first syllable of the word; see Ladd 1986 for a discussion), or languages which can be argued not to have lexical stress at all (e.g. French, in which prominences tend to mark the right or left edge of a word group: see among others Di Cristo 2000b), marking primary and secondary stress can only be meaningful if stress is not entirely predictable, and if the two can be distinguished on principled grounds. In French, for
Corpus Annotation
75
instance, the opposition is locational (i.e. initial vs. final) rather than one of level, as the IPA labels primary and secondary suggest (Astesano 2001; Di Cristo 2011). We can conclude that, as is the case for segmental transcription (sections 2 and 3.2), neither the segmentation of the speech stream into transcription units nor the way in which symbols are assumed to associate with the units can be truly language-independent. As a consequence, only transcribers who are familiar with the linguistic system of the language will be able to provide valid and consistent transcriptions using the IPA. Also, the prosodic extension of the IPA does not offer the same flexibility as the segmental transcription system. Prosodic systems that are not yet known will be difficult to transcribe, because the systems require the transcriber to make linguistically motivated choices in segmentation and symbol assignment. Moreover, unlike the segmental IPA symbols, the suprasegmental system lacks transparency, which can lead to inconsistency in transcriptions, while its theoretical assumptions leave it relatively inflexible, and necessitate a linguistic interpretation of the speech stream during transcription.
4.4.2.2 Mertens’ Transcription System and Prosogram The system proposed by Mertens can be seen as a contour-based transcription system that differs from IPA for two reasons: first, it is language-independent, and secondly, it is a semi-automatic system, and as such likely to be more robust than the IPA. The symbolic transcriptions proposed by the system are automatically generated from Prosograms. Prosograms are semi-automatic and language-independent stylized representations of intonation contours (Mertens 2004a; 2004b; 2011; http://bach.arts.kuleuven. be/pmertens/prosogram/; last accessed 16/09/2013). Transcription takes place in three stages. First, the speech signal is segmented into syllable-sized units, which are motivated by phonetic, acoustic, or perceptual properties. The automatic segmentation tool uses an estimation of variation in intensity to identify the boundaries of the segmentation units.16 The segmentation is indicated by vertical dotted lines in Figure 4.4. At the second stage, the F0 curves associated to the segmentation units serve as input to an F0 stylization procedure. The stylization is based on a model of human pitch perception, which takes into account a number of factors that have been shown to affect the perception of pitch for intonation. For instance, loudness and duration have been found to play a role in the perception of a change in pitch as a glissando movement as opposed to a sudden jump. If F0 changes more rapidly than a certain threshold in loudness and duration indicates, the F0 change will be perceived as a sudden jump rather than a glissando movement. Apart from the glissando threshold, the system also takes into account the differential glissando threshold for perceiving a change in slope when it is sufficiently large, as well as changes in spectral properties and signal amplitude as cues to boundaries (e.g. demarcating syllable nuclei).
16
Manually or semi-automatically generated segmentations into syllables or syllabic nuclei can also serve as input to Stage 2.
76 Elisabeth Delais-Roussarie and Brechtje Post 0
1
2
3
4
5
6
100 90
150 Hz
80 70
narrative1
3
100 90
150 Hz
80 70
narrative1
FIGURE 4.4 Prosograms of perceived pitch (in semitones, ST) for The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
Stylized F0 markers are level or show an inclination or curve expressed on a musical scale in semitones, indicated by the bold lines in Figure 4.4. Their relative height is calibrated globally with reference to the speaker’s pitch range, and locally to a three-syllable window to the left (capped at 450 ms). In Mertens (2011), the markers are translated into tonal symbols at a third stage, illustrated in Figures 4.5 and 4.6. Each syllable is assigned a single symbol for its pitch height (B = Bottom of the speaker’s range, L = Low, M = Mid, H = High, and T = Top of the speaker’s range) or glissandos within syllables that exceed the perception thresholds discussed above (R = Rise, F = Fall). Combinations of glissando symbols are used to transcribe complex pitch movements (RF or FR). Glissando symbols can also be combined with a symbol indicating level pitch when the slope of the change in F0 changes significantly (HF, MF, LR, HR, etc). The result is a symbolic transcription which provides more detail about perceptually relevant pitch in the signal than other symbolic transcription systems that make reference to pitch levels (e.g. INTSINT and ToBI) without requiring that the transcriber takes a position on the theoretical nature and the inventory of intonation units that are deemed to be relevant at the abstract linguistic level for the language that is being transcribed (i.e. which contours, are they analysable as tones, what phrasal units are involved, etc.). The advantage is that intonational events can be transcribed even if the intonation system of the language in question is not known, but the disadvantage is that, since the transcription is not categorical or discrete, it does not allow the transcriber to distinguish between perceptually relevant F0 information and information that is phonologically relevant in the language.
Corpus Annotation
77
FIGURE 4.5 Prosograms of perceived pitch (in semitones, ST, top) with symbolic representations of pitch height added (bottom) for Et j’ai donc été à l’école Remington apprendre la sténodactylo (extract from Mertens 2011).
0
1
2
3
4
5
6
100 ST
90
150 Hz
80 70 narrative1
3 100 ST
90
150 Hz
80 70 narrative1
FIGURE 4.6 Prosograms of perceived pitch (in semitones, ST, top) with symbolic representations of pitch height added (bottom) for The North Wind and the Sun were disputing which was the stronger when a traveler came along wrapped in a warm cloak.
Clear strengths of the system are that it takes into account acoustic parameters other than F0, while ensuring robustness by taking the syllable (or syllable nucleus) as the minimal unit as input for a (semi-)automatically generated symbolic transcription. Segmentation is error-prone when intensity at phrasal edges is low, while manual segmentation and segmentation corrections can be quite costly. Another potential weakness is that the system is not geared towards the analysis of prosodic boundaries and prominence relations. It remains to be seen whether the system is truly language-independent. In principle, if all but only auditory perceptual information is represented, language-specific effects on the interpretation of acoustic cues to suprasegmental features should be taken into account as well as effects that are attributable to the human auditory system.
78 Elisabeth Delais-Roussarie and Brechtje Post
4.4.3 Target-Based Transcription Systems and Intonational Phenomena 4.4.3.1 ToBi The ToBI (Tones and Break Indices) transcription system was originally developed to transcribe intonation and prosodic phrasing in American English (Silverman et al. 1992; Beckman et al. 2005). The system is couched in the Autosegmental-Metrical framework (e.g. Bruce 1977; Pierrehumbert 1980) and was based on the analysis of American English proposed by Pierrehumbert (1980) and Pierrehumbert and Beckman (1988). It has since been adapted to many other languages and language varieties (Jun 2005a; 2014; Prieto and Roseano 2010), which has led to the introduction of some additional features to account for the diversity of the prosodic systems under study (e.g. separation of transcription tiers for underlying and surface phonology in Korean: Jun 2005b). In the original ToBI system, transcription takes place on four tiers (see Beckman et al. 2005; and also http://www.ling.ohio-state.edu/~tobi/ame_tobi/annotation_conventions.html; last accessed 16/09/2013): • a tonal tier, which gives the pitch accents, phrase accents, and boundary tones that are realized in the speech stream (their number and nature depends on the intonational system of the language); • a break index tier, which gives a rating of the degree of juncture between words and phrases (5 levels); • an orthographic tier, which gives a transcription of orthographic words and phenomena such as filled pauses (the system does not impose any particular conventions); • a miscellaneous tier for transcriber comments. As the example in Figure 4.7 shows, the transcription tiers are time-aligned with the sound wave and the fundamental frequency trace of the speech sample, where each element on each tier receives its own time stamp(s). Pitch accents consist of one- or two-term labels (H*, L*, L*+H, L+H* and H+!H* for American English) in which the tone that is associated with the accented syllable is marked by a star (but see Arvaniti et al. 2000 for a discussion). Other pitch accent types, including three-tonal labels, have been introduced to transcribe pitch accents in other languages (see Jun 2005a). Downstep, the lowering of a high tone relative to a preceding high tone, is marked by placing the diacritic ! immediately before the high tone that is lowered according to the analysis. Phrase accents mark intermediate phrase boundaries (labelled 3 or higher on the break index tier) and can be High (H-) or Low (L-), and Htones can also be downstepped. Boundary tones mark the edges of Intonation Phrases (Break Index level 4), and they can also be High (H%) or Low (L%). The Break Index values described in the original ToBI guidelines (http://www. ling.ohio-state.edu/~tobi/ame_tobi/annotation_conventions.html) are intended to
Corpus Annotation
79
FIGURE 4.7 Time-aligned ToBI transcription for the utterance The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
represent the degree of juncture realized, but they are partly determined with reference to the morphosyntactic structure. For instance, a value of 0 is used for cases of clear phonetic marks of clitic groups, e.g. the medial affricate in contractions of did you or a flap as in got it, while level 1 is defined with reference to the word alone (most phrase-medial word boundaries), and levels 3 and 4 are defined with reference to intonational realization alone (e.g. 3 is marked by a single phrase tone affecting the region from the last pitch accent to the boundary). Disfluencies are also marked at this level. Various diacritics can be used to indicate uncertainty about the strength of a juncture or a tone in the relevant tiers. Wherever these conventions are followed, there is considerable redundancy among tiers (e.g. between boundary tone labels and break index label 4, and phrase tone labels, and break index label 3, as well as break index locations and the orthographic tier). As the authors of the guidelines point out, routines can be used for the automatic generation of redundant labels, which will improve consistency and save transcriber time (http:// www.ling.ohio-state.edu/~tobi/ame_tobi/annotation_conventions.html). A more general issue with Break Indices is that they have proved quite elusive when determined on the basis of prosodic realization alone, resulting in low transcriber agreement. External criteria can be invoked to improve consistency, but in that case the question arises whether that makes transcribing Break Indices altogether superfluous. The example in Figure 4.7 also shows that transcription takes place at a phonemic level of representation, i.e. the symbols and segmentation units represent phrasal and intonational elements that function contrastively in the language that is being transcribed. This implies that the symbols and units used to transcribe a specific data set are
80 Elisabeth Delais-Roussarie and Brechtje Post language-specific and, conversely, that the transcriber will need to work with a known set of symbols which represent the primitives of the prosodic system of that language. The second implication of transcription at a phonemic level is that the transcription symbols and segmentation units that are used cannot be theory-neutral, since they are in effect used to provide a phonological analysis of the data. That is, in using the ToBI transcription system one not only considers that the underlying principles of the Autosegmental-Metrical framework hold for the language that is being transcribed (phrasing and intonation are considered to be closely intertwined, and accented syllables and phrasal edges function as the loci of contrast; turning points rather than holistic contours are the primitives of analysis, etc.), but also accepts the language-specific phonological generalizations made in the definition of the symbols (phonotactic constraints that may apply to tonal configurations, or constraints on the syntax–phonology mapping which determine the permissible prosodic phrasing structures in the language). The language-specific, phonological nature of the transcription system can be considered as a strength when a linguistic analysis of the prosodic phrasing and the intonational events in the speech sample are required by the objectives of the transcription, and if the language at issue has been successfully analysed in an Autosegmental-Metrical framework. Since transcribers are forced to choose between contrasting categories, not only the form but also the linguistic interpretation of the form are simultaneously encoded. For instance, in a language that has a contrast between L+H* and L*+H, a rising pitch movement that is associated with an accented syllable has to be either L+H* or L*+H. By chosing one over the other, the transcriber records not only a possible difference in form (e.g. in the timing of the targets relative to the accented syllable) but also the associated difference in meaning. ToBI is not suitable for the transcription of suprasegmental phenomena other than phrasing and intonation. For instance, metrical or rhythmic structure may be realized in the signal by means of durational cues instead of pitch accents, or suprasegmental cues may occur in the signal for units that are larger than the Intonation Phrase (e.g. conversational turns). These phenomena may be phonological in nature, and they may not be marked by a specific tonal configuration, but in any case ToBI was not designed to deal with such data. The phonological nature of the system can be considered a weakness in the context of data sets in which the phonological system of the speaker is in flux or (partly) unknown, such as first and second language learner data, pathological speech data, and data from languages or language varieties that have not been fully analysed. Such data sets can only be transcribed in systems that allow for transcriptions at a phonetic level (see Gut, this volume), although a ToBI-style system can be used insightfully to draw up hypotheses about a speaker’s suprasegmental phonology.
4.4.3.2 IViE / IV The IViE system was developed for the transcription of intonational variation in English spoken in the British Isles for which the phonological systems are unknown (Grabe
Corpus Annotation
81
FIGURE 4.8 Time-aligned IV transcription of The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
et al. 2001; see Nolan and Post, this volume). Originally modelled on ToBI, it offers three transcription tiers to record the prosodic information in the speech signal, separating auditory phonetic and abstract phonological levels of representation for the transcription of intonation, in addition to a rhythmic tier on which perceptual prominence and phrasing can be marked. The three tiers are motivated by research on crossvarietal differences in English intonation, which shows that they can differ in (i) the location of rhythmic prominences, (ii) the phonetic realization of particular pitch accents, and (iii) the inventory of contrasting intonation contours (Grabe and Post 2004; Grabe 2004). IV is an adaptation of IViE that can be used for other languages, and which includes a fourth prosodic tier, which allows for the transcription of global intonational events that operate across sequences of phrases or utterances (e.g. register shift at topic boundaries; see Post et al. 2006; Delais-Roussarie et al. 2006). The IV template consists of six tiers, as shown in Figure 4.8. On the highest tier, miscellaneous comments and alternative transcriptions can be recorded. On the lowest, the text of the utterance is transcribed, word by word, as in ToBI. Labellers begin a transcription by filling the orthographic tier. The middle tiers provide information about prosody. One (just above the orthographic tier) is dedicated to encode the location of rhythmically prominent syllables and boundaries. No distinction is made here between lexically stressed and intonationally accented syllables, and unlike in ToBI, no distinction in boundary weight is made. Hesitations and interruptions that affect the rhythmic properties of the utterance are also marked here. In the following tier (third tier from the lowest), the pitch movement surrounding rhythmically prominent syllables is determined, which can capture auditory phonetic differences in implementation between languages or
82 Elisabeth Delais-Roussarie and Brechtje Post F0
F0 Cambridge “Find
H*L
my
hat!”
yelled
H*L %
mH-l
mH-l
P
P
%
[…]
one
Phonological tier
Liverpool “Find my
H*L
hat!”
yelled
H*L %
[…]
[…]
Local phonetic tier mH-l
lHM
[…]
[…]
Prominence tier
P%
[…]
P
one
FIGURE 4.9 Different phonetic implementations of the same phonological category in British English transcribed on the local phonetic tier: a nuclear fall (H*L %) is compressed in Cambridge and truncated in Liverpool (read passage data from the IViE project).
dialects of languages. For instance, in Cambridge English, pitch movements are compressed when they are realized on IP-final words with little scope for voicing, but they are truncated in Liverpool English in the same context (Grabe et al 2000), as is illustrated in Figure 4.9. Figure 4.9 shows how a difference in phonetic realization is transcribed on the phonetic tier in IViE, while the symbols on the other prosodic tiers are the same for the two dialects in the figure. The realization of pitch accents is transcribed within Implementation Domains or IDs. In English, an ID contains (i) the preaccentual syllable, (ii) the accented syllable, and (iii) any following unaccented syllables (if any) up to the next accented syllable. Hence, IDs overlap by one syllable. Pitch is transcribed on three levels: h(igh), m(id), and l(ow). The levels are transcribed relative to each other, with capital letters indicating the pitch level of the prominent syllable relative to that of the surrounding syllables. Finally, on the fourth tier, and only after rhythmic structure and pitch movement have been transcribed, phonological generalizations are made and noted on the phonological tier. Labels are based on existing Autosegmental-Metrical models of the language in question, making the tier language-specific as well as theory-specific (e.g. Gussenhoven’s analysis of British English is adopted for the standard variety spoken in southern England: Gussenhoven 2004). Symbolically representing auditory phonetic suprasegmental information has the advantage that phenomena can be captured and quantified even when their linguistic relevance is not yet established. This can be advantageous in the analysis of crosslinguistic or crossdialectal phenomena like truncation and compression in intonational implementation, but also when the transcriber is confronted with learner data and pathological speech, for which surface realizations can be observed, but the underlying phonological system cannot. Since phonological interpretation can follow auditory interpretation, the intonational system does not need to be fully understood for the transcription to be possible. Figure 4.10 illustrates how the non-target-like production of a
Corpus Annotation
52.3079137 400
83
53.8067966
300 200
Pitch (Hz)
100 50
%
H*L
H*L
%
%
mHl-m
mHL
%
%
P
P
%
You
are
are
52.31
what
you
are 53.81
Time (s)
48.0625054 500
49.5161009
400 300
Pitch (Hz)
200 100 50 H* or Spanish prenuclear rise? %
L+H*
!H*L
%
%
lM-h
hML
%
%
P
P
%
You 48.06
are
what Time (s)
you
are 49.52
FIGURE 4.10 IViE transcriptions of the intonation contours produced in the utterance You are what you are by a native speaker of standard Southern British English (top panel) and a Spanish L2 learner of English (bottom panel).
84 Elisabeth Delais-Roussarie and Brechtje Post Spanish L2 learner of English can be transcribed in IViE. A comparison of the realization of the first pitch accent produced by the native speaker and the learner shows a clear difference in realization in what globally looks like an overall rising–falling contour, with a peak that is timed much earlier in the accented syllable are for the former than for the latter. Also, the contour continues to rise after the accented syllable in the L2 English example, while it falls in the native speaker’s speech. These differences are transcribed on the local phonetic tier, while the phonological tier shows possible phonological analyses for the contours that are observed. The comments tier is used here to highlight the fact that the label on the phonological tier is tentative at this stage. It indicates that the learner’s intonation contour could either be a straightforward case of transfer of a phonological category from Spanish to English—since L+H* rises are very common in this position in Spanish, but they do not occur in this native variety of English—or that the pitch accent has been phonetically implemented in an un-target-like way, with an unusually late peak. Another strength of the system is that the dissociation of intonational and prominence tiers allows for the transcription of prominent syllables (and prosodic boundaries) which are not marked by a pitch movement. In addition, the multilinear time-aligned transcription of prosodic events at different levels allows for the integration of phonetic, phonological, and rhythmic/prosodic information in the transcription. However, the IV(iE) system also has the limitation that it relies on segmentation into Intonation Phrases and other intonational units, which are language-dependent, theory-internally defined, and not always robust because they are not always easy to identify. Moreover, like ToBI and the IPA, it is a manual system, which makes transcription time-consuming and liable to inconsistencies and errors.
4.4.3.3 Momel-INTSINT Momel-INTSINT is a semi-automatic target-based transcription system. It generates a transcription of the intonation contour based on a series of turning points, which are automatically calculated from the fundamental frequency by the Momel algorithm (Hirst and Espesser 1993; Hirst et al. 2000). By discarding all microprosodic F0 variation in the signal (i.e. all the F0 variations that result from the phonetic characteristics of the various segments and apply under the syllable level), and linking the peaks and valleys that it detects in the derived curve (i.e. the turning points), the algorithm transforms the discontinuous raw F0 curve into a continuous curve, which is intended to function as its auditory equivalent. Subsequently, each turning point is automatically assigned one of the eight INTSINT symbols given in (16) (detection errors have to be corrected manually). (16) INTSINT symbols symbols representing absolute values defined relative to speaker register: Top (T) Mid (M) Bottom (B); symbols representing relative values defined locally with reference to preceding values: Higher (H), Same (S), or Lower (L);
Corpus Annotation
85
symbols representing small changes in relative values defined locally with reference to preceding values: Upstepped (U) or Downstepped (D). The INTSINT symbols are designed to represent intonational events at the surface phonological level of description, i.e. they are intended to transcribe all and only linguistically relevant pitch information. Figure 4.11 gives the original F0 trace, while Figure 4.12 gives the Momel-derived curve, the turning points, and the associated INTSINT symbols for the same utterance. The example shows that most of the turning points are located in syllables which function as the locus of an intonational contrast (usually accented syllables or syllables at phrase boundaries, e.g. a rising accent (H) as opposed to a falling one (L) on North in the example), but some turning points occur in positions which do not typically carry meaningful changes in pitch (e.g. L on the). Momel-INTSINT’s main strength is that it allows large quantities of speech data to be annotated semi-automatically, requiring only manually inserted Intonation Units as input. Since equating Intonation Units to interpausal units usually gives good results (where a pause is defined as a silent interval of more than 200 ms), and pauses can be detected automatically, very little manual labelling is required with this system in practice. This makes the system very robust, since automatically generated transcriptions leave no room for human error. The second advantage is that both the naturalness and appropriateness of the smoothed Momel curves and the INTSINT encoding can be empirically tested, because Momel turning points can be translated into INTSINT symbols and vice versa. This provides an easily accessible and powerful tool to test, for instance, the perceptual effects of
300 250
Pitch (Hz)
200 150 100 50
The North windandthe Sun
0
were disputing whichwasthe
stronger
Time (s)
whena traveler came along wrapped ina warm
cloak
5.37
FIGURE 4.11 F0 curve associated with the utterance The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.
86 Elisabeth Delais-Roussarie and Brechtje Post
Pitch (Hz)
350
L H
50
L
D
S
S
D U LS
L
M
D
L
B
U 5.37
The North windandthe Sun were disputing whichwasthe stronger
0
Time (s)
whena traveler came along wrappedina warm
cloak
5.37
FIGURE 4.12 F0 trace, Momel pitch curve, turning points, and INTSINT labels for the same utterance
the tonal sequences that are theoretically allowed in INTSINT—and hence their phonotactic validity. The third and probably most interesting advantage from the point of view of the analysis of phonological corpora is that the transcription of the suprasegmental information can take place without recourse to the phonological system of the language in question. This characteristic of Momel-INTSINT facilitates crosslinguistic typological comparisons (see Hirst and Di Cristo 1998). However, since accented syllables and phrasal boundaries other than those of the Intonation Unit are not always detected and identified, the system may not be suitable for research that involves prosodic phrasing and prominence. Perhaps more importantly, as with Mertens’ system, the question arises to what extent the system is a sophisticated F0 stylization tool rather than a properly discrete symbolic annotation system that reflects all and only the linguistically relevant information in the signal.17 The linguistic status of the INTSINT symbols as surface phonological tones raises a number of theoretical issues. If the symbols represent phonological tones, they should be discrete, and they should combine into configurations which carry contrastive linguistic meaning (e.g. Gussenhoven 2004). Although they clearly meet the former requirement, it is not the case that an automatically derived tonal sequence, which represents perceptually relevant changes in pitch rather than contrastively meaningful changes in pitch, necessarily meets the second requirement. For instance, in some cases in which an M is generated instead of a B it is not clear that substitution of the former
17 Unlike Mertens’ system, Momel-INTSINT only takes macroprosodic effects into account and not more general perceptual effects.
Corpus Annotation
87
with the latter would lead to a different interpretation of the signal.18 In fact, one could argue that the transcriber’s linguistic brain is required to determine whether a variant is phonologically contrastive or not, especially when the language has not been prosodically analysed before.
4.5 Conclusion We have presented various systems that are commonly used to provide a discrete symbolic representation of the phonetic and phonological content of a speech sample. Such a representation abstracts away from the signal in all its complexity for several reasons. First, it usually provides information on the linguistically relevant events that occur in the speech signal, leaving aside any other elements. Secondly, it relies on a segmentation of the speech continuum into intervals to which labels are assigned, segments and labels being generally defined on a language-specific basis, and theory-internally. The systems most commonly used to provide information about the segmental content of the signal presented here are the orthographic system and the IPA. Both are clearly language-dependent and phonological by nature, in the sense that any segmental transcription relies on an interpretation of the linguistic content of the signal. Any orthographic transcription has several advantages, among which we may mention its readability, which facilitates data exchange. It has also some limitations: it cannot account for how the words were exactly pronounced by the speakers. By contrast, the IPA (and SAMPA) provides representations that can be more detailed depending on the level at which data have been transcribed (phonemic vs narrow phonetic). This is clearly one of its main advantages. Note, however, that broad phonemic and narrow phonetic transcriptions are difficult to achieve in terms of accuracy, and are very time-consuming. To our knowledge, no systematic empirical studies have been carried out to evaluate intertranscriber agreement in the context of atypical data (regional varieties, L1 acquisition data, L2 learner data, pathological speech, etc.). Several systems have been proposed to encode suprasegmental features. In contradistinction to systems used to encode segmental information, prosodic transcription systems are not necessarily language-dependent, in particular when providing an abstract representation of the tonal features as in INTSINT and in the symbolic extension of Prosogram (Mertens 2011). This results from the fact that the representations provided by these systems consist of discrete and stylized versions of the melodic curve. As soon as prosodic transcriptions represent phrasing and prominence structures, in addition to tonal events, they are language-dependent and, to a certain extent, theoretically driven, as they rely on analyses of the relation between phrasing, accentuation, and tonal events. 18 The same objection can be levelled at ToBI and IViE/IV to the extent that any tonal configuration that is allowed by the transcription system will need to be shown to be legal and contrastive according to the intonational grammar of the language that is being transcribed.
88 Elisabeth Delais-Roussarie and Brechtje Post In any case, when one wants to annotate phonological or phonetic features in a corpus, one has to keep in mind that an ideal system does not exist either at the segmental or at the suprasegmental level. They all have their strengths and weaknesses, or advantages and disadvantages, depending on research contexts and objectives. To choose one system over the other, it is crucial to evaluate which level of analysis is required (phonetic vs phonological), which units are relevant for the research question at hand, and which types of label and representation are the most adequate in the context of the research.
Acknowledgements We are very grateful to Ulrike Gut and an anonymous reviewer for their helpful comments on an earlier version of this chapter. This work was supported by a joint project grant from the British Academy and the CNRS (‘A Transcription System for Prosody and Phonological Modelling’). The first author would also like to acknowledge support from the Labex EFL, “Empirical Foundations in Linguistics” (ANR/CGI), and the second author support from the ESRC (‘Categories and Gradience in Intonation’, RES-061-25-0347).
C HA P T E R 5
O N AU T O M AT I C PHONOLO GICAL TRANSCRIPTION OF SPEECH CORPORA H E L M E R ST R I K A N D C AT IA C U C C H IA R I N I
5.1 Introduction Within the framework of Corpus Phonology, spoken language corpora are used for conducting research on speakers’ and listeners’ knowledge of the sound system of their native languages, and on the laws underlying such sound systems as well as their role in first and second language acquisition. Many of these studies require a phonological annotation of the speech data contained in the corpus. The present chapter discusses how such phonological annotations can be obtained (semi-)automatically. An important distinction that should be drawn is whether or not the audio signals come with a verbatim transcription (orthographic transcription) of the spoken utterances. If an orthographic annotation is available, the (semi-)automatic phonological annotation could in theory be derived directly from the verbatim transcription through a simple conversion procedure without an automatic analysis of the original speech signals. In this case, the strings of symbols representing the graphemes in words are replaced by corresponding strings of symbols representing phonemes. This can be achieved by resorting to a grapheme–phoneme conversion algorithm, through a lexicon look-up procedure in which each individual word is replaced by its phonological representation as found in a pronunciation dictionary or by a combination of the two (Binnenpoorte 2006). The term ‘phonological representation’ is often used for this kind of annotation, as suggested by Oostdijk and Boves (2008: 650). It is important to realize that such a phonological representation does not provide information on the speech sounds that were actually realized, but relies on what is already known about the possible
90 Helmer Strik and Catia Cucchiarini ways in which words can be pronounced (pronunciation variants). Such pronunciation variants may also refer to sandhi phenomena and can be used to model processes such as cross-word assimilation or phoneme intrusions, but the choice of the variants will be based on phonological knowledge and not on an analysis of the speech signals. Alternatively, a phonological annotation of the words in a spoken language corpus can be obtained automatically through an analysis of the original speech signals by means of automatic speech recognition algorithms. In this case, previously trained acoustic models of the phonemes to be identified are used together with the speech signal and the corresponding orthographic transcription, if available, to provide the most likely string of phonemes that reflects the speech sounds that were actually realized. In this chapter we will reserve the term ‘automatic phonological transcription’ for this latter type of analysis, as suggested by Oostdijk and Boves (2008: 650). It is also this form of phonological annotation, and the various (semi-)automatic procedures to obtain, evaluate, and optimize them, that will be the focus of the present chapter. Phonological transcriptions have long been used in linguistic research, for both explorative and hypothesis testing purposes. More recently, phonological transcriptions have proven to be very useful for speech technology too, for example, for automatic speech recognition and for speech synthesis. In addition, the development of multi-purpose speech corpora that we have witnessed in the last decades—e.g. TIMIT (Zue et al. 1990), Switchboard (Godfrey et al. 1992), Verbmobil (Hess et al. 1995), the Spoken Dutch Corpus (Oostdijk 2002), the Corpus of Spontaneous Japanese (Maekawa 2003), Buckeye (Pitt et al. 2005), and ‘German Today’ (Brinckmann et al. 2008)—has underlined the importance of phonological transcriptions of speech data, because these considerably increase the value of such corpora for scientific research and application development. Both orthographic transcriptions and phonological transcriptions are known to be time-consuming and costly. In general, the more detailed the transcription, the higher the cost. Orthographic transcriptions appear to be produced at speeds varying from three to five times real-time, depending on the quality of the recording, the speech style, the quality of the transcription desired, and the skill of the transcriber (Hazen 2006). Highly accurate transcriptions that account for all speech events (filled pauses, partial words, etc.) as well as other meta-data (speaker identities and changes, non-speech artefacts and noises, etc.) can take up to 50 times real-time depending on the nature of the data and the level of detail of the meta-data (Barras et al. 2001; Strassel and Glennky 2004). Making phonological transcriptions from scratch can take up to 50–60 times real-time, because in this case transcribers have to compose the transcription and choose a symbol for every single speech sound. An alternative, less time-consuming procedure consists in having transcribers correct an example transcription, i.e. a transcription of the word in question taken from a lexicon or a dictionary, which transcribers can edit and improve after having listened to the corresponding utterance. Both for Switchboard and the Spoken Dutch Corpus, transcription costs were restricted by presenting trained students with an example transcription. The students were asked to
On Automatic Phonological Transcription of Speech Corpora
91
verify this transcription rather than transcribing from scratch (Greenberg et al. 1996; Goddijn and Binnenpoorte 2003). Although such a check-and-correct procedure is very attractive in terms of cost reduction, it has been suggested that it may bias the resulting transcriptions towards the example transcription (Binnenpoorte 2006). In addition, the costs involved in such a procedure are still quite substantial. Demuynck et al. (2002) reported that the manual verification process took 15 minutes for one minute of speech recorded in formal lectures and 40 minutes for one minute of spontaneous speech. Because of the problems involved in obtaining phonological transcriptions—the time required, the high costs incurred, the often limited accuracy obtained, and the need to transcribe large amounts of data—researchers have been looking for ways of automating this process, for example by employing speech recognition algorithms. The advantages of automatic phonological transcriptions become really evident when it comes to exploring large speech databases. First, because automatic phonological transcriptions make it possible to achieve uniformity in transcription. With a manual transcription this aim would be utopian: large amounts of speech data cannot possibly be transcribed by one person, and the more transcribers are involved, the less uniform the transcriptions are going to be. Eliminating part of this subjectivity in transcriptions can be very advantageous, especially when analysing large amounts of data. Second, because with automatic methods it is possible to generate phonological transcriptions of huge amounts of data that would otherwise remain unexplored. The fact that large amounts of material can be analysed in a relatively short time, and with relatively low costs, makes automatic phonological transcription even more interesting. The importance of this aspect for the generalizability of the results cannot be overestimated. And although the automatic procedures used to generate automatic phonological transcriptions are not infallible, the advantages of a very large dataset might very well outweigh the errors introduced by the mistakes the automatic procedures make. In this chapter we provide an overview of the state of the art in automatic phonological transcription, paying special attention to the most relevant methodological issues and the ways they have been approached.
5.1.1 Types of Phonological Transcription Before we discuss the various ways in which phonological transcriptions can be obtained (semi-)automatically, it is worth clarifying a number of terms that will be used in the remainder of this chapter. First of all, a distinction should be drawn between segmental and suprasegmental phonological transcriptions. In this chapter, focus will be on the former, and in particular on (semi-)automatic ways of obtaining segmental annotations of speech; but this is not to say that there has been no research on (semi-)automatic transcription of suprasegmental processes. For instance, important work in this direction was carried out in the 1990’s within the framework of the Multext project (Gibbon and Llisterri 1994; Llisterri 1996), and more recently by Mertens (2004b), Tamburini and
92 Helmer Strik and Catia Cucchiarini Caini (2005), Obin et al. (2009), Avanzi, Lacheret et al. (2010), and Lacheret et al. (2010). A discussion of these approaches is, however, beyond the scope of this chapter. Even within the category of segmental phonological transcriptions, different types can be distinguished depending on the symbols used and the degree of detail recorded in the transcriptions. In the literature, various classifications have been provided on the basis of these parameters (for a brief overview, see Cucchiarini 1993). With respect to the notation symbols, in this chapter we will be concerned only with alphabetic notations, in particular with computer-readable notations, as this is a prerequisite for automatic transcription. Over the years different computer-readable notation systems have been developed. A widely used one in Europe is *SAMPA[http://www.phon.ucl.ac.uk/home/sampa/]* (Wells 1997; http://www.phon.ucl.ac.uk/home/sampa/), a mapping between symbols of the International Phonetic Alphabet and ASCII codes, which was established through a consultation process among international speech researchers. X-SAMPA is an extended version of SAMPA intended to cover every symbol of the IPA Chart so as to make it possible to provide a machine-readable phonetic transcription for every known human language. However, many other systems have been introduced. Arpabet was developed by the Advanced Research Projects Agency (ARPA) and consists of a mapping between the phonemes of General American English and ASCII characters. Worldbet (Hieronymus 1994) is a more extended mapping between ASCII codes and IPA symbols intended to cover a wider set of the world languages. Besides the computer phonetic alphabets mentioned here, many others do exist (see e.g. Hieronymus 1994; Wells 1997; EAGLES 1996; Draxler 2005). With regard to the degree of detail to be recorded in transcription, it appears that in general two types of transcription are distinguished: broad phonetic, or phonemic, and narrow phonetic, or allophonic. A broad phonetic transcription indicates only the distinctive units of an utterance, thus presupposing knowledge of the precise realization of the sounds transcribed. A narrow phonetic transcription attempts to provide such details. For transcriptions made by human transcribers, it holds that the more detailed the transcription, the more time-consuming and costly it will be. In addition, more detailed transcriptions are likely to be less consistent. Also, for automatically generated transcriptions it holds that recording more details requires more effort, albeit not in the same way and to the same extent as for manually generated transcriptions. The type of phonological transcription that is generally contained in spoken language corpora is broad phonetic, although narrow transcriptions are sometimes also provided, for example, in the Switchboard corpus (Greenberg et al. 1996). In general, the degree of detail required of a transcription will essentially depend on the aim for which the transcription is made. In the case of multi-purpose corpora of spoken language, it is therefore often decided to adopt a broad phonetic transcription, and to make it possible for individual users to add the details that are relevant for their own research at a later stage. An important question that is partly related to the degree of detail recorded in phonological transcription concerns the accuracy of transcriptions. As a matter of fact, there
On Automatic Phonological Transcription of Speech Corpora
93
is a trade-off relation between degree of detail and accuracy (Gut and Bayerl 2004) defined in terms of reliability and validity (see section 1.3: the more details the tran scription contains the less likely it is to be accurate. This may imply that, although a high level of detail may be considered essential for a specific linguistic study, for instance on degree of devoicing or aspiration, it may nevertheless turn out to be extremely difficult or impossible to obtain transcriptions of that specific phenomenon that are sufficiently accurate. This brings us to another important aspect of phonological transcription in general and of automatic phonological transcription in particular: that of its accuracy. This will be discussed below.
5.1.2 Evaluation of Phonological Transcriptions Before phonological transcriptions can be used for research or applications, it is important to know how accurate they are. The problem of transcription quality assessment is not new, since for manual phonological transcriptions it is also important to know how accurate they are before using them for research or applications (Shriberg and Lof 1991; Cucchiarini 1993; Cucchiarini 1996; Wesenick and Kipp 1996). Phonological transcriptions, whether they are obtained automatically or produced by human transcribers, are generally used as a basis for further processing (research, ASR (automatic speech recognition) training, etc.). They can be viewed as representations or measurements of the speech signal, and it is therefore legitimate to ask to what extent they achieve the standards of reliability and validity that are required of any form of measurement. With respect to automatic transcriptions, the problem of quality assessment is complex because comparison with human performance, which is customary in many fields, is not straightforward, owing to the subjectivity of human transcriptions and to a series of methodologically complex issues that will be explained below.
5.1.3 Reliability and Validity of Phonological Transcriptions In general terms, the reliability of a measuring instrument represents the degree of consistency observed between repeated measurements of the same object made with that instrument. It is an indication of the degree of accuracy of a measuring device. Validity, on the other hand, is concerned with whether the instrument measures what it purports to measure. In fact, the definitions of reliability and validity used in test theory are much more complex and will not be treated in this chapter. The description provided above indicates an important difference between the reliability of human-made as opposed to automatic transcriptions, and is related to the fact that human transcriptions suffer from intra-subject and inter-subject variation, and repeated measurements of the same object will differ from each other. With automatic transcriptions this can be prevented, because a machine can be programmed in such a way that repeated measurements of the same object always give the same
94 Helmer Strik and Catia Cucchiarini result, thus yielding a reliability coefficient of 1, the highest possible. It follows that with respect to the quality of automatic transcription, only one (albeit not trivial) question needs to be answered, viz. that concerning validity.
5.1.4 Defining a Reference Phonological Transcription The description of validity given above suggests that any validation activity implies the existence of a correct representation of what is to be measured, a so-called benchmark or ‘true’ criterion score (as in test theory), a gold standard. The difficulties in obtaining such a benchmark transcription are well known, and it is generally acknowledged that there is no absolute truth of the matter as to what phones a speaker produced in an utterance (Cucchiarini 1993; 1996; Wester et al. 2001). For instance, in an experiment we asked nine experienced listeners to judge whether a phone was present or not for 467 cases (Wester et al. 2001). The results showed that all nine listeners agreed in only 246 of the 467 cases, which is less than 53 per cent (see section 2.2.2.2). Furthermore, a substantial amount of variation was observed between the nine listeners. The values of Cohen’s kappa varied from 0.49 to 0.73 for the various listener pairs. It follows that one cannot establish the validity of an automatic transcription simply by comparing it with an arbitrarily chosen human transcription, because the latter would inevitably contain errors. Unfortunately, this seems to be the practice in many studies on automatic transcription. To try as much as possible to circumvent the problems due to the lack of a reference point, different procedures have been devised to obtain reference transcriptions. One possibility consists of using a consensus transcription, which is a transcription made by at least two experienced phoneticians after having reached a consensus on each symbol contained in the transcript (Shriberg et al. 1984). The fact that different transcribers are involved and that they have to reach a consensus before writing down the symbols can be seen as an attempt to minimize errors of measurement, thus approaching ‘true’ criterion scores. Another option is to have more than one transcriber transcribe the material and to use only that part of the material for which all transcribers agree, or at least the majority of them (Kuipers and van Donselaar 1997; Kessens et al. 1998).
5.1.5 Comparing Phonological Transcriptions Another issue that has to be defined in automatic phonological transcription is how to determine whether the quality of a given transcription is satisfactory. Once a reference phonological transcription has been defined, the obvious choice would be to carry out some sort of alignment between the reference phonological transcription and the automatic phonological transcription, with a view to determining a distance measure which will also provide a measure of transcription quality.
On Automatic Phonological Transcription of Speech Corpora
95
For this purpose, dynamic programming (DP) algorithms with different weightings have been used by various authors (Wagner and Fischer 1974; Hanna et al. 1999; Picone et al. 1986). Several of these DP algorithms are compared in Kondrak and Sherif (2006). In the current chapter we will refer to dynamic programming, agreement scores, error rates, and related issues. Some explanation of these issues is provided here. A standard (simple) DP algorithm is one in which the penalty for an insertion, deletion, or substitution is 1. However, when using such DP algorithms we often obtained suboptimal alignments, like the following example:
Tref = /A m st @ R d A m / Tbu = /A m s#@t a: n#/ (# = insertion)
For this reason, we decided to make use of a more sophisticated Dynamic Programming DP alignment procedure. In this second DP algorithm, the distance between two phones is not just 0 (when they are identical) or 1 (when they are not identical), but more gradual. The distance between two phones is calculated on the basis of the articulatory features defining the speech sounds the symbols stand for (Cucchiarini 1993: 96; Elffers et al. 2005). More details about this DP algorithm can be found in (Cucchiarini 1993: 96; Elffers et al. 2005). Using this second DP algorithm the following alignment was found for the example mentioned above:
Tref = /A m s t @ R d A m / Tbu = /A m s # @ # t a:n /
It is obvious that the second alignment is better than the first. Since, in general, the alignments obtained with DP algorithm 2 were more plausible than those obtained with DP algorithm 1, DP algorithm 2 was used to determine the alignments. Similar algorithms have been proposed, and are used by others. These dynamic programming algorithms can be used to align not only automatic phonological transcriptions and reference phonological transcriptions, but also other phonological transcriptions. Besides being used to assess the quality of phonological transcriptions, they can be also used to study in what respects the compared phonological transcriptions deviate from each other, and to obtain information on pronunciation variation. For instance, our DP algorithms compare the two transcriptions and return various data such as an overall distance measure, the number of insertions, deletions, and substitutions of phonemes, and more detailed data indicating to which features substitutions are related. This kind of information can be extremely valuable if one is interested to know how the automatic phonological transcription differs from the reference phonological transcription, and how the
96 Helmer Strik and Catia Cucchiarini automatic phonological transcription could be improved (see e.g. Binnenpoorte and Cucchiarini 2003; Cucchiarini and Binnenpoorte 2002; Cucchiarini et al. 2001).
5.1.6 Determining when an Automatic Phonological Transcription is of Satisfactory Quality After having established how much an automatic phonological transcription differs from a reference phonological transcription, one would probably need some reference data to determine whether the degree of distance observed is acceptable or not. In other words, how can we determine whether the quality of a given automatic phonological transcription is satisfactory? Again, human transcriptions could be used as a point of reference. For instance, one could compare the degree of agreement observed between the automatic phonological transcription and the reference phonological transcription with the degree of agreement observed between human transcriptions of the same utterances that are of the same level of detail and that are made under similar conditions, because this agreement level constitutes the upper bound, as in the study reported in Wesenick and Kipp (1996). If the degree of agreement between the automatic phonological transcription and the reference phonological transcription is comparable to what is usually observed between human transcriptions, one could accept the automatic phonological transcription as is; alternatively, if the degree of agreement between the automatic phonological transcription and the reference phonological transcription is lower than what is usually observed between human transcriptions, the automatic phonological transcription should first be improved. However, the problem with this approach is that it is difficult to find data on human transcriptions to be used as reference (for more information on this point, see Cucchiarini and Binnenpoorte 2002). Whether a transcription is of satisfactory quality will also depend on the purpose of the transcription. Some differences in transcriptions can be important for one application, but less important for another. Therefore, application, goal, and context should be taken into account for meaningful transcription evaluation (van Bael et al. 2003, Van Bael et al. 2007).
5.2 Obtaining Phonological Transcriptions In the current section we look at (semi-)automatic methods for obtaining phonological transcriptions. We start with completely automatic methods, distinguishing between cases in which orthographic transcriptions are not available and cases in which they are. We then discuss comparing (combinations of) methods, including methods in which a (small) part of the material is manually transcribed and subsequently used to improve automatic phonological transcriptions for a larger amount of speech. Finally, we look
On Automatic Phonological Transcription of Speech Corpora
97
at automatic phonological transcription optimization. However, before discussing how automatic phonological transcriptions can be obtained, we provide a brief explanation of how automatic speech recognition works.
5.2.1 Automatic Speech Recognition (ASR) Standard ASR systems are generally employed to recognize words. The ASR system consists of a decoder (the search algorithm) and three ‘knowledge sources’: the language model (LM), the lexicon, and the acoustic models. The LM contains probabilities of words and sequences of words. Acoustic models are models of how the sounds of a language are pronounced; in most cases so-called hidden Markov models (HMMs) are used, but it is also possible to use artificial neural networks (ANNs). The lexicon is the connection between the language model and the acoustic models. It contains information on how the words are pronounced, in terms of sequences of sounds. The lexicon therefore contains two representations for every entry: an orthographic and a phonological transcription. Most lexicons contain words for which more than one entry is present in the lexicon, i.e. the pronunciation variants. ASR is a probabilistic procedure. In a nutshell, ASR (with HMMs) works as follows. The LM defines which sequences of words are possible, for each word the possible variants and their transcriptions are retrieved from the lexicon, and for each sound in these transcriptions the appropriate HMM is retrieved. Everything is represented by means of a huge probabilistic network: an LM is a network of words, each word is a network of pronunciation variants and their transcriptions, and for each of the sounds in these transcriptions the corresponding HMM is a network of its own. In this huge and complex network, paths have probabilities attached to them. For a given (incoming, unknown) speech signal, the task of the decoder is to find the optimal global path in this network, using all the probabilistic information. In standard word recognition the output then consists of the labels of the words on this optimal path: the recognized words. However, the optimal path can contain more information than just that concerning the word labels, e.g. information on pronunciation variants, the phone symbols in these pronunciation variants, and even the segmentation at phone level. The description provided above is a short description of a standard ASR, i.e. one that is used for word recognition. However, it is also possible to use ASR systems in other ways. For instance, it is possible to perform phone recognition by only using the acoustic models, i.e. without the top-down constraints of language model and lexicon. Alternatively, the lexicon may contain phones, instead of words. If there are no further restrictions (in the LM). we are dealing with so-called free or unrestricted phone recognition, whereas if the LM contains a model with probabilities of phone sequence (a kind of phonotactic constraints), then we have restricted phone recognition. These phone LM models are generally trained using canonical phonological transcriptions. For instance, in some experiments on automatic phonological transcription that we carried out (see section 2.3) it turned out that 4-gram models outperformed 2-gram, 3-gram,
98 Helmer Strik and Catia Cucchiarini 5-gram, and 6-gram models (van Bael et al. 2006, 2007). Such a 4-gram model contains the probabilities of sequences of 4 phones.
5.2.2 Automatic Methods 5.2.2.1 Automatic Methods: No Orthography In general, if orthographic transcriptions are present, it is better to derive the phonological transcriptions not only from the audio signals, but from a combination of audio signals and orthographic transcriptions. But what to do if no orthographic transcriptions are present? 5.2.2.1.1 No Orthography, ASR If no orthographic annotation is present, an obvious solution would be to use ASR to obtain it. However, since ASR is not flawless, this orthographic transcription is likely to contain ASR errors. These errors in the orthographic transcription would counterbalance the positive effects of using the orthographic representation for obtaining automatic phonological transcriptions. Whether the net effect is positive depends on the task. For some tasks that are relatively easy for ASR, such as isolated digit recognition, the net effect may even be positive, but for most tasks this will probably not be the case. 5.2.2.1.2 No Orthography, Phone Recognition Another option, if no orthographic representation is available, is to use phone recognition (see section 2.1 on ASR). For this purpose, completely unrestricted phone recognition can be used, but usually some (phonotactic) constraints are employed in the form of a phone language model. Phone accuracy can be highly variable, roughly between 30 and 70 per cent, depending on factors such as speech style and quality of the speech (amount of background noise) (see e.g. Chang 2002). For instance, for one of our ASR systems we measured a phone accuracy level of 63 per cent for extemporaneous speech (Wester et al. 1998). In general, high accuracy values can be obtained for relatively easy tasks (e.g. carefully read speech), and by carefully tuning the ASR system for specific tasks (i.e. speech style, dialect or accent, gender, or speaker). In general, such levels of phone accuracy are too low, and thus the resulting automatic phonological transcriptions cannot be used directly for most applications. Still, phone recognition can be useful. For our ASR system with a phone accuracy of 63 per cent we examined the resulting phone strings by comparing them to canonical transcriptions (Wester et al. 1998). Canonical transcriptions can be obtained by means of lexicons, grapheme-to-phoneme conversion tools (for an overview, see Bisani and Ney 2008), or a combination of the two. Since the quality of the phonological transcriptions in lexicons is usually better than that of grapheme-to-phoneme conversion tools, in many applications one first looks up the phonological transcriptions of words in lexicons, and if these are
On Automatic Phonological Transcription of Speech Corpora
99
not found in existing lexicons the grapheme-to-phoneme conversion is applied. In Wester et al. (1998) it was found that the number of insertions (4 per cent) was much smaller than the number of deletions (17 per cent) and substitutions (15 per cent). Furthermore, the vowels remain identical more often than the consonants, mainly because in comparison to the consonants they are deleted less often. Finally, we studied the most frequently observed processes, which were all deletions. It turned out that these frequent processes are plausible connected speech processes (see Wester et al. 1998), some of which are related to Dutch phonological processes that have been described in the literature (e.g. /n/-deletion, /t/-deletion, and /@/-deletion are described in Booij 1995), but also some others that could not be found in the literature. Phone recognition can thus be used for hypothesis generation (Wester et al. 1998). However, owing to the considerable number of inaccuracies in unsupervised phone recognition, it is often necessary to check or filter the output of phone recognition. The latter can be done by applying decision trees (Fosler-Lussier 1999; van Bael et al. 2006; 2007) or forced recognition (Kessens and Strik 2001, 2004). The results of phone recognition can be described in terms of context-dependent rewrite rules. Various criteria can be employed for filtering these rewrite rules—for example, straightforward criteria are the absolute frequency with which changes (insertions, deletions, or substitutions) occur, or the relative frequency, i.e. the absolute frequency divided by the number of times the conditions of the rule are met (see e.g. Kessens et al. 2002). Of course, combinations of criteria are also possible.
5.2.2.2 Automatic Methods: With Orthography Above different methods were described to derive phonological transcriptions when no orthographic transcriptions are available. Such methods are often called bottom-up or data-driven methods. In the current subsection we will describe methods to obtain phonological transcriptions when orthographic transcriptions are present. In the latter case, top-down information can also be applied. 5.2.2.2.1 With Orthography, Canonical Transcriptions Probably the simplest way to derive phonological transcriptions in this case is by using canonical transcriptions (see above). Once a phonological (canonical) transcription is obtained for every word, the orthographic representations are simply replaced by the corresponding phonological representations (Binnenpoorte and Cucchiarini 2003; van Bael et al. 2006, 2007). 5.2.2.2.2 With Orthography, Forced Recognition Words are not always pronounced in the same way, and thus representing all occurrences of a word with the same phonological transcription will result in phonological transcriptions containing errors. The quality of the phonological transcriptions could be improved by modelling pronunciation variation. One way to do this is to use so-called forced recognition. In forced recognition the goal is not to recognize the string
100 Helmer Strik and Catia Cucchiarini of words that was spoken, as in standard ASR. On the contrary, in forced recognition this string of words (the orthographic transcription) has to be known. Given the orthographic transcription, and multiple pronunciations of some words, forced recognition will determine automatically which of the pronunciation variants of a word best matches the audio signal corresponding to that word. The words are thus fixed, and for every word the recognizer is forced to choose one of the pronunciation variants of that word—hence the term ‘forced recognition’. The search space can also be represented as a network or lattice with pronunciation variants. The goal then is to find the optimal path in that network, the optimal alignment, by means of the Viterbi algorithm; this is why this procedure is also referred to as ‘Viterbi’ or ‘forced’ alignment. In any case, if hypotheses (pronunciation variants) are present, forced recognition can be used for hypothesis verification (i.e. to find the hypothesis, the variant that most closely matches the audio signal). It is important to note that through the use of pronunciation variants it is also possible to model sandhi phenomena and cross-word phenomena such as assimilation or intrusion. We evaluated how well forced recognition performs by comparing its performance to that of human annotators (Wester et al. 2001). Nine experts, who often carried out phonetic transcriptions for their own research, carried out exactly the same task as the computer program: they had to indicate which pronunciation variants of a word best matched the audio signal. The variants of 467 cases were generated by means of 5 frequent phonological rules: /n/-, /r/-, /t/-, /@/-deletion, and /@/-insertion (Booij 1995). In all 467 cases the machine and the human transcribers thus had to determine whether a phone was present or not. The results of these experiments were evaluated in different ways; some of these results are presented here. Table 5.1 shows how often N of the 9 transcribers agree. For 5 out of 9 this is 100 per cent, obviously, but for larger N values this percentage drops rapidly, and all 9 experts agree only in 53 per cent of the cases. Note that these results concern decisions on whether a phone was present or not, i.e. insertions and deletions, and not substitutions, where a phone could be substituted by many other phones and thus the number of possibilities is much larger. Determining whether a phone is present or not can also be very difficult both for humans and for machines, because very often we are dealing with gradual processes in which phones are neither completely present or absent, and even if a phone is (almost) not present some traces can remain in the context. Furthermore, human listeners could be biased by their knowledge of the language. We then compared the transcriptions made by humans and machines to the same reference transcription, the majority vote with different degrees of strictness (N out of 9, N = 5–9) mentioned above. The results are shown in Figure 5.1. For the 246 cases in which all transcribers agree, the percentage agreement between listeners and reference transcription obviously is 100 per cent. If N decreases, the percentage agreement with the reference transcription decreases, both for the judgements of the listeners and for the forced recognition program. Note that the behaviour is similar: the average percentage agreement of the listeners almost runs parallel to the agreement of the ASR system.
On Automatic Phonological Transcription of Speech Corpora
101
Table 5.1 Forced Recognition: majority vote results, i.e. the number of times that N of the 9 transcribers agree 5 out of 9
467 (100%)
6 out of 9
435 (93%)
7 out of 9
385 (82%)
8 out of 9
335 (72%)
9 out of 9
246 (53%)
100
Agreement (%)
95
90
85
80 5 of 9
6 of 9
7 of 9 Reference transcriptions CSR
Listener
8 of 9
9 of 9
Average
FIGURE 5.1 Percentage agreement of the reference transcription compared to the transcriptions made by humans and machine.
We also carried out pairwise comparisons between transcriptions. We obtained inter-listener percentage agreement scores (for Dutch spontaneous speech) in the range of 75–87 per cent (with an average of 82 per cent) (Wester et al. 2001). Similar results were obtained for German spontaneous speech: 79–83 per cent (Kipp et al. 1997; Schiel et al. 1998), and for American English (Switchboard): 72–80 per cent (Greenberg 1999). The ASR-listener pairwise comparisons yielded slightly lower percentage agreement scores: 76–80 per cent (with an average of 78 per cent) for Dutch (Wester et al. 2001), and 72–80 per cent for German (Kipp et al 1997; Schiel et al. 1998).
102 Helmer Strik and Catia Cucchiarini Forced recognition appears to perform well: the results are comparable to those of human transcribers, and the percentage agreement scores for the ASR are only slightly lower than those between human annotators. However, note that these results were obtained by comparing the judgements by humans and machine to the judgements by human annotators. If we had based the reference transcription(s) on a combination of the judgements by listeners and by ASR systems, the differences would have been (much) smaller. In any case, forced recognition seems to be a useful technique for hypothesis verification, i.e. for obtaining information regarding the phonological transcription. Forced recognition, for hypothesis verification, can thus be used in combination with other methods that generate hypotheses. Examples of the latter are phone recognition (see section 2.2.1.2). and rule-based methods (e.g. by using phonological rules, such as the five rules mentioned above). A method to obtain information regarding reduction processes is to simply generate (many) variants by making (many) phones optional, and to use forced recognition to select variants. In Kessens et al. (2000) we showed that there is a large overlap between the results of the latter method and those obtained with phone recognition in combination with forced recognition. Both methods are useful for automatically detecting connected speech processes, and it turned out that only about half of these connected speech processes had already been described in the literature at that moment (Kessens et al. 2000).
5.2.3 Comparing (Combinations of) Methods In the previous sections we have presented several methods for obtaining phonological transcriptions. The question that arises is which of these methods performs best. In addition, many of these methods can be combined. In the research by van Bael et al. (2006; 2007) several (combinations of) methods were implemented, tested, and compared. Ten automatic procedures were used to generate broad phonetic transcriptions of well-prepared speech (read-aloud texts) and spontaneous speech (telephone dialogues) from the Spoken Dutch Corpus (see Table 5.2). The resulting transcriptions were compared to manually verified phonetic transcriptions from the same corpus. These ten methods are briefly described here (for more information, see van Bael et al. 2006; 2007). In methods 3–10, multiple pronunciation lexicons were used, and the best variant was chosen by means of forced recognition (in methods 1 and 2 this was not the case).
5.2.3.1 Canonical Transcription: CAN-PT The canonical transcriptions (CAN-PTs) were generated through a lexicon look-up procedure. Crossword assimilation and degemination were not modelled. Canonical transcriptions are easy to obtain, since many corpora feature an orthographic transcription and a canonical lexicon of the words in the corpus.
On Automatic Phonological Transcription of Speech Corpora
103
Table 5.2 Accuracy of the ten transcription methods for read speech and telephone dialogues: percentage of Substitutions (Subs), Deletions (Dels), Insertions (Ins), and percentage disagreement (%dis, the summation of Subs, Dels, and Ins) Comparison with RT
Read speech
Telephone dialogues
Subs
Dels
Ins
%dis
Subs
Dels
Ins
%dis
CAN-PT
6.3
1.2
2.6
10.1
9.1
1.1
8.1
18.3
DD-PT
16.1
7.4
3.6
27.0
26.0
18.0
3.8
47.8
KB-PT
6.3
3.1
1.5
10.9
9.0
2.5
5.8
17.3
CAN/DD-PT
13.1
2.0
4.8
19.9
21.5
6.2
7.1
34.7
KB/ DD-PT
12.8
3.1
3.6
19.5
20.5
7.8
5.4
33.7
[CAN-PT]d
4.8
1.6
1.7
8.1
7.1
3.3
4.2
14.6
[DD-PT]d
15.7
7.4
3.5
26.7
26.0
18.6
3.8
48.3
[KB-PT]d
5.0
3.2
1.2
9.4
7.1
3.5
4.2
14.8
[CAN/DD-PT]d
12.0
2.3
4.3
18.5
20.1
7.2
5.5
32.8
[KB/ DD-PT]d
11.6
3.1
3.1
17.8
19.3
9.4
4.5
33.1
5.2.3.2 Data-Driven Transcription: DD-PT The data-driven transcriptions (DD-PTs) were derived from the audio signals through constrained phone recognition: an ASR system segmented and labelled the speech signal using as a language model a 4-gram phonotactic model trained with the reference transcriptions of the development data in order to approximate human transcription behaviour. Transcription experiments with the data in the development set indicated that for both speech styles 4-gram models outperformed 2-gram, 3-gram, 5-gram, and 6-gram models.
5.2.3.3 Knowledge-Based Transcription: KB-PT We generated so-called knowledge-based transcriptions (KB-PTs) in three steps.
a. First, a list of 20 prominent phonological processes was compiled from the linguistic literature on the phonology of Dutch (Booij 1995). These processes were implemented as context-dependent rewrite rules modelling both within-word and cross-word contexts in which phones from a CAN-PT can be deleted, inserted or substituted with another phone. b. In the second step, the phonological rewrite rules were ordered and used to generate optional pronunciation variants from the CAN-PTs of the speech chunks. The rules applied to the chunks rather than to the words in isolation to account for cross-word phenomena.
104 Helmer Strik and Catia Cucchiarini
c. In the third step of the procedure, chunk-level pronunciation variants were listed. The optimal knowledge-based transcription (KB-PT) was identified through forced recognition.
Methods 4 and 5 are combinations of data-driven transcription (DD-PT) with canonical transcription (CAN-PT) and knowledge-based transcription (KB-PT):
5.2.3.4 Combined CAN-DD Transcription: CAN/DD-PT 5.2.3.5 Combined KB-DD Transcription: KB/DD-PT For each of these two methods, the variants generated by the two procedures were combined, and the optimal variant was chosen by means of forced recognition. Methods 1–5 are completely automatic methods—no manual phonological transcriptions are used. However, manual phonological transcriptions may be already available, at least for a (small) subset of the corpus. The question then is whether these manual phonological transcriptions could be used to improve the quality of the automatic phonological transcriptions obtained for the rest of the corpus. A possible way to do this is to align automatic and manual phonological transcriptions for the subset of the corpus, and use these alignments to train decision trees. Roughly speaking, these decision trees learn the (systematic) differences between manual phonological transcriptions and automatic phonological transcriptions. If the same decision trees are then used to transform the automatic phonological transcriptions of the rest of the corpus, these transformed automatic phonological transcriptions might be closer to the reference transcriptions. We applied these decision trees in each of the five methods described above, thus obtaining five new transcriptions, i.e. methods 6–10. For each of these methods, these decision trees and the automatic phonological transcriptions were used to generate new variants. The optimal variants were selected by means of forced recognition. To summarize, the ten methods are:
1. Canonical transcription: CAN-PT 2. Data-driven transcription: DD-PT 3. Knowledge-based transcription: KB-PT 4. Combined CAN-DD transcription: CAN/DD-PT 5. Combined KB-DD transcription: KB/DD-PT
6–10 = 1–5 with decision trees The results are presented in Table 5.2. It can be observed that applying the decision trees improves the results. Therefore, if manual phonological transcriptions are available for part of the corpus, they can be used to improve the automatic phonological transcriptions for the rest of the corpus. And if no manual phonological transcriptions are available, one could consider obtaining such transcriptions for (only a small) part of the corpus. The best results are obtained for method 6: [CAN-PT]d, a canonical transcription that, through the use of a small sample of manual transcriptions and decision trees,
On Automatic Phonological Transcription of Speech Corpora
105
Table 5.3 Examples of utterances with different phonological transcriptions (in SAMPA). From top to bottom: orthography, CAN-PT (method 1), [CAN-PT] d (method 6), and the manually verified phonetic transcription from the Spoken Dutch Corpus that is used as reference Orthog. maar het is niet handig als je nou . . . verbinding Reference mar t Is nid hAnd@x A S@ nA+ . . . f@-bIndIN CAN-PT mar @t Is nit hAnd@x Als j@ nA+ . . . v@rbIndIN D S DD S S (3 Dels and 3 Subs) [CAN-PT]d mar t Is nit hAnd@x As j@ nA+ . . . f@-bIndIN S D S - (1 Del and 2 Subs)
was modelled towards the target transcription. This method does not require the use of an ASR system, only canonical transcriptions obtained by means of a lexicon look-up, some manual phonological transcriptions, and decision trees trained on these manual transcriptions. For these (best) transcriptions, the number and the nature of the remaining disagreements with the reference transcriptions are similar to inter-labeller disagreement values reported in the literature. Some examples, including those for the best method (i.e. 6: [CAN-PT]d). are provided in Table 5.3. It can be observed that the decision trees have ‘learned’ some patterns that lead to improvements: the deletion of the schwa (@) of /@t/, the deletion of the ‘l’ in /Als/, and the devoicing of ‘v’ (see the last word). In order to determine what the errors in the transcriptions are, they are aligned with the reference. The number of errors for [CAN-PT]d (i.e. 1 Del and 2 Subs) is much smaller than for CAN-PT (3 Dels and 3 Subs).
5.2.4 Optimizing Automatic Phonological Transcriptions Deriving automatic phonological transcriptions, according to the methods described above, is usually done by using ASR systems. Since standard ASR systems are primarily intended for recognizing words, for automatic phonological transcription it is necessary to apply the ASR systems in nonstandard, modified ways (as was described above, for various methods). For many decades efforts in ASR research were directed to reducing the word error rate (WER). a measure of the accuracy of ASR systems in recognizing words. If an ASR system is used for deriving automatic phonological transcriptions, one generally takes an ASR system for which the WER is low. However, it is questionable whether the ASR system with the lowest WER is also the best choice for obtaining automatic phonological transcriptions. Given that automatic phonological transcriptions are increasingly used, it is remarkable that relatively little research has been conducted on optimizing automatic phonological transcriptions and on optimizing ASR systems for this purpose. In one of our studies we investigated the effect of changing the properties of the ASR system on the quality of the resulting transcriptions and the WER (Kessens and Strik 2001).
106 Helmer Strik and Catia Cucchiarini As a criterion we used the percentage agreement between the automatic phonological transcriptions and reference phonological transcriptions. The study concerned 1,237 instances of the five Dutch phonological rules mentioned above (see section 2.2.2.2): the 467 cases mentioned in section 2.2.2.2, in which the reference phonological transcription was obtained by means of a majority vote procedure, and an extra 770 cases, in which the reference phonological transcription was a consensus transcription. By means of a DP alignment of automatic phonological transcriptions with reference phonological transcriptions, we obtained agreement scores which are expressed in either %agreement or Kappa. A higher %agreement or Kappa indicates better transcription quality. We showed that the relation between WERs and transcription quality is not straightforward (Kessens and Strik 2001, 2004). For instance, using context-dependent HMMs usually leads to lower WERs, but not always to higher-quality transcriptions. In other words, lower WERs do not always guarantee better transcriptions. Therefore, in order to increase the quality of automatic phonological transcriptions, one should not simply take the ASR system with the lowest WER. Instead, specific ASR systems have to be optimized for this task (i.e. to generate optimal automatic phonological transcriptions). Our research made clear that by combining the right properties of an ASR, the resulting automatic phonological transcriptions can be improved. In Kessens and Strik (2001) this was achieved by training the HMMs on read speech (instead of spontaneous speech), by shortening the topology of the HMMs, and by means of pronunciation variation modelling. Related to the issue above, i.e. which ASR system to use to obtain automatic phonological transcriptions with high transcription quality, is the issue of which phonological transcriptions to use to obtain an ASR system with low WER. The question is whether higher-quality transcriptions—e.g. manual phonological transcriptions—always yield ASR systems with lower WERs. We used different phonological transcriptions for training ASR systems, measured transcription quality by comparing these transcriptions to a reference phonological transcription, and also measured the WERs of the resulting ASR systems (Van Bael et al. 2006, 2007). The phonological transcriptions we used were: a manual phonological transcription, a canonical transcription (APT1), and an improved APT2 obtained by modelling pronunciation variation. In this case too, no straightforward relation was observed between transcription quality and WER; for example, manual phonological transcriptions do not always yield ASR systems with lower WERs. The overall conclusion of these experiments is therefore that, since ASR systems with lower WERs do not always yield better phonological transcriptions, and better phonological transcriptions do not always yield lower WERs, if ASR systems are to be used to obtain automatic phonological transcriptions, they should be optimized for this specific task.
5.3 Concluding Remarks In the previous sections we have discussed the possibilities and advantages offered by automatic methods for phonological annotation. Ceteris paribus, the quality of the transcriptions is likely to be higher for careful (e.g. read) than for sloppy
On Automatic Phonological Transcription of Speech Corpora
107
(e.g. spontaneous) speech, and also higher for high-quality audio signals than for lower-quality ones (more noise, distortions, lower sampling frequency, etc.). If there is no orthographic transcription, it will be difficult to automatically obtain phonological transcriptions of high quality, since the output of ASR and phone recognition generally contain a substantial number of errors. If there are orthographic transcriptions, a good possibility might be to use method 6 of section 2.3: obtain some manual transcriptions, use them to train decision trees, and apply these decision trees to the canonical transcriptions. Another good option is to use method 3 of section 2.3: use ‘knowledge’ (e.g. a list of pronunciation variants, or rules for creating them) to generate variants, and apply ‘forced recognition’ to select the variants that best match the audio signal. In practice, automatic phonological transcriptions can be used in all research situations in which phonological transcriptions have to be made by one person. Given that an ASR does not suffer from tiredness and loss of concentration, it could assist the transcriber who is likely to make mistakes owing to concentration loss. By comparing his/ her own transcriptions with those produced by the ASR a transcriber could spot possible errors that are due to absent-mindedness. Furthermore, this kind of comparison could be useful for other reasons. For instance, a transcriber may be biased by his/her own hypotheses and expectations, with obvious consequences for the transcriptions, while the biases in automatic phonological transcription can be controlled. Checking the automatic transcriptions may help discover possible biases in the listener’s data. In addition, APT can be employed in those situations in which more than one transcriber is involved, in order to solve possible doubts about what was actually realized. It should be noted that using automatic phonological transcription will be less expensive than having an extra transcriber carry out the same task. Automatic phonological transcription could also play a useful role within the framework of agile corpus creation as proposed by (Voormann and Gut 2008; see also chapter on corpus design in this volume). Agile corpus creation advocates the adoption of a query-driven approach that envisages small, rapid iterations of the various cycles in corpus creation (querying, annotation schema development, corpus annotation, and corpus analysis) to enhance the quality of corpora. In this approach, automatic phonetic transcription can be employed in a step-by-step bootstrap procedure as proposed by Binnenpoorte (2006). so that improved automatic phonological transcriptions are obtained after each step. Finally, we would like to reiterate the clear advantage of using automatic phonological transcription when it comes to transcribing large amounts of speech data that otherwise would probably remain unexplored.
108 Helmer Strik and Catia Cucchiarini
APPENDIX
Phonetic Transcription Tools Below a list of some (pointers to) phonetic transcription tools is provided. Since much more is available for English than for other languages, we first list the tools for English, and then the tools for other languages.
English http://project-modelino.com/english-phonetic-transcription-conver ter.php? site_language=english This free online converter allows you to convert English text to its phonetic transcription using International Phonetic Alphabet (IPA) symbols. The database contains more than 125,000 English words, including 68,000 individual words and 57,000 word forms, such as plurals. http://upodn.com/ Turn your text into fonetiks actually, it is: fəɛtɪks http://www.brothersoft.com/phonetic-86673.html Phonetic 1.0. A program that translates text to the phonetic alphabet. http://www.brothersoft.com/phonetizer-428787.html Phonetizer 2.0. Easily and quickly add phonetic transcription to any English text. http://www.photransedit.com/ PhoTransEdit applications have been designed to help those who work with English phonetic transcriptions. Far from providing perfect automatic transcriptions, PhoTransEdit is aimed at just helping you save time when writing, publishing, or sharing English transcriptions. http://www.speech.cs.cmu.edu/cgi-bin/cmudict The CMU Pronouncing Dictionary http://www.filecrop.com/ipa-english-dictionary.html IPA English Dictionary http://ipa.typeit.org/ Type IPA phonetic symbols, for English
Other Languages http://mickey.ifp.illinois.edu/speechWiki/index.php?title=Phonetic_Transcription_ Tool&oldid=3011 This is a tool that maps strings of letters (words) to their phonetic transcriptions via a Hidden Markov Model. It can also give phonetic transcriptions for partial words or words not in a dictionary. If a transcription dictionary is provided, the tool can align letters with
On Automatic Phonological Transcription of Speech Corpora
109
their corresponding phones. It has been trained on American English pronunciations, but models for other languages can also be created. http://tom.brondsted.dk/text2phoneme/ Tom Brøndsted: Phonemic transcription. An automated phonetic/phonemic transcriber supporting English, German, and Danish. Outputs transcriptions in the International Phonetic Alphabet IPA or the SAMPA alphabet designed for speech recognition technology. http://ilk.uvt.nl/g2p-www-demo.html, last accessed date: 26/06/2011 The TreeTalk demo converts Dutch or English words to their phonetic transcription in the SAMPA (Dutch) or DISC (English) phonetic alphabet, and also generates speech audio. http://hstrik.ruhosting.nl/tqe/ Automatic Transcription Quality Evaluation (TQE) tool. Input is a corpus with audio files and phone transcriptions (PTs). Audio and PTs are aligned, phone boundaries are derived, and for each segment-phone combination it is determined how well they fit together, i.e. for each phone a TQE measure (a confidence measure) is determined, e.g. ranging from 0–100 per cent, indicating how good the fit is, what the quality of the phone transcription is. http://www.fon.hum.uva.nl/praat/ Praat: doing phonetics by computer. http://latlcui.unige.ch/phonetique/ EasyAlign: a friendly automatic phonetic alignment tool under Praat. http://korpling.german.hu-berlin.de/~amir/phon.php Automatic Phonetic Transcription and Syllable Analysis for German and Polish. http://www.webalice.it/sandro.carnevali2011/indice.htm Program for IPA phonetic transcription of Italian, Japanese and English. http://www.ipanow.com/ PhoneticSoft automatically transcribes Latin, Italian, German and French texts into IPA symbols. http://billposer.org/Software/earm2ipa.html This program translates Armenian in UTF-8 Unicode to the International Phonetic Alphabet, assuming that the dialect represented is Eastern Armenian.
C HA P T E R 6
S TAT I S T I C A L C O R P U S E X P L O I TAT I O N H E R M A N N MOI SL
6.1 Introduction This chapter regards corpus linguistics (Kennedy 1998; McEnery and Wilson 2001; Baker 2009) as a methodology for creating collections of natural language speech and text, abstracting data from them, and analysing that data with the aim of generating or testing hypotheses about the structure of language and its use in the world. On this definition, corpus linguistics began in the late eighteenth century with the postulation of an Indo-European protolanguage and its reconstruction based on examination of numerous living languages and of historical texts (Clackson 2007). Since then it has been applied to research across the range of linguistics sub-disciplines and, in recent years, has become an academic discipline with its own research community and scientific apparatus of professional organizations, websites, conferences, journals, and textbooks. Throughout the nineteenth and much of the twentieth century, corpus linguistics was mainly or exclusively paper-based. The linguistic material used by researchers was in the form of handwritten or printed documents, and analysis involved reading through the documents, often repeatedly, creating data by noting features of interest on some paper medium such as index cards, inspecting the data directly, and on the basis of that inspection drawing conclusions that were published in printed books or journals. The advent of digital electronic technology in the second half of the twentieth century and its evolution since then have rendered this traditional technology increasingly obsolete. On the one hand, the possibility of representing language electronically rather than as visual marks on paper or some other physical medium, together with the development of digital media and infrastructure and of computational tools for the creation, emendation, storage, and transmission of electronic text have led to a rapid increase in the number and size of corpora available to the linguist, and these are now at or even beyond the limit of what an individual researcher can efficiently use in the traditional
Statistical Corpus Exploitation
111
way. On the other, data abstracted from large corpora can themselves be so extensive and complex as to be impenetrable to understanding by direct inspection. Digital electronic technology has, in general, been a boon to corpus linguistics, but, as with other aspects of life, it’s possible to have too much of a good thing. One response to digital electronic language and data overload is to use only corpora of tractable size or, equivalently, subsets of large corpora, but simply ignoring available information is not scientifically respectable. The alternative is to look to related research disciplines for help. The overload in corpus linguistics is symptomatic of a more general trend. Daily use of digital electronic information technology by many millions of people worldwide both in their professional and personal lives has generated and continues to generate truly vast amounts of electronic speech and text, and abstraction of information from all but a tiny part of it by direct inspection is an intractable task not only for individuals but also in government and commerce—what, for example, are the prospects of finding a specific item of information by reading sequentially through the huge number of documents currently available on the Web? In response, research disciplines devoted to information abstraction from very large collections of electronic text have come into being, among them Computational Linguistics (Mitkov 2003), Natural Language Processing (Manning and Schütze 1999; Dale et al. 2000; Jurafsky and Martin 2008; Cole et al. 2010; Indurkhya and Damerau 2010), Information Retrieval (Manning et al. 2008), and Data Mining (Hand et al. 2001). These disciplines use existing statistical methods supplemented by a range of new interpretative ones to develop tools that render the deluge of digital electronic text tractable. Many of these methods and tools are readily adaptable for corpus linguistics use, and, as the references in Section 3 demonstrate, interest in them has grown substantially in recent years. The general aim of this chapter is to encourage that growth, and the particular aim is to encourage it with respect to corpus-based phonetic and phonological research. The chapter is in three main parts. The first part motivates the selection of one particular class of statistical method, cluster analysis, as the focus of the discussion, the second describes fundamental concepts in cluster analysis and exemplifies their application to hypothesis generation in corpus-based phonetic and phonological research, and the third reviews the literature on the use of statistical methods in general and of cluster analysis more specifically in corpus linguistics.
6.2 Cluster Analysis: Motivation ‘Statistics’ encompasses an extensive range of mathematical concepts and techniques with a common focus: an understanding of the nature of probability and of its role in the behaviour of natural systems. Linguistically-oriented statistical analysis of a natural language corpus thus implies that the aim of the analysis is in some sense to interpret the probabilities of occurrence of one or more features of interest—phonetic, phonological, morphological, lexical, syntactic, or semantic—in relation to some research question.
112 Hermann Moisl The statistics literature makes a fundamental distinction between exploratory and confirmatory analysis. Confirmatory analysis is used when the researcher has formulated a hypothesis in answer to his or her research question about a domain of interest, and wants to test the validity of that hypothesis by abstracting data from a sample drawn from the domain and applying confirmatory statistical methods to those data. Exploratory analysis is, on the other hand, used when the researcher has not yet formulated a hypothesis and wishes to generate one by abstracting data from a sample of the domain and then looking for structure in the data on the basis of which a reasonable hypothesis can be formulated. The present discussion purports to describe statistical corpus exploitation, and as such it should cover both these types of analysis. The range of material which this implies is, however, very extensive, and attempting to deal even with only core topics in a relatively short chapter would necessarily result in a sequence of tersely described abstract concepts with little or no discussion of their application to corpus analysis. Since the general aim is to encourage rather than to discourage, some selectivity of coverage is required. The selection of material for discussion was motivated by the following question: given the proliferation of digital electronic corpora referred to in the Introduction, which statistical concepts and techniques would be most useful to corpus linguists for dealing with the attendant problem of analytical intractability? The answer was exploratory rather than confirmatory analysis. The latter is appropriate where the characteristics of the domain of interest are sufficiently well understood to permit formulation of sensible hypotheses; in corpus linguistic terms such a domain might be a collection of texts in the English language, which has been intensively studied, or one that is small enough to be tractable by direct inspection. Where the corpora are very large, however, or in languages/dialectal varieties that are relatively poorly understood, or both, exploratory analysis is more useful because it provides a basis for the formulation of reasonable hypotheses; such hypotheses can subsequently be tested using confirmatory methods. The range of exploratory methods is itself extensive (Myatt 2006; Myatt and Johnson 2009), and further restriction is required. To achieve this, a type of problem that can be expected to occur frequently in exploratory corpus analysis was selected and the relevant class of analytical methods made the focus of the discussion. Corpus exploration implies some degree of uncertainty about what one is looking for. If, for example, the aim is to differentiate the documents in a collection on the basis of their lexical semantic content, which words are the best differentiating criteria? Or, if the aim is to group a collection of speaker interviews on the basis of their phonetic characteristics, which phonetic features are most important? In both cases one would want to take as many lexical/phonetic features as possible into account initially, and then attempt to identify the important ones among them in the course of exploration. Cluster analysis is a type of exploratory method that has long been used across a wide range of science and engineering disciplines to address this type of problem, and is the focus of subsequent discussion. The remainder of this section gives an impression of what cluster analysis involves and how it can be applied to corpus analysis; a more detailed account is given in Section 2.
Statistical Corpus Exploitation
113
Observation of nature plays a fundamental role in science. But nature is dauntingly complex, and there is no practical or indeed theoretical hope of describing any aspect of it objectively and exhaustively. The researcher is therefore selective in what he or she observes: a research question about the domain of interest is posed, a set of variables descriptive of the domain in relation to the research question is defined, and a series of observations is conducted in which, at each observation, the quantitative or qualitative values of each variable are recorded. A body of data is therefore built up on the basis of which a hypothesis can be generated. Say, for example, that the domain of interest is the pronunciation of the speakers in some corpus, and that the research question is whether there is any systematic variation in phonetic usage among the speakers. Table 6.1 shows data abstracted from the Newcastle Electronic Corpus of Tyneside English (NECTE) (Allen et al. 2007), a corpus of dialect speech from north-east England, which is described in Beal, Corrigan and Moisl, this volume. The speakers are described by a single variable ə1, one of the several varieties of schwa defined in the NECTE transcription scheme, and the values in the variable column of Table 6.1 are the frequencies with which each of the 24 speakers use that segment. It is easy to see by direct inspection that the speakers fall into two groups: those that use ə1 relatively frequently, and those that use it relatively infrequently. The hypothesis is, therefore, that there is systematic variation in phonetic usage among NECTE speakers. If two phonetic variables are used to describe the speakers, as in Table 6.2, direct inspection again shows two groups, those that use both ə1 and another variety of schwa, ə2, relatively frequently and those that do not, and the hypothesis remains the same. There is no theoretical limit on the number of variables that can be defined to describe the objects in a domain. As the number of variables and observations grows, so does the difficulty of generating hypotheses from direct inspection of the data. In the NECTE case, the selection of ə1 and ə2 in Tables 6.1 and 6.2 was arbitrary, and the speakers could be described using more phonetic segment variables. Table 6.3 shows twelve. What hypothesis would one formulate from inspection of the data in Table 6.3, taking into account all the variables? There are, moreover, 63 speakers in the NECTE corpus and the transcription scheme contains 158 phonetic segments, so it is possible to describe the phonetic usage of each of 63 speakers in terms of 158 variables. What hypothesis would one formulate from direct inspection of the full 63 × 158 data? These questions are clearly rhetorical, and there is a straightforward moral: human cognitive makeup is unsuited to seeing regularities in anything but the smallest collections of numerical data. To see the regularities we need help, and that is what cluster analysis provides. Cluster analysis is a family of mathematical methods for the identification and graphical display of structure in data when the data are too large either in terms of the number of variables or of the number of objects described, or both, for it to be readily interpretable by direct inspection. All the members of the family work by partitioning a set of objects in the domain of interest into disjoint subsets in accordance with how relatively similar those objects are in terms of the variables that describe them. The objects of interest in the NECTE data are speakers, and each speaker’s phonetic usage is described
114 Hermann Moisl Table 6.1 Frequency data for ə1 in the NECTE corpus Speaker
ə1
tlsg01
3
tlsg02
8
tlsg03
3
tlsn01
100
tlsg04
15
tlsg05
14
tlsg06
5
tlsn02
103
tlsg07
5
tlsg08
3
tlsg09
5
tlsg10
6
tlsn03
142
tlsn04
110
tlsg11
3
tlsg12
2
tlsg52
11
tlsg53
6
tlsn05
145
tlsn06
109
tlsg54
3
tlsg55
7
tlsg56
12
tlsn07
104
by a set of phonetic variables. Any two speakers’ phonetic usage will be more or less similar depending on how similar their respective variable values are: if the values are identical then so are the speakers in terms of their pronunciation, and the greater the divergence in values the greater the differences in usage. Cluster analysis of the NECTE data in Table 6.3 groups the 24 speakers in terms of how similar their frequency of usage of 12 of the full 158 phonetic segments is. There are various kinds of cluster analysis; Table 6.4 shows the results from application of two of them. Table 6.4a shows the cluster structure of the NECTE data in Table 6.3 as a hierarchical tree. To interpret the tree one has to understand how it is constructed, so a short intuitive
Statistical Corpus Exploitation
115
Table 6.2 Frequency data for ə1 and ə2 in the NECTE corpus Speaker
ə1
ə2
tlsg01
3
1
tlsg02
8
0
tlsg03
3
1
tlsn01
100
116
tlsg04
15
0
tlsg05
14
6
tlsg06
5
0
tlsg07
5
0
tlsg08
3
0
tlsg09
5
0
tlsg10
6
0
tlsn03
142
107
tlsn04
110
120
tlsg11
3
0
tlsg12
2
0
tlsg52
11
1
tlsg53
6
0
tlsn05
145
102
tlsn06
109
107
tlsg54
3
0
tlsg55
7
0
tlsg56
12
0
tlsn07
104
93
account is given here; technical details are given later in the discussion. The labels at the leaves of the tree are the speaker-identifiers. These labels are partitioned into clusters in a sequence of steps. Initially, each speaker is interpreted as a cluster on his or her own. At the first step the data are searched to identify the two most similar clusters. When found, they are joined into a superordinate tree in which their degree of similarity is graphically represented as the length of the horizontal lines joining the subclusters: the more similar the subclusters, the shorter the lines. In the actual clustering procedure assessment of similarity is done numerically, but for present expository purposes a visual inspection of Table 6.4a is sufficient, and, to judge by the shortness of the horizontal lines, the singleton clusters tlsg01 and tlsg03 at the top of the tree are the most similar. These are
116 Hermann Moisl Table 6.3 Frequency data for a range of phonetic segments in the NECTE corpus Speaker
ə1
ə2
o:
ə3
ī
eī
n
a:1
a:2
aī
r
w
tlsg01
3
1
55
101
33
26
193
64
1
8
54
96
tlsg02
8
0
11
82
31
44
205
54
64
8
83
88
tlsg03
3
1
55
101
33
26
193
64
15
8
54
96
tlsn01
100
116
5
17
75
0
179
64
0
19
46
62
tlsg04
15
0
12
75
21
23
186
57
6
12
32
97
tlsg05
14
6
45
70
49
0
188
40
0
45
72
49
tlsg06
5
0
40
70
32
22
183
46
0
2
37
117
tlsn02
103
92
7
5
87
27
241
52
0
1
19
72
tlsg07
5
0
11
58
44
31
195
87
12
4
28
93
tlsg08
3
0
44
63
31
44
140
47
0
5
43
106
tlsg09
5
0
30
103
68
10
177
35
0
33
52
96
tlsg10
6
0
89
61
20
33
177
37
0
4
63
97
tlsn03
142
107
2
15
94
0
234
4
0
61
21
62
tlsn04
110
120
0
21
100
0
237
4
0
61
21
62
tlsg11
3
0
61
55
27
19
205
88
0
4
47
94
tlsg12
2
0
9
42
43
41
213
39
31
5
68
124
tlsg52
11
1
29
75
34
22
206
46
0
29
34
93
tlsg53
6
0
49
66
41
32
177
52
9
1
68
74
tlsn05
145
102
4
6
100
0
208
51
0
22
61
104
tlsn06
109
107
0
7
111
0
220
38
0
26
19
70
tlsg54
3
0
8
81
22
27
239
30
32
8
80
116
tlsg55
7
0
12
57
37
20
187
77
41
4
58
101
tlsg56
12
0
21
59
31
40
164
52
17
6
45
103
tlsn07
104
93
0
11
108
0
194
5
0
66
33
69
joined into a composite cluster (tlsg01 tlsg03). At the second step the data are searched again to determine the next-most-similar pair of clusters. Visual inspection indicates that these are tlsg06 and tlsg56 about one-third of the way down the tree, and these are joined into a composite cluster (tlsg06 tlsg56). At step 3, the two most similar clusters are the composite cluster (tlsg06 tlsg56) constructed at step 2 and tlsg08. These are joined into a superordinate cluster ((tlsg06 tlsg56) tlsg08). The sequence of steps continues in this way, combining the most similar pair of clusters at each step, and stops when there is only one cluster remaining which contains all the subclusters. The resulting tree gives an exhaustive graphical representation of the similarity relations in the NECTE speaker data. It shows that there are two main groups of speakers,
Statistical Corpus Exploitation
117
Table 6.4 Two types of cluster analysis of the data in Table 6.3 tlsg01 tlsg03 tlsg04 tlsg55 tlsg07 tlsg11 tlsg06 tlsg56 tlsg08 tlsg10 tlsg53 tlsg05 tlsg09 tlsg52 tlsg02 tlsg12 tlsg54 tlsn01 tlsn04 tlsn07 tlsn02 tlsn06 tlsn03 tlsn05
(a)
A
tlsg08 tlsg10 tlsg03 tlsg09 tlsg02 tlsg04 tlsg01 tlsg53
A B B
tlsn05 tlsn07 tlsn01 tlsn02 tlsn03 tlsn04 tlsn06
(b)
labelled A and B, which differ greatly from one another in terms of phonetic usage, and, though there are differences in usage among the speakers in those two main groups, the differences are minor relative to those between A and B. Table 6.4b shows the cluster structure of the data in Table 6.3 as a scatter plot in which relative spatial distance between speaker labels represents the relative similarity of phonetic usage among the speakers: the closer the labels the closer the speakers. Labels corresponding to the main clusters in Table 6.4a have been added for ease of cross-reference, and show that this analysis gives the same result as the hierarchical one. Once the structure of the data has been identified by cluster analysis, it can be used for hypothesis generation (Romesburg 1984: chs 4 and 22). The obvious hypothesis in the present case is that the NECTE speakers fall into two distinct groups in terms of their phonetic usage. This could be tested by doing an analysis of the full NECTE corpus using all 63 speakers and all 158 variables, and by conducting further interviews and abstracting data from them for subsequent analysis. Cluster analysis can be applied in any research where the data consist of objects described by variables; since most research uses data of this kind, cluster analysis is very widely applicable. It can usefully be applied where the number of objects and descriptive variables is so large that the data cannot easily be interpreted by direct inspection, and the range of applications where this is the case spans most areas of science, engineering, and commerce (Everitt et al. 2001: ch. 1; Romesburg 1984: chs 4–6; detailed discussion of cluster applications in Jain et al. 1999: 296 ff). In view of the comments made in the introduction about text overload, cluster analysis is exactly what is required for hypothesis generation in corpus linguistics. The foregoing discussion of NECTE is an example in the intersection of phonetics, dialectology, and sociolinguistics: the set of phonetic transcriptions is extensive and the frequency data abstracted from them are far too large
118 Hermann Moisl
to be in any sense comprehensible, but the structure that cluster analysis identified in the data made hypothesis formulation straightforward.
6.3 Cluster Analysis Concepts and Hypothesis Generation 6.3.1 Data Data are abstractions of what we observe using our senses, often with the aid of instruments (Chalmers 1999), and are ontologically different from the world. The world is as it is; data are an interpretation of it for the purpose of scientific study. The weather is not the meteorologist’s data—measurements of such things as air temperature are. A text corpus is not the linguist’s data—measurements of such things as word frequency are. Data are constructed from observation of things in the world, and the process of construction raises a range of issues that determine the amenability of the data to analysis and the interpretability of the results. The importance of understanding such data issues in cluster analysis can hardly be overstated. On the one hand, nothing can be discovered that is beyond the limits of the data itself. On the other, failure to understand relevant characteristics of data can lead to results and interpretations that are distorted or even worthless. For these reasons, an overview of data issues is given before moving on to discussion of cluster analysis concepts; examples are taken from the NECTE corpus cited above.
6.3.1.1 Formulation of a Research Question In general, any aspect of the world can be described in an arbitrary number of ways and to arbitrary degrees of precision. The implications of this go straight to the heart of the debate on the nature of science and scientific theories, but to avoid being drawn into that debate, this discussion adopts the position that is pretty much standard in scientific practice: the view, based on Karl Popper’s philosophy of science (Popper 1959; 1963; Chalmers 1999), that there is no theory-free observation of the world. In essence, this means that there is no such thing as objective observation in science. Entities in a domain of inquiry only become relevant to observation in terms of a research question framed using the ontology and axioms of a theory about the domain. For example, in linguistic analysis, variables are selected in terms of the discipline of linguistics broadly defined, which includes the division into sub-disciplines such as sociolinguistics and dialectology, the subcategorization within sub-disciplines such as phonetics through syntax to semantics and pragmatics in formal grammar, and theoretical entities within each subcategory such as phonemes in phonology and constituency structures in syntax. Claims, occasionally seen, that the variables used to describe a corpus are
Statistical Corpus Exploitation
119
‘theoretically neutral’ are naive: even word categories like ‘noun’ and ‘verb’ are interpretative constructs that imply a certain view of how language works, and they only appear to be theory-neutral because of our familiarity with long-established tradition. Data can, therefore, only be created in relation to a research question that is defined using the ontology of the domain of interest, and that thereby provides an interpretative orientation. Without such an orientation, how does one know what to observe, what is important, and what is not? The research question asked with respect to the NECTE corpus, and which serves as the basis for the examples in what follows, is: Is there systematic phonetic variation in the Tyneside speech community, and, if so, what are the main phonetic determinants of that variation?
6.3.1.2 Variable Selection Given that data are an interpretation of some domain of interest, what does such an interpretation look like? It is a description of entities in the domain in terms of variables. A variable is a symbol, and as such is a physical entity with a conventional semantics, where a conventional semantics is understood as one in which the designation of a physical thing as a symbol together with the connection between the symbol and what it represents are determined by agreement within a community. The symbol ‘A’, for example, represents the phoneme /a/ by common assent, not because there is any necessary connection between it and what it represents. Since each variable has a conventional semantics, the set of variables chosen to describe entities constitutes the template in terms of which the domain is interpreted. Selection of appropriate variables is, therefore, crucial to the success of any data analysis. Which variables are appropriate in any given case? That depends on the nature of the research question. The fundamental principle in variable selection is that the variables must describe all and only those aspects of the domain that are relevant to the research question. In general, this is an unattainable ideal. Any domain can be described by an essentially arbitrary number of finite sets of variables; selection of one particular set can only be done on the basis of personal knowledge of the domain and of the body of scientific theory associated with it, tempered by personal discretion. In other words, there is no algorithm for choosing an optimally relevant set of variables for a research question. The NECTE speakers are described by a set of 158 variables each of which represents a phonetic segment. These are described in (Allen et al. 2007) and, briefly, in Beal, Corrigan and Moisl, this volume.
6.3.1.3 Variable Value Assignment The semantics of each variable determines a particular interpretation of the domain of interest, and the domain is ‘measured’ in terms of the semantics. That measurement constitutes the values of the variables: height in metres = 1.71, weight in kilograms = 70, and so on. Measurement is fundamental in the creation of data because it makes the link
120 Hermann Moisl between data and the world, and thus allows the results of data analysis to be applied to the understanding of the world. Measurement is only possible in terms of some scale. There are various types of measurement scale, and these are discussed at length in any statistics textbook, but for present purposes the main dichotomy is between numeric and non-numeric. Cluster analysis methods assume numeric measurement as the default case, and for that reason the same is assumed in what follows. For NECTE we are interested in the number of times each speaker uses each of the phonetic segment variables. The speakers are therefore ‘measured’ in terms of the frequency with which they use these segments.
6.3.1.4 Data Representation If they are to be analysed using mathematically-based computational methods, the descriptions of the entities in the domain of interest in terms of the selected variables must be mathematically represented. A widely used way of doing this, and the one adopted here, is to use structures from a branch of mathematics known as linear algebra. There are numerous textbooks and websites devoted to linear algebra; a small selection of introductory textbooks is Fraleigh and Beauregard (1994), Poole (2005), and Strang (2009). Vectors are fundamental in data representation. A vector is a sequence of numbered slots containing numerical values. Table 6.5 shows a four-element vector each element of which contains a real-valued number: 1.6 is the value of the first element v1, 2.4 the value of the second element v2, and so on. A single NECTE speaker’s frequency of usage of the 158 phonetic segments in the transcription scheme can be represented by a 158-element vector in which each element is associated with a different segment, as in table 6.6. This speaker uses the segment at Speaker1 23 times, the segment at Speaker2 four times, and so on.
Table 6.5 A vector V = 1.6 2.4 7.5 0.6 1
2
3
4
Table 6.6 A vector representing a NECTE speaker Speaker =
i: 23
| 4
ε 0
e| 34 …
1
2
3
4
ζ 2 158
Statistical Corpus Exploitation
121
Table 6.7 The NECTE data matrix Speaker 1 Speaker 2 Speaker 3
i: 23 18 21
4 12 16
Speaker 63
36
2
1
27 …
1
2
3
4
|
ε 0 4 9
e| 34 … 38 … 19 …
ζ 2 1 5
… 3 158
The 63 NECTE speaker vectors can be assembled into a matrix M, shown in Table 6.7, in which the 63 rows represent the speakers, the 158 columns represent the phonetic segments, and the value at Mij is the number of times speaker i uses segment j (for i = 1..63 and j = 1..158). This matrix M is the basis of subsequent examples.
6.3.1.5 Data Issues Once the data are in matrix form they can in principle be cluster analysed. It may, however, have characteristics that can distort or even invalidate the results, and any such characteristics have to be mitigated or eliminated prior to analysis. These include variation in document or speaker interview length (Moisl 2009), differences in variable measurement scale (Moisl 2011), data sparsity (Moisl 2008), and nonlinearity (Moisl 2007).
6.3.2 Cluster Analysis Once the data matrix has been created and any data issues resolved, a variety of computational methods can be used to group its row vectors, and thereby the objects in the domain that the row vectors represent. In the present case, those objects are the NECTE speakers.
100 (30, 70)
100 FIGURE
6.1 Geometrical interpretation of a 2-dimensional vector.
122 Hermann Moisl 100
(40, 20, 60)
100
100 FIGURE
(a)
FIGURE
6.2 Geometrical interpretation of a 3-dimensional vector.
(b)
(c)
6.3 Distributions of multiple vectors in 2- and 3-dimensional spaces.
6.3.2.1 Clusters in Vector Space Although it is just a sequence of numbers, a vector can be geometrically interpreted (Fraleigh and Beauregard 1994; Poole 2005; Strang 2009). To see how, take a vector consisting of two elements, say v = (30, 70). Under a geometrical interpretation, the two elements of v define a two-dimensional space, the numbers at v1 = 30 and v2 = 70 are coordinates in that space, and the vector v itself is a point at the coordinates (30, 70), as shown in Figure 6.1. A vector consisting of three elements, say v = (40, 20, 60) defines a three-dimensional space in which the coordinates of the point v are 40 along the horizontal axis, 20 along the vertical axis, and 60 along the third axis shown in perspective, as in Figure 6.2. A vector v = (22, 38, 52, 12) defines a four-dimensional space with a point at the stated coordinates, and so on to any dimensionality n. Vector spaces of dimensionality greater than three are impossible to visualize directly and are therefore counterintuitive, but mathematically there is no problem with them; two- and three-dimensional spaces are useful as a metaphor for conceptualizing higher-dimensional ones. When numerous vectors exist in a space, it may or may not be possible to see interesting structure in the way they are arranged in it. Figure 6.3 shows vectors in two- and three-dimensional spaces. In (a) they were randomly generated and there is no structure to be observed, in (b) there are two clearly defined concentrations in two-dimensional space, and in (c) there are two clearly defined concentrations in three-dimensional space. The existence of concentrations like those in (b) and (c) indicate relationships among the entities that the vectors represent. In (b), for example, if the horizontal axis measures
Statistical Corpus Exploitation
123
weight and the vertical one height for a sample human population, then members of the sample fall into two groups: tall, light people on the one hand, and short, heavy ones on the other. This idea of identifying clusters of vectors in vector space and interpreting them in terms of what the vectors represent is the basis of cluster analysis. In what follows, we shall be attempting to group the NECTE speakers on the basis of their phonetic usage by looking for clusters in the arrangement of the row vectors of M in 158-dimensional space.
6.3.2.2 Clustering Methods Where the data vectors are two- or three-dimensional they can simply be plotted and any clusters will be visually identifiable, as we have just seen. But what about when the vector dimensionality is greater than 3—say 4, or 10, or 100? In such a case direct plotting is not an option; how exactly would one draw a six-dimensional space, for example? Many data matrix row vectors have dimensionalities greater than three—the NECTE matrix M has dimensionality 158 and, to identify clusters in such high-dimensional spaces some procedure more general than direct plotting is required. A variety of such procedures is available, and they are generically known as cluster analysis methods. This section looks at these methods. Where there are two or more vectors in a space, it is possible to measure the distance between any two of them and to rank them in terms of their proximity to one another. Figure 6.4 shows a simple case of a two-dimensional space in which the distance from vector A to vector B is greater than the distance from A to C. A Distance (AC) C B
Distance (AB)
FIGURE
6.4 Vector distances.
dist ( AB) = (5 − 1) 2 + (4 − 2) 2 6 5 4 3 2 1
B(5,4) A(1,2) 1
FIGURE
2
3
4
5
6
6.5 Euclidean distance measurement.
124 Hermann Moisl There are various ways of measuring such distances, but the most often used is the familiar Euclidean one, as in Figure 6.5. Cluster analysis methods use relative distance among vectors in a space to group the vectors. Specifically, for a given set of vectors in a space, they first calculate the distances between all pairs of vectors, and then group into clusters all the vectors that are relatively close to one another in the space and relatively far from those in other clusters. ‘Relatively close’ and ‘relatively far’ are, of course, vague expressions, but they are precisely defined by the various clustering methods, and for present purposes we can avoid the technicalities and rely on intuitions about relative distance. For concreteness, we will concentrate on one particular class of methods: the hierarchical cluster analysis already introduced in Section 1, which represents the relativities of distance among vectors as a tree. Figure 6.6 exemplifies this. Column (a) shows a 30 × 2 data matrix that is to be cluster analysed. Because the data space is two-dimensional, the vectors can be directly plotted to show the cluster structure, as in the upper part of column (b). The corresponding hierarchical cluster tree is shown in the lower part of column (b). There are three clusters labelled A, B, and C in each of which the distances among vectors are quite small. These three clusters are relatively far from one another, though A and B are closer to one another than either of them is to C. Comparison with the vector plot shows that the hierarchical analysis accurately represents the distance relations among the 30 vectors in two-dimensional space. Given that the tree tells us nothing more than what the plot tells us, what is gained? In the present case, nothing. The real power of hierarchical analysis lies in its independence of vector space dimensionality. We have seen that direct plotting is limited to three or fewer dimensions, but there is no dimensionality limit on hierarchical analysis—it can determine relative distances in vector spaces of any dimensionality and represent those distance relativities as a tree like the one above. To exemplify this, the 158-dimensional NECTE data matrix M was hierarchically cluster analysed (Moisl et al. 2006), and the result is shown in Figure 6.7. Plotting M in 158-dimensional space would have been impossible, and without cluster analysis one would have been left pondering a very large and incomprehensible matrix of numbers. With the aid of cluster analysis, however, structure in the data is clearly visible: there are two main clusters, NG1 and NG2; NG1 consists of large subclusters NG1a and NG1b; NG1a itself has two main subclusters NG1a(i) and NG1a(ii).
6.3.2.3 Hypothesis Generation Given that there is structure in the relative distances of the row vectors of M from one another in the data space, what does that structure mean in terms of the research question? Is there systematic phonetic variation in the Tyneside speech community, and, if so, what are the main phonetic determinants of that variation?
Because the row vectors of M are phonetic profiles of the NECTE speakers, the cluster structure means that the speakers fall into clearly defined groups with specific
v1
v2
1
27
46
2
29
48
3
30
50
4
32
51
5
34
54
6
55
9
7
56
9
8
60
10
9
63
11
10
64
11
11
78
72
12
79
74
13
80
70
14
84
73
15
85
69
16
27
55
17
29
56
18
30
54
19
33
51
20
34
56
21
55
13
22
56
15
23
60
13
24
63
12
25
64
10
26
84
72
27
85
74
28
77
70
29
76
73
30
76
69
a FIGURE
100 90
C
80 70
11 1427 29 12 26 30281315
A
60
17 20 16 5 18 19 34 12
50 40 30
B
20
222324 21 10 67 8925
10 0
0
A
B
C
0
20
30
40
50
60
70
80
90 100
1 2 3 4 19 5 20 16 17 18 6 7 21 22 8 23 9 10 24 25 11 12 29 13 28 30 14 26 27 15
b
6.6 Data matrix and corresponding row-clusters.
126 Hermann Moisl tlsg01 tlsg03 tlsg51 tlsg26 tlsg06 tlsg16 tlsg41 tlsg34 tlsg44 tlsg45 tlsg49 tlsg11 tlsg17 tlsg22 tlsg42 tlsg08 tlsg10 tlsg36 tlsg39 tlsg35 tlsg38 tlsg04 tlsg43 tlsg24 tlsg28 tlsg27 tlsg33 tlsg15 tlsg31 tlsg48 tlsg56 tlsg53 tlsg37 tlsg40 tlsg05 tlsg21 tlsg23 tlsg52 tlsg09 tlsg20 tlsg25 tlsg32 tlsg02 tlsg19 tlsg46 tlsg54 tlsg12 tlsg07 tlsg14 tlsg50 tlsg13 tlsg30 tlsg55 tlsg29 tlsg18 tlsg47 tlsg01 tlsg04 tlsg07 tlsg02 tlsg06 tlsg03 tlsg05 FIGURE
NG1a(i)
NG1a
NG1a(ii) NG1
NG1b
NG2
6.7 Hierarchical analysis of the NECTE data matrix in Table 6.5.
interrelationships rather than, say, being randomly distributed around the phonetic space. A reasonable hypothesis to answer the first part of the research question, therefore, is that there is systematic variation in the Tyneside speech community. This hypothesis can be refined by examining the social data relating to the NECTE speakers, which show, for example, that all those in the NG1 cluster come from the Gateshead area on the south side of the river Tyne and all those in NG2 come from Newcastle on the north side, and that the subclusters in NG1 group the Gateshead speakers by gender and occupation (Moisl et al. 2006).
Statistical Corpus Exploitation
127
The cluster tree can also be used to generate a hypothesis in answer to the second part of the research question. So far we know that the NECTE speakers fall into clearly demarcated groups on the basis of variation in their phonetic usage. We do not, however, know why, that is, which segments out of the 158 in the TLS transcription scheme are the main determinants of this regularity. To identify these segments (Moisl and Maguire 2008), we begin by looking at the two main clusters NG1 and NG2 to see which segments are most important in distinguishing them. The first step is to create for the NG1 cluster a vector that captures the general phonetic characteristics of the speakers it contains, and to do the same for the NG2. Such vectors can be created by averaging all the row vectors in a cluster using the formula
νj =
∑M
i = 1…m
ij
m
where vj is the jth element of the average or ‘centroid’ vector v (for j = 1.. the number of columns in M), M is the data matrix, Σ designates summation, and m is the number of row vectors in the cluster in question (56 for NG1, 7 for NG2). This yields two centroid vectors. Next, compare the two centroid vectors by co-plotting them to show graphically how, on average, the two speaker groups differ on each of the 158 phonetic segments; a plot of all 158 segments is too dense to be readily deciphered, so the six on which the NG1 and NG2 centroids differ most are shown in Figure 6.8.
120 NG1 = Black NG2 = Grey
100
80
60
40
20
e
FIGURE
2 reduced standard
e
baker
i big
3 reduced houses
e
1 reduced
:c
0
smoke
ei knife
6.8 Co-plot of centroid vectors for NG1 and NG2.
128 Hermann Moisl The six phonetic segments most important in distinguishing cluster NG1 from NG2 are three varieties of [ə], [ɔː], [ɪ], and [eɪ]: the Newcastle speakers characteristically use ə1 and ə2 whereas the Gateshead speakers use them hardly at all, the Gateshead speakers use yet another variety of schwa, ə3, much more than the Newcastle speakers, and so on. A hypothesis that answers the second part of the research question is therefore that the main determinants of phonetic variation in the Tyneside speech community are three kinds of [ə], [ɔː], [ɪ], and [eɪ]. The subclusters of NG1 can be examined in the same way and the hypothesis thereby further refined.
6.4 Literature Review The topic of this chapter cuts across several academic disciplines, and the potentially relevant literature is correspondingly large. This review is therefore highly selective. It also includes a few websites; as ever with the web, caveat emptor, but the ones cited seem to me to be reliable and useful.
6.4.1 Statistics and Linear Algebra Using cluster analysis competently requires some knowledge both of statistics and of linear algebra. The following references to introductory and intermediate-level accounts provide this.
6.4.1.1 Statistics In any research library there is typically a plethora of introductory and intermediate-level textbooks on probability and statistics. It is difficult to recommend specific ones on a principled basis because most of them, and especially the more recent ones, offer comprehensive and accessible coverage of the fundamental statistical concepts and techniques relevant to corpus analysis. For the linguist at any but advanced level in statistical corpus analysis, the choice is usually determined by a combination of what is readily available and presentational style. Some personal introductory favourites are (Devore and Peck 2005; Freedman et al. 2007; Gravetter and Wallnau 2008), and, among more advanced ones, (Casella and Berger 2001; Freedman 2009; Rice 2006).
Statistics websites • Hyperstat Online Statistics Textbook: http://davidmlane.com/hyperstat/ • NIST-Sematech e-Handbook of Statistical Methods: http://www.itl.nist.gov/ div898/handbook/index2.htm • Engineering Statistics Handbook: http://www.itl.nist.gov/div898/handbook/index.htm • Statistics on the Web: http://my.execpc.com/~helberg/statistics.html
Statistical Corpus Exploitation
129
• Statsoft Electronic Statistics Textbook: http://www.statsoft.com/textbook/ • SticiGui e-textbook: http://www.stat.berkeley.edu/~stark/SticiGui/index.htm • John C. Pezzullo’s Statistical Books, Manuals, and Journals links: http://statpages. org/javasta3.html • Research Methods Knowledge Base: http://www.socialresearchmethods.net/kb/ index.php
Statistics software Contemporary research environments standardly provide one or more statistics packages as part of their IT portfolio, and these packages together with local expertise in their use are the first port of call for the corpus analyst. Beyond this, a web search using the keywords ‘statistics software’ generates a deluge of links from which one can choose. Some useful directories are: • Wikipedia list of statistical software: http://en.wikipedia.org/wiki/List_of_ statistical_packages • Wikipedia comparison of statistical packages: http://en.wikipedia.org/wiki/ Comparison_of_statistical_packages • Open directory project, statistics software: http://www.dmoz.org/Science/Math/ Statistics/Software/ • Stata, statistical software providers: http://www.stata.com/links/stat_software.html • Free statistics: http://www.freestatistics.info/ • Statcon, annotated list of free statistical software: http://statistiksoftware.com/ free_software.html • Understanding the World Today, free software: statistics: http://gsociology.icaap. org/methods/soft.html • The Impoverished Social Scientist’s Guide to Free Statistical Software and Resources: http://maltman.hmdc.harvard.edu/socsci.shtml • StatSci, free statistical packages: http://www.statsci.org/free.html • Free statistical software directory: http://www.freestatistics.info/stat.php • John C. Pezullo’s free statistical software links: http://statpages.org/javasta2.html • Statlib: http://lib.stat.cmu.edu/
6.4.1.2 Linear Algebra Much of the literature on linear algebra can appear abstruse to the non-mathematician. Two recent and accessible introductory textbooks are (Poole 2005) and (Strang 2009); older, but still a personal favourite, is (Fraleigh and Beauregard 1994).
Linear algebra websites • PlanetMath: Linear algebra: http://planetmath.org/encyclopedia/LinearAlgebra.html • Math Forum: Linear algebra: http://mathforum.org/linear/linear.html
130 Hermann Moisl
6.4.2 Cluster Analysis As with general statistics, the literature on cluster analysis is extensive. It is, however, much more difficult to find introductory-level textbooks for cluster analysis, since most assume a reasonable mathematical competence. A good place to start is with (Romesburg 1984), a book that is now quite old but still a standard introductory text. More advanced accounts, in chronological order, are (Jain and Dubes 1988; Arabie et al. 1996; Gordon 1999; Jain et al. 1999; Everitt et al. 2001; Kaufman and Rousseeuw 2005; Gan et al. 2007; Xu and Wunsch 2008). Cluster analysis is also covered in textbooks for related disciplines, chief among them multivariate statistics (Kachigan 1991; Grimm and Yarnold 2000; Hair et al. 2007; Härdle and Simar 2007), data mining (Mirkin 2005; Nisbet et al. 2009), and information retrieval (Manning et al. 2008).
Cluster analysis websites • P. Berkhin (2002) Survey of clustering data mining techniques: http://citeseer.ist. psu.edu/cache/papers/cs/26278/http:zSzzSzwww.accrue.comzSzproductszSzrp_ cluster_review.pdf/berkhin02survey.pdf • Journal of Classification: http://www.springer.com/statistics/statistical+theory+ and+methods/journal/357
6.4.2.1 Cluster Analysis Software Many general statistics packages provide at least some cluster analytical functionality. For clustering-specific software a web search using the keywords ‘clustering software’ or ‘cluster analysis sofware’ generates numerous links. See also the following directories: • Classification Society of North America, cluster analysis software: http://www.pitt. edu/~csna/software.html • Statlib: http://lib.stat.cmu.edu/ • Open Directory Project, cluster analysis: http://search.dmoz.org/cgi-bin/search? search=cluster+analysis&all=yes&cs=UTF-8&cat=Computers%2FSoftware%2FD atabases%2FData_Mining%2FPublic_Domain_Software
6.4.3 Statistical Methods in Linguistic Research Mathematical and statistical concepts and techniques have long been used across a range of disciplines concerned in some sense with natural language, and these concepts and techniques are often relevant to corpus-based linguistics. Two such disciplines have just been mentioned: information retrieval and data mining. Others are natural language processing (Manning and Schütze 1999; Dale et al. 2000; Jurafsky and Martin 2008; Cole et al. 2010; Indurkhya and Damerau 2010), computational linguistics (Mitkov 2005), artificial intelligence (Russell and Norvig 2009), and the range of sub-disciplines
Statistical Corpus Exploitation
131
that comprise cognitive science (including theoretical linguistics) (Wilson and Keil 2001). The literatures for these are, once again, very extensive, and, to keep the range of reference within reasonable bounds, two constraints are self-imposed: (i) attention is restricted to the use of statistical methods in the analysis of natural language corpora for scientific as opposed to technological purposes, and (ii) only a small and, one hopes, representative selection of mainly, though not exclusively, recent work from 1995 onwards is given, relying on it as well as (Köhler and Hoffmann 1995) to provide references to earlier work.
6.4.3.1 Textbooks Woods et al. 1986; Souter and Atwell 1993; Stubbs 1996; Young and Bloothooft 1997; Biber et al. 1998; Oakes 1998; Baayen 2008, Johnson 2008; Gries 2009, Gries et al. 2009.
6.4.3.2 Specific Applications As with other areas of science, most of the research literature on specific applications of quantitative and more specifically statistical methods to corpus analysis is in journals. The one most focused on such applications is the Journal of Quantitative Linguistics; other important ones, in no particular order, are Computational Linguistics, Corpus Linguistics and Linguistic Theory, International Journal of Corpus Linguistics, Literary and Linguistic Computing, and Computer Speech and Language. • Language classification: (Kita 1999; Silnitsky 2003; Cooper 2008) • Lexis: (Andreev 1997; Lin 1998; Li and Ave 1998; Allegrini et al. 2000; Yarowsky 2000; Baayen 2001; Best 2001; Lin and Pantel 2001; Watters 2002; Romanov 2003; Oakes and Farrow 2007) • Syntax: (Köhler and Altmann 2000; Gries 2001; Gamallo et al. 2005; Köhler and Naumann 2008) • Variation: (Kessler 1995; Heeringa and Nerbonne 2001; Nerbonne and Heeringa 2001; 2010; Nerbonne and Kretzschmar 2003; Nerbonne et al. 2008; Nerbonne 2009; Kleiweg et al. 2004; Cichocki 2006; Gooskens 2006; Hyvönen et al. 2007; Wieling and Nerbonne 2010; Wieling et al. 2013) • Phonetics/phonology/morphology: (Jassem and Lobacz 1995; Hubey 1999; Kageura 1999; Andersen 2001; Cortina-Borja et al. 2002; Clopper and Paolillo 2006; Calderone 2009; Mukherjee et al. 2009; Sanders and Chin 2009) • Sociolinguistics: (Paolillo 2002; Moisl and Jones 2005; Moisl et al. 2006; Tagliamonte 2006; Moisl and Maguire 2008; Macaulay 2009) • Document clustering and classification: (Manning and Schütze 1999; Lebart and Rajman 2000; Merkl 2000). Document clustering is prominent in information retrieval and data mining, for which see the references to these given above. Many of the authors cited here have additional related publications, for which see their websites and the various online academic publication directories.
132 Hermann Moisl
Corpus Linguistics Websites • Gateway to Corpus Linguistics: http://www.corpus-linguistics.com/ • Bookmarks for Corpus-Based Linguistics: http://personal.cityu.edu.hk/~davidlee/ devotedtocorpora/CBLLinks.htm • Statistical natural language processing and corpus-based computational linguistics: an annotated list of resources: http://nlp.stanford.edu/links/statnlp.html • Intute. Corpus Linguistics: http://www.intute.ac.uk/cgi-bin/browse.pl?id=200492 • Stefan Gries’ home page links: http://www.linguistics.ucsb.edu/faculty/stgries/ other/links.html • Text Corpora and Corpus Linguistics: http://www.athel.com/corpus.html • UCREL: http://ucrel.lancs.ac.uk/ • ELSNET: http://www.elsnet.org/ • ELRA: http://www.elra.info/ • Data-intensive Linguistics (online textbook): http://www.ling.ohio-state. edu/~cbrew/2005/spring/684.02/notes/dilbook.pdf
C HA P T E R 7
CORPUS ARCHIVING AND D I S S E M I NAT I O N PET E R W I T T E N BU RG , PAU L T R I L SBE E K , A N D F LOR IA N W I T T E N BU RG
7.1 Introduction The nature of data archiving and dissemination has been changing dramatically with the trend towards an all-digital data world, and it will continue to change due to the enormous innovation rate of information and communication technology. New sensor equipment enables all researchers to create large amounts of data; new software technology allows users to create data enrichments of all sorts, and internet technology allows users to virtually combine and utilize data. In addition, the ongoing innovation will require the continuous migration of data. Thus data management has to deal with continuous changes; the nature of collections is no longer static, and new channels of dissemination have been made available. These changes have also resulted in a blurring of the term ‘corpus’ in modern data archiving and dissemination. Traditionally, a linguistic corpus has been defined as a large, coherent, and structured set of resources created to serve a certain research purpose and usually used for statistical processing and hypothesis testing. Motivated by advances in information technology, we now increasingly consider re-purposing existing resources and using them in different contexts, i.e. selecting resources from different corpora, merging them to virtual collections, doing completely unforeseen types of analysis and enriching them in various ways. Thus, a single resource or a group of resources from a certain corpus can become part of a type of ‘virtual corpus’. This is the reason why we prefer to speak about collections in this chapter. In this chapter, we will first discuss the traditional model of archiving and dissemination; analyse in detail what has changed in these processes based on digital innovation;
134 Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg discuss the curation1 and preservation requirements that need to be met; discuss briefly the need for advanced infrastructures; and finally present a specific archive that meets some of the modern requirements.
7.2 Traditional Archives of Corpus Primary Data Traditional corpus archives, including those that store analogue recordings of sounds, are characterized by the close relationship between carrier and content. Every access action is associated with a loss of quality of the content, i.e. we can say that content and carrier are mutually intertwined. Also, for modern digital technology, access is of course associated with quality reduction—a rotating disk has a short lifetime due to attrition. However, the big difference is that digital copying can be done without loss of information. One of the principles explained in the guidelines for traditional media—be it paper or analogue media—is not to touch the original, and thus to restrict access even if it is for copying purposes. This affects workflows, ways of curation, dissemination, and business models. Originals need to be stored in special and expensive environments;2 only at restricted moments will master copies be taken which then will be used to create the copies used for dissemination.3 For analogue media, it is known that, due to electronic circuitry, there is at least a degradation of 3 dB for each copy, but additional damages of the carriers, e.g. due to mechanical factors, are possible. Consequently, business and dissemination models follow rather restrictive and static principles, i.e. on request copies to a new carrier are created and are disseminated by ground mail requiring various activities from the archivists. The copying of content fragments to new carriers is also possible, but is even more time-consuming. Curation of analogue recordings, for example, implies a transfer from one carrier format to another format when it is announced that existing player technology is to be phased out. Often these transformations are started too late, so that old players are only to be found in specialized institutes and copying becomes expensive because of the manual operation required and increasing hardware maintenance costs. In proper traditional archives, metadata records that describe the history of a recording were maintained manually, either on paper or in a database, so that the validity of a certain operation or interpretation could be checked. 1 By (digital) curation we mean the process of selecting, preserving and maintaining accessibility of digital collections, which includes its migration to new formats for example. 2 A UNESCO study (Schüller 2004) revealed that about 80 per cent of the recordings about cultures and languages are endangered due to carrier substrate deterioration. This fact indicates that in reality many originals are not treated appropriately, which will certainly lead to data loss. 3 A good example is given by the film industry in Hollywood, which is planning to store the originals of films in old salt mines. Experience shows that after 50 years about 50 per cent of the films can still be read (Cieply 2007).
Corpus Archiving and Dissemination
135
7.3 Digital Corpus Archives and Dissemination Digital technology is dramatically changing the rules in all respects—some speak of a revolution. Certainly the basic rule ‘don’t touch the original’ is no longer valid. Digital copying, when done properly by maintaining integrity and authenticity, can be carried out automatically, without limits and without loss of information. The disadvantages of very short media lifetime and media fragility are more than compensated for. The opposite is true now: ‘touch the stored objects regularly’.
7.3.1 Dynamic Online Archives This principle is congruent with the wishes of researchers to access the stored data whenever they want, and it changes the basic rules for archives. The traditional model of managing data was based on two pillars, ‘long-term preservation’ and ‘reusing the data’, and, for the reasons mentioned above, they had to be tackled separately. Digital technology allows us to switch to online archives, i.e. there is no longer any need to maintain a separate archive where no one can access the ‘originals’—in fact this is counterproductive. Automatic procedures allow us to create several copies, and software technology allows us to separate ‘primary’ resources from all kinds of enrichments which then may become primary resources for other researchers. Therefore we can say that modern digital archives are ‘live archives’—the stored objects are subject to continuous changes. These can be (1) migrations to new carriers, which need to be carried out on average every four years; (2) migrations to new encoding standards and formats; (3) the creation of a variety of presentation formats; (4) new versions of stored resources; and (5) enrichments in form of added resources, new annotations, and extended relations between object fragments. Obviously such a digital archive includes an increasing complexity of relations between objects and object fragments that serve a wide variety of functions, as depicted in Figure 7.1.
7.3.2 Handling Complexity Such complexity needs to be managed, and elements of a feasible solution have been consolidated in the recent years. The major elements are • Each object needs to be identified by a persistent and unique identifier (PID4) that needs to be registered at a dedicated external registry; such PIDs need to 4 Persistent IDentifiers (PID) services are now being offered to register data collections and objects (EPIC: http://www.pidconsortium.eu/; DataCite: http://datacite.org).
136 Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg PID Metadata description
Collections
Metadata description
The object
Object instance
Object version
Metadata description
Derived object FIGURE 7.1 The complexity of relations which need to be managed. The object has a metadata description which points to the PID that has paths to all accessible object instances. It can be part of several collections—amongst them the original ‘corpus’ it was meant to be part of—each of which has its own metadata description. There can be new versions of an object that need to be bundled by the metadata description. There can be a variety of derived objects requiring the maintenance of the relations; and finally there will be unlimited relations between fragments of various objects.
be associated with checksum type of information that can e.g. be used to check integrity and can point to different instances (copies) of the same object; each new version of an object needs to receive its own PID to prove identity and integrity and to allow correct referencing. • Each object (except metadata objects) needs to be associated with a metadata description (see chapter I.7) that includes stable references to all its closely related objects, i.e. a metadata description of an annotation should include the PID of the primary resource(s). • Metadata descriptions need to include provenance information either directly or indirectly by referring to a separate object. • It must be possible to create an unlimited number of virtual collections on the basis of the stored objects, and each such virtual collection is defined by its metadata description which will include all references, i.e. users must be able to create their own collection organization by creating hierarchies of virtual collections. One such organization is called the canonical organization,5 since it will be used by
5
This can be compared with the difference between the Unix directory structure and soft links.
Corpus Archiving and Dissemination
137
Repository A Annotation algorithm
Metadata Primary data Annotation
URL MD5 etc.
PID
PID resolution information
Annotation
New metadata
FIGURE 7.2 The activities typically carried out by a proper repository system when a new annotation is added, which is a typical enrichment action. In general, stand-off principles need to be applied, i.e. existing objects may not be affected. In the case shown it could be an authorized user who creates an annotation and an updated version of the metadata description containing information and references to both the primary object and the new annotation. Nevertheless, the gatekeeper function will carry out a few checks, perform a few calculations, and then automatically request a PID by providing typical information. Once all operations have been successfully concluded, the annotation will be integrated into the collection and the metadata description will be updated.
the depositors and archive managers to establish authority domains and to carry out management operations; • All objects, in particular metadata descriptions, need to be represented as atomic objects in standard encodings and formats to make their readability as independent from layered software as possible. Encapsulation makes sense for indexes that are derived to support fast searching etc. • Database and textual structures need to be specified by registered schemas,6 and all tag sets used should be registered in concept registries.7
6 Databases as well as textual structures such as those storing a lexicon can have complex structures. To be able to interpret such a complex structure correctly, a description of its syntax is required. For relational databases this is done by providing its logical data model with the help of the Data Definition Language, and for XML files its schema based on a schema definition. These structure definitions should be externally accessible via open registries. CLARIN is currently building such a schema registry. 7 While schema registries help in parsing the structure correctly, concept registries such as ISOcat (http://www.isocat.org) help the user to interpret the semantics of tag sets that are e.g. used to describe linguistic phenomena such as parts of speech.
138 Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg • Each archive needs to have a well-maintained repository management system that has a gatekeeper function to guarantee the archive’s consistency by checking encoding and format correctness, by creating a PID record, by checking the validity of the metadata descriptions, and by extending them to include relevant references.8 • A variety of access methods needs to be provided to support the naive as well as the professional user. Figure 7.2 schematically illustrates the operations to be carried out when an annotation of an existing audio recording is added to a collection.
7.3.3 Data Management As the amount of data grows also in the domain of linguistics, the principles described above are becoming increasingly relevant. Some experts already speak of a ‘data deluge’, others use the term ‘data Tsunami’ to refer to the challenges we are facing due to the necessity to handle data on an enormous scale—something natural sciences have been aware of for years. Multimedia archives containing digitized audio and video recordings, brain images, and other time-series data easily extend to some hundreds of terabytes and millions of related objects. Both together—scale and complexity—can no longer be handled in traditional ways, such as one researcher having all of his or her data on his or her notebook in directories. A professionally acting field researcher reported about 6,000 files he had created on his notebook, covering all materials of a certain language, i.e. objects that are related with each other in multiple ways. He was no longer able to manage this. Also we see that researchers who upload resources without metadata on a server typically forget about potentially valuable content after a few months. As a consequence, researchers should take up their responsibilities and take concrete actions urgently as part of a long-term strategy. The basis for all strategic decisions is a three-layer model of responsibility and actions illustrated in Figure 7.3. (1) Users preferably want to access the data universe from their notebooks independent of time and location. (2) Domain-specific services will be organized that are backed by detailed knowledge of the discipline specific solutions. (3) Common data services such as long-term preservation will be offered by strong data centres. Figure 7.3 also shows that ‘data curation’ is a task which includes all experts involved in the whole data object lifecycle, and that the acceptance and seamless functioning of such a layered system is dependent on mutual trust. The model stresses the fact that certain tasks cannot be done by the individual researcher in a useful way for a number of reasons. Researchers need to find ways to deposit their data in professional and trusted centres that offer appropriate access services and guarantee long-term preservation. Increasingly we see that data creation 8 Examples of such repository management systems are FEDORA (http://www.fedora-commons. org/), http://www.dspace.org/), and LAMUS (http://tla.mpi.nl/tools/tla-tools/lamus/).
Corpus Archiving and Dissemination
139
The collaborative data infrastructure - a framework for the future
Data curation
Trust
Data generators
Users
User functionalities, data capture & transfer, virtual research environments
Community support services
Data discovery & navigation workflow generation, annotation, interpretability
Community data services
Persistant storage, identification, authenticity, workflow execution, mining
FIGURE 7.3 The principal architecture suggested by the High-Level Expert Group on Scientific Data. It covers mainly three layers: the users generating data or making use of stored data, community-specific centres offering services that incorporate domain knowledge, and finally common data services that are the same for all research disciplines. Of course this can only be seen as a reference model, i.e. we will see different types of realization.
funded by government may no longer be seen as private capital, but as data that should be accessible to all interested researchers. Sharing with a larger community, however, requires that data be deposited with a recognized repository.
7.3.4 Costs In the context of the discussions about the nature of common data services, the term ‘Cloud’ is being discussed intensively. In contrast to the Grid model, where data and in particular operations are distributed on a large number of nodes, the Cloud model concentrates data and operations on large compute and storage facilities; thus we can speak about a form of centralization of resources. Within the Cloud we also have a distributed scenario, but all covered by one authority domain and one technological solution. Such large facilities can offer almost unlimited storage capacities at low prices, since economy of scale can be applied. Since these facilities are set up at strategic places, they can also operate their services at ecologically optimized conditions, and are thus attractive ways to set up common data services. Since mirroring of data at different locations is a must for long-term preservation, there will always be some form of distribution outside a Cloud’s authority domain. However, most of the costs for data lifecycle management are not caused by the pure bit-stream preservation, but by curation efforts and by maintaining access services, as Table 7.1 indicates. The results reported in Table 7.1 can be compared with what Beagrie found in an overview of some data archives in the UK (Schüller 2004): (1) the costs for staff are much higher
140 Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg Table 7.1 Annual operating costs of the archive at the Max Planck Institute for Psycholinguistics Cost factor
Costs (1000€/yr)
Local IT and storage infrastructure 4 copies at large data centres
80 10–20
Comments 4–8 years innovation cycle All copying activity is automatic
Local system management
40
Shared for different activities
Archive management
80
Archive manager and student assistants
Repository software maintenance
60
Basic code maintenance
Utilization/exploitation software maintenance
>120
These costs can be a bottomless pit
than for equipment and (2) the costs for ingest (42 per cent) and access (35 per cent) are higher than for storage and preservation (23 per cent). The still relatively high sum for storage and preservation has to do with the fact that curation costs are included. Table 7.1 indicates the costs for the archive at the Max Planck Institute for Psycholinguistics, which stores about 75 terabytes of data in an online accessible archive, stores about 200 terabytes of additional data, maintains a local hierarchical storage management system including multilayer hardware and software components, and automatically creates four dynamic copies at remote data centres. The costs of the remote storage are almost negligible compared to the other costs. Maintaining data and basic software components9 which ensure that data is accessible and interpretable requires most expenditure. Investments in utilization and exploitation software depend on the level of sophistication required. Technological innovation means that the costs for local investments (row 1) may be slightly reduced in future. At the Max Planck Institute we can observe another trend which is indicated in Figure 7.4 by the difference between the circled line (total amount of data) and the starred line (organized and partly annotated data). An increasingly high percentage of the data collected is neither described nor annotated, i.e. it cannot be used for scientific purposes. There is an increasing demand for better automatic analysis and exploitation methods.
7.3.5 Data Dissemination Digital technologies have not only changed the ways archives are organized and data are managed; they have also dramatically affected the channels of dissemination. As already indicated, all copying activity to strong data centres needs to be carried out 9 Once useful repository software supporting the basic requirements is commercially available, this cost factor may be reduced, but then considerable licence costs must be calculated, comparable to the costs for a professional hierarchical storage management system such as SAM-FS (Schüller 2004).
Corpus Archiving and Dissemination
141
MPI digital archive 300 Annotated data Non-Annotated
Data in terabytes
250 200 150 100 50 0 2000
2002
2004
2006
2008
2010
2012
Year FIGURE 7.4 This diagram indicates that at the Max Planck Institute an increasing amount of data is collected but not described in a way that it can be used for analysis. The starred line indicates the amount of described data, the circled line the total amount of observational data.
automatically and dynamically. In general, only new data will be transferred, about 1 terabyte per month (in the MPI case), which does not constitute a problem for current computer and network technology. We only see one scenario where an Internet-based exchange of data does not seem feasible at present: a user (group) needs to have fast access to large collections e.g. to train a stochastic engine, and thus a whole data set of several terabytes and more needs to be transferred in a short time period. In such cases it may still be useful to send media such as tapes by ground mail. The worldwide film industry, which creates animated high-resolution films through the joint activity of highly specialized labs operating around the globe, is quickly exchanging modules via the Internet to achieve the required production speed (see the CineGrid project [2]). This may give an impression of how state-of-the-art network technology can be used to disseminate large data volumes, and how dissemination will evolve in the coming years to support modern production lines, for instance. For random access even to lengthy media files, the Internet is a very convenient platform when e.g. highly compressed media streaming is applied. Only those fragments are downloaded which have been demanded or which have a high probability of being analysed next. Traditional dissemination channels are not completely obsolete, as shown, but will be widely replaced by methods using the Internet. Another aspect of dissemination has to do with chunking and packaging. Traditionally the user ordered a certain ‘corpus’ such as the British National Corpus [3] or the Dutch Spoken Corpus [4]. This would be copied to a carrier and shipped to the
142 Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg user. For specific types of usage, this may still be the default way of acting. But as already indicated, increasingly often this static usage will be replaced by a more dynamic usage which also opens the way to continuously extended collections. Researchers, for example, may take a few resources from corpus A and another few resources from corpus B if they all cover interactions between mothers and children of a certain lifetime relevant for the research question in focus. They create a ‘virtual collection’ to work on, including the creation of special annotations and links between elements of the two resource sets. In such dynamic scenarios it will no longer be as straightforward to do appropriate packaging. Exchange formats such as METS [5] and MPEG21 DIDL [6] allow bundling metadata descriptions and all sorts of relations an object has, but these are no longer static and can only be created on the basis of users’ selection decisions. Thus dissemination, if seen only as one-directional, will become old-fashioned for most research activities.
7.4 Curation and Preservation We have argued that in particular in the research domain we have dynamic archives, which is partly caused by the fact that encoding standards, structuring formats, and software components are continuously changing. Furthermore, with larger amounts of data the likelihood of bit errors—although small—may have an effect on data integrity. Thus on the one hand we need to make sure that bit-streams are maintained correctly, but on the other hand we need to ensure that interpretability is preserved while maintaining authenticity. Maintaining bit-streams—thus taking care of data integrity—in a distributed environment requires identifying the object at data entry time by a persistent identifier (PID), a reliable checksum indicator such as MD5 associated with the PID, and a metadata description with some verified type and format specifications. Any operation at any site storing copies of the object needs to verify whether the object is still the same, i.e. whether the bit-stream has been conserved correctly. To date there is no general solution in place for open repository domains to verify correctness in the sketched way. Some projects such as REPLIX [7]are currently working on this issue. Much more problematic is to ensure interpretability and ‘appropriate’ presentation in a world where software technology is changing at dramatic speeds and where new encoding standards and formats are emerging regularly. In the area of textual data we can now refer to UNICODE and XML as widely accepted standards offering basic stability for some decades.10 For sound we have linear PCM at HIFI norms (44.1/48 kHz/16 bit or 96 kHz/24 bit), which offers a solid base as master format. However, various compressed formats such as MP3 will continue to emerge for different purposes. In the area of video streams we were faced with a dynamic development of encoding schemes
10
The schema of a resource may change over time, but it is still described by the same Syntax Description Language.
Corpus Archiving and Dissemination
143
starting from MPEG1 and continuing to MPEG2, MPEG4/H.264,11 and also MJPEG 2000—all being introduced in little more than a decade. Since there is no guarantee that software will support the rendering of ‘old’ formats for long periods, we need to carry out transformations at regular time periods.
7.4.1 Compression and Transformation It needs to be emphasized that with certain so-called ‘lossy’ compression codecs, such as MP3 or MPEG1/2/4, information which is claimed to be not relevant for our perception is simply filtered out and not recoverable anymore. Thus ‘lossy’ compression raises the question of authenticity, i.e. despite the fact that many relevant acoustic features can be extracted from the compressed information, as v. Son [8]has shown, archiving and compression are not compatible. Another problematic step is switching from one codec to another and creating a series. The concatenation of transformations can result in severe audible or visible artefacts which disturb the original information. For this reason, it is much better to keep the archival master file in an uncompressed format and from there to generate the various presentation formats which can be compressed e.g. for transmission reasons. For audio signals this has been solved; for video signals it seems that MJPEG2000 lossless12 will become the accepted standard. Intensive quality checks by the archivist in collaboration with community experts are necessary to guarantee that authenticity of the original information is preserved for any operation from digitization to any subsequent encoding. Wrong choices can easily lead to loss or change of information: (1) information loss with MP3 as indicated in Figure 7.5 or (2) as blocking deformation as shown in Figure 7.6. The relevance of maintaining provenance information is underlined by this example as well. It should also be obvious that decisions about transformations can only be done after having analysed the consequences in detail.
7.4.2 Lifecycle Management All transformations of resources lead to new content; thus they must be associated with a new PID, since otherwise identity and integrity cannot be controlled.13 It is a matter of policy of a repository whether a new metadata description is being created or whether the new version is bundled into the existing structured metadata description. Whatever is being 11
Soon we can expect MPEG4/H.265 to be massively supported by industry. With ‘lossless’ compression it is possible to reconstruct the original signal, but compression factors higher than 2 are not possible. 13 It is widely agreed that metadata about resources is more dynamic where information can be added without leading to new versions. 12
144 Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg
Intensity (dB)
80 60 40 20 0 0
2
4
6
8 10 Frequency (kHz)
12
14
16
FIGURE 7.5 Psycho-acoustic masking: a high tone at 1 kHz (dark grey) would mask out all tones that have an intensity below the dotted line, i.e. according to psycho-acoustic findings the blue tone would not be recognizable by humans. MP3 algorithms make use of this phenomenon and filter out all frequency components. Thus MP3 recordings reduce information content.
(a)
(b)
FIGURE 7.6 The same information as in Figure 7.4: on the left side encoded in an uncompressed way and on the right side encoded with MPEG2 at 6 Mbit/s. In the right-hand image the blocking phenomenon can be seen. It is up to the researcher to decide whether this distortion is acceptable. For archiving it is not acceptable, since we do not know what kind of analysis will be done in future. These images were generated by the Phonogrammarchive Vienna.
done, provenance information needs to be stored, so that at any moment in time the user can trace what kinds of transformation have been carried out to arrive at the object as it is.14
7.5 Data Centres In the previous sections we described how the area of archiving and dissemination has been changing, and indicated that further changes can be expected due to the enormous
14 Using metadata records to maintain context and provenance information about a resource can be seen as an improved versioning system offering linguistically relevant information.
Corpus Archiving and Dissemination
145
technological innovation rate (e.g. digital technology has revolutionized the nature of acting). Obviously we need new types of centre that: • implement mechanisms as described above; • have an open deposit policy allowing users to deposit their corpora if they adhere to a number of requirements to establish trust (see below); • allow users to build, store, and manipulate virtual collections; and • allow users to add annotations of various sorts. Obviously this can only be done if proper repository systems are being used that implement the above-mentioned principles. These types of ‘new’ centres must be available to support the individual researcher in his or her daily workflow, since he or she will not be able to carry out proper data lifecycle management but will nevertheless need easy and flexible access to his or her collections. Therefore it is currently fairly common in different research disciplines to describe requirements for centres and to agree on Service Level Agreements. This is to ensure behaviours as expected and a high availability. One criterion for Google’s success is that it is always available and operates as users expect it to. So-called research infrastructures—be it the Large Hadron Collider [9]in physics, ELEXIR [10] in bioinformatics, or CLARIN [11] in the linguistic community—work along the same lines here: we need a structured network of professional centres that behave as expected. The CLARIN research infrastructure established criteria for centres that mainly cover mechanisms as described above [12]. As indicated in the 3-layer diagram (Figure 7.3), these centre networks need to collaborate with other infrastructures that offer common services such as long-term archiving to establish what may be called an ecosystem of infrastructure services. Another major aspect for establishing trust, which is the key to wide acceptance, is to clarify the rights situation. This is an immensely complex issue, especially the legal situation in an international setting where resources may have been created and (web) users are located in different countries. Data-driven research will depend on free and seamless access to useful data and the possibility of easily combining this data with data from other researchers. So far, we cannot see how the rights issue will evolve, but it seems obvious that a sort of ‘academic use’ policy must come into place based on clear user identities and ethically correct behaviour. Despite all wishes with respect to an unrestricted access, we will be faced with restrictions related to personality rights (recorded persons do not want that their voices or faces exhibited to everyone) and licences. Repositories need to make sure that they are capable of handling such restrictions in a proper way. Obviously we are entering a scenario where many users will need to rely on data centres and the quality of their data without having continuous personal contact. Thus, in a scenario where important actors remain anonymous with respect to each other, new ways of assessing the quality of repositories and their resources need to be established. Three proposals have been worked out to assess a centre’s quality: Repository
146 Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg Auditing and Certification (RAC) [13], Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) [14], and Data Seal of Approval (DSA) [15]. RAC was proposed by the MOIMS-Repository Audit and Certification Working Group based on the OAIS model for archiving (ISO 14721) [16], and is heading towards a new more refined ISO standard for quality assessment. DSA describes a more lightweight procedure to ensure that in the future research data can still be processed in a high-quality and reliable manner, without this entailing new thresholds, regulations, or high costs. DRAMBORA offers a toolkit that facilitates internal audit by providing repository administrators with a means to assess their capabilities, identify their weaknesses, and recognize their strengths. Archiving centres need to follow one of these procedures to indicate that they adhere to certain quality guidelines. Part of the quality assessment is the responsibility of the ‘data producer’, who needs to ensure (1) that there is sufficient information for others to assess the scientific and scholarly quality of the research data, and that there is compliance with disciplinary and ethical norms, (2) that the formats are in agreement with repository policies, and (3) that metadata with sufficient quality is provided. Thus, when collecting a corpus, it is important to anticipate the quality requirements.
7.6 MPI Archive We want to briefly describe a concrete archive as an example that comes close to the principles which we described above. The digital language archive at the MPI for Psycholinguistics makes use of an in-house-developed archiving system that contains ingest, management, and access components. The ingest and management components are combined in a tool named LAMUS (Language Archive Upload and Management System) [17], which allows depositors to upload resources together with metadata descriptions into the archive and to create a canonical organization for their data. The software performs file-type checks to verify that the uploaded files are of the type they claim to be and are on the list of accepted file types for the archive. The metadata description files are in IMDI format, which allows resources to be described with categories enabling research based selections (such as age, sex, and educational background of interviewees)15 [18], and are also validated upon upload. Linked to LAMUS there is an elaborate system that allows depositors to define access rules for their data. There are various levels of access: completely open, open for registered users, available upon request, and completely closed. Resources and metadata descriptions that are ingested automatically receive a persistent identifier. The MPI uses the Handle System [19] for persistent identifiers because it
15 Examples can be found when looking at metadata descriptions in the open catalogue (http:// corpus1.mpi.nl).
Corpus Archiving and Dissemination
147
is widely used, has shown its reliability over past years, and does not require payments per issued PID (instead, only a small annual fee for the Handle prefix is paid).16 Once the resources are ingested into the archive using the LAMUS software, the files are placed in a volume of a SAM-FS-based Hierarchical Storage Management system (HSM). This system consists of three layers of storage: a layer of fast hard drives, a layer of slower hard drives, and a layer of LTO5 data tapes. Files are migrated dynamically back and forth between the storage layers depending on usage, demand, and rules that are defined for different file types. The complete HSM can store up to 1.2 petabyte of data in its current form. Consistency checks of the archived files and metadata are performed continuously, and reports of any errors are sent to the archivists automatically. Two copies are automatically created in the HSM system. Upon ingest, the metadata records are indexed and made available in an IMDI-based online browse and search tool so that users can look for the data they need using a standard web browser. All archived files can in principle be downloaded via the web, provided that the user has the appropriate access rights. Connected to the metadata catalogue browse and search tool, there are online viewers for various types of files. Time-aligned annotations to media files, for example, can be viewed online via a tool called ANNEX [21]. This tool displays the annotations in synchrony with the audio or video streams in a web browser. It offers different views for the annotations such as a timeline, a ‘subtitle’ view, and a plain text view. Uploaded annotation files are also indexed so that they can be searched with an integrated search tool called TROVA [21]. This tool allows users to search for occurrences of linguistic structures within the annotation files. It can also export the search results in comma-separated text files for further analyses in a statistics program, for example. As described in the previous paragraph, long-term preservation of digital data poses some challenges. For safeguarding the data, the MPI archive creates automatic backup copies at two computer centres of the Max Planck Society in Göttingen and in Garching. Each centre also has an off-site backup solution. To increase the chance of future interpretability of the data, only a limited set of file types are currently allowed in the archive—e.g. for video data we do not accept files in every one of the large number of formats and codecs that are around, but instead limit the formats to MPEG1, MPEG2, MPEG4/H.264, and MJPEG2000. For audio we only archive linear PCM WAV files in 16 bit 44.1 or 48 KHz resolution. Having a limited set of widely accepted formats should make conversions to other formats more feasible in the future if a format becomes obsolete. The MPI archive has undertaken the Data Seal of Approval (DSA) self-assessment method, and will most likely be awarded the Data Seal of Approval in the course of 2011 after a review by the DSA board and another external reviewer.
16 The European PID Consortium [20] is now offering Handle registration to all registered research data centres. Since PIDs need to be persistent, there will be no removal of registered PIDs without consent.
148 Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg
7.7 Conclusions In this chapter we have argued that archiving and dissemination of corpora has been changing dramatically as a consequence of innovation and an all-digital world, and that dramatic changes still lie ahead. The most fundamental rule in traditional archiving, ‘don’t touch the original artefacts’, has been reversed to the basic rule for digital archives: ‘touch the digital resources frequently’. This fundamental change allows us to turn to live digital archives where we no longer distinguish between an archive for long-term preservation and a copy used for all sort of access. In contrast, we see a rich set of collections that are being extended as part of the research workflows. Therefore the notion of a ‘corpus’ is blurring insofar as digital technology allows users to re-purpose parts of different corpora to new virtual collections being used for some research purpose which was not thought of at the time of creation of the corpus it originally belonged to. Such changes do not come without risks. The close relation between carrier and information has resulted in the fact that we still can look back to our history, for example, in the form of ancient cuneiforms and even papyri. But this facility was bound to data of limited size and production processes which can no longer be used. In digital technology the possibility of copying without information loss allows us to separate carrier and information, and thus to create any number of equal copies automatically, at high speed and low cost. However, this will only work if we can rely on proper digital archiving principles such as using stand-off principles and registered persistent identifiers, creating and maintaining metadata records, and setting up proper hardware and in particular software mechanisms in dedicated centres devoted to dealing with large amounts of data. Likewise, the channels of dissemination have changed completely and will continue to change as network bandwidth grows. Generations are used to the web-based paradigm, and there will be fewer cases where it is necessary to ship tapes or DVDs to customers. Increasingly often, access will be web-based, and the possibility of re-purposing resources in unforeseen ways means that researchers will only want to access parts of corpora. The amount of downloading will depend on the services associated with resources. In this respect we will see enormous changes as a consequence of the available inexpensive storage capacity in Clouds which can be extended by easily deploying services—i.e. we can expect that the need for downloading corpora will become less and less important.
Internet Sources [1] https://wikis.oracle.com/display/SAMQFS/Home [2] http://www.cinegrid.org [3] http://www.natcorp.ox.ac.uk/
Corpus Archiving and Dissemination
149
[4] http://lands.let.ru.nl/cgn/ehome.htm [5] http://www.loc.gov/standards/mets/ [6] http://xml.coverpages.org/mpeg21-didl.html [7] http://www.mpi.nl/research/research-projects/the-language-archive/projects/replix-1/replix [8]Van Son, R.J.J.H., ‘Can Standard Analysis Tools be Used on Decompressed Speech?’ COCOSDA, 2002, Denver; URL: http://www.cocosda.org/meet/denver/COCOSDA2002Rob.pdf [9] http://lhc.web.cern.ch/lhc/ [10] http://www.elixir-europe.org/ [11] http://www.clarin.eu [12] http://www.clarin.eu/content/center-requirements-revised-version [13] http://cwe.ccsds.org/moims/default.aspx# [14] http://www.repositoryaudit.eu/ [15] http://www.datasealofapproval.org/ [16] http://public.ccsds.org/publications/archive/650x0b1.PDF [17] http://tla.mpi.nl/tools/tla-tools/lamus/ [18] http://www.mpi.nl/IMDI/ [19] http://www.handle.net/ [20] http://www.pidconsortium.eu/ [21] http://tla.mpi.nl/tools/tla-tools/annex/ [22] http://tla.mpi.nl/tools/tla-tools/trova/
C HA P T E R 8
M E TA DATA F O R M AT S DA A N BROE DE R A N D DI ET E R VA N U Y T VA NC K
8.1 Metadata: What Is It and Why Should It Be Used? The best definition of metadata (although not complete) is still ‘data providing information about other data’. Examples of such information are the name of the data creator(s), creation date, the data’s purpose, data formats used, etc. In general, three kinds of metadata can be distinguished (internet source 1 [NISO]): descriptive metadata that is used to search and locate data; structural metadata that describes how the data is internally organized; and administrative data such as the data format but also information on access rights and data preservation. In this chapter we use the term ‘metadata’ to refer to descriptive metadata; any other usage will be explicitly mentioned. Different approaches and requirements with respect to specificity and terminology have resulted in the development of different sets of metadata classifiers. Table 8.1 gives an example of a metadata record describing an electronic poetry publication using six metadata classifiers from the Dublin Core metadata set. Having sufficiently rich metadata for any (electronic) resource or corpus is extremely important: without it, resources cannot be located, and for instance the resources making up a corpus could not be properly identified and classified without inspecting their contents. When discussing the different metadata approaches, it is useful to explain some terminology used in this context. A metadata record consists of a limited number of different and sometimes repeatable metadata elements that represent specific characteristics of a resource. An element can have a value that can either be a free value or otherwise be constrained by a controlled vocabulary of values or by some other requirement, e.g. a valid date. The constraint of an element’s value is referred to as a value scheme. The set of rules for forming a metadata record is referred to as a metadata schema or metadata set, and this usually specifies the set of metadata elements, i.e. element names, the possible values for the elements, and the element semantics. Currently almost all metadata
Metadata Formats
151
Table 8.1 Part of a metadata record using the Dublin Core metadata set Identifier
http://ota.ox.ac.uk/headers/0382.xml
Title
Selected works [Electronic resource]/Mirko Petrović
Creator
Petrović, Mirko, 1820–1867
Subject
Serbo-Croatian poetry—19th century
language
Serbo-Croatian
Rights
Use of this resource is restricted in some manner. Usually this means that it is available for non-commercial use only with prior permission of the depositor and on condition that this header is included in its entirety with any copy distributed.
schemas are expressed as an XML file or file fragment, whose format is governed by XML schema or another schema language. The practical use of metadata for corpora that contain many individual resources, i.e. audio and video files as well as annotations, is twofold:
1. The corpus can be described as a whole and published for instance in the large catalogues of the Language Resource distribution agencies such as LDC and ELRA for users to identify suitable corpora. See e.g. the LDC catalogue1 2. The individual parts of a corpus can be identified and classified by corpus exploitation tools (e.g. COREX: Oostdijk and Broeder 2003, the exploitation environment of the Dutch Spoken Corpus). If all parts of the corpus are made available online, the metadata for each part of the corpus should also be published. This will facilitate easy reuse of the resources, since it removes the need to first download the complete corpus when needing just a few of the corpus resources. For an example of such a catalogue, see the Virtual Language Observatory2 of the CLARIN project.
Of course there is also the possibility of describing subsets of the whole corpus, i.e. datasets that in terms of their size are located between the complete corpus and the individual corpus components. The individual corpus components can be also described as a metadata record. It is obvious that the requirements for proper metadata for each of these levels are very different and require the use of different metadata schemata. Where metadata is used to allow potential corpus users to locate suitable corpora, the metadata records from different corpora should be collected in a central metadata catalogue that users can use for finding the corpus that fulfils their requirements. This requires the corpus maintainers to publish the corpus metadata, which can then be
1 2
http://www.ldc.upenn.edu/Catalog/ http://www.clarin.eu/vlo/
152 Daan Broeder and Dieter van Uytvanck harvested by the maintainers of such a central metadata catalogue. A popular protocol for this process is the well-known OAI-PMH (2). The origins of the concept of metadata, its usage and terminology come from the library world, where the problem of tagging and retrieving large amounts of resources has long existed. Because the technology and experience of librarians was already advanced compared to other disciplines, it was natural for them to take the lead in trying to develop metadata description systems such as the Dublin Core Metadata Initiative (DCMI) (1995) that aim to also incorporate other scientific domains. With this attempt, however, which originally advocated describing all objects with a system of fifteen classifiers or elements (although qualifiers for more specificity were allowed), too much focus was put on librarians’ terminology and interests (IPR etc.) to allow wide acceptance in other domains. For interoperability between different scientific domains that require mutually intelligible resource descriptions, DCMI is still a solution of choice, even if much information is lost in translating domain-specific metadata into it. DCMI is also problematic in that it is a flat list of descriptors lacking any structure and making it difficult to express structured information. As stated above, there are different approaches and traditions for using metadata and using different metadata sets. Some are targeted towards a shallow global description of resources requiring the specification of only a few metadata elements, while others are very detailed and require considerable effort to create the metadata records. Also the terminology, i.e. the names and semantics of the metadata elements in the different metadata sets, are often incompatible. This leads to so-called interoperability problems, when for instance corpus exploitation tools cannot be used for corpora using different metadata sets, or when metadata of different corpora must be stored in the same catalogue. The tension between the need for adequate and sufficiently rich (domain-specific) terminology in order to correctly describe resources and, on the other hand, the need for interoperability where terms have to be understood by people from different disciplines has led to a kind of oscillating effect of description systems, moving from small sets with descriptors with broad applicability to large sets with highly specific descriptors and back again. Some (e.g. Baker 1998) have compared this to the linguistic theory of pidginization and creolization, where pidgin languages arise when mutual intelligibility is needed and pidgins are creolized to achieve richer semantics. How this tension is resolved is often a matter of purpose or pragmatism.
8.2 A Closer Look at Metadata Sets This section presents some metadata standards that are often used for the construction of corpora.
Metadata Formats
153
8.2.1 Dublin Core and OLAC The Dublin Core (DC) metadata set originates from the library community and serves as a digital lingua franca: it provides a simple and limited but widely understood set of elements3:
• • • • • • • • • • • • • • •
Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights
As such, it is mainly used as an export format for project-specific metadata formats and not as the primary description means. Consider the following example of a DC record, encoded as an XML fragment: jaSlo: Japanese-Slovene Learner’s Dictionary http://nl.ijs.si/jaslo/ Dept. of Asian and African Languages, Uni. of Ljubljana LexicalResource slv jpn machine readable dictionary general Encoding TEI XML Annotation Level: etymology Annotation Level: phonology Annotation Level: definition Annotation Level: example of use 3
This list contains the elements of the DC simple set. For extensions (added later), see http://dcmi. kc.tsukuba.ac.jp/dcregistry/
154 Daan Broeder and Dieter van Uytvanck OLAC (2000) was created as an application of Dublin Core to the linguistic domain that extends some of the DC elements with more detailed linguistic information. Controlled vocabularies4 for discourse types, languages, linguistic field, linguistic data type, and participant role were added to the standard DC set. An illustration of an OLAC record can be found below. The OLAC-specific extensions for indicating the linguistic field, the linguistic type, the discourse type, and the language code have been marked in bold. AphasiaBank Legacy Chinese CAP Corpus Bates, Elizabeth aphasia
Chinese aphasics describing pictures in the Given-New task for the CAP Project TalkBank 2004-03-30 Text
1-59642-201-7 http://talkbank.org/data-xml/ AphasiaBank/Other/CAP/chinese.zip http://talkbank.org/data/ AphasiaBank/Other/CAP/chinese.zip
In total about forty language archives are currently providing their metadata records in the OLAC format5. The popularity of OLAC can be partially attributed to the fact that the OAI-PMH protocol for metadata harvesting requires the DC format. OLAC, as a linguistic extension of DC, keeps the simplicity of DC and adds some useful vocabularies, making it a good choice for the distribution of metadata for language resources.
8.2.2 TEI The Text Encoding Initiative (TEI, 1990) has been very successful in establishing a widely accepted system for text annotation. The TEI format also includes an extendable metadata section, the TEI header. An example of such a TEI header is included below.6 One can notice immediately the verbosity of the included metadata elements, which on the one hand is very readable but 4
A full list can be found at http://www.language-archives.org/REC/olac-extensions.html
5 See: http://www.language-archives.org/archives
6 Source: http://teibyexample.org/examples/TBED02v00.htm?target=wilde
Metadata Formats
155
also can pose problems with the machine-processing, as plain-text is not really suitable for such purposes.
The Importance of Being Earnest A trivial comedy for serious people An electronic edition Oscar Wilde compiled by Margaret Lantry University College, Cork First draft, revised and corrected.
Proof corrections by Margaret Lantry 19 648 CELT: Corpus of Electronic Texts: a project of University College, Cork
College Road, Cork, Ireland.
1997 CELT online at University College, Cork, Ireland. E850003.002
Available with prior consent of the CELT programme for purposes of academic research and teaching only.
CELT: Corpus of Electronic Texts
All the editorial text with the corrections of the editor has been retained.
Text has been checked, proof-read and parsed using NSGMLS.
The electronic text represents the edited text. Compound words have not been hyphenated after CELT practice.
Direct speech is marked q.
The editorial practice of the hard-copy editor has been retained.
div0=the whole text.
Names of persons (given names), and places are not tagged. Terms for cultural and social roles are not tagged.
The n attribute of each text in this corpus carries a unique identifying number for the whole text.
The title of the text is held as the first head element within each text.
div0 is reserved for the text (whether in one volume or many).
By Oscar Wilde (1854-1900). 1895Two microphones, standard frequency
12 Jan 2010audio file