The Oxford Handbook of Corpus Phonology 9780199571932, 0199571937

This handbook presents the first systematic account of corpus phonology - the employment of corpora, especially purpose-

141 57 47MB

English Pages 689 Year 2014

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
The Oxford Handbook of Corpus Phonology
Series
Copyright
Contents
List of Contributors
1 Introduction
PART I Phonological Corpora: Design, Compilation, and Exploitation
2 Corpus Design
3 Data Collection
4 Corpus Annotation: Methodology and Transcription Systems
5 On Automatic Phonological Transcription of Speech Corpora
6 Statistical Corpus Exploitation
7 Corpus Archiving and Dissemination
8 Metadata Formats
9 Data Formats for Phonological Corpora
PART II Applications
10 Corpus and Research in Phonetics and Phonology: Methodological and Formal Considerations
11 A Corpus-Based Study of Apicalization of /s/ before /l/ in Oslo Norwegian
12 Corpora, Variation, and Phonology: An Illustration from French Liaison
13 Corpus-Based Investigations of Child Phonological Development: Formal and Practical Considerations
14 Corpus Phonology and Second Language Acquisition
PART III Tools and Methods
15 ELAN: Multimedia Annotation Application
16 EMU
17 The Use of Praat in Corpus Research
18 Praat Scripting
19 The PhonBank Project: Data and Software-Assisted Methods for the Study of Phonology and Phonological Development
20 EXMARaLDA
21 ANVIL: The Video Annotation Research Tool
22 Web-based Archiving and Sharing of Phonological Corpora
PART IV Corpora
23 The IViE Corpus
24 French Phonology from a Corpus Perspective: The PFC Programme
25 Two Norwegian Speech Corpora: NoTa-Oslo and TAUS
26 The LeaP Corpus
27 The Diachronic Electronic Corpus of Tyneside English: Annotation Practices and Dissemination Strategies
28 The LANCHART Corpus
29 Phonological and Phonetic Databases at the Meertens Institute
30 The VALIBEL Speech Database
31 Prosody and Discourse in the Australian Map Task Corpus
32 A Phonological Corpus of L1 Acquisition of Taiwan Southern Min
References
Index
Recommend Papers

The Oxford Handbook of Corpus Phonology
 9780199571932, 0199571937

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

T H E OX F OR D HA N DB O OK OF

C OR P U S P HON OL O G Y

OXFORD HANDBOOKS IN LINGUISTICS The Oxford Handbook of Applied Linguistics Second edition Edited by Robert B. Kaplan The Oxford Handbook of Case Edited by Andrej Malchukov and Andrew Spencer The Oxford Handbook of Cognitive Linguistics Edited by Dirk Geeraerts and Hubert Cuyckens The Oxford Handbook of Compounding Edited by Rochelle Lieber and Pavol Štekauer The Oxford Handbook of Compositionality Edited by Markus Werning, Edouard Machery, and Wolfram Hinzen The Oxford Handbook of Computational Linguistics Edited by Ruslan Mitkov The Oxford Handbook of Field Linguistics Edited by Nicholas Thieberger The Oxford Handbook of Grammaticalization Edited by Heiko Narrog and Bernd Heine The Oxford Handbook of Historical Phonology Edited by Patrick Honeybone and Joseph Salmons The Oxford Handbook of the History of English Edited by Terttu Nevalainen and Elizabeth Closs Traugott The Oxford Handbook of the History of Linguistics Edited by Keith Allan The Oxford Handbook of Japanese Linguistics Edited by Shigeru Miyagawa and Mamoru Saito The Oxford Handbook of Laboratory Phonology Edited by Abigail C. Cohn, Cécile Fougeron, and Marie Hoffman The Oxford Handbook of Language Evolution Edited by Maggie Tallerman and Kathleen Gibson The Oxford Handbook of Language and Law Edited by Peter Tiersma and Lawrence M. Solan The Oxford Handbook of Linguistic Analysis Edited by Bernd Heine and Heiko Narrog The Oxford Handbook of Linguistic Interfaces Edited by Gillian Ramchand and Charles Reiss The Oxford Handbook of Linguistic Minimalism Edited by Cedric Boeckx The Oxford Handbook of Linguistic Typology Edited by Jae Jung Song The Oxford Handbook of Sociolinguistics Edited by Robert Bayley, Richard Cameron, and Ceil Lucas The Oxford Handbook of Translation Studies Edited by Kirsten Malmkjaer and Kevin Windle

THE OXFORD HANDBOOK OF

CORPUS PHONOLOGY Edited by

JACQUES DURAND, ULRIKE GUT, and

GJERT KRISTOFFERSEN

1

3 Great Clarendon Street, Oxford, United Kingdom

OX2  6DP,

Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark  of Oxford University Press in the UK and in certain other countries © editorial matter and organization Jacques Durand, Ulrike Gut, and Gjert Kristoffersen 2014 © the chapters their several authors 2014 The moral rights of the authors‌have been asserted First Edition published in 2014 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored  in a retrieval system, or transmitted, in any form or by any means, without  the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope  of  the above should be sent to the Rights Department, Oxford University Press,  at  the address  above You must not circulate this work in any other  form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University  Press 198 Madison Avenue, New  York, NY 10016, United States of America British Library Cataloguing in Publication  Data Data available Library of Congress Control Number: 2014933501 ISBN 978–0–19–957193–2 Printed and bound by CPI Group (UK) Ltd, Croydon,

CR0

4YY

Links to third party websites are provided by Oxford in good faith  and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this  work.

Contents

List of Contributors

ix

1. Introduction Jacques Durand, Ulrike Gut, and Gjert Kristoffersen

1

PA RT I   P HON OL O G IC A L C OR P OR A :  DE SIG N , C OM P I L AT ION , A N D E X P L OI TAT ION 2. Corpus Design Ulrike Gut and Holger Voorman

13

3. Data Collection Bruce Birch

27

4. Corpus Annotation: Methodology and Transcription Systems Elisabeth Delais-Roussarie and Brechtje Post

46

5. On Automatic Phonological Transcription of Speech Corpora Helmer Strik and Catia Cucchiarini

89

6. Statistical Corpus Exploitation Hermann Moisl

110

7. Corpus Archiving and Dissemination Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg

133

8. Metadata Formats Daan Broeder and Dieter van Uytvanck

150

9. Data Formats for Phonological Corpora Laurent Romary and Andreas Witt

166

vi  Contents

PA RT I I   A P P L IC AT ION S 10. Corpus and Research in Phonetics and Phonology: Methodological and Formal Considerations Elisabeth Delais-Roussarie and Hiyon Yoo

193

11. A Corpus-Based Study of Apicalization of /s/ before /l/ in Oslo Norwegian Gjert Kristoffersen and Hanne Gram Simonsen

214

12. Corpora, Variation, and Phonology: An Illustration from French Liaison Jacques Durand

240

13. Corpus-Based Investigations of Child Phonological Development: Formal and Practical Considerations Yvan Rose

265

14. Corpus Phonology and Second Language Acquisition Ulrike Gut

286

PA RT I I I   TO OL S A N D M E T HOD S 15. ELAN: Multimedia Annotation Application Han Sloetjes

305

16. EMU Tina John and Lasse Bombien

321

17. The Use of Praat in Corpus Research Paul Boersma

342

18. Praat Scripting Caren Brinckmann

361

19. The PhonBank Project: Data and Software-Assisted Methods for the Study of Phonology and Phonological Development Yvan Rose and Brian MacWhinney

380

20. EXMARaLDA Thomas Schmidt and Kai Wörner

402

21. ANVIL: The Video Annotation Research Tool Michael Kipp

420

Contents   vii

22. Web-based Archiving and Sharing of Phonological Corpora Atanas Tchobanov

437

PA RT I V   C OR P OR A 23. The IViE Corpus Francis Nolan and Brechtje Post

475

24. French Phonology from a Corpus Perspective: The PFC Programme 486 Jacques Durand, Bernard Laks, and Chantal Lyche 25. Two Norwegian Speech Corpora: NoTa-Oslo and TAUS Kristin Hagen and Hanne Gram Simonsen

498

26. The LeaP Corpus Ulrike Gut

509

27. The Diachronic Electronic Corpus of Tyneside English: Annotation Practices and Dissemination Strategies Joan C. Beal, Karen P. Corrigan, Adam J. Mearns, and Hermann Moisl

517

28. The LANCHART Corpus Frans Gregersen, Marie Maegaard, and Nicolai Pharao

534

29. Phonological and Phonetic Databases at the Meertens Institute Marc van Oostendorp

546

30. The VALIBEL Speech Database Anne Catherine Simon, Michel Francard, and Philippe Hambye

552

31. Prosody and Discourse in the Australian Map Task Corpus Janet Fletcher and Lesley Stirling

562

32. A Phonological Corpus of L1 Acquisition of Taiwan Southern Min Jane S. Tsay

576

References Index

589 639

List of Contributors

Joan C. Beal is Professor of English Language at the University of Sheffield. Her research interests are in the fields of sociolinguistics/dialectology and the history of the English language since 1700. She has published widely in both fields. Bruce Birch is currently Departmental Visitor in Linguistics at the Australian National University in Canberra. His research has focused on the development of a usage-based approach to the analysis of prosodic structure including intonation, as well as issues involved in the documentation of endangered languages. He has been collecting data and contributing to the building of online corpora for Iwaidja and other highly endangered languages of Northwestern Arnhem Land, Australia, since 1999. Paul Boersma received an MSc. in physics from the University of Nijmegen in 1988 and a Ph.D in linguistics from the University of Amsterdam in 1998 for a dissertation entitled ‘Functional phonology’. Since 2005 he has been Professor of Phonetic Sciences at the University of Amsterdam. Lasse Bombien works as a researcher at the Universities of Munich and Potsdam. In 2006 he received his MA in Phonetics at the University of Kiel, and in 2011 a D.Phil. in Phonetics at the University of Munich. His areas of interest include speech production, articulatory coordination, sound change, effects of prosodic structure on phonetic detail, phonetics and phonology of Scandinavian languages, techniques for speech production investigation, and software development. Caren Brinckmann studied computational linguistics and phonetics at Saarland University (Saarbrücken, Germany) and Tohoku University (Sendai, Japan). For her master’s thesis in 2004 she improved the prosody prediction module of a German speech synthesis system with corpus-based statistical methods. As a researcher at the national Institute for the German Language (IDS, Mannheim, Germany) she subsequently focused on regional phonetic variation, word phonology, and automatic text classification with statistical methods. In 2011 she left academia to apply her data-mining skills in a major German e-commerce company. Daan Broeder has a background in electrical engineering, is deputy head of the TLA unit at the Max Planck Institute for Psycholinguistics (Nijmegen, The Netherlands) and has been senior developer responsible for all infrastructure and metadata development for many years. He plays leading roles in European and national projects, such as all

x  list of Contributors metadata-related work in TLA and CLARIN, and is the responsible convener for ISO standards on metadata and persistent identifiers. Karen P. Corrigan is Professor of Linguistics and English Language at Newcastle University. She researches language variation and change in dialects of the British Isles with a particular focus on Northern Ireland and northeast England. She was principal investigator on the research project that created the Newcastle Electronic Corpus of Tyneside English (2000–2005), and fulfilled the same role for the Diachronic Electronic Corpus of Tyneside English project (2010–2012) at Newcastle University. Catia Cucchiarini obtained her Ph.D in phonetics from the University of Nijmegen. She worked at the Centre for Language and Education of K.U. Leuven in Belgium, and has been working at the University of Nijmegen on various projects on speech processing and computer-assisted language learning. She has supervised Ph.D students and has published many articles in international journals. In addition to her research activities, she has since 1999 been working at the Nederlandse Taalunie (Dutch Language Union) as a senior project manager for language policy and human language technologies. Elisabeth Delais-Roussarie is a senior researcher at the CNRS, Laboratoire de Linguistique Formelle, Paris (Université Paris-Diderot). She has worked on several topics in sentence phonology, such as the modelling of intonation and accentual patterns in French, the phonology–syntax interface, and prosodic phrasing in French. Her recent work has focused on the development and evaluation of prosodic annotation systems and tools that facilitate a corpus-based approach in sentence phonology and in the L2 acquisition of prosody. Jacques Durand is Emeritus Professor of Linguistics at the University of Toulouse II – Le Mirail and a Member of the Institut Universitaire de France. He was formerly Professor at the University of Salford, Director of the CLLE-ERSS research centre in Toulouse, and in charge of linguistics at CNRS headquarters. His publications are mainly in phonology (particularly within the framework of Dependency Phonology, in collaboration with John Anderson), but he also worked in machine translation in the 1980s and 1990s within the Eurotra project. Since the late 1990s he has coordinated two major research programmes in corpus phonology: Phonology of Contemporary French, with M.-H. Côté, B. Laks, and C. Lyche, and Phonology of Contemporary English, with P. Carr and A. Przewozny. Janet Fletcher is Associate Professor of Phonetics at the University of Melbourne. She completed her Ph.D at the University of Reading and has held research positions at the University of Edinburgh, Ohio State University, and Macquarie University. Her research interests include articulatory and acoustic modelling of coarticulation, and prosody and intonation in Australian English and Australian Indigenous languages. Michel Francard is Professor of Linguistics at the Catholic University of Louvain (Louvain-la-Neuve) and founder of the VALIBEL research centre in 1989. His main

list of Contributors  

xi

research interests include linguistic variation (especially the lexicography of peripheral French varieties) and the evolution of endangered languages in the globalized linguistic market. His most recent book, Dictionnaire des belgicismes (2010) illustrates the emergence of an autonomous norm within a variety of French outside France. Frans Gregersen is Professor of Danish Language at the Department of Scandinavian Studies and Linguistics, University of Copenhagen, and director of the Danish National Research Foundation LANCHART (LANguage CHAnge in Real Time) Centre since 2005. The centre webpage with current publications may be found at www.lanchart.dk. Ulrike Gut holds the Chair of English Linguistics at the Westfälische WilhelmsUniversity in Münster. She received her Ph.D from Mannheim University and her postdoctoral degree (Habilitation) from Freiburg University. Her main research interests include phonetics and phonology, corpus linguistics, second language acquisition, and worldwide varieties of English. She has collected the LeaP corpus and is currently involved in the compilation of the ICE-Nigeria. Kristin Hagen is Senior Engineer at the Text Laboratory, Department of Linguistic and Scandinavian Studies at the University of Oslo. For many years she has worked with the development of speech corpora such as NoTa-Oslo and the Nordic Dialect Corpus. She has also worked in other language technology domains like POS tagging (the Oslo– Bergen tagger), parsing and grammar checking. Her background is in linguistics and in computer science. Philippe Hambye is Professor of French Linguistics at the University of Louvain. His research mainly includes work in sociolinguistics relating to variation of linguistic norms and practices in the French-speaking world, language practices in education and work, and language policies, with a special interest in questions of legitimacy, power, and social inequalities. Tina John is a lecturer in linguistics at the University of Kiel. After MA graduation in Phonetics, Linguistics and Computer Science in 2004, she joined the developer team of the EMU System. In 2012 she obtained her Ph.D in Phonetics and Computer Linguistics at the University of Munich. Her areas of interest, in addition to the development of the EMU System and algorithms in general, are all kinds of speech data analysis (e.g. finding acoustic correlates) as well as analyses of text corpora. Michael Kipp is Professor for Interactive Media at Augsburg University of Applied Sciences, Germany. Previously he was head of a junior research group at the German Research Center for AI (DFKI), Saarbrücken, and Saarland University. His research topics are embodied agents, multimodal annotation, and communication and interaction design. He developed and maintains the ANVIL video annotation tool. Gjert Kristoffersen is Professor of Scandinavian Languages at the University of Bergen. His research interests are synchronic and diachronic aspects of Scandinavian phonology,

xii  list of Contributors especially Norwegian and Swedish prosody from a variationist perspective. He is the author of The Phonology of Norwegian, published by Oxford University Press in 2000. Bernard Laks is Professor at Paris Ouest Nanterre University and senior member of the Institut Universitaire de France. Former director of the Nanterre linguistic laboratory, he is with Jacques Durand, Chantal Lyche, and Marie-Hélène Côté, director of the ‘Phonologie du français contemporain’ corpus and research program (PFC). He has published intensively in phonology, corpus linguistics, variation, cognitive linguistics, the history of linguistics, and modelling. Chantal Lyche is currently Professor of French Linguistics at the University of Oslo. She has been adjunct Professor at the University of Tromsø and an associate member of CASTL (Center for Advanced Studies in Theoretical Linguistics, Tromsø). She has published extensively on French phonology and is the co-founder of a research programme in corpus phonology: ‘Phonology of contemporary French’ (with Jacques Durand and Bernard Laks). Since the 1990s she has focused more specifically on varieties of French outside France, particularly in Switzerland, Louisiana, Mauritius, and Africa. In addition, she has worked on the study of large corpora from a prosodic point of view. She is also the co-author of a standard textbook on the phonology of French, and is actively involved in the teaching of French as a foreign language. Brian MacWhinney, Professor of Psychology, Computational Linguistics, and Modern Languages at Carnegie Mellon University, has developed a model of first and second language acquisition and processing called the Competition Model which he has also applied to aphasia and fMRI studies of children with focal lesions. He has developed databases such as CHILDES, SLABank, BilingBank, and CABank for the study of language learning and usage. He is currently developing methods for second language learning based on mobile devices and web-based tutors and games. Marie Maegaard holds a Ph.D from the University of Copenhagen. She is currently Associate Professor of Danish Spoken Language at the Department of Scandinavian Research, University of Copenhagen, and is in charge of the phonetic studies at the Danish National Research Foundation LANCHART (LANguage CHAnge in Real Time) Centre. The centre webpage with current publications may be found at www.lanchart.dk. Adam J. Mearns was postdoctoral research associate on the Diachronic Electronic Corpus of Tyneside English project (2010–2012) at Newcastle University. His background is in the lexical semantics of Old English and the history of the English language. Hermann Moisl is a Senior Lecturer in Computational Linguistics at the University of Newcastle, UK. His background is in linguistics and computer science, and his research interests and publications are in neural language modelling using nonlinear attractor dynamics, and in methodologies for preparation and cluster analysis of data abstracted from natural language corpora.

list of Contributors  

xiii

Francis Nolan is Professor of Phonetics in the Department of Linguistics at the University of Cambridge. He studied languages at Cambridge before specializing in phonetics. After a first post in Bangor (North Wales) he returned to Cambridge, developing research interests in phonetic theory, connected speech processes, speaker characteristics, variation in English, and prosody—the latter two united in the IViE project in the late 1990s. He has been active in forensic phonetic research and casework. He is currently President of the British Association of Academic Phoneticians. Marc van Oostendorp is a researcher at the Meertens Instituut of the Royal Netherlands Academy of Arts and Sciences, and a Professor of Phonological Microvariation at Leiden University. His main interests are models of geographical and social variation, the relation between language as a property of the mind and language as a property of a community, and alternatives to derivational relations in the phonology–morphology interface. Nicolai Pharao is Assistant Professor at the Danish National Research Foundation’s Centre for Language Change in Real Time, LANCHART. He received his Ph.D in linguistics with the dissertation ‘Consonant Reduction in Copenhagen Danish’ in 2010. His research includes corpus based studies of phonetic variation and change and experimental studies of the relationship between phonetic variation, social meaning, and language attitudes. He is particularly interested in how the usage and social evaluation of phonetic features influences the representation of word forms in the mental lexicon. Brechtje Post is Lecturer in Phonetics and Phonology at the University of Cambridge. Her research interests centre around the syntax–phonology interface and intonation, which she investigates from a phonetic, phonological, acquisitional, cognitive, and neural perspective. She has published in journals such as Linguistics, Journal of Phonetics, Language and Speech, Cognition, and Neuropsychologia. Laurent Romary is Directeur de Recherche at INRIA, France and guest scientist at Humboldt University in Berlin, Germany. He carries out research on the modelling of semi-structured documents, with a specific emphasis on texts and linguistic resources. He is the chairman of ISO committee TC 37/SC 4 on Language Resource Management, and has been active as member (2001–2007), then chair (2008–2011), of the TEI (Text Encoding Initiative) council. He currently contributes to the establishment and coordination of the European Dariah infrastructure for the arts and humanities. Yvan Rose is currently Associate Professor of Linguistics at Memorial University and co-director of the PhonBank Project within CHILDES. He received his Ph.D in Linguistics from McGill University. His research concentrates on the nature of phonological representations and of their acquisition by young children. He investigates these questions through software-assisted methods, implemented in the Phon program for the analysis of transcript data on phonology and phonological development. Thomas Schmidt holds a Ph.D from the University of Dortmund. His research interests are spoken language corpora, text and corpus technology, and computational

xiv  list of Contributors lexicography. He is one of the developers of EXMARaLDA and the author of the Kicktionary, a multilingual electronic dictionary of football language. He has spent most of his professional life as a researcher at the University of Hamburg. He also worked as a language resource engineer for a commercial company and as a research associate at ICSI Berkeley and at the Berlin-Brandenburg Academy of Sciences. Currently he heads the Archive for Spoken German (AGD) at the Institute for the German Language in Mannheim. Anne Catherine Simon is Professor of French Linguistics at the Catholic University of Louvain (Louvain-la-Neuve, Belgium). She has been in charge of the VALIBEL research centre since 2009. Her research in French linguistics is in the areas of prosody and syntax of spoken speech, and their interaction in various speaking genres. Her dissertation, ‘Structuration prosodique du discours en français’, was published in 2004. She is co-author of La variation prosodique régionale en français (2012). Hanne Gram Simonsen is Professor of Linguistics at the Department of Linguistics and Scandinavian Studies at the University of Oslo. Her research interests include language acquisition (in particular phonology, morphology, and lexicon) and instrumental and articulatory phonetics, as well as clinical linguistics (language disorders in children and adults). She has published on these topics in journals such as Journal of Child Language, Journal of Phonetics, and Clinical Linguistics and Phonetics. Han Sloetjes is a software developer at the Language Archive, a department of the Max Planck Institute for Psycholinguistics. He has been involved in the development of the multimedia annotation tool ELAN since 2003. Currently he is the main responsible person for supporting, maintaining, and further developing this application. Lesley Stirling is Associate Professor of Linguistics and Applied Linguistics at the University of Melbourne. She has a disciplinary background in linguistics and cognitive science, and has published work on a variety of topics in descriptive and typological linguistics, semantics, and discourse analysis. One research interest has been the relationship between dialogue structure and prosody, involving collaborative cross-disciplinary research funded by the Australian Research Council. Helmer Strik received his Ph.D in physics from the University of Nijmegen, where he is now Associate Professor of Speech Science and Technology. His research addresses both human speech processing (voice source modelling, intonation, pronunciation variation, speech pathology) and speech technology (automatic speech recognition and transcription, spoken dialogue systems, and computer-assisted language learning and therapy). He has published over 150 refereed papers, has coordinated national and international projects, and has been an invited speaker at international events. Atanas Tchobanov is a research engineer in MoDyCo CNRS lab. He has been active in the field of oral corpora web implementations since 2001. His research interests also include data analysis and unsupervised learning of phonological invariants.

list of Contributors  

xv

Paul Trilsbeek is currently head of archive management at the Language Archive, Max Planck Institute for Psycholinguistics, Nijmegen. He studied sonology at the Royal Conservatory in The Hague, after which he worked at the Radboud University in Nijmegen as a music technologist in the Music, Mind, Machine project. His experience in working with audiovisual media turned out to be of great value in the domain of language resource archiving, in which he has been working since 2003. Jane S. Tsay received her Ph.D in Linguistics from the University of Arizona. She was a postdoctoral research fellow at the State University of New York at Buffalo 1993–1995. She is currently a Professor of Linguistics and the Dean of the College of Humanities at the National Chung Cheng University in Taiwan. Her research interests include phonology acquisition, experimental phonetics, corpus linguistics, and sign language phonology. She has constructed the Taiwanese Child Language Corpus, based on 330 hours of recordings of young children’s spontaneous speech. She is also the co-director of the Sign Language Research Group at the University and has compiled a Taiwan Sign Language online dictionary. Her recent research, besides phonology acquisition, is on the phonological structure of spoken and signed languages. Dieter van Uytvanck studied computer science at Ghent University and linguistics at the Radboud University, Nijmegen. After graduating he worked at the Max Planck Institute for Psycholinguistics in Nijmegen. Since 2008 he has been active in the technical setup of the CLARIN research infrastructure (www.clarin.eu) and as of 2012 he is director at CLARIN-ERIC. Holger Voormann received a degree in computer science from the University of Stuttgart. He worked as a research associate at the IMS Stuttgart and held several positions in IT companies. He is now a freelance software developer and consultant, and is involved in the development of several open source projects, such as the Platform for Annotated Corpora in XML (Pacx). Andreas Witt received his Ph.D in Computational Linguistics and Text Technology from Bielefeld University in 2002, and continued there for the next four years as an instructor and researcher in those fields. In 2006 he moved to Tübingen University, where he participated in a project on ‘Sustainability of Linguistic Resources’ and in projects on the interoperability of language data. Since 2009 he has headed the Research Infrastructure group at the Institute for the German Language in Mannheim. Florian Wittenburg works in the Language Archive at the MPI in Nijmegen in collaboration with the Max Planck Society, the Berlin Brandenburg Academy of Sciences, and the Royal Dutch Academy of Sciences. Peter Wittenburg has a diploma degree in Electrical Engineering from the Technical University Berlin and in 1976 became head of the technical group at the newly founded Max Planck Institute for Psycholinguistics. He has had leading roles in various European and national reearch projects including the DOBES programme, CLARIN and EUDAT

xvi  list of Contributors as well as ISO initiatives. He is the head of the Language Archive, a collaboration between the Max Planck Society, the Berlin Brandenburg Academy of Sciences, and the Royal Dutch Academy of Sciences. Kai Wörner holds a Ph.D in text technology from the University of Bielefeld. After finishing his university studies at Gießen University, he worked as a web developer in Hamburg. He is currently the managing director of the Hamburg Centre for Language Corpora and a research assistant in the language resource infrastructure project CLARIN-D. His research interests are corpus and computational linguistics. He is one of the developers of the EXMARaLDA system.

C HA P T E R  1

INTRODUCTION JAC QU E S DU R A N D, U L R I K E G U T, A N D G J E RT K R I STOF F E R SE N

Corpus phonology is a new interdisciplinary field of research that has only begun to emerge during the last few years. It has grown out of the need for modern phonological research to be embedded within a larger framework of social, cognitive, and biological science, and combines methods and theoretical approaches from phonology, both diachronic and synchronic, phonetics, corpus linguistics, speech technology, information technology and computer science, mathematics, and statistics. In the past, phonological research comprised predominantly descriptive methods, but while new methods such as experimentation, acoustic-perceptual, and aerodynamic modelling, as well as psycholinguistic and statistical methods, have recently been introduced, the employment of purpose-built corpora in phonological research is still in its infancy. With the increasing number of phonological corpora being compiled all over the world, the need arises for the international research community to exchange ideas and find a consensus on fundamental issues such as corpus annotation, analysis, and dissemination as well as corpus data formats and archiving. The time seems right for the development of standards for phonological corpus compilation and especially corpus annotation and metadata. It is the aim of this Handbook to address these issues. It offers guidelines and proposes international standards for the compilation, annotation, and analysis of phonological corpora. This includes state-of-the-art practices in data collection and exploitation, theoretical advances in corpus design, best practice guidelines for corpus annotation, and the description of various tools for corpus annotation, exploitation, and dissemination. It also comprises chapters on phonological findings based on corpus analyses, including studies in fields as diverse as the phonology–phonetics interface, language variation, and language acquisition. Moreover, an overview is provided of a large number of existing phonological corpora and tools for corpus compilation, annotation, and exploitation. The Handbook is structured in four parts. The first part, ‘Phonological Corpora: Design, Compilation, and Exploitation’, contains contributions on general issues in phonological corpus compilation, annotation, analysis, storage, and dissemination. In

2   Jacques Durand, Ulrike Gut, and Gjert Kristoffersen chapter 2, Ulrike Gut and Holger Voormann describe the basic processes of phonological corpus design including data compilation, data selection, and data annotation as well as corpus storage, sustainability, and reuse. They address fundamental questions and decisions that compilers of a phonological corpus are inevitably faced with, such as questions of corpus representativeness and size, raw data selection, and corpus sharing. On the basis of these reflections, the authors propose a methodology for corpus creation. Many of the issues raised in Chapter 2 are developed in the next three chapters. Chapter 3 is concerned with corpus-based data collection. In it, Bruce Birch discusses some key issues such as control over primary data, context, and contextual variation, and the observer’s paradox. He further classifies various widespread data collection techniques in terms of the amount and type of control they assert over the production of speech, and gives a comprehensive overview of data collection techniques for purposes of phonological research. The task of phonological corpus annotation is described in chapter 4 Elisabeth Delais-Roussarie and Brechtje Post first discuss some theoretical issues that arise in the transcription and annotation of speech such as segmentation and the assignment of labels. Furthermore, they provide a comprehensive overview and evaluation of the various systems that are in use for the annotation of segmental and suprasegmental information in the speech signal. Chapter 5 provides an overview of the state of the art in automatic phonetic transcription of corpora. After introducing the most relevant methodological issues in this area, Helmer Strik and Catia Cucchiarini describe and evaluate the different techniques of (semi-)automatic phonetic corpus transcription that can be applied, depending on what kind of data and annotations are available to corpus compilers. The next couple of chapters are concerned with the exploitation and archiving of phonological corpora. In chapter 6, Hermann Moisl presents statistical methods for analysing phonological corpora, focusing in particular on cluster analysis. Illustrating his account with the Newcastle Electronic Corpus of Tyneside English (which is presented in Part IV), he describes and discusses in a detailed way the process and benefits of applying the technique of clustering to phonological corpus data. Chapter 7 is concerned with corpus archiving and dissemination. Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg discuss how the traditional model of corpus archiving and dissemination is changing dramatically, with digital innovations opening up new possibilities. They examine various preservation requirements that need to be met, and illustrate the use of advanced infrastructures for data accessibility, archiving and dissemination. The last two chapters of Part I are concerned with the concept of metadata and data formats. In chapter 8, Daan Broeder and Dieter van Uytvanck describe some of the major metadata sets in use for the compilation of corpora including OLAC, TEI, IMDI, and CMDI. They further give some practical advice on what metadata schema to use or how to design one’s own if required. Finally, Chapter 9 addresses basic issues that are important for corpus compilers with regard to the choice of data format. Laurent Romary and Andreas Witt argue for providing the research community with a set of standardized formats that allow a high reuse rate of phonological corpora as well as better interoperability across tools used to produce or exploit them. They describe some

Introduction   3

basic concepts related to the representation of annotated linguistic content, and offer some proposals for the annotation of spoken corpus data. The second part of this Handbook, ‘Applications’, is devoted to how speech corpora can be put to use. Each chapter considers how corpus-based methods may enrich and improve research within different subfields of phonology such as phonetics, prosody, segmental phonology, diachrony, first language acquisition, and second language acquisition. These topics and perspectives should by no means be regarded as exhaustive; they are but a few examples of many possible ones that are intended to show the usefulness of corpus-based methods. In chapter 10, Elisabeth Delais-Roussarie and Hiyon Yoo take as their starting point the various data and methods commonly used for research in phonetics and phonology. This leads them to a definition of what can be considered (1) a corpus and (2) a corpus-based approach to the two disciplines. The rest of the chapter is devoted to post-lexical phonology and prosody, such as liaison in French, suprasegmental phenomena such as phrasing or intonation, and the use of corpora in phonetic research. The topic of chapter 11, written by Hanne Gram Simonsen and Gjert Kristoffersen, is segmental phonology from a variationist point of view. An ongoing change where a formerly laminal /s/ is turned into apical /ʂ/ before /l/ in Oslo Norwegian is shown to be governed by a complex set of phonological and morphological constraints that could not have been identified without recourse to corpus-based methods. The corpora used in their analysis are described in Chapter 25 in this volume. Chapter 12 takes up again one of the topics of Chapter 10 in greater detail: French liaison. Based on the PFC corpus (see Chapter 24, this volume), Jacques Durand shows how recourse to corpora has contributed to a better understanding of perhaps one of the most thoroughly analysed phenomena in the phonology of French. Durand argues that previous analyses are to a certain extent flawed because they are based on data which are too scarce, occasionally spurious, and often uncritically adopted from previous treatments. The PFC corpus has helped to put the analysis on firmer empirical ground and to chart which areas are relatively stable across speakers and which are variable. The topic of chapter 13, written by Yvan Rose, is phonological development in children. Following a discussion of issues that are central to research in phonological development, the chapter describe some solutions, with an emphasis on the recently proposed PhonBank initiative (see also chapter 19 of this volume) within the larger CHILDES project. Finally, chapter 14 is concerned with second language acquisition. Here, Ulrike Gut shows how research on the acquisition and structure of L2 phonetics and phonology can profit from the analysis of phonological corpora of second language learner speech. A second objective of this chapter is to discuss how corpora can support the creation of teaching materials and teaching curricula, and how they can be employed in classroom teaching and learning of phonology. Part III of the Handbook concerns ‘Tools and Methods’. A number of tools, systems, or methods in this section have become standard in the field and are used in a large number of research projects. Thus, chapter 15 by Han Sloetjes provides an overview of ELAN, a stand-alone tool developed at the Max Planck Institute for Psycholinguistics in Nijmegen in the Netherlands. ELAN is a generic multimedia annotation tool which is

4   Jacques Durand, Ulrike Gut, and Gjert Kristoffersen not restricted to the analysis of spoken language, since it is also applied in sign language research, gesture research, and language documentation, to name just a few. It offers powerful descriptive strategies, since it supports time-aligned multilevel transcriptions and permits annotations to reference other annotations, allowing for the creation of annotation tree structures. By contrast, EMU presented in chapter 16 by Tina John and Lasse Bombien is a database system for the specific analysis of speech, consisting of a collection of software tools for the creation, manipulation, and analysis of speech databases. EMU includes an interactive labeller which can display spectrograms and speech waveforms, and which allows the creation of hierarchical as well as sequential labels for a speech utterance. A central concern of the EMU project is the statistical analysis of speech corpora. To this end, EMU interfaces with the R environment for statistical computing. Like EMU, Praat, devised by Paul Boersma and David Weenink at the University of Amsterdam, is a computer program for analysing, synthesizing, and manipulating speech and other sounds, and for creating publication-quality graphics. A speech corpus typically consists of a set of sound files, each of which is paired with an annotation file, and metadata information. Paul Boersma’s introduction to Praat in chapter 17 demonstrates that the strengths of this tool lie in the acoustic analysis of the individual sounds, in the annotation of these sounds, and in browsing multiple sound and annotation files across the corpus. Moreover, corpus-wide acoustic analyses, leading to tables ready for statistical analysis, can be performed by the Praat scripting language, which is thoroughly described and illustrated by Caren Brinckmann in chapter 18. As stressed by this author, building a speech corpus and exploiting it to answer phonetic and phonological research questions is a very time-consuming process. Many of the necessary steps in the corpus-building process and the analysis stage can be facilitated by scripting. Caren Brinckmann demonstrates how scripts can be employed to support orthographic transcription, phonetic and prosodic annotation, querying, analysis, and preparation for distribution. The contribution by Yvan Rose and Brian MacWhinney (chapter 19) is centred on the PhonBank project. The authors provide a description of the tools available through the PhonBank initiative for corpus-based research on phonological development as well as for data sharing. PhonBank is one of ten subcomponents of a larger database of spoken language corpora called TalkBank. Other areas in TalkBank include AphasiaBank, BilingBank, CABank, CHILDES, ClassBank, DementiaBank, GestureBank, Tutoring, and TBIBank. All of the TalkBank corpora use the CHAT data transcription format, which enables a thorough analysis with the CLAN programs (Computerized Language ANalysis). The PhonBank corpus is unique in that it can be analysed both with the CLAN programs and also with an additional program, called Phon, which is designed specifically for phonological analysis. The authors provide an introduction to Phon and then widen the discussion to methodological issues relevant to software-assisted approaches to phonological development, and to phonology, more generally. In chapter 20, Thomas Schmidt and Kai Wörner provide an overview of EXMARaLDA. This is a system for creating, managing, and analysing digital corpora of spoken language which has been developed at the University of Hamburg since

Introduction   5

2000. From the outset, EXMARaLDA was planned to serve a variety of purposes and user communities. Today, the system is used, among other things, for corpus development in pragmatics and conversation analysis, in dialectology, in studies of multimodality, and for the curation of legacy corpora systems of its kind. This chapter foregrounds the use of EXMARaLDA for corpus phonology within the wider study of spoken interactions. As part of the overview, three corpora are presented—a phonological corpus, a discourse corpus and a dialect corpus—all constructed with the help of EXMARaLDA. Chapter 21 by Michael Kipp is devoted to ANVIL, a highly generic video annotation research tool. Like ELAN (cf. chapter 15), ANVIL supports three activities which are central to contemporary research on language interaction: the systematic annotation of audiovisual media (coding), the management of the resulting data in a corpus, and various forms of statistical analysis. In addition, ANVIL also allows for audio, video, and 3D motion-capture data. The chapter provides an in-depth introduction to ANVIL’s underlying concepts which is especially important when comparing it to alternative tools, several of which are described in other chapters of this volume. It also tries to highlight some of the more advanced features (like track types, spatial coding, and manual generation) that can significantly increase the efficiency and robustness of the coding process. To conclude this part, chapter 22 by Atanas Tchobanov returns to an issue raised in Part I  of this volume, namely web-based sharing of data for research, pedagogical, or demonstration purposes. Modern corpus projects are usually designed with a web-based administration of the corpus compilation process in mind. However, transferring existing corpora to the web proves to be a more challenging task. This chapter reviews the solutions available for designing and deploying a web-based phonological corpus. Furthermore, it gives a practical step-by-step description of how to archive and share a phonological corpus on the web. Finally, the author discusses a novel approach, based on the latest HTML5 and relying only on the now universal JavaScript language, implemented in any modern browser. The new system he presents runs online from any http server, but also offline from a hard drive, CD/DVD, or flash drive memory, thus opening up new possibilities for the dissemination of phonological corpora. In line with its title, ‘Corpora’, Part IV of this Handbook aims at presenting a number of leading corpora in the field of phonology. Even a book of this size cannot remotely hope to reference all the worthwhile projects currently available. Our aim has therefore been a more modest one: that of giving an overview of some well-known speech corpora exemplifying the methods and techniques discussed in earlier parts of the book and covering different countries, different languages, different linguistic levels (from the segmental to the prosodic), and different perspectives (e.g. dialectology, sociolinguistics, and first and second language acquisition). In chapter 23, Francis Nolan and Brechtje Post provide an overview of the IViE Corpus of spoken English. IViE stands for ‘Intonational Variation in English’ and refers to a collection of audio recordings of young adult speakers of urban varieties of English in the British Isles made between 1997 and 2002. These recordings were devised to facilitate the systematic investigation of intonational variation in the British

6   Jacques Durand, Ulrike Gut, and Gjert Kristoffersen Isles, and have served as a model for similar studies in other parts of the world. This chapter sets out by describing the reasoning behind the choices made in designing the corpus, and surveys some of the research applications in which recordings from IViE have been used. Chapter 24 is devoted to an ongoing programme concerning spoken French which was set up in the late 1990s: the PFC Programme (Phonologie du Français Contemporain: usages, variétés et structure), which is by far one of the largest databases of spoken French of its kind. In their contribution, Jacques Durand, Bernard Laks, and Chantal Lyche attempt to show the advantages of a uniform type of data collection, transcription, and coding which has led to the construction of an interactive website integrating advanced search and analysis tools and allowing for the systematic comparison of varieties of French throughout the world. They also emphasize that, while the core of the programme has been phonological (and initially mainly segmental), the database permits applications ranging from speech recognition to syntax and discourse—a point made by many other contributors to this volume. In chapter 25, Kristin Hagen and Hanne Gram Simonsen provide a description of two speech corpora hosted by the University of Oslo: NoTa-Oslo and TAUS (see also Chapter 11, where research based on these corpora are reported). Both corpora are based on recordings of spontaneous speech from Oslo residents, NoTa-Oslo speech recorded in 2005–2006 and TAUS speech recorded in 1972–1973. These two corpora permit a thorough synchronic and diachronic investigation of speech from Oslo and its immediate surroundings, which can be seen as representative of Urban East Norwegian speech. In both cases, the web search interface is relatively simple to use, and the transcriptions are linked to audio files (for both NoTa-Oslo and TAUS) and video files (for NoTa-Oslo). NoTa-Oslo and TAUS are both multi-purpose corpora, designed to support research in different fields, such as phonology, morphology, syntax, semantics, discourse, dialectology, sociolinguistics, lexicography, and language technology. This makes the corpora very useful for most purposes, but it also means that they cannot immediately meet the demands of every research task. The authors show how NoTa-Oslo and TAUS have been used for phonological research, but also discuss some of the limitations entailed by the types of interaction at the core of the corpus—a discussion most instructive for researchers wishing to embark on large-scale socio-phonological projects. Chapter 26 by Ulrike Gut turns to another area where phonological corpora are proving indispensable—that of second language acquisition. The LeaP corpus was collected in Germany at the University of Bielefeld between 2001 and 2003 as part of the LeaP (Learning Prosody in a Foreign Language) project. The aim has been to investigate the acquisition of prosody by second language learners of German and English with a special focus on stress, intonation, and speech rhythm as well as the influencing factors on the acquisition process and outcome. The LeaP corpus comprises spoken language produced by 46 learners of English and 55 learners of German as well as recordings with 4 native speakers of English and 7 native speakers of German. This chapter is particularly useful in providing a detailed discussion of methods concerning the compilation of a corpus designed for studying the acquisition of prosody:  selection of speakers,

Introduction   7

recordings, types of speech, transcription issues, annotation procedures, data formats, and assessment of annotator reliability. In chapter 27 by Joan C. Beal, Karen P. Corrigan, Adam J. Mearns, and Hermann Moisl, the focus is on the Diachronic Electronic Corpus of Tyneside English (DECTE), and particularly annotation practices and dissemination strategies. The first stage in the development of the Diachronic Electronic Corpus of Tyneside English (DECTE) was the construction of the Newcastle Electronic Corpus of Tyneside English (NECTE) between 2000 and 2005. NECTE is what is called a legacy corpus based on data collected for two sociolinguistic surveys conducted on Tyneside, northeast England, in c.1969–1971 and 1994, respectively. The authors concentrate in particular on transcription issues relevant for addressing research questions in phonetics/phonology, and on the nature of and rationale for the text-encoding systems adopted in the corpus construction phase. They also offer some discussion of the dissemination strategy employed since completion of the first stage of the corpus in 2005. The exploitation of NECTE for phonetic/phonological analysis is described in Moisl’s chapter in Part I  of this Handbook. Insofar as the researchers behind NECTE have been pioneers in the construction of a unique electronic corpus of vernacular English which was aligned, tagged for parts of speech, and fully compliant with international standards for encoding text, the continuing work on the subcorpora now included within DECTE is of interest to all projects having to deal with recordings and metadata stretching back in time. Interestingly, the following chapter on LANCHART by Frans Gregersen, Marie Maegaard, and Nicolai Pharao (28) focuses on similar issues concerning Danish. The authors give an outline of the corpus work done at the LANCHART Centre of the Danish National Research Foundation. The Centre has performed re-recordings of a number of informants from earlier studies of Danish speech, thus making it possible to study variation and change in real time. The chapter deals with the methodological problems posed by such a diachronic perspective in terms of data collection, annotation, and interpretation. Gregersen, Maegaard, and Pharao then focus on three significant examples: the geographical pattern of the (əð) variable, the accommodation to a moving target constituted by the raising of (æ) to [ɛ], and finally the covariation of three phonetic variables and one grammatical variable (the generic pronoun) in a single interview. Chapter 29, written by Marc van Oostendorp, is devoted to phonological and phonetic databases at the Meertens Institute in Amsterdam. This centre was founded in 1930 under the name ‘Dialect Bureau’ (Dialectenbureau) before being named in 1979 after its first director, P. J. Meertens. Originally, the institute had as its primary goal the documentation of the traditional dialects as well as folk culture of the Netherlands. In the course of time, this focus has broadened in several ways. From a linguistic standpoint, the Institute has widened its scope to topics other than the traditional dialects. Currently it comprises two departments, one of Dutch Ethnology and one of Variation Linguistics. Although the documentation of dialects has made significant progress, considerable effort has recently gone into digitizing material and putting it online. Van Oostendorp’s contribution seeks to describe the two most important databases on Dutch dialects which are available at the Meertens Institute:  the Goeman–Taeldeman–Van

8   Jacques Durand, Ulrike Gut, and Gjert Kristoffersen Reenen Database and Soundbites. He concludes by presenting new research areas at the Meertens Institute and by pointing out some desiderata concerning them. Chapter 30 is concerned with the VALIBEL speech database. Anne Catherine Simon, Philippe Hambye, and Michel Francard present the ‘speech bank’ which has been developed since 1989 at the Centre de recherche Valibel of the Catholic University of Louvain (Belgium). This speech database, which is one of the largest banks of spoken French in the world, is not a homogeneous corpus but rather a compilation of corpora, collected with a wide range of linguistic applications in mind and integrated into a system allowing for various kinds of investigation. The authors give a thorough description of the database, with special attention to the features that are relevant for research in phonology. Although the first aim of VALIBEL was not to build up a reference corpus of spoken French, but to collect data in order to provide a sociolinguistic description of the varieties of French spoken in Belgium, Simon, Hambye, and Francard show how the continuing gathering of data for various research projects has finally resulted in the creation of a large and controlled database, highly relevant for research in a number of fields, including phonetics and phonology. In chapter 31, Janet Fletcher and Lesley Stirling focus on prosody and discourse in the Australian Map Task corpus. The Australian Map Task corpus is part of the Australian National Database of Spoken Language (ANDOSL), which was collected in the 1990s for use in general speech science and speech technology research in Australia. It is closely modelled on the HCRC Map Task, which was designed in the 1990s by a team of British researchers to elicit spoken interaction typical of everyday talks in a controlled laboratory environment. Versions of this task have been used successfully to develop or test models of intonation and prosody in a wide number of languages including several varieties of English (as illustrated by Nolan and Post in Chapter 23). The authors show how the Australian Map Task has proved to be a useful tool with which to examine different prosodic features of spoken interactive discourse. While the intonational system of Australian English shares many features with other varieties of English, tune usage and tune interpretation are argued to remain variety-specific, with the Map Task proving to be a rich source of information on this question. The studies summarized in this contribution also illustrate the flexibility of Map Task data in permitting correlations of both micro-level discourse units such as dialogue acts and larger discourse segments such as Common Ground Units, with intonational and prosodic features of Australian English. The chapter includes a detailed discussion of annotation and analytical techniques for the study of prosody, thus complementing the contribution of Nolan and Post at the beginning of this part of the Handbook. The Handbook concludes with a chapter by Jane S. Tsay describing a phonological corpus of L1 acquisition of Taiwan Southern Min. In her contribution, Tsay outlines the data collection, transcription, and annotations for the Taiwanese Child Language Corpus (TAICORP), including a brief description of computer programs developed for specific phonological analyses. TAICORP is a corpus of spontaneous speech between young children growing up in Taiwanese-speaking families and their carers. The target language, Taiwanese, is a variety of Southern Min Chinese spoken in Taiwan.

Introduction   9

(Taiwanese and Southern Min are used interchangeably by the author.) Tsay shows how a well-designed phonological corpus such as TAICORP can be used to throw light on many issues beyond phonology such as the acquisition of syntax (syntactic categories, causatives, classifiers) and of pragmatic features. From a phonological point of view, much of the literature on child language acquisition has focused primarily on universal innate patterns (e.g. markedness constraints within Optimality Theory), but many contemporary studies have also argued that frequency factors are highly relevant and indeed more crucial than markedness. Only corpora such as TAICORP can allow investigators to test competing hypotheses in this area. As is argued in most chapters of this volume, the construction of corpora cannot be divorced from theory construction and evaluation. The idea for this Handbook was born during an ESF-funded workshop on phonological corpora held in Amsterdam in 2006. We would like to thank all participants of this event and of the summer school on Corpus Phonology held at Augsburg University in 2008 for their discussions, comments, and commitment to this emerging discipline of corpus phonology. Our thanks also go to Eva Fischer and Paula Skosples, who assisted us in the editing process of this Handbook. We hope that it will be of interest to researchers from a wide range of linguistic fields including phonology, both synchronic and diachronic, phonetics, language variation, dialectology, first and second language acquisition, and sociolinguistics.

PA R T  I

P HON OL O G IC A L C OR P OR A :  DE SIG N , C OM P I L AT ION , A N D E X P L OI TAT ION

C HA P T E R  2

C O R P U S  D E S I G N U L R I K E G U T A N D HOLG E R VO OR M A N N

2.1 Introduction Corpus phonology is a new interdisciplinary field of research that has emerged over the last few years. It refers to a novel methodological approach in phonology: the use of purpose-built phonological corpora for studying speakers’ and listeners’ knowledge and use of the sound system of their native language(s), the laws underlying such sound systems, and the acquisition of these systems in first and second language learning. Throughout its history, phonological research has employed a number of different methods, including the comparative method, experimental methods (Ohala 1995), and acoustic-perceptual and aerodynamic modelling taken from phonetics and integrated into the approach of laboratory phonology (Beckman and Kingston 1990: 3). The usage of purpose-built corpora in phonological research, however, is still in its infancy (see the chapters in part II of The Oxford Handbook of Corpus Phonology). Corpus linguistics as a method for studying the structure and use of language can be traced back to the 18th century (Kennedy 1998: 13). Modern corpora began to be collected in the 1960s and have quickly developed into one of the main methods of linguistic inquiry. It is now widely acknowledged that corpus-based linguistic research allows the modelling and analysis of language and language use as a valid alternative to linguistic research based on isolated examples of language. The development of corpus linguistics has proceeded in several waves (e.g. Renouf 2007; Johansson 2008). The few corpora that were compiled in the 1960s and 1970s were relatively small in size and were mainly used for lexical studies. In the 1980s, the number of different types of corpora increased rapidly and the first multi-million word corpora were created; these were employed for a wide range of linguistic studies including lexis, morphosyntax, language change, language variation, and language acquisition. In the past few years, the World Wide Web has been increasingly used as a corpus for morphosyntactic studies, and an entirely new type of corpus has appeared: the first phonological corpora. Accordingly, reflecting the technological possibilities and respective purposes of the different periods, the term ‘corpus’ has

14   Ulrike Gut and Holger Voormann been defined in many different ways. In general, it is used to refer to a substantial collection of language samples in electronic form that was assembled for a specific linguistic purpose (Sinclair 1991: 171). No agreed definition of what makes a corpus a phonological corpus exists as yet. This chapter attempts such a definition by outlining what a phonological corpus consists of and in what way it differs from other types of corpus (section 2.2). Researchers have only begun collecting phonological corpora in the past few years. Precursors of phonological corpora were developed in the 1980s. These were so-called speech corpora that were assembled for technological applications such as text-to-speech, automatic speech recognition or the evaluation of language processing systems (see Gibbon et al. 1997). They are, however, of limited use for phonological inquiry as they typically contain recordings in a very restricted range of speaking styles (usually they contain only several sentences read out by different speakers or many sentences read by one speaker) and do not include time-aligned phonological annotations. The spoken language corpora of the 1990s (e.g. the London-Lund Corpus, Svartvik 1990, the IBM Lancaster corpus of spoken English, and the Bergen corpus of London Teenage Language: Breivik and Hasselgren 2002) were collected with the aim of studying grammatical aspects of speech. Thus, they contain a more representative sample of speaking styles, but typically do not contain time-aligned phonological annotations either. It is only in the past few years that corpora have begun to be compiled with the express purpose of studying phonological phenomena. Such phonological corpora include the PFC (Phonologie du Français Contemporain) (see Durand, Laks and Lyche, this volume), the IViE corpus comprising different regional British varieties (see Post and Nolan, this volume), and the LeaP corpus of learner German and English (see Gut, this volume). The small number of extant phonological corpora indicates how little experience has been gained so far in compiling this type of corpus. It is therefore the second aim of this chapter to describe the entire design process of phonological corpora and to suggest some best practice guidelines that will help future corpus compilers to avoid common pitfalls and problems. Moreover, a theory of corpus creation, agile corpus creation (Voormann and Gut 2008), will be presented. This chapter is structured in the following way: After a definition of a phonological corpus is presented in section 2.2, section 2.3 discusses the most important elements in the design of phonological corpora. These include corpus storage, sustainability, sharing and reuse (section 2.3.1), questions of corpus representativeness and size (2.3.2), and raw data selection (2.3.3), as well as the issue of time-aligned phonological annotations (2.3.4). The chapter concludes with a discussion of theories of the corpus creation process (section 2.4) and a conclusion and outlook (section 2.5).

2.2  Definition of a Phonological Corpus No unanimously accepted definition of what constitutes a phonological corpus exists to date. In order to describe the essential properties and functions of a phonological corpus, first a general description of the term ‘corpus’ and its properties are given.

Corpus Design  15

2.2.1  Characteristics of a Corpus A corpus comprises two types of data: raw (or primary) data and annotations. The raw data of a linguistic corpus consists of samples of language. The types of raw data range from handwritten or printed texts and electronic texts to audio and video recordings. Some researchers only accept ‘authentic language’, i.e. language that was produced in a real communicative situation, as raw data and exclude recordings of individual sentences or text passages that are read out or repeated by speakers (e.g. McCarthy and O’Keefe 2008: 1012). Corpora containing the latter type of raw data have been classified as peripheral corpora (Nesselhauf 2004: 128) or databases. What all types of raw data have in common is that they have been selected but not altered or interpreted by researchers, and are accessible in the original form in which they were produced by speakers and writers. The term ‘annotation’ refers to additional (or secondary) information about the raw data of the corpus that is added by the corpus compilers. It can be divided into linguistic and nonlinguistic information. Examples of linguistic annotations are orthographic, phonemic, and prosodic transcriptions as well as part-of-speech tagging, semantic annotations, anaphoric annotations, and lemmatization. Annotations are always products of linguistic interpretation (see also Lehmann 2007:  17). Even an orthographic transcription reflects researchers’ decisions—for example by either using the spelling gonna or using the form going to. By the same token, annotations of syntactic, morphological, semantic, or phonological phenomena are results of interpretive processes resulting from the application of particular theoretical frameworks and perspectives. Non-linguistic corpus annotations are generally referred to as metadata, and include information about the corpus as a whole (e.g. who collected it, where, when and for which purpose); about the language samples (e.g. where and when they were produced); about the speakers/writers (e.g. their age, native language, and other languages); about the situation in which the language samples were produced (e.g. addressee, audience, event, purpose, time, date, and location); and about the recording process (e.g. what microphones and recording devices were used, what were the recording conditions). Not every collection of raw language data with corresponding annotations constitutes a corpus. The language sample should be representative (see section 2.3.2; McEnery and Wilson 2001: 29; Sinclair 2005). Some researchers furthermore claim that every corpus ‘is assembled for a specific purpose’ and that in the case of linguistic corpora, the purpose is the study of language structures and language use (e.g. Atkins et al. 1992: 1; Wynne 2009: 11; McCarthy and O’Keefe 2008: 1012). Thus, both language archives and the World Wide Web do not qualify as corpora in this strict sense since they were not collected for a linguistic purpose (e.g. Atkins et al. 1992: 1). By contrast, other researchers argue that the World Wide Web, if an informed selection of web pages is made, can serve as a corpus for linguistic research (Renouf 2007; Hoffmann 2009). While the usage of corpora in the past has been mainly restricted to the study of the structures and use of language, new opportunities for the applications of corpora have recently opened up. These include technical applications such as the training of automatic translation systems and the employment of corpora in the development of

16   Ulrike Gut and Holger Voormann dictionaries and grammars (e.g. Biber et al. 1999) as well as in language teaching (e.g. Kettemann and Marko 2002; Sinclair 2004; Gut 2006; Römer 2008). Modern corpora are available in electronic form and are thus machine-readable, so that a rapid (semi-) automatic analysis of large amounts of data in a given corpus can be realized.

2.2.2  Definition of a Phonological Corpus No commonly agreed definition of the term ‘phonological corpus’ exists yet. Phonological corpora can be divided into two types: speech databases and phonological corpora proper (see also Wichmann 2008: 187). The raw data of speech databases typically consists of lists of individual words, sets of sentences, or text passages read out by speakers under laboratory conditions. These recordings are well suited for instrumental analyses but, because of their highly controlled nature, might be of restricted use for the study of phonological phenomena other than those which the corpus compilers had in mind. Since the speech they contain is produced in a highly specific communicative situation, its phonological properties differ from those of speech produced under more ‘natural’ conditions such as in informal conversations, during conference speeches, in radio discussions, or in interviews (Summers et al. 1988; Byrd 1994; see also Birch, this volume). Raw data collected in these authentic communicative situations, by contrast, might suffer from a lower recording quality owing to background noises and speaker overlaps. Phonological corpora, compared to speech databases, have a wider application, and are collected with the purpose of studying the phonological structures of a language or language variety and their usage and acquisition as a whole. In our view, speech databases and phonological corpora should not be seen as a dichotomy but rather as the two endpoints of a continuum (see also Birch, this volume) with many possible intermediate forms, such as corpora containing spontaneous speech elicited in a controlled way and covering a very restricted topic, as in Map Tasks (e.g. Stirling et al. 2001). We will therefore not include a specific purpose in our definition of a phonological corpus. Phonological corpora can be designed for different purposes, and speech databases might be later converted and reused as phonological corpora by adding time-aligned phonological annotations. A phonological corpus is thus defined here as a representative sample of language that contains • primary data in the form of audio or video data; • phonological annotations that refer to the raw data by time information (time-alignment); and • metadata about the recordings, speakers and corpus as a whole. This definition is thus very close to Gibbon et al.’s (1997: 79) definition of a spoken language corpus as ‘any collection of speech recordings which is accessible in computer readable form and which comes with annotation and documentation sufficient to allow re-use of the data in-house, or by scientists in other organisations’. In detail, according to

Corpus Design  17

well w

e

and l

6

the n d@

11.99

mouse m

a 12.61

Time (s) FIGURE  2.1  Orthographic

and phonemic annotation of part of an utterance in Praat.

the above definition, phonological corpora always contain raw data in the form of audio or video data (thus excluding written or sign language corpora). Strategies for selecting such raw data are discussed in section 2.3.3 below. Time-aligned phonological annotations constitute the second prerequisite of a phonological corpus. They can include phonemic and phonetic transcriptions on the segmental level, and transcriptions of suprasegmental phenomena such as stress and accent, intonation, tone, pitch accents, pitch range, and pauses (see Post and Delais-Roussarie, this volume). The minimal requirement in terms of annotation for a corpus to be a phonological corpus is a time-aligned orthographic annotation plus one level of time-aligned phonological annotation. The term ‘time alignment’ refers to a technique that links linguistic annotations to the raw data. Figure 2.1 illustrates a time-aligned annotation carried out with the speech analysis software Praat of a part of an utterance that begins with ‘Well and the mouse . . . ’. Beneath the speech waveform, the annotation is displayed on two different tiers. The top tier shows the boundaries of each of the individual words together with an orthographic transcription. On the bottom tier, the speech is segmented into phonemes, and the individual phonemes are transcribed phonetically using SAMPA, the computer-readable adaptation of the IPA (Wells et al. 1992). For time-aligned annotations, the boundaries of annotated elements are defined by time stamps in the underlying text file that is created by the speech analysis software. This means that information about the exact beginning and end of each annotated phonological element is available in the corresponding file. The annotation illustrated in Figure 2.1 thus provides direct access from each annotated element to the corresponding primary data, i.e. the original recordings. By clicking on any annotated element, the matching part of the recording will be played back. This is not only useful for the annotation and analysis of the corpus, allowing for items in question to be listened to repeatedly, but it also facilitates automatic corpus analyses: on the basis of the text file, specifically designed software tools can calculate phonetic phenomena such as the mean length of words or phonemes and the exact alignment of pitch peaks and valleys with the phonemes in the speech signal.

18   Ulrike Gut and Holger Voormann The third requirement of a phonological corpus is that it includes metadata about the recording (e.g. data, place), speaker/s (e.g. age, dialect background), the recording situation (e.g. situational context), and the corpus as a whole (e.g. collectors, annotation schemas that were chosen).

2.3 Corpus Design Designing any corpus requires careful planning that takes into account the entire life cycle of the corpus. All decisions that are made before and during raw data collection and annotation will determine the eventual usability of the corpus. The first considerations of corpus designers should therefore centre round the issues of corpus use, sustainability, sharing, and reuse.

2.3.1  Storage, Sustainability, Sharing, and Reuse One of the principal issues to be addressed before the compilation of a phonological corpus can begin is that of corpus storage and sustainability (see also Wynne 2009). If the corpus is to be used over a long period of time, its sustainability and preservation need to be ensured. Corpus storage firstly involves organizing the provision of institutionalized archiving facilities that guarantee continued access to the corpus. Infrastructures for the storage of language resources have, for example, been created by the LDC (http:// www.ldc.upenn.edu/) and the CLARIN initiative (http://www.clarin.eu/; see also Wittenburg, Trilsbeek and Wittenburg, this volume). Furthermore, in order to compile a sustainable corpus, corpus creators should choose a data format and annotation tools that will be able to keep up with future technical developments. Adaptation to future changes is easier when available standards are used during corpus creation and when adequate documentation is provided. In particular, a standardized data format should be chosen (see Romary and Witt, this volume) and annotation tools should be selected that have the prospect of being further developed and maintained in the future. The issues of standardization and documentation also apply to the sharing and reuse of phonological corpora. Although Johansson (2008:  35)  claims that the sharing of resources is a ‘rather novel aspect of the development of corpus linguistics’, it is a central requirement for theoretical advancement in linguistics as a whole and phonology in particular. Examples abound of corpora that cannot be reused because the compilers did not envisage sharing their data: the original recordings of the spoken part of the British National Corpus (BNC), for instance, cannot be made available because permission was only sought from the speakers for publishing the transcripts. Obtaining declarations of consent now is impossible due to the anonymization procedure, and because no lists of the recorded speakers seem to have been made (Burnard 2002). Lack of planning can also be seen in the example of the Spoken English Corpus, whose orthographic transcriptions

Corpus Design  19

had to be time-aligned with the original recordings in retrospect, requiring great effort and cost of time. Together with automatically generated phonemic and intonation transcriptions, the corpus now constitutes a phonological corpus under the name of Aix-Marsec corpus (Auran et al. 2004). Currently, the reuse (and extension) of existing corpora is still impeded by the fact that the annotation tools used for corpus compilation have different data formats regarding both metadata and annotation, which results in limited interoperability. Moreover, there are as yet no commonly accepted ISO guidelines for the encoding of phonological corpora, and the existing TEI guidelines for the encoding of spoken language remain inadequate for collectors of phonological corpora. In fact, the only quasi-phonological corpus (the annotations are not time-aligned) encoded according to the TEI guidelines so far is the NECTE corpus (e.g. Moisl and Jones 2005). Legal issues requiring consideration at the outset of phonological corpus compilation include both licensing and permissions. For a publication of the corpus that provides access to all of its potential users, a declaration of consent that allows the subsequent distribution of the data has to be signed by every speaker who contributes raw data to the corpus. This declaration of consent should ideally include all possible forms of publication, even via media that cannot yet be envisaged at the point of corpus compilation. Most universities now provide standard forms for declarations of consent which contain a short description of the purpose of the study and descriptions of the intended use of the data by researchers. Surreptitious recordings such as those obtained for the spoken language sub-corpus of the BNC (Crowdy 1993) no longer meet ethical standards. It is no longer acceptable to record speakers and inform them only afterwards that they have been recorded. Consent must be obtained prior to the recording. The reuse and enhancement of corpora by researchers other than those directly involved in the compilation process is only possible given adequate documentation of both the corpus content and the corpus creation process. Publishing the corpus under a license such as the Creative Commons license (http://creativecommons.org) ensures that all subsequent modifications of the corpus by other researchers are required to be made available again under the same license. The corpus creation process itself can be made open, as is the case for the Nigerian component of the International Corpus of English (ICE Nigeria) that was compiled at the University of Münster (Wunder et al. 2010). Its creation process is documented on http://pacx.sf.net with a video tutorial that shows how to create a corpus from scratch in three minutes with the tool Pacx. Reuse of the corpus also requires the accessibility of appropriate annotation and search tools. Ideally, these are freely available open source tools that allow researchers both to add their own annotations to the corpus and to search the corpus automatically. ELAN (see Sloetjes, this volume) and Phon (see Rose and MacWhinney, this volume) are examples of such tools.

2.3.2  Representativeness and Size Representativeness and a sufficient size are two commonly agreed-on requirements for corpora (e.g. McEnery and Wilson 2001: 29; Sinclair 2005; Lehmann 2007: 23). Yet

20   Ulrike Gut and Holger Voormann representativeness has been described as a ‘not precisely definable and attainable’ goal (Sinclair 2005) or as impossible (Atkins et al. 1992: 4), and many corpus compilers state that their corpus does not reach this goal (e.g. Johansson et al. 1978: 14 on the LOB corpus) or that representativeness has to be sacrificed for other needs (Hunston 2008: 156). The term ‘representativeness’ is usually used to refer to the objective that the raw data of a corpus should constitute a sample of a language or a language variety that includes its full range of variability. It should thus provide researchers with as accurate as possible a picture of the occurrence and variation of linguistic phenomena, and the potential to generalize the corpus-based findings to a language or language variety as a whole. It is of course never possible for a corpus to be representative in the strict sense, since this presupposes an exact knowledge of the distribution and frequency of linguistic structures in the language or language variety in question. It is extremely difficult to define what, for example, ‘British English’ is, let alone to decide which linguistic features it does and does not contain (see also Clear 1992: 21). Typically, corpus collectors try to solve this methodological problem by using an intelligent sampling technique. Sampling, the selection of raw data, can be carried out as simple random sampling—choosing raw data without any predefined categorization—or as stratified random sampling, for which units or categories are established from which random samples are subsequently drawn. It is generally believed that the second approach achieves higher representativeness (Biber 1993: 244). The sampling frame that is created for stratified random sampling can be either linguistically motivated or rely on demographic features. A  sampling frame based on linguistic criteria could consist of a predetermined set of different text types or speaking styles, usually illustrated as different cells of a table (e.g. Hunston 2008: 154). The aim of raw data collection then is to fill all such cells with an equal number of language samples (usually measured in words). This approach has some inherent problems. The first question to be asked is whether the different text types or speaking styles should be taken from language production or language perception. Several authors have argued that the types of language people hear do not match the types of language they produce (e.g. Atkins et al. 1992: 5; Clear 1992: 24ff.). For instance, while very few people ever address others in public speeches, these public speeches are heard by many. Some researchers argue that only production defines the language under investigation (Clear 1992: 26) and therefore suggest including the text types of language production rather than perception in a corpus. Others hold that the selection of text types should be based on the inclusion of both language production and perception (Atkins et al. 1992: 5). The second question is concerned with the distribution of the different text types in the corpus. If, as has been argued, the distributional characteristics of the text types should be proportionally sampled, i.e. should match their relative distribution in the target language (e.g. Biber 1993: 247), all phonological corpora would have to consist of 90 per cent of conversations, which is the estimated percentage of this text type to be produced by the average speaker. Biber (1993: 244) states that ‘identifying an adequate sampling frame [for spoken texts] is difficult’.

Corpus Design  21

A demographic sampling frame selects not samples of language but speakers, and does so on the basis of standard social criteria such as age, gender, social class, regional background, and socioeconomic status. Often, combined linguistic and demographic sampling frames are used in corpus compilation: the collectors of the spoken part of the BNC, for instance, attempted to include recordings with British speakers of all regions, socioeconomic groups, and ages, and to cover a wide range of communicative situations (Crowdy 1993: 259). The dimensions of variation that should be included in a representative phonological corpus need to have an empirical basis, i.e. they should be determined by empirical studies of the extent and type of phonological variation and its constraining factors. Dimensions of variability that have been identified in previous research include • speaker groups of different age, gender, regional background, socioeconomic status; • speaker states (physiological and emotional); • situational contexts (e.g. interactional partners); • communicative contexts; • speaking style. The systematic influence of topic as a dimension of phonological variability remains to be conclusively demonstrated and requires further empirical investigation (see Biber 1993: 247). It is obvious that true representativeness is not always possible in corpus design. Together with issues of availability and willingness of speakers, also at issue are external factors such as the number of project members involved in corpus compilation and the duration and extent of funding. Likewise, it is very difficult to determine the optimal size of a corpus. With the ever-increasing capacity of computers to store gigabytes of data, the formerly arbitrarily established ‘ideal’ corpus size of one million words—at least for corpora with written texts as primary data—has now been superseded by the motto ‘the larger the better’ (Sinclair 1991: 18, Clear 1992: 30). Methodological and technological limitations constraining corpus size have largely been eliminated. However, no convincing proposals offering rigorous arguments for ideal corpus size have been published yet, and systematic studies in this area have yet to be conducted. It is increasingly accepted that the representativeness of a corpus does not correlate with a particular corpus size (e.g. Clear 1992: 24; Biber 1993: 243; McCarthy and O’Keefe 2008: 1012). Currently available phonological corpora vary enormously in size. At the small end of the scale, specialized corpora such as the LeaP corpus of learner German and learner English (see Gut, this volume) or the NIESSC (National Institute of Education Spoken Singapore Corpus) corpus of spoken educated Singapore English (Deterding and Low 2001) consist of 12 hours and 3.5 hours of recorded speech respectively. Although both corpora aim to include a representative sample of speakers, the problem remains that the small corpus size may cause difficulties in the interpretation and generalization of the results. At the large end of the scale are corpora such as the IViE corpus, which contains 36

22   Ulrike Gut and Holger Voormann hours of speech data from 108 speakers whose intonation was transcribed (see Nolan and Post, this volume). The optimal size of a corpus is therefore one that requires a minimum of time, effort, and funding for corpus compilation but that, at the same time, guarantees that the distribution of all linguistic features is faithfully represented. Biber (1993) has shown for written language that some structural features are distributed linearly and others in a curvilinear way. For the former type of linguistic feature, an increased corpus size implies a linearly growing representation in the corpus, while for the latter type an increased corpus size results in an overrepresentation. Empirical research of the same nature appears not to be available for phonological features. Only future research can determine whether these and/or other types of distributional patterns exist for phonological structures and how such patterns interrelate with sample size. In fact, it will be the increasing compilation and availability of purpose-built phonological corpora that allow for the first research of this kind, contributing in turn to refinements in the design of future corpora.

2.3.3  Raw Data Selection The ‘authenticity’ of the language contained in a corpus is the central argument that is typically used to point out the advantages of corpus linguistics in comparison with linguistic research based on invented sentences. It has been claimed that only on the basis of language samples that were actually produced, i.e. utterances that were used in real life and have proven their status as communicative vehicles, can the structure and usage of language be studied. While this is possibly true for research on morphosyntax, lexicon, and pragmatics, the problem of ‘non-authentic’ language production for phonological investigation is a more intricate one. Even speech elicited in very controlled situations can be considered ‘natural’ since it is appropriate in the very specific communicative situation of phonetic experiments (see also Birch, this volume). Although it has been shown in many studies that the prosodic properties of speech vary with speaking style, and that read speech exhibits distinct phonological properties from spontaneous ‘natural’ speech (e.g. Gut 2009), this should be taken as an argument in favour of including as many different speaking styles as possible into a phonological corpus, as was the case for the collection of the IViE corpus (see Nolan and Post, this volume), the PFC (see Durand, Laks and Lyche, this volume), and the LeaP corpus (see Gut, this volume). Distinctions in the phonological domain between speech produced under laboratory conditions and ‘natural’ are a much-needed focus for future research. Data collection and selection should be driven by external rather than internal criteria (Clear 1992: 29; Sinclair 2005). External criteria are defined situationally and refer to different language registers, text types, or communicative situations. Internal criteria refer to the distribution of certain linguistic structures or the frequency of particular words or constructions. Raw data selected according to internal criteria might be biased by the researchers’ purposes, and may therefore fail to achieve a high level of representativeness (see section 2.3.2).

Corpus Design  23

A further consideration to be made when embarking on data collection is the time and effort different types of raw data require to be gathered. It is well known, for example, that it is much more difficult to obtain language samples from speakers of lower socioeconomic status and educational level than from speakers with higher socioeconomic status and educational level (see Oostdijk and Boves 2008). Moreover, not all collected recordings will be usable, especially those that have been recorded in non-laboratory conditions. Crowdy (1993: 261) reports for the collection of the recordings for the spoken part of the BNC that the total amount of usable recordings was about 60 per cent of all recordings that were made. When, or preferably before, making recordings, potential difficulties with subsequent annotation should already be considered. These include the separation of the different speakers in multi-speaker conversations, and the overall intelligibility that might be reduced by speaker overlaps and background noise. It is generally a good idea to first collect a small pilot corpus and annotate and analyze this before proceeding to collect more raw data (see section 2.4 on the corpus creation process).

2.3.4  Time-Aligned Phonological Annotations Before embarking on corpus annotation, a suitable annotation tool needs to be chosen that fulfils the specific requirements of the corpus creation process. In order to ensure the usability and enhanceability of the corpus, the tool should store the created files in a standardized data format such as XML. Moreover, these files should be readable by other tools, e.g. corpus search tools, without requiring complex conversion routines. If the corpus annotators work in different locations, the tool should provide facilities for central storage and repeated modifications of the annotations. Pacx, for instance, which is used in the compilation of several corpora, is a tool that supports the entire corpus creation process including raw data storage, annotation and metadata storage in XML, collaborative annotation, and automatic checks of annotation consistency and annotation errors, as well as simple corpus searches (Wunder et al. 2010). The first type of annotation every phonological corpus needs to contain is a time-aligned orthographic transcription. This is indispensable for adding further annotations such as an automatic phonemic transcription or part-of-speech tagging and lemmatization. Due to the current lack of standardization, many decisions are required for an orthographic transcription, which include the spelling of colloquial forms, the representation of truncated words, and the usage of punctuation symbols. In the case of phonological corpora that contain non-native speech, this requirement is even more challenging since often a decision needs to be taken whether to transcribe the words and forms that were actually produced or those that might have been intended. While these decisions and the decision regarding what to annotate as the smallest unit of speech (words, turns, utterances) might differ according to the purpose of the specific corpus, it is essential for the use and reuse of the corpus that they are documented in a very detailed way, and that this documentation is accessible to all future users of the corpus. One way of documenting the decisions that were made during corpus annotation is to document them directly in the corpus or to set up a Wiki page that can be continuously updated by all annotators.

24   Ulrike Gut and Holger Voormann The second type of annotation that makes a corpus a phonological corpus is phonological annotation, which includes phonetic and phonemic transcription as well as transcription of prosodic features such as intonation and stress. Phonological annotations can have different formats and should be—in order to allow the reuse of the corpus—both well documented and as theory-independent as possible (Oostdijk and Boves 2008: 644). The degree of phonetic detail represented in the annotations depends on the type and purpose of the corpus (see also Delais-Roussarie and Post, this volume and Cucchiarini and Strik, this volume). Annotation of a corpus compiled for the documentation of a language with which the corpus collectors have a limited familiarity, for example, requires a far greater level of detail than annotation of a phonological corpus for a language that has a well-described phonology. For the former case, Lehmann (2007: 22) suggests the following guidelines: • Before the description of the phonological system of a language is completed, all variations down to the smallest allophonic level should be annotated. Some variation previously considered irrelevant might turn out to have functional relevance or be theoretically interesting. • Transcriptions of the phonetic details should be annotated on different tiers so that all the variants are linked to the corresponding invariant on a more abstract level. Manual annotations and especially phonological annotations are very time-consuming, taking up to an hour for a minute of speech. Furthermore, several studies have shown that annotator inconsistencies and errors are inevitable (e.g. Stirling et al. 2001; Gut and Bayerl 2004; Kerswill and Wright 1990). It is therefore advisable to carry out repeated measurements of annotator consistency and accuracy, or to use a tool that carries them out automatically (see Oostdijk and Boves 2008: 663; Voormann and Gut 2008). One general requirement for all annotations is that they are separated into different tiers, with each tier representing one speaker, one type of linguistic level, or one event. Thus, non-linguistic events such as laughter, noises, and background voices should be annotated on a separate tier as well as all additional annotations to the orthographic tier (e.g. repairs, false starts, disfluencies and hesitations). However, there is as yet no common definition across the different annotation tools of what a tier is. Neither have the concept and form of an annotation yet been specified formally. In the future, a language similar to DTD or XML Schema for XML needs to be created in order to allow formal specification of phonological corpus design. At present, the fact that nearly every annotation tool uses a different data format hinders interoperability.

2.4  The Corpus Creation Process Successful corpus creation is constrained by multiple factors. Restrictions on funding and time determine the corpus size and representativeness as well as the richness

Corpus Design  25

of corpus annotation and the accuracy of the annotations. Currently available corpora seem to suggest that these properties stand in a trade-off relationship, in the sense that the improvement of one of them results in the weakening of another. Theories of corpus creation address the problem of simultaneously maximizing corpus size as well as the quality and quantity of the annotations while minimizing the time and cost involved in corpus creation. Traditionally, corpus creation has been divided into separate phases that are carried out in a sequential manner: when all data has been collected, corpus creators will devise an annotation schema or decide to use an already established one. This is followed by an annotation phase, and only after this has been completed will the corpus be searched (e.g. Wynne 2009: 15). Some modern theories of corpus creation suggest a new approach. Biber (1993) and Atkins et al. (1992), for example, propose that corpus compilation should proceed as a cyclic process. Feedback from corpus users or corpus searches of a small sample corpus should be employed to provide guidelines for further data collection. Taking this idea several steps further, Voormann and Gut (2008) suggest an iterative theory of corpus creation, agile corpus creation. The central ideas of agile corpus creation are the reorganization of the traditional linear and separate phases of corpus design and the recognition of potential sources of errors during corpus creation. Modelled on agile software development, the theory of agile corpus creation replaces the linear-phase approach with a cyclic and iterative small-step process that turns the traditional sequence on its head. The starting point is a corpus query that drives the compilation of a prototypical mini-corpus which contains the essential functions such as the data format, the preliminary annotation schemas, some of the annotations, and a search tool for the execution of a query. As the first step of corpus creation, a corpus query is formulated, which drives the development of a first version of the annotation schema. When the specification of the annotation schema is accomplished, a small part of the primary data is annotated accordingly. This is followed by the next corpus query and thus the second cycle. The theory of agile corpus creation claims that possible errors can occur at any time during the corpus creation process, and that it is therefore cheaper and quicker to incorporate opportunities for concomitant and continuous modifications in the cyclic corpus creation process. With the first corpus query, all previous aspects of the corpus creation are analysed: Is the annotation schema suitable for the analysis of the corpus query? Is the annotation consistent? Design errors, annotation errors, and conceptual inadequacies in the annotation schema thus become visible at a very early stage. Even more importantly, the early corpus queries constitute an evaluation of the annotation process. The annotations are checked for inter- and intra-annotator agreement as a measure of reliability. Subsequently, sources of inconsistency are identified and appropriate steps to improve the annotation schema are taken. Corpus queries thus monitor annotator and inter-annotator reliability, as well as structural inadequacies. Importantly, the method of agile corpus creation allows corpus creators to know when the corpus has reached a size sufficient for the specific research question. An open source tool that supports agile corpus creation is Pacx (http://pacx.sf.net).

26   Ulrike Gut and Holger Voormann

2.5  Conclusion and Outlook Phonological corpus design that has the aim of producing a valuable tool for phonological and phonetic research requires careful planning. Not only data collection and annotation principles need to be taken into account but also considerations regarding software choice, licenses, documentation, and archiving. Only a phonological corpus that makes use of both the available standards in terms of data format and annotation, along with state-of-the-art sampling techniques and corpus creation tools, will prove to be accessible from a long-term perspective and will be reusable and enhanceable for future researchers. For both existing and planned phonological corpora, following internationally recognized standards for data formats, annotations, and the integration of metadata are of the highest importance. The standardization of some aspects of phonological corpora has just begun, and important decisions regarding data model, data format, and annotation will have to be taken by the international community in the near future. The growing infrastructure and an increasing number of tools that support phonological corpus creation, storage, and distribution, however, paint a positive picture for future developments.

C HA P T E R  3

DATA C O L L E C T I O N BRU C E BI RC H

3.1 Introduction In addition to giving a summary of techniques useful in the collection of data for the purposes of assembling phonological corpora, this chapter attempts to provide some theoretical background to the task of data collection, examining key issues including control over primary data, context and contextual variation, and the observer’s paradox. In section 2, I introduce the concept of a data collection continuum, suggesting that it is helpful to think of data collection techniques in terms of how much control, and the type of control, they assert over the production of speech. Section 3 deals with the problem of context: how the existence of different types of linguistic context and the interaction between contexts impact on primary data. In section 4, the closely related issues of ‘observer effects’ and the question of what is ‘natural’ data are examined in some detail. Sections 5–7 attempt to give an overview of data collection techniques, and Section 8 provides some practical advice on recording equipment and recording technique.

3.2  The Data Collection Continuum It is perhaps useful to recognize two types of corpus which provide data for phonological analysis, and which differ in terms of their origin (see Gut and Voormann, this volume). One type results from defining a research question, and subsequently designing speech experiments or elicitation tools and stimuli explicitly intended to capture the speech data required for the exploration of the question. The resulting corpus, especially in the case of speech experiments, will tend to be relatively small, and intentionally limited in terms of the contexts in which target segments/words/phrases occur. The intentional limitation of the primary data in such approaches is what is referred to by the notion of ‘control’.

28  Bruce Birch A second type is generated in order to make available a ‘representative’ sample of the language or languages being investigated. It does not result from the need to answer a specific research question, but rather is intended to be used by researchers investigating a broad range of questions. This type of corpus, as a result of its intention to be ‘representative’, will necessarily be large and richly complex in terms of the contexts in which yet-to-be-identified target segments/syllables/words, etc occur. It is for the most part intentionally ‘uncontrolled’, the speech it contains exemplifying a wide spectrum of observable communicative events. Controlled context Scripted Single research question Small sample Single communicative event FIGURE  3.1  The

Uncontrolled context Unscripted Multiple research questions Large sample Broad range of communicative events

Data Collection Continuum.

Variation in the amount of control exercised over primary data may be expressed in terms of a data collection continuum (see Figure 3.1). At one pole of the continuum are scripted speech experiments and reading tasks which attempt, more or less successfully, to maximize control over contextual factors identified as relevant in order to eliminate ‘noise’. At the other pole are, for example, language documentation projects which aim to capture the richness and diversity of a particular language, exerting relatively little intentional control over contextual factors within any given text, being more concerned with capturing a diverse range of unscripted communicative events. In between these two poles sit data elicitation techniques which contain elements of intentional control and limiting, and which at the same time set out to capture a representative sample of unscripted language use. A well-known and often-employed example of such a technique is the HCRC Map Task (see section 6.3), which typically aims to capture multiple instantiations of target phenomena (e.g., phrases providing environments known to trigger t-deletion in English, or words composed entirely of sonorants and vowels, which provide reliable unbroken FO traces) by limiting the names of locations on a map in terms of their segmental composition, at the same time as aiming to elicit dialogue motivated by the communicative needs of participants negotiating a shared task, rather than by the prompting of an experimenter or script. At the controlled end of the continuum, no experiment is perfect, and noise will tend to seep into the data despite the best efforts of those involved in the design and collection processes.1 At the uncontrolled end, a multitude of unintended contextual factors

1  For a description of this process with regard to efforts to examine neutralization of voicing in final obstruents in German, see Kohler (2007).

Data Collection  

29

are always present, such that (a) the data collected will inevitably be limited in ways of which the collector is unaware (potential target phenomena will be absent or under-represented), and (b) contexts crucial for the investigation of particular research questions will be absent (potential target phenomena may be present and well represented, but are absent in desired contexts). Regardless of where the corpus sits on the data collection continuum, therefore, it will exhibit gaps. These gaps are typically exposed as a result of attempts to use the corpus to answer a research question. As a result of the identification of such a gap, elicitation techniques may be employed in order to expand the corpus in a specific direction in order to fill the identified gap. At the uncontrolled end of the spectrum, where a corpus intended for multiple analyses consists mainly of recordings of communicative events occurring in the everyday lives of speakers, gaps may be filled via the addition of subcorpora which sit further toward the controlled end of the continuum. For example, in order to make a corpus of value for the investigation of the phonological correlates of contrastive focus, it may be necessary to employ or develop some nonverbal elicitation tools designed specifically for the purpose, perhaps along the lines of the QUIS2 stimuli set produced at the University of Potsdam. Data gathered through the use of these stimuli will form a subcorpus of your overall corpus, as well as a subcorpus of the total QUIS corpus generated as a result of the use of the same stimuli across a variety of the world’s languages. At the controlled end of the continuum, to fill identified gaps it may be necessary to modify the restrictions initially placed on the data by adding missing contexts, for instance through the addition of task-oriented elicited data to a corpus which consists only of read speech. The identification of gaps in your corpus by trying it out (if this is feasible) is probably the most reliable way to gain information on how to modify your data collection techniques to accommodate the research questions for which the corpus is intended (see the discussion of ‘agile corpus creation’ in Gut and Voormann, this volume). Both large and small corpora often include subcorpora from different places on the data collection continuum in order to achieve this aim. A corpus may contain, for example, a mixture of read speech or elicited words/paradigms, etc; data obtained through the use of stimuli such as the Map Task; and interviews with subjects about major events in their lives.

3.3  Context and Contextual Variation Before beginning on the task of data collection for phonological corpora, an understanding of the issues surrounding context and contextual variation is an absolute

2 

Skopeteas et al. (2006).

30  Bruce Birch requirement. As Pierrehumbert (2000a) has pointed out in support of the approach taken by Bruce (1973) in his study of Swedish intonation, Much early work on prosody and intonation (such as Fry 1958) takes citation forms of words as basic. Insofar as the intonation of continuous speech was treated at all, it was in terms of concatenation and reduction of word patterns which could have been found in isolation. Bruce, in contrast, adopted the working hypothesis that the ‘basic’ word patterns were abstract patterns whose character would be revealed by examining the full range of variation found when words are produced in different contexts. . . . The citation form is then reconstructed as the form produced in a specific prosodic context—when the word is both phrase-final and bears the main stress of the phrase. The importance of this point cannot be overemphasized. In effect there is no such thing as an intonation pattern without a prosodic context. The nuclear position and the phrase-final position are both particular contexts, and as such leave their traces in the intonation pattern. (Pierrehumbert 2000a: 17).

If we accept the claim that analysis of intonational phenomena and, by extension, of phonological and phonetic phenomena in general will be incomplete in the absence of examination of the ‘full range of variation found when words are produced in different contexts’, then it may seem to follow that the aim of data collection for the purposes of corpus-based phonological analysis must correspondingly be to capture language in as broad a range of contexts as possible. So, for example, capturing occurrences of the phoneme /r/ in some non-rhotic varieties of English in both prevocalic and preconsonantal position would be essential in order to provide, on the one hand, tokens where /r/ is realized (e.g., in the phrase butcher in town) and on the other, where it is deleted (as in the phrase the butcher shop). Further, it would be necessary to include examples which are word-final vs. word-initial, utterance-initial vs. utterance-final, and so on, in order to determine the impact of prosodic constituent boundaries on realization. However, an exhaustive study will require that contextual factors other than those of a purely phonological nature be taken into account. Grammatical, syntactic, and pragmatic contextual factors, as well as discourse genre and social context, have all been shown to impact on, or interact with, phonological structure. Moreover, we can also expect (or at least cannot rule out) that interactions between these different types of context exist—that, for example, social context may influence syntactic structure, which in turn will impact phonological structure, and so on. For example, it has been shown that speech elicited from speakers in the context of a laboratory reading task contrasts systematically with, on the one hand, reduced speech occurring in unobserved conversations (e.g. Summers et al 1988; Byrd 1994) and, on the other, hyperarticulated speech (e.g. Uchanski et al. 1996). It soon becomes apparent, therefore, that aiming to obtain a ‘complete’ set of contexts for a single phenomenon, let alone for a large range of phenomena, is to set oneself an infinite task, as there is ‘no principled upper limit’ (Himmelmann 2006) to the number

Data Collection  

31

and type of discoverable contexts. Or, as Sinclair puts it, language is a ‘population without limits, and a corpus is necessarily finite at any one point’ (Sinclair 2008: 30). Given the complexity of language as an object of study, it has been necessary for linguistics as a discipline to break the study of language up into different subdisciplines, and in turn to create further divisions within these subdisciplines. Thus, within the subdiscipline of phonology, it is possible, for example, to specialize in the behaviour of certain segments, or in the study of intonation systems, and so on. Correspondingly, the corpus required by a specialist in intonation will vary considerably from that, say, required by someone studying the behaviour of post-alveolar segments. To extend the example given above, examination of the way in which the intonation system of a language signals narrow contrastive focus, i.e. whether the language has a tendency (like English) to de-accent repeated material or not (like Spanish), will require a corpus which includes sentence pairs containing such repeated material (e.g. pairs such as I don’t understand that. You DO understand that). Such context will be of little interest to the analyst focusing on the phonological or phonetic behaviour of post-alveolar segments, who may want multiple examples of the same word in contrasting carrier phrases or frame sentences containing a specific range of phonological contexts. It is the task of phonological analysis, like any other kind of linguistic analysis, to confront and deal with the complexity resulting from the presence of, and interactions between, contextual factors in order to proceed at all. The myriad of contextual factors affecting the realization of constituents such as syllables and words must be identified (as far as this is possible) as a first step. From that point, there are two directions in which to move. Towards the ‘controlled’ end of the data collection continuum, data collectors aim to artificially reduce the complexity of language by eliciting linguistic data for which certain contextual factors are controlled. This approach is exemplified by the experimental paradigm, in which subjects produce speech in a context which has been specifically designed to control for, or eliminate, certain contextual factors which would otherwise introduce ‘noise’ to the data collected. In such experiments speakers may be repeating, for instance, ‘nonce’ or ‘meaningless’ words; reading; or responding to grammatical but semi-nonsensical or totally nonsensical verbal stimuli. The same approach is also exemplified by a linguist eliciting word lists in the early stages of compiling a language documentation or grammar of an unknown language. In this situation, the linguist compels the subject to suspend the usual contextualization of words in discourse, eliciting from them instead sequences of single words or phrases uttered in isolation. A second direction to move in, rather than attempting to artificially reduce the complexity of linguistic data, is to develop approaches and tools which allow for the analysis of relatively ‘uncontrolled’ data. In other words, rather than focusing on modifying or controlling the data at the production end, such approaches attempt to develop sophisticated analytical techniques which can filter complex, less controlled data in a variety of ways in order to make it available to a wide spectrum of analyses. Sophisticated annotation techniques such as those developed within the EAGLES framework (http://www. ilc.cnr.it/EAGLES96/home.html) are one such tool, in which the mark-up of data allows

32  Bruce Birch for the inclusion and exclusion of an array of contextual factors and combinations of factors. Clearly, your data collection activities need to reflect which of these two directions, or which elements of each, will be followed in order to analyse the data. Crucially, the complexity of language as an object of research guarantees that even the largest ‘multi-purpose’ corpus may well not contain enough tokens of a given phonological phenomenon to provide falsifiable results for a specific research question.3 Simply by amassing a large amount of data, the collector cannot hope to have as a result anything approaching a corpus which furnishes evidence for the infinite number of possible phonological research questions which it may be required to answer. If the intention of the data collector is to build a large multi-purpose corpus, the corpus should be viewed as an evolving process, gaining richness and complexity over time, rather than as a static object. The evolution of this richness and complexity will be facilitated greatly through attempts to use the corpus for the purposes of analysis. Users will inevitably identify gaps where required data is not present. These gaps can then be filled via more data collection. In this way, use of the corpus is crucial, feeding back into the structure of the corpus, gradually increasing its value for a wide range of phonological research questions.

3.4  The Observer’s Paradox The use of the words ‘natural’ or ‘naturalistic’, or the phrase ‘naturally occurring’, in regard to speech is problematic. All speech is in some sense ‘natural’. Wolfson suggests that the notion of ‘natural’ speech is ‘properly equivalent to that of appropriate speech; as not equivalent to unselfconscious speech’ (Wolfson 1976: 189), and that speech collected from native speakers in interviews or laboratory tasks is speech-appropriate, and therefore natural, to those contexts. If subjects are selfconscious in such contexts, their speech will reflect this in an appropriate way, for example by being more careful in character, with consequences for phonological analysis. Wolfson’s point is that careful speech produced under such conditions is in no way unnatural, suggesting further that the only reliable method of obtaining unselfconscious data is through unobtrusive observation in a diverse range of contexts. Recording speech unobtrusively while remaining within the bounds of ethically sound practice, however, presents its challenges. The difficulties involved in capturing unselfconscious speech led William Labov to coin the term ‘observer’s paradox’, characterizing it in the following way: ‘the aim of linguistic research in the community must be to find out how people talk when they are not being systematically observed; yet we can only obtain this data by systematic observation’ (Labov 1970: 32). This has important

3 

Biber (1993) provides some statistically supported estimates of required corpus size for various target phenomena.

Data Collection  

33

consequences for the goal of collecting unselfconscious speech data, as large observer effects will skew the data, undermining the aim of capturing typical language use. Extreme observer effects are present, for example, in the process of direct transcription, where speakers are intentionally encouraged to focus on their own speech production in the process of orally disambiguating the segmental composition of words and phrases to allow for accurate written representations. A  transcriber’s inability to reproduce a word correctly may elicit from the speaker slow, hyperarticulated tokens of the word, long words being perhaps broken into separate intonation units syllable by syllable. Data collected under these conditions, while quite ‘natural’ given Wolfson’s clarification of the term, is highly atypical in the context of the speaker’s everyday use of language, produced under specific and perhaps infrequently occurring conditions. Subsequent claims made on the basis of phonological analysis of such data may therefore be limited to the specific context in which the data were collected rather than applying to the language as a whole. For this reason it has been crucial to develop techniques in phonological data collection which in some way distract the subject or subjects from the goal of the exercise, reducing their awareness of the fact that every sound they make may eventually be scrutinized in minute detail. Labov’s solution to this dilemma involved, among other things, systematic limitation of interview topics. As summarized by Jack K. Chambers, ‘of the topics used by Labov [who was eliciting narrative monologues], the most successful in making the subjects forget the unnaturalness of the situation were the recollection of street games and of life-threatening situations. Most reliable in eliciting truly casual speech were fortuitous interruptions by family members and friends while the tape recorder was turned on’ (Chambers 1995: 20). Labov found that when speakers narrated stories with which they were emotionally engaged, they would tend to lose the selfconsciousness brought about by the interview setting. (Section 6 will deal in more detail with techniques aimed at obtaining unselfconscious data.) On the other hand, although the obtaining of unselfconscious speech data is an important requirement for corpus-based phonology in general, this is not to say that obtaining selfconscious data is unimportant. Himmelmann (2006: 9), writing in the context of language documentation, identifies ‘observable linguistic behaviour’ as a major target of data collection in the language documentation context, suggesting that it encompasses ‘all kinds of communicative activities in a speech community, from everyday small talk to elaborate rituals’. Thought about in this way, the communicative events selected will depend on the goals of the data collector and the intended use of the resulting corpus. Under the heading of communicative events, however, we must include not only the usual range of events observable in a given speech community, but also the less usual, such as the transcription session exemplified above. To return to that example, the fact that a speaker is able to break single words into separate intonation units under certain conditions offers evidence regarding the relative independence of the intonation system of the language under investigation, suggesting that words in that language do not have a single corresponding prosodic structure, but rather have a number of variants. Such evidence would be difficult, if not impossible, to find in more frequently occurring communicative events.

34  Bruce Birch In this way, the most controlled forms of speech elicitation nevertheless result in the collection of ‘natural’ data, (though likely to be selfconscious), of certain value for phonological analysis. This suggests an amendment to Labov’s statement quoted above, adapted to the specific aims of data collection for phonological analysis, where the aim must be to collect the speech of individuals ‘when they are not being systematically observed’, and also when they are being systematically observed. All data is ‘naturally occurring’, and the availability of data collected under highly controlled conditions on the one hand and relatively uncontrolled conditions on the other is of significant value to much phonological analysis. In the following sections the data collection continuum provides an organizing principle. Section 5 deals with data collection techniques which are at the highly controlled end of the continuum, where every word spoken is dictated by a script provided by the collector. This section includes reading tasks and elicitation of word lists and paradigms. Section 6 focuses mainly on the use of nonverbal stimuli designed to give the collector a certain amount of control over the resulting spoken text, but at the same time eliciting unscripted responses from subjects. Section 7 then looks at techniques where the collector exerts virtually no control over the resulting text other than by attempting to ensure that it is representative of a certain type of communicative event.

3.5  Highly Controlled Techniques At this point, it is important to make a distinction between two contrasting situations in which linguists find themselves. On the one hand, linguists may be recording data in a linguistic and cultural context with which they are familiar, and on the other hand they may be working in an alien cultural context, recording a language with which they have far less familiarity. In the first case, the data collector will typically have relatively less trouble implementing elicitation tasks, whereas in the second it may be (1) difficult to establish common ground, and (2) difficult to inspire enough confidence in speakers that you are understanding the information they are providing. It is ideal in this context to find a native speaker collaborator who can organize participants, conduct interviews, and make recordings. This allows you to take a back seat, and inspires confidence in subjects that the information they are providing is being understood and processed in a natural way, allowing them to produce the unselfconscious narrative and dialogue you are aiming for.

3.5.1 Reading Tasks Read speech is an effective way of restricting primary data to the precise needs of a research question. Many corpora include read speech as a counterpoint to unscripted material. For many research questions it is useful to be able to contrast the behaviour

Data Collection  

35

of target phenomena contextualized in a discourse with ‘citation forms’. An example of a corpus which takes this approach is the IViE Corpus,4 produced by a team at the University of Oxford, which was created for the purposes of analysing intonation across nine varieties of English occurring within the British Isles. The read stimuli consisted of short sentences intended to elicit declarative intonation, question intonation, etc, as well as a short passage based on the Cinderella fairy tale. To this was added Map Task data and recordings of ‘free conversation’. The obvious advantage of reading tasks is that they guarantee the inclusion of target phenomena previously identified by the researcher. It must be borne in mind, however, that reading is unlikely to produce unselfconscious speech, although the use of ‘distractor tasks’, such as assembling a simple construction while speaking, can alleviate this to some extent. An obvious limitation is that subjects must be literate, rendering reading tasks useless in work with non-literate individuals and cultures.

3.5.2  Elicitation of Word Lists In situations where the linguist is not a native speaker of the target language, and is perhaps dealing with an oral culture, the elicitation of word lists, sentences, and paradigms is the rough equivalent of a reading task in terms of the degree of control exerted over the data. Target words and phrases are elicited via a simple translation exercise, with the subject being instructed previously as to the requirements of the task. For example, you may require three repetitions of a word or phrase, or you may provide a carrier phrase designed to meet the needs of a specific research question. As with read speech, this kind of task guarantees the recording of target phenomena, and will provide an example of speech from a specific type of communicative event which will be useful for comparison with unscripted speech.

3.6  The use of Nonverbal Stimuli In response to the problem of minimizing observer effects, linguists, including phonologists and phoneticians, have put effort over the years into the development of nonverbal stimuli, such as images, video, slideshows, and animations. Although the design of the majority of these stimuli is not necessarily motivated by the potential use of resultant corpora for phonological analysis, such tools nevertheless provide a reliable method of obtaining relatively unselfconscious data, either narrative or dialogue. They also provide models on the basis of which new stimuli can be designed to meet the needs of specific analysts. In this section, I present a few of the better-known examples of this genre, not claiming it to be an exhaustive review.

4 

http://www.phon.ox.ac.uk/files/apps/IViE/

36  Bruce Birch

3.6.1 Film: The Pear Story An early example of the use of nonverbal stimuli to elicit speech is The Pear Story, a film of around six minutes in length designed by Wallace Chafe (Chafe 1980) and a research team based at the University of California in the 1970s to test how a single filmed narrative would be reproduced verbally across different cultures and languages. By showing a man harvesting pears, which are stolen by a boy on a bike, the film was intended to reference ‘universal’ experiences (harvesting, theft, etc), thereby making it suitable for use in a wide range of cultural contexts. The film contains no dialogue or voiceover. Several scenes were included to elicit particular responses. For example, a scene showing a boy falling off a bike and spilling pears was intended to elicit how languages encode cause and effect. After viewing the film, subjects were interviewed individually by a fellow native speaker of similar social status, who asked them, ‘You have just seen a film. But I have not seen it. Can you tell me what happens in the film?’ The resulting recorded narratives were around two minutes long. One basic principle behind the design of The Pear Story is common to most nonverbal stimuli used for linguistic elicitation. Stimuli should be adapted to the culture of the speakers of the target language or languages. If the material is to be used in cross-linguistic, typological research, it needs to be intelligible to speakers of all of the languages involved. A second principle is the inclusion of material designed to stimulate responses useful for the exploration of a specific research question. While not specifically targeted at the collection of phonological data, The Pear Story provides a template on the basis of which such specifically targeted stimuli can be produced. By taking some of the design features as a starting point, a film-based stimulus can be created with phonological analysis in mind. In films where the target audience speaks a single language, characters, locations, and props can be chosen according to the phonological composition of their names; events or sequences of events can be selected in order to elicit specific information structure, with resultant predictable consequences for intonation patterns; and so on.

3.6.2  Animated Sequence: Fish Film A highly successful example of a nonverbal stimulus designed to answer a specific research question, this time in the area of syntax, is the Fish Film (Tomlin 1997). This is a computer animation in which subjects are required to describe an unfolding drama enacted by two fish, one dark and one light-coloured, in real time. In each trial one fish is cued visually by an arrow. At a certain point, one fish eats the other fish. In cases where the cued fish was eaten, subjects tended to use the passive voice, responding with, ‘The dark fish was eaten by the light fish’. In cases where the cued fish ate the other fish, subjects used the active voice: ‘The light fish ate the dark fish’. The subjects’ attention to the cue

Data Collection  

37

influenced the choice of voice and correspondingly the syntactic subject. Tomlin’s results were robust, with 90 per cent of subjects choosing the cued fish as the syntactic subject. Animated sequences and slideshows provide a useful way of collecting unscripted but focused data, particularly when available resources do not permit the making of a film. The elicitation of a real-time description or commentary featured in this exercise is a technique which can be applied in the contexts of other tasks (viewing images, films, television series, etc).

3.6.3  Shared Task: Map Task Another highly successful stimulus for the production of naturalistic corpora intended for phonological analysis is the HCRC Map Task (http://groups.inf.ed.ac.uk/maptask/), developed by the Human Communication Research Centre at the University of Edinburgh. The Map Task was devised specifically in response to the issues discussed in the first part of this chapter: on the one hand, as a result of the dominance of the experimental paradigm ‘much of our knowledge of language is based on scripted materials, despite most language use taking the form of unscripted dialogue with specific communicative goals’; and on the other, samples of naturally occurring speech present ‘the problem of context: critical aspects of both linguistic and extralinguistic context may be either unknown or uncontrolled’. Moreover, ‘huge corpora may fail to provide sufficient instances to support any strong claims about the phenomenon under study’. The intention of the Map Task developers, therefore, ‘was to elicit unscripted dialogues in such a way as to boost the likelihood of occurrence of certain linguistic phenomena, and to control some of the effects of context’. The Map Task website describes the resulting stimulus task in the following way: The Map Task is a cooperative task involving two participants. The two speakers sit opposite one another and each has a map which the other cannot see. One speaker— designated the Instruction Giver—has a route marked on her map; the other speaker—the Instruction Follower—has no route. The speakers are told that their goal is to reproduce the Instruction Giver’s route on the Instruction Follower’s map. The maps are not identical and the speakers are told this explicitly at the beginning of their first session. It is, however, up to them to discover how the two maps differ.

The maps consist of named graphically depicted landmarks. Discrepancies between the landmarks on the two maps are of three types: in one type, a given landmark is present on only one of the maps; a second type involves two identically drawn and positioned landmarks having different names; and in the third type, a landmark appeared twice on the Instruction Giver’s map, and only once on the Follower’s map. The differences between the two maps make the negotiation process more complex, thus tending to produce longer, more animated dialogues than would be the case if the maps were identical. The designers of the maps had control over the names of landmarks, which were guaranteed to occur frequently throughout the corpus, and could therefore tailor the outcome to some extent to meet their research needs. For example, a name like vast

38  Bruce Birch meadows would provide evidence about t-deletion, or chestnut tree would be ideal for the measurement of glottalization, and so on. Unlike Chafe’s film, The Pear Tree, the Map Task was not originally designed to be used in a wide variety of cultural contexts. For instance, in its original form it could not be used with non-literate societies. However, the task does lend itself to adaptation for such contexts. An adaptation of the Map Task was used recently with Iwaidja speakers in Northwestern Arnhem Land, Australia (Birch and Edmonds-Wathen 2011). Although the basic principle of the task was retained, a map was created which did not rely on literate subjects. Landmarks were based on commonly occurring, easily recognizable objects such as creeks, pandanus trees, magpie geese, oysters, and so on. Participants were shown pictures of the landmarks beforehand and asked to call their names in order to record ‘citation forms’. Instead of being placed face to face across a barrier, participants were placed side by side with a barrier between them, a more culturally appropriate seating configuration. Because one of the intended uses of the resulting data was acoustic analysis of vowel formants in relation to metrical structure, accent, and focus, landmarks with names containing low central vowels were chosen, since these vowels provide the most robust FO, formant, and intensity traces.

3.6.4  Games and Other Tasks: MPI Nijmegen Language and Cognition Group The Language and Cognition (L & C) Group based at the Max Planck Institute for Psycholinguistics at Nijmegen in The Netherlands under the direction of Stephen Levinson has nearly twenty years’ experience in the production of elicitation tools aimed at exploring the cognitive infrastructure of human language. Over the last two decades they have produced an array of nonverbal elicitation tools aimed at exploring cross-linguistic semantic domains such as sound, time and space, emotion, and colour. They have also addressed areas such as the acquisition of social cognition by infants, and ‘event packaging’ across languages. The materials include films, card games, construction tasks, image sets, and much more, and are publicly available on agreement to conditions of use. As with the other stimuli discussed in this section, the L & C tools can be used in their current form to stimulate narrative and dialogue for phonological analysis, or may be used as inspiration for the development of new tools designed with phonological research questions in mind.

3.7  Collecting ‘Uncontrolled’ Data At the uncontrolled end of the data collection continuum are simply recordings of people talking—about anything. Although such recordings have the disadvantage, discussed above, that they may well not contain enough instances of desired target

Data Collection  

39

phenomena to be useful, they do have the advantage that, if data are captured in such a way as to eliminate or at least minimize observer effects, users of the resulting corpus may be confident that they are accessing unselfconscious speech in one of a range of communicative events engaged in by speakers in the course of their everyday language use. In both familiar and unfamiliar cultural contexts, where the intention is to record in some sense a representative sample of a language, it may be necessary to do some preparatory research in order to establish a typology, or partial typology, of communicative events, speech registers, etc. Hymes’ S-P-E-A-K-I-N-G heuristic (setting/scene, participants, ends, act, sequence, key, instrumentalities, norms, genre) provides a good starting point for this (Hymes 1974).

3.7.1  ‘Ready-Made’ Corpora At this end of the data collection spectrum, the collection of linguistic data for the purposes of phonological analysis is essentially indistinguishable from the collection of linguistic data for grammatical, semantic, pragmatic, or syntactic analysis, or even for nonlinguistic analysis. This falls out from the fact that the ‘representative sample’ the collector is aiming to obtain is not modified in any way by consideration of the research questions for which it is intended to provide raw data. For this reason, data which is collected with no future linguistic analysis in mind—for example, the archived recordings of a radio or television network—may do just as well as recordings made by a linguist for a language documentation project. They may even be better in that the people doing the recording are highly experienced in the field and will therefore typically make high quality recordings. This is undoubtedly the easiest form of data collection! A fine example of phonological research using a ready-made corpus is the excellent study by Shattuck-Hufnagel et al. (1994), who made use of a Massachusetts radio speech corpus for the exploration of variation in pitch-accent placement within words containing more than one metrical foot.

3.7.2  The Use of Video You have Recorded Recording an event (a ceremony or other social occasion with a complex structure is a good choice, but there are many options) of importance or relevance to the speakers you are targeting is an excellent way of obtaining flowing conversation and narrative. Having shown the film to your speakers, you can follow the Pear Film procedure by having a native speaker who has not seen the film ask people to narrate the story. Alternatively, you can prepare interview questions regarding the content of the film, then conduct interviews (either in person or preferably via a native speaker collaborator) with one or more subjects who have knowledge about the content. A further technique is to record a

40  Bruce Birch real-time commentary from different speakers, asking subjects to describe the action of the film as it unfolds.

3.7.3  The Use of Archival Visual Material Asking people to view photographs or films of relevance to them, such as family photographs, photos, or film of places or events familiar to them, but perhaps which they have not seen for some time, is a reliable way of eliciting relatively unselfconscious speech, as subjects lose themselves in reliving places and events from their past. People will enjoy sharing with each other experiences they have in common, and will frequently have different interpretations of events which may result in animated discussions.

3.7.4  Interviews Without Nonverbal Stimuli Following some preparatory research along the lines suggested in the introduction to this section, you may choose some topics for interviews or discussions to be conducted without the use of nonverbal stimuli. The topic clearly needs to be complex enough to stimulate discussion or narrative, and needs to be something subjects (a) like talking about and (b) are comfortable talking about in public One of the areas of greatest complexity in the languages of Northwestern Arnhem Land in Northern Australia, for example—and also an area which I personally found people liked talking about—was kinship and kinship terminology. As I was working for a language documentation project at the time, I was interested in actually understanding the system and in collecting terms for entry into a dictionary. I found that getting two speakers talking together based on questions such as ‘How is A related to B?’; ‘Why does A call B [insert kin term]?’; or ‘Why is A promised to B as a marriage partner?’; was a reliable way to elicit long and convoluted conversations and explanations (which then took weeks to transcribe and translate).

3.7.5  Partnering Other Data Collection Activities In the context where I have worked most (indigenous languages of Australia) I have found that collaborating with researchers in other fields has been an effective way to gather high-quality unscripted data. Most recently, for example, I have been working with researchers interested in documenting indigenous ecological knowledge both on land and at sea. Bringing on board an indigenous collaborator to act as translator and interpreter in interview situations, we have collected a large amount of narrative and dialogue. The fact that the researchers involved in the process are specialists in their field (marine biologists, ecologists, ethnobotanists, etc) inspires confidence in speakers

Data Collection  

41

that they are being understood when discussing in depth specialized areas of knowledge which the average linguist cannot be expected to know about.

3.7.6  Ethical Considerations Remembering Labov’s finding (referred to above) that truly casual speech data was collected as a result of fortuitous interruptions by family members while the tape recorder was switched on, ethical considerations require that care be taken in this and similar situations to protect the interests of the people who have generously given their time to assist you. While it may be the intention of the collector to reduce or even eliminate speakers’ awareness of the fact that they are being recorded, and while every word they say will potentially be analysed in great detail, if in the process they say things which they don’t really intend others to hear, it is only fair to make them aware of this possibility. For example, although recording subjects gossiping about other members of their own community ‘behind their backs’ may not be an obvious issue if the resultant text is used purely by researchers in an academic context, it certainly would be problematic if the recording or transcription reached the ears of the people who were the object of the gossip. In cases where potentially contentious material has been recorded, a good strategy is to replay the recording to the speaker or speakers involved, seeking their approval or otherwise to use the data.

3.8  Equipment and Recording Technique 3.8.1  Before You Start. . . . . . you need to decide a few things. First, do you require video images as well as audio, or will a good audio recording satisfy the needs of the project? Assuming that the main focus of your project is to capture spoken language, a good audio record is clearly your first priority. However, an accompanying video record has several advantages. Video may help you disambiguate, for example, what speakers are talking about when they point to something, or refer to someone, say a child, who is in the room but is silent and therefore unmonitored by your audio device. Video will also display gestures which may be of interest for the analysis of prosody—for example, where the nod of a head may coincide with an accented syllable in a word. Sometimes simply being able to look at a speaker’s mouth may help disambiguate speech sounds, especially in cases where transcribers have shaky knowledge of the dialect or language they are transcribing. If you decide to use video, however, be aware that you will perhaps need more personnel for your data collection activities than would be the case for an audio-only documentation. Managing a camera, an external microphone, and, in the case of an interview situation, the direction and development of the content will typically be too demanding for a single person. That said, in situations where people are stationary, and

42  Bruce Birch are performing a task not requiring the intervention of the data collector, cameras and mikes can be set up such that minimal adjustment is required during a session, allowing the operator to simply monitor video and sound to ensure that both are functioning as desired. Be aware that if you choose to include video in your documentation, you will not be relying on the inbuilt microphone of the camera. Inbuilt mikes are intended for amateur or home use only, in situations where the picture, for example, recording the antics of the family pet, is more important than the sound—not your situation. Moreover, inbuilt mikes will pick up the sound of your hands moving about the camera body, and are also susceptible to even low wind conditions when recording outdoors.

3.8.2  What Format are You Archiving and Annotating? You will need to think about, or get advice about, the required archive formats for the recordings you are making. Various video and audio formats are available, differing in quality and in terms of the amount of disk space they require. The generally accepted appropriate archive format for audio is uncompressed WAV, also referred to as ‘linear PCM’ (Pulse Code Modulation). Using a compressed format makes no sense in language documentation or data collection, as it results in loss of data and is designed for quick download and space-saving on disks. Video is a different matter. Video files take up far more disk space than audio files, and if they are secondary to your goal, you may decide to use a compressed format such as MPEG2 or MPEG4, which will reduce uncompressed video to a fraction of its original size while presenting it at a standard equivalent to the average commercially available DVD. On the other hand, the increasing affordability of large quantities of disk space may make it feasible for you to archive full-quality video. A further consideration will be ensuring the formats you archive are compatible with annotation and analysis tools such as ELAN5 or PRAAT.6 Having determined the amount of disk space available to you, and having researched which formats will be friendly to the programs you have selected, you will need to choose your equipment accordingly. Neither every video camera nor every audio device will output the formats you need. Ensuring that your equipment is compatible with your required archive and annotation formats will save time and bother down the track when you discover, for instance, that all your files need to be converted before they can be used.

3.8.3  Where are You Recording? Having obtained high-quality equipment and set up your mikes appropriately, etc, you may be surprised to find that you are not achieving the sound quality you need. This 5  6 

http://www.lat-mpi.eu/tools/elan/ http://www.fon.hum.uva.nl/praat/

Data Collection  

43

may be because the acoustics of the space you are using for the recording are impacting adversely on the final product. Hard interior surfaces, or simply the shape or dimensions of a built environment, reflect and distort sound waves in ways that the human brain typically factors out, meaning that what sounds fine to you when having a conversation in a room may sound harsh and unpleasant when you monitor it through headphones or listen back to your recording. This means you will need to source a good room, and a good position within that room, by doing a trial recording before you begin using it as a data collection location proper. Always use headphones to monitor the sound: you cannot trust your ears in these situations. Recording outdoors typically avoids problems associated with room recording, although in the case of (for example) a naturally occurring rock configuration in the vicinity, you may have similar issues. Your worst enemy outdoors, however, is wind. Even a slight wind will cause distortion on unprotected mikes, and a simple foam pop-cover will not help much either. Writing as someone who mostly records outdoors, I find a professional quality windshield (such that you will have seen in use by media crews, film crews, etc) indispensable. I’ve used them in strong winds with excellent results. Whether indoors or outdoors, unless you are in a soundproof booth, your recordings can be compromised by extraneous sound such as traffic noise, the activities of people or animals in the vicinity, and household equipment such as fans, refrigerators, and air conditioners. It will be necessary to eliminate such noise as far as possible before pressing the record button. Switching off ceiling fans or air-conditioning units in hot climates may be counterintuitive (and uncomfortable), but it will make a huge difference to the quality of your recording.

3.8.4  Audio Recorders There is now a fairly wide range of digital audio recording devices on the market, referred to as ‘field recorders’, or more technically ‘Linear PCM recorders’. These devices have the capacity to record sound in uncompressed WAV format, the format which will produce the best results for you. Avoid recording in compressed formats such as the popular mp3 format. Compressed formats result in loss of data. The resultant files are much smaller than their uncompressed counterparts, but the quality will never compare favourably. Typically, these recorders have built-in stereo mikes. You may use these if necessary (they are typically of better quality than built-in mikes on video camcorders), but using a good-quality external microphone will produce better results. As was mentioned above in relation to video cameras, built-in mikes on field recorders will pick up the movement of your hands on the device. In addition, the use of an external mike (or mikes) allows the operator freedom of placement, or movement, during the recording session, while continuing to monitor the sound. As most good-quality microphones have an XLR connection, it is advisable to purchase an audio device which correspondingly has XLR sockets. Most of the cheaper field recorders have mini-jack sockets, rather than XLR sockets, and while there is no problem obtaining a cable to connect the mike to the

44  Bruce Birch recorder, the mini-jacks themselves, and the sockets in the cameras, are often not strong enough for your needs, and may become noisy and unreliable after relatively little use. Your field recorder should have an SD (Secure Digital) Card slot, allowing you to record hours of audio before copying files from the card to a hard drive and erasing the card ready for the next recording session.

3.8.5  Video Camcorders Video camcorders today are capable of directly outputting the files you need for archiving and annotation purposes, thus delivering you from the process of file conversion down the track. Whereas DV tapes required capturing and conversion, today’s cameras record ready-to-archive files on SD Cards which are also ready to open in editing and annotation programs. There is now a bewildering array of cameras on the market, and a huge range of variation in terms of price, quality, and intended use. Your choice, as always, will be determined by your budget and your needs, and the compatibility of the camera with your computer, your archiving format, and your annotation and analysis programs. For example, a data collector who is archiving MPEG2 video and uncompressed 16-bit 48 khz WAV and who uses a MacBookPro laptop computer in the field may choose to use a camera such as the JVC GY-HM100, which records uncompressed WAV audio and Mac-friendly Quicktime-readable MPEG2 video files onto two standard (Class 6) SD Cards, currently available up to 32 GB each, allowing a total of around four hours of uninterrupted recording time. As soon as the recording is finished, access to it will be relatively seamless. The SD cards can be inserted into a slot in the computer and the video can be read instantly by the computer’s software. If desired, the video can be edited immediately on the SD card, but copying the files onto an external hard drive first is good practice. In fact, ideally, a back-up copy will be stored on a second hard drive immediately, before reinserting the SD Cards into the camera and erasing the data in preparation for the next recording. An uncompressed WAV file, which was encoded by the camera along with the video, can now be exported using Quicktime on the Mac. Both the audio and video files are instantly ready for use by the annotation program ELAN, and can be edited if desired using Final Cut Pro. The video files are easily converted to smaller formats if required. Choosing the wrong camera will affect your workflow adversely, so it is definitely worth investing some time into research before you buy. Ensuring a good match between camera, computer, software, and archiving needs will lay the basis for an efficient workflow. Because you are interested in obtaining high-quality audio, you will be using a good-quality external microphone in conjunction with your camera. As was mentioned above in relation to audio devices, cheaper cameras have mini-jack rather than XLR connections, and while there is no problem obtaining a cable to connect the mike to the camera, a camera with stereo XLR sockets built in, or with the capacity to connect to an external XLR socket unit, is preferable. In general you should use a tripod for your camera where possible, as it will result in a steadier picture, and you should also try to avoid gratuitous zooming and panning as

Data Collection  

45

it will distract the viewer. Obviously, you will practise using your eventual set-up before you begin data collection, and if you are inexperienced, you will obtain advice or training from experts at the outset.

3.8.6 Microphones Good microphones are essential ingredients in linguistic data collection, especially when phonetic and phonological analyses are to be the principal uses of the corpus you are creating. Mikes vary in a number of ways. First, there is directionality. Unidirectional mikes are designed to minimize sensitivity to sound other than that coming from the direction in which they are pointed. For example, when pointed at a subject in a room full of people talking, the subject’s voice will be privileged over all other voices in the room. Conversely, an omnidirectional mike would pick up more of the room noise, privileging those speakers in closest proximity. In general, unidirectional microphones will be more suitable for your purposes, although if recording a group conversation, for example, you may choose an omnidirectional microphone. The microphone should be placed as close as possible to the subject’s head, without causing discomfort. If the mike is placed too close, you will be picking up air release (by plosives, for example). Ideally, the mike will be mounted on a stand or boom at an optimal distance from the speaker’s head. You will be monitoring the sound as you move the mike into position so that you can find the optimal position. A good alternative is to use lapel mikes. These come in both wired and wireless designs, the wireless type being referred to as ‘radio’ mikes. These have the advantage of being less intrusive than their wired counterparts when the aim is to elicit unselfconscious speech, as the wearer will readily forget that they are wearing the mike. A disadvantage of both types of lapel microphone is that they will sometimes pick up the sound of the wearer adjusting their sitting or standing position, or accidentally making hand contact with the mike, or the clothing to which it is attached. This is especially the case with wired lapel mikes. There are sometimes compatibility issues between microphones and recording devices and cameras. It is therefore a good idea to seek advice on which microphones best suit your recording device or camera of choice before making a purchase.

3.8.7  A Note on Metadata Recording metadata about your recordings is crucial, as with the passage of time, important contextual knowledge about the recordings risks being lost forever. Best practice is to record metadata for each recording more or less as soon as the recording is made. What you choose to record will partly depend on the nature of your project. However, details about the speaker, the recorder, the equipment used, the content of the recording, and the location should all be part of basic metadata.

C HA P T E R  4

C O R P U S A N N O TAT I O N Methodology and Transcription Systems E L I S A BET H DE L A I S - ROU S S A R I E A N D BR E C H T J E P O ST

4.1  Introduction and Background In the last thirty years, corpus-based approaches have become very important in language research, notably in syntax, morphology, phonology, and psycholinguistics, as well as in language acquisition (Rose, this volume and Gut, this volume) and in Natural Language Processing (e.g. tools development, corpus-based Text to Speech). The use of corpora opens new perspectives. First, it provides new insights into the way in which linguistic events can be captured and defined. Some aspects that have long gone unnoticed have been observed and described by looking at different types of data in large corpora (see e.g. Durand 2009 for various concrete examples in several domains of linguistics). Secondly, it offers new perspectives on the way in which structures and processes can be accounted for, e.g. through statistical modelling as proposed for phonological phenomena in probabilistic phonology (see e.g. Pierrehumbert 2003a) or in stochastic OT (e.g. Boersma and Hayes 2001). Two major factors have allowed corpus-based approaches to become prominent in recent decades: the increase of available data and the development of new tools. The number of available corpora for language research has increased exponentially in recent decades. Here we consider as corpora any sets of machine-readable spoken or written data in electronic format which are associated with sufficient information/documentation to allow the data sets to be reused in different studies1. Two distinct types of information need to be provided: • metadata which specify the content of the corpus, the identity of the speakers, the ways the data have been collected, etc. (see e.g. Durand, Laks and Lyche, this 1 

This definition echoes that of Gibbon et al. (1997) for speech and oral data: ‘A corpus is any collection of speech recordings which is accessible in computer readable form and which comes with

Corpus Annotation  

47

volume). The information provided in the metadata is crucial in order to evaluate the representativity and the homogeneity of the data. Moreover, it should be sufficiently specific to allow any studies that are based on the corpus to be replicated or compared with other corpus-based studies. • metadata that describe the data format used (see Romary and Witt, this volume), the types of annotation proposed (Part of Speech (PoS) tagging, syntactic parsing, etc.) and the symbols and linguistic categories used in the annotation procedure, etc. The second factor that explains the increase of corpus-based research in the discipline is the development of tools that facilitate the analysis and the annotation of large amounts of language data (e.g. taggers and parsers for PoS tagging) (see e.g.Valli and Véronis 1999 and 2000; Véronis 1998 and 2000), automatic speech recognition (ASR) for segmental annotation and alignment (see Cucchiarini and Strik, this volume), and various custom-built database tools designed to analyse large datasets (see e.g. Phon and Phonbank (Rose and MacWhinney, this volume), CLAN and Emu (John and Bombien, this volume)). The task of annotating can be seen as consisting of assigning a label to an element or an interval in the data, where the label marks a specific event in the text or speech signal. If the event is linguistic in nature, the label normally represents a linguistic unit of some sort, such as a word, a phoneme or segment, a syntactic phrase, a prosodic phrase, a focused element, or a tone.2 However, labels can also represent communicatively meaningful nonlinguistic events occurring in speakers’ productions, like pauses, hesitations, interruptions, and changes in loudness and tempo. Deciding what to annotate depends on the purposes of the annotation, and will typically be constrained by the needs of the users, the size of the corpus, and the tools and the manpower that are available to create them. In this chapter we will focus on annotations that provide information about the phonological/phonetic dimension of the speech sample or dataset (i.e. annotation types that are relevant for corpus-based research in phonetics and phonology). As with other types of annotation, phonetic/phonological annotation can be seen as the assignment of a label to a specific unit in the data. In the case of speech data, however, several aspects have to be taken into consideration in the segmentation of the speech signal and the assignment of labels. In section 2 we will discuss some of the basic issues that arise in the transcription and annotation of speech. Sections 3 and 4 will be concerned with the encoding of segmental and suprasegmental information, respectively.

annotation and documentation sufficient to allow re-use of the data in-house, or by people in other organisations’ (Gibbon et al. 1997). 2  This conception of annotation prevails in the formal representation proposed by Bird and Liberman (2001) under the name ‘annotation graph’. Any annotation type can in fact be represented by an acyclic graph.

48   Elisabeth Delais-Roussarie and Brechtje Post

4.2  Basic Issues: Abstraction, Segmentation, and Discretization Unlike written data, which are already an abstract linguistic representation of language, audio data can be captured either in the form of various types of representation that are physically derived directly from the speech signal or as a representation that abstracts information from the speech signal by taking into account the content that is transmitted. Examples of physical derived representations are the digital audio file itself, in a format like .wav, or the speech pressure wave, a spectrogram, a fundamental frequency (F0) trace, a smoothed version of the F0 trace in which microprosodic effects that are not relevant for perception are removed from the representation (e.g. Prosogram, see Mertens 2004a; 2004b; and section 4.2.2 below). Like the speech signal itself, these representations are continuous in nature. Examples of symbolic representations are phonemic transcriptions (e.g. in IPA, see International Phonetic Association 1999; and section 3.2.2 below) and intonational transcriptions (e.g. in ToBI, see Silverman et  al. 1992, Beckman et al. 2005, and section 4.3.1), but also orthographic transcriptions (in which case the audio data are effectively transformed into written data). Such transcriptions are discrete and symbolic in nature. Hence, before starting the annotation, the transciber has to decide for what purpose the data are being transcribed (or even whether a transcription is indeed the most efficient way of achieving the objectives), and what requirements this imposes on the annotator and the annotations and tools used. Annotations serve the purposes of making the data searchable, quantifiable for specific research purposes, and comparable with other datasets. In other words, they improve accessibility to the data for a large community of users. In any case, the annotation will consist of assigning a label to a portion of the speech signal, where the label provides information in relation to the unit in a conventionalized way (the part of speech for a word, its function and/or category for a syntactic phrase, etc). The convention will depend on theoretical and methodological considerations, and the objectives of the annotation will determine the choice of segmentation process and the choice of labels. When the annotation is abstract, symbolic, and discrete, the speech chain is analysed as a linear sequence of units or intervals to which labels are assigned. Assigning a label to an interval presupposes two sub-tasks: • A segmentation of the speech signal or text into units to which labels can be assigned. These units may take different forms depending on the type of data. In written data, they may consist of syllables, words, sequences of words, etc.; in speech data, they will be either an interval or a point in time. • The definition of the labels or sets of symbols.

Corpus Annotation  

49

In some cases the annotation task appears to be quite trivial, as the labels and units are almost given by the data/linguistic form. Consider, for instance, annotation in parts of speech (PoS). Usually, the labels correspond to the PoS categories that are commonly agreed on (verb, noun, etc.), while the segments of speech or text that are being labelled correspond to the words or lexical entries present in the text. In other cases, segmentation and choice of labels are less straightfoward. An example is annotation of syntactic categories and phrases, where segmentation and labelling are typically carried out relative to a specific theoretical paradigm, and the specifics of the theory need to be understood for the labelling to be carried out or interpreted. In any case, regardless of how theory-dependent the annotations are, they will provide an array of information that can be used to formulate the queries that are essential for research to be carried out with the corpus data. For the annotation of speech data, the segmentation task is further complicated by the fact that the audio signal contains a wide array of simultaneously occurring information of a linguistic and nonlinguistic nature, covering more localized events like an individual sound segment in a word (generally referred to as segmental events) as well as more global events which extend over longer stretches of speech like intonation contours (generally referred to as suprasegmental or prosodic events). For example, in (1) the sound sample corresponds to a conversation between two speakers, who are sometimes speaking simultaneously (Figure 4.1).3 Several events occur in parallel in the signal: • the sounds ([œ] and [e]‌) are produced at the same time, since the two speakers overlap; • the realization of the sound [e]‌ coincides with an overlap and a hesitation realized by the other speaker • prosodically, the segment [ɑ̃] at the end of évidemment coincides with the end of an F0 rise. Note however that the rise is realized over the entire syllable /mɑ̃/ and not just on the segment. (1)  A: évidemment ça peut pas se faire à pied quoi euh  B: et c’est loin du centre-ville Different annotation systems take different approaches to segmentation. The events that are being annotated may be considered from a formal point of view, representing their acoustic or auditory properties, or from the point of view of their function in language. For any annotation to be successful, these properties and functions need to be explicitly identified and disentangled, at least to the extent that they are directly relevant 3  The audio data associated with the examples treated in the chapter can be downloaded from the following website: http://makino.linguist.univ-paris-diderot.fr/transcriptionpho (Last accessed 16/09/2013)

50   Elisabeth Delais-Roussarie and Brechtje Post 0.692

0 300 –0.6002 250

2.868

Time (s)

200

Pitch (Hz)

150 100 50 (Èvi)demment ça peut pas se faire à pied

quoi

euh Et c’est loin du centre-ville?

ã

s a p ø p a s

f

ε

я

i d a m 0

a

p

j

e

k

w

a

œ e\œ s e l w ὲ dy

....

Time (s)

FIGURE

2.868

4.1  Representation of the signal for (1). Corpus ACSYNT-COAE  8.

to the objectives that are to be met by the annotation. This will have consequences for how the speech continuum is discretized in order to enable a labelling of the units. The degree varies to which different annotation systems propose clear guidelines on what to encode, and how to proceed, but for oral corpora a standard has been proposed which specifies the events that have to be encoded such as speaker turns, overlaps, and background noise (see EAGLES 1996). The annotation of this type of information falls outside the scope of this chapter, so it will not be discussed here. The focus here is on the annotation of phonetic and phonological information in the speech signal,4 where we distinguish between segmental and suprasegmental phenomena, reflecting current practice in the field. Segmental information in speech corpora tends to be annotated—at least implicitly—at an abstract phonological level, while suprasegmental (or prosodic) transcription systems analyse the signal at phonological as well as auditory/acoustic levels. In the hypothetical case in which the segmental transcription is carried out directly from the audio signal (i.e. regardless of the signal’s linguistic content), segmentation and assigning labels would have to be done independently of the language being transcribed. Such an undertaking is probably impossible at the segmental level, since for any one label, the interpretation of the relevant properties of the signal (e.g. formant structure for vowels) as representing that label rather than another crucially depends on how an individual language exploits those properties in cueing different sound segments (e.g. for vowels, one language’s /i:/ may be another language’s /Ι/). Another 4  Phonology is primarily concerned with structure: studying how languages organize sounds to convey meaning. Phonetics, by contrast, is primarily concerned with form: the study of sounds used in speech, described in terms of their production, transmission and perception. By implication, phonological information will always be language-dependent.

Corpus Annotation  

51

example is sequences which contain sounds with double articulation such as [aΙ] and [ɛj] in French paille (straw) or soleil (sun), and which would generally be transcribed as [paj] (or [pɑj]) and [solɛj], respectively, rather than [paI] or [solɛj]. This reflects the fact that double articulation vowels and diphthongs do not contrast with other types of vowel in French, and are therefore not considered primitives in the French phonemic inventory. Just like context-dependent and predictable phonetic variation in the production of the phonemes, double articulation would only be transcribed when a narrow phonetic transcription is required (as opposed to what is termed a broad phonemic transcription). Similarly, what is transcribed as one phonological category, with its own label, in one language may be more appropriately considered as a sequence of two labels in another (e.g. affricates like /ts/ that could also be sequences of /t/ and /s/). By contrast, segmentation and assigning labels without phonological analysis is conceivable at the suprasegmental level to some extent, at least in the transcription of durational properties or melodic variation. For the transcription procedure to be crosslinguistically valid, segmentation would have to occur at the syllabic level, since this is the only unit of analysis that can reliably be claimed to have universal value (see e.g. Segui 1984; note that language-specific knowledge of syllabification will need to be referred to). The assignment of labels representing suprasegmental information cued by duration and melodic variation could be based either on local variation or on more general phonetic and psycho-acoustic information. That is, for the encoding of durational information, the speed of articulation and the internal composition of syllables may be taken into account (e.g. when automatically evaluating the degree of lengthening of a syllable in the speech stream). For melodic variation, information such as thresholds of glissando can be exploited. Note that this information is actually used in a number of stylization tools, most notably the Prosogram (Mertens 2004a; 2004b; section 4.2.2 below). In the following section, transcription and annotation at the segmental level are seen as consisting of the assignment of labels to grammatical units that range from segments to lexical entries. Hence, the segmentation task and the selection of the labels presuppose phonological and lexical knowledge and cannot be achieved language-independently. This has two consequences that will be discussed further: • It is difficult to transcribe data produced in a language that one does not master, as well as in a dialect or variety of the language that is poorly understood (e.g. because it has never been analysed phonologically, or because the system underlying the variety is in transition, as in bilingual speech or L1 and L2 acquisition data). • In any transcription, the accuracy and the granularity depend on the targeted representation level and on the procedure used, rather than on the underlying phonological assumptions: segments and intervals are usually the same, and they are defined according to the phonology of the language to transcribe. In section 4 it will be shown that exactly the same issues arise for abstract phonological annotations of suprasegmental information. In each section, the issues will be

52   Elisabeth Delais-Roussarie and Brechtje Post discussed in further detail in the context of the systems and methodologies that are used to annotate the speech signal for phonological (and phonetic) information. It will allow us to show how different systems take different approaches to address these issues.

4.3  Encoding Segmental Phonetic and Phonological Information Systems that are commonly used to represent (or annotate) the segmental dimension of speech are presented here. In 3.1, a detailed description of the various levels of representation will be given. Section 3.2 will be devoted to a presentation of major systems, including an evaluation of their advantages and limitations.

4.3.1  Transcription Procedures and Levels of Representation To represent the segmental dimension of the message, different levels of representation have been proposed for different types of transcription (see e.g. Gibbon et al. 1997). Two points are essential in the definition of the levels: (i) the degree of granularity or detail observed in the transcription, and (ii) the way in which it abstracts away from the reality of the speech signal.

4.3.1.1  Approaching Segmental Transcription at a Linguistic/Phonological Level Among the six different levels of annotation proposed in the EAGLES Handbook on Standards and Resources for Spoken Language Systems (see Gibbon et al. 1997), three depend on taking into account the conceptual dimension of the message, i.e. its meaning. These levels of transcription will primarily give access to the words or lexical units that compose the message. 4.3.1.1.1  Transcription from Script At this level, the transcription aligned with the speech signal consists of replicating the scripts or texts that were given to the speaker during the recording session. This level is thus only possible for read scripted speech (lists of words or sentences, text reading tasks, etc.) as in the PFC corpus (see Durand et al., this volume), and some of the IViE corpus (see Nolan and Post, this volume). In this case the transcription is not always accurate, as it does not necessarily represent what the speaker actually said. Repetitions, hesitations, and false starts are not encoded at all, as is shown in (2). Example (2b) provides an orthographic transcription from the script, while (2a) represents orthographically what was actually said by the speaker.

Corpus Annotation  

53

(2)  Extract from the PAC Reading task (Californian speaker) a. Sequence produced by a speaker from California 5 If television evangelists are anything like the rest of us, all they really want to do in Christmas week is snap at their families, criticize their friends and make their neighbours’ children cry by glaring at them over the garden fence. Yet society expects them to be as jovial and beaming as they are for the other fifty-one weeks of the year. If anything more. . . more so. b. Transcription from the Reading script If television evangelists are anything like the rest of us, all they really want to do in Christmas week is snap at their families, criticize their friends and make their neighbours’ children cry by glaring at them over the garden fence. Yet society expects them to be as jovial and beaming as they are for the other fifty-one weeks of the year. If anything more so. 4.3.1.1.2  Orthographic Transcription At this level, the transcription consists of encoding in standard orthographic form the linguistic units that are part of the message. This level of transcription tends to be used for transcribing or annotating large amounts of speech data. Note that this type of transcription is recommended by the TEI or CES for oral corpora (see e.g. EAGLES 1996; Gibbon et al. 1997). Orthographic transcriptions allow for all the words produced by the speaker to be represented, even in cases of false starts, hesitations, and incomplete sentences or words. As a consequence, the transcriptions obtained are more accurate than any transcription based on the recording script alone. Moreover, orthography may be used to represent a wide range of data types (read scripted speech, spontaneous speech, etc.). (3) Extract from a formal conversation between a researcher and a female speaker (Lancashire, PAC data)6 er, I loved school when, and I loved primary school best and, and when I started school and er I went to school in Essex and er (silence) I always lo/ I had really good teachers, I was really lucky and because my sister had a few like dud teachers but mine were really nice and I used to like draw a picture and put: I love Mrs (name) (laughter) and sit on her knee and stuff. Er, so that was really lucky, and er, (it was) er, quite a sort of progressive school and you could wear jeans and, like in France, you could wear jeans, any clothes and any, any clothes er, you wanted and we

MD: Yeah,

5  The sound files of the examples given in this chapter may be downloaded from the following website: http://makino.linguist.univ-paris-diderot.fr/transcriptionpho/ 6  Unscripted words or events such as false starts, hesitations, etc. are given in bold in the example.

54   Elisabeth Delais-Roussarie and Brechtje Post did like cookery and woodwork and quite adv/, different things whereas when I moved to Bolton, I went to a school where you couldnt, you had to wear a skirt or a dress if youre a girl and er, boys had to wear short trousers (laughter) and er, you had to wear indoor shoes when you were indoor, like plimsolls or (tra/),. like little slippers and we had nothing like cookery or woodwork or nothing sort of creative and it was all like textbooks and things so I didnt like that when I moved there, but I made nice, good friends. Even though this level of representation leads to accurate transcriptions of the speech chain (in terms of the words produced), it has its limitations: orthographic representation does not allow the transcriber to represent what the speaker says, for instance in the case of sandhi phenomena (e.g. intrusive [r]‌ in American English, liaison or schwa deletion in French, etc.). In the orthographic transcription in (4a), for instance, there is no indication whether liaison between the underlined words occurs or not. Only a phonemic transcription as in (4b) provides such information. (4) Extract from an informal conversation between two colleagues (ACSYNT, COAE 16). a. Orthographic transcription Euh parce par exemple je suis allée à Lille bon c’était pour le boulot hein mais c’était une ville dont je me faisais une idée complètement euh l’horreur totale quoi et en fait j’ai été plutôt agréablement étonnée quoi hein b. Phonemic transcription of the underlined sequence [ʒəsɥizale], [setEynvil] Transcribing data with standard orthography may be problematic when the speech samples do not fit the forms expected by standard language systems, for instance in the speech of bilinguals, language learners, and speakers with a speech pathology. In many cases, the forms that are produced are not in the lexicon, and hence difficult to transcribe accurately in standard orthography. 4.3.1.1.3  Citation-phonemic Representation At this level, phonetic symbols are used to represent the linguistic signs that are present in the message. To a certain extent this representation is equivalent to an orthographic transcription, from which it can easily be automatically derived. That is, a citation-phonemic representation corresponds to a phonemic representation that results from the concatenation of the phonemic representations associated with each lexical item present in the message. Elision and sandhi phenomena are not represented at all, since the citation-phonemic representation ignores any variations in pronunciation that occur in continuous speech. For instance, for the sentence in (5), the citation-phonemic representation (5b) is equivalent to the concatenation of the representations of each word taken in isolation (5a).

Corpus Annotation  

55

(5) Non je les ai trouvés un peu difficiles. Il y avait des mots qui euh qui se enfin je qui se suivaient un petit peu et j’arrivais pas trop à faire le la liaison quoi (ACSYNT, BAVE) a. Concatenation of the representations of each word taken in isolation (using SAMPA, section 3.2.2.2). Note that latent consonants are given in parentheses in the symbolic representation. n o~ + Z @ + l e ( z) + e + t R u v e (z) + e ~ (n) + p 2 + d i f i s i l @( z) + i l + i + a v e (t) + d e( z) + m o t (z) + k i + 2 + k i + s @ + a ~ f e ~ + Z @ + k i + s @ + s H i v E/ (t) + e ~ (n) + p @ t i (t ) + p 2 + e + Z@ + a R i v E/ ( z) + p a (z) + t R o (p) + a + f E R + l @ + l a + l j e z o ~ + k wa b. Citation-phonemic representation (using SAMPA) n o~ Z @ l e ( z) e t R u v e (z) e ~ (n) p 2 d i f i s i l @( z) i l i a v e (t) d e ( z) m o t (z) k i 2 k i s @ a ~ f e ~ Z @ k i s @ s H i v E/ (t) e ~ (n) p @ t i (t ) p 2 e Z@ a R i v E/ ( z) p a (z) t R o (p) a f E R l @ l a l j e z o k wa As shown in (5), citation-phonemic representation is comparable to orthographic representation in terms of accuracy. However, it makes possible queries that need to refer to phonological units or events, such as potential liaison contexts.

4.3.1.2  Approaching Segmental Transcription at a Phonetic/Acoustic Level The levels of representation that have been described so far do not allow a direct representation of the finer detail of what the speaker actually said beyond linguistic, phonologically contrastive information. The transcriptions focus on representing the linguistic content of the message, and the lexical items that compose it. Three further types of representation can be defined, which presuppose a segmentation in phonemes while enabling direct access to phonemic and/or subphonemic (phonetic) information. The construction of these phonological representations requires knowledge of the language, attentive listening to the signal, and even, in some cases, use of acoustic representations of the signal (spectrogram etc.). 4.3.1.2.1  Broad Phonetic or Phonemic (Phonotypic) Transcription A broad phonetic transcription provides a phonemic representation of the speaker’s actual pronunciation of the words in the speech signal, at a contrastive phonological level. Unlike a citation-phonemic representation, it will indicate phenomena that are characteristic of connected speech such as liaison, intrusive /r/, consonant deletion, vowel reduction, and assimilation. Hence, the resulting transcription tends to be more detailed than a citation-phonemic representation. The beginning of the extract in (5) can be transcribed as (6) in a broad phonemic transcription.

56   Elisabeth Delais-Roussarie and Brechtje Post (6)  Non je les ai trouvés un peu difficiles n o~ Z l e z e t R u v e e ~ p 2 d i f i s i l Such a representation can easily be derived from an orthographic or a citation-phonemic representation of the data: • automatically, by using ASR and forced alignment algorithms on the citation-phonemic representation, or even the orthographic transcription. All segments that appear in the citation-phonemic representation and that are not produced in the signal will be erased (Cucchiarini and Strik, this volume); • semi-automatically, by listening to the contexts in which continuous speech phenomena may apply, and modifying the automatically derived citation-phonemic transcriptions accordingly. Broad phonetic representation relies on the use of a clearly defined, limited set of symbols, like those of the IPA or its machine-readable extension SAMPA, but only symbols that have the status of phonemes are taken into consideration to encode the output of connected speech processes. By offering more phonetic detail than the citation forms, this level constitutes a balance between accuracy of representation and ease of derivation. 4.3.1.2.2  Narrow Phonetic Transcription To achieve a transcription at this level of representation, the transcriber has to listen very carefully to the signal, sometimes combined with visual inspection of the waveform and spectrogram. All segments have to be represented by phonetic symbols or combinations of symbols and diacritics (e.g. in IPA, section 3.2.2 below), which correspond most closely to the sound sequence that is actually produced. Allophonic variants are encoded when necessary, necessitating the use of symbols and diacritics with both phonemic and subphonemic (often allophonic) status in the language. For instance, the aspiration of a plosive in onset position—one of the allophonic variants of voiceless plosives in English—would be h encoded in the transcription, as is shown in (7) (transcribed as [ ] in the example), as well ~ as processes like nasalization of a vowel (marked [ ]), regressive place assimilation (/n/ pronounced as [m]‌), and pre-glottalization of a stop in coda position (transcribed as (7)). (7)  Ten bikes were stolen from the depot yesterday evening. [thɛ̃mbɑΙʔks].. This level of representation is very accurate, but is also time-consuming to produce. Hence, narrow phonetic transcription should only be used in cases in which it is absolutely necessary: It is better not to embark without good reason on this level of representation, which requires the researcher to inspect the speech itself, as this greatly increases the

Corpus Annotation  

57

resources needed (in terms of time and effort). If the broad phonetic (i.e. phonotypic) level is considered sufficient, then labelling at the narrow phonetic level should not be undertaken. (Gibbon et al. 1997)

4.3.1.2.3  Acoustic Phonetic Transcription This level of transcription provides very detailed information about the various elements and phases that occur during the realization of a sound. For a plosive, an acoustic phonetic transcription will indicate all the phases that can be distinguished in its production, including the oral or glottal closure, any aspiration, the release burst, etc. Such a transcription can only be achieved when the transcriber refers to visual representations of the acoustic signal (e.g. spectrum, spectrogram, and speech pressure wave). The labels that encode the acoustic information would normally be aligned to the signal in a graphical representation.

4.3.1.3  Deciding on a Level of Transcription Six different levels of segmental transcription have been presented here which provide different information about phonemic and subphonemic events in the speech signal. To decide which one to use, a number of factors need to be taken into account: the size of the corpus, the objectives, and the degree of detail required by the user. Thus, if the corpus has been developed to carry out research on e.g. discourse and conversation, and more precisely on turn-taking, an orthographic transcription in which speaker turns are indicated is sufficient. By contrast, if the purpose of the research is to study pronunciation variants of certain phonemes, a narrow phonetic transcription may be required, at least for the phonemes of interest and their immediate context. The difficulty of the transcription task depends crucially on the level of representation, but also on the nature or genre of the speech samples that are to be transcribed. For instance, scripted data that are recorded in a soundproof booth are much easier to transcribe than informal spontaneous conversation between several speakers, because the former is much more predictable, and more likely to stay closer to citation-form pronunciations of the lexical items in the speech signal. Automatic procedures may also be more successful in formal speech, since they typically rely on citation-form speech (or derivations from citation-form speech). Two distinct classes of systems can be used to provide a transcription of segmental information in the speech signal: orthographic systems and phonemic/phonetic systems. In languages with a segment-based orthographic system which closely follows the sound changes in the language, the two classes of representation may not be that far removed from each other, but the symbolic representations obtained in the systems will differ, and offer different research perspectives. However, since a number of tools can be used to transform an orthographic transcription into a citation-phonetic representation and vice versa, orthographic transcription can be considered as pivotal between alphabetical and phonotypic representations.

58   Elisabeth Delais-Roussarie and Brechtje Post

4.3.2  Most Commonly Used Systems As mentioned in section 3.1, two distinct types of system can be used to annotate a speech corpus for phonetic/phonological research:  the orthographic system of any given language, or a set of phonetic symbols that represent the sounds present in the signal. However, the two systems cannot be applied in all cases, as they may not offer the same degree of granularity in the way in which they represent the speech signal. In this section, the most commonly used systems will be presented, and their advantages and limitations will be discussed.

4.3.2.1  Representing the Segmental Dimension of the Speech Signal by Means of Alphabetical Writing Systems7 As mentioned in the previous section, making an orthographic transcription of a speech file consists of representing the linguistic content of the speech signal symbolically, in orthographic form. This necessarily implies that the signal is interpreted by the transcriber, at least to the extent that the orthography serves as a means not only to express a sound sequence but also to refer to a concept or an idea (de Saussure’s sign: see de Saussure 1916). This level of transcription—which is often applied in large speech corpora (GARS corpus of spoken French, spoken sections of the British National Corpus, etc.) and which is recommended by the TEI and EAGLES for the transcription of spoken corpora (EAGLES 1996; Burnard 1993; 1995; Sperberg-McQueen and Burnard 1999)—has a number of advantages: • The system is easy to use (since it does not require any knowledge of special symbols or segmentation conventions); • It allows transcription of large datasets; • It provides transcriptions that are readable for all potential user groups; • The transcriptions can serve as input to computional tools for automatic language processing, which can be used to automatically generate linguistic annotations from the text (e.g. syntactic parsing and tagging, phonemic representations, etc.). Nevertheless, producing an accurate orthographic transcription can be an arduous task, in spite of the range of tools that are available to assist the transcriber (cf. e.g. Delais-Roussarie (2003a; 2003b) for a review; Sloetjes, this volume; Boersma, this volume; Kipp, this volume). In fact, a number of factors make the task difficult (cf. e.g. Blanche-Benveniste and Jeanjean 1987): • It is often difficult to hear precisely what is being said due to distortions of the signal (background noise, etc.), especially in the case of spontaneous speech which is not recorded in a quiet environment. 7  Only orthographic alphabetical systems are covered here, since they represent segmental information more closely than other writing systems. This does not imply that other writing systems do not share some of the advantages and limitations of the orthographic systems discussed here.

Corpus Annotation  

59

• Transcribers will tend to auditorily reconstruct elements that they cannot readily interpret as part of the message, which may result in erroneous interpretations (unexpected words that are produced out of context, unfamiliar pronunciations in dialectal or non-native speech, etc.). • A variety of preconceptions can cause transcription errors, linguistic preconceptions in particular (i.e. when a transcriber relies on his/her knowledge and representations of the language, and erroneously reinterprets what he or she hears). • Sometimes the speech signal allows for multiple interpretations which need to be disambiguated in writing (e.g. auditory [lɑd ̃ ʁwaulɔn ̃ E] in French can correspond with orthographic l’endroit où l’on est as well as l’endroit où l’on naît; note that such ambiguities are usually disambiguated by the context and/or phonetic detail in the signal). A number of recommendations have been proposed (see e.g. EAGLES 1996; Burnard 1995) with a view to ensuring accuracy and ease of use (by human and machine), notably: • Transcriptions need to observe standard orthographic conventions as much as possible; generally adopted conventions are also used for contractions (e.g. gonna or wanna in English; t’as in French). • Hesitations, false starts, repetitions, and all other forms of self-correction need to be transcribed literally. • Numbers, formulae, and other symbols need to be represented as written words. • Abbreviations and acronyms are transcribed, distinguishing between abbreviations that are pronounced as words and those that are pronounced as a series of initials (NATO vs. U.S.A). • Only major punctuation marks are used (i.e. those used at the end of the sentence like question marks, exclamation marks and full stops). Two of these recommendations are the subject of debate: the use of punctuation, and the reliance on standard orthographic conventions. A number of editors of oral corpora refuse to use punctuation on the grounds that (i) punctuation is part of the written code, and/or (ii) there are no sentences in speech (cf. e.g. Blanche-Benveniste and Jeanjean 1987). Nevertheless, many speech corpora use punctuation to segment speech into utterances (cf. the spoken sections of the BNC, the CHAT conventions, etc.). This choice is justified in various ways. French (1992), for instance, states clearly which indicators the transcriber should use to segment the speech signal into utterances: Try to be guided by intonation—the rises and falls in the voice—as well as by the words themselves. If it sounds as though someone has finished a sentence and gone to another (their voice drops, they take a breath and start on a higher note), then it’s probably safe to start a new sentence.

60   Elisabeth Delais-Roussarie and Brechtje Post Payne (1995) proposes a definition of the sentence in spoken language when he addresses the issue: The resulting functional sentence is perhaps difficult to define precisely in linguistic terms, but as an essentially practical unit creates few problems for transcribers, as it is using their intuition about when speakers are starting, continuing and completing what they are saying on the one hand, and when they are abandoning an incomplete utterance on the other. (Payne 1995: 204)

As mentioned before, many guidelines and recommendations which aim to reach a standard in the design of oral corpora usually insist on the use of standard spelling to represent the linguistic content of the audio signal (e.g. TEI, EAGLES 1996). However, standard spelling cannot account for various realizations that occur in connected speech, and as such does not very accurately represent what has been effectively said by speakers. In any case, an orthographic transcription using standard spelling cannot represent the occurrence of phenomena such as liaison or optional schwa deletion in French, or intrusive /r/ in American English. In order to overcome this impossibility, some tricks which lead to a departure from standard spelling have been used in transcribing some oral data. For instance, in the orthographic transcription from an oral corpus of Acadian French (Péronnet-Kasparian:  see Chevalier et  al. 2003), standard spelling is not observed in transcribing some words. The sequences je suivais and je suis are transcribed chuivais and chuis, respectively, to account for the assimilation after schwa elision (8). (8)  Orthographic transcription and tricks y . . . fait-que j’ai euh travaillé deux ans à la firme comptable après mon bacc. / pis chuivais ces cours de correspondance-là / quand j’ai vu que ça marchait pas les cours / j’ai euh / appliqué à l’Université Laval pour la licence// j’ai faiT un an là pour euh obtenir ta licence pis là ton coaching de l’été qu(i) est de / juin/ juillet/ août / genre revue de tout c’que/ que t’as vu / euh de là / on fait note/ j’ai fait mon examen de comptable agréé // euhm // après mon / ma licence à Québec / là chuis déménagée à Montréal / pis j’ai travaillé pour une firme Other tricks are used to account for schwa deletion (le p’tit instead of le petit) and other phonological and phonetic phenomena occurring in connected speech. In general, such tricks are problematic for several reasons: • A lack of consistency may occur. In (8), for instance, the schwa deletion is encoded in c’que, but nothing is clear concerning the realization of sequences such as de l’été, genre revue de tout. • Transcriptions using such tricks may not be very easy to read in comparison to any transcription using standard spelling.

Corpus Annotation  

61

• Transcriptions cannot be further annotated by using automatic tools such as a tagger or a parser. Many words or orthographic forms cannot be properly labelled, as they are not in the dictionary. A clear advantage that the recommendations offer is that they facilitate the interpretation and automatic processing of the transcribed data or texts. In general, orthographic transcriptions—both from scripted speech and from careful listening—provide readily interpretable representations of the linguistic message, while allowing for the automatic generation of phonemic transcriptions. They are relevant for all further processing of the data, since they make systematic searches for specific phenomena possible (linguistic forms, phonemes, specific phonemic environments, etc.). However, certain types of data can prove difficult to transcribe when nonstandard forms are used, as in regional varieties, learner varieties, and child speech.

4.3.2.2  Representing the Segmental Dimension with Phonetic Symbols To provide a transcription that represents the sounds produced, and not only the words pronounced by the speaker, the transcriber tends to use systems in which a symbol corresponds to a sound. Such systems have to be used to transcribe data at any of the following levels of representation (cf. in particular section 3.1.2):  citation-phonemic representations, phonemic representation, and phonetic representation. The most commonly used sound–symbol systems are the IPA and some machine-readable extensions of the IPA: SAMPA, ARPABET, X-SAMPA. Some of these systems will be presented in detail in the next sections (see also International Phonetic Association 1999 and Wells 1997). Note that all these systems presuppose phonological knowledge in the sense that symbols are assigned to segments that correspond roughly to phonemes. 4.3.2.2.1  The International Phonetic Alphabet The IPA was first developed in 1888 by phoneticians and teachers of French and English under the auspices of the International Phonetic Association (cf. International Phonetic Association 1999). Its development was intended to facilitate the teaching of pronunciation in foreign languages, at least in terms of the sound segments (phonemes) in words, avoiding the complications introduced by orthographic representation. According to this system, a limited set of symbols should make it possible to represent any and all of the sound segments that are used contrastively in the languages of the world. To achieve this aim, the IPA builds on three fundamental theoretical assumptions: • The number of symbols is strictly limited (188 symbols, which represent vowels, pulmonic and non-pulmonic consonants, and 76 diacritics). • The system can be used to transcribe any language or variety of a language; in other words, it is independent of any given language. • The symbols are assigned to segments (not to phonemes), which presupposes a segmentation of the speech stream into pre-theoretical units.

62   Elisabeth Delais-Roussarie and Brechtje Post These principles should facilitate data sharing, as well as transcription validation. Moreover, different languages or varieties of a language can be compared, as different symbols are assigned to different sounds.8 The symbols used to represent the vowels and the pulmonic and non-pulmonic consonants are given in (9) (reprinted with the permission of the International Phonetic Association). (9)  The IPA Chart (Consonants and vowels) THE INTERNATIONAL PHONETIC ALPHABET (revised to 2005) CONSONANTS (PULMONIC)

© 2005 IPA

Bilabial Labiodental Dental Alveolar Post alveolar Retroflex

Palatal

Velar

Uvular Pharyngeal

Glottal

Plosive Nasal Trill Tap or flap Fricative Lateral fricative Approximant Lateral approximant Where symbols appear in pairs, the one to the right represents a voiced consonant. Shaded areas denote articulations judged impossible.

CONSONANTS (NON-PULMONIC) Clicks

Voiced implosives

Ejectives

Bilabial

Bilabial

Examples:

Dental

Dental/alveolar

Bilabial

(Post)alveolar

Palatal

Dental/alveolar

Palatoalveolar

Velar

Velar

Alveolar lateral

Uvular

Alveolar fricative

VOWELS Front

Central

Back

Close

Close-mid

Open-mid

Open Where symbols appear in pairs, the one to the right represents a rounded vowel. 8  Databases like UPSID (see Maddieson 1984) are constructed on the basis of comparisons between phonological inventories represented in IPA symbols. However, it should be borne in mind that one phonemic symbol can represent different phonetic realizations.

Corpus Annotation  

63

However, the principles just mentioned are in fact assumptions which run, to some extent, counter to fact: in its use the system is not truly independent of individual languages. First, an individual symbol does not always represent the same acoustic/phonetic reality. In fact, its precise realization varies from one language to another. For instance, the symbol [p]‌ is used to transcribe a voiceless bilabial plosive in French and in English, even though from an auditory and acoustic point of view the two sounds are different, with more aspiration in English than in French. This illustrates the important role of the notion of contrast in determining the mapping between sound and symbols in this system, which implies a high level of abstraction already. Second, as mentioned in sections 2 and 3.1, for any transcription to be made, the message needs to be interpreted, and hence the transcription cannot be made independently of a given language. In spite of these potential limitations, the IPA is widely used, and often serves as a medium for exchanging and analysing linguistic data. Three factors contribute to its popularity: • By representing the continuous sound signal as a string of segments, the IPA offers an intuitive way to represent the speech signal in a way that is to a certain extent comparable to orthography. • As the locus of contrast, the segment represents a fundamental unit for segmentation of the speech signal which encodes only linguistically relevant aspects of speech (as well as some phonetic detail). • By adopting the segment as its fundamental transcription unit while allowing for considerable flexibility in the precise realization of the transcription symbol, the IPA allows for more or less detailed transcriptions, depending on the level of abstraction at which the transcription is carried out—phonemic or phonetic. Distinguishing between different levels of representation does not call into question the need for the segmentation and phonemic analysis of the speech signal before any transcription can take place. The phonetic detail is rendered by the symbol chosen to represent the sound effectively realized. Thus, in examples (10) and (11), phonological knowledge allows us to segment the linguistic message, but the selection of the symbols can either be based on the abstract phonological form, as in examples (10a) for French and (11a) for English, or they can represent at a phonetic (or allophonic) level what the speaker actually produced, as in examples (10b) and (11b). (10)  C’est parti pour sa quatrième campagne présidentielle. (FOR-LEC) a. Phonemic representation /sε paʁti puʁ sa katʁijεm kɑp ̃ aɲ prezidɑs̃ jεl/ b. Phonetic representation9 [sεpaxti puχsachatxijεməkɑm ̃ paɲəpxεzidɑs̃ çεl] 9 

Blank spaces are inserted between orthographic words here to make the transcription easier to read.

64   Elisabeth Delais-Roussarie and Brechtje Post (11)  Ten bikes were stolen from the depot yesterday evening. a. Phonemic representation /tεnbɑIks . . ./ b. Phonetic representation [thɛm ̃ bɑIʔks . . . .] While examples (10a) and (11a) provide a phonemic transcription of the utterances, the phonetic transcriptions in examples (10b) and (11b) reveal how the sound segments were actually realized by the speaker. The French example in (10b) shows that some of the symbols that are used to transcribe the actual realization of the segments are not phonemes of French; they merely represent pronunciation variants. Thus, the /ʁ/ in parti is pronounced as a velar fricative instead of a uvular. Similarly, the velar plosive /k/ of quatrième is produced as a dorso-palatal aspirated plosive. By contrast, the English example (11b) shows that sometimes, when a different symbol is used to represent the pronunciation variant in the phonetic transcription, the alternative symbol also represents a contrastive segment in the phonemic inventory of the language, as in the case of alvealoar nasal /n/ in ten, which is pronounced as bilabial [m]‌ in connected speech. It should be borne in mind, though, that even though the transcriptions in examples (10b) and (11b) offer a high degree of detail, they do not directly correspond to the reality of the signal. Nevertheless, a number of points mentioned above can be distinguished, which can be considered an integral part of the transciption task: • Transcription requires the discretization of a speech signal which is inherently continuous in nature. • The units or segments that are chosen to discretize the signal represent loci of phonological contrast or opposition. Thus in example (10), the sound /s/ in c’est could enter into opposition with e.g. /t/ (t’es parti). If exchanging one phonetic segment for another (i.e. replacing one symbol with another) gives rise to a difference in meaning, the two segments are in opposition, and they correspond to variants of two distinct phonemes. • A difference in the choice of symbols can be used to encode different degrees of granularity, ranging from phonemic citation-form transcriptions to acoustic-phonetic transcription. To conclude, as a transcription system the IPA has a number of important characteristics. It is based on the assumption that (i) the speech signal can be analysed as a string of segments; (ii) segments are units that are relatively neutral from a theoretical point of view, while they represent possible loci for contrast and opposition; (iii) the set of labels (or symbols) is strictly limited (in spite of the use of diacritics); (iv) each symbol corresponds to a single sound segment in the speech signal; and (v) the sound signal can be

Corpus Annotation  

65

represented at a more or less abstract level, where the level of abstraction is marked by the symbol that is chosen to represent the realization of the sound segments in the signal. In general, the IPA allows for finer-grained transcriptions of phonetic realizations than are possible in orthography. However, the resulting transcriptions are difficult to read for untrained users. Moreover, they cannot be used as input for linguistic processing such as grammatical annotation of the data. Finally, the IPA does not resolve all the issues that arise when nonstandard speech data are transcribed orthographically, for instance in first and second language acquisition data, or pathological speech. This is because segmenting the speech signal into a phonemic string is a prerequisite for transcription into IPA symbols. 4.3.2.2.2  Machine-readable Extension of the IPA: SAMPA, X-SAMPA, and ARPABET10 Not all the IPA symbols are machine-readable alphanumeric symbols (or ASCII characters). Therefore, they cannot always be typographically represented. A  number of systems were developed to address this issue. They have been mostly used in speech technology (TTS or Text-to-Speech synthesis, and speech recognition). They mainly consist of a fairly straightforward modification of IPA in which IPA symbols are translated into alphanumeric code, which makes it usable in computing. Examples of the best-known systems are given in examples (12) and (13) below. In (12), the transcriptions with SAMPA and X-SAMPA of two sentences are presented.11 In (13), transcription with ARPABET is illustrated.12 (12)  a. The North Wind and the sun were disputing which was the stronger. With the IPA δə ˈnoɹθ ˌwInd ən (ð)ə ˈsʌn wɚ dIs ˈpjutIŋ ˈwItʃ wəz ðə ˈstɹɑŋɡɚ With SAMPA D@ nO4T wind @n @ sVn w@` dIspju4IN witS w@z D@ st4ANg@`

10  Some other machine-readable systems have been developed in house for research purposes. However, they are not as widely accepted as the systems we present here. An example is the system that was developed in the 1980s in Orange Labs, and which is used to transcribe French data in speech technology: (i)  La bise et le soleil se disputaient, chacun assurant qu’il était le plus fort. a  / la biz e l@ sOlEj s@ dispyte SakE~ asyRA~kilete l@ ply fOR/ in SAMPA b  / L A  B I Z EI L EU S AU L AI Y S EU D I S P U T AI CH A K UN A  S U R AN K I L EI T AI L EU P L U F AU R/ in Orange Lab’s annotation system for TTS 11  SAMPA and X-SAMPA have been mostly developed in Europe, X-SAMPA consisting of an extension of SAMPA, which was originally developed for individual languages. A complete description of the symbols used in SAMPA and X-SAMPA can be found, respectively, at: http://www.phon.ucl.ac.uk/ home/sampa/ (Last accessed 16/09/2013) http://en.wikipedia.org/wiki/Extended_Speech_Assessment_ Methods_Phonetic_Alphabet#Summary (Last accessed 16/09/2013). 12  ARPABET is a phonetic transcription code developed by the Advanced Research Projects Agency (ARPA) during a project on speech understanding (1971–1976). The system is presented in great detail at http://en.wikipedia.org/wiki/Arpabet. (Last accessed 16/09/2013).

66   Elisabeth Delais-Roussarie and Brechtje Post With X-SAMPA D@ nOr\T wind @n @ sVn ǀ w@`dIspju4IN witS w@z D@ str\ANg@ b. La bise et le soleil se disputaient, chacun assurant qu’il était le plus fort. With the IPA la biz e lə sɔlɛj səs dispytE ʃakɛ̃ asyʁɑ̃ kiletE lə ply fɔʁ with SAMPA and X-SAMPA la biz e l@ sOlEj s@ dispyte SakE~ asyRA~kilete l@ ply fOR (13)  Transcriptions with ARPABET a. Thursday is transcribed as /TH ER1 Z D EY2/ b. Thanks is transcribed as /TH AE1 NG K S/ c. Father is transcribed as / F AA1 DH ER/ As can be seen in (12), the IPA symbols that correspond to alphanumeric symbols remain in general unchanged in SAMPA and X-SAMPA. For instance, IPA /p/ is transcribed /p/ in SAMPA and X-SAMPA, and this holds for all languages in which a voiceless bilabial stop contrasts with other sound segments. However, symbols which do not correspond directly to ASCII characters, like the bilabial fricatives /ð/ and /β/, are translated as /D/ and /B/ respectively in SAMPA as well as X-SAMPA. The IPA diacritics also have their equivalents, but they are placed generally in the same typographical line as the symbol that they are modifying. However, suprasegmental information is encoded differently in the two types of system. In the IPA, prosodic symbols are normally inserted on the same typographical line as the segmental symbols, unlike in e.g. SAMPA, where they are used on a separate transcription tier (or line, in typographic terms). This difference allows for multiple use of the same symbol in SAMPA and SAMPROSA (the prosodic version of SAMPA) to represent two distinct sound objects. For instance, the symbol H represents the labio-palatal approximant /ɥ/ in SAMPA, as well as a High tone in SAMPROSA. In ARPABET, by contrast, the representation of IPA symbols follows very different principles, as can be seen in (13). Every phoneme is represented by one or two capital letters. For instance, /ɔ/ is represented by AO, and /Z/ by ZH. Stress is indicated by digits that are placed at the end of the stressed syllabic vowel. Intonational phenomena are represented by punctuation marks that are used in the same way as they are in the written language. In contradistinction to X-SAMPA, ARPABET is developed for American English only. Having been developed as machine-readable versions of the IPA, the systems implicitly adopt the phonological nature of the IPA system: • They are language-dependent, which means that a full transcription can only be made if the sound system of the language is known.

Corpus Annotation  

67

• Each symbol that represents a phoneme in a language has its equivalent in the transcription system, where one symbol corresponds to one sound segment. • The speech signal is treated as a sequence of segments (beads on a string). SAMPA, X-SAMPA, and ARPABET offer the same advantages and drawbacks as the IPA, but they were primarily devised to make phonemic transcription in ASCII symbols possible.

4.4  Transcription Systems and Suprasegmental Information Most issues that arise in the transcription and annotation of segmental information also arise with the transcription of suprasegmental, prosodic information (see section 2). However, unlike segmental transcription where the IPA—or SAMPA—is generally accepted as a useful standard, there is no commonly accepted transcription system for suprasegmental information. A number of different systems have been developed which are based on different theoretical approaches, pursuing different objectives. Hence, they deal in different ways with the issues that arise in the transcription of suprasegmental information. In this section, we will exemplify different approaches and objectives by presenting some commonly used transcription systems (see Llisterri 1994 for a more comprehensive survey of various prosodic transcription systems). We will focus on symbolic representation systems, i.e. systems that provide a discrete symbolic representation or a phonological representation of various prosodic phenomena (and to a certain extent their phonetic realization). It means that not all of the systems that have been developed to stylize the F0 curve or to decompose it into smaller components will be reviewed here, even though they can be a very helpful—and sometimes even crucial—tool for developing a linguistic analysis or a symbolic representation of a specific speech sample. Only two of these systems will in fact be included, but only because they are used to compute intermediate steps in an annotation process that yields a symbolic transcription (see e.g. sections 4.2.2 and 4.3.3 on Mertens’ transcription system and INTSINT, respectively).13 We will mostly concentrate on the transcription of intonational events, since a number of systems have been proposed in this domain that are used more widely, and

13  Examples of systems that provide a stylization of the F0 curve are TILT (Taylor 2000), Momel (Hirst and Espesser 1993), Prosogramme (Mertens 2004a; 2004b), and the stylization system based on the IPO framework (’t Hart et al. 1990). The system developed by Fujisaki provides an analysis of the F0 curve that allows it to be decomposed in two distinct components (e.g. Fujisaki 2006).

68   Elisabeth Delais-Roussarie and Brechtje Post because intonation transcription systems tend to include devices for encoding the phenomena that are assumed to be closely related to intonation—phrasing and accentuation. Thus, the representation of suprasegmental phenomena like the evolution in time of voice quality and rhythm will not be dealt with in this chapter. We will distinguish two types of system, based on the way in which they segment the speech signal for encoding the suprasegmental information (see section 4.1.2): contour-based systems (IPA and Mertens’ system: section 4.2) and target-based systems (Momel-INTSINT, ToBI, and IViE/IV: section 4.3). A second difference between the systems, which cuts across this distinction, is between systems that allow for (semi-)automatic transcription (Mertens system and Momel-INTSINT) and those that are manual (IPA, ToBI, and IViE/IV). A third distinction, which also cuts across the contours/targets distinction, is between systems that are designed to be language-specific (ToBI, and to a certain extent IPA) and those that are applicable to any language, even if the language’s suprasegmental system is not known (Momel-INTSINT, and arguably IViE/IV). This distinction is related to the level of linguistic abstraction that the systems are designed to represent in the transcription, since an abstract phonological representation of the suprasegmental characteristics of the speech signal depends on a linguistic interpretation of the signal, which is of necessity language-specific, while a surface phonetic representation does not necessarily do so.14 Prosogram targets a perceptual phonetic level, ToBI an abstract phonological level, and IViE/IV was developed to do both; Momel-INTSINT and IPA arguably also fall into the latter category, but as we will see below, this is problematic. In section 4.1, the different suprasegmental phenomena that need to be transcribed will be introduced briefly, and the key theoretical issues that have given rise to different approaches in transcription will be summarized. In sections 4.2 and 4.3, contour-based and target-based systems will be reviewed, respectively, including a summary discussion of the main strengths and weaknesses of each system. These discussions will clarify why developing a single transcription system that would be accepted and used by the entire research community has proved difficult. For each type, automatic or semi-automatic systems will be presented (Mertens system and INTSINT) as well as manual systems (IPA and ToBi).

4.4.1  Suprasegmental Phenomena and Segmentation 4.4.1.1  Intertwined Phenomena Suprasegmental information in the signal encompasses an array of different phenomena that are closely intertwined in the signal. Intonation, or the melody of speech, can be used to convey linguistic as well as nonlinguistic information (see e.g.Ladd 1996; 14 

Leaving aside the perceptual effects of prior experience (e.g. Cumming 2011 for native language effects in the perception of speech rhythm).

Corpus Annotation  

69

Gussenhoven 2004). That is, an utterance’s intonation contour can signal its pragmatic function, for instance when a speaker produces a falling pitch movement for a declarative utterance, or rising pitch for an interrogative. Intonation can also cue when a new topic is broached by the speaker, it can be used to highlight information, and it conveys functions in conversational interactions like turn-taking and floor-holding. At the same time, intonation can convey nonlinguistic information like the speaker’s attitude or emotion (often referred to as paralinguistic information), or extralinguistic information about the speaker, like gender. Intonation can also be used to mark prosodic phrasing, or the chunking of speech into linguistically relevant units like words, phrases, and utterances (e.g. Truckenbrodt 2007a; 2007b). The edges of the units are typically cued by changes in pitch, loudness, and duration (e.g. Wightman et al. 1992; Streeter 1978). Rhythm and accentuation are closely intertwined with phrasing and intonation (see e.g. Beckman 1986; Liberman 1975). Intonation contours are usually analysed in terms of accentual and phrase-marking pitch movements, and phrasing and accentuation are important contributors to the perception of rhythm in a specific language (Prieto, Vanrell et al. 2012). In many languages, accent placement can signal aspects of information structure like focus distribution. The potential locations of accents in an utterance are usually determined by the metrical structure, which indicates the elements in words that can be stressed (syllables, morae or feet). In tone languages, words can also be marked by lexical tone, which is used to distinguish word meanings, and is part of the lexical representation of the word together with the vowels and consonants that define the word segmentally. A transcriber’s first task is to decide which phenomena need to be represented for the transcription to meet its objectives. For instance, if the transcription is carried out for a study of the phonetic correlates of turn-taking, the locations of accents and phrasal edges will be relevant, as well as the type of intonation contour, in addition to various other factors. The transcriber may also need to decide how the relationships between the phenomena at issue are to be represented (e.g. how intonation contours are associated with phrases). As the review below illustrates, the existing transcription systems for suprasegmental information differ in this respect.

4.4.1.2  Suprasegmental Segmentation Units and the Discretization of the Speech Flow The segmentation of the speech stream into transcription units for suprasegmental labelling depends on the choice of phenomena that are being described (e.g. syllables for the transcription of individual accents or Intonation Phrases for the transcription of intonation contours; see Silverman et al. 1992), as well as the theoretical perspective that underlies the transcription system. Two approaches to the decomposition of intonation contours into discrete elements have been proposed (global vs atomistic: Bolinger 1951): • contour-based analyses which decompose intonation in terms of configurations of pitch movements;

70   Elisabeth Delais-Roussarie and Brechtje Post • target-based analyses which decompose intonation in terms of configurations of turning points with distinct pitch levels. Examples of the former are the Dutch model developed at the IPO (Collier and ’t Hart 1981; ’t Hart et al. 1990), and what is generally referred to as the British tradition, e.g. Halliday (1967), O’Connor and Arnold (1961/1973), and Crystal (1969). Here, the intonation contour of an utterance is modelled to consist of a sequence of discrete pitch movements (e.g. rises and falls). The movements are the primitives of the intonation structure; at the phonological level, the inventory of the movements and the way in which they can be combined is given. The IPA has its roots in the British tradition of intonation analysis, and is therefore a clear example of a contour-based approach. Examples of the target-based approach of intonation analysis are INTSINT (Hirst and Di Cristo 1998) and the Autosegmental-Metrical framework (Bruce 1977; Pierrehumbert 1980), in which intonation contours are analysed as linear sequences of locally specified targets, which are linked up by means of phonetic transitions. The phonetic targets are fundamental frequency maxima and minima that correspond to tones (High or Low) in the phonological representation, and these tones associate with specific locations in the utterance (metrically strong syllables (T*) or the edges of prosodic phrases (T-, T%) in the Autosegmental-Metrical framework). The inventory of pitch accents and boundary tones may vary from one language to another, and the phonetic realization of their targets is defined by a set of language-specific implementation rules. These approaches originate in American structuralist descriptions that decompose the intonation contour into its component levels (e.g. Low, Mid, High, and Extra High: Pike 1945; Trager and Smith 1951). Momel-INTSINT, ToBI and IViE/IV are all target-based transcription systems (see Hirst et al. 2000 for Momel-INTSINT; Beckman et al. 2005 for ToBi; and Grabe et al. 2001 for IViE). Although Prosogram (Mertens 2004a; 2004b) uses symbols for pitch levels in its symbolic representation of intonation, it is classified as a contours-based approach here, because it does not decompose pitch movements into turning points, which are associated with certain locations in the utterance between which pitch is interpolated. Instead, it specifies pitch on a syllable-by-syllable basis while taking changes in pitch into account. The transcription of suprasegmental phenomena other than intonation (e.g. metrical structure and phrasing) is treated indirectly in the suprasegmental transcription systems presented here, as will become clear in sections 4.2 and 4.3.

4.4.2  Contour-based Transcription Systems and the Encoding of Intonational Phenomena 4.4.2.1 The IPA Although the IPA was originally developed for the transcription of segmental information, it was further extended to include suprasegmental transcription symbols. The relevant symbols are given in the tables in (13) under the headings Suprasegmentals and Tones and words accents (in the computer-compatible version of the IPA, the

Corpus Annotation  

71

SAMPROSA subset of symbols for suprasegmental information is used on a separate transcription tier; see http://www.phon.ucl.ac.uk/home/sampa/samprosa.htm; last accessed 16/09/2013). (13)  IPA symbols representing suprasegmental events SUPRASEGMENTALS Primary stress Secondary stress

Long Half-long Extra-short Minor (foot) group Major (intonation) group Syllable break Linking (absence of a break)

TONES AND WORD ACCENTS LEVEL CONTOUR or

Extra high

.or

Rising

High

Falling

Mid

High rising

Low

Low rising

Extra low

Risingfalling

Downstep

Global rise

Upstep

Global fall

72   Elisabeth Delais-Roussarie and Brechtje Post The symbols ‘|’ and ‘||’ represent minor and major intonation group boundaries, respectively. Their definition depends on the language and on the aims of the transcriber. Two levels of stress can be indicated by placing the symbols ‘ˈ’ for primary stress, and ‘ˌ’ for secondary stress immediately before the stressed (or accented) syllable in the word. Lexical tone can be transcribed with a set of symbols that covers all of the lexically distinctive movements that are associated with words in tone languages. A set of four symbols is provided for the transcription of intonational pitch movements: ‘↓’ and ‘↑’ stand for down- and upstep respectively, and ‘↘’ and ‘↗’ mark falling and rising movements.15 They represent intonation over the whole of the minor or major intonation group, and they are placed before the syllable on which the pitch movement or register change takes place. For instance, the transcription of a typical suprasegmental realization of the first sentence of ‘The North Wind and the Sun’ in American English is given in (14) (International Phonetic Association 1999: 44). (14)  The North Wind and the Sun were disputing which was the stronger, when a traveller came along wrapped in a warm cloak. ǁ δə ˈnɔɹθ ˌwInd ən (ð)ə ˈsʌn ǀ wɚ dIsˈpjutɪŋ ǀ ˈwɪtʃ wəz ðə ˈstɹɑŋɡɚ ↘ ǁ wɛn ə ˈtɹævlɚ ǀ ˌkem əˈlɑŋ ǀ ɹæpˈt In əˌ wɔɹm ˈklok ↘ ǁ The IPA system offers a number of advantages. First, it allows the transcriber to encode a wide range of suprasegmental phenomena, including intonational phrasing, stress placement, intonation, and lexical tone. Also, the different phenomena can be encoded independent of one another. Second, transcription can be done at a pre-theoretical level, without necessarily anticipating a possible phonological analysis of the data, possibly with the exception of stress. That is, the symbols can be used to transcribe the observed prosodic realizations without requiring a full understanding of the phonological and prosodic system of the language. Nevertheless, since the transcriber is forced to place a symbol in a specific location, and since the symbol necessarily refers to a specific stretch of speech, the transcriber will have to decide at which locations changes in pitch that occur are relevant for the description of intonation contours in the language at issue, and over which domains the movements transcribed extend (i.e. does the movement which the symbol refers to extend over a single syllable, a group of syllables, or a word group, and what are the defining features of a minor and major intonation group in the language?). Compare, for instance, the overall rising–falling pitch movements marked in the solid boxes in Figures 4.2 and 4.3, which would be an acceptable production of the first accent in a

15  The symbols  and  may also be used to represent respectively a falling or a rising contour. In contradistinction to ↘ and ↗, which indicate a fall or a rise that spans over a whole prosodic unit,  and  represent a movement that occurs on a syllable.

Corpus Annotation  

73

0.1433 0 –0.1144 0 400

Time (s)

3.232

Pitch (Hz)

300 200

70 The banana

from Guatemala

0

has an extra quality. 3.232

Time (s)

FIGURE 4.2  Time-aligned orthographic transcription of The banana from Guatemala has an extra quality with an utterance-initial globally rising-falling movement (solid box), the segmentation unit over which the relevant accent extends (dotted box), with the middle of the accented vowel marked (double  arrow).

0.348 0

Pitch (Hz)

–0.3789 0 400

2.792

Time (s)

300 200

70 La banana 0

de Guatemala

és de bona qualitat Time (s)

2.792

FIGURE 4.3  Time-aligned orthographic transcription of La banana de Guatemala és de bona qualitat with an utterance-initial globally rising-falling movement (solid box), the segmentation unit over which the relevant accent extends (dotted box), with the middle of the accented vowel marked (double  arrow).

neutral declarative in English and Catalan, and many other languages (data from the April Project: Prieto, Vanrell et al. 2012); the IPA transcriptions are given in (15a) and (15b). (15) a. The banana from Guatemala has an extra quality. /baˈnana/ b. La banana de Guatemala és de bona qualitat. /baˈnana/

74   Elisabeth Delais-Roussarie and Brechtje Post As the example shows, although the same IPA symbols could in principle be used to transcribe the pitch movements in both cases (‘’ and ‘’), different linguistic categories are represented in the two languages, which is reflected in the alignment of the peak relative to the accented syllable (marked by the double arrow in the figures). In Catalan (15b), the rise continues through the accented syllable in banana to a peak in the final syllable of the word, followed by a falling movement that is the transition between the high point and the start of the following intonational event. The movement is normally analysed as a rising accent (L+H* in Cat_ToBI: Prieto et al. 2009). In British English (15a), by contrast, the rising part of the movement ends in a peak in the accented syllable, and is followed by a fall to the following accented syllable (ma in Guatemala, here). The pitch movement is analysed as a fall (e.g. Crystal 1969; H*L in IViE: Grabe et al. 2001) which is preceded by a rising movement from the beginning of the utterance (or phrase). The difference between these analyses arises primarily from a difference in segmentation. The segmentation unit that is considered central in the analysis of the example in Catalan is the accented syllable plus any pre-accentual syllables, while in British English it is the accentual syllable plus any post-accentual syllables (also referred to as the nuclear tone: Crystal 1969). In fact, the prosodic extension of the IPA has been criticized for being too heavily informed by a tonetic theory of stress-marking, which makes it relatively inflexible (see Gibbon 1990). Although its theoretical assumptions are rooted in the British tradition, the IPA is nevertheless nonprescriptive on these points. In fact, as the transcriptions of the text ‘The North Wind and the Sun’ recorded in different languages in the Handbook of the IPA illustrate, the conventions that are adopted for segmentation and symbol assignment diverge wildly. For instance, the major intonation group corresponds to the written sentence (usually marked by a full stop in the text) for some, but for others, it corresponds to the clause or Intonational Phrase (usually marked by a comma). This inconsistency makes it difficult to directly compare transcriptions produced by different transcribers in this system. A transcriber will need to refer to his/her knowledge of the language in order to reliably identify the accented syllables in the speech stream, since what is perceived as prominent is language-specific, depending on the way in which various acoustic correlates conspire to signal it. For instance, a bisyllabic word like railing with stress on the first syllable may be pronounced with a relatively prominent final syllable by a speaker of Punjabi English. To the average standard Southern British English listener, this is likely to sound like a stressed syllable. The IPA stress symbols are also problematic because they obscure the relationship between perceived prominence and the linguistic structure that it is associated with, which depends on the linguistic status of prominence (or accent) in the language. For languages with fixed stress (e.g. Finnish, in which stress always falls on the first syllable of the word; see Ladd 1986 for a discussion), or languages which can be argued not to have lexical stress at all (e.g. French, in which prominences tend to mark the right or left edge of a word group: see among others Di Cristo 2000b), marking primary and secondary stress can only be meaningful if stress is not entirely predictable, and if the two can be distinguished on principled grounds. In French, for

Corpus Annotation  

75

instance, the opposition is locational (i.e. initial vs. final) rather than one of level, as the IPA labels primary and secondary suggest (Astesano 2001; Di Cristo 2011). We can conclude that, as is the case for segmental transcription (sections 2 and 3.2), neither the segmentation of the speech stream into transcription units nor the way in which symbols are assumed to associate with the units can be truly language-independent. As a consequence, only transcribers who are familiar with the linguistic system of the language will be able to provide valid and consistent transcriptions using the IPA. Also, the prosodic extension of the IPA does not offer the same flexibility as the segmental transcription system. Prosodic systems that are not yet known will be difficult to transcribe, because the systems require the transcriber to make linguistically motivated choices in segmentation and symbol assignment. Moreover, unlike the segmental IPA symbols, the suprasegmental system lacks transparency, which can lead to inconsistency in transcriptions, while its theoretical assumptions leave it relatively inflexible, and necessitate a linguistic interpretation of the speech stream during transcription.

4.4.2.2  Mertens’ Transcription System and Prosogram The system proposed by Mertens can be seen as a contour-based transcription system that differs from IPA for two reasons: first, it is language-independent, and secondly, it is a semi-automatic system, and as such likely to be more robust than the IPA. The symbolic transcriptions proposed by the system are automatically generated from Prosograms. Prosograms are semi-automatic and language-independent stylized representations of intonation contours (Mertens 2004a; 2004b; 2011; http://bach.arts.kuleuven. be/pmertens/prosogram/; last accessed 16/09/2013). Transcription takes place in three stages. First, the speech signal is segmented into syllable-sized units, which are motivated by phonetic, acoustic, or perceptual properties. The automatic segmentation tool uses an estimation of variation in intensity to identify the boundaries of the segmentation units.16 The segmentation is indicated by vertical dotted lines in Figure 4.4. At the second stage, the F0 curves associated to the segmentation units serve as input to an F0 stylization procedure. The stylization is based on a model of human pitch perception, which takes into account a number of factors that have been shown to affect the perception of pitch for intonation. For instance, loudness and duration have been found to play a role in the perception of a change in pitch as a glissando movement as opposed to a sudden jump. If F0 changes more rapidly than a certain threshold in loudness and duration indicates, the F0 change will be perceived as a sudden jump rather than a glissando movement. Apart from the glissando threshold, the system also takes into account the differential glissando threshold for perceiving a change in slope when it is sufficiently large, as well as changes in spectral properties and signal amplitude as cues to boundaries (e.g. demarcating syllable nuclei).

16 

Manually or semi-automatically generated segmentations into syllables or syllabic nuclei can also serve as input to Stage 2.

76   Elisabeth Delais-Roussarie and Brechtje Post 0

1

2

3

4

5

6

100 90

150 Hz

80 70

narrative1

3

100 90

150 Hz

80 70

narrative1

FIGURE 4.4 Prosograms of perceived pitch (in semitones, ST) for The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm  cloak.

Stylized F0 markers are level or show an inclination or curve expressed on a musical scale in semitones, indicated by the bold lines in Figure 4.4. Their relative height is calibrated globally with reference to the speaker’s pitch range, and locally to a three-syllable window to the left (capped at 450 ms). In Mertens (2011), the markers are translated into tonal symbols at a third stage, illustrated in Figures 4.5 and 4.6. Each syllable is assigned a single symbol for its pitch height (B = Bottom of the speaker’s range, L = Low, M = Mid, H = High, and T = Top of the speaker’s range) or glissandos within syllables that exceed the perception thresholds discussed above (R = Rise, F = Fall). Combinations of glissando symbols are used to transcribe complex pitch movements (RF or FR). Glissando symbols can also be combined with a symbol indicating level pitch when the slope of the change in F0 changes significantly (HF, MF, LR, HR, etc). The result is a symbolic transcription which provides more detail about perceptually relevant pitch in the signal than other symbolic transcription systems that make reference to pitch levels (e.g. INTSINT and ToBI) without requiring that the transcriber takes a position on the theoretical nature and the inventory of intonation units that are deemed to be relevant at the abstract linguistic level for the language that is being transcribed (i.e. which contours, are they analysable as tones, what phrasal units are involved, etc.). The advantage is that intonational events can be transcribed even if the intonation system of the language in question is not known, but the disadvantage is that, since the transcription is not categorical or discrete, it does not allow the transcriber to distinguish between perceptually relevant F0 information and information that is phonologically relevant in the language.

Corpus Annotation  

77

FIGURE 4.5  Prosograms of perceived pitch (in semitones, ST, top) with symbolic representations of pitch height added (bottom) for Et j’ai donc été à l’école Remington apprendre la sténodactylo (extract from Mertens  2011).

0

1

2

3

4

5

6

100 ST

90

150 Hz

80 70 narrative1

3 100 ST

90

150 Hz

80 70 narrative1

FIGURE 4.6  Prosograms of perceived pitch (in semitones, ST, top) with symbolic representations of pitch height added (bottom) for The North Wind and the Sun were disputing which was the stronger when a traveler came along wrapped in a warm  cloak.

Clear strengths of the system are that it takes into account acoustic parameters other than F0, while ensuring robustness by taking the syllable (or syllable nucleus) as the minimal unit as input for a (semi-)automatically generated symbolic transcription. Segmentation is error-prone when intensity at phrasal edges is low, while manual segmentation and segmentation corrections can be quite costly. Another potential weakness is that the system is not geared towards the analysis of prosodic boundaries and prominence relations. It remains to be seen whether the system is truly language-independent. In principle, if all but only auditory perceptual information is represented, language-specific effects on the interpretation of acoustic cues to suprasegmental features should be taken into account as well as effects that are attributable to the human auditory system.

78   Elisabeth Delais-Roussarie and Brechtje Post

4.4.3  Target-Based Transcription Systems and Intonational Phenomena 4.4.3.1  ToBi The ToBI (Tones and Break Indices) transcription system was originally developed to transcribe intonation and prosodic phrasing in American English (Silverman et al. 1992; Beckman et  al. 2005). The system is couched in the Autosegmental-Metrical framework (e.g. Bruce 1977; Pierrehumbert 1980) and was based on the analysis of American English proposed by Pierrehumbert (1980) and Pierrehumbert and Beckman (1988). It has since been adapted to many other languages and language varieties (Jun 2005a; 2014; Prieto and Roseano 2010), which has led to the introduction of some additional features to account for the diversity of the prosodic systems under study (e.g. separation of transcription tiers for underlying and surface phonology in Korean: Jun 2005b). In the original ToBI system, transcription takes place on four tiers (see Beckman et al. 2005; and also http://www.ling.ohio-state.edu/~tobi/ame_tobi/annotation_conventions.html; last accessed 16/09/2013): • a tonal tier, which gives the pitch accents, phrase accents, and boundary tones that are realized in the speech stream (their number and nature depends on the intonational system of the language); • a break index tier, which gives a rating of the degree of juncture between words and phrases (5 levels); • an orthographic tier, which gives a transcription of orthographic words and phenomena such as filled pauses (the system does not impose any particular conventions); • a miscellaneous tier for transcriber comments. As the example in Figure 4.7 shows, the transcription tiers are time-aligned with the sound wave and the fundamental frequency trace of the speech sample, where each element on each tier receives its own time stamp(s). Pitch accents consist of one- or two-term labels (H*, L*, L*+H, L+H* and H+!H* for American English) in which the tone that is associated with the accented syllable is marked by a star (but see Arvaniti et al. 2000 for a discussion). Other pitch accent types, including three-tonal labels, have been introduced to transcribe pitch accents in other languages (see Jun 2005a). Downstep, the lowering of a high tone relative to a preceding high tone, is marked by placing the diacritic ! immediately before the high tone that is lowered according to the analysis. Phrase accents mark intermediate phrase boundaries (labelled 3 or higher on the break index tier) and can be High (H-) or Low (L-), and Htones can also be downstepped. Boundary tones mark the edges of Intonation Phrases (Break Index level 4), and they can also be High (H%) or Low (L%). The Break Index values described in the original ToBI guidelines (http://www. ling.ohio-state.edu/~tobi/ame_tobi/annotation_conventions.html) are intended to

Corpus Annotation  

79

FIGURE 4.7  Time-aligned ToBI transcription for the utterance The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm  cloak.

represent the degree of juncture realized, but they are partly determined with reference to the morphosyntactic structure. For instance, a value of 0 is used for cases of clear phonetic marks of clitic groups, e.g. the medial affricate in contractions of did you or a flap as in got it, while level 1 is defined with reference to the word alone (most phrase-medial word boundaries), and levels 3 and 4 are defined with reference to intonational realization alone (e.g. 3 is marked by a single phrase tone affecting the region from the last pitch accent to the boundary). Disfluencies are also marked at this level. Various diacritics can be used to indicate uncertainty about the strength of a juncture or a tone in the relevant tiers. Wherever these conventions are followed, there is considerable redundancy among tiers (e.g. between boundary tone labels and break index label 4, and phrase tone labels, and break index label 3, as well as break index locations and the orthographic tier). As the authors of the guidelines point out, routines can be used for the automatic generation of redundant labels, which will improve consistency and save transcriber time (http:// www.ling.ohio-state.edu/~tobi/ame_tobi/annotation_conventions.html). A more general issue with Break Indices is that they have proved quite elusive when determined on the basis of prosodic realization alone, resulting in low transcriber agreement. External criteria can be invoked to improve consistency, but in that case the question arises whether that makes transcribing Break Indices altogether superfluous. The example in Figure 4.7 also shows that transcription takes place at a phonemic level of representation, i.e. the symbols and segmentation units represent phrasal and intonational elements that function contrastively in the language that is being transcribed. This implies that the symbols and units used to transcribe a specific data set are

80   Elisabeth Delais-Roussarie and Brechtje Post language-specific and, conversely, that the transcriber will need to work with a known set of symbols which represent the primitives of the prosodic system of that language. The second implication of transcription at a phonemic level is that the transcription symbols and segmentation units that are used cannot be theory-neutral, since they are in effect used to provide a phonological analysis of the data. That is, in using the ToBI transcription system one not only considers that the underlying principles of the Autosegmental-Metrical framework hold for the language that is being transcribed (phrasing and intonation are considered to be closely intertwined, and accented syllables and phrasal edges function as the loci of contrast; turning points rather than holistic contours are the primitives of analysis, etc.), but also accepts the language-specific phonological generalizations made in the definition of the symbols (phonotactic constraints that may apply to tonal configurations, or constraints on the syntax–phonology mapping which determine the permissible prosodic phrasing structures in the language). The language-specific, phonological nature of the transcription system can be considered as a strength when a linguistic analysis of the prosodic phrasing and the intonational events in the speech sample are required by the objectives of the transcription, and if the language at issue has been successfully analysed in an Autosegmental-Metrical framework. Since transcribers are forced to choose between contrasting categories, not only the form but also the linguistic interpretation of the form are simultaneously encoded. For instance, in a language that has a contrast between L+H* and L*+H, a rising pitch movement that is associated with an accented syllable has to be either L+H* or L*+H. By chosing one over the other, the transcriber records not only a possible difference in form (e.g. in the timing of the targets relative to the accented syllable) but also the associated difference in meaning. ToBI is not suitable for the transcription of suprasegmental phenomena other than phrasing and intonation. For instance, metrical or rhythmic structure may be realized in the signal by means of durational cues instead of pitch accents, or suprasegmental cues may occur in the signal for units that are larger than the Intonation Phrase (e.g. conversational turns). These phenomena may be phonological in nature, and they may not be marked by a specific tonal configuration, but in any case ToBI was not designed to deal with such data. The phonological nature of the system can be considered a weakness in the context of data sets in which the phonological system of the speaker is in flux or (partly) unknown, such as first and second language learner data, pathological speech data, and data from languages or language varieties that have not been fully analysed. Such data sets can only be transcribed in systems that allow for transcriptions at a phonetic level (see Gut, this volume), although a ToBI-style system can be used insightfully to draw up hypotheses about a speaker’s suprasegmental phonology.

4.4.3.2  IViE / IV The IViE system was developed for the transcription of intonational variation in English spoken in the British Isles for which the phonological systems are unknown (Grabe

Corpus Annotation  

81

FIGURE 4.8 Time-aligned IV transcription of The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm  cloak.

et al. 2001; see Nolan and Post, this volume). Originally modelled on ToBI, it offers three transcription tiers to record the prosodic information in the speech signal, separating auditory phonetic and abstract phonological levels of representation for the transcription of intonation, in addition to a rhythmic tier on which perceptual prominence and phrasing can be marked. The three tiers are motivated by research on crossvarietal differences in English intonation, which shows that they can differ in (i) the location of rhythmic prominences, (ii) the phonetic realization of particular pitch accents, and (iii) the inventory of contrasting intonation contours (Grabe and Post 2004; Grabe 2004). IV is an adaptation of IViE that can be used for other languages, and which includes a fourth prosodic tier, which allows for the transcription of global intonational events that operate across sequences of phrases or utterances (e.g. register shift at topic boundaries; see Post et al. 2006; Delais-Roussarie et al. 2006). The IV template consists of six tiers, as shown in Figure 4.8. On the highest tier, miscellaneous comments and alternative transcriptions can be recorded. On the lowest, the text of the utterance is transcribed, word by word, as in ToBI. Labellers begin a transcription by filling the orthographic tier. The middle tiers provide information about prosody. One (just above the orthographic tier) is dedicated to encode the location of rhythmically prominent syllables and boundaries. No distinction is made here between lexically stressed and intonationally accented syllables, and unlike in ToBI, no distinction in boundary weight is made. Hesitations and interruptions that affect the rhythmic properties of the utterance are also marked here. In the following tier (third tier from the lowest), the pitch movement surrounding rhythmically prominent syllables is determined, which can capture auditory phonetic differences in implementation between languages or

82   Elisabeth Delais-Roussarie and Brechtje Post F0

F0 Cambridge “Find

H*L

my

hat!”

yelled

H*L %

mH-l

mH-l

P

P

%

[…]

one

Phonological tier

Liverpool “Find my

H*L

hat!”

yelled

H*L %

[…]

[…]

Local phonetic tier mH-l

lHM

[…]

[…]

Prominence tier

P%

[…]

P

one

FIGURE 4.9  Different phonetic implementations of the same phonological category in British English transcribed on the local phonetic tier:  a nuclear fall (H*L %) is compressed in Cambridge and truncated in Liverpool (read passage data from the IViE project).

dialects of languages. For instance, in Cambridge English, pitch movements are compressed when they are realized on IP-final words with little scope for voicing, but they are truncated in Liverpool English in the same context (Grabe et al 2000), as is illustrated in Figure 4.9. Figure 4.9 shows how a difference in phonetic realization is transcribed on the phonetic tier in IViE, while the symbols on the other prosodic tiers are the same for the two dialects in the figure. The realization of pitch accents is transcribed within Implementation Domains or IDs. In English, an ID contains (i)  the preaccentual syllable, (ii) the accented syllable, and (iii) any following unaccented syllables (if any) up to the next accented syllable. Hence, IDs overlap by one syllable. Pitch is transcribed on three levels: h(igh), m(id), and l(ow). The levels are transcribed relative to each other, with capital letters indicating the pitch level of the prominent syllable relative to that of the surrounding syllables. Finally, on the fourth tier, and only after rhythmic structure and pitch movement have been transcribed, phonological generalizations are made and noted on the phonological tier. Labels are based on existing Autosegmental-Metrical models of the language in question, making the tier language-specific as well as theory-specific (e.g. Gussenhoven’s analysis of British English is adopted for the standard variety spoken in southern England: Gussenhoven 2004). Symbolically representing auditory phonetic suprasegmental information has the advantage that phenomena can be captured and quantified even when their linguistic relevance is not yet established. This can be advantageous in the analysis of crosslinguistic or crossdialectal phenomena like truncation and compression in intonational implementation, but also when the transcriber is confronted with learner data and pathological speech, for which surface realizations can be observed, but the underlying phonological system cannot. Since phonological interpretation can follow auditory interpretation, the intonational system does not need to be fully understood for the transcription to be possible. Figure 4.10 illustrates how the non-target-like production of a

Corpus Annotation  

52.3079137 400

83

53.8067966

300 200

Pitch (Hz)

100 50

%

H*L

H*L

%

%

mHl-m

mHL

%

%

P

P

%

You

are

are

52.31

what

you

are 53.81

Time (s)

48.0625054 500

49.5161009

400 300

Pitch (Hz)

200 100 50 H* or Spanish prenuclear rise? %

L+H*

!H*L

%

%

lM-h

hML

%

%

P

P

%

You 48.06

are

what Time (s)

you

are 49.52

FIGURE 4.10 IViE transcriptions of the intonation contours produced in the utterance You are what you are by a native speaker of standard Southern British English (top panel) and a Spanish L2 learner of English (bottom  panel).

84   Elisabeth Delais-Roussarie and Brechtje Post Spanish L2 learner of English can be transcribed in IViE. A comparison of the realization of the first pitch accent produced by the native speaker and the learner shows a clear difference in realization in what globally looks like an overall rising–falling contour, with a peak that is timed much earlier in the accented syllable are for the former than for the latter. Also, the contour continues to rise after the accented syllable in the L2 English example, while it falls in the native speaker’s speech. These differences are transcribed on the local phonetic tier, while the phonological tier shows possible phonological analyses for the contours that are observed. The comments tier is used here to highlight the fact that the label on the phonological tier is tentative at this stage. It indicates that the learner’s intonation contour could either be a straightforward case of transfer of a phonological category from Spanish to English—since L+H* rises are very common in this position in Spanish, but they do not occur in this native variety of English—or that the pitch accent has been phonetically implemented in an un-target-like way, with an unusually late peak. Another strength of the system is that the dissociation of intonational and prominence tiers allows for the transcription of prominent syllables (and prosodic boundaries) which are not marked by a pitch movement. In addition, the multilinear time-aligned transcription of prosodic events at different levels allows for the integration of phonetic, phonological, and rhythmic/prosodic information in the transcription. However, the IV(iE) system also has the limitation that it relies on segmentation into Intonation Phrases and other intonational units, which are language-dependent, theory-internally defined, and not always robust because they are not always easy to identify. Moreover, like ToBI and the IPA, it is a manual system, which makes transcription time-consuming and liable to inconsistencies and errors.

4.4.3.3 Momel-INTSINT Momel-INTSINT is a semi-automatic target-based transcription system. It generates a transcription of the intonation contour based on a series of turning points, which are automatically calculated from the fundamental frequency by the Momel algorithm (Hirst and Espesser 1993; Hirst et al. 2000). By discarding all microprosodic F0 variation in the signal (i.e. all the F0 variations that result from the phonetic characteristics of the various segments and apply under the syllable level), and linking the peaks and valleys that it detects in the derived curve (i.e. the turning points), the algorithm transforms the discontinuous raw F0 curve into a continuous curve, which is intended to function as its auditory equivalent. Subsequently, each turning point is automatically assigned one of the eight INTSINT symbols given in (16) (detection errors have to be corrected manually). (16)  INTSINT symbols symbols representing absolute values defined relative to speaker register: Top (T) Mid (M) Bottom (B); symbols representing relative values defined locally with reference to preceding values: Higher (H), Same (S), or Lower (L);

Corpus Annotation  

85

symbols representing small changes in relative values defined locally with reference to preceding values: Upstepped (U) or Downstepped (D). The INTSINT symbols are designed to represent intonational events at the surface phonological level of description, i.e. they are intended to transcribe all and only linguistically relevant pitch information. Figure 4.11 gives the original F0 trace, while Figure 4.12 gives the Momel-derived curve, the turning points, and the associated INTSINT symbols for the same utterance. The example shows that most of the turning points are located in syllables which function as the locus of an intonational contrast (usually accented syllables or syllables at phrase boundaries, e.g. a rising accent (H) as opposed to a falling one (L) on North in the example), but some turning points occur in positions which do not typically carry meaningful changes in pitch (e.g. L on the). Momel-INTSINT’s main strength is that it allows large quantities of speech data to be annotated semi-automatically, requiring only manually inserted Intonation Units as input. Since equating Intonation Units to interpausal units usually gives good results (where a pause is defined as a silent interval of more than 200 ms), and pauses can be detected automatically, very little manual labelling is required with this system in practice. This makes the system very robust, since automatically generated transcriptions leave no room for human error. The second advantage is that both the naturalness and appropriateness of the smoothed Momel curves and the INTSINT encoding can be empirically tested, because Momel turning points can be translated into INTSINT symbols and vice versa. This provides an easily accessible and powerful tool to test, for instance, the perceptual effects of

300 250

Pitch (Hz)

200 150 100 50

The North windandthe Sun

0

were disputing whichwasthe

stronger

Time (s)

whena traveler came along wrapped ina warm

cloak

5.37

FIGURE 4.11  F0 curve associated with the utterance The North Wind and the Sun were disputing which was the stronger, when a traveler came along wrapped in a warm  cloak.

86   Elisabeth Delais-Roussarie and Brechtje Post

Pitch (Hz)

350

L H

50

L

D

S

S

D U LS

L

M

D

L

B

U 5.37

The North windandthe Sun were disputing whichwasthe stronger

0

Time (s)

whena traveler came along wrappedina warm

cloak

5.37

FIGURE 4.12  F0 trace, Momel pitch curve, turning points, and INTSINT labels for the same utterance

the tonal sequences that are theoretically allowed in INTSINT—and hence their phonotactic validity. The third and probably most interesting advantage from the point of view of the analysis of phonological corpora is that the transcription of the suprasegmental information can take place without recourse to the phonological system of the language in question. This characteristic of Momel-INTSINT facilitates crosslinguistic typological comparisons (see Hirst and Di Cristo 1998). However, since accented syllables and phrasal boundaries other than those of the Intonation Unit are not always detected and identified, the system may not be suitable for research that involves prosodic phrasing and prominence. Perhaps more importantly, as with Mertens’ system, the question arises to what extent the system is a sophisticated F0 stylization tool rather than a properly discrete symbolic annotation system that reflects all and only the linguistically relevant information in the signal.17 The linguistic status of the INTSINT symbols as surface phonological tones raises a number of theoretical issues. If the symbols represent phonological tones, they should be discrete, and they should combine into configurations which carry contrastive linguistic meaning (e.g. Gussenhoven 2004). Although they clearly meet the former requirement, it is not the case that an automatically derived tonal sequence, which represents perceptually relevant changes in pitch rather than contrastively meaningful changes in pitch, necessarily meets the second requirement. For instance, in some cases in which an M is generated instead of a B it is not clear that substitution of the former

17  Unlike Mertens’ system, Momel-INTSINT only takes macroprosodic effects into account and not more general perceptual effects.

Corpus Annotation  

87

with the latter would lead to a different interpretation of the signal.18 In fact, one could argue that the transcriber’s linguistic brain is required to determine whether a variant is phonologically contrastive or not, especially when the language has not been prosodically analysed before.

4.5 Conclusion We have presented various systems that are commonly used to provide a discrete symbolic representation of the phonetic and phonological content of a speech sample. Such a representation abstracts away from the signal in all its complexity for several reasons. First, it usually provides information on the linguistically relevant events that occur in the speech signal, leaving aside any other elements. Secondly, it relies on a segmentation of the speech continuum into intervals to which labels are assigned, segments and labels being generally defined on a language-specific basis, and theory-internally. The systems most commonly used to provide information about the segmental content of the signal presented here are the orthographic system and the IPA. Both are clearly language-dependent and phonological by nature, in the sense that any segmental transcription relies on an interpretation of the linguistic content of the signal. Any orthographic transcription has several advantages, among which we may mention its readability, which facilitates data exchange. It has also some limitations: it cannot account for how the words were exactly pronounced by the speakers. By contrast, the IPA (and SAMPA) provides representations that can be more detailed depending on the level at which data have been transcribed (phonemic vs narrow phonetic). This is clearly one of its main advantages. Note, however, that broad phonemic and narrow phonetic transcriptions are difficult to achieve in terms of accuracy, and are very time-consuming. To our knowledge, no systematic empirical studies have been carried out to evaluate intertranscriber agreement in the context of atypical data (regional varieties, L1 acquisition data, L2 learner data, pathological speech, etc.). Several systems have been proposed to encode suprasegmental features. In contradistinction to systems used to encode segmental information, prosodic transcription systems are not necessarily language-dependent, in particular when providing an abstract representation of the tonal features as in INTSINT and in the symbolic extension of Prosogram (Mertens 2011). This results from the fact that the representations provided by these systems consist of discrete and stylized versions of the melodic curve. As soon as prosodic transcriptions represent phrasing and prominence structures, in addition to tonal events, they are language-dependent and, to a certain extent, theoretically driven, as they rely on analyses of the relation between phrasing, accentuation, and tonal events. 18  The same objection can be levelled at ToBI and IViE/IV to the extent that any tonal configuration that is allowed by the transcription system will need to be shown to be legal and contrastive according to the intonational grammar of the language that is being transcribed.

88   Elisabeth Delais-Roussarie and Brechtje Post In any case, when one wants to annotate phonological or phonetic features in a corpus, one has to keep in mind that an ideal system does not exist either at the segmental or at the suprasegmental level. They all have their strengths and weaknesses, or advantages and disadvantages, depending on research contexts and objectives. To choose one system over the other, it is crucial to evaluate which level of analysis is required (phonetic vs phonological), which units are relevant for the research question at hand, and which types of label and representation are the most adequate in the context of the research.

Acknowledgements We are very grateful to Ulrike Gut and an anonymous reviewer for their helpful comments on an earlier version of this chapter. This work was supported by a joint project grant from the British Academy and the CNRS (‘A Transcription System for Prosody and Phonological Modelling’). The first author would also like to acknowledge support from the Labex EFL, “Empirical Foundations in Linguistics” (ANR/CGI), and the second author support from the ESRC (‘Categories and Gradience in Intonation’, RES-061-25-0347).

C HA P T E R  5

O N AU T O M AT I C PHONOLO GICAL TRANSCRIPTION OF SPEECH CORPORA H E L M E R ST R I K A N D C AT IA C U C C H IA R I N I

5.1 Introduction Within the framework of Corpus Phonology, spoken language corpora are used for conducting research on speakers’ and listeners’ knowledge of the sound system of their native languages, and on the laws underlying such sound systems as well as their role in first and second language acquisition. Many of these studies require a phonological annotation of the speech data contained in the corpus. The present chapter discusses how such phonological annotations can be obtained (semi-)automatically. An important distinction that should be drawn is whether or not the audio signals come with a verbatim transcription (orthographic transcription) of the spoken utterances. If an orthographic annotation is available, the (semi-)automatic phonological annotation could in theory be derived directly from the verbatim transcription through a simple conversion procedure without an automatic analysis of the original speech signals. In this case, the strings of symbols representing the graphemes in words are replaced by corresponding strings of symbols representing phonemes. This can be achieved by resorting to a grapheme–phoneme conversion algorithm, through a lexicon look-up procedure in which each individual word is replaced by its phonological representation as found in a pronunciation dictionary or by a combination of the two (Binnenpoorte 2006). The term ‘phonological representation’ is often used for this kind of annotation, as suggested by Oostdijk and Boves (2008: 650). It is important to realize that such a phonological representation does not provide information on the speech sounds that were actually realized, but relies on what is already known about the possible

90   Helmer Strik and Catia Cucchiarini ways in which words can be pronounced (pronunciation variants). Such pronunciation variants may also refer to sandhi phenomena and can be used to model processes such as cross-word assimilation or phoneme intrusions, but the choice of the variants will be based on phonological knowledge and not on an analysis of the speech signals. Alternatively, a phonological annotation of the words in a spoken language corpus can be obtained automatically through an analysis of the original speech signals by means of automatic speech recognition algorithms. In this case, previously trained acoustic models of the phonemes to be identified are used together with the speech signal and the corresponding orthographic transcription, if available, to provide the most likely string of phonemes that reflects the speech sounds that were actually realized. In this chapter we will reserve the term ‘automatic phonological transcription’ for this latter type of analysis, as suggested by Oostdijk and Boves (2008: 650). It is also this form of phonological annotation, and the various (semi-)automatic procedures to obtain, evaluate, and optimize them, that will be the focus of the present chapter. Phonological transcriptions have long been used in linguistic research, for both explorative and hypothesis testing purposes. More recently, phonological transcriptions have proven to be very useful for speech technology too, for example, for automatic speech recognition and for speech synthesis. In addition, the development of multi-purpose speech corpora that we have witnessed in the last decades—e.g. TIMIT (Zue et al. 1990), Switchboard (Godfrey et al. 1992), Verbmobil (Hess et al. 1995), the Spoken Dutch Corpus (Oostdijk 2002), the Corpus of Spontaneous Japanese (Maekawa 2003), Buckeye (Pitt et al. 2005), and ‘German Today’ (Brinckmann et al. 2008)—has underlined the importance of phonological transcriptions of speech data, because these considerably increase the value of such corpora for scientific research and application development. Both orthographic transcriptions and phonological transcriptions are known to be time-consuming and costly. In general, the more detailed the transcription, the higher the cost. Orthographic transcriptions appear to be produced at speeds varying from three to five times real-time, depending on the quality of the recording, the speech style, the quality of the transcription desired, and the skill of the transcriber (Hazen 2006). Highly accurate transcriptions that account for all speech events (filled pauses, partial words, etc.) as well as other meta-data (speaker identities and changes, non-speech artefacts and noises, etc.) can take up to 50 times real-time depending on the nature of the data and the level of detail of the meta-data (Barras et  al. 2001; Strassel and Glennky 2004). Making phonological transcriptions from scratch can take up to 50–60 times real-time, because in this case transcribers have to compose the transcription and choose a symbol for every single speech sound. An alternative, less time-consuming procedure consists in having transcribers correct an example transcription, i.e. a transcription of the word in question taken from a lexicon or a dictionary, which transcribers can edit and improve after having listened to the corresponding utterance. Both for Switchboard and the Spoken Dutch Corpus, transcription costs were restricted by presenting trained students with an example transcription. The students were asked to

On Automatic Phonological Transcription of Speech Corpora  

91

verify this transcription rather than transcribing from scratch (Greenberg et al. 1996; Goddijn and Binnenpoorte 2003). Although such a check-and-correct procedure is very attractive in terms of cost reduction, it has been suggested that it may bias the resulting transcriptions towards the example transcription (Binnenpoorte 2006). In addition, the costs involved in such a procedure are still quite substantial. Demuynck et al. (2002) reported that the manual verification process took 15 minutes for one minute of speech recorded in formal lectures and 40 minutes for one minute of spontaneous speech. Because of the problems involved in obtaining phonological transcriptions—the time required, the high costs incurred, the often limited accuracy obtained, and the need to transcribe large amounts of data—researchers have been looking for ways of automating this process, for example by employing speech recognition algorithms. The advantages of automatic phonological transcriptions become really evident when it comes to exploring large speech databases. First, because automatic phonological transcriptions make it possible to achieve uniformity in transcription. With a manual transcription this aim would be utopian: large amounts of speech data cannot possibly be transcribed by one person, and the more transcribers are involved, the less uniform the transcriptions are going to be. Eliminating part of this subjectivity in transcriptions can be very advantageous, especially when analysing large amounts of data. Second, because with automatic methods it is possible to generate phonological transcriptions of huge amounts of data that would otherwise remain unexplored. The fact that large amounts of material can be analysed in a relatively short time, and with relatively low costs, makes automatic phonological transcription even more interesting. The importance of this aspect for the generalizability of the results cannot be overestimated. And although the automatic procedures used to generate automatic phonological transcriptions are not infallible, the advantages of a very large dataset might very well outweigh the errors introduced by the mistakes the automatic procedures make. In this chapter we provide an overview of the state of the art in automatic phonological transcription, paying special attention to the most relevant methodological issues and the ways they have been approached.

5.1.1  Types of Phonological Transcription Before we discuss the various ways in which phonological transcriptions can be obtained (semi-)automatically, it is worth clarifying a number of terms that will be used in the remainder of this chapter. First of all, a distinction should be drawn between segmental and suprasegmental phonological transcriptions. In this chapter, focus will be on the former, and in particular on (semi-)automatic ways of obtaining segmental annotations of speech; but this is not to say that there has been no research on (semi-)automatic transcription of suprasegmental processes. For instance, important work in this direction was carried out in the 1990’s within the framework of the Multext project (Gibbon and Llisterri 1994; Llisterri 1996), and more recently by Mertens (2004b), Tamburini and

92   Helmer Strik and Catia Cucchiarini Caini (2005), Obin et al. (2009), Avanzi, Lacheret et al. (2010), and Lacheret et al. (2010). A discussion of these approaches is, however, beyond the scope of this chapter. Even within the category of segmental phonological transcriptions, different types can be distinguished depending on the symbols used and the degree of detail recorded in the transcriptions. In the literature, various classifications have been provided on the basis of these parameters (for a brief overview, see Cucchiarini 1993). With respect to the notation symbols, in this chapter we will be concerned only with alphabetic notations, in particular with computer-readable notations, as this is a prerequisite for automatic transcription. Over the years different computer-readable notation systems have been developed. A widely used one in Europe is *SAMPA[http://www.phon.ucl.ac.uk/home/sampa/]* (Wells 1997; http://www.phon.ucl.ac.uk/home/sampa/), a mapping between symbols of the International Phonetic Alphabet and ASCII codes, which was established through a consultation process among international speech researchers. X-SAMPA is an extended version of SAMPA intended to cover every symbol of the IPA Chart so as to make it possible to provide a machine-readable phonetic transcription for every known human language. However, many other systems have been introduced. Arpabet was developed by the Advanced Research Projects Agency (ARPA) and consists of a mapping between the phonemes of General American English and ASCII characters. Worldbet (Hieronymus 1994) is a more extended mapping between ASCII codes and IPA symbols intended to cover a wider set of the world languages. Besides the computer phonetic alphabets mentioned here, many others do exist (see e.g. Hieronymus 1994; Wells 1997; EAGLES 1996; Draxler 2005). With regard to the degree of detail to be recorded in transcription, it appears that in general two types of transcription are distinguished: broad phonetic, or phonemic, and narrow phonetic, or allophonic. A broad phonetic transcription indicates only the distinctive units of an utterance, thus presupposing knowledge of the precise realization of the sounds transcribed. A narrow phonetic transcription attempts to provide such details. For transcriptions made by human transcribers, it holds that the more detailed the transcription, the more time-consuming and costly it will be. In addition, more detailed transcriptions are likely to be less consistent. Also, for automatically generated transcriptions it holds that recording more details requires more effort, albeit not in the same way and to the same extent as for manually generated transcriptions. The type of phonological transcription that is generally contained in spoken language corpora is broad phonetic, although narrow transcriptions are sometimes also provided, for example, in the Switchboard corpus (Greenberg et al. 1996). In general, the degree of detail required of a transcription will essentially depend on the aim for which the transcription is made. In the case of multi-purpose corpora of spoken language, it is therefore often decided to adopt a broad phonetic transcription, and to make it possible for individual users to add the details that are relevant for their own research at a later stage. An important question that is partly related to the degree of detail recorded in phonological transcription concerns the accuracy of transcriptions. As a matter of fact, there

On Automatic Phonological Transcription of Speech Corpora  

93

is a trade-off relation between degree of detail and accuracy (Gut and Bayerl 2004) defined in terms of reliability and validity (see section 1.3: the more details the tran­ scription contains the less likely it is to be accurate. This may imply that, although a high level of detail may be considered essential for a specific linguistic study, for instance on degree of devoicing or aspiration, it may nevertheless turn out to be extremely difficult or impossible to obtain transcriptions of that specific phenomenon that are sufficiently accurate. This brings us to another important aspect of phonological transcription in general and of automatic phonological transcription in particular: that of its accuracy. This will be discussed below.

5.1.2  Evaluation of Phonological Transcriptions Before phonological transcriptions can be used for research or applications, it is important to know how accurate they are. The problem of transcription quality assessment is not new, since for manual phonological transcriptions it is also important to know how accurate they are before using them for research or applications (Shriberg and Lof 1991; Cucchiarini 1993; Cucchiarini 1996; Wesenick and Kipp 1996). Phonological transcriptions, whether they are obtained automatically or produced by human transcribers, are generally used as a basis for further processing (research, ASR (automatic speech recognition) training, etc.). They can be viewed as representations or measurements of the speech signal, and it is therefore legitimate to ask to what extent they achieve the standards of reliability and validity that are required of any form of measurement. With respect to automatic transcriptions, the problem of quality assessment is complex because comparison with human performance, which is customary in many fields, is not straightforward, owing to the subjectivity of human transcriptions and to a series of methodologically complex issues that will be explained below.

5.1.3  Reliability and Validity of Phonological Transcriptions In general terms, the reliability of a measuring instrument represents the degree of consistency observed between repeated measurements of the same object made with that instrument. It is an indication of the degree of accuracy of a measuring device. Validity, on the other hand, is concerned with whether the instrument measures what it purports to measure. In fact, the definitions of reliability and validity used in test theory are much more complex and will not be treated in this chapter. The description provided above indicates an important difference between the reliability of human-made as opposed to automatic transcriptions, and is related to the fact that human transcriptions suffer from intra-subject and inter-subject variation, and repeated measurements of the same object will differ from each other. With automatic transcriptions this can be prevented, because a machine can be programmed in such a way that repeated measurements of the same object always give the same

94   Helmer Strik and Catia Cucchiarini result, thus yielding a reliability coefficient of 1, the highest possible. It follows that with respect to the quality of automatic transcription, only one (albeit not trivial) question needs to be answered, viz. that concerning validity.

5.1.4  Defining a Reference Phonological Transcription The description of validity given above suggests that any validation activity implies the existence of a correct representation of what is to be measured, a so-called benchmark or ‘true’ criterion score (as in test theory), a gold standard. The difficulties in obtaining such a benchmark transcription are well known, and it is generally acknowledged that there is no absolute truth of the matter as to what phones a speaker produced in an utterance (Cucchiarini 1993; 1996; Wester et al. 2001). For instance, in an experiment we asked nine experienced listeners to judge whether a phone was present or not for 467 cases (Wester et al. 2001). The results showed that all nine listeners agreed in only 246 of the 467 cases, which is less than 53 per cent (see section 2.2.2.2). Furthermore, a substantial amount of variation was observed between the nine listeners. The values of Cohen’s kappa varied from 0.49 to 0.73 for the various listener pairs. It follows that one cannot establish the validity of an automatic transcription simply by comparing it with an arbitrarily chosen human transcription, because the latter would inevitably contain errors. Unfortunately, this seems to be the practice in many studies on automatic transcription. To try as much as possible to circumvent the problems due to the lack of a reference point, different procedures have been devised to obtain reference transcriptions. One possibility consists of using a consensus transcription, which is a transcription made by at least two experienced phoneticians after having reached a consensus on each symbol contained in the transcript (Shriberg et al. 1984). The fact that different transcribers are involved and that they have to reach a consensus before writing down the symbols can be seen as an attempt to minimize errors of measurement, thus approaching ‘true’ criterion scores. Another option is to have more than one transcriber transcribe the material and to use only that part of the material for which all transcribers agree, or at least the majority of them (Kuipers and van Donselaar 1997; Kessens et al. 1998).

5.1.5  Comparing Phonological Transcriptions Another issue that has to be defined in automatic phonological transcription is how to determine whether the quality of a given transcription is satisfactory. Once a reference phonological transcription has been defined, the obvious choice would be to carry out some sort of alignment between the reference phonological transcription and the automatic phonological transcription, with a view to determining a distance measure which will also provide a measure of transcription quality.

On Automatic Phonological Transcription of Speech Corpora  

95

For this purpose, dynamic programming (DP) algorithms with different weightings have been used by various authors (Wagner and Fischer 1974; Hanna et al. 1999; Picone et al. 1986). Several of these DP algorithms are compared in Kondrak and Sherif (2006). In the current chapter we will refer to dynamic programming, agreement scores, error rates, and related issues. Some explanation of these issues is provided here. A standard (simple) DP algorithm is one in which the penalty for an insertion, deletion, or substitution is 1. However, when using such DP algorithms we often obtained suboptimal alignments, like the following example:



Tref = /A m st @ R d A m / Tbu = /A m s#@t a: n#/ (# = insertion)



For this reason, we decided to make use of a more sophisticated Dynamic Programming DP alignment procedure. In this second DP algorithm, the distance between two phones is not just 0 (when they are identical) or 1 (when they are not identical), but more gradual. The distance between two phones is calculated on the basis of the articulatory features defining the speech sounds the symbols stand for (Cucchiarini 1993:  96; Elffers et al. 2005). More details about this DP algorithm can be found in (Cucchiarini 1993: 96; Elffers et al. 2005). Using this second DP algorithm the following alignment was found for the example mentioned above:



Tref = /A m s t @ R d A m / Tbu = /A m s # @ # t a:n /



It is obvious that the second alignment is better than the first. Since, in general, the alignments obtained with DP algorithm 2 were more plausible than those obtained with DP algorithm 1, DP algorithm 2 was used to determine the alignments. Similar algorithms have been proposed, and are used by others. These dynamic programming algorithms can be used to align not only automatic phonological transcriptions and reference phonological transcriptions, but also other phonological transcriptions. Besides being used to assess the quality of phonological transcriptions, they can be also used to study in what respects the compared phonological transcriptions deviate from each other, and to obtain information on pronunciation variation. For instance, our DP algorithms compare the two transcriptions and return various data such as an overall distance measure, the number of insertions, deletions, and substitutions of phonemes, and more detailed data indicating to which features substitutions are related. This kind of information can be extremely valuable if one is interested to know how the automatic phonological transcription differs from the reference phonological transcription, and how the

96   Helmer Strik and Catia Cucchiarini automatic phonological transcription could be improved (see e.g. Binnenpoorte and Cucchiarini 2003; Cucchiarini and Binnenpoorte 2002; Cucchiarini et al. 2001).

5.1.6  Determining when an Automatic Phonological Transcription is of Satisfactory Quality After having established how much an automatic phonological transcription differs from a reference phonological transcription, one would probably need some reference data to determine whether the degree of distance observed is acceptable or not. In other words, how can we determine whether the quality of a given automatic phonological transcription is satisfactory? Again, human transcriptions could be used as a point of reference. For instance, one could compare the degree of agreement observed between the automatic phonological transcription and the reference phonological transcription with the degree of agreement observed between human transcriptions of the same utterances that are of the same level of detail and that are made under similar conditions, because this agreement level constitutes the upper bound, as in the study reported in Wesenick and Kipp (1996). If the degree of agreement between the automatic phonological transcription and the reference phonological transcription is comparable to what is usually observed between human transcriptions, one could accept the automatic phonological transcription as is; alternatively, if the degree of agreement between the automatic phonological transcription and the reference phonological transcription is lower than what is usually observed between human transcriptions, the automatic phonological transcription should first be improved. However, the problem with this approach is that it is difficult to find data on human transcriptions to be used as reference (for more information on this point, see Cucchiarini and Binnenpoorte 2002). Whether a transcription is of satisfactory quality will also depend on the purpose of the transcription. Some differences in transcriptions can be important for one application, but less important for another. Therefore, application, goal, and context should be taken into account for meaningful transcription evaluation (van Bael et al. 2003, Van Bael et al. 2007).

5.2  Obtaining Phonological Transcriptions In the current section we look at (semi-)automatic methods for obtaining phonological transcriptions. We start with completely automatic methods, distinguishing between cases in which orthographic transcriptions are not available and cases in which they are. We then discuss comparing (combinations of) methods, including methods in which a (small) part of the material is manually transcribed and subsequently used to improve automatic phonological transcriptions for a larger amount of speech. Finally, we look

On Automatic Phonological Transcription of Speech Corpora  

97

at automatic phonological transcription optimization. However, before discussing how automatic phonological transcriptions can be obtained, we provide a brief explanation of how automatic speech recognition works.

5.2.1  Automatic Speech Recognition (ASR) Standard ASR systems are generally employed to recognize words. The ASR system consists of a decoder (the search algorithm) and three ‘knowledge sources’: the language model (LM), the lexicon, and the acoustic models. The LM contains probabilities of words and sequences of words. Acoustic models are models of how the sounds of a language are pronounced; in most cases so-called hidden Markov models (HMMs) are used, but it is also possible to use artificial neural networks (ANNs). The lexicon is the connection between the language model and the acoustic models. It contains information on how the words are pronounced, in terms of sequences of sounds. The lexicon therefore contains two representations for every entry: an orthographic and a phonological transcription. Most lexicons contain words for which more than one entry is present in the lexicon, i.e. the pronunciation variants. ASR is a probabilistic procedure. In a nutshell, ASR (with HMMs) works as follows. The LM defines which sequences of words are possible, for each word the possible variants and their transcriptions are retrieved from the lexicon, and for each sound in these transcriptions the appropriate HMM is retrieved. Everything is represented by means of a huge probabilistic network: an LM is a network of words, each word is a network of pronunciation variants and their transcriptions, and for each of the sounds in these transcriptions the corresponding HMM is a network of its own. In this huge and complex network, paths have probabilities attached to them. For a given (incoming, unknown) speech signal, the task of the decoder is to find the optimal global path in this network, using all the probabilistic information. In standard word recognition the output then consists of the labels of the words on this optimal path: the recognized words. However, the optimal path can contain more information than just that concerning the word labels, e.g. information on pronunciation variants, the phone symbols in these pronunciation variants, and even the segmentation at phone level. The description provided above is a short description of a standard ASR, i.e. one that is used for word recognition. However, it is also possible to use ASR systems in other ways. For instance, it is possible to perform phone recognition by only using the acoustic models, i.e. without the top-down constraints of language model and lexicon. Alternatively, the lexicon may contain phones, instead of words. If there are no further restrictions (in the LM). we are dealing with so-called free or unrestricted phone recognition, whereas if the LM contains a model with probabilities of phone sequence (a kind of phonotactic constraints), then we have restricted phone recognition. These phone LM models are generally trained using canonical phonological transcriptions. For instance, in some experiments on automatic phonological transcription that we carried out (see section 2.3) it turned out that 4-gram models outperformed 2-gram, 3-gram,

98   Helmer Strik and Catia Cucchiarini 5-gram, and 6-gram models (van Bael et al. 2006, 2007). Such a 4-gram model contains the probabilities of sequences of 4 phones.

5.2.2  Automatic Methods 5.2.2.1  Automatic Methods: No Orthography In general, if orthographic transcriptions are present, it is better to derive the phonological transcriptions not only from the audio signals, but from a combination of audio signals and orthographic transcriptions. But what to do if no orthographic transcriptions are present? 5.2.2.1.1  No Orthography, ASR If no orthographic annotation is present, an obvious solution would be to use ASR to obtain it. However, since ASR is not flawless, this orthographic transcription is likely to contain ASR errors. These errors in the orthographic transcription would counterbalance the positive effects of using the orthographic representation for obtaining automatic phonological transcriptions. Whether the net effect is positive depends on the task. For some tasks that are relatively easy for ASR, such as isolated digit recognition, the net effect may even be positive, but for most tasks this will probably not be the case. 5.2.2.1.2  No Orthography, Phone Recognition Another option, if no orthographic representation is available, is to use phone recognition (see section 2.1 on ASR). For this purpose, completely unrestricted phone recognition can be used, but usually some (phonotactic) constraints are employed in the form of a phone language model. Phone accuracy can be highly variable, roughly between 30 and 70 per cent, depending on factors such as speech style and quality of the speech (amount of background noise) (see e.g. Chang 2002). For instance, for one of our ASR systems we measured a phone accuracy level of 63 per cent for extemporaneous speech (Wester et al. 1998). In general, high accuracy values can be obtained for relatively easy tasks (e.g. carefully read speech), and by carefully tuning the ASR system for specific tasks (i.e. speech style, dialect or accent, gender, or speaker). In general, such levels of phone accuracy are too low, and thus the resulting automatic phonological transcriptions cannot be used directly for most applications. Still, phone recognition can be useful. For our ASR system with a phone accuracy of 63 per cent we examined the resulting phone strings by comparing them to canonical transcriptions (Wester et al. 1998). Canonical transcriptions can be obtained by means of lexicons, grapheme-to-phoneme conversion tools (for an overview, see Bisani and Ney 2008), or a combination of the two. Since the quality of the phonological transcriptions in lexicons is usually better than that of grapheme-to-phoneme conversion tools, in many applications one first looks up the phonological transcriptions of words in lexicons, and if these are

On Automatic Phonological Transcription of Speech Corpora  

99

not found in existing lexicons the grapheme-to-phoneme conversion is applied. In Wester et al. (1998) it was found that the number of insertions (4 per cent) was much smaller than the number of deletions (17 per cent) and substitutions (15 per cent). Furthermore, the vowels remain identical more often than the consonants, mainly because in comparison to the consonants they are deleted less often. Finally, we studied the most frequently observed processes, which were all deletions. It turned out that these frequent processes are plausible connected speech processes (see Wester et al. 1998), some of which are related to Dutch phonological processes that have been described in the literature (e.g. /n/-deletion, /t/-deletion, and /@/-deletion are described in Booij 1995), but also some others that could not be found in the literature. Phone recognition can thus be used for hypothesis generation (Wester et al. 1998). However, owing to the considerable number of inaccuracies in unsupervised phone recognition, it is often necessary to check or filter the output of phone recognition. The latter can be done by applying decision trees (Fosler-Lussier 1999; van Bael et al. 2006; 2007) or forced recognition (Kessens and Strik 2001, 2004). The results of phone recognition can be described in terms of context-dependent rewrite rules. Various criteria can be employed for filtering these rewrite rules—for example, straightforward criteria are the absolute frequency with which changes (insertions, deletions, or substitutions) occur, or the relative frequency, i.e. the absolute frequency divided by the number of times the conditions of the rule are met (see e.g. Kessens et al. 2002). Of course, combinations of criteria are also possible.

5.2.2.2  Automatic Methods: With Orthography Above different methods were described to derive phonological transcriptions when no orthographic transcriptions are available. Such methods are often called bottom-up or data-driven methods. In the current subsection we will describe methods to obtain phonological transcriptions when orthographic transcriptions are present. In the latter case, top-down information can also be applied. 5.2.2.2.1  With Orthography, Canonical Transcriptions Probably the simplest way to derive phonological transcriptions in this case is by using canonical transcriptions (see above). Once a phonological (canonical) transcription is obtained for every word, the orthographic representations are simply replaced by the corresponding phonological representations (Binnenpoorte and Cucchiarini 2003; van Bael et al. 2006, 2007). 5.2.2.2.2  With Orthography, Forced Recognition Words are not always pronounced in the same way, and thus representing all occurrences of a word with the same phonological transcription will result in phonological transcriptions containing errors. The quality of the phonological transcriptions could be improved by modelling pronunciation variation. One way to do this is to use so-called forced recognition. In forced recognition the goal is not to recognize the string

100   Helmer Strik and Catia Cucchiarini of words that was spoken, as in standard ASR. On the contrary, in forced recognition this string of words (the orthographic transcription) has to be known. Given the orthographic transcription, and multiple pronunciations of some words, forced recognition will determine automatically which of the pronunciation variants of a word best matches the audio signal corresponding to that word. The words are thus fixed, and for every word the recognizer is forced to choose one of the pronunciation variants of that word—hence the term ‘forced recognition’. The search space can also be represented as a network or lattice with pronunciation variants. The goal then is to find the optimal path in that network, the optimal alignment, by means of the Viterbi algorithm; this is why this procedure is also referred to as ‘Viterbi’ or ‘forced’ alignment. In any case, if hypotheses (pronunciation variants) are present, forced recognition can be used for hypothesis verification (i.e. to find the hypothesis, the variant that most closely matches the audio signal). It is important to note that through the use of pronunciation variants it is also possible to model sandhi phenomena and cross-word phenomena such as assimilation or intrusion. We evaluated how well forced recognition performs by comparing its performance to that of human annotators (Wester et al. 2001). Nine experts, who often carried out phonetic transcriptions for their own research, carried out exactly the same task as the computer program: they had to indicate which pronunciation variants of a word best matched the audio signal. The variants of 467 cases were generated by means of 5 frequent phonological rules: /n/-, /r/-, /t/-, /@/-deletion, and /@/-insertion (Booij 1995). In all 467 cases the machine and the human transcribers thus had to determine whether a phone was present or not. The results of these experiments were evaluated in different ways; some of these results are presented here. Table 5.1 shows how often N of the 9 transcribers agree. For 5 out of 9 this is 100 per cent, obviously, but for larger N values this percentage drops rapidly, and all 9 experts agree only in 53 per cent of the cases. Note that these results concern decisions on whether a phone was present or not, i.e. insertions and deletions, and not substitutions, where a phone could be substituted by many other phones and thus the number of possibilities is much larger. Determining whether a phone is present or not can also be very difficult both for humans and for machines, because very often we are dealing with gradual processes in which phones are neither completely present or absent, and even if a phone is (almost) not present some traces can remain in the context. Furthermore, human listeners could be biased by their knowledge of the language. We then compared the transcriptions made by humans and machines to the same reference transcription, the majority vote with different degrees of strictness (N out of 9, N = 5–9) mentioned above. The results are shown in Figure 5.1. For the 246 cases in which all transcribers agree, the percentage agreement between listeners and reference transcription obviously is 100 per cent. If N decreases, the percentage agreement with the reference transcription decreases, both for the judgements of the listeners and for the forced recognition program. Note that the behaviour is similar: the average percentage agreement of the listeners almost runs parallel to the agreement of the ASR system.

On Automatic Phonological Transcription of Speech Corpora  

101

Table 5.1  Forced Recognition: majority vote results, i.e. the number of times that N of the 9 transcribers agree 5 out of 9

467 (100%)

6 out of 9

435 (93%)

7 out of 9

385 (82%)

8 out of 9

335 (72%)

9 out of 9

246 (53%)

100

Agreement (%)

95

90

85

80 5 of 9

6 of 9

7 of 9 Reference transcriptions CSR

Listener

8 of 9

9 of 9

Average

FIGURE 5.1  Percentage agreement of the reference transcription compared to the transcriptions made by humans and machine.

We also carried out pairwise comparisons between transcriptions. We obtained inter-listener percentage agreement scores (for Dutch spontaneous speech) in the range of 75–87 per cent (with an average of 82 per cent) (Wester et al. 2001). Similar results were obtained for German spontaneous speech: 79–83 per cent (Kipp et al. 1997; Schiel et al. 1998), and for American English (Switchboard): 72–80 per cent (Greenberg 1999). The ASR-listener pairwise comparisons yielded slightly lower percentage agreement scores: 76–80 per cent (with an average of 78 per cent) for Dutch (Wester et al. 2001), and 72–80 per cent for German (Kipp et al 1997; Schiel et al. 1998).

102   Helmer Strik and Catia Cucchiarini Forced recognition appears to perform well: the results are comparable to those of human transcribers, and the percentage agreement scores for the ASR are only slightly lower than those between human annotators. However, note that these results were obtained by comparing the judgements by humans and machine to the judgements by human annotators. If we had based the reference transcription(s) on a combination of the judgements by listeners and by ASR systems, the differences would have been (much) smaller. In any case, forced recognition seems to be a useful technique for hypothesis verification, i.e. for obtaining information regarding the phonological transcription. Forced recognition, for hypothesis verification, can thus be used in combination with other methods that generate hypotheses. Examples of the latter are phone recognition (see section 2.2.1.2). and rule-based methods (e.g. by using phonological rules, such as the five rules mentioned above). A method to obtain information regarding reduction processes is to simply generate (many) variants by making (many) phones optional, and to use forced recognition to select variants. In Kessens et al. (2000) we showed that there is a large overlap between the results of the latter method and those obtained with phone recognition in combination with forced recognition. Both methods are useful for automatically detecting connected speech processes, and it turned out that only about half of these connected speech processes had already been described in the literature at that moment (Kessens et al. 2000).

5.2.3  Comparing (Combinations of) Methods In the previous sections we have presented several methods for obtaining phonological transcriptions. The question that arises is which of these methods performs best. In addition, many of these methods can be combined. In the research by van Bael et al. (2006; 2007) several (combinations of) methods were implemented, tested, and compared. Ten automatic procedures were used to generate broad phonetic transcriptions of well-prepared speech (read-aloud texts) and spontaneous speech (telephone dialogues) from the Spoken Dutch Corpus (see Table 5.2). The resulting transcriptions were compared to manually verified phonetic transcriptions from the same corpus. These ten methods are briefly described here (for more information, see van Bael et al. 2006; 2007). In methods 3–10, multiple pronunciation lexicons were used, and the best variant was chosen by means of forced recognition (in methods 1 and 2 this was not the case).

5.2.3.1  Canonical Transcription: CAN-PT The canonical transcriptions (CAN-PTs) were generated through a lexicon look-up procedure. Crossword assimilation and degemination were not modelled. Canonical transcriptions are easy to obtain, since many corpora feature an orthographic transcription and a canonical lexicon of the words in the corpus.

On Automatic Phonological Transcription of Speech Corpora  

103

Table 5.2  Accuracy of the ten transcription methods for read speech and telephone dialogues: percentage of Substitutions (Subs), Deletions (Dels), Insertions (Ins), and percentage disagreement (%dis, the summation of Subs, Dels, and Ins) Comparison with RT

Read speech

Telephone dialogues

Subs

Dels

Ins

%dis

Subs

Dels

Ins

%dis

CAN-PT

6.3

1.2

2.6

10.1

9.1

1.1

8.1

18.3

DD-PT

16.1

7.4

3.6

27.0

26.0

18.0

3.8

47.8

KB-PT

6.3

3.1

1.5

10.9

9.0

2.5

5.8

17.3

CAN/DD-PT

13.1

2.0

4.8

19.9

21.5

6.2

7.1

34.7

KB/ DD-PT

12.8

3.1

3.6

19.5

20.5

7.8

5.4

33.7

[CAN-PT]d

4.8

1.6

1.7

8.1

7.1

3.3

4.2

14.6

[DD-PT]d

15.7

7.4

3.5

26.7

26.0

18.6

3.8

48.3

[KB-PT]d

5.0

3.2

1.2

9.4

7.1

3.5

4.2

14.8

[CAN/DD-PT]d

12.0

2.3

4.3

18.5

20.1

7.2

5.5

32.8

[KB/ DD-PT]d

11.6

3.1

3.1

17.8

19.3

9.4

4.5

33.1

5.2.3.2  Data-Driven Transcription: DD-PT The data-driven transcriptions (DD-PTs) were derived from the audio signals through constrained phone recognition: an ASR system segmented and labelled the speech signal using as a language model a 4-gram phonotactic model trained with the reference transcriptions of the development data in order to approximate human transcription behaviour. Transcription experiments with the data in the development set indicated that for both speech styles 4-gram models outperformed 2-gram, 3-gram, 5-gram, and 6-gram models.

5.2.3.3  Knowledge-Based Transcription: KB-PT We generated so-called knowledge-based transcriptions (KB-PTs) in three steps.



a. First, a list of 20 prominent phonological processes was compiled from the linguistic literature on the phonology of Dutch (Booij 1995). These processes were implemented as context-dependent rewrite rules modelling both within-word and cross-word contexts in which phones from a CAN-PT can be deleted, inserted or substituted with another phone. b. In the second step, the phonological rewrite rules were ordered and used to generate optional pronunciation variants from the CAN-PTs of the speech chunks. The rules applied to the chunks rather than to the words in isolation to account for cross-word phenomena.

104   Helmer Strik and Catia Cucchiarini

c. In the third step of the procedure, chunk-level pronunciation variants were listed. The optimal knowledge-based transcription (KB-PT) was identified through forced recognition.

Methods 4 and 5 are combinations of data-driven transcription (DD-PT) with canonical transcription (CAN-PT) and knowledge-based transcription (KB-PT):

5.2.3.4  Combined CAN-DD Transcription: CAN/DD-PT 5.2.3.5  Combined KB-DD Transcription: KB/DD-PT For each of these two methods, the variants generated by the two procedures were combined, and the optimal variant was chosen by means of forced recognition. Methods 1–5 are completely automatic methods—no manual phonological transcriptions are used. However, manual phonological transcriptions may be already available, at least for a (small) subset of the corpus. The question then is whether these manual phonological transcriptions could be used to improve the quality of the automatic phonological transcriptions obtained for the rest of the corpus. A possible way to do this is to align automatic and manual phonological transcriptions for the subset of the corpus, and use these alignments to train decision trees. Roughly speaking, these decision trees learn the (systematic) differences between manual phonological transcriptions and automatic phonological transcriptions. If the same decision trees are then used to transform the automatic phonological transcriptions of the rest of the corpus, these transformed automatic phonological transcriptions might be closer to the reference transcriptions. We applied these decision trees in each of the five methods described above, thus obtaining five new transcriptions, i.e. methods 6–10. For each of these methods, these decision trees and the automatic phonological transcriptions were used to generate new variants. The optimal variants were selected by means of forced recognition. To summarize, the ten methods are:

1. Canonical transcription: CAN-PT 2. Data-driven transcription: DD-PT 3. Knowledge-based transcription: KB-PT 4. Combined CAN-DD transcription: CAN/DD-PT 5. Combined KB-DD transcription: KB/DD-PT

6–10 = 1–5 with decision trees The results are presented in Table 5.2. It can be observed that applying the decision trees improves the results. Therefore, if manual phonological transcriptions are available for part of the corpus, they can be used to improve the automatic phonological transcriptions for the rest of the corpus. And if no manual phonological transcriptions are available, one could consider obtaining such transcriptions for (only a small) part of the corpus. The best results are obtained for method 6: [CAN-PT]d, a canonical transcription that, through the use of a small sample of manual transcriptions and decision trees,

On Automatic Phonological Transcription of Speech Corpora  

105

Table 5.3  Examples of utterances with different phonological transcriptions (in SAMPA). From top to bottom: orthography, CAN-PT (method 1), [CAN-PT] d (method 6), and the manually verified phonetic transcription from the Spoken Dutch Corpus that is used as reference Orthog. maar het is niet handig als je nou . . . verbinding Reference mar t Is nid hAnd@x A S@ nA+ . . . f@-bIndIN CAN-PT mar @t Is nit hAnd@x Als j@ nA+ . . . v@rbIndIN D S DD S S (3 Dels and 3 Subs) [CAN-PT]d mar t Is nit hAnd@x As j@ nA+ . . . f@-bIndIN S D S - (1 Del and 2 Subs)

was modelled towards the target transcription. This method does not require the use of an ASR system, only canonical transcriptions obtained by means of a lexicon look-up, some manual phonological transcriptions, and decision trees trained on these manual transcriptions. For these (best) transcriptions, the number and the nature of the remaining disagreements with the reference transcriptions are similar to inter-labeller disagreement values reported in the literature. Some examples, including those for the best method (i.e. 6: [CAN-PT]d). are provided in Table 5.3. It can be observed that the decision trees have ‘learned’ some patterns that lead to improvements: the deletion of the schwa (@) of /@t/, the deletion of the ‘l’ in /Als/, and the devoicing of ‘v’ (see the last word). In order to determine what the errors in the transcriptions are, they are aligned with the reference. The number of errors for [CAN-PT]d (i.e. 1 Del and 2 Subs) is much smaller than for CAN-PT (3 Dels and 3 Subs).

5.2.4  Optimizing Automatic Phonological Transcriptions Deriving automatic phonological transcriptions, according to the methods described above, is usually done by using ASR systems. Since standard ASR systems are primarily intended for recognizing words, for automatic phonological transcription it is necessary to apply the ASR systems in nonstandard, modified ways (as was described above, for various methods). For many decades efforts in ASR research were directed to reducing the word error rate (WER). a measure of the accuracy of ASR systems in recognizing words. If an ASR system is used for deriving automatic phonological transcriptions, one generally takes an ASR system for which the WER is low. However, it is questionable whether the ASR system with the lowest WER is also the best choice for obtaining automatic phonological transcriptions. Given that automatic phonological transcriptions are increasingly used, it is remarkable that relatively little research has been conducted on optimizing automatic phonological transcriptions and on optimizing ASR systems for this purpose. In one of our studies we investigated the effect of changing the properties of the ASR system on the quality of the resulting transcriptions and the WER (Kessens and Strik 2001).

106   Helmer Strik and Catia Cucchiarini As a criterion we used the percentage agreement between the automatic phonological transcriptions and reference phonological transcriptions. The study concerned 1,237 instances of the five Dutch phonological rules mentioned above (see section 2.2.2.2): the 467 cases mentioned in section 2.2.2.2, in which the reference phonological transcription was obtained by means of a majority vote procedure, and an extra 770 cases, in which the reference phonological transcription was a consensus transcription. By means of a DP alignment of automatic phonological transcriptions with reference phonological transcriptions, we obtained agreement scores which are expressed in either %agreement or Kappa. A higher %agreement or Kappa indicates better transcription quality. We showed that the relation between WERs and transcription quality is not straightforward (Kessens and Strik 2001, 2004). For instance, using context-dependent HMMs usually leads to lower WERs, but not always to higher-quality transcriptions. In other words, lower WERs do not always guarantee better transcriptions. Therefore, in order to increase the quality of automatic phonological transcriptions, one should not simply take the ASR system with the lowest WER. Instead, specific ASR systems have to be optimized for this task (i.e. to generate optimal automatic phonological transcriptions). Our research made clear that by combining the right properties of an ASR, the resulting automatic phonological transcriptions can be improved. In Kessens and Strik (2001) this was achieved by training the HMMs on read speech (instead of spontaneous speech), by shortening the topology of the HMMs, and by means of pronunciation variation modelling. Related to the issue above, i.e. which ASR system to use to obtain automatic phonological transcriptions with high transcription quality, is the issue of which phonological transcriptions to use to obtain an ASR system with low WER. The question is whether higher-quality transcriptions—e.g. manual phonological transcriptions—always yield ASR systems with lower WERs. We used different phonological transcriptions for training ASR systems, measured transcription quality by comparing these transcriptions to a reference phonological transcription, and also measured the WERs of the resulting ASR systems (Van Bael et al. 2006, 2007). The phonological transcriptions we used were: a manual phonological transcription, a canonical transcription (APT1), and an improved APT2 obtained by modelling pronunciation variation. In this case too, no straightforward relation was observed between transcription quality and WER; for example, manual phonological transcriptions do not always yield ASR systems with lower WERs. The overall conclusion of these experiments is therefore that, since ASR systems with lower WERs do not always yield better phonological transcriptions, and better phonological transcriptions do not always yield lower WERs, if ASR systems are to be used to obtain automatic phonological transcriptions, they should be optimized for this specific task.

5.3  Concluding Remarks In the previous sections we have discussed the possibilities and advantages offered by automatic methods for phonological annotation. Ceteris paribus, the quality of the transcriptions is likely to be higher for careful (e.g. read) than for sloppy

On Automatic Phonological Transcription of Speech Corpora  

107

(e.g. spontaneous) speech, and also higher for high-quality audio signals than for lower-quality ones (more noise, distortions, lower sampling frequency, etc.). If there is no orthographic transcription, it will be difficult to automatically obtain phonological transcriptions of high quality, since the output of ASR and phone recognition generally contain a substantial number of errors. If there are orthographic transcriptions, a good possibility might be to use method 6 of section 2.3: obtain some manual transcriptions, use them to train decision trees, and apply these decision trees to the canonical transcriptions. Another good option is to use method 3 of section 2.3: use ‘knowledge’ (e.g. a list of pronunciation variants, or rules for creating them) to generate variants, and apply ‘forced recognition’ to select the variants that best match the audio signal. In practice, automatic phonological transcriptions can be used in all research situations in which phonological transcriptions have to be made by one person. Given that an ASR does not suffer from tiredness and loss of concentration, it could assist the transcriber who is likely to make mistakes owing to concentration loss. By comparing his/ her own transcriptions with those produced by the ASR a transcriber could spot possible errors that are due to absent-mindedness. Furthermore, this kind of comparison could be useful for other reasons. For instance, a transcriber may be biased by his/her own hypotheses and expectations, with obvious consequences for the transcriptions, while the biases in automatic phonological transcription can be controlled. Checking the automatic transcriptions may help discover possible biases in the listener’s data. In addition, APT can be employed in those situations in which more than one transcriber is involved, in order to solve possible doubts about what was actually realized. It should be noted that using automatic phonological transcription will be less expensive than having an extra transcriber carry out the same task. Automatic phonological transcription could also play a useful role within the framework of agile corpus creation as proposed by (Voormann and Gut 2008; see also chapter on corpus design in this volume). Agile corpus creation advocates the adoption of a query-driven approach that envisages small, rapid iterations of the various cycles in corpus creation (querying, annotation schema development, corpus annotation, and corpus analysis) to enhance the quality of corpora. In this approach, automatic phonetic transcription can be employed in a step-by-step bootstrap procedure as proposed by Binnenpoorte (2006). so that improved automatic phonological transcriptions are obtained after each step. Finally, we would like to reiterate the clear advantage of using automatic phonological transcription when it comes to transcribing large amounts of speech data that otherwise would probably remain unexplored.

108   Helmer Strik and Catia Cucchiarini

APPENDIX

Phonetic Transcription Tools Below a list of some (pointers to) phonetic transcription tools is provided. Since much more is available for English than for other languages, we first list the tools for English, and then the tools for other languages.

English http://project-modelino.com/english-phonetic-transcription-conver ter.php? site_language=english This free online converter allows you to convert English text to its phonetic transcription using International Phonetic Alphabet (IPA) symbols. The database contains more than 125,000 English words, including 68,000 individual words and 57,000 word forms, such as plurals. http://upodn.com/ Turn your text into fonetiks actually, it is: fəɛtɪks http://www.brothersoft.com/phonetic-86673.html Phonetic 1.0. A program that translates text to the phonetic alphabet. http://www.brothersoft.com/phonetizer-428787.html Phonetizer 2.0. Easily and quickly add phonetic transcription to any English text. http://www.photransedit.com/ PhoTransEdit applications have been designed to help those who work with English phonetic transcriptions. Far from providing perfect automatic transcriptions, PhoTransEdit is aimed at just helping you save time when writing, publishing, or sharing English transcriptions. http://www.speech.cs.cmu.edu/cgi-bin/cmudict The CMU Pronouncing Dictionary http://www.filecrop.com/ipa-english-dictionary.html IPA English Dictionary http://ipa.typeit.org/ Type IPA phonetic symbols, for English

Other Languages http://mickey.ifp.illinois.edu/speechWiki/index.php?title=Phonetic_Transcription_ Tool&oldid=3011 This is a tool that maps strings of letters (words) to their phonetic transcriptions via a Hidden Markov Model. It can also give phonetic transcriptions for partial words or words not in a dictionary. If a transcription dictionary is provided, the tool can align letters with

On Automatic Phonological Transcription of Speech Corpora  

109

their corresponding phones. It has been trained on American English pronunciations, but models for other languages can also be created. http://tom.brondsted.dk/text2phoneme/ Tom Brøndsted:  Phonemic transcription. An automated phonetic/phonemic transcriber supporting English, German, and Danish. Outputs transcriptions in the International Phonetic Alphabet IPA or the SAMPA alphabet designed for speech recognition technology. http://ilk.uvt.nl/g2p-www-demo.html, last accessed date: 26/06/2011 The TreeTalk demo converts Dutch or English words to their phonetic transcription in the SAMPA (Dutch) or DISC (English) phonetic alphabet, and also generates speech audio. http://hstrik.ruhosting.nl/tqe/ Automatic Transcription Quality Evaluation (TQE) tool. Input is a corpus with audio files and phone transcriptions (PTs). Audio and PTs are aligned, phone boundaries are derived, and for each segment-phone combination it is determined how well they fit together, i.e. for each phone a TQE measure (a confidence measure) is determined, e.g. ranging from 0–100 per cent, indicating how good the fit is, what the quality of the phone transcription is. http://www.fon.hum.uva.nl/praat/ Praat: doing phonetics by computer. http://latlcui.unige.ch/phonetique/ EasyAlign: a friendly automatic phonetic alignment tool under Praat. http://korpling.german.hu-berlin.de/~amir/phon.php Automatic Phonetic Transcription and Syllable Analysis for German and Polish. http://www.webalice.it/sandro.carnevali2011/indice.htm Program for IPA phonetic transcription of Italian, Japanese and English. http://www.ipanow.com/ PhoneticSoft automatically transcribes Latin, Italian, German and French texts into IPA symbols. http://billposer.org/Software/earm2ipa.html This program translates Armenian in UTF-8 Unicode to the International Phonetic Alphabet, assuming that the dialect represented is Eastern Armenian.

C HA P T E R  6

S TAT I S T I C A L C O R P U S E X P L O I TAT I O N H E R M A N N MOI SL

6.1 Introduction This chapter regards corpus linguistics (Kennedy 1998; McEnery and Wilson 2001; Baker 2009) as a methodology for creating collections of natural language speech and text, abstracting data from them, and analysing that data with the aim of generating or testing hypotheses about the structure of language and its use in the world. On this definition, corpus linguistics began in the late eighteenth century with the postulation of an Indo-European protolanguage and its reconstruction based on examination of numerous living languages and of historical texts (Clackson 2007). Since then it has been applied to research across the range of linguistics sub-disciplines and, in recent years, has become an academic discipline with its own research community and scientific apparatus of professional organizations, websites, conferences, journals, and textbooks. Throughout the nineteenth and much of the twentieth century, corpus linguistics was mainly or exclusively paper-based. The linguistic material used by researchers was in the form of handwritten or printed documents, and analysis involved reading through the documents, often repeatedly, creating data by noting features of interest on some paper medium such as index cards, inspecting the data directly, and on the basis of that inspection drawing conclusions that were published in printed books or journals. The advent of digital electronic technology in the second half of the twentieth century and its evolution since then have rendered this traditional technology increasingly obsolete. On the one hand, the possibility of representing language electronically rather than as visual marks on paper or some other physical medium, together with the development of digital media and infrastructure and of computational tools for the creation, emendation, storage, and transmission of electronic text have led to a rapid increase in the number and size of corpora available to the linguist, and these are now at or even beyond the limit of what an individual researcher can efficiently use in the traditional

Statistical Corpus Exploitation  

111

way. On the other, data abstracted from large corpora can themselves be so extensive and complex as to be impenetrable to understanding by direct inspection. Digital electronic technology has, in general, been a boon to corpus linguistics, but, as with other aspects of life, it’s possible to have too much of a good thing. One response to digital electronic language and data overload is to use only corpora of tractable size or, equivalently, subsets of large corpora, but simply ignoring available information is not scientifically respectable. The alternative is to look to related research disciplines for help. The overload in corpus linguistics is symptomatic of a more general trend. Daily use of digital electronic information technology by many millions of people worldwide both in their professional and personal lives has generated and continues to generate truly vast amounts of electronic speech and text, and abstraction of information from all but a tiny part of it by direct inspection is an intractable task not only for individuals but also in government and commerce—what, for example, are the prospects of finding a specific item of information by reading sequentially through the huge number of documents currently available on the Web? In response, research disciplines devoted to information abstraction from very large collections of electronic text have come into being, among them Computational Linguistics (Mitkov 2003), Natural Language Processing (Manning and Schütze 1999; Dale et al. 2000; Jurafsky and Martin 2008; Cole et al. 2010; Indurkhya and Damerau 2010), Information Retrieval (Manning et al. 2008), and Data Mining (Hand et al. 2001). These disciplines use existing statistical methods supplemented by a range of new interpretative ones to develop tools that render the deluge of digital electronic text tractable. Many of these methods and tools are readily adaptable for corpus linguistics use, and, as the references in Section 3 demonstrate, interest in them has grown substantially in recent years. The general aim of this chapter is to encourage that growth, and the particular aim is to encourage it with respect to corpus-based phonetic and phonological research. The chapter is in three main parts. The first part motivates the selection of one particular class of statistical method, cluster analysis, as the focus of the discussion, the second describes fundamental concepts in cluster analysis and exemplifies their application to hypothesis generation in corpus-based phonetic and phonological research, and the third reviews the literature on the use of statistical methods in general and of cluster analysis more specifically in corpus linguistics.

6.2  Cluster Analysis: Motivation ‘Statistics’ encompasses an extensive range of mathematical concepts and techniques with a common focus: an understanding of the nature of probability and of its role in the behaviour of natural systems. Linguistically-oriented statistical analysis of a natural language corpus thus implies that the aim of the analysis is in some sense to interpret the probabilities of occurrence of one or more features of interest—phonetic, phonological, morphological, lexical, syntactic, or semantic—in relation to some research question.

112  Hermann Moisl The statistics literature makes a fundamental distinction between exploratory and confirmatory analysis. Confirmatory analysis is used when the researcher has formulated a hypothesis in answer to his or her research question about a domain of interest, and wants to test the validity of that hypothesis by abstracting data from a sample drawn from the domain and applying confirmatory statistical methods to those data. Exploratory analysis is, on the other hand, used when the researcher has not yet formulated a hypothesis and wishes to generate one by abstracting data from a sample of the domain and then looking for structure in the data on the basis of which a reasonable hypothesis can be formulated. The present discussion purports to describe statistical corpus exploitation, and as such it should cover both these types of analysis. The range of material which this implies is, however, very extensive, and attempting to deal even with only core topics in a relatively short chapter would necessarily result in a sequence of tersely described abstract concepts with little or no discussion of their application to corpus analysis. Since the general aim is to encourage rather than to discourage, some selectivity of coverage is required. The selection of material for discussion was motivated by the following question: given the proliferation of digital electronic corpora referred to in the Introduction, which statistical concepts and techniques would be most useful to corpus linguists for dealing with the attendant problem of analytical intractability? The answer was exploratory rather than confirmatory analysis. The latter is appropriate where the characteristics of the domain of interest are sufficiently well understood to permit formulation of sensible hypotheses; in corpus linguistic terms such a domain might be a collection of texts in the English language, which has been intensively studied, or one that is small enough to be tractable by direct inspection. Where the corpora are very large, however, or in languages/dialectal varieties that are relatively poorly understood, or both, exploratory analysis is more useful because it provides a basis for the formulation of reasonable hypotheses; such hypotheses can subsequently be tested using confirmatory methods. The range of exploratory methods is itself extensive (Myatt 2006; Myatt and Johnson 2009), and further restriction is required. To achieve this, a type of problem that can be expected to occur frequently in exploratory corpus analysis was selected and the relevant class of analytical methods made the focus of the discussion. Corpus exploration implies some degree of uncertainty about what one is looking for. If, for example, the aim is to differentiate the documents in a collection on the basis of their lexical semantic content, which words are the best differentiating criteria? Or, if the aim is to group a collection of speaker interviews on the basis of their phonetic characteristics, which phonetic features are most important? In both cases one would want to take as many lexical/phonetic features as possible into account initially, and then attempt to identify the important ones among them in the course of exploration. Cluster analysis is a type of exploratory method that has long been used across a wide range of science and engineering disciplines to address this type of problem, and is the focus of subsequent discussion. The remainder of this section gives an impression of what cluster analysis involves and how it can be applied to corpus analysis; a more detailed account is given in Section 2.

Statistical Corpus Exploitation  

113

Observation of nature plays a fundamental role in science. But nature is dauntingly complex, and there is no practical or indeed theoretical hope of describing any aspect of it objectively and exhaustively. The researcher is therefore selective in what he or she observes: a research question about the domain of interest is posed, a set of variables descriptive of the domain in relation to the research question is defined, and a series of observations is conducted in which, at each observation, the quantitative or qualitative values of each variable are recorded. A body of data is therefore built up on the basis of which a hypothesis can be generated. Say, for example, that the domain of interest is the pronunciation of the speakers in some corpus, and that the research question is whether there is any systematic variation in phonetic usage among the speakers. Table 6.1 shows data abstracted from the Newcastle Electronic Corpus of Tyneside English (NECTE) (Allen et al. 2007), a corpus of dialect speech from north-east England, which is described in Beal, Corrigan and Moisl, this volume. The speakers are described by a single variable ə1, one of the several varieties of schwa defined in the NECTE transcription scheme, and the values in the variable column of Table 6.1 are the frequencies with which each of the 24 speakers use that segment. It is easy to see by direct inspection that the speakers fall into two groups: those that use ə1 relatively frequently, and those that use it relatively infrequently. The hypothesis is, therefore, that there is systematic variation in phonetic usage among NECTE speakers. If two phonetic variables are used to describe the speakers, as in Table 6.2, direct inspection again shows two groups, those that use both ə1 and another variety of schwa, ə2, relatively frequently and those that do not, and the hypothesis remains the same. There is no theoretical limit on the number of variables that can be defined to describe the objects in a domain. As the number of variables and observations grows, so does the difficulty of generating hypotheses from direct inspection of the data. In the NECTE case, the selection of ə1 and ə2 in Tables 6.1 and 6.2 was arbitrary, and the speakers could be described using more phonetic segment variables. Table 6.3 shows twelve. What hypothesis would one formulate from inspection of the data in Table 6.3, taking into account all the variables? There are, moreover, 63 speakers in the NECTE corpus and the transcription scheme contains 158 phonetic segments, so it is possible to describe the phonetic usage of each of 63 speakers in terms of 158 variables. What hypothesis would one formulate from direct inspection of the full 63 × 158 data? These questions are clearly rhetorical, and there is a straightforward moral:  human cognitive makeup is unsuited to seeing regularities in anything but the smallest collections of numerical data. To see the regularities we need help, and that is what cluster analysis provides. Cluster analysis is a family of mathematical methods for the identification and graphical display of structure in data when the data are too large either in terms of the number of variables or of the number of objects described, or both, for it to be readily interpretable by direct inspection. All the members of the family work by partitioning a set of objects in the domain of interest into disjoint subsets in accordance with how relatively similar those objects are in terms of the variables that describe them. The objects of interest in the NECTE data are speakers, and each speaker’s phonetic usage is described

114  Hermann Moisl Table 6.1  Frequency data for ə1 in the NECTE corpus Speaker

ə1

tlsg01

3

tlsg02

8

tlsg03

3

tlsn01

100

tlsg04

15

tlsg05

14

tlsg06

5

tlsn02

103

tlsg07

5

tlsg08

3

tlsg09

5

tlsg10

6

tlsn03

142

tlsn04

110

tlsg11

3

tlsg12

2

tlsg52

11

tlsg53

6

tlsn05

145

tlsn06

109

tlsg54

3

tlsg55

7

tlsg56

12

tlsn07

104

by a set of phonetic variables. Any two speakers’ phonetic usage will be more or less similar depending on how similar their respective variable values are: if the values are identical then so are the speakers in terms of their pronunciation, and the greater the divergence in values the greater the differences in usage. Cluster analysis of the NECTE data in Table 6.3 groups the 24 speakers in terms of how similar their frequency of usage of 12 of the full 158 phonetic segments is. There are various kinds of cluster analysis; Table 6.4 shows the results from application of two of them. Table 6.4a shows the cluster structure of the NECTE data in Table 6.3 as a hierarchical tree. To interpret the tree one has to understand how it is constructed, so a short intuitive

Statistical Corpus Exploitation  

115

Table 6.2  Frequency data for ə1 and ə2 in the NECTE corpus Speaker

ə1

ə2

tlsg01

3

1

tlsg02

8

0

tlsg03

3

1

tlsn01

100

116

tlsg04

15

0

tlsg05

14

6

tlsg06

5

0

tlsg07

5

0

tlsg08

3

0

tlsg09

5

0

tlsg10

6

0

tlsn03

142

107

tlsn04

110

120

tlsg11

3

0

tlsg12

2

0

tlsg52

11

1

tlsg53

6

0

tlsn05

145

102

tlsn06

109

107

tlsg54

3

0

tlsg55

7

0

tlsg56

12

0

tlsn07

104

93

account is given here; technical details are given later in the discussion. The labels at the leaves of the tree are the speaker-identifiers. These labels are partitioned into clusters in a sequence of steps. Initially, each speaker is interpreted as a cluster on his or her own. At the first step the data are searched to identify the two most similar clusters. When found, they are joined into a superordinate tree in which their degree of similarity is graphically represented as the length of the horizontal lines joining the subclusters: the more similar the subclusters, the shorter the lines. In the actual clustering procedure assessment of similarity is done numerically, but for present expository purposes a visual inspection of Table 6.4a is sufficient, and, to judge by the shortness of the horizontal lines, the singleton clusters tlsg01 and tlsg03 at the top of the tree are the most similar. These are

116  Hermann Moisl Table 6.3  Frequency data for a range of phonetic segments in the NECTE corpus Speaker

ə1

ə2

o:

ə3

ī



n

a:1

a:2



r

w

tlsg01

3

1

55

101

33

26

193

64

1

8

54

96

tlsg02

8

0

11

82

31

44

205

54

64

8

83

88

tlsg03

3

1

55

101

33

26

193

64

15

8

54

96

tlsn01

100

116

5

17

75

0

179

64

0

19

46

62

tlsg04

15

0

12

75

21

23

186

57

6

12

32

97

tlsg05

14

6

45

70

49

0

188

40

0

45

72

49

tlsg06

5

0

40

70

32

22

183

46

0

2

37

117

tlsn02

103

92

7

5

87

27

241

52

0

1

19

72

tlsg07

5

0

11

58

44

31

195

87

12

4

28

93

tlsg08

3

0

44

63

31

44

140

47

0

5

43

106

tlsg09

5

0

30

103

68

10

177

35

0

33

52

96

tlsg10

6

0

89

61

20

33

177

37

0

4

63

97

tlsn03

142

107

2

15

94

0

234

4

0

61

21

62

tlsn04

110

120

0

21

100

0

237

4

0

61

21

62

tlsg11

3

0

61

55

27

19

205

88

0

4

47

94

tlsg12

2

0

9

42

43

41

213

39

31

5

68

124

tlsg52

11

1

29

75

34

22

206

46

0

29

34

93

tlsg53

6

0

49

66

41

32

177

52

9

1

68

74

tlsn05

145

102

4

6

100

0

208

51

0

22

61

104

tlsn06

109

107

0

7

111

0

220

38

0

26

19

70

tlsg54

3

0

8

81

22

27

239

30

32

8

80

116

tlsg55

7

0

12

57

37

20

187

77

41

4

58

101

tlsg56

12

0

21

59

31

40

164

52

17

6

45

103

tlsn07

104

93

0

11

108

0

194

5

0

66

33

69

joined into a composite cluster (tlsg01 tlsg03). At the second step the data are searched again to determine the next-most-similar pair of clusters. Visual inspection indicates that these are tlsg06 and tlsg56 about one-third of the way down the tree, and these are joined into a composite cluster (tlsg06 tlsg56). At step 3, the two most similar clusters are the composite cluster (tlsg06 tlsg56) constructed at step 2 and tlsg08. These are joined into a superordinate cluster ((tlsg06 tlsg56) tlsg08). The sequence of steps continues in this way, combining the most similar pair of clusters at each step, and stops when there is only one cluster remaining which contains all the subclusters. The resulting tree gives an exhaustive graphical representation of the similarity relations in the NECTE speaker data. It shows that there are two main groups of speakers,

Statistical Corpus Exploitation  

117

Table 6.4  Two types of cluster analysis of the data in Table 6.3 tlsg01 tlsg03 tlsg04 tlsg55 tlsg07 tlsg11 tlsg06 tlsg56 tlsg08 tlsg10 tlsg53 tlsg05 tlsg09 tlsg52 tlsg02 tlsg12 tlsg54 tlsn01 tlsn04 tlsn07 tlsn02 tlsn06 tlsn03 tlsn05

(a)

A

tlsg08 tlsg10 tlsg03 tlsg09 tlsg02 tlsg04 tlsg01 tlsg53

A B B

tlsn05 tlsn07 tlsn01 tlsn02 tlsn03 tlsn04 tlsn06

(b)

labelled A and B, which differ greatly from one another in terms of phonetic usage, and, though there are differences in usage among the speakers in those two main groups, the differences are minor relative to those between A and B. Table 6.4b shows the cluster structure of the data in Table 6.3 as a scatter plot in which relative spatial distance between speaker labels represents the relative similarity of phonetic usage among the speakers: the closer the labels the closer the speakers. Labels corresponding to the main clusters in Table 6.4a have been added for ease of cross-reference, and show that this analysis gives the same result as the hierarchical one. Once the structure of the data has been identified by cluster analysis, it can be used for hypothesis generation (Romesburg 1984: chs 4 and 22). The obvious hypothesis in the present case is that the NECTE speakers fall into two distinct groups in terms of their phonetic usage. This could be tested by doing an analysis of the full NECTE corpus using all 63 speakers and all 158 variables, and by conducting further interviews and abstracting data from them for subsequent analysis. Cluster analysis can be applied in any research where the data consist of objects described by variables; since most research uses data of this kind, cluster analysis is very widely applicable. It can usefully be applied where the number of objects and descriptive variables is so large that the data cannot easily be interpreted by direct inspection, and the range of applications where this is the case spans most areas of science, engineering, and commerce (Everitt et al. 2001: ch. 1; Romesburg 1984: chs 4–6; detailed discussion of cluster applications in Jain et al. 1999: 296 ff). In view of the comments made in the introduction about text overload, cluster analysis is exactly what is required for hypothesis generation in corpus linguistics. The foregoing discussion of NECTE is an example in the intersection of phonetics, dialectology, and sociolinguistics: the set of phonetic transcriptions is extensive and the frequency data abstracted from them are far too large

118  Hermann Moisl

to be in any sense comprehensible, but the structure that cluster analysis identified in the data made hypothesis formulation straightforward.

6.3  Cluster Analysis Concepts and Hypothesis Generation 6.3.1  Data Data are abstractions of what we observe using our senses, often with the aid of instruments (Chalmers 1999), and are ontologically different from the world. The world is as it is; data are an interpretation of it for the purpose of scientific study. The weather is not the meteorologist’s data—measurements of such things as air temperature are. A text corpus is not the linguist’s data—measurements of such things as word frequency are. Data are constructed from observation of things in the world, and the process of construction raises a range of issues that determine the amenability of the data to analysis and the interpretability of the results. The importance of understanding such data issues in cluster analysis can hardly be overstated. On the one hand, nothing can be discovered that is beyond the limits of the data itself. On the other, failure to understand relevant characteristics of data can lead to results and interpretations that are distorted or even worthless. For these reasons, an overview of data issues is given before moving on to discussion of cluster analysis concepts; examples are taken from the NECTE corpus cited above.

6.3.1.1  Formulation of a Research Question In general, any aspect of the world can be described in an arbitrary number of ways and to arbitrary degrees of precision. The implications of this go straight to the heart of the debate on the nature of science and scientific theories, but to avoid being drawn into that debate, this discussion adopts the position that is pretty much standard in scientific practice: the view, based on Karl Popper’s philosophy of science (Popper 1959; 1963; Chalmers 1999), that there is no theory-free observation of the world. In essence, this means that there is no such thing as objective observation in science. Entities in a domain of inquiry only become relevant to observation in terms of a research question framed using the ontology and axioms of a theory about the domain. For example, in linguistic analysis, variables are selected in terms of the discipline of linguistics broadly defined, which includes the division into sub-disciplines such as sociolinguistics and dialectology, the subcategorization within sub-disciplines such as phonetics through syntax to semantics and pragmatics in formal grammar, and theoretical entities within each subcategory such as phonemes in phonology and constituency structures in syntax. Claims, occasionally seen, that the variables used to describe a corpus are

Statistical Corpus Exploitation  

119

‘theoretically neutral’ are naive: even word categories like ‘noun’ and ‘verb’ are interpretative constructs that imply a certain view of how language works, and they only appear to be theory-neutral because of our familiarity with long-established tradition. Data can, therefore, only be created in relation to a research question that is defined using the ontology of the domain of interest, and that thereby provides an interpretative orientation. Without such an orientation, how does one know what to observe, what is important, and what is not? The research question asked with respect to the NECTE corpus, and which serves as the basis for the examples in what follows, is: Is there systematic phonetic variation in the Tyneside speech community, and, if so, what are the main phonetic determinants of that variation?

6.3.1.2  Variable Selection Given that data are an interpretation of some domain of interest, what does such an interpretation look like? It is a description of entities in the domain in terms of variables. A variable is a symbol, and as such is a physical entity with a conventional semantics, where a conventional semantics is understood as one in which the designation of a physical thing as a symbol together with the connection between the symbol and what it represents are determined by agreement within a community. The symbol ‘A’, for example, represents the phoneme /a/ by common assent, not because there is any necessary connection between it and what it represents. Since each variable has a conventional semantics, the set of variables chosen to describe entities constitutes the template in terms of which the domain is interpreted. Selection of appropriate variables is, therefore, crucial to the success of any data analysis. Which variables are appropriate in any given case? That depends on the nature of the research question. The fundamental principle in variable selection is that the variables must describe all and only those aspects of the domain that are relevant to the research question. In general, this is an unattainable ideal. Any domain can be described by an essentially arbitrary number of finite sets of variables; selection of one particular set can only be done on the basis of personal knowledge of the domain and of the body of scientific theory associated with it, tempered by personal discretion. In other words, there is no algorithm for choosing an optimally relevant set of variables for a research question. The NECTE speakers are described by a set of 158 variables each of which represents a phonetic segment. These are described in (Allen et al. 2007) and, briefly, in Beal, Corrigan and Moisl, this volume.

6.3.1.3  Variable Value Assignment The semantics of each variable determines a particular interpretation of the domain of interest, and the domain is ‘measured’ in terms of the semantics. That measurement constitutes the values of the variables: height in metres = 1.71, weight in kilograms = 70, and so on. Measurement is fundamental in the creation of data because it makes the link

120  Hermann Moisl between data and the world, and thus allows the results of data analysis to be applied to the understanding of the world. Measurement is only possible in terms of some scale. There are various types of measurement scale, and these are discussed at length in any statistics textbook, but for present purposes the main dichotomy is between numeric and non-numeric. Cluster analysis methods assume numeric measurement as the default case, and for that reason the same is assumed in what follows. For NECTE we are interested in the number of times each speaker uses each of the phonetic segment variables. The speakers are therefore ‘measured’ in terms of the frequency with which they use these segments.

6.3.1.4  Data Representation If they are to be analysed using mathematically-based computational methods, the descriptions of the entities in the domain of interest in terms of the selected variables must be mathematically represented. A  widely used way of doing this, and the one adopted here, is to use structures from a branch of mathematics known as linear algebra. There are numerous textbooks and websites devoted to linear algebra; a small selection of introductory textbooks is Fraleigh and Beauregard (1994), Poole (2005), and Strang (2009). Vectors are fundamental in data representation. A vector is a sequence of numbered slots containing numerical values. Table 6.5 shows a four-element vector each element of which contains a real-valued number: 1.6 is the value of the first element v1, 2.4 the value of the second element v2, and so on. A single NECTE speaker’s frequency of usage of the 158 phonetic segments in the transcription scheme can be represented by a 158-element vector in which each element is associated with a different segment, as in table 6.6. This speaker uses the segment at Speaker1 23 times, the segment at Speaker2 four times, and so on.

Table 6.5 A vector V = 1.6 2.4 7.5 0.6 1

2

3

4

Table 6.6  A vector representing a NECTE speaker Speaker =

i: 23

| 4

ε 0

e| 34 …

1

2

3

4

ζ 2 158

Statistical Corpus Exploitation  

121

Table 6.7  The NECTE data matrix Speaker 1 Speaker 2 Speaker 3

i: 23 18 21

4 12 16

Speaker 63

36

2

1

27 …

1

2

3

4

|

ε 0 4 9

e| 34 … 38 … 19 …

ζ 2 1 5

… 3 158

The 63 NECTE speaker vectors can be assembled into a matrix M, shown in Table 6.7, in which the 63 rows represent the speakers, the 158 columns represent the phonetic segments, and the value at Mij is the number of times speaker i uses segment j (for i = 1..63 and j = 1..158). This matrix M is the basis of subsequent examples.

6.3.1.5 Data Issues Once the data are in matrix form they can in principle be cluster analysed. It may, however, have characteristics that can distort or even invalidate the results, and any such characteristics have to be mitigated or eliminated prior to analysis. These include variation in document or speaker interview length (Moisl 2009), differences in variable measurement scale (Moisl 2011), data sparsity (Moisl 2008), and nonlinearity (Moisl 2007).

6.3.2  Cluster Analysis Once the data matrix has been created and any data issues resolved, a variety of computational methods can be used to group its row vectors, and thereby the objects in the domain that the row vectors represent. In the present case, those objects are the NECTE speakers.

100 (30, 70)

100 FIGURE

6.1  Geometrical interpretation of a 2-dimensional  vector.

122  Hermann Moisl 100

(40, 20, 60)

100

100 FIGURE

(a)

FIGURE

6.2  Geometrical interpretation of a 3-dimensional  vector.

(b)

(c)

6.3  Distributions of multiple vectors in 2- and 3-dimensional  spaces.

6.3.2.1  Clusters in Vector Space Although it is just a sequence of numbers, a vector can be geometrically interpreted (Fraleigh and Beauregard 1994; Poole 2005; Strang 2009). To see how, take a vector consisting of two elements, say v = (30, 70). Under a geometrical interpretation, the two elements of v define a two-dimensional space, the numbers at v1 = 30 and v2 = 70 are coordinates in that space, and the vector v itself is a point at the coordinates (30, 70), as shown in Figure 6.1. A vector consisting of three elements, say v = (40, 20, 60) defines a three-dimensional space in which the coordinates of the point v are 40 along the horizontal axis, 20 along the vertical axis, and 60 along the third axis shown in perspective, as in Figure 6.2. A vector v = (22, 38, 52, 12) defines a four-dimensional space with a point at the stated coordinates, and so on to any dimensionality n. Vector spaces of dimensionality greater than three are impossible to visualize directly and are therefore counterintuitive, but mathematically there is no problem with them; two- and three-dimensional spaces are useful as a metaphor for conceptualizing higher-dimensional ones. When numerous vectors exist in a space, it may or may not be possible to see interesting structure in the way they are arranged in it. Figure 6.3 shows vectors in two- and three-dimensional spaces. In (a) they were randomly generated and there is no structure to be observed, in (b) there are two clearly defined concentrations in two-dimensional space, and in (c) there are two clearly defined concentrations in three-dimensional space. The existence of concentrations like those in (b) and (c) indicate relationships among the entities that the vectors represent. In (b), for example, if the horizontal axis measures

Statistical Corpus Exploitation  

123

weight and the vertical one height for a sample human population, then members of the sample fall into two groups: tall, light people on the one hand, and short, heavy ones on the other. This idea of identifying clusters of vectors in vector space and interpreting them in terms of what the vectors represent is the basis of cluster analysis. In what follows, we shall be attempting to group the NECTE speakers on the basis of their phonetic usage by looking for clusters in the arrangement of the row vectors of M in 158-dimensional space.

6.3.2.2  Clustering Methods Where the data vectors are two- or three-dimensional they can simply be plotted and any clusters will be visually identifiable, as we have just seen. But what about when the vector dimensionality is greater than 3—say 4, or 10, or 100? In such a case direct plotting is not an option; how exactly would one draw a six-dimensional space, for example? Many data matrix row vectors have dimensionalities greater than three—the NECTE matrix M has dimensionality 158 and, to identify clusters in such high-dimensional spaces some procedure more general than direct plotting is required. A variety of such procedures is available, and they are generically known as cluster analysis methods. This section looks at these methods. Where there are two or more vectors in a space, it is possible to measure the distance between any two of them and to rank them in terms of their proximity to one another. Figure 6.4 shows a simple case of a two-dimensional space in which the distance from vector A to vector B is greater than the distance from A to C. A Distance (AC) C B

Distance (AB)

FIGURE

6.4  Vector distances.

dist ( AB) = (5 − 1) 2 + (4 − 2) 2 6 5 4 3 2 1

B(5,4) A(1,2) 1

FIGURE

2

3

4

5

6

6.5  Euclidean distance measurement.

124  Hermann Moisl There are various ways of measuring such distances, but the most often used is the familiar Euclidean one, as in Figure 6.5. Cluster analysis methods use relative distance among vectors in a space to group the vectors. Specifically, for a given set of vectors in a space, they first calculate the distances between all pairs of vectors, and then group into clusters all the vectors that are relatively close to one another in the space and relatively far from those in other clusters. ‘Relatively close’ and ‘relatively far’ are, of course, vague expressions, but they are precisely defined by the various clustering methods, and for present purposes we can avoid the technicalities and rely on intuitions about relative distance. For concreteness, we will concentrate on one particular class of methods: the hierarchical cluster analysis already introduced in Section 1, which represents the relativities of distance among vectors as a tree. Figure 6.6 exemplifies this. Column (a) shows a 30  × 2 data matrix that is to be cluster analysed. Because the data space is two-dimensional, the vectors can be directly plotted to show the cluster structure, as in the upper part of column (b). The corresponding hierarchical cluster tree is shown in the lower part of column (b). There are three clusters labelled A, B, and C in each of which the distances among vectors are quite small. These three clusters are relatively far from one another, though A and B are closer to one another than either of them is to C. Comparison with the vector plot shows that the hierarchical analysis accurately represents the distance relations among the 30 vectors in two-dimensional space. Given that the tree tells us nothing more than what the plot tells us, what is gained? In the present case, nothing. The real power of hierarchical analysis lies in its independence of vector space dimensionality. We have seen that direct plotting is limited to three or fewer dimensions, but there is no dimensionality limit on hierarchical analysis—it can determine relative distances in vector spaces of any dimensionality and represent those distance relativities as a tree like the one above. To exemplify this, the 158-dimensional NECTE data matrix M was hierarchically cluster analysed (Moisl et al. 2006), and the result is shown in Figure 6.7. Plotting M in 158-dimensional space would have been impossible, and without cluster analysis one would have been left pondering a very large and incomprehensible matrix of numbers. With the aid of cluster analysis, however, structure in the data is clearly visible: there are two main clusters, NG1 and NG2; NG1 consists of large subclusters NG1a and NG1b; NG1a itself has two main subclusters NG1a(i) and NG1a(ii).

6.3.2.3  Hypothesis Generation Given that there is structure in the relative distances of the row vectors of M from one another in the data space, what does that structure mean in terms of the research question? Is there systematic phonetic variation in the Tyneside speech community, and, if so, what are the main phonetic determinants of that variation?

Because the row vectors of M are phonetic profiles of the NECTE speakers, the cluster structure means that the speakers fall into clearly defined groups with specific

v1

v2

1

27

46

2

29

48

3

30

50

4

32

51

5

34

54

6

55

9

7

56

9

8

60

10

9

63

11

10

64

11

11

78

72

12

79

74

13

80

70

14

84

73

15

85

69

16

27

55

17

29

56

18

30

54

19

33

51

20

34

56

21

55

13

22

56

15

23

60

13

24

63

12

25

64

10

26

84

72

27

85

74

28

77

70

29

76

73

30

76

69

a FIGURE

100 90

C

80 70

11 1427 29 12 26 30281315

A

60

17 20 16 5 18 19 34 12

50 40 30

B

20

222324 21 10 67 8925

10 0

0

A

B

C

0

20

30

40

50

60

70

80

90 100

1 2 3 4 19 5 20 16 17 18 6 7 21 22 8 23 9 10 24 25 11 12 29 13 28 30 14 26 27 15

b

6.6  Data matrix and corresponding row-clusters.

126  Hermann Moisl tlsg01 tlsg03 tlsg51 tlsg26 tlsg06 tlsg16 tlsg41 tlsg34 tlsg44 tlsg45 tlsg49 tlsg11 tlsg17 tlsg22 tlsg42 tlsg08 tlsg10 tlsg36 tlsg39 tlsg35 tlsg38 tlsg04 tlsg43 tlsg24 tlsg28 tlsg27 tlsg33 tlsg15 tlsg31 tlsg48 tlsg56 tlsg53 tlsg37 tlsg40 tlsg05 tlsg21 tlsg23 tlsg52 tlsg09 tlsg20 tlsg25 tlsg32 tlsg02 tlsg19 tlsg46 tlsg54 tlsg12 tlsg07 tlsg14 tlsg50 tlsg13 tlsg30 tlsg55 tlsg29 tlsg18 tlsg47 tlsg01 tlsg04 tlsg07 tlsg02 tlsg06 tlsg03 tlsg05 FIGURE

NG1a(i)

NG1a

NG1a(ii) NG1

NG1b

NG2

6.7  Hierarchical analysis of the NECTE data matrix in Table 6.5.

interrelationships rather than, say, being randomly distributed around the phonetic space. A reasonable hypothesis to answer the first part of the research question, therefore, is that there is systematic variation in the Tyneside speech community. This hypothesis can be refined by examining the social data relating to the NECTE speakers, which show, for example, that all those in the NG1 cluster come from the Gateshead area on the south side of the river Tyne and all those in NG2 come from Newcastle on the north side, and that the subclusters in NG1 group the Gateshead speakers by gender and occupation (Moisl et al. 2006).

Statistical Corpus Exploitation  

127

The cluster tree can also be used to generate a hypothesis in answer to the second part of the research question. So far we know that the NECTE speakers fall into clearly demarcated groups on the basis of variation in their phonetic usage. We do not, however, know why, that is, which segments out of the 158 in the TLS transcription scheme are the main determinants of this regularity. To identify these segments (Moisl and Maguire 2008), we begin by looking at the two main clusters NG1 and NG2 to see which segments are most important in distinguishing them. The first step is to create for the NG1 cluster a vector that captures the general phonetic characteristics of the speakers it contains, and to do the same for the NG2. Such vectors can be created by averaging all the row vectors in a cluster using the formula 

νj =



∑M

i = 1…m

ij

m



where vj is the jth element of the average or ‘centroid’ vector v (for j = 1.. the number of columns in M), M is the data matrix, Σ designates summation, and m is the number of row vectors in the cluster in question (56 for NG1, 7 for NG2). This yields two centroid vectors. Next, compare the two centroid vectors by co-plotting them to show graphically how, on average, the two speaker groups differ on each of the 158 phonetic segments; a plot of all 158 segments is too dense to be readily deciphered, so the six on which the NG1 and NG2 centroids differ most are shown in Figure 6.8.

120 NG1 = Black NG2 = Grey

100

80

60

40

20

e

FIGURE

2 reduced standard

e

baker

i big

3 reduced houses

e

1 reduced

:c

0

smoke

ei knife

6.8  Co-plot of centroid vectors for NG1 and  NG2.

128  Hermann Moisl The six phonetic segments most important in distinguishing cluster NG1 from NG2 are three varieties of [ə], [ɔː], [ɪ], and [eɪ]: the Newcastle speakers characteristically use ə1 and ə2 whereas the Gateshead speakers use them hardly at all, the Gateshead speakers use yet another variety of schwa, ə3, much more than the Newcastle speakers, and so on. A hypothesis that answers the second part of the research question is therefore that the main determinants of phonetic variation in the Tyneside speech community are three kinds of [ə], [ɔː], [ɪ], and [eɪ]. The subclusters of NG1 can be examined in the same way and the hypothesis thereby further refined.

6.4 Literature Review The topic of this chapter cuts across several academic disciplines, and the potentially relevant literature is correspondingly large. This review is therefore highly selective. It also includes a few websites; as ever with the web, caveat emptor, but the ones cited seem to me to be reliable and useful.

6.4.1  Statistics and Linear Algebra Using cluster analysis competently requires some knowledge both of statistics and of linear algebra. The following references to introductory and intermediate-level accounts provide this.

6.4.1.1 Statistics In any research library there is typically a plethora of introductory and intermediate-level textbooks on probability and statistics. It is difficult to recommend specific ones on a principled basis because most of them, and especially the more recent ones, offer comprehensive and accessible coverage of the fundamental statistical concepts and techniques relevant to corpus analysis. For the linguist at any but advanced level in statistical corpus analysis, the choice is usually determined by a combination of what is readily available and presentational style. Some personal introductory favourites are (Devore and Peck 2005; Freedman et al. 2007; Gravetter and Wallnau 2008), and, among more advanced ones, (Casella and Berger 2001; Freedman 2009; Rice 2006).

Statistics websites • Hyperstat Online Statistics Textbook: http://davidmlane.com/hyperstat/ • NIST-Sematech e-Handbook of Statistical Methods:  http://www.itl.nist.gov/ div898/handbook/index2.htm • Engineering Statistics Handbook: http://www.itl.nist.gov/div898/handbook/index.htm • Statistics on the Web: http://my.execpc.com/~helberg/statistics.html

Statistical Corpus Exploitation  

129

• Statsoft Electronic Statistics Textbook: http://www.statsoft.com/textbook/ • SticiGui e-textbook: http://www.stat.berkeley.edu/~stark/SticiGui/index.htm • John C. Pezzullo’s Statistical Books, Manuals, and Journals links: http://statpages. org/javasta3.html • Research Methods Knowledge Base:  http://www.socialresearchmethods.net/kb/ index.php

Statistics software Contemporary research environments standardly provide one or more statistics packages as part of their IT portfolio, and these packages together with local expertise in their use are the first port of call for the corpus analyst. Beyond this, a web search using the keywords ‘statistics software’ generates a deluge of links from which one can choose. Some useful directories are: • Wikipedia list of statistical software:  http://en.wikipedia.org/wiki/List_of_ statistical_packages • Wikipedia comparison of statistical packages:  http://en.wikipedia.org/wiki/ Comparison_of_statistical_packages • Open directory project, statistics software:  http://www.dmoz.org/Science/Math/ Statistics/Software/ • Stata, statistical software providers: http://www.stata.com/links/stat_software.html • Free statistics: http://www.freestatistics.info/ • Statcon, annotated list of free statistical software:  http://statistiksoftware.com/ free_software.html • Understanding the World Today, free software: statistics: http://gsociology.icaap. org/methods/soft.html • The Impoverished Social Scientist’s Guide to Free Statistical Software and Resources: http://maltman.hmdc.harvard.edu/socsci.shtml • StatSci, free statistical packages: http://www.statsci.org/free.html • Free statistical software directory: http://www.freestatistics.info/stat.php • John C. Pezullo’s free statistical software links: http://statpages.org/javasta2.html • Statlib: http://lib.stat.cmu.edu/

6.4.1.2  Linear Algebra Much of the literature on linear algebra can appear abstruse to the non-mathematician. Two recent and accessible introductory textbooks are (Poole 2005) and (Strang 2009); older, but still a personal favourite, is (Fraleigh and Beauregard 1994).

Linear algebra websites • PlanetMath: Linear algebra: http://planetmath.org/encyclopedia/LinearAlgebra.html • Math Forum: Linear algebra: http://mathforum.org/linear/linear.html

130  Hermann Moisl

6.4.2  Cluster Analysis As with general statistics, the literature on cluster analysis is extensive. It is, however, much more difficult to find introductory-level textbooks for cluster analysis, since most assume a reasonable mathematical competence. A  good place to start is with (Romesburg 1984), a book that is now quite old but still a standard introductory text. More advanced accounts, in chronological order, are (Jain and Dubes 1988; Arabie et al. 1996; Gordon 1999; Jain et al. 1999; Everitt et al. 2001; Kaufman and Rousseeuw 2005; Gan et al. 2007; Xu and Wunsch 2008). Cluster analysis is also covered in textbooks for related disciplines, chief among them multivariate statistics (Kachigan 1991; Grimm and Yarnold 2000; Hair et al. 2007; Härdle and Simar 2007), data mining (Mirkin 2005; Nisbet et al. 2009), and information retrieval (Manning et al. 2008).

Cluster analysis websites • P. Berkhin (2002) Survey of clustering data mining techniques: http://citeseer.ist. psu.edu/cache/papers/cs/26278/http:zSzzSzwww.accrue.comzSzproductszSzrp_ cluster_review.pdf/berkhin02survey.pdf • Journal of Classification:  http://www.springer.com/statistics/statistical+theory+ and+methods/journal/357

6.4.2.1  Cluster Analysis Software Many general statistics packages provide at least some cluster analytical functionality. For clustering-specific software a web search using the keywords ‘clustering software’ or ‘cluster analysis sofware’ generates numerous links. See also the following directories: • Classification Society of North America, cluster analysis software: http://www.pitt. edu/~csna/software.html • Statlib: http://lib.stat.cmu.edu/ • Open Directory Project, cluster analysis:  http://search.dmoz.org/cgi-bin/search? search=cluster+analysis&all=yes&cs=UTF-8&cat=Computers%2FSoftware%2FD atabases%2FData_Mining%2FPublic_Domain_Software

6.4.3  Statistical Methods in Linguistic Research Mathematical and statistical concepts and techniques have long been used across a range of disciplines concerned in some sense with natural language, and these concepts and techniques are often relevant to corpus-based linguistics. Two such disciplines have just been mentioned: information retrieval and data mining. Others are natural language processing (Manning and Schütze 1999; Dale et al. 2000; Jurafsky and Martin 2008; Cole et  al. 2010; Indurkhya and Damerau 2010), computational linguistics (Mitkov 2005), artificial intelligence (Russell and Norvig 2009), and the range of sub-disciplines

Statistical Corpus Exploitation  

131

that comprise cognitive science (including theoretical linguistics) (Wilson and Keil 2001). The literatures for these are, once again, very extensive, and, to keep the range of reference within reasonable bounds, two constraints are self-imposed: (i) attention is restricted to the use of statistical methods in the analysis of natural language corpora for scientific as opposed to technological purposes, and (ii) only a small and, one hopes, representative selection of mainly, though not exclusively, recent work from 1995 onwards is given, relying on it as well as (Köhler and Hoffmann 1995) to provide references to earlier work.

6.4.3.1 Textbooks Woods et al. 1986; Souter and Atwell 1993; Stubbs 1996; Young and Bloothooft 1997; Biber et al. 1998; Oakes 1998; Baayen 2008, Johnson 2008; Gries 2009, Gries et al. 2009.

6.4.3.2  Specific Applications As with other areas of science, most of the research literature on specific applications of quantitative and more specifically statistical methods to corpus analysis is in journals. The one most focused on such applications is the Journal of Quantitative Linguistics; other important ones, in no particular order, are Computational Linguistics, Corpus Linguistics and Linguistic Theory, International Journal of Corpus Linguistics, Literary and Linguistic Computing, and Computer Speech and Language. • Language classification: (Kita 1999; Silnitsky 2003; Cooper 2008) • Lexis: (Andreev 1997; Lin 1998; Li and Ave 1998; Allegrini et al. 2000; Yarowsky 2000; Baayen 2001; Best 2001; Lin and Pantel 2001; Watters 2002; Romanov 2003; Oakes and Farrow 2007) • Syntax: (Köhler and Altmann 2000; Gries 2001; Gamallo et al. 2005; Köhler and Naumann 2008) • Variation: (Kessler 1995; Heeringa and Nerbonne 2001; Nerbonne and Heeringa 2001; 2010; Nerbonne and Kretzschmar 2003; Nerbonne et  al. 2008; Nerbonne 2009; Kleiweg et al. 2004; Cichocki 2006; Gooskens 2006; Hyvönen et al. 2007; Wieling and Nerbonne 2010; Wieling et al. 2013) • Phonetics/phonology/morphology:  (Jassem and Lobacz 1995; Hubey 1999; Kageura 1999; Andersen 2001; Cortina-Borja et  al. 2002; Clopper and Paolillo 2006; Calderone 2009; Mukherjee et al. 2009; Sanders and Chin 2009) • Sociolinguistics:  (Paolillo 2002; Moisl and Jones 2005; Moisl et  al. 2006; Tagliamonte 2006; Moisl and Maguire 2008; Macaulay 2009) • Document clustering and classification: (Manning and Schütze 1999; Lebart and Rajman 2000; Merkl 2000). Document clustering is prominent in information retrieval and data mining, for which see the references to these given above. Many of the authors cited here have additional related publications, for which see their websites and the various online academic publication directories.

132  Hermann Moisl

Corpus Linguistics Websites • Gateway to Corpus Linguistics: http://www.corpus-linguistics.com/ • Bookmarks for Corpus-Based Linguistics: http://personal.cityu.edu.hk/~davidlee/ devotedtocorpora/CBLLinks.htm • Statistical natural language processing and corpus-based computational linguistics: an annotated list of resources: http://nlp.stanford.edu/links/statnlp.html • Intute. Corpus Linguistics: http://www.intute.ac.uk/cgi-bin/browse.pl?id=200492 • Stefan Gries’ home page links:  http://www.linguistics.ucsb.edu/faculty/stgries/ other/links.html • Text Corpora and Corpus Linguistics: http://www.athel.com/corpus.html • UCREL: http://ucrel.lancs.ac.uk/ • ELSNET: http://www.elsnet.org/ • ELRA: http://www.elra.info/ • Data-intensive Linguistics (online textbook):  http://www.ling.ohio-state. edu/~cbrew/2005/spring/684.02/notes/dilbook.pdf

C HA P T E R  7

CORPUS ARCHIVING AND D I S S E M I NAT I O N PET E R W I T T E N BU RG , PAU L T R I L SBE E K , A N D F LOR IA N W I T T E N BU RG

7.1 Introduction The nature of data archiving and dissemination has been changing dramatically with the trend towards an all-digital data world, and it will continue to change due to the enormous innovation rate of information and communication technology. New sensor equipment enables all researchers to create large amounts of data; new software technology allows users to create data enrichments of all sorts, and internet technology allows users to virtually combine and utilize data. In addition, the ongoing innovation will require the continuous migration of data. Thus data management has to deal with continuous changes; the nature of collections is no longer static, and new channels of dissemination have been made available. These changes have also resulted in a blurring of the term ‘corpus’ in modern data archiving and dissemination. Traditionally, a linguistic corpus has been defined as a large, coherent, and structured set of resources created to serve a certain research purpose and usually used for statistical processing and hypothesis testing. Motivated by advances in information technology, we now increasingly consider re-purposing existing resources and using them in different contexts, i.e. selecting resources from different corpora, merging them to virtual collections, doing completely unforeseen types of analysis and enriching them in various ways. Thus, a single resource or a group of resources from a certain corpus can become part of a type of ‘virtual corpus’. This is the reason why we prefer to speak about collections in this chapter. In this chapter, we will first discuss the traditional model of archiving and dissemination; analyse in detail what has changed in these processes based on digital innovation;

134   Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg discuss the curation1 and preservation requirements that need to be met; discuss briefly the need for advanced infrastructures; and finally present a specific archive that meets some of the modern requirements.

7.2  Traditional Archives of Corpus Primary Data Traditional corpus archives, including those that store analogue recordings of sounds, are characterized by the close relationship between carrier and content. Every access action is associated with a loss of quality of the content, i.e. we can say that content and carrier are mutually intertwined. Also, for modern digital technology, access is of course associated with quality reduction—a rotating disk has a short lifetime due to attrition. However, the big difference is that digital copying can be done without loss of information. One of the principles explained in the guidelines for traditional media—be it paper or analogue media—is not to touch the original, and thus to restrict access even if it is for copying purposes. This affects workflows, ways of curation, dissemination, and business models. Originals need to be stored in special and expensive environments;2 only at restricted moments will master copies be taken which then will be used to create the copies used for dissemination.3 For analogue media, it is known that, due to electronic circuitry, there is at least a degradation of 3 dB for each copy, but additional damages of the carriers, e.g. due to mechanical factors, are possible. Consequently, business and dissemination models follow rather restrictive and static principles, i.e. on request copies to a new carrier are created and are disseminated by ground mail requiring various activities from the archivists. The copying of content fragments to new carriers is also possible, but is even more time-consuming. Curation of analogue recordings, for example, implies a transfer from one carrier format to another format when it is announced that existing player technology is to be phased out. Often these transformations are started too late, so that old players are only to be found in specialized institutes and copying becomes expensive because of the manual operation required and increasing hardware maintenance costs. In proper traditional archives, metadata records that describe the history of a recording were maintained manually, either on paper or in a database, so that the validity of a certain operation or interpretation could be checked. 1  By (digital) curation we mean the process of selecting, preserving and maintaining accessibility of digital collections, which includes its migration to new formats for example. 2  A UNESCO study (Schüller 2004) revealed that about 80 per cent of the recordings about cultures and languages are endangered due to carrier substrate deterioration. This fact indicates that in reality many originals are not treated appropriately, which will certainly lead to data loss. 3  A good example is given by the film industry in Hollywood, which is planning to store the originals of films in old salt mines. Experience shows that after 50 years about 50 per cent of the films can still be read (Cieply 2007).

Corpus Archiving and Dissemination  

135

7.3  Digital Corpus Archives and Dissemination Digital technology is dramatically changing the rules in all respects—some speak of a revolution. Certainly the basic rule ‘don’t touch the original’ is no longer valid. Digital copying, when done properly by maintaining integrity and authenticity, can be carried out automatically, without limits and without loss of information. The disadvantages of very short media lifetime and media fragility are more than compensated for. The opposite is true now: ‘touch the stored objects regularly’.

7.3.1  Dynamic Online Archives This principle is congruent with the wishes of researchers to access the stored data whenever they want, and it changes the basic rules for archives. The traditional model of managing data was based on two pillars, ‘long-term preservation’ and ‘reusing the data’, and, for the reasons mentioned above, they had to be tackled separately. Digital technology allows us to switch to online archives, i.e. there is no longer any need to maintain a separate archive where no one can access the ‘originals’—in fact this is counterproductive. Automatic procedures allow us to create several copies, and software technology allows us to separate ‘primary’ resources from all kinds of enrichments which then may become primary resources for other researchers. Therefore we can say that modern digital archives are ‘live archives’—the stored objects are subject to continuous changes. These can be (1) migrations to new carriers, which need to be carried out on average every four years; (2) migrations to new encoding standards and formats; (3) the creation of a variety of presentation formats; (4) new versions of stored resources; and (5) enrichments in form of added resources, new annotations, and extended relations between object fragments. Obviously such a digital archive includes an increasing complexity of relations between objects and object fragments that serve a wide variety of functions, as depicted in Figure 7.1.

7.3.2  Handling Complexity Such complexity needs to be managed, and elements of a feasible solution have been consolidated in the recent years. The major elements are • Each object needs to be identified by a persistent and unique identifier (PID4) that needs to be registered at a dedicated external registry; such PIDs need to 4  Persistent IDentifiers (PID) services are now being offered to register data collections and objects (EPIC: http://www.pidconsortium.eu/; DataCite: http://datacite.org).

136   Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg PID Metadata description

Collections

Metadata description

The object

Object instance

Object version

Metadata description

Derived object FIGURE 7.1  The complexity of relations which need to be managed. The object has a metadata description which points to the PID that has paths to all accessible object instances. It can be part of several collections—amongst them the original ‘corpus’ it was meant to be part of—each of which has its own metadata description. There can be new versions of an object that need to be bundled by the metadata description. There can be a variety of derived objects requiring the maintenance of the relations; and finally there will be unlimited relations between fragments of various objects.

be associated with checksum type of information that can e.g. be used to check integrity and can point to different instances (copies) of the same object; each new version of an object needs to receive its own PID to prove identity and integrity and to allow correct referencing. • Each object (except metadata objects) needs to be associated with a metadata description (see chapter I.7) that includes stable references to all its closely related objects, i.e. a metadata description of an annotation should include the PID of the primary resource(s). • Metadata descriptions need to include provenance information either directly or indirectly by referring to a separate object. • It must be possible to create an unlimited number of virtual collections on the basis of the stored objects, and each such virtual collection is defined by its metadata description which will include all references, i.e. users must be able to create their own collection organization by creating hierarchies of virtual collections. One such organization is called the canonical organization,5 since it will be used by

5 

This can be compared with the difference between the Unix directory structure and soft links.

Corpus Archiving and Dissemination  

137

Repository A Annotation algorithm

Metadata Primary data Annotation

URL MD5 etc.

PID

PID resolution information

Annotation

New metadata

FIGURE 7.2 The activities typically carried out by a proper repository system when a new annotation is added, which is a typical enrichment action. In general, stand-off principles need to be applied, i.e. existing objects may not be affected. In the case shown it could be an authorized user who creates an annotation and an updated version of the metadata description containing information and references to both the primary object and the new annotation. Nevertheless, the gatekeeper function will carry out a few checks, perform a few calculations, and then automatically request a PID by providing typical information. Once all operations have been successfully concluded, the annotation will be integrated into the collection and the metadata description will be updated.

the depositors and archive managers to establish authority domains and to carry out management operations; • All objects, in particular metadata descriptions, need to be represented as atomic objects in standard encodings and formats to make their readability as independent from layered software as possible. Encapsulation makes sense for indexes that are derived to support fast searching etc. • Database and textual structures need to be specified by registered schemas,6 and all tag sets used should be registered in concept registries.7

6  Databases as well as textual structures such as those storing a lexicon can have complex structures. To be able to interpret such a complex structure correctly, a description of its syntax is required. For relational databases this is done by providing its logical data model with the help of the Data Definition Language, and for XML files its schema based on a schema definition. These structure definitions should be externally accessible via open registries. CLARIN is currently building such a schema registry. 7  While schema registries help in parsing the structure correctly, concept registries such as ISOcat (http://www.isocat.org) help the user to interpret the semantics of tag sets that are e.g. used to describe linguistic phenomena such as parts of speech.

138   Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg • Each archive needs to have a well-maintained repository management system that has a gatekeeper function to guarantee the archive’s consistency by checking encoding and format correctness, by creating a PID record, by checking the validity of the metadata descriptions, and by extending them to include relevant references.8 • A variety of access methods needs to be provided to support the naive as well as the professional user. Figure 7.2 schematically illustrates the operations to be carried out when an annotation of an existing audio recording is added to a collection.

7.3.3  Data Management As the amount of data grows also in the domain of linguistics, the principles described above are becoming increasingly relevant. Some experts already speak of a ‘data deluge’, others use the term ‘data Tsunami’ to refer to the challenges we are facing due to the necessity to handle data on an enormous scale—something natural sciences have been aware of for years. Multimedia archives containing digitized audio and video recordings, brain images, and other time-series data easily extend to some hundreds of terabytes and millions of related objects. Both together—scale and complexity—can no longer be handled in traditional ways, such as one researcher having all of his or her data on his or her notebook in directories. A professionally acting field researcher reported about 6,000 files he had created on his notebook, covering all materials of a certain language, i.e. objects that are related with each other in multiple ways. He was no longer able to manage this. Also we see that researchers who upload resources without metadata on a server typically forget about potentially valuable content after a few months. As a consequence, researchers should take up their responsibilities and take concrete actions urgently as part of a long-term strategy. The basis for all strategic decisions is a three-layer model of responsibility and actions illustrated in Figure 7.3. (1) Users preferably want to access the data universe from their notebooks independent of time and location. (2) Domain-specific services will be organized that are backed by detailed knowledge of the discipline specific solutions. (3) Common data services such as long-term preservation will be offered by strong data centres. Figure 7.3 also shows that ‘data curation’ is a task which includes all experts involved in the whole data object lifecycle, and that the acceptance and seamless functioning of such a layered system is dependent on mutual trust. The model stresses the fact that certain tasks cannot be done by the individual researcher in a useful way for a number of reasons. Researchers need to find ways to deposit their data in professional and trusted centres that offer appropriate access services and guarantee long-term preservation. Increasingly we see that data creation 8  Examples of such repository management systems are FEDORA (http://www.fedora-commons. org/), http://www.dspace.org/), and LAMUS (http://tla.mpi.nl/tools/tla-tools/lamus/).

Corpus Archiving and Dissemination  

139

The collaborative data infrastructure - a framework for the future

Data curation

Trust

Data generators

Users

User functionalities, data capture & transfer, virtual research environments

Community support services

Data discovery & navigation workflow generation, annotation, interpretability

Community data services

Persistant storage, identification, authenticity, workflow execution, mining

FIGURE 7.3  The principal architecture suggested by the High-Level Expert Group on Scientific Data. It covers mainly three layers:  the users generating data or making use of stored data, community-specific centres offering services that incorporate domain knowledge, and finally common data services that are the same for all research disciplines. Of course this can only be seen as a reference model, i.e. we will see different types of realization.

funded by government may no longer be seen as private capital, but as data that should be accessible to all interested researchers. Sharing with a larger community, however, requires that data be deposited with a recognized repository.

7.3.4 Costs In the context of the discussions about the nature of common data services, the term ‘Cloud’ is being discussed intensively. In contrast to the Grid model, where data and in particular operations are distributed on a large number of nodes, the Cloud model concentrates data and operations on large compute and storage facilities; thus we can speak about a form of centralization of resources. Within the Cloud we also have a distributed scenario, but all covered by one authority domain and one technological solution. Such large facilities can offer almost unlimited storage capacities at low prices, since economy of scale can be applied. Since these facilities are set up at strategic places, they can also operate their services at ecologically optimized conditions, and are thus attractive ways to set up common data services. Since mirroring of data at different locations is a must for long-term preservation, there will always be some form of distribution outside a Cloud’s authority domain. However, most of the costs for data lifecycle management are not caused by the pure bit-stream preservation, but by curation efforts and by maintaining access services, as Table 7.1 indicates. The results reported in Table 7.1 can be compared with what Beagrie found in an overview of some data archives in the UK (Schüller 2004): (1) the costs for staff are much higher

140   Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg Table 7.1  Annual operating costs of the archive at the Max Planck Institute for Psycholinguistics Cost factor

Costs (1000€/yr)

Local IT and storage infrastructure 4 copies at large data centres

80 10–20

Comments 4–8 years innovation cycle All copying activity is automatic

Local system management

40

Shared for different activities

Archive management

80

Archive manager and student assistants

Repository software maintenance

60

Basic code maintenance

Utilization/exploitation software maintenance

>120

These costs can be a bottomless pit

than for equipment and (2) the costs for ingest (42 per cent) and access (35 per cent) are higher than for storage and preservation (23 per cent). The still relatively high sum for storage and preservation has to do with the fact that curation costs are included. Table 7.1 indicates the costs for the archive at the Max Planck Institute for Psycholinguistics, which stores about 75 terabytes of data in an online accessible archive, stores about 200 terabytes of additional data, maintains a local hierarchical storage management system including multilayer hardware and software components, and automatically creates four dynamic copies at remote data centres. The costs of the remote storage are almost negligible compared to the other costs. Maintaining data and basic software components9 which ensure that data is accessible and interpretable requires most expenditure. Investments in utilization and exploitation software depend on the level of sophistication required. Technological innovation means that the costs for local investments (row 1) may be slightly reduced in future. At the Max Planck Institute we can observe another trend which is indicated in Figure 7.4 by the difference between the circled line (total amount of data) and the starred line (organized and partly annotated data). An increasingly high percentage of the data collected is neither described nor annotated, i.e. it cannot be used for scientific purposes. There is an increasing demand for better automatic analysis and exploitation methods.

7.3.5  Data Dissemination Digital technologies have not only changed the ways archives are organized and data are managed; they have also dramatically affected the channels of dissemination. As already indicated, all copying activity to strong data centres needs to be carried out 9  Once useful repository software supporting the basic requirements is commercially available, this cost factor may be reduced, but then considerable licence costs must be calculated, comparable to the costs for a professional hierarchical storage management system such as SAM-FS (Schüller 2004).

Corpus Archiving and Dissemination  

141

MPI digital archive 300 Annotated data Non-Annotated

Data in terabytes

250 200 150 100 50 0 2000

2002

2004

2006

2008

2010

2012

Year FIGURE 7.4  This diagram indicates that at the Max Planck Institute an increasing amount of data is collected but not described in a way that it can be used for analysis. The starred line indicates the amount of described data, the circled line the total amount of observational data.

automatically and dynamically. In general, only new data will be transferred, about 1 terabyte per month (in the MPI case), which does not constitute a problem for current computer and network technology. We only see one scenario where an Internet-based exchange of data does not seem feasible at present: a user (group) needs to have fast access to large collections e.g. to train a stochastic engine, and thus a whole data set of several terabytes and more needs to be transferred in a short time period. In such cases it may still be useful to send media such as tapes by ground mail. The worldwide film industry, which creates animated high-resolution films through the joint activity of highly specialized labs operating around the globe, is quickly exchanging modules via the Internet to achieve the required production speed (see the CineGrid project [2]‌). This may give an impression of how state-of-the-art network technology can be used to disseminate large data volumes, and how dissemination will evolve in the coming years to support modern production lines, for instance. For random access even to lengthy media files, the Internet is a very convenient platform when e.g. highly compressed media streaming is applied. Only those fragments are downloaded which have been demanded or which have a high probability of being analysed next. Traditional dissemination channels are not completely obsolete, as shown, but will be widely replaced by methods using the Internet. Another aspect of dissemination has to do with chunking and packaging. Traditionally the user ordered a certain ‘corpus’ such as the British National Corpus [3]‌ or the Dutch Spoken Corpus [4]. This would be copied to a carrier and shipped to the

142   Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg user. For specific types of usage, this may still be the default way of acting. But as already indicated, increasingly often this static usage will be replaced by a more dynamic usage which also opens the way to continuously extended collections. Researchers, for example, may take a few resources from corpus A and another few resources from corpus B if they all cover interactions between mothers and children of a certain lifetime relevant for the research question in focus. They create a ‘virtual collection’ to work on, including the creation of special annotations and links between elements of the two resource sets. In such dynamic scenarios it will no longer be as straightforward to do appropriate packaging. Exchange formats such as METS [5] and MPEG21 DIDL [6] allow bundling metadata descriptions and all sorts of relations an object has, but these are no longer static and can only be created on the basis of users’ selection decisions. Thus dissemination, if seen only as one-directional, will become old-fashioned for most research activities.

7.4  Curation and Preservation We have argued that in particular in the research domain we have dynamic archives, which is partly caused by the fact that encoding standards, structuring formats, and software components are continuously changing. Furthermore, with larger amounts of data the likelihood of bit errors—although small—may have an effect on data integrity. Thus on the one hand we need to make sure that bit-streams are maintained correctly, but on the other hand we need to ensure that interpretability is preserved while maintaining authenticity. Maintaining bit-streams—thus taking care of data integrity—in a distributed environment requires identifying the object at data entry time by a persistent identifier (PID), a reliable checksum indicator such as MD5 associated with the PID, and a metadata description with some verified type and format specifications. Any operation at any site storing copies of the object needs to verify whether the object is still the same, i.e. whether the bit-stream has been conserved correctly. To date there is no general solution in place for open repository domains to verify correctness in the sketched way. Some projects such as REPLIX [7]‌are currently working on this issue. Much more problematic is to ensure interpretability and ‘appropriate’ presentation in a world where software technology is changing at dramatic speeds and where new encoding standards and formats are emerging regularly. In the area of textual data we can now refer to UNICODE and XML as widely accepted standards offering basic stability for some decades.10 For sound we have linear PCM at HIFI norms (44.1/48 kHz/16 bit or 96 kHz/24 bit), which offers a solid base as master format. However, various compressed formats such as MP3 will continue to emerge for different purposes. In the area of video streams we were faced with a dynamic development of encoding schemes

10 

The schema of a resource may change over time, but it is still described by the same Syntax Description Language.

Corpus Archiving and Dissemination  

143

starting from MPEG1 and continuing to MPEG2, MPEG4/H.264,11 and also MJPEG 2000—all being introduced in little more than a decade. Since there is no guarantee that software will support the rendering of ‘old’ formats for long periods, we need to carry out transformations at regular time periods.

7.4.1  Compression and Transformation It needs to be emphasized that with certain so-called ‘lossy’ compression codecs, such as MP3 or MPEG1/2/4, information which is claimed to be not relevant for our perception is simply filtered out and not recoverable anymore. Thus ‘lossy’ compression raises the question of authenticity, i.e. despite the fact that many relevant acoustic features can be extracted from the compressed information, as v. Son [8]‌has shown, archiving and compression are not compatible. Another problematic step is switching from one codec to another and creating a series. The concatenation of transformations can result in severe audible or visible artefacts which disturb the original information. For this reason, it is much better to keep the archival master file in an uncompressed format and from there to generate the various presentation formats which can be compressed e.g. for transmission reasons. For audio signals this has been solved; for video signals it seems that MJPEG2000 lossless12 will become the accepted standard. Intensive quality checks by the archivist in collaboration with community experts are necessary to guarantee that authenticity of the original information is preserved for any operation from digitization to any subsequent encoding. Wrong choices can easily lead to loss or change of information: (1) information loss with MP3 as indicated in Figure 7.5 or (2) as blocking deformation as shown in Figure 7.6. The relevance of maintaining provenance information is underlined by this example as well. It should also be obvious that decisions about transformations can only be done after having analysed the consequences in detail.

7.4.2  Lifecycle Management All transformations of resources lead to new content; thus they must be associated with a new PID, since otherwise identity and integrity cannot be controlled.13 It is a matter of policy of a repository whether a new metadata description is being created or whether the new version is bundled into the existing structured metadata description. Whatever is being 11 

Soon we can expect MPEG4/H.265 to be massively supported by industry. With ‘lossless’ compression it is possible to reconstruct the original signal, but compression factors higher than 2 are not possible. 13  It is widely agreed that metadata about resources is more dynamic where information can be added without leading to new versions. 12 

144   Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg

Intensity (dB)

80 60 40 20 0 0

2

4

6

8 10 Frequency (kHz)

12

14

16

FIGURE 7.5 Psycho-acoustic masking:  a high tone at 1 kHz (dark grey) would mask out all tones that have an intensity below the dotted line, i.e. according to psycho-acoustic findings the blue tone would not be recognizable by humans. MP3 algorithms make use of this phenomenon and filter out all frequency components. Thus MP3 recordings reduce information content.

(a)

(b)

FIGURE 7.6 The same information as in Figure 7.4:  on the left side encoded in an uncompressed way and on the right side encoded with MPEG2 at 6 Mbit/s. In the right-hand image the blocking phenomenon can be seen. It is up to the researcher to decide whether this distortion is acceptable. For archiving it is not acceptable, since we do not know what kind of analysis will be done in future. These images were generated by the Phonogrammarchive Vienna.

done, provenance information needs to be stored, so that at any moment in time the user can trace what kinds of transformation have been carried out to arrive at the object as it is.14

7.5  Data Centres In the previous sections we described how the area of archiving and dissemination has been changing, and indicated that further changes can be expected due to the enormous

14  Using metadata records to maintain context and provenance information about a resource can be seen as an improved versioning system offering linguistically relevant information.

Corpus Archiving and Dissemination  

145

technological innovation rate (e.g. digital technology has revolutionized the nature of acting). Obviously we need new types of centre that: • implement mechanisms as described above; • have an open deposit policy allowing users to deposit their corpora if they adhere to a number of requirements to establish trust (see below); • allow users to build, store, and manipulate virtual collections; and • allow users to add annotations of various sorts. Obviously this can only be done if proper repository systems are being used that implement the above-mentioned principles. These types of ‘new’ centres must be available to support the individual researcher in his or her daily workflow, since he or she will not be able to carry out proper data lifecycle management but will nevertheless need easy and flexible access to his or her collections. Therefore it is currently fairly common in different research disciplines to describe requirements for centres and to agree on Service Level Agreements. This is to ensure behaviours as expected and a high availability. One criterion for Google’s success is that it is always available and operates as users expect it to. So-called research infrastructures—be it the Large Hadron Collider [9]‌in physics, ELEXIR [10] in bioinformatics, or CLARIN [11] in the linguistic community—work along the same lines here: we need a structured network of professional centres that behave as expected. The CLARIN research infrastructure established criteria for centres that mainly cover mechanisms as described above [12]. As indicated in the 3-layer diagram (Figure 7.3), these centre networks need to collaborate with other infrastructures that offer common services such as long-term archiving to establish what may be called an ecosystem of infrastructure services. Another major aspect for establishing trust, which is the key to wide acceptance, is to clarify the rights situation. This is an immensely complex issue, especially the legal situation in an international setting where resources may have been created and (web) users are located in different countries. Data-driven research will depend on free and seamless access to useful data and the possibility of easily combining this data with data from other researchers. So far, we cannot see how the rights issue will evolve, but it seems obvious that a sort of ‘academic use’ policy must come into place based on clear user identities and ethically correct behaviour. Despite all wishes with respect to an unrestricted access, we will be faced with restrictions related to personality rights (recorded persons do not want that their voices or faces exhibited to everyone) and licences. Repositories need to make sure that they are capable of handling such restrictions in a proper way. Obviously we are entering a scenario where many users will need to rely on data centres and the quality of their data without having continuous personal contact. Thus, in a scenario where important actors remain anonymous with respect to each other, new ways of assessing the quality of repositories and their resources need to be established. Three proposals have been worked out to assess a centre’s quality: Repository

146   Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg Auditing and Certification (RAC) [13], Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) [14], and Data Seal of Approval (DSA) [15]. RAC was proposed by the MOIMS-Repository Audit and Certification Working Group based on the OAIS model for archiving (ISO 14721) [16], and is heading towards a new more refined ISO standard for quality assessment. DSA describes a more lightweight procedure to ensure that in the future research data can still be processed in a high-quality and reliable manner, without this entailing new thresholds, regulations, or high costs. DRAMBORA offers a toolkit that facilitates internal audit by providing repository administrators with a means to assess their capabilities, identify their weaknesses, and recognize their strengths. Archiving centres need to follow one of these procedures to indicate that they adhere to certain quality guidelines. Part of the quality assessment is the responsibility of the ‘data producer’, who needs to ensure (1) that there is sufficient information for others to assess the scientific and scholarly quality of the research data, and that there is compliance with disciplinary and ethical norms, (2) that the formats are in agreement with repository policies, and (3) that metadata with sufficient quality is provided. Thus, when collecting a corpus, it is important to anticipate the quality requirements.

7.6  MPI Archive We want to briefly describe a concrete archive as an example that comes close to the principles which we described above. The digital language archive at the MPI for Psycholinguistics makes use of an in-house-developed archiving system that contains ingest, management, and access components. The ingest and management components are combined in a tool named LAMUS (Language Archive Upload and Management System) [17], which allows depositors to upload resources together with metadata descriptions into the archive and to create a canonical organization for their data. The software performs file-type checks to verify that the uploaded files are of the type they claim to be and are on the list of accepted file types for the archive. The metadata description files are in IMDI format, which allows resources to be described with categories enabling research based selections (such as age, sex, and educational background of interviewees)15 [18], and are also validated upon upload. Linked to LAMUS there is an elaborate system that allows depositors to define access rules for their data. There are various levels of access: completely open, open for registered users, available upon request, and completely closed. Resources and metadata descriptions that are ingested automatically receive a persistent identifier. The MPI uses the Handle System [19] for persistent identifiers because it

15  Examples can be found when looking at metadata descriptions in the open catalogue (http:// corpus1.mpi.nl).

Corpus Archiving and Dissemination  

147

is widely used, has shown its reliability over past years, and does not require payments per issued PID (instead, only a small annual fee for the Handle prefix is paid).16 Once the resources are ingested into the archive using the LAMUS software, the files are placed in a volume of a SAM-FS-based Hierarchical Storage Management system (HSM). This system consists of three layers of storage: a layer of fast hard drives, a layer of slower hard drives, and a layer of LTO5 data tapes. Files are migrated dynamically back and forth between the storage layers depending on usage, demand, and rules that are defined for different file types. The complete HSM can store up to 1.2 petabyte of data in its current form. Consistency checks of the archived files and metadata are performed continuously, and reports of any errors are sent to the archivists automatically. Two copies are automatically created in the HSM system. Upon ingest, the metadata records are indexed and made available in an IMDI-based online browse and search tool so that users can look for the data they need using a standard web browser. All archived files can in principle be downloaded via the web, provided that the user has the appropriate access rights. Connected to the metadata catalogue browse and search tool, there are online viewers for various types of files. Time-aligned annotations to media files, for example, can be viewed online via a tool called ANNEX [21]. This tool displays the annotations in synchrony with the audio or video streams in a web browser. It offers different views for the annotations such as a timeline, a ‘subtitle’ view, and a plain text view. Uploaded annotation files are also indexed so that they can be searched with an integrated search tool called TROVA [21]. This tool allows users to search for occurrences of linguistic structures within the annotation files. It can also export the search results in comma-separated text files for further analyses in a statistics program, for example. As described in the previous paragraph, long-term preservation of digital data poses some challenges. For safeguarding the data, the MPI archive creates automatic backup copies at two computer centres of the Max Planck Society in Göttingen and in Garching. Each centre also has an off-site backup solution. To increase the chance of future interpretability of the data, only a limited set of file types are currently allowed in the archive—e.g. for video data we do not accept files in every one of the large number of formats and codecs that are around, but instead limit the formats to MPEG1, MPEG2, MPEG4/H.264, and MJPEG2000. For audio we only archive linear PCM WAV files in 16 bit 44.1 or 48 KHz resolution. Having a limited set of widely accepted formats should make conversions to other formats more feasible in the future if a format becomes obsolete. The MPI archive has undertaken the Data Seal of Approval (DSA) self-assessment method, and will most likely be awarded the Data Seal of Approval in the course of 2011 after a review by the DSA board and another external reviewer.

16  The European PID Consortium [20] is now offering Handle registration to all registered research data centres. Since PIDs need to be persistent, there will be no removal of registered PIDs without consent.

148   Peter Wittenburg, Paul Trilsbeek, and Florian Wittenburg

7.7 Conclusions In this chapter we have argued that archiving and dissemination of corpora has been changing dramatically as a consequence of innovation and an all-digital world, and that dramatic changes still lie ahead. The most fundamental rule in traditional archiving, ‘don’t touch the original artefacts’, has been reversed to the basic rule for digital archives:  ‘touch the digital resources frequently’. This fundamental change allows us to turn to live digital archives where we no longer distinguish between an archive for long-term preservation and a copy used for all sort of access. In contrast, we see a rich set of collections that are being extended as part of the research workflows. Therefore the notion of a ‘corpus’ is blurring insofar as digital technology allows users to re-purpose parts of different corpora to new virtual collections being used for some research purpose which was not thought of at the time of creation of the corpus it originally belonged to. Such changes do not come without risks. The close relation between carrier and information has resulted in the fact that we still can look back to our history, for example, in the form of ancient cuneiforms and even papyri. But this facility was bound to data of limited size and production processes which can no longer be used. In digital technology the possibility of copying without information loss allows us to separate carrier and information, and thus to create any number of equal copies automatically, at high speed and low cost. However, this will only work if we can rely on proper digital archiving principles such as using stand-off principles and registered persistent identifiers, creating and maintaining metadata records, and setting up proper hardware and in particular software mechanisms in dedicated centres devoted to dealing with large amounts of data. Likewise, the channels of dissemination have changed completely and will continue to change as network bandwidth grows. Generations are used to the web-based paradigm, and there will be fewer cases where it is necessary to ship tapes or DVDs to customers. Increasingly often, access will be web-based, and the possibility of re-purposing resources in unforeseen ways means that researchers will only want to access parts of corpora. The amount of downloading will depend on the services associated with resources. In this respect we will see enormous changes as a consequence of the available inexpensive storage capacity in Clouds which can be extended by easily deploying services—i.e. we can expect that the need for downloading corpora will become less and less important.

Internet Sources [1]‌ https://wikis.oracle.com/display/SAMQFS/Home [2]‌ http://www.cinegrid.org [3]‌ http://www.natcorp.ox.ac.uk/

Corpus Archiving and Dissemination  

149

[4]‌ http://lands.let.ru.nl/cgn/ehome.htm [5]‌ http://www.loc.gov/standards/mets/ [6]‌ http://xml.coverpages.org/mpeg21-didl.html [7]‌ http://www.mpi.nl/research/research-projects/the-language-archive/projects/replix-1/replix [8]‌Van Son, R.J.J.H., ‘Can Standard Analysis Tools be Used on Decompressed Speech?’ COCOSDA, 2002, Denver; URL: http://www.cocosda.org/meet/denver/COCOSDA2002Rob.pdf [9]‌ http://lhc.web.cern.ch/lhc/ [10] http://www.elixir-europe.org/ [11] http://www.clarin.eu [12] http://www.clarin.eu/content/center-requirements-revised-version [13] http://cwe.ccsds.org/moims/default.aspx# [14] http://www.repositoryaudit.eu/ [15] http://www.datasealofapproval.org/ [16] http://public.ccsds.org/publications/archive/650x0b1.PDF [17] http://tla.mpi.nl/tools/tla-tools/lamus/ [18] http://www.mpi.nl/IMDI/ [19] http://www.handle.net/ [20] http://www.pidconsortium.eu/ [21] http://tla.mpi.nl/tools/tla-tools/annex/ [22] http://tla.mpi.nl/tools/tla-tools/trova/

C HA P T E R  8

M E TA DATA F O R M AT S DA A N BROE DE R A N D DI ET E R VA N U Y T VA NC K

8.1  Metadata: What Is It and Why Should It Be Used? The best definition of metadata (although not complete) is still ‘data providing information about other data’. Examples of such information are the name of the data creator(s), creation date, the data’s purpose, data formats used, etc. In general, three kinds of metadata can be distinguished (internet source 1 [NISO]): descriptive metadata that is used to search and locate data; structural metadata that describes how the data is internally organized; and administrative data such as the data format but also information on access rights and data preservation. In this chapter we use the term ‘metadata’ to refer to descriptive metadata; any other usage will be explicitly mentioned. Different approaches and requirements with respect to specificity and terminology have resulted in the development of different sets of metadata classifiers. Table 8.1 gives an example of a metadata record describing an electronic poetry publication using six metadata classifiers from the Dublin Core metadata set. Having sufficiently rich metadata for any (electronic) resource or corpus is extremely important: without it, resources cannot be located, and for instance the resources making up a corpus could not be properly identified and classified without inspecting their contents. When discussing the different metadata approaches, it is useful to explain some terminology used in this context. A metadata record consists of a limited number of different and sometimes repeatable metadata elements that represent specific characteristics of a resource. An element can have a value that can either be a free value or otherwise be constrained by a controlled vocabulary of values or by some other requirement, e.g. a valid date. The constraint of an element’s value is referred to as a value scheme. The set of rules for forming a metadata record is referred to as a metadata schema or metadata set, and this usually specifies the set of metadata elements, i.e. element names, the possible values for the elements, and the element semantics. Currently almost all metadata

Metadata Formats  

151

Table 8.1  Part of a metadata record using the Dublin Core metadata set Identifier

http://ota.ox.ac.uk/headers/0382.xml

Title

Selected works [Electronic resource]/Mirko Petrović

Creator

Petrović, Mirko, 1820–1867

Subject

Serbo-Croatian poetry—19th century

language

Serbo-Croatian

Rights

Use of this resource is restricted in some manner. Usually this means that it is available for non-commercial use only with prior permission of the depositor and on condition that this header is included in its entirety with any copy distributed.

schemas are expressed as an XML file or file fragment, whose format is governed by XML schema or another schema language. The practical use of metadata for corpora that contain many individual resources, i.e. audio and video files as well as annotations, is twofold:

1. The corpus can be described as a whole and published for instance in the large catalogues of the Language Resource distribution agencies such as LDC and ELRA for users to identify suitable corpora. See e.g. the LDC catalogue1 2. The individual parts of a corpus can be identified and classified by corpus exploitation tools (e.g. COREX: Oostdijk and Broeder 2003, the exploitation environment of the Dutch Spoken Corpus). If all parts of the corpus are made available online, the metadata for each part of the corpus should also be published. This will facilitate easy reuse of the resources, since it removes the need to first download the complete corpus when needing just a few of the corpus resources. For an example of such a catalogue, see the Virtual Language Observatory2 of the CLARIN project.



Of course there is also the possibility of describing subsets of the whole corpus, i.e. datasets that in terms of their size are located between the complete corpus and the individual corpus components. The individual corpus components can be also described as a metadata record. It is obvious that the requirements for proper metadata for each of these levels are very different and require the use of different metadata schemata. Where metadata is used to allow potential corpus users to locate suitable corpora, the metadata records from different corpora should be collected in a central metadata catalogue that users can use for finding the corpus that fulfils their requirements. This requires the corpus maintainers to publish the corpus metadata, which can then be

1  2 

http://www.ldc.upenn.edu/Catalog/ http://www.clarin.eu/vlo/

152   Daan Broeder and Dieter van Uytvanck harvested by the maintainers of such a central metadata catalogue. A popular protocol for this process is the well-known OAI-PMH (2). The origins of the concept of metadata, its usage and terminology come from the library world, where the problem of tagging and retrieving large amounts of resources has long existed. Because the technology and experience of librarians was already advanced compared to other disciplines, it was natural for them to take the lead in trying to develop metadata description systems such as the Dublin Core Metadata Initiative (DCMI) (1995) that aim to also incorporate other scientific domains. With this attempt, however, which originally advocated describing all objects with a system of fifteen classifiers or elements (although qualifiers for more specificity were allowed), too much focus was put on librarians’ terminology and interests (IPR etc.) to allow wide acceptance in other domains. For interoperability between different scientific domains that require mutually intelligible resource descriptions, DCMI is still a solution of choice, even if much information is lost in translating domain-specific metadata into it. DCMI is also problematic in that it is a flat list of descriptors lacking any structure and making it difficult to express structured information. As stated above, there are different approaches and traditions for using metadata and using different metadata sets. Some are targeted towards a shallow global description of resources requiring the specification of only a few metadata elements, while others are very detailed and require considerable effort to create the metadata records. Also the terminology, i.e. the names and semantics of the metadata elements in the different metadata sets, are often incompatible. This leads to so-called interoperability problems, when for instance corpus exploitation tools cannot be used for corpora using different metadata sets, or when metadata of different corpora must be stored in the same catalogue. The tension between the need for adequate and sufficiently rich (domain-specific) terminology in order to correctly describe resources and, on the other hand, the need for interoperability where terms have to be understood by people from different disciplines has led to a kind of oscillating effect of description systems, moving from small sets with descriptors with broad applicability to large sets with highly specific descriptors and back again. Some (e.g. Baker 1998) have compared this to the linguistic theory of pidginization and creolization, where pidgin languages arise when mutual intelligibility is needed and pidgins are creolized to achieve richer semantics. How this tension is resolved is often a matter of purpose or pragmatism.

8.2  A Closer Look at Metadata Sets This section presents some metadata standards that are often used for the construction of corpora.

Metadata Formats  

153

8.2.1  Dublin Core and OLAC The Dublin Core (DC) metadata set originates from the library community and serves as a digital lingua franca: it provides a simple and limited but widely understood set of elements3:

• • • • • • • • • • • • • • •

Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights

As such, it is mainly used as an export format for project-specific metadata formats and not as the primary description means. Consider the following example of a DC record, encoded as an XML fragment: jaSlo: Japanese-Slovene Learner’s Dictionary http://nl.ijs.si/jaslo/ Dept. of Asian and African Languages, Uni. of Ljubljana LexicalResource slv jpn machine readable dictionary general Encoding TEI XML Annotation Level: etymology Annotation Level: phonology Annotation Level: definition Annotation Level: example of use 3 

This list contains the elements of the DC simple set. For extensions (added later), see http://dcmi. kc.tsukuba.ac.jp/dcregistry/

154   Daan Broeder and Dieter van Uytvanck OLAC (2000) was created as an application of Dublin Core to the linguistic domain that extends some of the DC elements with more detailed linguistic information. Controlled vocabularies4 for discourse types, languages, linguistic field, linguistic data type, and participant role were added to the standard DC set. An illustration of an OLAC record can be found below. The OLAC-specific extensions for indicating the linguistic field, the linguistic type, the discourse type, and the language code have been marked in bold. AphasiaBank Legacy Chinese CAP Corpus Bates, Elizabeth aphasia

Chinese aphasics describing pictures in the Given-New task for the CAP Project TalkBank 2004-03-30 Text

1-59642-201-7 http://talkbank.org/data-xml/ AphasiaBank/Other/CAP/chinese.zip http://talkbank.org/data/ AphasiaBank/Other/CAP/chinese.zip

In total about forty language archives are currently providing their metadata records in the OLAC format5. The popularity of OLAC can be partially attributed to the fact that the OAI-PMH protocol for metadata harvesting requires the DC format. OLAC, as a linguistic extension of DC, keeps the simplicity of DC and adds some useful vocabularies, making it a good choice for the distribution of metadata for language resources.

8.2.2 TEI The Text Encoding Initiative (TEI, 1990)  has been very successful in establishing a widely accepted system for text annotation. The TEI format also includes an extendable metadata section, the TEI header. An example of such a TEI header is included below.6 One can notice immediately the verbosity of the included metadata elements, which on the one hand is very readable but 4 

A full list can be found at http://www.language-archives.org/REC/olac-extensions.html

5 See: http://www.language-archives.org/archives

6 Source: http://teibyexample.org/examples/TBED02v00.htm?target=wilde

Metadata Formats  

155

also can pose problems with the machine-processing, as plain-text is not really suitable for such purposes.

              The Importance of Being Earnest      A trivial comedy for serious people      An electronic edition      Oscar Wilde                compiled by          Margaret Lantry            University College, Cork                   First draft, revised and corrected.

               Proof corrections by          Margaret Lantry                        19 648                  CELT: Corpus of Electronic Texts: a ­project of University College, Cork      

         College Road, Cork, Ireland.      

     1997      CELT online at University College, Cork, Ireland.      E850003.002      

        

Available with prior consent of the CELT ­programme for purposes of academic research and teaching only.

     

                  There is not as yet an authoritative edition of Wilde’s works.                  

156   Daan Broeder and Dieter van Uytvanck          The edition used in the digital edition.

        

         

            Oscar Wilde             The Importance of Being Earnest          

         

            Plays, Prose Writings and Poems            

               London                Everyman               1930               450-509            

         

        

     

                       

CELT: Corpus of Electronic Texts

                 

All the editorial text with the corrections of the editor has been retained.

                 

        

Text has been checked, proof-read and parsed using NSGMLS.

     

     

        

The electronic text represents the edited text. Compound words have not been hyphenated after CELT practice.

     

     

        

Direct speech is marked q.

     

     

        

The editorial practice of the hard-copy editor has been retained.

     

Metadata Formats  

157

     

        

div0=the whole text.

     

     

        

Names of persons (given names), and places are not tagged. Terms for cultural and social roles are not tagged.

     

                 

The n attribute of each text in this corpus carries a unique identifying number for the whole text.

     

The title of the text is held as the first head element within each text.

     

div0 is reserved for the text (whether in one volume or many).

                  By Oscar Wilde (1854-1900). 1895

           Whole text in English.       One word occurring twice in Anglo-French.                   Text captured by scanning.

  

Although the TEI format is very popular among the creators of textual and historical corpora, it is less frequently used for phonological corpora. Only the Newcastle Electronic Corpus of Tyneside English (NECTE) has so far been described with the TEI header.

8.2.3 IMDI The [IMDI] standard7 was developed by a broad cooperation of several language resource providers as a detailed and structured metadata set. It has been used to describe resources from field linguistics [DOBES], for large corpora of spoken data [CGN, IFA – see Table 8.2 for abbreviations] and for several sign language corpora [CNGT]. 7 

Please see Table 8.2 for references of the (bracketed) abreviations.

158   Daan Broeder and Dieter van Uytvanck Table 8.2  Projects and abbreviations Reference

Abbreviation of

Link

[APA]

Alliance for Permanent Access

http://www.alliancepermanentaccess.eu

[CGN]

Corpus Gesproken Nederlands

http://lands.let.ru.nl/cgn/

[CHAT] [CNGT]

Child Language Exchange System Corpus Nederlandse Gebarentaal

http://childes.psy.cmu.edu. http://www.ru.nl/corpusngt/

[DC]

Dublin Core

http://dublincore.org/

[DCAM]

http://dublincore.org/documents/ abstract-model/

[DC-DS-XML]

http://dublincore.org/documents/ dc-ds-xml/

[DC-TEXT]

http://dublincore.org/documents/ dc-text/

[DFKI]

http://www.language-archives.org/ archive/dfki.de

[DOBES]

Dokumentation Bedrohter Sprachen

http://www.mpi.nl/dobes

[EAD]

Encoded Archival Description,

http://en.wikipedia.org/w/index. php?title=Encoded_Archival_ Description&oldid=250469911

ECHO

European Cultural Heritage Online http://echo.mpiwg-berlin.mpg.de/home

[ELDA UC]

Universal Catalogue

[ENABLER]

http://universal.elra.info/ http://www.ilsp.gr/enabler/

[ESF]

European Science Foundation Second Learner Study

http://books.google.de/books?id=g292t XMX4tgC&pg=PA1&lpg=PA1&dq=esf+ Second+learner+perdue&source=bl&o ts=WKi3GUQQP6&sig=n7QSWy3StXvD 06nMfAzY7GBbm9w&hl=de&sa=X&oi =book_result&resnum=3&ct=result#P PP1,M1

[FIDAS]

Fieldwork Data Sustainability Project

http://www.apsr.edu.au/fidas/fidas_ report.pdf

[ICONCLASS]

http://en.wikipedia.org/wiki/Iconclass

[IFA]

IFA spoken language corpus

http://www.fon.hum.uva.nl/IFAcorpus/

[IMDI]

ISLE Metadata Initiative

http://www.mpi.nl/IMDI

[INTERA]

Integrated European language data Repository Area

http://www.mpi.nl/intera/

[ISLE]

International Standards of Language Engineering

http://www.ilc.cnr.it/EAGLES/isle/ISLE_ Home_Page.htm

[ISOcat] [LAT] tools

http://www.isocat.org Language Archive Technology

http://tla.mpi.nl/ (Continued)

Metadata Formats  

159

Table 8.2  Continued Reference

Abbreviation of

Link

[LMF]

Lexical Markup Framework

http://en.wikipedia.org/w/index. php?title=Lexical_Markup_ Framework&oldid=255448197

[METATAG]

[METS]

http://en.wikipedia.org/w/ index.php?title=Meta_ element&oldid=256779491 Metadata Encoding and Transmission Standard

http://en.wikipedia.org/wiki/METS

[MILE]

http://www.mileproject.eu/

[MPEG7]

http://en.wikipedia.org/w/index.php?title =MPEG-7&oldid=241494600

[OAIS]

Open Archival Information System http://en.wikipedia.org/wiki/ Open_Archival_Information_System

[OASIS]

Organization for the Advancement http://www.oasis-open.org/ of Structured Information Standards

[ODD]

One Document Does all

http://www.tei-c.org/wiki/index.php/ODD

[OLAC]

Open Language Archives Community

http://www.language-archives.org/

[OAI-PMH]

Open Archive Initiative—Protocol for Metadata Harvesting

http://www.openarchives.org/pmh/

[SCHEMAS]

http://www.schemas-forum.org/

[SRU]

Search/Retrieve via URL

http://www.loc.gov/standards/sru/

[SRW]

Search/Retrieve Web Service

http://en.wikipedia.org/wiki/Search/ Retrieve_Web_Service

[TEI]

Text Encoding Initiative

http://www.tei-c.org/

[UDDI]

Universal Description Discovery and Integration

http://en.wikipedia.org/wiki/UDDI

[WSDL]

Web Services Description Language http://www.w3.org/TR/wsdl20

[Z39.50]

http://en.wikipedia.org/wiki/Z39.50 http://www.loc.gov/z3950/agency/

However, it is mostly targeted towards the description of multimedia collections (video and audio recordings and annotations based on these). IMDI (as shown in the example below) contains a high number of descriptive fields (information elements), and corpus compilers can add further details about their data. Through the extension mechanism of IMDI, called profiles, additional fields can be

160   Daan Broeder and Dieter van Uytvanck defined to accommodate these needs. For sign language research, for instance, fields like the hearing aids used by the participants in a video recording were added.8 Example of an IMDI file for the phonological IFA corpus (with links to Praat TextGrid files): http://hdl.handle.net/1839/00-0000-0000-0003-4BF4-D Example of an IMDI file for the Dutch sign language corpus: http://hdl.handle.net/1839/00-0000-0000-0009-2D60-A

An important advantage of IMDI, when it is compared to other metadata formats for the creation of corpora, is the large toolset that is available to create, maintain, and publish structured (meta)data. Using the hosted The Language Archive (www.mpi.nl/tla), environment corpus compilers can upload their resources and IMDI metadata via a web application, and then set access rules and search through metadata and annotation files. Alternatively, one can install the software on a separate server (all TLA software is open source) and use all tools that come with the IMDI system (http://tla.mpi.nl).

8.2.4 CMDI Work on the Component Metadata Infrastructure (CMDI) started in 2008 in the context of the European CLARIN9 research infrastructure. Most existing metadata schemas for language resources seemed to be too superficial (e.g. OLAC) or too much tailored towards specific research communities or use cases (e.g. IMDI). CMDI addresses this by leaving it to the metadata modeller how a schema should look. It is based on the use of so-called metadata components. These elementary building blocks contain one or more elements describing a resource. For instance, an Actor component10 can group elements like First Name, Last Name, and Language. A component can also contain one or more other components, allowing a lego brick approach, where many small components together form a larger unit (see Figure 8.1). To continue with the Actor example, such a component could include a subcomponent called ActorLanguage, containing a set of fields describing the language(s) a person can speak. Ultimately a set of components is grouped into a profile—this ‘master component’ thus contains all fields that can be used to describe a language resource. In order to promote the reuse and sharing of components and profiles, the CMDI component registry was created.11 This web application (see Figure 8.2) allows metadata modellers to browse through all existing components and profiles and to create new ones, with the possibility of including existing components. 8 

A full overview of the sign language profile can be found at http://www.ru.nl/sign-lang/technology/ imdi_sl_profile/ 9 See http://www.clarin.eu 10 See http://hdl.handle.net/1839/00-DOCS.CLARIN.EU-101 11 See http://www.clarin.eu/cmdi

Metadata Formats  

161

Actor firstName lastName ActorLanguage actorLanguageName

FIGURE 8.1 A  component describing an actor, consisting of two elements (first name and last name) and an embedded ActorLanguage component

FIGURE

8.2  The CMDI component registry.

After creating or choosing a profile, the user can generate an XML W3C schema (also known as an XSD file) that contains a formal specification of the structure of the metadata descriptions that are to be created. This schema too can be accessed from the component registry, with a right click on the profile, choosing the ‘Download as XSD’ option. From then on it can be used to check the formal correctness of the CMDI metadata descriptions. A CMDI metadata file exists of three main parts:

162   Daan Broeder and Dieter van Uytvanck • A fixed Header, containing information about the author of the file, the creation date, a reference to the unique profile code and a link to the metadata file itself. • A  fixed Resources section, containing links to the described resources or other CMDI files. • A flexible Components section, containing all of the components that belong to the specific profile that was chosen as a basis. In our earlier example there would be one Actor component immediately under the Components tag. An illustration of such a CMDI file, including these three parts, can be accessed at: http://hdl.handle.net/1839/00-DOCS.CLARIN.EU-102 Another CMDI file, basically containing the same information from the IFA corpus with the TextGrid links, can be found at: http://hdl.handle.net/1839/00-DOCS.CLARIN.EU-103 CMDI files can be edited with a standard XML editor (like 12) or with Arbil,13 a program that can edit both IMDI and CMDI files. The Virtual Language Observatory14 facet browser and the CLARIN Metadata Browser currently support searching and exploring collections of CMDI files. Yet, as part of an infrastructure that is still in development, the exploitation software does not currently reach the level of completeness that the IMDI/TLA tools have achieved after years of significant efforts. However, the TLA software will be made fully CMDI-compatible in the short term. The component approach brings with it a certain risk: one element could exist in several components with a different name. When searching for a particular name, how do users know whether the metadata modeller used the term Family Name, Last Name or Surname? This problem is addressed by defining links (‘Concept Links’, e.g. http://www. isocat.org/datcat/DC-4195) from each element in a component to the ISOcat data category registry. The search software can thus, later on, automatically detect that Last Name and Family Name are basically the same concepts, and return all relevant results whatever label the profile uses for this field.

8.3  Practical Matters When Designing and Creating Metadata In the previous sections we have made a case for providing sufficiently rich metadata not only for the complete corpus but also for all parts of the corpus. Furthermore, we have

12 See http://www.oxygenxml.com/ 13 

Downloads available via http://www.clarin.eu/cmdi

14 See http://www.clarin.eu/vlo

Metadata Formats  

163

given examples of some of the metadata sets currently in use or under development. In this section we give some practical advice on what metadata schema to use or how to design one’s own if required (although we strongly encourage the reuse of existing metadata schema as much as possible. New metadata schemas cause problems with the interoperability between tools that are rarely compensated by the advantage of having one’s own metadata schema). Existing metadata schemas usually come with their own documentation, tools and community. New approaches using flexible component metadata (CMDI) allow the creation of new metadata schema based on existing ones while also providing semantic interoperability. This practice is just emerging, and its usability for a specific case should be checked.

Specificity Creating metadata is expensive when it is necessary to collect many detailed resource properties for a large number of corpus data. However, the temptation of limiting oneself to a subset of the available information and postponing providing a full set should be avoided. Compiling more metadata at a later stage is usually much more expensive, since a new editorial effort has to be started and typically the original corpus compilers have moved on and their knowledge has been lost. Thus, it is advisable to compile as much and as detailed metadata as possible, even if at the creation stage of the corpus the usefulness of some of the information is not clear. This might change in the future, and also it is not essential to publish the available metadata in all its details. Of course, if only limited information is available, it does not make much sense to use a much more detailed set. Having to specify the values of many elements as empty will not motivate the metadata creators, and the users may question the quality of the metadata.

Granularity In general, it is useful to create one metadata record for the entire corpus that will be published. Language resource (LR) distribution centres usually require a very specific metadata record, which you have to create if you decide to distribute your data through those centres. To make the corpus available to other users, it is best to create an OLAC/DC record, which is currently the most widely used interoperable metadata set for this purpose, and to have that harvested by (one of the) the OLAC service providers.15 When the corpus is deposited at a bona fide LR archive, this will usually be taken care of automatically. Information on the individual parts of a corpus should be made available to the corpus users or exploitation software so that they can make appropriate sub-selections of the corpus resources for e.g. an analysis of all recordings with females or all read texts. Producing this information in one of the existing metadata formats has the following advantages: (1) a well-described data format, (2) increasingly efficient use of the data by publishing the corpus metadata records, and (3) allowing cross-corpus analysis using other corpora using the same metadata set.

15 

http://www.language-archives.org/

164   Daan Broeder and Dieter van Uytvanck Furthermore, if the metadata records are provided as XML records, an additional advantage will be that this is considered a persistent format suitable for archiving. The question whether there should also be separate metadata records for each corpus component can best be answered by considering the status of those components and their possible usage. If a corpus component has a separate provenance and existence from the main corpus, e.g. if the corpus consists of an aggregation of older smaller corpora, it is advisable to have a separate metadata record. This will enable users to manage and access the corpus component as a separate entity, and allow future merging within other corpus configurations.

Interoperability From the viewpoint of maximal metadata interoperability, of using tools as metadata editors, and the visibility within LR catalogues, it is advisable to choose the most widely used metadata set for your particular domain: • IMDI (or its future successor CMDI) for multi-modal/multi-media corpora; • OLAC for describing aggregated resources as whole corpora and describing publications on LRs and in linguistics; • TEI for text resources. Any of these choices should probably be considered ‘safe’ from the viewpoint of future interoperability, since they each have large installed bases of corpora described, with these types of metadata supported by large communities. Future metadata infrastructures are expected to ensure interoperability with these existing installed bases, although this is of course not an absolute guarantee.

8.4 Outlook Creating a corpus entails more than just selecting an appropriate metadata set. The environment in which the corpus and the associated metadata will be stored is another important aspect to consider. With the rise of the Internet, creating a DVD that contains all corpus data is no longer considered the best way to distribute a corpus. If done in a well-organized way, electronic publication can offer quite some advantages; to mention a few: • Visibility. The number of potential users is significantly higher if the data is available on the Internet. • Network effect. With resources addressable online, a researcher can cite relevant data and use it in local and web applications. • Sustainability. Data hosted at a trustworthy digital (language) archive will be kept available and accessible for a long time. Ad hoc approaches often result in

Metadata Formats  

165

digital obsolescence (broken links and data formats that are no longer supported). A long-term vision is thus required, and that is exactly what data archives are for. On the other hand, it is not easy to set up a long-lasting repository and maintain all necessary services around it. With these observations in mind, the CLARIN research infrastructure started setting up a distributed network of centres. Relying on the economy of scale, researchers can deposit their data and tools with one of the centres and serve the whole scientific community without needing to reinvent the infrastructural wheel. Alternatively, academic institutions can set up their own centre and join the CLARIN network. In any case this ensures a better interoperability, given the use of common standards (e.g. CMDI for metadata) and protocols. All in all, it is certainly worth looking at the broader context of e-infrastructures such as CLARIN when creating or upgrading a corpus. Doing so could significantly improve the preservation of and access to the data while keeping costs at a reasonable level.

Internet Sources (1) b NISO. ‘Understanding Metadata’. NISO Press. http://www.niso.org/publications/press/UnderstandingMetadata.pdf. Retrieved 5 January 2010. (2) OAI-PMH. http://www.openarchives.org/pmh/ (3) Alliance for Permanent Access. http://www.alliancepermanentaccess.eu

C HA P T E R  9

DATA F O R M AT S F O R PHONOLO GICAL CORPORA L AU R E N T ROM A RY A N D A N DR E AS W I T T

9.1  Representing Annotated Spoken Corpora The annotation of linguistic resources has long-standing traditions (see Cole et al. 2010, Witt and Metzing 2010) and, as the other chapters of this book have shown, is a laborious, time-consuming, and expensive task. In theory, we want to make these resources available in such a way that they can be reused by as many scholars as possible (see Ide and Romary 2002). However, a large variety of annotation formats have been developed in the previous decades, each one created for a specific research task, and the resulting resources are frequently only usable by members of the individual research projects. The goal of the present chapter is to explore the possibility of providing the research and industrial communities that commonly use spoken corpora with a set of well-documented standardized formats that allow a high reuse rate of annotated spoken resources and, as a consequence, better interoperability across tools used to produce or exploit such resources. We hope to identify standards that cover all possible aspects of the management of spoken data, from the actual representation of raw recordings and transcriptions to high-level content-related information at a semantic or pragmatic level. Most of the challenges here are similar to those for textual resources, except for, on the one hand, the grounding relation that spoken data has to illocutionary circumstances (time, place, speakers and addressees) and, on the other, the specific annotation layers that correspond to speech-related information (e.g. prosody), comprising multimodal aspects such as gestures. We should also not forget, as is well illustrated in this book, the importance of legacy practices in the spoken corpora community, most of them resulting from the existence of specific tools at various representation layers, ranging from basic transcription tools (Transcriber, Praat, see Boersma, this volume) to generic score-based annotation environments (TASX, Elan, CLAN/CHAT (CHILDES), EMU, see chapters X, III.1, III.2,

Data Formats for Phonological Corpora  

167

this volume). By definition these various tools do not have the same maintenance rate and capacity, and it is therefore essential to think about standardized formats as offering the possibility to be embedded within existing practices. This implies that we have two basic scenarios in mind: • We want to be able to project existing data into a range of standardized representations that bear as little specificity to the original format as possible but as much faithfulness as necessary. • We want standardized formats to have the capacity to be used for the development of new technical platforms, thus allowing the integration of new requirements and new features. These two general requirements both necessitate standards that can incorporate features and data we have not yet envisioned. To do this, the standards should provide specification or customization mechanisms that do not hinder their ability to improve interoperability. That said, it is clear that such a thorough set of standards cannot be fully described in a single book chapter. Moreover, we acknowledge that there is still some work to be done before we have a convincing portfolio of standards that can cover all aspects of annotated spoken corpora. For these reasons, we are adopting an intentionally selective (and hence subjective) strategy, with the goal of laying out a foundation that can serve as a basis to complete the standardization picture step by step. After a brief introduction1 to existing standardization activities for language resources in general, we will describe some basic concepts related to the representation of annotated linguistic content. We will present in detail some of the proposals that can be used for the transcription and annotation of spoken data, along with the possibility of defining precise semantics for the corresponding representations.

9.2  Standards and Standardization Processes It has become common to speak of two kinds of standards: de facto standards, which arise through the practices of active communities and are adopted over the years, and de jure standards, which are created ‘from scratch’ and are promulgated by official standardization bodies. Such a dichotomy is misleading, since the actual development of standards is usually accomplished by cooperation from both sides. Indeed, we suggest that standardization is a process with three essential components: • consensus building within a technical community, including the involvement of reliable experts and the consideration of existing practices and developments; 1 

For a precise presentation of background activities which lead to the current standardization picture, see Ide and Romary 2007.

168   Laurent Romary and Andreas Witt • wide availability of the standard so that any potential user may determine how much he or she is complying with it; • a maintenance process, through which existing defects or necessary improvements may be implemented in further revisions of the standard, while taking care of backward compatibility issues. These processes are the basis for most standardization bodies, including official national and international organizations such as the ISO or IETF, or consortium-based bodies such as the W3C, OASIS, or TEI. Many standard proposals that do not arise from these processes (usually those initiated within dedicated research and development projects) have failed or suffered due to the lack of community support that could provide dissemination and maintenance of the standards. For language resources, we can identify three main organizations that play the most important role in standards: • The World Wide Web Consortium (W3C) provides horizontal standards (called recommendations) for the management of Internet-based communication, and in particular XML technologies,2 which are widely used for representing all sorts of semi-structured information. The W3C also carries out language-oriented activities regarding internationalization, in particular. • The International Organization for Standardization (ISO), a confederation of national standardization bodies that covers nearly all areas of industrial activities. Beyond generic IT-relevant projects carried out in ISO-IEC JTC1 (from character encoding with ISO 10646-Unicode to document representation with SGML), technical committee 37 (TC 37) of ISO provides guidance for linguistic content management. In particular, sub-committee 2 (SC 2) of TC 37 is in charge of language codes, SC3 of computer based terminologies and SC4 of language resources. • The Text Encoding Initiative (TEI), a consortium that has taken up the responsibility of providing the digital humanities community at large with a wide range of XML-based representations covering most of the possible useful genres from prose text to dictionaries.

9.3  Which Standards for Linguistic Annotation of Spoken Corpora? To understand standards for language resources, it is important to understand the various activities that the standardization organizations mentioned above are pursuing. 2 

In the remaining text of this chapter, we assume that the reader has some basic understanding of XML technologies, and in particular has no difficulty in reading through the XML samples we introduce. See also Bray et al. (1998).

Data Formats for Phonological Corpora  

169

In the following paragraphs, we suggest a possible overall strategy to achieve the best standard-based approach to the management of linguistic data, and justify the biased approach taken up in the rest of the chapter.

9.3.1  Various User Scenarios—Various Standards It is important to consider how standardization relates to possible organization levels of spoken corpora. In general, these organization levels include: • The first important level of representation in phonological corpora is the transcription, where the source signal, in the form of an audio or video file, as well as any additional information provided by specific sensors (e.g. articulatory) is segmented and classified as a set of symbolic codes. Such codes may be phonetic or orthographic ones, but may also correspond to any kind of features or patterns that are deemed useful for further analysis of the primary source. Transcription is understood as a process which theoretically should be independent of further annotation steps. • Anchored to the transcription layers (also referred to as tiers in phonological corpora), but also to other prior annotations, a given annotation layer is identified as providing a certain type of interpretation of the primary source, whether this is linguistic (e.g. the identification of syntactic constructs) or of any other possible kind (e.g. identification of pathological features in the speaker’s voice). As we shall see, the specification of an annotation layer relies on the provision of its internal logic (meta-model) and the corresponding elementary descriptors (data categories). • Finally, an important aspect of corpus annotation relies on the proper management of the combination of annotation layers as well as the corpus of primary sources used within a given transcription and annotation project. Tool implementers and project managers are usually those who consider these specific aspects. The second important aspect to consider is the ecology within which a given corpus creation project will take place and how much this may impact the issue of formats. In general, specific standards for representing a given transcription or annotation layer are chosen based on a wide variety of factors: • In some cases, the choice will simply be dependent on the formats employed by the software used for the annotation task and, to a lesser degree, on how the tool exports data and files. • The targeted representation format of an annotated corpus may depend on the kinds of operation carried out with the data. The capacity, for instance, of a query environment to have a more or less deep understanding of complex annotations or of combinations of various mark-up schemas might or might not increase the actual requirements on the data formats. • One has to consider in which data structure the final corpus will be recorded and archived in the long run. Indeed, combining too many heterogeneous formats,

170   Laurent Romary and Andreas Witt which might not all have the same level of stability and documentation, may hinder the further exploitation of the data outside (in time and space) the initial production locus. • Finally, an important factor is the culture that a given community shares in relation to standards and how difficult it is for individual community members (and groups of them) to change their practices. This learning curve effect usually explains why communities tend to design their own formats, to be able to progressively add layers of complexity.

9.4  Basic Components of an Annotation Schema As explained in the various contributions to this book, each annotation tool tends to come with its own annotation schema; and in turn, each annotation schema is defined according to its own technical principles, mostly resulting both from legacy practices in the corresponding research environment and from the actual preferences of the implementer. As a whole, it is seldom the case that an annotation schema results from a clear conceptual analysis where, in particular, the modelling (e.g. based on a UML specification) and representation (e.g. in the form of an XML schema) levels are clearly differentiated (cf. Zipser and Romary 2010). If we want, in this context, to move towards better interoperability across the existing initiatives within the spoken corpora community, it is necessary for us to introduce some basic elements that will act as references for comparing existing schemas and above all for mapping them onto common principles and standards. The first stage for us is to define what is meant by an annotation and identify its various components. As illustrated in Figure 9.1, we consider an annotation to be a combination of three components, a source, a range and a qualifier, that have the following characteristics: • The source3 is the information upon which some additional statement is made in the context of the annotation. It is considered a fixed object from the point of view of the annotation (i.e. changing the source invalidates the annotation). • The range identification characterises a portion of the source (a markable) that is being qualified by the annotation, either as one or several already identified parts of the source, or by reference to a certain identification scale (e.g. a temporal or spatial reference) that maps onto the source.

3  One may want to distinguish between a primary source, which is not anchored on any previous information layers, and a secondary source, when this can be seen as being derived from or built upon another source.

Data Formats for Phonological Corpora  

171

Qualification

Range identification

Annotation

Source information FIGURE

9.1  Generic structure of an annotation.

• The qualification expresses a constraint on the actual portion of the source as elicited by the range. This constraint is made up of an elementary piece of information, mostly expressible as a feature-value pair. ‘Range’ is defined here abstractly because existing annotation schemas implement ranging mechanisms in different ways. These can be classified along the following lines: • Direct reference to a (generally temporal) scale that transitively relates to the source. This is the basic mechanism provided by simple models such as annotation graph (Bird and Liberman 2001). In such situations, there is no possibility of expressing an explicit co-occurrence relation between two annotations, except for identifying that one temporal reference is, for instance, the same. This is typically the strategy implemented in tools such as ANVIL or Praat (see Schmidt 2011). • Reference to reified objects on a scale. In the case of a temporal scale, this corresponds to the identification of events to which more than one qualifier may refer (i.e. a timeline, as in EXMARaLDA and ELAN: see Schmidt 2011). • Reference to explicit components of the source, allowing one to skip in some sense the actual ranging mechanism, or at least to make it boil down to a simple pointer or group of pointers. The important part of this last possibility is that it allows annotations to be about any kind of entity, including annotations themselves. This is usually the case for all annotation tools adopting a pure stand-off strategy such as MMAX (Müller and Strube 2001). An annotation is defined at a very low level of granularity, so that each elementary statement upon a source (e.g. an elementary provision of some part of speech information about a word) is potentially embedded within a single annotation. Naturally, this does not prevent specific implementations from providing explicit factorizations that may facilitate the reduction of redundant information across annotations. For instance, a morphosyntactic annotation schema may want to combine all information relevant to a given word by conflating all descriptors associated to a single range as a tagset label. Because there are so many types of linguistic annotation, annotations are often

172   Laurent Romary and Andreas Witt

FIGURE

9.2  Example of a transcription using EXMARalDa (cf. Schmidt 2011).

grouped according to different criteria. The reason for the grouping is technical and/or conceptual. To distinguish between these two different groupings, a distinction between annotation layer and annotation level has been introduced (see e.g. Goecke et al. 2010). A short description of this distinction can also be found in Witt (2004): To avoid confusion when talking about multiply structured text and text ideally organized by multiple hierarchies, the terms ‘level’ or ‘level of description’ are used when referring to a logical unit, e. g. visual document structure or logical text structure. When referring to a structure organizing the text technically in a hierarchically ordered way, the terms ‘layer’ or ‘tier’ are used. A level can be expressed by means of one or more layers and a layer can include markup information on one or more levels.

Furthermore, it is possible to conceptualize the underlying coherence that is required when optimizing an annotation schema by defining the notion of annotation level as a coherent set of annotation types sharing the following characteristics: • same underlying source, or set of sources (in the case of a corpus); • same ranging mechanism, by which we mean not only the same referring mechanisms (component or scale), but also a coherent description of ranges from the point of view of their linearity, possible overlapping or alternation; • precisely defined and comprehensive data category selection that is applicable for qualifiers. We predict a general notion of tagset as such a selection.

Data Formats for Phonological Corpora  

173

With this general analysis in mind, we can now take a more detailed look at the current state of standardization processes for language resources.

9.5  Providing a Reference Semantics for Linguistic Annotation One important aspect of representing any kind of annotation is the capacity to provide a clear and reliable semantics for the various descriptors that are being used, either in the form of features and feature values or directly as objects in a representation that is expressed, for instance, in XML. In order to be shared across various annotation schemas and encoding applications, such a semantics should be implemented as a centralized registry of concepts, which we will henceforth refer to as data categories. As such, data categories should bear the following constraints: • From a technical point of view, they must provide unique and stable references (implemented as persistent identifiers) such that the designer of a specific encoding schema can refer to them in his or her specification. By doing so, two annotations will be considered as equivalent when they are actually defined in relation to the same data categories (as feature and feature-value). • From a descriptive point of view, each unique semantic reference should be associated with precise documentation combining a full text elicitation of the meaning of the descriptor with the expression of specific constraints that bear upon the category. In recent years, ISO has developed a general framework for representing and maintaining such a registry of data categories, encompassing all domains of language resources. This work, carried out in the context of ISO project 12620 (ISO 12620), has led to the implementation of an online environment providing access to all data categories which have been standardized in the context of the various language resource-related activities within ISO, or specifically as part of the maintenance of the data category registry. It also provides access to the various data categories that individual language technology practitioners have defined in the course of their own work and decided to share with the community. The ISO data category registry, as available through the ISOC at implementation, is meant to be a ‘flat’ marketplace of semantic objects, providing only a limited set of ontological constraints. The objective is to facilitate the maintenance of a comprehensive descriptive environment where new categories are easily inserted and reused without requiring any strong consistency check with the registry at large. Indeed, the following basic constraints are actually part of the data category model, as defined in ISO 12620:

174   Laurent Romary and Andreas Witt • Simple generic-specific relations, where these are useful for the proper identification of interoperability descriptors between data categories. For instance, the fact that /properNoun/ is a sub-category of /noun/ allows one to compare morphosyntactic annotations which are based on different descriptive levels of granularity. • Description of conceptual domains, in the sense of ISO 11179, to identify, when known or applicable, the possible value of so-called complex data categories.4 For instance, this can be used to record that possible values of /grammaticalGender/ (limited to a small group of languages:  see Romary 2011) could be a subset of {/masculine/, /feminine/ and /neuter/}. • Language-specific constraints, either in the form of specific application notes or as explicit restrictions bearing upon the conceptual domains of complex data categories. For instance, one could express explicitly that /grammaticalGender/ in French can only take the two values: {/masculine/ and /feminine/}. In this section, we have tried to delineate a comprehensive view on annotations that, as it were, encompasses all types of representation within a multi-tier annotated corpus. Indeed, any kind of information added to a bare primary source (like an audio recording), from low-level segmentation markers to high-level discourse relation identification, can be seen as an annotation in the sense presented here.

9.6  Language Resource Management: An ISO Perspective 9.6.1  Specific ISO Models and Formats for Linguistic Annotation ISO committee TC 37/SC 4, launched in 2002, focuses on the definition of models and formats for the representation of annotated language resources. To this end, ISO/TC 37/SC 4 has generalised the modelling strategy initiated by its sister committee SC 3 for the representation of terminological data (Romary 2001), and through which linguistic data models are seen as the combination of a generic data pattern (a meta-model) which is further refined through a selection of data categories that provide the descriptors for this specific annotation level. Such models are defined more or less independently from any specific formats (not even bound to an XML framework), and ensure that an implementer has the necessary tool to design and compare formats with regard to their degrees of interoperability. In the rest of this section, we will survey several projects5 from ISO/TC 37/SC 4 that are important for phonological corpora. 4  Complex data categories will typically be implemented as placeholders (or features), whereas simple data categories will be implemented as values. 5  In the ISO sense.

Data Formats for Phonological Corpora  

175

One of the early proposals of ISO/TC 37/SC 4 has been to outline a possible standard for morphosyntactic (also referred to as part-of-speech) annotation. Such an annotation level corresponds more or less to the first linguistic abstraction level for a corpus and, depending on the language to be annotated and the actual characteristics of the tool that is being used, can vary enormously in structure and complexity. In order to deal with the complex issues of ambiguity and determinism in morphosyntactic annotation, ISO 24611/MAF makes a clear distinction between the two levels of tokens (representing the surface segmentation of the source) and word forms (identifying lexical abstractions associated to groups of tokens). These two levels have the specificities that, on the one hand, they can be represented as simple sequences as well as local graphs (e.g. multiple segmentations, ambiguous compounds, etc.) and, on the other hand, any n-to-n-combination can stand between word forms and tokens.6 Associated with this meta-model, MAF provides a default XML syntax, but as we shall see later in this chapter, it is also possible to contemplate a TEI-based implementation for it. For syntactic annotation, however, ISO committee TC 37/SC 4 did not reach an early consensus on a possible XML syntax that would cover the variety of possible syntactic frameworks (constituency- or dependency-based, theory-specific) that can be observed either within existing treebanks (Abeillé 2003) or as export formats of syntactic parsers (Ide and Romary 2003). The published standard (SynAF, ISO 24615) is thus centred on a comprehensive meta-model informing the whole spectrum of syntactic representation practices, coupled with an extensive list of data categories that are now available within ISOCat (Broeder et al. 2008). The standard can presently be used to specify new formats or make interoperability checks,7 and a reference serialization of SynAF that would cover the kind of features now available in such formats as Tiger2 (Romary et al., n.d.)8 is planned. Work carried out within ISO project 24617-2 provided a comprehensive framework for the annotation of dialogue acts (Bunt et al. 2010), applicable to any kind of multimodal interaction. ISO/DIS 24617-2 (dialogue acts) can be seen at various levels of abstraction. It first provides a well-defined theoretical framework where the basic concepts of dialogue act, semantic content and communicative function are defined. Building upon the numerous initiatives and projects9 that have taken place in the last twenty years, it defines a domain-independent meta-model providing a multidimensional description of dialogue act phenomena, coupled with data categories registered in the ISOCat registry. Finally it offers a default XML serialization that fully implements the features of the intended model10. 6 

One token can correspond to several word forms, and vice versa. Usually to assess the conformity of a data set with an expected input of a tool, and design a possible filter accordingly. 8  See the proposals by Głowińska and Przepiórkowski (2010) and Erjavec et al. (2010) for the encoding of SynAF compliant annotations by means of the TEI framework. 9  Cf. annotation schemes defined in such projects as TRAINS, HCRC Map Task, Verbmobil, DIT, SPAAC, C-Star, MUMIN, MRDA, AMI, and more recent attempts towards domain-independence, interoperability, and standardization in DAMSL, MATE, DIT++ or the EU project LIRICS. 10  Even if space prevents us from providing further details on this, this serialization is inspired by the annotation framework provided by the TEI guidelines. 7 

176   Laurent Romary and Andreas Witt As the preceding examples make clear, the focus on modelling and interoperability issues facilitates the design of a given corpus as the combination of basic standardization building blocks which can then be adapted by projects to handle legacy data or tools. It also allows us to anticipate possible transitions to make existing data more and more compliant with international standards when they are adopted in a scholarly community.

9.6.2  Genericity Made a Principle: LAF—GRAF In cases where no standardization activity for a specific annotation level exists, or (as is usually the case) when a variety of annotation levels have to be merged within one single information pool in order to carry out cross-level queries or visualization, there is a need for a high-level representation that basically unifies all types of specific annotation structure. Various proposals have been suggested to address this situation, including projects such as ATLAS (Bird et al. 2000), Mate (McKelvie et al. 2001), or more recently the American National Corpus ANC (Ide and Macleod 2001). The American National Corpus project was an opportunity to experiment and finalize the principles enunciated in the ISO LAF project, on the basis of a generic graph representation where nodes represent the reification of linguistic annotation components and edge relations between them. Based on the ISO-TEI feature structure standard for the further qualification of nodes and edges, LAF offers a default format (called GraF) for the serialization of any type of linguistic structure. LAF was created to provide easy mapping with similar past and present initiatives such as annotation graphs, or PAULA. It is also an important step in contemplating generic query mechanisms and perhaps a standardized query language for language resources.

9.6.3  Linguistic Annotation with the TEI In many respects, the TEI appears to be a very appropriate method to (a) describe primary transcription of phonological corpora and (b) implement the models provided by ISO standards (Romary 2009). Indeed, the Text Encoding Initiative can be a good entry point for anyone looking for a general purpose XML vocabulary, which in turn may be connected to—and thus be made interoperable with many other corpora and encoding initiatives. In the rest of the chapter, we will show how the TEI guidelines already offer a variety of constructs and mechanisms to cope with many issues relevant to spoken corpora and their annotations. When applicable, we will make the necessary links with ongoing ISO/TC 37 activities so that some clues are given as to how a possible transition to more elaborate annotation schemas, or possibly a mapping from basic TEI representations to other annotation schemas, could be implemented.

Data Formats for Phonological Corpora  

177

9.7  The TEI Framework for Transcribing Spoken Corpora The Text Encoding Initiative (TEI) began to propose approaches to annotate different types of textually represented resource in the late 1980s. Beginning with the 3rd major edition of the TEI Guidelines (Sperberg-McQueen and Burnard 1994), the TEI also addresses the topic of annotating transcribed speech. After a revision of the guidelines in 2002 that mainly switched from an SGML- to a fully XML-compliant syntax of the annotation, the most recent version of the TEI-annotation scheme was published as TEI P5 in 2009 as a ‘living document’ that is continuously updated. This section describes TEI’s approach to transcribing spoken language according to P5 (TEI 2011). However, as the TEI consortium has been very careful with their updates and changes—especially the chapter on the transcription of spoken languages, which has only seen a few minor changes over the years—older TEI-based annotations are still usable without much loss. The general structure of the TEI encoding framework is highly modularized. About 30 specialized TEI modules exist, for instance for dictionaries, verse text, dramas, linguistic analysis, and speech transcriptions. Moreover, it is also possible to define freely specialized tagsets for all purposes not addressed by existing TEI tags. Independent of the type of the annotated document, i.e. regardless of the used TEI modules, all TEI documents are subdivided into two major parts: the TEI-Header containing the metadata of the annotated resource, e.g. information on the time and place a dialogue took place; and the annotated resource itself, for instance the transcription of the spoken dialogue (see listing).                

The following sections describe the TEI-metadata and TEI-annotations with a strong focus on options to deal with spoken language. This entails the omission of many aspects of TEI. The complete guidelines, some 1,300 pages, are available on the TEI website (http://www.tei-c.org).

178   Laurent Romary and Andreas Witt

9.8  The TEI- Header The header of the TEI document contains all the metadata associated with a spoken text. This information is subdivided into four different major classes: (1) the file description, (2) the encoding description, (3) the profile description, and (4) the revision description. While the revision description does not contain information specifically relevant to phonological resources, the other three do. Apart from the file description, all other parts of the header can be omitted.                

9.8.1  Information About the File There are only three compulsory parts to a TEI header. All of them must be included as children of the file description, annotated as . These compulsory elements are used to provide information about the title (), a publication statement (), and a description of the source of the annotated text (). In some respects, the file description contains information that is usually regarded as metadata. In case of annotated speech resources, this class also allows the representation of information about the source of the transcription, almost always a recording. Technical data of a speech recording can be included in the information contained in . Such data include file format information (e.g. uncompressed WAV, compressed MP3 or OGG, the sampling frequency), specifications of the audio equipment (e.g. the number and the type(s) of microphone(s)), and the source of the recording (e.g. original recording, broadcast transmission). For this kind of information the (recording statement) with its subelement

Data Formats for Phonological Corpora  

179

(recording event) are available in the header of a TEI document that contains the transcription of speech.            

Two microphones, standard frequency

    12 Jan 2010        

44.1

KHz

sampling

The type of recording could also be ‘video’. In addition to the description of the used to prepare the , the element could be used if the source were recorded from radio or TV. Of course, since the broadcast speech was also recorded before transmission, it is possible to include the element in , as well. This exemplifies how rich the TEI’s metadata description can be when needed.

9.8.2  Information About the Encoding The encoding declaration ‘documents the relationship between an electronic text and the source or sources from which it was derived’ (TEI P5). Besides other information the element allows a tagging declaration to provide detailed information about the tagset used in the document, the feature system declaration that could be used when applying feature structures, and the element for the declaration of the geographic coordinates. Because a lot of transcriptions of spoken language are prepared (semi-)automatically, for instance with the tools described in this volume, one might want to mention which tools have been used for this task in the metadata. The element allows the specification of a list of applications used for preparing the transcription.    

180   Laurent Romary and Andreas Witt   EXMARaLDA Partitur-Editor       This example defines the application EXMARaLDA Partitur-Editor 1.4.4, and specifies two dialogues that have been transcribed with this tool.

9.8.3  Information About the Profile A comprehensive description of the languages used by the speakers, information about the situation in which the speech recording took place, and other non-bibliographic metadata can be specified in a profile description. One important component for the transcription of speech, especially when elicited in an experiment, is the . By means of this element it is possible to provide information about the place, date, activities, etc. of the speech interaction. It could also be used to refer to controlled settings as, e.g. in Maptask (Anderson et al. 1991) and Tinkertoy (Senft 1994) experiments. It is possible to provide very fine-grained metadata with very detailed specifications of a participant in a dialogue. Within the , the element can be used to include information about participants in a conversation by means of a list of -elements. This element enables personal data for a person to be included, e.g.:       12 Jan 2010   Berlin, Germany       German    

9.9  TEI-Based Transcription In this section, we discuss how the TEI can be used for spoken data, using a dialogue from Thomas Schmidt’s chapter in this volume (see Schmidt, this volume) as an example. In this example the speakers communicate verbally in French as well as through gestures. A translation into English and additional information are also provided. Furthermore,

Data Formats for Phonological Corpora  

181

the alignment of the characters and the timeline indicate the sequence and the overlap of information. Whereas the metadata of a speech transcription are embedded in the < ­ teiHeader>, the actual transcriptions are part of the of a TEI document. The embeds one or more ‘utterances’ (). Within an element an orthographic or a phonetic transcription is included. Since this element may contain text, it is possible to include annotations in non-XML-based conventions. The following example uses the convention GAT (see Schmidt and Wörner, this volume) to mark a nonlinguistic event.   Alors ça dépend ((cough)) un petit peu. Such an approach allows researchers to continue to use conventions they are used to. At least, they can do so to a certain extent, as long as the annotation conventions do not contradict constraints pertaining to text data in XML documents. This means, in particular, that characters like ‘

bien.Alors ça dépend    cough     un petit peu.            right hand raised   Ah oui?.   In this example, the overlapping speech of the two speakers is indicated by the inclusion of an anchor within the first utterance at the point where the second speaker starts his or her first utterance. At this very point, the first speaker starts a gesture that ends when the second speaker begins the phrase un petit peu. Besides this explicit information about the temporal relations of the different utterances and gestures, implicit temporal information is also included in the XML file, simply due to the serialization of the XML document. If there is no explicit information about overlaps, then it is implied that the communication events (speech, gestures, etc.) were produced sequentially one after the other. In the example above, this means that the last utterance Ah oui? starts after the completion of its previous speech turn Alors ça dépend ((cough)) un petit peu. The most precise approach to keep the temporal information is referencing each event to relative or absolute time points. This can be done by including the TEI element , the definition of relevant time points and linking from utterances etc. to them. In the appendix to this chapter a complete example that makes use of this technique is given. One of the most interesting benefits of using a TEI-based approach for annotating speech corpora is the possibility of including elements from all other TEI modules. One of these modules is described in the TEI guidelines in ­chapter 17, ‘Linking, Segmentation, and Alignment’. It not only provides elements for a highly sophisticated addressing and linking mechanism, but also an element that allows the grouping of text fragments as long as the XML constraints are met. So, naturally, it is not possible to split elements in a way that results in overlapping markup. The element might be used with the attribute ‘xml:id’ to provide unique identifiers. This allows, whenever needed, a direct referencing to arbitrary text segments. Another example of the use of the element given in the TEI Guidelines (TEI 2011: 464f.) is reproduced below:             Literate       and       illiterate

Data Formats for Phonological Corpora  

183

     speech              in      a      language       like      English            are      plainly       different        .   In this example the element is used to segment a sentence into phrases and words and to associate more detailed information like the phrase type or the part of speech with the segments. However, the guidelines also make it clear that a more appropriate annotation of linguistic information is available in the module ‘Simple Analytic Mechanisms’ because this module defines specialized elements not only for sentences (), phrases (), and words () but also for morphemes () and syllables ().

9.10  Annotating Corpora with the Core Mechanisms of the Tei 9.10.1  Using Feature Structures within an Annotation Scheme In this section we address the implementation of what we named the ‘qualification level’ by means of feature structures, and compare it with the general model for elementary annotations described above. Feature structures (Pollard and Sag 1987) are formal structures which combine a basic representation mechanism by means of a possibly recursive combination of feature-value pairs, where values can in turn be feature structures and associated operations in order to access, filter or unify such structures. Feature structures have been used as the reference mechanism for various

184   Laurent Romary and Andreas Witt unification-based formalisms and also as a descriptive tool in order to attach basic properties to a linguistic segment (e.g. for phonetic descriptions Bird and Klein 1994). Complementing this well-established scientific background, since the early days of the Text Encoding Initiative an XML-based representation for feature structures has been developed by Terry Langendoen and Gary Simons (1995), and has been further improved and stabilized in the context of a joint TEI-ISO activity (ISO 24610-1, henceforth ISO-TEI-FSR). The representation of feature structures in ISO-TEI-FSR is based upon two central elements: • which contains a single feature-value-pair • which groups together one or several feature-value pairs A simple feature-value-pair is described by means of the name of the feature (attribute @name) and its value, expressed as the content of the element. In the canonical ISO-TEI-FSR this value is systematically typed by means of an embedded element, which can either be (with attribute @value=true/false), , or . For instance, the expression of a part of speech value for a noun would typically look like this:      noun   When combined, several feature-value pairs should be embedded within a feature structure, which can optionally be further typed, e.g. to provide direct access to all feature structures associated with the same annotation level. For instance, a basic morphosyntactic qualification block could be represented as:               noun              masculine            plural      

Data Formats for Phonological Corpora  

185

As an illustration of the way feature structures can be used to describe the basic components of an annotation schema, let us show how a tagset can be covered with this framework.

9.10.2  Creating Tagsets Through Feature Structure Libraries 9.10.2.1 Rationale The main issue regarding tagsets11 as reference descriptions for morphosyntactic annotations is that they can be shared across corpora and annotation tools. In particular, a tagset articulates the relation between a concrete syntactic representation within a set of annotations and a reference semantics that may allow one to interpret the annotation further when exploring the annotated data. To this end, the ISO-TEI standard provides mechanisms for declaring feature and feature-value libraries that perfectly match the objective stated here. In the following section we will briefly outline a possible method for declaring tagsets in the feature structure framework, in order to show that such a method could be used as reference to actually document, record and compare the various tagsets used within the linguistic and computational linguistic communities.

9.10.2.2  Description of an Elementary Tag The first step in the process of declaring a tagset is the ability to describe elementary features. This can easily be achieved with the ISO-TEI standard by combining elementary feature statements such as those seen above within a feature library (fLib), together with a systematic identification of each feature (by means of an xml:id attribute). In the following example, the three elementary features corresponding to the grammatical gender possibilities in German are described accordingly.                                               11 

See e.g. (Monachini and Calzolari 1994) for the corresponding work carried out within the Multext project.

186   Laurent Romary and Andreas Witt It can be noted here that if desired, one may fragment the various types of features (grammatical category, gender, number, etc.) within separate constructs or just group them all together within a single one. For instance, and in order to have all the illustrative material at hand, we could have the following series of declarations for grammatical categories:                          

9.10.2.3  Description of a complete tagset Once all the elementary declarations are made, the ISO-TEI framework allows one to combine them to declare feature-value libraries (fvLib), within which a feature structure combining elementary morphosyntactic features corresponds to a tag in the tagset in a one-to-one manner. In the following (simplified) example, for instance, the tag for a masculine singular common noun is declared and provides the appropriate identifier for further reference:          







&decten1tlsg01; &decten1tlsg02; &decten1tlsg03; &decten1tlsg04; &decten1tlsg05; &decten1tlsg06; &decten1tlsg07; &decten1tlsg08; &decten1tlsg09; &decten1tlsg10; &decten1tlsg11; &decten1tlsg12; &decten1tlsg13; &decten1tlsg14; &decten1tlsg15; &decten1tlsg16; &decten1tlsg17; &decten1tlsg18; &decten1tlsg19; &decten1tlsg20;











FIGURE

27.7  The structure of the DECTE interview content.

THE DIACHRONIC ELECTRONIC CORPUS OF TYNESIDE ENGLISH   



audio file





right T L S G five eleven thanks very much ta ehm well could you tell us first of all where you were born please were you born in Gateshead Gateshead yes yeah whereabouts new Gateshead eh where do you mean by that eh that's ehm down by eh FIGURE

27.8  Excerpt of a DECTE interview (decten1tlsg01).

529

530   Joan C. Beal, Karen P. Corrigan, Adam J. Mearns, and Hermann Moisl

Clark Chapman's oh aye like Saltmeadows yes Saltmeadows



02441 01123 02301 02621 02363 02741 02881 00906 02081 02301 02322 01443 02741 02201 01284 02383 02801 00421 02421 02501 00342 02164 02721 02021 02741 02642 04321 02621 00503 02825 02301 02721 FIGURE

27.8  Continued

THE DIACHRONIC ELECTRONIC CORPUS OF TYNESIDE ENGLISH   

00246 02341 12601 02642



FIGURE

27.8  Continued

531

532   Joan C. Beal, Karen P. Corrigan, Adam J. Mearns, and Hermann Moisl

right T L S G five eleven thanks very much ta ehm well could you tell us first of all where you were born please were you born in Gateshead Gateshead yes yeah whereabouts new Gateshead eh where do you mean by that eh that's ehm down by eh Clark Chapman's oh aye like Saltmeadows yes Saltmeadows FIGURE

27.9  Excerpt of a DECTE interview text file (decten1tlsg01).

The multiplicity of XML tags would make any DECTE interview file very difficult to read. However, this is not really a problem in principle or in practice. TEI-conformant corpora are not intended for direct human inspection but rather for use with XML-aware application software like Xaira (http://www.oucs.ox.ac.uk/rts/xaira), which interprets the tags and uses them in whatever analysis is specified by the user and in content presentation to the user in accessible formats. Nevertheless, plain text file versions of the DECTE interviews are also available for those users who prefer them. As Figure 27.9 illustrates, these greatly simplify the tagging of the main XML files, while retaining enough markup to allow them to be used in a straightforward manner with text analysis software, such as AntConc (http://www.antlab.sci.waseda.ac.jp/antconc_index.html).

27.2.4 Dissemination Dissemination has been a key concern in shaping the DECTE project’s approach to updating and developing the earlier NECTE resource. This enhanced version of the corpus is intended to engage a much broader range of user groups, with different interests and requirements. To this end, the material is presented online via two different web portals, each with its own particular focus. The DECTE research website (http:// research.ncl.ac.uk/decte) is configured for academic purposes. As noted above, it

THE DIACHRONIC ELECTRONIC CORPUS OF TYNESIDE ENGLISH   

533

provides comprehensive accounts of the history and development of the corpus, the background details of its constituent interviews, and the structure and format of its audio and XML files. It also contains the download area through which the corpus can be obtained. The project’s other portal is a public-facing website called The Talk of the Toon (http://research.ncl.ac.uk/decte/toon).6 This site is geared towards users in school and museum contexts, as well as members of the general public. With this in mind, it integrates the DECTE files with still and moving images related to the sociocultural themes and aspects of local history that informants touch upon in their interviews. As a resource for schools, the site provides support for National Curriculum and A-Level syllabus subject areas. The study of characteristic features of spoken English and aspects of regional accent and dialect, for example, is supported in particular through the design of the main text/audio interface, which interactively links the playback of the interview recordings to their orthographic transcriptions, a feature made possible through the inclusion in the XML files of the time-alignment anchor tags mentioned above. Another important initiative as regards long-term sustainability, tools for analysis, and interoperability, as well as dissemination, is the DECTE team’s involvement in the JISC-funded ENROLLER project (http://www.gla.ac.uk/enroller). This initiative is a collaboration between DECTE at Newcastle University, the National e-Science Centre (NeSC), Scottish Language Dictionaries Limited (SLD), and the STELLA Project at the University of Glasgow, where ENROLLER is based. The project has established ‘An Enhanced Repository for Language and Literature Researchers’ that combines different kinds of linguistic datasets (thesauri and dictionaries, as well as corpora) into an integrated, interoperable online resource. The aim in doing so has been to address a need for corpus researchers who currently deal with distributed, non-interoperable data repositories that are often licence protected. ENROLLER demonstrates that secure access to distributed data resources with targeted analysis and collaboration tools can be delivered in a unified framework producing a greatly enhanced research repository for phonologists as well as others more widely in the arts, humanities, and social sciences.

6  Toon reflects a relic pre-GVS pronunciation of the word town as [tu:n], which is characteristic of Tyneside. It has become a synonym for Newcastle itself, and is particularly associated with the city in the context of football, with supporters of Newcastle United being known both locally and nationally as ‘The Toon Army’.

C HA P T E R  28

T H E L A N C HA RT  C O R P U S F R A N S G R E G E R SE N , M A R I E M A E G A A R D, A N D N IC OL A I PHA R AO

28.1  Aim and Basic Terminology The purpose of establishing the LANCHART Centre was and is to apply sociolinguistic methods to study variation and change in a speech community at large in real time (Sankoff 2005). Any study of language change in real time is dependent on prior studies. In acquisition studies the spacing between the recordings is small. In the study of change in real time it is usually larger. But in all cases there are at least two (rounds of) recordings which are compared in order to establish what has changed and what has remained constant. In order to create a general terminology, we suggest and here adopt the convention of abbreviating the first (i.e. in most cases the ‘original’) study S1 while the ‘repetition’ or equivalent new study, or in fact any later recording using the same set up in some specified sense, is designated as (an instance of an) S2 (Gregersen 2009). This makes it possible to speak about any relationship between first and later recordings.

28.1.1 Size The LANCHART corpus at present (May 2012)  contains 600 transcribed files with around 6.577.029 filled intervals (roughly equalling ‘word tokens’; the tokens represent 77.719 ‘word types’) and a total of 1814 recordings. The transcribed corpus thus includes only a small part of the total number of available recordings.

28.1.2 Design A fair number of studies have analysed the speech of a stratified section of informants in the Danish speech community in the late 1970s through the 1980s and these

The Lanchart Corpus  

535

studies have been taken as our S1s. Since the new recordings have all taken place from 2005 and onwards there is in the case of the LANCHART corpus a time distance of around twenty years between the S1 and the S2. The core corpus thus consists of recordings from

• • • •

Vinderup in Western Jutland (S1 from 1978) Odder in Eastern Jutland (S1 from 1986 to 1989) Næstved, Southern Sealand (two different S1s, both of them from 1986ff) Copenhagen, Central Sealand (S1 from 1986 to 1988)

In addition, there are separate corpora from a longitudinal study of Danish and Danish-Turkish speaking children from a public comprehensive school in Køge immediately South of Copenhagen (Jørgensen 2003; Møller and Jørgensen 2009), a study of real-time change at Vissenbjerg, Funen (Pedersen 1994), and a study from 2001 of Danish and Swedish in the Øresund region (Gregersen 2003). The Copenhagen material has been supplemented by including a number of shorter recordings from 1970 to 1971 to enlarge the time frame. Designs for real time studies are of two types. Panel studies record the same persons in the S1 and the S2, whereas trend studies construct an S2 sample which is equivalent in some predefined respect to the original S1 sample; the speakers, however, are not the same as in the S1. We have used the panel as our primary design. This means that we have located and recorded a number of informants from the various S1s. These re-recorded informants make up the panels. But the S1s in all cases included more informants than those used for the panels. Even in the S2s we recorded enough new informants so that the various data sets would also qualify as trend studies. Having both the panel and trend designs at our disposal is important in order to compare the behaviour of individuals in real time to the general patterns of change in the surrounding speech community (cf. Sankoff 2006; Sundgren 2009).

28.1.3  Speaker Variables Four speaker variables are controlled in the LANCHART corpus: gender, age defined as a broader section of time, a time slice (Eckert 1997), social class, and geographical location. Gender is here defined simply as a biological category. As to age, we had to define our samples in two dimensions: First, the birth year of an informant is the key to his or her assignment to a specific generation. We took the Copenhagen core generation of the then between 25-year-old and 40-year-old informants as our starting point and searched for informants to fill the cells further defined by gender and social class in the other S1s. Thus birth year is the key to the structure of generations of which we have three plus two: Apart from generation 1 and 2 defined by birth years in the original Copenhagen S1 (1946–63 and 1964–73 respectively) we recorded

536   Frans Gregersen, Marie Maegaard, and Nicolai Pharao ninth-graders in the four localities as generation 3 (birth years 1987–96). Between generations 2 and 3 we identified a number of Næstved and Odder informants with birth years 1981–86 as generation 2.5 and all those born before 1945 as generation 0. Birth year is one dimension. But obviously, recording year is also relevant. A person recorded in 1978 as a 15-year-old school pupil of course in many respects differs from the grown up 25-year-old who was recorded in 1988. He or she might, for example, be more similar to a 15-year-old recorded in 1988 as to choice of lexicon and use of genres. Thus time of recording is the other dimension of time in the corpus. We fare slightly worse on the matter of social class. Again, we took the Copenhagen criteria for class assignment as the point of departure (Gregersen 2009:  9f) only to realize that one of the important differences between small provincial towns the size of Odder and Vinderup and a metropolis such as Copenhagen is precisely their social differentiation. The categories are not (necessarily) immediately transferable and the precise distribution characteristic of Copenhagen (and in particular of the neighbourhood of the original Copenhagen S1) was not to be found anywhere else. This is a general problem of comparison: either you control for independent factors and risk ending up with a sample which is grossly inadequate as a representation of the particular speech community, or you do not control and risk ending up with portraits of avowedly completely different communities. Finally geographical location: The four main sites are nicely divided into one metropolis (Copenhagen), one city under the dominance of the metropolis (Næstved), and two smaller provincial cities, one very close to a larger conurbation and partly being integrated into this (Odder, close to Aarhus) and one more isolated and placed in a traditionally dialect speaking area (Vinderup). This geographical dimension in the corpus makes it possible to test models of diffusion from a norm centre, in this case Copenhagen. That Copenhagen, and only Copenhagen, functions as the norm centre for Danish speech has been documented in the language attitude tests carried out at all the LANCHART sites (Kristiansen 2009).

28.1.4  The Informant Database All speaker variable data are stored in a separate MySQL informant data base. The link between the informant database and the transcripts is the ID code for each informant consisting of three capital letters. This ensures that the informant database may be continuously updated, for example, with information from the recordings themselves without any change in the actual transcripts.

28.1.5  Transcription and Coding In the transcription process we make use of the Transcriber program and a detailed manual is available (in Danish) from the website. Alignment between sound files

The Lanchart Corpus  

537

(24 bit wave-files) and transcripts is done at the utterance level. This means that the transcription is only roughly aligned—not precisely enough to carry out automatic analysis of, for example, vowel formants but enough to identify where a word occurs in the transcripts. All mark-up, annotation, and coding of linguistic variables is done in Praat so that each type of coding is stored in a separate tier. In Praat there is no limit to the number of tiers which may be linked to the primary tier, which in this case is invariably the orthographical one.1 In order to maximize the potential for comparison with written language, all transcriptions follow the strict orthographical norm institutionalized by the Danish Language Board in its orthographical dictionary (Retskrivningsordbogen). The number of additional features that needed to be indicated, for example, pauses and false starts, have been kept to an absolute minimum (details of the transcription conventions can be found in the online manual).

28.1.6  Important Connections In general, the LANCHART corpus is not a closed unit which contains only a specific number of recordings; it is still growing not only because we still record new informants but also because various existing recordings are incorporated and access is gained to other corpora through cooperation. The LANCHART Centre participates in both the CLARIN and LARM research infrastructures. The Danish CLARIN project coordinates a number of Work Packages in a research infrastructure for the humanities and social sciences (www.clarin.eu). Among them is a Work Package (3) which will give CLARIN users (i.e. researchers and students) access to a number of transcribed and annotated recordings of spoken Danish. The Danish State Radio archives participate in another research infrastructure, the LARM project, which from 2010 and until the end of 2013 will make both old and new radio recordings accessible for a broad range of research purposes cf. www.larm.hum. ku.dk.

28.2  Comparability and the Discourse Context Analysis Comparability is a major issue for any study of real-time change. S1s are always different from the proposed S2s and in our case the S1s varied considerably in scope, aims, and methods (Gregersen 2009). In order to ensure comparability we developed a 1  This is one of the reasons why it is cumbersome to the level of unbearable to update the orthography in the transcripts. It would mean disrupting all structure in the text grids. Obviously, this creates an urgent need for proof-reading before the transcription is allowed to enter the corpus proper.

538   Frans Gregersen, Marie Maegaard, and Nicolai Pharao six-dimensional description of each file so that a simple style analysis (Albris 1991) could be supplanted by this Discourse Context Analysis (abbreviated DCA) (Gregersen et al. 2009). A framework such as the DCA enables us to find comparable passages in recordings made for different purposes and using different methods so that we actually compare only comparable sections from S1 and S2. The DCA forms the basis for the phonetic analysis: Only passages delimited as belonging to the Macro Speech Act of Exchange of Information have been coded for phonetic variation.

28.3  The Phonetic Annotation All the sociolinguistic interviews contained in the LANCHART corpus have been phonetically annotated, using automatic text-to-phone generation.2 This facilitates the study of real time changes in pronunciation. The phonetic annotation obviously cannot be used as primary data in the study of real-time sound change, but it greatly facilitates the search for tokens of phonetic variables at the segmental level. The annotation is based on distinct Copenhagen-based speech, and thus circumvents the problems of searching for tokens of variables on the basis of orthographic forms, which is particularly useful for a language such as Danish, where we find a very opaque relationship between written elements (‘letters’) and phones. The level of representation is thus roughly similar to a phonemic representation of every running word in this part of the corpus. Because the annotation is based on this highly distinct style of speech it becomes possible to study coarticulation phenomena such as schwa assimilation and connected speech processes like weakening and deletion of segments, as well as the more traditional sociophonetic variables that have been included in the studies of on-going sound change in different varieties of Danish (Pharao 2010). The segmental annotation is given in two styles: a modified version of SAMPA developed by Peter Juel Henrichsen and a representation using IPA symbols through the use of codes in Praat. The former representation is easy to search through, since each symbol in the annotation represents one phonetic segment, but naturally the IPA symbols are easier to read for researchers not familiar with Danish and the modified version of SAMPA. An overview of the symbols used in the modified SAMPA representation can be found at this url: http://www.isvcbs.dk/~pjuel/TtT/phontable.html. In addition to the segmental annotation of the interviews, two prosodic features are also included: primary stress and the Danish syllable accent stød (a laryngealization which can occur in certain segmental sequences and which is used for lexical contrast (Fischer-Jørgensen 1989, Basbøll 2005: 83ff )).

2 

The phonetic annotation was originally carried out by Peter Juel Henrichsen but it is now part of the LANCHART family of programmes.

The Lanchart Corpus  

539

Since the phonetic annotation of the LANCHART corpus has been done automatically, stress is also assigned by rule. An actual investigation of the accuracy of this rule-based assignment of stress in the annotation is still pending, but it is clear (unsurprisingly) that the rules are not perfectly accurate, that is, syllables marked as stressed in the annotation are not necessarily realized as stressed, nor are canonically unstressed syllables always realized without stress. Furthermore, tokens with secondary stress will often be of interest to researchers, but there is no way of including such tokens in a search and mark-up of a particular variable without also including all unstressed tokens, in effect any and all tokens of the variable. This is because stress has been annotated as a binary feature, and the level of secondary stress has been marked (or, rather, not marked) in the same way as null stress. However, since the automatic generation of the marking of primary stress needs to be evaluated by human listeners before a token can be included in an analysis, the coarse grained nature of the stress marking in the annotation is of little practical consequence.

28.3.1  Coding for Phonetic Variation The search and mark-up of particular phonetic variables is done using Praat scripts specifically written for use with the files in the LANCHART corpus. Users are asked at the beginning of a search which segment or sequence of segments are to be included. The basic search script then reads the labels of each interval in the tier containing the phonetic annotation and evaluates the label for a possible match with the segment or sequence of segments specified by the user. When a match is found for a particular variable, an interval is created which matches the interval for the word containing the token, and this interval is marked in a separate tier used only for mark-up of the variable being investigated. These intervals are then labelled with the variable in question and the search and mark-up continues until the end of the textgrid-file is reached. Only passages delimited as examples of the Macro Speech Act Exchange of Information are coded and only up to 40+ instances of any variable.

28.3.2  The Sociophonetic Variables The LANCHART corpus has been marked up and coded for nine sociophonetic variables (and a number of morphological and grammatical variables (cf. Jensen 2009)); they are described below. Following the conventions of variationist sociolinguistics, regular parentheses are used to represent the sociophonetic variables, whereas phonemes and allophones are given in slanted and square brackets respectively.

28.3.2.1  The (æ) and (Α) Variables The most thoroughly investigated variation in the sociolinguistic literature on Danish are the changes which have occurred in the realization of the /a/ phoneme throughout

540   Frans Gregersen, Marie Maegaard, and Nicolai Pharao the twentieth century (cf. Brink and Lund 1975; Holmberg 1991; Gregersen et  al. 2009). A process of fronting and raising began in the early twentieth century in which original [ɑ] and the innovation [æ], achieved near-perfect complementary distribution: [æ] occurring only before alveolars and syllable boundaries, and [ɑ] occurring only before labials and dorsals. For example, ‘han’ /han/ he is pronounced [hæn], that is with an [æ], but ‘ham’ /ham/ him is pronounced [hɑ̈m], i.e. with an [ɑ̈]. The fronted [æ] only occurs before a labial or a dorsal if a syllable boundary intervenes, for example, ‘kapel’ /kaʔpɛl/ chapel [khæˈphɛlʔ] and ‘magi’ /maʔgiː/ magic [mæˈgiːʔ]. Conversely, [ɑ̈] only occurs before alveolars if an underlying /r/ occurs in the next syllable of the word, for example, ‘altid’ /ʔalʔtið/ always [ˈalʔtsiðʔ] but ‘aldrig’ /aldri/ never [ˈɑ̈ldʁi]. Due to the imperfect complementarity of the distribution, the possibility for minimal pairs arose (and a few exist (Basbøll 2005: 51)), hence leading to two separate phonemes /a/ and /ɑ/. Both of these have since developed into the sociophonetic variables (æ) and (Α). The variable (a) thus concerns the raising of expected [æ] to [ɛ] before alveolars and syllable boundaries, that is occurrences of /a/ in all segmental contexts, except in words where an /r/ occurs in the following syllable. The variable (Α) concerns the retraction of [ɑ̈] to [ɑ] in realizations of the phoneme /ɑ/. It is restricted to all tokens of expected [ɑ̈] before labials and dorsals, that is, once again, instances occurring in words with an underlying /r/ in the following syllable are excluded from the variable. Occurrences of either /a/ or /ɑ/ in words with underlying /r/ in subsequent syllables are often studied together as the variable (ær) (cf. Holmberg 1991: 158ff), due to the variation between the two phonemes in some words, for example, ‘alder, altså’ /alər alsɔ/ age, thus may be realized as either [ˈɑ̈lɐ] and [ˈɑ̈lsʌ] or [ˈælɐ] and [ˈælsʌ], with the latter forms being the most widely used in modern speech.

28.3.2.2  The (Αj) Variable Variation between back [ɑ̈] and front [a]‌ is also found in the diphthong /Αj/. Throughout the twentieth century, the standard pronunciation of this diphthong has varied between [ɑ̈j] and [aj]. Currently, the standard pronunciation is [ɑ̈j], and realization as [aj] is highly salient and carries clear social connotations (of using a ‘posh’ style). In other words, the variable (aj) concerns the fronting and raising of the nucleus of the /aj/ diphthong relative to the standard target realization as [ɑ̈j].

28.3.2.3  The (e̞ŋ) variable A more recent innovation than those involving changes in the pronunciation of /a/ and /ɑ/ is the raising of [e̞] to [e]‌before the velar nasal [ŋ]. That is, in words like ‘penge, engelsk, tænke’ (Eng. money, English, think), the vowel in the first syllable is variably realized as [e̞] or [e]. Note that the distinction between the two vowels is phonemic, that is all varieties of Danish have both an /e̞/ and an /e/ phoneme. The standard pronunciation of all word forms is with [e̞], i.e. the words ‘penge, engelsk, tænke’ are realized as [phe̞ŋə e̞ŋʔəlsg tse̞ŋgə] in (distinct) standard speech, whereas realizations as [pheŋə eŋʔəlsg tseŋgə] are mainly found in the speech of young people.

The Lanchart Corpus  

541

This variable provides a good example of the usefulness of the automatic phonetic annotation of the corpus. All relevant tokens of the (e̞ŋ) variable will have been annotated as [e̞] + [ŋ] in the corpus. As is apparent from the examples above, the orthographic representation of the relevant sequence of phones is quite variable, making the use of the phonetic annotation much easier, and ensuring that all relevant tokens are identified and marked up for subsequent analysis. In other words, using the phonetic annotation as a basis for the search is not only easier but also ensures that all relevant tokens are identified in any given interview.

28.3.2.4  The (ru) Variable Preceding /r/ has initiated a process of lowering for several vowel qualities in Danish, and most phonological analyses of standard Danish operate with a general process of /r/-colouring (Basbøll 2005). More recently, this lowering has spread to both short and long /u/. The (ru) variable therefore concerns the lowering of [u]‌to [o] after [ʁ]. The variable is not very frequent in running speech, and has mainly been studied in elicited material, for example, shadowing tasks (Holmberg 1991: 204ff).

28.3.2.5  The (əð) Variable The phoneme sequence /əð/ occurs in mainly two grammatical forms: Past participle and definite article, neuter. Regional West-Danish (Jutland/Funen) has [əd], whereas in Regional East-Danish and hence also in the norm centre of Copenhagen, the standard pronunciation is [əð]. According to the literature, the [əd]-pronunciation is not specific to one Jutland dialect but has become a regional variant at least for the Eastern part of Jutland (e.g. Brink et al. 1991; Nielsen 1998; Nielsen and Nyberg 1993).

28.3.2.6  The (ðəð) Variable The (ðəð) variable constitutes a special case of the (əð) variable. The variable deals with the hypothesized dissimilation of the final [ð] from the initial [ð] in the sequence, leading to realizations as [ðəd] even in Regional East-Danish. It concerns only inflected forms where the suffix /əð/ (regardless of morphological category) is added to stems ending in [ð], for example, ‘brødet’ /'brøðəð/ bread (def.sg.) is hypothesized to be realized as [ 'bʁœðəd].

28.3.2.7  The (wəð) Variable An ongoing process of consonant deletion is the loss of [w]‌before syllabic [ð], a process which originated in Copenhagen Danish early in the twentieth century and is still spreading (Pharao 2010). The variable has been marked up for all of the sub-corpora in the LANCHART corpus in order to enable us to investigate to what extent the process has spread to other varieties of Danish.

542   Frans Gregersen, Marie Maegaard, and Nicolai Pharao

28.4  Sample Studies In what follows we will present brief examples of two types of sociophonetic analyses that have been carried out using the LANCHART Corpus. The first two examples concern the overall distributions of variants across the different LANCHART communities, the third concerns analyses of co-variation between phonetic and grammatical variables in a conversation.

28.4.1  The Distribution of (əð) The distributional patterns of this variable across the Jutland communities are studied as a possible example of ongoing regionalization. If a regionalization process is taking place, we would expect a pattern where Eastern Jutland is the centre. Thus, we would assume speakers from Eastern Jutland to have the highest frequency of [əd]-variants, younger speakers to have a higher frequency than older speakers, and possibly that all speakers will have higher frequencies of [əd]-variants in the new recordings than in the old ones. The two Jutland towns, Odder (Eastern Jutland) and Vinderup (Western Jutland), are situated in different dialect areas. Odder is situated in a part of the country where the traditional dialect has [əd] in the past participle, whereas Vinderup is situated in a part where the traditional dialect has [əɹ] (according to Rasmussen et al. 2010). So, we would expect the Odder informants to have a higher incidence of pronunciations as [əd] than the Vinderup speakers, at least when it comes to past participles. The same is the case with respect to the definite article. Odder is situated in the area that traditionally has [əd], whereas Vinderup is situated in the area where the article is actually not a suffix at all but a pre-posed particle. Again, we would expect this to have an influence on the variation between the two populations, so that informants from Odder would have a higher frequency of [əd]-variants. Our analyses do not, however, support the regionalization hypothesis. In Odder, the oldest speakers (Generation 1), use the regional variant more (around 70 per cent), and the Eastern Danish variant less than the younger speakers in Generation 2 (who have a frequency of 50 per cent). This is the case in both the old and the new recordings, and the difference between the two generations is statistically significant both now and then (p 〈 0.05 according to a chi-square test). When we include Vinderup (Western Jutland) in the analyses, we find that the Vinderup speakers (Generation 3) have a very high incidence of [əð]-variants—more than 90 per cent. Hence, there are no signs of an ongoing regionalization process regarding this variable. If such a process has been taking place, it must have been earlier. Future analyses of generations 1 and 2 in Vinderup, will enable us to tell if these earlier generations have taken up the regional variant, kept the local dialect variant, or taken up the Eastern Danish variant. Judging from these data, we would expect that in Vinderup, speakers have changed directly from the local variant to the Eastern or standard variant.

The Lanchart Corpus  

543

Change in the use of the two variants [əd] and [əð] resembles the classical pattern of dialect levelling, where the regional form is used less and less from one generation to the next. Thus, this is not an example of regionalization, but of standardization from Copenhagen / Eastern Danish (Jensen and Maegaard 2012).

28.4.2  The Raising of (æ) The next example manifests a different pattern. The analyses lead to a similar conclusion, however. The raising of (æ) has long been assumed to be spreading from Low Copenhagen to the speech of speakers in the rest of the country. Our analysis is based only on generation 1 (i.e. the oldest speakers in the sample), but in three communities: Copenhagen, Næstved and Odder, in both the old and the new recordings. In the old recordings, contrary to expectations, the frequency of raised variants in Copenhagen is much lower (9 per cent) than in the two provincial towns (20–22 per cent). This means that there is no sign of the use of this variant progressing in Copenhagen and spreading from there to other speech communities in the country. Actually, the pattern could be interpreted as a change where Odder and Næstved are in the lead, and Copenhagen is lagging behind. However, a different pattern appears in the new recordings: The frequency of [ɛ]-variants is stable in Copenhagen, whereas the informants from Odder and Næstved have now decreased their use of [ɛ] to 12 per cent, and thus have approximated to the level of Copenhagen. This is an unexpected result, based on predictions from earlier research. We interpret the result as a spread from Copenhagen as in the example above. Earlier studies (Jørgensen 1980) clearly show that raising of (æ) was increasing among Copenhagen speakers (up to around 80 per cent for speakers born around 1935– 45, recorded in 1978), and if we take these results into account, we see a development where the use of [ɛ] peaked in Copenhagen prior to the time of the LANCHART S1. Later, speakers from other parts of the country increased their use of [ɛ], and finally, after a decrease of [ɛ]-variants in Copenhagen had taken place, speakers in the rest of the country also decreased their use of raised variants. It seems that this is an ‘accommodation to a moving target’, which means that the changes have happened as shown in Figure 28.1: This is a recurring pattern in analyses of the LANCHART Corpus (both in phonetic and grammatical variation): a variant increases and decreases in Copenhagen over time, and speakers from other parts of the country follow the developments in Copenhagen some decades later (Gregersen et al. 2009, Jensen 2009). Since we have in a small sample of files coded all occurrences of variables during the entire interview, the LANCHART corpus may also be used for studying co-variation of variables.

28.4.3  Co-Variation of Variables Since we have in a small sample of files coded all occurrences of variables during the entire interview, the LANCHART corpus may also be used for studying co-variation of variables. Figure 28.2 (from Jensen and Pharao 2007) shows the use of innovative

544   Frans Gregersen, Marie Maegaard, and Nicolai Pharao Earlier studies % [ε] Copenhagen

Lanchart S1 % [ε] Odder Næstved

Lanchart S2 % [ε] Næstved Odder

Copenhagen

FIGURE

Copenhagen

28.1  Accommodation to a moving  target.

Share of innovative variants TNI 87

100 90

Percentage (%)

80 70 60 50 40 30 20 10 0

1

2

3

4

5

6

7

8

9

10

11

12

13

FIGURE 28.2  The use of innovative variants during one sociolinguistic interview with a male middle class speaker from the Copenhagen S1.

variants of three phonetic and one grammatical variable (denoted ‘du’ in the diagram) in a single conversation. The diagram shows, for one informant, the change in frequency of innovative variants throughout a sociolinguistic interview. The interview has been divided into arbitrary intervals of 2000 words each. Each point on the curves marks the percentage of innovative variants in one such interval. There are on average around 20 occurrences of each variable in each interval. There are a number of peaks and troughs in all curves, which indicates that while there is an overall tendency for an increase in the use of innovative variants of the phonetic variables, this increase is not monotonic through the interview. Rather, the share of innovative variants varies widely across the entire conversation and therefore it matters a great deal which passages are selected for quantitative analysis. Furthermore, the grammatical variable does not necessarily follow the pattern of the phonetic variables (cf. interval no.  4 in Figure 28.2), nor do the phonetic variables necessarily co-vary, that is an increase in use of the variant [ɛ] of the (æ)-variable from one interval to

The Lanchart Corpus  

545

another is not necessarily accompanied by an increase in the use of the variant [e]‌of the (e̞ŋ)-variable, cf. intervals 8 and 9 in Figure 28.2.

28.5 Conclusion The LANCHART corpus makes it possible to study language change in real time in a socio-geographical perspective. The coding has been done at various linguistic levels, and the corpus thus affords many possibilities of a closer study of how and why spoken language varies in time and space. The phonetic annotation facilitates the study of sound change, and the storing of raw data in textgrid files allows us to examine the relationship between utterance phonology and other levels of linguistic and sociolinguistic description.

C HA P T E R  29

PHONOLO GICAL AND P H O N E T I C DATA B A S E S AT T H E M E E RT E N S I N S T I T U T E M A RC VA N O O ST E N D OR P

29.1 Introduction The Meertens Institute in Amsterdam was founded in 1930 under the name ‘Dialect Bureau’ (Dialectenbureau), and became an official institute of the Royal Netherlands Academy of Arts and Sciences (KNAW) in 1952. In 1979 it was named after its first director, P. J. Meertens (1899–1985), a student of 17th-century Dutch literature. Currently it comprises two departments, Dutch Ethnology and Variational Linguistics. Originally, the Institute had as its primary goal the documentation of the traditional dialects (as well as folk culture) of the Netherlands. In the course of time this focus has broadened in several ways; for instance the Institute now also does its own research, and the linguistic department has widened its scope to topics other than traditional dialects. At the same time, the documentation of dialects itself has progressed. Over the past fifteen years, considerable effort has gone into digitizing material and putting it online. This brief contribution seeks to describe the two most important databases on Dutch dialects which are available at the Meertens Institute:  the Goeman-Taeldeman-Van Reenen Database and Soundbites; I will conclude with pointing out some future plans and desiderata.

29.2  The Goeman–Taeldeman–Van Reenen Database At the core of phonological research at the Meertens Institute, we find the database of the so-called Goeman-Taeldeman-Van Reenen Project (GTRP; available at http://www.

Phonological and Phonetic Databases at the Meertens Institute  

547

meertens.knaw.nl/mand/database/). It contains data about the phonology and morphology of 613 dialects spoken in the Dutch language area of Europe—that is to say, in the Netherlands, in Flanders, and in French Flanders. The locations are more or less evenly spread geographically on the basis of a hexagonal grid with one location per hexagon, as Figure 29.1 shows. In known transitional dialect areas the sampling of locations is denser In most cases, there is one speaker per location; typically this is somebody close to the traditional NORM for classical dialectological fieldwork: a non-mobile older rural male, as the Table 29.1 shows (the reason for choosing such informants was that they are considered to be the more representative speakers of ‘traditional’ variants).

FIGURE  29.1  Informants

for the GTRP  base.

548   Marc van Oostendorp Table 29.1  Informants for the GTRP database Gender (% of females)

Mean age

Age range

Flanders

22%

65,2

37–91 (most speakers around 65)

The Netherlands

30%

61,7

25–84 (speakers more evenly spread)

The fieldwork for this database was mostly done in the 1980s and 1990s. The core of the interview consisted of a questionnaire of 1,876 items (mostly individual words, but also a few paradigms and sentences) in Standard Dutch, which the informants were invited to translate into their own dialect. Since the emphasis of the project was on phonology and morphology from the outset, the design of the questionnaire and the actual interview were set up in such a way that translation into etymologically different items were avoided as much as possible. All interviews were recorded on audiotape and subsequently transcribed—the Dutch items at the Meertens Institute and the Flemish items at the university of Ghent. (To be precise, over 700 recordings were made, but in the end there was only money for the transcription of 613; these are the ones discussed here. The other recordings are digitized and may be added to the database in due course; the problem will be the transcription, which is specialized work.) One of the problems with the database is that the work in The Netherlands and Flanders was done separately in the two different countries, by two different teams, and at different times. This has the effect that there are quite a few differences between the data from the two areas which cannot necessarily be understood in terms of dialect geography. For instance, the fieldwork in the Netherlands was mostly done in the 1980s, while that in Flanders was performed almost a decade later. Also, as we can learn from Table 29.1 above, the selection of informants was not carried out in exactly the same way in the two locations, with the Belgian and French informants adhering slightly closer to the concept of Non-mobile Old Rural Male (NORM). Furthermore, the Dutch transcribers used a rather narrow phonetic transcription, which was notated into a computer legible ASCII format called KIPA (‘keyboard IPA’, a precursor to SAMPA which was developed specifically for this project). The Flemish transcriptions were more ‘phonological’ and furthermore notated directly in IPA; these files were later converted to KIPA, while the overall database was later provided with an automatic translation into IPA. The name of the GTRP project refers to its three founders: Ton Goeman from the Meertens Institute, who led the Dutch part of the fieldwork; Johan Taeldeman, from Ghent University, who did the same for Flanders and French Flanders; and Piet van Reenen, who has been mostly involved in setting up the database. The resulting database has been the source of two linguistic atlases (on paper): the Fonologische Atlas van de Nederlandse Dialecten (FAND: Phonological Atlas of Dutch Dialects, Goossens et al. 1998; 2000; de Wulf et al. 2005) and the Morfologische Atlas van

Phonological and Phonetic Databases at the Meertens Institute  

549

de Nederlandse Dialecten (MAND: Morphological Atlas of Dutch Dialects, Goeman et al. 2005; 2008). The former comprises three volumes and was produced in Flanders around the turn of the century, whereas the latter consists of two volumes that appeared in 2005 and 2008. FAND is a monolingual publication in Dutch, whereas MAND exists in both a Dutch and an English version. The database can also be consulted directly on the web in an interface built by Jan-Pieter Kunst of the Meertens Institute, in collaboration with the author of this article (http://www.meertens.knaw.nl/mand/database/). Unfortunately, this interface is currently available in Dutch only, although this problem can probably be overcome by using some online translation tools. The database gives access to the original transcription files in a variety of ways (the files are also available on request from the Meertens Institute for those researchers who want to run their own scripts on the material; these files are simple UNIX textfiles in which every record occurs on a separate line, and all the fields are separated by commas). During recent years most of these data have been enriched by (highly trained) volunteers with the original sound files on which the transcriptions are based, and this work will continue over the next few years. Furthermore, the interface allows the user to save selections of the data, and to draw maps based on such selections of the language area. For example, one could select all the words which have a fronted pronunciation of the vowel in aap ‘monkey’, and see whether these form a coherent region. One can search the transcriptions either in the KIPA form or in a simplified transcription, which is based on Standard Dutch orthography (although the data can be presented in an IPA form, it is unfortunately not possible to search for them in that way). Thus, one can search for all occurrences of the high fronted rounded vowel [y]‌in all dialects. One can also search for items in the questionnaire, and thus get all translations of the item brood (bread). Another possibility is to search within word categories, and thus get all past tense forms of verbs (this obviously seems more useful for morphological searches), or to search for ‘word endings’, e.g. all words ending in a velar fricative. The reason to make a specific option for searching word endings rather than beginnings is that it is known from the historical phonology of Dutch that the initial segments of words have been more stable than the final ones (Goossens 1974); one can thus expect more variation towards the end of the word. Another search option inspired by historical phonology is that one can search for words which historically belong to the same ‘vowel class’ (van Bree 1987). Finally, one can also search for speaker properties, such as their town or province of origin, their age, and their gender. It is also possible to combine various search dimensions and thus look for all instances of brood containing an [y]‌and spoken by men (there are 21 such items, almost all of them in East Flanders). The GTRP database is not finished yet. As has been noted above, currently the audiofiles are added to the database. At the time of writing (2012), the Institute is trying to integrate the database with (at least) the database for the Syntactic Atlas of Dutch Dialects (SAND) and possibly with other dialectological databases. Another idea, obviously, is to have the remaining approximately 100 recordings transcribed and

550   Marc van Oostendorp added to the database; but so far we have not been successful in finding money for this project.

29.3 Soundbites A second valuable tool for (phonetic) research into Dutch dialects is the so-called Soundbites project, available at http://www.meertens.knaw.nl/soundbites/. (Again, the interface is in Dutch, but will be fairly transparent to the foreign user with some simple translation tools.) This website contains soundfiles with in total approximately 1,000 hours of spontaneous conversation in a variety of Dutch dialects (mostly from the Netherlands) spoken in the second half of the 20th century. In the 1950s, researchers from the Meertens Institute started to more or less systematically record such spontaneous conversations from all over the language area on audiotape. (It should be noted that the definition of a spontaneous conversation was rather loose, and the dataset contains also monologues and interviews with the researcher.) They continued to record materials until some point in the 1980s, when it was decided that a representative coverage of the language area had been attained. The recordings were (and are) stored in a climatized room at the Meertens Institute, but around the turn of the century the Institute digitized all of its recordings, including those of this project (which never had an official name, as far as I know, but was commonly referred to within the institute as banden vrij gesprek, ‘tapes [of] free conversation’). The digital copies were stored on CDs, which were then also put into a climatized room. In principle, researchers from outside the Institute could ask for copies of these recordings, but this has seldom happened. As a matter of fact, it was rather difficult, even for researchers within the Institute, to find out which recordings existed for which town. This changed in 2009, when money was obtained for putting the contents of all CDs on a single server, which was made accessible to the outside world through the Internet. The result is a website which displays all the material with a simple interface. One can search either in an alphabetic list of names of provinces and place names or visually on a two-dimensional map of the Netherlands, with red dots denoting all locations where dialect recordings can be found. Because of the latter interface, the website is also referred to as the ‘Speaking Map’. The recordings themselves can be listened to on the website or downloaded, and are in many cases provided with metadata, although the quality and quantity of those metadata vary rather wildly from one example to the next. In any case, some of these metadata—age, town or origin, gender, year of recording— can also be searched for. Furthermore, the website shows a number of recordings from Flanders, from other regions in the world such as Brazil and Indonesia (these are either about Dutch emigrants or about colonial varieties), as well as a category of ‘other’ material, which contains e.g. recordings of radio programmes. It goes without saying that the Soundbites material is potentially relevant not just for phonological and phonetic research. It could also be used to study other levels of

Phonological and Phonetic Databases at the Meertens Institute  

551

linguistic structure, but its contents are also potentially interesting, for example, to those studying oral history. At present, the site only presents raw material. The Meertens Institute has a set of transcriptions covering about 10 per cent of the recordings; however, these transcriptions at present only exist in a paper version. In the ideal case, we would release all of these transcriptions also in a digital version, preferably in some way aligned with the sound material.

29.4 Desiderata Next to the databases we already have put online, the Meertens Instituut still has a lot of material which it may try to publish on the Internet during the next few years. First, more or less from the beginning until 2008, the Institute has sent out a yearly written questionnaire to a group of informants, asking them about all kinds of details in their dialects, including in many cases information about phonological topics. These questions were always formulated by researchers from the Meertens Institute. The answers are currently being digitized, and will be put online in the course of the next few years. (During the last few years the written questionnaire has been replaced by an Internet forum with the same function.) Secondly, the institute hosts a large set of dialect grammars—books published by both professional and lay linguists describing fragments of the structure of individual dialects. There are plans for putting these online as well, possibly in a Wikipedia format, so that they can be edited and adjusted by readers. Thirdly, the institute has started experimenting with the use of types of data gathering related to social networks. In particular, it has started a website, Meldpunt Taal (‘Reporting Language’), together with a consortium of other Dutch language-related institutions, on which language users can report on any development in the Dutch language which they consider relevant. Data mining techniques will be applied to study the data from this project in more detail. Another concrete step we are now taking is developing tools to make the geographical aspects of these data more directly relevant for research. Broadly speaking, the research focus at the Meertens Institute is on formal (‘generative’) grammar and on sociolinguistics. Although both disciplines are interested in linguistic (micro-)variation, neither of them have been particularly interested in the way in which linguistic phenomena are spread out on the map. Still, such geographic patterns may sometimes be taken as evidence, e.g. when a certain linguistic phenomenon A only occurs in areas which also have phenomenon B (which may show that the two phenomena are not completely independent). In 2012 a research project started, involving several of the researchers to explore such issues in more detail.

C HA P T E R  30

T H E VA L I B E L S P E E C H DATA B A S E A N N E C AT H E R I N E SI MON , M IC H E L F R A NC A R D, A N D PH I L I PPE HA M B Y E

30.1 Introduction The ‘corpus’ created by the Centre de recherche Valibel—Discourse & Variation—is one of the largest speech banks of spoken French. As will be explained below, this speech bank is not a homogeneous corpus, tailored for a single purpose, but rather a compilation of corpora, collected with a wide range of linguistic applications in mind and integrated into a database allowing for various kinds of investigation. In this contribution, we give a thorough description of this database, with special attention to the features that are of interest for research in phonology.

30.2  VALIBEL: From Corpora to Database The VALIBEL database1 was initiated in 1989 by linguists at the Catholic University of Louvain (Louvain-la-Neuve, Belgium). The first aim was not to build up a reference corpus of spoken French, but to collect data in order to learn more about the varieties of French spoken in Belgium, using sociolinguistic methods. In this section, we explain how the continuing gathering of data for various research projects finally resulted in the creation of a large and controlled database. 1  The acronym VALIBEL stands for Variétés linguistiques du français en Belgique (linguistic varieties of French in Belgium): www.uclouvain.be/valibel.

The VALIBEL Speech Database  

553

30.2.1  Spoken Data Collection for Specific Purposes During the period 1989–2002, researchers’ efforts were mainly to investigate: (i) mental representations and attitudes towards language, e.g. linguistic insecurity (Francard et al. 1993–1994); (ii) linguistic variation of spoken French in Belgium (Francard 1995; Francard et al. 2002). Data consisted of sociolinguistic interviews about attitudes towards French in Belgium (with teachers, journalists, students, politicians, and businessmen). On the other hand, the need for natural data was crucial in order to propose accurate description of the varieties of spoken French in Belgium and overcome the normative attitude that had been prevailing in former linguistic studies.2 The accent was put on the geographic and social diversity of the language varieties collected, in order to take into account the fact that Belgian French is the Oïl variety most exposed to regional languages (Walloon, Flemish), especially in working-class circles (Francard 2009: 107). At the same time, VALIBEL hosted more specific studies, and student or young researchers gathered a variety of small corpora.3 More recently (2002–09), new types of data were added to the available set of corpora: we collected data within secondary schools following ethnographic fieldwork methods; we recorded a group of speakers both in formal (at their workplace) and informal (at home) situations; we participated in the PFC (Phonologie du français contemporain: Durand et al. this volume) and the CIEL-F (Corpus international écologique de la langue française: Gadet et al. 2012) projects aiming at covering all the Francophone areas and collecting speech samples from various situations (see Dister et al. 2009). We took advantage of all this empirical work to build a large database of spoken French containing recordings of speakers from different regions, age groups, education levels, and social status, thereby contributing to the illustration of French variation in Belgium.

30.2.2  Organization of the Corpora within a VALIBEL Speech Bank What makes the difference between a corpus and a speech bank? A corpus, according to Sinclair (1996: 4), is ‘a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language’. 2 

Some Belgian scholars used to hunt down dialectal features (vocabulary, accent) and to recommend the exclusive use of ‘Parisian standard’ French. See e.g. the so-called chasses aux belgicismes (lit. ‘Belgian-French word hunting’; Hanse et al. 1971) or the lists of Ne dites pas . . . mais dites . . . (Don’t say . . . say instead . . .). 3  e.g. collections of data for studying French liaison, code switching, or regional vocabulary.

554   Anne Catherine Simon, Michel Francard, and Philippe Hambye In this sense, the VALIBEL speech bank is neither a reference nor a balanced corpus, although it is intended to represent the contemporary Belgian varieties of spoken French. Biber (1993: 243) says that representativeness refers to the extent to which a corpus includes the full range of a variety in a language (or language variety). The VALIBEL speech bank is therefore representative since it contains speech produced by speakers coming from different parts of Wallonia and Brussels, and belonging to nearly all ‘social profiles’ to be found in Belgium,4 in a wide range of situations (elicited speech, formal interview, informal interview, conversation, media, education, and workplace interaction).5 Representativeness also comes from the continuous update of the speech bank with new data. VALIBEL is not a finite collection of texts but it is continuously enriched with new recordings—as is a monitor corpus. This practice encourages the reuse of data prepared for other purposes (Sinclair 1996: 172). VALIBEL speech bank hosts 44 corpora, collected from 1988 to 2008. A corpus comprises from 1 to 118 recordings (20 recordings on average). With more than 900 audio recordings representing 900 speakers, and amounting to approximately 700 hours and 5 million words of transcription, VALIBEL provides a rich and complex picture of the French spoken in Belgium over the last twenty years.

30.2.3  The [moca] Database Once a corpus has been exploited for a specific study or project, its insertion within the speech bank allows the data to be reused for other purposes. However, the interest of carrying out new research on the basis of previously constituted corpora depends heavily on the possibility of knowing precisely what these corpora represent, namely what the sociolinguistic profile of the speakers is, which type of interaction they are engaged in, under what conditions the recordings were made, etc. So it is crucial for the researchers to have these metadata available, to use them to select the type of data which they are interested in, and thereafter to have easy access to the recordings and transcriptions needed. This is why the [moca] software was designed by a group of research teams.6 This web-based interface is designed for archiving, indexing, and glancing through a large amount of sounds, texts, and metadata (Table 30.1). Using [moca] allows the possibility of creating a specific corpus (collection of recordings) and exporting the selected sound files and transcriptions. One can create a balanced

4 

Speakers are from 18 to 85 years old, and have various levels of education. We do not have recordings with children or recordings with Belgian people whose French is not their mother language. 5  Due to the heterogeneity of labels, one recording may be indexed for more than one category. 6  Originally created by Peter Gilles under the name of ProsoDB, the [moca] database system was jointly developed by colleagues from three universities: Freiburg (Daniel Alcón, Peter Auer, Oliver Ehmer, Stefan Pfaender), Louvain (Anne Catherine Simon, Isabelle Lecroart), and Luxembourg (Peter Giles).

The VALIBEL Speech Database  

555

Table 30.1  Metadata available for creating specific corpus within the speech bank Information about . . .

Examples of available information

The speaker

Age, gender, education level, marital status, place of birth and locations, spoken languages, languages spoken by the parents, etc.

The situation

Formality degree, number of participants and relationships between the participants, location and situational context (professional, scholar, private), etc.

The recording

Speech ‘genre’ (classroom interaction, interview, elicited data, conversation, etc.), length, sound quality, etc.

The corpus

Name and details of the researcher who conducted the data collection, purpose of the study, material used for recording, etc.

The annotations

Types of transcription available (none, orthographic, phonetic, POS-annotated, etc.) and conditions to reuse the data (authorization given or not by the speakers)

corpus intended for a specific study. Thanks to an exhaustive indexation of the metadata, sound files and transcriptions can be used for multiple research purposes, although they were collected with a specific range of applications in mind.

30.3 Transcription Within the VALIBEL speech bank, transcription is carried out according to three main principles: use of standard orthographic spelling for French; importance of transcribing temporal phenomena linked to interaction (overlapping speech, pauses, interruptions, etc.); and alignment of text to sound files. From the beginning, the VALIBEL transcription system has advocated the use of standard orthographic spelling (Francard and Péronnet 1989; Dister and Simon 2007) and avoided any kind of trucages orthographiques (Blanche-Benveniste and Jeanjean 1987). We therefore refuse to adapt spelling to indicate a specific pronunciation.7 First, modified spelling seriously complicates the automatic processing of texts like part-of-speech tagging or concordance indexing. Secondly, standard spelling avoids a stigmatization effect of the transcribed spoken language: using nonstandard spelling may suggest that speakers use a nonstandard variety of language.8 And finally, the use of

7  We use the same transcription qui a (‘who has’) both for the phonetic utterance [kija] and its colloquial variant [ka]; similarly, tu as (‘you have’) stands for [tya] (formal) and [ta] (informal). 8  See Dister and Simon (2007: 55–59) for a thorough discussion of that point.

556   Anne Catherine Simon, Michel Francard, and Philippe Hambye Table 30.2  Transcription symbols within the VALIBEL transcription system Temporal and sequential relationships /

Short pause

//

Long pause

(silence)

Silence

Overlapping or simultaneous talk |-

Point of overlap onset

-|

Point of overlap end



Point of simultaneous conversations onset

§|

Point of simultaneous conversations end

Speech delivery cou/

Cut-off or self-interruption within a unit

cou/ -pure

Cut-off followed by self-continuation

?

Question with a declarative form and a rising intonation

(cough), (laugh), (whispering)

Description of a communicative event (rather than transcription of it)

Others (x)(xx) (xxx)

Inaudible or incomprehensible stretch of talk (one syllable, various syllables or more)

{choice1, choice2}

Alternative hearings of the same strip of talk (multi-transcriptions)

{uncertain}

Uncertainty on the transcriber’s part representing a likely possibility

adapted spelling increases the risk that the transcribers make mistakes, since it requires paying full and accurate attention to every pronunciation detail. Punctuation marks (like comma or question mark) are not used within the transcription process. According to Blanche-Benveniste et Jeanjean (1987: 139, our translation), ‘punctuation marks, when they are used too early, suggest a syntactic analysis beforehand and impose a segmentation which it is hard to review afterwards’. In addition, the written language logic going hand in hand with punctuation system would prevent transcriptions from rendering the specific organization of spoken, interactive language. VALIBEL conventions have been designed so that specificities of interactive speech, like overlap or repair procedures, can be expressed with enough detail (see Table 30.2). As far as interactivity and speech delivery are concerned, our transcription system is not as sophisticated as Gail Jefferson’s system used in conversation analysis.9 9 

See (ICOR 2006) for an adaptation of Jefferson’s system to French.

The VALIBEL Speech Database  

557

Nevertheless, it offers a compromise between readability—for the linguist as well as for the non-specialist reader—and accuracy—that is, fidelity to the communicative event transposed into a written text.

30.4 Annotations Orthographic transcription is the first step for studying spoken speech. In this section, we first explain how transcription can be used for corpus phonology (4.1). We then present the semi-automatic annotation tools we use for phonological, phonetic and prosodic investigation (4.2). We finally present a tool available in [moca] for quick annotation (4.3).

30.4.1  Searching the Sound Through the Written Text Our choice of orthographic spelling has been reinforced by the fact that we transcribe spoken speech within a software environment that aligns written text to sound.10 The well-known Praat software (see Boersma, this volume), originally devoted to phonetics, is also used for creating annotation files with multiple annotation tiers (TextGrid files). On the one hand, transcription within Praat increases the precision for annotating temporal organization of talk (start and end points of overlapping speech are better detected; the length of pause can be measured; etc.). On the other hand, alignment allows the user to simultaneously read the text and listen to the sound.11 Orthographic transcription, when aligned to the sound file, is also a first step within the annotation process. Corpus phonology can profit from orthographic alignment since it allows exploratory research on pronunciation. One can easily retrieve all the occurrences of one word and listen to each speaker’s pronunciation; other specific phenomena, like the pronunciation of liquids in French or the vanishing of nasal vowel [œ̃ ] in contemporary French, can be retrieved from the orthographic spelling. As is shown in Figure 30.1, the [moca] interface allows us to search for a transcript within any collection of corpora, and to listen directly to the sound associated in order to check the speaker’s pronunciation (which can then be labelled according to personal categories: see section 5.3 below). The use of [moca] then constitutes an entirely new access to the linguistic data, which considerably modifies the expectations which can be met thanks to a written transcription. Written text is no longer the only material available: it first serves as a track for ‘navigating’ within the sound data. The [moca] software provides the user with the basic functionalities of any concordancier. 10 

About 70 recordings are aligned to the orthographic transcription. Transcription is also available in a raw text format, generated from TextGrids using the software Transformer. 11 

558   Anne Catherine Simon, Michel Francard, and Philippe Hambye

FIGURE 30.1 Fragments of transcription and sound displayed in [moca] as a result of a search for the sequence  ‘brun’.

This leads us to the next point of this contribution, which explains how further annotation levels are build up from the orthographic transcription, and how these can be used for corpus phonology.

30.4.2  Semi-Automatic Annotations Phonetic and prosodic annotation techniques range from manual (qualitative) mark-up (Grabe et al. 2000) to fully-automatic stylization and coding of the acoustic signal (Hirst and Di Cristo 1998). This kind of annotation requires a fine-grained alignment of the sound signal to a phonetic transcription. We use the EasyAlign software (Goldman 2007), which takes the orthographic transcription as a point of departure for generating a word-by-word phonetic transcription. This phonetic transcription is manually corrected (for elision, liaison, vowel reduction, devoicing, etc.) and then automatically aligned to the sound signal. A further step creates a new tier with the syllabic segmentation. It results in a multiple-tier annotation illustrated in Figure 30.2 (syntactic annotation is presented below). Phonetic alignment helps for processing the acoustic signal itself. A wide range of tools are available nowadays for analysing prosody. For example, Prosogram (Mertens 2004b) is a tool for stylizing intonation and providing a visual representation of duration, intensity, and fundamental frequency. With the help of other Praat scripts, one can automatically detect prominent syllables (Goldman et al. 2007) which may correspond with initial and final accents (Simon et al. 2008). This processing results in a rich prosodic analysis (see Figure 30.3). The information displayed in the Prosogram (see above) allows quantitative measurements and comparison between recordings. We combine prosodic annotation with morphosyntactic annotation in order to capture complex phonological phenomena such as ‘accentuation’. For example, the prominence on the word attention is interpreted as a final accent once we have the information that the prominence affects the last syllable of a noun.

The VALIBEL Speech Database  

559

FIGURE 30.2  TextGrid with multi-level annotation:  PHONES (sounds labeled with SAMPA symbols12), SYLL (syllables), WORDS, GRA (POS-tags), NBSYLL (number of syllables within the word), PREPOST (position within a grammatical  chunk).

FIGURE 30.3 Prosogram (Mertens 2004b) enriched with detection of prominent syllables (plotted in red) and prosodic report for each unit (f0  =  mean fundamentally frequency in semi-tones; Δ f0  =  mean f0 range; déb  =  speech rate, in number of syllables by second; P  =  proportion of prominent syllables).

Morphosyntactic annotation, also referred as part-of-speech (POS) tagging, means assigning a part-of-speech tag to each word in a corpus.13 It forms the basis for further forms of analysis. This annotation can of course be of great interest for the analysis of phonological phenomena partly ruled by morphosyntactic constraints (e.g. French liaison; see Durand and Lyche 2008). 12 

SAMPA is a computer-readable phonetic alphabet (http://www.phon.ucl.ac.uk/home/sampa/). Tags are as follows: Adjective, Adverb, Coordination/Subordination Conjunction, Determiner, Interjection, Noun, Proper Name, Prefix, Preposition, Pronoun, Verb. 13 

560   Anne Catherine Simon, Michel Francard, and Philippe Hambye

FIGURE 30.4 Screenshot from VALIBEL [moca] database:  each transcription line is annotated with one or several ‘labels’ (illustrated in the last column).

POS tagging has been carried out by Dister (2007) on a subcorpus amounting to about 440,000 words.14 The subcorpus has been tagged using the Unitex software (Paumier 2006). A similar annotation has been carried out on a smaller corpus15 (13,000 words), using another procedure (Beaufort et al. 2002; Roekhaut et al. 2010). The resulting tagging is exported within a TextGrid and is aligned to the sound (see Figure 30.2, above).

30.4.3  Manual Labelling for Exploratory Purposes Syntactic and prosodic annotations presented so far rely on the extensive coding of a recording. This annotation process, in spite of being semi-automatic, is extremely time-consuming. Imagine one wants to scan a corpus for particular examples of a not-yet-described phenomenon. A fully phonetic and morphosyntactic annotation would be too heavy for that purpose. The [moca] software, used for archiving our data, allows annotation of sporadic phenomena spread over large quantities of recordings. One has to listen a recording ‘line by line’ and annotate each line using a specific set of labels created by users. The advantage of this labelling system is that it does not require any prior annotation.16 One can glance over a transcription, mark off interesting phenomena, and index them by the use of ‘labels’. Labels can be subsequently modified, merged, or subdivided. The inconvenience of the system lies in the fact that each line of transcription (as it has been segmented during the transcription process) is the domain for attributing labels. A search form is then used to retrieve the collection of labelled fragments and export them into a spreadsheet for further analysis or counting. The corresponding sound fragments can be exported as well. For example, Simon and Degand (2007) retrieved from 14  60 recordings, 40 hours of speech. This annotation requires a complex pre-processing of the text, for identifying non-analysable units (hesitations, repetitions, interruptions, etc.) which are frequent in spoken speech. 15  The C-PROM corpus, partly containing data from VALIBEL speech bank (see Simon et al. 2008). 16  The [moca] system even allows the user to directly assign labels to sound fragments, without any orthographic transcription.

The VALIBEL Speech Database  

561

the database a large collection of utterances combined by use of the connectors car and parce que. Each occurrence was annotated pragmatically (using a scale of subjectivity for describing the discourse relation) and prosodically (indicating the degree of prosodic cohesion between both utterances). As a result, specific prosodic profiles could be linked to a range of causal relations signalled by both connectors.

30.5  Developments to Come The VALIBEL speech bank today is the result of twenty years of data collection. New developments introduced during the last few years aimed to offer the possibility of searching and exploiting these data within a real database. In years to come, this database will continue to be enriched by new recordings while new annotations will be added. As explained before, the options which have prevailed so far (transcription principles, etc.) and the tools we have set up all follow the objective of making our corpora available for a large range of research in a variety of fields in linguistics. This advantage has of course a price, since this means that the data is less suited for specific purposes than it could be if all our development policy were directed towards a single goal. It is clear, for instance, that the VALIBEL corpora do not respect all the guidelines that should orient the tailoring of corpora for phonological studies. It constitutes, however, a rich resource for exploring many linguistic phenomena, including those related to phonological questions. Providing access to this linguistic database is thus both a necessity and a technical challenge. In the early days of the VALIBEL database, this question was solved through direct contact between the Centre and researchers interested in spoken French in Belgium. Most of the time, it resulted in the sharing of speech samples or other data strictly restricted to scientific (and personal) use. But the substantial increase in this kind of request makes it impossible to go further in that direction, and requires a more professional approach. In the near future, we intend to improve the accessibility of the database in two ways: first, by providing online access to the VALIBEL recordings that are aligned to their orthographic transcription; secondly, by putting on the market a CD-ROM containing a representative sample of the database (with recordings, transcriptions, and metadata). In both cases, enlarging the accessibility of our data involves specific measures to protect the confidentiality of the speakers, and requires checking the audio quality of recordings to be selected, reviewing the transcription in search of potential errors, etc. Since this work is time-consuming, it always depends on the financial resources available. This is why it is crucial for the development of the VALIBEL database to benefit from collaborative projects like the one which gave birth to the [moca] interface or the PFC project. The future of our database, and certainly also the future of corpus linguistics and corpus phonology, lies in the sharing of experience, tools, and data, allowing to us construct collectively what we cannot achieve individually.

C HA P T E R  31

P R O S O DY A N D D I S C O U R S E I N T H E AU S T R A L IA N M A P TA S K  C O R P U S JA N ET F L ETC H E R A N D L E SL EY ST I R L I NG

31.1 Introduction The Australian Map Task is part of the Australian National Database of Spoken Language (ANDOSL) that was collected in the 1990s for use in general speech science and speech technology research in Australia (Vonwiller et al. 1995). The Australian Map Task is closely modelled on the HCRC1 Map Task, which was designed by a team of British researchers to elicit spoken interaction typical of everyday talk in a controlled laboratory environment (Anderson et al. 1991). Versions of this task have been used successfully to develop or test models of intonation and prosody in various languages, including Dutch (e.g. Caspers 2003), German (Grice et  al. 2005), Italian (Grice and Savino 2003), Danish (Grønnum 2008), Mandarin (Tseng 2008), Japanese (Koiso et al. 1998), and varieties of English including Glasgow English (e.g. Mayo et al. 1997) and several other varieties of the British Isles (see Nolan and Post, this volume), New Zealand English (Warren 2005b), General American English (e.g. the MIT American English map task, http://dspace.mit.edu/handle/1721.1/32533), and Australian English (Fletcher and Harrington 1996; 2001). The Map Task is an exercise in controlled quasi-spontaneous talk between two participants, and therefore constitutes a rich resource for the investigation of the interaction between intonation and spoken discourse. The advantage of the Map Task is that

1  This version of the map task is named after the Human Communication Research Centre, an interdisciplinary research centre based at the Universities of Edinburgh and Glasgow where the task was developed in the 1990s.

Prosody and Discourse in the Australian Map Task Corpus  

563

it mediates uncontrolled spontaneous conversational speech on the one hand and, on the other, rigorously scripted read material that is regularly used in traditional experimental phonetics paradigms (see Harrington 2010a for an overview). From a discourse perspective also, instruments like the Map Task are useful for investigating the ways in which interactive discourse is structured because of the way the task is constructed and the ways in which the knowledge states of participants are controlled and managed. The structure of the Map Task can be summarized as follows: participants work in pairs, each with a map in front of them that the other participant cannot see. The maps contain a number of landmarks, which are not all identical (see Figure 31.1 for an example of a pair of maps used in the Australian Map Task). One participant (the ‘instruction-giver’ IG) has a route marked on his/her map and is required by the task to instruct the other

TOUR 1a

Tour 1 starts here Mill stream Gravestones

Galah open-cut mine

Desert dunes Wooden pole Wild west film set Millionaire’s castle

Loose rubble

Rolling stone creek

Finish Wooden pole Branded steers

Cattle enclosure Adventure playground

FIGURE 31.1 Two maps mismatched for certain landmarks that are used by participants in the Australian Map  Task.

564   Janet Fletcher and Lesley Stirling TOUR 1b

Tour 1 starts here Mill stream

Desert dunes

Wild west film set Millionaire’s castle

Loose rubble

Ghost town

Saloon bar Rolling stone creek

Wooden pole Branded steers

Adventure playground

FIGURE

31.1  Continued

(the ‘instruction-follower’ IF) in drawing the correct route onto their own map. The mismatching landmarks on each map ensure that a range of queries, checks, and negotiation talk will be elicited. This is one reason why it is a more effective tool to examine intonational variation than read speech, given the well-known relationship of intonation to utterance modality and discourse segmentation (see Himmelmann and Ladd 2008 for an overview). The Map Task is arguably a more controlled source of intonational variation than spontaneous discourse within a constrained timeframe. The analyst may need to analyse a much longer stretch of spontaneous conversational discourse than a typical map task interaction in order to yield the range of intonational variation that is typically observed in the latter due to the structure of the task. Ideally, spontaneous conversational discourse should also be analysed where possible to ensure that intonational patterns and discourse/tune interaction observed in map task dialogues are ecologically valid for the language or language variety in question.

Prosody and Discourse in the Australian Map Task Corpus  

565

Another important advantage of corpora like the ANDOSL Australian Map Task is that they are studio-recorded, and so avoid the often poor quality of field recordings of conversational data (Grønnum 2008). This is particularly valuable if the researcher wishes to perform an acoustic phonetic analysis of fO and/or duration and intensity variation as part of an intonational study, for example. The sheer scale of map task corpora like the HCRC Map corpus and the ANDOSL Australian Map Task (and other large corpora) is an additional bonus. The ANDOSL corpus includes 216 dialogues from native speakers of Australian English of various ages who have been classified as belonging to one of the three main dialectal groupings for Australian English: broad, general, and educated/cultivated (Mitchell and Delbridge 1965). It should also be made clear that many types of corpus have been used in large-scale analyses of intonation and prosody for many languages (see Nolan and Post, this volume for a discussion of the IViE corpus which in addition to a map task, includes a story reading and subsequent re-telling as well as read sentences; Bruce 2005 for a discussion of Swedish intonational typology based on analyses of the SWEDIA 2000 spontaneous speech database; Gussenhoven 2004, which includes a description of intonational variation in Dutch dialects; Warren 2005b for a discussion of New Zealand English; and Kohler et al. 2005 for various papers based on the Kiel Corpus of German spontaneous speech). There are a number of prosody and intonation studies of American English based on data from the SWITCHBOARD corpus (Godfrey et al. 1992), the Boston Radio News corpus (Ostendorf et al. 1996, and see Dainora 2006), and the Boston Directions Corpus (e.g. Hirschberg and Nakatani 1996). In the sections that follow, we present an overview of research on Australian English intonation and prosody and its relationship to different aspects of discourse structure based on the ANDOSL Map Task corpus. There are also other cooperative game tasks such as the SPOT game task (see Warren et al. 2000 for a full description), which are particularly useful for investigating specific issues to do with the relationship between prosodic and syntactic structure.

31.2  Prosodic Structure, Intonation, and Discourse Segmentation in Australian English Australian English is typologically related to Southern British English (SBE) insofar as it is a non-rhotic variety and shares the same phonemic inventory (e.g. Mitchell and Delbridge 1965; Cox and Palethorpe 2007). Like its close neighbour, New Zealand English, it has a distinct accent, which is carried mainly by the phonetics of its vowel system. It has been widely assumed that the basic intonational inventory of Australian English is essentially the same as SBE. Halliday’s (1967) (nuclear) tone model was particularly influential in the 1970s and 1980s, and was used in early sociolinguistic analyses of Australian English intonation (e.g. Guy and Vonwiller 1989). Since the 1990s,

566   Janet Fletcher and Lesley Stirling Table 31.1  Tonal categories and their general pitch description, and major Break Indices from AuE ToBI (Australian English Tones and Break Indices) after Beckman et al. (2005) Intonation events

Pitch description

AuE ToBI label

Pitch Accents

Simple High

H*

Simple Low

L*

Rising

L+H*

‘Scooped’

L*+H

Downstepped High

!H*

Downstepped Rising

L+!H*

Downstepped ‘scooped’

L*+!H

Downstepped High from preceding

H+!H*

H tone Phrase Accents

High

H-

Low

L-

Downstepped High–Mid

!H-

High

H%

Low

L%

Additional Pitch labels

Highest pitch value for intermediate phrase (excluding phrase accents or boundary tones)

HiF0

Break Indices

Prosodic Structure

Boundary Tones

Word

BI1

Intermediate Phrase

BI3

Intonational Phrase

BI4

research on Australian English intonation and prosody has largely been within the autosegmental-metrical (AM) framework, a term that is widely used these days to refer to tone sequence models (e.g. see Ladd 2008 for an overview). One version of an AM model, ToBI (‘Tones and Break Indices’—see Beckman et al. 2005) has been adopted in laboratory phonology research on Australian English (e.g. Fletcher and Harrington 2001; Fletcher et al. 2002; Fletcher et al. 2005; Harrington et al. 2000; McGregor and Palethorpe 2008). As in other models within an AM framework, tunes are made up of a sequence of separate H (high) and L (low) tone targets or tone ‘autosegments’ that align with different parts of the prosodic structure of an utterance, i.e. to rhythmically prominent syllables at the lower levels of the prosodic hierarchy (i.e. intonational pitch accents), and to phrase edges at the higher levels of the hierarchy (i.e. phrase accents and boundary tones). Table 31.1 summarizes the major annotation categories used in ToBI analyses of Australian English intonation. As in MAE ToBI (Mainstream American

Prosody and Discourse in the Australian Map Task Corpus  

567

English), two levels of intonational constituency are annotated: a minor intonational or intermediate phrase, and the intonational phrase. The Break Indices component of ToBI focuses more on the transcription of prosodic constituency. Criteria identical to those used in the analysis of Mainstream American English have been applied to prosodic analyses of Australian English. Briefly, index ‘1’ marks a word boundary, index ‘3’ marks an intermediate (intonational) phrase boundary (equivalent to the Phonological Phrase in Nespor and Vogel 2007), and ‘4’ marks a full intonational phrase. It is generally well accepted that it is important to have some independent model of discourse segmentation that will then allow a rigorous examination of the intersection between the intonational and discourse properties of an utterance (e.g. Pierrehumbert and Hirschberg 1990). ToBI (Tones and Break Indices) provides one framework for prosodic and intonational annotation of map task corpora (and other kinds of speech materials), and there is a range of annotation systems for transcribing discourse segments in dialogues (e.g. see Cooper et al. 1999; Traum 2000). Identification and coding of discourse structure and function varies according to the theoretical perspective of the coder but, more importantly, is a function of the genre of discourse being segmented (e.g. monologic narrative or expository text, compared with naturalistic ‘everyday’ conversation and other kinds of dialogic interaction). For example, both the intentional structure of the discourse and the effect of ‘recipient design’ factors influence monologic discourse, whether pre-scripted or not, differently from the way they operate within dynamically unfolding dialogic interaction. For this reason, different analytical categories have been found to apply in different genres (see also Fox 1987). The two utterance-tagging schema used in our Map Task research were designed specifically to analyse dialogic discourse (see below). However, in addition to sequential aspects of discourse including the patterned contribution of turns at talk, it is normally acknowledged that most discourse exhibits a hierarchic structure, leading to the possibility of considering both smaller units (‘micro-structure’) and larger (‘macro-structure’). A number of studies have examined the interaction between higher-level prosodic structure and discourse segmentation at different levels. Many studies have focused on the sequential and micro-structural (e.g. Hirschberg and Nakatani 1996; Swerts 1997; Grice et al. 1995), or have sought to use various acoustic parameters associated with prosodic structure such as duration, F0, pause length, and speaking rate to automatically classify utterances in terms of their discourse function e.g. as specific kinds of dialogue acts (e.g. Shriberg et al. 1998). In the study reported in Stirling et al. (2001), two major dialogue-coding systems were applied to the same corpus of map task dialogues, which were also annotated for Break Indices 3 and 4 (i.e. intermediate and intonational phrases, respectively). By applying the discourse coding systems to the same dialogues, we were able to evaluate them against one another both in terms of the kinds of information they were able to capture and in terms of their correlation with prosodic phrase boundaries based on the Break Index annotation. The two dialogue act coding schemes used were the HCRC map task coding scheme (Carletta et al. 1997; Isard and Carletta 1995) and the ‘Switchboard’ DAMSL (Dialog Act Markup in Several Layers) scheme (after Allen and Core 1997; Jurafsky et al. 1997; Cooper et al. 1999).

568   Janet Fletcher and Lesley Stirling In the HCRC system, dialogue moves are grouped into two larger categories, ‘initiations’ and ‘responses’, and in the DAMSL scheme the broad categories are equivalently ‘forward-looking functions’—i.e. dialogue acts that influence the upcoming discourse or how the current utterance ‘constrains the beliefs and actions of the participants, and affects the discourse’ (Allen and Core 1997: 4)—and ‘backward-looking’—i.e. how the current utterance ‘relates to the previous discourse’ (Allen and Core 1997: 4). The HCRC scheme was specifically developed to analyse Map Task interactions, whereas the SWBD-DAMSL scheme augments the original Allen and Core (1997) DAMSL set of dialogue act annotation tags for the Switchboard corpus of telephone interactions (Godfrey et al. 1992; Jurafsky et al. 1997). The hypotheses behind both annotation schemes are similar, in that their ultimate aim is to capture the relevant discourse goals at a number of levels, in order to support automatic speech recognition and spoken dialogue systems. While both schemes have been used to investigate dialogue correlations with prosody and intonation, SWBD-DAMSL in particular was designed with such research goals in mind. The results of the comparison of the two utterance tagging schemes are summarized in Table 31.2. As a general rule, the DAMSL system offers a wider range of labels to choose from, as well as a finer granularity in applying labels to the dialogue. In certain cases there was almost complete correlation between the major broad categories such as HCRC Instruct and DAMSL action-directive, HCRC Explain and DAMSL statement, the question categories of various kinds, and the answer categories of various kinds. The granularity of the DAMSL scheme is clearly shown when we compare the corresponding DAMSL tags to the HCRC Explain and Instruct codes. For example, the Explain code can map onto either sd (statement non-opinion) or sv (statement opinion) dialogue acts. Conversely, a few HCRC categories are specifically designed to capture characteristics of instructional dialogue. For example, the HCRC category Aligns forms a coherent functional category in HCRC (questions which check the interlocutor’s attention, agreement, or readiness for the next move), but is not distinguished from other yes/no-questions and tags in DAMSL. The HCRC categories Check and Clarify are also not represented in the DAMSL coding used in Stirling et al. (2001). As shown in Table 31.2, the DAMSL tags that correspond to Check, for example, also overlap with the corresponding tags for Align. Also, what is not really shown in Table 31.2 is that Instruct and Explain HCRC moves corresponded to larger discourse chunks than did DAMSL coding for the same sequences of the dialogue. Stirling et al. (2001) concluded that the finer granularity of the DAMSL scheme for shorter stretches of spoken discourse afforded a finer micro-level analysis of the MAP TASK than the HCRC moves. In the Stirling et  al. (2001) study the analysis of the correspondence between higher-level prosodic structure (intermediate and intonational phrase boundaries) and dialogue acts showed that the latter usually coincided with intonational boundaries with matches of 88 per cent for HCRC moves and 84 per cent for DAMSL dialogue acts. Differences were most likely due to the mismatches in the way the two discourse annotation schemes deal with certain dialogue acts or moves. For example, the higher match level for HCRC was most likely due to the greater correspondence of HCRC

Prosody and Discourse in the Australian Map Task Corpus  

569

Table 31.2  Some of the mappings between HCRC and SWBD-DAMSL initiating and response categories based on Stirling et al. (2001) HCRC

SWBD-DAMSL

Preparatory move Ready



Initiating moves

Forward-looking functions

Explain

Statement sd statement non-opinion sv statement opinion

Query-yn

Information-request qy yes-no question ^d declarative question ^g tag question

Check

qy yes-no question ^d declarative question ^g tag question bf summarize/reformulate

Align

qy yes-no question ^d declarative question ^g tag question

Query-w

qw wh-question qo open question qr or-question

Instruct

Influencing addressee future action ad Action-directive oo Open-option



Conventional (mostly forwards) fp conventional opening fc conventional closing ft thanking fw you’re welcome fa apology

Response moves

Backward-looking functions

Reply-y

Answer ny yes answers na affirmative non-yes answers ny^e yes plus expansion

Reply-n

nn no answers ng negative non-no answers nn^e no plus expansion

Reply-w

no other answers

Clarify –

nd dispreferred answers



sd^e statement expanding y/n answer

(Continued)

570   Janet Fletcher and Lesley Stirling Table 31.2  Continued HCRC Acknowledge

SWBD-DAMSL Signal-understanding b acknowledge bh backchannel in question form bk acknowledge answer bf summarize/reformulate Agreement aa accept aap accept-part am maybe ^h hold

Object

Agreement ar reject arp reject-part



br Signal-non-understanding



bc Correct-misspeaking

reply options to intonational phrase boundaries in our corpus. Nevertheless, the correspondences between Break Indices (i.e. Breaks 3 and 4) and the majority of dialogue act boundaries are in line with results from previous studies (e.g. Lehiste 1975; Nakatani et al. 1995; Grice et al. 1995; Swerts 1997; Shriberg et al. 1998). Major categories such as statements, questions, and instructions had the highest level of correspondence with intonational boundaries (BI3, BI4). The larger discourse segments corresponding to Explain or Instruct moves often included many BI4 (and therefore BI3) units. Pitch range reset (measured using the ToBI HiFO label assigned to the highest pitch value associated with an intonational pitch accent in an intermediate phrase) was also correlated with discourse boundary strength. Pitch range reset was also a reasonable indicator of a new discourse event. Statement, question, and acknowledge categories were usually associated with a small but significant level of local pitch range reset, with initiation segments (i.e. where a new topic or ‘landmark’ was the focus of the interaction) being associated with an upward reset of pitch range and response segments with a downwards resetting of pitch range for the same speaker. However, only a small proportion of dialogue acts were preceded or followed by a silent pause, suggesting that other junctural phenomena such as presence of an intonational boundary (BI3 or BI4) or pitch range reset were more robust predictors of dialogue act boundaries in these map task dialogues. This finding differs from other studies based on monologues where pause duration was a good indicator of either discourse boundary strength (e.g. Swerts 1997), or discourse segment edges (Nakatani et al. 1995). However, silent pauses often coincided with speaker turn boundaries as well as dialogue act boundaries, in keeping with other studies (see also Shriberg et al. 1998; Dankovičová et al. 2004).

Prosody and Discourse in the Australian Map Task Corpus  

571

In summary, the combination of ToBI prosodic annotation with the version of DAMSL micro-level coding and HCRC utterance tags used in Stirling et  al. (2001) proved to be an effective tool with which to examine dialogue acts, discourse segmentation, and prosodic boundaries in a corpus setting. This methodology formed the basis of subsequent research exploring intonational tune and discourse segmentation that is summarized in the following section.

31.3  Intonational Categories and Discourse Structure in Australian English 31.3.1  Intonation and Discourse ‘Micro-Structure’ While the summary in section 2 outlines how the Australian English Map Task has been used to compare different discourse segmentation schema and their correlation with aspects of prosodic segmentation, the ANDOSL Australian Map Task corpus has also been used in quantitative analyses of intonational categories in Australian English. One important implication of an AM intonational model like ToBI is that it is possible for tone targets to combine in different ways to create intonational contours or tunes that may (or may not) be categorically distinct in the language or language variety. While it is a concern that the intonational categories in one variety may not be identical in another (Beckman et al. 2005; Warren 2005a), it has been possible to annotate a range of tune-types in the ANDOSL Australian Map Task corpus that are relatively similar to those found in either Mainstream American English or Standard Southern British English. The ToBI model as applied to Australian English has afforded extra granularity in terms of annotation categories. For example, whereas some British School contour-based analyses of intonation only have two simple rises (low and high), at least three kinds of simple rise can be transcribed within ToBI: L* L–H% (low rise with low phrase accent, more or less equivalent to Halliday’s Tone 3), L* H–H% (low rise, expanded range, with high phrase accent), H* H–H% (high rise, equivalent to Halliday’s Tone 2). Figure 31.2 shows a summary of the distribution of the main annotated tunes from 20 dialogues from the Australian Map Task, including these three simple rises (based on analyses in Fletcher and Harrington 2001; Fletcher et al. 2002; Fletcher 2005; Fletcher and Loakes 2006). Around 43 per cent of intonational phrases terminate with L–L% boundaries (most with nuclear H* pitch accents), with the remaining 57 per cent comprising various kinds of non-falling tunes, including L–H%, H–L%, and H–H% tunes. The three simple rises, L* L–H%, H* H–H%, and L* H–H%, constitute 36 per cent of all tunes found in the 20 dialogue corpus analysed to date. A further 8 per cent of tunes are H* L–H% (i.e. ‘fall-rise’) tunes, and a further 13 per cent are mid-level tunes (H*

572   Janet Fletcher and Lesley Stirling H–L%) often referred to as ‘stylized rises’ by Ladd (2008) for other varieties of English such as SBE and General American English. It has been argued elsewhere that this is a highly frequent tune in certain kinds of adolescent discourse in Australian English (e.g. Fletcher and Loakes 2006). The fact that rising intonation is more widespread in the Australian Map Task data analysed to date is not that surprising given the interactive nature of the task and the fact that in our earlier analyses of the corpus (e.g. Stirling et al. 2001), a large number of ‘question’ dialogue acts were also noted which is typical of Map Task discourse (e.g. Grice et al. 2005). Much has also been written about the distinctive rising statement intonation used by many Australian English speakers in interactive discourse (e.g. see discussion in Fletcher et al. 2005). It is also relatively widespread in New Zealand (and variants are noted in Canadian, South African, and other varieties of English) and frequently draws public comment (e.g. Warren 2005b). The Australian Map Task corpus has proved to be an excellent tool with which to examine this kind of distinctive intonational pattern, given the wide variety of dialogue-act types that are typically noted in dialogic interaction (see above). Regardless of the different theoretical implications that underpin different intonation models (e.g. see Gussenhoven 2004 and Ladd 2008 for good overviews of these differences), the relationship between intonational tune and discourse function can be explored effectively using the kind of methodology employed in Stirling et  al. (2001). With this in mind, Fletcher et  al. (2002) examined the correspondence between a range of intonational tunes including the three broad rise-types and DAMSL-annotated dialogue acts in a subset of the ToBI-labelled map task interactions. A broad set of HCRC codes were also compared (e.g. Query-yn, Check, Align, Instruct) but there was a high level of correspondence between the two sets for the particular utterance tags we were most interested in for the purposes of this study. Almost 97 per cent of rises that were labelled H* H–H% were associated with information requests (i.e. ‘qy’ labelled DAMSL dialogue acts or query-yn, or Check HCRC moves listed in Table 31.2), whereas rises that were labelled L*L–H% corresponded to acknowledgment/answer and acceptance dialogue acts, or back channels. Few (3  per  cent) H* H–H% tunes were associated with non-questioning dialogue acts. Rises labelled L* H–H% were more likely to conclude statement directives or instruction dialogue acts rather than information requests, although a proportion of these terminated Align HCRC moves. Interestingly, almost 80 per cent of L* H–H% tunes in our map task analyses could also be analysed as the second portion of so-called split or compound fall–rise, which could also account for why relatively few L* H–H% tunes (compared to H*H–H% tunes) were associated with Damsl ‘qy’ dialogue acts (e.g. Fletcher et al. 2005). A split-fall–rise is where the final rise occurs on material later than the initial H* accent as shown in (1): (1)  ‘You go underneath the spruce trees.’ H*    L*    H–H%

Prosody and Discourse in the Australian Map Task Corpus  

573

8%

11%

12%

32%

16%

H* H-H% L* H-H% L* L-H% H* L-H% H* H-L% H* L-L% L+H* L-L%

8% 13% FIGURE 31.2  Distribution (%)  of ToBI-annotated Intonational Phrase-final tunes from twelve Australian Map Task dialogues (based on Fletcher and Harrington 2001; Fletcher et al. 2002; Fletcher  2005).

In this and many examples in our corpus, the final low pitch target on spruce was deemed by the corpus annotators as sufficiently prominent to merit a L* pitch accent tag, even though the pitch target on ‘underneath’ sounds more prominent (in accordance with earlier analyses of the ‘split-fall–rise’ by Halliday 1967, and also discussed in Grice et al. 2000). An alternative analysis would place a nuclear H* pitch accent on ‘underneath’ and a final L–H% boundary to account for the rising tune across spruce trees (the L-phrase tone in this case would be anchored to spruce following recent arguments for secondary association of phrase accents as suggested in Grice et al. 2000). Regardless of the two different intonational analyses, very few of these ‘compound’ tunes were associated with information request dialogue acts. They were more likely to be correlated with the kinds of dialogue acts that were associated with simple fall–rise tunes (i.e. H* L–H%) which included non-questioning forward-looking dialogue acts like sd or ad. There were however some examples of simple fall–rise tunes that coincided with HCRC Check moves, particularly when speakers were in the ‘Follower’ role in the Map Task interaction. The majority of these tunes seemed to be associated with ‘continuative’ as well as ‘encouraging’ semantic nuances generally attached to fall–rises in Southern British English, for example (e.g. see Cruttenden 1997) in the Australian Map Task interactions. It has been suggested elsewhere that the differences between L*H–H% and H*H– H% (contours summarized in Figure 31.2) may reflect variation in pitch span and/or pitch key of the rise (i.e. wide pitch span, high pitch key, respectively), and that the pitch accents in these contours (i.e. L* versus H*) may not necessarily be phonologically contrastive (see Ladd 2008 for more discussion on this topic). Nevertheless, the tunes clearly have categorically different dialogue act associations in our DAMSL- and HCRC- tagged Map Task corpus. In particular, the componential nature of the ToBI model together with the granularity of the DAMSL annotation schema has allowed us to capture important differences in the way different tone-target combinations are exploited in these interactive dialogues.

574   Janet Fletcher and Lesley Stirling

31.3.2  Intonation and Discourse ‘Macro-Structure’ While there is a clear correspondence between intonational tune in Australian English and discourse micro-structure, there is also a well-known correspondence between pitch patterning and discourse macro-structure (e.g. Lehiste 1975; Swerts 1997). Map tasks present an excellent source of data in which to examine the potential correspondence between certain prosodic features and a larger discourse construct beyond an individual dialogue act (e.g. Mushin et al. 2003). The Common Ground Unit or CGU is a unit of macro-level discourse structure relevant to collaborative interaction which itself feeds into higher level intentional (goal-directed) structures (Nakatani and Traum 1999). Based on the idea that participants in an interaction continually negotiate and signal their common understanding (‘common ground’), a CGU represents a basic unit of collaborative dialogue consisting of all the linguistic material involved in achieving mutual understanding of an initial contribution of information by one participant. CGUs thus typically include several dialogue acts and possibly multiple initiation–response pairings of acts to the point where both participants make it clear that they understand the propositional information at issue. A subset of dialogues from the same corpus of map task interactions used in the analyses reported above was coded for CGUs in the study described in Mushin et al. (2003); inter-coder reliability is known to be variable for CGUs (cf. Traum and Nakatani 1999), and for this study the coding was conducted independently by two researchers who then identified and resolved any points of difference to produce a consensual version. An example of a simple CGU and a more complex CGU from this study are shown in (2) and (3) ([ ]‌indicate overlapping segments of talk). (2)  IF:  Have you got a cross IG: Yes (3)  IG  IF:  IG:  IF: 

so you’re] swee[ping east] [so am I sweep]ing right around [am I going east] [you’re sweeping] east yeh okay well don’t~~ yeh okay alright I’m going east

The initial contributions in CGUs (called the ‘initiation phase’ in Mushin et al. 2003) are the utterances presenting the information to be understood, and these tended not to differ across types of CGU in terms of distribution of intonational tunes or pause behaviour. In a further analysis of the map tasks examined in Stirling et al. (2001) and Fletcher et al. (2002), it was found that the only respect in which initiations of simple and complex CGUs differed was that initiations of simple CGUs had a higher proportion of rising boundary tones than those of complex CGUs (43 per cent vs. 29.8 per cent H–H per cent and 29.9 vs. 18.5 per cent L–H%) while complex CGUs had a higher proportion of falling (L–L%) boundary tones (42.1 per cent vs. 29.5 per cent). This was possibly a reflection of the kind of dialogue act likely to lead to a simple grounding negotiation (e.g. questions vs. statements).

Prosody and Discourse in the Australian Map Task Corpus  

575

The major finding of Mushin et al. (2003) was that the first responses to these initiations did differ in their intonational characteristics depending on whether the CGU was going to be a simple or complex one: in other words, responses had particular intonational boundaries if they flagged that they would be ‘finishing’ an unproblematic CGU. These were different from if they were launching a more complex set of contributions negotiating mutual understanding. Complex CGUs displayed more overlapped first responses, indicating complexity in the CGU. They also showed a higher proportion of low falling (L–L%) boundary tones (generally indicating completion) (30.3 per cent vs. 17.5 per cent). The structure of complex CGUs typically contained a number of initiation–response dialogue act pairs, and the higher proportion of L–L% tunes at the end of the first response echoes the widespread finding (crosslinguistically) that falling intonational boundaries signal finality or a conclusion to the particular discourse exchange (e.g. see Pierrehumbert and Hirschberg 1990; Wichmann 2004; Vermillion 2006). However, regardless of whether they were in simple or complex CGUs, responses had a higher proportion of non-final (continuing) intonational tunes (e.g. H–L%, H-, L–H%) than initiating contributions (32.5 per cent vs. 7.8 per cent). The higher proportion of ‘continuing’ intonation contours in responses overall reflects the fact that these responses were often a simple acknowledgement or yes/no answer followed by further talk by the same speaker who made the response. These results indicate a complex but discernible relationship between intonation, prosody, and the ways speakers work together towards mutual understanding within the structure of a dialogue (see also Vermillion 2006).

31.4 Conclusion The Australian Map Task has proved to be a useful tool with which to examine different prosodic features of spoken interactive discourse. While the intonational system of Australian English shares many features with other varieties of English, tune usage and tune interpretation will always be variety-specific, and the Map Task has been a rich source of data on this question. The studies summarized in this chapter also illustrate the flexibility of Map Task data in permitting correlations of both micro-level discourse units such as dialogue acts and larger discourse segments such as Common Ground Units, with intonational and prosodic features of Australian English. There are great benefits in adopting separate annotational and analytical devices when analysing the relationship between prosody and discourse. The DAMSL dialogue-act schema, HCRC move annotation, and CGU analysis combined with intonation and prosodic annotation has allowed for a more rigorous assessment of the intersection between aspects of discourse and prosody in spoken dialogue.

C HA P T E R  32

A PHONOLO GICAL CORPUS O F L 1 AC Q U I S I T I O N O F TA I WA N S O U T H E R N  M I N * JA N E S . T S AY

32.1 Introduction This chapter describes data collection, transcription, and annotations for the Taiwanese Child Language Corpus. Computer programs developed for specific phonological analyses will also be described briefly. The Taiwanese Child Language Corpus (TAICORP hereafter) is a corpus of spontaneous speech between young children growing up in Taiwanese-speaking families and their carers. The target language Taiwanese is a variety of Southern Min Chinese spoken in Taiwan.1 Taiwanese and Southern Min are used interchangeably in this chapter. In the literature on child language acquisition, studies have focused primarily on universal innate patterns, described in current phonological theories with markedness constraints. For example, Optimality-theoretic (OT) models of child language acquisition make specific predictions about markedness (e.g. Prince and Smolensky 1997; Tesar and Smolensky 1998). A number of studies have followed this line (e.g. Barlow and Gierut

*  This project has been supported by research grants from the National Science Council, Taiwan, for more than ten years (NSC87/88/89-2411-H-194-019, NSC89/90/91-2411-H-194-067, NSC92/93/94-2411-H-194-015, NSC95-2411-H-194-022-MY3, NSC98-2410-H-194-086-MY3). We thank the children and their families for their participation. Without the research assistants over the years, especially Rose Huang, Joyce Liu, and Kay Chen, this project would not have gone so far. 1  Southern Min (or Minnan) originally referred to the southern area of Min (Fujian Province, China), including Xiamen (or Amoy), Zhangzhou, and Quanzhou. Most of the early immigrants to Taiwan more than 300 years ago came from the Zhangzhou and Quanzhou areas.

A Phonological Corpus of L1 Acquisition of Taiwan Southern Min  

577

1999; Barlow 2001; Boersma and Hayes 2001; Dinnsen 2001; Dinnsen and Gierut 2008; among others). However, learning phonology also requires learning the language-specific sound patterns found in the adult language’s specific lexicon. In particular, the lexicon contains crucial information about frequency. Therefore, we expect both universal markedness and language-specific lexical properties to be available for the child. This point of view has also been recognized in recent years (e.g. Gierut 2001; Zamuner et al. 2005; Lleó 2006; Fikkert and de Hoop 2009; Rose 2009; Tessier 2009). Beckman and colleagues have also emphasized different levels of phonological abstractness and the role of vocabulary size in phonological acquisition (Edwards et  al. 2004; Beckman et al. 2007; Edwards and Beckman 2008a, b; Beckman and Edwards 2010; Edwards et al. 2011). However, vocabulary size and frequency information regarding sound patterns in child language are difficult to obtain due to methodological limitations. To obtain frequency information, we need child language corpora that are very large, but that also present a great amount of phonological detail. While a lexicon provides type frequency information, a corpus can provide token frequencies. It is therefore very desirable to adopt a corpus-based approach towards child phonology acquisition (e.g. Zuraw 2007, and works cited therein). Although the significance of a large-scale collection of longitudinal child language data for linguistic studies goes without saying, there was an additional motivation for studying the acquisition of Taiwan Southern Min. Until almost the turn of the century, for over forty years, Mandarin was the only official language for instruction in schools in Taiwan in spite of the fact that about 73 per cent of the population belonged to the Southern Min ethnic group (Huang 1993). Young children in kindergartens and elementary schools were not allowed to speak Southern Min at school even though it was the language spoken at home. Although the situation has changed in recent years and languages other than Mandarin, including Southern Min, Hakka, and the aboriginal (Formosan) languages have been included in the curriculum of elementary schools, there still is a serious concern about the dwindling numbers of native Southern Min speakers. This concern can be supported by a more recent report by Tsay (2005) which found that, in a survey of all 8th graders in Chiayi City in Southern Taiwan, only about 26 per cent of these 14-year-olds used Southern Min in their daily life, compared with results showing that over 70 per cent of their parents were native Southern Min speakers. Given these considerations, we saw some urgency in the study of the first language acquisition of Southern Min, which motivated the construction of our corpus, beginning more than ten years ago. Data collection was done through regular home visits during a period of three years. A total of fourteen children (four 1-year-olds, seven 2-year-olds, three 3-year-olds) participated in this longitudinal study during the three-year data collection phase. There were about 330 hours of recordings from the 431 recording sessions (see below for details on the recording sessions). These recordings were transcribed into 431 text files which

578   Jane S. Tsay together contain 1 646 503 word tokens (about 2 million syllables/morphemes/characters)2 in 497 426 lines (utterances). The recordings were transcribed into machine-readable text in the format of the Child Language Data Exchange System (CHILDES) (MacWhinney and Snow 1985, MacWhinney 1995, Rose and MacWhinney, this volume). Annotations in the text files include part of speech, narrow phonetic transcription of the child speech, and syllable types. For some young children, the sound files were also synchronized with the annotated text using a function in CLAN of CHILDES. Discourse annotations were only minimally coded due to manpower limitations. The following sections describe data collection, transcription, text files in CHILDES format, annotated phonological information, and data analysis programs.

32.2  Data Collection Data collection took place over a period of three years. Fourteen children (9 boys and 5 girls) were recruited from Southern Min-speaking families in Min-hsiung Township, Chiayi County in Southern Taiwan. Home visits were conducted at two-week intervals at the child’s home by one investigator of the project, accompanied by a carer (parent or grandparent for most children, the nanny for one child). The children’s spontaneous speech while at play or interacting with the carer and/or the investigator was recorded using a digital MiniDisc recorder and a microphone. Each recording session lasted from 40 to 60 minutes. The activities were children’s daily life at home—playing games or playing with toys, reading picture books, or just talking without any specific topics. Three research assistants participated in this project, each being responsible for the longitudinal recording of three to four children during the three-year data collection phase. These research assistants were also the first-round transcribers of the recordings. Each child had his/her own recording schedule and had different participation durations. One child, YJK, was recorded only twice because he was speaking more Mandarin than the target language Southern Min. Three children, LJX, YCX, and YDA, were recorded for half a year until they went to preschool and started picking up Mandarin. The other ten were recorded for at least one year and seven months. Among them, six children were recorded for more than two years. The gender, age range, participation duration, number of recording sessions, and recording time of each child are given in Table 32.1. There were a total of 431 recording sessions. Each session was saved as a separate sound file. Because the recordings were done in a natural setting, long periods of silence

2 

Like other Sinitic languages, most morphemes in Taiwanese are monosyllabic, each corresponding to one Chinese character in the orthography.

A Phonological Corpus of L1 Acquisition of Taiwan Southern Min  

579

Table 32.1  Information about the children and their recording sessions Sex

Age range

Duration of participation

Recording Sessions

Length (min.)

LYC

F

1;2.13–3;3.29

2yr 2mo

48

2255

HYS

M

1;2.28–3;4.12

2yr 3mo

51

2280

TWX

F

1;5.12–3;6.15

2yr 2mo

44

1829

YSW

M

1;7.17–2;7.14

1yr 1mo

21

1210

LWJ

F

2;1.08–3;7.03

1yr 7mo

36

1777

WZX

M

2;1.17–4;3.15

2yr 3mo

44

1757

HBL

M

2;1.22–4;0.03

2yr 0mo

45

1889

CEY

F

2;1.27–3;10.00

1yr 10mo

37

1728

YJK

M

2;6.11–2;6.26

0yr 1mo

2

105

LMC

F

2;8.07–5;3.21

2yr 8mo

50

2045

CQM

M

2;9.07–4;6.22

1yr 10mo

30

1584

LJX

M

3;9.20–4;2.24

0yr 6mo

8

530

YCX

M

3;10.16–4;0.16

0yr 6mo

6

285

YDA

M

3;11.02–4;4.26

0yr 6mo

Total

M = 9 F = 5

Name

9 431

540 330 hours

and background noises were inevitable. So the sound files were first edited to delete the long silences and noisy parts. In order to permit easier searching and locating of the content on the recordings, each sound file was segmented into several tracks, which were then tagged.

32.3 Text Files 32.3.1 Transcription All sound files were transcribed into text files in both orthographic transcription and phonetic transcription.

32.3.1.1  Orthographic Transcription There are two kinds of systems used in orthographic transcription:  the logographic orthography (i.e. Chinese characters—traditional Chinese writing system) used in the main tier and a spelling-based romanization system for Taiwan Southern Min (Minnan

580   Jane S. Tsay Pinyin) used in the dependent tier %ort. (The names and descriptions of the tiers will be explained shortly.) It should be noted that some problems were encountered in transcribing the speech into Chinese characters. Although all Sinitic languages use Chinese characters as the writing system, only Mandarin has an almost perfect mapping between the spoken words and the written words. This is probably the consequence of Mandarin being the ‘official’ spoken language assigned by the government since the turn of the twentieth century. The mass media (especially the newspapers) have also helped in conventionalizing the written form. By contrast, Taiwan Southern Min does not yet have as conventionalized an orthography and quite a few words in Taiwan Southern Min do not have a consistent written form. In order to increase the consistency in Taiwan Southern Min writing conventions, several Southern Min dictionaries were consulted. A program for checking the inter-transcriber consistency of the orthography was developed by Galvin Chang, James Myers, and Jane Tsay (see Tsay 2007 for more detailed discussion).

32.3.1.2  Phonetic Transcription Phonetic transcription was first done by the investigator who made the recording for a particular session. Another investigator would do a second-round transcription. A third investigator then checked discrepancies between the first two transcriptions. Segments were transcribed in IPA in the %pho tier and tones were transcribed using a 5-point scale (where 1 = lowest pitch, 5 = highest pitch) in the %tone tier. The child speech in about 180 out of the 330 hours of total recordings was transcribed phonetically.

32.3.2  The Text Format The format of the text files in CHILDES is introduced very briefly in this chapter. For details, including the tools of transcription, please refer to the official CHILDES website at http://childes.psy.cmu.edu/. The main components of text files in the CHILDES format are headers and tiers. There are three kinds of headers: obligatory headers, constant headers, and changeable headers. Obligatory headers are necessary for every file. They mark the beginning and the end of the file. Constant headers mark the name of the file and background information of the children, while changeable headers contain information that may change across files, such as language, participant ID, recording date, transcribers, and so on. These headers all begin with @. Headers @Begin @Languages:zho-nan, zho @Participants:CHI LYC Target_Child, IN1 Kay Investigator, IN2 Rose Investigator, SIS Ci Sister @ID:zho-nan|Tsay|CHI|3;3.29|female|||Target_Child|| @ID:zho-nan|Tsay|IN1|||||Investigator||

A Phonological Corpus of L1 Acquisition of Taiwan Southern Min  

581

@ID:zho-nan|Tsay|IN2|||||Investigator|| @ID:zho-nan|Tsay|SIS|||||Sister|| @Transcriber:Rose, Kay, Joyce @Date:31-MAY-2000 @Media: LYC30329, audio @Tape Location:Yi D17-1-10 @Comment:Time duration is 40 @Location:Chiayi, Taiwan @Transcriber:Kay @Comment:Track number is D17-1

The content of the speech is presented in tiers: the main tier and the dependent tiers. The main tier, marked with *, contains the utterance of the speaker, for example, *CHI the target child, *MOT the mother, and *INV the investigator. The speech in the main tier in TAICORP is transcribed in Chinese characters. The dependent tiers, marked with %, are for additional information about the utterance in the main tier. The names and description of the dependent tiers in TAICORP are given below. %ort: utterance in Southern Min romanization (Minnan Pinyin) %cod: part-of-speech coding %eng: English gloss of the words %syl: syllable types (in CV notation) of the target (phonemic) pronunciation %pho: phonetic transcription of the child speech %syc: syllable types of the actual production of the child speech %ton: phonetic transcription of the tones on a 5-point scale

The following example is from the child HYS at 2;3.9.3 Tiers *CHI:你 食. %ort:li2 ciah8. %cod:Nh VC %eng:you eat %syl:CV CVVK %pho:i kia %syc:V CVV %ton:55 32 3  Digits in the %ort tier denote the lexical tone categories (to be explained in the next section) of the syllable, while digits in the %ton tier are pitch values of the tone categories in the child’s actual pronunciation.

582   Jane S. Tsay Table 32.2  Lexical tones in Taiwanese Tone Category

Traditional terms

Tone values in junctureor isolation

Example

Gloss

Tone 1

Yinping

55

si55

Tone 2

Poem

Yinshang

53

si53

Death

21

Tone 3

Yinqu

21

Tone 4

Yinru (short tone)

33

si

sik33 13

Four Colour

Tone 5

Yangping

13

si

Time

Tone 7

Yangqu

33

si33

Yes

Tone 8

Yangru (short tone)

5

sik5

Ripe

32.4  Phonological Information 32.4.1  The Sound System of Taiwanese Like all languages in the Sinitic (Chinese) family, Taiwanese has the following characteristics regarding syllable and tone: (1) virtually all morphemes in Taiwanese are monosyllabic; (2) lexically contrastive tones occur with almost all syllables, except for some function words like particles which do not have an underlying tone and usually surface with a neutral tone. Syllable structure is relatively simple in Taiwanese. No consonant clusters are allowed. The consonant in the coda position of a syllable is very restricted—only nasals and stops are allowed. There are seven lexical tones in Taiwan Southern Min. Tone categories are referred to by digits, including five long tones (T1, T2, T3, T5, and T7) and two short (abrupt) tones (T4 and T8).4 Short tones only occur with the so-called checked syllables (which end with an unreleased obstruent coda -p, -t, -k, or glottal stop) and are called Rusheng or Entering Tone in the Chinese philology tradition. Tone values in 5-point scale notation for syllables/morphemes in juncture position (including in isolation) are given in Table 32.2 with short tones underlined.5 In addition to the seven lexical tones described in the previous paragraph, there is also a neutral tone, labelled Tone 0, which occurs in underlyingly toneless words such 4 

Tone 6 does not exist in Taiwanese any more due to historical sound change. For the purpose of diachronic comparison as well as synchronic dialectal studies, this gap is still respected and preserved in the numbering of the tone categories. 5  Surface tone values are different in juncture (including in isolation) vs. non-juncture (context) positions, a phenomenon called tone sandhi in the literature.

A Phonological Corpus of L1 Acquisition of Taiwan Southern Min  

583

as particles. These particles usually serve pragmatic functions. The actual realization of the particles might vary according to pragmatic situations. Another non-lexical tone, labelled Tone 9, is a non-contrastive tone, usually derived from tone contraction due to the coalescence of two adjacent syllables into one syllable. The consonants in Taiwanese are: /p, ph, b, m, t, th, l, n, k, kh, g, ŋ, h, ʔ, ts, tsh, s, dz/. As mentioned above, the coda stops /-p/, /-t/,. /-k/, and /-ʔ/ are unreleased and are the only obstruents that can appear in the coda position. The other coda consonants are nasals /-m/, /-n/, and /-ŋ/. The labial nasal and velar nasal can also be syllabic as in /am̩/ ‘aunt’ and / ŋ̩/ ‘yellow’, respectively. There are six single vowels: /i, e, a, ɔ, o (or ə in some dialects), u/. Single vowels can be combined into diphthongs and triphthongs. Except for /o/, all single vowels also have a nasalized counterpart.

32.4.2  Phonological Coding and Phonological Analysis Regarding phonological analysis, syllable types and tone types have been the main concern in our research. Two programs SYLLABLE and ToneFreq were developed to study these issues.

32.4.2.1  Coding Syllable Types and Counting Syllable Token Frequencies The program SYLLABLE was designed by Chienyu Hsu to code syllable types in the %ort tier (the target/phonemic sound in Minnan Pinyin). Since all syllables in the %ort tier end with a digit which denotes the tone categories, these digits mark syllable boundaries at the same time. Therefore, by replacing the digits at the end of the syllable with a space, all syllables in the %ort tier were segmented. The isolated syllables were then coded with C (consonant) or V (vowel). Twelve syllable types in Taiwan Southern Min were found: V, CV, VC, VV, CVC, CVV, VVC, VVV, CVVC, CVVV, CN, N.6 Token frequencies for each syllable type can therefore be obtained by running the FREQ program in CLAN in CHILDES, treating the separated syllables as separate words. Syllable type coding in the %ort tier is based on the target speech (i.e. adult speech), while syllable type coding in the %pho tier is based on the actual production of the child. Thus by comparing (mapping) the syllable type coding in these two tiers, the child’s syllable errors (i.e. the mismatches) can be identified. For example, consider the passage given in Section 3.2 above, where the child was supposed to say li ciah (CV CVVK) ‘you eat’, but said i kia (V CVV) instead. We can see that there is an onset deletion in the first syllable and a coda deletion in the second syllable. The program SYLLABLE compares these two tiers and gives the results of the deviation of the child speech from the adult speech. 6 

In addition to standard C and V codes for consonants and vowels, coda Cs could be further coded as N (nasal coda) or K (obstruent coda) when this distinction in coda becomes relevant.

584   Jane S. Tsay

32.4.2.2  Tone Distribution Another program ToneFreq, also developed by Chienyu Hsu, was designed to count tone frequencies at both the syllable level and the word level. Both type frequencies and token frequencies of the tone categories can provide interesting information about the characteristics of lexical tone. For example, it was found that syllables with certain onsets do not occur with certain tones, a potential phonotactic constraint very likely caused by historical sound change. Whether this kind of sound pattern plays a role in tone acquisition is an issue worth pursuing.

32.4.2.3  Digitized Audio Linkage Using a tool in CLAN of CHILDES, the sound file and the text file of a recording can be synchronized. A command in the phonic mode called ‘insert bullet’ can link the sounds to the computerized transcript line by line. This makes it much easier for the user to hear the sounds while reading the transcript. It also makes acoustic analysis much easier, although the manual insertion of the bullets remains labour-intensive. In TAICORP, this work has been completed for the three youngest children (about 106 hours).

32.5  POS and Discourse Annotations 32.5.1  Automatic Word Segmentation and POS Tagging Constructing a speech-based corpus requires a lot more steps than constructing a corpus based on written texts. The most labour-intensive and time-consuming work is transcribing the sound files into text files. In the first stage of the construction of TAICORP, every step was done manually. After a lexical bank was built based on the manual transcription, automatic word segmentation and part-of-speech (POS) tagging became possible. The POS coding system used in TAICORP is a revised version of the system used in the Sinica Corpus of Mandarin (see various technical reports by the Chinese Knowledge Information Processing Group (CKIP) (CKIP 1993, 1998; Chen et al. 1996). For details about the POS coding system of TAICORP, see Tsay (2007). A sample of the lexical bank is given in Table  32.3, which contains the following information for each lexical entry: the orthography (both in Chinese characters and in Minnan Pinyin), English gloss, POS tag(s), synonyms in Mandarin, and an example from the corpus. The programs listed in Table 32.4 were developed by Galvin Chang for the automatization of the transcription and coding processes. It should be noted that, morphological (including both derivational and inflectional) marking is very limited (with very few exceptions, e.g. plural marker men for a limited

A Phonological Corpus of L1 Acquisition of Taiwan Southern Min  

585

Table 32.3  A sample of the lexical bank of TAICORP Chinese MinnanPinyin

EnglishGloss

Synonyms in POSTag Mandarin

Example from the corpus

male

A



你這尾魚仔是公e0抑母e0?

Cbb

要不然,否則, 抑無你看這張 不然, 要 不



kang1

抑無

ah8bo5^ah4bo5 otherwise

大概

tai7khai3

approximately D

大概

大概是按呢la0.

畫粧

ue7cong1

make up

化妝、化粧

早起起來畫(粧) [/]‌ 畫粧e0時陣伊創啥?

VA

Table 32.4  Programs developed to automatize the transcription and coding processes Program

Functions

Programmer

Spell-Checker

Check the accuracy of the orthography, including the spelling in Minnan Pinyin and Chinese characters.

Galvin Chang

Automatic Word Segmentation

Segment an utterance into words and automatically insert Minnan Pinyin for each word in Chinese characters

Galvin Chang

Automatic POS Tagging

Code the POS tags automatically after word segmentation

Galvin Chang

set of nouns). Therefore, for word forms that have more than one syntactic category (e.g. water could be a noun or a verb in English), it is ambiguous when they occur without context. This might cause problems in word frequency counts as will be discussed in the next section. For the automatization procedure for transcription and coding, and details about some issues associated with transcribing into Chinese characters and Minnan Pinyin, see Tsay (2007).

32.5.2  Data Analysis Programs A program called WordClassAgent was developed by Ziyang Wang mainly for word frequency counts (Table  32.5). There are two sub-programs in WordClassAgent:  one selects words in a specific main tier (i.e. speech for that specific speaker only); the other matches the words with their syntactic categories in the %cod tier where POS is coded. The former gives the word frequency counts for a specific speaker and the latter gives the word frequency counts for a specific syntactic category.

586   Jane S. Tsay Table 32.5  Programming for word frequency counts Program

Functions

Programmer

WordClassAgent

Mainly for word frequency counts

Ziyang Wang

Sub-program 1: SpeakerSelection

Sort data by different participants/speakers (main tier)

Sub-program 2: ChildWordClass

Count word frequencies with specific constraints, e.g. verbs vs. nouns (i.e. words with specific syntactic categories at the %cod tier)

32.5.3  Discourse Annotation Due to manpower limitations, discourse annotation was minimal. Discourse codes used in TAICORP are from the conventions provided in CHILDES. (1) Codes for unidentifiable material

(a) xxx/xx: unintelligible speech (utterance/word). (b) yyy/yy: unintelligible speech at the phonetic level. (c) www/ww: untranscribed speech to be used in conjunction with a note to explain the situation

(2) Repetition [/]‌: repetition of either one or more words (3) Basic utterance terminators The basic utterance terminators are the period, the question mark, and the exclamation mark. Each utterance must end with one of these three utterance terminators. (4) Special utterance terminators: these terminators all begin with the + symbol and end with one of the three basic utterance terminators. For example,

(a) + . . .  Incomplete but not interrupted utterance (b) +/.  Incomplete utterance due to interruption (c) +//. Self-interruption: breaking off an utterance and starting up another by the same speaker (d) +?.  Interruption of a question:  the utterance being interrupted is a question (e) +, Self-completion:  to mark the completion of an utterance after an interruption

A Phonological Corpus of L1 Acquisition of Taiwan Southern Min  

587

(5) Scoped symbols

(a) [=! text] Paralinguistic material: marking paralinguistic events or actions, such as coughing, laughing, crying, singing, and whispering. (b) [>] Overlap follows (c) [