Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010 [1 ed.] 9781607506416, 9781607506409

This book contains papers from the Fourth International Conference on Human Language Technologies - the Baltic Perspecti

166 29 4MB

English Pages 264 Year 2010

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010 [1 ed.]
 9781607506416, 9781607506409

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

HUMAN LANGUAGE TECHNOLOGIES

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

THE BALTIC PERSPECTIVE

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Frontiers in Artificial Intelligence and Applications FAIA covers all aspects of theoretical and applied artificial intelligence research in the form of monographs, doctoral dissertations, textbooks, handbooks and proceedings volumes. The FAIA series contains several sub-series, including “Information Modelling and Knowledge Bases” and “Knowledge-Based Intelligent Engineering Systems”. It also includes the biennial ECAI, the European Conference on Artificial Intelligence, proceedings volumes, and other ECCAI – the European Coordinating Committee on Artificial Intelligence – sponsored publications. An editorial panel of internationally well-known scholars is appointed to provide a high quality selection. Series Editors: J. Breuker, N. Guarino, J.N. Kok, J. Liu, R. López de Mántaras, R. Mizoguchi, M. Musen, S.K. Pal and N. Zhong

Volume 219

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Recently published in this series Vol. 218. C. Soares and R. Ghani (Eds.), Data Mining for Business Applications Vol. 217. H. Fujita (Ed.), New Trends in Software Methodologies, Tools and Techniques – Proceedings of the 9th SoMeT_10 Vol. 216. P. Baroni, F. Cerutti, M. Giacomin and G.R. Simari (Eds.), Computational Models of Argument – Proceedings of COMMA 2010 Vol. 215. H. Coelho, R. Studer and M. Wooldridge (Eds.), ECAI 2010 – 19th European Conference on Artificial Intelligence Vol. 214. I.-O. Stathopoulou and G.A. Tsihrintzis, Visual Affect Recognition Vol. 213. L. Obrst, T. Janssen and W. Ceusters (Eds.), Ontologies and Semantic Technologies for Intelligence Vol. 212. A. Respício et al. (Eds.), Bridging the Socio-Technical Gap in Decision Support Systems – Challenges for the Next Decade Vol. 211. J.I. da Silva Filho, G. Lambert-Torres and J.M. Abe, Uncertainty Treatment Using Paraconsistent Logic – Introducing Paraconsistent Artificial Neural Networks Vol. 210. O. Kutz et al. (Eds.), Modular Ontologies – Proceedings of the Fourth International Workshop (WoMO 2010) Vol. 209. A. Galton and R. Mizoguchi (Eds.), Formal Ontology in Information Systems – Proceedings of the Sixth International Conference (FOIS 2010) Vol. 208. G.L. Pozzato, Conditional and Preferential Logics: Proof Methods and Theorem Proving Vol. 207. A. Bifet, Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams Vol. 206. T. Welzer Družovec et al. (Eds.), Information Modelling and Knowledge Bases XXI ISSN 0922-6389 (print) ISSN 1879-8314 (online)

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies The Baltic Perspective Proceedings of the Fourth International Conference Baltic HLT 2010

Edited by

Inguna Skadiņa Institute of Mathematics and Computer Science, University of Latvia Tilde, Latvia

and

Andrejs Vasiļjevs

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Tilde, Latvia

Amsterdam • Berlin • Tokyo • Washington, DC

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

© 2010 The authors and IOS Press. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 978-1-60750-640-9 (print) ISBN 978-1-60750-641-6 (online) Library of Congress Control Number: 2010935129 Publisher IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: [email protected]

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail: [email protected]

LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved.

v

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Preface This volume contains papers presented at the Fourth International Conference “Human Language Technologies – the Baltic Perspective” (Baltic HLT 2010). The series of Baltic HLT conferences provides a forum for the sharing of recent advances in human language processing and for promotion of cooperation between the research communities of computer science and linguistics from the Baltic countries and the rest of the world. The conference brings together scientists, developers, providers and users to discuss state-of-the-art of Human Language Technologies (HLT) in the Baltic countries, to exchange information and to discuss problems, to find new synergies, and to promote initiatives for international cooperation. The first larger pan-Baltic event on HLT research was the seminar “Language and Technology 2000” organized by Andrejs Spektors and the Institute of Mathematics and Computer Science, University of Latvia in Riga in 1994. In 2004, ten years after this seminar, Andrejs Vasiļjevs and Inguna Skadiņa initiated the first international Baltic HTL conference organized by the Commission of the Official Language of the Chancellery of the President of Latvia. Einar Meister took over this initiative with the second conference in 2005 in Tallinn organized by the Institute of Cybernetics and Institute of Estonian Language. Successful continuation of the series was ensured by Rūta Marcinkevičienė who initiated the third Baltic HLT conference in Kaunas in 2007 organized by Vytautas Magnus University and the Institute of Lithuanian language. This fourth conference takes place in Riga again in October 7–8, 2010. We would like to thank the supporters and organizers of this conference: Tilde and the Institute of Mathematics and Computer Science, University of Latvia. The conference was also supported by the CLARIN, LetsMT!, and ACCURAT projects. The last three years were very fruitful for HLT researchers and developers. A new concept – language resources as research infrastructures (RI) – was introduced throughout Europe. Baltic countries have actively contributed to the first steps to create such an RI. An overview section includes two invited papers by the creators of the Baltic linguistic infrastructure presenting an analysis of the current situation in HLT in Latvia and Lithuania, and a summary on The National Programme for Estonian HLT. HLT research in the Baltic countries was boosted by several large-scale national and international activities, such as the projects CLARIN, ACCURAT, LetsMT! and others as described in this volume. Research results, work in progress, descriptions of demonstrations, and position papers on these and other activities form the main content of this volume. The contributions were submitted by more than 75 authors from eleven countries and reviewed by an international program committee in a blind review process. Papers selected for the conference represent a wide range of topics of research in corpus linguistics, machine translation, speech technologies, semantics, and other areas of HLT research. We hope that this volume will serve as a useful and comprehensive repository of information and will facilitate research and development of HLT in the Baltic countries and the creation of the pan-European RI of language resources and technology.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

vii

Conference Organization

The Fourth International Conference HUMAN LANGUAGE TECHNOLOGIES — THE BALTIC PERSPECTIVE Riga, Latvia, October 7–8, 2010

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

PROGRAMME COMMITTEE • • • • • • • • • • • • • • • •

Nick Campbell, University of Dublin, Trinity College, Ireland Rolf Carlson, KTH, Sweden Robert Gaizauskas, Sheffield University, UK Steven Krauwer, CLARIN project, The Netherlands Bente Maegaard, University of Copenhagen, Denmark Rūta Marcinkevičienė, Vytautas Magnus University, Lithuania Einar Meister, Institute of Cybernetics, Estonia Joakim Nivre, Uppsala University, Sweden Tiit Roosma, University of Tartu, Estonia Inguna Skadiņa, IMCS / Tilde, Latvia Jurģis Šķilters, National Library of Latvia / University of Latvia, Latvia Koenraad De Smedt, University of Bergen, Norway Andrejs Spektors, IMCS, Latvia Gregor Thurmair, Linguatech, Germany Jörg Tiedemann, Uppsala University, Sweden Andrejs Vasiļjevs, Tilde, Latvia

Technical Editor (Proceedings) Roberts Rozis Tilde, Latvia

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

viii

Organized by Institute of Mathematics and Computer Science (University of Latvia) and Tilde



Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Supported by Tilde, Institute of Mathematics and Computer Science, University of Latvia, ACCURAT project (FP7), CLARIN project, LetsMT! project (CIP ICT-PSP)

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

ix

Contents Preface Conference Organization

v vii

Overview Developing the Human Language Technology Infrastructure in Lithuania Rūta Marcinkevičienė and Daiva Vitkutė-Adžgauskienė

3

National Programme for Estonian Language Technology: A Pre-Final Summary Einar Meister, Jaak Vilo and Neeme Kahusk

11

Language Resources and Technology for the Humanities in Latvia (2004–2010) Inguna Skadiņa, Ilze Auziņa, Normunds Grūzītis, Kristīne Levāne-Petrova, Gunta Nešpore, Raivis Skadiņš and Andrejs Vasiļjevs

15

Speech Technologies and Spoken Corpus Estonian Emotional Speech Corpus: Culture and Age in Selecting Corpus Testers Rene Altrov and Hille Pajupuu

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Estonian Large Vocabulary Speech Recognition System for Radiology Tanel Alumäe and Einar Meister Towards Spoken Latvian Corpus: Current Situation, Methodology and Development Ilze Auziņa

25 33

39

Remarks on the Duration of Lithuanian Consonants in a Continuous Speech Sigita Dereškevičiūtė and Asta Kazlauskienė

45

Modelling the Temporal Structure of Estonian Speech Mari-Liis Kalvik and Meelis Mihkla

53

An Audio System of Electronic Texts for the Visually Impaired and Perception of Different Speech Rates by the Blind and the Sighted Meelis Mihkla, Indrek Hein, Indrek Kiissel, Margit Orusaar and Artur Räpp Latvian Text-to-Speech Synthesizer Mārcis Pinnis and Ilze Auziņa Using Dependency Grammar Features in Whole Sentence Maximum Entropy Language Model for Speech Recognition Teemu Ruokolainen, Tanel Alumäe and Marcus Dobrinkat

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

61 69

73

x

Spoken and Written Dialog Internet Commentators as Dialogue Participants: Coherence Achieved Through Membership Categorization Tiit Hennoste, Olga Gerassimenko, Riina Kasterpalu, Mare Koit, Kirsi Laanesoo, Anni Oja, Andriela Rääbis and Krista Strandson Uncertainty in Spoken Dialogue Management Kristiina Jokinen Human-Computer Interaction in Estonian: Collection and Analysis of Simulated Dialogues Siiri Pärkson A Framework for Asynchronous Dialogue Systems Margus Treumuth

83

91

99 107

Machine Translation SMT of Latvian, Lithuanian and Estonian Languages: A Comparative Study Maxim Khalilov, Lauma Pretkalniņa, Natalja Kuvaldina and Veronika Pereseina

117

Improving SMT for Baltic Languages with Factored Models Raivis Skadiņš, Kārlis Goba and Valters Šics

125

LetsMT! – Online Platform for Sharing Training Data and Building User Tailored Machine Translation Andrejs Vasiljevs, Tatiana Gornostay and Raivis Skadins

133

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Written Corpora and Linguistic Resources The Estonian Reference Corpus: Its Composition and Morphology-Aware User Interface Heiki-Jaan Kaalep, Kadri Muischnek, Kristel Uiboaed and Kaarel Veskis

143

Adaptive Automatic Mark-Up Tool for Legacy Dictionaries Lauma Pretkalniņa and Ilze Millere

147

Corpus of Contemporary Lithuanian Language – The Standardised Way Erika Rimkutė, Jolanta Kovalevskaitė, Vida Melninkaitė, Andrius Utka and Daiva Vitkutė-Adžgauskienė

154

A Collection of Comparable Corpora for Under-Resourced Languages Inguna Skadiņa, Ahmet Aker, Voula Giouli, Dan Tufis, Robert Gaizauskas, Madara Mieriņa and Nikos Mastropavlos

161

The Database of Estonian Word Families: A Language Technology Resource Ülle Viks, Silvi Vare and Heete Sahkai

169

Digitization of Historical Texts at the National Library of Latvia Arturs Zogla and Jurgis Skilters

177

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

xi

Semantics Verbalizing Ontologies in Controlled Baltic Languages Normunds Grūzītis, Gunta Nešpore and Baiba Saulīte

187

Enriching Estonian WordNet with Derivations and Semantic Relations Neeme Kahusk, Kadri Kerner and Kadri Vider

195

Main Trends in Semantic-Research of Estonian Language Technology Haldur Õim, Heili Orav, Kadri Kerner and Neeme Kahusk

201

Semantic Analysis of Sentences: The Estonian Experience Haldur Õim, Heili Orav, Neeme Kahusk and Piia Taremaa

208

Methods and Tools for Language Processing An Ensemble of Classifiers Methodology for Stemming in Inflectional Languages: Using the Example of Latvian Steffen Eger and Ineta Sējāne

217

Using Syllables as Indexing Terms in Full-Text Information Retrieval Kimmo Kettunen, Paul McNamee and Feza Baskaya

225

Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars Gunta Nešpore, Baiba Saulīte, Guntis Bārzdiņš and Normunds Grūzītis

233

241

Subject Index

249

Author Index

251

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Cloud Computing for the Humanities: Two Approaches for Language Technology Graham Wilcock

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Overview

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-3

3

Developing the Human Language Technology Infrastructure in Lithuania a

5njWD0$5&,1.(9,ý,(1Ơa,1 and Daiva 9,7.87Ơ-$'ä*$86.,(1Ơ b Centre of Computational Linguistics, Vytautas Magnus University, Kaunas, Lithuania b Department of Applied Informatics, Vytautas Magnus University, Kaunas, Lithuania

Abstract. This paper aims at a short overview of the development of the Lithuanian human language technology infrastructure in the context of European co-operation. It also presents national policies related to research infrastructures and joint activities planned for different levels, institutional, national and European. Keywords. language technology infrastructure, corpus linguistics, CLARIN.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction The declared policy of EU emphasizes that linguistic diversity must be preserved and multilingualism promoted in the Union. In the age of digitalization this aim can be achieved by developing computerized language resources and tools. Official languages of EU differ from the point of view of the level and availability of their resources. Internationally used and thoroughly researched languages are much more advanced in that respect than lesser used languages that joined the community of resource developers twenty years ago. However, latecomers benefited from the know-how and universally applicable technologies in resource provision. In what follows an overview of Lithuanian Human language technology (HLT) is provided together with strategic issues for its further development.

1. HLT ± Short Overview HLT development in Lithuania as in many other countries started with the development of corpora resources [1]. The role of a corpus is manifold and depends on its type: large or small, compiled of raw or annotated texts, general or specific, monolingual or bi/multilingual, parallel or comparable, historical or contemporary. Some corpora belong more to the cultural heritage rather than to language engineering. Whatever corpora might be they provide linguistic and statistical evidence as well as their specific content knowledge. The policy advocated by EU for multilingualism encouraged to build links between resources in pairs and groups of languages, and to use the same formalism. Lithuanian corpora that respond to the policy have their comparable counterparts in some other EU languages. As to standards and common 1

&RUUHVSRQGLQJ $XWKRU 5njWD 0DUFLQNHYLþLHQơ &HQWUH RI &RPSXWDWLRQDO /LQJXLVWLFV 9\WDXWDV 0DJQXV8QLYHUVLW\'RQHODLþLR.DXQDV/LWKXDQLD(-mail: [email protected]. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

4

R. Marcinkeviˇcien˙e and D. Vitkut˙e-Adžgauskien˙e / Developing the HLT Infrastructure in Lithuania

formalisms, they are still a goal for the future. It needs to be mentioned that the importance of corpora nowadays has increased together with their availability and easiness to compile them or to use the Internet in the role of a corpus. No matter that the notion of a corpus has become more fuzzy when so many digital text resources are at hand, the value of large and well-balanced corpora has not decreased. The following corpora of written language are currently available on-line: x

x x

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

x

Corpus of Contemporary Lithuanian Language CCLL (160 mln. running words) and its morphologically annotated version, as well as a set of parallel corpora (bidirectional Czech-Lithuanian and Lithuanian-Czech corpus of five millions words and English-Lithuanian corpus of 18 million words), compiled by the Centre of Computational Linguistics of Vytautas Magnus University (VMU)2; Corpus Academicum Lithuanicum CorALit (9 mln. running words), compiled by the Faculty of Philology of Vilnius University (VU)3; Corpus of Spoken Lithuanian language (200 000 running words) and a universal annotated database of speech recordings, compiled by the Centre of Regional Studies of VMU4; Corpus of Lithuanian Dialects and Database of Old Lithuanian Writings, compiled by the Lithuanian Language Institute (LLI) 5.

Different computerized bilingual and multilingual dictionaries are available either on-line or in the form of independent or integrated software packages. These include bilingual English-Lithuanian, French-Lithuanian and international-word dictionaries Alkonas, Anglonas 2, Frankonas and Interleksis, maintained by Fotonija, EnglishLithuanian, Lithuanian-English, German-Lithuanian and Lithuanian-German dictionaries LED and WinLED, maintained by the TEV Publishing House, the Dictionary of Contemporary Lithuanian Language and the Dictionary of Lithuanian Language, maintained by LLI, the Ä7LOGơV ELXUDV³ VRIWZDUH SDFNDJH LQFXGLQJ dictionaries for English, German, French, Latvian, Russian and Lithuanian languagies, Cobuild English-Lithuanian-Czech Dictionary, multilingual dictionaries Stella and Etoile by Akelote and others. In addition, other language resources are available in the form of on-line databases. LLI maintains the database of Old Lithuanian Writings and the Dictionary of Toponyms. The State Commission of the Lithuanian Language (SCLL) is monitoring an open terminological database 6 , and the Institute of Mathematics and Informatics (IMI) is monitoring the database combining digitalized term dictionaries from 27 different branches 7 . Database of Lithuanian nominal collocations is built using the CCLL corpus. The following language technology tools are available: x

Corpus query and concordance extraction tools8;

2

http://donelaitis.vdu.lt http://www.coralit.lt/ 4 http://www.vdu.lt/LTcourses/?pg=41&menu_id=112 5 http://www.lki.lt/ 6 http://terminai.vlkk.lt/ 7 http://www.terminynas.lt/ 8 http://vertimas.vdu.lt 3

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

R. Marcinkeviˇcien˙e and D. Vitkut˙e-Adžgauskien˙e / Developing the HLT Infrastructure in Lithuania

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

x x x x x x x x x

5

Collocation extraction tool using Gravity Counts method; 6SHOOFKHFNHUVDQGJUDPPDUFKHFNHUVÄ-XRGRVDY\V³DQG³7LOGơVELXUDV´ Lemmatizer; Aligner for compiling parallel corpuses; Automatic morphological annotator/tagger; Automatic accentuation tool; $XWRQRPRXVWUDQVODWLRQV\VWHPÄ9HUWLPRYHGO\V³ SDUWRIÄ7LOGơVELXUDV³  On-line English-Lithuanian machine translation tool9; Semantic ontological annotation tool (prototype).

Research in speech signal processing and synthesis is mainly done by researchers at Kaunas University of Technology (KUT), VMU, VU and IMI. The researchers at IMI have been developing and perfecting the speaker identification and verification technology for many years. This technology received a few successful implementations such as the speaker identification system deployed at Lithuanian Parliament (recognizes the identity of hundreds of speakers). Recently IMI has turned into speech recognition research and developed a few software prototypes for the recognition of isolated digits using Dynamic Time Warping and Hidden Markov modeling (HMM) techniques. The researchers at KUT developed the speech recognition prototype for recognizing voice commands. This group has been working on speech recognition, voice transmission over telephony and spoken dialog systems for many years. The research group at VMU has been focusing on the large vocabulary continuous speech recognition (LVCSR) and language modeling resulting in the HMM-based LVCSR prototype being developed. Researchers of VU and KTU had built the first two versions of the text-to-speech synthesizHUÄ$LVWLV³ However, these technologies are far from being mature. For instance, the word error rate achieved by VMU LVCSR prototype for the vocabulary of 1 million word forms is about 25%. The naturalness of the synthesized speech should also be improved to suit the mainstream practical applications. Though automatic accentuation and grapheme-to-phoneme conversion are already mature sub-components of text-tospeech (TTS) systems, phone duration models and especially prosody models still pose a great challenge. The design of speech processing technologies is backed by the compilation of necessary speech resources. The most important corpora developed with the speech recognition in mind are: the LRN corpus of radio news compiled by IMI (10h, 23 speakers). LTDIGITS corpus of digit sequences and voice commands compiled by KUT and VU (6h, 350 speakers) and LAB50 corpus of read speech (50h, 50 speakers) compiled by VMU. VU has compiled the corpus of diphones for the needs of speech synthesis. VMU has recently compiled the corpus of spontaneous speech (10h, 18 speakers). LLI maintains the Archive of Lithuanian Dialects.

2. HLT Classification Schemes ± Resources, Tools, Services The 6DUDVROD¶V W\SRORJ\ RI HLT resources [2] distinguishes between the following 5 resource groups: 9

http://vertimas.vdu.lt

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

6

R. Marcinkeviˇcien˙e and D. Vitkut˙e-Adžgauskien˙e / Developing the HLT Infrastructure in Lithuania

x x

x x

x

so-called foundations, i.e. raw corpora, machine-readable dictionaries, speech databases, basic tools such as statistical tools for corpus treatment, a morphological analyzer, generator and lemmatizer, and a speech recognition system dealing with isolated words, medium-complexity tools such as spell checkers and a structured lexical database which includes multiword lexical units, advanced tools, including syntactically annotated corpora (treebanks), grammar and style checkers, lexical-semantic knowledge bases or concept taxonomies such as WordNet, word sense disambiguators, speech processing tools functioning at sentence level, so-called multilinguality and general applications, including semantically annotated corpora, information retrieval and extraction applications, dialogue systems, language learning systems, machine translation.

If to match this scheme to the existing Lithuanian HLT resources, we get the picture as shown in Figure 1. Raw or untagged corpus:

Foundations

CCLL (160 mln words); CorALit (9 mln words); Parallel bidirectional Lithuanian-Czech corpus (5 mln words); Parallel English-Lithuanian corpus (18 mln words); Corpus of Spoken Lithuanian Language (200000 words); Corpus of Lithuanian Dialects; DB of Old Lithuanian Writings

Lexicons, machine-readable dictionaries: ‡Explanatory: Dict. of Lithuanian Language, Dict. of Contemporary Lithuanian ; ‡Translation: Alkonas, Anglonas, Frankonas/(':LQ/('³7LOGŏV%LXUDV´&REXLOGEtoile, etc.; ‡Special: Lithuanian Term Bank, DB of digitalised term dictionaries from 27 branches, Dict. of international words Interleksis, Dict. of Toponyms, DB of nominal collocations, etc.

Speech databases:

Basic Tools Medium -compl. Tools

Spellcheckers; Morpho-syntactic analyzer; DB of Lithuanian nominal collocations

Adv. Tools

On-line corpus query tools (concordances, etc.); Morphological analyzer and tagger; Statistical tools (frequency lists, etc.); Collocation extraction tools; Automatic accentuation tool; Automatic identification of text functions; Morphologically annotated corpus of Lithuanian; Universal annotated DB of speech recordings, etc.

Gen. appl.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

DB of Speech Recordings for the Common Lithuanian language; Archive of Lithuanian dialects

Grammar checkers; Semantic-ontological annotation tool; Experimental concept ontology for semantic-ontological mark-up

Rule-based autonomous and on-line English-Lithuanian translation systems; Learning tools (word analyser and synthesizer, etc.)

Figure 1. Lithuanian HLT UHVRXUFHVDFFRUGLQJWRWKH6DUDVROD¶VW\SRORJ\

When analyzing the structural picture of Lithuanian HLT resources, the following conclusions can be formulated: ‡ Foundations, i.e. a collection of text and speech corpus exists, though their expansion is still a task. Corpora and other resources are designed by different institutions and available only via specialized, individual access tools, or even unaccessible. In many cases, standards, such as TEI P5, are underused. ‡ The collection of resources and tools is rather fragmented, this pointing to the lack of coordination in their development.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

R. Marcinkeviˇcien˙e and D. Vitkut˙e-Adžgauskien˙e / Developing the HLT Infrastructure in Lithuania

‡ ‡

‡

7

Obviously, there are too few language technology tools available, especially WKRVHEHORQJLQJWRWKH³DGYDQFHG´DQG³ general application´FDWHJRULHV. So far, only first attempts to create tools for semantic annotation and analysis as well as to compile semantically annotated resources. This process should be much faster, taking into account the needs of the emerging Semantic Web as well as the requirements for semantic interoperability of e-government initiatives. Existing tools are mainly oriented towards researchers, few on-line tools are available for users.

However, the drawback of the Sarasolaµs typology is that it does not draw a clear separating line between resources and tools, e.g., annotated corpora are positioned not in line with other resources, but in line with corresponding tools used for anotation (namely, tools for syntactixal or semantical analysis). Such a classification becomes inconvenient for describing composite services (applications), as in this case it is normal to separate resources, that are normally allready available, from the tools that are still necessary to be applied. For practical reasons, DÄ5HVRXUFH-Tool-$SSOLFDWLRQ³HLT classification scheme is more convenient (Figure 2). Here, several HLT tool complexity levels can be included, however, in most cases, two levels (Ä%DVLF WRROV³ DQG Ä&RPSOH[ WRROV and service components³ DUHVXIILFLHQW Applications/ Services

Complex tools

Basic tools

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Resources

1) Institutional(for researchers) 2) On-line (all users) E.g. ± search, machine translation, dialogue systems, etc.

Diverse analysis tools (information extraction, syntactical, semantical); synthesis tools (text and speech); service components Tools for resource compilation; annotation tools (formating, morphosyntactical, semantical annotation)

Corpora speech databases, vocabukaries and ontologies, lexical databases, etc. 1) Institutional 2) national

Figure 2. ³5HVRXUFH-Tool-$SSOLFDWLRQ´HLT classification scheme

Such a scheme is convenient for evaluating the needs for resources and tools of different complecity levels when designing applications and services. Figure 3 presents an fragment of the architecture of Lithuanian HLT resources, tools and applications, defined using this scheme. Systematic approach towards tools and resourses gives a possibility to track different sets or combinations that enable their users to reach their goals in different ways that are highlighted in Figure 3. Circles by corresponding tools and services mark HLT resources and tools, necessary for constructing a press (media) monitoring application. Such classification scheme can help to compare several implementation alternatives for the same application or service, based on their complexity, depending on the availability and complexity of their constituent parts. Also, such a scheme could be useful for prioritizing the compilation of new HLT resources or tools by visualizing their need in application and service plans.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

8

R. Marcinkeviˇcien˙e and D. Vitkut˙e-Adžgauskien˙e / Developing the HLT Infrastructure in Lithuania

Figure 3. Architecture of Lithuanian HLT resources, tools and applications

3. Strategic issues in building the HLT infrastructure

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The following strategic issues are addressed at the institutional and national levels in building the Lithuanian HLT infrastructure: 1. 2. 3. 4.

Standardization of HLT resources and tools Design of a federated system for joint HLT resource access and reuse Rational planning of the most needed HLT resources and tools as well as their implementation strategies Alignment with European initiatives

The need for standardization of language resources and tools becomes more and more important, considering possibilities for combining in designing modern applications and e-services, participation in large-scale international projects, use of open-source and other available tools for corpus analysis, annotation, search, etc. Currently the two main standardization activities in the design of Lithuanian HLT resources and tools are:

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

R. Marcinkeviˇcien˙e and D. Vitkut˙e-Adžgauskien˙e / Developing the HLT Infrastructure in Lithuania

x

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

x

9

Adaptation of the largest corpora (CCLL and CorALit) to the TEI (Text Encoding Initiative) P5 encoding standard. Before selecting TEI P5 was compared to XCES (XML Corpus Encoding) Standard, considering different implementation and future development options. On-line access to the functionality of the main HLT tools via SOAP web service interface.

Standardization is the first step in designing an open national technological infrastructure for wide access and joint use of Lithuanian HLT resources. National eLingua project addresses these issues. E-Lingua is conceived as a common and highly interoperable virtual system of resources accessible via one main site. Resource aggregation is planned using the federation-oriented infrastructure. Linguistic resources can be stored both in centralized and in institution-specific digital repositories. Resources should comply with corresponding standards (e.g. TEI P5). Unified resource access should be implemented using standardized web service mechanism for interoperable machine-to-machine interaction over the Internet. Open access to resources and tools is to be promoted wherever possible in combination with flexible intellectual property protection mechanisms applied by resource owners upon request. An important issue when implementing certain HLT tools is the selection between two main strategies ± designing from scratch or adapting universal or other-languagespecific tools. In general, the opinion is that compilation of language specific tools is to be based on universal tools and their adaptation wherever possible in order to avoid reinventing a wheel. However, in cases where so-called universal and language independent tools are based on the prevailing language probabilistic models (usually for English) such tools are in many cases not usable for easy generalization towards other languages [3]. Rule-based MT system, implemented by the Centre of Computational Linguistics at VMU can be an example of such a situation. After implementation, it was immediately followed by the appearance of a statistical MT tool presented by Google, and now, also, a similar tool by Microsoft. If known in advance, compilation of a rule-based MT system could have been postponed as from the point of view of a small language, duplication of tools is a waste of time. However, deeper research shows, that in many cases rule-based MT system gives better results than the universal multilingual systems. The most important development and support of resources is foreseen in the framework of the National Research Infrastructure (NRI) compatible with ESFRI requirements for national states. The strategy of NRI includes documentation and unification of existing national resources as well as support for trans-national initiatives such as CLARIN, CESSDA and other similar joint infrastructures for the Social Sciences and Humanities (SSH). National support for research infrastructures in general and HLT in particular is timely since "SSH researchers rely on new technologies, and real overhead costs for SSH research have increased dramatically over the past 20 years, without government subsidies necessarily reflecting these changes. Consequently, more and more SSH research depends on capital injections to develop cutting edge data sets and develop retrieval systems" [4]. While processes of joining the CLARIN at the national level are rather slow due to decision and financing problems, first steps are easier to implement at the institutional level by accomplishing the following actions: 1) Institutional CLARIN membership (LLI and VMU agreements);

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

10

R. Marcinkeviˇcien˙e and D. Vitkut˙e-Adžgauskien˙e / Developing the HLT Infrastructure in Lithuania

2) Establishment of CLARIN C and B type centres by preparing and opening access to the CLARIN-compatible metadata system for HLT resources and corresponding services. These centres can be established at institutional level, otr using federation principles.

4. Conclusion To sum up, Lithuanian HLT community that started from scratch two decades ago has advanced in creating tools and resources. In that way it contributed to the preservation of the Lithuanian language in its digital format. However, the existing tools and resources have to be compatible and accessible as one national HLT infrastructure. New advanced tools and resources have to be created to fill in the gaps. Moreover, the national infrastructure has to be integrated into EU and transnational networks in order to enable multilingual HLT applications. The major strategic steps towards the ultimate goal are as follows: become digital, become standard and integrated, apply tools and communicate. The goal is to enable native speakers to use their native tongue for interaction with the computer. The computer has to be flexible enough to deal with both written and spoken language and to process raw and unrestricted texts as well as to get information from multilingual sources. The role of the ordinary native language cannot be overestimated in the support of human-oriented information services, particularly those based on communication through language.

References [1]

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[2]

[3] [4]

R. Marcinkeviciene, Two Decades of Lithuanian HLT, Proceedings of 17th Nordic Conference of Computational Linguistics, Odense, Denmark, 2009. K. Sarasola, Strategic priorities for the development of language technology in minority languages, LREC 2000, Proceedings of the Second International Conference on Language Resources and (YDOXDWLRQ³'HYHORSLQJODQJXDJHUHVRXUFHVIRUPLQRULW\ODQJXDJHVUHXVDELOLW\DQGVWUDWHJLFSULRUities, Athens, Greece. ELRA (2000), 106-109. L. %RULQ /DQJXDJH WHFKQRORJ\ UHVRXUFHV IRU OHVV SUHYDOHQW ODQJXDJHV ZLOO WKH 0QFKKDXVHQ PRGHO work? 1RUGLVN6SURJWHNQRORJL.¡EHQKDYQ'HQPDUN(2004), 71-82. Emerging Trends in Socio-economic Sciences and Humantities in Europe. The METRIS Report, 2009.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-11

11

National Programme for Estonian Language Technology: a Pre-final Summary a

Einar MEISTER,a,1 Jaak VILO b and Neeme KAHUSK c Institute of Cybernetics at Tallinn University of Technology b Institute of Computer Science, University of Tartu c Institute of Computer Science, University of Tartu

Abstract. The National Programme for Estonian Language Technology (20062010) was launched in 2006 and is approaching to the end in 2010. This paper gives an overview of the programme and projects covering different areas of language technology; also the future prospects will be discussed.. Keywords. Human Language Technology, Estonian

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction The report “Human Language Technology for Europe” compiled within the TC-STAR project and published in 2006, states that for Europe, Human Language Technology (HLT) is an economic, political and cultural necessity since the European Union is a multilingual society by design [1]. Although all EU languages are declared equal, the report states that there are primary, secondary and even tertiary languages of commercial relevance, especially languages with a small number of speakers are at a disadvantage. In recent years, several EU-level activities are initiated in order to promote the development and wider use of HLT, for example, the CLARIN-project (http://www.clarin.eu) for creating the European-wide infrastructure for common languages resources, the FLaReNet-project (http://www.flarenet.ee) aiming at developing a common vision of the field of language resources and technologies and fostering a European strategy for consolidating the HLT sector, etc. However, the cost of multilingualism is enormous for the EU (23 official languages, 506 language pairs) and the necessary investments for HLT development for all official languages should be shared with the European Commission and the Member States, in full agreement with the concept of “subsidiarity” [2]. In several EU-countries diverse national level initiatives are undertaken in order to facilitate and coordinate research and development of HLT for national languages, e.g., in France [2], Netherlands [3], Sweden [4], etc., some effort has been made also for the languages without a national state, e.g., Catalan [5]. In addition, some initiatives have recently reported from India [6] and South Africa [7]. 1

Corresponding Author.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

12

E. Meister et al. / National Programme for Estonian Language Technology: A Pre-Final Summary

There are at least two areas which should be evolved mainly at the national level – creation of language resources and training of languages technologists [8]. In Estonia, the National Programme for Estonian Language Technology (20062010) (NPELT) was launched in 2006. Its main goal is to develop the technology support for the Estonian language to the level that would allow functioning of Estonian in the modern information society. NPELT is funding HLT-related R&D activities including the creation of reusable language resources and development of essential linguistic software (up to the working prototypes) as well as bringing the relevant language technology infrastructure up to date. The resources and prototypes funded by the national programme are declared public.

1. Management NPELT management is carried out by a steering committee of 9 members (including HLT experts and representatives of the ministries), and a programme coordinator. Responsibilities of the steering committee include the evaluation of project proposals and progress reports, making funding proposals, purposeful use of public funding, surveying the developments in the HLT field on the national and international scale, etc. General rules adopted by the committee: • • • • •

financing of projects based on open competition, groups are requested to provide annual progress reports, evaluation of projects based on well-established criteria, international standards/formats need to be followed, the developed prototypes and language resources should be in public domain or in exceptional circumstances the access can be based on clear license agreements.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1.1. Project evaluation criteria Two types of evaluation criteria have been developed – (1) the criteria for new project applications, and (2) the criteria to assess the annual progress of on-going projects. The funding decision of a new project is based on the average ratings of eleven features (sub-criteria) including the relevance of the proposal in the context of the programme, methods applied to achieve the goals of the project, competence and experience of the project team, are the results of the project useful for other projects, etc. In the case of on-going projects the evaluation is based on annual progress reports which should provide detailed information how well the project has proceeded; if possible the objective measures are applied (mainly in the case of resource projects).

2. Financing of the programme The programme is financed from government budget, in 2006 and 2007 ca 0.5 M€ per year, ca 1.1 M€ in 2008, and ca 0.8 M€ per year in 2009 and 2010. According to the guidelines of the programme, 33% of total financing should be used for the projects focused on the development of language resources, and 66% for research and software development; the administration costs are limited to 1%.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

E. Meister et al. / National Programme for Estonian Language Technology: A Pre-Final Summary

13

3. Supported projects The number of funded projects is slightly increased year by year: 2006 – 17 projects, 2007 – 20 projects, 2008 and 2009 – 23 projects, and 2010 – 24 projects. Most of the projects are long-term projects spanning from 2006 to 2010, a few short-term projects (1-2 years) have been funded, too. The projects cover a wide range of topics (see http://www.keeletehnoloogia.ee/projects): • speech corpora: emotional speech, spontaneous speech, dialogues, L2 speech, radio news and talk shows, etc; • text corpora: written language corpus, multi-lingual parallel corpora, resources for interactive language learning, etc. • research/technology development – speech recognition, speech synthesis, machine translation, information retrieval, lexicographic tools, syntactic analysis, semantic analysis, dialogue modeling, rule-based language software, intelligent search engine, variations in speech production and perception, etc.

4. Research groups involved in NPELT

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

There are three key players working in the field of HLT in Estonia: (1) University of Tartu, represented by three groups: (1) Research Group on Computer Linguistics, (2) phonetics, and (3) bioinformatics. Their projects are focused on: • morphology, syntax, semantics, and machine translation, • corpora of written and spoken language, dialogue corpora, parallel corpora, lexical and semantic database (thesaurus, Estonian WordNet), phonetic corpus of spontaneous speech, • rule-based language software, information retrieval, interactive Web-based language learning. (2) Institute of the Estonian Language, represented by the Research Group on Language Technology, is working with three projects: • corpus-based speech synthesis for Estonian, • Estonian emotional speech corpus, • lexicographic tools. (3) Institute of Cybernetics at Tallinn University of Technology, represented by the Laboratory of Phonetics and Speech Technology, carries out three projects: • automatic speech recognition in Estonian, • variability issues in speech production and perception, • speech corpora including radio news and talk shows, lecture speech, foreignaccented speech. In addition, there are other institutions and companies responsible for single projects: • Tallinn University – Estonian Interlanguage Corpus, • Estonian Literary Museum – electronic dictionary of idiomatic expressions, • Filosoft – corpus query in the Estonian language website keeleveeb.ee, • Eliko – a prototype of Controlled Natural Language module for knowledgebased systems.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

14

E. Meister et al. / National Programme for Estonian Language Technology: A Pre-Final Summary

5. Future prospects The national programme has created favorable conditions for HLT development in Estonia. Although the steering committee is going to analyze the achievements of the NPELT at the end of 2010, it can be stated already that the programme has been successful and has fulfilled most of the expectations. The amount of re-usable language resources and software prototypes as well as new knowledge and experience created within the NPELT will serve as the technological bases for the development of innovative HLT-applications in coming years. Of course, not all HLT fields are equally covered in the programme and it would be naive to expect that all essential prototypes and resources will be created within five years. Therefore, the second phase of the programme is under preparation and should focus on the implementation and integration of the existing resources and software prototypes in public services.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

References [1] G. Lazzari. Human Language Technologies for Europe. ITC IRST/TC-Star project report, 2006. [2] J. Mariani, Research infrastructures for Human Language Technologies: A vision from France, Speech Communication 51 (2009), 569–584. [3] P. Spyns and E. D'Halleweyn, Flemish-Dutch HLT Policy: Evolving to New Forms of Collaboration, in Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), 2010. [4] K. Elenius, E. Forsbom and B. Megyesi, Language Resources and Tools for Swedish: A Survey, in Proceedings of the Sixth Conference on International Language Resources and Evaluation (LREC'10), 2010. [5] M. Melero, G. Boleda, M. Cuadros, C. España-Bonet, L. Padró, M. Quixal, C. Rodríguez and R. Saurí, Language Technology Challenges of a ‘Small’ Language (Catalan), in Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), 2010. [6] S. Lata, S. Chandra, Development of Linguistic Resources and Tools for providing multilingual Solutions in Indian Languages – A Report on National Initiative, in Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), 2010. [7] A.S. Grover, G.B. van Huyssteen and M.W. Pretorius, The South African Human Language Technologies Audit, in Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC'10), 2010. [8] S. Krauwer, How to survive in a multilingual EU?, in Proceedings of The Second Baltic Conference on HLT, Tallinn, 61–66, 2005.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-15

15

Language Resources and Technology for the Humanities in Latvia (2004±2010) Inguna 6.$',ƻ$a,b, Ilze $8=,ƻ$a, 1RUPXQGV*5Nj=Ʈ7,6a, .ULVWƯQH /(9Ɩ1(-PETROVAa, Gunta 1(â325(a, Raivis 6.$',ƻâb, Andrejs V$6,ƹ-(96b a Institute of Mathematics and Computer Science, University of Latvia b Tilde

Abstract. The last six years have been very important for research and development of language technologies in Latvia. Several large projects have been funded by the government of Latvia, important tools and resources have been created by the industry, and since 2006 Latvia has participated in the CLARIN initiative. Although there is still a gap in language resources and technology (LRT) for Latvian and the more widely used languages, the current LRT for Latvian can already serve as a basic research infrastructure for the Humanities. The paper presents an overview and the current status of LRT in Latvia. Special attention is paid to the CLARIN project and its role for the humanities in Latvia. Keywords. Latvian language, Language resources and technology, Machine translation, Text corpora, CLARIN, Research infrastructure

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction Human Language Technologies (HLT) have a long history in Latvia, beginning at the end of the 50-ies with statistical analysis, frequency dictionaries, and a prototype of a machine translation system. In the 70-ies the main fields of studies were computational morphology, statistical analysis, and computer aided development of dictionaries. Research and development was interrupted during the mid 70-ies till mid 80-ies after which the research has been resumed from the mid-80-ies until today. While the research results from the earlier period is now accessible only through scientific publications, many resources and tools developed since the mid-80-ies have been collected and are available on the Web. An overview of HLT in Latvia from 1988 till 2004 has been presented at two previous Baltic language technology events: seminar ³Language and Technology in Europe 2000´ in 1994 [1] and the first Baltic conference on Human Language Technologies in 2004 [2]. Now we present an overview of the current situation in the field of HLT in Latvia, particularly emphasizing steps in building a common, integrated and reliable research infrastructure.

1. Language Policy, Main Programmes and Projects Language resources and tools have an important role in the State language policy defined in two major documents: ³Guidelines of the State Language Policy for Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

16

I. Skadin¸a et al. / Language Resources and Technology for the Humanities in Latvia (2004–2010)

2005-´ and ³The State Language Policy Programme for 2006-2´. Several tasks of the programme are directly related to language technologies (LT): x provide financial and administrative support to research in computational linguistics for the Latvian language; x organize and create a modern computer-aided Latvian language database and ensure its wide usage; the result of this task should be corpora of the Latvian written and spoken language, tools for corpora management and lexicography, standards and schemas for lexical and other data; x ensure development of Latvian terminology, creation of terminological databases and dictionaries, terminology harmonization and international cooperation in terminology development; x ensure education in computational linguistics in Latvian universities. This paper describes the progress achieved in implementing all these tasks by describing the most important initiatives, projects and resources of the last six years. The only exception is a lack of substantial progress in the education field where there is still no dedicated programme in computational linguistics (CL) in Latvia.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1.1. Latvian Council of Science and State Research Programmes The Latvian Council of Science (LCS) has been founded by the Latvian Council of Ministers in 1990. One of the central tasks of the LCS is the advancement, evaluation, financing, and coordination of research in Latvia. The LCS together with the Ministry of Education and Science prepares proposals for the science budget and distributes funding among the branch commissions of the various fields of science. Significant funding from the LCS has been received between 2005-2009 when two HLT related projects were authorized as components of the State Research 3URJUDPPHV³Scientific Foundations of Information Technology´DQG³Latvian Studies (Letonica): Culture, Language and History´The SemTi-Kamols project (www.semtikamols.lv) aimed at the development and adaptation of the semantic web technologies for semantic analysis of /DWYLDQ VHH 6HFWLRQ   7KH SURMHFW ³Database of Latvian Explanatory Dictionaries and Recent Loanwords´ ZDV mainly dealing with semiautomatic transformation of the Dictionary of Standard Latvian Language into a machine-readable format (see Section 2.2). In addition every year about 2-3 smaller projects related to HLT have been funded by the LCS. Beginning from 2009, LCS has encouraged submission of larger projects, thus in the last 2 years only one project has received funding. The following research projects have been supported during the last six years: ³Evaluation of Statistical Machine Translation Methods for English-Latvian Translation System´ (2005-2008), ³Modeling of Universal Lexicon System for the Latvian Language´ (2005-2008), ³Historical Dictionary of the Latvian Language (16-18 centuries)´ (2005-2008), ³Methods for Latvian-English Computer Aided Lexicography´ (2008), ³Application of Factored Methods in English-Latvian Statistical Machine Translation System´ (20092012). 1.2. The Language Shore Initiative Taking into account the importance of LT in ensuring sustainable development of Latvian and other smaller languages, the initiative Language Shore was launched in

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

I. Skadin¸a et al. / Language Resources and Technology for the Humanities in Latvia (2004–2010)

17

2009 under the patronage of the President of Latvia, Valdis Zatlers. This initiative fosters the creation of a partnership between government, academia, and industry to develop an international expertise cluster around LT in Latvia. In order to ensure the successful development of the initiative at the government level, a Language Shore Steering Group composed of five sector ministers was established. The first Language Shore pilot projects have been started by Tilde and Microsoft Research increasing the speed of advancement in Latvian machine translation (MT), developing a new crowd-sourcing model in MT data collection, and establishing cooperation in terminology data sharing (www.valodukrasts.lv). Several Language Shore related projects in MT, speech technologies, content analysis and other LT fields are planned as part of the activities of the Latvian IT Competence Centre. This centre is being organized by the leading Latvian IT companies and universities.

2. Main Resources and Tools

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.1. Latvian National Corpus Initiative and Latvian Language Corpora Resources The development of the Latvian National Corpus was initiated by the State Language Commission in 2004. Theoretical and practical work of the Institute of Mathematics and Computer Science (IMCS) of the University of Latvia was supported and working group established to facilitate cooperation. As different resources have been collected in number of institutions, Latvian National Corpus Initiative envisions establishing an umbrella for all the available corpora of the Latvian language. The Agreement of Intent between the main language resource developers and holders, both academic and industry, has been signed and next practical steps are being discussed. During the last six years three corpora have been developed at IMCS (see www.korpuss.lv). The Balanced Corpus of Modern Latvian (~3.5 million running words) has been compiled from printed and electronic materials created after 1990. It is automatically morphologically tagged: for each token all the syntactically valid interpretations are stored. The Web Corpus (~100 million running words) consists of texts from Latvian web pages. The corpus has been automatically tagged, however, only one random interpretation for each morphologically ambiguous token is kept [3]. The &RUSXVRIWKH 7UDQVFULSWVRIWKH6DHLPD¶V 3DUOLDPHQWRI/DWYLD) Sessions (more than 20 million running words.) contains transcripts of the plenary sessions of the 5th to 9th Saeima (up to July 2009). The corpus is structurally marked: information about speakers, session type, date etc. is added. Since 2006, The National Library of Latvia has been working on the creation of the /DWYLDQ 1DWLRQDO 'LJLWDO /LEUDU\ ³/HWRQLFD´ (www.lnb.lv/en/digital-library). Currently the Digital Library holds collections of newspapers, pictures, maps, books, sheet music and audio recordings. Collection Periodicals (periodika.lv) offers 40 newspapers and magazines in Latvian, German, and Russian from 1895 to 1957 (more than 350 000 pages). 2.2. Electronic Dictionaries and Terminology Resources Several machine-readable versions of monolingual dictionaries of modern Latvian have been created by IMCS in cooperation with other research institutions. The Dictionary of Standard Latvian Language (www.tezaurs.lv/llvv) is a machine-readable version of Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

18

I. Skadin¸a et al. / Language Resources and Technology for the Humanities in Latvia (2004–2010)

the largest Latvian monolingual dictionary of the second half of the 20th century (~64 000 entries in 8 volumes). The content of the machine-readable dictionary fully corresponds to the content of the original dictionary; but it contains detailed structural annotations, hence it is easily transformable into different user-friendly presentation formats, including a mobile version (www.tezaurs.lv/wap). The Explanatory Dictionary (www.tezaurs.lv/sv) has been created with the goal to provide an explanation for every word that can be found in Latvian texts. Currently the dictionary contains more than 150 000 entries from about 120 Latvian dictionaries of different times and domains. Using the register of unsuccessful search queries, words that are not found in the existing dictionaries are constantly being added to this dictionary. A new Dictionary of the Modern Latvian is being compiled at the Latvian Language Institute (University of Latvia). The parts of the dictionary that have already been completed can be found at www.tezaurs.lv/mlvv (~20 000 entries from A±ƹ). 7LOGH¶V HOHFWURQLF GLFWLonaries are provided as part of the Tildes Birojs software suite and as an online internet resource in the reference portal letonika.lv. They are integrated in Microsoft Office applications, comprehension assistant, MT system and other applications. Altogether there are dictionaries for 20 translation routes: from English, French, German and Russian into Latvian and Lithuanian and vice versa as well as Latvian-Lithuanian, Lithuanian-Latvian and Estonian-Latvian. Tilde provides not only general dictionaries but also more than 40 terminological dictionaries. The Terminology Commission of the Latvian Academy of Sciences publishes official terminology in two large online databases: www.termnet.lv and termini.lza.lv/akadterm. EuroTermBank portal [4] (www.eurotermbank.com) hosted by Tilde is a PanEuropean term bank providing a consolidated interface to comprehensive multilingual terminology resources on the Web. It enables searching almost 2 million terms in over 25 languages. Under the term bank federation principle, it provides a single access point to the central database along with interlinked national and international term banks, consolidating terms from such major collections as IATE, WebTerm, Microsoft Terminology Collection, Terminology database of the Latvian Terminology Commission, and others. The Terminology Server for European Centre for Disease Prevention and Control (http://ecdc.europa.eu) developed by Tilde is a specialized solution for terminology in the medical and legal domains. It is an ontology-based solution for the creation, maintenance and dissemination of terminology used by human and machine users. 2.3. Machine Translation The rule-based approach to machine translation has been dominant in Latvia since mid90-ies when the first version of the LATRA system has been developed at IMCS [5]. Research on rule-based systems continued at IMCS until 2004 by elaborating LATRA with semantic properties and by adapting it to new domains. Tilde also has worked on a rule-based approach to develop a commercial system for users who have poor or no foreign language skills. The MT system 7LOGHV7XONRWƗMV[6] has been released in 2007 as part of Tildes Birojs 2008. The system translates texts from English into Latvian and from Latvian into Russian. Research on Statistical Machine Translation (SMT) was started by IMCS with a LCS funded SURMHFW ³Evaluation of SMT Methods for English-Latvian Translation

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

I. Skadin¸a et al. / Language Resources and Technology for the Humanities in Latvia (2004–2010)

19

System´ -2008) through which the baseline English - Latvian system was created [7]. The system¶V performance in BLEU points was similar to other systems for inflected languages of that time. IMCS research on SMT continues with the project ³Application of Factored Methods in English-Latvian SMT System´ [8], the latest version of the system is available on the Web at http://eksperimenti.ailab.lv/smt. In 2009 Tilde started development of an English-Latvian SMT system. Besides publicly available resources, internal resources collected over time have been used for SMT training. This year Latvian-English SMT system has been developed as well. Both systems are publicly available at http://translate.tilde.com [9]. Two SMT related EU projects: the ICT PSP program project LetsMT! (www.letsmt.eu) and the FP7 project ACCURAT (www.accurat-project.eu), both coordinated by Tilde, have been started in 2010. The LetsMT! project aims to build an innovative online collaborative platform for data sharing and MT generation. This platform will support the uploading of MT training data and building of multiple customized MT systems. The ACCURAT project researches novel methods that exploit comparable corpora to compensate for the shortage of linguistic resources to improve MT quality for under-resourced languages and narrow domains [10][11].

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.4. Speech Technologies Several research projects in speech technologies have been carried out during the last six years resulting in three speech synthesis systems that have achieved the level of practical usability: Visvaris (Tilde), T2S (IMCS) and Balss (SIA Rubuls & Co). In 2005 Tilde together with The Association of Blind People started a project to develop a Latvian text-to-speech (TTS) system [12]. The primary goal was to address the needs of visually impaired people using computers in Latvian. The architecture of the system covers the traditional TTS transformation, performing text normalization, grapheme-to-phoneme conversion, prosody generation, and waveform synthesis. IMCS had several projects devoted to experimental TTS [13] and speech recognition systems. The demonstration version of the TTS system is available at http://runa.ailab.lv/tts2. In the SURMHFW ³Applications of Latvian Language Speech Synthesis and Analysis in Call Centres´ financed by Lattelecom BPO the speech synthesis system was improved and an experimental speech recognition module for isolated words was created. There has not been any serious research in Latvian language speech recognition, which could result in a practically usable speech recognition system. Only some individual experiments have been carried out in sound and isolated word recognition. 2.5. Tools for Natural Language Processing 2.5.1. Morphology Tools In the recent years, IMCS has summarized the previous experience and developments in morphological analysis [2] by creating a new robust analyzer [14] and a synthesizer. The lexicon-based analyzer (currently ~50 000 lemmas are covered together with a rich set of derivational rules) is available as a Java library from the SemTi-Kamols project site. It is used in research projects by several academic and commercial users (including outside Latvia). On top of the morphological analyser a statistical tagger (http://eksperimenti.ailab.lv/tagger) has been built.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

20

I. Skadin¸a et al. / Language Resources and Technology for the Humanities in Latvia (2004–2010)

The synthesizer that can be also used as an analyzer is a database containing all possible forms of the headwords from the Dictionary of Standard Latvian Language. It is available as a CLARIN standards-compliant web service at http://valoda.ailab.lv/ws. Tilde has also developed a Latvian morphological analyzer, synthesizer, and stemmer. These tools are used in many practical applications, e.g., electronic dictionaries, grammar checkers, MT systems, and search engines developed by Tilde as well as are licensed to other developers. Online interface for word analysis and synthesis is provided as part of the reference portal letonika.lv. A statistical Latvian morphological tagger was built on top of the morphological analyzer in 2008. 2.5.2. Syntactic Parsers A novel dependency-based syntactic representation and a corresponding rule-based parser were created [15] in the SemTi-Kamols project. The current version of the grammar covers simple extended sentences and is primarily intended for chunking. The parser has been used for the automatic tagging of several Latvian text corpora (see Section 2.1) and is integrated in a semi-automated corpus annotation tool. A Latvian syntactic parser was built by Tilde in 2007. The formal grammar is derived from the unification grammar. Rules consist of two parts: context-free description of the syntactic structure and usage conditions describing constraints and allowing assigning or passing morphological features. The parser is based on the CYK algorithm. The first version was designed for a shallow parsing; there is ongoing work on the next version of the parser designed for a deep parsing for the needs of a grammar checker.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. Common Language Resources and Technologies Infrastructure (CLARIN) The aim of the CLARIN initiative is to establish an integrated and interoperable research infrastructure of language resources and technology by overcoming fragmentation and by offering a stable, accessible and extendable infrastructure for eHumanities [16]. Both IMCS and Tilde are members of the CLARIN initiative since 2006. The preparation phase (2008 - mid 2011) of CLARIN is supported through the FP7 Research infrastructures project which IMCS joined on April 1, 2009. Participation of Latvia in the CLARIN project is supported by the Ministry of Education and Science of the Republic of Latvia (since the autumn of 2008). The advancement of CLARIN is mentioned in strategic documents on the development of science in Latvia. Recently the Cabinet of Ministers has approved ³Action Plan for Implementation of Guidelines for Science and Technology Development´ One of the subtasks of the Action Plan is to ensure the participation of research institutions in the CLARIN project. Prior to the CLARIN initiative, IMCS contributed to the CLARIN aims not only by collecting and preserving linguistic resources and making them publicly available, but also by cooperating with other research organizations in creating resources and by being the web publisher and maintainer of resources created in other research institutions. Therefore, after joining the FP7 CLARIN project, IMCS has been appointed as the CLARIN National Contact Point (www.clarin.lv) by the Ministry of Education and Science.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

I. Skadin¸a et al. / Language Resources and Technology for the Humanities in Latvia (2004–2010)

21

To prioritize goals and tasks of the CLARIN project in Latvia and to facilitate the creation of the CLARIN infrastructure, the CLARIN National Advisory Board was established and approved by the Ministry of Education of Science. The Advisory Board consists of 17 members from universities, research institutes, government organizations and enterprises. Tasks of the Advisory Board include setting priorities and providing recommendations related to the goals of the CLARIN project in Latvia. As a partner of the FP7 CLARIN project IMCS contributes to tasks related to the technical infrastructure, Language Resource and Technology overview, IPR and Business models [17]. Long term intention of IMCS is to become a CLARIN conformant national-level service and metadata providing centre. For the preparatory phase, however, IMCS has selected some existing resources and tools that can be rather rapidly integrated in the CLARIN infrastructure (see http://valoda.ailab.lv/ws/). IMCS also works on creation of reliable identity federation. Experimental identity federation LAIFE, based on information system LUIS by University of Latvia, has been set up by Sigmanet and Lanet. IMCS participated in the creation of the CLARIN LRT overview by collecting and analyzing information about tools and resources developed in Latvia and by adding the collected information to the CLARIN LRT inventory. Currently 34 Latvian language resources (31 developed in Latvia) and 11 tools are included in the inventory. The CLARIN LRT inventory, the ELRA R&D Catalogue, the ACL Wiki and the LT-World registry served as a baseline for the creation of CLARIN BLARK (http://www-sk.let.uu.nl/u/D5C-4.pdf). CLARIN BLARK provides a good general overview of the LRT, the needs of Humanities and the Humanities BLARK matrix. Since this document is based on an analysis of four LRT repositories, some well known tools, e.g., commercial spelling checkers are not included. Thus, for better understanding of specific Humanities needs in Latvia, the creation of the Latvian BLARK has begun. CLARIN National Advisory Board in Latvia has been involved in this task by setting priorities to create the missing Latvian resources and tools.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

4. Conclusions and Outlook The last six years have been an active period in HLT development in Latvia. Strong advancement in Latvian HLT has been achieved by developing a number of important written and spoken resources, tools and technologies as well as establishing cooperation between researchers from academic institutions and companies on national and international levels. Thus we can conclude that the basic elements for research infrastructure of language resources and technology have been established in Latvia. On the other hand, several important resources such as the Latvian National Corpus, Latvian WordNet, Latvian Treebank, are still missing. Since Latvia lacks a dedicated national programme for HLT research and development current research activities are fragmented and mostly organized around short-term projects which complicates long-term inter-institutional cooperation and development of larger resources. Another urgent problem is the lack of programmes on computational linguistic at Latvian universities. Only one semester-long course is available at the Liepaja University for master and doctoral degree students. Targeted national research and development activities are urgently needed to fill these gaps in HLT development in Latvia.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

22

I. Skadin¸a et al. / Language Resources and Technology for the Humanities in Latvia (2004–2010)

Acknowledgements We would like to thank our colleagues Everita Andronova,QGUD 6ƗPƯWH and Andrejs Spektors for their support and consultations during the preparation of this overview. The research described in this paper has been supported by the Ministry of Education and Science, the Latvian Council of Sciences, the State Language Agency, EU FP7 and the CIP ICT PSP Programmes.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

References [1] Language & Technology in Europe 2000, Reports of Seminar, Riga, November 10-11, 1994. [2] (0LOþRQRND1*Unj]ƯWLV$6SHNWRUV, Natural Language Processing at the Institute of Mathematics and Computer Science: 10 Years Later, 3URFHHGLQJV RI WKH ILUVW %DOWLF FRQIHUHQFH µ+XPDQ /DQJXDJH Technologies ± the Baltic Perspective¶ (2004), 6±12. [3] -'åHULƼã.'åRQVRQV+DUYHVWLQJ1DWLRQDO/DQJXDJH7H[W&RUSRUDIURPWKH:HEProceedings of the 3rd Baltic Conference on Human Language Technologies, Kaunas, Lithuania (2007), 87±94. [4] S. Rirdance, A. Vasiljevs (eds.), Towards Consolidation of European Terminology Resources Experience and Recommendations from EuroTermBank Project, Riga, 2006. [5] I. *UHLWƗQH, 0DãƯQWXONRãDQDVVLVWƝPD/$75$, /=$9ƝVWLV Nr.3./4 (1997), 1-6. [6] I. 6NDGLƼDR. 6NDGLƼãD. Deksne, T. Gornostay, English/Russian-Latvian Machine Translation System, Proceedings of the 3rd Baltic Conference on HLT (2008), 287-296. [7] I. 6NDGLƼDE. %UƗOƯWLV, Experimental Statistical Machine Translation System for Latvian, Proceedings of the 3rd Baltic Conference on HLT (2008), 281-286. [8] I. 6NDGLƼD E. %UƗOƯWLV, English-Latvian SMT: knowledge or data? Proceedings of the 17th Nordic Conference on Computational Linguistics NODALIDA, May 14-16, 2009, Odense, Denmark, NEALT Proceedings Series, Vol. 4 (2009), 242±245. [9] R6NDGLƼã, K. Goba, Vâics, Improving SMT for Baltic languages with factored models, Proceedings of the fourth %DOWLFFRQIHUHQFHµ+XPDQ/DQJXDJH7HFKQologies ± the Baltic Perspective¶, (2010). [10] A. Eisele, J. Xu, Improving Machine Translation Performance Using Comparable Corpora, Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities (2010), 35-39. [11] I. SkDGLƼDA. 9DVLƺMHYVR. 6NDGLƼãR. Gaizauskas, D. Tufis, T. Gornostay, Analysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation, Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities (2010), 6-14. [12] . *RED$9DVLƺMHYV Development of Text-To-Speech System for Latvian, Proceedings of the 16th Nordic Conference of Computational Linguistics NODALIDA-2007 (2007), 67-72. [13] , $X]LƼD /DWYLDQ 7H[W-to-Speech System, 3URFHHGLQJV RI WKH ILUVW %DOWLF FRQIHUHQFH µ+XPDQ Language Technologies ± the Baltic Perspective¶ (2004), 21-26. [14] P. Paikens, Lexicon-Based Morphological Analysis of Latvian Language, Proceedings of the 3rd Baltic Conference on Human Language Technologies, Kaunas, Lithuania (2007), 235±240. [15] G. %ƗU]GLƼã N. *Unj]ƯWLV G. 1HãSRUH DQG B. 6DXOƯWH, Dependency-Based Hybrid Model of Syntactic Analysis for the Languages with a Rather Free Word Order, Proceedings of the 16th Nordic Conference of Computational Linguistics (2007), 13±20. [16] P. Wittenburg et al., Resource and Service Centres as the Backbone for a Sustainable Service Infrastructure, Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10) (2010), 60-63. [17] I. 6NDGLƼD, CLARIN in Latvia: current situation and future perspectives. Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Common Language Resources, May 14, 2009, Odense, Denmark, NEALT Proceedings Series, Vol. 5 (2009), 33-37.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Speech Technologies and Spoken Corpus

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-25

25

Estonian Emotional Speech Corpus: Culture and Age in Selecting Corpus Testers Rene ALTROV1 and Hille PAJUPUU Institute of the Estonian Language, Tallinn

Abstract. The Estonian Emotional Speech Corpus serves as the acoustic basis for emotional text-to-speech synthesis. Because the Estonian synthesizer is a TTSsynthesizer, we started off by focusing on read texts and the emotions contained in them. The corpus is built on a theoretical model and we are currently at the stage of verifying the components of the model. In the present article we give an overview of the corpus and the principles used in selecting its testers. Some studies show that people who have lived longer in a certain culture can more easily recognize vocal expressions of emotion that are characteristic of the culture without seeing the speaker’s facial expressions. We therefore decided not to use people under 30 years of age as testers of emotions in our theoretical model. We used two tests to verify the selection principles for the testers. In the first test, 27 young adults aged under 30 were asked to listen to and identify the emotion (joy, anger, sadness, neutral) of 35 sentences. We then compared the results with those of adults aged over 30. In the second test we asked 32 Latvians listen to the same sentences, and then compared the results with those of Estonians. Our analysis showed that younger and older testers, Estonians and Latvians perceive emotions quite differently. From these test results we can say that the selection principle of corpus testers, using people who are more familiar with Estonian culture, is acceptable.2

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Keywords. emotion, vocal expression, ageing, perception of emotions, Estonian

Introduction Work on the creation of the Estonian Emotional Speech Corpus (EESC) started in 2006 at the Institute of the Estonian Language within the framework of the National Program for Estonian Language Technology. The corpus serves as the acoustic basis for corpusbased emotional text-to-speech synthesis. As the Estonian synthesizer is a TTS synthesizer [1], we focused on read texts and the emotions they contain. The theoretical model of the EESC relies on research results on emotional corpora and emotions in general [2]. The corpus was created using the following principles: 1. Professional actors were not used because acted emotions are stereotypical and exaggerated and thus different from real-life emotional communication (see [3], [4]). Based on the presumption that listeners can recognize emotions in natural speech quite well, recordings of texts read by ordinary people were used (cf. [5]). 1

Corresponding Author: Researcher Extraordinary, Institute of the Estonian Language, Roosikrantsi 6, Tallinn, 10119 Estonia; E-mail: [email protected]. 2 The study was supported by the National Program for Estonian Language Technology and the project SF0050023s09 “Modeling intermodular phenomena in Estonian”. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

26

R. Altrov and H. Pajupuu / Estonian Emotional Speech Corpus

2. The emotions contained in the corpus sentences were subjected to perception tests. 3. The corpus can be enlarged with readers, sentences and emotions and, in addition to its main function as an acoustic basis for corpus-based emotional speech synthesis, it can be used for other purposes such as research of spoken or written emotions. 4. The corpus is publicly accessible during all its development stages.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. Corpus Principles EESC consists of six creation stages [2]. CHOICE OF EMOTIONS. The first step in corpus creation was to choose the emotion categories to be used. We decided to include in the corpus sentences expressing joy, anger and sadness and, to satisfy the needs of speech synthesis, neutral sentences. CHOICE OF READING MATERIAL. As the main function of the Estonian TTSsynthesizer is to read out journalistic texts, not to have conversations, corpus material came from the Estonian press. Instead of isolated sentences, we decided to record text passages, as the message of a passage facilitates the reader to achieve the right emotional state (see [5]). We chose texts that covered a wide range of topics and tried to avoid colloquial style, keeping in mind that the main function of the synthesizer is to read out written texts. The next step was to ask a group of people to read the text passages quietly on their own and determine the emotions (joy, anger, sadness) contained in the passages. We included in the corpus passages which contained identifiable emotions. CHOICE OF READERS. Texts for the EESC were read by a non-actor, a woman with correct pronunciation and a pleasant voice. The pleasantness of her voice was assessed by listeners [6]. In corpus creation we followed the principle that the text itself elicited which emotions were used to read it. We thus did not dictate to the reader which emotions to express but let her decide depending on the text. CHOICE OF LISTENERS AND LISTENING TEST. Corpus passages were segmented into sentences. As it was up to the reader of the text to choose which emotions to express in each case, each corpus sentence was subjected to a listening test where listeners identified the emotions. A web-based user interface was used for testing. Testers listened to isolated sentences and decided which emotions they contained. They could choose between three main emotions – joy, anger and sadness – and neutrality. Data on testers covered sex, education, nationality, mother tongue, language of education and age. When the corpus was created, it was not at all certain if it was possible to identify emotions successfully by listening to recorded texts in non-acted Estonian and not seeing facial expressions. To increase the success rate, we chose people older than 30 to be our testers of corpus sentences as research results show that people who have lived longer in a particular culture and have acquired culture-specific skills of expressing emotions are better at vocal emotion recognition (see [7]). That is why we excluded younger people from the group of testers. In the corpus, sentences are stored together with listening test data. An emotion is considered identified when at least 51% of the participants of the listening test have recognized this emotion in a sentence.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

R. Altrov and H. Pajupuu / Estonian Emotional Speech Corpus

27

READING TEST. It is likely that vocal emotion recognition is influenced by the semantic context of text as any emotional text contains some emotionally marked words [3]. So far, corpus creators have not given enough credit to content influence on emotion recognition. The issue of content influence is not particularly important when the readers of texts or sentences are told which emotions to express and later the recordings are subjected to listening tests to check if listeners perceive the intended emotions. However, in the EESC the reader was not asked to express particular emotions. Instead, the listeners were asked to identify the emotion of each sentence. As the corpus mainly serves as an acoustic basis for speech synthesis, it is important to distinguish sentences where the emotions are rendered by vocal expression only. To discover in which cases the emotion is rendered by voice only and in which cases emotion recognition is influenced by text, the same sentences that passed a listening test, were subjected to a reading test with different testers. The results of the two tests, listening and reading, were compared (see Table 1), and the outcome (if a sentence belongs to the group where emotions were recognized from voice only or to the group where content may have influenced the perception of emotions) was recorded in the corpus database.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Table 1. Principles of emotion classification in the corpus

CONTENT OF CORPUS. The corpus contains sentences expressing joy, anger and sadness, and neutral sentences (currently 579 sentences that have passed both a listening and a reading test). They have been divided into sentences in which the emotion is rendered by voice only and sentences in which emotion recognition may have been influenced by content. The sentences have been segmented and labeled into words and phonemes. The corpus can be found at http://193.40.113.40:5000/ together with a technical description. The user interface is in Estonian, English, Finnish and Latvian. Currently the EESC is in a state where it is possible to verify the applicability of the decisions made during its creation. We have found proof that emotions can be 3

“Not sure” was added in the Reading menu for cases where the reader finds it hard to pick an emotion, feeling that the emotion depends more on how the sentence sounds when uttered. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

28

R. Altrov and H. Pajupuu / Estonian Emotional Speech Corpus

identified in normal, non-acted speech, and we have studied content influence on emotion recognition [8]. In our present study we try to find out if our decision to use corpus testers older than 30 and leave out younger people is well-founded. We look at the relation between emotion recognition and the age and cultural/national background of testers.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Age-Related Effects on Emotion Recognition It is often presumed that as people get older, they become wiser and more experienced through interaction with other people, while their memories get worse and their cognitive abilities slow down. Age also influences a person’s ability to understand a communication partner’s emotional signals and cues [9]. Age-related studies mainly focus on the recognition of facially expressed emotions and show that as people get older, their ability to identify emotions, especially negative emotions, becomes impaired [9]–[13]. However, there has been little research into the recognition of vocal expression of emotion. Findings indicate that ageing does not necessarily impair the perception of all emotions, and that impairments do not all happen simultaneously [9], [12], [14]. Older people are less accurate at recognizing two negative emotions – anger and sadness [9], [11]. Although there is a small impairment in the perception of positive emotions as well, it only becomes noticeable after the age of 60. As to perceiving neutrality of expression, there are no striking differences between age groups [9]. While it is reasonable to assume that people at the age of 60 and over are likely to have difficulties recognizing different emotions, the latest research results show that it may actually happen earlier. For example, Paumann et al. [12] established remarkable differences in vocal emotion recognition by middle-aged (aged 42–45) and younger (aged 23–25) people. Moreover, Mill et al. [9] discovered that young adults at the ages of 21–30 are slightly less efficient in the identification of sadness and anger than are younger people, but the decline is already significant by the ages of 31–40. However, not all research results are comparable as it is not always clear how the studies have been conducted and results obtained. There are many unanswered questions such as what reading material was chosen; were professional actors or ordinary people used to express the sentence emotions; were the emotions contained in sentences determined by researchers, for example by asking readers to read neutral sentences with different emotions, or were the readers themselves able to decide on which emotions to use? We believe that these factors have considerable influence on emotion recognition as, for example, it has been determined that the recognition of certain emotions is facilitated by lexical context [12]. It is also easier to identify emotions expressed by actors as these normally sound more stereotypical and thus differ from real emotional expression. In our study we try to find out if older (aged 30–62) and younger (aged 20–28) people perceive emotions differently when the sentences they hear are read by a nonactor and content influence on emotion recognition has been excluded. We are interested in how justified our decision to use people over 30 as testers of the corpus emotions was. We made this decision assuming that people who have lived longer in a specific culture are better at vocal recognition of culture-specific emotions. We did not consider the possibility that the ability to recognize emotions may fall with age.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

R. Altrov and H. Pajupuu / Estonian Emotional Speech Corpus

29

3. Material and Method We composed a listening test of corpus sentences in which testers over 30 had identified joy (10 sentences), anger (10 sentences), sadness (10 sentences) and neutrality (5) and in which content influence on emotion recognition had been excluded (for the principles of sentence selection, see Table 1). Two groups of testers took the listening test. One group consisted of young Estonian adults (aged 20–28). The other consisted of Latvians. Latvians were used as testers to find out how important it is to live in a specific culture in order to be able to recognize emotions from voice. By comparing the results of the over-30 and under-30 groups and Latvians we can have an insight in how cultural experiences influence emotion recognition. We chose Latvians because Estonian and Latvian cultures share similar values [15] and therefore we can assume that the two nations express emotions similarly and can recognize each other’s emotions from voice alone. Both groups were asked to listen to isolated sentences without seeing the text and decide which emotion they heard in each sentence. The choices were joy, anger, sadness and neutrality. Listening tests were web-based and were carried out through a user interface for test creation that was connected to the corpus. The first group comprised 27 testers under 30 years of age with Estonian as their mother tongue (L1). The second group consisted of 32 Latvians. We compared the results of the two groups with each other and with those of the over-30 Estonians, using R for an Analysis of Variance (ANOVA).

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

4. Results and Discussion Corpus sentences were labeled according to the emotion identifications of adult testers (aged over 30). An emotion was considered identified when over 51% of the testers decided in favor of it. Young adult testers and Latvian testers were also asked to determine the emotion of 35 sentences previously identified by adult testers. (The identification percentage of sentences by each group see http://urve.eki.ee:5000/table1_identification.doc) To find out if adult testers identify emotions differently from young adults and Latvians, we used ANOVA and logistic regression. For example, we used the following formula (Eq. (1)) to determine the influence of cultural background on emotion recognition: anova(glm(outcome~nationality, family=binomial, data=t29), test="Chisq"), (1)

where t29 was source data table, and outcome and nationality were columns of the table to which logistic regression was applied. (For source data see http://urve.eki.ee:5000/data.csv) The results are given in Table 2. All three groups – adults, young adults and Latvians – differ significantly in identifying emotions. Adults and young adults identify sadness and neutrality in voice similarly. Young adults and Latvians are close in perceiving anger.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

30

R. Altrov and H. Pajupuu / Estonian Emotional Speech Corpus

Table 2. ANOVA results (Est – Estonians, LV – Latvians, EstA – adult Estonians, EstY – young adult Estonians)

Table 3 presents the confusion pattern in the emotional identification task by cultural background (nationality) and age (mean values in %). Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Table 3. Percentage of Estonians and Latvians who identified emotions of sentences according to each emotion target (Est – Estonians, LV – Latvians, Y – young adult Estonians, A – adult Estonians)

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

R. Altrov and H. Pajupuu / Estonian Emotional Speech Corpus

31

The identification percentage of the target emotion is over 51 for both Estonian adults and young adults, although it is lower for young adults. Young adults perceive considerably more sentences as neutral. Adults and young adults are closest to each other in identifying sadness (see Table 3). This shows that ageing does not negatively influence the perception of sadness in regular, non-acted speech (cf. [9], [11]). Latvians identify only sadness in more than 51% of cases. There is a significant difference in how they identify sadness, though (see Table 2). For Latvians, often sentences sound neutral rather than anything else. The fact that Latvians and young adult Estonians identified many sentences as neutral shows that in order to successfully identify an emotion in voice, one has to have a longer experience in how emotions are expressed in a certain culture (cf. [16]). We asked whether the decision to use testers aged over 30 was right. From our research results, we cannot give a conclusive answer to our research question. On the one hand, results show that testers aged over 30 and younger testers differ significantly in how they identify emotions. On the other hand neither group has problems with identifying emotions in voice. There was a strong consensus (over 51%, see Table 3) among the two groups about which emotions were heard in sentences, but younger people tend to identify more sentences as neutral. As all sentence emotions contained in the corpus were determined by listeners, we cannot really say which of the two groups was better at emotion recognition. There are two points that support the group of adult testers: 1) the ability of older testers to recognize emotions from voice has not lessened during the process of ageing as they recognize sadness similarly to younger testers (see Table 2); 2) older testers perceive fewer sentences as neutral than do younger testers which shows that older people are better at decoding the culturespecific expression of emotions. The test results for Latvians also confirm the importance of the cultural aspect in emotion recognition as their ability to recognize Estonian emotions from voice was relatively low.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

5. Conclusion In the initial stage of creating the Estonian Emotional Speech Corpus a decision was made to use testers older than 30 as emotion identifiers. This decision relied on the assumption that people who have lived longer in a certain culture are more likely to have acquired the skills of culture-specific expression of emotions. Having studied the relations between testers’ age and cultural background and their ability to recognize emotions, and having evaluated the results directly and indirectly we can say that the initial decision is acceptable.

References [1] M. Mihkla, L. Piits, T. Nurk, I. Kiissel, Development of a unit selection TTS system for Estonian. Proceedings of the Third Baltic Conference on Human Language Technologies: The Third Baltic Conference on Human Language Technologies. Vilnius: Vytauto Didžiojo Universitetas. Lietuviu kalbos institutas (2008), 181–187. [2] R. Altrov, Eesti emotsionaalse kõne korpus: teoreetilised toetuspunktid. Keel ja Kirjandus 4 (2008), 261– 271.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

32

R. Altrov and H. Pajupuu / Estonian Emotional Speech Corpus

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[3] E. Douglas-Cowie, N. Campbell, R. Cowie, P. Roach, Emotional speech: Towards a new generation of databases. Speech Communication 40 (2003), 33–60. [4] K.R. Scherer, Vocal communication of emotion: A review of research paradigms. Speech Communication 40, 1–2 (2003), 227–256. [5] A. Iida, N. Campbell, F. Higuchi, M. Yasumura, A corpus-based speech synthesis system with emotion. Speech Communication 40, 1–2 (2003), 161–187. [6] R. Altrov, H. Pajupuu, The Estonian Emotional Speech Corpus: Release 1. Proceedings of the Third Baltic Conference on Human Language Technologies: The Third Baltic Conference on Human Language Technologies. Vilnius: Vytauto Didžiojo Universitetas. Lietuviu kalbos institutas (2008), 9– 15. [7] J. Toivanen, E. Väyrynen, T. Seppänen, Automatic discrimination of emotion from spoken Finnish. Language & Speech 47, 4 (2004), 383–412. [8] R. Altrov, H. Pajupuu, Estonian emotional speech corpus: Content and options. Rassegna Italiana di Linguistica Applicata, forthcoming. [9] A. Mill, J. Allik, A. Realo, R. Valk, Age related differences in emotion recognition ability: a crosssectional study. Emotion 9, 5 (2009), 619–630. [10] A.J. Calder, J. Keane, T. Manly, R. Sprengelmeyer, S. Scott, I. Nimmo-Smith, A. W. Young, Facial expression recognition across the adult life span. Neuropsychologia 41, 2 (2003), 195–202. [11] D.M. Isaacowitz, C.E. Löckenhoff, R.D. Lane, R. Wright, L. Sechrest, R. Riedel, P.T. Costa, Age differences in recognition of emotion in lexical stimuli and facial expressions. Psychology and Aging 22, 1 (2007), 147–159. [12] S. Paulmann, M.D. Pell, S.A. Kotz, How aging affects the recognition of emotional speech. Brain and Language 104, 3 (2008), 262–269. [13] L.H. Phillips, R.D.J. MacLean, R. Allen, Age and the understanding of emotions: Neuropsychological and sociocognitive perspectives. Journal of Gerontology B: Psychological Sciences and Social Sciences 57, 6 (2002), 526–530. [14] P. Laukka, P.N. Juslin, Similar patterns of age-related differences in emotion recognition from speech and music. Motivation and Emotion 31 (2007), 182–191. [15] M. Huettigen, Cultural dimensions in business life: Hofstede’s indices for Latvia and Lithuania. Baltic Journal of Management 3, 3 (2008), 359–376. [16] M.D. Pell, S. Paulmann, C. Dara, A. Alessari, S.A. Kotz, Factors in the recognition of vocally expressed emotions: A comparison of four languages. Journal of Phonetics 37 (2009), 417–435.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-33

33

Estonian Large Vocabulary Speech Recognition System for Radiology Tanel ALUMÄE and Einar MEISTER Laboratory of Phonetics and Speech Technology Institute of Cybernetics at Tallinn University of Technology Akadeemia tee 21, 12618 Tallinn, Estonia Abstract. This paper describes implementation and evaluation of an Estonian large vocabulary continuous speech recognition system prototype for the radiology domain. We used a 44 million word corpus of radiology reports to build a word trigram language model. We recorded a test set of dictated radiology reports using ten radiologists. Using speaker independent speech recognition, we achieved a 9.8% word error rate. Recognition worked in around 0.5 real-time. One of the prominent sources of errors were mistakes in writing compound words. Keywords. speech recognition, radiology, applications

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction Radiology has historically been one of the pioneer domains in large vocabulary continuous speech recognition (LVCSR) among several languages. Radiologists’ eyes and hands are busy during the preparation of a radiological report, creating thus a suitable condition for speech based input as an alternative to keyboard based text entry. In many hospitals, radiologists dictate the reports which are then converted to text by human speech transcribers. Speech recognition systems have the potential to replace human transcribers and enable faster and less expensive report delivery. For example, [1] describes a case study where the use of speech recognition decreased the mean report turnaround time by almost 50%. In radiology, a typical active vocabulary is much smaller than in general communication and the sentences have a more well-defined structure, following certain patterns. This enables to create statistical language models for the radiology domain that are accurate and have a good coverage, given enough training data [2]. Over the past several years, speech recognition for radiology has greatly improved and some vendors claim up to 99% accuracy in word error rate (WER) [3]. However, most vendors provide systems only for the biggest languages (although Nuance’s SpeechMagic supports 25 languages [4]) and there have been no known attempts of building a speech recognition system for radiology for the Estonian language. We have found one report on building a radiology-focused speech recognition system for medium or little resourced languages. In [5], a radiology dictation system for Turkish is developed, with an emphasis on improving modeling of pronunciation variation. A very low word error rate of 4.4% was reported, however, the vocabulary of the system was only 1562.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

34

T. Alumäe and E. Meister / Estonian Large Vocabulary Speech Recognition System for Radiology

In this paper, we describe our implementation of the Estonian speech recognition system for the radiology domain. In the first section, we describe the training data and different aspects of building the system. In the second section, the process of collecting test data is explained and the experimental results are reported. Some error analysis is performed. The paper ends with a discussion and gives some plans for future work.

1. LVCSR system 1.1. Language model

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

For training a language model (LM), our industry partner made a large anonymized corpus of digital radiology reports available to us. The corpus contains over 1.6 million distinct reports, and has 44 million word tokens before normalization. We randomly selected 600 reports for development and testing and used the rest for training. The corpus contains over 480 thousand distinct words (including all different numbers). We created a word trigram LM from the corpus. In previous LVCSR systems for Estonian, sub-word units have been used instead of words as basic units of the LM, since the inflectional and compounding nature of the language makes it difficult to achieve good vocabulary coverage with words [6]. However, our initial investigations revealed that in the radiology domain, the vocabulary is much more compact and words can be used as basic units. The corpus was normalized using a multi-step procedure: 1. A large set of hand-made rules (implemented using regular expressions) was applied that expanded and/or normalized different types of abbreviations often used by radiologists. 2. A morphological analyzer [7] was used for determining the part-of-speech (POS) properties of all words. 3. Numbers were expanded into words, using the surrounding POS information for determining the correct inflections for expansion. 4. The resulting corpus was used for producing two corpora: one including verbalized punctuations and another without punctuations. 5. A candidate vocabulary of all words that occur at least five times was generated. 6. Pronunciations for all words were generated using simple Estonian pronunciation rules [6]. A list of exceptions was used to acquire pronunciations for abbreviations not expanded in the first step. 7. Words which had pronunciations that were now composed only of consonants were removed. Such words are mostly spelling errors and abbreviations whose pronunciations has not been defined in the exception dictionary. However, during recognition, such mis-modeled words are often mis-inserted instead of fillers. 8. The LM vocabulary was composed of the most frequent 50 000 words from the remaining candidate vocabulary. 9. Two trigram LMs – one with verbalized punctuations and another without punctuations – were built, using interpolated modified Kneser-Ney discounting [8]. The two LMs were interpolated into one final model. We model optional verbalized punctuations in our LM because some recordings in our test set contain them.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

T. Alumäe and E. Meister / Estonian Large Vocabulary Speech Recognition System for Radiology

35

The LM perplexity against the development texts without verbalized punctuation is 37.9. The perplexity against development texts with verbalized punctuations is 31.6. The out-of-vocabulary (OOV) rate is 2.6%. However, the majority of the OOV words are compound words that are not present as compounds in the LMs but whose compound constituents are present. There are also many spelling errors that contribute to the relatively high OOV rate. 1.2. Acoustic models We used off-the-shelf acoustic models trained on various wideband Estonian speech corpora: the BABEL speech database [9] (phonetically balanced dictated speech from 60 different speakers, 9h), transcriptions of Estonian broadcast news (mostly read news speech from around 10 different speakers, 7.5h) and transcriptions of live talk programs from three Estonian radio stations (42 different speakers, 10h). The latter material consists of two or three hosts and guests half-spontaneously discussing current affairs or certain topics, and includes passages of interactive dialogue and long stretches of monologue-like speech. Speech signal is sampled at 16 kHz and digitized using 16 bits. From a window of 25 ms, taken after every 10 ms, 13 MFC coefficients are calculated, together with their first and second order derivatives. A maximum likelihood linear transform, estimated from training data, is applied to the features, with an output dimensionality of 29. The data is used to train triphone HMMs for 25 phonemes, silence and nine different noise and filler types. The models comprise 2000 tied states, each state is modeled with eight Gaussians.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Evaluation 2.1. Data acquisition For performing speech recognition experiments, we recorded a small speech corpus of radiology reports, using 10 subjects (5 male and 5 female). The recordings were performed using a close-talking microphone. The subjects dictated different reports in the test set of our radiology text corpus. As all subjects were professional radiologists with variable degree of experience (from 1 to 15 years) in writing radiology reports, they had no difficulties in reading specific medical terminology and in interpretation of different abbreviations. Nine subjects were native Estonian speakers, one subject was a non-native speaker with a slight perceived foreign accent (her speech samples were not used in testing). The speakers did not receive any special training before the recordings. Thus, some radiologists chose to dictate the reports with verbalized punctuations, while the majority did not include them. The total length of the recordings was 4 hours and 23 minutes, with 26 minutes per speaker on average. The recordings were then manually transcribed, using the report texts given to the speakers as templates. The total number of running words in the transcripts is 19 486. The number of unique words is 4317.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

36

T. Alumäe and E. Meister / Estonian Large Vocabulary Speech Recognition System for Radiology

Table 1. Word error rates for different speakers and the average. Speaker

WER

AL

7.3

AR

8.5

AS

8.5

ER

10.3

JH JK

13.3 9.2

SU VE

10.7 8.7

VS

11.9

Average

9.8

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.2. Recognition experiments The recognition experiments were performed using the CMU Sphinx 3.7 open source recognizer1 . The recognizer was configured to use relatively narrow search beam and ran in 0.5 × real time on a machine with Intel Xeon X5550 processor (2.66 GHz, 8 MB cache, 667 MHz DDR3 memory). The WER results per speaker and on average are given in Table 1. We analyzed the recognition errors and found that around 17% of them are “word compounding” errors – a compound word is recognized as a non-compound, or vice versa (i.e., the only error is in the space between compound constituents). Such errors have a high impact on the WER since they are counted as one substitution error and one deletion error, or one substitution error and one insertion error. However, such errors have probably the lowest impact on the perceived quality of the recognized text. Often the fact, whether a word is a real compound word or not is arguable even for humans. Other prominent sources of errors were spelling errors in the reference transcripts (17% of all errors) and normalization mismatches, i.e., situations where reference and hypothesis represent the same term using different levels of expansion or normalization (e.g., C kuus ’C six’ vs. C6, 11%). Thus, only around 55% of the errors were “real” recognition errors.

3. Discussion and future work The paper described a pilot study of an Estonian LVCSR system for radiology. Using offthe-shelf acoustic models, and a word trigram language model built from a large corpus of proprietary radiology reports, a word error rate of 9.8% was achieved, using one-pass speaker independent recognition. Brief error analysis suggests that almost half of the errors can be contributed to different normalization and compound word representation problems. The system can be improved in many ways. First, we have gathered additional transcribed wideband speech data for training acoustic models since doing the experiments reported here (over 13 hours of conference speech recordings, additional 10 hours of broadcast conversations from radio). Second, our system did not use adaptation of any 1 http://cmusphinx.org

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

T. Alumäe and E. Meister / Estonian Large Vocabulary Speech Recognition System for Radiology

37

kind, while in a such system, both acoustic model adaptation towards specific speakers, as well as language model adaptation towards certain parameters of the radiological study will probably improve the accuracy of the system. The speech recognition experiments were conducted using speech from written radiology reports read out aloud. Such setting might have skewed the results in a positive direction, since when dictating spontaneously new reports, the concentration of speech disfluencies at lexical, syntactic, and acoustic-prosodic levels is probably much higher. In order to measure the WER more realistically, we should perform Wizard of Oz style experiments where radiologists produce reports for previously unseen images. However, to gain an objective insight of the system potential, the subjects should also receive some training on dictating the reports using a speech recognition system. This study concentrated only on the speech recognition aspects of voice-automated transcription of radiology reports. There are many post-processing steps, such as consistent normalization of read numbers, dates, abbreviations, and proper structuring of the generated reports, that are perhaps even more important for the report availability and time efficiency of the voice-automated reporting process. Also, the best benefit from speech recognition can be obtained if it is fully integrated into a radiology information system (RIS) [10]. We are planning to continue the cooperation with our industry partner on such aspects of the system.

Acknowledgments

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This research was partly funded by the target-financed theme No. 0322709s06 of the Estonian Ministry of Education and Research and by the National Programme for Estonian Language Technology, and by Cybernetica Ltd’s project CyberDiagnostika supported by Enterprise Estonia.

References [1]

M. R. Ramaswamy, G. Chaljub, O. Esch, D. D. Fanning, and E. vanSonnenberg, “Continuous speech recognition in MR imaging reporting: Advantages, disadvantages, and impact,” Am. J. Roentgenol., vol. 174, pp. 617–622, Mar. 2000. [2] J. M. Paulett and C. P. Langlotz, “Improving language models for radiology speech recognition,” Journal of Biomedical Informatics, vol. 42, pp. 53–58, Feb. 2009. PMID: 18761109. [3] Nuance Communications, “Dragon Medical product page.” http://www.nuance.com/ healthcare/products/dragon_medical.asp, 2010. [4] Nuance Communications, “SpeechMagic product page.” http://www.nuance.co.uk/ speechmagic/, 2010. [5] E. Arısoy and L. M. Arslan, “Turkish radiology dictation system,” in Proceedings of SPECOM, (St. Petersburg, Russia), 2004. [6] T. Alumäe, Methods for Estonian large vocabulary speech recognition. PhD thesis, Tallinn University of Technology, 2006. [7] H.-J. Kaalep and T. Vaino, “Complete morphological analysis in the linguist’s toolbox,” in Congressus Nonus Internationalis Fenno-Ugristarum Pars V, (Tartu, Estonia), pp. 9–16, 2001. [8] S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, vol. 13, no. 4, pp. 359–393, 1999.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

38

T. Alumäe and E. Meister / Estonian Large Vocabulary Speech Recognition System for Radiology

A. Eek and E. Meister, “Estonian speech in the BABEL multi-language database: Phonetic-phonological problems revealed in the text corpus,” in Proceedings of LP’98. Vol II., pp. 529–546, 1999. [10] D. Liu, M. Zucherman, and W. Tulloss, “Six characteristics of effective structured reporting and the inevitable integration with speech recognition,” Journal of Digital Imaging, vol. 19, pp. 98–104, Jan. 2006.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[9]

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-39

39

Towards Spoken Latvian Corpus: Current Situation, Methodology and Development Ilze AUZIŅA Institute of Mathematics and Computer Science, University of Latvia

Abstract. The aim of this paper is to present the development stages of Spoken Latvian Corpus and the current situation of Spoken Latvian Corpus. The development of Spoken Latvian Corpus has already begun in 2006 (Latvian Council of Science funding), and some individual speech corpora are developed. There are several stages in the creation of Spoken Latvian Corpus: development of concept, speech data collection procedures, transcription and annotation of speech data, representation of corpus. Keywords. Corpus, Spoken Latvian Corpus, Latvian

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction Based on the Latvian language corpus conception [1], speech data also has to be included into the national corpus (about 10% transcribed speech and 90% written texts). Currently the text Corpus of Modern Latvian consists of 3.5 million words; however these numbers are obtained only from written texts [2]. Only 2% of included texts (Transcripts of the Saeima’s (Parliament of Latvia) sittings) can be regarded as transcribed speech [3]. By including speech data into the text Corpus of Modern Latvian, it will be possible to draw an overall view on the Latvian language, to evaluate its present situation and the culture of speaking. Materials of spoken Latvian will help in the development of dictionaries, grammars and natural language processing tools. There are different kinds of speech data bases in Latvia. It is important to be able conveniently share the speech data to avoid waste of time and money as well as to make the research work more efficient. The lack of general specifications on corpus collection, annotation and distribution in sharing these materials is considered one of the problems. The objectives of developers of the spoken corpus are to create a source of authentic speech and to define common metadata and annotation standards. There are several institutions dealing with speech data processing, for example, the Institute of Philosophy and Sociology of the University of Latvia. It has 20 years of experience in storing, transcribing and processing audio files. Currently there are more than 3000 audio recordings of life stories with the total length of more than 9000 hours, a small part of which is digitized and decoded [4, 5]. In the Artificial Intelligence Laboratory of the Institute of Mathematics and Computer Science of the University of Latvia (AILAB IMCS LU) has stored a wide variety of speech data, for example, prepared speech texts, news broadcasts and spontaneous speech recordings [6, 7, 8]. A part of the speech data is transcribed and features of speech characteristic are added.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

40

I. Auzin¸a / Towards Spoken Latvian Corpus: Current Situation, Methodology and Development

Several development projects of the Latvian corpora have already been carried out by IMCS LU. Thus the Latvian Council of Science has financed fundamental and applied research project “Speech Corpus of Latvian” (years 2006–2009). As a result of this project the Corpus of Public Discussions (radio broadcast recordings) has been created. The State Language Agency (Valsts valodas aģentūra, VVA) financed several projects – “Preparation of Text Metadata of Latvian Corpus” (years 2007–2008), “Development of First Stage of Software of Latvian Corpus” (2007), “Development of Second Stage of Software of Latvian Corpus” (year 2008), “Extension of the balanced modern Latvian language text corpus2009. The two versions of the Text Corpus of Modern Latvian has been created and currently available online (http://www.korpuss.lv). It is planned to extend existing corpus and to integrate speech data into the Corpus of Modern Latvian.

1. Concept of Spoken Latvian Corpus

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The corpus has proved to be a potential resource for studies in nearly all branches of linguistics. The main goal is to integrate the speech data into already existing text corpus of the Latvian language, and to make speech data comparable with written data, as well as to find the standard program of speech corpus, which can make the corpus more efficient, and its usage or sharing more easily. The task is to create a balanced micro-model of speech corpus. Thus it is decided to follow the principle of balance (See Figure 1.) and to record and collect conversations (both private and institutional), taking place in different communicative situations and settings, to collect various types of oral speech. Planned division: • Spontaneous speech (~80%)  dialogues and polilogues (phone calls; public discussions, interviews; private conversations etc); monologues (narrations, life stories) • Planned speech (~20%)  monologues (TV and radio news; academic speeches, papers). Monologues (Narrations, life stories) Prepared texts for reading (TV and radio news, papers, thesis) Prepared texts for speaking (Presentations, thesis)

10% 5%

5% 40%

Telephone conversations

10%

Conversations (Informal conversations)

30%

Public discussions (Interviews, disputes) Figure 1. Prospective ratio of speech data (Latvian language corpus conception (2005))

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

I. Auzin¸a / Towards Spoken Latvian Corpus: Current Situation, Methodology and Development

41

The entire corpus will be annotated structurally, morphosyntactically and with the meta data. Along the creation of corpus metadata, which is needed for characterisation of speech recording (as well as mark-up), and methodology to add and mark new speech data will be developed; a structurally and grammatically marked speech corpus with about 350 000 words will be created and manual corpus data analysis will be carried out in order to add metadata. About 10% of the speech data will be marked on the phonetic level. The spoken corpus of Latvian will be developed and the results published online and will be a part of the Modern Latvian corpus. The corpus will include different types of linguistic data. The infrastructure of the corpus will be designed in a simple user-friendly way, so that the data can be processed efficiently in the database, and users can browse the spoken data directly from the web. At that moment only corpus of the Corpus of Public Discussions is carried out.

2. Metadata Specific meta information will be added before the structural marking of speech data. For instance, the primary specifications include specification of speakers: describing the information of speakers, such as age, sex, education, accent; specification of recording: describing the recording software, the specification of recording equipment, and acoustic environment; specification of data: describing the format and index of the data; specification of annotation: describing the annotation system. The existing corpus of public discussion contains information about speakers (age, sex, education, native language etc.), specification of recording and date.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.1. Speaker information The information about the speakers may differ between spoken corpora. For instance, if speakers are anonymous (conversations in reference and information services, anonymous interviews), their identity nor any other relevant characteristics may be known. However, mostly the speaker’s identity and definite information can be added. Minimally required information is: number of speakers, sex, age (age group). Additional information: native language, dialect, education, place of birth etc. 2.2. Specification of recording Recording protocol may vary depending on the type of speech data collection. For example, for describing face-to-face conversation the following information is included: recording platform (analogue or digital, number of channels etc.) and microphone information (the type and position of microphone, distance between speaker and microphone etc.). The description of environment in which the speaker is speaking and situation is included as well.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

42

I. Auzin¸a / Towards Spoken Latvian Corpus: Current Situation, Methodology and Development

3. Levels of Annotation The first step is the transcription of the speech signal, using conventions for the orthographic transcription and for the annotation of non-linguistic acoustic events. For the orthographic transcription, decisions will be taken over spelling conventions. The non-verbal data, such as contextual information, paralinguistic features, gaps in the transcript, pauses, and overlaps, will be represented on this level as well. In dialogues the turns are annotated with time information, speaker identity and gender, and speaker overlap is marked.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3.1. Orthographic Transcription (Orthographic Annotation) The principal features of the transcription scheme are: • Generally orthographical standards for the Latvian language are used; incorrect forms are annotated. • Capitalization: initial words of sentences are capitalized only if they would be capitalized in the middle of the sentence. • Numbers are spelled out following the standards of the Latvian language, using correct ending. (The declinable numeral is declined to agree with the noun. It takes the same gender, number and case as the noun.) • Foreign words are specially marked to separate from Latvian text. • The transcription includes only some punctuation marks: full stop, question mark and exclamation mark. During the process of transcription some problems may arise, for example, nonstandard pronunciation of certain words and word forms. Both standard and nonstandard word forms are used in speech (for example, lasam (incorrect spelling), lasām (correct spelling) present 1st pl. read), even in those speakers’ speech talking on the TV and radio. Therefore it is decided to show differently pronounces word and add the standard form. As we know the main unit of spoken language is an utterance – a unit of speech bounded by silence or by turn of speaker in dialogs. The corresponding unit in written language is sentence. In continues speech often it is not easy to decide where one utterance ends and other starts due to fast speech, mispronunciation, overlapping etc. 3.2. Annotation of Non-linguistic Acoustic Events The non-verbal data, such as contextual information, paralinguistic features, gaps in the transcript, pauses, and overlaps, are added on the level of orthographical transcription. In dialogues the turns are annotated without time information, speaker identity and gender, and speaker overlap is marked. Main non-linguistic acoustic events marked in orthographical are: • Pause fillers, hesitations. • Human noises, such us laughing, cough, expiration, inspiration etc. • Mispronunciations, unintelligible words, unfinished words. • Pauses: both micro pauses and pauses (silences longer than 1 sec.) are marked with full stop enclosed in brackets. • Non human noises.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

I. Auzin¸a / Towards Spoken Latvian Corpus: Current Situation, Methodology and Development

43

3.3. The Methodology of Grammatical Annotation of Speech Data IMCS LU already has developed text corpus processing tools [9, 10]. The text morphosyntactic marking tools will be adjusted to speech data processing. To ensure the functionality of the speech corpus it is intended to develop the necessary tools of the speech corpus and/or to adapt the tools of text corpus (searching tools, concordance and statistics tools) to the speech corpus.

4. The Corpus of Public Discussions The Corpus of Public Discussions (recordings of radio broadcasts) has been developed. The corpus was recorded from public radio: records of 6 radio broadcasts are orthographically transcribed and meta information are added. Length of each record is approximately 40 minutes (total record length is 4 hours). Speech of 13 speakers is transcribed: 2 females and 11 male. There is one moderator (the same person) and two participants (deputy candidates of Parliament of Latvia) in every broadcast. This is authentic spontaneous speech material and can be viewed and analyzed from various aspects, for example, the linguistic and political discourse analysis can be carried out, and the tendencies of language development can be treated. To make an indepth linguistic analysis speech data should be morphologically annotated. The orthographic transcription and the annotation of non-linguistic acoustic events were chosen (see above). The metadata are added. The annotation will be used for preparation other speech corpora as well.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

5. Discussion and conclusions The development of a speech corpus is much more time consuming and much more expensive than development of a text corpus. This is because speech data has to be transcribed at first and only then it can be structurally and morphosyntactically marked, by adding relevant meta information to speech data, for instance, speaker characteristics and recording parameters. Worldwide experience shows that the development of a speech corpus is a difficult task and it requires precise planning. When planning the development it is crucial to make sure that all meaningful information has been collected, described and accordingly annotated. In this paper, the development stages of Spoken Latvian Corpus and the current situation of Spoken Latvian Corpus have been present. Although the development of Spoken Latvian Corpus has already begun in 2006, only the Corpus of Public Discussions is created. This is only first step towards Spoken Latvian corpus. In the near future, the Latvian conversation corpus and the Latvian language learner corpus will be developed and the research build on speech corpus date will be carried out.

References [1] The Latvian Language Corpus Conception, The State Language Agency, 2005

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

44

I. Auzin¸a / Towards Spoken Latvian Corpus: Current Situation, Methodology and Development

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[2] Līdzsvarots mūsdienu latviešu valodas tekstu korpuss (the text Corpus of Modern Latvian), http://www.korpuss.lv [3] I. Auziņa, LR Saeimas sēžu stenogrammu datorizēta apstrāde un analīze, Parlamentārais diskurss Latvija. Saeimas plenārsēžu datorizēta analīze, Rīga, LU Akadēmiskais apgāds, 2007, 9 –21. [4] B. Bela-Krūmiņa, Research Opprotunities of life Stories: Everyday history, Elore (2006), vol. 13. [5] B. Bela, Dzīvesstāsti kā resurss sabiedrības izpētē: Nacionālās mutvārdu vēstures projekts, Latvijas Universitātes raksti. Socioloģija. Socioloģijai Latvijā – 40, 736 (2008), Rīga, 85-102 . [6] E. Milčonoka, N. Grūzītis, A. Spektors, Natural language processing at the Institute of mathematics and computer science: 10 years later, Proceedings of the first Baltic conference "Human Language Technologies - the Baltic Perspective", Riga (2004), 6-11. [7] N. Grūzītis, I. Auziņa, S. Bērziņa-Reinsone, K. Levāne-Petrova, E. Milčonoka, G. Nešpore, A. Spektors, Demonstration of resources and applications at the Artificial Intelligence Laboratory, IMCS, UL, Proceedings of the first Baltic conference "Human Language Technologies - the Baltic Perspective", Riga (2004), 38-42. [8] A. Spektors, Latviešu valodas datorfonda izveide, LZA Vēstis A 1 (2001), 74-82. [9] K. Levāne, A. Spektors, Morphemic Analysis and Morphological Tagging of Latvian Corpus, Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, Greece (2000), vol.2, 1095-1098.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-45

45

Remarks on the Duration of Lithuanian Consonants in a Continuous Speech Sigita DEREâ.(9,ý,Nj7Ơ Asta .$=/$86.,(1Ơ Vytautas Magnus University (Kaunas, Lithuania)

Abstract. Acoustical (quantitative) properties of consonants of Lithuanian standard language are still not extensively and properly covered by the contemporary research. The objective of this paper is to investigate quantity of consonants in a continuous speech of Standard Lithuanian, to qualify spontaneous duration of the analyzed sounds considering qualitative (articulatory) features and LJQRULQJ RWKHU IDFWRUV OLNH WKH OHQJWK RI WKH VHJPHQW RUWKH VRXQG¶VSRVLWLRQ LQD word. The results show that the most significant and distinctive articulatory feature influencing duration is the manner of articulation and the voicing. The place of articulation and palatalization has no impact on the duration of the analyzed consonants. Keywords. Plosives, fricatives, sonorants, duration, voicing, palatalization, place of articulation, manner of articulation

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction The objective of this paper is to investigate the quantity of consonants in a corpus of continuous speech of Standard Lithuanian, to qualify spontaneous duration of the analyzed sounds considering qualitative (articulatory) features and ignoring other IDFWRUVOLNHWKHOHQJWKRIWKHVHJPHQWWKHVRXQG¶VSRVLWLRQLQDZRUGRUDGMDFHQWVRXQGV Consonant system of Standard Lithuanian consists of 45 phonemes characterized by the following table (see Table 1): Table 1. Lithuanian consonants Bilabials Plosives Fricatives Affricates Nasals Sonorants

Labiodentals

pb

f m v

Dentals td sz c dz n l

Alveolars

Palatals

Velars

N¶J¶

kg

ãå þGå

[¶K¶

xh

O¶U

j

ƾ

In Lithuanian language [f, h, x] can only occur in loanwords and the velar [ƾ] is a positional variant of /n/ [2]. All the consonants, except the mediolingual [j], can contrast by being either soft (palatalized) or hard (non-palatalized). In the production of soft consonants the middle part of the tongue is additionally raised towards the hard palate. The hard consonants are characterized not only by absence of palatalization but also by velarization (by raising the back part of the tongue towards the velum). Consonants also build local oppositions and voicing correlations (see Table 1) [2].

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

46

S. Dereškeviˇci¯ut˙e and A. Kazlauskien˙e / Remarks on the Duration of Lithuanian Consonants

1. Previous investigations Acoustical (quantitative) properties of consonants of Standard Lithuanian are still not extensively and properly covered by the contemporary research. The first investigations were made by poor technical equipment and were intended only for specific cases of consonants: spectrographic investigation of the quantity of intervocalic and stressed syllable consonants [3, 4], of the duration of geminates [5], temporal investigations of QRUWKHUQ 6DPRJLWLDQ GLDOHFWµV FRQVRQDQWLVP >6]. More recent investigations reveal temporal data of particular consonantal classes: sonorants [7], burst duration of voiceless plosives [8], fricatives and plosives [9, 10].

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Methods and material The data used for the analysis ± D IUDJPHQW IURP D SV\FKRORJLFDO QRYHO Ä$OWRULǐ ãHãơO\³ RI V. Mykolaitis-Putinas read by the DFWRU 9 âLUND 7KH WH[W was read expressively, with a monotonous, non-emotional intonation. It contained little of direct speech, the voice (low pitch  KDVQ¶W EHHQ FKDQJHG The data from one speaker was chosen to detect temporal regularities of consonants, to avoid the presumptive impact RI LQGLYLGXDO VSHDNHU¶V IHDWXUHV +RZHYHU JLYHQ WKDW VSHHFK UDWH RI VSHDNHUs differs and it might influence the results more speakers should be taken into consideration in future researches. Speech corpus contained approximately 60.000 sounds (almost 1 hour and 40 minutes of records). The quantity of consonants was measured after automatically annotating sound records with the HTK speech recognition toolkit [1] which allows learning the stimuli Hidden Markov Models of the phonemes and applying Viterby search to set the boundaries of the prominent sequence of the phonemes so that this sequence would match the record at the highest probability. Subsequently phone boundaries were manually corrected with the acoustic analysis program Praat. This could reveal the duration differences that might exist in real spoken language. The results were processed statistically1. Sonorant and fricative consonants were analyzed in all word positions, whereas plosive consonants, appearing in an initial word positions, were ignored. The duration RIWKHODWWHUFRQVRQDQWV¶FODVVZDVPHDVXUHGLQFOXGLQJERWKWKHFORVXUHDQGWKHEXUVW Since the aim of this experimental research is to analyze temporal relations and articulatory correlations among the consonantal classes, the plosives were taken as a single segment including both the closure and the burst. Affricates are not covered by this paper.

3. Results 3.1. The place of articulation Previous works on temporal features of consonants in sentences and nonsense words revealed some factors influencing the quantity of these sounds. First, the duration is 1 For the help in preparing data for the investigation and for comments in article writing, authors are kindly grateful to the colleague Assoc. Prof. Gailius Raãkinis (Department of Applied Informatics).

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

S. Dereškeviˇci¯ut˙e and A. Kazlauskien˙e / Remarks on the Duration of Lithuanian Consonants

47

influenced by the length of the phrase: consonants are uttered longer in isolated words than in phrases or sentences [3, 10]. This is a critical aspect for speech synthesis. Second, the manner of articulation DQG WKH YRLFLQJ¶V IDFWRU KDYH WKH most significant impact on the duration of consonants has manner of articulation [3, 6]. The results of consonants duration in the continuous speech display these regularities and are described below. Fischer--¡UJHQVHQ>@3HWHUVRQ/HKLVWH>@/DGHIRJHG>@'HUHãNHYLþLnjWơ>@ in their experiments supported the importance of the place of articulation for measuring the duration of VOT2: the deeper in the oral cavity the sound is articulated the longer it should be in duration. The results in Table 2 show that the place of articulation does not have any impact or has an insignificant impact on the quantity of consonants. The duration of bilabial, dental and velar3 plosives differ only slightly (~ 0,003 ms). Table 2. The duration and the place of articulation of Lithuanian consonants Sample size

Mean (s)

Stand. deviation

Confidence interval (95 %)

[pp'bb']

2649

0,063

0,023

·

[tt'dd']

4876

0,025

[kk'gg']

3860

0,062 0,065

0,025

· ·

[f f'] [hh'xx']

52 14

0,074 0,093

0,03 0,23

· ·

[ss'zz']

4498

0,098

0,03

>ãã åå @

1353

0,03

· ·

[r r']

2036 2036

0,019

·

[v v']

0,102 0,046 0,057

·

[l l']

1627

0,059

0,027 0,028

[n n']

2116

0,028

·

[m m']

1726

0,062 0,070

0,023

·

[j]

1796

0,070

0,038

·

Consonants

Plosives

Fricatives

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Sonorants

·

To recall, the closure interval and VOT were not measured separately, the whole segment was considered in this research. However, the results correspond to the earlier mentioned tendency to pronounce longer the sounds that are uttered in the back part of the cavity to pronounce longer. This also applies to fricatives [14, 15] (bilabials [II¶], dentals [VV¶], alveolars [ãã¶]) and sonorants (taps [UU¶], labiodentals [YY¶], dentals / alveolars [OO¶ QQ¶], palatals [j]). Consonants articulated in the front part of the oral cavity articulated consonants are shorter in comparison with those articulated in the back part. The ambiguity lies then for bilabial and dental / alveolar consonants where all are articulated in front part of the mouth. According to some researchers the velocity of the tongue [13, 16] and the contact area of the articulators [13, 17] influence the duration of the consonants. Sounds articulated with the tongue tip (the smallest and the fastest part of the tongue) articulated sounds should be shorter than those articulated 2

VOT = Voice Onset Time The place of articulation of palatalized [N¶ J¶] and non-palatalized [k g] differ: by articulating palatalized ones the middle part of the tongue approaches the middle part of the palate, while articulating non-palatalized ones it approaches the back part of the palate. However, here they are called generalized ± velar consonants. If in this case we distinguished them, then not only one feature (the place of articulation) but two (the place of articulation and the palatalization) could have the impact on the duration. 3

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

48

S. Dereškeviˇci¯ut˙e and A. Kazlauskien˙e / Remarks on the Duration of Lithuanian Consonants

with a tongue body (broader and slower part of the tongue). The fricatives [VV¶ ]]¶], sonorants [UU¶] and plosives [WW¶GG¶] are the shortest within their respective consonantal class. According to the authors, the articulation with lips then requires a little bit more time. This is the reason for bilabials being longer then dentals. However, this is still a debatable issue; forasmuch some results of VOT measurements of plosives would not follow this regularity [8, 13]. The temporal results of the duration of velar [[[¶KK¶] and labiodental [II¶] fricative consonants fall out from the general view, it is not surprising though since these peripheral consonants occur in Lithuanian language only in loanwords and are shorter [3] in duration compared with other fricatives (see Table 2). 3.2. The manner of articulation Intensive vibration of vocal folds, high energy and indirect or discontinuous air flow [3, 6] possibly complicate the articulation of sonorants and cause their short production. Compared with fricatives (see Figure 1), sonorants are almost 1,5 times shorter but are more similar to plosives (~ 0,62 s). According to the manner of articulation consonants can be arranged by their duration in the descending order: voiceless fricatives, voiced fricatives, two sonorants [PP¶] and [j], voiceless plosives, voiced plosives and the rest of the sonorants. The results of the duration of Standard Lithuanian are similar to those obtained for the Samogitian dialect [6] and for those that were uttered in isolated words [3].

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 1. The duration and manner of articulation

The duration differs in distribution of the consonants according the voicing: investigations revealed that the voiceless plosives and fricatives are the longest, the voiceless ones shorter and sonorants are the shortest in duration. As expected, the duration of consonants is also shorter in a continuous speech compared with consonants taken from the isolated words or sentences (plosives even twice). In relation to other languages like Czech, German and Mandarin [14, 18] the results tend to be comparable only that their quantity is greater in all consonantal classes. The duration of English the duration of fricatives varies depending on some contextual factors [14, 21, 22]. Consonants may last from 50 ms in consonant clusters to 200 ms in phrase final positions. To address the phenomenon of widely distributed durations of the fricative [s] of Lithuanian, this consonant was analyzed in different positions of the phrase in this paper. Lithuanian [s] frequently occurs in the final word position and usually is uttered longer (almost twice; see Table 3). The above-mentioned factor and another factor ± process of degemination ± are significant elements in terms of synthesizing naturally sounding Lithuanian. The sequence of two identical adjacent

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

S. Dereškeviˇci¯ut˙e and A. Kazlauskien˙e / Remarks on the Duration of Lithuanian Consonants

49

consonants (occurring only at the morpheme boundary) undergoes degemination and only the second one is being pronounced (SXVVHVHUơ) in Standard Lithuanian. Table 3. The duration of fricative [s] in middle and final positions4 Consonant Sample size Mean (s) St. deviation Confidence interval (The duration ) rates

s 2062 0,102 0,04 0,100 · 0,104

s + sp 570 0,182 0,05 0,178 · 0,186 1 : 1,8

s + sil 228 0,182 0,03 0,178 · 0,186 1 : 1,8

s 2062 0,102 0,04 0,100 ·0,104

The results of possible geminates and isolated sounds do reveal the process of degemination (see Table 4). The possible geminates are only 1,1 times longer than unambiguously non-geminates. This difference is statistically significant for nonpalatalized [s]. The sample size of palatalized ones might have been too small to reveal the statistical significance. Thus synthesizing a combination of words like vaikas serga at normal speech rate would require longer pause between them in order to obtain two separate sounds. Table 4. The duration of possible geminates and single consonants

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Consonant Sample size Mean (s) St. deviation Confidence interval (The duration ) rates

s 499 0,100 0,04 0,097 · 0,104

ss 145 0,108 0,03 0,103 · 0,112 1 : 1,1

V¶ 499 0,098 0,03 0,095 ·0,101

VV¶ 79 0,104 0,03 0,098 · 0,110 1 : 1,1

Lithuanian language is also a language in which long stressed syllables may prosodically contrast in tonemes or syllable accents (for example: gÈnti µWRGHIHQG¶ and gi”ti ñWRGULYH RII ¶ >@ Standard Lithuanian has a clear distinction is made between diphthongal and monophthongal allotones of syllable accents. A diphthongal circumflexed (rising) allotone is produced by emphasizing and lengthening the second element of a biphonemic diphthong or semi-diphthong and by reducing its first element (for example, katas, ve–kti). This tendency is observed in this research as well: compared with monophthongs Table 5. The duration of sonorants in monophthongs and diphthongs Consonant Sample size

V/C±V/C [PP¶] 1726

V±C [PP¶] 269

V/C±V/C [UU¶] 2036

V±C [UU¶] 1164

5

V/C±V/C [QQ¶] 2116

V±C [QQ¶] 344

V/C±V/C [OO¶] 1627

V±C [OO¶] 228

Mean (s)

0,070

0,088

0,046

0,057

0,062

0,075

0,059

0,083

St. dev. Confidence interval Duration rates

0,023

0,027

0,019

0,025

0,028

0,023

0,028

0,032

· · · · · · · 0,079· 1:1,3

1:1,2

1:1,2

4

1:1,4

The symbol [s] marks here the duration of the fricative in the middle of the word; s+sp ± the [s] in the final position of the word before the pause in the middle of the phrase (vaikas verkia); s+sil ± the [s] in the final position of the word before the pause at the end of the phrase (verkia vaikas). 5 V/C±V/C ± represents sonorants uttered in monophthongs; V±C ± in diphthongs. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

50

S. Dereškeviˇci¯ut˙e and A. Kazlauskien˙e / Remarks on the Duration of Lithuanian Consonants

in diphthongs uttered sonorant consonants are 1,3 times longer within diphthongs in comparison to monophthongs. This ratio becomes 1,25 if the sonorants making part of the diphthong are stressed (see Tables 5±6). Table 6. The duration of sonorants in stressed and unstressed positions Confidence interval (95 %)

0,088 0,104

Stand. deviation 0,027 0,048

Duration rates

[‚ ‚']

Sample size 269 64

[r r'] [– –']

1164 196

0,057 0,075

0,019 0,025

· ·

1,3

[n n'] [ ” ”']

344 126

0,075 0,095

0,023 0,024

· ·

1,3

[l l'] [  ']

228 80

0,083 0,102

0,032 0,028

0· ·

Consonants [m m']

Mean (s)

· ·

1 :

1,2 1 :

1 :

1 :

1,2

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3.3. Voicing The duration of consonants is also influenced by the factor of voicing. The quantity of plosive consonants which are articulated with a closed mouth (bilabials) is similar regardless if they are voiced or voiceless (see Figure 2). Voiceless dental and velar consonants are slightly longer (1,1 time as long) than corresponding voiced ones. Although these are only small differences, they are statistically significant. Vibration of vocal folds while uttering voiced consonants causes the shorter duration: when the articulation process requires the activity of more articulators, the production of sounds gets more complex and shorter [3, 6, 19, 20]. However, why then the duration of the voiceless and voiced bilabial consonants is the same? The results of the temporal investigations of Lithuanian dialects [6] and the duration of consonants in nonsense words or sentences [3] are something different. Here, the voiceless bilabials are 1,1 times longer, thus indicating that in a continuous speech we tend to shorten these sounds in a continuous speech. Voiceless fricative consonants are 1,2 times longer than the voiced ones (see Figure 2). Differences in duration of the fricative consonants are statistically significant. The difference of the quantity of fricative consonants [[[¶ KK¶] of foreign origin is statistically not significant, although it is the same (25 ms) as of the Lithuanian ones. However, this could happen because of the data sparsity in the analyzed text. 3.4. Palatalization Palatalized consonants differ from non-palatalized ones by an additional raise of the middle part of the tongue to the soft palate. Thus, they must be shorter because of the more complex articulation pattern [19]. The results obtained with the plosive and fricative consonants actually do not validate this assumption: dental palatalized consonants and palatalized consonants articulated in the depth of the mouth articulated (velars) are 1,1 times longer on the average than their non-palatalized counterparts (see Figure 3.). The difference is not big, but it is statistically significant. The durations of

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

S. Dereškeviˇci¯ut˙e and A. Kazlauskien˙e / Remarks on the Duration of Lithuanian Consonants

51

both palatalized [S¶ E¶] and non-palatalized [p b] plosive bilabials do not differ. This indicates that the articulatory features of the consonants articulated in the front part of the mouth are not so distinct with those articulated in the back part of the mouth. This is not a surprising conclusion: consonants articulated in the back part of the mouth are clearer, it is easy to hear them, and the distinctive features are more discriminant (their spectrum is often more compact) than of those articulated in the front part of the mouth. On the other hand, bilabials often remain relatively hard when uttered together with other consonants. In this paper, as it is common in the Standard Lithuanian, they were labeled as palatalized. Fricative palatalized consonants are slightly longer (1,1 times) than the nonpalatalized ones. However, if to consider the effect of Bernoulli [23] states that, the airflow passing through narrower gap (in the case of palatalized fricatives) should pass faster. The consonants durations, however, do not seem to follow this phenomenon. The dependence of consonant duration on palatalization in the case of peripheral consonants is not so obvious. Only the palatalized sonorant consonants are shorter (1,1 times) in all classes than their respective non-palatalized counterparts. In comparison with the results of other authors [3, 6] no significant impact of this articulatory feature palatalization - was observed; the results are too divergent.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 2. The duration of voiceless-voiced consonants

Figure 3. The duration of consonants and the impact of palatalization

The analysis of the correlation between the consonant duration and its palatalization of the consonants can partly prove previous remarks that the duration can be influenced by WKHDUWLFXODWLRQ¶VFRPSOH[LW\ DGGLtional raise of the tongue).

Conclusion The conclusion can be made that fricatives (except [I I¶]) are almost one and a half times longer than plosives. Sonorant consonants are more similar to plosives considering their duration. However, not the place of articulation but the way how the air penetrates determines some duration regularities. The place of articulation does not show any regular tendency. Palatalization appears to have no significant impact on the

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

52

S. Dereškeviˇci¯ut˙e and A. Kazlauskien˙e / Remarks on the Duration of Lithuanian Consonants

quantity of the consonants: only few palatalized plosive and fricative consonants are longer than the non-palatalized ones. On the contrary, only non-palatalized sonorant consonants are longer than the palatalized ones. The most significant feature to impact the duration of consonants is their voicing and the manner of articulation. For further quantity analysis some additional factors should be considered: how the duration of the consonants depends on the position in a word, the adjacent sounds, or the phrase length. The speech rate, intonation changes also should be considered in further researches.

References [1] [2] [3] [4] [5] [6] [7] [8]

[9] [10] [11] [12]

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[13] [14] [15] [16] [17]

[18] [19] [20] [21] [22] [23]

S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev and P. Woodland, The HTK Book, 2000, http://htk.eng.cam.ac.uk/docs/docs.shtml. V. Ambrazas, E. GeniXãLHQơ$*LUGHQLVHWDOLithuanian Grammar, Baltos lankos, Lithuania, 1997. 07DQNHYLþLnjWơ%HQGULQơVOLHWXYLǐNDOERVLQWHUYRNDOLQLǐSULHEDOVLǐWUXNPơKalbotyra 32 (1) (1981), 96±105. 07DQNHYLþLnjWơ.LUþLXRWRVNLHPHQVSULHEDOVLǐWUXNPơLUMRVVDQW\NLVVXORJLQLXNLUþLXKalbotyra 33 (1) (1982), 106±120. 0 6WULPDLWLHQơ $ *LUGHQLV 3ULHEDOVLǐ MXQJLQLǐ WUXNPơ NDLS DWYLURVLRV VDQGnjURV LQGLNDWRULXV EHQGULQơMHOLHWXYLǐNDOERMHKalbotyra 29 (1) (1987), 61±68. 5.OLXNLHQơâLDXUơVåHPDLþLǐLQWHUYRNDOLQLǐSULHEDOVLǐWUXNPơKalbotyra 44 (1) (1995), 58±68. 6'HUHãNHYLþLnjWơ'ơOOLHWXYLǐNDOERVVNODQGåLǐMǐSULHEDOVLǐNLHN\EơV, äPRJXVLUåRGLV 10 (1) (2008), 15-19. S. 'HUHãNHYLþLnjWơ$.D]ODXVNLHQơ'XVOLǐMǐVSURJVWDPǐMǐSULHEDOVLǐVSHNWULQơDQDOL]ơLUMǐVSURJLPR WUXNPơ Garsas ir jo tyrimo aspektai: metodologija ir praktika / Sound and its Research Aspects: Methodology and Practice (2009) 98±111. $ .D]ODXVNLHQơ 3DVWDERV GơO OLHWXYLǐ EHQGULQơV NDOERV SXþLDPǐMǐ SULHEDOVLǐ NLHN\EơV Valoda GDåDGXNXOWnjUXNRQWHNVWD (2006), 148±154. A. .D]ODXVNLHQơ * 5DãNLQLV /LHWXYLǐ NDOERV VSURJVWDPǐMǐ SULHEDOVLǐ NLHN\Eơ .DOEǐ VWXGLMRV 8 (2006), 64±69. E. Fischer--¡UJHQVHQ$FRXVWLFDQDO\VLVRIVWRSFRQVRQDQWVMiscellanea Phonetica 2 (1954), 42±59. E. G. Peterson, I. Lehiste, Duration of syllable nuclei in English, Journal of the Acoustic Society of America 32 (1960), 693±703. T. Cho, P. Ladefoged, Variation and universals in VOT: evidence from 18 languages, Journal of Phonetics 27 (1999), 207-229. R. D. Kent, Ch. Read, The Acoustic Analysis of Speech, Singular Pub. Group, San Diego, California, 1992. H. Y. You $Q $FRXVWLFDO DQG 3HUFHSWXDO 6WXG\ RI (QJOVLK )ULFDWLYHV 8QSXEOLVKHG PDVWHU¶V WKHVLV University of Edmonton, Edmonton, Kanada (1979). W. J. Hardcastle, Some observations on the Tense-Lax distinction in initial stops in Korean, Journal of Phonetics 1 (1973), 263±271. K. N. Stevens, S. J. Keyser, H. Kawasaki H., Toward a phonetic and phonological theory of redundant features. Invariance and variability in speech processes, Lawrence Erlbaum Associates, New Jersey, 1986. P. Shinn, A cross-language investigation of the stop, affricate and fricative manner of articulation. Unpublished doctoral dissertation, Brown University, Providence, RI (1984). (0LNDODXVNDLWơ/LHWXYLǐNDOERVIRQHWLNRVGDUEDL, Mokslas, Vilnius, 1975. L. Richter, The duration of Polish consonants, Speech Analysis and Synthesis (1976), 219±238. D. H. Klatt, Duration of [s] in English words, Journal of Speech and Hearing Research 17 (1974), 41± 50. D. H. Klatt, Linguistic uses of segmental duration in English: Acoustic and perception evidence, Journal of the Acoustical Society of America 59 (1976), 1208±1221. Gloria J. Borden, Katherine S. Harris, Lawrence J. Raphael, Speech Science Primer. Physiology, Acoustics, and Perception of Speech, Williams & Wilkins, Maryland, USA, 1994.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-53

53

Modelling the Temporal Structure of Estonian Speech Mari-Liis KALVIK1 and Meelis MIHKLA Institute of the Estonian Language, Estonia

Abstract. The study is focused on Estonian rhythmic structure as revealed in fluent read speech. The core of the study involves determining the distinctive features of the three degrees of Estonian phonetic quantity and assessment of the significance of those features by statistical methods with an aim to enhance the naturalness of synthetic speech by using the features best identifying each quantity degree in fluent Estonian speech. The theory of adjacent phones is tested on a large data set and the role of intensity as a possible feature to identify quantity degrees is investigated. According to the results of phonetic and statistical analysis the main constitutive factors of quantity degrees and, thus, of speech rhythm are the classical duration ratio of stressed and unstressed syllables, whereas the rest of the duration ratios and tonal characteristics investigated turned out to be less significant for the data analysed. Keywords: quantity degree, temporal and tonal characteristics, prosody modelling

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction What is the basis of speech rhythmicity? Logically, where there is rhythm, there should be a system or unit lying at the base of speech temporal regulation. For the recent halfcentury certain units of the prosodic component have been used to divide languages into stress-timed, syllable-timed and mora-timed ones. Ramus et al. [1] focused on phonetic characteristics (vowel and consonant intervals) of speech signal as units to compute correlates of rhythm from. Speech passages have been under investigation also for Grabe and Low who developed special indices (e.g. the Pairwise Variability Index PVI) which they have tested in 18 languages [2], including Estonian. According to their result Estonian has been classified as a mixed or indefinable language, that is, not belonging to any traditional class. Next calculations were later made by Asu and Nolan [3], giving the similar results. In our study we carry on new tests (see 2.1.) and assume that Estonian belongs rather to stress-timed languages. The newest definitions of speech rhythmicity [4] are focused on the onset of the stressed syllable rime. Different aspects of the vowel of the stressed syllable as important factors of speech temporal structure have a major role in the enhancement of the naturalness of synthetic speech [5]. Of such aspects the nucleus of the stressed syllable is particularly important for Estonian, where a sequence of a stressed and unstressed syllable, i.e. a foot, is the domain of the ternary phonological opposition between the short Q1, long Q2 and overlong Q3 quantity degrees. 1

Corresponding author: Researcher Extraordinary. Address: Institute of the Estonian Language, Roosikrantsi 6, Tallinn, Estonia. E-mail: [email protected] Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

54

M.-L. Kalvik and M. Mihkla / Modelling the Temporal Structure of Estonian Speech

Estonian quantity degrees are a suprasegmentals, defined by the duration ratios of stressed to unstressed syllables, and the pitch contour [6]. Several studies have proved that the duration ratio of the first to the second syllable is 2:3 for the first quantity degree (Q1), 3:2 for the second quantity degree (Q2), and 2:1 for the third quantity degree (Q3) (e.g. [6], [7]). The second characteristic is the pitch (F0) curve, which in the first syllable of a Q1 or Q2 word is flat or rising, but displays a considerable fall in the first third of the first vowel of Q3 words ([8], [9], [10]). Lehiste and Eek have conducted perception tests proving that the difference in the F0 contour serves as an important signal of Q2 vs. Q3 if the duration ratio fails. Thus Lehiste [11] has suggested that quantity degree discrimination is a two-step process first making a difference, by means of duration ratios, between the short and long degrees, and then, between the long and overlong ones, using the pitch (F0). The significance of the F0 curve as the discriminator between the two longer degrees, and the binary nature of quantity degree discrimination have also been confirmed by later studies (e.g. [12]). Beside those traditional distinctive features other characteristics have been suggested as potential discriminators between quantity degrees. In the present study A. Eek's ideas are tested on fluent speech material. The theory of adjacent phones developed by Eek and Meister [13] after running a series of perception tests is based on the assumptions that judgement over the length of a segment depends on the preceding segment and for the difference to be perceived the segment should be 20±25% longer than the preceding one. This makes up a two-step system, where the duration ratio of the vowel (V1) and the consonant (C1) of the stressed syllable of CV(::)CV-word discriminates the Q1 words, having a short first vowel, from the Q2 and Q3 words, where the first vowel is long. Next, the Q3 words are told apart from the Q2 (and Q1) ones on the basis of the duration ratio of the vowel (V2) of the unstressed syllable to the intervocalic consonant (C2). The critical value of the ratio is 1.4, which means that if the V1-to-C1 ratio is lower than this critical value (V1:C11.4) signals of a long V1 indicative of Q2 or Q3. As for second syllable, the respective duration ratio V2:C2, if lower than 1.4, is supposed to signal of the third quantity degree (Q3), whereas a value equal to or higher than 1.4 should indicate Q1 or Q2. In addition, Eek [14] has pointed out that for Q3 the mean intensity of the first syllable should be higher than that of the second syllable, whereas for Q2 both syllables are pronounced with equal intensity. On the whole the manifestation and perception of Estonian quantity degrees is a complex system where, depending on the situation, different characteristics may combine and come to the fore. At present we are studying fluent speech trying to find some more phonetic parameters possibly important for quantity degree realization. As our general aim is to model speech temporal structure for better synthetic speech we need the best possible parameters for modelling each quantity degree. Thus we compare the characteristics of different quantity degrees and weigh their significance by statistical methods. The present study is focused on the traditional duration ratio and the ratios of adjacent segments as well as the possible role of pitch and intensity.

1. Material and method For the analysis of quantity degrees we divided the material into two groups. Set 1 contains 485 words, of all three quantity degrees, read in passages by 25 male and female readers. Most of this material comes from the Babel corpus [15], some passages

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

M.-L. Kalvik and M. Mihkla / Modelling the Temporal Structure of Estonian Speech

55

have been read by Estonian Public Broadcasting announcers. Set 2 consists of 160 words from a male announcer who is the source voice for synthetic speech. His results function here as the control set. The two groups also differ in the speech rate: set 1 has a moderate speech rate (122±144 words per minute), while set 2 has been read fast (159 wpm). From material the word structure CV(::)CV was chosen as it had been the main research object of previous studies as well as the domain of Eek's and Meister's theory. It is simple as here the ratio of the stressed and unstressed syllables is the same as that of their respective vowels (V1:V2). In Q1 words the vowel of the stressed syllable is short (e.g. pole [pole] 'is, are not'), while in Q2 and Q3 words it is long (e.g. poole [po:le] 'half GenSg') or overlong poole [po::le] 'towards'). Most of the words have two syllables, they may occupy different positions in the phrase, represent different parts of speech and be stressed as well as unstressed. The material was segmented and analysed using Praat programme. In each word was measured the duration of phones (ms), F0 values at the initial and final boundaries of V1 and V2, and at the peak of V1, as well as the position of the peak (distance from the onset of V1), enabling calculation of the F0 rise. Also the mean intensity (dB) of the both vowels was measured. The results were averaged, the F0 values (Hz) were converted into semitones (st). For statistical analysis the CARTs and linear regression were used. For computing 9¨9¨&DQGWKH 39, values we followed Ramus' and Grabe, Low's methods. The material comes from the source named above and consists of five declarative sentences (135 syllables) per speaker (4), altogether 20 utterances.

2. Results

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.1. %V, ¨V, ¨C and the PVI Our approach is based on the assumption that Estonian is neither mora- nor syllabletimed language: its speech rhythm is connected to stresses and thus to the quantity degrees. There are made studies ([2], [3]) to compute the interval-based correlates of speech rhythmicity in Estonian to classify it according to the traditional ternary distribution. We computed the proportion of vocalic intervals (%V), the standard deviatLRQRIWKHGXUDWLRQERWKRIWKHYRFDOLF ¨9 DQGFRQVRQDQWDOLQWHUYDOV ¨&) and also the normalized Pairwise Variability Index (PVI). Averaged results from earlier studies and from this research (in the bottom row) are presented in Table 1. Table 1. 9¨9¨&DQGWKH39,YDOXHVRI(VWRQLDQLQ different studies Grabe, Low (2002) Asu, Nolan (2005) Kalvik, Mihkla (2010)

%V 44.5 42.8

¨C

¨V 39.6

31.9

36.7

49.1

Vocalic PVI 45.4 44.6 44.6

Intervocalic PVI 40.0 57.5 64.5

Our results do not to affirm that Estonian is without doubt a stress-timed language. On the other hand, they do not affirm that Estonian should be a syllable-timed language either. Our vocalic PVI value is exactly the same as Asu and Nolan calculated. This value (44.6) actually matches with the one what Grabe and Low computed for Catalan. Our intervocalic PVI presents different result from Grabe and Low who used not normalized (raw) intervocalic PVI. According to their data Estonian have the lowest rPVI among languages. When we compare our results with the data in the study made

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

56

M.-L. Kalvik and M. Mihkla / Modelling the Temporal Structure of Estonian Speech

by Ramus et al. [1], then the value of %V locate Estonian between Dutch (stress-timed) and French (syllable-WLPHG  ¨9 QHDU &DWDODQ (mixed) DQG ¨& QHDU ,WDOLDQ (syllabletimed). According to our data stays Estonian in the so-called mixed or undefined class in the present distribution, being with its strong stress-concentrating still rather stresstimed. Asu and Nolan suggest that for Estonian the more suitable units to calculate rhythmic correlates should be the durational PVI of the syllable and of the foot. 2.2. Durational ratios Table 1 presents the averaged results of set 1. The material contains 234 Q1 words, 150 Q2 words and 101 Q3 words. Set 2 contains 160 words, including 50 Q1, 55 Q2 and 55 Q3 words. Table 2. Mean durations (ms) and duration ratios of segments in Q1, Q2, and Q3 words

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Q1 Q2 Q3

C1 69 64 66

V1 68 120 165

V1:C1 1.05 1.99 2.59

C2 52 52 59

V2 87 69 66

V1:V2 0.82 1.80 2.59

V2:C2 1.74 1.37 1.17

The results in Table 2 rather coincide, in many aspects, with previous results of several studies. The traditional duration ratio (V1:V2) is 0.6±0.7 for Q1, 1.2±1.6 for Q2 and 2.4±2.6 for Q3 ([7], [8], [9], [14]). Newer data on the same structure come from spontaneous speech, where the relevant duration ratios for Q1, Q2 and Q3 are found to be 0.7, 1.7 and over two, respectively [16]. The same ratio measured on our reference set (2) yields 0.7, 2.0 and 2.8. The somewhat higher duration ratios yielded by our measurements are due to material variability (set 1) and idiolectal peculiarities (set 2). Our analysis of adjacent segments revealed the following: it is true that V1 is short (and thus the quantity degree is Q1), if the V1 to C1 ratio is lower than 1.4 (1.05 in the Table 2). The Q2 and Q3 words have yielded 1.99 and 2.59, respectively, which are both higher than 1.4. In second syllable the ratio of the vowel to the preceding consonant (V2:C2) is less definitive. However, the result for Q1 (1.8) matches well with the theoretical assumption that V2:C2 should be higher than or equal to 1.4, while for Q2 words the ratio can be rounded to 1.4. The Q3 words are also well distinguished as their V2:C2 ratio bears evidence to the theory, being lower than 1.4 (1.17). The same is proved by the results of set 2, where the C1:V1 ratios for Q1, Q2 and Q3 are 1.1, 2,2 and 2.7, respectively, and the respective V2:C2 ratios are 2.0, 1.3 and 1.0. Next we are trying to demonstrate, by using statistical methods, which duration ratio, the classical V1:V2 or Eek's two-VWHSV\VWHP9&ĺ9& is more important in quantity degree classification and thus more relevant for modelling speech temporal structure. For either system and for either data set we generated CART decision trees with quantity degree as the dependent variable. In all trees the quantity degree showed a positive correlation with the classical duration ratio, while the other parameters either failed to surpass the threshold or were salient in more distant subclasses covering fewer words. For both data sets the words fell into three classes mostly corresponding to Q1, Q2 and Q3. Figure 1 represents the decision tree generated from V1:C1 and V2:C2 of set 1. Here as well as in the tree for set 2 the tree classifies the quantity degrees using only one parameter ± V1:C1. Being an indicator of the length of the vowel of the stressed syllable (short or long) this parameter (V1:C1) refers to the primary division of

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

M.-L. Kalvik and M. Mihkla / Modelling the Temporal Structure of Estonian Speech

57

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

the syllables into short (Q1) and long (Q2, Q3) ones. The parameter V2:C2 is not significant. This brings back an association with M. Hint's theory [17] of syllabic quantity stating that although quantity is realized in the foot, the quantity degree of a word depends on the parameters of the stressed syllable.

Figure 1. Decision tree based on a two-step system of the ratios V1:C1 and V2:C2.

In order to obtain another criterion for assessing the significance of duration ratios some simple equations of linear regression were generated. For set 1 the linear model using the classical duration ratio V1:V2 yielded a rather strong positive correlation between input and output (correlation coefficient r=0.867). Consequently the model explains more than 75% of the variability of the data (coefficient of determination r2=0.752). Set 2 did not show much difference, yielding r=0.838, r2=0.702. As for the model using the two alternative ratios, the correlation coefficient was 0.759, thus the model covered no more than 58% of the data of set 1. The result was corroborated by set 2 (r=0.772, r2=0.596). As a result of our experiment the classical duration ratio V1:V2 retained its position as the main feature distinguishing between quantity degrees, thus proving to be, once again, a very important parameter in generating the temporal structure of Estonian speech.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

58

M.-L. Kalvik and M. Mihkla / Modelling the Temporal Structure of Estonian Speech

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.3. Pitch Pitch analysis concentrates on the vowel of the stressed syllable (V1). If the F0 curve in the domain of V1 is peakless, it is called smooth (falling, rising or flat). Next we will treat the two data sets separately, as the statistics differ. In set 1, 215 of the 234 Q1 words, 84 of the 150 Q2 words, and 45 of the 101 Q3 words have a smooth pitch contour in V1. In words with the peak, it is located in the second half of the vowel (the rises being 66% for Q1 and 68% for Q2), the fall is not sharp (less than one st). Q3 shows different scores as its F0 peak is located more to the front (rise 48%) and the fall from the peak to the end of the vowel is steeper (2.4 st). How do the above results relate to earlier data? There are F0 contour descriptions from lab speech ([8], [9], [11]) as well as from spontaneous speech ([10], [16]). Most of the data are Q2 and Q3 words as the characteristic pitch contour is crucial in discriminating between those two quantity degrees. For Q2 F0 should be rising, up to the peak located in the central or final part of the first syllable vowel, whereas for Q3 the maximum of F0 is reached in the first quarter of the first syllable, while the fall in the first syllable is relatively sharp [11]. Eek and Meister [14] present the following statistics for the F0 location in lab speech: in Q2 words the rise lasts about half (50%) as long as V1, whereas in Q3 words it takes up 24% of its duration. According to Krull [10] those values are 55% and 44%, respectively. Asu et al. [16] focus on the turning point (a notable change) in the F0 contour, which in Q2 (and Q1) words is usually located near the syllable boundary, in Q3 words it is found in the first half of V1. Considering the pitch changes from the turning point to the end of the first vowel there is a clear difference between the quantity degrees: for Q3 F0 falls about 2 st vs. the 0.5 st for Q2 and Q1. A difference between our measurements and the previous data is manifested in a large number of cases where the F0 peak is not located in the domain of the first vowel and the pitch is smooth. One of the possible explanations could be that our material represents fluent speech containing stressed as well as unstressed words and the latter group fails to display the characteristic F0 contour. A closer look at the material reveals that although there are indeed more F0 peaks in the words located in the stressed position and the location of those peaks is somewhat closer to the statistics presented in earlier studies, the difference is not significant. Hence it can be concluded that, at least on the example of this data set, in fluent speech the pitch contour is not a crucial parameter for quantity degree determination. Neither does F0 appear as an additional parameter in the decision tree. The conclusion may pertain to the criticism by the visually impaired who find the synthetic speech where the F0 values precisely imitate those found in lab speech tiring to hear. In fluent speech the quantity degree need not be emphasized by all possible parameters as in the case of doubt (if the duration ratio as the primary characteristic is unclear) context may help. The results of set 2 resemble those of set 1 in many ways. The reader of set 2 presents also peakless F0 curves of V1 in practically equal proportions: for Q1 in 34 words (out of 50), for Q2 in 24 words (out of 55), and for Q3 in 34 words (out of 55). In the rest of the words the F0 rise makes up 69% (Q1), 61% (Q2) and 37% (Q3) of the respective V1 pitch contours. As in set 1, the Q3 words display a relatively sharp fall of the pitch from the peak to the end of V1 (2.7 st), never observed in Q1 or Q2 words. The difference concerns the decision tree, where the F0 rise surpasses the parameter threshold and appears in a subdivision of the tree. First the 99 words that the tree is based on are divided according to the traditional duration ratio into Q1 (V1:V2@. In the used typology, the name of DA consist of two parts separated by a colon: 1) the first two letters give abbreviation of the name of an act-group, e.g. DI ± directives, repair acts. The third letter is used only for AP acts ± the first (F) or the second (S) part of an AP act; 2) full name of the act; e.g. DIF: Request, RPF: Reformulation. The dialogues are annotated by members of research group of spoken Estonian at University of Tartu.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.2. Non-understanding Non-understandings are the most frequent other-initiated repairs in collected dialogues, despite the fact who of the wizards took the role of Aivo. The other-initiated repair is mostly initiated by Aivo. Consequently, user formulates his/her utterances in very natural way, so that Aivo prefers to initiate non-understanding. Nonunderstanding is often initiated when user leaves the conversation topic and asks questions about the agent Aivo. Non-understandings can roughly be divided into three different groups. The first group consists of non-understandings that express the non-understanding directly with words (ex. 2). In this example, Aivo advises to the user go to the cinema and look a film Disneys A Christmas Carol, and provides information in the line 3 and 4. Sometimes users like saying title of the film, words and phrases in English. This happens here, too. The user tells an English phrase in the line 5, and Aivo starts a nonunderstanding (`I do not understand`). It is well-known that people adapt the language to the needs of the interlocutor. Although test participants adjust also to Aivo, they sometimes like testing how far the DS has developed. Example 2. Verbally expressed non-understanding (dialogue 209) (1) $0LQHYDDWD-}XOXOXJX

DIF: PROPOSAL

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

103

S. Pärkson / Human-Computer Interaction in Estonian

go to look Disneys A Christmas Carol (2) 8NDVVHDOPlQJLE-LP&DUUH\" QUF: CLOSED YES/NO QUESTION ADJUSTING THE CONDITIONS OF ANSWER does Jim Carrey play there? (3) $-}XOXOXJX(EHQH]HU6FURRJH -LP&DUUH\) CONDITIONS OF ANSWER

QUS: YES

ACF:

ACS: ADJUSTING THE

A Christmas Carol Ebenezer Scrooge (Jim Carrey) (4) A: Disneys A Christmas Carol Animatsioon Alla 6 a mittesoovitatav Pikkus: 95 min. QUS: GIVING INFORMATION ACS: ADJUSTING THE CONDITIONS OF ANSWER Disneys a Christmas Carol, an animation, improper to children under 6 years, duration: 95 minutes (5) U: not bad DIS: OTHER OPF: OPINION not bad (6) $0DHLP}LVWD OPF: ASSERTION RPF: NON-UNDERSTANDING I do not understand (7) 8YlJHY ASSERTION

DIS: GIVING INFORMATION

RPS: REPAIR

OPF:

awesome

The second group is formed of non-understandings in the case of which the initiator asks the partner to give further details (ex. 3). Example 3. The non-understanding that insists a specification (dialogue 36) (1 ) A: kas sind huvitab homne ilm? QUF: OPEN YES/NO QUESTION are you interested in weather of tomorrow (2) 8NRJXNlHVROHYDQlGDODLOPQUS: OTHER DIF: REQUEST the weather of the whole week (3) A: homme on 10 kraadi sooja DIS: OTHER it will be 10 degrees tomorrow (4) $SDOXQWlSVXVWDNVLPXVW

DIF: REQUEST RPF: NON-UNDERSTANDING

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

please specify the question

The third group of non-understandings is more multifarious than the abovementioned. They indicate that some previous turn contains non-understandable information and more particular information is needed for successful answering. These non-understandings are usually formulated by using the word(s) of the problematic turn (ex. 4), and they are often wh-questions which expect giving information. Example 4. Non-understanding that repeats the problematic turn partly (dialogue 257) (1 )A: saan Teid kuidagi veel aidata? QUF: OPEN YES/NO QUESTION can I offer additional help to you (2)U: ei saa! QUS: NO no, you can not (3) U: head aega imelik tegelane RIF: GOODBYE goodbye strange person (4) A: ei saa mida? QUF: WH-QUESTION RPF: NON-UNDERSTANDING what I can not? (5) $YHVWOXVHO}SHWDPLVHNVNLUMXWDJH/®33 PS: GIVING INFORMATION to finish the conversation type END (6) 8PLVNUGLO}SS" QUF: WH-QUESTION RPF: NON-UNDERSTANDING

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

104

S. Pärkson / Human-Computer Interaction in Estonian

what the damn end?

The distance between non-understanding and repair is often more than one turn. Even more, the repair may not exist at all. Aivo initiates the non-understandings of the first and the second group; both users and Aivo use the non-understandings of the third group. 2.3. Clarification The problematic turn of clarification could be correct and contain enough information. The analysis shows that the user initiates clarification as a reaction to giving information and expresses his/her emotions in connection with the information given by Aivo (ex. 5). Example 5. Clarification as a reaction to giving information (dialogue 40)  $UHHGH||VHORQOXPHY}LPDOXV

DIS: GIVING INFORMATION

there is chance of snow at Friday night (2) U: ausalt ka?

RPF: CHECKING2

honestly?  $PLLQLPXPWHPSHUDWXXUUHHGH||VHORQNUDDGL

QUS: GIVING INFORMATION

the minimum temperature is 0 degrees at Friday night (4) A: ausalt jah

QUS: YES RPS: REPAIR

honestly yes

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Aivo uses clarification to win time and plan the answer. The repair is not relevant in this occasion (ex. 6), unless it contains new information. The user requests information (OLQH $LYRUHSHDWVWKHZRUGµWRPRUURZ¶ which is clarification, and gives information in the next turn, although the user does not answer before the 5 th line. This behaviour proves that the answer of user has no relevance for Aivo. Example 6. Delayed Repair act (dialogue 82)  8DJDWOHPHVLLVHWKRPPHNHOO DIF: REQUEST EXWOHW¶VVD\tomorrow at 3 p.m (2) A: Star Trek, kell 18.45

QUS: GIVING INFORMATION

Star Trek, at 6.45 p.m (3) A: homme?

RPF: CHECKING

tomorrow? (4) A: XMeeste algus: Wolverine, kell 14.15,

QUS: GIVING INFORMATION

X-Men Origins: Wolverine, at 2.15 p.m (5) U: jah QUS: YES RPS: REPAIR yes

2

i.e. clarification

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

S. Pärkson / Human-Computer Interaction in Estonian

105

Clarifications can be divided into direct clarifications, in which the turn remains the same as the previous one, and modified clarifications, in which the speaker repeats the previous turn and makes some grammatical or lexical changes. The analysis indicates that only modified clarifications occur in the written WOz dialogues (ex. 6). Thus the purpose of written clarification differs from the aim of spoken clarification because the communicators can re-read written dialogues. Therefore, they initiate clarification to find out some nuances, and only modified clarification is suitable for that task. Clarification is often formulated by closed yes/no question that expects yes/no for an answer which is very suitable for Aivo to give further information. 2.4. Reformulation There appear coincidence to a particular extent in the task of reformulations and clarifications. Reformulation is also initiated to win time and plan the answer by Aivo or convince the user that Aivo is a computer program, not a person. Reformulation also occurs when understanding of the utterance or its part includes a real problem (ex. 7: channel 11 or channel 2). Example 7. Reformulation when the information is interpreted incorrectly (dialogue 46) (1 8DJDNDVVDWHDGPLVPlQJXILOPWlQDNDQDOWXOHE QUF: WH-QUESTION but do you know which film is coming on channel 11 today (2) A: kas leidsid soovitud teabe? QUF: CLOSED YES/NO QUESTION did you find the requested information? (3 8MDKOHLGVLQNOO

QUS: YES

yes, I did (4 $NDVVDP}WOHG.DQDO"

QUF: CLOSED YES/NO QUESTION

RPF: REFORMULATION

do you mean Channel 2? Copyright © 2010. IOS Press, Incorporated. All rights reserved.

(5) U: ei kanal 11

QUS: NO

RPS: REPAIR

AI: SPECIFICATION

no, channel 11

3. Concluding Remarks and Further Work This paper reports written dialogue collecting by WOz method. There are described the simulation environment Aivo and some principles of the experiment series. Information about participants of the experiments ± wizards and users ± also is given. After that, collected dialogues are analyzed by CA, especially focusing on the other-initiated repairs ± non-understanding, clarification and reformulation. There is a clear difference between Aivo`s and the users communication patterns. The ways and purposes for initiating the other-initiated repairs are related with the communication role of interlocutors. There are some types of other-initiated repairs that are used only by Aivo, and other that both Aivo and the users use. Apparently, some of the other-initiated repairs are related with the level of using natural languge because Aivo`s linguistic behaviour is limited (on account of truthfulness) during the WOz experiments.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

106

S. Pärkson / Human-Computer Interaction in Estonian

Future work will include spoken WOz experiment series and analysing of written and spoken WOz dialogues. The useful source of information could be comparison of WOz dialogues with human-human phone calls. Due to comparison of different kinds of dialogue it is possible to identify universal and specific patterns of communication, and therefore detect the patterns which are suitable for the real DS.

Acknowledgement This paper is supported by the European Regional Development Fund through the Estonian Center of Excellence in Computer Science (EXCS) and the Estonian Science Foundation (grant 7503).

References A. Batliner, K. Fischer, R. Huber, J. Spilker, E. 1|WK, How to find trouble in communication, Speech Communication 40 (2003), 117-143. [2] A. Bellucci, P. Bottoni, S. Levialdi, WOEB: Rapid Setting of Wizard of Oz Experiments and Reuse for Deployed Applications, 'LSDUWLPHQWRGL,QIRUPDWLFD8QLYHUVLWj6DSLHQ]DGL5RPD,WDO\, 2009. [3] N. 'DKOElFN $ -|QVVRQ / Ahrenberg, Wizard of Oz Studies ± Why and How, Knowledge-Based Systems Vol. 6, No. 4 (1993), 258-266. [4] K. 0lNHOl, E.-P. Salonen, M. Turunen, J. Hakulinen, R. Raisamo, Conducting a Wizard of Oz Experiment on a Ubiquitous Computing System Doorman, Proceedings of the International Workshop on Information Presentation and Natural Multimodal Dialogue, Computer-Human Interaction Unit, Department of Computer and Information Sciences, University of Tampere, Finland, 2001. [5] M. Kullasaar, Eestikeelse dialoogikorpuse aUHQGDPLQH ÄY}OXU 2]L³ WHKQLNDJD 0DJLVWULW|| 7DUWX hOLNRRO$UYXWLWHDGXVHLQVWLWXXW .lVLNLUL, 2001. [6] J. Nielsen, Usability Engineering, San Fransisco, published by Morgan Kaufmann, 1993. [7] K. Jokinen, Constructive Dialogue Modelling Speech Interaction and Rational Agents, UK, John Wiley & Sons Ltd, 2009. [8] I. Hutchby, R. Wooffitt, Conversation analysis: principles, practices, and applications, Cambridge (UK): Polity Press, Malden (Mass.): Blackwell, 2006. [9] A.J. Liddicoat, An Introduction to Conversation Analysis, London, New York: Continuum, 2007. [10] T. Hennoste, $ 5llbis, 'LDORRJLDNWLG HHVWL LQIRGLDORRJLGHV WSRORRJLD MD DQDOV, 7DUWX hOikooli Arvutiteaduse Instituut, 7DUWX7DUWXhOLNooli Kirjastus, 2004. [11] J. Sidnell, Conversation analysis: an introduction, Oxford (UK): John Wiley & Sons Ltd, 2010. [12] M.-L. Sorjonen, .RUMDXVMlVHQQ\V.HVNXVWHOXQDQDO\\VLQSHUXVWHHW (1997), edit. L. Tainio, Vastapaino, Tampere, Finland, 111 ± 137.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[1]

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-107

107

A Framework for Asynchronous Dialogue Systems MARGUS TREUMUTH University of Tartu, Estonia

Abstract. This paper presents a framework for asynchronous dialogue systems. It is used in developing text-based dialogue systems. The framework features webbased asynchronous turn management and AI-assisted live agent chat. Some other features are also briefly covered in this paper (a language independent solution for spell checking the user input, a language independent solution for the word-order problem, semantic resolution and a web-based interface for WOZ data collection). The exploitation of the asynchronous communication pattern has improved the communication style of the user which has resulted in decreased number or single word utterances. Higher word count per utterance is important when performing shallow language analysis without complete semantic understanding. The framework is currently tailored for Estonian language, yet most of its features and modules are language independent. Keywords. asynchronous turn management, phrase pattern search, word order, text-based dialogue systems, semantic resolution

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. Introduction and Related Work This paper presents a framework for asynchronous dialogue systems – the ADS framework. The ADS framework, developed by the author, is a collection of integrated modules that can be used in building text-based natural language dialogue systems. The framework features web-based asynchronous turn management and AI-assisted live agent chat. Some other features are also briefly covered in this paper (a language independent solution for spell checking the user input, a language independent solution for the word-order problem, semantic resolution and a web-based interface for WOZ data collection). The dialogue systems built on the ADS framework use an open prompt (non-restrictive) approach [1]. The framework is currently tailored for Estonian language, yet most of its features and modules are language independent. There are several projects that are similar to the ADS framework. I studied some outstanding frameworks, such as Olympus/RavenClaw [2] [3], Semantra [4] [5] and CLSU Toolkit [6]. I also searched for demonstrations of web-based dialogue systems (DS) with asynchronous turn management, yet I was unable to find any evidence of such systems. I was not able to use these frameworks because the Olympus/RavenClaw does not support web based asynchronous communication pattern. Semantra is an NLI engine and not really suitable for building DS. In addition, Semantra is a commercial tool and not freely available for experimenting. The modules of the CSLU Toolkit are not language independent. Also, the CSLU Toolkit is not easily portable to the web.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

108

M. Treumuth / A Framework for Asynchronous Dialogue Systems

Finally, it would be a rather complex task to integrate the morphological analysis to build systems in Estonian (or any other agglutinative language) with these tools.

2. The Features of the ADS Framework

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.1. Web-Based Asynchronous Turn Management Most of the DS eventually enter into an input phase where they wait for the user input. While in the input phase the system will not continue unless the user has given some input. The main problem is that such DS cannot give any additional information while in the input phase and waiting for the user input. This can be described as synchronous communication pattern. However, in real life text-based conversations people often act in asynchronous communication pattern. For example, the participants can type simultaneously while chatting in Skype or MSN. All parties can provide input at any given moment and can take any number of sequential turns without waiting for the other party to acknowledge each turn. Allen [7] has stated that “The behavior of a conversational agent should ideally be determined by three factors, not just one: the interpretation of the last user utterance (if any), the agent’s own persistent goals and obligations, and exogenous events of which it becomes aware.” So, if the dialogue system is stuck in the input phase waiting for the user utterance, there is no way for the dialogue system to continue, even if system’s own goals or external events would require the system to continue. The asynchronous communication pattern used in the ADS framework provides a way to escape this “stuck in the input phase” problem. Asynchronous approach might not be usable in spoken language conversations due to overlaps. The text-based systems, however, can gain from asynchronous turn management by having more flexible and realistic conversations. Overlaps are not an issue in the text-based conversations, i.e. overlapping text does not become incomprehensible. There is some evidence presented in the evaluation section, that the exploitation of the asynchronous communication pattern in the ADS framework has improved the communication style of the user. This has decreased the number or single word utterances. Higher word count per utterance is important when performing shallow language analysis without complete semantic understanding. If the number of words in a question is too small, the resulting information entropy of the question will also be low. Low information entropy will reduce the chance to match the question against existing patterns in the repository [8]. A framework for asynchronous language processing opens up many ways to enhance the dialogue (e.g. the dialogue system can answer many questions at once). Making interpretation, behavior, and generation asynchronous allows, for example, the system to acknowledge a question while it is still working on finding the answer. For example, a system with an asynchronous behavior subsystem can inform the user of a new, important event, regardless of whether it is tied to the user's last input. One of the benefits of the asynchronous communication pattern is revealed in the process of Wizard-of-Oz (WOZ) data collection, where it is essential to hide the fact that the computer is replaced by a human, as it is known that the user is quickly adjusting to the partner in the conversation [9]. If the dialogue system is running in the serial synchronous communication pattern, then the main problem in the WOZ data

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

M. Treumuth / A Framework for Asynchronous Dialogue Systems

109

collection is the high predictability of the turn-taking pace. The user can easily guess when and how quickly the system usually would reply, because the serial synchronous systems always use a fixed turn-taking pace. The user has to be explained why is it sometimes taking so long to reply while usually the answer came in just a few milliseconds. So it is complicated to switch to Wizard-of-Oz or live-agent assistance in synchronous DS. However, in the asynchronous communication pattern the irregular pauses are typical and accepted by the user. That is the reason why there are better chances to trick the user into believing that the partner is still the computer. 2.2. AI-Assisted Live Agent Chat

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

There are text-based DS on the web that are interacting with the customer based only on an AI-engine (e.g. IKEA Furniture http://www.ikea.com – “Ask Anna”). There are also chat systems interacting with the customer based only on real human being giving the answers to the customer (e.g. LHV – http://www.lhv.ee “Chat with customer support”). The common problems with the first approach (all answers given by an AI-engine) are semantic errors made by the system. The problems with the second approach (all answers given by a human) are cost and availability, as it can be rather expensive to provide 24h online support with real human operators. The ADS framework provides a combined approach – an artificial intelligence assisted (AI-assisted) live agent chat system that allows a single live agent to handle a number of simultaneous chat sessions by having an AI-system handle the bulk of common, repeat questions. The AI-system will allow the live agent to focus his or her attention on the few chat sessions needing unique service and will effectively lower the cost of supporting chat sessions. The server-side technology of the ADS framework uses an AI-engine as well as a live agent backend interface for a site to deliver liveagent experience without the customer having to know whether the answer is from the AI-system or from the live agent. I noticed that an attempt to patent the idea of AI-Assisted Live Agent Chat Systems has been made [10]. The patent has not been granted as of May 2010. 2.3. Spell-Checking and Error Correction The spell-checking approach in the ADS framework is language independent. It is based on the measure of similarity between two strings. The Jaro-Winkler distance [11] is used in spell-checking the user input. This string comparison uses a prefix scale which gives more favorable ratings to strings that match from the beginning for a set prefix length. This is the main advantage if compared to the Levenshtein distance. An automated task in the ADS framework generates a domain lexicon, based on the words that appear in the recognition patterns expressed by regular expressions. The reason is that there is no need to spell-check the words that the system does not “understand”. The language analysis capability is limited to the rules. So, the rules contain all the words that we need to capture and understand. Words shorter than 6 letters are not spell-checked as the ambiguity risk would be too high.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

110

M. Treumuth / A Framework for Asynchronous Dialogue Systems

2.4. The Rule Based Semantic Resolution Resolving the user input involves parsing the input stream and placing its contents into a relational model. There is a considerable amount of pre-processing done prior to completing the relational model (tokenization, morphological analysis and spell checking). The morphology module of the ADS framework integrates the Estonian morphological analyzer [12]. The morphological analyzer is used in the preprocessing step to generate base forms from the original word forms. After this step the user input is stored in two different versions – an original version and a version with base forms. The version with original word forms has the priority over the version with base forms in the pattern matching process. In case the original form is successfully matched to the knowledge base patterns, then the version with base forms is ignored. There are two main rule-based approaches in the ADS framework for semantic resolution. The first approach resolves the semantics of basic key phrases. The second approach resolves the semantics of temporal expressions. This separation of knowledge bases is rather similar to the approach used by [13] where the authors decided to separate the ontology used in general purpose language parsing from the ontology used in reasoning. Both of these rule-based approaches use a declarative representation and the knowledge base consists of pattern-response pairs. The structure of the rules is given as: RULE PATTERN – a regular expression RESPONSE – a static response STATE – reference to additional responses IGNORE_WORD_ORDER – ignore word order (Y/N)

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

An example of a rule: RULE PATTERN: (kartma|hirm) (valu|arst) RESPONSE: Ei ole põhjust karta! STATE: IGNORE_WORD_ORDER: Y

RULE (translated) PATTERN: (scared|fear) (pain|doctor) RESPONSE: No reason to be scared! STATE: IGNORE_WORD_ORDER: Y

The reference to additional responses (attribute State) can be blank. The patterns are given as regular expressions. The pattern may contain just one keyword. The switch for ignoring word order of the input phrase (IGNORE_WORD_ORDER) is explained later in this paper. The sentences for answering are given as predefined fixed sentences. The ADS framework also uses dynamic responses that are generated based on the information retrieved from the database. The morphological generation is used only with temporal expressions. Yet, these dynamic responses are not represented in the declarative knowledge base. They are represented as procedures. The semantic resolution uses the knowledge base to find the suitable answer. The search for a suitable answer starts by matching the patterns of the rules to the preprocessed user input. After matching the patterns of the rules to the pre-processed user input, the according response sentence (or set of sentences) is selected. The selected sentence or set of sentences is forwarded to the planning module. This module decides whether and how to use this sentence or set of sentences in replying to the user.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

M. Treumuth / A Framework for Asynchronous Dialogue Systems

111

The resolution of temporal expressions is not used in all domains, so the ADS framework contains an optional component for this task. This component was developed previously by the author as the temporal information can often be a significant part of meaning communicated in dialogues. The component is described in more detail in [14]. The experiments with the ADS framework have shown that within a restricted domain the framework has proven to work well. These experiments confirm that the rule based semantic analysis that uses pattern-response pairs in the knowledge representation is a reasonable and effective approach. The key phrases describe the ontology of the domain. The process of gathering domain specific knowledge and creating the rules for the knowledge base involves complicated administrative work. The representation of patterns by regular expressions can require special skills. Yet, the process of understanding the user input is not merely handled by the rule based semantic analysis. In addition, the pragmatic analysis is involved in the conversation. The features of this pragmatic analysis establish the core competencies of the framework. For example, the system understands and reacts appropriately, when the input from the user is a repeated input; there has been too long pause between two inputs; there has been long enough pause between two repeated inputs.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.5. Word Order Issues in the Language Analysis In the language analysis process, usually the grammar is adjusted to look for the phrases relevant to the domain. The ADS framework uses a similar approach. The grammar in the ADS framework can be a simple grammar that is meant to extract just the single keywords and their relevant word forms, with the help from a morphological analyzer. In such a simple case, the word order is not an issue. Yet, in a more complex task, the single keyword approach is not enough to capture the meaning of the sentence. Then, in addition to matching the single words, the phrases have to be matched and then the word order of a phrase becomes an issue. Usually more than two different wordings have to be considered to recognize the relevant phrase. The word order in Estonian is relatively free. Usually many words in a sentence can be easily reordered without the sentence becoming ungrammatical. One of the options is to handle the word order of the sentence manually by defining all possible word order permutations in the grammar. In this case, the number of permutations can be rather high and usually only the most probable permutations are listed in the grammar to keep the grammar readable. The manual approach is also supported by the ADS framework. The word order can be ignored by adding manual word permutations into the rules of the grammar. The problems with this manual approach are: the readability of the grammar is decreased; the developer can forget to add the permutations to the grammar; the developer can provide an insufficient amount of permutations. The ADS framework also includes an optional automated approach, as the manual approach is not always efficient and elegant. To keep the patterns simple and still be able to handle the word order problems, I have added an attribute IGNORE_WORD_ORDER to the definition of a rule to allow automated word permutations (see the example in section 2.4). The attribute can have two values YES or NO. This attribute also leaves an option to turn off automated permutations for certain patterns with fixed word order (i.e. changing the word order would change the meaning). The attribute value is set to YES for all the rules where an arbitrary sequence

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

112

M. Treumuth / A Framework for Asynchronous Dialogue Systems

of words does not change the meaning of what the rule needs to capture. And respectively, the value is set to NO for all the rules where only a fixed sequence of words is allowed to capture the meaning. Automating the permutations greatly simplifies the process of grammar design and keeps the grammar more readable. Without this method we would have to explicitly indicate word order variations by listing all possible realizations. With this new approach, if the rule is marked to allow free word order, all the variations of word order can be accepted. 2.6. Handling Repetitions in the Conversation In the spoken language conversations the repetition can be a part of the repair strategy. For example, the user might not have heard what was said in the conversation and therefore specifically requests for a repetition (possibly by also repeating itself). In this case, it is appropriate for the system to repeat the previous utterance. However, in the text based conversations, the user can always scroll back in the chat history and look at the whole conversation. So, in the text based conversations the repetition by the system is usually not needed and should be avoided (or used only after a period of expiration) to prevent user frustration. The repetitions are usually considered to be a sign of poor intelligence of the system. Vrajitoru [15] has said that the repetition decreases the life-like impression of the system and undermines the credibility of the system. The ADS framework uses an expiration interval in repetition control. This means that the repetitions are only those recurring replies that occur in less than two minutes from the previous occurrence. If the same reply is older than two minutes, it is not considered to be a repetition of the reply and it is issued as a regular reply. This way the repetition is less disturbing as some time has passed since the previous output.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.7. Domain Adaptation The ADS framework has been built so that the domain dependent dialogue management and the generic dialogue feature handling are separated. This significantly decreases the development effort when the ADS framework has to be adjusted for a new domain. The adjustments in the knowledge base, dialogue management module and user interface can be made. The adjustments in the knowledge base are most complex. One should decide what part of the knowledge of the target domain can be represented as pattern-response pairs. The domain shift does not mean that the entire knowledge base has to be replaced. There are many common phrases (greetings and small talk) that can be kept for the new domain. The ADS framework has been tested on the following three domains: conversation with a virtual politician, conversation with a virtual dental consultant, conversation with a virtual movie schedule administrator. The first one (the virtual politician) was a simple test with some voluntary users. The second one (the virtual dental consultant) has also been tested in a commercial environment. The third example (virtual movie schedule information desk) is a natural language interface to a database. It serves as an interface to a dynamically changing database. The first two domains are based on “Frequently Asked Questions” and static answers. The third domain is a more complex domain with a dynamic knowledge base.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

M. Treumuth / A Framework for Asynchronous Dialogue Systems

113

2.8. The Wizard-of-Oz (WOZ) interface The web-based WOZ interface for data collection in the ADS framework supports unlimited number of wizards and unlimited number of conversations, while the number of conversations being more than or equal to the number of wizards. While the wizard is helping the system, the system (i.e. the dialogue engine) can be turned into the silent mode. There is also an option to keep the system answering to the questions if able to do so (i.e. AI-assisted Live Agent Chat). 2.9. The Remote Services used by the ADS Framework The speech-synthesis server is used to create speech output for the user. The speech output is a nice addition to the plain text output that is generated anyway. Currently the speech-synthesis server at http://kiisu.eki.ee/ is used. This server has been created by the Institute of the Estonian Language and Tallinn University of Technology [16], [17]. A remote SMTP server is used to send notifications to the wizard’s (live agent’s) mobile phone to inform the wizard that a conversation has started. The ADS framework can communicate with external databases. This feature is implemented and used in a dialogue system that currently imports the new movie schedule into the ADS framework. One of the previous systems developed by the author [18] was also integrated with the speech recognition component [19]. In the ADS framework the author has dropped the speech recognition component as its availability to the general public is still limited.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. Evaluation of the Asynchronous Communication Pattern The word count per utterance was examined, to determine whether using the asynchronous communication pattern has any effect on making the users talk with more expressiveness (that is using more words per utterance and less single-word utterances). The word count per utterance is an important issue in evaluating the mixedinitiative DS. This shows how expressive the users tend to be while talking to the dialogue system. It is not a good sign if the average number of words per utterance is very low. This means that the users have captured the keyword based approach of the system and are over-simplifying the conversation by using only single keywords. If the users use too many single-word utterances, the system is likely to fail in responding correctly. Three DS developed by the author were studied in evaluating the word count. One of the systems (Teatriagent – Theatre Agent) was a system with synchronous communication pattern which was not built in the ADS framework. The other two systems (Alfred and Zelda) were using asynchronous communication pattern and were implemented in the ADS framework. Teatriagent and Alfred are very similar in domain – the first one gives information about theatre schedules and the second one gives information about the movie schedules. Zelda is a virtual dental consultant. The systems with asynchronous communication pattern had higher numbers in word counts per utterance. Teatriagent has a large number (36.3%) of single-word utterances. Alfred and Zelda have less of these single-word utterances (accordingly 26.1% and 24.4%). It looks like the users were more expressive while using asynchronous DS (Alfred and Zelda).

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

114

M. Treumuth / A Framework for Asynchronous Dialogue Systems

4. Conclusion and Future Work The experiments have shown that the asynchronous turn management is a good path to go. The work will be continued by refining the knowledge engineering process. Much of the knowledge cannot be expressed as pattern-response pairs. The ADS framework will be improved to use the pattern-function pairs in the knowledge base.

References [1] [2] [3] [4] [5] [6] [7]

[8] [9] [10] [11] [12]

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[13]

[14] [15] [16]

[17] [18]

[19]

D. Jurafsky and J.H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Prentice-Hall, NJ, 2000. D. Bohus, A. Raux, T. Harris, M. Eskenazi and A. Rudnicky, Olympus: an open-source framework for conversational spoken language interface research. In: Proceedings of HLT-2007. Rochester, NY, 2007. D. Bohus and A. Rudnicky, RavenClaw: Dialog management using hierarchical task decomposition and an expectation agenda. In: Proceedings of Eurospeech, 2003. Semantra, Semantra Technology Overview, 2009. M. Elder, Preparing a data source for a natural language query. United States Patent Application No 20050043940, 2004. CSLU Toolkit. 2009. Retrieved Aug 1, 2009 from http://cslu.cse.ogi.edu/toolkit/ J. Allen, G. Ferguson and A. Stent, An architecture for more realistic conversational systems. In IUI ’01: Proceedings of the 6th international conference on Intelligent user interfaces, Santa Fe, New Mexico, United States (2001), 1–8. ACM Press. M.E. Liljenback, ContextQA: Experiments in Interactive Restricted-Domain Question Answering, MSc. in CS Thesis, San Diego University, 2007. S. Stenchikova and A. Stent. Measuring adaptation between dialogs. In: Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, Antwerp, Belgium, 2007. C. Wampler, Artificial Intelligence Assisted Live Agent Chat System. United States Patent Application No 20090245500, 2009. W.W. Cohen, P. Ravikumar and S.E. Fienberg, A comparison of string metrics for matching names and records. In: Proceedings of KDD-2003 Workshop on Data Cleaning and Object Consolidation, 2003. H.-J. Kaalep and T. Vaino, Complete morphological analysis in the linguist's toolbox. In: Congressus Nonus Internationalis Fenno-Ugristarum Pars V (2001), 9-16, Tartu, Estonia. M.O. Dzikovska, J.F. Allen and M.D. Swift, Integrating linguistic and domain knowledge for spoken dialog systems in multiple domains. In: Proc. of IJCAI-03 Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Acapulco, Mexico, 2003. M. Treumuth, Normalization of Temporal Information in Estonian. In: Proceedings of the 11th international conference on Text, Speech and Dialogue. Brno, Czech Republic, 2008. D. Vrajitoru, NPCs and Chatterbots with Personality and Emotional Response. In: Proceedings of the 2006 IEEE Symposium on Computational Intelligence and Games (CIG06) (2006), 142-147. E. Meister, J. Lasn and L. Meister, SpeechDat-like Estonian database. - In: Text, Speech and Dialogue : 6th International Conference, TSD 2003, Czech Republic, September 8-12, 2003 / Eds. Matoušek [et al.]. Berlin [etc.] : Springer, Lecture Notes in Artificial Intelligence, Vol. 2807 (2003), 412-417. M. Mihkla, A. Eek and E. Meister, Text-to-Speech Synthesis of Estonian. In: Proceedings of the 6th European Conference on Speech Communication and Technology, Budapest, Vol. 5 (1999), 2095-2098. M. Treumuth, T. Alumäe and E. Meister, A Natural Language Interface to a Theater Information Database. In: Proceedings of the 5th Slovenian and 1st International Language Technologies Conference (2006), 27–30. T. Alumäe, Methods for Estonian large vocabulary speech recognition. Ph.D. Thesis. Tallinn University of Technology. TUT Press, 2006.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Machine Translation

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-117

117

SMT of Latvian, Lithuanian and Estonian Languages: a Comparative Study Maxim KHALILOV a,1 , Lauma PRETKALNIN˛A b , Natalja KUVALDINA c and Veronika PERESEINA d a Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, The Netherlands b Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia c Marine Systems Institute, Tallinn University of Technology, Tallinn, Estonia d Jönköping International Business School, Jönköping University, Jönköping, Sweden Abstract. This paper is an attempt to discover the main challenges in working with Baltic and Estonian languages, and to identify the most significant sources of errors generated by a SMT system trained on large-vocabulary parallel corpora from legislative domain. An immense distinction between Latvian/Lithuanian and Estonian languages causes a set of non-equivalent difficulties which we classify and compare. In the analysis step, we move beyond automatic scores and contribute presenting a human error analysis of MT systems output that helps to determine the most prominent source of errors typical for SMT systems under consideration.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Keywords. Machine translation, Error analysis, Statistical methods

Introduction Unlike many small languages, Latvian, Lithuanian (called Baltic languages together) and Estonian (LLE) languages have been quite well-researched linguistically and possess parallel corpora, which is an indispensable resource for statistical machine translation (SMT). The availability of the bilingual corpus opens the way for the estimation of the SMT models and the development of real-world automatic translation systems. Until recently, automatic translation from/into LLE languages has not received much attention from the scientific community and, to a certain extent, can be considered still an open research line in the field of automatic translation. Scarce attempts at constructing SMT systems for these languages can be found as of 2007 [1,2,3], that is much later than SMT systems for popular language pairs. In this study we present a set of full multilingual bi-directional experiments on Latvian↔English, Lithuanian↔English and Estonian↔English SMT, mostly concentrating on more difficult translation tasks in which English language is a source. We compare the outputs of state-of-the-art SMT systems that follow a phrase-based approach to 1 Corresponding Author: Maxim Khalilov, Institute for Logic, Language and Computation, University of Amsterdam, P.O. Box 94242, 1090 GE Amsterdam, The Netherlands, E-mail: [email protected].

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

118

M. Khalilov et al. / SMT of Latvian, Lithuanian and Estonian Languages: A Comparative Study

MT and report results in terms of automatic evaluation metrics. We also experiment with different parameters of SMT systems and show that their accurate tuning can improve the quality of modeling the deviations between LLE languages and English. In the following step, we move beyond automatic scores of translation quality and present a manual error analysis of English⇒Latvian/Lithuanian and English⇒Estonian MT systems output that the vast majority of research papers avoid. The translation errors typical for each language pair are detected following the framework proposed in [4]. The results of human evaluation done by native or nearly native speakers of the target languages helps to shed light on advantages and disadvantages of the SMT systems under consideration and identify the most prominent source of errors. The rest of the paper is organized as follows: Section 1 briefly outlines the most important characteristics of the LLE languages and describes the corpus which was used in experiments, Section 2 introduces the phrase-based approach to SMT, Section 3 details the experiments, Section 4 reports the results of automatic translation quality evaluation, along with the results of human error analysis, while Section 5 presents the conclusions drawn from the study.

1. Languages and data

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

There is a variety of languages spoken in Baltic states, which includes languages like Lithuanian, Latvian and Estonian. Here, we provide the reader with a brief overview of the three official languages of Baltic countries and their most important grammatical characteristics. Latvian. Latvian is the official language of Latvia and belongs to the Baltic branch of the Indo-European language family. There are about 1.5 million native Latvian speakers around the world: 1.38 million are in Latvia, while others are spread in USA, Russia, Sweden, and some other countries. Also Latvian language is a second language for about 0.5 million inhabitants of Latvia and several tens of thousands from neighbor countries, especially Lithuania2 . Latvian is characterized by rich morphology, relatively complex pre- and postposition structures and high level of morphosyntactic ambiguity. There are no articles, two grammatical genders and two numbers in Latvian. Nouns decline into seven cases. Lithuanian. Lithuanian language is most closely related to Latvian and from linguistic point of view there is no much difference in treating Latvian and Lithuanian. A small difference between them from MT perspective is that the latter has a higher number of declensions, inflectional types of nouns and adjectives and a comparison system of adjectives. Another minor distinction between the two live Baltic languages is that there is no neuter gender in Latvian, while there is a number of neuter obsolete (but still used) word forms in Lithuanian. Linguistic topology of Latvian and Lithuanian is SOV, however, word order is relatively free. Lithuanian is one of the official languages of the European Union. There are about 2.96 million native Lithuanian speakers in Lithuania and about 170,000 abroad3 . 2 Source: 3 Source:

State Language Agency http://www.valoda.lv/lv/latviesuval Wikipedia http://en.wikipedia.org/wiki/Lithuanian_language

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

M. Khalilov et al. / SMT of Latvian, Lithuanian and Estonian Languages: A Comparative Study

119

Estonian. While Lithuanian and Latvian are closely related and descend from the same ancestor language, Estonian differs from them in many aspects and does not even belong to the same language family4 . Estonian is a highly inflectional agglutinative language characterized by a large number of cases (14 productive cases), and absence of grammatical genders. This language is characterized by rich structure of declensional and conjugational forms. The number of these forms is significantly higher than in Latvian and Lithuanian. Basic word order is SVO. There are about 1.1 million Estonian speakers in Estonia and tens of thousands in other countries5 . All the languages under consideration are characterized by a relatively free order of sentence constituents (non-configurational languages), however the number of ways how a sentence can be rearranged without becoming ungrammatical is much higher for Estonian than for Lithuanian and Latvian languages. 1.1. Data We used JRC-Acquis parallel corpus [5] of about one million parallel sentences. Development set contains 500 sentences randomly extracted from the bilingual corpus, test corpus size is 1,000 lines. Development and test are provided with 1 reference translation. Basic statistics of the bilingual corpus can be found in Table 1. Latvian

Lithuanian

Estonian

English

Training Sentences

1,09M

Words

23.87M

23.90M

21.15M

28.21M

Vocabulary

338.65K

355.17K

507.80K

237.94K

Development

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Sentences

0.5K

Words

10.82K

11.56K

9.49K

13.6K

Vocabulary

1.28K

1.90K

2.32K

1.14K

20.37K

27.74K

Test Sentences Words

1.0K 20.09K

21.55K

Vocabulary 3.86K 4.64K 4.85K Table 1. Basic statistics of the JRC-Acquis corpus.

2.43K

2. Phrase-based SMT SMT is based on the principle of translating a source sentence (f1J = f1 , f2 , ..., fJ ) into a sentence in the target language (eI1 = e1 , e2 , ..., eI ). The problem is formulated in 4 Estonian belongs to the Baltic Finnic branch of the Uralic languages and its most close relative is Finnish. Estonian is one of the few languages in Europe which does not belong to the Indo-European family. 5 Source: Estonian Institute http://www.einst.ee/publications/language/

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

120

M. Khalilov et al. / SMT of Latvian, Lithuanian and Estonian Languages: A Comparative Study

terms of source and target languages; it is defined according to equation (1) and can be reformulated as selecting a translation with the highest probability from a set of target sentences (2):   (1) eˆI1 = arg max p(eI1 | f1J ) = eI1

= arg max eI1



 p(f1J | eI1 ) · p(eI1 )

(2)

where I and J represent the number of words in the target and source languages, respectively. Modern state-of-the-art SMT systems operate with the bilingual units (phrases) extracted from the parallel corpus based on word-to-word alignment. They are enhanced by the maximum entropy approach and the posterior probability is calculated as a log-linear combination of a set of feature functions [6]. Using this technique, the additional models are combined to determine the translation hypothesis eˆI1 that maximizes a log-linear combination of these feature models, as shown in (3):   M  I I J eˆ1 = arg max λm hm (e1 , f1 ) (3)

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

eI1

m=1

where the feature functions hm refer to the system models and the set of λm refers to the weights corresponding to these models. A phrase-based translation [6] is considered a three step algorithm: (1) the source sequence of words is segmented in phrases, (2) each phrase is translated into target language using translation table, (3) the target phrases are reordered to be inherent in the target language. A phrase-based system which we experiment with within the framework of this study employs feature functions for a phrase pair translation model, a language model (LM), a reordering model, and a model to score translation hypothesis according to length. The weights λm are usually set to optimize system performance [7] as measures by BLEU [8]. Two word reordering methods are considered: a distance-based distortion model [9] and lexicalized MSD block-oriented model [10]. An alternative decoding technique is Minimum Bayes Risk (MBR), the approach that seeks for hypothesis which is similar to the most likely translations using optimization functions that measure translation performance [11].

3. Experiments Experimental setup The system built for the English⇔LLE translation experiments is implemented within the open-source Moses toolkit [12]. Standard training and weights tuning procedures which were used to build our system are explained in details on the Moses web page: http://www.statmt.org/moses/. Word alignments have been estimated using GIZA++ [13] tool assuming 4 iterations of the IBM2 model, 5 HMM model iterations, 4 iterations of the IBM4 model, and 50 statistical word classes (found with mkcls tool [14]). Target LMs with unmodified Kneser-Ney backoff discounting

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

M. Khalilov et al. / SMT of Latvian, Lithuanian and Estonian Languages: A Comparative Study

121

were generated using the SRI language modeling toolkit [15]. Automatic evaluation was case insensitive and punctuation marks were not considered. Systems Apart from unfactorized phrase-translation, the set of systems considered in this paper includes alternative configurations. We investigate the impact that different ingredients of a phrase-based translation system have on the final system performance. We experiment with (1) different orders of target-side LM, (2) the way to reduce the search space during decoding (beam size) and (3) MBR decoding.

4. Results Evaluation of the system performance is twofold. In the first step, we report the standard automatic translation scores, namely BLEU, NIST and METEOR (MTR) scores for the tasks in which English is a target language, and BLEU and NIST scores for the English⇒Latvian/Lithuanian/Estonian tasks. In the next step, we look at the human analysis of translation output, that, in the general case, provides a comprehensive comparison of multiple translation systems and reveals the most prominent source of errors generated by phrase-based systems. 4.1. Automatic evaluation

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The evaluation results for the test datasets are reported in Tables 2 and 3. System

EnLv BLEU NIST

EnLt BLEU NIST

EnEst BLEU NIST

Baseline

19.07

4.81

13.29

4.06

11.84

3.76

LM: 3-gram

18.36

4.74

13.21

3.94

10.99

3.66

LM: 4-gram

18.37

4.74

14.14

4.15

11.39

3.78

S1000

18.95

4.77

13.05

3.98

11.58

3.77

MBR

19.15

4.83

13.50

4.21

11.56

3.71

Table 2. Automatic translation scores for English⇒LLE translations.

The systems considered include: (1) a baseline configuration (5-gram target-side LM); (2-3) LM: 3(4)-gram systems, considering lower order target-side LMs; (4) S1000 system with increased stack size (beam) for histogram pruning (100 is the default value) and (5) MBR configuration where MBR algorithm is used during decoding. The major conclusion that can be drawn from the results of automatic evaluation is that modification of default MOSES parameters does not significantly change translation systems’ performance. However, when using MBR algorithm instead of standard optimization procedure leads to a slight improvement in terms of translation scores for English⇔Latvian and English⇔Lithuanian tasks. For all translations into English there is a consistent improvement of systems’ performance with an increase of the target-side LM order, that is not the case for English⇒LLE translations. As expected, translation from and into Latvian is a less complex task comparing to other directions, while Estonian⇔English tasks are the most complicated from the SMT perspective. Increased beam size has a positive impact on translation scores for

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

.

122

M. Khalilov et al. / SMT of Latvian, Lithuanian and Estonian Languages: A Comparative Study

the majority of the systems under consideration but at the cost of translation speed that increases significantly (in 3-4 times). 4.2. Manual error evaluation We performed error analysis on the 1,000 lines test dataset for English⇒LLE baseline systems. The analysis of typical errors generated by each system was done following the error classification scheme proposed in [4] by contrasting the systems output with the reference translation. The comparative statistics of errors is reported in Table 4.

BLEU

LvEn NIST

MTR

BLEU

LtEn NIST

MTR

BLEU

EstEn NIST

MTR

Baseline

29.69

6.38

55.07

26.27

6.04

49.59

18.52

4.42

45.74

LM: 3-gram

27.78

6.18

54.65

25.55

5.82

49.23

17.32

4.39

45.25

LM: 4-gram

29.47

6.25

54.88

26.01

5.93

49.57

18.21

4.41

45.54

S1000

29.75

6.43

55.09

26.12

5.93

49.57

18.55

4.44

45.79

MBR

29.73

6.39

55.05

26.33

6.01

49.55

18.40

4.34

45.71

System

Table 3. Automatic translation scores for LLE⇒English translations.

Type

Sub-type

Missing words Content words Filler words

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Word order Local word order Local phrase order Global word order

EnLv

EnLt

EnEst

631 (10.16 %) 272 (4.38 %)

622 (10.35 %) 244 (4.06 %)

884 (12.21 %) 422 (5.83 %)

359 (5.78 %)

378 (6.29 %)

462 (6.38 %)

885 (14.27 %) 181 (2.92 %) 317 (5.11 %) 241 (3.89 %)

868 (14.44 %) 194 (3.23 %) 270 (4.49 %) 216 (3.59 %)

1,216 (16.80 %) 300 (4.14 %) 459 (6.34 %) 340 (4.70 %)

Global phrase order

146 (2.35 %)

188 (3.13 %)

117 (2.44 %)

Wrong lex. choice Incorrect disambig. Incorrect form

4,294 (69.18 %) 348 (5.60 %) 865 (13.94 %) 2,237 (36.05 %)

4,164 (69.30 %) 292 (4.86 %) 920 (15.31 %) 2,472 (41.14 %)

4,653 (64.27 %) 758 (10.47 %) 525 (7.25 %) 2,927 (40.43 %)

750 (12.08 %) 94 (1.51 %) 0 (0.00 %)

430 (7.16 %) 50 (0.83 %) 0 (0.00 %)

351 (4.85 %) 85 (1.18 %) 7 (0.09 %)

Unk. words

85 (1.37 %)

107 (1.78 %)

341 (4.71 %)

Punctuation

198 (3.19 %)

248 (4.13 %)

86 (1.19 %)

6,206

6,009

7,240

Incorrect words

Extra words Style Idioms

Total

Table 4. Human made error statistics for a representative test set.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

M. Khalilov et al. / SMT of Latvian, Lithuanian and Estonian Languages: A Comparative Study

123

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Distribution of errors for the languages under consideration is quite similar, however the total number of errors generated by Estionian system is about 20% higher than for Lithuanian and Latvian systems. The most prominent class of errors is related to incorrect words/word forms that is typical for morphologically rich languages, while the prevailing type of errors within this class is “incorrect word forms” that can be re-phrased as if the system is able to generate the correct word lemma but can not find the correct lexical form. Rich morphology, high level of morpho-syntactic ambiguity and relatively complex pre- and postposition structures typical for Latvian and Lithuanian cause a significant number of errors typical for morphologically-rich languages, namely, incorrect word forms and wrong lexical choice. The minor linguistical distinction between Lithuanian and Latvian is reflected in similar total number and distribution of errors when translating into Latvian and Lithuanian. In case of Estonian language that differs from Latvain and Lithuanian in many aspects, many errors come from erroneous grammatical choice, i.e. the translation system is not able to generate the correct word on the target side. The major difficulty that either an Estonian-English or an English-Estonian SMT system faces is a rich structure of word forms, whose number is much higher than in Latvian and Lithuanian languages. There is a substantial number of errors related to generation of the correct word/constituent order within the sentence for all English⇒LLE tasks (≈15%), which is explained by a free word order nature of the target languages. For non-configurational languages, the rich overall inflectional system renders word order less important than in isolating languages like English. Nevertheless, there is only a limited number of acceptable word permutations. Evaluation of the word order correctness for free word order languages is not a trivial task. We considered equally all admissible word order combinations for the translations, hence the clumps are marked erroneous only if the word order is not acceptable or changes the meaning of the sentence. The total number of errors generated by the English⇒Latvian system is slightly higher than by the one for the English⇒Lithuanian translation that contradicts theory. We explain this phenomenon by sparseness of translation model.

5. Conclusions and discussion In this paper, we report results of multilingual translation experiments that involve Baltic and Estonian languages on the one side and English on the other. Unsurprisingly, translation scores for Latvian and Lithunian translataions are higher than for transltions into and from Estonian, that is equivalent to the fact that the latter is a more difficult translation task. MBR decoding is slightly more efficient than standard maximum a posteriori decoding for Latvian⇔English and Lithuanian⇔English tasks. Human-made error analysis, performed on the next step, gives a more complete and fair view of translation quality than automatic scores which just compare a translation output with a reference translation. Surprisingly, all three LLE languages are found to be quite similar in terms of error distribution that can be partly explained by the specificity of the legal domain that the data belongs to. English⇒Estionan system generates more errors than English⇒Latvian and English⇒Lithuanian systems mostly due to richer morphology, different word order and linguistic typology. Latvian and Lithuanian systems

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

124

M. Khalilov et al. / SMT of Latvian, Lithuanian and Estonian Languages: A Comparative Study

mostly suffer from incorrect word forms, incorrect disambiguation of lexical instances and word order errors. In case of Estonian system, the most frequent errors, in addition to aforementioned errors, include wrong words translation. The high number of translation errors of all types (6-7 per sentences) leaves room for a lot of interesting research which can potentially lead to a significant improvement of English⇔LLE translations.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

References [1] M. Fishel, H. Kaalep, and K. Muischnek. Estonian-english statistical machine translation: the first results. In Proceedings of NODALIDA-2007, Tartu, Estonia, May 2007. [2] M. Khalilov, J.A.R. Fonollosa, I. Skadina, E. Bralitis, and L Pretkalnina. Towards improving englishlatvian translation: a system comparison and a new rescoring feature. In Proceedings of LREC’10, pages 1719–1725, Valetta, Malta, May 2010. [3] Ph. Koehn, A. Birch, and R. Steinberger. 462 machine translation systems for europe. In Proceedings of the twelfth Machine Translation Summit, pages 65–72, Ottawa, Ontario, Canada, August 2009. [4] D. Vilar, J. Xu, L. F. D’Haro, and H. Ney. Error Analysis of Machine Translation Output. In Proceedings of LREC’06, pages 697–702, 2006. [5] S. Ralf, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tufi¸s, and D. Varga. The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of LREC’2006, Genoa, Italy, May 2006. [6] F. Och and H. Ney. Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. In Proceedings of ACL 2002, pages 295–302, 2002. [7] F. Och. Minimum error rate training in statistical machine translation. In Proceedings of ACL 2003, pages 160–167, Sapporo, July 2003. [8] K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of ACL 2002, pages 311–318, 2002. [9] Ph. Koehn, F. Och, and D. Marcu. Statistical phrase-based machine translation. In Proceedings of the HLT-NAACL 2003, pages 48–54, 2003. [10] C. Tillman. A unigram orientation model for statistical machine translation. In Proceedings of HLTNAACL’04, 2004. [11] S. Kumar and W. Byrne. Minimum bayes-risk decoding for statistical machine translation. In Proceedings of HLT/NAACL 2004, 2004. [12] Ph. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst. Moses: open-source toolkit for statistical machine translation. In Proceedings of ACL 2007, pages 177–180, 2007. [13] F. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51, 2003. [14] F. Och. An efficient method for determining bilingual word classes. In Proceedings of ACL 1999, pages 71–76, June 1999. [15] A. Stolcke. SRILM: an extensible language modeling toolkit. In Proceedings of the Int. Conf. on Spoken Language Processing, pages 901–904, 2002.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-125

125

Improving SMT for Baltic Languages with Factored Models Raivis SKADIƻâa,b.ƗUOLV*2%$a and Valters â,&6 a a Tilde SIA, Latvia b University of Latvia, Latvia

Abstract. This paper reports on implementation and evaluation of English-Latvian and Lithuanian-English statistical machine translation systems. It also gives brief introduction of project scope ± Baltic languages, prior implementations of MT and evaluation of MT systems. In this paper we report on results of both automatic and human evaluation. Results of human evaluation show that factored SMT gives significant improvement of translation quality compared to baseline SMT. Keywords. Statistical Machine Translation, Factored models

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction Besides Google machine translation engines and research experiments with statistical MT for Latvian [1] and Lithuanian, there are both English-Latvian [2] and EnglishLithuanian [3] rule based MT systems available. Both Latvian and Lithuanian are morphologically rich languages with quite free phrase order in a sentence and with very limited parallel corpora available. All mentioned aspects are challenging for SMT systems. We used Moses SMT toolkit [4] for SMT system training and decoding. The aim of the project was not to build yet another SMT using publicly available parallel corpora and tools, but also to add language specific knowledge to assess the possible improvement of translation quality. Another important aim of this project was the evaluation of available MT systems; we wanted to understand whether we can build SMT systems outperforming other existing statistical and rule based MT systems.

1. Training resources For training the SMT systems, both monolingual and bilingual sentence-aligned parallel corpora of substantial size are required. The corpus size largely determines the quality of translation, as has been shown both in case of multilingual SMT [5] and English-Latvian SMT [1]. For all of our trained SMT systems the parallel training corpus includes DGT-TM, OPUS and localization corpora. The DGT-TM corpus is a publicly available collection of legislative texts available in 22 languages of European Union. The OPUS translated text collection [6][7] contains publicly available texts from web in different domains. For Latvian we chose the EMEA (European Medicines Agency) sentence-aligned

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

126

R. Skadin¸š et al. / Improving SMT for Baltic Languages with Factored Models

corpus. For Lithuanian we chose the EMEA and the KDE4 sentence-aligned corpus. Localization parallel corpus was obtained from translation memories that were created during localization of software content, appliance user manuals and software help content. We additionally included word and phrase translations from bilingual dictionaries to increase word coverage. Both parallel and monolingual corpora were filtered according to different criteria. Suspicious sentences containing too much non-alphanumeric symbols and repeated sentences were removed. Monolingual corpora were prepared from the corresponding monolingual part of parallel corpora, as well as news articles from Web for Latvian and LCC (Leipzig Corpora Collection) corpus for English. Table 1. Bilingual corpora for English-Latvian system Bilingual corpus Localization TM DGT-TM OPUS EMEA Fiction Dictionary data Total

Parallel units ~1.29 mil. ~1.06 mil. ~0.97 mil. ~0.66 mil. ~0.51 mil. 4.49 mil. (3.23 mil. filtered)

Table 2. Bilingual corpora for Lithuanian-English system Bilingual corpus Localization TM DGT-TM OPUS EMEA Dictionary data OPUS KDE4 Total

Parallel units ~1.56 mil. ~0.99 mil. ~0.84 mil. ~0.38 mil. ~0.05 mil. 3.82 mil. (2.71 mil. filtered)

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Table 3. Monolingual corpora Monolingual corpus Latvian side of parallel corpus News (web) Fiction Total, Latvian

60M 250M 9M 319M

Words

English side of parallel corpus News (WMT09) LCC Total, English

60M 440M 21M 521M

The evaluation and development corpora were prepared separately. For both corpora we used the same mixture of different domains and topics (Table 4) representing the expected translation needs of a typical user. The development corpus contains 1000 sentences, while the evaluation set is 500 sentences long.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

R. Skadin¸š et al. / Improving SMT for Baltic Languages with Factored Models

127

Table 4. Topic breakdown of evaluation and development sets Topic General information about European Union Specifications, instructions and manuals Popular scientific and educational Official and legal documents News and magazine articles Information technology Letters Fiction

Percentage 12% 12% 12% 12% 24% 18% 5% 5%

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. SMT training The baseline SMT models were trained on lowercased surface forms for source and target languages only. The SMT baseline models were trained for reference point to assess the relative improvement of additional data manipulation, factors, corpus size and language models. The phrase-based approach allows translating source words differently depending on their context by translating whole phrases, whereas target language model allows matching target phrases at their boundaries. However, most phrases in inflectionally rich languages can be inflected in gender, case, number, tense, mood and other morphosyntactic properties, producing considerable amount of variations. Both Latvian and Lithuanian belong to the class of inflected languages which are the most complex from the point of view of morphology. Latvian nouns are divided into 6 declensions. Nouns and pronouns have 6 cases in both singular and plural. Adjectives, numerals and participles have 6 cases in singular and plural, 2 genders, and the definite and indefinite form. The rules of case generation differ for each group. There are two numbers, three persons and three tenses (present, future and past tenses), both simple and compound, and 5 moods in the Latvian conjugation system. Latvian is quite regular in the sense of forming inflected forms however the form endings in Latvian are highly ambiguous. Nouns in Latvian have 29 graphically different endings and only 13 of them are unambiguous, adjectives have 24 graphically different endings and half of them are ambiguous, verbs have 28 graphically different endings and only 17 of them are unambiguous. Lithuanian has even more morphological variation and ambiguity. Another significant feature of both languages is the relatively free word order in the sentence which makes parsing and translation complicated. The inflectional variation increases data sparseness at the boundaries of translated phrases, where a language model over surface forms might be inadequate to estimate the probability of target sentence reliably. The baseline SMT system was particularly weak at adjective-noun and subject-object agreement. To address that, we introduced an additional language model over morphologic tags in the English-Latvian system. The tags contain relevant morphologic properties (case, number, gender, etc.) that are generated by a morphologic tagger. The order of the tag LM was increased to 7, as the tag data has significantly smaller vocabulary. When translating from morphologically rich language, the SMT baseline system will not give translation for all forms of word that is not fully represented in the training data. The solution addressing this problem would be to separate richness of morphology from the words and translate lemmas instead. Morphology tags could be

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

128

R. Skadin¸š et al. / Improving SMT for Baltic Languages with Factored Models

used as additional factor to improve quality of translation. However, as we do not have a morphologic tagger for Lithuanian we used a simplified approach, splitting each token into two separate tokens containing the stem and an optional suffix. The stems and suffixes were treated in the same way in the training process. Suffixes were marked (prefixed by a special symbol) to avoid overlapping with stems. The suffixes we used correspond to inflectional endings of nouns, adjectives and verbs, however, they are not supposed to be linguistically accurate, but rather as a way to reduce data sparsity. Moreover, the processing always splits the longest matching suffix, which produces errors with certain words. We trained another English-Latvian system with a similar approach, using the suffixes instead of morphologic tags for the additional LM. Although the suffixes are often ambiguous (e.g. the ending -a is used in several noun, adjective and verb forms), our goal was to check whether we can get improvement in quality by using knowledge about morphology in case we do not have morphological tagger, and to assess how big is this improvement compared with using the tagger. Table 5 gives an overview of SMT systems trained and the structure of factored models. Table 5. Structure of Translation and Language Models System EN-LV SMT baseline

Translation Models 1: Surface Æ Surface

Language Models 1: Surface form

EN-LV SMT suffix

1: Surface Æ Surface, suffix

EN-LV SMT tag

1: Surface Æ Surface, morphology tag

1: Surface form 2: Suffix 1: Surface form 2: Morphology tag

LT-EN SMT baseline LT-EN SMT Stem/suffix LT-EN SMT Stem

1: Surface Æ Surface 1: Stem/suffix Æ Surface 1: Stem Æ Surface

1: Surface form 1: Surface form 1: Surface form

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. Results and Evaluation 3.1. Automated evaluation We used BLEU [8] and NIST [9] metric for automatic evaluation. The summary of automatic evaluation results is presented in Table 6. Table 6. Automatic evaluation BLEU scores System Tilde rule-based MT Google1 Pragma2 SMT baseline SMT suffix SMT tag

Language pair English-Latvian English-Latvian English-Latvian English-Latvian English-Latvian English-Latvian

Google SMT baseline SMT stem/suffix

Lithuanian-English Lithuanian-English Lithuanian-English

1 2

BLEU

Google Translate (http://translate.google.com/) as of July 2010 Pragma translation system (http://www.trident.com.ua/eng/produkt.html)

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

8.1% 32.9% 5.3% 24.8% 25.3% 25.6% 29.5% 28.3% 28.0%

129

R. Skadin¸š et al. / Improving SMT for Baltic Languages with Factored Models

For Lithuanian-English system we also measured the out of vocabulary (OOV) rate on both per-word and per-sentence basis (Table 7). The per-word OOV rate is the percentage of untranslated words in the output text, and the per-sentence OOV rate is the percentage of sentences that contain at least one untranslated word. It was not possible to automatically determine the OOV rates for other translation systems (e.g. Google), as the OOV rates were calculated by analyzing the output of Moses decoder. Table 7. OOV rates for Lithuanian-English System SMT baseline SMT stem/suffix

Language pair

OOV, Words

Lithuanian-English Lithuanian-English

3.31% 2.17%

OOV, Sentences 39.8% 27.3%

3.2. Human evaluation We used a ranking of translated sentences relative to each other for manual evaluation of systems. This was the official determinant of translation quality used in the 2009 Workshop on Statistical Machine Translation shared tasks [10]. The same test corpus was used as in automatic evaluation. The summary of manual evaluation results is presented in Table 8. Table 8. Manual evaluation results for 3 systems, balanced test corpus System

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Tilde Rule-Based MT SMT Baseline SMT F1

Language pair

English-Latvian English-Latvian English-Latvian

BLEU

8.1% 21.7% 23.0%

NIST

3.82 5.32 5.40

Average rank in manual evaluation “ “ “

We did evaluation both ranking several systems simultaneously and ranking only two systems (ties were allowed). We discovered that it is more convenient for evaluators to evaluate only two systems and results of such evaluations are easier to interpret as well. We developed a web based evaluation environment where we can upload sources sentences and outputs of two MT systems as simple txt files. Once evaluation of two systems is set up we can send a link of evaluation survey to evaluators. Evaluators are evaluating systems sentence by sentence. Evaluators see source sentence and output of two MT systems. The order of MT system outputs in evaluation differs; sometimes evaluator gets the output of the first system in a first position, sometimes he gets the output of the second system in a first position. Evaluators are encouraged to evaluate at least 25 sentences, we allow evaluator to perform evaluation is small portions. Evaluator can open the evaluation survey and evaluate few sentences and go away and come back later to continue. Each evaluator never gets the same sentence to evaluate. We are calculating how often users prefer each system based on all answers and based on comparison of sentences. When we calculate evaluation results based on all answers we just count how many times users chose one system to be better than other. In a result we get percentage showing how in many percents of answers users preferred one system over the other. To be sure about the statistical relevance of results we also calculate confidence interval of the results. If we have A users preferring the first system and B users preferring the second system, then we calculate percentage using Eq. (1) and confidence interval using Eq. (2).

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

130

R. Skadin¸š et al. / Improving SMT for Baltic Languages with Factored Models

‫݌‬ൌ

஺ ஺ା஻

ͳͲͲΨ

ܿ݅ ൌ ‫ݖ‬ට

௣ሺଵି௣ሻ ஺ା஻

(1)

ͳͲͲΨ

(2)

where z for a 95% confidence interval is 1.96. When we have calculated p and ci, then we can say that users prefer the first system over the second in S“FL percents of individual evaluations. We say that evaluation results are weakly sufficient to say that with a 95% confidence the first system is better than the second if Eq. (3) is true.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

’ െ …‹ ൐ ͷͲΨ

(3)

Such evaluation results are weakly sufficient because they are based on all evaluations but they do not represent system output variation from sentence to sentence. We can perform system evaluation using just one test sentence and get such weakly sufficient evaluation results. It is obvious that such evaluation is not reliable. To get more reliable results we have to base evaluation on sentences instead of all answers. We can calculate how evaluators have evaluated systems on a sentence level; if we have A evaluators preferring the particular sentence from the first system and B evaluators preferring sentence from the second system, then we can calculate percentage using Eq. (1) and confidence interval using Eq. (2). We say that particular sentence is translated better by the first system than by other system if Eq. (3) is true. To get more reliable evaluation results we are not asking evaluators to evaluate sentences which have sufficient confidence that they are translated better by one system than by other. When we have A sentences evaluated to be better translated by the first system and B sentences evaluated to be better translated by the second system or systems are in tie, then we can calculate evaluation results on sentence level using Eqs. (1) and (2) again. And we can say that evaluation results are strongly sufficient to say that the first system is better than the second in the sentence level if Eq. (3) is true. We can say that evaluation is just sufficient if we ignore ties. Table 9. Manual evaluation results. Comparison of two systems System1 SMT F1 Google SMT stem/suffix

System2 SMT baseline SMT F1 SMT baseline

Language pair English-Latvian English-Latvian Lithuanian-English

p 58.67 % 55.73 % 52.32 %

ci “ “01 % “

Best factored systems where compared to baseline systems and best EnglishLatvian factored system to Google SMT system using system for manual comparison of two systems described above. Results of manual evaluation are given in Table 9. Manual comparison of English-Latvian factored and baseline SMT systems shows that evaluation results are sufficient to say that factored system is better than baseline system, because in 58.67% (“ 4.98%) of cases users judged its output to be better than the output of baseline system. Manual comparison of English-Latvian factored and Google systems shows that Google system is slightly better, but evaluation results are not sufficient to say that it is really better, because the difference between systems is not statistically significant (55.73 ± 6.01% < 50%). Manual comparison of our best Lithuanian-English and the baseline systems shows that system with stems and suffixes is slightly better, but evaluation results are not sufficient to say that with a strong

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

R. Skadin¸š et al. / Improving SMT for Baltic Languages with Factored Models

131

confidence, because difference between systems also is not statistically significant (52.32 ± 4.14% < 50%).

4. Conclusions The MT system evaluation shows that used automatic metrics are unreliable for comparing rule-based and statistical systems, strongly favoring the latter. Both Pragma and Tilde rule-based systems have received very low BLEU score. This behavior of automated metrics has been shown before [11]. By development of factored EN-LV SMT models we expected to improve human assessment of quality by targeting local word agreement and inter-phrase consistency. Human evaluation shows a clear preference for factored SMT over the baseline SMT, which operates only with the surface forms. However, automated metric scores show only slight improvement on balanced test corpus (BLEU 21.7% vs 23.8%). By developing of the LT-EN SMT Stem/suffix model we expected to increase overall translation quality by reduction of untranslated words. The BLEU score slightly decreased (BLEU 28.0% vs 28.3%), however the OOV rate differs significantly. Human evaluation results suggest that users prefer lower OOV rate despite slight reduction in overall translation quality in terms of BLEU score.

Acknowledgements

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The research within the project LetsMT! leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 ± Multilingual web, grant agreement no 250456 and this research was partially supported by the European Social Fund (ESF) activity No.  ³6XSSRUW WR GRFWRU¶V VWXGLHV´ SURMHFW No. 2009/0138/1DP/1.1.2.1.2/09/IPIA/VIAA/004.

References [1] I. 6NDGLƼD E. %UƗOƯWLV, English-Latvian SMT: knowledge or data, in Proceedings of the 17th Nordic Conference on Computational Linguistics NODALIDA, Odense, Denmark, NEALT Proceedings Series, Vol. 4 (2009), 242±245., 2009. [2] R. 6NDGLƼã, I. 6NDGLƼDD. Deksne, T. Gornostay, English/Russian-Latvian Machine Translation System, in Proceedings of +/7¶, Kaunas, Lithuania, 2007. [3] E. Rimkute, J. Kovalevskaite, Linguistic Evaluation of the First English-Lithuanian Machine Translation System, in Proceedings of +/7¶, Kaunas, Lithuania, 2007. [4] P. Koehn, M. Federico, B. Cowan, R. Zens, C. Duer, O. Bojar, A. Constantin, E. Herbst, Moses: Open Source Toolkit for Statistical Machine Translation, in Proceedings of the ACL 2007 Demo and Poster Sessions, pages 177-180, Prague, 2007. [5] P. Koehn, J.F. Och, D. Marcu, Statistical Phrase-Based Translation, in Proceedings of HLT/NAACL 2003, 2003. [6] J. Tiedemann, L. Nygaard, The OPUS corpus - parallel & free, in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04). Lisbon, Portugal, May 26-28., 2004. [7] J. Tiedemann, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces, in N. Nicolov, K. Bontcheva, G. Angelova, R. Mitkov (eds.) Recent Advances in Natural Language Processing (vol V), 237-248, John Benjamins, Amsterdam/Philadelphia, 2009.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

132

R. Skadin¸š et al. / Improving SMT for Baltic Languages with Factored Models

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[8] K. Papineni, S. Roukos, T. Ward, W. Zhu, BLEU: a method for automatic evaluation of machine translation, in Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics (ACL), 2002 [9] G. Doddington, Automatic evaluation of machine translation quality using n-gram co-occurrence statistics, in Proceedings of HLT-02, 2002. [10] C. Callison-Burch, P. Koehn, C. Monz, J. Schroeder, Findings of the 2009 Workshop on Statistical Machine Translation, in Proceedings of the Fourth Workshop on Statistical Machine Translation, 1-28, Athens, Greece, 2009. [11] C. Callison-Burch, M. Osborne, P. Koehn, Re-evaluating the role of BLEU in machine translation research, in Proceedings of EACL, 2006.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-133

133

LetsMT! ± Online Platform for Sharing Training Data and Building User Tailored Machine Translation Andrejs VASILJEVS, Tatiana GORNOSTAY and Raivis SKADINS Tilde, Latvia

Abstract. This position paper presents the recently started European collaboration project LetsMT!. This project creates a platform that gathers public and userprovided MT training data and generates multiple MT systems by combining and prioritizing this data. The project extends the use of existing state-of-the-art SMT methods that are applied to data supplied by users to increase quality, scope and language coverage of machine translation. The paper describes the background and motivation for this work, key approaches, and the technologies used. Keywords. LetsMT!, machine translation, Moses, data sharing, cloud service

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction In recent years statistical machine translation (SMT) has become a major breakthrough in machine translation (MT) development providing a cost effective and fast way to build MT systems. This development was particularly facilitated by the open-source corpus alignment tool GIZA++ [1] and the MT training and decoding tool Moses [2]. Another factor for facilitating the development of MT for many languages was the EU translation corpus and other parallel data available on the Internet. The EuroMatrix project has demonstrated how open source tools and publicly available data can be used to generate SMT systems for all language pairs of the official EU languages [3]. However, these achievements do not fulfil all expectations regarding the application of available SMT methods. The quality of an SMT system largely depends on the size of training data. Obviously, the majority of parallel data is in widely-used languages (e.g. English, German and some others). As a result, SMT systems for these languages are of much better quality compared to systems for under-resourced languages, i.e. languages with scarce linguistic resources. This quality gap is further deepened due to the complex linguistic structure of many smaller languages. Languages like Latvian, Lithuanian and Estonian, to name just a few, have a complex morphological structure and free word order. To learn this complexity from corpus data by statistical methods, much larger volumes of training data are needed than for languages with the simpler linguistic structure. Current systems are built on the data accessible on the web, but it is just a fraction of all parallel texts. The majority of valuable parallel texts still reside in the local systems of different corporations, public and private institutions, and desktops of individual users.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

134

A. Vasiljevs et al. / Online Platform for Sharing Training Data and Building User Tailored MT

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Another obstacle preventing wider use of MT is its general nature. Although free web translators provide reasonable quality for many language pairs, they perform poorly for domain and user-specific texts. Current free systems cannot be adjusted for particular terminology and style requirements. Large international corporations contract MT companies like Language Weaver to adapt translation systems for their particular needs. But this costly process is not accessible to smaller companies or the majority of public institutions. This prevents large part of the EU population from using existing MT solutions to get access to online information. Specifically regarding application in the localization and translation industry, a huge number of parallel texts in a variety of industry formats have been accumulated, but the application of this data does not fully utilize the benefits of the modern MT technology. At the same time, this industry experiences a growing pressure on efficiency and performance, especially due to the fact that volumes of texts to be translated grow at a higher rate compared to availability of human translation, and translation results are expected in real-time. At present, the integration of MT in localization services is in its early stages, and the cost of developing specialized MT solutions is prohibitive to most players in the localization and translation industry. The quality of the generic MT offerings provided for free is too low to reap any efficiency gains in the professional localization industry setting. The same problem is faced by online information providers. They provide information mostly in the larger languages because the cost of human translation into smaller languages is prohibitively high and the quality of existing MT solutions is insufficient. To fully exploit the huge potential of existing open SMT technologies and the huge potential of user-provided content, we propose to build an innovative online platform for data sharing and MT building. This platform is being created in the EU collaboration project LetsMT!. The LetsMT! Consortium includes the project coordinator Tilde, Universities of Edinburgh, Zagreb, Copenhagen and Uppsala, localization company Moravia and semantic technology company SemLab. The project started in March 2010 and should achieve its goals till September 2012. The following sections describe the background and motivation of the LetsMT! project as well as the key approaches and technologies used.

1. Machine translation strategies MT has been a particularly difficult problem in the area of natural language processing since its beginnings in the early 1940s. From the very beginning of MT history, three main MT strategies have been prominent: direct, interlingua, and transfer. The rule-based MT strategy with a rich translation lexicon showed good translation results and found its application in many commercial MT systems, e.g. Systran, PROMT and others. However, this strategy requires immense time and human resources to incorporate new language pairs or to enhance translation quality. The more competitive SMT approach has occupied the leading position since the first research results were gained in the late 1980s with the Candide project at IBM for an English-toFrench translation system [4][5]. The SMT strategy, first suggested in 1949 by Warren Weaver and then abandoned for various philosophical and theoretical reasons for several decades until the late 1980s [6], has proven to be a fruitful approach to foster the development of MT. Cost-effectiveness and translation quality are the key reasons

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

A. Vasiljevs et al. / Online Platform for Sharing Training Data and Building User Tailored MT

135

that the SMT paradigm has become the dominant current framework for MT theory and practice [7]. In a majority of cases, SMT research and development activities were focused on widely used languages, such as English, German, French, Arabic, and Chinese. For smaller under-resourced languages, including the languages of the Baltic countries Lithuanian, Latvian and Estonian, MT solutions as well as language technologies in general have not been as well developed due to the lack of linguistic resources and cost effective technological approaches. This has resulted in a technological gap between these two groups of languages. The goal of the LetsMT! project is to overcome this challenge by exploiting open source SMT toolkits and involving users in collecting training data. This will result in populating and enhancing the currently most progressive MT technology and making it available and accessible for all categories of users in the form of sharing MT training data and building tailored MT systems for different languages on the basis of the online LetsMT! platform.

2. LetsMT! approach

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The LetsMT! project will extend the use of existing state-of-the-art SMT methods enabling users to participate in data collection and MT customization to increase quality, scope and language coverage of MT. Currently LetsMT! is creating a cloudbased platform that gathers public and user-provided MT training data and generates multiple MT systems by combining and prioritizing this data. Figure 1 provides a general architecture of the LetsMT! platform. Its components for MOSES based SMT training, parallel data collection and data processing are described further in this paper.

Figure 1. Software architecture of the LetsMT! platform.

LetsMT! services of translating texts will be used in several ways: through the web portal, through a widget provided for free inclusion in a web-page, through browser plug-ins, and through integration in computer-assisted translation (CAT) tools and different online and offline applications. Localisation and translation industry business and translation professionals will be able to use the LetsMT! platform for uploading

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

136

A. Vasiljevs et al. / Online Platform for Sharing Training Data and Building User Tailored MT

their parallel corpora in the LetsMT! website, building custom SMT solutions from the specified collections of training data, and accessing these solutions in their productivity environments (typically, various CAT tools).

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. Application of the Moses SMT toolkit A significant breakthrough in SMT was achieved by the EuroMatrix project 1 . Among project objectives were translation systems for all pairs of EU languages and the provision of the open source MT technology including research tools, software and data. Its result is the improved open source SMT toolkit Moses developed by the University of Edinburgh. The Moses SMT toolkit is a complete translation system distributed under the Lesser General Public License (LGPL). Moses includes all the components needed to pre-process data, train language models, and translation models [2]. Moses is widely used in the research community and has also reached the commercial sector. While the use of the software is not closely monitored (there is no need to sign a license agreement), Moses is known to be in commercial use by companies such as Systran, Asia Online, Autodesk, Matrixware, Translated.net. The LetsMT! project coordinator Tilde bases its free online Latvian MT system 2 on the Moses platform. LetsMT! uses Moses as a language independent SMT solution and integrates it as a cloud-based service into the LetsMT! online platform. One of the important advancements of the LetsMT! project will be the adaptation of the Moses toolkit to fit into the rapid training, updating, and interactive access environment of the LetsMT! platform. The SMT training pipeline implemented in Moses currently involves a number of steps that each require a separate program to run. In the framework of LetsMT! this process will be streamlined and made automatically configurable given a set of user-specified variables (training corpora, language model data, dictionaries, tuning sets). Additional important improvements of Moses that are being implemented by the University of Edinburgh as part of LetsMT!, are the incremental training of MT models, randomised language models [8], and separate language and translation model servers. We expect some users to add relatively small amounts of additional training data in frequent intervals. The incremental training will benefit from the addition of these data without re-running the entire training pipeline from scratch.

4. Parallel corpora for SMT training While SMT tools are language independent, they require very large parallel corpora for training translation models. A parallel corpus is a collection of texts, each of which is translated into one or more languages [9]. SMT generates translations on the basis of statistical models with parameters derived from the analysis of bilingual parallel text corpora. Thus, large scale parallel corpora are indispensable language resources for SMT [10]. The most multilingual parallel corpus, the JRC-Acquis is a huge collection of European Union legislative documents translated into more than 1 2

http://www.euromatrix.net/ http://translate.tilde.com

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

A. Vasiljevs et al. / Online Platform for Sharing Training Data and Building User Tailored MT

137

twenty official European languages [11] including under-resourced languages such as Latvian, Lithuanian, Estonian, Greek, Romanian, and others. For example, for the Latvian language it has 22 906 texts containing 27 592 514 words, for the Lithuanian language ± 23 379 texts containing 26 937 773 words (version 3.03). A similar corpus to JRC-Acquis is the European Parliament Proceedings Parallel Corpus4 (Europarl corpus) which was extracted from the proceedings of the European Parliament (1996-2006) and includes versions in 11 European languages: French, Italian, Spanish, Portuguese, English, Dutch, German, Danish, Swedish, Greek and Finnish [12]. These resources along with other publicly available parallel resources, such as OPUS 5 and JRC-Acquis 6 , are used in LetsMT! as initial training data for the development of pre-trained SMT systems.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

4.1. Applying user-provided data for SMT training The number of open source parallel resources is limited and this is an essential problem for SMT, since translation systems trained on data from a particular domain, e.g. parliamentary proceedings, will perform poorly when used to translate texts from a different domain, e.g. news articles [13][14]. At the same time, a huge amount of SDUDOOHOWH[WVDQGWUDQVODWHGGRFXPHQWVDUHDWWKHXVHUV¶GLVSRVDODQGWKH\FDQEHXVHG for SMT system training. Therefore, the LetsMT! online platform will provide all categories of users (public organizations, private companies, individuals) with an opportunity to upload their proprietary resources to the repository and receive a tailored SMT system trained on these resources. The latter can be shared with other users who can exploit them further on. The motivation of users to get involved in sharing their resources is based on the following factors: x participate and contribute, in a reciprocal manner, with a community of professionals and its goals; x achieve better MT quality for user specific texts; x build tailored and domain specific translation services; x enhance reputation for individuals and businesses; x ensure compliance with the requirement set forth by EU Directive to provide usability of public information in a convenient way for public institutions; x deliver a ready resource for study and teaching purposes for academic institutions. The LetsMT! project is advancing the concept of data sharing which implies the practice of making data used in one activity available to other users. One of the examples of successful data sharing is the Translation Automation User Society (TAUS) Translation Memory (TM) Sharing Platform ± TAUS Data Association (TDA) 7. TDA is a global not-for-profit organization providing a neutral and secure platform for sharing language data. By sharing their TMs and glossaries, members in return get access to the data of all other members. 3

http://langtech.jrc.it/JRC-Acquis.html http://www.statmt.org/europarl/ 5 http://urd.let.rug.nl/tiedeman/OPUS/ 6 http://langtech.jrc.it/JRC-Acquis.html 7 http://www.tausdata.org 4

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

138

A. Vasiljevs et al. / Online Platform for Sharing Training Data and Building User Tailored MT

There is an obvious cooperation potential between LetsMT! and TDA. LetsMT! is interested in using TDA data for SMT training and TDA is interested in integrating MT generation capabilities with the TDA data repository. TDA is already a member of the LetsMT! Support Group and cooperation is further ensured by membership of project partners Tilde and Moravia in TDA.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

4.2. Processing of training data Since user-provided shared data plays a major role in LetsMT!, the project should deal with issues related to processing of noisy data and ensuring data interoperability. The platform should prevent its abuse by the inclusion of corrupted material, even though user authentication is used to reduce such dangers. The component of data management will therefore include various tests and pre-processing tools to validate the data and fix potential errors. There are various ready-made tools that can be used outof-the-box for data checking. For example, freely available XML parsers, e.g. Libxml2 8, Tidy9, and TMX10 validators will be used to detect problems in the TMs provided by users; open-source GNU/Unix tools will be used to detect character encodings and perform conversions; language guessers, e.g. TextCat 11 , are also available and can easily be trained for additional language/character sets [15]. It is feasible to use existing tools and integrate them in the LetsMT! platform in order to detect and correct potential errors. Besides basic validation, the LetsMT! platform requires a number of other preprocessing steps. Most important is a proper tokenisation (text segmentation) module, since most of the users will not provide segmented data. Tokenisation is a non-trivial task and highly language dependent. In the first phase, simple standard tools will be applied that split punctuations from other tokens, e.g. pattern-based tokenisation with the tools provided together with the Europarl parallel corpus. Language specific tools will be used where they are available. Tools for better support of language specific issues will be continuously incorporated, like morphological analysers and lemmatisers. Initially only user-provided translation memories containing aligned singlesentence units will be supported. Sanity checks should be carried out to avoid unreasonable training examples such as very long and fragmented translation units or sentences with formatting mark-up or other types of non-textual contents. At a later stage, LetsMT! will also support an upload of other types of parallel data. The idea is to use existing resources in various formats and allow users to create their own training material in the form of sentence aligned corpora. Support for a number of common formats should be provided and the validation process ensured. Standard approaches to automatic sentence alignment are readily available, e.g. Hunalign 12 [16], Vanilla13 [17], GMA14 [18]. Post-editing interfaces will be included to verify and improve alignment results online, e.g. ISA as part of

8

http://xmlsoft.org/ http://tidy.sourceforge.net/ http://www.maxprograms.com/products/tmxvalidator.html 11 http://www.let.rug.nl/vannoord/TextCat/ 12 http://mokk.bme.hu/resources/hunalign 13 http://nl.ijs.si/telri/Vanilla/ 14 http://nlp.cs.nyu.edu/GMA 9

10

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

A. Vasiljevs et al. / Online Platform for Sharing Training Data and Building User Tailored MT

139

Uplug15 [19]. In this way, more users will be encouraged to provide parallel data in a variety of formats. The next step in building SMT translation models from parallel corpora is automatic word alignment. This part of the process is especially complicated and requires a great deal of computational power especially for large-scale corpora. Standard word alignment for SMT are the IBM models [6] and the HMM alignment model [20] implemented in the freely available tool GIZA++ [1]. It can be used as a black-box tool in connection with the Moses toolkit which supports all the necessary steps to build a phrase-based SMT model from a given sentence aligned parallel corpus. Word alignment is carried out in an unsupervised way using EM re-estimation procedures and a cascaded combination of alignment models. Various settings can be adjusted in the alignment procedure and phrase table extraction. Word alignment is time consuming and requires large amounts of internal memory for extensive data sets. Fortunately, there are extensions and alternative tools available with improved efficiency. Multi-threaded version of GIZA++ 16 [21] can run several word alignment processes in parallel on a multi-core engine. Furthermore, another version of GIZA++ (cluster-based) can be used to distribute word alignment over various machines. An alternative approach that can also run a parallel alignment procedure is implemented in the MTTK toolkit 17 [22]. This software provides several alignment models and may also be used to perform sentence alignment which usually is a prerequisite for word alignment.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

5. Conclusion Current development of SMT tools and techniques has reached the level where they can be implemented in practical applications addressing the needs of large user groups in a variety of application scenarios. The work in progress that is described in this paper promises important advances in the application of SMT by integrating available tools and technologies into an easy-to-use cloud-based platform for data sharing and generation of customized MT. Successful implementation of the project will enable wider use and greater impact of available open-source SMT technologies, facilitate diversification of free MT by tailoring it to specific domains and user requirements.

Acknowledgements We ZRXOG OLNH WR WKDQN -|UJ 7LHGHPDQQ and other LetsMT! project partners for contributing to parts of this paper. The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 ± Multilingual web, grant agreement no250456.

15

http://www.let.rug.nl/~tiedeman/Uplug/php/ http://code.google.com/p/giza-pp/ and http://www.cs.cmu.edu/~qing/ 17 http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1/ 16

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

140

A. Vasiljevs et al. / Online Platform for Sharing Training Data and Building User Tailored MT

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

References [1] F.J. Och, H. Ney, A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, (29)1: 19-51, 2003. [2] P. Koehn, M. Federico, B. Cowan, R. Zens, C. Duer, O. Bojar, A. Constantin, E. Herbst, Moses: Open Source Toolkit for Statistical Machine Translation, in Proceedings of the ACL 2007 Demo and Poster Sessions, pages 177-180, Prague, 2007. [3] P. Koehn, A. Birch and R. Steinberger, 462 Machine Translation Systems for Europe, in Proceedings of MT Summit XII, 2009. [4] P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, F. Mercer, P. Roossin, A statistical approach to French/English translation, in Proceedings of the Second International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages, June 12-14, 1988. [5] P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, F. Mercer, P. Roossin, A statistical approach to language translation, in Proceedings of the 12th International Conference on CRPSXWDWLRQDO/LQJXLVWLFV&2/,1*¶, (1): 71-76, 1988. [6] P. Brown, S. Della Pietra, V. Della Pietra, R. Mercer, The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics 19.2: 264-311, 1993. [7] J. Hutchins, Machine translation: a concise history, in Computer aided translation: Theory and practice, ed. Chan Sin Wai. Chinese University of Hong Kong, 2007. [8] A. Levenberg, M. Osborne, Stream-based Randomised Language Models for SMT, in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009. [9] EAGLES, Preliminary recommendations on corpus typology. Electronic resource: http://www.ilc.cnr.it/EAGLES96/corpustyp/corpustyp.html, 1996. [10] C. Goutte, N. Cancedda, M. Dymetman, G. Foster (eds.), Learning Machine Translation, The MIT Press. Cambridge, Massachusetts, London, England, 2009. [11] R. Steinberger, B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. 7XILúD. Varga, The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages, in Proceedings of the 5th International &RQIHUHQFH RQ /DQJXDJH 5HVRXUFHV DQG (YDOXDWLRQ /5(&¶. Electronic resource: http://langtech.jrc.it/Documents/0605_LREC_JRC-Acquis_Steinberger-et-al.pdf, 2006. [12] P. Koehn, Europarl: a parallel corpus for statistical machine translation, in Proceedings of Machine Translation Summit X, 2005. [13] D. Munteanu, A. Fraser, D. Marcu, Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora, in Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, HLT / 1$$&/¶, Electronic resource: http://www.mt-archive.info/HLT-NAACL-2004-Munteanu.pdf. 2004. [14] D. Munteanu, Exploiting Comparable Corpora (for automatic creation of parallel corpora), Online presentation. Electronic resource: http://content.digitalwell.washington.edu/msr/external_release_talks_12_05_2005/14008/lecture.htm, 2006. [15] W.B. Cavnar, J.M. Trenkle, N-Gram-Based Text Categorization, in Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, UNLV Publications/Reprographics, pp. 161-175, 11-13 April, 1994. [16] D. Varga, L. 1pPHWK P. +DOƳFV\ A. Kornai, V. 7UyQV. Nagy, Parallel corpora for medium density languages, in Proceedings of the Recent Advances in Natural Language Processing, pp. 590±596, 2005. [17] W.A. Gale, K.W. Church, A Program for Aligning Sentences in Bilingual Corpora, Computational Linguistics, 19(1): 75- 102, 1993. [18] D. Melamed, Bitext maps and alignment via pattern recognition, Computational Linguistics, 25(1), 107-130, 1999. [19] J. Tiedemann, ISA & ICA - Two Web Interfaces for Interactive Alignment of Bitexts, in Proceedings of LREC 2006, Genova, Italy, 2006. [20] S. Vogel, H. Ney, C. Tillmann, HMM-based word alignment in statistical translation, in Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark, 1996. [21] Q. Gao, S. Vogel, Parallel Implementations of Word Alignment Tool, in Proceedings of Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49-57, 2008. [22] Y. Deng, W. Byrne, MTTK: An alignment toolkit for statistical machine translation, in Demo Presentation in the HLT-NAACL Demonstrations Program, June 2006.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Written Corpora and Linguistic Resources

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-143

143

The Estonian Reference Corpus: Its Composition and Morphology-aware User Interface Heiki-Jaan KAALEP1 , Kadri MUISCHNEK, Kristel UIBOAED and Kaarel VESKIS University of Tartu

Abstract. This paper gives a brief overview of the composition as well as technical and morphological annotation of the Reference Corpus of Estonian. A user interface using the morphological information about lemmas and grammatical categories of word-forms is presented. Keywords. corpus compilation and mark-up, corpus user interface, morphological annotation

Introduction

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The Estonian Reference Corpus 2 is a collection of written present-day language consisting of ca 245 million words at the moment. In our paper we are going to describe the overall composition of the corpus; say a few words about its technical and morphological annotation and present a corpus query system based on morphologically analyzed version of the corpus. 3

1. The overall design and technical annotation of the Corpus The Estonian Reference Corpus is a non-balanced one: the newspaper texts make up 75% of the Corpus, fiction texts 2%, scientific texts 2%, legalese 5%, parliament transcripts 5% and the texts of the “new media” 9% of the corpus. By “new media” we mean the genres of the computer-mediated discourse; i.e. the chatrooms (Internet relay chats), Internet forums, newsgroups and user comments from the news portals. The technical annotation of the Corpus follows the TEI guidelines. The traditional written texts (i.e. newspapers, fiction texts etc) are annotated for the text structure. Non-textual material (graphs, formulae, pictures, tables etc) has been omitted and represented by a tag . 1

Corresponding Author. http://www.cl.ut.ee/korpused/segakorpus/index.php?lang=en 3 This work was supported by the European Regional Development Fund through the Estonian Center of Excellence in Computer Science, EXCS and by the Estonian Ministry of Education and Science (grant SF0180078s08) 2

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

144

H.-J. Kaalep et al. / The Estonian Reference Corpus

The annotation of the “new media” texts is different from that of the rest of the Corpus. The basic idea behind tagging was that the transcript of a chatroom or a newsgroup or the text of an online forum is similar to a transcript of a drama play: the actors enter the stage, produce their lines, and leave the stage. Thus, the time of the text entered to a web site, if available, has been tagged as , the speaker as , a text of one speaker as , the theme or title of the message as , and the actions between the chat lines as . The mark-up of the Corpus follows the currently outdated P3 version of the TEI guidelines4 that has some significant disadvantages compared to later versions of TEI. In 2002, TEI changed its underlying representation from SGML to XML with the P4 version and in 2007; the P5 version added some architectural changes. There are a number of benefits in switching from SGML to XML, one of which is that XML has a number of standards and specifications that SGML lacks. Our plan is to migrate from P3 to P5, not skipping the P4 stage, but instead use P4 as an intermediary stage in order to facilitate the migration. The reason is that in 2002, The TEI Task Force on SGML to XML Migration has devised a Practical Guide to Migration of TEI Documents from P3 to P45. TEI has also guidelines for migrating from P4 to P56 but no guidelines for direct P3 to P5 migration are known to us. As a part of the migration process, we have already converted the text encoding of the corpus from ASCII and SGML entities to UTF-8.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Morphological annotation of the Corpus Estonian is an agglutinating language; it uses inflection for encoding the syntactic relations between the words of a sentence. At the same time Estonian has some fusional traits: it has a tendency to fuse morphemes so that they are difficult to segment. For example the first four case-forms of the word käsi ‘hand’ would be in singular käsi käe kätt kätte and in plural käed käte käsi kätesse. That entails the necessity of morphological analysis for a corpus query system, as in many cases it is not possible to retrieve all inflectional forms of a word using its base form and some kind of regular expression. To make matters worse, 45% of tokens in a text corpus can be analysed in several ways, if the context they occur in is not taken into account. In other words, 45% of the tokens are morphologically ambiguous. The corpus has been annotated morphologically by Filosoft Ltd. using their morphological analyzer (including guesser for out-of-dictionary words) of Estonian and a HMM disambiguator. The principles of the approach date back to [1], but the tools have been developed further, e.g. the HMM disambiguator has been implemented as a trigram HMM and trained on a manually annotated corpus of 500,000 tokens7 . The categories used by the morphological analyzer and disambiguator are based on [2] . After disambiguation, 10% of the tokens still remain ambiguous. This is because if we do not have rules or data for choosing the right annotation with a high probability, it is better not to make the choice at all. The ambiguous tokens fall into the following categories: participles (ambiguous between verb and adjective readings), pronouns 4

https://docs.google.com/viewer?url=http://www.tei-c.org/Vault/GL/p4beta.pdf http://www.tei-c.org/Activities/Workgroups/MI/index.xml 6 http://www.tei-c.org/Guidelines/P5/migrate.xml 7 http://www.cl.ut.ee/korpused/morfkorpus/ 5

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

H.-J. Kaalep et al. / The Estonian Reference Corpus

145

(ambiguous between singular and plural, and between pronoun and numeral readings), verb form on (ambiguous between ‘(he) is’ and ‘(they) are’), uninflected words like kui, otsekui, nagu ‘if’, ‘as if’, just ‘just’ (ambiguous between conjunction, interjection and adverb readings). An evaluation of the quality of the disambiguation showed that depending on the text class, 3-6% of the annotations were not quite correct8. An error could be in the lemma form, inflectional category, or word class. The evaluation also showed that if the HMM disambiguator was used on a text class, not seen in the training phase, the quality of its output was about 1 percent point lower. Thus we expect the quality to remain rather stable across the whole annotated corpus. The “new media” subcorpus has not been morphologically annotated yet, the reason being that the texts of the new media, especially those of the chatrooms, contain a lot of word-forms not occurring in the texts of the standard written language or being used in different function and meaning. Due to these facts these texts need some special pre-processing prior to the morphological analysis and the lexicon of the morphological analyzer needs to be customised.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. The morphology-aware user interface For years, the whole corpus has been freely downloadable for non-commercial purposes, with the possibility to use software of any origin for doing research on the downloaded texts. However, feedback from potential users has indicated that Estonian linguists would prefer to query the corpus via Internet, using a simple search facility instead. So it is necessary to have an appropriate corpus toolkit, commonly known as concordance software, which enables the user to query the corpus. For several years our corpus has had a simple search facility9 that retrieves a sequence of symbols from the corpus. Another, new interface10 enables the users to query the corpus using the lemmas of word-forms and/or morphological information, in combination with surface word forms. The new search facility is trying to be balanced between 1) the ease of use, 2) the functionality, required by linguists, and 3) simplicity of the maintenance (including upgrading) of the corpus and the software. The internal representation of the Corpus for ensuring fast retrieval was designed and implemented by Rene Prillop, who also designed and implemented the query interface. The conversion from TEI format to the morphologically annotated one was performed by Tarmo Vaino. Figure 1 shows the first 5 results for a query for a multi-word verbal expression silmi lahti hoidma lit. ‘keep one’s eyes open’, i.e. ‘pay attention, be watchful’. The query was submitted to several sub-corpora (shown as tabs on the web page) at the same time, but only the results from one subcorpus are shown – 119 sentences from the newspaper Eesti Päevaleht. For every sentence, there is a clickable field for showing the exact source of this sentence. The searched terms are highlighted.

8

http://teataja.ee/veskis-liba-syntax-assignment-modified.pdf http://www.cl.ut.ee/korpused/kasutajaliides/ 10 http://www.keeleveeb.ee 9

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

146

H.-J. Kaalep et al. / The Estonian Reference Corpus

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 1. Results for a query silmi lahti hoidma. Note that the order of the searched word forms may vary and that there may be intervening words. The second resulting sentence shows what happens when one clicks on a word (e.g. silmad ‘eyes’): its morphological analysis – lemma (silm), inflectional ending (d), word class (S – noun) and grammatical information (plural nominative) – are displayed. The query can be submitted via a set of text fields (for word forms, lemmas, parts of words) and clickable boxes. The user input is transformed into a query string (silm@l lahti hoidma@l on Figure 1), which is then used by the system to perform the search. The user can save this string, so that next time she need not tick the same boxes again, but may paste the saved string directly to the search box.

4. Conclusion This paper gave a short overview of the Reference Corpus of Estonian, its annotation and a new morphology-aware user interface. Our plans for the near future are twofold: first, to perform the morphological annotation of the “new media” texts and thus make them usable via a search interface. Second, we are working on splitting the sentences into clauses in order to provide better context for retrieving co-occurrences of words. References [1] Heiki-Jaan Kaalep, Tarmo Vaino. 2001 Complete Morphological Analysis in the Linguist’s Toolbox. Congressus Nonus Internationalis Fenno-Ugristarum Pars V, pp. 9-16, Tartu. [2] Viks, Ü. 1992. Väike vormisõnastik. ETA EKI, Tallinn

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-147

147

Adaptive Automatic Mark-up Tool for Legacy Dictionaries Lauma PRETKALNIŅA and Ilze MILLERE Institute of Mathematics and Computer Science, University of Latvia

Abstract. This paper considers the problem of developing a trainable software tool for automatic structural mark-up of legacy dictionaries that have been digitalized in a rich-text format. The proposed method is intended to obtain the accurate structural representation of an entry from its visual formatting, which is assumed to be consistent throughout the dictionary. The obtained XML structures are assured to be valid against the specified DTD [1]. Keywords. mark-up, XML, dictionary, adaptive tool, machine learning

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction The state of the art dictionary writing systems [e.g. 2] facilitate development of machine-readable dictionaries, but there are also a lot of valuable legacy dictionaries that are not available in a machine-readable form. Ad-hoc solutions that are based on hand-crafted regular expression patterns are often used to convert rich-text formatted dictionaries into structurally annotated ones [e.g. 3], however, such tools have to be tweaked or even modified for each particular dictionary. Such conversion services also can be outsourced1. The main purpose of the proposed method is to allow the lexicographer without knowledge in programming, including regular expressions, to obtain a customized automatic mark-up tool for the particular dictionary by simple, intuitive means. The proposed tool asks lexicographer to prepare some samples with correctly marked entries and then induces the possible relations between visual formatting and XMLstructural formatting to be obtained.

1. Outline of proposed method 1.1. General outline We propose method based on machine learning techniques. The method works independently with each entry. Structural tags are added in a top-down manner (level by level) according to the specified XML structure schema (see Fig.1). Here and further – when describing marking process in current level, the element that is already

1

See http://www.tshwanedje.com/data_conversion/, for instance.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

148

L. Pretkalnin¸a and I. Millere / Adaptive Automatic Mark-Up Tool for Legacy Dictionaries

inserted in previous level will be called parent-element and the elements to be inserted in this level – child-elements. There are few restrictions imposed on the obtainable scheme: each element is either a leaf element (text only), or a container element that encapsulates a sequence of different non-repeating types of elements, or repetition of single-type elements. Empty elements and mixed elements are not allowed. Reasons of putting these restrictions are based on the mark-up process and will be covered more detailed later. Here we admit that in most cases existing schemes can be easily adjusted to satisfy these requirements by adding additional container-elements.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 1. Top down mark-up steps in case of the sample entry structure

The tool that implements the proposed method must be trained by a user prepared data set to obtain knowledge (rules) about mapping between visual and structural representations of entries before execution. To achieve better results a hybrid learning strategy is used. First, it uses decision tree learning [4] to obtain transformation rules (this is explained more thoughtfully in next chapter). Second, in an optional step user can add additional knowledge in a form of simple rules: “element X always contains feature Y” or “element X never contains feature Y”. 1.2. Decision tree usage Decision trees are used to accumulate knowledge from user given samples. During the training process child elements are transformed to feature vectors with Boolean valued components. Every component of vector is determined as combination of formatting element (e.g., bolded text, text in italic, special symbol, etc.) and position of this formatting element regarding to child element (e.g., child element contains this formatting, child element begins with this formatting, child element begins when this formatting ends, etc.). Training process is performed differently depending on the given parent-element structure. Two cases are possible: (1) parent-element contains a sequence of different

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

L. Pretkalnin¸a and I. Millere / Adaptive Automatic Mark-Up Tool for Legacy Dictionaries

149

elements or (2) parent element contains a repetition of a single-type element. In the first case the user gives a set of correctly marked samples to the tool. A decision tree then is trained to classify child elements’ vector representations in categories. These categories essentially answer the following question – which child element of those allowed by the given schema could this vector be. In the second case user gives both correctly and incorrectly marked samples. The incorrectly marked samples are obtained by training the tool with some initial set, running tool on some test entries and then specifying the incorrectly marked samples and adding them to the training set. In this case a decision tree essentially answers to the question, whether the substring described by feature vector given to tree could be a valid child element or not. In the mark-up process, search process is performed through the parent element to find the positions to put child elements’ beginnings and ends so that all the obtained child elements would be approved by previously trained decision trees. Right now this is done by performing exhaustive search through all the possible positions in the parent element string where child elements’ beginnings and ends could be inserted, and the search is organised by backtracking. During the evaluation it was noticed that this slows the process down for big entries, so some optimization should be used in the future, perhaps using dynamic programming techniques.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Evaluation The proposed method has been implemented as mark-up tool prototype. The method has been tested on the Dictionary of Latvian Literary Language (LLVV) [5] that has a highly complex entry structure. The LLVV has been marked in a semi-automated way from OCR-processed facsimiles, which means that each entry is available in both formats: in human-reviewed XML mark-up (the “correct answer”) and in the original visual formatting. This also means that some textual errors can be found in these data – originating from the optical character recognition process, including comma instead of period, incorrect letters or incorrect italic formatting used. The simplest way to evaluate the obtained results would be to count how many parent-elements are correctly marked. However, this approach is not very informative in terms of the amount of manual work needed to correct mistakes made by the automatic mark-up tool, because parent-elements differ widely in their scope and complexity. Therefore we evaluate resulting entries by the following parameters: • incorrectly placed border between two child-elements; • incorrect child-element — a child-element, which should not be in this parentelement; • missing child-element. As the proposed method uses slightly different approach to deal with different kinds of parent-elements – whether the parent element is made of sequence of different elements or repetition of a single-type element – all experiments can be divided in two groups, accordingly. 2.1. Marking parent element with different child-elements The tool was trained to divide entry according to following DTD:

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

150

L. Pretkalnin¸a and I. Millere / Adaptive Automatic Mark-Up Tool for Legacy Dictionaries

Training corpus contains 23 correctly marked entries, picked by human to reflect most of the possible child-element combinations. Test corpus contains 460 sequential entries from [5, vol.3]. All possible formattings available in the entries where given to the tool. Two experiments were made: first one with no additional data and second one with one additional rule. Statistics of this experiment is shown in Table 1.

Table 1. Results marking parent element with different child-elements Proportion

Total amount

Correctly marked parent-elements

13.04%

60

Elements with one misplaced border

0.65%

3

Elements with one additional or missing child-element

79.57%

366

Elements with more than one error of any kind

6.74%

31

Total amount of parent-elements

100%

460

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Inspecting the results leads to an observation that most of the mistakes are created by mistakenly placed reference elements. This can be explained by the structure of the obtained decision tree (see Figure 2). The substring of parent-element given for markup tool's consideration is labeled “reference” when no other child element type is appropriate. Therefore, the second experiment was made adding additional rule about reference elements.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

L. Pretkalnin¸a and I. Millere / Adaptive Automatic Mark-Up Tool for Legacy Dictionaries

151

Figure 2. Example of an obtained decision tree

The rule used in second experiment was constructed as follows: “element WITH SPECIAL SYMBOL ””, because the reference element in dictionary starts with special symbol. Statistics of this experiment is shown in Table 2. “REFERENCE” always contains feature “STARTING

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Table 2. Results marking parent element with different child-elements, one additional rule Proportion

Total amount

Correctly marked parent-elements

94.13%

433

Elements with one misplaced border

5.87%

27

Elements with one additional or missing child-element

0%

0

Elements with more than one error of any kind

0%

0

Total amount of parent-elements

100%

460

As can bee seen, using an additional rule has improved the results greatly – more than 80% mistakes are corrected by that. Overall results can be considered promising. 2.2. Marking parent element as sequence of repeated child element The tool was trained to divide block meanings into separate meanings and to divide header into separate blocks each containing one header-word with its additional information. In DTD this can be represented as follows:

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

152

L. Pretkalnin¸a and I. Millere / Adaptive Automatic Mark-Up Tool for Legacy Dictionaries

Training corpus contains 23 correctly marked entries, and 32 entries with mistakes indicated. Text corpus contains 460 sequential entries from [5, vol.3], each with correctly indicated header and block of meanings. Experiments when all possible formattings available in the entries (13 different kinds) where given to the tool turned out to take up too much calculation time – marking the test corpus was not finished after more than 1 hour. Also produced decision trees turned out to be very big - transformed to rule-sets they contained 42 rules each and longest rule contained 41 clauses. So the experiments when only 2 different formatings are available to tool where made. Available formatings were chosen by user. In this case marking test corpus took approximately 12 minutes. Produced decision trees transformed to rule sets contained 12 rules each and the longest rule contained 12 clauses. Statistics of this experiment is show in Table 3.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Table 3. The results of marking parent elements as sequences of a repeated child-element Proportion

Total amount

Correctly marked entries

99.35%

457

Correctly marked parent-elements

99.57%

916

Elements with one misplaced border

0%

0

Elements with one additional or missing child-element

0.22%

2

Entries with more than one error of any kind

0.43%

2

Elements with more than one error of any kind

0.22%

2

Total amount of entries

100%

460

Total amount of parent-elements

100%

920

Conclusion The initial results are rather promising and suggest that the proposed method can be useful for automatic conversion of legacy dictionaries into machine-readable representations. Experiments show that appropriately trained prototype-tool is able to give correct mark-up for more than 94% samples and mark-up with no more than a misplaced child-element error (and no errors of other kind) for more than 99% samples. Further work is related to usability improvements. There are two main directions of development possible: (1) optimizing search performed through the possible positions, where child elements’ beginnings and ends could be placed – to gain better performance, and (2) enhancing user support by creating improved interface for more

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

L. Pretkalnin¸a and I. Millere / Adaptive Automatic Mark-Up Tool for Legacy Dictionaries

153

convenient provision of training data (both positive and negative samples and rules) and a handy way to control the top-down training processes and all the knowledge accumulated in the process, in order to fully utilize the proposed method.

References [1] [2] [3]

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[4] [5]

Extensible Markup Language (XML) 1.0 (Fifth Edition), Recomendation [on-line]. World Wide Web Consortium. 2008. Available online: http://www.w3.org/TR/xml/ [reference – 2010-07-02.]. D. Joffe, G. M. de Schryver, TshwaneLex - A State-of-the-Art Dictionary Compilation Program. In: Proceedings of the Eleventh EURALEX International Congress. Lorient, France: EURALEX 2004. V. Zinkevičius, The Digitalization of the Dictionary of the Lithuanian Language. In: Proceedings of the 3rd Baltic Conference on Human Language Technologies. Kaunas, Lithuania 2007, p. 349-355 T. M. Mitchel, Machine Learning. Singapore: McGraw Hill, 1997. 414 p. Latviešu literārās valodas vārdnīca. Latvijas Zinātņu akadēmija Valodas un literatūras institūts. Rīga: Zinātne, 1972 - 1996, Vol. 3, 5.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

154

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-154

Corpus of Contemporary Lithuanian Language – the Standardised Way Erika RIMKUTĖ1, Jolanta KOVALEVSKAITĖ, Vida MELNINKAITĖ, Andrius UTKA, Daiva VITKUTĖ-ADŽGAUSKIENĖ Centre of Computational Linguistics, Vytautas Magnus University, Kaunas, Lithuania

Abstract. The paper presents the development process of the 160m word Corpus of Contemporary Lithuanian Language (CCLL), standardization issues being the focus of current development phase. The paper presents problems and solutions for the process of converting the CCLL from a proprietary format into a standardised one. Challenges in encoding the corpus using the Text Encoding Initiative Guidelines P5 are addressed, covering document metadata, text structure and morphological annotation levels that are already implemented in CCLL. Future perspectives for corpus development are discussed. Keywords. Corpus linguistics, TEI P5 encoding, morphosyntactic specifications.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction The Corpus of Contemporary Lithuanian Language (CCLL) has been started 16 years ago at the Centre of Computational Linguistics of Vytautas Magnus University (CCLVMU), and since then it has grown up into a large 160m word corpus. Its contents are structured in the following way: newspaper texts make up 46 percent of the corpus, non-fiction books – 32 percent, fiction books – 13 percent, documents – 3 percent, and spoken language texts - 7 percent [1]. For more than 10 years the corpus has been freely searchable on-line2 and thus it has become a representative and authoritative source of information for the usage of real Lithuanian. Whereas a simple corpus of raw textual data has recently turned into a morphologically annotated one, the need is felt to change its “dress” into something more up to date. The need for a standardized corpus structure becomes more and more important, considering possibilities for simultaneous use of several national corpora (e.g. for machine translation tasks), participation in large-scale national and international projects, use of open-source and other available tools for corpus analysis, annotation, search, sharing, etc. The standardisation is also needed considering the future possibilities to join large national and international infrastructures, such as CLARIN. The first open question is the choice of the encoding standards, the three main alternatives named in the CLARIN short guide [2], being the standards developed by International Standards Organization Technical Committee 37 Subcommittee 4 (ISO/

1

Corresponding Author: Erika Rimkutė, Centre of Computational Linguistics, Vytautas Magnus University, Donelaičio 52, Kaunas, Lithuania; E-mail: [email protected]. 2 http://donelaitis.vdu.lt Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

E. Rimkut˙e et al. / Corpus of Contemporary Lithuanian Language – The Standardised Way

155

TC37/SC4)3, XCES (XML Corpus Encoding Standard)4 and TEI P5 (Text Encoding Initiative) 5 . The ISO/TC37/SC4 family of standards for representing linguistic information are still far from stable. XCES, which is considered to be a de facto corpus encoding standard, used by different national corpora (American National Corpus, IPI PAN Corpus of Polish, etc.), is still not TEI P5 compatible, poorly documented with large parts of documentation coming from a previous non-XML CES version, also rather limited in annotation levels. TEI P5, though being a universal standard for text representation in a digital form, and, thus, a much more complex one, is rather flexible in defining different annotation levels, has well-defined semantics and rich documentation, and can be easily adapted to various corpus encoding needs. Similar conclusions were drawn by the maintainers of the National Corpus of Polish [3], where they selected to use TEI P5 as the encoding standard. A number of other national corpora (British National Corpus, Bulgarian National Corpus, Croatian Language Corpus, etc.) have also chosen the TEI P5 way6. Therefore, the decision has been taken to follow the TEI P5 standard for the CCLL encoding.

1. CCLL – General Architecture Due to its large size the CCLL is not stored as a single TEI conformant file, but, instead, a collection of XML files is maintained, each of these files representing separate corpus texts with different annotation level. Each of the documents has its own header, represented by the element, containing the necessary document metadata, related to the corresponding corpus annotation level. Additionally, a special directory file for the whole corpus is used, facilitating the operations of browsing the whole corpus or selecting separate parts of it for specialized processing tasks. Figure 1 presents the overall architecture of CCLL.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

ŽƌƉƵƐĚŝƌĞĐƚŽƌLJ

ŶŶŽƚĂƚŝŽŶ>ĞǀĞůϭ

ĨŝůĞϭ

ŶŶŽƚĂƚŝŽŶ>ĞǀĞůE

͘͘͘

͘͘͘

ĨŝůĞϭ

͘͘͘ ĨŝůĞE

džƚĞƌŶĂů ĨŝůĞƐ

ĨŝůĞE

dĂdžŽŶŽŵLJĚĞĨŝŶŝƚŝŽŶ

DŽƌƉŚŽƐLJŶƚĂĐƚŝĐ ƐƉĞĐŝĨŝĐĂƚŝŽŶƐ

Figure 1. Overall architecture of CCLL 3

http://www.tc37sc4.org/ http://www.xces.org/ 5 http://www.tei-c.org/Guidelines/P5/ 6 http://www.tei-c.org/Activities/Projects/index.xml 4

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

156

E. Rimkut˙e et al. / Corpus of Contemporary Lithuanian Language – The Standardised Way

The collection of corpus texts is supplemented with additional external XML files, representing taxonomy descriptions, morphosyntactic specifications, etc. The structure of these files is also defined following the TEI P5 recommendations. Classification scheme description () consists of a number of nested elements, each defining a certain category between a predefined CCLL typology. The file, describing the morphosyntactic specifications is constructed using TEI recommendations for a feature-structure library. Such an architecture enables easy expansion of the corpus. New annotation levels can be formed by incrementally adding annotation metadata to the corresponding set of corpus files. Similarly, subsets of separate CCLL annotation levels can be formed, thus constructing “baby” or “sampler” versions of the corpus. An alternate approach of applying a “stand-off” annotation scheme, where annotation metadata are stored separately, was considered, however the current approach was selected mainly due to ease of data access in different complex processing tasks.

2. Annotation at the Document Metadata Level

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

TEI conformant annotation at the document metadata level was accomplished by converting the existing proprietary CCLL document header to a TEI P5 conformant structure. Figure 2 presents the former structure of a header for a single corpus document.

Figure 2. Proprietary header structure before conversion

Figure 3 presents header structure of a unitary corpus text after the conversion to TEI P5 format. Conversion was done using a special set of automatic conversion tools. As the new header structure contains additional fields, that were not present in the proprietary structure, these fields had to be filled in, using a semi-manual procedure. The main constituent parts of a TEI-conformant header (, , and ) are flexible enough to cover all the necessary elements for presenting bibliographical and non-bibliographical description of an electronic text, relationship between the electronic text and its source and the file revision history. Quite some of the elements could be described in several alternative ways according to TEI P5, so in each case there was a possibility to use the

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

E. Rimkut˙e et al. / Corpus of Contemporary Lithuanian Language – The Standardised Way

157

most acceptable solution. If needed, additional description elements can be added in the TEI document header part.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 3. Structure of TEI-conformant document heading

The largest issue here has been the definition of text taxonomy, which had to be redesigned according to TEI P5 classification declaration recommendations. Presently, the CCLL contains various types of books, periodicals, legal documents, transcriptions of parliamentary debates, etc. As the 5-level taxonomy used for the CCLL is rather complicated and large in size, its definition is stored in a separate TEI P5 conformant XML document, and referenced by corpus texts using xml:id attributes for corresponding categories. Such an approach is convenient, considering that the task is not only to create an appropriate taxonomy for the current selection of texts, but it is equally important considering future corpus development, including the addition of new types of texts.

3. Annotation at the Text Structure Level One of major challenges in building a TEI-conformant corpora is the encoding of structure in serial composite publications, e.g. texts in newspapers or magazines. Such composite electronic texts contain corresponding hierarchical structures of component elements – textual divisions and subdivisions. As these corpus texts are usually imported from different electronic sources, where the text structure is already defined by some kind of metadata, as a result we have a variety of different formats to convert to the defined TEI-conformant text structure. Solution to this problem requires the selection of a rather universal TEI element subset, capable of covering different structural aspects of serial publications, as well as a set of

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

158

E. Rimkut˙e et al. / Corpus of Contemporary Lithuanian Language – The Standardised Way

corresponding conversion tools. Figure 4 presents the text structure encoded in the TEI conformant version of CCLL. Such a structure is based on a nested set of elements, usually representing columns (rubrics), articles and paragraphs. It was tested for describing serial texts of different complexity and was found to be sufficient for defining their rather diverse structure.

Figure 4. Text structure annotation for the CCLL

While annotation of the text structure describes text segments above the word level – columns, articles, paragraphs and sentences, word-level description is left for the morphosyntactic annotation.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

4. Morphosyntactic Annotation Currently, the morphological analysis of the CCLL is carried out automatically by a morphological annotation tool Lemuoklis [4]. The tool lemmatises any given word and produces a detailed morphological analysis by identifying a word’s part of speech and appropriate grammatical information (e. g. for nouns it identifies whether it is a proper noun or not, as well as its gender, number, and case). However, the original tool does not solve ambiguities. It has been established that 47 per cent of all Lithuanian word forms are morphologically ambiguous in the corpus, so the same word can acquire several morphological tags. In order to solve the ambiguity problem, 1 m word morphologically annotated corpus has been created at CCL-VMU, its structure resembling that of CCLL [5], [6]. By applying the statistical method of Hidden Markov model, the tool has achieved the high correctness rate of 94 per cent (for lemmatisation – 99 per cent) [7]. The statistical tool has been trained on the previously mentioned 1m word morphologically annotated corpus. The Lithuanian tagger is freely accessible online7. After the morphological analysis the tool produces a text with multi-layered metadata, which is well-structured, but not TEI compatible. Therefore, the task for the conversion of morphological annotation to TEI P5 format was set. As this format is 7

http://donelaitis.vdu.lt/main.php?id=4&nr=7_2

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

E. Rimkut˙e et al. / Corpus of Contemporary Lithuanian Language – The Standardised Way

159

well suited for inflectionally rich languages, it is successfully used for some other languages, e.g. Slovene [8]. Morphological annotation is executed as word-level markup, using context disambiguated lemmas and morphosyntactic definitions (MSDs), e.g., vyriausybės. Each MSD is linked to a TEI feature-structure library (see Figure 5 for a fragment of this library), which describes the decomposition into morphological features. фĨƐdžŵů͗ŝĚсΗĚďŵǀŬΗdžŵů͗ůĂŶŐсΗůƚΗĨĞĂƚƐсΗηWϭ͘ϭηWϮ͘ϮηWϭϬ͘ϭηWϭϭ͘ϭηWϭϮ͘ϮΗͬх фĨŶĂŵĞс͞WK^Ηdžŵů͗ŝĚсΗWϭ͘ϭΗdžŵů͗ůĂŶŐсΗůƚΗхфƐLJŵďŽůǀĂůƵĞсΗĚŬƚǀ͘ΗͬхфͬĨх фĨŶĂŵĞс͞sŽŝĐĞΗdžŵů͗ŝĚсΗWϮ͘ϮΗdžŵů͗ůĂŶŐсΗůƚΗхфƐLJŵďŽůǀĂůƵĞсΗďĞŶĚ͘ΗͬхфͬĨх фĨŶĂŵĞсΗ'ĞŶĚĞƌΗdžŵů͗ŝĚсΗWϭϬ͘ϭΗdžŵů͗ůĂŶŐсΗůƚΗхфƐLJŵďŽůǀĂůƵĞсΗŵŽƚ͘Ő͘ΗͬхфͬĨх фĨŶĂŵĞс͞EƵŵďĞƌΗdžŵů͗ŝĚсΗWϭϭ͘ϭΗdžŵů͗ůĂŶŐсΗůƚΗхфƐLJŵďŽů ǀĂůƵĞсΗǀŶƐ͘ΗͬхфͬĨх фĨŶĂŵĞс͞ĂƐĞΗdžŵů͗ŝĚсΗWϭϮ͘ϮΗdžŵů͗ůĂŶŐсΗůƚΗхфƐLJŵďŽůǀĂůƵĞсΗŬůŵ͘ΗͬхфͬĨх ͙͘

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 5. Morphosyntactic specification defined as TEI feature structure

Morphological tags in the CCLL are formed as coded strings of letters, where each letter denotes a morphological category. Each part of speech gets a code of a different length, e.g. the noun vyriausybės (en. government) is given a five-letter code “dbmvk” (where d denotes noun, b – common, m – feminine, v – singular, k – Genitive); the adjective prekybinių (en. commercial) gets a seven-letter code “btnnvdk” (where b – adjective, t – positive, n – positive degree, n – non-pronominal, v – masculine, d – plural, k – Genitive). The longest ten-letter code is given to participle forms of the verb, e.g. the participle pirkę (en. buying) is given “vdtnvknvdv” (v – verb, d – participle, t – positive, n – non-reflexive, v – active, k – past, n – non-pronominal, v – masculine, d – plural, v – Nominative). The morphosyntactic specification, used for the CCLL, has been built in the form, compatible with the MULTEXT-East multilingual dataset for language engineering research and development8, though it isn’t explicitly included in this dataset yet. Such specification also allows the localization of feature names and codes, e.g. for international projects it would also be useful to have English feature names available. The symbol name, which is defined for each feature in the feature structure, can be used for display and human analysis purposes, when presenting corpus query results on-line.

5. Supporting Tools The CCLL is equipped with a set of software tools, falling into two main categories: • •

Tools for annotating and managing the CCLL; Tools for the CCLL query and analysis.

Figure 6 presents the structure of the CCLL tool set. Annotation and management tools are meant for managers and researchers, and are available either in stand-alone, or in on-line form. On-line form is a preferred choice, and all the new tools are developed

8

http://nl.ijs.si/ME/

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

160

E. Rimkut˙e et al. / Corpus of Contemporary Lithuanian Language – The Standardised Way

this way. Query and analysis tools and services, meant for a broader user category, are available online.

Figure 6. Tools for CCLL maintenance and access

Development of new tools and services, as well as update of existing ones, is based on web service technology in order to satisfy interoperability and wider access needs.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

6. Conclusion The process of transformation of the CCLL into a new standard has proved to be a complicated, but necessary step in the development of the corpus. Whereas this task is rather difficult and time consuming endeavor, it may be noted that selection of an appropriate format from several candidate standards depends not only on functionalities of standards, but also on how well they are documented. In this aspect, TEI P5 standard stands out as a very well documented standard. Further CCLL development plans to include additional annotation levels, namely syntactic and semantic metadata, mark-up of collocations, named entities and other textual elements, necessary for various corpus-based natural language processing tasks. Preliminary investigation has shown, that TEI P5 encoding scheme includes elements necessary for such annotation.

References [1] R. Marcinkevičienė, A. Bielinskienė, V. Daudaravičius, E. Rimkutė, Corpora for Lithuanian Language Technologies, Proc. of the First Baltic Conference Human Language Technologies, The Baltic Perspective, Riga, Latvia (2004), 21–24. [2] CLARIN:STE(ed.), Standards for text encoding: A CLARIN shortguide, 2009. [3] P. Banski, A. Przepiorkowski. TEI P5 as a Text Encoding Standard for Multilevel Corpus Annotation, Prepr. of Digital Humanities 2010, London. [4] V. Zinkevičius, Lemuoklis – morfologinei analizei (Lemmatizer for morphological analysis). Darbai ir Dienos 24 (2000), 246–273. [5] E. Rimkutė, Morfologinio daugiareikšmiškumo ribojimas kompiuteriniame tekstyne (Morphological disambiguation in computerized corpora), Doctor thesis, Kaunas: Vytautas Magnus University (2006). [6] V. Zinkevičius, V. Daudaravičius, E. Rimkutė, The Morphologically annotated Lithuanian Corpus, Proc. of the Second Baltic Conference on Human Language Technologies, Tallinn (2005), 365–370. [7] E. Rimkutė E., V. Daudaravičius, Morfologinis dabartinės lietuvių kalbos tekstyno anotavimas. Kalbų studijos 11 (2007), 30–35. [8] T. Erjavec, S. Krek, The JOS morphosyntactically tagged corpus of Slovene. Sixth International Conference on Language Resources and Evaluation, LREC’08, Paris (2008).

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-161

161

A Collection of Comparable Corpora for Under-resourced Languages ,QJXQD6.$',ƻ$a, Ahmet AKER b, Voula GIOULI c, Dan TUFIS d, Robert GAIZAUSKAS b0DGDUD0,(5,ƻ$a, Nikos MASTROPAVLOS c a Tilde, Latvia b University of Sheffield, UK c Institute for Language and Speech Processing, R.C. "Athena", Greece d Research Institute for Artificial Intelligence, Romanian Academy Bucharest, Romania

Abstract. This paper presents work on collecting comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, GreekEnglish, Greek-Romanian, Croatian-English, Romanian-English, RomanianGerman and Slovenian-English. The objective of this work was to gather texts from the same domains and genres and with a similar level of comparability in order to use them as a starting point in defining criteria and metrics of comparability. These criteria and metrics will be applied to comparable texts to determine their suitability for use in Statistical Machine Translation, particularly in the case where translation is performed from or into under-resourced languages for which substantial parallel corpora are unavailable. The size of collected corpora is about 1million words for each under-resourced language. Keywords. Comparable corpora, under-resourced languages, comparability, metadata, crawling, statistical machine translation

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction In recent decades data-driven approaches have significantly advanced the development of machine translation (MT). However, the applicability of current data-driven methods directly depends on the availability of very large quantities of parallel corpus data. For this reason the translation quality of current data-driven MT systems varies from being quite good for language pairs/domains for which large parallel corpora are available to being barely usable for languages with fewer resources or in narrow domains. The problem of availability of linguistic resources is especially relevant for underresourced languages, including languages of the three Baltic countries ± Estonian, Latvian and Lithuanian. One potential solution to the bottleneck of insufficient parallel corpora is to exploit comparable corpora to provide more data for MT systems. The concept of a comparable corpus is a relatively recent one in MT and NLP in general. It can be defined as collection of similar documents that are collected according to a set of criteria, e.g. the same proportions of texts of the same genre in the same domain from the same period [1] in more than one language or variety of languages [2] that contain overlapping information [3][4]. Comparable corpora have several obvious advantages over parallel corpora ± they are available on the Web in large quantities for many languages and domains and many texts with similar content are produced every day (e.g. multilingual news feeds).

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

162

I. Skadin¸a et al. / A Collection of Comparable Corpora for Under-Resourced Languages

Recent experiments have demonstrated that a comparable corpus can compensate for the shortage of parallel corpora. Hewavitharana and Vogel [4] have shown that adding extracted aligned parallel lexical data from comparable corpora to the training data of an Statistical Machine Translation (SMT) V\VWHP LPSURYHV WKH V\VWHP¶V performance with respect to un-translated word coverage. It has been also demonstrated that language pairs with little parallel data are likely to benefit the most from the exploitation of comparable corpora. Munteanu and Marcu [3] achieved performance improvements of more than 50% using comparable corpora of BBC news feeds for English, Arabic and Chinese over a baseline MT system trained only on existing available parallel data. The FP7 project Accurat [5] [6] aims to find, analyze and evaluate methods that exploit comparable corpora in order to compensate for the shortage of linguistic resources, and to significantly improve MT quality for under-resourced languages and narrow domains. This paper presents work on the creation in Accurat of bilingual comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, Greek-English, Greek-Romanian, Croatian-English, RomanianEnglish, Romanian-German and Slovenian-English, where in each corpus at least one language is under-resourced. The objective was to gather texts which can be used as starting point to define criteria and metrics of comparability, i.e., determine what degree of comparability LV µSUHIHUUHG¶ µVXLWDEOH¶ RU µPLQLPDOO\ DFFHSWDEOH¶ IRU WH[WV used for MT. The metrics will be used to define certain routes for exploiting comparable corpora in MT. We present an initial definition of comparability, principles used for collecting textual data, a proposal for metadata encoding and tools used for collection, and also describe useful sources and problems we experienced.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. Principles of Collecting Comparable Corpora Until now there has been no agreement on the degree of similarity that documents in comparable corpora should have, or agreement about the criteria for measuring parallelism and comparability. Objective measures for detecting how similar two corpora are in terms of their lexical content have been studied only recently [7] [8]. Thus for our task we have introduced four comparability levels ± parallel, strongly comparable, weakly comparable and non-comparable. By parallel texts we understand true and accurate translations or approximate translations with minor language-specific variations. Typical samples of parallel texts by our definition are legal documents, software manuals, fiction translations, etc. By strongly comparable texts we understand closely related texts reporting the same event or describing the same subject. These texts could be heavily edited translations or independently created, such as texts coming from the same source with the same editorial control, but written in different languages (e.g., news provided by Baltic News Service in English, Latvian and Russian), or, independently written texts concerning the same subject, e.g., Wikipedia articles linked via the wiki or news items concerning the same specific event from different news agencies. The third category is weakly comparable texts which include texts in the same narrow subject domain and genre, but describing different events, as well as texts within the same broader domain and genre, but varying in subdomains and specific

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

I. Skadin¸a et al. / A Collection of Comparable Corpora for Under-Resourced Languages

163

genres. Finally, we can speak about non-comparable texts: pairs of tests drawn at random from a pair of very large collections of texts (e.g. the web) in the two languages. Our goal was to collect 1 million running words for each language with the same distribution between domains and genres (see Table 1) and with the similar proportions between comparability levels (10% parallel texts, 40% strongly comparable texts, 50% weakly comparable texts). Table 1. Domain and genre distribution of Accurat comparable corpora Domain

Genre

Coverage

International news Sports Admin Travel Software Software Medicine Medicine

Newswires Newswires Legal Advice Wikipedia User manuals For doctors For patients

20% 10% 10% 10% 15% 15% 10% 10%

2. Metadata

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

For encoding both documents and the alignments between them we use A Comparable Corpus Encoding Schema (ACCES). It is an adaptation of the Corpus Encoding Standard (CES) structure and contains further metadata elements specific to the Accurat project, but potentially of use for any comparable corpus. Its structure is as shown below.

...

newswires international news URL of text utf-8 19/12/2007 how was raw text extracted from HTML source doc

« Html source with entities encoded

7KH QHZ WDJV DUH ³H[WHQGHGVRXUFHGHVF´ DQG ³KWPOVRXUFH´ 7KH ³extendedsourcedesc´ tag encodes information about the genre, domain, source of the document, encoding of the text, date of the publication and information about any technique used to clean the original html document to obtain raw text from it. The ³KWPOVRXUFH´ WDJ LQFOXGHV WKH RULJLQDO KWPO VRXUFH LH WKH HQWLUH GRFXPHQW FRQWHQW found on the web. This is included because its structure may supply information that can be used in deciding whether a pair of documents is parallel, strongly or weakly comparable (of course the html source file may be useful for other purposes too). When saving the content into the XML structure, we ensure that the XML structure is still well-formed, i.e. all HTML special characters are encoded so that the HTML can be placed inside the XML without violating the structure.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

164

I. Skadin¸a et al. / A Collection of Comparable Corpora for Under-Resourced Languages

It should be noted that ACCES follows the CES structure, i.e. the order of the elements occurring in the file is the same as in CES, and it also includes all mandatory elements from CES for representing a document. Thus it is possible to use a CES parser to parse the ACCES structure. For expressing the alignments between documents the following structure is used.





This structure is again based on CES with small extensions to meet Accurat-specific requirements. ,Q&(6HDFKDOLJQPHQWLVH[SUHVVHGLQWKH³OLQNJUS´WDJ,WFRQWDLQVWKH alignment types (document, paragraph, sentence, etc.) and the aligned text pairs (in the ³OLQN´ WDJ  ,Q WKH H[DPSOH VKRZn above we have alignment at document (doc) level ZKHUHZHKDYHDOLJQHGGRFXPHQW³GRF[PO´ZLWK³GRF[PO´:HKDYHH[WHQGHGWKH ³OLQNJUS´ WDJ ZLWK WZR IXUWKHU DWWULEXWHV ZKLFK KHOS WR H[SUHVV WKH DOLJQPHQW OHYHO ³DOLJQPHQWOHYHO´ ZLWK SRVVLEOH YDOXHV ³SDUDOOHO´ ³VWURQJO\ FRPSDUDEOH´ ³ZHDNO\ FRPSDUDEOH´  DQG LQIRUPDWLRQ DERXW KRZ WKH DOLJQPHQW OHYHO ³DOLJQPHQWGHFLVLRQ´  was determined.

3. Collection methods

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3.1. Methodology adopted for collecting the ACCURAT Comparable Corpus Methods employed by the partners for data collection heavily depend on the type of corpora, degree of comparability and, of course, availability of suitable tools. As one might expect, parallel texts were retrieved automatically by all partners from bilingual or multilingual web sources. However, approaches taken for acquiring strongly and weakly comparable corpora are not uniform among partners and for all domains/genres. These corpora were to a great extent selected manually, except from for the domain/genre Software/Wikipedia which was selected automatically due to the predictable structure of Wikipedia and to the inter-linking provided among languages. Work reported hereafter aimed at researching methods for the automatic acquisition of comparable texts from web sources. The rationale was to build on already existing open-source tools that are suitable for other types of corpora, rather than attempting to build a new harvesting application. Depending on the approach taken, three general strategies are referred to in the literature: (a) monolingual crawling, (b) bilingual crawling, and (c) topic specific (focused) monolingual harvesting. In monolingual crawling, documents are retrieved for each language separately. In bilingual crawling, filtering techniques are employed for harvesting parallel data from bilingual/multilingual websites. Finally, topic specific (focused) monolingual crawling attempts to harvest texts belonging to pre-specified domains and narrow topics, and therefore, to directly provide corpora that are by definition at least weakly comparable. The task at hand requires the combination of the

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

I. Skadin¸a et al. / A Collection of Comparable Corpora for Under-Resourced Languages

165

afore-mentioned techniques. Among the various candidate tools that have been considered, the following ones seemed the most promising: x BootCaT toolkit [9], a well-known suite of Perl scripts for bootstrapping specialized language corpora from the web. x Heritrix, an open-source, modular web crawler. Implemented in Java, it is an extremely extensible crawling tool providing many configuration settings for achieving best performance, yet it does not support focused crawling. x Combine [10], an open system web crawler-indexer, implemented in Perl. It is based on a combination of a general web crawler and an automated subject classifier. The classification is provided by a focus filter using a topic definition in the form of a list of in-topic terms. x Bitextor [11], a free/open-source application for harvesting translation memories from multilingual websites. Bitextor is based on two main assumptions: (a) parallel pages should be under the same domain, and (b) they should have similar html structure. The selected tools were adapted to cater for the acquisition of Greek-English and Romanian-Greek corpora. Parallel corpora in these language pairs were retrieved via Bitextor. BootCaT was used in order to select monolingual domain-specific corpora to initiate the acquisition of weakly comparable corpora. Seed words semi-automatically extracted from source language texts guided the acquisition of texts in the source language and in specific domains. These were consequently mapped onto their translational equivalents in the target languages in order to serve as seed terms for the selection of candidate weakly comparable texts in the target languages. Combine was also used to further supplement the weakly comparable part of corpora. Terms semiautomatically retrieved from the texts in the source language were also coupled with a list of seed URL lists that were manually identified as relevant to the specific domains, and Combine performed limited crawls on selected web sites (e.g. Reuters, BBC, Timesonline etc.). The highest ranking web pages were selected from the result pool and added to the weakly comparable text collection. After several runs with the above-mentioned tools, manual validation was performed by trained annotators. Retrieved documents were grouped by topic, and annotators had to decide whether they were (a) accurately retrieved as pertaining to the VSHFLILHGGRPDLQJHQUHDQG E FRUUHFWO\DVVLJQHGWKH³ZHDNO\FRPSDUDEOH´DWWULEXWH As was expected, BootCaT SURYLGHG VDWLVI\LQJ UHVXOWV LQ WKH GRPDLQV ³6SRUWV´ DQG³7UDYHO´ZLWKWKHYDVWPDMRULty of retrieved texts being positively validated, yet it IDLOHG LQ WKH LGHQWLILFDWLRQ RI JHQUH LQ GRFXPHQWV SHUWDLQLQJ WR WKH GRPDLQV ³1HZV´ ³6RIWZDUH´ DQG ³0HGLFLQH´ Bitextor performance heavily depends on how wellformed a web page is (HTML structure) as well as the general structure of the web site. Testing in well structured web sites (e.g. www.setimes.com) provided quite satisfying results in terms of precision and recall, while not so well-formed web sites proved the WRRO¶VPDLQZHDNQHVVLQdealing with such environments. 3.2. Visualized crawling environment Data collection from the web is rarely a well defined job and more often than not corpus linguistics practitioners design their own scripts to provide an answer to an immediate need and as soon as the problem is solved, the scripts are forgotten. We tried to give a more principled solution to reusing the small pieces of useful software and

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

166

I. Skadin¸a et al. / A Collection of Comparable Corpora for Under-Resourced Languages

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

prolonging the life-time of such scripts by the development an environment that incorporates three components: a Flow Graphical Editor which enables the user to easily create and manage workflows, a Script Editor which assists the user in defining the processing units of the workflows and a Windows Service which takes as input the chained scripts generated by the first two components and executes the entire process at a given interval. Thus the environment is not a standalone crawler but a more general program which supports high scalability and integration of modules. The Flow Graphical Editor component allows the user to graphically organize the logic of the application around processing units and decision blocks. The user can alter the global application behavior by adding new blocks or modifying the way the output is being handled. The Script Editor enables the creation of processing modules invoked by the active blocks. We use the tools provided by ICSharpCode (http://www.icsharpcode.net/) to enable syntax highlighting and code compilation. The Windows service provides the actual functionality for the built-up processing flow. It will start at a given interval, read the flow diagram and start the execution of active blocks. The user can observe the execution progress at any time and can stop/pause/resume the process. By means of the environment presented here we created two main processing flows, which can be further connected into a larger one. The first one was a monolingual processing chain incorporating tokenization, tagging, lemmatization and tagging. This ensemble of language tools, called TTL is written in Perl and each of its components is also a web-service [13]. The second application is a web harvester for collecting parallel and strongly comparable corpora from the seed web-pages. We applied the process of collecting strongly comparable documents from the news section of The European Parliament website in 22 languages. Within the months of May and June 2010 195 short articles were harvested, not all of them available in 22 languages. Because we wanted to preserve the structure of the multilingual strongly comparable corpus for the few articles which were not translated in some languages, empty content has been created for the missing languages in these cases.

4. Collected corpora Using the different approaches described in Section 3, we collected comparable corpora for 9 language pairs: Estonian-English, Latvian-English, Lithuanian-English, GreekEnglish, Romanian-Greek, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. Almost each language pair corpus consists of approximately one million words for the under-resourced part of corpus (see Table 2). The Romanian-Greek corpus is approximately 130 000 words short of the target one million words which can be explained by the difficulty of collecting appropriate comparable corpora for under-resourced language pairs. Although all the Accurat languages are under-resourced, the collection process revealed significant differences in relation to availability of parallel and strongly comparable texts. For example, for Balkan languages news can be easily collected from the SETimes portal, while for languages of the Baltic countries such a resource is not available. $OVR WH[WV LQ WKH GRPDLQ ³,QWHUQDWLRQDO 1HZV´ RU :LNLSHGLD SUHVHQW D significant disproportion in terms of document size and content among languages.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

167

I. Skadin¸a et al. / A Collection of Comparable Corpora for Under-Resourced Languages Table 2. Collected corpora Weakly comparable % Words 39,46 424 022

Total

ET-EN LV-EN LT-EN

122 581 553 747

11,82 46,17

389 127 261 841

37,51 21,83

525 681 383 819

50,67 32

1 037 389 1 199 407

191 843 282 213 418 752 186 682 117 281 462 514 2 018 745

13,33 32,62 39,51 6,94 8,52 40,17 20,49

294 554 267 897 100 000 459 458 449 942 322 243 2 993 826

20,47 30,96 9,44 17,07 32,67 27,98 26,01

952 534 315 108 541 085 2 045 631 809 929 366 759 5 823 483

66,2 36,42 51,05 76 58,81 31,85 53,5

1 438 931 865 218 1 059 837 2 691 771 1 377 152 1 151 516 11 895 891

EL-EN RO-EL HR-EN RO-EN RO-DE SL-EN All pairs

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Strongly comparable Words % 548 764 51,06

Parallel Words 101 884

% 9,48

1 074 670

Although the collection process was performed independently in five countries by different project partners, several common resources were identified: x The SETimes (http://www.setimes.com/) news portal is a source of news about Southeastern Europe in ten languages: Albanian, Bosnian, Bulgarian, Croatian, English, Greek, Macedonian, Romanian, Serbian and Turkish. The portal is updated every day and is an excellent resource of parallel texts for the above mentioned languages. x The JRC Acquis corpus [14] (http://wt.jrc.it/lt/Acquis/) contains selected EU legal texts in all EU official languages, except Irish. It is a widely used source of parallel texts for the legal domain. x EMEA corpus [15] (http://urd.let.rug.nl/tiedeman/OPUS/EMEA.php) contains European Medicines Agency documents in 22 languages. The corpus has no texts in Croatian or Slovenian. x Wikipedia (http://www.wikipedia.org/) is a well known source of comparable texts in more than 270 languages. However, the size of Wikipedia differs from language to language. E.g., for the Accurat languages Wikipedia contains the following number of articles: Croatian ± 82 952, Estonian ± 76 334, Greek ± 53 546, Latvian 28 483, Lithuanian ± 110 799, Slovenian ± 88 129, Romanian ± 146 418 articles (05.07.2010). Also the level of comparability of Wikipedia articles varies a lot. x European Commission News (http://ec.europa.eu/news) is good resource of strongly comparable texts for EU official languages, especially those which have no other parallel news texts available. The articles describe different topics of interest in the EU, e.g. business, culture, science and technology.

5. Conclusions and future work We collected comparable corpora for 9 language pairs: Estonian-English, LatvianEnglish, Lithuanian-English, Greek-English, Romanian-Greek, Croatian-English, Romanian-English, Romanian-German and Slovenian-English. Every corpus, except Romanian-Greek, consists of approximately one million words for each language. Taken together the collected corpora consist of 11,8 million words for Croatian, Estonian, Greek, Latvian, Lithuanian, Romanian and Slovenian.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

168

I. Skadin¸a et al. / A Collection of Comparable Corpora for Under-Resourced Languages

Currently the corpora are used for two tasks. First, they are being used to develop criteria and automated metrics to determine the degree of comparability of comparable corpora and parallelism of individual documents. Secondly they are serving to evaluate the applicability of existing alignment methods to comparable corpora. The collected corpora are available for Accurat consortium currently, more texts will be collected through the project lifetime and thus publicly available comparable corpora will be released by the end of the project.

Acknowledgements The research within the project Accurat leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013), grant agreement no 248347. Many thanks for collecting corpora data to colleagues in ACCURAT partner organizations: Serge Sharoff from University of Leeds (UK), Gregor Thurmair from Linguatec (Germany), Marko 7DGLü from University of Zagreb (Croatia) and %RãWMDQâSHWLþ from Zemanta (Slovenia).

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

References [1] A.M. McEnery, R.Z. Xiao, Parallel and comparable corpora: What are they up to? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters (2007). [2] EAGLES, Preliminary recommendations on corpus typology (1996), electronic resource: http://www.ilc.cnr.it/EAGLES96/corpustyp/corpustyp.html. [3] D. Munteanu, D. Marcu. Improving Machine Translation Performance by Exploiting Non-Parallel Corpora. Computational Linguistics, 31(4) (2005), 477-504. [4] S. Hewavitharana, S. Vogel, Enhancing a Statistical Machine Translation System by using an Automatically Extracted Parallel Corpus from Comparable Sources. Proceedings of the Workshop on &RPSDUDEOH&RUSRUD/5(&¶ (2008), 7-10. [5] A. Eisele, J. Xu. Improving machine translation performance using comparable corpora. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities (2010), 35-39. [6] ,6NDGLƼD$9DVLƺMHYV56NDGLƼã5*DL]DXVNDV'7XILV7*RUQRVWD\$QDO\VLVDQG(YDOXDWLRQRI Comparable Corpora for Under Resourced Areas of Machine Translation. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities (2010), 6-14. [7] A. Kilgarriff, Comparing Corpora. International Journal of Corpus Linguistics 6 (1) (2001), 1-37. [8] P. Rayson, R. Garside, Comparing corpora using frequency profiling. Proceedings of the Comparing &RUSRUD:RUNVKRSDW$&/¶ (2000), 1-6. [9] M. Baroni, S. Bernardini, .Bootcat: Bootstrapping corpora and terms from the web. Proceedings of Language Resources and Evaluation Conference LREC'04 (2004). [10] $UGR ³&RPELQH ZHE FUDZOHU´ 6RIWZDUH SDFNDJH IRU JHQHUDO DQG IRFXVHG :HE-crawling (2005), electronic resource: http://combine.it.lth.se/. [11] M. Gomis, M. Forcada, Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor. PBML No 93(2010), 77±86. [12] J. Cho, H. Garcia-Molina, L. Page, Efficient crawling through URL ordering. Proceedings of the seventh international conference on World Wide Web (1998), 161±172. [13] D. 7XILú5. Ion, A. &HDXúX'. ùWHIăQHVFX, RACAI's Linguistic Web Services. Proceedings of the Sixth International Language Resources and Evaluation Conference (LREC'08), (2008) [14] R. Steinberger, B.Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. 7XILú, D. Varga, The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5 th International &RQIHUHQFHRQ/DQJXDJH5HVRXUFHVDQG(YDOXDWLRQ/5(&¶ (2006). [15] J. Tiedemann, News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing vol. V (2009), 237-248.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-169

169

The Database of Estonian Word Families: a Language Technology Resource Ülle VIKS 1 , Silvi VARE and Heete SAHKAI Institute of the Estonian Language, Tallinn Abstract. The paper describes a polyfunctional database of Estonian word families which is based on extensive research and contains detailed word formation information about the Estonian vocabulary. It is an XML database integrated into a dictionary management system which offers various possibilities of structure based editing and searching, data reuse etc. The design of the database is based on the word families method, which consists in the organization of words on the basis of common stem morphemes and word formation relations. Until now, the word families method has been used in the compilation of word formation dictionaries. Using the method in the compilation of a database is a novel solution which considerably broadens the access to and the possible uses of word formation data. The database provides material for researchers in computational and general linguistics, language learners and teachers, and lexicographers. The data can also be used in several language technology applications like search engines, text-tospeech synthesis etc. 2 Keywords. word formation, XML database, electronic lexicography, automatic morphology, dictionary management system

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction The paper describes a word formation resource created in the Institute of the Estonian Language: the database of Estonian word families (henceforth DEWF), which is a novel application of the word families method [1]. The database is integrated into a dictionary management system and equipped with a web interface. After completion (in 2012), it will be made available through the Web as a free public resource. The database presents word formation information about Estonian words in an explicit form. The word formation analysis is based on thorough research and has been inserted manually following a fixed schema. The electronic database permits to access the data by a large number of criteria. There is great need for this type of information. As an agglutinative-fusional language, Estonian is characterized by a rich word formation system in which different word formation kinds, types and means combine in complex ways, and stems are subject to different types of change. Given the complexity and difficult access of word formation data, many important and theoretically interesting phenomena of Estonian grammar have not been properly researched. Comprehensive word formation information is also needed in lexicography; at present, word formation information in 1 Corresponding Author: Senior Computational Linguist, Institute of the Estonian Language, Roosikrantsi 6, Tallinn, 10119 Estonia; E-mail: [email protected]. 2 The study was supported by the National Programme for Estonian Language Technology and by the project SF0050023s09 “Modeling intermodular phenomena in Estonian”. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

170

Ü. Viks et al. / The Database of Estonian Word Families: A Language Technology Resource

the dictionaries of Estonian is scarce and often controversial. Another area in which word formation information is badly needed is language education, since the complex structure of Estonian words creates comprehension and production difficulties for language learners. Word formation information has also been insufficiently used in language technology applications, which is a more widespread problem [2].

1. Word Families and Word Formation

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1.1. Theoretical-methodological Background A word family comprises all the words of a language related by a common stem morpheme (or its variant). It is headed by the simplex word (the head of the family) that represents the common stem. Inside the word family, words (family members) are arranged semasiologically, according to word formation [1]. The words are organized in an integrated hierarchical network on the basis of a stepwise immediate constituent analysis [3]. This way of representation permits to visualize the internal structure of complex words by simultaneously showing their base word and their immediate constituents (1) 3 . (1) SPORT ’sport’ > sport=lane ’sport=NOUN SUFFIX’ „sportsman, athlete“ > sport|las=lik ’sportsman=ADJECTIVAL SUFFIX’ „sportsmanly“ > sport|las|likk=us ’sportsmanly=NOUN SUFFIX’ „sportsmanship“ As a whole, the word families of a language reflect the structure of the vocabulary of the language, as well as its entire word formation system, which can thus be grasped by the user [4, 5, 6, 7]. The word family’s phenomenon embodies the fact that most words of a language are related to a number of other words by semantic and formal motivation [8]. The word families method is the method for structuring the vocabulary of a language and for compiling word formation dictionaries [4, 1, 5, 6]. The compilation of a word family dictionary consists in segmenting the lexemes into immediate constituents and organizing them into word families, which requires a large amount of research and analysis. Word family dictionaries are a relatively rare type of specific scientific dictionary. The largest word formation dictionaries exist for German [9, 10] and Russian [7]; smaller dictionaries exist for some other languages. Using the word families method as the design principle of an electronic database is a novel application of the method and gives rise to a new type of linguistic resource. 1.2. Word Formation in Estonian Estonian is typologically an agglutinative-fusional language and its morphology is characterized by extensive stem variation and the abundance of formatives (inflectional

3 Symbols used in examples: # demarcates stems and inflections; = demarcates stems and affixes if derivation is the last step of formation; | demarcates stems and affixes if derivation is not the last step of formation; + demarcates constituents of compounds if compounding is the last step of formation; ¤ demarcates constituents of compounds if compounding is not the last step; , demarcates interfixes.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Ü. Viks et al. / The Database of Estonian Word Families: A Language Technology Resource

171

and derivational affixes) [11, 12, 13]. The majority of Estonian vocabulary consists of derivations and compounds and can thus be organized into relatively large families. The two main kinds of Estonian word formation are derivation (including conversion) and compounding. A repeated application of the same word formation kind, or an alternation of different kinds, may give rise to lexemes with quite complex structure, displaying concatenations of several affixes or stems, or combinations of stems and affixes. The hierarchical structure of word families permits to visualize these consecutive word formation steps in an explicit manner. A frequent phenomenon that complicates the Estonian word formation are different types of stem changes, which may give rise to very different stem forms. Stem changes may lead to accidental formal similarities, which in turn give rise to interpretation difficulties. The word families representation clearly disambiguates these coinciding forms as they occur in different parts of the hierarchy and have different internal structures. A phenomenon that has given rise to theoretical discussion (cf. [14, 15, 16]) and to misinterpretations in Estonian lexicography and language teaching are the so-called synthetic compounds. In dictionaries they are treated as compounds whereas theoretically they are regarded in Estonian as a subtype of derivation involving simultaneously the attachment of a suffix and the compounding of stems [17, 18, 19], e.g. keel#t kast#ma ’tongue#PARTITIVE water#INFINITIVE’ „to drink“ > keele+kast=e ’tongue.GENITIVE+water=NOUN SUFFIX’ „a drink“. In DEWF they are explicitly represented as derivations based on syntactic phrases.

2. The Database of Estonian Word Families

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.1. The Data The Estonian word families are compiled on the basis of the latest general dictionaries of Estonian, which cover the majority of the vocabulary of contemporary written Estonian. The database contains at present 8880 lexical entries, i.e. word families, with a total of about 186 000 items, and 940 simplex words with no attested derivations or compounds. The analysis of the data is based on the descriptive grammar of Estonian [11] and on the subsequent research into Estonian word formation, e.g. [20–24, 19]. The basic unit of the macrostructure of the DEWF is the word family. The word family is introduced by the head of the word family (a simplex word) and constituted of the family members. The family members are organized hierarchically by step of formation, strictly following the motivational relations between words. On the first level, the head is followed by all the words based on it – the first-step formations, each of which is again followed by the eventual second-step formations, and so forth. For clarity of presentation, the first-step formations are divided into separate blocks according to their word formation kind: derivatives, compounds, verbal expressions. To illustrate, Figure 1 presents the word family aed ’garden, yard; fence’ in a strongly abbreviated form. The head of the word family is followed by the first-step formations (in separate blocks), e.g. aed=nik “gardener”, las#te+aed “kindergarten”, aeda pida#ma “to garden”. The second-step formations are e.g. maa|stiku+aed|nik “landscape architect”, las#te¤aed=nik “kindergarten teacher”, aia+pida=ja “gardener”, and so forth. The maximal number of steps found in the database is seven.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

172

Ü. Viks et al. / The Database of Estonian Word Families: A Language Technology Resource

aed SUBST. x

o o

[P_TUL] aed=ik SUBST. väike aed aed=nik SUBST. maa|stiku+aed|nik SUBST. (tegeleb maastiku kujundamisega)

ƒ x

o o o

[P_LS1] botaanika+aed SUBST. ema+aed SUBST. AIAND. (kust võetakse seemneid, pook- ja pistoksi) las#te+aed SUBST. las#te¤aed=nik SUBST. lasteaiakasvataja las#te¤aia+kasva|ta|ja SUBST. las#te¤aia+laps SUBST.

ƒ ƒ ƒ x

o x

[P_LS2]

o

aed+maasikas SUBST. aed¤maasika+kee|d|is SUBST. aia+maja SUBST.

ƒ [P_YH2]

o ƒ ƒ

aeda pida#ma aia+pida=ja SUBST. aia+pida=mine SUBST. Figure 1. An excerpt of the word family aed

4

On the level of the microstructure of the DEWF, the principal units of description are the head of the word family and all the family members. Inside each family member, special symbols (cf. footnote 3) are used to code its internal word formation structure. Separate fields represent grammatical and lexical information characteristic to family members and the head, e.g. part-of-speech (provided for all words), definition, subject or usage label, homonym number, context etc.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.2. Technical Solutions The word families database is integrated into the dictionary management system (henceforth EELex), a web-based toolset for dictionary writing and management (lexicographer’s workbench), which has been developed in the Institute of the Estonian Language [25]. At present, EELex contains the databases of about 20 dictionaries of various types. The databases stored in EELex are universal reusable language resources encoded in a standard XML format. This permits to exchange data both internally and externally to the system. Each dictionary has a specially designed XML schema. In the case of the DEWF, the schema follows the hierarchical structure of the word families. The schema serves as the basis for editing, searching and the layout design. Editing. The EELex software permits to use various structure based editing functions. The data is displayed simultaneously in two formats: the editing window is divided into the editing pane and the layout pane, which are mutually connected by click (Figure 2). Data can be edited in the editing pane both in the table form and in the XML code.

4 Symbols and conventions: [P_TUL]: derivatives; [P_LS1]: compounds by the right constituent; [P_LS2]: compounds by the left constituent; [P_YH2]: verbal expressions by the left constituent; PART-OF-SPEECH; SUBJECT LABEL; definition. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Ü. Viks et al. / The Database of Estonian Word Families: A Language Technology Resource

173

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 2. The editing window in EELex: on the left, the editing pane in the XML format, on the right, the layout view pane

For the hierarchical DEWF, important editing functions are the adding, deleting and moving of whole structural groups (blocks and family members). As the same word may occur in several word families (e.g. a compound containing two or more stems), a useful function is bulk corrections, which permit to make the same change simultaneously in all the entries corresponding to the defined criteria. Query. The EELex software permits to conduct structure based queries by every labelled group, element and attribute. Words can be searched e.g. by structural elements of a word: different affixes, inflections, constituents of compounds, coding symbols, etc., as well as by additional information: part-of-speech, definition, subject label, word formation type, etc. The search system permits to use regular expressions, logical operators, and symbol classes. The search results can be sorted in different ways: each column can be sorted in increasing, decreasing and reverse order (i.e. by the final letters of words). The entries returned as the result of the search can be exported to a MS Word file in layout format. Web Interface. DEWF, like all the resources completed in EELex, will be made available through the Web as a free public resource (http://portaal.eki.ee/). The public versions of the dictionaries are primarily addressed to ordinary users, but the more specific needs of researchers, students, lexicographers and teachers are taken into account as well. More specific material can be searched using the structure-based query, which permits to define more precise search criteria. Another function currently being developed is the complex query, which permits to combine the values of several attributes in the same query.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

174

Ü. Viks et al. / The Database of Estonian Word Families: A Language Technology Resource

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 3. The Web query: suffix -nik

A special solution was needed for the display of the search results, as the DEWF entries are often extremely large and the queried item thus difficult to find, even with the aid of colouring of text or structure menus which have been used for large dictionaries [26]. Our solution is the following: in the initial search result, the minimum of information is displayed: the head of the family and the family member matching the query criteria, together with the item immediately preceding it in the hierarchy. The remaining parts of the entry can by visualized by clicking on the „+“ icons (Figure 3).

3. The Applications of the Database DEWF has a whole range of possible applications. First of all, DEWF considerably broadens the possibilities of the study of Estonian word formation and related areas like lexical semantics. The process of the compilation of the database has already given rise to studies into several problematic and less researched phenomena like the backformation of verbs [22], conversion [23, 24], or reanalysis [19]. Secondly, language learners can use the DEWF as a tool for learning Estonian word formation and the vocabulary of Estonian. The export function of EELex permits to compile exercises and other teaching material, and to generate various types of learner’s dictionaries of word formation. DEWF can also be used in the creation of interactive electronic systems for language learning. Thirdly, DEWF will be used in language technology to create the independent word formation module as part of the rule-based morphology of Estonian, which already covers the fully regular word formation [27]. DEWF provides the necessary

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Ü. Viks et al. / The Database of Estonian Word Families: A Language Technology Resource

175

data for writing the word formation rules and in part permits to generate the rules automatically. In addition to morphology, the word formation module is necessary in text-to-speech synthesis [28] since the pronunciation of a word may depend on its word formation structure. Another possible area of application of the word formation module is information retrieval, e.g. in search engines: if the queried word is not found, another word from the same word family may provide useful information. And finally, DEWF has already been used in lexicography, e.g. in the compilation of the lists of headwords of dictionaries. Thanks to the data reuse and data export and import functions of EELex, the data of DEWF will be used in the other dictionaries of the system: it will provide the word formation segmentation of the complex headwords of new dictionaries, and the lists of selected derivatives and compounds to be included in the entries. The first application of this type will be the learner’s dictionary of the basic vocabulary of Estonian (ab. 4000 entries), which will contain all the relevant lexicographic information: definitions, inflectional and derivational morphology, syntax, semantic classes etc. DEWF is also a good starting point for the creation of electronic lexicon and grammar systems like Word Manager, DeKo [2] or canoonet German Dictionaries and Grammar (http://www.canoo.net).

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

4. Conclusion DEWF is in many ways a novel resource. First of all, large lexicographic word formation resources exist so far only for a small number of languages, since they require a large amount of research and contain relatively specific information, which however is indispensable in the research, acquisition and language technology applications of languages with rich and complex morphology. Secondly, DEWF is a novel application of the word families method so far used in the compilation of word formation dictionaries: using the method as the design principle of an electronic database considerably broadens the access to word formation data. Thirdly, DEWF is a new type of application of a dictionary management system. As a polyfunctional resource, DEWF will have many applications in language technology (automatic morphology, text-to-speech synthesis, search engines etc.), lexicography (word formation data in general dictionaries), language education (learner’s dictionaries of word formation, teaching materials etc.) and research.

References [1] [2] [3] [4]

[5]

G. Augst, New trends in the research on word-family-dictionaries. In: Studia Anglica Posnaniensis XXV–XXVII (1991–1993), 183–197. P. ten Hacken, A. Lüdeling, Word Formation in Computational Linguistics. In: Proceedings of TALN, vol. 2, Nancy, 2002, 61–87. W. Fleischer, I. Barz, Wortbildung der deutschen Gegenwartssprache. 2. durchgesehene und ergänzte Auflage. Tübingen: Max Niemeyer Verlag, 1995. G. Augst, Das Wortfamilienwörterbuch. In: Hausmann, F. J., Reichmann, O., Wiegand, H. E. (eds.), Wörterbücher. Ein internationales Handbuch zur Lexicographie. Zweiter Teilband. Berlin, New York: Walter de Gruyter, 1990, 1145–1152. F. Hundsnurscher, Gliederungsaspekte des Wortschatzes. In: Hoinkes, U., Dietrich, W. (eds.), Kaleidoskop der Lexicalischen Semantik. Tübingen: Narr, 1997, 185–191.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

176

[6]

[7] [8]

[9] [10]

[11]

[12] [13]

[14] [15] [16] [17]

[18]

[19] [20] [21]

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[22] [23] [24] [25]

[26]

[27]

[28]

Ü. Viks et al. / The Database of Estonian Word Families: A Language Technology Resource

F. Hundsnurscher, Das Wortfamilienproblem in der Forschungsdiskussion. In: Cruse, D. A. (ed.), Lexikologie. Ein internationales Handbuch zur Natur und Struktur von Wörtern und Wortschätzen. 1. Handband. Berlin, New York: Walter de Gruyter, 2002, 676–679. A. N. Tikhonov, Slovoobrazovatel’nyj slovar’ russkogo jazyka I–II. Moskva: Russkii jazyk, 1985. G. Augst, Typen von Wortfamilien. In: Cruse, D. A. (ed.), Lexikologie. Ein internationales Handbuch zur Natur und Struktur von Wörtern und Wortschätzen. 1. Handband. Berlin, New York: Walter de Gruyter, 2002, 681–688. G. Augst, Wortfamilienwörterbuch der deutschen Gegenwartssprache. In Zusammenarbeit mit Karin Müller, Heidemarie Lagner und Anja Reichmann. Tübingen: Max Niemeyer Verlag, 1998. J. Splett, Deutsches Wortfamilienwörterbuch. Analyse der Wortfamilienstrukturen der deutschen Gegenwartssprache, zugleich Grundlegung einer zukünftigen Strukturgeschichte des deutschen Wortschatzes. Berlin, New York: de Gruyter, 2009. M. Erelt, R. Kasik, H. Metslang, H. Rajandi, K. Ross, H. Saari, K. Tael, S. Vare, Eesti keele grammatika I. Morfoloogia. Sõnamoodustus. Tallinn: Eesti Teaduste Akadeemia Eesti Keele Instituut, 1995. [The grammar of Estonian I. Morphology. Word formation] Ü. Viks, A Concise Morphological Dictionary of Estonian I–II. Tallinn: Estonian Academy of Sciences, 1992. Ü. Viks, Eesti keele avatud morfoloogiamudel. In: Hennoste, T. (ed.), Arvutuslingvistikalt inimesele. Tartu Ülikooli üldkeeleteaduse õppetooli toimetised 1. Tartu, 2000, 9–36. [The open model of Estonian morphology] G. Booij, The Grammar of Words. An Introduction to Linguistics. Morphology. Second edition. New York: Oxford University Press, 2007. A. Spencer, Word-formation and syntax. In: Štekauer, P., Lieber, R. (eds.), Handbook of wordformation. Netherlands: Springer, 2005, 73–97. R. Lieber, English word-formation processes. In: Štekauer, P., Lieber, R. (eds.), Handbook of wordformation. Netherlands: Springer, 2005, 375–427. R. Kull, Liitnimisõnade kujunemine eesti kirjakeeles. (Dissertation at the Institute of the Estonian Language and Literature). Tallinn, 1967 (Manuscript at the Institute of the Estonian Language). [The development of nominal compounds in Estonian] H. Saari, Sünkroonia, diakroonia ja kolm liiki uut keelevara. In: Eesti keele grammatika küsimusi. Keel ja struktuur X. Tartu: Tartu Riiklik Ülikool, 1978, 42–65. [Synchrony, diachrony and three new types of linguistic resource] S. Vare, Potentsiaalsetest sõnadest leksika ja grammatika vaatenurgast. In: Keel ja Kirjandus 7 (2008), 531–552. [Potential words from lexical and grammatical aspects] R. Kasik, Eesti keele sõnatuletus. Teine, täiendatud ja parandatud trükk. Tartu: Tartu Ülikooli Kirjastus, 2004. [Estonian word formation] K. Kerge, Vormimoodustus, sõnamoodustus ja leksikon. Oleviku kesksõna võrdluse all. Tallinn: TPÜ Kirjastus, 1998. [Inflection, word formation and the lexicon. The present participle in comparison] S. Vare, Back-Formation of Verbs in Estonian. In: Metslang, H., Rannut, M. (eds.). Languages in development. Lincom Europa. Linguistic Edition 41 (2003), 123–132. S. Vare, Põgusalt ühest leksika ja süntaksi piirinähtusest. In: Keel ja Kirjandus 12 (2004), 915–922. [A note on a phenomenon between lexicon and syntax] S. Vare, Eesti keele verbimoodustus: desubstantiivne konversioon. In: Emakeele Seltsi aastaraamat 50 (2004). Tallinn. 39–67. [Verb formation in Estonian: denominal conversion] M. Langemets, A. Loopmann, Ü. Viks, Dictionary Management System for Bilingual Dictionaries. In: Granger, S.; Paquot, M. (eds.), eLEX 2009. eLexicography in the 21st century: New challenges, new applications. (Louvain-la-Neuve, 22–24 October 2009), 135–139. R. Lew, P. Tokarek, Entry Menus in Bilingual Electronic Dictionaries. In: Granger, S.; Paquot, M. (eds.), eLEX 2009. eLexicography in the 21st century: New challenges, new applications. (Louvain-laNeuve, 22–24 October 2009), 145–146. Ü. Viks, A morphological analyzer for the Estonian language: the possibilities and impossibilities of automatic analysis. In: Viks, Ü. (ed.). Automatic Morphology of Estonian 1. Tallinn: Estonian Academy of Sciences, 1994, 7–28. M. Mihkla, L. Piits, T. Nurk, I. Kiissel, Development of a Unit Selection TTS System for Estonian. In: ýermak, F.; Marcinkeviþienơ, R.; Rimkutơ, E.; Zabarskaitơ, J. (eds.), Proceedings of the Third Baltic Conference on Human Language Technologies: The Third Baltic Conference on Human Language Technologies. Kaunas Lithuania, October 4–5, 2007. Vilnius, 2008, 181–187.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-177

177

Digitization of Historical Texts at the National Library of Latvia a

”–—”• ƒǡ —”‰‹• „ Project manager at National Library of Latvia b As. Prof. at University of Latvia

Abstract. National Library of Latvia has just started a mass-digitization of books and periodicals. A total of about 3.5 million pages will be digitized, segmented, OCR-ed and made available on-line. The large scope of materials by language, format, orthography, logical structure and time of publishing causes many technical, syntactic and semantic problems that will need to be solved when creating a historical text portal. Keywords. digital libraries, digitization, NLP

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction National Library of Latvia in collaboration with few other libraries began digitizing its collection of historical newspapers in 2000 by doing project “Mantojums – 1” (further “Heritage – 1”) [1]. The result of this project was a web-site containing PDF documents for each issue ordered by place of publishing and date. Because no OCR was done during digitizing, no keyword search is available in the web-site. However “Heritage – 1” did demonstrate the difference between analogue and digital environments by obtaining full virtual editions of periodicals. For example, no library holds a full collection of newspaper “Kurzemes Vārds”, but thanks to combining collections of several libraries, a full virtual collection of “Kurzemes Vārds” is available on-line. In 2007 NLL began working on project “Periodika” that improved on “Heritage – 1” by incorporating OCR and segmentation in the digitization process. This would allow users to search for particular keywords in full text of newspapers. The result of this project was a web-site containing about 350 000 pages of newspapers, which corresponds to about 45 000 separate issues [2]. Web-site contains newspapers from years 1895 to 1957, however major part of newspapers included in “Periodika” fall into period 1920-1938. On one hand, this is because more recent newspapers are still protected by copyright. On the other hand, older newspapers use old gothic fonts and obsolete orthography, which couldn’t be processed at a sufficient level at the time of this project. Figure 1 shows search results for keyword Oslo in “Periodika”.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

178

A. Zogla and J. Skilters / Digitization of Historical Texts at the National Library of Latvia

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 1 Search result page in "Periodika"

Finally, in 2009 NLL started a mass-digitization project which aims to scan, segment and publish approximately 2.1 million pages of periodicals (newspapers, magazines, scientific journals, etc.) and 1.4 million pages of books. In total about 700 different titles of periodicals and about 6-7 thousand books will be digitized. Project includes issues published in period from 18th century up to year 2008. The diversity of languages covered in this project is also greater than in any projects done by NLL so far. There will be issues in Latvian, Russian, German, French and English. Some books and periodicals will even be written in one of Latvia’s dialects – Latgalian.

1. Mass-digitization Libraries have digitized their collections since 1990ties. One of the first libraries to experiment with digitizing its collections was Library of Congress by starting a digitization project in 1990 that went on to become “American Memory” [3]. Until very recently libraries have chosen a so called cherry-picking method of digitization, by picking material on a certain topic and creating a small, individual digital collection on that topic. Such collections often contain just a few hundred objects or even less. NLL did a cherry-picking digitization project in 2006 by creating a digital collection on one of Latvia’s composers – Jāzeps Vītols. Digitization trends of libraries have changed in recent years and most libraries today are involved in mass-digitization of their collections. Almost all of the national libraries around the world are involved in mass-digitization of their historic newspaper collections. In 2008 some of Europe’s leading national libraries joined forces with major digitization software companies and several research institutes to form IMPACT

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

A. Zogla and J. Skilters / Digitization of Historical Texts at the National Library of Latvia

179

– a project to create guidelines and software for mass-digitization of libraries text collections [4]. National Library of Latvia is a kind of exception among other national libraries, because it has chosen to digitize books as well, while most national libraries digitize purely newspapers. Although in many respects books are much easier to digitize than newspapers, they are also more challenging to be made available on-line due to copyright limitations. 1.1. The process of mass-digitization Mass-digitization consists of three major steps: 1. 2.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3.

Scanning. Either original material or microfilmed copies can be scanned to produce high-quality archival images. Segmenting & OCR. Individual archival images need to be “virtually bound” to form PDF files of individual newspaper issues or books. During this process, different zones of material must be identified (titles, images, captions, authors, dates, tables, etc.) and the text must be OCRed. Segmenting is by far the most complicated step in mass-digitization. Publishing. All the material obtained during Segmentation must be published on-line, indexed and made searchable. Libraries create portals of different levels of complexity and functionality to achieve this.

Each step is usually finished by quality assurance (QA) to ensure the best possible material is given as input to the next step. QA is done by both the scanning/segmenting companies and NLL. Some libraries merge Scanning and Segmenting & OCR into a single step – National Library of Netherlands being one of the most typical examples to do so [5], [6]. NLL does mass-digitization in three separate steps, however. Scanning and Segmenting & OCR are almost always out-sourced to specialized companies due to amount of operators required to perform these tasks. For massdigitization project done by NLL, Scanning requires up to 10 full-time operators, but Segmenting & OCR might require up to 70-80 full-time operators. The text portal, however, can be created both by libraries themselves or out-sourced to IT companies. NLL has out-sourced all three mass-digitization steps. 1.2. Mass-digitization at National Library of Latvia NLL began scanning its newspapers and books in February 2010. Unlike few other libraries that scan microfilmed copies of their newspapers, NLL scans only the original books and periodicals. This ensures the best possible quality of scans as microfilms often turn out to be in poor shape. Microfilms also seem to contain a lot of duplicates – the same page microfilmed twice in a row. Duplicates must be spotted during QA otherwise they will end up in PDFs generated during Segmentation. Every two weeks two packages are sent from NLL to a scanning company: • •

Periodicals: ~46 000 pages; Books: ~50 000 pages.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

180

A. Zogla and J. Skilters / Digitization of Historical Texts at the National Library of Latvia

These are scanned within two weeks and returned to NLL as 400dpi, JPEG2000 images. NLL performs QA of these images and has the right to return any book to the scanning company if it is badly scanned. Out of about 1000 books scanned so far, only 2 books have been returned for re-scan. The scanning in general seems to be performed at a very high quality compared to some other national libraries. For example, National Library of Australia reports an impressive 10% of erroneous pages in scanning process of their microfilms, while for NLL the number of scanning errors is certainly less than 1%. NLL has just begun Segmentation & OCR of first packages. The first results are very promising. Although previous research suggests that only 80% of correctly OCRed words can be expected when working with historical newspapers [7], first results produced by NLL’s segmentation partner show a much higher rate of correctly OCRed words. Several newspaper articles consisting of more than 1000 characters after OCRing didn’t contain a single OCR error. Other articles of the same size had less than 5 misspelled characters. It must be noted though, that these results were produced by processing a high-quality image containing modern Latvian orthography. OCRing Latvian fracture (or Latvian gothic as it is also known) texts will almost certainly yield more errors.

2. Improvements of digitizing texts at NLL NLL has created several solutions to improve the quality of digitization of its historic texts. Some other improvements will be done in near future.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.1. OCR-engine for Latvian fracture One of the most important steps of digitization process is the OCRing of images. The most popular OCR engine today used by libraries is ABBYY Finereader. As ABBYY Finereader includes OCRing capabilities for Latvian language, it seemed to be the right choice for NLL as well. ABBYY Finereader even has a special OCR-engine for gothic fonts which was also important for NLL, because almost all of the texts published before 1920-ies in Latvia are written in gothic fonts. Unfortunately ABBYY Finereader had support only for German gothic-fonts and it turned out that Latvian gothic fonts are quite different from their German counterparts. For example, Latvian gothic alphabet includes special fonts to handle diacritics of some Latvian letters. Sometimes the same letter could be represented in more than one way. Figure 2 contains three different ways to represent letter “ā”, appearing in just a single sentence of a newspaper dated February 14, 1852.

Figure 2 Different representations of letter "ā" in Latvian gothic

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

A. Zogla and J. Skilters / Digitization of Historical Texts at the National Library of Latvia

181

In some cases, glyphs that in German gothic represented a particular letter, where redefined in Latvian gothic. The glyph in Figure 3 represents letter “z” in German gothic, but in Latvian gothic, for some reason, it stands for letter “c”.

Figure 3 Letter "c" in Latvian gothic

In some other cases, typesetters came up with original solutions to represent letters they had no glyphs for. The letter “u” with a smaller “e” on top of it in Figure 4 stands for “u+umlaut” or “ü”.

Figure 4 One possible representation of letter "u+umlaut"

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Because existing ABBYY Finereader engines were not capable to handle these and other Latvian fracture peculiarities, NLL worked closely with ABBYY to create a special OCR engine for Latvian gothic fonts. The engine was created and delivered to NLL and tested by Mathematics and Informatics Institute on several pages of gothic text. Tests showed that the new engine works considerably better than any previous ABBYY Finereader versions. It still sometimes has problems with correctly OCRing letters “f”, “s” and “z”, but it must be noted that the glyphs for these letters appear very similar even to human-eye. The true test for the new ABBYY Finereader engine is still to come, when NLL will start OCRing gothic texts at almost an industrial scale. As much as 2 million pages of Latvian gothic might be OCRed until June, 2012. 2.2. Morphologic analysis tool of Latvian fracture Even with a good OCR engine for Latvian fracture, the text obtained wouldn’t be easily read or used by a modern reader. The orthography of Latvian has changed considerably over time. Almost every word would be considered orthographically incorrect if represented letter-by-letter in modern fonts to a user today. On top of that, old Latvian language has words not used and in many cases – hardly even recognized today. NLL aims to make old Latvian texts usable not only to a small group of historians and researchers, but to anyone interested. In an ideal case, a user would search for a keyword using modern words and modern orthography and find articles that contain that word even if it is represented in old orthography or in an out-dated form. Furthermore, when working with Latvian fracture text, a user should be able to inquire for the meaning of some obsolete word and the system should be able to come up with the modern form of that word. To achieve this goal, NLL worked closely with Artificial Intelligence Lab at the Mathematics and Informatics Institute (AIL MII) to create a morphologic analysis tool for Latvian fracture. The goal of the tool was to create an algorithm that for a given

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

182

A. Zogla and J. Skilters / Digitization of Historical Texts at the National Library of Latvia

Latvian fracture text with possible OCR errors would return a text written in modern Latvian. It turned out that this goal is practically unreachable. Often just a single OCR error corrupted a word enough to make it impossible to understand what the original word was. The problem was made even harder with unavailability of a good old Latvian vocabulary. Fortunately, AIL MII had digitised a historic Latvian vocabulary dated around 1920ies [8] and containing many old word-forms and this increased the number of words that were recognised in Latvian fracture texts. The tool developed by AIL MII proved to be very useful. Sometimes, out-dated words in some non-standard case with two to three OCR mistakes were recognised and their modern counterparts found. On the other hand, there were often cases when the tool couldn’t choose among several modern words and presented 2-4 words as possible “originals”.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.3. Crowd-sourcing OCR correction There is now a general agreement that there is no such thing as a 100%-correct OCR engine. Even a human reader can’t be 100% certain about texts of poor quality. The optimal quality of an OCR engine today is considered to be “80% of words are correctly OCRed”. This is in fact a very high quality and requires, probably, 95% per-character OCR. ABBYY has already expressed doubts of developing their engines to substantially increase OCR quality even further. Instead they aim to develop OCR engines for languages still not covered very well (like Arabic, for example). To decrease OCR errors as much as possible, libraries have turned their views at general public. National Library of Australia (further – NLA) can be considered a pioneer of crowdsourcing OCR correction on a large scale [9]. NLA has developed their historic newspaper portal with a feature to allow every registered user to correct any automatic OCR errors. The involvement of users has turned out so well that corrected lines of text in NLA’s newspaper portal are already measured in millions. NLL will also include OCR-correction in its portal. Users will be presented with an image of a single line from a newspaper issue or a book together with the text obtained during automatic OCR. User will be able to correct any OCR mistakes and publish their version of the text back to the system, where it will be indexed and made available to other users. User involvement should not be limited to correcting OCR errors only. The system can also provide a functionality to allow adding comments to particular parts of issues. A user could, for example, mark a person’s name in a newspaper article and add a mini-biography of that person. Finally users can add tags to newspaper articles so articles on certain topic could be grouped and used by users themselves or by others.

3. Future perspectives Mass-digitization at NLL will generate a large, machine-readable text resource. Although it might contain OCR errors, it still will be of sufficient quality to do further research and developments. We mention here two possible developments based on digitized texts.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

A. Zogla and J. Skilters / Digitization of Historical Texts at the National Library of Latvia

183

3.1. Latvian language corpus The result of mass-digitization by NLL can be considered a major source for creating Latvian language corpus. One of the most positive aspects about the materials digitized by NLL is that they cover both a wide period of time (18th century to year 2008) and texts of varied style. NLL will digitize fiction books and scientific ones, newspapers written both in a formal style and in everyday speech. Almost the only lexicon missing from materials digitized by NLL will be modern everyday speech, but this is generally covered by harvesting web-pages – a process that is also a part of NLL’s digitization strategy. An obvious problem with using digitized texts as a source to language corpora is that digitized texts contain OCR errors, which should be eliminated before a word is added to a corpus. One solution to this problem would be to check for rarely appearing words and marking those as suspects to OCR misspellings. If a word appears just once in the entire collection of digitized texts, it’s almost certain that it contains some kind of OCR error. By doing few experiments a tolerance measurement can be indicated – the number of times a word must appear in digitized texts before it is considered to be OCR error-free.

3.2. Semantic mark-up

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

One of the future developments NLL considers to perform is the semantic mark-up of digitized text. By applying relatively simple algorithms certain special words can be identified. An obvious first choice would be to identify person names and place-names. Identifying these words is made easier by the fact that the first character is capitalized both in person names and place-names. By using a vocabulary of geographic names, identifying place-names can be made even simpler. Furthermore, certain templates can be used to identify particular words. For example, these templates can be used to identify place-names: • • •

“river W1w2…wn”, “lives in W1w2…wn”, “on the shores of W1w2…wn”.

Many other templates to discover place-names can be created based on these principles. Following template could be used to identify person names •

“…W1w2…wn ”,

where W1w2…wn stands for a word with its first letter – capitalized. could stand for went, bought, made, sailed, etc. Identifying person names and place-names could add an extra functionality to the system. An interactive map could be developed showing information associated with particular geographic areas. Person names mentioned together with certain place-names could help identify individual persons and texts associated with them.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

184

A. Zogla and J. Skilters / Digitization of Historical Texts at the National Library of Latvia

Finally, street-names can be identified in the digitized texts. These are especially abundant in advertisements. Most of the street names mentioned in historic newspapers have been changed over time and sometimes it takes quite a research to find out what is the current street name of the one mentioned in a 1920ies newspaper ad. NLL has done a previous research on historic street names of Riga [10] and this could be used to create a solution where a user inquires the system to find the modern name of the street that he has identified in some article or advertisement.

References National Library of Latvia. Collection of historical newspapers “Heritage – 1” (2000-2006). Available on-line at: http://data.lnb.lv/digitala_biblioteka/laikraksti/ [2] National Library of Latvia. Collection of historical newspapers “Periodika” (2008-2010). Available on-line at: http://www.periodicals.lv [3] Library of Congress, American Memory (1994-2010). Available on-line at: http://memory.loc.gov/ammem/about/index.html [4] IMPACT project (2008-2010). Available on-line at: http://www.impact-project.eu/ [5] National Library of Netherlands, Databank of Digital Daily newspapers, (2007-2010). Available on-line at: http://www.kb.nl/hrd/digi/ddd/index-en.html [6] E. Klijn, The Current State-of-art in Newspaper Digitization: A Market Perspective, D-Lib Magazine 1/2 (2008) [7] S. Tenner, T. Muñoz, H. R. Pich, Measuring Mass Text Digitization Quality and Usefulness: Lessons Learned from Assessing the OCR Accuracy of the British Library's 19th Century Online Newspaper Archive, D-Lib Magazine 7/8 (2009) [8] K. Mīlenbahs, J. Endzelīns, Latviešu valodas vārdnīca, Rīga, 1923-1929. [9] H. Rose, Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers. National Library of Australia, March 2009, ISBN 978-0642-27694-0 [10] A. Juta-Zālīte, Rīgas ielu, laukumu, parku un tiltu nosaukumu rādītājs, Latvijas Nacionālā bibliotēka, Rīga, 2001.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[1]

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Semantics

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-187

187

Verbalizing Ontologies in Controlled Baltic Languages Normunds *5Nj=Ʈ7,6 Gunta 1(â325(DQG%DLED6$8/Ʈ7( Institute of Mathematics and Computer Science, University of Latvia

Abstract. Controlled natural languages (mostly English-based) recently have emerged as seemingly informal supplementary means for OWL ontology authoring, if compared to the formal notations that are used by professional knowledge engineers. In this paper we present by examples controlled Latvian language that has been designed to be compliant with the state of the art Attempto Controlled English. We also discuss relation with controlled Lithuanian language that is being designed in parallel. Keywords. Controlled Natural Language, Ontology Verbalization, Information Structure, Synthetic Language, Baltic Languages

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction One of the fundamental requirements in verbalization of ontology structure, restrictions and data integrity constraints 1 is unambiguous interpretation of controlled natural language (CNL) statements, so that the CNL user could easily predict the precise meaning of the specification he/she is writing or reading. In the case of integrity constraints, the interpretation process also includes resolving of anaphoric references. Several restrictions are used in CNLs to enable the deterministic construction of discourse representation structures (DRS): a strict syntactic subset of natural language, a set of interpretation rules for potentially ambiguous constructions, a monosemous (domain-specific) lexicon, an assumption that the antecedent of a definite noun phrase (NP) is the most recent and most specific accessible NP. There are several sophisticated CNLs that provide seemingly informal means for bidirectional mapping between controlled English and OWL [1]. Although the existing CNLs primarily focus on English, Angelov and Ranta [2] have shown that the Grammatical Framework (GF), a formalism for implementing multilingual CNLs, provides convenient means for writing parallel grammars that simultaneously cover similar syntactic fragments of several natural languages. Thus, if the abstract and concrete grammars are carefully designed, GF provides syntactically and semantically precise translation from one CNL to another. This potentially allows exploitation of powerful tools that are already developed for controlled English also for non-English CNLs. For instance, the Attempto Controlled English (ACE) parser [3] could be used for DRS construction, paraphrasing and mapping to OWL, and ACE verbalizer [4] 1 Here we refer to OWL 2 terminological statements (http://www.w3.org/TR/owl2-primer/), SWRL implication rules (http://www.w3.org/Submission/SWRL/) and SPARQL integrity queries (http://www.w3.org/TR/rdf-sparql-query/).

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

188

N. Gr¯uz¯ıtis et al. / Verbalizing Ontologies in Controlled Baltic Languages

could be used in the reverse direction, facilitating cross-lingual ontology development, verbalization, and querying. While it seems promising and straightforward for rather analytical languages that share common fundamental characteristics, allowing (apart from other) explicit detection of given and new information and, thus, detection of anaphoric references, it raises issues in the case of highly synthetic languages (like Baltic languages), where explicit linguistic markers, indicating which information is new (potential antecedents) and which is already given (anaphors), in general, are not available. In analytical CNLs, analysis of the information structure of a sentence is based on the strict word order and systematic use of definite and indefinite articles. In highly synthetic languages, articles are rarely used2 DQGDUH³FRPSHQVDWHG´E\PRUHLPSOLFLWOLQJXLVWLFPDUNHUVW\SLFDOO\ by changes in the neutral word order, which is enabled by rich inflectional paradigms and syntactic agreement. We might impose the consistent use of artificial determiners, using, for example, indefinite and demonstrative pronouns, but then the controlled language would lose its characteristics of naturalness and intuitiveness. The problem is even more apparent in case of Lithuanian that, in contrast to Latvian, has not been historically influenced by German. Therefore the only3 formal and general feature that indicates the status of a NP is its position in a sentence ² whether it belongs to the topic or focus part. Thus, the correspondence between the given/new information and the word order patterns can be described in terms of topic-focus articulation (TFA) [5], i.e., what we are talking about and what we are saying new about it. Although the topic (theme) and focus (rheme) parts of a sentence, in general, are not always reflected by systematic changes in the word order, it has been shown [6] that, in the case of controlled Latvian, TFA is a simple and reliable mechanism for deterministic (predictable) analysis of the information structure of a sentence. As the initial evaluation shows, the ³FRUUHFW´ word order is both intuitively satisfiable by a native speaker and facilitates the automatic detection of anaphoric NPs in highly synthetic CNL. In order to evaluate the TFA-based approach, an on-line questionnaire was created for each language, containing 15±17 ontological statements and rules of different complexity, each of them verbalized in two or three slightly different ways. The survey was aimed at a rather wide target group: ca. 80 respondents participated in the Latvian survey, and ca. 40 respondents participated in the Lithuanian survey (in both cases ca. 75% evaluated all examples; others ² at least one third). Each of the proposed alternative verbalizations had to be ranked being either ³JRRG´³DFFHSWDEOH´RU³SRRU´ In addition, respondents were able to propose their own (modified) verbalizations ² this option was frequently used leading to many interesting and/or overlapping suggestions. In this paper we present by examples the improved and extended version of controlled Latvian language, if compared to [6], and discuss relation with controlled Lithuanian language that is being designed in parallel. At the end we briefly summarize some remaining issues and future tasks. Prototype implementations for both languages are available on-line at http://eksperimenti.ailab.lv/cnl. 2

In Baltic languages, the (in-)definiteness feature is not encoded even in noun endings as it is in the case of Scandinavian languages, for instance. 3 Although definite and indefinite adjectives and participles can be used in multi-word units, such markers are optional and, in the case of controlled language, non-reliable ² attributes in domain-specific terms often have definite endings by default.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

N. Gr¯uz¯ıtis et al. / Verbalizing Ontologies in Controlled Baltic Languages

189

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. Generalization and SVO Statements The most simple and basic type of ontological statements are generalizations defining that one named class is a subclass of another named class, i.e., statements that define the taxonomy of an ontology (see Table 1). But even in this simple case there is no consensus among respondents, which verbalization is the best, except the common agreement that the indefinite SURQRXQ ³DUWLFOH´ LVDEVROXWHO\XQQHFHVVDU\ in predicate nominal phrases. Apart from some nuances, the main dissatisfaction was about the singular subjects. Due to the underlying formalism ² a subset of first-order logic (FOL) ² in ACE and in other CNLs for OWL, references in the singular are typically used [1]. In natural language, however, plural statements are more intuitively and frequently used when generalizations are expressed. Thus, we have allowed in our grammar plural clauses in parallel to the singular ones: they are automatically paraphrased (linearized) into the singular readings, ensuring also compliance with ACE (see Sg vs. Pl in Table 1 and henceforth). Several respondents noted that they also would prefer to skip the plural determiner (universal quantifier) visi µDOO¶ We could allow this if we were interested only in terminological (TBox) statements, however, our aim is to cover rules (see Table 7 in the next section) and assertional statements as well, and therefore the optional determiner would introduce ambiguity (universal vs. existential quantification). Paraphrasing mechanism in CNLs is often used to explicitly indicate the interpretation of a potentially ambiguous syntactic construction. Grammatical Framework is especially handy for dealing with paraphrases ² we use them widely throughout the grammar; moreover, GF supports using this mechanism even at the lexical level, allowing synonyms for both function and content words (in Table 1 and henceforth synonym lists are given in parenthesis; the first word is used in linearization). For instance, the majority of Latvian respondents suggested the pronoun ikviens for µHYHU\¶LQVWHDGRIkatrs ² if it is used as an attribute of a noun. As the survey showed, it should be emphasized that we are not addressing the machine translation problem in the traditional sense: the semantically precise translation (via OWL as interlingua) among the controlled languages is not the ultimate goal but rather a side-effect. Some statements in CNL might not conform to a fully correct subset of natural language ² different trade-offs have to be made to ensure the predictable interpretation. One of the main reasons for using a CNL is to encourage the active involvement of domain experts in the conceptual modeling phase of an ontology [7]. CNL provides

Table 1. Generalization axiom from sample university ontology. P stands for parsing, L ² for linearization. For the same linearization the parser can accept grammatically and/or lexically different statements. ACE

Every professor is a teacher. P

,NYLHQV_.DWUV SURIHVRUVLUSDVQLHG]ƝMV

L

Ikviens profesors ir SDVQLHG]ƝMV

P

9LVLSURIHVRULLU SDVQLHG]ƝML_VNRORWƗML 

L

Visi profesori ir SDVQLHG]ƝML

Sg

Pl OWL

Class: Professor SubClassOf: Teacher

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

190

N. Gr¯uz¯ıtis et al. / Verbalizing Ontologies in Controlled Baltic Languages

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

high-level intuitive means for ontology authoring, if compared to the formal notations 4 that are used by professional knowledge engineers. The consequence of involving GRPDLQ H[SHUWV LV WKDW WKH FRQFHSWXDO RQWRORJ\ PRVW OLNHO\ ZLOO QRW EH ³RSWLPDOO\´ and/or completely defined. Thus, the subsequent ontology modeling phase, involving a knowledge engineer, is needed in general. For example, it might be easier for a domain expert to define several subclasses of the same class in one sentence (see Table 2). To avoid the anonymous class the knowledge engineer might decide to split the axiom in two or more separate axioms. Moreover, the actually intended generalization in some cases might be in the inverse direction (as in Table 2). Another interesting point that was confirmed regarding the predicate nominals is that CNL users often would lLNH WR XVH WKH µHLWKHU  RU¶ vai nu .. vai) construction LQVWHDG RI MXVW µRU¶ DOWKRXJK WKH LQWHQGHG PHDQLQJ LV WKH VLPSOH GLVMXQFWLRQ OR) instead of the exclusive disjunction (XOR). Thus, we have allowed to use it while defining an axiom, but in the paraphrases the element µHLWKHU¶LVDXWRPDWLFDOO\GURSSHG indicating that the interpretation, perhaps, is not what was expected. If the exclusive disjunction was actually intended, it might be the case that additional disjointness axioms have to be defined (as illustrated in Section 3). Subclass axioms often include property restrictions, making them more complex and, if verbalized in CNL, the generalization might not be explicitly visible (compare the verbalizations in Table 3 and Table 4). In terms of CNL, a property restriction is a subject-verb-object (SVO) statement (clause), where either the subject or the object is existentially quantified (in controlled English, it is always the object). Thus, the question about using the indefinite marker (pronoun) rises again5. In Lithuanian this would be ungrammatical, but in Latvian such markers might improve the reading in certain cases: the survey confirmed the hypothesis that the indefinite pronoun is more often used in singular NPs, if no relative clause follows the NP. In the case when the subclass is anonymous, a reference to the universal class is naturally made (see Table 4) by using an indefinite pronoun. Here the problem of differentiation between animate µHYHU\RQH¶ and inanimate µHYHU\WKLQJ¶ things appears (both in English and in Baltic languages). We can easily allow such

Table 2. Generalization axiom that refers to an anonymous superclass ² in this case, to a disjunction of the named classes. Fragments in square brackets are optional (vai nu VWDQGVIRUµHLWKHU¶ . Every course is a mandatory_course or is an optional_course.

ACE P

,NYLHQV_.DWUV NXUVVLU>YDLQX@REOLJƗWDLV_NXUVVYDLL]YƝOHV_kurss.

L

Ikviens kurss ir REOLJƗWDLV_kurss vai L]YƝOHV_kurss.

P

Visi kursi ir [vai nu] REOLJƗWLH_NXUVLYDLL]YƝOHV_kursi.

L

Visi kursi ir REOLJƗWLH_kursi vai L]YƝOHV_kursi.

Sg

Pl OWL

Class: Course SubClassOf: MandatoryCourse or OptionalCourse

4 Manchester OWL Syntax [8], for instance, which is intended to be the most user-friendly formal syntax for OWL. We have used it for comparison with CNL statements in all examples (except Table 7 and Table 10) in this paper. 5 Note that while we are dealing only with the terminological axioms, this is only a grammatical issue, which does not introduce any interpretation ambiguities: if a NP is not explicitly universally quantified we could assume that it is existentially quantified.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

N. Gr¯uz¯ıtis et al. / Verbalizing Ontologies in Controlled Baltic Languages

191

Table 3. Generalization axiom that includes a property restriction. The indefinite pronoun NƗGV is optional, but is used in the linearization (in the singular only), as there is no relative clause attached with the NP. ACE

Every course is taught by a teacher. P

,NYLHQX_.DWUX NXUVX PƗFD_SDVQLHG] >NƗGV@SDVQLHG]ƝMV

L

Ikvienu kursuACC PƗFDNƗGV SDVQLHG]ƝMV

P

9LVXVNXUVXV PƗFD_SDVQLHG] SDVQLHG]ƝML

L

Visus kursusACC PƗFDSDVQLHG]ƝML

Sg

Pl

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

OWL

Class: Course SubClassOf: inverse (teaches) some Teacher

differentiation from the analysis point of a view, however, the problem is how to choose the appropriate pronoun, if an ontology, which has not been created and annotated by means of a CNL, is being verbalized. To keep the lexicon robust and domain-independent, the ACE verbalizer >@DOZD\VXVHVµHYHU\WKLQJ¶. In order to linearize the appropriate pronoun, additional information should be encoded for each lexical unit, indicating whether the term represents animate or inanimate things. This would make the lexicon sense- and, thus, domain-specific. Few Latvian respondents suggested to use the neutral pronoun tas µWKDW¶ for µHYHU\WKLQJ¶µHYHU\RQH¶. It is unlikely that a controlled Latvian user would intuitively use it himself, however, if the pronoun is automatically used in linearization, the statement remains grammatically correct and easily comprehensible. In the Lithuanian questionnaire we used its counterpart (tai) by default, which was generally accepted. The issue, however, cannot be avoided in statements defining domain or range of a property (see Table 5). In such definitions both the subject and the object are references to the universal class; one of them ² existentially quantiILHG µVRPHWKLQJ¶ ZLWKRXWDQ\ restricting relative clause. There is no neutral pronoun (in all the three languages) that could be used instead of µsomething¶ Moreover, to include the information in the lexicon, it should be encoded in verb entries, indicating whether the subject/object has to be an animate or an inanimate thing. Therefore the chosen trade-off for linearization is to use the indefinite pronoun kaut kas µVRPHWKLQJ¶ , which is more neutral. Many Latvian respondents also suggested to use the personal pronoun NXUã instead of the relative pronoun kas, if the antecedent of the anaphor is an animate thing. Such differentiation is probably influenced from Russian, but due to its frequent use both pronouns are included in the lexicon; for linearization the relative one is always used.

Table 4. Generalization axiom that refers to an anonymous subclass The pronoun µHYHU\WKLQJ¶ refers to the universal class owl:Thing, which is further specified by the property restriction. Everything that teaches a mandatory_course is a professor.

ACE P

(Tas|Ikviens|Katrs|Jebkas|Viss), NDV_NXUã PƗFD_SDVQLHG] >NƗGX@ REOLJƗWR_kursu, ir profesors.

L

Tas, kas PƗFDNƗGX REOLJƗWR_kursuACC, ir profesors.

P

7LH_9LVL  NDV_NXUL  PƗFD_SDVQLHG] REOLJƗWRV_kursus, ir profesori.

L

Tie, kas PƗFDREOLJƗWRV_kursusACC, ir profesori.

Sg

Pl OWL

Class: owl:Thing and (teaches some MandatoryCourse) SubClassOf: Professor

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

192

N. Gr¯uz¯ıtis et al. / Verbalizing Ontologies in Controlled Baltic Languages

Table 5. Axiom defining the range of a property. The domain would be defined in the active voice. Note that the domain/range definitions can be distinguished from generalizations only due to the agreement that both the subject and the object refers to the universal class and the existentially quantified one is not restricted. ACE

Everything that is taught by something is a course. P

7DV_-HENDV_9LVV_,NYLHQV_.DWUV  NR_NXUX  NDXWNDV_NƗGV  PƗFD_SDVQLHG] LUNXUVV

L

Tas, koACC kaut kas PƗFDir kurss.

P

7LH_9LVL  NR_NXUXV  NDXWNDV_NƗGV  PƗFD_SDVQLHG] LUNXUVL

L

Tie, koACC kaut kas PƗFDir kursi.

Sg

Pl OWL

ObjectProperty: teaches Range: Course

So far we have not faced anaphoric references, except the relative pronouns that start subordinate clauses. Apart from the assertional (factual) statements that are not covered in this paper, the need for anaphors emerges when SWRL rules and SPARQL integrity queries are verbalized (see Table 7 and Table 10 in the next sections). It was already mentioned that in the case of synthetic CNL we can impose the use of systematic word order patterns [6], e.g., if a NP stands before the verb, it should be an anaphor (if it is not universally quantified). The majority of respondents confirmed that the word order changes are intuitive, but the majority also requested that in rules and queries, which typically contain more than one subordinate clause, the definite pronoun should be used as well. Therefore, for the unambiguous parsing the pronoun is still optional (parsing is fully TFA-based), but it is always used in linearization.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Pseudo-SVO Statements Theoretically, all predicates in OWL ontologies should conform to the SVO pattern (generalization predicates are a special case). In practice, however, it can be very hard or even impossible to come up with an appropriate verb for a property, or to use a syntactic object (accusative case), so that the statement would remain natural. In the first case, individuals of two classes most likely can be associated at least by means of a role (a NP) that would be translated into OWL as a property. The leading CNLs support different, but limited ways how to define and refer such properties [1]. In controlled Baltic languages they can be expressed in a uniform way: by making a NP that consists of a class name in the genitive (possessive) case followed by its role name whose case depends on the context (see an example in Latvian in Table 6). Table 6. Use of a noun instead of a verb (predicate) to express an association (property) between two classes. Every course is a part of an academic_program. For every academic_program its part is a course.

ACE P

(Ikviens|Katrs) kurss ir [NƗGDV] DNDGƝPLVNƗV_SURJUDPPDVGDƺD (Ikvienas|Katras) DNDGƝPLVNƗV_SURJUDPPDVGDƺDLU[NƗGV] kurss.

L

Ikviens kurss ir NƗGDV DNDGƝPLVNƗV_programmasGEN GDƺD Ikvienas DNDGƝPLVNƗV_programmasGEN GDƺDir NƗGV kurss.

Sg

OWL

Class: Course SubClassOf: inverse (part) some AcademicProgram Class: AcademicProgram SubClassOf: part some Course

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

N. Gr¯uz¯ıtis et al. / Verbalizing Ontologies in Controlled Baltic Languages

193

Table 7. Use of a modifier instead of an object: course-included_in-program vs. program-includes-course. Every student takes every mandatory_course that is included_in an academic_program that enrolls the student.

ACE P

,NYLHQV_.DWUV VWXGHQWV DSJnjVW_ƼHP  LNYLHQX_NDWUX REOLJƗWR_NXUVX NDV_NXUã  ([ir] LHNƺDXWV_LHWLOSVW >NƗGƗ@DNDGƝPLVNDMƗ_SURJUDPPƗNXUƗ>ãLV@VWXGHQWV>LU@X]ƼHPWV

L

Ikviens VWXGHQWVDSJnjVWikvienu REOLJƗWR_kursuACC, kas ir LHNƺDXWV DNDGƝPLVNDMƗ_SURJUDPPƗLOC, NXUƗLOC ãLV students ir X]ƼHPWV.

Sg

SWRL

AcademicProgram(?x3), MandatoryCourse(?x2), Student(?x1), HQUROOV "["[ LQFOXGHV "["[ ĺWDNHV "["[

In the second case, typical pseudo-objects in ontological statements are adverbial modifiers of place. Apart from statements where nothing else than a modifier is possible, the survey confirmed that use of an object is inappropriate also in statements where it is syntactically possible, but semantically incorrect (e.g. µDFDGHPLFSURJUDP¶ in Table 7). Note that currently we are considering only such modifiers that in English UHTXLUHWKHSUHSRVLWLRQµLQ¶EXWin Baltic languages are expressed by the locative case. There is an issue, however, in translating relative clauses to/from ACE, if the relative pronoun is the modifier ² such constructions are not supported in ACE. The transformation form modifier to object clauses can be easily introduced in the parallel GF grammars, but ambiguity arises in the opposite direction. To ensure the correct choice between object and modifier constructions, morphological restrictions on verb arguments have to be encoded in the lexicon; if both constructions could be valid, they can be included as alternatives, indicating which is preferable for linearization. Table 7 illustrates one more aspect: although in OWL (and FOL in general) there is no time dimension, the survey showed that majority of CNL users would prefer to differentiate perfect and imperfect actions. To support this, use of certain participles has been allowed at the surface level (in both SVO and pseudo-SVO statements).

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. Negated Statements Three typical cases when negation has to be used are: to define disjoint classes (see Table 8), to define a subclass of the complement of a class (see Table 9), and to ask a data integrity query (see Table 10). In all cases, no determiners are used in the plural. Table 8. Axiom defining disjoint classes. Double negation, in general, is used in both Latvian and Lithuanian. ACE Sg

No assistant is a professor.

P/L

OWL

Neviens asistents nav profesors. DisjointClasses: Assistant, Professor

Table 9. Generalization axiom that includes a negated property restriction. The indefinite pronoun neviens is optional (in the object position), but is used in the linearization, because no relative clause follows. ACE

No assistant teaches a mandatory_course. P

1HYLHQVDVLVWHQWV QHPƗFD_QHSDVQLHG] >QHYLHQX@REOLJƗWR_kursu.

L

Neviens DVLVWHQWVQHPƗFDnevienu REOLJƗWR_kursuACC.

Sg OWL

Class: Assistant SubClassOf: not (teaches some MandatoryCourse)

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

194

N. Gr¯uz¯ıtis et al. / Verbalizing Ontologies in Controlled Baltic Languages

Table 10. Data integrity query. In the singular, the indefinite pronoun NƗGV is always used with the subject. Is there a student that takes a course that is not included_in an academic_program that enrolls the student?

ACE P

9DLLUNƗGVVWXGHQWV NDV_NXUã  DSJnjVW_ƼHP >NƗGX@NXUVX NDV_NXUã  QDYLHNƺDXWV_QHLHWLOSVW  [QHYLHQƗ@DNDGƝPLVNDMƗ_SURJUDPPƗNXUƗ>ãLV@VWXGHQWV>LU@X]ƼHPWV"

L

9DLLUNƗGV students, kas DSJnjVWNXUVXACC, kas nav LHNƺDXWVDNDGƝPLVNDMƗ_SURJUDPPƗLOC, NXUƗLOC ãLV students ir X]ƼHPWV?

Sg

SPARQL

ASK WHERE {?x1 rdf:type Student. ?x1 takes ?x2. ?x2 rdf:type Course. ?x3 rdf:type AcademicProgram. ?x3 enrolls ?x1. NOT EXISTS {?x3 includes ?x2}}

4. Conclusion An interesting observation was made that after a little training one can easily express rather complex rules (if they are conceptually clear to him), however, it might not be easy for others to grasp the meaning, if cascades of relative clauses are used. Transforming the relative clauses into genitive NPs often improves the readability ² it should be supported as an alternative in future. Other tasks are to extend the grammars to support prepositional phrases as adverbial modifiers, cardinality constraints of properties, if-then statements as an alternative pattern, and assertional (ABox) statements. A trade-off has to be found regarding the lexicon: whether it will be domainindependent or domain-specific. On the one hand, the latter choice enables to differentiate animate/inanimate things, to impose morphological restrictions on verb arguments etc. It would also allow adapting of existing, linguistically non-motivated ontologies for verbalization. On the other hand, GF does not support anaphora resolution ² if we intend to use existing tools for translation to/from OWL, it is important to keep compliance with ACE or some other well resourced CNL.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Acknowledgements The research is funded by the State Research Programme in Information Technologies. 7KH DXWKRUV ZRXOG OLNH WR WKDQN *XQWLV %ƗU]GLƼã IRU HQFRXUDJLQJ WKH UHVHDUFK WRSLF, and all respondents for their significant help in evaluation of the proposed grammars. References [1] R. Schwitter, K. Kaljurand, A. Cregan, C. Dolbear and G. Hart. A Comparison of Three Controlled Natural Languages for OWL 1.1. In: 4th International OWLED Workshop, Washington DC, 2008. [2] K. Angelov and A. Ranta. Implementing Controlled Languages in GF. In: N.E. Fuchs (ed.): Controlled Natural Language, LNAI 5972, Springer, 2010, pp. 82±101. [3] N.E. Fuchs, K. Kaljurand and G. Schneider. Attempto Controlled English Meets the Challenges of Knowledge Representation, Reasoning, Interoperability and User Interfaces. In: 19th International FLAIRS Conference, Melbourne Beach, Florida, 2006, pp. 664±669. [4] K. Kaljurand and N.E. Fuchs. Verbalizing OWL in Attempto Controlled English. In: 3rd International OWLED Workshop, Innsbruck, Austria, 2007. [5] E. +DMLþRYiIssues of Sentence Structure and Discourse Patterns. Charles University, Prague, 1993. [6] N. *Unj]ƯWLV :RUG 2UGHU %DVHG $QDO\VLV RI *LYHQ DQG 1HZ ,QIRUPDWLRQ LQ &RQWUROOHG 6\QWKHWLF Languages. In: 1st Workshop on the Multilingual Semantic Web, CEUR 571, 2010, pp. 29±34. [7] K. Kovacs, C. Dolbear, G. Hart, J. Goodwin and H. Mizen. A Methodology for Building Conceptual Domain Ontologies. Ordnance Survey Research Labs Technical Report, IRI-0002, 2006. [8] M. Horridge, N. Drummond, J. Goodwin, A. Rector, R. Stevens and H. Wang. The Manchester OWL syntax. In: 2nd International OWLED Workshop, Athens, Georgia, 2006.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-195

195

Enriching Estonian WordNet with Derivations and Semantic Relations Neeme KAHUSK 1 , Kadri KERNER, and Kadri VIDER Institute of Computer Science, University of Tartu Abstract. In this paper we describe the work on extending Estonian WordNet with synsets that are automatically generated using derivational suffixes. Almost all of the action nouns and agent nouns derived from verbs are missing from Estonian Wordnet. While annotating texts with word senses in Estonian Wordnet it became clear that words with these suffixes are quite frequent in texts and should be included into the Estonian Wordnet. Also, it is possible to automatically adapt some semantic relations between verbs. Keywords. Estonian Wordnet, automatic derivation, semantic relations, Python, action noun, agent noun

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction Wordnet is a lexical-semantical database—the resource that was initially created as a model of human lexical memory. Started at Princeton for English only it has spread to tens of countries and languages all over the world. WordNet as a valuable resource can be used for example in Semantic Web, ontology, word sense disambiguation systems, machine translation etc. The development of Estonian Wordnet started on 1998 within the EuroWordNet project2 In wordnets the lexical units are divided into synonym sets (synsets) which are connected with different types of semantic relations. In the EuroWordNet project different languages were also connected to each other via ILI-links (interlingual index). For more details about Estonian Wordnet see Õim et al. in this volume [1]. Wordnet is based on word meaning and from this point of view such lexical feature as derivation should not play a significant role. But a lot of Estonian derivational suffixes have concrete meanings and this fact can be applied in connecting the derivational base and the derivation with a definite semantic relation, dependent on the meaning of the derivational affix. Up to nowadays the creation of Estonian WordNet was manual with the help of numerous lexical resources, both on paper and in electronic form. The manual work is often more thorough, but on the other hand the manual work takes up a lot of time and so we decided to look for ways to increase the size of Estonian WordNet (semi)automatically. 1 Institute of Computer Science, University of Tartu, Liivi 2 50409 Tartu, ESTONIA E-mail: [email protected]. 2 http://www.illc.uva.nl/EuroWordNet/ (12.06.2010)

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

196

N. Kahusk et al. / Enriching Estonian WordNet with Derivations and Semantic Relations

      

 

      

  

 

    !        





  



     

         





     

    

    

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 1. Activity diagram of the derive.py script. The algorithm of generating -mine suffixes is shown, for -ja suffixes no hyperonyms and ILI links are generated.

In 2002 K. Vider discussed derivations and their relations in Estonian Wordnet [2], so there is a theoretical ground to start with. Extending Wordnet with derivations and semantic relations has also been done for French WN [3] and Czech and Polish wordnets [4]. Extending wordnet with derivations demands some extra tools, as the Polaris tool3 that we are currently using does not have any fuctionality for such a task. We make use of Eurowordnet module for Python programming language described in [5]. The main action diagram of the Python script used to generate synsets containing derivation is described on Fig 1.

1. Brief Overview of Estonian Derivation Derivation, a frequent and productive way in Estonian for forming new words, is a process of appending derivational suffixes, more than 60 altogether, to both declinable and conjugable words. Suffixes can be appended sequentially; up to four suffixes in a row can be appended in some cases. Most of derivational suffixes have their regular mean3 http://www.illc.uva.nl/EuroWordNet/database.html

(12.06.2010)

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

N. Kahusk et al. / Enriching Estonian WordNet with Derivations and Semantic Relations

197

ing(s) and is very obvious, that source and target words in derivation have regular lexicalsemantic relations between them Derivational morphology in Estonian is always connected with changing the meaning of lexeme. The lexical meaning of the derived word is different from the word used as the derivational base, in some productive cases the derived words belong to a different part of speech.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Deriving synsets with -mine suffix The base form of Estonian verbs (dictionary form) is always ending with -ma suffix. So it is easy to automatically generate the action substantive with -mine suffix from verbs. The -mine suffix is a productive one and it rarely changes the content of the synset, only the part of speech. For example, from the verb neelama (‘to swallow’) it is possible to derive a noun neelamine (‘swallowing’) by cutting off the -ma suffix and replacing it with the -mine suffix. Since the verb neelama has 5 different senses in Estonian Wordnet, then we also generate 5 senses for derivatives. For example, if there is a synset with members neelama, neelatama, klõnksatama (‘to swallow’) then it is formed into a noun synset neelamine, neelatamine, klõnksatamine (‘swallowing’); and another synset neelama, summutama, katma, matma (‘to mute, to dampen’) is formed into synset neelamine, summutamine, katmine, matmine accordingly (See Fig 2 and 1). In Estonian Wordnet, all synsets are linked with semantic relations. Most of the semantic relations in Princeton Wordnet connect synsets of the same part of speech, but in Eurowordnet there are semantic relations between synsets belonging to different parts of speech. The most important relation forin this project is xpos_near_synonym connecting verb base form and action noun. The derived synsets automatically inherit the xpos_near_synonym relation to according verb synset. Via this relation it is possible to have access to the definition and examples that are added to verb base form only. In case there is a multiword verb synset then we don’t generate the noun synset, because of the orthography issues. If there is a verb synset containing of both multiword and non multiword verbs, then we only generate a noun synset from non multiword verbs. Also, it is possible to adapt the hyperonymy relations of verbs. For example, if the hyperonym for the verb synset of neelama, neelatama, klõnksatama (‘to swallow’) is manustama, omastama, sisse võtma (‘to take in, to absorb’), then the hyperonym for derived synset neelamine, neelatamine, klõnksatamine (‘swallowing’) is xpos_has_hyperonym manustama, omastama, sisse võtma. If the xpos_has_hyperonym has a hyperonym itself, then the synset automatically inherits also the ILI-relation (see Fig 2). 3. Deriving synsets with -ja suffix The second more regular derivation is from verb (again from the -ma suffix) to agent substantive with -ja suffix. Some of the -ja suffixes can be divided into two groups; they indicate to [6]: 1. occupation, job, agent; for example õpetaja (‘teacher’), joonestaja (‘draftsman’); Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

198

N. Kahusk et al. / Enriching Estonian WordNet with Derivations and Semantic Relations

2. instrument; for example laadija (‘charger’). Some of the nouns with -ja suffix can also indicate to both occupation and instrument; theoretically it is possible almost for all the nouns, but in reality there are exceptions, for example sööja (‘eater’) or some defective conjugable words like piisama → *piisaja (‘to suffice’ → ‘*sufficer’). Derivation is automatically done by also simply cutting off the -ma suffix and replacing it with -ja suffix. Since there are fewer -ja suffixes indicating to the instrument group, we don’t generate them automatically, because the manual check-up will take too much manustama omastama sisse võtma

eq_synonym to take in to absorb 6

6

has_hyperonym

to swallow

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

@ I @ @ @ eq_synonym @

eq_has_hyperonym @ neelama neelatama klõnksatama 6

xpos_synonym neelamine neelatamine klõnksatamine





Figure 2. Synsets and relations. An example of verb neelama (‘to swallow’) and generated noun synset neelamine (‘swallowing’) with semantic relations. Existing synsets and semantic relationares marked with boxes and solid lines, generated synsets with rounded boxes and relations with dashed arrows.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

N. Kahusk et al. / Enriching Estonian WordNet with Derivations and Semantic Relations

199

time. This derivation is automatically connected to according verb with involved_agent relation; the reverse relation is role_agent.

4. Summary and Future Work

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Estonian Wordnet was extended with more than 10 000 synsets. The verb derivations were quite easy to automatically generate, since it can be done by cutting off -ma suffix and adding -mine or -ja suffix. As for other word classes the automatic derivation is more complicated since the derivation needs moprhanalysis and synthesis. All new generated synsets were also connected with at least one semantic relation, and if possible, they were connected to ILI as well. Still, it is necessary to check over these generated synsets. For example, abstract and metaphorical meanings of the verb should not be bound to the suffix -mine but only the ones expressing a definite action. In Estonian there are plenty of adverbs which are derived from other word classes, especially from adjectives, for example ahne → ahne/lt (‘greedy’ → ‘greedily’), but also from substantives (liik → liigi/ti (‘sort (n)’ → ‘by sort’) and from verbs (ärka/ma → ärk/vel ‘to wake’ → ‘awake’). One of the most productive and most frequent in Estonian is the -lt suffix. According to lexical semantics the adverbs derived with -lt suffix are considered adverbs of manner and the meaning of adjective carries over to adverb [7]. This makes it possible then to derive adverbs with -lt suffix automatically since in EstWN there are more adjectives present. Also it is possible to carry over all the semantic relations already present with an adjective and also to carry over the definition of adjective. For example, adjective aeglane ‘slow’ → xpos_near_synonym is adjective aeglus ‘slowness’; state_of is noun kiirus ‘speed, swiftness’; has_derived adverb aeglase/lt ‘slowly’. In some cases semantic relations carried over from the adjective need to be corrected or added.

Acknowledgements This work is supported by the European Regional Development Fund through the Estonian Center of Excellence in Computer Science, EXCS.

References [1] [2]

[3]

[4]

H. Õim, H. Orav, K. Kerner, N. Kahusk (2010). Main Trends in Semantic-Research of Estonian Language Technology In This volume. K. Vider (2002). Notes about labelling semantic relations in Estonian WordNet In Proceedings of Workshop on Wordnet Structures and Standardisation, and how these Affect Wordnet Applications and Evaluation; Third International Conference on Language Resources; Third International Conference on Language Resources and Evaluation (LREC 2002). D. N. Christodoulakis, C. Kunze, L. Lemnitzer (Eds). ELRA, Las Palmas de Gran Canaria 2002 pp. 56–59. B. Sagot, K. Fort, and F. Venant (2009). Extending the adverbial coverage of a French wordnet. In Proc. of the NODALIDA 2009 workshop on WordNets and other Lexical Semantic Resources, Odense, Danemark. K. Pala, and D. Hlaváˇcková (2007). Derivational Relations in Czech WordNet. In Workshop on BaltoSlavonic Natural Language Processing, 2007).

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

200 [5]

[6]

N. Kahusk (2010). Eurown: An EuroWordNet Module for Python. In Principles, Construction and Application of Multilingual Wordnets. Proceedings of the 5th Global Wordnet Conference. P. Bhattacharyya, C. Fellbaum, P. Vossen (Eds) Mumbai: Narosa Publishing House. pp. 360–364. M. Erelt, T. Erelt, K. Ross (2007) Eesti keele käsiraamat. 3., täiendatud trükk. In Estonian. English title: Handbook of Estonian Language (3. ed.). Tallinn: Eesti Keele Sihtasutus. R. Kasik (2009). Eesti keele sõnatuletus (3. trükk). In Estonian English title: Estonian Dervivation, 3. edition. Tartu: Tartu Ülikooli Kirjastus.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[7]

N. Kahusk et al. / Enriching Estonian WordNet with Derivations and Semantic Relations

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-201

201

Main Trends in Semantic-Research of Estonian Language Technology Haldur ÕIM 1 , Heili ORAV, Kadri KERNER, and Neeme KAHUSK Institute of Computer Science, University of Tartu Abstract. The paper gives a general overview of the development of computational semantics in Estonia beginning from the second half of the 20th century. Main focus concentrates on the work we have done so far and on the problems we try to solve at present in our research group of computational linguistics at Tartu University. Keywords. computational semantics

Our works in the field of semantic analysis can be divided—quite conventionally, of course—into three main themes: lexical semantic resources, sense disambiguation and semantics of sentence. But first we give for the background an overview of the development of computational linguistics in Estonia.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. A short history of computational linguistics in Estonia History of computational linguistics in the University of Tartu has been quite long. It started before teaching of computational linguistics—already in early sixties. The first electronic computer in Estonia was established at Tartu University in 1959 and one of the first “non-mathematical” tasks the enthusiasts attacked was machine translation. We failed, of course, but what we learned was very important: that the methods and forms of language description for computer should be quite different from those intended for humans. At our university a special program of mathematical and structural linguistics was started: the students participating in this program received special teaching in new trends of linguistics (including, of course, classical schools of structural linguistics and generative grammar) and several mathematical disciplines, from mathematical logic to statistics. The first real task in the area that at present is called language technology was, surprisingly, in the field of semantic resources: at the very beginning of 70ties we started to build an information retrieval system for legal texts in Estonian, and in the frames of this we compiled a thesaurus of legal terms (concepts) where the classical semantic relations (synonymy, hyponymy, part-whole, several functional relations, e.g. causality) were fixed. After the information retrieval project we turned to artificial intelligence and, in the frames of this, to language understanding and human-computer interaction. We started to 1 Institute of Computer Science, University of Tartu, Liivi 2 50409 Tartu, ESTONIA E-mail: [email protected]

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

202

H. Õim et al. / Main Trends in Semantic-Research of Estonian Language Technology

build a language understanding system called TARLUS (=TARtu Language Understanding System), were actively involved in “All-Union” activities including regular meetings/seminars with the common name Dialogue (these meetings, by the way, occur regularly to this day). For TARLUS we had to create (preliminary) programs for morphological and syntactic analysis of Estonian, but, of course, to continue to develop our semantic resources. And this was the actual beginning of our Research Group of Computational Linguistics. Main trends have been in the fields of morphology, syntactical analysis, semantic analysis and pragmatics (dialogue models). And all kinds of language resources. When in the middle of 90ties EU started the COPERNICUS program, we joined it, as did several research groups from other Baltic countries. In 2006 started the National Program for Estonian Language Technology2 (see [1] as well) which by the idea should cover all areas of language processing, from speech technology to pragmatics of human interaction, that are considered relevant “to enable Estonian to function seamlessly in the modern information technology infrastructure”. In the following we describe three trends and their results in the area that we qualify as semantic.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Estonian Wordnet During the last decades, wordnets have been developed for several languages (over 50 languages) in the world3 . For Estonian there are two concept-based thesauri available. First thesaurus [2] has more of an historic value (compiled by Andrus Saareste as war refugee in Uppsala in 1979) and second, the modern and most famous one is the wordnettype thesaurus of Estonian. The creation of Estonian Wordnet4 was started within the project EuroWordNet (EWN, see also [3])5 . The Estonian team joined the project supported by European Union in 1998 together with Czech, French and German languages. In the framework of the project the Estonian Wordnet has been created during the years 1997–2000. After some discontinuation this project was awaken again. In 2006 started the project for increasing EstWN and is supported by Estonian National Programme on Human Language Technology. Thanks to governmental program our thesaurus has enlarged a lot—the number of concepts in thesaurus is more than 34 000 (June 2010). The main idea and basic design of all wordnets in the project came from Princeton WordNet (more in [4]). Each wordnet is structured along the same lines: synonyms (sharing the same meaning) are grouped into synonym sets (synsets). Synsets are connected to each other by semantic relations, like hyperonymy (is-a) and meronymy (ispart-of). Most of them are reciprocated (e.g. if koer (‘dog’) has hyperonym loom (‘animal’) then loom (‘animal’) has hyponym koer (‘dog’)). There are 43 semantic relations used in Estonian Wordnet. Different wordnets of each language are connected with each other via special ILI (Inter-Lingual-Index) relations. ILI concepts themselves do not have intra-language relations, this allows handling lexicalization and knowledge (ontology) separately: see [3] for futher details. 2 www.keeletehnoloogia.ee/ 3 There

are currently around 50 wordnets to different languages in the world (see more http://www.globalwordnet.org/). 4 EstWN, see http://www.cl.ut.ee/ressursid/teksaurus/ 5 See http://www.illc.uva.nl/EuroWordNet/ Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

H. Õim et al. / Main Trends in Semantic-Research of Estonian Language Technology

203

The wordnet builders all around have applied different compilation strategies. Our chosen approach so far for enlarging has been manual and domain-specific, i.e we have added concepts from semantic fields like architecture, transportation, personality traits and so on. Since one person is dealing with one domain at the time, then it makes the relations between different concepts (in one domain) easier to determine [13]. For example from the domain of architecture the concept antiiktempel (‘antique tempel’) has 1 hyperonym, 11 hyponyms, 1 has_holo_part and 8 has_mero_part relations. We have tried some ways to enlarge Estonian Wordnet automatically also. For instance, during the increasing process of EstWN around 3000 noun synsets were automatically transferred from the Estonian Synonym Dictionary [16]. After this attempt we discovered that manual work gives more high-quality result because the revision is too long-standing. Automatically we plan to include an amount of words which have been derived via suffixes. Most frequent suffix between noun and verb is -mine (i.e kõndima ‘to walk’ — kõndimine ‘walking’). This approach gives us thousands of new entries. This work is described in [5]. Besides including domain-specific vocabulary we have started to think about how to supplement metaphors and multi-word units (idioms etc) into EstWN, because it would increase the size and usability of the thesaurus to a remarkable degree. Metaphors and metaphorical meanings of words are a topical issue in linguistics and lexicology and they surely should be considered in building a thesaurus [15] . But their occurrence in text is really rather unpredictable and chaotic. And if we add the metaphorical uses to the thesaurus, then how should we explain them properly. As is known, the understanding of a metaphor depends on the context. A multi-word unit is a combination of two or more words that occur together to express a single meaning. In English, compound words are often written separately and therefore seen as a kind of multi-word expression. In Estonian, compounds are almost always written as single words and therefore separated from multi-word expressions. The fact that there is no certain definition for neither of these expressions makes it also difficult to include them in wordnets. There are several problems that occur when adding them, for example formal and semantic problems as well as some more specific problems like handling prepositions in the wordnet structure (Fellbaum 1998). Besides, some idiomatic constructions are just too complex and variable to integrate them. It can be said that although there are many multi-word expressions already included, inaccuracies in semantic relations and missing synonyms are rather frequent. Including compound words into wordnet-type thesaurus is a problem for Estonian language as well as for example for the German language, because in both of these languages words can be combined quite freely while the meaning still stays understandable. Nevertheless, the number of compounds in wordnets should be somehow restricted. There the usage of Corpus of Estonian Written Language can be helpful. It is important to include at least the frequent ones. To sum up, it appears that the creation of a concept-based thesaurus is not as easy as it seems at first sight. The main problems we face nowadays in setting up a thesaurus include: • Possibility of automatic extension • Multi-word combinations.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

204

H. Õim et al. / Main Trends in Semantic-Research of Estonian Language Technology

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. Sense disambiguation Secondly named task is word sense disambiguation of Estonian language. Currently we are working on the increasing of the Word Sense Disambiguation Corpus of Estonian and we hope to reach to the total amount of words in the corpus of 500 000 by the end of 2010. The first project of creating Word Sense Disambiguation Corpus of Estonian started in 2001 within the Senseval-2 competition and this project lasted for a year (see [14]). During the first stage around 110 000 tokens were annotated. There were 43 morphologically analyzed texts of fiction from the Corpus of the Estonian Literary Language6 and only nouns and verbs were the subject of annotation. The second project started in 2009. Since the first project dealt with fictional texts, then now we have included newspaper texts, scientific texts, informational texts and legal texts. These texts come from morphologically disambiguated corpus of Estonian7 . Compared to the previous project we are now annotating nouns, verbs and also adjectives and adverbs, since these parts of speeches are now present in EstWN. The texts are divided into parts of ca 2000 words of each, and annotated by two people. In the first project the disagreement of two annotators was settled by discussion, now we have decided that it is more effective if the disagreements are resolved by the third annotator. As a sense division we are using Estonian Wordnet and for disambiguation there has been developed a tool KYKAP [7] which is meant to assist the human annotator and speed up the annotation process. The annotation-task is divided into three parts. Firstly texts are pre-annotated. For speeding up the annotation process we pre-annotate the words that are monosemous in EstWN. Also many of the highly polysemous word forms indicate to a certain sense and now these word forms are included in the pre-annotation task. From the sense annotated corpus it is possible to extract word pairs which tend to have one sense per one collocation [8] and these collocations are then being used in pre-annotation as well. After pre-annotation human annotators tag the words which have not been tagged by the pre-annotation system or correct tags added by pre-annotation process. And finally, third person solves the disagreements. This number of words and different text types makes WSDCEst hopefully a valuable resource for WSD systems as training and testing data, also for some basic statistics about word sense distributions.

4. Semantic analysis of sentences The third direction in our research we would like to give an overview of is semantic analysis of (simple) sentences of Estonian. One of the distant goals in natural language processing has been the semantic analysis of language, so that in addition to the recognition of structure of words and sentences, the computer could also understand the meaning of sentences (ultimately, of texts). Let us note that the solution of this task is also a precondition of the solution of several pragmatic tasks (human-computer interaction in natural language). We have worked at this problem about 5 years. The input of the corresponding 6 http://www.cl.ut.ee/korpused/baaskorpus/ 7 http://www.cl.ut.ee/korpused/morfkorpus

(21.03.2010) (10.06.2010)

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

H. Õim et al. / Main Trends in Semantic-Research of Estonian Language Technology

205

program is the syntactic tree of a sentence and the output is its representation in the form of a frame where the syntactic roles (Subject, Object etc) are replaced with the semantic ones (Agent, Patient, Recipient etc) using a lexicon of verb frames where each frame is organized according to these semantic roles. The lexicon contains verbs that can function as predicates in sentences and thus determine the possible semantic roles that can /must occur in corresponding sentences, and the task of the program is to match the units from syntactic trees with these semantic roles. The principles and general structure of our approach were described on the third Baltic HLT conference [9]. Thus far, we have restricted our research to the domain of motion, i.e. to sentences which express events where some entity changes its location. Here we want to give a short overview of the main points of the development in our work (and in our understandings of what is crucial in the task of semantic analysis of sentences at the present stage; there will be a more detailed presentation at the conference: “Semantic analysis of sentences: the Estonian experience” by Õim et al [10]). These points can be summarized as follows: first, the organization of the frame lexicon; second, inferences as part of the meaning of a sentence; third, the role of ontological information (world knowledge) in sentence understanding 1. With respect to the frame lexicon the first thing to point out is that the frames in it are in fact not frames of verbs but frames of EVENTS represented/designated by the corresponding verbs: the central semantic unit in text semantics is not a word nor even a sentence but an event ( in our domain of motion). The details of one such event (information about the fillers of the roles) can be picked up from different sentences but they should be collected and integrated into the frame of this individual event. For instance, let’s take a string of sentences (not necessarily in immediate succession in the real text): Yesterday, Mari went to Tallinn. This time she took her own car because she had to be in Tallinn very early. She left Tartu already at six o’clock. These sentences describe (pieces) of a concrete traveling event, but its different role fillers (AGENT— Mari, TIME—yesterday, INSTRUMENT—car, LOCFROM—Tartu, LOCTO—Tallinn, TIMEFROM—six o’clock) are given in different sentences (the role names in capital letters are from the list of our semantic roles). The second aspect worth mentioning in connection with our frames is the use of so-called hidden arguments (as fillers of certain roles; this term—and the whole idea— we took from conceptual semantics, e.g. Jackendoff 2002 [11]). The idea is that some predicates incorporate in their meaning the information about the fillers of certain roles: e.g. walking and running imply that AGENT’s legs are used as the (immediate, bodily) INSTRUMENT, in the same way as seeing and looking imply the use of eyes. This information has not to be explicitly expressed in a sentence, unless something special is said about these instruments; and this specific information can come in another sentence, cf. He looked at me. His eyes were blue. Because of this the information about such “hidden” roles-fillers has to be included already into the frames of the corresponding verbs; and into the frame representations of concrete sentences, too, even when they are not explicitly given in the syntactic structure of the first sentence. 2. Inferences are a necessary part of the whole event expressed by a sentence. In our domain of motion most important inferences concern information about moving entities, especially, where the entity was located before the event and where it is after the event— to be able to answer such questions as “where was/is X?”. In our frames this problem has been solved by attaching corresponding rules to the roles/entities which move, using

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

206

H. Õim et al. / Main Trends in Semantic-Research of Estonian Language Technology

the information from the roles LOCFROM (starting place) and LOCTO (end place). The point is that there are three critical roles whose fillers can move: AGENT, OBJECT and INSTRUMET. But in case of different verbs the entities in these roles move differently. Compare, for instance verbs like walk (AGENT moves), throw (OBJECT moves, but not AGENT), bring (AGENT and OBECT move, and if an INSRUMENT is used, it moves, too). 3.Ontological knowledge and its relationship to “pure” linguistic-semantic knowledge is becoming more and more important today and especially, of course, in modeling understanding of sentences and texts, but the solution of the problems starts in building lexical-semantic resources (see [12]) In case of sentence analysis the ontological information is connected, in particular, with the problem of inferences. To give just one example: when someone throws a stone onto a street we know (infer) that it will be there until we learn that somebody moved it somewhere else; but if somebody throws a stone into the air, we know (infer) that it will not stay there but falls back down. This is not connected with the frame of the verb throw; instead, these inferences are connected with our knowledge of what is a stone, what is a street, what is air. This is ontological knowledge about the corresponding entities and their possible interactions. We are dealing with these problems using the concept of qualia structure [11] but at present there is little to report about practical results. In sum, what we have learned thus far in the field of sentence/text semantic analysis is:

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. when “already” in semantics (after the analysis of sentences in a text) we in a sense can forget about sentences and have to build semantic structures in terms of semantic units (in our case, events); 2. the events structures (and they are the structures which remain in our memory after reading a text) are not compiled from the information gathered from the analyzed sentences only; 3. in addition, we use inferences to fill in certain gaps in the event structure; and 4. we use ontological knowledge to do this.

Acknowledgements This work is supported by the European Regional Development Fund through the Estonian Center of Excellence in Computer Science, EXCS and the National Programme of Estonian Language Technology.

References [1]

E.Meister, J.Vilo and N.Kahusk, National Programme for Estonian Language Technology: a pre-final summary. In This volume, (2010). [2] A.Saareste, Eesti keele mõisteline sõnaraamat I–IV. Dictionnaire analogique de laEstonienne I–IV. Kirjastus Vaba Eesti, Stockholm, 1958–1968. [3] P.Vossen, Eurowordnet: a multilingual database of autonomous and language-specific wordnets connected via an inter-lingual-index. Semi-special issue on multilingual databases. International Journal of Linguistics, (2004). [4] G.Miller, R.Beckwith, C.Fellbaum, D.Gross and K.Miller, WordNet: An on-line lexical database. International journal of lexicography, 3(4), (1990) 235–244.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

H. Õim et al. / Main Trends in Semantic-Research of Estonian Language Technology

[5] [6]

[7]

[8] [9]

[10] [11] [12]

[13] [14] [15]

N.Kahusk, K.Kerner and K.Vider, Enriching Estonian WordNet with Derivations and Semantic Relations. In This volume, (2010). C.Fellbaum, Towards a representation of idioms in WordNet. In S. Harabagiu (ed), Proceeding of the Workshop on Usage of WordNet in Natural Language Processing Systems. Montreal: COLING/ACL, (1998), pp. 52–57. N.Kahusk, Eurown: An EuroWordNet Module for Python. In Principles, Construction and Application of Multilingual Wordnets. Proceedings of the 5th Global Wordnet Conference. P. Bhattacharyya, C. Fellbaum, P. Vossen (Eds) Mumbai: Narosa Publishing House, (2010), pp. 360–364. W.Gale, K.Church, D.Yarowsky, One Sense Per Discourse. DARPA Workshop on Speech and Natural Language, New York, (1992), pp. 233–237. K.Müürisep, H.Orav, H.Õim, K.Vider, N.Kahusk, P.Taremaa, Fom Syntax Trees in Estonian to Frame Semantics. In The Third Baltic Conference on Human Language Technologies. October 4–5, 2007, Kaunas. Proceedings,Vilnius, (2008), pp. 211–218. H.Õim, H.Orav, N.Kahusk and P.Taremaa, Semantic analysis of sentences: the Estonian experience. In This volume, (2010). R.Jackendoff, Foundations of Language. Oxford: Oxford University Press, 2002. N.Kahusk, K.Kerner, H.Orav, Toward Estonian Ontology. In: LREC 2008 Proceedings: LREC 2008, Marrakesh; Maroko; 26. mai–1. juuni 2008. Eds: Oltramari, A. ; Prevot, L.; Huang, C.-R.; Buitelaar, P.; Vossen, P.. Elite Imprimerie, (2008), 20–24. K.Kerner, H.Orav, S.Parm, Semantic Relations of Adjectives and Adverbs in Estonian WordNet. In: LREC 2010 Proceedings: LREC 2010, Malta, Valetta, ELRA, (2010), 33–37. K.Kerner, K.Vider, Word Sense Disambiguation Corpus of Estonian. The Second Baltic Conference on Human Language Technologies, April 4–5, (2005), Proceedings, pp. 143–148. K.Vider, H.Orav, Concerning the difference between a conception and its application in the case of the Estonian wordnet. In: Proceedings of the second international wordnet conference: Second international wordnet conference; Brno; 2004. Ed. by Sojka, P.; Pala, K.; Smrz, P.; Fellbaum, Ch.; Vossen, P.. Brno:, (2004), 285–290. A.Õim, Sünonüümisõnastik, Tallinn, 1991.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[16]

207

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

208

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-208

Semantic Analysis of Sentences: The Estonian Experience Haldur ÕIM 1 , Heili ORAV, Neeme KAHUSK, and Piia TAREMAA Institute of Computer Science, University of Tartu Abstract. This paper describes the work done on computational semantics of Estonian simple sentences. Main attention is paid to inferring knowledge from text. Differences from widely used FrameNet is discussed. Keywords. syntactic semantics, semantic roles, frame semantics, FrameNet, Estonian, inference

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction For about 5 years we at Tartu University are working on a LT project called “Semantics of simple sentences”. Concerning the title of the project we would like to stress at once that our real goal is to move from sentence semantics to semantics of coherent texts, that is, to modeling of the “real” process of human language understanding, so that what we call now semantic representation of a sentence or text in fact would be a representation of what the reader/hearer knows after having read/heard this text. This should explain why we in our project, when speaking of semantic analysis of simple sentences (= translating syntactic trees of isolated sentences into some kind of semantic structure), quite intentionally deal with such theoretically-oriented problems as drawing different kinds of inferences or using ontological knowledge. In the analysis process itself we really use as input the syntactic dependency trees of sentences from the Estonian Treebank2 and the output is the representation of the sentence in the form which we call a (sentence) frame where the syntactic dependencies (Subject, Object, Adverbial etc) are replaced by semantic roles (Agent, Instrument, Recipient etc; see e.g. [1]). But this is the first step only. And even this is not that simple as it seems. The main problems we have found to be of critical importance can be summarized as follows: First, compiling the inventory of semantic roles; because at the present state of semantics there is little hope to create a universal inventory, we have to restrict ourselves to some semantic domain. At present the domain of our semantic analysis program is motion—self-motion as well as caused-motion events. Second, in the process of transition from a syntactic tree of a sentence to its semantic frame (representing the corresponding event) it often appears necessary to add so-called 1 Institute of Computer Science, University of Tartu, Liivi 2 50409 Tartu, ESTONIA E-mail: [email protected] 2 www.clarin.eu/arborest

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

H. Õim et al. / Semantic Analysis of Sentences: The Estonian Experience

209

hidden arguments/roles (in the sense of Jackendoff’s conceptual semantics), i.e. arguments that do not always appear in the surface sentence as syntactic elements but will be needed in its semantic representation, e.g. when some specific information has to be added to their description later (e.g. walking and legs, throwing something somewhere and hands, looking-seeing and eyes—as implicit Instruments). Third, the problem of inferences: the full meaning of a sentence includes, for the recipient, not only the data explicitly represented in it but also the knowledge s/he can derive from these data by means of inferences; and this is particularly important when we start to model the understanding of coherent texts where the knowledge derived from the previous sentences by inferences cannot be distinguished from the information explicitly conveyed. And fourth, there is the need to take into account, along with “pure” linguistic meaning of language expressions, also the world knowledge (domain ontology). Since it is clear that in a short overview it would not be possible to treat all these problems at the reasonably informative level we will concentrate below on the kernel of our system, the frame lexicon; by describing its organization it is possible to show also how it helps to solve other problems, e.g. the problem of inferences.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. Frame lexicon, semantic roles, inferences Frames in our system are structures consisting of a head—a motion verb which in a sentence can function as predicate—and its possible arguments as fillers of certain semantic roles. Thus, semantic roles are the main structuring elements of a frame. The original idea behind the concept of frame came, of course, from frame semantics and specifically from FrameNet3 (see e.g. [2] for overview). But for purposes of our project which deals with the interaction between syntax (sentences, texts) and semantics we had to work out our own inventory of semantic roles. One reason for this was, for instance, the need to draw inferences from frames: FrameNet does not deal with inferences, at least not explicitly. But in case of semantic analysis of sentences it is impossible to ignore this problem; and certain kinds of inferences are directly connected with semantic roles (we will discuss the problem below). This, by the way, does not mean that FrameNet structures cannot be used to draw inferences from sentences with e.g. motion verbs as predicates. We have tried this, in parallel with our frame structures. But the role inventory in FrameNet is too complicated and domain-dependent to be taken as a regular basis of sentence/text semantic analysis program at the very beginning. The first conceptually important point we want to make clear is that although the heads of frames are verbs, the frames are in fact not frames of verbs but frames of EVENTS represented/designated by the verbs as possible predicates of corresponding sentences. The basic semantic unit in text semantics is not a word, nor even a sentence, but an event (in our domain of motion). The details of one such event can be picked up from different sentences, but they should be collected and integrated into the frame of this individual event. For instance, let’s take a string of sentences: Yesterday, Mari went to Tallinn. This time she took her own car because she had to be in Tallinn very early. She left Tartu already at six o’clock. 3 http://framenet.icsi.berkeley.edu/

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

210

H. Õim et al. / Semantic Analysis of Sentences: The Estonian Experience

These sentences describe (pieces) of a specific traveling event the frame of which is evoked by the verb ‘went’ in the first sentence, but its different role fillers (AGENT = Mari, TIME = yesterday, INSTRUMENT = car, LOCFROM = Tartu, LOCTO = Tallinn, TIMEFROM = six o’clock) are given in different sentences (the role names in capital letters are from the list of our semantic roles). The idea of this differentiation and, concretely, the concept of event in our case we have taken from Conceptual Semantics [3] where the complex problems of word-sentence-text semantics (including the background knowledge) are dealt within a common framework. Such complex treatments are quite rare in today’s theoretically oriented linguistics. Apparently, the most direct way to explain our ideas connected with frames and their structure—semantic roles and inferences—in the analysis and representation of the meaning of sentences would be to use a concrete example. Below, we give (in basic details) the frame structure of agentive self-motion (AGENTIIVNE LIIKUMINE) represented by verbs like kõndima ‘to walk’, lendama ‘to fly’ (like a bird) ujuma ‘to swim’, sõitma ‘go (using a vehicle), travel, ride. . . ’ and then explain the reason of its structural elements. See Fig 1 for an overview of the general role structure. There are two features in the structure of this frame that are of importance here and need explanation. First, the ASETSEMA-subframes are attached to the roles whose fillers move in the event described by the frame. In the agentive self-motion event AGENT and INSRUMENT are the entities that move. ASETSEMA_1 and ASETSEMA_2 fix the location of the entity before and after the motion event, accordingly, taking the corresponding information from the LOCFROM and LOCTO roles of the main frame. Thus, this is our present (preliminary) solution to the problem of inferences concerning the location of entities participating in a motion event before and after the event. The reason why we have chosen such a straightforward solution is that in different motion events different participant move. For instance, in case of an agentive caused-motion event where AGENT throws an OBJECT from place L1 to place L2, only OBJECT moves to L2, AGENT stays at L1, and therefore in the frame corresponding to predicate throw ASETSEMA1/2 subframes are attached only to OBJECT role and not to AGENT. But in an event where AGENT brings an OBJECT from L1 to L2, both OBJECT and AGENT move, and if AGENT uses an INSTRUMENT, it moves, too. And therefore ASETSEMA1/2 subframes should be attached all these roles in the bring frame. The second feature which needs explanation is the use of in- and at-subroles by LocRoles. It may be remarked at the outset that this is also connected with the problem of inferences, but in quite different way; and it brings in the ontological dimension. The critical point here is that in our folk ontology of the world we differentiate, among other aspects, between entities that have inside and those that do not, the difference being, that other object can be moved into the first ones (and kept there), but not into the second ones. For instance, bags, baskets, boxes, boats, cupboards, have inside, but stones, chairs, trees, etc do not in the same sense. Of course, there is an indefinite number of entities in case of which this difference simply does not make sense. In the context of motion domain this difference appears relevant in the following way. Both kinds of entities can function as fillers of the role LOCTO, that is, as reference points of where the motion ended. But there is a principal difference, in case of “entities with inside”, whether the moving object moved into them (like in sentence ‘I put the shoes in the basket’) or somewhere near it (‘I put the shoes behind the basket’). The difference becomes important, among other

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

H. Õim et al. / Semantic Analysis of Sentences: The Estonian Experience

211

AGENTIVE SELF-MOTION HYPERONYM: MOTION ROLE STRUCTURE Participant Roles AGENT (participant who controls his/her activity, the instigator of the event) FRAME: ASETSEMA_1 ’be located’ Object: = Agent Loc = Locfrom Time = Timefrom FRAME: ASETSEMA2 Object = Agent Loc = Locto Time = Timeto

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

INSTRUMENT [the same ASETSEMA subframes attached as by AGENT, only Object = Instrument, which means that INSTRUMENT is supposed to move the same way as AGENT] Loc-Roles LOCFROM (starting place, e.g. from the garden, from under the table, from the box) Locfrom-in Locfrom-at LOC (where the motion takes place, e.g. on the street, in the garden, under the table) Loc-in Loc-at LOCTO (the ending place, e.g. onto the street, into the garden, into the box) Locto-in Locto-at /---/ Time-roles [The same system: TIMEFROM, TIME, TIMETO, DURATION] /---/ Other roles Not important in the given context: DIRECTON, PATH, MANNER, about 30 in total. Figure 1. Frame structure of agentive self-motion.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

212

H. Õim et al. / Semantic Analysis of Sentences: The Estonian Experience

things, when in the later text it is said that the corresponding entity with inside (basket in our examples) moves to another place (e.g. is taken somewhere). Then, by inference, one (e.g. the computer program) should conclude that all things that were in it (e.g. my shoes) are also at this place. But things that were ‘at“ it (behind, before, etc) have not moved. Of course, this concerns a very specific aspect of the motion, but our intention was just to demonstrate that once we start a serious task of semantic analysis of sentences and (coherent) texts we cannot avoid ”landing“, former or later, at such specific problems. The last aspect we would like to touch in connection with our frames is the use of so-called hidden arguments (as fillers of certain roles; this term—and the whole idea— we took from conceptual semantics, e.g. Jackendoff 2002). The idea is that some predicates incorporate in their meaning the information about the fillers of certain roles: e.g. walking and running imply that AGENT’s legs are used as the (immediate, bodily) INSTRUMENT, in the same way as seeing and looking imply the use of eyes. This information has not to be explicitly expressed in a sentence, unless something special is said about these instruments; and this specific information can come in another sentence, cf. ‘He walked to the table. He was barefoot.’ Because of this, the information about such ”hidden“ roles-fillers has to be included already into the frames of the corresponding verbs; and into the frame representations of concrete sentences, too, even when they are not explicitly given in the syntactic structure of the first sentence the predicate of which triggered the frame (walked in the given case). In the corresponding frame under the role in question such information should explicitly formulated. For instance when we take the Estonian verb kõndima ‘to walk’ then under the role AGENT as one of the semantic requirements to its possible fillers should be given ”has legs,“ e.g.:

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

KÕNDIMA AGENT SEMREQ: Living being HAS_BODYPART: legs And there should be in the frame the specific INSTRUMENT role by which is the information that as this instrument function the legs of the AGENT: INSTRUMENT-B[odypart] Legs = BODYPART-of-AGENT

Summary Our main aim was here not give technical details of our project but to give an outline of the solutions to problems we consider critical in the semantic analysis of text and, further, of coherent texts. We described our approach to them: in the center of it is the frame lexicon, where the principal elements are semantic roles, including ”hidden“ semantic roles. And second, the treatment of inferences, which constitute an inevitable part of sentence semantics.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

H. Õim et al. / Semantic Analysis of Sentences: The Estonian Experience

213

Acknowledgements This work is supported by the European Regional Development Fund through the Estonian Center of Excellence in Computer Science, EXCS and the National Programme of Estonian Language Technology.

References [1]

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[2] [3]

K. Müürisep, H. Orav, H. Õim, K. Vider, N. Kahusk, and P. Taremaa (2008). Fom Syntax Trees in Estonian to Frame Semantics. In The Third Baltic Conference on Human Language Technologies. October 4–5, 2007, Kaunas. Proceedings, Vilnius, 2008, pp. 211–218. International Journal of Lexicography. Special issue. 16 (3) Sept., 2003. T. Fontenelle, ed. R. Jackendoff (2002). Foundations of Language. Oxford: Oxford University Press.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Methods and Tools for Language Processing

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-217

217

An Ensemble of Classifiers Methodology for Stemming in Inflectional Languages: Using the Example of Latvian b ¯ ¯ ANE Steffen EGER a and Ineta SEJ a Institute for the German Language, Mannheim, Germany b University of Heidelberg, Germany

Abstract. In this paper, we present a stemming methodology based both on a handcrafted rule-based system and data-driven machine learning approaches. The rulebased system models phenomena of Latvian, a highly inflectional language, in a linguistically sound and consistent way. While the handcrafted stemmer can be used on its own, it may also serve as a supplier of training data for our statistical modeling. This relies on two assumptions which are quite natural in the context of stemming and many other NLP applications such as grapheme-to-phoneme conversion, lemmatization, etc., namely that the output sequence is not longer than the input sequence and that the orderings of input and output sequence characters are ‘similar’. Under these conditions, we train several machine learning algorithms and show that very good results for stemming in Latvian can be obtained by combining them via bootstrapping and ensemble of classifiers methods.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Keywords. Stemming, ensemble of classifiers, highly inflectional languages, rulebased stemming, bootstrapping, machine learning, statistical stemming, Latvian

Introduction Stemming, also called conflation, is the problem of reducing inflected and derived words of a natural language to their ‘stem’, where this stem is a special notion in natural language processing (NLP) that is similar to the linguistic notion of a ‘morphological stem’, or ‘root’, of a word. A characterizing feature of this NLP stem is that it is the ‘tertium comparationis’ for a set of morphologically related words, e.g. in English librari for library, libraries, librarian or in Latvian grib for griba, grib¯et, j¯agrib (Engl. the will, to want, one has to want). A field related to stemming is lemmatization, the task of mapping word forms to their lemma. The difference between stemming and lemmatization is that the former is usually coarser in the sense that if two word forms have the same lemma, they will (usually) also have the same stem, while the converse need not be true. Thus, stemming is generally better suited for information retrieval (IR) and indexing applications than lemmatization because it more strongly reduces the set of index terms. Up to date, different approaches to the stemming problem have been proposed, e.g. checking against special stem and suffix lists, dictionary look-up (lexical approaches), K-stem co-occurrence computation, n-gram comparison (statistical approaches), longest suffix match, and iterative rule-based affix removal (morphological approaches) (e.g.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

218

S. Eger and I. S¯ej¯ane / Classifiers Methodology for Stemming in Inflectional Languages

[1]). Particularly algorithms of the latter kind exhibit good results for English (for example the classic Porter stemmer [2]) since they are fast and, even more importantly, can handle arbitrary words beyond the scope of any dictionary. However, they are still not broadly adapted to other languages, especially to highly inflectional and derivationfriendly languages such as Latvian, mainly because the conversion of existing algorithms for English to synthetic languages is not straightforward as these possess sufficiently different characteristics. For example, in Latvian the following problems cannot be modeled with traditional affix removal algorithms:

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

(i) consonant gradation at the end of the stem, e.g. s-š os(is) – oš(i) (ash tree) ll-l¸¸l lell(e) – lel¸¸l(u) (doll) s-d-z ¯es(t) – ¯ed(a) – ¯ez(dams) (to eat) g-dz snieg(t) – sniedz(a) (to hand so. sth.) (ii) ablaut, e.g. e-¯e n¸em(t) – n¸¯em(u) (to take) a-o pras(t) – prot(u) (to know how to do sth.) ¯-uv sagr¯ u u(t) – sagruv(a) (to collapse) Our approach to stemming in the current work is two-fold. First, we develop a Porter-like iterative affix removal stemming algorithm for Latvian that tackles some of the named Latvian-specific problems, particularly (i) above. Secondly, we present a transferable data-driven modeling approach to stemming in a machine-learning framework that presupposes only two assumptions on the word form to stem conversion process; namely, that 1) the stem character sequence y is not longer than the original word form sequence x to be stemmed, and 2) that the ‘character ordering’ of y is reflected in the ‘character ordering’ of x. Under these premises, mappings between x and y can easily be learned from data using canonical machine learning algorithms. In the current work, we further show that by combining these statistical learners, very powerful stemmers can be obtained whose capacities are quite beyond the scope of traditional affix removal systems. The connection between the two stemming methodologies presented in this paper (hand-crafted and machine-learned) is that the former generates the initial training data set, subject to human intervention, used by the latter.

1. A handcrafted rule-based stemmer Our handcrafted iterative stemming algorithm was designed based primarily on morphological rules of Latvian. The present implementation is a carefully revised version of a Porter stemmer formerly developed for Latvian [3], for which it had turned out that simple consonant and vowel counting and suffix removal were insufficient for the Latvian language. Therefore, the current stemming algorithm was devised not only to tackle different inflections and some derivations as do algorithms for English and as is done in [4] but also some very peculiar features of Latvian such as inflectional prefixes and consonant gradation. We provide a sound and generalizable solution to these problems by more explicitly examining the structure of words, e.g. by checking for prefixes, and by introducing meta-symbols for alternating consonants. The mappings in the latter case are as follows:

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

S. Eger and I. S¯ej¯ane / Classifiers Methodology for Stemming in Inflectional Languages

219

l, ¸l → L d, z, ž → Z n, n¸ → N g, dz, dž → G k, c, cˇ → C j, t, s, š → T The current algorithm consists of eleven steps of checking context rules and stripping affixes or changing letters, e.g. mapping them to meta-symbols. We are not going into great detail here because the context rules are rather complex, e.g. testing for prefixes (with some exceptions such as nest, sac¯ıt, paka) before detecting the length of the remaining part of the word. The eleven general steps, to be applied in sequential order, are as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

10. 11.

Strip verb prefix j¯aSort out some irregular verbs Strip most of the verb, adjective and noun suffixes Strip j in some special consonant combinations, e.g. pj or mj, at the end of the remaining word Strip comparative suffix -¯ak and superlative prefix visStrip participle suffixes, e.g. dam-, ušMap some consonants prone to alternation at the end of the word to meta-symbols Sort out more irregular verbs (see step 2) Strip even more suffixes from — hopefully — verbs in order to match the present and past stems Map further consonants at the end of the word to meta-symbols (see step 7) Normalize some derived stems (usually in foreign words) by changing back from special symbols to the actual, not alternating letters, e.g. ciT to cij and ovsC to ovsk

This stemmer, as well as the data-driven stemmers presented in Section 2, was tested on a word frequency list of Latvian containing 218, 6281 different word forms. Considering the criterion of index term list reduction rate [5], our rule-based stemmer is comparable to the best rule-based stemmers for English (reduction rate above 30%). The average word form reduction is by about 2.4 characters from about 8 to 9 word form to roughly 6 stem characters. The largest reduction in this data set was as many as 14 characters in visneiedom¯ajam¯akajiem to neiedom. Meta-symbols were added to about 50% of all stems, which seems to be too much as the evaluation suggests that it should be at about 20%. A great advantage of this algorithm is that it largely avoids overstemming and consequently most stems are still easily readable and recognizable.

2. Data-driven machine-learned stemmers Our methodological approach to data-driven stemming is as follows. We first generate a set of training data input-output sequences {(xl , yl ) | l = 1, . . . , N }, where each xl is a Latvian word form and yl is its corresponding stem. This training data is obtained semi-automatically using the handcrafted stemming algorithm described in Section 1 and subsequent manual correction. Once we have this data, we align input and output sequences by inserting  symbols in the shorter output sequence until both strings have 1 Many

thanks to Dr Andrejs SPEKTORS from the AIL, University of Latvia, for providing us with this list.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

220

S. Eger and I. S¯ej¯ane / Classifiers Methodology for Stemming in Inflectional Languages

equal length. This step is implemented in order to have one-to-one correspondences between the sequences’ characters, which greatly facilitates the word form to stem conversion process. We then train several learning algorithms on this modified data set {(xl , yl ) | l = 1, . . . , N }; 1) decision trees for various parametrizations, where predictor variables are contextual input sequence characters and 2) standard Hidden Markov Models (HMMs), where input sequence symbols are observed variables and output sequence symbols hidden variables. Finally, we combine all learning classifiers to obtain an ‘ensemble of classifiers approach’. Below we give a few more details on each of the steps mentioned. 2.1. Alignments All our classifiers presuppose the following structure. An input sequence x = x1 . . . xn is mapped to an output sequence y = y1 . . . ym , m, n ∈ N, m ≤ n, where xi ∈ Σ, i = 1, . . . , n, and yj ∈ Γ, j = 1, . . . , m, are taken from some finite sets Σ and Γ. Moreover, it is assumed that the elements yj ‘give rise’, or translate, to the elements xi in some regularized manner so that the first task is to find a mapping, or alignment [6], a : {1, . . . , m} → {1, . . . , n}, between x and y that describes this relationship.2 If a is injective — i.e. no two elements yj and yk , j = k, map to the same xi — as we n m! possible alignments between x and y, a will assume throughout, then there are m number that is impossible to evaluate even for moderate sizes of m and n. Fortunately, we can postulate a monotonicity constraint on the alignments a that is quite natural in the context of stemming (and many other NLP applications) and that drastically reduces the number of possible alignments to consider. To be more precise, if we assume monotonicity on a, i.e. j1 < j2 , j1 , j2 ∈ {1, . . . , m}, implies that a(j1 ) < a(j2 ), then

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

  {a : {1, . . . , m} → {1, . . . , n} : a is injective & monotone} =

  n , m

which is by a factor of m! lower than the previous number and is in fact moderate provided that both n and the distance between n and m are not too large. As mentioned, the monotonicity constraint is quite intuitive in stemming and just means that the ordering of the output sequence is preserved in the input sequence, i.e. if yj2 comes after yj1 then so do the respective translations of these elements. Table 1 illustrates the implications of this requirement. Now given a set of training data input-output sequences {(xl , yl ) | l = 1, . . . , N }, (l) (l) where each xl and each yl is of the form discussed above, i.e. xl = x1 . . . xnl , yl = (l) (l) y1 . . . yml , ml ≤ nl etc., our goal  nl is, as indicated, to find for each input-output pair possibilities with maximum probability, given the (xl , yl ) that alignment out of the m l training data. To this end, we apply the EM (estimation maximization) algorithm, which iteratively improves on the (alignment) parameter estimates until it converges to a local maximum [7]. 2 If a(j) = i, this means that y is mapped to x . Since a cannot be surjective in case that m is strictly j i smaller than n, this means that there are input elements xi for which there is no corresponding output element yj ; we imagine here that the respective input symbol is deleted or mapped to the empty string , cf. Table 1.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

S. Eger and I. S¯ej¯ane / Classifiers Methodology for Stemming in Inflectional Languages

 7

j



c



r

p

j

c

i

r

p







c



i

r

p







c



i

r



p





c

i

r

p



221

Table 1. Four out of 4 = 35 possibilities to align the input sequence j¯ac¯erpj (Engl. there is a need of cutting; to cut, debitive mood) to its stem cirp under the monotonicity constraint on alignments. Note that now it is not allowed e.g. to align the output c to the input e¯ but i to a¯ .

2.2. Decision Trees Decision trees have widespread application as a machine learning algorithm, not least in the field of NLP (e.g. [8]). In general, a decision tree is a rule system that tries to predict the value of a target variable based on the values of predictor variables, where rules are restricted to “if-then”-statements. In the machine learning context, conditional upon available data, the goal is to find the decision tree that ‘best’ explains this data. In our situation, given aligned training data as described above, we learn a decision tree for each possible input sequence character — in other words, for each letter of the Latvian alphabet. Predictor variables are contextual characters of a given letter, where we parametrize this context by the number of symbols considered to the left and to the right of the given letter (a so called ‘window’), and target variables are corresponding output, i.e. stem, sequence characters. Decision trees are learned using the classic ID3 algorithm due to Quinlan [9].

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2.3. Hidden Markov Models HMMs model a relationship between observable variables and hidden/unobservable variables (states) by assuming state transition probabilities and probabilities of state variables ‘leading to’ or emitting observable variables. Their most prominent application in NLP is in part-of-speech tagging [10]. In our situation, observable variables are Latvian word form characters and hidden variables are the corresponding stem characters (including ). All transition and emission probabilities are estimated from the training data, which, as before, consists of aligned word form-stem sequence pairs. Then, given a new Latvian word form we find the most probable sequence of state variables that may have caused this string of observations by applying the well-known Viterbi algorithm [11]. 2.4. Ensemble of classifiers Ensemble of classifiers methods have rather recently emerged in the machine learning community (e.g. [12]), with up to now only few applications in NLP (e.g. [13]). The idea of this meta-learning approach is to combine the predictions of a multitude of classifiers using some agreement policy, e.g. (weighted) majority voting. A particular instantiation of this rationale, called ‘bootstrapping’, generates an ensemble of classifiers by training a single system on different subsets of the training data. In our situation, the individual classifiers are the learners and systems presented in the current section, with possibly different parametrizations (e.g. length of context to be considered) of the same learning

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

222

S. Eger and I. S¯ej¯ane / Classifiers Methodology for Stemming in Inflectional Languages

algorithm counting as a different classifier. We then combine these systems via Bayesian optimization to obtain a meta-classifier. Moreover, we construct such a meta-classifier for T , T ∈ N, resamples of our original training data and combine these T meta-classifiers using majority voting to obtain a final meta-meta-classifier. A more precise description of this setup is given in the following section.

3. Experiments As mentioned in the preceding section, we tackle our statistical stemming task by constructing an ensemble of classifiers approach along two dimensions. On the first, we generate T bootstrap samples {Ti }Ti=1 of our original training data T of size N , where each subsample has the same size as T but is sampled with replacement. On each of these samples we train a meta-classifier (second dimension) which consists of k ∈ {1, . . . , 6} classifiers taken from the set of six data driven classifiers (four decision tree classifiers for window sizes of 1 to 4 and two HMMs, one scanning the input from left to right and one from right to left) presented in Section 2. The individual classifiers for a particular bootstrap sample are drawn randomly and their combination is implemented by solving the optimization problem

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

y ∗ ∈ argmaxy P (Y = y | X = x, C1 = c1 , . . . , Ck = ck ) character for character, where Y is the current stem character, X is the current word form character, and C1 to Ck are the classifications of the classifiers 1 to k with regard to x. The conditional probabilities are estimated using a separate development set. Finally, the T meta-classifiers are combined via majority-voting, which yields the resulting metameta-classifier. In our particular situation, our total data consisted of approximately 900 Latvian word form-stem pairs, which we subdivided into training, development and test sets of sizes 600, 150, and 150, respectively. As said before, this data was generated semiautomatically by using the handcrafted affix removal stemming algorithm presented in Section 1 and subsequent manual correction. Table 2 and Figure 1 summarize the results. These show that the decision tree learners are more effective in the stemming task than the HMMs (92%/67% vs. 83%/50% character/word level accuracy for DT-3 vs. HMM-backward), obviously due to the former’s ability to look both forward and backward in the input sequence when making a new stem character decision. Moreover, the results show that the combination of classifiers induces dramatic performance increases. Merely the ensemble, ignoring bootstrap replicates, accounts for a 25% word level accuracy improvement over the best single classifier (83.8% for the joint combination of all six classifiers vs. 67% for DT-3). The table also shows that each sort of combination can be beneficial, even if the included systems themselves exhibit rather poor performance; for example, joining the two HMMs to the decision tree learners yields a 7% word level accuracy increase. Moreover, Figure 1 demonstrates the positive effect of the bootstrap replicates and the majority based meta-classifier combination; for each level of k, the meta-metaclassifier’s performance increases significantly with the number T of bootstrap resamples considered. The best-performing system (k = 6, T = 7) exhibits average character/word

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

S. Eger and I. S¯ej¯ane / Classifiers Methodology for Stemming in Inflectional Languages

HMM-forw.

Classifier HMM-back DT-1

DT-2

DT-3

DT-4

x x x x x x x

x x x

x x

x x

x

Char.

Results Word Edit Dist.

78.5 83.4 82.2 90.1

30.9 50.9 36.1 63.8

0.35 0.35 0.35 0.30

92.1

67.0

0.29

92.0 85.1 96.0 94.8

65.8 51.6 78.0 79.3

0.30 0.34 0.28 0.28

223

x x x x x x 97.1 83.8 0.27 Table 2. The six classifiers involved in the Latvian stemming task and performances on our test set when trained on the training and development set as mentioned in the text. For this evaluation, there are no bootstrap replicates and performances are only those of classifiers and meta-classifiers, not of the meta-meta-classifier. ‘Edit Dist.’ is the average edit distance between actual and desired output sequence.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

level accuracies of 97.7%/85.5%, which is roughly a 2% increase over the combination of all six classifiers without replicates.

Figure 1. Character level accuracy curves when choosing combinations of k = 2, 3, 4, 5, 6 classifiers for each bootstrap replicate t = 1, . . . , T (T = 10 here). Interestingly, the bootstrap replicates seem to more strongly increase the total system performance in case of low k values; moreover, the figure suggests that the bootstrap gain seems to level off after T = 6 or T = 7. All shown values are averages over 15 runs.

Overall, we see that the best-performing classifier combinations achieve very high accuracy values, with only some minor problems e.g. for ‘irregular’ verbs that, firstly, exhibit rather strong variation with respect to vowel and consonant alternation and, secondly, were underrepresented in our training sets (approximately 40 irregular verbs in a training set of size 900) so that the majority of data shows different statistical properties. This leads us to the conclusion that even better performances could be achieved by splitting our training sets by word classes (or even subclasses of word classes) so that the amount of contradictory training data would be reduced.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

224

S. Eger and I. S¯ej¯ane / Classifiers Methodology for Stemming in Inflectional Languages

4. Conclusions In this paper, we have presented a stemming methodology that is based both on a handcrafted rule-based system and data-driven machine learning approaches. The rule-based system models phenomena of inflectional languages in a linguistically sound and consistent way. While this stemmer can be used on its own, it may also serve as a supplier of training data for our statistical modeling. This relies on two assumptions which are quite natural in the context of stemming and many other NLP applications such as grapheme-to-phoneme conversion (e.g. [14]), (arguably) lemmatization (e.g. [15], [16]), etc., namely that the output sequence is not longer than the input sequence and that the orderings of input and output sequence characters are ‘similar’. Under these conditions, we train several machine learning algorithms and show that very good results for stemming in Latvian can be obtained by combining them via bootstrapping and ensemble of classifiers methods. In future work, a hybrid system consisting of several machine learning algorithms (including very sophisticated learners such as conditional random fields and maximum entropy models) and the here presented rule-based stemming algorithm is to be investigated, from which we expect even better performances and more insights as to the interplay of linguistically motivated systems and statistical models. Moreover, there are indications that a separation of training data by word classes could reduce the error of the systems by diminishing the amount of contradictory data.

References [1] [2] [3]

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]

[16]

R. Hooper, C. Paice, What is Stemming?, 2005. Web. 10 June 2010. http://www.comp.lancs.ac.uk/computing/research/stemming/general/index.htm. M.F. Porter, An algorithm for suffix stripping, Program 14(3) 1980, 130–137. I. S¯ej¯ane et al., Stemming. Methoden und Ergebnisse, 2004. Web. 5 July 2010. http://www.cl.uni-heidelberg.de/ weissman/Ausarb.pdf. K. Kreslins, A stemming algorithm for Latvian (doctoral thesis), Loughborough University, 1996. W.B. Frakes, C.J. Fox, Strength and Similarity of Affix Removal Stemming Algorithms, ACM SIGIR Forum 37(1) 2003, 26–30. F.J. Och, H. Ney, A comparison of alignment models for statistical machine translation, Proceedings of COLING, 2000, 1086–1090. S. Abney, Semisupervised Learning for Computational Linguistics, Chapman & Hall/CRC, 2008. O.G. Kalles et al. Decision Trees and NLP: A Case Study in POS Tagging, Proceedings of ACAI, 1999. J.R. Quinlan, Induction of Decision Trees, Machine Learning 1 1986, 81–106. C. Manning, H. Schütze, Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA, 1999. G.D. Forney Jr., The Viterbi Algorithm, Proceedings of the IEEE, 61(3) 1973, 268–278. T.G. Dietterich, Machine-Learning Research: Four Current Directions, The AI Magazine 18(4) 1998, 97–136. H.v. Halteren et al. Improving Accuracy in NLP Through Combination of Machine Learning Systems, Computational Linguistics 27(2) 2001, 199–229. F. Mana et al. Using machine learning techniques for grapheme to phoneme transcription, EUROSPEECH-2001, Aalborg, Denmark, September 2001, 1915–1918. M. Dreyer et al. Latent-Variable Modeling of String Transductions with Finite-State Methods, Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Honolulu, Hawaii, 2008, 1080–1089. S. Eger, I. Sejane, Adiuvaris, 2009. Web 8 July 2010. http://www.karl-steinbuch-stipendium.de/fileadmin/_steinbuch/downloads/Abschlussbericht_Eger.pdf

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-225

225

Using Syllables As Indexing Terms in Full-Text Information Retrieval

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Kimmo KETTUNEN a,1, Paul MCNAMEE b and Feza BASKAYA c a Kymenlaakso University of Applied Sciences, Finland b Johns Hopkins University, USA c University of Tampere, Tampere, Finland

Abstract. This paper describes empirical results of information retrieval in 13 languages of the Cross Language Evaluation Forum (CLEF) collection augmented with results of Turkish using syllables as a means to manage morphological variation in the languages. This kind of approach has been used in speech retrieval [1], but for some reason it has not been much tried out in text-based IR, although it has many clear advantages. Firstly, a quite well working version of it can be implemented with a very simple syllabification algorithm, consisting of only variants of one syllable structure rule, CV, consonant vowel. Secondly, although syllable-based word form variation management resembles n-gramming [2], it has the advantage, that the number of grams with syllables is more restricted which keeps the size of the text index smaller and retrieval faster. Thirdly, syllable-based approach makes possible to use different types of syllabification procedures, which can be either very fine grained, i.e. language specific or very coarse, i.e. more language independent. Fourthly, syllable based methods work for both speech and text retrieval. Our results show, that the two different CV syllabification procedures produced good results with four morphologically complex languages of the CLEF collection. For Turkish they produced also good results. For three of the languages that got good results with the CV syllabification (De, Fi and Tu), we tried also language specific, accurate syllabification procedures. Accurate syllabification was not able to produce as good IR results as CV procedures, but it was not far behind in performance. Keywords. full-text information retrieval, syllables, management of word form variation, syllables as index terms

Introduction Variation of word forms in natural languages has been one of the problems of full-text retrieval since the beginning of computerized textual information retrieval. Several different methods for management of variation have been proposed, and these include stemming, lemmatization, n-gramming, truncation and different phonetic transformations, such as Soundex [3, 4]. Lately automatic morphological methods (i.e. non-supervised or semi-supervised) have been tried out with some success, but still the old methods like stemming with human written rules are in full and effective use in text IR. 1 Corresponding Author: Kymenlaakso University of Applied Sciences, Paraatikenttä 7, FIN45100 Kouvola, Finland; E-mail: [email protected]

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

226

K. Kettunen et al. / Using Syllables as Indexing Terms in Full-Text Information Retrieval

Most of the word form variation management methods are based on character level manipulation of words. Simple stemmers prune the word forms more or less well, lemmatizers use sophisticated combination of linguistic rules and dictionary representations for word forms to deduce the base forms for inflected words. The crudest systems may only truncate words to beginnings or endings of certain length and n-gramming techniques reduce the words to overlapping character strings. Still these crude, semantically unaware character level methods work amazingly well as shown for example in McNamee, Nicholas and Mayfield [4]. In this study we focus on a rarer character level (or sub-word unit) based method for word form variation management that has a linguistic basis, namely syllabification of words. Syllables can be seen as the next level of word structure after sounds that are represented as alphabet characters in written language2. Syllables also give a suitable handle to manage structure of written words: number of allowed syllables in each language is limited to tens (abstract syllable types) and from hundreds to thousands (concrete syllable tokens) [5]. Although rules for syllabification are quite limited in many languages, automatic syllabification of words is challenging, mainly because the syllable is not easy to define precisely linguistically. Consequently, no accepted standard algorithm for automatic syllabification exists, and syllabifications for the same word may vary. There are two approaches to the problem of automatic syllabification: rule-based and data-driven. The rule-based method implements some theoretical position regarding the syllable, whereas the data-driven paradigm tries to infer new syllabifications from examples assumed to be correctly syllabified already. A typical example of a rule-based orthographic syllabification algorithm is the Finnish hyphenation algorithm described in Karlsson [6]. It consists of 8 abstract syllable structure rules which as an implementation produce about 95 % recall and over 99 % precision for syllabification of the Finnish test corpus. Bouma [7] reports on trial of Dutch hyphenation using finite state transducers. The simplest method achieves accuracy of 94.5 %, and two others, TEX and TBL, 99.8 % and 99.1 %. Adsett and Marchand [8] discuss the data-driven approach to automatic syllabification. They compare five different data-driven syllabification procedures across nine European languages. All the algorithms achieve mean word accuracy across all lexicons of over 90%; the best algorithms achieve mean accuracy of 95–96.8 %. A detailed analysis of data-driven syllabification vs. rule-based syllabification of one language, English, is given in Marchand, Adsett and Damper [9, 10]. Their results imply that data-driven syllabification works better than rule-based at least for English both in pronunciation and spelling domain. Syllabification by analogy is the best data-driven method in both domains. Bartlett et al. [11], however, show that syllabification with a structured support vector machine (SVM) performs better than syllabification by analogy. SVM syllabification achieves word accuracies varying from 86.7–90 % when compared to CELEX data of English. SVMs with Dutch and German achieve word accuracy percentages of 98.2–99.8. It should be emphasized that although automatic syllabification looks like a simple procedure, it is not, which can be deduced from the accuracy figures. For many other levels of linguistic analysis (e.g. morphology, syntax) there are so called golden standards against which you can test your automatic procedure, but syllabification 2

Strictly linguistically taken syllables are phonological, not orthographic, units, but we are concerned here only about orthographic syllabification, which is also referred to as hyphenation many times in different publications [11]. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

K. Kettunen et al. / Using Syllables as Indexing Terms in Full-Text Information Retrieval

227

lacks these resources [8, 11]. In our case the problem of highly correct syllabification is not that severe, while our application domain, information retrieval, itself allows quite a lot of noise, and presents an application field where “good enough” result can be very useful. Syllable-based retrieval has been used much in speech retrieval. According to different publications, it has worked quite well in spoken document retrieval [12, 13]. Larson and Eickeler [1] have used syllable-based indexing and language models in German spoken document retrieval. They also test the approach with text documents, and the best performance in both types of documents is achieved with syllable 2-grams. A slightly similar type of approach is presented in Gouvea and Raj [14]. They introduce a search system, where indexing is based on “particles”, which are phonetic in nature and “comprise sequences of phonemes that together compose the actual or putative pronunciation of documents and queries”. The idea of a particle can be applied both to written and spoken documents. Although the idea is somehow similar to syllable-based retrieval, the authors emphasize that particles are not syllables. The main goal of our research is to test, whether orthographic syllabification can work as an effective means for management of morphological variation in a number of different languages and their full-text IR collections. We proceed in a two-fold way: first we test how two variants of a simple and naïve syllabification approach consisting only of one syllable rule work with the languages. After that we test more elaborate language specific syllabifiers with a smaller number of the same languages.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1.

Data

Our empirical test material for IR runs consists of materials for 14 languages. Cross Language Evaluation Forum (CLEF) has available IR collections for 13 European languages. The languages of the collection are: Bulgarian, Czech, Dutch, English, Finnish, French, German, Hungarian, Italian, Portuguese, Russian, Spanish and Swedish. The size of the collections vary from ~17 000 to 450 000 documents. The number of topics for each collection is between 50 and 367 [4]. Retrieval experiments in the 13 CLEF languages were conducted using the HAIRCUT text retrieval engine [2], which adopts a statistical language model of retrieval and supports a variety of tokenization choices. HAIRCUT has previously been used to achieve state of the art results on multiple CLEF test sets. Turkish has available one IR collection, so called Milliyet newspaper material. This collection consists of newspaper articles published in the Turkish newspaper Milliyet. The size of the collection is 408,305 documents, and it has 72 topics [15]. For the Turkish collection our search engine was Lemur.

2.

Results

To get a start and baseline to our syllable approach, we used two, very simple rules for splitting words up into sequences of “syllables”3 for all the languages: 1) Scan left to 3

Syllables is in quotes because the CV procedures produce both valid and invalid syllable sequences. Perhaps a proper name for the entities could be syllagrams, i.e. syllable like Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

228

K. Kettunen et al. / Using Syllables as Indexing Terms in Full-Text Information Retrieval

right and insert a split after a vowel that immediately follows a consonant; 2) put syllable juncture before every CV sequence. These two algorithms would produce the following syllabifications: CV_1 (ca + rbo + hy + dra + te + s; do + gs; go + es) CV_2 (car+bo+hyd+ra+tes; dogs; goes) A CV syllable structure is usually a basic one in many languages and even the only one in some languages [17]. Thus the CV procedures were a natural starting point for testing, whether syllabification can work as a basis for word form variation management. We adopted the CV_1 and CV_2 syllabification algorithm to all the 13 languages of the CLEF collection and the Turkish material. From the sequence of syllables for each word, we created indexing terms based on: (1) single syllables; (2) bigrams of syllables; and (3) trigrams of syllables. Keywords in the queries were also handled in the respective manner when queries were run. Tables 1 and 2 show results of the CV_1 and CV_2 runs for the 13 CLEF languages and Turkish. Best result for each language is emphasized.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Table 1. Results of CV_1 and CV_2 syllabification runs for 14 languages, title and description queries, mean average precisions (MAPs) words snow 4

syl1_CV1 syl2_CV1 syl3_CV1

syl1_CV2

syl2_CV2

syl3_CV2

BG

0.22

N/A

0.31

0.21

0.22

0.10

0.21

0.20

0.10

CS

0.23

N/A

0.33

0.18

0.26

0.16

0.19

0.27

0.19

DE

0.33

0.37

0.41

0.28

0.39

0.29

0.30

0.38

0.24

EN

0.41

0.44

0.40

0.21

0.38

0.27

0.23

0.35

0.20

ES

0.44

0.49

0.46

0.24

0.45

0.31

0.22

0.43

0.29

FI

0.34

0.43

0.50

0.30

0.46

0.38

0.27

0.43

0.31

FR

0.36

0.40

0.38

0.20

0.37

0.25

0.23

0.34

0.22

HU

0.20

N/A

0.38

0.20

0.32

0.23

0.18

0.29

0.18

IT

0.38

0.42

0.37

0.18

0.39

0.26

0.17

0.37

0.26

NL

0.38

0.40

0.42

0.26

0.38

0.25

0.29

0.36

0.23

PT

0.32

N/A

0.34

0.17

0.33

0.20

0.17

0.30

0.16

RU

0.27

N/A

0.34

0.28

0.24

0.13

0.26

0.26

0.15

SV

0.34

0.38

0.42

0.26

0.41

0.31

0.25

0.37

0.26

TU

0.19

0.22

0.31

0.17

0.30

0.22

0.21

0.26

0.20

Table legend: words = surface forms (lower-cased); snow = Snowball stemmer; 4 = overlapping, wordspanning character 4-grams; syl1 = single syllables; syl2 = syllable bigrams; syl3 = syllable trigrams.

sequences of characters. However, some linguists have adapted a so called strict CV theory, which claims that all languages have only CV syllables [16]. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

K. Kettunen et al. / Using Syllables as Indexing Terms in Full-Text Information Retrieval

229

Table 2. Averages and changes from plain words baseline words snow

4

syl1_CV1 syl2_CV1 syl_CV1 syl1_CV2

syl2_CV2

syl3_CV2

Avg-8

0.37

0.42

0.42

0.24

Chg-8 %

N/A

Avg-A Chg-A %

0.40

0.29

0.25

0.38

0.25

11.47 13.31 -34.69

7.89

-22.18

-34.14

1.60

-32.80

0.32

N/A

0.39

0.35

0.24

0.23

0.33

0.21

N/A

N/A

20.54 -29.15

8.69

-25.25

-29.42

3.40

-33.75

0.23

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Table legend: Avg-8 is average over 8 'Snowball' languages, i.e. languages that had available Snowball stemmer; Avg-A is average over the CLEF data; Chg-A is change over plain words with the CLEF data average.

Firstly, the results of CV_1 runs show that single syllables and trigrams were both ineffective, except in the languages that were most complex or had significant compounding. Syllable bigrams, however, seemed to work quite well for many languages. We measured mean average precision and found statistically significant relative gains vs. surface forms in four languages using syllable bigrams with CV_1 procedure: German (+18.5%), Finnish (+34.8%), Hungarian (+60.4%), and Swedish (+19.9%). With Turkish the CV_1 procedure with syl2 was performing at the same level as 4-grams, which is interesting. There were also several cases where syllable bigrams with CV_1 runs showed slight improvement over inflected word forms which wasn't statistically significant (CS, ES, FR, IT and PT). Out of the four languages that achieved statistically significantly better results with bigram syllables, three (FI, SV and DE) also outperformed Snowball stemming slightly. This can be considered interesting, as stemming with a Snowball stemmer has many times been shown to perform very well with morphologically more complex languages [18, 3]. Simple syllabification was not able to outperform usage of 4-grams, which was overall the most effective method of keyword variation management for all the languages, except for Italian. This is consistent with the results of McNamee and Mayfield [2], McNamee [19] and McNamee, Nicholas and Mayfied [4]. The effectiveness of n-gramming, however, is achieved with the cost of huge text indexes, which many times make the use of ngramming impractical. Results of McNamee, Nicholas and Mayfied [4] show that ngram indexing with 4-grams can consume up to three times as much index storage and queries can take seven times as long to execute when compared to plain words (these figures with English data). Results of the CV_2 runs for the 13 CLEF languages and Turkish showed that the CV_2 procedure is not as good an option as CV_1, and it performed usually slightly below CV_1. Many times syl1 results of CV_2 runs outperform syl1 results of CV_1 runs, but as syl1 is not performing that well in general, this is not very interesting. With Czech and Russian syl2 results of CV_2 runs were slightly better than results of CV_1 runs, but with other languages lower. Overall syl2 results of CV_2 runs are performing the best, just as with CV_1. With morphologically complex languages CV_2 results with syl2 outperform also plain words clearly. This confirms further that syllable bigrams seem to offer the most optimal solution if syllables are used as means to manage word form variation. After the initial results we decided to concentrate on those languages that got good results with the CV procedures and were also morphologically interesting. We had

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

230

K. Kettunen et al. / Using Syllables as Indexing Terms in Full-Text Information Retrieval

available more fine-grained proper syllabifiers for three languages, DE, FI and TU4, and we tried out, how proper syllabification works for these languages. With Finnish, German and Turkish proper syllabification did not perform any better than CV procedures. With Finnish uni-, bi- and trigrams achieved MAPs of 0.28, 0.44 and 0.33. With German they achieved MAPs of 0.31, 0.36 and 0.23. With Turkish the results were 0.21, 0.27 and 0.20. Once again bigrams were the best way to construct both index and queries for all the languages. The tendency seemed to be, that properly syllabified unigrams performed slightly better and bi- and trigrams slightly worse than with CV_1 procedures for each language. With Finnish and Turkish proper syllabification results were at best slightly better than CV_2 results. However, proper syllabification was able to outperform Snowball stemming in Finnish slightly and in Turkish clearly. With Finnish and Turkish the difference to plain words with proper syllabification and bigramming was clear and in German 3 per cent.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3.

Discussion and conclusions

Our aim in this research was to study, whether syllabification can be used effectively in management of word form variation in full-text retrieval of different languages. As our test data we had CLEF collections for 13 different European languages and a separate Milliyet collection for Turkish. We tried first a crude CV syllabification procedure for all of the languages. Two versions of this procedure put a hyphen either after a CV sequence or before it. The results showed that in both ways the CV procedure was able to perform quite well when the textual indexes were built from bigrams of CV syllables. However, CV_1 procedure was clearly better than CV_2. CV_2 outperformed plain words clearly in most of the languages, but left 2-3 per cent behind CV_1 with most of the languages. We achieved good results with CV procedures in four of the CLEF languages that were also morphologically complex (DE, FI, HU and SV) and Turkish. Turkish results were especially good, as the CV_1 procedure performed at the same level as 4-grams. After the initial results we tested three languages with language specific elaborate syllabifiers. The results showed that language specific syllabification was not able to outperform simple CV procedure. Having introduced our results, we can now get to the question of why syllables should work at all as indexing and query terms that take reasonably well care of morphological variation found in different languages. McNamee [19, 4] has studied the question why n-grams have a performance advantage over plain words. He designed an experiment to remove morphological regularity from words by shuffling the characters of words randomly, and got results suggesting strongly that the fundamental reason why n-grams are effective is because they control for morphological variation. According to him this also explains a variety of previously observed phenomena about n-grams, namely: • that n-grams yield greater improvements in more morphologically complex languages; • n-grams of lengths 4 and 5 (about the size of root morphemes) are most effective. 4

For German and Turkish the syllabifiers were implemented by the third author, for Finnish by the first. Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

K. Kettunen et al. / Using Syllables as Indexing Terms in Full-Text Information Retrieval

231

We noted earlier briefly that syllables and n-grams resemble each other, but the main difference is that syllables are of varying length and when they are used, there is not as much overlapping of characters as in n-grams. Our preliminary analysis of syllable lengths used in the index procedures also suggest that the most effective length, bi-grams, result in index word length between 4 and 5 for most of the languages (mean syllable lengths for CV_1 are 2.3 characters and 2.5 for CV_2). Although we have not yet made a test of randomized character shuffling and syllabification based on the results of randomizing to confirm this hypothesis, we believe that the basic explanation why syllables work in management of word form variation is the same as with n-grams; They are able control for morphological variation and there is also an ideal length for the query and index terms made out of syllables. We wish to do more research with this respect later. Overall our results show that syllables can be used effectively in management of word form variation for different languages. They are not able to outperform 4-grams, but at best they perform at the same level or slightly better than Snowball stemmer for morphologically complex languages, such as Finnish, German, Hungarian, Swedish and Turkish. As the best results are achieved with a very simple syllabification and indexing procedure (CV_1 syllabification and bigram syllable indexing), we believe that the approach has some promise even in a practical IR settings. We also believe that some language typological factors affecting the results could be found (cf. FenkOczlon and Fenk [20]). This aspect needs more consideration and we wish to continue also this work later on.

References

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[1]

M Larson and S. Eickeler, Using syllable-based indexing features and language models to improve German spoken document retrieval. Proceedings of Eurospeech. 8th European Conference on Speech Communication and Technology (2003). Retrieved 15 May, 2010, from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.124.4455&rep=rep1&type=pdf [2] P. McNamee and J. Mayfield, Character n-gram tokenization for European language text retrieval, Information Retrieval 7 (2004), 73–97. [3] K. Kettunen, Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval – an overview, Journal of Documentation 2 (2009), 267–290. [4] P. McNamee, C. Nicholas and J. Mayfield, Addressing morphological variation in alphabetic languages. Proceedings of the 32nd Annual International Conference on Research and Development in Information Retrieval (SIGIR-2009), Boston, MA, 75–82. [5] F. Pellegrino, C. Coupé and E. Marsico, An Information theory-based approach to the balance of complexity between phonetics, phonology and morphosyntax (2007). Retrieved May 5, 2010, from http://www.ddl.ish-lyon.cnrs.fr/fulltext/pellegrino/Pellegrino_2007_PCM_LSA.pdf. [6] F. Karlsson, Automatic Hyphenation of Finnish. In: Karlsson, F. (ed.) Computational morphosyntax. Report on Research 1981–1984. Publications of the Department of General Linguistics, University of Helsinki, 13, (1985), 93–113. [7] G. Bouma, Finite state methods for hyphenation, Natural Language Engineering 9, (2003), 5–20 [8] C. Addsett and Y. Marchand, A Comparison of data-driven automatic syllabification methods. In: J. Karlgren, J. Tarhio and H. Hyyrö (eds.) String Processing and Information Retrieval, 16th International Symposium, SPIRE 2009. Heidelberg, Springer, 174–181. [9] Y. Marchand, C. Adsett and R. Damper, Evaluating automatic syllabification algorithms for English (2007). Retrieved April 14, 2010, http://eprints.ecs.soton.ac.uk/14285/1/MarchandAdsettDamper_ISCA07.pdf. [10] Y. Marchand, C. Adsett and R. Damper, Automatic syllabification in English: a comparison of different algorithms. Language and Speech 52 (2009), 1–27.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

232

K. Kettunen et al. / Using Syllables as Indexing Terms in Full-Text Information Retrieval

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[11] S. Bartlett, G. Kondrak and C. Cherry, Automatic syllabification with structured SVMs for letter-tophoneme conversion. In: Proceedings of ACL-08, HLT, Columbus, (2008), 568–576. [12] K. Ng and V.W. Zue, Subword-based approaches for spoken document retrieval, Speech Communication 32 (2000), 157–186. [13] H.-M. Wang, . Experiments in syllable-based retrieval of broadcast news speech in Mandarin Chinese. Speech Communication 32 (2000), 49–60. [14] E. B. Gouvea and B. Raj, Word particles applied to information retrieval. In: M. Boughanem, C. Berrut, J. Mothe and C. Soule-Dupuy (eds.) Advances in information retrieval. 31th European Conference on IR Research, ECIR 2009. Heidelberg: Springer, 424–436. [15] F. Can, S. Kocberber, E. Balcik, C. Kaynak, H.C. Ocalan and O.N. Vursavas, Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 59 (2008), 407–421. [16] H. van der Hulst and N. A. Ritter (eds.), The syllable: views and facts. Berlin: Mouton de Gruyter, 1999. [17] I. Maddieson, Chapter 12: Syllable Structure. In: The World Atlas of Language Structures Online (2008). Retrieved 30 April, 2010, from http://wals.info/feature/12. [18] E. Airio, Word normalization and decompounding in mono- and cross-lingual IR. Information Retrieval, 9 (2006), 249–271. [19] P. McNamee, Textual representations for corpus-based bilingual retrieval, PhD Thesis, University of Maryland Baltimore County (2008). Retrieved 4 May, 2010. http://apl.jhu.edu/~paulmac/publications/thesis.pdf. [20] G. Fenk-Oczlon and A. Fenk, Cognition, quantitative linguistics, and systemic typology. Linguistic Typology 3, (1999), 151–177.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-233

233

Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars Gunta NEŠPORE, Baiba SAULĪTE, Guntis BĀRZDIŅŠ and Normunds GRŪZĪTIS Institute of Mathematics and Computer Science, University of Latvia

Abstract. The dependency approach, originally developed by Lucien Tesnière, has become a popular model of syntactic representation. However, the state-of-the-art dependency parsers and annotation schemes typically discard some relevant features of the original Tesnière’s model, retaining only the concept of dependency relations between individual words. The SemTi-Kamols grammar model attempts to return to the Tesnière’s key concepts supplementing the approach with some additional features and concepts. The aim of this paper is to explain how the Semti-Kamols approach relates to and differs from the original Tesnière’s approach, discussing the similarities and differences. Keywords. Syntactic Representation, Dependency Grammar, Structural Syntax, Parsing, Treebank

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction The dependency approach, originally developed by Tesnière before 1959 [1], has become a popular model of syntactic representation in recent years. Dependency relations can be more straightforwardly mapped to the semantic representation of a sentence, if compared to the phrase structure trees, and, thus, this approach is more appropriate in information extraction and other tasks of semantic analysis. Analytical constructions (consisting of more than one word, but taking one syntactic position, e.g., analytical verb forms, prepositional phrases) are frequently used in many languages, even in highly synthetic languages with rich morphology and rather free word order (like Baltic and Slavic languages). To represent such constructions, the pure dependency approach, where each running word is involved in a separate dependency relation, is not quite suitable, i.e., there are no dependency relations between the elements of the analytical constructions. It has been already pointed out that the state-of-the-art dependency parsers and annotation schemes typically discard some relevant features of the original Tesnière’s model, retaining only the concept of dependency relations between individual words [2]. The consequence is that each token in a sentence is represented by a separate node in the corresponding dependency graph, i.e., content words and function words are not distinguished; both are implicitly treated as syntactic nuclei (in terms of Tesnière). Although such simplification allows to apply efficient parsing algorithms [3] and requires relatively smaller treebanks to be annotated in order to meet the representativeness, it makes the representation non-compliant with the linguistic tradition — an adequate representation of e.g. analytical word forms, coordination is hardly possible if the grammar model is simplified.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

234

G. Nešpore et al. / Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars

In the annotation scheme of the leading dependency treebank, the Prague Dependency Treebank, this issue has been addressed by splitting grammatical annotations among several layers [4]. At the deep or tectogrammatical level, constituents of analytical word forms and coordinations are collapsed under shared nodes, however, at the surface level, dependency links are still artificially drawn among function and content words. A Tesnière-compliant representation, Tesnière’s Dependency Structure (TDS), and a converter from Penn Treebank phrase structure trees [5] to TDS have recently been proposed by Sangati and Mazza [2]. Their approach covers all Tesnière’s key concepts: the nuclei, the operations of junction and transference, and, of course, the dependency relations. Thus, it provides means for a more adequate representation of conjoined structures and analytical word forms. However, the Tesnière’s model and, thus, the TDS have a rather significant drawback: the dependency types are highly simplified. In the original model, dependency links themselves are actually anonymous — the functional labels (generalized part-of-speech categories) are assigned to each nucleus. Although such approach makes analysis simple, it represents only the very general structure of a sentence. There is at least one more dependency model, the SemTi-Kamols grammar and parser for Latvian [6] that implements the Tesnière’s key concepts with some modifications. Although this approach initially has been positioned as a dependencybased hybrid model that incorporates elements of the constituency approach, from Tesnière’s point of view this actually can be seen as a pure dependency approach and also as an extension of the TDS representation. A brief comparison of the SemTi-Kamols approach with other formalisms (HPSG, TIGER) is given in [6]. The aim of this paper is to show how it relates to the original Tesnière’s approach, discussing the similarities and differences.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. Tesnière’s approach In the Tesnière’s structural syntax, the basic categories are syntactic relations (connexions structurales), junctions (jonction) and transference (translation) [1:335]. Syntactic relations are based on the dependencies between components of the sentence — governors (régissant) and subordinates (subordonné).



  





Figure 1. Nucleus and nucleus dissocié in dependency relations.

Developing the notion of node in the dependency syntax Tesnière has introduced the concept of syntactic nucleus (nucléus). Nucleus is a functional syntactic unit; apart from the node it can contain additional elements. In Figure 1 two types of nuclei are shown — nucleus (parle ‘speaks’) and nucleus dissocié (est arrive ‘has arrived’).

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

G. Nešpore et al. / Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars

235

Nuclei are content-words or syntactically inseparable units that are acquired via the morpho-syntactic transference and are treated as a whole. Dependency relations in this approach, in general, appear at the level of nuclei, not at the surface level. Both transference and junction are derived syntactically. Junction combines several elements connected with horizontal relations (coordination and apposition) into one unit. Transference is an operation when the transferers (function words) change the original category (function) of the content word to another. In Figure 2 horizontal relations between two nuclei belonging to the same category are shown (Alfred, Bernard). These nuclei are linked with an extra-nuclear element (et).









Figure 2. Junction.

Figure 3 shows an example of first degree transference (le livre de Pierre ‘Pierre’s book’); — the noun Pierre takes the function of an adjective by the transferer de.

  

Copyright © 2010. IOS Press, Incorporated. All rights reserved.



 

Figure 3. Transference.

Although Tesnière primarily focuses on the syntactic functions in the sentence, to some extent, he mixes together the morphological and syntactic levels of analysis, translating one part-of-speech to another, in order to group the words in a sentence by their function. For example, nouns in genitive case are translated to adjectives to assign them attributive function — it is considered that a noun cannot be a direct dependant of another noun.

2. SemTi-Kamols approach The SemTi-Kamols dependency grammar implements Tesnière’s key concept of syntactic nucleus, exploiting a mechanism that is similar to the transference operation [6:14].

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

236

G. Nešpore et al. / Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars

For this a concept of “x-word” has been introduced, which is the core idea of the approach, and is also similar to the concept of blocks in the TDS approach. From the phrase structure perspective, x-words can be viewed as non-terminal symbols, and as such during the parsing process substitute all entities forming respective constituents. From the dependency perspective, x-words are treated as regular words, i.e., an x-word can act as a head for depending words and/or as a dependent of another head word. [6:14]. Like Tesnière’s syntax, the SemTi-Kamols approach is based on the dependency pairs between syntactic units — elements taking one syntactic position in the dependency relations. In overall, in the SemTi-Kamols grammar, similarly to the Tesnière’s approach, the two dimensions of syntactic analysis are merged together in the same model: • •

The vertical dependency relations between the subordinated parts of sentence (simple words that are directly involved in the dependency relations); The horizontal non-dependency relations among the constituents of x-words (unlike the words in dependency relations, constituents of x-words are required to conform a particular word order).

2.1. Dependency pairs The SemTi-Kamols model is based on the dependency pairs where the subordinate element with particular morphological features is attached by the governor regardless its position (in terms of word order) in the sentence.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

 ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ŶŽŵ                ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ŐĞŶ ĚĂƚŽƌůŝŶŐǀŝƐƚŝŬĂƐ ŬŽŶĨĞƌĞŶĐĞ ͚ĐŽŶĨĞƌĞŶĐĞ ŽĨ ĐŽŵƉƵƚĂƚŝŽŶĂů ůŝŶŐƵŝƐƚŝĐƐ͛ Figure 4. SemTi-Kamols dependency pair.

The morphological and syntactic levels are strictly separated. Transference that in the Tesnière’s approach would end up with a nucleus consisting of one word are omitted1, e.g., no transference is necessary for a noun to be a subordinate of another noun: datorlingvistikas (‘computational linguistics’: genitive case; attribute) konference (‘conference’: nominative case; governor) — see Figure 42. The rest of the Tesnière’s 1 2

Except some special cases of syntactic reductions. In all the examples morpho-syntactic tags are simplified.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

G. Nešpore et al. / Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars

237

transferences (e.g., prepositional constructions and subordinate clauses) in the SemTiKamols approach are described by different types of x-words — in the same way as the analytical verb forms (Tesnière’s nuclei dissocié). 2.2. X-words By introducing x-words in the grammar model we have used and extended the notion of nucleus dissocié [1:58] — eventually, x-words are used to handle all the constructions that are not covered by the dependency rules. X-words participate in the dependency relations as single units (nuclei). Though constituents of x-words on their turn can be governours in other dependency pairs — thus we combine analytical constructions with dependency pairs. In Figure 5 dependency pair datorlingvistikas konferenci (‘conference of computational linguistics’) is combined with x-Preposition uz konferenci (‘to the conference’).

  ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ĂĐĐ

  ƉƌĞƉ͕ĂĐĐ

  ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ĂĐĐ    

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

           ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ŐĞŶ ƵnjĚĂƚŽƌůŝŶŐǀŝƐƚŝŬĂƐ ŬŽŶĨĞƌĞŶĐŝ ͚ƚŽƚŚĞ ĐŽŶĨĞƌĞŶĐĞ ŽĨ ĐŽŵƉƵƚĂƚŝŽŶĂů ůŝŶŐƵŝƐƚŝĐƐ͛ Figure 5. Constituent of an x-word in the dependency relations.

X-words are used to describe various syntactic constructions, though relations between elements in the inner structure of x-words are different. This information is reflected indirectly in the name of the particular x-word (x-Verb, x-Preposition, x-Apposition etc.). X-words have rich morpho-syntactic tags — features inherited from their constituents — and additional features describing the x-words as units (e.g., indicating their types and subtypes). At the level of the dependency relations, there is no difference between simple words (coming from the lexicon) and x-words (generated on-the-fly). An important aspect in the SemTi-Kamols approach is that the syntactic functions are not assigned to the x-words during the transference process (creation of x-words); they are decided when the dependency tree is being constructed. From the structural point of view, there are several types of x-words — they can be formed of: •

one content word and one or several function words (e.g., x-Verb, x-Preposition);

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

238

G. Nešpore et al. / Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars



of two or more content words linked with function words (e.g., conjunctions) or comma (x-Coordination);

• of two or more content words alone (e.g., x-Apposition). The model of x-words can be applied also in the analysis of the complex and compound sentences. X-Verbs and x-Prepositions are formed by one content word and one or several function words. X-Preposition combines a preposition and a noun (or a pronoun), its syntactic function is the adjunct or the object. X-Verb combines at least one auxiliary verb and one content word (participle, noun, adjective, adverb or pronoun). Figure 6 illustrates how a preposition and a noun in accusative case are united into x-word, creating a construction with its own mopho-syntactic features that are further used when the x-word participates in the dependency relations (see Figure 9).

  ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ĂĐĐ

  ƉƌĞƉ͕ĂĐĐ

  ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ĂĐĐ ƵnjŬŽŶĨĞƌĞŶĐŝ ͚ƚŽĐŽŶĨĞƌĞŶĐĞ͛

Figure 6. Prepositional construction.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 7 shows an x-Verb consisting of auxiliary verb to be and a verb in the relative mood.

Figure 7. Predicate in the relative mood.

Coordinated parts of sentence also take a single node in the dependency tree. Tesnière analyses them using the operation of junction, but, in the SemTi-Kamols approach, the x-word mechanism is used to deal with this phenomenon as well. Thus all the units (nuclei) of a dependency structure are described in a uniform way. Information about their type and inner structure is encoded on-the-fly in the rich morpho-syntactic tags.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

G. Nešpore et al. / Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars

239

   ŶŽƵŶ͕ŵĂƐĐ͕Ɖů͕ŶŽŵ

!

ŵĂƐĐ͕ƐŐ͕ŶŽŵ

  ĐŽŶũ



ŵĂƐĐ͕ƐŐ͕ŶŽŵ

ůĨƌĤĚƐƵŶĞƌŶĂƌĚƐ ͚ůĨƌĞĚ ĂŶĚ ĞƌŶĂƌĚ͛ Figure 8. Coordinated parts of sentence.

In Figure 9 a complete sentence in the SemTi-Kamols representation is given. It consists of one simple word and four x-words linked with dependency relations. For the reasons of simplicity, the structure of each x-word is not included, as most of them are found in the previous figures.

" ƉƌĞƐ͕ϯƉĞƌƐ͕ŵĂƐĐ͕Ɖů͕ŶŽŵ

   ŵĂƐĐ͕Ɖů͕ŶŽŵ

#

  

  ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ĂĐĐ

  ŶŽƵŶ͕ŵĂƐĐ͕ƐŐ͕ŐĞŶ

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

               ŶŽƵŶ͕ĨĞŵ͕ƐŐ͕ŐĞŶ ůĨƌĤĚƐƵŶĞƌŶĂƌĚƐƵnjĚĂƚŽƌůŝŶŐǀŝƐƚŝŬĂƐ ŬŽŶĨĞƌĞŶĐŝĞƐŽƚŝĞƌĂĚƵƓŝĞƐŶŽƌţƚĂ ͚ůĨƌĞĚ ĂŶĚ ĞƌŶĂƌĚŚĂǀĞ ĂƌƌŝǀĞĚ ƚŽƚŚĞ ĐŽŶĨĞƌĞŶĐĞ ŽĨ ĐŽŵƉƵƚĂƚŝŽŶĂů ůŝŶŐƵŝƐƚŝĐƐ ŝŶ ƚŚĞ ŵŽƌŶŝŶŐ͛ Figure 9. Dependency structure of a whole sentence.

3. Conclusion The dependency grammar theory of Tesnière captures linguistic insights that are only partially reflected in the modern approaches to dependency parsing. These simplifications, mainly disregarding the concept of syntactic nucleus, allow for efficient parsing. The current non-optimized prototype implementation of the SemTiKamols parser, however, has an exponential complexity, relative to the length of a sentence. Although it is possible to significantly reduce the complexity by exploiting

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

240

G. Nešpore et al. / Comparison of the SemTi-Kamols and Tesnière’s Dependency Grammars

the principle of chart parsing, a data-driven parser should be created in the long term perspective. However, to acquire a high precision parser, covering all the different syntactic patterns how the simple and complex words can be combined (including the inner structure of x-words), most likely a significantly larger treebank has to be created in advance (due to the complex annotation scheme used in the SemTi-Kamols approach) than it would be necessary in the case of a simplified dependency approach.

References

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[1] L. Tesnière, Éléments de syntaxe structurale, Klincksieck, Paris, 1959. (Russian translation: Л. Теньер, Основы структурного синтаксиса, Ред. В.Г. Гак. Москва, Прогресс, 1988.) [2] F. Sangati, C. Mazza, An English Dependency Treebank à la Tesnière, In Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories, 2009, pp. 173–184. [3] J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov and E. Marsi, MaltParser: A language-independent system for data-driven dependency parsing, Natural Language Engineering, 13(2), 2007, pp. 95–135. [4] E. Hajičová, Prague Dependency Treebank: From analytic to tectogrammatical annotations, In Proceedings of the 2nd International Conference on Text, Speech and Dialogue, LNCS, SpringerVerlag, 1998, pp. 45–50. [5] M.P. Marcus, M.A. Marcinkiewicz and B. Santorini, Building a Large An- notated Corpus of English: The Penn Treebank, Computational Linguistics, 19(2), 1993, pp. 313–330. [6] G. Bārzdiņš, N. Grūzītis, G. Nešpore and B. Saulīte, Dependency-Based Hybrid Model of Syntactic Analysis for the Languages with a Rather Free Word Order, In Proceedings of the 16th Nordic Conference of Computational Linguistics, 2007, pp. 13–20.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-60750-641-6-241

241

Cloud Computing for the Humanities: Two Approaches for Language Technology Graham WILCOCK 1 University of Helsinki, Finland Abstract. The paper describes Aelred, a web application that demonstrates the use of language technology in the Google App Engine cloud computing environment. Aelred serves up English literary texts with optional concordances for any word and a range of linguistic annotations including part-of-speech tagging, shallow parsing, and word sense definitions from WordNet. Two alternative approaches are described. In the first approach, annotations are created offline and uploaded to the cloud datastore. In the second approach, annotations are created online within the cloud computing framework. In both cases standard HTML is generated with a template engine so that the annotations can be viewed in ordinary web browsers. Keywords. Cloud computing, humanities computing, language technology

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction The paper describes Aelred, a web application that demonstrates the use of language technology in a cloud computing environment. As an example of the conference theme “Language resources and technology for the Humanities”, Aelred serves up English literary texts with optional concordances and a range of linguistic annotations including part-of-speech tagging, shallow parsing, and word sense definitions from WordNet. All the annotations are created automatically by NLP tools. In this initial demonstration version, the texts are the six main novels of Jane Austen (Figure 1). The total number of words is about half a million. The raw texts (prior to being annotated) are the plain text versions of the novels from Project Gutenberg (http: //www.gutenberg.org), whose pioneering work in providing freely-available texts on the web has been a huge contribution to humanities computing. Aelred runs on the Google App Engine cloud computing framework (http:// appengine.google.com). The application can be accessed from any web browser at the URL http://aelred-austen.appspot.com. In Figure 1, selecting a novel leads to a list of its chapters (Figure 2). The start of the text of each chapter is shown alongside the button. Selecting a chapter leads to the chapter text (Figure 3), initially in a plain text format which has been tokenized as described in Section 3. 1 Corresponding Author: Graham Wilcock, University of Helsinki, P.O. Box 24, 00014 Helsinki, Finland; E-mail: graham.wilcock@helsinki.fi.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

242

G. Wilcock / Cloud Computing for the Humanities: Two Approaches for Language Technology

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 1. List of available books by Jane Austen.

Aelred is implemented in Python and uses several tools from NLTK, the Python Natural Language Toolkit. NLTK (http://www.nltk.org) is a set of open source Python tools and resources for natural language processing, whose companion textbook Natural Language Processing with Python [1] provides an excellent self-contained course in human language technology. However, not all NLTK tools can be used directly with Google App Engine, as discussed in Section 3.

1. Two Approaches Two approaches can be taken to implementing language technology in App Engine. The reason for two different approaches is that the cloud computing framework imposes certain restrictions on applications, as specified in the App Engine documentation [2]. First, there are restrictions on data storage. Data must be stored using App Engine Datastore. This means that only a restricted set of data types can be stored as members of lists. To solve this problem, Aelred converts all annotations into a serialized YAML format as described in Section 2 and stores them in Datastore as large text strings. Second, there are restrictions on the code. Python code uploaded to App Engine must be pure Python. This means that some NLTK tools cannot be used in App Engine, and alternative Aelred tools are used as described in Section 3. Working within these restrictions, two alternative approaches can be taken. In one approach, annotations are created off-line. The annotations are then serialized to YAML and uploaded to Datastore. In the other approach, pure Python tools are used to perform

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

G. Wilcock / Cloud Computing for the Humanities: Two Approaches for Language Technology

243

Figure 2. List of chapters of Northanger Abbey, with first lines.

language technology tasks on-line in App Engine. These tools also use serialized YAML formats, but store and retrieve Datastore files on-the-fly.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Using YAML Aelred processes texts chapter by chapter. Each chapter is divided into a list of paragraph strings. Each paragraph is divided into a list of sentences and each sentence is divided into a list of tokens. A structured object is created for each token, modelled by a Token class. App Engine provides facilities for defining data models in a very similar way to the well-established Django facilities [3] for defining data models easily in Python. The sentence annotations are represented as lists of Token objects, and the Token objects are represented as Python dictionaries with multiple key:value pairs. The words, the part-ofspeech tags, and the chunk labels are all included in the Token structures. The lists of structured Token objects cannot be stored directly in App Engine, because Datastore only allows lists to contain a restricted set of data types [2]. Therefore the annotations are first serialized to YAML (http://www.yaml.org), and the YAML files are uploaded to App Engine Datastore as long text strings. YAML ("YAML Ain’t Markup Language") is a light-weight data format that many people prefer to XML. Python data structures including lists and dictionaries are easily serialized to YAML using simplejson [4]. When a text chapter is requested by a user, the relevant YAML file is retrieved from Datastore and deserialized. The annotations are displayed as described in Section 6.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

244

G. Wilcock / Cloud Computing for the Humanities: Two Approaches for Language Technology

Figure 3. Tokenized Plain Text of Northanger Abbey, Chapter 1.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. NLTK and App Engine There have recently been discussions on the NLTK users forum (http://groups. google.com/group/nltk-users) about problems encountered when attempting to use NLTK with App Engine. One of the aims of Aelred is to build a demonstration prototype showing what can be done in practice, and to identify those cases where NLTK can be used with App Engine and those cases where it cannot, as detailed below. When annotations are created offline and uploaded, there are no restrictions on the tools that create the annotations because the tools do not run inside App Engine. All the standard NLTK tools can therefore be used, including the sentence boundary detector nltk.sent_tokenize(), the word tokenizer nltk.word_tokenize(), the part-of-speech tagger nltk.pos_tag() and the classifier-based named entity recognizer nltk.ne_chunker(). However, some of these tools do not suit the Gutenberg texts, so alternative Aelred tools are used even in the off-line case. In the on-the-fly approach, annotations are created by tools running inside the App Engine framework. Tools written in pure Python can be used in App Engine, but tools written in C cannot be used. Some of the NLTK tools are pure Python so they can be imported into App Engine successfully, but some cannot. Aelred therefore uses alternative tools that are pure Python, so they can be imported into App Engine. The NLTK sentence detector nltk.sent_tokenize(), which is based on the Punkt sentence boundary detector [5], is pure Python and can be used in App Engine. The code can be loaded from a pickled file, which is uploaded to App Engine with the

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

G. Wilcock / Cloud Computing for the Humanities: Two Approaches for Language Technology

245

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 4. Part-of-Speech Tagging, with mouse over "CHAPTER/NN".

rest of the code. Pickled files can be used in App Engine so long as the pure Python pickle is used, not the C version (cPickle). The NLTK tokenizer nltk.word_tokenize() is pure Python and can be used in App Engine, but Aelred does not use it because there are specific problems in tokenizing the Gutenberg texts. One problem is the frequent use of a double hyphen (--) to represent a dash. For example, the third sentence in Northanger Abbey starting with "Her father was a clergyman, . . . " includes the string Richard--and. This is tokenized as a single token by the standard NLTK tokenizer. The Aelred tokenizer splits this into three tokens as shown in Figure 3. The NLTK part-of-speech tagger nltk.pos_tag() cannot be used directly in App Engine because it uses the NLTK maximum entropy classifier, which uses numpy, and numpy is not pure Python as it uses C. Aelred therefore uses an alternative pure Python tagger trained on the NLTK Treebank corpus, a subset of the full Penn Treebank corpus. The tagger is uploaded into App Engine as a pickle file. Part-of-speech tagging for the start of Northanger Abbey is shown in Figure 4. The part-of-speech tags are also used for phrase chunking. A shallow parser, which is currently under development, performs chunking for NPs, PPs and VPs, using NLTK tag pattern matching as described in [1]. The different kinds of phrase chunks are displayed in different colours, as described in Section 6.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

246

G. Wilcock / Cloud Computing for the Humanities: Two Approaches for Language Technology

Figure 5. Words with clickable links to their concordances.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

4. Making Concordances Concordances are created using NLTK’s ConcordanceIndex() method, and show all occurrences of a word in a novel, not chapter by chapter. The offsets for the whole novel are calculated off-line and uploaded to Datastore in a serialized YAML format. In the concordances view of the text (Figure 5), all the words are displayed as clickable links so that any word’s usage can be seen simply by clicking on the word. The concordance for the word is then generated and displayed as a standard HTML table. The concordance for handsome (Figure 6) shows that Austen used this adjective for both male and female characters.

5. Using WordNet A Python interface to WordNet [6] is bundled with NLTK. Aelred uses the interface to get word sense definitions from WordNet for nouns, verbs, adjectives and adverbs, excluding stopwords. Words that have WordNet definitions and are not stopwords are highlighted in the browser display, as shown in Figure 7. When the user hovers the cursor over one of these highlighted words, the WordNet definition is displayed in a pop-up tooltip as described further in Section 6. Only definitions for the part of speech given by the part-of-speech tagger are displayed. Where there are multiple word senses for the same part of speech, a simple form of word sense disambiguation is used to select the most appropriate definition.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

G. Wilcock / Cloud Computing for the Humanities: Two Approaches for Language Technology

247

Figure 6. Concordance for "handsome", referring to males and females.

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

6. Displaying the Annotations Annotation frameworks often provide special tools for viewing annotations, but App Engine is used with standard web browsers so Aelred displays annotations using standard HTML tags in combination with CSS stylesheets. The HTML is generated using Django (www.djangoproject.com) templates. Django is a widely-used open source Python web development framework. The Django Book [3] is an excellent tutorial and guide on how to build web apps with Django. Django 0.96 is bundled with App Engine. Words or phrases are highlighted in specific colours by means of the tag. This is used in various ways: to highlight part-of-speech tags (Figure 4), to distinguish different kinds of phrase chunks (NPs, PPs, VPs), and to highlight words for which a WordNet definition is available (Figure 7). The colour scheme is specified by a CSS stylesheet, so the choice of colours can easily be changed. Tooltip strings are displayed when the user hovers the cursor over a particular word. For example, in Figure 4 the cursor was placed over CHAPTER/NN and the expanded description of the NN tag from the Penn Treebank tagset is shown in the pop-up tooltip. This is done by means of the title attribute in the tag. In Figure 7 the cursor was placed over the word heroine and the WordNet definition for this word is shown in the tooltip.

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

248

G. Wilcock / Cloud Computing for the Humanities: Two Approaches for Language Technology

Figure 7. Pop-up WordNet Definitions, with mouse over "heroine".

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

7. Conclusion The paper has described Aelred, a web application that demonstrates the use of language technology in the Google App Engine cloud computing environment. Aelred serves up English literary texts with optional concordances and a range of linguistic annotations, including part-of-speech tagging, shallow parsing, and word sense definitions from WordNet. Two approaches were described. In the first approach, annotations are created offline and uploaded to the cloud datastore. In the second approach, annotations are created online within the cloud computing framework. In both cases standard HTML is generated with a template engine so that the annotations can be viewed in ordinary web browsers.

References S. Bird, E. Klein and E. Loper, Natural Language Processing with Python, O’Reilly, 2009. Google App Engine documentation, The Python Runtime Environment, http://code.google. com/appengine/docs/python/, 2010. [3] A. Holovaty and J. Kaplan-Moss, The Django Book (version 0.96), 2009. [4] SimpleJSON documentation, JSON encoder and decoder, http://code.google.com/p/ simplejson/, 2010. [5] T. Kiss and J. Strunk, Unsupervised Multilingual Sentence Boundary Detection, Computational Linguistics 32 (2006), 485–525. [6] G. A. Miller, WordNet: A Lexical Database for English, Communcations of the ACM 38 (1995), 39–41. [1] [2]

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved.

249

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Subject Index action noun 195 adaptive tool 147 ageing 25 agent noun 195 applications 33 asynchronous turn management 107 audio system 61 automatic derivation 195 automatic morphology 169 Baltic languages 187 bootstrapping 217 clarification 99 CLARIN 3, 15 cloud computing 241 cloud service 133 coherence 83 comparability 161 comparable corpora 161 computational semantics 201 controlled natural language 187 conversation analysis 99 corpus 39 corpus compilation and mark-up 143 corpus linguistics 3, 154 corpus user interface 143 crawling 161 data sharing 133 dependency grammar 73, 233 dialogue 91 dialogue structure 83 dictionary 147 dictionary management system 169 digital libraries 177 digitization 177 duration 45 electronic lexicography 169 emotion 25 ensemble of classifiers 217 error analysis 117 Estonian 11, 25, 208 Estonian Wordnet 195 factored models 125 frame semantics 208

FrameNet 208 fricatives 45 full-text information retrieval 225 gesturing 91 highly inflectional languages 217 human language technology 11 humanities computing 241 inference 208 information dialogues 99 information structure 187 Internet comments 83 language modeling 73 language resources and technology 15 language technology 241 language technology infrastructure 3 Latvian 15, 39, 217 LetsMT! 133 machine learning 147, 217 machine translation 15, 117, 133 management of word form variation 225 manner of articulation 45 mark-up 147 membership categorization 83 metadata 161 morphological annotation 143 morphosyntactic specifications 154 Moses 133 NLP 177 non-understandings 99 ontology verbalization 187 palatalization 45 parsing 233 perception of emotions 25 perception test 61 phrase pattern search 107 place of articulation 45 plosives 45 prosody modelling 53 Python 195 quantity degree 53 radiology 33 reformulation 99

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

250

15 217 195 107 208 45 91 61 33, 73 69 39 125, 161 117 217 217 233 225 225 233 208

synthetic language TEI P5 encoding temporal and tonal characteristics text corpora text-based dialogue systems treebank TTS system uncertainty under-resourced languages visually impaired people vocal expression voicing whole sentence maximum entropy model Wizard of Oz experiments word formation word order written language XML XML database

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

research infrastructure rule-based stemming semantic relations semantic resolution semantic roles sonorants speech interaction speech rate speech recognition speech synthesis spoken Latvian corpus statistical machine translation statistical methods statistical stemming stemming structural syntax syllables syllables as index terms syntactic representation syntactic semantics

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

187 154 53 15 107 233 69 91 161 61 25 45 73 99 169 107 99 147 169

Human Language Technologies – The Baltic Perspective I. Skadin¸a and A. Vasil¸jevs (Eds.) IOS Press, 2010 © 2010 The authors and IOS Press. All rights reserved.

251

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Author Index Aker, A. 161 Altrov, R. 25 Alumäe, T. 33, 73 Auziņa, I. 15, 39, 69 Bārzdiņš, G. 233 Baskaya, F. 225 Dereškevičiūtė, S. 45 Dobrinkat, M. 73 Eger, S. 217 Gaizauskas, R. 161 Gerassimenko, O. 83 Giouli, V. 161 Goba, K. 125 Gornostay, T. 133 Grūzītis, N. 15, 187, 233 Hein, I. 61 Hennoste, T. 83 Jokinen, K. 91 Kaalep, H.-J. 143 Kahusk, N. 11, 195, 201, 208 Kalvik, M.-L. 53 Kasterpalu, R. 83 Kazlauskienė, A. 45 Kerner, K. 195, 201 Kettunen, K. 225 Khalilov, M. 117 Kiissel, I. 61 Koit, M. 83 Kovalevskaitė, J. 154 Kuvaldina, N. 117 Laanesoo, K. 83 Levāne-Petrova, K. 15 Marcinkevičienė, R. 3 Mastropavlos, N. 161 McNamee, P. 225 Meister, E. 11, 33 Melninkaitė, V. 154 Mieriņa, M. 161 Mihkla, M. 53, 61

Millere, I. 147 Muischnek, K. 143 Nešpore, G. 15, 187, 233 Õim, H. 201, 208 Oja, A. 83 Orav, H. 201, 208 Orusaar, M. 61 Pajupuu, H. 25 Pärkson, S. 99 Pereseina, V. 117 Pinnis, M. 69 Pretkalniņa, L. 117, 147 Rääbis, A. 83 Räpp, A. 61 Rimkutė, E. 154 Ruokolainen, T. 73 Sahkai, H. 169 Saulīte, B. 187, 233 Sējāne, I. 217 Šics, V. 125 Skadiņa, I. 15, 161 Skadiņš, R. 15, 125, 133 Skilters, J. 177 Strandson, K. 83 Taremaa, P. 208 Treumuth, M. 107 Tufis, D. 161 Uiboaed, K. 143 Utka, A. 154 Vare, S. 169 Vasiļjevs, A. 15, 133 Veskis, K. 143 Vider, K. 195 Viks, Ü. 169 Vilo, J. 11 Vitkutė-Adžgauskienė, D. 3, 154 Wilcock, G. 241 Zogla, A. 177

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This page intentionally left blank

Human Language Technologies - the Baltic Perspective : Proceedings of the Fourth International Conference Baltic HLT 2010, edited