337 55 4MB
English Pages [271] Year 2018
Current Issues in Linguistic Theory 341
Multiword Units in Machine Translation and Translation Technology edited by Ruslan Mitkov Johanna Monti Gloria Corpas Pastor Violeta Seretan
J OHN B ENJAMINS P U B LISHING COMPANY
MULTIWORD UNITS IN MACHINE TRANSLATION AND TRANSLATION TECHNOLOGY
CURRENT ISSUES IN LINGUISTIC THEORY AMSTERDAM STUDIES IN THE THEORY AND HISTORY OF LINGUISTIC SCIENCE – Series IV
issn 0304-0763
General Editor JOSEPH C. SALMONS
University of Wisconsin–Madison [email protected]
Founder & General Editor (1975-2015) E.F.K. KOERNER
Leibniz-Zentrum Allgemeine Sprachwissenschaft, Berlin [email protected]
Current Issues in Linguistic Theory (CILT) is a theory-oriented series which welcomes contributions from scholars who have significant proposals that advance our understanding of language, its structure, its function and especially its historical development. CILT offers an outlet for meaningful contributions to current linguistic debate. A complete list of titles in this series can be found on http://benjamins.com/catalog/cilt
Editorial Board Claire Bowern (New Haven, Ct.) Sheila Embleton (Toronto) Elly van Gelderen (Tempe, Ariz.) John E. Joseph (Edinburgh) Matthew Juge (San Marcos, Tex.) Martin Maiden (Oxford) Martha Ratliff (Detroit, Mich.) E. Wyn Roberts (Vancouver, B.C.) Klaas Willems (Ghent)
Volume 341
Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan (eds.) Multiword Units in Machine Translation and Translation Technology
MULTIWORD UNITS IN MACHINE TRANSLATION AND TRANSLATION TECHNOLOGY Edited by RUSLAN MITKOV
University of Wolverhampton
JOHANNA MONTI
“L’Orientale” University of Naples
GLORIA CORPAS PASTOR University of Málaga
VIOLETA SERETAN University of Geneva
JOHN BENJAMINS PUBLISHING COMPANY AMSTERDAM & PHILADELPHIA
8
TM
The paper used in this publication meets the minimum requirements of the American National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984.
doi 10.1075/cilt.341 Cataloging-in-Publication Data available from Library of Congress: lccn 2017058783 (print) / 2018001455 (e-book) isbn 978 90 272 0060 0 (Hb) isbn 978 90 272 6420 6 (e-book)
© 2018 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Company · https://benjamins.com
Table of contents About the editors Multiword units in machine translation and translation technology Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
vii 1
Part 1. Multiword units in machine translation Analysing linguistic information about word combinations for a Spanish-Basque rule-based machine translation system Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola How do students cope with machine translation output of multiword units? An exploratory study Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken Aligning verb + noun collocations to improve a French-Romanian FSMT system Amalia Todiraşcu & Mirabela Navlea
41
61
81
Part 2. Multiword units in multilingual NLP applications Multiword expressions in multilingual information extraction Gregor Thurmair A multilingual gold standard for translation spotting of German compounds and their corresponding multiword units in English, French, Italian and Spanish Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk Dutch compound splitting for bilingual terminology extraction Lieve Macken & Arda Tezcan
103
125 147
Part 3. Identification and translation of multiword units A flexible framework for collocation retrieval and translation from parallel and comparable corpora Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor
165
Multiword Units in Machine Translation and Translation Technology
On identification of bilingual lexical bundles for translation purposes: The case of an English-Polish comparable corpus of patient information leaflets Łukasz Grabowski
181
The quest for Croatian idioms as multiword units Kristina Kocijan & Sara Librenjak
201
Corpus analysis of Croatian constructions with the verb doći ‘to come’ Goranka Blagus Bartolec & Ivana Matas Ivanković
223
Anaphora resolution, collocations and translation Eric Wehrli & Luka Nerima
243
Index
257
About the editors While Prof Ruslan Mitkov is best known for his seminal contributions to anaphora resolution and automatic generation of multiple-choice tests, his original contributions and extensively cited research (250 publications) cover many other topics including but not limited to translation technology and NLP applications for language disabilities. Mitkov is author of the monograph Anaphora resolution (Longman) and Editor of the Oxford Handbook of Computational Linguistics (Oxford University Press). Current prestigious projects include his role as Executive Editor of the Journal of Natural Language Engineering (Cambridge University Press) and Editor-in-Chief of the NLP book series of John Benjamins publishers. Prof Mitkov is regular keynote speaker at international conferences and often acts as Programme Chair of scientific events; he has been principal investigator of projects with funding (exceeding €20,000,000) provided by UK funding agencies, the EC and the industry. Dr Mitkov received his MSc from the Humboldt University in Berlin and his PhD from the Technical University in Dresden. Ruslan Mitkov is Professor of Computational Linguistics and Language Engineering at the University of Wolverhampton and Director of the Research Institute in Information and Language Processing. Mitkov is a Fellow of the Alexander von Humboldt Foundation, Germany, Distinguished Visiting Professor at the University of Franche-Comté in Besançon, France and Vice-Chair for the prestigious EC funding programme ‘Future and Emerging Technologies’. In recognition of his outstanding professional/research achievements, Prof Mitkov was awarded the title of Doctor Honoris Causa at Plovdiv University in 2011; he was also conferred Professor Honoris Causa at Veliko Tarnovo University in 2014. Johanna Monti is currently Associate Professor of Modern Languages Teaching at the L’Orientale University of Naples where she teaches Translation Studies, Specialised Translation, Machine Translation and CAT tools. She was the Computational Linguistics Research manager of the Thamus Consortium in Salerno (FINMECCANICA Group) from 1988 till 2001 and took part in various national and international research projects (CLIA, EUROLANG, CLIPS). She was research fellow from 2008 till 2012 in Computational Linguistics at the Department of Political, Social and Communication Sciences of the University of Salerno. She received her PhD in Theories, Methodologies and Advanced applications for Communication,
Multiword Units in Machine Translation and Translation Technology
Computer Science and Physics with a thesis in Computational Linguistics at the University of Salerno, Italy. She has acted as Programme Chair of various international conferences on Natural Language Processing (NLP), Machine Translation, Translation Technology, Computational Linguistics and is asked on a regular basis to review for leading Italian and international funding bodies and organisations. She was an active member of the PARSEME Cost Action and is currently acting as Principal Investigator of the PARSEME-IT project. She is member of the governing board of the I-Land inter-university research center and of the standing committee of the SIGLEX-MWE Interest Group. Her current research activities are in the field of hybrid approaches to MT, development of Linguistic Resources for NLP applications, evaluation of Translation technologies and finally new trends in translation. Gloria Corpas Pastor has a BA in German Philology (English) from the University of Malaga and a PhD in English Philology from the Universidad Complutense de Madrid (1994). Visiting Professor in Translation Technology at the Research Institute in Information and Language Processing (RIILP) of the University of Wolverhampton, UK (since 2007), and Professor in Translation and Interpreting at the University of Malaga (since 2008). Published and cited extensively, member of several international and national editorial and scientific committees. Spanish delegate for AEN/CTN 174 and CEN/BTTF 138, actively involved in the development of the UNE-EN 15038:2006 and currently involved in several ISO Standards (ISO TC37/SC2-WG6 “Translation and Interpreting”). Regular evaluator of University programmes and curriculum design for the Spanish Agency for Quality Assessment and Accreditation (ANECA) and various research funding bodies. Director of the research group “Lexicography and Translation” since 1997 (http:// www.lexytrad.en). Director of the Department of Translation and Interpreting of the University of Malaga, former President and member of the Scientific Committee of AIETI (Iberian Association of Translation and Interpreting Studies), Board member and member of the Advisory council of EUROPHRAS (European Society of Phraseology) and Vice-President of AMIT-A (Association of Women in Science and Technology of Andalusia). Her research fields include corpus linguistics, phraseology, lexicography and translation technology. Violeta Seretan (PhD in Linguistics, University of Geneva, 2008) is a Senior Research Associate and Lecturer at the Universities of Geneva and Lausanne. She has extensive experience in the fields of Computational Linguistics and Multilingual Natural Language Processing. Since 2000, when she earned her MSc degree in Computer Science from the University of Iasi, she has conducted research on topics as varied as language analysis, lexical semantics, phraseology acquisition, information extraction, text simplification, machine translation, pre-editing, post-
About the editors
editing, and machine translation evaluation (among others). She has authored a book (“Syntax-Based Collocation Extraction”, Springer, 2010) and over 50 peerreviewed publications in international journals and conferences in the field of Human Language Technology. She has been a coordinator of the FP7 European project accEPT; a member of the COST Action PARSEME; and a participant in a number of nationally and locally funded research projects. She has held positions in several institutions, including the University of Edinburgh (UK), the Fuji Xerox Palo Alto Laboratory (USA), and the University of Lausanne (CH).
Multiword units in machine translation and translation technology Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
“L’Orientale” University of Naples, Italy / University of Geneva, Switzerland / University of Malaga, Spain / University of Wolverhampton, United Kingdom The correct interpretation of Multiword Units (MWUs) is crucial to many applications in Natural Language Processing but is a challenging and complex task. In recent years, the computational treatment of MWUs has received considerable attention but we believe that there is much more to be done before we can claim that NLP and Machine Translation (MT) systems process MWUs successfully. In this chapter, we present a survey of the field with particular reference to Machine Translation and Translation Technology. Keywords: multiword units, multiword expressions, natural language processing, machine translation, translation technology
1. Introduction As the late and inspiring John Sinclair (1991, 2007) observed, knowledge of vocabulary and grammar is not sufficient for someone to express himself/herself idiomatically or naturally in a specific language. One has to have the knowledge and skill to produce effective and naturally-phrased utterances which are often based on multiword units (the idiom principle). This contrasts with the traditional assumption or open choice principle that lies at the heart of generative approaches to language. As Pawley and Syder (1983) stated more than three decades ago, the traditional approach cannot account for nativelike selection (idiomaticity) or fluency. Multiword units or multiword expressions1 are meaningful lexical units made of two or more words in which at least one of them is restricted by linguistic conventions in the sense that it is not freely chosen. For example, in the
. Henceforth and as explained later in the text, for the purpose of this chapter multiword units and multiword expressions will be used interchangeably as synonyms.
doi 10.1075/cilt.341.01mon © 2018 John Benjamins Publishing Company
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
expression to smell a rat the word rat cannot be replaced with similar words such as mouse or rodent. Perhaps most scholars would agree that multiword units are habitual recurrent word combinations of everyday language (Firth 1957) and carry some degree of idiomaticity in that their meaning does not correspond to the literal meaning represented by its individual lexical items (parts or components). In other words, the meaning of a multiword unit is not always derivable from the meaning of its components nor can be determined by the rules used to combine them.2 Baldwin & Kim (2010) write that multiword expressions have ‘surprising properties not predicted by their component words’. On the other hand, Sag et al. (2002) define multiword expressions more formally as lexical items which (i) can be decomposed into multiple lexemes and (ii) display lexical, syntactic, semantic, pragmatic and/or statistical idiomaticity.3 Ramisch (2015) presents two linguistic tests as to when a set of words can be regarded as a multiword expression or not. The first test is whether it is possible to replace one word in the expression for a synonym. If we take the multiword unit full moon, for instance, it would sound quite awkward if someone said entire moon, whole moon, total moon or complete moon. Another test for detecting multiword expressions (MWEs) is word-for-word translation into another language.4 If the translation sounds strange, unnatural or even ungrammatical, the original expression is probably a multiword unit. By way of illustration, Ramisch (2015) explains that the expression prince charming is translated into Portuguese as príncipe encantado, that is, enchanted prince. It has to be borne in mind that an MWU in one language may be translated as a
. The Fregean principle of compositionality is clearly flouted in the case of idioms, as observed long ago by Katz & Postal (1963), Chafe (1968) and Chomsky (1980), among many others. . Sag et al. (2002) provide a summary of the idiosyncratic features traditionally attributed to idioms, together with their various clines or degrees of fixity, e.g. semantic opacity and compositionality, institutionalisation, variation and variability, continuity and discontinuity of components, syntactic frozenness, lexical and syntactic flexibility, idiomatic processing and literal counterparts, etc. On these issues, see the early works by Fraser (1970), Makkai (1972), Arnold (1973), Cowie (1981), Fernando & Flavell (1981), Moon (1988), Carter (1998), Fellbaum (1993), Nunberg et al. (1994), Jackendoff (1997), Cacciari & Tabossi (1988); Gibbs & Nayak (1989), etc. . A considerable amount of work has been devoted to the identification of multiword units since early works by Choueka et al. (1983) and Smadja (1993). Both detection tests proposed by Ramisch (2015) are, in fact, reminiscent of previous attempts based on substitutability (cf. Lin, 1998; Pearce, 2001) and on identification through language translation mismatches (cf. Melamed, 1997). Other relevant approaches involve latent semantic analysis (Katz & Giesbrecht, 2006) and asymmetric directed graphs (cf. Widdows & Dorow, 2005).
Multiword units in machine translation and translation technology
single word in another language (en. shape up → it. progredire) and that there are regional variations. Multiword units (MWU) are ubiquitous and pervasive in language. Jackendoff (1997) observes that the number of MWUs in a speaker’s lexicon is of the same order of magnitude as the number of single words . Biber et al. (1999) argue that they constitute up to 45% of spoken English and up to 21% of academic prose in English. Sag et al. (2002) note that they are overwhelmingly present in terminology and 41% of the entries in WordNet 1.7 are reported to be multiword units. Language is indeed phraseological and MWUs are a fundamental linguistic concept which is central to a wide range of Natural Language Processing and Applied Linguistics applications including Phraseology, Terminology, Translation and Lexicography. Phraseology is the discipline which studies MWUs or their related concepts referred to (and regarded largely synonymous) by scholars as, for example, multiword units, multiword expressions, fixed expressions, set expressions, phraseological units, formulaic language, phrasemes, idiomatic expressions, idioms, collocations, and polylexical expressions. For the purposes of this chapter, we shall consider the above terms largely synonymous.5 Given the variety of views and high number of definitions, we shall not attempt to provide a comprehensive list of definitions or different viewpoints. The successful computational treatment of MWUs is essential for Natural Language Processing, including Machine Translation and Translation Technology; the inability to detect MWUs automatically may result in the incorrect (and even unfortunate) automatic translation and may jeopardise the performance of applications such as text summarisation and web search. The study of multiword units in Natural Language Processing has been gaining prominence and in recent years the number of researchers and projects focusing on them has increased dramatically. Ramisch et al. (2013) count the number of papers mentioning “multiword”, “collocation” or . There are subtle nuances, though. For instance, terms like set phrase, fixed expression, set expression or formulaic language imply that the basic criterion of differentiation is stability of the lexical components and grammatical structure of the multiword unit (preferred terms in the early papers in Lexicology and Applied Linguistics); multiword expression, multiword unit, and polylexical expression stress the polylexical nature of these units (preferred terms within the NLP and Computational Linguistics community); idiomatic expression and idiom imply that the essential feature of these units is idiomaticity, non-compositionality or lack of motivation (preferred terms among English and American linguists); phraseological unit and phraseme imply stability, idiomaticity and gradability (preferred terms among Continental phraseologists). Collocation is a polysemous term which is used to denote a type of multiword unit partially restricted, a mode of semantic analysis and frequent, habitual co-occurrence (cf. the papers in the recent volumes edited by Orlandi & Giacomini, 2016; and by Torner & Bernal, 2017). For an attempt to clarify the different terminology, see also Granger & Paquot (2008).
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
“idiom” over the total number of papers published in a year in the ACL Anthology between 1995 and 2006. They report that while in 1995 only less than 1% of the papers contain the above terms, by 2006 this percentage is almost 10. We recently conducted a similar experiment in which the occurrences of “multiword” and “multi-word” were counted in the papers available in the ACL anthology reference corpus version 201603016 from the year 1980 to the year 2015.7 We plot the actual numbers of papers in which the words “multiword” or “multi-word” appear with a frequency more than 1 (Figure 1). We also plot the ratio of these numbers with respect to the total number of papers in the years (Figure 2). The increasing use of the above terms reflects the growing importance of multiword units in Natural Language Processing.8 60 50 40 30 20 10 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
0
Figure 1. Number of papers mentioning multiword or multi-word in their body 7 6 5 4 3 2 1 0
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Figure 2. Proportion of papers mentioning the multiword or multi-word in their body
. http://acl-arc.comp.nus.edu.sg . We exclude workshop papers and only consider the papers in main conferences. . We are indebted to Shiva Taslimipoor for carrying out the experiments.
Multiword units in machine translation and translation technology
Multiword units do not only play a crucial role in the computational treatment of natural languages. Often terms are multiword units (and not single words) which makes them highly relevant to terminology. The requirement for correct rendering of MWUs in translation and interpretation highlights their importance in these fields. Given the pervasive nature of MWUs, they play a key role in the work of lexicographers who study and describe both words and MWUs. Lastly, MWUs are vital in the study of language which includes not only language learning, teaching and assessment, but also more theoretical linguistic disciplines such as Pragmatics, Cognitive linguistics and Construction grammars, which are nowadays aided by (and in fact often driven by) corpora. MWUs are very relevant for corpus linguists, too. As a result, MWUs provide an excellent basis for interdisciplinary research and for collaboration between researchers across different areas of study which for the time being is underexplored. The remainder of this chapter is structured as follows. Section 2 discusses the role and computational treatment of MWUs in Natural Language Processing. Section 3 covers the processing of MWUs in one particular NLP application – that of Machine Translation. Section 4 outlines the so far underrepresented area of MWUs in Translation Technology and, more specifically, in tools for translators. The chapter finishes with a section outlining further reading materials and relevant resources. 2. Multiword units in natural language processing While it is largely agreed that successful NLP applications need to identify and treat MWUs appropriately (see also Section 1), the empirical validation of this claim has received relatively little attention from the scientific community. Many years after MWU research was established as a field in itself (see, in this respect, Ramisch et al. 2013), the largest body of work in this area continues to be devoted to the issues of automatic acquisition of MWUs from corpora and their classification with the actual exploitation of MWUs in real-world applications remaining the focus of isolated efforts. Recent years, however, have seen the emergence of research work devoted to MWU integration in various NLP applications, in particular, machine translation. In this section, we review some of the work which illustrates the integration of MWU knowledge in NLP tasks and applications other than machine translation (for a review of MT-related work, see Section 3). We begin with a brief historical overview, and then present the integration approaches proposed in areas as diverse as POS tagging, parsing, word sense disambiguation, information retrieval, information extraction, question answering, sentiment analysis, and text mining, among others.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
2.1 Historical notes At the start of the MWE workshop series, which symbolically coincides with the formal establishment of the MWE community in the NLP field, the situation was such that “most real-world applications tend to ignore MWEs or address them simply by listing”.9 A couple of years later, the importance of multiword units still appeared to be underestimated in computational applications. In tasks like machine translation, word alignment, cross-language information retrieval, and computer-assisted language learning, individual words tend to be focused on, instead of dealing with a “level of units larger than the word”.10 Despite the community’s repeated calls for work that exploits MWUs in NLP applications, such work has remained largely underrepresented over the years. The issue of extrinsic evaluation of MWU knowledge and resources by assessing the impact of their integration into NLP pipelines was, and still is, relatively little explored. In more recent years, there has still been “not enough successful applications in real NLP problems”11 like question answering, machine translation, information retrieval, information extraction, and textual entailment, despite this avenue of research being considered as “the key for the advancement of the field”.12 One of the main reasons for this is, certainly, the problem of how to represent MWUs in order to allow for integration in NLP applications, given the ambivalent status of MWUs at the intersection of lexis and grammar: “In the pipeline of linguistic analysis and/or generation, where should we insert MWEs? And even more important: how? Because all the effort that has been put in automatic MWE extraction would not be useful if we do not know how to employ these rich resources in our real-life NLP applications!”.13 And in fact, the MWE workshop organisers (MWE2013) emphasised the underrepresentation of MWUs in NLP resources: “MWEs are not nearly as frequent in NLP resources as they are in realword text, and this problem of coverage may impact the performance of many NLP tasks”.14 It was not until 2014 that the annual MWE workshop series saw a special track devoted to the integration of MWUs in parsing, thanks to the collaboration of . http://www.cl.cam.ac.uk/~alk23/mwe/mwe.html . http://ucrel.lancs.ac.uk/EACL06MWEmc/ . http://multiword.sourceforge.net/mwe2009/ . http://multiword.sourceforge.net/mwe2009/ . http://multiword.sourceforge.net/mwe2011/ . http://multiword.sourceforge.net/mwe2013/
Multiword units in machine translation and translation technology
the European COST Action PARSEME – PARSing and Multi-word Expressions: Towards linguistic precision and computational efficiency in natural language processing (2013–2017). The PARSEME community, which counts more than 200 members, fostered the research on the integration of MWUs in parsing and translation. As mentioned earlier, recent years have seen an increased interest in the integration of MWUs in many important NLP applications. We summarise this work in the remainder of this section. 2.2 POS tagging and parsing The incorporation of MWUs into POS-tagging and parsing systems is expected to benefit the performance of these major language analysis applications. MWE knowledge acts as constraints that guide that analysis process, leading to a reduction in the number of alternatives pursued and to a re-ranking which favours MWU-compatible analyses. There is significant evidence, indeed, in the literature on POS tagging and parsing that shows that integrating knowledge about MWUs improves the performance of these applications. For instance, Leech et al. (1994) describe the “idiom tagging” approach used for tagging the British National Corpus (BNC). This approach relies on a lexicon of MWUs and is used for correcting, in a second tagging pass, the errors from the first pass concerning the parts of speech extending over on orthographic words (e.g., complex prepositions and conjunctions such as according to and so that, general idioms such as as much as, complex names and foreign expressions). Wehrli (2014) reports an increase in the POS tagging accuracy when the underlying language analysis procedure is modified such that it gives preference to alternatives involving the attachment of the component items of MWUs in a lexicon. Constant and Sigogne (2011) show that by integrating MWU recognition and external MWU resources, their POS system outperforms the version of the system that is not MWU-aware. Also, Shigeto et al. (2013) report that their MWU-aware tagger achieved better performance in POS-tagging and MWU recognition, when evaluated on an MWU-annotated version of the Penn Treebank. Positive results have also been reported in a large number of studies concerned with the impact of integrating MWU knowledge into parsing systems. For instance, Brun (1998) compared the versions of a parser with and without prerecognition of terminology, and found that by using a glossary of nominal terms in the pre-processing component of the parser, the number of parsing alternatives is significantly reduced. Similarly, Alegría et al. (2004) reported a significant improvement in the POS tagging precision when using a MWU processor in the pre-processing stage of their parser for Basque. Nivre & Nilsson (2004) investigated the impact of MWU pre-recognition on a data-driven Swedish dependency
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
parser. They reported a significant improvement in parsing accuracy when the parser is trained on a treebank in which MWUs are represented as a single token. They also observed an increase in the parsing coverage, as reflected in the number of complete trees built by the parser. Zhang & Kordoni (2006) used a similar wordswith-spaces pre-processing approach in which they treated fixed units as single tokens, and obtained a significant increase in the coverage of an English parser. A significant increase in parsing coverage was also observed by Villavicencio et al. (2007) when adding as few as 21 new MWUs into the lexicon of a parser. In a similar vein, Wehrli et al. (2010) presented a method for incorporating MWU knowledge in a constituency parser during the analysis process, rather than at the pre-processing stage. They found a positive impact on both MWU recognition and parsing coverage for English and Italian. Later, Wehrli (2014) further evaluated this method and not only confirmed the positive impact of MWU knowledge on parsing coverage, but also showed that the parsing accuracy is improved in most cases, as reflected by the POS tagging accuracy. Wehrli et al.’s (2010) approach was particularly suited to deal with flexible MWUs, in which the component items are found in long-distance dependencies. Using the much more common words-with-spaces approach, Korkontzelos and Manandhar (2010) found, in line with previous results, that treating MWUs as singletons leads to an increase in the shallow parsing accuracy. Their work was specifically aimed at phrase chunking, and relied on data annotated at the level of compound nouns and proper nouns. Recent reports (e.g., Constant et al. 2013a and 2013b) have also showed that a positive impact in terms of parsing accuracy can be obtained if MWUs are recognised in the input text. Summing up, the relevant literature has largely shown the benefit of using MWU knowledge – either before, during or after the analysis process – for improving the performance of language analysis applications, from POS tagging and chunking to full parsing. It is expected that further research in this area will overcome the limitations of the predominant words-with-spaces approach and will also provide a more in-depth evaluation of the impact including error analysis details. 2.3 Word sense disambiguation Word sense disambiguation is another major NLP task in which MWU knowledge has proved very useful. This is particularly the case with collocations. Yarowsky (1993) introduced the famous “one sense per collocation” hypothesis, according to which “nearby words provide strong clues to the sense of a target word” (Yarowski, 1995: 266). Yarowski (1995) showed that the performance of WSD algorithms
Multiword units in machine translation and translation technology
arnessing this property of language rivals that of supervised algorithms that h require sense-tagged training corpora. Strategies for MWU-aware WSD have since been proposed in a number of studies dealing with specific disambiguation tasks. Focusing on disambiguating the semantics of noun compounds, Girju et al. (2005) created annotation data which then they used in their WordNet-based WSD method. They found that their supervised model performs better than previous models which use less semantic information. Arranz et al. (2005) dealt with the task of disambiguating WordNet glosses. They proposed an approach which integrated a pre-processing step of MWU identification on the basis of matching against MWUs present in WordNet 2.0. The MWU identification has been found to lead to gains in precision and recall of WSD. Finlayson & Kulkarni (2011) further extended Arranz et al.’s work (2005) by integrating MWUs present in the SemCor corpus and by proposing two new MWU detection algorithms. They found non-trivial improvements over the baseline, although these remained comparable to those of previous work (Arranz et al. 2005) despite the more sophisticated MWU detectors used. 2.4 Information extraction and information retrieval Given that the unit of meaning spans beyond the orthographic word, MWUs are expected to play an important role in information-seeking applications, such as Information Extraction and Information Retrieval. In the field of Information Extraction, Lin (1998) proposed an approach in which he exploited collocations extracted from a corpus to build a named entity classifier, and showed that collocation statistics greatly improved the performance of the system. In the field of Information Retrieval, the applications of MWUs are very numerous. For instance, Salton & Smith (1989) discussed the identification of “complex identifying units” such as noun phrases and prepositional phrases in documents and search queries. In the same context, Lewis & Croft (1990) explored the use of phrase clusters in document representation for information retrieval. Riloff (2005) provided evidence in favour of incorporating phrases containing prepositions for more effective document representation, arguing against the use of stopword lists. Mandala et al. (2000) considered the use of co-occurrence-based, automatically constructed thesauri for query expansion, and showed that this strategy, coupled with appropriate weighing methods, results in improved retrieval performance. Wacholder & Song (2005) showed the advantage of using complex noun phrase chunks and technical terms for indexing, providing evidence that users prefer long queries, since these are more specific. Finally, recent work by Acosta et al. (2011) gave further evidence that using automatically constructed MWU thesauri and
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
indexing MWUs as single units are good strategies for improving information retrieval performance. 2.5 Other applications MWU knowledge is arguably particularly useful in the field of Text Mining, since it allows for the identification of multiword concept terms (Rayson, 2010). MWUs have been used, for instance, to help ontologists build knowledge maps (Venkatsubramanyan & Perez-Carballo, 2004), and for knowledge acquisition from technical documents (Manrique-Losada et al., 2013). Successful computational treatment of MWUs has also been reported in keyword extraction (e.g. Tomoyoko & Hurst 2003), topic modelling (e.g. Baldwin, 2011; Lau et al., 2013; Nokel & Loukachevitch, 2015) and cognitive processing of MWUs based on gaze data (Rohanian et al., 2017; Yaneva et al. 2017). MWUs have also been used in various other NLP tasks and applications which include paraphrase recognition (e.g. Boonthum et al. 2005; Ullman & Nivre, 2014), question answering (e.g. Dowdall et al., 2003; Sanjuan et al., 2005), text summarisation (Seretan, 2011), word alignment (Venkatapathy & Joshi, 2006), sentiment analysis/classification (Rentoumi et al., 2009; Wang & Yu, 2010; Beigman Klebanov et al., 2013; Moreno-Ortiz et al., 2013; Williams et al., 2015), literary analysis (Cook & Hirst, 2013), and event categorisation (Marvel & Koenig, 2015). Finally, a promising avenue of research is the exploitation of MWUs in the field of language learning technology, in which they are expected to have an important impact. Examples of previous work in this area include the use of MWUs in developing a reading assistant for Japanese (Hazelbeck & Saito, 2010), automated essay evaluation methods (Burstein, 2013), and multiword learning systems (Brooke et al, 2015). The large number of successful applications of MWUs reviewed in this section shows the advantage of integrating MWU knowledge in NLP and language learning applications, and we expect that the future will see even wider benefits from the enhanced treatment of MWUs, going beyond the words-with-spaces approaches to more flexible approaches capable of handling the morphosyntactic variation potential of complex lexical units.
3. Multiword unit processing in machine translation In this section, we review some of the work which illustrates the main approaches to MWU processing in Machine Translation (MT). We begin with a brief historical overview, and then present the main processing approaches in Rule-based Machine
Multiword units in machine translation and translation technology
Translation (RBMT), Example-based Machine Translation (EBMT), Statistical Machine Translation (SMT) and finally Neural Machine Translation (NMT). 3.1 Historical notes Translation of MWUs has always represented a great challenge for MT. Already with the first experiments in MT in the late 50s, it became immediately apparent that MWUs pose serious problems for MT (Bar-Hillel, 1952) and after more than six decades MT still fails to translate them correctly as highlighted by several recent contributions such as Barreiro et al. (2014), Monti (2013), and Ramisch et al. (2013) among others. For instance, if we try to translate into Italian He’s given up the ghost with one of the most used MT systems, namely Google Translate, we obtain Ha abbandonato il fantasma (lit. ‘He left the ghost’), which does not make sense because the system is not able to distinguish between the literal and the idiomatic meaning of this expression. The importance of the correct processing of MWUs in MT has been highlighted by several authors (e.g. Thurmair 2004, Villavicencio et al. 2005, Váradi 2006, Hurskainen, 2008, Rayson et al., 2010). However, in spite of the acknowledgement of the importance of the proper processing and translation of MWUs in MT, until very recently little attention has been paid to the main difficulties that this linguistic phenomenon poses to MT: MWUs are ubiquitous and there are very few bilingual or multilingual lexical resources which include them;15 they may have different degrees of compositionality (from free lexical combinations, like in clean water to frozen ones such as fish out of water), their morphosyntactic properties allow, in some cases, a certain number of formal variations (Member State → Member States); MWUs can be contiguous or discontiguous, i.e. with the possibility of dependencies of elements which are distant from each other in the sentence (take sth. away); MWUs are very often non-compositional and therefore have unpredictable, non-literal translations (such as kick the bucket); MWUs may have idiomatic or non-idiomatic meanings according to the context (Fazly 2009, Taslimipoor et al. 2017). To make things even more difficult, not all MWUs share the same semantic and syntactic properties (Monti, 2013). The problem of MWU processing and translation in MT has been discussed from several viewpoints according to the different MT modelling approaches, i.e. rule-based MT (RBMT), example-based MT (EBMT) or statistical MT (SMT). However, it was not until 2013 that the first workshop on Multiword units in Machine Translation and Translation Technology was established as an important
. A recent survey on multilingual MWU resources (2016) has been conducted in PARSEME: https://awesome-table.com/-KMxGtOyp8q3fqjwlR3w/view
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
forum for the MT community to interact, share resources and tools, and collaborate in efforts to improve the computational treatment of MWUs in Machine Translation and Translation Technology. The aim of this section is therefore to present an overview of the state of the art of the various MT approaches to MWU processing and translation. 3.2 Multiword unit processing in RBMT Rule-based Machine Translation (RBMT) is based on the use of linguistic resources both for the source and the target languages, such as dictionaries and rules, which may range from simple grammar and reordering rules to more complex syntactic and semantic rules, according to the different approaches (direct, transfer and interlingua approaches). In RBMT, the identification and translation of MWUs is mainly based on two different methods: a lexical and a compositional approach. In the lexical approach, MWUs are considered as single lemmas and lemmatised as such in the system dictionaries. MWUs like in accordance with, water resistant, and by way of illustration are examples of MWUs which can be lemmatised as they are units with no internal variability in co-occurrence and are contiguous. In the compositional approach, MWU processing is obtained by means of POS tagging and syntactic analysis of the different components of an MWU. The compositional approach complements the lexical one, since it is designed to detect and disambiguate MWUs not included in the MT system dictionaries and for which specific rules to handle non-contiguous or compositional MWUs are needed. An example of how MWUs are processed and translated in an RBMT can be found in Scott (2003), Scott & Barreiro (2009), and Barreiro et al. (2016) with reference to OpenLogos,16 the former Logos System. This transfer system is grounded in the premise that the syntactic structure on which an RBMT system is based should be essentially merged with the semantic structure to help resolve ambiguities at every linguistic level (lexical, syntactic or semantic). Every natural language string is translated into an intermediate and abstract representation language, the Semanto-syntactic Abstraction Language (SAL), before parsing. Its key element lies in the description of the verbs which are the main means for the production and comprehension of natural languages. The main linguistic knowledge bases are machine-readable dictionaries (MRDs), syntactic rules and SEMTAB rules, which interact (i) during the different phases through which natural language ambiguities are simplified and
. https://sourceforge.net/projects/openlogos-mt/
Multiword units in machine translation and translation technology
reduced in an incremental way, and (ii) in the different modules (analysis, transfer and generation). In order to process and translate MWUs, Logos MT combines the lexical and the compositional approaches. System dictionaries contain fixed MWUs (in particular adverbs, conjunctions, verbs and noun compounds) with information about the head word (HEAD), which can be used during the generation phase and to correct machine translation problems related to agreement within multiword structures or within larger units, such as the agreement between nominal multiwords and the verb or agreement within verbal multiwords, such as in support verb constructions (Barreiro et al. 2014). The compositional approach comes into play when MWUs cannot be lexicalised, since they present a certain degree of variability or because they are discontinuous. The processing of this type of MWU in the OpenLogos system is performed by means of contextual-pattern rules called SEMTAB rules (Scott, 2003; Scott & Barreiro, 2009; and Barreiro et al., 2016). They disambiguate the meaning of words in the source text by identifying the semantic and syntactic structures underlying each meaning and provide the correct equivalent translation in the target language. The SEMTAB rules are deep structure patterns that match on/apply to a great variety of surface structures, since they are also able to handle variants of MWUs. They are invoked after dictionary look-up and during the execution of source and/or target syntactic rules (TRAN rules) at any point in the transfer phase in order to solve various ambiguity problems: (i) homographs such as bank which can be a transitive and intransitive verb or a noun; (ii) verb dependencies such as the different argument structures, [speak to], [speak about], [speak against], [speak of], [speak on], [speak on N (radio, TV, television, etc.)], [speak over N1(air) about N2], for the verb speak; (iii) MWUs of a different nature. In order to explain the role of this type of rule and how it operates, we can use the English phrasal verb mix up as an example. This MWU assumes different meanings according to the words and the nature of the words it occurs with. In (1), it means to change the order or arrangement of a group of things, especially by mistake or in a way that you do not want. In (2), it means to prepare something by combining two or more different substances. In (3), it means to think wrongly that somebody/something is somebody/something else and in (4), it means to be in a state of confusion.
(1) try not to mix up all the different problems together.
(2) mix up the ingredients in the cookie mix. (3) Tom mixes John up with Bill.
(4) I’m all mixed up.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
All these different meanings of mix up represented in (1)–(4) correspond to different translations into Italian and, possibly, into any other language. SEMTAB rules comprehend the different semantico-syntactic properties of each verb (also called linguistic constraints). For example, the SEMTAB rule mix up (vt) N (human, info) with describes the meaning (3) of the verb mix up, by generalising to an abstract level of representation the nature of its direct object and classifying it under the Information or Human noun superset of the SAL ontology. This type of abstraction allows coverage of a number of different sentences in which different types of human nouns occur, as illustrated in (5). (5) Tom mixed John/him/the brother/the man/the buyer/the Professor up … with Bill.
This approach largely resembles Sinclair’s hypothesis (1996) about the existence of a clear correlation between the textual environment of a word in one language and its translation equivalent in another language. The Multilingual Lexicography Project confirmed this hypothesis, as the relevant co-textual features (words, phrases and structural patterns) were found to be frequent and clear enough to enable the automation of important parts of the translation process and to create a multilingual lexical database as a resource for human translation. 3.3 Multiword unit processing in EBMT EBMT relies on the analogy principle and therefore re-uses translations already stored in the system to translate MWUs. MWU processing in EBMT has been discussed by several scholars over the last decades (Sumita et al. 1990 and 1991; Nomiyama, 1992; Franz et al., 2000; Gangadharaiaha & Balakrishnan, 2006 and more recently Anastasiou, 2010). Basically, the EBMT approach to MWUs uses examples of possible translations of MWUs, integrated in many cases by linguistic rules. This is the case in Franz et al. (2000), and in Gangadharaia & Balakrishnan (2006). The work by Anastasiou (2010) presents an exhaustive study of idiom processing in EBMT and a concrete application within the data-driven METIS-II system. The idiom linguistic resources used in the system are: –– a dictionary, consisting of 871 German idioms together with their translations into English; –– a corpus compiled from a subset of the Europarl corpus, a mixture of manually constructed data and examples extracted from the Web and, additionally, a part of the DWDS, a digital lexicon of the German language; –– a set of rules to identify continuous and discontinuous idioms.
Multiword units in machine translation and translation technology
Idiom processing in Metis II is divided into six stages: SL analysis; dictionary look-up; syntactic matching rules to identify idioms as lexical units; the use of Expander, a tool that formalises the German sentence into the corresponding English target sentence by changing its word order; the use of a ranking tool, Ranker, to choose the most appropriate target translation; and, finally, a stage in which the systems generate the target sentence. In PRESEMT, another example of EBMT, Tambouratzis et al. (2012) propose an approach to MWUs that relies on the use of a large monolingual and a small parallel bilingual corpus with a few hundred sentences aligned at sentence level to identify subsentential segments in both SL and TL and thus transfer structural information between languages. Alignment is therefore a crucial aspect for EBMT. Alignment is an unsupervised methodology, i.e. a methodology that uses raw (unannotated) input data to extract correspondences from large parallel corpora. Originally, alignment was used in Translation Memories (TMs) and took place at sentence level in order to provide translators with ready solutions extracted from previous translations stored in the TM database. TMs either return sentence pairs with identical source segments (exact matches) to translators, or sentences that are similar, but not identical to the sentence to be translated (fuzzy matches). Firstgeneration TM systems, based on sentence alignment, showed severe shortcomings since the full repetition of a sentence only occurs in a very limited number of texts, e.g. technical documents, and texts with related content or text revisions. To overcome these limitations, research in this area is now addressing the possibility of alignment on a subsentential level, and there are already several systems which are based on this approach (see the next section of this chapter, “Multiword Units in Translation Technology”, where MWUs and TM systems are discussed). Several scholars have focused their research on the possibility of automatically producing subsentential alignments from parallel bilingual corpora both to recover text chunks which have a higher occurrence probability than the sentence, but also to efficiently cope with the problem of translating MWUs. In Groves et al. (2004), for instance, the methodology foresees the development of an automatic algorithm that aligns bilingual context-free phrase-structure trees at sub-structural level and its application to a subset of the English-French section of the HomeCentre corpus. More recently, in Ozdowska (2006), the syntactic information has been used in a heuristics-based method that expands anchor alignment using a set of manually defined syntactic alignment rules. Subsentential alignment seems to be a more suitable solution for the alignment of MWUs, especially if it takes into account the divergences between languages which can occur on the lexical, syntactic and semantic levels, i.e. if the method adopted is able to cope with the asymmetries between languages which
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
concern the translation of MWUs. For instance, if we consider the English verbal expression “blow the whistle on”, the Italian translation is denunciare and it is immediately clear that a one-to-one word mapping between the two text segments is not possible and that a different solution should be found. Recently, Barreiro et al. (2016) address this problem by proposing a set of linguistically informed and motivated guidelines for aligning multilingual texts. The guidelines are based on the alignment of bilingual texts of the test set of the Europarl corpus covering all possible combinations between English, French, Portuguese and Spanish. This contribution specifically analyses, and proposes guidelines which take into account, MWUs and semantico-syntactic unit alignments. In particular, it offers alignment solutions for four different classes: lexical and semantico-syntactic (MWUs, including support verb constructions, compound verbs and prepositional predicates), morphological (lexical versus non-lexical realisation such as articles and zero articles, the pro-drop phenomenon including subject pronoun dropping and empty relative pronoun, and contracted forms), morpho-syntactic (free noun adjuncts), and semantico-discursive (emphatic linguistic constructions such as pleonasm and tautology, repetition and focus constructions). Other types of MWUs have also been taken into account with reference to alignment problems, and in particular (i) bilingual terminology by Claveau (2009), whose method relies on syntax to extract patterns such as Noun-Verb, AdjectiveNoun, Prepositional Noun Phrase, etc; and (ii) collocations by Seretan (2009) through bilingual alignments where POS-tags are equivalents or close (even with distant words). With regard to collocations, Segura & Prince (2011) propose an alignment process between pairs of sentences, strongly based on syntax. It relies on an alignment memory, consisting of a learnt set of good alignments as well as a rule-based process that asynchronously combines alignment constraints in order to maximise coverage. 3.4 Multiword unit processing in SMT In Statistical Machine Translation (SMT), which evolved from the IBM wordbased models (Brown et al., 1988, 1990) to phrase-based models (Zens et al., 2002; Koehn et al., 2003; Tillmann & Xia, 2003), the problem of MWU processing is not specifically addressed. The traditional approach to word alignment following IBM Models (Brown et al., 1993) shows many shortcomings related to MWU processing, especially due to their inability to handle many-to-many correspondences. Since alignment is performed only between single words, i.e. one word in the source language only corresponds to one word in the target language, these models are not able to handle MWUs properly.
Multiword units in machine translation and translation technology
The phrase-based alignment approach to MT (PB-SMT) also does not take into account the problem of MWUs since, even if it considers many-to-many alignments, some combinations of words or n-grams have limited linguistic significance (e.g., the war) while others are more linguistically meaningful (e.g., cold war). In SMT, phrases are therefore sequences of contiguous words not linguistically motivated and do not implicitly capture all useful MWU information. In the state-of-the-art PB-SMT systems, the correct translation of MWUs occurs therefore only on a statistical basis if the constituents of MWUs are marked and aligned as parts of consecutive phrases (n-grams) in the training set and it is not generally treated as a special case where correspondences between source and target may not be so straightforward, i.e. it does not consist of consecutive manyto-many source-target correspondences. MWU processing and translation in SMT started being addressed only very recently and different solutions have been proposed so far, but basically they are considered either as a problem of automatically learning and integrating translations or as a problem of word alignment as already described for EBMT. The most used methodology is the following: 1. Identification of possible monolingual MWUs. This phase can be accomplished using different approaches, (i) by means of morphosyntactic patterns (Okita et al., 2010; Dagan & Church, 1994); (ii) statistical methods (Vintar & Fisier, 2008); and finally (iii) hybrid approaches (Wu & Chang, 2004; Seretan & Wehrli, 2007; Daille, 2001; Boulaknadel et al., 2008). 2. Alignment to extract and attribute the equivalent translations of the identified monolingual MWUs according to the different alignment approaches, such as alignment based on co-occurrence, syntax based models, models with linguistic annotation, discriminative word alignment, alignment combination, alignment of subsentential units and finally alignment evaluation.17 Recently, an increasing amount of attention has been paid to MWU processing in SMT since it has been acknowledged that large-scale applications cannot be created without proper handling of MWUs of all kinds (Salehi et al. 2015). Current approaches to MWU processing move towards the integration of phrasebased models with linguistic knowledge and scholars are starting to use linguistic resources, either hand-crafted dictionaries and grammars or data-driven ones, in order to identify and process MWUs as single units (Costa-Jussà, & Farrús, 2014).
. For a comprehensive overview refer to http://www.statmt.org/survey/Topic/Word Alignment.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
A first possible solution is the incorporation of machine-readable dictionaries and glossaries into the SMT system for which there are several straightforward approaches. One is to introduce the lexicon as phrases in the phrase-based table (Lambert & Banchs, 2005). Unfortunately, the words coming from the dictionary have no context information. A similar approach is to introduce them to substitute the unknown words in the translation, but this poses the same problem as before. Okuma et al. (2008) present a more sophisticated approach where the lexicon words are introduced in the training corpus to enlarge their corpus. The criterion that they use is basically a Named Entity Recognition classification which allows them to substitute the named entity in the original corpus with any named entity from their lexicon. Note that their lexicon contains only proper nouns but it could be extended to any word, given the appropriate tagging of the original corpus. To deal with outof-vocabulary words, Aziz et al. (2010) use entailment rules, in this case obtained from WordNet, and scored by different methods, including distributional similarity. The different scores are combined in an ‘active learning’ fashion and the expert model is applied/learnt in such a way that it never harms the performance of the original model. Another solution for overcoming translation problems in MT and in SMT in particular is based on the idea that MWUs should be identified and bilingual MWUs should be grouped prior to statistical alignment (Lambert & Banchs, 2005). These authors adopted a method in which a bilingual MWU corpus was used to modify word alignment in order to improve the translation quality. In their work, bilingual MWUs were grouped as one unique token before training alignment models. They showed on a small corpus that both alignment quality and translation accuracy were improved. However, in their further study, they reported even lower BLEU scores after grouping MWUs by part of speech on a large corpus (Lambert & Banchs, 2006). As demonstrated by Ren et al. (2009), experiments show that the integration of bilingual domain MWUs in SMT could significantly improve translation performance. Wu et al. (2008) propose the construction of phrase tables using a manually-made translation dictionary in order to improve SMT performance. Korkontzelos & Manandhar (2010) highlight that knowledge about multiword units leads to an increase of between 7.5% and 9.5% in the accuracy. Finally, B ouamor et al. (2011) show that integration of contiguous MWUs and their translations in Moses18 improves translation quality and propose a hybrid approach for extracting contiguous MWUs and their translations in
. http://www.statmt.org/moses/
Multiword units in machine translation and translation technology
a French-English parallel corpus. They use distributional-similarity-based approaches to find translation equivalents. Other solutions seek to integrate syntactic and semantic structures (Chiang, 2005; Marcu et al., 2006; Zollmann & Venugopal, 2006) in order to obtain better translation results, but the solutions undoubtedly vary according to the different degrees of compositionality of the MWU. Carpuat & Diab (2010), for instance, conducted an English-Arabic translation pilot study for task-orientated evaluation of MWUs in SMT using manually defined WordNet MWUs and a dictionary-matching approach to MWU detection. They proposed two different integration strategies for monolingual MWUs in SMT, considering different degrees of MWU semantic compositionality, i.e. (i) a static integration strategy that segments training and test sentences according to the MWU vocabulary and (ii) a dynamic integration strategy that adds a new MWU-based feature in SMT translation lexicons. The first strategy allows a source text to be segmented in such a way that MWUs are recognised and frozen as single lexical units. In this way, during the training and decoding phases, MWUs are handled as distinct words regardless of their compositionality. In the dynamic strategy, the SMT system decides at decoding time how to segment the input sentence and attempts to translate compositional MWUs on the basis of a count feature in the translation lexicon that represents the number of MWUs in the input language phrase. On the basis of the positive outcome of their pilot study, Carpuat & Diab (2010) conclude that it would be interesting to use more general MWU definitions such as automatically learned collocations (Smadja, 1993) or verb-noun constructions (Diab & Bhutada, 2009) on a larger scale. In the wake of this latter study, different scholars have analysed this problem in more depth from different points of view. Pal et al. (2010) show how single tokenisation of two types of MWUs, namely named entities (NEs) and compound verbs, as well as their prior alignment can boost the performance of PB-SMT. (4.59 BLEU points absolute, 52.5% relative improvement) on an English-Bengali translation task. This model is further implemented by Pal et al. (2011) who propose to preprocess a parallel corpus to identify Noun-noun MWUs, reduplicated phrases, complex predicates and phrasal prepositions. Single tokenisation of Noun-noun MWUs, phrasal preposition (source side only) and reduplicated phrases (target side only) provide significant gains (6.38 BLEU points absolute, 73% relative improvement) over the PB-SMT baseline system on an English-Bengali translation task. More recently, Neural Machine Translation (NMT), based on the use of deep learning methods which try to resemble human reasoning and more specifically on neural networks (NNs) (Kalchbrenner & Blunsom, 2013; Cho et al., 2014;
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
Cho, forthcoming; Luong et al., 2014), is emerging as the most advanced technology in the field of MT which potentially overcomes many shortcomings of the state-of-the-art SMT. At the time of writing, there are to the best of our knowledge only very few papers specifically addressing the problems related to MWU processing and translation, such as Tang et al (2016), who propose phraseNet, a Chinese-English NMT with a phrase memory able to generate MWUs, and Matīss Rikters & Ondřej Bojar (2017), who investigate NMT processing of MWUs and propose a method to improve automated translation of sentences that contain MWUs with reference to English-Latvian and English-Czech NMT systems. It is worth noting that there are studies which report automatic translation of MWUs not necessarily as part of an MT system but as an independent methodology which extracts MWUs and their translations from parallel corpora (Bouamor et al. 2012, Taslimipoor 2015) or comparable corpora (e.g. Mitkov 2016, Taslimipoor et al. 2016, Rapp & Sharoff, 2014) using association measures for extraction and distributional similarity approaches for finding transition equivalents. In addition, some studies have exploited alignments in parallel corpora (Cap et al. 2015) or translation asymmetries of MWUs (Monti et al., forthcoming) to better identify them. Such studies could potentially underpin not only MT systems, but also the tools used by translators and lexicographers. 4. Multiword units in translation technology This section discusses how MWUs are dealt with in Translation Memory (TM) systems and other support tools for translators.19 There has been little work on the treatment of MWUs in Translation Technology reported. Macken (2009) compares two different types of TM systems (first-generation and second-generation TM systems) with regard to their treatment of terms and frequent multiword expressions. The findings of this study suggest that second generation, subsentential TM systems such as Similis work better than the traditional first-generation, sentence-based TM systems such as SDL at providing translations for terms and frequent multiword expressions. In her PhD thesis, Fernández Parra (2011) investigated how Translation Memory systems and CAT tools in general perform automatic term extraction and automatic term recognition when applied to formulaic expressions instead of terms. She analysed the following CAT tools: SDL Trados, Déjà Vu, Wordfast, STAR Transit, Araya, ExtPhr32, OmegaT, MemoQ, Swordfish, Fusion Translate,
. See also the previous section (Section 3) where TM systems are briefly discussed.
Multiword units in machine translation and translation technology
Similis, Google Translator Toolkit and Lingotek. Fernández Parra concludes that CAT tools can be productively used in the treatment of formulaic expressions but that some improvements can be made both in the short and in the longer term. In the short term, and more specifically without changing the software specifications of the CAT tools, she makes recommendations related to their settings and combinations of settings which would deliver better results in the computational treatment of formulaic expressions. Fernández Parra also makes more general and longer term recommendations as to the design of the CAT tools and their software specifications to include specific features for formulaic expressions. Jian et al. (2004) outline the online tool TANGO which performs translation of collocations. The tool is based on bilingual collocation extraction from a parallel corpus. Their method exploits statistical and linguistic information which includes parts of speech, syntactic chunks and clauses to obtain extended lists of collocations from monolingual corpora such as the BNC. By using word alignment techniques in a parallel corpus, they compile a ‘phrasal’ translation memory consisting of pairs of bilingual collocations. Huet & Langlais (2012) describe an experiment where they seek to identify the translations of a number of idioms in the translation memory20 of the new version of the bilingual concordance TransSearch which, in addition to being a bilingual concordancer, also functions as a translation finder. The user can search for a specific translation in the system and the output is the query and its counterpart in French.21 The system allows queries which are more than one word and more specifically, multiword expressions. By way of example the MWE query is still in its infancy would return the French equivalent expression en est encore à ses premiers balbutiements. TransSearch also caters for the morphological derivations of words and, in this sense, is more advanced than the traditional TM methods. In their experiment, three types of queries were submitted to the system: queries in English, queries in French and bilingual queries. The initial results showed that the user could retrieve no more than 28% of the French expressions by simply querying them verbatim. They then changed the queries following various
. The translation memory is derived from the Canadian Hansards, a collection of the official proceedings of the Canadian Parliament. For their experiments, they used an in-house sentence aligner to align 8.3 million French–English sentence pairs extracted from the 1986 to 2007 period of the Hansards. . The authors provide the following interesting statistics. During a period of 6 years, TransSearch received 7.2 million queries and 87% of these queries contained at least two words, with the most frequent such queries being in light of and out of the blue.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
transformations such as removing pronouns, replacing verbs with their lemmas, removing auxiliary verbs, etc., which resulted in a higher retrieval rate. The Sketch Engine (Kilgarriff et al. 2014) offers the functionality of producing a one-page summary of collocational behaviour of a particular word (its so-called ‘word sketch’), statistically derived from the corpus data and structured according to the grammatical patterns in which they occur. Originally, the system was designed only for monolingual corpora. Due to the emergence of large volumes of bilingual corpora, the development of a functionality of providing bilingual sketches became increasingly relevant; this feature was subsequently implemented and reported in (Kovář et al. 2016). Colson (2016; forthcoming) developed the IdiomSearch tool which is based on the so-called Corpus Proximity Ratio (cpr)22 which he proposed for the automatic extraction of phraseological units. Colson reports that preliminary evaluation by natives speakers indicate that the cpr-score reaches a precision score of about 90%, but recall is impossible to determine for longer n-grams, because no agreement can be reached on the exact number or boundaries of longer phraseological units contained in a text or a corpus.23 Colson (forthcoming) goes on to conclude that IdiomSearch could be useful for making language learners and translators aware of a whole host of ‘weakly idiomatic phraseological units’ which might go largely unnoticed if not highlighted by other computational phraseology tools.
Further reading and additional resources The computational treatment of multiword expressions is an area of intensive research. The vibrant MWE community is organised around the MWE section of the Special Interest Group on the Lexicon of the Association for Computational Linguistics, SIGLEX-MWE. Since 2003, SIGLEX-MWE has held regular workshops in conjunction with major computational linguistics, machine translation
. The author writes that the cpr-score measures the ratio between the exact frequency of an n-gram in a corpus, and the frequency of the n-gram given a certain window between the grams; he also explains that this window (W) must be set experimentally according to the corpus and the language. . By way of illustration, Colson (forthcoming) provides the example, ‘At the heart of the current turmoil is a decision by Saudi Arabia and other leading voices in the Opec oil cartel to get drawn into a turf war with the new generation of US shale producers’ where annotators will disagree as to which sequence among the following to mark as a phraseological unit: heart of, at the heart, at the heart of, at the heart of the. He also writes that annotators may not agree as to whether Saudi Arabia is a phraseological unit or not.
Multiword units in machine translation and translation technology
and phraseology conferences (e.g., ACL, COLING, LREC, MT Summit, EUROPHRAS). Most of the papers published in workshop proceedings are available online at the SIGLEX-MWE website24 and some are included in the ACL Anthology.25 Three collections of papers have been published to date as special issues of journals: –– ACM Transactions on Speech and Language Processing special issue on multiword expressions: “From theory to practice and use” (Ramisch et al., 2013); –– Language Resources and Evaluation special issue on multiword expressions: “Hard going or plain sailing” (Rayson et al., 2010); –– Computer Speech & Language special issue on multiword expressions: “ Having a crack at a hard nut” (Villavicencio et al., 2005). See also the proceedings of the two editions of the Computational and Corpus-Based Phraseology Conference (under the aegis of EUROPHRAS) held so far (Corpas Pastor, 2016; Mitkov, 2017). PhD theses on multiword expressions include Daille (1994), Krenn (2000), Evert (2004), Fazly (2007), Seretan (2008), Pecina (2008), Anastasiou (2009), Tsvetkov (2010), Straňák (2010), Fernández Parra (2011), Ramisch (2012), Monti (2013), Schneider (2014), Nagy (2014) and Cap (2014) among others. Three of these are also available as monographs (Anastasiou, 2010; Seretan, 2011; Ramisch 2015). Early influential books include Sinclair (1991), Moon (1998), and Fellbaum (2007). Useful resources include software (Evert, 2004; Seretan, 2008; Ramish, 2012) and datasets such as the ones available on the SIGLEX-MWE page of resources26 and those produced in the two shared tasks organised by the community to date: the LREC 2008 Shared Task for Multiword Expressions (Grégoire et al., 2008), and the PARSEME shared task on automatic identification of verbal MWEs (Savary et al., 2017).27 Other relevant datasets are the CMWE corpus28 developed by researchers in CMU (Schneider at. al, 2014) and the DiMSUM29 prepared for SemEval (2016) Task10, Detecting Minimal Semantic Units and Their Meanings.
. http://multiword.sourceforge.net/ . http://aclweb.org/anthology/ . http://multiword.sourceforge.net/PHITE.php?sitesig=FILES&page=FILES_20_Data_Sets . The PARSEME shared task corpus, which contains texts in 18 languages manually annotated with verbal multiword expressions using universal guidelines, is freely available under various flavours of Creative Commons licences (http://hdl.handle.net/11372/LRT-2282). . http://www.ark.cs.cmu.edu/LexSem/ . https://dimsum16.github.io/
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov
For introductory surveys, the reader is referred to Sag et al. (2002), the most influential paper on MWEs, according to bibliographic indicators; the introduction to the second special issue on MWE (Villavicencio et al., 2005); the ‘Multiword Expression’ chapter of the Handbook of Natural Language Processing (Baldwin & Kim, 2010), and Ramisch’s book Multiword Expressions Acquisition. A generic and open framework (2014). The latter provides an excellent introduction on the topic of multiword expressions in general and their computational treatment. Chapter 7 of this book discusses the problem of MWEs in machine translation. More specifically, the author experiments with the translation of English phrasal verbs into French, employing Statistical Machine Translation systems and concludes that this task is a real challenge for SMT. For an up-to-date survey of the computational treatment of MWEs, the reader is referred to the chapter “Computational treatment of multiword expressions” (Ramisch & Villavicencio, forthcoming) in the forthcoming second, substantially revised edition of the Oxford Handbook of Computational Linguistics (Mitkov (Ed.), forthcoming) and to Multiword expression processing: a survey (Constant et al., 2017). Research on the computational treatment of MWEs has been fostered by a number of past projects, including: XMELLT: Cross lingual Multiword Expression Lexicons for Language Technology (https://www.cs.vassar.edu/~ide/XMELLT. html), Collocations in the German Language of the 20th Century (http://www. bbaw.de/forschung/kollokationen/index.html), and the Multiword Expression Project (http://mwe.stanford.edu/). More recently, the previously mentioned European network PARSEME – a consortium of 230 members from 33 countries representing 31 languages (EU IC1207 COST Action) – has acted as a powerful catalyst for work on the integration of MWEs into advanced levels of linguistic processing, notably parsing. The network has held regular general meetings with peer-reviewed poster sessions, tutorials, and invited talks; organised several workshops and training schools; and produced a large number of publications, including one book, four surveys on the main research axes of the network (MWE resources, processing, annotation, and classification), and other important outcomes. In 2016, the Action launched the “Phraseology and Multiword Expressions” (PMWE) Open Access book series at the Language Science Press. The network’s website (http://www. parseme.eu) gives access to all the results of this unprecedented scientific cooperation, including documentation and training material such as video lectures. Beyond the computational linguistics realm, important theoretical and experimental work on multiword expressions, including work based on computational approaches, has been carried out in the fields of phraseology, lexicography, and second language learning – in particular, under the auspices of the European
Multiword units in machine translation and translation technology
ssociation for Phraseology (EUROPHRAS). A forthcoming book which will A cover different aspects of computational phraseology will be an edited volume by Corpas Pastor, Colson & Heid (forthcoming) to be published by John Benjamins. The EUROPHRAS webpage30 provides pointers to useful material and gives an overview of the activities of the association. Finally, the MUMTTT workshop series “Multiword Units in Machine Translation and Translation Technology” (Monti et al., 2013; Corpas Pastor et al., 2016; Monti et al., 2018) jointly organised by SIGLEX-MWE, PARSEME and EUROPHRAS, provides the opportunity for researchers and practitioners in the fields of (Computational) Linguistics, (Computational) Phraseology, Translation Studies and Translation Technology to coordinate research efforts across disciplines in order to improve the treatment of multiword expressions in human language technologies.
Acknowledgements The authors would like to thank Shiva Taslimipoor for providing useful comments and references, and for carrying out the experiment outlined in Section 1. Preparation of this chapter was supported in part by research grants FFI2016-75831-P (VIP) and HUM2754 (TERMITUR). **Funding should be acknowledged for both R Mitkov and G Corpas’s involvement in both projects.
References Abeillé, A., Clément, L., & Toussenel, F. (2003). Building a treebank for French. In Abeillé(Ed.) Treebanks (pp. 165–187). Dordrecht: Kluwer. doi: 10.1007/978-94-010-0201-1_10 Acosta, O., Villavicencio, A., & Moreira, V. (2011). Identification and treatment of multiword expressions applied to Information Retrieval. In Proceedings of the workshop on multiword expressions: From parsing and generation to the real world (pp. 101–109). Portland, Oregon, USA. Alegría, I., Ansa, O., Artola, X., Ezeiza, N., Gojenola, K., & Urizar, R. (2004). Representation and treatment of multiword expressions in Basque. Second ACL workshop on multiword expressions: Integrating processing (pp. 48–55). Barcelona, Spain. doi: 10.3115/1613186.1613193 Anastasiou, D. (2009). Idiom treatment experiments in machine translation (Unpublished doctoral dissertation). Saarland University. Anastasiou, D. (2010). Idiom treatment experiments in machine translation. Newcastle upon Tyne: Cambridge Scholars Publishing.
. http://europhras.org/
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov Arnold, I.V. 1973. The English Word. Moscow: Higher School Publishing House. Arranz, V., Atserias, J., & Castillo, M. (2005). Multiwords and word sense disambiguation. In Proceedings Computational linguistics and intelligent text processing: 6th international conference, CICLING 2005, Mexico City, Mexico, February 13–19, 2005 (pp. 250–262). Mexico city, Mexico. Aziz, W., Dymetman, M., Mirkin, S., Specia, L., Cancedda, N., & Dagan, I. (2010). Learning an expert from human annotations in statistical machine translation: The case of out-ofvocabulary words. In Proceedings of the 14th annual meeting of the European Association for Machine Translation (EAMT) (pp. 28–35). Saint-Rapha, France. Baldwin, T. (2011). MWEs and topic modelling: Enhancing machine learning with linguistics. In Proceedings of the workshop on multiword expressions: From parsing and generation to the real world (p. 1). Portland, Oregon, USA. Baldwin, T., & Kim, S. N. (2010). Multiword expressions. In N. Indurkhya & F. J. Damerau, (Eds.), Handbook of Natural Language Processing, Second Edition (pp. 267–292). Boca Raton, USA: Chapman and Hall/CRC (2010). Bar-Hillel, Y. (1952). The Treatment of ‘idioms’ by a Translating Machine, presented at the Conference on Mechanical Translation at Massachusetts Institute of Technology, June 1952. Barreiro, A., & Batista, F. (2016). Machine translation of non-contiguous multiword units. In Proceedings of Workshop on Discontinuous Structures in Natural Language Processing (DiscoNLP) (pp. 22–30). San Diego, California, USA. Barreiro, A., Monti, J., Orliac, B., Preuß, S., Arrieta, K., Ling, W., Batista, F., & Trancoso, I. (2014). Linguistic evaluation of support verb constructions by OpenLogos and Google Translate. In Proceedings of Ninth International Conference on Language Resources and Evaluation (LREC2014) (pp. 35–40). Reykjavik, Island. Barreiro, A., Raposo, F., & Luís, T. (2016). CLUE-Aligner: An alignment tool to annotate pairs of paraphrastic and translation units. In Proceedings of the LREC 2016 Workshop “Translation Evaluation: From Fragmented Tools and Data Sets to an Integrated Ecosystem” (pp. 7–13). Portorož, Slovenia. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Grammar of spoken and written English. Edimburgh: Pearson Education Limited. Boonthum, C., Toida, S., & Levinstein, I. (2005). Sense disambiguation for preposition with. In Proceedings of the second ACL–SIGSEM workshop on the linguistic dimensions of prepositions and their use in computational linguistic formalisms and applications (pp. 153–162). Colchester, United Kingdom. Bouamor, D., Semmar, N., Zweigenbeaum, P., (2012). Automatic Construction of a MultiWord Expressions Bilingual Lexicon: A Statistical Machine Translation Evaluation Perspective. In Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon (CogALex-III), COLING 2012 (pp. 95–108). Mumbai, India. Bouamor, D., Semmar, N., & Zweigenbaum, P. (2011). Improved statistical machine translation using multiword expressions. In Proceedings of the International Workshop on Using Linguistic Information for Hybrid Machine Translation (LIHMT 2011) (pp. 15–20). Barcelona, Spain. Boulaknadel, S., Daille, B., & Aboutajdine, D. (2008). A multi-word term extraction program for Arabic language. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08) (pp. 1485–1488). Marrakech, Morocco. Brooke, J., Hammond, A., Jacob, D., Tsang, V., Hirst, G., & Shein, F. (2015). Building a lexicon of formulaic language for language learners. In Proceedings of the 11th workshop on multiword expressions (pp. 96–104). Denver, Colorado, USA.
Multiword units in machine translation and translation technology
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R. L., & Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85. Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311. Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, F., Mercer, R., & Roossin, P. (1988). A statistical approach to language translation. In Proceedings of the 12th conference on Computational linguistics, Volume 1, (pp. 71–76). Budapest, Hungry. Brun, C. (1998). Terminology finite-state preprocessing for computational LFG. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (pp. 196–200). Morristown, New Jersey, USA. Burstein, J. (2013). The far reach of multiword expressions in educational technology. In Proceedings of the 9th workshop on multiword expressions (p. 138). Atlanta, Georgia, USA. Cacciari, C., & Tabossi, P. 1988. The comprehension of idioms. Journal of Memory and Language, 27(6), 668–683. Cap, F., Nirmal, M., Weller, M., & Schulte im Walde, S. (2015). How to Account for Idiomatic German Support Verb Constructions in Statistical Machine Translation. In Proceedings of the 11th Workshop on Multiword Expressions (MWE) at NAACL (pp. 19–28). Denver, Colorado, USA. Cap, F. (2014). Morphological processing of compounds for statistical machine translation (Unpublished doctoral dissertation). University of Stuttgart. Carpuat, M., & Diab, M. (2010). Task-based evaluation of multiword expressions: A pilot study in statistical machine translation. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 242–245). Los Angeles, California, USA. Carter, R. 1998. Vocabulary: Applied Linguistics Perspectives (2nd ed.). London and New York: Routledge. Chafe, W. 1968. Idiomaticity as an anomaly in the Chomskyan paradigm. Foundations of Language, 4, 109–127. Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 263– 270). Ann Arbor, Michigan, USA. Cho, K.Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of Conference on Empirical Methods on Natural Language Processing (EMNLP 2014) (pp. 1724–1734). Doha, Qatar. Cho, K. (forthcoming). Deep Learning. In R. Mitkov (Ed.), The Oxford Handbook of Computational Linguistics (2nd ed.). Oxford: Oxford University Press. Chomsky, N. (1980). Rules and representations. Behavioral and Brain Sciences, 3(1), 1–15. Choueka, Y., Klein, S. T., & Neuwitz, E. 1983. Automatic Retrieval of Frequent Idiomatic and Collocational Expressions in a Large Corpus. Journal of the Association for Literary and Linguistic Computing, 4(1), 34–38. Claveau, V. (2009). Translation of biomedical terms by inferring rewriting rules. In V. Prince (Ed.), Information Retrieval in Biomedicine: Natural Language Processing for Knowledge Integration, IGI-Global (pp. 106–123). doi: 10.4018/978-1-60566-274-9.ch006 Colson, J. P. (forthcoming). Computational phraseology and translation studies: from theoretical hypotheses to practical tools. In G. Corpas Pastor, J. P. Colson, & U. Heid, (Eds.), Computational Phraseology. Amsterdam & New York: John Benjamins.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov Colson, J. P. (2016). Set phrases around globalization : an experiment in corpus-based computational phraseology. In F. A. Almeida, I. Ortega Barrera, E. Quintana Toledo, & M.E. Sanchez Cuervo (Eds.), Input a Word, Analyze the World. Selected Approaches to Corpus Linguistics (pp. 141–152). Newcastle: Cambridge Scholars Publishing. Constant, M., & Sigogne, A. (2011). MWU-aware part-of-speech tagging with a CRF model and lexical resources. In Proceedings of the workshop on multiword expressions: From parsing and generation to the real world (pp. 49–56). Portland, Oregon, USA. Constant, M. Candito, M. & Seddah, D. (2013b) The LIGM-Alpage Architecture for the SPMRL 2013 Shared Task: Multiword Expression Analysis and Dependency Parsing. Shared task track of the EMNLP Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL’13) (pp. 46–52). Seattle, Washington, USA. Constant, M., Eryiğit, G., Monti, J., Van Der Plas, L., Ramisch, C., Rosner, M., & Todirascu, A. (2017). Multiword expression processing: a survey. Computational Linguistics, 43(4), 837–892. Constant, M., Roux, J. L., & Sigogne, A. (2013a). Combining compound recognition and PCFGLA parsing with word lattices and conditional random fields. ACM Transactions on Speech and Language Processing (TSLP), 10(3), 8:1–8:24. Cook, P., & Hirst, G. (2013). Automatically assessing whether a text is clichéd, with applications to literary analysis. In Proceedings of the 9th workshop on multiword expressions (pp. 52–57). Atlanta, Georgia, USA. Corpas Pastor, G. (Ed.). (2016). Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives (Full papers). Geneva: Tradulex. http://www.tradulex. com/varia/Europhras2015.pdf. Corpas Pastor, G., Colson, J. P., & Heid, U. (Eds.). (forthcoming). Computational Phraseology. Amsterdam & New York: John Benjamins. Corpas Pastor, G., Monti, J., Seretan, V., & Mitkov, R. (Eds.). (2016). Workshop proceedings: Multi-word units in machine translation and translation technologies (MUMTTT 2015), Malaga, Spain. Geneva: Editions Tradulex. Corpas Pastor, G. (ed.) (2016). Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives (Full papers). Geneva: Tradulex. [http://www. tradulex.com/varia/Europhras2015.pdf] Costa-Jussà, M. R., & Farrús, M. (2014). Statistical machine translation enhancements through linguistic levels: A survey. ACM Computing Surveys (CSUR), 46(3), 42. doi: 10.1145/2518130 Cowie, A. P. 1981. The treatment of collocations and idioms in learners’ dictionaries. Applied Linguistics, 2(3), 223–235. Dagan, I., & Church, K. (1994). Termight: Identifying and translating technical terminology. In Proceedings of the fourth conference on Applied natural language processing (pp. 34–40). Stuttgart, Germany. Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie : statistiques lexicales et filtres linguistiques (Unpublished doctoral dissertation). Université Paris 7. Daille, B. (2001). Extraction de collocation à partir de textes. Actes de la 8ème conférence sur le Traitement Automatique des Langues Naturelles (TALN’2001) (pp. 3–8). Tours, France. Diab, M. T., & Bhutada, P. (2009). Verb noun construction MWE token supervised classification. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (pp. 17–22). Suntec, Singapore. Dowdall, J., Rinaldi, F., Ibekwe-SanJuan, F., & SanJuan, E. (2003). Complex structuring of term variants for Question Answering. In Proceedings of the ACL 2003 workshop on multiword expressions: Analysis, acquisition and treatment (pp. 1–8). Sapporo, Japan.
Multiword units in machine translation and translation technology
Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations (Unpublished doctoral dissertation). University of Stuttgart. Fazly, A., Cook, P., & Stevenson, S. (2009). Unsupervised type and token identification of idiomatic expressions. Computational Linguistics, 35(1):61–103. doi: 10.1162/coli.08-010-R1-07-048 Fazly, A. (2007). Automatic acquisition of lexical knowledge about multiword predicates (Unpublished doctoral dissertation). University of Toronto. Fellbaum, C. (1993). The Determiner in English Idioms. In C. Cacciari, & P. Tabossi (Eds.), Idioms: Processing, Structure, and Interpretation (pp. 271–295). Hillsdale, NJ: Erlbaum. Fellbaum, C. (2007). Idioms and collocations: Corpus-based linguistic and lexicographic studies. Bloomsbury Academic. Fernández Parra, M. A. (2011). Formulaic Expressions in Computer-Assisted Translation. A specialised translation approach (Unpublished doctoral dissertation). Swansea University. Fernando, C. & Flavell R. (1981). On Idiom: Critical Views and Perspectives. Exeter Linguistic Studies vol. 5. Exeter: University of Exeter. Finlayson, M., & Kulkarni, N. (2011). Detecting multi-word expressions improves Word Sense Disambiguation. In Proceedings of the workshop on multiword expressions: From parsing and generation to the real world (pp. 20–24). Portland, Oregon, UAS. Firth, J. R. (1957). Papers in Linguistics 1934–1951. London: Oxford University Press. Franz, A., Horiguchi, K., Duan, L., Ecker, D., Koontz, E., & Uchida, K. (2000). An integrated architecture for example-based machine translation. In Proceedings of the 18th conference on Computational linguistics, Volume 2 (pp. 1031–1035). Saarbrücken, Germany. Fraser, B. 1970. Idioms within a transformational grammar. Foundations of Languagem, 6, 22–42. Gangadharaia, R., & Balakrishanan, N. (2006). Application of linguistic rules to generalized example based Machine Translation for Indian languages. In Proceedings of first National symposium on modeling and shallow parsing of Indian languages (MSPIL). Mumbai, India. Geoffrey Leech, R. G., & Bryant, M. (2011). CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING-94) (pp. 622–628). Kyoto, Japan. Gibbs, R., & N. Nayak (1989) Psycholinguistic Studies on the Syntactic Behavior of Idioms. Cognitive Psychology, 21, 100–138. Girju, R., Moldovan, D., Tatu, M., & Antohe, D. (2005). On the semantics of noun compounds. Journal of Computer Speech and Language – Special Issue on Multiword Expressions, 19 (4), 479–496. doi: 10.1016/j.csl.2005.02.006 Granger, S., & Meunier, F. (2008). Disentangling the phraseological web. In S. Granger & F. Meunier enjamins. (Eds.), Phraseology: An interdisciplinary perspective (pp. 27–49). Amsterdam: John B
doi: 10.1075/z.139.07gra
Grégoire, N., Evert, S., & Krenn, B. (Eds.). (2008). Proceedings of the LREC workshop towards a shared task for multiword expressions (MWE 2008). Marrakech, Morocco. Groves, D., Hearne, M., & Way, A. (2004). Robust sub-sentential alignment of phrase-structure trees. In Proceedings of the 20th international conference on Computational Linguistics (pp. 1072–1078). Geneva, Switzerland. Hazelbeck, G., & Saito, H. (2010). A hybrid approach for functional expression identification in a Japanese reading assistant. In Proceedings of the 2010 workshop on multiword expressions: From theory to applications (pp. 81–84). Beijing, China. Huet, S., & Langlais, Ph. (2011). Identifying the translations of idiomatic expressions using TransSearch. In Proceedings of the 8th International NLPCS Workshop (Human-Machine Interaction in Translation (pp. 45–56). Copenhagen, Denmark.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov Huet, S., & Langlais, Ph. (2012). Translation of idiomatic expressions across different languages: A study of the effectiveness of TransSearch. In A. Neustein & J. A. Markowitz (Eds.), Where Humans Meet Machines. Innovative Solutions for Knotty Natural-Language Problems (pp. 185–209). New York: Springer. Hurskainen, A. (2008). Multiword expressions and machine translation. Technical Reports in Language Technology, Report No 1. Jackendoff, R. (1997). The Architecture of the Language Faculty. Cambridge, Mass.: MIT Press Jian, J. Y., Chang, Y. C., & Chang, J. S. (2004). Collocational translation memory extraction based on statistical and linguistic information. In ROCLING 2004, Conference on Computational Linguistics and Speech Processing (pp. 329–346). Taipei, Taiwan. Kalchbrenner, N., & Blunsom, P. (2013). Recurrent convolutional neural networks for discourse compositionality. In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality (pp. 119–126). Sofia, Bulgaria. arXiv preprint arXiv:1306.3584. Katz, G., & Giesbrecht, E. (2006). Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties (pp. 12–19). Association for Computational Linguistics. Katz, J., & Postal, P. (1963).The semantic interpretation of idioms and sentences containing them. MIT Research Laboratory of Electronic Quarterly Progress Report, 70, 275–282. Kilgarriff, Adam, Jakubíček, Miloš, Kovář, Voytěch, Rychlý, P., & Suchomel, V. (2014). Finding terms in corpora for many languages with the Sketch Engine. In Proceedings of the Demonstrations at the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 53–56). Gothenburg, Sweden. Klebanov, B. B., Burstein, J., & Madnani, N. (2013). Sentiment Profiles of multiword expressions in test-taker essays: The case of noun-noun compounds. ACM Transactions for Speech and Language Processing, Special Issue on Multiword Expressions: From Theory to Practice, 10(3), 12:1–12:15. Koehn, P., Och, F. J., Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology – Volume 1, (NAACL ’03) (pp. 48–54). Edmonton, Canada. Korkontzelos, I., & Manandhar, S. (2010). Can recognising multiword expressions improve shallow parsing? In Human language technologies: The 2010 annual conference of the North American chapter of the Association for Computational Linguistics (pp. 636–644). Los Angeles, California, USA. Kovář, V., Baisa, V., & Jakubíček, M. (2016). Sketch Engine for bilingual lexicography. International Journal of Lexicography, 29(3), 339–352. doi: 10.1093/ijl/ecw029 Krenn, B. (2000). The usual suspects: Data-oriented models for identification and representation of lexical collocations (Vol. 7). Saarbrücken, Germany: German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology. Lambert, P., & Banchs, R. (2006). Grouping multi-word expressions according to part-of-speech in statistical machine translation. In Proceedings of the EACL Workshop on Multi-word expressions in a multilingual context (pp. 9–16). Trento, Italy. Lambert, P., & Banchs, R. (2005). Data inferred multi-word expressions for statistical machine translation. In Proceedings of Machine Translation Summit X (pp. 396–403). Phuket, Thailand. Lau, J. H., Baldwin, T., & Newman, D. (2013). On collocations and topic models. ACM Transactions on Speech and Language Processing, 10(3), 10:1–10:14. doi: 10.1145/2483969.2483972
Multiword units in machine translation and translation technology
Lewis, D. D., & Croft, W. B. (1990). Term clustering of syntactic phrases. In Proceedings of 13th international ACM-SIGIR conference on research and development in information retrieval (SIGIR’90) (pp. 385–404). Brussels, Belgium. Lin, D. (1998). Using collocation statistics in information extraction. In Proceedings of the seventh message understanding conference (MUC-7). Fairfax, Virginia, USA. Luong, M. T., Sutskever, I., Le, Q. V., Vinyals, O., & Zaremba, W. (2014). Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (pp. 11–19). Beijing, China. arXiv preprint arXiv: 1410.8206. Macken, L. (2009). In search of the recurrent units of translation. In W. Daelemans, & V. Hoste (Eds.), Evaluation of Translation Technology (pp. 195–212). Brussels: Academic and Scientific Publishers. Mandala, R., Tokunaga, T., & Tanaka, H. (2000). Query expansion using heterogeneous thesauri. Information Processing and Management, 36 (3), 361–378. doi: 10.1016/S0306-4573(99)00068-0
Manrique-Losada, B., Zapata-Jaramillo, C. M., & Burgos, D. A. (2013). Exploring MWEs for knowledge acquisition from corporate technical documents. In Proceedings of the 9th workshop on multiword expressions (pp. 82–86). Atlanta, Georgia, USA. Makkai, A. 1972. Idiom structure in English (Janua Linguarum, series maior, 48). The Hague: Mouton. Marcu, D., Wang, W., Echihabi, A., & Knight, K. (2006). SPMT: Statistical machine translation with syntactified target language phrases. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 44–52). Sydney, Australia. Marvel, A., & Koenig, J.-P. (2015). Event categorization beyond verb senses. In Proceedings of the 11th workshop on multiword expressions (pp. 77–86). Denver, Colorado, USA. Melamed, I. D. (1997). A word-to-word model of translational equivalence. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (pp. 490–497). Association for Computational Linguistics. Mitkov, R. (Ed.). (forthcoming). The Oxford handbook of computational linguistics. Oxford University Press doi: 10.1093/oxfordhb/9780199276349.001.0001 Mitkov, R. (2016). Computational Phraseology light: automatic translation of multiword expressions without translation resources. Yearbook of Phraseology, 26(7), 149–166. Monti, J. (2013). Multi-word unit processing in Machine Translation: developing and using language resources for multi-word unit processing in Machine Translation. (Unpublished doctoral dissertation). University of Salerno, Italy. Monti, J, Arhan, M. & Sangati F. (forthcoming). Translation asymmetries of Multiword Expressions in Machine Translation: An analysis of the TED-MWE corpus. In G. Corpas Pastor, J. P. Colson, & U. Heid (Eds.). (forthcoming). Computational Phraseology. Amsterdam & New York: John Benjamins. Monti, J., Elia, A., Postiglione, A., Monteleone, M., & Marano, F. (2012). In search of knowledge: text mining dedicated to technical translation. In Proceedings of ASLIB 2011 – Translating and the Computer Conference. London, United Kingdom. Monti, J., Mitkov, R., Seretan V. & Corpas Pastor, G. (Eds.). (2018) Workshop proceedings Multiword units in Machine Translation and Translation Technology (MUMTTT2017). London, United Kingdom. Geneva: Editions Tradulex.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov Monti, J., Mitkov, R., Corpas Pastor, G., & Seretan, V. (Eds.). (2013). Workshop proceedings: Multi-word units in machine translation and translation technologies. Nice, France: The European Association for Machine Translation. Moon, R. (1998). Fixed expressions and idioms in English: A corpus-based approach (Oxford studies in lexicography and lexicology). Oxford: Claredon Press Oxford. Moreno-Ortiz, A., Perez-Hernandez, C., & Del-Olmo, M. (2013). Managing multiword expressions in a lexicon-based sentiment analysis system for Spanish. In Proceedings of the 9th workshop on multiword expressions (pp. 1–10). Atlanta, Georgia, USA. Nivre, J., & Nilsson, J. (2004). Multiword units in syntactic parsing. In MEMURA 2004 – Workshop on Multi-word-expressions in a Multilingual Context held in conjunction with the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006) (pp. 39–46). Trento, Italy. Nagy, I. (2014). Detecting Multiword Expressions and Named Entities in Natural Language Texts, Doctoral dissertation, Ph. D. dissertation, University of Szeged. Nokel, M., & Loukachevitch, N. (2015). A method of accounting bigrams in topic models. In Proceedings of the 11th workshop on multiword expressions (pp. 1–9). Denver, Colorado, USA. Nomiyama, H. (1992). Machine translation by case generalization. In Proceedings of the 14th conference on Computational linguistics–Volume 2 (pp. 714–720). Nantes, France. Nunberg, G., Sag, I.A., & Wasow, T. 1994. Idioms. Language, 70(3), 491–538. Och, F. J., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology – Volume 1 (pp. 48–54). Edmonton, Canada. Okita, T., Guerra, M. A., Graham, Y., & Way, A. (2010). Multi-word expression-sensitive word alignment. In Proceedings of the 4th International Workshop on Cross Lingual Information Access at COLING 2010 (pp. 26–34). Beijing, China. Okuma, H., Yamamoto, H., & Sumita, E. (2008). Introducing a translation dictionary into phrase-based SMT. IEICE Transactions on Information and Systems, 91(7), 2051–2057. doi: 10.1093/ietisy/e91-d.7.2051
Orlandi, A., & Giacomini, L. (Eds.). 2016. Defining collocations for lexicographic purposes: From linguistic theory to lexicographic practice (Series ‘Linguistic Insights’). Frankfurt: Peter Lang. Ozdowska, S. (2006). ALIBI, un systeme d’ALIgnement BIlingue base de regles (Doctoral dissertation PhD thesis), Université de Toulouse 2. Pal, S., Chakraborty, T., & Bandyopadhyay, S. (2011). Handling multiword expressions in phrasebased statistical machine translation. Machine Translation Summit XIII, (pp. 215–224). Xiamen, China. Pal, S., Kumar Naskar, S., Pecina, P., Bandyopadhyay, S., & Way, A. (2010). Handling named entities and compound verbs in phrase-based statistical machine translation. In Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications (pp. 46–54). Beijing, China. Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Native like selection and native like fluency. In J.J. Richards, & R. R. W. Schmidt (Eds.), Language and Communication (pp. 191–225). Harlow: Longman. Pearce, D. (2002). A Comparative Evaluation of Collocation Extraction Techniques. In Proceedings of Ninth International Conference on Language Resources and Evaluation (LREC2002) (pp. 1530–1536). Las Palmas, Spain.
Multiword units in machine translation and translation technology
Pecina, P. (2008). Lexical association measures: Collocation extraction (Unpublished doctoral dissertation). Charles University. Ramisch, C. (2012). A generic and open framework for multiword expressions treatment: from acquisition to applications (Unpublished doctoral dissertation). University of Grenoble and Federal University of Rio Grande do Sul. Ramisch, C. (2015). Multiword expressions acquisition: A generic and open framework (Vol. XIV). Springer. Ramisch, C., Villavicencio, A. (forthcoming) Computational treatment of multiword expressions. In R. Mitkov (Ed.). (forthcoming). The Oxford handbook of computational linguistics. Oxford University Press. Ramisch, C., Villavicencio, A., & Kordoni, V. (2013). Introduction to the special issue on multiword expressions: From theory to practice and use. ACM Transactions on Speech and Language Processing, 10(2), 3:1–3:10. (Special issue on Multiword Expressions). doi: 10.1145/2483691.2483692
Rapp, R., & Sharoff, S. (2014). Extracting multiword translations from aligned comparable documents. In Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra) (pp. 87–95). Gothenburg, Sweden. Rayson, P., Piao, S., Sharoff, S., Evert, S., & Moirón, B. V. (2010). Multiword expressions: hard going or plain sailing? Language Resources and Evaluation Special Issue on Multiword expressions: Hard going or plain sailing, 44 (1–2), 1–25. (Special issue on Multiword Expressions). Ren, Z., Lü, Y., Cao, J., Liu, Q., & Huang, Y. (2009). Improving statistical machine translation using domain bilingual multiword expressions. In Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (pp. 47–54). Suntec, Singapore. Rikters M., & Bojar O. (2017). Paying Attention to Multi-Word Expressions in Neural Machine Translation. In MT Summit XVI Proceedings Nagoya, Japan, September 18–22, 2017, vol. 1: Research Track (pp. 86–95). Nagoya, Japan. Riloff, E. (2005). Little words can make a big difference for text classification. In Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (pp. 130–136). Seattle, Washington, USA. Rohanian, O., Taslimipoor, S., Yaneva, V. and L. A. Ha (2017). Using Gaze Data to Predict Multiword Expressions. In Proceedings of the 11th Conference on Advances in Natural Language Processing (RANLP 2017), Varna, Bulgaria. Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the third international conference on intelligent text processing and computational linguistics (CICLING 2002) (pp. 1–15). Mexico City, Mexico. Salehi, B. Mathur, N., Cook, P. & Baldwin, T. (2015). The impact of multiword expression compositionality on machine translation evaluation. In Proceedings of the 11th Workshop on MWEs (MWE 2015) (pp. 54–59). Denver, Colorado, USA. Salton, G., & Smith, M. (1989). On the application of syntactic methodologies in automatic text analysis. In Proceedings of the 12th annual international ACM SIGIR conference on research and development in information retrieval (pp. 137–150). New York, USA. Sanjuan, E., Dowdall, J., Ibekwe-Sanjuan, F., & Rinaldi, F. (2005). A symbolic approach to automatic multiword term structuring. Journal of Computer Speech and Language – Special Issue on Multiword Expressions, 19 (4), 524–542.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov Savary, A., Ramisch, C., Cordeiro, S., Sangati, F., Vincze, V., Qasemi Zadeh, B., Candito, M., Cap, F., Giouli, V., Stoyanova, I., & Doucet, A. (2017). The PARSEME shared task on automatic identification of verbal multiword expressions. In Proceedings of the 13th workshop on multiword expressions (MWE 2017) (pp. 31–47). Valencia, Spain. Schneider, N. (2014). Lexical Semantic Analysis in Natural Language Text. Doctoral dissertation, Ph. D. dissertation, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA. Carnegie Mellon University. Schneider, N., Onuffer, S., Kazour, N., Danchik, E., Mordowanec, M. T., Conrad, H., & Smith, N. A. (2014). Comprehensive annotation of multiword expressions in a social web corpus. In Proceedings of the International Conference on Language Resources and Evaluation (LREC’14) (pp. 455–461). Reykjavik, Island. Schneider, N., Hovy, D., Johannsen, A., & Carpuat, M. (2016). Semeval-2016 task 10: Detecting minimal semantic units and their meanings (dimsum). In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016) (pp. 546–559). Scott, B. (2003). The Logos model: An historical perspective. Machine Translation, 18(1), 1–72. doi: 10.1023/B:COAT.0000021745.20402.59
Scott, B., & Barreiro, A. (2009). OpenLogos MT and the SAL representation language. In Proceedings of the First International Workshop on Free/Open-Source Rule-Based Machine Translation (pp. 19–26). Alacant, Spain. Segura, J., & Prince, V. (2011). Using Alignment to detect associated multiword expressions in bilingual corpora. Tralogy. Paris, France. Seretan, V. (2008). Collocation extraction based on syntactic parsing (Unpublished doctoral dissertation). University of Geneva. Seretan, V. (2009). Extraction de collocations et leurs équivalents de traduction à partir de corpus parallèles. TAL, 50(1), 305–332. Seretan, V. (2011). A collocation-driven approach to text summarization. In Actes de la 18e conférence sur le traitement automatique des langues naturelles (TALN 2011) (pp. 9–14). Montpellier, France. Seretan, V. (2011). Syntax-based collocation extraction (Vol. 44). Dordrecht: Springer. doi: 10.1007/978-94-007-0134-2
Seretan, V., & Wehrli, E. (2007). Collocation translation based on sentence alignment and parsing. In Proceedings of Traitement Automatique des Langues Naturelles (TALN) (pp. 401–410). Toulouse, France. Shigeto, Y., Azuma, A., Hisamoto, S., Kondo, S., Kouse, T., Sakaguchi, K., Yoshimoto, A., Yung, F., & Matsumoto, Y. (2013). Construction of English MWE dictionary and its application to POS tagging. In Proceedings of the 9th workshop on multiword expressions (pp. 139–144). Atlanta, Georgia, USA. Sinclair, J. M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Sinclair, J. M. (1996). The search for units of meaning. Textus, 9(1), 75–106. Sinclair, J. M. (2007). Collocation reviewed. manuscript, Italy: Tuscan Word Centre. Sinclair, J. M. (2008). Preface. In S. Granger, & F. Meunier (Eds.), Phraseology. An interdisciplinary perspective. Amsterdam: John Benjamins publishers. Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177. Straňák, P. (2010). Annotation of multiword expressions in the Prague Dependency Treebank (Unpublished doctoral dissertation). Charles University.
Multiword units in machine translation and translation technology
Sumita, E., & Iida, H. (1991). Experiments and prospects of example-based machine translation. In Proceedings of the 29th annual meeting on Association for Computational Linguistics (pp. 185–192). Berkeley, California. Sumita, E., Iida, H., & Kohyama, H. (1990). Translating with examples: a new approach to machine translation. In The Third International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Language (pp. 203–212) Austin, Texas, USA. Tambouratzis, G., Troullinos, M., Sofianopoulos, S., & Vassiliou, M. (2012). Accurate phrase alignment in a bilingual corpus for EBMT systems. In Proceedings of the 5th BUCC Workshop, held within the International Conference on Language Resources and Evaluation (LREC2012), Vol. 26, (pp. 104–111). Istanbul, Turkey. Tang, Y., Meng, F., Lu, Z., Li, H., & Yu, P. L. (2016). Neural machine translation with external phrase memory. arXiv preprint arXiv:1606.01792. Taslimipoor, S., Rohanian, O., Mitkov, R., & A. Fazly. (2017). Investigating the opacity of verbnoun multiword expression usages in context. In Proceedings of the 13th Workshop on Multiword Expressions, MWE@EACL 2017, Valencia, Spain, April 4, 133–138. Taslimipoor, S., Mitkov, R., Mitkov, R., & A. Fazly. (2016). Bilingual Contexts from Comparable Corpora to Mine for Translations of Collocations. In Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing2016), Konya, Turkey. Taslimipoor, S. (2015). Cross-lingual Extraction of Multiword Expressions. In G. Corpas Pastor (Ed.), Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives (Full papers). Geneva: Tradulex. http://www.tradulex.com/varia/ Europhras2015.pdf. Thurmair, G. (2004). Multilingual Content Processing. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LRE2004), (pp. XI–XVI). Lisbon, Portugal. Tillmann, C., & Xia, F. (2003). A phrase-based unigram model for statistical machine translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003–short papers (pp. 106–108). Edmonton, Canada. Tomokiyo, T., & Hurst, M. (2003). A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on multiword expressions: Analysis, acquisition and treatment (pp. 33–40). Sapporo, Japan. Torner, S., & Bernal, E. (Eds.). 2017. Collocations and Other Lexical Combinations in Spanish. Theoretical and Applied Approaches. London: Routledge. Tsvetkov, Y. (2010). Extraction of multi-word expressions from small parallel corpora (Unpublished doctoral dissertation). University of Haifa. Ullman, E., & Nivre, J. (2014). Paraphrasing Swedish compound nouns in Machine Translation. In Proceedings of the 10th workshop on multiword expressions (MWE) (pp. 99–103). Gothenburg, Sweden. Váradi, T. (2006). Multiword Units in an MT Lexicon. In Proceedings of the EACL Workshop on Multi-Word Expressions in a Multilingual Contexts, (pp. 73–78). Trento, Italy. Venkatapathy, S., & Joshi, A. K. (2006). Using information about multi-word expressions for the word-alignment task. In Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties (pp. 20–27). Sydney, Australia.
Johanna Monti, Violeta Seretan, Gloria Corpas Pastor & Ruslan Mitkov Venkatsubramanyan, S., & Perez-Carballo, J. (2004). Multiword expression filtering for building knowledge. In T. Tanaka, A. Villavicencio, F. Bond & A. Korhonen (Eds.), Second ACL workshop on multiword expressions: Integrating processing (pp. 40–47) Barcelona, Spain. doi: 10.3115/1613186.1613192
Villavicencio, A., Bond, F., Korhonen, A., & McCarthy, D. (2005). Introduction to the special issue on multiword expressions: Having a crack at a hard nut. Computer Speech & Language, 19(4), 365–377. (Special issue on Multiword Expressions). doi: 10.1016/j.csl.2005.05.001 Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CONLL) (pp. 1034–1043). Prague, Czech Republic. Vintar, S., & Fiser, D. (2008). Harvesting Multi-Word Expressions from Parallel Corpora. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08) (pp. 1091–1096). Marrakech, Morocco. Wacholder, N., & Song, P. (2005). Toward a task-based gold standard for evaluation of NP chunks and technical terms. In Proceedings of the 2003 Human Language Technology conference of the North American Chapter of the Association for Computational Linguistics (pp. 130–136). Edmonton, Canada. Wang, L., & Yu, S. (2010). Construction of Chinese idiom knowledge-base and its applications. In Proceedings of the 2010 workshop on multiword expressions: From theory to applications (pp. 11–18). Beijing, China. Wehrli, E. (2014). The relevance of collocations for parsing. In Proceedings of the 10th workshop on multiword expressions (MWE 2014) (pp. 26–32). Gothenburg, Sweden. Wehrli, E., Seretan, V., & Nerima, L. (2010). Sentence analysis and collocation identification. In Proceedings of the workshop on multiword expressions: from theory to applications (MWE 2010) (pp. 27–35). Beijing, China. Widdows, D., & Dorow, B. (2005). Automatic extraction of idioms using graph analysis and asymmetric lexicosyntactic patterns. In Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition (pp. 48–56). Association for Computational Linguistics. Williams, L., Bannister, C., Arribas-Ayllon, M., Preece, A., & Spasić, I. (2015). The role of idioms in sentiment analysis. Expert Systems with Applications, 42 (21), 7375–7385. Wu, C. C., & Chang, J. S. (2004). Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses. Computational Linguistics and Chinese Language Processing, 9(1):1–20. Wu, H., Wang, H., & Zong, C. (2008). Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In Proceedings of the 22nd International Conference on Computational Linguistics–Volume 1 (pp. 993–1000), Manchester, United Kingdom. Yaneva, V., Taslimipoor, S., Rohanian, O., & L. A. Ha. (2017). Cognitive Processing of M ultiword Expressions in Native and Non-native Speakers of English: Evidence from Gaze Data. In R. Mitkov (Ed.), Computational and Corpus-based Phraseology. Springer: Heidelberg, New York, London. Yarowsky, D. (1993). One sense per collocation. In Proceedings of ARPA Human Language Technology workshop (pp. 266–271). Princeton, New Jersey, USA. doi: 10.21236/ADA458621 Yarowsky, D. (1995). Unsupervised word sense disambiguation rivalling supervised methods. In Proceedings of the 33rd annual meeting of the Association for Computational Linguistics (ACL 1995) (pp. 189–196). Cambridge, Massachusetts, USA.
Multiword units in machine translation and translation technology
Zens, R., Och, F. J., & Ney, H. (2002). Phrase-based statistical machine translation. In Annual Conference on Artificial Intelligence (pp. 18–32). Edmonton, Canada. Zhang, Y., & Kordoni, V. (2006). Automated deep lexical acquisition for robust open texts processing. In Proceedings of 5th International Conference on Language Resources and Evaluation (LRE2006)–2006 (pp. 275–280). Genoa, Italy. Zollmann, A., & Venugopal, A. (2006). Syntax augmented machine translation via chart parsing. In Proceedings of the Workshop on Statistical Machine Translation (pp. 138–141). New York city, USA.
part 1
Multiword units in machine translation
Analysing linguistic information about word combinations for a Spanish-Basque rule-based machine translation system Uxoa Iñurrieta1, Itziar Aduriz2, Arantza Díaz de Ilarraza1, Gorka Labaka1 & Kepa Sarasola1 1IXA
NLP group, University of the Basque Country / 2Department of Linguistics, University of Barcelona This paper describes an in-depth analysis of noun + verb combinations in Spanish-Basque translations. Firstly, we examined noun + verb constructions in the dictionary, and confirmed that this kind of MWU varies considerably from language to language, which justifies the need for their specific treatment in MT systems. Then, we searched for those combinations in a parallel corpus, and we selected the most frequently-occurring ones to analyse them further and classify them according to their level of syntactic fixedness and semantic compositionality. We tested whether adding linguistic data relevant to MWUs improved the detection of Spanish combinations, and we found that, indeed, the number of MWUs identified increased by 30.30% with a precision of 97.61%. Finally, we also evaluated how an RBMT system translated the MWUs we analysed, and concluded that at least 44.44% needed to be corrected or improved. Keywords: Basque, Spanish, Rule-Based Machine Translation, Multiword Units, morphosyntactic fixedness, semantic compositionality
1. Introduction Multiword Units (MWUs) are word combinations that pose difficulties for many research areas, as they do not usually follow the common grammatical and lexical rules of languages. Although they are made up of more than one lexeme, they are often used as a single unit in a sentence, and sometimes their meaning is not even transparent, which makes them particularly tricky for Natural Language Processing (NLP). (1) a. She always ends up spilling the beans. (She always ends up giving information away.)
doi 10.1075/cilt.341.02inu © 2018 John Benjamins Publishing Company
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
b. They buried the hatchet. (They stopped arguing.)
This kind of word combination is also highly variable cross-linguistically and, as it is a very common phenomenon in all types of texts, it presents an additional challenge to multilingual systems like Machine Translation (MT), especially if the source and target languages are from different language families. (2) ‘to kid/trick someone’ English: pull [someone]’s leg Spanish: tomar el pelo [a alguien] ‘take the hair [to someone]’ take.inf art.m.sg hair[prep someone] Basque: [norbait-i] adar-a jo ‘[someone-to] horn-the play’ [someone-dat]horn-art.sg play.inf
The work presented in this chapter has been done within the framework of Computational Linguistics, and therefore, it involves both a linguistic analysis and an experiment aimed at improving a computer application. More specifically, our object is to analyse the translation of Spanish word combinations into Basque, in order to improve the existing MT system, which is based on linguistic rules and, up to now, has used a very basic method to process MWUs. As we believe that linguistic data particular to MWUs is necessary in order to obtain good processing results, we undertook an in-depth analysis of a set of word combinations and their possible translations, with the aim of adding information both to the Spanish parser and to the Basque generation process. In this paper, we will first give an overview of the challenges posed by MWUs to MT systems, and will discuss different techniques that have been used to meet those challenges (Section 2). Secondly, we will give some information about our linguistic analysis: the features we focused on, a selection of statistical data, and the criteria we followed for the classification of the combinations (Section 3). Finally, we will present our experiment and will show how it improves our system (Section 4).
2. Definitions, challenges and treatment of MWUs in MT Although authors usually agree when it comes to the most important features of MWUs, there are almost as many definitions as researchers in the field. The broadest definition is probably the one given by Sag et al. (2002), who define
Analysing linguistic information about word combinations
MWUs as lexical items that can be decomposed into multiple lexemes and that display some kind of idiomaticity, which, according to Baldwin & Kim (2010), can be of several types: lexical (ad hoc), syntactic (by and large), semantic (kick the bucket), pragmatic (good morning), or statistical (immaculate performance, black and white). In fact, idiomaticity is understood as a key factor of this kind of word combination by other authors too (Gurrutxaga & Alegria, 2011), and forms the basis of a number of classifications. Howarth (1998), for example, proposes a three-layer grouping in which the last layer corresponds to the division between idiomatic and non-idiomatic combinations (see Table 1). Table 1. Howarth’s classification of word combinations Functional expressions
Composite units
non-idiomatic
Grammatical composites
Lexical composites
non-idiomatic
non-idiomatic
idiomatic
idiomatic
idiomatic
Other classifications follow different criteria to sort MWUs, like the one created by Corpas Pastor (1997) for Spanish combinations, which has later been reused and adapted to other languages, including Basque (Urizar, 2012). Its main focus is upon two features of what she terms Phraseological Units: whether they are complete speech acts or not, and the nature of their fixedness (see Table 2). Table 2. Corpas Pastor’s classification of Phraseological Units Phraseological statements
Collocations
Idioms
fixed in speech
fixed in norms of usage
fixed in the system
complete speech acts
not complete speech acts
As regards the computational treatment of MWUs, however, it is essential to take into account their level of syntactic fixedness. While some approaches focus solely on word combinations that are indivisible, if we take a look at real texts, it soon becomes evident that a large number of them can be separated by other words, and sometimes even the word order can be changed. Therefore, this can be a determining feature in the adequate processing of a given combination. Sag et al. (2002), for example, make a distinction between institutionalised and lexicalised phrases, and rank the latter as fixed, semi-fixed or synt actically free.
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
Table 3. Classification of multiword expressions by Sag et al. (2002) Institutionalised phrases
Lexicalised phrases
Fixed expressions
Semi-fixed expressions
Syntactically-flexible expressions
These kinds of expressions are used very frequently both in oral and written texts, and are hence important linguistic phenomena to be borne in mind for NLP systems. Jackendoff (1997) estimates that the number of MWUs in an English speaker’s lexicon is of the same order of magnitude as the number of single words, and, indeed, 41% of the entries in WordNet 1.7 (Fellbaum, 1998) are constituted of more than one word. Thus, word combinations pose an important challenge to NLP in general (Sag et al., 2002; Villavicencio et al., 2005), but even bigger when the language to be processed has a rich morphology, as with Basque (Alegria et al., 2004). Furthermore, the difficulties multiply when it comes to multilingual systems, as MWUs vary a great deal from one language to another, especially when the languages are very different. As stated in Baldwin & Kim (2010): There is remarkable variation in MWEs across languages (…) There are of course many MWEs which have no direct translation equivalent in a second language. (…) Equally, there are terms which are realised as MWEs in one language but single-word lexemes in another.
As a matter of fact, Simova & Kordoni (2013) studied the translation of English phrasal verbs into Bulgarian, and found out that asymmetry is a major problem when translating word combinations: MWEs constitute a major challenge, since it is very often the case that they do not receive exact translation equivalents. (…) In Bulgarian, phrasal verbs do not occur as multiword units, but are usually translated as single verbs.
On the other hand, regarding MT systems, there are two major issues to be addressed: (1) the identification of MWUs in the source language, and (2) their adequate transfer into and correct generation in the target language. Concerning the identification process, the most basic method is probably the words-with-spaces strategy, which consists in searching solely for sequential word combinations (Zhang et al., 2006; Alegria et al., 2004). Nonetheless, as previously mentioned, non-sequential combinations are as frequent as the sequential ones, and this approach does not allow us to find them. It is important to use a flexible method which allows the detection of as many combinations as possible, but also to impose some restrictions, so that only real MWUs are detected. The tendency of recent years has been to combine
Analysing linguistic information about word combinations
computational methods, like association measures, with linguistic features (Dubremetz & Nivre, 2014; Pecina, 2008). For example, information obtained from deep parsers has been proved to be very helpful (Baldwin et al., 2004; Blunsom, 2007). It must be noted, however, that, while a lot of detection and extraction work has been done, not very much research has been conducted on MWU integration into MT systems. Most reports explain experiments in which combinations are added to Statistical Machine Translation (SMT) systems (Bouamor et al., 2012; Tsvetkov & Wintner, 2012), all of which greatly improve translation quality. As is pointed out in Seretan (2013): Phrase-based SMT systems already incorporate MWE/collocational knowledge as an effect of training their language and translation models on large (parallel) corpora. These systems are successful in dealing with local collocations, but are arguably ill-suited for handling collocations whose components are not in close proximity to one another.
Meanwhile, integration experiments on Rule-Based Machine Translation (RBMT) systems have also been confirmed to have a very positive effect. Wehrli et al. (2009), for instance, replaced the parsing strategy in an RBMT system with a new one which integrated collocation identification, and obtained much better results regarding MWU translation adequacy. It must be mentioned that, according to studies, even the simplest treatment of MWUs improves translation quality, although, of course, more complex processing methods will obtain better results, especially concerning non-sequential word combinations (Copestake et al., 2002). 3. Linguistic analysis of Basque and Spanish noun + verb combinations As previously mentioned, our aim is to study MWUs and their translations, in order to establish the linguistic grounds for their appropriate treatment in MT systems. So, we focused on several features of noun + verb combinations in Basque and Spanish, and we analysed how they were translated. First of all, we gathered noun + verb combinations from bilingual dictionaries, and we looked at their morphological composition and some semantic features (see Section 3.1). Secondly, we searched for these combinations in a parallel corpus, so that we could check to what extent they were used in real texts and how they were translated (see Section 3.2). Thirdly, we chose the most frequent combinations in the corpus, and classified them according to their syntactic flexibility and their semantic compositionality (see Section 3.3).
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
All of our results are collected in a public database: Konbitzul.1 It is now available online, and it allows users to search for the appropriate translation of a given combination, along with all the linguistic data we garnered from our in-depth analysis. 3.1 Noun + verb combinations in bilingual dictionaries Although it was clear to us that parallel corpora were the most useful resource for extracting frequently-used word combinations, we decided to take a look at bilingual dictionaries first, in order to get a general idea of the translation challenges the combinations can pose. To that end, we used the Elhuyar dictionaries2 (Spanish into Basque and Basque into Spanish), from which we gathered 2,954 Basque combinations (along with 6,392 Spanish equivalents) and 2,650 Spanish combinations (along with 6,587 Basque equivalents). All of the Basque combinations we analysed consisted of just a noun and a verb. However, it is important to note that Basque is an agglutinative language and, as such, constructs phrases by attaching elements, typically at the end of the phrase (Laka, 1996). This means that the nouns that are used in MWUs can also be marked by different grammatical cases and postpositions. (3) a. lan egin work do work.abs do.inf ‘to work’ b. deabru-a-k hartu devil-the take devil-art.sg-erg take.inf ‘the devil take [someone/something]’ c. joko-a-n jarri game-the-in put game-art.sg-loc put.inf ‘to risk’ d. buru-tik egon head-the.from be head-art.sg.abl be.inf ‘to be crazy’
Spanish, on the other hand, uses prepositions instead of postpositions and grammatical cases, and determiners in Spanish are not morphemes attached to the
. http://ixa2.si.ehu.es/konbitzul . http://hiztegiak.elhuyar.org
Analysing linguistic information about word combinations
phrases, but always separate words. Therefore, of the Spanish combinations we selected for this study, each one consisted of at least a verb and a noun, but many of them also contained prepositions and/or determiners in-between. (4) a. tener afecto have affection have.inf affection ’to have affection’ b. ser una pena be a pity be.inf art.f.sg pity ‘to be a pity’ c. saber de memoria know by memory know.inf prep memory ‘to know by heart’ d. dejar a un lado leave to a side leave.inf prep art.m.sg side ‘to leave aside/to one side’
We focused on the combinations in each language separately first, without taking their translations into account (see Section 3.1.1). Then, we examined their translations (see Section 3.1.2), paying special attention to those combinations that are also translated by noun + verb constructions (see Section 3.1.3). 3.1.1 Basque and Spanish noun + verb combinations in the dictionary To begin with our analysis, we focused on the morphological composition of the combinations in the Elhuyar dictionaries. As we mentioned earlier, the Basque combinations we chose for this project consisted of a noun and a verb (see Example 3), while the Spanish combinations were of four types: –– –– –– ––
verb + noun (Example 4a) verb + determiner + noun (Example 4b) verb + preposition + noun (Example 4c) verb + preposition + determiner + noun (Example 4d)
Concerning the Basque combinations in our list, we found many kinds of morphemes attached to the end of the nouns: three grammatical cases, and ten different postpositional marks. However, not all of them were used as often. As a matter of fact, 76.18% of the nouns were in the absolutive case, and the rest of the cases and postpositional marks were hardly used. On the other hand, there was no such
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
ifference among the Spanish structures, even though the combinations of the type d verb + determiner + noun were slightly more common than the rest (37.70%). It is also interesting to note that a large number of the verbs in the combinations are very common, both in Basque and in Spanish. In addition, the most frequent verbs in both languages are equivalent to each other: egin – hacer (‘do’), izan – ser/estar/tener (‘be/have’), eman – dar (‘give’), hartu – tomar (‘take’) and so on. This is no surprise though, as light verb constructions are very frequent among MWUs (Butt, 2010; Sag et al., 2002). 3.1.2 Translations of noun + verb combinations in the dictionary As a second step, we looked at the dictionary translations of the combinations we had extracted. When translating between languages from the same family, most word combinations in the source language were also word combinations in the target one. However, this is not the case in Spanish into Basque translations, where asymmetry is much more evident. In fact, of the Spanish translations of Basque combinations we analysed, 58.07% were single verbs, while just 30.85% contained a noun and a verb. This was to be expected, given that in Basque, it is very common to use two-word verbs to represent some actions that are expressed with single verbs in most European languages (see Example 5). On the other hand, this asymmetry was slightly less prominent but still significant when Spanish was the source language, as fewer than half of the Basque equivalents (48.54%) were noun + verb combinations (see Example 6). (5) ‘to work’ Basque: lan egin work do work.abs do.inf Spanish: trabajar work work.inf (6) ‘to open one’s eyes’ Spanish: abrir los ojo-s open the eye-s open.inf art.m.pl eye-pl Basque: begi-ak ireki eye-s open eye-art.pl.abs open.inf
3.1.3 Equivalences of noun + verb constructions in translations Before finishing our dictionary-based study, we considered it worth analysing syntactically-symmetrical translations further. So, we selected those noun + verb
Analysing linguistic information about word combinations
constructions that were also translated by other noun + verb constructions, and we found that there was a link between the morphological composition of the combinations in both languages. As previously mentioned, the natural equivalents of Basque postpositions are prepositions in Spanish. Our study has found that, despite their high idiosyncrasy, MWUs are not always an exception to this rule, as most Spanish combinations containing a preposition in our list were translated by combinations with a postposition into Basque, and vice versa. (7) ‘to eat hungrily’ Spanish: comer con apetito eat with appetite eat.inf prep appetite Basque: gogo-z jan desire-with eat desire-ins eat.inf (8) ‘to be a case in point’ Basque: hari-ra etorri string-to.the come string-art.sg.all come.inf Spanish: venir al caso come to.the case come.inf prep.art.m.sg case
This symmetry, however, is not consistent when it comes to the (in)definiteness and singularity/plurality of noun phrases, which is usually highly irregular cross-linguistically. The only exceptions are indefinite Basque nouns, which mostly remain indefinite when the combinations are translated into Spanish (80.72%). To conclude, we found it pertinent to make a comparison between the noun phrases and verbs in the source language and those in the target language. As we had expected, very few combinations were translated by substituting each component with an equivalent (see Example 9). Most of the time, at least one of the components was translated by a word that was not its equivalent in the dictionary (see Example 10). (9) ‘to leave [sb/st] alone’ Spanish: dejar en paz leave in peace leave.inf prep peace Basque: bake-a-n utzi peace-the-in leave peace-art.sg-loc leave.inf
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
(10) ‘to make noise’ Spanish: armar bulla build racket build.inf racket Basque: zarata egin noise make noise.abs make.inf
3.2 Contrasting information with parallel corpora The dictionary-based analysis provided us with a general view of the high complexity of MWU translation, but in order to learn about the actual use of these units, we needed to look at real texts. To do this, we used a parallel corpus of Spanish into Basque translations, constituted of 491,853 sentences from many different sources. Out of the 2,650 combinations we had gathered from the dictionary, just 200 were found within the corpus. However, we did not search for whole word sequences, but for noun lemmas and verb lemmas only, accepting any preposition and/or determiner in-between. This allowed us to find many other variants of the combinations we had already analysed (see Example 11), and, in addition, we also added new combinations that could be worth examining. These variants and extra combinations numbered 698 in all. (11) Previously examined: New variant 1: New variant 2:
alzar la voz raise the voice raise.inf art.f.sg voice ‘to raise the voice’ alzar su voz raise his/her voice raise.inf pos.3sg voice ‘to raise one’s voice’ alzar voces raise voices raise.inf voice.pl ‘to raise voices’
On the other hand, while the aforementioned 200 combinations had no more than 385 Basque equivalents in the dictionary, they were translated in as many as 1,641 different ways in the corpus, which enabled us to feed new translations into our database. 3.3 Classification of the Spanish MWUs For the next study, we ranked all the combinations extracted from the corpus by their number of occurrences, and we selected the most frequently-used ones: a total of 150. Our aim this time was to analyse linguistic information that could be
Analysing linguistic information about word combinations
useful for MT systems, so we focused on two main features of the Spanish combinations: (1) their syntactic flexibility, and (2) their semantic compositionality. 3.3.1 Syntactic flexibility In order to measure how flexible the combinations were, we asked the following questions about each of them: –– Was the noun phrase definite or indefinite? Was this consistent for every occurrence? –– Was the noun phrase singular or plural? Was this consistent? –– Could the noun phrase include a modifier? Adjectives, prepositional phrases and so on. –– Was it possible to add something between the noun phrase and the verb? An adverb, an extra phrase etc. –– Could the order of the components be changed? In passive sentences, for example. As our judgement was that syntactic information was a key element for the adequate treatment of a given MWU, we used that information to sort the combinations into three groups, following Sag et al. (2002): fixed, semi-fixed and free (see Table 4). Table 4. Syntactic classification of Spanish MWUs Fixed expressions
0%
Semi-fixed expressions
33.33%
Syntactically free expressions
66.67%
We call fixed expressions those word combinations that are always used together, using the same word forms (except for the verb, which can be inflected) and the same word order. Therefore, the MWUs in this group should be detected easily, simply by searching for the lemma of a given verb and the word sequence that follows it. Not surprisingly, none of the combinations we analysed was classified in this group, since verb-noun combinations not accepting any kind of morphosyntactic modification are extremely rare. Semi-fixed expressions, on the other hand, are more problematic regarding automatic detection tools. The components of these kinds of MWUs are often separated by other words (Example 12), and even the word order can be changed, for example when the sentence is in the passive voice. They are not completely free though, as they have certain syntactic restrictions, such as that modifiers and/or determiners cannot be inserted. It is important to take those restrictions into account in order
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
to detect only the combinations we are interested in, as in E xamples 13a and 13b, where the first one is an MWU whereas the second one is not. (12) dar paso [a algo] (‘to give rise [to sth]’) a. Las elecciones dieron paso a un nuevo gobierno. ‘The elections gave rise to a new government.’ b. Las elecciones darán quizás paso a un nuevo gobierno. ‘Elections may give raise to a new government.’ (13) hacer memoria (‘to try to remember’ vs. ‘to do a report’) a. Haz memoria, ¿qué hiciste ayer? ‘Try to remember: what did you do yesterday?’ b. Harán una memoria exhaustiva sobre su labor. ‘They will do a comprehensive report on their activities.
The combinations we classified as free expressions do not seem to have any syntactic restriction. As a result, the MWUs in this group are probably the most difficult ones to detect. (14) fijar un plazo (‘to set a deadline’) a. Hemos fijado el plazo de inscripción. ‘We have set the enrolment deadline.’ b. El plazo de inscripción ha sido fijado. ‘The enrolment deadline has been set.’ c. ¿Cuál es el plazo de inscripción que se ha fijado? ‘What deadline has been set for enrolment?’
3.3.2 Semantic compositionality Apart from analysing the syntax of the combinations, we also considered it important to look at their meaning. We sorted the combinations into four groups, depending on their degree of semantic idiomaticity (Table 5). Table 5. Semantic classification of Spanish MWUs Non-compositional expressions (idioms)
2%
Figurative expressions
10.67%
Semi-compositional expressions (collocations and light verb constructions)
52%
Compositional expressions (free)
35.33%
Non-compositional expressions are word combinations in which the meaning is not derivable from the separate meanings of their components. They are also called opaque expressions.
Analysing linguistic information about word combinations
(15) llevar a cabo take to ending take.inf prep ending ‘to carry out’, ‘to do’
Figurative expressions, on the other hand, are combinations which can have a figurative sense in addition to the canonical one. (16) poner [algo] sobre la mesa put on the table put prep art.f.sg table ‘to put [something] on the table’ or ‘to draw attention to [something]’
In the case of semi-compositional expressions, one of the components keeps its literal meaning, while the other one adopts a new sense (in collocations) or is emptied of meaning to work as a supporting element for the other word (in light verb constructions). In verb + noun combinations, the component which keeps its original meaning is usually the noun. (17) cumplir su palabra fulfil his/her word fulfil.inf pos.3sg word ‘to keep one’s word’ (18) tener dificultad [para algo] have difficulty have.inf difficulty ‘to have difficulty [doing something]’, ‘to find [something] difficult’
Finally, compositional expressions are completely regular in terms of semantics, as their meaning is made up of the separate meanings of the components. Hence, the constructions in this group are not semantically idiomatic, and most of them do not need any special computational treatment, as their literal translation is usually correct. (19) ir a un lugar go to a place go.inf prep art.m.sg place ‘to go to a place’
4. Evaluation of MWU detection and translation adequacy As we mentioned earlier, the aim of our work is to establish the linguistic basis for the treatment of MWUs in MT systems. The experiment we will explain here was
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
carried out with an RBMT system, namely Matxin3 (Mayor et al., 2011), which translates from Spanish into Basque. Matxin works in three phases: (1) analysis, (2) transfer and (3) generation. In the first phase, it analyses the text in Spanish syntactically, based on the information given by Freeling 3.0 (Padró & Stanilovsky, 2012). Secondly, it transfers the structure of the sentences to be translated, as well as the lexicon, which is gathered from wide-coverage dictionaries. And in the third place, the words and phrases are re-ordered and the necessary morphological information is added to them. Before we used our linguistic data, the system already had an MWU processing method, but it was based on the words-with-spaces approach (see Section 2) and was thus unable to identify non-sequential word combinations (see Example 13). The old MWU detection system was part of the analysis process, and searched only for the lemmas of the verbs and the forms of the rest of the words, which made it impossible to find combinations in which the components were non-adjacent and/or used a different order or word forms. The new system, however, is based on all the data we acquired from the linguistic analysis presented in Section 3.3. It is much more flexible, but, at the same time, it has many restrictions that prevent the identification of free combinations as MWUs. If a given combination is marked as a fixed expression, the system employs the old strategy, as this kind of MWU is always sequential and unchangeable (except for the verb inflection, which is also taken into account). If the unit is marked as semi-fixed, on the other hand, the system looks at the linguistic data we provided. For the expression cambiar de tema (‘change the topic’), for example, the system identifies those word combinations in which: The noun phrase is singular and definite, and preceded by the preposition de. According to this constraint, Example 20a would be accepted, whereas 20b would not. (20) a. Cambiemos de tema. ‘Let’s change the topic.’ b. *Cambiemos los temas. *‘Let’s change the topics.’ The noun phrase has no modifier. (21) *Cambiemos de aburrido tema. *‘Let’s change the boring topic.’ There may be more words between the verb and the prepositional phrase.
. www.opentrad.com
Analysing linguistic information about word combinations
(22) Cambiemos inmediatamente de tema. ‘Let’s change the topic immediately.’ The word order cannot be changed (see Example 23). (23) *De tema han cambiado. *‘The topic, they changed.’
Here again, we used the linguistic analysis provided by Freeling 3.0, combined with the linguistic data we had manually analysed. This was very helpful for limiting, on the one hand, the number of words between the verb and the noun phrase that constitute the MWUs, and on the other hand, the modifiers that could be inside the noun phrase. In the following sections, we will compare the results of the old and new detection systems, and will also evaluate the performance of our MT system when it comes to translating the MWUs we analysed. 4.1 Evaluation of MWU detection To test whether or not our new detection method was useful, we used 15,182,385 sentences in Spanish, taken from the parallel English-Spanish corpus made public for the shared task in WMT workshop 2013.4 Out of the 150 word combinations we analysed, we discarded those which were neither syntactically nor semantically idiomatic, that is, the ones classified as free and compositional expressions (see Section 3.3). In all, the set we used for the experiment consisted of 117 MWUs. We did the detection experiment both with the old system and with the new one, and we found that, as we had expected, the method based on linguistic data was able to identify quite a large number of additional combinations. As a matter of fact, of the 433,092 MWUs detected by the new system, 27.80% was constituted of combinations that the old system did not manage to detect (see Table 6). Table 6. Comparison of the old and new MWU detection systems MWUs identified by both systems
311,966
MWUs identified by the new system only
120,362
MWUs identified by the old system only
764
. http://www.statmt.org/wmt13/translation-task.html
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
Our next step was to evaluate the combinations that were identified by just one of the systems, so that we could see (1) whether the 97,382 extra combinations detected by our method produced a real improvement, and (2) why we failed to identify 731 combinations that the old system did manage to detect. The evaluation was undertaken manually by linguists, on a representative set of sentences containing MWUs detected by one of the systems only. Out of the evaluation set, all but one of the MWUs extracted with the wordswith-spaces method were correct (99%), and the hit rate obtained by the new system was 95%. Assuming that the accuracy of the old system would still be 99.80% for the combinations detected by both methods, the total hit rate of our new system would be 98%,5 which would be a very satisfactory result. Therefore, this confirms that linguistic data specific to MWUs does improve the detection process, as the number of identified combinations increased by 27.80% with a very high degree of precision. On the other hand, when evaluating the correct MWUs that were detected by the old system but not by the new one, we realised that most of them had parsing errors that prevented our method from working correctly. Thus, taking into consideration that the words-with-spaces method is extremely accurate, we decided to use both systems from now on: the old one first, in order to detect all sequential MWUs, and then the new one, which allows us to identify a large number of additional non-sequential combinations. More details about this experiment and its results can be found in Inurrieta et al. (2016) 4.2 Evaluation of MWU translation quality in an RBMT system Apart from evaluating the detection quality, we also wanted to get a general picture of the improvement our data would make to Matxin, an RBMT system created by IXA NLP group. So, we translated all 99 MWUs (see Section 4.1) using Matxin, and we also provided a manual translation for each of them. We sorted the results into three groups (see Table 7): Correct, improvable and incorrect. Table 7. Evaluation of MWU translations given by Matxin The MT is as good as the manual translation
55.55%
The MT is not incorrect, but the manual translation is better
33.33%
The MT is incorrect
11.11%
. (223.282*99.80/100+97.382*92.60/100)/(223.272+97.382)
Analysing linguistic information about word combinations
The results we obtained in this evaluation show that much improvement remains to be made concerning MWU translation in Matxin, as 44.44% of the MTs were incorrect or improvable. In addition, it must be considered that we undertook this test without any context, and this percentage would surely be much higher if the combinations were used in the context of sentences, especially if they were separated by other words or if a non-canonical word order was used. In Inurrieta et al. (2017), it is explained how MWU-specific linguistic data helps improving translation quality in Matxin. 5. Conclusions and future work In order to establish the grounds for the computational treatment of MWUs in MT systems, we undertook an in-depth linguistic analysis of some word combinations and their translations. First of all, we extracted combinations containing nouns and verbs from bilingual dictionaries: Spanish into Basque, and Basque into Spanish. We examined the morphological and semantic features of both the combinations (5,604) and their translations (12,979) and, as we had expected, we confirmed that MWUs cannot usually be translated word for word and morpheme for morpheme, as this kind of expression varies considerably from language to language. Secondly, we searched for the combinations analysed in a parallel corpus, which allowed us (1) to know to what extent each combination was used in real texts, and (2) to obtain a large number of additional translations that were not in the dictionary. All of our results were included in our database, Konbitzul,6 which is now available for public use. Then, we selected the 150 most frequent combinations in Spanish, and analysed them further and classified them according to their syntactic fixedness and their semantic compositionality, which helped us determine the kind of treatment that each MWU needed. As we wanted to carry out an experiment with a Spanish- into-Basque RBMT system, we did an identification to establish whether the data we provided had a real effect on MWU identification, and we obtained very satisfactory results. Firstly, the number of MWUs identified increased by 27.80% with our data, and secondly, our method achieved a precision of 98% according to a manual evaluation undertaken by linguists. Finally, we also evaluated the translations given by our RBMT system for the MWUs we analysed, and we concluded that at least 44.44% of them were either
. http://ixa2.si.ehu.es/konbitzul
Uxoa Iñurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka & Kepa Sarasola
incorrect or improvable, which underscores the need for specific techniques to process MWUs in the systems. We are currently working on semi-automatising the whole linguistic analysis explained here, so that this methodology can be applied to a larger number of word combinations more easily. In addition, there would be merit in analysing semantic data about MWUs, as we believe this information could make further improvement both to the detection and generation processes.
Acknowledgements Uxoa Iñurrieta’s work is funded by a PhD scholarship from the Ministry of Economy and Competitiveness (BES-2013–066372). This research was undertaken as part of the SKATeR (TIN2012–38584-C06–02) and QTLeap (FP7-ICT-2013.4.1–610516) projects.
References Alegria, I., Ansa, O., Artola, X., Ezeiza, N., Gojenola, K., & Urizar, R. (2004, July). Representation and treatment of multiword expressions in Basque. In Proceedings of the Workshop on Multiword Expressions: Integrating Processing (pp. 48–55). Association for Computational Linguistics. Baldwin, T., Bender, E. M., Flickinger, D., Kim, A., & Oepen, S. (2004, May). Road-testing the English Resource Grammar Over the British National Corpus. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal. Baldwin, T., & Kim, S. N. (2010). Multiword expressions. Handbook of Natural Language Processing (2nd ed.). Morgan and Claypool. Blunsom, P. (2007). Structured classification for multilingual natural language processing (Doctoral dissertation, University of Melbourne, Melbourne, Australia). Bouamor, D., Semmar, N., & Zweigenbaum, P. (2012, May). Identifying bilingual Multi-Word Expressions for Statistical Machine Translation. In LREC 2012, Eigth International Conference on Language Resources and Evaluation, (pp. 674–679). Istanbul, Turkey Butt, M. (2010). The light verb jungle: Still hacking away. Complex predicates in cross-linguistic perspective (pp. 48–78). Corpas Pastor, G. (1997). Manual de Fraseología Española. Gredos. Copestake, A., Lambeau, F., Villavicencio, A., Bond, F., Baldwin, T., Sag, I., & Flickinger, D. (2002). Multiword Expressions: Linguistic Precision and Reusability. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, LREC 2002, (pp. 1941–1947). Las Palmas, Spain. Dubremetz, M., & Nivre, J. (2014). Extraction of Nominal Multiword Expressions in French. In Proceedings of the 10th Workshop on Multiword Expressions (MWE), (pp. 72–76,). Gothenburg, Sweden. Fellbaum, C. (1998). WordNet. Blackwell Publishing Ltd.
Analysing linguistic information about word combinations
Gurrutxaga, A., & Alegria, I. (2011, June). Automatic extraction of NV expressions in Basque: basic issues on cooccurrence techniques. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (pp. 2–7). Association for Computational Linguistics. Heylen, D., & Maxwell, K. (1994). Lexical Functions and the Translation of Collocations. In Proceedings of Euralex. Howarth, P. (1998). Phraseology and second language proficiency. Applied linguistics, 19(1), 24–44. doi: 10.1093/applin/19.1.24 Inurrieta, U., Aduriz, I., Diaz de Ilarraza, A., Labaka, G., Sarasola, K., & Carroll, J. (2016). Using linguistic data for English and Spanish verb-noun combination identification. In Proceedings of the 26th International Conference on Computational Linguistics (COLING 2016): Technical Papers (pp. 857–867). Inurrieta, U., Aduriz, I., Diaz de Ilarraza, A., Labaka, G., & Sarasola, K. (2017). Rule-based translation of Spanish verb-noun combinations into Basque. In Proceedings of the 13th Workshop on Multiword Expressions, in EACL 2017 (pp. 149–154). Jackendoff, R. (1997). The architecture of the language faculty (No. 28). MIT Press. Laka, I. (1996). A brief grammar of Euskara, the Basque language. Universidad del País Vasco. Mayor, A., Alegria, I., De Ilarraza, A. D., Labaka, G., Lersundi, M., & Sarasola, K. (2011). Matxin, an open-source rule-based machine translation system for Basque. Machine Translation, 25(1), 53–82. doi: 10.1007/s10590-011-9092-y Padró, L., & Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. In Proceedings of the Language Resources and Evaluation Conference (LREC 2012) ELRA. Istanbul, Turkey. Pecina, P. (2008, June). A machine learning approach to multiword expression extraction. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008) (pp. 54–61). Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Computational Linguistics and Intelligent Text Processing (pp. 1–15). Springer Berlin Heidelberg. Seretan, V. (2013, October). On collocations and their interaction with parsing and translation. In Informatics (Vol. 1, No. 1, pp. 11–31). Multidisciplinary Digital Publishing Institute. Simova, I., & Kordoni, V. (2013, September). Improving English-Bulgarian statistical machine translation by phrasal verb treatment. In Proceedings of MT Summit XIV Workshop on Multi-word Units in Machine Translation and Translation Technology, Nice, France. Tsvetkov, Y., & Wintner, S. (2012). Extraction of multi-word expressions from small parallel corpora. Natural Language Engineering, 18(04), 549–573. doi: 10.1017/S1351324912000101 Urizar, R. (2012). Euskal lokuzioen tratamendu konputazionala (Doctoral dissertation, Faculty of Computer Science, University of the Basque Country). Villavicencio, A., Bond, F., Korhonen, A., & McCarthy, D. (Eds.). (2005). Computer Speech & Language (Special issue on Multiword Expressions), volume 19. Elsevier. Wehrli, E., Seretan, V., Nerima, L., & Russo, L. (2009, May). Collocations in a rule-based MT system: A case study evaluation of their translation adequacy. In Proceedings of the 13th Annual Meeting of the European Association for Machine Translation (pp. 128–135). Zhang, Y., Kordoni, V., Villavicencio, A., & Idiart, M. (2006, July). Automated multiword expression prediction for grammar engineering. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties (pp. 36–44). Association for Computational Linguistics. doi: 10.3115/1613692.1613700
How do students cope with machine translation output of multiword units? An exploratory study Joke Daems1, Michael Carl2, Sonia Vandepitte1, Robert Hartsuiker3 & Lieve Macken1 1Ghent
University, Department of Translation, Interpreting and Communication / University of China, Beijing & Copenhagen Business School, Department of Management, Society and Communication / 3Ghent University, Department of Experimental Psychology 2Renmin
In this chapter, we take a closer look at students’ post-editing of multiword units (MWUs) from English into Dutch. The data consists of newspaper articles post-edited by translation students as collected by means of advanced keystroke logging tools. We discuss the quality of the machine translation (MT) output for various types of MWUs, and compare this with the final post-edited quality. In addition, we examine the external resources consulted for each type of MWU. Results indicate that contrastive MWUs are harder to translate for the MT system, and harder to correct by the student post-editors than non-contrastive MWUs. We further find that consulting a variety of external resources helps student post-editors solve MT problems. Keywords: machine translation, post-editing, search strategies, external resources, multiword units, translation process, translation quality
1. Introduction Phrase-based statistical machine translation (SMT) systems are generally better suited to translate multiword units (MWUs) than rule-based machine translation (RBMT) systems, as SMT systems can extract MWUs from corpora and therefore have a better coverage of MWUs than RBMT systems. Still, these SMT systems often lack the semantico-syntactic knowledge required to properly translate MWUs (Monti et al., 2011). While there is an abundance of research into the identification and classification of MWUs within machine translation (Mendoza Rivera et al., 2013), research into the effects of MWUs on subsequent post-editing is limited. doi 10.1075/cilt.341.03dae © 2018 John Benjamins Publishing Company
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken
If post-editing is used in MWU research, it is often as a measure of the machine translation quality. The greater the contrast between the post-edited text and the machine translated text, the lower the quality of the machine translation output (Koehn & Germann, 2014). In this case, the assumption is made that human posteditors detect and correct all machine translation errors, which is, unfortunately, not always true. In this chapter, we therefore first look at the quality of machine translated MWUs, after which we analyse how these MWUs are subsequently processed by student post-editors. Only then do we look at the quality of the final product. In addition, even for post-edited texts of high quality, the edit distance between machine translation output and post-edited text does not give the full picture. Word order or other grammatical issues, for example, might require the post-editor to reorganise the entire sentence, which will lead to a high edit rate, whereas a word sense error is more demanding to solve, but will not inflate the edit rate. To compensate for these issues, we propose using the same quality assessment approach for the raw MT output and the post-edited texts, as well as taking the external resources consulted during the post-editing process into account. The latter can give a better idea of post-editors’ problem-solving strategies (Göpferich, 2010) and are as such an indication of their uncertainty while post-editing. The addition of post-editing to the equation requires another point of view entirely. Whereas people interested in improving machine translation systems have come up with various ways of classifying MWUs according to semantical, lexical or grammatical features (Monti et al., 2011), they have looked at MWUs solely from the ‘how is it processed by the machine translation system’ point of view. This information is then used to improve the processing of MWUs by machine translation systems via linguistic pre-processing, POS-pattern definition, and syntactic and semantic processing (Mendoza Rivera et al., 2013). While this type of research is certainly important, it sometimes ignores the fact that machine translation output is often subsequently post-edited. In a postediting scenario, it makes sense to ask ‘which types of MWUs are problematic for a post-editor?’ in addition to ‘which types of MWUs are problematic for the machine translation system?’, and ‘how can we improve the machine translation quality?’. With that in mind, we tried to keep our MWU classification as simple as possible, but we added the factor ‘contrast with the target language’ to the classification. This distinction is of course language-dependent, but we believe it to be a necessary addition. If, for example, an idiom is not contained in the corpus a statistical machine translation system was trained on, the system will resort to word-by-word translation. Depending on the target language, that
How do students cope with machine translation output of multiword units?
idiom might actually still make sense. While the English idiom ‘it’s raining cats and dogs’ cannot be translated into Dutch as het regent katten en honden, the English ‘the apple does not fall far from the tree’ can literally be translated into Dutch as de appel valt niet ver van de boom and retain its idiomatic meaning. The first would be highly confusing for the post-editor and would not make sense in Dutch at all, the second would not even require post-editing. Even though these examples belong to the same category (idioms), they have a different effect on subsequent post-editing because of their degree of contrast with the target language. We believe that the contrast with the target language and the subsequent post-editing process can provide interesting new insights in the processing of MWUs in a machine translation scenario. These insights can, in turn, be used to improve the translation interface to better aid post-editors with their work by, for example, highlighting specific types of MWUs that post-editors often find problematic. On the basis of the abovementioned assumptions, we decided to classify MWUs for this study in two ways: by category (compound, collocation, multiword verb) and by contrastiveness: if a direct translation of the English MWU would be correct in Dutch, the unit was classified as ‘non-contrastive’, whereas MWUs that could not be translated literally into Dutch were classified as ‘contrastive’. An explanation plus example of each type can be found below. 1. Compound: lexical units with more than one base functioning grammatically and semantically as a single unit (1) a. high-rise hoogstijger ‘hoogbouw’ b. climate warming klimaatopwarming ‘klimaatopwarming’
CONTRASTIVE
NON-CONTRASTIVE
2. Collocation: a semantically autonomous base with an additional element which makes its own semantic contribution to the whole, but the selection of this element is dependent on the base (2) a. for centuries voor eeuwen ‘eeuwenlang’ b. steep mountains steile bergen ‘steile bergen’
CONTRASTIVE
NON-CONTRASTIVE
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken
3. Multiword verb: a MWU functioning like a single verb. This term encompasses phrasal verbs, prepositional verbs, phrasal-prepositional verbs and other multiword verb constructions (3) a. listen up luister op ‘luister’ b. count on rekenen op ‘rekenen op’
CONTRASTIVE
NON-CONTRASTIVE
2. Experimental set-up We conducted an experiment with ten master’s students of translation. They had no previous experience post-editing and could only participate if they had passed their general translation exam. We registered the students’ translation and post-editing process during two sessions, though we only look at the postediting process in this chapter. Each session consisted of two regular translation tasks and two post-editing tasks, with additional tests and surveys to provide us with some background information. It remains of course to be seen whether or not our results will generalise to professional translators. We have c onducted the same experiment with professional translators, and we found no significant difference in post-editing time and overall quality between the two groups (Daems et al., 2017). The texts to be translated were eight fragments taken from news website Newsela.1 We originally selected fifteen texts with comparable Lexile® scores (between 1160L and 1190L),2 and further filtered those texts on the basis of readability metrics, translation problems, and machine translation problems so that the final texts were of comparable complexity. Texts had to be translated from English into Dutch. For the post-editing task, the texts were translated by Google Translate. For the regular translation task as well as the post-editing task, the goal was to obtain target texts of publishable quality.
. http://www.newsela.com . The authors would like to thank MetaMetrics® for their permission to publish Lexile scores in the present chapter. https://www.metametricsinc.com/lexile-framework-reading
How do students cope with machine translation output of multiword units?
We registered the translation process by means of keystroke logging and eyetracking. Our two keystroke logging tools were Casmacat (Alabau et al., 2013) and Inputlog (Leijten & Van Waes, 2013). The first provides the professional look of an actual translation tool and is compatible with the eye-tracker, whereas the second logs applications outside of the Casmacat interface as well, making it easier to analyse the usage of external resources during the translation process. The eyetracker was an EyeLink 1000 with chin rest, which we chose for its compatibility with Casmacat and its accuracy. In total, we found 99 MWUs in the texts; 52 were collocations, 37 could be classified as compounds and ten were multiword verbs. Though we assumed it would be interesting to compare these three categories due to their varying degrees of semantic autonomy, we are mainly interested in the translation of MWUs. Since translation is language-dependent, we classified our MWUs further as contrastive or non-contrastive with Dutch (can the MWU be translated literally or not?). The collocations are fairly evenly divided across categories, with 24 collocations being contrastive and 28 being non-contrastive. Of the compounds, 13 belong to the ‘contrastive’ category, the other 24 to the ‘non-contrastive’ category. Most of the multiword verbs (seven) are contrastive and only three of them are non-contrastive.
3. Analysis In order to analyse the effect of machine translation quality on post-editing, we first need to establish how problematic each type of MWU is for machine translation. Rather than comparing the MT output with a reference translation, we decided to compare the MT output with the source text, as suggested by Monti et al. (2011). We annotated the raw MT output for quality on the basis of our two-step translation quality assessment approach (Daems et al., 2013). In a first step, all errors related to the target language and the translation as a target text are marked and labelled. In a second step, the target text is compared with the source text and all adequacy issues (differences in meaning between source and target text) are marked and labelled. As such, we get a detailed impression of the problems in the machine translated text, regarding both the number of problems and the type of problems. Figure 1 shows how common MT errors are for each type of MWU. A value of zero means that there were no errors in the MT version of the MWU, a value of one means that there was one error, and so on up to three errors.
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken 100
3 2 1 0
90 80 70 60 (%) 50 40 30 20
collocation
compound
no_contrast
contrast
no_contrast
contrast
no_contrast
0
contrast
10
multi-word verb
Figure 1. Frequency of zero, one, two or three errors occurring in MWUs processed by MT, for each type of MWU
In line with expectations, MT performs much better with non-contrastive MWUs than with contrastive MWUs, with 60 to 80% (depending on the MWU category) of non-contrastive MWUs being unproblematic for MT. In the contrastive conditions, only 30 to 45% of MWUs is processed correctly by the MT system. We also see a higher percentage of MWUs containing two or three errors after MT in the contrastive condition. Figure 2 gives an overview of the most common error types found in the MT output. As can be seen in Figure 2, unclassified adequacy errors (Ad_Other) are most common in the MT output, with logical problems and wrong collocations occurring frequently as well. The adequacy issues (word sense or others) often cause logical problems. (4) … the volunteers keep us company … de vrijwilligers houden ons gezelschap … ‘de vrijwilligers houden ons bedrijf ’
(MT)
How do students cope with machine translation output of multiword units? 20 18 16 14 12 10 8 6 4 2
te nt ne xis
ct ur e
w or d_ no
st ru
r _o th e gr am
e ns
w or d_ or de r
lo g_ co l ro n
W or d_ se
n ca tio
le l_ pr ob w
lo
gi ca
Ad _O th e
r
m
0
Figure 2. Error types occurring more than once in machine translated MWUs
In Example 1, the word ‘company’ is mistranslated as bedrijf by the MT system. From an adequacy perspective, this is a word sense error (‘company’ can be translated as bedrijf, just not in this context), but it is also a logical problem from an acceptability perspective (bedrijf makes no sense in this sentence). As expected, grammatical errors also occur in the machine translation output (word order, structure, and other types of grammatical problems). After establishing how problematic the various MWUs are for a statistical machine translation system, we take a closer look at the effectiveness of post-editing (PE). We anticipated four scenarios, which we briefly discuss with examples below. 1. No problem: the MWU was not problematic after MT, and none of the student post-editors introduced errors during the post-editing process. (5) In exchange for… ‘In ruil voor’ … ‘In ruil voor’ …
(MT) (PE)
2. Not solved: the MWU was problematic in MT, and is still problematic after post-editing for at least one student post-editor. (6) Families are holding tight to their cash … ‘Gezinnen zijn strak vast te houden aan hun geld’ … ‘Gezinnen houden zich vast aan hun geld’ …
(MT) (PE)
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken
3. Solved: the MWU was problematic in MT, and is no longer problematic in any of the post-edited versions. (7) self-described ‘zelf-beschreven’ ‘zoals ze zichzelf noemen’
(MT) (PE)
4. Problem introduced: the MWU was not problematic in MT, but errors were introduced by at least one student post-editor during the post-editing process. (8) finger-painting ‘vingerverven’ ‘vingerverfschilderijen’
(MT) (PE)
Figure 3 gives an overview of the outcome after post-editing for MWUs in each category. Quite a few errors found in the machine translated MWUs are not solved after post-editing, especially in the contrastive condition. Example nine shows a 30
Problem_Intro Solved Not_Solved No_Problem
25
20
15
10
collocation
compound
no_contrast
contrast
no_contrast
contrast
no_contrast
0
contrast
5
multi-word verb
Figure 3. Occurrence of the four possible scenarios after post-editing (problem introduced, solved, not solved, no problem) for each type of multiword unit
How do students cope with machine translation output of multiword units?
contrastive collocation that was mistranslated by MT and was still problematic for most post-editors. Post-editors three (P3) and ten (P10) provide correct translations, whereas post-editor one (P1) does correct the mistranslation of ‘pulled’, but not the mistranslation of ‘together’. Post-editors five (P5) and eight (P8) correctly interpret the ‘together’ as belonging with ‘pulled’, but failed to choose a Dutch verb that completely covers the meaning of ‘pull together’. (9) Researchers have pulled together data… ‘Onderzoekers hebben samen gegevens getrokken’… (MT) ‘Onderzoekers hebben samen gegevens geanalyseerd’ (P1) ‘Onderzoekers hebben gegevens verzameld’ (P5, P8) ‘Onderzoekers hebben gegevens … samengebracht’ (P3) ‘Onderzoekers hebben gegevens…samengebracht en onderzocht’ (P10)
In addition, collocations seem to be more problematic than compounds, which might be explained by the fact that in collocations, each part is semantically more autonomous than in a compound, so it might be harder for post-editors to spot errors there. Example 10 shows a case where a collocation is translated literally by the MT system. Yet, in context, the ‘trip’ is an indication of distance and not an actual ‘trip’, so the literal translation does not work in Dutch. This error goes unnoticed by one post-editor. An error in a compound noun, as shown in Example 11, is much more obvious and is solved by all post-editors. (10) … the museum is just a short trip for … ‘het museum is slechts een korte reis voor’ … ‘het museum is slechts een korte reis voor’
(MT) (P9)
(11) museum exhibit ‘museumstuk’ ‘tentoonstelling’
(MT) (PE)
Only in a few cases did the post-editors introduce errors that were not there in the MT output, most notably in the non-contrastive condition. (12) urban lifestyles ‘stedelijke levensstijl’ ‘straatcultuur’
(MT) (PE)
This could be because student post-editors do not trust the MT output and feel like they should change something, even when it is correct. Another explanation might be that student post-editors correct certain MT errors, but introduce errors of their own. To verify this assumption, we take a closer look at the error types found in the ‘not solved’ condition, so those MWUs that were problematic in machine translation and still problematic for at least one student post-editor (Figure 4).
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken 12
MT PE
10 8 6 4
untranslated
missing
gram_other
verb form
word order
structure
word sense
missing info
Part of Speech
Quantity
Spelling mistake
Deletion
style other
compound
register
Addition
logical problem
Ad_other
0
wrong collocation
2
Figure 4. Error types in the ‘not solved’ condition for MT and PE
From Figure 4, we can derive that ‘wrong collocation’ errors are abundant, both in the machine translation output and after post-editing. Unclassified adequacy errors (Ad_Other) are more common in the MT output than in the final post-edited version, as are logical problems. Some (mostly grammatical) error types can only be found in the MT output, whereas adequacy and style issues only appear after postediting. It can be assumed that grammatical errors are easily spotted and solved by a student post-editor, whereas some adequacy errors (i.e., contrasts between source and target text) are easily overlooked, perhaps because the text itself is fluent. In line with expectations, student post-editors treat the text more freely than the MT system, as evidenced by the number of additions, deletions and stylistic problems. Though looking at quality tells us something about how difficult the processing of MWUs is for MT systems and subsequent post-editing, it does not tell the whole story. We therefore take a closer look at the external resources consulted during the post-editing process. The first question is ‘for which types of MWUs do student post-editors consult external resources?’ Figure 5 gives an overview of the MWUs and whether or not resources were consulted when post-editing that particular type of MWU. ‘Yes’ indicates that at least one post-editor consulted external resources during the translation of a MWU, ‘no’ indicates that no post-editors looked up external resources for a MWU. As can be seen in Figure 5, student post-editors look up external resources for each type of MWU. What is remarkable is that it is more common for posteditors to consult external resources when translating compounds than when translating collocations, even if the collocations are incorrect in the MT o utput and have been solved by post-editing. This might, in part, be due to the type of
How do students cope with machine translation output of multiword units? 100
yes no
90 80 70 (%)
60 50 40 30 20
contrast
no_contrast
collocation
contrast
no_contrast
compound
contrast
Solved
No_problem
Solved
Not_solved
No_problem
Solved
Problem_introduced
No_problem
Solved
Not_solved
No_problem
Solved
Problem_introduced
Not_solved
No_problem
Solved
Problem_introduced
Not_solved
0
No_problem
10
no_ contrast
multi-word verb
Figure 5. Proportion of multiword units per category for which resources have been consulted by at least one post-editor
errors. As shown in Figure 4, many MT errors consist of grammatical errors, which presumably can be solved without consulting external resources. We further expect grammatical errors to occur more frequently within collocations than within compounds. Then again, there is still an abundance of collocations that were not solved by all post-editors, and it seems that post-editors consulted external resources for less than half of those cases. A comparison of contrastive and non-contrastive MWUs shows comparable results for compounds, but when looking at collocations it seems that, overall, resources are consulted more frequently in the non-contrastive condition, which again is a little odd. To better try and understand these findings, we take a closer look at the time spent in the various types of external resources, and a few examples of search strategies. Figure 6 gives a general overview of the total time spent in each type of external resource for each type of MWU. Even though there are more c ollocations than compounds in the data, a lot more time is spent looking up external resources when post-editing compounds than when post-editing collocations. It is also remarkable that, for collocations, more time is spent in e xternal resources
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken
for non-contrastive collocations than for contrastive collocations. Perhaps the latter can be considered ‘false friends’, where the student post-editor believes that nothing should be changed, when actually the MT output is incorrect. This might also explain the abundance of ‘not solved’ MT problems in the contrastive collocations condition. Perhaps the frequency of MWUs also plays a part. If a MWU is not frequently used in the target language, the post-editor might want to verify that it is a correct translation, even for the non-contrastive MWUs. We did not look at frequency in the present study, as it is rather hard to get accurate frequency information for collocations, definitely when they are split up in the sentence. 1200000
Termbank Synonym Spelling Search MT Encyclopedia Dictionary Conversion Concordancer
1000000
800000
600000
400000
collocation
compound
no_contrast
contrast
no_contrast
contrast
no_contrast
0
contrast
200000
multi-word verb
Figure 6. Total time spent in external resources for each type of MWU
Overall, we see that most time is spent in dictionaries, concordancers and search engines, the latter being used more for compounds than for collocations. Perhaps student post-editors use a search query to verify that a particular compound exists, and if the search engine returns enough results, it is no longer necessary
How do students cope with machine translation output of multiword units?
to consult other resources. The common choice of dictionaries for collocations is counterintuitive, as the words in collocations are more independent than compounds, and so we would not expect those to appear in dictionaries. A possible explanation is that student post-editors do not realise that a word is part of a collocation and they try to solve the problems by looking up parts of the collocation rather than the whole. Closer inspection of the categories ‘conversion’ and ‘spelling’ reveal that both were used for only one MWU each (‘conversion’ for ‘183-square-foot’ and ‘spelling’ for ‘21st-century’), so these should not be considered as typical. Table 1. Search strategy for MWU ‘low interest payments’ Source descriptor
Time
Dur
Keystrokes
Type
Van Dale
758460
42500
interst[..]est m[.] pauy[..]yment
dictionary
Nieuw tabblad - Google Chrome
815038
4235
linguee
navigation
Linguee | Nederlands-Engels woordenboek
819273
5749
interest payments
concordancer
interest payments - Nederlandse vertaling – Linguee woordenboek
825022
22485
Nieuw tabblad - Google Chrome
847507
4672
iate.europa.eu
navigation
IATE - De veeltalige databank van de EU
852179
18703
“interest paymne[..] ents”
termbank
IATE - Zoekresultaat
870882
30422
“interes [.]t payment”
termbank
concordancer
An example of a search strategy can be seen in Table 1. During the twelfth minute of post-editing, the post-editor navigates towards the Dutch dictionary Van Dale and types in the words ‘interest payment’. The post-editor remains on this page for 42 seconds before opening a new tab and looking up ‘interest payments’ on concordancer website Linguee. This page remains in focus for 22 seconds, after which the post-editor navigates to the European multilingual term base IATE to look up the plural ‘interest payments’ as well as the singular ‘interest payment’. An additional question here is: how effective is the time spent in external resources? Does spending a lot of time in external resources equal better quality, and how much time is spent on passages that were not problematic to begin with? Figure 7 shows the average time spent in external resources for MWUs that were incorrectly translated by the MT system and were correctly translated by all student post-editors.
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken 100000
Termbank Synonym
90000
Search Encyclopedia
80000
Dictionary Concordancer
70000 60000 50000 40000 30000 20000 10000 0
collocation contrast
collocation compound compound multi-word no_contrast contrast no_contrast verb contrast
Figure 7. Average time (in ms) spent in external resources per MWU incorrectly translated by the MT system and correctly translated by all student post-editors, for each category
It seems that, on average, between 25 and 95 seconds are needed to look up enough information to correctly post-edit MWUs that were incorrectly translated by the machine translation system. In the contrastive condition, much more time on average is spent in external resources when post-editing compounds or multiword verbs than when post-editing collocations. In addition, there is less variety in the types of resources consulted when solving collocations than when solving compounds and MWUs. When solving contrastive compounds, a lot of time seems to be spent in termbanks as well, though closer inspection reveals this search to be conducted by one post-editor only for one specific MWU (interest payments), the one shown in Table 1. Figure 8 then gives an overview of the sources consulted while translating those MWUs that were incorrectly translated by the machine translation system, and incorrectly translated by at least one post-editor. We compare the sources used by the post-editors that did not correctly post-edit the MWU (‘not_solved’ in the graph) with the sources used by post-editors that did manage to correctly postedit the MWU (‘solved’ in the graph).
How do students cope with machine translation output of multiword units? 90000
Termbank Synonym Spelling Search MT Dictionary Conversion Concordancer
80000 70000 60000 50000 40000 30000 20000 10000 0
collocation contrast
compound contrast
NOT_SOLVED NOT_SOLVED
collocation collocation compound multi-word contrast no_contrast contrast verb contrast SOLVED
SOLVED
SOLVED
SOLVED
Figure 8. Average time (in ms) spent in external resources for MWUs that were incorrectly translated by MT and not corrected by at least one student post-editor
It can be derived from this graph that far less time is spent in external resources by the post-editors that failed to solve the problems compared to those post- editors who corrected the machine translation errors. This seems to indicate that consulting external resources can help student post-editors solve problems related to MWUs. There were only four cases where post-editors who consulted external resources when trying to correct an incorrectly translated MWU failed to provide a good translation. Two of these were by the same post-editor, and in one of the cases, the error that remained after post-editing was a spelling error rather than the original logical problem found in the MT output. In contrast with those MWUs that were correctly translated by all post-editors, we see a larger variety of sources consulted and more total time needed when translating collocations than when translating multiword verbs, with the translation of contrastive compounds demanding the most time in external resources. There is also less variety in the types of resources consulted when translating multiword verbs. In addition, termbanks are only used when translating contrastive collocations, and not contrastive compounds, as was the case for MWUs that were correctly translated by all p ost-editors. Table 2 shows the search strategy
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken
for the contrastive collocation ‘fail their polygraph tests’. The collocation was translated correctly by all post-editors with the exception of post-editor nine (P9). Table 2. Search strategies of five different student post-editors for MWU ‘fail their polygraph tests’ Participant
Source descriptor
Time
Dur
Keystrokes
Type
P2
Van Dale
368148
9484
polydrap[….] graph
dictionary
polygraaf
P4
Van Dale
395901
6859
P4
Van Dale
430381
3563
dictionary
P4
Nieuw tabblad - Google Chrome
505189
4531
polygraaf
navigation
P4
polygraaftest - Google zoeken
510236
4391
[….]
search
P6
meer wel dan niet - Google zoeken
574400
8672
polygraaftest
search
dictionary
P6
polygraaftest - Google zoeken
583072
1281
search
P6
polygraaftest - Google zoeken
591737
4422
search
P6
polygraaftest - Google zoeken
919262
2235
P6
Nieuw tabblad - Google Chrome
921497
3203
search
P6
groene boekje - Google zoeken
924700
1719
P6
Woordenlijst Nederlandse Taal - Officiële Spelling
926419
5765
test
spelling
P6
Nieuw tabblad - Google Chrome
932184
6360
testente[..] tests
navigation
P6
tests testen taaladvies Google zoeken
938544
3046
]groene boekje
navigation search
search
P6
Testen / tests
941590
5688
P7
Van Dale - Google Chrome
320232
7938
spelling
P9
polygraph tests - Nederlandse vertaling – Linguee woordenboek
300124
9453
P9
Nieuw tabblad - Google Chrome
309577
2828
P9
Google - Google Chrome
312405
1360
navigation
P9
polygraaf - Google zoeken
313765
5406
search
polygraph
dictionary concordancer
polygrra[..]aaf
navigation
How do students cope with machine translation output of multiword units?
Participants number two and seven simply look up the word ‘polygraph’ in a dictionary (Van Dale). Participant number four has a slightly more elaborate search strategy, looking up the Dutch word ‘polygraaf ’ in the same dictionary and then navigating to Google to search for the word ‘polygraaftest’. Participant number six has the most elaborate search strategy. He looks up ‘polygraaftest’ in Google search, and switches back to consult the results of the search query throughout his translation process. The first time is around nine minutes and a half, with a few checks quickly following the first, then there’s another check at around fifteen minutes. The post-editor then switches to ‘Groene Boekje’, which is the official word list of the Dutch language, to look up the correct spelling. The same query is given to the site ‘taaladvies’, which is another website for checking Dutch spelling. Judging by the keystrokes, the post-editor wanted to know whether the Dutch plural of test is tests or testen. The last post-editor, also the only post-editor that made a mistake in the final translation of this MWU, is the only person to use a concordancer (Linguee) to look up ‘polygraph tests’, after which she also consults Google Search to look up polygraaf. What’s remarkable in this example is that the main issue in the machine translation output was the translation of ‘fail’ rather than the translation of ‘polygraph test’, yet all posteditors focus on ‘polygraph test’ in their searches. Though most post-editors correctly translate ‘fail’ as well, post-editor nine does not. It might be possible that in cases like this, where a compound (polygraph test) is part of a collocation (fail a test), post-editors focus on the compound rather than the collocation as a whole. Though the above findings indicate that looking up external resources can help student post-editors correct errors made by the MT system, the success of looking up external resources is also determined by knowing when to look things up. A key post-editing skill is knowing when the machine translation is correct, and when it is not. From Figure 9, we can derive that students spend a lot of time looking up external resources when post-editing MWUs, even when the MWUs have been correctly translated by the machine translation system. The time spent on contrastive compounds in particular is striking, as is the variety of sources consulted for both types of compounds. This finding is a little counterintuitive, as we would expect post-editors to not think twice about correctly translated compounds, whereas collocations might require additional searches for verification due to their freer compositional nature than compounds. This could perhaps be a sign that post-editors do not consider collocations to be a whole, whereas they are more accustomed to compounds. Or, as mentioned before, this might be due to the frequency of the compounds being low.
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken
Table 3. Search strategy for MWU ‘high-rise’ Source descriptor
Time
Dur
Keystrokes
Type
cinema - Nederlandse vertaling - bab.la Engels-Nederlands woordenboek
1029750
6469
hih[.]gh-rise
dictionary
high rise - Nederlandse vertaling - bab.la Engels-Nederlands woordenboek
1036219
30687
dagelijkse levensstijl - Google zoeken
1066906
3281
high-rise - Google zoeken
1070187
4063
dictionary high-rise
search search
main document
1074250
2687
high-rise - Google zoeken
1076937
3688
hoogbouw
hoogbouw - Google zoeken
1080625
6969
een
een hoogbouw - Google zoeken
1087594
3172
main document
1090766
48526
…
…
…
een hoogbouw - Google zoeken
1330352
1297
search search search
…
… search
In Table 3, we see an example of a student post-editor looking up ‘high-rise’, a contrastive compound that was correctly translated by the machine translation system as ‘hoogbouw’. The post-editor first looks up ‘high-rise’ in the English-Dutch dictionary ‘bab.la’ and via Google Search. He then returns to the main document for two seconds, and continues to use Google Search, this time to look up the Dutch word ‘hoogbouw’. He adds the article ‘een’ to the search query and returns to the main document for almost an entire minute. The post-editor proceeds with the rest of the text for a while (omitted from example) and checks the search results again four minutes later. In total, the post-editor spent almost two minutes verifying a correct translation. 4. Conclusion In this chapter, we discussed the machine translation quality of various types of MWUs, the subsequent post-editing process by students and the final quality of the product after post-editing. We suggest adding ‘contrast with the target language’ as a new factor in the evaluation and analysis of MWUs in machine translation. Contrastive MWUs were found to be more difficult than non-contrastive MWUs for Google Translate to process as well as for the post-editors to correct. We further found collocations to be harder to post-edit than compounds. Fine-grained error analysis shows that grammatical errors and logical problems are usually corrected
How do students cope with machine translation output of multiword units?
by the post-editors, whereas wrong collocation errors and adequacy issues remain after post-editing. A closer look at the resources consulted during the post-editing of MWUs showed that students consult resources more frequently and spend a lot more time looking up external resources when post-editing compounds than when post-editing collocations, which might indicate they need to be made more aware of collocations occurring in the text. We found that, if sources are consulted, the machine translation errors are usually corrected by the post-editor. More time is used to successfully process the contrastive MWUs than the noncontrastive MWUs, with the exception of collocations. The limited time spent in external resources when post-editing contrastive collocations might be the reason that so many contrastive collocations remain problematic after post-editing. In addition, post-editors spend quite some time looking up MWUs that were correctly translated by the machine translation system. We can conclude that the difference between contrastive and non-contrastive MWUs is a useful new way of classifying MWUs with regards to machine translation and subsequent post-editing. While post-editors’ search strategies seem to be successful, they need to be made aware of contrastive collocations, and they could further benefit from some sort of MT quality estimation to prevent them from spending a lot of time looking up resources for correctly translated MWUs. 5. Future work In the future, we wish to compare these findings with those of experienced translators to be able to compare the results and control for the generalisability of this study. In addition, we will also compare the findings of students’ post-editing performances with the processing of MWUs during translation from scratch. By examining gaze data and post-editors’ and translators’ perceived difficulty, it will be possible to determine the best method to handle MWUs in a translation scenario. Ideally, our findings will be used to improve translation tools for post-editors. Seeing how problematic contrastive collocations are for the post-editors, it could be useful to highlight contrastive collocations in the machine translation output so that post-editors know to double-check them. Machine translation suggestions that are most likely correct can be indicated as well, so that post-editors know not to spend too much time on those. Finally, frequency information might also help predicting what post-editors find problematic to solve. Perhaps adding extra information (dictionary and/or concordancer) whenever a low-frequent MWU occurs can also aid post-editors.
Joke Daems, Michael Carl, Sonia Vandepitte, Robert Hartsuiker & Lieve Macken
References Alabau, V. Bonk, R., Buck, C., Carl, M., Casacuberta, F., Martínez, M., González, J., Koehn, P., Leiva, L., Mesa-Lao, B., Ortiz, D., Saint-Amand, H., Sanchis, G., & Tsoukala, C. (2013). CASMACAT: An Open Source Workbench for Advanced Computer Aided Translation. The Prague Bulletin of Mathematical Linguistics, 100, 101–112. doi: 10.2478/pralin-2013-0016. Daems, J., Macken, L., & Vandepitte, S. (2013). Quality as the sum of its parts: A two-step approach for the identification of translation problems and translation quality assessment for HT and MT + PE. In Proceedings of the MT Summit XIV Workshop on Post-editing Technology and Practice, 63–71. Daems, J., Vandepitte, S., Hartsuiker, R., & Macken, L. (2017). Translation methods and experience : a comparative analysis of human translation and post-editing with students and professional translators. META, 62(2), 245–270. Göpferich, S. (2010). The translation of instructive texts from a cognitive perspective. In F. Alves, S. Göpferich, & I. Mees (Eds.) New approaches in Translation Process Research (pp. 5–65). Frederiksberg: Samfundslitteratur. Koehn, P., & Germann, U. (2014). The impact of machine translation quality on humanpostediting. In Proceedings of the EACL 2014 Workshop on Humans and Computer-assisted Translation (pp. 38–46). Gothenburg, Sweden: Association for Computational Linguistics. Leijten, M., & Van Waes, L. (2013). Keystroke Logging in Writing Research: Using Inputlog to Analyse and Visualize Writing Processes. Written Communication, 30(3), 358–392. doi: 10.1177/0741088313491692 Mendoza Rivera, O., Mitkov, R., & Corpas Pastor, G. (2013). A flexible framework for collocation retrieval and translation from parallel and comparable corpora. In Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technology, 18–25, Nice. Monti, J., Barreiro A., Elia A., Marano F., & Napoli A. (2011). Taking on new challenges in multiword unit processing for Machine Translation. In F. Sanchez-Martinez, & J. A. Perez-Ortiz (Eds.), Proceedings of the Second International Workshop on Free/Open-Source Rule-Based Machine Translation, 11–19, Barcelona, Spain.
Aligning verb + noun collocations to improve a French-Romanian FSMT system Amalia Todiraşcu & Mirabela Navlea
FDT (Fonctionnements Discursifs et Traduction), LiLPa (Linguistique, Langues, Parole), Université de Strasbourg We present several Verb + Noun collocation integration methods using linguistic information, aiming to improve the results of a French-Romanian factored statistical machine translation system (FSMT). The system uses lemmatised, tagged and sentence-aligned legal parallel corpora. Verb + Noun collocations are frequent word associations, sometimes discontinuous, related by syntactic links and with non-compositional sense (Gledhill, 2007). Our first strategy extracts collocations from monolingual corpora, using a hybrid method which combines morphosyntactic properties and frequency criteria. The second method applies a bilingual collocation dictionary to identify collocations. Both methods transform collocations into single tokens before alignment. The third method applies a specific alignment algorithm for collocations. We evaluate the influence of these collocation alignment methods on the results of the lexical alignment and of the FSMT system. Keywords: MWE, FSMT, hybrid collocation identification, lexical alignment, MWE-aware MT systems, collocation dictionary
1. Context and motivation Multiword expressions (MWEs) are defined as lexical combinations composed of two or more words, with specific lexical, syntactic or semantic behaviour ( Baldwin & Kim, 2010). Some categories of MWEs, such as idiomatic expressions, named entities or prepositional phrases, are characterised by their high degree of fixedness (preference for specific morphosyntactic properties or for fixed syntactic patterns), but also by their non-compositional sense. For example, terms and named entities, two very productive categories in specialised domains, have non-compositional sense and should be translated into the target language by avoiding the word-for-word strategy. Other MWEs (collocations, noun or verb compounds) are highly productive, vary in their syntactic structures (accepting variation in mood or number, several determiners or modifiers occurring inside
doi 10.1075/cilt.341.04tod © 2018 John Benjamins Publishing Company
Amalia Todiraşcu & Mirabela Navlea
the structure) and provide various degrees of compositionality. Word-for-word translation strategy fails to handle MWEs, due to their strong lexical preferences (poser une question ‘ask a question’, but not *demander une question). Thus, a significant category of translation errors made by SMT systems is due to multiword expressions (particle verbs, nominal compounds, terms, named entities, collocations). They still represent a challenge in statistical machine translation (Kordoni & Simova, 2014; Ramisch et al., 2013; Schottmüller & Nivre, 2014), due to specific properties of MWEs which are difficult to model. Various research projects (such as PARSEME)1 aim to propose effective strategies to handle MWEs in machine translation. A SMT system uses three components: a translation model, a target language model and a decoder. First, the SMT system builds translation models from lexically and sentence-aligned parallel corpora, but also language models from target monolingual corpora. Then, it uses a decoder to propose best translations from several alternatives (built using the translation model). Finally, the system reorders the output according to the target language model. The lexical alignment systems integrated by SMT methods use word frequency in monolingual corpora and bilingual sentence-aligned corpora to compute possible word-for-word translation equivalents. Statistical methods usually fail to align MWEs without complex syntactic or semantic information. Indeed, if MWEs are not identified in the training corpora, the statistical lexical alignment systems generally propose wrong or partial alignments for these multiword expressions. Some frequent alignment errors are due to the wrong lexical choice, inappropriate for a given context. Moreover, flexible MWEs accept the use of modifiers such as adverbs or relative clauses between the various components of the MWE, which might not be aligned by the system. Lexical alignment strategies also fail to detect long-distance dependencies. In addition, differences in word order, between the source and the target language, decrease the quality of the MWEs alignment, mainly in the case of monotonic alignments. Despite their high frequency, MWEs identified as single tokens might be translated by various equivalents (multiword or simple lexical units). The various translation equivalents have low probabilities and then they are rarely proposed as valid translations. To illustrate erroneous collocation alignments, we give below the collocation definition adopted in our project and then we discuss several alignment examples. We consider collocations as multiword, sometimes discontinuous expressions, presenting specific morphosyntactic and semantic properties (Gledhill, 2007; Gledhill & Todiraşcu, 2008), and we focus here on the Verb + Noun class of col-
. Parsing and multi-word expressions. Towards linguistic precision and c omputational efficiency in natural language processing (http://www.cost.eu/domains_actions/ict/Actions/IC1207)
Aligning verb + noun collocations to improve a French-Romanian FSMT system
locations (Gledhill, 2007; see Section 3 where the definitions of the c ollocations and their properties are presented). However, multiple links (many-to-many alignments) are not identified by systems such as the popular, purely statistical GIZA++ aligner (Och & Ney, 2003), used in our project. In the Example (1), obtained with GIZA++ from the parallel corpus DGT-TM (Steinberger et al., 2012) (as well as the other examples presented in this paper), the French (FR) Verb + Noun collocation prendre les mesures ‘take measures’ is not completely aligned with its Romanian (RO) translation equivalent a lua măsurile ‘take measures’. Thus, prend ‘takes’ and ia ‘takes’ are aligned, as well as mesures ‘measures’ and măsurile ‘measures-the’, but the French definite determiner les ‘the’ is not aligned to Romanian collocation ia măsurile ‘takes measures’. This alignment error is due to a morphosyntactic difference between French and Romanian. Indeed, the French definite determiner always occurs before the noun as a separate word, while in Romanian it is a suffix for the noun. The definite French determiner ‘les’ should be aligned with the other noun and verb in Romanian. The normal lines displayed in the following example represent the alignment found by GIZA++. The dashed lines represent missing alignments, to be completed by the alignment algorithm. (1) Chaque partie prend les mesures appropriées. (FR) Fiecare parte ia măsurile care se impun. (RO)
In the next Example (2), the French collocation donner lieu ‘to give place’ is not aligned at all with its correspondent Romanian collocation a da naștere ‘to give birth’. This alignment error might be due to the low frequency of collocation elements and their translation equivalents in the training parallel corpus. The dashed lines show how the alignment should be completed. (2) L ’ aide prévue dans le règlement CEE n° 1308/70 ont donné lieu, dans certains Etats membres, ... (FR)
Subvenția menționată în Regulamentul (CEE) nr. 1308/70 a dat naștere, în unele state membre, ... (RO)
Other alignment errors are due to the semantic properties of MWEs. Noncompositional MWEs should be translated via an underlying semantic representation (a concept for a domain-specific term or for idiomatic expressions, a physical or an abstract entity for a named entity (NE)). This information requires large external knowledge bases and it is too complex to be used by current MT systems. To avoid these problems, different methods handling MWEs in machine translation are developed as presented in the next section.
Amalia Todiraşcu & Mirabela Navlea
2. Handling MWEs for MT Several methods are proposed to identify MWEs at the alignment level. Some of them are defined for specific MWE categories. Thus, to reduce the number of unknown words, noun compounds are decomposed in simple words and then the alignment is applied at simple word level (Cap et al., 2014). Other strategies use bilingual lists of translation equivalents (at word level) as a starting point and complete multiple alignments if at least one link is found in the area (Rapp & Sharoff, 2014). Specific MWEs, such as NEs (Tan & Pal, 2014) or terms (Wu et al., 2008), are identified by named-entity recognition (NER) tools or term extractors and transformed into single tokens before the alignment process. Various approaches aim to align monolingual MWE candidates using statistical measures (Tan & Pal, 2014) or linguistic information (Venkatapathy & Joshi 2006; de Gispert et al., 2006) from parallel corpora. These collocation extraction methods from monolingual and multilingual corpora improve lexical alignment results (Ren et al., 2009). To identify MWEs, external resources such as dictionaries or terminological databases are also applied. Thus, some methods identify MWEs from a segmentation step (Lambert & Banchs, 2006) before starting the alignment process, and show significant improvements on the SMT system. Other approaches use collocation lists or machine-readable dictionaries to complete multiple alignments (Tiedemann, 1999; Wehrli et al., 2009; Okita et al., 2010), but these resources are not always complete or available for all domains or languages. To avoid the lack of complete lexical resources, several projects propose rules to describe the syntactic variability of the MWEs, for morphological rich languages (Deksne et al., 2008). Furthermore, some MWE integration strategies extract collocation candidates (frequent n-grams) from the source texts and search MWE translation candidates in the target text (Melamed, 1997). External dictionaries of idioms explaining their literal meanings are used to replace MWEs in the source language with the literal meanings. Then the translation model is built and the literal meaning obtained after the translation is substituted with the idiomatic MWEs in the target language (Salton et al., 2014). Compared to these approaches, to avoid MWEs translation errors, other specific strategies integrate MWEs alignment later in the MT system. Thus, phrasebased SMT systems exploit n-grams (frequent contiguous sequences of words), generally extracted from lexically aligned parallel corpora by applying heuristics such as the intersection and/or the union of bidirectional alignments (Koehn et al., 2003). As this method uses phrases instead of simple words, it is effective to handle specific contiguous MWEs (fixed idioms or prepositional phrases), but it is not able to treat discontinuous or flexible MWEs. Moreover, n-grams are not always
Aligning verb + noun collocations to improve a French-Romanian FSMT system
linguistically motivated in order to map specific MWE categories or discontinuous constituents. Phrase-based systems are improved by integrating bilingual lists of MWEs to complete training parallel corpora (Ren et al., 2009), or by retraining the translation model from parallel corpora containing MWEs as single units (Okita et al., 2010; Pal et al., 2013; Ramisch et al., 2013). The alignment of MWEs may also be done by computing the probabilities of the translation equivalents and using linguistic information (Bouamor et al., 2012), by using similarity scores (Pal et al., 2013) or a measure of compositionality (Venkatapathy & Joshi, 2006). This alignment is then mainly dependent on the volume of the training parallel data, for different pairs of languages. However, the quality of the automatic translations provided by phrase-based SMT methods can be significantly improved by building factored statistical machine translation systems (FSMT) (Koehn & Hoang, 2007; Birch et al., 2007; Avramidis & Koehn, 2008; Ceauşu & Tufiş, 2011). These methods use linguistic factors (lemmas, morphosyntactic properties, part-of-speech tags, etc.) in the translation process. Even if the quality of the translation is enhanced due to the linguistic information associated to each token, FSMT systems also fail to deal with discontinuous or flexible MWEs. In this article, we study the impact of several MWE alignment methods on the translation results of a French-Romanian FSMT (Factored Statistical Machine Translation) system, by taking into account the specific class of Verb + Noun collocations (Gledhill, 2007; Todiraşcu et al., 2008). As done by Wehrli et al. (2009), we use linguistic information in our lexical alignment system to identify MWEs, but also to align them. We experiment with several methods of MWE integration for French and Romanian, two morphologicallyrich, less-resourced languages: (1) preprocessing MWEs before alignment. We identify MWE candidates before the alignment step by applying two different extraction methods: (a) a hybrid method to extract collocation candidates from large monolingual corpora (Todiraşcu et al., 2009); (b) search against an external French-Romanian collocation dictionary (Todiraşcu et al., 2008); (2) MWE alignment. We first apply the GIZA++ (Och & Ney, 2003) standard lexical alignment and then our own alignment algorithm which uses the French-Romanian collocation dictionary.
Thus, we proceed to several experiments to evaluate the results of the FSMT system after the identification of MWEs. First, we include MWE units as single tokens in the training corpus and we rebuild the translation models. Then, we apply a specific MWE alignment to complete the initial lexical alignment provided by GIZA++ , without transforming the MWEs as single units.
Amalia Todiraşcu & Mirabela Navlea
In the next section, we present our definition of the collocations, while the problems related to their translation and their alignment are discussed in Section 4. We present the architecture of our FSMT system in Section 5. The method of preprocessing MWEs and the dictionary are presented in Sections 6 and 7, respectively. In Section 8, we describe our specific algorithm for aligning MWEs. In Section 9, we show the influence of the MWEs on the results of the lexical alignment and of the FSMT systems. The last section presents the conclusions and future work. 3. Collocation definition As we mentioned previously, we consider collocations as multiword, sometimes discontinuous expressions, presenting specific morphosyntactic and semantic properties (Gledhill, 2007; Gledhill & Todiraşcu, 2008). Thus, to identify collocations, we apply three criteria (Gledhill, 2007; Gledhill & Todiraşcu, 2008; Todiraşcu et al., 2009): a. Frequency Collocations are frequent word associations; b. Syntactic relations The elements composing a collocation are related by strong syntactic dependencies. Several classes of collocations are identified (Hausmann, 2004): Noun + Noun (the second noun is the modifier of the first one: prise de decisions ‘decision making’), Verb + Noun (the noun is the direct object of the verb: prendre des mesures ‘take measures’), and Adverb + Adjective (the adverb is a modifier of the adjective: grièvement malade ‘very ill’); c. Semantic relations The sense of a collocation is more or less compositional. The specific class of Verb + Noun collocations is very frequent in several genres or types of texts, but it is domain independent. Applying the three criteria, Verb + Noun collocations might be classified in two main classes (Gledhill & Todiraşcu, 2008): –– Complex predicators which are characterised by their non-compositional sense and a high degree of fixedness (tenir compte ‘take into account’). This class has strong preferences for some contextual morphosyntactic properties such as: the noun is always in singular or in plural, the determiner is always definite or absent, no modifiers occur between the verb and the noun, and the verb is present
Aligning verb + noun collocations to improve a French-Romanian FSMT system
only at the active voice. Some collocations contain prepositions (faire l’objet de ‘is subject of ’, entrer en vigueur ‘bring into force’). They generally express a relational or a mental process and the noun specifies the range of this process (Gledhill, 2007). –– Complex predicates which present morphosyntactic variability both for the noun and for the verb. Their sense is more compositional, but they manifest specific lexical preferences (prendre des mesures ‘take measures’, but not *faire des décisions ‘make decisions’). These constructions generally express mental processes and the range is expressed by the noun (Gledhill, 2007). 4. Translation problems Verb + Noun collocations are highly frequent and their automatic translation is a difficult task. For example, in large parallel corpora, we identify several cases of translation of a Verb + Noun collocation (Navlea, 2014): –– direct equivalent Verb + Noun collocations (prendre des mesures ‘to take some measures’ vs. a lua măsuri ‘to take measures’); –– synonym collocations (prendre des mesures ‘to take some measures’ vs. a adopta măsuri ‘to adopt measures’). In other contexts, the verbs are not synonyms; –– one lexical unit (avoir l’intention ‘to have the intention’ vs. a intenționa ‘to intend’); –– Noun + Noun collocations (remplir sa mission ‘to fulfill its mission’ vs. îndeplinirea misiunii ‘mission fulfillment’); –– paraphrases (prendre effet ‘to bring into force’ vs. a fi valabil ‘to be valid’). For the Verb + Noun constructions, modifiers (adverbs, adjectives) might occur between the verb and the noun. Passive constructions are also possible. Sometimes, modifiers occur between the verb and the noun as in Example (3) (Navlea, 2014). In French, an adverb (également ‘also’) and a prepositional group (après consultation de cet Etat ‘after consultation of this State’) occur between the verb (prendre ‘to take’) and the noun (les dispositions ‘measures-the’), while in Romanian, the prepositional group (după consultarea acestui stat ‘after consultation-the this State’) is outside the border of the collocation. (3) Il prend également, après consultation de cet Etat, les dispositions prévues auxdits paragraphes. (FR) El ia, de asemenea, măsurile prevăzute la alineatle menţionate, după (RO) consultarea acestui stat.
Amalia Todiraşcu & Mirabela Navlea
Other contexts show extended collocations in translation as in Example (4) (Navlea, 2014). In this case, maintenir ‘maintain’ is translated by a Verb + Noun collocation a menţine în vigoare ‘to maintain in force’ followed by its co-occurent măsuri ‘measures’. (4) […] les Etats membres […] ne maintiennent aucune mesure […] […] statele membre nu menţin în vigoare măsuri […]
(FR) (RO)
These various problems explain why alignment systems generally fail to align Verb + Noun collocations in parallel corpora. In the next section, we present the overall architecture of our FSMT system, including the collocation dictionary and the various strategies for handling MWEs. 5. Th e Architecture of the FSMT system and verb + noun collocation integration To build our FSMT system, we use the open-source Moses system (Koehn et al., 2007) trained on a French-Romanian parallel corpus of law text, extracted from DGT-TM (Steinberger et al., 2012). This training corpus is composed of 64,918 pairs of parallel sentences (containing approximately 1.5 million tokens per language). To test and to optimise our FSMT system, we also extract from DGT-TM two separate small parallel corpora, each containing 300 bilingual pairs of sentences. The training corpora are lemmatised and tagged with TTL, a POS tagger available for Romanian (Ion, 2007) and for French (Todiraşcu et al., 2011). TTL uses the MULTEXT MSD tagset (Ide & Veronis, 1994) for French and MULTEXTEast for Romanian (Erjavec, 2004). Thus, each token is accompanied by linguistic factors such as lemma (followed by the two first characters of the morphosyntactic tag which morphologically disambiguates the lemma), part-of-speech, and morphosyntactic tag. To train our baseline factored translation model (Navlea, 2014), we use lemmas and morphosyntactic properties as linguistic factors. The training heuristic is grow-diag-final (Koehn et al., 2003). Concerning the target language models, we exploit existing Romanian language models (Tufiş et al., 2013) and we develop our own French language models via SRILM application (Stolcke, 2002). They are built on surface word forms or on different linguistic factors (lemmas, morphosyntactic tags) and are both based on the law corpus JRC-Acquis (Steinberger et al., 2006). To study the impact of MWE integration in our baseline FSMT system, we experimented with several strategies for their exploitation in the translation process (see Figure 1). First, we identify MWEs and then they are transformed into single lexical units. In one experiment, the MWE identification is performed by
Aligning verb + noun collocations to improve a French-Romanian FSMT system
using a hybrid extraction method from monolingual corpora (Todiraşcu et al., 2009). In another, we apply an existing French-Romanian collocation dictionary (Todiraşcu et al., 2008). In both cases, the lexical alignment is performed by applying GIZA++ on the corpus containing MWEs transformed into single units. The last strategy completes GIZA++ results with multiple alignments. For this purpose, we use our specific collocation alignment algorithm and the collocation dictionary as an external resource. We try several configurations of the FSMT system integrating MWEs. I. MWE identification with a dictionary (Todirascu et al, 2008)
II. MWE extraction (Todirascu et al, 2009)
III. MWE dictionary alignment Source text
French-Romanian Parallel corpus POS tagged and lemmatized
Word Alignment (Giza++ (Och & Ney, 2003
Translation model
Language model
Monolingual corpus
Decoder Moses (Koehn et al, 2007)
Translated text
SRILM (STOLCKE, 2002)
Figure 1. The architecture of the FSMT system integrating three successive modules (I, II, III) of MWE identification
6. Preprocessing Verb + Noun collocations The first strategy consists of identifying the Verb + Noun collocations in each monolingual corpus before alignment. For this purpose, we use a hybrid method combining statistical and linguistic filters to extract candidates from large monolingual corpora (Todiraşcu et al., 2009). While we consider that Verb + Noun constructions are frequent word associations, which are related by specific syntactic relations, our extraction method combines these criteria. First, to find frequent word associations, we apply the statistical module, implementing the measure of log-likelihood LL (Dunning, 1993). For this task, we use large lemmatised and
Amalia Todiraşcu & Mirabela Navlea
tagged monolingual corpora. The POS tagger TTL (Ion, 2007; Todiraşcu et al., 2011) assigns lemmas and detailed morphosyntactic properties to each token. For Romanian, TTL identifies some domain-specific terms (Noun + Noun) from the law domain as single units, while this feature is not available in French. According to the statistical module, we select the pairs of lemmas of verbs and of nouns in a window of 11 words, and with LL at least 9. Then, we apply several linguistic filters, defined on the basis of morphosyntactic properties characterising the Verb + Noun collocations. These filters are able to identify strong preferences for specific properties (determiner, number, voice). Linguistic filters use only surface information to detect morphosyntactic fixedness, but no semantic information is available to distinguish between non-compositional and compositional candidates. For example, complex predicators have no determiner, the number is always singular and only active voice forms are accepted. An example of French patterns used to detect complex predicates with a high degree of fixedness is: [tag = “Vm(.*)”] [tag = “Nc.s”] [lemma = “de|à”] where Vm − main verb, Nc.s − common noun, singular, followed by a specific preposition (de ‘of ’ or à ‘to’). Some filters implement heuristics to eliminate invalid candidates (containing too many prepositions or conjunctions between the verb and the noun) (40.15% are deleted for Romanian and 39.43% for French). All collocation candidates (and their properties) are extracted from an independent monolingual corpus, which was already tagged and lemmatised. Some parts of this corpus were used to build the dictionary, so we expect to have similar or comparable data for the two experiments. We obtain 14,191 candidates from a French corpus composed of law texts (JRC-Acquis (Steinberger et al., 2006)), newspapers (from Le Monde and L’Est Republicain) and scientific articles from medicine domain (Todiraşcu et al., 2012). The French corpus is composed of approx. 1 million tokens. Additionally, we extract 9,141 candidates from a Romanian corpus composed of law texts (texts extracted from JRC-Acquis) and newspapers texts. The Romanian corpus is composed of approx. 0.5 million tokens. For our experiments, we use a reduced list of candidates obtained after filtering (we select only candidates with the log-likelihood higher than 5,000). In this list of candidates, we found complex predicators, complex predicates, but also Verb + Noun where the Noun is not a direct object, but it is a subject. Moreover, this hybrid strategy adds some irrelevant candidates which do not respect the semantic criteria (the noun does not express the range of the process (Gledhill, 2007)). It is not possible to distinguish between collocations and simple occurrences only with surface linguistic information (Gledhill, 2007).
Aligning verb + noun collocations to improve a French-Romanian FSMT system
The list of candidates is applied to the monolingual corpus in order to transform these candidates into a single unit. The FSMT adds the MWEs and their translation equivalents to the translation model. In the next section, we present another method of MWE identification using an external bilingual dictionary. 7. The MWE dictionary The dictionary contains 250 French-Romanian Verb + Noun constructions (Todiraşcu et al., 2008), built from monolingual corpora (JRC –Acquis but also newspaper texts, as specified in the previous section). The dictionary represents for each Verb + Noun construction the translation equivalent which might be a Verb + Noun collocation or a single word (verb or noun). For each construction, we represent a complex set of morphosyntactic properties specific for the noun (the determiner, the number, the gender, the modifiers) and for the verb (the passive and the modifiers for the verb), with their frequency. An example of a French entry entrer en vigueur ‘bring into force’ contains (see Figure 2): a section representing the properties of the collocation itself (), giving information about the arguments and the class of collocations (complex predicator or complex predicate); a section containing noun properties (preference for the determiner or for the number, the possibility of using modifiers and their frequency); a section with the verb properties (modifiers, voice, auxiliaries).
entrer en vigueur entrer en vigueur null sg
complex predicator
. . .
Figure 2. An example of a French entry (Todiraşcu et al., 2008)
Amalia Todiraşcu & Mirabela Navlea
The detailed information about morphosyntactic properties is used to detect MWEs in the parallel corpus and to properly align the Verb + Noun constructions. Then, to evaluate the influence of this method to the overall performances of the FSMT, we transform the collocations into single tokens before alignment (Ramisch et al., 2013). 8. The collocation alignment algorithm The third method proposes a different MWE integration strategy. In this case, we complete the lexical alignment performed by GIZA++ (Och & Ney, 2003) with many-to-many and many-to-one alignments by the development of a specific alignment algorithm. This algorithm exploits the information found in the dictionary in order to align the noun, the verb, and also specific determiners or modifiers of the noun or of the verb. Some syntactic structures might occur between the verb and the noun, without being part of the collocation. The algorithm completes the alignment by taking into account the collocation behaviour, such as strong preferences for a class of determiners, for a given preposition, for a specific mood or voice. The algorithm follows the next steps: 1. for each pair of sentences, we search pairs for verb and noun lemmas from the dictionary; 2. if the pair in the source language and its equivalent are found in the sentence pair, we align them as many-to-many links; 3. for each set of properties of the bilingual pairs of the verb and of the noun, we search the matching determiners, modifiers or specific prepositions (with a frequency higher than 80) and we align them with the equivalent collocation. The resulting alignments are added to the intersection of the bidirectional GIZA++ alignments, but MWEs are not transformed into single lexical units. 9. Experiments We apply several strategies aiming to integrate Verb + Noun constructions into a FSMT system. In order to evaluate the FSMT system, we use the BLEU (Bilingual Evaluation Understudy) score (Papineni et al., 2002), which compares the output translation with a reference translation. BLEU 1 is the BLEU score before system optimization, and BLEU 2 the score after this step. Optimisation is realised by using the MERT application (Bertoldi et al., 2009). In the following subsections, we study the influence of these strategies to the final results of these systems.
Aligning verb + noun collocations to improve a French-Romanian FSMT system
9.1 MWEs and the lexical alignment system In this experiment, we study the influence of the identification and alignment of MWEs on the results of the lexical alignment system. For this first experiment, we use a test corpus composed of 1,000 pairs of bilingual aligned sentences (containing approximately 30,000 tokens per language) (Navlea, 2014). This test corpus is lemmatised and tagged with TTL (Ion, 2007; Todiraşcu et al., 2011). It is also manually aligned at the lexical level, following the guidelines proposed by Melamed (1998), which was adapted for French and Romanian to take into account their morphosyntactic differences. Moreover, additional annotation rules specific to the legal domain and style of the corpus were defined. All Verb + Noun collocations are also fully aligned. In this reference corpus, 470 collocation equivalents were identified (Navlea, 2014). This corpus is used to evaluate Verb + Noun collocation alignment, compared to the baseline alignment system. In order to build the baseline system (Navlea, 2014), we apply GIZA++ in both directions of the alignment process and we intersect the two resulting alignments. We keep only sure links if they have been detected by the intersection operation (Koehn et al., 2003). Then, we apply our specific alignment algorithm to obtain multiple alignments. We evaluate the baseline and the new alignment (baseline alignments plus the collocation links) against the reference corpus, in terms of the AER (Alignment Error Rate) score (Och & Ney, 2003). All alignments of the reference corpus are considered as sure, then the AER score is computed as follows: AER = 1 − F-measure
The AER score obtained for the baseline is 34.61% (Navlea, 2014), while the AER score obtained for the new alignment is 33.84%. The improvement of the AER score is low (0.77%). This result is due to the fact that the dictionary contains few collocations found in the evaluation corpus: 35 Romanian collocations with 127 occurences and 39 French collocations occuring 96 times in this corpus. Other similar studies, involving a larger dictionary, are still necessary to be able to measure, with more precision, the impact of such resource in the alignment process. For this reason, we do not apply for the moment this alignment in the FSMT system. 9.2 MWEs and FSMT system In the first experiment, we use the bilingual dictionary of the Verb + Noun constructions presented in Section 7. As we mentioned previously, in order to build our FSMT system we use a larger French-Romanian lemmatised and tagged parallel corpus from law domain, composed of 64,918 pairs of sentences presented in
Amalia Todiraşcu & Mirabela Navlea
Section 5. We identify the MWEs entries and transform them into single units. We proceed with experiments in three configurations, following Ramisch et al. (2013). First, we apply the dictionary to both corpora. Then, the other two experiments concern the identification of MWEs only in one language. The baseline is represented by the factored system, without MWE identification (Navlea, 2014). We obtain a significant improvement on the BLEU 2 score (3.22 points), by applying the dictionary as an external resource for the French → Romanian FSMT system transforming MWEs into single units both in source and target languages (see Table 1). In the opposite translation direction (Romanian → French), the BLEU 2 score is comparable with the baseline system score. When we translate from Romanian to French we have the benefit of the rich morphological information helping to choose the appropriate translation equivalent. Moreover, the Romanian POS tagger already recognises some multiword terms from the law domain, which explains that BLEU 2 is very high for the baseline and the BLEU 2 scores are quite comparable (48.23 compared to 48.34 for the baseline). Table 1. Results of the Moses system with dictionary-based MWE identification and transformation into a single unit (BLEU 1 – BLEU before tunning, BLEU 2 – BLEU after tunning) MWE in source and target languages
Direction Baseline
BLEU 1 BLEU 2 BLEU 1 BLEU 2
MWE in source language
MWE in target language
BLEU 1 BLEU 2 BLEU 1 BLEU 2
FR =>RO 23.15
25.33
25.33
28.55
23.47
27.11
25.60
29.10
RO =>FR 47.05
48.34
46.76
48.23
47.49
48.29
46.80
47.49
For MWEs recognised only in the source language, the BLEU 2 score is comparable with the one of the previous systems, when translating into French. In the opposite translation direction, the results are comparable with the baseline results (48.29 vs. 48.34). Concerning MWEs recognised only in the target language, the BLEU 2 score is significantly improved when translating into Romanian (3.77 points), compared to the baseline system. However, in the opposite translation direction no improvement is available. 9.3 MWE identification before aligning The second set of experiments aims to apply the MWE candidates and transform them into single units to the existing corpora, as presented in Section 6. We use the same corpus as in Section 9.2. We study the influence of the automatically extracted MWEs from monolingual corpora, on the FSMT system performances.
Aligning verb + noun collocations to improve a French-Romanian FSMT system
We apply a set of 1,840 candidates for French and 2,400 for Romanian sorted by their log-likelihood (with log-likelihood > 5,000). The baseline is the same as for the first experiment. The FSMT system integrating MWEs in source and target language significantly outperforms the baseline after optimisation (with 2.69 points), when translating into Romanian (see Table 2). However, in the opposite direction, we obtain slightly worse results than for the baseline system. Table 2. The results of Moses obtained with the MWE extractor and transforming MWEs into single unit tokens; BLEU 1 – BLEU before tunning; BLEU 2 – BLEU after tunning Direction
MWE in source and target languages
Baseline
BLEU 1 BLEU 2 BLEU1 BLEU 2
MWE in source language
MWE in target language
BLEU I BLEU 2 BLEU1 BLEU 2
FR =>RO
23.15
25.33
25.30
28.02
22.95
25.73
23.19
26.45
RO =>FR
47.05
43.34
47.02
47.39
46.48
48.05
46.78
48.04
The candidates identified as MWEs are obtained with the extractor and some of them are not valid collocations. Moreover, the identification module transforms MWEs into low frequency single lexical units (including their significant properties such as determiners or modifiers), which are not very effective when building the translation model. 10. Conclusions and future work We present here several methods for integrating MWEs into FSMT systems for less-resourced pairs of languages. These methods exploit linguistic information (both to extract and to align MWEs) and language specific resources and tools. We apply MWE integration strategies in MT systems which are not new (their transformation into single units in the training corpora (Ramisch et al., 2013), we use an external resource (Todiraşcu et al., 2008) and a MWE-extractor-based method (Todiraşcu et al., 2009) for MWE identification), but no FSMT exploiting MWEs is available for the studied languages. As a novelty, we use our specific alignment algorithm using collocations and linguistic information from a dictionary but, due to the small alignment improvement obtained during the evaluation step, this strategy is not finally used in the FSMT system. Few common collocations were found between the dictionary and the manually aligned corpus (35 Romanian collocations and 39 French collocations), which explains the small impact of the dictionary to the overall alignment result. The collocation alignment
Amalia Todiraşcu & Mirabela Navlea
and its integration in the translation process should be done in the future, with a larger collocation dictionary. We set up several experiments aiming to study the influence of a specific class of MWEs (Verb + Noun collocations) on the results of a French- Romanian baseline FSMT system. We use a small bilingual Verb + Noun dictionary as an external resource and alternatively, a hybrid collocation extractor to identify MWEs, transformed into single tokens before alignment. These strategies are more effective for the French → Romanian translation direction obtaining a significant improvement of the BLEU score for both methods (almost 4 points for the dictionary and 2.69 for the hybrid method). For the Romanian → French translation direction, the baseline system obtained the best result. Romanian corpus contains rich morphosyntactic information used to generate the correct translations. Moreover, the training corpus already contains some Romanian domain-specific terms which have been recognised during the tagging and chunking process, while the French corpus does not contain domain-specific terms. As a consequence, the identification of Verb + Noun collocations is not significant for the BLEU score for this translation direction. The MWE identification before alignment with the dictionary and their transformation into single lexical units is better than the MWE identification using the tool extractor (1 point), for our data. However, these results should be validated with larger corpora for the two methods. In addition, to evaluate more precisely the impact of the dictionary to the automatic translation quality, the dictionary-based method should also be evaluated with other test corpora, containing a significant number of Verb + Noun collocations. The dictionary is quite small (250 bilingual entries) and has low coverage on the current test corpus (14 French collocations from the dictionary have 27 occurrences in the test corpus, while 15 Romanian collocations were found 34 times in the Romanian test corpus). Future experiments will be done with the dictionary completed by nominalisations of Verb + Noun collocations and by a set of about 230 Verb + Noun collocations, manually validated from parallel corpora of law texts (Navlea, 2014).
References Avramidis, E., & Koehn, P. (2008). Enriching Morphologically Poor Languages for Statistical Machine Translation. In Proceedings of ACL-08: HLT (pp. 763–770). Columbus (USA). Stroudsburg (USA, PA): Association for Computational Linguistics. Baldwin, T., & Kim, S. N. (2010). Multiword expressions. In N. Indurkhya, & F. J. Damerau (Eds.), Handbook of Natural Language Processing (2nd ed., pp. 267–292). Boca Raton (USA, FL): CRC Press, Taylor and Francis Group.
Aligning verb + noun collocations to improve a French-Romanian FSMT system
Bertoldi, N., Haddow, B., & Fouet, J.-B. (2009). Improved Minimum Error Rate Training in Moses. Prague Bulletin of Mathematical Linguistics (PBML), 91, 7–16. Birch, A., Osborne, M., & Koehn, P. (2007). CCG Supertags in factored Statistical Machine Translation. In Proceedings of the Second Workshop on Statistical Machine Translation (pp. 9–16). Prague (République Tchèque). Stroudsburg (USA, PA): Association for Computational Linguistics. Bouamor, D., Semmar, N., & Zweigenbaum, P. (2012). Identifying bilingual multi-word expressions for statistical machine translation. In Proceedings of Eigth International Conference on Language Resources and Evaluation (pp. 674–679). Istanbul, Turkey: ELRA. Cap, F, Fraser, A., Weller, M., & Cahill, A. (2014).How to Produce Unseen Teddy Bears: Improved Morphological Processing of Compounds in SMT. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp. 579–587). Goteborg, Sweden. Ceauşu, A., & Tufiş, D. (2011). Addressing SMT Data Sparseness when Translating into Morphologically-Rich Languages. In B. Sharp, M. Zock, M. Carl, & A. Lykke Jakobsen (Eds.), In Proceedings of the 8th international NLPCS workshop: Human-machine interaction in translation (pp. 57–68). Copenhagen Business School (Danemark). Copenhagen (Danemark): Samfundslitteratur. de Gispert, A., Gupta, D., Popović, M., Lambert, P., Mariño, J., Federico, M., Ney, H., & Banchs, R. (2006). Improving Statistical Word Alignments with Morpho-syntactic Transformations. In Proceedings of 5th International Conference on Natural Language Processing, FinTAL’06 (pp. 368–379). Deksne, D., Skadiņš, R., & Skadiņa, I. (2008). Dictionary of Multiword Expressions for Translation into Highly Inflected Languages. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp. 1401–1405). Marrakech, Morocco: ELRA. Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1), 61–74. Erjavec, T. (2004). MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (pp. 1535–1538). Paris: ELRA. Gledhill, C. (2007). La portée : seul dénominateur commun dans les constructions verbo- nominales. In P. Frath, J. Pauchard, & C. Gledhill (Eds.), Actes du 1er colloque, Res per nomen, pour une linguistique de la dénomination, de la référence et de l’usage (pp. 113–125), Université de Reims-Champagne-Ardenne. Gledhill, C., & Todiraşcu, A. (2008). Collocations en contexte : extraction et analyse contrastive. In Texte et corpus, 3, Actes des Journées de la linguistique de Corpus 2007 (pp. 137–148). Hausmann, F. J. (2004). Was sind eigentlich Kollokationen?. In K. Steyer (Ed.), Wortverbindungen -mehr oder weniger fest (pp. 309–334). Institut fur Deutsche Sprache Jahrbuch. Ide, N., & Véronis, J. (1994). Multext (multilingual tools and corpora). In Proceedings of the 15th CoLing (pp. 90–96). Kyoto (Japon). Ion, R. (2007). Metode de dezambiguizare semantică automată. Aplicaţii pentru limbile engleză şi română [Semantic desambiguation methods. Application for English and Romanian Languages]. Ph.D.Thesis. Bucharest (Romania): Romanian Academy. Koehn, P., & Hoang, H. (2007). Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 868–876). Prague (République Tchèque).
Amalia Todiraşcu & Mirabela Navlea Koehn, P., Hoang, H., Birch, A., Callison-Burch, Ch., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, Ch., Zens, R., Dyer, Ch., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses : Open source toolkit for statistical machine translation. In Proceedings of the ACL 2007 Demo and Poster Sessions (pp. 177–180). Prague. Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical Phrase-Based Translation. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp. 48–54). Edmonton (Canada). Stroudsburg (USA, PA): Association for Computational Linguistics. Kordoni, V., & Simova, I. (2014). Multiword Expressions in Machine Translation. In Proceedings of the International Conference on Language Resources and Evaluation (pp. 1208–1211). Reykjavik, Iceland: ELRA. Lambert P. & Banchs R. (2006). Grouping multi-word expressions according to Part-Of-Speech in statistical machine translation. In Proceedings of the EACL Workshop on Multi-word expressions in a multilingual context (pp. 9–16). Trento, Italy. Melamed D. I. (1997). Automatic Discovery of Non-Compositional Compounds in Parallel Data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (pp. 97–108). RI, USA: Providence. Melamed, D. I. (1998). Manual annotation of translational equivalence: The Blinker project. Cognitive Science Technical Report. University of Pennsylvania. Navlea, M. (2014). La traduction automatique statistique factorisée : une application à la paire de langues français – roumain. Thèse de doctorat, Université de Strasbourg, Strasbourg. Och, F. J., & Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Journal of Computational Linguistics, 29(1), 19–51. doi: 10.1162/089120103321337421 Okita, T, Guerra, A. M., Graham, Y., & Way, A. (2010). Multi-Word Expression-Sensitive Word Alignment. In Proceedings of the 4th International Workshop on Cross Lingual Information Access at COLING 2010 (pp. 26–34). Beijing. Pal, S., Naskar, S. K., & Bandyopadhyay, S. (2013). MWE Alignment in Phrase Based Statistical Machine Translation. In K. Sima’an, M. L. Forcada, D. Grasmick, H. Depraetere, & A. Way (Eds.), Proceedings of the XIV Machine Translation Summit (pp. 61–68). Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of 40th Annual meeting of the Association for Computational Linguistics (ACL) (pp. 311–318), Philadelphia (USA, PE). Stroudsburg (USA, PA): Association for Computational Linguistics. Ramisch, C., Besacier, L., & Kobzar, A. (2013). How hard is it to automatically translate phrasal verbs from English to French?. In J. Monti, R. Mitkov, G. Corpas Pastor, V. Seretan (Eds.), In Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technology (pp. 53–61), Nice (France). Rapp R., & Sharoff, S. (2014) Extracting Multiword Translations from Aligned Comparable Documents. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra), 87–95. doi: 10.3115/v1/W14-1016 Reinhard R., & Sharoff S. (2014). Extracting Multiword Translations from Aligned Comparable Documents. In Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra) (pp. 87–95). Gothenburg, Sweden. Ren, Z, Lü, Cao J., Liu, Q, & Huang, Y. (2009). Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions. In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP2009 (pp. 47–54). Salton, G., Ross, R., & Kelleher, J. (2014). Evaluation of a Substitution Method for Idiom Transformation in Statistical Machine Translation. In Proceedings of the 10th Workshop on
Aligning verb + noun collocations to improve a French-Romanian FSMT system
ultiword Expressions (MWE) (pp. 38–42), EACL 2014. Göteborg, Sweden: Association M for Computational Linguistics. SchottmüllerN., & Nivre, J. (2014). Issues in Translating Verb-Particle Constructions from German to English. In Proceedings of the 10th Workshop on Multiword Expressions (MWE) (pp. 124–131), EACL 2014. Göteborg, Sweden: Association for Computational Linguistics. Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely Available Translation Memory in 22 Languages. In Proceedings of the 8th international conference on Language Resources and Evaluation (LREC’2012) (pp. 454–459). Istanbul (Turquie): ELRA. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (2006). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20 + Languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) (pp. 2142–2147). Gênes (Italie). Paris (France): ELRA. Stolcke, A. (2002). SRILM – An Extensible Language Modeling Toolkit. In Proceedings of the International Conference Spoken Language Processing (pp. 901–904). Denver (USA, Colorado). Tan, L, & Pal, S. (2014). Manawi: Using Multi-Word Expressions and Named Entities to Improve Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation (pp. 201–206). Baltimore, Maryland USA. Tiedemann J. (1999). Word alignment -step by step. In Proceedings of the 12th Nordic Conference on Computational Linguistics (pp. 216–227). University of Trondheim, Norway. Todiraşcu A., Gledhill C., & Stefanescu D. (2009). Extracting Collocations in Contexts. In Z. Vetulani, & H. Uszkoreit (Eds.), Responding to Information Society Challenges: New Advances in Human Language Technologies, LNAI 5603 (pp. 336–349). Berlin Heidelberg: Springer-Verlag. Todiraşcu, A., Heid, U., Stefanescu, D., Tufiş, D., Gledhill, C., Weller M., & Rousselot F. (2008). Vers un dictionnaire de collocations multilingue. Cahiers de Linguistique, 33(1), 171–185. Todiraşcu, A., Ion, R., Navlea, M., & Longo, L. (2011). French text preprocessing with TTL. In Proceedings of the Romanian Academy, Series A, 12(2), 151–158. Todiraşcu, A., Pado, S., Krisch, J., Kisselew, M., & Heid, U. (2012). French and German Corpora for Audience-based Text Type Classification. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 1591–1597). Istanbul, Turkey: ELRA. Tufiş, D., Ion, R., & Dumitrescu, Ș. (2013). Wikipedia as an SMT Training Corpus. In Proceedings of the International Conference on Recent Advances on Language Technology (RANLP 2013) (pp. 702–709). Hissar (Bulgarie). Venkatapathy, S. & Joshi, A. (2006).Using Information about Multi-word Expressions for the Word-Alignment Task. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties (pp. 20–28). Sydney. Wehrli, E, Seretan, V, Nerima L, & Russo L, (2009). Collocations in a Rule-Based MT System: A Case Study Evaluation of Their Translation Adequacy. In Proceedings of the 13th Annual Conference of the European Association for Machine Translation (EAMT) (pp. 128–135). Barcelona: EAMT. Wu, H., Wang, H., & Zong C, (2008). Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) (pp. 993–1000). Manchester.
part 2
Multiword units in multilingual NLP applications
Multiword expressions in multilingual information extraction Gregor Thurmair Linguatec
Multilingual Information Extraction requires significant Multiword Expressions (MWE) processing as many such items are multiwords. The lexical representation of MWEs supports large bilingual lexicons (for Persian, Pashto, Turkish, Arabic); multiwords are represented like single words, extended by two annotations: MWE head, and lemma plus part of speech for the MWE parts. In text analysis, MWEs are recognised as part of the parsing process, not as pre- or post-processing components. The analysis design extends the X-bar scheme by a level for multiword rules. In transfer, MWEs are translated as elementary nodes like single word lemmata, to present key concepts for relevance judgement in Information Extraction. Evaluation shows that 90% of the MWE patterns in the lexicon can be analysed with about 150 MWE-specific rules, and that more than 90% of text document tokens are covered by the proposed integrated single and multiword processing. Keywords: Multiword expressions, Multilingual Indexing, Multilingual Information Extraction, Machine Translation, lexical analysis, Arabic, Persian, Pashto, morphological analyser, lexical representation
1. Introduction Multiword expressions (MWEs) pose the problem that they are semantic units built from constituents with an internal morphosyntactic structure. Applications of Language Technology, like (crosslingual) Search, Information Extraction, or Machine Translation, deal with elements which are semantic units. So, it would be logical to treat multiwords in the same way as other semantic units (single words): why should the multiwords planetary gear set or Salt Lake City differ in translation contexts from single words like diacid or Chattanooga? It is a fundamental requirement that the specific challenges posed by
doi 10.1075/cilt.341.05thu © 2018 John Benjamins Publishing Company
Gregor Thurmair
MWEs be kept transparent in NLP applications, to provide an integrated way of processing. On the other hand, multiword expressions (MWEs) have an internal morphosyntactic structure. From the point of view of linguistic descriptions, MWEs belong to quite different classes (Sag et al., 2002, Hurskainen, 2008), which sometimes require different processing (Vincze et al., 2011). In the present context, the most relevant MWE types are compound terms, light verb constructions, split verbs, and named entities. They differ in the degrees of variance and in the level of compositionality. The challenge when constructing an application for multilingual information extraction consists in finding a uniform and transparent treatment for all of these kinds of lexical units.
2. Application context The approach to MWE treatment presented here was developed in the context of multilingual indexing and information extraction: foreign language documents are processed to extract key terms and named entities. These elements are translated and presented as lists in the user’s native language, to help determine whether or not the document is relevant with regards to a given profile of interest. If it is, in-depth processing is triggered, and if not, the document is filtered out. The task involves named entity recognition, key term identification (to provide a list of relevant terms) and key term translation (to provide them in the user’s native language). Most of the key elements (terms and names) are multiwords. The languages involved are both European and Middle Eastern (Persian, Turkish, Arabic, Pashto). Some properties of these languages underline the relevance of the MWE aspect. –– Arabic-Persian script languages have a much higher variance in inserting spaces into words than Latin script languages, e.g. for calligraphic reasons; this fact affects the borderline between single words and multiwords, and requires a uniform treatment of single words and multiwords even more. – Persian has a significant number of light verb constructions (Family, 2006), formed with verbs like ( کردنkardan ‘do’) or ( شدنšodan ‘get’). In translation, the whole light verb construction must be translated and not just the head verb, e.g. ( شتاب کردنštāb kardan) = ‘make hurry’ = ‘hurry up’; ( شروع شدنšoru’ šodan) = ‘get a start’ = ‘begin’.
Multiword expressions in multilingual information extraction
– Term formation follows similar principles as in European languages. Terms have evolved from compositional constructions, so they follow linguistic patterns just like in European languages (noun-noun, noun-adjective, etc., e.g. دفتر اطالعاتdaftar ettela’āt ‘information office’), and can be analysed accordingly (cf. Khozani & Bayat (2011) for Persian, and Attia et al. (2010) for Arabic). Like in other languages, there is a large borderline area between compositional and non-compositional readings. A key factor in the application is a comprehensive lexicon coverage. As the lexicon consists of both single word and multiword units, an integrated approach of lexical analysis is required to provide the best possible coverage for the analysis of incoming documents.
3. MWEs in multilingual information processing The current research in MWE focuses on three areas: –– MWE extraction tries to find hitherto unknown lexical units in corpus texts, often as part of term extraction. –– MWE representation deals with the lexical representation of MWEs, and with aspects of their internal structure. –– MWE analysis deals with identifying MWEs in texts during parsing. The link between these aspects is obvious: the MWE lexical representation must match the requirements of MWE analysis, translation and generation, and ideally contains only information which can be provided by MWE extraction from corpora. 3.1 MWE extraction There is significant literature on MWE extraction, and even toolkits (for Python, see De Araujo et al. (2011), and for Java, Kulkarni & Finlayson (2011)) and Web Service offers (Quocchi et al., 2012) are available. The standard procedure is to identify candidates which show certain features, and apply filters to them to increase precision. Such features can be statistical (Tu & Roth, 2011), contextual (Fahramand & Martins, 2014), or linguistic (e.g. patterns of part-of-speech sequences). Additional features are also proposed; Martens & Vandeghinste (2010), for example, use dependency treebanks. Combinations of features are also tried (Dubremetz & Nivre, 2014; Tu & Roth, 2011). A challenge for MWE extraction research lies in the definition of their output, and, as a consequence, in the evaluation of such approaches.
Gregor Thurmair
–– If MWEs are used for term extraction, the challenge is that there is no clear consensus on what should be considered as a term, and it seems that the definition is dependent on the corpus used (Bonin et al., 2010), the domain, and the interests of the users. –– If MWEs are extracted to add new entries to a lexicon, the distinction between collocations and semantic units will have to be drawn. However, this definition contains a certain degree of vagueness in itself. In both cases it seems to be difficult to find reliable criteria to build gold standards for evaluation. In the application presented here, MWE extraction from corpus data was not in the focus, because large bilingual lexicons created by human translators were already available. Due to the fact that incoming documents which passed the relevance filter were later translated, lexicons built by translators and terminologists were ready to be used. These lexicons are bilingual, with German as the target language; they contain both general vocabulary and specialised terminology; and their entries are collected both from specialised dictionaries and from corpus data. They also contain proper names (such as place names). It is important to note that 30–50% of the entries are MWEs, which underlines the importance of MWE treatment.1 Table 1 gives the size and composition of the lexicons. In using ‘real-world’ lexicons it could be observed that the linguistic MWE patterns show a much greater variability than the ones used in MWE extraction research. More than 700 patterns were identified in some of the lexicons (cf. Table 3). The current pattern selection approach is also questioned in Nissim et al. (2014). 3.2 MWE lexical representation 3.2.1 Design Different solutions have been put forward on how to represent MWEs in lexicons. The simplest approach is to describe MWEs as a sequence of single words (Moreno uestion of how Ortiz et al., 2013; Arun & Keller, 2005). However, this leaves the q
. In many projects, multiword entries are not treated at all, cf. the Shiraz project (Amtrup et al., 2000), or Sagot and Walther (2010)
Multiword expressions in multilingual information extraction
Table 1. Dictionary size and composition Turkish > German
Persian > German
Pashto > German
Arabic > German
Size No. entries
89,200
82,300
60,800
70,000
No. source lemmata
62,000
59,100
42,500
52,000
No. translations per lemma
1.43
1.39
1.43
1.34
58,000
33,700
27,800
47,000
Domain specific terms
14,000
38,600
23,900
23,000
Proper names
14,000
9,900
9,100
320
Composition General words
Entry Type Single words
58,500
34,300
37,500
39,400
Multiwords
30,700
38,700
22,500
30,600
Percentage of multiwords
34.5%
47.0%
37.0%
43.7%
the annotations for the MWE as a whole can be stored.2 Villavivencio et al. (2004) propose to extend the single word descriptions with additional tables, describing the MWE elements and their relationships. This proposal has influenced the representation outlined in Lexical Markup Framework (LMF) (Francopoulo et al., 2006). Many proposals try to describe the MWEs by a tree structure, be it phrasestructure trees (Grégoire (2007, 2009) or dependency trees (Bejček et al. (2013)). Lee (2011) relates lexicon representations of light verb constructions to the qualia structure of a lexicon entry; Fotopoulou et al. (2014) propose a hierarchy of the SIGNIFIER class for MWEs. Graliński et al. (2010) compare two approaches to code MWEs in highly inflecting languages. A review of existing standards (Term Base Exchange Format (TBX), Lexical Markup Format (LMF)) and a new proposal is given in Escartín et al. (2013). Most of these contributions focus on MWEs as monolingual items; only few proposals also include the multilingual aspects. For instance, Deksne et al. (2008) define lexical representations containing mapping operations between source and target tree representations of MWEs; Escartín et al. (2013) incorporate translation information into the lexicon entries. . For instance, the position of the head of a MWE cannot always be predicted from its POS sequence; cf. the switch operation proposed by Constant & Tellier (2012) to overcome this problem.
Gregor Thurmair
Table 2. Mix of single words and multiwords in translation احتیاط کردن
(ehtijātt�� k�ardan)
be careful
multi → multi
اصول حسابداری
(osul hesābdāri)
accounting
multi → single
امشب
(emšab)
today evening
single → multi
طناب
(tanāb)
rope
single → single
The following conclusions seem to result from this discussion: 1. MWE representation is not a translation issue but a monolingual issue. Firstly, in translation, MWEs interchange with single words quite freely, as seen in Table 2. This shows that transfer is best described as replacement of elementary units, whether or not these units are MWEs (cf. Anastasiou, 2010). In whichever way the units are built, transfer just replaces the source with target units. Secondly, there are monolingual applications which also need to identify MWEs, like information retrieval (Acosta et al., 2011) or sentiment analysis (Moreno-Ortiz et al., 2013), where MWEs are involved but no translation is carried out. These two facts indicate that MWE representation is a monolingual task, not a transfer task. 2. MWEs have properties which are idiosyncratic. They will therefore need a lexical description as a whole unit, independent of their parts; part of speech, number, verb frames, semantic descriptions, domain, etc. will have to be represented for the MWE as a whole. For instance, take cannot have a that-clause complement, but take into account can. However, all such annotations are also needed for single word descriptions. This means that single words and multiwords can share many lexical descriptions as they are both lexical units; in turn, no special representation is required for MWEs, as, for example, Fotopoulou et al. (2014) propose. 3. However, in addition to the feature annotations they share with single words, MWEs need to provide information about their constituents; this information must be sufficient to be able to analyse and generate the complete MWE. Beyond the pure sequence of lemmata, some linguistic description must be given to the MWE parts, such as part of speech, optionality (Villavivencio et al., 2004), inflection vs. invariance,3 and others. It is still debated whether the lexical description of MWEs needs structural information like tree representation. Many proposals exist, such as those of
. In the current system, invariant MWE parts have a special part-of-speech annotation, ‘MultiwordPart’, and a lemma which is identical to the textform.
Multiword expressions in multilingual information extraction
régoire (2009), Bejček et al. (2013), and Francopoulo et al. (2006); FrancoG poulo et al. (2006) foresee a special table for the syntactic relations between the parts of the MWE tree. However, if MWEs are considered as (semantic) units, internal information about their parts is only needed to analyse and generate them; once this is done the whole MWE structures can in principle be replaced by single nodes (cf. Hurskainen, 2008; Anastasiou, 2010). It then depends on the analysis and generation components how much internal information on the MWEs is required; the smarter the components are, the less information needs to be stored in the lexicon. The current proposal does not store trees but just their terminal elements, i.e. sequences of part-of-speech information. From them, plus the part of speech of the head, the MWEs can be analysed, as well as generated. 4. The result of these considerations is that MWE entries in the lexicon should look like other lexical units, and in addition have information on their members. This design follows the LMF proposal, but is adapted to practical requirements, for which LMF is less suited (Escartín et al., 2013). 3.2.2 The lexicon In the application described here, the lexicons mentioned in Table 1 needed to be linguistically annotated to make them usable for automatic processing. For largescale use, the question was, “what is the minimum amount of annotation required to analyse a multiword in a text?” In general, all lexical entries in the lexicons are represented as typed feature structures. Feature types specify the scope of a feature (analysis, transfer, generation). In addition, all entries share some basic annotations. 3.2.2.1 General annotations. Such annotations refer to the definition of a lexicon entry, and to information needed for processing. 1. The definition of a lexical entry holds for every entry, whether single word or multiword: Entries are defined by a lemma, a part of speech, a semantic description, and an entry type. Lemmata give the canonical spelling of a concept. Part of speech information is based on a ‘universal’ basic tagset of 12 categories (like noun, adjective, pronoun, cf. Table 3), extended by subcategories. The semantic description is implemented as a reading number; it reflects polysemic lemmata, often coinciding with different domain tags; in Arabic script languages it also used for entries which are vowelised in different ways.4 Entry types separate single words from multiwords. . E.g. in Persian = پلکpelk ‘eyelid’ or palak ‘socket’
Gregor Thurmair
2. In addition to their definitory features, lexicon entries are annotated with linguistic descriptions required for processing. On the morphological level, the most important features are: –– Allomorphs with morphological information: Words and affixes can have several allomorphs, to cover broken plurals or other irregular forms. Every lemma has a list of allomorphs. –– Inflection class: For Turkish, only vowel harmony information is given. For Persian, about 11 noun and two verb classes are assumed;5 for Pashto, a classification of nine main noun classes, four main adjective classes, and two main verb classes is used.6 Beyond morphology, only a few annotations are coded in the lexicons. As the system does not do full parsing, only syntactic function (subcategorisation) is coded; syntactic relations are not coded systematically. Some semantic features are coded, mainly for named entities. Dialect is coded to distinguish between Dari and Farsi origin of words. Subject area is coded to assign words and translations to specific domains; the classification consists of about 15 rather general domains. Each entry has also information on its source, and possibly a comment. Some of these annotations are used for human lookup (e.g. subject area, dialect, source), less for machine treatment. Idiomatic features of the MWEs (like plurale tantum, or specific transitivity types) would be coded in this section, and be treated just like features of single words. No additional machinery is required for MWEs here. 3.2.2.2 MWE extensions. Multiword expressions share the above-mentioned annotation with the other lexicon entries. The features which define the entry are as follows: the value of the lemma feature is the lemma of the multiword (e.g. take into account);7 the value of the part of speech is the part of speech of the multiword (take into account → Vb).8 The allomorph gives the list of the allomorphs of the . Following Newid & Mumm (2007) . Basically following Lorenz (1962) . English examples are used for easier readability. Examples in Persian, Turkish, etc. would be possible as well. . Usually the POS of the MWE is the POS of the head. For terms this would be Noun, and for split verbs and light verbs it would be Verb (from a parsing point of view, it is better to use the verb as head in light verb constructions, to avoid verbless sentences). In praxi this part of speech can deviate from the POS of the head, contrary to linguistic theory, e.g. where PPs are used as adverbials (in addition → Adverb).
Multiword expressions in multilingual information extraction
head (take, took, taken). It is important to note that the allomorph differs from the lemma; it only refers to a part of it (namely the head). Other lexical features of MWEs (like gender) can be inherited from their heads in many cases; however there will be differences in the syntactic-semantic description (syntactic frames, semantic features, and domain tags), reflecting the non-compositional content of the MWE. In extension to these basic annotations, MWEs have special features. The basic distinction between single words and multiwords is expressed by a feature called ‘entry type’, values being ‘single word’, ‘compound’,9 ‘multiword-2’, ‘multiword-3’, etc.10 Every entry in the lexicon has an ‘entry type’ feature; it is part of the definition of a lexical entry. In addition, MWE entries have the following feature decoration: –– Head position. This is simply an integer pointing to the head position. For a MWE like take into account, this would be 1.11 The head of the MWE must be known, as it inflects. –– Sequence of parts of speech of the multiword: take into account would have the sequence Vb,Ap,No; Salt Lake City would be coded as No,No,No. –– Sequence of the lemmata forming the multiword: take into account would have the sequence take,into,account. Additional MWE-specific information is not foreseen in the system: Each multiword part is described by its lemma and part of speech; the semantics of each part, and also their internal syntactic structure is irrelevant as it is only a multiword part and has no semantics of its own. Therefore a simple list of the (terminal) MWE elements has turned out to be sufficient for analysis, and no tree structure needs to be stored. It should be mentioned that the multiword-specific additional features for MWE entries can be automatically produced by tools extracting MWEs from corpora: Thurmair & Aleksić (2012) use POS patterns as a feature for term extraction; these patterns also contain information about the MWE head. As a result, only
. This is for agglutivating compounds such as in German. They get feature annotations like MWEs, to have access to the compound parts. . Giving the number of its parts. There are multiwords of more than ten parts in the lexicon, mostly paraphrases if a term is not translatable; however, they almost never occur in corpus texts and would not need to be identified in analysis. . Position was preferred to lemma as identification means, as lemmata can be duplicated, e.g. in many Turkish MWEs (Oflazer et al., 2004).
Gregor Thurmair
minimal additional coding effort is required for multiword expressions once they are extracted. This is important for large-scale applications. 3.3 MWE analysis and identification Once the lexicons are coded, the next challenge is to identify lexicon entries in texts. Recent approaches try to improve (lexicalised) parsing by including MWEs (Arun & Keller, 2005; Green et al., 2011), or combine POS tagging with MWE detection (Constant & Tellier, 2012). The approach taken in the system described here is to run a probabilistic parser (Charniak, 1997), organised as an active chart, using augmented phrase structure rules;12 whether or not lexicalised grammars show better performance in the analysed languages remains to be seen (Arun & Keller, 2005). The current analyser uses grammar rules with a phrase structure head and a feature section where features can be tested, set, unified, etc. 3.3.1 Design There are several suggestions on how to integrate MWEs in an analysis component: MWE treatment has been proposed before, after, and during analysis. 3.3.1.1 MWE processing after analysis. Early attempts in MT systems like LMT (McCord, 1989) or METAL ( Thurmair, 1990) only analysed single words, and used the transfer component for MWE treatment: The transfer lexicon contained entries with tests on the MWE parts, like ‘‚plant → Kernkraftwerk‘ IF modified by an adjective ‚nuclear‘ and a noun‚ power”. Later proposals use the output of the analysis component, i.e. the analysis tree, try to find a match between subtrees in source and target language, and offer subtree transformation operations in the transfer lexicon for the MWE descriptions (Deksne et al. 2008). This approach avoids weird transfer lexicon entries like ‘plant → Kernkraftwerk’ as in the approaches above and provides the MWEs as proper units in the transfer lexicon. However, the weakness of the attempt to treat MWE after analysis is that their idiosyncratic properties (like specific verb frames, semantic features) are not taken into account in analysis. This fact increases the probability of incorrect parses (e.g. wrong PP attachment), which in turn tends to produce failures in the fol-
. Phrase structures seem to be more adequate for MWE representation as they can provide nonterminal nodes for the MWE as a whole; this is more difficult in dependency trees (cf. Candito & Constant, 2014)
Multiword expressions in multilingual information extraction
lowing MWE analysis. In addition, the parse trees themselves must be expressive enough to permit MWE detection.13 The consequence is that MWE processing after analysis is too late; the analysis component itself must be aware of MWEs, and use their specific annotations (Wehrli et al., 2010: 28).14 3.3.1.2 MWE processing before analysis. An alternative approach is to do MWE processing before the analysis. A special component looks over the input string and marks MWE candidates. Analysis then starts with such MWEs as single nodes in the input. Such approaches have been proposed by Samaridi & Markantonatou (2014), Oflazer et al. (2004), and Hurskainen (2008), where the ‘replace’ option of Constraint Grammar rules is used. The result of the MWE treatment is a ‘words-withspaces’ constituent which serves as input for the parser. However, there are drawbacks to this approach: 1. The multiword filter in itself needs a kind of formalism to identify MWEs in an input text, for instance in cases of discontinuous constituents; as a result the input is parsed twice, be it with different formalisms or the same one. 2. In most cases, a found MWE reading is taken as the only input, and alternative (compositional) readings are removed. However, it is only in the analysis process itself that it can be decided whether an MWE candidate is really an MWE or not, so the removal is premature and can lead to failures. The conclusion is that MWE treatment must be an integral part of the analysis, and not be treated as a pre- or post-analysis component (cf. also Wehrli, 2014). 3.3.2 MWE treatment in analysis The system described here uses MWE treatment rules as an integral part of the analysis grammar. The analysis component consists of several steps: 3.3.2.1 Preprocessing. Incoming texts are deformatted, and then split into sentences. Next, instead of tokenisation, an allomorph-based segmentation c omponent . Indicated in Bejček et al. (2013) where it is stated that the kind of dependency trees used in their experiments seem to abstract away from some of the information items needed for MWE identification. . They also favour an “approach in which collocations are identified in a sentence as soon as possible during the analysis of that sentence, rather than at the end of the analysis” (Wehrli et al., 2010: 28).
Gregor Thurmair
is used; as already mentioned, tokenisation would address the fact that there is much more variability in spacing in the Arabic script (e.g. sometimes no space between a preposition and proper noun in Pashto: دعراقd-Iraq ‘of-Iraq’); but occasional space between a noun and its inflection in Persian (as in اردک هاordak hā ‘duck-pl’). For Turkish, as an agglutinative language, space-centred tokenisation is known to underperform (Çetinoğlu & Oflazer, 2006). Unlike the ‘space-centred’ tokeniser approaches,15 a segmentiser was written to identify lexical strings (allomorphs of stems and affixes), controlled by a state-transition type segmentation grammar. The result of the segmentation is a sequence of allomorphs. 3.3.2.2 Chart initialisation. The process of chart initialisation looks up each allomorph in the lexicon. In the case of MWEs, an allomorph instantiates all entries of which it is part, be they single words or multiwords. In the example above, the allomorph took would instantiate two entries, namely take and take into account, and so would into and account create single-word and multiword readings in the chart. This way, the chart initialisation produces both a compositional and a noncompositional option for such segments; it provides all lexical information available for single words and multiwords, and leaves the decision on what to take to the analysis component. The lookup step instantiates many MWE candidates which are not helpful. For instance, the allomorph Saint in the sentence He came to Saint Helena would, in addition to Saint Helena, also instantiate other MWE lemmata of which Saint is part, such as Saint Petersburg, Saint Barthélemy, Saint Christopher-Nevis-Anguilla, Saint Lucia, Saint Kitts, and many others. To avoid the need to spend time parsing here, a lexicon erasure procedure follows the chart initialisation which eliminates all MWE candidates which cannot show all their parts in the input allomorph sequence. 3.3.2.3 Analysis design. The analysis component is designed as an X-bar scheme, extended such that the MWE readings are integrated in the earliest possible steps. Both single words and multiwords can take modifiers and specifiers to form XP constituents; such rules should be transparent with respect to the entry type of their heads, and MWE treatment should be done beforehand.
. Proposed e.g. by Shamsfard et al. (2010) for Persian
Multiword expressions in multilingual information extraction
Figure 1. Extended X-bar scheme
In an X-bar scheme, this involves extending the X-bar by one level, and ordering parsing rules as follows: –– On the X-0 level: word stems and affixes/inflections are composed into words; this would not be changed. –– X-1 rules attach multiword parts; single words would just pass this level (X1 → X0). –– X-2 rules attach modifiers, as usual, and –– XP rules build complete constituents by adding specifiers (cf. Figure 1). In such a setup, the idiosyncratic properties of MWE entries can be used in the higher levels of syntactic analysis (X2 and XP levels) just like single words. Compositional and non-compositional readings would be computed in parallel: In the case of take into account, the multiword reading would be built by rules on the X1 level, and the compositional reading by rules on the X2 level.16 Of course, this is just a general outline. There are many cases where these heuristics need to be refined, such as in cases of discontinuous split verb constructions. But most of the MWEs in terminology, named entity and light verb areas can be analysed this way. 3.3.2.4 Multiword analysis rules. Multiwords are processed on the X1 level. But before the multiword rules fire, basic morphology rules are applied (on the X0
. The rules might share their PSG structure but differ in the feature part.
Gregor Thurmair
level) which attach inflections to the lemmata, as in No → No N-Flex and Vb → Prefix Vb V-Flex. This is the same as in single word analysis; no special MWE treatment is needed here, and the morphological variants of the MWE are covered by the respective features (allomorph, inflection class).17 The grammar for MWE treatment on the X1 level is based on the POSsequence features in the lexicon. For each POS sequence found in the lexicon there is a phrase structure rule, like.g.: Vb1 →
Vb Ap No
// take into account
No1 →
No No No
// Salt Lake City
Such rules give the MWEs the chance to be considered as potential multiword constituents. Then the feature part of these rules, among others,18 takes care of validating the multiword reading, by performing two tests: –– The value of the lemma feature of all MW parts must unify: all nodes must have take into account or Salt Lake City respectively as value of the lemma. –– The value of the entry type feature must match the number of consti tuents (mw3 in our case) to ensure completeness of the MWE hypothesis, and not return the lemma Salt Lake City Memorial Hospital for an mw3 MWE rule. If these operations succeed, the parser builds a new non-terminal single node with an X1 level label, the MWE lemma, and features percolated from the MWE source. This node is used as a single/elementary unit on higher syntactic levels; the rest of the analysis does not need to take the fact into account that this node represents a multiword.19 An example is given in Fig. 2.
. Derivational phenomena on the X0 level were a major concern in Turkish morphology, in which the agglutination of very long and complex words can be observed. Intermediate nonterminals are used similar to Çetinoğlu & Oflazer (2006). . For instance, checking the casing of the proper noun. . In a transformational setup, the whole tree underneath the X1 node could be deleted, as soon as all relevant features are percolated to the new multiword node. In dependency representations (Candito & Constant, 2014), a similar solution could be tried; the internal structure of the MWE is irelevant for higher-level analysis.
Multiword expressions in multilingual information extraction
Figure 2. Analysis of the plural form of انعطافی های بودجهbudjeh hāje en’etāfi ‘flexible budget’: plural added on the X0 level, Ezafe20 and Adjective by an X1 rule (tree shown in right → left mode). Lemma originated in the NoC, with the number in the PtFN node, and percolated up to the No1 node
The number of these X1 MWE rules is between 100 and 200 per grammar, depending on the languages and the intended coverage. They are defined by the MWE patterns in the lexicon. As these lexicons were made by humans, many more patterns can be observed there, leading to lexicon entries which can never be analysed.21 A trade-off needed to be found. For non-adjacent MWEs like split verb constructions, the same strategy of MWE processing is used; however, these rules interact with rules on the X1 and X2 levels.
. Ezafe is a grammatical particle which links an attribute to its head noun in Persian languages. . This is just a practical constraint. In principle, as many rules as patterns can be written.
Gregor Thurmair
3.3.2.5 Analysis output. As the system does not intend to do full parsing, the output structure is a flat collection of XP constituents under an S node, created by recursive rules like S → S XP. The XP nodes can contain named entities, terms (built from single words or multiwords), or irrelevant material. As the parser builds several such structures, the one with the best probability is taken, as in Wehrli et al. (2010). MWE readings are preferred in analysis. 3.4 MWE translation and generation 3.4.1 Transfer The parser produces shallow syntactic structures, their leaves being lexical nodes. These nodes are looked up in the transfer dictionary. In transfer, source language semantic units are transferred into target language semantic units. Whether such units are single words or multiwords is not relevant. Therefore, all types of transfer between single words and multiwords (cf. Table 1) are represented in the lexicons in a uniform way. A simple lookup of lexical entries for their translation into target lexical entries is sufficient; no additional tests and actions for MWEs need to be coded: ( راضی کردنrāzi kardan) = ‘zufrieden stellen’. This proposal matches human intuition, and is also proposed by Anastasiou (2010). Of course, as in the case of single words, there are selection mechanisms for multiple translations (1:n translations). They prefer translations belonging to a given subject area, then translations which are marked as preferred in the lexicon. But no additional provisions for MWE transfer needs to be taken into account; transfer is just the replacement of a source node by a target node. 3.4.2 Generation The definition of MWE annotations in the lexicon holds for both the source and target language entries. Again, there is information for both single and multiwords, like part of speech, entry type, inflection class, gender, subcategorisation etc.; in the case of multiwords, features for head position, part-of-speech sequence, and lemma sequence are added. In case the Target Language (TL) lexical unit is a multiword, the system would expand the unit into a flat tree under the TL-MWE node, based on the head and the POS-sequence feature of the TL entry. The main transfer features (number, case) would be percolated down to the head, where proper inflection would be triggered. Additional operations would have to be foreseen for some elements, e.g. to inflect adjectives in agreement to the head, or movement operations for split verb constructions. Again, MWE treatment in generation
Multiword expressions in multilingual information extraction
is a monolingual operation; examples of MWE generation are given e.g. in Anastasiou (2010).22 In the present application, there was no need for a full translation as only keyword translation was required, so the TL lemma form of the entries was sufficient to display in the output list, without further generation effort. 4. Evaluation The overall performance of the system depends on several factors, some of which were evaluated. In the MWE context, the most important issues are: 4.1 MWE coverage: rule vs. lexicon compatibility The MWEs in the lexicons comprise many part-of-speech patterns, due to the fact that they were created by humans (automatic term extraction would have a much lower variance). Due to the fact that for each POS sequence a special analysis rule would have to be created, a trade-off had to be reached between lexicon coverage and grammar size. The data are given in Table3 (for the Persian lexicon). Table3. MWE coverage: MWE patterns for two- and three-part elements in Persian POS of the MWE
No. POS No. entries No. POS No. entries patterns in showing these patterns covered showing these the lexicon patterns by rules patterns Coverage
Adjectives
135
1,905
19
1,538
80.7%
Adpositions
19
78
7
66
85%
Adverbs
78
355
9
201
57%
Conjunctions
42
125
18
96
77%
Numerals
9
68
5
64
94%
Common nouns 204
8,587
32
7,956
92.7%
Proper nouns
117
1,802
25
1,561
86.6%
Phrases
9
12
3
6
50%
Pronouns
25
62
14
51
82%
Particles
7
8
1
2
25%
Verbs
59
5,040
15
4,913
97.5%
Total
704
18,042
148
16,454
91.1%
. She proposes a generation rule for every MWE, whereas here, only a rule for every POS pattern would be needed.
Gregor Thurmair
It can be seen that about 700 multiword structures were used by human lexicon coders. Furthermore, 21% of the MWE patterns in the lexicon cover more than 90% of its MWE entries.23 So it seemed to be acceptable to write analysis rules only for these cases: With a rule set of about 150 rules, 91.1% of all multiword structures could be covered.24 Pashto shows similar behaviour. The Turkish lexicon had much less variance in multiwords, and therefore fewer analysis rules. 4.2 Lexicon coverage The important question for the application as a whole is the lexicon coverage, i.e. to which extent incoming texts can be analysed using the components provided. To determine this, test corpora were created from different sources, mainly from news texts. Table 4 shows the results. Table 4. Text coverage evaluation Turkish > German
Persian > German
Pashto > German
No. documents selected
60
350
177
No tokens in test corpus
10,930
401,600
84,308
Non-translated tokens
385
37,783
8,327
95.4
90.6
90.1
Coverage tokens
It can be seen that more than 90% of the input text tokens are covered by the dictionary and identified by the analyser in all languages; they can be analysed and translated. This is sufficient to decide on the relevance of a given input text. It should be noted that not all of the tokens25 require translations by themselves, like stand-alone inflectional affixes; due to the different segmentation in Arabic-Persian script, many tokens flagged as ‘no-translate’ do not in fact need a translation. To investigate this, some Persian documents were evaluated manually, to find the ‘real’ unknown words; the results are given in Table 5. Only 130 of the 4,850 evaluated words (2.66%) were not translated. In turn, the coverage of the Persian system would increase to about 97%.26 This seems an acceptable result.
. Difficult POS, with large variance, are adverbs (Av) and phrases (Phr), as can be expected. . Irrespective of the single words which still hold the majority in the lexicons. . Tokens here were counted on the basis of space segmentation. . The value for Pashto would be lower, due to a higher variance and inconsistency in lexicon entries (spelling differences, etc.)
Multiword expressions in multilingual information extraction
Table 5. Manual coverage evaluation for Persian No. documents inspected
14
No. tokens inspected
4,849
No. words ‘really’ not found
129
Coverage
97.34%
5. Conclusion The contribution has shown that it is important to provide uniform treatment of lexical units in Natural Language applications. The MWE challenge is seen to be located in the monolingual side; the translation step simply replaces source language units with target language units, and is transparent to the entry type (single or multiword). In the monolingual lexicons, MWEs share all feature annotations with single words. Additional information needed for MWEs consists just in the specification of the head, and in the description of the MWE parts as sequences of part of speech and lemma; no more complex structures are required. Analysis treats MWEs neither in pre-processing (the words-with-spaces approach), nor in a post-analysis step where the analysis output is interpreted, but as an integral part of the analysis itself. The approach is to extend the X-bar scheme by a level which applies the multiword rules; the decision to use a compositional or a non-compositional reading is left to the probability calculation of the parsing. Such an approach shows good coverage for the identification of multiword expressions in the lexicon, as well as for the identification and translation of key information items in documents; in both cases evaluation produces results above 90%.
Acknowledgements The team that was involved in this research consisted of Mehr Newid (Univ. Kabul), Vera Aleksić (Linguatec), Thilo Will (Linguatec) and Fulya Adatürk (LMU Munich).
References Acosta, O., Villavicencio, & Moreira, V. (2011). Identification and Treatment of Multiword Expressions applied to Information Retrieval. In Proceedings Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE 2011): Portland, Oregon, USA.
Gregor Thurmair Amtrup, J., Rad M., Megerdoomian, K. & Zajac, R. (2000). Persian-English Machine Translation: An Overview of the Shiraz Project. Report from the University of New Mexico. Anastasiou, D. (2010). Idiom Treatment Experiments in Machine Translation. Diss. Saarbrücken, Germany. Arun, A., & Keller, F. (2005). Lexicalization in Crosslinguistic Probabilistic Parsing: The Case of French. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’ 05) (pp. 306–313). Attia, M., Toral, A., Tounsi, L., Pecina, P., & van Genabith, J. (2010). Automatic Extraction of Arabic Multiword Expressions. In Proceedings of the Workshop on Multiword Expressions: From Theory to Applications (MWE 2010): Beijing, China. Bejček, E., Straňák, P., & Pecina, P. (2013). Syntactic Identification of Occurrences of Multiword Expressions in Text using a Lexicon with Dependency Structures. In Proceedings of the 9th Workshop on Multiword Expressions (MWE 2013): Atlanta, Georgia, USA. Bonin, Fr., Dell‘Orletta, F., Venturi, G., & Montemagni, S. (2010).Contrastive Filtering of DomainSpecific Multi-Word Terms from Different Types of Corpora. In Proceedings of the Workshop on Multi word Expressions: From Theory to Applications (MWE 2010): Beijing, China. Candito, M., & Constant, M. (2014). Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing. In Proceedings of the 52 ACL: Baltimore, Maryland, USA. doi: 10.3115/v1/P14-1070 Çetinoğlu, Ö., & Oflazer, K. (2006). Morphology-Syntax Interface for Turkish LFG. In P roceedings of ACL. doi: 10.3115/1220175.1220195 Charniak, E. (1997). Statistical Parsing with a Context-free Grammar and Word Statistics. In Proceedings of AAAI‘97/IAAI’97 (pp. 598–603). AAAI Press ©1997. Constant, M. & Tellier, I. (2012). Evaluating the Impact of External Lexical Resources into a CRF-based Multiword Segmenter and Part-of-Speech Tagger. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12) (pp. 646–650). LREC: Istanbul. De Araujo, V., Ramisch, C., & Villavicencio, A. (2011). Fast and Flexible MWE Candidate Generation with the mwetoolkit. In Proceedings of the Workshop on Multiword Expressions (MWE 2011): Portland, Oregon, USA. Deksne, D., Skadiņš, R., & Skadiņa, I. (2008). Dictionary of Multiword Expressions for Translation into Highly Inflected Languages. In Proceedings of LREC Marrakech. Dubremetz, M., & Nivre, J. (2014). Extractionof Nominal Multiword Expressions in French. In Proceedings of the 10th Workshop on MultiwordExpressions (MWE 2014): Gothenburg, Sweden. Escartín, C. P., Losnegaard, G. S., Samdahl, G. I. L., García, P. P. (2013). Representing Multiword Expressions in Lexical and Terminological Resources: An Analysis for Natural Language Processing Purposes. In Proceedings of eLex 2013: Tallinn, Estonia. Family, N. (2006). Explorations of Semantic Space: The Case of Light Verb Constructions in Persian. Diss. Paris. Farahmand, M., & Martins, R. (2014). A Supervised Model for Extraction of Multiword Expressions Based on Statistical Context Features. In Proceedings of the 10th Workshop on Multiword Expressions (MWE 2014): Gothenburg, Sweden. doi: 10.3115/v1/W14-0802 Fotopoulou, A., & Markantonatou, St., Giouli, V. (2014). Encoding MWEs in a conceptual l exicon. In Proceedings of the10th Workshop on Multiword Expressions (MWE 2014): Gothenburg, Sweden. doi: 10.3115/v1/W14-0807
Multiword expressions in multilingual information extraction
Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., & Soria, Cl. (2006). Lexical markup Framework (LMF) for NLP Multilingual Resources. In Proceedings of the Workshop on Multilingual Language Resources and Interoperability: Sydney, Australia. doi: 10.3115/1613162.1613163
Graliński, F., Savary, A., Czerepowicka, M., & Makowiecki, F. (2010). Computational Lexicography of Multi-Word Units: How Efficient Can It Be? In Proceedings of the Workshop on Multiword Expressions: From Theory to Applications (MWE 2010): Beijing, China. Green, S., de Marneffe, M. C. Bauer, J., & Manning, Ch. (2011). Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing: Edinburgh, Scotland. Grégoire, N. (2007). Design and Implementation of a Lexicon of Dutch Multiword Expressions. In Proceedings of the Workshop on A Broader Perspective on Multiword Expressions: Prague, the Czech Republic. doi: 10.3115/1613704.1613707 Grégoire, N. (2009). Untangling Multiword Expressions. A study on the representation and variation of Dutch multiword expressions. Diss. Utrecht. Utrecht (LOT). Hakkani-Tür, D., & Oflazer, K. (2002). Statistical Morphological Disambiguation for Agglutinative Languages. Computers and the Humanities, 36(4), 381–410. doi: 10.1023/A:1020271707826
Hurskainen, A. (2008). Multiword Expressions and Machine Translation. University of Helsinki. Technical Reports in Language Technology, Report No 1, 2008. Khozani, S. M. H., & Bayat, H. (2011). Specialization of Keyword Extraction Approach to Persian Texts. In Proceedings of the International Conference of Soft Computing and Pattern Recognition (SoCPaR 2011): Dalian, China. doi: 10.1109/SoCPaR.2011.6089124 Kulkarni, N., & Finlayson, M. A. (2011). MWE: A Java Toolkit for Detecting Multi-Word Expressions. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE 2011): Portland, Oregon, USA. Lee, J. (2011). Two Types of Korean Light Verb Constructions in a Typed Feature Structure Grammar. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE 2011): Portland, Oregon, USA. Lorenz, M. (1982). Lehrbuch des Pashto (Afghanisch). Leipzig (VEB Verlag Enzyklopädie), Germany. Martens, Sc., & Vandeghinste, V. (2010). An Efficient, Generic Approach to Extracting MultiWord Expressions from DependencyTrees. In Proceedings of the Workshop on Multiword Expressions: From Theory to Applications (MWE 2010): Beijing, China. McCord, M. (1989). Design of LMT: A prolog-based machine translation system. Computational Linguistics, 15(1), 33–53. Moreno-Ortiz, A., Pérez-Hernández, Ch., & Del-Olmo, M. A. (2013). Managing Multiword Expressions in a Lexicon-Based Sentiment Analysis System for Spanish. In Proceedings of the 9th Workshop on Multiword Expressions (MWE 2013): Atlanta, Georgia, USA. Newid, M. A., & Mumm, P. A. (2007). Persisches Lesebuch. Germany: Wiesbaden (Reichert). Nidhi Kulkarni, N., & Finlayson, M. A. (2011). MWE: A Java Toolkit for Detecting Multi-Word Expressions. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE 2011): Portland, Oregon, USA. Nissim, M., Castagnoli, S., & Masini, Fr. (2014). Extracting MWEs from Italian corpora: A case study for refining the POS-pattern methodology. In Proceedings of the 10th Workshop on Multiword Expressions (MWE 2014): Gothenburg, Sweden.
Gregor Thurmair Oflazer, K., Çetinoğlu, Ö., & Say, B. (2004). Integrating Morphology with Multi-word Expression Processing in Turkish. In Proceedings of the Second ACL Workshop on Multiword Expressions. Integrating Processing: Barcelona, Spain. Quocchi, V., Frontini, F., & Rubino, F. (2012). A MWE Acquisition and Lexicon Builder Web Service. In Proceedings of COLING: Mumbai, India. Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword Expressions: a Pain in the Neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLing): Mexico City, Mexico. doi: 10.1007/3-540-45715-1_1
Sagot, B., & Walther, G. (2010). A Morphological Lexicon for the Persian Language. In Proceedings of the LREC: Malta. Samaridi, N., & Markantonatou, St. (2014). Parsing Modern Greekverb MWEs with LFG/XLE grammars. In Proceedings of the 10th Workshop on Multiword Expressions (MWE 2014): Gothenburg, Sweden. Shamsfard, M., Jafari, H. S., & Ilbeygi, M. (2010). STeP-1: A Set of Fundamental Tools for Persian Text Processing. In Proceedings of the LREC: Malta. Thurmair, Gr. (1990). Complex Lexical Transfer in METAL. In Proceedings of the Third International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages (pp. 91–107). Austin, Texas, USA. Thurmair, Gr., & Aleksić, V. (2012). Creating Term and Lexicon Entries from Phrase Tables. In Proceedings of the 16th EAMT: Trento, Italy. Tu, Y., & Roth, D. (2011). Learning English Light Verb Constructions: Contextualor Statistical. In Proceedings of the Workshop on MultiwordExpressions: From Parsing and Generation to the Real World (MWE 2011): Portland, Oregon, USA. Villavicencio, A., Copestake, A., Waldron, B., & Lambeau, F. (2004). Lexical Encoding of MWEs. In Proceedings of the Second ACL Workshop on Multiword Expressions: Integrating Processing: Barcelona, Spain. Vincze, V., Nagy, I., & Berend, G. (2011). Detecting noun compounds and light verb constructions: a contrastive study. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (MWE 2011): Portland, Oregon, USA. Wehrli, E., Seretan, V., & Nerima, L. (2010). Sentence Analysis and Collocation Identification. In Proceedings of the Multiword Expressions: From Theory to Applications (MWE 2010): Beijing, China. Wehrli, E. (2014). The Relevance of Collocations for Parsing. In Proceedings of the 10th Workshop on Multiword Expressions (MWE 2014): Gothenburg, Sweden. doi: 10.3115/v1/W14-0804
A multilingual gold standard for translation spotting of German compounds and their corresponding multiword units in English, French, Italian and Spanish Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk Institute of Computational Linguistics, University of Zurich
This article describes a new word alignment gold standard for German nominal compounds and their multiword translation equivalents in English, French, Italian, and Spanish. The gold standard contains alignments for each of the ten language pairs, resulting in a total of 8,229 bidirectional alignments. It covers 362 occurrences of 137 different German compounds randomly selected from the corpus of European Parliament plenary sessions, sampled according to the criteria of frequency and morphological complexity. The standard serves for the evaluation and optimisation of automatic word alignments in the context of spotting translations of German compounds. The study also shows that in this text genre, around 80% of German noun types are morphological compounds indicating potential multiword units in their parallel equivalents. Keywords: gold standard, word alignment, compounding, multilinguality, German, English, Spanish, Italian, French
1. Introduction This article describes the creation of a multilingual gold standard dedicated to the alignment of German compounds and their parallel multiword units (MWUs) in English, French, Italian and Spanish. Such a resource dedicated to the parallel equivalents of German compounds in a Germanic and several Romanic languages is a novelty, to the best of our knowledge. Most word alignment resources aim at a complete alignment of all tokens of parallel text segments. In contrast, our work concentrates on partial but high-precision word alignment, sometimes referred to as translation spotting (Tiedemann, 2011:60). This term was introduced by Véronis & Langlais (2000) and is often used in the context of bilingual lexicon extraction tasks. doi 10.1075/cilt.341.06cle © 2018 John Benjamins Publishing Company
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
Morphological compounding is a productive and prominent means of word formation in German. According to Baroni, Matiasek, & Trost (2002), about 47% of the different word types in newswire texts are compounds. The fact that these types contribute only 7% of all tokens indicates that many compounds occur rarely. Our study confirms these findings: about 80% of German noun types are morphological compounds according to our manually validated sample of 4,500 types. For this reason, we take German compounds as a starting point to investigate the multilingual translation behaviour of MWUs in languages without morphological compounding. Even though some German compounds do not have a parallel MWU equivalent, in the sense of a semantically opaque construction, we most often encounter cases in which one orthographical unit in German is translated by several words in the other languages. 1.1 Related work A broad range of approaches exists for exploiting bitexts in order to establish translation relations between words or MWUs (Tiedemann 2011:104). Additionally, several gold standards for bilingual word alignments have been created in the past, for instance for English-French (Och & Ney, 2003), English-Spanish (Lambert, De Gispert, Banchs, & Mariño, 2005), for several language pairs with scarce resources (Martin, Mihalcea, & Pedersen, 2005), or English-Swedish (Holmqvist & Ahrenberg, 2011). The latter also contains a compact discussion on bilingual alignment guidelines and evaluation techniques. Multilingual gold standards are less common; a gold standard comprising 100 sentences in English, French, Portuguese and Spanish has been created by Graça, Paulo Pardal, Coheur, & Caseiro (2008). Other gold standards do not aim at complete alignment of all words in a sentence. For instance, the collection of German compounds and their Spanish equivalents from a parallel corpus (Parra Escartín & Héctor Martínez, 2014) can be regarded as a gold standard for translation spotting or for bilingual lexicon extraction. Another type of aligned resources are bilingual parallel syntactic treebanks such as SMULTRON (Volk, Göhring, Marek, & Samuelsson, 2010) which consists of several hundred pairs of constituent trees (German, English, French, Swedish) with manually inserted word alignments and phrase alignments (socalled sub-sentential alignments). The English-Chinese treebank (Deng & Xue, 2014) has an interesting approach for dealing with translation divergences of function words. A fully automated approach for building syntactic trees with sub-sentential alignments is described in Tinsley, Hearne, & Way (2009); they use automatic parses and a tree-to-tree aligner (Zhechev, 2010) to produce
A multilingual gold standard for translation spotting of German compounds
sparse and precision-oriented alignments in order to improve phrase-based machine translation. The remainder of this article is organised as follows: the next section describes our data sources, the automatic linguistic annotations and the alignment process for sentences and words. Then we explain the selection and sampling criteria for the German compounds, followed by a detailed account of our annotation guidelines, the automatic pre-annotations and the annotation workflow. The section “Evaluation and Discussion” presents the performance of our baseline model measured against our gold standard, and further optimised alignment models. We focus on the effects of frequency, morphological complexity and lexicalisation. Finally, we describe the distribution of the gold standard, its potential uses and possible improvements. 2. Resources 2.1 Selection and preprocessing of the gold standard material Multilingual Sentence Alignment of a Parallel Corpus. The data source for our gold standard originates from a preliminary version of the CoStEP1 corpus (Graën, Batinic, & Volk, 2014). This corpus provides a large subset of the widely used Europarl Corpus (Koehn, 2005) extended with aligned speaker turns, i.e. the same particular speaker’s contribution to the plenary debates translated into several languages. The number of speaker turns varies between translations of the same original turn into different languages. In order to obtain a sentence alignment of our five languages, we first carried out a pairwise alignment for all ten language pairs using hunalign (Varga et al., 2005). Subsequently, we combined the pairwise alignments into a multilingual alignment graph onto which we applied cleaning rules with the objective of removing those alignments that were only supported by a minority of the language pairs. Tagging, Mapping and Correction of Part-of-Speech (POS) Tags. For tokenisation and POS tagging we relied on the TreeTagger (Schmid, 1994) and the language-specific parameter files that are available on its website.2 The tagsets of these parameter files vary widely across languages. Since we wanted to compare the POS categories of aligned words across all five languages, we mapped the language-specific tags into the so-called Universal POS tagset (UPOS) as described in Petrov, Das, &
. http://pub.cl.uzh.ch/purl/costep . http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
McDonald (2012). During the manual alignment validation process, we manually corrected tagging errors of aligned words on the granularity of UPOS tags. Automatic Word Alignment. We used the popular tool GIZA++ (Och & Ney, 2003) for word alignment. Unlike sentence alignment, word alignment is performed unidirectionally, so we had to run alignment twice for each ordered language pair. Every token of the source language can be aligned to exactly one token of the target language (but it does not need to be). In this vein, a compound that corresponds to a MWU in another language will not be aligned to all parts of said expression, but only to the ‘most probable’ part of it (according to the model that the word aligner had gained from the data). The other way around, each part of a MWU may be aligned with the corresponding compound. Selection and Sampling of German Compounds. In order to sample nominal compounds from the German corpus (53,964,639 tokens), we extracted all nouns via their automatically assigned POS tag NN (common noun). The resulting set of 283,733 different word forms3 was analysed by the broad-coverage morphological tool GERTWOL (Haapalainen & Majorin, 1994).4 GERTWOL computes a linear (that is, non-hierarchical) segmentation of the lemmas, marking strong compounding boundaries by the symbol #, weak boundaries by | and derivation boundaries by ∼. For instance, the analysis of the complex German compound Ver |brauch∼er#schutz#dienst#leist∼ung (‘consumer protection services’) results in 4 segments (only strong boundaries are taken into account for the segmentation). For 249,602 out of 283,733 noun types (88%), GERTWOL computed one or more compound boundaries. This clearly illustrates the prominence of compounds (and their potential parallel MWUs) in this text genre. However, these numbers might be slightly biased due to errors in the automatic preprocessing stage. The Tree Tagger sometimes misclassifies nouns, and GERTWOL cannot analyse 100% of all compounds and sometimes overanalyses compounds. In order to avoid a sampling bias due to GERTWOL, we decided to manually extract and validate a subset of 4,000 compounds. We randomly sorted the full list of 283,733 noun types (most of them enriched with their GERTWOL lemmatisation). Then we started to validate from the top of this list, including only true compounds in our subset. This meant excluding compounds that consisted entirely of foreign words (open-skies) or proper names (Knörr-Borràs). We also excluded alternatives separated by slash (Lernender/Wissen/Lehrkraft, ‘learner/knowledge/teacher’) or problematic tokenisations (AKP-EU-)Abkommen, ‘AKP-EU) a greement’). However, hyphenated nominal compounds (Duty-Free-Sektor ‘duty-free sector’) were included.
. Types in the sense of distinct letter strings (Baayen, 2001:2). . See Parra Escartín (2014) for a comparative study of tools that perform only compound splitting.
A multilingual gold standard for translation spotting of German compounds
4,559 types had to be examined in order to identify 4,000 valid compounds. 326 of these had no GERTWOL analysis, which in turn means that GERTWOL’s coverage for noun compounds is about 82%. We manually segmented these missed words. The most frequent word of our sample of 4,000 compound types is Wohlstand (‘prosperity’) which occurs 2,703 times in the corpus. The frequency distribution of the sample roughly follows the Zipfian Law (LNRE distribution (Baayen, 2001:54): 57% of the compounds are hapax legomena, 14.5% occur twice, 6.1% occur three times. Table 1 contains the manually validated number of morphological segments and the frequency in the corpus for the sample of 4,000 compounds. For the selection of the 137 compound types of our gold standard, we considered the frequency class m and the morphological complexity ( = number of segments). Basically, each combination is represented by ten different types in order to ensure enough material for more detailed statistical evaluations. For the hapax legomena, we decided to include more items to avoid an insufficient number of annotated tokens. Our sample contained only seven types with at least four morphological segments that occur at least eight times. The gold standard contains two types with five segments (EU-Trinkwasserrichtlinie ‘EU Drinking Water Directive’, Einhüllen-Öltankschiffe, ‘single-hull oil tankers’) and one type with six (Wertpapierdienstleistungs-Richtlinie ‘directive on investment services’). Table 1. Statistics of the subset of 4,000 manually validated noun compounds and of the sampling of the Gold Standard (GS) in terms of compound types and annotated compound occurrences (tokens) Segments
Frequency Class m Subset N = 4000
GS Types N = 137
GS Tokens N = 362
m = 1
2−7
≥8
m = 1
2–7
≥8
m = 1
2–7
≥8
2
1582
927
382
50
10
10
50
26
88
3
604
276
35
20
10
10
20
23
77
≥ 4
94
38
12
10
10
7
10
21
47
Total
2281
1241
478
80
30
27
80
70
212
For each compound, we randomly selected at most ten occurrences from the corpus. However, we admitted only items for which (a) parallel MWUs existed in all four languages, (b) no obvious translation errors could be found, and (c) GIZA++ word alignments were available between all language pairs.5 . For practical reasons, the application of GIZA++ is normally restricted to sentences with at most 100 tokens.
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
Description of the Morphological Categories and the Lexicalisation of German Compounds. For each of the 137 compounds in our gold standard, we determined the universal POS tags of each morphological segment. 109 compounds (80%) contained only nominal segments. 20% of the compound also contained verbs, adjectives, numerals or adverbs. Does the alignment quality of lexicalised compounds differ from non- lexicalised compounds? In order to be able to answer this question with the help of our gold standard, we collected this information as a binary feature, similar to Roth (2014, 48). A compound was regarded as lexicalised if it was listed in the online lexicon DUDEN.6 In total, only 17 compounds (12.4%) were listed: 13 of them had two morphological segments and four had three segments. 2.2 Annotation guidelines and annotation process Every gold standard must be documented with objective, comprehensible guidelines (Koehn 2010:115). We used Och & Ney (2003) as well as the detailed guidelines of Lambert et al. (2005) and Graça et al. (2008) as a starting point. Since we do not aim at complete word alignment but translation spotting, certain adjustments to these guidelines proved to be necessary. Principle I. All nouns and proper nouns (NOUN), verbs (VERB), adjectives (ADJ) and adverbs (ADV) that are part of a translation equivalent of a German compound will be bidirectionally aligned with the compound as well as with each other. The compound Doppelfinanzierung serves as a suitable example of this principle since all categories – NOUN, VERB, ADJ, and ADV – appear, as Figure 1 shows. The French (FR) equivalent is double/ADJ financement/NOUN, in Italian (IT), it is duplice/ADJ finanziamento/NOUN and in Spanish (ES) doble/ADJ financiación/NOUN. All adjectives and nouns (ADJ, NOUN) are bidirectionally aligned with the German compound as well as their corresponding parts in all other languages. English (EN), on the other hand, realises the German (DE) compound with the verbal phrase funding is [not] given twice:7 here, not only funding/NOUN and twice/ADV are aligned, but also the auxiliary verb is and the past participle given – which are also tagged as a VERB in the UPOS tagset. The adverb not remains unaligned since it is not part of the compound in a semantic sense.
. http://www.duden.de . Note that unaligned tokens (function words) appear in brackets.
A multilingual gold standard for translation spotting of German compounds
[de]
[en]
[…] eine Doppelfinanzierung/NOUN
[…] that
funding/NOUN
[fr]
[it]
[es]
[…] un
[…] un
[…] una
is/VERB
double/ADJ
duplice/ADJ
doble/ADJ
not
auszuschließen […]
given/VERB
financement/NOUN
finanziamento/NOUN
financiación/NOUN
twice/ADV
. […]
, […]
. […]
. […]
Figure 1. Gold standard example Doppelfinanzierung ‘double financing’: bidirectional alignments between nouns, adverbs, adjectives and verbs
Principle II. Function words – articles, pronouns, prepositions and particles – will not be aligned in order to keep the alignment effort to a reasonable level. The only exception from this principle are prepositions (ADP) that are a semantic part of the translation equivalent, given that no other lexical category expresses its meaning. The Spanish equivalent órgano contra [la] corrupción of the compound Korruptionsbekämpfungseinheit (‘anti-corruption body’) is the only example in the gold standard where a preposition (contra) is aligned. Principle III. According to Lambert et al. (2005, 275), as few words as possible and as many words as necessary that carry the same meaning should be aligned. If no correspondence between single words (1:1 alignment) exists, then the smallest possible word groups should be aligned. We regard German compounds as one unit (like a word group) that is bi-aligned with all corresponding parts of its translation equivalents. A 1:n alignment of a German compound thus means that all words aligned with it represent its meaning as a whole. Principle IV. Word alignment gold standards often distinguish two different types of alignments, namely sure (s) and fuzzy (f, or probable) alignments (Holmqvist & Ahrenberg, 2011). As a default, unmarked alignments are classified as sure alignments, or s-alignments. Two different types of fuzziness are distinguished in our context, namely morphosyntactic and semantic fuzziness. Morphosyntactic fuzziness can be automatically detected since all 1:1 alignments between different UPOS tags are fuzzy by definition. Semantic fuzziness is a challenging yet rare phenomenon that is difficult to handle consistently. We did not annotate fuzziness in this current version; however, we would like to briefly address a few problematic cases.
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
Semantic overspecification occurs in the alignment of Orgelsaal (‘organ hall’) which has more specific correspondences in all four languages: EN organ recital hall, FR salle [pour] récitals [d’] orgue, IT sala concerti [per] organo and ES sala [de] conciertos [de] órgano. In contrast, we find semantic underspecification, for example, in the alignment of Finanzmarktpolitik (‘financial market policy’) which is abridged in all other languages: EN financial market, FR marché financier, IT mercato finanziario, and ES mercado financiero. In some rare cases, under- and overspecification occur for the same compound: Euribor-Interbankzinssatzes (‘Euribor interbank interest rate’) is only fully expressed in Spanish (tipo [de] interés interbancario Euribor). All other languages shorten the expression to interest rate: EN Euribor interbank rate, FR Euribor [, le] taux interbancaire and IT tasso interbancario [trimestrale] Euribor. In such cases, principle III applies: 1:n alignments are made, for example between English rate and Spanish tipo [de] interés. Preannotations, Alignment Software and Annotation Workflow. Producing consistent word alignments between five languages is a non-trivial task. Our annotation workflow was structured as follows: as a starting point, all word alignments computed by GIZA++ were used as pre-alignments. These pre-alignments were automatically filtered according to the following criteria: (1) all alignments directed to any other word but the German compound were removed, (2) all alignments which involved a function word were removed, (3) only alignment paths that were either connected directly or via a single intermediate alignment to the German compound were allowed (path length restriction). After this process, we carried out a manual validation with the help of textbased software created by us. At first, all unidirectional alignments were either deleted or bidirectionalised using short commands in a textual interface. All bidirectional pre-alignments were then either deleted or validated. Finally, all missing bidirectional alignments were added. All alignments were then visualised as an SVG graph, which helped to quickly identify remaining errors. In our experience, this workflow enabled an efficient validation of thousands of alignments across five languages. The annotation has been conducted by a trained linguist and advanced student of English and computational linguistics. The annotator’s mother tongue is German, she is fluent in English and she has a good command of French (B1 level on the CEFR scale). Her language skills in Spanish correspond to an A2 level. She has limited passive knowledge of Italian yet made sure to check Italian translation equivalents thoroughly with the help of online dictionaries and, on occasion, a native speaker of Italian.
A multilingual gold standard for translation spotting of German compounds
3. Evaluation and discussion 3.1 Quality of universal part-of-speech (UPOS) tagging We manually corrected all UPOS tags of the aligned MWU tokens and found 104 errors (3.23%). In Spanish, it was VERB tags that were wrong most often (21); these tags had to be changed into either NOUN or ADJ tags. A typical example is cotejo [de] datos [del] ADN ‘DNA-matching’: the TreeTagger tagged cotejo as a VERB (cotejo, ‘I compare’) yet a NOUN tag would be correct here (el cotejo, ‘the comparison’). In all other translation languages, it was NOUN and ADJ tags rather than VERB tags that had to be corrected most often, changing NOUN tags into ADJ tags and vice versa. 3.2 Aligned UPOS tags across languages In total, 3,218 lexical content words were aligned with the German gold standard compounds. Across all languages, the compounds were most frequently aligned with nouns (2,552, about 80% of all MWU tokens), followed by adjectives (625, about 20%). Verbs and numerals were rare, with only 20 and 17 tokens respectively. Overall, only four instances of ADV (adverbs), ADP (prepositions) and X (symbols, foreign material) were aligned. The number of tokens per language is roughly the same, with English having the lowest (792) and French having the highest number of tokens (814). In total, 5,005 bidirectional alignments between all MWU tokens exist, of which only 580 (11.6%) align tokens of different word classes; among these mixed alignments, NOUN-ADJ alignments were most common (513 alignments, 88.5%), followed by NOUN-VERB alignments (25, 4.3%). 3.3 Complexity of the compounds and aligned MWUs Table 2 lists the average number of MWU tokens, grouped by the three morphological complexity classes which we distinguish: two morphological segments, three segments, four or more segments. As the table illustrates, the average number of tokens (Ø) of corresponding MWUs is always lower than the number of segments of the German compounds. In the case of compounds with two segments – henceforth referred to as bi-compounds – the difference is small: the majority of equivalents does indeed consist of two tokens. Yet when it comes to compounds with four or more segments (4+-compounds), the average number of tokens is lower than three for all languages. All languages exhibit the same pattern: increasing morphological complexity consistently raises the average number of tokens and the standard deviation. Lexicalised units, ellipses and other linguistic factors reduce the number of
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
Table 2. Relationship between the complexity of German compounds and the average number of tokens of their equivalent MWUs Number of tokens EN
FR
IT
ES
Complexity
Ø
σ
Ø
σ
Ø
σ
Ø
σ
2
1.82
0.47
1.82
0.44
1.84
0.42
1.8
0.43
3
2.28
0.63
2.4
0.65
2.4
0.67
2.34
0.65
4+
2.95
0.85
2.95
0.88
2.84
0.92
2.98
0.92
c orresponding MWU tokens. Energiesteuerrichtlinie (‘energy taxation directive’) is an example of a compound with four segments; it contains the strongly lexicalised sub-unit richt#linie which consists of two morphological segments, yet is realised as one token in all other languages (EN directive, FR directive, IT direttiva, ES directiva). This means that there is not necessarily a correspondence between the morphologically motivated segments of a compound and the semantics of the lexicalised units of that compound. An extreme example affects the longest compound in our gold standard, Wert#papier#dienst#leistungs-#Richt#linie (‘investment services directive’): in one instance, it corresponds to a single-word acronym in all languages, for example ISD in English. 3.4 Evaluation of the quality of the automatic GIZA++ word alignment A typical raw graph of all unidirectional and bidirectional word alignments which are directly or indirectly related to a German compound is shown in the left-hand part of Figure 2. It is evident that we cannot use the closure of all bilingual word alignments computed by GIZA++ – not even for a baseline model. We therefore imposed the following restrictions on the alignments of baseline model 1 (BM1): (1) only alignments between NOUN, ADJ, and ADV are allowed, (2) only paths that either connect directly or via another alignment to the German compound are allowed (path length restriction). As can be seen in the right-hand part of Figure 2, these restrictions considerably improve the automatic alignment. In the context of translation spotting, having only partial word alignments, it makes sense to use the standard evaluation metrics precision (P), recall (R) and F-measure (F). Since we currently do not have fuzzy/probable alignments in the sense of Tiedemann (2011:22), we treat each unidirectional alignment as one item. Bidirectional alignments are represented by two unidirectional items. Formally, if |A| denotes the number of automatic unidirectional alignments, |AG| the number
A multilingual gold standard for translation spotting of German compounds Bidirectional Alignment Unidirectional Alignment
[de]
[en]
[fr]
[it]
[…] Par/ADP
[…] D/X
[es]
'/.
[…] Furthermore/ADV
ailleurs/ADV
altro/PRON
[…] Por/ADP
otro/DET
,
on/PRON
canto/NOUN
lado/NOUN
,
no/DET
n/VERB
,
,
situation/NOUN
'/.
a/VERB
involving/VERB
relevé/VERB
nel/ADP
1999/NUM
non/ADV
en/ADP
1999/NUM
no/PRT
[en]
[…] ,
[fr]
[it]
[es]
[…] Por
[…] '
otro
altro
no
[…] relevé
lado/NOUN
,
è/VERB
en
aucun
,
nel
se
descubrió
duplication/NOUN
cas/NOUN
stata/VERB
oder
involving
cas/NOUN
rilevata
Doppelfinanzierung/NOUN
or
de
alcuna
de/ADP
of
aus […]
funding/NOUN
fraude
rilevata/VERB
descubrió/VERB
[…] B etrug
situation/NOUN
canto/NOUN
fraud
aucun/PRON
se/PRT
[de]
[…] oder
de
fraude
or
duplication/NOUN
ou
de
situazione/NOUN
situación/NOUN
aus
alguna
frode
o
de
fraude
ni
di/ADP
fraude
funding/NOUN
financement/NOUN
di
fund
ni
w as/VERB
entre/ADP
frode
di
le
uncovered/VERB
communautaire
doppio/ADJ
duplicación/NOUN
de
di
doppio/ADJ
duplicación/NOUN
betw een
finanziamento/NOUN
entre
the […]
le […]
finanziamento/NOUN
de
tra
financiación/NOUN
il […]
entre
. […]
en/ADP
financiación/NOUN
dem […]
of
double/ADJ
de/ADP
the
financement/NOUN
situazione/NOUN
alguna/DET
Doppelfinanzierung/NOUN
fraud
double/ADJ
alcuna/PRON
situación/NOUN
betw een/ADP
el […]
1999/NUM
tra/ADP
entre/ADP
. […]
il […]
el […]
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
of unidirectional gold standard alignments, and |A ∩ AG| the true positive alignments, we define precision P = |A ∩ AG|/|A|, recall R = |A ∩ AG|/|AG|, and their harmonic mean, the F-measure F = (2 × P × R)/(P + R). First we examine the performance of our baseline model 1 (BM1) on a global evaluation level across all languages (all–all) against our gold standard. Table 3 shows an overall precision of 61.0%, meaning that 61.0% of all automatically computed directed alignments are correct. Recall is 71.6%, which says that 71.6% of the gold standard annotations were identified. This results in an F-measure of 65.9%. Table 3. Baseline model 1 (BM1): global precision, recall and F-measure. The highest values are shown in bold, the lowest values in italics P
R
F
all-all
61.0
71.6
65.9
DE-all
72.7
87.3
79.3
EN-all
57.8
69.4
63.0
FR-all
58.9
68.0
63.1
IT-all
57.9
69.1
63.0
ES-all
57.9
64.9
61.2
all-DE
81.7
29.3
43.1
all-EN
61.6
81.3
70.1
all-FR
59.3
84.5
69.7
all-IT
58.3
84.4
69.0
all-ES
59.8
77.5
67.5
Analysing each single language and its alignments to all other languages (all), some differences in alignment quality are noticeable: It is striking that alignments from Spanish have the lowest recall (64.9%) and F-measure (61.2%) whereas the other two Romanic languages achieve better recall. Alignments of German to all other languages, however, achieve by far the best results, with values that surpass the evaluation measures of EN, FR, IT and ES all by 15 to 18 percentage points. This result was to be expected because the alignment between multiple elements of a MWU is more difficult, and our restriction to exactly one compound on the German side excludes alignment errors to other German words. When it comes to the language direction from all to a specific language, it is German (all-DE) which has the highest precision of all pairs (81.7%), yet it is alignments directed to English tokens that achieve the highest F-measure (all-EN: 70.1%) due to the very low recall of all-DE (29.3%).
A multilingual gold standard for translation spotting of German compounds
Table 4 shows the performance of all 20 language pairs separately. The language pair DE-EN achieves the highest recall (90.7%) as well as F-measure (85.0%), while EN-DE has the highest overall precision of 84.3%. Recall for EN-DE is low with 33.3%, yet this is still the highest recall of all single language pairs towards German. The low recall values for alignments towards German do not come as a surprise since the settings of GIZA++ only allow one incoming edge per token, when most of the time, at least two tokens form a corresponding MWU. The low F-Measure of Spanish is mostly due to low recall of alignments to Spanish, even from the Romanic languages French and Italian. Table 4. Baseline model 1: precision, recall and F-measure of all language pairs. The highest values are shown in bold, the lowest in italics P
R
F
DE-EN
80.0
90.7
85.0
DE-FR
71.0
89.2
79.1
DE-IT
67.8
89.7
77.2
DE-ES
72.9
79.9
76.3
EN-DE
84.3
33.3
47.8
EN-FR
56.5
83.9
67.5
EN-IT
53.8
82.4
65.1
EN-ES
56.2
75.8
64.6
FR-DE
81.1
28.5
42.2
FR-EN
58.1
80.3
67.4
FR-IT
56.9
86.3
68.6
FR-ES
56.4
75.8
64.7
IT-DE
79.9
29.1
42.7
IT-EN
55.7
80.7
65.9
IT-FR
55.9
86.8
68.0
IT-ES
56.7
78.5
65.9
ES-DE
81.2
26.2
39.7
ES-EN
56.6
74.3
64.3
ES-FR
55.9
78.3
65.2
ES-IT
56.0
79.6
65.7
3.5 Optimisation of the directed word alignments through symmetrisation The global performance analysis of BM1 has shown that the alignments from German to all other languages achieve the highest values for P, R and F; for this
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
reason, German alignments are considered a suitable starting point for an optimisation of the overall alignment quality. Hence, all unidirectional alignments between the German compounds and their corresponding tokens were made bidirectional for our model 2 (M2). Table 5 lists the language directions of model 2 which include German; for direct comparison, the increase or decrease in relation to BM1 is listed in a separate column (Δ BM1). Table 5. Model 2 (M2): global P, R and F values of language directions including DE. Difference (Δ) to baseline model 1 included P
R
F
M2
Δ BM1
M2
Δ BM1
M2
Δ BM1
all-all
61.2
+ 0.2
83.3
+ 11.7
70.6
+ 4.7
DE-all
70.0
−2.7
88.2
+ 0.9
78.1
−1.2
all-DE
70.0
−11.7
88.2
+ 58.9
78.1
+ 35.0
EN-DE
76.5
−7.8
91.0
+ 57.7
83.2
+ 35.4
FR-DE
68.7
−12.4
90.3
+ 61.8
75.9
+ 33.7
IT-DE
65.5
−14.4
90.3
+ 61.2
75.9
+ 33.2
ES-DE
70.4
−10.8
81.9
+ 55.7
75.7
+ 36.0
Across all languages (all–all), model 2 achieves a significant increase of recall, leading to an increase of F-Measure by almost five percentage points. EN, FR, IT and ES all profit from the bidirectionalisation: recall is between 56 to 62 percentage points higher, and F values are around 35 percentage points higher than in BM1. Under these circumstances, the somewhat lower precision is perfectly acceptable. For test purposes, all edges between all languages were made bidirectional in a third model. On the global all-all level, this symmetrisation resulted in an increased recall of 86.6% (Δ M2 + 3.3%). However, the marked reduction of precision to 57.5% (Δ M2 − 3.7%) leads to a lower F-Measure of 69.2% (Δ M2 − 1.4%). Of all three models, model 2 thus proves to be optimal as measured by F, and therefore, the following evaluations are all based on it. 3.6 Frequency effects Next, we investigate whether frequent compounds are more reliably aligned by GIZA++. P, R and F of the three frequency classes m = 1 (hapax legomena), 2≥m≥7 and m≥8 are compared. GIZA++ being a statistically based word aligner, we expect that the alignment quality of compounds depends greatly on their frequency in a corpus, with more frequent compounds being more reliably aligned.
A multilingual gold standard for translation spotting of German compounds
As expected, there are clear differences between the frequency classes on all evaluation levels, the class of hapax legomena (m = 1) consistently has lowest precision, recall and F-measure. With increasing frequency, all results increase as well, with the difference between the classes 2≥m≥7 and m≥8 being most pronounced (allall: P + 18.7%, R + 5.0%, F + 14.0%). Altogether, the results exhibit a strong correlation between frequency and alignment quality. For detailed results, see Table 9 in the Appendix. 3.7 Effects of morphological complexity Another research question that needs to be addressed is whether the morphological complexity of German compounds affects alignment performance. The results shown in Table 6 do not support the intuitive assumption that MWUs corresponding to compounds with fewer segments are aligned more reliably. In fact, the class of compounds with two segments (bi-compounds) has the lowest global F-measure, lower than the F-measure of the classes of compounds with three segments and compounds with four or more segments. F is practically the same for compounds with three or more segments, yet recall drops to 80.5% with increasing complexity. Table 6. Global performance of model 2 grouped by morphological complexity classes 2
all-all
3
≥ 4
P
R
F
P
R
F
P
R
F
55.3
83.5
66.5
64.0
85.4
73.2
67.1
80.5
73.2
These overall results can be linked to the fact that the average number of tokens of MWUs is always lower than the number of segments of German compounds; various complexity reduction factors are at play, particularly for compounds with many segments. Another reason for the low performance of bi-compounds might be that more than half of them are hapax legomena; a multifactorial analysis is needed to discern the separate influence of frequency and complexity on performance. Table 7 combines both factors across all classes. Bi-compound hapax legomena have the lowest F-measure (55.9%) while the class of tri-compounds of the highest frequency class m≥8 reaches the highest F-measure (81.2%). General tendencies are: the alignment quality for the bi-compounds as well as the 4+-compounds improves with increasing frequency; the values for the tri-segmented compounds first drop only to suddenly increase between the frequency classes of 2≥m≥7 and m≥8, reaching the highest P, R and F values of all classes.
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
Table 7. Performance of the combined frequency and complexity classes across all languages of model 2. The highest F value is shown in bold, the lowest in italics m = 1
m ≥ 8
2 ≥ m ≥ 7
Complexity
P
R
F
P
R
F
P
R
F
2
42.6
81.5
55.9
53.1
76.7
62.7
69.1
87.2
77.1
3
52.1
81.3
63.5
48.7
80.2
60.6
75.3
88.2
81.2
≥4
59.0
62.4
60.6
60.1
86.1
70.8
72.7
82.5
77.3
All
47.0
78.0
58.6
53.8
81.2
64.7
72.5
86.2
78.7
3.8 Lexicalisation and variability As GIZA++ does not rely on external dictionaries, the factor of lexicalisation is not expected to have an influence on alignment quality. Lexicalised and non- lexicalised compounds are both solely aligned based on co-occurrence probabilities. However, the numbers for the global all-all level in Table 8 do not support this assumption. The precision of GIZA++ for lexicalised compounds is considerably higher than for non-lexicalised compounds. The difference could even be higher with a word aligner that makes use of bilingual dictionaries. Table 8. Performance of lexicalized (+ lex.) and non-lexicalized (− lex.) compounds. Breakdown for compounds with 2 segments against their frequency class, and an overall statistics. Compounds with 2 segments Freq.
m = 1
all m ≥ 8
2 ≥ m ≥ 7
all
P
R
F
P
R
F
P
R
F
P
R
F
− lex.
42.8
81.7
56.2
52.3
76.9
62.3
66.4
85.7
74.8
58.7
82.3
68.6
+ lex.
40.6
80.0
53.9
56.8
76.1
65.1
72.2
88.8
79.7
72.2
87.3
79.0
Since lexicalised compound types are rare in our gold standard – it contains 17 lexicalised compounds out of 137 compounds – we only consider the class of bi-compounds for a more detailed analysis.8 We observe no difference in
. The following 13 out of 70 bi-compounds in our gold standard are lexicalised: Werftanlagen, Spielhölle, Leitungssystem, Lageplan, Machtantritts, Reitvereine, Musikproduzenten, Planziele, Dankesworten, Leistungsfähigkeit, Wahlsystems, Regenwald, Redebeitrag.
A multilingual gold standard for translation spotting of German compounds
erformance except in the frequency class of m≥8: the higher performance for p lexicalised compounds in this class is most likely due to the fact that many of these compounds are far more frequent than the non-lexicalised compounds; examples include the lexicalised compound Redebeitrag (‘contribution’) with m = 800, in contrast with, for example, the non-lexicalised compound Visumliberalisierung (‘visa liberalisation’) with m = 12. The interplay of lexicalisation and frequency should be further investigated to reach a final conclusion about the influence of lexicalisation. As regards variability, the frequency class of m≥8 has been examined in terms of type-token ratio. On average, one compound of the frequency class m≥8 is realised by four different MWU variants; this holds true for all languages, with English only just having the highest (0.47), and Spanish the lowest (0.43) average type-token ratio. For example, Europa-Mittelmeer-Kooperation (‘Euro-Mediterranean cooperation’) has one of the lowest type-token ratios, with both English and Italian only having one variant, and French and Spanish each having two variants. Frauenerwerbsquote (‘female employment rate’), on the other hand, is highly variable, having seven different realisations in both Italian and Spanish, six in French, and five in English. Nevertheless, the alignment of Frauenerwerbsquote measured over all languages achieves a respectable mean F-measure of 88.3%, whereas Europa-Mittelmeer-Kooperation has a similar mean F-measure of 88.5%. 4. Conclusion We built a multilingual gold standard for the alignment of German compounds with English, French, Italian and Spanish translations. This gold standard is freely available.9 In addition to the alignment information, the following content is stored in the gold standard distribution: (1) UPOS tags for each token (manually corrected for aligned words), and (2) the frequency of GIZA++ alignments for each content word pair collected from the whole corpus. This extra information will allow users to test and optimise automatic methods of multilingual word alignment and to evaluate multilingual term extraction processes. We distribute the data in a simple10 TEI P5 XML format. According to Simões & Fernandes (2011), the XCES11 format – which would also be suited for linguistically-annotated multi-parallel resources – is not well established. . http://pub.cl.uzh.ch/purl/compal_gs . http://www.tei-c.org . http://www.xces.org
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk
dditionally, we publish our gold standard as SVG graphics on the web in order to A facilitate quick inspection. Further Work. The performance of automatic word alignments from GIZA++ measured against our gold standard revealed some systematic differences across languages – Spanish, especially, showed consistently lower quality. Alignments to and from German turned out to be better than any other direction if all alignments were bidirectionalised. This is evidence that German compounds can be used as pivotal elements in order to improve the alignment of MWUs (and translation spotting) in multi-parallel corpora. This may even be applied to scenarios in the translation industry where morphological compounds within multilingual translation memories can serve as pivotal elements to extract translations of MWUs in other languages. The current version does not include annotations of semantic fuzziness. Although we found that they are rarely needed, this should be further investigated. Another continuation of our work could involve a morphological splitting of the German compounds into segments and a subsequent alignment of these segments with parallel words. An interesting extension of our work could be the inclusion of more parallel languages in our gold standard. The European Parliament corpus supplies data for another 15 languages. However, the required language expertise and the number of cross-lingual alignments grows quadratic. Therefore, a focus on a non-Romance language or a further language with morphological compounding will be most valuable.
Acknowledgements This research was supported by the Swiss National Science Foundation under grant 105215_146781/1 through the project “SPARCLING – Large-scale Annotation and Alignment of Parallel Corpora for the Investigation of Linguistic Variation”.
References Baayen, R. (2001). Word frequency distributions. Kluwer Academic Publishers. doi: 10.1007/978-94-010-0844-0
Baroni, M., Matiasek, J., & Trost, H. (2002). Predicting the components of German nominal compounds. In Proceedings of the 15th European Conference on Artificial Intelligence, ECAI’2002, Lyon, France, July 2002 (pp. 470–474). Deng, D., & Xue, N. (2014). Building a hierarchically aligned Chinese-English parallel treebank. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers (pp. 1511–1520). Dublin, Ireland. Retrieved from http://www. aclweb.org/anthology/C14-1143
A multilingual gold standard for translation spotting of German compounds
Graça, J., Paulo Pardal, J., Coheur, L., & Caseiro, D. (2008). Building a Golden Collection of Par onference allel Multi-Language Word Alignment. In Proceedings of the Sixth International C on Language Resources and Evaluation (LREC’08). Marrakech, Morocco. Graën, J., Batinic, D., & Volk, M. (2014). Cleaning the Europarl Corpus for Linguistic Applications. In Konvens 2014. Stiftung Universität Hildesheim. Retrieved from http://dx.doi. org/10.5167/uzh-99005 Haapalainen, M., & Majorin, A. (1994). GERTWOL: Ein System zur automatischen Wortformerkennung deutscher Wörter (Tech. Rep.). Lingsoft, Inc. Holmqvist, M., & Ahrenberg, L. (2011). A Gold Standard for English-Swedish Word Alignment. In S. P. Bolette, G. Nešpore, & I. Skadina (Eds.), Proceedings of the 18th nordic conference of computational linguistics nodalida (Vol. 11, pp. 106–113). Riga. Koehn, P. (2005). Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of the MT Summit 2005 (pp. 79–86). Retrieved from http://www.iccs.inf.ed.ac.uk/~pkoehn/ publications/europarl-mtsummit05.pdf Koehn, P. (2010). Statistical machine translation. New York, NY, USA: Cambridge University Press. Lambert, P., De Gispert, A., Banchs, R., & Mariño, J. B. (2005). Guidelines for Word Alignment Evaluation and Manual Alignment. Language Resources and Evaluation, 39(4), 267–285. doi: 10.1007/s10579-005-4822-5 Martin, J., Mihalcea, R., & Pedersen, T. (2005). Word alignment for languages with scarce resources. In Proceedings of the ACL Workshop on Building and Using Parallel Texts (pp. 65–74). Och, F. J., & Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1), 19–51. doi: 10.1162/089120103321337421 Parra Escartín, C. (2014). Chasing the perfect splitter: A comparison of different compound splitting tools. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 3340–3347). Parra Escartín, C., & Héctor Martínez, A. (2014). Compound Dictionary Extraction and WordNet. A Dangerous Liaison. Retrieved from http://typo.uni-konstanz.de/parseme/images/ Meeting/2014-03-11-Athens-meeting/PostersA4/WG3-Parra_Martinez-posterA4.pdf Petrov, S., Das, D., & McDonald, R. (2012). A Universal Part-of-Speech Tagset. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 2089–2096). Istanbul, Turkey. Roth, T. (2014). Wortverbindungen und Verbindungen von Wörtern: Lexikografische und distributionelle Aspekte kombinatorischer Begriffsbildung zwischen Syntax und Morphologie (No. 94). Tübingen: A. Francke Verlag. Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of international conference on new methods in natural language processing (nemlap) (Vol. 12, pp. 44–49). Simões, A., & Fernandes, S. (2011). XML schemas for parallel corpora. In Xata 2011 – 9a conferência nacional em xml, aplicações e tecnologias associadas, vila do conde, portugal (pp. 59–69). Tiedemann, J. (2011). Bitext Alignment. Synthesis Lectures on Human Language Technologies, 4(2). Morgan & Claypool. Tinsley, J., Hearne, M., & Way, A. (2009). Exploiting parallel treebanks to improve phrasebased statistical machine translation. In Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Mexico City, Mexico, March 1–7, 2009 (pp. 318–331). Retrieved from http://dx.doi.org/10.1007/978-3-642-00382-0_26
Simon Clematide, Stéphanie Lehner, Johannes Graën & Martin Volk Varga, D., Halácsy, P., Kornai, A., Nagy, V., Németh, L., & Trón, V. (2005). Parallel Corpora for Medium Density Languages. Proceedings of the Recent Advances in Natural Language Processing (RANLP) (pp. 590–596). Volk, M., Göhring, A., Marek, T., & Samuelsson, Y. (2010). SMULTRON (version 3.0) – The Stockholm MULtilingual parallel TReebank. electronic. Retrieved from http://www. cl.uzh.ch/research/parallelcorpora/paralleltreebanks.html (An English-French-German- Spanish-Swedish parallel treebank with sub-sentential alignments). Véronis, J., & Langlais, P. (2000). Evaluation of parallel text alignment systems. In 369–388). Dordrecht: Kluwer Academic Parallel text processing (Vol. 13, p. Publishers. doi: 10.1007/978-94-017-2535-4_19 Zhechev, V. (2010). Unsupervised Generation of Parallel Treebanks through Sub-Tree Alignment. The Prague Bulletin of Mathematical Linguistics, 91, 89–98.
Appendix Table 9. Performance grouped by the frequency classes m = 1, 2 ≥ m ≥ 8 und m ≥ 8. The highest F-measure of the 3 levels (all-all, languages aligning with all and the 20 single language directions) is printed in bold; the lowest F-measure is shown in italics. Due to the symmetric alignments of German in model 2, we omit redundant rows m = 1
m ≥ 8
2 ≥ m ≥ 7
P
R
F
P
R
F
P
R
F
all-all
47.0
78.0
58.6
53.8
81.2
64.7
72.5
86.2
78.7
DE-all
56.4
84.0
67.5
62.0
84.9
71.6
80.9
91.0
85.7
EN-all
45.3
76.7
57.0
52.5
81.5
63.9
71.2
86.4
78.1
FR-all
46.2
79.2
58.3
52.6
81.2
63.9
70.5
85.0
77.0
IT-all
44.1
79.2
56.7
50.1
81.1
62.0
70.2
86.9
77.7
ES-all
44.1
71.2
54.5
52.8
77.4
62.8
70.3
81.8
75.6
all-EN
46.4
76.7
57.8
54.6
80.6
65.1
72.3
83.6
77.6
all-FR
44.8
78.1
56.9
51.8
82.6
63.7
69.6
87.8
77.7
all-IT
44.5
79.9
57.2
50.0
81.7
62.1
68.8
87.5
77.0
all-ES
44.1
71.6
54.6
51.7
76.4
61.7
71.8
81.1
76.2
DE-EN
63.6
86.6
73.3
69.6
88.3
77.8
86.3
93.9
89.9
DE-FR
56.2
87.3
68.4
60.5
84.9
70.7
78.6
91.9
84.7
DE-IT
51.0
87.0
64.3
56.8
86.6
68.6
77.9
93.0
84.8
DE-ES
56.0
75.4
64.2
62.2
79.8
69.9
81.5
85.4
83.4
EN-FR
41.9
75.7
54.0
50.0
83.3
62.5
67.2
87.4
76.0
EN-IT
40.5
77.0
53.1
46.7
80.6
59.1
64.9
85.1
73.6
EN-ES
40.1
68.0
50.5
48.1
73.8
58.2
69.8
79.7
74.4
A multilingual gold standard for translation spotting of German compounds
Table 9. (Continued) m = 1
m ≥ 8
2 ≥ m ≥ 7
P
R
F
P
R
F
P
R
F
FR-EN
45.0
77.8
57.0
51.1
79.8
62.3
68.8
81.4
74.6
FR-IT
44.1
82.3
57.4
49.3
83.0
61.9
67.2
89.1
76.6
FR-ES
40.9
69.7
51.6
50.6
77.1
61.1
67.9
77.8
72.5
IT-EN
40.7
74.9
52.8
49.4
80.0
61.1
67.4
83.2
74.5
IT-FR
43.3
81.7
56.6
47.1
83.0
60.1
66.9
90.1
76.8
IT-ES
42.1
73.5
53.6
47.8
74.8
58.4
69.3
81.8
75.0
ES-EN
40.6
68.0
50.9
51.2
74.4
60.7
68.9
76.8
72.6
ES-FR
39.5
68.1
50.0
51.4
78.9
62.2
66.8
82.2
73.7
ES-IT
43.3
73.5
54.5
48.1
76.7
59.1
66.2
83.0
73.7
Dutch compound splitting for bilingual terminology extraction Lieve Macken & Arda Tezcan
Ghent University, Department of Translation, Interpreting and Communication As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domainadaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists. As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. Experiments show that the compound splitter combined with the novel word alignment technique considerably improves bilingual terminology extraction results. Keywords: compound splitting, bilingual terminology extraction, word alignment, multiword units, translation, Dutch
1. Introduction Compounding is a highly productive process in Dutch that poses a challenge for various NLP applications that rely on automated word alignment such as machine translation and bilingual terminology extraction. In Dutch, a compound is usually not separated by means of white space characters and hence constitutes one single word. Examples are slaap + zak ‘sleeping bag’, hoofd + pijn ‘headache’ and [post + zegel] + verzamelaar ‘stamp collector’. Compounds written as one word are problematic for statistical word alignment as on the one hand they drastically increase the vocabulary size and on the other hand lead to one-to-many word alignments, which are more difficult to model as is the case in slaapzak, which corresponds to two words and regeringshoofd, which corresponds to three words in English (‘head of government’).
doi 10.1075/cilt.341.07mac © 2018 John Benjamins Publishing Company
Lieve Macken & Arda Tezcan
Numerous studies showed that splitting compounds prior to translation model training improves the translation quality of statistical machine translation systems (Fritzinger & Fraser, 2010; Koehn & Knight, 2003; Stymne & Holmqvist, 2008). However, the impact of compound splitting on bilingual terminology extraction is less studied. Most compound splitting approaches are corpus-based and use corpus frequencies to find the optimal split points of a compound (Koehn & Knight, 2003). Adding linguistic knowledge in the form of part-of-speech restrictions (Stymne & Holmqvist, 2008) or morphological information (Fritzinger & Fraser, 2010) reduces the number of erroneous split points. As terminology extraction systems typically work with much smaller corpora than the training corpora of Machine Translation systems, and as the accuracy of the compound splitter depends on the size and the quality of the training corpus, we trained a stand-alone data-driven compound splitting tool on the basis of a frequency list derived from Wikipedia. The tool determines a list of eligible compound constituents (so-called ‘heads’ and ‘tails’) on the basis of word frequency information and uses part-of-speech (POS) information as a means to restrict this list of possible heads and tails. As a drop in recall can be expected on domainspecific test sets, we use domain-adaptation techniques to combine the large outof-domain data set (Wikipedia) with the smaller in-domain data sets. 2. Dutch compound splitter To ensure a broad coverage of topics, we compiled a frequency list of token-POStag-tuples for Dutch derived from a part-of-speech-tagged Dutch Wikipedia dump of 150 million words. We used a coarse-grained POS tag set of ten categories that are relevant for compound splitting: plural noun, singular noun, adjective, numeral, adverb, preposition, past participle, present participle, infinitive and verb stem. We followed the implementation of Réveil & Martens (2008) and stored all possible heads and tails (together with the frequency and POS information) in two prefix trees. Possible heads or tails are defined as words of at least three characters, containing a minimum of one vowel. Heads belong to one of the abovementioned POS categories; tails belong to the same set without adverbs, prepositions and numerals. As the Wikipedia files were automatically parsed, tokenised and POS-tagged, they inevitably contain errors. To avoid the problem of error percolation, a minimum frequency threshold was experimentally set at 20. Unfortunately, the frequency threshold could not fully prevent non-words being stored in the prefix
Dutch compound splitting for bilingual terminology extraction
trees. Therefore, non-words due to spelling mistakes, (e.g. vor instead of voor (‘for’)) or tokenisation problems (e.g. ste or ata) were manually filtered out on the basis of tests on the development corpus (see the Section ‘Data Sets and Experiments’). We also compiled a list of non-productive prepositions and adverbs (e.g. hoe, dan and per (‘how’, ‘then’, ‘per’)) and a list of frequent Dutch derivational suffixes on the basis of the ANS,1 an authoritative Dutch grammar book, which are discarded as possible heads or tails. With the minimum frequency threshold of 20 and after applying the filters described above, the head prefix tree contains 71,147 possible heads and the tail prefix tree 70,189 possible tails. The compound splitter searches the head and tail prefix trees for all possible split points. The compound splitter allows a linking-s between the head and the tail as is the case in e.g. aanwezigheid + s + lijst ‘attendance list’. Head and tail combinations are considered valid if the POS combination is included in a predefined list of valid POS combinations. This list was compiled on the basis of the development set: –– Noun tails can be combined with nouns, adjectives, adverbs and verb stems as heads; –– Adjective tails can be combined with singular nouns, prepositions, adverbs, adjectives and verb stems as heads; –– Infinitive tails can be combined with prepositions, adverbs, adjectives and past participles; –– Past and present participles as tail can be combined with prepositions, adverbs and adjectives as head. Two other restrictions limit the number of possible split points. A first restriction blocks two identical consonants followed by the ending –en. This rule prevents the erroneous splitting of Dutch plural forms as in e.g. boodschappen ‘groceries’ into boodschap ‘message’ and pen ‘pen’. The second restriction regulates the linking –s. The linking –s is not allowed if the head is a preposition, adverb or adjective and can only be split off in certain contexts. A list of 2,585 possible contexts (defined as two letters to the left and two letters to the right) was compiled on the basis of a set of 50,000 Dutch compounds (see the Section ‘Data Sets and Experiments’). The compound splitter generates all possible split points and retrieves the frequency information of the token-POS-tag-tuples from the suffix trees, after which
. http://ans.ruhosting.nl/e-ans/
Lieve Macken & Arda Tezcan
the split with the highest geometric mean of word frequencies of its parts (Koehn & Knight, 2003) is chosen as the best solution: n (∏ freq p )1/ n, in which n is the number of split points in the compound and i =1
freqp is the frequency of the component parts. The following example shows the possible split points for the word staatsbankroet and the geometric mean calculated for the different splits: staat (51657) + s + bankroet (257)
3643.60
En: ‘state’ + ‘bankruptcy’
staats (146) + bankroet (257)
193.70
En: ‘of the state’ + ‘bankruptcy’
staatsbank (24) + roet (328)
88.72
En: ‘state-owned bank’ + ‘soot’
Note that original tokens, without split points are also considered, as is the case in databank: data (2535) + bank (4226)
3273.06
En: ‘data’ + ‘base’
databank (224)
224. 0
En: ‘database’
Compounds can be nested and especially in technical texts, compounds of more than two components frequently occur, as is the case in e.g. satelliet + [navigatie + systeem] ‘satellite navigation system’ and [[baar + moeder] + hals] + kanker ‘cancer of the cervix uteri’. Therefore, the compound splitter can further split the component parts in their underlying parts. 2.1 Domain adaptation As mentioned above, we compiled a frequency list using Wikipedia to ensure a broad coverage of topics. The Wikipedia frequency list is static and forms the core part of the compound splitter. However, as we aim to integrate the compound splitter in a terminology extraction system, we foresee a mechanism to extend the large static Wikipedia frequency list with a smaller dynamically compiled frequency list derived from the extraction corpus. To account for differences in corpus size, the in-domain frequencies are estimated on the basis of their relative frequencies. 2.2 Data Sets and Experiments We compiled three different Gold Standard data sets on the basis of Celex (Baayen, Piepenbrock, & van Rijn, 1993): a set of 50,000 Dutch compounds, a set of 5,000 monomorphemic Dutch words and a development set of 5,550 compounds and 2,886 monomorphemic words. To evaluate the performance of the compound splitter on a more technical domain, we used a set of 5,000 Dutch compounds
Dutch compound splitting for bilingual terminology extraction
belonging to the automotive domain that had been compiled for earlier research (Lefever, Macken, & Hoste, 2009) and an in-domain frequency list derived from an automotive corpus of 2.7 million words. To evaluate the compound splitter, we compare its output with the Gold Standard data and used precision, recall and accuracy as evaluation metrics. These metrics are commonly used in the field (Fritzinger & Fraser, 2010; Koehn & Knight, 2003; Parra Escartín, 2014) and can be defined as follows: # CorrectlySplit # WordsSplit # CorrectlySplit Re call = # CompoundsInCorpus # CorrectWords Accuracy = # WordsInCorpus precision =
We experimented with different minimum frequency thresholds and we also defined a minimum length threshold (expressed in the number of characters) for words to be sent to the compound splitter. As expected, raising the minimum frequency threshold and the minimum length threshold has a positive impact on precision, but lowers recall. The results reported in Table 1 use a minimum frequency threshold of 20 and a minimum length threshold of seven characters. Table 1. precision and recall scores on two test sets using different frequency lists Test corpus and frequency information used
Precision
Recall
Celex (Wikipedia freq. list)
98.5
80.3
Automotive (Wikipedia freq. list)
97.8
76.6
Automotive (Automotive freq. list)
97.2
75.7
Automotive (Wikipedia and automotive freq. list)
96.4
88.5
Celex (Wikipedia freq. list)
94.9
77.4
Automotive (Wikipedia freq. list)
88.6
69.4
Automotive (Automotive freq. list)
83.9
65.3
Automotive (Wikipedia and automotive freq. list)
86.5
79.4
One-level compound splitting
Two-level compound splitting
On the test set of 5,000 monomorphemic words, the compound splitter reaches an accuracy of 98.3. It wrongly split 84 monomorphemic words of which 60 are Dutch infinitives such as mopperen ‘grumble’, which is wrongly split in mop + peren ‘joke’ + ‘pears’.
Lieve Macken & Arda Tezcan
On the Celex test set of 50,000 Dutch compounds the compound splitter has a precision of 98.5 and a recall of 80.3 if the words are split at the highest level (one-level compound splitting). These figures drop to 94.9 and 77.4 in the case of two-level compound splitting. Please note that we adopt a very strict evaluation method. If we ignore the linking –s and append it to the head in both the Gold Standard data set and the output of the compound splitter (as in varken + s + snuit → varkens + snuit, ‘pig’s snout’), this operation solves 47% of the wrongly split words. Precision and recall scores on the test set consisting of 5,000 compounds of the automotive domain are slightly lower (a precision score of 97.8 and recall score of 76.6 for one-level compound splitting and a precision score of 88.6 and recall score of 69.4 for two-level compound splitting). The lower two-level scores for the automotive test set can be attributed to the higher percentage of nested compounds in the technical data set (22.8% vs. 5.8% in the Celex data set). We also tested the compound splitter using the in-domain frequency list of the automotive corpus of 2.7 million words instead of the Wikipedia frequency list. Precision and recall scores are slightly lower for one-level compound splitting and remarkably lower for two-level compound splitting. These figures demonstrate that the Wikipedia frequency list indeed has a good coverage of more technical domains. By combining both frequency lists, best recall scores are obtained while the precision scores only slightly decrease. In a real terminology extraction scenario, however, much smaller extraction corpora are available. We therefore created two smaller parallel data sets to test the compound splitter and the impact on word alignment and subsequent terminology extraction. The first one is an English-Dutch corpus belonging to the medical domain, and consists of four European public assessment reports (EPARs) extracted from the Dutch Parallel Corpus (Macken, De Clercq, & Paulussen, 2011). It is a relatively small corpus and contains 4,333 English and 4,332 Dutch tokens. Manual word alignments are available for this data set in the Dutch Parallel Corpus. The second corpus is a French-Dutch parallel corpus belonging to the automotive domain of 14,087 French and 13,133 Dutch tokens. It is a subset of the data set used in (Lefever et al., 2009) for which manual word alignments are also available. Again, we contrast the performance of the compound splitter using the Wikipedia frequency list with one using a combined version of the Wikipedia frequency list and a frequency list derived from the Dutch part of the in-domain parallel
Dutch compound splitting for bilingual terminology extraction
c orpus. Despite the fact that the in-domain frequency lists are much smaller than in our previous experiments, using additional in-domain data drastically increases precision and recall scores for the medical domain and increases the recall scores in the automotive domain. We POS-tagged the parallel corpora and evaluated the performance of the compound splitter only on nouns and adjectives. The basic underlying assumption is that especially nouns and adjectives are important for terminology extraction. Limiting compound splitting only to nouns and adjectives yields the best overall results. Table 2. precision and recall scores on the medical data set in different settings One-level compound splitting
Precision
Recall
Accuracy
Wikipedia freq. list
72.7
56.2
96.3
Wikipedia and medical freq. list
77.7
74.9
97.3
Wikipedia and medical freq. list, restricted to nouns and adjectives
80.5
74.1
98.0
Wikipedia freq. list
72.2
55.8
96.3
Wikipedia and medical freq. list
76.9
74.1
97.3
Wikipedia and medical freq. list, restricted to nouns and adjectives
79.5
73.1
98.0
Two-level compound splitting
Analysing the output of the compound splitter on the medical data set we found that the splitter misses compounds such as injectie + flacons ‘vials’ whose compound parts are not present in the Wikipedia frequency list and do not occur as a single word in the Dutch part of the small parallel data set. Wrongly split compounds are monomorphic words such as receptoren ‘receptors’, which was erroneously split into recept + oren ‘recipe’ + ‘ears’ or besloot ‘concluded’, which was erroneously split into bes + loot. Limiting compound splitting only to nouns and adjectives solved the last case. A manual inspection of the missed compounds reveals a phenomenon that frequently occurred in the automotive data set and that cannot be handled by the current system. The head of compounds such as opberg + vak ‘stowage box’ or aandrijf + tandwiel ‘drive pinion’ consists of a verb form that never occurs as such in a corpus as the prefix of a separable verb is separated from the verb (berg…op, drijf…aan). Only in the infinitive and the past participle is the separable verb written as one word (opbergen, aandrijven). Rules for such types of transformations are currently lacking in the system.
Lieve Macken & Arda Tezcan
Table 3. precision and recall scores on the automotive data set in different settings One-level compound splitting
Precision
Recall
Accuracy
Wikipedia freq. list
88.9
62.1
93.6
Wikipedia and automotive freq. list
88.8
66.9
94.2
Wikipedia and automotive freq. list, restricted to nouns and adjectives
90.1
65.9
94.7
Wikipedia freq. list
81.8
57.1
92.9
Wikipedia and automotive freq. list
79.8
59.9
93.2
Wikipedia and automotive freq. list, restricted to nouns and adjectives
80.0
58.4
93.8
Two-level compound splitting
At this moment, the compound splitter does not use the POS code of the compound. However, as the compound inherits the POS category of the tail, putting a restriction on the tail’s POS category to match the compound’s POS c ategory will avoid errors such as stekkers ‘plugs’, which is wrongly split in stek + kers ‘spot’ + ‘cherry’ and the case of mop + peren described above. 3. Impact on word alignment In statistical machine translation, translational correspondences are estimated from bilingual corpora on the basis of statistical word alignment models that are based on the assumption of co-occurrence: words that are translations of each other co-occur more often in aligned sentence pairs than randomly. In the context of statistical machine translation, GIZA++ is one of the most widely used word alignment toolkits. GIZA++ implements the IBM models 1–5 (Brown et al., 1993) and is used in Moses (Koehn et al., 2007), an open-source statistical machine translation system. One of the shortcomings of the IBM models is that they only allow oneto-many word mappings as they take the source word as their starting point to estimate conditional probabilities (i.e. the probability that a target word is a translation of a source word, given the source word). Multiword units (e.g. the Dutch word regeringsleider ‘Head of Government’) are problematic for the alignment models, as every word (Head, of and Government) is treated as a separate entry. To overcome this problem, the IBM models are used in two directions: from source to target and from target to source after which a symmetrisation heuristic (Koehn et al., 2005) combines the alignments of both translation directions. Intersecting the two alignments results in an overall
Dutch compound splitting for bilingual terminology extraction
alignment with a higher precision, while taking the union of the alignments results in an overall alignment with a higher recall. The default symmetrisation heuristic applied in Moses (grow-diag-final) starts from the intersection points and gradually adds alignment points of the union to link unaligned words that neighbour established alignment points. The main problem with the union and the grow-diag-final heuristics is that the gain in recall causes a substantial loss in precision, which poses a problem for applications such as terminology extraction in which precision is important. Apart from the one-to-many word alignment problem, compounds also lead to data sparseness. The compounding process is highly productive and can create a potentially infinite number of valid Dutch words, which as a consequence occur infrequently in the data sets that are used to train the word alignment models. As terminology extraction systems typically work with much smaller corpora than machine translation system, this makes the problem of data sparseness even more apparent. A solution to overcome the problems of data sparseness and the one-tomany alignments is to split compounds into their component parts prior to word alignment (Koehn & Knight, 2003; Stymne & Holmqvist, 2008). However, the underlying assumption that compounds are translated compositionally is not always valid. The assumption holds for examples such as injectie + oplossing ‘solution for injection’, but not for doktervoorschrift ‘prescription’ or werkbank ‘workbench’. This observation led us to investigate a new approach in which we train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. We then apply the normal intersection heuristics on both data sets after which we merge all alignment points. We then expand this model by adding alignment points from the grow-diag-final output of the word alignment model trained on the split compounds data set. An alignment point is added to the merged alignments if the following conditions are met: –– The source alignment point is new (source language is the language which is not subject to compound splitting in our experiments) –– The target alignment point is a compound 3.1 Data sets and experiments To evaluate the impact of compound splitting (one- and two-level splitting) and the different word alignment scenarios we used the two terminology extraction corpora described above for which we have manual word alignment available.
Lieve Macken & Arda Tezcan
To evaluate the system’s performance, we used the evaluation methodology of Och and Ney (2003), who introduced the following redefined precision and recall measures, precision =
| A∩ P | |A∩ S | , recall = | A| |S |
and the alignment error rate: AER (S , P , A) = 1 −
| A∩ P |+ |A∩ S | | A |+ |S |
in which S refers to sure alignments, P to possible alignments (which also includes the sure alignments) and A to the set of alignments generated by the system. We built different word alignment systems and compared systems with no compound splitting (NC) with manual compound splitting (MC), level 1 (L1) and level 2 (L2), and automatic compound splitting (AC), level 1 and level 2. As a first experiment we use the methodology that is commonly used in machine translation and split compounds into their component parts prior to word alignment after which we apply the different symmetrisation heuristics (intersection, union and grow-diag-final) provided in Moses. To avoid error percolation, we work with the manually split compounds. In Table 4 we see that the best precision, recall and AER scores are all obtained with the word alignment models after compound splitting, for both data sets, from which we can conclude that high-quality compound splitting improves word alignment quality. Table 4. precision, recall and AER on the medical and automotive data set trained on the original data (NC) and the data set in which compounds are split manually (MC) Medical En-Nl
Automotive Fr-Nl
Setting
Prec.
Rec.
AER
Prec.
Rec.
AER
NC intersect
93.69
68.52
20.58
93.77
55.65
30.15
NC gdf
75.00
82.02
21.78
76.43
76.08
23.73
NC union
71.22
84.29
23.07
73.81
78.13
24.90
MC L1 intersect
93.91
70.36
19.03
95.09
58.74
27.37
MC L1 gdf
74.95
82.72
21.52
76.84
78.33
22.41
MC L1 union
70.78
84.58
23.24
74.30
79.65
23.21
MC L2 intersect
93.96
70.05
19.18
94.82
58.88
27.34
MC L2 gdf
74.50
82.43
21.89
77.41
78.98
21.82
MC L2 union
70.68
84.39
23.37
75.09
80.32
22.37
Dutch compound splitting for bilingual terminology extraction
Next we merge (MRG) all high-quality alignment points obtained by the intersection heuristic on both data sets (NC intersect and MC L1 intersect or MC L2 intersect). As can be seen in Table 5, merging the two sets of intersected alignment points improves recall and AER scores for both data sets compared to the intersection on the original data set (NC intersect in Table 4) or the intersection on the data set in which the compounds are split (MC L1 intersect of MC L2 intersect in Table 4). For the automotive data set, using a second level of compound splitting further improves the recall scores. Table 5. precision, recall and AER on the merged intersected alignment points (medical and automotive data set) Medical En-Nl
Automotive Fr-Nl
Setting
Prec.
Rec.
AER
Prec.
Rec.
AER
MRG MC L1-NC
92.17
73.42
18.01
92.53
62.45
25.41
MRG MC L2-NC
93.96
70.05
19.18
92.25
63.01
25.12
Finally, as a lot of alignment points are still missing in the data, we add additional alignment points taken from the grow-diag-final set trained on the split compounds corpus provided that they meet the requirements explained above (new alignment point for source word, target word is a compound). Adding these additional alignment points improves the recall scores for both data sets further, while marginally reducing precision. The results in Tables 4, 5 and 6 demonstrate that high-quality (manual) compound splitting improves word alignment quality. We now repeat the experiments by using the automatically split compounds. Table 7 presents the results of the merged intersected alignment points (original data and automatically split data) enriched with alignment points taken from the grow-diag-final set of the automatically split data. On the medical data set, the obtained scores for one-level splitting approximate the scores of the manual compound splitting, while two-level splitting seems to work best for the automotive data set. Table 6. precision, recall and AER on the merged intersected alignment points enriched with alignment points taken from the grow-diag-final set trained on data sets in which the compounds were manually split (medical and automotive data set) Medical En-Nl
Automotive Fr-Nl
Setting
Prec.
Rec.
AER
Prec.
Rec.
AER
MRG + GDF MC L1-NC
89.46
74.49
18.50
88.81
68.92
22.38
MRG + GDF MC L2-NC
90.23
74.97
17.89
89.28
68.78
22.29
Lieve Macken & Arda Tezcan
Table 7. precision, recall and AER on the merged intersected alignment points enriched with alignment points taken from the grow-diag-final set trained on data sets in which the compounds were automatically split, without and with POS filtering (medical and automotive data set) Medical En-Nl
Automotive Fr-Nl
Setting
Prec.
Rec.
AER
Prec.
Rec.
AER
MRG + GDF AC L1-NC
90.46
73.74
18.52
90.13
66.45
23.49
MRG + GDF AC L2-NC
90.08
73.14
19.08
90.11
66.60
23.40
MRG + GDF AC L1-NC F
90.65
72.87
18.98
90.34
66.29
23.49
MRG + GDF AC L2-NC F
90.88
72.89
18.86
90.60
66.32
23.40
The lower part of the table presents the results when limiting compound splitting only to nouns and verbs. This has a minor positive impact on the medical data set. As the automotive data set contains really technical texts, the POS tagger probably introduces too many errors to be fully reliable. From the experiments with the automatically split compounds we can conclude that even with imperfect compound splitting high-precision word alignment can be obtained with reasonable recall scores, even when trained on small parallel data sets.
4. Impact on terminology extraction We evaluated the different word alignment scenarios in the TExSIS terminology extraction system, which is a more advanced version of the system described in Macken, Lefever, & Hoste (2013). The TExSIS system is a hybrid system that uses both linguistic and statistical information. The bilingual terminology extraction system first generates monolingual term lists for the source and target part of the extraction corpus, after which source and target terms are paired on the basis of word alignments. The monolingual term extraction component produces a list of term candidates on the basis of predefined morpho-syntactic patterns. Two statistical filters are then used to create the final term list: Log-Likelihood ratio is applied on all single-word terms to filter out general vocabulary words; C-value (Frantzi & Ananiadou, 1999) is calculated for all multiword terms to determine unithood (Kageura & Umino, 1996). The bilingual term extraction component uses the word alignments to pair source and target terms. Term pairs are valid if for all source and target content
Dutch compound splitting for bilingual terminology extraction
words alignment points are found within the term pair and if there are no alignments points from words within the term pair to words outside the term pair. As such, the success of the term pairing process heavily depends on the quality of the word alignments. A high precision is extremely important to pair single word terms, whereas a high recall is also important to pair multiword terms. The TExSIS terminology extraction system integrates the word alignments of Moses described above. We created three baseline systems with TExSIS (without the compound splitter) using the three different symmetrisation heuristics, viz. intersection, grow-diag-final, and union. Compounds are problematic in this framework as they are often erroneously paired with a partial translation due to missing word alignments. A typical example is the erroneous term pair dose - aanvangsdosis, which should be paired starting dose - aanvangsdosis. 4.1 Experiments To evaluate the impact of compound splitting on bilingual terminology extraction we created Gold Standard bilingual term lists for the two domain-specific parallel corpora described above. As the aim of the Gold Standard term lists is to test the impact of compound splitting on the bilingual term extraction module, the term lists only contain valid term pairs, so source or target terms for which no valid counterpart is found in the translation are discarded. The English-Dutch medical bilingual term list contains 369 term pairs of which 96 Dutch paired terms contains a compound (26%) and the French-Dutch automotive term list contains 1,909 term pairs of which 1,109 Dutch paired terms contains a compound (58%). We evaluated different word alignment scenarios in the TExSIS bilingual term extraction module and report precision, recall and the harmonic mean F. The results are presented in Table 8. As upper bound we used the manually created word alignments described above. On the medical data set, the upper bound precision score is 54.76 and the upper bound recall score is 57.72. Precision scores are higher on the automotive data set (66.93), but recall scores are lower (45.15). The upper bound figures demonstrate that (bilingual) terminology extraction is a difficult task. A manual inspection of the wrong and missed term pairs using the manual word alignments shows us that most wrong terms pairs are terms that are not specific enough such as the term pair ingredient - stof or are larger multiword terms that are not part of the Gold Standard data set as such, but whose parts are included in the Gold Standard data set, e.g. masse du piston - massa van de zuiger (the smaller parts masse – massa and piston - zuiger are included in the
Lieve Macken & Arda Tezcan
Table 8. Term extraction results: precision, recall and F-score using different word alignment scenarios (medical and automotive data set) Medical En-Nl
Automotive Fr-Nl
Setting
Prec.
Rec.
F
Prec.
Rec.
F
Manual word alignments
54.76
57.72
56.20
66.93
45.15
53.93
NC intersect
48.52
48.78
48.65
47.78
29.91
36.79
NC gdf
51.50
42.01
46.27
61.59
32.84
42.84
NC union
53.23
37.94
44.30
65.62
30.59
41.73
MRG + GDF MC L1-NC
53.49
53.93
53.71
63.70
38.24
47.79
MRG + GDF MC L2-NC
52.80
53.66
53.23
65.23
38.82
48.67
MRG + GDF AC L1-NC
50.93
52.03
51.47
59.31
36.72
45.36
MRG + GDF AC L2-NC
51.19
52.57
51.87
59.93
36.83
45.62
MRG + GDF AC L1-NC F
51.77
51.49
51.63
59.46
37.04
45.64
MRG + GDF AC L2-NC F
50.80
51.76
51.28
59.75
37.24
45.89
Gold Standard). Missed terms pairs are o.a. adjectives and verbs that are currently not extracted, e.g. spread - uitzaaien and unresectable - niet-operabel. If we look at the results of the standard TExSIS system without compound splitting (NC intersect, NC gdf and NC union), we observe different behaviour on the two data sets. Intersection (NC intersect) yields the best results on the medical data set but the worst on the automotive data set. Substantial improvements can be achieved by using the proposed word alignment technique described above on the data set containing the manually split compounds (MRG + GDF MC L1/L2-NC). Two-level compound splitting gives the best overall results on the automotive data set. Automatic compound splitting also improves the results considerably. On both data sets best results are obtained using two-level compound splitting. Filtering on POS code only leads to a minor improvement on the automotive data set. 5. Conclusion We described here a compound splitting method for Dutch, which uses frequency information and linguistic knowledge to determine the split points. To optimise the performance of the compound splitter on domain-specific data sets, we combine a dynamically compiled in-domain frequency list with the large static Wikipedia frequency list. To account for nested compounds, the compound splitter can generate compounds at different levels. We experimented with one- and twolevel splitting.
Dutch compound splitting for bilingual terminology extraction
We developed a novel methodology to incorporate compound splitting in word alignment. Rather than choosing for data sets with or without split compounds, we train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. We merge the intersected alignment sets to obtain high precision alignment points which are then further enriched by adding selected alignment points from the grow-diag-final set of the split compounds corpus. The obtained (precise) word alignments are integrated in the TExSIS bilingual terminology extraction system. The novel word alignment technique substantially improves terminology extraction results if manually split compounds are used and considerably improves the results when the compounds are split automatically. As the compound splitting tool can still be improved by implementing a POS restriction so that the POS code of the tail matches the POS code of the compound and by allowing morphological operations on separable verb forms, we are confident that the results on terminology extraction can still be improved. For machine translation purposes one-level compound splitting is considered to be sufficient (Fritzinger & Fraser, 2010). In our experiments, two-level compound splitting led to the best results. In future work, we will implement a recursive call in the system and experiment with all possible levels. We will also evaluate whether machine translation also benefits from our novel word alignment method.
References Baayen, R. H., R. Piepenbrock, & van Rijn, H. (1993). The CELEX lexical database on CD-ROM. Philadelphia, PA: Linguistic Data Consortium. Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., & Mercer R. L. (1993). The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics, 19(2), 263–311. Frantzi, K., & Ananiadou. S. (1999). The C-value / NC-value domain independent method for multiword term extraction. Journal of Natural Language Processing, 6(3), 145–179. doi: 10.5715/jnlp.6.3_145
Fritzinger, F., & Fraser, A. (2010). How to avoid burning ducks: combining linguistic analysis and corpus statistics for German compound processing. In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR (pp. 224–234). Uppsala, Sweden. Kageura, K., & Umino, B. (1996). Methods of automatic term recognition. A review. Terminology, 3(2), 259–289. doi: 10.1075/term.3.2.03kag Koehn, P., Axelrod, A., Birch Mayne, A., Callison-Burch, C., Osborne, M., & Talbot, D. (2005). Edinburgh system description for the 2005 IWSLT speech translation evaluation. In Proceedings of the International Workshop on Spoken Language Translation: Evaluation Campaign on Spoken Language Translation (IWSLT 2005). Pittsburgh, PA, USA.
Lieve Macken & Arda Tezcan Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. In Proceedings of the ACL 2007 Demo and Poster Sessions (pp. 177–180). Prague, Czech Republic. Koehn, P., & Knight, K. (2003). Empirical methods for compound splitting. In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003) (pp. 187–193). Budapest, Hungary. Lefever, E., Macken, L., & Hoste, V. (2009). Language-independent bilingual terminology extraction from a multilingual parallel corpus. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 496–504). Athens, Greece. Macken, L., De Clercq, O., & Paulussen, H. (2011). Dutch Parallel Corpus: a Balanced Copyright-Cleared Parallel Corpus. Meta, 56(2), 374–390. doi: 10.7202/1006182ar Macken, L., Lefever, E., & Hoste, V. (2013). TExSIS. Bilingual terminology extraction from parallel corpora using chunk-based alignment. Terminology, 19(1), 1–30. doi: 10.1075/term.19.1.01mac
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51. doi: 10.1162/089120103321337421 Parra Escartín, C. (2014). Chasing the Perfect Splitter: A Comparison of Different Compound Splitting Tools. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) (pp. 3340–3347). Reykjavik, Iceland. Réveil, B., & Martens, J.-P. (2008). Reducing speech recognition time and memory use by means of compound (de-)composition. In Proceedings of the Annual Workshop on Circuits, Systems and Signal Processing (ProRISC 2008) (pp. 348–352). Utrecht, The Netherlands. Stymne, S., & Holmqvist, M. (2008). Processing of Swedish compounds for phrase-based statistical machine translation. In Proceedings of the 12th annual conference of the European Association for Machine Translation (EAMT 2008) (pp. 182–191). Hamburg, Germany.
part 3
Identification and translation of multiword units
A flexible framework for collocation retrieval and translation from parallel and comparable corpora Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor Research Group in Computational Linguistics, University of Wolverhampton
This paper outlines a methodology and a system for collocation retrieval and translation from parallel and comparable corpora, developed with translators and language learners in mind. It is based on a phraseology framework, applies statistical techniques, and employs source tools and online resources. The collocation retrieval and translation has proved successful for English and Spanish and can be easily adapted to other languages. The evaluation results are promising and future goals are proposed. Furthermore, conclusions are drawn on the nature of comparable corpora and how they can be better exploited to suit particular needs of target users. Keywords: collocation retrieval, collocation translation, parallel corpora, comparable corpora, phraseology
1. Introduction Multiword expressions (MWEs) are lexical units made up of several words in which at least one of them is restricted by linguistic conventions. One example is the expression fast food, in which the word fast is arbitrary, in that it cannot be replaced with synonyms, such as quick, speedy or rapid. It is thought that a significant part of a language’s vocabulary is made up of these expressions: as noted by Biber et al. (1999), MWEs account for between 30% and 45% of spoken English and 21% of academic prose, while Jackendoff (1997) goes as far as to claim that their estimated number in a lexicon is of the same order of magnitude as its number of single words. Furthermore, these numbers are probably underestimated: they appear in all text genres, but specialised domain vocabulary, such as terminology, “overwhelmingly consists of MWEs” (Sag et al., 2002:2). Collocations represent the highest proportion of MWEs (Lea and Runcie, 2002; Seretan, 2011). As such, collocation retrieval has sparked interest in the NLP community (Smadja, 1993; Sag et al., 2002; Lü & Zhou, 2004; Sharoff doi 10.1075/cilt.341.08riv © 2018 John Benjamins Publishing Company
Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor
et al., 2009; Gelbukh & Kolesnikova, 2013). Several methods have been adopted to measure the association strength of collocations, which has achieved favourable results with increases in accuracy (Seretan, 2011). However, a much more limited number of studies have dealt with post-processing of collocations from the perspective of their practical use. Collocation translation, for instance, while a natural follow-up to collocation extraction in this trail of research, still poses a problem for computational systems (Seretan, 2011). Furthermore, while several collocation resources have been put together, such as the multilingual collocation dictionary MultiCoDiCT (Cardey et al., 2006), approaches to collocation retrieval and translation lack, in general, the solid theoretical basis of phraseology (Corpas Pastor, 2013). To address this problem, the present paper describes the development and implementation of a computational tool to allow language learners and translators to retrieve collocations in a source language (SL) and their translations in a target language (TL) from bilingual parallel and comparable corpora. The project focuses on English and Spanish, but the methodology is designed to be flexible enough to be applied to other pairs of languages as well. The remainder of this paper is organised as follows: Section 2 discusses the phraseology basis of our project and presents two collocation typologies (one in English and one in Spanish) as well as a comparative grammar. Section 3 provides a brief review of existing techniques for the extraction and translation of collocations. Section 4 presents a new methodology and outlines the implementation of a computational tool based on it. Finally, Section 5 details the results from the experiments set up to evaluate our system, and discusses opportunities for future work. 2. Phraseology Collocations are compositional and statistically idiomatic MWEs (Baldwin & Kim, 2010). Like idioms, collocations belong to the set phrases of a language. However, while the meaning of an idiom is generally incomprehensible if not previously heard in context (to pay through the nose, cold turkey), collocations are compositional: their meanings can be deduced from the meaning of their component words (to pay attention, roast turkey). However, these are arbitrary. For example, the expression I did my homework is correct in English, but the expression I made my homework is not. The choice of using the verb to do and not the verb to make in this particular example can be thought of as an arbitrary convention. In addition, some collocates exhibit delexical and metaphorical meanings (to make an attempt, to toy with an idea). Similarly, collocations are cohesive lexical clusters.
A flexible framework for collocation retrieval and translation
This means that the presence of one or several component words of a collocation in a phrase often suggests the existence of the remaining component words of that collocation. This property attributes particular statistical distributions to collocations (Smadja, 1993). For example, in a sample text containing the words bid and farewell, the probability of the two of them appearing together is higher than the probability of the two of them appearing individually. 2.1 Typologies of collocations Hausmann (1985) argued that the components of a collocation are hierarchically ordered: while the base can be interpreted outside of the context of a collocation and can therefore be considered as semantically autonomous, the collocatives depend on the base in order to get their full meaning. He also presented a typology of collocations in English based on their syntax (see Table 1). Similarly, Corpas Pastor (1995, 1996) studied, classified, and contrasted collocations for Spanish and English and has proposed her own typology of collocations in these two languages (see Tables 1 and 2). These tables show the base of collocations in bold and use abbreviations borrowed from the tag set of Tree Tagger (Schmidt, 1994): VB stands for verb, NN for noun, RB for adverb, JJ for adjective, and IN for preposition. Furthermore, these typologies have been helpful in the development of this project’s underlying methodology to extract collocations (see Sections 4.1 and 4.2). Table 1. Typology of collocations in English Type
Examples
1. VB + NN(direct object)
to express concern, to bid farewell
2. NN or JJ + NN
traumatic experience, copycat crime
3. NN + of + NN
pinch of salt, pride of lions
4. RB + JJ
deadly serious, fast asleep
5. VB + RB
to speak vaguely, to sob bitterly
6. VB + IN + NN
to take into consideration, to jump to a conclusion
7. VB + NN(subject)
to break out,to crow
2.2 Transfer rules Bradford and Hill (2000) studied the differences between the grammars of English and Spanish. Based on their work, we have developed a set of transfer rules (see Table 3) between these two languages which help us translate collocations (see Section 4.4).
Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor
Table 2. Typology of collocations in Spanish Type
Examples
1. VB + NN(direct object)
conciliar el sueño, entablar conversación
2. NN + JJ or NN
lluvia torrencial, visita relámpago
3. NN + de + NN
grano de arroz, enjambre de abejas
4. RB + JJ
profundamente dormido, estrechamente relacionado
5. VB + RB
trabajar duro, jugar sucio
6. VB + IN + NN
tomar en consideración, poner a prueba
7. VB + NN(subject)
ladrar , estallar
Table 3. English-Spanish syntax comparison English
Spanish
VB + NN
VB + NN
NN or JJ + NN
NN + JJ or NN
NN + of + NN
NN + de + NN
RB + JJ
RB + JJ
VB + RB
VB + RB
VB + IN + NN
VB + IN + NN
It is worth noting that these transfer rules are designed to aid us in our own approach to the task of syntactic processing, but they are not all-inclusive. In fact, as is often the case, there are exceptions to the rules. For example, collocations in English such as copycat crime (delito inspirado en uno precedente or que trata de imitarlo, in Spanish) and to commit suicide (suicidarse in Spanish) cannot be translated using the proposed approach. 3. Related work This section presents a brief review of existing techniques for the extraction and translation of collocations. It starts by outlining collocation extraction and then moves to translation. 3.1 Collocation retrieval Early work on collocation extraction focused on statistical processing. Choueka et al. (1983) developed an approach to retrieve sequences of words occurring
A flexible framework for collocation retrieval and translation
together over a threshold in their corpora. Similarly, Church & Hanks (1989) proposed a correlation method based on the notion of mutual information. Smadja (1993), however, highlighted the importance of combining statistical and linguistic methods. In recent years, advances have been made (Ramisch et al., 2010; Seretan, 2011), many of them advocating rule-based and hybrid approaches (Hoang, Kim & Kam, 2009), and many of them based on language-specific syntactic structures (Santana et al., 2011) or the machine learning of lexical functions (Gelbukh & Kolesnikova, 2013). 3.2 Parallel corpora Classic approaches to translation using parallel corpora exploited the concepts of alignment and correspondence at sentence level (Brown et al., 1991; Gale & Church, 1993). Two methods were developed: length-based and translation-based (Varga et al., 2005). Collocation translation using parallel corpora has also been approached using transfer systems that rely on generative grammars, because of the notion that the base of a collocation determines its collocatives (Wehrli et al., 2009) and the assumption that source and target MWEs share their syntactic relation (Lü & Zhou, 2004). 3.3 Comparable corpora Parallel resources are generally scarce and in many cases not available at all. The wider availability of comparable texts offers new opportunities to both researchers and translators. While these do not allow for bridging between languages (Sharoff et al., 2009), research suggests (Rapp, 1995) that a word is closely associated with words in its context and that the association between a base and its collocatives is preserved in any language. Fung & Yuen (1998), for instance, argued that the first clue to the similarity between a word and its translation is the number of common words in their contexts. Similarly, Sharoff et al. (2009) proposed a methodology that relies on similarity classes. 4. System The system1 employs the following three language-independent tools: Tree Tagger to POS-tag corpora, the MWE Toolkit (Ramisch et al., 2010) to extract collocations according to specific POS-patterns, and Hunalign (Varga et al., 2005) to align . Consisting of a series of Python scripts which handle text and XML representations, and implemented using the wx Python development environment for Mac OSX.
Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor
corpora at sentence level. Furthermore, it connects online to Word Reference and uses it as a multilingual translation dictionary and thesaurus. Figure 1 illustrates the architecture of the system; its main modules will be described in greater detail in the following paragraphs.
Figure 1. Architectural scheme of the system
4.1 Candidate selection module This module processes the SL corpus in order to format it to comply with the input requirements of the modules that follow in the system pipeline. It represents the linguistic component of the hybrid approach to collocation retrieval. It makes use of both Tree Tagger and the MWE Toolkit to perform linguistic pre-processing in the form of lemmatisation and POS-tagging on the input data, as well as POSpattern definition. Linguistic processing aims at transforming the input data from a stream of alphanumeric characters to sequences of words, which can be grouped in n-grams. It is important to work with lemmas instead of inflected words in order to identify collocations; otherwise, for example, collocations such as committing murder and committed murder would be treated separately, even though they are obviously the same (whose lemma is commit murder). The system relies on Tree Tagger to annotate text sentences with both lemma and POS-tagging information. Its output is then transformed into the XML format (see Figure 2) by running a Python script, part of the MWE Toolkit. POS-pattern definition aims at applying syntactic constraints on collocation candidates. This stage is language-dependent: as long as a language can be POStagged and a typology of the most commonly occurring collocations exists for it, POS-patterns can be defined. This task is simplified because the MWE Toolkit
A flexible framework for collocation retrieval and translation
Figure 2. Sample XML output of tree tagger
supports the definition of syntactic patterns of collocations to extract. These can include repetitions, negation, and optional elements, much like regular expressions (see Figure 3, a definition of the English POS-pattern NN or JJ + NN). When retrieving collocations, each sentence in the corpus is matched against this set of patterns, and all n-grams which do not comply with any of them are ignored. Patterns that correspond exactly to the typologies of collocations in English and Spanish presented above have been defined (see Section 2.2).
Figure 3. Example of POS-pattern definition
4.2 Candidate filtering module This module computes collocation candidates and assigns a weight to each of these according to its probability of representing a collocation. It corresponds to the statistical component of our hybrid approach to collocation retrieval and relies on the MWE Toolkit to perform n-gram selection and statistical processing. The toolkit receives two XML files as input: a representation of all sentences in the corpus with all words described by linguistic properties (see Figure 2), and a set of user-defined POS-patterns (see Figure 3). It performs n-gram selection by matching each sentence in the corpus against all defined POS-patterns, producing a set of collocation candidates. Once candidates have been extracted, it performs statistical processing by computing the frequencies of each candidate’s word components from the SL corpus. This information is used to calculate a log likelihood score for each candidate. Candidates are then ranked according to their scores. Figure 4 presents a sample collocation candidate in English. As can be observed, the toolkit not only extracts the lemma form of a collocation (lemon drop), but also the different surface forms it appears in (lemon drops).
Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor
Figure 4. Sample collocation candidate
4.3 Dictionary look-up module This module connects to the online translation dictionary Word Reference to attempt a direct translation of a collocation in its SL into its TL. Word Reference translation entries include two tables: one for one-word direct translations (principal translations), and another for translations of MWEs (compound forms). Furthermore, the dictionary lists its translation entries in order from the most common to the least common. A Python script was written to handle the connection to the Word Reference API. Our task is to look at the compound forms table and attempt to find a match for our collocation. If such a match is found, its translation from the HTML is extracted, and presented to the user. If no match is found, then translation will be based on the bilingual corpora presented by the user as input, triggering the parallel corpora or the comparable corpora module accordingly. 4.4 Parallel corpora module This module first employs Hunalign to align the input corpora. Next, after syntactic processing and semantic processing, transformational rules are applied in order to identify the TL translations of all collocations extracted from the SL. A sample output of Hunalign is presented in Figure 5: the first column refers to a sentence number in the SL corpus, the second column refers to a sentence number in the TL corpus, and the third column represents a confidence value, or the estimated certainty of the SL-TL pairing. Semantic processing consists in identifying the base of a collocation in SL and finding its translation in TL. The POS-tags of the components of the collocation (see Figure 4) will help completely determine its base. This is because the POS- pattern of the collocation should adhere to one of the set of POS-patterns defined
A flexible framework for collocation retrieval and translation 101 102 103 104 105
96 97 98 99 100
0.336927 0.583117 0.228 0.229412 0.226056
Figure 5. Sample Hunalign output
previously (see Figure 3). Next, the components representing the base for collocations will be identified following their linguistic model (see Section 2.1). Finally, Word Reference is employed to retrieve the first three translation entries that match the POS-tag of our base from its principal translations table. Similarly, syntactic processing consists in finding the translations of the collocatives in the TL corpus. It requires the output of both the candidate filtering module, which is an XML file containing a set of SL collocations (see Figure 4) and that of Hunalign presented above (see Figure 5). It also requires, as input, the TL corpus, which is a translation of the SL corpus. We implemented an algorithm that first reads the SL corpus and finds all sentences where a collocation appears, and then performs these tasks for each of the retrieved SL sentences: –– Read the output of Hunalign and match the SL sentence to its TL counterpart, where the translation of the collocation should appear. –– Expand this TL sentence to a window of five sentences to be extracted and analysed, to make up for any Hunalign precision error. –– For each of the translations in TL of the collocation’s base, obtained during semantic processing, go through our window of sentences, one sentence at a time, and look for the presence of the translation within it. If a match is found, it means the translation of the collocatives in the TL should also be present within the sentence. –– POS-tag the matching TL sentence using Tree Tagger. –– Apply a transfer rule (see Table 3) to obtain the translation of the collocatives in the TL. 4.5 Comparable corpora module This module computes similarity classes in order to find the TL translations of all extracted SL collocations via query expansion, query translation, and context generalisation. Query expansion produces a generalisation of the SL collocation’s context by computing two different similarity classes, one centred on the base of the collocation, and another on its context (two open-class words that appear before it, and two after it). Computing similarity classes requires the use of a thesaurus. For English, we use WordNet, and obtain the first five synsets of the same POS-tag of
Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor
any given open-class word. As for Spanish, WordReference is used. Our first similarity class, the one centred on the base of the collocation, will thus consist of up to six words, the original base itself and up to five synonyms. Correspondingly, our second similarity class will consist of up to 24 words: the four context words we retrieved, and up to five synonyms for each of them. Next in the pipeline process is query translation, which computes a translation class, an expansion of the target language translations of the words that make up our original similarity class. Here again, we rely on WordReference as our de facto bilingual dictionary and thesaurus. For each of our two similarity classes, we iterate through all of their words, look up each via the WordReference API and retrieve up to five translation entries that match their POS-tags, and then further expand these by retrieving up to five thesaurus entries for each. This means that our first translation class, the one centred on the base of the collocation, will contain up to 30 translations for each of the (up to) six words of its similarity class, which totals up to 180 words. Similarly, our second translation class, centred on context words, will contain up to 720 words. Finally, context generalisation aims at finding TL translations of a SL collocation by comparing context similarities. We first determine the POS-pattern of our SL collocation, and then see if any of the words in the translation class of its base corresponds with the base of any of the TL collocations of the same POS-pattern. If a match is found, we compute a similarity class for the context of the matched TL collocation and we see if it has any elements in common with the context of the SL collocation. If it does, we present it to the user as a potential translation of the collocation from the original text. 5. Evaluation The choice of our experimental corpora was made completely on the basis of the profiles of the target users of our system: language learners and translators. Reading in a target language is an integral component of any language-learning process. We chose Harry Potter and the Philosopher’s Stone and its translation into Spanish, Harry Potter y la Piedra Filosofal, to exemplify this. Similarly, professional translators usually specialise in a certain domain of translation, and therefore must translate technical terminology on a regular basis. Thus, we chose the Ecoturismo corpus,2 a collection of multilingual parallel and comparable corpora on tourism
. Compiled in the framework of the R&D project Espacio Único de Sistemas de Información Ontológica y Tesaurus sobre el Medio Ambiente: ECOTURISMO (Spanish Ministry of Education, FFI2008–06080-C03–03).
A flexible framework for collocation retrieval and translation
and tourism law, as it represents a real-life example of the technical documents a translator works with. 5.1 Experimental setup Two bilingual annotators, fluent in English and Spanish, reviewed the output of our system after processing both experimental corpora. They assigned a score to the translations the system offered for each collocation according to a five-point scale (with 5 representing an excellent translation). Precision and recall are estimated from these scores for each case study. 5.2 Experimental results 100 English collocations were retrieved from the Harry Potter corpus. 12 collocations were successfully translated directly, using Word Reference, such as total rubbish, to speak calmly, fast asleep, and to lean against the wall. Out of the remaining 88 collocations, ten could not be translated at all, and 78 were translated using our approach to processing parallel corpora. Table 4 summarises these results (A stands for annotator, WR for WordReference, and AVG for average). Table 4. Parallel corpora result scores A #1 #2
WR 12
1
2
3
4
5
AVG
0
0
11
16
51
4.51
0
0
8
17
53
4.58
As can be observed, we obtained a high average score of 4.55 for the quality of translations retrieved from parallel corpora. Moreover, only ten collocations out of the original 100 could not be translated, yielding an equally high score for recall, of 90%. Similarly, 100 Spanish collocations were retrieved from the Ecotourism corpus. 15 of them were translated using WordReference; all of these were of the Spanish POS-patterns NN + JJ or NN + de + NN, such as transporte público, asistencia técnica, and viaje de negocios. Out of the 85 remaining collocations, 15 could not be translated at all, and the other 70 received translation suggestions were found in the comparable corpora. Table 5 summarises the results. Table 5. Comparable corpora result scores A #1 #2
WR 15
1
2
3
4
5
AVG
7
14
19
17
13
3.21
8
15
21
15
11
3.09
Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor
Despite the lower average score of 3.15 for the quality of translations, we managed to provide translation suggestions to 85% of the collocations. We can conclude that by imposing flexible constraints on the matching process performed during the task of context generalisation, we obtain average translations for a high number of collocations. These constraints refer to the size of our context window and the number of thesaurus entries we retrieve for each original word during query expansion. Improving our precision score would mean strengthening these constraints, but this would also result in a lower recall. Moreover, in this particular case, recall of the output is more relevant than precision because our suggested translations, even if not always excellent, might offer translators a useful hint for correctly translating collocations. 5.3 Discussion and future work Against the background of the limitations of the current version of our system, we propose the following future improvements. First, we exploit the nature of collocations as cohesive lexical clusters, but disregard the linguistic property of semantic idiomaticity that differentiates them from other MWEs, such as idioms. Our system cannot, therefore, differentiate between collocations and other MWEs in terms of compositionality. Secondly, we would like to provide better integration between the stages of collocation extraction and collocation translation. Currently, the former relies on Tree Tagger and the MWE Toolkit, while the latter makes use of Hunalign. This means that all users would also have to have access to these three tools; this poses no significant problem because all of them are open source, and readily available online, but it would be simpler to integrate the tasks performed by these tools into our system in order to increase its ease of use. Finally, we would like to investigate the use of the web as a corpus to find proficient ways of using information offered by search engines. The expected final users of our system are professional translators and language learners. However, as mentioned previously, further fine-tuning of the system might be worthwhile in order to better address the specific needs of these particular user groups. Working with comparable corpora is not highly reliable because of their noisy nature. We opted to impose flexible constraints on the matching process performed during the last stage of comparable corpora processing, context generalisation, in order to increase the recall of our system. As stated before, this would be better suited to translators, who could benefit from the translation suggestions offered by our system to find the most adequate translation of a collocation. Language learners, however, are probably more interested in learning very precise translations for several collocations, rather than translation
A flexible framework for collocation retrieval and translation
s uggestions for a large number of collocations. A way forward would be to adjust the comparable corpora algorithm so it can impose stronger constraints during the task of context generalisation, to the benefit of language learners. Future research goals could include (1) providing better integration between the different stages of the project, (2) finding a way to further exploit the use of the web as a corpus to aid in the processes of collocation retrieval and translation, (3) demonstrating the flexibility of our framework by adjusting our system to work with several other languages, and (4) tailoring the constraints imposed by our system to better meet the needs of our final users.
Acknowledgements This project was supported by the European Commission, Education & Training, Erasmus Mundus: EMMC 2008–0083, Erasmus Mundus Masters in NLP & HLT programme and also partially supported by the LATEST (Ref: 327197-FP7-PEOPLE-2012-IEF) project.
References Baldwin, T., & Kim, S. N. (2010). Multiword Expressions. In Handbook of Natural Language Processing (2nd ed.). Boca Raton, FL. Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Grammar of spoken and written English. Edimburgh: Pearson Education Limited. Bradford, W., & Hill, S. (2000). Bilingual Grammar of English-Spanish Syntax. University Press of America. Brown, P., Lai, J., & Mercer, R. (1991). Aligning Sentences in Parallel Corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (pp. 169–176). Berkeley, Canada. Cardey, S., Chan, R., & Greenfield, P. (2006). The Development of a Multilingual Collocation Dictionary. In Proceedings of the Workshop on Multilingual Language Resources and Interoperability (pp. 32–39). Sydney. Choueka, Y., Klein, T., & Neuwitz, E. (1983). Automatic Retrieval of Frequent Idiomatic and omputing, Collocational Expressions in a Large Corpus. Journal for Literary and Linguistic C 4(1), 34–38. Church, K. W., & Hanks, P. (1989). Word Association Norms, Mutual Information, and Lexicography. In Proceedings of the 27th annual meeting on Association for Computational Linguistics (pp. 76–83). Corpas Pastor, G. (1995). Un Estudio Paralelo de los Sistemas Fraseológicos del Inglés y del Español. Málaga: SPICUM. Corpas Pastor, G. (1996). Manual de Fraseología Española. Madrid, Gredos. Corpas Pastor, G. (2013). Detección, Descripción y Contraste de las Unidades Fraseológicas mediante Tecnologías Lingüísticas. Manuscript submitted for publication. In I. Olza, & E. Manero (Eds.), Fraseopragmática. Berlin: Frank & Timme.
Oscar Mendoza Rivera, Ruslan Mitkov & Gloria Corpas Pastor Fung, P., & Yuen, Y. (1998). An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In Proceedings of the 17th International Conference on Computational Linguistics (pp. 414–420). Gale, W., & Church K. (1993). A Program for Aligning Sentences in Bilingual Corpora. Journal of Computational Linguistics, 19, 75–102. Gelbukh, A., & Kolesnikova O. (2013). Expressions in NLP: General Survey and a Special Case of Verb-Noun Constructions. In S. Bandyopadhyay, S. K. Naskar, & A. Ekbal (Eds.), Emerging Applications of Natural Language Processing: Concepts and New Research (pp. 1–21). Hershey: Information Science Reference. IGI Global. Hausmann, F. (1985). Kollokationen im deutschen Wörterbuch. Ein Beitrag zur Theorie des lexikographischen Beispiels. In H. Bergenholtz, & J. Mugdan (Eds.), Lexikographie und Grammatik, (Lexicographica, series maior 3, pp. 175–186). Tübingen: Niemeyer. Hoang, H. H., Kim, S. N., & Kan, M. Y. (2009). A Re-examination of Lexical Association Measures, In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP (pp. 31–39). Singapore: ACL and AFNLP. Jackendoff, R. (1997). The Architecture of the Language Faculty, Cambridge, Mass.: MIT Press. Jackendoff, R. (2007). Language, Consciousness, Culture: Essays on Mental Structure. Cambridge, Mass.: The MIT Press. Lea, D., & Runcie, M. (2002). Oxford Collocations Dictionary for Students of English. Oxford University Press. Lü, Y., & Zhou, M. (2004). Collocation Translation and Acquisition Using Monolingual C orpora. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL ’04) (pp. 167–174). Ramisch, C., Villavicencio, A., & Boitet, C. (2010). MWEToolkit: A Framework for Multiword Expression Identification. In Proceedings of LREC’10 (7th International Conference on Language Resources and Evaluation). Ramisch, C. (2012). A Generic Framework for Multiword Expressions Treatment: from Acquisition to Applications. In Proceedings of ACL 2012 Student Research Workshop (pp. 61–66). Rapp, R. (1995). Identifying Word Translations in Nonparallel Texts. In Proceedings of the 35th Conference of the Association of Computational Linguistics (pp. 321–322). Boston, Massachusetts. Sag, I. (2002). Multiword Expressions: A Pain in the Neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (COCLing-2002) (pp. 1–15). Santana, O. (2011). Extracción Automática de Colocaciones Terminológicas en un Corpus Extenso de Lengua General. Procesamiento del Lenguaje Natural, 47, 145–152. Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. In Proceedings of International Conference on New Methods in Language Processing. Manchester, UK. Seretan, V. (2011). Syntax-Based Collocation Extraction (Text, Speech and Language Technology). (1st ed.). Springer. doi: 10.1007/978-94-007-0134-2 Sharoff, S., Babych, B., & Hartley, A. (2009). “Irrefragable answers” using comparable corpora to retrieve translation equivalents. Language Resources and Evaluation, 43(1), 15–25. doi: 10.1007/s10579-007-9046-4
Sinclair, J., & Jones, S. (1974). English Lexical Collocations: A study in computational linguistics. Cahiers de lexicologie, 24(2), 15–61.
A flexible framework for collocation retrieval and translation
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1),143–177. Varga, D. et al. (2005). Parallel corpora for medium density languages. In Proceedings of the RANLP 2005 (pp. 590–596). Wehrli, E., Nerima, L., & Scherrer, Y. (2009). Deep linguistic multilingual translation and bilingual dictionaries. In Proceedings of the Fourth Workshop on Statistical Machine Translation (pp. 90–94).
On identification of bilingual lexical bundles for translation purposes The case of an English-Polish comparable corpus of patient information leaflets Łukasz Grabowski University of Opole
Grounded in phraseology and corpus linguistics, this paper aims to explore the use of bilingual lexical bundles to improve the degree of naturalness and textual fit of translated texts. More specifically, this study attempts to identify lexical bundles, that is, recurrent sequences of 3–7 words with similar discursive functions in a purpose-designed comparable corpus of English and Polish patient information leaflets, with 100 text samples in each language. Because of cross-linguistic differences, we additionally apply a number of formal criteria in order to filter out the bundles in each subcorpus. The results show that bilingual lexical bundles with overlapping discourse functions in texts and extracted from comparable corpora hold unexplored potential for machine translation, computer-assisted translation and bilingual lexicography. Keywords: lexical bundles, comparable corpora, translation quality, translation universals, patient information leaflets
1. Introduction Multiword units (short MWUs) and word co-occurrence patterns constitute one of the perennial problems, if not bottlenecks, of machine translation (short MT) or computer-assisted translation (CATs). The main reasons for this are, among others, varying degrees of fixedness, pattern variability, syntactic flexibility, and semantic compositionality (Sag et al., 2002). It may also happen that when used in different contexts the same sequence of words may constitute either a complete and fixed MWU or a construction generated by means of grammatical rules
doi 10.1075/cilt.341.09gra © 2018 John Benjamins Publishing Company
Łukasz Grabowski
(Cobb, 2003, p. 107).1 Furthermore, MWUs differ – to varying extents – with respect to their length, frequency and distribution in texts produced in languages with incompatible phraseological systems.2 One of the practical problems with translation of recurrent MWUs is sense disambiguation where a sequence of words may convey varying pragmatic meanings depending on the context of use. For example, a Polish noun phrase zły pies ‘bad dog’ could be rendered into English as a bad dog if used, for example, in a narrative text, or Beware of the dog! if used as a warning nailed to a gate or fence. In fact, both translations are justified in each context, yet the former one would be less natural if used as a security gate sign. Likewise, an English MWU this medicine is for you only could be rendered into Polish as ten lek przepisano tylko Tobie or – more literally – ten lek jest tylko dla Ciebie. However, neither translation is used in real exemplars of Polish patient information leaflets, where one may find more impersonal expression lek ten przepisano ściśle określonej osobie ‘this medicine has been prescribed for a specific person’.3 Unless MWUs are enriched with meta-situational data, it is hardly possible to overcome such problems (that is, that pragmatic meanings of MWUs vary across contexts of language use) using traditional rule-based machine translation systems (short RBMT); also, statistical machine translation systems (short SMT) will not solve this problem unless its database of parallel texts includes a sufficiently high number of texts representing security gate signs or patient information leaflets.4 This, however, may be crucial for the quality and textual fit of translated texts. This is because proper identification of meanings and functions of MWUs ultimately determines the choice of the most natural translation of the MWUs in a given context, thereby impacting the overall quality of translation.5
. Cobb (2003, p. 107) provides an example whereby a sequence of words Shut your mouth can be either a complete standalone MWU (when someone wants another person to stop talking) or a construction formed by applying grammar rules (when used as a directive by a dentist to mark the end of a surgery). . Because of these differences, it is often possible to find one-to-many or many-to-many correspondence between MWUs used in texts written in different languages. . This is expected since language, in general, is used with specific purposes and functions in mind and that is why language use varies across different situations (Kilgarriff 2005). . Barreiro et al. (2013, p. 26) argue that with respect to translation of MWU “RBMT systems fail for lack of multiword coverage, while SMT systems fail for not having linguistic (semantico-syntactic) knowledge to process them, leading to serious structural problems.” . This problem has somewhat lower impact in the case of computer-assisted translation. In fact, translation memories or terminological databases, created by translators using CAT
On identification of bilingual lexical bundles for translation purposes
Hence, in this linguistically-oriented study, which may be also useful for bilingual lexicography, an attempt is made to improve the degree of naturalness and textual fit of English-to-Polish translations of so-called lexical bundles (Biber et al., 1999), a peculiar type of MWUs, extracted semi-automatically from patient information leaflets written originally in the two languages. The results, in the form of tables with functionally-aligned MWUs, can be readily integrated into MT systems, notably phrase-based ones, or computer-assisted translation software (CATs). As lexical bundles have been used primarily for descriptive purposes (most often treated as linguistic data to be further used in foreign/second language, ESP or EAP teaching), another aim of the study is to show that these MWUs can be also used for other applied purposes, e.g. for improving translation quality.
2. Background and related work When reading translations of documents representing specific text types or genres (an obituary, patient information leaflets, a legal contract, etc.), one often has the impression that despite being grammatically correct, the translation sounds somewhat unnatural or reads with some difficulty. The benchmark is then the reader’s intuition or their earlier experience with reading non-translated native texts.6 This problem has been previously addressed in translation studies, notably by corpus linguists. Using comparable corpora, that is, collections of original texts in two or more languages and similar in terms of time of composition, text type or genre, target audience and/or size (Granger, 2010, p. 15), researchers formulated a number of hypotheses called translation universals designed to capture the specificity of translational texts as compared with native texts produced in the same language (Baker, 1993).7 Two hypotheses are particularly relevant for this study. The socalled ‘levelling-out’ hypothesis posits that translated texts, in their totality, are linguistically similar to each other irrespective of source and target languages
software, usually correspond to relevant specialist domains of language use (legal, medical, pharmaceutical etc.). . This is in agreement with lexical priming hypothesis, originally put forward by Hoey (2005), whereby we are primed to use words in one way or the other depending on contexts in which we previously encountered the words (Hoey, 2007, p. 8). . In fact, to the knowledge of the author, there have been no studies conducted so far on translation universals in machine-translated texts. However, one may assume that capturing features typical of machine-translated texts may contribute to better understanding of linguistic features of such texts, which may further help improve the quality of MT output.
Łukasz Grabowski
(Baker, 1996, p. 184), while the ‘textual fit’ hypothesis refers to the degree of linguistic similarity of translated texts to non-translated texts written in the same language (Chesterman, 2004, p. 6). However, despite a few important descriptive studies (e.g. Laviosa, 1998; Olohan & Baker, 2000; Kajzer-Wietrzny, 2012; Grabowski, 2013; Biel, 2014), no progress has been made so far on the application of these concepts and research findings for practical translation purposes, notably in terms of improving the naturalness and textual fit of texts translated using machine translation or computer-assisted translation systems. This is particularly surprising since freely available comparable corpora are far more abundant compared with parallel corpora8 which typically constitute databases of MT systems or translation memories fed into CATs. There is therefore the need to explore the potential of using comparable corpora as the source of linguistic data in order to improve the quality of translation of various types of interrupted or uninterrupted MWUs. Functioning as formmeaning mappings, MWUs in texts are either contiguous or non-contiguous sequences of words, e.g. n-grams, concgrams (Cheng et al., 2006), lexical frames or formulaic frames (Biber, 2009; Gray & Biber, 2013), to name a few labels proposed recently by corpus linguists. In this study, the emphasis is on a specific type of MWU known as lexical bundles (short LBs) that are identified based on their length, frequency and distribution (range) in texts. More specifically, LBs are sequences of three or more words that occur frequently in natural discourse and constitute lexical building blocks used frequently by language users in different situational and communicative contexts, e.g. as a result, the nature of the, as well as (Biber et al., 1999, pp. 990–991). Depending on their composition, LBs are either multiword collocations or multiword formulaic sequences (Biber, 2009, pp. 286–290).9 Typically represented by technical terms, the former are composed solely of content words, have high MI-scores and have relatively low frequencies in texts (ibid.), e.g. chronic renal failure, severe subcutaneous tissue disorders. On the contrary, the latter ones consist of both function and content words, have low MI-scores and relatively high frequencies (ibid.), e.g. if you miss a dose, contact your doctor or pharmacist. Although LBs are usually not perceptually salient and constitute incomplete grammatical units (Biber et al., 2003; Stubbs & Barth, 2003; Hyland, 2008), they often convey domain-specific terminology and phraseology. Also, as demonstrated by many studies (e.g. Biber et al., 2004; Biber, 2006; Hyland, 2008;
. This observation is also made by Granger (2010). . According to Biber (2009, p. 290), the two types of LBs should not be viewed as two disparate types of multiword sequences, but rather as two poles on the continuum.
On identification of bilingual lexical bundles for translation purposes
Goźdź-Roszkowski, 2011; Grabowski, 2015), LBs tend to perform specific textual/ discourse functions across the whole variety of text types and genres. Notwithstanding the fact that most linguistically-oriented studies of LBs were conducted on texts written originally in English, there have also been some recent attempts to explore these MWUs in other languages (e.g. Grabowski, 2014) or in cross-linguistic contexts (e.g. Forchini & Murphy, 2008; Granger, 2014). Some of the insights from the latter studies have shed light on the ways the apparently corresponding MWUs are used in texts written in different languages. Forchini & Murphy (2008) showed that synonymic (i.e. formally and semantically similar) n-grams in English and Italian differ with respect to their frequency and collocational relations in a specialised comparable corpus of financial newspapers. Granger (2014) revealed, among others, that functionally similar LBs (the ones with first-person pronouns and those expressing stance) are more than two times more numerous in the English parliamentary debates as compared with the French ones, a finding that suggests the existence of one-to-many or many-tomany correspondence between this type of bilingual MWU.10 Granger (2014) also revealed that functionally similar bilingual LBs often differ with respect to their length, a finding that translates into problems with their automatic identification and alignment. In this paper, we aim to capitalise on these insights by exploring LBs in patient information leaflets produced independently in two languages, that is, English and Polish, with a view of improving the naturalness and textual fit of specialist texts translated using machine translation or computer-assisted translation systems.11 More specifically, we will first identify the most frequent LBs in a custom-designed comparable corpus of English and Polish patient information leaflets, followed by identification of the LBs’ discourse functions. The study will conclude by matching any functionally similar LBs in English and Polish patient information leaflets. The obtained data can be further integrated into an MT system or CAT tool. Due to length constraints, however, in this paper we only suggest possible implementation of the findings into domain-specific MT systems. This problem is discussed in greater detail in other studies (e.g. Ren et al., 2009). The following section describes in greater detail the research material, methodology and stages of this study.
. In view of that, Granger (2014, p. 68) argues that “lexical bundles are powerful window onto pragmatics and rhetoric”. . The motivation behind this research were the results of earlier studies of the use, distribution and discourse functions of LBs in Polish and English pharmaceutical texts, including patient information leaflets (Grabowski, 2014, 2015).
Łukasz Grabowski
3. Research material and methodology This study focuses on one of the most frequently used genre in the healthcare sector, that is patient information leaflets (short PILs), which are found in sales packages of drugs or medicines. Nowadays, PILs are produced (or computergenerated)12 by pharmaceutical companies in accordance with specific guidelines issued by regulatory authorities. The major communicative function of this domain-specific pharmaceutical text type is to present specific information on proper and safe use and administration of medicines. More specifically, PILs contain information about medical conditions, doses and side effects associated with the use of drugs or medicines (Montalt & Gonzalez, 2007, p. 69). The research material encompasses a purpose-designed comparable corpus of English and Polish PILs (see Table 1). The English subcorpus (short EPILs) includes 100 leaflets extracted from the Patient Information Leaflet (PIL) Corpus 2.0 (Buoayad-Agha, 2006), originally compiled at the Natural Language Technology Group at the University of Brighton.13 The Polish subcorpus (PPILs) includes 100 leaflets produced by ten pharmaceutical companies operating on the Polish market.14 For both corpora, a selection of text samples was meant to ensure that each medicine is represented in the collection only once and that there are no overlapping text samples (i.e. the ones describing the same medicine in English and Polish or the ones being translations from either English or Polish). Table 1. Study corpus: descriptive statistics Number of texts
Size (word tokens)
Size (word types)
STTR* (per 1,000 running words)
English patient information leaflets (EPILs)
100
111,509
5,316
6.03
Polish patient information leaflets (PPILs)
100
196,757
12,754
6.47
Text type
* Standardized type/token ratio per 1,000 words
As mentioned earlier, this paper focuses on a specific type of MWU called lexical bundles (LBs). In order to obtain an analysable sample of LBs, it is required that specific selection criteria be determined. Based on Biber et al. (1999) and . An attempt at development of an automatic multilingual authoring system for patient information (PILLS) is discussed by Scott et al. (2001). . The PIL corpus is readily available at: http://mcs.open.ac.uk/nlg/old_projects/pills/ corpus/PIL. . The same corpus was used in Grabowski (2014).
On identification of bilingual lexical bundles for translation purposes
Biber (2006), these criteria, usually set arbitrarily, include the length of an LB, a frequency cut-off point (frequency per 1 million words; short pmw), and the distribution range (the number of texts with an LB). In view of typological and systemic interlingual differences, Granger (2014) argues that these criteria should be adequately fine-tuned when used to extract LBs from texts written in different languages. Furthermore, since multilingual LBs may vary in length and frequency, Granger (2014, p. 60) suggests including MWUs of several lengths or analysing a similar percentage of LBs in each language respectively.15 For example, in EnglishPolish pairs of functionally-similar LBs, such as used in the treatment of and stosowane w leczeniu, or the ability to drive and use machines and zdolność prowadzenia pojazdów, the LBs in Polish, a more morphologically synthetic language than English, are often shorter than their English counterparts; one may also find some examples to the contrary, e.g. contact your doctor and należy zwrócić się do lekarza, as some of the differences are due to stylistic conventions. In fact, EPILs are written in a plainer, less formal, and more direct/straightforward style while PPILs reveal features of more formal and impersonal style. Also, one should remember that, as a rule, the longer the LB, the lower its frequency in texts and that is why frequency thresholds for shorter LBs should be higher than for longer LBs. Taking these suggestions into account, this study focuses on recurrent n-grams consisting of 3–7 words generated using WordSmith Tools 4.0 (Scott, 2007). Furthermore, the following normalised frequency (f) thresholds were established: f = 200 pmw for 3-word grams (applicable to PPILs only); f = 175 pmw for 4-word grams; f = 150 pmw for 5-word grams; f = 125 pmw for 6-word grams; and f = 100 for 7-word grams. As regards distribution, it was decided to include only those n-grams which occur in at least 15% of the text samples in EPILs and PPILs. This first stage of filtering out the data is intended to remove any idiosyncratic items from the analyses. Next, additional exclusion criteria, based on the ones employed by Chen & Baker (2010, p. 33) and Salazar (2011, pp. 48–50), were applied in order to further limit the number of potential bilingual LBs. More specifically, those LBs which occurred on the clause or phrase boundaries (e.g. not sure ask your doctor to in EPILs, or tę ulotkę aby w ‘this leaflet in order to’ in PPILs) were deleted from the list. Also, any LBs ending in prepositions or conjunctions were dropped, likewise any LBs with acronyms or abbreviations (e.g. Sp. z o. o., an abbreviation for a Polish private limited company).
. Granger (2014, p. 60) also noticed that in more inflectional languages one typically finds more recurrent n-grams, unless this effect is countered by lemmatisation. However, in this study, aimed to improve the quality of, for example, domain-specific SMT systems, lemmatisation will not be used because such systems do not take advantage of linguistic annotation (Barreiro et al. 2013, p. 26).
Łukasz Grabowski
In the third stage of filtering out the data, those overlapping LBs which are fragments of other, typically longer LBs, were removed from the list (e.g. of the reach of or out of the reach of in EPILs, which are fragments of the longer LB out of the reach of children, or się z treścią ulotki ‘self with the contents of the leaflet’ in PPILs, a fragment of the longer LB należy zapoznać się z treścią ulotki ‘one should familiarise him-/herself with the contents of the leaflet’). In the final stage, in order to identify the discourse functions of the remaining LBs, the functional typology largely based on the one originally proposed by Biber et al. (2004) was applied. After aligning English lexical bundles with their discourse functions, an attempt was made to identify any functionally similar LBs in PPILs and align them in a table with their functional counterparts in EPILs. In what follows, we present the study results and discuss various possibilities regarding implementation of the findings into machine translation or computerassisted translation systems with the aim of improving the degree of naturalness and textual fit of translations of recurrent MWUs, which in this study were operationalised as lexical bundles.
4. Results As explained earlier, the n-grams – generated using WordSmith Tools 4.0 (Scott, 2007) – were subsequently filtered out in a number of stages described in the previous section. Consequently, we obtained a list with 78 LBs with 4–7 words in EPILs, called pivot lexical bundles (see Appendix 1).16 The corresponding figure for 3–7 words LBs in PPILs is 314 items.17 It appeared that one may find considerably more LBs in PPILs, a finding due to systemic differences between the English and Polish languages as well as to stylistic differences between EPILs and PPILs. More specifically, a highly inflectional morphology of the Polish language, the lack of lemmatisation of linguistic data and fewer repetitions in PPILs are among the reasons that may account for the discrepancy.18
. These include 30 4-word LBs, 29 5-word LBs, 12 6-word LBs and 7 7-word LBs identified in the EPILs corpus. . These include 85 3-word LBs, 103 4-word LBs, 49 5-word LBs, 39 6-word LBs and 38 7-word LBs in the PPILs corpus. . For example, in PILs the English verbs to tell and to ask, in such LBs as ask/tell your doctor or pharmacist, correspond to five distinct verbs in Polish, that is poradzić się, zwrócić się, skontaktować się, powiedzieć and powiadomić.
On identification of bilingual lexical bundles for translation purposes
As the English LBs were treated as pivot LBs in this study, they were divided into three functional categories, namely referential, discourse-organising and expressing stance (Biber et al. 2004).19 Next, the corresponding Polish LBs with the same discourse functions were manually searched for among the total of 314 3–7 word LBs in PPILs. For example, the stance bundle tell your doctor or pharmacist (53 occurrences in 44 texts in the EPILs corpus), a directive targeted at patients, was found to have three functional counterparts among the LBs identified in the PPILs corpus (see Table 2). In other words, at this stage of the study, the lists of Polish LBs were checked for potential equivalents and, next, the texts were analysed – using WordSmith Tools Concord facility – to find out whether a pair of bilingual LBs have approximately the same discourse functions in EPILs and PPILs (see Table 3). Table 2. Examples of LBs (in context) extracted from EPILs and PPILs tell your doctor or pharmacist
należy powiedzieć lekarzowi lub farmaceucie należy powiadomić lekarza lub farmaceutę należy powiedzieć o tym lekarzowi lub farmaceucie
Table 3. Examples of use of functionally-similar bilingual LBs in EPILs and PPILs If the answer is YES to any of these questions, and if you have not already discussed them with your doctor, tell your doctor or pharmacist before you take this medicine. (Zovirax) If you suffer from either of the following, stop treatment and tell your doctor or pharmacist as soon as possible (Climagest) If the answer is YES to any of these questions tell your doctor or pharmacist as soon as possible if you have not already done so. (Buscopan tablets) If the answer to any of these questions is YES, tell your doctor or pharmacist. (Anturan)
Jeśli nasili się którykolwiek z objawów niepożądanych lub wystąpią jakiekolwiek objawy niepożądane niewymienione w ulotce, należy powiedzieć o tym lekarzowi lub farmaceucie lub pielęgniarce. (Anafranil)
Jeśli wystąpią jakiekolwiek objawy niepożądane, w tym wszelkie możliwe objawy niepożądane niewymienione w ulotce, należy powiedzieć o tym lekarzowi lub farmaceucie. (Brilique)
Należy powiedzieć lekarzowi lub farmaceucie o wszystkich przyjmowanych aktualnie lub ostatnio lekach, również tych, które wydawane są bez recepty oraz lekach ziołowych. (Cazaprol)
. Grabowski (2015) discusses certain problems related to functional analyses of lexical bundles.
Łukasz Grabowski
As some of the English LBs had no functional counterparts among the LBs identified in PPILs, we obtained a list of 38 functional pairs of bilingual LBs. Some pairs were found to have more than one LB in either English or Polish, or both (i.e. with one-to-many, many-to-one or many-to many correspondence).20 The results are summarised below. Firstly, 11 pairs with bilingual LBs with primarily referential functions were identified and aligned (see Table 4). In short, referential LBs are used in texts to refer to abstract or physical entities or to identify an attribute of an entity as particularly important (Biber et al. 2004, p. 384), such as information about side-effects, contents of the packaging, information on indications or counter-indications. Table 4. Referential LBs in EPILs and PPILs EPILs
PPILs
before you start to take your medicine
przed zastosowaniem leku przed zastosowaniem jakiegokolwiek leku
this medicine is for you
lek ten przepisano ściśle określonej osobie
this medicine is for you only the name of your medicine
to jest lek
the name of your medicine is a summary of the information
spis treści ulotki
out of the reach of children
w miejscu niedostępnym i niewidocznym dla dzieci w miejscu niewidocznym i niedostępnym dla dzieci
belongs to a group of medicines
należy do grupy leków
belongs to a group of medicines called your doctor or pharmacist
lekarzem lub farmaceutą
after the expiry date
po upływie terminu ważności
in a safe place
w miejscu niedostępnym i niewidocznym
. In the course of a manual comparison of linguistic data, it was also revealed that some English LBs had a single word rather than a MWU counterpart in Polish (e.g. as soon as possible vs. niezwłocznie or natychmiast¸ that is, ‘forthwith’ or ‘immediately’). In such cases, the functional pair was not taken into consideration in the analysis since one of the counterparts was not a MWU. This is primarily because this research focuses on lexical bundles only. However, adding single-word units to the list may further contribute to the improvement of the naturalness and textual fit of translations.
On identification of bilingual lexical bundles for translation purposes
Table 4. (Continued) EPILs
PPILs
after taking your medicine
podczas stosowania leku
information about your medicine
informacje ważne przed zastosowaniem leku informacje co zawiera lek informacja dla użytkownika
Next, 11 bilingual pairs with discourse-organising LBs were matched in EPILs and PPILs (see Table 5). Also referred to as text-oriented bundles (Hyland 2008, p. 13), these items are generally concerned with textual organisation (Biber et al., 2004, p. 391). Their main functions in PILs are to introduce various conditions pertaining to the use of drugs or medicines, convey warnings and precautions, elaborate on or specify topics introduced earlier in texts, or introduce new information. Table 5. Discourse-organising LBs in EPILs and PPILs EPILs
PPILs
if you have any questions
w razie jakichkolwiek dalszych wątpliwości
if you are not sure
w przypadku wątpliwości należy
if you miss a dose
w celu uzupełnienia pominiętej dawki
if their symptoms are the same
jeśli objawy jej choroby są takie same
to a group of medicines called
do grupy leków
if you are pregnant
jeśli pacjentka jest w ciąży
for the treatment of
stosowane w leczeniu
is used to treat
jest stosowany w leczeniu
do you suffer from
jeśli u pacjenta stwierdzono jeśli u pacjenta występują jeśli u pacjenta występuje
as with all medicines
jak każdy lek
the following inactive ingredients
którykolwiek z pozostałych składników leku
Finally, the study revealed 13 pairs with bilingual LBs expressing stance (see Table 6). In PILs, such items typically direct patients to carry out specific actions in the event of any problems with the use of drugs or medicines, or help PILs writers express attitudes, degree of certainty or uncertainty as well as value judgments or assessments with respect to the use of medicines (based on Biber et al., 2004, p. 384).
Łukasz Grabowski
Table 6. Stance LBs in EPILs and PPILs EPILs
PPILs
please ask your doctor or pharmacist
należy poradzić się lekarza lub farmaceuty
ask your doctor or pharmacist
należy zwrócić się do lekarza lub farmaceuty należy skontaktować się z lekarzem lub farmaceutą należy poradzić się lekarza lub farmaceuty
tell your doctor or pharmacist
należy powiedzieć lekarzowi lub farmaceucie należy powiadomić lekarza lub farmaceutę należy powiedzieć o tym lekarzowi lub farmaceucie
please read this leaflet
należy uważnie zapoznać się z treścią ulotki
please read this leaflet carefully
należy zapoznać się z treścią ulotki
what you should know about
ważne informacje o niektórych składnikach leku
as directed by your doctor
zgodnie z zaleceniami lekarza stosować zgodnie z zaleceniami lekarza należy zawsze stosować zgodnie z zaleceniami lekarza zawsze stosować zgodnie z zaleceniami lekarza
never give it to someone else
nie należy go przekazywać innym
you never give it to someone else tell your doctor immediately
należy natychmiast skontaktować się z lekarzem niezwłocznie skontaktować się z lekarzem natychmiast skontaktować się z lekarzem
it may harm them
lek może zaszkodzić innej osobie
check with your doctor
należy skontaktować się z lekarzem
stop taking the tablets
nie należy przyjmować nie należy stosować nie stosować leku
talk to your doctor
należy powiedzieć o tym lekarzowi powiedzieć o tym lekarzowi
you may need to
aby w razie potrzeby aby w razie potrzeby móc
you should not take
nie stosować tego leku nie zaleca się stosowania leku
On identification of bilingual lexical bundles for translation purposes
5. Discussion and conclusions This study aimed to explore the potential of using bilingual lexical bundles to improve the naturalness and textual fit of automatic (using MT software) or semiautomatic (using CATs) translation of recurrent MWUs. Although this descriptive and exploratory research focused on patient information leaflets, the methods described in it may be also applicable to other specialist text types or genres where multiword units are employed in constrained ways for a limited range of discursive functions. In this chapter, we showed how to identify and align bilingual lexical bundles, a specific type of MWUs, in terms of the same discourse functions (e.g. referential, discourse-organising or expressing stance) performed in a purposedesigned comparable corpus of English and Polish patient information leaflets. However, the drawback of the methods described in this paper is that they are relatively time-consuming and labour-intensive, notably the manual analysis and alignment of the discourse functions of MWUs. The study also revealed that the majority of bilingual lexical bundles, extracted semi-automatically from the comparable corpus (independently from its English and Polish sub-corpus), do not represent technical terms.21 The findings of this study can be integrated into machine-translation or computer-assisted translation software in a number of ways. First off, the tables with functionally-aligned MWUs can be integrated into domain-specific SMT systems, notably phrase-based ones, where “contiguous segments of words in the input sentence are mapped to contiguous segments of words in the output sentence” (Hoang & Koehn, 2008, p. 58).22 For example, Ren et al. (2009) show how to implement bilingual MWU tables into Moses, an open-source phrase-based SMT system based on large amounts of parallel texts (Koehn et al., 2007; Hoang & Koehn, 2008).23 Secondly, as for RBMT systems, which make ample use of linguistic annotation, additional functional tagging of bilingual pairs of domain-,
. Frantzi, Ananiadou & Mima (2000) present a method, called C-value/NC value, customdesigned for automatic extraction of multiword terms. Combining linguistic and statistical information, the method may be complementary with respect to the lexical bundles approach (Biber et al., 1999). . In fact, in phrase-based SMT systems, n-grams, that is uninterrupted sequences of words, have limited or none meta-linguistic information (Barreiro et al., 2013, p. 28), let alone metatextual or meta-situational one. . In their study, Ren et al. (2009) extract MWUs from parallel corpora with texts from the domain of traditional medicine and chemical industry.
Łukasz Grabowski
text type- or genre-specific n-grams may facilitate MWU function disambiguation in various contexts of language use. This, however, requires that an interoperable comprehensive taxonomy of text types, genres or specialist domains be developed, together with sets of domain-specific ontologies of text type-specific or genrespecific discourse functions performed by MWUs (di Buono et al., 2013). Thirdly, the functionally-aligned bilingual lexical bundles identified in this paper represent a readily-available descriptive dataset to be integrated into domain-specific glossaries or terminological databases created by translators using computer-assisted translation software (CATs). The findings of the study may be also useful for bilingual lexicography. All in all, the approach presented in this paper allows one to process bilingual English-Polish MWUs as text type-specific or genre-specific pairs of functional units. This is in agreement with Wilks (2009, p. 198) who argues that an intermediate representation (IR) of MWUs, a model for representation of communicative events, originally designed as part of ULTRA MT system (Farwell et al., 1993), should include “the explicit information in the context of the expression being processed. This includes the referential, stylistic and communicative aspects of an utterance in any language in so far as these are reflected by the form of the expression uttered, and not the information inferable from such explicit information” (Wilks, 2009, p. 198). This is also in alignment with the claims made by those scholars who see the need for integrating the insights from pragmatics into the field of MT (Wilks, 2009). On these grounds, we can argue that the information on pragmatic meanings/functions of MWUs acquired from comparable corpora is essential for improving the degree of naturalness and textual fit of translated texts as compared with native texts produced in the same language. Since pragmatic meaning is linked with the purposes and contexts of using particular MWUs, the approach presented in this linguistically-motivated paper is particularly suitable for domain-specific text types or genres. The results therefore suggest the need to consider annotating MWUs with meta-situational (corresponding to text type and specialist domain) as well as functional (corresponding to textual/discourse functions of MWUs) information in order to improve the performance of MT systems.24
. Needless to say, it is necessary to verify absolute improvement of employing the data obtained through the use of the methods described in this paper, for example by calculating BLEU scores on English to Polish translation, compared with the baselines using general multi-domain corpora (Papineni et al., 2002), or by using other methods to evaluate machine translation quality (e.g. White 2003; Callison-Burch et al., 2007, 2008). Although conducting such tests is beyond the scope of this paper, this would enable one to determine how effective
On identification of bilingual lexical bundles for translation purposes
In conclusion, the results of this study provide an insight into how certain domain-specific MWUs are used in native non-translational texts produced in two languages: English and Polish. That is why the pairs of functionally-equivalent MWUs extracted from comparable corpora may be used for developing domainspecific multilingual resources dedicated to improving naturalness and textual fit of translated texts. Hence, this research showed that apart from being used as descriptive data, the findings obtained from studies on lexical bundles may be also turned into actionable knowledge useful for practitioners of translation.
References Allschwil: The European Association for Machine Translation. Available at: http://www.academia. edu/4319501/Proceedings_MT_Summit_2013_Workshop_on_Multiword_units_in_ Machine_Translation_and_Translation_Technology (accessed November 2014). Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis, & E. Toginini-Bonelli (Eds.), Text and Technology. In Honor of John Sinclair (pp. 233–250). Amsterdam: John Benjamins. doi: 10.1075/z.64.15bak Baker, M. (1996). Corpus-based translation studies: The challenges that lie ahead. In H. Somers (Ed.), Terminology, LSP and Translation: Studies in Language Engineering. In Honour of Juan C. Sager (pp. 175–186). Amsterdam: John Benjamins. doi: 10.1075/btl.18.17bak Barreiro, A., Monti, J., Batista, F., & Orliac B. (2013). When Multiwords Go Bad in Machine Translation. In J. Monti, R. Mitkov, G. Corpas-Pastor, & V. Seretan (Eds.), Workshop Proceedings: Multi-Word Units in Machine Translation and Translation Technologies (pp. 26–33). Biber, D. (2006). University Language. A corpus-based study of spoken and written registers. Amsterdam/Philadelphia: John Benjamins. doi: 10.1075/scl.23 Biber, D. (2009). A corpus-driven approach to formulaic language in English: multi-word patterns in speech and writing. International Journal of Corpus Linguistics, 14(3), 275–311. doi: 10.1075/ijcl.14.3.08bib Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). The Longman Grammar of Spoken and Written English. London: Longman. Biber, D., Conrad, S., & Cortes, V. (2003). Lexical bundles in speech and writing: An initial taxonomy. In A. Wilson, P. Rayson, & T. McEnery (Eds.), Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech (pp. 71–92). Frankfurt am Main: Peter Lang. Biber, D., Conrad, S., & Cortes, V. (2004). If you look at…: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405. doi: 10.1093/applin/25.3.371 Biel, Ł. (2014). Lost in the Eurofog. The Textual Fit of Translated Law. Frankfurt am Main: Peter Lang. doi: 10.3726/978-3-653-03986-3
the results and methods presented in this study are for improving the quality and textual fit of recurrent phraseologies in machine translated texts.
Łukasz Grabowski Bouayad-Agha, N. (2006) The Patient Information Leaflet (PIL) corpus. Available at: http://mcs. open.ac.uk/nlg/old_projects/pills/corpus/PIL/ (accessed May 2012). Callison-Burch, Ch., Fordyce, C., Koehn, P., Monz, Ch., & Schroeder, J. (2007). (Meta-) Evaluation of Machine Translation. StatMT ’07 Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, 136–158. Available at: http://dl.acm.org/ft_gateway.cfm?id=1626373&type=pdf&CFID=624242940&CFTO KEN=26744291 (accessed February 2015). Callison-Burch, Ch., Fordyce, C., Koehn, P., Monz, Ch., & Schroeder, J. (2008). Further MetaEvaluation of Machine Translation. StatMT ’08 Proceedings of the Third Workshop on Statistical Machine Translation, Association for Computational Linguistics, 70–106. Available at: http://dl.acm.org/ft_gateway.cfm?id=1626403&type=pdf&CFID=624242938&CFTO KEN=97170002 (accessed February 2015). Chen, Y.-H., & Baker, P. (2010). Lexical bundles in L1 and L2 academic writing. Language Learning and Technology, 14(2), 30–49. Cheng., W, Greaves, C., & Warren, M. (2006). From n-gram to skipgram to concgrams. International Journal of Corpus Linguistics, 11(4), 411–433. doi: 10.1075/ijcl.11.4.04che Chesterman, A. (2004). Hypothesis about translation universals. In G. Hansen, K. Malmkjaer, & D. Gile (Eds.), Claims, Changes and Challenges in Translation Studies (pp. 1–13). Amsterdam: John Benjamins. doi: 10.1075/btl.50.02che Cobb, T. (2003). Review: Alison Wray. 2001. Formulaic Language and the Lexicon. Cambridge: Cambridge. University Press. xi + 332pp. Canadian Journal of Applied Linguistics, 6(1), 105–110. di Buono, M., Monti, J., Monteleone, M., & Marano, F. (2013). Multiword processing in an ontology-based Cross-Language Information Retrieval model for specific domain collections. In J. Monti, R. Mitkov, G. Corpas-Pastor, & V. Seretan (Eds.), Workshop Proceedings: MultiWord Units in Machine Translation and Translation Technologies (pp. 43–52). Allschwil: The European Association for Machine Translation. Available at: http://www.mt-archive. info/10/MTS-2013-W4-Buono.pdf (accessed November 2014). Farwell, D., Guthrie, L., & Wilks, Y. (1993). Automatically Creating Lexical Entries for ULTRA, a Multilingual MT System. Machine Translation, 8, 127–145. doi: 10.1007/BF00982636 Forchini, P., & Murphy, A. (2008). N-grams in comparable specialized corpora. Perspectives on phraseology, translation and pedagogy. International Journal of Corpus Linguistics, 13(3), 351–367. doi: 10.1075/ijcl.13.3.06for Frantzi, K., Ananiadou, S., & Mima, H. (2000). Automatic recognition of multi-word terms: the C-value/NC-value method. International Journal on Digital Libraries, 3(2), 115–130. doi: 10.1007/s007999900023
Goźdź-Roszkowski, S. (2011). Patterns of Linguistic Variation in American Legal English. A Corpus-Based Study. Frankfurt am Main: Peter Lang Verlag. doi: 10.3726/978-3-653-00659-9 Grabowski, Ł. (2013). Interfacing corpus linguistics and computational stylistics: translation universals in translational literary Polish. International Journal of Corpus Linguistics, 18(2), 254–280. doi: 10.1075/ijcl.18.2.04gra Grabowski, Ł. (2014). On Lexical Bundles in Polish Patient Information Leaflets: A CorpusDriven Study. Studies in Polish Linguistics, 9(1), 21–43. Grabowski, Ł. (2015). Keywords and lexical bundles within English pharmaceutical discourse: a corpus-driven description. English for Specific Purposes, 38, 23–33. doi: 10.1016/j.esp.2014.10.004
On identification of bilingual lexical bundles for translation purposes
Granger, S. (2010). Comparable and translation corpora in cross-linguistic research. Design, analysis and applications. Journal of Shanghai Jiaotong University, 2, 14–21. Available at: http://sites.uclouvain.be/cecl/archives/Granger_Crosslinguistic_research.pdf (accessed November 2014). Granger, S. (2014). A lexical bundle approach to comparing languages. Stems in English and French. In M.-A. Lefer, & S. Vogeleer (Eds.), Genre- and register-related discourse features in contrast. Special issue of Languages in Contrast, 14(1), 58–72. Gray, B. & Biber, D. (2013). Lexical frames in academic prose and conversation. International Journal of Corpus Linguistics, 18(1), 109–135. doi: 10.1075/ijcl.18.1.08gra Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London: Routledge. Hoey, M. (2007). Lexical priming and literary creativity. In M. Hoey, M. Mahlberg, M. Stubbs, & W. Teubert (Eds.), Text, Discourse and Corpora. London: Continuum, 7–30. Hoang, H. & Koehn, P. (2008). Design of the Moses Decoder for Statistical Machine Translation. Software Engineering, Testing, and Quality Assurance for Natural Language Processing (pp. 58–65). Columbus, Ohio, USA, June (2008). Association for Computational Linguistics. Available at: http://www.aclweb.org/anthology/W08-0510 (accessed November 2014). Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27, 4–21. doi: 10.1016/j.esp.2007.06.001 Kajzer-Wietrzny, M. (2012). Interpreting Universals and Interpreting Style. Unpublished PhD dissertation. Adam Mickiewicz University, Poznań, Poland. Available at: https://repozytorium.amu.edu.pl/jspui/bitstream/10593/2425/1/Paca%20doktorska%20Marty%20KajzerWietrzny.pdf (accessed September 2012). Kilgarriff, A. (2005). Language is never ever ever random. Corpus Linguistics and Linguistic Theory, 1(2), 263–276. doi: 10.1515/cllt.2005.1.2.263 Koehn, P., Hoang, H., Birch, A., Callison-Burch, Ch., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, Ch., Zens, R., Dyer, Ch., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Prague, Czech Republic, June 2007. Available at: https://www.cs.jhu.edu/~ccb/publications/moses-toolkit.pdf (accessed November 2014). Laviosa, S. (1998). Core patterns of lexical use in a comparable corpus of English narrative prose. Meta, 43(4), 557–570. doi: 10.7202/003425ar Montalt Resurreccio, V., & Gonzalez Davies, M. (2007). Medical Translation Step by Step. Translation Practices explained. Manchester: St. Jerome Publishing. Olohan, M. (2004). Introducing Corpora in Translation Studies. London/New York: Routledge. Olohan, M., & Baker, M. (2000). Reporting that in translated English: Evidence for subconscious processes of explicitation?. Across Languages and Cultures, 1, 141–172 (cited in Olohan 2004: 94). doi: 10.1556/Acr.1.2000.2.1 Papineni, K., Roukos, S., Ward, T., & Zhu, W-J. (2002). BLEU: A method for automatic evaluation of machine translation. Proceedings for the 40th Annual Meeting of the Association for Computation Linguistics, Philadelphia, July 2002. (pp. 311–318). Available at: http://aclweb. org/anthology/P/P02/P02-1040.pdf (accessed November 2014). Ren, Z., Lu, Y., Cao, J., Liu, Q., & Huang, Y. (2009). Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions. Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications. MWE’ 09. (pp. 47–54). Stroudsburg: Association for Computational Linguistics. Available at: http://www.aclweb.org/anthology/W09-2907 (accessed November 2014).
Łukasz Grabowski Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger D. (2002). Multiword Expressions: A Pain in the Neck for NLP. Computational Linguistics and Intelligent Text Processing: Third International Conference (CICLing 2002), 1–15. Available at: http://lingo.stanford.edu/ pubs/WP-2001-03.pdf (accessed May 2013). Salazar, D. (2011). Lexical bundles in scientific English: A corpus-based study of native and nonnative writing. Unpublished PhD dissertation. University of Barcelona. Available at: http:// www.tdx.cat/bitstream/handle/10803/52083/DJLS_DISSERTATION.pdf (accessed March 2013). Scott, D., Bouayad-Agha, N., Power, R., Shultz, S., Beck, R., Murphy, D., & Lockwood, R. (2001). PILLS: A Multilingual Authoring System for Patient Information. In Proceedings of the 2001 Meeting of the American Medical Informatics Association (AMAI’01), Washington, D.C., USA. Available at: http://mcs.open.ac.uk/rp3242/papers/amia01.pdf (accessed May 2013). Scott, M. (2007). WordSmith Tools 4.0. Liverpool: Lexical Analysis Software. Stubbs, M. & Barth, I. (2003). Using recurrent phrases as text-type discriminators: a quantitative method and some findings. Functions of Language, 10(1), 65–108. doi: 10.1075/fol.10.1.04stu White, J. (2003). How to evaluate machine translation. In H. Somers (Ed.), Computers and Translation: A Translator’s Guide (pp. 211–244). Amsterdam: John Benjamins. doi: 10.1075/btl.35.16whi
Wilks, Y. (2009). Machine Translation: Its Scope and Limits. New York: Springer.
Appendix 1. Pivot LBs in EPILs 4-word LBs
5-word LBs
6-word LBs
7-word LBs
your doctor or pharmacist
ask your doctor or pharmacist
only a doctor can prescribe it
a doctor can prescribe it for you
please read this leaflet
if you have any questions
if you forget to take a
before you start to take your medicine
as soon as possible
tell your doctor or pharmacist
out of the reach of children
if you forget to take a dose
it is important to
if you are not sure
the name of your medicine is
one of a group of medicines called
after the expiry date
if you forget to take
if their symptoms are the same
you may want to read it again
if you are pregnant
before you start to take
belongs to a group of medicines
you never give it to someone else
tell your doctor immediately
please read this leaflet carefully
if your doctor decides to stop
if your doctor tells you to stop
it may harm them
a group of medicines called
please ask your doctor or pharmacist
in a safe place
only a doctor can prescribe
you may want to read it
check with your doctor
what you should know about
never give it to someone else
On identification of bilingual lexical bundles for translation purposes
Appendix 1. (Continued) 4-word LBs
5-word LBs
6-word LBs
for the treatment of
a doctor can prescribe it
to any of the following questions
after taking your medicine
as soon as you remember
this medicine is for you only
it is important that
this medicine is for you
is used to treat
the name of your medicine
do you suffer from
if you miss a dose
an allergic reaction to
to any of these questions
are you taking any
what to do if you
do not stop taking
if you are taking any
if you notice any
as directed by your doctor
stop taking the tablets
of a group of medicines
talk to your doctor
had an allergic reaction to
you may need to
this leaflet does not contain
you may want to
a summary of the information
nearest hospital casualty department
then go on as before
if your doctor decides
or you are not sure
information about your medicine
are you taking any other
as with all medicines
to any of the following
go on as before
unless your doctor tells you
if you accidentally take
to a group of medicines
the following inactive ingredients
7-word LBs
The quest for Croatian idioms as multiword units Kristina Kocijan & Sara Librenjak
Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of Zagreb Idiomatic expressions are types of MWUs in which the meaning of the unit does not equal the meaning of its parts. They are culturally dependent, so the translation cannot be inferred from the expression itself. Croatian language has a very rich idiomatic structure. A few such expressions can be understood in direct translation but most are different from the literal translations. As the idioms are rooted in the tradition of the language and society from which they hail, they need special treatment in computational linguistics. Using NooJ as an NLP tool, we describe different types of Croatian idioms that will help us recognize them in texts. Idioms recognition should be given special treatment, being the major task in translation. Keywords: Multiword units, idiomatic expressions, translation, Croatian, NooJ
1. Introduction Multiword units (MWU) present a challenge when it comes to computer assisted translation. Among them, idiomatic expressions are considered a group of expressions that pose a significant problem in the process of translation, since their meaning is different from the sum of its parts, and furthermore, many of them are culturally dependent. Croatian language is highly flective language with seven noun cases and rich verbal structure. When preparing resources for computer assisted translation, mere statistical approach is not sufficient. Instead, we propose the rule based approach in order to digitally understand the complex case and tense structure, verb valency as well as non-literal expressions such as idioms. As the idioms are rooted in the tradition of the language and the society from which they hail (Menac, 1978, Fink & Menac, 2008), they need special treatment in computational linguistics. As recent as the end of the 20th century, linguists
doi 10.1075/cilt.341.10koc © 2018 John Benjamins Publishing Company
Kristina Kocijan & Sara Librenjak
realized that idioms are a basic and widespread linguistic phenomenon and not an exception (Machonis, 2010). In this work, we are using NooJ as an NLP tool to propose a rule based model that describes different types of Croatian MWUs. By treating them as frozen expressions in NooJ, we are using its power of local grammars to describe the syntactic behavior and recognize all the possible contexts of each expression (Silberztein, 2003). This way we are able to recognize idiomatic expressions found in text as continuous and discontinuous MWUs. Recognition of idioms, especially discontinuous types, is very important for the translation projects. NooJ, as an NLP tool of our choice, has already proved very efficient in dealing with different types of MWUs (Bekavac & Tadić, 2008; Todorova, 2008; Machonis, 2010, 2012; Gavriilidou et al., 2012; Vietri, 2012) and the results justify our selection. The paper sections are organized in the following manner: first we give a description of theoretical background, where we describe the Croatian idioms and contrast them to MWUs. In general, we talk about the importance of idiom detection in translation and give some previous references in this area. Secondly, we present the methodology applied in building the corpora we used for training and testing of our grammars. Then, our proposal of classification of Croatian idioms with examples from NooJ dictionary and syntactic grammars will be explained in more detail, followed by the results for recognizing idioms in the text. We then finish the paper with some concluding remarks and suggestions for future work.
2. Theoretical background In this chapter we introduce the linguistic theory behind the idioms in Croatian language. Secondly, we talk about the importance of idiom detection in translation. Finally, similar work concerning MWUs in Croatian language is presented. 2.1 Idioms as a type of MWU in Croatian language Idioms are a form of multiword expressions in which the meaning of the expressions as a whole does not equal the sum of its parts. For example, to give somebody a present is a literal expression, whereas to give somebody one’s heart does not refer to the physical act of giving, but to metaphorical gift of affection. In English language, MWUs are usually divided into a variety of linguistic categories. Most common division is idioms (cost an arm and a leg, miss the boat), fixed phrases (up and about, larger than life, et cetera), compound nouns (nail polish, TV screen, crash course) and compound verbs or phrasal verbs (take somebody on, come around). Linguistically speaking, this division is somewhat both
The quest for Croatian idioms as multiword units
syntactically and semantically unclear. In all of the categories exists a mixture of literal and non-literal units. The issue with this point in translation is that the nonliteral expressions tend to be culturally dependent and translated differently into other languages, whereas literal expressions can find their translation with greater facility. For example, compound nouns like nail polish or TV screen are translated to Croatian respectively as lak za nokte and TV ekran, which poses no great problem in machine assisted translation. On the other hand, compound nouns like crash course or hot potato would cause disastrously humorous results if they were to be translated literally. There are also many cases when a multiword unit corresponds to a single word in another language (Vitas et al., 2007). Thus, it is of a great importance that such expressions be properly detected in texts and paired with either corresponding idiom, or non-idiomatic expression in Croatian (or any other target language). Therefore, we propose a different method of categorizing idiomatic expressions in this work. First, we apply the semantic filters: 1. if an expression is a compound noun with a literal meaning, it is not considered an idiom; 2. fixed phrases with a non-literal meaning are considered idioms; 3. compound verbs are considered an idiom only if the expression in question is used metaphorically; 4. a proverb is considered an idiom (mostly fixed), but since the focus of the research is on more widely used expressions, they will not be covered fully. After an idiom passes the semantic filters, it is categorized by its syntactic properties and added to the dictionary accordingly where it is described using its lexical, syntactic and semantic features. Notation used at the dictionary level is used for the final phase of idioms detection and annotation, i.e. in syntactic grammars (both NooJ dictionary and syntactic grammars will be explained in more detail in the section on Dictionaries and Syntactic Grammars). Croatian idiom structure is described in Matešić (1982), Menac et al. (2003, 2007) and Menac-Mihalić (2007). Among the literature resources about Croatian phraseology, we found the Croatian Dictionary of Idioms (Menac et al., 2003), which is the most comprehensive source currently available. It incorporates as many as 2,878 Croatian idioms, most of which are verbal structures with a corresponding object in a particular case. There are also many noun phrases (usually a noun and its attribute) and fixed expressions which do not change case, tense or person. We used this dictionary as a main reference for extraction of idioms in our work.
Kristina Kocijan & Sara Librenjak
2.2 Importance of idiom detection in translation A few idioms can be understood in direct translation to e.g. English: (1) a. biti na vrhu jezika en: to be at the tip of the tongue, b. osvojiti [kome] srce en: to win [someone’s] heart, c. naoružan do zuba en: armed to the teeth.
Many more are only similar, but not direct translations: (2) a. komu se steže grlo enlit:the throat is being clenched en: have lump in one’s throat; b. nositi srce na dlanu enlit.: carry the heart in the palm of one’s hand en: wear the heart on one’s sleeve,
or completely different: (3) a. grlom u jagode enlit.: rush with your throat to the strawberries en: jump the gun; b. ruku na srce enlit.: hand to the heart en: truth be told.
Since the machine does not possess the linguistic knowledge of a bilingual person, it cannot differ between those types. There is a vast number of similar expressions, where one of them is idiomatic and the other is literal. Without proper idiom-recognition software, it is impossible to distinguish idioms by an artificial intelligence. For example, two syntactically identical expressions like morski vuk and morski pas are virtually indistinguishable even if the text has proper POS and syntactic tags. The first one is an idiom meaning a seasoned sailor while the second one simply means a shark. The purpose of our work is to provide proper distinctions between would-be idiomatic expressions (which are in fact only compound expressions with literal meaning) and real idioms, using both semantic and syntactic knowledge in the NooJ linguistic environment.
The quest for Croatian idioms as multiword units
2.3 Previous work So far, there was little work on computational semantic distinction between idioms and other types of MWUs in Croatian language. Every effort to detect a MWU is significant in the quest of more successful machine assisted translation systems, and previous work should be considered in any future efforts in this field. Firstly, Tadić & Šojat (2003) wrote about statistical approach in detection of candidates for MW noun compounds. It is possible to statistically approximate the chance of a digram or trigram being a collocative MWU rather than an accidental combination of words. This method provided useful information about syntactical structure of noun-based MWUs in Croatian, most of them being a combination of adjectives and nouns in a specific order. For example, in the case of bigrams almost two thirds of the results were a combination + (noun phrase with an attribute), while a remaining third was + (noun phrase with an apposition). Similar distribution was found in trigrams and tetrograms as well. Still, we found that statistics serve better when a semantic distinction poses no issue, i.e. when there is no need to distinguish between collocational MWUs Hrvatski predsjednik and idiomatic ones mačji kašalj GDEX 920,430 (453.7 per million), search restricted to one collexeme to the right of doći as a node (with option P.DUPLICATE 0) lemma
frequency
1.
do
195,423
2.
biti
94,411
3.
u
86,221
4.
na
76,134
5.
i
36,084
6.
s
15,304
7.
htjeti
13,661
8.
iz
12,376
9.
vrijeme
9,902
10.
da
8,460
11.
ja
7,048
12.
po
6,576
13.
kao
6,166
14.
k
5,706
15.
kraj
5,041
16.
zbog
4,089
17.
sebe
4,058
18.
samo
4,007
19.
doma
3,992
20.
taj
3,619
21.
kuća
3,585
22.
nakon
3,572
. In this paper, the frequency is taken as the main statistical measure. Considering the high frequency of the construction doći do ‘to come to’ and doći na ‘to come onto’, according to T-score and MI the results for both constructions are very similar, as expected. The construction doći do ‘to come to’ in both hrWaC and HNK has the highest T-scores: doći do 437.133 (hrWaC) / 154.567 (HNK). The construction doći na ‘to come onto’ is the second in hrWaC (229.556), and the third one in HNK (72.704) according to T-score. According to MI, the construction doći do ‘to come to’ is listed as the thirty-first in hrWaC (6.485) and twenty-first in HNK (6.980). The construction doći na ‘to come onto’ has very low MI – it is not among the top two hundred collexemes both in hrWaC and HNK.
Corpus analysis of Croatian constructions with the verb doći ‘to come’
Table 2. (Continued) lemma
frequency
23.
za
3,572
24.
kod
3,532
25.
jer
3,455
26.
kada
3,436
27.
neki
3,242
28.
od
3,105
29.
pred
2,633
30.
pod
2,437
31.
li
2,215
32.
nov
2,171
33.
oko
2,026
34.
ovdje
1,931
35.
jedan
1,918
36.
tek
1,905
37.
on
1,868
38.
prije
1,778
39.
još
1,776
40.
ni
1,664
41.
preko
1,602
42.
bez
1,555
43.
tamo
1,500
44.
sav
1,479
45.
mi
1,457
(The results from HNK are very similar. They differ in eight candidates: HNK has kad, tijek, trenutak, već, vidjeti, ovamo, sam and kako, but excludes ja, doma, taj, jedan, on, preko and tamo).
3.1.1 Corpus analysis of the construction doći do ‘to come to’ The simple preposition do ‘to’ is one of the oldest prepositions common in all Slavic languages, and it has many functions: as a preposition, as an adverb, and as a prefix as well. The preposition do ‘to’ has many meanings, and one of its basic ones is that of directionality – it refers to the endpoint of an action in a spatial or non-spatial sense. It bears the same meaning as the prefix in the verb doći: (…) prefiks do- kad znači adlativnost i dolazi uz glagole kretanja pretpostavlja uporabu (ponavljanje) prijedloga do, npr. doći do, došetati do, dopuzati do, dovući se do, doprijeti do i sl. Katkada se u tom značenju prijedlog upotrebljava i kad je
Goranka Blagus Bartolec & Ivana Matas Ivanković
riječ o prijelaznim glagolima, npr. dotjerati (stoku) do rijeke, dogurati (ormar) do zida i sl… (Pranjković, 2009). ‘(…) prefix do-, in referring to adlative meaning and when it comes with verbs of motion, presumes the usage (repetition) of the preposition do, e.g. to come to, to walk to, to crawl to, to crawl up to, to reach to, etc. Sometimes this preposition with the same meaning is used with transitive verbs, e.g. to lead (the cattle) to the river, to push (the cabinet) to the wall, etc.’
In exploring how space is structured in language from a typological aspect, Talmy (1985) noticed that Slavic languages have a strong repetitive pattern of using a prefixed verb of movement followed by the preposition identical to the prefix (from Brala-Vukanović & Rubinić, 2011, p. 24), so the high frequency of do after doći is not surprising. The search of the preposition do ‘to’ as a node (with a restriction of -1 to the left, the presumption being that the verb and preposition come in sequence as parts of a multiword unit) showed that doći is the second most frequent collocation candidate within this span. The verb biti ‘to be’ is the most frequent (with a slight advantage), but, as mentioned earlier, it is most often used as an auxiliary verb in the formation of complex verb forms (such as perfect, pluperfect, future II). Doći and do form the collostruct doći do, which has 195,396 occurrences and attracts different collexemes, shown in Table 3. Table 3. Collocation candidates of collostruct doći do ‘to come to’ Query doći, do > GDEX 195,396 (96.3 per million), search restricted to one collexeme to the right of doći do ‘to come to’ as the node (with the option P.DUPLICATE 0). lemma
frequency
1.
taj
7,678
2.
izražaj
4,682
3.
zaključak
3,693
4.
neki
3,210
5.
svoj
2,814
6.
promjena
2,740
7.
pobjeda
2,601
8.
on
2,486
9.
velik
2,481
10.
nov
1,933
11.
ovaj
1,531
12.
prvi
1,450
13.
kraj
1,447
14.
cilj
1,272
15.
ja
1,056
Corpus analysis of Croatian constructions with the verb doći ‘to come’
Table 3. (Continued) 16.
oni
1,000
17.
takav
981
18.
onaj
960
19.
sukob
893
20.
određen
875
21.
podatak
845
22.
grlo
841
23.
mi
831
24.
drugi
794
25.
željen
793
26.
novac
787
27.
pad
785
28.
informacija
777
29.
oštećenje
776
30.
finale
758
31.
rješenje
758
32.
problem
750
33.
prav
719
34.
bod
690
35.
značajan
667
36.
jedan
651
37.
spoznaja
603
38.
prekid
588
39.
vodstvo
578
40.
smanjenje
574
41.
povećanje
565
42.
još
558
43.
njihov
555
44.
sav
551
45.
njegov
541
46.
točka
536
47.
potpun
535
48.
rezultat
521
49.
saznanje
500
50.
izjednačenje
499
(The results from HNK are very similar. They differ in 11 candidates: HNK has povreda, ustavan, bitan, smjena, eksplozija, dogovor, nekoji, pun, biti, njihov, ozbiljan, ovakav, but excludes ja, oni, onaj, mi, željen, problem, jedan, još, točka, rezultat, izjednačenje.)
Goranka Blagus Bartolec & Ivana Matas Ivanković
Further analysis compared the frequency list of the first 50 collocation candidates of the construction doći do ‘to come to’ as a node, of which 26 were nouns: izražaj ‘expression’, zaključak ‘conclusion’, promjena ‘change’, pobjeda ‘victory’, kraj ‘end’, cilj ‘goal’, sukob ‘conflict’, podatak ‘data’, grlo ‘throat’, novac ‘money’, pad ‘fall’, informacija ‘information’, oštećenje ‘damage’, finale ‘finals’, rješenje ‘solution’, problem ‘problem’, bod ‘point’, spoznaja ‘knowledge’, prekid ‘break’, vodstvo ‘lead’, smanjenje ‘decrease’, povećanje ‘increase’, točka ‘point’, rezultat ‘result’, saznanje ‘knowledge’, izjednačenje ‘equation’. Only one of them – novac ‘money’ – is registered as a collexeme in RHJ, while the others are not registered in dictionaries. RHJ uses novac ‘money’ and položaj ‘position’ to exemplify the figurative meaning of doći do, defined as ‘to obtain, to acquire’ (doći do novca ‘to attain money’, doći do položaja ‘attain a position’). The collexeme novac ‘money’ (f (doći do + novca ‘to come to + money-gen.sg’) = 787) is the 26th collocational candidate of the construction doći do. The noun položaj ‘position’ is not listed within the first 26 noun collexemes of construction doći do. Only two collexemes from this list can be understood in the primary meaning of ‘reaching a destination’: kraj ‘end’ and cilj ‘goal’. In Croatian, spatial meaning represents only the partial semantic potential of these nouns – they are polysemous and have non-spatial meanings, as well as a primary spatial meaning. The noun kraj ‘end’ as the collexeme of the construction doći do initiates another noun in the genitive. The whole construction can have spatial meaning: (11) do-ći do kraj-a put-a come-inf to end-gen.sg road-gen.sg ‘to come to the end of the road’
It can also refer to the meaning of ending something that lasted for a certain time: (12) do-ći do kraj-a radn-og vijek-a come-inf to end-gen.sg working-gen.sg life-gen.sg ‘to come to the end of working life’ (13) do-ći do kraj-a proces-a privatizacij-e come-inf to end-gen.sg process-gen.sg privatization-gen.sg ‘to come to the end of the privatization process’
The noun cilj ‘goal’ has manifold semantic potential with the construction doći do – it can have spatial meaning: (14) prv-i do-ći do cilj-a first-nom.sg come-inf to goal-gen.sg ‘to reach the goal first’
Corpus analysis of Croatian constructions with the verb doći ‘to come’
or it can refer to a non-spatial meaning: (15) Kako bi do-šli do cilj-a, That-cop aux-pst.3pl come-act.3pl to goal-gen.sg ‘To reach their goals, služ-e se kradljivc-i svakakv-im metoda-ma. use-prs.3pl refl thief-nom.pl various.obl-ins.pl method-ins thieves use all kinds of methods’.
Other highly frequent noun collexemes co-occurring with the construction doći do, like izražaj ‘expression’, zaključak ‘conclusion’, pobjeda ‘victory’, promjena ‘change’, rješenje ‘solution’, etc., do not have spatial meaning, and they do not refer to ‘termination of moving in the space’. The verb doći in these constructions is periphrastic, and it forms a fixed lexicalized construction with a prepositional noun phrase. Corpus analysis shows the high frequency of this kind of periphrastic construction, although no Croatian dictionary notes them. Some of these constructions can be observed as decomposed verbs (Silić & Pranjković, 2005, p. 188), because they can be replaced with a single-word verb. When translating these examples into English, it is possible to use the verb to come, but they can also be translated with a verb of the same origin as a noun (e.g. doći do rješenja / riješiti ‘to come to a solution / to solve’, doći do zaključka / zaključiti ‘to come to a conclusion / to conclude’, doći do pada / pasti ‘to come to a decline / to decline’, doći do oštećenja / oštetiti se ‘to come to damage / to damage’…). The construction doći do and its collexemes can refer to a result or a consequence to which a certain action led (e.g. doći do pada ‘to decline’). These constructions are impersonal, with no subject, and the construction doći do can be replaced with the construction dovesti do ‘to lead to’, but in this case the subject which caused the result must be expressed. As translated examples show, in English the verb to come is not used in such examples. The construction doći do with certain collexemes can refer to obtaining, getting something: e.g. doći do rezultata ‘to come to a result – reach a result’, doći do boda/bodova ‘to attain points’, doći do novca ‘to attain money’, doći do pobjede ‘to win’, doći do informacija ‘to get information’, doći do podatka/podataka ‘to get data’. English constructions with the same meaning would rarely contain the verb come, and they would mostly be translated with get, obtain, reach or other synonymous verb followed by the noun referring to the object. The high frequency of noun collexemes with construction doći do, which do not have primary spatial meaning, imposes the need to define the verb doći with more details, especially the construction doći do. It should be defined as a prepositional verb, since collostruct doći do attracts different noun collexemes and the whole collostruction does not have basic meaning and it is not an idiom. This
Goranka Blagus Bartolec & Ivana Matas Ivanković
usage is not adequately described in Croatian dictionaries (only RHJ lists examples doći do novca, doći do položaja which do not have primary meaning). 3.1.2 Corpus analysis of the construction doći na ‘to come onto’ As the corpus analysis shows, the preposition na has an extremely high frequency. It can come with either the locative or the accusative in Croatian. With the locative it determines where an action is taking place – usually the top or outer surface of something: (16) staja-ti na pod-u stand-inf on floor-loc.sg ‘to stand on the floor’
With the accusative it refers to the goal of the movement or action while still retaining the meaning of the top or outer surface: (17) pas-ti na pod fall-inf onto floor-acc.sg ‘to fall onto the floor’.
A corpus search of the preposition na shows that collexeme doći is in 18th place on the list of collocation candidates of na, and it has fewer occurences than with the preposition do. This is not so unusual, however, since na, as already mentioned, comes with two cases, the locative and the accusative, meaning that its semantic and collocational capacity is greater. The collostruct doći na occurs 76,045 times, and it attracts different collexemes, shown in Table 4. Table 4. Collocation candidates of collostruct doći na ‘to come onto’ Query doći, na > GDEX 76,045 (37.5 per million), search restricted to one collexeme to the right of doći na as a node (with option P.DUPLICATE 0) lemma
frequency
1.
vlast
5,936
2.
svoj
4,327
3.
red
3,934
4.
ideja
2,153
5.
ovaj
1,527
6.
svijet
1,301
7.
posao
1,274
8.
mjesto
1,258
9.
naplata
1,225
Corpus analysis of Croatian constructions with the verb doći ‘to come’
Table 4. (Continued) lemma
frequency
10.
taj
1,204
11.
čelo
743
12.
utakmica
712
13.
isto
682
14.
naš
566
15.
vrijeme
539
16.
koncert
533
17.
prvi
505
18.
neki
478
19.
dnevan
418
20.
zemlja
415
21.
sastanak
403
22.
tržište
401
23.
vrata
389
24.
sud
388
25.
forum
387
26.
poziv
367
27.
vrh
367
28.
razgovor
356
29.
jedan
343
30.
kraj
337
31.
stadion
328
32.
moj
320
33.
vidjelo
314
34.
sjednica
303
35.
drugi
296
36.
samo
296
37.
stranica
289
38.
pregled
288
39.
njegov
287
40.
trening
279
41.
korak
248
42.
isti
241 (Continued)
Goranka Blagus Bartolec & Ivana Matas Ivanković
Table 4. (Continued) lemma
frequency
43.
njihov
235
44.
odmor
224
45.
razina
216
46.
cilj
209
47.
trg
208
48.
otok
205
49.
sam
190
50.
poljud
189
(The results from HNK differ in 14 candidates: HNK has rasprava, glavni, birački, temelj, ročište, otvorenje, uvid, priprema, intervencija, biralište, list, proba, teren and Kosovo, but excludes isto, zemlja, vrata, forum, jedan, kraj, moj, stranica, isti, njihov, cilj, trg, otok and sam.)
34 of the 50 most frequent collexemes of construction doći na ‘to come onto’ are nouns. None of these noun collexemes is listed as an example within the entry doći ‘to come’ in any Croatian dictionary. The construction doći na is listed only twice in RHJ: it exemplifies the primary meaning of the verb doći ‘to complete the movement of arrival at a destination’ (doći na more ‘to come to the sea’) and it exemplifies the definition ‘to arrive by movement, reach’ (doći na vlak ‘to come onto the train’). Unlike the construction doći do ‘to come to’ (which according to the corpus results usually initiates noun collexemes with either figurative or periphrastic meaning), 16 of the 34 most frequent noun collexemes of the construction doći na refer to the primary meaning of the verb doći: doći na posao ‘to come to work’, doći na Zemlju ‘to come to Earth’, doći na utakmicu ‘to come to the match’, doći na koncert ‘to come to the concert’, doći na sastanak ‘to come to the meeting’, doći na vrata ‘to come to the door’, doći na sud ‘to come to the court’, doći na razgovor ‘to come to talk’, doći na stadion ‘to come to the stadium’, doći na sjednicu ‘to come to the session’, doći na pregled ‘to come to an examination', doći na trening ‘to come to training’, doći na odmor ‘to come for a holiday’, doći na cilj ‘to come to the goal’, doći na tržište ‘to come onto the market, doći na otok ‘to come onto the island’, doći na Poljud ‘to come to Poljud (stadium in Split)’. Three collexemes, mjesto ‘place’, kraj ‘end’ and vrh ‘top’ alongside the construction doći na, express both the primary and figurative meaning of the verb doći, e.g. doći na mjesto nesreće ‘to come to the site of an accident’ (primary meaning) / doći na mjesto predsjednika ‘to come to the position of president’ (figurative meaning); doći na kraj puta ‘to come to the end of the path’ (primary meaning) / doći na kraj mandata ‘to come to the end of the mandate’ (figurative meaning); doći na vrh brijega ‘to come to the top of the hill’ (primary meaning) / doći na vrh ljestvice ‘to come to the top of the charts’ (figurative meaning).
Corpus analysis of Croatian constructions with the verb doći ‘to come’
The fifteen noun collexemes with the highest frequency of the collostruct doći na ‘to come onto’ express the figurative meaning of the verb doći: doći na vlast ‘to come to power’, doći na red ‘to come to one’s turn’, doći na ideju ‘to come to an idea’, doći na svijet ‘to come into the world’, doći na naplatu ‘to come to payment’, doći na čelo ‘to come to the head’, doći na tržište ‘to come onto the market’, doći na vrijeme ‘to come on time’, doći na poziv ‘to come by invitation’, doći na vidjelo ‘to come to light’, doći na stranicu ‘to come onto the site’, doći na forum ‘to come to the forum’, doći na razinu ‘to come to the level’, doći na korak ‘to come into step’. Of all of these, only the periphrastic construction doći na vlast ‘to come to power’ can be replaced with a single-word verb zavladati ‘to govern’. In comparison with the construction doći do, the construction doći na does not have great periphrastic potential. The nouns vrijeme ‘time’, korak ‘step’, and poziv ‘invitation’ are not collexemes of doći na, but are rather parts of adverbials: na vrijeme ‘in time’, na korak (do) ‘a step (from)’, na poziv ‘by invitation’. Corpus analysis has shown the great usage and semantic potential of the construction doći na and noun collexemes on its righthand side. Although the analysis included the 34 most frequent noun collexemes of the construction doći na ‘to come onto’, contemporary Croatian dictionaries do not record such examples in any sense of the entry doći ‘to come’, which indicates a need for the clearer and more detailed treatment of verbs with such rich semantic and syntactic potential.4 4. Conclusion This study has described the semantic and syntactic potential of the Croatian verb doći based on a corpus analysis, comparing it with the description of the same verb in the contemporary Croatian dictionaries. Meanings derived from the corpus results significantly surpass the lexicographic description, in which only the primary meaning of the verb or idiomatic usage prevails. The results of the corpus search of the verb doći showed the following: 1. According to the frequency data, the verb doći is most frequent in the Croatian language as a prepositional construction. This feature is realised both in the primary spatial meanings of the verb doći, as described in Croatian dictionaries (doći na more ‘to come onto the sea’) and in non-spatial (figurative)
. One of the reasons is technical – most examples and meanings in Croatian dictionaries are listed under noun entries. Thus, for example, the construction doći na čelo ‘to come to the head’ in ŠKRJ is listed only as an idiom under the noun entry čelo ‘forehead’. However, such examples are rare, e.g. in the same dictionary, doći na vidjelo ‘to come to light’or doći na vlast ‘to come to power’ are not listed.
Goranka Blagus Bartolec & Ivana Matas Ivanković
meanings (doći do zaključka ‘to come to a conclusion’, doći na ideju ‘to come to an idea’). Some meanings can be realised only as verb-prepositional-noun constructions (V + prep + N): doći do problema ‘to come to a problem’, doći na vidjelo ‘to come to light’, while some structures can be viewed as periphrastic verbs that can be replaced with a single-word verb: doći do zaključka / zaključiti ‘to come to a conclusion / to conclude’, doći na vlast / zavladati ‘to come to power / to govern’. This prepositional usage must be introduced into dictionary descriptions of the verb doći. 2. Some other Croatian verbs also have manifold prepositional structures (such as dovesti ‘to lead’ (dovesti u iskušenje ‘to tempt’, dovesti do katastrofe ‘to lead to disaster’), držati ‘to keep’ (držati na oku ‘to keep an eye on’, držati pod kontrolom ‘keep under control’), pasti ‘to fall’ (pasti u očaj ‘fall into despair’, pasti na pamet ‘to come to mind’), so this description should be universal for all such verbs. The grammatical descriptions in dictionaries should be expanded and, in addition to transitivity/intransitivity, prepositionality should be added as a structural verb characteristic. 3. Prepositional structures must be taken into consideration when translating Croatian into other languages, e.g. into English, since they cannot be translated literally. Some constructions can have doći ili ‘to come’ in both languages (doći do rješenja ‘to come to a solution’, doći na vlast ‘to come to power’), while other Croatian constructions with doći cannot be translated by using to come (doći do pada ‘to decline’; doći do novca ‘to attain money’). 4. The use of Croatian corpora in the analysis of the lemma doći ‘to come’ indicated some deficiencies in corpus tools, for example, the lack of the ability to tag complex verb forms with the auxiliary verbs biti ‘to be’ and htjeti ‘to want’, the need for the more precise differentiation of homographs (e.g. prepositions and adverbs, particles and adverbs, nouns and adverbs which are often formed by conversion). The paper has shown that the practical use of corpora can contribute to its improvement. Although some contemporary Croatian dictionaries relied on corpora available at the time they were written, advanced corpus tools recently made available for the Croatian language allow more sophisticated searches and improvement of lexicographic work and also easier detection of multiword units in corpora.
References Birtić, M., & al. (2012). Školski rječnik hrvatskoga jezika [School dictionary of Croatian]. Zagreb: Institut za hrvatski jezik i jezikoslovlje, Školska knjiga. Brala-Vukanović, M., & Rubinić, N. (2011). Prostorni prijedlozi i prefiksi u hrvatskome jeziku [Spatial prepositions and prefixes in Croatian]. Fluminensia, 23(2), 21–37.
Corpus analysis of Croatian constructions with the verb doći ‘to come’
Brozović Rončević, D., & Ćavar, D. (2010). Riznica: The Croatian language corpus. SlaviCorp2010 conference, Warsaw. http://riznica.ihjj.hr/CLC-Slavicorp.pdf. Hrvatski jezični portal. http://hjp.novi-liber.hr/. Longman Dictionary of Contemporary English Online. http://www.ldoceonline.com Ljubešić, N., & Klubička, F. (2014). {bs,hr,sr}WaC – Web corpora of Bosnian, Croatian and Serbian. In F. Bildhauer, & R. Schäfer (Eds.), Proceedings of the 9th Web as Corpus Workshop (WaC-9) (pp. 29–35). Gothenburg: Association for Computational Linguistics. doi: 10.3115/v1/W14-0405
Macmillan English Dictionary. http://www.macmillandictionary.com/. Pranjković, I. (2009). Prostorna značenja u hrvatskome jeziku [Spatial meanings in Croatian]. In K. Mićanović (Ed.), Prostor u jeziku / Književnost i kultura šezdesetih (pp. 11–19). Zagreb: F ilozofski fakultet, Zagrebačka slavistička škola, Hrvatski seminar za strane slaviste. http://www.hrvatskiplus.org/article.php?id=1830&naslov=prostorna-znacenja-uhrvatskome-jeziku. Silić, J., & Pranjković, I. (2005). Gramatika hrvatskoga jezika za gimnazije i visoka učilišta [Grammar of Croatian for gymnasiums and universities]. Zagreb: Školska knjiga. Stefanowitsch, A., & Gries, S. Th. (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics, 8(2), 200–243. doi: 10.1075/ijcl.8.2.03ste
Šonje, J. (Ed.). (2000). Rječnik hrvatskoga jezika [Dictionary of Croatian]. Zagreb: Leksikografski zavod Miroslav Krleža, Školska knjiga. Tadić, M. (2009). New version of the Croatian National Corpus. In D. Hlaváčková, A. Horák, K. Osolsobě, & P. Rychlý (Eds.), After Half a Century of Slavonic Natural Language Processing (pp. 199–205). Brno: Masaryk University. Talmy, L. (1985). Lexicalization patterns: semantic structure in lexical forms. In T. Shopen (Ed.), Language Typology and Syntactic Description: Grammatical Categories and the Lexicon, 3, (pp. 57–149). Cambridge: Cambridge University Press.
Anaphora resolution, collocations and translation Eric Wehrli & Luka Nerima LATL-CUI, University of Geneva
Collocation identification and anaphora resolution are widely recognised as major issues for natural language processing, and particularly for machine translation. This paper focuses on their intersection domain, that is verb-object collocations in which the object has been pronominalised. To handle such cases, an anaphora resolution procedure must link the direct object pronoun to its antecedent. The identification of a collocation can then be made on the basis of the verb and its object or its antecedent. Results obtained from the translation of a large corpus will be discussed, as well as an evaluation of the precision of the anaphora resolution procedure for this specific task. Keywords: collocations, anaphora resolution, translation, pronominalized collocations, syntax based translation, syntax based collocation detection
1. Introduction Collocation identification and anaphora resolution (henceforth AR) are widely recognised as major issues for natural language processing, and particularly for machine translation. An abundant literature has been dedicated to each of those issues (see in particular Mitkov (2002) for AR, Wehrli et al. (2010) and Seretan (2011) for collocation identification), but to the best of our knowledge their intersection domain – a collocation in which the base term has been pronominalised – has hardly been treated yet. This paper intends to be a modest contribution towards filling this gap, focusing on the translation from English to French of collocations of the type verb-direct object, also called light-verb constructions with and without pronominalisation of the complement. The paper is organised as follows. The next Section (2) will give a brief overview of the translation problems with respect to both collocations and anaphors. We will also show how current MT systems fail to handle successfully such cases. In Section 3 our treatment of collocations and anaphora resolution will be presented, along with some preliminary results. Finally, in Section 4, we will try to address the issue of the frequency of
doi 10.1075/cilt.341.12weh © 2018 John Benjamins Publishing Company
Eric Wehrli & Luka Nerima
those phenomena, presenting the results of our collocation extraction system for both English and French over a corpus of approximately 10,000 articles from the news magazine The Economist totalizing over 8,000,000 words, for English, while the French corpus, taken from the Swiss daily newspaper “Le Temps”, amounts to about 5,000 words. 2. Collocations in Translation The importance of collocations in translation has long been recognised, both by human translators and by developers of MT systems. For one, collocations tend to be ubiquitous in natural languages. Furthermore, it is often the case that they cannot be translated literally, as illustrated below. One of the characteristic features of collocations is that the choice of the collocate may be quite arbitrary and therefore cannot be safely derived from the meaning of the expression, and for that matter be translated literally. Consider, for instance, the examples in (1)–(3), where for each English phrase (a), (b) is the literal translation and (c) a more adequate translation. In each example, the collocate which cannot be translated literally is in italics. (1) a. heavy smoker b. French: lourd fumeur; German: schwerer Raucher c. gros/grand fumeur “big/large smoker”; starker Raucher “strong smoker” (2) a. Paul broke a record. b. French: Paul a brisé un record. c. John a battu un record. “John has beaten a record” (3) a. He made an appointment with her. b. French: Il a fait un rendez-vous avec elle. c. Il lui a donné un rendez-vous. “he gave her an appointment”
The adjective heavy in the collocation (1) heavy smoker cannot be translated literally into French or into German. Both of those languages have their own equivalent collocation, which in turn could not be translated literally into English. Similarly, the verbal collocate in a verb-object collocation can usually not be translated literally, as illustrated in (2–3). For instance, while one says in English “to break a record”, in French you say “to beat a record”. In most cases, a literal translation, though sometimes understandable, would be felt as “non-idiomatic” or “awkward” by native speakers. Even though this state of affairs does not apply to all collocations, it is widespread across languages and requires a proper treatment of collocations. Commercial MT systems usually have a good handling of collocations of the type “word-with-spaces”, such as adjective-noun, noun-noun, noun-prepositionnoun, and the like. With respect to collocations which display a certain amount of syntactic flexibility and in which the two constituents can be arbitrarily far away
Anaphora resolution, collocations and translation
from each other, commercial MT systems do relatively poorly, as illustrated in the few examples given at the end of the next section. 3. Translating collocations with Its-2 In this section, we describe how collocations are handled in the Its-2 translation system (cf. Wehrli et al. 2009a, 2009b), which is based on the Fips multilingual parser (cf. Wehrli, 2007; Wehrli & Nerima, 2015). The proposed treatment relies on the assumption that collocations are “pervasive” in NL (cf. Jackendoff, 1997; Mel’cuk, 2003), which calls for a “light” and efficient treatment – perhaps in contrast to true idiomatic expressions, which are far less numerous and may require and justify a much heavier treatment.1 Let us first consider again Example (2), which involves a verb-object collocation, both in the source language (break-record) and in the target language (battrerecord “beat record”) The structure assigned to this sentence by the Fips parser is identical to the structure of a non-collocational sentence such as (4) Jean a mangé un biscuit ‘Jean has eaten a cookie’
Ideally, therefore, we would like to say that the only difference between the two examples boils down to a lexical difference: the verb and the object head noun correspond to a collocation in (2), but not in (4). Based on this observation, we developed a transfer procedure and a generation procedure which are almost identical for the two cases, except for the lexical transfer. The general transfer algorithm of Its-2 recursively traverses the syntactic tree structure generated by the parser in the following order: head, left sub-constituents, right sub-constituents. Lexical transfer occurs during the transfer of a non-empty head. At that time, the bilingual dictionary is consulted and the target language item with the highest score among all the possible translations of the source language lexical item is selected. If a collocation is identified in the source sentence, as in our example, the lexical item associated with the verb break will also specify that collocation. In such a case, lexical transfer occurs on the basis of the collocation and not on the basis of the lexeme. The collocate is marked with its specific translation, which is used when lexical transfer treats it. To make things explicit, consider our example break-record → battre-record. When the syntactic head of the collocation (the verb) is handled by lexical transfer, the verb of the associated target c ollocation is used, hence break → battre (instead of the more standard correspondances briser . See Sag et al. 2002 for a thorough and enlightening discussion of multiword expressions.
Eric Wehrli & Luka Nerima
or casser. The second term of the source collocation (record) is marked to be translated as the second term of the target collocation (record). When the transfer procedure treats the source direct object, this information is used instead of the ”normal” lexical transfer, thus avoiding translations such as fichier, dossier or disque or any other of the numerous possible translation of the English noun record. This procedure yields very satisfactory results, as illustrated by the following simple example of translation, which we compare with outputs from some commercial MT systems, both statistical and rule-based.2 A few more examples, with sentences taken from the magazine The Economist, are given in the last section:
(5) a. The record that Paul set is likely to be broken.
b. Its-2 Le record que Paul a établi est susceptible d’être battu. c. Google Translate Le dossier que Paul ensemble est susceptible d’être rompu. d. Systran Le disque que l’ensemble de Paul est susceptible d’être cassé. e. Reverso Le rapport(record) que Paul met va probablement être cassé.
Example (5) contains two collocations, to set a record and to break a record. The first one occurs in a relative clause, while the latter is in the passive voice. As a result, in neither of them does the direct object follow the verb. For that reason, the three commercial MT systems that we considered fail to identify the presence of those collocations and, thus, yield a poor translation. Its-2, thanks to the Fips parser, is quite capable of identifying verb-object collocations even when complex grammatical processes disturb the canonical order of constituents and correctly translate them by means of the equivalent French collocations établir un record and battre un record.
(6) a. The world record will be broken.
b. Its-2 Le record du monde sera battu. c. Google Translate Le record du monde sera brisé. d. Systran Le record mondial sera cassé. e. Reverso Le record du monde sera cassé. . The commercial MT systems are Google Translate (translate.google.fr), Systran (www. systranet.com) and Reverso (www.reverso.net), last accessed on March 30, 2015.
Anaphora resolution, collocations and translation
Example (6) also exhibits two collocations, world record and to break a record. The first one is of the “noun-with-spaces” variety, and therefore is well-translated by all the systems. The second one is in the passive form and, as in the previous example, commercial systems fail to recognise it. 3.1 Anaphora resolution Not only do verb-object collocations exhibit a large amount of syntactic flexibility as we have just seen in the previous section, they can also occur with the direct object pronominalised. This means, for instance, that an occurrence of breaks it is an occurrence of the collocation break-record if it refers to the noun record, as illustrated in (7).
(7) Paul set a new world record last year and he is hoping to break it later this year.
To properly handle such cases, and more generally to determine which constituent can be interpreted as the antecedent of the pronoun, a process usually called anaphora resolution (cf. Mitkov, 2002) must be undertaken. As a first step towards a proper treatment of anaphora, we developed a simple procedure that allows the Fips parser to handle personal pronouns, by far the most widespread type of anaphora,3 restricted to third person.4 A second limitation of our AR procedure is that it only covers cases of anaphoric pronouns with antecedent within the same sentence or within the preceding sentence. As reported by Laurent (2001) on the basis of a French corpus, these two cases cover nearly 89% of the cases (67% and 22%, respectively). Roughly speaking, our AR procedure adopts the Lappin & Leass (1994) algorithm, adapted to the grammatical representations and other specificities of the Fips parser. First, the AR procedure must distinguish between anaphoric and non- anaphoric occurrences. For English, this concerns mainly the singular pronoun it, which can have an impersonal reading, as in (8). Identifying impersonal pronouns is achieved by taking advantage of the rich lexical and grammatical information available to our parser. (8) a. It is raining. b. It turned out that Bill was lying.
. According to Tutin (2002), personal pronouns range from 60% to 80% of anaphoric expressions, based on a large, well-balanced French corpus. Russo et al. (2011) report relatively similar results for English, Italian, German and French. . First and second person pronouns are left out, since they do not have any linguistic antecedent. Rather, their interpretation is usually set by the discourse situation.
Eric Wehrli & Luka Nerima
c. To put it lightly. d. It is said that they have been cheated. e. it is obvious that Paul was lying.
So-called weather verbs in English (to rain, to snow, etc.) are lexically marked [ + impersonal subject] and so is the verb to turn out when selecting a tensed sentence (when selecting an infinitival sentence, turn out is a raising verb). In sentence (8c) we have an expression which involves a non-referential instance of it. This information must be part of the lexical specification of the expression. The presence of an impersonal pronoun in sentence (8d)–(e) is not explained by lexical knowledge but by grammatical one (d) or lexical and grammatical (e). In the case of sentence (8d) we have a passive sentence which involves a verb that selects a sentence rather than a direct object. In such cases, in English, the subject takes the form of an expletive pronoun, usually the impersonal it. Finally, in the (8e) sentence, obvious is an adjective that takes a sentential subject (lexical knowledge). When the sentential subject gets extraposed, as in our example, an expletive pronoun appear in the subject position (grammatical knowledge). The next step concerns anaphors in the stricter sense of Chomsky’s binding theory (cf. Chomsky, 1981), that is reflexive and reciprocal pronouns, which must be bound in their governing category. Our somewhat simplified interpretation of principle A of the binding theory states that a reflexive/reciprocal pronoun must be linked to (ie. agrees with and refers to) the subject of its minimal clause.5 Finally, in the third step, we consider referential pronouns, such as personal pronouns (he, him, it, she, her, them, etc.), still using the insight of binding theory, which states according to principle B that pronouns must be free (ie. not bound) in their governing category. Here again, our simplified interpretation of principle B prevents a pronoun from referring to any c-commanding noun phrase in the same minimal clause.6 In a nutshell, c-command is a relation between nodes in a phrase-structure stating that a given node c-commands its sibling as well as all its siblings’ subconstituents. Note that the binding theory is not an AR method per se, in the sense that it does not say what the antecedent of a pronoun is. What it does, though, is to filter out possible but irrelevant candidates. To illustrate, consider the simple sentences in (9), where the indices represent the co-indexing relation between a pronominal element and its antecedent.
. The minimal clause containing a constituent X is the first sentential node (tensed or untensed) which dominates X in the phrase structure. . See Reinhart (1983) and references there for the complex history and motivation of the c-command relation
Anaphora resolution, collocations and translation
(9) a. Peteri watches himselfi in the mirror. b. Peteri watches himk in the mirror. c. *Peteri watches himi in the mirror.
Sentence (9a) is well-formed because the anaphor himself is bound by the subject Peter. Given principle A of binding theory, we can conclude that the only possible antecedent of himself is Peter. Following the same reasoning, binding theory validates (9b) and rules out (9c). Since him is a pronoun, it cannot be bound (ie. find its antecedent) by a c-commanding noun within the same minimal clause. Therefore, it cannot refer to Peter. Our implementation of a simple but efficient AR procedure makes use of a stack of noun phrases, that the parser stores for each analysis and maintains across sentence boundaries. When a pronoun is read, the parser first determines whether it is a reflexive/reciprocal pronoun – in which case by virtue of principle A it must co-refer to the subject of its minimal clause – or a third person pronoun. In the latter case, the parser will distinguish between referential and non-referential it, as discussed above. As we mentioned, that distinction can be made on the basis of the lexical and grammatical information available to the parser, in connection with the grammatical environment of the pronoun. For referential third person personal pronouns, the procedure selects all the noun phrases stored on the stack which agree in person, number and gender with the pronoun. If more than one is selected, preference goes first to the subject arguments and second non-subject arguments, a heuristic inspired in part by the Centering theory (cf. Grosz et al., 1986, 1995; Kibble, 2001). Needless to say, the procedure sketched above is merely a first attempt at tackling the AR problem. 4. Results and evaluation The examples discussed above are all simple sentences constructed for the purpose of the present research. Let us now turn to “real” sentences taken respectively, from the July 2, 2002 and from the February 7, 2004 issues of The Economist. Consider the English collocation to make a case, as illustrated by the Examples (10)–(11). A literal translation into French of this collocation would give something like faire un cas, which is hardly understandable and certainly fails to convey the meaning of that collocation. A more appropriate translation would use the collocation présenter un argument. In the first example, the collocation occurs in a tough-movement construction, a peculiar grammatical construction in which an adjective of the tough-class (tough, difficult, easy, hard, fun, etc.) governs an infinitival complement whose direct object cannot be lexically realised, but is understood as the subject of the sentence – in our
Eric Wehrli & Luka Nerima
example the phrase such a case.7 Following a standard generative linguistics analysis of that construction, we assume that the direct object position of the infinitival verb is occupied by an abstract anaphoric pronoun linked to the subject noun phrase. We can observe that Google Translate chooses a literal translation of the collocation (10a), while Its-2 correctly identifies the presence of the collocation and translates it appropriately with the corresponding French collocation présenter un argument. (10) a. Such a case would not be at all difficult to make. b. Google Translate Un tel cas ne serait pas du tout difficile a faire. c. Its-2 Un tel argument ne serait pas du tout difficile à présenter.
In our second Example (11), the collocation make a case occurs twice (making this case, makes it). Notice that in the second occurrence, the base term of the collocation has been pronominalised, with its antecedent in the previous sentence. Thanks to the AR procedure, Its-2 correctly identifies the collocation and translates it appropriately (11d), which is not the case for Google Translate (11b). (11) a. Every Democrat is making this case. But Mr Edwards makes it much more stylishly than Mr Kerry. b. Google Translate Chaque démocrate fait ce cas. Mais M. Edwards rend beaucoup plus élégamment que M. Kerry. c. Bing Tout démocrate rend cette affaire. Mais M. Edwards, il est beaucoup plus élégant que M. Kerry. d. Its-2 Chaque démocrate présente cet argument. Mais M. Edwards le présente beaucoup plus élégamment que M. Kerry.
4.1 Two experiments To measure the accuracy of our collocation identification procedure as well as the impact of the anaphora resolution algorithm, we ran two experiments, one for English, one for French. The global results are given in the figure below.
. See Chomsky (1977) for a detailed analysis of this construction.
Anaphora resolution, collocations and translation
Experiment 1: English We parsed and translated a corpus taken from The Economist (years 2002 to 2015) amounting to over 8,000,000 words (463,173 sentences). 14,663 occurrences (tokens) of verb-object collocations were identified, corresponding to 553 types.8 In 68 cases, the direct object had been pronominalised, as in the next two examples, where the source sentence(s) is given in the (a) section in which both the collocation (verb + pronoun) and the antecedent of the pronoun are emphasised. The (b) section gives the Its-2 translation with the anaphora procedure turned off, the (c) section the Its-2 translation with the AR procedure turned on, and the (d) section, the translation obtained with Google Translate. (12) a. b. c. d.
The golden rule also turns slithery under close inspection. On an annual basis, the government is breaking it. [-AR] Sur une base annuelle, le gouvernement le casse. [+AR] Sur une base annuelle, le gouvernement l’enfreint. [Google] Sur une base annuelle, le gouvernement est le casser.
The best result is (c), the only one where the collocation break-rule is correctly identified thanks to the AR procedure which connects the direct object pronoun to the subject of the preceding sentence golden rule. The translation of that collocation yields the French verb enfreindre rather than casser. (13) a. In Spain the target is mainly symbolic, since companies will not face financial penalties if they do not meet it. b. [-AR] En Espagne la cible est principalement symbolique, depuis que les sociétés n’affronteront pas des pénalités financières si ils ne le rencontrent pas. c. [+AR] En Espagne la cible est principalement symbolique, depuis que les societes n’affronteront pas des pénalités financières si elles ne l’atteignent pas. d. [Google] En Espagne, la cible est surtout symbolique, puisque les entreprises ne seront pas passibles de sanctions financières si elles ne répondent pas.
In that last example, the source sentence contains two pronouns, they referring to companies and it referring to target. In (c), both of them have been correctly handled by the AR procedure and with the latter the collocation meet-target has been identified, yielding the correct collocation translation atteindre(-cible).
. The most frequent collocations are to take place (529 occurrences), to make sense (407), to play a role (323), to make money (304) and to make a difference (266). Among the collocations with pronominalised objects, the most frequent are to spend money (7) and to solve a problem (5).
Eric Wehrli & Luka Nerima
Experiment 2: French The Le Temps corpus (years 2000 to 2002) is a little smaller, totalling about 5,000,000 words (284,006 sentences). 16,216 occurrences (tokens) of verb-object collocations were identified, corresponding to 1,088 types.9 In 48 cases, the direct object had been pronominalised, as in the next two examples: (14) a. Nous ne sommes pas pressés: si une opportunité se présente comme ça a été le cas avec CRC, nous la saisirons. “We are not in a hurry: if an opportunity turns up as was the case with CRC, we will seize it” b. La loi fédérale le prescrit mais tout indique qu’elle sera mal appliquée. “The federal law prescribes it but everything indicates that it will be badly enforced”
In the Example (14a), the sentence contains one pronoun, la, referring to opportunité. In the Example (14b), the sentence contains the pronoun, elle referring to loi. In these examples, they have been correctly handled by the AR procedure and the collocations saisir-opportunité (to seize an opportunity), respectively appliquer-loi (to enforce the law) have been identified. Global results of experiments Figure 1, below, summarises for both languages the corpora size and the number of identified verb-object collocations, as well as the number of verb-object collocations with a pronominalised object.
Corpus size (words) Sentences V-O collocations (tokens) V-O collocations (types) Colloc. with pronom. object
EN
FR
8,000,000
5,000,000
463,173
284,006
15,949
16,216
664
1,088
68
48
Figure 1. Corpora and verb-object collocation figures
. The most frequent collocations are avoir besoin ‘to need’ (597 occurrences), jouer unrôle ‘to play a role’ (426 occurrences), avoir lieu ‘to take place’ (402), avoir une chance ‘to stand a chance’ (353), and prendre une décision ‘to make a decision’ (283). Among the collocations with pronominalised objects, the most frequent are appliquer la loi ‘to enforce the law’ (4) and prendre une décision ‘to make a decision’ (3).
Anaphora resolution, collocations and translation
Although not very frequent, collocations with a direct object pronoun should not be overlooked if one aims at a high-quality translation, as illustrated by the Examples (12–14). Enriching the collocation lexicon and extending the AR procedure to a larger set of pronouns, as we intend to do in future work, is likely to increase the number of pronominalised collocations detected by the system. 4.2 Evaluation of the precision of the AR procedure The following evaluation aims at measuring the performance of the Fips parser in identifying verb-direct object collocations in which the object has been pronominalised. The two examples below, taken from the corpus of The Economist and containing the collocation to solve a problem, illustrate the most frequent cases. (15) a. Africa has, to put it mildly, a lot of problems; even a hyperpower cannot solve them all. b. Clearly, a problem exists. But it is unlikely to be solved by driving people out of Rome for others to deal with.
In sentence (15a), the pronoun them, which is the direct object of the verb solve, refers to the noun problems. Therefore we have an occurrence of the collocation solve-problem. In sentence (15b), the collocation is in passive form, which means that the subject pronoun it corresponds to the deep direct object of the verb to solve. Since the antecedent of the pronoun it is the noun problem in the preceding sentence, we have again an occurrence of the collocation. As pointed out in Section 3.1, our AR procedure can deal with antecedents either within the sentence where the pronoun occurs or in the preceding sentence. Evaluation settings The Fips multilingual parser has a lexical database comprising a lexicon of collocations for each language treated. For instance, the English collocation lexicon has about 10,000 entries while the French one has 17,000. For this study we considered only these collocations, i.e. the collocations that are lexicalised. The aim of the described experimentation is to measure Fips accuracy to identify collocations that meet the three following criteria: (i) the type of the collocation is verb-object; (ii) the collocation is lexicalised; (iii) the object of the collocation has been pronominalised. The underlying aim is to measure the reliability of Fips to achieve anaphora resolution in such cases. The corpora used are The Economist and Le Temps described above. We used the FipsCoView collocation extraction tool (Seretan & Wehrli, 2011) based on the Fips parser. For each identified collocation, it displays the collocation itself as well as the context in which the collocation
Eric Wehrli & Luka Nerima
appears. For the precision measure, we asked a native speaker to manually evaluate the collocation and the anaphora resolution for each language10 Results Fips identified 68 occurrences of verb-object collocations with pronominalised object for English, and 48 for French. They correspond, respectively, to 54 and 34 types of collocations. The evaluation results, displayed in Figure 2 below, show that the precision is excellent for English and very good for French. EN Identified colloc. with pronom. object Precision
68 92.5%
FR 48 84.6%
Figure 2. Precision of the identification of the pronominalised verb-object collocations
We explain these good figures by the fact that the constraints to resolve the anaphora and to identify a collocation are very high. In order to identify an erroneous collocation, Fips would have to determine a wrong antecedent which, combined with the verb, gives a verb-object collocation that exists in the lexicon. Even if the probability of such an error is low, it occurs sometimes. In English this error occurred in our experiment only with collocations containing the verb to make. For instance, in Example (16) below the anaphora resolution wrongly determined that the antecedent of it is money. As the collocation to make money exists in the lexicon, it is erroneously identified. (16) Those good at running money rarely run companies well. The stellar hedge funds of the 1990s failed to make it to the top-ten this decade.
Recall Recall is harder to measure. In a previous work we made a preliminary evaluation on a similar English corpus (Nerima & Wehrli, 2013), computing recall manually on the 18 most frequent pronominalised verb-object collocations. The observed recall was 48% which corresponds roughly to the state of the art in anaphora resolution. However, a larger evaluation of the recall should be undertaken to confirm that figure.
. As we only take into account lexicalised collocations, that is collocations validated by a lexicographer, we considered that the task of validating a pronominalised collocation is easy enough for a native speaker, and thus does not justify the use of several annotators.
Anaphora resolution, collocations and translation
5. Conclusion Collocations constitute an fundamental aspect of natural language, that cannot be overlooked in NLP. We have presented here the method used to handle collocations in the Its-2 translation system. Focusing on verb-object collocations we have shown the this method is quite successful and vastly superior to usual handling of such expressions by commercial systems, particularly when the two constituents of the collocation are not adjacent (or near-adjacent) or not in the expected order. Additionally, we turned to the particular case of verb-object collocations where the object has been pronominalised and showed how, with a relatively simple AR procedure, we can detect the antecedent and recover the collocation. The evaluation of the precision of this method shows excellent results. Although not very frequent, collocations with a direct object pronoun should not be overlooked if one aims at a high-quality translation, as illustrated by the Examples (12–14). Extending the collocation lexicon and the AR procedure to a larger set of pronouns, as we intend to do in future work, is likely to increase the number of pronominalised collocations detected by our parsing and translation systems.
References Chomsky, N. (1977). On Wh-Movement. In P. W. Culicover, T. Wasow, & A. Akmajian (Eds.), Formal Syntax (pp. 71–132). New York, Academic Press. Chomsky, N. (1981). Lectures on Government and Binding, Foris Publications. Grosz, B., Joshi, A., & Weinstein, S. (1995). Centering: A Framework for Modeling the Local Coherence of Discourse. Computation Linguistics, 21(2), 203–225. Grosz, B., & Sidner, C. L. (1986). Attention, Intention, and the Structure of Discourse. Computational Linguistics, 12(3), 175–204. Jackendoff, R. (1997). The Architecture of the Language Faculty. Cambridge, Mass.: MIT Press. Kibble, R. (2001). A Reformulation of Rule 2 of Centering Theory. Computational Linguistics, 27(4), Cambridge, Mass., MIT Press. doi: 10.1162/089120101753342680 Lappin, Sh., & Leass, H. (1994). An Algorithm for Pronominal Anaphora Resolution. Computational Linguistics, 20(4), 535–561. Laurent, D. (2001). De la résolution des anaphores, Rapport interne, Synapse Développement. Mel’cuk, I. (2003). Collocations : définition, rôle et utilité. In F. Grossmann & A. Tutin (Eds.), Les collocations : analyse et traitement (pp. 23–32). Amsterdam, De Werelt. Mitkov, R. (2002). Anaphora Resolution, Longman. Nerima, L., & Wehrli, E. (2013). Résolution d’anaphores appliquée aux collocations: une évaluation préliminaire. In Proceedings of TALN 2013 Conference, Les Sables d’Olonne, France. Reinhart, T. (1983). Coreference and Bound Anaphora: A Restatemt of the Anaphora Questions. Linguistics and Philosophy, 6, 47–88. doi: 10.1007/BF00868090
Eric Wehrli & Luka Nerima Russo, L., Scherrer, Y., Goldman, J.-Ph., Loaiciga, S., Nerima, L., & Wehrli, E. (2011). Etudes inter-langues de la distribution et des ambiguités syntaxiques des pronoms. In Proceedings of TALN 2011 Conference, Montpellier, France. Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword Expressions: A Pain in the Neck for NLP. In Proceedings of Computational Linguistics and Intelligent Text Processing (CICLing-2002) (pp. 1–15), Lecture Notes in Computer Science, 2276. Seretan, V. (2011). Syntax-Based Collocation Extraction. Springer Verlag. doi: 10.1007/978-94-007-0134-2
Seretan, V., & Wehrli, E. (2011). Fipscoview: On-line Visualisation of Collocations Extracted from Multilingual Parallel Corpora. In Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (pp. 125-127). Association for Computational Linguistics. Tutin, A. (2002). A Corpus-based Study of Pronominal Anaphoric Expressions in French. In Proceedings of DAARC 2002, Lisbonne, Portugal. Wehrli, E. (2007). Fips, a Deep Linguistic Multilingual Parser. In Proceedings of the ACL 2007 Workshop on Deep Linguistic processing (pp. 120–127). Prague, Czech Republic. Wehrli, E., Nerima, L., & Scherrer, Y. (2009a). Deep Linguistic Multilingual Translation and Bilingual Dictionaries. In Proceedings of the Fourth Workshop on Statistical Machine Translation (pp. 90–94). Athens, Greece. Wehrli, E., Seretan, V., Nerima, L., & Russo, R. (2009b). Collocations in a Rule-based MT System: A Case Study Evaluation of their Translation Adequacy. In Proceedings of the 13th Annual Meeting of the European Association for Machine Translation (pp. 128–135). Barcelona, Spain. Wehrli, E., Seretan, V., & Nerima, L. (2010). Sentence Analysis and Collocation Identification. In Proceedings of the Workshop on Multiword Expressions: from Theory to Applications (MWE 2010) (pp. 27–35). Beijing, China. Wehrli, E., & Nerima, L. (2015). The Fips Multilingual Parser. In N. Gala, R. Rapp, & G. Bel-Enguix (Eds.), Language Production, Cognition and The Lexicon, Text, Speech and Language Technology (Volume 48, pp. 473–490). Springer Verlag. doi: 10.1007/978-3-319-08043-7_26
Index A Alignment error rate (AER) 93, 156, 157, 158 Anaphora resolution (AR) 243, 247–254 Arabic language 19, 104, 105, 109 B Basque language 7, 42, 43, 44, 45, 46, 47, 48, 49, 50, 54, 57 bilingual terminology extraction 147, 158, 159, 161 BLEU score 92, 96, 194 C C-value 158, 193 Casmacat 65 collexeme 224, 225, 227, 228, 229, 232, 234, 235, 236, 238, 239 collocation alignment 82, 89, 92, 93, 95 collocation dictionary 81, 85, 88, 89, 96, 166 collocation extraction 21, 84, 160, 168, 176, 244, 253 collocation identification 45, 243, 250 collocation retrieval 165, 166, 168–169, 170–171, 177 collocation translation 166, 169, 176, 251 collocation 3, 8, 9, 16, 19, 21, 24, 43, 45, 52, 53, 63–79, 81–106, 113, 165–171, 173–177, 244–247, 249, 250, 251–255; see also collocation candidate 84, 85, 90, 170, 171, 172, 230, 232, 234, 236; see also contrastive collocation 69, 72, 75, 76, 79, see also non-contrastive collocation 72 collostruct 224, 227, 229 collostruction 224
comparable corpus 166, 169, 173, 175, 176, 183, 184, 185, 186, 193, 194 compositional expression 53 compositionality 2, 11, 19, 82, 85, 104, 176; see also semantic compositionality 19, 41, 45, 51, 52–53, 57, 181 compound splitter 148, 149, 150, 151, 152, 153, 154, 159, 160 compound splitting 148, 151, 155, 156, 160 compound 63, 65–75, 77–79, 82, 84; see also compound nouns or nominal compound or noun compound 9, 13, 82, 84, 128, 129, 205; see also compound terms 104; see also compound verb 16, 19, 202, 203 compounding 126, 147 Computer-assisted translation (CAT) or CAT tools 4, 21, 182, 183, 184, 185, 193, 194, 201, see also Translation Technology 20–22 construction 223, 225, 228, 229 context generalisation 174, 176, 177 Corpus Proximity Ratio (cpr) 22 CoStEP corpus 127 Croatian Language Repository or Hrvatska jezična riznica or Riznica 224, 229 Croatian language 201, 202, 205, 208, 218, 219, 223, 234, 240 Croatian National Corpus or Hrvatski nacionalni korpus (HNK) 224, 228, 230, 231, 233, 238 D discontinuity 2 discontinuous expression 82, 86
doći (verb) 230, 231, 239 domain adaptation 148, 150 Dutch language 63, 64, 69, 73, 77, 78, 148, 149 E Elhuyar dictionaries 46, 47 English language 15, 16, 133, 152, 167, 185, 189, 190, 243 English-Dutch parallel corpus 152 EUROPHRAS 23, 25, event categorisation 10 external resources 61, 65, 70, 71, 73, 77 eye-tracking 65 EyeLink 65 F F-measure 93, 134, 136, 137, 138, 139, 141, 206, 218, 219 figurative expression 52, 53 figurative meaning 234, 238, 239 Fips parser 245, 246, 247, 253, 254 FipsCoView 253 fixedness 43, 81, 86, 90 formulaic expression 20–21 formulaic language 3 Freeling 54, 55 French language 21, 83, 87, 88, 90, 91, 93, 94, 125, 243, 244 French-Dutch parallel corpus 152 frequency list 148, 150, 151, 152, 153, 228, 229, 234 frozenness 2 G German language 14, 136, 141, 244 GERTWOL 128, 129 GIZA++ 83, 85, 89, 92, 93, 134, 137, 138, 140, 141, 142, 154
Multiword Units in Machine Translation and Translation Technology Gold standard 125, 126, 142, 159 Google Translate 11, 64, 246, 250, 251 grow-diag-final 88, 155, 156, 157, 158, 159, 161 H hapax legomena 129, 138, 139 hrWaC 224, 228, 229, 230 Hunalign 127, 169, 172, 173, 176 I IATE,73 IBM model 16, 154 idiom processing 14, 15 idiom 2–4, 7, 14, 15, 21, 43, 52, 62, 63, 84, 166, 176, 201–219, 223, 225, 226, 229, 235, 239; see also discontinuous idiom 14; see also idiomatic expression 3, 81, 83, 201–205, 208, 217–219, 245 ; see also idiom principle 1; see also idiom-tagging 7 idiomatic meaning 11, 63, 217 idiomatic processing 2 idiomaticity 1, 2, 43, 52, 176; see also semantic idiomaticity 52,176; see also statistical idiomaticity 2 IdiomSearch 22 Information extraction 9–10 Information retrieval 9–10 Inputlog 65 intersecting or intersection 84, 92, 154, 155, 156, 157, 159, 160 Italian language 16, 185 Its-2, 245, 246, 250, 251 K keystroke logging 65 keyword extraction 10 Konbitzul database 46, 57 L language learning ,10 lexical alignment 82, 93 lexical analysis 105 Lexical bundles (LB) 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194; see also Lexical bundles (LB) identification 181–195
lexical representation 105, 106 lexicalisation 130, 140, 141 light verb (construction) 48, 52, 53, 104, 107, 110, 115, 244 literal meaning 84, 203, 204, literal translation 53, 69, 244, 250 literary analysis 10 Log-Likelihood (ratio) 89, 90, 95, 158 M Machine Translation (MT) 1, 10–20, 42–45, 61–65, 71, 81–83, 85, 103, 147, 154, 182; see also Example-based Machine Translation 14–16; see also Factored Statistical Machine Translation (FSMT) 81, 85, 86, 88, 89, 91, 92, 93–95; see also Neural Machine Translation 18–19; see also Phrase-based Statistical Machine Translation 61; see also Rule-based Machine Translation 12–14; see also Statistical Machine Translation 16–18 Matxin MT 54 METIS-II 14–15 Moses 18, 88, 94, 95, 154, 155, 156, 193 multilingual indexing 104 Multilingual information extraction 103 Multilingual Lexicography Project 14 multiword (MWU/MWE) identification 9, 44, 57, 61, 88–89, 94–95, 96, 112–118, 182 multiword analysis rule 115–117 Multiword processing (MWU/ MWE processing) 10, 12, 14, 16, 17, 20, 54, 112, 113, 117 Multiword Units in Machine Translation and Translation Technology (MUMTTT) workshop 25 Multiword unit (MWU) 1, 2, 3, 5, 6, 8, 10, 15, 18, 20, 41, 44, 51, 61, 65, 75, 105, 125, 154, 181, 182, 184, 193, 201, 203, see also Multiword
expression (MWE) 1, 2, 7, 20, 22, 24, 81, 82, 83, 84, 93, 95, 96, 103, 104,105 110, 111, 113, 116, 121, 165, 175, 202, 205; see also contrastive MWU, 63, 64, 65, 66, 78, 79, 63; see also discontinuous or flexible multiword 84, 85, 202; see also multiword verb 64–65, 74,75; see also non-contrastive MWU 66, 71, 72, 78, 79 MWE dictionary 91–92 MWE extension 110–112 MWE extraction 105–112 MWE lexical representation 106–112 MWE Toolkit 169, 170, 171, 176 MWE treatment 104, 106, 112, 113,114, 116, 118 MWU detection 52–56 N Named Entity (NE) 18, 19, 81, 82, 83, 84, 104, 110, 115, 118 Named Entity Recognition (NER) 84, 104, 18 Natural language processing (NLP) 3, 5, 6, 41, 206, 207, 219, 243 non-compositionality 3 non-idiomatic expression 203 non-idiomatic meaning 11, 217 non-literal expression 201, 205 non-literal meaning 203 non-literal translation 11 Nooj 202, 206, 207, 208 O OpenLogos 12–13 P parallel corpus 50, 82, 83, 84, 85, 88, 127, 152, 153, 169, 172, 175 Paraphrase recognition 10 paraphrase 87, 111 PARSEME Cost Action 7, 11, 23, 24, 82 Parsing 7–8 Pashto language 110, 120 patient information leaflets 182, 183, 185, 186
Index Persian language 104, 110 phraseme 3, 207–209 Phraseological unit 3, 22, 43 Phraseology 166 POS tagging 7–8, 127, 170 post-editing 62–65, 67, 68, 70–79 PRESEMT 15 Q query expansion 9, 173, 176 query translation 173, 174 Question answering (QA) 10 R Romanian language 83, 85, 87, 88, 90, 91, 93, 94, 95, 96 S search strategies 71, 76 semantic opacity 2 semi-compositional expression 53 semi-fixed expression 44, 51 sense disambiguation or word sense disambiguation 5, 8, 182
sentiment analysis/ classification 10 set expression 3 SIGLEX-MWE 24, 25 Sketch Engine 22 sostitutability 2 Spanish language 42, 48, 49 split verbs 104, 115, 117, 118 syntactic flexibility 2, 45, 51–52, 180, 181, 244, 247
U union 155, 159, 160, Universal PoS tagset 127
T TANGO 21 term extraction 20, 105, 106, 111, 119, 141, 158, 159, 160 terminology extraction 148, 150, 152, 153, 155, 158, 159, 161 TExSIS 158, 159, 160, 161 Text mining 10 Text summarisation 10 Translation Memory 20–22 translation process 65 translation quality 56, 57, 62, 65 translation universals 183 TransSearch 21 TreeTagger 127, 128, 133 TTL POS-tagger 88, 90, 93
W Wikipedia 148, 150, 151, 152, 153, 154, 205 Word alignment 10, 16, 125, 128, 131, 142, 155 Word Reference 170, 172, 173, 175 Word sense disambiguation (WSD) 8–9 WordNet 3, 9, 18, 19, 44, 173, 219 words-with-spaces (approach) 8, 44, 54, 56, 121
V variability 2, 12, 13, 106, 114, 140, 141, 181; see also morphosyntactic variability 87; see also syntactic variability 84
The correct interpretation of Multiword Units (MWUs) is crucial to many applications in Natural Language Processing but is a challenging and complex task. In recent years, the computational treatment of MWUs has received considerable attention but there is much more to be done before we can claim that NLP and Machine Translation (MT) systems process MWUs successfully. This volume provides a general overview of the ield with particular reference to Machine Translation and Translation Technology and focuses on languages such as English, Basque, French, Romanian, German, Dutch and Croatian, among others. The chapters of the volume illustrate a variety of topics that address this challenge, such as the use of rule-based approaches, compound splitting techniques, MWU identiication methodologies in multilingual applications, and MWU alignment issues.
isbn 978 90 272 0060 0
J OHN B ENJAMINS P U B LISHING COMPANY