289 107 6MB
English Pages 256 Year 2014
COMPUTATIONAL DEPENDENCY THEORY
Frontiers in Artificial Intelligence and Applications FAIA covers all aspects of theoretical and applied artificial intelligence research in the form of monographs, doctoral dissertations, textbooks, handbooks and proceedings volumes. The FAIA series contains several sub-series, including “Information Modelling and Knowledge Bases” and “Knowledge-Based Intelligent Engineering Systems”. It also includes the biennial ECAI, the European Conference on Artificial Intelligence, proceedings volumes, and other ECCAI – the European Coordinating Committee on Artificial Intelligence – sponsored publications. An editorial panel of internationally well-known scholars is appointed to provide a high quality selection. Series Editors: J. Breuker, N. Guarino, J.N. Kok, J. Liu, R. López de Mántaras, R. Mizoguchi, M. Musen, S.K. Pal and N. Zhong
Volume 258 Recently published in this series Vol. 257. M. Jaeger, T.D. Nielsen and P. Viappiani (Eds.), Twelfth Scandinavian Conference on Artificial Intelligence – SCAI 2013 Vol. 256. K. Gibert , V. Botti and R. Reig-Bolaño (Eds.), Artificial Intelligence Research and Development – Proceedings of the 16th International Conference of the Catalan Association for Artificial Intelligence Vol. 255. R. Neves-Silva, J. Watada, G. Phillips-Wren, L.C. Jain and R.J. Howlett (Eds.), Intelligent Decision Technologies – Proceedings of the 5th KES International Conference on Intelligent Decision Technologies (KES-IDT 2013) Vol. 254. G.A. Tsihrintzis, M. Virvou, T. Watanabe, L.C. Jain and R.J. Howlett (Eds.), Intelligent Interactive Multimedia Systems and Services Vol. 253. N. Cercone and K. Naruedomkul (Eds.), Computational Approaches to Assistive Technologies for People with Disabilities Vol. 252. D. Barbucha, M.T. Le, R.J. Howlett and L.C. Jain (Eds.), Advanced Methods and Technologies for Agent and Multi-Agent Systems Vol. 251. P. Vojtáš, Y. Kiyoki, H. Jaakkola, T. Tokuda and N. Yoshida (Eds.), Information Modelling and Knowledge Bases XXIV Vol. 250. B. Schäfer (Ed.), Legal Knowledge and Information Systems – JURIX 2012: The Twenty-Fifth Annual Conference Vol. 249. A. Caplinskas, G. Dzemyda, A. Lupeikiene and O. Vasilecas (Eds.), Databases and Information Systems VII – Selected Papers from the Tenth International Baltic Conference, DB&IS 2012
ISSN 0922-6389 (print) ISSN 1879-8314 (online)
Com mputaationall Depeendenccy Theeory
y Edited by
K Gerd Kim des LPP (CNRS), ( ILP PGA, Sorbonn ne Nouvelle, Paris
Ev va Hajičo ová Facculty of Math hematics and d Physics, Insstitute of Forrmal and Appplied L Linguistics, C Charles Univversity, Pragu ue
and
Leeo Wann ner Deepartament de d Tecnologiees de la Inforrmació i les Comunicacions Universitat Pompeu P Fabra Barcelona a a Institució and ó Catalana de d Recerca i Estudis E Avan nçats (ICREA A)
Amstterdam • Berrlin • Tokyo • Washington, DC
© 2013 The authors and IOS Press. All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without prior written permission from the publisher. ISBN 978-1-61499-351-3 (print) ISBN 978-1-61499-352-0 (online) Library of Congress Control Number: 2013953717 Publisher IOS Press BV Nieuwe Hemweg 6B 1013 BG Amsterdam Netherlands fax: +31 20 687 0019 e-mail: [email protected] Distributor in the USA and Canada IOS Press, Inc. 4502 Rachael Manor Drive Fairfax, VA 22032 USA fax: +1 703 323 3668 e-mail: [email protected]
LEGAL NOTICE The publisher is not responsible for the use which might be made of the following information. PRINTED IN THE NETHERLANDS
Computational Dependency Theory K. Gerdes et al. (Eds.) IOS Press, 2013 © 2013 The authors and IOS Press. All rights reserved.
v
Preface Kim GERDESa, Eva HAJIČOVÁb and Leo WANNERc a LPP (CNRS), ILPGA, Sorbonne Nouvelle, Paris b Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Charles University, Prague c Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra Barcelona and Institució Catalana de Recerca i Estudis Avançats (ICREA)
In the past decade, dependencies, i.e., directed labeled graph structures representing hierarchical relations between morphemes, words and semantic units, have become the near-standard representation in many fields of computational linguistics – including, e.g., parsing, generation, and machine translation. The linguistic significance of these structures often remains vague, and the need for the development of common notational and formal grounds is felt strongly by many people working in these fields. Historically, the generative grammatical tradition that, in its origins, solely attempts to construct a system that distinguishes grammatical from ungrammatical sentences, left linguistics in a state where the outcome of the grammatical analysis, namely phrase structure, was difficult to connect to deeper (semantic, conceptual) structures. The result was a complete separation between, on one side, Natural Language Processing (NLP) that needed deeper analyses for translation, classification, generation etc. and, on the other side, generative linguistics that built structures which grew more and more complex as languages farther away from English began to be addressed, with the declared goal to model Language as a whole. In the second half of the 20th century, only a few linguists, often referring to Lucien Tesnière, continued to describe language in terms of dependency, mainly because they were working on free word order languages, where the use of phrase structure has been obviously not appropriate. Since the 1990s, NLP is turning towards dependency analysis, and in the past five years dependency has become quasi-hegemonic: The very large majority of parsers presented in recent NLP conferences are explicitly dependency-based. It seems, however, that the connection between computational linguists and dependency linguists remains sporadic. A very common procedure is that an existing phrase structure tree bank is transferred into a dependency format that fits the computational linguist’s needs, and other researchers attempt to reproduce this annotation, with statistical or rule-based grammars. This is not to say that the situation was any better when parsers still derived phrase structures and linguistics discussed “move alpha”. Yet, we believe that the circumstances are different today and dependency linguists and computational linguists have a lot to share: We know that statistical parsing gives better results if we have a linguistically coherent corpus analysis. We need to know what the differences are between surface and deep dependency. What are the units that appear in dependency analysis? What kind of analysis works for which application? How to link dependency structures to the lexicon and to semantics?
vi
The Dependency Linguistics Conference Depling 2011 in Barcelona brought together a number of scholars from the domain of Natural Language Processing as well as from theoretical and applied linguistics. All the submissions to the conference were critically reviewed and commented upon by internationally well-known reviewers, three to four for each paper. Their comments were an important contribution to the final versions of the papers. This volume unites the formal theoretical- and NLPoriented articles from the conference (in their revised, updated, and extended forms) and gives a general overview over the current state of the arts in formal and computational aspects of dependency linguistics. The volume starts out with what may be the first formal definition of dependency structure, mathematically derived from the simpler notions of fragments and connections, by Kim Gerdes and Sylvain Kahane. Alicia Burga, Simon Mille and Leo Wanner then show how to develop a complete and coherent set of surface syntactic functions for corpus annotation, and Katri Haverinen, Filip Ginter, Veronika Laippala, Samuel Kohonen, Timo Viljanen, Jenna Nyblom, and Tapio Salakoski demonstrate how to detect dependency annotation errors in a tree bank. The two following papers address the interface of syntactic structures with semantics: Michael Hahn and Detmar Meurers present the computation of meaning on syntactically annotated learner corpora and Xinying Chen extracts valency patterns from dependency tree banks using network analysis tools. Bernd Bohnet, Leo Wanner, and Simon Mille’s paper shows how dependencyoriented semantic structures can be mapped to text surface by means of statistical language generation. Formalization of dependency grammars are the subject of Federico Gobbo and Marco Benini’s work, which introduces a grammar closely oriented at Tesnière’s original intuitions, and Alexander Dikovsky presents a formal grammar that links categorical and dependency grammars. A complete grammar realization for this formalism is then presented by Denis Béchet, Alexander Dikovsky, and Ophélie Lacroix. The last section of the volume is devoted to recent advances in dependency parsing. Bernd Bohnet describes the important differences between graph- and transition-based statistical dependency parsers, and Niels Beuck, Arne Köhn and Wolfgang Menzel address the important frontier of parsing on incomplete or partial utterances. The volume closes with a contribution on the link between statistical and rulebased dependency parsing: Julia Krivanek and Walt Detmar Meurers evaluate the results of the two approaches, while Igor Boguslavsky, Leonid Iomdin, Leonid Tsinman, Victor Sizov, and Vadim Petrochenkov present a parser that combines the best of the two worlds. This comprehensive collection of papers gives a coherent overview of recent advances in the interplay of linguistics and natural language engineering around dependency grammars, ranging from definitional challenges of syntactic functions to formal grammars, tree bank development, and parsing issues.
vii
List of Authors Denis Bechet LINA CNRS, UMR 6241, Université de Nantes, France [email protected] Marco Benini University of Insubria, Varese, Italy [email protected] Niels Beuck Department Informatik, Universität Hamburg, Germany [email protected] Igor Boguslavsky Institute for Information Transmission Problems of the Russian Academy of Sciences, 127994, GSP-4 Moscow [email protected] Bernd Bohnet School of Computer Science, University of Birmingham, United Kingdom [email protected] Alicia Burga Universitat Pompeu Fabra, Barcelona, Spain [email protected] Xinying Chen Jiaotong University, Xian, China [email protected] Alexander Dikovsky LINA CNRS, UMR 6241, Université de Nantes, France [email protected] Kim Gerdes LPP, ILPGA, Sorbonne Nouvelle, Paris, France [email protected] Filip Ginter Department of Information Technology, University of Turku, Turku, Finland [email protected]
viii
Federico Gobbo University of Insubria, Varese, Italy [email protected] Michael Hahn Seminar für Sprachwissenschaft, Wilhelmstraße 19, 72074 Tübingen, [email protected] Katri Haverinen Turku Centre for Computer Science (TUCS), Department of Information Technology, University of Turku, Turku, Finland [email protected] Leonid Iomdin Institute for Information Transmission Problems of the Russian Academy of Sciences, 127994, GSP-4 Moscow [email protected] Sylvain Kahane Modyco, Université Paris Ouest [email protected] Arne Köhn Department Informatik, Universität Hamburg, Germany [email protected] Samuel Kohonen Department of Information Technology, University of Turku, Finland [email protected] Julia Krivanek Seminar für Sprachwissenschaft, Wilhelmstraße 19, 72074 Tübingen, [email protected] Ophélie Lacroix LINA CNRS, UMR 6241, Université de Nantes, France [email protected] Veronika Laippala Department of French, University of Turku, Turku, Finland [email protected] Wolfgang Menzel Department Informatik, Universität Hamburg, Germany [email protected] Detmar Meurers Seminar für Sprachwissenschaft, Wilhelmstraße 19, 72074 Tübingen, [email protected]
ix
Simon Mille Universitat Pompeu Fabra, Barcelona [email protected] Jenna Nyblom Department of Information Technology, University of Turku, Finland [email protected] Vadim Petrochenkov Institute for Information Transmission Problems of the Russian Academy of Sciences, 127994, GSP-4 Moscow [email protected] Tapio Salakoski Turku Centre for Computer Science (TUCS) and Department of Information Technology, University of Turku, Finland [email protected] Victor Sizov Institute for Information Transmission Problems of the Russian Academy of Sciences, 127994, GSP-4 Moscow [email protected] Leonid Tsinman Institute for Information Transmission Problems of the Russian Academy of Sciences, 127994, GSP-4 Moscow [email protected] Timo Viljanen Department of Information Technology, University of Turku, Finland [email protected] Leo Wanner Universitat Pompeu Fabra and Institució Catalana de Recerca i Estudis Avançats, Barcelona [email protected]
This page intentionally left blank
xi
Contents Preface Kim Gerdes, Eva Hajičová and Leo Wanner List of Authors Defining Dependencies (and Constituents) Kim Gerdes and Sylvain Kahane Looking Behind the Scenes of Syntactic Dependency Corpus Annotation: Towards a Motivated Annotation Schema of Surface-Syntax in Spanish Alicia Burga, Simon Mille and Leo Wanner A Dependency-Based Analysis of Treebank Annotation Errors Katri Haverinen, Filip Ginter, Veronika Laippala, Samuel Kohonen, Timo Viljanen, Jenna Nyblom and Tapio Salakoski On Deriving Semantic Representations from Dependencies: A Practical Approach for Evaluating Meaning in Learner Corpora Michael Hahn and Detmar Meurers
v vii 1
26 47
62
Valence Patterns of Parts of Speech in Chinese Language Networks Xinying Chen
78
One Step Further Towards Stochastic Semantic Sentence Generation Bernd Bohnet, Simon Mille and Leo Wanner
93
Dependency and Valency: From Structural Syntax to Constructive Adpositional Grammars Federico Gobbo and Marco Benini Structural Bootstrapping of Large Scale Categorial Dependency Grammars Alexander Dikovsky “CDG Lab”: An Integrated Environment for Categorial Dependency Grammar and Dependency Treebank Development Denis Béchet, Alexander Dikovsky and Ophélie Lacroix
113 136
153
Graph-Based and Transition-Based Dependency Parsers with Hash Kernels Bernd Bohnet
170
Predictive Incremental Parsing and Its Evaluation Niels Beuck, Arne Köhn and Wolfgang Menzel
186
Comparing Rule-Based and Data-Driven Dependency Parsing of Learner Language Julia Krivanek and Detmar Meurers
207
xii
A Case of Hybrid Parsing: Rules Refined by Empirical and Corpus Statistics Igor Boguslavsky, Leonid Iomdin, Vadim Petrochenkov, Victor Sizov and Leonid Tsinman
226
Subject Index
241
Author Index
243
Computational Dependency Theory K. Gerdes et al. (Eds.) IOS Press, 2013 © 2013 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-352-0-1
1
Defining dependencies (and constituents) Kim GERDESa and Sylvain KAHANE b a LPP, Sorbonne Nouvelle, Paris b Modyco, Université Paris Ouest Nanterre
Abstract. The paper proposes a mathematical method of defining dependency and constituency provided linguistic criteria to characterize the acceptable fragments of an utterance have been put forward. The method can be used to define syntactic structures of sentences, as well as discourse structures for texts, or even morphematic structures for words. Our methodology leads us to put forward the notion of connection, simpler than dependency, and to propose various representations which are not necessarily trees. We also study criteria to refine and hierarchize a connection structure in order to obtain a tree. Keywords. Connection graph, dependency tree, phrase structure, syntactic representation, syntactic unit, catena
Introduction Syntacticians generally agree on the hierarchical structure of syntactic representations. Two types of structures are commonly considered: Constituent structures and dependency structures (or mixed forms of both, like headed constituent structures, sometimes even with functional labeling, see for example Tesnière’s nucleus [34], Kahane’s bubble trees [21], or the Negra corpus’ trees [6]). However, these structures are often rather purely intuition-based than well-defined and linguistically motivated, a point that we will illustrate with some examples. Even the basic assumptions concerning the underlying mathematical structure of the considered objects (ordered constituent tree, unordered dependency tree) are rarely motivated (why should syntactic structures be trees to begin with?). In this paper, we propose a definition of syntactic structures that supersedes constituency and dependency, based on a minimal axiom: If an utterance can be separated into two fragments, we suppose the existence of a connection between these two parts. We will show that this assumption is sufficient for the construction of rich syntactic structures. The notion of connection stems from Tesnière who says in the very beginning of his Éléments de syntaxe structurale that “Any word that is part of a sentence ceases to be isolated as in the dictionary. Between it and its neighbors, the mind perceives connections, which together form the structure of the sentence.” Our axiom is less strong than Tesnière's here, because we do not presuppose that the connections are formed between words only. In the rest of the paper, we use the term connection to designate a non-directed link (the — dog; the and dog are connected) and
2
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
the term dependency to designate a directed link (dogs ← slept; dogs depends on slept or slept governs dogs). We will investigate the linguistic characteristics defining the notion of “fragment” and how this notion leads us to a well-defined graph-based structure, to which we can apply further conditions leading to dependency or constituency trees. We will start with a critical analysis of some definitions in the field of phrase structure and dependencybased approaches (Section 1). Connection structures are defined in Section 2, and are applied to discourse, morphology, and deep syntax in Section 3. The case of surface syntax is explored in Section 4. Dependency structures are defined in Section 5 and refined in Section 6. In Section 7, we show how to derive constituent structures from dependency structures.
1. Previous definitions 1.1. Defining dependency Tesnière 1959 [34] does not go any further in his definition of dependency and remains on a mentalist level (“the mind perceives connections”). The first formal definition of dependency stems from Lecerf 1960 [24] and Gladkij 1966 [13] (see also [21]) who showed that it is possible to infer a dependency tree from a constituent tree with heads (what is commonly called phrase structure). Further authors have tried to overcome these first definitions of constituency. Mel’čuk 1988 ([25]) states that two words are connected as soon as they can form a fragment, and he gives criteria for characterizing acceptable two-word fragments. But it is not always possible to restrict the definition to two-word fragments. Consider: (1) The dog slept. Neither the slept nor dog slept are acceptable syntactic fragments. Mel’čuk resolves the problem by connecting slept with the head of the dog, which means that his definitions of fragments and heads are mingled. Moreover Mel’čuk’s definition of the head is slightly circular: “In a sentence, wordform w1 directly depends syntactically on wordform w2 if the passive [surface] valency of the phrase w1+w2 is (at least largely) determined by the passive [surface] valency of wordform w2.” However, the concept of passive valency presupposes the recognition of a hierarchy, because the passive valency of a word or a fragment designates the valency towards its governor (see Section 5).1 Garde 1977 [12] does not restrict his definition of dependency to two-word fragments but considers more generally “significant elements” which allows him to construct the dependency between slept and the dog. However, he does not show how to reduce such a dependency between arbitrary “significant elements” to links between
1 The valency of an element is the set of all the connections it awaits commonly. The idea is that a connection is generally controlled by the governor, which is then active, while the dependent is passive. In the case of modifiers, however, the contrary holds as they rather control the connection with their syntactic governor. The passive valency of an element can also be seen as its distribution. This supposes that when looking at the distribution of an element we exclude all elements depending on it, which supposes that we already know which slot of the valency is the governor’s slot (see Section 5).
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
3
words. The goal of this article is to formalize and complete Garde’s and Mel’čuk’s definitions. Schubert 1987 ([29]) attempts to define dependency as “directed co-occurrence” while explicitly including co-occurrence relations between “distant words”. He explains the directedness of the co-occurrence by stating that the “occurrence of certain words [the dependent] is made possible by the presence of other words,” the governor. However, “form determination should not be the criterion for establishing cooccurrence lines.” This adds up to lexical co-occurrences which can describe relationships between words on a semantic or on a discourse level. Consider the relation between radio and music in: (2) The radio is playing my favorite music. (3) I heard a great piece of music on the radio this morning. It is clear that radio and music co-occur frequently in a statistical sense of the word “co-occur”, i.e. the occurrence of one word is highly correlated with the occurrence of the other word within the same sentence. However, in both sentences, music and radio do not form an acceptable text fragment. Moving closer to syntax, consider the relationship between radio and play in sentences (2). This relation describes something we would name a “semantic dependency”, a type of dependency that Hudson [18] precisely proposes to show in his dependency structures. For our part, we want to restrict a connection and a dependency to couples of elements that can form an acceptable text fragment in isolation (which is not the case of the radio playing and even less so for music and radio). We do not disagree that some sort of dependency exists between these words, but we consider this link as a lexical or semantic dependency (see Mel’čuk [25], [27]) rather than as a surface syntactic one. 1.2. Defining constituency In order to evaluate the cogency of a definition of dependency based on a pre-existing definition of constituency, we have to explore how constituents are defined. Bloomfield 1933 [4] does not give a complete definition of syntactic constituents. His definition of the notion of constituent is first given in the chapter Morphology where he defines the morpheme. In the chapter on syntax, he writes: “Syntactic constructions are constructions in which none of the immediate constituents is a bound form. […] The actor-action construction appears in phrases like: John ran, John fell, Bill ran, Bill fell, Our horses ran away. […] The one constituent (John, Bill, our horses) is a form of a large class, which we call nominative expressions; a form like ran or very good could not be used in this way. The other constituent (ran, fell, ran away) is a form of another large class, which we call finite verb expressions.” Bloomfield does not give a general definition of constituents: They are only defined by the previous examples as instances of distributional classes. The largest part of the chapter is dedicated to the definition of the head of a construction. We think that in
4
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
some sense Bloomfield should rather be seen as a precursor of the notions of connection (called construction) and dependency than as the father of constituency. For Chomsky, a constituent exists only inside the syntactic structure of a sentence, and he never gives precise criteria of what should be considered as a constituent. In Chomsky 1986 [9], quarreling with the behaviorist claims of Quine [31], he refutes it as equally absurd to consider the fragmentation of John contemplated the problem into John contemplated – the problem or into John contemp – lated the problem instead of the “correct” John – contemplated the problem. No further justification for this choice is provided. Gleason 1961 ([14]) proposes criteria for defining constituents (like substitution by one word, possibility to be a prosodic unit) and to build a constituent structure bottom up: “We may, as a first hypothesis, consider that each of [the words of the considered utterance] has some statable relationships to each other word. If we can describe these interrelationships completely, we will have described the syntax of the utterance in its entirety. […] We might start by marking those pairs of words which are felt to have the closest relationship. ” But he makes the following assumption without any justification: “We will also lay down the rule that each word can be marked as a member of only one such pair.” Gleason then declares the method of finding the best among all the possible pairings to be “the basic problem of syntax” and he notes himself that his method is “haphazard” as his “methodology has not as yet been completely worked out” and lacks precise criteria. We are not far from agreeing with Gleason but we do not think that one has to choose between various satisfactory pairings. For instance, he proposes the following analysis for the NP the old man who lives there: the old man who lives there the graybeard who survives the graybeard surviving the survivor he We think that other analyses are possible, such as the old man the graybeard someone
who lives there there living surviving he
and these analyses are not in competition, but complementary; both (and others) can be exploited to find the structure of this NP.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
5
Today, the definition of 'constituent' seems no longer to be a significant subject in contemporary literature in syntax. Even pedagogical books in phrase structure based frameworks tend to skip the definition of constituency, for example Haegeman 1991 [17] who simply states that “the words of the sentence are organized hierarchically into bigger units called phrases.”, and take constituency for granted. Commonly proposed tests for constituency include proform substitution tests (including interrogation and clefting), the “stand-alone test”, meaning that the segment can function as an “answer” to a question, “movement tests” (including insertion and suppression), and coordinability, the latter being fraught with confounding factors of multiple constituents, gapping, and right node raising (RNR). However, the application of these criteria to our previous example (the old man who lives there) does not clearly favor the first decomposition over the second one. In phrase structure frameworks, constituents are nothing but a global approach for the extraction of regularities, the only goal being the description of possible constructions with as few rules as possible. However, it is never actually shown that the proposed phrase structure really is the most efficient way of representing the observed utterances. We see that the notion of constituency is either not defined at all or in an unsatisfactory way, often based on the notion of one element, the head, being linked to another, its dependent, modifying it. It is clear that the notion of dependency cannot be defined as a derived notion of constituency, as the definition of the latter presupposes head-daughter relations, making such a definition of dependency circular. Conversely, we will see that, as soon as the dependency relations are constructed, it is possible to select some fragmentations between all those that are possible, these fragmentations being the ones that are aimed at in phrase structure based approaches. 1.3. Intersecting analyses An interesting result of the vagueness of the definitions of constituency is the fact that different scholars invent different criteria that allow choosing among the possible constituent structures. For example, Jespersen's lexically driven criteria select particle verbs as well as idiomatic expressions. For instance, the sentence (4) is analyzed as “S W O” where W is called a “composite verbal expression” (Jespersen 1937[16]) (4) She [waits on] us. As a point of comparison, Van Valin & Lapolla 1997 [35] oppose core and periphery of every sentence and obtain another unconventional segmentation of sentences as in example (5). (5) [John ate the sandwich] [in the library] Assuming one of these various fragmentations necessitates that one put forward additional statements (all legitimate) based on different types of information like headdaughter relations (for X-bar approaches), idiomaticity (for Jespersen) or argument
6
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
structure or information packaging (for VanValin & Lapolla) and serve merely for the elimination of unwanted fragments. For us, multiple decompositions of an utterance are not a problem. There is no reason to restrict ourselves to one particular fragmentation, as it is done in phrasestructure based approaches. On the contrary, we think that the best way to compute the syntactic structure of an utterance is to consider all its possible fragmentations and this is the idea we want to explore now. Steedman [33] may have been one of the first linguists to develop a formal grammar that allows various groupings of words. Steedman’s articles corroborate the multi-fragment approach to syntactic structure which is pursued here.
2. Fragmentation and connection 2.1. Fragments We will relax the notion of syntactic constituent and define a new syntactic notion: We call a part of an utterance a fragment if it is a linguistically acceptable phrase with the same semantic contribution as in the initial utterance. Let us take an example: (6) Peter wants to read the book. We consider the acceptable fragments of (6) to be: Peter, wants, to, read, the, book, Peter wants, wants to, to read, the book, Peter wants to, wants to read, read the book, Peter wants to read, to read the book, wants to read the book. We will not justify this list of fragments at this point, but rather we point for the moment just to the fact that wants to read, just like waits on in (4), fulfills all the commonly considered criteria of a constituent: It is a “significant element”, “functions as a unit” and can be replaced by a single word (reads).2 In the same way, Peter wants could be a perfect utterance. Probably the most unnatural fragment of (6) is the VP wants to read the book, that, together with the corresponding subject, is traditionally considered as the main constituent of a clause in a phrase structure analysis. Our fragments correspond more or less to the catenae of Osborne et al. [28] (see Section 2.8 for details). We both think that fragments and catenae are very relevant units (more than constituents in particular), but we consider that the relationship between fragments and dependencies goes the other way around: Osborne et al. [28] define the catenae of from the dependency tree as the connected subparts of this tree, but do not say how they define dependency. We think that the best way to define dependency is to define fragments first.3
2 The substitution test is generally limited to proforms. But this test does not work for adjectival or adverbial phrases for instance. And although English has the verbal proform DO, such a proform does not exist in many other languages, including close languages such as French (for instance, Pierre a parlé à Marie ‘Peter talked to Mary’ gives Pierre l’a fait ‘Peter did (it)’, where an accusative clitic is obligatory even if the verb PARLER has only an indirect object). The possible substitution by a single lexical item can be a useful relaxing of the substitution test. 3 It may seem paradoxical that we do not think that fragments are syntactic primitives. In some sense we agree with Tesnière when he says that “the mind perceives connections”. A fragment is a witness to a possible connection and it allows us to postulate one connection or another.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
7
2.2. Fragmentations A fragmentation (tree) of an utterance U is a recursive partition of U into acceptable fragments, that is, a tree-like decomposition of the utterance into fragments. Figure 1 shows two of the various possible fragmentations of (6).4
Figure 1. Two fragmentation trees of (6) More formally, if X is set of minimal units (for instance the words of (6)), fragments are subsets of X and a fragmentation F is a subset of the powerset5 of X (F ⊆ P(X)), which is well-partitioned, that is, which verifies the two following properties: 1. 2.
For every f1, f2 ∊ F, either f1 ⊆ f2, f2 ⊆ f1, or f1 ∩ f2 = ∅; Each fragment is partitioned by its immediate sub-fragments.
Written out less formally this means that a fragmentation is just a selection of subsets composed of minimal units (the fragments) such that: 1. 2.
The fragments cannot overlap strangely: If two fragments overlap, then one must be completely contained in the other. Each (non-minimal) fragment can be decomposed into fragments.
A fragmentation whose fragments are constituents is nothing else than a constituent tree.6 A fragmentation is binary if every fragment is partitioned into 0 or 2 fragments. 4
We represent a constituency tree as a bracketing. See Figure 15 for the equivalence to the traditional representation Another equivalent representation, introduced by Hockett at the end of the 50s, was used above when we presented Gleason 1961’s decomposition of the old man who lives there. 5 Every set S has a powerset, noted P(S). It is the set of all subsets of S. For example, if S={a,b,c} then P(S) contains the following elements: the whole set S itself, all subsets of two elements {a,b}, {a,c}, {b,c}, all subsets of one element {a}, {b}, {c}, and , the empty set. This gives the identity P(S)={ {a,b,c}, {a,b}, {a,c}, {b,c}, {a}, {b}, {c}, }. A powerset is thus a set of sets, as is a fragmentation, which is a set of fragments, which are sets of words (or whatever minimal units we have chosen). And if such a set of sets has good properties (i.e. if it is well-partitioned), we can represent it by a tree or a bracketing. 6 There are two views on constituents. From a purely formal point of view, every fragmentation tree is a constituent tree, that is, a tree where each node C represents a subpart of X and the daughters of C are subparts of C partitioning C. From a linguistic point of view, only some subparts of X are (syntactic, prosodic, …) constituents, and this is why some linguists would consider that only some of our fragmentation trees are constituent trees.
8
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
2.3. Connection structure and fragmentation hypergraph We consider that each partition of a fragment in two pieces induces a connection between these two pieces.7 This allows us to define the graph of connections between the fragments of a set X. Such a graph, defined on a set of sets (that is, on a subset of the powerset of X) is called a hypergraph. More formally, a hypergraph H on X is a triplet (X,F,φ) where F ؿP(X) (F is the set of fragments) and φ is a graph on F. If F is only composed of singletons, H corresponds to an ordinary graph on X. For each binary fragmentation F on X, we will define a fragmentation hypergraph H = (X,F,φ) by introducing a connection between every couple of fragments which partitions another fragment. Let us illustrate this with an example: (7) Little dogs slept. There are two natural fragmentations of (7) whose corresponding hypergraphs are given Figure 2.8
Figure 2. The two fragmentation hypergraphs of (7) These two hypergraphs show both constituents and connections (i.e. nonhierarchized dependency). This is redundant and we will now see how to keep the connections only. We remark that little is connected to dogs in H1 and dogs to slept in H2. H2 also shows a connection between little and dogs slept, but in some sense, this is just a rough version of the connection between little and dogs in H1. The same observation holds for the connection between little dogs and slept in H1, which corresponds to the connection between dogs and slept in H2. In other words, the two hypergraphs contain the same connections (in more or less precise versions). We can thus construct a finer-grained hypergraph H with the finest version of each connection (Figure 3). We will call this hypergraph (which is equivalent to a graph on the words in this case) the connection structure of the utterance. We will now see how to define the connection structure in the general case.
7 The restriction of the connections to binary partitions can be traced back all the way to Becker (1827:469 [3]), who claims that “every organic combination within language consists of no more than two members.” (Jede organische Zusammensetzung in der Sprache besteht aus nicht mehr als zwei Gliedern). Although we have not encountered irreducible fragments of three or more elements in any linguistic phenomena we looked into, this cannot be a priori excluded. It would mean that we encountered a fragment XYZ where no combination of any two elements forms a fragment, i.e. is autonomizable in any without the third element. Our formal definition does not exclude this possibility at any point and a connection can in theory be ternary. 8 It is possible that, for most readers, H1 seems to be more natural than H2. From our point of view, that is not the case: dogs slept is a fragment just as valid as little dogs. Nevertheless, see footnote 12.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
9
Figure 3. Connection structure of (7) 2.4. A complete partial order on hypergraphs We saw with our example that the connection structure is a finer-grained version of the different fragmentation hypergraphs of the utterance. So we propose to define the connection structure as the infimum9 of the fragmentation hypergraphs for a natural order of fineness. The definition we expose in this subsection requires good skills in mathematics and can be skipped without loss of continuity. A connection f — g is finer than a connection f' — g' if f ⊆ f' and g ⊆ g'. For instance the connection [dogs]–[slept] is finer than the connection [little dogs]–[slept]. A connection is minimal when it cannot refine. Intuitively, the fineness order, henceforth noted ≤, represents the precision of the hypergraph, ie. H1 ≤ H2 if H1 is a finer-grained analysis than H2. A hypergraph H1 is finer than a hypergraph H2 (that is H1 ≤ H2) if every connection in H2 has a finer connection in H1. In other words, H1 must have more connections than H2, but H1 can have some connections pointing to a smaller fragment than in H2, and in this case the bigger fragment can be suppressed in H1 (if it carries no other connections) and H1 can have less fragments than H2. This is illustrated with the following schemata:
In case (a), H1 is finer because it has one connection more. In case (b), H 1 is finer because it has a finer-grained connection and the dotted fragment can be suppressed. It is suppressed when it carries no further connection. We think that this partial order on hypergraphs is complete (see note 9). We have not proven this claim but it appears to be true on all the configurations we have investigated. If we have an utterance U and linguistic criteria characterizing the acceptable fragments of U, we define the connection structure of U as the infimum of all its fragmentation hypergraphs.
9 If ≤ is a partial order on X and A is a subset of X, a lower bound of A is an element b in X such that b ≤ x for each x in A. The infimum of A, noted ΛA, is the greatest lower bound of A. A partial order for which every subset has an infimum is said to be complete. (As a classical example, consider the infimum for the divisibility on natural integers, which is the greatest common divisor: 9 ٿ12 = 3).
10
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
2.5. Constructing the connection structure Our definitions are complicated, perhaps. In practice however, it is easy to build the connection graph of an utterance as soon as you have decided what the acceptable fragments of an utterance are. Indeed, because the fineness order on hypergraphs is complete, one can begin with any fragmentation and refine its connections until there are no further refinements to be made. The connection structure is obtained when all the connections are minimal. The completeness ensures, due to the uniqueness of the greatest lower bound, that one always obtains the same structure.10 Let us see what happens with example (6). Suppose the first step of the fragmentation is: f1 = Peter wants to f2 = read the book One has a connection here between f1 and f2 that will correspond to a link between two minimal fragments in the final connection, possibly words. Now, one wants to discover these minimal fragments. To accomplish that, one seeks the minimal fragment g overlapping both f1 and f2: g = to read. The fragment g can be decomposed into to and read. Therefore the connection between f1 and f2 is finally a connection between to and read. It now remains to calculate the connection structures of f 1 and f2 in order to obtain the complete connection structure of the whole sentence (Figure 4).
Figure 4. Connection structure of (6) 2.6. Irreducible fragment The connection structure of (6) is not equivalent to a graph on its words because some fragments are irreducible. An irreducible fragment is a fragment bearing connections which cannot be attributed to one of its parts. For instance, the book in (6) is irreducible because there is no fragment overlapping the book and including only the or only book (neither read the nor read book are acceptable). (8) The little dog slept. Example (8) poses the same problem, because little can be connected to dog (little dog is acceptable), but slept must be connected to the dog and cannot be refined (neither dog slept or the slept is acceptable). One easily verifies that (8) has the fragmentation hypergraphs F1, F2, and F3 of Figure 5 and the connection graph H (which is their infimum). Note that the fragment the dog persists in the final connection graph H because it carries the link with slept but little is connected directly to dog and not to the whole fragment the dog. 10 This is more complicated if the connection structure contains cycles (Section 2.7). The previous process, starting with a (fragmentation) tree, gives us acyclic structures and we must verify that no connection can be added to obtain the whole connection structure.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
11
Figure 5. The fragmentations hypergraphs of (8) and its connection structure H = F1 רF2 רF3 Irreducible fragments are quite common with grammatical words. We have seen the case of determiners, but conjunctions, prepositions, or relative pronouns can also cause irreducible fragments: (9) I think [ that [ Peter slept ] ] (10) Pierre parle [ à Marie ]11 Peter speaks [to Mary] (11) [ the (old) man ] [ who lives ] there 2.7. Cycles Usually the connection graph is acyclic (and could be transformed into a tree by choosing a node as the root, as we have shown for example (7)). But we can have a cycle when a fragment XYZ can be fragmented into XY+Z, YZ+X, and XZ+Y. This can happen in examples like: (12) Mary gave advice to Peter. (13) I saw him yesterday at school. (14) the rise of nationalism in Catalonia In (12), gave advice, gave to Peter, and advice to Peter are all acceptable. We encounter a similar configuration in (12) with saw yesterday, saw at school, and yesterday at school (It was yesterday at school that I saw him). In (13), in Catalonia can be connected both with nationalism and the rise and there is no perceptible change of meaning. We can suppose that the hearer of these sentences constructs one connection or the other (or even both) and does not need to favor one.12
11
Preposition stranding (*Pierre parle à ‘Pierre speaks to’) is impossible in French. The fact that we cannot always obtain a tree structure due to irreducible fragment and cycle suggests that we could add weights on fragments indicating that a fragment (or a fragmentation) is more likely than another. We do not pursue this idea here, but we think that weighted connection graphs are certainly cognitively motivated linguistic representations. 12
12
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
Figure 6. The cyclic connection structure of (12)13 Personal names are another interesting example.14 (15) I met Fernando Manuel Rodriguez Pérez For such a Spanish name the possible fragments are Fernando, Rodriguez, Fernando Manuel, Fernando Manuel Rodriguez, Fernando Rodriguez, and Fernando Rodriguez Pérez, giving us the following connection graph with a cycle, because both Fernando and Rodriguez can be connected to the verb (Figure 7).
Figure 7. The cyclic connection structure of (15) 2.8. Connection structures and fragments We have seen that the connection structure is entirely defined from the set of fragments. Conversely, the set of fragments can be reconstructed from the connection graph. We extend the notion of catena that Osborne et al. [28] define for dependency trees. For a dependency tree T on X, a catena of T is a subset of X composed of elements that are connected together by dependencies in T. In other words, catenae of T are supports of connected subparts of T (the support of a structure is just the set of its vertices). For instance, if we take the previous example (Figure 7), Manuel, Fernando and met are connected and form a connected subpart of the complete connection structure: the support of this subgraph, met Fernando Manuel, is a catena. For hypergraphs, we need a slightly more complex definition, in particular because the notion of “connectedness” is not immediate in this case. In a (well-partitioned) hypergraph, we said that two fragments are weakly connected if there is a connection Note also that the preferred fragmentation is not necessary to the constituent structure. For instance, the most natural division in (i) occurs right before the relative clause, which functions as a second assertion in this example and can be preceded by a major prosodic break (Deulofeu et al. [11]). (i) He ran into a girl, who just after that entered the shop. 13 The irreducibility of to Peter is conditioned by the given definition of fragments. If we considerrelativization as a criterion for fragments, the possibilities of preposition stranding in English may induce the possibility to affirm that gave and advice are directly linked to the preposition to. 14 We would like to thank Orsolya Vincze and Margarita Alonso Ramos for presenting us with this data.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
13
between them or if one is included in the other (they cannot intersect if the hypergraph is well-partitioned). A hypergraph is weakly connected if its fragments are weakly connected. For instance, the connection structure of Figure 5 is weakly connected because little is connected to dog, which is included in [the dog], which is connected to slept. But it is not connected in a strong sense, because there is no chain of connections from little to slept without considering the inclusion between dog and [the dog]. A catena of a well-partitioned hypergraph H is the support of a weakly connected sub-hypergraph of H. A weakly connected sub-hypergraph of H is a subpart of H, the vertices of which are weakly connected together. If H is a hypergraph on X, the support of a subgraph of H is a subset of X.; therefore, the support of a sub-hypergraph is not exactly the set of its vertices, because its vertices are subsets of X, but the union of its vertices. For instance, read –– [the book] is a weakly connected sub-hypergraph of the connection structure of Figure 4; this sub-hypergraph has two vertices, read and the book, and its support, read the book is their union. But read book is not a catena because read and book are not weakly connected. Every catena can be obtained by cutting connections in the structure and keeping the segment of the utterance corresponding to continuous pieces of the connection structure. For instance in the connection structure of (6), cutting the connections between to and read, gives the segment read the book. But the segment read book cannot be obtained because even when cutting the connection between the and book, read remains connected to the entire group the book. We have the following result: If F is a set of fragments and C is the connection structure associated to F, then the set of catenae of C is F. It means that the connection structure contains the memory of all the fragments and all the fragmentation trees that this set of fragments allows us to construct. In this sense, the connection structure is a very powerful structure and the most economical way we can imagine to encode a set of fragments. Considering that our fragments are representative syntactic units, the connection structure they define is a representative syntactic structure (and maybe the most motivated syntactic structure we can imagine). We will now see that different criteria allow us to define different sets of fragments, which in turn define connection structures for different domains.
3. Discourse, morphology, semantics Dependency structures are usually known to describe the syntactic structures of sentences, i.e the organization of the sentence's words. In the next sections, we will give a precise definition of fragments for surface syntax in order to obtain a linguistically motivated connection structure and to transform it into a dependency tree. Let us now at first apply our methodology to construct connection structures for discourse, morphology, and the syntax-semantics interface. 3.1. Discourse Nothing in our definition of connection graphs is specific to syntax. We obtain syntactic structures if we limit our maximal fragment to be sentences and our minimal fragments to be words. But if we change these constraints and begin with a whole text and take “discourse units” as minimal fragments, we obtain a discourse connection graph. This strategy can be applied to define discourse relations and discourse
14
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
structures such as RST or SDRT. Of course, to obtain linguistically motivated structures, we need to define what an acceptable sub-text of a text is (generally it means to preserve coherency and cohesion). (16) (π1) A man walked in. (π2) He sported a hat. (π3) Then a woman walked in. (π4) She wore a coat. (Asher & Pogodalla [2]) We have the fragments π1π2, π1π3, π3π4 but we do not have π2π3 (no coherency) nor π1π4 (no cohesion). This gives us the connection graph of Figure 8. π1
π3
π2
π4
Figure 8. Connection structure of discourse (16) 3.2. Morphology On the other side, we can fragment words into morphemes. To define the acceptable fragmentations of a word, we need linguistic criteria like the commutation test. As an example for constructional morphology, consider the word “unconstitutionally”, which has two possible fragmentations presented in Figure 9. These two possible decompositions can be summarized in a unique structure, i.e. the connection structure induced by the two possible fragmentations.
Figure 9. Two fragmentation trees and the resulting connection structure for a word decomposition 3.3. Deep Syntax The deep syntactic representation is the central structure of the semantics-syntax interface (Mel’čuk [25], Kahane [22]). If we take compositionality as a condition for fragmentation, we obtain a structure that resembles Mel’čuk's deep syntactic structure. In other words, the deep syntactic structure is obtained by the same method as the
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
15
surface syntactic structure except that idioms are not fragmented and semantically empty grammatical words are not considered as fragments (Figure 10). (17)
Pierre donne du fil à retordre à ses parents. lit. Peter gives thread to twist to his parents 'Peter has become a major irritant to his parents'
Figure 10. Deep-syntactic connection structure 4. Fragmentations for surface syntax 4.1. Criteria for syntactic fragments The connection structure we obtain depends completely on the definition of acceptable fragments. We are now interested in the linguistic criteria we need in order to obtain a connection structure corresponding to a usual surface syntactic structure. As a matter of fact, these criteria are more or less the criteria usually proposed for defining constituents. A surface syntactic fragment of an utterance U is a subpart of U (generally in its original order),15 is a linguistic sign and its meaning is the same when it is taken in isolation and when it is part of U,16 • can stand alone (for example as an answer to a question),17 • belongs to a distributional class (and can for instance be replaced by a single word). Mel’čuk [26] proposes, in his definition of wordforms, to weaken the stand-alone property (or autonomizability). For instance in (8), the or slept are not autonomizable, but they can be captured by subtraction of two autonomizable fragments: slept = Peter slept \ Peter, the = the dog \ dog.18 We call such fragments weakly autonomizable.19 • •
15
For instance, the NP a wall 3 meters high has the subfragment a high wall and not *a wall high. This condition has to be relaxed for the analysis of idiomatic expressions as they are precisely characterized by their semantic non-compositionality. The fragments are in this case the elements that appear autonomizable in the paradigm of parallel non-idiomatic sentences. 17 Mel’čuk (1988 [25], 2011[27]) proposes a definition of two-word fragments. Rather than the stand alone criterion, he proposes that a fragment must be a prosodic unit. This is a less restrictive criterion, because the possibility to stand alone supposes to be a speech turn and therefore to be a prosodic unit. For instance little dog can never be a prosodic unit in the little dog but it is a prosodic unit when it stands alone. We think that this criterion is interesting, but not easy to use because the delimitation of prosodic units can be very controversial and seems to be a graded notion. Note also that clitics can form prosodic units which are unacceptable fragments in our sense, like in: (i) the king | of England's | grandmother (ii) Je crois | qu'hier | il n'est pas venu ‘I think | that yesterday | he didn't come’ 18 Note that singular bare noun like dog are not easily autonomizable in English, but they can for instance appear in titles. 19 Some complications arise with examples like Fr. il dormait ‘he slept’. Neither il (a clitic whose strong form lui must be used in isolation), nor dormait are autonomizable. But if we consider the whole distributional 16
16
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
Of course, even if our approach resolves most of the problems arising when trying to directly define constituents, some problems remain. For instance, if you consider the French noun phrase le petit chien ‘the little dog’, the three fragments le chien, petit chien, and le petit ‘the little one’ are acceptable. Eliminating the last fragment le petit necessitates that one assume nontrivial arguments: le petit, when it stands alone, is an NP (it commutes with NPs) but it cannot commute with NPs like for example la fille ‘the girl’ in le petit chien as *la fille chien ‘the girl dog’ is ungrammatical. Many exciting questions posed by other phenomena like coordination or extraction cannot be investigated here for lack of space. 4.2. Granularity of the fragmentation Syntactic structures can differ in the minimal units. Most authors consider that the wordforms are the basic units of dependency structure, but some authors propose to consider dependencies only between chunks and others between lexemes and grammatical morphemes. The following figure shows representations of various granularities for the same sentence (18). (18) A guy has talked to him. A has talked a guy
to him
ind
D
pres
has
B
3 sg
guy
talked
a
to
sg
him
GUY
pp
A
TALK
C HAVE ind GUYsg A
pres 3 sg
HAVE
TALKpp
TO
TO
acc
HEacc
HE
Syntactic trees of various granularities for (18)
class of the element which can commute with il in this position, containing for example Peter, we can consider il to be autonomizable by generalization over the distributional class.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
17
Tree A is depicting an analysis in chunks (Vergne [36]), Tree B in words, Tree D in lexemes and inflectional morphemes (and can be compared to an X-bar structure with an IP, governed by agreement and tense). The tree C (corresponding to the surface syntactic structure of Mel’čuk [25]) can be understood as an underspecified representation of D. These various representations can be captured by our methods. The only problem is to impose appropriate criteria to define what we accept as minimal fragments. For instance, trees C and D are obtained if we accept parts of words which commute freely to be “syntactic” fragments (Kahane [22]). Conversely, we obtain tree A if we only accept strongly autonomizable fragments.
5. Heads and dependencies Most syntactic theories suppose that the syntactic structure is hierarchized. 20 This means that connections are directed. A directed connection is called a dependency. For a dependency from A to B, A is called the governor of B, B, the dependent of A, and A, the head of the fragment AB. 21 The introduction of the term “head” into syntax is commonly attributed to Henry Sweet (1891-96, I:16, Sections 40 and 41 [30]): “The most general relation between words in sentences from a logical point of view is that of adjunct-word and head-word, or, as we may also express it, of modifier and modified. […] The distinction between adjunct-word and head-word is only a relative one: the same word may be a head-word in one sentence or context, and an adjunctword in another, and the same word may even be a head-word and an adjunct-word at the same time. Thus in he is very strong, strong is an adjunct-word to he, and at the same time head-word to the adjunctword very, which, again, may itself be a head-word, as in he is not very strong.” Criteria for the recognition of the direction of relations between words have been proposed by Bloomfield [4], Zwicky [37], Garde [12], or Mel’čuk [25]. The most common criterion is that the head of a constituent is the word controlling its distribution, which is the word that is the most sensitive to a change in its context. But for any fragment, its distribution does not depend only on its head (and, as we have said in the introduction, constituents cannot easily be defined without using the notion of head). As an example, consider the fragment little dogs in (19): (19) Very little dogs slept. As little is connected to very and dogs to slept, little dogs does not have the distribution of dogs nor of little in 17 as very dogs slept and very little slept are both unacceptable. Determining the head of the fragment little dogs (i.e. the direction of the 20 The only dependency-based grammar we know that uses non-hierarchized connections (and even cycles) is Link Grammar [32], which has developed one of the most efficient parsers of its time. 21 Dependency relations are sometimes called head-daughter relations in phrase structure frameworks. Note the distinction between head and governor. For a fragment f, the governor of f is necessary outside f, while the head of f is inside f. The two notion are linked by the fact that the governor x of f is the head of the upper fragment composed of the union of f and x.
18
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
relation between little and dogs) is equivalent to the identification of the governor of this fragment (between very and slept). But, as soon as we have identified the governor of the fragment, the head of the fragment is simply the word of the fragment which is connected to the fragment’s governor – the main word outside the fragment. For example, in (19), the identification of slept as the governor of the fragment little dogs also chooses dogs as the head of little dogs. Problems occur only if we are dealing with an irreducible fragment like the determiner-noun connection.22 To sum up: In order to identify the directedness of the connections and to define a dependency structure for a sentence, it is central to define the head of the whole sentence (and to resolve the case of irreducible fragments if we want a dependency tree). We consider that the head of the sentence is the main finite verb, because it bears most of the illocutionary marks: Interrogation, negation, and mood morphemes are linked to the main finite verb. In English, interrogation changes the verbal form (20), and in French, interrogation (20), negation (20), or mood (20) can be marked by adding clitics or inflectional morphemes on the finite verb even if it is an auxiliary verb. (20) a. b.
c.
d.
Did very little dogs sleep? Pierre a-t-il dormi? lit. Peter has-he slept? ‘Did Peter sleep?’ Pierre n'a pas dormi. lit. Peter neg. has neg. slept ‘Peter didn't sleep’ Pierre aurait dormi. lit. Peter have-COND slept? ‘Peter would have slept’
Once the head of the sentence has been determined, most of the connections can be directed by a top down strategy. Consequently the main criterion to determine the head of a fragment f is to search if one of the words of f can form a fragment with the possible governors of f, that is, if one of the words of f can be connected with the possible governors of f. If not, we are confronted with an irreducible fragment, and other criteria must be used, which will be discussed in the next section (see also Mel’čuk [25], [27]).23 Nevertheless, it is well known that in many cases, the head is difficult to find (Bloomfield [4] called such configurations exocentric). It could be
22 Various criteria have been proposed in favor of considering either the noun or the determiner as the head of this connection, in particular in the generative framework (Principles and Parameters, Chomsky (1981 [8]), remains with NP, and, starting with Abney (1986 [1]), DP is preferred). It seems that the question is triggered by the assumption that there has to be one correct directionality of this relation, in other words that the syntactic analysis is a (phrase structure) tree. This overly simple assumption leads to a debate whose theoretical implications do not reach far as any DP analysis has an isomorphic NP analysis. The NP/DP debate was triggered by the observation of a parallelism in the relation between the lexical part of a verb and its inflection (reflected by the opposition between IP and VP in the generative framework). This carries over to dependency syntax: The analysis D of sentence (18) captures the intuition that the inflection steers the passive valency of a verb form. 23 Conversely, whenever the fragmentation tests do not give clear results on whether or not a connection must be established, criteria used to determine the head can be helpful to confirm the validity of the connection.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
19
advocated not to attempt to direct such connections and thus settle with an only partially directed connection structure.24
6. Refining the dependency structure Even when the connection structure is completely directed, the resulting dependency structure is not necessarily a tree due to irreducible fragments and cycles. We can use new principles to refine the dependency structure and to get closer to a dependency tree. The situation is as follows: C is connected to AB and neither AC nor BC is an acceptable fragment. We thus have the following configuration: [ A — B ] — C. A first case can be solved only by structural considerations: it is the case where C governs AB and B governs A, which means [A ← B] ← C. In this case there is only one solution for obtaining a tree. It is not possible to connect C and A, and we necessary have A ← B ← C. This can be illustrated by the sentence Peter thinks Mary snored, with A = Mary, B = snored, and C = thinks. In any other case of A — B being irreducible, C can be connected either to A or to B and two solutions are structurally possible. We need an additional linguistic principle to decide. To train our intuition, let us consider an example: the most famous of the world can be analyzed in [ [the most] ← [famous] ] → [of the world] and neither famous of the world nor the most of the world are acceptable.25 But we think, rather, that [of the world] is rather selected by the superlative marker the most rather than by the adjective famous, because for any adjective X we have the most X of the world, while the most cannot commute with other adjective modifiers (*very famous of the word). Generalizing this idea, we propose the following principle. Principle of selection: If in the configuration [ A — B ] — C, B commutes less freely than A, then C can be connected directly to B, which gives us the configuration A — B — C.26 In our previous example, famous commutes more freely than the most in the most famous of the world because famous can commute with every adjective while the most cannot commute with most other adjective modifiers. This means that it is the most and not famous that selects of the world and if we refine the connection between the most famous and of the world, this connection must be attributed to the most. We can give another example: in the sentence Mary snored, if we segment snored into a verbal lexeme SNORE and an inflection -ed, one questions to what element the subject Mary must be connected (Figure 11). As SNORE can commute with most verbal lexeme but -ed cannot commute with non-finite inflections (non-finite forms of the 24
Equally, the problem of PP attachment in parsing is certainly partially based on true ambiguities, but in many cases, it is an artificial problem of finding a tree structure where the human mind sees multiple connections, like for instance in He reads a book about syntax or in the examples (12) to (13). We can assume that a statistical parser will give better results when trained on a corpus that uses the (circular) graph structure, reserving the simple tree structures for the semantically relevant PP attachments. 25 Most English speakers prefer the most famous in the world, which may have another structure because famous in the world is an acceptable fragment. French has only one construction, le plus célèbre du monde, lit. ‘the most famous of the world’, for which the analysis discussed here also holds true. 26 The term selection is often used in linguistics for selectional restrictions. This is the same idea here: C selects B rather than A because C restricts the distributional paradigm of B more than the one of A.
20
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
verb cannot fill the subject slot of the verbal valency), it follows that it is the inflection and not the lexical part of the verb that selects the subject.27
Figure 11. A dependency structure H1 of and its refinement H2 Selection must not be confused with subcategorization. Following Tesnière [34], we take the subject to be a verbal actant just like the direct and the indirect object. The connection of the subject with the inflection can be compared with other cases of raising. For instance, in Peter seems to snore, Peter is clearly the subject of seems even if it is subcategorized for by SNORE and not SEEM. The connection between parle and à Marie in (10) can also be refined using the principle of selection. As Marie can commute with any noun while the preposition à cannot commute with other prepositions, the connection is attributed to the preposition (Figure 12).
Figure 12. Dependency structure of (10) and its refinement H2 Sometimes the principle of selection is not discriminating. This is the case for the choice between determiner and noun in The dog slept. Indeed the as well as dog can commute freely: the/a/each/this… dog/girl/man/idea… slept. In such a case another principle must be introduced in order to obtain a tree, but we will not discuss this point further.
Figure 13. Dependency structure of The dog slept Other principles can be proposed in order to decide which of the two categories determiner and noun is the head of this fragment. As discussed before (Note 16), we are not convinced that such a decision is linguistically relevant. Nevertheless, from a computational point of view it can be motivated to want to manipulate trees rather than hypergraphs and an arbitrary decision can be taken. The most common decision in
27 X-bar syntax makes the same assumption: the subject is a daughter of InflP while other actants are daughters of VP. Note also that, as discussed in Section 5, the inflection is clearly the head of the fragment composed of the verbal lexeme and its inflection.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
21
dependency-based theory (Tesnière [34], Mel’cuk [25], but see Hudson [19] for the reverse choice) is to choose the noun as the head. This choice privileges the semantic dependency between the noun and the governor of the NP (I bought two books means ‘I bought books, the number of which is two’). We will conclude this section by discussing the consequences of the refinement of the structure. Let us recall that the set of possible fragments can be recovered from the connection structure (Section 2.8). As soon as we refine the structure we increase the number of catenae: [ A — B ] — C has only two catenae (AB and ABC) while A — B — C has three catenae (AB, BC and ABC). It is possible to label the connections in order to indicate which connection has been refined by the principle of selection and thus does not correspond to a fragment: Each time connections external to a fragment are attributed to internal nodes of this fragment, we label e the external connections that has been refined and i the internal connections of the fragment. For instance, if the connection between C and the fragment f = AB is refined and attributed to B, we obtain the labeled graph A –i– B –e– C. The initial hypergraph can be reconstructed by attributing each e-connection to the fragments obtained by aggregating all the nodes connected by adjacent i-connections (Figure 14). It is sometimes necessary to coindex e-links and i-links obtained from the refinement. Let us, for instance, consider the configuration [A — B –]– [ C — D ], where we have two irreducible fragments AB and CD and a connection between B and CD. If we reduce both AB and CD, the connection between B and CD will become an e-link, but we must indicate that this elink corresponds to the i-link between C and D but not to the i-link between A and B. See also the second refinement of Figure 14.
Figure 14. A dependency structure and two successive refinements (with encoding of fragments by labels)
7. Constituency We saw in Section 2.8 that any fragmentation can be recovered from the connection structure. As soon as the connections are directed, some fragmentations can be favored and constituent structures can be defined. Let us consider nodes A and B in a dependency structure. A dominates B if A = B or if there is a path from A to B starting with a dependency whose governor is A. The fragment of elements dominated by A is called the maximal projection of A (following [24] and its definition of projectivity). Maximal projections are major constituents (XPs
22
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
in X-bar syntax). The maximal projection of A can be fragmented into {A} and the maximal projections of its dependents. This fragmentation gives us a flat constituent structure (with possibly discontinuous constituents). Discontinuous constituents are often postulated in phrase structure grammars and are particularly difficult to characterize because they require a relaxation of the defining criteria. The difficulty is to authorize discontinuous constituents without obtaining an explosion of the units that can possibly acquire the status of constituent. In our approach we accept at the beginning discontinuous fragments and the constituents are selected among the fragments using the hierarchization provided by the heads. Partial projections of A are obtained by considering only a part of the dependencies governed by A. By defining an order on the dependency of each node (for instance by deciding that the subject is more “external” than the object), we can privilege some partial projections and obtain our favorite binary fragmentation equivalent to the phrase structure trees we prefer. In other words, a phrase structure for a given utterance is just one of the possible fragmentations and this fragmentation can only be identified if the notion of head is considered. We can thus say that phrase structure contains a definition of dependency at its very base, a fact that already appears in Bloomfield's work, who spends much more time on defining head-daughter relations than on the notion of constituency. Jackendoff [20]'s X-bar theory is based on a head-centered definition of constituency, as each XP contains an X being the (direct or indirect) governor of the other elements of XP. If we accept a mix of criteria for identifying fragments and heads, it is possible to directly define a constituent structure without considering all the fragmentations. The strategy is recursive and top-down (beginning with the whole sentence at first constituent); each step consists of first identifying the head of the constituent we want to analyze and then looking at the biggest fragments of the utterance without its head: These biggest fragments are constituents. 28 Let us exemplify this with sentence (4): wants is the head of the sentence and all the biggest (i.e. in a sense of inclusion: fragments not contained in other fragments) remaining fragments are Peter and to read the book. At each step we have first to take the head off and to go on with the subfragments we obtain, which give us successively to and read the book, read and the book, the and book. The resulting constituent structure is given in Figure 15.29
Conclusion We have shown that it is possible to formally define a syntactic structure solely on the basis of fragmentations of an utterance. The definition of fragments does not have to keep the resulting constituent structure in mind, but can be based on simple observable
28 If the head of the constituent is a finite verb, clefting can be a useful test for characterizing sub-constituents. But clefting can only capture some constituents and only if the head of the constituent has been identified and is a finite verb. As noted by Croft [10] considering the typological point of view, such constructions can only be used to characterize the constituents once we have defined them. We know that constructions like clefting select constituents because we were able to independently define constituents with other techniques. We cannot inversely define constituents by use of such language-specific constructions. 29 Our constituent tree does not contain a VP. Indeed the maximal projection of the main verb is the whole sentence. The VP can be obtained as a maximal projection only if we separate the verbal lexeme from its inflection (see Figure 11).
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
23
Figure 15. Two equivalent representation of the constituent structure of (6) criteria like different forms of autonomizability. Even (and especially) if we obtain intersecting fragmentations, we can obtain a connection graph. This operation can be applied to any type of utterance, yielding connections from the morphological to the discourse level. This delegates the search for the head of a fragment to a secondary optional operation. It is again possible to apply the known criteria for heads only when they provide clear-cut answers, leaving us with partially unresolved connections, and thus with a hypergraph, and not necessarily a tree structure. It is possible, and even frequent, that the syntactic structure is a tree, but our definition does not presuppose that it must be one. This two-step definition (connection and directionality) allows for a more coherent definition of dependency as well as constituency avoiding the commonly encountered circularities. It takes connection as a primary notion, preliminary to constituency and dependency. Another interesting feature of our approach is not to presuppose a segmentation of a sentence into words and even not suppose the existence of words as an indispensable notion. In this paper, we could explore neither the concrete applicability of our approach to other languages nor the interesting interaction of this new definition of dependency with recent advances in the analysis of coordination in a dependency based approach, like the notion of pile put forward in Gerdes & Kahane [15]. It also remains to be shown that the order on hypergraphs is really complete, i.e. that we can actually always compute a greatest connection graph refining any set of fragmentation hypergraphs. We also leave it to further research to explore the inclusion of weights on the connection which could replace the binary choice of presence or absence of a connection.
Acknowledgments We would like to thank Igor Mel’čuk, Timothy Osborne, Federico Sangati, and our three anonymous reviewers.
24
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
References [1] S. Abney, The English Noun Phrase in its Sentential Aspect, Unpublished Ph.D., MIT, 1986. [2] N. Asher, S. Pogodalla, SDRT and Continuation Semantics, Logic and Engineering of Natural Language Semantics 7 (LENLS VII), 2010. [3] K. F. Becker, Organismus der Sprache, 2nd edition. Verlag von G.F. Kettembeil, Frankfurt am Main, 1841 [1827]. [4] L. Bloomfield, Language. Allen & Unwin, New York, 1933. [5] R. Bod. Beyond grammar: an experience-based theory of language. Stanford, CA: CSLI Publications, 1998. [6] Th. Brants, W. Skut, and H. Uszkoreit, Syntactic annotation of a German newspaper corpus. Treebanks, pp. 73-87. Springer Netherlands, 2003. [7] A. Carnie, Modern Syntax: A Coursebook, Cambridge University Press, 2011. [8] N. Chomsky, Lectures On Government and Binding. Foris, Dordrecht, 1981. [9] N. Chomsky, New horizons in the study of language and mind, Cambridge University Press, 1986. [10] W. Croft, Radical construction grammar: syntactic theory in typological perspective Oxford University Press, 2001 [11] J. Deulofeu, L. Dufort, K. Gerdes, S. Kahane, P. Pietrandrea, Depends on what the French say, The Fourth Linguistic Annotation Workshop (LAW IV), 2010. [12] P. Garde, Ordre linéaire et dépendance syntaxique : contribution à une typologie, Bull. Soc. Ling. Paris, 72:1, 1-26, 1977. [13] A. V. Gladkij, Leckii po matematiceskoj linguistike dlja studentov NGU, Novosibirsk, 1966 (French translation: Leçons de linguistique mathématique, fasc. 1, 1970, Paris, Dunod). [14] H. A. Gleason, An Introduction to Descriptive Linguistics. New York: Holt, Rinehart & Winston, 503 p., 1955, Revised edition 1961. [15] K. Gerdes, S. Kahane, Speaking in piles: Paradigmatic annotation of a French spoken corpus, Corpus Linguistics 2009, Liverpool, 2009 [16] O. Jespersen, Analytic syntax. Copenhagen, 1937. [17] L. M. V. Haegeman, Introduction to Government and Binding Theory. Blackwell Publishers, 1991. [18] R. Hudson, Discontinuous phrases in dependency grammars, UCL Working Papers in Linguistics, 6, 1994. [19] R. Hudson, Language Networks: The new Word Grammar, Oxford University Press, 2007. [20] R. Jackendoff, X-Bar Syntax: A Study of Phrase Structure. Cambridge, MA: MIT Press, 1977. [21] S. Kahane, “Bubble trees and syntactic representations”, Proceedings of MOL5, Saarbrücken, 70-76, 1997 [22] S. Kahane, “Defining the Deep Syntactic Structure: How the signifying units combine”, Proceedings of MTT 2009, Montreal. [23] S. Kahane, Why to choose dependency rather than constituency for syntax: a formal point of view, in J. Apresjan, M.-C. L’Homme, L. Iomdin, J. Milićević, A. Polguère, L. Wanner, eds., Meanings, Texts, and other exciting things: A Festschrift to Commemorate the 80th Anniversary of Professor Igor A. Mel’čuk, Languages of Slavic Culture, Moscow, 257-272. [24] Y. Lecerf, Programme des conflits, module des conflits, Bulletin bimestriel de I'ATALA, 4,5, 1960. [25] I. Mel’čuk, Dependency Syntax: Theory and Practice. The SUNY Press, Albany, N.Y., 1988 [26] I. Mel’čuk Aspects of the Theory of Morphology. de Gruyter, Berlin, New York, 2006. [27] I. Mel’čuk, Dependency in language, Proceedings of Dependency Linguistics 2011, Barcelona, 2011. [28] T. Osborne, M. Putnam, Th. Groß, Catenae: Introducing a novel unit of syntactic analysis. Syntax, 15:4, 354-396, 2012. [29] K. Schubert, Metataxis: Contrastive dependency syntax for machine translation. http://www.mtarchive.info/Schubert-1987.pdf, 1987. [30] H. Sweet, A New English Grammar, 2 vols. Clarendon Press. Oxford, 1891-1896. [31] W. Quine. 1986. Reply to Gilbert H. Harman, in E. Hahn and P.A. Schilpp, eds., The Philosophy of W.V. Quine. La Salle, Open Court. [32] D. Sleator, D. Temperley. Parsing English with a Link Grammar. Carnegie Mellon University Computer Science technical report, CMU-CS-91-196,1991. [33] M. Steedman. 1985. Dependency and coordination in the grammar of Dutch and English, Language, 61:3, 525-568. [34] L. Tesnière, Éléments de syntaxe structurale. Klincksieck, Paris, 1959. [35] R. D. Van Valin, R. J. LaPolla, Syntax: Structure, meaning, and function, Cambridge University Press, 1997.
K. Gerdes and S. Kahane / Defining Dependencies (and Constituents)
25
[36] J. Vergne, A parser without a dictionary as a tool for research into French syntax, in: Proceedings of the 13th conference on Computational linguistics-Volume 1, pp. 70-72. Association for Computational Linguistics, 1990. [37] A. M. Zwicky, Heads, Journal of Linguistics, 21: 1-29, 1985.
26
Computational Dependency Theory K. Gerdes et al. (Eds.) IOS Press, 2013 © 2013 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-352-0-26
Looking Behind the Scenes of Syntactic Dependency Corpus Annotation: Towards a Motivated Annotation Schema of Surface-Syntax in Spanish Alicia BURGA a , Simon MILLE a , and Leo WANNER a,b a Universitat Pompeu Fabra, Barcelona b Instituci´ o Catalana de Recerca i Estudis Avanc¸ats, Barcelona Abstract. Over the last decade, the prominence of statistical NLP applications that use syntactic rather than only word-based shallow clues increased very significantly. This prominence triggered the creation of large scale treebanks, i.e., corpora annotated with syntactic structures. However, a look at the annotation schemata used across these treebanks raises some issues. Thus, it is often unclear how the set of syntactic relation labels has been obtained and how it can be organized so as to allow for different levels of granularity in the annotation. Furthermore, it appears questionable that despite the linguistic insight that syntax is very much languagespecific, multilingual treebanks often draw upon the same schemata, with little consideration of the syntactic idiosyncrasies of the languages involved. Our objective is to detail the procedure for establishing an annotation schema for surface-syntactic annotation of Spanish and present a restricted set of easy-to-use criteria and a methodology which facilitate the decision process of the annotators, but which can also accommodate for the elaboration of a more or a less fine-grained tagset. The procedure has been tested on a Spanish 3,513 sentence corpus, a fragment of the AnCora newspaper corpus. Keywords. corpus, annotation, dependency, methodology, Spanish
Introduction Over the last decade, the prominence of statistical Natural Language Processing (NLP) applications (among others, machine translation, parsing, and text generation) that use syntactic rather than only word-based shallow clues increased very significantly. This prominence triggered, in its turn, the creation of large scale treebanks, i.e., corpora annotated with syntactic structures, needed for training of statistical algorithms; see, among others, the Penn Treebank [1] for English, the Prague Dependency Treebank [2] for Czech, the Swedish Talbanken05 [3], the Tiger corpus [4] for German, and the Spanish, Catalan, and Basque AnCora treebank [5]. Even though this is certainly a very positive tendency, a look at the annotation schemata used across the treebanks of different languages raises some issues. Thus, despite the linguistic insight that syntax is very much language-specific, many of them draw upon the same more or less fine-grained annota-
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
27
tion schemata, i.e., sets of syntactic (dependency) relations, with little consideration of the languages themselves. Often, it is unclear how the individual relations in these sets have been determined and in which linguistic theory they are grounded, and occasionally it is not obvious that the annotation schema in question uses only syntactic (rather than also semantic) criteria. Our objective is to detail the process of elaboration of an annotation schema for surface-syntactic annotation of Spanish corpora,1 which is based on syntactic criteria and has already been used to annotate a 3,513 sentence corpus of Spanish [7].2 In the next section, we analyze the state of affairs in some of the well-known dependency treebanks and justify why we set out to write this paper. In Section 2, we present the notion of surface-syntactic structure and the general principles of dependency as defined in the Meaning-Text Theory (MTT), the theoretical framework we base our work on. Section 3 outlines the annotation schema we propose and the principles used to distinguish between different relations. Section 4 illustrates two complementary ways of using the criteria during annotation. Section 5, finally, summarizes the paper and draws some conclusions.
1. A Glance Behind the Scenes It is well-known that surface-syntactic relations (SSyntRels) as usually used in dependency treebanks are language-specific. Therefore, a dependency relation annotation schema should, on the one hand, facilitate the annotation of all language-specific syntactic idiosyncrasies but, on the other hand, offer a motivated generalization of the relation tags such that it could also serve for applications that prefer small generic dependency tag sets. However, as already mentioned above, in a number of dependency treebanks containing corpora in different languages, the same arc tag set is used for all languages involved—no matter whether the languages in question are related or not. For instance, AnCora [5] contains the related Spanish and Catalan, but also Basque; the treebank described in [8] contains Swedish and Turkish, etc. This makes us think that not sufficient work has been done concerning the definition of the relation labels. In general, for all parallel and non-parallel treebanks that we found—the Czech PDT2.0-PDAT [2] and [9]) and PCET [10], the English-German FuSe [11], the English-Swedish LinEs [12], the English Penn Treebank [1], the Swedish Talbanken [3], the Portuguese Bosque [13], the Dutch Alpino [14], etc.—the justification of the choice of dependency relation labels is far from being central and is largely avoided. This may lead to the conclusion that the selection of the relations is not of great importance or that linguistic research already provides sets of relations for a significant number of languages. Neither of these two conclusions is correct. In our work, we found crucial the question of the determination of SSyntRels, and we observed the lack of an appropriate description of the language through a justified description of the SSyntRels used even for languages for which treebanks are available and widely used. In MTT, significant work has been carried out on SSyntRels, particularly for English and French. Thus, Mel’ˇcuk and Percov [15] and Mel’ˇcuk [16] present a detailed 1 “Surface-syntactic”
is used here in the sense of the Meaning-Text Theory [6]. Spanish corpus is a fragment of the AnCora corpus, which consists of newspaper material. It is downloadable from http://www.taln.upf.edu/content/resources/495. 2 The
28
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
inventory of SSyntRels for English, and Iordanskaja and Mel’ˇcuk [17] suggest criteria for establishing an inventory of labeled SSyntRels governed by verbs as well as a preliminary inventory of relations for French. However, we believe that both inventories are not thought for large scale corpus annotation to be used in statistical NLP applications, given that the criteria are generally difficult to apply and do not separate thoroughly enough surface-syntactic phenomena from the phenomena at other levels of the linguistic description. For instance, one important distinction in [17] is whether a dependent is actantial or not—in other words, whether a dependent is part of the definition of its governor or not—, which is a clearly semantic distinction. We attempt to avoid recourse to deep semantic criteria. Instead, we replace semantic criteria by a list of strictly syntactically motivated, easy-to-use criteria in order to make their application efficient on a large scale, and detail the process from the very beginning. This list is as reduced as possible, but still sufficient to capture fine-grained idiosyncrasies of Spanish. Obviously, we intensely use the cited works on SSyntRels in MTT as a source of inspiration.
2. MTT Guide to SSynt Dependencies The prerequisite for the discussion of the compilation of a set of SSyntRels for a particular language is a common understanding of (i) the notion of a surface-syntactic dependency structure (SSyntS) that forms the annotation of a sentence in the corpus; (ii) the principles underlying the determination of a dependency relation, i.e., when a dependency relation between two lexical units in a sentence holds, and what the direction of this dependency is (in other words, who is the governor and who is the dependent). 2.1. Definition of SSyntS In MTT, an SSyntS is defined as follows: Definition 1 (Surface-Syntactic Structure, SSyntS) Let L, Gsem and Rssynt be three disjunct alphabets, where L is the set of lexical units (LUs) of a language L , Gsem is the set of semantic grammemes, and Rssynt is the set of names of surface-syntactic relations (or grammatical functions). An SSyntS of L , SSSynt , is a quintuple over L ∪ Gsem ∪ Rssynt of the following form: SSSynt = N, A, λls →n , ρrs →a , γn→g where – the set N of nodes and the set A of directed arcs (or branches) form an unordered dependency tree (with a source node ns and a target node nt defined for each arc), – λls →n is a function that assigns to each n ∈ N an ls ∈ L, – ρrs →a is a function that assigns to each a ∈ A an rs ∈ Rssynt , – γn→g is a function that assigns to the name of each LU associated with a node ni ∈ N, li ∈ λls →n (N), a set of corresponding grammemes Gt ∈ Gsem .
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
29
Figure 1. SSyntS of the sentence Las primeras v´ıctimas fueron trabajadores que ped´ıan regularmente d´ıas de recuperaci´on a sus patrones ‘The first victims were employees who regularly asked days-off to their bosses’.
For illustration, consider the SSyntS of a Spanish sentence in Figure 1.3 We are particularly interested in the assignment of surface-syntactic relation labels to the arcs (i.e., the function ρrs →a ). These labels are of the same nature as used by many other treebanks: subject, direct/indirect object, copulative, modificative, determinative, adverbial, etc, i.e., grammatical functions. We want to determine when to use each of them and how to build the tag set such that it can be enriched or reduced in a prescribed way under clearly defined conditions. For instance, in Figure 1, the indirect object of the verb ped´ıan ‘askedPL ’ is introduced by a preposition a ‘to’. However, in Spanish, direct objects can also be introduced by this preposition. So, obviously, looking at the units of the sentence is not enough to establish the dependency relations. Each relation has to be associated with a set of central properties. These properties must be clearly verifiable. For instance, a direct object is cliticizable by an accusative pronoun, an indirect object by a dative pronoun, and a subject triggers number and person agreement with its governor. 2.2. Principles for determination of SSynt-dependencies The central question faced during the establishment of the SSyntS as defined above for each sentence of the corpus under annotation is related to: – the elements of A: When is there a dependency between two nodes labeled by the LUs li and l j and what is the direction of this dependency? – the elements of Rssynt : What are the names of the dependencies, how they are to be assigned to a ∈ A, and how they are to be distinguished? 3 The
nominal node labels reflect the number (v´ıctimas ‘victims’, trabajadores ‘workers’, patrones ‘bosses’) only to facilitate the reading; semantic plural is encoded as a grammeme in terms of an attribute/value pair on the node: number=PL. Note also that we consider each node label to be a disambiguated word, i.e., lexical unit (LU). For details on grammemes, see [18].
30
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
or, in short, to the determination of SSynt-Dependencies. In what follows, we address this question in terms of two corollaries. Corollary 1 (Dependency between nodes) Given any two unordered nodes n1 and n2 , labeled by the LUs l1 and l2 respectively in the sentence S of the corpus, there is a dependency between n1 and n2 if either (a) in order to position li in S, reference must be made to l j , with i, j = 1, 2 and i = j (linear correlation criterion) and (b) between li and l j or between syntagms of which li and l j are heads (i, j = 1, 2 and i = j), a prosodic link exists (prosodic correlation criterion) or (c) li triggers agreement on l j (i, j = 1, 2 and i = j) (agreement criterion) Thus, in John has slept well today, John has to be positioned before the auxiliary has (or after, in a question) and a prosodic link exists between John and the syntagm headed by has. This means that John and has are likely to be linked by a dependency relation. Well has to be positioned compared to slept (not compared to has), hence there is a dependency between slept and well. With respect to agreement, we see that the verb is has and not have, as it would be if we had The boys instead of John. This verbal variation in person, which depends on the preverbal element, implies that a dependency links John and has. Once the dependency between two nodes has been established, one must define which node is the governor and which one is the dependent, i.e., the direction of the SSynt arc linking those two nodes. The following corollary handles the determination of the direction of the dependency: Corollary 2 (Direction of a dependency relation) Given a dependency arc a between the nodes n1 and n2 of the SSyntS of the sentence S in the corpus, n1 is the governor of n2 , i.e., n1 is the source node and n2 is the target node of a if (a) the passive valency (i.e., distribution) of the group formed by the LU labels l1 and l2 of n1 / n2 and the arc between n1 and n2 is the same as the passive valency of l1 (passive valency criterion) or (b) l1 as lexical label of n1 can be involved in a grammatical agreement with an external element, i.e., a label of a node outside the group formed by LU labels l1 and l2 of n1 / n2 and the arc between n1 and n2 (morphological contact point criterion) If neither (a) nor (b) apply, the following weak criteria should be taken into account: (c) if upon the removal of n1 , the meaning of S is reduced AND restructured, n1 is more likely to be the governor than n2 (removal criterion), (d) if n1 is not omissible in S, it is more likely to be the governor than n2 (omissibility criterion),
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
31
(e) if l2 as label of n2 needs (“predicts”) l1 as label of n1 , n2 is likely to be a dependent of n1 (predictability criterion). As illustration of the passive valency criterion,4 consider the nominal phrase the cats. It has the same distribution as cats: both can be used in exactly the same paradigm in a sentence. On the other side, the cats does not have the distribution of the. We conclude that cats is the head in the phrase the cats. It is important to note that, for instance, in the case of prepositional phrases, the preposition does not have its own passive valency since it always needs an element directly after it. It does not prevent the passive valency criterion from applying since, e.g., the distribution of from [the] house is not the same as the distribution of house. It is the presence of the preposition that imposes on the group a particular distribution. The morphological contact point criterion is used as follows: considering the pair s´olo felinos ‘only felines’ in s´olo felinos ronronean ‘only felines purrPL ’, felinos ‘felines’ is the unit which is involved in the agreement with an external element, ronronean ‘purrPL ’. As a consequence, felinos is more prone to be the governor of s´olo. We refrain from elaborating here on the weak criteria (c–e); see [6] for more details.5 2.3. Labelling the dependencies With the two corollaries from above at hand, we should be able to state when there is a dependency arc between two nodes, and which node governs which other node. Now, labels to the dependency arcs need to be assigned. The assignment may be very intuitive and straightforward (as, e.g., the assignment of object to the arc between Sp. tiran ‘[they] throw’ and bolas ‘balls’ in tiran bolas, lit. ‘[they] throw balls’) or less clear (as, e.g., the assignment of a label to the dependency arc between caen ‘[they] fall’ and bolas ‘balls’ in caen bolas, lit. ‘[it] falls balls’: is it some kind of object, a regular subject, or some other kind of dependent?). The following corollary addresses the question whether two given dependency arcs are to be assigned the same or different labels: Corollary 3 (Different labels) Be given an arc a1 and an arc a2 such that • a1 holds between the nodes nsa1 (labeled by lsa1 ) and nsta1 (labeled by lta1 ), with the property set Pa1 := {pa11 , pa12 , . . . , pa1i , . . . , pa1n }, • a2 holds between the nodes nsa2 (labeled by lsa2 ) and nsta2 (labeled by lta2 ), with the property set Pa2 := {pa21 , pa22 , . . . , pa2 j , . . . , pa1m } Then, ρrs →a (a1 ) = ρrs →a (a2 ), i.e., a1 and a2 are assigned different labels, if (a) ∃pk : (pk ∈ Pa1 ∧ pk ∈ Pa2 ) ∨ (pk ∈ Pa2 ∧ pk ∈ Pa1 ) and pk is a central property or (b) one of the following three conditions apply; cf. [6]: 1. semantic contrast condition: lsa1 and lsa2 and lta1 and lta2 are pairwise the same wordforms, but either lsa1 and lsa2 or lta1 and lta2 have different meanings. 2. prototypical dependent condition (quasi-Kunze property): given the prototypical dependents d p1 of a1 and d p2 of a2 , when lta1 in lsa − a1 →lta1 is substituted by 4 For 5 We
the definition of the notion “passive valency”, see [6]. actually use the omissibility criterion for labelling dependencies, as shown in Section 3.2.
32
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
d p2 the grammaticality of lsa1 − a1 →lta1 is affected or when lta2 in lsa2 − a2 →lta2 is substituted by d p1 the grammaticality of lsa2 − a2 →lta2 is affected. 3. SSyntRel repeatability criterion: If lta1 and its dependency a1 from lsa1 can be repeated and lta2 and its dependency a2 from lsa2 cannot (or vice versa). First of all, condition (a) entails that a relation should have clear properties associated to it. Associating properties to a relation is exactly what means to define a relation. This can only be done in opposition to other relations, which means that this is the result of numerous iterations after the inspection of numerous examples. As a consequence, paradoxically, the list of properties of a relation is one of the last things which is defined.6 The semantic contrast condition (b1) states that for a given relation and a given minimal pair of LUs, there must not be any semantic contrast; the arc orientation has to be the same for both members of the minimal pair, and the deep-morphological representation should be different (different possible orders or different case on the dependent, for instance). Both pairs have the property to be able to occupy the same syntactic role in a sentence. Consider the two LUs comer ‘[to] eat’ and (los) gatos ‘(the) cats’: they can form an ambiguous sentence Comen los gatos, lit. ‘Eat cats’, ‘Cats eat’ vs. ‘[They] eat cats’. The ambiguity cannot be explained by the difference of meaning of the components of the sentence (since they are the same). Hence, the semantic contrast criterion prevents both dependencies to be the same; in one case, gatos is subject of comer, and in the other case, it is its object. The semantic contrast condition does not apply to una casa ‘a house’ / una casa ‘one house’ because una does not have the same meaning (i.e., is not the same lexeme) in both cases. The quasi-Kunze criterion (b2) states that any SSyntRel must have a prototypical dependent, that is, a dependent which can be used for ANY governor of this SSyntRel; see [16]. Consider, for illustration, poder ‘can’–R→caer ‘fall’ vs. cortar ‘cut’–R→pelo ‘hair’: it is not possible to have an N as dependent of poder ‘can’ nor an Vinf as dependent of cortar ‘cut’. More generally, no element of the same category can appear below both poder and cortar. This implies that the prototypical dependents in both cases do not coincide, such that it is not the same relation. The SSyntRel repeatability criterion (b3) indicates that a particular SSyntRel should be, for any dependent, either always repeatable or never repeatable. If one dependent can be repeated and another one cannot, then we have two different relations. In a concrete case, we can start with the hypothesis that we have ONE relation R for which we want to know if it is suitable to handle two dependents with different properties (in particular, two different Part of Speech (PoS) tags). If the same relation R can be used to represent the relation, for instance, between a noun and an adjective and, on the other side, between a noun and a numeral quantifier, R should be either repeatable or not repeatable in both cases. We observe that R is repeatable for adjectives but not for quantifiers and conclude, thus, that R should be split in two relations (namely modifier and quantificative). 6 For instance, a restricted property set of the direct objectival relation in Spanish includes: the direct object (1) is cliticizable (2) by an accusative pronoun, (3) can be promoted, (4) does not receive any agreement, and (5) is typically a noun.
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
33
3. Towards a SSynt Annotation Schema for Spanish In Section 2, the general principles that allow us to decide when two units are involved in a dependency relation and who is the governor have been presented. Furthermore, some generic cases have been identified in which it seems clear whether a new relation should be created or not. With these principles at hand, we can set out for the definition of a motivated SSyntS annotation schema. To be taken into account during this definition is that (a) (unlike the available MTT SSyntRel sets) the schema should cover only syntactic criteria; (b) the granularity of the schema should be balanced in the sense that it should be fine-grained enough to capture language-specific syntactic idiosyncrasies, but be still manageable by the annotator team (we are thinking here of decision making and interagreement rate). The latter led us target a set of 50 to 100 SSyntRels. 3.1. Principles for the criteria to distinguish between different SSynt-relations The following properties are particularly important: • Applicability: The criteria should be applicable to the largest possible number of cases. For instance, a governor and a dependent always have to be ordered, such that a criterion implying order can be applied to every relation whatever it is. One advantage here is to keep a set of criteria of reasonable size in order to avoid the necessity of handling a large number of criteria which could only be applied in very specific configurations. The other advantage in favouring generic criteria is that it makes the classification of dependency relations more readable: if using the same set of criteria a relation is opposed to another one, the difference between them is clearer. • Visibility: When applying a criterion, an annotator would rather see a modification or the presence of a particular feature. Indeed, we try to use only two types of criteria: criteria that transform a part of the sentence to annotate (promotion, mobility of an element, cliticization, etc.), and criteria that check the presence or absence of an element in the sentence to annotate (Is there an agreement on the dependent? Does the governor impose a particular preposition?, etc.). In other words, we avoid semantically motivated criteria. The main consequence of this is the absence of the opposition complement/attribute as a discriminating feature between syntactic relations. • Simplicity: Once the annotator has applied a criterion, she must be able to make a decision quickly. This is why almost all criteria involve a binary choice. All of the resulting selected criteria presented in the next subsection have been used in one sense or the other in the long history of grammar design. However, what we believe has not been tackled up to date is how to conciliate in a simple way fine-grained syntactic description and large-scale NLP applications. In what follows, we present a selection of the most important criteria that we use in order to assign a label to a dependency relation. Then, in Section 4 we show how we use these criteria for the annotation of a Spanish corpus with different levels of detail.
34
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
3.2. Main criteria to distinguish between different SSynt-relations • Type of linearization: Some relations are characterized by a rigid order between the governor and the dependent (in any direction), whereas some others allow more flexibility with respect to their positioning. Thus, e.g., the relations that connect an auxiliary with the verb imply a fixed linearization: the auxiliary (governor) always appears to the left of the verb (dependent): He comido mucho, lit.‘[I] have eaten a-lot.’ *Comido he mucho, lit. ‘[I] eaten have a-lot.’ On the other hand, even if Spanish is frequently characterized as an SVO language, the relation subject does allow flexibility between the governor and the dependent: Juan come manzanas, lit. ‘Juan eats apples.’ Come Juan manzanas, lit. ‘Eats Juan apples.’ Come manzanas Juan, lit. ‘Eats apples Juan.’ Given that it is possible to apply this criterion to all relations, the linearization criterion is very relevant to our purposes. • Canonical order: As just stated, some relations are more flexible than others with respect to the order between governor and dependent. When the order is not restricted, there is usually a canonical order. Thus, although it is possible to have a postverbal subject, the canonical order between the subject and the verb is that the former occurs to the left of the latter. On the other hand, the relations introducing the non-clitic objects have the opposite canonical order, i.e. the object appears to the right of the verb (see Juan come manzanas above). • Adjacency to the governor: There are some relations that require that the governor and the dependent are adjacent in the sentence, and therefore only accept a very restricted set of elements (namely, other adjacent elements) to be inserted between them. On the other hand, there are some other relations that allow a larger variety of elements to appear between governor and dependent. The fact that a governor has to keep a dependent very close to itself is a distinctive syntactic feature. All the relations involving clitics belong to the first type, while a relation such as determinative belongs to the second type: Cada d´ıa, lo miraba, lit. ‘Every day, it [I] watched.’ *Lo cada d´ıa miraba, lit. ‘It each day [I] watched.’ ‘I watched it every day’. Un hombre muy bueno, lit. ‘A man very good.’ Un muy buen hombre, lit. ‘A very good man.’ ‘A very good man.’ • Cliticization: Cliticization refers to the possibility for the dependent to be replaced or duplicated by clitic pronouns and refers thus only to elements for which the order between the verbal governor and its dependent is not restricted. For in-
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
35
stance, the relation indirect object allows cliticization, as opposed to the oblique object that does not: Miente ‘[He] lies’ –iobj→ a ‘to’ Carla ‘Carla.’ Le miente, lit. ‘to-her [he] lies.’ ‘[He] lies to her.’ A Carla le miente, lit. ‘to Carla to-her [he] lies .’ ‘[He] lies to Carla.’ Invierte ‘[He] invests’ –obl obj→ en ‘into’ bolsa ‘stock-market.’ *La invierte, lit. ‘in-it [he] invests.’ *En bolsa la invierte, lit. ‘into stock-market in-it [he] invests.’ • Promotion/demotion: Promotion and demotion refer to the possibility of moving an argument up (respectively down) the ordered syntactic actant list (subject > direct object > indirect object > . . . ). Thus, the dependent of the relation direct object can be promoted to the dependent of the relation subject in a passive sentence, and, from the opposite point of view, the subject can be demoted to the dependent of the relation agent in a passive sentence:7 Juan compuso las canciones ‘Juan wrote the songs.’ Las canciones fueron compuestas por Juan ‘The songs were written by Juan.’ Cliticization and promotion/demotion can only be applied if the governor is a finite verb. From this perspective, they do not seem to comply with the Applicability principle. However, since there are many different relations that can hold on a verb, this is not totally true. In addition, those criteria are very efficient with respect to the other two principles, Visibility and Simplicity. • Agreement: Agreement appears when governor and dependent share morphological features such as gender, number, person, etc., which one of the elements passes to the other. Agreement actually depends on two parameters. On the one hand, the target of the agreement must have a PoS which allows agreement. On the other hand, the dependency relation itself must allow it. For example, the copulative relation allows agreement, but if the dependent is not an adjective, it is not mandatory; cf.: Pedro y Carla son relajados ‘Pedro and Carla are relaxedPLU ’ as opposed to Pedro y Carla son una pareja ‘Pedro and Carla are a coupleSING ’. Inversely, the past participle in the perfect analytical construction is intrinsically prone to agreement (as the second example that follows shows), but the relation does not allow it: Carla est´a perdida ‘Carla is lostFEM ’ as opposed to Carla ha perdido ‘Carla has lostnoFEM ’. This is why the notion of prototypical dependent is important (see next paragraph): if a relation licences agreement, this does not mean that any dependent must have agreement, but, rather, that there is always agreement for its prototypical dependent. There are different types of agreements allowed by a syntactic relation: – dependent agrees with governor (i.e., the dependent is the target of the agreement): sillas ‘chairs’–modificative→ rotas ‘brokenFEM.PL ’, – governor agrees with dependent (i.e., the dependent controls the agreement): Juan ‘Juan’ ←subject–viene ‘comes’, 7 In Spanish, only direct objects can be promoted; English, for instance, also allows for the promotion of indirect objects: John sent a postcard to Paul vs. Paul was sent a postcard by John.
36
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
– dependent agrees with another dependent: Juan ‘Juan’ ←sub ject–parece ‘seems’–copulative→ enfermo ‘sickMASC.SG .’
•
•
•
•
When there is agreement, secondary criteria concerning the type of inflection of the agreeing element can be applied. Thus, in some cases the agreement can vary, in other cases it cannot (see, e.g., the opposition between subject and quotative subject in the next section). Prototypical dependent: As mentioned in Section 2, every relation must have a prototypical dependent. This criterion is more useful for designing the set of dependency relations than for assigning a tag to a relation since it involves a generalization over a large number of cases which are not accessible during the process of annotation. However, it can be used during annotation as well, especially in order to discard/confirm a relation: if a dependent of a SSyntRel cannot be replaced by the prototypical dependent of this relation, then the relation should be changed. It can also be useful when looking for a relation in the hierarchical representation of the criteria (see Table 1)—for instance, in combination with the Agreement criterion. If the pair son ‘are’–??→ pareja ‘couple’ in the sentence Pedro y Carla son una pareja ‘Pedro and Carla are a coupleSING ’ has to be annotated, although there is no visible agreement. A native speaker annotator has the knowledge that the typical dependent in this case for this verb is an adjective and then should consider that agreement is usually involved. Part-Of-Speech of the Governor and Dependent: The actual PoS of the governor is relevant in that there are very few syntactic dependents that behave the same with governors of different syntactic categories once a certain level of detail has been reached in the annotation. For instance, prepositional objects can depend on verbs, adjectives or nouns; an object of a verb can very often appear to the right or to the left of its governor in a sentence, while it is almost always to the right when the governor is an adjective or a noun. Similarly, only objects of verbs can pronominalize: cliticization is impossible when the governor is a noun or an adjective. Different dependency labels encode these distinctions. Taking into account the PoS of the governor allows the annotator to reduce the number of candidates for one label. The PoS of the dependents can also rule out some labels (for example, a noun cannot be the dependent of a relation modif ). This is taken into account during the annotation process. Governed Preposition/ Conjunction/ Grammeme (P/C/G): There are some relations that require the presence of a preposition, a subordinating conjunction or a grammeme. For instance, the relation oblique object implies the presence of a preposition without meaning to introduce the dependent (invierte en la Bolsa ‘[he/she] invests in the stock market’), and the relation subordinate conjunctive requires the presence of a feature in the verb indicating that it is finite. Dependent omissibility: This syntactic criterion is defined within an “out-ofthe-blue” context, given that otherwise it is very difficult to determine whether a dependent is omissible or not: it is always possible to create pragmatic contexts in which the dependent can be perfectly omitted. There are two cases: on the one hand, relations such as prepositional always require the presence of the dependent
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
37
and, on the other hand, relations as modifier do not require the presence of the dependent. Consider: Juan viene para ‘Juan comes to’ –prepos→ trabajar ‘work.’ *Juan viene para, lit. ‘Juan comes to.’ Tiene ‘[He] has’ sillas ‘chairs’ –modi f → verdes ‘green’. ‘[He] has green chairs.’ Tiene sillas. ‘He has chairs.’ • Left dislocation: Left dislocation (with or without comma) is used in order to distinguish in some cases an object from an adverbial. If the dislocated element seems strongly focalized when it is positioned to the left of its governor, the relation is more probably an object. When applying this criterion, the dependency relation (henceforth: ‘DepRel’) should still stand after the dislocation. For instance, it seems possible to dislocate the apposed element in the case of apposition: el presidente Obama ‘the president Obama’ gives Obama, el presidente ‘Obama, the president’, but in the latter, there would be an inversion of dependency, in that el presidente ‘the president’ would now be the apposed element. As a result, the relation apposition does not react positively with respect to this criterion.
4. Examples of Application of the Criteria In this section, we illustrate two different ways of using the criteria we determined above. One is based on a hierarchical layout, in which criteria have to be examined one after the other in a given order; the other approach considers no such hierarchy in order to achieve more flexibility. 4.1. The hierarchical approach We organized all criteria into a tree-like hierarchy such that if an annotator identifies a pair governor/dependent but wonders which relation holds between the two, she has merely to follow a path of properties that leads to the relation. The order in which the criteria are applied is only important for expressiveness: it allows for keeping the relations that have the same type close to each other in the graphical representation. In this way, differences between similar relations can be visualized very easily. We present in this section only a part of the complete hierarchy, namely, the relations governed by a verb which do not impose a rigid order between governor and dependent. Our complete hierarchy contains 79 different arc labels and covers the annotation of a 100,000 word corpus [7]. We use here nine criteria: (1) removability of dependent, (2) possible cliticization, (3) agreement type, (4) inflection type, (5) PoS of prototypical dependent, (6) promotion/demotion, (7) presence of governed preposition, (8) presence of quotes, and (9) presence of parentheses or dashes. With this level of detail, we obtain sixteen different relations in which verbs are involved; cf. Table 1. In the following, we give an example for each relation; the governor of the relation appears in bold uppercase, the dependent in bold lowercase: — adjunctive: Vale, VAMOS.
38
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
Criteria
SSyntRel Quot ¬Quot
Prom Clitic ¬Prom ¬Remove
dobj quot dobj Gov P
¬Agree Agree
Target
Sibling
Agree
Target
Sibling
¬Fix Lin ProtD N
Agree
Control
Agree
Target
Remove ProtD A
ProtD Adv
copul quot subj subj dobj
copul compl1 compl2 obl obj
Gov P
¬Agree ¬Clitic
iobj
¬Gov P
¬Gov P Sibling & Governor
Var ¬Var
subj dobj External Elt
Sibling
Parenthetical ¬Parenthetical
quasi subj subj subj quot subj copred obj copred adv mod adjunct adv
Table 1. A partial hierarchy of syntactic criteria ( ¬: negation of a criterion; Fix Lin: Governor and dependent always in the same order; Clitic: Dependent can be replaced by a clitic pronoun; Prom: Dependent can be promoted; Remove: Dependent can be removed; Quot: Dependent is quoted; Agree: Dependent is involved in an agreement; Target: (IF AGREE) Dependent is the target of the agreement; Control: (IF AGREE) Dependent controls the agreement on another word; Sibling: (IF AGREE) Dependent agrees with one of its siblings; Governor: (IF AGREE) Dependent agrees with its governor; External Elt: (IF TARGET OF AGREE) Dependent agrees with an element which is in another sentence; subj - dobj: (IF AGREE WITH SIBLING) Dependent agrees with subject or object; ProtD A: Prototypical dependent is an adjective; ProtD Adv: Prototypical dependent is an adverb; ProtD N: Prototypical dependent is a noun; Gov P: the dependent is a governed preposition; Parenthetical: the dependent is between brackets or dashes)
lit.‘ok, [we] go’ ‘Ok, let’s go!’ — adverbial: ´ Hoy PASEE. lit.‘today [I] went-for-a-stroll’ ‘Today, [I] took a walk.’ — completive 1: ´ buena. La frase RESULTO lit.‘the sentence turned-out-to-be fine’ ‘The sentence became good.’ — completive 2: Pedro CONSIDERA tontos a los gatos. lit.‘Pedro considers stupid to the cats’ ‘Pedro considers the cats to be stupid.’ — copulative: El gato ES negro. ‘The cat is black.’ — direct objectival: CONSTRUYEN una casa. ‘[They] build a house.’
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
39
— indirect objectival: Les MOLESTA el ruido a los peces. lit. ‘to-them bothers the noise to the fish’ ‘The fish are bothered by the noise.’ — modificative adverbial: Llegados a ese extremo, el trabajo se VUELVE insoportable. lit.‘arrived-MASC-PL to that extremity, the work becomes unbearable’ ‘When we get to this point, the work becomes unbearable.’ — object copredicative: Pedro VE felices a los gatos. lit. ‘Pedro sees happy-PL to the cats’ ‘Pedro sees the cats being happy.’ — oblique objectival: PASA de Pedro. lit.‘[he] passes from Pedro.’ ‘[He] ignores Pedro.’ — quasi subjectival: LLUEVE(N) ranas. lit. ‘[it/they] rain(s) frogs’ ‘[It] rains frogs.’ — quotative copulative: La pregunta ERA “Va a volver?” ‘The question was “Is [he] going to come back?” ’ — quotative direct objectival: ´ “C´allate!” GRITE ‘[I] shouted “Shut up!” ’ — quotative subjectival: “Dogs” ES una palabra inglesa. ‘ “Dogs” is an English word.’ — subjectival: Pedro CORRE. ‘Pedro runs.’ — subject copredicative: Pedro VUELVE feliz. ‘Pedro comes back happy.’ By selecting only a few criteria, it is possible to diminish the number of relations and thus to tune the level of detail of the annotation. For example, keeping only four of the nine criteria presented above, we end up with only five relations instead of sixteen; see Table 2. In Tables 1 and 2, each cell corresponds to the application of one criterion; the rightmost column contains the SSyntRels. The path from the root of the tree to one leaf thus indicates a list of properties of this relation (note that not all properties are listed in these tables). It would be also possible to merge some properties. For instance, the Canonical Order can always be predicted by a particular property; elements that can be cliticized are usually linearized to the right of their governor; etc. Table 3 shows the correspondence between the fine-grained relations displayed in the rightmost column of Table 1 and generalized relations of a different level of gran-
40
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
Criteria
SSyntRel
Clitic ¬ Remove ¬Fix Lin
¬ Clitic
ProtD N
Obj1 Compl
Agree (control)
Subj
¬ Agree Obj2 ProtD A/Adv Mod1 Table 2. A hierarchy with less criteria (¬: negation of a criterion; Fix Lin: Governor and dependent always in the same order;Clitic: Dependent can be replaced by a clitic pronoun; Remove: Dependent can be removed; Agree: Dependent is involved in an agreement; Control: (IF AGREE) Dependent controls the agreement on another word; ProtD A: Prototypical dependent is an adjective; ProtD Adv: Prototypical dependent is an adverb; ProtD N: Prototypical dependent is a noun) Remove
ularity. Although we use only syntax-based criteria, it is possible to reach the semantic level by indicating whether the dependent of a relation is accounted for in the valency of its governor (no (—), actant I, actant II, etc.), which is indicated by the numbers in the column to the right of the SSyntRel in Table 3.8 This helps to generalize the relations, as illustrated on the right side of the table. This second relation hierarchy is similar to the hierarchies proposed by, among others, [19], [20] or [21]. SSyntRel
DSyntRel
dobj quot dobj
II II
iobj
II III IV V VI
copul quot copul subj compl obj compl
obl obj
II II II III II III IV V
Generalized SSynRels
Obj1
Arg2 Arg
Compl
Dependent Obj2
VI quasi subj subj subj quot subj copred obj copred
II I I ATTR ATTR
Subj
Arg1
Mod1 Mod1 Mod adv mod ATTR adjunct ATTR adv ATTR Table 3. A possible generalization of dependency relations 8 We
actually have a version of our corpus with such valency information.
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
41
The hierarchical approach has the advantage to display clearly what the differences between different dependency relations are. For instance, in Table 1 one can see at one glance that the relations dobj and iobj allow for the dependent to be moved around and cliticized, but that only dobj allows for promotion. However, it is well known that natural languages cannot be described in their entirety by general rules, without exceptions. Languages evolve independently of the rules formulated by linguists with the goal to capture the observed syntax. In other words, all rules have more or less numerous exceptions. As a result, the criterion hierarchy as it has been presented above has its limitations: not all the instances of a DepRel necessarily exhibit all the properties that appear in the path from the root of the criterion tree. For example, if an annotator finds a dependent that has all the properties of an obl obj, with the exception of one—for instance, that this dependent cannot be removed from the sentence it appears in—she will never arrive at the obl obj relation. One way to avoid this deadlock would be to add a branch in the criterion hierarchy in order to have another path that arrives at obl obj with the property “not removable dependent”. But if we do this for each configuration of properties of each DepRel, the resulting hierarchy will be totally unreadable and lose its main purpose. Therefore, we decided to create a complementary approach that considers bags of properties for each DepRel instead of a hierarchy. 4.2. The bag-of-properties approach As its name indicates, this approach simply consists in assigning to each DepRel a set of properties. This time, we do not focus on using the properties that differentiate a DepRel from another one; neither do we impose an order in the use of criteria. Instead, we compile an inventory of all the possible values for each criterion (see Section 3.2) for each DepRel. For this purpose, we designed an SQL-based tool that allows the annotator to introduce one value for each criterion of her choice. The tool returns a classification of dependency relations ordered by (i) similarity based on the selected criteria, and (ii), frequency. Consider, for instance, in Table 4 the properties of the DepRel modif, which holds between a noun and a modifying adjective. The idea behind the inventory of all possible values for each criterion for each DepRel is that whatever the configuration in the sentence to annotate is, the target DepRel appears at (or close to) the top of the list when the annotator introduces the selected criteria. For example, in the case of the DepRel modif in Spanish, the dependent usually appears to the right of its governor, and cannot be moved to the left of it. However, some adjectives (as, e.g., peque˜no ‘small’) can appear both to the right and to the left of the governing noun: ni˜no peque˜no vs. peque˜no ni˜no, and some can only be found to the left of the noun (cf. quantificative adjectives such as poco ‘little’, which do not behave as numbers: poco aire ‘little air’ vs. *aire poco, lit. ‘air little’). In other words, some lexical properties can overrule syntactic properties. To handle this phenomenon, some criteria can be left unspecified (e.g., in terms of the value N/A for type of linearization and canonical order in Table 4), such that both YES and NO give a match for the DepRel in question. In contrast, if we would use the hierarchical schema to describe the most probable or prototypical properties of a DepRel, an unconventional construction would erroneously rule out a DepRel (see Section 4.1). Consider the following sentence Tiene sillas verdes ‘[He] has green chairs’ as a simple use case. The adjective verdes ‘greenPL ’ is positioned with respect to the noun
42
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
Criterion
Possible values
PoS Gov
noun—date
prototypical Dep
A
PoS Dep
VPart—A
governed preposition
NO
governed grammeme
NO
type of linearization
N/A
canonical order
N/A
left dislocation
NO
adjacency to Gov
NO
cliticization
NO
promotion
NO
demotion
NO
agreement
TARGET
agreement with
Gov
variant inflection
YES
Dep omissibility
YES
comma NO Table 4. Distinctive properties of the modif SSynt DepRel
sillas ‘chairs’, more precisely after it (if the noun goes in front of the verb, so does the adjective). Verdes forms a prosodic group with sillas, and it agrees with it, which indicates a dependency between these two words. The group behaves as a noun, and sillas triggers the agreement on verdes, which indicates that the latter is the dependent of the relation. An annotator has to perform simple syntactic tests, starting with the indication of the PoS of the governor and the dependent.9 Figure 2 shows a screenshot of the result of this query made by the annotator. The tool returns three lists, one with the n DepRels that match the two criteria, namely that noun is the governor and adjective the dependent, one with DepRels that match only one of the two criteria, and one with DepRels that do not match any of the two criteria. Within each frame, the relations are ordered from the most (top) to the least (bottom) frequent. That is, in our example, the most likely label for the query in question is modif, while the least probable would be the one at the bottom of the 0-criteria list. The annotator can discard candidates from the most to the least probable, based on the knowledge she has about the labels. She can also refine the query by adding criteria. Figure 3 shows a screenshot of the result of such a refined query. In this case, we can see that the annotator considered that it is not possible to move the dependent with respect to the governor (cf. the criterion fixed lin), that the dependent is found to the right of the governor (cf. the criterion canonical ord r), that it can be removed without causing meaning restructuring nor agrammaticality (cf. the criterion dep removable), that it is involved in an agreement of some kind (cf. the criterion agreement involved), and that there is no comma between the noun and the adjective (cf. the criterion presence of comma in the “False” column). 9 We use letters as prefixes for criteria so as to order them in a way that helps the annotator: the most discriminative and easy-to-use criteria appear first in the list. However, the annotator is free to use the criteria in any order; the output will always be the same.
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
43
Figure 2. Sample query in the DepRel identifier tool with two criteria
At the bottom, we can see that only one relation matches the seven criteria, and that six relations match six criteria. In this case, the correct label is indeed modif, but it can happen that the most probable label is discarded by the annotator because no answer was given to some criteria relevant to the DepRel in question. In practice, our experience is that this tool is not used much after the annotators obtained some routine, since the vast majority of dependency relations is easily identifiable. However, it is a considerable help when the annotator is confronted with a difficult case. Thus, even if the tool does not always give the correct DepRel, in the worst case it di-
44
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
Figure 3. Sample query in the DepRel identifier tool with seven criteria
rects the annotator towards a restricted subset of dependencies in the detailed guidelines which describe and illustrate every DepRel, in order to see which one seems to fit better. Finally, let us mention again that the criteria we describe in this section do not represent an exhaustive list of properties encoded by each relation. For instance, the syntactic relation det is also differentiated from the relation modif in that both do not have the same combinatorics: a modif can combine with a det, but a det cannot combine with another det. Since they are usually not necessary for the annotator to obtain the right label, such properties are described in the complete guidelines but not spelled out here.
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
45
5. Conclusions Even if there are dependency corpora in different languages and some of them are widely used for NLP applications, it is not yet clear how the set of syntactic relations can be obtained and how this set can be organized so as to allow for different levels of granularity in the annotation. In this paper, we attempted to fill this gap by detailing the procedure for establishing a tagset for a subset of Spanish verbal relations. We presented a restricted selection of easy-to-use criteria which facilitate the work of the annotators, but which can also accommodate for the elaboration of a more or less fine-grained tagset. Another advantage is its potential application to any other language, although it is possible that some criteria are not needed for a specific language (as, e.g., linearization for order-free languages) or, on the contrary, that new syntactic criteria are needed. We already successfully began to apply the method to a radically different language, namely Finnish, and annotated a 2,000 sentence corpus with a restricted set of about 40 relations.10 We also show two ways of using the criteria during the annotation process. A hierarchical criterion schema allows for a one-glance visualization of the differences between (groups of) dependency relations, since each relation is assigned only those criteria that are necessary for its identification. However, sometimes it is impossible to identify a nonprototypical dependency relation using the hierarchical schema. This speaks for a second approach, the bag-of-properties annotation schema, which expresses a large subset of properties for each relation. The bag-of-properties schema allows for an easy reach out for a particular relation while omitting criteria that the annotator does not want to handle in a particular case, using SQL queries. Although it is possible to allow for queries that explicitly show the differences between two or more relations, it does not seem possible to cast all relations and their distinctive criteria into a single tree. This is why we believe that the two annotation schemata are complementary and should be used in parallel. As shown in [7], thanks to our approach, we achieve more than 92% inter-annotator agreement for the syntactic annotation of a 100,000 word corpus: the use of the finegrained tagset and the application of criteria organized hierarchically and in sets has proven feasible and efficient.
Acknowledgements We express our heartfelt gratitude to Roberto Carlini, Anton Granvik and Igor Mel’ˇcuk for their valuable help. The work described in this paper has been partially funded by the Ministry of Economy and Competitiveness (MINECO) and the FEDER Funds of the European Commission under the contract number FFI2011-30219-C02-02.
References [1]
M.P. Marcus, B. Santorini and M.A. Marcinkiewicz. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2): 313–330 (1993). MIT Press, Cambridge, MA. ˇ ep´anek, J. Havelka, M. Mikulov´a and Z. [2] J. Hajiˇc, J. Panevov´a, E. Hajiˇcov´a, P. Sgall, P. Pajas, J. Stˇ ˇ Zabokrtsk´ y. Prague Dependency Treebank 2.0. (2006). Linguistic Data Consortium, Philadelphia, Pennsylvania. 10 The
Finnish corpus is downloadable from http://www.taln.upf.edu/content/resources/417.
46 [3]
[4]
[5]
[6] [7] [8]
[9] [10]
[11] [12] [13]
[14] [15] [16]
[17]
[18] [19]
[20]
[21]
A. Burga et al. / Looking Behind the Scenes of Syntactic Dependency Corpus Annotation
J. Nivre, J. Nilsson and J. Hall. Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (2006). Genova, Italy. C. Thielen, A. Schiller, S. Teufel and C. St¨ockert. Guidelines f¨ur das Tagging deutscher Textkorpora mit STTS (1999). Institute for Natural Language Processing, University of Stuttgart. http://www.ims.unistuttgart.de/projekte/corplex/TagSets/ M. Taul´e, M.A. Mart´ı and M. Recasens. AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (2008). Marrakesh, Morocco. I.A. Mel’ˇcuk. Dependency Syntax: Theory and Practice (1988). State University of New York Press, Albany, NY. S. Mille, A. Burga, and L. Wanner. AnCora-UPF: A Multi-Level Annotation of Spanish. In Proceedings of the Second International Conference on Dependency Linguistics (2013). Prague, Czech Republic. B. Megyesi, B. Dahlqvist, E. Pettersson and J.Nivre. Swedish-Turkish Parallel Treebank. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (2008). Marrakesh, Morocco. J. Hajiˇc and P. Zem´anek. Prague Arabic dependency treebank: Development in data and tools. In Proceedings of the NEMLAR International Conference on Arabic Language Resources and Tools (2004). ˇ M. Cmejrek, J. Hajiˇc and V. Kuboˇn. Prague Czech-English Dependency Treebank: Syntactically Annotated Resources for Machine Translation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (2004). Lisbon, Portugal. L. Cyrus, H. Feddes and F. Schumacher. FuSe–a Multi-Layered Parallel Treebank. In Proceedings of the Second Workshop on Treebanks and Linguistic Theories (2003). V¨axj¨o, Sweden. L. Ahrenberg. LinES: An English-Swedish Parallel Treebank. In Proceedings of the 16th Nordic Conference of Computational Linguistics (2007). Tartu, Estonia. S. Afonso, E. Bick, R. Haber and D. Santos. Floresta sint´a(c)tica: A treebank for Portuguese. In Proceedings of the Third International Conference on Language Resources and Evaluation (2002). Las Palmas de Gran Canaria, Spain. L. Van der Beek, G. Bouma, R. Malouf and G. Van Noord. The Alpino Dependency Treebank. In Proceedings of Computational Linguistics in the Netherlands (2001). Twente, The Netherlands. I.A. Mel’ˇcuk and N.V. Percov. Surface Syntax of English (1987). John Benjamins Publishing Company, Amsterdam. I.A. Mel’ˇcuk. Levels of Dependency in Linguistic Description: Concepts and Problems. In Dependency and Valency. An International Handbook of Contemporary Research, 1: 188–229 (2003). W. de Gruyter, Berlin. L. Iordanskaja and I.A. Mel’ˇcuk. Establishing an Inventory of Surface-Syntactic Relations: ValenceControlled Surface-Syntactic Dependents of the Verb in French. In I. Mel’ˇcuk and A. Polgu`ere (eds.). Dependency in Linguistic Description:151–234 (2009). John Benjamins Publishing Company, Amsterdam. I.A. Mel’ˇcuk. Aspects of the Theory of Morphology (2006). Mouton De Gruyter, Berlin. M.C. De Marneffe, B. MacCartney and C.D. Manning. Generating Typed Dependency Parses from Phrase Structure Parses. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (2006). Genova, Italy. S. Mille and L. Wanner. Syntactic Dependencies for Multilingual and Multilevel Corpus Annotation. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (2010). Valletta, Malta. S. Mille and A. Burga and G. Ferraro and L. Wanner. How Does the granularity of an Annotation Scheme Influence Dependency Parsing Performance? In Proceedings of the 23rd International Conference on Computational Linguistics (2012). Mumbai, India.
Computational Dependency Theory K. Gerdes et al. (Eds.) IOS Press, 2013 © 2013 The authors and IOS Press. All rights reserved. doi:10.3233/978-1-61499-352-0-47
47
A Dependency-based Analysis of Treebank Annotation Errors Katri HAVERINEN a,c , Filip GINTER c , Veronika LAIPPALA b , Samuel KOHONEN c , Timo VILJANEN c , Jenna NYBLOM c and Tapio SALAKOSKI a,c a Turku Centre for Computer Science (TUCS) b Department of French, University of Turku c Department of Information Technology, University of Turku Abstract. In this paper, we investigate errors in syntax annotation with the Turku Dependency Treebank, a recently published treebank of Finnish, as study material. This treebank uses the Stanford Dependency scheme as its syntax representation, and its published data contains all data created in the full double annotation as well as timing information, both of which are necessary for this study. First, we examine which syntactic structures are the most error-prone for human annotators, and compare these results to those of two baseline parsers. We find that annotation decisions involving highly semantic distinctions, as well as certain morphological ambiguities, are especially difficult for both human annotators and the parsers. Second, we train an automatic system that offers for inspection sentences ordered by their likelihood of containing errors. We find that the system achieves a performance that is clearly superior to the random baseline: for instance, by inspecting 10% of all sentences ordered by our system, it is possible to weed out 25% of errors. Keywords. Finnish, treebank, annotation, parsing
Introduction In the field of natural language processing (NLP), human-annotated training data is of crucial importance, regardless of the specific task. The creation of this data requires a large amount of resources, and the data quality affects applications. Thus it is important to ensure that first, the quality of the data is as sufficiently high for the desired purpose, and second, that the amount of expensive manual work is kept to a reasonable amount. Considering the importance of manual annotation for NLP, studies on different aspects of the annotation process are of great interest. This work strives to examine the difficulty of syntax annotation in the context of Finnish. Our primary objective is to study human annotation and the errors in it, so as to make observations beneficial for future treebanking efforts. As dependency representations have been argued to be a good choice for the purposes of evaluating the correctness of an analysis as well as the general intuitiveness of evaluation measures (see, for instance, the work of Lin [15] and Clegg and Shepherd [2]), and as there exists a recently published, dependency-based treebank for Finnish, also this study uses dependency-based evaluation.
48
K. Haverinen et al. / A Dependency-Based Analysis of Treebank Annotation Errors punct>
dobj>
Komission täytyy pyytää selvitystä Comission must ask_for clarification
cc>