Coreference: Annotation, Resolution and Evaluation in Polish 9781614518389, 9781614518358

‘Coreference’ presents specificities of reference, anaphora and coreference in Polish, establish identity-of-reference a

201 66 5MB

English Pages 297 [298] Year 2014

Table of contents :
Preface
Part I Introduction
1 Reference, anaphora, coreference
1.1 The concept of reference
1.2 The typology of reference
1.3 The real world versus the mental world
1.4 Reference, coreference and anaphora
1.5 Coreference and identity
2 Polish coreference-related studies
2.1 Terminology and characteristics of the problem
2.1.1 Klemensiewicz
2.1.2 Topolinska
2.1.3 Pasek
2.1.4 Fall
2.2 Text coherence, cohesion and intra-document linking
2.2.1 Pisarkowa
2.2.2 Bellert
2.2.3 Wajszczuk
2.2.4 Marciszewski
2.2.5 Stroinska, Szkudlarek and Trofimiec
2.2.6 Pisarek
2.3 Reference and anaphora in text understanding
2.3.1 Grzegorczykowa
2.3.2 Gajda
2.4 Text genres and language stylistics
2.4.1 Szwedek and Duszak
2.4.2 Honowska
2.4.3 Fontanski
2.4.4 Dobrzynska
2.5 Anaphora and first-order logic
2.5.1 Dunin-Keplicz
2.5.2 Studnicki, Polanowska, Fall and Puczylowski
2.6 Application of formal binding theories to Polish
2.6.1 Reinders-Machowska
2.6.2 Kupsc and Marciniak
2.6.3 Trawinski
Part II Coreference Annotation
3 Related work
3.1 Coreference annotation abroad
3.2 Polish anaphora and coreference annotation
3.2.1 LUNA-related work
3.2.2 Pronominal anaphora annotation for information extraction
3.2.3 Anaphora representation in the KPWr corpus
4 Annotation models
4.1 Annotation requirements
4.1.1 Requirement 1: Linguistic compatibility
4.1.2 Requirement 2: Offline mode
4.1.3 Requirement 3: Standard, open, stand-off annotation format
4.1.4 Requirement 4: Simple, user-centered design
4.1.5 Requirement 5: Reliability
4.1.6 Requirement 6: Extensibility, adaptability and open source availability
4.2 Annotation tools review
4.2.1 Available tools
4.2.2 PALinkA
4.2.3 MMAX2
4.2.4 Conclusion
5 Annotation guidelines
5.1 Types of coreference
5.1.1 Identity of reference, identity of sense
5.1.2 Identity, quasi-identity
5.1.3 Dominant expression
5.2 Scope of annotation and typology of coreferential constructions
5.2.1 Anaphoric expressions in grammatical and lexical form
5.2.2 The borders of the nominal phrase
5.3 Particular annotation problems
5.3.1 Embedded phrases
5.3.2 Discontinuous phrases
5.3.3 Annotation of zero subject
5.3.4 Idioms
5.3.5 Indirect speech, direct speech and free direct speech
5.3.6 Gerunds
5.3.7 Definiteness and indefiniteness
5.3.8 Elective phrases
5.3.9 Extended nominal phrases
5.3.10 Enumerations
6 Annotation methodology
6.1 Initial annotation experiment
6.1.1 The SemEval-based format
6.1.2 Annotation data
6.1.3 Annotation procedure
6.1.4 First visualisation attempts
6.1.5 Findings from the process
6.2 Series versus parallel annotation
6.3 Annotation workflow
6.3.1 Design decisions
6.3.2 Corpus creation procedure
6.3.3 Text selection
6.3.4 Text preparation
6.3.5 Text distribution and acquisition
6.3.6 Manual text annotation
6.3.7 Corpus publication process
7 Annotation tools
7.1 Tools used in the CORE project
7.1.1 DistSys
7.1.2 MMAX4CORE
7.2 DistSys from annotator’s perspective
7.2.1 Downloading texts
7.2.2 Saving texts on server (optional operation)
7.2.3 Uploading finished texts
7.2.4 Checking the number of finished texts
7.2.5 Rejecting problematic texts
7.2.6 Working on more than one computer
7.3 MMAX4CORE from annotator’s perspective
7.3.1 Starting the program
7.3.2 Operations on files
7.3.3 Operations on mentions
7.3.4 Operations on clusters
7.3.5 Operations on links
7.3.6 Browsers
7.3.7 Program settings
7.3.8 Copying text fragments to clipboard
7.4 Adjudication of parallel annotations
7.4.1 Differences in DistSys
7.4.2 Differences in MMAX4CORE
8 Polish Coreference Corpus
8.1 Corpus composition
8.1.1 Short texts
8.1.2 Long texts
8.2 Corpus representation and visualisation
8.2.1 TEI format
8.2.2 MMAX format
8.2.3 Brat format
8.2.4 Brat visualization
8.3 Corpus statistics
8.3.1 Mentions
8.3.2 Coreference clusters
8.3.3 Cluster and mention count correlation
Part III Coreference Resolution
9 Resolution approaches
9.1 Resolution methodologies
9.1.1 Resolution models
9.1.2 Resolution strategies
9.1.3 Learning features
9.1.4 Resolution algorithms
9.2 Foreign state-of-the-art resolution tools
9.2.1 BART
9.2.2 Reconcile
9.2.3 Stanford Deterministic Coreference Resolution System
9.2.4 Berkeley Coreference Resolution System
9.3 Polish coreference resolution attempts
9.3.1 Knowledge-poor pronoun resolution
9.3.2 The analysis of anaphoric relations in Polsyn parser
9.3.3 Coreferencing for geo-tagging of Polish data
9.3.4 Pronominal anaphora resolution module for GATE
9.3.5 IKAR and anaphora representation in KPWr
9.3.6 English–Polish projection-based approach
10 Mention detection
10.1 Simple nouns and pronouns
10.2 Nominal groups
10.3 Nested mentions
10.3.1 Reorganisation of the grammar
10.3.2 Rule modification
10.3.3 Problematic cases
10.3.4 Reorganization results
10.4 Zero subjects
10.4.1 Null subject detection difficulties
10.4.2 Development and evaluation data
10.4.3 Development of the solution
10.4.4 Evaluation
10.4.5 Results
10.5 Named entities
10.6 Mention detection chain
11 Rule-based approach
11.1 Resolution process
11.2 Data sets and evaluation
11.3 Results
12 Statistical approach
12.1 First adaptation of BART for Polish
12.2 Second adaptation of BART for Polish
12.2.1 Feature categories
12.2.2 The final configuration
12.2.3 Summary
12.3 Third adaptation of BART for Polish
12.3.1 Bartek features
Part IV Evaluation
13 Manual annotation evaluation
13.1 Annotation agreement of mentions
13.2 Annotation agreement of heads
13.3 Annotation agreement of quasi-identity links
13.4 Annotation agreement of dominant expressions
13.5 Annotation agreement of coreference
13.5.1 Existing agreement scores
13.5.2 Results for PCC
14 Evaluation approaches
14.1 Evaluation exercises
14.1.1 Anaphora Resolution Exercise 2007
14.1.2 SemEval 2010
14.1.3 Evaluation metrics
14.2 Mention detection evaluation metrics
14.3 Coreference evaluation metrics
14.3.1 MUC
14.3.2 B3
14.3.3 ACE-value
14.3.4 CEAF
14.3.5 BLANC
15 Evaluation results
15.1 PCC evaluation methodology
15.1.1 Mention detection measures
15.1.2 Coreference resolution measures
15.1.3 Evaluation data
15.1.4 Evaluated tools
15.2 Mention detection evaluation results
15.3 Coreference resolution evaluation results
15.3.1 Gold mentions
15.3.2 System mentions
15.4 Conclusions and future directions
Part V Summary
16 Conclusions
16.1 Afterthoughts
16.1.1 Non-nominal pronouns
16.1.2 Possessive pronouns
16.1.3 Expanded phrases
16.1.4 Indirect speech
16.1.5 Quasi-identity
16.1.6 Context and semantics
16.1.7 Dominant expressions
16.2 Applications
16.2.1 Automatic summarization
16.2.2 Multiservice
16.3 Contribution summary
17 Perspectives
17.1 Annotation improvements
17.2 Mention detection improvements
17.2.1 Ignored words
17.2.2 Named-entity-related improvements
17.3 Resolution improvements
17.3.1 Detection of knowledge-linked mention pairs
17.3.2 The motivation for the knowledge base of periphrastic expressions
17.3.3 Data extraction sources
17.3.4 The experiment: knowledge extraction attempt
17.3.5 Data abstraction concept
17.3.6 Findings
17.4 Evaluation improvements
Acknowledgements
Bibliography

Recommend Papers

Coreference: Annotation, Resolution and Evaluation in Polish 9781614518389, 9781614518358

‘Coreference’ presents specificities of reference, anaphora and coreference in Polish, establish identity-of-reference a

115 50 3MB Read more

Annotation 9780262539920

An introduction to annotation as a genre--a synthesis of reading, thinking, writing, and communication--and its signific

272 86 2MB Read more

Annotation 9780262361408, 9780262539920

241 30 2MB Read more

Events and Personalities in Polish History 9781528760416

209 43 963KB Read more

Language Corpora Annotation and Processing [1 ed.] 9789811629600

This book addresses the research, analysis, and description of the methods and processes that are used in the annotation

301 45 7MB Read more

Through Words and Deeds: Polish and Polish American Women in History 9780252053146, 0252053141

Though often overlooked in conventional accounts, women with myriad backgrounds and countless talents have made an impac

167 60 3MB Read more

Risk and Resolution 1644248832, 9781644248836

America repeatedly finds itself mired in military interventions long after public buy-in to the national interest has wa

177 107 1MB Read more

Women in Polish Cinema 9781782387206

Polish film has long enjoyed an outstanding reputation but its best known protagonists tend to be male. This book points

156 63 45MB Read more

The Holocaust object in Polish and Polish-Jewish culture 9780253005090, 9780253355645

115 77 30MB Read more

London's Polish Borders: Transnationalizing Class and Ethnicity Among Polish Migrants in London 3838208773, 9783838208770

The figure of the Polish plumber or builder has long been a well-established icon of the British national imagination, u

175 76 540KB Read more

Coreference: Annotation, Resolution and Evaluation in Polish
9781614518389, 9781614518358

Author / Uploaded
Maciej Ogrodniczuk
Katarzyna Glowinska
Mateusz Kopec
Agata Savary
Magdalena Zawislawska

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Ogrodniczuk, Głowińska, Kopeć, Savary, Zawisławska Coreference

Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary, Magdalena Zawisławska

Coreference

| Annotation, Resolution and Evaluation in Polish

Authors: Katarzyna Głowińska Institute of Computer Science Polish Academy of Sciences

Agata Savary François Rabelais University Tours Laboratoire d’informatique

Mateusz Kopeć Institute of Computer Science Polish Academy of Sciences

Magdalena Zawisławska Institute of Polish Language University of Warsaw

Maciej Ogrodniczuk Institute of Computer Science Polish Academy of Sciences

ISBN 978-1-61451-835-8 e-ISBN 978-1-61451-838-9 e-ISBN (EPUB) 978-1-61451-995-9 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2015 Walter de Gruyter Inc., Berlin/Boston/Munich Cover image concept: Magdalena Zawisławska Cover image: Inzyx / iStock / Thinkstock Typesetting: le-tex publishing services GmbH, Leipzig Printing and binding: CPI books GmbH, Leck ♾ Printed on acid-free paper Printed in Germany www.degruyter.com

Contents Preface | XII

Part I Introduction Magdalena Zawisławska 1 Reference, anaphora, coreference | 3 1.1 The concept of reference | 3 1.2 The typology of reference | 5 1.3 The real world versus the mental world | 10 1.4 Reference, coreference and anaphora | 12 1.5 Coreference and identity | 17 Maciej Ogrodniczuk 2 Polish coreference-related studies | 23 2.1 Terminology and characteristics of the problem | 23 2.1.1 Klemensiewicz | 23 2.1.2 Topolińska | 25 2.1.3 Pasek | 26 2.1.4 Fall | 26 2.2 Text coherence, cohesion and intra-document linking | 26 2.2.1 Pisarkowa | 26 2.2.2 Bellert | 27 2.2.3 Wajszczuk | 27 2.2.4 Marciszewski | 27 2.2.5 Stroińska, Szkudlarek and Trofimiec | 28 2.2.6 Pisarek | 28 2.3 Reference and anaphora in text understanding | 28 2.3.1 Grzegorczykowa | 28 2.3.2 Gajda | 29 2.4 Text genres and language stylistics | 29 2.4.1 Szwedek and Duszak | 29 2.4.2 Honowska | 30 2.4.3 Fontański | 30 2.4.4 Dobrzyńska | 30 2.5 Anaphora and first-order logic | 30 2.5.1 Dunin-Kęplicz | 30 2.5.2 Studnicki, Polanowska, Fall and Puczyłowski | 32

VI | Contents 2.6 2.6.1 2.6.2 2.6.3

Application of formal binding theories to Polish | 32 Reinders-Machowska | 32 Kupść and Marciniak | 32 Trawiński | 33

Part II Coreference Annotation Agata Savary, Maciej Ogrodniczuk 3 Related work | 37 3.1 Coreference annotation abroad | 37 3.2 Polish anaphora and coreference annotation | 41 3.2.1 LUNA-related work | 41 3.2.2 Pronominal anaphora annotation for information extraction | 45 3.2.3 Anaphora representation in the KPWr corpus | 46 Maciej Ogrodniczuk, Mateusz Kopeć 4 Annotation models | 47 4.1 Annotation requirements | 47 4.1.1 Requirement 1: Linguistic compatibility | 48 4.1.2 Requirement 2: Offline mode | 48 4.1.3 Requirement 3: Standard, open, stand-off annotation format | 48 4.1.4 Requirement 4: Simple, user-centered design | 49 4.1.5 Requirement 5: Reliability | 49 4.1.6 Requirement 6: Extensibility, adaptability and open source availability | 49 4.2 Annotation tools review | 50 4.2.1 Available tools | 50 4.2.2 PALinkA | 50 4.2.3 MMAX2 | 55 4.2.4 Conclusion | 59 Katarzyna Głowińska, Magdalena Zawisławska, Maciej Ogrodniczuk 5 Annotation guidelines | 61 5.1 Types of coreference | 61 5.1.1 Identity of reference, identity of sense | 61 5.1.2 Identity, quasi-identity | 62 5.1.3 Dominant expression | 64 5.2 Scope of annotation and typology of coreferential constructions | 65 5.2.1 Anaphoric expressions in grammatical and lexical form | 69 5.2.2 The borders of the nominal phrase | 73

Contents |

5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.3.5 5.3.6 5.3.7 5.3.8 5.3.9 5.3.10

Particular annotation problems | 77 Embedded phrases | 77 Discontinuous phrases | 78 Annotation of zero subject | 79 Idioms | 79 Indirect speech, direct speech and free direct speech | 80 Gerunds | 80 Definiteness and indefiniteness | 80 Elective phrases | 81 Extended nominal phrases | 81 Enumerations | 81

Maciej Ogrodniczuk, Agata Savary, Mateusz Kopeć 6 Annotation methodology | 83 6.1 Initial annotation experiment | 83 6.1.1 The SemEval-based format | 83 6.1.2 Annotation data | 84 6.1.3 Annotation procedure | 85 6.1.4 First visualisation attempts | 86 6.1.5 Findings from the process | 87 6.2 Series versus parallel annotation | 87 6.3 Annotation workflow | 90 6.3.1 Design decisions | 91 6.3.2 Corpus creation procedure | 91 6.3.3 Text selection | 92 6.3.4 Text preparation | 93 6.3.5 Text distribution and acquisition | 94 6.3.6 Manual text annotation | 95 6.3.7 Corpus publication process | 95 Mateusz Kopeć 7 Annotation tools | 97 7.1 Tools used in the CORE project | 97 7.1.1 DistSys | 97 7.1.2 MMAX4CORE | 99 7.2 DistSys from annotator’s perspective | 103 7.2.1 Downloading texts | 103 7.2.2 Saving texts on server (optional operation) | 106 7.2.3 Uploading finished texts | 107 7.2.4 Checking the number of finished texts | 108 7.2.5 Rejecting problematic texts | 108 7.2.6 Working on more than one computer | 109

VII

VIII | Contents 7.3 7.3.1 7.3.2 7.3.3 7.3.4 7.3.5 7.3.6 7.3.7 7.3.8 7.4 7.4.1 7.4.2

MMAX4CORE from annotator’s perspective | 110 Starting the program | 111 Operations on files | 112 Operations on mentions | 113 Operations on clusters | 115 Operations on links | 117 Browsers | 118 Program settings | 120 Copying text fragments to clipboard | 121 Adjudication of parallel annotations | 122 Differences in DistSys | 122 Differences in MMAX4CORE | 123

Maciej Ogrodniczuk, Mateusz Kopeć 8 Polish Coreference Corpus | 127 8.1 Corpus composition | 127 8.1.1 Short texts | 128 8.1.2 Long texts | 131 8.2 Corpus representation and visualisation | 133 8.2.1 TEI format | 133 8.2.2 MMAX format | 135 8.2.3 Brat format | 137 8.2.4 Brat visualization | 138 8.3 Corpus statistics | 139 8.3.1 Mentions | 140 8.3.2 Coreference clusters | 143 8.3.3 Cluster and mention count correlation | 147

Part III Coreference Resolution Maciej Ogrodniczuk, Mateusz Kopeć 9 Resolution approaches | 151 9.1 Resolution methodologies | 151 9.1.1 Resolution models | 151 9.1.2 Resolution strategies | 153 9.1.3 Learning features | 153 9.1.4 Resolution algorithms | 154 9.2 Foreign state-of-the-art resolution tools | 155 9.2.1 BART | 155 9.2.2 Reconcile | 156 9.2.3 Stanford Deterministic Coreference Resolution System | 158 9.2.4 Berkeley Coreference Resolution System | 160

Contents | IX

9.3 9.3.1 9.3.2 9.3.3 9.3.4 9.3.5 9.3.6

Polish coreference resolution attempts | 162 Knowledge-poor pronoun resolution | 162 The analysis of anaphoric relations in Polsyn parser | 163 Coreferencing for geo-tagging of Polish data | 164 Pronominal anaphora resolution module for GATE | 164 IKAR and anaphora representation in KPWr | 165 English–Polish projection-based approach | 167

Maciej Ogrodniczuk, Mateusz Kopeć, Alicja Wójcicka 10 Mention detection | 169 10.1 Simple nouns and pronouns | 169 10.2 Nominal groups | 170 10.3 Nested mentions | 172 10.3.1 Reorganisation of the grammar | 172 10.3.2 Rule modification | 173 10.3.3 Problematic cases | 174 10.3.4 Reorganization results | 176 10.4 Zero subjects | 177 10.4.1 Null subject detection difficulties | 177 10.4.2 Development and evaluation data | 179 10.4.3 Development of the solution | 180 10.4.4 Evaluation | 185 10.4.5 Results | 186 10.5 Named entities | 187 10.6 Mention detection chain | 188 Maciej Ogrodniczuk, Mateusz Kopeć 11 Rule-based approach | 189 11.1 Resolution process | 189 11.2 Data sets and evaluation | 190 11.3 Results | 190 Mateusz Kopeć, Bartłomiej Nitoń 12 Statistical approach | 193 12.1 First adaptation of BART for Polish | 193 12.2 Second adaptation of BART for Polish | 194 12.2.1 Feature categories | 195 12.2.2 The final configuration | 198 12.2.3 Summary | 200 12.3 Third adaptation of BART for Polish | 200 12.3.1 Bartek features | 200

X | Contents

Part IV Evaluation Mateusz Kopeć, Maciej Ogrodniczuk 13 Manual annotation evaluation | 207 13.1 Annotation agreement of mentions | 207 13.2 Annotation agreement of heads | 208 13.3 Annotation agreement of quasi-identity links | 209 13.4 Annotation agreement of dominant expressions | 210 13.5 Annotation agreement of coreference | 210 13.5.1 Existing agreement scores | 210 13.5.2 Results for PCC | 214 Mateusz Kopeć, Agata Savary 14 Evaluation approaches | 217 14.1 Evaluation exercises | 217 14.1.1 Anaphora Resolution Exercise 2007 | 217 14.1.2 SemEval 2010 | 219 14.1.3 Evaluation metrics | 220 14.2 Mention detection evaluation metrics | 221 14.3 Coreference evaluation metrics | 222 14.3.1 MUC | 223 14.3.2 B3 | 225 14.3.3 ACE-value | 228 14.3.4 CEAF | 228 14.3.5 BLANC | 231 Mateusz Kopeć 15 Evaluation results | 237 15.1 PCC evaluation methodology | 237 15.1.1 Mention detection measures | 238 15.1.2 Coreference resolution measures | 238 15.1.3 Evaluation data | 238 15.1.4 Evaluated tools | 239 15.2 Mention detection evaluation results | 239 15.3 Coreference resolution evaluation results | 240 15.3.1 Gold mentions | 240 15.3.2 System mentions | 243 15.4 Conclusions and future directions | 243

Contents | XI

Part V Summary Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary, Magdalena Zawisławska 16 Conclusions | 247 16.1 Afterthoughts | 247 16.1.1 Non-nominal pronouns | 247 16.1.2 Possessive pronouns | 248 16.1.3 Expanded phrases | 249 16.1.4 Indirect speech | 249 16.1.5 Quasi-identity | 250 16.1.6 Context and semantics | 251 16.1.7 Dominant expressions | 252 16.2 Applications | 253 16.2.1 Automatic summarization | 253 16.2.2 Multiservice | 256 16.3 Contribution summary | 258 Maciej Ogrodniczuk, Katarzyna Głowińska 17 Perspectives | 261 17.1 Annotation improvements | 261 17.2 Mention detection improvements | 261 17.2.1 Ignored words | 261 17.2.2 Named-entity-related improvements | 263 17.3 Resolution improvements | 263 17.3.1 Detection of knowledge-linked mention pairs | 263 17.3.2 The motivation for the knowledge base of periphrastic expressions | 265 17.3.3 Data extraction sources | 266 17.3.4 The experiment: knowledge extraction attempt | 267 17.3.5 Data abstraction concept | 267 17.3.6 Findings | 268 17.4 Evaluation improvements | 269 Acknowledgements | 271 Bibliography | 273

Preface This book is the result of a coreference-related project (see appendix for detailed information). When we were starting preparation for its funding application in 2010, we already knew what our main target was: a creation of innovative methods and tools (supplemented with necessary resources) for automated anaphora and coreference resolution in Polish texts, with planned quality comparable to those of other languages. We were about to verify whether certain properties of a language highly different from English, such as rich inflection or flexible word order, can help us obtain better results – or, conversely, constitute additional obstacles in the task. We can now say that our findings, described below, shed new light on the coreference landscape. Although with 40 million speakers worldwide and multiple national language engineering centers (see e.g. http://clip.ipipan.waw.pl) Polish is by no means an underresourced language, the current state of language technology support for semantic text analysis is no better than fragmentary (see Miłkowski, 2012). We believe that the result of our work moves it a little forward. Moreover, by following the path outlined in this book, other less-supported languages can also find their way towards a better semantic processing of content. The book presents our work on coreference understanding, annotation and resolution in Polish. It discusses specifics of reference, anaphora and coreference, establishes an identity-of-reference annotation model and presents the methodology used to create the corpus of Polish general nominal coreference. Various resolution approaches are presented, followed by their evaluation. By presenting the subsequent steps of building a coreference-related component of the natural language processing (NLP) toolset and offering deeper explanation of the decisions taken, this volume might also serve as a reference book on state-of-the art methods of carrying out coreference projects for new languages and a tutorial for NLP practitioners. Apart from serving as a description of the first complete approach to annotation and resolution of direct nominal coreference for Polish, we believe that this book can be a useful starting point for further work on other types of anaphora/coreference, semantic annotation, cognitive linguistics (related to the topic of quasi-identity, discussed in the book) etc. With extended tutorial-like sections on important subtopics, such as evaluation metrics for coreference resolution, it can prove useful to both researchers and practitioners interested in semantic description of Balto-Slavic languages and their processing, engineers developing language resources, tools (LRTs) and linguistic processing chains, as well as computational linguists in general. In our work, we decided to follow the corpus linguistics methodology (collecting and analysing real language data), which have been prevailing over theoretical linguistics (modelling language competence of an ideal language user) for more than 50 years now. The motivation for such an approach is straightforward: with the invention of the computer and modern information technology, new language theories

Preface |

XIII

can be easily verified against recorded language usage on a large scale. While for English such coreference-related resources are publicly available since MUC-6 (6th Message Understanding Conference in 1995, see (Grishman and Sundheim, 1996)), for Polish the process has just started. The corpus-related approach implies a certain work flow, reflected throughout this book: starting from the definition of the basic notions of reference, anaphora and coreference, through selecting the type and scope of phenomena to be observed, and establishing the rules for their identification in a corpus model of language, their human annotation, ending with evaluation of the process performed in order to guarantee a high quality of work. Although most of the examples in the book are originally in Polish, we have translated them into English for the sake of non-Slavic readers. Similarly, some of the theories and the formal description of reference, anaphora and coreference in Polish unavoidably come from Polish literature which is not always available in English, but we also did our best to translate all the quoted passages. Maciej Ogrodniczuk Warsaw, November 2014

| Part I: Introduction

Magdalena Zawisławska

1 Reference, anaphora, coreference

1.1 The concept of reference Coreference is usually defined as a phenomenon consisting in different expressions relating to the same referent. Therefore, this general definition requires us to first explain the basic concept of reference. Reference is, most of all, a subject of interest of logical semantics, but linguistics as well, and it is understood in very different ways, depending on the field. Classical logical semantics adopts after Frege (1892) a distinction between two aspects of meaning of natural language units. Frege described them as Bedeutung and Sinn; in the writings of other scholars they are called, respectively: denotation and connotation by Mill (1843), denotation and meaning by Russell (1905), extension and intension by Carnap (1947), and reference and sense by Black (1949). In this perspective, reference is a relation to extra-language beings (referents), while sense is an inter-language relation – to other signs of this language system. This means that expressions can have the same reference, but a different sense, or conversely – the same sense, and a different reference, cf.: (1.1) (1.2)

Zwycięzca spod Austerlitz jest przegranym spod Waterloo. ‘The winner from Austerlitz is the loser from Waterloo.’ Jeden logik kłócił się z drugim logikiem o definicję referencji. ‘A logician argued with another logician about the definition of reference.’

In Example (1.1), both expressions zwycięzca spod Austerlitz ‘the winner from Austerlitz’ and przegrany spod Waterloo ‘the loser from Waterloo’ refer to the same referent – Napoleon Bonaparte, but they have a different sense – they distinguish different properties of Napoleon. In sentence (1.2), two isomorphic words logician have the same sense (name the same attributes), but different referents. However, such an understanding of reference is mistaken according to Padučeva (1992). She points out that reference is not an element of word meaning, but is realised in an utterance – therefore, it is not a quality of lexemes, but of the uses representing their forms in the text, e.g.: Reference is, generally speaking, a relation of individual and, at the same time, new objects and situations. For this reason, reference applies not to words or linguistic expressions, but to their text uses – it relates to utterances and their elements. [. . . ] Therefore, the question “What is the reference of the word man?” would be nonsensical, just as of any other word in the dictionary or of any word connection (for example, a young man, a tall man), or a sentence constructed

4 | 1 Reference, anaphora, coreference according to the rules of grammar. The sentence A man entered has no reference on its own – it is not related to any specific situation, or any object. (Padučeva, 1992, p. 12)

That is why, as the scholar notes, every syntactically bound element of a sentence has a meaning, but reference characterises only some elements of an utterance. Thus, it is necessary to distinguish between reference as an attribute of speech act (utterance) and lexeme denotation (extension), which would constitute a set of all potential referents. This means that a dictionary entry does not have reference, it only has denotation – a potential reference that can (but does not have to) be realised in the text. Lexeme denotation is connected with a phenomenon which Padučeva calls denotative ambiguity. This means that one word can have semantic variants, and, depending on the context, realise a different reference type:, e.g.: (1.3)

(1.4)

(1.5)

a.

Przeczytał całego Hemingwaya. ‘He read all Hemingway.’ b. Hemingway urodził się w 1899 r. ‘Hemingway was born in 1899.’ a. Mieszkał w Berlinie. ‘He lived in Berlin.’ b. Berlin naciska na bankructwo Grecji. ‘Berlin insists on the bankruptcy of Greece.’ a. Zbiła nowy talerz. ‘She broke a new plate.’ b. Zupa była tak dobra, że zjadł cały talerz. ‘The soup was so good that she ate the whole plate.’

Example (1.3.a) describes the writer’s legacy, and (1.3.b) the writer himself. Example (1.4.a) refers to a city, and (1.4.b) to the German government, whose seat is in Berlin. Sentence (1.5.a) is about a vessel, and (1.5.b) about its contents. Semantic variability of this type is very system-wise and often noted in dictionaries; however, it makes it very difficult to annotate coreferential expressions in a corpus. Moving reference to the utterance level and treating it independently of the lexeme meaning is an essential matter when creating a corpus of coreferential expressions. This is because one cannot treat every recurring word in the text as coreference, e.g.: (1.6)

Każdy szanujący się poseł ma asystenta. Asystentami są z reguły ludzie młodzi, ale nie brakuje również szczerze zaangażowanych emerytów. Poglądy polityczne asystenta powinny być zbieżne z linią szefa. Pracują jako wolontariusze tak jak Marek Hajbos, asystent Zyty Gilowskiej. Poseł Adam Bielan (rzecznik PiS) na przykład płaci asystentom za wysyłanie korespondencji. Obecny minister sprawiedliwości Grzegorz Kurczuk zaczynał partyjną działalność jako asystent Izabelli

1.2 The typology of reference |

5

Sierakowskiej. W ministry poszedł też były asystent Józefa Oleksego Lech Nikolski. Posłowie nie poprzestają na jednym asystencie. ‘Every MP with some self-respect has an assistant. Assistants are usually young people, but one can also find many sincerely committed pensioners. The political views of the assistants should mirror the line of their supervisors. They work as volunteers just like Marek Hajbos, the assistant of Zyta Gilowska. For example, MP Adam Bielan (the speaker of Law and Justice) pays his assistants for sending correspondence. The presiding Minister of Justice Grzegorz Kurczuk began his party activity as an assistant of Izabella Sierakowska. Similarly, the former assistant of Józef Oleksy, Lech Nikolski, joined the ministry force. MPs do not stop at one assistant.’ In the text (1.6), the author used the word assistant eight times. However, it does not have the same reference. We can see references to the class (‘Every MP with some selfrespect has an assistant’; ‘The political views of the assistants should mirror the line of their supervisors’; ‘MPs do not stop at one assistant.’), to the part of this class (‘Assistants are usually young people, but one can also find many sincerely committed pensioners.’; ‘Adam Bielan’s assistants’), and to specific people (‘Marek Hajbos, Zyta Gilowska’s assistant’; ‘assistant of Józef Oleksy, Lech Nikolski’). Including all these expressions into one coreferential cluster would be a mistake.

1.2 The typology of reference Another exceptionally important issue when studying coreference concerns the question of which elements of the utterance can be characterised by reference. In a classic take, reference can only be a quality of nominal phrases that name objects. This view is expressed by Topolińska (1984, p. 302), cf.: The names of objects of material nominal groups encountered in texts are characterised by referential properties, that is, a relation to an object that they name.

Padučeva’s standpoint is not entirely clear, as she writes at one point (Padučeva, 1992, p. 113) that reference is not a property of predicates (and NPs used in predicative function), because they name attributes of the object instead of the objects themselves; at other point, on the other hand, she speaks of two types of reference: object and non-object, cf.: In the tradition of logic since Frege, it has been assumed that object terms of reference [. . . ] and propositions should be treated uniformly. Propositions – just as terms – have referents (the referent of a proposition is its truth value). In the papers dealing with pragmatics [. . . ] reference is understood primarily as a property of object terms. For linguistics, it is more natural to treat object terms reference and propositions in the same way: assigning a language utterance to reality

6 | 1 Reference, anaphora, coreference is performed not only through object term reference, but also through reference of components with propositional meaning, which can refer to facts, events, and situations. (Padučeva, 1992, p. 15).

According to Padučeva, reference is an attribute of a whole sentence used in an utterance, of its propositional elements and nominal groups. The author does not see words as non-object reference means (that is, references to events and situations), but instead she sees those means in grammatical categories: tense and mood. A different solution of this matter can be found in (Vater, 2009, p. 104), who distinguishes four types of reference: 1. situational reference, which is a superordinate reference type, because sentences refer to a certain event, state or action 2. temporal reference, which relates to all time relations between situations 3. spatial reference, which can refer to the location of a given object in space or moving direction 4. object reference. As we can see, the choice of reference definition significantly influences the creation of coreferential chains in the text. As a rule, reference in the papers in the field of NLP is understood classically – as object reference exclusively. It seems, however, that it is a solution that is effective only in the starting phase of work on the matter of text coreference. In order to fully describe this phenomenon, one should include all types of reference named by Vater, e.g.: (1.7) (1.8) (1.9)

Zapadał mrok. Bardzo ich to zaniepokoiło. ‘It was getting dark. They grew very anxious because of that.’ Spotkali się w zeszłym roku. Wtedy właśnie się w sobie zakochali. ‘They met last year. It was then that they fell in love.’ Mieszkał w samym centrum. Było tam dość głośno. ‘He lived in the very centre. It was quite loud there.’

In Example (1.7), the pronoun to ‘that’ which refers to the whole predication Zapadał mrok. ‘It was getting dark.’ From the point of view of the classical reference theory, one cannot match in one cluster coreferential expressions of this sentence and the pronoun to ‘that’. Analogically, adverbials w zeszłym roku ‘last year’, wtedy ‘then’, w samym centrum ‘in the very centre’, and tam ‘there’ will not be seen as coreferential, but if we adopt Vater’s view that reference is not restricted to object reference, we will have to include these examples in annotation as well. Even if we define reference narrowly (as object reference only), we need to take into account its different types. In general, in the papers on logical semantics reference has been understood as merely a relation to a specific, distinguished object. However, linguistics sees object reference also as a relation to a set of objects, which the addressor does not want, or cannot identify.

1.2 The typology of reference |

7

Padučeva (1992, p. 118–126) distinguishes three types of nominal reference groups: 1. defined nominal groups (with a single object or a set of objects), e.g. Ernest Hemingway urodził się w 1899 r. ‘Ernest Hemingway was born in 1899.’; Wszyscy moi studenci zaliczyli kolokwium. ‘All my students passed the test.’ 2. under-defined nominal groups, e.g. Mam ci coś do powiedzenia. ‘I have something to tell you.’; Niektórzy referenci nie dojechali na konferencję. ‘Some speakers didn’t make it to the conference.’ 3. nominal groups undefined for the speaker, e.g. Ktoś zjadł mój jogurt. ‘Somebody ate my yoghurt.’; Jacyś ludzie włamali się do jego mieszkania. ‘Some people broke into his apartment.’ Furthermore, Padučeva purports to single out a fourth group – with a neutralised definiteness opposition, that is practically undefined nominal groups. This is very important for languages with no articles, such as Polish: the sentence Zatrzymał mnie policjant. ‘A/The policeman stopped me.’ is ambiguous – due to a lacking article, we do not know if it is some indefinite policeman (‘a policeman’) or a specific one (‘the policeman’). In a language with articles, our doubts would be dispelled due to an article appearing before the nominal group. Padučeva points out a class of nominal groups without reference, which, in the author’s understanding of the term, means that they do not annotate any distinguished objects: 1. existential nominal groups, which refer to classes of objects, but do not distinguish any of them, (a) distributive nominal groups annotating “participants separated in a given set of events of one type” (Padučeva, 1992, p. 127), e.g. Czasami ktoś z nas go odwiedza. ‘Sometimes, somebody visits us.’; Do każdego wychowanka przyjechali jego krewni. ‘All pupils were visited by their relatives.’ (b) unspecific nominal groups appearing in the context of a subdued assertion (namely, with verbs może ‘can’, chce ‘want’, powinien ‘should’, należy ‘must’, in imperative forms, questions, negations, performative verbs, etc.), e.g. Jan chce się ożenić z jakąkolwiek cudzoziemką. ‘Jan wants to marry any foreigner.’ (c) general existential nominal groups referring to objects in a general way, without distinguishing a specific exemplar, e.g. Niektórzy ludzie mają alergię na gluten. ‘Some people are allergic to gluten.’ 2. universal nominal groups referring to a whole, abstract class of objects, e.g. Kto rano wstaje, temu Pan Bóg daje. ‘The early bird catches the worm.’ 3. attributive nominal groups referring to a single being, but the addressor does not think of a specific object, e.g. Najsilniejszy człowiek na świecie nie podniósłby 500 kg. ‘The strongest man on Earth would never lift 500 kg.’; Ten, kto wygra, otrzyma nagrodę. ‘The winner will get the prize.’ 4. nominal groups annotating gender or species, e.g. On postąpił jak mężczyzna. ‘He acted like a man.’; Jaguary wymierają. ‘Jaguars are dying out.’

8 | 1 Reference, anaphora, coreference On the other hand, Topolińska (1984, p. 303–324) in her typology of nominal groups with a single referent distinguishes: 1. complete linguistically defined descriptions (of unequivocal reference), e.g. stolica Polski za Jagiellonów ‘the capital of Poland in Jagiellonian times’, autor “Pana Tadeusza” ‘the author of “Pan Tadeusz”’ 2. incomplete linguistically defined descriptions (whose linguistic formalisation alone does not guarantee an unequivocal reference, or when reference changes with the speaking situation) – incomplete descriptions that annotate unequivocally in a certain situation, e.g. Swędzi mnie ręka. ‘My hand itches.’ – incomplete descriptions correlated with an unequivocal gesture of reference, e.g. Daj mi ten nóż. ‘Give me this knife.’ 3. nominal groups in a role of argumentative non-identifying expressions, e.g. Coś mi wpadło do oka. ‘Something is in my eye.’ Topolińska divides nominal groups with a collective referent into: 1. groups constituting an explicitly named collection (in a distributive or collective manner) 2. groups differentiating elements of collection given explicitly (all elements, merely a part of them, or one of the elements is distinguished). Both Padučeva and Topolińska emphasise the fact that nominal groups used predicatively do not have reference, because they annotate attributes of the object, not the object itself, e.g. Jan jest lekarzem. ‘Jan is a doctor.’ Padučeva also writes that one should pay attention to nominal groups that look like predicates, but which de facto constitute arguments of the identification predicate, e.g. Miasto, które widzimy, to Warszawa. ‘The city that we can see is Warsaw.’ In this case, the expression Warszawa ‘Warsaw’ is not used predicatively, and has reference. On the other hand, Topolińska (1984, p. 324) thinks that in sentences such as Na apel zgłosił się dziennikarz. ‘The meeting was attended by a journalist.’, Telefon przyjęła kobieta. ‘The call was answered by a woman.’, nominal groups dziennikarz ‘a journalist’ and kobieta ‘a woman’ are predicative expressions, introduced in the position of arguments in a secondary manner. Topolińska believes that these sentences can be explicated as: Ten, kto się zgłosił na apel, był dziennikarzem. ‘The one who attended the meeting was a journalist.’, Ta, która przyjęła telefon, była kobietą. ‘The one who answered the call was a woman.’ However, one should note that predicatively used nominal groups can also form chains in texts that resemble coreferential clusters, e.g. Mama Jana była architektem i on też chce nim zostać, chociaż wie, że ten zawód jest bardzo wymagający. ‘Jan’s mum was an architect and he wants to become one too, even though he knows that this profession is very demanding.’ (architektem ‘architect’, nim ‘one’, ten zawód ‘this profession’). Therefore, if we adopt Padučeva’s and Topolińska’s viewpoints, chains of this type will not be included in the annotated texts, although, essentially, they are

1.2 The typology of reference | 9

not formally different from “true” coreferential chains. So, should we conclude that the definition of coreference does not overlap with the idea of reference? It would be a quite inconvenient solution, as it would require a substantial redefining of coreference and a renewed decision on its relation to reference. Can we treat chains of expressions connected with a nominal group in predicate function as a completely separate phenomenon? This solution can, naturally, be adopted at the beginning of work on coreference corpus, but, finally, something will need to be done with those “pseudocoreferential” sequences in annotated texts. Assigning these two phenomena to different categories is quite problematic as well, because there are no formal differences that would make it possible to unambiguously distinguish “true” coreference from the “fake” one, in which the antecedent is a predicatively used nominal group. Actually, setting a boundary this sharp between the nominal groups used as argument names and the same groups used as predicates does not have any justification, assuming we adopt a broader definition of reference (that is, of place, time and situation as well). It is even less justified as the boundaries between arguments and predicate are blurred, which can be clearly seen on the example of gerunds, cf.: (1.10) Nie mam żadnego pytania ‘I have no questions.’ (Padučeva, 1992, p. 116) (1.11) Bitwa zaczęła się o świcie. Trwała cały dzień i zginęło w niej wielu osadników. ‘The battle started at dawn. It lasted the whole day, and many settlers died in it.’ (1.12) W biegu wzięło udział stu zawodników. Nikomu jednak nie udało się go ukończyć. ‘The run had one hundred participants. However, nobody was able to finish it.’ In Example (1.10), the expression żadnego pytania ‘no questions’ has, according to Padučeva, a reference, although the lexeme pytanie ‘question’ names an activity, not an object (pytanie = ‘the act of asking’). The fact that it is used in the position of the argument does not automatically mean that it has a reference – since Topolińska believes that it is possible for a predicative expression to be used secondarily as an argument (cf. Nie mam żadnego pytania ‘I have no questions’ = O nic nie chcę zapytać ‘I don’t want to ask you about anything’). Analogically, one should not treat bitwa ‘the battle’ and biegu ‘the run’ in Examples (1.11) and (1.12) as referent nominal groups, although they might look like antecedents of anaphoric groups. Distinguishing between nominal groups with and without reference is, to a large extent, arbitrary. If we look at Topolińska’s example: Jan chce być architektem. ‘Jan wants to be an architect.’, it is clear that a tiny formal change would be enough for the suddenly content-analogical nominal group to have a reference: Jan chce wykonywać zawód architekta. ‘Jan wants to get a job as an architect.’ Both expressions: być architektem ‘to be an architect’ and wykonywać zawód architekta ‘to get a job as an architect’ have identical meaning. Why should we treat the first group as non-referential? Semantic arguments put forward by Padučeva and Topolińska are hardly convincing. The authors refer to semantics of expressions, although they actually focus

10 | 1 Reference, anaphora, coreference more on their grammatical (class, part of speech) and/or syntactic characteristics (role in the sentence). They do not say what to do with nominal names of activities – Padučeva apparently cannot see the difference between gerunds and typical nouns, see Example (1.10); Topolińska, on the other hand, ignores the matter completely, whereas her suggestion to treat as predicative uses sentences like Telefon odebrała kobieta. ‘The telephone was answered by a woman.’ seem totally unjustified. First of all, the interpretation of this utterance is context-dependent (e.g. Telefon odebrała kobieta. Mężczyzna, który jej towarzyszył, stanął zaś koło budki. ‘The telephone was answered by a woman. The man who accompanied her stood next to the booth.’ – in this case the reference is altogether concrete), secondly, practically most of descriptive nominal groups can be explicated in the manner Topolińska proposes, e.g. Autor „Pana Tadeusza” przyjechał do Paryża. ‘The author of “Pan Tadeusz” came to Paris.’ = Ten, który był autorem „Pana Tadeusza”, przyjechał do Paryża. ‘The one who was the author of “Pan Tadeusz” came to Paris.’; Widział najdłuższą rzekę świata. ‘He saw the longest river in the world. = To, co widział, było najdłuższą rzeką świata. ‘What he saw was the longest river of the world.’ In fact, the solution of the reference problems depends on our definition of the world which the expressions of the natural language refer to.

1.3 The real world versus the mental world While musing about reference (and, as a consequence, coreference) it is impossible to disregard the problem of which extra-linguistic world we are talking about. In the classical understanding, reference applies to the real world (in Lyons’ or Searle’s writings). Of course, according to such a definition, expressions like Pegaz ‘Pegasus’, krasnoludek ‘imp’ or smok ‘dragon’ will never have a reference, because they do not refer to real beings. Padučeva remarks that such a limited understanding of reference does not have application in linguistics, because a linguist does not care whether the text describes an object existing in the material world or in the ancient Greece mythology. As an example of possible worlds colliding with the real world, Padučeva (1992, p. 14) quotes this text: The Napoleon from the works of Tolstoy does not resemble the Napoleon in professor Tarlé’s books. . . Discussing which of these two Napoleons is closer to the historical prototype makes no sense.

In this case, the addressor introduces referents from three different worlds: from Tolstoy’s book, from Tarlé’s book, and the historical Napoleon who existed in the real world. Kunz (2010, p. 31) points out that in recent writings it is assumed that the referents of language expressions do not have to exist as objects of reality. In the end, as the

1.3 The real world versus the mental world

|

11

author underlines, we can talk about the facts from distant past (e.g. Platon napisał „Dialogi”. ‘Plato is the author of “Dialogues”.’) or some yet to come (Za rok wypłacą mi zysk z polisy. ‘In one year’s time, I will get profits from my insurance policy.’), abstract ideas and concepts (demokracja ‘democracy’, biel ‘whiteness’, mądrość ‘wisdom’), imagined or hypothetical beings (centaur, Harry Potter, czarna dziura ‘black hole’). Kunz writes in this way on the world of discourse: During speech production, recipients establish a global mental representation on the basis of the linguistic unity of the text – a mental textual world. [. . . ] The mental textual world is considered to be the cognitive input and output of the text processing. [. . . ] It is widely accepted that construing a textual world is a dynamic cognitive process of text representation. It involves multiple composite steps of building, specifying, updating and shifting knowledge. Text recipients start processing a text with the very first textual elements they encounter and integrate new mental concepts and relations into the textual world on the basis of the linear progression of the linguistic structure. (Kunz, 2010, p. 31–32)

The world of text is ruled by its own logic. It can have a connection with the real world, but just as often they will have nothing in common (frequent in literature and poetry). What lies on the foundation of the world construed by the addressee is the so-called coherence of the text, a semantic connectivity resulting from various types of conceptual and cognitive relations created in the text. According to de Beaugrande and Dressler (after Vater (2009)), those conceptual relations can be described as “property of something”, “place of something”, “coreferents of something”. The conclusion is that coreferential relations in the text occur primarily on its content level, although they can also be assigned by formal language means. Similarly, Langacker thinks that the expressions of the natural language refer to the mental world, cf.: We can talk about everything we can imagine. Only a limited part of our discourse is devoted to real situations, belonging to our world (in spite of their privileged status). Even when we actually speak about real-life situations, our descriptions are selective and schematic. Linguistic meanings not only reflect the described situations, but also emerge during an interactive process of construing and portraying those situations for the means of communication and expression. In the course of discourse development, the interlocutors cooperate in building, detailing, and modifying the conceptual structures, which, in the best case, only constitute a very partial representation of what is discussed. It is these conceptual representations of situations, and not their real nature that are a direct basis for the meaning of expressions. (Langacker, 2008, p. 350).

Langacker points out that the concept of reference tends to be used in a very inconsequent manner. According to him, the classical interpretation of reference (as a relation to a real world) is not fitting for the general characteristic of nominal groups, because designated objects of many nominal groups are abstract, or their objective existence is problematic (e.g. a supposed unimportance of moral scruples). Therefore, a linguist should be primarily interested in reference on the level of discourse. Actually, Langacker (2008, p. 353) undermines the existence of non-referential nominal groups.

12 | 1 Reference, anaphora, coreference First of all, he does not restrict the reference of nominal groups to object reference. The fact that a given word is a noun does not stem from it naming an object; it is the product of two cognitive phenomena: grouping and reification. An “object” defined in this manner can, according to the author, emerge on every level of conceptual organisation. As an example, Langacker (2008, p. 147) gives the word recipe: a recipe can be written down, but it can also exist in space as an object. It is still an “object”, because in the course of preparation of given food items, we conceptualise them as groups and reificate as a uniform process, the effect of which is a finished dish. The same category of being an “object” applies to a plate, a pile of plates, a commission (comprised of several members) or a moment (a continuous period, whose elements – points in time – are grouped according to closeness in time). In this sense, there is no difference between a tree, a battle or running.

1.4 Reference, coreference and anaphora The above chapter shows that the sole concept of reference is very complex and differently understood depending on the scholar. The same problem applies to the phenomena of coreference and anaphora. Coreference has never been in the centre of academic attention in linguistics. It is usually mentioned in the context of text coherence (as it is thought to be one of its indicators). Reference tends to be defined as a referential identity of expressions in the text, which means that two or more expressions refer to the same referent. On the other hand, anaphora, contrarily to reference, is seen as an intertextual system. Topolińska writes that “anaphora is understood as a relation to matters and objects already mentioned in the text, and, as a consequence, known to the reader” (Topolińska, 1984, p. 327). Padučeva defines anaphora in a slightly different way, cf.: An anaphoric reference is a relation between words in a sentence or in a text when the meaning of one word or expression contains a reference to another, while there is no syntactic relation between them. The first element of an anaphoric relation is called an antecedent, while the second – an anaphor (or anaphoric expression). (Padučeva, 1992, p. 187)

Padučeva seems to confuse anaphora with coreference, as she first writes that the content of an anaphoric relation can be a coreferentiality of nominal groups (defined by the author as a relation between nominal groups that annotate the same object, that is, have the same referent). However, further on she remarks that coreferentiality is possible between NPs which do not annotate a specific object (have no reference), e.g. Każdy człowiek chce, żeby go szanowano. ‘Every man wants to be respected.’ According to Padučeva, in this example we can see a phenomenon (which she calls coassignation) consisting in one group having a referent in one of the possible worlds, while the other is understood as a reference to the referent itself, cf.:

1.4 Reference, coreference and anaphora

| 13

If a pronoun has a limited reference, a proposition constituting its direct context is devoid of truth value [. . . ]. The relation between the pronoun and the antecedent [. . . ] is a variation of coreferentiality [. . . ] called coassignation. In fact, the thesis that all coreferential nominal groups are referential (have a referent in the spoken world) should be rejected [. . . ]. (Padučeva, 1992, p. 134)

According to Padučeva, coreference cannot be defined as the identity of referents because of its denotative indefiniteness, characteristic for many lexemes. The author believes that in the sentence: Hemingwaya czyta się najlepiej, gdy opisuje on polowanie. ‘Hemingway is most readable when he describes hunting.’ the pronoun and its antecedent are coreferential even though they do not annotate the same object (the phrase Hemingway annotates the writer’s works, and the pronoun refers to a person). The conclusion is that, in some passages of her papers, Padučeva identifies coreference with anaphora. Moreover, the scholar lists the limitations of coreference: 1. referential nominal groups are not coreferential if they apply to different objects 2. predicative NPs cannot be coreferential because they are located outside reference 3. two NPs of different denotative type cannot be coreferential as well, e.g. predicative and term status. (Padučeva, 1992, p. 188) The untranslatability of the reference and anaphora system is also pointed out by Topolińska. Her remarks are largely similar to Padučeva’s, but Topolińska does not identify coreference with anaphora. She writes, namely, that anaphora is possible between nominal groups without reference, and, what is more, all sentence content can be anaphorised, e.g. Chcę być architektem i nim zostanę. ‘I want to be an architect and I will.’; Zapadał mrok. ‘It was getting dark.’; Powiększyło to nasze szanse. ‘It increased our chances.’ (Topolińska, 1984, p. 328) Padučeva’s and Topolińska’s theses are quite problematic from the point of view of working on a coreference corpus, as it is quite common to find in texts chains of anaphoric expressions that are antecedents of nominal groups included by the author in the class of non-referential expressions, e.g. (1.13) Często tłumy ludzi przechodzą obok jakichś zjawisk i nie dostrzegają sedna sprawy, a dopiero gdy ktoś zbierze i opracuje rządzące nimi prawa, wszyscy dziwią się, że to jest takie proste i oczywiste, że nawet dziecko to zrozumie, a przez tyle lat nikt na to nie wpadł. ‘Often, crowds of people pass over a phenomenon and never notice the gist of the matter until somebody brings the laws governing it together and describes them, and then everybody wonders that it is so simple a child could understand it, though for so many years no one thought of it.’ In Example (1.13) we have exclusively nominal groups which Padučeva thinks of as non-referential. The expressions tłumy ludzi ‘crowds of people’ and jakichś zjawisk ‘a phenomenon’ would fall under the existential class, phrases ktoś ‘somebody’, nikt ‘no

14 | 1 Reference, anaphora, coreference one’ under the attributive class and phrases rządzące nimi prawa ‘laws governing it together’, wszyscy ‘everybody’ and dziecko ‘a child’ would belong to the universal class. If they have no reference, one cannot say that these sequences are coreferential: tłumy ludzi ‘crowds of people’ and (oni) nie dostrzegają ‘(they) never notice’; jakichś zjawisk ‘a phenomenon’ and nimi ‘it’; rządzące prawa ‘governing laws’ and to ‘it’; ktoś ‘somebody’ and (on) opracuje ‘(he/she) describes.’ Predicatively used nominal groups, too, are used as antecedents in sequences of anaphoric expressions. As a result, assuming Padučeva’s theses would substantially restrict the possibilities of annotation of coreference corpus, especially, as some of the author’s preliminaries are objectionable. An example given by Padučeva: Hemingwaya czyta się najlepiej, gdy opisuje on polowanie. ‘Hemingway is most readable when he describes hunting.’ is an evident elliptical phrase: (Książki) Hemingwaya czyta się najlepiej, gdy opisuje on polowanie. ‘Hemingway’s (books) are most readable when he describes hunting.’ It is quite clear that the pronoun on ‘he’ relates to the writer, and the instrumental in the first sentence has been omitted. Langacker (2008) claims, on the other hand, that if a nominal groups can become an antecedent of a pronoun to which it refers to in the text, it means that it has a virtual point of reference in the world of discourse, or, in our perspective – a reference. This approach seems legitimate, especially from the point of view of research on reworking the natural language, because the sole existence of chains of anaphoric expressions in the text becomes a sign that a nominal group is referential on the level of text world and there is no need to introduce an artificial and problematic (for a human, even) distinction between nominal referential and non-referential groups, which Topolińska and Padučeva had been proposing. To sum up the musings above, it seems necessary to resolve the matter of these terms: reference and coreference concern relating expressions to phenomena created in the world of discourse, while anaphora is an inter-textual system which enables the text to set relations of coreference. It should be noted here that only certain types of expressions are coreferential outside the context by virtue of its meaning (e.g. Adam Mickiewicz, autor „Pana Tadeusza” ‘Adam Mickiewicz, the author of “Pan Tadeusz”’). Most of linguistic expressions become coreferential only after they are put in context, among other things thanks to the mechanism of anaphora, e.g.: (1.14) Ośmioletni Władimir, choć od zamachu minęło już trzynaście miesięcy, ciągle wspomina tamtą tragedię. Nie może zapomnieć, co terroryści wyrabiali w jego szkole. – Ma ranną duszę i ciało – mówi jego 40-letni ojciec Sergej Oziev. – Syn był postrzelony w rękę i głowę, którą operowano mu w Niemczech – dodaje. Tacie głos też się jeszcze trzęsie. Ośmiolatek to cała rodzina, jaka mu została po zabiciu w zamachu starszego syna i żony. Although it’s been thirteen years since the attack, eight-year-old Wladimir still recalls this tragedy. He cannot forget what the terrorists did in his school. ‘His soul and body are wounded’, says his 40-year-old father Sergej Oziev. ‘My son

1.4 Reference, coreference and anaphora

| 15

was shot in the hand and the head, which he had operated in Germany’, he adds. The voice of the father still trembles as well. The eight-year-old is all family he has left after his older son and wife were killed in the attack. In Example (1.14) we have sequences of coreferential expressions¹: (1) ośmioletni Władimir ‘eight-year-old Wladimir’, (on) nie może zapomnieć ‘(he) still recalls’, syn ‘my son’, ośmiolatek ‘the eight-year-old’, (2) 40-letni ojciec Sergej Oziev ‘40-year-old father Sergej Oziev’, (on) dodaje ‘he adds’, tacie ‘the father’, mu ‘he’. Many of these expressions become coreferential thanks to the mechanism of anaphora, e.g. deriving the anaphor – ośmioletni/ośmiolatek ‘eight-year-old’, ellipses (omitting subject by personal forms of verbs). Of course, coreferential relations are assigned not only by anaphora. We know that the word syn ‘son’ refers to Władimir, because from the textual world we learn that Sergej Oziev is his father, and that those two lexemes are connected by a semantic relation of conversion. From the point of view of a linguist, connections of coreference with the system of anaphora do not explain a lot, because it is not exactly clear how the mechanism of anaphora works. Usually, under anaphora we understand a pronominal anaphora, but it is not the only measure of signalising this type of relationship. Topolińska (1984) lists, apart from the pronouns, such anaphoric means as: full replica (e.g. Przyszli do nas wuj z synem. Wuj przyniósł nam cukierki. ‘We were visited by uncle and his son. Uncle brought us sweets.’) or a partial semantic replica (using synonyms, hyponyms or hypernyms, e.g. Spojrzał na dom. Budynek był strasznie zapuszczony. ‘He looked at the house. The building was horribly neglected.’). It is not easy to formalise anaphora, even the narrowly understood pronominal anaphora. Analyses in the spirit of generative grammar bring little to the discussion, because the c-command analysis proposed by Reinhart does not extend beyond the scope of a sentence; it is, therefore, of little use for the analysis of coreference in text. Papierz (1995, p. 216) remarks that in the case of analysis of anaphora, in some Slavic languages (Polish, Slovak, or Czech) it is impossible to give an exception-free rule to “describe conditions of appearance of the anaphoric pronoun in case of a full replica for languages without articles”. An additional problem is that measures mentioned by Topolińska do not have to be used in anaphora exclusively at all. In the following examples we can see a full replica, but it is not an anaphora, cf.: (1.15) Ten długopis jest mój, ale tamten długopis możesz sobie wziąć. ‘This pen is mine, but that pen you can take.’ (1.16) Człowiek człowiekowi wilkiem. ‘Man is man’s wolf.’

1 Selected mentions only.

16 | 1 Reference, anaphora, coreference In both case the isomorphe expressions have different referents (two different pens, two different representatives of the human species), so we cannot think of them as of a sequence of anaphoric expressions. An interesting coreference mechanism is put forward by van Hoek (1995, 2003), who claims that anaphoric references in the text are dependent on the cognitive clarity of the objects of discourse and the system of conceptual (semantic) coherence between the elements of the discourse. In her theory, the scholar uses two cognitive conceptions, the so-called theory of accessibility, created by Ariel (1990), and Langacker’s grammar. According to the theory of accessibility, there are different types of nominal phrases with regard to different levels of cognitive accessibility of the object which the nominal phrases refer to. Ariel believes that the accessibility scale looks as follows: Zero < reflexive < agreement markers < cliticized pronouns < unstressed pronouns < stressed pronouns < stressed pronouns + gesture < proximal demonstrative (+NP) < distal demonstrative (+NP) < proximal demonstrative (+NP) + modifier < distal demonstrative (+NP) + modifier < first name < last name < short definite description < long definite description < full name < full name + modifier (Ariel, 1990, p. 73)

Using the full name (e.g. Albert Einstein) is a sign that the referent is relatively less cognitively accessible to the author of the utterance. Using the first name only means a slightly higher accessibility. Pronouns signalise a yet higher level of cognitive accessibility of the referent – the addressor will use a pronoun in a text only when he or she is sure that the addressee knows who they are talking about (e.g. the referent is present or has been mentioned before (see van Hoek, 2003, p. 173)). This general rule seems universal for all languages. Van Hoek combines the concept of accessibility with the concept of conceptual points of reference. Langacker (2008, p. 682) observes that in an anaphoric relation the antecedent opens a mental access to the pronoun in the sense that it defines its reference. The scholar thinks that opening the access to the pronoun depends on the prominence of the nominal group fulfilling the role of the antecedent and on the closeness of their conceptual connection. Van Hoek defines the point of reference in the following way: A reference point is a conception that is prominent and therefore is used as a starting point from which to apprehend a larger conception of which it is a part, such as the meaning of an entire sentence. (van Hoek, 2003, p. 180)

The concept of prominence origins from Langacker’s theory that some elements of the sentence are more prominent than others, and that they exist on the basis of figure/ground opposition. In such a sentence, the subject is the main, superordinate element – the figure. Therefore, it is the main point of reference in the utterance. A direct

1.5 Coreference and identity | 17

object, on the other hand, will be the secondary figure – less prominent than the subject, but still more prominent than the remaining nominal phrases in the sentence. This means that a direct object may exist as a secondary point of reference in the sentence. The remaining nominal phrases can play peripheral roles (van Hoek, 2003, p. 180–183). An attempt to implement van Hoek’s theory for describing Polish anaphora was made by Data-Bukowska (2008). The analyses of anaphora examples with insertions seem to confirm van Hoek’s theses in reference to the Polish language as well. DataBukowska sums up her research in this way: First observations resulting from the analysis of the relatively small material, which nevertheless provided good foundations for the study, are quite surprising. The higher the semantic coherence between the objects and the simpler the relation allowing for a recognition of this coherence, the clearer are the differences on the conceptual prominence level between the competing objects of the discourse. This concerns, most of all, the most primary opposition between the subject and the object within the sentence and the syntactic accommodation, that is the relation of superordination / subordination between the sentences. In such cases deciding on the point of reference for the pronoun is performed by minimal cognitive effort on the part of the addressee, and it is usually unambiguous. Seemingly, the developed schemata are characterised by a considerable level of consolidation. (Data-Bukowska, 2008, p. 65)

Analysing the mechanisms of anaphora from the point of view of cognitive grammar seems to be promising, as it could let us discover some universal elements on the cognitive level. On the formal level, the mechanisms of anaphora are significantly different depending on the language. Even if the languages are very similar, there are major differences between the systems of anaphora – see the comparative research by Papierz (1995) on the Polish, Czech, and Slovak anaphora, or Kunz (2010) on English and German.

1.5 Coreference and identity Defining coreference as identity or sameness of referents is a source of new problems. Of course, there are situation when we have absolutely no doubts that the expressions refer to the same referent. However, it can also happen that the matter is not as simple. The issue of assigning coreference is connected with two major problems: firstly, the concept of identity, and secondly, the fact that lexemes also connect all types of semantic relations, which makes it difficult for the addressee to decide whether he or she is dealing with the same referent, or a different one. Additionally, the Polish language does not have articles, which means that we do not have formal signals whether the addressor has in mind a specific object, an indefinite object, or a class of objects.

18 | 1 Reference, anaphora, coreference According to Wierzbicka (2010, p. 61), identity is a universal, basic and undividable semantic unit. This is seen differently by Fauconnier and Turner (2002) who perceive identity as the most important basic relation (vital relation). According to the authors, it does not mean that it is an uncomplicated relation, because it requires from the addressees a huge imagination input so that they can perform an analysis and a synthesis of many material spaces – for example, they need to connect the mental space of an infant, a child, a teenager, and an adult with an identity relation, in spite of the obvious differences between them, and then include into the integration process other relations like change, time, and cause and effect. On the other hand, various semantic relations between lexemes can cause the addressee to mistakenly take a set of expressions for a coreferential sequence, while the expressions are merely connected with some kind of a lexical relation such as synonymy, holonymy/meronymy, hyponymy/hypernymy, e.g.: (1.17) Mity są niezastąpionym narzędziem dla psychologa, usiłującego prześledzić wzorce ludzkich zachowań. Wysiłki archeologów, religioznawców, antropologów doprowadziły z jednej strony do porzucenia eurocentrycznego spojrzenia na mitologię. . . ‘Myths are an irreplaceable tool for a psychologist trying to track the patterns of human behaviour. The efforts of archaeologists, religion scholars, and anthropologists have, on the one hand, caused a retreat from the Eurocentric view on mythology. . . ’ In the example above the expressions mitologię ‘mythology’ and mity ‘myths’ are connected with a relation of whole-part relationship, as, obviously, mythology consists of myths. It does not mean, however, that those expressions are coreferential. These obscurities led to Recasens’ introduction of an additional concept of nearidentity. She claims that “coreference relations between DEs depend on criteria of identity largely determined by the linguistic and pragmatic context” (Recasens et al., 2011, p. 1142). The author continues that identity is blurred and gradable. (Recasens, 2010, p. 151) distinguishes the following types of near-identity: A. Name metonymy (a) Role (b) Location (c) Organization (d) Information realization (e) Representation (f) Other B. Meronymy (a) Part_Whole (b) Stuff_Object (c) Set_Set

1.5 Coreference and identity | 19

C. Class (a) More specific (b) More general D. Spatio-temporal function (a) Place (b) Time (c) Numerical function (d) Role function Distinguishing the concept of near-identity results from the fact that the author confuses various levels: the linguistic and the conceptual ones, and the level of the real world. It is beyond discussion that lexemes have some common semantic elements and that there are relations of various types between them. They are described in projects such as WordNet (Miller, 1995) or FrameNet (Baker et al., 1998). However, it is important to remember that reference (and coreference) is not a property of a linguistic system, it is a property of an utterance, which means that it is realised no sooner than on the pragmatic level. Coreferential expressions do not have to be isomorphic, synonymous or bound with a semantic relation, e.g.: (1.18) Jedynym moim pożywieniem w ostatnich trzech dniach były węglowodany – powiedział trzykrotny złoty medalista. Sportowcy niemieccy są ogromnie zaskoczeni wiadomością o pozytywnym wyniku testu antydopingowego Muehlegga. ‘My only food in the last three days were carbohydrates – said the three-time gold medallist. German sportsmen are immensely surprised by the news of the positive result of Muehlegg’s anti-doping test.’ (1.19) Kruk krukowi oka nie wykole. ‘A crow does not pick crow’s eyes.’ In Example (1.18), the expressions Muehlegg and trzykrotny złoty medalista ‘three-time gold medallist’ are coreferential, although they are not connected by any semantic relation. However, the forms of the same lexeme: kruk ‘a crow’ and krukowi ‘crow’s’ in Example (1.19) are not coreferential, because they refer to two different referents that happen to belong to the same species. The examples of near-identity given by Recasens can, in most cases, be quite easily explained by various phenomena on the levels of: a) grammar, b) semantics, c) concepts. For example, many misunderstandings in defining coreference stem from the fact that Polish is a language without articles, and when full replicas are used, it is quite difficult for the addressee to judge whether her or she is dealing with the same object, or not, cf.:

20 | 1 Reference, anaphora, coreference (1.20) Anna D. od lat hoduje świnki morskie. Twierdzi, że to bardzo inteligentne zwierzątka. Kiedy mieszkała w maleńkiej garsonierze, „akwarium” świnek stało blisko okna; świnki nauczyły się naśladować głosy ptaków i w chwilach radości ćwierkały jak wróble. Teraz Anna mieszka w dużym domu za miastem, a jej świnki to już kolejne pokolenie tamtych „ćwierkających”. Najwyraźniej jednak przodkowie przekazali im umiejętność naśladowania ptasich treli, bo i te zwierzaki to potrafią, chociaż z miejsca, w którym teraz przebywają, ptaków nawet nie widzą. ‘Anna D. has bred guinea pigs for years. She claims that these animals are very intelligent. When she lived in a small bed-sitter, the guinea pigs’ “aquarium” stood next to the window; the animals learned to imitate the voices of birds, and in moments of happiness chirped like sparrows. Now Anna lives in a big house outside the city, and her guinea pigs are a subsequent generation of the “chirpers”. It seems, though, that their predecessors passed on the ability to imitate bird trills, because these animals can do it as well, despite the fact that they cannot even see birds from the place they are living in now.’ This text requires a rather careful reading to be able to say that the multiple use of the expression świnki morskie ‘guinea pigs’ has different referents: class ((hoduje) świnki morskie ‘(has bred) guinea pigs’), the first generation, which learnt to imitate sparrows and the subsequent generation (or generations), which can chirp. One should note, however, that the lack of articles makes it very difficult to decide on the coreferential sequences. The next example Recasens would classify, according to her theory, as a type of near-identity ROLE, see: (1.21) Są akcentami dużej wystawy plastycznej, prezentującej twórczość Jana Stępkowskiego, artysty ze Strzegowa. Właściwie artysty przedstawiać nie trzeba; to jeden z najwybitniejszych na północnym Mazowszu rzeźbiarzy, a jednocześnie znakomity malarz. Jako rzeźbiarz – sięga po różne techniki i materiały, nawet tak trudne do obróbki i kształtowania, jak bazalt i granit. ‘They are the highlight of a big art exhibition, which presents the works of Jan Stępkowski, an artist from Strzegowo. In fact, there is no need to introduce the artist; he is one of the most talented sculptors in the northern Mazovia, and, at the same time, an excellent painter. As a sculptor – he makes use of different techniques and materials, even as hard in working and shaping as basalt and granite.’ In Example (1.21), according to Recasens, the expressions: jeden z najwybitniejszych na północnym Mazowszu rzeźbiarzy ‘one of the most talented sculptors in the northern Mazovia’, znakomity malarz ‘excellent painter’, (jako) rzeźbiarz ‘(as) a sculptor’ should be connected with the antecedent Jan Stępkowski with a near-identity relation. The

1.5 Coreference and identity

| 21

problem is that these anaphoras are used in the role of predicates – they describe the properties of the artist, so they do not create any identity or near-identity with the expression Jan Stępkowski. It suffices to exchange these nominal phrases for verbs for it to become apparent: (1.22) Właściwie artysty przedstawiać nie trzeba; rzeźbi i jednocześnie maluje. Kiedy rzeźbi – sięga po różne techniki i materiały, nawet tak trudne do obróbki i kształtowania, jak bazalt i granit. ‘In fact, there is no need to introduce the artist: he makes sculptures and, at the same time, paints. When he makes sculptures – he makes use of different techniques and materials, even as hard in working and shaping as basalt and granite.’ Most of the examples named by Recasens are related to situations where there is some semantic relation between lexemes (units of the system), e.g.: (1.23) – „Od czasu okupacji. . . ” – „Ale tu je masz z powrotem, w metryce, i musisz ich używać w urzędowych papierach” – powiedział oschle dyrektor i podsunął mi nowy blankiet do wypełnienia. – „Kiedy to jest stara metryka, którą mi odtworzono zaraz po wojnie.” ‘“Since the occupation. . . ”, “But here you have them back, in the certificate, and you need to use them in official papers”, said the headmaster stiffly, and pushed a new form to fill out in my direction. “But this is the old certificate that was reconstructed for me, just after the war.”’ The lexemes wojna ‘the war’ and okupacja ‘the occupation’ are of course bound with some kind of a semantic relation (it is called entailment in WordNet), but it does not mean that we can automatically consider them near-identity. In the text, it does not look like the addressor treated those expressions coreferentially. In the creation of the textual world, addressors can use semantic properties of lexemes, but they do not have to do it. Coreferential expressions are not always related with any kind of semantic relation, cf.: Widziałaś dziś naszą księgową? Co ta debilka dziś na siebie założyła! ‘Have you seen our accountant? The clothes this idiot is wearing today!’ The lexemes księgową ‘accountant’ and debilka ‘idiot’ are not connected in the system by any lexical relation, and still they are coreferential in the text. Often, an incorrect identification of coreferential expressions is a result of the fact that addressors have different knowledge of the world from addressees, see: (1.24) Setnik Lipkowski, rzekomy Mellechowicz, jest synem groźnego Tatara Tuhaj-beja, który wyrządził wiele szkód Polsce. Azja podstępem próbuje porwać Basię. ‘Sotnik Lipkowski, the alleged Mellechowicz, is the son of the mighty Tatar Tuhay Bey, who dealt a lot of damage in Poland. Azja tries to kidnap Basia by deception.’

22 | 1 Reference, anaphora, coreference If the reader did not read the novel by Henryk Sienkiewicz “Pan Wołodyjowski” (‘Fire in the Steppe’, 1888), he would not know that sotnik Lipkowski, alleged Mellechowicz and Azja all apply to the same referents – one of the literary characters in this book. There are of course more problematic cases which Recasens also considers nearidentity, e.g.: (1.25) W miejscu dawnej jezdni ryją buldożery. Bez problemu można dojechać ul. Bandurskiego, a że nawierzchnia Retkińskiej była znana jako jedna z najbardziej dziurawych w mieście, nikt nawet specjalnie nie skarży się na utrudnienia w ruchu. Nowa Retkińska będzie miała i sygnalizację u zbiegu z ul. Krzemieniecką, i chodnik (spory odcinek ulicy był go całkowicie pozbawiony). ‘In the place of the old road, bulldozers are digging. There is no problem to reach Bandurskiego St., and since the pavement of Retkińska was known to have more holes than anywhere in town, nobody is really complaining about the impediments in traffic. New Retkińska will have both traffic lights at the crossing with Krzemieniecka St. and a sidewalk (a large section of the street had no sidewalk at all).’ In this example, we have expressions such as Retkińska and nowa Retkińska ‘new Retkińska’ which seemingly refer to the same being in the real world – a specific street. However, in fact, the expression nowa Retkińska refers to a virtual being – a street from the future, after the roadworks are finished. Therefore, these expressions are not the same referents in the world of discourse. The addressor wants the addressee to think of them as separate objects, which he or she signalises by adding the adjective new to the name of the street. The ambiguity regarding the referent’s identity most often comes from interferences in communication between the addressor and the addressee on different levels in the course of text decoding: interferences of the conceptual level (for example, addressor and addressee can have different knowledge of a certain object), interferences of semantic level (e.g. addressor does not know the word, or addressee uses a wrong word), and interferences of the syntactic level (e.g. ellipsis, mental shortcut). In my opinion, there is no need to introduce an additional term of near-identity, which does not explain anything, and rather disturbs the structure of annotated texts, as it mixes up separate levels of language – system and speech. In the case of coreference, the level of discourse is decisive – it is this level that decides what is, and what is not coreferential. Some semantic properties of lexemes can be used there, the world of the text can overlap to some degree with the real worlds, but it is not obligatory. The addressor may as well create virtual referent types (e.g. three Napoleons, or two Retkińska Streets), and in this case it does not matter that there was only one Napoleon or one Retkińska Street in the real world. The point is that in the world of discourse there are different referents and only one level that should be taken into account during the annotation of coreferential expressions in the corpus.

Maciej Ogrodniczuk

2 Polish coreference-related studies

The research on discourse structure has had a long history in modern Polish linguistics, starting with Klemensiewicz (1937) discussing indicators of reference (Pol. wskaźniki nawiązania) as an element of his comprehensive model of syntactic description of Polish. It has been followed by a number of more specific works on referential and anaphoric constructs, shortly reviewed below and restricted to direct nominal coreference – running here to serve as illustrative examples and a repository of topics which must be taken into account in studies in the field. However, two important remarks must be made regarding the scope and applicability of these deliberations. Firstly, although reference and linking seem to be thoroughly investigated for Polish, coreference is a phenomenon which lacks theoretical clarity: some researchers use this term interchangeably with anaphora, some define coreference as its subtype, some avoid using this notion completely. Secondly, most of theoretical research has been carried out in pre-computer times, which hindered verification of theories on a large scale. With the rise of computer linguistics and increase of availability of affordable computing power, the ‘theoretical’ and ‘practical’ trends have been merging for some time now, providing new grounds for both groups of researchers. Practitioners can easily evaluate concepts previously created on paper while theoreticians can use big language data on demand as a playground for their research. Importantly, the underused advanced theoretical material described below still awaits deeper computational research. Our approach is just the first step in this direction.

2.1 Terminology and characteristics of the problem 2.1.1 Klemensiewicz The first consistent analysis of reference and anaphora in Polish is by far (Klemensiewicz, 1948)¹ stating that most utterances are semantically or formally interlinked with grammatical, lexical or thematical relationship of external reference. Klemensiewicz introduces the basic Polish terminology of the field, defining e.g. człon nawiązany, CN ‘reference element’, presently: anaphor) and podstawa nawiązania, PN ‘reference basis’, antecedent), both being the whole clauses (or, in

1 Further developed in (Klemensiewicz, 1950) and reprinted in a more accessible (Klemensiewicz, 1982).

24 | 2 Polish coreference-related studies certain contexts, larger units such as whole paragraphs representing the general thread of the relevant part of the discourse) rather than more precisely specified link source and target. The scholar analyses syntactic relations between anaphors and antecedents, listing the general hierarchy of the most common indicators of reference: 1. grammatical: (a) conjunctions: PN: Prawdą żywą staje się tylko przeżycie, pozadoświadczalne wyczucie, które się w samym fakcie życia objawia. ‘Only experience can become the real truth, an extra-empirical perception which is manifested by the sole fact of living.’ CN: Prawda zatem jest nieskończoną i objawiającą się, jak nieskończonym i objawiającym się jest życie. ‘The truth is, then, as infinite and manifesting as infinite and manifesting is life.’ (b) anaphoric pronouns: PN: Zadawał pytania starszy z oficerów, porucznik. ‘The older of the officers, the lieutenant, was asking the questions.’ CN: Jego ciemna twarz sportowca o rysach twardych i nieregularnych wyrażała chłód i pogardę. ‘His dark face of a sportsman, hard and irregular, expressed coolness and scorn.’ (c) verbal constructs, referring to the antecedent: PN: Dziewczyna zaśpiewała. ‘The girl sang.’ CN: Podobało się. ‘They liked it.’ (d) parts of the sentence (attributes, modifiers, complements): PN: Z seminarium duchownego idą klerycy. ‘Clerics are coming from the seminar.’ CN: Na spacer. ‘For a walk.’ CN: Po obiedzie. ‘After dinner.’ (e) interrogative pronouns: PN: Kto przyszedł? ‘Who’s that?’ CN: Piotr. 2. lexical: (a) expressions with incomplete meaning: PN: Na wszystkie pytania leśniczy rudawickich lasów odpowiadał jednakowo. ‘The forester from Rudawice answered all questions in one way.’ CN: Broń, którą nieopodal. . . ‘The guns near the. . . ’ (with the intention of referring to the act of answering) (b) synonyms (c) corresponding expressions (po wtóre ‘utmost’ – po pierwsze ‘first’, naprzód ‘forwards’ – potem ‘then’ – w końcu ‘at last’) (d) interrogative uninfected pronoun or particle: PN: Kiedy wyjeżdżasz? ‘When are you leaving?’ CN: Jutro. ‘Tomorrow.’ 3. thematical: PN: Pójdziesz na koncert? ‘Are you coming to the concert?’ CN: Nie wiem. ‘I don’t know.’ and indicates two functions of the anaphor: 1. attachment: when both parts are independent and the second one is extending the previous one, but preserving its integrity 2. inclusion: when understandability of the anaphor requires presence of the antecedent.

2.1 Terminology and characteristics of the problem |

25

The paper provides the first and comprehensive model of anaphoric linking in Polish, being part of the general model of ‘text grammar’, the topic developed in modern times as discourse structure modelling, and further developed in a general book on Polish syntax by Klemensiewicz (1968).²

2.1.2 Topolińska Topolińska (1976) uses the term wyznaczoność ‘designation’ to denote the ‘referential characteristics’ of nominal groups, i.e. reference to their designata, and provides an in-depth analysis of different types of relations formed by Polish primary and derived argument expressions. Topolińska (1977) further extends the topic by analysing differences between anaphora and coreference using the notions of connotation and denotation, assuming that anaphora is a relation based on intersection of features connoted by two markables in question (irrelevant of their denotation) while coreference assumes (or is defined by) compatibility of denotation. A complete definition of anaphoric relations, already mentioned in the previous chapter, is offered by another work by Topolińska (1984). Two interesting topics which are discussed by the author is the influence of idiolectal characteristics on impression of semantic difference, as in: (2.1)

Pani A: Pokazały się ostatnio śliczne fajansowe kubki do mleka. ‘Mrs A: Recently there’ve been lovely faience mugs for milk.’ Pani B: Ach, takie filiżanki w kwiatki? Tak, widziałam gdzieś na wystawie. ‘Mrs B: Ah, the flowery cups? Yes, I saw them somewhere in the shop window.’

Example (2.1) shows that, when constructing an anaphor, apart from synonym or hyponym substitution, a more vague ‘closeness of meaning’ can also be used in strict identity-of-reference relations. Example (2.2) shows an even more dramatic distinction: (2.2)

Pani A: Włożę dziś tę szarą płócienną sukienkę. ‘Mrs A: Today I’ll put on that grey linen dress.’ Pani B: Ach, tę zieloną? ‘Mrs B: Ah, the green one?’

with two objects having the same reference, and the difference motivated by a divergence of perception between the speaker and the receiver.

2 First edition in 1953, still listing thematical indicator of reference, removed in subsequent editions.

26 | 2 Polish coreference-related studies 2.1.3 Pasek A modern synthetic approach to anaphora can be found in (Pasek, 1991). The work reviews the problems with theoretical description of anaphora, stating that correct anaphora resolution requires: – semantic knowledge (categories of objects which can become arguments of a given predicate type), as in Example (2.3) (the tables can be sloping and pencils are inclined to fall) – psychological knowledge (topic of the sentence), as in Example (2.4) (it is rare to call oneself using a pejorative term) – awareness of the universally accepted norms, ways of perceiving certain situation and understanding of human behaviour etc., as in Examples (2.5) and (2.6) (defeating someone implies better play; telling somebody off results from bad behaviour). (2.3) (2.4) (2.5) (2.6)

Położyłem ołówek na :::: stole, ale : Ø był pochyły i Ø się zsunął. table, but it was sloping and it fell.’ ‘I put the pencil on the ::::::: :: Jan powiedział ::::::: Piotrowi, że Ø jest łobuzem. ‘Jan told :::: Piotr that:: he is a rascal.’ Maria pokonała :::: Annę, ponieważ Ø lepiej grała. ‘Maria beat Anna because she played better.’ :::: Maria zbeształa :::: Annę, ponieważ :: Ø postąpiła lekkomyślnie. Anna off because she acted recklessly.’ ‘Maria told ::::: :::

2.1.4 Fall Another comprehensive introduction to the problem is offered by Fall (1994). The author investigates the influence of anaphora on the discourse model and presents a rough typology of the phenomenon with numerous examples in Polish. The factors influencing interpretation of anaphora comprise: inflectional compatibility, syntactic constraints, theme/rheme pair, immutability of roles, semantic constraints, recency and change of scene, implicit causality, hypothetical situations and sentence stress.

2.2 Text coherence, cohesion and intra-document linking 2.2.1 Pisarkowa Pisarkowa (1969) follows the work of Klemensiewicz by investigating distribution of pronouns in Polish utterances and analysing inter-sentential functions of a pronoun (consolidating or dividing the text). On the textual level, anaphora is treated as

2.2 Text coherence, cohesion and intra-document linking

|

27

a means of reintegration of the series of sentences, following the semantic convention accepted by the receiver of the text. Lack of such linguistic means as anaphoric pronouns can lead to difficulties in maintenance of the inter-utterance relation. Pisarkowa also shows an interesting distributional difference of use of pronouns and their nominal counterparts in a text: when, particularly in longer passages, which, naturally, contain a larger number of nominal elements, the traditional means of identity disambiguation (called “flexion abilities”) fail to work, full nounal denominations need to be repeated for clarity. What is more, such repetition needs neutralization of the full mark by addition of the demonstrative pronoun (such as ten, ta, to ‘this’) or by use of synonym to denote that the designate is not “new”, but already known from the context.

2.2.2 Bellert Bellert (1971) introduces the notion of language indices (presently: mentions) as a common name for textual connectors such as nouns, nominal groups, named entities, personal, relative or reflexive pronouns or certain adverbs (here, there) serving as referential expressions in discourse.

2.2.3 Wajszczuk Wajszczuk (1978) treats binding as a concept deriving from a much broader theory of textual coherence, and investigates relevance and entailment of subsequent utterances in the process of development of consistent discourse. Binding text fragments by means of anaphoric relations is treated analogically to creating compound sentences by using conjunctions.

2.2.4 Marciszewski Marciszewski (1983) contrasts structural integrity (cohesion) of a text with its semantic integrity (coherence), and points out that the abundance of anaphoric links and presence of thematic continuation are insufficient to recognize a text as coherent, which he illustrates with the following example: Powyższy wywód jest przyczynkiem do następnego rozważenia w/w tematu, a mianowicie każdy człon kosmologiczny jest skończony i jego poznanie musi mieć stuprocentowy wynik wyczerpania w jego masie. (Marciszewski, 1983, p. 187) ‘The above elaboration contributes to a following take on the topic, that is, each cosmological element is finite and needs to have a 100 per cent mass exhaustion result.’

28 | 2 Polish coreference-related studies 2.2.5 Stroińska, Szkudlarek and Trofimiec Referential cohesion in scientific discourse has been investigated e.g. by Stroińska (1992), in Polish short-story writing by Szkudlarek-Śmiechowska (2003) and in Polish journalistic texts by Trofimiec (2007) (analysing anaphoric expressions with demonstrative pronoun ten ‘this’).

2.2.6 Pisarek In one of the most recent works, Pisarek (2012) investigates linguistic mechanisms of intra-document linking in a specific text genre – feature articles based on a limited number of texts from one of the Polish weekly magazines. The author differentiates inter-sentence relations from extra-document relations, analyses selected anaphoric mechanisms (pronoun anaphora, repetitions and quasi-anaphora) and confronts them with related phenomena: text deixis, context-aware syntactic ellipsis and time expressions.

2.3 Reference and anaphora in text understanding 2.3.1 Grzegorczykowa The function of anaphora in the process of text understanding and anaphorisation techniques are analysed e.g. by Grzegorczykowa (1996). The text goes beyond the classical understanding of anaphora (i.e. by referencing notions previously mentioned in the discourse) to, yet unnamed, coreference-like definition. Apart from information actually contained in the previous portion of an analysed text, it points out to other sources of information responsible for creation of anaphoric or quasi-anaphoric links, such as common knowledge of speaker and recipient (Przyszła do nas wczoraj Basia. Moja siostra stęskniła się za nami. ‘Basia paid us a visit yesterday. My sister missed us a lot.’), or general knowledge (Popsuła się nam lodówka. Trudno dziś o naprawę sprzętów gospodarstwa domowego. ‘Our fridge broke down. It is hard to repair houseware today.’), relating the task of text decoding to the application of the holistic conceptual system. Major anaphoricity techniques (such as pronominalization, repetition, synonymization etc.) are supplemented by using lexemes with embedded anaphoricity information which either require presence of certain information or presuppose it. Examples of such lexemes are particles presupposing some previous facts or events (wreszcie ‘finally’, dopiero ‘not until’ itp.), adjectives (podobny ‘similar’, inny ‘other’, ten sam ‘the same’), numerals: oba, obie ‘both’ and derived adjectives: obopólny ‘mutual’, obustronny ‘bilateral’, verbs: przeprosić ‘to apologise’ (by using such a verb the speaker assumes that one did some wrong to another person etc.), nouns:

2.4 Text genres and language stylistics

|

29

sąsiad ‘neighbour’, kolega ‘colleague’, przyjaciel ‘friend’ – they require reference to a non-subject entity in contrast to swój ‘oneself (his, her)’: The limitation of the approaches is explained by complexity of Polish reflexive pronouns (omitted from the current work), particularly swój, referring to the subject of the sentence, as in Jan kocha swoją pracę ‘Jan loves his job.’. In case of multiple subjects, as in Jan prosił Marię o sprzątnięcie swojego pokoju. ‘Jan asked Maria to clean his/her room.’, ambiguity makes it impossible to decode, in contrast to Jan prosił Marię o sprzątnięcie własnego pokoju. ‘Jan asked Maria to clean her own room.’, since własny refers to ‘the nearest’ subject.

The text also covers an interesting analysis of several Polish lexemes containing embedded anaphoric properties, i.e. such as adjectives inny ‘other’ or cudzy ‘somebody else’s’, unlinking the following noun group from the subject: Jan przyjechał cudzym samochodem. ‘Jan arrived with somebody else’s car.’ (in contrast to e.g. ten sam ‘the same’).

2.3.2 Gajda Gajda (1990) shows that density of referential expression varies in different genres of text, with scientific publications featuring a dense network of anaphoric relations and much lower share in literary texts. This obviously correlates with advanced nominalisation of scientific texts, with the indicator of nominality (number of nouns divided by number of verbs used in the text) amounting to 4.2 for sciences, 3.3 for the arts, 1.1 for fiction and 0.8 for spoken texts (Gajda, 1982). The most common type of anaphoric links is lexical repetition, which the scholar explains with its high binding power and precision supporting correct understanding of the text.

2.4 Text genres and language stylistics 2.4.1 Szwedek and Duszak Szwedek (1975) and later Duszak (1986) investigate the relation of the English article to word order in Polish and analyse the role of word order in anaphoric processes. The authors state that coreferential properties of a noun depend on its position in the sentence, cf. Example (2.7): (2.7)

Na ulicy zorientowałam się, że nie mam torebki, więc wróciłam do sklepu. ‘In the street, I realised I didn’t have my bag on me, so I went back to the shop.’ a. Torebka leżała na ladzie. ‘The bag lay on the counter.’ b. Na ladzie leżała torebka. ‘On the counter lay the bag.’

30 | 2 Polish coreference-related studies Word order seems to imitate the function of an article and is of grave importance in all anaphoric processes, as it binds coreferentiality with sentence stress. It is observed that nouns with indefinite interpretation appear in the final position in the sentence while definite nouns can appear in positions other than final.

2.4.2 Honowska Honowska (1984) points out the difference between inter- and intra-sentential pronominal coreference, contrasting reflexive się ‘oneself’ and anaphoric go ‘him’; out of which only the second one is able to cross the clause border.

2.4.3 Fontański Fontański (1986) investigates the conditions adjectival anaphoric pronouns appear in texts with respect to two specifically defined text variants: evocative (illusion-forming, communicative, such as action short stories and stage direction in plays) and nonevocative (simple narrative). In evocative expressions, the anaphoric demonstrative adjectives ten ‘this’, tamten ‘that’, ów ‘such’ are used much less frequently than in straightforward narration.

2.4.4 Dobrzyńska Dobrzyńska (1996) presents coreference as a notion used in text stylistics; poverty of chains of coreferential expressions proves, in most cases, stylistic incapacity. The scholar points out that overuse of pronouns with indefinite reference can lead to unification of different discourse world objects which results in a fallacious disambiguation.

2.5 Anaphora and first-order logic 2.5.1 Dunin-Kęplicz Dunin-Kęplicz (1989)³ makes a bridge between discourse and formal methods, based on the meta-text character of pronouns, which leads to an analogy between some uses of pronouns in discourse and variables in logical formulae, started with (Montague, 1973) and (Kamp, 1984).

3 See also (Dunin-Kęplicz, 1983, 1984, 1985).

2.5 Anaphora and first-order logic

| 31

pod(‘Jan󸀠 , ‘Jan poszedł do niego󸀠 ) uzup(‘niego󸀠 , ‘Jan poszedł do niego󸀠 ) NP(‘Jan’) Z(‘niego’) pod(x, v) ∧ uzup(z, v) ∧ Z(y) ⊃ NC(x, y)

Fig. 2.1. A sample first-order logic theory for coreference resolution

Dunin-Kęplicz builds coreference rules by expressing the flow of information with logical methods. Discourse is bound with a first-order logic theory describing syntactic relations between textual fragments. Such formalization of anaphoric mechanisms in logic results in taking down the problem of (partial) anaphora resolution to first-order logic proofs. The ambiguities of reference are marked by the presence of dominating and potential coreference, which cannot be expressed with classic logic due to incomplete information available. This problem is resolved by successive approximations and application of non-monotonic or defeasible reasoning techniques. Coreferentiality of two expressions is motivated by their number-gender and semantic agreement (semantic types such as abstractness/concreteness, animacy, collectivity etc. – or their combinations). The formal description of rules covers a definition of predicates denoting syntactic realisation of the expression, a number of auxiliary predicates (representing linear order of sentences and phrases) as well as referential and syntactic axioms (availability of a syntactic information is assumed). A series of rules for simple and compound sentences is defined and the coreferential discourse structure is reconstructed in logic. Coreference resolution is then realised by reasoning investigating whether the clause representing relation between a pronoun and its antecedent is a theorem of the formal theory formulated in first order logic. Figure 2.1 shows a simple example of the theory definition for the sentence Jan poszedł do niego. ‘Jan went to him.’: 1. predicates defining Jan as the subject of the sentence Jan poszedł do niego and niego as the complement of the sentence 2. predicate defining Jan as a noun phrase and niego as a non-reflexive pronoun 3. a referential axiom stating that when x is a subject in the sentence v, z is a complement of the verb in v, and y is not a reflexive pronoun, then x and y are noncoreferential Figure 2.2 shows an example of a linear resolution process from which it can be inferred that the set of formulae from Figure 2.1 implies NC(‘Jan’, ‘niego’) (their noncoreferentiality). In computational linguistics such an approach directly implies its application in declarative programming languages such as Prolog. By representation of world knowl-

32 | 2 Polish coreference-related studies ¬NC(󸀠 Jan󸀠 , 󸀠 niego󸀠 ) pod(󸀠 Jan󸀠 , z) ∨ ¬uzup(󸀠 niego󸀠 , z) ∨ ¬Z(󸀠 niego󸀠 ) ¬uzup(󸀠 niego󸀠 , S1) ∨ ¬Z(󸀠 niego󸀠 ) ¬Z(󸀠 niego󸀠 )

¬pod(x, z) ∨ ¬uzup(y, z)∨ ∨¬Z(y) ∨ NC(x, y) pod(󸀠 Jan󸀠 , S1)

uzup(󸀠 niego󸀠 , S1)

Z(󸀠 niego󸀠 )

Fig. 2.2. Reasoning for coreference resolution

edge in formal logic this approach can also unify linguistic and extra-linguistic knowledge in one system.

2.5.2 Studnicki, Polanowska, Fall and Puczyłowski Another series of similar approaches is presented by Studnicki and Polanowska (1983, application of the research to legal discourse), Fall (1988, 2001, analysis of anaphora in predicate logic) and Puczyłowski (2003, investigation of the problem of exchangeability of coreferential expressions in belief sentences).

2.6 Application of formal binding theories to Polish 2.6.1 Reinders-Machowska The literature on the application of modern theories covers a series of papers, such as Chomskian Binding Theory (Chomsky, 1980, 1981) to Polish, starting with (Kardela, 1985) and (Willim, 1995). Reinders-Machowska (1991) investigates short- and long-distance anaphora binding in Polish (relations within simplex clauses and across clause boundaries). Two binding domains are discussed: the NP, relevant for pronominal binding and the Tensed S, relevant also for anaphoric binding, with the conclusion that Polish anaphors can be bound in local and extended domain.

2.6.2 Kupść and Marciniak Kupść and Marciniak (1997) capture complementary distribution of pronouns and anaphora, local subject-binding of anaphors, and breaking down clause boundaries

2.6 Application of formal binding theories to Polish

| 33

as binding barriers. They also investigate the problems of possessive anaphora, gerunds and negative constructs. Marciniak (2002)⁴ investigates interpretation of Polish personal pronouns and reflexive anaphors (possessive and non-possessive) within different types of phrases, formulates elements of the HPSG (Head-Driven Phrase Structure Grammar) binding theory by Pollard and Sag (1994) for Polish and describes some aspects of its implementation.

2.6.3 Trawiński Trawiński (2007) investigates non-trivial referential relations in Polish with respect to several formal frameworks such as Government and Binding (Chomsky, 1981), Lexical Functional Grammar (Bresnan, 2001), Lexicalized Tree Adjoining Grammar (Ryant and Scheffler, 2006) or Head-Driven Phrase Structure Grammar (Pollard and Sag, 1994). Assuming that coindexation indicates coreference and agreement, the author investigates problematic examples of coindexing non-agreeing expressions and proposes a HPSG implementation of semantic representation of nominal objects involving a set of individual variables as value of an attribute serving for indexation. These approaches, however, were limited to a narrow set of constructs and, as one of the authors admits, “fail to capture a wide range of Polish data that should be licensed by the binding theory”.

4 See also (Marciniak, 1999, 2001) and (Przepiórkowski et al., 2002, Chapter 6).

| Part II: Coreference Annotation

Agata Savary, Maciej Ogrodniczuk

3 Related work

As largely discussed in the previous chapters, one of the main challenges in linguistics is to understand how entities of the language refer to those of the discourse world. Modelling and studying this phenomenon – as many others – is frequently based on corpus annotation. Since discourse world referents are hard to represent, instead of representing reference phenomena directly, one usually builds coreference chains between linguistic entities and considers those chains (or clusters) abstract representatives of referents.

3.1 Coreference annotation abroad In this section, we present an overview of some existing approaches to coreference annotation, and to their annotation schemes in particular. We briefly describe their main assumptions, with special attention to some phenomena which are important in our own annotation schema described in Chapter 1: the scope of a mention, mention’s head markup, and the typology of coreference relations. We also cite the inter-annotator agreement reported by the authors. We focus, in particular, on those coreference annotation schemata which were applied to corpora of about 200 thousand words or more – according to Recasens and Martí (2010, p. 10). We are particularly interested in languages that show coreference-relevant morphosyntactic similarities with Polish. Slavic languages are obviously of highest importance, but Spanish, Japanese, Arabic and Chinese are also relevant, in particular due to their frequent zero subjects (cf. Section 10.4). For obvious dominance reasons in NLP, we also address recent studies dedicated to English. Osenova and Simov (2004) describe BulTreeBank, a syntactically annotated corpus of Bulgarian based on an HPSG model. Coreferential chains link nodes in HPSG trees. Each noun phrase is linked to an (extra-linguistic) index representing, roughly, the discourse-world entity. Coreference is expressed by linking several phrases to the same index. In principle, only coreferential relations which cannot be inferred from the syntactic structure are annotated explicitly, however, some inferable ones are annotated too (it is unclear which ones). Zero subjects and other elliptical elements (e.g. headwords missing due to coordination) are represented whenever they belong to coreference chains. Syntactic trees may help represent split mentions but it is uncertain if they do. Possessive pronouns are considered mentions. Three relations are encoded: identity, member-of, and subset-of. Discourse deixis is probably taken

38 | 3 Related work into account. It seems that the annotation concerns coreference occurring within one sentence only. No inter-annotator agreement results are given. Hinrichs et al. (2005) describe the annotation of the 22-thousand-sentence Tübingen treebank of German newspaper texts (TüBa-D/Z)¹ – pre-annotated with morphology, syntax and semantics – with a set of 7 coreference relations (coreferential, anaphoric, cataphoric, bound, split antecedent, instance, and expletive). These are no equivalence relations: they are non-symmetric and mostly non-transitive, thus they do not divide the set of referents into disjoint clusters. For instance, the split antecedent relation holds between a plural expression and a mention of a single member. E.g. in ‘John and Mary were there. . . Both. . . ’, John and both, as well as Mary and both are coreferential but John and Mary are not. Potential markables are definite NGs, personal pronouns, relative, reflexive, and reciprocal pronouns, demonstrative, indefinite and possessive pronouns. All of them correspond to nodes of the already existing parse trees resulting from prior syntactic annotation. Unlike in our approach, predicative nominal groups (NGs) seem to be considered coreferential with subjects. Zero subjects are not an issue in German. Neither dominant expressions nor semantic heads are mentioned. The series of ACE (Automatic Content Extraction) campaigns has been carried out from 1999 to 2008 for a varying number of languages, including Arabic, Chinese, English, and Spanish. It was meant to boost the development of automatic detection and characterization of meaning conveyed by human texts. The ACE-2007 annotation guidelines for Spanish (LDC, 2006) give the rules of annotating and disambiguating entities. The entity typology is rather fine-grained: it consists of 7 main types (person, organization, geopolitical entity, etc.) and several dozens of subtypes (individual, group, governmental, commercial, etc.). Two coreference relations are considered: identity and apposition. The former is further subdivided into generic and non-generic. Mentions are NGs (including attached prepositional phrases and relative clauses) and can be nested (the president of Ford). Each NGs should have its (syntactic) head marked (the syntactic head can be a multi-word, and then its last token is marked). Semantic heads different from syntactic ones are not an issue. The problem of Spanish zero subject is not mentioned. Magnini et al. (2006) address the construction of a 182-thousand word Italian Content Annotation Bank of newspaper texts (I-CAB). Mentions are NGs, possibly containing modifiers, prepositional complements or subordinate clauses, representing entities (persons, organizations, etc.) or temporal expressions. ACE-2003 annotation guidelines for English are adopted and extended, to cover notably clitics contiguous with verbs (vederlo) and coordinated expressions (John and Mary). The paper promises future annotation of relations between entities but it is unclear if this task has been performed.

1 The project’s webpage accessed on 22 May 2014 indicates the current size of the corpus equal to 1.5M words and 85,000 sentences.

3.1 Coreference annotation abroad |

39

Pradhan et al. (2007) describe OntoNotes, a system of multi-layered annotated corpora in English, Chinese and Arabic. It is supposed to make up for the drawbacks of previous annotation schemata, mainly MUC and ACE in that the coreference annotation is not restricted to NGs and a larger set of entity types is considered. The English corpus consists of 300-thousand word newspaper texts, later completed by broadcast conversation data (Weischedel et al., 2010). All data have been previously annotated for syntax, verbal arguments and word senses (by linking words to nodes of an external ontology). Thus, mention candidates correspond to nodes of pre-existing syntax trees. As in ACE, two coreference relations are considered: identity and apposition. The main mention candidates are specific NGs, pronouns (they, their) and single-word verbs coreferent with noun phrases (e.g. the sales roseby 18% ← the strong growth). Expletive (it rains), pleonastic (there are) and generic (you need) pronouns are not marked. Generic, underspecified or abstract entities are only partly considered by identity: those cannot be interlinked among themselves, even if they can be linked with referring pronouns (parents ← they). Nested structures are generally marked but exceptions occur in dates (e.g. no subphrase of Nov. 2, 1999 is coreferent with November). Only intra-document coreference is annotated, thus dominant expressions (motivated in our approach notably by future inter-document coreference annotation) are not an issue. Zero subjects are addressed with respect to pro-drop Arabic and Chinese pronouns (Weischedel et al., 2010). Since such pronouns are materialized in parse trees as separate nodes, their inclusion in coreference chains is straightforward. The existence of semantic heads different from syntactic ones is not mentioned. Iida et al. (2007) present NAIST, a 38-thousand sentence Japanese corpus annotated for coreference and predicate-argument relations (including nominal predicates relating to events). They consider identity-of-reference relations for the former, and both identity-of-reference and identity-of-sense relations for the latter. They pay special attention to zero anaphora, whose role – not only as a subject but as an object or a complement as well – is particularly visible when coreference and predicates’ arguments are annotated jointly. Namely, the frames for elided arguments have to be filled out with antecedents appearing in other sentences than the predicate itself. The reported inter-annotator agreement for coreference annotation is 0.893 for recall and 0.831 for precision. No mention of dominant expressions or semantic heads is made. Poesio and Artstein (2008) present annotation efforts for a 94,000-word English corpus. Special attention is paid to two difficult phenomena: discourse-inherent coreference ambiguity and discourse deixis. The former yields an annotation scheme in which coreference is not an equivalence relation (one mention can appear in several chains). All nominal groups are considered mentions, but some are later marked as non-coreferential. A limited set of bridging relations is taken into account. Problems related to zero subjects, nested, split and attributive NGs, as well as semantic heads, are not discussed and are probably not addressed in the annotation scheme. Hendrickx et al. (2008) describe a 200-thousand word coreference-annotated corpus of Dutch newspaper texts, transcribed speech and medical encyclopedia entries.

40 | 3 Related work Its annotation schema is largely based on the MUC-7 annotation guidelines for English². Annotation focuses mainly on identity relations between NGs but other nonequivalence relations are also introduced: bound relations (everybody ← they), bridging relations (e.g. superset–subset or group–member), and predicative relations (e.g. John is a painter). Syntactic heads are pointed at but semantic heads different from syntactic ones do not seem to be an issue (e.g. tons is the head of 200,000 tons of sugar). The ideas of dominant expressions and zero subjects are not present. Predicative NGs and appositions are considered mentions coreferent with their subjects. Discontinuous NGs are taken into account. The inter-annotator agreement measured as the MUC-like F-score, is 0.76 for identity relations, 0.33 for bridging relations and 0.56 for predicative relations. Nedoluzhko et al. (2009) extend the coreference annotation in the Prague Dependency Treebank of Czech, a language rather close to Polish. It builds on previously constructed annotation layers including the so-called tectogrammatical layer, which provides ready mention candidates and (probably) their semantic heads. Mentions include nominal phrases and coreferential clauses (discourse deixis). Nested groups are delimited except in named entities (where only embedded groups which are NEs themselves are marked). Attributive phrases are not considered uniformly: appositions are marked as mentions even if they are never included in coreference chains, while predicate nominals are not considered at all. The notable contribution of this approach is addressing a wide range of bridging relations between nominals. The relatively low scores of the inter-annotator agreement might be an evidence that coreference annotation is particularly difficult in Slavic languages. Recasens et al. (2010a) describe coreference annotation in AnCora-CO, a 400Kword corpus of Spanish and Catalan, for which other annotation layers had previously been provided, including syntax. Thus, possible candidates for mentions had already been delimited. The annotation schema is rather complete. Three types of relations are considered: identity, predicative link and discourse deixis. Zero subjects are marked, clitic pronouns which get attached to the verb are delimited, embedded and discontinuous phrases are taken into account, and referential NGs are distinguished from attributive ones. Bridging references are not considered. Korzen and Buch-Kromann (2011) address the anaphoric relations in a parallel, 5-language Copenhagen Dependency Treebank, in which unified annotation of morphology, syntax, discourse and anaphora is performed. It consists of a 100,000-word Danish corpus with its translations into English, German, Italian and Spanish. Possible specificities of mention detection are not addressed, however, relation typology is extensively discussed. Both coreference and bridging relations (called associative anaphora) are considered. The former are split into 6 categories, according to linguistic techniques used to express coreference, including discourse deixis. The latter count

2 See http://www.itl.nist.gov/iaui/894.02/related_projects/muc/proceedings/co_task.html.

3.2 Polish anaphora and coreference annotation

| 41

as many as 12 types. The inter-annotator agreement (expressed in percentages, i.e. not accounting for agreement by chance) varies highly among relation types. Muzerelle et al. (2013) offer a 435K-word French corpus of spontaneous speech annotated for anaphoric and coreferential relations. Mentions contain nominal and pronominal phrases, possibly embedded. Identity relations are subcategorized according to lexical criteria (same/different heads or pronominal anaphora) and untyped associative (bridging) relations are marked. Table 3.1 shows a contrastive study of some coreference annotation schemata and of our approach, whose in-depth description is presented in Section 1.

3.2 Polish anaphora and coreference annotation Several efforts towards annotating anaphora and coreference in Polish preceded or were carried out in parrallel to the creation of the Polish Coreference Corpus. In this section we provide their overview and a brief contrastive analysis.

3.2.1 LUNA-related work A notable systematic approach to annotation of anaphora in Polish was proposed for the LUNA European project³ (Marciniak, 2010). The data comprises 500 dialogues in the domain of Warsaw public transportation. They were collected at the Warsaw Transport Authority (Zarząd Transportu Miejskiego) information center, where operators of the call center provided information on public transport connections, schedules, fares etc. The corpus was manually annotated at different levels – morphological, shallow syntactic (with chunks) and semantic. The semantic description includes: – a new CityTransport ontology containing approximately 200 concepts related to different transportation means (buses, trams, local trains and metro), their routes and transport-relevant urban topology concepts (stops, important buildings, street names) – FrameNet-based predicate annotation – anaphoric and cataphoric relations (1889 anaphors and 2051 antecedents). Only (pronominal and nominal) anaphora to pre-defined topics is indicated – mostly information about bus line numbers, bus stop names etc., as in Example (3.1):

3 Spoken Language UNderstanding in multilinguAl communication systems, European (IST) Specific Targeted Research Project, contract number 033549, duration: 2006–2009, objective: creation of a robust natural spoken language understanding toolkit for multilingual dialogue services in French, Italian and Polish.

German

Spanish

Italian

English, Chinese, Arabic

Japanese 38K sentences

Hinrichs et al. (2005)

LDC (2006)

Magnini et al. (2006)

Pradhan et al. (2007, 2012)

Iida et al. (2007)

Poesio and English Artstein (2008)

Bulgarian 11K sentences

Osenova and Simov (2004)

94K words

1.6M/950K /300K words

182K words

N/A

22–85K sentences

Language Size

Reference

✓

✓

✓

Underlying syntactic annotation

✓

✓

✓

✓

✓

✓

✓

delimited, predicativelinked

delimited, predicativelinked

✓

✓

? ✓

?

✓

✓

Split NGs

✓

✓

synt.

synt.

synt.

synt.

identity, predicative link, discourse deixis, bridging relations

identity, predicate-argument

identity, apposition

identity, apposition

identity, anaphora, cataphora, bound rel., split antecedent, instance, expletive

identity, part-of, subset-of

Head Relations Attributive Disc. Markup NGs Deixis

Mention scope Zero Nested Subject NGs

Table 3.1. Contrastive analysis of coreference annotation schemata.

α= 0.6–0.7

R = 0.893 P = 0.831

0.65–0.89 (MUC)

unknown

N/A

unknown

unknown

InterAnnotator Agreement

42 | 3 Related work

French

Polish

Muzerelle et al. (2013)

Our approach

503K words

453K words

✓

5 × 100K words

✓

✓

✓

✓

Danish, English, German, Italian, Spanish

partly

✓

Zero Nested Subject NGs

✓

Korzen and Buch-Kromann (2011)

58K sentences

✓

Czech

Nedoluzhko et al. (2009); Nedoluzhko (2010)

200K words

Underlying syntactic annotation

400K words

Dutch

Hendrickx et al. (2008)

Size

Recasens and Martí Spanish, (2010) Catalan

Language

Reference

Table 3.1. Contrastive analysis of coreference annotation schemata (cont.).

✓

✓

✓

delimited, never linked

delimited, never linked

delimited, predicativelinked

partly

✓

✓

✓

Attributive NGs

Split NGs

Mention scope

✓

✓

✓

✓

Disc. Deixis

sem.

synt., sem.?

synt.

0.76 (MUC)

InterAnnotator Agreement

identity, quasi-identity

identity, associative coref.

identity, discourse deixis bridging relations

κ = 0.7424

unknown

25 − 100%

identity, predicative α = link, discourse deixis 0.85–0.89

identity, predic. link, F1 = discourse deixis with 0.39–0.80 many sentences, bridging relations

identity, bound rel., bridging rel. predicative link

Head Relations Markup

3.2 Polish anaphora and coreference annotation | 43

44 | 3 Related work

134 | 8 Polish Coreference Corpus

...

...

The ann_coreference.xml file encodes coreference relations in ments, each representing either one in the identity-of-reference cluster or a single quasi-identity link between a pair of mentions. Mentions are referenced by tags targeting ments from corresponding ann_mentions.xml file. Since coreferential chains can span across sentences and paragraphs, no entence tag is included. However, for conformance with TEI file structure, a single

aragraph is preserved as an artificial ment container. To facilitate human reading, each ment tag representing a coreference cluster or a quasi-identity link carries an XML comment with its orthographic form. Relation type is distinguished with value of the type eature – either ident or quasi-ident, denoting respectively an identity cluster or a quasi-identity link. For coreference clusters, the dominating expression is indicated with the value of dominant feature, while for quasi-identity links, source links are marked with type="source" attribute. The listing below presents the encoding of two relations: a coreference identity cluster (containing mention_27 and mention_29) and a quasiidentity relation (between mention_35 and mention_30).

...

...

Please consult http://nlp.ipipan.waw.pl/TEI4NKJP/ for more detailed examples referring to other annotation layers.

8.2 Corpus representation and visualisation

|

135

8.2.2 MMAX format Another target format which allows using PCC data which are to be further processed in MMAX annotation environment is MMAX2-based representation with 3 XML files for each source text: – text metadata file (with .mmax extension) – text source file (_words.xml) – mention and relation file (_mentions.xml). The text source file contains information about the text type and reference to the text source file. The tag contains the identifiers of the source NKJP text and a series of paragraph identifiers containing the retrieved text:

124_words.xml

The segmentation layer contains the text source segmented into s (with generated identifiers), enriched with morphological annotation (lemma in base tag, NKJP POS in ctag and additional information in msd). Sentence and paragraph segmentation is stored using lastInSent="true" and lastInPar="true" attribute values for the last words in the sentence and in the paragraph, respectively. Additional NKJP information related to original tokenization – information about lack of space character between the punctuation character and the previous word – is also preserved for visualisation. Please refer to NKJP documentation for reference on structure and meaning of the morphosyntactic data.⁴

Unijne

renty

strukturalne

powinny

przyspieszyć

4 A concise information on the tagset can be retrieved from the online help file of Poliqarp (Janus and Przepiórkowski, 2007), a corpus search engine used by NKJP: http://nkjp.pl/poliqarp/help/ense2. html.

136 | 8 Polish Coreference Corpus

proces

powiększania

się

gospodarstw

. ...

The last file contains uniquely identified s – a representation of mentions (spans of comma-separated segments, i.e. tags from the source file, potentially discontinuous; e.g. mention markable_19 in the example below). Continuous segment sequences can be represented with a double dot symbol (..) between tags representing start and end segments (inclusively). Semantic heads are indicated for each mention (a single word selected from mention segments). Mentions grouped in one identity coreference cluster carry the same value of the mention_group attribute (in the form of set_number, treated as a technical label, so no assumption is made regarding the order or completeness of numbers). Singletons are represented by empty mention_groups. Each cluster carries (redundantly, since this value is repeated for each mention in the cluster) a dominating expression as the value of a dominant attribute. Quasi-identity links are represented by uni-directional references to linked mentions by specifying their XML identifiers in quasi_identity attribute. Absence of an outgoing quasi-identity link implies an empty value of this attribute.

...

Please refer to the MMAX2 manual (see mmax2.net) for detailed description of the MMAX format.

8.2 Corpus representation and visualisation

|

137

8.2.3 Brat format brat (Stenetorp et al., 2012) is an online collaborative annotation environment, which uses a simple standoff annotation format described at http://brat.nlplab.org/standoff. html. Each text in this format is represented by two files: one containing raw text, the other one with information about mentions (marked as spans of characters in the former file) and relations between them (both coreference and quasi-identity). The file with .ann extension contains 4 types of rows, each one being tabseparated: 1. Rows starting with the letter T, representing mentions. Example: T24 Mention 370 398 Jeorjos Aleksandros Mangakis

2.

Tab-separated fields are: – T24 is the mention’s identifier. – Mention 370 398 describes the offset of the first and last characters of the mention (in the file with .txt extension). In case of a discontinuous mention, a semicolon will appear here, separating multiple continuous spans of the mention (for example 370 398;410 420). – Jeorjos Aleksandros Mangakis is the text form of the mention. Rows starting with the letter R, marking quasi-identity links. Example: R4 Quasi Arg1:T337 Arg2:T628

3.

Tab-separated fields are: – R4 – link identifier. – Quasi Arg1:T337 Arg2:T628 – describes the source (mention with id T377) and the target (mention with id T628) of the link. Rows starting with the character *, marking non-singleton clusters. Example: * Coref T361 T362

4.

Tab-separated fields are: – * – the identity relation symbol. – Coref T361 T362 T451 – T361 T362 T451 are the identifiers of mentions in a coreference cluster. There may be any number of mentions in a cluster (but at least two). Rows starting with the character #, marking heads of mentions or their dominant expressions (where applicable – for non-singleton mentions). Examples: #2 empty T2 Head: "braci" #3 empty T2 Dominant: "7 tysięcy braci z całej Europy" Tab-separated fields are: – #2 or #3 – the identifier of annotation. – empty T2 – T2 is the identifier of the mention whose head or dominant expression is described. – Head: "braci" or Dominant: "7 tysięcy braci z całej Europy" – head of mention or dominant expression in quotes.

138 | 8 Polish Coreference Corpus 8.2.4 Brat visualization The whole corpus is visualized in a modified version of the brat annotation tool (visualization tweaks⁵ were needed for the readability of long coreference chains). The address of the visualization is http://core.ipipan.waw.pl/pcc/browse, example text is presented in Figure 8.2. Corpus texts are stored in a directory structure, allowing for easy navigation to a text with a particular identifier. When a text is opened, its content and manual annotations are presented in the screen. Placing the mouse cursor over an annotation shows further information about that annotation. The only visible text annotations are mentions (yellow tags over text spans, with “Mention”, “Men” or “M” label). When you hover your mouse over a mention, it will highlight its span in green. A popup will also show a note, describing the head of the mention and its dominating expression (if it is not a singleton). All other mentions in the text that are in the same identity cluster with that mention will also be highlighted green. All other mentions in text that are in the quasiidentity relation with that mention will be highlighted purple. More information is available in a mini-tutorial for basic usage, which is presented upon the first visit on the page, and may be accessed at any time by navigating to the top right corner of the page and clicking “Corpus tutorial” button.

Fig. 8.2. brat corpus visualization

5 Modified version available at http://zil.ipipan.waw.pl/brat4core.

8.3 Corpus statistics

| 139

8.3 Corpus statistics⁶ The analysis of the statistical properties of linguistic constructs in PCC can help assess how the frequency of certain linguistic phenomena compares to the efforts put in their processing and the final success. As some of them are not yet considered in automatic processing of Polish coreference, proving their high frequency can bring new motivation to the development of their resolution techniques. Other important questions we would like to answer are about the inter-annotator agreement of different linguistic phenomena and the current efficiency of automatic methods used to discover them. It could help us estimate whether their careful annotation brings real value to the task and whether the task is sufficiently clear. Note (cf. Section 8.1) that PCC features 1773 short texts (ST) and 21 long texts (LT), with 503,981 and 36,234 segments in total, respectively. Since the size and nature of both resources are different, we present their statistics comparatively. The average counts of the corpus building blocks are presented in Table 8.3. The average complexity of paragraphs and sentences in the long texts stemming from newspapers exceeds the one in the balanced subcorpus of short texts. Table 8.3. Average counts of units in PCC

Indicator Paragraphs per text Sentences per paragraph Segments per sentence Segments per text

Average count Short texts Long texts 7.38 2.38 16.19

30.33 3.13 18.15

284.37

1725.43

Since long texts are complete documents rather than randomly extracted fragments thereof, it is interesting to see how different annotated data distribute over different text genres. As shown in Table 8.4, economic texts and domestic news have, respectively, the longest and the shortest paragraphs, as well as the largest and the smallest quantity of mentions per paragraph. The most intriguing finding from these statistics is the difference in average count of singleton mentions across text types, ranging from 1.49 in legal texts to 14.16 in economic section. A possible reason might be that a legal text usually concerns a relatively constraint topic at a time, and needs to refer, in a very precise manner, to well-defined notions touched upon in that document or other texts. Economic texts, on the other hand, frequently give an account of global phenomena and relate to many world entities such as locations, organizations, dates, measures, etc.

6 This section is an extended version of (Ogrodniczuk et al., 2014b).

140 | 8 Polish Coreference Corpus Table 8.4. Average counts of basic properties by text types in long texts

Domain

Paragraphs per text

Sentences

Segments

Mentions

Singletons

Nonsingleton clusters

per paragraph

Journalism Law Economics Domestic news Sport Culture Science and tech.

36.00 31.67 27.67 43.67 24.33 23.33 25.67

3.97 2.93 3.24 2.43 3.88 2.96 2.75

65.54 62.26 70.40 39.48 59.23 58.76 49.21

22.15 22.98 23.23 13.86 19.42 20.79 17.78

12.05 1.49 14.16 6.24 11.19 11.91 10.40

2.14 1.98 2.57 1.56 1.55 2.31 1.92

Any

30.33

3.13

56.88

19.72

11.25

1.98

8.3.1 Mentions The total number of mentions identified in PCC is 167,679 in ST and 12,562 in LT, which amounts to 5.39 and 6.29 mentions per sentence on average, respectively. These numbers seems high, taking into account that sentences contain 16–18 segments on average, but can be explained with our strategy of annotating nested mentions (cf. Section 5.3.1). Maximisation of mention boundaries in PCC by including e.g. relative clauses in the mention (according to the ‘precise reference’ principle – cf. ‘the astronaut’ vs. ‘the astronaut who stayed in the command module while Armstrong and Aldrin walked on the Moon’, cf. Section 5.2.2) resulted in large mention sizes (with the longest mention containing as many as 147 segments). Nevertheless, over 87% of the mentions consist of 5 segments or less. This can be explained with the fact that even potentially lengthy definitions are constrained so that the texts are still understandable, and 5 segments seem enough to convey the complete nature of a mention. The average mention size is 2.67 and 3.03 segments in ST and LT, respectively (for singleton mentions: 3.19 and 3.64, for non-singletons: 1.85 and 2.20). Table 8.5 shows the ST-LT distribution of mentions of different lengths. Understandingly, whatever the length (except for length 31–40), singleton mentions are relatively more frequent in ST than in LT. Interestingly enough, the proportion of onesegment mentions is over 9% higher in ST (49.60%) than in LT (40.83%), while, conversely, tokens of lengths more than 1 are less frequent in ST than in LT.

8.3.1.1 Pronouns and zero subjects Two particularly interesting types of mentions are pronouns and zero subjects. The former are prototypical anaphoric techniques while the latter are segments that are

8.3 Corpus statistics

| 141

Table 8.5. Mention length in short and long texts Mention length in segments 1 2 3 4 5 6 7 8 9 10 11–20 21–30 31–40 41–50 51–99 100–147 Any

Count ST

% of all mentions ST LT

LT

% singleton ST LT

83,161 36,946 15,957 8819 5901 3802 2693 2077 1573 1238 4479 736 179 57 57 4

5129 3058 1406 818 582 384 261 190 152 121 400 69 16 6 6 0

49.60 22.03 9.52 5.26 3.52 2.27 1.61 1.24 0.94 0.74 2.67 0.44 0.10 0.04 0.04 0.00

40.83 24.34 11.19 6.51 4.63 2.77 2.08 1.51 1.21 0.96 3.18 0.55 0.13 0.05 0.05 n/a

44.70 72.15 78.06 81.22 82.19 81.85 83.74 83.15 81.12 81.91 81.56 78.26 77.65 84.21 85.96 100.00

39.15 62.23 69.91 74.08 76.98 80.75 76.25 82.11 77.63 78.51 72.75 78.26 93.75 83.33 83.33 n/a

167,679

12,562

100.00

100.0

60.93

57.05

Table 8.6. Singleton vs. non-singleton mentions in short and long texts Count Mentions Verbs Verb mentions Pronouns Pronoun mentions Discontinuous mentions

ST

LT

167,679 50,134 15,397 8794 7546 1328

12,562 3161 749 472 317 64

% singleton ST LT 60.93 – 7.63 – 7.38 82.23

57.05 – 5.21 – 4.10 70.31

marked at finite verbs⁷ in case of non-subject sentences. As shown in Table 8.6, they account for about 14% of all mentions in ST. In LT this proportion is much lower – 8.5% – again, probably due to the nature of the texts (newspaper articles). Most (over 92–96%) mentions of these types appear in non-singleton clusters, which is often due to the presence of a more specific mention in the same cluster which introduces the referent. Notable exceptions include impersonal use of the first person plural pronoun my, nas, etc. ‘we, us’, (see Example (8.1)), improper verbs such as należy ‘(one) should’, chodzi (o kogoś/coś) ‘lit. (it) goes (about somebody/something)

7 Verbs are segments marked at the morphosyntactic level with one of the following classes: bedzie, fin, aglt, praet.

142 | 8 Polish Coreference Corpus = as far as (somebody/something) is concerned’⁸ (Examples (8.1–8.2)), impersonal use of second and third person plural verbs (Examples (8.3–8.4)), and verbs contained in titles of works (Example (8.5)). (8.1)

(8.2)

(8.3) (8.4)

(8.5)

Należy (. . . ) dać nową szanse każdorazowo, gdy nas o nią proszą. ‘(lit.) Shouldsing.3pers [. . . ] give a new chance each time we are asked for it.’ ‘One should [. . . ] give a new chance each time one is asked for it’ Ale jeśli chodzi o Halinę . . . ‘(lit.) But if goes about Halina . . . ’ ‘As far as Halina is concerned . . . ’ Jest to wszystko zrozumiałe, gdy weźmiemy pod uwagę, że . . . ‘All that is understandable if (we) takeplur.2pers into account that . . . ’ Już pana wyleli? ‘(lit.) Already firedplur.3pers you?’ ‘Have they fired you already?’ Druga kompozycja Czesława – „Czy mnie jeszcze pamiętasz?” ‘Another song by Czesław – (lit.) “Still remembersing.2pers me?”’ ‘Another song by Czesław – “Do you still remember me?”’

Note also that over 30% and 23% of all verbal forms in ST and LT, respectively, are marked as mentions, which confirms the importance of the zero anaphora phenomenon in Polish. Finally, discontinuous mentions, which pose a particular challenge for automatic coreference resolution, account for a rather low percentage of mentions: 0.79% in ST and 0.5% in LT.

8.3.1.2 Nested and coordinated mentions Note (cf. Section 5.3.1) that we delimit not only the outermost maximum-length mentions but also all mentions nested therein. As shown in Table 8.7, as many as 27% and 33% of all mentions in ST and LT, respectively, contain nested mentions, and at least Table 8.7. Nesting in short text and long text mentions

Mention types Multi-segment non-verb With at least 1 nested mention With at least 2 nested mentions

Count ST

LT

84,169 45,030 22,405

7420 4203 2138

% of all mentions ST LT 50.20 26.85 13.36

59.06 33.46 17.02

8 Improper verbs in Polish are verbs occurring only in the 3rd person singular.

8.3 Corpus statistics

|

143

2 nested mentions appear in 13%/17% of them. Nesting is more pervasive in the long texts, probably due to the presence of many long and complex constructions typical for newspaper articles. Some mention pairs (676 in ST and 36 in LT) overlap partly but they are not embedded. They are usually coordinated mentions with factorised elements or modifiers, which – according to our annotation guidelines – are annotated separately, as in Examples (8.6–8.7). (8.6) (8.7)

Mogą zazdrościć nam lasów jodłowych i::::::::: bukowych. ::::: beech ::::: forests.’ ‘They can envy us our fir and ::::: pogarszającą się sytuacją materialną i życiową lokatorów :::::::::::::::::::::::::::::: :::::::: ‘deteriorating material and existential situation of the tenants’ ::::::::::: :::::: :::::::::::::::::::

Finally, we have noticed some intriguing cases of mention pairs in which one mention is embedded in the other, and still both belong to the same coreference cluster, as in Example (8.8) (the underlined mention is coreferential with the whole phrase). (8.8) Ewę Nowowiejską – siostrę Adama Nowowiejskiego, syna człowieka, który go wychował i skatował za amory do córki ‘Ewa Nowowiejska – the sister of Adam Nowowiejski, son of a man, who brought him up and tortured him for (his) love to (his) daughter’

8.3.2 Coreference clusters In PCC we mark identity of reference in its strict sense (direct reference) with an extension to the so called quasi-identity. Direct reference, which is an equivalence relation, clusters mentions referring to the same discourse-world entity. Table 8.8 shows basic cluster statistics, while Table 8.9 presents sizes of clusters. Obviously, the rate of clusters per text is much higher in LT than in ST. More interestingly, their rate per paragraph and per sentence is also significantly higher in LT than in ST. This might be due to the nature of LT (newspaper texts). Table 8.8. Basic cluster statistics in short and long texts Count ST LT All clusters Singleton clusters Ns-clusters Single paragraph ns-clusters

119,796 102,160 17,636 8369

8426 7167 1259 433

Per text ST LT 67.57 57.62 9.95 4.72

401.24 341.29 59.95 20.62

Per paragraph Per sentence ST LT ST LT 9.16 7.81 1.35 0.64

13.28 11.25 1.98 0.68

3.85 3.28 0.57 0.27

4.22 3.59 0.63 0.22

144 | 8 Polish Coreference Corpus Table 8.9. Cluster sizes in short and long texts

Cluster size in mentions

Count

% of all clusters

% of all mentions in clusters of that size ST LT

ST

LT

ST

LT

1 2 3 4 5 6 7 8 9 10 11–20 21–30 31–41 43–129

102,160 9459 3230 1509 909 570 396 336 204 173 688 138 24 0

7167 622 241 122 82 42 28 22 17 15 46 11 2 9

85.28 7.90 2.70 1.26 0.76 0.48 0.33 0.28 0.17 0.14 0.57 0.12 0.02 0

85.06 7.38 2.86 1.45 0.97 0.50 0.33 0.26 0.20 0.18 0.55 0.13 0.02 0.10

60.93 11.28 5.78 3.60 2.71 2.04 1.65 1.60 1.09 1.03 5.84 1.96 0.47 0

57.05 9.90 5.76 3.88 3.26 2.01 1.56 1.40 1.22 1.19 5.33 2.18 0.58 4.68

Any

119,796

8426

100.00

100.00

100.00

100.00

Not surprisingly, a large majority (85% in both ST and LT) of all clusters (119,848 in ST and 8426 in LT) corresponds to singleton clusters (containing one mention only). The remaining 17,630 and 1259 non-singleton clusters (referred to as ns-clusters) are mostly composed of two (54% and 49%), three (18% and 19%) or four (9% and 10%) mentions. The average cluster size is 1.40 and 1.49 in ST and LT, respectively. For nonsingleton clusters, it is only 3.72 and 4.29. Less than a half (8364 and 433) of the non-singleton clusters are included in single paragraphs. This validates our choice of text fragments for the PCC construction, which are larger than single paragraphs (unlike the manually annotated 1-million word NKJP subcorpus). The longest cluster contains 41 mentions in ST (in a spoken dialog: mostly ja ‘I’ pronouns and 1st person zero subjects) and 129 mentions in LT (in a biographical text: mostly variants of the proper name św. Wojciech ‘Saint Wojciech (Adalbert of Prague)’ and zero-subject verbs). The distribution of clusters of different size (in Table 8.9, column % of all clusters) is similar in ST and in LT. Also, note that while exceptionally long clusters (containing over 10 mentions) are very few (only 0.71% and 0.8% of all clusters in ST and LT), they contain a substantial percentage of mentions: 8% and 13% in ST and LT, respectively.

8.3 Corpus statistics

|

145

8.3.2.1 Agreement in clusters Number and gender agreement between mentions, as well as identity of headwords, are among the features frequently taken into account in automatic, both rule-based and probabilistic, coreference resolution tools. Table 8.10 presents the agreement statistics for non-singleton clusters in PCC. The agreement counts in the table represent the situations when all mentions in a cluster have the same value of the given parameter assigned by the Pantera tagger (Acedański, 2010)⁹. Relaxed gender stands for the three Polish agglomerated masculine genders: masculine human, masculine animate and masculine inanimate. Table 8.10. Cluster agreement types in short and long texts

ST

LT

% ns-clusters ST LT

Same head number Same head gender-relaxed Same head gender Same head base Same head orth

14,116 11,027 10,297 7057 3710

968 856 810 611 313

80.04 62.53 58.39 40.01 21.04

76.89 67.99 64.34 48.53 24.86

Any

17,636

1259

100.00

100.00

Ns-cluster type

Count

Clusters containing at least one mention with non-agreeing gender are very frequent (42% and 36% of all clusters in ST and LT, respectively; 37% and 32% in case of relaxed gender agreement). This fact can be justified e.g. by the use of synonyms or hyper-/hyponyms to describe the same referent (e.g. to Volvo ‘this Volvosing:neut ’, ten samochód ‘this carsing:masc ’). The frequent disagreement in number (20% and 23% of clusters) is more surprising. Interesting cases of this type concern generic noun groups as in Examples (8.9–8.10). (8.9)

wyleczenie z antysemityzmu ‘curation of antisemitism’sing.neut (8.10) takie zmiany ‘such changes’plur.fem These statistics show that using traditional strict gender/number agreement constrains for coreference clustering in an automated tool is not a good solution in the case of Polish.

9 When a mention head has no parameter of a given type, it can only agree with a mention which also misses this parameter.

146 | 8 Polish Coreference Corpus Note also that only about 40% and 49% of clusters has the same head base form in all mentions, and about half of them (19% and 23%) contain graphically different inflected forms of the head. This fact confirms the necessity of using lemmatization or stemming techniques for automatic coreference resolution in highly inflected languages, notably those admitting declension (inflection for case of nouns, adjectives and pronouns). Interestingly enough, agreement of mention heads within clusters is significantly higher in LT than in ST for all features except number.

8.3.2.2 Clusters with indefinite mentions Traditionally, indefinite pronouns (ktokolwiek ‘anyone’, ktoś ‘someone’, cokolwiek ‘anything’, coś ‘something’), negative pronouns (nic ‘nothing’, nikt ‘nobody’) or universal pronouns (wszystko ‘everything’, wszyscy ‘everybody’) are not regarded as coreferential since they do not carry direct reference information. However, Table 8.11 shows that they sometimes do form coreferential chains, as in Example (8.11), even if the proportion of such cases is very low. (8.11) Jak ktoś jest zazdrosny, znaczy, że Ø naprawdę kocha. ‘If someone is jealous, it means, that he/she really loves.’ Note also that indefinite pronouns relate to indefinite descriptions, as the one in Example (8.12)¹⁰, which corresponds to a prototypical construction introducing a new discourse referent. (8.12) Pewien chłopiec oblał egzamin maturalny. Zdecydował on, że zrezygnuje z dalszych studiów. ‘A certain boy failed his maturity exam. He decided to resign from further studies.’ Table 8.11. Ns-clusters with at least one pronoun mention of a specific type in short and long texts

Mention type Indefinite pronoun Negative pronoun Universal pronoun Any pronoun Any

Cluster count ST LT

% of all ns-clusters ST LT

44 11 46 4181

2 0 6 182

0.25 0.06 0.26 23.71

0.16 0.00 0.48 14.46

17,636

1259

100.00

100.00

10 Example from (Bellert, 1971), the first Polish work mentioning the role of indefinite descriptions in a coherent text.

8.3 Corpus statistics

| 147

8.3.3 Cluster and mention count correlation The distribution of cluster and mention counts, which was examined in short texts only, is depicted in Figure 8.3. Each dot corresponds to one short text from the corpus; the dot numbers indicate text identifiers which enable us to track the outliers. The horizontal axis shows the number of mentions in a given text, divided by the length of text in segments for normalization. The vertical axis shows the number of clusters in a given text, again divided by the number of segments. Figure 8.3 shows that both normalized mention count and normalized cluster count are similar for most texts, with certain outliers: – text 246 (the rightmost dot) is a training programme with extended number of mentions due to the nominal phrase-based character of such document types (with training places, instructor names and topics being all nominal phrases) – text 1938 (the topmost dot) is a quasi-spoken parliament session transcript with an extensive list of clusters resulting from the discussed topics (multiple references to parliamentary committees, voting procedure, bills etc.) – text 323 (the leftmost dot) is a short and highly fragmented relation from a book promotional event, with a book excerpt, book title, information about the event time and place; it features a low number of clusters and hardly any nested mentions.

0.08 0.06 0.04 0.02

Normalized cluster count

0.10

●1938

●323

3319 ●167 ● ●3539 ●2742 ●239 ●945 736 ●3394 ● 557 ●352 719 ● ● ●3550 ●2557 2743 2830 389 ●625 3242 ● ● ● ● ●946 ●360 3253 800●2699 284● ●6392820 ●2545 ●3582 ● ● ●536 ● ● ● ●2605 ● ● ●● ●●532 ● ● ● ● ●● ● ●●●● ●● ●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ●●● ● ●3294 ●● ●● ●●●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●● ●● ● ● ● ●●● ●● ●● ● ● ●● ● ● ● ● ● ●●● ● ● ●● ●●● ●● ● ● ● ● ● ● ● ●232 ● ● ● ● ● ●2104 ●● ●● ●● ● ●● ●●●● ● ●● ● ● ●● ●● ● ●● ●● ● ● ● ● ●● ● ● ●●●● ● ● ● ● ● ● ● ● ● ●385 ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●965 ● ● ● ● ● ● ● ● ● ● ●● ● ● ●985 ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●745 ● ● 2760 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● 727 ● ● ● ● ● ● ● ● ● 45 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 725 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●58 ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●3144 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 290 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●2812 ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 995174 ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● 8 ● ● ●3386 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 2918 ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● 953 ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●3197 ● ● ●●● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●●● ● ● ● ●● ●● ● ● ● ● ●● ● ●2826 ●● ● 988 ● ● ● ●● ● 176 ●●● ●●● ● ● 3298 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 109 ● ● ● ● ●● ●● ●●● ● ●●● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ●● ●● ●● ● ●2829 ●● ●●● ●● ●● ●● ● ● ● ● ● ●●●●● ● ●375 ●●●● ●● ● ● ●● ● ● ●3496 ● ●414 ● ● ● ● ● ● ●● ● ●

0.2

0.3

0.4

Normalized mention count

Fig. 8.3. Normalized cluster/mention ratio in short texts

0.5

246 ●

0.6

148 | 8 Polish Coreference Corpus

1500

1000

500

Unclassified written

Non−fiction literature

Magazines

Journalistic books

Dailies

Fiction literature (prose, poetry, drama)

Academic writing and textbooks

Internet interactive (blogs, forums, usenet)

Spoken − conversational

Spoken from the media

Instructive writing and textbooks

Quasi−spoken (parliamentary transcripts)

Misc. written (legal, ads, manuals, letters)

Internet non−interactive (static pages, Wikipedia)

0

Fig. 8.4. “Typicality” of texts from different domains

We have also calculated the Mahalanobis distance from the mean point for this data to sort the texts from the most “atypical” regarding the mention/cluster proportion (number 0) to the most “typical” one (number 1772). Then, we have drawn a boxplot presented in Figure 8.4, which shows the positions of texts from different domains on that list. We have 14 different text domains in the corpus and there is no strong evidence that any of them has an abnormal frequency of “atypical” texts regarding the cluster/mention ratio.

| Part III: Coreference Resolution

Maciej Ogrodniczuk, Mateusz Kopeć

9 Resolution approaches

9.1 Resolution methodologies Since the problem of automated resolution of anaphora, and more recently, coreference, dates back as far as to 50 years from now¹, several well-written introductions to the problem have already been collated either in cross-sectional articles such as, most recently, (Ng, 2010) or numerous coreference-related PhD theses such as (Ng, 2004), (Uryupina, 2007) or (Recasens, 2010). This chapter does not pretend to offer a better overview of the domain. It is meant instead as a very concise summary of resolution methods, representations and algorithms. Traditionally, the process of end-to-end coreference resolution is split into two parts: 1. mention detection, i.e. marking potentially coreferential elements 2. coreference resolution, i.e. linking mentions in groups referring to the same entity. Since we limit our interest to nominal direct identity-of-reference only, the mention detection issue is strictly related to nominal phrase identification. Starting with simple nouns and pronouns, syntactic groups can be identified together with named entities and other language-dependent constructs interacting in coreferential relations, such as null subjects. This step is regarded as language-dependent and will be discussed in detail in Chapter 10, which describes the identification of mentions for Polish.

9.1.1 Resolution models 9.1.1.1 Mention-pair model The mention-pair (or pairwise) resolution model (Aone and Bennett, 1995) combines two steps: – classification (prediction), which, for pairs of mentions in a document, decides whether they are coreferent or not – clustering, which, based on the pairwise decisions, constructs the partition of the mention set into equivalence classes – coreference clusters, each of which corresponds to a different discourse-world entity.

1 Mitkov (1999) reports on STUDENT (Bobrow, 1964), a high-school algebra problem answering system as one of the earliest attempts to resolve anaphors by a computer program.

152 | 9 Resolution approaches Training instances are created for each pair of mentions from texts annotated with coreference information. Each mention is described with a vector of features representing the properties of a mention and its context. To reduce skewness resulting from the fact that most training mention pairs are not coreferential, the training instance creation method must be optimized. The most popular optimization was suggested by Soon et al. (1999) based on creating a positive training instance between a given mention M and its closest preceding antecedent A, or – in a modification by Ng and Cardie (2002b) – the closest preceding non-pronominal antecedent. Negative instances, in turn, are built between M and all mentions occurring between A and M. Several clustering algorithms have also been proposed, including: – closest-first (Soon et al., 2001) – best-first (Ng and Cardie, 2002b) – correlation clustering (Bansal et al., 2004) – graph partitioning (McCallum and Wellner, 2005) – Bell-tree-based clustering (Luo et al., 2004) and many others. Some approaches are based on multi-pass processing, in which it is checked whether basic coreference conditions such as gender-number agreement have been preserved in the preceding steps (Cardie and Wagstaff, 1999). Moreover, dedicated methods for the identification of the pleonastic it during the pre-processing step (Bergsma and Yarowsky, 2011) can also improve global coreference resolution results.

9.1.1.2 Entity-based model and ranking The mention-pair model seems to result directly from the anaphora resolution point of view, by definition focused on linking pairs of mentions (instead of creating clusters). This brings about the common problem of the model: linking decisions are taken in isolation from other candidates, which prevents the classifier from taking into consideration the properties of other already linked mentions. To compensate for this weakness, the entity-based (or entity-mention) resolution model has been proposed as a classifier/clusterer which, for a given mention, computes the probability of coreference between the mention and all previously constructed, potentially incomplete, coreference clusters. With such a definition, cluster-level features can be used and exploited in the process. Both pairwise and entity-based models can also employ ranking algorithms instead of classification to determine which candidate mention or cluster best suits the mention being resolved (see Connolly et al., 1994; Rahman and Ng, 2009).

9.1 Resolution methodologies

|

153

9.1.2 Resolution strategies According to the classification by Ng (2004), we can distinguish several aspects of anaphora and coreference resolution systems: 1. knowledge-based vs. corpus-based: with a resolution procedure based on heuristically hand-crafted semantic models of discourse, as opposed to data-driven resolution, with knowledge features derived from labelled data and prediction function acquired automatically by applying machine learning algorithms 2. knowledge-rich vs. knowledge-lean (or knowledge-poor): depending on the amount of information used in the process, with either sophisticated domain-specific knowledge, semantic and discourse properties or just a few basic morphosyntactic/syntactic features 3. with semi-automatic vs. fully automatic pre-processing: depending on whether the resolution process runs on golden (error-free) mentions or on system mentions (automatically identified by existing state-of-the-art tools). Contemporary writings describe a tendency to lean towards corpus-based, knowledgelean and fully automatic resolution process, not only due to much less intensive manual work and linguistic expertise involved in the process and better reaction of the system to incomplete and noisy input data, but also because they outperform their knowledge-based counterparts.²

9.1.3 Learning features Selecting learning features is crucial for the resolution process; it has been proved by Bengtson and Roth (2008) that a simple pairwise system with a well-designed set of features can outperform state-of-the-art systems with complex models but using more traditional weak features. Below, we present a short overview of feature classes used by the current systems. The features can provide positive or negative (e.g. with WordNet-induced antonymy) constraints: – positional features, such as the number of intervening noun phrases between the anaphor and its potential antecedent – string-matching features derived from the lexicographic similarity of mentions, based on techniques ranging from normalization (case conversion, punctuation and determiner stripping etc.), through substring match (first word, rarest word, any substring etc.) to edit distance or other linguistic string distance metrics – syntactic features – compatibility of gender, number, person, animacy, grammatical role, definiteness, etc.

2 See e.g. (McCarthy and Lehnert, 1995) and (Soon et al., 2001).

154 | 9 Resolution approaches –

–

semantic features – semantic class, modifiers match, WordNet-based features (synonymy, hypernymy, sharing a common hypernym), Wikipedia-mined knowledge, etc. discourse-based features – candidate salience, proximity, etc.

Depending on the available information and the experimental setting, machine learning processes can also benefit from more advanced features or from the output of semidependent modules, such as anaphoricity determination (Ng and Cardie, 2002a) or non-referential pronoun detection (Bergsma and Yarowsky, 2011).

9.1.4 Resolution algorithms Although all coreference resolution algorithms seem to be variants of the generic “antecedent generation – filtering – ranking – clustering” model, they differ with respect to their architecture, layers of linguistic information taken into account, etc. Below, we present the highlights of the coreference resolution timeline: – Hobbs (1986) describes a now classic syntax-based algorithm for resolving pronoun references by traversing full parse trees in search of the most recent antecedent obeying binding and agreement constraints – Lappin and Leass (1994) propose a pronoun resolution algorithm using morphological and syntax-based constraints and salience-based preferences – Grosz et al. (1995) describe an anaphora resolution algorithm based on the centering theory, governed by maintaining discourse cohesion and salience – Cardie and Wagstaff (1999) propose a clustering factor-based algorithm combining numerous weighted knowledge sources – Harabagiu et al. (2001) describe a knowledge-minimalist simple selection algorithm based on mining coreference rules from an annotated corpus and transforming them into constraints used by partition search – Soon et al. (2001) use a decision tree algorithm with a very limited number of surface features and closest-first clustering – McCallum and Wellner (2005) introduce the sequence modelling algorithm with Conditional Random Fields – Luo et al. (2004) present an algorithm redefining the coreference resolution problem into finding the best path in the Bell tree – Haghighi and Klein (2007) and Ng (2008) represent a shift of paradigm from supervised to unsupervised approaches in coreference resolution. After 2008, one can observe a tendency to develop extensive coreference resolution platforms which combine different methods, features and resolution strategies in a common machine learning architecture. For coreference resolvers of this type, the most successful ones for English are presented in the next section.

9.2 Foreign state-of-the-art resolution tools

|

155

9.2 Foreign state-of-the-art resolution tools As the main purpose of this section was to inspect some recent successful approaches to coreference resolution, two most important aspects were considered: the maturity of the tool and public availability of its sources (or, at least, of a reasonably detailed description of the algorithm). Based on these guidelines, the following systems were selected for analysis (in chronological order, relatively to the year of the first publication about the given tool): – Beautiful Anaphora Resolution Toolkit (BART) – 2008 – Reconcile – 2010 – Stanford Deterministic Coreference Resolution System – 2010 – Berkeley Coreference Resolution System – 2013. The above systems are described in the following sections.

9.2.1 BART The first 1.0 release (described here) of the Beautiful Anaphora Resolution Toolkit (BART) by Versley et al. (2008) appeared in 2007, while version 2.0 was published in 2013. BART is a Java open source (Apache license v. 2.0) framework for end-to-end coreference resolution for English, but its modularity is designed so as to allow using it for other languages.

9.2.1.1 Annotation process The tool processes texts in the following way: 1. Pre-processing – this stage includes all necessary pre-processing steps for English, such as chunking, named entity recognition, tagging and parsing. A mention-building module extracts mentions based on the pre-processing results. 2. Training example selection – as BART implements a model for mention-tomention binary classification, there is a need to extract positive and negative training examples from the training corpus, which contains clusters of mentions rather than explicit mention pairs. This process can be configured in BART in the encoder module. The simplest encoder extracts all mention pairs from the text as training examples: an example is positive if two mentions belong to the same coreference cluster, and negative otherwise. However, this is not the only way, as we may try e.g. to model mention pairs in which mentions are located textually close to each other, for example not further than 10 mentions away. Several encoders are implemented in BART. 3. Feature extraction – each mention pair is enriched with classification features by 64 feature extractors. Those are based on language-specific settings, therefore,

156 | 9 Resolution approaches using them in an out-of-the-box mode is problematic for languages other than English. 4. Classification – BART employs machine learning techniques – from WEKA (Hall et al., 2009), SVMlight (Joachims, 1999) and MaxEnt implementations – to train a model for mention-to-mention classification. The binary classification consists in deciding whether a mention pair is coreferent or not. 5. Clustering – pairwise decisions need to be consolidated into coreference clusters, which is again configurable in the decoder module. As with encoder, the simplest approach would be to cluster together all mentions from pairs that are labelled as coreferent by the classifier. However, again, it may not be the best strategy, and several decoders are implemented. The decoder needs to be compatible with the encoder, as the usage of mention-to-mention classifier in prediction process should be similar to its use in the training process. 6. Scoring – BART 1.0 features only a MUC scorer, which is not sufficient for an indepth evaluation. BART’s authors report on its ability to achieve state-of-the art results for English.

9.2.1.2 Summary BART is an interesting framework for multilingual coreference resolution. There were attempts to use it for example for Italian (Poesio et al., 2010), Bengali (Sikdar et al., 2013), Arabic and Chinese (Uryupina et al., 2012), as well as for Polish (Kopeć and Ogrodniczuk, 2012). It is highly modular and the only strict constraint in its architecture is that the machine learning module employs mention-to-mention comparisons. However, many features are language-dependent, so it is not easy to rapidly benefit from BART’s full potential – it needs some adjustment for languages other than English. BART in version 2.0, published in 2013, is claimed to have better support for multilingualism and to implement more features. Unfortunately, it lacks documentation of the features and of the encoders/decoders, as well as of their evaluation.

9.2.2 Reconcile Reconcile (Stoyanov et al., 2010a) is an automatic noun phrase coreference resolution framework, similar to BART, but focused on English only. The source language is Java, and it is freely available under the GPL. It allows for an end-to-end coreference resolution.

9.2 Foreign state-of-the-art resolution tools

|

157

9.2.2.1 Annotation process The tool processes texts in the following way: 1. Pre-processing – it contains several components chosen from the following list: two sentence splitters, a tokenizer, three POS taggers, two deep parsers and a dependency parser, two NE recognisers and several mention detectors that conform to different mention definitions occurring in data sets. Feature extractors are dedicated to noun phrases, including proper nouns and pronouns. 2. Feature generation – for each mention pair in text, various features are generated. Reconcile implements 88 features, for their detailed description see (Stoyanov et al., 2010b). They are mostly binary (compatible/incompatible), some are three-valued (compatible/incompatible/not available). 3. Classification – various mention-to-mention coreference classifiers can be trained by use of algorithms from WEKA (Hall et al., 2009), libSVM (Chang and Lin, 2011) and SVMlight (Joachims, 1999). Training is performed on all mention pairs in a text. 4. Clustering – three algorithms for consolidation of pairwise decisions are available: – Single-link – for a given mention, we merge clusters of all mentions classified as coreferent to it, and add the mention to that cluster. – Most-recent-first – for a given mention, we add it to the cluster of the nearest preceding mention classified as coreferent. – Best-first – for a given mention, we add it to the cluster of a mention with the highest coreference score from the classifier, if that score is above a certain threshold. To use this strategy, we need a confidence measure from the classifier. 5. Scoring – Reconcile evaluates itself using system-mention MUC, B3 (with two approaches of dealing with twinless mentions) and entity-based CEAF scores. The authors of the tool evaluate the following configuration, named Reconcile2010 : 1. Pre-processing: OpenNLP sentence splitter, tokenizer and POS tagger, Berkeley parser and Stanford NE recogniser. 2. Feature extraction: 60 features selected out of 88 available ones. 3. Classification – Averaged Perceptron. 4. Clustering – Single-link, with the decision threshold cross-validated on the training set. This configuration achieves state-of-the-art results (compared to two elder approaches from 2001 and 2002).

9.2.2.2 Summary Reconcile largely resembles BART, but – contrary to the latter – does not have a separate module for training the extraction of mention pairs (it assumes that all pairs are

158 | 9 Resolution approaches always used). Its features are documented in the report, but the tool (contrary to BART) does not seem to be maintained (last code change in 2010), and is not as popular in the community.

9.2.3 Stanford Deterministic Coreference Resolution System This system implements a multi-pass sieve coreference resolution (or anaphora resolution) approach described initially in (Raghunathan et al., 2010) and further extended in (Lee et al., 2011, 2013). Contrary to the tools presented previously, it does not employ machine learning techniques and is fully deterministic. The architecture of the Stanford System involves several independent coreference resolution modules, which are applied in a sequence, starting with the modules with the highest precision. Because of that, modules (except the first one) receive partially clustered mentions as input and, therefore, the whole system may be considered as entity-based rather than mention-based. Each module may use information from the previously partially created mention cluster to extract mention features.

9.2.3.1 Annotation process The systems proceeds in the following steps: 1. Pre-processing – as the system is a part of the Stanford CoreNLP suite, it uses Stanford pre-processing tools, such as the Stanford tagger (Toutanova et al., 2003), parser (Socher et al., 2013) and named entity recognizer (Finkel et al., 2005). Mention detection involves marking noun groups, pronouns, and named entities. Then, these mention candidates are filtered by some simple heuristic rules (for example stop words, pleonastic it or numeric entities), tailored for particular evaluation tasks. 2. The version described by Lee et al. (2013) contains ten coreference resolution modules. Each module receives partially formed clusters and may merge some of them. Only mentions that are currently coming first in textual order in their clusters (the so-called cluster representatives) are considered, and they are processed sequentially. Each such mention is compared with the representatives occurring earlier in the text. The order of this pairwise comparisons is important (and is determined by the information from parsing), as the first compatible pair of representatives triggers the merging of their clusters and stops the action of further processing the current cluster. Although mention representatives are used as the basis for the comparisons, the sieves may use information from the already partially formed clusters of the compared mentions. The modules are applied in the following order: – Speaker Sieve – matches pronominal mentions that appear in a quotation block with the corresponding speaker.

9.2 Foreign state-of-the-art resolution tools

|

159

–

3.

String Match – searches for anaphoric antecedents that are textually identical with the mention under consideration. – Relaxed String Match – similar to the String Match, but ignores the text following the mention head while performing the comparison. – Precise Constructs – searches for several high-precision syntactic constructs, such as appositive relations, predicate nominatives and others. – Strict Head Match A – clusters mentions that have the same head word and fulfil some other constraints. – Strict Head Match B – similar to String Head Match A, but taking another set of constraints into account. – Strict Head Match C – similar to String Head Match A, but taking yet another set of constraints into account. – Proper Head Noun Match – clusters mentions headed by the same proper noun and fulfilling some other constraints. – Relaxed Head Match – searches for antecedents containing the head of the mention under consideration; applies to named entities of the same type only. – Pronoun Match – enforces agreement of the following constraints between the pronoun and the antecedent: – Number. – Gender. – Person. – Animacy. – NER label. – Distance not larger than 3 mentions. Scoring – no internal scoring implementation is available.

The system was evaluated with MUC, B3 , entity-based CEAF, BLANC and CoNLL metrics during the CoNLL-2011 shared task (Pradhan et al., 2011) and proved to generally outperform the previous state of the art.

9.2.3.2 Summary Stanford system is a multi-sieve architectural framework which was adapted to other languages during the CoNLL-2012 shared task focused on coreference resolution in a multi-lingual setting: English, Chinese and Arabic (Pradhan et al., 2012). The most recent version of the tool uses a machine learning module to detect singletons – see (Recasens et al., 2013). The tool is different from other systems presented here as it employs an entity-centric model and no machine learning technique. The system is modular, yet not as simple as its authors claim – the sieve set contains a large number of carefully designed heuristics.

160 | 9 Resolution approaches 9.2.4 Berkeley Coreference Resolution System The Berkeley Coreference Resolution System is among the newest tools for English coreference resolution, as it was described in (Durrett and Klein, 2013) and (Durrett et al., 2013). It is a system capable of pre-processing raw test, detecting mentions and clustering them into coreference clusters. Clustering is made with a machine-learnt model, which predicts for each mention if it has an antecedent earlier in text, and, if so, which one it is. The system does not include any evaluation tool, the authors use the official CoNLL scorer instead.

9.2.4.1 Annotation process The processing steps may be summarised as follows: 1. Pre-processing – The system is bundled with a pre-processor for English that can take raw text input, split it by sentences and tokens, generate POS tags, syntactic parses, and named entity information. 2. Mention detection – Mentions of three types are chosen from the pre-processing data: proper name mentions from named entity chunks, pronominal mentions from POS tagging, and nominal mentions from maximal noun phrases obtained during syntactic parsing. 3. Feature generation – Features are generated for each mention and for each pair (mention, antecedent candidate). The authors, in their final setting, used following feature sets: The surface feature set accounts for the following properties: – mention type (nominal, proper, or pronominal) – complete string of a mention – semantic head of a mention – the first and the last word of each mention – the word immediately preceding and the word immediately following a mention – mention length in words – values of two distance measures between mentions (the number of sentences and the number of mentions). The additional feature set contains the following data: – information whether two mentions are nested – ancestry of each mention head: dependency parent and grandparent POS tags and arc directions – the speaker of each mention – number and gender of each mention from (Bergsma and Lin, 2006). Additionally, each feature is combined (by conjunction) with the type of the current mention and of its antecedent (in the case of pronominal type, the type is equivalent with the citation form of the pronoun). For example, when we have

9.2 Foreign state-of-the-art resolution tools

|

161

feature f describing the number of sentences between two mentions, we generate following conjunction features: f ∧ mention_type ,

4.

5.

f ∧ mention_type ∧ antecedent_type .

Because of this conjunction principle and also due to the lexicalization of features (although words occurring less than 20 times were replaced with their tags), the quantity of features is very large. Even simple surface properties led to over 400,000 features on the training data, while the application of all the above types yielded the total of almost 3,000,000 features (including over 2,500,000 features about the ancestry of the mention head for the key mention and its candidate antecedent). Classification – a log-linear model is used to select at most one antecedent for each mention. In other words, the model predicts the antecedent for each mention or marks it as having no antecedents (i.e. as the beginning of a new chain or a singleton). As the gold data contain clusters, not antecedent links, training examples need to be somehow extracted from the data. For that purpose, the Berkeley system generates all possible antecedent structures compatible with gold clustering, i.e. for which a mention has another mention as its antecedent if, and only if they are both in the same cluster. Loss function punishing for wrong antecedent choice during training is a weighted average of three error types: – FA (false anaphor) – occurs when a mention is chosen to be anaphoric when it should start a new cluster – FN (false new) – occurs when a mention is wrongly indicated as the beginning of a new cluster when it should be anaphoric – WL (wrong link) – occurs when the antecedent chosen for a mention is wrong (but the mention is indeed anaphoric). The authors choose weight 0.1 for FA, 3 for FN and 1 for WL to counterbalance the high number of singleton mentions and bias the system towards making more coreference linkages. Clustering – antecedent links are easily converted into coreference clusters by grouping mentions connected by any (undirected) antecedent link path.

The authors claim that their approach of using surface features only gives state-of-the art results, even better than the Stanford Deterministic Coreference Resolution System. It is worth noting that the features used in this system are simple, yet their combinations are able to account for some more complex features found in the literature (the authors show some examples of such features). Surface features, however, cannot capture all coreference relations. The system is most noticeably poor at resolving anaphoric mentions whose heads have not appeared before. The only way of resolving such cases includes semantics, therefore, there were experiments of incorporating semantic resources, such as WordNet hypernymy and

162 | 9 Resolution approaches synonymy, number and gender data for nominals and proper names from (Bergsma and Lin, 2006), named entity types and other data, into the feature generation process. Since it did not yield big improvement, the final features do not include these semantic properties, except the data mentioned above.

9.2.4.2 Summary The Berkeley Coreference Resolution System is interesting as it employs a novel model for machine learning of coreference relations. At the point of occurrence of the key mention in text, a decision has to be made as to which other mention is its antecedent, or that there is no antecedent present. The features are rather simple but their count is pretty large – about 3,000,000. The final setting (although very accurate in terms of CR scores) does not answer the question about the efficient incorporation of semantic features.

9.3 Polish coreference resolution attempts 9.3.1 Knowledge-poor pronoun resolution The first known computational approach to anaphora resolution in Polish is described by Mitkov and Styś (1997) and further supplemented by Mitkov et al. (1998). They present a robust, knowledge-poor approach to resolving pronouns in technical manuals in English, Polish and Arabic. Linguistic knowledge is limited to a small noun phrase grammar, a list of terms and a set of antecedent indicators. The part-of-speech tagged text is searched for the presence of antecedents in two sentences preceding the occurrence of the anaphor, and the candidates are scored according to weighted tracking indicators. In case of equal scores, additional parameters such as lexical reiteration (see below), and then proximity of antecedent are taken into account. The list of antecedent tracking indicators was collated based on studying handannotated technical manuals and consists of: – definiteness – in case of Polish, which lacks definite articles, it is signalled by word order, demonstrative pronouns or repetition – givenness – the property of being the first nominal phrase in the sentence – term preference – whether a domain-dependent dictionary contains the candidate term – indicating verbs – whether a verb preceding the candidate exists on the list of salient verbs such as discuss, present, illustrate etc. (for Polish, their equivalents and synonyms were used) – indicating noun phrases – whether the NP preceding the candidate is from a fixed list of structure markers (such as chapter, section, table)

9.3 Polish coreference resolution attempts | 163

–

– – – –

–

lexical reiteration – whether an item is repeated within the same paragraph (including equal heads of subsequent NPs); boosted when NP is repeated twice or more section heading preference – whether NP occurs in the section heading collocation pattern preference – given to candidates which have an identical collocation pattern with a pronoun referential distance – with graded rating dependent on whether the candidate is in the preceding clause (for compound sentences) one, two or three sentences back “non-prepositional” noun phrases – with higher preference given to NPs not being part of a prepositional phrase; in Polish with additional penalization of frequent genitive constructs “immediate reference” – in compound sentences linked with coordinating conjunctions the NP after the first verb in the first sentence is the preferred antecedent of the NP after the first verb in the second sentence, etc.

The reported preliminary evaluation for Polish over a small set of Polish technical manuals containing 180 pronouns³ showed success rate at 93.3%, which is a considerable improvement over the baseline number/gender agreement model (23.7% in case of the subject and 68.4% in case of the most recent NP). The authors also report on the frequency of the indicators’ use: “definiteness” – 97.2% of the cases, “referential distance” – 94.4%, “givenness” – 61.1%, “non-prepositional” noun phrases – 52.8%, “indicating verbs” – 16.7%. “immediate reference” – 2.8% of the cases, “lexical reiteration” – 13.9%.

9.3.2 The analysis of anaphoric relations in Polsyn parser Since 1996 the Software Division of the current Institute of Computer Science at the Silesian University of Technology has been developing the Syntactical Groups Grammar of Polish (SGGP) and since 1997 – the Polsyn parser. One of the stages in its analysis is marking anaphoric relations, introduced by Kulików et al. (2004) i.a. to facilitate machine translation of Polish texts into the sign language or extractive text summarization (which involves substitution of anaphora for the antecedent). The resolution process is described by Ciura et al. (2004). Only pronouns and selected conjunctions are taken into account as potential antecedents, with zero subjects replaced by pronouns in a separate step preceding anaphora resolution. Antecedents are individual words, belonging either to nominal group (NG) identified in one of the previous steps of the analysis with SGGP or making one of previously identified

3 An earlier version of the paper reports on evaluating 143 pronouns with the achieved precision rate at 92.1% and considerably different share of preferences, which may imply their high dependence on the data used.

164 | 9 Resolution approaches anaphors. Antecedents are searched among candidates carrying matching gender at most two sentences backwards, but due to inclusion of anaphors in the process, the chains can indirectly reach a wider scope. In case more than one candidate is found, exactly one of them is selected by investigating the function of its superior NG, with preference given to subjects over objects and objects over ‘no function’. The resolution module is available as part of LAS 2.0, Linguistic Analysis Server: http://las.aei.polsl.pl/las2/. Anaphora substitution results in the summarization process can be viewed at http://las.aei.polsl.pl/PolSum/.

9.3.3 Coreferencing for geo-tagging of Polish data Another partial attempt at coreference resolution for Polish was carried out by Abramowicz et al. (2006) in the context of entity detection task. They describe a linguistic suite for geo-tagging of free-text Polish data which involves a lightweight coreferencing technique based on Jaro-Winkler string-similarity metrics. The suite was built on top of SProUT (Drozdzynski et al., 2004) and a grammar interpreter with a 60K-word corpus conformant to DECADENT annotation guidelines built on texts from daily Polish newspaper Rzeczpospolita, financial magazine Tygodnik Finansowy and several local news portals. The authors refer to this component as ‘partial coreference resolver’ and admit that ‘detecting entity mentions of any type and grouping them into full coreference chains [. . . ] is beyond the scope of our current work’.

9.3.4 Pronominal anaphora resolution module for GATE Another systematic computational approach to anaphora resolution is presented by Filak (2006). The author describes the development of a pronominal anaphora resolution module for a GATE⁴-based information extraction system. The scope of the resolved relations was limited to nominal phrases and personal, possessive and demonstrative pronouns, and the machine learning resolution tool used the manually annotated corpus described in Section 3.2.2. The learning set was created using heuristic rules to indicate phrase heads; the author points out that no satisfactory parser was available for this purpose at that time. Positive examples were taken from the corpus while the negative ones were created by linking the pronoun with all non-selected pronouns and nominal phrases located between the given pronoun and its true referent. 17 features were used, starting from the traditional string

4 See Cunningham et al. (2002).

9.3 Polish coreference resolution attempts | 165

match, sentence and token distance, through gender-number agreement ending with the case of anaphor or its accentability⁵. In order to create an anaphoric classifier, a machine-learning approach was used, with J48 decision tree inducer from WEKA (Hall et al., 2009) and various pruning parameters, manually selected feature set and 10-fold cross-validation applied. Three disambiguation heuristics were used to select the antecedent from the candidates offered by the classifier: distance-based heuristics, classifier error-based heuristics and coincidence heuristics. The system achieved 50.7% of precision and 53.5% of recall (52.1% F-measure). The reason for low results were erroneous human annotation of the learning examples, tagging errors, classification errors (resulting from sparse learning data and phrase head detection errors) and problems with manual heuristics for relation disambiguation.

9.3.5 IKAR and anaphora representation in KPWr IKAR (Broda et al., 2012) is a hybrid toolkit for anaphora resolution for Polish. It combines machine learning and rule based methods. In that research, mentions are understood as: – proper nouns – noun phrases containing elements agreeing in number, gender and case – pronouns. IKAR performs coreference resolution over mentions created in a pre-processing step consisting of the following stages: 1. division into tokens and sentences 2. morphological analysis and disambiguation 3. proper names extraction 4. chunking. These steps rely on a set of tools developed at the Wrocław University of Technology. The resolution scope concerns the so-called direct anaphora limited to proper nouns: – coreference between two proper names (PN–PN type) – resolved by a ML algorithm – coreference between a proper name and an agreeing (in number, gender and case) noun phrase (PN–AgP type) – resolved by a heuristic rule – coreference between a proper name and a pronoun (PN–Pron type) – resolved by a heuristic rule. 5 See (Przepiórkowski, 2004, Chapter 3.3–3.4) for the repertoire of grammatical categories and classes used in the IPI PAN Corpus.

166 | 9 Resolution approaches For each mention, the task is to join it with its proper noun antecedent or leave it without an antecedent. The algorithm works on mention pairs (with antecedents being proper nouns), for each such pair solving a binary classification problem, based on a set of features extracted for each such pair. Mention pairs necessary to train and test the tool were extracted from the KPWr Corpus (cf. Section 3.2.3). For each subtask, the algorithm was different. The automatic anaphora resolution was conducted in the sequence in which algorithms are presented (i.e. PN-AgP algorithm is used after PN-PN resolution, and PN-Pron resolution comes after both other types). IKAR was evaluated by 10-fold CV on KPWr with B3 , BLANC and MUC measures and exceeded simple majority baseline for each anaphora type. The algorithm is very simple and limited to proper noun antecedents, yet it shows some valuable insight into division of distinct subtasks for anaphora resolution and introduces an interesting SemanticLink feature.

9.3.5.1 PN–PN algorithm 3006 positive examples were acquired by taking pairs consisting of a PN and its nearest PN antecedent. 14,676 negative examples were created by taking all PN–PN pairs, in which a false antecedent candidate was closer than the true PN antecedent. The features extracted for the C4.5 decision tree classifier were based mostly on similarity of both PN phrases: – CosSimilarity: degree of token overlapping in both PNs (in terms of tokens’ base forms) – TokenCountDiff: difference in number of tokens forming each PN – SameChannName: feature indicating if two PNs are of the same type – number agreement – gender agreement. Based on binary pair classification, it may happen that a PN has more than one antecedent. In that case, if the antecedents are different entities, the entity with a greater number of references is chosen. If the number of references is the same, the one that occurred in the text closer to the mention is chosen.

9.3.5.2 PN–AgP algorithm 1077 positive examples were generated by taking an AgP and its antecedent PN from the same coreference chain. 1833 negative examples were generated by taking for each AgP any proper name that occurred in the text not further than 50 tokens earlier (but only up to two negative examples for given AgP). In this case, only one rule called SemanticLink was used. For each name category, a representative synset from the Polish Wordnet (Piasecki et al., 2009) was selected

9.3 Polish coreference resolution attempts | 167

manually. Therefore, the procedure is as follows: 1. Find the matching synset for the AgP’s head. The search is capped up to 10 similarity, hypernym and holonym edges. 2. If it cannot be found, switch places of the category’s synset and head’s synset and search again. (In case the head’s synset is more general than the one of the category.) 3. If it cannot be found, the distance is the minimal number of edges separating the head and the category synset. (Note that a head usually gets more than one synset, because its meaning is not disambiguated.) 4. The score – 1/distance – can be interpreted as how well the AgP’s head can refer to the PN from a given category. If there is no better antecedent candidate between AgP and a given PN, then it is a positive match. As in PN–PN, it may happen that an AgP has more than one PN antecedent. In that case, the SemanticLink score is calculated for each of them and the one with the best score is chosen as the antecedent. If there are two candidates with the same score, the one closer to the mention is chosen.

9.3.5.3 PN–Pron algorithm 450 positive and 596 negative examples were generated as in the PN–AgP case. Only one rule is used: Pronoun Link. The procedure is as follows: 1. Check if there is an AgP between the pronoun and the PN that meets the Semantic Link criteria for a given PN and gender and number for a given pronoun, and that there is no closer AgP which meets these criteria. 2. If the condition is fulfilled, there is a link between that pronoun and a PN. 3. Otherwise, check if the pronoun and the PN share the same gender and number, and if there is no closer PN that meets those criteria. If the condition is fulfilled, there is a Pronoun Link. If there is more than one antecedent for a given mention, the closest is chosen.

9.3.6 English–Polish projection-based approach Ogrodniczuk (2013b) presented another perspective of tackling the coreference resolution problem: a translation-projection-based approach⁶ performed in the following steps:

6 A similar solution has also been proposed e.g. by Rahman and Ng (2012) and evaluated for Spanish and Italian with projection from English.

168 | 9 Resolution approaches Table 9.1. Translation-projection based approach: experimental results Evaluation metrics MUC B3 CEAF-M CEAF-E BLANC CONLL

P

R

50.30% 93.34% 81.51% 81.06% 71.43% 74.90%

29.62% 84.20% 81.51% 89.62% 60.51% 67.81%

F 37.28% 88.53% 81.51% 85.12% 64.01% 70.31%

1.

translating plain text which needs to undergo coreference resolution from source (Polish) into the target language, for which coreference resolution tools are available (English) 2. word-alignment of both texts 3. running the English mention detector and coreference resolver on plain text 4. pre-identifying mentions in the Polish part 5. transferring the produced annotations from English to Polish using alignment information to create Polish coreference clusters. Translation and alignment was produced using Google Translate service⁷; Polish mention detection tools (see Chapter 10) were used for mention detection and Stanford CoreNLP (Lee et al., 2013) has been used for English mention detection and coreference resolution. Evaluation of the results performed on 260 gold samples from the Polish Coreference Corpus (all available at the time of experiment) showed promising results reported in Table 9.1. The presented approach can be easily applied to languages still lacking state-ofthe-art component language tools necessary for detection of mentions and coreference resolution.

7 Made available by the University Research Program for Google Translate, see http://research.google. com/university/translate/.

Maciej Ogrodniczuk, Mateusz Kopeć, Alicja Wójcicka

10 Mention detection

Detection of mentions is directly related to the mention model, which in our case was simplified on the grounds of a general assumption that our study should be restricted to nominal constructs only. The PCC annotation guidelines identifies 4 major types of such expressions: 1. single-segment nouns and pronouns, identifiable with a simple morphological analyser 2. nominal groups – detected with a shallow parser with a grammar of Polish 3. zero subjects, detected with a custom solution 4. nominal named entities – detected with a named entity resolver. All the above-mentioned tools are currently available for Polish and were used in the mention detection process. The following sections provide a short overview of reference and those tools along with information on their use in the process.

10.1 Simple nouns and pronouns The basic morphological analysis was performed with Morfeusz PoliMorf (Woliński, 2014),¹ the morphological analyser for Polish with the largest dictionary available (featuring approx. 415,000 lexemes, including 220,000 nominal ones). It was created in 2012 as a merger of linguistic data coming from The Grammatical Dictionary of Polish (Saloni et al., 2007) and then competitive Morfologik (Miłkowski, 2010). It uses positional tagset (Przepiórkowski and Woliński, 2003) offering complete morphosyntactic information for the segment. The ambiguities were resolved by Pantera (Acedański, 2010), a morphosyntactic tagger of Polish using an optimized version of rule-based Brill’s algorithm adapted to inflectional languages. Tagging is performed in two steps, with a smaller set of morphosyntactic categories disambiguated in the first run (part of speech, case, person) and the remaining ones in the second run. Due to free word order nature of Polish, the original set of rule templates as proposed by Brill has been extended to cover larger contexts.

1 See also http://sgjp.pl/morfeusz/index.html.

170 | 10 Mention detection

–

–

–

–

–

The following morphosyntactic categories were processed:² ppron3 – 3rd person pronoun, containing exactly one flexeme: on ‘he’, with lexically determined person (3rd ), inflecting for number, case and gender; some forms additionally inflect for accentability and post-prepositionality (e.g., niego vs. go, ‘himacc ’) ppron12 – non-3rd person pronoun; contains exactly four flexemes, which inflect for case and gender, but have lexically determined number and person: ja ‘I’, my ‘we’, ty ‘yousing ’, wy ‘youplur ’; some forms of the flexemes ja and ty additionally inflect for accentability (e.g., ci vs. tobie, ‘yousing.dat ’) subst – noun; contains lexemes infecting for number and case, with a lexically determined grammatical gender, which do not have the category of person, e.g., woda ‘water’, profesor ‘professor’, pięciokrotność ‘fivefoldness’; moreover, this class contains defective plurale tantum and singulare tantum lexemes, but not depreciative lexemes depr – depreciative form; contains depreciative flexemes, i.e., flexemes with fixed number (plural) and gender (animate masculine), defectively inflecting for case (only nominative and vocative), e.g., profesory ‘professorsdepr ’, studenty ‘studentsdepr ’ ger – gerund; contains flexemes which inflect for number, case and negation, and have the lexical categories of gender (always neuter) and aspect, e.g., picie ‘drinkingimperf ’ and wypicie ‘drinking upperf ’.

10.2 Nominal groups Nominal groups were identified with Spejd (Przepiórkowski and Buczyński, 2007), a universal engine for shallow parsing using cascade regular grammars, based on the results of the Polish tagger Pantera (Acedański, 2010) producing disambiguated tokenization, segmentation, lemmatization and morphologic analysis. The parser was fitted with a grammar of the Polish language (Głowińska, 2012) encoded as a set of rules matching against orthographic forms or morphological interpretations of particular words. The grammar on the syntactic level features two levels of description: – syntactic words (10 types), linking simple segments into higher-level syntactically sufficient structures with a coarser granulation than simple segments (e.g. subst, ger and depr become Noun) and representation of potential discontinuity – syntactic groups (10 types), composed of syntactic words with marked syntactic and semantic heads.

2 Definitions taken from ISOCat, please see (Patejuk and Przepiórkowski, 2010).

10.2 Nominal groups

| 171

Due to identity-of-reference restriction, the established model of mentions in PCC is based on only one type of syntactic group detected by the grammar – nominal groups (NGs). The semantic and syntactic centre of the nominal group can be: – a Noun – a personal pronoun – represented by syntactic words Ppron12 or Ppron3, directly corresponding to pronominal morphosyntactic classes ppron12 and ppron3 – a pronoun siebie ‘self’ – corresponding to syntactic word Siebie and morphosyntactic class/element siebie. The classification of the nominal group features a number of subtypes³: – coordinated nominal group – NGk (Jan albo Maria ‘John or Mary’, rządu i parlamentu ‘of the government and the parliament’) – nominal group with a nominal modifier: – apposition, with the same value of the case for both heads – NGs (terroryści samobójcy ‘suicide terrorists’) – with the head of the second component in the genitive – NGg (brat ojca ‘the father’s brother’, sala posiedzeń Senatu ‘the conference room of the Senate’, zabójca króla Henryka IV ‘King Henry IV’s assassin’) – nominal group with a numeral modifier – NGn with the head of the numeral group in the genitive (kurtki trojga dzieci ‘the jackets of three children’) – nominal group with an adjectival modifier with case, gender and number agreement – NGa (miła dziewczyna ‘a nice girl’, bieżące wydarzenia polityczne ‘current political affairs’) – nominal group with a participle modifier – NG (prawie geniusz ‘almost a genius’, (przed) niespełna rokiem ‘almost a year (ago)’) – nominal group composed of an abbreviation and a nominal/adjectival compound – NGb (woj. mazowieckie ‘Masovian Voivedoship’, Prof. Nowak) – nominal group with an indefinite pronoun (which is a noun in our tagset) as head of the elective construction – NGe (ktoś spośród nich ‘someone of them’) – nominal group with a third-person pronoun and an adjective – NGx (coś specjalnego ‘something special’) – quoted nominal group – NGc – numeral group treated as nominal – NGl (dwa rowery ‘two bicycles’) – nominal group with relative clauses – NGkg (szpieg, który mnie kochał ‘the spy who loved me’) – special groups NGadres, NGgodz and NGdata for describing addresses, hours and dates. Syntactic annotation in the National Corpus of Polish was limited to joining words together into constituents. Spejd grammar used in the PCC annotation was a modified

3 Examples from (Głowińska, 2012, p. 115–116).

172 | 10 Mention detection version of the NKJP grammar, but due to the fact that NKJP nominal groups were different from the CORE nominal phrases (e.g. no nesting was allowed, no relative clauses were originally attached to nominal phrases, numeral groups were a separate group type etc.), certain modifications had to be made to the grammar to make it compatible with annotation guidelines from our project. This adaptation is described separately in Section 10.3.

10.3 Nested mentions The configuration for detecting nominal groups described above⁴ was successfully used for automatic tagging of NKJP. However, nested nominal groups with disparate referents (cf. prezes firmy ‘CEO of a company’) have never been targeted by the NKJP grammar. The reason for that was the extensive character of NGs in the project. They were constructed to contain as many elements of a given type as possible, for example, in a phrase composed of consecutive nouns in the genitive case such as propozycji wyznaczenia daty rozpoczęcia procesu wprowadzania reformy ustroju ‘proposal for setting the date of launching the process of introducing reform of the system’, the whole phrase was the only reported nominal group despite the fact that seven other nested nominal phrases with distinct referents were internally detected by the recognition process. Therefore, for the sake of coreference resolution compatible with our guidelines, the task of making such intermediate constructs available in the output seemed an important one. Another reason for creating a separate version of the grammar was the difference in borderlines of nominal phrases set in the coreference annotation guidelines as compared to NKJP, where, above all, syntactic criteria were taken into account. The PCC nominal phrase consists not only of adjectives, nouns, gerunds, conjunctions (coordinated groups) and subordinate numerals, but also of superior numerals (e.g., trzy dziewczynki ‘three girls’), relative subordinate clauses (e.g., kwiaty, które dostałam wczoraj ‘the flowers that I got yesterday’), prepositional phrases, as well as adjectival participles. The complexity of the task is further raised by PP-attachment or by similar ambiguities involving potentially post-modifying adjectival participles.

10.3.1 Reorganisation of the grammar The original NKJP grammar does not always reveal the proper internal structure of nominal groups due to the order and structure of rules which are designed to detect the longest possible sequence irrespective of the fact if the group is nested or not. For 4 This section reuses and extends parts of the material from (Ogrodniczuk and Przepiórkowski, 2014), please see the original for details of the nested grammar preparation process.

10.3 Nested mentions

| 173

example, the old version of the grammar detects the group: bardzo małym druczkiem ‘in very small print’, as consisting of two parts: adjectival group bardzo małym ‘very small’ and noun druczkiem ‘print’; the structure of the group can be shown in this way: bardzo małym druczkiem. This division is not entirely correct, as the whole group is not nested (it is just a nominal phrase with an adjectival attribute) and should be interpreted as a group without children: bardzo małym druczkiem. Obtaining such an interpretation requires reconstruction of the grammar. On the other hand, a nested group usług firmy ‘services of the companygen ’ is interpreted as a group without children: usług firmy by the original version of the grammar. The nested syntactic interpretation should consist of three groups: the whole phrase usług firmy with preserved information about the two component groups: usług (which is marked as syntactic and semantic head of the group) and firmy. From the perspective of coreference annotation guidelines (see Chapter 5), such an interpretation is excessive since nested nominal phrase should be marked as separate from the superior phrase only when its syntactic/semantic head is other than the head of the superior phrase, which is not the case with the component usług and the superior usług firmy. However, such cases can be easily handled in post-processing (described in Section 10.6).

10.3.2 Rule modification In the first step, the rules in the grammar were reordered: – rules for syntactic groups without nesting were placed in the initial section of the model – within types, from the broadest to the narrowest – more specialized rules (e.g. detecting addresses or dates) before less specialized – less frequently used before the more frequent ones. In case of ambiguities, certain decisions had to be made. For instance, nominal-nominal groups are nested in most cases, but there are exceptions such as proper names of people (Jan Kowalski) or appositions (malarz pejzażysta ‘landscape painter’). Their genitive versions (e.g. Matki Teresy) have two interpretations: ‘Mother Teresagen ’ or ‘Mother of Teresagen ’; the first not nested, unlike the second. Our solution consists in creating nested groups from two subsequent nouns only if both are in the genitive and their orthographical forms begin with a lower case letter (so our group from the previous example would be recognized as not nested). There are four main types of nested groups: case-governed groups, prepositional groups, coordinated groups (conjunction governed groups) and relative clauses. Prepositional groups were excluded from this attempt since they are often very difficult to distinguish – not only by parsers, but also by native speakers – from groups with prepositions governed by verbs or groups governed by other nominal groups.

174 | 10 Mention detection For example, the text Jaś obserwuje Marysię przy jedzeniu can be interpreted as ‘John is watching Mary while eating’ or ‘John is watching how Mary eats’. Since different types of nested groups can be embedded in all other types of groups (e.g., a coordinated group in a case-governed group and vice versa; a relative clause in a coordinated group and vice versa), the rules detecting various types of groups must be placed alternately. For example, the group bandy partyzantów i terrorystów ‘gangs of partisans and terrorists’ is made out of two smaller groups: the one-element group bandy ‘gangs’ and the coordinated group partyzantów i terrorystów ‘partisans and terrorists’. If the rules detecting coordinated groups were placed first, the grammar would find the group partyzantów i terrorystów, and in the second step the group bandy partyzantów i terrorystów would be created, which is the desirable result. However, the situation is more complex. There also exist groups such as: naszego państwa oraz sposobu realizacji ‘(of) our state and way of realisation’. The internal structure of the group is: naszego państwa oraz sposobu realizacji, so there is a group with nesting within the coordinated group. If the rules for coordinated groups were at the beginning of this part of the grammar, an incorrect group such as państwa oraz sposobu ‘our state and way’ would have been created. In order to solve the problem, the first group of rules (they detect case-governed groups) is restricted only to the context without comma or conjunction on the right side of the given string (the group bandy partyzantów from bandy partyzantów i terrorystów is not found in the first step; on the other hand, the group sposobu realizacji being a part of naszego państwa oraz sposobu realizacji is detected). After this set of rules, the rules responsible for coordinated groups are placed, so the groups partyzantów i terrorystów and naszego państwa oraz sposobu realizacji are found. Then, the first set of rules must reappear so the whole group bandy partyzantów i terrorystów can be detected. The whole procedure is repeated by detecting longer groups and should be applied also to relative clauses (in the recent version of the grammar this method is used only by case-governed and coordinated groups).

10.3.3 Problematic cases For syntactic groups of several types, reorganization of the grammar posed a serious problem due to their dual nature. Selected problems of this type are discussed below.

10.3.3.1 NP-NP groups The problem concerns mainly groups composed of two or more nominatives. Most of them are nested, but there are also dubious and borderline cases. Linguistic analysis showed many annotation errors in this group (20% problematic expressions, which

10.3 Nested mentions

| 175

amounts to 490 groups among approx. 2100) which proves the difficulty of the mention detection task even for human annotators. The first subtype of the problematic cases are appositions. As mentioned above, the annotation guidelines clearly define their constituents as being of equal status, although they were often wrongly marked as nested. Examples of not nested nominative groups include pies przewodnik ‘a guide dog’ or pan poseł ‘polite phrase sir + Member of Parliament’. Another problematic groups are named entities. From the coreference point of view, a named entity forms a single mention. Many named entities are, however, syntactically nested. Since it is very hard to automatically distinguish a named entity with nesting from a nested syntactic common group, the decision was taken that all nested named entities should be treated like common groups. There are nested common groups consisting of two named entities, e.g. Jan Marysi ‘Mary’s John’; on the other hand, a group with a common noun at the beginning of the sentence (and therefore capitalized) and named entity looks like a named entity, e.g. Siostra Jana ‘John’s sister’. Due to this decision, the name Rada Europy ‘Council of Europe’ should be annotated as a nested group. Idioms are treated accordingly, e.g. słowo honoru ‘word of honour’ is annotated as consisting of two smaller groups: słowo and honor. Most numeral groups are not nested, but there are groups consisting of a substantive (with numeral meaning) and a common noun, for example miliony dolarów ‘millions of dollars’. In the recent version of the grammar, such groups are described as nested.

10.3.3.2 PP-NP groups From strictly morphological point of view, possessive pronouns are adjectives. Therefore, groups consisting of a possessive pronoun and a noun are treated as not nested despite the fact that some possessive pronouns are annotated as personal pronouns in the genitive (there is a special part of speech for them in the NKJP tagset: ppron3). As opposed to adjectival-nominal groups, there is no agreement between elements of this type of groups (cf. jego córka ‘hisgen:masc daughternom:fem ’ vs. ładna córka ‘prettynom:fem daughternom:fem ’).

10.3.3.3 Dates, addresses and numbers Some sequences in the corpus are automatically tagged as ‘ignored words’ (with an NKJP tag ign, see Section 17.2.1). They are not taken into consideration in the grammar – apart from all numbers written in digits automatically annotated as ignored words. For that reason, the grammar contains some special rules responsible for creating groups with digits, both Arabic and Roman. For example, the rule NGdata

176 | 10 Mention detection cited below detects, among other things, the following groups: 3 XI 1943, 11 VIII, 5–6 VIII. Rule

"NGdata"

Match: [orth~"[0-9][0-9]?[--]?[0-9]?[0-9]?"] (ns? [orth~"-"] ns? [orth~"[0-9][0-9]?"])? [base~"I|II|III|IV|V|VI|VII|VIII|IX|X|XI|XII"] [orth~"[1-9][0-9]*"]? [base~"rok"]?; Eval:

group(NGdata,3,3);

10.3.4 Reorganization results In order to check the quality of the new grammar, ca. 10% of mentions detected by the grammar, both nested and not nested, were checked by a linguist. The set comTable 10.1. Simple groups

Rule group name

Occurrence count in the whole corpus

Error count

Proven occurrences count

tytuły (titles) NGadres (addresses) NGdata (dates) NGgodz (hours) NGl (numeral-substantive) NGs (substantive-substantive) NGa (adjective-substantive) NGx (pronoun-adjective) NGb (abbreviation-substantive)

547 96 1434 181 4403 2707 33,851 176 323

7 0 2 1 30 28 41 1 8

55 10 169 19 417 270 3652 20 35

Table 10.2. Nested groups

Rule group name

Occurrence count in the whole corpus

Error count

Proven occurrences count

NGadres (addresses) NG2 (NG with 2 nested elements) NG3 (NG with 3 nested elements) NG4 (NG with 4 nested elements) NG5 (NG with 5 nested elements) NGkg (relative clauses) NGk (coordinated groups)

74 21,822 3743 626 73 2581 5692

0 236 48 22 0 91 61

20 2259 426 104 10 260 694

10.4 Zero subjects | 177

prised 4825 not nested groups and 3773 nested groups, all manually reviewed. Among not nested groups, 125 errors were found (ca. 2,6% of all reviewed groups), whereas, among nested groups, 458 error occurrences were detected (ca. 12.1%). More details are shown in Tables 10.1 and 10.2.

10.4 Zero subjects⁵ Zero subjects constitute a separate category of markables; they are commonly used in Polish, yet not as easy to identify as nominal groups (see Section 5.3.3). Omission of an explicit subject is very typical in all Balto-Slavic languages (e.g. Polish) and some Romance languages; our tests show that about 30% of verbs do not carry explicit subjects in Polish. Russo et al. (2012) report similar figures: 30.42% for Italian and 41.17% for Spanish. Moreover, null subjects often belong to large coreference clusters. Non-singleton coreference clusters containing at least one zero subject in our development corpus have 5.89 mentions on the average, while for all nonsingleton clusters the average size is 3.56 mentions. Due to the high occurrence count of zero subjects, their identification is an important part of a high-quality mention detection module, and it heavily influences the final coreference resolution score. As reported in (Ogrodniczuk and Kopeć, 2011a), our coreference resolution module working on gold mentions achieved 82.90% F1 BLANC, as opposed to 38.13% for the same module working on system mentions (at the time when the zero subject detection module was not implemented).

10.4.1 Null subject detection difficulties The problem of detecting a null subject can be stated as follows: when given a clause with a personal form, we need to decide whether the clause contains a nominal expression as its explicit subject. Since the scope of even such traditional parts of speech as verbs or nouns is fuzzy (with e.g. gerunds making a borderline with both their aspect and productive relation to verbs and declination – to nouns), the most commonly used tagset of Polish uses more precisely delimited grammatical classes based on the notion of flexeme (Bień, 1991), which is finer-grained than traditional parts of speech. For the sake of the null subject detection task, we have assumed a coarser-grained part of speech definition indicating whether a word with a given part of speech may fulfil the role of a subject (Noun) or a predicate (Verb) in the sentence. Table 10.3 presents assignment of the flex-

5 This section is an extended and updated version of (Kopeć, 2014).

178 | 10 Mention detection Table 10.3. Parts of speech Coarsegrained POS

POS

Tag

Noun

Noun Depreciative form Main numeral Collective numeral Gerund Personal pronoun – 1st , 2nd person Personal pronoun – 3rd person Non-past form Future być Agglutinate być L-participle winien-like verb

Verb

Number

Case

Gender

Person

subst depr num numcol ger ppron12 ppron3

+ + + + + + +

+ + + + + + +

+ + + + + + +

+ +

fin bedzie aglt praet winien

+ + + + +

+ + + + +

emic classes to our definition and availability of selected morphosyntactic categories for each class. The assignment was inspired by Głowińska (2012) with a few differences: – numerals, gerunds and pronouns are Nouns because they are frequently subjects of the sentence and have the same morphosyntactic information as ‘standard’ nouns – the forms of siebie (“self”, traditionally treated as pronoun) are not Nouns, as they cannot make a subject – impersonal verbs belonging to impt, imps, inf, pcon, pant, pact, ppas, pred classes are not Verbs because they cannot take a subject. From now on we will use the words ‘noun’ and ‘verb’ referring to Noun and Verb respectively. It is worth noting that Polish is a free word order language, therefore, there are many possible places for the subject to appear, with respect to the position of the verb. It may come (not necessarily directly) before or after the verb. In this study, we do not attempt to handle the cases of subjects not being nouns, as in the following example: (10.1) Niestety znaleźli się tacy. ‘Unfortunately there were ones like thisadj.’ Judging from the analysis of the errors of a few classifiers (in following subsections), it is a very rare case, as not a single example with a subject other than a noun was found.

10.4 Zero subjects |

179

Table 10.4. Zero subject study data statistics Corpus

# texts

# sentences

# tokens

# verbs

# mentions

# verb mentions

Development Evaluation

1264 530

23,066 10,066

378,107 162,111

37,860 15,847

126,214 54,027

11,263 4623

Total

1794

33,132

540,218

53,707

180,241

15,886

Furthermore, in Polish the subject is not always in the nominative case, as in the example: (10.2) Pieniędzy nie starczy dla wszystkich. ‘There won’t be enough moneygen for everyone.’ This case is more frequent and was taken into account in our solution. As the corpus has only automatic morphosyntactic information available (provided by the Pantera tagger (Acedański, 2010)), not corrected by the coreference annotators, the only verbs considered in this study are the ones found by the tagger. If such a verb was marked as a mention by the coreference annotator, it constituted a positive example for our machine learning study, otherwise a negative one. Sentence segmentation in the corpus was automatic too. We are aware that the corpus used for the study was not perfectly suited for the task – verbs with a zero subject are not marked there explicitly, but can only be found based on automatic tagging. However, the tagging error of detecting personal verbs⁶ is reported as not higher than 0.04%, so we consider the resource sufficiently correct.

10.4.2 Development and evaluation data The presented research used all texts from the PCC corpus (version 0.91), and randomly split them into two parts: development and evaluation (see Table 10.4). 70% of texts of each text type were taken to the development part, 30% (but not less than 1 text) to the evaluation part⁷.

10.4.2.1 Inter-annotator agreement Estimation of the inter-annotator agreement for the task is a key indication showing how close can a system get to results of manual annotation. Basing on the fact that

6 Represented with the fin tag; see (Acedański, 2010) for details. 7 Exact corpus split may be obtained with utils.Splitter class of Scoreference tool, available at http://zil.ipipan.waw.pl/Scoreference.

180 | 10 Mention detection 210 texts of the PCC were annotated independently by two annotators (see Section 6.2), this part of the corpus was analysed for the inter-annotator agreement of marking zerosubject mentions. Technically, the null subject mark was added to the verbal form following the position where the argument would have been expected – thus creating ‘verbal’ mentions (which are just a technical means for expressing the new artificial element). The inter-annotator agreement evaluation data contained 5879 verbs. One of the annotators marked 1812 of these as mentions, while the other annotator marked 1787. 1581 mentions were shared by both annotations, which yields an observed agreement of 92.57%. Agreement expected by chance (assuming per-annotator chance annotation probability distribution) equalled 57.52%; therefore, chance-corrected Cohen’s κ for the task equalled 82.51%.

10.4.2.2 Results of full dependency parsing As already mentioned, the main reason to treat zero-subject verb detection as a separate task is that most languages lack full parsers or their performance is far from satisfying. In the case of Polish, the first dependency parser was recently developed and described by Wróblewska (2012). The author reports 71% LAS⁸ and 75.2% UAS⁹ performance of this parser. This parser was used to detect null subjects – every verb lacking the dependency relation of the subject type (subj) was marked as missing the subject. This baseline method achieved on evaluation corpus accuracy of 68.18%, precision of 48.18%, recall of 90.35% and F1 equal to 62.82%. It is clear that a procedure of this sort too often predicts a verb as having a zero subject. These results are worse than of a simple majority baseline classifier, therefore, the current state-of-the-art Polish dependency parsing is not a satisfactory solution to our task.

10.4.3 Development of the solution A baseline model which predicts that a verb always has an explicit subject achieves 70.25% accuracy on the development data, as out of 37,860 verbs, 26,597 have an explicit subject. The upper bound as stated in the previous subsection is somewhere around 92.57% accuracy. First, two approximations of the final solution were evaluated only on part of the development set, as at the time full data was not yet available.

8 Labelled attachment score – the percentage of tokens that are assigned a correct head and a correct dependency type. 9 Unlabelled attachment score – the percentage of tokens that are assigned a correct head.

10.4 Zero subjects |

181

10.4.3.1 First approximation – high recall study Our first idea was to create a classifier which marks a verb in a clause as having no subject every time there was no obvious candidate for the subject in that clause. ‘Obvious’ means in the nominative case and agreeing with the verb in terms of morphosyntactic information. This agreement was checked based on the information available for both the verb and the noun, so for example, if the verb was tagged as praet and the noun was tagged as ppron12, the agreement of number and gender was checked (see Table 10.3 for the details about information available for each tag). Sentences were split into clauses using the following heuristics: first, the sentence was split into candidate clauses at any of the following tokens: i ‘and’, albo ‘or’, lub ‘or’ and characters: comma, semicolon, colon, parentheses, hyphen, en-dash, quote. Then, moving from the beginning of the sentence, each clause was checked for the presence of a verb. If none were found, it was merged with the next clause. Then, all nouns in the nominative case belonging the clause with the analysed verb were selected as candidates for the subject. Next, a filtering step was performed: on the basis of available morphosyntactic information about the verb and nouns, some nouns were removed from the subject candidate set if they were found to be incompatible with the verb. Regardless of the verb tag, number information was always available, therefore, all nouns with a number different from the verb were discarded. In case of the keyword verb being of the praet or winien type, nouns having different gender than the verb were also filtered out. In case of other verb type (fin, bedzie or aglt), all nouns not being ppron12 or ppron3 with person agreement with the verb were removed from the candidate set. If no candidate was left, the verb was marked as missing a subject. Naturally, such classifier tested on a part of evaluation corpus (390 texts) achieved very high recall of ≈ 97.7%, but precision of only ≈ 22.56%. It generated only 40 false negatives out of all 10,719 examples in development corpus, which were carefully analysed in Table 10.5. Table 10.5. False negative error analysis of the high-recall classifier Error category

Type of error

Count

Not an error

Annotator’s fault

Tagger error

No sentence segmentation in spoken texts (no full stops) Wrong case (nom instead of acc)

Clause segmentation problems

Problems with narrator/speaker text division Clause split at only one of () or pair Missing commas in text Clause split at comma between adjectives

2 2 3 1

Hard problems

jak/jako (as) Gdy ocknął się drugi raz. . . Cały czas chodził . . .

5 1 1

8 14 3

182 | 10 Mention detection Two improvements were proposed as the result of this experiment. The first was to enhance clause segmentation by forcing a split in the case of question and exclamation mark characters, regardless of the verbs found, and not allowing a split inside of a syntactic group found by a shallow parser Spejd (Przepiórkowski and Buczyński, 2007). Secondly, to overcome the problems with erroneous sentence segmentation in texts consisting of transcribed speech, subject search can be restricted not to a clause, but alternatively to a window of words before and after the verb. An interesting finding was the frequent case of the words jako or jak followed by a noun in the nominative case. In Example (10.3), the noun phrase głównodowodzący armii is a perfect morphological match for the subject of the verb phrase nie miał, yet it is rather a description of the role of the null subject on ‘he’. (10.3) . . . jako głównodowodzący armii ⌀ nie miał prawa nawet startować. ‘. . . as the army’s commander-in-chief he didn’tsing:masc even have the right to start.’

10.4.3.2 Second approximation – high precision study The second step towards the final solution was to investigate errors of a classifier focused on high precision – i.e. analyse difficult cases of verbs which seem to have no explicit subject in clause, but in fact do have it (or do not require one). This time, a cost-sensitive sample weighting version of the Random Forest algorithm was used (implemented in WEKA (Hall et al., 2009)), and the cost of false positive was 5 times higher than the cost of false negative. Features used included: tag of the previous and next word, presence of a noun morphosyntactically compatible (to various extent) with the verb either in the clause, in the sentence or in the window around the verb (of various size). This time, the classifier tested on a 390-text sample of the evaluation corpus (same as in the first approximation) achieved 78.44% accuracy, 97.46% precision and 26% recall. Only 21 false positives occurred, which are analysed in Table 10.6. Table 10.6. False positive error analysis of the high-precision classifier Error category

Type of error

Count

Not an error

Annotator’s fault

Tagger error

Wrong gender/number/tag Unrecognized (unknown) noun

3 1

Clause segmentation problems

Clause starting with a comma and the relative pronoun który/jaki (‘which’)

2

Other problems

Pseudo-verbs (not requiring a subject) Imperfect classifier learning

3 1

11

10.4 Zero subjects | 183

From this experiment, the conclusion was to find a way to incorporate the knowledge about pseudo-verbs, i.e. verbs which do not require a subject, for example: (10.4) Bywało, że niektórzy . . . ‘(It) happened, that some of them . . . ’ Another room for improvement was in clause segmentation – the commas followed by a relative pronoun który/jaki ‘which’ should not be triggers for clause splitting.

10.4.3.3 The final setting Based on the two experiments described before, we present features which we developed for machine learning the null subject detection. We designed: – 3 features of the verb: – is the verb on the pseudo-verbs list extracted from (Świdziński, 1994) – i.e. may not require a subject (boolean feature) – number of the verb – to help with cases of plural verbs having two or more singular nouns as subject (nominal feature) – tag of the verb – as it may happen that some verb types are more frequently missing an explicit subject than the others (nominal feature). – 6 features of the tokens surrounding the verb: – tag of the next token (nominal feature) – tag of the previous token (nominal feature) – is the previous tag equal to praet – a redundant feature to the previous one, but it should help with cases like: (10.5) . . . byłapraet maglt:pri . . . ‘. . . (I) was . . . ’

–

when we split a word into a L-participle and agglutinate. Annotation guidelines were to only mark the agglutinate as a mention when the verb does not have an explicit subject (boolean feature) does one of the previous two tokens have the pred tag – should allow detecting examples similar to: (10.6) Możnapred się byłopraet tego spodziewać. ‘. . . It could have been expected. . . . ’ (10.7) Trzebapred byłopraet myśleć wcześniej. ‘(One) should have thought before.’ when było ‘was’ cannot have a subject, as it is part of an impersonal construction (boolean feature)

184 | 10 Mention detection –

is the next tag equal to inf – similar role to the previous feature, as in the example: (10.8) Wtedy należyfin poprosićinf . ‘(One) should then ask for it.’

–

when należy ‘should’ cannot have a subject (boolean feature) – is the previous token a comma (boolean feature). 2 features about the length (following the hypothesis, that the shorter the sentence/clause, the less likely for the subject to appear): – number of tokens in the sentence (numerical feature) – number of tokens in the clause with the verb (numerical feature).

We also generated a large number of features examining the existence of a noun (or two) in a given window around the verb. This noun (or these nouns) should also fulfil some requirements to be a good candidate for the subject of a given verb. Considered windows were: – the clause containing the verb – the whole sentence containing the verb – windows of 1 to 5 tokens before the verb – windows of 1 to 5 tokens after the verb – windows of 1 to 5 tokens both before and after the verb. For every window, we generated a number of boolean features describing whether this window contained a noun not preceded by jak/jako and fulfilling one of the set of conditions: – case of the noun equal to nominative – NOM – number agreement with the verb – NUM – person or gender agreement – POG (depending on which was available to check, see Table 10.3) – NUM and POG – NUM and NOM – POG and NOM – NOM and NUM and POG or at least two nouns, both fulfilling conditions: – NOM – POG – NOM and POG.

0

Count 5 10

15

10.4 Zero subjects | 185

0.840 0.845 0.850 0.855

0.860

Accuracy

Fig. 10.1. Histogram of accuracy tested 100 times

10.4.3.4 Accuracy on the development corpus Presented features were first tested on the full development corpus. We chose the JRip implementation of RIPPER (Cohen, 1995) from WEKA as the learning algorithm. We used 10-fold cross-validation which was repeated 10 times with different random seeds for training and train/test splits. The average from the total of 100 trials (each cross-validation split separately) was equal to 84.97%, with standard deviation of 0.517%. As the Shapiro-Wilk test (Shapiro and Wilk, 1965) for normality for this data gives p-value of 0.24, it may be assumed that it follows the normal distribution. In that case, the 95% confidence interval for the accuracy is equal to [84.87%, 85.07%]. The histogram is presented in Figure 10.1.

10.4.4 Evaluation The development corpus was used during the development of final features, therefore, it is not feasible for the final evaluation of the proposed solution. The evaluation corpus, consisting of 530 texts (for detailed statistics see Table 10.4), was used only for 2 experiments described below.

10.4.4.1 Accuracy on the evaluation corpus We used the RIPPER model which learned on the development corpus and tested it on the evaluation corpus, achieving 85.55% accuracy. A majority classifier would achieve 70.83% accuracy. The confusion matrix is depicted in Table 10.7. For finding null subjects, recall of 76.74% and precision of 72.42% gives F1 measure of 74.52%. As the result is similar to the one obtained for the development corpus, overfitting is not likely. Table 10.7. Confusion matrix of the model trained on the development corpus and tested on the evaluation corpus True values Null subject Explicit subject Predictions

Null subject Explicit subject

3348 1275

1015 10,209

0.85 0.84 0.83 0.82

Accuracy on test data

186 | 10 Mention detection

0.2

0.4 0.6 0.8 Fraction of training data used

1.0

Fig. 10.2. Learning curve

10.4.4.2 Learning curve To test how the number of training examples influences the quality of the trained classifier, we used subsets of the development corpus of various sizes as training sets. The test set was the same in all cases (the evaluation corpus). Proportions of the examples used ranged from 5% to 100% of the development corpus, each proportion was tested 10 times to provide some estimation of variance. For example, to evaluate the efficiency of the classifier trained on 5% of the training examples, we randomly sampled 5% of the examples, trained the classifier and tested it on the full evaluation corpus. Then, we repeated it another 9 times, randomly choosing a different 5% portion of the examples for training. The Shapiro test was performed to assess the normality of results for each proportion; out of 19 proportions tested (the proportion of 1 was of course not tested for normality), only 3 had p-value less than 0.1, therefore, we assumed that the data is distributed approximately normally. 95% confidence intervals of the classifiers trained on a given proportion of the development corpus are shown in Figure 10.2. The algorithm clearly benefits from having more training examples. When using the full training set – 1.0 on the horizontal axis, accuracy achieves 85.55%, as stated in previous section. Yet, using 75% of the training set gives the mean accuracy equal to 85.33%, which may be considered a good compromise between manual annotation effort and efficiency. Moreover, the shape of the curve may suggest that the developed solution would not be able to significantly exceed 85%, even given more training examples.

10.4.5 Results Our experiment resulted in the development of an efficient zero subject detection module for Polish. The technique of using cost-sensitive learning to find interesting exam-

10.5 Named entities |

187

ples of null subjects proved very successful – it was also helpful for detecting many errors in manual annotation. We highlighted the importance of correct clause splitting for the task and proposed a solution for the Polish language. The achieved accuracy of 85.55% significantly exceeds the baseline of majority tagging, equal to 70.25%, but there is still room for improvement, as the upper bound of 92.57% was computed.

10.5 Named entities Named entity-based mentions were identified with Nerf (Waszczuk et al., 2013) – a statistical CRF-based named entity recognition tool trained over the 1-million NKJP subcorpus. Its annotation scope is defined to cover a hierarchy of names, including relational adjectives and basic temporal expressions. In certain cases (e.g. personal names), the two-level taxonomy of annotation features is used (to split personal names into forenames and surnames). For the sake of nominal mention detection, the process was limited to all detected named entities which contained at least one nominal token (i.e. marked as subst, depr or ger in morphosyntactic description). The following named entity categories were investigated: – personal names – real or fictitious (Maria Skłodowska-Curie¹⁰, Król Artur ‘King Arthur’), additionally split into forename (Jean-Marie, Krzysztofowie), surname and additional name – pseudonym, pen name, stage name, nickname, dynasty etc. (Stefan Grot Rowecki, Jan Bez Ziemi ‘John Lackland’, Iwan Groźny ‘Ivan the Terrible’, Zygmunt Waza ‘Sigismund III Vasa’) – names of organizations and institutions (Parlament Europejski ‘the European Parliament’, Kancelaria Prezydenta RP ‘Chancellery of the President of the Republic of Poland’, Polska Akademia Nauk ‘Polish Academy of Sciences’, National Aeronautics and Space Organisation, Royal Air Force) – geographical names (Al. Jerozolimskie, Tatrzański Park Narodowy ‘Tatra National Park’, Mount Everest, Downing Street, the Great Lakes) – geopolitical names: districts (Za Żelazną Bramą, the Bronx), settlements (Nowa Słupia, Murton), regions (powiat bieruńsko-lędziński, Cheshire), countries (Republika Czeska ‘Czech Republic’, New Zealand) and blocs (Unia Europejska ‘European Union’, the North Atlantic Treaty Organization) – time expressions: dates (24 października 1945 ‘October 24th , 1945’, XXI wiek ‘21st century’) and times (pięć po dwunastej ‘five past twelve’, 9 sekund i 58 setnych ‘nine seconds and fifty-eight hundredths’) – derived names of persons (poznaniak, AK-owiec, Dubliner, Orangemen).

10 Examples from (Savary, 2012, p. 133–136).

188 | 10 Mention detection

10.6 Mention detection chain The various information coming from different layers of mention description described above is used by the MentionDetector tool (http://zil.ipipan.waw.pl/ MentionDetector), which operates in three steps: 1. collection of mention candidates from available sources of information – morphosyntactical, shallow parsing, named entity and/or zero anaphora detection tools (lack of any layers simply results in fewer mentions discovered) 2. removal of redundant/unnecessary candidates 3. update of mention head information. At the first stage of the process, mention candidates are extracted from the morphosyntactical level, taking all tokens with a noun or a personal pronoun tags assigned by the parser. From the shallow parsing level, all syntactic nominal groups and syntactic words with noun or personal pronoun ctags are taken. Finally, from the named entity level, all named entities that contain at least one noun or pronoun token also become mention candidates. To enable zero subject processing, MentionDetector marks each verb in a sentence that does not contain any noun/pronoun token in the nominative case¹¹ as a mention. At the second stage, redundant mentions are detected by removing one of two mentions with exactly the same boundaries and exactly the same heads. When one mention is the head of another mention or when two mentions intersect, a “less important mention” is selected for removal, which basically means removing the shorter one (or, arbitrarily, the first one in case of equal length). For example, pre-processing of the phrase: największa zagadka lotnictwa cywilnego ‘the biggest mystery of civil aviation’, would produce the following mention candidates: – lotnictwa ‘aviation’ (based on token tag or syntactic word tag) – zagadka ‘mystery’ (based on token tag or syntactic word tag) – lotnictwa cywilnego ‘civil aviation’ (based on syntactic nominal group) – największa zagadka lotnictwa cywilnego ‘the biggest mystery of civil aviation’ (based on syntactic nominal group). The task of the second stage is then to remove all duplicates (e.g. zagadka ‘mystery’ could be found both as a token with a noun tag or a one-word nominal group). Next, finding mentions with the same heads will be followed by removing lotnictwa ‘aviation’, as there is a broader mention of lotnictwa cywilnego ‘civil aviation’ with the same head. Similarly, zagadka ‘mystery’ will be removed for the same reason. At the final stage of the mention detection process, for mentions lacking automatically detected head, the first token is arbitrarily marked as its head. 11 Marking verbs instead of adding empty tokens representing zero subjects is just a technical measure implemented in PCC to maintain the original text unchanged.

Maciej Ogrodniczuk, Mateusz Kopeć

11 Rule-based approach

The rule-based coreference resolution module for Polish – Ruler (Ogrodniczuk and Kopeć, 2011a,b) is a simple baseline tool planned as the initial step of the coreference project. Its design followed the approach described in (Haghighi and Klein, 2007), with a few hand-written rules and a standard best-first entity-based model based on syntactic constraints (elimination of nested nominal groups), syntactic filters (elimination of syntactic incompatible heads), semantic filters (WordNet-derived compatibility) and selection (weighted scoring).

11.1 Resolution process The tool uses gold mentions produced in the process described in Section 6.1 with additional semantic properties based on Polish WordNet (Piasecki et al., 2009). The resolution process starts with mention-to-mention comparisons with calculation of a ‘compatibility score’ for a given pair of mentions, followed by clustering and evaluation with the SemEval scorer by Recasens and Martí (2010). For a new mention candidate, its compatibility with all previously constructed mention clusters is calculated, and the best cluster is selected (only when the score exceeds the threshold value). When more than one cluster results in the best score, the one containing the closest mention is selected. The compatibility of the candidate mention and a given cluster is defined as the maximum of the compatibility scores of the mention tested against each of the cluster’s mentions. The scoring of compatibility of two mentions starts with 0.5 value for the investigated mention (which corresponds to equal chances of compatibility/incompatibility in the pair) and consists in applying the 5 rich rules: 1. gender/number rule eliminates syntactically incompatible matches, i.e. prevents mentions with different gender or number to be marked as coreferent (elimination means setting the compatibility score to 0) 2. including rule eliminates nested groups, not allowing for putting two mentions with a non-empty intersection in one cluster 3. lemma rule, turned on for nominal groups only (not pronouns), promotes matches with identical heads and lowers the total score of incompatible heads 4. WordNet rule, valid only for nominal groups which have their WordNet representation, increases the score when the topic words set (containing synonyms, hy-

190 | 11 Rule-based approach

5.

peronyms and fuzzynyms¹) intersects with more than 3 entries, and decreases it otherwise pronoun rule, valid for pronouns only, increases the score of matching pronouns with any other mentions, because pronouns mostly appear in text after a nonpronoun coreferent and, therefore, should be a part of a cluster (it also lowers the score for incompatible first and second-person pronouns, because they do sometimes occur in texts without non-pronoun coreferents).

11.2 Data sets and evaluation The implemented module was designed to provide an environment for testing coreference rule sets, which facilitated creating two common variations: the all-singletons and head-match baselines plus slightly more complex, although still very straightforward settings, with all 5 rules described above and – additionally – another run with smaller set of four rules (WordNet rule turned off) to illustrate an interesting discovery. Four implemented rule sets were evaluated against four well-known coreference resolution evaluation metrics: MUC (Vilain et al., 1995), CEAF (Luo, 2005), B3 (Bagga and Baldwin, 1998) and BLANC (Recasens and Hovy, 2010). Because the output of the system was generated in SemEval format, there was no need to implement a new comparator as we were able to use the script provided by organisers of the Task 1: Coreference Resolution in Multiple Language from SemEval2010 competition (see Recasens et al. (2010b)). This script is able to compare two files (both encoded in SemEval format) – one containing golden standard annotations and the other being created by the system under test. The output has the results for each of the four metrics mentioned earlier, both in terms of precision and recall, as well as F1 measure.

11.3 Results The formal results computed on data from the initial annotation experiment (see Section 6.1) are presented in Table 11.1. The most interesting finding is that the WordNet rule, although usually adding to the recall, lowers the precision of the score, which can result from the fact that the topic sets can contain very occasionally used meanings producing false positives in unexpected contexts. For instance, the word strona ‘part’ can be marked as coreferent to ziemia ‘land’, because there is an expression in

1 Fuzzynymy is a relation adopted from EuroWordNet (Vossen, 1998) and indicates clear semantical connection between a pair of lexical units which cannot fit into existing classification, such as burmistrz ‘mayor’ and ratusz ‘town hall’ or rzeźba ‘sculpture’ and wystawa ‘exhibition’ (see Piasecki et al., 2009, Chapter 2.2.5).

11.3 Results |

191

Table 11.1. Rule-based resolution results MUC

System type All-sing. All-sing. + head m. 5 rules 4 rules (no WordNet)

CEAF

R

P

F1

R

P

50.73% 75.36% 74.73%

– 61.16% 59.46% 65.13%

55.46% 66.48% 69.60%

93.10% 84.22% 78.62% 83.45%

67.64% 79.14% 87.42% 88.36%

B3

All-sing. All-sing. + head m. 5 rules 4 rules (no WordNet)

F1 78.35% 81.60% 82.79% 85.84%

BLANC

R

P

F1

R

P

72.65% 84.17% 90.56% 90.35%

100.00% 90.05% 82.56% 86.66%

84.16% 87.01% 86.37% 88.47%

50.00% 69.64% 81.99% 81.94%

49.18% 84.54% 78.39% 83.92%

F1 49.58% 74.97% 80.08% 82.90%

Polish strony ojczyste ‘native land’) and it appears in the WordNet. There is a need to develop a more sophisticated method of using the WordNet, which should have to do with word sense disambiguation. As the amount of available gold data was very limited, the results of the process should be treated as preliminary. However, they have shown that Ruler offers a useful baseline for further work and that it can be confidently used in the pre-annotation process (see Section 6.3.4).

Mateusz Kopeć, Bartłomiej Nitoń

12 Statistical approach

12.1 First adaptation of BART for Polish Our first machine learning approach to coreference resolution for Polish was an adaptation of multilingual BART system (version 1.0). We will further refer to it as Bart-pl-1. The whole experiment, described by Kopeć and Ogrodniczuk (2012), may be summarized as follows: 1. Pre-processing was executed outside BART, using a pipeline similar to the one used by Ruler (see Chapter 11) for acquiring morphological, named entity and shallow parsing information. Gold mentions were used. 2. Training example selection was performed with SoonEncoder algorithm, presented in Figure 12.1. 3. Feature extraction – the original BART offered 64 feature extractors to transform the training examples into features, however, using them out-of-the-box for languages other than English was problematic due to their language-specific settings. Although some of them are extracted into the Language Plugins, which are supposed to increase the modularity of the toolkit by discriminating the nonlanguage-agnostic parts of BART, a large number of the feature extractors still contain the settings specific for English. For example, a feature extractor may take into consideration a specific (English) substring of the mention or the English definite article, not to mention obvious cross-lingual tagset incompatibilities. Another difficulty, this time objective, arises from the lack of certain types of language processing tools for Polish. Taking these into account, only 13 pair feature extractors were selected for the Bart-pl-1 setting: – First Mention – extracting information whether a given mention is the first one in its mention chain – FirstSecondPerson – checking if mentions are first or second person for each mention m: for each mention n preceding m (from closest to furthest): if m and n are coreferent: for each mention o between n and m (including n, excluding m): if o and m are coreferent: generate positive example from pair (o,m) else: generate negative example from pair (o,m) Fig. 12.1. SoonEncoder training example selection algorithm

194 | 12 Statistical approach for each mention m: for each mention n preceding m (from closest to furthest): if coreferent(m, n): mergeClusters(m, n) break Fig. 12.2. SoonDecoder clustering algorithm

– – –

Gender, Number – extracting compatibility of gender/number of two mentions HeadMatch – comparing heads of mentions MentionType, MentionType Anaphor, MentionType Salience – providing a number of features based on mention types (for example, if they are personal or reflexive pronouns) – DistDiscrete, SentenceDistance – providing information about text distance between mentions in sentences – StringKernel, StringMatch, LeftRight-Match – feature extractors based on orthographic similarity of mentions. 4. Classification – J48 decision tree from WEKA (Hall et al., 2009) was used to train a model for mention-to-mention classification whether a mention pair is coreferent or not. 5. Clustering – SoonDecoder from Figure 12.2 algorithm was used to apply trained model to new examples. 6. Scoring – the data was exported into SemEval format and evaluated with SemEval evaluation suite described by Recasens et al. (2010b), using MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998), CEAF (Luo, 2005) and BLANC (Recasens, 2010) metrics. These experiments yielded promising scores similar to the Ruler system (see Chapter 11), evaluated on the same (small) dataset. As with other machine learning solutions, it may be expected that using a larger corpus for training would improve resolution accuracy to achieve results superior to Ruler. The Bart-pl-1 setting is a quite simple baseline, but one that uses a powerful machine learning technique. The tool and its configuration was not published in its original setting, but the presented features and processing algorithm were reimplemented in Bartek system, presented in the last section of this chapter, which allows for reevaluation of the procedure on any new dataset.

12.2 Second adaptation of BART for Polish The second approach of adapting BART to Polish used an extended feature set, enlarging the one described by Nitoń (2013). As it is very much similar to Bart-pl-1, we will refer to this attempt as Bart-pl-2.

12.2 Second adaptation of BART for Polish

|

195

12.2.1 Feature categories Feature candidates were inspired by Uryupina (2007) who describes over 350 linguistic features which can be used to recognize coreference. They are considered languageindependent (this hypothesis was also verified for Polish in Batko (2012)), therefore, 147 of them were implemented and tested in various combinations in BART. Uryupina grouped features into 5 types: – surface similarity (e.g. linking orthographic entity name with its abbreviation; 88 features implemented for Polish) – syntactic knowledge (e.g. traditional gender/number agreement; 9 implemented for Polish) – semantic compatibility (e.g. agreement between semantic classes of mention heads; left for further study) – discourse structure and salience (e.g. salience of topics; 46 features implemented for Polish) – anaphoricity and antecedenthood (4 features implemented for Polish). Moreover, the analysis covered 16 additional features present in BART and customized to Polish (see Section 12.2.1.5) and 6 new features implemented specifically for our linguistic representation (see Section 12.2.1.6).

12.2.1.1 Surface features Surface similarity features are based mostly on comparing mention strings or their specified fragments. Uryupina decomposes surface similarity problem into three subtasks: normalization, specific substring selection and proper matching. Normalization covers different spelling variants of the same name throughout a text. Uryupina describes three normalization functions: no_case, no_punctuation and no_determiner. The first function ignores case in strings, the second one strips the text from all punctuation marks and other auxiliary characters (like “-” or “#”), while the last one strips it from determiners. Substring selection considers the fact that some words in a mention are more informative and important than other ones. Uryupina describes four key words of a mention string: head, last, first and rarest. The last sub-task, matching, is based on a string comparison function. There are five string comparison algorithms: – exact_match, comparing the whole mention strings – approximate_match, based on the minimum edit distance (MED) measure (Wagner and Fisher, 1974) – matched_part, counting overlap between strings in words or symbols – abbreviation, using one of four abbreviation algorithms – rarest(+contain), finding the rarest word in a mention string and checking whether it occurs in some other mention.

196 | 12 Statistical approach (Nitoń, 2013) combined these three sub-tasks in an experiment implementing about 88 features which can be used for increasing Bart-pl-1 coreference resolution score. The experiment omitted functions using no_determiner normalization and more complex abbreviation algorithms. 12.2.1.2 Syntactic features While Uryupina presents about 61 different syntactic features, only the 9 implemented during Nitoń’s experiment have been taken into account: – post-modification (IPostModified, JPostModified features): checks whether the markable’s head is not the last word in a mention – number (INumber, JNumber features): checks the grammatical number of the anaphor or the antecedent – person (IPerson, JPerson features): checks the grammatical person of the anaphor or the antecedent – same number (Number feature): checks whether the anaphor and the antecedent share the same number – same person (Person feature): checks whether the anaphor and the antecedent share the same person – syntactic agreement (SyntAgree feature): checks whether the anaphor and the antecedent share the same number and person. 12.2.1.3 Discourse structure and salience features Salient entities are those which are likely to be antecedents, are considered more important than the others and bring new objects to the discourse scene. Although “salient entities” seem to be very important for text understanding, their recognition is very difficult to formalize. To approach this problem, one must use salience measures which are, on the one hand, expressable and implementable while, on the other – theoretically sound, and make predictions appropriate for coreference resolution. Uryupina (2007) presents about 97 discourse and salience-based features; during Nitoń’s experiment 46 of them were implemented. To formalize salience features, Uryupina noticed that in short discourse pieces, the main topics of text should be introduced in the beginning. Locally, each paragraph starts with some of the main entities and then brings new information about them, represented as coreference chains or single markables. These new entities are related to the paragraph’s topic and, thus, are not likely to be mentioned once again later. These hypotheses are very simplistic and may not be relevant for longer texts. Uryupina describes five hypotheses used to define salience and discourse structure features: 1. Earlier fragments of a document contain fewer anaphors because new entities are introduced. 2. Earlier fragments of a document contain more antecedents.

12.2 Second adaptation of BART for Polish

3. 4.

5.

|

197

Entities in the beginning of a document segment (paragraph) can be anaphors or antecedents, whereas markables closer to its end are more likely to be anaphors. Short coreference chains are more likely to hold in a single paragraph than span over multiple text segments. Short chains, unlike the long ones, correspond to local entities and are fully discussed in a single paragraph. Markables in close proximity are more likely to be coreferent than those sharing a greater distance.

First two hypotheses were resolved in Nitoń’s experiment by implementing paragraph number and sentence number features. They correspond to a simple paragraph/sentence count from the beginning of a text, normalized by the total number of paragraphs/sentences (10-bin-discretization described by Uryupina was not used). The third hypothesis was resolved using paragraph rank and sentence rank features. They correspond to distances in markables to the beginning of the paragraph/sentence, normalized by the total number of markables (also without the 10-bin-discretization). This group of features also includes IFirstInSentence, JFirstInSentence, IFirstInPara and JFirstInPara, describing whether the markable is the first one in a sentence or paragraph. The fourth hypothesis was omitted in Nitoń’s experiment. The fifth hypothesis was resolved by a set of proximity features. Although texts are in most cases devoted to one topic, it is described from different angles by introducing new entities and abandoning the ones introduced before. Therefore, anaphors and antecedents are usually close to one another. To examine that hypothesis, three different types of distance measures between possible anaphor and antecedent are used: distance in markables, in sentences and in paragraphs. These distances were implemented as ParaDistance, SentenceDistance and MarkableDistance features. This group also includes features SameSentence and SameParagraph, describing whether anaphor-antecedent pair is in the same sentence or paragraph. Another approach to using discourse and salience features for coreference resolution is to represent each feature by a triple (proximity, salience, agreement): – Proximity can be measured by the same function (true, if anaphor and antecedent are in the same sentence), prev (true, if anaphor and antecedent are in the adjacent sentences) and closest (true, if the antecedent is the closest markable to the anaphor). – Salience can be measured by the closest function (true, if the antecedent is the closest markable to the anaphor), sfirst (true for the first markable in the sentence), pfirst (true for the first markable in the paragraph). Subject, ssubject and cb functions described by Uryupina were omitted during this experiment. – Number and person agreement (true, if anaphor and the antecedent share the same number and person). Because proximity factor is very crucial for pronominal anaphora, Uryupina introduces proana encoding (true, if anaphor is pronominal).

198 | 12 Statistical approach 12.2.1.4 Anaphoricity and antecedenthood features Anaphoricity- and antecedenthood-related features are responsible for discovering how likely it is for a given mention to be the antecedent of another mention. Beside surface, syntactic and salience features, Nitoń implemented the samehead feature group consisting of ISameHeadExist, JSameHeadExist, ISameHeadDist, JSameHeadDist features. They represent coreference knowledge on a very basic level. SameHeadExist checks whether there is a mention with the same head as given in the preceding text, SameHeadDist describes distance between given markable and another one with the same head in the preceding text.

12.2.1.5 BART features Apart from features from Bart-pl-1 configuration, during the experiment we used a set of other features already implemented in BART system: – PL_Alias – feature used to check whether two mentions are different spelling variants of the same name (it applies to acronyms, different notations etc.); this feature was only partially adjusted for Polish language – SemanticClass – feature used to check semantic compatibility of mentions, not fully compatible with Polish language – BothLocation – feature used to check whether both mentions are locations.

12.2.1.6 Other features While searching for the best configuration of Bart-pl-2 system, a couple of additional features were also implemented: – HeadPos – checks whether anaphor and antecedent share the same part of speech tag based on NKJP tagset (Przepiórkowski et al., 2012) – Contains – checks whether one mention is not contained by another mention – JGender – checks antecedent’s gender – IGender – checks anaphor’s gender – JNamedEntity – checks whether antecedent is not a named entity – INamedEntity – checks whether anaphor is not a named entity.

12.2.2 The final configuration The baseline for Bart-pl-2 were features used in Bart-pl-1 system. The feature set was then extended one by one and the system was checked with respect to whether the new feature affects the system’s coreference resolution in a positive way. If yes, the feature was added to the set and another feature was examined. Following i.a. CONLL-2011 evaluation methodology (Pradhan et al., 2011), we used an average score of MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998) and CEAF-E

12.2 Second adaptation of BART for Polish

|

199

(Luo, 2005) metrics which track the influence of different coreference dimensions (the B3 measure being based on mentions, MUC on links and CEAF-E on entities). The best achieved BART setting was using the following configuration: 1. Pre-processing – as in Bart-pl-1. 2. Training example selection – as in Bart-pl-1, with SoonEncoder. 3. Feature extraction – all features from Bart-pl-1 plus the new ones tested to work fine for Polish: – surface features: – IDigits_H – checking whether anaphor’s head contains digits – IContainRarest_B – testing if anaphor’s rarest word (base form) is contained in the antecedent – JAlphas – determining whether antecedent contains letters – JDigits – determining whether antecedent contains digits – ExactMatch_NP – comparing strings of mention pairs by ignoring punctuation – ExactMatch_NC_NP – comparing strings of mention pairs by ignoring case and punctuation – Abbrev2 – comparing abbreviation of anaphora with the head of antecedent, and vice versa, taking into consideration during abbreviating only words starting with upper case, no normalization used. – syntactic feature IPostModified – determining whether anaphor’s head is not the last token in the markable string – discourse structure and salience features: – ProanaPFirstSame – checking whether antecedent is the first markable in the paragraph and pair elements are in the same sentence – ProanaClClAgree – checking whether pair elements are closest to one another and are syntactically compatible – ProanaClPrev – testing whether pair elements are closest to one another and are in adjacent sentences – PFirstSameAgree – checking whether antecedent is the first markable in paragraph, pair elements are in the same sentence and are syntactically compatible. 4. Classification – as in Bart-pl-1, J48 decision tree trained for mention-to-mention classifier. 5. Clustering – as in Bart-pl-1, SoonDecoder from Figure 12.2 was used. 6. Scoring – the data was evaluated using new coreference evaluation tool, Scoreference¹ with an average of MUC, B3 and CEAF-E metrics. This configuration was developed and tested on 390-text sample from the Polish Coreference Corpus. It achieved better results than Ruler and Bart-pl-1 on that dataset. 1 http://zil.ipipan.waw.pl/Scoreference

200 | 12 Statistical approach 12.2.3 Summary Bart-pl-2 is an extension of the approach presented by Bart-pl-1, with significantly more features implemented and tested. Again, the original Bart-pl-2 is not publicly available, but all features used in the final setting were implemented in Bartek system, presented in the next section.

12.3 Third adaptation of BART for Polish The key point of the experiments with statistical coreference resolution was to create an efficient resolver for Polish, distributed in an easy-to-use package, compatible to use with data in various formats. This would allow for comparison with Ruler and such package would be also deployed to an online demo server. Because the most popular approaches for English employ mention-to-mention comparisons, this paradigm was a clear candidate to start with in order to obtain a state-of-the-art system for Polish, before trying other models of coreference. As most of the work for Polish was about adapting the BART tool, it was reasonable to go further in that direction and extend the settings presented in two previous sections. The BART architecture, however, was found to be not easily modifiable – starting with its design merging both mention detection and coreference resolution which we consider two separate steps. With such integration and the whole processing chain focused on English, even the possibility of attaching pre-processing tools presented considerable problems with its adjustment for Polish. Another difficulty was BART’s working format, based on MMAX (Müller and Strube, 2006). Because of these conflicts, we decided to reimplement BART for Polish, using BART’s mention-to-mention comparison architecture, yet focusing on features which turned out to be valuable during experiments presented in two previous sections of this chapter. That reimplementation is named Bartek and is available at http://zil.ipipan.waw.pl/Bartek.

12.3.1 Bartek features Bartek (similar to original BART) allows to configure which features may be used for machine learning models of pairwise coreference relations. In its initial implementation, the tool combines features of Bart-pl-1 and Bart-pl-2 systems, to allow for comparison between them and Ruler using the data from the Polish Coreference Corpus. Bart-pl-1 and Bart-pl-2 experiments did not use knowledge-based features relying on external resources, such as Wikipedia (an online encyclopedia) or WordNet (a lexico-semantic network). These are widely used for English, yet adapting them for Polish is not straightforward. There are many differences between the Polish WordNet

12.3 Third adaptation of BART for Polish

| 201

(Piasecki et al., 2009) and Princeton WordNet (Miller et al., 1990), which makes several applications developed for Princeton WordNet not compatible with Polish. Rich inflection present in Polish makes it similarly difficult to search Wikipedia for an article about a given entity, provided only its inflected version is found in the text, as articles use lemmatized entity names. Despite those difficulties, these features should prove very valuable to resolve coreference, as otherwise even simple synonymy is hard to be recognised. To propose and evaluate a set of semantic features, we decided to create a third feature setting named Bart-pl-3 and implement them in Bartek.

12.3.1.1 WordNet features The Polish WordNet – plWordNet is already the second-largest WordNet in the world, and it keeps growing. It is reported by authors to contain ca. 144,000 nouns, verbs and adjectives, ca. 203,000 word senses and ca. 500,000 relations. For this experiment, three binary features using WordNet were implemented: – AnaIsHypernym – true if anaphor is a hypernym of the antecedent – AnteIsHypernym – true if antecedent is a hypernym of the anaphor – AreSynonyms– true if anaphor and antecedent are synonyms. The WordNet represents a particular sense of a phrase as a lexical unit. One phrase may have several lexical units, if it is polysemous. These lexical units are grouped into sets representing synonyms, so called synsets. Two synsets may be connected by a hypernymy relation. To test whether two mentions are synonyms, we need to find lexical units representing mentions and find whether they are in the same synset. We search for lexical units for a mention by querying the WordNet database for lexical units with lemma equal to the lemma of the mention’s head. Since many lexical units for one mention may be found that way, we test whether any pair of lexical units (where one element of the pair is a lexical unit representing the first mention, the other representing the second mention) is in the same synset. If so, we assume they are synonyms. A similar approach is taken to acquire the hypernymy data. If any lexical unit representing one mention is in a synset connected by the hypernymy relation with another synset containing a lexical unit representing the second mention, we assume there exists hypernymy relation for this mention pair. This study used plWordNet in version 2.2, downloaded from http://nlp.pwr.wroc. pl/plwordnet/download/?lang=eng. For the purpose of having a distributable package of the tool, Bartek stores data extracted from the WordNet inside its package, without the need of installing WordNet by the user.

202 | 12 Statistical approach 12.3.1.2 Wikipedia features Wikipedia is an online, multilingual free-content encyclopedia that anyone can edit and contribute to. Polish Wikipedia is one of the largest non-English Wikipedias with over 1,000,000 articles (English Wikipedia contains over 4,500,000 articles, but second-largest one – Dutch – has less than 1,800,000 articles). Polish Wikipedia was used to provide the following binary features: – IsRedirect – true if page about the anaphor redirects to the page about the antecedent, or the other way round – AreLinked – true if page about the anaphor has a link to the page about the antecedent, or the other way round – AreMutuallyLinked – true if both the page about the anaphor has a link to the page about the antecedent and the page about the antecedent has a link to the the page about the anaphor. To explain these features in more detail, first we need to introduce Wikipedia data structures. This encyclopedia contains millions of pages, each identified by the name of the entity described at a given page. Some pages do not have any content; they simply redirect to another page (for example in English Wikipedia if we enter page with title: “Pages”, we will be redirected to page with title: “Page”). Using such redirect information, we may be able to find some synonyms, such as uczucie ‘feeling’ redirecting to emocja ‘emotion’. One page may contain many links to other pages. These links allow the encyclopedia’s user to navigate to a page describing an entity, which was found relevant to the currently browsed page. Clearly, such a link indicates some semantic relatedness between the two entities whose pages are linked. If such links exist in both directions, this may indicate an even stronger connection. The most crucial part of using Wikipedia for finding such relations between mentions is finding a page describing a given mention. Pages are named after the base form of an entity, yet our mention may be present in the text in an inflected form. The lemmatization performed by the tagger is a token-by-token procedure, which does not necessarily produce a correct base form of a multi-word-expression (MWE). One option would be to use a dictionary of MWE inflection forms, but we chose another one: we simply lemmatized token-by-token titles of Wikipedia pages, trying to match (without case sensitivity) such forms with a token-by-token lemmatization provided for our mention occurrence in the text. Since many requests to Wikipedia are required to extract presented Wikipediabased features, it is not feasible to do it online. A much faster way is to download a snapshot of Wikipedia² and process it in a local database. However, such database

2 The latest Polish Wikipedia snapshot can be obtained at http://dumps.wikimedia.org/plwiki/ latest/.

12.3 Third adaptation of BART for Polish

|

203

version would be very difficult to distribute with the tool, as Wikipedia contains millions of pages and querying such resource is time and resource demanding. Therefore, we decided to bundle only information about the pages which described mentions from the Polish Coreference Corpus, recognised via procedure described earlier in this section. We know that such an approach limits the recall of our solution, yet it was necessary for performance reasons. At the same time, it does not invalidate the extensibility of our solution since it may be extended with the contents of the whole Wikipedia at any time.

12.3.1.3 The final configuration The Bart-pl-3 configuration implemented in Bartek worked exactly as Bart-pl-2 setting, with extension of the new knowledge-based features. The initial experiments have shown that new semantic features improve results in all metrics (for detailed comparison of three Polish BART configurations see Chapter 15). The newly introduced features improve verification of semantic compatibility for common nouns (thanks to WordNet) and named entities (thanks to Wikipedia). However, there are several challenges regarding these type of knowledge-based resources, mainly concerning the memory and time complexity of the task, as well as problems with distribution of a tool in a standalone version. Bartek, with its clean design and public availability, leaves much room for further improvement with implementation of the new features. What is also important for the Polish computational linguistics community, it is easily integrated with the TEI annotation format, the de facto standard for Polish.

| Part IV: Evaluation

Mateusz Kopeć, Maciej Ogrodniczuk

13 Manual annotation evaluation¹

Manual annotation in our project concerns mentions with their heads, identity clusters with dominant expressions and quasi-identity links. Each of these subtasks can be evaluated in terms of annotators’ agreement. In order to study this agreement, 210 texts (60,674 segments) were selected, which represented all categories (14 text categories in sum) in the same measure (15 texts each). Each text was annotated independently on all layers of annotation, by 2 annotators. In sum, there were more than 2 annotators working on the texts, but each text was always assigned to exactly two persons. The “first” annotator of a given text should always be understood as the annotator who first started the annotation of the text in question (we will denote him/her as the annotator A, and his/her annotation as annotation A), and the “second” annotator should be seen analogically (we will denote him/her as the annotator B, and his/her annotation as annotation B). For each text, they can be totally different persons. Interestingly, the number of texts visibly surpasses the hitherto prevailing agreement standards, while the number of annotators independently annotating one text is minimal. For example, the agreement study for AnCora-CO-Es corpus (Recasens, 2010) was performed by 8 annotators for 2 texts from the corpus (1100 segments in total). In 420 annotations of 210 texts, the annotators marked 41,006 mentions, 4410 clusters and 1009 quasi-identical links in total.

13.1 Annotation agreement of mentions According to Artstein and Poesio (2008), studying annotators’ agreement that would include agreement resulting from the task of annotating mentions (which can be nested, discontinuous, or inter-crossing) has not been measured in any way yet. It is difficult to say how to estimate the probability of randomly marking a given mention. For this reason, the data given below describe only the observed agreement. In 210 texts, the annotation of the “A” annotators covered exactly 20,420 mentions (after removing false mentions, that is, mention duplicates with exactly the same boundaries). Analogically, the annotation of the “B” annotators comprised 20,560 mentions. Out of all mentions, 17,530 appeared in both annotations. The equality of

1 This chapter is based on (Kopeć and Ogrodniczuk, 2014).

208 | 13 Manual annotation evaluation two mentions is to be understood as boundary identity, including internal boundaries in the case of discontinuous mentions. Having assumed that the annotation A is “gold”, and the annotation B is systemic, we obtain 85.26% precision, 85.85% recall and F1 85.55% score. In 210 texts in which mentions were compared with lesser exactness, and only their heads were considered, the annotation of the annotators A included exactly 19,394 mentions (after removing mentions with precisely the same heads). Analogically, the cleared annotation of the annotators B consisted of 19,522 mentions. Out of them all, 18,317 mentions appeared in both annotations. Having assumed that the annotation A is “gold”, and the annotation B is systemic, we obtain 93.83% precision, 94.47% recall and F1 94.14% score.

13.2 Annotation agreement of heads Obviously, head annotation agreement can be studied only for mentions simultaneously appearing in both annotations of a given text, the number of which was 17,530. Out of them all, in 17,363 mentions the same heads were annotated, which allows us to calculate an observed agreement of (pAO ): pAO =

17,363 ≈ 0.9905 17,530

Agreement of the case (pAE ) was calculated in the following way: for each mention, the choice of head was the choice of one of the segments belonging to that mention. In the case of a given mention, the probability of choosing the same head by two annotators equalled the reversed number of segments in a mention – e.g. if a mention consisted of three segments, the annotators randomly chose the same head with 13 probability. It is difficult to create a better estimation of the agreement resulting from accidental choice for each annotator, because each mention has its own pool of possible heads; it is not a constant set of categories for which one can calculate the probability distribution on the basis of observation. The total agreement resulting from accidental choice was calculated as mean probability of this agreement for individual mentions: pAE ≈ 0.6832 It is relatively high because of numerous singletons content, whose agreement probability from accidental choice was 1. Having computed these two values, one can calculate the S agreement score from (Bennet et al., 1954), which predicted a constant probability distribution for each category chosen by annotators. S=

pAO − pAE 0.9905 − 0.6832 0.3073 = ≈ 0.9700 ≈ 1 − pAE 1 − 0.6832 0.3168

It is a highly satisfying probability score.

13.3 Annotation agreement of quasi-identity links

|

209

13.3 Annotation agreement of quasi-identity links Similarly as in the case of heads, agreement of link annotation was only studied for mentions which simultaneously appeared in both annotations of the same text. For each pair of mentions in a text, annotators could decide whether to match them with a link or not. The total agreement score for each of these pairs can be found in Table 13.1. For this table, one could calculate κ (Cohen, 1960); however, it is noteworthy that this approach is inappropriate. It would not consider the fact that annotators cannot add a quasi-identical link between mentions in two different texts. In consequence, κ should be counted for each text separately, and then averaged. We will now present a method to calculate kappa for single texts. In Table 13.2 there are statistics of link agreement for an exemplary text. The agreement from observation (pAO ) will be then: pAO =

1 + 2775 2776 = ≈ 0.9968 1 + 2775 + 8 + 1 2785

The expected agreement (pAE ) is, on the other hand: pAE =

9 2 2776 2783 18 + 7,725,608 7,725,626 = ∗ + ∗ = ≈ 0.9961 2785 2785 2785 2785 7,756,225 27852

On this basis, we can calculate κ: κ=

pAO − pAE = 1 − pAE

2776 2785

1−

7,725,626 7,756,225 7,725,626 7,756,225

−

=

7,731,160−7,725,626 7,756,225 7,756,225−7,725,626 7,756,225

=

5534 ≈ 0.1809 30,599

Whenever a text does not include any assigned links by any annotator, an agreement of 1 has been adopted. In the case when one annotator did not mark any link in the text, and the other did, the discussed procedure was applied, which resulted in the agreement of 0 (because the expected agreement mirrored the observed one). Table 13.1. Quasi-identical link agreement in total Annotator B Quasi-identical Not quasi-identical Annotator A

Quasi-identical Not quasi-identical

67 367

306 741,584

Table 13.2. Quasi-identical link agreement in single texts Annotator B Quasi-identical Not quasi-identical Annotator A

Quasi-identical Not quasi-identical

1 1

8 2775

210 | 13 Manual annotation evaluation By applying the described procedure to all texts, and then averaging the result, we obtain κ which equals around 0.2220. Undoubtedly, this is not a good score, which poses a difficulty of qualifying the mentions as quasi-identical. Some pairs of mentions were often marked as quasiidentical by one annotator, and as identical by the other (that is, in a coreferential cluster). This happened 128 times.

13.4 Annotation agreement of dominant expressions This agreement is only applied to mentions appearing in the annotations of both annotators simultaneously. Dominant expressions were exclusively annotated for nonsingleton clusters. The total of common mentions comprising an annotated dominant expression was 6162. Out of them, 4115 mentions had the same dominant expression in both annotations, which gives a proportion of around 66.78%. If we include in the calculations only one representative of each cluster (which is sensible, as each cluster element has the same dominant expression), then 1146 out of 1818 cluster representatives have the same expressions in both annotations, which gives us the score of around 63.04%. An analysis corrected for random agreement has not been performed, because the annotators could choose cluster elements as the dominant expression, but they could also insert any text, which does not allow for a valid estimation of the random agreement probability.

13.5 Annotation agreement of coreference A study on annotators’ agreement in computer linguistics can be found in a comprehensive paper by Artstein and Poesio (2008). Annotating coreference is singled out as a specific task due to the fact that it relies on clustering, not classification (which is typical for this field).

13.5.1 Existing agreement scores 13.5.1.1 First Passonneau’s scores Passonneau (1997) describes two annotators’ agreement scores: κ by Cohen (1960) and (not weighted) α by Krippendorff (2003). pAO − pAE 1 − pAE pD α =1− O p DE κ=

13.5 Annotation agreement of coreference |

211

Table 13.3. Exemplary coincidence matrix Annotator B Label X Label Y Annotator A

Label X Label Y

47 10 57

14 29 43

61 39 100

κ is described as a difference between the degree of annotators’ agreement (pAO – observed agreement) and the accidental agreement (pAE – expected agreement), while α is determined by the difference of the annotators’ disagreement (p D O – observed disagreement) and the accidental disagreement (p D E – expected disagreement). By means of simple substitution, under the assumption that: 1 = pAO + p D O 1 = pAE + p D E Passonneau shows, that α and κ scores are equal. In the following part of this section, the procedure of calculating κ will be described. Supposing we have a coincidence matrix for the decisions made by multiple annotators, α and κ can always be calculated. For example, if we have a matrix such as in Table 13.3, which contains information on how often annotators A and B chose X and Y labels, we can calculate the probability: 47 + 29 76 19 = = = 0.76 100 100 25 57 61 43 39 3477 + 1677 5154 2577 = ∗ + ∗ = = = = 0.5154 100 100 100 100 10,000 10,000 5000

pAO = pAE

The observed agreement can be seen on the diagonal of the coincidence matrix, so pAO equals 47 + 29 divided by the number of all decisions, that is 100. The expected agreement can be calculated on the basis of the probability distribution for choosing a given label by a given annotator. For example, annotator A chose label X in 57 cases out of 100, and annotator B did that in 61 cases of 100. Thus, the chance of both of them randomly choosing label X equals (57/100) ∗ (61/100). After having summed those values for all labels, we obtain the expected agreement. So, in the end κ=

19 − 2577 pAO − pAE 5000 = 25 2577 = 1 − pAE 1 − 5000

3800−2577 5000 5000−2577 5000

=

1223 = 0.5 2423

212 | 13 Manual annotation evaluation Table 13.4. Coincidence matrix for MUC score Annotator B Link+ Link− Annotator A

Link+ Link−

47 10 57

14 29 43

61 39 100

13.5.1.2 Method à la MUC There is still an unresolved, but vital issue of how to create a coincidence matrix for the task of clustering, not classification. In (Passonneau, 1997), agreement is computed on the basis of MUC score, which gives us a coincidence matrix similar to the one from Table 13.4, where Link+ means an annotation of a link between mentions (minimal), and Link– means no link. In order to learn details on MUC score calculations, please see Section 14.3.1. Unfortunately, the MUC score cannot be seen as the standard one because of its shortcomings (i.e. dismissing singletons), and probably that is the reason why this calculation method for annotators’ agreement has not been adopted.

13.5.1.3 Weighted Krippendorff’s α Passonneau (2004) presented another approach, in which she used a weighted version of Krippendorff α. In order to calculate it, one should see the annotator’s decision concerning a given mention as annotating a set of mentions with which this mention is in cluster (without counting the mention itself). So, if for five mentions with 1, 2, 3, 4, 5 identifiers, annotator A annotated the following clusters: {1, 3}, {2, 4, 5} , and annotator B annotated these clusters: {1}, {2, 4}, {3, 5} , their annotation performed according to the scheme above will look the same as in Table 13.5. One can see there that e.g. according to annotator A, mention 1 is the only one in the cluster apart from mention 3, whereas according to annotator B, it is a singleton. Table 13.5. Cluster annotation depiction Mention No.

1

2

3

4

5

Annotator A Annotator B

{3} {}

{4, 5} {4}

{1} {5}

{2, 5} {2}

{2, 4} {3}

13.5 Annotation agreement of coreference |

213

Passonneau’s idea is to penalize differences between annotators in different ways depending on the size of discrepancies between clusters assigned to the same mentions. Therefore, she used α by Krippendorff in a version in which a certain weight is assigned to each error, according to a chosen distance score. Let us use it to mark the distances between clusters e1 and e2 as δ(e1 , e2 ). As a result, in case of the first mention, the weight of the error will be δ({3}, {}); it can be performed analogically for the remaining mentions. The set of all clusters used by annotators will be marked as E (in the example, it has 9 elements, mirroring the number of rows and columns in the coincidence matrix). The equation for weighted α in case of two annotators is as follows: p DO =

1 ∑ ∑ o e e δ(e1 , e2 ) n e ∈E e ∈E 1 2 1

p DE

2

1 = ∑ ∑ n e n e δ(e1 , e2 ) n(n − 1) e ∈E e ∈E 1 2 1

2

∑e1 ,e2 ∈E o e1 e2 δ(e1 , e2 ) pD α = 1 − O = 1 − (n − 1) p DE ∑e1 ,e2 ∈E n e1 n e2 δ(e1 , e2 ) where o e1 e2 is the number of mentions annotated by one annotator with a cluster e1 , and by the other with a cluster e2 , while n e1 is the number of all annotations of the label e1 ; analogically n e2 is the number of all annotation of the label e2 . In the case of our example, e.g. o{3}{2,4} = 1, o{3}{2,5} = 0, a n{3} = 2, n{2,5} = 1. Passonneau (2004) defines function δ(e1 , e2 ) as follows (the first fitting rule from the above gives the score): – δ(e1 , e2 ) = 0, when e1 = e2 – δ(e1 , e2 ) = 0.33, when e1 ⊂ e2 ∨ e2 ⊂ e1 – δ(e1 , e2 ) = 0.67, when e1 ∩ e2 ≠ 0 – δ(e1 , e2 ) = 1, when e1 ∩ e2 = 0. Passonneau et al. (2006) propose another score for the same task, that is, MASI – Measuring Agreement on Set-valued Items. MASI is calculated as the product of the previous δ score and the Jaccard index (Jaccard, 1908), therefore, it equals: MASI(e1 , e2 ) = δ(e1 , e2 ) ∗

e1 ∩ e2 e1 ∪ e2

13.5.1.4 Recasens’ agreement study Recasens (2010) describes an agreement study of 8 annotators for 2 texts from AnCoraCO-Es corpus (around 1100 segments in total). The mentions introduced to the annotators were assumed to be identical, and two aspects were tested: 1. Annotators’ agreement on whether a mention belongs to a cluster, and to what type of cluster. Therefore, the annotator could assign one of 4 categories to each

214 | 13 Manual annotation evaluation

2.

mention: – non-coreference – discourse deixis – predicative – identity. Thus, it was a standard classification task, studied by means of weighted α (0.85 for the first text, 0.89 for the second one). The accuracy of calculations performed on the weights of individual errors is not certain; however, the weights were included in the thesis. It seems that they depended on the distance of the category according to the given order, as well as on their multiplicity – see Ordinal metric differences from (Krippendorff, 2007, p. 6). The annotators’ agreement on assigning each mention from predicative or identity category to clusters. This time, numbers of clusters were used as labels, so, again, it was a classification task studied by the use of κ (0.98 for the first text, 1 for the second text).

13.5.2 Results for PCC The essential question at this stage is what constitutes a single annotator’s decision during the act of coreference annotation, and which study score should be chosen. Unfortunately, it all comes down to determining a given coreference evaluation method, whereas, as discussed in Section 14.3, there is no agreement in the academic milieu on the right score.

13.5.2.1 Agreement à la Passonneau The agreement was calculated for each text separately, and then averaged. According to the method from (Passonneau, 2004), it amounted to 79.08%, while according to the procedure from (Passonneau et al., 2006), the result was 59.54%.

13.5.2.2 Agreement à la Recasens Similarly to Marta Recasens, the agreement was computed on the following level: for each mention annotated by both annotators, it was studied whether either of them believed that a certain cluster contained a given cluster or not. Thus, it was a binary decision, which we will study by use of Cohen’s κ. According to the data from Table 13.6, the observed agreement (pAO ) will amount to: 6238 + 9094 15,332 pAO = = ≈ 0.8746 6238 + 9094 + 975 + 1223 17,530

13.5 Annotation agreement of coreference |

215

Table 13.6. Decision agreement singleton/element of cluster in all texts in total Annotator B In cluster Singleton Annotator A

In cluster Singleton

6238 1223

975 9094

The expected agreement (pAE ) is, on the other hand: 6238 + 1223 6238 + 975 9094 + 1223 9094 + 975 ∗ + ∗ 17,530 17,530 17,530 17,530 7461 ∗ 7213 + 10,317 ∗ 10,069 157,698,066 = ≈ 0.5132 = 17,5302 17,5302

pAE =

Based on this, one can calculate κ: κ=

pAO − pAE = 1 − pAE

15,332 17,530

1−

157,698,066 17,5302 157,698,066 17,5302

−

=

268,769,960−157,698,066 17,5302 307,300,900−157,698,066 17,5302

=

11,107 ≈ 0.7424 14,960

13.5.2.3 Agreement à la BLANC (our contribution) Statistics of coreferential and non-coreferential link agreement (such as in BLANC score) for mentions marked in both A and B annotation can be found in Table 13.7. It was possible to calculate Cohen’s κ for this table; however, it is worth noting that a naïve application of the scheme described above to Table 13.7 is inappropriate, similarly as in the case of calculating the agreement of quasi-identical links. It would not consider the fact that annotators cannot add a coreferential link between two different texts, even by accident. Thus, κ should be calculated for each text separately, and then averaged. We shall now present a method for calculating kappa for a single text. In Table 13.8 we can see statistics of link agreement for this text. Therefore, the observed agreement (pAO ) will amount to: pAO =

37 + 2737 2774 = ≈ 0.9996 37 + 2737 + 1 2775

Table 13.7. BLANC link agreement in all texts in total Annotator B Coreferential Non-coreferential Annotator A

Coreferential Non-coreferential

16,638 3353

3448 718,822

216 | 13 Manual annotation evaluation Table 13.8. BLANC link agreement in a single text Annotator B Coreferential Non-coreferential Annotator A

Coreferential Non-coreferential

37 1

0 2737

Table 13.9. κ index in the text division on types κ

Type Academic writing and textbooks Instructive writing and textbooks Internet non-interactive (static pages, Wikipedia) Dailies Quasi-spoken (parliamentary transcripts) Internet interactive (blogs, forums, usenet) Spoken – conversational Magazines Spoken from the media Non-fiction literature Misc. written (legal, ads, manuals, letters) Journalistic books Unclassified written Fiction literature (prose, poetry, drama)

0.699 0.727 0.730 0.740 0.746 0.764 0.765 0.772 0.785 0.795 0.807 0.817 0.826 0.871

Any

0.775

The expected agreement (pAE ) will be, on the other hand: 37 38 2738 2737 ∗ + ∗ 2775 2775 2775 2775 1406 + 7493,906 7495,312 202,576 = = ≈ 0.9733 = 7700,625 208,125 27752

pAE =

On this basis, we can calculate κ: κ=

pAO − pAE = 1 − pAE

2774 2775

1−

202,576 208,125 202,576 208,125

−

==

208,050−202,576 208,125 208,125−202,576 208,125

==

5474 ≈ 0.9865 5549

After applying this procedure to all texts and grouping the results according to types, the results can be seen in Table 13.9.

Mateusz Kopeć, Agata Savary

14 Evaluation approaches

14.1 Evaluation exercises The first part of this chapter describes evaluation procedures used in the two recent coreference resolution exercises: Anaphora Resolution Exercise (2008) and SemEval (2010). Next, we provide a detailed presentation of existing mention detection and coreference resolution evaluation measures.

14.1.1 Anaphora Resolution Exercise 2007 Anaphora Resolution Exercise (ARE; Orăsan et al. (2008)) was an evaluation campaign meant to establish methods of evaluating coreference and anaphora resolution systems. Its first (and, by now, the only) edition was held during the 6th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC2007)¹ in 2007. The details about the event can be found at http://clg.wlv.ac.uk/events/ARE/. Only three competitors joined the exercise and their results were anonymized.

14.1.1.1 Data ARE focused on the English language only. The evaluation corpus contained newspaper articles extracted from the Reuters corpus (Rose et al., 2002), which were manually annotated for coreference. Mentions in this evaluation effort were understood as noun phrases (NPs). The XML format of the data allowed for marking nested mentions but not for taking discontinuity of mentions into account.

14.1.1.2 Tasks Four tasks were organised: two of them concerned evaluating anaphora and coreference resolution when gold mentions were given on input, two others (in the so called system mention setting) – when the systems were to identify the mentions on their own, as shown in Table 14.1. For tasks with gold mentions, the mention boundaries were given (for the Task 1, pronouns were additionally distinguished from other mentions). For tasks with sys-

1 http://daarc2007.di.fc.ul.pt/

218 | 14 Evaluation approaches Table 14.1. Tasks in ARE Pronouns only

All mentions

Gold mentions

Task 1: All mentions in the texts are preidentified (as spans of text). For each preidentified pronoun, its antecedent (one of the other pre-identified mentions) is to be found.

Task 2: All mentions in the texts are pre-identified (as spans of text). They are to be grouped into coreferential clusters.

System mentions

Task 3: Raw text is given on input, with no pre-identified mentions. Each pronoun and its antecedent are to be found.

Task 4: Raw text is given on input, with no pre-identified mentions. All mentions are to be found and grouped into coreferential clusters.

tem mentions, the so called “nodes” with unique identifiers were marked before each space and punctuation mark to allow evaluation of mention borders found by the competitors. The competitors simply provided mentions as spans from one “node” to another “node” with the same token segmentation provided as the input in terms of these “nodes”.

14.1.1.3 Evaluation metrics Task 1 was evaluated using the success rate defined as the sum of scores for each pronoun divided by the number of pronouns in the test data. Score 1 was given if a pronoun was connected to one of the correct non-pronoun mentions or to another pronoun mention with score 1 (i.e. the whole chain contained at least one non-pronoun mention). Score 0.5 was given if the pronoun was correctly connected to another pronoun mention with a score different than 1. Score 0 was given otherwise. Task 2 was evaluated using the MUC-like precision, recall and F-measure (see Section 14.3.1). Task 3 was evaluated using modified versions of precision, recall and F-measure, based on the following overlap measure between two strings: overlapRate(S1 , S2 ) =

length(overlap(S1 , S2 )) max(length(S1 ), length(S2 ))

where length(overlap(S1 , S2 )) represents the length in words of the overlap between S1 and S2 and max(length(S1 ), length(S2 )) is the number of words in the longest of the two strings. To calculate precision and recall, the following formulae are used: score of the correctly resolved pronouns number of system pronouns resolved as non-singletons score of the correctly resolved pronouns Recall = number of pronouns in the gold standard

Precision =

14.1 Evaluation exercises

| 219

The score of the correctly resolved pronouns is defined similarly as in Task 1, yet modified for possible differences in mention boundaries in system and gold outputs. First of all, a system pronoun receives a score of 0, if it is not present in gold standard. Otherwise, as in Task 1, if a pronoun is resolved to another pronoun, the score is 1 if there is at least one non-pronominal antecedent in the coreference chain, and 0.5 if there are no non-pronominal elements in the chain or one of the pronouns in the chain is not correctly resolved. However, in case of the non-pronominal antecedent in the chain (denoted by SAUTO ), the score is multiplied by the overlapRate(SAUTO , SGOLD ), where SGOLD is a string from the gold standard which maximises the overlap score with SAUTO . Task 4 was evaluated using modified versions of MUC-like precision, recall and F-measure. Instead of counting the number of common pairs, the overlap rate measure proposed for Task 3 was used. This means that when a pair is compared, the overlap between its elements is calculated.

14.1.1.4 Summary Since ARE methodology is based on the MUC metric, it does not take singleton mentions into account. While this fact is not a major issue for the pronoun-related tasks (as it is rare for a pronoun not to have an antecedent), it does represent a serious drawback for coreference clustering (cf. the MUC metric description in Section 14.3.1). The ARE organisers also concluded that the overlap measure led to some unexpected results and needed to be improved. Even if the general idea of dividing the tasks into pronoun-based and general coreference resolution seems reasonable, it has not been taken up by other researchers. The gold and system mention settings, however, were largely adopted in further evaluation efforts.

14.1.2 SemEval 2010 The 2010 SemEval conference featured a task called Coreference Resolution in Multiple Languages (Recasens et al., 2010b). Its goal was to evaluate and compare automatic coreference resolution systems for six different languages: Catalan, Dutch, English, German, Italian and Spanish. Four evaluation settings and four different metrics were used (cf. Section 14.3): – MUC (Vilain et al., 1995) – B3 (Bagga and Baldwin, 1998) – CEAF-M (Luo, 2005) – BLANC (Recasens, 2010). 6 systems entered the competition, whose details may be found at http://stel.ub.edu/ semeval2010-coref. An extended analysis of the competition is presented by Màrquez et al. (2012).

220 | 14 Evaluation approaches Table 14.2. Tasks in SEMEVAL Gold

Regular

Closed

Gold-standard columns in the task data, including the true mention boundaries, are used. No external knowledge resources are allowed.

Only the automatically predicted columns from the task data are used. No external knowledge resources are allowed.

Open

Gold-standard columns from the task data, including the true mention boundaries, are used. External knowledge resources are allowed.

Only the automatically predicted columns from the task data are used. External knowledge resources are allowed.

14.1.2.1 Data Each of the 6 languages in the task had its training, development and test corpora. The data size for a single language varied from about 100,000 to about 500,000 tokens. The corpora were both automatically and manually pre-annotated with tagging, lemmatization, named entity recognition and dependency parsing information. The data format was column-based: each text token was encoded in a separate row, with its various properties given in several columns. Mention and NP boundaries were encoded using a bracketing notation, which, as in ARE, allowed mentions to be nested, but not to be discontinuous. Mentions were understood as NP constituents and possessive determiners only, and limited to referential expressions, thus, nominal predicates, appositives, expletive NPs, attributive NPs, NPs within idioms, etc. were excluded. Singletons were taken into account.

14.1.2.2 Tasks Four tasks were organized, as shown in Table 14.2. Two of them allowed for using information provided in the task datasets only (closed setting), two others admitted external tools and resources, e.g., WordNet, Wikipedia, etc., to predict the preprocessing information (Open setting). In each of these two types, one task involved the manual annotation in the datasets (Gold setting), while the other was based on the automatic annotation only (Regular setting).

14.1.3 Evaluation metrics The coreference resolution task was evaluated with MUC, B3 , mention-based CEAF and BLANC measures. These were applied to the correctly recognized mentions only, i.e., to those recognized by a system and occurring in the gold set. The quality of the mention detection was separately measured with recall, precision, and F1 . A mention was rewarded with 1 point if its boundaries coincided with

14.2 Mention detection evaluation metrics

|

221

those of a gold NP, with 0.5 point if its boundaries were within a gold NP and included the gold NP’s head, and with 0 otherwise. 14.1.3.1 Summary During this SemEval task, mention detection and coreference resolution evaluation were performed separately, the latter taking only the correct mentions into account. Mention detection was rewarded for partial matches, coreference resolution was evaluated with 4 measures. An important conclusion from the task was that different evaluation metrics provide different system rankings. As the participating systems were not run for all languages and evaluation settings, their comparison was even more difficult. Because of that, it has become a standard in the coreference resolution research to report evaluation results for several measures. It is also pointed out that a meaningful comparison can hardly involve systems working on different datasets in different languages.

14.2 Mention detection evaluation metrics A premise of coreference resolution measures presented in the next section is that an automatic system does not recognise mentions but uses the mentions of the gold standard instead. This is an obviously unrealistic premise and the target system will arguably not only have to group mentions into coreference clusters, but also detect them independently beforehand. In general, the systems performing mention resolution were evaluated by removing all undetected or redundant mentions from the evaluation data. In this way, the evaluation results are twofold: first, it is shown how effective the system is in finding mentions; second, how accurately it clusters the (correctly identified) mentions into entities. There were also other proposals for dealing with the so-called ‘twinless’ mentions, that is mentions without a pair (those that occur in the gold data but not in the system output, or conversely). The undetected twinless mentions from the gold data could be added to the system output, the twinless mentions from both the system output and the gold data could be removed under the condition that they are singletons, etc. The mention detection efficiency can be estimated in a standard way by the use of precision and recall scores. It is merely required to choose beforehand whether we consider mentions to be correctly annotated only when their boundaries are identical with those of the gold mentions, or if e.g. it suffices for them to include the appropriate fragment of a gold mention. Splitting the task of detecting mentions and detecting entities seems reasonable, because the first of the tasks is similar to syntactic parsing and can be defined relatively easily.

222 | 14 Evaluation approaches

14.3 Coreference evaluation metrics Before we proceed with the detailed explanations, let us recapitulate on some of the basic notions used throughout the book. In all evaluation scores it is assumed that coreference is represented by clusters. In other words, the relation of being coreferent is an equivalence relation – a binary relation which is reflexive, symmetric and transitive, therefore, each mention belongs to exactly one cluster. A cluster of coreferent mentions can be understood as a representative of a discourse entity (i.e. of the referent which all mentions in the cluster refer to). Therefore, coreference clusters are also called entities, for short. If a relation of coreference occurs between two mentions, it means that they belong to the same entity. If a mention is the only reference to a given discourse entity in the text, it constitutes a one-element cluster called a singleton. While performing certain computations, we will also refer to the concept of links between mentions. Such a link should be understood as a visualization of the fact that two mentions belong to the same entity. Moreover, we assume that a coreference resolution system operates on the gold standard set of mentions. As explained in Section 14.2, this may be achieved in several ways but the choice of the way is not relevant to coreference resolution algorithms. We will annotate mentions with subsequent natural numbers, according to their order of appearance in the text. The gold standard resolution will be referred to as GOLD, automatic resolution as SYS. In the figures, a mention occurring after the first mention in the same entity will always be assigned an arrow pointing to the preceding mention of this entity. The measures discussed here cannot be used to evaluate an annotation that does not assume transitivity of coreference, that is one in which the existence of a link from mention m1 to m2 and from m2 to m3 does not imply the coreference relation between m1 and m3 . Coreference defined in a non-transitive manner can be evaluated with the standard precision and recall measures by counting the number of links resolved correctly and dividing this score by the number of links in SYS and GOLD respectively. Unfortunately, such an approach will not include singletons. This section describes only those scores that assume transitivity of coreference, i.e. that operate on clusters of mentions. In next sections we will denote the set of all entities in GOLD by G, and in SYS by S. Then G = {G1 , G2 , . . . , G|G| } S = {S1 , S2 , . . . , S|S| } where S i and G i are the i-th entity in SYS and in GOLD, respectively. There are |S| entities in SYS and |G| in GOLD.

14.3 Coreference evaluation metrics |

223

14.3.1 MUC MUC score (Vilain et al., 1995), the oldest one proposed, seemingly operates on the links between mentions; however, in reality, it takes clusters rather than binary relations into account. The MUC-like recall measure is defined as follows: |G|

∑ (|G i | − |p(G i )|)

recall = R =

i=1 |G|

∑ (|G i | − 1)

i=1

For each i, we add |G i | − |p(G i )| to the numerator and |G i | − 1 to the denominator of the formula for recall. Let us discuss the denominator first. |G i | is the number of elements (mentions) in the entity G i . Thus, |G i |−1 corresponds to the minimum number of links that should be introduced between mentions in G i in order for G i to be correctly defined. For example, if an entity consists of three mentions denoted by the natural numbers 1, 2, 3, in order to define it, we only need two links: e.g. 1 → 2 and 2 → 3 or 1 → 3 and 3 → 2, etc. Let us now consider the numerator. p(G i ) is a partition of set G i according to SYS. This partition is created by choosing all entities in SYS including any of the elements from G i . In this way |p(G i )| − 1 represents the minimum number of coreference links that should be added to SYS so that all elements of G i are included in the same set. For instance, if GOLD includes an entity {1, 2, 3}, and SYS includes the following entities: {1}, {2, 4}, {3}, then p({1, 2, 3}) = {{1}, {2, 4}, {3}} and |p({1, 2, 3})| = 3 so we need 2 links (e.g. 1 → 2 and 4 → 3) to merge the elements into one entity. In fact, the numerator can be rewritten as (|G i | − 1) − (|p(G i )| − 1) = |G i | − |p(G i )|, which is the minimum number of links forming a given entity diminished by the minimum number of missing links necessary for the creation of this entity in SYS. In other words, it is the maximum number of correctly annotated links. Upon summing these correct links for all GOLD entities, we receive the numerator of the recall formula. The denominator, on the other hand, is the number of minimal links required for defining the GOLD partition. Precision is computed similarly, but the roles of SYS and GOLD resolutions are reversed. This means that the numerator represents the number of correct links in SYS, while the denominator shows the total number of links in SYS: |S|

∑ (|S i | − |p󸀠 (S i )|)

precision = P =

i=1 |S|

∑ (|S i | − 1)

i=1

S is the entity set in SYS, S i is the i-th entity in S, and p󸀠 (S i ) is the partition of the set S i according to GOLD.

224 | 14 Evaluation approaches 14.3.1.1 Sample calculations Let us calculate the outcome of the MUC score calculation for the example in Figure 14.1. Let us start with recall (R): In GOLD there are 8 entity sets, that is |G| = 8. We have, for instance, the following partitions: p({1, 2, 3}) = {{1, 2, 3, 9}} p({11, 12, 13, 14, 15}) = {{11, 12}, {13, 14, 15}} Thus: 8

∑ (|G i | − |p(G i )|) R=

i=1 8

∑ (|G i | − 1) i=1

=

(3 − 1) + (1 − 1) + (1 − 1) + (2 − 1) + (1 − 1) + (1 − 1) + (1 − 1) + (5 − 2) (3 − 1) + (1 − 1) + (1 − 1) + (2 − 1) + (1 − 1) + (1 − 1) + (1 − 1) + (5 − 1)

=

6 = 0.857 7

Precision (P) is calculated upon the reversal of the GOLD and SYS roles. The entity set in SYS denoted by S consists of 7 elements. Subsequent entities in SYS are denoted by S i . The sample divisions are as follows (this time the function p󸀠 divides the set from SYS relating to the sets from GOLD): p󸀠 ({1, 2, 3, 9}) = {{1, 2, 3}, {9}} p󸀠 ({11, 12}) = {{11, 12, 13, 14, 15}}

1

2

3

4

5

1

2

3

4

5

6

7

8

9

10

6

7

8

9

10

11

12

13

14

15

11

12

13

14

15

(a) GOLD Fig. 14.1. A sample resolution

(b) SYS

14.3 Coreference evaluation metrics |

225

7

∑ (|S i | − |p(S i )|) P=

i=1 7

∑ (|S i | − 1) i=1

=

(4 − 2) + (2 − 2) + (2 − 1) + (1 − 1) + (1 − 1) + (2 − 1) + (3 − 1) (4 − 1) + (2 − 1) + (2 − 1) + (1 − 1) + (1 − 1) + (2 − 1) + (3 − 1)

=

6 = 0.75 8

14.3.1.2 Main characteristics The MUC-like measure has the following properties: – It does not take singletons into account. This results from the fact that the score was dedicated to a corpus recording only those mentions which belonged to coreference chains. As can be seen in the example above, each correctly annotated singleton, both in the denominator and in the numerator, gives the value of 0. As a consequence, the score does not depend on the number of the correctly annotated singletons. – It is not always intuitive, mostly because it calculates the minimum number of missing/redundant links. e.g. assigning a mention to the wrong chain results in both a lower precision and a lower recall. Conversely, wrongly conflating two chains (which is a by far more severe error) changes only precision. As a result, MUC gives too good outcomes for systems which merge too many mentions. In particular, should a system conflate all mentions into one entity, it will receive a recall score of 1 at a non-zero precision (P = 7/14 for the example from Figure 14.1). – It does not matter if a spurious split of an entity results in two medium-size entities or in a small and a big one. Both of these errors are assigned identical weights.

14.3.2 B3 The B3 score (Bagga and Baldwin, 1998) was designed to solve the problems of the MUC score. Both precision and recall are defined as a mean of outcomes for single mentions, i.e.: N

1 ∗ Ri N i=1

recall = R = ∑ N

1 ∗ Pi N i=1

precision = P = ∑

where N is the number of mentions, and R i and P i are values assigned to recall and precision for mention i, respectively (see below).

226 | 14 Evaluation approaches The authors speak of another version of these measures with different weights for the outcome of each mention. They propose to first assign identical weights to each entity, and then distribute them within the entity evenly between mentions. Such an algorithm version penalizes precision errors to a lesser degree; nevertheless, the most commonly used version is the one with weights N1 . Let ESYS (i) denote the entity (i.e. the set of mentions) in SYS which includes mention i. Analogically, let EGOLD (i) refer to the entity containing i in GOLD. Recall and precision for a given mention are calculated in the following manner: Ri =

|ESYS (i) ∩ EGOLD (i)| |EGOLD (i)|

Pi =

|ESYS (i) ∩ EGOLD (i)| |ESYS (i)|

Thus, recall represents the number of mentions correctly classified as coreferent with i (including i as coreferent with itself), divided by the number of mentions coreferent with i in GOLD (also including i). Precision is computed by analogy, whereas the denominator represents the number of elements in the entity containing i in SYS rather than in GOLD.

14.3.2.1 Sample calculation Let us calculate the B3 score for the example in Figure 14.2. We have: N

15 1 1 ∗ Ri = ∑ ∗ Ri N 15 i=1 i=1

R=∑ N

15 1 1 ∗ Pi = ∑ ∗ Pi N 15 i=1 i=1

P=∑

1

2

3

4

5

1

2

3

4

5

6

7

8

9

10

6

7

8

9

10

11

12

13

14

15

11

12

13

14

15

(a) GOLD Fig. 14.2. A sample resolution

(b) SYS

14.3 Coreference evaluation metrics |

227

We also have, for instance (it is worth noting that R1 = R2 = R3 ): R1 =

|ESYS (1) ∩ EGOLD (1)| |EGOLD (1)|

=

|{1, 2, 3, 9} ∩ {1, 2, 3}| |{1, 2, 3}|

=

|{1, 2, 3}| 3 = =1 |{1, 2, 3}| 3

R11 =

|ESYS (11) ∩ EGOLD (11)| |EGOLD (11)|

=

|{11, 12} ∩ {11, 12, 13, 14, 15}| |{11, 12, 13, 14, 15}|

=

|{11, 12}| 2 = = 0.4 |{11, 12, 13, 14, 15}| 5

In total we have: R=

3 3 3 1 1 2 2 1 1 1 2 2 3 3 3 1 ∗( + + + + + + + + + + + + + + ) 15 3 3 3 1 1 2 2 1 1 1 5 5 5 5 5

=

13 1 ∗ (10 + ) 15 5

=

63 63 1 ∗ = = 0.84 15 5 75

while P=

3 3 3 1 1 2 2 1 1 1 2 2 3 3 3 1 ∗( + + + + + + + + + + + + + + ) 15 4 4 4 2 2 2 2 1 4 1 2 2 3 3 3

=

9 2 1 1 ∗ (9 + + + ) 15 4 2 4

=

1 50 50 ∗ = = 0.833 15 4 60

14.3.2.2 Main characteristics The B3 measure has the following properties: – It takes singletons into account. – It pays attention to the size of the wrongly conflated entities – the bigger the number of elements in these entities, the heavier the penalty. – It is not always intuitive. For instance, a system conflating all mentions together always reaches a 100% recall, while a system returning only singletons will have a 100% precision. In the first case, one would expect that a 100% recall means

228 | 14 Evaluation approaches

–

that the system returned a superset of entities from GOLD, while a 100% precision should mean that all entities from SYS are correct. Unfortunately, it is not resistant to large amounts of singletons in texts – its outcomes usually come very close to 100%.

14.3.3 ACE-value The ACE-value score (Doddington et al., 2004) was used during the ACE tournament (see introduction to Section 3.1). It is very task-specific and, consequently, unhelpful in general. The outcome is computed by summing up the errors: wrongly detected entities, undetected entities and entities with incorrect elements. Each error is assigned a different weight which depends on the entity type (e.g. LOCATION, PERSON), as well as on the mentions contained in the entity (NAME, NOMINAL, PRONOUN). For normalisation reasons, the achieved value is then divided by the sum of error weights in a hypothetical system which would return singletons only. The final outcome is achieved by subtracting this normalised error rate from one. A poor system can give a negative outcome. Similarly to CEAF (see below), entities in SYS and GOLD are mapped on each other on the one-to-one basis, which enables us to avoid some of the problems relating to the fact that a single entity can be used multiple times for comparisons with other entities. However, the score is difficult to interpret. For instance, an outcome of 80% means solely that the joint weight of errors amounts to 20% of the weight of errors of a system returning only singletons, which does not say much about the overall quality of the system.

14.3.4 CEAF The Constrained Entity-Alignment F-measure (Luo, 2005) was developed in order to overcome the shortcomings of the MUC and B3 scores. The main idea is to perform a one-to-one mapping of entities from SYS to entities in GOLD (up to differences in the number of elements in those entities). The author believes that problems with counterintuitive outcomes of the MUC and B3 scores result from the fact that a single entity can be used multiple times for comparisons with other entities, because entity intersections are considered. Assigning every entity from SYS to at most one entity from GOLD is supposed to solve this problem. As the numbers of entities in GOLD and SYS are not necessarily identical, some of the SYS or GOLD entities have no counterparts. In consequence, the mapping can be performed only on a limited number of entities. Let us define m in the following way: m = min{|G|, |S|}

14.3 Coreference evaluation metrics

| 229

i A subset of m entities can be chosen from G in (|G| m ) different ways. Let G m ⊆ G denote

the i-th of these subsets (where i = 1, . . . , (|G| m )). By analogy, let S m ⊆ S (where j = |S| 1, . . . , ( m )) be the j-th m-element subset of the S set. j

j

Let H(Gim , Sm ) denote the set of all possible one-to-one mappings of the entities j in Gim onto those in Sm , and let H m denote all such sets of mappings for any i and j. Formally: j

j

H(Gim , Sm ) = {h : Gim → Sm |∀S∈Sj ∃!G∈Gim h(G) = S} m

Hm =

j ∪i,j H(Gim , Sm )

|H m | = (

|S| |G| ) × ( ) × m! m m

Now, let ϕ(G, S) denote the similarity between two entities G and S. For a given similarity measure ϕ and a mapping h, the joint similarity score for h is obtained by aggregating the similarities of entities, one of which is mapped onto the other one by h: j

Φ(h ∈ H(Gim , Sm )) = ∑ ϕ(G, h(G)) G∈Gim

The mapping considered the best is the one maximising the joint similarity of all pairs in this mapping: h∗ = arg max Φ(h) h∈H m

Finally, the CEAF score for the similarity function ϕ is defined as follows: precision = p = recall = r = F1 =

Φ(h∗ ) |S| ∑i=1

ϕ(S i , S i )

Φ(h∗ ) |G|

∑i=1 ϕ(G i , G i ) 2pr p+r

The author proposed two similarity functions that can be used with the CEAF measure: ϕ3 (G, S) = |G ∩ S| ϕ4 (G, S) =

2 ∗ |G ∩ S| |G| + |S|

ϕ3 is a mention-based similarity, while ϕ4 is entity-based. Each of them yields a particular instantiation of the CEAF score: the mention-based CEAF (called CEAF-M), and the entity-based CEAF (CEAF-E), respectively. Note that, in the case of CEAF-M, the denominator of the precision formula corresponds to the number of mentions in SYS, while in recall the denominator represents

230 | 14 Evaluation approaches the number of mentions in GOLD. F1 can be understood as the percentage of mentions correctly assigned to entities. In CEAF-E, in turn, the denominator in the precision formula is equal to the number of entities in SYS, while in recall it is equal to the number of entities in GOLD. The intuition behind F1 is to represent the rate of the correct entities within all entities.

14.3.4.1 Example of calculations Let us calculate the outcome of the CEAF-M score for the example from Figure 14.3. The first step is to find such a mapping of entities from GOLD onto those from SYS which maximises the joint similarity of the entities linked by the mapping. One of such mappings is the following: G1 = {1, 2, 3}GOLD ↔ {1, 2, 3, 9}SYS = S1 G2 = {4}GOLD ↔ {4, 5}SYS = S2 G3 = {5}GOLD G4 = {6, 7}GOLD ↔ {6, 7}SYS = S3 G5 = {8}GOLD ↔ {8}SYS = S4 G6 = {9}GOLD ↔ {11, 12}SYS = S6 G7 = {10}GOLD ↔ {10}SYS = S5 G8 = {11, 12, 13, 14, 15}GOLD ↔ {13, 14, 15}SYS = S7 As can be seen, one entity in GOLD (G3 ) is assigned to no entity in SYS, because the number of elements in S is lesser by 1 than the number of elements in G. If we call the above mapping h∗ , upon aggregating the similarities of the mapped entities, we

1

2

3

4

5

1

2

3

4

5

6

7

8

9

10

6

7

8

9

10

11

12

13

14

15

11

12

13

14

15

(a) GOLD Fig. 14.3. A sample resolution

(b) SYS

14.3 Coreference evaluation metrics

| 231

obtain: Φ(h∗ ) = ϕ3 (G1 , S1 ) + ϕ3 (G2 , S2 ) + ϕ3 (G4 , S3 ) + ϕ3 (G5 , S4 ) + ϕ3 (G6 , S6 ) + ϕ3 (G7 , S5 ) + ϕ3 (G8 , S7 ) = 3 + 1 + 2 + 1 + 0 + 1 + 3 = 11 Since ∑ ϕ(G i , G i ) = 3 + 1 + 1 + 2 + 1 + 1 + 1 + 5 = 15 i

∑ ϕ(S i , S i ) = 4 + 2 + 2 + 1 + 1 + 2 + 3 = 15 i

the precision, recall and F1 -measure are equal to: precision = p =

11 Φ(h∗ ) = ∑i ϕ(S i , S i ) 15

recall = r =

11 Φ(h∗ ) = ∑i ϕ(G i , G i ) 15

F1 =

2 2 × ( 11 2pr 11 15 ) = = 11 p+r 15 2 × 15

14.3.4.2 Main characteristics The CEAF measure has the following properties: – It is interpretable – e.g. with the similarity measure ϕ3 the outcome 70% indicates that approximately 70% of mentions are in the right entities. – It solves some of the problems of the MUC and B3 measures. – It is computationally demanding – its complexity is of O(m3 log m), where m is the number of mentions. – It is not resistant to a large amount of singletons – the scores come close to 100%, leaving little space for comparisons. – A correct merge of two mentions in SYS can be ignored if it is impossible to assign their entity to an entity in GOLD. This applies e.g. to entity {11, 12} in the example above. – Entities have the same weight regardless of their size (Stoyanov et al., 2009), therefore, a spurious conflation of two small entities and of a big and a small one are penalized equally.

14.3.5 BLANC BLANC is the newest metrics, described by Recasens (2010). It originates from the Rand measure meant for ranking clustering methods, modified so as to better resist

232 | 14 Evaluation approaches Table 14.3. BLANC coincidence table

Coreferent GOLD

Coreferent Non-coreferent

SYS Non-coreferent

rc wc

wn rn

to large amounts of singletons. It takes all pairs of mentions into account. Namely, for each pair of mentions, it is checked if the system considers them coreferent or not. In order to calculate the BLANC outcome for a given resolution, the following values from Table 14.3 must first be determined: – rc (rightly coreferent) – the number of pairs of mentions marked as coreferent both in SYS and GOLD – wc (wrongly coreferent) – the number of pairs of mentions marked as coreferent in SYS and as non-coreferent in GOLD – wn (wrongly non-coreferent) – the number of pairs of mentions marked as noncoreferent in SYS and as coreferent in GOLD – rn (rightly non-coreferent) – the number of pairs of mentions marked as noncoreferent both in SYS and in GOLD. The number of pairs correctly defined by the system amounts to rc + rn. The sum of rc + rn + wn + wc is always equal to the number of all pairs of mentions, which means that for n mentions in the text it amounts to n(n−1) 2 . Having completed the table, we may continue to the calculation of the final scores of the BLANC measure. The relevant equations are given in Table 14.4. As the table shows, the values of recall, precision and F-measure are calculated separately for the coreferent and for the non-coreferent pairs. The final results of the BLANC measure are the arithmetic means: – BLANC-P – precision is the mean of Pc (precision of coreferent pairs) and Pn (precision of non-coreferent pairs) – BLANC-R – recall is the mean of Rc (recall of coreferent pairs) and Rn (recall of non-coreferent pairs) – BLANC-F – F-measure is the mean of Fc (F-measure of coreferent pairs) and Fn (F-measure of non-coreferent pairs). Table 14.4. Formulas of the BLANC measure components Measure

Coreference

Precision-P

Pc =

Recall-R

Rc =

F-measure-F

Fc =

rc rc+wc rc rc+wn 2Pc Rc Pc +Rc

Non-coreference Pn = Rn = Fn =

rn rn+wn rn rn+wc 2Pn Rn Pn +Rn

BLANC BLANC-P = BLANC-R = BLANC-F =

Pc +Pn 2 Rc +Rn 2 Fc +Fn 2

14.3 Coreference evaluation metrics

| 233

Table 14.5. BLANC boundary cases

GOLD

only singletons

SYS one entity

otherwise

only singletons

BLANC-P = 1 BLANC-R = 1 BLANC-F = 1

BLANC-P = 0 BLANC-R = 0 BLANC-F = 0

BLANC-P = Pn BLANC-R = R n BLANC-F = Fn

one entity

BLANC-P = 0 BLANC-R = 0 BLANC-F = 0

BLANC-P = 1 BLANC-R = 1 BLANC-F = 1

BLANC-P = Pc BLANC-R = R c BLANC-F = Fc

otherwise

BLANC-P = Pn /2 BLANC-R = R n /2 BLANC-F = Fn /2

BLANC-P = Pc /2 BLANC-R = R c /2 BLANC-F = Fc /2

according to the preceding table

The authors also propose a generalised version of the measure called BLANCα . It relies on an α parameter which allows us to assign different weights to the two components of the mean. BLANC α − F = αFc + (1 − α)Fn Thus, it is possible to give more importance to coreferent pairs than to the noncoreferent ones, or vice versa. With α equal to 12 , we we get the default BLANC score. In boundary cases, the measure may lead to a division by 0. For example, when the system returns singletons only, rc + wc = 0 and the value of Pc is undefined. In such cases (when SYS or GOLD consist of singletons or of one global entity only), the authors put forward deviations from the general formula, shown in Table 14.5. Table 14.5 can be summarized as follows: – If GOLD is identical with SYS, recall, precision and F-measure are equal to 1. – If GOLD is opposite to SYS, recall, precision and F-measure are equal to 0. – If GOLD contains no coreferent pairs (only singletons are present), the BLANC score is simply the score for non-coreferent pairs. Conversely, if all pairs in GOLD are coreferent (there is only one global entity), the BLANC score mirrors the score of coreferent pairs. – If the general formula leads to a division by 0, the deviated score is 0. The boundary cases are obviously problematic, since huge differences in evaluation scores may occur between systems whose results are close. For instance, whenever GOLD consists of a single two-element entity, and SYS includes singletons only, the system’s score will not exceed 50%. If, however, GOLD contains no entity, the score of the same system is 100%.

234 | 14 Evaluation approaches 14.3.5.1 Example of calculations Let us calculate the score of the BLANC measure for the example from Figure 14.4.

1

2

3

4

5

1

2

3

4

5

6

7

8

9

10

6

7

8

9

10

11

12

13

14

15

11

12

13

14

15

(a) GOLD

(b) SYS

Fig. 14.4. A sample resolution

Because there are 15 mentions, there are 15 ∗ 14/2 = 105 pairs. How many coreferent pairs have been marked correctly? Note that the figure does not represent all coreference links. For example, GOLD includes the entity {1, 2, 3}. Thus, three pairs are in coreference relation ({1, 2}, {2, 3} and {1, 3}), although only two coreference links appear in the figure. SYS contains the following 8 pairs rightly marked as coreferent (rc = 8): {1, 2}, {1, 3}, {2, 3} {6, 7} {11, 12} {13, 14}, {13, 15}, {14, 15} SYS also contains 4 pairs wrongly marked as coreferent (wc = 4): {1, 9}, {2, 9}, {3, 9} {4, 5} Furthermore, 6 pairs are wrongly non-coreferent in SYS (wn = 6): {11, 13}, {11, 14}, {11, 15}, {12, 13}, {12, 14}, {12, 15} The value that remains to be calculated is rn. As rc + wc + wn + rn = 105, it is easy to compute that rn = 87. All results are presented in Table 14.6. These values allow us to obtain the scores for the BLANC measure, as shown in Table 14.7.

14.3 Coreference evaluation metrics

| 235

Table 14.6. Sample BLANC coincidence table

Coreferent GOLD

Coreferent Non-coreferent

SYS Non-coreferent

8 4

6 87

Table 14.7. Sample BLANC scores Coreference

Non-coreference

BLANC

Pc =

rc 8 = rc + wc 12

Pn =

rn 87 = rn + wn 93

BLANC-P =

Pc + Pn 149 = = 0.801 2 186

Rc =

rc 8 = rc + wn 14

Rn =

rn 87 = rn + wc 91

BLANC-R =

Rc + Rn 139 = = 0.764 2 182

Fc =

2Pc R c 16 = Pc + R c 26

Fn =

2Pn R n 87 = Pn + R n 92

BLANC-F =

Fc + Fn 1867 = = 0.781 2 2392

14.3.5.2 Main characteristics The BLANC measure has the following properties: – The version with α parameter enables us to give different importance to coreferent and non-coreferent pairs. Thus, the final score becomes a weighted mean instead of a simple arithmetic mean. – The behaviour of the measure is problematic for boundary cases (when SYS or GOLD consist either of singletons only or of a single global entity). In this case, the system cannot exceed 50% if, for example, there is only one coreference pair in GOLD and only singletons in SYS. In general, if the system cannot find any correct coreferent link, its score will not exceed 50%. – The measure is resistant to large amounts of singletons and can further be regulated with the α parameter. – It accounts for the sizes of different entities. This is because it relies on pairs of coreferent mentions, whose quantity is quadratic in function of the entity size.

Mateusz Kopeć

15 Evaluation results

15.1 PCC evaluation methodology The state-of-the art survey of coreference evaluation, described in Chapter 14, led us to put forward an evaluation methodology within the CORE project. It resulted in Scoreference – a newly-developed coreference resolution evaluation tool¹. The tool takes as input two versions (manual and automatic) of the same set of texts in MMAX or TEI format and compares them. Henceforth, the mentions contained in the manual version are called manual or gold mentions, while those in the automatic version are referred to as automatic or system mentions. Depending on the experiment, coreference resolution systems may be evaluated as either having gold or system mentions as input. When system mentions are input to coreference system, there are several ways of mapping gold to system mentions to calculate final scores of coreference resolution evaluation measures. Our system incorporates 4 different strategies for this task, all being possible combinations of two binary decisions. The first decision is whether full extent of mentions or their heads only should be taken into account when comparing gold and system mentions. The second decision is about choosing one of two strategies of dealing with twinless mentions (present only in one of the text versions). A combination of these two choices gives one of the four available evaluation scores. These 4 results are only useful when a coreference resolution system is tested on system mentions. When gold mentions are used as input, there are no twinless mentions and there exists a one-to-one identity mapping between mention output performed by the system and the gold mentions. Our evaluation scheme is mostly in line with the SemEval approach: mention detection and coreference resolution can be evaluated separately but an end-to-end system evaluation procedure is also available. The differences between our evaluation procedure and the one from SemEval 2010 are twofold: (i) for system mention setting, we apply the exact mention matching and the head mention matching evaluation of mention detection (cf. Section 15.1.1) without combining them into one measure, (ii) we allow for an evaluation variant which takes “twinless” mentions into account (cf. Section 15.1.2). A separate evaluation of mention detection and coreference resolution for mentions of specific classes (such as zero subjects or pronouns) is not implemented, however, this may be easily achieved by filtering the gold and system mention sets. 1 The tool is downloadable at: http://zil.ipipan.waw.pl/PolishCoreferenceTools.

238 | 15 Evaluation results 15.1.1 Mention detection measures We evaluate mention detection using precision, recall and F-measure. Contrary to SemEval, we have decided not to reward partial matches but to provide two alternative mention detection scores instead: – score of exact boundary matches (an automatic and a manual mention match if they have exactly the same boundaries; in other words, they consist of the same tokens) (EXACT) – score of head matches (we reduce the automatic and the manual mentions to their single head tokens and compare them) (HEAD).

15.1.2 Coreference resolution measures There is still no consensus about the single best coreference resolution metrics (cf. Chapter 14), therefore, our evaluation tool provides results for 5 widely known measures: MUC, B3 , mention- and entity-based CEAF and BLANC. As these measures assume that system and gold mentions are the same, we implemented two options for evaluation settings in which this condition does not hold (i.e. in system mentions setting): – we consider only the correct system mentions (i.e. the intersection between gold and system mentions) (INTERSECT) – we transform the system and gold mentions, according to the procedure described below (TRANSFORM). The TRANSFORM procedure, meant to deal with the so-called “twinless” mentions (not belonging to the intersection of the system and gold mention sets), was presented by Màrquez et al. (2012) and consists of the following steps: 1. inserting twinless true mentions into the system output as singletons 2. removing twinless system mentions that are resolved as singletons 3. inserting twinless system mentions that are resolved as coreferent into the gold mention set (as singletons). This approach was also used in the CoNLL-2011 shared task (Pradhan et al., 2011).

15.1.3 Evaluation data As some of the evaluated coreference resolution and mention detection tools used machine learning techniques, the full Polish Coreference Corpus (TEI version 0.92) was randomly split into two parts: the training and the testing part. All results presented in this chapter are based on the testing part.

15.2 Mention detection evaluation results |

239

The splitting procedure randomly selected 30% of texts to be in the testing part, while the other texts formed the training part. Text type balance was maintained in this division, with an additional constraint that at least one text of each type should be in the testing part. To replicate this text division, one may use the utils.Splitter class of Scoreference tool. The test data contained 530 texts, which were automatically enriched with tagging, named entity and shallow parsing data (for details see Chapter 10). This set was then used to evaluate mention detection and coreference resolution.

15.1.4 Evaluated tools In this chapter, we evaluate a single mention detection system, MentionDetector (cf. Section 10.6), and 4 coreference resolution systems: – Ruler – described in Chapter 11 – Bartek with Bart-pl-1 feature set (Bartek-1) – described in Section 12.1 – Bartek with Bart-pl-2 feature set (Bartek-2) – described in Section 12.2 – Bartek with Bart-pl-3 feature set (Bartek-3) – described in Section 12.3. MentionDetector and Bartek were trained on the 70% of the corpus not used for evaluation. MentionDetector trained its zero subject detection model², Bartek (all 3 feature sets) used gold mentions for training its coreference resolution model.

15.2 Mention detection evaluation results As stated in the previous section, we evaluate mention detection using two procedures: EXACT and HEAD. The mention detection system is MentionDetector³ in version 1.2 (with a Spejd grammar for nested mentions, presented in Section 10.3). Precision, recall and F1 scores for both settings are presented in Table 15.1. Table 15.1. Mention detection evaluation results Setting

Precision

Recall

F1

EXACT HEAD

66.79% 88.29%

67.21% 89.41%

67.00% 88.85%

2 MentionDetector is a deterministic rule-based tool, yet its zero subject detection model uses machine learning method, for which training data is required. 3 See Section 10.6.

240 | 15 Evaluation results As important aspect of MentionDetector is zero subject detection, we tested its performance for that task alone: it scored 77.15% precision, 72.47% recall and 74.73% F1 , which is very similar to the results presented in Section 10.4.4.1.

15.3 Coreference resolution evaluation results Coreference resolution tools were evaluated in two settings: using gold mentions and using system mentions, provided by MentionDetector.

15.3.1 Gold mentions As gold mentions were given as input to coreference resolution tools, the problem of twinless mentions was not present in this experiment. Also, problem of mapping system mentions to gold mentions was not present, as these two mention sets were identical; there was no need to check two variants (exact boundary or head only) of mention matching, as they would give the same results. Therefore, we have only one result for each competing system for each score, presented in Table 15.2. Each row has the best result highlighted in bold font. Table 15.2. Coreference resolution evaluation results – gold mentions Score

Ruler

Bartek-1

Bartek-2

Bartek-3

MUC

Precision Recall F1

51.83% 66.39% 58.21%

57.85% 62.62% 60.14%

61.79% 66.94% 64.26%

61.82% 67.82% 64.68%

B3

Precision Recall F1

78.70% 85.44% 81.94%

83.48% 84.49% 83.98%

84.67% 85.89% 85.27%

84.47% 86.17% 85.31%

CEAF-M

Precision Recall F1

74.88% 74.88% 74.88%

77.88% 77.88% 77.88%

80.16% 80.16% 80.16%

80.22% 80.22% 80.22%

CEAF-E

Precision Recall F1

85.05% 75.59% 80.04%

84.55% 81.79% 83.15%

86.66% 83.80% 85.20%

86.95% 83.60% 85.24%

BLANC

Precision Recall F1

71.36% 70.15% 70.74%

74.40% 72.09% 73.18%

76.51% 73.88% 75.12%

76.39% 74.18% 75.23%

15.3 Coreference resolution evaluation results | 241

Table 15.3. Coreference resolution evaluation results – system mentions, EXACT INTERSECT setting Score

Ruler

Bartek-1

Bartek-2

Bartek-3

MUC

Precision Recall F1

61.25% 72.86% 66.55%

67.58% 67.11% 67.35%

67.02% 71.14% 69.02%

66.80% 72.04% 69.32%

B3

Precision Recall F1

83.25% 87.49% 85.32%

87.49% 85.97% 86.72%

86.51% 87.41% 86.96%

86.22% 87.69% 86.95%

CEAF-M

Precision Recall F1

79.42% 79.42% 79.42%

81.64% 81.64% 81.64%

82.09% 82.09% 82.09%

82.03% 82.03% 82.03%

CEAF-E

Precision Recall F1

87.24% 80.55% 83.76%

86.07% 86.31% 86.19%

87.49% 85.32% 86.39%

87.72% 84.93% 86.30%

BLANC

Precision Recall F1

76.36% 74.50% 75.40%

78.45% 75.23% 76.74%

78.76% 77.31% 78.01%

78.81% 77.58% 78.18%

Table 15.4. Coreference resolution evaluation results – system mentions, EXACT TRANSFORM setting Score

Ruler

Bartek-1

Bartek-2

Bartek-3

MUC

Precision Recall F1

38.50% 49.74% 43.40%

43.46% 45.81% 44.61%

43.32% 48.56% 45.79%

43.25% 49.18% 46.02%

B3

Precision Recall F1

77.51% 84.05% 80.65%

82.24% 82.78% 82.51%

81.21% 83.65% 82.41%

80.94% 83.84% 82.36%

CEAF-M

Precision Recall F1

70.71% 70.71% 70.71%

73.64% 73.64% 73.64%

73.73% 73.73% 73.73%

73.67% 73.67% 73.67%

CEAF-E

Precision Recall F1

79.57% 72.00% 75.59%

79.41% 77.96% 78.68%

80.25% 77.00% 78.59%

80.38% 76.70% 78.50%

BLANC

Precision Recall F1

65.75% 65.69% 65.72%

68.81% 66.19% 67.39%

69.25% 67.53% 68.35%

69.35% 67.71% 68.49%

242 | 15 Evaluation results Table 15.5. Coreference resolution evaluation results – system mentions, HEAD INTERSECT setting Score

Ruler

Bartek-1

Bartek-2

Bartek-3

MUC

Precision Recall F1

55.74% 72.10% 62.87%

60.78% 64.32% 62.50%

60.87% 69.10% 64.72%

60.77% 70.09% 65.10%

B3

Precision Recall F1

80.88% 88.10% 84.33%

85.55% 86.01% 85.78%

84.50% 87.54% 85.99%

84.23% 87.83% 86.00%

CEAF-M

Precision Recall F1

77.57% 77.57% 77.57%

80.02% 80.02% 80.02%

80.47% 80.47% 80.47%

80.43% 80.43% 80.43%

CEAF-E

Precision Recall F1

86.70% 77.11% 81.62%

85.45% 83.57% 84.50%

86.98% 82.55% 84.71%

87.22% 82.18% 84.63%

BLANC

Precision Recall F1

73.79% 74.30% 74.04%

76.58% 74.33% 75.40%

76.68% 76.36% 76.52%

76.76% 76.63% 76.70%

Table 15.6. Coreference resolution evaluation results – system mentions, HEAD TRANSFORM setting Score

Ruler

Bartek-1

Bartek-2

Bartek-3

MUC

Precision Recall F1

47.21% 59.83% 52.77%

50.82% 53.37% 52.06%

51.29% 57.34% 54.15%

51.27% 58.16% 54.50%

B3

Precision Recall F1

78.04% 84.65% 81.21%

82.48% 82.81% 82.65%

81.50% 84.06% 82.76%

81.24% 84.31% 82.74%

CEAF-M

Precision Recall F1

72.09% 72.09% 72.09%

74.49% 74.49% 74.49%

74.79% 74.79% 74.79%

74.74% 74.74% 74.74%

CEAF-E

Precision Recall F1

80.70% 72.43% 76.34%

80.10% 78.55% 79.32%

81.23% 77.54% 79.34%

81.39% 77.17% 79.22%

BLANC

Precision Recall F1

68.78% 68.08% 68.43%

71.09% 68.12% 69.49%

71.59% 69.65% 70.58%

71.71% 69.86% 70.74%

15.4 Conclusions and future directions

| 243

15.3.2 System mentions For system mentions, we have four results for each system: – EXACT INTERSECT in Table 15.3 – EXACT TRANSFORM in Table 15.4 – HEAD INTERSECT in Table 15.5 – HEAD TRANSFORM in Table 15.6. Each row has the best result highlighted in bold font.

15.4 Conclusions and future directions Mention detection results are satisfactory, especially if we take into account that mention detection includes zero subject detection. F1 score of 88.85% in the HEAD setting compared to 67.00% in the EXACT setting implies that focus needs to be set on improving exact mention boundaries detection. Next step would be to analyse mention types which are source of errors and improve Spejd grammar or zero subject detection algorithm to help with such difficult cases. An analysis of gold mention setting shows that Bartek with various feature sets is superior to Ruler, as it has higher F1 scores in all metrics. Another interesting finding in gold mention setting lays in the comparison of three Bartek feature settings. As presented in Chapter 12, Bart-pl-3 feature set is a superset of Bart-pl-2, which is a superset of Bart-pl-1. This should imply that, with high probability, models trained by the systems achieve better results when the feature set is extended. This is exactly what we observe in the results: Bart-pl-3 reports the best F1 scores among Bartek system versions for all measures. This shows that SoonEncoder and SoonDecoder combined with a pairwise resolution model are optimised for all the metrics used. System mentions settings show again that Bartek is the best system regarding F1 of all metrics, yet no single configuration outperforms the other ones: only for BLANC and MUC F1 the best scores are consistently achieved by the Bart-pl-3 setting. This situation may be implied by the fact that all Bartek versions were trained using gold mentions. It would be very interesting to retrain them using system mentions and see how it influences the results.

| Part V: Summary

Maciej Ogrodniczuk, Katarzyna Głowińska, Mateusz Kopeć, Agata Savary, Magdalena Zawisławska

16 Conclusions

16.1 Afterthoughts Of course, the instruction for annotators had been prepared before the start of the corpus annotation, and we were forced to adopt in it some basic presumptions on the rules of description of the coreference phenomenon. In the course of work on the corpus and afterwards, there were some doubts and ambiguities connected with some of our presumptions, while others turned out to be very good solutions. It seems, therefore, that our instruction was substantially well-thought in its original shape, yet some of the annotation rules were worth modifying in the later stages of work on coreference.

16.1.1 Non-nominal pronouns Initially, we had assumed that a head of a nominal phrase can only be constituted by a noun or a nominal pronoun, but we removed from clusters phrases whose superordinate elements were reflexive, reciprocal, interrogative, relative, indefinite, generalising, and negative pronouns. In the course of annotation of the corpus, it turned out to be a wrong solution, as all mentioned types of pronouns can form relatively long coreferential chains, cf.: (16.1)

Takie usługi w Polsce może prowadzić każdy. Wystarczy, że Ø zarejestruje działalność gospodarczą i Ø wpisze się do ewidencji. ‘Everyone can conduct such services in Poland. It suffices that they register their business activity and sign into the register.’

In the example above, the generalising pronoun każdy ‘everyone’ forms a cluster with zero-anaphora represented by the verbs zarejestruje ‘register’ and wpisze się ‘sign into’. (16.2)

Rzadko ktoś decyduje się na to, by zrobić sobie wszystkie zęby porcelanowe w odcieniu, jaki sobie zażyczy. ‘It is seldom that somebody decides to make himself porcelain teeth in the colour he requests.’

248 | 16 Conclusions On the other hand, the antecedent of the second example is evidently the indefinite pronoun ktoś ‘somebody’, which forms a coreferential cluster with the reflexive pronoun sobie ‘himself/he’.

16.1.2 Possessive pronouns Our instruction rejected adjective possessive pronouns like mój ‘my’, twój ‘your’, jego ‘his’, jej ‘her’ as heads of nominal phrases. However, annotators very often annotated those types of pronouns as nominal phrases and included them in coreferential clusters. Sometimes, it was the result of homonymy of adjective and personal pronoun forms, cf.: (16.3) (16.4)

Anna zgubiła teczkę i nigdy jej już nie odnalazła. ‘Anna lost her briefcase and never found it again.’ Anna zgubiła teczkę, a była to jej najdroższa pamiątka po ojcu. ‘Anna lost her briefcase, and that was her dearest remembrance of her father.’

In Example (16.3), the form jej ‘it’ is a personal pronoun which forms a cluster of coreferential expressions with the noun teczka ‘briefcase’. However, in Example (16.4) the identical form of jej ‘her’ is an adjective possessive pronoun; therefore, according to the instruction, it is not a head of a nominal phrase, and does not form a coreferential chain with the noun Anna. In spite of that, annotators treated as nominal phrases also those possessive pronouns whose forms were not homonymous with the forms of personal pronouns, e.g.: (16.5)

Wątpliwe jest, aby odpowiedzialna część opozycji poparła pakiet wicepremiera, który – wbrew jego zapowiedziom – nie ma szans przyczynić się do przyspieszonego rozwoju gospodarki. ‘It is doubtful that the responsible part of the opposition should support the vice prime minister’s package, which – contrarily to his declarations – has no chances of contributing to a faster pace of the economy.’

The annotator saw the possessive pronoun jego ‘his’ as a nominal phrase, and included it in an appropriate coreferential cluster with the form wicepremiera ‘vice prime minister’. During the superannotation of the corpus, such an approach was treated as a mistake, and phrases of that type were excluded from clusters. Perhaps the commonness of this mistake is a sign that we have a strong impression of an identity relation between phrases of this kind. Let us observe that we treat as regular phrases in the corpus nominal nouns functioning as attributes, e.g. dom ojca ‘father’s house’, „Fotoplastikon” Siesickiej ‘“Fotoplastikon” by Siesicka’, „Biały” Kieślowskiego ‘“Biały” by Kieślowski’. And yet, these phrases are analogous with expressions with possessive

16.1 Afterthoughts |

249

pronouns: ‘his house’, ‘my novel’, ‘their new film’, cf.: (16.6)

Anna sprzedaje swój dom (= Anna sprzedaje dom Anny). ‘Anna is selling her house (= Anna is selling the house of Anna).’

Perhaps one should accept the distinctiveness of possessive pronouns and treat them as heads of nominal phrases, conversely to typical adjectives.

16.1.3 Expanded phrases The next widely encountered problem during the annotation of the examples from the corpus concerns expanded phrases: with relative clauses or appositions, e.g.: (16.7)

(16.8)

Pozostało mi tylko wspomnienie ogromnego zdziwienia, tak wielkiego, że nie mogło się ono pomieścić w moich oczach. ‘I was left with merely a memory of huge amazement, so great that it couldn’t fit in my eyes.’ Z klasztoru wyciągnął go przy pomocy fortelu jego wieloletni przyjaciel przyjaciel, Onufry Zagłoba. ‘He was pulled out of the monastery by a trick by his old friend, Onufry Zagłoba.’

In Example (16.7), there is a phrase which is simultaneously internally embedded and coreferential. A single cluster will, thus, include the whole expression ogromnego zdziwienia, tak wielkiego, że nie mogło się ono pomieścić w moich oczach ‘a memory of huge amazement, so great that it couldn’t fit in my eyes’, and the pronoun ono ‘it’ (amazement), placed in the middle of the embedded phrase. Example (16.8), on the other hand, includes an apposition. The whole expression jego wieloletni przyjaciel, Onufry Zagłoba ‘his old friend, Onufry Zagłoba’ refers to the same referent and should be treated as one, internally embedded phrase.

16.1.4 Indirect speech Another issue that we needed to correct as late as during the annotation of the corpus concerned the interpretation of expressions appearing within the same text as indirect and direct speech, cf. (16.9)

Ø Uważam, że powołanie kogoś, kto nie sprawdził się w szpitalu i jako szef spółdzielni mieszkaniowej, może budzić wątpliwości – mówi Wojciech Wenecki. ‘I believe that appointing somebody who has not proved oneself in a hospital and as the leader of a housing co-operative can cause doubts – says Wojciech Wenecki.’

250 | 16 Conclusions (16.10) Ø Jestem zadowolony z wyniku meczu. Nic mnie nie zaskoczyło – powiedział trener, kiedy zapytaliśmy go, jak Ø ocenia grę zespołu. ‘I am glad about the result of the match. Nothing surprised me – said the coach when we asked him what he thinks about the team’s performance.’ In Example (16.9), these are phrases: Uważam ‘I (believe)’ and Wojciech Wenecki, in Example (16.10), on the other hand: Jestem ‘I am’, mnie ‘me’, trener ‘the coach’, go ‘him’. In the first version of the instruction we treated these phrases as connected by a quasi-identity relation, but in the second take we acknowledged them as identity relation, because they refer to the same referent. This problem was crucial most of all during the annotation of theatrical plays, where this phenomenon existed on the level of the whole text.

16.1.5 Quasi-identity The quasi-identity relation, which we decided to introduce in our corpus, turned out to be unnecessary. Our annotators demonstrated very high unanimity in determining identity relations. On the other hand, there was no such unanimity in the description of quasi-identity relations. Therefore, the Recasens’ hypothesis that identity is blurred and gradable (Recasens et al., 2010a, p. 151) seems to be questionable. The main reason for this divergence might be not clear distinction between reference (which is pragmatic phenomenon) and various lexical and semantic relations between lexemes (which is a natural property of a language system). Undoubtedly, a considerable problem was adopting in our instruction the narrow understanding of reference after Topolińska (1984). The analysis of the examples from the corpus shows that the concept of reference should be significantly extended and the typology of Vater (2009) should be adopted; he distinguished four types of reference: object, situational, temporal, and location reference. Thanks to this approach, one could accept the phrases underlined in Examples (16.11) and (16.12) as coreferential, cf.: (16.11) Nominacja Tadeusza Matusiaka (SLD) wywołała poruszenie. Na początku tej kadencji samorządu został prezydentem Łodzi. Kiedy w ubiegłym roku od partyjnych kolegów nie otrzymał wotum zaufania, podał się do dymisji. Zadeklarował wówczas, że robi to w poczuciu odpowiedzialności za dobre imię SLD. ‘The nomination of Tadeusz Matusiak (SLD) caused a stir. In the beginning of this term of local government, he became president of Łódź. Last year, when he failed to achieve vote of confidence from his party colleagues, he resigned. He declared then that he was doing it feeling responsible for SLD’s good name.’

16.1 Afterthoughts | 251

(16.12) „Bóg ci daj, panie, dobry dzień i dobre zdrowie, ale ja człek wolny, nie niewolnik.” Winicjuszowi, który miał ochotę rozpytać Ursusa a ojczysty kraj Ligii, słowa te sprawiły pewną przyjemność. ‘“God grant you, my lord, good day and good health, but I am a free man, not a slave.” For Vinicius, who wished to ask Ursus about Ligia’s home country, those words were in a way pleasant.’

16.1.6 Context and semantics At the beginning of our work on the coreferential corpus, we presumed that we would annotate all types of texts from the National Corpus of Polish. However, one should note that coreference is a substantially pragmatic phenomenon that is strongly context-dependent. As a result, some discourse types posed great difficulty to the annotators, as the context was missing – this concerned all spoken texts and fragments of discourse from internet forums (it is known that the language used on forums is a hybrid, quasi-spoken one, de facto closer to spoken language). Therefore, it seems that it would be recommended to omit those types of texts in annotation during later work on the corpus. A substantial part of our corpus consists of fragments of texts. However, we also have a sample of 21 full texts. It seems a very good solution, as full texts provide the addressee as well with the full context, and it is much easier to see the coreferential links between phrases. Of course, it would not be possible to describe all samples from the National Corpus of Polish in this manner, as it would exceed the possibilities of annotators. However, a small subcorpus with exemplary short, fully discussed texts is, in our opinion, a very good idea that should be continued. Building a coreferential corpus would, undoubtedly, require additional tools, which could support an automatic analysis. Contrary to appearances, semantic bases like WordNet are not too promising, because they operate on semantic, often quasi-anaphoric relations between lexemes, while using synonyms, hyponyms and hyperonyms as elements of coreferential string is not frequent at all, cf.: (16.13) Po rezygnacji z pracy w szpitalu, były dyrektor zniknął z życia publicznego. Ø Wrócił dopiero, gdy starosta Andrzej Barański zaproponował mu współpracę. Posunięcie starosty wywołało ostrą reakcję kilku radnych. W trakcie ostatniej sesji kilkakrotnie pytano, czy nowy pracownik ma odpowiednie kwalifikacje, by zdobywać dla powiatu unijne środki pomocowe. ‘After having resigned from the job in the hospital, the former director disappeared from public life. He came back only when staroste Andrzej Barański offered him cooperation. Staroste’s move caused harsh reactions from some of the board members. In the course of the last session it was asked many times whether the new employee has appropriate qualifications to acquire EU aid resources for the province.’

252 | 16 Conclusions In the coreference cluster formed by expressions były dyrektor ‘the former director’, ⌀ ‘he’, mu ‘him’, nowy pracownik ‘the new employee’ not even one element is connected with semantic relations with the last mention of the cluster. It would seem that knowledge corpora based on interpretation frameworks (like FrameNet) or even Wikipedia were much more helpful.

16.1.7 Dominant expressions In many situations, determining the dominant expression helped annotators order a vast set of pronouns denominating different persons (e.g. in fragments of plays or novels), or name the actor signalised by means of the verbs themselves, e.g.: (16.14) Klaster: stwierdzili, powiedzieli Wyrażenie dominujące: lekarze w Polsce ‘Cluster: stated, said Dominant expression: doctors in Poland’ The process of assignment of the dominant expression to the cluster is an issue which needs further elaboration. In most cases, one of cluster elements was chosen by the annotator as a dominant one (with or without modification consisting in changing inflected forms into the base forms). It is still an open question, how to find automatically the best candidate for the dominant expression, especially if there are no good candidates, as in the example above. The best candidates for dominant expressions are: proper names, descriptions of unequivocal reference (e.g. the author of “The World According to Garp”), and expressions with richest semantics (hyponyms). It seems that the longest expression in the cluster should be a good candidate, as it contains more information, but sometimes this additional information is irrelevant, e.g.: (16.15) Klaster: Prezesa PKP InterCity S.A.; Szanowny Pan Prezes PKP InterCity Wyrażenie dominujące: Prezes PKP InterCity S.A. ‘Cluster: (of the) CEO of PKP InterCity S.A.; Dear Sir, CEO of PKP InterCity Dominant expression: CEO of PKP InterCity S.A. This very simplistic approach was experimentally used at one of the later stages of annotation, when we integrated automatic assignment of a dominant expression to the cluster by creating it from a sequence of base forms of the cluster’s longest mention.

16.2 Applications

| 253

16.2 Applications 16.2.1 Automatic summarization The newly created Polish Summaries Corpus (Ogrodniczuk and Kopeć, 2014) contains a large number of single-document summaries of news articles from web archive of Rzeczpospolita, a nationwide Polish daily newspaper. It consists of abstractive summaries (written in free words of the annotator) and extractive summaries (composed by selecting fragments of the full document), with many independently created versions for each text. This resource allows for research on Polish news summarization process, including studies about the impact of coreference resolution on automatic summarization. In the initial effort in that subject we decided to design an experiment to test whether coreference information may be useful for an automatic summarizer which uses sentence extraction technique.

16.2.1.1 Data Polish Coreference Corpus contains a subcorpus of long texts with 21 documents (see Section 8.1.2 for details). These texts are also present in the Polish Summaries Corpus, which allows to test how manual coreference information may be used to form a summary similar to a manually created one. These 21 texts have both extractive and abstractive manual summaries of three sizes: 20%, 10% and 5% of the word count in the full text. For our experiment, we selected extractive summaries with 20% of words, as they are the longest available and, therefore, may contain the largest number of coreference relations. The reason for using extractive summaries is to have an easy way to find which mentions were selected to the summary – in case of abstractive summaries it would be difficult.

16.2.1.2 Experiment description Let us start with a definition: we call two sentences coreferent if there exists a mention m1 in the first sentence and mention m2 in the second sentence, such that m1 and m2 are coreferent. This relation is an equivalence relation, therefore, it groups sentences in text into sentence clusters. Coreference between adjacent sentences improves text cohesion and makes the text easier to read. However, introducing a new entity may require having a sentence not coreferent with the previous one. The extractive summary needs to be coherent and focused on the main topic of the summarized text, therefore, one may expect that it will have a large number of coreferent sentences, as the main topic includes a limited number of entities. However, on the other hand, a summary needs to cover the original content in a limited number of words, which may require skipping some ‘bridges’ of coreferent sentences, which were used in the original text to link different topics. In a summary, there may be not

254 | 16 Conclusions enough room to include these sentences, leaving more sentences which (based on the content of summary only) are not coreferent. The purpose of the experiment was to verify whether the information about coreference between sentences may be useful for selecting a subset of them for the summary, and if so, in what way: either to have more coreferent or more non-coreferent sentences selected. For that task, we decided to create a single 20% gold sentence extraction summary for each of the 21 texts. In Polish Summaries Corpus each text has 5 versions (made by independent annotators) of extraction summaries, however, they are unconstrained in terms of what can be extracted – not necessarily full sentences, but any subset of characters. Therefore, a transformation procedure was required. We automatically segmented the original text into sentences using a pipeline for mention detection (see Chapter 10 for details). Then, each sentence received points for every character from its body selected by any annotator to a summary. After normalizing points for each sentence by sentence length in characters, we obtained a ranking of sentences, in which the best sentence was the one having the highest percentage of characters selected by all annotators altogether. These sentences were added to the final gold summary according to that ranking, until the 20% word limit was reached. Having a gold summary, we have chosen to test how many sentence clusters (created by the sentence coreference relation) are in it, normalised by the number of sentences in the summary. This number was then compared with the normalised number of clusters obtained by selecting a random set of sentences having desired cumulative word count. The result of random sentence selection was averaged for 100 trials for each text. The sentence cluster counts were calculated in two settings: with gold coreference information and with system coreference information, obtained by using Ruler in configuration described in Chapter 15.

16.2.1.3 Results Table 16.1 shows the number of sentence clusters in three sentence sets: all sentences in the original text, sentences selected to the gold summary using the procedure described earlier, and random summary created by selecting random sentences (in that case results were averaged among multiple trials). The number of sentence clusters is calculated by using either gold or system coreference information. The average number of sentence clusters is much higher for random summary than a manually created one both in the gold setting (10.31 : 6.19) and the system setting (7.96 : 5.10). After normalisation of the sentence cluster count by the number of sentences, we still observe the same phenomenon in gold (0.4947 : 0.3291) and system (0.3820 : 0.2711) settings. This indicates that humans in 20%-word summaries of news articles tend to select sentences which are more coreferent to each other than the random subset of sentences of similar size. Such information may be used during automatic sentence selec-

16.2 Applications

| 255

Table 16.1. Summarization experiment results.

16.19

18.81

Average

100.90 24.81

6.19

# sentence clusters (sys. coreference)

6 1 20 6 10 3 10 4 6 9 4 9 4 5 8 4 3 2 4 8 4

# sentence clusters (gold coreference)

23 8 47 19 27 21 22 15 10 13 14 20 22 16 19 13 12 28 13 18 15

127 97 248 96 132 96 117 63 53 63 80 90 101 70 87 64 76 191 89 98 81

# sentences

# sentence clusters (gold coreference)

22 11 55 6 15 8 27 9 8 10 10 19 17 14 18 14 10 28 18 12 9

199704210012 199704230043 199704240021 199801030068 199801310037 199802030127 199802040182 199901190115 199901290095 199901300101 199906240084 199906260027 199911220063 199912180034 199912280024 200001030029 200004290022 200108300032 200112280057 200202020036 200203190026

Random summary (avg.)

# sentence clusters (sys. coreference)

# sentences

34 9 70 33 32 12 37 13 22 16 28 19 20 17 34 18 14 39 19 18 17

# sentences

# sentence clusters (sys. coreference)

Gold summary

# sentence clusters (gold coreference)

Full text

10 2 19 1 4 4 4 5 3 6 2 7 4 6 8 2 2 4 7 5 2

26.46 20.43 50.58 19.44 26.96 19.76 23.84 13.13 11.38 13.09 16.62 18.84 20.73 14.43 17.98 13.66 15.70 39.00 18.60 20.23 16.68

13.17 4.87 27.26 12.57 15.79 6.55 13.15 6.31 7.98 8.21 10.25 7.45 6.46 7.22 12.64 5.83 8.09 13.97 9.07 10.67 8.95

11.73 5.97 25.54 3.71 6.90 4.36 10.78 4.79 4.65 5.33 5.21 8.33 8.45 6.69 9.55 6.50 6.71 11.87 9.47 6.34 4.30

20.84

10.31

7.96

5.10

tion to favour sentence sets which are highly coreferent. At the same time, automatic summary with a high number of coreferent sentences is more coherent and, therefore, easier to read. However, additional attention should be paid to post process selected sentences to make sure that their composition does not introduce anaphoric chains which are false or difficult to resolve by the reader.

16.2.1.4 Conclusion Our results show that a sentence selection summarization algorithm may benefit from information about coreference between sentences, even when using imperfect automatic coreference data. Further research is required to investigate how to incorporate this coreference information into sentence selection procedure.

256 | 16 Conclusions 16.2.2 Multiservice The Multiservice (Ogrodniczuk and Lenart, 2013) is a multi-purpose set of NLP tools for Polish, made available online in a common web service framework. The toolset currently comprises a disambiguating tagger with morphological analyser, named entity recognizer, shallow parser, dependency parser, various text summarizers and, thanks to the presently described work, a mention detector and coreference resolver. Addi

...

"mentions": [ ... { "id": "m-70", "headIds": [ "seg-4.8.5" ], "childIds": [ "seg-4.8.5", "seg-4.8.6", "seg-4.8.7" ], }, ... { "id": "m-72", "headIds": [ "seg-4.9.1" ], "childIds": [ "seg-4.9.1" ], }, ], ... "coreferences": [ { "id": "c-13", "type": "ident", "dominant": "turniej", "mentionIds": [ "m-70", "m-72" ], }, ... ],

Fig. 16.1. TEI P5 XML and JSON coreference representation formats in Multiservice

16.2 Applications | 257

tionally, a web application offering chaining capabilities, asynchronous processing of requests and a common brat-based (Stenetorp et al., 2012) presentation interface is available. In the following subsections we describe the parts of Multiservice related to coreference.

16.2.2.1 Output formats The coreference tools have been adjusted to multiservice interchange formats, supporting chaining and uniform presentation of linguistic results: TEI P5 XML (see Section 8.2.1) and its JSON equivalent (see Figure 16.1). The TEI P5 format used by the Multiservice is a packaged version of the standoff annotation used by NKJP (Przepiórkowski and Bański, 2011), extended with new annotation layers originally not available in NKJP. It stores a collection of annotation layers in an artificial root element which allows for a uniform use of XML IDs in references.

16.2.2.2 Usage and visualisation Sample Python and Java clients have been implemented for using the service¹. The application allows for triggering a processing request and periodically checking its status. When execution ends, the result is retrieved and displayed to the user. In case of a failure, an appropriate error message is presented. To facilitate non-programmatical experiments with the toolset, a simple Djangobased Web interface has been offered to allow users for creating toolchains and entering texts to be processed. By selecting components to be included in the chain, users

Fig. 16.2. Coreference in the Multiservice

1 See http://glass.ipipan.waw.pl/redmine/projects/multiserwis/wiki/Usage for details.

258 | 16 Conclusions can test how inclusion or exclusion of a certain component influences the resolution process. For instance, removing Spejd (shallow parser used for nominal group detection) from the chain results in restriction of mention detection to single segments only (in our case pronouns and simple nouns). The web interface features consistent visualization of linguistic information produced with brat tool (Stenetorp et al., 2012) with mentions represented as boxes spanning over text fragments and coreference clusters presented as links between nearest mentions (see Figure 16.2).

16.3 Contribution summary In the light of the state-of-the-art coreference-related work, it appears that the Polish Coreference Corpus presented in various aspects throughout this book is one of the largest corpora of general nominal coreference worldwide (over 0.5M tokens) and the first of its kind for the Polish language. Both the corpus and all related tools (used in the process of corpus composition and produced in scope of the project) are available under a very liberal CC-BY licence (Creative Commons Attribution 3.0 Unported License) which supports both their non-commercial and commercial use. The publication of the corpus resulted in the development of several important interfaces: a TEI NKJP-based format for coreference representation, a large-scale visualisation with brat framework and an integration with Multiservice, a Web service offering chain-based access to state-of-the-art Polish linguistic tools. According to our best knowledge, before the completion of our work no fullfledged automated anaphora or coreference resolution system existed for Polish. Moreover, no systematic description of anaphoric constructs in Polish had been performed. Therefore, having achieved such a description together with a large, manually validated resource seems important for Polish linguistic studies. The annotation of the PCC introduces some novel aspects such as verification of the quasi-identity concept (cf. Sections 1.5 and 5.1.2) on a large scale, the semantic head markup, presence of clusters with indefinite mentions (cf. Section 8.3.2.2) as well as the selection of the dominant expression in each cluster (cf. Section 5.1.3). The resolution process includes a newly implemented nested mention detector and zero subject detector which achieved 85.55% accuracy, significantly exceeding the baseline of majority tagging. Several resolution approaches were tested, from a preliminary rule-based experiment, to application of statistical models and testing translation-projection approaches. The best achieved results of overall resolution reached 92.86% on head mention detection, 72.27% CoNLL F1 score² on gold head coreference resolution and 70.79% on system head coreference resolution.

2 The score used in the CoNLL-2012 Shared Task: Modelling Multilingual Unrestricted Coreference in OntoNotes; F1 = (MUC + B3 + CEAF-E)/3.

16.3 Contribution summary

|

259

The set of tools³ created by the project and made available for the community is long and extensive: – MentionDetector – a mention detection tool involving simple nouns, pronouns, nominal groups, nominal named entities and zero subjects (see Chapter 10) – Ruler – a rule-based coreference resolution tool (see Chapter 11) – Bartek – a statistical coreference resolution tool (see Chapter 12) – DistSys – a distribution system for manual annotation of texts (see Section 7.1.1) – MMAX4Core – a modified version of the MMAX2 annotation tool, adjusted for the needs of the project (see Section 7.1.2) – Scoreference – a mention detection and coreference resolution evaluation tool (see Chapter 15) – TextSelector – a tool for manual text inspection and selection (see Section 8.1.1) – brat4core – brat online annotation tool with tweaks specific for coreference visualisation (see Section 8.2.4) – numerous converters (between MMAX, TEI and brat representation formats; see Chapter 8 for more detail). Some of the tools (TextSelector, Brat4core, DistSys, Scoreference, TextExtractor, all converters) are suitable for direct use for languages other than Polish. Last but not least, some parts of this book (e.g. Section 14.3 about coreference resolution metrics or parts of Chapter 13 related to inter-annotator agreement) are tutoriallike and can be used independently of project results.

3 Available at http://zil.ipipan.waw.pl/PolishCoreferenceTools.

Maciej Ogrodniczuk, Katarzyna Głowińska

17 Perspectives

17.1 Annotation improvements At the beginning of the project we had to make certain decisions concerning the scope of representation of coreference and related phenomena in our corpus. Due to pioneering character of our work for Polish, we decided to limit our efforts to direct nominal coreference, but now our resource is ready to be extended with other types of anaphora and coreference. Section 5.2 contains a comprehensive list of constructs which were excluded from annotation due to their non-nominal or indirect character. They include identity-of-sense, bridging or bound anaphora as well as different syntactic types of clustered mentions (e.g. verbal or adverbial constructs, references to relative clauses etc.) We have also initiated research on the role of zero subjects, semantic heads and dominant expressions in coreference description. Among these topics, the most underexplored one seems the notion of a dominant expression (see Section 5.1.3), an expression which had the richest meaning or described the referent in the most precise way. We believe that dominant expressions could facilitate cross-document annotation, as well as facilitate the creation of a semantic framework covering different expressions/descriptions of the same object.

17.2 Mention detection improvements Although the tools currently used in the mention detection process make the state-ofthe-art for Polish, many mentions manage to slip unnoticed, on each detection level, either because of dictionary (or ‘knowledge’) shortages, or differences of the annotation model between the individual tools and our annotation principles. Even though for a certain type of tools a pure application of new versions of the machine learning algorithms, bigger amounts of training data or combining existing component tools (cf. the new PoliTa tagger (Kobyliński, 2014)) can dramatically improve mention detection results, the most important improvements we present below involve manual development of linguistic tools and resources.

17.2.1 Ignored words Since the morphosyntactic analyser marks unrecognized words with a special tag (ign – ignored word in case of Morfeusz), in real-world conditions this means that

262 | 17 Perspectives such words cannot be used in further processing (e.g. to make syntactic groups) and serve as a deterrent in the overall process, e.g. because the Spejd grammar rules do not take ignored words into account (i.a. because it was developed for the 1-million manually annotated subcorpus of the National Corpus of Polish, where all words have been disambiguated). In the corpus, there were 20,515 ignored words (approx. 3.8% corpus tokens; 8487 distinct ignored words) of the following types: – numbers represented with digits (24) – compounds (12-letnia ‘12-year-old’, ekoturystyka ‘ecotourism’¹) – abbreviations (PLK) – diminutives not covered by PoliMorf (hipopotamek ‘little hippopotamus’) – foreign words (adwords, airways) and their polonized versions (tłiter ‘Twitter’, łikend ‘weekend’) – unrecognized named entities (Agniesiu ‘diminutive of Agnes’, Skrętkowskiej) – other unrecognized elements, though properly formed (neologisms, rare words, spelling variants, words lacking Polish diactrical marks, e.g. lumpenliberalizm, wordowskiej, zolc etc.) – representation problems and errors (punctuation-related, e.g. allowed punctuation such as an apostrophe mark wrongly included in the segment, other spelling-related problems, e.g. old-Polish variants: substancya old Polish ‘substance’, baaardzo ≈ ‘verrry’, przkro ≈ ‘sory’ etc.) Their processing may require an integration of external vocabulary resources, mostly coming from websites registering the newest linguistic trends in Polish, fresh loan words and neologisms already popular, but not yet covered by traditional dictionaries. The most popular sources for Polish are: – miejski słownik slangu i mowy potocznej (http://www.miejski.pl/) – słownik slangu, neologizmów i mowy potocznej (http://vasisdas.pl/) – słownik polskiego slangu (http://www.ug.edu.pl/slang/) – słownik slangu polsko-angielskiego (http://www.ponglish.org/). A good source of general diminutives is the Polish WordNet (Piasecki et al., 2009)². Name diminutives can be found in (Marinković, 2004) or internet-based sources such as lexicons of first names http://de.wikipedia.org/wiki/Polnischer_Name). Rulebased online resources such as http://aztekium.pl/zdrobnienia.py are not credible due to high alternation in Polish word formation.

1 Elements of these two types are currently recognized with a new version of Morfeusz, implemented after preparation of PCC. 2 Also known as plWordNet or Słowosieć; see http://plwordnet.pwr.wroc.pl/wordnet/.

17.3 Resolution improvements | 263

17.2.2 Named-entity-related improvements Out of the long list of nominal named entity expressions which are not identified by Nerf, the following seem important for mention detection (and cannot be easily identified with other means, such as shallow parsing, if not combined with named entity detection) e.g.: – titles and functions of people (dyrektor Adam Płocki, Prof. A. Kamiński) – multi-word names of human creations – works of art or literary works („Człowiek z marmuru”) – addresses (ul. Wyspiańskiego 32 m 12, www.msz.gov.pl). Another interesting feature which must be taken into consideration when preparing semantic description of named entities is that in Polish names of fictitious characters inherit gender after the name, independently of the sex of the character. This results in resolution difficulties for names such as Czerwony Kapturek, Myszka Miki, Kopciuszek, Zośka (Tadeusz Zawadzki), where ‘semantic gender’ instead of grammatical gender must be used (cf. Kopciuszek zgubił pantofelek. ‘Cinderella lostsubst:m her shoe.’ which is the syntactically correct version). From the technical point of view, identification of mentions could also be improved if named entity detection results were used as pre-defined nominal group components by the shallow parsing engine. In a sense, this approach is simulated in our mention detection chain, described in Section 10.6.

17.3 Resolution improvements Although comparisons of systems using numerous learning features³ shows only moderate improvement over systems using minimal number of rich features (Haghighi and Klein, 2009), and the high number of features is typically related to degraded performance, lower reliability and overfitting, it is generally accepted (see Cristea and Postolache, 2005) that growing complexity of resolution cases requires more elaborate knowledge-intensive representation. The following subsections⁴ present some preliminary findings based on the contents of PCC which may lay the ground for future improvements.

17.3.1 Detection of knowledge-linked mention pairs To verify the extent to which currently used methods (surface, syntactic, semantic or discourse-based) cannot bring successful results due to their knowledge-lean nature,

3 Ng and Cardie (2002b) report on using 53 features while Uryupina (2007) – 351 features. 4 Based on (Ogrodniczuk, 2013a).

264 | 17 Perspectives we have compared manual PCC annotation with automatic annotation created with the tools developed in our project. By extracting coreferential links identified by human annotators, but missed by the computer resolver, we could track cases when the current resolution methods could not create the association, but there existed some additional level of understanding of the text which made it obvious for the human annotators. Out of 1220 nominal clusters (with only nominal mentions) 73 mention pairs were selected for further processing. They constituted all data for which coreference resolution was unfeasible with the above-mentioned means. Mentions which are currently not clustered, but could get resolved with additional semantic effort, were removed from the data set. Two examples of such semantic-intensive data are czternaście tysięcy złotych ‘fourteen thousand Polish zlotys’ – pierwsza tak duża dotacja ‘the first so huge subsidy’, when cluster could have been created by comparing WordNetbased semantic classes, and marszałek ‘marshal’ – Marek Nawara, marszałek małopolski ‘Marek Nawara, the marshal of Małopolska’⁵, when appositional components could have been inspected to create the link. Sometimes, the contents of the set followed, to a great extent, the common classifications of named entities – among the 73 mention pairs included in the set, four out of five following subclasses were named entity-related and followed the NKJP classification (Waszczuk et al., 2013): – 29 personal names linked with person role, function, occupation etc. (e.g. Jan Paweł II ‘John Paul II’ – polski papież ‘the Polish pope’, Rafał Blechacz – pianista ogromnie utalentowany i skromny ‘a pianist tremendously talented and modest’) – 18 names of organisations – companies, sports clubs, political parties, music bands etc. (e.g. Ich Troje – zespół Michała Wiśniewskiego ‘Michał Wiśniewski’s band’, Wizzair – tania linia lotnicza ‘low-cost airline’) – 14 geographical/geo-political names – here: only names of countries and cities (e.g. Irak ‘Iraq’ – kraj ‘country’, Aleksandrów Łódzki – miasto ‘city’) – 6 ‘human creation’ names – movie, book and newspaper titles (e.g. Star Trek – dzieło filmowe ‘cinematographic work’, Wahadło Foucaulta ‘Foucault’s Pendulum’ – książka ‘a book’) – 6 descriptive definitions, e.g. kot ‘cat’ – udomowiony ssak ‘domesticated mammal’, lekarze i pielęgniarki ‘doctors and nurses’ – personel szpitalny ‘hospital staff’). The overrepresentation of named entities results in the first place from absence or underrepresentation of such concepts in the data sources used. At the same time, certain cases of this type were identified since e.g. the Polish WordNet contains 339 sample instances of the artificial synset miasto Polskie (‘a Polish city’), which corresponds only to 1/3 of the total number of all cities in Poland. Nevertheless, the structure and con-

5 In PCC appositions are treated as components of the main phrase.

17.3 Resolution improvements | 265

tents of any WordNet cannot be subordinated to ideology of representing the whole world knowledge – c.f. the Princeton WordNet, similarly far from representing company names or movie titles.

17.3.2 The motivation for the knowledge base of periphrastic expressions In the following example: (17.1) Aldrin and ::::::::: Armstrong przyjaźnili się nadal, mimo że cała uwaga mediów skupiła się wyłącznie na::::::::: pierwszym::::::::: człowieku::: na Księżycu. ::::::: ‘Aldrin and Armstrong stayed friends even though the whole attiention of :::::::: first man on the Moon.’ media now focused on the ::::::::::::::::::::: the resolution going beyond a random guess is not possible when only knowledgeunaware features are applied. It is, nevertheless, true that linking Armstrong and the first man on the Moon could be easy for most human annotators – and even some search engine-based systems, even using the simplest full-text search mode. Sometimes, the situation becomes complicated by the nature of the domain; the phrases Adam Mickiewicz, the husband of Celina Szymanowska, the poet, the lecturer in College of France can be clustered together only with some (deeper) knowledge about the life of Adam Mickiewicz, a Polish 19th century poet. Without referencing the history of Polish literature, both a person and a computer system would experience difficulties to resolve coreference between those phrases. However, the border between common and specific knowledge is vague, especially in the face of availability of such resources as Wikipedia, offering ready-to-extract information on even less-generally known topics. We deliberately skip one more (rare) case which should be noted for completeness: understanding of certain concepts in expert knowledge can be different from ‘the common knowledge’, which may hinder coreference resolution. For example, in the scientific sense, tomato is the fruit (mature ovary) of the tomato plant, but in common interpretation (e.g. in cooking) tomato is a vegetable. Since the nature of coreference resolution problem is conceptual: establishing and decoding coreference is about sharing the same knowledge of discourse entities between the speaker (conveying some message in the text, being the primary communication channel) and the recipient (decoding method), we could make an attempt at establishing a common, reusable, updateable platform of understanding of the facts expressed in the analysed text. Within the scope of the resolution task, limited to nominal groups, such platform could be conceived of as a pragmatic knowledge base composed of ‘seed’ nominal facts and their interpretations. This type of information goes far beyond semantic relations present e.g. in the WordNet, with its Polish version unable to maintain definitions such as pediatria ‘pe-

266 | 17 Perspectives diatrics’ – nauka o chorobach dziecięcych ‘branch of medicine that deals with child’s diseases’. Similarly, this information cannot be inferred from investigating syntactic heads of phrases since człowiek ‘man’ carries much different information capacity than the whole phrase pierwszy człowiek na Księżycu ‘the first man on the Moon’. The content of such resource would cover established facts (such as, again, linking Neil Armstrong with his well-known attribute of being “the first man on the Moon”) and typical periphrastical realisations of frequent nominal phrases, including named entities (e.g. linking Napoleon Bonaparte with his nickname, “The Little Corporal”).

17.3.3 Data extraction sources To boost the development of the knowledge base, existing data sources can be reused. Structured sources should be represented by existing data- and knowledge repositories such as traditional dictionaries. For Polish, two adequate resources of these type are: The Dictionary of Periphrastic Constructions (Bańko, 2003) and The Great Dictionary of Polish – WSJP (Żmigrodzki, 2007), both prepared by the scientific community. On the other hand, there is a growing number of crowd-sourced dictionaries and definition bases, in most cases intended to be used for Internet games and crosswords (http://sjp.pl, http://krzyzowki.info). Processing data from these groups would consist in automatic filtering of nominal definitions and passing them to manual verification. Digital ontologies (explicit specifications of conceptualization) could also be used as a source of periphrases, most likely with typical nominal instantiations of knowledge items generated in a human-readable form (to be later matched with textual content), but we deliberately omit this method, on the one hand, because of mostly derivative nature of such resources and, on the other – due to their artificial character, abstracted from realistic use of language. To illustrate this problem, let us analyse the relation between the phrase gród Kraka ‘Krak’s (fortified) town’ and its synonym, city name Kraków ‘Cracow’. The former one is frequently used in texts about Cracow to maintain cohesion but we could hardly ever find it when looking in available structured sources. Moreover, it will never be automatically generated from any ontology because of its collocational character and atypical component gród, rarely used in contemporary texts when referring to a town. Capturing phenomena of this type can only be achieved by processing unstructured sources representing the bottom-up approach to language and likely to enrich dictionary data with real-life examples. In the long run, we plan to process both balanced corpora (such as NKJP (Przepiórkowski et al., 2012) or KPWr (Broda et al., 2012), providing standard representation of a language), available content sources (such as Gutenberg project) and sources of dynamic language – electronic media archives such as Korpus Rzeczpospolitej (Presspublica, 2002) or current parliamentary transcripts from the Polish Sejm Corpus (Ogrodniczuk, 2012).

17.3 Resolution improvements | 267

17.3.4 The experiment: knowledge extraction attempt Each of the mention pairs extracted in the process described in Section 17.3.1 was manually tested against one of the knowledge bases mentioned in Section 17.3.3 to provide a proof of concept that extracted data used as ‘pragmatic features’ would, to a large extent, help in proper clustering of mentions in the coreference resolution process. 2 sources were selected as main supplies of pragmatic data: Polish Wikipedia and online crossword definition service http://krzyzowki.info. This decision was based on the assumption that Wikipedia is a reliable enough source of information about named entities while crossword services should provide sufficient support for definitions. Table 17.1 provides statistics of data sources used for resolving mention pair dependency, showing the number of entity pairs which could be resolved using only Wikipedia, only the crossword definitions, with both methods, some other algorithmically available method or which could not be resolved by any pragmatic means. Table 17.1. Sources of pragmatic information Wikipedia Personal names Organisations Geo names Creation names Definitions

krzyzowki.info

Both

Other 1

1

14 8 13 5 1

14 9 1 4

None 1

1

The first important finding is that all but one problematic assignments could be properly resolved; the missing one resulted from manual annotation error (wrong association between a soccer club name and a mountain name: Klimczok). Another striking fact is that for most mention pairs (all but three) the resolution process could be completed by using Wikipedia exclusively. The only definition-based case was the association between the country name Niemcy ‘Germany’ and its property: zachodni sąsiad Polski ‘the western neighbour of Poland’, possible to get resolved by use of the textual head-match with the phrase present in the definition base. ‘Other’ resolution source indicates that both of the main sources were not sufficient to resolve the link, but another available online source could be used; the examples here are diminutive and augmentative forms of the name Małgorzata ‘Margaret’: Gosiadim and Gochaaug and a common name for medical staff: lekarze i pielęgniarki ‘doctors and nurses’ – personel szpitalny ‘hospital staff’.

17.3.5 Data abstraction concept When collected, separate sets of algorithms can be used to abstract nominal facts from nominal phrases which we believe to boost coreference resolution recall while main-

268 | 17 Perspectives taining storage efficiency. Apart from typical collocations which should only be processed in a controlled manner, two abstraction components are now envisaged: a syntactic one and a semantic one. The former would convert between different syntax models of a phrase maintaining its meaning, e.g. relative to participial phrases: osoba, która podpowiada aktorom ‘a person who feeds lines to actors’ – osoba podpowiadająca aktorom ‘a person feeding lines to actors’. The semantic component would use WordNet relations such as synonymy or hyponymy to neutralise lexical meanings of phrase components: osoba, która podpowiada wykonawcom ‘a person who feeds lines to performers’. Evaluation of both components would be a starting point for further investigation of several independent research problems e.g.: – how alternation of verbal constructs influences the usage of phraseology (przejąć ‘to take over’ – dokonać przejęcia ‘to make a takeover’, człowiek, który przepłynął Atlantyk ‘a man who sailed across the Atlantic’ – człowiek, który przebył Atlantyk ‘a man who travelled across the Atlantic’) – how far can attributes modify nominal syntax constructs (mała niebieska pigułka ‘little blue pill’ – niebieska pigułka ‘blue pill’) – which factors influence syntactic stability of collocations (cf. Kraj Wschodzącego Słońca ‘Land of the Rising Sun’).

17.3.6 Findings The experiments confirmed our original hypothesis that currently available data sources can provide pragmatic knowledge and, in this way, improve coreference resolution in Polish when currently used algorithms fail. Apart from coreference resolution, the completed version of the database will also find its other linguistic applications such as pragmatic analysis of text for smoothing the result of automatic text summarization, machine translation or readability improvements. Development of the knowledge base would seriously enrich capabilities of independent IT systems performing text analysis, especially as current versions of such systems are insensitive to pragmatic facts vital for correct interpretation of the text, while such information is freely available to all search engines, even in the simplest full-text search mode (cf. search results for “Nazi Propaganda Minister”). The knowledge base could also be made available independently, in a WolframAlpha-like interface offering search and visualisation. Considering the incremental and volatile character of knowledge, expressed by constant update of underlying resources by Internet users, extraction algorithms could be linked with data sources in a way triggering updates of the knowledge base contents when source data (e.g. Wikipedia article) gets updated. The data pool could be extended with more linked data sets and tools traditionally used for ontological modelling, with a possibility of using ontological relations

17.4 Evaluation improvements | 269

to improve data abstraction (e.g. when Księżyc ‘the Moon’ is linked in ontology to Srebrny Glob ‘the Silver Globe’, it could be used in abstraction of phrases like pierwszy człowiek na Księżycu ‘the first man on the Moon’). Interfacing with WolframAlpha or Google Knowledge Graph will be also investigated. Last but not least, foreign-language resources could be examined to import translated nominal representation of knowledge bits to the base.

17.4 Evaluation improvements Certain improvements can be envisaged also in the evaluation process, starting with an observation that e.g. BLANC metric in its precision and recall calculation takes into account the number of links between mentions which exhibit quadratic growth with respect to the number of mentions – which, in turn, makes it dependent on the size of the analysed text. A new evaluation metric could also take into account mention borders and other properties resulting from specificities of annotation guidelines such as inclusion of relative clauses, presence of discontinuities etc. Coreference can also be further investigated in related research tasks such as measuring text readability based on the number and character of coreferential links (negative correlation is shown between e.g. the average size of a coreference cluster in a document and traditional readability measures, such as the Gunning FOG index). One of the next steps will be a comparison of “NG coreference density” between Polish and other languages, based on data extracted from existing coreference-enabled corpora such as the Tübingen treebank of German newspaper texts (TüBa-D/Z) annotated with a set of coreference relations (Hinrichs et al., 2005), a coreferenceannotated corpus of Dutch newspaper texts, transcribed speech and medical encyclopaedia entries (Hendrickx et al., 2008), NAIST, a Japanese corpus annotated for coreference and predicate-argument relations (Iida et al., 2007), AnCora-CO, coreferentially annotated corpora for Spanish and Catalan (Recasens and Martí, 2010) and many others. It seems that there exists no systematic evaluation of statistical properties of such corpora going beyond a simple mention and cluster count. This results probably from the differences in annotation guidelines and approaches to certain coreference-related linguistic properties such as appositions, predicates or relative clauses, hindering unifications and comparisons.

Acknowledgements The work reported here was carried out within the Computer-based methods for coreference resolution in Polish texts (CORE) project⁶ financed by the Polish National Science Centre (contract number 6505/B/T02/2011/40), aimed at the creation of innovative methods and tools for automated coreference resolution in Polish texts, with planned quality compared to state-of-the-art tools available for other languages. Parts of the work described here were also contributed by other externally funded projects, carried out simultaneously with CORE: – works on the new version of the Polish grammar for Spejd by Alicja Wójcicka and Katarzyna Głowińska described in Section 10.3 were co-funded by the Polish Ministry of Science and Higher Education as an Investment in CLARIN-PL Research Infrastructure and by the European Union from resources of the European Social Fund – works related to linguistic evaluation of usefulness of Uryupina’s coreference features for Polish by Piotr Batko and development of adaptation of BART (Beautiful Anaphora Resolution Toolkit) for Polish by Bartłomiej Nitoń, described in Section 12.1 were co-funded by the European Union from financial resources of the European Social Fund, project PO KL “Information technologies: Research and their interdisciplinary applications” (http://phd.ipipan.waw.pl/) – works related to coreference-based approach to summarization described in Section 16.2.1 were carried out within PhD studies of Mateusz Kopeć – help with adaptation of coreference tools to Multiservice (http://glass.ipipan. waw.pl/multiservice/, a Web service framework for Polish NLP tools described in Section 16.2.2 was offered by Michał Lenart taking part in CESAR project (Central and South-east European Resources; http://www.meta-net.eu/projects/cesar, part of META-NET, grant agreement 271022) financed from a European Competitiveness and Innovation framework Programme, Information and Communication Technologies Policy Support Programme (CIP ICT-PSP) – projection-based experiments were made possible by the University Research Program for Google Translate (http://research.google.com/university/translate/) – contacts established with the parallel French coreference annotation project ANCOR (http://tln.li.univ-tours.fr/Tln_Ancor.html) were also beneficial for some of our scientific results and helped relate the CORE project more deeply to the international coreference community. The core CORE project team constituted of (almost alphabetically): – Maciej Ogrodniczuk – principal investigator – Barbara Dunin-Kęplicz – formalization of coreference rules

6 PL. Komputerowe metody identyfikacji nawiązań w tekstach polskich, see http://core.ipipan.waw.pl/.

272 | Acknowledgements – – – – – – – – – –

Maria Głąbska – coreference annotation Katarzyna Głowińska – linguistic expertise related to anaphora, coreference and Polish syntax Anna Grzeszak – coreference annotation Mateusz Kopeć – technical leadership, implementation and IT design, development of the annotation environment and project tools Emilia Kubicka – coreference annotation Barbara Masny – coreference annotation Paulina Rosalska – coreference annotation Agata Savary – coreference annotation and annotation work expertise Magdalena Zawisławska – linguistic and semantic expertise, annotation management, adjudication of the annotation of Polish Coreference Corpus Sebastian Żurowski – coreference annotation

but there were numerous other people, mainly colleagues from the Linguistic Engineering Group at the Institute of Computer Science, Polish Academy of Sciences, who contributed to various stages of the project with their selfless help: – Piotr Batko – coreference annotation, verification of coreference features for Polish (linguistic part) – Łukasz Degórski – help related to processing NKJP data – Łukasz Dębowski – statistical expertise – Michał Lenart – help related to processing NKJP data, hardware expertise, Multiservice integration assistance – Małgorzata Marciniak – HPSG anaphora expertise – Bartłomiej Nitoń – verification of coreference features for Polish (implementation part) – Adam Przepiórkowski – linguistic and natural language processing expertise, management of co-operation with the National Corpus of Polish – Filip Skwarski – translation and proofreading – Jakub Waszczuk – expertise related to annotation and named entity-related tools, versioning system management – Joanna Wierucka – translation and proofreading – Marcin Woliński – expertise on morphosyntactic annotation and typesetting – Alicja Wójcicka – preparation of the nested version of the grammar of Polish – Alina Wróblewska – dependency parsing expertise. We would like to thank all project collaborators for hard and very productive work, impressive results, and friendly atmosphere!

Bibliography Abramowicz W., Filipowska A., Piskorski J., Węcel K. and Wieloch K. (2006). Linguistic Suite for Polish Cadastral System. In: Calzolari et al. (2006), p. 2518–2523. Acedański S. (2010). A Morphosyntactic Brill Tagger for Inflectional Languages. In: Loftsson H., Rögnvaldsson E. and Helgadóttir S. (ed.), Advances in Natural Language Processing, vol. 6233 in Lecture Notes in Computer Science, p. 3–14. Springer. Aone C. and Bennett S. W. (1995). Evaluating Automated and Manual Acquisition of Anaphora Resolution Strategies. In: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics, ACL 1995, p. 122–129, Stroudsburg, PA, USA. Association for Computational Linguistics. Ariel M. (1990). Accessing Noun Phrase Antecedents. Routledge. Artstein R. and Poesio M. (2008). Inter-Coder Agreement for Computational Linguistics. “Computational Linguistics”, 34(4), p. 555–596. Bagga A. and Baldwin B. (1998). Algorithms for Scoring Coreference Chains. In: The 3st International Conference on Language Resources and Evaluation – Workshop on Linguistics Coreference, p. 563–566. Baker C. F., Fillmore C. J. and Lowe J. B. (1998). The Berkeley FrameNet Project. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics – Volume 1, ACL 1998, p. 86–90, Stroudsburg, PA, USA. Association for Computational Linguistics. Bańko M. (2003). Słownik peryfraz, czyli wyrażeń omownych. PWN Scientific Publishers, Warszawa. Bansal N., Blum A. and Chawla S. (2004). Correlation Clustering. “Machine Learning”, 56(1), p. 89– 113. Batko P. (2012). Analysis of linguistic features based on (Uryupina, 2007). Technical report, Institute of Computer Science, Polish Academy of Sciences. Bellert I. (1971). O pewnym warunku spójności tekstu. In: Mayenowa M. R. (ed.), O spójności tekstu, vol. XXI, p. 47–76. Zakład Narodowy im. Ossolińskich, Wrocław. Bengtson E. and Roth D. (2008). Understanding the Value of Features for Coreference Resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, p. 294–303, Stroudsburg, PA, USA. Association for Computational Linguistics. Bennet E. M., Alpert R. and Goldstein A. C. (1954). Communications Through Limited-Response Questioning. “Public Opinion Quarterly”, 18(3), p. 303–308. Bergsma S. and Lin D. (2006). Bootstrapping Path-Based Pronoun Resolution. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, p. 33–40, Sydney, Australia. Association for Computational Linguistics. Bergsma S. and Yarowsky D. (2011). NADA: A Robust System for Non-referential Pronoun Detection. In: Hendrickx I., Devi S. L., Branco A. H. and Mitkov R. (ed.), Anaphora Processing and Applications. Revised selected papers from the 8th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2011), vol. 7099 in Lecture Notes in Computer Science, p. 12–23. Springer. Bień J. S. (1991). Koncepcja słownikowej informacji morfologicznej i jej komputerowej weryfikacji. Rozprawy Uniwersytetu Warszawskiego, Dissertationes Universitatis Varsoviensis, ISSN 05097177 (383). Wydawnictwa Uniwersytetu Warszawskiego, Warszawa. Black M. (1949). Language and philosophy: Studies in method. Cornell University Press and Ithaca. Bobrow D. G. (1964). A Question-answering System for High School Algebra Word Problems. In: Proceedings of the October 27-29, 1964, Fall Joint Computer Conference, Part I, (AFIPS 1964), p. 591–614, New York, NY, USA. ACM.

274 | Bibliography Bresnan J. (2001). Lexical-Functional Syntax. Blackwell Publishers. Broda B., Marcińczuk M., Maziarz M., Radziszewski A. and Wardyński A. (2012). KPWr: Towards a Free Corpus of Polish. In: Calzolari et al. (2012), p. 3218–3222. Calzolari N., Choukri K., Gangemi A., Maegaard B., Mariani J., Odijk J. and Tapias D., ed. (2006). Proceedings of the 5th Language Resources and Evaluation Conference (LREC 2006), Genoa, Italy. European Language Resources Association. Calzolari N., Choukri K., Maegaard B., Mariani J., Odijk J., Piperidis S. and Tapias D., ed. (2008). Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morroco. European Language Resources Association. Calzolari N., Choukri K., Maegaard B., Mariani J., Odijk J., Piperidis S., Rosner M. and Tapias D., ed. (2010). Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010), Valletta, Malta. European Language Resources Association. Calzolari N., Choukri K., Declerck T., Dogan M. U., Maegaard B., Mariani J., Odijk J. and Piperidis S., ed. (2012). Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey. European Language Resources Association. Calzolari N., Choukri K., Declerck T., Loftsson H., Maegaard B., Mariani J., Moreno A., Odijk J. and Piperidis S., ed. (2014). Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), Reykjavík, Iceland. European Language Resources Association. Cardie C. and Wagstaff K. (1999). Noun Phrase Coreference as Clustering. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, p. 82–89, University of Maryland, MD. Association for Computational Linguistics. Carnap R. (1947). Meaning and Necessity. University of Chicago Press, Chicago. Chang C.-C. and Lin C.-J. (2011). LIBSVM: A Library for Support Vector Machines. “ACM Transactions on Intelligent Systems and Technology”, 2, p. 27:1–27:27. Software available at http://www. csie.ntu.edu.tw/~cjlin/libsvm. Chomsky N. (1980). On Binding. “Linguistic Inquiry”, 11(1), p. 1–46. Chomsky N. (1981). Lectures on Government and Binding. Studies in generative grammar. Foris Publications. Ciura M., Grund D., Kulików S. and Suszczańska N. (2004). A System to Adapt Techniques of Text Summarizing to Polish. In: Okatan A. (ed.), International Conference on Computational Intelligence, p. 117–120, Istanbul, Turkey. International Computational Intelligence Society. Cohen J. (1960). A Coefficient of Agreement for Nominal Scales. “Educational and Psychological Measurement”, 20(1), p. 37–46. Cohen W. W. (1995). Fast Effective Rule Induction. In: In Proceedings of the 12th International Conference on Machine Learning, p. 115–123. Morgan Kaufmann. Connolly D., Burger J. D. and Day D. S. (1994). A Machine Learning Approach to Anaphoric Reference. In: Proceedings of the International Conference on New Methods in Language Processing (NeMLaP). ACL. Cristea D. and Postolache O.-D. (2005). How to Deal with Wicked Anaphora? “Current Issues in Linguistic Theory”, 263, p. 17–46. Cunningham H., Maynard D., Bontcheva K. and Tablan V. (2002). GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. Data-Bukowska E. (2008). O funkcjonowaniu zaimkowych odniesień anaforycznych w języku polskim – analiza z perspektywy językoznawstwa kognitywnego. “Studia Linguistica Universitatis Iagellonicae Cracoviensis”, 125, p. 51–65. Dobrzyńska T. (1996). Tekst i jego odmiany: zbiór studiów. In: Dobrzyńska T. (ed.), Tekst – w perspektywie stylistycznej, p. 125–143. Instytut Badań Literackich PAN.

Bibliography

| 275

Doddington G., Mitchell A., Przybocki M., Ramshaw L., Strassel S. and Weischedel R. (2004). The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation. In: Lino et al. (2004), p. 837–840. Drozdzynski W., Krieger H.-U., Piskorski J., Schäfer U. and Xu F. (2004). Shallow Processing with Unification and Typed Feature Structures – Foundations and Applications. “Künstliche Intelligenz”, 18(1), p. 17–23. Dunin-Kęplicz B. (1983). Towards Better Understanding Of Anaphora. In: Zampolli A. and Ferrari G. (ed.), EACL, p. 139–143. The Association for Computer Linguistics. Dunin-Kęplicz B. (1984). Default Reasoning in Anaphora Resolution. In: ECAI, p. 315–324. Dunin-Kęplicz B. (1985). How To Restrict Ambiguity Of Discourse. In: King M. (ed.), EACL, p. 93–97, Geneva, Switzerland. The Association for Computer Linguistics. Dunin-Kęplicz B. (1989). Formalna metoda rozwiązywania pewnej klasy polskiej anafory zaimkowej. Ph.D. thesis, Uniwersytet Jagielloński. Durrett G. and Klein D. (2013). Easy Victories and Uphill Battles in Coreference Resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Seattle, Washington. Association for Computational Linguistics. Durrett G., Hall D. and Klein D. (2013). Decentralized Entity-Level Modeling for Coreference Resolution. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), p. 114–124, Sofia, Bulgaria. Association for Computational Linguistics. Duszak A. (1986). Niektóre uwarunkowania semantyczne szyku wyrazów w zdaniu polskim. “Polonica”, XII(12), p. 59–74. Fall J. (1988). O anaforze i logicznych metodach jej interpretacji. Ph.D. thesis, Uniwersytet Jagielloński, Wydział Filozoficzno-Historyczny, Kraków. Fall J. (1994). Anafora i jej zatarte granice. “Studia Semiotyczne”, XIX/XX, p. 163–191. Fall J. (2001). Anafora – logiczne metody interpretacji. “Studia Semiotyczne”, XXIII, p. 65–97. Fauconnier G. and Turner M. (2002). The Way We Think: Conceptual Blending and the Mind’s Hidden Complexities. Basic Books, New York. Filak T. (2006). Zastosowanie metod automatycznego uczenia do rozstrzygania problemu anafory. Master’s thesis, Wydział Informatyki i Zarządzania Politechniki Wrocławskiej, Wrocław. Finkel J. R., Grenager T. and Manning C. (2005). Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, p. 363–370, Stroudsburg, PA, USA. Association for Computational Linguistics. Fontański H. (1986). Anaforyczne przymiotniki wskazujące w języku polskim i rosyjskim: problem użycia. Prace naukowe Uniwersytetu Śląskiego w Katowicach. Uniwersytet Śląski. Frege G. (1892). Über Sinn und Bedeutung. “Zeitschrift für Philosophie und philosophische Kritik”, 100, p. 25–50. Gajda S. (1982). Podstawy badań stylistycznych nad językiem naukowym. Państwowe Wydawnictwo Naukowe. Gajda S. (1990). Współczesna polszczyzna naukowa: język czy żargon? Instytut Śląski w Opolu. Głowińska K. (2010). Nawiązania. In: Marciniak (2010), p. 191–198. Głowińska K. (2012). Anotacja składniowa. In: Przepiórkowski et al. (2012), p. 107–127. Górski R. L. and Łaziński M. (2012). Reprezentatywność i zrównoważenie korpusu. In: Przepiórkowski et al. (2012), p. 25–36. Grishman R. and Sundheim B. (1996). Message Understanding Conference-6: A Brief History. In: Proceedings of the 16th Conference on Computational Linguistics – Volume 1, COLING 1996, p. 466–471, Stroudsburg, PA, USA. Association for Computational Linguistics.

276 | Bibliography Grosz B. J., Weinstein S. and Joshi A. K. (1995). Centering: A Framework for Modeling the Local Coherence of Discourse. “Computational Linguistics”, 21(2), p. 203–226. Grzegorczykowa R. (1996). Polskie leksemy z wbudowaną informacją anaforyzacyjną. In: Grochowski M. (ed.), Anafora w strukturze tekstu, p. 71–77. Wydawnictwo Energeia, Warszawa. Gunning R. (1971). Technique of Clear Writing. McGraw-Hill. Haghighi A. and Klein D. (2007). Unsupervised Coreference Resolution in a Nonparametric Bayesian Model. In: Carroll J. A., van den Bosch A. and Zaenen A. (ed.), Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, p. 848–855. The Association for Computational Linguistics. Haghighi A. and Klein D. (2009). Simple Coreference Resolution with Rich Syntactic and Semantic Features. In: EMNLP 2009, p. 1152–1161. Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P. and Witten I. H. (2009). The WEKA Data Mining Software: An Update. “ACM SIGKDD Explorations Newsletter”, 11(1), p. 10–18. Handschuh S., Staab S. and Studer R. (2003). Leveraging Metadata Creation for the Semantic Web with CREAM. In: Günter A., Kruse R. and Neumann B. (ed.), KI 2003: Advances in Artificial Intelligence, vol. 2821 in Lecture Notes in Computer Science, p. 19–33. Springer Berlin Heidelberg. Harabagiu S. M., Bunescu R. C. and Maiorano S. J. (2001). Text and Knowledge Mining for Coreference Resolution. In: Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, NAACL 2001, p. 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics. Hendrickx I., Bouma G., Daelemans W., Hoste V., Kloosterman G., Mineur A.-M., Van J., Vloet D. and Verschelde J.-L. (2008). A Coreference Corpus and Resolution System for Dutch. In: Calzolari et al. (2008), p. 144–149. Hinrichs E. W., Kübler S. and Naumann K. (2005). A Unified Representation for Morphological, Syntactic, Semantic, and Referential Annotations. In: Proceedings of the ACL Workshop on Frontiers In Corpus Annotation II: Pie In The Sky, p. 13–20, Ann Arbor, Michigan, USA. Hobbs J. R. (1986). Resolving Pronoun References. In: Grosz B. J., Sparck-Jones K. and Webber B. L. (ed.), Readings in Natural Language Processing, p. 339–352. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Honowska M. (1984). Grzybnia zaimkowa. Przyczynek do zagadnień spójności tekstu. “Polonica”, X, p. 111–120. Iida R., Komachi M., Inui K. and Matsumoto Y. (2007). Annotating a Japanese Text Corpus with Predicate-Argument and Coreference Relations. In: Proceedings of the Linguistic Annotation Workshop (LAW 2007), p. 132–139, Stroudsburg, PA, USA. Association for Computational Linguistics. Jaccard P. (1908). Nouvelles recherches sur la distribution florale. “Bulletin de la Sociète Vaudense des Sciences Naturelles”, 44, p. 223–270. Janus D. and Przepiórkowski A. (2007). POLIQARP 1.0: Some technical aspects of a linguistic search engine for large corpora. In: Waliński J., Kredens K. and Goźdź-Roszkowski S. (ed.), Proceedings of Practical Applications in Language and Computers Conference (PALC 2005), Frankfurt am Main. Peter Lang. Joachims T. (1999). Making Large-Scale SVM Learning Practical. In: Schölkopf B., Burges C. and Smola A. (ed.), Advances in Kernel Methods – Support Vector Learning, chapter 11, p. p. 169– 184. MIT Press, Cambridge, MA. Kamp H. (1984). A Theory of Truth and Semantic Representation. In: Groenendijk J., Janssen T. M. V. and Stokhof M. (ed.), Truth, Interpretation and Information: Selected Papers from the 3rd Amsterdam Colloquium, p. 1–41. Foris Publications, Dordrecht. Kardela H. (1985). A grammar of English and Polish reflexives. Uniwersytet Marii CurieSkłodowskiej, Wydział Humanistyczny.

Bibliography

| 277

Klemensiewicz Z. (1937). Składnia opisowa współczesnej polszczyzny kulturalnej. Polska Akademia Umiejętności. Klemensiewicz Z. (1948). Syntaktyczny stosunek nawiązania. “Sprawozdania z Czynności i Posiedzeń PAU”, XLVIII(6), p. 214–217. Klemensiewicz Z. (1950). O syntaktycznym stosunku nawiązania. “Slavia”, XIX, p. 13–27. Klemensiewicz Z. (1968). Zarys składni polskiej. Państwowe Wydawnictwo Naukowe, wyd. 5. Klemensiewicz Z. (1982). O syntaktycznym stosunku nawiązania. In: Kałkowska A. (ed.), Składnia, stylistyka, pedagogika językowa, Biblioteka Filologii Polskiej: Językoznawstwo, p. 241–257. Państwowe Wydawnictwo Naukowe. Kobyliński Ł. (2014). PoliTa: A multitagger for Polish. In: Calzolari et al. (2014), p. 2949–2954. Kopeć M. (2014). Zero subject detection for Polish. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, p. 221–225, Gothenburg, Sweden. Association for Computational Linguistics. Kopeć M. and Ogrodniczuk M. (2012). Creating a Coreference Resolution System for Polish. In: Calzolari et al. (2012), p. 192–195. Kopeć M. and Ogrodniczuk M. (2014). Inter-annotator Agreement in Coreference Annotation of Polish. In: Sobecki J., Boonjing V. and Chittayasothorn S. (ed.), Advanced Approaches to Intelligent Information and Database Systems, vol. 551 in Studies in Computational Intelligence, p. 149– 158. Springer International Publishing, Switzerland. Korzen I. and Buch-Kromann M. (2011). Anaphoric relations in the Copenhagen Dependency Treebanks. In: Proceedings of DGfS Workshop, p. 83–98, Göttingen, Germany. Krippendorff K. (2007). Computing Krippendorff’s Alpha Reliability. Technical report, University of Pennsylvania, Annenberg School for Communication. Krippendorff K. H. (2003). Content Analysis: An Introduction to Its Methodology. Sage Publications, Inc, wyd. 2. Kulików S., Romaniuk J. and Suszczańska N. (2004). A syntactical analysis of anaphora in the Polsyn parser. In: Kłopotek M. A., Wierzchoń S. T. and Trojanowski K. (ed.), Intelligent Information Processing and Web Mining, vol. 25 in Advances in Soft Computing, p. 444–448. Springer Berlin Heidelberg. Kunz K. A. (2010). Variation in English and German Nominal Coreference: A Study of Political Essays. Saarbrücker Beiträge zur Sprach- und Translationswissenschaft. Peter Lang, Frankfurt am Main, Berlin, Bern, Bruxelles, New York, Oxford, Wien. Kupść A. and Marciniak M. (1997). Some notes on HPSG binding theory for Polish. Langacker R. W. (2008). Cognitive Grammar: A Basic Introduction. Oxford University Press, USA. Lappin S. and Leass H. J. (1994). An Algorithm for Pronominal Anaphora Resolution. “Computational Linguistics”, 20(4), p. 535–561. LDC (2006). ACE (Automatic Content Extraction) Spanish Annotation Guidelines for Entities. Linguistic-Data-Consortium. Available at http://projects.ldc.upenn.edu/ace/docs/SpanishEntities-Guidelines_v1.6.pdf (accessed on Feb. 18, 2013). Lee H., Peirsman Y., Chang A., Chambers N., Surdeanu M. and Jurafsky D. (2011). Stanford’s Multipass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task. In: Proceedings of the 15th Conference on Computational Natural Language Learning: Shared Task, CONLL Shared Task 2011, p. 28–34, Stroudsburg, PA, USA. Association for Computational Linguistics. Lee H., Chang A., Peirsman Y., Chambers N., Surdeanu M. and Jurafsky D. (2013). Deterministic Coreference Resolution Based on Entity-centric, Precision-ranked Rules. “Comput. Linguist.”, 39(4), p. 885–916. Lino M. T., Xavier M. F., Ferreira F., Costa R. and Silva R., ed. (2004). Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal. European Language Resources Association.

278 | Bibliography Luo X. (2005). On Coreference Resolution Performance Metrics. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT 2005, p. 25–32, Vancouver, Canada. Association for Computational Linguistics. Luo X., Ittycheriah A., Jing H., Kambhatla N. and Roukos S. (2004). A Mention-synchronous Coreference Resolution Algorithm Based on the Bell Tree. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL 2004, Stroudsburg, PA, USA. Association for Computational Linguistics. Magnini B., Pianta E., Girardi C., Negri M., Romano L., Speranza M., Lenzi V. B. and Sprugnoli R. (2006). I-CAB: the Italian Content Annotation Bank. In: Calzolari et al. (2006), p. 963–968. Marciniak M. (1999). Toward a Binding Theory for Polish. In: Borsley R. D. and Przepiórkowski A. (ed.), Slavic in Head-Driven Phrase Structure Grammar, p. 125–147. CSLI Publications, Stanford, CA. Marciniak M. (2001). Algorytmy implementacyjne syntaktycznych reguł koreferencji zaimków dla języka polskiego w terminach HPSG. Ph.D. dissertation, Institute of Computer Science, Polish Academy of Sciences, Warsaw. Marciniak M. (2002). Anaphor Binding in Polish. Theory and Implementation. In: Proceedings of DAARC 2002 – the 4th Discourse Anaphora and Anaphor Resolution Colloquium, Lisbon. Marciniak M., ed. (2010). Anotowany korpus dialogów telefonicznych. Problemy Współczesnej Nauki, Teoria i Zastosowania: Inżynieria Lingwistyczna. Akademicka Oficyna Wydawnicza EXIT, Warsaw. Marciszewski W. (1983). Spójność strukturalna a spójność semantyczna. In: Dobrzyńska T. and Janus E. (ed.), Tekst i zdanie, p. 183–189. Zakład Narodowy im. Ossolińskich. Marinković I. (2004). Wielka Księga Imion. Słownik Encyklopedyczny. Wydawnictwo Europa. Màrquez L., Recasens M. and Sapena E. (2012). Coreference Resolution: An Empirical Study Based on SemEval-2010 Shared Task 1. “Language Resources and Evaluation”, 47, p. 1–34. Maziarz M., Piekot T., Poprawa M., Broda B., Radziszewski A. and Zarzeczny G. (2012). Język raportów ewaluacyjnych. Ministerstwo Rozwoju Regionalnego. Departament Koordynacji Polityki Strukturalnej. McCallum A. and Wellner B. (2005). Conditional Models of Identity Uncertainty with Application to Noun Coreference. In: Advances in Neural Information Processing Systems, vol. 17, p. 905–912. MIT Press. McCarthy J. F. and Lehnert W. G. (1995). Using Decision Trees for Coreference Resolution. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI 1995), p. 1050–1055, Montreal, Canada. Miłkowski M. (2010). Developing an open-source, rule-based proofreading tool. “Software: Practice and Experience”, 40(7), p. 543–566. Miłkowski M. (2012). The Polish Language in the Digital Age. Springer Publishing Company. Mill J. S. (1843). A System of Logic Ratiocinative and Inductive: Being a Connected View of the Principles of Evidence and the Methods of Scientific Investigation, vol. 1. John W. Parker. Miller G. A. (1995). WordNet: A Lexical Database for English. “Communications of the ACM”, 38(11), p. 39–41. Miller G. A., Beckwith R., Fellbaum C., Gross D. and Miller K. (1990). WordNet: An on-line lexical database. “International Journal of Lexicography”, 3, p. 235–244. Mitkov R. (1999). Anaphora Resolution: The State Of The Art. Technical report, University of Wolverhampton. Based on the COLING/ACL 1998 tutorial on anaphora resolution. Mitkov R. and Styś M. (1997). Robust reference resolution with limited knowledge: high precision genre-specific approach for English and Polish. In: In Recent Advances in Natural Language Processing (RANLP-97), p. 74–81.

Bibliography

| 279

Mitkov R., Belguith L. and Styś M. (1998). Multilingual Robust Anaphora Resolution. In: Proceedings of the 3rd International Conference on Empirical Methods in Natural Language Processing (EMNLP-3), p. 7–16, Granada, Spain. Montague R. (1973). The Proper Treatment of Quantification in Ordinary English. In: Suppes P., Moravcsik J. and Hintikka J. (ed.), Approaches to Natural Language, vol. 49, p. 221–242. Dordrecht. Müller C. and Strube M. (2006). Multi-level annotation of linguistic data with MMAX2. In: Braun S., Kohn K. and Mukherjee J. (ed.), Corpus Technology and Language Pedagogy: New Resources, New Tools, New Methods, p. 197–214. Peter Lang, Frankfurt a.M., Germany. Muzerelle J., Lefeuvre A., Antoine J.-Y., Schang E., Maurel D., Villaneau J. and Eshkol I. (2013). ANCOR, premier corpus de français parlé d’envergure annoté en coréférence et distribué librement. In: Proceedings of the 20th Conference Traitement Automatique des Langues Naturelles (TALN 2013), p. 555–563, Les Sables d’Olonne, France. Nedoluzhko A. (2010). Coreferential relationships in text – comparative analysis of annotated data. In: Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference Dialogue 2010, Issue 9 (16). RGGU, Moscow. Nedoluzhko A., Mírovský J., Ocelák R. and Pergler J. (2009). Extended Coreferential Relations and Bridging Anaphora in the Prague Dependency Treebank. In: Proceedings of the 7th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2009), p. 1–16, Goa, India. AU-KBC Research Centre, Anna University, Chennai, AU-KBC Research Centre, Anna University, Chennai. Ng V. (2004). Improving Machine Learning Approaches to Noun Phrase Coreference Resolution. Ph.D. thesis, Cornell University, Ithaca, NY, USA. Ng V. (2008). Unsupervised Models for Coreference Resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), p. 640–649. Ng V. (2010). Supervised Noun Phrase Coreference Research: The First Fifteen Years. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL 2010, p. 1396– 1411, Stroudsburg, PA, USA. Association for Computational Linguistics. Ng V. and Cardie C. (2002a). Identifying Anaphoric and Non-Anaphoric Noun Phrases to Improve Coreference Resolution. In: Proceedings of the 19th International Conference on Computational Linguistics – Volume 1, COLING 2002, p. 730–736, Stroudsburg, PA, USA. Association for Computational Linguistics. Ng V. and Cardie C. (2002b). Improving Machine Learning Approaches to Coreference Resolution. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, p. 104–111, Stroudsburg, PA, USA. Association for Computational Linguistics. Nitoń B. (2013). Evaluation of Uryupina’s coreference resolution features for Polish. In: Vetulani Z. (ed.), Proceedings of the 6th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2013), p. 122–126, Poznań, Poland. Ogrodniczuk M. (2012). The Polish Sejm Corpus. In: Calzolari et al. (2012), p. 2219–2223. Ogrodniczuk M. (2013a). Discovery of Common Nominal Facts for Coreference Resolution: Proof of concept. In: Prasath R. and Kathirvalavakumar T. (ed.), Mining Intelligence and Knowledge Exploration (MIKE 2013), vol. 8284 in Lecture Notes in Artificial Intelligence, p. 709–716. SpringerVerlag, Berlin, Heidelberg. Ogrodniczuk M. (2013b). Translation- and Projection-Based Unsupervised Coreference Resolution for Polish. In: Kłopotek M. A., Koronacki J., Marciniak M., Mykowiecka A. and Wierzchoń S. T. (ed.), Proceedings of the 20th International Conference Intelligent Information Systems, vol. 7912 in Lecture Notes in Computer Science, p. 125–130. Springer-Verlag, Berlin, Heidelberg. Ogrodniczuk M. and Kopeć M. (2011a). End-to-end coreference resolution baseline system for Polish. In: Vetulani Z. (ed.), Proceedings of the 5th Language & Technology Conference: Human

280 | Bibliography Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2011), p. 167– 171, Poznań, Poland. Ogrodniczuk M. and Kopeć M. (2011b). Rule-based coreference resolution module for Polish. In: Proceedings of the 8th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2011), p. 191–200, Faro, Portugal. Ogrodniczuk M. and Kopeć M. (2014). The Polish Summaries Corpus. In: Calzolari et al. (2014), p. 3712–3715. Ogrodniczuk M. and Lenart M. (2013). A Multi-purpose Online Toolset for NLP Applications. In: Métais E., Meziane F., Saraee M., Sugumaran V. and Vadera S. (ed.), Proceedings of the 18th International Conference on Applications of Natural Language to Information Systems, vol. 7934 in Lecture Notes in Computer Science, p. 392–395. Springer-Verlag, Berlin, Heidelberg. Ogrodniczuk M. and Przepiórkowski A., ed. (2014). Advances in Natural Language Processing: Proceedings of the 9th International Conference on NLP, PolTAL 2014, vol. 8686 in Lecture Notes in Computer Science, Warsaw, Poland. Springer International Publishing. Ogrodniczuk M., Wójcicka A., Głowińska K. and Kopeć M. (2014a). Detection of Nested Mentions for Coreference Resolution in Polish. In: Ogrodniczuk and Przepiórkowski (2014), p. 270–277. Ogrodniczuk M., Kopeć M. and Savary A. (2014b). Polish Coreference Corpus in Numbers. In: Calzolari et al. (2014), p. 3234–3238. Orăsan C. (2003). PALinkA: a highly customizable tool for discourse annotation. In: Proceedings of the 4th SIGdial Workshop on Discourse and Dialog, p. 39–43, Sapporo, Japan. Orăsan C., Cristea D., Mitkov R. and Branco A. (2008). Anaphora Resolution Exercise: An Overview. In: Calzolari et al. (2008), p. 2801–2805. Osenova P. and Simov K. (2004). BTB-TR05: BulTreeBank Stylebook. BulTreeBank Version 1.0. Technical Report BTB-TR05, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Sofia, Bulgaria. Padučeva E. V. (1992). Wypowiedź i jej odniesienie do rzeczywistości. (Referencyjne aspekty znaczenia zaimków). PWN, Warszawa. [tłum. Z. Kozłowska]. Papierz M. (1995). Zaimki a mechanizmy anafory (w języku polskim, słowackim i czeskim). “Studia z Filologii Polskiej i Słowiańskiej”, 32, p. 215–223. Pasek J. (1991). Anafora. In: Pelc J. (ed.), Prace z pragmatyki, semantyki i metodologii semiotyki, Biblioteka myśli semiotycznej, p. 275–286. Ossolineum. Passonneau R., Habash N. and Rambow O. (2006). Inter-annotator Agreement on a Multilingual Semantic Annotation Task. In: Calzolari et al. (2006), p. 1951–1956. Passonneau R. J. (1997). Applying Reliability Metrics to Co-Reference Annotation. “CoRR”, cmplg/9706011. Passonneau R. J. (2004). Computing Reliability for Coreference Annotation. In: Lino et al. (2004), p. 1503–1506. Patejuk A. and Przepiórkowski A. (2010). ISOcat Definition of the National Corpus of Polish Tagset. In: LREC 2010 Workshop on LRT Standards, Valletta, Malta. ELRA. Piasecki M., Szpakowicz S. and Broda B. (2009). A Wordnet from the Ground Up. Oficyna Wydawnicza Politechniki Wrocławskiej. Pisarek J. (2012). Językowe mechanizmy nawiązania w tekstach publicystycznych na przykładzie felietonów “Tygodnika Powszechnego”. Ph.D. dissertation, Wydział Polonistyki Uniwersytetu Jagiellońskiego, Kraków. Pisarkowa K. (1969). Funkcje składniowe polskich zaimków odmiennych. Prace Komisji Językoznawstwa nr 22. Zakład Narodowy im. Ossolińskich. Polska Akademia Nauk, Oddział w Krakowie. Poesio M. and Artstein R. (2008). Anaphoric Annotation in the ARRAU Corpus. In: Calzolari et al. (2008), p. 1170–1174.

Bibliography

| 281

Poesio M., Uryupina O. and Versley Y. (2010). Creating a Coreference Resolution System for Italian. In: Calzolari et al. (2010), p. 713–716. Pollard C. and Sag I. A. (1994). Head-Driven Phrase Structure Grammar. Studies in Contemporary Linguistics. University of Chicago Press. Pradhan S., Ramshaw L., Marcus M., Palmer M., Weischedel R. and Xue N. (2011). CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes. In: Proceedings of the 15th Conference on Computational Natural Language Learning: Shared Task, CONLL Shared Task 2011, p. 1–27, Stroudsburg, PA, USA. Association for Computational Linguistics. Pradhan S., Moschitti A., Xue N., Uryupina O. and Zhang Y. (2012). CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes. In: Proceedings of the 16th Conference on Computational Natural Language Learning (CoNLL 2012), Jeju, Korea. Pradhan S. S., Ramshaw L., Weischedel R., MacBride J. and Micciulla L. (2007). Unrestricted Coreference: Identifying Entities and Events in OntoNotes. In: Proceedings of the 1st IEEE International Conference on Semantic Computing (ICSC 2007), p. 446–453, Washington, DC, USA. IEEE Computer Society. Presspublica (2002). Rzeczpospolita Corpus. [on-line] http://www.cs.put.poznan.pl/dweiss/ rzeczpospolita. Przepiórkowski A. (2004). The IPI PAN Corpus: Preliminary version. Institute of Computer Science, Polish Academy of Sciences, Warsaw. Przepiórkowski A. (2005). The IPI PAN Corpus in Numbers. In: Vetulani Z. (ed.), Proceedings of the 2nd Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2005), p. 27–31, Poznań, Poland. Przepiórkowski A. and Buczyński A. (2007). Spejd: Shallow Parsing and Disambiguation Engine. In: Vetulani Z. (ed.), Proceedings of the 3rd Language & Technology Conference, p. 340–344, Poznań, Poland. Przepiórkowski A. and Woliński M. (2003). A Flexemic Tagset for Polish. In: Proceedings of Morphological Processing of Slavic Languages, EACL 2003. Przepiórkowski A., Kupść A., Marciniak M. and Mykowiecka A. (2002). Formalny opis języka polskiego: Teoria i implementacja. Akademicka Oficyna Wydawnicza EXIT, Warsaw. Przepiórkowski A., Bańko M., Górski R. L. and Lewandowska-Tomaszczyk B., ed. (2012). Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warsaw. Przepiórkowski A. and Bański P. (2011). XML Text Interchange Format in the National Corpus of Polish. In: Goźdź-Roszkowski S. (ed.), Explorations across Languages and Corpora: PALC 2009, p. 55–65, Frankfurt. Peter Lang. Puczyłowski T. (2003). Problem wymienialności wyrażeń koreferencyjnych w zdaniach przekonaniowych. “Przegląd Filozoficzno-Literacki”, 4(6), p. 249–262. Raghunathan K., Lee H., Rangarajan S., Chambers N., Surdeanu M., Jurafsky D. and Manning C. (2010). A Multi-pass Sieve for Coreference Resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, p. 492–501, Stroudsburg, PA, USA. Association for Computational Linguistics. Rahman A. and Ng V. (2009). Supervised Models for Coreference Resolution. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing: Volume 2, EMNLP 2009, p. 968–977, Stroudsburg, PA, USA. Association for Computational Linguistics. Rahman A. and Ng V. (2012). Translation-Based Projection for Multilingual Coreference Resolution. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (HLT-NAACL 2012), p. 720–730, Montréal, Canada. Association for Computational Linguistics. Recasens M. (2010). Coreference: Theory, Annotation, Resolution and Evaluation. Ph.D. thesis, Department of Linguistics, University of Barcelona, Barcelona, Spain.

282 | Bibliography Recasens M. and Hovy E. (2010). BLANC: Implementing the Rand index for coreference evaluation. “Natural Language Engineering”, 17, p. 485–510. Recasens M. and Martí M. A. (2010). AnCora-CO: Coreferentially annotated corpora for Spanish and Catalan. “Language Resources and Evaluation”, 44(4), p. 315–345. Recasens M., Hovy E. and Martí M. A. (2010a). A Typology of Near-Identity Relations for Coreference (NIDENT). In: Calzolari et al. (2010), p. 149–156. Recasens M., Màrquez L., Sapena E., Martí M. A., Taulé M., Hoste V., Poesio M. and Versley Y. (2010b). SemEval-2010 Task 1: Coreference Resolution in Multiple Languages. In: Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval 2010, p. 1–8, Stroudsburg, PA, USA. Association for Computational Linguistics. Recasens M., Hovy E. and Martí M. A. (2011). Identity, non-identity, and near-identity: Addressing the complexity of coreference. “Lingua”, 121(6), p. 1138–1152. Recasens M., de Marneffe M.-C. and Potts C. (2013). The Life and Death of Discourse Entities: Identifying Singleton Mentions. In: Proceedings of NAACL-HLT 2013, p. 627–633, Atlanta, Georgia. The Association for Computational Linguistics. Reinders-Machowska E. (1991). Long Distance Anaphora. In: Long Distance Anaphora, chapter Binding in Polish, p. p. 137–150. Cambridge University Press. Rose T., Stevenson M. and Whitehead M. (2002). The Reuters Corpus Volume 1 – from Yesterday’s News to Tomorrow’s Language Resources. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, p. 329–350, Las Palmas de Gran Canaria. Russell B. (1905). On Denoting. “Mind”, 14, p. 479–493. Russo L., Loáiciga S. and Gulati A. (2012). Improving machine translation of null subjects in Italian and Spanish. In: Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics, p. 81–89, Avignon, France. Association for Computational Linguistics. Ryant N. and Scheffler T. (2006). Binding of anaphors in LTAG. In: Proceedings of the 8th International Workshop on Tree Adjoining Grammar and Related Formalisms, p. 65–72. Saloni Z., Gruszczyński W., Woliński M. and Wołosz R. (2007). Grammatical Dictionary of Polish – Presentation by the Authors. “Studies in Polish Linguistics”, 4, p. 5–25. http://www.ijppan.krakow.pl/sipl/saloni.pdf, see also http://www.info.univ-tours.fr/~savary/Polonium/ Papers/prezentacja-SGJP-Tours.pdf. Savary A. (2012). Anotacja jednostek nazewniczych. In: Przepiórkowski et al. (2012), p. 129–168. Shapiro S. S. and Wilk M. B. (1965). An Analysis of Variance Test for Normality (Complete Samples). “Biometrika”, 52(3/4), p. 591–611. Sikdar U. K., Ekbal A., Saha S., Uryupina O. and Poesio M. (2013). Resolución de anáfora para el bengalí: un experimento con la aplicación al dominio. “Computación y Sistemas”, 17(2), p. 137–146. Socher R., Bauer J., Manning C. D. and Ng A. Y. (2013). Parsing with Compositional Vector Grammars. In: Fung P. and Poesio M. (ed.), ACL (1), p. 455–465. The Association for Computer Linguistics. Soon W. M., Ng H. T. and Lim C. Y. (1999). Corpus-Based Learning for Noun Phrase Coreference Resolution. In: 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, p. 285–291. Soon W. M., Ng H. T. and Lim D. C. Y. (2001). A Machine Learning Approach to Coreference Resolution of Noun Phrases. “Computational Linguistics”, 27(4), p. 521–544. Stenetorp P., Pyysalo S., Topić G., Ohta T., Ananiadou S. and Tsujii J. (2012). BRAT: a web-based tool for NLP-assisted text annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2012, p. 102– 107, Stroudsburg, PA, USA. Association for Computational Linguistics.

Bibliography

|

283

Stoyanov V., Gilbert N., Cardie C. and Riloff E. (2009). Conundrums in Noun Phrase Coreference Resolution: Making Sense of the State-of-the-Art. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, p. 656–664, Suntec, Singapore. Association for Computational Linguistics. Stoyanov V., Cardie C., Gilbert N., Riloff E., Buttler D. and Hysom D. (2010a). Coreference Resolution with Reconcile. In: Proceedings of the ACL 2010 Conference Short Papers, ACLShort 2010, p. 156–161, Stroudsburg, PA, USA. Association for Computational Linguistics. Stoyanov V., Cardie C., Gilbert N., Riloff E., Buttler D. and Hysom D. (2010b). Reconcile: A Coreference Resolution Research Platform. Technical report, Cornell University. Stroińska M. (1992). Styl bezosobowy a spójność referencjalna w dyskursie. In: Dobrzyńska T. (ed.), Typy tekstów: zbiór studiów, p. 15–25. Instytut Badań Literackich Polskiej Akademii Nauk. Studnicki F. and Polanowska B. (1983). Zautomatyzowane rozwiązywanie odesłań występujących w tekstach prawnych. “Studia Semiotyczne”, 13, p. 65–90. Świdziński M. (1994). Syntactic Dictionary of Polish Verbs. Szkudlarek-Śmiechowska E. (2003). Wskaźniki nawiązania we współczesnych tekstach polskich (na materiale współczesnej nowelistyki polskiej). Acta Universitatis Lodziensis: Folia linguistica. Wydawnictwo Uniwersytetu Łódzkiego. Szwedek A. (1975). Coreference and Sentence Stress in English and Polish. “Poznań Studies in Contemporary Linguistics”, 3, p. 209–213. Topolińska Z. (1976). Wyznaczoność (tj. charakterystyka referencyjna) grupy imiennej w tekście polskim. “Polonica”, 3(2), p. 33–72. Topolińska Z. (1977). “Referencja”, “koreferencja”, “anafora”. “Slavica Slovaca”, 12(3), p. 225–232. Topolińska Z. (1984). Składnia grupy imiennej. In: Grochowski M., Karolak S. and Topolińska Z. (ed.), Składnia, Gramatyka współczesnego języka polskiego, p. 301–389. Państwowe Wydawnictwo Naukowe. Toutanova K., Klein D., Manning C. D. and Singer Y. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: IN PROCEEDINGS OF HLT-NAACL, p. 252–259. Trawiński B. (2007). Referential Relations in Polish. In: Book of Abstracts of 7th European Conference on Formal Description of Slavic Languages (FDSL-7), p. 109–111, Leipzig. University of Leipzig. Trofimiec S. (2007). Konstrukcje anaforyczne jako wskaźniki nawiązania w tekstach prasowych. “Język Polski”, LXXXVII(1), p. 24–28. Uren V., Cimiano P., Iria J., Handschuh S., Vargas-Vera M., Motta E. and Ciravegna F. (2006). Semantic Annotation for Knowledge Management: Requirements and a Survey of the State of the Art. “Web Semant.”, 4(1), p. 14–28. Uryupina O. (2007). Knowledge Acquisition for Coreference Resolution. Ph.D. thesis, Saarland University. Uryupina O., Moschitti A. and Poesio M. (2012). BART Goes Multilingual: The UniTN/Essex Submission to the CoNLL-2012 Shared Task. In: Joint Conference on EMNLP and CoNLL – Shared Task, CoNLL 2012, p. 122–128, Stroudsburg, PA, USA. Association for Computational Linguistics. van Hoek K. (1995). Conceptual Reference Points: A Cognitive Grammar Account of Pronominal Anaphora Constraints. “Language”, 71(2), p. 310–340. van Hoek K. (2003). Pronouns and Point of View: Cognitive Principles of Coreference. In: Tomasello M. (ed.), The New Psychology of Language: Cognitive and Functional Approaches to Language Structure, vol. 2, p. 169–194. Lawrence Erlbaum Associates, Mahwah, NJ. Vater H. (2009). Wstęp do lingwistyki tekstu. Struktura i rozumienie tekstów, vol. 2. Atut, Wrocław. tłum. Błachut E., Gołębiowski A. Versley Y., Ponzetto S. P., Poesio M., Eidelman V., Jern A., Smith J., Yang X. and Moschitti A. (2008). BART: A Modular Toolkit for Coreference Resolution. In: Association for Computational Linguistics (ACL) Demo Session.

284 | Bibliography Vilain M., Burger J., Aberdeen J., Connolly D. and Hirschman L. (1995). A Model-Theoretic Coreference Scoring Scheme. In: Proceedings of the 6th Message Understanding Conference (MUC-6), p. 45–52. Vossen P., ed. (1998). EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Academic Publishers, Norwell, MA, USA. Wagner R. A. and Fisher J. M. (1974). The String-to-String Correction Problem. “Journal of the Association for Computational Machinery”, 21(1), p. 168–173. Wajszczuk J. (1978). Syntaktyczny stosunek nawiązania (na materiale współczesnego języka rosyjskiego). Ph.D. thesis, Uniwersytet Warszawski, Warszawa. Waszczuk J., Głowińska K., Savary A., Przepiórkowski A. and Lenart M. (2013). Annotation tools for syntax and named entities in the National Corpus of Polish. “International Journal of Data Mining, Modelling and Management”, 5(2), p. 103–122. Weischedel R., Pradhan S., Ramshaw L., Kaufman J., Franchini M. and El-Bachouti M. (2010). OntoNotes Release 4.0. Available at https://catalog.ldc.upenn.edu/docs/LDC2011T03/ OntoNotes-Release-4.0.pdf (accessed on May 2, 2014). Wierzbicka A. (2010). Semantyka: jednostki elementarne i uniwersalne. Wydawnictwo Uniwersytetu Marii Curie-Skłodowskiej, Lublin. Willim E. (1995). In defence of the subject: Evidence from Polish. “Licensing in Syntax and Phonology”, 1, p. 147–164. Woliński M. (2006). Morfeusz – a practical tool for the morphological analysis of Polish. In: Kłopotek M. A., Wierzchoń S. T. and Trojanowski K. (ed.), Proceedings of the International Intelligent Information Systems: Intelligent Information Processing and Web Mining 2006 Conference, p. 511–520, Wisła, Poland. Woliński M. (2014). Morfeusz Reloaded. In: Calzolari et al. (2014), p. 1106–1111. Wróblewska A. (2012). Polish Dependency Bank. “Linguistic Issues in Language Technology”, 7(1), p. 1–18. Żmigrodzki P. (2007). O projekcie Wielkiego słownika języka polskiego. “Język Polski”, 5(LXXXVII), p. 265–267.