141 70 5MB
English Pages 288 [247] Year 2017
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Quantitative Historical Linguistics
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
OX F O R D S T U D I E S I N D IAC H R O N IC A N D H I S T O R I C A L L I N G U I S T I C S general editors Adam Ledgeway and Ian Roberts, University of Cambridge advisory editors Cynthia Allen, Australian National University; Ricardo Bermúdez-Otero, University of Manchester; Theresa Biberauer, University of Cambridge; Charlotte Galves, University of Campinas; Geoff Horrocks, University of Cambridge; Paul Kiparsky, Stanford University; Anthony Kroch, University of Pennsylvania; David Lightfoot, Georgetown University; Giuseppe Longobardi, University of York; George Walkden, University of Konstanz; David Willis, University of Cambridge recently published in the series 19 The Syntax of Old Romanian Edited by Gabriela Pană Dindelegan 20 Grammaticalization and the Rise of Configurationality in Indo-Aryan Uta Reinöhl 21 The Rise and Fall of Ergativity in Aramaic Cycles of Alignment Change Eleanor Coghill 22 Portuguese Relative Clauses in Synchrony and Diachrony Adriana Cardoso 23 Micro-change and Macro-change in Diachronic Syntax Edited by Eric Mathieu and Robert Truswell 24 The Development of Latin Clause Structure A Study of the Extended Verb Phrase Lieven Danckaert 25 Transitive Nouns and Adjectives Evidence from Early Indo-Aryan John J. Lowe 26 Quantitative Historical Linguistics A Corpus Framework Gard B. Jenset and Barbara McGillivray For a complete list of titles published and in preparation for the series, see pp. 230–2
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Quantitative Historical Linguistics A Corpus Framework
G A R D B. J E N SE T A N D BA R BA R A M C G I L L I V R AY
1 i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
3
Great Clarendon Street, Oxford, ox2 6dp, United Kingdom Oxford University Press is a department of the University of Oxford. It furthers the University’s objective of excellence in research, scholarship, and education by publishing worldwide. Oxford is a registered trade mark of Oxford University Press in the UK and in certain other countries © Gard B. Jenset and Barbara McGillivray 2017 The moral rights of the authors have been asserted First Edition published in 2017 Impression: 1 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, without the prior permission in writing of Oxford University Press, or as expressly permitted by law, by licence or under terms agreed with the appropriate reprographics rights organization. Enquiries concerning reproduction outside the scope of the above should be sent to the Rights Department, Oxford University Press, at the address above You must not circulate this work in any other form and you must impose this same condition on any acquirer Published in the United States of America by Oxford University Press 198 Madison Avenue, New York, NY 10016, United States of America British Library Cataloguing in Publication Data Data available Library of Congress Control Number: 2017933972 ISBN 978–0–19–871817–8 Printed and bound by CPI Group (UK) Ltd, Croydon, cr0 4yy Links to third party websites are provided by Oxford in good faith and for information only. Oxford disclaims any responsibility for the materials contained in any third party website referenced in this work.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Contents Series preface List of figures and tables
ix xi
Methodological challenges in historical linguistics . Aims of this book . Context and motivation .. Empirical methods .. Models in historical linguistics .. A new pace . Main claims .. The example-based approach .. The importance of corpus annotation .. Problems with certain quantitative analyses .. Problems with the research process .. Conceptual difficulties . Can quantitative historical linguistics cross the chasm? .. Who uses new technology? .. One size does not fit all: the chasm .. Perils of the chasm . A historical linguistics meta study .. An empirical baseline .. Quantitative historical research in
Foundations of the framework . A new framework .. Scope .. Basic assumptions .. Definitions . Principles .. Principle : Consensus .. Principle : Conclusions .. Principle : Almost any claim is possible .. Principle : Some claims are stronger than others .. Principle : Strong claims require strong evidence .. Principle : Possibly does not entail probably .. Principle : The weakest link
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
vi
Contents .. Principle : Spell out quantities .. Principle : Trends should be modelled probabilistically .. Principle : Corpora are the prime source of quantitative evidence .. Principle : The crud factor .. Principle : Mind your stats . Best practices and research infrastructure .. Divide and conquer: reproducible research .. Language resource standards and collaboration .. Reproducibility in historical linguistics research .. Historical linguistics and other disciplines . Data-driven historical linguistics .. Corpus-based, corpus-driven, and data-driven approaches .. Data-driven approaches outside linguistics .. Data and theory .. Combining data and linguistic approaches
Corpora and quantitative methods in historical linguistics . Introduction . Early experiments . A bad case of glottochronology . The advent of electronic corpora . Return of the numbers . What’s in a number anyway? . The case against numbers in historical linguistics .. Argumentation from convenience .. Argumentation from redundancy .. Argumentation from limitation of scope .. Argumentation from principle .. The pseudoscience argument . Summary Historical corpus annotation . Content, structure, and context in historical texts .. The value of annotation .. Annotation and historical corpora .. Ways to annotate a historical corpus . Annotation in practice . Adding linguistic annotation to texts .. Annotation formats .. Levels of linguistic annotation .. Annotation schemes and standards
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Contents . Case study: a large-scale Latin corpus . Challenges of historical corpus annotation
vii
(Re)using resources for historical languages . Historical languages and language resources .. Corpora and language resources .. Corpus-based and corpus-driven lexicons . Beyond language resources . Linking historical (language) data .. Linked data .. An example from the ALPINO Treebank .. Linked historical data . Future directions
The role of numbers in historical linguistics . The benefits of quantitative historical linguistics .. Reaching across to the majority .. The benefits of corpora .. The benefits of quantitative methods .. Numbers and the aims of historical linguistics . Tackling complexity with multivariate techniques . The rise of existential there in Middle English .. Data .. Exploration .. The choice of statistical technique .. Quantitative modelling .. Summary
A new methodology for quantitative historical linguistics . The methodological framework . Core steps of the research process . Case study: verb morphology in early modern English .. Data .. Exploration .. The models .. Discussion . Concluding remarks
References Index
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Series preface Modern diachronic linguistics has important contacts with other subdisciplines, notably first-language acquisition, learnability theory, computational linguistics, sociolinguistics, and the traditional philological study of texts. It is now recognized in the wider field that diachronic linguistics can make a novel contribution to linguistic theory, to historical linguistics, and arguably to cognitive science more widely. This series provides a forum for work in both diachronic and historical linguistics, including work on change in grammar, sound, and meaning within and across languages; synchronic studies of languages in the past; and descriptive histories of one or more languages. It is intended to reflect and encourage the links between these subjects and fields such as those mentioned above. The goal of the series is to publish high-quality monographs and collections of papers in diachronic linguistics generally, i.e. studies focusing on change in linguistic structure, and/or change in grammars, which are also intended to make a contribution to linguistic theory, by developing and adopting a current theoretical model, by raising wider questions concerning the nature of language change or by developing theoretical connections with other areas of linguistics and cognitive science as listed above. There is no bias towards a particular language or language family, or towards a particular theoretical framework; work in all theoretical frameworks, and work based on the descriptive tradition of language typology, as well as quantitatively based work using theoretical ideas, also feature in the series. Adam Ledgeway and Ian Roberts University of Cambridge
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
List of figures and tables Figures 1.1
Technology adoption life cycle modelled as a normal distribution (Moore, 1991)
23
1.2
Proportions of empirical studies appearing in Language (1960–2011)
26
1.3
MCA plot of the journals considered for the meta study and their attributes
32
1.4
The number of observations for various quantitative techniques in the selected studies, for LVC and other journals
34
2.1
Main elements of our framework for quantitative historical linguistics
45
3.1
Illustration of Moore’s law with selected corpora plotted on a base 10 logarithmic scale
74
3.2
Sizes of some selected corpora plotted on a base 10 logarithmic scale, over time
75
3.3
Log-linear regression model showing the relationship between the growth in computing power and the growth in corpus size for some selected corpora
76
3.4
Relative frequencies of linguistics terms every 1,000 instances of the word linguistics in the twentieth century taken from the BYU Google Corpus
78
4.1
Phrase-structure tree (left) and dependency tree (right) for Example (2)
116
4.2
The dependency tree of Example (3) from the Latin Dependency Treebank
117
5.1
Lexical entry for the verb impono from the lexicon for the Latin Dependency Treebank
134
Page containing information about the text of Chaucer’s Parson’s Tale from the Penn–Helsinki Parsed Corpus of Middle English
141
5.3
Part of the entry for Adriatic Sea in Pleiades
150
6.1
Geometric representation of Table 6.1 in a two-dimensional Cartesian space
161
5.2
6.2 Line that best fits the four points in Figure 6.1
162
6.3
166
Plot from MCA on the variables ‘construction’, ‘era’, ‘preverb’, ‘sp’, and ‘class’
6.4 Graph showing the shift in relative frequencies of existential there and empty existential subjects during the Middle English period
171
6.5
172
Distribution of V1 and V2 word-order patterns
6.6 Box-and-whiskers plot of conditional probabilities of elements following existential there and empty existential subjects 6.7
Box-and-whiskers plot of the maximum degree of embedded (phrase-structure) elements for sentences with there and empty existential subjects
173 174
6.8 Maximum degree of embedding for all sentences in the sample over time, with added non-parametric regression line
175
6.9 Bar plot of counts of existential subjects by genre
176
6.10 Bar plot of counts of existential subjects by dialect
177
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
xii
List of figures and tables
6.11 Binned residuals plot of the logistic regression model, indicating acceptable fit to the data
183
7.1 Plot showing the shifting probabilities over time between -(e)s and -(e)th in the context of third person singular present tense verbs
195
7.2 Plots of the trends of lemma frequency over time for verb forms occurring with -(e)s and -(e)th
196
7.3 MCA plot of suffix, corpus sub-period, gender, and phonological context
197
7.4 Binned residuals plot for the mixed-effects logistic regression model described in Example (2)
199
7.5 Binned residuals plot for the mixed-effects logistic regression model described in Example (3)
200
7.6 Binned residuals plot for the mixed-effects logistic regression model described in Example (4)
201
Tables 1.1 Classification of sample papers according to whether they are corpus-based/ quantitative
29
1.2 Classification of papers from Language (2012) according to whether they are corpus-based/quantitative
29
1.3 Confidence intervals (95) for the percentage of quantitative papers in Language (2012) and the historical sample
30
1.4 Classification of sampled papers according to whether they are corpus-based/quantitative (excluding LVC)
32
4.1 The first four lines of Virgil’s Aeneid in tabular format, where each row corresponds to a line
104
4.2 Example of bibliographical information on a hypothetical collection of texts in tabular format
105
4.3 Example of metadata and linguistic information encoded for the first three word tokens of Virgil’s Aeneid
107
6.1 Example of a data set recording the century of the texts in which prefixed verbs were observed, and the proportion of their spatial arguments expressed as a PP out of all their spatial arguments
160
6.2 Subset of data frame used for study on Latin preverbs in McGillivray (2013)
163
6.3 Frequencies of there1 and Ø according to dialect in Middle English
178
6.4 Frequencies of there1 and Ø according to genre in Middle English
180
6.5 Coefficients for the binary logistic regression model showing the log odds ratio for switching from there1 to Ø
184
7.1 Part of the metadata extracted from the PPCEME documentation
193
7.2 Part of the data extracted from PPCEME
193
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
List of figures and tables
xiii
7.3 Frequencies of verb tokens in the sample from texts produced by female and male writers, broken down by corpus sub-period
197
7.4 Summary of fixed effects from the mixed-effects logistic regression model for E2 described in Example (3)
202
7.5 Summary of predictors from the mixed-effects logistic regression model for E3 described in Example (4)
202
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics . Aims of this book The principal aims of this book are to introduce the framework for quantitative historical linguistics, and to provide some examples of how this framework can be applied in research. Ours is a framework and not a ‘theory’ in any of the senses commonly used in historical linguistics. For example, we do not take a position in favour of a particular formalism for corpus annotation, nor do we offer answers to metaphysical questions such as ‘what is language?’, ‘how is it learned?’. What we are interested in is how corpus data can be employed to gather evidence that we can analyse quantitatively to model various historical linguistics phenomena. To this end, we set out principles for the research process as a whole. Ultimately, the aim of quantitative historical linguistics is to make it easier to settle disputes in historical linguistics by means of quantitative corpus evidence, so to progress the field as a whole. The more concrete desirable outcome is the increased use of quantitative corpus methods in historical linguistics through the adoption of a systematic methodological framework. Because the present book is about methodology in historical linguistics, it does not primarily explain specific, individual methods, but discusses the relationship between corpus data, aims, methods, and ways of doing research in historical linguistics. More specifically, given some desirable outcomes, we discuss the necessary steps that should be taken to achieve those outcomes (Andersen and Hepburn, 2015). There are three focal points in this discussion: (i) Why should historical linguistics adopt quantitative corpus methods to a larger extent? (ii) What are the obstacles for a more widespread adoption of such methods? (iii) How ought such methods to be used in historical linguistics? The first two points are addressed in the present chapter and in the next one, and set out the context for the original contribution of this publication; the last point is the focus of our framework, and is dealt with throughout the book. Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
. Context and motivation From what we have said so far it should be clear that this book is not an introduction to corpus linguistics, nor is it an introduction to quantitative techniques. There are already some very good introductions to corpus linguistics in print, such as McEnery and Wilson (2001), Gries (2009b), and McEnery and Hardie (2012). There are also good books introducing quantitative techniques to linguists, including Baayen (2008) and Johnson (2008). Our position is that core corpus linguistics concepts such as collocations, concordances, and frequency lists can be taught without necessarily referring to historical data and still be transposed to historical linguistics. Likewise, statistical techniques such as null-hypothesis testing, regression modelling, or correspondence analysis (CA) can be explained and illustrated using synchronic data equally well as with historical data. So if corpus linguistics and quantitative techniques can be taught without specific reference to historical linguistics, is there a need for a quantitative corpus historical linguistics methodology? We believe that to be the case, as we explain here. Historical linguistics is an endeavour that is highly data-centric, as Labov (1972, 100) observed when he described historical linguistics as making the best use of ‘bad data’, i.e. imperfect pieces of evidence riddled with gaps. We also agree with Rydén (1980, 38) that the ‘study of the past [. . .] must be basically empirical’, and with Fischer (2004, 57) that ‘[t]he historical linguist has only one firm knowledge base and that is the historical documents’. Moreover, we subscribe to what Penke and Rosenbach (2007b, 1) write: ‘nowadays most linguists will probably agree that linguistics is indeed an empirical science’, and the thorny questions are instead what kind of evidence ought to be used, and how it ought to be used. In spite of the high-level awareness of historical linguistics as data-focused, quantitative corpus methods are still underused and often misused in historical linguistics, and an overarching methodological structure inside which to place such methods is missing, as we illustrate in sections 1.3 and 1.5. We believe that the question of what it means for historical linguistics to be empirical (in the corpus-driven quantitative sense that we define in our framework) is much less clear, as Penke and Rosenbach (2007b) acknowledge is also the case for linguistics in general. With the additional challenges faced by the special nature of historical language data, the concern with methodological development should certainly not be lesser in historical linguistics than in other linguistic disciplines. Therefore, the most pressing gap to fill is not a book introducing corpus methods or statistical techniques to historical linguists, but a book that tackles what it means to be empirical in historical linguistics research, and how to go about doing it. That is precisely what we want to achieve with the present book.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Context and motivation
.. Empirical methods The term ‘empirical’ is of use to us to the extent that practices covered by it will improve the precision level of professional linguistic communication regarding data, evidence, hypotheses, and quantitative models in historical linguistics. Penke and Rosenbach (2007b, 3–9) show how the term ‘empirical’ is used to mean very different things in linguistics, including testing (i.e. attempting to falsify) hypotheses, rational enquiry by means of counter-evidence, as well as data-driven approaches that may rely on qualitative or quantitative evidence. We agree with Penke and Rosenbach (2007b, 4–5) that a strict Popperian falsificationist definition of empirical research (with the requirement that it collects data that can falsify a hypothesis or theory) is problematic, since it quickly runs into grey areas of the kind ‘exactly how many counter-examples does it take to falsify the hypothesis?’. Instead, we argue that a distinction conceptualized as a probabilistic continuum, where individual pieces of evidence can increase or reduce support for a given hypothesis, is more useful. Such a probabilistic approach is transparent to the extent that the data forming the basis for the continuum are objectively verifiable. For the same reason, we consider approaches based exclusively on intuitions about acceptability or grammaticality to be less useful, since what constitutes sound empirical proof of grammaticality is subject to individual judgements that vary greatly. For the purposes of the present book, what it means to be ‘empirical’ in historical linguistics is thus a matter of transparency and objective verifiability. This is related to the point made by Geeraerts (2006) who argues that empirical methods are needed to decide between competing conclusions in linguistics. The ideal of transparency and objectivity can in principle be approached either by means of a categorical argument or a probabilistic one. In their discussion of how to set up a linguistic argument, Beavers and Sells (2014) point out that at the end of the day linguistic argumentation is about classification: is item x an instance of morpheme/phoneme/construction/etc. y or some other morpheme/phoneme/ construction/etc. z? This is a prime example of categorical argumentation. In categorical terms, an item x cannot partially belong to a class, or belong to it to some degree. This contrasts with a probabilistic approach where arguments based on probabilities derived from e.g. corpus frequencies can be used to establish a graded classification scheme whereby x is an instance of y with a given probability. Probabilistic approaches have become increasingly popular, especially in the computational linguistics community; for example, Lau et al. (2015) describe unsupervised language models for predicting human acceptability judgements in a probabilistic way, and argue: ‘it is reasonable to suggest that humans represent linguistic knowledge as a probabilistic, rather than as a binary system. Probability distributions provide a natural explanation of the gradience that characterises acceptability judgements. Gradience is intrinsic
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
to probability distributions, and to the acceptability scores that we derive from these distributions’ (Lau et al., 2015, 1619). The opposition between categorical and probabilistic approaches corresponds in many ways to the distinction between a classical category structure based on necessary and sufficient features, and a category structure based on degrees of resemblance to a prototype, as discussed in Croft and Cruse (2004, 76–81). To some extent, a qualitative approach is necessary, especially when dealing with the clear, central cases. For instance, the morphemes in or at can clearly function as prepositions. However, the marginal cases such as concerning, regarding, following, given are more difficult to place. Should they be considered as prepositions in some cases, or not? Or only to some degree? A probabilistic approach might answer the question differently by stating that some morphemes occur more often than others in certain grammatical contexts, allowing us to establish a probabilistic membership of the class. It should be clear that such a probabilistic approach to description and classification does not (and is not intended to) completely do away with qualitative linguistic judgements. For instance, how to decide what counts as a grammatical context? A strictly probabilistic approach might run the risk of descending into an infinite regression problem of probability estimates that rely on other probability estimates, without any clear starting point for practical investigation of the phenomena of interest. Therefore, we are content to take as axiomatic certain statements about language and the conceptual framework for analysing language. At first glance, this might seem like a half-way solution at best; at worst it may suggest quantitative methods as a form of freeloading. Or as Campbell (2013, 484) phrases it: quantitative methods appear ‘to involve methods that depend on the results of the prior application of linguistic methods, made to masquerade as numbers and algorithms’. However, this view is far too negative, and grossly overstates the differences between quantitative and qualitative models in linguistics, as we will explain in the next section. .. Models in historical linguistics A model is a representation, and any kind of linguistics is about creating models. Zuidema and de Boer (2014) argue that, although all kinds of linguistics involve modelling of some kind, the nature of the models differs. The model might be a representation of a genealogical relationship between languages, or it might represent a particular part of a grammatical system. Zuidema and de Boer (2014) discuss four main types: symbolic models, statistical models, memory-based models, and connectionist models. We will only discuss the first two here, since they are of particular interest in the context of our framework. The key differences between symbolic models and statistical models are how they deal with variation and complexity. In a symbolic model, such as the phrase-structure
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Context and motivation
tree in Example (1), no reference is typically given to how many times the individual parts occur together in a corpus. The model operates with discrete, qualitative categories (such as S, VP, and NP), and the model provides the rules to connect the categories in specific ways. (1)
S NP Trees
VP V are
NP symbolic models
As Zuidema and de Boer (2014) point out, such symbolic models tend to be vulnerable to linguistic variation and performance factors. A statistical model, on the other hand, is crucially reliant on quantitative information about how often combinations of words, categories, or features are found. Since statistical models by default assume a certain amount of variation in the data, they are very well equipped to deal with variation, and they are uniquely able to disentangle very complex patterns of probabilistic dependence between categories or features. This is particularly suited to the case of corpus data, which always contain a frequency or quantitative dimension. However, a purely statistical model may struggle with other types of complexity. Zuidema and de Boer (2014) mention long-distance syntactic dependencies as one example. As Zuidema and de Boer (2014) point out, when the two types of models are combined, they can complement each other by allowing a probabilistic analysis that builds on the symbolic model. Manning (2003) discusses one way to build the statistical modelling on the symbolic model. For instance, rather than adhering to a hard distinction between different argument patterns for verbs, Manning (2003, 303) gives the example of representing the different subcategorization patterns for the English verb retire as probabilities like this: (2) P(NP [obj] | V = retire) = 0.52 P(PP [ from] | V = retire) = 0.05 The annotation expresses that with the verb retire there is a probability of 0.52 of encountering an NP functioning as an object, and a probability of 0.05 of encountering a PP headed by the preposition from; this way, we do not have to choose only one option for the argument patterns of this verb. This annotation keeps the same symbolic (or qualitative) categories as the one above (NP, V, PP), but uses probabilities to encode the relations between them.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
Alternatively, the statistical modelling may take the form of a statistical analysis of frequency information derived from a collection of symbolic models, without the intention of feeding the probabilities back into the grammatical model, as in Example (2). A typical instance of this approach is the statistical analysis of annotated corpora that are enriched with part-of-speech information or syntactic annotation, in order to draw conclusions about usage, grammar, or language change. Clearly, the scepticism expressed by Campbell (2013, 484) about quantitative models being qualitative models ‘masquerading as numbers’ is not warranted. On the contrary: investigating the same phenomenon by means of different types of models (what Zuidema and de Boer (2014) call ‘model parallelization’) can lead to rich new insights that combine the best qualities of both types of models. Thus, there is no real opposition between qualitative (or symbolic) models and quantitative models. The real question is how to achieve this in practice, as we discuss next. .. A new pace Although there certainly are concrete challenges in building corpora and adopting specific quantitative methods, we believe that the main obstacle is not concrete. In a discussion about the French eighteenth-century scholar Pierre Louis Maupertuis (who formulated a theory stating that material particles from the seminal fluids of both the mother and the father were responsible for forming the foetus), Gould (1985, 151) makes the following observation: We often think, naïvely, that missing data are the primary impediments to intellectual progress—just find the right facts and all problems will dissipate. But barriers are often deeper and more abstract in thought. We must have access to the right metaphor, not only to the requisite information. Revolutionary thinkers are not, primarily, gatherers of facts, but weavers of new intellectual structures. Ultimately, Maupertuis failed because his age had not yet developed a dominant metaphor of our own time-coded instructions as the precursor to material complexity.
This quote very effectively stresses the important role of metaphors in preparing the ground for true innovations in a field. Returning to historical linguistics, we believe that the availability of historical corpora and statistical techniques alone are insufficient to achieve the methodological shift that we propose here. What is required is just as much a conceptual change of pace, whereby linguistic problems are reformulated as complex interplays of factors that can be addressed quantitatively by means of corpus data. Such a reconceptualization has a knock-on effect in terms of what we consider as data and evidence, as well as the status of theoretical concepts. This is why the present treatment goes beyond a collection of best practices for doing historical corpus linguistics, although such advice is also discussed both in the present and in the subsequent chapters.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Context and motivation
Some might argue that the change we are discussing here is unnecessary, since historical linguistics is already making use of corpora and quantitative techniques. After all, Labov (1972) commended historical linguists for what he considered their superior methodological rigour compared to synchronic linguists. It might also be argued that the change is already well under way, and that corpus methods and quantitative techniques are becoming more important in the historical linguist’s toolbox. Hilpert and Gries (2009) state that large corpora are increasingly being used in historical linguistics; and with growing corpus size comes the need for statistical techniques to handle large and complex data. The first question is an empirical question about the present: to what extent are historical linguists already using quantitative techniques and corpus methods, and are they using them more or less than some relevant level of comparison? This is a question we return to in more detail in Chapter 3, along with a discussion of how quantitative methods have been used in historical linguistics previously. The second argument, that the change we are advocating is already well under way, is more subtle, since it is in fact a prediction. It assumes that we can observe some changes and that those changes will continue until their natural completion. However, as with any prediction, the result is only as good as the assumptions it builds on. In this case, the assumption that the adoption of a specific set of technologies (corpus methods and quantitative techniques) will continue at the present rate is an assumption that may not be warranted. In section 1.4 we discuss some of the dynamics involved with the adoption of new technologies, which we will argue also apply in the case of quantitative historical linguistics. Of course, the conceptual difficulties should not completely overshadow the practical obstacles involved in doing quantitative historical linguistics. However, the distinction can sometimes be hard to make. This is the reason for our efforts in compiling a proper methodology which constitutes a framework within which to discuss these matters. Specifically, sections 2.1, 2.2, and 2.3 set out a series of definitions, principles, and best practices for quantitative historical linguistics. With the fundamentals we set out acting as a common ground, the impetus for solving the practical obstacles becomes all the much stronger. In summary, there is a real need for a methodological treatment of quantitative corpus methodology in historical linguistics that sketches the place of such methods in the broader historical linguistics landscape, and that offers a link between the more conceptual level and the concrete computational and quantitative techniques taught in general courses for linguists. The present book takes on this challenge by first acknowledging the conceptual hurdles represented by a required shift in thinking as much as in doing. In the spirit of Gould (1985), we take seriously the need for appropriate metaphors to help make concrete the changes involved.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
In addition to the metaphor already mentioned above, namely seeing the spread of quantitative corpus methods in historical linguistics as analogous to a technology adoption process (further discussed in section 1.4), we see the following as some of the governing metaphors of the approach we propose here. We do not claim uniqueness or novelty in conceptualizing language via the metaphors below, but we do consider them to be central to our approach: • language change phenomena as outcomes modelled by a set of predictors; • language data as multidimensional; • historical linguistics as a humanistic field that not only analyses the particular, but also looks for patterns and extends these to include probabilistic patterns. This chapter will add more meat to the bone of the suggested methodological approach. However, before that is discussed, the next section will elaborate on some of the main claims involved in our argument.
. Main claims This section highlights the methodological gaps in historical linguistics and how our proposal addresses them. .. The example-based approach As shown by the evidence we have collected and which we will illustrate in section 1.5.2, historical linguistics generally does not make full use of corpora. This is not to say that research in historical linguistics disregards primary sources of evidence, nor to say that there are not increasingly more and more exceptions to this statement. However, historical linguistics is still far from considering corpora as the default or preferred source of evidence. Not using corpora is justified in a limited number of circumstances. In some cases, for example, the only evidence sources for a historical language are so limited that it is not possible to build a corpus; examples include languages not attested in written form (like Proto-Indo-European), or languages for which we only have access to an extremely limited number of fragments. Apart from such particular instances, corpora should be built for historical languages and diachronic phenomena, when they are not already available, and should be used as an integral part of the research process. In the literature review reported on in section 1.5.2 we will observe that the proportion of historical linguistics research articles employing corpus data is lower than the state of the art in general linguistics. When texts or corpora are the source of evidence, we often enter the realm of example-based approaches. Example-based approaches do not aim at an exhaustive account of the data and can be suitable to show whether or not a particular construction or form is attested, which is in line with a qualitative view of language. However, if we want to quantify how much a particular
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Main claims
form or construction is used, we need to resort to a larger pool of data that have been collected systematically. As we discuss in Chapter 6, such quantitative approach may (but not necessarily) be coupled with a deeper view of language as inherently probabilistic. In any case, the example-based approach is not appropriate as a basis for probabilistic conclusions about language and comes with a full range of problems, which we discuss in this section. Let us consider Rovai (2012) as a methodological case study; this article is very clear and detailed, and we will use it as an illustration of the example-based methodology. The paper analyses Latin gender doublets, i.e. those nouns that occur as both neuter and masculine nouns. To support his statements, the author lists illustrative examples (97–100), such as: Corium ‘skin’ is currently attested as a thematic neuter at all stages of the languages, but in Plautus’ plays (e. g. Poen. 139: ∼ 197 bc) and in Varro’s Menippeae (Men. 135: 80–60 bc) there also occurs the masculine gender
The examples are taken from a canonical body of texts, whose critical editions are listed in the bibliography. However, it is not clear how the author selected the examples provided. This is more important when the examples reported in the research publication are not meant to be for illustration purposes only, but are the object of the analysis itself. We do not know whether the author did not report those occurrences that contradict the hypothesis, which brings with it the risk of the so-called ‘confirmation bias’ (see Risen and Gilovich, 2007, 110–30 and Kroeber and Chrétien, 1937). Generally, the lack of transparency on the selection criteria for the examples presented has negative implications for the replicability of the studies. If another researcher were to go through the same texts, due to the lack of clear selection criteria, he or she would probably choose a different cohort of examples, and potentially reach different conclusions. When the examples constitute the main basis of the argumentation, and no more details about the rest of the evidence are given, the research conclusions themselves may rest on unstable grounds. Another problem with the example-based approach is that it limits the range of questions that can be addressed in the research task. In fact, by not explicitly stating the total number of words or instances from which the examples were drawn, this approach cannot give a good sense of the quantitative value of the phenomena illustrated, and cannot draw quantitative generalizations beyond the examples given. The role of examples is limited to providing evidence that a linguistic phenomenon is attested or not. Questions relating to the variation in the data, like ‘How many times is corium attested as a thematic neuter?’ or ‘How many times does corium occur as a masculine noun in Plautus and Varro?’ cannot be answered by an example-based methodology. Another case of the example-based approach is Bentein (2012). The main object of study here is the function of periphrastic perfect in Ancient Greek according
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
to the period and according to the discourse primitives as theorized in the mental spaces theory. His evidence basis consists of 784 examples taken from previous studies. Although the author says that ‘[t]aken together, these studies comprise a large part of Ancient Greek literature, both prose and poetry’ (175), it is not clear which texts he did not analyse nor how many instances he selected them from, which makes it impossible to place the data into their correct quantitative context. Further issues come from the example-based approach. As we will see more extensively in Chapter 6, the example-based approach does not allow for a quantitative analysis, or, when it does, it typically has too few data to obtain statistically significant results and large enough effects. This is accompanied by a lack of formal hypothesis testing, as we will motivate further in section 1.3.3. Moreover, analyses from example-based studies are not easily reproducible. Negative evidence for an argument is as critical as positive evidence. Which factors were considered, which ones were found to be important, and which ones were not found important? Also, this approach allows the researcher to perform the analysis of the published examples on an ad hoc basis, according to criteria that vary depending on the specific examples being analysed. An example may be used to show the relevance of a particular feature (say, animacy for word order) and another one to demonstrate another feature (say, the case of the object), but we are not given a full overview of all the relevant features for all examples. This is what we call the practice of ‘post hoc analysis’, and we will explain it further in section 1.3.3. .. The importance of corpus annotation García García (2000, 121) says: ‘[a]n exhaustive analysis of any linguistic issue in a corpus language should be based, ideally, on a study of all the available texts in that language or that period of the language [. . .] This is obviously a task that exceeds the possibilities of any individual. Therefore, any feasible study must necessarily be based on a limited and therefore incomplete corpus.’ This claim is justified if we assume that the data need to be collected and analysed manually, which is not the only way, as we discuss in this section. When corpora are available and when the phenomena studied fall into the scope discussed in section 2.1.1, corpora should constitute the source of linguistic data, and larger corpora should be preferred to smaller corpora, all other things being equal. Fortunately, it is not necessary to analyse all corpus data if the corpus has been annotated. Let us consider the example of a study on word order in Latin. Word-order change is a complex phenomenon where morphological, syntactic, semantic, and pragmatic factors play a role. Let us assume that our study focuses on morphosyntactic aspects of word-order change. For this purpose, a morphosyntactically annotated corpus (treebank) is the ideal evidence source (for an illustration of treebanks, see section 4.3.1). The example-based approach would imply analysing a set of texts to identify, for
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Main claims
example, the different word-order patterns used (SVO, OVS, etc.). Instead, Passarotti et al. (2015) systematically used the data from the Latin Dependency Treebank and the Index Thomisticus Treebank (via the Latin Dependency Treebank Valency Lexicon and the Index Thomisticus Treebank Valency lexicon, see McGillivray 2013, 31–60) to automatically retrieve such patterns, together with the metadata information relative to the authors of the texts where each pattern was observed. After the phase of data extraction from the corpus sources, the authors carried out a quantitative analysis of the distribution of every word-order pattern by author, identifying a trend that has a diachronic component and a genre component. Passarotti et al. (2015) kept the phase of data collection and the phase of data analysis completely separate, as the data were collected from corpora that had been annotated by independent research projects. This has the advantage of eliminating the bias that could arise when the researcher aims at proving a particular theoretical statement and may unconsciously select examples that support that statement. Because the authors conducted a systematic analysis of all available corpus data from the treebanks, there was no option to analyse only specific examples. Also, the presence of the annotation meant that they could use a much larger evidence base than they would have used if they had had to manually analyse every instance. If we search a corpus that has not been annotated, our search may have low precision, because we may find a large number of irrelevant instances. Imagine that we are interested in the uses of the English determiner that. If we search a corpus for the string ‘that’, we will find a high number of occurrences of conjunctions. If we have a corpus annotated by part of speech, however, we can limit our searches to include only determiners whose lemma is ‘that’ and avoid a very time-consuming manual postselection. Another risk in using an unannotated corpus concerns low recall. Imagine that we want to identify relative clauses not introduced by relative pronouns, as in the train they took was delayed. A corpus that annotated clause type would make it easy for us to obtain those instances; conversely, if the corpus does not have this kind of annotation, it is very difficult to find the relevant patterns. Another advantage of using annotated corpora has to do with the research methodology, particularly the distinction between the annotation phase and the analysis phase, and the relationship between annotation and linguistic theoretical framework, as we discuss more extensively in section 2.4.3. Let us consider the case of verbal phrases in Old English. The York–Helsinki Parsed Corpus of Old English (Pintzuk and Plug, 2002) annotates a number of linguistic features, but does not annotate verb phrases (VPs) specifically, due to a number of reasons, including the fact that the boundaries of VPs in Old English are still disputed. If we were interested in using this corpus to further the research on VPs in Old English, we could use the annotation of the corpus to investigate the elements that define VPs, so to obtain a corpusbased distributional definition of VPs. This way, the corpus analysis could support
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
the definition of VPs themselves, thus leading to empirically-informed theoretical statements. .. Problems with certain quantitative analyses We have talked about the problems with manually collecting and analysing examples for a study on historical linguistics. Even when a corpus is used, there can be methodological problems in the subsequent phase, the analysis of the data. This book addresses this point in detail in Chapter 6. Here we will summarize two main aspects: the use of raw frequency counts and the practice of what we called ‘post hoc analysis’. Letting numbers speak for themselves In the literature review we present in section 1.5 we will see that, in the cases where quantitative evidence is used in the historical linguistics publications we examined, there is a large variability in the statistical techniques used, ranging from simple interpretation of raw frequency counts or percentages, to null-hypothesis testing and multivariate statistical techniques. This highlights a lack of standardization and best practices on which techniques are best suited to study the particular phenomenon at hand, and we will cover this in more detail in Chapter 6. Here, we will focus on the problems caused by the practice of using raw frequencies and ‘letting the numbers speak for themselves’. Let us consider the example of Bentein (2012, 186–187). After introducing the previous literature and his theoretical framework, and after analysing a series of examples, the author introduces some quantitative data in terms of frequency counts of Ancient Greek periphrastic perfect forms, broken down by author and person/number features. He uses the raw frequency counts to argue for ‘a general increase of the periphrastic perfect’, which ‘must have been—at least partially—morpho-phonologically motivated’. The frequency data presented are presented as follows: ‘almost all examples occur with the 3sg/pl’. It is not at all clear how such diachronic trend was detected, since the frequency counts presented do not even follow a monotonic distribution; moreover, the author gives no indication of the relevance of those forms with respect to the overall amount of data available for each author, making it impossible to assess the raw frequencies in any meaningful way. For what concerns the predominance of third person singular or plural, the statement seems to be purely based on the raw frequencies as well. In other words, letting the raw frequencies ‘speak for themselves’ is problematic, as we further explain below. Let us take the example of McGillivray (2013, 57), who collected the frequencies of the Latin word-order patterns VO and OV in two corpus-driven lexicons, one based on classical Latin authors and one based on St Thomas’s and St Jerome’s texts. OV has a higher frequency than VO in the classical data set (152 vs 52 occurrences) and VO is more frequent than OV in the later age data set (107 vs 38). A simple inspection of the raw frequencies would lead us to conclude that OV is preferred by the classical authors and VO by the later authors. However, we may not have enough
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Main claims
data to exclude that the differences are due to chance. The rational way to answer this question is by performing a statistical significance test. The author used Pearson’s chisquare test (illustrated in section 6.3.3) and found a significant result,1 which points to a difference between the two groups of authors for what concerns their choice of word-order pattern; more precisely, the probability of finding those frequencies under the assumption that the two variables (author group and word-order pattern) are independent is less than 1 per cent. After we have established whether the two variables are significantly independent or not, it is important to consider the size of the detected difference, the effect size, since high frequencies tend to magnify small deviations (Mosteller, 1968). Effect sizes provide a standardized measure of how large the detected difference is. In the case of Latin word-order patterns mentioned above, the author found a large effect size,2 which justifies the conclusion that the two groups of authors have indeed very different preferences for word-order patterns. Adger (2015, 133) presents a special case of the argument that numbers should speak for themselves.3 He starts by quoting Cohen (1988)’s rule of thumb that a large effect size is one that can be identified with the naked eye. However, he then commits a logical fallacy when he conflates the estimation of the size of an effect with the problem of establishing whether or not we are faced with a meaningful difference or correlation. To speak of a ‘large effect’ implies that we have enough data to speak of such an effect to begin with. This is precisely the purpose of statistical testing, and only after this step is it meaningful to speak of effect size. As Adger (2015, 133) puts it: ‘most syntacticians feel justified in not subjecting [data] to statistical testing’. We cannot help but conclude that such confidence is misplaced. Dealing with linguistic complexity The second problem affecting quantitative analyses that we will examine here is the tendency towards what we call ‘post hoc analysis’, and is related to the example-based approach covered in section 1.3.1. The post hoc analysis consists in collecting occurrence counts of a phenomenon in a set of texts or a corpus, and then focusing the analysis on specific examples drawn from this evidence basis, highlighting the role played by certain variables, which are analysed in a nonsystematic way and are introduced after the data collection. This approach attempts to account for the multidimensional nature of the phenomenon at hand, but it does so without employing techniques from multivariate statistical analysis (see section 6.2 for more details). This is an instance of the search for particular elements, as opposed to recurrent, general patterns (see discussion in section 1.3.5). For instance, we may say that in a particular example the choice of word-order pattern seems to be related to a particular grammatical case of the object, and explain why this is the case based
1 3
2 = 77.79. 2 ϕ = 0.474. p < 0.01, χ(1) We are grateful to Kristian Rusten for bringing this publication to our attention.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
on our theoretical statements. Then, we may argue for the role played by semantics by illustrating it with an example showing a particular semantic class of the subject. For example, in Bentein (2012, 187–8) the author introduces the following post hoc variables to analyse periphrastic perfects in fifth-century classical Greek: passive voice, object-oriented and resultative nature, and telicity of the verbs. As part of the discoursive description of the examples, the author adds some quantitative details in a footnote on page 188. Such details further specify the statement ‘especially in Sophocles and Euripides one can find relatively more subject-oriented resultatives than in the historians’ (Bentein, 2012, 187–8). The author provides frequencies and percentages of active vs medio-passive forms in poetry and in prose, but does not test the statistical significance of such effects, nor their size. Next, the author introduces the placement of temporal/locational adverbials in the verbal group. However, the role of this variable is not measured, and only a few examples are given. This is a missed opportunity to add a quantitative dimension to the analysis. Similarly, the author argues for the diachronic shift from resultative perfect to anterior perfect. However, the argumentation stands on underspecified quantitative statements like ‘the active transitive perfect (with an anterior meaning) is indeed rather uncommon in fifthcentury writers’ (Bentein, 2012, 189). Phrases like ‘various examples’, ‘several examples’, and ‘many cases’ (Bentein, 2012, 190) indicate attempts to argue for the quantitative relevance of the phenomenon described, but the lack of precise measures undermines the efficacy of the arguments. In general, the argumentation develops throughout the article adding more variables to the picture (such as the telicity of the predicates and the agentivity of the clauses) in a post hoc fashion, and keeping them outside the scope of the frequency-based analysis. The practice of post hoc analysis may be coupled with an argumentation strategy that relies heavily on anecdotal evidence. In this respect, a very instructive example is again given in Bentein (2012, 192), where four examples are considered sufficient to show a diachronic development of the periphrastic perfect towards an increased degree of agentivity. Let us consider another case of post hoc analysis, this time used in the context of the presence vs absence of Latin gender doublets over time. Rovai (2012, 120) performs a quantitative analysis by counting the occurrences of the feminine and neuter forms in a given set of texts. The quantitative data are thus frequency counts according to one variable (gender). After presenting the count data, the article contains a detailed analysis of each of the sixteen lemmas, specifying the declension class, stem, and number features of the forms found in the texts (Rovai, 2012, 102–3). This is a wellmotivated step, because obviously counting the number of occurrences of each gender form is not sufficient for a good analysis of the phenomenon at hand, and more factors need to be taken into consideration. It is also a step that we can consider part of a qualitative analysis, because it goes into the detail of each instance. This analysis is followed by a summary of the data according to the time variable, showing the cases
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Main claims
where the feminine forms are more ancient than the neuter ones (eight out of sixteen) and those where the feminine forms do not occur after the archaic age (but with five exceptions), while the neuter forms are attested in the later centuries. From these observations the author draws the conclusion that the feminine forms ‘seem to be the last occurrences of unproductive remnants already in early times’ (Rovai, 2012, 103). To support this claim, he provides the example of the fossilized form rament¯a and the form caementa, mainly occurring in the conservative context of public law texts. Therefore, the type of texts where the feminine forms are attested and their fossilized nature are used as arguments for proving the fact that such forms are more ancient than the neuter ones. In this case, the main analysis focused on the gender of the forms and the age of the texts; however, later on, text type and formulaic features are considered as well, but with respect to only two of the sixteen nouns (rament- and caement-). It comes natural to ask: how many times do each of the sixteen nouns occur in fossilized forms or in legal texts? Including such variables to the original analysis would make the approach systematic and appropriate to the multidimensional nature of the phenomenon studied. Another variable considered in a post hoc fashion in the article is related to lexical connectionism (Rovai, 2012, 106). Limited to a subset of the nouns analysed, this is used as an argument supporting the hypothesis that ancient feminine forms were later on reanalysed as thematic neuter forms. According to this argument, some feminine nouns shared the same semantic field as some second-declension neuter nouns, and therefore occurred in the same contexts. To support this, the author provides two examples. However, it is not clear how to quantify the role played by lexical connectionism in the phenomenon under investigation. How many counter-examples can be found that contrast with the two examples provided? What is the relevance of these two examples in the context of all occurrences of the nouns considered? As the author says in Rovai (2012, 107–11), lexical connectionism cannot account for the development of ten of the sixteen nouns analysed. For this reason, the author analyses constructions that are ambiguous between the personal passive and the impersonal interpretation (e. g. dicitur ‘it is said’), and uses the fact that the latter gradually became more common over time to argue for the original first-declension feminine forms (such as menda ‘error’) to be reanalysed as second-declension neuter forms (mendum ‘error’). However, no measure of the relevance of this argument is given as to how many times these ambiguous constructions occur out of all occurrences of the nouns considered, how many instances are available to support it, and how this account compares quantitatively to the other factors considered for explaining the reanalysis. .. Problems with the research process We have seen some of the problems affecting the data collection and analysis phases. Here we want to focus on the research process as a whole, and in section 2.3 we will summarize the main claims of our proposal in this respect.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
Traditionally, access and automatic processing of large amounts of texts has been difficult, due to technological limitations, as we illustrate later on in this section. These constraints had an impact on how research was carried out, leading researchers in historical linguistics to focus on relatively small data sets and publish the final results of their investigations, typically in the form of articles or monographs. As we have noticed, the fact that the analyses were not published meant that they would not be easily reproducible. In spite of the technological advances of the past decades, the focus on the final results of the analysis and the lack of documentation of the intermediate phases of the research process is still a given, both in scientific disciplines and in the humanities. Following an increasingly popular line of thought (Candela et al., 2015), in our proposed framework we argue that more emphasis should be placed on documenting, publishing, and sharing all phases of the research process, from data collection to interpretation. In section 2.3 we will outline our suggestions in this area. New technologies The dramatic increase in digitization projects in the late 1990s made it possible to encode documents in digital formats, and the growing computing power has allowed computers to store more and more data at increasingly lower costs. A number of projects aimed at digitizing historical material have led to large amounts of data being available to the academic community, such as the Internet Archive,4 Europeana (Bülow and Ahmon, 2011),5 and Project Gutenberg,6 just to mention a few. This has meant that archives and libraries can make their collections more accessible and can preserve them in a better way. In parallel, the development of disciplines like computational linguistics and its applied field of natural language processing has made it possible to analyse large amounts of text automatically. Let us imagine that we were interested in studying the usage of a as a preposition (meaning ‘in’ as in We go there twice a week) in English in the seventeenth century. We would not be able to read all texts written in the seventeenth century and note all usages of a as a preposition. In the pre-digital era, we would have probably selected a sample of the texts, checked existing theories, possibly formulated a hypothesis and checked it against the selected texts. This way, we would be less likely to find patterns that contradict our intuition, and if we did, we would only be able to collect a very limited number of examples, and we would not have an idea of how common the evidence contrasting our intuition is. With the wealth of digitized texts we have at our disposal nowadays (especially for English), we are able to resort to a much broader evidence basis, and this triggers new research questions that were not conceivable before. Such increasingly larger text collections cannot be tackled with the so-called ‘close-reading’ approach. On the
4 6
https://archive.org/index.php. https://www.gutenberg.org.
5
http://www.europeana.eu/portal/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Main claims
other hand, simply searching for a in a raw-text collection leads to a high number of spurious results, including all cases where a is used as a determiner. Even if we searched for certain patterns (such as instances preceding ‘day’, ‘week’, or ‘year’), we would only capture a subset of the relevant occurrences. Instead, if we are able to automatically analyse all texts of interest by part of speech with appropriate natural language processing (NLP) tools, we would be able to identify the cases where a is a preposition. This way, we would be in the position to answer questions like ‘how has the relative frequency of a as a preposition and a as a determiner changed?’ or ‘which factors might have driven this change?’. As we have suggested in the example above, the new possibilities offered by digital technologies have had profound implications on research practice and methodologies. In addition to historical linguistics, numerous other areas of human knowledge have witnessed an explosion in the size of the data sets available. Ranging from market analysis to traffic data, the phenomenon of ‘big data’ (generally referred to as data sets characterized by large volume, variety, and velocity) has become a reality that organizations cannot afford to ignore (Mayer-Schonberger and Kenneth, 2013). In this book we argue that historical linguistics has not taken full advantage of this technological and cultural change, and we suggest a framework which supports the transition of this field to a new state that is more in harmony with the current scientific landscape. This transition does not only consist of a set of new techniques applied to traditional research questions or the ability to carry out traditional analyses on a larger scale. We believe that this transition allows a whole set of new questions to be answered. In their abstract, Bender and Good (2010, 1) summarize the need for linguistics to scale up its approach as follows: The preeminent grand challenge facing the field of linguistics is the integration of theories and analyses from different levels of linguistic structure and aspects of language use to develop comprehensive models of language. Addressing this challenge will require massive scalingup in the size of data sets used to develop and test hypotheses in our field as well as new computational methods, i.e., the deployment of cyberinfrastructure on a grand scale, including new standards, tools and computational models, as well as requisite culture change. Dealing with this challenge will allow us to break the barrier of only looking at pieces of languages to actually being able to build comprehensive models of all languages. This will enable us to answer questions that current paradigms cannot adequately address, not only transforming Linguistics but also impacting all fields that have a stake in linguistic analysis.
This extract applies to the whole field of linguistics and the authors identify the main challenges ahead of linguistics today as consisting of data sharing, collaboration, and interdisciplinarity, as well as standards and scaling up of data sets used for formulating and testing hypotheses on language (with the help of NLP tools for the automatic analysis). They also underline the need for overcoming such challenges to allow higher goals to be achieved. We fully support this view, and in the present book we will
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
combine it with further points pertaining specifically to historical linguistics, in the context of a general methodological framework. .. Conceptual difficulties Why has historical linguistics not yet fully embraced the methodological shift we outline in this book? There are many reasons for this. Inadequate technical means and skills, insufficient computing power, and storage capabilities are certainly concrete obstacles that have been in the way of a complete transition of historical linguistics into the empirical, data-driven quantitative science we argue for in this book, as we have seen in section 1.3.4. Here we want to briefly discuss other, more serious obstacles which concern the place of historical linguistics and the humanities in general in the scientific landscape. Bod (2014) offers a comprehensive overview of the history of the humanities, while at the same time taking the opportunity to discuss the defining elements of the humanities and their relationship with the sciences. The humanities have been defined as ‘the disciplines that investigate the expressions of the human mind’ (Dilthey, 1991); however, this definition is not unproblematic, for example it would apply to mathematics as well. In fact, Bod chooses a more pragmatic one according to which the humanities are ‘the disciplines that are taught and studied at humanities faculties’ (Bod, 2014, 2). From Bod (2014)’s overview it is clear that a radical dichotomy between the humanities and the sciences is not supported by historical evidence. In fact, he finds a unifying feature shared by scientific and humanistic disciplines in the development of methodological principles and the search for patterns (Bod, 2014, 355), which in the case of the humanities’ focus on humanistic material (texts, language, art, music, and so on). The nature of such patterns varies across disciplines, with examples of local and approximate patterns found both in the humanities and, for example, in biology. According to Bod (2014, 300): linguistics is the humanistic field that is ideally suited to the pattern-seeking nomothetic method, which has indeed become common currency [. . .] Despite its general pattern-seeking character, present-day linguistics displays a striking lack of unity [. . .] In one cluster we see the approaches that champion a rule-based, discrete method, whereas in the other cluster an example-based, gradient method is advocated.
This perspective is in contrast with the view according to which the humanities are not concerned with finding general patterns, and instead are only concerned with analysing particular human artefacts, whether they are texts, or manuscripts’ transmission histories, or works of art. Instead of stressing a strict opposition between scientific and humanistic disciplines, hence, it is helpful to appreciate the differences that exist within the sciences themselves and opt for a more nuanced approach. In this book we propose a methodological framework that encompasses a large portion of
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Can historical linguistics cross the chasm?
the practice of historical linguistics (for the scope of our framework, see section 2.1.1), and is concerned with empirical, corpus-driven quantitative approaches. In our framework, historical linguistics research looks for patterns and tests hypotheses in historical language data, mainly historical corpora, and builds models of historical language phenomena.
. Can quantitative historical linguistics cross the chasm? A fundamental assumption for this book is that historical linguists already work with technology. The Greek root tekhn¯e can refer to any acquired or specialized skill. In the more conventional sense of technology as some invented means by which we achieve something (books are also a technology), historical linguistics is a technological field, or at the very least not an atechnological one. Therefore, it is anachronistic to create an artificial contradiction between historical linguistics on the one hand, and technology on the other. For the discussion of this paragraph, we will consider ‘technology’ as having the broadest possible scope, pointing out that historical linguists already use ‘technologies’. Along this conceptualization of technology, a symbolic analytical framework (such as X-bar annotation) counts as a ‘technology’ just as much as a software platform like R. This broad use of technology can then be distinguished from the very advanced and possibly more recent high-tech type of technologies, such as cutting-edge lab equipment or statistical and computational software or algorithms. It is probably safe to say that historical linguistics is not typically or commonly associated with high-tech approaches, and this impression will be further discussed in later chapters. Above, we indicated that a more high-tech approach could benefit historical linguistics. Since such an approach is already in use in other branches of linguistics, it is clearly technically possible to adopt it, and there are examples of historical linguists who already have made use of state-of-the-art techniques from computational and corpus linguistics, and applied statistics. What we are more concerned with here is the possibility for making these approaches mainstream. The present section deals with the problem of disseminating such a methodology beyond a small group of linguists who have already adopted it, and making it available to a much larger share of historical linguists. To do this, we will base our discussion on a much-touted model of technology adoption in the world of business, the problem of crossing the chasm (Moore, 1991). The technology adoption life cycle we have in mind is based on Moore (1991), and views technology adoption as a process of diffusion. The market is viewed as consisting of relatively distinct groups who will adopt a new technology or product for very different reasons. Crucially, the different market segments will act as reference points for each other, so that a product or a technology can seemingly be transmitted from one group to the next. As we will see, this highly idealized model can bring some real insights regarding the adoption of quantitative corpus methods in historical
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
linguistics, as much as it can inform marketers about how to push the latest hightech gadgets to consumers. The key to the insight is not in the details of the model as such, but in the way it throws light on people’s motivation for deciding to make use of a specific technology. It is this pivotal insight that we think merits the model’s application to the problem of how to advocate a more widespread adoption of quantitative corpus methods in historical linguistics, and what the obstacles are. .. Who uses new technology? To better understand why people adopt a technology, the model operates with five groups of highly idealized technology users. These groups can again be grouped together into two broad types, namely the early adopters and the mainstream adopters. Early adopters will typically have very different motivations for picking up a new technology compared to mainstream users. From this simple observation follows the conclusion that a technology that appeals to early users may fall flat on its face when presented to the mainstream. The gap in expectations and requirements that separates the early adopters from the majority of potential users of the technology is what constitutes the metaphorical chasm. But before tackling how to cross the chasm, we will look into what defines the different groups of users. We have adapted the business-oriented examples from Moore (1991) and situated them in a linguistic context where needed. The innovators The innovators are the technology enthusiasts. These are people who are interested in new technology for its own sake, and they will eagerly pick up something simply because the new technology appeals to them. They are typically not deterred by cost, and since they have a high level of technological competence, they are not put off by prototypes and a lack of formalized user support. If a new technology, such as a piece of software, requires modification or configuration to function, they will be able to do this themselves, or find out how to do it via technology discussion forums on the web. In a more linguistic context, innovators are linguists who introduce new technologies from other fields, or even create their own. This idealized user type might remind us of the caricature of the quantitative corpus linguist from Fillmore (1992) who is mostly concerned about corpora, tools, and corpus frequencies for their own sake. The visionaries The next group of users, the visionaries, are also technologically savvy, but unlike the innovators, they are not primarily interested in the technology for its own sake. The visionaries have a strategic interest in the technology and are primarily interested in the subject matter, i.e. historical linguistics. To the visionaries the exact properties of the new technology are subordinate to what it can help them achieve in linguistics. Such achievements could be anything from answering a linguistic question that has hitherto been considered too hard to be adequately answered, to gaining an advantage in the academic job market by mastering a new, trendy
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Can historical linguistics cross the chasm?
technology. The visionaries and the innovators make up the early adopters, and the visionaries will have the innovators as a reference group. If the innovators can demonstrate that, in principle, a new technology can tackle new questions, or answer old questions on a new scale, then the visionaries are happy to start making use of that new technology in order to gain an advantage. The early majority With the visionaries, we leave behind the early adopter groups and enter into the mainstream territory. Here we find the early majority. This group will constitute a large share of the overall users, and the defining characteristic of the group, according to Moore (1991), is pragmatism. They will adopt a new technology when it is both convenient and beneficial for them to do so. They are more interested in incremental improvements than in huge leaps forward, and will avoid the risks associated with new technology by finding out how others, typically the visionaries, have fared with it (Moore, 1991, 31). This means that the early majority are much slower to adopt new technologies than the early adopters, but they are more likely to stick to their new technology once it has caught on. Moore (1991, 31) points out that the early majority is difficult to characterize, but we can think of them as linguists who have adopted corpus linguistics methods or quantitative tools as a purely pragmatic measure after seeing that the visionaries have successfully used the same tools to answer questions in a new way, but only after those tools have reached a sufficient level of maturity and user-friendliness. The late majority The next large segment of users of technology are the conservatives. The conservative users are not concerned about the latest high-tech tools; indeed, they might be wary of them (Moore, 1991, 34). Conservatives are highly focused on ease of use, and will stick with their chosen technology for as long as possible. They are reluctant to changing it for another technology, and will do so only when the new technology has become a virtual standard, is easy to use, and covers all their needs in the area it is meant to cover. A hypothetical example might be a linguist who adopts a new technology because it has become so widely adopted that it is a near requirement. Incentives could be negative, as in the loss of support for an older technology, with the new one being introduced as the standard; or they could be positive, e.g. some journals favouring articles that make use of the technology in question. The sceptics The sceptics, or ‘laggards’, as Moore (1991, 39) also calls them, make up a small tail end of the technology adoption cycle. The sceptics, as the label implies, do not adopt new technology and will instead stick to their tried-and-trusted methods, no matter what the cost in lost productivity or lack of perceived coolness is. The linguistic example in this case might be the caricature of the ‘armchair’ linguist from Fillmore (1992) who has access to relevant data purely based on introspection and intuition. As Moore (1991, 40) points out, there are important lessons to be learned from this group, since they are more prone to seeing the flaws in any new technology, and are
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
sensitive to the hyperbole that inevitably accompanies a new technology making its way into an academic field. Thus, while we are fundamentally in disagreement with the sceptics regarding the value of new technology in historical linguistics, we are also highly interested in their arguments, since we can learn much from them about the discrepancy between how a new technology is marketed and its actual capabilities. We remain committed to the idea that introducing new technologies can benefit historical linguistics, but only to the extent that such technologies are fairly evaluated on their actual merits, not hyperbole. .. One size does not fit all: the chasm As the characterization of the types of technology users above should make clear, these are idealizations that do not necessarily fit any one person, and one person might fit in several idealized groups to some degree. However, each idealized user type captures very different motivations for taking on a new technology, and the broad differentiation between early adopters and the mainstream captures the fact that some of these motivations are more closely aligned than others. The key insight that they confer is that motivations for adopting new technology differ. Essentially, one size does not fit all. This means that although a new technology might be outright attractive to the innovators, the visionaries might fail to see how it can be used in a meaningful way to answer the linguistic questions they care about. In that case, the technology in question is likely to remain a niche phenomenon. Alternatively, the technology itself might appeal to the innovators and at the same time offer the visionaries the strategic advantage they seek in answering linguistic questions. In this case, the technology will have fully engaged the early adopters. However, the technology might still not permeate the mainstream market, because it fails to cross the metaphorical chasm. The idealized segments of user types are not continuous, hence there are gaps separating them. However, one gap stands out as larger than the others: the chasm that separates the early users from the mainstream users. This is illustrated in Figure 1.1, which shows the relatively larger gap between early adopters and the mainstream as a noticeable discontinuity. As the figure also makes clear, the chasm separates the relatively small number of early adopters from the bulk of users who are found in the mainstream part of the model. Thus, the chasm not only separates qualitatively different users from each other, it also represents a quantitative difference that separates a minority of users from the vast majority of users. We consider the chasm model a useful basis for analysing the status of quantitative approaches to historical linguistics for two reasons. First, it covers technology in the broad sense (even new analytic frameworks). Thus it provides a way to understand not only the point that this book is trying to make, but also a tool for understanding the current situation. Second, the model provides some insights about what can be done to change the situation, provided that our argument in favour of increasing historical linguistics’ reliance on high-tech approaches is accepted.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Can historical linguistics cross the chasm?
The chasm separates the early adopters from the majority
Early majority
Late majority
‘Chasm’ Early adopters Innovators
Sceptics Technology adoption curve
Figure . Technology adoption life cycle modelled as a normal distribution, based on Moore (1991, 13).
Although the chasm model can be used in many ways, we will focus on the key component, namely the insight about the chasm that divides the early adopters from the majority of users. To understand why some technologies never go mainstream, we must consider what prevents them from crossing the chasm, as we will see in the next section. .. Perils of the chasm There are a number of reasons why a technology might never arrive at the mainstream segment of users. For instance, it might never reach the chasm at all, because it fails to catch on among the innovators and visionaries. According to Moore (1991), this is likely to happen if the vision behind the technology is marketed before the technology itself is actually viable. For example, the vision of large-scale data-driven corpus approaches to study language crucially depends on specific types of computer technology. Without a suitably mature version of this technology, the vision may be appealing, but the practical problems would prevent it from really catching on. As we shall see later, we might find parallels to this in historical linguistics. In the case of quantitative corpus methods, these are at the very least a technology that has been embraced by early adopters (innovators and visionaries) in historical linguistics. We argue that it has not yet fully entered the mainstream of users in
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
historical linguistics, and we will further substantiate this claim in section 1.5. However, as we consider some of the potential reasons for this failure to cross the chasm, it is useful to look into potential general pitfalls for new technologies crossing the chasm, adapted from Moore (1991, 41–3): (i) (ii) (iii) (iv)
lack of respect for established expertise and experience; forgetting the linguistics and only focusing on the technology; lack of concern for established ways of working; practical problems such as missing standards, lack of training opportunities, or educational practices.
Point (i) prevents a new technology from crossing the chasm because it alienates the majority of mainstream users. For all the high-tech buzz about disruptive technology, it is clear that technologies that are able to adapt to existing practices have an advantage when it comes to crossing the chasm. The majority of users are pragmatic and the disruptive, innovative aspects of a new technology are simply not what appeals to them. This brings us to point (ii), which is the insight that for the majority of users, such hightech approaches must present a better option for doing historical linguistics. Without that perspective, we would hardly expect any attempt to push a new technology to the majority of mainstream users to succeed. Point (iii) captures the fact that historical linguists, as any users of technology, are interested in tools that work. Established tools, such as the qualitative methods of historical comparative linguistics, clearly work. Thus, the chasm model suggests that innovative technology ought to work best where the established methods have their weakest points. Finally, point (iv) addresses all the practical or financial problems associated with a new technology, such as acquiring the technology itself, learning new skills (and transferring them to students), finding new ways to integrate the existing technology with the new, establishing standards (e.g. for annotation), and best practices (e.g. for peer review). None of these points needs to be fatal for a new technology attempting to enter the mainstream, but in combination they would seriously impede its chances of reaching out. In the case of high-tech approaches to historical linguistics, we can easily find examples of all four problems which taken together would prevent full adoption of the technology advocated here. As the following sections and chapters will make clear, our aim is to provide a roadmap for how these potential problems can be avoided. Specifically, we seek to address points (i) to (iii) by presenting quantitative historical linguistics in an accessible, and relatively jargon-free manner, with the aim of highlighting how this particular approach in many ways can exist alongside established ways of working. We also aim to illustrate how the approach we advocate in some cases will result in better or perhaps more interesting results, which we believe make the investment in the technology well worth it from a historical linguistics point of view. The final point dealing with practical problems lies to some extent outside the scope of the book.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A historical linguistics meta study
There are a large number of books and courses teaching these skills and aimed at linguists, and the software we advocate is free. However, we do tackle the problem of standards and some terminology, thus hoping to also help to ease some of the practical problems associated with the new technology. A crucial step for achieving these aims is to have a clear understanding of the situation in historical linguistics, as we do in section 1.5.
. A historical linguistics meta study In this section we focus on the level of adoption of quantitative corpus methods in historical linguistics compared to linguistics in general. We also report on a quantitative study we have carried out on a selection of publications from existing literature in historical linguistics. .. An empirical baseline Before looking into the current use of corpora and quantitative methods in historical linguistics, it is worth considering just how quantitative we expect historical linguistics to be. A reasonable benchmark is the field of linguistics overall. After all, those linguists working on contemporary languages have a wider spectrum of methods available to them that are out of reach for most historical linguists: native speaker intuitions, surveys, recordings, interviews, controlled experiments, and so on. Given that these methods are all considered acceptable in mainstream linguistics (see Podesva and Sharma 2014 for an overview), and given that the primary source of data for most historical linguists is textual, our position is that historical linguistics should not be using corpora and quantitative methods to a lesser degree than linguistics overall. For this benchmark we have relied on data from Sampson (2005 and 2013). These two studies analyse research articles (excluding editorials and reviews) published in the journal Language between 1960 and 2011. Sampson wanted to know the extent to which mainstream linguistics relied on empirical and usage data. To this end he sampled the volumes of what is arguably the leading linguistics journal, Language, at regular intervals between 1960 and 2011. As a baseline he chose the journal’s 1950 volume, so as to reflect the period prior to the increased reliance on intuition-based methods in the 1960s. Sampson devised a three-way classification system to label articles as ‘empirical’, ‘intuition-based’, or ‘neutral’. The last category was designed to cover papers that did not readily fit into the two first categories, such as methodological papers or papers dealing with the history of linguistics. To classify articles he used a number of rules of thumb, including an admittedly arbitrary threshold of two usage-based or corpus-based examples to classify a paper as evidence-based. However, he also employed positive criteria for labelling papers as intuition-based, notably the presence of grammaticality judgements. Thus, while the criteria for being
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
evidence-based might seem liberal, the presence of additional criteria ensures a reasonable classification accuracy. For full details about the sampling, procedure, and criteria, see Sampson (2005). Although the data in Sampson (2005) indicated a trend towards an increasing number of evidence-based papers, his main conclusion was cautious, suggesting that linguistics still had some way to go before empirical scientific methods were fully accepted in this field. The proportion of evidence-based papers (calculated from the total number of non-neutral articles) was only growing slowly and showing some signs of dipping. Picking up the thread from the previous study, Sampson (2013) continued the exercise and found that what had appeared as a downward trend around 2000 was simply due to fluctuations. The addition of more data confirmed a continued upward trend since the nadir in the 1970s. Figure 1.2, based on data from Sampson (2013), illustrates this trend. Since 2005 the proportion of evidence-based studies has exceeded the 1950 baseline, represented by the horizontal line in the plot. As Figure 1.2 shows, empirical methods (according to Sampson’s criteria) have made a remarkable comeback. Already in the 1980s approximately half the research articles published in Language were based on empirical evidence (in Sampson’s sense of the word), with a rapid increase setting off in the 1990s. This is perhaps not surprising, since it coincides with the availability of electronic corpora around the 1.0
Proportion of empirical articles
0.8
0.6
0.4
0.2
0.0 1960
1970
1980
1990
2000
2010
Figure . Proportions of empirical studies appearing in the journal Language between 1960 and 2011. The horizontal dotted line represents the baseline of the 1950 volume. After figure 1 in Sampson (2013).
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A historical linguistics meta study
same time, as discussed in section 3.5, and Sampson (2005) also makes the link to the availability of corpora explicitly. It is clear that searchable, electronic corpora foster empirical research. However, it is all too easy to mistake correlation for causation, and corpora are only one piece of the puzzle, as proven by the fact that what is published in Language is still (as we believe it should be) a mixture of empirical papers (in the sense of Sampson 2005) and other studies. Corpora do not determine what kind of research is published. Thus, we cannot simply assume that a similar situation is found in historical linguistics. To complement the picture, we therefore surveyed the field of historical linguistics. .. Quantitative historical research in Our meta study differs from those in Sampson (2005) and Sampson (2013) in that we surveyed several journals published in one particular year, as opposed to a single journal over several decades. We found this to be a reasonable approach, since our aim was to present a snapshot of the field of historical linguistics as it currently appears. For the literature survey we carefully read a selection of research articles published in 2012, taken from six journals. These six journals are clearly a small sample of all that is published within historical linguistics in a given year, but should nevertheless provide some insight into the breadth of research currently being published. To make the effort feasible, we applied a number of exclusion criteria, and focused on the cases that met all the following criteria: 1. 2. 3. 4.
research journals (excluding monographs, yearbooks, and edited books); journals published in English; journals focusing specifically on historical linguistics and/or language change; journals with a general coverage, excluding those focusing on specific languages or subfields (like historical pragmatics or syntax); 5. linguistics journals (excluding interdisciplinary ones). Applying these criteria resulted in the following final list of journals: • • • • • •
Diachronica Folia Linguistica Historica (FLH) Journal of Historical Linguistics (JHL) Language Dynamics and Change (LDC) Language Variation and Change (LVC) Transactions of the Philological Society
From these journals we selected only the full-length research papers, thus excluding book reviews, editorials, and squibs. This left us with sixty-nine papers, a number which was pruned down to sixty-seven, after removing two papers that were deemed out of scope. We then read and classified the final set of papers. The data and the code for this study are available on the GitHub repository https://github.com/gjenset.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
For each paper, a number of variables were recorded, including journal, the type of techniques employed in the analysis, whether or not corpora were used, and whether or not the paper could be classified as quantitative, qualitative, or neutral. For the neutral classification we employed the criteria from Sampson (2005). Five papers with a mainly methodological or overview focus were included in the neutral category, leaving us with sixty-two papers for the quantitative vs qualitative categories. Our classification differs from that in Sampson (2005) in that it relies on more variables, notably recording the use of corpus data, but also other data sources such as word lists (e.g. in phylogenetic studies). Furthermore, we decided to distinguish between the source of data (such as corpora vs quoted examples), and the use to which they were put (e.g. if they were treated quantitatively or qualitatively). This was done to obtain a classification that was both more fine-grained and easier to operationalize for historical linguistics than the criteria from Sampson (2005), since none of the papers relied on native speaker intuitions. Whether or not a paper was corpus-based was judged based on the discussion of the data in the paper. We relied on the accepted definition of a corpus as a machine-readable collection of naturalistic language data aiming at representativity (with obvious allowances being made for historical data with their gaps and genre bias). This excluded sources of data such as the World Atlas of Language Structures or word lists. Furthermore, we required the corpus to be published or at least in principle accessible to others, which excluded private, purpose-built collections made for a specific study, but we accepted as corpus-based those studies relying on a subset of data from a corpus that would otherwise fulfil these criteria. The distinction between quantitative and qualitative studies was made by assessing whether or not the conclusion, as presented by the article’s author(s), relied on quantitative evidence or not. Essentially, we considered whether or not the author(s) argued along qualitative lines or quantitative ones by looking for phrases that would imply a quantitative proof of the article’s point, such as ‘x is frequent/infrequent/ statistically correlated with y’. Qualitative papers were thus mostly defined as nonquantitative ones, but we also applied positive criteria. We judged arguments based on the presence or absence of a feature or phenomenon to be indicative of a qualitative line of argumentation. Phylogenetic studies, while not typically based on frequency data, were counted as quantitative, since the underlying assumptions are based on computing distances between features or clusters of features. Applying these criteria we found that thirty-seven papers (60 per cent) were qualitative, while the remaining twenty-five (40 per cent) were quantitative. Table 1.1 lists the number of papers grouped according to whether or not they are corpusbased and whether they are qualitative or quantitative. A Pearson chi-square test of independence reveals that there is a statistically significant, medium-to-strong 2 association between corpus use and the use of quantitative methods (χdf (1) = 12.68, p = 0.0004, φ = 0.49). Perhaps unsurprisingly, corpus-based studies tend to favour a
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A historical linguistics meta study
Table . Classification of sample papers according to whether or not they are corpus-based, and whether or not they are quantitative Qualitative
Quantitative
Total
Not corpus-based Corpus-based
33 (53) 4 (6)
11 (18) 14 (23)
44 (71) 18 (29)
Total
37 (60)
25 (40)
62 (100)
Table . Classification of papers from Language 2012 according to whether or not they are corpus-based, and whether or not they are quantitative
Not corpus-based Corpus-based
Qualitative
Quantitative
3 0
6 6
quantitative approach, although four qualitative corpus studies were also identified, which illustrates that there is no simple one-to-one relationship between corpus data and quantitative methods. Of the quantitative studies we see that a little over half (fourteen out of twenty-five) were corpus-based. Comparing this to the benchmark from Sampson (2013), it seems that the leading linguistics journal Language has gone further than historical linguistics in adopting quantitative methods. Recall that around 80 per cent of papers in the most recent samples studied by Sampson were classified as empirical, whereas we only found 40 per cent. Some caution is required in the interpretation, since the criteria used by Sampson differ subtly from ours due to Sampson’s focus on the use of native speaker intuitions and authentic examples as the minimum criteria for what he terms empirical. To investigate how well Sampson’s classification corresponds to our own, we classified the 2012 volume of Language according to our own criteria. This classification, based on fifteen research articles, yielded a similar result to Sampson’s conclusion for recent articles in his sample: we deemed twelve out of fifteen (i.e. 80 per cent) to be quantitative, with six out of fifteen (i.e. 40 per cent) being corpus-based. As Table 1.2 shows, there were no qualitative corpus-based articles. This indicates that, although the minimum criteria employed by Sampson differs from our criteria, both sets of criteria in fact point towards the same conclusion. A sample of fifteen articles is obviously tiny, and the relative frequencies above are offered as an easy means of comparison with the sampled historical linguistics articles,
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics
and not as some generalized prediction about linguistics overall. However, we take the numbers in Table 1.2 to indicate that our classification is at least comparable to that made by Sampson, even if our exact criteria differ. Although the sample of fifteen articles from Language 2012 is too small for a direct comparison with the historical and diachronic linguistics sample via statistical methods, it certainly strengthens the case for our main claim that historical linguistics journal articles are using quantitative methods less than the articles in Language. The claim can be further strengthened if we take into consideration the likely variation, or error margins, around these estimates. Based on the comparisons above, it seems fair to compare Sampson’s estimate of 80 per cent empirical articles in Language with the 40 per cent quantitative papers identified in our historical sample, since our own classification of the 2012 volume of Language showed that the two correspond. It is clear that 80 per cent is a higher percentage than 40 per cent, but how much should we really read into this difference? One way to better understand the difference between the two numbers is to think of them as estimates from an underlying distribution, where we must account for some measurement error. Put differently: our estimates might be incorrect, and the two samples might in fact be exaggerating the differences. We can calculate the range or interval around each of these percentages using the normal distribution as a model. The range of variation we calculate is a 95 per cent confidence interval, which is taken to indicate that 95 per cent of the observations from the underlying population (i.e. articles from the journals) would fall into this range, if our sample is representative. The intervals are listed in Table 1.3. If the error margin around our percentages was excessive, we would expect to see the 95 per cent confidence intervals overlapping, i.e. we would expect to see the upper range of variation for the historical sample reaching into the range surrounding the estimate from Language. As the numbers in Table 1.3 show, this is not the case, however. Even if we have underestimated the percentage of true quantitative papers in the historical sample, and overestimated the percentage of true quantitative papers in the Language 2012 sample, we see that the two samples are still likely to be different. The likely theoretical maximum percentage of quantitative papers in the historical sample is 52 per cent, whereas the theoretical minimum for Language is 60 per cent,
Table . 95 confidence intervals for the percentage of quantitative papers in Language 2012 and the historical sample. Note that the confidence intervals do not overlap
Language Historical sample
Proportion of quantitative papers
95 confidence interval
80 40
[60, 100] [28, 52]
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A historical linguistics meta study
so Language is clearly different even under this worst-case scenario. Using the same logic, we can test this formally using the prop.test() function in R. The p-value 2 returned by the test is extremely small (χdf (1) = 58.6, p 0.001), which shows that a sample of sixty-two (the size of the historical sample) is sufficiently large to establish that the percentage of quantitative papers (40 per cent) is statistically different from the percentage reported by Sampson (2013) and found in our Language 2012 sample (80 per cent). However, there is another aspect to the question of how similar the two samples are, namely the percentage of corpus-based papers, which is the same (40 per cent). Looking only at the proportion of corpus-based papers, we might assume that the situation in historical linguistics journals is very similar to the one in Language. However, we also need to consider how dispersed the corpus-based papers are in the historical sample before attempting to draw further conclusions. Our aggregated results may hide different research traditions and methodological conventions within the field of historical and diachronic linguistics. If this were the case, then we would expect to see some differentiation among types of studies depending on the journal. In fact, this is what we find if we group the classifications by journal. We carried out an exploratory multiple correspondence analysis (MCA) to look for the links between journals, evidence source type (corpus-based or not), and the quantitative–qualitative distinction. MCA is an exploratory multivariate technique that seeks to compress the variation in a large set of data into a smaller number of dimensions that can be visualized in a two-dimensional plot (Greenacre, 2007). The MCA analysis (shown in Figure 1.3) found that the first dimension (represented by the horizontal axis) explained virtually all the variation in the data, accounting for 90.9 per cent of the total variation. This means that the plot can simply be read from left to right (or right to left), as a continuum where the leftmost journal is maximally different from the rightmost journal. We can interpret how the data relate to this first dimension by looking at the projection of the points on the horizontal axis in Figure 1.3. We can see that the journals can be grouped along a continuum from non-corpus-based and qualitative, to corpus-based and quantitative. On the qualitative/non-corpus-based extreme we find Transactions of the Philological Society, followed by Language Dynamics and Change, Diachronica, Folia Linguistica Historica, and Journal of Historical Linguistics. The other, i.e. quantitative, end of the continuum is represented by Language Variation and Change. The results are hardly surprising for someone familiar with the scope of these journals, and it is obvious that this picture represents some form of mutual selfselection between journals and scholars: journals attract submissions that are in line with their explicitly stated profile. However, the conclusion that Language and the historical linguistics data set as a group are similar in their use of corpora is clearly not warranted. Instead, what we see in Figure 1.3 is that Language Variation and Change is (not surprisingly) different from the other historical linguistics journals, and that
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics journalJHL journalDiachronica journalFLH
corpus.basedTRUE
0.2 quantitativeFALSE journalTrPhilSoc
0.0 quantitativeTRUE
corpus.basedFALSE
journalLangVar Change
–0.2
–0.4
–0.6
journalLangDynChange
–0.8 –0.6
–0.4
–0.2
0.0 Dim 1: 93.3%
0.2
0.4
0.6
Figure . MCA plot of the journals considered for the meta study and their attributes. Dim 1 is the dimension with the most explanatory value; Dim 2 is the dimension with the second most explanatory value.
Table . Classification of sampled papers according to whether or not they are corpus-based, and whether or not they are quantitative, with LVC left out
Not corpus-based Corpus-based
Qualitative
Quantitative
33 (66) 4 (8)
6 (12) 7 (14)
it is Language Variation and Change that is primarily associated with both corpora and quantitative methods. We can still observe a continuum among the remaining historical linguistics journals, reflecting the degree to which we find quantitative and corpus-based articles in their 2012 publications. However, once Language Variation and Change is excluded, the numbers change substantially, as Table 1.4 shows. Once the data from Language Variation and Change are set aside, we find that the corpus-based studies account for only 22 per cent of the articles, and the quantitative
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A historical linguistics meta study
methods account for only 26 per cent. Without Language Variation and Change, the sample is down to fifty articles, but the test of equal proportions (prop.test() in R) tells us that we still have enough data to distinguish the historical sample from Language. Having established that the historical sample seems to use quantitative methods and corpus methods less than what the state-of-the-art journal in general linguistics (Language) does, we can turn to how this is related to the life-cycle model of adopting new technology that we introduced in section 1.4. It is worthwhile reiterating the theoretical proportions accounted for by the different adopter groups in the technology adoption model, with the chasm situation between early adopters and the early majority: • • • •
Early adopters: 16 per cent Early majority: 34 per cent (cumulative percentage: 50 per cent) Late majority: 34 per cent (cumulative percentage: 84 per cent) Sceptics: 16 per cent (cumulative percentage: 100 per cent)
If we make the working assumption that the published articles more or less correspond to the research technologies adopted by their authors, we can compare the observed proportion of quantitative and corpus-based articles with the theoretical proportions predicted by the technology adoption model. Of course, this assumption cannot be taken literally, since mastery of quantitative corpus research techniques does not preclude using qualitative methods. However, we consider this a useful approximation, since the sampled journals can select their articles from a larger set of submissions. Based on this, at least for our purposes, we can assimilate journal authors to users of research technology. Comparing the technology adoption model to the data from Language collected by Sampson (2013), we see that the proportion of articles employing quantitative methods in that journal (around 80 per cent) is close to what we would see with full adoption of such technologies by the late majority. In our historical linguistics and language change sample we found that 40 per cent of the studies were quantitative, which would suggest that those methods extend to the early majority. However, as we saw above, this is a little too optimistic due to the effect of papers from Language Variation and Change. If that journal is excluded, the proportion of quantitative papers drops to 26 per cent which, although still among the early majority, suggests a less widespread adoption. If we look more specifically at the intersection of corpus data and quantitative methods in the sample of historical and diachronic change articles, we see that 23 per cent are both corpus-based and quantitative (Table 1.1). However, if we again exclude Language Variation and Change, we see that the percentage drops to 14 (Table 1.4), which is in the early adopter range, according to the technology adoption model. This position is corroborated by taking into account the actual quantitative techniques employed by the quantitative articles in our historical sample. Figure 1.4 shows
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Methodological challenges in historical linguistics 14 LVC Other 12
10
8
6
4
2
0
Linear models
Freq
%
NHT
Trees
PCA
Figure . The number of observations for various quantitative techniques in the selected studies, for LVC and other journals. Some studies employed more than one of the techniques.
the number of times different quantitative techniques were encountered. Multivariate techniques such as linear regression models (including Varbrul) are clearly the largest single group of techniques. However, Language Variation and Change is again intimately involved in the details. The majority of the uses of linear models is found in that journal, as is the majority of uses of null-hypothesis tests. The numbers in Figure 1.4 are small, but sufficient to give us the impression of Language Variation and Change as a methodological outlier among the quantitative papers in the sample. Thus, we can conclude that, based on our sample, articles from the journals specializing in historical linguistics and language change that we considered (published in 2012) use quantitative methods to a lesser degree than a relevant comparison journal in general linguistics (Language). Furthermore, if we exclude Language Variation and Change, which is biased towards quantitative methods, we see that the percentage of quantitative papers and corpus-based papers drops even further. If we consider historical papers that use both quantitative methods and corpora (excluding Language Variation and Change), the percentage is low enough to be compared to the early
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A historical linguistics meta study
adopter section of the curve from the technology adoption model presented in section 1.4. Having described the state of adoption of quantitative corpus methods in historical linguistics, we are ready to carve out a niche for this technology in historical linguistics, and we will do this in Chapter 2.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework . A new framework In this chapter we outline the foundations of the new methodological framework we propose. This framework is not meant to replace all existing ways of doing historical linguistics. Instead, we present a carefully scoped framework for doing certain parts of historical linguistics that we think would benefit from this approach. Other areas of historical linguistics might not require the kind of innovation we propose, or they might require innovations of a different kind. However, we strongly believe that the approach outlined in this book is the right choice for what we define in the scope of quantitative historical linguistics. We think many, if not most, historical linguists would agree with us that corpora and frequencies are potentially very informative in answering questions in historical linguistics. Our aim is to take this intuition one step further by proposing principles and guidelines for best practices, essentially an agreement as to what constitutes quantitative historical linguistics. The next section addresses the question of scope for the framework. .. Scope We submit that the principles of quantitative historical linguistics pertain to any branch or part of historical linguistics. These principles are not only meant as guides to carrying out quantitative research, but also establish a hierarchy of claims about evidence which also encompasses non-quantitative data. In this respect, quantitative historical linguistics is just as much a framework for evaluating research as for doing research. The basic assumptions and principles laid out in sections 2.1.2 and 2.2 establish a basis for evaluating and comparing research in historical linguistics, whether quantitative or qualitative. Our main focus is nevertheless the methodological implications of these assumptions and principles for how to do historical linguistics research. The principles and guidelines of quantitative historical linguistics can be applied within any conventional area of historical linguistics, such as phonology, morphology, syntax, and semantics. In line with Gries (2006b), we argue that corpora serve as the best source of quantitative evidence in linguistics, and by extension also in historical Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A new framework
linguistics. This might at first glance seem to exclude e.g. historical phonology from quantitative historical linguistics; however, this is a practical consideration based on available corpus resources, not an inherent feature of quantitative historical linguistics. In fact, historical corpora can be illuminating when it comes to questions of sound change, as demonstrated by recent studies using the Origins of New Zealand English corpus (Hay et al., 2015; Hay and Foulkes, 2016). Nevertheless, we stress that in some areas, such as phonology, quantitative historical linguistics is to a large extent complementary to traditional historical comparative linguistics (see section 2.1.2). In the following chapters we give examples and case studies from morphology, syntax, and semantics, with some discussion on phonology. Our focus in this book is predominantly on corpus linguistics, since corpora constitute the best source of quantitative evidence. However, quantitative does not automatically entail corpus-based. For instance, historical phylogenetic modelling attempts to establish relationships, classification, and chronology of languages based on historical data by probabilistic means (Forster and Renfrew, 2006; Campbell, 2013, 473–4). Phylogenetic models may employ typological traits, such as Dunn et al. (2005), or lexical data, such as Atkinson and Gray (2006), or corpus data (Pagel et al., 2007). Quantitative historical linguistics is deliberately agnostic regarding the use of specific statistical techniques, since such techniques must reflect the specific research question. The caveat here is that the choice of statistical technique should reflect best practices in applied statistics and be sufficiently advanced to tackle the full complexity of the research problems (see section 2.2.12). Thus, although we consider corpora the preferred and recommended source of quantitative evidence, quantitative historical linguistics does not necessarily equate to corpus linguistics. In addition to the source of data, quantitative historical linguistics relies on a number of other principles and basic assumptions, which we turn to next. .. Basic assumptions The scope of our framework builds on a number of premises and relies on different levels of analysis. As in other historical disciplines, different skills are needed for different stages of the problem-solving process. Historians must judge sources in light of both the physical documents, their literary genre, and the context of the source, which might require very different sets of skills, as discussed in chapter 2 of Carrier (2012). Similarly, quantitative historical linguistics must make a number of assumptions, some of which rely on other scholarly disciplines. Thus, the approach is not all-inclusive, but rests on and interacts with other pursuits of knowledge, by means of the following assumptions. We are indebted to Carrier (2012) for inspiration, but have reworked the material to match the case of historical linguistics. The historical linguistic reality is lost Whether we study the history of particular languages, the relationship between languages and language families over time, or
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
how language change proceeds in general, we all face the same inescapable problem: whatever reality we wish to describe, understand, or approximate is irrecoverably lost. It cannot be directly accessed and hence we can only study it indirectly. Because of this inaccessibility, our models of the past historical linguistic reality will always be imperfect. However, they may still be useful. A key question for the present book is to show what we think constitutes a useful model, and in which circumstances it is useful. Philological and text-critical research is fundamental No corpus is better than the quality of what goes into it. Consequently, sound groundwork in terms of philological, paleographical, and text-critical research must be assumed. Put differently, the proposed approach cannot replace these pursuits of knowledge. Instead, it complements them and relies on them to critically study the physical manuscripts and philological and stemmatological context of the text contained in them. Based on such research, critical editions can be created, and these critical editions can subsequently form the basis for corpora. Grammars and dictionaries are indispensable Another level in the research process is the creation of grammars and dictionaries that make it possible to annotate historical corpora. Of course, such research is not only a means to create corpora, but it illustrates the degree to which quantitative historical linguistics rests on other approaches to historical linguistics. We would again like to emphasize that the present approach is in many respects complementary to existing approaches, although, as we explain in Chapter 5, it is desirable to create corpus-driven dictionaries. Reaching back to the extended notion of technology that we introduced in section 1.4, we see no reason to replace existing approaches where they work well. As the levels of analysis outlined here illustrate, several approaches can and must coexist. Qualitative models We agree with Gries (2006b) that corpora provide one type of evidence only: quantitative evidence. It follows from this that quantitative claims or hypotheses are best addressed by corpus evidence. However, not all hypotheses are quantitative, as we illustrate in section 2.4.3. Qualitative approaches in historical linguistics have more than proved their worth in establishing genealogical relationships between languages, especially through the study of regular sound correspondences. Although such qualitative correspondences might be a simplification (i.e. imperfect models), they might nevertheless be useful and successful. Similarly, the simplifications and generalizations involved in establishing paradigmatic grammatical patterns might be useful without being a one-to-one correspondence to the lost historical linguistic reality. Where we do see the limits of qualitative approaches is in distributional claims, especially as they relate to claims or hypotheses about syntagmatic patterns. In the following sections we will elaborate the terminology and basic tenets of our framework.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A new framework
.. Definitions In the present section we define the core terminology based on which we will formulate the principles of our framework. Evidence By evidence we mean facts or properties that can observed, independently accessed, or verified by other researchers. Such facts can be pre-theoretical or based on some hypotheses. A pre-theoretical fact could be the observation that in English the word the is among the most frequent ones, alongside words such as you. We can observe facts in light of a hypothesis by assuming grammatical classes of articles and pronouns that group words together. Based on this hypothesis we can gather facts that constitute evidence that the classes article and pronoun are among the most frequent ones in English. It follows from this definition that empirical evidence is a pleonasm, since all evidence conforming to it must be empirical. The definition above explicitly excludes the intuitions of the researcher as evidence in historical linguistics. Such intuitions are problematic as evidence for languages where native speakers can judge them; for extinct languages and language varieties we consider such intuitions inadmissible as evidence. This position does not imply that intuitions are without value. For instance, intuitions are undoubtedly valuable in formulating research questions and hypotheses, and when collecting and evaluating data, as we stress in section 2.4.3. Thus, we think intuitions can and should play a role in the research process, but we do not consider them as evidence. We can distinguish between different types evidence, namely quantitative evidence and distributional evidence. Quantitative evidence is based on numerical or probabilistic observation or inference. The quantification must be precise enough to be independently verifiable. As a consequence, quantifying the observations by means of e.g. the words many or few will not suffice, since these terms are underspecified. In the classic linguistics sense, distributional evidence is empirical in the sense that it can be independently verified that certain linguistic units (be they phonemes, morphemes, or other units) do or do not (tend to) occur in certain contexts. To the extent that such distributional patterns can be reduced to hard, binary rules (e.g. x does/does not occur in context y), distributional evidence is qualitative. However, we also keep the option open that such distributional evidence may be recast in probabilistic terms. Finally, we need to consider criteria for strong and weak evidence, since independent verifiability is a necessary but not sufficient criterion for evidence. We can establish the following hierarchy of evidence: (i) More is better: a larger sample will yield better evidence than a small one, other things being equal. (ii) Clean is better than noisy: clean, accurate, and well-curated data will yield better evidence than noisy data, i.e. data with (more) errors.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
(iii) Direct evidence is better than evidence by proxy: it is better to measure or observe directly what is being studied, rather than through some proxy or stand-in for the object of study. (iv) Evidence that rests on fewer assumptions (be they linguistic, philological, or mathematical) is preferable, other things being equal. It is obvious from the list above that some of the statements in the hierarchy will conflict. For instance, the ‘more is better’ requirement (i) will almost always conflict to some degree with the requirement for precise, well-curated data (ii). This implies that the end result will always be some kind of compromise which entails that perfect, incontrovertible evidence is a goal that can be approximated, but never fully reached. We believe this is an important consideration, since no numerical method can salvage bad data. Instead, the realization that all data sets are imperfect to some degree breeds humility and ushers along the need to explicitly argue for the strength of the evidence, independently of the strength of the claims being made on the evidence. The next section deals with claims. Claim We follow Carrier (2012) in considering anything that is not evidence a claim. A claim can be small or large in its scope, and it may rest directly on evidence, or it may rest on other claims. A claim must always rest on evidence, directly or indirectly, to be valid. The following are examples of different types of scientific claims of variable complexity and scope: • • • •
Classification: x is an instance of class y. Hypothesis: we assume x to be responsible for an observed change y. Model interpretation: based on the model w, x is related to y by mechanism z. Conclusion: we conclude that x was responsible for bringing about z.
All claims are subject to a number of constraints discussed further in section 2.2. However, we want to stress the distinction between evidence and claims, as it is fundamental to the subsequent principles. In particular, we consider linguistic frameworks (sometimes called ‘linguistic theories’) to be series of claims which cannot be admitted as evidence for other claims. It also implies that such frameworks are subject to the same standards of evaluation as other claims (see section 2.2). Truth and probability Following chapter 2 of Carrier (2012) we consider a claim, be it a classification or a hypothesis, to be a question of truth. However, the truth value of such a claim, e.g. x belongs to class y, can be stated in categorical or probabilistic terms. We choose to think of the truth value of claims about the past in probabilistic terms, since there is always a risk that we are mistaken, even in the most well-established claims. For sure, the probability may be vanishingly small, but it may still exist. Furthermore, such probabilities about the truth value of claims can be interpreted in at least two ways:
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A new framework
(i) As facts about the world (i.e. physical probabilities). (ii) As beliefs about the world (i.e. epistemic probabilities). Carrier (2012) makes the distinction above in the context of an explicitly Bayesian statistical framework. Bayesian statistics is a branch of statistics that considers probabilities as subjective and as degrees of belief (Bod, 2003, 12). So-called frequentist statistics will tend to conceptualize probabilities as long-term relative frequencies of events in the world. The distinction is discussed in more depth in Hájek (2012). For our purposes it is sufficient to say that when we talk about the probability of some claim being true, we are talking about the epistemic probability, i.e. how likely we are to be correct when we claim that x is a class of y. Although the difference between (i) and (ii) is sometimes overstated, there is a real difference between claiming that ‘8 out of 10 times in the past verb x belonged to conjugation class y’, versus claiming that ‘if we assign verb x to conjugation class y, the probability that we are making the correct classification is 0.8’. The latter statement is explicitly made contingent on our knowledge and our argumentation in a manner that is different from and better than the former case. Historical corpus In this book, we are concerned with historical corpora and define them as any set of machine-readable texts collected systematically from earlier stages of extant languages or from extinct languages. We follow Gries (2009b, 7–9) in defining a corpus as a collection of texts with some or all of these characteristics: (i) Machine-readable: the corpus is stored electronically and can be searched by computer programs. (ii) Natural language: the corpus consists of authentic instances of language use for communicative purposes, not texts created for making a corpus. (iii) Representative: representativity is taken to refer to the language variety being investigated. (iv) Balanced: the corpus should ideally reflect the physical probabilities of use within some language variety. These characteristics are ideals, even for corpora based on extant languages. To create a balanced and representative corpus of extinct language varieties is in most cases not a realistic aim. Therefore we do not take these to be necessary and sufficient features for what constitutes a (historical) corpus. In fact, we agree with Gries and Newman (2014) who consider the notion of a ‘corpus’ to be a prototype-based category, with some corpora being more prototypical than others. However, the definition above is clearly also too broad, since it extends to other types of text collections that are not normally considered corpora in the strict sense (Gries, 2009b, 9). For instance, a text archive containing the writings of a single author would fulfil criterion (i) and criterion (ii), but could not lay claim to representativity beyond the author in question. Gries (2009b, 9) argues that in practice the distinction
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
between corpora and text archives can be diffuse, and for our purposes we take the representativity criterion in (iii) to be sufficient to rule out many text archives from the definition. A more pressing exclusion are perhaps collections of examples. As Gries (2009b, 9) points out, any such collection is prone to errors and omissions, and it is doubtful what it can be taken to be representative of. For this reason, we follow Gries (2009b) in excluding example collections from the definition of what constitutes a corpus. This exclusion also applies to collections based on examples from historical corpora or quotations, since such text fragments are by definition handpicked and presented outside the communicative context they can be said to be representative of. Thus, our definition of a corpus only includes machine-readable, natural, representative (within limits) text that has been systematically sampled for the purpose of the corpus. This is not to say that example collections cannot be useful, but we exclude them for the purpose of terminological clarity. Finally, we exclude word lists (or sememe lists of cognates based on the Swadesh lists, see section 3.3) since they fall short of the requirement of texts collected for natural, communicative purposes. The notion of ‘historical corpus’ can also be problematic, since it is not clear exactly how historical a corpus needs to be in order to count as ‘historical’. We are inclined to take a pragmatic approach to this question and consider as historical in the broad sense any corpus that either covers an extinct language (or language variety), or that covers a sufficient time span of an extant language variety that it can be used diachronically, i.e. to detect trends (see also section 2.2). We would also stress that annotation of corpora and analysis of data are two separate and independent steps of the research process. The annotation step could for instance involve enrichment from other linked external resources, not necessarily corpora. The relationship between data, corpora, and annotation is discussed further in Chapter 4. Linguistic annotation scheme By linguistic annotation scheme we intend the set of guidelines that instruct annotators on how to annotate linguistic phenomena occurring in a corpus according to a specific format. Such schemes rely on certain theoretical assumptions and usually contain a set of categories (tags) that are to be applied to the corpus text. An example of a linguistic annotation scheme is the set of guidelines for the annotation of the Latin Dependency Treebank and the Index Thomisticus Treebank (Bamman et al., 2008). Section 4.3.3 gives a full description of annotation schemes. In our framework we do not impose constraints on the particular schemes to be used, as long as they are explicit, and allow the annotators to interpret the text consistently and map it to the predefined categories. Hypothesis By hypothesis we intend a claim that can be tested empirically, i.e. through statistical hypothesis testing on corpus data. Hypotheses can come from previous research, logical arguments, or intuition, and, as long as they can be tested empirically, they have a place in our framework.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A new framework
An example of a hypothesis is the statement ‘there is a statistically significant difference in the relative distribution of the -(e)th and -(e)s endings of early modern English verbs by gender of the speaker in corpus X’. This formulation is a technical one, and there is usually some work involved in going from an under-specified hypothesis, such as ‘the verbal endings -(e)th and -(e)s in early modern English vary by gender’, to an operationalized one as in the example above. For a fuller explanation of hypothesis testing and concrete examples, see section 6.3.4. Generating hypothesis is one of the main steps in the research process, and it helps focus the efforts in the analysis. Instead of considering all possible variables that might remotely affect the phenomenon under study (the so-called strategy of ‘boiling the ocean’), we can concentrate our attention on those factors that are promising, based on what we know of the phenomenon. If the hypothesis is generated from data exploration, it can be defined as data-driven, although the process itself of exploring the data will have relied on some theoretical assumptions, as we explain in section 2.4.3. Model As we explained in section 1.2.2, by ‘model’ we mean a representation of a linguistic phenomenon, be it statistical or symbolic. Not all models, however, are allowed in our framework: only those that derive from hypotheses tested quantitatively against corpus data or from statistical analysis of corpus data. An example of such a model is given in Jenset (2013), where the use of the morpheme there in Early English is modelled as a function of the presence of the verb be followed by an NP, and the complexity of the sentence. In section 7.3.3 we provide a full description of a model for historical English verb morphology. Trend We define a trend as the directional change in the probability of some linguistic phenomenon x over time that is detectable and verifiable by means of statistical procedures (Andersen, 1999). In other words, a trend cannot be established by impressions or intuitions. Furthermore, it can only be counted as a trend if reliable and appropriate statistical evidence can be presented to back it up. By ‘trend’, we mean the combination of innovation and spread of a linguistic phenomenon. For a linguistic change to happen, a speaker (or a group of speakers) needs to create a new form (innovation), and for this to be more than a nonce formation, the use of such form needs to spread and be adopted more broadly. For example, the use of ‘like’ in quoted speech must have been an innovation at first, and was then adopted to a broader set of people until it became established in current spoken English. We believe that linguistic innovation is best dealt with probabilistically, although this does not mean that our framework is incompatible with categorical views of language innovation. When a new linguistic form is used for the first time (or according to the terminology of Andersen (1999), it is ‘actualized’ in a speaker’s usage), it will differ from the old form in some aspect, for example in a semantic
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
sense, or in a morphological realization. According to a categorical view of language, this difference will be displayed as an opposition between the ‘old’ and the ‘new’ category; for example, the English affix -ism may be used as a noun as well (see ‘ism’ in the sentence We will talk about the isms of the 20th century). The innovative usage consists of the nominal use and the opposition is between the two part-of-speech categories. According to a non-categorical view of language, the innovative form could for example be characterized by a ‘fuzzy’ nature which can be described as more nounlike than preposition-like. We argue that both the categorical view and the non-categorical view are compatible with a probabilistic modelling of the linguistic innovation. In the categorical view, we can describe the innovative use of ‘ism’ in terms of a low probability of the preposition category and a higher probability of the noun category. In the non-categorical view, we can describe this innovation as change along a continuum, so that, for example, the innovative form ‘ism’ is found in contexts more similar to those of ‘theory’ (e. g. following a determiner) than those of ‘-ian’ (e. g. as a morpheme following ‘Darwin’). On the other hand, the spread of new linguistic behaviours among speakers through genres, linguistic environments, and social contexts, is a time-dependent phenomenon. The innovative form and the new form will coexist for a period of time, thus realizing synchronic variation, and there will be a more or less rapid adoption of the new form by the language communities. This can be described as a shift in probabilities, and it is clear that language spread should be dealt with in probabilistic terms. Quantitative multivariate analysis of corpus data allows us to measure the evidence for the spread of a linguistic phenomenon, and the effect of different variables on it. This way, it is in principle possible to model the way an innovation is increasingly used by a community; in section 7.3 we will provide a concrete example of this in a study on English verb morphology.
. Principles Figure 2.1 shows the diagram of the research process in our framework, and is based on the entities defined in section 2.1.3 and the principles illustrated in that section. As shown in Figure 2.1, the aim of quantitative historical linguistics is to arrive at models of language that are quantitatively driven from evidence. Such definition of ‘model’ includes statistical models and their linguistic interpretation. Section 7.2 will outline the steps of this process in a linear way, and we will describe these steps in more detail throughout this book. In the present section we describe the basic principles of quantitative historical linguistics, which are valid within the scope defined above. The principles are inspired by, and in some cases adapted from chapter 2 of Carrier (2012), a work advocating the use of statistical methods in history. However much history and historical linguistics
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Principles
Historical linguistic reality
Primary sources (Documents etc.)
Secondary sources (Grammars, dictionaries, etc.)
Linguistic annotation schemes*
Models*
Intuition
Hypotheses
Examples
Annotated corpora*
Quantitative distributional evidence*
Figure . Main elements of our framework for quantitative historical linguistics. Boxes are entities, arrows are actions or processes; asterisks mark terms for which we use our definitions (see section 2.1.3). The dashed line from models to the (lost) historical linguistic reality implies an approximation.
have in common, the differences are nevertheless sufficiently great to warrant a reframing of the issues to fit into the context of historical linguistics. The adoption of those principles allows for improved communication between scholars regarding claims and evidence, which in turn will make it easier to resolve contentious claims by means of empirical evidence. However, such a resolution is only possible to the extent that historical linguists agree with and adhere to the principles presented below. For this reason, the first issue deals with the question of consensus in the historical linguistics community. .. Principle : Consensus To achieve the aim of quantitative historical linguistics research, it is necessary to reach consensus among those scholars who accept the premises of quantitative historical linguistics. The basic premise for all the following principles is that the aim, indeed the duty, of historical linguists is to seek consensus. However, consensus is only valuable to the extent that it reflects an empirical evidence base. We therefore limit the consensus to those scholars who accept the basic premises of empirical argumentation, as it is grounded in the concepts of evidence and claims (section 2.1.3). Since we consider
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
these principles fundamental to empirical reasoning about historical linguistics, no consensus would be possible without them, even in theory. This means that the effort of creating consensus without a common ground of fundamental principles is probably going to be futile. The requirement of seeking consensus might seem overly optimistic and even a negative constraint to the development of the field. However, all serious scholars already abide by the consensus principle to some limited extent, by submitting their research articles to a scientific peer review. This particular type of consensus does not necessarily extend beyond the peer reviewers, the editors, and the scope of journals, but the principle remains the same: all research is ultimately an attempt to influence others by making claims grounded in some form of evidence. Without the requirement to seek consensus, any claim could in principle be made and defended by resorting to some private standard of evidence and argumentation. In contrast, the consensus requirement provides an impetus to follow the principles of quantitative historical linguistics as closely as possible, since this will help to persuade other scholars of the validity of the claims being made. However, the principle cannot be understood as an injunction to achieve consensus, only to seek it, since consensus by definition must involve more than one researcher. A hypothetical objection to the principle might be that it constrains creativity and development of the field. However, we view the matter differently. We agree with the argument made in chapter 2 of Carrier (2012) that when we have no direct access to historical realities, our best approximation must be the consensus among the experts in the field, in this case historical linguists. Naturally, experts may be mistaken, but on the whole we must assume that their beliefs and claims are accurate, given the current state of knowledge in the field. This final refinement of the point is crucial, since the consensus by definition must rest on what has been discovered and argued up until the present. Hence, new claims will always be in a position to challenge the consensus. But to challenge the consensus is to seek its amendment. When facing a new, possibly controversial, claim that goes against the current consensus, the experts in the field must evaluate the claim according to the empirical principles. If the claim is solid enough, the consensus will be given. Similarly, any claim might have gaps that require fixing before other historical linguists will accept it. After such modifications, the claim might be strong enough to alter the reigning consensus. We consider those claims that are too weak to persuade other experts in the field to be of no interest. If a creative, controversial claim cannot persuade those who are experts in the field, then it is questionable whether it can bring the field forward. Thus, we do not consider a plurality of claims regarding historical linguistics to be an aim in itself, but only a means of providing suggestions for altering the current consensus. .. Principle : Conclusions All conclusions in quantitative historical linguistics must follow logically from shared assumptions and evidence available to the historical linguistics community.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Principles
Following the definition of evidence in section 2.1.3, a piece of research is empirical if it relies on empirical evidence that is observable and verifiable independently of the researcher and her attitudes and beliefs. Above all, intuitions (even those stemming from long, in-depth study of the material under scrutiny) are inadmissible as evidence. This principle is supported by the previous one: beliefs and intuitions are not independently verifiable, hence they do not form a good basis on which to build consensus. This is not to say that intuitions do not belong in historical linguistics research; quite the contrary. Such intuitions can be a very valuable starting point for insightful research. However, the intuitions can never be more than a starting point, or guidance, for creating hypotheses and deriving empirically testable predictions from them. .. Principle : Almost any claim is possible Every claim has a non-zero probability of being true, unless it is logically or physically impossible. We consider this insight from Carrier (2012) to be a key principle when evaluating claims regarding historical linguistics. Carrier (2012) points out that almost any claim about the past has some probability of truth to it, with the exception of claims that are logically impossible (such as ‘Julius Caesar was murdered on the Ides of March and lived happily ever after’) or physically impossible (such as ‘Julius Caesar moved his army from Gaul into Italy on a route that took them via the Moon’). We consider this statement equally applicable to historical linguistics as to history. Another way of phrasing the principle is that identifying sufficient conditions is not enough to establish a strong claim. A very similar point is made by Beavers and Sells (2014) who argue that since linguistic data can support many conclusions, it is not enough to find data that support the claims we wish to make. It is also necessary to consider all the other claims those same data might support, that is, what is the evidence against our chosen interpretation of the data. The take-home message in both cases is that the set of all possible claims (i.e. physically and logically possible) contains both profitable and misleading claims, but both these types of claims can be supported by historical linguistic data, albeit to different degrees. It follows from this principle that a claim that ‘fits the data’ in historical linguistics is near worthless unless it is further substantiated. Such a claim could be a very strong one, or it might have an associated probability so small that it would be indistinguishable from zero for all practical purposes. The subsequent section discusses the problem of ranking claims that all have a non-zero possibility of being true. .. Principle : Some claims are stronger than others There is a hierarchy of claims from weakest to strongest. It follows from principle 3 that all possible claims in historical linguistics have some probability of being true, ranging from completely implausible to extremely well attested and likely. In other words, there exists a hierarchy of claims where some
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
claims stand above others. For instance, the claim by Emonds and Faarlund (2014) that Old English simply died out and was replaced by a variant of Norse (making modern English genealogically North Germanic) has very little support in the data and is hence an extremely weak claim, as demonstrated by Bech and Walkden (2016). Since the claim that Middle English evolved from Old English (albeit from other dialects than the dominant West Saxon variety of Old English) is based on much stronger evidence, it takes precedence over the replacement argument. Essentially, all claims are not made equal, and even if some kind of historical linguistic data can be made to fit a claim, this is in itself unsurprising and constitutes an insufficient ground for accepting that claim. The key question then becomes what distinguishes a weak claim from a strong one. The following principle will dig further into the problem of how to rank claims. .. Principle : Strong claims require strong evidence The strength of any claim is always proportional to the strength of evidence supporting it. Section 2.1.3 dealt with how we can judge the strength of the evidence. Here we spell out the relationship this has to claims and their strength. Carrier (2012) argues, correctly in our view, that evidence based only on a small number of examples is very weak. Furthermore, when a claim is a generalization, its supporting evidence must consist of more than one example. That is, the evidence for any generalization that goes beyond the observed piece of data must consist of more than one observation. Such arguments follow from the principle that the strength of a claim is proportional to the evidence backing it up. Since no claim is stronger than the evidence supporting it, the nature of the supporting evidence is key. Other things being equal, more evidence implies stronger support for a claim, as we stated in section 2.1.3. However, the principle is not only about finding strong evidence. The opposite also applies: if your evidence is weak, your claims ought to reflect this fact. In some cases a weak claim is all that can be supported by a body of evidence. In this situation, we feel that the adage ‘better an approximate answer to an exact question, than an exact answer to an approximate question’ applies. That is, if the combination of a research question and some data only allows a weak or tentative conclusion, then this should be explicitly acknowledged without attempts to overstate the results. In historical linguistics this means that in some cases certain generalizations might be impossible. As the statistician John Tukey phrased it, ‘the combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data’ (Tukey, 1977). This applies to historical linguistics as much as to statistics. The example from section 2.2.4 about the typological status of English within the Germanic language family is also relevant here. Since the evidence provided by Emonds and Faarlund (2014) is narrowly focused on one area (syntax) and is also very sparse, the evidence is clearly not proportional to the claims being made, as demonstrated by Bech and Walkden (2016).
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Principles
.. Principle : Possibly does not entail probably The inference from ‘possibly’ to ‘probably’ is not logically valid. In section 2.2.3 we argued that merely fitting the data is not sufficient for accepting a claim. A special case that deserves its own principle is the logical fallacy that Carrier (2012) describes as ‘possibly, therefore probably’. The principle can be made clearer when recast as probabilities, where the notation P(x) means ‘probability of x’ for some claim x: • • • •
If P(x) > 0, x is possible. If P(x) is close to 1, x is probable. If P(x) = 0.01, x is possible but not probable. If P(x) = 0.99, x is both possible and probable.
Put differently, all probable claims are possible, but not all possible claims are probable. The example-based approach described in section 1.3.1 should only be associated with claims about events being possible or not; in order to state anything about their probability value, quantitative data and systematic analysis are required. We turn again to the claim that Old English died out and that Middle English descended from Norse. We certainly agree with Emonds and Faarlund (2014) that this is a possible scenario. The process of languages falling out of use and being substituted by others, possibly with some substrate influence from the language falling out of use, is clearly possible. However, since all logically and physically possible claims have a non-zero probability of being true, it is trivial to state that Old English might have died out and been replaced with a variant of Norse. The possible does not automatically entail the probable because probable claims are only a subset of all the possible claims. Thus the argument, ‘this might have been the case, therefore it was probably the case’ is logically invalid without further supporting evidence. It also follows from section 2.2.3 that the set of possible claims is very, very large since it is constrained only by the physically and logically impossible. This in turn raises the question: in the absence of stronger corroborating evidence, why privilege one particular possible claim out of the much larger set of other possible claims? To present a possible claim as probable without sufficient evidence, whether by arbitrariness or sheer wishful thinking on the part of the researcher, does not support that claim. In particular, such an inference cannot adequately support a conclusion, as discussed in the next section. .. Principle : The weakest link The conclusion is only as strong as the weakest premise it builds on. This principle entails that any conclusion will be evaluated by its weakest point, not its strongest. This may sound counter-intuitive, because surely we want the strongest evidence to inform our claims. The reason can be traced back to the principle that
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
any claim that is physically or logically possible has a non-zero probability of being true (section 2.2.3). The great number of possible interpretations of evidence from the linguistic past thus enables us to find individual strong arguments in favour of a conclusion. However, the conclusion might nevertheless be undermined by a number of weak premises. .. Principle : Spell out quantities Implicitly quantitative claims are still quantitative and require quantitative evidence. One of the key aims of the principles outlined here is to enable a fair evaluation of claims about historical linguistics in terms of quantities and frequencies. However, such an evaluation is only possible when the quantification is spelled out. Terms such as those in the following list are ambiguous and should be avoided when presenting evidence in historical corpus linguistics: few, little, rare, scarce, uncommon, infrequent, some, common, frequent, normal, recurrent, numerous, many, much. The list is obviously not exhaustive, but it illustrates words that represent quantities and frequencies in a subjective and coarse-grained manner. They are subjective because what counts as few or many depends on the circumstances and the person doing the evaluation. They are coarse-grained because it is difficult to compare the quantities they designate. Is an ‘uncommon’ phenomenon equally scarce as something that is ‘infrequent’? Or is it perhaps more common? Or less? Such quantification is hard to evaluate and verify independently and hence violates the fundamental requirement that the evidence for a claim must be objectively accessible to all researchers in the field. This is not to say that such words cannot be used, but they render an argument less powerful by making it opaque. .. Principle : Trends should be modelled probabilistically Quantitative historical linguistics can rely on different types of evidence, but only quantitative evidence can serve as evidence for trends. In section 2.1.3, we defined trends in explicitly probabilistic terms. The approach defined here is deliberately agnostic about whether language is inherently based on probabilities, or categorical rules, or some combination of the two. However, a trend should be modelled as a probabilistic, quantitative entity since it denotes a directed shift in variation over time. Sample sizes may vary at different points along a time line, which makes statistical tools the correct choice for identifying and evaluating trends. Linked to this point is the question of adequate statistical power. Thus, having three points connected by a straight line pointing upwards does not qualify as a trend unless this line can be shown to be both statistically significant and a good fit to the data. Like any claim in historical linguistics, a proposed trend is subject to the principle in section 2.2.3 that any claim has a greater than zero probability of being true, provided that the claim is not logically or physically impossible. Any claim about a possible
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Principles
trend is consequently liable to a number of errors: the trend might not be a trend at all, but merely random variation. Or the claimed trend might represent wishful thinking or biased attention on the part of the researcher (as pointed out by Kroeber and Chrétien 1937, 97); the data for the claimed trend might give the appearance of trend due to inadequate sampling procedures, and so on. In other words, the requirement that a trend be verified by statistical means is an insurance against overstating the case beyond what the data can back up. .. Principle : Corpora are the prime source of quantitative evidence Corpora are the optimal sources of quantitative evidence in quantitative historical linguistics. Above we defined corpora as sources of quantitative data (section 2.1.2). We also defined quantitative variation (including variation implicitly stated by means of words like much or few) as subject to quantitative evidence (principles 8 and 9). However, we reserve a separate principle for the statement that quantitative evidence in historical linguistics should come from corpora. This is not to say that quantitative evidence cannot come from other sources; there are clearly other possible sources for quantitative evidence (see section 2.1.3). However, when available, corpora should always be the preferred source of quantitative evidence for a number of reasons: (i) Corpora (as defined in section 2.1.3) have a better claim to be representative than other text collections, other things being equal. (ii) Publicly available corpora allow reproducibility, to the extent that they are available to the research community. Thus, following principle 4 (section 2.2.4), we consider a claim based on quantitative evidence coming from corpora stronger than a claim that is not based on corpus evidence, as long as the two claims are equally capable of accounting for the relevant facts. .. Principle : The crud factor Language is multivariate and should be studied as such. For the purpose of historical linguistic research, we consider language, and language use, to be an inherently multivariate object of study. Bayley (2002, 118) explains a similar ‘principle of multiple causes’ as the need to include multiple potentially explanatory factors in an analysis, since it is likely that a single factor can explain only some of the observed variation in the data. In other words, it is essential to be open to a potentially large number of explanatory variables for any linguistic phenomenon. This principle does not imply that this is inherent in language, only in language as an object of study. From this principle it follows that a large number of potential explanatory variables should be considered. This is consonant with principle 3 (section 2.2.3), since finding a single variable that is correlated with the phenomenon being
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
studied is trivial. The real aim in quantitative historical linguistics is to find one or more variables that are more strongly correlated with the phenomenon being studied, compared to other potential variables. In essence, this guards against spuriously positive results, since we aim to build counter-arguments into the quantitative model. Doing so protects against what Meehl (1990, 123–7) calls the ‘crud factor’, or ‘soft correlation noise’, since many factors involved in language will be correlated with each other at some level. Stacking them up against each other helps separate the wheat from the chaff. .. Principle : Mind your stats Quantitative analyses of language data must adhere to best practices in applied statistics. From principle 11 it follows that statistical methods are required to distinguish the more important correlations from the less important ones. Bayley (2002, 118) describes this as the ‘principle of quantitative modeling’, which implies calculating likelihoods for linguistic forms given context features. This implies that multivariate statistical methods, such as regression models or dimensionality reduction techniques, are typically required. For instance, a single multivariate regression model with all relevant variables is superior to a series of individual null-hypothesis tests, since the latter do not take the simultaneous effect of all the other variables into account and are vulnerable to false positive results by testing the same data several times over. Testing the same data over and over again with a null-hypothesis test such as Pearson’s Chisquare is a little like having several attempts to hit the bull’s eye in darts: more attempts make it more likely to get a statistically significant result, but the approach artificially inflates the strength of the claim. Furthermore, as Gelman and Loken (2014) make clear, null-hypothesis tests are often under-specified, a point also raised by Baayen (2008, 236), which means that in practice they can often be supported or refuted by the data in more than one way. Furthermore, comparing null-hypothesis tests is conceptually difficult. Although the p-values may look comparable, they actually represent a series of alternative hypotheses, each of which has been compared against a null-hypothesis (Gelman and Loken, 2014). This is not to say that we proscribe the use of simple null-hypothesis tests in quantitative historical linguistics, we merely consider them to provide weaker evidence than multivariate techniques in those cases where a multivariate approach is possible and gainful. Similarly, the direct interpretation of raw counts, or what Stefanowitsch (2005) calls ‘the raw frequency fallacy’, constitutes the weakest form of quantitative evidence, since such numbers are void of context. Without a frame of reference, it is impossible to judge objectively (see the requirement that all evidence be accessible to all linguists) whether an integer is large or small. Also, the direct interpretation of proportions or percentages needs to be done with care. Proportions can be misleading since they can inflate small quantities unless accompanied by the actual number of observed instances. Furthermore, the proportion constitutes a point estimate, i.e. a single number
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Best practices and research infrastructure
that in reality comes with an error margin attached. When we perform a formalized null-hypothesis test and the test compares observed and expected frequencies, we account for such error margins. If we interpret proportion data directly, they should be accompanied by a confidence interval, as we exemplify in section 1.5.
. Best practices and research infrastructure In section 1.3.4 we highlighted some of the problems with common practices in the historical linguistics research process. In this section we will outline our proposed solutions, which are meant to accompany the principles outlined above and create the context for an infrastructure that facilitates and optimizes research achievements in quantitative historical linguistics. .. Divide and conquer: reproducible research As we will see more in detail in Chapter 5, documentation and sharing of processes and data are at the core of our framework. Transparency in the research process facilitates reproducibility of the research results, as well as their generalization to other data sets, thus advancing the field itself. Moreover, if the process is transparent, it is easier to credit all the people who participated in it, including those responsible for gathering and cleaning the data, and building language resources like corpora and lexicons, an aspect that is still undervalued in the historical linguistics community. Replicability is also aligned with principle 1 (section 2.2.11), which stresses the importance of consensus in quantitative historical linguistics. Transparency (and therefore replicability and reproducibility) is achieved by documenting the data and phases of the research process and by making them available. In addition to being transparent about the research methodology used, corpora, data sets, metadata, and computer code1 should be made publicly available whenever possible and appropriate. In the case of historical data, questions of privacy are rarely a problem, so compared to other fields of study historical linguistics is in a fortunate position in this respect. Once we have taken all steps to ensure transparent and reproducible results, and have made the data openly available, the research practice can move beyond the scope of an individual study to that of a larger, collaborative effort. Each study may still concentrate on just one aspect of the process (design of a resource or generalization of previous results, for example), while keeping a view to documenting and making the tools and data sets available to the community. Efforts in this direction have already had some success, for example in the case of the Perseus Digital Library2 and the 1 Generally speaking, using code/scriptable tools like Python and open formats like csv instead of tools with graphical user interfaces and proprietary formats like Excel is essential for reproducibility. 2 http://www.perseus.tufts.edu/hopper/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
Open Greek and Latin Project.3 The Open Science Framework (https://osf.io/) offers a platform for managing research projects in an open way, facilitating reproducibility and data sharing; see the page https://osf.io/ek6pt/ for one such project dealing with Latin usage in the Croatian author Paulus Ritter Vitezović. We believe that such an approach will allow the field of historical linguistics to move forward in a less fragmented way than it has so far. .. Language resource standards and collaboration Since the time of the comparative philologists, historical linguists have often resorted to gathering their own data. Although this is sometimes warranted (and even the only available option), historical linguistics as a scientific endeavour would benefit from a greater reliance on reuse of existing resources, and on the creation of publicly available standardized corpora and resources, whenever reuse is not an option. Electronic resources like lexical databases (WordNet, FrameNet, valency lexicons) provide valuable information complementary to corpora. Such resources are still not widely used in historical linguistics, partly for epistemological reasons and partly for technical reasons, as argued by Geeraerts (2006). Our framework provides historical linguists with the methodological scaffolding to incorporate computational resources into their research practice. As we will argue more extensively in Chapter 5, the design, creation, and maintenance of language resources should be a crucial component of the work of historical linguists, and in order to maximize their reuse and compatibility, language resources should be developed in the spirit of linked open data (Freitas and Curry, 2012), when possible. Reusing resources means that conclusions and results can more easily be replicated and tested by other researchers, which is a crucial point of our framework (see section 2.3.1). Moreover, if a study on a specific linguistic phenomenon is carried out on a resource built in an ad hoc fashion, there is always the lingering doubt that the results were influenced by the choice of data. Conversely, if the results are obtained from a pre-existing resource or corpus, they are less likely to have been influenced by factors directly related to the research in question. A greater reliance on reuse gives an impetus to creating corpora also for less-resourced languages (McGillivray, 2013). In spite of its strengths, gathering and annotating data is costly in terms of time and resources. The labour-intensive tasks involved in creating language resources often involve technical expertise which is not normally part of standard linguistics training; therefore, the development of language resources is an interdisciplinary team effort and is at the core of the collaborative approach to research that we propose. If a group creates a resource that is well documented, has a standard format, and is compatible
3
http://www.dh.uni-leipzig.de/wo/projects/open-greek-and-latin-project/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Best practices and research infrastructure
and interoperable with other resources (for example a historical lexicon of namedentities), this makes it possible for another group to build on this work and either enrich the resource itself, or use its data (alone or in combination with data from other sources) for further analyses. If such analyses are well documented, they will be more likely to be reproduced by others, who can check the validity of the results, generalize them further to other data sets, or add more components to the analysis. However, researchers do not currently have sufficient incentives to spend time on building a corpus or other language resources. We believe that the publication of such resources ought to carry substantial weight in terms of academic merit, as much as the publication of studies carried out on them. .. Reproducibility in historical linguistics research In sections 1.3.1 and 1.3.3 we considered the major weaknesses concerning certain research practices in historical linguistics. In this section we will broaden the perspective to cover the issues of transparency, replicability, and reproducibility and their impact on the field of historical linguistics in general. Section 1.3.1 dealt with the negative effects of the lack of transparency in the evidence sources employed in historical linguistics research. As a matter of fact, the issue of transparency concerns all phases of the research process, from data collection to annotation and analysis. Making all phases of the research process more transparent has a number of benefits. First, it makes it possible to replicate the research results obtained by a study in the context of other studies dealing with the same data, method, and research questions. This increases the chances of detecting omissions and correcting errors. Second, transparency forms the basis for generalizing the research results, thus advancing the field itself: this generalization can involve applying the same method to a different data set or extending the approach. For example, a researcher can test alternative approaches based on the data from a reproducible piece of research. Third, transparency ensures that the work involved in building a data set (for example a historical language resource) is visible, and therefore acknowledged and credited appropriately. Considered the emphasis on publishing research articles that report on analyses of particular phenomena or formulation of theories, this level of transparency on the data behind the analysis would encourage more researchers to dedicate their time to building language resources, which play an essential role in advancing the field. The issue of lack of transparency is, of course, not unique to (historical) linguistics, and has very negative consequences that in some fields like medicine span well beyond the academic community to impact directly people’s lives.4 Although it does not affect 4 For an example of how current this issue is in medicine and psychology, see https://www.nature.com/ news/first-results-from-psychology-s-largest-reproducibility-test-., respectively.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
human lives, this issue is current in linguistics research as well, as demonstrated, for example, by the recent special issue of the journal Language Resources and Evaluation dedicated to the replicability and reproducibility of language technology experiments. Transparency (and therefore replicability and reproducibility) is achieved in two main steps: by describing the data and phases of the research process, and by making such data and processes available, which we will discuss in more detail in the next sections. Documentation As we saw in section 1.3.1, research papers in historical linguistics dedicate a lot of space and attention to the theoretical framework(s) and the final results of the research, as well as to linguistic examples, either as illustration of the phenomenon studied or as the evidence base of the analysis. However, little attention is usually dedicated to the following aspects, in spite of the crucial role they play in the research process: how the data were collected, how the hypotheses were formulated and tested, which variables were measured (if any), how the analysis was carried out. For example, Bentein (2012) presents the details of his data collection criteria in the footnotes, and describes the corpus used in four lines (Bentein, 2012, 175). As there are no agreed standards on how to build and annotate a corpus, how to carry out the analysis, and how to report the results, we argue that the following guidelines would significantly increase the level of transparency in historical linguistics research. • Include references to the resources (including corpora) used, with exact locations and URL links. • Specify the size of the corpus or linguistic sample(s) used. • Describe how the corpus/sample was collected by detailing the inclusion/exclusion criteria. • Detail the annotation schema used, even when the researcher performed the annotation as a by-product of the subsequent analysis. • Add information about the analysis methods employed and their motivation, as well as the statistical techniques, programming languages and software used (with version number). • Give details of the different analyses performed (including the ones that did not lead to the desired results), to eliminate the risk of ‘cherry-picking’ results that conform to the researcher’s expectations. • Add all relevant information to allow the reader to interpret and reproduce the data visualizations. Sharing and publishing research objects Being transparent about the research methodology used is very important, but may not ensure full replicability of the results when the work is complex. Therefore, it is important that the corpora, data
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Best practices and research infrastructure
sets, metadata, and computer code on which the research was based are made publicly available whenever this is possible and appropriate from an ethical point of view. Evidence from a study on 516 biology articles published between 1991 and 2011 reported on in Vines et al. (2014) has shown that informally stored data associated with published works disappear at a rate of around 17 per cent per year. Even though we do not have evidence of this kind for (historical) linguistics research, it would not be surprising if a similar pattern were found. Access to the data and the process behind a research work is essential, and should be ensured in a systematic way and through platforms that researchers can use. There are a number of repositories for language resources, including corpora, lexica, terminologies, and multimedia resources. One (non-free) catalogue of such resources is available through the European Language Resources Association (ELRA).5 Another example is CLARIN (Common Language Resources and Technology Infrastructure),6 a large repository of language resources. Examples of research data repositories which are not specific to linguistics but are widely used in the sciences are Figshare7 and Dryad.8 Figshare allows researchers to upload figures, data sets, media, papers, posters, presentations, and data deposited in Figshare receive a digital object identifier (DOI), which makes them citable. The most commonly used repository designed to track versions of computer code, attribute it to its authors, and share it is GitHub.9 Specific to humanities, Humanities Commons10 is a platform for sharing data and work in progress and constitutes a positive example of this sharing attitude. An interesting publishing model that is gaining popularity among the scientific community is concerned with so-called ‘data journals’. Such peer-reviewed publications collect descriptions of data sets rather than traditional article publications reporting on theoretical considerations or the results of particular studies. Such citable ‘data descriptors’ or ‘data papers’ receive persistent identifiers and give publication credits for the authors. The methodological importance of such data publications consists in allowing other researchers to use the data described and benefit from them, and ensuring that scientists who collect and share data in a reusable fashion receive credit for that. Examples of open access data journals in the scientific domain are Scientific Data11 and Gigascience.12 One notable example in the humanities is the Research Data Journal for the Humanities and Social Sciences13 published by Brill in collaboration with Data Archiving and Networked Services.
5 8 11 13
6 http://clarin.eu/. 7 http://figshare.com/. http://catalog.elra.info/. 9 https://github.com/. 10 https://hcommons.org/. http://datadryad.org/. 12 http://www.gigasciencejournal.com/. http://www.nature.com/sdata/. http://dansdatajournal.nl/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
.. Historical linguistics and other disciplines In spite of their clear connections, historical linguistics and historical disciplines like history and archaeology have largely followed separate paths (Faudree and Hansen, 2014). To take a concrete example, with the exception of corpora created for historical sociolinguists, early historical corpora contained very limited metadata about the texts themselves, and focused primarily on annotating linguistic features. In section 5.2 we argue for a stronger interaction between historical linguistics and history, and make a case for a stronger connection between historical language resources and other resources (like collections of information on people or places). This strengthened link has the potential to enrich historical linguistics research by accounting for the sociohistorical context of the language data in a direct way. Linked data provides a valid solution to this question because it allows to connect linguistic corpora with general resources on various aspects of the historical context of the texts. This enables a more historically accurate investigation of the language and facilitates interdisciplinary efforts, which would benefit historical linguistics research. In Chapter 4, we also make a case for cooperation between historical linguistics and digital humanities. In particular, the Text Encoding Initiative has established standards for annotating a range of information on texts and their contexts. This type of annotation would definitely make the traditional corpus annotation more exhaustive and therefore allow corpus analyses to consider a wider range of properties of texts and their context; this, in turn, would make the linguistic results more comprehensive.
. Data-driven historical linguistics In section 1.3 we have stressed the importance of using corpora as evidence basis for research in historical linguistics. We dedicate this section to defining ‘corpus-driven’ and ‘data-driven’ in the context of the methodological framework we propose, and to explaining how this approach interacts with linguistic theory. .. Corpus-based, corpus-driven, and data-driven approaches Once we have established the necessity of using corpus data as evidence sources for the historical linguistics investigation, we need to clarify how this evidence (which by definition relates to individual instances of language use, or parole in Saussurian terms) relates to more general statements about language as a system (or langue, to follow Saussurian terminology). Are corpus data going to support general claims about language, or will they determine them? Will the investigation start from corpus data, or from theoretical statements, or a combination of the two? According to a terminology that is well established in corpus linguistics (TogniniBonelli, 2001, 65), corpus-based approaches involve starting from a research question
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Data-driven historical linguistics
and testing it against (annotated) corpus data, often by analysing selected examples. Theoretical hypotheses play a prominent role in this approach, and corpus data are used to support (or more rarely refute) them; therefore, we could categorize such approaches as ‘confirmatory’. On the other hand, ‘corpus-driven’ approaches (Tognini-Bonelli, 2001, 85) rely on unannotated corpus data with the aim of revising existing assumptions about language that pre-dated electronic corpora; in fact, annotated corpora are seen as sources of skewed data because they reflect such pre-existing assumptions. In corpus-driven approaches the corpus evidence is the starting point of all analyses and needs to be reflected in the theoretical statements, which makes the primary focus of such approaches exploratory. The researcher draws generalizations from the observation of individual instances of unannotated corpus data to theoretical statements about the language system. In other words, corpus-driven approaches aim ‘to derive linguistic categories systematically from the recurrent patterns and the frequency distributions that emerge from language in context’ (Tognini-Bonelli, 2001, 87). Rayson (2008) proposes the use of the term ‘data-driven’ as a compromise between the ‘corpus-based’ and the ‘corpus-driven’ approaches contrasted above. His starting point is the automatic annotation of two corpora by part-of-speech and semantic fields; then, he conducts a quantitative analysis of the keywords extracted from the two corpora. At this point, in his model the researcher’s contribution consists in examining qualitatively ‘concordance examples of the significant words, POS and semantic domains’ (Rayson, 2008, 528) to formulate research questions. This way, the research questions arise from the qualitative analysis of quantitatively processed data from automatically annotated corpora, rathen than from theoretical hypotheses, as in corpus-based approaches. In this book we will employ the term ‘corpus-driven’ in a sense that is different from the ones outlined above. We will accept the confirmatory view according to which corpus analyses can test hypotheses from previous theories (and we discuss the term ‘theory’ in section 2.4.3), but we also allow for exploratory views in which such hypotheses emerge directly from corpus data. Moreover, unlike in the traditional definition of ‘corpus-driven’, we consider annotated corpora as legitimate sources of evidence. Finally, we do not consider it acceptable to analyse selected examples from corpora in order to test theoretical statements, as done in large part in corpus-based research. In our definition, ‘corpus-driven’ will refer to those approaches whereby evidence from (annotated) corpus data is collected systematically, usually with automatic means. This evidence (whose size is typically relatively large) undergoes a systematic and exhaustive quantitative analysis. Such analysis aims at testing theoretical hypotheses (in confirmatory studies) or formulate new ones (in exploratory studies). ‘Datadriven’ refers to the same procedure as ‘corpus-driven’ defined above, but specifically affects other types of data in addition to corpus data, for example metadata on authors,
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
genres, geography, or data from language resources and other resources. Because historical corpora necessarily contain some form of metadata, any ‘corpus-driven’ methodology is also ‘data-driven’. Doing linguistics research in this data-driven quantitative way accounts for the variability in language use and lends itself to a usage-based and probabilistic view of language, whereby frequency distributions are built into the language models (Penke and Rosenbach, 2007b, 20). However, as explained by de Marneffe and Potts (2014, 8), as we discussed in section 1.2.1, and as we will see in Chapter 6, corpus research is compatible with non-probabilistic approaches as well, because the statistical evidence collected from corpora may reflect the interaction of various discrete phenomena. .. Data-driven approaches outside linguistics The emphasis on data-driven approaches is common to a general movement affecting a range of disciplines, particularly in the sciences and in business. The terms ‘data-intensive’ (Hey et al., 2009) and ‘data-centric’ science (Leonelli, 2016) have acquired specific senses in today’s scientific context. They refer to approaches characterized by large-scale networks of scientists, a focus on open data sharing and data publishing, and a drive towards large collections of data employed as evidence sources in research. Following a similar trend, the business world has witnessed an exponential growth in the demand for data scientists and a general shift towards data-centred attitudes in organizations in recent years. Data-driven approaches are increasingly employed in designing business strategy by relying on large-scale analyses of data on users’ behaviour and preferences, as well as data from internal systems, including workflow and sales databases (Redman, 2008). Mason and Patil (2015, 10) define the ‘data scientific method’ for organizations as follows: 1. 2. 3. 4.
Start with data. Develop intuitions about the data and the questions it can answer. Formulate your question. Leverage your current data to better understand if it is the right question to ask. If not, iterate until you have a testable hypothesis. 5. Create a framework where you can run tests/experiments. 6. Analyse the results to draw insights about the question.
This list of steps highlights the importance of data exploration in the initial phases (steps 1 and 2) and largely overlaps with the exploratory approaches we referred to in section 2.4.1. Examples from the scientific world and the business world highlight a general trend in society, which may be explained by the cultural innovations driven by new technologies, as we discussed in section 1.3.4. Since linguistics research does not
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Data-driven historical linguistics
happen in isolation, but interacts with this changing external context, we believe that keeping in mind this broader perspective can help us to understand and frame datadriven approaches in historical linguistics as well. However, one important difference between business applications and research in historical linguistics is the ultimate aim of the investigation. Historical linguistics intended as an empirical enterprise aims to model and ultimately explain language phenomena in the past. This theoretical aim has implications on the data-driven research process we propose, as we see in the next section. .. Data and theory Given the explanatory aim of historical linguistics, the corpus-driven framework we propose must be compatible with the creation of a historical linguistics theory, intended as a system of laws for historical languages and language change which allows us to explain and predict phenomena affecting these languages. As a matter of fact, the term ‘theory’ is used quite generously in linguistics to refer to annotation schemes like HPSG, X-BAR, LFG, dependency grammar, construction grammar, approaches like distributional semantics, or other formalisms; we agree with Köhler (2012, 21) when he states: there is no linguistic theory as of yet. The philosophy of science defines the term ‘theory’ as a system of interrelated, universally valid laws and hypotheses [. . . ] which enables to derive explanations of phenomena within a given scientific field.
Data and theory interact in complex ways in corpus-driven historical linguistics. In this section we will examine individually what we consider the main aspects of this interaction in the context of the data-driven approach to historical linguistics research we propose. We will present these different aspects one by one for reasons of clarity, although we recognize that they often occur together and interact. Theory in data representation In spite of their intended objectivity, whenever some data are collected as part of a research study, by necessity they reflect a specific way of understanding, representing, and encoding the recorded entities or events. Therefore, they are tied to a particular historical moment and theoretical views. As TogniniBonelli (2001, 85) summarizes very effectively, ‘[t]here is no such thing as pure induction’ and even corpus-driven approaches (in the sense she defines) acknowledge this. Let us imagine that we have taken records of daily measurements of the temperature. In order to make sense of the pairs of numbers and characters collected, we would need to read them as temperatures (e.g. in centigrade degrees) and day–month–year triples, if that is the way we decided to represent dates. When it comes to linguistics research, the notations chosen for representing and collecting the corpus data play an important role in any subsequent analysis. In the case of annotated corpus data, the annotation is always performed with reference
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
to a specific notational framework and, therefore, annotated data will reflect that viewpoint, which we may call ‘theoretical’ (de Marneffe and Potts, 2014, 17). If the annotation includes part of speech, for instance, it has to rely on definitions for the different part of speech categories and how these labels apply to the language or variety in question. If the corpus is a treebank, we may want to choose a phrase-structure model or a dependency-based model for representing syntactic structures, and this choice will depend on our preferences, the features of the language annotated, as well as other considerations. In a similar vein, a corpus that has not been annotated will still need to be interpreted according to a specific theoretical perspective in order to form the basis for any subsequent linguistic analysis. Theoretical assumptions In addition to the way we represent the entities that we want to analyse and their context, whenever we carry out a data analysis we rely on a set of assumptions, which we may call ‘theoretical’, too. Let us go back to the example of daily temperatures. When we collect and then interpret the data, we need to keep in mind that they are limited to a specific range, so that if we spot a measurement of −400, for instance, we can quickly identify it as an error. In this case, any data-driven analysis would only make sense if we have access to the domain knowledge concerning temperatures on the earth. When we annotate and then analyse a corpus, in addition to the notational framework chosen, we rely on a set of assumptions on which there is general consensus among linguists: for example that nouns in French are inflected by number and gender, that verbs in Latin can display different endings depending on their person, number, tense, voice, etc. When we analyse verb data in a treebank, for instance, we assume that verbs do not occur with their arguments in a random way, but that they display specific syntactic and lexical–semantic preferences according to their argument structure. This kind of domain knowledge also supports the design and interpretation of exploratory analyses. The choice of which variables we decide to study will need to make sense according to this domain knowledge or in the context of specific hypotheses we want to test. To take a slightly absurd example, we may collect data relative to a number of events happening by a beach and we may find a strong correlation between the number of shark attacks in a day and the amount of ice cream sold in the same day. However, our domain knowledge tells us that, rather than concluding that buying more ice cream increases the chances of being attacked by a shark, we could hypothesize that both variables are correlated with (and possibly caused by) the number of visitors to the beach in that day. As Köhler (2012, 15) states, even exploratory approaches need some theoretical grounding: It is impossible to ‘find’ units, categories, relations, or even explanations by data inspection— statistical or not. Even if there are only a few variables, there are principally infinitely many formulae, categories, or other models which would fit in with the observed data.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Data-driven historical linguistics
For instance, in McGillivray (2013, 127–78), the author studied the change of the argument structure of Latin verbs prefixed with spatial preverbs, and particularly the opposition between prepositional constructions and bare-case constructions. This phenomenon involves the interplay of a range of variables, including morphological, lexical, and semantic features such as the type of verbal prefix, the particular spatial relation expressed in conjunction with the verb, the semantics of the verb, and the case of the verbal argument. McGillivray (2013, 169–72) employed an exploratory approach to deal with the complexity of the phenomenon and to measure the contribution of various variables to it. She resorted to exploratory data analysis (Tukey, 1977) (specifically CA, see section 6.2), which aims at letting the model ‘emerge’ from the data. However, the author chose the set of variables based on a combination of findings from previous research and linguistic domain knowledge. Data and theoretical hypotheses We have seen that exploratory approaches to historical linguistics analysis need access to domain knowledge and need to be theoretically grounded. But, of course, theory plays a crucial role in confirmatory approaches as well, which are essential to the progress of any empirical research. When approaching corpus data with a theoretical hypothesis, it is important to avoid the risk of confirmation bias, which would lead us to only find positive evidence of the claims we intend to make. To address this issue, McEnery and Hardie (2012, 15) define the principle of total accountability according to which we should always aim at using the entire corpus, or at least random samples when the corpus is too large. This way we can satisfy the criterion of falsifiability, identified by Popper (1959) as the defining feature of the scientific method. If we follow the principle of total accountability, we are very likely to employ quantitative analysis techniques, as manual analysis is often inadequate to deal with the size and complexity of the data. By relying on the systematic evidence from a corpus, corpus-driven approaches can address the question of whether a phenomenon is attested by finding occurrences of certain patterns or constructions. However, finding few examples of such patterns in a corpus in itself does not guarantee that these are not annotation errors, typos, or other anomalies; this is also true in the case of historical texts, for which spelling and other elements often do not follow standards and cannot be easily categorized because they are captured at the moment in which they undergo diachronic change. A systematic quantitative account of corpus evidence based on all available data, together with a theoretical model and the effort to make our results consistently replicable (Doyle, 2005), can help to avoid spurious conclusions in these situations and increase their validity in an empirical context (McEnery and Hardie, 2012, 16). Corpus data may also support statements about possible but unseen phenomena, by relying on seen events and statistical estimation, coupled with domain knowledge (Stefanowitsch, 2005; de Marneffe and Potts, 2014, 11). Corpora, paired with a theoretical model, can also predict that a phenomenon is impossible (or has a negligible
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Foundations of the framework
probability to occur). For example, Pereira (2000) used a statistical model trained on a newspaper corpus to predict that colourless green ideas sleep furiously is about 200,000 times more probable than furiously sleep ideas green color less, thus addressing Chomsky (1957)’s challenge. As Pereira’s study illustrates, corpora can also address probabilistic hypotheses about language, as well as binary ones. This is explored in more depth in Chapter 6. As an example of such a probabilistic hypothesis from historical linguistics, in the context of the diachronic change in the argument structure of Latin prefixed verbs, McGillivray (2013, 127–78) formulated various hypotheses including the following, concerning one of the constructions that pertain to these verbs, specifically the barecase construction: Construction 1 [ . . . ] is significantly more frequent in the archaic age and in works by poets than in the later ages and in prose writers.
This hypothesis operationalizes a generalizing statement in terms that can be addressed by corpus data. McGillivray (2013) tested the hypothesis above with a statistical significance test (chi-square test; see section 6.3.3 for details on this test), and obtained a confirmation of the hypothesis, together with a measure of the size of the detected effect. The process involved all available corpus data and therefore fulfilled the principle of total accountability, which is fundamental to corpus-driven approaches. This way, the results contributed new quantitative evidence which can support more general theoretical models. .. Combining data and linguistic approaches Following Köhler (2012, 2), we define linguistic theory as a series of connected claims from which predictions about historical languages can be made. As we have seen, our framework includes theoretical hypotheses, properly tested against corpus data. From a series of such contingent statements corresponding to tested theoretical hypotheses, we can proceed towards formulating theoretical models of the historical linguistics phenomenon at hand. By this term we mean those generalized explanations of observed phenomena that some linguists call ‘theories’. Our framework does not impose restrictions on which particular models can be derived from this process, nor on the ontological setup that allows this generalization step from contingent claims to theoretical models. Our main concern is on the way such process is performed. In the rest of this book we will provide more details of this process; particularly, Chapter 6 will give some concrete examples of how this can be realized in practice. Thus, our framework is not meant to replace other approaches to historical linguistics rooted in e.g. generative theory or traditional comparative linguistics. We consider work on X-bar theory, grammaticalization, and language history equally compatible with our framework. A linguistic description can be characterized as a hypothesis (Carnie, 2012, §3.1). As mentioned above, the key characteristic of a good hypothesis
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Data-driven historical linguistics
is that it is falsifiable, and that it has predictive power (Beavers and Sells, 2014), and it needs to be tested against alternative hypotheses, as stressed by Beavers and Sells (2014). The reason for this is captured in our principle number 3 (section 2.2.3): since almost any claim is possible, merely fitting a hypothesis to the data is insufficient. Instead, the hypotheses must be compared and tested against data. This ought to be uncontroversial, and both Carnie (2012) and Beavers and Sells (2014) are rooted in generative theory, thus demonstrating that such hypothesis testing is not restricted to probabilistic approaches to linguistics. We go beyond Carnie (2012) and Beavers and Sells (2014) in insisting that such hypothesis testing and comparison in historical linguistics ought to be done quantitatively using corpus data, whenever possible. Furthermore, we argue that multivariate techniques for quantitative modelling are superior to others, due to the complex nature of language. We also see this focus on quantitative techniques and corpus data as a means to compare results across linguistic frameworks, hence our emphasis on an empirically based consensus (see section 2.2.1 informed by appropriate statistical techniques; see section 2.2.12). In short, the present framework extends commonly accepted guidelines for constructing linguistic arguments. The framework takes explicit issue with hypothesis testing in historical linguistics by means of intuitions and qualitative judgements about frequencies (as well as quantitative arguments that do not follow state-of-the-art standards). It is precisely this methodological focus that makes our framework compatible with different paradigms in historical linguistics.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Corpora and quantitative methods in historical linguistics . Introduction Historical linguistics and quantitative research have enjoyed a long and tangled coexistence over the years. It must be stressed that any attempt to paint a picture of a gradual, one-directional diachronic shift from qualitative to quantitative methods in historical linguistics is an oversimplification; not even a particularly useful one. Instead, we would like to repeat the image of the chasm separating the early innovators and visionaries from the majority or mainstream, discussed in Chapter 1. Looking back at the history of quantitative and corpus methods in historical linguistics through the lens of the chasm model, we can compare the degree to which quantitative corpus methods are used within the groups defined in the chasm model. For instance, the early adopters would correspond to roughly 16 per cent of the potential users. A technology adopted by the early majority would bring the total up to about 50 per cent whereas including the late majority too would mean that the technology has reached more than 80 per cent of potential users. This is essentially an empirical question (contingent on the validity of the chasm model). As this chapter will show, in the case of historical linguistics, quantitative corpus technologies have not transitioned much beyond the early stages of the adoption curve. However, we also want to better understand why these methods have failed to transition from the ranks of early innovators to the majority of linguists practising historical linguistics. It is indisputable that the early models of linguistic change associated with the development of the comparative method, such as the family-tree model and wave theory, largely fall under the rubric of qualitative methodology (Campbell, 2013, 187–90). The comparative method remains a vital approach to historical linguistics, and Campbell (2013, 471) argues that what he calls ‘mathematical solutions’ to historical linguistic problems are neither necessary nor sufficient, implying that historical linguistics can do without quantitative methods. The chapter on quantitative methods only appeared as a full chapter in the third edition of Campbell’s book, suggesting perhaps a growing need for addressing these methods in historical linguistics, albeit largely to refute
Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Early experiments
them. Yet the casualness of the refutation also points to the status of the qualitative approaches, and particularly the comparative method, as the hegemon of historical linguistics. That state of affairs undoubtedly stems at least partially from both the success and age of the comparative method (McMahon and McMahon, 2005, 5–14). However, as Campbell’s treatment of quantitative approaches to historical linguistics illustrates, a certain antagonism can also be traced back to early attempts at statistical approaches to historical linguistics, leading to the perhaps surprising conclusion that the full acceptance of quantitative methods in historical linguistics is not only hampered by the novelty of the methods, but also by a somewhat painful previous exposure.
. Early experiments The success of the comparative method notwithstanding, it is possible to find examples of researchers proposing ‘mathematical solutions’ to problems in historical linguistics at least as far back as the nineteenth century. Köhler (2012, 12) claims that ‘in linguistics, the history of quantitative research is only 60 years old’, a claim that appears to be founded on his view that quantitative linguistics in the modern sense began with the work of George K. Zipf in the late 1940s, although Köhler does point out that studies based on ‘statistical counting’ can be found as far back as the nineteenth century (Köhler, 2012, 13). Gorrell (1895), to take but one example, certainly took a quantitative approach in his study of indirect discourse in Old English, with tables displaying counts of constructions appearing every few pages. However, it is paramount to avoid simplistic generalizations, and the overall picture of historical linguistics a century or so ago is above all one of variation. McGillivray (2013, 144–7) discusses a study of Latin preverbs by Bennett, in a study dating back to 1914, where the author made the choice of classifying occurrences above ten as ‘frequent’, but without providing the reader with access to the actual numbers of occurrences, leaving the reader to guess exactly what evidence underpins such unquantified yet implicitly numerical distinctions as ‘many’ vs ‘most’. Statistics beyond word counts also enjoys some seniority within historical linguistics. Kroeber and Chrétien (1937) calculated correlation coefficients for linguistic features in order to arrive at a statistically based classification of Indo-European languages. Their work was criticized by Ross (1950), who (although generally sympathetic to the approach) took issue with their calculations, favouring instead using the chi-square statistic, a test with its own inherent problems when employed in linguistics. However, both Kroeber and Chrétien and Ross were somewhat pessimistic in their conclusions, prompting a rebuttal from Ellegård (1959). While taking care to emphasize that statistical measures of linguistic similarity refer to similarity with respect to some specifically chosen traits or features (rather than a global, a-theoretical similarity), Ellegård proposed an alternative approach to interlanguage correlations. Ellegård’s conclusion is mainly methodological, but rounds off with the insight that
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Corpora and quantitative methods
the application of statistical methods in historical linguistics is not merely a methodological choice. Instead, he outlines a dynamic relationship where quantitative methods spur on theoretical developments, and where statistical methods ‘will require a linguistic taxonomy, will help to establish it, and can be used for bringing taxonomic and developmental studies into fruitful contact’ (Ellegård, 1959, 156). In hindsight it is clear, however, that the impact of statistical methods on historical linguistic theorizing remained limited. The limited impact of statistical methods is perhaps best judged by the insistent tone in the publications advocating their use, which (following the logic from historical studies that what is frequently prescribed by law generally reflects common actual behaviour) quickly raises the image of a besieged minority. At the same time, the arguments are often pithy and convey a message which in many cases remains relevant today. Take the point made by Kroeber and Chrétien (1937, 97), who suggested that the linguist working only with intuition easily becomes biased when the linguist observes a certain affiliation which is real enough, but perhaps secondary; thereafter he notes mentally every corroborative item, but unconsciously overlooks or weighs more lightly items which point in other directions.
The quote, which is a polite way of saying that non-quantitative studies are prone to bias by over-emphasizing rare or unexpected phenomena, has held up well and is in tune with more recent critiques of qualitative methods, such as that raised by Sandra and Rice (1995). The simple psychological fact that the human mind is not well equipped at dealing objectively with relative frequencies in an intuitive way remains a key objection to non-quantitative work in historical linguistics. However, the fact that similar critiques are being made decades after Kroeber and Chrétien says something of their impact, or more precisely lack of such. Occasional arguments for the virtues of a fully quantitative linguistics can be found around the middle of the twentieth century, but their relative rareness as well as their timbre are testament to a lack of impact. Consider the acerbic yet slightly despondent tone in the following observation by Guiraud (1959, 15), published over twenty years after Kroeber and Chrétien: La linguistique est la science statistique type; les statisticiens le savent bien; la plupart des linguistes l’ignorent encore. (‘Linguistics is the typical statistical science; the statisticians know this well; most linguists are still ignorant of it.’)
If the quote from Guiraud suggests ignorance of (and hence lack of involvement in) quantitative research on the part of the linguists, the following passage from Ellegård (1959, 151–2) has the air of a well-rehearsed response to familiar criticism: ‘Even intuitive judgments must be based on evidence. Now if that evidence turns out to be insufficient statistically, it will be insufficient also for an intuitive judgment.’ The comment is poignant, the tone is one of calm reason; however, the implications went
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
A bad case of glottochronology
largely unheeded by historical linguistics as a discipline, suggesting that Guiraud and Ellegård were early adopters of a technology that did not quite catch on. There are clearly several reasons for this: first and foremost is probably the undoubted success of the comparative method mentioned earlier, which (following the old adage that if it ain’t broke don’t fix it), must have made the rather tedious mathematical calculations seem subject to diminishing returns. Second, the lack of electronic corpora, desktop computers, and statistical software meant that quantitative work was slow and almost impossible to perform at the large scale where it really comes into its own. Third, the advent of generative linguistics, vividly chronicled in Harris (1993), heralded a period where numerical approaches to linguistics generally were no longer in vogue, or were even regarded with some hostility (Pullum, 2009). And finally, there were the stains left behind by a specific, much revered (and later much reviled) method: glottochronology.
. A bad case of glottochronology In the 1950s linguistics was both changing and expanding with a mature and optimistic sense of security, enjoying ‘measured dissent, pluralism, and exploration’ (Harris, 1993, 37). Such exploration was also taking place in historical linguistics where Morris Swadesh launched the term glottochronology in the early 1950s, see e.g. Swadesh (1952) and Swadesh (1953). Glottochronology was proposed by Swadesh as an approach to lexicostatistics more generally. The distinction is worth making since lexicostatistics is generally taken to mean statistical treatments of lexical material for the purposes of studying historical linguistics. McMahon and McMahon (2005, 33) offer the following definition of lexicostatistics: ‘the use of standard meaning lists to assess degrees of relatedness among languages’. Campbell (2013, 448), like McMahon and McMahon (2005, 33–4), notes that ‘glottochronology’ and ‘lexicostatistics’ are frequently used interchangeably, but Campbell goes on to claim that ‘in more recent times scholars have called for the two to be distinguished’. However, the attempts at making the distinction are as old as the confusion itself. To Hockett (1958, 529) the two terms appear to have been synonyms, whereas Hymes (1960, 4) argues for a distinction: Glottochronology is the study of rate of change in language, and the use of the rate for historical inference, especially for the estimation of time depths, and the use of such time depths to provide a pattern of internal relationships within a language family. Lexicostatistics is the study of vocabulary statistically for historical inference. . . . Lexicostatistics and glottochronology are thus best conceived as intersecting fields.
Hymes goes on to point out that lexicostatistics could in fact refer to any numerical study of lexical material, synchronic or diachronic, but that the term has received a ‘specialized association with historical studies’.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Corpora and quantitative methods
At its core, glottochronology operates with three basic elements: word lists, or strictly speaking lists of sememes or ‘meaning lists’ (McMahon and McMahon, 2005, 34), of ‘basic’ vocabulary for the languages to be compared, the number of cognate items within the list, and the retention rate over time (Hymes, 1960, 3). Two lists predominate the literature, one containing 100 items, the other 200 (Campbell, 2013, 448–51). As subsequent criticism would show, all three variables turned out to have their particular trapdoors, including the problem of defining culturally neutral and replicable versions of the lists themselves, what to count as ‘basic’ (which again had an impact on the cognates), but also the assumption of a constant rate of change. The constant retention rate over a 1,000 years, argued to be 86 per cent for the 100-word list and 81 per cent for the 200-word list, was boldly presented as a real, mathematical fact (a physical probability, see section 2.1.3) with evidence ‘sufficient to eliminate the possibility of chance’ (Swadesh, 1952, 455). Glottochronology was met with an initial rush of enthusiasm (Hymes, 1960, 32), and made it into the introductory-course university curriculum in linguistics (Hockett, 1958, 526–35). However, some methodological problems were pointed out both by Swadesh himself and by others. Ellegård (1959, 155) criticized the lexico-statistical method used by Swadesh (1953), commenting that the latter seemed ‘somewhat rash in assuming a uniform rate of development’. Hockett questioned the assumption of a ‘basic vocabulary’, but nevertheless rounded off his introduction of the approach to undergraduate students rather optimistically by stating that ‘no development in historical linguistics in many decades has showed such great promise’ (1958, 534). Many, however, went further, possible out of confusion, enthusiasm, or both. Hymes (1960, 4) notes that some academics leaped from the method’s treatment of a narrowly circumscribed basic vocabulary, to endorsing it for tackling the problem of language change at large. In this we can recognize the pitfall pointed out by Moore (1991) regarding any new technology, namely the risk in overselling the ‘vision’ of the new technology before it is sufficiently mature to back up that vision with concrete results. The detailed contemporary critiques summarized in Hymes (1960) cover the nowfamiliar criticisms against glottochronology, namely problems with basic lists, problems with judging sameness, problems with cultural bias, problems with synonyms, problems with borrowings, problems with taboo words, the problematic assumption of a constant rate of change, as well as specific mathematical problems.1 Although critical, Hymes (1960, 15) nevertheless argued that more research should go into the method, and he approved of its continued application. However, problems were mounting. In addition to the problems listed above, different linguists were reporting different results for glottochronological studies of the same languages, as discussed in e.g. Bergsland and Vogt (1962), suggesting that 1 The core criticism against glottochronology is concisely presented in McMahon and McMahon (, –) and Campbell (, –).
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
The advent of electronic corpora
the method was introducing more vagueness rather than more objective replicability. The central tenet of glottochronology, a universal constant rate of change in basic vocabulary, did not hold empirical water, as shown by Fodor (1961) and Bergsland and Vogt (1962). In one of the studies conducted by Bergsland and Vogt (1962), the authors found that lexical replacement rates in the basic vocabulary list were either far higher (Icelandic) or far lower (Riksmål/Norwegian Bokmål) than those predicted by the model. Fodor (1961), on the other hand, found split dates for Slavic languages that were not only at odds with the comparative method, but also with well-attested historical facts. Add to this further criticism of the mathematics involved (Chrétien, 1962), and the result was a predictable and considerable dampening of the initial enthusiasm. In 1964 Lunt, in an editorial quoted in McMahon and McMahon (2005, 45), declared glottochronology an ‘idle delusion’ and bluntly denied the usefulness of continuing the project. As the 1960s and 1970s went on, glottochronology, and quantitative methods in linguistics more generally, largely fell out of favour, and well-known exceptions such as William Labov’s quantitative studies of sociolinguistic variation implied some opposition to the orthodoxy (Sampson, 2003; Lüdeling et al., 2011). Glottochronology did not reduce the interest for quantitative methods in historical linguistics on its own. As we have seen, structuralist approaches to linguistics were sceptical towards statistical evidence, and that scepticism was inherited and refined by mainstream transformational-generative grammar in subsequent decades (Sampson, 2003; Lüdeling et al., 2011; Gelderen, 2014). Cognitive–functional approaches (Sandra and Rice, 1995) also displayed a lack of attention to statistical methods (Deignan, 2005), which suggests a general tenor of linguistic research that went beyond historical linguistics. However, judging from the fact that glottochronology is still being discussed in the context of quantitative approaches to historical linguistics (McMahon and McMahon, 2005; Campbell, 2013; Pereltsvaig and Lewis, 2015), it seems clear that the negative perception of quantitative methods stemming from the failure of glottochronology has endured beyond the method itself. According to Moore (1991), such negative impressions could be a contributing factor when a technology fails to cross the chasm. As we pointed out in section 2.2.6, we cannot go logically from this possibility to concluding that this is in fact the case. We need to take a much richer context into account. In the next section, we turn from quantitative methods to the use of corpora.
. The advent of electronic corpora From a methodological point of view, it is interesting that some of the early publications referred to here, notably Kroeber and Chrétien (1937), Ross (1950), and Ellegård (1959), while predominantly concerned with statistical methods, kept returning to the question of data. Statistical methods in themselves will not yield answers without
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Corpora and quantitative methods
appropriate quantitative data, which means that methodological advances in one area are contingent on the other. Ross (1950) built upon the data from Kroeber and Chrétien (1937) and added more data, whereas Ellegård (1959) returned to the question of ‘relative frequencies’ from ‘random samples’ several times when discussing the methodological shortcomings of statistical methods that only deal with binary features. Without reading too much into this, it is clear that even if they do not phrase it in those terms, these authors were intimately aware of the problems caused by the lack of (and to some extent solved by the presence of) today’s electronic annotated corpora. That is not to say that text corpora were something new in the mid-twentieth century. Käding published his 11-million corpus concordance in 1897 (McEnery and Wilson, 2001, 12), and the first half of the twentieth century saw a string of studies relying on corpus linguistic methods, with the early 1950s witnessing both Firth’s work on collocations (Gries, 2006a, 3) as well as Fries’s corpus study of spoken American English (Gries, 2011, 81–2). The 1960s saw the introduction of the so-called first generation of machine-readable corpora whose characteristics today are the defining hallmarks of corpora: electronically stored, searchable, possibly annotated, and with an aim at representativeness. In the field of historical language studies, the Index Thomisticus corpus coevolved alongside the technological development from punched cards to magnetic storage, and finally online publication over its thirty-year construction phase (Busa, 1980). Pioneering work on corpus linguistics continued from the 1950s to the 1980s (McEnery and Wilson, 2001, 20). However, with notable exceptions such as the Index Thomisticus corpus and the Helsinki Corpus of English Texts, these efforts were mainly directed at contemporary languages. Today it is perhaps easy to underestimate the financial and technical difficulties facing early corpus builders. As Baayen (2003, 229–30) points out, early computers were few and expensive, which provided both a positive incentive for a formal approach to language, as well as a negative incentive against statistical investigation of large corpora. The case of the Index Thomisticus corpus proves an interesting illustration of the difficulties: it took some thirty years to complete (including adaptation to changing technologies along the way), and it was reliant on large-scale funding from the IBM corporation (Busa, 1980). In the face of such financial and technical obstacles, perhaps it is not surprising that historical linguistics (with its data being considerably less interesting from a commercial point of view) lagged behind in corpus creation, or at least in the modern sense of general, representative, machine-readable corpora. We have already mentioned Käding’s large late nineteenth-century corpus; however, the usefulness of such a corpus would be severely limited by the available (manual) search technology. Thus, the pragmatic pressures imposed by technology for creating and searching relatively large corpora, alongside the financial costs, would naturally favour smaller collections of purpose-built corpora that could be collected and searched manually. For all their merits, such corpora are nevertheless limited in
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Return of the numbers
their usefulness. Organized as lists of sentences, they are difficult to search, except by manually reading each sentence. Organized as collections of index cards, they can take a lot of space and are not easily distributed. The size limitation imposed by storage and searching points naturally in the direction of relatively small, purpose-built specialized corpora, rather than large general ones. Although such specialization can be valuable, it might also limit the potential for reuse. The lack of shared, reusable resources would also mean that each corpus in the worst-case scenario would have to be created afresh for each new project. This view of the situation is perhaps slightly too sombre. As the studies in Kroeber and Chrétien (1937), Ross (1950), and Ellegård (1959) attest to, collections of data could be shared and expanded gradually. Some specialized early corpora have enjoyed much longevity, perhaps most notably the data on the history of the English periphrastic do from Ellegård (1953), which have been reanalysed by Kroch (1989) and Vulanović and Baayen (2007), among others. Nevertheless, the central critiques of such early corpora remain: their specialized nature leads to a proliferation of isolated resources, rather than general ones that are suited for at least a majority of research questions. Furthermore, idiosyncrasies in sampling and annotation might make comparing or merging data sets difficult, a difficulty which would be compounded by a lack of standardized annotation. Although quantitative work based on a corpus methodology was being carried out in historical linguistics prior to the emergence of electronic historical corpora, reduced costs and improved computing power (together with the availability of lessons learned from the efforts to build corpora of contemporary language) meant that by the 1990s the scene was set for mainstream electronic historical corpora.
. Return of the numbers By the end of the 1980s, the stage was set for a growing interest in corpora. The two decades that had passed since the release of the Brown corpus in 1967 had seen a gradual growth in corpus size, as well as a growth in corpus use, including in commercial projects like the Cobuild Dictionary. The evolution of a scientific community which refined and promoted the building and use of corpora was undoubtedly vital. So was another development taking place: the growth of computing power. In computer science, the power of computing hardware (measured by the number of transistors that could fit into an integrated circuit) has been argued to follow what is commonly known as Moore’s law, a prediction made in the late 1960s that computing power would double every two years, that is, grow at an exponential rate. Especially since the computing industry to some extent have calibrated their development efforts to match the law, the law itself is perhaps less interesting than the result, namely a massive growth in computing power at a greatly reduced cost. Figure 3.1 illustrates Moore’s law as a regression line showing the growth in computing power over time on a logarithmic
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Corpora and quantitative methods Computing power and corpora PPCMBE* COHA* PPCEME* OEC Corpus del Español* YCOE* COCA PPCME2* Google N−grams
Computing power
10bn
100m
BNC
1m
Index Thomisticus*
10K LOB Brown
1960
1970
Helsinki*
1980
1990 Year
2000
2010
Figure . Illustration of Moore’s law with selected corpora plotted on a base 10 logarithmic scale. Corpora marked with an asterisk (∗ ) are historical.
scale, with some corpora added to the plot according to the year of their release. The data and the code for the figures in this chapter are available on the GitHub repository https://github.com/gjenset. Unsurprisingly, we see a cluster of corpora from the year 2000 onwards. It would be grossly simplistic to claim that computing power alone powered this growth. Corpora are created for a number of reasons, and typically require established research projects (which again require a certain intellectual climate), long-term funding, an ecosystem of tools and standards, and so on. However, keep in mind the observations from Baayen (2003) about how the technological bottleneck of early computing provided an incentive towards formal and non-corpus based approaches to linguistics. Clearly, at the very least, we can hypothesize an interplay between intellectual development and new technological possibilities (see also section 1.3.4). The historiographical problem of deciding the exact causality of this development is obviously outside the scope of this book. It is also secondary to what we consider far more important: the growth in
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Return of the numbers
Corpus sizes Google N−grams
100bn
Corpus size
OEC
1bn CoCA CoHA*
100m
BNC
Corpus del Español*
Index Thomisticus*
10m
Helsinki*
1m
Brown
1960
1970
LOB
1980
1990 Year released
YCOE* PPCEME* PPCMBE* PPCME2*
2000
2010
Figure . Sizes of some selected corpora plotted on a base 10 logarithmic scale, over time. Corpora marked with an asterisk (∗ ) are historical.
computing power, coupled with easier access and lower price, obviously removed an important bottleneck that was present in the 1950s and 1960s. It is instructive to consider the growth in corpus size, which has also followed an exponential curve during the same period. Figure 3.2 illustrates this by means of a bubble plot. The vertical axis shows the size of the plotted corpora on a logarithmic scale, whereas the bubbles (each representing a corpus) are scaled to be proportional to the corpus size. As the plot shows, there has been a rapid growth in the potential for building large corpora since the 1990s. A caveat is in order here, since the potential for building large corpora does not prevent small corpora from being built. Take for example the syntactically annotated historical corpora in the lower-right corner of the plot. These corpora have remained small for a number of reasons unrelated to computing power: dealing with historical texts, there is only a finite set of data to base the corpus on. Furthermore, the annotation step requires manual coding, since machine-learning
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Corpora and quantitative methods Corpus size vs computing power (Log-linear model) Google N−grams
100bn
10bn
Corpus size
OEC 1bn CoCA
100m
BNC
10m
Index Thomisticus*
CoHA*
Corpus del Español*
Helsinki* 1m Brown LOB 10K
1m
100m Computing power
10bn
Figure . Log-linear regression model showing the relationship between the growth in computing power and the growth in corpus size for some selected corpora. Corpora marked with an asterisk (∗ ) are historical.
algorithms for adding annotation to corpora cannot normally be used with good results on historical texts, without some manually annotated historical data as training material. However, if we remove these historical, syntactically annotated corpora, and fit a log-linear regression model (see section 6.2 for an introduction to linear regression models) relating computing power (on a base 10 logarithmic scale) to corpus size (also on a base 10 logarithmic scale), we find a significant relationship between the two.2 According to this model, every 1 per cent increase in computing power corresponds to a 44 per cent increase in corpus size. The model is illustrated in Figure 3.3. As mentioned earlier, it is impossible to claim that the increases in computing power directly caused corpora to grow, since the creation of corpora depends on much more than computing power alone. However, cheaper and faster computers with more 2
Fdf (1,8) = ., p
< line identifier ="1">Arma virumque cano, Troiae qui primus ab oris < line identifier ="2">Italiam, fato profugus , Laviniaque venit < line identifier ="3">litora, multum ille et terris iactatus et alto < line identifier ="4">vi superum saevae memorem Iunonis ob iram;
After the opening tag , the tag has an attribute identifier with value 1, which refers to the fact that this is the first book of the work. Then, every line of the poem is enclosed between the opening tag and the closing tag . This example shows how it is possible to annotate structural information in the text. In the next section we will focus on linguistic information.
. Adding linguistic annotation to texts So far, we have focused on how to represent structural and contextual metadata. Of course, it is essential to represent the content of the text as well, as we will see in this section. Humans are very skilled at identifying implicit layers of linguistic information in language. For example, native speakers of English can easily recognize that the word book in the sentence They are going to book the flight tonight is a verb, and it is a noun in She read that book in one day, although their ability to make the distinction explicit may depend on the level of their grammatical training. When the texts are analysed by a computer, we need to make such information explicit in order to interpret the text and retrieve its elements. For instance, in Example
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
Table . Example of metadata and linguistic information encoded for the first three word tokens of Virgil’s Aeneid Work ID
Title
Token ID
Token form
Lemma
Part of speech
Case
Number
IX001 IX001 IX001
Aeneid Aeneid Aeneid
T00101 T00102 T00103
Arma virum que
arma vir que
noun noun conjunction
accusative accusative –
plural singular –
(1), discussed on page 104, arma is the accusative of the plural noun arma ‘weapons’; que is an enclitic which means ‘and’ and is attached to the end of the word virum, which is the accusative of the noun vir ‘man’. Because this type of morphological information is at the level of individual words (more precisely, tokens in corpus linguistics terms), rather than phrases or larger segments, one way to encode it is to define each row as the minimal analytical unit, i.e. the token, and add new fields called ‘lemma’, ’part of speech, ‘case’, and ‘number’, as in Table 4.3. Once we have the information for the whole text, we can run searches on any combination of the fields; for instance, we can retrieve all occurrences of the singular accusative of vir. Alternatively, if we choose to use XML, we can embed every token in the XML presented on pages 105–6 in a new tag , and add the attributes tokenID, lemma, part of speech, case, and number to it, as shown below. < collection >
< line identifier ="1"> Arma virum que . . .
We could also decide to encode other types of linguistic information, like the English translation of every word, their syntactic relations, or their synonyms. In any case, this added information contributes to making such elements searchable; for
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
example, we can retrieve all instances of the lemma vir in the text by simply limiting the search to the lemma attribute of the tag . .. Annotation formats There are different ways to include annotation in a corpus. In the so-called embedded format, annotation is included in the original text and is displayed in the form of tags. For example, the example below indicates that reading is a participle form, as the tag ‘PARTICIPLE’ is next to the form reading, and is separated by a forward-slash sign: reading/PARTICIPLE When the units being annotated span over more than one token, we need some way of grouping together their elements; this is sometimes achieved by bracketing or nesting tags, as in phrase-structure syntactic annotation. The example below shows a parse tree from the Early Modern English Treebank (Kroch et al., 2004). ( (IP−MAT (NP−SBJ (D The) (N Chancelor)) (VBD saide) (CP−THT (C that) (IP−SUB (PP (P after ) (NP (ADJ long) (N debating ))) (NP−SBJ (PRO they)) (VBD departyd) (PP (P for ) (NP (D that) (N tyme))) (, ,) (IP−PPL (CONJ nedyr) (IP−PPL (VAG falling) (PP (P to) (NP (Q any) (N poynt )))) (CONJP (CONJ nor) (ADJP (ADJ lyke) (IP−INF (TO to) (VB com) (PP (P to) (NP (Q any ))))))))) (. .)) (ID AMBASS−E1−P2,3.2,25.20)) The phrase-structure of the sentence is represented with embedded bracketing corresponding to syntactic constituents, and the leaf nodes consist of tags followed by word forms. ‘IP–MAT’ signals the whole sentence, ‘NP–SBJ’ the subject–noun phrase, consisting of a determiner node (‘D’) and a noun node (‘N’); ‘VBD’ is the past tense
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
verb and ‘CP–THT’ is a complementizer phrase introduced by the conjunction that, while the last node contains the ID of the sentence.8 Another way to represent embedded structures in corpus annotation is by using the XML format, introduced in section 4.3. An example of embedded dependency annotation in XML format is given below and is taken from the Latin Dependency Treebank (Bamman and Crane, 2006):
alexlessie sneil01 millermo
The tags and indicate respectively the beginning and end of the sentence being annotated; the attributes of the tag indicate various properties of the sentence: the sentence’s unique identifier (id), the identifier of the text (document_id), the portion of the text containing the sentence (subdoc), and the first and last words of the chunk of text (span). Inside the tag sentence, we find the names of the primary and secondary annotators, followed by the words making up the sentence. The tag word indicates every word of the sentence. Inside the tag, the attribute id uniquely identifies the word in the corpus, form represents the word form, lemma its lemma, and postag contains a series of codes for morphological features such as part-of-speech tag, gender, mood, case, and number. In this example the type of syntactic annotation is relational, as opposed to the structural type of the phrase-structure example from the Modern English Treebank. The tag contains the ID of the dependency head of each word, while the tag indicates the syntactic dependency relation between the word and its head. For example, the first word of the sentence above is ergo, and is a sentence adverbial (‘AuxY’) depending on the third word of the sentence, i.e. potuit. Its lemma is ergo1 8
For a full description of the tags, see https://www.ling.upenn.edu/hist-corpora/annotation/index.html.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
as it is the first (and only) homograph of the lemma ergo in the Lewis–Short Latin dictionary (Lewis and Short, 1879). In addition to linguistic information, as we noted in section 4.1, it is important to record contextual information about a text; this is sometimes included as part of the corpus annotation itself, as in the Helsinki Corpus. McEnery and Wilson (2001, 39–40) list a document header from this corpus, where for example, the tag indicates an author’s name, and her gender. Such metadata can then be used by corpus programs to restrict the search criteria on texts’ attributes and their linguistic content. So far, we have examined examples of embedded annotation. Standalone annotation retains the annotation information in a separate document, which is linked to the original text. The American National Corpus (Ide and Macleod, 2001) has followed this approach (Gries and Berez, 2015). For example, each word of the sentence We then read is assigned an identifier: We\\ then\\ read\\ .\\ Each word is then associated to its part-of-speech tag in the standalone annotation by means of identifiers: PRONOUN\\ ADVERB\\ VERB\\ PUNCTUATION\\ Standalone annotation makes it possible to have multiple formats or levels of annotation for the same text. Although standalone annotation is recommended by the standard for corpus annotation (the Corpus Encoding Standard), most corpora have embedded annotation; therefore, in the rest of this chapter we will refer to this type of annotation. .. Levels of linguistic annotation Linguistic annotation is typically performed in an incremental way, by adding successive layers to the original text, starting from the most basic ones with lemma or part of speech, to the most advanced ones with semantic and pragmatic information. In this section we will cover these main levels of annotation, with particular attention to the peculiarities of historical corpora. The challenges of text pre-processing When building a historical corpus, researchers usually acquire texts held in non-electronic formats. Optical character recognition
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
(OCR) and direct transcription are the most popular ways to convert the texts into a digital format. Alternatively, manual transcription is an option when automatic methods are not able to reach an acceptable level of accuracy. Automatic and manual transcription are not mutually exclusive options, as the results of an automatic process can be further refined by manual intervention. This approach was the one chosen by the Impact Centre of Competence in Digitization,9 a collaborative network of libraries, industry partners, and researchers working towards the goal of digitizing historical printed texts from Europe’s cultural heritage material. Concerning OCR, Impact has developed an OCR software whose results are further improved thanks to the involvement of volunteers through an interface for crowdsourcing. Historical texts present challenges also regarding their characters, which typically span over a much larger set than modern texts. In the history of historical text processing, the lack of a common framework for encoding texts has meant that customized processing tools have been created which could not be shared across different systems. Over the past decades, the character encoding Unicode has gradually become the universal standard, and contains now more than one million characters. New characters often need to be added to the Unicode repository, especially to deal with historical scripts, and this is achieved via the Script Encoding Initiative.10 As Piotrowski (2012, 53–60) points out, the wide coverage of Unicode facilitates the sharing of tools and texts across different projects. For an overview of the issues concerning the digitization of historical texts and historical character encoding, see Piotrowski (2012). In the next sections, we focus on the levels of linguistic annotation that can be performed on historical corpora, stressing their features and challenges. Tokenization The first step in automatically processing the language in a corpus usually consists of tokenization. Tokenization segments a running text into linguistic units such as words, punctuation marks, or numbers. Once we have identified such units (called tokens), we can perform further levels of annotation. The task of word segmentation is more complex for those East Asian Languages like Chinese, Japanese, Korean, and Thai, which do not use white spaces to separate words. This is relevant also to those historical languages that were written in scriptio continua, such as classical Greek and classical Latin, for which the word separation is sometimes disputed by different philological interpretations. Even in languages like English, Italian, or French, where white spaces are used to separate tokens in many cases, we can find several exceptions. For example, the English sequence I’m, the French l’oiseau ‘the bird’, and the Italian l’anguilla ‘the eel’ comprise two tokens each; on the other hand, the English name New York, should count as one 9
http://www.digitisation.eu/.
10
http://linguistics.berkeley.edu/sei/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
single token. Moreover, compounds may not require spaces, as the German compound computerlinguistik ‘computational linguistics’. Another challenge in tokenization is given by the different possible uses of hyphens, for example to split a word at the end of a line for typesetting or to join elements of a single complex unit like forty-two. What counts as a token, therefore, depends on the language, the context of use, and further processing. For languages that use a Latin-based, Cyrillic-based, or Greekbased writing system, tokenization is often performed by a combination of rules that rely on white spaces and punctuation marks as delimiters of token boundaries. In addition to applying these general rules, we need to take into account languagespecific exceptions drawn from lists of acronyms and abbreviations. For example, such lists for English should contain Dr. and Mrs., because in these cases the dot should be considered part of the token. One challenge with abbreviations is that the same string may be a full word in certain contexts and an abbreviation in others, like in. for inches. Sentence segmentation is another crucial task related to tokenization and can present challenges for historical texts which do not employ punctuation marks consistently. For an overview of such challenges in Latin and Armenian, and respective solutions adopted to build the PROIEL Project corpus, see Haug et al. (2009, 24–6). Morphological annotation From a historical perspective, researchers have expanded much effort on written texts, and therefore the morphological, syntactic, semantic, and pragmatic levels have received most of the attention, compared to other levels of annotation such as phonetic/phonemic and prosodic annotation. In this section we will describe morphological annotation in more detail. Morphological annotation is the first layer of annotation that is normally added to raw corpora. It usually involves spelling variant processing, lemmatization, partof-speech tagging, and annotation of other morphological features such as number, gender, animacy, and case for nouns and adjectives, degree for adjectives, mood, voice, aspect, tense, person for verbs. In this section we will examine the main challenges posed by morphological annotation of historical texts, and how current or past projects have tackled them. Tackling spelling variation One major challenge of historical texts relates to the amount of spelling variation they typically contain. Many historical corpora cover large time spans, during which spelling standards were often lacking and spelling conventions changed. Second, data capture errors and philological issues sometimes make spelling uncertain. For these reasons, a unique approach to spelling is often not viable for historical texts. The field of NLP has developed tools that generally assume consistent spelling and consequently work well for modern languages, which normally have a much smaller degree of spelling variation than their historical counterparts. When applying NLP tools to historical texts, a common strategy is to normalize spelling to their modern equivalents. Normalization can be acceptable for certain applications, such
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
as retrieving information from historical documents, where the user wants to find the relevant content by searching for a limited number of terms. Normalization requires a set of mappings between the historical variants and the modern ones (if available) or rules that prescribe how to infer one from the other (for example -yng or -ynge endings in place of the modern -ing ending of verbs in English). This approach was adopted by the designers of VARD,11 a spelling analysis tool for early modern English. When such lexicons or rules are not available, we can adopt several different approaches to identify the relationship between spelling variants. One such example is the so-called edit distance, which measures the ‘distance’ between two strings by considering the number of deletions, insertions, and replacements of characters needed to transform one into the other. We can employ similar methods also when correcting OCR errors, a common challenge of digitized historical documents. For an overview of this topic, see Piotrowski (2012, 69–83). As an example of the challenges of historical spelling for English, Archer et al. (2003) present a historical adaptation of USAS, which is the Semantic Analysis System developed by UCREL, the University Centre for Computer Corpus Research on Language of Lancaster University. Because USAS was designed for present-day English, when it was applied to early modern English texts, it failed to part-of-speechtag a number of items. The issues concerned spelling, because some historical variants were not present in the lexicon used by USAS. A straightforward modification of the lexicon that included historical variants would have led to incorrect results; for example, one historical spelling of the verb be is bee, which is also a noun in presentday English. Therefore, the authors decided to keep the present-day lexicon separate from the early modern lexicon, and to create the historical lexicon manually by analysing the items that were not tagged by the part-of-speech tagger. The system then assigned the correct tags of such items based on some rules; for example, it would analyse bee as a form of the verb be if it was preceded by a modal verb. Tagging by part-of-speech Part-of-speech tagging is a crucial step in annotating corpora. As with other levels of annotation, automatic part-of-speech taggers exist alongside manual systems; however, compared with part-of-speech taggers for modern languages, historical part-of-speech taggers present some specific challenges, as Piotrowski (2012, 86-96) explains. Here we will summarize some of the main solutions devised to perform part-of-speech for historical languages. Machine learning algorithms for part-of-speech taggers have become increasingly popular in the recent years. A typical so-called supervised machine learning system for linguistic annotation relies on an annotated corpus used as a training set; the model learns the patterns observed in the training set and subsequently uses these patterns to annotate a new corpus. Following this approach, Passarotti and Dell’Orletta 11
http://ucrel.lancs.ac.uk/vard/about/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
(2010) trained a part-of-speech tagger on the morphologically annotated data from the Latin Index Thomisticus Treebank (61,024 tokens), and automatically disambiguated the lemmas of the Index Thomisticus. Scholars have adopted different solutions with the aim of improving the accuracy of part-of-speech taggers for historical languages. Some, such as Rayson et al. (2007), have used part-of-speech taggers for modern language varieties to analyse historical varieties by modernizing their spelling. Another approach consists in using a partof-speech tagger for the modern variety of the historical language being studied and expand its lexicon with historical forms, as Sanchez-Marco et al. (2011) did for Spanish. An alternative method is to first use a modern-language tagger and then incrementally correct it for historical data. This was the approach followed by Resch et al. (2014), who describe used the modern-German version of Treetagger (Schmid, 1995) to tag the Austrian Baroque Corpus, a corpus of printed German language texts dating from the Baroque era (particularly from 1650 to 1750). Given the high number of incorrectly tagged and lemmatized items, they manually corrected a portion of the output of the tagger; they then retrained Treetagger on the additional training set. This procedure was sufficient to make the performance of Treetagger increase significantly. Bamman and Crane (2008) use a similar approach and report on experiments on part-of-speech tagging of classical Latin with TreeTagger (Schmid, 1994), trained on a treebank for classical Latin. Lemmatization and morphological annotation Lemmatization associates every word form with its lemma, together with its homograph number, where needed. We can perform lemmatization both on inflected forms and on spelling variants; for example, if we want to use a list of lemmas from British English, we can lemmatize the American variant color as colour. Lemmatization is closely related to morphological analysis and part-of-speech tagging. In fact, if we know the part of speech of a given form in a given context, we can often assign the correct lemma to it. For example, the Latin form rosa can be an inflected form of the noun rosa ‘rose’, but also the feminine past participle of the verb rodo ‘gnaw’, and its correct lemma will depend on the context. For this reason, lemmatization is often coupled with part-of-speech tagging in corpus annotation. Just like other levels of linguistic annotation, lemmatization can be performed either manually or automatically, through tools called lemmatizers. Examples of historical corpora which have been manually lemmatized are treebanks, which we will introduce later in this section. While possible, attempts in the direction of automatic lemmatization of historical corpora have been overall rare. One method for automatic lemmatization is based on a set of rules that prescribe how to analyse a given word form depending on which category it falls in. Examples of rule-based systems are LGeRM (Souvay and Pierrel, 2009), which identifies the dictionary entry of a given form in Middle French, and the morphological model build by Borin and Forsberg (2008) for Old Swedish. Along similar lines, several software systems are available
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
for performing automatic lemmatization and morphological analysis of Latin and Ancient Greek. For example CHLT-LEMLAT (Passarotti, 2007a)12 is a lemmatizer and morphological analyser for Latin created at the Institute of Computational Linguistics (ILC-CNR) in Pisa. Another morphological analyser for Latin and Ancient Greek is Morpheus (Crane, 1991).13 Morpheus contains rules for generating inflected forms automatically and allows the users to search the digital library by word forms and lemmas. Kestemont et al. (2010) propose a machine-learning approach to lemmatization of Middle Dutch. Syntactic annotation Syntactic annotation consists of assigning each element of the sentences in a corpus to its syntactic role. Given the complexity of the task of syntactic annotation, historical corpora with this type of annotation are quite small, and attempts in the direction of automatic annotation have been rare. In this section we will give a brief overview of the research in this area. Manual syntactic annotation and treebanks Syntactically annotated corpora are usually called treebanks because we can represent syntactically annotated sentences as trees. For an overview of existing treebanks for modern and historical languages and some methodological points, see Abeillé (2003). Here we will focus on methodological issues specific to historical treebanks. There are two main kinds of syntactic annotation: constituency annotation and dependency annotation. In a constituent annotation, phrases are identified and marked so that it is clear which one each element belongs to. Constituency annotation makes use of bracketing to represent the syntactic embedding of constituents and is the style followed by the early treebanks. We presented an example of this kind of annotation in section 4.3.1. Examples of constituent-based historical treebanks are the Penn Corpora of Historical English (Kroch and Taylor, 2000; Kroch and Delfs, 2004; Kroch and Diertani, 2010). On the other hand, dependency annotation is based on the theoretical assumptions of Dependency Grammar (Tesnière, 1959), which represents the syntactic structure of a sentence with the dependency relations between its words. In a Dependency Grammar annotation, each lexical element corresponds to a node in the syntactic tree of the sentence; in order to tag its syntactic role in the sentence, we assign each node to a label (such as ‘predicate’, ‘object’, ‘attribute’) and to the node it is governed by. Figure 4.1 shows the phrase-structure tree and the dependency tree for Example (2): (2) She ate the apple. In the dependency tree of Figure 4.1 we can see the nodes corresponding to the words of the sentence, and the edges representing the dependencies between the words (‘Pred’ for predicate, ‘Sb’ for subject, ‘Obj’ for objects, and ‘Det’ for determiner). In 12 13
http://webilc.ilc.cnr.it/ ruffolo/lemlat/index.html. http://www.Perseus.tufts.edu/hopper/morph.jsp.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation S
ate Pred VP
NP PR
V
She
ate
She Sb
apple Obj
NP DET
N
the
apple
the Det
Figure . Phrase-structure tree (left) and dependency tree (right) for Example (2).
the constituent tree we can see the terminal nodes corresponding to the words of the sentence, and non-terminal symbols corresponding to the constituents (e.g. noun phrases ‘NP’ and verb phrases ‘VP’ in Figure 4.1) or part-of-speech (such as pronouns (‘PR’), verbs (‘V’), determiners (‘DET’), nouns (‘N’) in Figure 4.1). Dependency annotation has become increasingly popular among treebank creators. One common model of annotation is that of the Prague Dependency Treebank (Böhmová et al., 2003), developed under the Dependency Grammar theoretical framework of Functional Generative Description (Sgall et al., 1986). This treebank contains part of the Czech National Corpus annotated at three levels: morphological, so-called ‘analytical’ (with dependency trees of all sentences), and semantic, so-called ‘tectogrammatical’. Dependency annotation is generally considered to be very suitable for morphologically rich languages with free word order such as Czech and Latin. Examples of historical treebanks that followed this framework are: the Ancient Greek Dependency Treebank (Bamman and Crane, 2011), the PROIEL Treebank (Haug and Jøndal, 2008), the Latin Dependency Treebank (Bamman and Crane, 2007), and the Index Thomisticus Treebank (Passarotti, 2007b). Let us consider an example. Figure 4.2 from McGillivray (2013, 45) shows the dependency tree of the Latin in Example (3), where movet and pervenit are coordinated predicates, governing respectively the direct object castra, and the adverbial diebus and the indirect object fines introduced by the preposition ad. (3) Re frumentaria provisa castra Provisions:abl.f.sg provide:ptcp.pf.abl.f.sg camp:acc.n.pl movet diebus -que circiter XV ad move:ind.prs.3sg day:abl.m.pl and about:adv fifteen to fines Belgarum pervenit border:acc.m.pl Belgian:gen.m.pl arrive:ind.pf.3sg ‘After providing his provisions, he moved his camp, and in about fifteen days reached the borders of the Belgae’
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
#1 AuxS
que-6 Coord
provisa-3 movet-5 Adv Pred_Co
Re-1 castra-4 Sb Obj
frumentaria-2 Atr
diebus-7 Adv
XV-9 Atr
circiter-8 Adv
pervenit-13 Pred_Co
ad-10 AuxP
fines-11 Obj
Belgarum-12 Atr
Figure . The dependency tree of Example (3) from the Latin Dependency Treebank.
As we have seen from Example (3), the high level of complexity of the annotation in treebanks makes them very valuable resources for linguistic analyses, allowing for complex searches involving syntactic functions. Treebanks can also help linguists test their theories, as they can provide examples and counter-examples for illustrating linguistic phenomena in qualitative research. As a matter of fact, empirical linguistic analysis provided the prevalent motivation behind the creation of the early treebanks (and corpora in general). Treebanks can also constitute the basis for corpus-driven analyses as defined in section 2.4. This latter use is the one that makes the most of the potential of treebanks, because they offer the kind of systematic information and frequency data that is needed in this type of linguistic analyses. Moreover, there is a significant educational potential in the use of treebanks, as testified by the Visual Interactive Syntax Learning project at the University of Southern Denmark,14 which contains syntactically annotated sentences and games for modern and historical languages (Latin and Ancient Greek). Finally, treebanks have been recently used as gold-standard resources for historical NLP, as they can be used to train automatic syntactic analysers or parsers, as explained in the next section. 14
http://beta.visl.sdu.dk/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
Automatic annotation of syntax: parsing Parsing consists in automatically annotating a corpus from a syntactic point of view. Parsing is a very important field of research in NLP, and has a variety of practical and commercial applications, ranging from machine translation to natural language understanding and lexicography. Parsing can be achieved in two main ways: rule-based and statistical. Rule-based parsers exploit some manually constructed rules to parse a sentence; on the other hand, statistical parsers, based on machine-learning techniques, are trained on treebanks, from which they learn patterns of linguistic regularities that can then be applied when analysing new unannotated texts. As with any other automatic method, parsing involves a number of errors, which we must take into consideration when using parsed data directly. Depending on the end use of the annotated corpus, this margin of error may constitute a problem, as traditionally historical linguists and philologists have aimed at an almost perfect level of analysis and often require the same accuracy to carry out further analyses based on annotated corpora. When the historical corpora are so small that it is possible to manually check the annotation, semi-automatic annotation is often the preferred solution. As illustrated in Piotrowski (2012, 98–100), parsing experiments for historical languages have highlighted interesting challenges and have often originated from adaptations of parsers developed for modern languages. For example, comparing classical Chinese and modern Chinese, Huang et al. (2002) report an accuracy of 82.3 per cent for a parser trained on a 1,000-word treebank. The challenges involved in segmenting the text are less serious for classical Chinese, which has a higher number of single-character words compared to modern Chinese; on the other hand, part-ofspeech ambiguity is more extreme for classical Chinese and therefore makes part-ofspeech tagging more difficult. Given the historical importance of Latin in Western culture, it is not surprising that significant efforts have been devoted to parsing this language. Koch (1993) describes a first attempt at parsing Latin. McGillivray et al. (2009), Passarotti and Ruffolo (2009), and Passarotti and Dell’Orletta (2010) report on more recent experiments on parsing Latin corpora using machine learning. For example, following the same approach exposed earlier and consisting in adapting parsers developed for modern languages to the case of historical languages, Passarotti and Dell’Orletta (2010) applied the DeSR parser (Attardi, 2006) to medieval Latin, and designed some specific features for this language. English is the other language for which considerable research has been done on parsing historical texts. Considered that modern English is the language which has the highest number of language processing tools, it is not surprising that such tools have also been tested on historical varieties of this language. One such tool is the Pro3Gres parser (Schneider, 2008), a hybrid dependency parser for modern English. Pro3Gres is based on a combination of handwritten rules and statistical disambiguation, and can be adapted to historical language varieties. Schneider (2012) evaluated Pro3Gres on the
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
historical corpus ARCHER (A Representative Corpus of Historical English Registers, Biber and Atkinson 1994), constructed by Douglas Biber and Edward Finegan in the 1990s and consisting of British and American English texts written between 1650 and 1999. A preprocessing step was performed before parsing and led to a normalization of the text with the tool for spelling normalization VARD2 (Baron, 2009). Schneider’s evaluation results range from 70 per cent on seventeenth-century texts to 80 per cent on early-twentieth-century texts for the unadapted parser. If we compare these results with the state-of-the-art parsers for modern English we can see that the difference is not as great as one might expect. For example, Kolachina and Kolachina (2012) evaluated a number of dependency and phrase-structure parsers for English and found accuracy ranges between 70 per cent and 90 per cent.15 Semantic, pragmatic, and sociolinguistic annotation Semantic annotation often builds on syntactic annotation and involves interpreting a variety of different linguistic phenomena. These include indicating the semantic fields of a text like sport or medicine, for example, but also tagging named entities such as names of people or places, indicating whether an entity is animate or inanimate, whether it is an event or an abstract entity, and so on. Sense tagging is another important way to semantically annotate a corpus and consists in associating every word with its correct sense in context, based on an external ontology such as WordNet (Miller et al., 1990). WordNet is a lexical–semantic database for the English lexicon. Lexical items are assigned to sets of synonyms (synsets) representing lexical concepts, which are linked through semantic and lexical relations like hyponymy, hyperonymy, and meronymy. An example of an English synchronic semantically annotated corpus is SemCor (Fellbaum, 1998). Semantic annotation of historical corpora also covers the automatic detection of named-entities such as people, organizations, locations, time expressions, which are of particular relevance to historical research (Toth, 2013). This section will focus on the semantic annotation of historical corpora, and provide some examples. Semantically annotated historical corpora Annotating a historical corpus at the semantic level is challenging for a variety of reasons, including the complexity of the task, the high degree of linguistic interpretation required, the scarcity of annotation standards, and the diachronic change of meaning. Some historical corpora have successfully attempted this kind annotation and have approached it from different points of view. The PROIEL corpus, introduced in section 4.3.2, contains a semantic annotation for its Ancient Greek portion (Haug et al., 2009, 40–3), in addition to morphological and syntactic annotation. The semantic annotation in PROIEL has the form of type-level 15 The authors first converted the parses of a constituency parser into dependency structures. Then, they measured labelled attachment score (LAS), unlabelled attachment score (UAS), and label accuracy (LA).
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
animacy tagging, and follows the framework developed by Zaenen et al. (2004). Every Greek noun lemma is associated with one category taken from the following set: HUMAN, ORG (for organizations), ANIMAL, VEH (for vehicles), CONC (for concrete entities), PLACE, NONCONC (for non-concrete, inanimate entities), and TIME. These tags provide a ‘flat’ annotation, because they are not organized in any hierarchy. The treebank annotators tagged nouns in the corpus; then, thanks to anaphoric links, the tags were transferred from nouns to pronouns. Since the annotation is generally done at the level of the lemma rather than at the token level, it represents the animacy values of the majority of the corpus tokens, rather than a strictly contextspecific identification of animacy. Moreover, this corpus-driven approach means that every lemma is annotated based on the collection of its tokens and not on its general meaning. Therefore the noun kardia ‘heart’, for example, is labelled as NONCONC because none of its corpus occurrences refer to physical hearts. Another type of semantic annotation of historical texts is that of Declerck et al. (2011), who report on the semantic annotation of the Viennese Danse Macabre Corpus, consisting of a digital collection of printed German texts from 1650 to 1750. The aim of the annotation is to identify different conceptualizations of the theme of death, and hence the annotation specifically concerns this domain, and uses a tagset which conforms to the Text Encoding Initiative (TEI). Below we give an example of the annotation, taken from Declerck et al. (2011): Mors, Tod, Todt General Haut und Bein, Menschenfeind This example shows two instances of the tag , which is used for generalpurpose names or strings; in this case the two tags annotate two personifications of violent death. This annotation allows for semantically informed searches on the corpus; for example, we can retrieve the personifications of death as a figure. A different approach to semantic annotation of historical corpora focuses on the historical context of the texts. The Hansard Corpus, which contains 1.6 billion words from 7.6 million speeches in the British Parliament from the period 1803-2005, is semantically tagged, which allows for powerful meaning-based searches. Users can create ‘virtual corpora’ by speaker, time period, house of parliament, and party in power, and make comparisons across these corpora. Semantic annotation can also be performed automatically with the support of computational tools. For instance, Archer et al. (2003) present a tool for semantic annotation of English historical corpora based on USAS (see section 4.3.2 for an introduction to USAS), which was designed and initially implemented for presentday English. USAS assigns semantic labels based on a thesaurus consisting of over 45,000 words and almost 19,000 multi-word expressions. It works on a set of rules that rank the most likely analysis of a word based on some context-specific disambiguation rules and a frequency lexicon which records all semantic analyses of a word in order
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
of frequency. Archer et al. (2003) adapted USAS to make it possible to tag every word of a historical corpus, thus allowing meaning-based searches on the texts. The analysis referred to the Historical Thesaurus of English compiled at the University of Glasgow, which contains almost 800,000 words from Old English to the present day, arranged into fine-grained hierarchies primarily based on the second edition of the Oxford English Dictionary and Its Supplements, and the Thesaurus of Old English. Given the hierarchical structure of the thesaurus, the semantic analysis tool allows for conceptbased searches on the texts. Pragmatically and sociolinguistically annotated corpora At the beginning of the history of corpus linguistics, the annotation of language-internal phenomena like lemmatization, part of speech, or syntax, received a great amount of attention. However, language use is best understood when analysed together with its context, as a discursive and social practice. Sociolinguistic research is interested in such contextual information, which covers social categories like gender and class, but also the knowledge possessed by the participants of the communicative event and situational aspects such as the relationships between the participants and the purpose of their communication (Biber, 2001). Recording the macro-social components of language, as well as the situational aspects of the individual communicative events, is very important to explain the role of language in society, and corpus data constitute crucially important evidence sources for this type of investigation. Sociolinguistic research is the background to the Corpus of Early English Correspondence, a family of historical corpora compiled with the aim of testing sociolinguistic theories on historical data. In addition to morphological and syntactic annotation, these corpora are linked to a database containing information about letter writers, which allows the users to search sociolinguistic information about writers and recipients like age, gender, and family roles, and thus study the relation between language use and its context. One way to capture pragmatic and social characteristics of language is through the specific type of annotation employed in the Sociopragmatic Corpus (Archer and Culpeper, 2003), a section of the Corpus of English Dialogues 1560–1760 (Kytö and Walker, 2006) covering the years 1640–1760. This corpus contains more than 240,000 words from trial proceedings and drama, annotated with characteristics of the speakers and the addressees. Here is an example from Culpeper and Archer (2008): Look upon this Book; Is this the Book? The example shows that a male speaker (indicated by spsex="m"), identified by the code s4franc001, acts here as a prosecutor, belongs to the social status ‘gentry’, and is classified as an older adult. His addressee is a male witness, identified by the code s4franc003, of social status commoner, and an adult. All this information is encoded
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
in terms of attributes of the tag , which includes a speaker’s conversational turn directed to a specific addressee, in an item of direct speech. Another way to perform pragmatic annotation is by marking discursive elements in language data. This is the approach chosen by the PROIEL project and the tectogrammatical annotation of the Index Thomisticus Treebank, as we will see now. The PROIEL Project (Pragmatic Resources in Old Indo-European Languages) has developed a parallel corpus of the Greek text of the New Testament and its translations into the old Indo-European languages Latin, Gothic, Armenian, and Old Church Slavonic. Specifically, the Greek gospels have been annotated for information structure and discourse structure, in addition to the morphological, syntactic, and semantic annotations (Haug et al., 2009). This kind of annotation records information status and anaphoric distance, covering givenness tags based on the context used by the hearer to establish the reference, situational information, encyclopaedic knowledge, and tags to express information new to the context, as well as anaphoric links between discourse referents. The annotation scheme chosen by the Index Thomisticus Treebank is the tectogrammatical annotation of the Prague Dependency Treebank (Passarotti, 2010, 2014), which refers to the Functional Generative Description framework (Sgall et al., 1986). This level of annotation builds on the so-called ‘analytical’ (i.e. syntactic) layer, where every token is a node in the dependency tree. However, the tectogrammatical annotation resolves ellipsis by reconstructing elided nodes, and represents the dependency relations between the elements which have semantic meaning, thus excluding nodes like conjunctions, prepositions, and auxiliaries. The dependency relations are represented in terms of semantic roles thanks to so-called ‘functors’, such as actor. The pragmatic content of the annotation involves anaphoric references, as well as the information structure of sentences, distinguishing between topic and focus. .. Annotation schemes and standards We have seen the major levels of linguistic annotation and discussed their application to historical corpora. In this section we will concentrate on some recommended procedures to conduct a historical corpus annotation project, and stress the infrastructural implications of corpus annotation. In order for an annotation to be consistent throughout the corpus, it is essential that it follows some predefined parameters. An annotation scheme defines the architecture of an annotation in terms of the tags that are allowed in it, and how they should be used. Good annotation schemes should allow us to describe (rather than explain) the phenomena observed in the corpus, and should be based on theory-neutral, widely agreed principles, as far as this is possible. An example of an annotation scheme is Bamman et al. (2008), where the authors describe all the tags employed in the annotation of the Latin Dependency Treebank. This is also an interesting example of a collaborative approach to defining annotation
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Adding linguistic annotation to texts
guidelines, because these guidelines are shared with another Latin treebank, the Index Thomisticus Treebank (Passarotti, 2007b). Moreover, both treebanks follow the overall theoretical framework of the Prague Dependency Treebank (Böhmová et al., 2003). In addition, another Latin treebank, the PROIEL project Latin treebank, is compatible with both the Index Thomisticus Treebank and the Latin Dependency Treebank, since automatic conversion processes are available from one format to the other, and this increases the range of opportunities for linguistic analyses that span over the data from all three treebanks. A similar example of shared approach to annotation is given by the Penn Corpora of Historical English, which include the Penn–Helsinki Parsed Corpus of Middle English (Kroch and Taylor, 2000), the Penn–Helsinki Parsed Corpus of Early Modern English (Kroch and Delfs, 2004), and the Penn Parsed Corpus of Modern British English (Kroch and Diertani, 2010). Following the same schema designed for the Penn Corpora of Historical English, a whole constellation of corpora have been built over the years: the York–Helsinki Parsed Corpus of Old English Poetry (Pintzuk and Plug, 2002), the York–Toronto–Helsinki Parsed Corpus of Old English Prose (Taylor et al., 2003), the York–Helsinki Parsed Corpus of Early English Correspondence (Taylor et al., 2006), the Tycho Brahe Corpus of Historical Portuguese (Galves and Britto, 2002), Corpus MCVF (parsed corpus), Modéliser le changement: les voies du français (Martineau and Morin, 2010), and the Icelandic Parsed Historical Corpus (Wallenberg et al., 2011a). Covering a larger set of languages, Universal Dependencies16 is a project aimed at developing treebank annotation for many languages (including historical ones), with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. [. . .] The general philosophy is to provide a universal inventory of categories and guidelines to facilitate consistent annotation of similar constructions across languages, while allowing language-specific extensions when necessary.17
Unfortunately, collaborations such as the ones mentioned above are not as frequent as we would wish. In historical corpus research, as well as in corpus linguistics in general, there are several schemes for corpus annotation, and no prevailing one. This has to do with historical reasons, as especially the older projects often originated within different theoretical frameworks to address specific needs and goals, and therefore developed their own (often peculiar) approaches to annotation; see, for example, the original annotation used for the Index Thomisticus (Busa, 1980). While this is partially justified by the fact that different languages require different annotation schemes, and that each level of annotation has its own features, it becomes increasingly important to aim at a more harmonized state, especially given the growth in the number of annotated historical corpora. 16
http://universaldependencies.org/.
17
http://universaldependencies.org/introduction.html.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
Although no annotation scheme should be considered as a standard a priori, since the beginning of corpus linguistics it has gradually become clear that standards commonly agreed through practice and consensus are necessary. Such standards make corpora processable by a variety of software systems, thus facilitating the comparison, sharing, and linking of annotated corpora, avoiding duplication of effort, while at the same time enhancing the evidence basis for historical linguistic analyses. To this end, TEI18 has published the Guidelines for Electronic Text Encoding and Interchange, which document ‘a markup language for representing the structural, display, and conceptual features of texts’.19 TEI has modules for different text types (drama, dictionaries, letters, poems, and so on), and its annotation guidelines cover a range of palaeographic, linguistic, and historical features. For an overview of TEI for historical texts, see Piotrowski (2012, 60–7). Here we will look at one example of a historical text annotated following TEI conventions, the Bodleian First Folio.20 The following is an excerpt from the beginning of Shakespeare’s A Midsummer Night’s Dream. Enter Theseus, Hippolita , with others .
Theseus.
NOw faire Hippolita, our nuptiall houre Drawes on apace: foure happy daies bring in Another Moon: but oh, me thinkes, how slow This old Moon wanes; She lingers my desires Like to a Step−dame, or a Dowager, Long withering out a yong mans reuennew.
. . .
The element ‘stage’ contains stage directions, ‘cb’ marks the beginning of a column of text, ‘sp’ marks the speech text, ‘speaker’ gives the name of the speaker in the dramatic text, and ‘l’ indicates the verse line. For a complete explanation of the tags and attributes, see TEI Consortium (2014). 18
http://www.tei-c.org. Unlike annotation, which typically adds linguistic information to the text, markup is usually concerned with marking information relative to the structure and context of the texts, such as author names or speakers in a drama, for example. 20 http://firstfolio.bodleian.ox.ac.uk/. 19
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Case study: a large-scale Latin corpus
TEI is a very positive initiative which addresses the need for standardization in the markup and annotation of texts in the humanities and social sciences; it is very widespread in the field of digital humanities. The Medieval Nordic Text Archive aims to preserve, disseminate and publish medieval texts in digital form, and to develop the standards required for this. The archive includes texts on the Nordic languages and in Latin (http://www.menota.org/), and its texts are encoded in TEI. Generally speaking, TEI is not very widely used for historical corpora, where there is a stronger emphasis on linguistic annotation rather than on palaeographic and historical markup. Moreover, most programs for automatic annotation (the NLP tools introduced in section 4.3) strip down all forms of markup contained in the texts, as it is not relevant to the automatic processing they perform. However, in the case of historical texts, the information contained in these tags can be crucial to the interpretation of the text and should be considered by the language processing tools. A related difficulty is the fact that historical texts typically contain a number of nonlinear elements, such as alternative readings or corrected and erroneous text, which are heavily dependent on the specific edition of the text. A challenge for the future will certainly be to have the NLP community interact more with the TEI community and make it possible to apply NLP to complex TEI documents while preserving their tagging structure for further analysis.
. Case study: a large-scale Latin corpus We have seen how annotation makes it possible for researchers to search historical corpora for simple and complex linguistic entities. As the size of the corpora increases, automatic annotation becomes more and more of a necessity. This is especially true when we consider the increasing amount of texts that are being digitized as part of digital humanities projects, and that constitute very valuable sources of data for historical linguistics research. The case study illustrated in this section, an interesting application of historical NLP tools to Latin, shows an example of a very fruitful interchange between these disciplines. LatinISE (McGillivray and Kilgarriff, 2013) is a Latin corpus containing 13 million word tokens, available through the corpus query tool Sketch Engine (Kilgarriff et al., 2004). Similarly to corpora compiled for modern languages like ukWac (Ferraresi et al., 2008), the texts making up LatinISE were collected from web pages. However, the process of data extraction was controlled by selecting three specific online digital libraries: LacusCurtius,21 IntraText,22 and Musique Deoque.23 These websites contain Latin texts covering a wide range of chronological eras, from the archaic age to the beginning of the current century, all editorially curated, 21 22
http://penelope.uchicago.edu/Thayer/I/Roman/home.html. 23 http://www.mqdq.it. http://www.intratext.com.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
which meant that the quality of the raw material is superior to that of general web resources. Another important observation concerns the metadata that the texts were provided with. As we discussed in section 4.1, this is an essential property of historical corpora, since it allows for further corpus-based studies that analyse the language in its historical context. In the case of LatinISE, the metadata were inherited from the original online libraries and include information on the names of authors, titles, books, sections, paragraphs, and line boundaries for poetry. After removing HTML tags and irrelevant content from the web pages, the corpus compiler converted them into the verticalized format required by Sketch Engine, where each line contains only one token or punctuation mark. In addition to being provided with rich metadata, LatinISE is also lemmatized and part-of-speech-tagged. The lemmatization relies on the morphological analyser of the PROIEL Project, developed by Dag Haug’s team,24 complemented with the analyser Quick Latin.25 As an example, consider the following phrase: (4) sumant exordia fasces take:sbjv.prs.3pl beginning:acc.n.pl fasces:nom.m.pl ‘let the fasces open the year’ This sentence was automatically analysed as follows: > sumant sumo > exordia exordium exordium exordium > fasces no result for fasces For each word form, the morphological analyser generated all possible analyses, which included an empty result for fasces. These multiple analyses needed to be disambiguated so to assign the most likely lemma and part of speech to each token in context. This disambiguation was achieved with a machine-learning approach, by relying on existing Latin treebanks: the Index Thomisticus Treebank, the Latin Dependency Treebank, and the PROIEL Project’s Latin treebank. At the time of the creation of LatinISE, these corpora contained a total of 242,000 lemmatized and morphosyntactically annotated words; this set constituted the training set for training TreeTagger (Schmid, 1995), a statistical part-of-speech tagger developed by Helmut Schmid at the University of Stuttgart. McGillivray and Kilgarriff (2013) describe how 24 25
http://www.hf.uio.no/ifikk/english/research/projects/PROIEL/ http://www.quicklatin.com/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Challenges of historical corpus annotation
TreeTagger was run on the analyses of the morphological analyser to obtain the most likely part of speech and lemma for each word form in the corpus. In Example (4), the corresponding corpus occurrences are: sumant V exordia N fasces N
sumo exordium fascis
Every line contains the word form, followed by the part-of-speech tag (‘N’ for ‘noun’ and ‘V’ for ‘verb’) and the lemma. LatinISE is currently in its first version and an evaluation of the automatic lemmatization and part-of-speech tagging is the necessary next step to assess the usability of the corpus, especially on the texts from those eras whose language differs significantly from that of the training set. With its ongoing development, this corpus testifies to the challenges of applying NLP tools to historical language data, and of dealing with texts from very different time periods. At the same time, a large diachronic annotated corpus is what is needed to conduct a study of language change. Of course, some may discount the period when Latin was not spoken by native speakers; we believe that this corpus is nevertheless a valuable resource for Latin (diachronic) studies. Following principle 9 (section 2.2.9) and principle 10 (section 2.2.10), quantitative evidence is the only type of evidence for detecting trends, and this evidence comes primarily from corpora. A corpus like LatinISE, which was annotated automatically, can be improved by successively refining the training set for the automatic annotation. Hence, it is a resource that can serve the community both by being the empirical basis for quantitative analyses and by being subject to further incremental developments leading to better and better language resources.
. Challenges of historical corpus annotation So far we have stressed the merits of corpus annotation and have seen how annotated historical corpora can serve the scholarly community. However, some scholars have criticized annotation, and in this section we will dedicate some space to their arguments, and to more general considerations about annotated corpora in historical linguistics research. Sinclair (2004, 191) called corpus annotation ‘a perilous activity’, which negatively affected the text’s ‘integrity’ and caused researchers to miss ‘anything the tags [are] not sensitive to’. Hunston (2002, 93) evokes a similar danger, in which researchers may tend to forget that the categories used to search the corpora partially shape their research questions: the categories used to annotate a corpus are typically determined before any corpus analysis is carried out, which in turn tends to limit, not the kind of question that can be asked, but the kind of question that usually is asked.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical corpus annotation
It is indeed true that if we choose an annotation scheme that is too firmly bound to a specific linguistic paradigm, we risk only finding results supporting that paradigm. Moreover, depending on the annotation available for a corpus, certain research questions may not be answered by that corpus. For example, imagine that our annotation scheme contained a tag for ‘noun’ and our annotation guidelines specified when an element was to be annotated as a noun; then, we would be constrained by these choices when retrieving nouns from the corpus. If our research aim is to define the characteristics of nouns, then our results would be heavily influenced by the corpus guidelines. One partial solution to this is to be very precise in specifying the corpus compilation principles and the assumptions made during the annotation phase, so that any research results can be interpreted in light of this, and results from differently annotated corpora can be compared. In any case, the dependence of annotation on the schema is an unavoidable consequence of the practice of annotation itself. We can think of annotation as a pragmatic (in the common, non-linguistic sense of the word) solution to the problem of representing linguistic categories and their properties. Annotation tags are convenience representations of theoretical entities and should not be confused with the linguistic entities themselves. Annotated corpora are examples of the symbolic modelling of language introduced in section 1.2.2. They impose discrete categorizations to linguistic elements. This symbolic representation is compatible with both categorical and non-categorical views of language, precisely because it is a model and is not the linguistic reality directly. In other words, a corpus annotation that contains the categories of noun and verb can coexist with a view whereby such categories sit along a probability distribution. Equally, such annotation is compatible with a view according to which words possess part of speech as discrete classes. Archer (2012) discusses some of the objections to corpus annotation and explores the question of whether annotation can be seen simply as a useless exercise that does not add anything to the data that is not already contained in them. In line with Archer (2012)’s view, we believe that corpus annotation is an essential step in the research process, and that, in spite of its limits, it contributes to a transparent way to empirically draw conclusions from language data. Historical corpus linguistics can certainly hope to gain an independent status from corpus linguistics for modern languages by developing more and more sophisticated tools for annotating historical texts—following past and current research directions—and by emphasizing the unique features of historical texts. Annotated corpora are not fixed and immutable objects, and the issue of maintenance is critical in corpus building. Corpora need updates continuously for a range of reasons, as new linguistic theories emerge and as we discover new properties of language, or simply as more people contribute to the annotation by various means, including crowdsourcing. In the case of historical corpora, this is particularly
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Challenges of historical corpus annotation
important. Due to the lack of native speakers and the philological complexities that affect many historical texts, it is advisable to support a more flexible type of annotation, which allows for multiple interpretations of the texts by different annotators. This model of annotation is particularly appreciated by classicists and philologists, who are interested in displaying the different variants of the original text as a consequence of the transmission of the text over time. Along these lines, Bamman and Crane (2009) propose a model of annotation that takes into account the scholarly tradition developed on the texts and gives the annotators scholarly credit for their work. Bamman and Crane (2009) applied this model to the portion of the Ancient Greek Dependency Treebank containing Aeschylus’ plays. This case displays an example of a highly debated text, both in terms of its philological transmission and its syntactic interpretation (which are linked, of course). In this respect, this model of scholarly annotation corresponds to the traditional practice of compiling critical editions and will, it is hoped, encourage philologists to engage with it alongside corpus and computational linguists. Another way in which corpora should be updated is in conjunction with the research process itself. Historical corpora are often used to study particular linguistic phenomena. Once the researcher has extracted the patterns of interest from the corpus, he or she may carry out further analyses. For example, in the case study on early modern English third person verbal ending described in section 7.3, we collected all instances of third-person ending of verbs from the Penn–Helsinki Corpus of Early Modern English. Then, we added lemma information on each verbal form, as we wanted to measure the effect of the lemma frequency on the type of morphological ending realized. The lemma information was not available in the original corpus, so this work enriched the corpus material, which we made available for reuse by other scholars. One way to maintain data sets like the one we built for that case study, which are important outputs of the research process, is to make it possible to incorporate additional annotation into the user’s personal working copy of the corpora, as allowed by the Penn–Heksinki Early English Treebanks. Additionally or alternatively, the analysis can be made publicly accessible by publishing it in a repository, as we chose to do. This way, other researchers can make use of this work in combination with the original corpus data, provided that some linking mechanism is in place. In this specific case, we published the list of verb form types and their associated lemmas, thus effectively providing a linking facility, which is in line with the requirement of reproducibility highlighted in section 2.3.1. All these approaches point towards a view of annotated corpora as the results of collaborative efforts based on which research can make progress in an incremental way. We believe that, thanks to this collaborative attitude, historical linguistics research can achieve access to larger data sets that allow us to reach more ambitious goals, well beyond what is possible in the context of smallscale studies. In the next chapter we will expand on this further.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
(Re)using resources for historical languages . Historical languages and language resources In Chapter 4 we have seen that annotated corpora are essential to quantitative historical linguistics research. Of course, they are not the only source we can rely on. Indirect sources of language data like dictionaries and lexicons have been and still are of great importance. Unlike corpora, where words are organized in their context of occurrence, traditional language resources store general information about lexical items out of context, and in some cases link this information back to their occurrences in the texts (section 2.1.3). In this chapter we will support a view according to which such links between lexical entries and their occurrences in context (i.e. in corpora) should be made more systematic and explicit; we will therefore argue that the gap between corpora and other language resources can be closed thanks to a corpus-driven approach paired with a quantitative practice, and show the benefits of this perspective for research in historical linguistics. We will also turn our attention beyond language resources, towards the wider landscape of historical and cultural heritage resources, and make a case for synergies that can benefit research on historical languages. Finally, we will make a case for building language resources in a way that makes them easy to maintain and compatible with other resources, and reusing existing resources when that is possible, thus increasing the level of transparency and replicability that are among the most important elements of our methodology (sections 1.1 and 4.1.1). .. Corpora and language resources Traditional language resources like dictionaries are very useful in historical linguistics research. However, even when they are based on corpora, if they are qualitative in nature it is not possible to draw quantitative arguments from them, apart from basic type frequencies extracted from the resource itself. Conversely, corpus-driven language resources like computational lexicons offer more potential for integration with corpora and therefore allow the researchers to include a quantitative dimension to their analysis, as we will show in this section. Quantitative Historical Linguistics. First edition. Gard B. Jenset and Barbara McGillivray. © Gard B. Jenset and Barbara McGillivray 2017. First published 2017 by Oxford University Press.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical languages and language resources
Let us start with an example of a psycholinguistic phenomenon that is relevant to historical linguistics: local syntactic ambiguity. Consider Example (1), from a Latin sentence from Ovid, Metamorphoses 1.736: (1)
et Stygias iubet hoc audire and Stygian:acc.f.pl command:ind.prs.3sg this:acc.n.sg listen:inf.prs paludes water:acc.f.pl ‘He commands the Stygian waters to listen to this.’
Example (1) contains an instance of the general pattern [V1 ARG V2 ], where V1 is the verb iubet, ARG is the pronoun hoc, and V2 is the verb audire. According to the valency properties of the two verbs, ARG could be an argument of both V1 and V2 . Example (1) is a case of local syntactic ambiguity, which is resolved once the sentence is read out in full. This is in line with the online nature of oral language comprehension, whereby the hearer perceives one word at a time and incrementally interprets the partial input, even before the sentence is complete (Schlesewsky and Bornkessel, 2004; Van Gompel and Pickering, 2007, 289; Levy, 2008, 1129). McGillivray and Vatri (2015) investigated this phenomenon in Latin and Ancient Greek, taking the opportunity to apply some principles from psycholinguistics to historical languages, for which experiments on native speakers are, of course, not possible. Before it is read in full, Example (1) may be taken to mean ‘he commands the Stygian waters this’, indicating an order given to the waters; however, after reading audire, it becomes clear that this verb governs hoc and therefore the sentence unambiguously means ‘he commands the Stygian waters to listen to this’. In order to classify Example (1) as ambiguous, we need to know that both iubeo and audio can govern hoc. In other words, we need to answer the question: are iubeo ‘to command’ and audio ‘to listen’ transitive verbs? Traditional language resources like dictionaries and lexicons can help to answer this question, as they contain vast amounts of information about lexical items, including verbs’ transitivity. The Latinto-English dictionary by Lewis and Short (1879)1 records that sense 1α of iubeo can occur ‘with an object clause’, as in (2) from Terence’s Eunuchus 3, 2, 16, where istos foras exire ‘that they come out’ is the object clause of the imperative iubete ‘order’: (2) iubete istos foras exire order:imp.prs.2pl that:acc.m.pl out come out:inf.prs ‘order them to come out’ On the other hand, the first sense of the entry for the verb audio in Lewis and Short (1879) records aliquid ‘something’ (i.e. accusative direct object) as the first of 1
Accessed from the Perseus Project’s page http://www.perseus.tufts.edu.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
(Re)using resources for historical languages
the possible argument structure configurations for this verb. Therefore, from the information contained in the dictionary, we know that the two verbs in Example (1) are transitive, and thus we can hypothesize that the accusative hoc in Example (1) can be the argument of either iubet (‘He commands this to the Stygian waters’)2 or audire (‘He commands the Stygian waters to listen to this’). The argument structure information contained in a dictionary can certainly help to consider the different possible syntactic interpretations of a sentence like Example (1). However, if we want to be able to identify all locally ambiguous sentences from a corpus of texts without manually checking each instance, we need to combine the corpus data with a machine-readable resource. Such resource can be automatically queried by a computer algorithm in order to detect those sentences where two verbs occur with a noun phrase that is compatible with the valency properties of both verbs, making it possible for both verbs to govern that phrase. This is the approach followed by McGillivray and Vatri (2015), who relied on corpus-driven computational valency lexicons for Latin and Ancient Greek verbs. In the next section we will cover the difference between corpus-based and corpus-driven lexicons, and briefly illustrate the valency lexicon in question. .. Corpus-based and corpus-driven lexicons As noted in McGillivray (2013, 32–6), traditional historical dictionaries are qualitative resources. They are compiled based on large collections of examples usually taken from the canon of texts of a historical language. In this sense they may be called ‘corpus-supported’ resources in a loose sense, if we broaden the term ‘corpus’ to cover any collection of texts, independently of their format, and the selection criteria and annotation features of modern corpus linguistics. In other words, the texts constitute the evidence source on which the historical lexicographer relies to prepare the summary contained in a dictionary entry. That this is the case is evident from the amount of examples included to support most statements about grammatical and lexical-semantic properties in a dictionary. However, the process leading from the whole collection of texts to the selected examples that appear in a lexical entry is the result of the subjective judgement of the dictionary’s compilers, and cannot always be reliably reproduced. A similar argument holds for other historical dictionaries and thesauri like the Oxford English Dictionary3 and the Historical Thesaurus of English.4 Such a qualitative approach makes the dictionaries supported by a complete corpus good resources for answering qualitative questions such as ‘Is verb X found with a dative object in historical language Y?’ (assuming that verb X is included among the examples presented in the dictionary), but not quantitative questions like ‘Has the 2 This interpretation is only acceptable if we consider the online processing of the sentence up to the word hoc (et Stygias iubet hoc). 3 http://www.oed.com. 4 http://historicalthesaurus.arts.gla.ac.uk.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical languages and language resources
proportion of animate senses of noun X over inanimate senses increased over time?’. The reason for this is rooted in the original purpose of printed dictionaries, which suited an era of information scarcity. They aimed to ‘provide information in a manner which is accessible to the reader . . . The reader should . . . regard the Dictionary as a convenient guide to the history and meaning of the words of the English language, rather than as a comprehensive and exhaustive listing of every possible nuance’ (Jackson, 2002, 60). With the potential offered today by digitized text collections and computational tools, we can raise our ambitions to a more systematic account of the behaviour of words in texts; then, this information can be queried by programs as well as humans, as we will see in the next sections. Historical valency lexicons In the field of computational linguistics there have been several successful attempts at building lexical resources from corpora for modern languages in a radically different way from traditional dictionaries. One example is the Italian historical dictionary TLIO (Tesoro della Lingua Italiana delle Origini),5 which is directly associated with a corpus of texts. If we focus on valency lexicons, we find examples like PDT-Vallex (Hajič et al., 2003), FrameNet (Baker et al., 1998), and PropBank (Kingsbury and Palmer, 2002), to name just a few. All these lexicons have in common the fact that they are based on syntactically annotated corpora. This makes it possible to maintain an explicit relation between the corpus and the lexicon: once the corpus has been annotated (for example by marking all arguments and their dependency from verbs), human compilers create the lexicon by summarizing the corpus occurrences into the lexical entries (for example by describing argument patterns found for each verb) and recording the link between the entries and the corpus. Moving from a corpus-based to a corpus-driven approach, computational lexicons like Valex (Korhonen et al., 2006) for English and LexSchem (Messiant et al., 2008) for French systematically describe the valency behaviour of all verbs in the corpora they are linked to. These lexicons are automatically extracted from annotated corpora and therefore display frequency information about each valency pattern, which can be traced back to the original corpus occurrences it was derived from. For example, it is possible to know how many times a verb occurs with a subject and direct object, and retrieve all corpus instances of this pattern. Attempts to apply this approach to Latin data have resulted, for example, in the lexicon described by Bamman and Crane (2008), which was automatically extracted from a Latin corpus consisting of 3.5 million words from the Perseus Digital Library and automatically parsed (i.e. syntactically analysed).
5
http://tlio.ovi.cnr.it/TLIO.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
(Re)using resources for historical languages
Figure . Lexical entry for the verb impono from the lexicon for the Latin Dependency Treebank. The pattern is called ‘scc2_voice_morph’ because it shows the voice and the morphological features of the arguments.
McGillivray (2013, 31–60) describes a corpus-driven lexicon automatically derived from the Latin Dependency Treebank (Bamman and Crane, 2007) and the Index Thomisticus Treebank (Passarotti, 2007b). Figure 5.1 shows the lexicon entry for the Latin verb impono. Each entry in the lexicon corresponds to a verb occurrence in the corpora, identified by an ID number for the verb (second column) and the unique sentence number from the corpus (last column); in addition, the lexicon entry displays the author of the text in which that occurrence is found (first column), the verb lemma (third column), and the argument pattern corresponding to that verb token (fourth column). For example, the pattern ‘A_Obj[acc],Sb[nom]’ in the first row indicates that the verb impono in sentence 845 occurs in the active voice, with an accusative direct object and a nominative subject. Applying the same database queries developed to create the Latin lexicons to data from the Ancient Greek Dependency Treebank (Bamman and Crane, 2009), which follows the same annotation guidelines and format as the two Latin treebanks previously mentioned, McGillivray and Vatri (2015) describe a corpus-driven valency lexicon for Ancient Greek, which they used to study the phenomenon of local syntactic ambiguity. The advantages of automatically built lexicons like the Latin and Greek ones described above are numerous. First of all, as we have seen, they contain frequency
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Historical languages and language resources
information which is directly linked to the corpus, thus allowing for corpus-based quantitative studies, as prescribed by principle 8 (section 2.2.8). Second, they are easy to maintain because, as the corpus grows in size, the automatic processes for obtaining the lexicon can be executed again without starting a new process from scratch. This is exemplified by the Ancient Greek lexicon described in McGillivray and Vatri (2015), built using the same approach developed for the Latin lexicon, as we have seen. Third, the creation of these lexicons is independent from the corpus annotation phase, which minimizes the risks of biased results. In traditional studies (as we have seen from the survey reported on in Chapter 1), the phase of data/text collection and the phase of data analysis are often performed jointly, in the context of a specific study and with a particular set of theoretical hypotheses in mind. By resorting to corpus-driven resources, the phases are kept separate, because the text collection phase happens at the point in which the corpus compilers build the corpus; then the persons responsible for the language resource extract the corpus data to create the lexicon via automatic techniques. Only at this stage does the researcher pull the relevant data from the language resource to address a specific research question. For example, McGillivray (2013, 127–78) describes a study on the argument structure of Latin verbs prefixed with spatial preverbs. The study relies on the corpus-driven valency lexicon described in McGillivray (2013, 31–60). Hence, the decision of what counts as a verbal argument and what is an adjunct was made by the corpus annotators and therefore was not influenced by the specific purpose of the study on preverbs. This guarantees a higher level of consistency (as noted in section 6.1), and facilitates the reproducibility of the study, as recommended in our best practices (section 2.3.1). Other historical lexicons Valency lexicons are very useful for studies that require information on verbs’ syntactic arguments. For other purposes, different types of lexicons are available for historical languages. One such type of resources are the lexicons developed in the context of the IMPACT (Improving Access to Text) project,6 which aims at developing a framework for digitizing historical printed texts written in European languages. One common issue with performing OCR on historical texts is that it requires a large lexicon containing all possible spellings and inflections of words over time, as the OCR algorithm uses the lexicon to assign the most likely transcription to each word. Another challenge with searching historical texts concerns retrieval: ideally, users should find occurrences of old spellings or inflections of words by searching for the modern variants. For example, the user may search for ‘water’ and be presented with corpus occurrences of ‘weter’, ‘waterr’, ‘watre’, and so on. Moreover, lists of proper names (so-called ‘named-entities’, typically for locations, persons, and organizations), drastically improve the accuracy of OCR systems for historical texts. To address all 6
http://www.digitisation.eu/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
(Re)using resources for historical languages
these needs, the researchers of the IMPACT project have developed computational morphological lexicons which display both spelling variants and inflected forms for modern lemmas, as well as named-entities lexicons, for Bulgarian, Czech, Dutch, English, French, German, Polish, Slovene, and Spanish (Depuydt and de Does, 2009). The morphological lexicons were created from corpora, collections of quotations contained in historical dictionaries, and/or modern dictionaries provided with historical variants. These morphological lexicons usually contain frequency information as well. The named-entities lexicons were created by training named-entity recognition algorithms on manually curated sets tagged with various types of named-entities labels (the so-called ‘gold standards’) and then running the named-entity recognizers on new, unannotated data. Let us consider one of the historical lexicons developed as part of the IMPACT project, the lexicon for German.7 This lexicon was extracted from a corpus of 3.5 million words from Early New High German (1350–1650) and New High German period (since 1650). Each entry in the lexicon has the following structure: historical word form, followed by the corresponding modern lemma and its attestations in the corpora. The lexicon was created with Lextractor, a web-based tool with a graphical user interface designed for lexicographers. This tool contains a modern morphological lexicon, a lemmatizer, and an algorithm that uses rules to generate historical forms from modern lemmas. Therefore, the tool is able to suggest the linguistic interpretation for some of the historical word forms, in terms of their modern lemmas, part-ofspeech information, and their possible attestations in corpora. The lexicographer has the option of accepting or rejecting the automatic suggestions, and difficult cases are handled collaboratively (Gotscharek et al., 2009). For example, the following rules are among those for generating modern forms from historical forms in German: 1. 2. 3. 4.
th → t ei → ai ey → ei l → ll
In addition, the morphological lexicon for modern German maps the inflected form teile to the noun teil ‘part’ (plural) and to the verb teilen ‘to share’ (first person singular present indicative). When presented with the historical form theile, Lextractor can suggest the lemmas teile by combining the first rule listed above with the modern morphological information; Lextractor can also suggest the lemma taille by applying the second and fourth rules and the modern morphological lexicon entry for taille ‘waist’. At this point the lexicographer can confirm or reject the automatic suggestions; 7 http://www.digitisation.eu/tools-resources/language-resources/ historical-and-named-entities-lexicaof-german/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Beyond language resources
moreover, he or she can classify the historical form as one or more of the following cases: • • • • •
historic form without modern equivalent; historic abbreviation; pattern matcher failed; named-entity; missing in modern lexicon.
Next, Lextractor provides the lexicographer user with a list of candidate corpus attestations of the word form, with their context in the form of concordances, as well as the frequencies of all forms of the lemma being analysed and their time stamps. The user can then select the correct occurrences.
. Beyond language resources Historical sociolinguists have emphasized the relationship between language and its social context for a long time (Romaine, 1982; Nevalainen, 2003; McColl Millar, 2012). As we said in section 4.3.2, recording the macro-social components of language, as well as the situational aspects of the individual communicative events, is very important to explain the history of language in society (both in terms history of individual languages and language change), and corpus data constitute important evidence sources for this type of investigation. As we noted in section 4.3.3, the field of digital humanities has been concerned with contributing to humanities research by addressing research questions of humanistic disciplines with the support of digital tools. One important project in this area is the TEI, which establishes a standard for annotating a wide range of textual and contextual information for a large number of text types and formats. Unfortunately, the academic communities of digital humanities and historical linguistics have not always shared approaches and tools, and TEI markup is still not usually employed in corpus annotation. However, this tendency is gradually changing. In recent years, the collaboration between historical linguists and scholars from other historical areas of the humanities has received a new impulse thanks to their shared interests in the analysis of cultural heritage data. The LaTeCH (Language Technology for Cultural Heritage) workshops testify to the increased popularity of this area of research (see, for example, Kalliopi Zervanou, 2014). We argue that this collaboration presents a number of benefits for all research fields involved, as we explain here. On the one hand, historical linguistics can gain more insight into how language changed over time by explicitly placing language data into their historical context. One way to achieve that is by adding information on social features of the texts, and the work done by (historical) sociolinguists is a good model for such efforts. Metadata on where the language data were composed (or uttered) are certainly an
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
(Re)using resources for historical languages
essential piece of knowledge that needs to be recorded; in addition, annotating social features of the authors/speakers and the location allows the researchers to investigate how language and such social factors interact, thus adding an additional level of depth to the analysis. One example of this approach is the project for creating the British Telecom Correspondence Corpus at Coventry University (Morton, 2014), which annotated business letters written over the years 1853–1982 with TEI-compliant XML. The metadata elements recorded for this corpus include: date, author’s name, occupation, gender, and location; recipient’s name and location; general topic of the letter, whether the letter was part of a chain or not, format (handwritten, printed, etc.), and company/department, in addition to text-internal annotation marking quotes, letter openings, paragraphs, letter closings and salutations, as well as a pragmatic annotation of the letter’s function (application, complaint, query, etc.). On the other hand, from the point of view of historical research, texts and archives are among the various sources from which we can come to new interpretations of historical facts or we discover new relations between events. Detailed linguistic analyses grounded on language data, particularly texts, can definitely support and enrich this work. For instance, social history and marginalized groups are best investigated by a corpus-based register and lexical analysis of the language of certain official documents, as exemplified by the study of prostitution based on judicial records from the seventeenth century described in McEnery and Baker (2014). The authors analysed nearly one billion words from the seventeenth-century section of the Early English Books Online corpus.8 The texts underwent variant spelling annotation, lemmatization, part-of-speech tagging, and semantic tagging. After processing the corpus data, historians and linguists in the project team carried out the collection of relevant linguistic data in an iterative fashion. These data concerned the change in meaning and discourse features of a set of lexical items recognized as pertinent to the topic through literature review and corpus data inspection. This phase was followed by the corpus work, which investigated semantic and pragmatic change though collocation analysis. The analysis also concerned place names associated with the nouns of interest (synonyms of prostitute). This research shed new light onto certain aspects of language change, and offered insights into the society and culture of that historical period, which would have been more limited without access to large corpora, linguistic knowledge, and historical expertise. As the experience of McEnery and Baker (2014) shows, while more and more historical texts become available, the traditional approach involving close reading of the texts becomes less and less feasible, leaving space to the so-called ‘distant reading’ approach and coexisting with it. This is where the experience of corpus and computational linguistics can give a substantial contribution to historical research, thanks to the vast set of tools for language processing and examination that these 8
http://www.textcreationpartnership.org/tcp-eebo/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Beyond language resources
disciplines have developed. Such tools allow the researchers to scale up their analyses and give a faithful representation of the language used in the texts, as well as their content. In addition, large corpora make it possible to investigate rare language usages that are simply not found in corpora of the size manageable by hand. Some examples of this line of thinking are the research outlined in Toth (2013), the HiCor (History and Corpus Linguistics) research network funded by the Oxford Research Centre in the Humanities,9 the Network for Digital Methods in the Arts and Humanities (NeDiMAH),10 and the Collaborative European Digital Archive Infrastructure (CENDARI).11 When applying corpus methods to historical archives and documents, however, we need to keep in mind an important difference between corpora and archives (and digital resources in general). This difference concerns the well-known issue of ‘representativity’, which is far from being resolved, especially in historical contexts, where the corpus compilers often can only include texts or fragments that have survived historical accidents, and cannot aim at so-called ‘balanced’ corpora (see discussion in Chapters 2 and 4). Archives, in particular, usually group together records relating to certain events, thus making it difficult to identify individual text types in them. Attention should be also paid to ensuring that documents on less prominent individuals are included as well, so to best reflect linguistic variation. A number of software tools are now available to support historians’ interpretative work by using the traditional corpus linguistics tools such as concordances and keyword-in-context, as well as language technology techniques, including morphological tagging, part-of-speech tagging, syntactic parsing, named-entity recognition, semantic relation extraction, temporal and geographical information processing, semantic similarity, and sentiment analysis. Just to mention a few examples, ALCIDE (Analysis of Language and Content in a Digital Environment) was specifically developed for historians at the Fondazione Bruno Kessler in Trento, and combines data visualization techniques with information extraction tools to make it possible to view and select the relevant information from a document, including a semantic analysis of the content (Menini, 2014). Another example concerns the synergy between geographic systems and language technology, specifically named-entity recognition. Geographic information systems (GIS) help to investigate the role played by different places in social phenomena over time by analysing their mentions (both overt and implicit) and their collocations in historical documents; see, for example, Joulain et al. (2013). In section 5.3.3 we will give an example of a resource created in the context of geographical historical data. From this brief overview it will be clear, we hope, that our position supports synergies and collaborations between historical linguistics and other historical 9 11
http://www.torch.ox.ac.uk/hicor/. http://www.cendari.eu/.
10
http://www.nedimah.eu/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
(Re)using resources for historical languages
disciplines, which requires historical linguistics to develop a stronger commitment to non-language-related resources. This way, it will be possible to combine multidisciplinary expertise to cover more research ground and achieve further goals that could not be achieved in the context of the individual disciplines. Such synergies and exchanges do not only affect people; they also have an important implementation in linking the data resources employed in research, as we will see in section 5.3.
. Linking historical (language) data In section 5.2 we argued that historical corpora should be more integrated with other linguistic and non-linguistic resources in order to give a fuller account of language change over time. One way to achieve that is to enrich the corpus annotation with metadata information recording the historical context of the texts and social features of authors, characters, and places (usually at the beginning of the corpus or in a separate file), as well as pragmatic functions of the speech acts (typically with in-line annotation). Traditionally, this has been the standard approach in corpus-based historical sociolinguistics, and has allowed researchers to study the interplay of linguistic phenomena and external factors by extracting the data directly from the corpora. Along these lines, the compilers of the Penn–Helsinki Parsed Corpus of Middle English, second edition (PPCME2) created a series of files containing a range of metadata information about each text of the corpus. For instance, Figure 5.2 shows the page of the Parson’s Tale by Chaucer. In addition to indicating the details of the manuscript (name, date, edition, and sampled portion for the corpus), the page contains the genre and dialect of the text, in addition to other information from the original Helsinki corpus from which the PPCME2 was derived, such as the relationship to the original text and its language, the sex, age, and social rank of the author. Enriching the annotation with such information makes the size of the corpus files much larger. This does not need to be a problem, especially given the low cost of data storage nowadays. However, there is another, more serious disadvantage in this approach. Maintaining this kind of annotation is time-consuming and not particularly efficient, because it involves creating copies of information already available in other sources. Let us consider the example of a study on the relationship between the determiners a and an over time and the social rank of the author. The researcher would need to run a search of the corpus and then associate each occurrence of a/an in each text with the social rank of the author and the date of the text as given by the corpus pages exemplified in Figure 5.2. Let us now imagine that a new discovery reveals that the manuscript of the Parson’s Tale used by the compilers of PPCME2 was in fact produced ten years earlier than was thought previously. In order for the linguistic analysis on a/an to be updated, the corpus compilers would need to be informed and they would have to correct the corpus page (both for the date of the text and the age of the author); the data for the sociolinguistic study would then need to be re-extracted.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Linking historical (language) data
Helsinki Corpus information File name
CMCTPROS
Text identifier
M3 NI FICT CTMEL
Text name
CT MELIBEE
Author
CHAUCER GEOFFREY
Period
M3
Date of original
1350–1420
Date of manuscript
1350–1420
Contemporaneity
X
Dialect
EML
Verse or prose
PROSE
Text type
FICTION
Relationship to foreign original
TRANSL
Foreign original
FRENCH
Relationship to spoken language
WRITTEN
Sex of author
MALE
Age of author
40–60
Social rank of author
PROF HIGH
Audience description
X
Participant relationship
X
Interaction
X
Setting
X
Prototypical text category
NARR IMAG
Sample
SAMPLE X
Figure . Page containing information about the text of Chaucer’s Parson’s Tale from the Penn–Helsinki Parsed Corpus of Middle English, https://www.ling.upenn.edu/hist-corpora/ PPCME2-RELEASE-3/info/cmctpars.m3.html (accessed 22 March 2015).
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
(Re)using resources for historical languages
This process is prone to errors and requires a number of people to be aware of the new discovery. An alternative solution would involve having the data stored in one single place (a ‘knowledge base’), to which they would be linked from the corpus, for example, a repository of all manuscripts (or all Middle English manuscripts). In the scenario imagined above, such repository would be the only resource requiring a change. As the corpus would link to it, those responsible for the corpus would just need to update the links to the repository in order to get the corrected metadata on which to base a sociolinguistic analysis. Linked data is a growing area of research and development in computing which offers the model for realizing this link, as we will see in section 5.3.1. .. Linked data The term ‘Linked Data’ refers to a way of representing data so that it can be interlinked. Bizer et al. (2008) define linked data as follows: Linked Data is about employing the Resource Description Framework (RDF) and the Hypertext Transfer Protocol (HTTP) to publish structured data on the Web and to connect data between different data sources, effectively allowing data in one data source to be linked to data in another data source.
In simple terms, the World Wide Web consists of a large number of pages interlinked via HTML links. These links express a very rudimentary form of relationship between webpages: we know that one page is related to another, but we do not know the nature of this relationship, at least not explicitly from the link itself.12 In contrast, the approach of linked data assumes a ‘web of data’ whereby entities (and not just webpages) are connected through semantic links; these links identify the two entities being linked and express explicitly the type of link between them; moreover, this is done in such a way to allow the information to be automatically read by computers. In the RDF data model, links are expressed in the form of triples where a subject is connected to an object via a predicate that indicates the nature of the relationship between the two. Triples are an example of structured data (see section 4.3) that can be automatically retrieved by computer algorithms. In order to illustrate RDF, we will take an example from DBPedia, which is a large resource of linked data derived from Wikipedia, representing one of the hubs of the emerging web of data. DBPedia organizes a subset of the Wikipedia entries into an ontology of over 4 million entities, covering persons, places, creative works, organizations, species, and diseases, together with the links between them. The DBPedia entry for Geoffrey Chaucer (the subject)13 lists a series of attributes pertaining to this writer (the predicates) and their 12 We could consider the context in which the link appears, for example the words surrounding it, and perform a distributional semantics analysis on that. However, what we are concerned with here is the explicit type of relationship between the two entities being linked. 13 http://dbpedia.org/page/Geoffrey_Chaucer.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Linking historical (language) data
respective values (the objects). For example, Chaucer is related to the date ‘1343-0101’ though the predicate ‘Birth date’, to the place ‘Westminster_Abbey’ though the predicate ‘RestingPlace’, and to ‘Philippa_Roet’ via ‘spouse’. The two latter entities (‘Westminster_Abbey’ and ‘Philippa_Roet’) also have their own entries, thus creating an interlinked network of information. Using this knowledge base, it is possible to run searches that are not possible on Wikipedia, thus allowing for a much wider discoverability of the content of this resource. For example, the search for all authors who were born in the fourteenth century and whose spouses died in the fourteenth century. Linked data collections may be open according to the Open Definition,14 which in its concise version states: Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).
Linked open data are by definition easier to access by a wide audience, which offers new avenues of research for a large number of scientific fields. Linguistics is certainly one such field, and one unquestionable advantage of developing and using linked open data in linguistics is that resources can be combined together to improve specific linguistic processing tasks. For example, combining a dictionary with a part-of-speech tagger makes it possible to perform dictionary-based part-of-speech tagging; another example is the integration of dictionaries and corpora, which allows the lexicographer to refer to corpus examples from lexical entries, and therefore place each example in its corpus context. Linking language resources in this way makes them at the same time integrated and interoperable. This means that the resources are not only provided with links to allow exchange of information, but that the interpretation of this information is consistent across the linked resources. .. An example from the ALPINO Treebank Let us take the example of a treebank (see section 4.3.2 for an illustration of treebanks). The ALPINO Treebank is a syntactically annotated corpus of Dutch (Van der Beek et al., 2002) with over 150,000 words from the newspaper part of the Eindhoven corpus. Its original format is in XML (illustrated in section 4.2), as shown below for the syntactic tree of the phrase In principe althans ‘In principle, at least’.
14
http://opendefinition.org/.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
(Re)using resources for historical languages
In principe althans .
The nodes of the dependency tree are tagged as and the attributes cat, rel, and pos stand for categories/phrase types, dependency relations, and part-ofspeech tags, respectively. For example, the word in is a preposition (pos=‘prep’). Moreover, it is the first word of the sentence, so it begins at position 0 (begin=‘0’) and ends at position 1 (end=‘1’), and the lexical head of its phrase is in position 1 (hd=‘1’). The node corresponding to in is part of a prepositional phrase, so its parent node (which starts at position 0 and ends at position 2 because it includes principe as well) has cat=‘pp’. Let us imagine that we want to make sure that the inventory of part-of-speech tags is consistent with an external tagset. The Linked Data approach to this is to link the corpus to another resource through RDF. One such resource is the General Ontology for Linguistic Description (Farrar and Langendoen, 2003).15 Here, we will consider the linguistic ontology LexInfo (Cimiano et al., 2011).16 The linking between the treebank and LexInfo allows us to connect the treebank with another corpus that uses the LexInfo tagset; moreover, if the tagset is updated, the part-of-speech information in the treebank will not need to be changed. Let us have a closer look at ontologies through the case of Lexinfo in the next section. The LexInfo ontology In computer science an ontology formally defines the entities of a particular domain, together with their properties and relationships. OWL (Web Ontology Language) is the standard language used to represent ontologies. OWL defines classes and subclasses, which classify individuals into groups which share common characteristics; an ontology in OWL also specifies the types of relationships permitted between these individuals. For what concerns language resources, LEMON (LExicon Model for ONtologies, McCrae et al., 2012) is an RDF model specifically designed for lexicons and machinereadable dictionaries. LexInfo is a model for relating linguistic information (such as part of speech, subcategorization frames) to ontology elements (such as concepts, relations, individuals), following the LEMON model. The following example shows the portion of the Lexinfo ontology relative to the category of adverbs.17 15 17
http://www.linguistics-ontology.org/. The line numbers were added by us.
16
http://www.lexinfo.net.
i
i i
i
i
i
OUP CORRECTED PROOF – FINAL, 11/7/2017, SPi i
i
Linking historical (language) data
1 2 3 4 5 6 7 8 9 10 Let us examine each element of this RDF snippet. •